The architecture of the Wybe compiler exhibits the following structure:
wybemk.hs
holds the top level of the compiler, which delegates the bulk of the
work to Builder.hs
. This module overseas the phases of compilation, as well
as determining which files actually need to be (re-)compiled. All of the
compiler phases are built upon the basic data structures and operations, which
are defined in AST.hs
The body of the compiler lies in the semantic analysis, compilation, and optimisation. This part can be seen in more detail as several separate passes:
The source files in this directory and their purposes are:
File | Purpose |
---|---|
AliasAnalysis.hs | Alias analysis for a single module |
Analysis.hs | Entry point of all kinds of analysis for a single module |
AST.hs | Wybe Abstract Syntax Tree and LPVM representation |
ASTShow.hs | Show Wybe intermediate representation |
BinaryFactory.hs | Deriving AST Types to be Binary instances |
Blocks.hs | Transform a clausal form (LPVM) module to LLVM |
BodyBuilder.hs | A monad to build up a procedure Body, with copy propagation |
Builder.hs | Handles compilation at the module level. |
Callers.hs | Find all callers for each proc and count static calls per caller |
Clause.hs | Convert Wybe code to clausal (LPVM) form |
Codegen.hs | Generate and emit LLVM from basic blocks of a module |
Config.hs | Configuration for wybe compiler |
Emit.hs | Emit LLVM code |
Expansion.hs | Replace certain procedure calls with others |
Flatten.hs | Flatten function calls (expressions) into procedure calls |
Macho.hs | Extended version of parser for Mach-O object format files |
Normalise.hs | Convert parse tree into an AST |
ObjectInterface.hs | Parse and edit a object file. |
Optimise.hs | Framework to optimise a single module |
Options.hs | Handle compiler options/switches |
Parser.hs | Parser for the Wybe language using Parsec. |
Resources.hs | Resource checker for Wybe |
Scanner.hs | Lexical scanner for the Wybe language |
Snippets.hs | Convenience functions for generation of Wybe AST |
Transform.hs | Transform LPVM after alias analysis |
Types.hs | Type checker/inferencer for Wybe |
Unbranch.hs | Turn loops and conditionals into separate procedures |
Unique.hs | The unique typing system for Wybe |
UnivSet.hs | Provide a set type supporting the universal set. |
Util.hs | Various small utility functions |
wybemk.hs | Wybe compiler/builder main code |
The wybe compiler handles module dependencies, and builds executables by itself, without the need for build tools like make. The function buildTarget is responsible for determining what source files need to be compiled, and for building the final outputs (whether executable, object file, archive, etc.).
Wybe allows a module to be defined by a single .wybe file, or by a directory containing multiple submodules. In the latter case, the directory must contain a (possibly empty) file named _.wybe, which should contain all of that module's source code other than its submodules.
The compiler can generate all these output formats, using the file extension to work out which type of file to build: executable, object code, LLVM bitcode, LLVM assembly code, or archive file. In general, it can build any type of output from either type of input, but note that on unix-style systems, both directories and executables have empty file extensions, so it is not possible to build an executable from a directory. Asking to build an object, LLVM bitcode, or LLVM assembler file for a directory module is interpreted as asking to recursively build that type of file for each wybe source file and module directory within that directory.
The compiler stores its internal data structures in object files it generates, and extracts that information from the object file, rather than loading and compiling the source file, if the object file is younger than the source file. In the case of a directory module, the nested _.o, _.bc, or _.ll file holds all of the contents of that module that come from the _.wybe file. That is, the _.wybe file is compiled much like other modules, except that its module name is that of the parent directory, rather than the file itself.
To keep compile times manageable while supporting optimisation, we compile modules bottom-up, ensuring that all a module's imports are compiled before compling the module itself. In the case of circular module dependencies, each strongly-connected component in the module dependency graph is compiled as a unit. This is handled by the compileModule function, which includes the functionality for finding the SCCs in the module dependency graph. Then each SCC is compiled by the compileModSCC function.
One shortcoming of the bottom-up approach is that some optimisations depend upon the way a procedure is used, which cannot be determined from the code itself. Such analyses are best performed top-down. For example, if we can determine that a structure will not be referenced after the call to a procedure, that procedure may be compiled to destructively modify or reuse that structure. Our approach to this is to apply multiple specialisation: we compile different versions of this code for calls that may reference that argument again and calls that cannot. Our approach is to to have the bottom-up analysis produce "requests", which indicate what top-down analysis results would allow more efficient specialisations of the code. This information is produced bottom- up. Then, when generating the final executable, when all the code is available, we determine how beneficial each specialisation would be, and select the most useful specialisations to actually produce, and generate code for any that have not already been generated.
Ensuring that all compiler phases happen in the right order is subtle, particularly in the face of mutual module dependencies. Following are the ordering dependencies.
-
Types: the types a type depends on need to have been processed before the type itself, so that sizes are known. In the case of recursive or mutually recursive type dependencies, all types in the recursion must be implemented as pointers. Types are transformed into (sub)modules, and constructors, deconstructors, accessors, mutators, and auxiliary type procedures (equality tests, plus eventually comparisons, printers, pretty printers, etc.) are all generated as procedures within those submodules. Therefore, these must be handled as submodules are.
-
Resources: the resources a resource depends on must have been processed before the resource itself. (We currently don't support defining resources in terms of others, but the plan is to support that.) The types in the module that defines a resource, and all module dependencies, must have been processed at least enough to know they have been defined before processing the resource declaration.
-
Top-level statements in a module: these are transformed to statements in a special procedure whose name is the empty string as the statements are processed, so their dependencies are the same as for statements in ordinary procedure bodies.
-
Functions: functions and function calls are transformed to procedures and procedure calls without reference to anything external to the functions themselves, so function dependencies behave exactly like procedure dependencies.
-
Procedures: the procedures a procedure calls must have been type and mode checked before they can themselves be type/mode checked, and must be analysed and optimised before they can themselves be analysed/optimised. All the procedures in the (sub)modules that define each type a procedure uses, as either a parameter or local variable type, must have been processed the same way before processing the procedure itself.
-
Nested submodules: the submodules defined within a module file, including the types, must be processed as mutual dependencies of that module, which they are. The nested submodules of a module (including types) have access to all public and private members of the parent module, and the parent has access to all public members of the parent, so they are mutual dependencies.
-
Directory modules: A directory containing a file named _.wybe is also a module, with all the contained .wybe files as submodules. However, these submodules do not automatically import all the other modules in that directory, although they can explicitly import any sibling modules they need.
This means only minimal processing can be done before module dependencies are noted and read in. So we handle all these dependencies by initially reading a module to be compiled and handling contents as follows:
-
Types: create and enter the submodule, note that the parent imports it, and process its constructors and other contents.
-
Submodules: create and enter the submodule, note that parent imports it, and process its contents.
-
Resources: Record for later processing.
-
Pragmas: Record for later processing.
-
Constructors and low-level (representation) types: record for later type layout, code generation, etc.
-
Top level statements: add statements to the special "" proc in the module, creating it if necessary.
-
Procs and functions: record them for later normalisation, analysis, optimisation, etc.
Once we have finished loading the sources for the specified targets, and all their dependencies, we compute the SCCs in the module dependency graph. Then for each SCC, in topographic order, we call compileModSCC, which does the following:
-
Traverse all recorded type submodules in the module list finding all type dependencies; topologically sort them and identify SCCs. For each SCC:
-
Determine the type representation for all constructors.
-
Record the primitive representation of the type.
-
Generate and record all constructor, accessor, mutator, and utility procs.
This is handled in the Normalise module.
-
-
Check all resource imports and exports. (Resources.hs)
-
Normalise all recorded procs in their modules, including generated constructors, etc. (Normalise.hs)
-
Validate the types of exported procs. (Types.hs)
-
Check proc argument types, resolve overloading, and mode check all procs. (Types.hs)
-
Check proc resources and transform them to args. (Resources.hs)
-
Transform away branches, loops, and nondeterminism. (Unbranch.hs)
-
Topologically sort proc call graph and identify SCCs. For each SCC, bottom-up, do the following:
-
Compile procs to clausal form (Clause)
-
Optimise procs (Optimise)
-
We transform away all expression types except for constants and variables. Where, let, conditional, and function call expressions are turned into statements that bind a variable, and then the variable is used in place of the expression. All function calls are transformed into procedure calls by adding an extra argument corresponding to the function result.
Function call expressions can take one of three forms. Expressions where all arguments are inputs are turned into a procedure call with a fresh temporary variable as an output, which is called before the statement in which that function call appears. The function call itself is then replaced by a referenced to the temporary variable. For example, p(f(x,y),z) is replaced by f(x,y,?t); p(t,z).
A function call containing some output arguments, and perhaps some inputs, is transformed into a fresh input variable, with a later proc call to that function with that variable as an added input. For example, p(f(?x,y),z) is transformed to p(?t,z); f(?x,y,t). Finally, a function call containing some input-output arguments, and perhaps some input arguments, is transformed into an input-output variable, plus two procedure calls, one to compute the initial value of the expression, and a second to assign it the specified new value. For example, a statement p(f(!x,y),z) is transformed to f(x,y,?t); p(!t,z); f(!x,y,t).
There are three passes related to resources within the Wybe compiler.
The first pass canonicalises the use
declarations in procedure prototypes.
This resolves the module which each resource refers to.
The second, performed after type checking, transforms resource usage into
regular parameters/arguments. Here, each procedure is modifed to include
parameters corresponding to each resource and respective flow.
Each call is also transformed to include the resources as arguments.
The order of arguments is dictated by the ordering of the ResourceSpec
type, ordering by module then name. use
blocks are also transformed to
contain LPVM move
instructions of respectively used resources, to save
and subsequently restore the value of each resource variable.
The final pass tranforms resource into global variables. This pass must be
performed after uniqueness checking, as uniqueness checking assumes only the
second pass has been performed. The additional parameters and arguments
introduced in the second pass are removed here, as they are now passed in
global variables. Each reference to an in scope resource by name is
transformed into a load (input) or a store (output), or both (in/out).
Finally, use
blocks are also transformed to save (load) and then restore
(store) each out-of-scope resource.
The final two passes assume that type and mode checking has occured. Type checking ensures that resources have the correct types, and mode checking ensures that resources, where applicable are in scope.
Our type inference is flow sensitive, that is, types flow from callees to callers, but not vice versa. Therefore, types must be uniquely determined by proc definitions.
Because submodules in a file automatically have access to all items (even private ones) in their supermodule, submodules in that file are considered to depend on their supermodules. Likewise, supermodules automatically import everything exported by their submodules in the same file, so supermodules depend on their submodules. This means all modules in a given file are always in the same module dependency SCC. Since SCCs are type checked in topological order, this ensures that all proc calls can only refer to procs that have already been type checked or are defined in the current SCC.
Type checking is responsible for overloading resolution, therefore during type checking, there may be multiple possible procs that could be referenced by an individual call. To support this, we use a type RoughProcSpec which represents a proc as best we are able to identify it. This is only used during type checking to determine potential call graph SCCs. Type checking/inference is then performed bottom-up by potential call graph SCC.
Handling of resources here is a little tricky, because resources in lower SCCs will have been transformed into parameters, but resources in the current SCC will not have been transformed. This problem is unavoidable because types must be determined (so that overloading can be resolved) before resources can be transformed. Therefore, type checking must be prepared to handle both calls that have had resources transformed to parameters and calls that haven't.
This code transforms loops into fresh recursive procs, and ensures that all conditionals are the last statements in their respective bodies. Note that conditionals can be nested, but at the end of the conditional, they must return to the caller. This is handled by introducing a fresh continuation proc for any code that follows the conditional. The reason for this transformation is that a later pass will convert to a logic programming form which implements conditionals with separate clauses, each of which returns on completion.
Loops are a little more complicated. do {a b} c d would be transformed into next1, where next1 is defined as def next1 {a b next1}, and break1 is defined as def break1 {c d}. Then Next and Break are handled so that they cancel all the following code in their clause body. For example, Next a b would be transformed to just next1, where next1 is the next procedure for that loop. Similarly Break a b would be transformed to just break1, where break1 is the break procedure for that loop. Inside a loop, a conditional must be handled specially, to support breaking out of the loop. Inside a loop, if {a:: b | else:: c} d e would be transformed to a call to gen1, where gen2 is defined as def gen2 {d e}, and gen1 is defined as def gen1 {a:: b gen2 | else:: c gen2}. So for example do {a if {b:: Break} c} d e would be transformed into next1, which is defined as def next1 {a gen1}, gen1 is defined as def gen1 {b:: break1 | else:: gen2}, gen2 is defined as def gen2 {c next1}, and break1 is defined as def break1 {d e}.
The tricky part of all this is handling the arguments to these generated procedures. For each generated procedure, the input parameters must be a superset of the variables used in the body of the procedure, and must be a subset of the variables defined prior to the generated call. Similarly, the output parameters must be a a subset of the variables defined in the generated procedure, and must be superset of the variables that will be used following the generated call. Ideally, we would make these the smallest sets satifying these constraints, but later compiler passes remove
- unnecessary parameters, so instead we go with the largest such sets. This is determined by the type/mode checker, which already keeps track of variables known to be defined at each program point.
This code also eliminates other Wybe language features, such as transforming SemiDet procs into Det procs with a Boolean output param. Following unbranching, code will be compiled to LPVM form by the Clause module; this requires a much simplified input AST form with the following requirements:
- All procs, and all proc calls, are Det.
- All statements but the last in a body are either ProcCalls or ForeignCalls or Nops.
- The final statement in a body can be any of these same statement types, or a Cond whose condition is a single TestBool, and whose branches are bodies satisfying these same conditions.
This document is assembled from the source code of the Wybe compiler by the top-level Makefile with the command
make src/README.md
This command should be rerun whenever new modules are added to the compiler, or when the documentation is updated, and the resulting src/README.md file should be pushed to github.
Haskell source files with extension .hs
should contain a line beginning
with -- Purpose :
followed by a one-line description.
In Haskell source files, any text between marker lines of the form:
-- BEGIN MAJOR DOC
and
-- END MAJOR DOC
will be included in this README.md
file. One or more spaces
may separate the --
from the BEGIN
or END
text.
The documentation should be written in markdown notation. Of
course, --
comment markers at the beginnings of lines, and up to two following
spaces, will be stripped before being interpreted as markdown.