Copyright © 2004-2006,2008 by Marcin 'Qrczak' Kowalczyk (QrczakMK@gmail.com)
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included here.
Let’s first introduce some concepts used by the generated code. All values except small integers are represented by pointers to objects, where the first word of each object is a pointer to its descriptor. Descriptors are usually allocated statically, except for objects of locally defined types. A descriptor contains:
For most types there is a single descriptor of objects of that type. Some types use several descriptors, e.g. a lazy variable has separate descriptors depending on whether its value has been computed. Functions defined at different places in the source have separate descriptors (of course all local functions defined at one place share the same descriptor).
Kokogut compiles a module at once. Compilation of a module is split into several stages. First the source of the module is read as a string.
The Lexer splits the string into a list of tokens. Each token is paired with its source location.
The Parser transforms the list of tokens into a Source syntax tree. It’s a recursive descent one, implemented by hand.
The Source represents names as symbols, without any contextual information. All nodes store a location. The tree doesn’t distinguish between expressions, patterns and definitions, and doesn’t enforce contextual invariants. Parentheses are not stored, and some basic cases of syntactic sugar are already expanded.
The Expander transforms the Source tree into an Abstract form by resolving names, interpreting compound expressions which use builtin macros, and expanding any remaining syntactic sugar. The Abstract form is designed to represent the core semantics of the language, no matter how it’s represented in the source nor how it will be realized.
The Expander produces the Abstract representation of the module and an interface to be written into an interface file. It reads interfaces of used modules when it needs them. The format of interfaces is internal to the compiler. Reading interfaces uses the same Lexer as the normal source but a different parser.
An Abstract module consists of a module name and global definitions which bind global names and use expressions and patterns.
Names, both those translated from the source and those introduced by the Expander for internal purposes, are stored as objects which many fields:
#local
, for locally named objects,
#global
, for global pointers to computed objects,
#static
, for objects allocated statically,
#fun
, for global functions,
#label
, for targets of jumps in later stages;
Each name in this sense has exactly one point of definition, i.e. they are never shadowed.
The Realizer resolves forward references by introducing special objects, called forwards, which are allocated when a name needs a forward reference, and filled just after the corresponding definition executed. Code before a definition takes the value of the defined name from the forward.
In some cases forwards can be avoided. If the definition has not executed yet, but we are sure that it will execute before the code we are currently processing (because we are inside a function body and the function will not be applied before the definition executes), we don't have to check whether the forward is filled. In addition if the name is global, we may refer to the name directly and a forward will not be needed at all.
The Realizer translates some constructs to lower level code.
Descriptors of types being defined are made explicit.
Specializations of SingletonName, RecordConstructor and
RecordFields are added. Specializations and try
expressions are translated to applications of appropriate
builtin functions.
The Realizer reports warnings about mismatched arities, about definitions or imports of names which are never used, about forward references which will always fail, about applications which will always fail due to the type of the applied object, or when the result of an expression is ignored while the kind of the expression makes this suspicious.
Some optimizations are performed, e.g. direct applications of
anonymous functions which don't refer to themselves are
translated to case
expressions.
The Real representation is quite similar to Abstract, except that forwards, descriptors etc. are explicit.
This stage is run optionally, when the -O
option is
turned on.
Optimization is split into three phases:
Some standard functions are recognized and replaced with inline code. In particular arithmetic operators and comparison operators check for small ints before falling back to generic functions.
Names with trivial definitions (defined as other names or literals) are optimized out within a module.
We will perform two optimizations which are internally similar:
This phase scans definitions and usages of functions and variables, and decides which objects will be optimized in this way.
Also, remaining substitutions are performed. The previous phase performed a substitution only for occurrences seen after it has decided about the substitution.
Optimizations of functions and variables determined in the previous phase are performed.
Functions and structure descriptors are lifted to the toplevel and allocated statically when possible.
The Planner transforms the Real module into a Plan form where the order of operations is explicit. A Plan module consists of unordered global definitions and an initialization statement. Definitions describe global and static objects, static descriptors, and functions.
A function definition, besides the function name, contains a local name of the object being applied, names of its free local variables, local names of parameters, and the statement to execute. The Plan representation uses the same names as the Abstract and Real representations.
A statement can be a proper statement or a pair of statements to be executed in order. Statements can bind local names which are used in the following statements. Execution of most statements is followed by the execution of the next statement in sequence, except a few special statements:
PS_RETURN
ends the execution of the function,
PS_CALL
calls another function and resumes
execution when it returns,
PS_TAIL_CALL
ends this function and jumps to
another function,
PS_FAIL
throws an exception,
PS_IF
and PS_CCOND
conditionally
execute one of two statements (which may either continue
after the conditional or end the execution of the function),
PS_LABEL
is a target of jumps,
PS_JUMP
jumps to a label in the same function.
PS_CCASE
contains a list of pairs of a label
and a statement.
Statements use names and expressions. An expression evaluates to a value, never fails, doesn’t use statements, and doesn’t transfer control to other functions. A Sequential expression can always be implemented as a C expression.
The same local name can be used in several functions, with an independent definition in each function.
The CCoder transforms the Plan module into CCode form, an abstract syntax of C. In particular it allocates registers and stack offsets for local names. A subsequence of statements between calls is translated into a separate C function.
C names corresponding to global Kogut names are derived from the (module,symbol,extension) triples, so the binary interface of a module changes only after predictable changes to the source. Other names include the unique identifier, but also the symbol and extension if they are available, so it’s easier to relate them to the source.
Each internal representation (Source, Abstract, Real, Plan, and CCode) comes with its pretty-printer. The printers for Source and CCode use the actual syntax of Kogut and C respectively, other printers use some ad-hoc textual representation useful for debugging.