Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included here.

Kokogut Internals

Let’s first introduce some concepts used by the generated code. All values except small integers are represented by pointers to objects, where the first word of each object is a pointer to its descriptor. Descriptors are usually allocated statically, except for objects of locally defined types. A descriptor contains:

its own descriptor, so the garbage collector can treat descriptors as objects;
the pointer to a C function which moves the object to a new address and marks its fields which point to other objects;
an optional pointer to a finalization function;
the type of the object;
entries, pointers to code which implement application of the object; there are fast entries for a known number of arguments between 0 and 5, and a list entry which takes arguments in a list.

For most types there is a single descriptor of objects of that type. Some types use several descriptors, e.g. a lazy variable has separate descriptors depending on whether its value has been computed. Functions defined at different places in the source have separate descriptors (of course all local functions defined at one place share the same descriptor).

Kokogut compiles a module at once. Compilation of a module is split into several stages. First the source of the module is read as a string.

Lexer: String → Tokens

The Lexer splits the string into a list of tokens. Each token is paired with its source location.

Parser: Tokens → Source

The Parser transforms the list of tokens into a Source syntax tree. It’s a recursive descent one, implemented by hand.

The Source represents names as symbols, without any contextual information. All nodes store a location. The tree doesn’t distinguish between expressions, patterns and definitions, and doesn’t enforce contextual invariants. Parentheses are not stored, and some basic cases of syntactic sugar are already expanded.

Expander: Source → Abstract

The Expander transforms the Source tree into an Abstract form by resolving names, interpreting compound expressions which use builtin macros, and expanding any remaining syntactic sugar. The Abstract form is designed to represent the core semantics of the language, no matter how it’s represented in the source nor how it will be realized.

The Expander produces the Abstract representation of the module and an interface to be written into an interface file. It reads interfaces of used modules when it needs them. The format of interfaces is internal to the compiler. Reading interfaces uses the same Lexer as the normal source but a different parser.

An Abstract module consists of a module name and global definitions which bind global names and use expressions and patterns.

Names, both those translated from the source and those introduced by the Expander for internal purposes, are stored as objects which many fields:

unique identifier (serial number) for hashing;
module name in case it’s a global name;
symbol, taken from the source if there is a corresponding source name;
name extension, to distinguish several internal names supporting the given source name, such that (module,symbol,extension) uniquely identifies every global name;
storage, which is one of:
- #local, for locally named objects,
- #global, for global pointers to computed objects,
- #static, for objects allocated statically,
- #fun, for global functions,
- #label, for targets of jumps in later stages;
whether a global name is public, available from other modules;
whether the name has been used in addition to being defined;
some type information if we statically know the type;
the descriptor name if we statically know the descriptor;
entry names if we statically know the entries;
whether the object is known to not need the self pointer on application.

Each name in this sense has exactly one point of definition, i.e. they are never shadowed.

Realizer: Abstract → Real

The Realizer resolves forward references by introducing special objects, called forwards, which are allocated when a name needs a forward reference, and filled just after the corresponding definition executed. Code before a definition takes the value of the defined name from the forward.

In some cases forwards can be avoided. If the definition has not executed yet, but we are sure that it will execute before the code we are currently processing (because we are inside a function body and the function will not be applied before the definition executes), we don't have to check whether the forward is filled. In addition if the name is global, we may refer to the name directly and a forward will not be needed at all.

The Realizer translates some constructs to lower level code. Descriptors of types being defined are made explicit. Specializations of SingletonName, RecordConstructor and RecordFields are added. Specializations and try expressions are translated to applications of appropriate builtin functions.

The Realizer reports warnings about mismatched arities, about definitions or imports of names which are never used, about forward references which will always fail, about applications which will always fail due to the type of the applied object, or when the result of an expression is ignored while the kind of the expression makes this suspicious.

Some optimizations are performed, e.g. direct applications of anonymous functions which don't refer to themselves are translated to case expressions.

The Real representation is quite similar to Abstract, except that forwards, descriptors etc. are explicit.

Optimizer: Real → Real

This stage is run optionally, when the -O option is turned on.

Optimization is split into three phases:

Some standard functions are recognized and replaced with inline code. In particular arithmetic operators and comparison operators check for small ints before falling back to generic functions.

Names with trivial definitions (defined as other names or literals) are optimized out within a module.
We will perform two optimizations which are internally similar:
- lifting some local functions to the toplevel, transforming their free variables into additional parameters
- unboxing of some plain variables
This phase scans definitions and usages of functions and variables, and decides which objects will be optimized in this way.

Also, remaining substitutions are performed. The previous phase performed a substitution only for occurrences seen after it has decided about the substitution.
Optimizations of functions and variables determined in the previous phase are performed.

Functions and structure descriptors are lifted to the toplevel and allocated statically when possible.

Planner: Real → Plan

The Planner transforms the Real module into a Plan form where the order of operations is explicit. A Plan module consists of unordered global definitions and an initialization statement. Definitions describe global and static objects, static descriptors, and functions.

A function definition, besides the function name, contains a local name of the object being applied, names of its free local variables, local names of parameters, and the statement to execute. The Plan representation uses the same names as the Abstract and Real representations.

A statement can be a proper statement or a pair of statements to be executed in order. Statements can bind local names which are used in the following statements. Execution of most statements is followed by the execution of the next statement in sequence, except a few special statements:

PS_RETURN ends the execution of the function,
PS_CALL calls another function and resumes execution when it returns,
PS_TAIL_CALL ends this function and jumps to another function,
PS_FAIL throws an exception,
PS_IF and PS_CCOND conditionally execute one of two statements (which may either continue after the conditional or end the execution of the function),
PS_LABEL is a target of jumps,
PS_JUMP jumps to a label in the same function.
PS_CCASE contains a list of pairs of a label and a statement.

Statements use names and expressions. An expression evaluates to a value, never fails, doesn’t use statements, and doesn’t transfer control to other functions. A Sequential expression can always be implemented as a C expression.

The same local name can be used in several functions, with an independent definition in each function.

CCoder: Plan → CCode

The CCoder transforms the Plan module into CCode form, an abstract syntax of C. In particular it allocates registers and stack offsets for local names. A subsequence of statements between calls is translated into a separate C function.

C names corresponding to global Kogut names are derived from the (module,symbol,extension) triples, so the binary interface of a module changes only after predictable changes to the source. Other names include the unique identifier, but also the symbol and extension if they are available, so it’s easier to relate them to the source.

Printing

Each internal representation (Source, Abstract, Real, Plan, and CCode) comes with its pretty-printer. The printers for Source and CCode use the actual syntax of Kogut and C respectively, other printers use some ad-hoc textual representation useful for debugging.