shithub: mc

Info • Files • Log • Branches
ref: d2abc97bfa136ff2dcdf970457b1434cfd7cc712
dir: /doc/compiler.txt/
                Structure of the Myrddin Compiler
                            Aug 2012
                          Ori Bernstein

TABLE OF CONTENTS:

    1. OVERVIEW
    2. PARSING
        2.1. Lexing
        2.2. Parsing
        2.3. Type checking
    3. FLATTENING
    4. OPTIMIZATION
    5. CODE GENERATION
    6. TUTORIAL: ADDING A STATEMENT

1. OVERVIEW:

    The Myrddin compiler suite consists of a set of binaries, written in C,
    which translate Myrddin source code to the assembly most appropriate for
    the target platform, and subsequently invoke the native assembler on it.
    The linker is not invoked by the compiler, and the final output is an
    object file for the target platform.

    The compilers are named with a single character for the target platform,
    with a single character for the language being compiled. A table of the
    compilers and their names is below:

        Compiler        Platform
        -------------------------
        6m              x86-64


    The compilation is divided into a small number of phases. The first phase
    is parsing. The first phase is parsing, where the source code is first
    tokenized, parsed, and semantically checked. The second phase is the
    machine dependent tree flattening. In this phase, the tree is decomposed
    function by function into simple operations that are relatively close to
    the machine. Sizes are fixed, and all loops, if statements, etc are
    replaced with gotos. The next phase is a machine independent optimizer,
    which currenty does nothing other than simply folding trees. In the final
    phase, the instructions are selected and the registers are allocated.

    So, to recap, the phases are as follows:

        parse           Tokenize, parse and analyze the source.
        flatten         Rewrite the complex nodes into simpe ones
        opt             Optimize the flattened source trees
        gen             Generate the assembly code


2. PARSING:

    This phase takes in a source file, and outputs a tree that is guaranteed
    to be valid.

    2.1. Lexing:
        
        Lexing occurs in parse/tok.c. Because we desire to use this lexer from
        within yacc, the entry point to this code is in 'yylex()'. As required
        by yacc, 'yylex()' returns an integer defining the token type, and
        sets the 'tok' member of yylval to the token that was taken from the
        input stream. In addition, to allow for better error messages, the
        global variable 'curtok' is set to the value of 'yylval.tok'. This
        allows yyerror to print the last token that was seen.

        The tokens that are allowable are generated by Yacc from the '%token'
        definiitions in parse/gram.y, and are placed into the file
        'parse/gram.h'. The lexer and parser code is the only code that
        depends on these token constants.

        The lexer is initalized through 'tokinit(char *file)'. This function
        will open the file passed in, read all the data from it in one go
        and set up the internal data for the tokenizer. The tokenizing is then
        done while the whole file is in memory, which means that this code
        will work poorly on files that are larger than the address space
        available to the compiler. If this is a problem, you deserve all the
        pain that is caused.

        The file data is stored in the three global variables 'fidx', 'fbuf',
        and 'fbufsz'. The actual tokenization happens through 'toknext()' and
        its callees, which operate on these data structures character by
        character, matching the values read, and shoving them into the 'Tok'
        data structure.

    2.2. Parsing:

        The parser used is a traditional Yacc based parser. It is generated
        from the source in parse/gram.y. The starting production is 'file',
        which fills in a global 'file' tree node. This 'file' tree node must
        be initialized before yyparse() is called.


    2.3. Type Checking:

        Type checking is done through unification of types. It's implemented
        in parse/infer.c 

    2.4. Speciaization:

    2.4. Serialization:

    2.5. Use Files:

3. FLATTENING:

    This phase is invoked repeatedly on each top level declaration that we
    want to generate code for. 

4. OPTIMIZATION:

    4.1. Constant Folding:

5. CODE GENERATION:

    5.1. Instruction Selection:
    5.2. Register Allocation:

6: TUTORIAL: ADDING A STATEMENT:

    6.1. Stubbing in the node types:

    6.2. Parsing:
    
    6.3. Flattening:

    6.4. Optimization:

    6.5. Instruction Selection: