ref: d2abc97bfa136ff2dcdf970457b1434cfd7cc712
dir: /doc/compiler.txt/
Structure of the Myrddin Compiler Aug 2012 Ori Bernstein TABLE OF CONTENTS: 1. OVERVIEW 2. PARSING 2.1. Lexing 2.2. Parsing 2.3. Type checking 3. FLATTENING 4. OPTIMIZATION 5. CODE GENERATION 6. TUTORIAL: ADDING A STATEMENT 1. OVERVIEW: The Myrddin compiler suite consists of a set of binaries, written in C, which translate Myrddin source code to the assembly most appropriate for the target platform, and subsequently invoke the native assembler on it. The linker is not invoked by the compiler, and the final output is an object file for the target platform. The compilers are named with a single character for the target platform, with a single character for the language being compiled. A table of the compilers and their names is below: Compiler Platform ------------------------- 6m x86-64 The compilation is divided into a small number of phases. The first phase is parsing. The first phase is parsing, where the source code is first tokenized, parsed, and semantically checked. The second phase is the machine dependent tree flattening. In this phase, the tree is decomposed function by function into simple operations that are relatively close to the machine. Sizes are fixed, and all loops, if statements, etc are replaced with gotos. The next phase is a machine independent optimizer, which currenty does nothing other than simply folding trees. In the final phase, the instructions are selected and the registers are allocated. So, to recap, the phases are as follows: parse Tokenize, parse and analyze the source. flatten Rewrite the complex nodes into simpe ones opt Optimize the flattened source trees gen Generate the assembly code 2. PARSING: This phase takes in a source file, and outputs a tree that is guaranteed to be valid. 2.1. Lexing: Lexing occurs in parse/tok.c. Because we desire to use this lexer from within yacc, the entry point to this code is in 'yylex()'. As required by yacc, 'yylex()' returns an integer defining the token type, and sets the 'tok' member of yylval to the token that was taken from the input stream. In addition, to allow for better error messages, the global variable 'curtok' is set to the value of 'yylval.tok'. This allows yyerror to print the last token that was seen. The tokens that are allowable are generated by Yacc from the '%token' definiitions in parse/gram.y, and are placed into the file 'parse/gram.h'. The lexer and parser code is the only code that depends on these token constants. The lexer is initalized through 'tokinit(char *file)'. This function will open the file passed in, read all the data from it in one go and set up the internal data for the tokenizer. The tokenizing is then done while the whole file is in memory, which means that this code will work poorly on files that are larger than the address space available to the compiler. If this is a problem, you deserve all the pain that is caused. The file data is stored in the three global variables 'fidx', 'fbuf', and 'fbufsz'. The actual tokenization happens through 'toknext()' and its callees, which operate on these data structures character by character, matching the values read, and shoving them into the 'Tok' data structure. 2.2. Parsing: The parser used is a traditional Yacc based parser. It is generated from the source in parse/gram.y. The starting production is 'file', which fills in a global 'file' tree node. This 'file' tree node must be initialized before yyparse() is called. 2.3. Type Checking: Type checking is done through unification of types. It's implemented in parse/infer.c 2.4. Speciaization: 2.4. Serialization: 2.5. Use Files: 3. FLATTENING: This phase is invoked repeatedly on each top level declaration that we want to generate code for. 4. OPTIMIZATION: 4.1. Constant Folding: 5. CODE GENERATION: 5.1. Instruction Selection: 5.2. Register Allocation: 6: TUTORIAL: ADDING A STATEMENT: 6.1. Stubbing in the node types: 6.2. Parsing: 6.3. Flattening: 6.4. Optimization: 6.5. Instruction Selection: