The Full Wiki

Advertisements

More info on Compiler Construction

Compiler Construction: Wikis

Advertisements
  

Note: Many of our articles have direct quotes from sources you can cite, within the Wikipedia article! This article doesn't yet, but we're working on it! See more info or our list of citable articles.

Encyclopedia

Advertisements

Compiler construction is an area of computer science that deals with the theory and practice of developing programming languages and their associated compilers.

The theoretical portion is primarily concerned with syntax, grammar and semantics of programming languages. One could say that this gives this particular area of computer science a strong tie with linguistics. Some courses on compiler construction will include a simplified grammar of a spoken language that can be used to form a valid sentence for the purposes of providing students with an analogy to help them understand how grammar works for programming languages.

The practical portion covers actual implementation of compilers for languages. Students will typically end up writing the front end of a compiler for a simplistic teaching language, such as Micro.

External links


Wikibooks

Up to date as of January 23, 2010

From Wikibooks, the open-content textbooks collection

Contents

Preface

The purpose of this Wikibook is to provide practical advice on writing a compiler, together with some working examples of both compilers and interpreters. Some theory is unavoidable, but has been kept to a minimum. If you search the Web for "compiler construction" you will find lots of information and many different approaches. All this book can do is demonstrate a few of the many possible ways of constructing a compiler. After going through this book you should be better able to judge other methods.

This is a Wiki. Readers, please contribute: fix a spelling mistake, rephrase a sentence, expand a paragraph, add a section, write a complete chapter. When you make a non-trivial contribution, remember to add your name to the list of contributors.

This book has been written with the following goals:

The book must be

  • useful and practical - there are several good compiler books available to us, in particular the Dragon Book. However, while technologies for scanners or syntax analyzers have been well-developed and known, we believe that there is a need for a book that covers, for example, script languages or sophisticated dynamic compilation with byte-code.
  • non-trivial - this is related to the first goal.
  • provides a good coverage of object-oriented languages -
  • have a lot of references -
  • updated - one advantage of an online textbook is that it can be updated to cover new languages or implementation technologies.

Overview

A compiler is a computer program that translates source code into it's equivalent machine readable instructions. The process of translation is called compilation. We compile the source program to create the compiled program. The compiled program can then be run (or executed) to do what was specified in the original source program.

An interpreter is a computer program which executes a source program more or less directly, instead of translating it into another language. It may be easier to understand an interpreter than a compiler, since you don't need to worry about the target language.

For basic information, see the Wikipedia articles on compilers and on interpreters. A program which does a lot of calculation or internal data manipulation will generally run faster in compiled form than when interpreted. But a program which does a lot of input/output and very little calculation or data manipulation may well run at about the same speed in either case.

The source language is normally a high-level language, written using some mixture of English words and mathematical notation. The target language is normally a low-level language such as assembly, written with somewhat cryptic abbreviations for machine instructions. The target language may instead be machine code for some actual or virtual computer e.g. bytecode for the Java Virtual Machine.

Being themselves computer programs, both compilers and interpreters must be written in some implementation language. Up until the early 1970's, most compilers were written in assembly language for some particular type of computer. The advent of C and Pascal compilers, each written in their own source language, led to the more general use of high-level languages for writing compilers. For the first version of a compiler written in its own source language you have a bootstrapping problem. Once you get a simple version working, you can then use it to improve itself.

Note that a compiler is a non-trivial computer program; when written completely by hand a non-optimizing compiler for a simple source language is likely to be upwards of 3000 lines long. Some compiler-writing tools are available which can reduce this size, but adds corresponding dependencies.

Links to chapters and sections will be added as and when they are written; the detailed table of contents serves as a roadmap for where this book may be going, though the map will almost certainly change as development proceeds. The first two chapters (Introduction, Dealing with Errors) are now more or less done, work is progressing on the next chapter (A Simple Interpreter).

Introduction

Completion status: Development stage: 100% (as of Feb 3, 2005)(Feb 3, 2005)

If you are not familiar with terminology or basic concepts in compilers, you should read most of this introductory chapter since the rest of the book will assume it as background information.

Assumptions

We have to start somewhere.

Copyright and Licences

This book and any programs contained therein are copyrighted, but you have considerable freedom to copy, modify, and distribute them. The copyright license for the text of this book is licensed under the GNU Free Documentation License.

Requirements

What must a compiler do and not do.

Overall Structure

  • Front end
    • lexical analysis
    • syntax analysis
    • semantic analysis
    • some global/high-level optimisation
  • Middle end
    • Interprocedural analysis
    • Alias analysis
    • Loop nest optimisation
    • SSA optimisation
    • Automatic parallelisation
  • Back end
    • some local optimisation
    • register allocation
    • peep-hole optimisation
    • code generation
    • instruction scheduling

Performance

The compiler itself needs to be reasonably fast.

Describing a Programming Language

We need a precise description of the source language to start with.

Dealing with Errors

Completion status: Development stage: 100% (as of Jan 27, 2005)(Jan 27, 2005)

How does/should a compiler deal with errors?

Even experienced programmers make mistakes, so they appreciate any help a compiler can provide in identifying the mistakes. Novice programmers may make lots of mistakes, and may not understand the programming language very well, so they need clear, precise, and jargon-free error reports.

Case Study 1 - a Simple Interpreter

Completion status: Development stage: 25% (as of Feb 15, 2005)(Feb 15, 2005)

The purpose of this case study is to provide a gentle introduction to some compilation methods while avoiding some of the complications. The language being interpreted is a very small subset of Basic, even smaller than Tiny Basic.

Very Tiny Basic: Specification

Line Mode IDE: Implementation

Very Very Tiny Basic: an Interpreter

Lexical Analysis

Completion status: Development stage: 00% (as of Jan 31, 2005)(Jan 31, 2005)

This is alternatively known as scanning or tokenization. Lexical analysis is roughly the equivalent of splitting ordinary text written in a natural language (e.g. English) into a sequence of words and punctuation symbols.

The purpose of lexical analysis is to convert a source program expressed as characters (arranged on lines) into a sequence of valid tokens. Note that this is not the same as a valid sequence of tokens.

What is a token

In computing, a token is a categorized block of text, usually consisting of indivisible characters known as lexemes. A lexical analyzer initially reads in lexemes and categorizes them according to function, giving them meaning. This is assignment of meaning is known as tokenization. A token can look like anything: English, gibberish symbols, anything; it just needs to be a useful part of the structured text.

Consider the following table:

lexeme token
sum IDENT
= ASSIGN_OP
3 NUMBER
+ ADD_OP
2 NUMBER
; SEMICOLON

Tokens are frequently defined by regular expressions, which are understood by a lexical analyser such as lex. The lexical analyser reads in a stream of lexemes and categorises them into tokens. This is called "tokenizing." If the lexer finds an invalid token, it will report an error.

Following tokenizing is parsing. From there, the interpreted data may be loaded into data structures, for general use, interpretation, or compiling.

Consider a text describing a calculation : "46 - number_of(cows);". The lexemes here might be: "46", "-", "number_of", "(", "cows", ")", and ";". The lexical analyser will denote lexemes 4 and 6 as 'number' and - as character, and 'number_of ' as a separate token. Even the lexeme ';' in some languages (such as C) has some special meaning.

The whitespace lexemes are sometimes ignored later by the syntax analyser. A token doesn't need to be valid, in order to be recognized as a token. "cows" may be nonsense to the language, "number_of" may be nonsense. But they are tokens none the less, in this example.

Simple Hand-Written Scanning

Table-Driven Hand-Written Scanning

Scanning via a Tool - lex/flex

Wikipedia Flex lexical analyser article.

Scanning via a Tool - JavaCC

In JavaCC's terminology the scanner/lexical analyser is called the token manager. And in fact the generated class that contains the token manager is called ParserNameTokenManager. Of course, following the usual Java file name requirements, the class is stored in a file called ParserNameTokenManager.java. The ParserName part is taken from the input file. In addition, JavaCC creates a second class, called ParserNameConstants. That second class, as the name implies, contains definitions of constants, especially token constants. JavaCC also generates a boilerplate class called Token. That one is always the same, and contains the class used to represent tokens. One also gets a class called ParseError. This is an exception which is thrown if something went wrong.

It is possible to instruct JavaCC not to generate the ParserNameTokenManger, and instead provide your own, hand-written, token manager. Usually - this holds for all the tools presented in this chapter - a hand-written scanner/lexical analyser/token manager is much more efficient. So, if you figure out that your generated compiler gets too large, give the generated scanner/lexical analyser/token manager a good look. Rolling your own token manager is also handy if you need to parse binary data and feed it to the parsing layer.

Since, however, this section is about using JavaCC to generate a token manager, and not about writing one by hand, this is not discussed any further here.

Defining Tokens in the JavaCC Grammar File

A JavaCC grammar file usually starts with code which is relevant for the parser, and not the scanner. For simple grammars files it looks similar to:

options { LOOKAHEAD=1; }
PARSER_BEGIN(ParserName) public class ParserName { // code to enter the parser goes here } PARSER_END(ParserName)

This is usually followed by the definitions for tokens. These definitions are the information we are interested in in this chapter. Four different kinds, indicated by four different keywords are understood by JavaCC when it comes to the definition of tokens:

TOKEN
Regular expressions which specify the tokens the token manager should be able to recognize.
SPECIAL_TOKEN
SPECIAL_TOKENs are similar to TOKENS. Only, that the parser ignores them. This is useful to e.g. specify comments, which are supposed to be understood, but have no significance to the parser.
SKIP
Tokens (input data) which is supposed to be completely ignored by the token manager. This is commonly used to ignore whitespace. A SKIP token still breaks up other tokens. E.g. if one skips white space, has a token "else" defined, and if the input is "el se", then the token is not matched.
MORE
This is used for an advanced technique, where a token is gradually built. MORE tokens are put in a buffer until the next TOKEN or SPECIAL_TOKEN matches. Then all data, the accumulated token in the buffer, as well as the last TOKEN or SPECIAL_TOKEN is returned.
One example, where the usage of MORE tokens is useful are constructs where one would like to match for some start string, arbitrary data, and some end string. Comments or string literals in many programming languages match this form. E.g. to match string literals, delimited by ", one would not return the first found " as a token. Instead, one would accumulate more tokens, until the closing " of the string literal is found. Then the complete literal would be returned. See Comment Example for an example where this is used to scan comments.

Each of the above mentioned keywords can be used as often as desired. This makes it possible to group the tokens, e.g. in a list for operators and another list for keywords. All sections of the same kind are merged together as if just one section had been specified.

Every specification of a token consists of the token's symbolic name, and a regular expression. If the regular expression matches, the symbol is returned by the token manager.

Simple Example

Lets see an example:

//
// Skip whitespace in input
// if the input matches space, carriage return, new line or a tab,
// it is just ignored
//
SKIP: { " " | "\r" | "\n" | "\t" }
// Define the tokens representing our operators // TOKEN: { < PLUS: "+" > | < MINUS: "-" > | < MUL: "*" > | < DIV: "/" > }
// // Define the tokens representing our keywords // TOKEN: { < IF: "if" > | < THEN: "then" > | < ELSE: "else" > }

All the above token definitions use simple regular expressions, where just constants are matched. It is recommended to study the JavaCC documentation for the full set of possible regular expressions.

When the above file is run through JavaCC, a token manager is generated which understands the above declared tokens.

Comment Example

Eliminating (ignoring) comments in a programming language is a common task for a lexical analyser. Of course, when JavaCC is used, this task is usually given to the token manager, by specifying special tokens.

Basically, a standard idiom for JavaCC has evolved on how to ignore comments. It combines tokens of kind SPECIAL_TOKEN and MORE with a lexical state. Lets assume we e.g. have a programming language where comments start with a --, and either end with another -- or the end of the line (this is the comment schema of ASN.1 and several other languages). Then a way to construct a scanner for this would be:

//
// Start of comment. Don't return as token, instead
// shift to special lexical state.
//
SPECIAL_TOKEN: {  <"--"> : WithinComment }
// // While inside the comment lexicals state, look for end // of comment (either another '--' or EOL). Don't return // the (accumulated data). Instead switch back to normal // lexical state // <WithinComment> SPECIAL_TOKEN: { <("--" | "\n" )> : DEFAULT }
// // While inside the comment state, accumulate all contents // <WithinComment> MORE: { <~[]> }

TODO: Document lexical states; Token manager declaration; Details about the regular expressions; internal (#) regular expressions.

Syntax Analysis

This is alternatively known as parsing. It is roughly the equivalent of checking that some ordinary text written in a natural language (e.g. English) is grammatically correct (without worrying about meaning).

The purpose of syntax analysis or parsing is to check that we have a valid sequence of tokens. Note that this sequence need not be meaningful; as far as syntax goes, a phrase such as "true + 3" is valid but it doesn't make any sense in most programing languages.

Recursive Descent Parsing

A recursive descent parser is a top-down parser built from a set of mutually-recursive procedures (or a non-recursive equivalent) where each such procedure usually implements one of the production rules of the grammar. Thus the structure of the resulting program closely mirrors that of the grammar it recognizes.

A predictive parser is a recursive descent parser with no backup. Predictive parsing is possible only for the class of LL(k) grammars, which are the context-free grammars for which there exists some positive integer k that allows a recursive descent parser to decide which production to use by examining only the next k tokens of input. (The LL(k) grammars therefore exclude all ambiguous grammars, as well as all grammars that contain left recursion. Any context-free grammar can be transformed into an equivalent grammar that has no left recursion, but removal of left recursion does not always yield an LL(k) grammar.) A predictive parser runs in linear time.

Recursive descent with backup is a technique that determines which production to use by trying each production in turn. Recursive descent with backup is not limited to LL(k) grammars, but is not guaranteed to terminate unless the grammar is LL(k). Even when they terminate, parsers that use recursive descent with backup may require exponential time.

Although predictive parsers are widely used, programmers often prefer to create LR or LALR parsers via parser generators without transforming the grammar into LL(k) form.

Some authors define the recursive descent parsers as the predictive parsers. Other authors use the term more broadly, to include backed-up recursive descent.[citation needed]

Example parser

The following EBNF grammar (for Niklaus Wirth's PL/0 programming language, from Algorithms + Data Structures = Programs) is in LL(1) form (for simplicity, ident and number are assumed to be terminals):

program = block "." .

block =
    ["const" ident "=" number {"," ident "=" number} ";"]
    ["var" ident {"," ident} ";"]
    {"procedure" ident ";" block ";"} statement .

statement =
    [ident ":=" expression
    | "call" ident
    | "begin" statement {";" statement} "end"
    | "if" condition "then" statement
    | "while" condition "do" statement
    ] .

condition =
    "odd" expression
    | expression ("="|"#"|"<"|"<="|">"|">=") expression
    .

expression = ["+"|"-"] term {("+"|"-") term} .

term = factor {("*"|"/") factor} .

factor = ident | number | "(" expression ")" .

Terminals are expressed in quotes (except for ident and number). Each nonterminal is defined by a rule in the grammar.

Notice how closely the predictive parser below mirrors the grammar above. There is a procedure for each nonterminal in the grammar. Parsing descends in a top-down manner, until the final nonterminal has been processed. The program fragment depends on a global variable, sym, which contains the next symbol from the input, and the global function getsym, which updates sym when called.

typedef enum {ident, number, lparen, rparen, times, slash, plus,
   minus, eql, neq, lss, leq, gtr, geq, callsym, beginsym, semicolon,
   endsym, ifsym, whilesym, becomes, thensym, dosym, constsym, comma,
   varsym, procsym, period, oddsym} Symbol;

Symbol sym;
void getsym(void);
void error(const char msg[]);
void expression(void);

int accept(Symbol s) {
   if (sym == s) {
       getsym();
       return 1;
   }
   return 0;
}

int expect(Symbol s) {
   if (accept(s))
       return 1;
   error("expect: unexpected symbol");
   return 0;
}

void factor(void) {
   if (accept(ident)) {
       ;
   } else if (accept(number)) {
       ;
   } else if (accept(lparen)) {
       expression();
       expect(rparen);
   } else {
       error("factor: syntax error");
       getsym();
   }
}

void term(void) {
   factor();
   while (sym == times || sym == slash) {
       getsym();
       factor();
   }
}

void expression(void) {
   if (sym == plus || sym == minus)
       getsym();
   term();
   while (sym == plus || sym == minus) {
       getsym();
       term();
   }
}

void condition(void) {
   if (accept(oddsym)) {
       expression();
   } else {
       expression();
       if (sym == eql || sym == neq || sym == lss ||
           sym == leq || sym == gtr || sym == geq) {
           getsym();
           expression();
       } else {
           error("condition: invalid operator");
           getsym();
       }
   }
}

void statement(void) {
   if (accept(ident)) {
       expect(becomes);
       expression();
   } else if (accept(callsym)) {
       expect(ident);
   } else if (accept(beginsym)) {
       do {
           statement();
       } while (accept(semicolon));
       expect(endsym);
   } else if (accept(ifsym)) {
       condition();
       expect(thensym);
       statement();
   } else if (accept(whilesym)) {
       condition();
       expect(dosym);
       statement();
   }
}

void block(void) {
   if (accept(constsym)) {
       do {
           expect(ident);
           expect(eql);
           expect(number);
       } while (accept(comma));
       expect(semicolon);
   }
   if (accept(varsym)) {
       do {
           expect(ident);
       } while (accept(comma));
       expect(semicolon);
   }
   while (accept(procsym)) {
       expect(ident);
       expect(semicolon);
       block();
       expect(semicolon);
   }
   statement();
}

void program(void) {
   getsym();
   block();
   expect(period);
}

Implementation in functional languages

Recursive descent parsers are particularly easy to implement in functional languages such as Haskell or ML.

See Functional Pearls: Monadic Parsing in Haskell

See also

  • JavaCC - a recursive descent parser generator
  • Coco/R - a recursive descent parser generator
  • ANTLR - a recursive descent parser generator
  • CodeCodex: Recursive descent parsing - sample recursive descent parser generator code for C, Java and OCaml
  • Tail Recursive Parser - a variant of the recursive descent parser

Table-Driven Parsing

Operator Precedence Parsing

Operator precedence parsing is used in shift-reduce parsing. operator grammar No production has two nonterminal symbols in sequence on the right hand side. An operator grammar can be parsed using shift-reduce parsing and precedence relations between terminal symbols to find handles. This strategy is known as operator precedence.

Parsing via a Tool - yacc/bison

Wikipedia GNU bison article.

Parsing via a Tool - JavaCC

Please refer to Lexical Analysis: Scanning via a Tool - JavaCC for information on generating a token manager.

Using the generated Token manager

In order to use the generated token manager, an instance of it has to be created. When doing so, the constructor expects a Reader as the source of the input data (other sources are also possible, see the JavaCC documentation).

Once created, the token manager object can be used to get tokens via the

Token ParserNameTokenManager.getNextToken() throws ParseError;

method.

Each Token object as returned by the getNextToken() method. Such a Token object provides a field kind which identifies the token (ParserNameConstants.java defines the corresponding constants). It also contains the field image, which just holds the matched input data as a String.

Semantic Analysis

This is roughly the equivalent of checking that some ordinary text written in a natural language (e.g. English) actually means something (whether or not that is what it was intended to mean).

The purpose of semantic analysis is to check that we have a meaningful sequence of tokens. Note that a sequence can be meaningful without being correct; in most programming languages, the phrase "x + 1" would be considered to be a meaningful arithmetic expression. However, if the programmer really meant to write "x - 1", then it is not correct.

Name Table

Type Checking

Intermediate Representation

The form of the internal representation among different compilers varies widely. If the back end is called as a subroutine by the front end then the intermediate representation is likely to be some form of annotated parse tree, possibly with supplementary tables. If the back end operates as a separate program then the intermediate representation is likely to be some low-level pseudo assembly language or some register transfer language (it could be just numbers, but debugging is easier if it is human-readable).

See Compiler Construction/Stack-based representation.

Low-Level Pseudo Code

Annotated Parse Tree

Optimization

An extensive list of optimizations can be found on Wikipedia in the compiler optimization article. Optimization is a very rich and complex topic, so this chapter will only attempt to introduce the basics.

Optimization within a compiler is concerned with improving in some way the generated object code while ensuring the result is identical. Technically, a better name for this chapter might be "Improvement", since compilers only attempt to improve the operations the programmer has requested. Optimizations fall into three categories:

  • Speed; improving the runtime performance of the generated object code. This is the most common optimisation
  • Space; reducing the size of the generated object code
  • Safety; reducing the possibility of data structures becoming corrupted (for example, ensuring that an illegal array element is not written to)

Unfortunately, many "speed" optimizations make the code larger, and many "space" optimizations make the code slower -- this is known as the space-time tradeoff.

Common Optimization Algorithms

Common optimization algorithms deal with specific cases:

  1. Dead Code Elimination
  2. Common Sub-expression Elimination
  3. Copy Propagation
  4. Code Motion
  5. Induction Variable Elimination
  6. Reduction In Strength
  7. Function Chunking

Dead Code Elimination

Dead code elimination is a size optimization (although it also produces some speed improvement) that aims to remove logically impossible statements from the generated object code. Dead code is code which will never execute, regardless of input

Consider the following program:

a = 5
if (a != 5) {
  // Some complicated calculation
}
...

It is obvious that the complicated calculation will never be performed; since the last value assigned to a before the if statement is a constant, we can calculate the result at compile-time. simple substitution of arguments produces if (5 != 5), which is false. Since the body of an if(false) statement will never execute - it is dead code we can rewrite the code:

a = 5
// Some statements

The algorithm was used to identify and remove sections of dead code

Common Sub-expression Elimination

Common sub-expression elimination is a speed optimization that aims to reduce unnecessary recalculation by identifying, through code-flow, expressions (or parts of expressions) which will evaluate to the same value: the recomputation of an expression can be avoided if the expression has previously been computed and the values of the operands have not changed since the previous computation.

Consider the following program:

a = b + c
d = e + f
g = b + c

In the above example, the first and last statement's right hand side are identical and the value of the operands do not change between the two statements; thus this expression can be considered as having a common sub-expression.

The common sub-expression can be avoided by storing its value in a temporary variable which can cache its result. After applying this Common Sub-expression Elimination technique the program becomes:

t0 = b + c
a = t0
d = e + f
g = t0

Thus in the last statement the re-computation of the expression b + c is avoided.

Copy Propagation

Code Motion


This optimization technique mainly deals to reduce[citation needed] the number of source code lines in the program. For example, consider the code below:

for (i = 0; i < n; ++i) {
    x = y + z;
    a[i] = 6 * i + x * x;
}

The calculations x = y + z and x * x can be moved outside the loop since within they are loop invariant - they do not change over the iterations of the loop - so our optimized code will be something like this:

x = y + z;
t1 = x * x;
for (i = 0; i < n; ++i) {
    a[i] = 6 * i + t1;
}

This code can be optimized further. For example, strength reduction could remove the two multiplications inside the loop (6*i and a[i]).

Another example of code motion:

for(i=0;i<10;i++)
{
  a = a + c;
}

In the above mentioned code, a = a + c can be moved out of the 'for' loop, and the new code is

a = a + 10*c;

Induction Variable Elimination

Reduction In Strength

Function Chunking

Function chunking is a compiler optimization for improving code locality. Profiling information is used to move rarely executed code outside of the main function body. This allows for memory pages with rarely executed code to be swapped out.

Code Generation

A compiler usually is designed to output an executable program that will allow the user to run your program, and to be directly run by the processor, without having an intermediary interpreter such as in the interpretation process. For your program to be run by the processor however, you will need to transform the instructions in your specific programming language into assembler code, which is then sent to an assembler tool to create object code, which is then linked together with specific libraries to create your executable code. For now, we only really need to worry about transforming the instructions into assembler code. This process is what we will deal with in this section. You will need to be well versed in the assembler language you wish to output. If you intend your programs to run on the x86 architecture, you need to be familiar with x86 assembler code, and so on.

Code generation occurs after semantic analysis is done, which gives us enough information to generate more primitive concrete code. The general idea behind code generation is decompose the tree structure of the syntax tree into a sequence of instructions, whatever an instruction set is. In this stage, since we are done with the semantic program, we are not interested in the syntactic and semantic structure of programs but in the order of executions of instructions.

Sometimes it may be beneficial to output some sort of intermediate code is often produced before generating actual machine code. The benefits of this are

  1. it is easier to generate more abstract code, not bothering too much about things like register allocations,
  2. optimization independent to machine architecture can be done and
  3. compiler bugs can be spotted more easily.

However, it may be simpler for your program to output assembler code directly, but you lose the above advantages. See the next section for more techniques on this.

In this chapter, we shall use the three address format to represent intermediate code. The format is useful because it is analogous to actual machine instructions in some architectures and, more importantly, allows us to easily change the execution order of instructions, which is an huge advantage over stack-based intermediate code like the byte code of Java.

Although is not a complex problem to reuse names after they have already been used, it is actually beneficial to allocate a new name every time one is needed because it allows us to form a call graph and optimize easily as we will see later. For this reason, we only briefly mention the methods to reuse names. You can find more on the optimization of allocation of names in optimization chapter.

The three address code, as the name suggests, consist of three address and opcode, which tells what kind of operation is meant to be done. For example, an expression (a + b) * 3 can be transformed into:

temp1 := a + b; temp2 := temp1 * 3

In the first line, temp1, a and b are addresses and + is an opcode, and the second line is similar to the first one. Unlike load-store machines, it is unnecessary to load variables to registers and store them back. You see why the three address code is easy to handle.

Choosing portable, flexible and expressive instructions is critical; Not having enough instructions can complicate generated code with the combination of several instructions to achieve one operation and having too much may obviously make maintenance more daunting task. Probably the best way to do this is to examine existing machine code. It is more straightforward to transform code close to underlying machine code than abstract one.

Expression

Algebraic expressions can be translated into the three address code in a very straightforward manner. This can be done rather recursively as follows: Assume two expressions left and right with an operation op-code, then the results should be:

   code for left
   code for right
   temp = place for left + place for right

Control Structures

The general idea behind generating code for control structures is the same as coding in assembly programming. That is, an if statement, for instance, is converted into a chunk of code using conditional and unconditional jumps.

Run-time Considerations

Error Reporting

Storage Management - Stack

Storage Management - Heap

Storage Management - Garbage Collector

In computing there are new tools that are waiting to emerge when developers can find the technology capable of supporting them. One good example is that 50 years ago John L. McCarthy came up with an idea to automatically reclaim the memory of objects that are no longer needed during the execution of Lisp programs. It was the origin of the garbage collection concept in computer science.

However it took 35 years until a company developed a new language based on such technology and computers were fast enough to finally convince the world that garbage collection might be a good thing. That language was Java.

Hated by some, loved by others, ignored by many, garbage collection is one of the central stones of several modern software solutions. Web browsers, office suites, games and anything written in most modern computer languages such as Java, C#, Python, PHP, etc. relies on this. Even programmers using older languages (C, C++) are forced to implement a similar mechanism in their software.

Although many can argue that any program can be written without automatic memory management, garbage collection can optimize the time needed to develop software. Engineers can gain a higher level of abstraction in their designs and save time solving memory related bugs.

Furthermore, there are moments when it is not so easy to decide when to free an object. It can happen in graphical applications, distributed programs or any program with several threads accessing the same object.

Input/Output

Development Environment

Testing

Glossary

Completion status: Development stage: 75% (as of Feb 15, 2005)(Feb 15, 2005)

This glossary is intended to provide definitions of words or phrases which relate particularly to compiling. It is not intended to provide definitions of general computing jargon, for which a reference to Wikipedia may be more appropriate.

References

  • Aho, Alfred & Sethi, Ravi & Ullman, Jeffrey. Compilers: Principles, Techniques, and Tools ISBN 0201100886 The Classic Dragon book.
  • Appel, Andrew Modern Compiler Implementation in C/Java/ML (respectively ISBN 0-521-58390-X,ISBN 0-521-58388-8, ISBN 0-521-58274-1) is a set of cleanly written texts on compiler design, studied from various different methodological perspectives.
  • Brown, P.J. Writing Interactive Compilers and Interpreters ISBN 047127609X Useful practical advice, not much theory.
  • Fischer, Charles & LeBlanc, Richard. Crafting A Compiler ISBN 0805332014 Uses an ADA like pseudo-code.
  • Holub, Allen Compiler Design in C ISBN 0131550454 Extensive examples in "C".
  • Hunter, R. The Design and Construction of Compilers ISBN 0471280542 Several chapters on theory of syntax analysis, plus discussion of parsing difficulties caused by features of various source languages.
  • Pemberton, S. & Daniels, M.C. Pascal Implementation. The P4 Compiler ISBN 0853123586 (Discussion) and ISBN 085312437X (Compiler listing) Complete listing and readable commentary for a Pascal compiler written in Pascal.
  • Weinberg, G.M. The Psychology of Computer Programming: Silver Anniversary Edition ISBN 0932633420 Interesting insights and anecdotes.
  • Wirth, Niklaus Compiler Construction ISBN 0201403536 From the inventor of Pascal, Modula-2 and Oberon-2, examples in Oberon.

External Links

Wikipedia

Wikibooks Resources

other wiki

World Wide Web

Contributors

  • Murray Langton Complete restructure of apparently abandoned book.
  • TakuyaMurata; I occasionally add some content. I believe that while it likely takes years to complete the book, we can still produce pieces of work useful for those interested in compilers.
  • psanand Like to contribute something to this whenever i get time

Case studies

Since this book is about the compiler, the following studies do not detail much syntax or language designs but implementation techniques. If you are familiar with languages below, it may help understand compilers by understanding how those familiar concepts (like dynamic binding) are implemented.

Furthermore, we don't consider practical implication of decisions on the language design. Some tricky situations, like problems of reference dependency or the precedence of operators, may not occur most of the times, but those are often the interests of language implementors.

Appendices

Appendix A: Brief summary of C++

A brief explanation of the C++ constructions used in this book.

Appendix:

The following are Java code used in the book.

Appendix Z: Previous partial book

This chapter is included for now so as not to lose any previous work. It will gradually vanish as the earlier work is incorporated into the main chapters.

The emphasis of this book is a technique that can be used in practical in analysing popular languages. In other words, the lexical analysis and parsing techniques are covered briefly.

We wish the book helps you create a compiler or any tool employing compiler technology.

Old Contents

Information

Please merge into this page as soon as possible and redirect


Advertisements






Got something to say? Make a comment.
Your name
Your email address
Message