Friday, October 29, 2010

C# Compiler Overview

C# Compiler Overview

Programs are the most complicated engineering artifacts known to man. A compiler is a special type of program that validates, optimizes and transforms programs into executable code. These articles dive into our understanding of compilers.

C# Compiler

When you compile a C# program, the program is translated into an abstract binary format, but this format, called intermediate language, must then also be translated. This short overview document describes some steps about this process to aid conceptualization of the C# language compiler.

Compiler phases

First, compiler theory divides the compilation of programs into several different phases. At first, the program must be read from the text file, and then important characters are recognized as lexemes. The term lexeme is used to refer to the textual representation of a token. The term token refers to a structure that combines a lexeme and also information about that lexeme. After the tokens are determined in the program text, the compiler can use internal data structures called intermediate representations to change the form of programs so it is more efficient.
Note: Lexical refers to the text representation of programs.
      Lexeme refers to the text representation of keywords and more.
      Tokens combine lexemes and symbolic information about lexemes.
      The symbol table stores information about tokens.

C# compiler phases

Here, we apply at a high level the compiler phases to the C# compiler system typically used, such as in the .NET Framework. When you compile a C# program in Visual Studio, the csc.exe program is invoked on the program text. According to the rules of the language specification, all the compilation units are combined in a preliminary step to ensure discovery of all parts of the program. The C# compiler tries its hardest to prove errors in your program, and these are termed compile-time errors.
Note: Programs are interpreted at compile-time and runtime.
      Compile-time analysis is static, meaning not dynamic.
      Runtime analysis is dynamic.
      Static analysis does not impact performance of execution.
      Runtime analysis can slow programs down.
Compile-time errors. For example, the C# compiler uses a process called definite assignment analysis to prove that variables are not used before they are initialized. This step alone reduces the number of security problems and bugs in C# programs substantially; definite assignment analysis ensures higher program quality because the programs are tested more at compile-time.
Type inference. The C# compiler also can apply certain inferential logic at compile-time, and because this is not used at runtime, it has no penalty at execution. For example, the C# compiler will use algorithms to find the best overloaded method based on its parameters, or the best overloaded method based on the type of its parameters.
Numerical promotion. At the C# compilation stage, certain number transformations are also applied. Numbers are "promoted" to larger representations to enable compilation with certain operators. Also, some casts that are not present in the program text can be added by the C# compiler. This is done to enable shorter and clearer high-level source code, and to ensure an accurate lower-level implementation.
If-statements and loops. The C# compiler also uses node-based logic to rearrange conditional statements and loops, which both use jump instructions. For this reason, your code often will be compiled to use branch instructions that do not reflect your source text exactly. For example, the C# compiler will change while-loops into the same thing as certain for-loops. It has sophisticated logic, presumably based on graph theory, to transform your loops and nested expressions into efficient representations.
Constant folding. In compiler theory, some levels of indirection can be eliminated by actually injecting the constant values into the representation of the program directly. This is termed constant folding and the author's benchmarks have shown that constant values do provide performance benefits over variables. If you look at your compiled program, all constants will be directly inside the parts of methods where they were referenced.
String literals. In the C# compiler, string literals are actually pooled together and constant references to the stream of string data in the compiled program are placed where you used the literals. Therefore, the literals themselves are not located where you use them in methods but the literal is transformed into a pointer to pooled data.

Metadata

In the .NET Framework, your C# program is compiled into a relational database called the program metadata. This is also considered an abstract binary representation. The metadata is an efficient encoding of the program text that the C# compiler generates. The metadata is stored on the disk, and it does not contain comments in your source code.
Relational database. The metadata is divided into many different tables, and these tables contain records that point to different tables and different records. It is not typically important to study the metadata format unless you are writing a compiler.
Book: See "Expert .NET 2.0 IL Assembler" by Serge Lidin.
      This book explains the metadata and assembly format for .NET.
      It is excellent and the author jokes.
Method representation. Structural programming, which represents logic as procedure calls, uses methods extensively. In the metadata, method bodies do not store the names of their local variables; this information is lost at compile-time. Parameter names are retained. The goal was to improve the level of optimization on method bodies and eliminate unneeded information, reducing disk usage.

Runtime

At this point, we have taken a high-level C# source text and translated it into a relational database called metadata. When you execute this metadata, the Common Language Runtime for the .NET Framework is started, which incurs a lot of overhead. Typically then, as you run the program each method is read from the metadata and the intermediate language code is translated into machine-level code.
Just-in-time compilation. The Common Language Runtime (CLR) applies several optimizations to the methods. It will sometimes insert the methods at their call site in an optimization called function inlining. The system will actually rewrite the instruction layouts in memory to improve efficiency and eliminate unnecessary indirections. This is because each pointer dereference costs time; by removing this dereference, fewer instructions are needed. Fewer clocks are then required at runtime.
Note: The JIT system does cause a slowdown when first used.
      Therefore, it is most beneficial on long-running programs.

Learn more

In this part, we elaborate on how you can learn about these principles better. The book Compilers: Principles, Techniques and Tools provides overviews and theoretical models for the preceding thirty years of compiler theory, and this theory is the foundation of the C# compiler. For more specific information about the .NET Framework, the book Expert .NET 2.0 IL Assembler provides copious help. The book Structure and Interpretation of Computer Programs will take you through each step of program interpretation and at the end you will have a compiler.
Tip: Compilers are used to develop processors and chips.
     Special-purpose compilers enable amazing simulations.
     Compiler theory has driven CPU development and its direction.

Summary

In this article, we explored the C# compiler in specific and applied the system to compiler theory in general. This article overviews the elaborate series of phases at compile-time and runtime each C# program is taken through. Modern computers, and all computer software, revolve around compiler theory, which is at the very core of all software.

.NET Framework Content

The C# language and the VB.NET language are most often executed with the .NET Framework. By understanding the .NET Framework's implementations, such as the intermediate language, we can write better high-level programs. These articles reveal aspects of the .NET Framework.

Intermediate language

The intermediate language (IL) for the C# language and the VB.NET language is specified by a standard. In these articles, we look at the IL for various types such as arrays and the singleton pattern; we also cover the intermediate language in general.

Callvirt instruction

The callvirt instruction is part of the intermediate language. It is used in some unexpected places in the C# language, though, as this article demonstrates and elaborates upon.

Performance

I think performance is important, and one of the best things about the developers behind the .NET Framework is that they do too. Microsoft's teams are very careful to maintain good performance or improve performance where it may be lacking. In this article, I benchmarked the .NET Framework 4.0 and found an overall improvement over .NET 3.5 SP1.


 

 

No comments:

Post a Comment