The aim of this repository is to keep the applications in the books which I've been studying on about the field of compilers while I trying to improve myself. Since this repository is related to the compilers area, the applications here will be made with the LLVM library.
[WARNING] Most of the information in this repository is a summary of what I have read and understood myself. The information here may not be completely accurate. For more detailed information on compilers, please study the Aho's Dragon book and Kai Nacke's Learn LLVM 12 book.
Before explaining the steps of a compiler, it is better to first explain what a compiler is and what it does.
A Compiler is a program that converts the code in the source file (files such as .java, .cpp, .cs, etc.) into machine code that the computer can understand.
A compiler consists of 3 steps.
These steps are Front-End, Optimizer and Back-End and these steps consist of sub-steps within themselves.
You can think of the code in the source file(source files such as .c, .java, .cs, etc.) as an ordinary text in a text document (like .txt file) at this stage.
The Lexer reads one by one characters in the source file until it detects the characters like '+' ,'=', ';' etc. expression and creates a special object called Token with the characters it reads.
An example of code and its split into Token objects are in below.
# Token Code Example;
x1 = a3 + 5;
x'2;
# Tokens of the code example;
x1 is a Token (identifier)
= is a Token (assignment operator)
a3 is a Token (identifier)
+ is a Token (+)
5 is a Token (number)
; is a Token (end of statement)
x'2 is an Invalid Token
; is a Token (end of statement)
Note: a3 + 5 also called an EXPRESSION
The Parser makes statements with tokens just like people make sentences with words. Tokens obtained in the previous section (Lexer sect.) are used to establish the statements. The statements are then converted into a special data structure called the Abstract Syntax Tree (A.S.T.).
The aim of Parser is to understand whether the grammar created is correct or not and to debug in A.S.T..
Grammar is the rules given to programmer for compiler have a valid statement.
The code that each token is used correctly but the grammar is wrong in below.
# Invalid Grammar Example;
num = a*b +/;
Grammatically correct code is in below.
# Valid Grammar Example;
num = a*b + y/z;
Grammatically correct code is shown as an A.S.T. graph in below.
graph TD;
linkStyle default interpolate basis;
assign --> num;
assign --> +;
+ --> *;
+ --> /;
* --> a;
* --> b;
/ --> y;
/ --> z;
The Semantic Analyzer checks whether the written code is meaningful even if the code written in the source file is grammatically correct.
There are examples where a sentence or code may be meaningless even if it is grammatically correct in the below.
Sentence example;
His home is going to he.
Explanation: This sentence is grammatically correct but meaningless.
-----------------
Code example;
5 = x;
Explanation: A value of 5 is an real value and x is a local value(represents the memory address).
A real value can be assigned to a local variable, but the opposite is meaningless.
Compilers only must generate meaningful code.
Now that it's understood that a code can be meaningless even if it is grammatically correct, let's take a real-life example.
Animal a1, a2;
Person p;
Computer c;
int x;
x = p - c;
When we write code in a source file in real life, Symbol Table is called for each of our declarations and each variables of the declarations.
As it can be understood, declared variables in our source code in Symbol Table, their scope fields etc. information is available.
The Compiler can understand that the above example contains meaningless codes by checking the Symbol Table in the Semantic Analyzer during compilation.
The Optimizer tries to optimize the A.S.T. data structure obtained at the end of the front-end stage of the compiler independently of the target machine(architectures such as x86, arm, gpu, etc.).
Our aim is to hope to make the code faster in optimization.
There are different methods and approaches for these optimization processes. The most common of these is the Command Subexpression Elimination(CSE). To summarize the CSE approach roughly and to give a simple example, if there are more than one repetitive operations, those operations are assigned to a variable. When those operations are needed, they are used from the variables they are assigned.
Optimizer and Compiler solutions are a kind of heuristic algorithms.
Unlike the optimizer stage the compiler tries to optimize our code according to the specifications of our target machine(x86, arm, etc. architecture) at this stage.
The compiler converts the AST data structure obtained from the The Optimizer part to the Abstract Assembly model in order to optimize our code in the Back-End part.
Abstract Syntax Tree = High-Level Intermediate Representation
Abstract Assembly = Low-Level Intermediate Representation
An example for " num = a*b + y/z " created as an abstract assembly model in below.
# Abstract Assembly Model;
load a -> r1
load b -> r2
mult r1, r2 -> r3
load y -> r4
load z -> r5
div r4, r5 -> r6
add r3, r6 -> r7
store r7 -> num
There are different processors that can perform different operations in different ways at nowadays.
While some processors take the value from memory one by one and load them into registers one by one (like the abstract assembly model in above), some other processors can directly access and process the data in memory with operations such as memory operands.
For example; compiler can select
mult[a], [b] -> r1 instead of
load a -> r1
load b -> r2
mult r1, r2 -> r3
Compiler tries to choose the most performing option from the available options for our target machine in Instruction Selection.
Compiler reorganizes the instructions so that our program can run most efficiently, taking into account the processing times of the instructions, the number of registers of the target machine, etc. in Instruction Scheduling.
As an example, we can think that the abstract assembly model in the instruction selection part is reorganized as follows to work more efficiently.
# Reorganized Abstract Assembly Model;
load a -> r1
load b -> r2
load y -> r4
load z -> r5
mult r1, r2 -> r3
div r4, r5 -> r6
add r3, r6 -> r7
store r7 -> num
Compiler acts as if it has an infinite number of registers until the register allocation stage and creates the abstract assembly model with these virtual registers. The r1, r2, etc. shown in the examples above they are all virtual registers and the registers resulting from the operation with two virtual registers are temporary virtual registers.
For example; mult r1, r2 -> r3
r3 is temporary virtual register in here.
Compiler takes the virtual registers in the abstract assembly model and maps them with the real registers in the target machine in register allocation part.
As you can imagine, the number of registers of a target machine is not infinite. Compiler map the virtual registers with the highest priority for the mapping to the real registers, and the virtual registers that cannot be mapped to the real registers are stored in memory again in this stage.
In each section folder, there is a sample project folder and a Readme.md file that contains a summary of the grammar rules used in that project.
All you need to do to open each project is to enter the relevant section folder and open the project folder in the section folder with VsCode.
-
You can download and install the VsCode editor to run the sample projects from the link here https://code.visualstudio.com/download
-
The LLVM 12 library will be used mainly in the examples. You can download and install any version of the LLVM library from the link here https://releases.llvm.org/download.html
You can also download LLVM 12 with the tools provided by your operating system and ide, apart from the link above.
For example, homebrew tool in MacOs operating system or Visual C++ Package Manager(vcpkg) of Visual Studio ide.
-
Required for building and compiling C++ projects in the VsCode editor. You can install CMake tool from the link here https://cmake.org/install/