Lexical analysis

An ultra-simplified explanation 👍

Primarily written for my own benefit, but hopefully helpful to someone else as well. Here I try to make them stick in to your mind (and maybe mine) by explaining them in the simplest way possible. Contributions are always welcome.

Table of content

Introduction
Lexical analysis
Syntax analysis
Semantic analysis

Introduction

A programmer is just a tool that converts caffeine into code. A compiler is just a tool that converts code into machine language.

Wikipedia says:

A compiler is a computer program that translates computer code written in one programming language (the source language) into another language (the target language)

This article will use Power Fx (open-source language written in C#) as a practical example of how compilers work and their common phases.

Lexical analysis

Wikipedia says:

The process of converting a sequence of characters into a sequence of lexical tokens (strings with an assigned and thus identified meaning).

Let's take the following sentence:

BOB DROVE HOME.

We, humans, identify some elements of English grammar and American culture. Now, imagine that instead of having the full sentence upfront, you are reading one character per turn. Let's do a lexical analysis:

B - Doesn't mean anything.
BO - Doesn't mean anything.
BOB - Doesn't mean anything.
**BOB ** - Wait! This is a blank space, meaning we finish a word. BOB is a word. Plus, a blank space is punctuation.

Moving forward:

D - Doesn't mean anything.
DR - Doesn't mean anything.
DRO - Doesn't mean anything.
DROV - Doesn't mean anything.
DROVE - Doesn't mean anything.
**DROVE ** - Wait! This is a blank space, meaning that we finish a word. DROVE is a word. Plus, a blank space is punctuation.

Moving forward:

H - Doesn't mean anything.
HO - Doesn't mean anything.
HOM - Doesn't mean anything.
HOME - Doesn't mean anything.
HOME. - Wait! This is a period, meaning that we finish a word. HOME is a word. Plus, a period is punctuation.

If we organize the results in a table-like structure, it would be something like the following:

Token	Kind
BOB	Word
(blank space)	Punctuation
DRIVE	Word
(blank space	Punctuation
HOME	Word
.	Punctuation

We just performed a basic lexical analysis, or in other words, TOKENIZATION. The product of this analysis is a list containing all tokens found in the input sentence, categorized by each of their types (word, punctuation, etc).

Note that at this phase we don't care if there is any grammar error or if a word is unknown. The tokenization job is simply to split the sentence into tokens, categorize them, and nothing more.

Now, let's move on to a real-world scenario.
Abs(-10) is a valid Power Fx expression. We can tokenize it by calling the following code:

using Microsoft.PowerFx;

var engine = new Engine();
var tokens = engine.Tokenize("Abs(-10)");

The result will be a similar table as the example above:

Token	Kind
Abs	Ident
(	ParenOpen
-	Sub
10	DecLit
)	ParenClose

Syntax analysis

Coming soon.

Semantic analysis

Coming soon.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
LICENSE		LICENSE
README.md		README.md
cover.png		cover.png
lexical-fx.md		lexical-fx.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

An ultra-simplified explanation 👍

Table of content

Introduction

Lexical analysis

Syntax analysis

Semantic analysis

About

Releases

Packages

License

anderson-joyle/compilers-for-humans

Folders and files

Latest commit

History

Repository files navigation

An ultra-simplified explanation 👍

Table of content

Introduction

Lexical analysis

Syntax analysis

Semantic analysis

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages