This is a toy x86 assembler written from scratch. It was written to gain a better understanding of how machine code and executable files work. Its syntax uses a lot of emojis because why not?
Using the print example:
$ cargo run -- examples/print.jas hi!
🖊LINUX_SYSCALL $128 # ... ❗ LINUX_SYSCALL
# I'm a comment 🦘= ✉exit
⚫ ⬅ $8
Load 8 into ⚫.
🔴 ⬅ 🔵
Copies data from 🔵 into 🔴.
📗my_number 3 # ... 🔴 ⬅ my_number
This loads 3 into 🔴.
🔴 ⬅ $0~🔵
This loads the value at the address contained in 🔵 into 🔴.
🔴 ⬅ $4~🔵
Or alternatively with a constant:
🖊ST_ARG $8 # ... 🔴 ⬅ ST_ARG~🔵
This is similar to indirect addressing except that it adds a constant offset to the address in 🔵.
🦘 ✉exit # ... 📪exit: ⚪ ⬅ $1 ❗ LINUX_SYSCALL
Labels are defined by prefixing them with 📪 and ending them with a
:
. To refer to a label prefix it with ✉ instead.
📗numbers 3, 67, 34, 222, 45 # ... 🔵 ⬅ numbers
Data sections start with 📗 and can be referred to later by just their name.
The main high-level function which processes a file is process. First
the code is broken up into separate lines. Each line is then tokenized
into a vector of TokenType
. ConstantReferences
are replaced by
their constants and the vector is compiled into a vector of
IntermediateCode
. Intermediate code consists of bytes and
displacements. We need this intermediate step because e.g. a jump to
an instruction further down the program can not be encoded, when we
encounter a jump to a next instruction we don’t know yet how far to
jump. After this we iterate through the IntermediateCode
and replace
the displacements with bytes. This is done by keeping track of the
byte offset of each instruction in the program during the first step.
After this an ELF binary is built. Its layout is as follows (the multiple data sections example was used here):
$ readelf -a a.out ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: Intel 80386 Version: 0x1 Entry point address: 0x804b000 Start of program headers: 52 (bytes into file) Start of section headers: 148 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 5 Section header string table index: 4 Section Headers: [Nr] Name Type Addr Off Size ES Flg Lk Inf Al [ 0] NULL 00000000 000000 000000 00 0 0 0 [ 1] pi PROGBITS 08049000 001000 000014 00 WA 0 0 1 [ 2] euler PROGBITS 0804a000 002000 000014 00 WA 0 0 1 [ 3] .code PROGBITS 0804b000 003000 000019 00 AX 0 0 1 [ 4] .shstrtab STRTAB 00000000 000400 00001a 00 0 0 1 ... Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align LOAD 0x003000 0x0804b000 0x0804b000 0x00019 0x00019 R E 0x1000 LOAD 0x001000 0x08049000 0x08049000 0x00014 0x00014 RW 0x1000 LOAD 0x002000 0x0804a000 0x0804a000 0x00014 0x00014 RW 0x1000 Section to Segment mapping: Segment Sections... 00 .code 01 pi 02 euler ...
There’s a program header entry for each data section (📗) and for the executable code. Everything is padded to 4 KB (=virtual page size). To allow for linking a correct section header is also generated.
Symbol | Name |
---|---|
⚪ | %eax |
🔴 | %ebx |
🔵 | %ecx |
⚫ | %edx |
◀ | %esp |
⬇ | %ebp |
Symbol | Example | Description |
---|---|---|
↩ | ↩ | Return from a function |
📞 | 📞 fn | Call function |
➕ | ⚪ ➕ ⚫ | ⚪ += ⚫ |
➖ | ⚪ ➖ ⚫ | ⚪ -= ⚫ |
✖ | ⚪ ✖ ⚫ | ⚪ *= ⚫ |
⬅ | 🔴 ⬅ $1 | Move into register |
❗ | ❗ $128 | Interrupt |
⚖ | ⚖ ⚫, ⚪ | Compare ⚫ to ⚪ |
🦘= | 🦘= ✉exit | Jump if equal |
🦘≠ | 🦘≠ ✉exit | Jump if not equal |
🦘< | 🦘< ✉exit | Jump if less than |
🦘≤ | 🦘≤ ✉exit | Jump if less or equal |
🦘> | 🦘> ✉exit | Jump if greater than |
🦘≥ | 🦘≥ ✉exit | Jump if greater or equal |
🦘 | 🦘 ✉exit | Unconditional jump |
📥 | 📥 $8 | Push onto stack |
📤 | 📤 🔵 | Pop from stack |
🖊 | 🖊c $4 | Define constant c to be 4 |
📪 (ends with :) | 📪exit: | Define a label with name exit |
📗 | 📗pi 3, 1, 4 | Define a data section pi containing 3 integers |
✉ | ✉exit | Refer to a previously defined (📪) exit label |
$ | $1 | 1 is a number |
# | # hi! | hi! is a comment |
[0-9]+ | 1 | 1 is a memory address |
[aA-zZ]+ | constant | constant is a previously defined (🖊, 📗) constant |