forked from IUCompilerCourse/Essentials-of-Compilation
-
Notifications
You must be signed in to change notification settings - Fork 3
/
book.tex
6831 lines (6188 loc) · 254 KB
/
book.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[11pt]{book}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{lmodern}
\usepackage{hyperref}
\usepackage{graphicx}
\usepackage[english]{babel}
\usepackage{listings}
\usepackage{amsmath}
\usepackage{amsthm}
\usepackage{amssymb}
\usepackage{natbib}
\usepackage{stmaryrd}
\usepackage{xypic}
\usepackage{semantic}
\usepackage{wrapfig}
\usepackage{multirow}
\usepackage{color}
\definecolor{lightgray}{gray}{1}
\newcommand{\black}[1]{{\color{black} #1}}
\newcommand{\gray}[1]{{\color{lightgray} #1}}
\newcommand{\tagbits}{3} % For consistency
\newcommand{\tagmask}{7} % 0b111
%% For pictures
\usepackage{tikz}
\usetikzlibrary{arrows.meta}
\tikzset{baseline=(current bounding box.center), >/.tip={Triangle[scale=1.4]}}
% Computer Modern is already the default. -Jeremy
%\renewcommand{\ttdefault}{cmtt}
\definecolor{comment-red}{rgb}{0.8,0,0}
\if{0}
% Peanut gallery comments:
\newcommand{\rn}[1]{{\color{comment-red}{(RRN: #1)}}}
\newcommand{\margincomment}[1]{\marginpar{#1}}
\else
\newcommand{\rn}[1]{}
\newcommand{\margincomment}[1]{}
% \newcommand{\margincomment}[1]{}
\fi
\lstset{%
language=Lisp,
basicstyle=\ttfamily\small,
escapechar=|,
columns=flexible,
moredelim=[is][\color{red}]{~}{~}
}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{constraint}[theorem]{Constraint}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{exercise}[theorem]{Exercise}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% 'dedication' environment: To add a dedication paragraph at the start of book %
% Source: http://www.tug.org/pipermail/texhax/2010-June/015184.html %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newenvironment{dedication}
{
\cleardoublepage
\thispagestyle{empty}
\vspace*{\stretch{1}}
\hfill\begin{minipage}[t]{0.66\textwidth}
\raggedright
}
{
\end{minipage}
\vspace*{\stretch{3}}
\clearpage
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Chapter quote at the start of chapter %
% Source: http://tex.stackexchange.com/a/53380 %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\makeatletter
\renewcommand{\@chapapp}{}% Not necessary...
\newenvironment{chapquote}[2][2em]
{\setlength{\@tempdima}{#1}%
\def\chapquote@author{#2}%
\parshape 1 \@tempdima \dimexpr\textwidth-2\@tempdima\relax%
\itshape}
{\par\normalfont\hfill--\ \chapquote@author\hspace*{\@tempdima}\par\bigskip}
\makeatother
\input{defs}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\title{\Huge \textbf{Essentials of Compilation} \\
\huge An Incremental Approach}
\author{\textsc{Jeremy G. Siek, Ryan R. Newton} \\
%\thanks{\url{http://homes.soic.indiana.edu/jsiek/}} \\
Indiana University \\
\\
with contributions from: \\
Carl Factora \\
Andre Kuhlenschmidt \\
Michael M. Vitousek \\
Cameron Swords
}
\begin{document}
\frontmatter
\maketitle
\begin{dedication}
This book is dedicated to the programming language wonks at Indiana
University.
\end{dedication}
\tableofcontents
\listoffigures
%\listoftables
\mainmatter
\if{0}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter*{Preface}
The tradition of compiler writing at Indiana University goes back to
programming language research and courses taught by Daniel Friedman in
the 1970's and 1980's. Dan had conducted research on lazy evaluation
in the context of Lisp~\citep{McCarthy:1960dz} and then studied
continuations and macros in the context of the
Scheme~\citep{Sussman:1975ab}, a dialect of Lisp. One of students of
those courses, Kent Dybvig, went on to build Chez
Scheme~\citep{Dybvig:2006aa}, a production-quality and efficient
compiler for Scheme. After completing his Ph.D. at the University of
North Carolina, Kent returned to teach at Indiana University.
Throughout the 1990's and 2000's, Kent continued development of
Chez Scheme and taught the compiler course.
The compiler course evolved to incorporate novel pedagogical ideas
while also including elements of effective real-world compilers. One
of Dan's ideas was to split the compiler into many small passes over
the input program and subsequent intermediate representations, so that
the code for each pass would be easy to understood in isolation. (In
contrast, most compilers of the time were organized into only a few
monolithic passes for reasons of compile-time efficiency.) Kent and
his students, Dipanwita Sarkar and Andrew Keep, developed
infrastructure to support this approach and evolved the course, first
to use micro-sized passes and then into even smaller nano
passes~\citep{Sarkar:2004fk,Keep:2012aa}. I took this compiler course
in the early 2000's, as part of my Ph.D. studies at Indiana
University. Needless to say, I enjoyed the course immensely.
\rn{I think that 1999 when I took it was the first micropass semester, and that
that approach preceded the infrastructure work by Dipa.}
One of my classmates, Abdulaziz Ghuloum, observed that the
front-to-back organization of the course made it difficult for
students to understand the rationale for the compiler
design. Abdulaziz proposed an incremental approach in which the
students build the compiler in stages; they start by implementing a
complete compiler for a very small subset of the input language, then
in each subsequent stage they add a feature to the input language and
add or modify passes to handle the new feature~\citep{Ghuloum:2006bh}.
In this way, the students see how the language features motivate
aspects of the compiler design.
After graduating from Indiana University in 2005, I went on to teach
at the University of Colorado. I adapted the nano pass and incremental
approaches to compiling a subset of the Python
language~\citep{Siek:2012ab}. Python and Scheme are quite different
on the surface but there is a large overlap in the compiler techniques
required for the two languages. Thus, I was able to teach much of the
same content from the Indiana compiler course. I very much enjoyed
teaching the course organized in this way, and even better, many of
the students learned a lot and got excited about compilers.
It is now 2016 and I too have returned to teach at Indiana University.
In my absence the compiler course had switched from the front-to-back
organization to a back-to-front organization. Seeing how well the
incremental approach worked at Colorado, I started porting and
adapting the structure of the Colorado course back into the land of
Scheme. In the meantime Indiana had moved on from Scheme to Racket, so
the course is now about compiling a subset of Racket to the x86
assembly language and the compiler is implemented in
Racket~\citep{plt-tr}.
This is the textbook for the incremental version of the compiler
course at Indiana University (Spring 2016) and it is the first
open textbook for an Indiana compiler course. With this book I hope to
make the Indiana compiler course available to people that have not had
the chance to study in Bloomington in person. Many of the compiler
design decisions in this book are drawn from the assignment
descriptions of \cite{Dybvig:2010aa}. I have captured what I think are
the most important topics from \cite{Dybvig:2010aa} but I have omitted
topics that I think are less interesting conceptually and I have made
simplifications to reduce complexity. In this way, this book leans
more towards pedagogy than towards the absolute efficiency of the
generated code. Also, the book differs in places where I saw the
opportunity to make the topics more fun, such as in relating register
allocation to Sudoku (Chapter~\ref{ch:register-allocation}).
\section*{Prerequisites}
The material in this book is challenging but rewarding. It is meant to
prepare students for a lifelong career in programming languages. I do
not recommend this book for students who want to dabble in programming
languages. Because the book uses the Racket language both for the
implementation of the compiler and for the language that is compiled,
a student should be proficient with Racket (or Scheme) prior to
reading this book. There are many other excellent resources for
learning Scheme and
Racket~\citep{Dybvig:1987aa,Abelson:1996uq,Friedman:1996aa,Felleisen:2001aa,Felleisen:2013aa,Flatt:2014aa}. It
is helpful but not necessary for the student to have prior exposure to
x86 (or x86-64) assembly language~\citep{Intel:2015aa}, as one might
obtain from a computer systems
course~\citep{Bryant:2005aa,Bryant:2010aa}. This book introduces the
parts of x86-64 assembly language that are needed.
%\section*{Structure of book}
% You might want to add short description about each chapter in this book.
%\section*{About the companion website}
%The website\footnote{\url{https://github.com/amberj/latex-book-template}} for %this file contains:
%\begin{itemize}
% \item A link to (freely downlodable) latest version of this document.
% \item Link to download LaTeX source for this document.
% \item Miscellaneous material (e.g. suggested readings etc).
%\end{itemize}
\section*{Acknowledgments}
Need to give thanks to
\begin{itemize}
\item Bor-Yuh Evan Chang
\item Kent Dybvig
\item Daniel P. Friedman
\item Ronald Garcia
\item Abdulaziz Ghuloum
\item Ryan Newton
\item Dipanwita Sarkar
\item Andrew Keep
\item Oscar Waddell
\end{itemize}
\mbox{}\\
\noindent Jeremy G. Siek \\
\noindent \url{http://homes.soic.indiana.edu/jsiek} \\
\noindent Spring 2016
\fi{} %% End Preface
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Preliminaries}
\label{ch:trees-recur}
In this chapter, we review the basic tools that are needed for implementing a
compiler. We use abstract syntax trees (ASTs), which refer to data structures in
the compilers memory, rather than programs as they are stored on disk, in
\emph{concrete syntax}.
%
ASTs can be represented in many different ways, depending on the programming
language used to write the compiler.
%
Because this book uses Racket (\url{http://racket-lang.org}), a descendant of
Scheme, we use S-expressions to represent programs (Section~\ref{sec:ast})
and pattern matching to inspect individual nodes in an AST
(Section~\ref{sec:pattern-matching}). We use recursion to construct
and deconstruct entire ASTs (Section~\ref{sec:recursion}).
\section{Abstract Syntax Trees}
\label{sec:ast}
The primary data structure that is commonly used for representing
programs is the \emph{abstract syntax tree} (AST). When considering
some part of a program, a compiler needs to ask what kind of part it
is and what sub-parts it has. For example, the program on the left,
represented by an S-expression, corresponds to the AST on the right.
\begin{center}
\begin{minipage}{0.4\textwidth}
\begin{lstlisting}
(+ (read) (- 8))
\end{lstlisting}
\end{minipage}
\begin{minipage}{0.4\textwidth}
\begin{equation}
\begin{tikzpicture}
\node[draw, circle] (plus) at (0 , 0) {\key{+}};
\node[draw, circle] (read) at (-1, -1.5) {{\footnotesize\key{read}}};
\node[draw, circle] (minus) at (1 , -1.5) {$\key{-}$};
\node[draw, circle] (8) at (1 , -3) {\key{8}};
\draw[->] (plus) to (read);
\draw[->] (plus) to (minus);
\draw[->] (minus) to (8);
\end{tikzpicture}
\label{eq:arith-prog}
\end{equation}
\end{minipage}
\end{center}
We shall use the standard terminology for trees: each circle above is
called a \emph{node}. The arrows connect a node to its \emph{children}
(which are also nodes). The top-most node is the \emph{root}. Every
node except for the root has a \emph{parent} (the node it is the child
of). If a node has no children, it is a \emph{leaf} node. Otherwise
it is an \emph{internal} node.
When deciding how to compile the above program, we need to know that
the root node operation is addition and that it has two children:
\texttt{read} and a negation. The abstract syntax tree data structure
directly supports these queries and hence is a good choice. In this
book, we will often write down the textual representation of a program
even when we really have in mind the AST because the textual
representation is more concise. We recommend that, in your mind, you
always interpret programs as abstract syntax trees.
\section{Grammars}
\label{sec:grammar}
A programming language can be thought of as a \emph{set} of programs.
The set is typically infinite (one can always create larger and larger
programs), so one cannot simply describe a language by listing all of
the programs in the language. Instead we write down a set of rules, a
\emph{grammar}, for building programs. We shall write our rules in a
variant of Backus-Naur Form (BNF)~\citep{Backus:1960aa,Knuth:1964aa}.
As an example, we describe a small language, named $R_0$, of
integers and arithmetic operations. The first rule says that any
integer is an expression, $\Exp$, in the language:
\begin{equation}
\Exp ::= \Int \label{eq:arith-int}
\end{equation}
%
Each rule has a left-hand-side and a right-hand-side. The way to read
a rule is that if you have all the program parts on the
right-hand-side, then you can create an AST node and categorize it
according to the left-hand-side.
%
A name such as $\Exp$ that is
defined by the grammar rules is a \emph{non-terminal}.
%
The name $\Int$ is a also a non-terminal, however,
we do not define $\Int$ because the
reader already knows what an integer is.
%
Further, we make the simplifying design decision that all of the languages in
this book only handle machine-representable integers. On most modern machines
this corresponds to integers represented with 64-bits, i.e., the in range
$-2^{63}$ to $2^{63}-1$.
%
However, we restrict this range further to match the Racket \texttt{fixnum}
datatype, which allows 63-bit integers on a 64-bit machine.
The second grammar rule is the \texttt{read} operation that receives
an input integer from the user of the program.
\begin{equation}
\Exp ::= (\key{read}) \label{eq:arith-read}
\end{equation}
The third rule says that, given an $\Exp$ node, you can build another
$\Exp$ node by negating it.
\begin{equation}
\Exp ::= (\key{-} \; \Exp) \label{eq:arith-neg}
\end{equation}
Symbols such as \key{-} in typewriter font are \emph{terminal} symbols
and must literally appear in the program for the rule to be
applicable.
We can apply the rules to build ASTs in the $R_0$
language. For example, by rule \eqref{eq:arith-int}, \texttt{8} is an
$\Exp$, then by rule \eqref{eq:arith-neg}, the following AST is
an $\Exp$.
\begin{center}
\begin{minipage}{0.25\textwidth}
\begin{lstlisting}
(- 8)
\end{lstlisting}
\end{minipage}
\begin{minipage}{0.25\textwidth}
\begin{equation}
\begin{tikzpicture}
\node[draw, circle] (minus) at (0, 0) {$\text{--}$};
\node[draw, circle] (8) at (0, -1.2) {$8$};
\draw[->] (minus) to (8);
\end{tikzpicture}
\label{eq:arith-neg8}
\end{equation}
\end{minipage}
\end{center}
The following grammar rule defines addition expressions:
\begin{equation}
\Exp ::= (\key{+} \; \Exp \; \Exp) \label{eq:arith-add}
\end{equation}
Now we can see that the AST \eqref{eq:arith-prog} is an $\Exp$ in
$R_0$. We know that \lstinline{(read)} is an $\Exp$ by rule
\eqref{eq:arith-read} and we have shown that \texttt{(- 8)} is an
$\Exp$, so we can apply rule \eqref{eq:arith-add} to show that
\texttt{(+ (read) (- 8))} is an $\Exp$ in the $R_0$ language.
If you have an AST for which the above rules do not apply, then the
AST is not in $R_0$. For example, the AST \texttt{(- (read) (+ 8))} is
not in $R_0$ because there are no rules for \key{+} with only one
argument, nor for \key{-} with two arguments. Whenever we define a
language with a grammar, we implicitly mean for the language to be the
smallest set of programs that are justified by the rules. That is, the
language only includes those programs that the rules allow.
The last grammar for $R_0$ states that there is a \key{program} node
to mark the top of the whole program:
\[
R_0 ::= (\key{program} \; \Exp)
\]
The \code{read-program} function provided in \code{utilities.rkt}
reads programs in from a file (the sequence of characters in the
concrete syntax of Racket) and parses them into the abstract syntax
tree. The concrete syntax does not include a \key{program} form; that
is added by the \code{read-program} function as it creates the
AST. See the description of \code{read-program} in
Appendix~\ref{appendix:utilities} for more details.
It is common to have many rules with the same left-hand side, such as
$\Exp$ in the grammar for $R_0$, so there is a vertical bar notation
for gathering several rules, as shown in
Figure~\ref{fig:r0-syntax}. Each clause between a vertical bar is
called an {\em alternative}.
\begin{figure}[tp]
\fbox{
\begin{minipage}{0.96\textwidth}
\[
\begin{array}{rcl}
\Exp &::=& \Int \mid ({\tt \key{read}}) \mid (\key{-} \; \Exp) \mid
(\key{+} \; \Exp \; \Exp) \\
R_0 &::=& (\key{program} \; \Exp)
\end{array}
\]
\end{minipage}
}
\caption{The syntax of $R_0$, a language of integer arithmetic.}
\label{fig:r0-syntax}
\end{figure}
\section{S-Expressions}
\label{sec:s-expr}
Racket, as a descendant of Lisp, has
convenient support for creating and manipulating abstract syntax trees
with its \emph{symbolic expression} feature, or S-expression for
short. We can create an S-expression simply by writing a backquote
followed by the textual representation of the AST. (Technically
speaking, this is called a \emph{quasiquote} in Racket.) For example,
an S-expression to represent the AST \eqref{eq:arith-prog} is created
by the following Racket expression:
\begin{center}
\texttt{`(+ (read) (- 8))}
\end{center}
To build larger S-expressions one often needs to splice together
several smaller S-expressions. Racket provides the comma operator to
splice an S-expression into a larger one. For example, instead of
creating the S-expression for AST \eqref{eq:arith-prog} all at once,
we could have first created an S-expression for AST
\eqref{eq:arith-neg8} and then spliced that into the addition
S-expression.
\begin{lstlisting}
(define ast1.4 `(- 8))
(define ast1.1 `(+ (read) ,ast1.4))
\end{lstlisting}
In general, the Racket expression that follows the comma (splice)
can be any expression that computes an S-expression.
\section{Pattern Matching}
\label{sec:pattern-matching}
As mentioned above, one of the operations that a compiler needs to
perform on an AST is to access the children of a node. Racket
provides the \texttt{match} form to access the parts of an
S-expression. Consider the following example and the output on the
right.
\begin{center}
\begin{minipage}{0.5\textwidth}
\begin{lstlisting}
(match ast1.1
[`(,op ,child1 ,child2)
(print op) (newline)
(print child1) (newline)
(print child2)])
\end{lstlisting}
\end{minipage}
\vrule
\begin{minipage}{0.25\textwidth}
\begin{lstlisting}
'+
'(read)
'(- 8)
\end{lstlisting}
\end{minipage}
\end{center}
The \texttt{match} form takes AST \eqref{eq:arith-prog} and binds its
parts to the three variables \texttt{op}, \texttt{child1}, and
\texttt{child2}. In general, a match clause consists of a
\emph{pattern} and a \emph{body}. The pattern is a quoted S-expression
that may contain pattern-variables (preceded by a comma).
%
The pattern is not the same thing as a quasiquote expression used to
\emph{construct} ASTs, however, the similarity is intentional: constructing and
deconstructing ASTs uses similar syntax.
%
While the pattern uses a restricted syntax,
the body of the match clause may contain any Racket code whatsoever.
A \texttt{match} form may contain several clauses, as in the following
function \texttt{leaf?} that recognizes when an $R_0$ node is
a leaf. The \texttt{match} proceeds through the clauses in order,
checking whether the pattern can match the input S-expression. The
body of the first clause that matches is executed. The output of
\texttt{leaf?} for several S-expressions is shown on the right. In the
below \texttt{match}, we see another form of pattern: the \texttt{(?
fixnum?)} applies the predicate \texttt{fixnum?} to the input
S-expression to see if it is a machine-representable integer.
\begin{center}
\begin{minipage}{0.5\textwidth}
\begin{lstlisting}
(define (leaf? arith)
(match arith
[(? fixnum?) #t]
[`(read) #t]
[`(- ,c1) #f]
[`(+ ,c1 ,c2) #f]))
(leaf? `(read))
(leaf? `(- 8))
(leaf? `(+ (read) (- 8)))
\end{lstlisting}
\end{minipage}
\vrule
\begin{minipage}{0.25\textwidth}
\begin{lstlisting}
#t
#f
#f
\end{lstlisting}
\end{minipage}
\end{center}
\section{Recursion}
\label{sec:recursion}
Programs are inherently recursive in that an $R_0$ $\Exp$ AST is made
up of smaller expressions. Thus, the natural way to process in
entire program is with a recursive function. As a first example of
such a function, we define \texttt{R0?} below, which takes an
arbitrary S-expression, {\tt sexp}, and determines whether or not {\tt
sexp} is in {\tt arith}. Note that each match clause corresponds to
one grammar rule for $R_0$ and the body of each clause makes a
recursive call for each child node. This pattern of recursive function
is so common that it has a name, \emph{structural recursion}. In
general, when a recursive function is defined using a sequence of
match clauses that correspond to a grammar, and each clause body makes
a recursive call on each child node, then we say the function is
defined by structural recursion.
%
\begin{center}
\begin{minipage}{0.7\textwidth}
\begin{lstlisting}
(define (R0? sexp)
(define (exp? ex)
(match ex
[(? fixnum?) #t]
[`(read) #t]
[`(- ,e) (exp? e)]
[`(+ ,e1 ,e2)
(and (exp? e1) (exp? e2))]))
(match sexp
[`(program ,e) (exp? e)]
[else #f]))
(R0? `(+ (read) (- 8)))
(R0? `(- (read) (+ 8)))
\end{lstlisting}
\end{minipage}
\vrule
\begin{minipage}{0.25\textwidth}
\begin{lstlisting}
#t
#f
\end{lstlisting}
\end{minipage}
\end{center}
Indeed, the structural recursion follows the grammar itself. We can generally
expect to write a recursive function to handle each non-terminal in the
grammar\footnote{If you took the \emph{How to Design Programs} course
\url{http://www.ccs.neu.edu/home/matthias/HtDP2e/}, this principle of
structuring code according to the data definition is probably quite familiar.}
You may be tempted to write the program like this:
\begin{center}
\begin{minipage}{0.5\textwidth}
\begin{lstlisting}
(define (R0? sexp)
(match sexp
[(? fixnum?) #t]
[`(read) #t]
[`(- ,e) (R0? e)]
[`(+ ,e1 ,e2) (and (R0? e1) (R0? e2))]
[`(program ,e) (R0? e)]
[else #f]))
\end{lstlisting}
\end{minipage}
\end{center}
%
Sometimes such a trick will save a few lines of code, especially when it comes
to the {\tt program} wrapper. Yet this style is generally \emph{not}
recommended, because it can get you into trouble.
%
For instance, the above function is subtly wrong:
\lstinline{(R0? `(program (program 3)))} will return true, when it
should return false.
%% NOTE FIXME - must check for consistency on this issue throughout.
\section{Interpreters}
\label{sec:interp-R0}
The meaning, or semantics, of a program is typically defined in the
specification of the language. For example, the Scheme language is
defined in the report by \cite{SPERBER:2009aa}. The Racket language is
defined in its reference manual~\citep{plt-tr}. In this book we use an
interpreter to define the meaning of each language that we consider,
following Reynold's advice in this
regard~\citep{reynolds72:_def_interp}. Here we will warm up by writing
an interpreter for the $R_0$ language, which will also serve as a
second example of structural recursion. The \texttt{interp-R0}
function is defined in Figure~\ref{fig:interp-R0}. The body of the
function is a match on the input program \texttt{p} and
then a call to the \lstinline{exp} helper function, which in turn has
one match clause per grammar rule for $R_0$ expressions.
The \lstinline{exp} function is naturally recursive: clauses for internal AST
nodes make recursive calls on each child node. Note that the recursive cases
for negation and addition are a place where we could have made use of the
\key{app} feature of Racket's \key{match} to apply a function and bind the
result. The two recursive cases of \lstinline{interp-R0} would become:
\begin{minipage}{0.5\textwidth}
\begin{lstlisting}
[`(- ,(app exp v)) (fx- 0 v)]
[`(+ ,(app exp v1) ,(app exp v2)) (fx+ v1 v2)]))
\end{lstlisting}
\end{minipage}
Here we use \lstinline{(app exp v)} to recursively apply \texttt{exp} to the
child node and bind the \emph{result value} to variable \texttt{v}. The
difference between this version and the code in Figure~\ref{fig:interp-R0} is
mainly stylistic, although if side effects are involved the order of evaluation
can become important. Further, when we write functions with multiple return
values, the \key{app} form can be convenient for binding the resulting values.
\begin{figure}[tbp]
\begin{lstlisting}
(define (interp-R0 p)
(define (exp ex)
(match ex
[(? fixnum?) ex]
[`(read)
(let ([r (read)])
(cond [(fixnum? r) r]
[else (error 'interp-R0 "input not an integer" r)]))]
[`(- ,e) (fx- 0 (exp e))]
[`(+ ,e1 ,e2) (fx+ (exp e1) (exp e2))]))
(match p
[`(program ,e) (exp e)]))
\end{lstlisting}
\caption{Interpreter for the $R_0$ language.
\rn{Having two functions here for prog/exp wouldn't take much more space.
I'll change that once I get further.. but I also need to know what the story
is for running this code?}}
\label{fig:interp-R0}
\end{figure}
Let us consider the result of interpreting some example $R_0$
programs. The following program simply adds two integers.
\begin{lstlisting}
(+ 10 32)
\end{lstlisting}
The result is \key{42}, as you might have expected. Here we have written the
program in concrete syntax, whereas the parsed abstract syntax would be the
slightly different: \lstinline{(program (+ 10 32))}.
The next example demonstrates that expressions may be nested within
each other, in this case nesting several additions and negations.
\begin{lstlisting}
(+ 10 (- (+ 12 20)))
\end{lstlisting}
What is the result of the above program?
\noindent
If we interpret the AST \eqref{eq:arith-prog} and give it the input
\texttt{50}
\begin{lstlisting}
(interp-R0 ast1.1)
\end{lstlisting}
we get the answer to life, the universe, and everything:
\begin{lstlisting}
42
\end{lstlisting}
Moving on, the \key{read} operation prompts the user of the program
for an integer. Given an input of \key{10}, the following program
produces \key{42}.
\begin{lstlisting}
(+ (read) 32)
\end{lstlisting}
We include the \key{read} operation in $R_1$ so that a compiler for
$R_1$ cannot be implemented simply by running the interpreter at
compilation time to obtain the output and then generating the trivial
code to return the output.
(A clever did this in a previous version of the course.)
The job of a compiler is to translate a program in one language into a
program in another language so that the output program behaves the
same way as the input program. This idea is depicted in the following
diagram. Suppose we have two languages, $\mathcal{L}_1$ and
$\mathcal{L}_2$, and an interpreter for each language. Suppose that
the compiler translates program $P_1$ in language $\mathcal{L}_1$ into
program $P_2$ in language $\mathcal{L}_2$. Then interpreting $P_1$
and $P_2$ on their respective interpreters with input $i$ should yield
the same output $o$.
\begin{equation} \label{eq:compile-correct}
\begin{tikzpicture}[baseline=(current bounding box.center)]
\node (p1) at (0, 0) {$P_1$};
\node (p2) at (3, 0) {$P_2$};
\node (o) at (3, -2.5) {$o$};
\path[->] (p1) edge [above] node {compile} (p2);
\path[->] (p2) edge [right] node {interp-$\mathcal{L}_2$($i$)} (o);
\path[->] (p1) edge [left] node {interp-$\mathcal{L}_1$($i$)} (o);
\end{tikzpicture}
\end{equation}
In the next section we see our first example of a compiler, which is
another example of structural recursion.
\section{Example Compiler: a Partial Evaluator}
\label{sec:partial-evaluation}
In this section we consider a compiler that translates $R_0$
programs into $R_0$ programs that are more efficient, that is,
this compiler is an optimizer. Our optimizer will accomplish this by
trying to eagerly compute the parts of the program that do not depend
on any inputs. For example, given the following program
\begin{lstlisting}
(+ (read) (- (+ 5 3)))
\end{lstlisting}
our compiler will translate it into the program
\begin{lstlisting}
(+ (read) -8)
\end{lstlisting}
Figure~\ref{fig:pe-arith} gives the code for a simple partial
evaluator for the $R_0$ language. The output of the partial evaluator
is an $R_0$ program, which we build up using a combination of
quasiquotes and commas. (Though no quasiquote is necessary for
integers.) In Figure~\ref{fig:pe-arith}, the normal structural
recursion is captured in the main \texttt{pe-arith} function whereas
the code for partially evaluating negation and addition is factored
into two separate helper functions: \texttt{pe-neg} and
\texttt{pe-add}. The input to these helper functions is the output of
partially evaluating the children nodes.
\begin{figure}[tbp]
\begin{lstlisting}
(define (pe-neg r)
(cond [(fixnum? r) (fx- 0 r)]
[else `(- ,r)]))
(define (pe-add r1 r2)
(cond [(and (fixnum? r1) (fixnum? r2)) (fx+ r1 r2)]
[else `(+ ,r1 ,r2)]))
(define (pe-arith e)
(match e
[(? fixnum?) e]
[`(read) `(read)]
[`(- ,(app pe-arith r1))
(pe-neg r1)]
[`(+ ,(app pe-arith r1) ,(app pe-arith r2))
(pe-add r1 r2)]))
\end{lstlisting}
\caption{A partial evaluator for $R_0$ expressions.}
\label{fig:pe-arith}
\end{figure}
Our code for \texttt{pe-neg} and \texttt{pe-add} implements the simple
idea of checking whether the inputs are integers and if they are, to
go ahead and perform the arithmetic. Otherwise, we use quasiquote to
create an AST node for the appropriate operation (either negation or
addition) and use comma to splice in the child nodes.
To gain some confidence that the partial evaluator is correct, we can
test whether it produces programs that get the same result as the
input program. That is, we can test whether it satisfies Diagram
\eqref{eq:compile-correct}. The following code runs the partial
evaluator on several examples and tests the output program. The
\texttt{assert} function is defined in Appendix~\ref{appendix:utilities}.
\begin{lstlisting}
(define (test-pe p)
(assert "testing pe-arith"
(equal? (interp-R0 p) (interp-R0 (pe-arith p)))))
(test-pe `(+ (read) (- (+ 5 3))))
(test-pe `(+ 1 (+ (read) 1)))
(test-pe `(- (+ (read) (- 5))))
\end{lstlisting}
\rn{Do we like the explicit whitespace? I've never been fond of it, in part
because it breaks copy/pasting. But, then again, so do most of the quotes.}
\begin{exercise}
\normalfont % I don't like the italics for exercises. -Jeremy
We challenge the reader to improve on the simple partial evaluator in
Figure~\ref{fig:pe-arith} by replacing the \texttt{pe-neg} and
\texttt{pe-add} helper functions with functions that know more about
arithmetic. For example, your partial evaluator should translate
\begin{lstlisting}
(+ 1 (+ (read) 1))
\end{lstlisting}
into
\begin{lstlisting}
(+ 2 (read))
\end{lstlisting}
To accomplish this, we recommend that your partial evaluator produce
output that takes the form of the $\itm{residual}$ non-terminal in the
following grammar.
\[
\begin{array}{lcl}
\Exp &::=& (\key{read}) \mid (\key{-} \;(\key{read})) \mid (\key{+} \; \Exp \; \Exp)\\
\itm{residual} &::=& \Int \mid (\key{+}\; \Int\; \Exp) \mid \Exp
\end{array}
\]
\end{exercise}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Compiling Integers and Variables}
\label{ch:int-exp}
This chapter concerns the challenge of compiling a subset of Racket,
which we name $R_1$, to x86-64 assembly code~\citep{Intel:2015aa}.
(Henceforth we shall refer to x86-64 simply as x86). The chapter
begins with a description of the $R_1$ language (Section~\ref{sec:s0})
and then a description of x86 (Section~\ref{sec:x86}). The
x86 assembly language is quite large, so we only discuss what is
needed for compiling $R_1$. We introduce more of x86 in later
chapters. Once we have introduced $R_1$ and x86, we reflect on
their differences and come up with a plan breaking down the
translation from $R_1$ to x86 into a handful of steps
(Section~\ref{sec:plan-s0-x86}). The rest of the sections in this
Chapter give detailed hints regarding each step
(Sections~\ref{sec:uniquify-s0} through \ref{sec:patch-s0}). We hope
to give enough hints that the well-prepared reader can implement a
compiler from $R_1$ to x86 while at the same time leaving room for
some fun and creativity.
\section{The $R_1$ Language}
\label{sec:s0}
The $R_1$ language extends the $R_0$ language
(Figure~\ref{fig:r0-syntax}) with variable definitions. The syntax of
the $R_1$ language is defined by the grammar in
Figure~\ref{fig:r1-syntax}. The non-terminal \Var{} may be any Racket
identifier. As in $R_0$, \key{read} is a nullary operator, \key{-} is
a unary operator, and \key{+} is a binary operator. In addition to
variable definitions, the $R_1$ language includes the \key{program}
form to mark the top of the program, which is helpful in some of the
compiler passes. The $R_1$ language is rich enough to exhibit several
compilation techniques but simple enough so that the reader can
implement a compiler for it in a week of part-time work. To give the
reader a feeling for the scale of this first compiler, the instructor
solution for the $R_1$ compiler consists of 6 recursive functions and
a few small helper functions that together span 256 lines of code.
\begin{figure}[btp]
\centering
\fbox{
\begin{minipage}{0.96\textwidth}
\[
\begin{array}{rcl}
\Exp &::=& \Int \mid (\key{read}) \mid (\key{-}\;\Exp) \mid (\key{+} \; \Exp\;\Exp) \\
&\mid& \Var \mid \LET{\Var}{\Exp}{\Exp} \\
R_1 &::=& (\key{program} \; \Exp)
\end{array}
\]
\end{minipage}
}
\caption{The syntax of $R_1$, a language of integers and variables.}
\label{fig:r1-syntax}
\end{figure}
The \key{let} construct defines a variable for use within its body
and initializes the variable with the value of an expression. So the
following program initializes \code{x} to \code{32} and then evaluates
the body \code{(+ 10 x)}, producing \code{42}.
\begin{lstlisting}
(program
(let ([x (+ 12 20)]) (+ 10 x)))
\end{lstlisting}
When there are multiple \key{let}'s for the same variable, the closest
enclosing \key{let} is used. That is, variable definitions overshadow
prior definitions. Consider the following program with two \key{let}'s
that define variables named \code{x}. Can you figure out the result?
\begin{lstlisting}
(program
(let ([x 32]) (+ (let ([x 10]) x) x)))
\end{lstlisting}
For the purposes of showing which variable uses correspond to which
definitions, the following shows the \code{x}'s annotated with subscripts
to distinguish them. Double check that your answer for the above is
the same as your answer for this annotated version of the program.
\begin{lstlisting}
(program
(let ([x|$_1$| 32]) (+ (let ([x|$_2$| 10]) x|$_2$|) x|$_1$|)))
\end{lstlisting}
The initializing expression is always evaluated before the body of the
\key{let}, so in the following, the \key{read} for \code{x} is
performed before the \key{read} for \code{y}. Given the input
\code{52} then \code{10}, the following produces \code{42} (and not
\code{-42}).
\begin{lstlisting}
(program
(let ([x (read)]) (let ([y (read)]) (- x y))))
\end{lstlisting}
Figure~\ref{fig:interp-R1} shows the interpreter for the $R_1$
language. It extends the interpreter for $R_0$ with two new
\key{match} clauses for variables and for \key{let}. For \key{let},
we will need a way to communicate the initializing value of a variable
to all the uses of a variable. To accomplish this, we maintain a
mapping from variables to values, which is traditionally called an
\emph{environment}. For simplicity, here we use an association list to
represent the environment. The \code{interp-R1} function takes the
current environment, \code{env}, as an extra parameter. When the
interpreter encounters a variable, it finds the corresponding value
using the \code{lookup} function (Appendix~\ref{appendix:utilities}).
When the interpreter encounters a \key{let}, it evaluates the
initializing expression, extends the environment with the result bound
to the variable, then evaluates the body of the \key{let}.
\begin{figure}[tbp]
\begin{lstlisting}
(define (interp-R1 env)
(lambda (e)
(define recur (interp-R1 env))
(match e
[(? symbol?) (lookup e env)]
[`(let ([,x ,(app recur v)]) ,body)
(define new-env (cons (cons x v) env))
((interp-R1 new-env) body)]
[(? fixnum?) e]
[`(read)
(define r (read))
(cond [(fixnum? r) r]
[else (error 'interp-R1 "expected an integer" r)])]
[`(- ,(app recur v))
(fx- 0 v)]
[`(+ ,(app recur v1) ,(app recur v2))
(fx+ v1 v2)]
[`(program ,e) ((interp-R1 '()) e)]
)))
\end{lstlisting}
\caption{Interpreter for the $R_1$ language.}