Optimization.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
	<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=windows-1252">
	<TITLE>SoftWire - Linear Optimization</TITLE>
	<META NAME="GENERATOR" CONTENT="OpenOffice.org 2.0  (Win32)">
	<META NAME="CREATED" CONTENT="20060513;15350571">
	<META NAME="CHANGED" CONTENT="20060513;15351478">
</HEAD>
<BODY LANG="en-US" DIR="LTR">
<P ALIGN=JUSTIFY><STRONG><U>Introduction</U></STRONG></P>
<P ALIGN=JUSTIFY>For script compilers and Just-In-Time-compilers
(JIT), compilation time is critical, but we also want code that is
not entirely unoptimized. When code optimization is performed, most
algorithms however take a significant amount of processing time and
extra memory. In this article I will present algorithms that require
no new passes and take virtually no extra memory. The focus is on the
implementation of the back-end of the compiler. It is intended to be
of practical nature, targeted at developers who wish to apply the
presented ideas quickly in their own projects. Basic knowledge of
assembly language and compiler technology is assumed.</P>
<P ALIGN=JUSTIFY>The compiler back-end I'll use is SoftWire [1]. It
was developed as a run-time x86 code generator to construct
processing pipelines for 3D applications without specialized hardware
[2]. It is also used to compile shaders, short stream processing
programs for programmable lighting effects. Although most of the
presented algorithm were originally created for the purpose of
efficient 3D rendering, here we'll focus on SoftWire's use&nbsp;in a
generic x86&nbsp;JIT-compiler. It should be noted that&nbsp;they are
even more widely applicable than x86 code, and can be an inspiration
for anyone interested in optimization techniques.</P>
<P ALIGN=JUSTIFY>SoftWire's low-level interface consists of functions
with the same name as assembly mnemonics, which put the corresponding
code in a buffer. I call these functions run-time intrinsics. For
example, <FONT FACE="Courier New">add(eax, ebx);</FONT> will generate
the binary code for <FONT FACE="Courier New">add eax, ebx</FONT> and
place it in an internal temporary buffer. This way we can construct
functions one instruction at a time. When finished, we can ask
SoftWire for a callable function. For this it places the instructions
in a final buffer and resolves label addresses. It returns a pointer
to this newly created function, which can directly be called. All
this is exposed via the <FONT FACE="Courier New">Assembler</FONT>
class. It is highly recommended to read the SoftWire introductory
tutorial [3] first to understand this article better.</P>
<P ALIGN=JUSTIFY>This way of creating functions at run-time is very
powerful, but for a compiler back-end we need an abstraction of the
underlying architecture. We have to be able to compile intermediate
code in a convenient, flexible way, while still keeping high
performance.</P>
<P ALIGN=JUSTIFY><STRONG><U>Linear Register Allocation</U></STRONG></P>
<P ALIGN=JUSTIFY>The first step to achieve flexibility is register
allocation. It assigns variables to registers, so we can keep the
most actively used data right in the processor core instead of in
memory. This allows us to abstract the register set of the x86
processor, so we can work with generic variables everywhere. In this
article I will only discuss 32-bit integer variables as to not make
things more complicated. Below is an illustration of what register
allocation tries to achieve. Intermediate instructions use only
variables, while the assembly code has to use registers:</P>
<DL>
	<DD STYLE="text-align: justify"><FONT FACE="Courier New">b = a;</FONT></DD><DD STYLE="margin-bottom: 0.2in; text-align: justify">
	<FONT FACE="Courier New">c += b;</FONT></DD></DL>
<P ALIGN=JUSTIFY>
Will translate into something like:</P>
<DL>
	<DD STYLE="text-align: justify"><FONT FACE="Courier New">mov eax,
	[a]&nbsp;&nbsp; <FONT COLOR="#008000">// Allocate a to eax</FONT></FONT></DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">mov ebx, [b]&nbsp;&nbsp; <FONT COLOR="#008000">//
	Allocate b to ebx</FONT></FONT></DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">mov ebx, eax</FONT></DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">mov ecx, [c]&nbsp;&nbsp; <FONT COLOR="#008000">//
	Allocate c to ecx</FONT></FONT></DD><DD STYLE="margin-bottom: 0.2in; text-align: justify">
	<FONT FACE="Courier New">add ecx ebx</FONT></DD></DL>
<P ALIGN=JUSTIFY>
Here we can clearly see the primary task of the register allocator.
It has to assign registers to variables, load the variables from
memory into the registers, and keep using the registers in further
instructions. Another task is to write registers back to memory. This
is required when we attempt to allocate a new variable, but there are
no more free registers. Just like pouring more water into a full
glass, this is called spilling. An allocated variable gets written
back to memory and a new variable takes its place in the same
register. So the problem to solve is when to allocate, what register
to use, and when to spill.</P>
<P ALIGN=JUSTIFY>We have one big restriction to perform good register
allocation. SoftWire has no look-ahead. Since binary code is created
when run-time intrinsics are called, there is no way to look at what
instructions or variables will be used next. It does not have access
to the list of intermediate code instructions (SoftWire does not
perform instruction selection). Hence it is not possible to directly
determine when a variable will not be used any more. So we can't know
in advance when a register is free, to avoid spilling. The limitation
of having no look-ahead is the main issue this article addresses.
Eventually it is a big advantage for reduced compile time and memory
requirement.</P>
<P ALIGN=JUSTIFY>SoftWire's register allocator is inspired by the
linear-scan register allocation algorithm [4]. It works quite
straight-forward and just as illustrated above: it keeps assigning
any newly used variable to a new register, until we're out of
registers. When this happens, we spill the register, which then
becomes available for the new variable. One crucial step is to
determine which register to spill, as we can choose practically any
of the registers. SoftWire uses the least recently used. It's most
likely that the variable that is allocated to this register is
actually not used again. While this doesn't work entirely optimally,
it does yield very good results, and I'll show you in the rest of the
article how the allocation and spilling process can be improved.</P>
<P ALIGN=JUSTIFY>The actual implementation is simple. SoftWire keeps
a small list of allocation states, one per register. Each of these
contains a reference to a variable. A reference can be anything to
uniquely identify a variable's memory location. In practice, this
is&nbsp;often a simple&nbsp;pointer. But it can also be a stack
location, which I'll explain in the next section. Allocating a
register is done by filling in the variable's reference in the
corresponding position in the allocation list. Then a mov run-time
intrinsic is called, to generate the code that loads the variable
from memory into the register. Spilling is just as simple, we call
the run-time intrinsic that writes the register to its allocated
variable's reference. Freeing can be done by nullifying the
reference. This is also how free registers are identified internally.</P>
<P ALIGN=JUSTIFY>SoftWire conveniently wraps all this functionality
into one function, <FONT FACE="Courier New">r32()</FONT>. It is
defined in the <FONT FACE="Courier New">RegisterAllocator</FONT>
class, derived from <FONT FACE="Courier New">Assembler</FONT>.
Similar functions are available for other sizes and types than 32-bit
integers. As argument, it takes the reference to a variable, the
output is a register. So we can use it to write code this way:</P>
<DL>
	<DD STYLE="text-align: justify"><FONT FACE="Courier New">static int
	a;</FONT></DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">static int b;</FONT></DD><DD STYLE="text-align: justify">
	&nbsp;</DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">mov(r32(&amp;b), r32(&amp;a));</FONT></DD><DD STYLE="margin-bottom: 0.2in; text-align: justify">
	<FONT FACE="Courier New">add(r32(&amp;c), r32(&amp;b));</FONT></DD></DL>
<P ALIGN=JUSTIFY>
The right <FONT FACE="Courier New">r32()</FONT> gets called first,
then the left one, and then the run-time intrinsic. When executed, it
will produce the assembly code given above. So this provides the
first abstraction we required to generate assembly code from
intermediate instructions. Because we also would like to make it
somewhat processor independent, and because there's only a limited
set of intermediate code instructions, we can abstract it further
like this:</P>
<DL>
	<DD STYLE="text-align: justify"><FONT FACE="Courier New"><FONT COLOR="#0000ff">void</FONT>
	emitAssign(<FONT COLOR="#0000ff">const int</FONT> &amp;a, <FONT COLOR="#0000ff">const
	int</FONT> &amp;b)</FONT></DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">{</FONT></DD><DL>
		<DD STYLE="text-align: justify">
		<FONT FACE="Courier New">mov(r32(&amp;a), r32(&amp;b));</FONT></DD></DL>
	<DD STYLE="text-align: justify">
	<FONT FACE="Courier New">}</FONT></DD><DD STYLE="text-align: justify">
	&nbsp;</DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New"><FONT COLOR="#0000ff">void</FONT>
	emitUnaryAddition(<FONT COLOR="#0000ff">const int</FONT> &amp;a,
	<FONT COLOR="#0000ff">const int</FONT> &amp;b)</FONT></DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">{</FONT></DD><DL>
		<DD STYLE="text-align: justify">
		<FONT FACE="Courier New">add(r32(&amp;a), r32(&amp;b));</FONT></DD></DL>
	<DD STYLE="margin-bottom: 0.2in; text-align: justify">
	<FONT FACE="Courier New">}</FONT></DD></DL>
<P ALIGN=JUSTIFY>
There's still one thing that highly influences register allocation:
jumps. In C++, every if, else, for, etc. creates jumps. Now imagine
that inside an if-block we access a variable for the first time. So
we would call <FONT FACE="Courier New">r32()</FONT>, and this
allocates a new register. Using SoftWire, the code is generated
linearly. It is not aware of the if-block. We just keep calling
run-time intrinsics and it adds the corresponding binary code to the
buffer. The jump instruction is treated no different. However, when
we execute the code, we might actually jump over the if-block. If a
new variable is allocated there, then we skipped that and there's a
big chance that the register doesn't contain the value we expected
after the jump. At function calls we also expect (some) register
content to be preserved.</P>
<P ALIGN=JUSTIFY>The brute-force solution is to write back all
registers to memory before the jump, and before the target address of
the jump. This way, all registers are free and they are re-allocated
both inside the block and after it. We've seen this operation before.
It's exactly the same as register spilling, except that we don't
allocate a new variable in their place. We can call&nbsp;SoftWire's&nbsp;
<FONT FACE="Courier New">spill()</FONT> function for every register
to do this. There's also a <FONT FACE="Courier New">spillAll()</FONT>
function to make this more convenient.</P>
<P ALIGN=JUSTIFY>So that's it! With these tools we can create a
compiler back-end in a very convenient way. Make sure you've read the
older SoftWire tutorial [3] to see a few more examples on how to get
started. There's still a lot to be done though. The generated code is
far from optimal and I promised to present some new algorithms to
optimize the code and keep the linear nature of SoftWire.</P>
<P ALIGN=JUSTIFY><STRONG><U>Local Variables</U></STRONG></P>
<P ALIGN=JUSTIFY>Note how I used static data in the previous
examples. However, local variables are usually stored on the stack,
so that memory space can be reused. To enable this, SoftWire uses a
more generic reference; OperandREF. This is an internally used
structure that can hold the pointer to an integer, but also the x86
specific base register, index register and offset. This needs a bit
of illustration. Memory references&nbsp;can be written in this
general format:</P>
<DL>
	<DD STYLE="margin-bottom: 0.2in; text-align: justify"><FONT FACE="Courier New">dword
	ptr [baseRegister + {1, 2, 4, 8} * indexRegister + offset]</FONT></DD></DL>
<P ALIGN=JUSTIFY>
Although this is x86 specific, most other architectures have memory
addressing modes with a base register and offset, to access the
stack. The stack pointer is stored in the esp register, and ebp is
commonly&nbsp;used to construct a local frame. So, in SoftWire, esp
and ebp are excluded from the register allocation process, and can be
used freely by higher layers, such as the <FONT FACE="Courier New">CodeGenerator</FONT>
class. It is inherited from the <FONT FACE="Courier New">RegisterAllocator</FONT>
class, so it also exposes the run-time intrinsics. The added
functionality is that it provides the mechanism for creating and
destroying local variables on the stack. It has child-classes that
automatically behave like local variables. This is possible thanks to
C++ constructors, destructors, and cast operators. One of these
classes is Int, with capital. When constructed, it looks for a
location on the stack to be stored. This is a memory reference in the
form of <FONT FACE="Courier New">[ebp + offset]</FONT>. When used in
a run-time intrinsic, it will automatically cast itself to a
register. To do this it simply calls the <FONT FACE="Courier New">r32()</FONT>
function and uses its stack location as argument. This allows for
example to implement binary intermediate instructions very
conveniently:</P>
<DL>
	<DD STYLE="text-align: justify"><FONT FACE="Courier New">Int
	emitBinaryAddition(<FONT COLOR="#0000ff">const</FONT> Int &amp;a,
	<FONT COLOR="#0000ff">const</FONT> Int &amp;b)</FONT></DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">{</FONT></DD><DL>
		<DD STYLE="text-align: justify">
		<FONT FACE="Courier New">Int t;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <FONT COLOR="#008000">//
		Temporary, so we don't modify a</FONT></FONT></DD><DD STYLE="text-align: justify">
		<FONT FACE="Courier New">mov(t, a);</FONT></DD><DD STYLE="text-align: justify">
		<FONT FACE="Courier New">add(t, b);</FONT></DD><DD STYLE="text-align: justify">
		<FONT FACE="Courier New"><FONT COLOR="#0000ff">return</FONT> t;</FONT></DD></DL>
	<DD STYLE="margin-bottom: 0.2in; text-align: justify">
	<FONT FACE="Courier New">}</FONT></DD></DL>
<P ALIGN=JUSTIFY>
Two more C++ operators had to be implemented to make this work. First
of a copy constructor was needed. The implementation is very simple,
using a mov run-time intrinsic just like <FONT FACE="Courier New">emitAssign()</FONT>
. The other operator is the destructor. Here, the local variable can
be removed from the stack, and the register freed. Removing it from
the stack is&nbsp;done by comparing all the stack pointers that have
been used as reference, and using the offset of the first unused
stack location as the new stack top. Actually freeing the register
can be done by nullifying its reference. This is implemented in the
<FONT FACE="Courier New">free()</FONT> function.</P>
<P ALIGN=JUSTIFY>Setting up the stack is done by the
<FONT FACE="Courier New">CodeGenerator::prologue()</FONT> <FONT FACE="Arial">function.
As argument it takes the number of arguments that are used by the
generated function.&nbsp;These arugments can then be accessed with
the </FONT><FONT FACE="Courier New">argument()</FONT> <FONT FACE="Arial">function,
which takes an integer index and returns the memory location of the
argument. SoftWire will automatically keep track of how many
variables require stack space.&nbsp;Ending the function can be done
conveniently with the <FONT FACE="Courier New">epilogue()</FONT>
function.&nbsp;Summing this up, the general framework&nbsp;for
generating a function looks like this:</FONT></P>
<DL>
	<DD STYLE="text-align: justify"><FONT FACE="Courier New"><FONT COLOR="#0000ff">void</FONT>(*)(<FONT COLOR="#0000ff">int</FONT>,<FONT COLOR="#0000ff">int</FONT>)
	emitFunctionX()&nbsp;&nbsp; <FONT COLOR="#008000">// Returns a
	function pointer</FONT></FONT></DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">{</FONT></DD><DL>
		<DD STYLE="text-align: justify">
		<FONT FACE="Courier New">prologue(2);&nbsp;&nbsp; <FONT COLOR="#008000">//
		Generated function takes two arguments</FONT></FONT></DD><DD STYLE="text-align: justify">
		<BR><FONT FACE="Courier New">Int a;</FONT></DD><DD STYLE="text-align: justify">
		<FONT FACE="Courier New">Int b;</FONT></DD><DD STYLE="text-align: justify">
		<BR><FONT FACE="Courier New">mov(a, argument(0));&nbsp;&nbsp; <FONT COLOR="#008000">//
		They are memory operands</FONT></FONT></DD><DD STYLE="text-align: justify">
		<FONT FACE="Courier New">mov(b, argument(1));<BR><BR><FONT COLOR="#008000">/*
		Call other funcions to generate the desired&nbsp;function
		*/</FONT><BR><BR>epilogue();</FONT></DD><DD STYLE="text-align: justify">
		<FONT FACE="Courier New"><BR><FONT COLOR="#0000ff">return</FONT>&nbsp;(<FONT COLOR="#0000ff">void</FONT>
		(*)(<FONT COLOR="#0000ff">int</FONT> , <FONT COLOR="#0000ff">int</FONT>))callable();</FONT></DD></DL>
	<DD STYLE="margin-bottom: 0.2in; text-align: justify">
	<FONT FACE="Courier New">}</FONT></DD></DL>
<P ALIGN=JUSTIFY>
You might have noticed that using these classes for local variables
can be made even more powerful, by implementing more C++ operators.
We can have an operator+, which gets the implementation of
<FONT FACE="Courier New">emitBinaryAddition()</FONT>. This way we can
generate code for all arithmetic intermediate code instructions, by
simply performing the operation on the variables. Also, generating
code for whole C++ expressions is no problem:</P>
<DL>
	<DD STYLE="text-align: justify"><FONT FACE="Courier New"><FONT COLOR="#0000ff">struct</FONT>
	Vector</FONT></DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">{</FONT></DD><DL>
		<DD STYLE="text-align: justify">
		<FONT FACE="Courier New">Int x;</FONT></DD><DD STYLE="text-align: justify">
		<FONT FACE="Courier New">Int y;</FONT></DD><DD STYLE="text-align: justify">
		<FONT FACE="Courier New">Int z;</FONT></DD></DL>
	<DD STYLE="text-align: justify">
	<FONT FACE="Courier New">};</FONT></DD><DD STYLE="text-align: justify">
	&nbsp;</DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">Int emitDotProduct(<FONT COLOR="#0000ff">const</FONT>
	Vector &amp;v1, <FONT COLOR="#0000ff">const</FONT> Vector &amp;v2)</FONT></DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">{</FONT></DD><DL>
		<DD STYLE="text-align: justify">
		<FONT FACE="Courier New"><FONT COLOR="#0000ff">return</FONT> v1.x *
		v2.x + v1.y * v2.y + v1.z * v2.z;</FONT></DD></DL>
	<DD STYLE="margin-bottom: 0.2in; text-align: justify">
	<FONT FACE="Courier New">}</FONT></DD></DL>
<P ALIGN=JUSTIFY>
Don't confuse this to be a function which computes a dot product, all
it does is generate the code to compute a dot product, and keep that
internally in SoftWire. Only after you retrieve a pointer to the code
it can be used as a function to compute a dot product. 
</P>
<P ALIGN=JUSTIFY><STRONG><U>Intermediate Code Free</U></STRONG> 
</P>
<P ALIGN=JUSTIFY>As mentioned before, SoftWire's biggest restriction
is that it has no look-ahead, so it can't detect that variables will
no longer be used in further code. Freeing registers only happens
when they are spilled and reallocated, and when the <FONT FACE="Courier New">free()</FONT>
function is explicitly called for local variables. However, this
usually happens later than the actual point at which the variable is
no longer used. This can be too late to avoid unnecessary spilling.</P>
<P ALIGN=JUSTIFY>The compiler front-end doesn't have that
restriction. It can do complex analysis on the code, and determine
the last use of variables. Hence, it can insert <FONT FACE="Courier New">free()</FONT>
commands into the intermediate code. For JIT-compilers, compiling the
source code to intermediate byte code is not as time-critical as code
generation and optimization. Note that byte code doesn't use generic
variables, but virtual registers. Either way, the same register
allocation algorithms apply since the x86 architecture has only six
general purpose registers, while there can be many more virtual
registers in the byte code.</P>
<P ALIGN=JUSTIFY>So this simple technique gives SoftWire the
look-ahead capabilities from a higher level. It shows that its
restriction isn't really a restriction, and the end result will be a
faster code generator that uses no extra memory. The next three
sections will focus on optimizations that can also be done close to
optimal without any need for look-ahead, but also without help of the
compiler front-end.</P>
<P ALIGN=JUSTIFY><STRONG><U>Load Elimination</U></STRONG></P>
<P ALIGN=JUSTIFY>Let's go back to generating binary code for
intermediate instructions. Take the most trivial example, an
assignment of a variable <FONT FACE="Courier New">a</FONT> to a
variable <FONT FACE="Courier New">b</FONT>:</P>
<DL>
	<DD STYLE="margin-bottom: 0.2in; text-align: justify"><FONT FACE="Courier New">b
	= a;</FONT></DD></DL>
<P ALIGN=JUSTIFY>
Recall that with run-time intrinsics and the allocation functions,
this can be implemented as:</P>
<DL>
	<DD STYLE="text-align: justify"><FONT FACE="Courier New">Int
	&amp;CodeGenerator::Int::<FONT COLOR="#0000ff">operator</FONT>=(<FONT COLOR="#0000ff">const</FONT>
	Int &amp;x)</FONT></DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">{</FONT></DD><DL>
		<DD STYLE="text-align: justify">
		<FONT FACE="Courier New">mov(*this, x);</FONT></DD><DD STYLE="text-align: justify">
		<FONT FACE="Courier New"><FONT COLOR="#0000ff">return</FONT> *<FONT COLOR="#0000ff">this</FONT>;</FONT></DD></DL>
	<DD STYLE="margin-bottom: 0.2in; text-align: justify">
	<FONT FACE="Courier New">}</FONT></DD></DL>
<P ALIGN=JUSTIFY>
The assembly code that gets generated is:</P>
<DL>
	<DD STYLE="text-align: justify"><FONT FACE="Courier New">mov eax,
	[a]&nbsp;&nbsp; <FONT COLOR="#008000">// Allocate a to eax</FONT></FONT></DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">mov ebx, [b]&nbsp;&nbsp; <FONT COLOR="#008000">//
	Allocate b to ebx</FONT></FONT></DD><DD STYLE="margin-bottom: 0.2in; text-align: justify">
	<FONT FACE="Courier New">mov ebx, eax&nbsp;&nbsp; <FONT COLOR="#008000">//
	Assign</FONT></FONT></DD></DL>
<P ALIGN=JUSTIFY>
Obviously, the second instruction is redundant. The value of ebx gets
overwritten, so we didn't have to load it with <FONT FACE="Courier New">b</FONT>.
Because again, SoftWire has no look-ahead, this could not be
prevented. There is no convenient way to know in advance that ebx
will get overwritten. But the story doesn't end here. While SoftWire
doesn't have look-ahead, looking 'back' is less of a problem. So the
idea is that when a register gets overwritten, we erase the load
instruction. This is easy to implement, just keep a pointer to the
load instruction for every register. This gets stored together with
the variable reference in the register's allocation state. So when
the load can be eliminated,&nbsp;SoftWire uses this pointer to notify
that this instruction can be erased from the internal buffer.</P>
<P ALIGN=JUSTIFY>But wait, we can't just eliminate all load
instructions. Assume that the next intermediate operation assigns <FONT FACE="Courier New">b</FONT>
to <FONT FACE="Courier New">a</FONT>. Then the load instruction of a
would be erased, while it's actually required. To prevent this, every
time a register is read, we set the pointer to the load instruction
of that register to null. Note that for arithmetic x86 instructions,
the first and second operand is read, while for mov only the source
operand is read. So we make an exception for mov only, by overloading
this run-time intrinsic function in the RegisterAllocator class. The
end result is that every time a register gets allocated and
overwritten <EM>before</EM> it is used in another instruction, the
load instruction gets eliminated. The same elimination is possible
when the register is freed before used.</P>
<P ALIGN=JUSTIFY><STRONG><U>Copy Propagation</U></STRONG></P>
<P ALIGN=JUSTIFY>The approach of looking back is really powerful. I
will show you that it does not only allow to identify and eliminate
mov instructions that load a value from memory into a register, but
also mov instructions that copy from one register to another, and mov
instructions that stores a register back in memory. Eliminating copy
instructions is called copy propagation, while eliminating store
instructions is called spill elimination. The latter will be
explained in the next section, and I'll also make it clear why it
isn't called store elimination.</P>
<P ALIGN=JUSTIFY>Copy propagation is a classical optimization in
compiler technology. When a register is copied to another register,
and the first register is never used again, then this copy was
redundant. In this case we could have done all the calculations with
the first register. This isn't trivial to implement in SoftWire,
because we can't/won't keep a pointer to every instruction that uses
a certain register, nor is it even feasible to adjust all the
registers in the binary format. The only look-back we have is to
eliminate instructions.</P>
<P ALIGN=JUSTIFY>The solution is quite simple in nature. Similar to
load elimination, we keep a pointer to the copy instruction per
register. Then, we 'blindly' assume that the first variable won't be
used again, so we keep working with the first register. This can be
achieved by swapping the references of both registers. So when the
second variable is used, <FONT FACE="Courier New">r32()</FONT> will
return the first register. When the first variable is freed before
being used again, we know for certain that the copy instruction can
be eliminated. We then set the pointer back to null. Here's an
illustration of how it works:</P>
<DL>
	<DD STYLE="text-align: justify"><FONT FACE="Courier New">b = a;</FONT></DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">c += b;</FONT></DD><DD STYLE="margin-bottom: 0.2in; text-align: justify">
	<FONT FACE="Courier New">free(a);</FONT></DD></DL>
<P ALIGN=JUSTIFY>
Would get translated into:</P>
<DL>
	<DD STYLE="text-align: justify"><FONT FACE="Courier New">mov(eax,
	a);&nbsp;&nbsp;&nbsp;&nbsp; <FONT COLOR="#008000">// Allocate a to
	eax</FONT></FONT></DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">mov(ebx, b);&nbsp;&nbsp;&nbsp;&nbsp; <FONT COLOR="#008000">//
	Allocate b to ebx, load eliminated!</FONT></FONT></DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">mov(ebx, eax);&nbsp;&nbsp; <FONT COLOR="#008000">//
	Copy propagation, assign b to eax, a to ebx</FONT></FONT></DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">mov(ecx, c);&nbsp;&nbsp;&nbsp;&nbsp; <FONT COLOR="#008000">//
	Allocate c to ecx</FONT></FONT></DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">add(ecx, eax);</FONT></DD><DD STYLE="margin-bottom: 0.2in; text-align: justify">
	<FONT FACE="Courier New">free(ebx);&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
	<FONT COLOR="#008000">// Eliminate ebx's copy instruction</FONT></FONT></DD></DL>
<P ALIGN=JUSTIFY>
Note that if <FONT FACE="Courier New">a</FONT> was still used, then
ebx would be returned, and its copy instruction pointer nullified, to
prevent elimination. What would have effectively happened is that eax
and ebx switched place. The copy instruction makes them perfectly
equivalent, so at that point it doesn't really matter which register
is associated to which variable. This nice property enabled copy
propagation with simple look-back. Other compiler back-ends use
complicated graphs to perform this optimization!</P>
<P ALIGN=JUSTIFY><STRONG><U>Spill Elimination</U></STRONG></P>
<P ALIGN=JUSTIFY>Copy propagation works like a charm, but there's
still a significant problem. In the above example, ebx was never
used, but <FONT FACE="Courier New">a</FONT> was associated to it
anyway. Imagine that we didn't have any free registers before
generating code for these intermediate instructions. Then we would
have spilled eax, ebx and ecx. We can't avoid that eax and ecx have
to be spilled, but spilling ebx was redundant, since it never got
used. The cost of this isn't only the spill instruction, but also the
load instruction that follows, to get useful data into ebx again.
Ideally, we want to avoid the spill, and the load.</P>
<P ALIGN=JUSTIFY>This can be achieved by keeping the previous
reference of a variable for a register that has been spilled, and a
pointer to the spill instruction. The idea is that when we detect a
redundant spill, we eliminate the instruction, and restore the
previous reference, as if the register was never associated to
another variable.</P>
<P ALIGN=JUSTIFY>The implementation is trickier than that. Recall
that for branches we have to spill the registers ourselves, to store
their value in memory. It is not uncommon, that after this some
registers are freed. After all, a spill makes the register available
for allocation again, so it shouldn't make any difference when it is
also freed afterwards. Unfortunately, with the current approach for
spill elimination it does. At the free, the register is considered to
no longer be used. So the spill is eliminated. But when this register
was copy propagated, this means that also the previously allocated
variable, which is the actual contained value, would not be written
back to memory. For the branch, this is absolutely required.</P>
<P ALIGN=JUSTIFY>The actual problem is that there are two types of
spilling. One, which really deserves the name 'spill', is to write
registers back to memory, to allocate another variable in its place.
The other type of spill is to store the value in memory, but leave
the register free. It can take several instructions before this
register is then back allocated. It is this latter type of spill that
we use at branches, so it should be available as a public function.
Spill elimination only targets the former type of spill. And it
should happen only when we try to allocate a variable and find no
more free registers. This variable can then be copy propagated, and
when the associated register is unused, the spill can be eliminated
and the previous variable re-allocated at the same variable.</P>
<P ALIGN=JUSTIFY>To preserve the previous interface and semantic of
the <FONT FACE="Courier New">spill()</FONT> function, the internal
behavior has to be altered. Keeping the pointer to the spill
instruction should only be done at the <FONT FACE="Courier New">r32()</FONT>
function. At the <FONT FACE="Courier New">spill()</FONT> function, we
always write the register back to memory, no exceptions.</P>
<P ALIGN=JUSTIFY>We only restore the previous allocation when the
register is free and the reference of the previous allocation, which
I call the spill allocation, matches the reference passed to <FONT FACE="Courier New">r32()</FONT>.
When a register is used in another instruction, we erase the spill
allocation information because the current allocation proves to be
useful. Copy instructions skip this rule because we apply copy
propagation and the register might not be used effectively after all,
as described in the begin of this section.</P>
<P ALIGN=JUSTIFY><STRONG><U>Unmodified Registers</U></STRONG></P>
<P ALIGN=JUSTIFY>The previous section dealt with the type of spilling
that happens when we're out of registers and need to allocate a new
variable. The other type of spill can be optimized as well. When a
register is only read, never written to, then we never have to write
it back to memory. This is one of the simplest optimizations, but
very effective. For every register, we just have to keep a flag to
indicate whether it has been modified. For x86 instructions, all
destination operands can be considered modified, while source
operands are only read. Note that copy propagation makes the
implementation a little more complicated.</P>
<P ALIGN=JUSTIFY><STRONG><U>Minimal Register Restore</U></STRONG></P>
<P ALIGN=JUSTIFY>Spilling all registers at branches isn't efficient.
Most code consists of jumps every dozen instructions or so, which
would result in code where hardly any variables are kept in
registers. A more intelligent approach is to spill and reallocate
only registers to restore the previous allocation state. For the
example of a simple if-statement, it's even likely that all registers
can keep their current allocation state. Situations are very similar
for forward and backward jumps. In the former case, we wish to
capture the register allocation state before the jump, and restore it
to that state before the target label. For the latter case we'll
capture the state before the label, and restore it before the jump.
SoftWire implements <FONT FACE="Courier New">capture()</FONT> and
<FONT FACE="Courier New">restore()</FONT> functions with exactly this
functionality. <FONT FACE="Courier New">CodeGenerator::capture()</FONT>
<FONT FACE="Arial">returns a structure of the <FONT FACE="Courier New">CodeGenerator::State</FONT>
type, and <FONT FACE="Courier New">CodeGenerator::restore()</FONT>&nbsp;takes
such state as argument. They are used in practice like this:</FONT></P>
<DL>
	<DD STYLE="text-align: justify"><FONT FACE="Courier New"><FONT COLOR="#0000ff">if</FONT>(x
	== true)</FONT></DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">{</FONT></DD><DL>
		<DD STYLE="text-align: justify">
		<FONT FACE="Courier New">a++;</FONT></DD></DL>
	<DD STYLE="text-align: justify">
	<FONT FACE="Courier New">}</FONT></DD><DD STYLE="text-align: justify">
	&nbsp;</DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">b = a;</FONT></DD><DD STYLE="margin-bottom: 0.2in">
	&nbsp;</DD></DL>
<P ALIGN=JUSTIFY>
Could be&nbsp;translated using:</P>
<DL>
	<DD STYLE="text-align: justify"><FONT FACE="Courier New">cmp(x, 1);</FONT></DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">State x = capture();</FONT></DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">jne(&quot;ifBranch&quot;);</FONT></DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">inc(a);</FONT></DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">restore(x);</FONT></DD><DD STYLE="text-align: justify">
	<FONT FACE="Courier New">label(&quot;ifBranch&quot;);</FONT></DD><DD STYLE="margin-bottom: 0.2in; text-align: justify">
	<FONT FACE="Courier New">mov(b, a);</FONT></DD></DL>
<P ALIGN=JUSTIFY>
Assuming that <FONT FACE="Courier New">a</FONT> was already
allocated, the allocation state before and after the branch are the
same, so no spills or reallocations would be made at all. Compared to
spilling the whole register set this is a gigantic improvement!</P>
<P ALIGN=JUSTIFY>The implementation is again quite compact but
tricky. <FONT FACE="Courier New">capture()</FONT> takes a copy of the
whole allocation state at the moment of the call. But it also
disables all load elimination, copy propagation and spill
elimination. This is required because all these optimizations assume
we're working in a 'basic block' where code is executed stricktly
linearly. The <FONT FACE="Courier New">restore()</FONT> function
first also disables all optimizations, then it compares the current
allocation state for every register with the one stored in the passed
state. All registers which have different allocation states are
spilled. After all the spills, the allocation states are compared
again and the old variables are allocated to the registers. Note that
spilling and reallocating registers one at a time wouldn't work.</P>
<P ALIGN=JUSTIFY>There are two&nbsp;more optimizations possible here.
Very often, variables are just allocated to other registers inside a
conditional block. So at the <FONT FACE="Courier New">restore()</FONT>
we don't have to spill two registers and load them again. We can just
copy the value&nbsp;from the register that has the allocation state
we're trying to attain. The second optimization is to avoid that
variables get reallocated to different registers at all. At
every&nbsp;spill operation, we can remember which variable was last
allocated at that register.&nbsp;When the variable is reallocated, we
can prefer the same register, if it's available.&nbsp;So not every
free register is equivalent. Nothing is left at chance.</P>
<P ALIGN=JUSTIFY>This optimization makes branches a lot more
attractive. In any case it attempts to keep a maximum number of
variables allocated. Other optimizations are not possible accross
basic blocks, but in many cases formulas are limited to one basic
block so the impact of this is minimal.</P>
<P ALIGN=JUSTIFY><STRONG><U>Memory Operands</U></STRONG></P>
<P ALIGN=JUSTIFY>Until now we've always required that a variable gets
allocated to a register before it is used. While it is safe to assume
that registers are faster than memory accesses, this strategy can
actually work adversely. On an x86 processor we only have six
general-purpose registers, so we have to use them sparingly. In
functions with more than six variables, which is quickly the case, we
will be spilling very frequently to keep them in registers, but this
way we actually miss our goal.</P>
<P ALIGN=JUSTIFY>A more efficient approach is to require only
destination operands to be in registers.&nbsp;An
arithmetic&nbsp;x86&nbsp;instruction with a&nbsp;destination operand
in memory&nbsp;will cause the processor core to read the value of the
destination operand, apply the operation to it, and write it back. So
the destination operand is both read and written while the source
operand is only read. Hence it's best to make the destination operand
a register, while the source operand can remain in memory. This is
also the way it's actually implemented in SoftWire's local variable
classes' operators. If the source operand is not yet allocated in a
register, we read the value from memory, else we use the register.
For this we use SoftWire's <FONT FACE="Courier New">m32()</FONT>
function.</P>
<P ALIGN=JUSTIFY>This time having no look-ahead is a hard limitation.
If we knew in advance which variables will be used intensively, even
if only as source operand, we could force it to be allocated to a
register. But look-back approaches can't help us here. The compiler
front-end, as always, can give hints though. But in my experience
this is hardly worth it. Modern processors deal with frequent
accesses to the same memory quite efficiently. Also, frequently used
variables will often be used as destination operand as well, so they
will quickly be allocated to a register.</P>
<P ALIGN=JUSTIFY>Either way we can also use some hybrid approach.
When sufficient registers are available, we can assume that it's
worth it to allocate variables to registers even when they are first
used as source operand. Real code benchmarks would be required to
determine the optimal decision parameters.</P>
<P ALIGN=JUSTIFY><STRONG><U>Practical Results</U></STRONG></P>
<P ALIGN=JUSTIFY>Several key optimizations have been presented.
Implementing them all successfully was sometimes very complicated. I
do not know of any previous work like this so it was all very
experimental for me. So even though the above explanations might
sound very logical, making it all work in&nbsp;every situation is
very tricky. The allocation state of one register consists of more
than a dozen components, so at every related function they have to be
updated correctly.</P>
<P ALIGN=JUSTIFY>The first approach to deal with this complexity was
of course divide-and-conquer. Just as the optimizations are described
one by one, they were implemented one by one. This resulted in adding
several functions in the <FONT FACE="Courier New">RegisterAllocator</FONT>
class to switch the specific optimizations on and off. This allowed
also to test them separately, or in combinations, until finally they
all worked together.</P>
<P ALIGN=JUSTIFY>The actual testing was done by automatically
generating very long functions, with a random combination of
arithmetic operations, copy operations, introducing new variables,
and spilling. Using various distributions it was possible to simulate
extreme situations such as having multiple copy propagations of the
same registers. Errors were detected by executing the same
operations, but as a regular C++ function. Differences in output
values were analyzed by reducing the situation to a minimum number of
operations, and looking at what the last couple of instructions did
wrong.</P>
<P ALIGN=JUSTIFY>This technique of stress-testing was very efficient,
but not flawless. The ultimate testing was done by swShader [2], the
project for which SoftWire was created in the first place. It creates
several functions at run-time, which all depend on each other's
output. When an optimization caused a minor&nbsp;error, this was
immediately visible, literally, in the produced 3D images. Sometimes
it also caused crashes because of bad memory accesses, which made it
easier to track the bug.</P>
<P ALIGN=JUSTIFY>So, although any code can never be fully flawless, I
am confident that SoftWire's new optimizations can be used in many
applications without trouble. In case you do find errors, and believe
it's caused by SoftWire internally, please contact me to have a look
at the issue!</P>
<P ALIGN=JUSTIFY>There are currently no hard theoretical proofs of
these algorithms. Although feasible, they are more complicated than
one would expect. For example, load elimination is very simple in
concept, but becomes harder when combined with copy propagation,
spill elimination, etc. To prove it, we would have to identify the
dependencies between the techniques, and show that for every
operation related to register allocation, the total allocation state
remains valid. Formal proofs can be expected in future 'academic'
versions of this article. For now, their application in complex
dynamically generated functions in swShader should be convincing that
these proofs do exist.</P>
<P ALIGN=JUSTIFY><STRONG><U>Conclusion</U></STRONG></P>
<P ALIGN=JUSTIFY>In this article I've presented new algorithms that
allow a compiler back-end to generate vastly optimized code without
the need for multiple passes or extra memory requirements. This makes
the approach very interesting for fast JIT-compilers and native
script compilers. But it doesn't end here, as there are many more
optimizations that could be tried out.&nbsp;Since this is a work in
progress, any kind of feeback is highly appreciated! Don't hesitate
to <A HREF="mailto:Nicolas@Capens.net">Contact Me</A> and discuss
this article.</P>
<P>Enjoy! <BR><BR><A HREF="mailto:Nicolas@Capens.net">Nicolas Capens</A>
</P>
<P>[1] SoftWire: <A HREF="https://gna.org/projects/softwire/">https://gna.org/projects/softwire/
</A><BR>[2] swShader: <A HREF="http://sw-shader.sourceforge.net/">http://sw-shader.sourceforge.net</A><BR>[3]
SoftWire Tutorial: Included in SoftWire package. <BR>[4] Linear-Scan
Register Allocation: <A HREF="http://citeseer.ist.psu.edu/poletto99linear.html">http://citeseer.ist.psu.edu/poletto99linear.html</A>
</P>
<P>Copyright &copy; 2004-2005 Nicolas Capens. All rights reserved.</P>
<P><BR><BR>
</P>
</BODY>
</HTML>