|
[In this technical article, part of Microsoft's XNA-related Gamasutra microsite, XNA Developer Connection staffer and Interplay co-founder Becky Heineman gives tips on avoiding the 'Load-Hit-Store' performance-killer when making games.]
"90%
of the time is spent in 10% of the code, so make that 10% the fastest code it
can be."
One
of the most common problems encountered in creating computer games is
performance. Issues like disk access, GPU performance, CPU performance, race
conditions, and memory bandwidth (or lack thereof) can cause stalls or delays that
may turn a 30-frames-per-second game into a 9-frames-per-second game.
This article will describe one of the most common CPU performance
killers, the Load-Hit-Store, and give tips and tricks on how to avoid it.
Load-Hit-Store
Ask
any Xbox 360 performance engineer about Load-Hit-Store and they usually go into
a tirade. The sequence of a memory read operation (The Load), the assignment of
the value to a register (The Hit), and the actual writing of the value into a
register (The Store) is usually hidden away in stages of the pipeline so these
operations cause no stalls. However, if the memory location being read was one
recently written to by a previous write operation, it can take as many at 40
cycles before the "Store" operation can complete.
Example:
stfs fr3,0(r3) ;Store
the float
lwz r9,0(r3) ;Read
it back into an integer register
oris r9,r9,0x8000 ;Force
to negative
The first instruction writes
a 32-bit floating-point value into memory, and the following instruction reads
it back. What's interesting is that the load instruction isn't where the stall
occurs; it's the "oris" instruction. That instruction can't complete until
the "store" into r9 finishes, and it's waiting for the L1 cache to update.
What's going on? The first
instruction stores the data and marks the L1 cache as "dirty". It takes about 40
cycles for the data to be written into the L1 cache and become available for the CPU to
use. During this window of time, an instruction requests that data from the
cache and then "hits" R9 for a "store". Since the last instruction can't
execute until the store is complete, you've got a stall.
The Microsoft tool, PIX, can
locate these issues. Since it's confusing to tag the "oris"
instruction as the cause of the stall (which it is), PIX flags the load
instruction that started the chain of events so the programmer has a better
chance of fixing the issue.
Three CPUs in One Thread
Think of the PowerPC as
three completely separate CPUs, each with its own instruction set, register set,
and ways of performing operations on the data. The first is the integer unit
with its 32-integer registers, which is considered the workhorse, handling a
large percentage of the operations.
The second is the floating-point unit with
its 32 floating-point registers, handling all of the simple mathematics. Finally,
the third is the VMX unit with its 128 registers dealing with complex vector
operations.
Why think of the units as
three CPUs that share a common instruction stream? These units have no way of
directly transferring data between one another internally. Due to the lack of
an instruction to move the contents of an integer register to a floating-point register,
the CPU must write the integer value to memory, and then load it into a
floating-point register using a memory read instruction. That pattern of
operation is by nature, a Load-Hit-Store.
Moving data from the integer
unit to the floating-point unit is as simple as...
Example:
int iTime;
float fTime;
fTime = static_cast<float>(iTime);
This is extremely simple
code and very common, but on the PowerPC, an instruction is generated to store
the integer value to memory such that a floating-point instruction can be
executed to load from memory into a floating-point register. A fix-up
instruction follows that converts the integer representation into a floating-point
representation, and the sequence is complete.
A common way to generate
Load-Hit-Store is using member values or reference pointers as iterators in
tight loops.
Example:
for (int i=0;i<100;++i)
{
m_iData++;
}
Seldom are compilers smart
enough to figure out that the above loop resolves into m_iData+=100 and optimizes it into a single operation. Most will happily
load m_iData at runtime, increment it, and store it back into memory
referenced by the "this" pointer. The first pass of the loop will run at full
speed, but once it loops back, the m_iData value will incur a
Load-Hit-Store from the write operation of the previous pass through the loop.
Since registers invoke no
penalty, if the code was rewritten to look like this:
int iData = m_iData;
for (int i=0;i<100;++i) {
iData++;
}
m_iData = iData;
Not only will the code run
much faster since the operations are all in registers, you increase the chances
the compiler will reduce this to iData+=100 and remove any chance of a
Load-Hit-Store bottleneck.
|
Example that works:
XMVECTOR Radius = { 1.0f, 2.0f, 3.0f, 4.0f };
float Z;
Z = Radius.z; //LHS
__stvewx(__vspltw(Radius,2), &Z,2); //Avoids LHS.
Cheers,
Will