*[In a detailed Intel-sponsored technical article, part of its Visual Computing section on Gamasutra, veteran game programmer and architect Mike Abrash (Quake) goes in-depth on how Intel's upcoming Larrabee architecture pipeline deals intelligently with rasterization.] *

You don't tug on Superman's cape

You don't spit into the wind

You don't pull the mask off that old Lone Ranger

And you don't mess around with Jim

The
first time you heard the chorus of "You Don't Mess Around with Jim," you no
doubt immediately wondered why Jim Croce left out "And you don't rasterize in
software if you're trying to do GPU-class graphics, even on a high-performance
part such as Larrabee." That's just common sense, right? It was obvious to me,
anyway, that Larrabee was going to need a hardware rasterizer. But maybe Jim
knew something I didn't, because it turned out I was wrong, and therein lies a
tale.

First,
though, let's back up a step and get some context.

**Parallel programming for semi-parallel tasks**

In
Look at the Larrabee New Instructions (LRBni),
I presented an overview of Larrabee, an upcoming multithreaded, vectorized,
many core processor from Intel. It may not be entirely apparent from that
article, but the LRBni contains everything you need to do a great job running
code that's designed to be vectorized, such as HLSL shaders. But what about
tasks that aren't inherently scalar (in particular, where the iterations aren't
serially dependent), and that at the same time can't be efficiently parallelized
in any obvious way, and for which the usual algorithms aren't easily
vectorized?

In
the process of implementing the standard graphics pipeline for Larrabee, we
(the Larrabee team at Intel's partner RAD Game Tools - Atman Binstock, Tom Forsyth, Mike Sartain, and me)
have gotten considerable insight into that question, because while the pipeline
is largely vectorizable, there are a few areas in which the conventional
approaches are scalar, and where some degree of serialization is unavoidable.

This article takes a close look at how we applied the Larrabee New Instructions
to one of those problem areas - rasterization - in the process, redesigning our
implementation so it was far more vectorizable so it took advantage of the
strengths of Larrabee's CPU-based architecture. Needless to say, performance
with the Larrabee New Instructions will vary greatly from one application to
the next - there's not much to be done with purely scalar code - but diving
into Larrabee rasterization in detail gives you a good sense of what it's like
to apply the Larrabee New Instructions to at least one sort of non-obvious
application.

To
avoid potential confusion, let me define "rasterization": For our present
purposes, it's the process of determining which pixels are inside a triangle,
and nothing more. In this article, I won't discuss other aspects of the
rendering pipeline, such as depth, stencil, shading, or blending, in any
detail.

**Rasterization Was the Problem Child - But Less
Problematic Than We Thought**

When
we look at applying the Larrabee New Instructions to the graphics pipeline, for
the most part, it's easy to find 16-wide work for the vector unit to do. Depth,
stencil, pixel shading, and blending all fall out of processing 4×4 blocks,
with writemasking at the edges of triangles. Vertex shading is not quite a
perfect fit, but 16 vertices can be shaded and cached in parallel, which works
well because vertex usage tends to be localized. Setting up 16 triangles at a
time also works pretty well; there's some efficiency loss due to culling, but
it still yields more than half of maximum parallelism.

That
leaves rasterization, which is definitely not easy to vectorize with enough
performance to be competitive with hardware. As an aside, the Larrabee
rasterizer is not, of course, the only rasterizer that could be written for
Larrabee. There are other rasterizers for points and lines, and of course,
there may be other future rasterizers for interesting new primitive types. In
fact, I suggest you write your own - it's an interesting problem! But as it
doesn't have a more formal name, I'll just refer to the current general-purpose
triangle rasterizer as "the Larrabee rasterizer" in this article.

The
Larrabee rasterizer was at least the fifth major rasterizer I've written, but
while the others were loosely related, the Larrabee rasterizer belongs to a
whole different branch of the family - so different, in fact, that it almost
didn't get written. Why? Because, as is so often the case, I wanted to stick
with the mental model I already understood and was comfortable with, rather
than reexamining my assumptions. Long-time readers will recall my fondness for
the phrase "Assume nothing"; unfortunately, I assumed quite a lot in this case,
proving once again that it's easier to dispense wise philosophy than to put it
into action!

Coming
up with better solutions is all about trying out different mental models to see
what you might be missing, a point perfectly illustrated by my efforts about 10
years ago to speed up a texture mapper. I had thrown a lot of tricks at it and
had made it a lot faster, and figured I was done. But just to make sure, I ran
it by my friend David Stafford, an excellent optimizer. David said he'd think
about it and would let me know if he came up with anything.

When
I got home that evening, there was a message on the answering machine from
David, saying that he had gotten two cycles out of the inner loop. (It's been a
long time, and my memories have faded, so it may not have been exactly two
cycles, but you get the idea.) I called him back, but he wasn't home (this was
before cellphones were widespread). So all that evening I tried to figure out
how David could have gotten those two cycles. As I ate dinner, I wondered; as I
brushed my teeth, I wondered; and as I lay there in bed not sleeping, I wondered.
Finally, I had an idea - but it only saved one cycle, not two. And no matter
how much longer I racked my brain, I couldn't get that second cycle.

The
next morning, I called David, admitted that I couldn't match his solution, and
asked what it was. To which he replied: "Oh, sorry, it turned out my code
doesn't work."

It's
certainly a funny story, but more to the point, the only thing keeping my code
from getting faster had been my conviction that there were no better solutions
- that my current mental model was correct and complete. After all, the only
thing that changed between the original solution and the better one was that I
acquired the belief that there was a better solution - that is, that there was
a better model than my current one. As soon as I had to throw away my original
model, I had a breakthrough.

Software
rasterization is a lot like that.

The
previous experience that Tom Forsyth and I had with software rasterization was
that it was much, much slower than hardware. Furthermore, there was no way we
could think of to add new instructions to speed up the familiar software
rasterization approaches, and no obvious way to vectorize the process, so we
concluded that a hardware rasterizer would be required. However, this turned
out to a fine example of the importance of reexamining assumptions.

In
the early days of the project, as the vector width and the core architecture
were constantly changing, one of the hardware architects, Eric Sprangle, kept
asking us whether it might be possible to rasterize efficiently in software
using vector processing. We kept patiently explaining that it wasn't. Then, one
day, he said: "So now that we're using the P54C core, do you think software
rasterization might work?" To be honest, we thought this was a pretty dumb
question because the core didn't have any significant impact on vector
performance; in fact, we were finally irritated enough to sit down to prove in
detail - with code - that it wouldn't work.

And
we immediately failed. As soon as we had to think hard about how the inner loop
could be structured with vector instructions, and wrote down the numbers - that
is, as soon as our mental model was forced to deal with reality - it became
clear that software rasterization was in fact within the performance ballpark.
After that, it was just engineering, much of which we'll see shortly.

First,
though, it's worth pointing out that, in general, dedicated hardware will be
able to perform any specific task more efficiently than software; this is true
by definition because dedicated hardware requires no multifunction capability,
so in the limit, it can be like a general-purpose core with the extraneous
parts removed. However, by the same token, hardware lacks the flexibility of
CPUs, and that flexibility can allow CPUs to close some or all of the
performance gap. Hardware needs worst-case capacity for each component, so it
often sits at least partly idle; CPUs, on the other hand, can just switch to
doing something different, so ALUs are never idle. CPUs can also implement
flexible, adaptive approaches, and that can make a big difference, as we'll see
shortly.

**Why Rasterization Was the Problem Child**

I
can't do a proper tutorial on rasterization here, so I'll just run through a
brief refresher. For our purposes, all rasterization will be of triangles.
There are three edges per triangle, each defined by an equation Bx+Cy relative
to any point on the edge, with the sign indicating whether a point is inside or
outside the edge. Both x and y are in 15.8 fixed-point format, with a range of
[-16K, +16K). The edge equations are tested at pixel or sample centers, and for
cases where a pixel or sample center is right on an edge, well-defined fill
rules must be observed (in this case, top-left fill rules, which are generally
implemented by subtracting 1 from the edge equation for left edges and flat-top
edges). Rasterization is performed with discrete math, and must be exact, so
there must be enough bits to represent the edge equation completely. Finally,
multisampled antialising must be supported.

Let's
look at a quick example of applying the edge equation. Figure 1 shows an edge
from (12, 8) to (4, 24). The B coefficient of the edge equation is simply the
edge's y length: (y1 - y0). The C coefficient is the negation of the edge's x
length: (x0 - x1). Thus, the edge equation in Figure 1 is (24 - 8)x + (12 - 4
)y. Since we only care about the sign of the result (which indicates inside or
outside), not the magnitude, this can be simplified to 2x + 1y, where the x
value used in the equation is the distance between the point of interest and
any point at which the equation is known to be zero (which is to say, any point
on the line). Usually, a vertex is used; for example, as the vertex at (12, 8)
is used in Figure 1. All points on the edge have the value 0, as can be seen in
Figure 1 for the point on the line at (8, 16).

Points
on one side of the edge will have positive values, as in Figure 2 for the point
at (12, 16), which has a value of 8.

Points
on the other side of the edge will have negative values, as in Figure 3 for the
point at (4, 12), which has a value of -12.

Simple
though it is, the edge equation is the basis upon which the Larrabee rasterizer
is built. By applying the three edge equations at once, it is possible to
determine which points are inside a triangle and which are not. Figure 4 shows
an example of how this works; the pixels shown in green are considered to be
inside the triangle formed by the edges, because their centers are inside all
three edges. As you can see, the edge equation is negative on the side of each
edge that's inside the triangle; in fact, it gets more negative the farther you
get from the edge on the inside, and more positive the farther you get from the
edge on the outside.

Vectorization
is an essential part of Larrabee performance - capable of producing a speedup
of an order of magnitude or more - so the knotty question is how we can perform
the evaluation in Figures 1-4 using vector processing. More accurately, the
question is how we can efficiently perform the evaluation using vector
processing; obviously, we could use vector instructions to evaluate every pixel
on the screen for every triangle, but that would involve a lot of wasted work.
What's needed is some way of using vector instructions to quickly narrow in on
the work that's really needed.

We
considered a lot of approaches. Let's take a look at a couple so you can get a
sense of what a different tack we had to take in order to vectorize a task
that's not an obvious candidate for parallelization - and to leverage the
unique strengths of CPUs.