[In a detailed Intel-sponsored technical article, part of its Visual Computing section on Gamasutra, veteran game programmer and architect Mike Abrash (Quake) goes in-depth on how Intel's upcoming Larrabee architecture pipeline deals intelligently with rasterization.]
You don't tug on Superman's cape
You don't spit into the wind
You don't pull the mask off that old Lone Ranger
And you don't mess around with Jim
The first time you heard the chorus of "You Don't Mess Around with Jim," you no doubt immediately wondered why Jim Croce left out "And you don't rasterize in software if you're trying to do GPU-class graphics, even on a high-performance part such as Larrabee." That's just common sense, right? It was obvious to me, anyway, that Larrabee was going to need a hardware rasterizer. But maybe Jim knew something I didn't, because it turned out I was wrong, and therein lies a tale.
First, though, let's back up a step and get some context.
In Look at the Larrabee New Instructions (LRBni), I presented an overview of Larrabee, an upcoming multithreaded, vectorized, many core processor from Intel. It may not be entirely apparent from that article, but the LRBni contains everything you need to do a great job running code that's designed to be vectorized, such as HLSL shaders. But what about tasks that aren't inherently scalar (in particular, where the iterations aren't serially dependent), and that at the same time can't be efficiently parallelized in any obvious way, and for which the usual algorithms aren't easily vectorized?
In the process of implementing the standard graphics pipeline for Larrabee, we (the Larrabee team at Intel's partner RAD Game Tools - Atman Binstock, Tom Forsyth, Mike Sartain, and me) have gotten considerable insight into that question, because while the pipeline is largely vectorizable, there are a few areas in which the conventional approaches are scalar, and where some degree of serialization is unavoidable.
This article takes a close look at how we applied the Larrabee New Instructions to one of those problem areas - rasterization - in the process, redesigning our implementation so it was far more vectorizable so it took advantage of the strengths of Larrabee's CPU-based architecture. Needless to say, performance with the Larrabee New Instructions will vary greatly from one application to the next - there's not much to be done with purely scalar code - but diving into Larrabee rasterization in detail gives you a good sense of what it's like to apply the Larrabee New Instructions to at least one sort of non-obvious application.
To avoid potential confusion, let me define "rasterization": For our present purposes, it's the process of determining which pixels are inside a triangle, and nothing more. In this article, I won't discuss other aspects of the rendering pipeline, such as depth, stencil, shading, or blending, in any detail.
When we look at applying the Larrabee New Instructions to the graphics pipeline, for the most part, it's easy to find 16-wide work for the vector unit to do. Depth, stencil, pixel shading, and blending all fall out of processing 4×4 blocks, with writemasking at the edges of triangles. Vertex shading is not quite a perfect fit, but 16 vertices can be shaded and cached in parallel, which works well because vertex usage tends to be localized. Setting up 16 triangles at a time also works pretty well; there's some efficiency loss due to culling, but it still yields more than half of maximum parallelism.
That leaves rasterization, which is definitely not easy to vectorize with enough performance to be competitive with hardware. As an aside, the Larrabee rasterizer is not, of course, the only rasterizer that could be written for Larrabee. There are other rasterizers for points and lines, and of course, there may be other future rasterizers for interesting new primitive types. In fact, I suggest you write your own - it's an interesting problem! But as it doesn't have a more formal name, I'll just refer to the current general-purpose triangle rasterizer as "the Larrabee rasterizer" in this article.
The Larrabee rasterizer was at least the fifth major rasterizer I've written, but while the others were loosely related, the Larrabee rasterizer belongs to a whole different branch of the family - so different, in fact, that it almost didn't get written. Why? Because, as is so often the case, I wanted to stick with the mental model I already understood and was comfortable with, rather than reexamining my assumptions. Long-time readers will recall my fondness for the phrase "Assume nothing"; unfortunately, I assumed quite a lot in this case, proving once again that it's easier to dispense wise philosophy than to put it into action!
Coming up with better solutions is all about trying out different mental models to see what you might be missing, a point perfectly illustrated by my efforts about 10 years ago to speed up a texture mapper. I had thrown a lot of tricks at it and had made it a lot faster, and figured I was done. But just to make sure, I ran it by my friend David Stafford, an excellent optimizer. David said he'd think about it and would let me know if he came up with anything.
When I got home that evening, there was a message on the answering machine from David, saying that he had gotten two cycles out of the inner loop. (It's been a long time, and my memories have faded, so it may not have been exactly two cycles, but you get the idea.) I called him back, but he wasn't home (this was before cellphones were widespread). So all that evening I tried to figure out how David could have gotten those two cycles. As I ate dinner, I wondered; as I brushed my teeth, I wondered; and as I lay there in bed not sleeping, I wondered. Finally, I had an idea - but it only saved one cycle, not two. And no matter how much longer I racked my brain, I couldn't get that second cycle.
The next morning, I called David, admitted that I couldn't match his solution, and asked what it was. To which he replied: "Oh, sorry, it turned out my code doesn't work."
It's certainly a funny story, but more to the point, the only thing keeping my code from getting faster had been my conviction that there were no better solutions - that my current mental model was correct and complete. After all, the only thing that changed between the original solution and the better one was that I acquired the belief that there was a better solution - that is, that there was a better model than my current one. As soon as I had to throw away my original model, I had a breakthrough.
Software rasterization is a lot like that.
The previous experience that Tom Forsyth and I had with software rasterization was that it was much, much slower than hardware. Furthermore, there was no way we could think of to add new instructions to speed up the familiar software rasterization approaches, and no obvious way to vectorize the process, so we concluded that a hardware rasterizer would be required. However, this turned out to a fine example of the importance of reexamining assumptions.
In the early days of the project, as the vector width and the core architecture were constantly changing, one of the hardware architects, Eric Sprangle, kept asking us whether it might be possible to rasterize efficiently in software using vector processing. We kept patiently explaining that it wasn't. Then, one day, he said: "So now that we're using the P54C core, do you think software rasterization might work?" To be honest, we thought this was a pretty dumb question because the core didn't have any significant impact on vector performance; in fact, we were finally irritated enough to sit down to prove in detail - with code - that it wouldn't work.
And we immediately failed. As soon as we had to think hard about how the inner loop could be structured with vector instructions, and wrote down the numbers - that is, as soon as our mental model was forced to deal with reality - it became clear that software rasterization was in fact within the performance ballpark. After that, it was just engineering, much of which we'll see shortly.
First, though, it's worth pointing out that, in general, dedicated hardware will be able to perform any specific task more efficiently than software; this is true by definition because dedicated hardware requires no multifunction capability, so in the limit, it can be like a general-purpose core with the extraneous parts removed. However, by the same token, hardware lacks the flexibility of CPUs, and that flexibility can allow CPUs to close some or all of the performance gap. Hardware needs worst-case capacity for each component, so it often sits at least partly idle; CPUs, on the other hand, can just switch to doing something different, so ALUs are never idle. CPUs can also implement flexible, adaptive approaches, and that can make a big difference, as we'll see shortly.
I can't do a proper tutorial on rasterization here, so I'll just run through a brief refresher. For our purposes, all rasterization will be of triangles. There are three edges per triangle, each defined by an equation Bx+Cy relative to any point on the edge, with the sign indicating whether a point is inside or outside the edge. Both x and y are in 15.8 fixed-point format, with a range of [-16K, +16K). The edge equations are tested at pixel or sample centers, and for cases where a pixel or sample center is right on an edge, well-defined fill rules must be observed (in this case, top-left fill rules, which are generally implemented by subtracting 1 from the edge equation for left edges and flat-top edges). Rasterization is performed with discrete math, and must be exact, so there must be enough bits to represent the edge equation completely. Finally, multisampled antialising must be supported.
Let's look at a quick example of applying the edge equation. Figure 1 shows an edge from (12, 8) to (4, 24). The B coefficient of the edge equation is simply the edge's y length: (y1 - y0). The C coefficient is the negation of the edge's x length: (x0 - x1). Thus, the edge equation in Figure 1 is (24 - 8)x + (12 - 4 )y. Since we only care about the sign of the result (which indicates inside or outside), not the magnitude, this can be simplified to 2x + 1y, where the x value used in the equation is the distance between the point of interest and any point at which the equation is known to be zero (which is to say, any point on the line). Usually, a vertex is used; for example, as the vertex at (12, 8) is used in Figure 1. All points on the edge have the value 0, as can be seen in Figure 1 for the point on the line at (8, 16).
Points on one side of the edge will have positive values, as in Figure 2 for the point at (12, 16), which has a value of 8.
Points on the other side of the edge will have negative values, as in Figure 3 for the point at (4, 12), which has a value of -12.
Simple though it is, the edge equation is the basis upon which the Larrabee rasterizer is built. By applying the three edge equations at once, it is possible to determine which points are inside a triangle and which are not. Figure 4 shows an example of how this works; the pixels shown in green are considered to be inside the triangle formed by the edges, because their centers are inside all three edges. As you can see, the edge equation is negative on the side of each edge that's inside the triangle; in fact, it gets more negative the farther you get from the edge on the inside, and more positive the farther you get from the edge on the outside.
Vectorization is an essential part of Larrabee performance - capable of producing a speedup of an order of magnitude or more - so the knotty question is how we can perform the evaluation in Figures 1-4 using vector processing. More accurately, the question is how we can efficiently perform the evaluation using vector processing; obviously, we could use vector instructions to evaluate every pixel on the screen for every triangle, but that would involve a lot of wasted work. What's needed is some way of using vector instructions to quickly narrow in on the work that's really needed.
We considered a lot of approaches. Let's take a look at a couple so you can get a sense of what a different tack we had to take in order to vectorize a task that's not an obvious candidate for parallelization - and to leverage the unique strengths of CPUs.