Gamasutra is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Gamasutra: The Art & Business of Making Gamesspacer
Sponsored Feature: Rasterization on Larrabee -- Adaptive Rasterization Helps Boost Efficiency
View All     RSS
October 21, 2020
arrowPress Releases
October 21, 2020
Games Press
View All     RSS







If you enjoy reading this site, you might also want to check out these UBM Tech sites:


 

Sponsored Feature: Rasterization on Larrabee -- Adaptive Rasterization Helps Boost Efficiency


November 13, 2009 Article Start Previous Page 2 of 6 Next
 

The Pixomatic 1 Rasterization Approach

Pixomatic version 1 used a rasterization approach often used by scalar software rasterizers, decomposing triangles into 1 or 2 trapezoids, then stepping down the two edges simultaneously, on pixel centers, emitting the spans of pixels covered on each scan line, as in Figure 5.

This approach was efficient for scalar code, but it just doesn't lend itself to vectorization. There were several other reasons this approach didn't suit Larrabee well (for example, it emits pixel-high spans; but for vectorized shading, you want 4×4 blocks, both to generate 2D texture gradients and because a square aspect ratio results in the highest utilization of the vector units). But the most important reason was that I just could never come up with a way to get good results out of vectorizing edge stepping.

Sweep Rasterization

Another approach, often used by hardware, is sweep rasterization. An example of this is in Figure 6. Starting at a top vertex, a vector stamp of 4×4 pixels is swept left, then right, then down, and the process is repeated until the whole triangle has been swept. The edge equation is evaluated directly at each of the 16 pixels for each 4×4 block that's swept over.



Sweep rasterization is more vectorizable than the Pixomatic 1 approach because evaluating the pixel stamp is well-suited to vectorization; but on the other hand, it requires lots of badly predicted branching, as well as a significant amount of work to decide where to descend. It also fails to take advantage of the ability of CPUs to make smart, flexible decisions, which was our best bet for being competitive with hardware rasterization. So we decided sweep rasterization wasn't the right answer.

A High-level View of Larrabee Rasterization

Larrabee takes a substantially different approach, one better suited to vectorization. In the Larrabee approach, we evaluate 16 blocks of pixels at a time to figure out which blocks are even touched by the triangle, then descended into each block that's at least partially covered, evaluating 16 smaller blocks within it, continuing to descend recursively until we had identified all the pixels inside the triangle. Here's an example of how that might work for our sample triangle.

As I'll discuss shortly, the Larrabee renderer uses a chunking architecture. In a chunking architecture, the largest rasterization target at any one time is a portion of the render target called a "tile"; for this example, let's assume the tile is 64×64 pixels, as in Figure 7.

First, we test which of the 16×16 blocks (16 of them - we check 16 things at a time whenever possible in order to leverage the 16-wide vector units) that make up the tile are touched by the triangle, as in Figure 8.

We find that only one 16×16 block is touched - the block shown in yellow. So we descend into that block to determine exactly what is touched by the triangle, subdividing it into 16 4×4 blocks (once again, we check 16 things at a time to be vector-friendly), and evaluate which of those are touched by the triangle, as in Figure 9.

We find that five of the 4×4s are touched, so we process each of them separately, descending to the pixel level to generate masks for the covered pixels. The pixel rasterization for the first block is in Figure 10.

Figure 11 shows the final result.

As you can see, the Larrabee approach processes 4×4 blocks, like the sweep approach. But unlike the sweep approach, it doesn't have to make many decisions in order to figure out which blocks are touched by the triangle, thanks to the single 16-wide test performed before each descent. Consequently, this rasterization approach typically does somewhat less work than the sweep approach to determine which 4×4 blocks to evaluate. The real win, however, is that it takes advantage of CPU smarts by not-rasterizing whenever possible. I'll have to walk through the Larrabee rasterization approach in order to explain what that means, but as an introduction, let me tell you another optimization story.

Many years ago, I got a call from a guy I had once worked for. He wanted me to do some consulting work to help speed up his new company's software. I asked him what kind of software it was, and he told me it was image processing software, and that the problem lay in the convolution filter, running on a Sparc processor. I told him I didn't know anything about either convolution filters or Sparcs, so I didn't think I could be of much help. But he was persistent, so I finally agreed to take a shot at it.

He put me in touch with the engineer who was working on the software, who immediately informed me that the problem was that the convolution filter involved a great many integer multiplies, which the Sparc did very slowly because, at the time, it didn't have a hardware integer multiply instruction. Instead, it had a partial product instruction, which had to be executed for each significant bit in the multiplier. In compiled code, this was implemented by calling a library routine that looped through the multiplier bits, and that routine was where all the time was going.

I suggested unrolling that loop into a series of partial product instructions, and jumping into the unrolled loop at the right point to do as many partial products as there were significant bits, thereby eliminating all the loop overhead. However, there was still the question of whether to make the pixel value or the convolution kernel value the multiplier. The smaller the multiplier, the fewer partial products would be needed, so we wanted to pick whichever of the two was smaller on average.

When I asked which was smaller, though, the engineer said there was no difference. When I persisted, he said they were random. When I said that I doubted they were random (since randomness is actually hard to come by), he grumbled. I don't know why he was reluctant to get me that information - I guess he thought it was a waste of time - but he finally agreed to gather the data and call me back.

He didn't call me back that day, though. And he didn't call me back the next day. When he hadn't called me back the third day, I figured I might as well get it over with and called him. He answered the phone and, when I identified myself, he said, "Oh, Hi. I'm just standing here with my managers, watching. We're all really happy."

When I asked what exactly he was happy about, he replied, "Well, when I looked at the data, it turned out 90% of the values in the convolution kernel were zero, so I just put an if-not-zero around the multiply, and now the whole program runs three times faster!" Not-rasterizing is a lot like that, as we'll see shortly.


Article Start Previous Page 2 of 6 Next

Related Jobs

innogames
innogames — Hamburg, Germany
[10.20.20]

Mobile Software Developer (C++) - Video Game: Forge of Empires
Johnson County Community College
Johnson County Community College — Overland Park, Kansas, United States
[10.19.20]

Assistant Professor, Game Development
Insomniac Games
Insomniac Games — Burbank, California, United States
[10.18.20]

Lead Gameplay Programmer
Insomniac Games
Insomniac Games — Burbank, California, United States
[10.18.20]

Lead Engine Programmer





Loading Comments

loader image