Gamasutra is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.   Contents Latest Jobs
October 21, 2020 Latest Blogs
October 21, 2020 Press Releases
October 21, 2020
Games Press About Gama Network
If you enjoy reading this site, you might also want to check out these UBM Tech sites: Features

# Sponsored Feature: Rasterization on Larrabee -- Adaptive Rasterization Helps Boost Efficiency

November 13, 2009  Page 5 of 6 # Descending the Rasterization Hierarchy

We're now done rasterizing trivially accepted 16×16 blocks, but we still have to handle partially covered 16×16 blocks, and by this point, it should be obvious how that works. We descend into each partial 16×16 to evaluate the 4×4 blocks it contains, just as we descended into the 64×64 tile to evaluate the 16×16s it contained. Again, we put trivially accepted 4×4s directly into the bin. Partially accepted 4×4s, however, need to be processed into pixel masks. This is done with a vector add of a precalculated, position-independent table for each edge that steps from the 4×4 trivial reject corner to the center of the pixel itself. The sign bits can then be ANDed together to form the pixel mask.

Figure 25 illustrates the calculation of the pixel mask. From the trivial reject corner of the 4×4, shown in red, a single add yields the equation value for the black edge at the 16 pixel centers. The red arrows show how the 16 values in the precalculated, position-independent step table for each edge are added to the trivial reject corner of the 4×4 to evaluate the edge equation at the 16 pixel centers. The pixels shown in blue have negative values and are inside the edge. Figure 25 shows the actual mask that is generated, with a 1-bit for each pixel inside the edge, and a 0-bit for each pixel that's outside. Figure 25 is a good demonstration of how vector efficiency can fall off with partially vectorizable tasks, and why Larrabee uses 16-wide vectors rather than something larger. We will often do all the work needed to generate the pixel mask for a 4×4, only to find that many or most of the pixels are uncovered, so much of the work has been wasted. And the wider the vector, the lower the ratio of covered to uncovered becomes, on average, and the more work is wasted.

Listing Three shows code for calculating the pixel masks for all the partially covered 4×4 blocks in a partially covered 16×16 block. Note that this is the only case in which per-pixel work is performed; all solidly covered-block cases are handled without generating a pixel mask. Listing Three first stores the trivial reject corner values for the 16 4×4 blocks for the three edges. These are the values we'll step relative to in order to generate the final pixel masks for each partial 4×4 and that were generated earlier by the 16×16 code that ran just before this code. This is done with the three vstored instructions.

Next, the step tables for the three edges are loaded with the three vloadd instructions. (Actually, these will probably just be preloaded into registers when the triangle is set up and remain there throughout the rasterization of the triangle, but loading them here makes it clearer what's going on.)

Once that set-up is complete, the code scans through the partial accept mask, which was also generated earlier by the 16×16 code. First, the mask is copied to eax with the kmov instruction, and then bsfi is used to find each partially accepted 4×4 block in turn.

For each partially covered 4×4 found, the code does three vector compares to evaluate the three edge equations at the 16 pixel centers. Note that each vcmpgtpi uses the index of the current 4×4 to retrieve the trivial reject corner value for that 4×4 from the vector generated at the 16×16 level. While we could directly add and then test the signs of the edge equations, as I described earlier, it's more efficient to instead rearrange the calculations by flipping the signs of the step tables so that the testing can be done with a single compare per edge. Put another way, instead of adding two values and seeing if the result is less than zero:

m + n < 0

it's equivalent and more efficient to compare one value directly to the negation of the other value:

m < -n

(In case you're wondering, we couldn't use this trick at the 16×16 and 64×64 levels of the hierarchy because in those cases, in addition to getting the result of the test, we also need to keep the result of the add, so we can pass it down the hierarchy as the new corner value.)

The result of the compares is the final composite pixel mask for that 4×4, which can be stored in the rasterization output queue. And that's it. Once things are set up, the cost to rasterize each partial 4×4 is three vector instructions, a mask instruction, and a handful of scalar instructions. Once again, if the whole tile was trivially accepted against one or two edges, then proportionately fewer vector instructions would be required.

One nice feature of the Larrabee rasterization approach is that MSAA ("multisample antialiasing") falls out for one extra vector compare per 16 samples per edge because it's just a different step from the trivial reject corner. (That's for partially accepted 4×4 blocks; for all trivially accepted blocks, there is no incremental cost for rasterization of MSAA.) In Figure 26, each pixel is divided into four samples, and the step shown is to one of the four samples rather than to the center of each pixel. This takes one compare, so a total of four compares would be required to generate the full MSAA sample mask for 16 pixels.   Page 5 of 6 ### Top Stories 