**Descending the Rasterization Hierarchy**

We're
now done rasterizing trivially accepted 16×16 blocks, but we still have to
handle partially covered 16×16 blocks, and by this point, it should be obvious
how that works. We descend into each partial 16×16 to evaluate the 4×4 blocks
it contains, just as we descended into the 64×64 tile to evaluate the 16×16s it
contained. Again, we put trivially accepted 4×4s directly into the bin.
Partially accepted 4×4s, however, need to be processed into pixel masks. This
is done with a vector add of a precalculated, position-independent table for
each edge that steps from the 4×4 trivial reject corner to the center of the
pixel itself. The sign bits can then be ANDed together to form the pixel mask.

Figure
25 illustrates the calculation of the pixel mask. From the trivial reject
corner of the 4×4, shown in red, a single add yields the equation value for the
black edge at the 16 pixel centers. The red arrows show how the 16 values in
the precalculated, position-independent step table for each edge are added to
the trivial reject corner of the 4×4 to evaluate the edge equation at the 16
pixel centers. The pixels shown in blue have negative values and are inside the
edge. Figure 25 shows the actual mask that is generated, with a 1-bit for each
pixel inside the edge, and a 0-bit for each pixel that's outside.

Figure
25 is a good demonstration of how vector efficiency can fall off with partially
vectorizable tasks, and why Larrabee uses 16-wide vectors rather than something
larger. We will often do all the work needed to generate the pixel mask for a
4×4, only to find that many or most of the pixels are uncovered, so much of the
work has been wasted. And the wider the vector, the lower the ratio of covered
to uncovered becomes, on average, and the more work is wasted.

Listing
Three shows code for calculating the pixel masks for all the partially covered
4×4 blocks in a partially covered 16×16 block. Note that this is the only case
in which per-pixel work is performed; all solidly covered-block cases are
handled without generating a pixel mask.

Listing
Three first stores the trivial reject corner values for the 16 4×4 blocks for
the three edges. These are the values we'll step relative to in order to
generate the final pixel masks for each partial 4×4 and that were generated
earlier by the 16×16 code that ran just before this code. This is done with the
three **vstored** instructions.

Next,
the step tables for the three edges are loaded with the three **vloadd**
instructions. (Actually, these will probably just be preloaded into registers
when the triangle is set up and remain there throughout the rasterization of
the triangle, but loading them here makes it clearer what's going on.)

Once
that set-up is complete, the code scans through the partial accept mask, which
was also generated earlier by the 16×16 code. First, the mask is copied to **eax**
with the **kmov** instruction, and then **bsfi** is used to find each
partially accepted 4×4 block in turn.

For
each partially covered 4×4 found, the code does three vector compares to
evaluate the three edge equations at the 16 pixel centers. Note that each **vcmpgtpi**
uses the index of the current 4×4 to retrieve the trivial reject corner value
for that 4×4 from the vector generated at the 16×16 level. While we could
directly add and then test the signs of the edge equations, as I described
earlier, it's more efficient to instead rearrange the calculations by flipping
the signs of the step tables so that the testing can be done with a single
compare per edge. Put another way, instead of adding two values and seeing if
the result is less than zero:

m
+ n < 0

it's
equivalent and more efficient to compare one value directly to the negation of
the other value:

m
< -n

(In
case you're wondering, we couldn't use this trick at the 16×16 and 64×64 levels
of the hierarchy because in those cases, in addition to getting the result of
the test, we also need to keep the result of the add, so we can pass it down
the hierarchy as the new corner value.)

The
result of the compares is the final composite pixel mask for that 4×4, which
can be stored in the rasterization output queue. And that's it. Once things are
set up, the cost to rasterize each partial 4×4 is three vector instructions, a
mask instruction, and a handful of scalar instructions. Once again, if the
whole tile was trivially accepted against one or two edges, then
proportionately fewer vector instructions would be required.

One
nice feature of the Larrabee rasterization approach is that MSAA ("multisample
antialiasing") falls out for one extra vector compare per 16 samples per edge
because it's just a different step from the trivial reject corner. (That's for
partially accepted 4×4 blocks; for all trivially accepted blocks, there is no
incremental cost for rasterization of MSAA.) In Figure 26, each pixel is
divided into four samples, and the step shown is to one of the four samples
rather than to the center of each pixel. This takes one compare, so a total of
four compares would be required to generate the full MSAA sample mask for 16
pixels.