A Compact Method for Backface Culling

By Osnat Levi

Backface culling is an important part of how a 3D engine performs visibility checks. Its purpose is to detect polygons that are invisible in a particular scene – that is, polygons that face away from the viewer. Detecting back-facing polygons allows us to eliminate them from an engine's rendering pipeline at an early stage, thus reducing the amount of computation and memory traffic. There are various ways to detect whether a polygon faces towards or away from the viewer, and each method has its own level of complexity and accuracy.

This article describes a new culling technique that reduces the storage requirements for facet normals when performing backface culling. This highly accurate technique, which performs accurate backface culling in object space, is particularly applicable to front-end culling, and requires only half as much storage as the standard facet-normal technique. Included at the end of this article is an appendix containing C++ implementations of both the traditional method and the compact method, both of which come in scalar and SIMD modes. The code for these methods is contained in a downloadable archive.

Motivation for front-end culling

Backface culling can be performed at several stages in the 3D pipeline. Though we could just sit back and let the rasterizer cull for us, it's advantageous to perform the culling step earlier. After all, the earlier we get rid of irrelevant data, the less data needs to be pushed around the system (saving bandwidth) and the fewer calculations that must be performed (saving CPU load). Cull may be performed during one of three stages:

1. Before transformation and lighting

2. After transformation but before lighting

3. In the rasterizer (after transformation and lighting).

Culling during stages 2 and 3 is typically performed in screen space by checking the clockwise/counterclockwise order of a polygon's vertices. Front-end culling (stage 1) is typically performed by calculating the dot product of the viewing vector and the facet normal of the polygon. The facet normal can be calculated on the fly or precalculated and stored with the data set. In any case, culling up front (stage 1) typically is faster than the other culling strategies (stages 2 or 3) since it saves bandwidth and requires fewer calculations.

The real expense of front-end culling is that we either have to calculate the facet normal on the fly or use a precalculated facet normal, which increases the size of our model database. However, this increase can be reduced by half, as we show below. And although the precalculated facet normals add to the model size, the method accesses memory sequentially, with is advantageous to us.

On the other hand, with the culling based on vertex-order testing (stages 2 or 3), the culling procedure loops through the triangles and accesses the vertices for each triangle. The vertices of neighboring triangles (and even of the same triangle) can be spread all over the vertex pool, causing random memory accesses during the culling. These random accesses are slow and lead to poor cache utilization.

The culling technique based on facet normals also loops through the triangles, but it accesses their facet normals instead of their vertices. Since the facet normals are stored on per-triangle basis, they are fetched sequentially in the culling loop. Therefore the sequential memory accesses are fast, they utilize the cache effectively, and they can be further accelerated using prefetch.

Our work with game developers has shown that front-end culling often increases performance substantially (a 10 to 20 percent boost in the frame-rate) if done properly. In the following sections, we show how to implement efficient front-end culling, using precalculated facet normals stored in a compact form with only half the storage requirements of the standard form.


Accurate front-end culling using facet normals

This technique performs a per-polygon test using the camera position and the polygon's facet normal and position to determine whether it is front or back facing. Its main advantages are its accuracy — the method culls the maximum number of polygons — and the fact that its calculations are done in object space, which eliminates vertices in the pretransform phase.

The classic implementation of this technique requires us to precalculate and store the following information about the polygon:

• The normal of the polygon (facet normal) stored in object space. (Some tools will export the normal, but otherwise it must be calculated using the polygon's vertices.)

• A position on the plane of the polygon in object space — either one of the vertices or the polygon's center.

• A list holding the indices of the polygon's vertices (indices list)

To determine whether a polygon is front or back facing, we must first transform the viewer position (or camera position) to object space, calculate the viewing vector (the vector in which the viewer is "looking" at the polygon), and then calculate the dot product between the viewing vector and the polygon normal. If the result is positive, then the polygon is facing backwards.

Here's an example of the traditional implementation for indexed triangles. The structure we're assuming for each face is 32 bytes in size, and is as follows:

struct {

 

float x,y,z;

} Vector;

struct {

 

Vector position;
// position on polygon
// 12 bytes

Vector normal;
// normal of polygon
// 12 bytes

WORD p1, p2, p3;
// indices to vertex pool
// 6 bytes

WORD stub; // needed to avoid 2 byte alignment
// stalls and filling a cache line.

} faceData; //= total size 32 bytes

Here's pseudo-code for the culling (please refer to the appendix for a C++ implementation):

for ( each vertex in pool )
{

 

mark vertex "not visible"

}

Transform Vw (viewer in world space) to Vo (viewer in object space) using the world-to-object transformation matrix.

for ( each face in array )
{

 

Calculate vector T = face->position – Vo

T is the viewing vector in object space.

Calculate the dot product between T and the face normal.

(T * face->normal) is the cosine between the face normal and the viewing vector.

If the result <=0 then the polygon is front facing.

if ( dot product <= 0)
// polygon is front facing
{

face->p1, face->p2 and face->p3 are indices to the vertex pool.

Use them to mark the appropriate vertices "visible"

}

}


Compact culling

Compact culling is derived from front-end culling and provides an algorithm that is accurate and is done in object space. Applying the following two optimizations reduces the storage per polygon by half:

1. Precalculate the dot product. Let's assume that:

P = position on polygon (in object space)

N = normal of polygon (in object space)

V = viewing vector (transformed to object space)

The visibility check calculates (P-V)*N, which can also be written as (P*N) - (V*N).

We can precalculate (P*N) and store the result. By storing the dot product, we've eliminated the need to store the polygon's position.

2. Store the normal in fixed-point representation. Because the polygon normal is normalized, its components are within the range –1 to +1. If we multiply the normal by the maximum value for signed word, it will be within the range of –32767 to +32767. We can now store the normal as a 2-byte signed word instead of a 4-byte float.

Our face structure is now

struct {

 

float pn; // precalculated position*normal
// 4 bytes

signed short normal[3]; // normal of polygon
// 6 bytes

WORD p1, p2, p3; // indices to vertex pool
// 6 bytes

} faceData; // = total size 16 bytes

The size of the compact structure is 50 percent of the size of the standard structure. Our culling loop now becomes:

for ( each face in the array)

{

Expand the normal back to float from its fixed-point representation.

Calculate face->tn – (face->normal * Vo)

Previously we calculated (P-V)*N here, but since we have tn = P*N (precalculated) we now need to calculate (tn – V*N)

if ( result <= 0) // polygon is front-facing

{

face->p1, face->p2 and face->p3 are indices to the vertex pool.

Use them to mark the appropriate vertices "visible"

 

}

}


Performance results

The performance improvements that compact culling offers depends on the implementation. However, even the most straightforward implementation should have some important benefits. For one, compact culling offers better memory utilization. Because the compact structure is half the standard size, there will be fewer cache misses. Furthermore, compact culling uses front-end culling for more objects. If we have both a heavy content application and memory size restrictions, we can use front-end culling for objects that we otherwise wouldn't be able to include in our scene.

The appendix contains a dowloadable test application that compares culling for two kinds of objects. Foreground and background objects are represented by a sphere and a terrain, respectively. Most game objects correspond to one of these types with regard to the number of culled triangles. On average, about of 50 percent of the triangles in a foreground object are back facing. Background objects have fewer back-facing triangles, depending on the kind of background.

The test application contains four culling implementations in C++:

1. Compact culling – Scalar implementation.

2. Compact culling – SIMD implementation.

3. Compact culling – SIMD implementation with prefetch instructions.

4. Non-compact (standard) culling implementation.

The first is a scalar implementation with a compact structure, as mentioned previously. The second and third use a compact structure expanded for SIMD (single instruction, multiple data); each component of the structure becomes an array of 4. These implementations allow us to use SIMD instructions, including MMX technology and Streaming SIMD Extensions, and to process four elements in parallel. The third implementation is identical to the second, with the addition of memory prefetch instructions. The fourth is the standard non-compact culling implementation.

Performance ratios are as follows. Note that a higher triangle count yields higher ratios because relatively more time is spent in the optimized kernel.

SPHERE OBJECT

# triangles

400

6400

20,000

Compact Scalar /
Non-Compact

1.15

1.2

1.2

Compact SIMD /
Non-Compact

1.45

1.6

1.7

Compact SIMD+prefetch /
Non-Compact

2.3

2.7

2.7

The facet normal-based front-end culling technique increases the total application performance by eliminating the excessive triangles early in the 3D pipeline. It also offers the advantage of the sequential memory access. The model size increase due to pre-calculated facet normals can be limited using the compact culling technique described in this paper. Our measurements have shown that using MMX technology and Streaming SIMD Extensions can double the performance of the compact culling.

Osnat Levi has a B.SC. (1991) in computer science from The Technion - Israel Institute of Tehnology. Her areas of concentration are in 3D graphics and optimization techniques. Osnat works in the Media Team at Intel's Israel Design Center (IDC) in Haifa, Israel. Before joining Intel, Osnat was developing software for embedded real time military systems. She can be contacted at [email protected].

Ronen Zohar has a B.A. (1998) in computer science from The Technion - Israel Institute of Technology. His areas of concentration are in 3D graphics, geometric modeling and computer vision. Ronen currently works in the Media Team at Intel's Israel Design Center (IDC) in Haifa, Israel. He can be reached at [email protected].

Haim Barad has a BSEE and BSCS from Tulane Univ (1981). He also has an MSEE (1983) and PhD (1987) from Univ of Southern California. His areas of concentration are in 3D graphics, video and image processing. Haim currently leads the Media Team at Intel's Israel Design Center (IDC) in Haifa, Israel.

Alex Klimovitski is a senior application engineer with Intel Developer Relations and Engineering Group Europe. Alex works on software libraries and tools for the latest Intel processors, most recently for Pentium®III. He assisted several leading software vendors in porting their products to PentiumIII. Before joining Intel, Alex was developing signal processing and scientific visualization software.


Appendix

This appendix contains a C++ implementation with all four methods: non-compact culling, compact scalar culling, compact SIMD culling, and compact SIMD culling with prefetch instructions.

The traditional, non-compact structure for storing triangle data is 32 bytes and is the following:

struct STriangleData {

 

float px,py,pz;
// position on the triangle

float nx,ny,nz;
// the normal of the triangle

WORD v1,v2,v3;
// indices to the vertex pool

WORD stub;
// padding to fill a cache line and
// avoid alignment stalls

};

The compact structure is is only 16 bytes. We store the normal in fixed-point representation instead of float. The tn field is a precalculated dot product of the position*normal of the triangle. Here it is:

struct SCompactTriangleData {

 

float tn; // precalculated dot product
// (position*normal)

signed short nx,ny,nz;
// the normal of the triangle

WORD v1,v2,v3;
// indices to the vertex pool

};

The compact SIMD structure has the same fields as the compact structure, except that each is four elements wide. The F32vec4 and Is16vec4 classes are defined in the Intel Vector Class Libraries. These libraries provide a C++ interface for using MMX Technology and Streaming SIMD Extensions. Here is the compact SIMD structure:

struct SCompactSIMDTriangleData {

 

F32vec4 tn; // 4 items of precalculated dot
// product (position*normal)

Is16vec4 nx,ny,nz;
// 4 items of the triangle's normal

WORD v[4][3];
// 4 items of indices to the vertex pool

};

The traditional backface culling routine is as follows:

int C3DObject::backfaceCulling()

{

 

// With 1 bit per vertex calculate number of
// dwords in cull flag array

DWORD cullFlagsLen = (m_nVerts + 31) / 32;
int i;
int ff = 0; // front facing triangles

// Mark all vertices as "invisible"
for (i=0 ; i<cullFlagsLen ; i++) {

 

m_vertsCulled[i] = 0;

}

}

for (i=0; i<m_nTris; i++) // for each triangle

{

 

float Tx,Ty,Tz;
float TN;

// Calculate viewing vector
Tx = m_tris[i].px - VOx;
Ty = m_tris[i].py - VOy;
Tz = m_tris[i].pz - VOz;

// Calculate dot product (between viewing vector
// and triangle normal)
TN = Tx*m_tris[i].nx + Ty*m_tris[i].ny + Tz*m_tris[i].nz;

if (TN <= 0.0f) // is front facing
{

 

ff++;

// mark the triangle 3 vertices as
// "visible"
SELECT_VTX(m_vertsCulled,m_tris[i].v1);
SELECT_VTX(m_vertsCulled,m_tris[i].v2);
SELECT_VTX(m_vertsCulled,m_tris[i].v3);

}

}
return ff;

}

The compact backface culling routine

The compact backface culling routine is much the same as the previous one, except for the fixed-point calculations and the use of the precalculated dot product.

int CCompactCull3DObject::backfaceCulling()
{

 

DWORD nVerts = getVertsCount();
DWORD nTris = getTrisCount();
DWORD *vertsCulled = getVertsCulledBitArray();
DWORD cullFlagsLen = (nVerts + 31) / 32;
int i;
int ff = 0; // front facing triangles

// Mark all vertices as "invisible"
for (i=0;i<cullFlagsLen;i++) {

 

vertsCulled[i] = 0;

}

for (i=0 ; i<nTris ; i++) // for each triangle
{

 

float Nx,Ny,Nz;
float TN;
// Convert the normal from fixed point to float

// normal components become floats between -SHRT_MAX to +SHRT_MAX
Nx = (float)m_triData[i].nx;
Ny = (float)m_triData[i].ny;
Nz = (float)m_triData[i].nz;

// Calculate dot product

// view*normal is multiplied by
// 1/SHRT_MAX to adjust to original
// normal values
TN = m_triData[i].tn - (VOx*Nx + VOy*Ny + VOz*Nz)*OOSHRT_MAX;

if (TN <= 0.0f) // is front facing
{

 

ff++;

// mark the triangle 3 vertices
// as "visible"
SELECT_VTX(vertsCulled,m_triData[i].v1); SELECT_VTX(vertsCulled,m_triData[i].v2); SELECT_VTX(vertsCulled,m_triData[i].v3);

 

 

}

return ff;

}

}

The compact SIMD backface culling routine

The compact SIMD backface culling routine is much the same as the compact culling routine, except for the use of the vector classes for parallelism. Note that we can't use an if (TN <= 0) expression here because the results may not be the same for all four elements. Instead we collect the sign bits of the dot product and check for each of the four elements separately.

int CCompactSIMDCull3DObject::backfaceCulling()
{

 

static F32vec4 _OOSHRT_MAX_ = F32vec4(OOSHRT_MAX);
DWORD nVerts = getVertsCount();
DWORD nTris = (getTrisCount() + 3)/4;
DWORD *vertsCulled = getVertsCulledBitArray();
DWORD cullFlagsLen = (nVerts + 31) / 32;
int i;
int ff = 0; // front facing triangles

// Mark all vertices as "invisible"
for (i=0;i<cullFlagsLen;i++)
{

 

vertsCulled[i] = 0;

}

// Duplicate each of the components of the
// viewing vector 4 times
F32vec4 _VOx_ = VOx, _VOy_ = VOy, _VOz_ = VOz;

for (i=0; i<nTris; i++) // each triangle
{

 

F32vec4 Nx,Ny,Nz;
F32vec4 TN;
int mask;

// Convert triangle normal from fixed
// point (16-bit integer) to float
Nx = Is16vec4ToF32vec4( m_triData[i].nx);
Ny = Is16vec4ToF32vec4( m_triData[i].ny);
Nz = Is16vec4ToF32vec4( m_triData[i].nz);

// Calculate dot product
TN = m_triData[i].tn - (_VOx_*Nx + _VOy_*Ny + _VOz_*Nz)*_OOSHRT_MAX_;

// Collect sign bits of dot products

// If all dot products are positive mask // gets the value of 0 otherwise it gets
// a value between 0x1-0xF
mask = move_mask(TN);

if (mask > 0)
// any of the 4 dot products is negative
{

 

for (int j=0; j<4; j++)
// do 4 times
{

 

if (mask & 1)
// dot product is negative
{

 

ff++;
// mark the
// triangle 3
// vertices as
// "visible"
SELECT_VTX(vertsCulled, m_triData[i].v[j][0]);

SELECT_VTX(vertsCulled, m_triData[i].v[j][1]);

SELECT_VTX(vertsCulled, m_triData[i].v[j][2]);

}

// shift right
mask >>= 1;

}

}

}
_m_empty();
return ff;

}

Return to the full version of this article
Copyright © UBM Tech, All rights reserved