| |
|
|
||||
![]() |
||||||
| |
|
|||||
|
Real-Time
Procedural Procedural textures are rarely used in real-time, hardware-based rendering engines. Procedural textures use the basic Perlin noise algorithm, which has many variants and is non-standard. In addition, each texture requires a different hardware circuit to implement it, whereas regular texture mapping uses the same circuit, but loads different textures. Procedural textures are rarely used in real-time, software rendering engines, primarily because the calculations are time consuming. The Perlin gradient noise function interpolates random values that are precomputed for each lattice point in the object space. This computation is floating-point intensive and requires many table reads for each texel. In addition, the calculations required for turbulence and sine wave evaluation make the Perlin method even more time-consuming. These problems seem to imply that the procedural texture method can't produce the many mega-pixels per second required for real-time hardware and software game engines. However, we present an MMX implementation of Perlin noise that produces fast procedural textures, which perform competitively with regular texture mapping methods. The following lists summarizes the strengths and weaknesses of the two different texture mapping methods: Procedural Textures:
Traditional Texture Mapping:
Most procedural
texture mapping techniques are based on a noise function, such as Perlin
noise. Generally speaking, noise functions assign each location in space
some random value, but in a somehow controllable way. The values are assigned
to the integer points and are interpolated for other points. The function
can be defined for any dimension (e.g. 1D, 2D, 3D, 4D...) and at arbitrary
resolution sampling.
The following
diagram shows how the image is built. Seven noise-function outputs are
scaled and added together. Noted under each of the images are the zoom
factors along with the amplitude modification factor. In practice, experiments
show octaves beyond 3 are essentially unneeded.
Since the
textures in Figures 3.1 and 3.2 are based on fBm, Appendix
A shows the code segment used. Wood texture
can be computed using the relative distance of a point from the tree's
axis to construct rings of similar color, like wood rings. The algorithm
calculates the radius and perturbs it with the turbulence, which is the
fractional Brownian motion discussed in Section 3.0. Thus, the wood at
point (u,v) is evaluated at
The algorithm for wood at texel location x = (u,v) contains four steps:
Wood[]
is an array of gradient colors, based on the RenderMan wood function (see
"The RenderMan Companion: A Programmer's Guide to Realistic Computer Graphics",
by Steve Upstill, Addison-Wesley). For each point, this wood function
uses the fractional part of the perturbed radial distance from the tree
axis to build an interpolation scheme between [0-1], going from black
to brown in a smooth way. Since MMX technology lacks a fast, parallel
square root evaluation, step 1 reads the square root values from a table.
Another table is used in step 4 for RenderMan's wood function.
All the
points (u,v) which obey the equation,
The code in Listing 1 implements the wood texture algorithm. Listing 1. Wood Texture Algorithm.
The SIMD_Octave() procedure pre-calculates the turbulence values and stores them in a buffer. The rest of the algorithm is implemented in the two wood procedures, SIMD_Wood_Linear() and SIMD_Wood_Sqrt() The table
containing the square root values, for (i = 0; i ((2048; i++) sqrtTable[i] = (unsigned __int16)floor(sqrt((i << 10)));This initialization section is used to set up the smooth gradient wood colors. The colors start off black and smoothly change to brown.
for (i = 0; i (6000; i++)
{
//The equation for "r" is just a linear one. If graphed,
//a line with positive slope results. This gives us the backbone
//for the smooth gradients that will be used for the wood color.
r = (float)4.0 * i;
r *= 1.0 / 512.0;
r -= (float)floor(r);
r = smoothstep((float)0, (float)0.83, r) - smoothstep((float)0.83,
(float)1.0, r);
comp_r = 1 - r;
//One r is calculated, the individual red, green, and blue components
//are found. These components are on a scale from 0.0 to 1.0.
wood_red = r * (float)0.30 * 2.0 + comp_r * (float)0.050 * 2.0;
wood_green = r * (float)0.12 * 2.0 + comp_r * (float)0.010 * 2.0;
wood_blue = r * (float)0.03 * 2.0 + comp_r * (float)0.005 * 2.0;
red = ((long)(wood_red * 255)) & 0xF8;
green = ((long)(wood_green * 255)) & 0xFC;
if (FORMAT565) //for 565 bits per pixel
For the
MMX technology source code listings of SIMD_Octave() and SIMD_Wood(),
see Appendix A and Appendix
B respectively. For the linear approximation of the sqrt() version
of the code, see Appendix G.
Marble has
a fractal-like appearance, which can be approximated by evaluating the
sine(x + turbulence(x)) and applying a perturbation based on turbulence(x)
to the object's normals during the lighting procedure.
Figure
5.1 illustrates the first four steps of the marble texture algorithm.
The fifth step is part of the lighting procedure and will be explained
in Section 7.2.
Since the algorithm uses fixed point arithmetic, the inputs to the sine in step 2 are known in advance. Therefore, the complicated calculation of steps 2, 3 and 4 can be performed in setup time, and stored in a table. At rendering time the algorithm indexes into this table. Although the reads are serial, they capture only a small amount of time compared to the rest of the parallel computation. The actual marble algorithm at point x = (u,v) is as follows:
The content
of the Marble table can be replaced with a different variant every few
frames, without impacting overall performance. This prevents you from
having to load a new texture from memory when using image texture mapping.
In addition, the original steps 2, 3 and 4 can be replaced with any texture
calculation based on the location ( u ) and the turbulence of (u,v). When
calculating the turbulence, the number of octaves used is critical. As
more octaves are used, the computation time increases but the end result
is better. See Figures 5.4 to 5.8.
The code in Listing 2 implements the marble texture algorithm.
Listing
2. The Marble Texturing Algorithm. void marblePassMMX(unsigned
long u_init, unsigned long v_init,
//Clear out the turbulence
buffer. //Calculate the turbulence //Using the averaging
scheme for the even pixels while the odd pixels are calculate, //Calculate the marble
colors for the scanline. memcpy(screen_buffer,
turbulenceBuf, sizeof(__int16) * Num_Pix);
The SIMD_Octave()
procedure calculates the turbulence values and stores them in a buffer.
The rest of the algorithm is implemented in the SIMD_Marble()
procedure. SIMD_Marble()
uses the values of the turbulence buffer filled by SIMD_Octave
with several octaves of noise. Four pixels are calculated in each iteration.
for (i = 0; i < 5000; i++)
{
val = (double)i / 256.0;
sin_val = (sin(val * Pi) + 1.0) * 0.5;
red = ((long) ((0.33 + 0.66 * sin_val) * 256)) & 0xF8;
blue = ((long) ((0.60 + 0.39 * sin_val) * 256)) & 0xF8;
if (FORMAT565) //for 565 bits per pixel
{
green = ((long) ((0.27 + 0.72 * sin_val) * 256)) & 0xFC;
MarbleTable[i] = (unsigned __int16)((red << 8) |
(green <<3 ) | (blue > 3));
}
else //for 555 bits per pixel
{
green = ((long) ((0.27 + 0.72 * sin_val) * 256)) & 0xF8;
MarbleTable[i] = (unsigned __int16)((red << 7) |
(green <<2 ) | (blue > 3));
}
}
With games
that use a perspective viewing frustrum instead of an orthogonal view,
drawing perspective-corrected textures can be difficult. The mathematics
required to draw perfect perspective textures is generally too much for
a PC to handle in real time. Algorithms can approximate perspective without
many viewing artifacts. One algorithm, quadratic approximation, involves
finding the per-pixel change in du and dv across each scanline, known
as ddu and ddv respectively.
Using ddu and ddv to update du and dv across each scanline poses another problem with the new procedural texture scanline algorithms. The problem is that, since four pixels are calculated in parallel, DU, DV, DDU, and DDV must be calculated in parallel as well, and the assembly code needed to set up these parameters is expensive for the CPU to calculate. As an alternative, the procedural textures developed in this application don't use ddu and ddv to update du and dv for each pixel drawn.
As a result,
if the polygons are too big, gross artifacts develop during the texturing
process. There are ways to get around this. One is to keep the polygons
small with small scanlines. When only drawing a few pixels, there isn't
enough time for errors to accumulate. The other technique is to sub-divide
each long scanline into shorter segments. Many short line segments can
be put end-to-end to construct a longer segment. For example, if a scanline
is 600 pixels long, it can be drawn as 37 16-pixel scanlines, with eight
pixels left over. At the start of each sub-scanline, the du, dv parameters
are recalculated to remove error accumulation. This techniques works well
but in some instances, a ripple artifact in the textures can be seen.
This is because as each pixel is drawn, more errors accumulate. Then after
pixel N, the du and dv values are recalculated. Then as more pixels are
drawn, the errors begin to accumulate again. At pixel 2N, the du and dv
values are recalculated to be exact. With the repetition of this over
and over, it can be seen how a ripple develops.
Table 1
Table 2
The MMX registers containing the initial U and V values need to be setup as shown below: ;Note: UV values are stored in 10.22 fixed integer format. ;This sets up the U parameters for pixels 1 and 3 in register MM0 and ;V in MM1. After setup, the registers will contain: ; |--------- 32 bit ------------| ; +-------------------------------------------------------------------+ ;MM0 = | U texel for pix #1 = u + du | U texel for pix #3 = u + 3du + 3ddu | ; +-------------------------------------------------------------------+ ; +-------------------------------------------------------------------+ ;MM1 = | V texel for pix #1 = v + dv | V texel for pix #3 = v + 3dv + 3ddv | ; +-------------------------------------------------------------------+ The code
in Appendix D shows how
this is done.
Table 3
As shown from the above table, the MMX registers used to contain the initial DU and DV values need to be setup as shown below. ;Note: du dv texel values are stored in 10.22 fixed integer format. ;This sets up the du parameters for pixels 1 and 3 in MM0 register and ;dv parameter in MM1 register. After setup, the registers will contain: ; |--------- 32 bit --------------| ; +---------------------------------------------------------------+ ;MM0 = | DU texel for p1 = 4du + 10ddu | DU texel for p3 = 4du + 18ddu | ; +---------------------------------------------------------------+ ; +---------------------------------------------------------------+ ;MM1 = | DV texel for p1 = 4dv + 10ddv | DV texel for p3 = 4dv + 18ddv | ; +---------------------------------------------------------------+ To determine what the DDU and DDV values should be, the change in the DU and DV values is measured when moving from pixel to pixel. Applying the formula DDU = Next DU - Previous DU to the previous table produces the following table of values:
This table shows that the initial values for variables DDU and DDV should be set up in the MMX registers as shown in the following: ;Note: ddu ddv texel values are stored in 10.22 fixed integer format. ;This sets up the ddu parameters for pixels 1 and 3 in MM0 register and ;ddv parameter in MM1 register. After setup, the registers will contain: ; |--------- 32 bit ---------| ; +-----------------------------------------------------+ ;MM0 = | DDU texel for p1 = 16ddu | DDU texel for p3 = 16ddu | ; +-----------------------------------------------------+ ; +-----------------------------------------------------+ ;MM1 = | DDV texel for p1 = 16ddv | DDV texel for p3 = 16ddv | ; +-----------------------------------------------------+ Since
the DDU and DDV terms are constant, no additional calculations are required
across the scanline.
Gouraud
shading calculates the color at each vertex of the polygon and interpolates
it for each internal pixel. The Phong method calculates the color at each
internal pixel by interpolating the normal, but it's an expensive calculation.
Therefore, most graphic systems implement the Gouraud method, often without
the specular part.
For the second effect, the 8-9 bit turbulence value is divided by 64. The result's fraction is multiplied by the interpolated 'specular' component and used in the lighting equation. At pixel P, having turbP, the final color is calculated as follows:
The above
images show what is possible when using noise to perturb color and normals.
Figure 7.1 is the normal lighted image. Figures 7.2 and 7.3 show what
is possible when using the above techniques. Due
to the C ANSI standard, when an application converts a number from floating
point to integer, the number is truncated. On Pentium and Pentium II processors,
this truncation is expensive because it involves changing the floating
point control word. During the rendering process there are many places
where ftol is called: in the polygon setup part and when converting
the output of the lighting to rgb integer values. To save the extra cycles
wasted on truncation, the fast_ftolprocedure presented here 'rounds
to nearest'. result dd 0 ;(in the data section)
PUBLIC _fast_ftol
_TEXT SEGMENT
_d$ = 4
_fast_ftol PROC NEAR
fld DWORD PTR _d$[esp]
fistp DWORD PTR result
mov eax , DWORD PTR result
ret 0
_fast_ftol ENDP
_TEXT ENDS
Sometimes
objects require a Z-buffer in the rendering process. A fixed point 16-bit
representation for Z values enables MMX technology to process four data
elements (words) in parallel. Using the "compare" instruction (instead of
branches) prevents possible stalls after branch miss prediction on the Pentium
and Pentium II processors. Unlike a conventional texture mapping engine, as each new texture is developed and written in assembly, Z-Buffering becomes a problem. The programmer must incorporate optimized Z-Buffer code for each procedural texture developed. This is difficult and tedious to do, but there are two solutions to this problem. One is to come up with a standard Z-Buffer code template that can be slapped into the appropriate section of the texture mapping code. The other is to come up with a separate function callable by procedural texture mappers. As with most engineering decisions, tradeoffs are involved. Integrating the Z-Buffer with each procedural texture function is clearly the fastest choice but requires more work from the developer. The algorithm used for Z-Buffer integration is based from the application note 3D Z-Buffer Using MMX Technology. This algorithm removes the jump/compare per pixel typically needed. The Z-Buffer integration can be broken up into four sections. The first is the initialization. The next section draws four 16 bit pixels at a time to the display. For the scan lines that are not multiples of four, the third section handles the initialization of registers that will be used to draw three or less end pixels. The last section draws these pixels. Section #1 is the initialization section of the standard Z-Buffer code template. This part should be included outside of the main rasterization loop. Code is optimized to compute Z values for four 16 bit pixels at a time. Two 64 bit MMX registers are split up to accommodate four 32 bit Z-Buffer values. 16 bits are used for the integer part while 16 bits are used for the fractional part. For the Z-Buffer write to the depth surface, a 64 bit write accommodates four pixels at a time (this is because the 16 bit fractional part of each Z-value is discarded). Variable definitions:
"z_start" and "dz" were two variables given to us in the beginning of the procedure. The following code segment shows how the variables "high_z", "low_z", and "z_inc" are calculated. MOVD MM0, z_start MOVD MM2, dz PUNPCKLDQ MM0, MM0 PSLLQ MM2, 32 PADDD MM0, MM2 MOVQ low_z, MM0 PUNPCKHDQ MM2, MM2 PSLLD MM2, 1 PADDD MM0, MM2 MOVQ high_z, MM0 PSLLD MM2, 1 MOVQ z_inc, MM2 After initialization, the variables hold the following information: Note: The following are what the values look like when stored in a register.
|------- 32 bits ------|
+---------------------------------------------+
MM0 = high_z = | z_start + 3dz | z_start + 2dz |
+---------------------------------------------+
|------- 32 bits ------|
+---------------------------------------------+
MM1 = low_z = | z_start + 1dz | z_start |
+---------------------------------------------+
|------- 32 bits ------|
+---------------------------------------------+
MM2 = z_inc = | 4dz | 4dz |
+---------------------------------------------+
Once the memory
write occurs, this is what the first 8 bytes will look like:
|--- 16 bits ---|
+---------------------------------------------------------------+
Z_Buffer = | z_start + 0dz | z_start + 1dz | z_start + 2dz | z_start + 3dz |
+---------------------------------------------------------------+
Address 0 1|2 3|4 5|6 7
Section #2: After initialization, this section draws pixels in multiples of four. PUSH ESI MOV ESI, z_buffer ;ESI = pointer to four Z values being looked at in Z-Buffer. Get the new Z-Buffer values for the four pixels being drawn. MOVQ MM4, low_z ;Move two rightmost Z-Buffer values into MM4 PSRAD MM4, 16 ;Discard the fractional part of the two Z values MOVQ MM2, high_z ;Move the leftmost Z-Buffer values into MM2 PSRAD MM2, 16 ;Discard the fractional part of the two Z values PACKSSDW MM4, MM2 ;Mesh all four Z-Buffer values into one register Update the four pixel screen values. MOVQ MM2, [ESI] ;MM2 = the old Z values currently in the Z-Buffer. PCMPGTW MM2, MM4 ;Perform a compare between the old and the new Z values. MOVQ MM3, MM2 ;Save a copy of MM2 register. PAND MM1, MM2 ;MM1 = Colors of current pixel 4 pixels to be drawn. PANDN MM3, [EDI] ;[EDI] = Pointer to existing 4 pixels in the screen buffer. POR MM1, MM3 ;"OR" old and new contents together for the 4 pixel colors. MOVQ [EDI], MM1 ;Write out the 4 pixels to video memory. Update the four Z-Buffer values. MOVQ MM3, MM2 ;Save a copy of MM2 register. PAND MM2, MM4 PANDN MM3, [ESI] ;[ESI] = Pointer to existing 4 Z-Buffer values. POR MM2, MM3 ;"OR" old and new contents together for the 4 Z values. MOVQ [ESI], MM2 ;Update the Z-Buffer with the 4 new values. Update "high_z" components. This is Z = Z + Z_inc MOVQ MM0, z_inc PADDD MM0, high_z MOVQ high_z, MM0 ;Add Delta_Z to the High Z components. ;Update "low_z" components. This is Z = Z + Z_inc MOVQ MM0, z_in PADDD MM0, low_z MOVQ low_z, MM0 ;Add Delta_Z to the Low Z components. ;Update the Z-Buffer pointer by four pixels. ADD z_buffer, 8 ;z_buffer pointer is incremented eight bytes (4 pixels). ;Restore ESI POP ESI
Section #3: For the three or less pixels at the end of the scanline, the following code template can be used. This initializes certain registers and variables therefore shouldn't be put into the main loop. This part is used to point ESI to the Z-Buffer where the pixel write is going to occur. CX will contain the current Z-depth value. MOVQ MM2, low_z ;We want the starting Z-Buffer value PSRLD MM2, 16 ;Truncate the 16 bit fractional part. MOVD ECX, MM2 ;Copy the Z-value to CX MOV ESI, z_buffer ;ESI points to the Z-Buffer
Section #4: This section handles drawing the pixels and Z-Buffer update for the three or less pixels at the end of the scanline. This code is based on traditional Z-Buffering. A compare is made and a branch is taken depending on the results of the compare. The code is self-explanatory so no explanation will be given. end_pixels: CMP CX, [ESI] ;Compare new Z value against old value in Z-Buffer. JGE skip_pix ;If new Z value is greater than old then skip the pixel write. MOVD EAX, MM3 ;Move the previous color to eax MOV [EDI], AX ;Write 16 bit color to video buffer. MOV [ESI], CX ;Write new Z value to Z-Buffer. skip_pix: ADD EDI, 2 ;Increment the pointer to the video buffer. ADD ESI, 2 ;Increment the pointer to the Z-Buffer. PSRLQ MM3, 16 ;Shift to the next color. DEC EDX ;Decrement the end pixel counter. JNZ end_pixels ;Repeat if there are more pixels to draw.
Programmer dilemmas:
The function
then runs through each of the pixels in the scanline and determines whether
or not the pixel should be drawn based on the calculated Z-values. This
allows the programmer to put any information into the off-screen scanline
buffer. Then the Z-Buffer function writes pixels to the display depending
on the Z-depth values.
void z_buffer(unsigned __int16* screen_pointer,
unsigned __int16* temp_buffer,
signed __int16* z_pointer, long z_start, long dz,
unsigned long num_pixels)
{
unsigned long index;
for(index = 0 ; index < num_pixels; index++)
{
if ((z_start 16) < *(z_pointer))
{
*(z_pointer) = (signed __int16)(z_start 16);
*(screen_pointer) = temp_buffer[index];
}
z_pointer++;
screen_pointer++;
z_start += dz;
}
}
More optimized
versions can be written by converting the above into assembly using aligned
64 bit writes with MMX technology. See Appendix
E for a better full featured Z-Buffer scanline algorithm, fully optimized
for the Pentium and Pentium II processors.
The table below gives clock cycle information on the various code samples
in this document. These results were obtained through Intel's
VTune profiler utility.
The procedures/code
segments in the first table are meant to be called outside of the main
rasterization loop. Therefore only the number of clocks required for one
pass are given. These values are the amount of clock cycles required to
calculate four pixel values. To find clks/pix, divide by four. Because
these routines are called far less than others, memory stalls occur more
often. This significantly drives up the clock/pixel ratio.
This article
and the earlier application note, Using
MMX Instructions for Procedural Texture Mapping, present a new approach
for implementing procedural textures using MMX technology. Using the Perlin
noise function as a building block, wood, marble and grass textures were
developed. Based on one octave of noise, marble takes 40 clocks, wood
takes 44 clocks, while simple grass takes 30 clocks, as measured on the
Pentium II processor. Perspective correction and z-buffering add more
cycles. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|