|
Features

Building a Million-Particle System
Particle Simulation
on Graphics Hardware
The following sections describe
the algorithm of a state-preserving particle system
on a GPU in detail. After a brief overview, the used
graphics hardware features, the storage and then the
processing of particles is described.
Algorithm Overview
The state-preserving particle system
stores the velocities and positions of all particles
in textures. These textures are also render targets.
In one rendering pass the texture with particle velocities
is updated according to the stream processing model
(cf. "General-purpose computation on graphics
hardware"). The update performs one time step
of an iterative integration which applies acceleration
forces and collision reactions. Another rendering
pass updates the position textures in a similar way,
using the just computed velocities for the position
integration. Depending on the integration method it
is possible to skip the velocity update pass, and
directly integrate the position from accelerations.
Optionally, the particle positions
can be sorted depending on the viewer distance to
avoid rendering artifacts later on. The sorting performs
several additional rendering passes on textures that
contain the particle distance and a reference to the
particle itself.
Then the particle positions are
transferred from the position texture to a vertex
buffer. Finally this geometry data is rendered to
the screen in a traditional way – as point sprites
or primitive triangles or quads.
Hardware Requirements
The simulation on the GPU requires
functionality that has only recently become available
in PC graphics hardware. One key functionality is
a floating point programmable pixel pipeline. The
other one is a mechanism for communicating pixel data
to the vertex pipeline. The latter is already common
on video game consoles with unified memory access,
whereas it is quite new in PC graphics hardware. "Transfer
texture data to vertex data" discusses two approaches
to this problem.
The algorithm's application to video
game console hardware is only limited, though. The
Microsoft Xbox offers 8-bit precision in the pixel
pipeline. It is only capable of performing a low precision
simulation, which limits the maximum extents of the
particle system. Compared to PC hardware, the Sony
PlayStation 2 is architecturally quite different and
it does not offer a programmable pixel pipeline. Its
programmable geometry unit is however capable of running
the stateless particle simulation (cf. "Stateless
particle systems") quite efficiently.
As details about the upcoming generation
of video game consoles are not publicly available
at this time, one can only speculate about their capabilities.
Considering the trend towards increased parallel computation,
it can be assumed that a parallel physical particle
simulation is at least conceptually suitable for these
devices.
Particle Data
Storage
The most important attributes of
a particle are its position and velocity. The positions
of all active particles are stored in a floating point
texture with three color components that will be treated
as x, y and z coordinates. Each texture is conceptually
treated as a one-dimensional array, texture coordinates
representing the array index. The actual textures
however need to be two-dimensional due to the size
restrictions of current hardware. The texture itself
is also a render target, so it can be updated with
the computed positions. In the stream processing model
(covered in the section "General-purpose computation
on graphics hardware"), it represents either
the input or the output data stream. As a texture
cannot be used as input and output at the same time,
we use a pair of these textures and a double buffering
technique to compute new data from the previous values
(see Figure 1).
A pair of velocity textures can
be created in the same way as the position textures.
Due to their reduced precision requirements it is
usually sufficient to store the velocity coordinates
in 16-bit floating point format. Depending on the
integration algorithm there is no need to store the
velocity explicitly (cf. "Update positions").
If the velocity is not stored in textures, you need
a third position texture, which basically forms a
triple buffer.
If other particle attributes (like
orientation, size, color, and opacity) were to be
simulated with the iterative integration method, they
would need texture double buffers as well. However,
since these attributes typically follow simple computation
rules or are even static, we can take a simpler approach.
An algorithm similar to the stateless particle system
(cf. "Stateless particle systems") can be
used to compute these values only from the relative
age of the particle and a function description, e.g.
initial and differential values or a set of keyframes.
To be able to evaluate this function, we need to store
two static values for each particle: its time of birth
and a reference to a set of attribute parameters for
its particle type. They are stored in a further texture,
but the double buffer approach is not necessary.
We assume that the particles can
be grouped by a particle type in order to minimize
the amount of static attribute parameters that need
to be uploaded during the final rendering of the particles.
This particle type can either be directly coupled
in a one-to-one relationship to the particle emitter
or a group of emitters emits all particles of the
same type.
The mass of a particle needs to
be known to calculate accelerations from forces. Possible
approaches are: treating all particles as having equal
mass, uploading a mass value or function as particle
type parameters, or storing the mass of each particle
in the static data texture described above.
To sum up: a single particle consists
of data values spread between several textures, but
placed at equal texture coordinates in all those textures.
According to demand particle type parameters also
allow the computation of further particle values.
Simulation and Rendering Algorithm
The algorithm consists of six basic
steps:
1. Process birth and death
2. Update velocities
3. Update positions
4. Sort for alpha blending (optional)
5. Transfer texture data to vertex data
6. Render particles
1. Process
Birth and Death
The particles in a system can either
exist permanently or only for a limited time. A static
number of permanently existing particles represents
the simplest case for the simulation, as it only requires
uploading all initial particle data to the particle
attributes textures once. As this case is rather rare,
we assume a varying number of short-living particles
for the rest of the discussion. The particle system
must then process the birth of a new particle, i.e.
its allocation, and the death of a particle, its deallocation.
The birth of a particle requires
associating new data with an available index in the
attribute textures. Since allocation problems are
serial by nature, this cannot be done efficiently
with a data-parallel algorithm on the GPU. Therefore
an available index is determined on the CPU via traditional
fast allocation schemes. The simplest allocation method
uses a stack filled with all available indices. A
more complex allocator uses a heap data structure
that is optimized to always return the smallest available
index. Its advantage is that if a particle system
has a highly varying number of particles, the particles
remain packed in the first portion of the system.
The following simulation and rendering steps only
need to update that portion of data, then. After the
index has been determined, the new particle's data
is rendered as single pixel into the attribute textures.
This initial particle data is determined on the CPU
and can use complex algorithms, e.g. various probability
distributions for initial starting positions and directions
etc. (cf. [McAllister2000])
A particle's death is processed
independently on the CPU and GPU. The CPU registers
the death of a particle and adds the freed index to
the allocator. The GPU does an extra pass over the
particle data: The death of a particle is determined
by the time of birth and the computed age. The dead
particle's position is simply moved to invisible areas,
e.g. infinity. As particles at the end of their lifetime
usually fade out or fall out of visible areas anyway,
the extra pass rarely really needs to be done. It
is basically a clean-up step to increase rendering
efficiency.
2. Update Velocities
The first part of the simulation
updates the particles' velocity. The actual program
code for the velocity simulation is a pixel shader
which is used with the stream processing algorithm
described in "General-purpose computation on
graphics hardware". The shader is executed for
each pixel of the render target by rendering a screen-sized
quad. The current render target is set to one of the
double buffer velocity textures. The other texture
of the double buffer is used as input data stream
and contains the velocities from the previous time
step. Other particle data, either from inside the
attribute textures or as general constants, is set
before the shader is executed.
There are several velocity operations
that can be combined as desired (cf. [Sims1990]
and [McAllister2000]):
global forces (e.g. gravity, wind), local forces (attraction,
repulsion), velocity dampening, and collision responses.
For our GPU-based particle system these operations
need to be parameterized via pixel shader constants.
Their dynamic combination is a typical problem of
real-time graphics. It is comparable to the problem
of light sources and material combinations and can
be solved in similar ways. Typical operation combinations
are to be prepared in several variations beforehand.
Other operations can be applied in separate passes,
as all operations are completely independent.
Global forces, e.g. gravity, influence
particles regardless of their position with a constant
acceleration in a specific direction. The influence
of local forces however depends on the particle's
position which is read from the position texture.
A magnet attracting or repelling a particle has a
local acceleration towards a point. This force can
fall off with the inverse square of the distance or
it can stay constant up to a maximum distance (cf.
[McAllister2000]).
A particle can also be accelerated towards its closest
point on a line, leading to a vortex-like streaming.
A more complex local force can
be extracted from a flow field texture. Since texture
look-ups are very cheap on the GPU, it is quite efficient
to map the particle position into a 2D or 3D texture
containing flow velocity vectors. This sampled flow
vector vfl can be used with Stoke's law
of a drag force Fd on a sphere:

where n is the flow viscosity,
r the radius of the sphere (in our case the
particle) and
the particle's velocity. The constants can all be
combined to a single constant c , for efficient
computation and for simpler manual adjustment.
Global and local forces are accumulated
into a single force vector. The acceleration can then
be calculated with Newtonian physics:

where a is the acceleration
vector, F the accumulated force and m
the particle's mass. If all particles have unit mass,
forces have the same value as accelerations and can
be used without further computation.
The velocity is then updated from
the acceleration with a simple Euler integration in
the form:

where v is the current velocity,
the
previous velocity and
the time step. Another simple velocity operation is
dampening, i.e. a scaling of the velocity vector,
which imitates viscous materials or air resistance.
This is basically a special case of equation 1 with
a flow velocity of zero. The reverse operation, an
undampening, can be used to imitate self-propelled
objects, e.g. a bee swarm.
A more important operation is collision.
Collision with complex polygonal geometry is hardly
practical on current GPUs, whereas collision with
a plane or bounding spheres is rather cheap. The real
strength of collision on the GPU however is collision
against texture-based height fields that are typically
used to model terrain. By sampling the height field
three times a normal can be computed which is then
used for calculating the reflection vector. The normal
can also be stored in the height field, basically
making it a scaled normal map.
If a collision has been detected,
the collision reaction, i.e. the velocity after the
collision, has to be computed (cf. [Sims1990]).
First the current velocity has to be split into a
normal and a tangential component. If n is the normal
of the collider at the collision point, these can
be computed as:
where vbc is the velocity
computed so far, i.e. before the collision occurs,
vn is the normal component of the velocity
and vt the tangential one. The velocity
after the collision can now be computed with two further
parameters describing material properties. Dynamic
friction
reduces the tangential component, and resilience
scales the reflected normal component. The new velocity
is computed as:

This default handling of the collision
however has two problems which cause visual artifacts.
The slow-down effect of the dynamic friction will
lead to situations where the velocity is (very close
to) zero. Since in our case the collision is processed
after acceleration forces like gravity, this might
lead to particles hanging in the air. They virtually
seem to be attached to the side of a collider, e.g.
at the equator of a sphere collider with respect to
the global gravity. Therefore the friction slow-down
should not be applied if the overall velocity is smaller
than a given threshold.
The second problem is caused by
particles getting caught inside a collider. A collider
with sharp edges, e.g. a height field, or two colliders
close to each other might push particles into a collider.
This can be avoided by trying to push a caught particle
out of the collider. Normally, the collision detection
is done with the expected next particle position to
avoid the particle from entering an object for the
time of one integration step (cf. figure 2a). The
expected particle position pbc is computed
as

where
is the previous position. Doing the collision detection
twice, once with the previous and once with the expected
position, allows differentiating between particles
that are about to collide and those having already
penetrated (see Figure 2b). The latter can then be
pushed out of the collider, either immediately or
by applying the collision velocity without any slow-down.
The direction of the shortest way out of the collider
can be guessed from the normal component of the velocity:

3. Update Positions
The second part of the particle
system simulation updates the position of all particles.
Here we are going to discuss the possible integration
methods in detail that have been mentioned earlier
(cf. "Particle data storage"). For the integration
of large data sets on the GPU in real-time only simple
integration algorithms can be used. Two good candidates
for particle simulation are Euler and Verlet integration.
Euler integration has already been used in the previous
section to integrate the velocity by using the acceleration.
The computed velocity can be applied to all particles
in just the same way. This leads to

where p is the current position
and the previous
position.
In some ways even simpler than Euler
integration is Verlet integration (cf. [Verlet1967]).
Verlet integration for a particle system (cf. [Jakobsen2001])
does not store the velocity explicitly. Instead, the
velocity is implicitly deduced by comparing the previous
position to the one before. The great advantage of
this handling for the particle simulation is that
it reduces memory consumption and removes the velocity
update rendering pass.
If we assume the time step is constant,
the combination of the velocity and position update
rules from the Euler integration can be combined to
a position update rule based only on the acceleration:

where
is the position two time steps ago. Verlet integration
handles simple (global or local) acceleration forces
quite efficiently. However, complex velocity operations,
like the collision reaction discussed above, require
position manipulations to implicitly change the velocity
in the following frames. Alternatively, collision
can be handled more efficiently with position constraints
that simply move the position out of the collider.
Due to the deduction of velocity based on this constraint
movement, an implicit reflection velocity is set.
Other constraints can be used to limit the distance
between a pair or group of particles, which is useful
for particle simulation of cloth or hair (cf. [Jakobsen2001]
for more details).
4. Sort for Alpha Blending
If particles are alpha blended,
a distance-based sorting should be applied or else
particles in the front will not be blended correctly
with particles behind them. This error might be intolerable
depending on the size and amount of the partially
transparent pixels of each particle. Due to the cost
of sorting, first examine whether these particles
can instead be rendered with a commutative blending
function, e.g. additive or multiplicative.
A particle system on the GPU can
be sorted quite efficiently with the parallel sorting
algorithm “odd-even merge sort” [Batcher1968].
It is independent of the data's sortedness, i.e. it
always has a constant number of iterations for a data
set of a given size. This is important as the parallel
hardware execution makes it inefficient to check whether
all data is already in sequence. This algorithm also
guarantees that with each iteration the sortedness
never decreases. If you assume a high frame-to-frame
coherence, this property allows you to distribute
the whole sorting sequence over 20 - 50 frames. Especially
game applications often know the maximum velocity
with which the viewer and the particle objects can
move, so assumptions about the rate of change in the
sortedness can be made.
The basic principle of odd-even
merge sort is to divide the data into two halves,
to sort these and then to merge the two halves (for
further details cf. [Lang2003]).
The algorithm is commonly written recursively and
in a serial way, but a closer look at the resulting
sorting network shows its parallel nature. Figure
3 shows the sorting network for eight values, an arrow
marking a comparison pair. The first arrow indicates
that element 0 is compared to element 1 and possibly
swapped in case the order is not fulfilled. You can
see that several consecutive comparisons are independent
of each other. They can be grouped for parallel execution,
which is indicated here by the vertical lines.
|
|
float4 mergeSort1DEnd(float _Current :
TEXCOORD0,
uniform int _Step) :
COLOR
{
float currentSample
= (float)texRECT(_SortData,
(float2)_Current);
float direction = (fmod(_Current
/ _Step, 2.0) < 1.0 ? 1.0 :
-1.0);
float otherSample =
(float)texRECT(_SortData,
(float2)(_Current
+ direction * _Step));
if (direction >=
0)
return
max(currentSample, otherSample);
else
return
min(currentSample, otherSample);
}
float4 mergeSort1DRecursion(float _Current
: TEXCOORD0,
uniform int _Step, uniform
int _Count) : COLOR
{
float currentSample
= (float)texRECT(_SortData, (float2)_Current);
int modulus = fmod(_Current
/ _Step, (float)_Count);
if (modulus >= 1
&& modulus < _Count - 1)
{
if
(fmod((float)modulus, 2.0) > 1.0)
return
max(currentSample,
(float)texRECT(_SortData,
(float2)(_Current + _Step)));
else
return
min(currentSample,
(float)texRECT(_SortData,
(float2)(_Current - _Step)));
}
else
return
currentSample;
}
|
 |
 |
 |
Figure
4: Cg code for odd-even merge sorting
a one-dimensional texture
|
Figure 4 shows the Cg (cf. [Mark2003])
code for the sort passes. The code can be ported to
Microsoft HLSL (cf. [Microsoft2002])
by simply mapping the texRECT
instruction onto a tex2D
instruction. The code is slightly simplified for readability
and sorts only a one dimensional texture. To enhance
the shader for sorting two dimensional textures the
texel index needs to be split into u and v coordinates
before look-up. The sorting has two alternative sort
shaders, a “recursion” and an “end” step. The pseudo
code to trigger the sort passes on the CPU is shown
in figure 5.
The sorting requires
passes, where n is the number of elements to
sort. For a 1024x1024 texture this leads to 210 rendering
passes. Just like in the particle simulation, they
always render the full texture into a render target
using a double buffer. As mentioned earlier, running
all 210 passes each frame is far too expensive on
current hardware, but spreading the whole sorting
sequence over 50 frames, i.e. 1-2 seconds, reduces
the workload to only four passes each frame, which
results in acceptable performance.
This general sorting algorithm is
applied to the particle simulation in the following
way: The sorting data textures contain the particle-viewer
distance and the index of the particle. The distance
in this texture is updated after the position simulation.
After sorting the rendering step (cf. next two sections)
looks up the particle attributes via the index in
this texture.
5. Transfer Texture Data to Vertex
Data
The copying of position data from
a texture to vertex data is a hardware feature that
is only just coming up in PC GPUs. Currently there
are two approaches to this problem:
1. DirectX and OpenGL offer vertex
textures with the vertex shader (VS) version 3.0 (cf.
[Microsoft2002]) rsp. the ARB_vertex_shader extension
(cf. [OpenGL2003]).
While vertex textures would be rather optimal for
this technique, at the time of writing there is no
hardware supporting this feature.
2. The alternative solution uses"uber-buffers"
(also called super buffers; cf. [Percy2003]) which
basically are a data-agnostic storage of vertex or
pixel data in a buffer. This concept is already available
in current GPUs, but it is up to now only supported
by the OpenGL API. The current implementation uses
OpenGL with the vendor specific NV_pixel_data_range
extension (cf. [NVIDIA2002])
which offers accelerated asynchronous copying of data
inside the GPU memory. Since the data can be copied
from pixel to vertex memory, this is a basic implementation
of the uber-buffer concept.
The particles can be rendered as
point sprites, triangles or quads (cf. next section).
If multiple vertices per particle are to be generated,
the currently used uber-buffer concept for data transferal
requires manual replication of the particle position
before rendering. In a further rendering step the
positions need to be rendered three or four times
into pixels lying next to each other in a texture.
To avoid this overhead, the current implementation
uses point sprites.
Two upcoming concepts in future
hardware can handle this replication problem more
efficiently: The so-called "vertex stream frequency"
is about to be introduced on hardware supporting the
VS3.0 version. This feature allows reducing the update
frequency of vertex input data to a vertex shader.
Basically, one entry in a vertex stream can be used
for a range of several consecutive vertices, whereas
other vertex streams change at a different rate. The
replication problem will also be solved once vertex
textures are used, as they are actively read by a
shader that otherwise uses static vertex data. This
static vertex data contains the index of a particle
to be read from the texture; it can already be replicated
several times.
6. Render Particles
Finally, the transferred vertex
positions are used to render primitives to the frame
buffer in a traditional way. For the reasons mentioned
in the previous section and in order to reduce the
workload of the vertex unit, particles are currently
rendered as point sprites. Compared to rendering single
triangles or quads this reduces the number of vertices
to a third or a quarter. The disadvantage though is
that particles are always axis-aligned and do not
have a 2D rotation about the screen space z axis.
To overcome this limitation, a 2D rotation is applied
to the texture coordinates inside the pixel shader.
The decision about using point sprites or generating
triangle/quad geometry is to be made depending on
the distribution of workload between vertex and pixel
units of a particular application.
During the rendering other attributes
(e.g. color and size) are computed inside the vertex
shader from the particle type parameters (cf. "Particle
data storage"). Some of them take into account
the relative age of the particle or a pseudo-random
function.
The current implementation uses
the following computation rules for these particle
attributes: The size of a particle is random within
a range that is defined by the particle type. The
initial orientation and a rotation velocity (2D in
screen space) are also determined randomly within
a defined type-based range.
Based on the relative age of the
particle the color and opacity are interpolated from
four keyframes. These keyframes are defined for each
particle type and need not be equidistant. They define
three linear function segments that are first converted
into a form for efficient GPU evaluation and then
uploaded with the particle type data.
It is not possible to switch
textures on a per-particle basis while rendering.
Thus it is necessary to combine different textures
into tiles of a larger 2D texture. Each particle then
modifies the texture coordinates appropriately. If
point sprites are used, the texture coordinates will
be generated automatically by the rasterizer in the
range [0..1]2. The sub-texture selection
then needs to be done in the pixel shader. Fortunately,
the texture coordinate transformation for the 2D rotation
we described above will do the sub-texture selection
basically for free, as a transformation by a 2×2 or
by a 3×2 matrix both need two vector instructions.
______________________________________________________
|