Algorithms
for Vector Quantization
The
main problem in VQ is choosing the vectors for the codebook so
that the mean quantization error is minimal; after the codebook
is known, mapping input vectors to it is a trivial matter of finding
the best match. In applications where the quantization is performed
in real time, a trivial approach to this latter step might prove
too slow, but in our case it was on orders of magnitudes faster
than finding an optimal codebook.
We
experimented with two algorithms for VQ, the classical GLA (generalized
Lloyd algorithm, sometimes called K-means clustering), and Anthony
Dekker's Neuquant. Both of them are extremely computationally
expensive, basically using brute force to find a general solution
to the problem. Other, much faster algorithms exist, but they
achieve speed by restricting the generality of the codebook (for
example, tree-structured VQ), which would lead to greater quantization
error. For our purposes—compression as a preprocessing stage
for some of the art assets—compression times of a few hours
were acceptable, and that was well within the reach of the brute
force algorithms.
Generalized
Lloyd Algorithm
Each
iteration of the GLA consists of two phases: the codebook assignment
phase and the codebook adjustment phase. During the former, each
of the vectors from the input set is assigned to the nearest vector
from the codebook. During the latter, each of the codebook vectors
is replaced with the centroid (in this case, average) of all input
vectors assigned to it. This process is convergent, and minimizes
the mean square error of quantization.
There
are two problems with GLA: what to do with "empty cells"
and how to choose the initial codebook. You have an empty cell
when some vector from the codebook gets no input vectors assigned
to it during the assignment phase (its Voronoi cell is empty).
It will not move during the adjustment phase and will therefore
the cell will probably remain empty in all subsequent operations.
In this case, you should remove it from the codebook. You need
some kind of heuristic to come up with a prospective replacement.
You could split the codebook vector with the greatest number of
assigned input vectors into two close vectors, and let several
iterations pull them apart; or you could split the one whose assigned
vectors are most distant. The first heuristic aims to minimize
the mean error, while the second minimizes maximum error.
If
you have a satisfying solution to the dead vector problem, the
choice of initial codebook does not matter—you could start
out with random vectors, or with all vectors at the origin, GLA
will eventually move them into position and eliminate improper
ones. This, however, can take tens of iterations—work which
would be spared with a more careful starting position.
Neuquant
Neuquant
was the name given by Anthony Dekker for his application of Kohonen's
self-organizing maps (SOMs)—a special type of neural network—to
color quantization. We found it quite suitable for quantization
of vectors of higher dimensions.
Imagine
our codebook as a string of connected points—neurons—in
the vector space. Each neuron has two neighbors (except, of course,
for the first and the last) in this string. Now, with each vector
in the input set, you "stimulate" the string: you find
the neuron closest to the stimulus vector, and you pull it in
the direction of the stimulus, say, one-half the distance towards
it. Its neighbors in the string also get pulled towards the stimulus,
but by a lesser amount, say, one-fourth. Their neighbors are influenced
only by one-eighth and so on, until at some distance along the
string the reaction to the stimulus stops.
When
you feed the input set as stimuli, the string moves, turns, and
stretches itself to occupy these parts of the vector space where
the input data is dense—which is precisely what you want
from a good VQ codebook. Neuquant with larger codebooks sometimes
wastes codebook entries.
We
found that, in general, Neuquant gives decent results faster than
GLA (with less iterations), but when given enough time GLA tends
to produces better codebooks, which adapt well to "stray"
vectors. If you can live with the time, using GLA is definitely
recommended.
VQ for
Texture Compression
VQ
compression is highly asymmetric in processing time: choosing
an optimal codebook takes huge amounts of calculations, but decompression
is lightning-fast—only one table lookup per vector. This
makes VQ an excellent choice for data which once created will
never change, like most art assets for a game, but which can be
kept in a compressed form even in memory right up to the moment
when it's used.
Texture
images are a prime candidate for VQ—they are often of limited
color gamut (for example, in a grass texture you might have hundreds
of shades of green, but only a few different reds, blues, and
whites) and have a lot of similar, same-frequency features. Several
hardware vendors have recognized the suitability of textures for
VQ compression. For example, the Dreamcast's video chip supports
rendering directly from textures compressed to 2 bits/pixel. The
vectors in this case are 2x2 blocks of pixels, or 12-dimensional;
the codebook has 256 entries for single-byte indices.
Roll Your
Own: Practical Issues
The
rest of this article is a detailed account of our practical experience
with VQ compression for our current project, a software-rendered
2.5D RTS. The source code accompanying the article is very close
to what we have in the actual game, so you can easily experiment
with ideas and trade-offs discussed below.
For
our game we didn't have the luxury of building special hardware
for VQ, so we had to design our VQ scheme around the software
blitter/blender. Since it uses MMX to process four adjacent 16-bit
pixels, we chose to quantize 12-dimensional vectors too, but taken
from a 4x1 block of pixels. This leads to slightly worse results
compared to the 2x2 blocks, because the pixels in the tightly
packed square block are more closely correlated (that is, likely
to have similar values).
Both
VQ algorithms work on vectors via a bunch of operators and don't
care about their dimensionality or internal representation. This
makes them a perfect fit for generic implementations in the form
of templated C++ classes, with the vector types left as template
parameters for the quantizer classes. The vector class should
provide +=, -=, *= operators with their usual meaning, a function
returning the distance between two vectors according to the metrics
of choice (Euclidian distance works just fine). Neuquant needs
an additional function, shifting a vector towards another vector
by a specified amount.
Because
with both algorithms almost all of the time is spend in these
vector operations, they are a good candidate for SIMD optimization.
Writing SSE and 3DNow versions of the vector classes took us a
couple of hours. They both run at about the same speed, roughly
twice as fast as their scalar counterparts on the respective processors;
greater speedup is probably possible with careful hand-tuning.
The plain old x87 version of the vector class can also be implemented
in a generic manner without sacrificing performance, with the
dimensionality as a template parameter.
We
had to spend much more time tackling the empty cell problem with
the GLA algorithm. We found that splitting the cell with the largest
population of input vectors results in neglecting colors outside
the general color gamut of the image; for example, small patches
of brown dirt in a grass texture almost completely disappeared
because the heuristics allocated all available vectors to the
greens. Splitting the largest cells (measured by the maximal distance
from a codebook vector to a input vector mapped to it) works much
better and preserves color variations more faithfully.
Another
issue is the initial codebook configuration and the stop condition
for the GLA algorithm. Depending on the input data, sometimes
GLA never stops updating the codebook vectors, falling into a
stable loop of several iterations. We arbitrarily stop the process
when the last N iterations have updated less than 1 percent of
the codebook vectors, or when they have updated the same number
of input vectors. It might be possible to come across an image
for which the algorithm will fall into a loop which updates a
different number of vectors on each iteration but still never
finishes; we haven't found such an image, but still we put an
arbitrary limit on the total number of iterations.
As
for the initial configuration, we kept the naive solution of starting
with all codebook vectors having the same values. It takes about
10-15 iterations just to fill up the codebook with "active"
vectors, but the end results are better than starting with a random
sample of vectors from the image, starting with a random codebook
or using a few iterations of Neuquant as a "seed" for
GLA.
Because
we let the compression run overnight anyway, for each image we
generated compressed versions with different codebook sizes in
the range 256-8,192. Then we examined them and chose the best
trade-off between visual fidelity and size.
For
the compressed images we keep the codebook in high-color format
(because that's what our blitter needs; the sample code works
with 24-bit true color) and for each image vector a 8-bit or 16-bit
index into the codebook. Reasonable codebook sizes (up to about
8,192) don't use all 16 bits of the indices, but tight packing
of the bits would slow down decompression, and this is more important
in our case than the additional compression gains. Even in this
case you should always use the smallest codebook possible to minimize
memory traffic; ideally it should fit into the CPU L1 data cache.
If
you ignore the memory for the codebook, this corresponds to 2
or 4 bits/pixel. While a bit on the low side for general image
compression, this is excellent for an algorithm which is practically
free in terms of CPU time decompression.
When
we put the VQ compression described above in our renderer, we
expected to get slightly worse performance but were ready to live
with it, because it took about 20MB off our memory footprint.
However, profiles showed that rendering has actually gained about
15 to 20 percent performance, probably due to decreased memory
traffic.