This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.
Algorithms for Vector Quantization
The main problem in VQ is choosing the vectors for the codebook so that the mean quantization error is minimal; after the codebook is known, mapping input vectors to it is a trivial matter of finding the best match. In applications where the quantization is performed in real time, a trivial approach to this latter step might prove too slow, but in our case it was on orders of magnitudes faster than finding an optimal codebook.
We experimented with two algorithms for VQ, the classical GLA (generalized Lloyd algorithm, sometimes called K-means clustering), and Anthony Dekker's Neuquant. Both of them are extremely computationally expensive, basically using brute force to find a general solution to the problem. Other, much faster algorithms exist, but they achieve speed by restricting the generality of the codebook (for example, tree-structured VQ), which would lead to greater quantization error. For our purposes—compression as a preprocessing stage for some of the art assets—compression times of a few hours were acceptable, and that was well within the reach of the brute force algorithms.
Generalized Lloyd Algorithm
Each iteration of the GLA consists of two phases: the codebook assignment phase and the codebook adjustment phase. During the former, each of the vectors from the input set is assigned to the nearest vector from the codebook. During the latter, each of the codebook vectors is replaced with the centroid (in this case, average) of all input vectors assigned to it. This process is convergent, and minimizes the mean square error of quantization.
FIGURE 5. A grass texture (uncompressed.)
There are two problems with GLA: what to do with "empty cells" and how to choose the initial codebook. You have an empty cell when some vector from the codebook gets no input vectors assigned to it during the assignment phase (its Voronoi cell is empty). It will not move during the adjustment phase and will therefore the cell will probably remain empty in all subsequent operations. In this case, you should remove it from the codebook. You need some kind of heuristic to come up with a prospective replacement. You could split the codebook vector with the greatest number of assigned input vectors into two close vectors, and let several iterations pull them apart; or you could split the one whose assigned vectors are most distant. The first heuristic aims to minimize the mean error, while the second minimizes maximum error.
If you have a satisfying solution to the dead vector problem, the choice of initial codebook does not matter—you could start out with random vectors, or with all vectors at the origin, GLA will eventually move them into position and eliminate improper ones. This, however, can take tens of iterations—work which would be spared with a more careful starting position.
Neuquant was the name given by Anthony Dekker for his application of Kohonen's self-organizing maps (SOMs)—a special type of neural network—to color quantization. We found it quite suitable for quantization of vectors of higher dimensions.
Imagine our codebook as a string of connected points—neurons—in the vector space. Each neuron has two neighbors (except, of course, for the first and the last) in this string. Now, with each vector in the input set, you "stimulate" the string: you find the neuron closest to the stimulus vector, and you pull it in the direction of the stimulus, say, one-half the distance towards it. Its neighbors in the string also get pulled towards the stimulus, but by a lesser amount, say, one-fourth. Their neighbors are influenced only by one-eighth and so on, until at some distance along the string the reaction to the stimulus stops.
When you feed the input set as stimuli, the string moves, turns, and stretches itself to occupy these parts of the vector space where the input data is dense—which is precisely what you want from a good VQ codebook. Neuquant with larger codebooks sometimes wastes codebook entries.
We found that, in general, Neuquant gives decent results faster than GLA (with less iterations), but when given enough time GLA tends to produces better codebooks, which adapt well to "stray" vectors. If you can live with the time, using GLA is definitely recommended.
VQ for Texture Compression
VQ compression is highly asymmetric in processing time: choosing an optimal codebook takes huge amounts of calculations, but decompression is lightning-fast—only one table lookup per vector. This makes VQ an excellent choice for data which once created will never change, like most art assets for a game, but which can be kept in a compressed form even in memory right up to the moment when it's used.
Texture images are a prime candidate for VQ—they are often of limited color gamut (for example, in a grass texture you might have hundreds of shades of green, but only a few different reds, blues, and whites) and have a lot of similar, same-frequency features. Several hardware vendors have recognized the suitability of textures for VQ compression. For example, the Dreamcast's video chip supports rendering directly from textures compressed to 2 bits/pixel. The vectors in this case are 2x2 blocks of pixels, or 12-dimensional; the codebook has 256 entries for single-byte indices.
Roll Your Own: Practical Issues
The rest of this article is a detailed account of our practical experience with VQ compression for our current project, a software-rendered 2.5D RTS. The source code accompanying the article is very close to what we have in the actual game, so you can easily experiment with ideas and trade-offs discussed below.
For our game we didn't have the luxury of building special hardware for VQ, so we had to design our VQ scheme around the software blitter/blender. Since it uses MMX to process four adjacent 16-bit pixels, we chose to quantize 12-dimensional vectors too, but taken from a 4x1 block of pixels. This leads to slightly worse results compared to the 2x2 blocks, because the pixels in the tightly packed square block are more closely correlated (that is, likely to have similar values).
Both VQ algorithms work on vectors via a bunch of operators and don't care about their dimensionality or internal representation. This makes them a perfect fit for generic implementations in the form of templated C++ classes, with the vector types left as template parameters for the quantizer classes. The vector class should provide +=, -=, *= operators with their usual meaning, a function returning the distance between two vectors according to the metrics of choice (Euclidian distance works just fine). Neuquant needs an additional function, shifting a vector towards another vector by a specified amount.
Because with both algorithms almost all of the time is spend in these vector operations, they are a good candidate for SIMD optimization. Writing SSE and 3DNow versions of the vector classes took us a couple of hours. They both run at about the same speed, roughly twice as fast as their scalar counterparts on the respective processors; greater speedup is probably possible with careful hand-tuning. The plain old x87 version of the vector class can also be implemented in a generic manner without sacrificing performance, with the dimensionality as a template parameter.
FIGURE 6. VQ-compressed grass texture, codebook size 256, compression time 2 minutes, 2.1 bits/pixel.
We had to spend much more time tackling the empty cell problem with the GLA algorithm. We found that splitting the cell with the largest population of input vectors results in neglecting colors outside the general color gamut of the image; for example, small patches of brown dirt in a grass texture almost completely disappeared because the heuristics allocated all available vectors to the greens. Splitting the largest cells (measured by the maximal distance from a codebook vector to a input vector mapped to it) works much better and preserves color variations more faithfully.
Another issue is the initial codebook configuration and the stop condition for the GLA algorithm. Depending on the input data, sometimes GLA never stops updating the codebook vectors, falling into a stable loop of several iterations. We arbitrarily stop the process when the last N iterations have updated less than 1 percent of the codebook vectors, or when they have updated the same number of input vectors. It might be possible to come across an image for which the algorithm will fall into a loop which updates a different number of vectors on each iteration but still never finishes; we haven't found such an image, but still we put an arbitrary limit on the total number of iterations.
As for the initial configuration, we kept the naive solution of starting with all codebook vectors having the same values. It takes about 10-15 iterations just to fill up the codebook with "active" vectors, but the end results are better than starting with a random sample of vectors from the image, starting with a random codebook or using a few iterations of Neuquant as a "seed" for GLA.
Because we let the compression run overnight anyway, for each image we generated compressed versions with different codebook sizes in the range 256-8,192. Then we examined them and chose the best trade-off between visual fidelity and size.
For the compressed images we keep the codebook in high-color format (because that's what our blitter needs; the sample code works with 24-bit true color) and for each image vector a 8-bit or 16-bit index into the codebook. Reasonable codebook sizes (up to about 8,192) don't use all 16 bits of the indices, but tight packing of the bits would slow down decompression, and this is more important in our case than the additional compression gains. Even in this case you should always use the smallest codebook possible to minimize memory traffic; ideally it should fit into the CPU L1 data cache.
If you ignore the memory for the codebook, this corresponds to 2 or 4 bits/pixel. While a bit on the low side for general image compression, this is excellent for an algorithm which is practically free in terms of CPU time decompression.
When we put the VQ compression described above in our renderer, we expected to get slightly worse performance but were ready to live with it, because it took about 20MB off our memory footprint. However, profiles showed that rendering has actually gained about 15 to 20 percent performance, probably due to decreased memory traffic.