Gamasutra is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Gamasutra: The Art & Business of Making Gamesspacer
Breaking the 64 Spatialized Sources Barrier
View All     RSS
December 9, 2019
arrowPress Releases
December 9, 2019
Games Press
View All     RSS







If you enjoy reading this site, you might also want to check out these UBM Tech sites:


 

Breaking the 64 Spatialized Sources Barrier


May 29, 2003 Article Start Page 1 of 2 Next
 

Spatialized soundtracks and sound effects are standard elements in today's games. However, although 3D audio modeling and content creation tools (e.g., Creative Lab's EAGLE [4]) provide some help to game audio designers, the number of available 3D audio hardware channels remains limited, usually ranging from 16 to 64 in the best case. While one can wonder whether more hardware channels are actually required, it is clear that large numbers of spatialized sources might be needed to render a realistic environment.

This problem becomes even more significant if extended sound sources are to be simulated. For instance, consider a train which is far too long to represented as a point source. Since current hardware and APIs implement only point-source models or limited extended source models [2,3,5], a large number of such sources would be required to achieve a realistic effect (view Example1). Finally, 3D-audio channels might also be used for restitution-independent representation of surround music tracks, leaving the generation of the final mix to the audio rendering API but requiring the programmer to assign some of the precious 3D channels to the soundtrack.

Also, dynamic allocation schemes currently available in game APIs (e.g. Direct Sound 3D [2]) remain very basic. As a result, game audio designers and developers have to spend a lot of effort to best map the potentially large number of sources to the limited number of channels. In this article, we provide some answers to this problem by reviewing and introducing several automatic techniques to achieve efficient hardware mapping of complex dynamic audio scenes in the context of currently available hardware resources.


Figure 1. A traditional hardware-accelerated audio rendering pipeline. 3D audio channels process the audio data to reproduce distance, directivity, occlusion, Doppler shift and positional audio effects depending on the 3D location of the source and listener. Additionally, a mix of all signals is generated to feed an artificial reverberation or effect engine.

We show that clustering strategies, some of them relying on perceptual information, can be used to map a larger number of sources to a limited number of channels with little impact to the perceived audio quality. The required pre-mixing operations can be implemented very efficiently on the CPU and/or the GPU (graphics processing unit), out-performing current 3D audio boards with little overhead. These algorithms simplify the task of the audio designer and audio programmer by removing the limitation on the number of spatialized sources. They permit rendering of extended sources or discrete sound reflections, beyond current hardware capabilities. Finally, they integrate well with existing APIs and can be used to drive automatic resource allocation and level-of-detail schemes for audio rendering.

In the first section, we present an overview of clustering strategies to group several sources for processing through a single hardware channel. We will call such clusters of sources auditory impostors. In the second section, we describe recent techniques developed in our research group that incorporate perceptual criteria in hardware channel allocation and clustering strategies. The third section is devoted to the actual audio rendering of auditory impostors. In particular, we present techniques maximizing the use of all available resources including the CPU, APU (audio processing unit) and even the GPU which we turn into an efficient audio processor for a number of operations. Finally, we demonstrate the described concepts on several examples featuring a large number of dynamic sound sources.

Clustering Sound Sources


Figure 2. Clustering techniques group sound sources (blue dots) into groups and use a single representative per cluster (colored dots) to render or spatialize the aggregate audio stream.

The process of clustering sound sources is very similar in spirit to the level-of-detail (LOD) or impostor concept introduced in computer graphics [13]. Such approaches render complex geometries using a smaller number of textured primitives and can scale or degrade to fit specific hardware or processing power constraints whilst limiting visible artifacts. Similarly, sound-source clustering techniques (Figure 2) aim to replace large sets of point-sources with a limited number of representative point-sources, possibly with more complex characteristics (e.g., impulse response). Such impostor sound-sources can then be mapped to audio hardware to benefit from dedicated (and otherwise costly) positional audio or reverberation effects (see Figure 3). Clustering schemes can be divided into two main categories: fixed clustering, which uses a predefined set of clusters, and adaptive clustering which attempts to construct the best clusters on-the-fly.

Two main problems in a clustering approach are the choice of good clustering criteria and a good cluster representative. The answers to these questions largely depend on the available audio spatialization and rendering back-end and whether the necessary software operations can be performed efficiently. Ideally, the clustering and rendering pipeline should work together to produce the best result at the ears of the listener. Indeed, audio clustering is linked to human perception of multiple simultaneous sound sources, a complex problem that has received a lot of attention in the acoustics community [7] . This problem is also actively studied in the community of auditory scene analysis (ASA) [8,16]. However, ASA attempts to solve the dual and more complex problem of segregating a complex sound mixture into discrete, perceptually relevant components.


Figure 3. An audio rendering pipeline with clustering. Sound sources are grouped into clusters which are processed by the APU as standard 3D audio buffers. Constructing the aggregate audio signal for each cluster must currently be done outside the APU, either using the CPU and/or GPU.

Fixed-Grid Clustering

The first instance of source clustering was introduced by Herder [21,22] in 1991 who grouped sound sources by cones in direction space around the listener, the size of which was chosen relying upon available psycho-acoustic data on the spatial resolution of human 3D hearing. He also discussed the possibility of grouping the sources by distance or relative-speed to the listener. However, it is unclear that relative speed, which might vary a lot from source to source, is a good clustering criteria. One drawback of fixed-grid clustering approaches is that they cannot be targeted to fit a specified number of (non-empty) clusters. Hence, they can end-up being sub-optimal (e.g., all sources fall into the same cluster while sufficient resources are available to process all of them independently) or might provide too many non-empty clusters for the system to render any further (see Figure 5).


Figure 4. "Virtual surround". A virtual set of speakers (here located at the vertices of an icosahedron surrounding the listener) are used to spatialize any number of sources using 18 3D audio channels.

It is, however, possible to design a fixed grid clustering approach that works pretty well by using direction-space clustering, and more specifically, "virtual surround". Virtual surround renders the audio from a virtual rig of loudspeakers (see Figure 4). Each loudspeaker can be mapped to a dedicated hardware audio buffer and spatialized according to its 3D location. This technique is widely used for headphone rendering of 5.1 surround film soundtracks (e.g., in software DVD players), simulating a planar array of speakers. Extended to 3D, it shares some similarities with directional sound-field decomposition techniques such as Ambisonics [34]. However, not relying on complex mathematical foundations, it is less accurate but does not require any specific audio encoding/decoding and fits well with existing consumer audio hardware. Common techniques (e.g., amplitude panning [31]) can be used to compute the gains that must be applied to the audio signals feeding the virtual loudspeakers in order to get smooth transitions between neighboring directions. The main advantage for such an approach is its simplicity. It can be very easily implemented in an API such as DirectSound 3D (DS3D). The main application is responsible for the pre-mixing of signals and panning calculations while the actual 3D sound restitution is left to DS3D. Although there is no way to enforce perfect synchronization between DS3D buffers, the method appears to work very well in practice. Example 2 and Example 3 feature a binaural rendering of up to 180 sources using two different virtual speaker rigs (respectively an octahedron and an icosahedron around the listener) mapped to 6 and 18 DS3D channels. One drawback of a direction-space approach is that reverberation-based cues for distance rendering (e.g., in EAX [3] which implements automatic reverberation tuning based on source-to-listener distance) can no longer be used directly. One workaround for this issue is to use several virtual speaker rigs located at different distances from the listener, at the expense of more 3D channels.

Adaptive Positional Clustering

In contrast to fixed-grid methods, adaptive clustering aims at grouping sound sources based on their current 3D location (including incoming direction and distance to the listener) in an "optimal" way. Adaptive clustering has several advantages: 1) it can produce a requested number of non-empty clusters, 2) it will automatically refine the subdivision where needed, 3) it can be controlled by a variety of error metrics. Several clustering approaches exist and can be used for this purpose [18,19,24]. For instance, global k-means techniques [24] start with a single cluster and progressively subdivide it until a specified error criteria or number of clusters has been met. This approach constructs a subdivision of space that is locally optimal according to the chosen error metric. Example 4 shows the result of such an approach applied to a simple example where three spatially-extended sources are modeled as a set of 84 point-sources. Note the progressive de-refinement when the number of clusters is reduced and the adaptive refinement when the listener is moving closer or away from the large "line-source". In this case, the error metric was a combination of distance and incident direction onto the listener. Cluster representatives were constructed as the centroid in polar coordinates of all sources in the cluster.


Figure 5. Fixed grid and adaptive clustering illustration in 2D. (a) regular grid clustering (ten non-empty clusters), (b) non-uniform "azimuth-log(1/distance)" grid (six non-empty clusters), (c) adaptive clustering. Contrary to fixed grid clustering, adaptive clustering can " optimally " fit a predefined cluster budget (four in this case).

Perceptually-Driven Source Prioritization And Resource Allocation

So far, very few approaches have attempted to include psycho-acoustic knowledge in the audio rendering pipeline. Most of the effort has been dedicated to speeding up signal processing cost for spatialization purposes, (e.g., spatial sampling of HRTF or filtering operations for headphone rendering [6,28]). However, existing approaches never consider the characteristics of the input signal. On the other hand, important advances in audio compression, such as MPEG-1 layer 3 (MP3), have shown that exploiting both input signal characteristics and our perception of sound could provide unprecedented quality vs. compression ratios [15,25,26]. Thus, taking into consideration signal characteristics in audio rendering might help to design better prioritization and resource allocation schemes, perform dynamic sound-source culling and improve the construction of clusters. However, contrary to most perceptual audio coding (PAC) applications (that encode once to be played repeatedly), the soundscape in a game is highly dynamic, and perceptual criteria have to be recomputed at each processing frame for a possibly large number of sources. We propose a perceptually-driven audio rendering pipeline with clustering, illustrated in Figure 6.


Figure 6. A perceptually-driven audio rendering pipeline with clustering. Based on pre-computed information on the input signals, the system dynamically evaluates masking thresholds and perceptual importance criteria for each sound source. Inaudible or masked sources are discarded. Remaining sources are clustered. Perceptual importance is used to select better cluster representatives.

Prioritizing Sound Sources

Assigning priorities to sound sources is a fundamental aspect of resource allocation schemes. Currently, APIs such as DS3D use basic schemes based on distance to the listener for on-the-fly allocation of hardware channels.
Obviously, such schemes would benefit from the ability to sort sources by perceptual importance or "emergence" criteria. As we already mentioned, this is a complex problem, related to the segregation of complex sound mixtures as studied in the ASA field. We experimented with a simple, loudness-based model that appears to give good results in practice in our test examples. Our model uses power spectrum density information pre-computed on the input waveforms (for instance, using short time Fast Fourier Transform) for several frequency bands. This information does not represent a significant overhead in terms of memory or storage space. In real-time, we access this information, modify it to account for distance attenuation, directivity of the source, etc., and map the value to perceptual loudness space using available loudness contour data [11,32]. We use this loudness value as a priority criterion for further processing of sound sources.

Dynamic Sound Source Culling

Sound source culling aims at reducing the set of sound sources to process by identifying and discarding inaudible sources. A basic culling scheme determines whether the source amplitude is below the absolute threshold of hearing (or below a 1-bit amplitude threshold). Using dynamic loudness evaluation for each source, as described in the previous section, makes this process much more accurate and effective.

Perceptual culling is a further refinement whose aim is to discard sources that are perceptually masked by others. Such techniques have been recently used to speed-up modal/additive synthesis [29,30] - e.g., for contact-sound simulation. To exploit them for sampled sound signals, we pre-compute another characteristic of the input waveform: the tonality. Based on PAC techniques, an estimate of the tonality in several sub-bands can be calculated using short-time FFT. This tonality index [26] estimates whether the signal in each sub-band is closer to a noise or a tone. Masking thresholds, which typically depend on such information, can then be dynamically evaluated.

Our perceptual culling algorithm sorts the sources by perceptual importance, based on their loudness and tonality, and progressively inserts them into the current mix. The process stops when the sum of remaining sources is masked by the sum of already inserted sources. In a clustering pipeline, sound source culling could be used either before or after clusters are formed. Both solutions have their own advantages and drawbacks.

Performing culling first reduces the load of the entire subsequent pipeline. However, culling in this case must be conservative or take into account more complex effects, such as spatial unmasking - in other words, sources we can still hear distinctly because of our spatial audio cues, although one would mask the other if both were at the same location [23]. Unfortunately, little data is available to quantify this phenomenon. Performing culling on a per-cluster basis reduces this problem since sources within a cluster are likely to be close to each other. However, the culling process will be less efficient since it will not consider the entire scene. In the following train-station example, we experimented with the first approach, without spatial unmasking, using standard MP3 masking calculations [26].


Article Start Page 1 of 2 Next

Related Jobs

Wevr
Wevr — Los Angeles, California, United States
[12.04.19]

Audio Designer / Implementer





Loading Comments

loader image