| |
| | ||||
![]() | ||||||
| | | |||||
|
Breaking the 64 Spatialized Sources Barrier Perceptually Weighted Clustering The adaptive clustering scheme presented in the previous section can also benefit from psycho-acoustic metrics. For instance, dynamic loudness estimation can be used to weight the error metric in the clustering process so that clusters containing louder sources get refined first. It can also be used to provide a better estimate for the representative of the cluster as a loudness-weighted average of the location of all sources in the cluster. Rendering Auditory Impostors
The second step of the pipeline is to render the groups of sound sources resulting from the clustering process. Although we can replace a group of sound sources by a single representative for localization purposes, a number of operations still have to be performed individually for each source. Such "pre-mixing" operations, usually available in 3D audio APIs, include variable delay lines, resampling and filtering (e.g., occlusions, directivity functions and distance attenuation). For a clustering-based rendering pipeline to remain efficient, these operations must be kept simple and implemented very efficiently since they can quickly become the bottleneck of the approach. Another reason why one would want to keep as much per-source processing as possible is to avoid sparseness in the impulse response and comb-filtering effects that would result from using a single delay and attenuation for each cluster (see Figure 7). Efficient Pre-Mixing Using CPU and GPU
Pre-mixing operations can be efficiently implemented on the CPU and even on the GPU. For instance, pre-mixing for Examples 3 and 4 consists of a linear interpolation for Doppler shifting and resampling (which works well, especially if the input signals are over-sampled beforehand [33]), and three additions and multiplications per sample for gain control and panning on the triplet of virtual speakers closest to the sound's incoming direction. We implemented all operations in assembly language using 32-bit floating-point arithmetic. In our examples, we used audio processing frames of 1024 samples at 44.1KHz. Pre-mixing 180 sound sources required 38% of the audio frame in CPU time on a Pentium 4 mobility 1.8GHz, 70% on a Pentium 3 1GHz. For the train-station application presented in the next section, pre-mixing consisted of a variable delay line implemented using linear interpolation plus 3-band equalization (input signals were pre-filtered) and accumulation. Equalization was used to reproduce frequency-dependent effects, like source directivity and distance attenuation. For this application we experimented with GPU audio premixing. By loading audio data into texture memory (see Figure 8), it is possible to use standard graphics rendering pipelines and APIs to perform premixing operations. Signals pre-filtered in multiple frequency sub-bands are loaded into multiple color components of texture data. Audio premixing is achieved by blending several textured line segments and reading back the final image. Re-equalization can be achieved through color modulation. Texture resampling hardware allows resampling of audio data and the inclusion of Doppler shift effects (Figure 9).
Our test implementation currently only supports 8-bit mixing due to limitations of frame buffer depth and blending operations on the hardware at our disposal. However, recent GPUs support extended resolution frame-buffers and accumulation could be performed using 32-bit floating-point arithmetic using pixel shaders. With performance comparable to optimized (and often non-portable) software implementations, GPU pre-mixing can be implemented using multi-platform 3D graphics APIs. When possible, using the GPU for audio processing will reduce the load on the main CPU and help balance the load between CPU, GPU and APU. Applications Rendering Complex Scenes With Extended Sound Sources Using DS3D The techniques discussed above directly apply to audio rendering of complex, dynamic, 3D scenes containing numerous point sources. They also apply to rendering of extended sources, modeled as a collection of point sources such as the train in Figure 10. In this train-station example with 160 sound sources, we were able to render both the visuals (about 70k polygons) and pre-mix the audio on the GPU (an ATI Radeon mobility 5700 on a Compaq laptop). Pre-mixing with the CPU (Pentium 4 mobility 1.8GHz), using a C++ implementation, resulted in degraded performance (i.e., slower frame rate) but improved audio quality. Perceptual criteria used for loudness evaluation, source culling and clustering were pre-computed on the input signals using three sub-bands (0-500 Hz,500-2000 Hz,2000+ Hz) and short audio frames of 1024 samples at 44.1kHz. For more information and technical details, we refer the reader to [36].
Audio rendering was implemented using DS3D accelerated by the built-in SoundMax chipset (32 3D audio channels). A drawback of the approach is increased bus traffic, since the audio signals for each cluster are pre-mixed outside the APU and must be continuously streamed to the hardware channels. Also, since aggregate signals and representative location for each cluster are continuously updated at each audio frame to best-fit the current soundscape, care must be taken to avoid artefacts when switching the position of the audio channel with DS3D. Switching must happen in-sync with the playback of each new audio frame and can be implemented through the DS3D notification mechanism. On certain hardware platforms, perfect synchronization cannot be achieved but artefacts can be minimized by enforcing spatial coherence of the audio channels from frame to frame (i.e., making sure a channel is used for clusters whose representatives are as close to each other as possible). View Example 5 : train-station rendered with GPU pre-mixing View Example 6 : train-station rendered with CPU pre-mixing Another application that requires spatialization of numerous sound sources is the simulation of early reflected or diffracted paths from walls and objects in the scene [20,27]. Commonly used techniques, based on geometrical acoustics, use ray or beam tracing to model the indirect contributions as a set of virtual image-sources [14,27]. The number of image-sources grows exponentially with the reflection order, limiting such approaches to a few early reflections or diffractions (i.e., reaching the listener first). Obviously, this number further increases with the number of actual sound sources present in the scene, making this problem a perfect candidate for clustering operations. Spatial Audio Bridges Voice communication, as featured on the Xbox Live! system, adds a new dimension to massively multi-player online gaming but is currently limited to monaural audio restitution. Next generation online games or chat rooms will require dedicated spatial audio servers to handle real-time spatialized voice communication between a large number of participants. The various techniques discussed in this paper can be implemented on a spatial audio server to dynamically build clustered representations of the soundscape for each participant, adapting the resolution of the process to the processing power of each client, server load and network load. In such applications, an adaptive clustering strategy could be used to drive a multi-resolution binaural cue coding scheme [35], compressing the soundscape and including incoming voice signals as a single or a small collection of monaural audio streams and corresponding time-varying 3D positional information. Rendering the spatialized audio scene could be done either on the client side (if the client supports 3D audio rendering) or on the server side (e.g., if the client is a mobile device with low processing power). Conclusions We presented a set of techniques aimed at spatialized audio rendering of large numbers of sound sources with limited hardware resources. This techniques will hopefully simplify the work of the game audio designer and developer by removing limitations imposed by the audio rendering hardware. We believe these techniques can be used to leverage the capabilities of current audio hardware while enabling novel effects, such as the use of extended sources. They could also drive future research in audio hardware and audio rendering API design to allow for better rendering of complex dynamic soundscapes. Acknowledgements The author would like to thank Yannick Bachelart, Frank Quercioli, Paul Tumelaire, Florent Sacré and Jean-Yves Regnault for the visual design, modeling and animations of the train-station environment. Christophe Damiano and Alexandre Olivier-Mangon designed and modeled the various elements of the countryside environment. Some of the voice samples in the train-station example were originally recorded by Paul Kaiser for the artwork "TRACE" by Kaiser and Tsingos. This work was partially funded by the 5th Framework IST EU project "CREATE" IST-2001-34231. References and further reading Game Audio Programming and APIs [1]
Soundblaster, Creative Labs. http://www.soundblaster.com Books [6]
D.R. Begault. 3D Sound for Virtual Reality and Multimedia. Academic
Press Professional, 1994. Papers [14]
J. Borish. Extension of the image model to arbitrary polyhedra. J. of
the Acoustical Society of America, 75(6), 1984.
________________________________________________________
|
|||||||||||||||||||||||||||||||||||||||||||||||||||
|
|