It's free to join Gamasutra!|Have a question? Want to know who runs this site? Here you go.|Targeting the game development market with your product or service? Get info on advertising here.||For altering your contact information or changing email subscription preferences.
Registered members can log in here.Back to the home page.

Search articles, jobs, buyers guide, and more.

By Nicolas Tsingos
[Author's Bio] , Emmanuel Gallo
[Author's Bio]
and George Drettakis

[Author's Bio]
Gamasutra
May 29, 2003

Introduction

Perceptually Weighted Clustering

Printer Friendly Version


Resource Guide

Breaking the 64 Spatialized Sources Barrier

Perceptually Weighted Clustering

The adaptive clustering scheme presented in the previous section can also benefit from psycho-acoustic metrics. For instance, dynamic loudness estimation can be used to weight the error metric in the clustering process so that clusters containing louder sources get refined first. It can also be used to provide a better estimate for the representative of the cluster as a loudness-weighted average of the location of all sources in the cluster.

Rendering Auditory Impostors


Figure 7. Rendering clusters of sound sources and corresponding energy distribution throught time (echograms). Using a single delay/attenuation for each cluster results in sparse impulse response and comb-filtering effects (top echogram). Per source pre-mixing solves this problem (bottom echogram).

The second step of the pipeline is to render the groups of sound sources resulting from the clustering process. Although we can replace a group of sound sources by a single representative for localization purposes, a number of operations still have to be performed individually for each source. Such "pre-mixing" operations, usually available in 3D audio APIs, include variable delay lines, resampling and filtering (e.g., occlusions, directivity functions and distance attenuation). For a clustering-based rendering pipeline to remain efficient, these operations must be kept simple and implemented very efficiently since they can quickly become the bottleneck of the approach. Another reason why one would want to keep as much per-source processing as possible is to avoid sparseness in the impulse response and comb-filtering effects that would result from using a single delay and attenuation for each cluster (see Figure 7).

Efficient Pre-Mixing Using CPU and GPU


Figure 8. Audio signals must be pre-processed in order to be pre-mixed by the GPU. They are split into three frequency bands (low, medium, high), shown as the red, green and blue plots on the lower graph, and stored as one-dimensional RGB texture chunks (one pair for the positive and negative parts of the signal).

Pre-mixing operations can be efficiently implemented on the CPU and even on the GPU. For instance, pre-mixing for Examples 3 and 4 consists of a linear interpolation for Doppler shifting and resampling (which works well, especially if the input signals are over-sampled beforehand [33]), and three additions and multiplications per sample for gain control and panning on the triplet of virtual speakers closest to the sound's incoming direction. We implemented all operations in assembly language using 32-bit floating-point arithmetic. In our examples, we used audio processing frames of 1024 samples at 44.1KHz. Pre-mixing 180 sound sources required 38% of the audio frame in CPU time on a Pentium 4 mobility 1.8GHz, 70% on a Pentium 3 1GHz.

For the train-station application presented in the next section, pre-mixing consisted of a variable delay line implemented using linear interpolation plus 3-band equalization (input signals were pre-filtered) and accumulation. Equalization was used to reproduce frequency-dependent effects, like source directivity and distance attenuation. For this application we experimented with GPU audio premixing. By loading audio data into texture memory (see Figure 8), it is possible to use standard graphics rendering pipelines and APIs to perform premixing operations. Signals pre-filtered in multiple frequency sub-bands are loaded into multiple color components of texture data. Audio premixing is achieved by blending several textured line segments and reading back the final image. Re-equalization can be achieved through color modulation. Texture resampling hardware allows resampling of audio data and the inclusion of Doppler shift effects (Figure 9).


Figure 9. Pre-mixing audio signals with the GPU. Signals are rendered as textured line segments. Resampling is achieved through texturing operations and re-equalization through color modulation. Signals for all sources in the cluster are rendered with blending turned on, resulting in the desired mix.

Our test implementation currently only supports 8-bit mixing due to limitations of frame buffer depth and blending operations on the hardware at our disposal. However, recent GPUs support extended resolution frame-buffers and accumulation could be performed using 32-bit floating-point arithmetic using pixel shaders. With performance comparable to optimized (and often non-portable) software implementations, GPU pre-mixing can be implemented using multi-platform 3D graphics APIs. When possible, using the GPU for audio processing will reduce the load on the main CPU and help balance the load between CPU, GPU and APU.

Applications

Rendering Complex Scenes With Extended Sound Sources Using DS3D

The techniques discussed above directly apply to audio rendering of complex, dynamic, 3D scenes containing numerous point sources. They also apply to rendering of extended sources, modeled as a collection of point sources such as the train in Figure 10. In this train-station example with 160 sound sources, we were able to render both the visuals (about 70k polygons) and pre-mix the audio on the GPU (an ATI Radeon mobility 5700 on a Compaq laptop). Pre-mixing with the CPU (Pentium 4 mobility 1.8GHz), using a C++ implementation, resulted in degraded performance (i.e., slower frame rate) but improved audio quality. Perceptual criteria used for loudness evaluation, source culling and clustering were pre-computed on the input signals using three sub-bands (0-500 Hz,500-2000 Hz,2000+ Hz) and short audio frames of 1024 samples at 44.1kHz. For more information and technical details, we refer the reader to [36].


Figure 10. (a) An application of the perceptual rendering pipeline to a complex train-station environment. (b) Each pedestrian acts as two sound sources (voice and footsteps). Each wheel of the train is also modeled as a point sound source to get the proper spatial rendering for this extended source. Overall, 160 sound sources must be rendered (magenta dots). (c) Colored lines represent direct sound paths from the sources to the listener. All lines in red represent perceptually masked sound sources while yellow lines represent audible sources. Note how the train noise masks the conversations and footsteps of the pedestrians. (d) Clusters are dynamically constructed to spatialize the audio. Green spheres indicate representative location of the clusters. Blue boxes are bounding boxes of all sources in each cluster.

Audio rendering was implemented using DS3D accelerated by the built-in SoundMax chipset (32 3D audio channels). A drawback of the approach is increased bus traffic, since the audio signals for each cluster are pre-mixed outside the APU and must be continuously streamed to the hardware channels. Also, since aggregate signals and representative location for each cluster are continuously updated at each audio frame to best-fit the current soundscape, care must be taken to avoid artefacts when switching the position of the audio channel with DS3D. Switching must happen in-sync with the playback of each new audio frame and can be implemented through the DS3D notification mechanism. On certain hardware platforms, perfect synchronization cannot be achieved but artefacts can be minimized by enforcing spatial coherence of the audio channels from frame to frame (i.e., making sure a channel is used for clusters whose representatives are as close to each other as possible).

View Example 5 : train-station rendered with GPU pre-mixing

View Example 6 : train-station rendered with CPU pre-mixing

Another application that requires spatialization of numerous sound sources is the simulation of early reflected or diffracted paths from walls and objects in the scene [20,27]. Commonly used techniques, based on geometrical acoustics, use ray or beam tracing to model the indirect contributions as a set of virtual image-sources [14,27]. The number of image-sources grows exponentially with the reflection order, limiting such approaches to a few early reflections or diffractions (i.e., reaching the listener first). Obviously, this number further increases with the number of actual sound sources present in the scene, making this problem a perfect candidate for clustering operations.

Spatial Audio Bridges

Voice communication, as featured on the Xbox Live! system, adds a new dimension to massively multi-player online gaming but is currently limited to monaural audio restitution. Next generation online games or chat rooms will require dedicated spatial audio servers to handle real-time spatialized voice communication between a large number of participants. The various techniques discussed in this paper can be implemented on a spatial audio server to dynamically build clustered representations of the soundscape for each participant, adapting the resolution of the process to the processing power of each client, server load and network load. In such applications, an adaptive clustering strategy could be used to drive a multi-resolution binaural cue coding scheme [35], compressing the soundscape and including incoming voice signals as a single or a small collection of monaural audio streams and corresponding time-varying 3D positional information. Rendering the spatialized audio scene could be done either on the client side (if the client supports 3D audio rendering) or on the server side (e.g., if the client is a mobile device with low processing power).

Conclusions

We presented a set of techniques aimed at spatialized audio rendering of large numbers of sound sources with limited hardware resources. This techniques will hopefully simplify the work of the game audio designer and developer by removing limitations imposed by the audio rendering hardware. We believe these techniques can be used to leverage the capabilities of current audio hardware while enabling novel effects, such as the use of extended sources. They could also drive future research in audio hardware and audio rendering API design to allow for better rendering of complex dynamic soundscapes.

Acknowledgements

The author would like to thank Yannick Bachelart, Frank Quercioli, Paul Tumelaire, Florent Sacré and Jean-Yves Regnault for the visual design, modeling and animations of the train-station environment. Christophe Damiano and Alexandre Olivier-Mangon designed and modeled the various elements of the countryside environment. Some of the voice samples in the train-station example were originally recorded by Paul Kaiser for the artwork "TRACE" by Kaiser and Tsingos. This work was partially funded by the 5th Framework IST EU project "CREATE" IST-2001-34231.

References and further reading

Game Audio Programming and APIs

[1] Soundblaster, Creative Labs. http://www.soundblaster.com
[2] DirectX homepage, Microsoft. http://www.microsoft.com/windows/directx/default.asp
[3] Environmental audio extensions: EAX, Creative Labs. http://www.soundblaster.com/eaudio, http://developer.creative.com
[4] EAGLE, Creative Labs. http://developer.creative.com
[5] ZoomFX, MacroFX, Sensaura. http://www.sensaura.co.uk

Books

[6] D.R. Begault. 3D Sound for Virtual Reality and Multimedia. Academic Press Professional, 1994.
[7] J. Blauert. Spatial Hearing : The Psychophysics of Human Sound Localization. M.I.T. Press, Cambridge, MA, 1983.
[8] A.S. Bregman. Auditory Scene Analysis, The perceptual organization of sound. M.I.T Press, Cambridge, MA,1990.
[9] A. Gersho and R.M. Gray. Vector quantization and signal compression, The Kluwer International Series in Engineering and Computer Science, 159. Kluwer Academic Publisher, 1992.
[10] K. Steiglitz. A DSP Primer with applications to digital audio and computer music. Addison Wesley, 1996.
[11] E. Zwicker and H. Fastl. Psychoacoustics: Facts and Models. Springer, 1999.
[12] M. Kahrs and K. Brandenburg Ed., Applications of Digital Signal Processing to Audio and Acoustics. Kluwer Academic Publishers, 1998.
[13] David Luebke, Martin Reddy, Jonathan D. Cohen, Amitabh Varshney, Benjamin Watson, and Robert Huebner. Level of Detail for 3D Graphics. Morgan Kaufmann Publishing. 2002

Papers

[14] J. Borish. Extension of the image model to arbitrary polyhedra. J. of the Acoustical Society of America, 75(6), 1984.
[15] K. Brandenburg. MP3 and AAC explained. AES 17th International Conference on High-Quality Audio Coding, September 1999.
[16] D.P.W. Ellis. A perceptual representation of audio. Master's thesis, Massachusets Institute of Technology, 1992.
[17] H. Fouad, J.K. Hahn, and J.A. Ballas. Perceptually based scheduling algorithms for real-time synthesis of complex sonic environments. Proceedings of the 1997 International Conference on Auditory Display (ICAD'97), Xerox Palo Alto Research Center, Palo Alto, USA, 1997.
[18] P. Fränti, T. Kaukoranta, and O. Nevalainen. On the splitting method for vector quantization codebook generation. Optical Engineering, 36(11):3043-3051, 1997.
[19] P. Fränti and J. Kivijärvi. Randomised local search algorithm for the clustering problem. Pattern Analysis and Applications, 3:358 - 369, 2000.
[20] T. Funkhouser, P. Min, and I. Carlbom. Real-time acoustic modeling for distributed virtual environments. ACM Computer Graphics, SIGGRAPH'99 Proceedings, pages 365-374, August 1999.
[21] J. Herder. Optimization of sound spatialization resource management through clustering. The Journal of Three Dimensional Images, 3D-Forum Society, 13(3):59-65, September 1999.
[22] J. Herder. Visualization of a clustering algorithm of sound sources based on localization errors. The Journal of Three Dimensional Images, 3D-Forum Society, 13(3):66-70, September 1999.
[23] I.J. Hirsh. The influence of interaural phase on interaural summation and inhibition. J. of the Acoustical Society of America, 20(4):536-544, 1948.
[24] A. Likas, N. Vlassis, and J.J. Verbeek. The global k-means clustering algorithm. Pattern Recognition, 36(2):451-461, 2003.
[25] E. M. Painter and A. S. Spanias. A review of algorithms for perceptual coding of digital audio signals. DSP-97, 1997.
[26] R. Rangachar. Analysis and improvement of the MPEG-1 audio layer III algorithm at low bit-rates. Master thesis, Arizona State Univ., December 2001.
[27] N. Tsingos, T. Funkhouser, A. Ngan, and I. Carlbom. Modeling acoustics in virtual environments using the uniform theory of diffraction. ACM Computer Graphics, SIGGRAPH'01 Proceedings, pages 545-552, August 2001.
[28] W. Martens, Principal Components Analysis and Resynthesis of Spectral Cues to Perceived Direction, Proc. Int. Computer Music Conf. (ICMC'87), pages 274-281, 1987.
[29] K. van den Doel, D. K. Pai, T. Adam, L. Kortchmar and K. Pichora-Fuller, Measurements of Perceptual Quality of Contact Sound Models, Proceedings of the International Conference on Auditory Display (ICAD 2002), Kyoto, Japan, pages 345-349, 2002.
[30] M. Lagrange and S. Marchand, Real-time Additive Synthesis of Sound by Taking Advantage of Psychoacoustics, Proceedings of the COST G-6 Conference on Digital Audio Effects (DAFX-01), Limerick, Ireland, December 6-8 2001.
[31] V. Pulkki, Virtual Sound Source Positioning using vector Base amplitude panning, J. Audio Eng. Soc., 45(6), page 456-466, june 1997.
[32] B. C. J. Moore and B. Glasberg and T. Baer, A Model for the Prediction of Thresholds, Loudness and Partial Loudness, J. Audio Eng. Soc., 45(4) : 224-240, 1997.
[33] E. Wenzel, J. Miller, and J. Abel. A software-based system for interactive spatial sound synthesis. Proceeding of ICAD 2000, Atlanta, USA, april 2000.
[34] York University Ambisonics homepage. http://www.york.ac.uk/inst/mustech/3d_audio/ambison.htm
[35] C. Faller and F. Baumgarte, Binaural Cue Coding Applied to Audio Compression with Flexible Rendering, Proc. AES 113th Convention, Los Angeles, USA, Oct. 2002.
[36] N.Tsingos, E. Gallo and G. Drettakis. Perceptual audio rendering of complex virtual environments, INRIA Technical Report # 4734, February 2003.


________________________________________________________

[back to] Introduction


join | contact us | advertise | write | my profile
news | features | companies | jobs | resumes | education | product guide | projects | store



Copyright © 2003 CMP Media LLC

privacy policy
| terms of service