Sound Synthesis for Virtual Reality and Computer Games by Cornells Pieter van den Doel Ph.D., University of California at Santa Cruz, 1984 M.Sc, University of British Columbia, 1994 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF T H E REQUIREMENTS FOR T H E D E G R E E OF Doctor of Philosophy in T H E FACULTY OF G R A D U A T E STUDIES (Department of Computer Science) we accept this thesis as conforming to the required standard The GnivefcSity of British Columbia November 1998 © Cornelis Pieter van den Doel, 1998 In p r e s e n t i n g t h i s t h e s i s i n p a r t i a l f u l f i l m e n t of the requirements f o r an a d v a n c e d d e g r e e a t t h e U n i v e r s i t y o f B r i t i s h C o l u m b i a , T a g r e e t h a t t h e L i b r a r y s h a l l make i t f r e e l y a v a i l a b l e f o r r e f e r e n c e and s t u d y . I f u r t h e r agree t h a t p e r m i s s i o n f o x e x t e n s i v e c o p y i n g of t h i s t h e s i s f o r s c h o l a r l y p u r p o s e s may be g r a n t e d by t h e h e a d o f my d e p a r t m e n t o r by h i s o r h e r r e p r e s e n t a t i v e s . I t i s u n d e r s t o o d t h a t c o p y i n g o r p u b l i c a t i o n o f t h i s t h e s i s f o r f i n a n c i a l g a i n s h a l l not be a l l o w e d w i t h o u t my w r i t t e n p e r m i s s i o n . Dep; i r t m e n t of (^M^^^<" The U n i v e r s i t y o f B r i t i s h V a n c o u v e r , Canada Sc^y^Oi-^ Columbia Date lof2 3/31/99 1:27 PI* Abstract T h e synthesis of audio in real-time computer simulations is investigated. A physics based parameterized v i b r a t i o n model for physical objects is constructed, and a realtime synthesis algorithm is developed which allows the synthesis of the sound made by such objects under any kind of interaction force. M e t h o d s for o b t a i n i n g the parameters of such models are investigated. W e s t u d y m a t h e m a t i c a l models w i t h simple geometries, parameter fitting to measured d a t a , and e m p i r i c a l models. M o d e l s for interaction forces o c c u r r i n g d u r i n g contacts between rigid bodies such as impact and sliding interactions are developed, as well as models for the d r i v i n g forces for combustion engines and avalanches. Studies were conducted of several objects which were successfully modeled w i t h these techniques. Several computer programs were w r i t t e n for the testing of models, for the construction of models, and for the demonstration of the level of realism that can be achieved w i t h this type of synthesis. It is concluded t h a t this type of synthesis can generate realistic, interactive audio using only a small fraction of available C P U power on modern personal computers. 11 Contents Abstract ii Contents iii List of Tables vii List of Figures viii Acknowledgements 1 2 xii Audio in Virtual Reality 1 1.1 Goals and Background 1 1.2 Example of an Audio Simulation 4 1.3 Contributions of this Thesis 9 Background 12 2.1 Sound Perception 12 2.2 Representation and Rendering of Computer Sound 14 2.3 Environment Modeling 17 2.4 Computer Music 19 m 3 4 5 6 Physics of Sound 22 3.1 Sound Propagation 23 3.2 Fundamental Equations for Gases 23 3.3 Linearized Equations for Acoustic Waves 26 3.4 Interaction of Acoustic Waves with Solids 29 3.5 Sound Generation 30 3.6 Vibration Theory 34 Contact Sounds 38 4.1 Overview 40 4.2 Vibration Modes from Shape 40 4.3 Mode Amplitudes from Impact Location 44 4.4 Sound Sources from Vibrating Shapes 48 4.5 Sounds and Material Properties 49 4.6 The Sound Map 53 4.7 Real-time Synthesis 55 Creating Vibration Models 61 5.1 Computing Model Parameters Analytically 62 5.2 Fitting Model Parameters to Empirical Data 65 5.3 Designing Model Parameters by Hand 80 Interaction Force Models 82 6.1 Impact Forces • 6.2 Continuous Contact Forces 84 6.3 Live Data Streams 86 6.4 Engine Forces 86 iv 82 7 Implementations 7.1 7.2 7.3 8 90 General System Architecture 90 7.1.1 SynthesisEngine 94 7.1.2 Controller . Authoring Tools • . . . : 99 103 7.2.1 Off-line Model Tester 103 7.2.2 Vibration Model Generators 104 Demonstration Programs 105 7.3.1 Sonic Explorer 1.0 105 7.3.2 Sonic Explorer 2.0 106 7.3.3 Sonified Objects 108 7.3.4 Avalanche Sounds 110 7.3.5 Location Dependent Sounds of Mathematical Shapes Ill 7.3.6 Virtual Bell Tower 112 7.3.7 Real-time Model Tester 113 Conclusions and Extensions 121 8.1 Summary of Results 121 8.2 Extensions and Future Work 123 8.2.1 Faster Synthesis 123 8.2.2 Integrate Waveguide Models 123 8.2.3 Improve Parameter Fitting 124 8.2.4 Improve Interaction Models 125 8.2.5 Non-linear Models 125 Bibliography 128 v Appendices 134 Appendix A File Format for Vibration Models 135 Appendix B Parameter Fitting Data 139 Appendix C Audio Synthesis Techniques for Music 162 vi List of Tables 1.1 Graphics-Audio analogies 3 7.1 Performance of synthesis algorithm at a sampling rate of 22,050 Hz on a 233 Mhz Pentium P C with M M X using a C implementation and a Java implementation 98 vii List of Figures 1.1 The stages of modeling needed to compute audio 4 1.2 Interactive environment: a hammer hitting a bar 6 1.3 Interactive environment: a hammer hitting a bar at different locations 8 4.1 First nine eigenfunctions of a square membrane 4.2 Excited frequencies of a square membrane struck near the corner. . . 4.3 Excited frequencies of a square membrane struck near the middle of 43 -46 a side 47 4.4 Excited frequencies of a square membrane struck near the center. . . 47 4.5 Ratio of frequency/damping for the first 100 modes sorted by amplitude (left) and frequency (right) of a struck metal vase. Reconstructed with a Fourier window of 4096 4.6 51 Ratio of frequency/damping for the first 100 modes sorted by amplitude (left) and frequency (right) of a struck hockey stick. Reconstructed with a Fourier window of 1024 vm 52 4.7 Ratio of frequency/damping for the first 100 modes sorted by amplitude (left) and frequency (right) of a struck computer tower box. Reconstructed with a Fourier window of 1024 4.8 52 Ratio of frequency/damping for the first 100 modes sorted by amplitude (left) and frequency (right) of a struck sword (sword I). Reconstructed with a Fourier window of 2048 5.1 53 Sonogram of recorded (left) and reconstructed (right) impulse response of vase. The window size of the Fourier transform was 4096 and the sampling rate was 44100 Hz 5.2 . Identified modes for vase. Shown are the logarithmic amplitudes of the frequency peaks and their linear fits 5.3 73 Identified modes for vase. Shown are the frequencies corresponding to Figure 5.2 5.4 71 74 Identified modes for vase. The frequency response is shown. The bottom figure shows a subset of the frequency range of the upper figure 75 5.5 Frequency response of vase for three regions 79 6.1 Sample driving four stroke one cylinder engine 87 6.2 Sample driving four stroke one cylinder engine with fan 88 6.3 Sample driving four cylinder engine 89 7.1 System architecture for applications with real-time audio synthesis. . 91 7.2 Off-line model tester 103 7.3 A room modeled with the Sonic Explorer 105 7.4 A xylophone modeled with the Sonic Explorer 107 ix 7:5 A vase modeled with the Sonic Explorer 7.6 Plastic swords with embedded contact mikes used to create metal > 107 sword sounds. 109 7.7 Applet to generate avalanche sounds Ill 7.8 Applet to create location dependent sounds for a variety of objects. . 112 7.9 Applet for ringing a bell tower 114 7.10 Applet for ringing a bell tower 115 7.11 Real-time model tester. Main control panel 116 7.12 Real-time model tester. Interaction windows to scrape and hit objects. 117 7.13 Real-time model tester. Engine model editor 117 8.1 Pendulum is non-linear when colliding with the walls 127 B.l Top of vase with a window of 1024 140 B.2 Top of vase with a window of 2048 141 B.3 Top of vase with a window of 4096 142 B.4 Top of vase with a window of 8192 ! 143 B.5 Top of vase with a window of 1024: Material constant 144 B.6 Top of vase with a window of 2048: Material constant 144 B.7 Top of vase with a window of 8192: Material constant 145 B.8 Santur with a window of 1024 146 B.9 Santur with a window of 2048 147 B.10 Santur with a window of 4096 148 B . l l Santur with a window of 8192 149 B.12 Piano with a window of 1024 150 B.13 Piano with a window of 2048 151 x B.14 Piano with a window of 4096 152 B.15 Piano with a window of 8192 153 B.16 Computer tower with a window of 512 154 B.17 Computer tower with a window of 1024 155 B.18 Computer tower with a window of 2048 . . . 156 B.19 Computer tower with a window of 4096 157 B.20 Computer tower with a window of 512: Material constant 158 B.21 Computer tower with a window of 2048: Material constant 158 B.22 Computer tower with a window of 4096: Material constant . . . . . 159 B.23 Bell with a window of 1024: Material constant 160 B.24 Bell with a window of 2048: Material constant 160 B.25 Bell with a window of 4096: Material constant 161 B.26 Bell with a window of 8192: Material constant 161 xi Acknowledgements I would like to thank my thesis supervisor, Dinesh K. Pai, for his active participation in this research and his support. I appreciate feedback and comments from the other members of my thesis committee, Alain Fournier and Uri Ascher. Thanks to IRIS, ASI, and IVL Technologies Ltd., for financial support. C O R N E L I S The University November of British Columbia 1998 xii P I E T E R V A N D E N D O E L Chapter 1 Audio in Virtual Reality The research described in this thesis was conducted to create a methodology for the rendering of audio in virtual reality environments such as simulations and computer games. The focus is on a topic which has not been investigated very much at the time of writing, namely the computation (synthesis) and rendering of "natural" sounds produced by physical objects. Most research in sound synthesis has concentrated on "computer music," where the primary interest is to create interesting sounds, or to reproduce musical sounds, rather than to faithfully reproduce existing naturally occurring sounds. 1.1 Goals and Background Audio techniques [37] used in games and virtual reality (VR) are more inspired by movie and television sound effects, than based on a physical simulation approach. Unfortunately, because of the interactivity of games and V R , these techniques are of limited use in such environments. A common technique to generate audio in a game or simulation is to play 1 back a prerecorded sound (a "sample") when a certain event, such as one solid object colliding with another, or a player of a video game killing the final monster, takes place. For "effects" such as a triumphant fanfare after winning the game this is appropriate, but for more subtle interactive sounds caused by physical objects it is quite tedious to have the same sounds repeated over and over again. Other types of environmental sounds, such as those caused by objects sliding and rolling, are continuous and are driven by the momentary state of the physical objects that cause them. They can not be prerecorded and some type of synthesis is necessary. Synthesis is a form of modeling or simulation, so it is useful to compare audio synthesis with simulation in a more general context. A lot of research in "reality simulation", i.e., the creation of a sensory illusion with the computer, has focused on computer graphics rather than on computer sound. As a result, many techniques and software packages exist to display and animate a scene using computer graphics at many levels of realism, but similar techniques to recreate the sounds of the scene are still underdeveloped. The study of creating sound with a computer is analogous to computer graphics, just as the study of interpreting sound with a computer is analogous to computer vision. However, the analogy between sound and graphics is not straightforward, because vision and hearing are very different sensory modes. Nevertheless, it is useful to try to draw analogies between the two. In Table 1.1 we have indicated some possible connections one can make at various levels. The correspondences indicated in the table are intended to stimulate the mind, rather than to show actual connections between sound and graphics. For each of the correspondences one can find arguments against and for the analogy. The modeling and simulation of rigid bodies with contacts has been inves- 2 Sound Graphics Sample Sampled sound Stereo sound Spatial sound F M synthesis Sound emission Vibration analysis Distance attenuation Impact sounds Scraping, rolling Material Room acoustics Pixel Digital picture Perspective Stereo display Color-map animation Shading Surface modeling Haze or fog Flashes Animation Color Raytracing and radiosity Table 1.1: Graphics-Audio analogies. tigated widely [78, 79, 47, 42, 10, 9, 22] recently, and this type of simulation has a lot in common with the simulation of audio. In both cases the task is to update the state of the objects/audio-buffers in a manner consistent with the (simplified) physical laws governing the simulation. A rigid body simulation would also be an ideal environment to integrate with an audio simulation, as the information about body contact forces, collisions, and contacts, which is usually directly available from the simulation, can be used to drive the audio synthesis. What are the physical factors determining the generation of audio? Sound is most commonly created by accelerated bodies and their vibrating surfaces: Heat and turbulence can also act as sources of sound. These bodies interact with the medium (air, usually), and generate pressure waves that propagate through the air, and scatter from other objects. The sound is scattered at the ear and passes through the ear canal and is detected at the eardrum. It is often possible to regard the sound production as occurring in several 3 Impact Vibration Emission Propagation Listener Figure 1.1: The stages of modeling needed to compute audio. stages that can be modeled independently. A force such as an impact is applied to a material object, which vibrates. The object emits sound waves, which propagate through the environment and are detected by the human ears. We have depicted this audio "pipeline" in Figure 1.1. To produce realistic simulated sounds with a computer we need to understand the physics behind the sound generating phenomena as well as the nature of human perception of sound. A simplified model of the complete physics of the sound generation should be created, so that the associated sounds can be computed and rendered in real-time. The computation should be based on physical laws, not on "hacked" algorithms. This provides the possibility of systematic refinement, based on knowledge of physical laws. 1.2 Example of an Audio Simulation When we strike an object such as a metal bar, we hear a brief transient or'click followed by a more or less sustained sound, which then decays. The click or onset has some role in identifying the sound. For example, try listening to a recording of a flute played backwards. The sound is no longer as clearly recognizable as a flute, even though the sustained part of the sound is unchanged. See also [15, 51]. Nevertheless, a lot of information about the nature of the object is present in the sustained part. To obtain this, we need to compute the vibrations of an object when it is struck, and compute the resulting sound emitted. The sound made by a struck object depends on many factors, of which we consider the following: 1. The shape of the object. The propagation of vibrations in the object depends on its geometry. This is why a struck gong sounds very different from a struck piano string, for example. Shape and material together determine a characteristic frequency spectrum. 2. The location of the impact. The timbre of the sound, i.e., the amplitudes of the frequency components, depends on where the object is struck. For example, a table struck at the edges makes a different sound than when struck at the center. 3. The material of the struck object. The harder the material, the brighter the sound. We also relate the material to the decay rate of the frequency compo-^ nents of the sound through the internal friction parameter, see below. 4. The force of the impact. Typically, the amplitude of the emitted sound is proportional to the square root of the energy of the impact. A l l these factors give important cues about the nature of an object and about what is happening to an object in the simulation. As an illustrative example, consider an environment where a user interacts with a metal bar by hitting it with a (virtual) hammer. When the user hits the metal bar, we want to synthesize the appropriate sound. Obviously, whatever processing we do after the impact will have to be done very fast, or unrealistic latency will appear. (Typically, for musicians, 20ms latency is considered acceptable, 3ms would be ideal. ) 1 'Andy Schloss, private communication. 5 Figure 1.2: Interactive environment: a hammer hitting a bar. A simple approach is to record a sample of an actual bar being struck, and load this sample in memory. When a strike event occurs the sample is then sent to the audio hardware. But in this approach the sound will always be the same, no matter how hard or soft the user strikes the bar. One solution would be to record a set of samples for various levels of impact. A t the impact event the sample that corresponds best with the actual force will be rendered. Another solution would be to change the amplitude of the sample in real-time before rendering it, to correspond to the volume of the sound for a particular strike force. This is generally such a low cost operation that it can be done in software even on very modest equipment. This is a very simple example of manipulating a sound by processing it with a parameterized "effect", in this case the volume. A single "prototype" sound is changed in real-time to create many sounds. This is also the simplest example of parameterized synthesis. Synthesis by physical modeling is presently a very active topic of research in computer music [20, 67], but the name is somewhat misleading. The focus of most musical instruments research is not so much on the physical modeling, but in parameterizing sound with physical events. The parameters of interest to musicians for a simulated wind instrument are breath pressure, finger holes, speed of closure of finger holes, mouth position (for flute), and tongue position (for ney). A "physical model" of a wind instrument in this context is a synthesizer that can create sounds that are parameterized by this set of physical characteristics. In this case the parameter is the force of the impact, which is translated into audio volume. One may argue whether this deserves the name "synthesis", but it should be clear that there is a continuum between "samples with effects" and "synthesis". In this case we are synthesizing the volume, using a sample for the basic waveform. When using more complicated effects such as 3D spatialization and reverberation effects, we add more synthesis, but still use some recorded data. However, by doing this we have sacrificed some level of realism, because we assume that the only effect of the force of the impact is a change in volume. This may be true in a first approximation, but non-linear effects cause the "timbre" of the sound to change also with the strike force. So we have sacrificed some level of detail by reusing a sample. Let us continue with the example of the virtual bar. What if we allow the bar to be moved or allow the user to listen from different locations? This could be incorporated in the synthesis by filtering the output of our synthesizer through a set of HRTF filters (see 2.3). This can be done with off the shelf hardware or software. It can be thought of as an independent "multiplexer". We now have three parameters: volume, azimuth, and elevation. (We are ignoring room acoustics here.) What if there were multiple bars in the scene? If the bars differ only in length, their sound spectra are approximately related through a simple scale factor in time, i.e., we can change the "pitch" of the stored sample depending oh the length of the struck bar. Many digital sampler synthesizers emulate musical instruments based on this principle. Typically, a group of tones will be represented by one sample, which is 7 Figure 1.3: Interactive environment: a hammer hitting a bar at different locations. played back at various rates. Because the timbral characteristics of instruments vary a lot over the entire range of the instruments, the recorded sample can only be transposed (scaled) to a certain extent, before it starts to sound "unnatural". So far, we have been successful in synthesizing the sounds of different sizes of bars, at different spatial locations, which can be struck with any force. A l l these sounds can be obtained from a single recorded sample. What if we strike the bar in the middle, and then near an end, as depicted in Figure 1.3? The timbre of the sound should change, depending on the strike point, because this will excite different resonance modes in different proportions. The same effect is well known in music; for example a violin will sound much more nasal when bowed near the bridge (which supports the strings). A linear model (developed in detail in this thesis) predicts that the intensities of the partials (constituent sinusoidal waves) of the sound change as the strike point is changed. However, their frequencies remain the same, because the same vibration modes are excited. To deal with this, we can record some samples of the sounds corresponding to various impact locations, and play back the closest one, or do a linear interpolation between two. 2 This could be called "sound morphing". It can be done in the time domain because the spectral content of the two sounds is the same. 2 8 But what if we now want to have different hammers, made out of metal, wood, and rubber? The sound will be different when hitting the bars with these different sticks. Should we now also pre-record the samples of the bar being struck by different sticks? If we want to have 10 different materials, 10 impact locations, and 10 different length bars, we would need 1000 samples! This clearly pushes' straightforward sample playback to its limits. The sample based approach does not require much modeling, and allows the use of recorded sounds, which will always (or at least for a very long time) be better than synthesized sounds. On the other hand, since recorded samples are not available to everyone, and are difficult to obtain, it restricts a designer of a virtual reality environment. A synthesis approach has the advantage that no external data/ is needed, but it also requires detailed physical modeling. Especially for complicated, structures, not enough data may be available to model the vibrations accurately. However, in such a case one can use an experimental approach, and use recorded sounds to extract the relevant synthesis parameters empirically from real objects. If we also want realistic continuous sounds when we scrape the bar with a stick, with velocity, direction, and speed under real-time user control, we can no longer use sample playback directly. One can imagine looping a sample at different speed/volume depending on the interaction, and this has been tried with limited success, but this does not take into account the resonances of the touched object. For these types of sound we need to synthesize the sounds in real-time. 1.3 Contributions of this Thesis The work presented in this thesis is the first coherent effort to develop a physics based real-time audio synthesis methodology for environmental sounds caused by 9 interacting solid objects. We develop a real-time synthesis technique specifically tailored to the synthesis of sounds of solid objects under external forces. We address model rendering (audio synthesis), model authoring, interaction modeling, and software and system design issues. Using the techniques developed here, it is possi- ble to generate realistic, interactive audio in simulation or games applications using only a small fraction of available C P U power on modern personal computers. The remainder of this thesis is organized as follows. In Chapter 2 we discuss . general background material relevant to audio synthesis and environmental-sound simulation, and we describe related work in this field. In Chapter 3 we summarize the physical laws which govern acoustics and ,. which are the foundation upon which audio synthesis by physical modeling is built. In Chapter 4 we derive a physics based parameterized vibration model for physical objects from the linearized vibration equations for solid bodies. A real• • .time synthesis algorithm is developed which allows the synthesis of the sound of such objects under any kind of interaction force. In Chapter 5 we investigate methods for obtaining the parameters of the vibration models from mathematical models of simple geometries, by automatic parameter fitting to measured data, or by empirical means. Studies were conducted of several objects which were successfully modeled with these techniques. In Chapter 6, models for interaction forces during contacts between rigid bodies (impulsive forces, and sliding interactions) are developed, as well as models for the driving forces for combustion engines and avalanches. In Chapter 7 we describe the software tools that were built to implement the methodology developed. A n object-oriented application programming interface (API) is developed to interface the audio synthesis to user code, and implementations 10 in C + + and in Java are described. Several computer programs were developed for the testing of models, for the construction of models, and for the demonstration of the level of realism that can be achieved with this type of synthesis. In Chapter 8 we present our conclusions and outline directions for future research. Some technical details such as the numerical results of parameter fitting, and file formats are relegated to the appendices. It is hard to describe audio verbally or graphically, and many of our results should be heard in order to be appreciated. We have made several audio examples as well as some interactive applications available online on [4]. This material is also available on the accompanying C D . 11 [ Chapter 2 Background In this Chapter we will discuss some general topics that are relevant to sound simulation, and describe related work in this field. The topics considered are • Sound perception; • Digital representation of sound; • Environment modeling; • Computer Music. 2.1 Sound Perception To simulate sounds, we need to know some things about how humans perceive sounds, since we do not want to waste effort on aspects of sound that are not important perceptually. Knowledge of what is perceptually important may focus sonic modeling on relevant phenomena. The human ear can detect frequencies in the range 20 — 20,000 Hz and is most sensitive in the 500 — 6,000 Hz region. The threshold of hearing is around ldB 12 (see Section 3.3 for a definition of dB), the threshold of pain is at about 140^5.: A normal conversation at a distance of 1 meter is about 50dB. The sensitivity of the ear at various levels of loudness can be depicted in sensitivity curves, see for example [73]. When a sound is heard, various attributes are usually associated with it by the human brain. For example one may hear it as a liquid sound, or as a distant animal call, or as a metallic percussive sound coming from a specific direction. An attempt at classifying sounds into categories (like liquid, metallic, etc.) was made in [30]. The study of how sounds are processed by humans and grouped into auditory streams, is called auditory scene analysis, and there exists a standard work on the subject [15]. Considerable recent attention has focused on the localization of sound in three dimensional space, i.e., on how we perceive sounds as coming from a specific direction. A good review of the subject is [11]. In [94] psychophysical experiments are described which are used to investigate how inter-aural time delays provide cues for left-right localization. A good review of psychological aspects of sound localization by humans can be found in [50]. For wavelengths greater than the size of the human head, the horizontal position cues come from a phase difference in the waves at the ears. These waves don't "see" the head and are therefore unaffected by it. Smaller waves are scattered by the head and by the pinna. The primary cue for position at the onset of sound with small wavelengths comes from the inter-aural delay time, and the manner in which the sound is altered (filtered) by the head and the pinna. The brain has learned to correlate these scattering effects with the position of the source. The vertical position of a sound can also be perceived, though not as accurately. Cues 13 for this come entirely from the filtering of the sound through the head and pinna. Percussive sounds are often perceived as having a distinct "onset", and a sustained part. Localization cues obtained at the onset tend to be important. The reverberations of the room also have a role in sound localization. Without reverberations it is harder to localize sounds. Different frequencies decay(attenuate) at different rates in the atmosphere. This is why sounds coming from a great distance can be heard as such, as they sound "muffled". Note that the propagation of sound through the atmosphere is strongly influenced by the weather, giving a distinct aural "feel" to various kinds of weather. Blind people report that they can "feel" the presence of objects like a wall; as a pressure on the face. I can reproduce this sensation also to some extent by moving my head close to a wall in the dark. What actually happens is that small sounds like that of your breathing reflect off nearby surfaces, which is detected by your ears but somehow perceived as a kinesthetic sensation. 2.2 Representation and Rendering of Computer Sound Various techniques exist to record and reproduce sound. For example, the sound can be stored as a physical image of the vibrations in a gramophone, or as a pattern of magnetization on a tape. These are analogue techniques, as they map pressure variations directly to some other physical characteristics. It has even been speculated that sound can leave imprints by accident, for example on drying paint, and might be reconstructed [93]. A different method of storing sound is by digital sampling. The signal is sampled at some rate and encoded as a set of numbers, which are typically stored on some electronic device. Digitally stored sounds can be manipulated algorithmically 14 1 and recordings can be duplicated without any loss in quality. A good introduction to digital audio techniques is [72]. The signal at each ear will be represented by a scalar function of time which, apart from a multiplicative factor, specifies completely the perception at the ear. Two such signals would give a complete description of a sound as perceived by a human. There are also phenomena which are usually classified as "sound" that are perceived not directly through the ears. For example, the deepest tones of a large church organ are "felt" in the chest cavity. The signal at the ear could represent the pressure at the entrance to the ear canal, the deviation from equilibrium of the eardrum; or perhaps other relevant parameters. How this function is precisely to be translated into a physical rendering of the sound, with the aid of speaker, headphones, or other means, will depend on the setup of the speakers or on the headphone characteristics. But it should be clear that in principle any sound can be created in this manner. With loudspeakers we have the complication of "leaks" between the speakers. A simple minded setup with a speaker for the left ear and a speaker for the right ear will not work, as the right ear will also pick up the signal from the left speaker. But one could for example cancel out the "leak" from the left speaker to the right ear with an additional signal from the right speaker that had the opposite phase. Unfortunately, this requires a knowledge of the position of the listeners ear. Various techniques, sometimes called "Surround Sound", "Dolby Surround", or "3D sound", exist to partially overcome this problem. The signal function is represented by a digital sample. The values of the function are stored at time intervals of I/SR, where SR is the sampling rate. The values can be given as floats or as 16 or 8 bit signed integers. 15 The method of sampling is usually indicated by specifying the sample width, which is the number of bits used to represent the value, and the sampling rate. To represent sounds with frequencies up to / Hz, the Shannon Sampling Theorem [61] tells us that we must have SR > 2 / , for an accurate representation of the function. If SR < 2 / , frequencies above / will "fold back" into low frequencies and produce distortions. The human ear can perceive frequencies of up to 20,000 Hz, [73]. A common sampling rate for sound , (used also for commercial C D quality recordings) is 44,100 Hz. The lowest sampling .rate in common use is 8,000 Hz, which is used often in Internet applications. A wide variety of file formats is available for the storage of digital samples. Some widely used formats are au Used by S U N and currently the only format supported by Java. (Java version 1.1.) This format stores the wave form logarithmically. wave This format can use a variety of encoding techniques, but most commonly is used with linear encoding. It is used on P C ' s . A I F C The format used by SGI. A linear encoding with a very complicated header structure with descriptions of looping etc. This is an elaboration of the older A I F F format. I F F The format used by Amiga. Most computers have some hardware to play a stereo sample in real time i.e., to output an electric signal that can be used with speakers or headphones to 16 ! generate sound. The sample will typically be located at some memory location of the computer. Depending on the available hardware, it is possible to dynamically manipulate a sample and change certain aspects of it in real time. Many things can be done in software, but often one uses special D S P hardware to do the signal processing in real time. Examples of such processors are the readily available "reverb" units, which add simulated room acoustics and other effects to an input signal in real time. Also emerging are "spatialization" engines, which modify a signal to position it at a specified location in space. With increasing processor speeds there is a tendency to assign more and more D S P tasks to the main C P U . 2.3 Environment Modeling Suppose we have a virtual reality environment in which we want to compute the sound perceived by an observer. That is, we want to generate a signal that drives speakers or headphones that will lead to a correct perception of the sound. In principle one should model the sound generating objects, compute the resulting sound field at the eardrums of the simulated observer, and generate a signal at the headphones or speakers such that the eardrum of the human subject is exposed to the sound field computed. To make this a somewhat manageable task, it is usually assumed that the sources of the sound are sufficiently far from the observation point that we can approximate the acoustic wave by a plane wave. We can then replace the sound source with a suitable point source. We can then divide the simulation in three parts: source modeling, sound propagation (reverberation), and sound reception. By "reception" we mean the scattering of the sound waves around the head and 17 ears, which needs to be simulated if we want the sound to appear to come from definite spatial locations. The vibrational modeling to compute the point source is discussed in Sections 3.5 and 3.6. However, we can also record a sample of a real sound. This is an approach often taken, as more is known about modeling the propagation (reverberation) and reception (the scattering around the head and ears) than about modeling the sound production. Reverberation modeling involves computing reflections of the sound from surfaces along the path from the source to the observer. For this we need to have a good model of scattering of sound from materials. Some of the relevant physics will be discussed in Section 3. Off-line computation of acoustical properties of performance halls, in the context of graphical visualization techniques was investigated in [74]. When the sources and observers are not stationary, the Doppler shift changes the rate at which the wave arrives at the ear, which can be simulated by changing the effective sampling rate of the stored sample. The filtering of the sound at the ears and the head, which is direction dependent, can be summarized by a set of direction dependent filters that simulate the ears. These are called Head Related Transfer Functions (HRTF) and have received much attention recently. The basic idea is to measure the sound inside the ear for a standard source (usually white noise) positioned at various locations with respect to the subject. The relation between the source and the measured sound is then extracted as a finite impulse response filter (FIR) for every direction. These are the H R T F filters. On playback, a given "dry" sound can then be spatialized by convolving it with the appropriate H R T F . A review of the scientific and technological issues of auditory displays can 18 be found in [26]. Takaia and Hahn introduced the concept of sound rendering for computer animation [80]. They associated a characteristic sound with each object which could then be rendered after filtering the sound to model the environmental effects. Recently [31] they proposed "Timbre Trees," which, like shade trees, provide a scene description language for sounds. In [17] complex sounds such as breaking events are: analyzed and decomposed into individual components which can then be reconstructed with adjustable parameters to obtain: a form of parameterized synthesis. 2.4 Computer Music Computer music has been around for a long time, but has traditionally been more concerned with imitating the sounds of musical instruments, or with the creation of totally new sounds, than with the reproduction of everyday sounds. Surprisingly little can be learned from computer music that can be used in the synthesis of simple sounds such as emitted by an object of known material and geometry being struck or scraped. Several synthesis techniques have been developed or are currently being investigated, which have some relevance to the synthesis of natural sounds: A comprehensive overview of musical synthesis techniques is given by Herbert Janfien, on the W W W site [7]. As this material is not published anywhere else and may be removed from this W W W site, we include a compilation of relevant material from that source with the author's permission in Appendix C . ' In [36] audio synthesis is discussed from a more theoretical point of view and synthesis techniques are analysed using various criteria. Many of the synthesis techniques described there are based on specific al19 gorithms or specific hardware configurations to create periodic oscillations. Some methods are general enough that they could be used to generate sound for some of our purposes. One such technique is additive synthesis, which builds a complex sound from sinusoidal components with controllable amplitudes. Unfortunately,, there is no obvious way to compute the amplitudes of the sinusoidal components of contact sounds, except for struck objects, where they decay exponentially. Sample playbackxs i! currently the only technique used for environmental sounds in games and virtual reality. It has the capability of producing "photo-realistic" sounds, but with little or no interactivity. For discrete sounds such as impacts this is an excellent method. Granular synthesis : [82] superimposes many short fragments of recorded sam- ples in a controllable stochastic manner. It has some applications in the generation of crowd sounds in computer games and seems to be well suited for other sounds which involve a large number of events, such as rain. Waveguides [67] provide models of one dimensional structures such as vibrat- ing strings and air columns. Because such structures have a harmonic spectrum of resonance frequencies, waveguides (a.k.a. comb filters) provide a very efficient means of synthesis and are used for musical instruments. The disadvantage of waveguides is that one has no control over the bandwidths and amplitudes of the resonances. This synthesis method is not suited for the sound of vibrating solid bodies, which usually have a non-harmonically spaced frequency spectrum. Modal synthesis [52] has been used for percussive instruments with a non- harmonic frequency spectrum such as marimba and bells. Because each resonance requires additional computation, this synthesis technique is not well suited for har- 20 monic musical instruments which usually need many resonances. However it is very well suited for our purposes, and in the following chapter we shall show that this synthesis technique can be derived as an implementation of a real-time solution to the linearized vibration equations for solid bodies. Several research groups are actively pursuing the physical modeling approach to music: • The Center for Computer Music and Acoustics, C C R M A , at Stanford is developing physical models of musical instruments. They use mainly wave-guides for their modeling, but they also have done some work on modal synthesis. Musical instruments are modeled as filter banks to which an appropriate stimulus is applied. A n overview of work at C C R M A can be found on their W W W site [1]. • The Center for New Music and Audio Technologies, C N M A T , at Berkeley is developing a synthesis technique based on additive synthesis. A sound is described in frequency space where a number of dominant partials are identified with a given time evolution. A n efficient algorithm based on the F F T allows this description to be inverted in real time for sound synthesis [24]. A n overview of the research at C N M A T can be found at the W W W site [3]. • The Institut de Recherche et Coordination Acoustique/Musique, I R C A M , in their Modalys program has investigated synthesis techniques using modal models. Their interest is primarily in the synthesis of musical sounds and in the specification of control algorithms, rather than in real-time techniques. Their web site is [6] and their scientific publications are available on-line at [2]. 21 Chapter 3 Physics of Sound In this chapter the laws governing the propagation of pressure waves in gases and the interaction of solids with these waves will be derived from fundamental physical laws. This allows a clear understanding of the nature of the approximations involved in the derivation, and thus shows under what conditions these simplified laws are valid and where they might break down. We will not consider any conditions under which the simplest linear wave model of sound in air breaks down in this thesis, but this is a possible direction for future research. The material in this chapter indicates how the construction of more complicated equations for pressure wave propagation might proceed. It provides enough background material to allow the reader to follow such a derivation. Material in this chapter can be found in many standard works on acoustics, see for example [13, 54, 45, 81]. We follow the notation of [81] here. In the last section of this chapter we will give a brief introduction to the theory of vibrating solids. 22 3.1 Sound Propagation For sound propagation through fluids and gases we shall use the continuum model for fluids. Such a model will be applicable when the Knudsen number is small enough. The Knudsen number is defined by Kn = A / / , (3.1) where A is the mean free path, which is the average distance that a molecule travels between collisions with other molecules and / is the length scale of the phenomenon of interest, such as the wavelength of sound. For example, a sound wave with a frequency of 20,000 Hz (at the limit of the audible frequency range) has a wavelength of 1.7 cm. The mean free path in air is 5 . 9 5 1 0 m . The Knudsen number is 3.5 X 10~ , so the continuum model is 6 _8 applicable in this case. 3.2 Fundamental Equations for Gases We shall present the fundamental equations that govern the propagation of sound. Derivations and more detailed discussions of the thermodynamic background can be found in [81]. The local state of a simple fluid (whose composition is uniform) can be specified by two quantities. For example, at low pressures all gases obey the ideal gas law, p = RpT (3.2) where R is the gas constant, T the absolute temperature, p the mass density of the gas, and p the pressure. The local state of the gas can be described by p and - 23 T, or by p and p, etc. For gases in motion, we can formulate the law of conservation of mass in terms of the mass density field p(x,t) and the velocity field u(x,t) as (3.3) To derive equations of motion for the gas we must specify a model for the forces and stresses acting on an infinitesimal volume of the gas. External forces on the gas are described by a vector field F, which represents the force per unit volume acting on the gas. Stresses are described by the symmetric stress tensor cr of rank two. The component <r,-j is interpreted as the i-th component of the force per unit area acting on an infinitesimal surface element perpendicular to a vector in the j direction. For a Newtonian fluid, the fluid is governed by the Navier-Stokes equation plus thermodynamic equations. Most gases and simple fluids can be accurately described by the Newtonian fluid model. Fluids with complicated molecular structure can not, but are not considered here. For a Newtonian fluid, the stress tensor is given by o~ij — P&ij where S{j is the unit tensor. The deviatoric The quantity P is called the mechanical ~\- (3.4) dij, stress tensor dij pressure. is traceless (i.e., da = 0). For small velocity gradients, the deviatoric stress tensor is given by d^ = 2p(eij - -ASij) (3.5) where p, is the coefficient of viscosity (also called shear viscosity) of the gas and e;j 24 is the rate-of-strain tensor, given by l.dui e duj - = 2 f e ^ + ( 3 - 6 ) The local rate of expansion A is defined as A = e„-. (3.7) The difference between the mechanical pressure P and the thermodynamic pressure p is related to the local rate of expansion through p-P where p is the expansion coefficient v = pA (3.8) v of viscosity. For reference, we will now give the complete set of equations for a Newtonian gas assuming that temperature variations are small and that p is constant. v 1. Continuity: | + ^ ) = 0. (3.9) 2. Navier-Stokes: + = ^ (3.10) + A/*+ 3. Energy: VS Pi VT_0TVP p P i p 1 8 pdx* Pi BT y dx ' 1 & p K ' 4. State: p = p(p,S). 25 (3.12) Here we have defined £=4+ Vt dt dx' v (3.13) ; The viscous dissipation function $ is defined by d> = ^ ( ^ - I A 2 ) . (3.14) P The specific heat is denoted by c and A; is the thermal conductivity of the gas. The p coefficient of thermal expansion, (3, is defined by with v* = 1/p. (3.16) For the important case of an ideal fluid, the stress force is assumed to be perpendicular to an infinitesimal surface element, and to have the same magnitude independent of the direction of the normal to the surface element. In this case the equations for the gas can be written as and where c is the local speed of sound. Thermodynamic considerations lead to c (Ip) ' '^ 3.3 Linearized Equations for Acoustic Waves 5 n e S r a v 2 = i t y vector is given by g\ For acoustic waves in an ideal gas we linearize Equations 3.17 and 3.18. We consider small fluctuations of pressure p and density p around the background values p and 0 26 po. If the length scale for variations in velocity is much smaller than Cg/g, with Co the velocity of sound and g the absolute value of the gravity vector, the linearized equations are " 1 + ^ P = 0, Po{x)cl(x) dt dx du 1 dp % + - ^ 7 dt po ox (3.19, l , = 0. 1 n . 3.20 For uniform gases we can rewrite the linearized Equations 3.19 and 3.20 in vorticityu), a different form. First, observe that the .dv? dx hi is conserved. e^ k k v is the antisymmetric rank-3 tensor (with e 123 ; = 1). If we assume that the gas is at rest initially, we will have UJ = 0. For velocity fields with zero vorticity, we can write «' = |4 ox 1 where (j) is the velocity potential. (3^) The acoustics equations can now be formulated in terms of a single wave equation for <f>, - ^ with = + c W=0 2 (3.23) The pressure fluctuation p can also be obtained from the velocity potential by P=~Po^- (3-24) The energy density of the acoustic field is given by E=E + E k p (3.25) where the kinetic energy is given by E k = ipouV 27 (3.26) and the potential energy is given by 1 2p c'o 0 In terms of the velocity potential c/>, this becomes 1 d<p dcf> ^ - ^ c V c V ( 3 - 2 8 ) and The energy-flux vector (the energy flow per unit area) is given by la P = ( U 3 - 3 0 ) and its time average (3.31) I=<q > a is called the acoustic Its dimension is power per area, e.g. w a t t / m . It intensity. 2 is customary to express the acoustic intensity in decibels (dB). For this we define a reference level I ref = 10 - 1 2 watt/m . 2 (3.32) It corresponds to the lowest intensity at which a sine wave of 1000 Hz can be heard by the average human. The intensity level in decibels is given by IL = 1 0 1 o g ( / / / ) . 10 re/ (3.33) We note that the linearized equations for acoustic waves can not deal with dissipation. For this we must use the Newtonian fluid model, described in Section 3.2. For sound waves in air, the decay rate is proportional to the frequency. The amplitude decay over 1000 wavelengths (340m at 1000 Hz) is only 1-2% in air, so it can be neglected in many applications. 28 3.4 Interaction of Acoustic Waves with Solids Sound waves are reflected (scattered) by solids. Vibrating solids generate sound waves. Both phenomena require an understanding of the fluid dynamics at the boundary between the two media. When a sound wave hits a surface, the surface will move in response. If the surface is part of the boundary of a solid there will be waves passing through the solid, and the solid will emit sound from its entire boundary. However, in many situations the propagation of sound inside the solid can be ignored and we have > to deal only with reflection and absorption at the surface. How much of the wave is absorbed and how much is reflected depends on the material properties of the' surface. Since in general we do not know the dynamical equations of the surface, some sort of phenomenological model is necessary. The specific acoustic impedance of a boundary is used to characterize a surface as follows. We assume, that the velocity of the boundary depends on the pressure variation of the air at the boundary'with some time lag. For a monochromatic incident wave we assume the relation (3.34) u =p/z , n a where u is the surface normal component of the velocity of the surface and p is n the pressure fluctuation at the boundary. The specific impedance z depends on a the frequency of the incoming wave and is complex in general (inducing a phase difference, i.e., a time lag, between the surface response and the pressure). Some classical topics that can be solved analytically are the sound field in a piston driven tube and transmission of waves through tubes as a function of the tube diameter. An example of the latter is the human vocal tract. The vowel sounds 29 used in human speech are produced by changing the shape of the vocal tract with the throat and tongue, thereby changing the transmission of the sound radiated by the vocal chords, see [21]. The wave equation (Equation 3.23) has been studied thoroughly. The study of spherical waves leads to an exact solution for the sound field of a pulsating sphere. The sound field of a bursting balloon can be computed analytically. This is an important topic, because one may consider a small pulsating sphere as a "pointsource" of sound leading to a field theory approach of sound. Such a point source plays a similar role as the point charge and point mass do in electromagnetism and gravity. Just as a finite body can be considered as an infinite collection of point masses, so a complex sound source can be considered as an infinite collection of point sound sources. The advantage of this approach is its similarity to the well studied fields of electromagnetism and gravity. 3.5 Sound Generation Vibrating bodies generate sound waves through boundary conditions imposed on the wave equation (Equation 3.23) at the surface of a solid body. For an ideal fluid the boundary condition is u.n = U ij.n with n the surface normal and U b (3.35) the velocity of the surface. However, real gases "stick" to the surface, in which case we have u = U. b (3.36) For an ideal gas, the condition given in Equation 3.35 has to be used, as viscous effects are not taken into account in the ideal gas model. 30 A particularly useful formulation of the equations governing sound generation is the formulation in terms of Green's functions. A n exact computation of the far field generated by a pulsating sphere leads to the following simple source potential for a point source (i.e., a source of volume) that is pulsating (with frequency.u) with k = u/c 0 and r the distance from the source (which is located at the origin). The quantity Q 0 represents the volume flow out of a small sphere enclosing the source. The intensity of the acoustic wave described by Equation 3.37 is (3-38) and the total acoustic power is = ^ M . n (3.39) The solution 3.37 can be viewed as the solution of the inhomogeneous wave equation + c V cf> = clQ(x,t) --^ 2 (3.40) 2 0 which should be compared with Equation 3.23. The general solution to Equation 3.40 can be given as 1 = r Oix / t - \ ~ '\) x x (3.41) jf—!-dV(x). \X — X \ 47T JV A n important case is a surface distribution of the source Q. For a plane, the result is ik\x-x' \-iwt \x-x'\ e ^ t ) = -Yj s f r fa Tn where | ^ is just the normal velocity of the surface. 31 d M x ) ' " (3 42) Unfortunately, Equation 3.42 can not be used directly to compute the sound emitted by a vibrating polyhedron. The problem is that the sound emitted by one facet of the polyhedron is dependent on the overall shape of the polyhedron. Some special problems can be analyzed analytically. A n example is the sound field of an oscillating piston in an infinite wall. The problem of interest for simulating the sound field of vibrating solids leads to an integral equation for <f>. This equation, known in slightly modified form as the Kirchhoff-Helmholtz equation [46] plays a fundamental role for an analysis of the emission of sound by vibrating bodies. It is derived in a particularly clear way in [81]. The end of [27] is devoted to a discussion of problems and numerical techniques associated with solving the Kirchoff-Helmholtz equation. equation is that it breaks down if there is resonancein One problem with the the system. A t certain driving frequencies the amplitudes of the waves become infinite (as we have not incorporated dissipation in our model) and the solution of the integral equation diverges. resonance, dissipation must be modeled. At In this case the amplitudes often grow large enough that nonlinear effects become important. In such a case, adequate physical models are often not available. There is a section in [81] on nonlinear effects. Nonlinear acoustics is an important field of research, see for example [12]. The study of nonlinear waves (see for example [90]) is anextensive research field. We will now state the Kirchoff-Helmholtz equation. The free-field Green's function (which is essentially the sound field of a harmonic point volume source with frequency u) is defined by Jk\x\ G{xM = j-r,-r (3-43) where k — OJ/CO. We consider a region V bounded by surface S, which does not have to be 32 connected. We can think of S as consisting of an outer boundary like the walls.of a room and some boundaries of objects in the room which produce sound. Let n(x) denote the outward normal (i.e., pointing into the region V) at point x on S. We assume that every point y on S is oscillating harmonically with frequency to and velocity v(y). The velocity potential is written as <j> ( )e -%L. <f>(x t) = r (3.44) iut u x t The time independent potential (f> satisfies u &,(*) = J p ( y ) ^ G ( x - y) - Mx)n k s {y)- d S(y) 2 dy k (3.45) where n(y) is the outward normal at point y on S, and d S(y) is an infinitesimal 2 surface element on S. The boundary condition for an ideal gas, Equation 3.35, gives us the normal derivative of <f> on S, which appears in the right side of Equation 3.45, n (y)^f L k = n(y).v(y) (3-46) with v(y) the velocity of S at y. We can consider this a given quantity (computed from a model of the physics of the vibrations of the surface S). The value of (f>ui{y) on S is related to the pressure fluctuation Pu(x) (in the frequency domain) through PUJ(X) = -ipou<j>u{x) (3.47) which follows from Equations 3.24 and 3.44. If the pressure on the boundary was known, Equation 3.45 would give us the acoustic field on V . By restricting x on S in Equation 3.24, and dividing S in n elements, we formally obtain a system of n equations for the n values of p on S. However, there are some problems with this approach. 33 Observe that if x lies on S in Equation 3.45, the Green's function defined in Equation 3.43 is not defined for x = 0. In this case one should take the principal .values of the integrals at the singular points. This entails a regularization of the Green's function, by defining G (x) = 0 for |x| < e and letting e —» 0 at the end. c If the volume V is finite, there may be no unique solution for the pressure on the surface, for some values of the frequency. resonance The physical reason is that at the acoustic field grows without limit. In this case a separate analysis is required to identify the resonance frequencies. We refer to [27] for more details on numerical approaches to the Kirchoff-Helmholtz equation. 3.6 Vibration Theory Vibrating bodies are sources of acoustic waves and produce sound through the mechanism described in Section 3.5. In this section, we describe the equations governing their behaviour. Suppose the configuration of a vibrating system can be described by a vector q(i) of n real numbers. A t equilibrium, without motion, we assume q — 0. A typical: example is a set of rigid bodies connected through (damped) springs. A solid body such as a bar can be thought of as an infinite collection of infinitesimal rigid bodies connected through generalized springs. They can be approximately described by a discrete set of n coordinates q, provided n is large enough. Another route to this conclusion is to note that the solution of the partial differential equation for the vibration of a bar can be written as a discrete sum of basic functions (a generalization of a Fourier transform series), which can be interpreted as the q in the above. The vibrating system, for sufficiently small amplitudes q, is described by the 34 linear equation of motion Mq + Cq + Kq — F (3.48) where the dots denote differentiation with respect to time and F(t) is the external force on the system. The matrices M, C, K, and the vector F can be found for a given system by various means, depending on the system. A case of particular interest to us is a system of many Finite Elements [58] that approximates a given solid. The continuous system is divided into a number of elements, each with appropriate elasticity properties and degrees of freedom. The elements are then connected to approximate the continuous object. The degrees of freedom corresponding to the finite elements obey an equation of the same structure as Equation 3.48, in the linear approximation. The last Chapter in [66] discusses Finite Element methods in vibration analysis. We solve Equation 3.48 by writing / (3.49) y = The equation of motion 3.48 becomes (3.50) y = By + r with B 0 1 ^ -M~ K (3.51) -M~ C 1 X ) and \ 0 (3.52) r — \ M~ F l ) The solutions to Equation 3.50 with F — 0 are y = ae 35 lit (3.53) where the eigenvectors a (of dimension 2n) and the eigenvalues /J, satisfy (B - fil)a (3.54) = 0. There are 2n eigenvalues, each representing an oscillation with frequency fi/(2n). The reason we have two solutions for every degree of freedom is that there is also a phase associated with this degree of freedom. In the presence of an external force F the solution to Equation 3.50 can be written as y = T(t)y +f T(t-T)r(r)dr where the state transition matrix (3.55) t 0 Jo T is given by -l (3.56) where (3.57) Ti = 0 and the 2n x 2n modal matrix * 0 is given by = [A, A 2 (3.58) »-2n with A i the i-th eigenvector a, i.e., a solution of Equation 3.54. y in Equation 3.55 0 is given by the initial conditions on the system. We conclude that a vibrating system can be characterized by a discrete set of eigenvalues which correspond to the natural frequencies and their associated decay rates. When the system is excited by an external force of finite duration some of 36 these frequencies will be excited. The relative amplitudes after the application of the force depend on the nature of the excitation. When an object is struck, the applied force is of very short duration and contains many frequencies. Very shortly after the strike, many frequencies of the system will be excited, but most decay very rapidly, leaving only a few. The object will be perceived as having a definite "pitch" if one frequency decays much slower than the others, such as is the case for a tuning fork, for example, or if the object has a harmonic spectrum, as is the case for many melodic percussive musical instruments such as the piano. The sound emitted when an object is struck is perceived as containing a "click" of very short duration, which is a mixture of many frequencies, and a sustained part, which contains only a few frequencies. Continuous systems such as bars and plates lead to partial differential equations. For some simple systems exact solutions can be found. The classical problems that can be tackled analytically are the vibrating string, the rectangular and the circular membrane and plate, and bars, under various boundary conditions. 37 Chapter 4 Contact Sounds In this chapter we investigate the computation and rendering of sounds emitted b y solid bodies in contact. We will construct a parameterized vibration model that' lends itself to real-time synthesis. Some of the material in this chapter has been published previously by Dinesh Pai and the author in [84]. The computation of sound emitted by an object to which a time dependent driving force is applied can be done in principle as follows. 1. Formulate the equations of motion of the object under external forces. In general, this will be a partial differential equation. 2. Solve the resulting system of equations in the presence of the driving force. 3. Determine the surfaces that are exposed to air, where sound will be emitted. Using the theory described in Section 3.5, determine the acoustic field at relevant locations. Usually this will be at the ears of the (virtual) observer. Note that we also have to take bounding surfaces such as walls into account. We also have to model the scattering of sound at the pinna (the outer ear), 38 and at the head and shoulders. To make this into a manageable task, we can make a number of simplifying assumptions: • Often the observer will be "far" from the source, so we only have to compute the distant field. • Often we can bypass the entire computation of the acoustic field and just replace the sound source with a point source. This point source oscillates in a manner which is obtained from some combination of the vibrations of the surfaces of the object which constitutes a reasonable approximation to the full emission theory. • Reflections of sound at the walls adds important reality value. Commercially available digital reverb processors are available for this. • Filters that model the scattering at the pinna (HRTF) have been measured widely and are available on the Internet. There are commercial "spatializers" available, that can model the scattering of sound at the pinna in real time. A t the time of writing, low cost soundcards for P C ' s come equipped with hardware support for 3D sound and the free DirectSound3D A P I has built in spatialization support. • The reverberation and H R T F contributions to the sound are, under certain conditions, independent of the emission computation and can therefore be dealt with separately and independently. 39 4.1 Overview Based on material and shape properties, we do a pre-computation of the relevant characteristic frequencies of each object in Section 4.2. In Section 4.3 we then divide the boundary of the object into small regions and determine the amplitudes of the excitation modes if an impulsive force is applied to a point in this region. This is similar to the tesselation of a surface for graphics rendering. The whole procedure is analogous to assigning a color to a surface and rendering it with some shading model. In Section 4.4, we normalize the energies of the vibrations associated with the different impact points to some constant value, and scale them proportional to the impact energy when rendered. In Section 4.5 we discuss the material properties and their effect on sound. The decay rate of each mode is assumed to be determined by the internal friction parameter, which is an approximate material property [91, 43]. In effect, the decay rate of a component is assumed to be proportional to the frequency, with the constant determined by the internal friction parameter. After the preprocessing, a sound parameter map is attached to an object, which allows us to render sounds resulting from forces on the body. We discuss the structure of this map and a possible approach to reduce its storage requirements in Section 4.6. In Section 4.7 we will construct an algorithm to synthesize the sound under any type of interaction in real-time. 4.2 Vibration Modes from Shape We now introduce the framework for modeling vibrating objects. We will illustrate it with a rectangular membrane, but the framework is quite general; we have used 40 it to generate sounds of strings, bars, plates and other objects. The framework is based on the well developed models in the literature on vibration or acoustics, for example [54] - for the calculus involved we refer to [75]. The vibration of the object is described by a function u(x,t), which reprer sents the deviation from equilibrium of the surface, defined on some region S, which defines the shape of the object. We assume that p obeys a wave equation of the form { A - ^ M x , t ) = F(x,t) (4.1) with c a constant (related to the speed of sound in the material), and A is a selfadjoint differential operator, under the boundary conditions on OS. E x a m p l e : For the rectangular membrane we consider a rectangle [0 — L , 0 — L ] spanned by a membrane under uniform tension. For this x y case the operator A is given by A = dx dy 2 2 The boundary conditions are that p(x,y,t) is fixed on the boundary of the membrane, i.e., the membrane is attached to the rectangular frame. We will take the following initial value conditions. p(x,0) = y {x), 0 i.e., the surface is initially in configuration yo{x), and dfi(x, 0) v (x), 0 dt where v (x) is the initial velocity of the surface 0 41 The solution to Equation 4.1, in the absence of external forces, is written as oo fi(x,t) = ^2{a sin(u ct) + n n b c o s ( u > n n c t ) ( 4 - 2 ) 71=1 where a and b are arbitrary real numbers, to be determined by the initial value n n conditions. The u „ are related to the eigenvalues of the operator A under the • ^appropriate boundary conditions (which we specify below), and the functions ^ (x) n are the corresponding eigenfunctions. That is, we have (A + u )V (x) 2 n n = 0. (4.3) The spectrum of a self-adjoint operator A is discrete, and the eigenfunctions are orthogonal. Their norm is written as E x a m p l e : For the rectangular membrane, the eigenfunctions and eigenvalues are most naturally labeled by two positive integers n x n and and are given by y ^n n {x,y) x y = sm(nn x/L )sm(Kn y/L ), x x y y and In Figure 4.1, we show the first 9 eigenfunctions on a square membrane. As Equation 4.3 is linear, we can normalize the eigenfunctions \ P ( x ) such n that a n is independent of n, which often simplifies some of the algebra. Using the 42 Figure 4.1: First nine eigenfunctions of a square membrane. 43 orthogonality of the eigenfunctions we can find the coefficients in the expansion given in Equation 4.2 as a n = / Js rf*,, (4.4) (x)d x>, (4.6) ca co n n and Js ct n The time averaged energy of the vibration is given by E = constantX ^^OM)^ J < p k where p(x) is the mass density of the vibrating object. The <> indicates an average over time. If the mass is distributed uniformly, we have oo E = constant x ^ a u>l(a n n + b ). 2 n (4.7) 71=1 4.3 Mode Amplitudes from Impact Location Next we compute the vibrations resulting from an impact at some point p, when the body is initially at rest. The initial value conditions are taken to be y (x) 0 = 0, (4.8) = 5(x-p), (4.9) and v (x) 0 with S(x) the fc-dimensional Dirac delta function. We note that Equation 4.9 is not strictly correct as an initial value condition. The reason is that the expression for the energy, given in Equation 4.6, involves the square of the time derivative of n(x,t). But the integral of the square of the Dirac 44 delta function is infinite. One symptom of this is that the infinite sum appearing in Equation 4.2 does not converge. A mathematically more correct method would replace the delta function in the initial value conditions by some strongly peaking external force function, representing the impact on a small region of the object over a finite region and over a small but finite extension in time. However, this would complicate things quite a bit, and we would gain little in terms of more realistic sounds. Therefore we shall just assume an appropriate frequency cutoff in the infinite sum appearing in Equations 4.7 and 4.2. Typically, we will only use the frequencies in the audible range. For more details and a more rigorous treatment of this problem for the special cases of the ideal string and the circular membrane, see [54]. Using Equations 4.8 and 4.9, and substituting them in Equations 4.4 and 4.5 we obtain the amplitudes of the vibration modes as a function of the impact location as a n = (4.10) ca u n n and 6„ = 0. The energy of the vibration is determined by the impact strength. It will be used to scale the amplitudes of Equation 4.10. The energy is given by E = constant X J2 71=1 where nj is determined by the frequency cutoff mentioned above. E x a m p l e : In Figures 4.2 to 4.4 we show the amplitudes a , graphed n against the frequency of the modes (i.e., u ) for a square membrane n 45 0.16i 0.14 'm 0.12 0.1 o.oe 0.04 0.02 1000 2000 3000 4000 5000 6000 7000 Frequency in Hz 8000 9000 10000 Figure 4.2: Excited frequencies of a square membrane struck near the corner. struck at the points (0.1,0.1), (0.1,0.4), and (0.5,0.4), using a cartesian coordinate system with the origin (0, 0) in the corner and the opposite corner at (1,1). We have taken the lowest frequency to be 500 Hz and taken the first 400 modes into account. We can see clearly that the higher frequencies become relatively more excited for strike points near the boundary of the membrane. In other words, the membrane sounds dull when struck near the center, and bright (or sharp) when struck near the rim. The method outlined above is very general, and allows the computation of the vibrations under impact of any object governed by a differential equation of the form given in Equation 4.1. The frequency spectrum u> and the eigenfunctions *& (x) n n can be computed analytically in a number of cases. In general one has to resort to numerical methods. For membranes, the problem reduces to the solution of the Laplace equation on a given domain, which is a well studied problem. We mention the method of particular solutions [28], which we have adapted for the example of the L-shaped membrane, 46 1000 2000 3000 4000 5000 6000 7000 Frequency in Hz 8000 9000 10000 Figure 4.3; Excited frequencies of a square membrane struck near the middle of a side. 1000 2000 3000 4000 5000 6000 7000 Frequency in Hz 8000 9000 10000 Figure 4.4: Excited frequencies of a square membrane struck near the center. 47 described below in Section 5.1. For plates, the operator A is fourth order, and a more general finite element method can be used. See for example [38]'. 4.4 Sound Sources from Vibrating Shapes After obtaining the frequency spectrum and the eigenfunctions, using methods described in Section 4.3, we have to model the relation between the vibration of the object and the sound emitted. In general the sound-field around a vibration body is very complicated and non-uniform. However, it is clear that the sound emitted can be described as a sum of monochromatic components with frequencies u c, and n amplitudes , which will depend on the location of the observer with respect to the object, as well as on the environment. As a first approximation, we will identify the coefficients with the vibra- tion amplitudes a , scaled with the inverse of the distance to the observer, as the n amplitude decays inversely proportional to the distance. This is not strictly correct, but we argue that it is reasonable as follows. > Consider a vibrating plate. A t some point above the plate, waves emerging from all locations on the plate arrive at this point. Some will be in phase, and some will be out of phase. This interference will depend very sensitively on the location of the observation point. However, in most real situations, the sound will not only arrive directly from the source, but also from reflections from the walls and other objects in the room. The total effect of this is to average out the phase differences, making the sound-field less sensitive to the locations of the listener. As a heuristic, we assume that the intensity (i.e., the energy) of the sound 48 emitted in frequency u n I n , I , is given by n = Ej n Vl(x) = constant x tf* (p). This seems reasonable, as it integrates the intensity of the vibration, but not the phase. This means, we can identify the a^, which are the amplitudes of the heard sound, with the a given in Equation 4.10, omitting the factor a . Note that since n n we assumed that the eigenfunctions are normalized so that the a n are independent of n, this does not matter. Finally, we obtain the amplitudes af as 5 _ ffimpact^n u Q{p)d ' N with d the distance from the sound source, Q ( P ) = X impact i > (AT\\ (p) 2 ( P ( 4 - U j the energy of the impact, and ) , \ i=l with nj. a suitable frequency cutoff. Of course, the are only defined up to a multiplicative constant (corresponding to the volume setting of the audio hardware). For a more detailed treatment of the radiation of vibrating plates, we refer to books on vibration analysis [65, 66, 27]. 4.5 Sounds and Material Properties When the object is struck, each frequency mode is excited with an initial amplitude a{, which depends on where the object is struck. The relative magnitudes of the amplitudes a; determines the "timbre" of the sound. Each mode is assumed to decay exponentially, with decay time 1 nfi tan < 49 (4.12) where <f> is the internal friction parameter. The internal friction parameter is roughly invariant over object shape, and depends on the material only. In [91] a method was proposed to identify the material type from the sound emitted by a struck object, by extracting the internal friction parameter of the material via Equation 4.12. Such a model is also used in [80] to simulate object sounds. Some experiments were reported in [43], where it was concluded that a rough characterization of material was indeed possible. However, the internal friction parameter is only approximately invariant over object shape. See also [87]. To emulate external damping of the object, we add an overall decay factor of e * / ° . - This also allows us to adjust the length of the emitted sound, while T maintaining its "material character", which is determined by (j). So we assume the sound-wave ps(t) to be given for t > 0 (for t < 0 it is zero) by p (t) = e - ' / s T0 y^afe-^ t a n ^sin(27r/ i), t (4.13) i=l • with the amplitudes af given in Equation 4.11, and / j ~ 2TT' with the u determined by Equation 4.1. n Although this simple one-parameter characterization of material works perceptually reasonably well, there is no advantage to restricting the damping coefficients in any way. When dealing with model parameters acquired from measurements we will allow the damping coefficients to take on values independent of the frequencies. In this case we will write the impulse response as p (t) = M5>.-e ''), i=l in s with Qi = u>i + idi, where di are the dampings. 50 (4-14) 70 80 90 100 10 20 30 40 50 70 80 90 Figure 4.5: Ratio of frequency/damping for the first 100 modes sorted by amplitude (left) and frequency (right) of a struck metal vase. Reconstructed with a Fourier window of 4096. We have investigated the extent to which the ratio of frequency and damping is constant in real objects, and found that this relation is only very approximately valid. In Figure 4.5 we plot the ratio f/d for the first 100 modes of a metal vase reconstructed with the techniques developed in Chapter 5. The average ratio f/d is 477, characterizing metal, but still varies considerably, with a standard deviation of 262. In Figure 4.6 we plot the ratio f/d for the first 100 modes of a wooden hockey stick. The average ratio f/d is about 35 which, as an order of magnitude, seems a good characterization of wood. The standard deviation is 22. In Figure 4.7 we plot the ratio f/d for the first 100 modes of a metallic computer tower box. This is an extremely complex object with rattling parts inside, and this seems to be reflected in the great variance in the ratio for this object. In Figures 4.8 we plot the ratio f/d for the first 100 modes of a metal sword. The average ratio f/d is 1005. The standard deviation 949 is very high. 51 100 0 10 20 30 40 50 60 70 BO 90 100 0 10 20 30 40 50 60 70 SO 90 100 Figure 4.6: Ratio of frequency/damping for the first 100 modes sorted by amplitude (left) and frequency (right) of a struck hockey stick. Reconstructed with a Fourier' window of 1024. . • 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 Figure 4.7: Ratio of frequency/damping for the first 100 modes sorted by amplitude (left) and frequency (right) of a struck computer tower box. Reconstructed with a Fourier window of 1024. 52 100 10 20 30 40 50 60 70 60 90 100 10 20 30 40 50 60 70 BO 90 Figure 4.8: Ratio of frequency/damping for the first 100 modes sorted by amplitude (left) and frequency (right) of a struck sword (sword I). Reconstructed with a Fourier window of 2048. 4.6 The Sound Map In the preprocessing stage we first compute the frequency and damping spectrum, and then the excitation spectrum a,- under a suitably normalized impact (i.e., with fixed energy) for a number of locations on the surface. This gives us all the in- formation needed to compute the sound under any kind of external force during the real time simulation. This is somewhat analogous to texture mapping in computer graphics. A n interpolation of the timbre spectrum af between pre-computed locations is also obvious to implement. One may ask how many points on the surface need to be computed. In general, the timbre of the sound changes non-uniformly over the surface. For example, a string sounds "dull" when plucked near the center, and becomes dramatically brighter when excited near the endpoints. In this case, one would need a denser set of points near the ends. Given two sounds Si and 52, a measure d(Si,S2) 53 is needed, that tells us 100 how different sound Si "sounds" from sound S , such that if d(Si,S ) 2 < do, with 2 do a threshold (depending on the individual), S i and S can not be distinguished. 2 Perception of timbre is a complex subject, see for example [15, 51] for a discussion, so we can not expect to be able to formulate such a sound-distance measure easily and accurately. A complication is that a very important distinguishing factor between pitched sounds is the perceived pitch, which is the same in our case, so our timbre measure depends on subtle and poorly understood aspects of audio psychology. As an initial proposal we take the sonic distance d(S\, S ) between two sounds 2 to be d (S ,S ) 2 1 2 = J2S(mH(\og(E}/Eo)) - H(\og(E?/E ))) , 0 2 where E\ denotes the energy contribution of the i-th mode of sound r (= 1,2), i.e., E\ = ( a i ) / , . We take the logarithm of the energy, as the human ear is sensitive S 2 2 to the logarithm of intensity (measured in decibels). The function H(x) is zero for. x < 0, and x otherwise. The constant Eo represents the lowest sound-level that can be heard, so the term H (\og(E![ / Eo) vanishes if E± < Eo- The function S(f) models the sensitivity of the ear to frequency. Without loss of generality we take 0 < S(f) < 1. The function S(f) has to be determined psychoacoustic experiments. We have ignored the "masking" effect, which changes the sensitivity curve of the ear in the presence of other stimuli. One could also argue that the threshold energy Eo depends on frequency. We leave a refinement of the measure d as a topic for future research. In addition to encoding the location dependency of the interaction on the surface, one can also extend this by taking into account the direction of the interaction at a given point. It is also possible to have different coupling coefficients at different locations and orientation around the object, thereby encoding the spatial 54 field of sound around the object. 4.7 Real-time Synthesis The vibration modeling developed thus far in this Chapter generalizes to more general interactions besides impulsive forces. Because the model is linear, any interaction force can be represented as the sum of an infinite number of "impulses", and the resulting equations lead to an efficient real time synthesis algorithm for the synthesis of the sound of an object under any kind of external force. We shall now derive the algorithm, show how it can be implemented most efficiently, and we then show that it can be viewed as a discretization of a continuous time spring-damper system. According to our linear model, the sound produced by an impulsive force of magnitude F at time s can be described by the imaginary part of the complex wave form • y(i)= E n ' a e n n ( '"* ) f f (*- ) '' s F ( 4 / 1 5 ) n where the sum is over the complex eigenfrequencies Q (the imaginary part detern mines the damping of a mode). H(t) = 0 for t < 0 and H(t) = 1 for t > 0. A continuous stimulus force F(t) can be represented formally as an infinite sum of infinitesimal impulses F(t) /•oo = / Jo S(t- s)F{s)ds, where 8(t) is the Dirac delta distribution, assuming the force is zero for negative times. Using the principle of linearity, the output of the model driven by this force can be written as a sum of infinitesimal contributions from each of these impulses: /•oo y(t)= dsTa e ^H(t-s)F(s). n 55 tn Discretizing this equation in time, with sampling rate SR, gives k 1=0 n with y(k) = y{t ) and t = k/SR. This can be rewritten as a recursion by defining k k the functions y {k), n one for each partial. The complex signal is written as a sum of modal contributions y( ) = J2vn(k). k n For the partials y (k) we have n J/n(0) = a F(.0) n and the recursion relation y (k) n = e *y (k s determines the audio signal Im(y). stable. - 1) + n a F(k) n (4.16) As | e « | < 1 , the recursion relation is always t s Equation 4.16 requires 5 multiplications per sample point, which can be' reduced to 3 as we will now derive. To simplify the notation, let us drop the subscripts n, labeling the mode, and define a two-component vector / y = Re(y) X The recursion 4.16 can now be written as y(k) = Ay(k-1) / aF(k) + where C r \ C{ 56 C-> c r J ^ (4.17) with c = e- d/SR cos(w/5 ), d = e~ d/SR sin(w/5fl), r d = fi Im(Q), and u = Re{Q). Let us change variables to z by writing y = Qz. In terms of z, the recursion 4.17 reads z(k) = Bz(k- 1) aF{k) + Q- l 0 where B = Q~ AQ. l Since B has the same eigenvalues as A, this does not affect the stability of the recursion. There are several (equivalent) choices for Q that reduce the number of multiplications to 3, and we choose Q 1 J l - c \ 0 c, \ J Defining C+ = C + yjl - C 2 r and c_ = c, 57 we arrive at the following equation for B, B = (*- -A V 1 C W In terms of the real variables u and v z = the recursion becomes u(t) = c-u(t v(t) = u(t-l) - 1) - v(t - + aaF(t) 1) + c+v(t-l). We have now only 3 multiplications per sample point (as ac,- can be pre-computed), but because the physical quantity of interest, Im(y), Im(y(t)) - is related to v by Civ(t) it appears that we need another multiply per sample point to obtain this.. We can avoid this by multiplying a with c; in the preprocessing phase. Assuming that we have initially silence, i.e., u(t) = v(t) = 0 for t < 0, the linearity of the system guarantees that multiplying the input signal with a factor c- will multiply the output 2 signal with this factor also. We therefore arrive at the following synthesis recursion for the individual modal contribution v(t): u{k) c-u(k v{k) u(k — 1) — — 1) + with 58 v(k c v(k + l) + 1) aF{k) (4.18) The intuitive picture of a set of spring-damper systems driven by an external force is verified by considering the behaviour of u and i; for large SR. In that case, a continuous time differential equation should emerge, which describes a springdamper system under an external force. B y substitution of u(k — 1) = v(k) — c v(k — 1) + in Equation 4.18 we obtain a second order recursion for v(k): v(k + 1) - (c+ + c_)v(k) + (1 + c_c )v(k - 1) = + aF(k). We now make the connection to the continuous time system by expanding v- to second order in 1/SR. If we denote the continuous time limit of v by V, with V(t) = v(tS ) R wherever v is defined, we obtain v(k + 1) = V(t) + V(t)/S + R V"(t)/2S , 2 R (dropping the arguments t) v {k - 1) = V - V/SR c+ = 2- CI/SR - u /S 2 + V/2S ~ 2 R 2 R} d /2S , 2 2 R and c_ = -d/S - R d /2S . 2 2 R W i t h some algebra we obtain V + 2dV + {to + d )V 2 2 = dS F, 2 R which is precisely the equation for a spring-damper system driven by an external force aS F. R In some applications we have found that the quantity u + v sometimes 59 produces a better sound, especially for car engine sounds. The quantity w = u + v satisfies in the continuum limit W + 2dW + (u 2 + d )W 2 = aS {dF R - F). This means that by using w instead of v for the audio signal, we can obtain the same sound (up to a volume factor) as obtained by using v with input signal (dF — F). Because of the derivative of F, this generally gives a brighter driving signal. To synthesize the sound in real-time, we repeatedly compute an audio buffer of length T. The synthesis algorithm fetches the values of the coefficient arrays c+, c_, and a, as well as the external force F for the time interval T. Equation 4.18 is then used to sequentially add contributions of the modes v until all modes have n been added or until a certain deadline has been passed. If the modes are sorted in a decreasing order of importance, this allows for a graceful degradation in the quality of the synthesized sound, when the time available for audio synthesis is not constant. This algorithm is explained in more detail in Section 7.1. 60 Chapter 5 Creating Vibration Models The linear vibration model presented in Chapter 4 parameterizes an object's acoustic response by a set of frequencies, dampings, and a set of amplitude functions on the surface that determine the coupling between an external force and the modes at that location. The amplitudes will be sampled on a discrete set of locations and when we are not interested in the location dependency of the sound, the coefficients a will be a set of numbers. The number of parameters needed depends on the nature of the model. Dense sounds like that of drums, or complex musical instruments will require a large amount of modes, whereas sounds with a less dense spectra, such as bars and plates, require much less. The model parameters are stored in an ASCII file of a format that we denote by sy, which is used by all our implementations. The file format is designed to be human readable, and also contains some extra information which allows the user to change overall characteristics of the model (such as the frequency scale of the object) easily. This file format is explained in Appendix A . The model parameters can be acquired from mathematical modeling of the 61 material and shape, or by parameter identification methods, or by "manual" means. 5.1 Computing Model Parameters Analytically For simple shapes and simple material properties it is possible to write down an explicit partial differential equation ( P D E ) for the vibration of the object, and for very simple shapes it is even possible to solve the resulting P D E analytically for some boundary conditions. For more complicated shapes and materials, a finite element model [38, 58] could be used to compute the vibration modes and frequencies. However, in many cases not enough information about the object's geometry and material characteristics is available to do a sensible computation from first principles. We therefore have not pursued the finite element modeling approach in this work. We have explicitly computed the model parameters for a number of simple geometries, which allow an analytic solution of the vibration equations. These shapes already provide a fairly rich set of objects for use in simulations and are very useful if the goal is to provide generic interactive audio rather than the sound of specific physical objects (which are usually too complicated to consider modeling from first principles). We have obtained analytic model parameters for the following shapes, which can be generated using the software described in Chapter 7. 1. The taut string. This is the simplest example of a vibrating system. The eigenfunctions are simple sine functions. The sound becomes brighter for impacts near the ends of the string. The frequency spectrum is harmonic, i.e., all frequencies are integer multiples of the lowest (fundamental) frequency. The amplitudes a are inversely proportional to n, for large n, in contrast to a n 62 plucked string, considered in [80], where they decay as 1/n . This is one factor 2 accounting for the difference between a piano and a guitar sound, for example. A derivation of the eigenfunctions and eigenfrequencies can be found in [54], Chapter III. The nonlinear behaviour of the string (making it less suitable for this type of modeling) was investigated in [16]. 2. The rigid bar. For the rigid bar the operator A appearing in Equation 4.1 is given by A ~ Ox*' As A is a fourth order operator, we need to specify 4 boundary conditions. We have computed a clamped-clamped bar, i.e., the bar is rigidly attached at both ends, and a clamped-free bar. The boundary conditions are ••'«=»"("=(^L=(^L= < -» o 5 for the clamped-clamped case and ••««-(^L-(^L-(^L- ™ for the clamped-free bar, which is assumed to be clamped at x = 0 and free at x = 1. The frequency spectrum is less dense than for the string, and not harmonic, due to the different nature of the restoring forces on a bar. The sparse spectrum makes this shape very suitable for efficient modeling with the synthesis methods considered here. For details, see [54], Chapter IV. 3. The rectangular membrane. This is the simplest two-dimensional geometry. This shape has been used as an illustrative example in Chapter 4. The sound spectrum is extremely dense, giving a rich complex sound. Because of this density, it does not lend itself well to the synthesis algorithm described in 63 this thesis, as many modes are needed to obtain a good sound. For details on the rectangular membrane, see [54], Chapter V , Section 18. 4. The circular membrane. This corresponds to the vibrations of a drum, ignoring the effects of the surrounding air on the drum membrane. The eigenfunctions are Bessel functions, and the eigenfrequencies can be computed,as• the zeros of Bessel functions. The sound is also very dense, and is therefore not well suited for our type of synthesis for real-time applications. The rectangular and the circular membrane can be thought of as idealized models for. drums, which fall outside the scope of this work. For details on the circular membrane, see [54], Chapter V , Section 19. 5. The circular plate. This is one of the few cases where the two-dimensional plate equations can be separated, which allows an analytic solution. The eigenfunctions are a combination of Bessel functions and modified Bessel functions. We have considered a plate clamped rigidly at the boundary. The spectrum is much less dense than for the circular membrane. This is due, as for the bar, to the larger restoring forces in a plate, compared to a membrane. Therefore this model leads itself very well to our synthesis method. For details on the circular plate, see [54], Chapter V , Section 21. 6. The simply supported rectangular plate. This is another of the few cases where the two-dimensional plate equations can be separated, which allows an analytic solution. The eigenfunctions are products of sine functions. spectrum is much less dense than for the rectangular membrane. The This is due, as for the bar, to the larger restoring forces in a plate, compared to a membrane. Therefore this model leads itself very well to our synthesis method. 64 For details on the rectangular plate, see [55], page 389. 7. The L shaped membrane. A membrane supported by a domain consisting of three unit squares in the shape of an L does not allow an analytic solution of the wave equation. This problem has received some attention in the literature, as the resulting boundary value problem requires some refined numerical methods. We have computed the eigenfunctions and the spectrum with an adaptation of the method of partial solutions, see [28], which is available for the L shaped membrane from within M A T L A B . As an aside, we note that the first eigenfunction features prominently on the cover of the M A T L A B reference guide [8]. 5.2 Fitting Model Parameters to Empirical Data An experimental approach to obtaining sound model parameters is to record sounds of real objects and fit the model parameters to the recorded actual sounds. Because of the linearity of the model, all we need is the response of the object to an impulsive force. By striking an object and recording the sound, we can fit the recorded waveform with a function of the form given in Equation 4.15. We can think of this as designing a digital filter of a specific type with a given impulse response (the recording). Because recordings have a lot of noise in them (we want a method that is applicable for "home users"), and objects can't be excited with a true impulse, we need a robust parameter estimation method. The extraction of sinusoidal signals from time-series data has attracted a lot of attention in the statistics and signal processing literature. For a general introduction see [61]. For more recent work, see [44, 56, 14, 70, 69]. 65 Experimentation in M A T L A B with the Prony method [23], and other filter design methods such as the maximum entropy method [71] used in Linear Predictive Coding gave very unsatisfactory results. We tried constructing IIR filters from recorded sounds of several objects and then compared the reconstructed impulse response to the originals. In most cases the reconstructed sounds were very distorted, and often the damping coefficients were completely wrong (too large). We believe this is because the actual sounds are only partially described by the linear model, and non-linearities and external noise are known to cause problems with these methods. Therefore we have not pursued this approach any further. Similar difficulties were reported in [41] when trying to construct an IIR filter for the impulse response of the body of a guitar. It was found that autoregressive (AR) modeling using the autocorrelation method of linear prediction (LP) [48], as well as the pole-zero ( A R M A ) model using Prony's method [57], require extremely high order filters in order to accurately predict the decay rates reasonably well. The general consensus seems to be that for complex noisy data and rough models (as we are considering here), methods based on spectral analysis are more appropriate. Better results were obtained with a more robust approach using windowed Fourier transforms. We construct a spectrogram from the recorded audio signal, ; and use this to determine the dominant frequencies, their decay rates, and their amplitudes. We will now describe the algorithm. The input consists of a recorded impulse response of a real object. The recordings were made at a sampling rate of 44,100 Hz and encoded as a 16 bit wav file. A brief fragment of silence before the strike is (optionally) used to determine the noise level. The analysis consists of the following steps: 66 To be computed: Arrays / , d, and a corresponding to the frequencies, dampings, and coupling amplitudes of a modal model. Perform the windowed D F T : 1. Set the following constants: fs = sampling rate (44100) N = size of D F T window (1024 - 8192) M = desired number of modes (100) X = signal to noise ratio (10) Q = window overlap factor (4) 2. Load the sample y(i) from disk. Store as a floating point array of length L. Total duration of the sample is L/fs3. Compute the overlapping windowed discrete Fourier transforms W (i), k i = 0 , . . . , N/2 — 1. Wk is obtained by windowing the vector y(i) over the interval [kN/Q, kN/Q + N - 1] with a Hanning window [25, 72, 61], and taking the D F T . The index k which labels the D F T ' s lies in the range k = 0,...,[{L-N)Q/N - 1J. Identify the part of the signal used for analysis 4. Compute the intensities N/2-1 J2 A= k 5. Find the k m a x i^wi- for which A is maximal, and define ko = k k is the start of the impulse response. 67 m a x + 1. This 6. Analyze the amplitudes A on the "silence" fragment k = 0 , . . . , k k max and compute the average A, and the standard deviation cr(A). smallest k, k > ko, such that A < A + Xo~(A), k k den —Q Find the and call this value This is the end of the "signal". The constant A , together with the background noise level determines this. 7. Define log\W {i)\, Vjt(i) = ko+k where k •= 0 , . . . , K — 1, with K = k d — ko. This is the part of the signal en we will use to extract the model parameters. Estimate the frequency modes 8. Define an array of N/2 — 1 "bins" , B(i), i = 0 , . . . , N/2 — 1, each corre: sponding to a frequency of fs(i + 1 ) / A . Initialize them to zero. 9. For k = 0 , . . . , K — 1, find the set of indices i, call them {i }, for which max V (i) k is a local maximum, i.e., V (i) > V (i + 1) and V (i) > V (i — 1). k (There is a different set {i } max i ™ , . . . , i^ 1 ax for k k k each value of k.) Select the M indices with maximal values ofV (i), k and add 1 to B(i™ ) ax for each of these indices. We say these bins have been voted for as candidates for a resonance mode. There are K voting rounds. 10. Select the M bins B(i) with most votes, call them i = obtain the estimated frequencies fi = fs(h + l)/N, for z = 0 , . . . , A f - 1. Estimate the damping coefficients of the modes 68 IQ, .. .,IM-I and 11. For each /,-, fit Vk{U) as a function of k, k = 0 , . . . , K — 1, with a linear function —a{k + /?;, using a least squares algorithm. The damping coefficients d{ are identified as di = for i = 0, ...,M- Qfsai/N, 1. Estimate the coupling amplitudes 12. The amplitudes a; are obtained as °< = with pi = T ^ 7 < diN/f S 13. Normalize the amplitudes a; so that max(a,) = 1, and sort /,-, di, and a; by the value of a;. The sample y(«)> ' = 0, — 1 is divided in a number of (overlapping) •• windows of size N. The window size N is typically N = 1024,2048,4096,8192, or about 25 — 200ms. The windows overlap, and the overlap of about 75%, corresponding to Q = 4, was found to work well by trial and error. For each window, a discrete Fourier transform W(j), j = 0 , . . . , N — 1 is computed, using a Hanning window [25, 72, 61], in step 3. The value of determines the spectral resolution as A / = f /N, where f is the sampling rate. For s of about 20Hz. s For a typical pitch of about 400Hz = 2048 this gives a resolution this corresponds to a perceived pitch error of (12/log(2)) log((400 + 20)/400) = .13 semitones, or 13 cents. This is clearly audible, but in the types of sound we are concerned with, absolute pitch is not an important factor. If so desired, an overall fine-tuning of the pitch can be made 69 later, either manually or by using a specific pitch detection algorithm [68; 62, 60]. See Figure 5.1 for an example of the recording of the sound of a metal vase struck at the top, and its reconstruction using the parameter fitting described here. Displayed is the sonogram based on N = 4096 of the actual recording on the left, and its reconstruction using 40 partials on the right. The best window size for a given sound was determined experimentally, by reconstructing the impulse response and comparing it "by ear" with the original. There seems to be no obvious way to automate this process. Often reconstructions with different window sizes would differ audibly, but it was not possible to say which one was "closer" to the original sound. Until we have a better understanding of timbre perception, the best approach seems to be to provide interactive tools with humans making perceptual judgements. For each window the norm of the Fourier transform is summed over all frequencies, giving the average intensity of the signal as A k The window W ko after = Y^iHo 1 l^fcWI m s * P 4e the window with maximum intensity A is chosen as the start k of the impulse response in step 5. Note that if the signal starts somewhere inside an F F T window, this window may or may not register as the maximum intensity window. In order to avoid artifacts from this, we throw away the first window and start the analysis from the next one, which is guaranteed to contain only signal. The beginning of the sample (before the impulse response) is analyzed in step 6 for the average level (the background noise) and the standard deviation. This section ends with the window indexed by kmax k m a x — Q because the window indexed by contains the beginning of the impulse response, and we therefore have to go back Q windows to obtain the previous non-overlapping window. This information is used to determine the end of the impulse response, which is set at the point where the signal amplitude falls below the background level plus some reasonable number 70 6400 Hz 5120 Hz + -3840 Hz + + + I 6400 Hz + + 5120 Hz + -3840 Hz + + 1280 Hz -256ini7~=F~ 1280 Hz 0.2 sec 0.4 sec 0.6 sec lT2*sec™ 0.4 sec 0.6 sec Figure 5.1: Sonogram of recorded (left) and reconstructed (right) impulse response of vase. The window size of the Fourier transform was 4096 and the sampling rate was 44100 Hz. 71 (empirically set to 10) times the background standard deviation. This corresponds roughly to what would be obtained by a visual inspection and cutting of the signal when it "looks like noise". The logarithms of the absolute values of the windows containing a signal are identified in step 7 to extract frequencies, amplitudes, and dampings. For each time-slice we find the frequency bins which are local maxima in step 9, and for the M largest peaks we increase the counts of those bins by one (they are initialized to zero, of course). After doing this for all time-slices, we keep the M bins with the most votes and these are the frequencies identified in step 10. For each of those bins i we fit V~k(i) as a function of k with a linear function, and we extract the damping parameters in step 10 and the amplitudes in step 11. In Figure 5.2 we show the first 100 frequency peak functions and their fits. The plots are sorted left to right and top to bottom by the number of votes. We have set the vertical scale for each subplot separately. Note that the strongest peaks in amplitude don't necessarily get the most votes. We do it like this in order not to miss weak frequencies, because perceptually the strongest are not always the most audible ones. The corresponding frequencies are depicted in Figure 5.3. The resulting vibration model can be visualized by plotting its frequency response in Figure 5.4. Note that the highest peaks do not necessarily correspond to the largest coupling coefficients, as the damping also plays a role in the spectral response. Note that some modes seem to behave very linearly whereas others are almost completely random. In a refinement of the parameter fitting one could reject modes that did not produce an acceptable fit with a linear function. However, it is by no means clear that this would improve the model. Some of the linear fits to the modes (such as the mode in the lower left corner, 72 V \ \ \ \ . \ \ \ ^ \ ^ V \ v. ^ < Figure 5.2: Id entified frequency peaks and their linear fits. 73 4425 4005 3822 3758 3467 3424 3391 3316 2993 2649 2390 2261 2078 1992 183 4996 4576 4177 3973 3865 3790 3208 2229 2121 2046 1109 6105 5846 5028 4619 4242 3941 3908 3714 3499 3165 3058 3036 2595 ,1518 1270 301 97 65 8656 7558 6611 6298 5922 5814 5060 4113 3650 3230 2961 2778 2186 484 10444 9109 8829 8742 8355 6826 6686 6643 6578 6492 6385 6072 5394 5243 5179 5136 4953 4037 3112 2175 1324 678 624 420 388 10993 10476 9615 8775 8269 8236 8053 7203 6891 6460 6223 5986 5717 5663 5599 5523 5426 Figure 5.3: Identified modes for vase. Shown are the frequencies corresponding to Figure 5.2. 74 Figure 5.4: Identified modes for vase. The frequency response is shown. The bottom figure shows a subset of the frequency range of the upper figure. 75 one from the bottom) seem to be wrong. This is an artifact of the least squares algorithm we used, which only works correctly if the best fit has negative slope,' which corresponds to a positive damping coefficient. Negative damping modes are artifacts caused by trying to fit noise. In all cases their amplitudes were so small that they do not contribute to the resulting sound. We preferred to have the algorithm produce positive damping always, even when a least-squares fit would result in a negative damping, as an undetected negative damping coefficient causes instability in the synthesis algorithm, and must be removed before using the data for synthesis. This is straightforward but the result of forgetting this can be rather unpleasant. 1 ; A very weak but positively damped spurious mode is harmless, on the other hand." Some modes may appear to be oscillating, which can occur easily if two frequencies are approximately degenerate. Degeneracy in the spectrum is closely related to geometrical symmetries, with the exact degeneracy being broken by higher order effects, and occurs frequently. Beating between the frequency doublet causes apparent oscillations in a frequency. Such behaviour has been observed in bells [35] •, kettledrums [53], and in timpani [77]. Such oscillating behaviour could be detected and one could attempt to fit such spectral lines with two frequencies. For the purpose of this research we have not found a compelling reason to pursue this further. : Various other refinements of the fitting method are possible, but have not been pursued in the context of this research. For example, one could work with a spline interpolation of the windowed discrete Fourier transforms, to obtain a more precise estimation of the frequencies. This would necessitate some form of frequency tracking, as we don't have discrete bins. For this an adaptation of the McAulayQuatieri algorithm [49] for frequency tracking could be used. For rough models of ' i t results in a horribly loud squeak which may damage the ear. 76 "common" objects little seems to be gained by this, as an accurate determination! of the frequencies is not so important (unlike in musical instruments). Polynomial interpolated Hanning windowed Fourier transforms were used successfully in [77] to identify the spectral lines for timpani sounds. After acquiring the model parameters we usually have to make some manual adjustments. Often background noise such as the hum of a refrigerator will show • up as a spurious mode with zero decay constant. Often these modes are easily' recognized by a visual inspection (or by some pruning code) and can be removed easily. They are also very audible, so when importing the constructed model to the. model tester described in Chapter 7, these spurious modes will be discovered. In Appendix B we show results for the vase at different values of the window size, as well as results for three other objects: a santur, a piano, and a metal computer tower. The piano and the santur parameters did not produce a good vibration? model. This was expected as musical instruments are much more complicated than ! "common" objects. In particular the santur is a very complicated instrument. It consists of a set of strings on a wooden frame, three strings per note. The strings arestruck with wooden hammers. Because of the coupling between the frame and the strings if a single string is struck, the other strings resonate. The sound is therefore extremely complex and thousands of modes would be needed in order to approximate even the linear response. We have included the analysis of this instrument to indicate where our method breaks down. The piano string, whose data was obtained by playing a note on a piano, produces a reasonable model, though hardly musically satisfying. The vase and the computer tower produced very realistic sounding vibration models for a modest number of modes, around 5 — 10. The original sounds and the reconstructed impulse responses can be accessed on-line on [4], and on the 77 accompanying C D . Our reconstruction assumed the recorded sound is the impulse response of the system. However in reality the impulsive force will have some finite duration. One could account for this if the exact force profile of the strike was known, by correcting the amplitudes a obtained by the algorithm with a factor which depends on the exact force profile of the interaction, as follows. The impulse response yo(t) and the response J/F(0 to a finite duration force F(t), (non-zero only for 0 < t < r) are related by •y (t) F = [ Jo T yo{t- s)F(s)ds. If the impulse response yo(t) is assumed to be of the form Vo(t) = ± a l e ^ \ k=l and the response yF{t) is assumed to be of the same form with different coefficients ajT, we can relate a£ and a° by k a - Ik fc) a where the correction coefficients 7 ^ are given by 7jf y/( J T e ^sin(u t)F(t)dt) d k + 2 (e ^cos(u t)F(t)dt) . d k 2 We can now use the algorithm described previously, and divide the coupling coefficients by the correction coefficients 7 ^ to obtain the impulse response. For this the objects would need to be hit with a hammer equipped with a force sensor. We attempted to use the initial part of the recorded impulse response as a profile for the contact force, however this did not improve the results and we conclude that this is not an acceptable approximation. The resulting reconstructed sounds can be heard for the vase and the santur on the web page and on the C D . 78 90 80 h 0 1000 2000 3000 Hz 4000 5000 6000 Figure 5.5: Frequency response of vase for three regions. We also have tested the models by applying various input forces to the models taking into account different number of vibration modes and a selection of these sounds can also be accessed on the web page and C D . In Figure 5.5 we show the frequency response, as reconstructed by our method, for three different locations on the vase at the top, middle, and bottom. As can be seen, many frequencies and dampings (which show up as the widths of the resonances) are shared but the amplitudes differ, as expected. 79 5.3 Designing Model Parameters by Hand In addition to computing or measuring the parameters, there are also circumstances where a less automated approach is desirable. For example, we have been able to synthesize the sound of avalanches by a set of random resonances with frequencies and dampings within a specified range. The avalanche models can be generated on-line and heard on the Java applet on [4], and on the accompanying C D . Also, the sound of "generic" objects such as unspecified structures of a certain type of material can be generated by choosing a set of resonances and controlling their distribution by stochastic methods. For example on the web page referred to above there is an object denoted by "pseudo string" which consists of a set of resonances which are spaced almost harmonically as in the ideal string model, but with some random perturbation. To generate the sounds of combustion engines, described in more detail in Chapter 6, a stochastic model for the resonances is also found to be very effective. In an application described in Chapter 7 we constructed a xylophone object, which was obtained by adding resonances with frequencies corresponding to the seven white piano keys (using just intonation) and adding a bar-like spectrum for each on top with coupling and damping coefficients adjusted by hand until it "sounded right". The xylophone was examined in great detail in [18], with particular attention to the mallet-bar interaction. We have also tried to extract the model parameters from a recorded sound of a church bell "by hand". For this we used a software package to analyze the sound, do Fourier transforms, plot spectrograms etc., in order to identify important frequencies. After a time consuming analysis, it was found that the first 20 or so modes identified were also found by our parameter fitting software, and the 80 weaker modes were not audible. This provides a nice validation of the parameter identification method. Although the "manual" approach did not give us a better model, in general one can take more extensive measurements of vibrating structures under a variety of interactions to arrive at a modal model. For example, steel drums [63], harpsichords [64], and guitar bodies [41] have been investigated using a variety of measurements. 81 Chapter 6 Interaction Force Models To create sounds we need an interaction force model as well as a vibration model. In this chapter we consider four types of interaction models: • Impact forces • Continuous contact forces (sliding and rolling) • Engine forces • Live data streams 6.1 Impact Forces When two solid bodies collide, large forces are applied for a short period of time. The precise details of the contact force will depend on the shape of the contact areas as well as on the elastic properties of the involved materials. For example, a rubber ball colliding with a concrete floor will experience a contact force which will increase faster than linearly with the compression of the ball, because the contact area also 82 increases during the collision. A generic model of contact forces based on the Hertz model and the radii of curvatures of the surfaces in contact was considered in [39]. A Hertzian model was also used to create a detailed model of the interaction forces between the mallet and the bars of a xylophone [18]. To create the simplest model of a collision force to drive the audio synthesis, we assume that the two most important distinguishing characteristics of an impact on an object is the energy transfer in the strike and the "hardness" of the contact/ A psychophysical study of perceived mallet hardness [29] of xylophones showed that this is indeed a very perceptible parameter of an acoustic event. The hardness translates directly in the duration of the force, and the energy transfer translates directly in the magnitude of the force profile. We have experimented with a number of force profiles and found that the exact details of the shape is relatively unimportant. A phenomenological model of a finite duration impact has been constructed and implemented. The model approximates the contact force by a function of the form (6.1) for 0 < t < T , with T the total duration of the contact. This function has the qualitative correct form for a contact force. The force increases slowly in the beginning, representing a gradual increase in contact area, and then rises rapidly, representing the elastic compression of the materials. A n implementation of this contact profile can be found on the interactive demo's on the accompanying C D and on [4]and is found to be quite effective. The sounds of soft contacts (with large T) are recognizable as such, which shows that this model can produce this perception. We have also adopted this contact force model in a real time synthesis program described in Chapter 7, as well in other testing modules. 83 The spatial extension of the contact area will have an influence on the resulting sound. This effect can be incorporated easily in the linear model by averaging the amplitudes a over a region, which will have only a small effect on the sound for small contact areas. This is not an important effect and it will be ignored. More complex interactions during contact such as damping effects may have an important effect in some cases. For example, the hammers in a piano are covered with felt, so during the contact between the hammer and the string they damp the higher modes of vibration. How this actually occurs is quite complicated [32, 33, 34, 76] and potentially important, especially for high quality musical instrument modeling. Another example is a ringing sword. When the sword is sliding over a surface, the continuous contact will provide extra damping, compared to a free ringing sword. This can be taken into account by adjusting the damping parameters appropriately during continuous contact. 6.2 Continuous Contact Forces A n important ingredient in synthesizing realistic scraping and rolling sounds is a surface interaction model. A lot of research has been conducted on models of contact interactions between solids [78, 79, 47, 42, 10, 9], but they usually focus on predicting forces at a coarser time scale than needed for our purposes. A n recent exception is [83]. Nevertheless a rigid body simulator would be able to provide information about the contact force magnitudes and the friction forces at the contact areas which may be used as inputs for a high sampling rate contact force model. This model should be able to generate contact forces for a specific type of contact depending on the contact force and the sliding speed. The roughness profile of both surfaces will determine the effective force stimulus to the object 84 and therefore have an important effect on the sound. Initial experiments indicated that bandwidth limited white noise and Gaussian noise (in the time domain) are' reasonably satisfactory, but they lack a surface property parameter. We have also experimented with scaling (fractal) noise. This is noise with a spectral content which behaves as f a (at least over a sizable region), for some constant a. Such noise (of which white noise is a simple example with a = 0) sounds the same when played back at any speed, and is extremely common in nature [89]. Especially 1 / / noise [85, 88] occurs frequently, and is afield of study by itself. The exponent a can be used as a roughness parameter, a = 0 being a very rough surface, and a = 1 representing a smoother surface. The most satisfactory solution we found is to use a looping digital sample with pitch shifting and volume control to adjust the speed and force of the contact. With a small set of short samples representing a variety of textures a great variety of contact surface profiles can be imposed upon the vibration models. Surface profiles were created by simply scraping a real object with a contact microphone. See the accompanying C D or [4]for a palette of scraping sounds and their use. The physical picture behind this contact model is that the sample encodes the shape of the surface and the object scraping it is following this surface profile exactly. In reality microscopic deformations and breaking of contacts will complicate the mechanism whereby the interaction force is generated. These interactions are so complex that it seems hopeless to try to model the physics behind this in real time. We therefore believe the pitch-shifting sample playback approach is the best way to model these type of contacts. If it is necessary to take into account that the contact force does not simply scale in time at different speeds, one can have a numbers of samples for different contact speeds and cross-fade between them. In essence, this is 85 the technique used presently for wave table synthesis of musical instrument sounds, but used here for the contact force. Viewed in this way, our synthesis can be thought of as an "effect" added to a recorded sample, in this case representing an interaction force, instead of a sound. 6.3 Live Data Streams A n interesting interface to this type of synthesizer is a sensor that measures real interaction forces. We have implemented this very simple (and cheap) with a contact microphone (designed to be attached to a guitar body) attached to a real object. When touching and scraping real objects the audio signal is sent to our real-time simulator and the audio signal is interpreted as a force to whatever vibration model is currently loaded. We can then scrape some interface object and transfer the measured signal to the audio synthesis to create the impression of touching a virtual object. Another type of application is to use the output of an electrical guitar as the driving force for some virtual guitar body. In Chapter 7 we will describe some applications written around this idea. 6.4 Engine Forces Engine sounds are very difficult to achieve in computer games, as they are continuous sounds. Racing games are very popular and a properly modeled car sound that responds in a realistic way to input parameters would greatly enhance the audio in such games. It is not obvious that combustion engines can be modeled with our techniques, as the sources of sound are explosions and very complicated gaseous phe- 86 Figure 6.1: Sample driving four stroke one cylinder engine. nomena. Rather surprisingly, we found that very good sounding interactive models can be made by driving some resonance object (a lumped model of everything that is vibrating) with a rather simple-minded model of a combustion engine. The first model we created is a 4-stroke engine. The driving force is obtained by constructing a looping audio sample divided into four regions which represent the 4 stages of the engine. The sample depicted in Figure 6.1 contains an intake stroke, a compression stroke (silence), a combustion stroke, and an exhaust stroke. The intake stroke was modeled as white noise enveloped with a bell curve. The exhaust stroke is modeled as white noise, rapidly decaying in time, inspired by a high pressure gas mixture being released when the valve opens. The combustion stroke consists of an enveloped burst of 1// noise. It was found, after trying various l/f a noises, that this gives the most realistic sound. The reason is probably that the combustion takes place inside the cylinder, so the shock wave is transmitted to the mass of metal of the engine block, which acts as a low pass filter. The sample is looped at adjustable rate, corresponding to the running speed of the engine, but we keep track of the four stages in the sample and allow the volume of the intake, combustion, and exhaust stages to be set in real-time independently of each other. This gives extra dimensions of control to map for example to driving 87 Figure 6.2: Sample driving four stroke one cylinder engine with fan. uphill, with a large load, etc. This simple driving force model can generate a variety of engine sounds by coupling it to various vibration models. Models with relatively high frequency resonances with high damping give a "lawn mower" effect, whereas a low frequency object gives a very convincing motorcycle sound. The best motorcycle sound was obtained by using a mathematical model of a rectangular plate with dimensions 7 •by 1 and lowest resonance at 50Hz. About 7 modes gives an optimal result. A slightly different sound is produced by addinga background pitched sound to the sample at all four stages. This simulates the sound of the fan, and perhaps other rotating parts. For the pitched sound, a short noisy note played on a 70cm long pipe with holes (a.k.a. a ney) was used, which is very satisfactory. The driving sample is depicted in Figure 6.2. A race car engine sound was obtained by creating a four cylinder version. We assume the four cylinders fire VT/2 out of phase and simply add the samples from . the one cylinder engine four times, with a relative phase shift of 7r/2. The resulting sample is depicted in Figure 6.3. If we just use this driving force as is the result is not very good. The reason is that in a real engine the four cylinders are attached to the intake manifold at different locations and therefore sound different. To incorporate this we adjust the volumes of the four one cylinder samples individually and when 88 t tIL± Figure 6.3: Sample driving four cylinder engine. they are set all different the resulting sound is very convincing. In a real game application one would probably spend more time on the actual modeling of the car engine than we have done here. But our simple models show that a convincing result can already be obtained using extremely simple models. To create richer engine sounds, different cylinders may be coupled to different exhaust manifolds or muffler pipes. It is interesting that the Harley Davidson motorcycles are designed specifically to generate their characteristic sound. Sub optimal engine performance is tolerated, just to get their characteristic sounds. A n interesting article on the audio aspects of the Harley Davidson motorcycle by Joel C . Moser appeared as the 1996 winner of the Streuben essay competition of the College of Engineering at the University of Wisconsin-Madison. The article is available on the W W W at [5]. 89 Chapter 7 Implementations The theory presented in the previous chapters was tested in practice in order to evaluate the quality of interactive audio created with the methodology in several applications. , In this chapter we will first describe the general architecture of applications with real-time audio synthesis and describe the software components for audio syn1 thesis that have been implemented in C + + and in Java. Following this, we w i l l , discuss authoring tools that we created, and finally we will describe several demonstration programs that have been constructed to illustrate what can be done with the audio synthesis. 7.1 General System Architecture In order to apply the theory developed here to the creation of interactive audio in software applications (such as simulations) we need to implement software components which will have to be integrated with the application. The low level audio synthesis and the control algorithms should be designed as orthogonal as possible. 90 Synthesis Engine Sampling rate Buffer size C P U load User Input Controller c_[] c+[] Vibration model a[] Geometry model Physical object force[] Interaction model Interaction state Real-time Application Offline Authoring Tools OS Audio API Model Database Mathematical shapes C + + classes Vibration models Parameter fitting Figure 7.1: System architecture for applications with real-time audio synthesis. We therefore have divided the runtime system into two main components: • SynthesisEngine. This component represents a real-time thread or process that continuously writes computed audio buffers to the audio rendering hardware. This object implements the core synthesis algorithm at the lowest level. • Controller. This controls the SynthesisEngine by feeding it the necessary parameters. This object encapsulates composite models of objects and interactions and computes the filter coefficients and input buffers from higher level parameters such as sliding speed, throttle settings, etc. The SynthesisEngine communicates with the audio hardware by repeatedly submitting a computed audio buffer of duration T using the audio A P I of the operat91 ing system. The operating system will provide an abstraction of the audio hardware in the form of a queue on which audio buffers can be placed. The SynthesisEngine must submit buffers at a rate of 1/T, or the queue will underflow, resulting in spurious noises. The buffer size T will determine the latency of the synthesis algorithm (the delay between an interaction and when it is heard) and the control overhead. We discuss this in more detail in Section 7.1.1. Before computing the next buffer, the SynthesisEngine will obtain information from the application about the object whose sound it should render and about the force applied to it. This information is obtained through the Controller class, which is responsible for making the coefficients c , c_, a, and the force F (see Equation 4.18) available for the duration T of + the buffer. The SynthesisEngine will then compute the modal contributions of each mode as defined by the parameters obtained from the Controller one by one. After each mode is added, it checks if it has time to compute another mode. If its allotted time has run out or if all modes have been computed, it will place the computed audio buffer on the output queue and start computing the next buffer. The time allowed for the computation of a single buffer should be less than T, in order to insure an uninterrupted audio stream. By allowing significantly less time than T for the completion of the buffer, one can effectively force the synthesis algorithm to use only a small fraction of the available computational resources of the system. If more resources become available (for example when a computationally intensive task in a game is finished), more modes can be added and the quality of the synthesized sounds will adjust dynamically to the load on the system. For a buffer length of T, the latency of the synthesis will be between T and 2T. The force F(t) will become available after a time interval T , and the synthesis 92 algorithm will have to finish its computation of the audio buffer after another time interval T , because after that the next buffer F will have to be processed. In practice, the audio synthesis will often be required to use only a small fraction of the computational resources of the host machine, and will be required to finish in a time much shorter than T. To achieve minimum latency, we should choose T to be as small as possible. However, there is additional overhead for each buffer, as the model parameters may change. For large values of T this overhead can be made as small as desired, at the expense of more latency. Let us denote the computational cost of a multiplication by C(mult) and the computational cost of an addition by C(add), and let SR be the sampling rate and n the number of modes. The computational load per second of the synthesis will be L s y n t h = (3C(mult) + 4C(add))n5R. (7.1) The control overhead is assumed to have a computational cost of C(control) per mode, and the control parameters are recomputed every T seconds. The control overhead will be -^control per second. = nC(control)/T (7.2) As an illustrative example let us consider an object whose overall frequency scale will be modified in real time. In this case the frequencies will have to be multiplied by a scale factor which will cost TiC(mult). Then the coefficients c + and c_ appearing in Equation 4.18 have to be recomputed giving a total cost per mode of C(control) = C(exp) + 2C(trig) + 2C(div) + 4C(mult) + 2C(add) + C(sqrt), where C(exp) is the cost of an exponentiation, C(trig) is the cost of computing a trigonometric function, C(div) is the cost of a division, and C(sqrt) is the cost of 93 taking a square root. The condition under which the control cost can be neglected, ^synth, reads ^control <C •. C(control)/T < (3C(mult) + 4 C ( a d d ) ) 5 ; R (7.3) In practice a buffer length of about 50 ms, corresponding to about 1000 samples at 22, 050 Hz was used. The interface to the audio hardware on the computer introduces additional latency and requires a certain minimum buffer length for the audio queues^ In practice this also determines the buffer length used in the synthesis implementation. In this case the condition given by Equation 7.3 becomes C(control) < (3308C(mult) + 4410C(add)), which is easily satisfied. The runtime components are implemented as C + + and/or Java classes. The application writer will create a properly configured SynthesisEngine object, which will start the low level synthesis thread, but not interact with it directly. The SynthesisEngine is given access to a Controller object, which encapsulates the task of translating physical parameters such as resonance frequencies and excitation forces into appropriate control parameters for the low level synthesis algorithm. The rela- • tion between the components and the application is depicted in Figure 7.1. We will now describe the run-time components in more detail. 7.1.1 SynthesisEngine The SynthesisEngine implements the audio synthesis algorithm defined by Equa- • tion 4.18. The class definition in simplified form is as follows: class SynthesisEngine { 94 private: < int samplingRate; int bufferSize; // =T*samplingRate double cpuFraction; ' SynthesisEngineController *sec; public: // Define controller object int setSynthesisEngineController(SynthesisEngineController *sec); // Set output buffer size int setBufferSize(int bufsz); // Set fraction of CPU load to use int setCPUFraction(double fraction); // sampling rate i n Hertz , int setSamplingRate(int samplingRate); // Create synthesis thread int run(); // Destroy synthesis thread int stopO ; }; When a SynthesisEngine object is created and its run() method is called, thread is started which implements the following pseudo code: buf[bufsz]; // Audio buffer to be computed uPrev[] = vPrev[] = 0 ; // Zero'd arrays while(not_stopped) { Obtain arrays a[] , c _ [ ] , and c+[] from Controller; 95 // See Equation 4.18 for these coefficients Obtain array F[bufsz] from Controller; N = length of c[] array, i e . , number of modes; buf[] =0; time_allowed = fraction * bufsz / srate; start_time = getTimeO; for(i=0;i<N && (getTime()-start_time)<time_allowed;i++) { // Add contribution of mode i to buf [] and // recover the state variables from the previous time s l i c e u_prev = uPrev[i]; v_new = vPrev[i]; // Load parameters for this mode i n register ccplus = c_ [ i ] ; ccminus = c+ [ i ] ; aa = a[i] ; // Add this mode to output buffer for(k=0;k<bufsz;k++) { u_new = ccminus * u_prev - v_new + aa * F[k]; v_new = ccplus * v_new + u_prev; u_prev = u_new; buf [k] += u_new; > // Save state variables for next buffer uPrev[i] = u_new; vPrev[i] = v_new; 96 } // Computed as many modes as time allows, now send to audio hardware. submitBufferToAudioHardware(buf); } The inner loop adds the contributions from the modes one by one, until either all modes are computed or the allotted time has run out. If other processes are running at high priority on the machine, less modes will be added in the time allotted than if the synthesis thread has more resources available. In this case there will be a graceful degradation of the audio, provided that there is enough time to computed the "essential" modes of the object. 1 When the computed buffer is submitted to the audio hardware, it will be put on a playback queue and the function submitBuf ferToAudioHardwareO is assumed to block if the queue is full, thereby 1 taking care of the timing of the synthesis loop. This function needs to be called at a rate of samplingRate/buf f erSize to keep the audio stream moving at the correct rate. How submitBuf ferToAudioHardwareO is implemented depends on the operating system on which the application is built. The operating system's audio A P I will also impose restrictions on latency, in addition to those associated with the value of the buffer size used in the actual synthesis algorithm. We have implemented the SynthesisEngine on top of the SGI IRIX A L audio A P I and on top of the Microsoft DirectSound A P I . On A L , we were able to achieve a total latency of about 50ms at a sampling rate of 22,050 Hz. Using DirectSound, the lowest latency we could achieve was about 100 ms. It is to be expected that the latter figures will come down when the DirectSound implementation matures. 'We assume that the granularity of the thread scheduling on the processor is fine enough. 97 Max. modes C P U load for 10 modes C 450 2.2% Java (Compiled) 250 4% Java (JIT) 150 6.7% Table 7.1: Performance of synthesis algorithm at a sampling rate of 22,050 Hz on a 233 Mhz Pentium P C with M M X using a C implementation and a Java implementation. The latency of the IRIX implementation is caused by the minimum length ofithe input queue buffer. If necessary, this buffer can be bypassed and the latency can be made as small as desired. As the achieved latency is quite satisfactory, we have not pursued this possibility. Bypassing this buffer also requires the process to run with root privileges, which severely limits its deployability. In order to test the speed of the algorithm, we timed the inner synthesis' loop on a Pentium 233 M M X P C . The results are summarized in Table 7.1. A t the sampling rate of 22,050 Hz, using a C implementation of the synthesis loop, we found that we can synthesize a maximum of 450 modes. This means we can compute about 5 modes utilizing 1% of the computational resources on average. We also benchmarked the algorithm in Java. 2 With the Symantec J I T compiler version 3.00.029(i) we were able to synthesize a maximum of 150 modes, about three times slower. Compiling the Java code into an executable allowed us to increase the performance to 250 modes, which is slightly better than twice as slow as the C implementation. We have not implemented the algorithm in assembly language, which could provide a significant performance boost. 2 This estimate assumes that there are no exceptional events such as page faults, context switches, etc. 98 7.1.2 Controller The Controller class is an abstract base class and is subclassed by the application writer to encapsulate concrete audio models of objects. The Controller will usually load data from a database of vibration models created with various authoring tools described in more detail below. It translates parameters such as throttle setting or contact sliding speed into the low level parameters that the SynthesisEngine needs. In bare-bones form the class is given by: class SynthesisEngineController { public: // Callback function to get current f i l t e r parameters . v i r t u a l RTSFilter *callBack( int *location_index,const double *force,int srate.int buflen) = 0; >; The function callBackO is called by the SynthesisEngine thread and returns a pointer to a struct, RTSFilter (real Time Synthesis Filter) which has the following form: ttdefine MAX_SPECTRUM 100 // Max. of 100 modes #define MAX_L0C 10 // Max. of 10 locations typedef struct { int nf; /* Number of components */ int np; /* Number of points on object*/ double c_plus[MAX_SPECTRUM]; double c_minus[MAX_SPECTRUM]; double a[MAX_L0C][MAX_SPECTRUM]; 99 } RTSFilter; This struct contains the parameters necessary to set the coefficients in Equation 4.18, for a set of "locations" on the object. A "location" is just an integer 0 < i < np, which indexes the arrays a[ ][}. The translation to this linear array of "locations" from geometrical concepts used in the application will be done at a higher level in the code layers. The floating point array force [ ] represents the force applied to the object. The parameter buflen indicates the number of samples to be returned in the force buffer and is set by the SynthesisEngine thread. Similarly the parameter srate is used to inform the application of the current sampling rate used by the SynthesisEngine. The member function SynthesisEngineController::callBack() is overridden by the user (application programmer) of the SynthesisEngine class and also returns the location index of the vibration model in the pointer variable int *location_index. Example: MouseTouchableObject The first example (used in the demonstration program described in Section 7.3.7) uses this base class to define a controller allowing the user to scrape virtual objects with the mouse: class MouseTouchableObject : public SynthesisEngineController { private: double m_scrapeSpeed; double m_scrapeForce; double *m_heightProfile; 100 SonicObject *sob; public: int setScrapeSpeed(double); int setScrapeForce(double); int setSonicObject(SonicObject *sob); int setSurfaceProfile(double *heightProfile,int buflen); }; // Implementation of callBackO RTSFilter *MouseTouchableObject:: callBack(int *location_index,const double *force,int srate.int buflen) { *location_index = 0 ; // Only one location // Maintain a circular pointer m_p i n the buffer m_heightProfile. Copy // a section of length buflen*m_scrapeSpeed into tmpbuf [], and advance // m_p by this amount. Resample tmpbuf [] // scale with a factor m_scrapeForce and to contain buflen samples, copy i n force []. This i s a // pitch-shifted segment of the height p r o f i l e buffer, return sob->getRTSFilter(srate); // return the f i l t e r parameters } The interface metaphor used in this class is a surface profile, whose shape is stored in the array mJieightProf i l e . This surface is touched by an object with force m_scrapeForce and speed m_scrapeSpeed, and the resulting excitation force can be obtained by looping through the circular array mJieightProf i l e at speed m_scrapeSpeed, resulting in a pitch shift of the buffer. 101 The SonicObject class is a wrapper around a parameter set for a vibration model of an object. It contains data members for frequencies, dampings, and amplitudes, and can be loaded and saved from file. It stores its data in the sy file format explained in Appendix A . The member function getRTSFilter(int srate) computes the actual coefficient arrays c_, c , and a (see Chapter 4) from the higher + level model parameters. 3 It also provides member functions to change the material properties by rescaling frequencies, changing the material constant, etc. Example: AudioSignalProcessor The second example (used in the demonstration programs described in Section 7.3.7, • and in Section 7.3.3) uses the base class to define a controller that will obtain the input force from the input channel of the audio hardware. In this manner, the SynthesisEngine can be viewed as an audio effects processor. class AudioSignalProcessor : public SynthesisEngineController {1 private: SonicObject *sob; public: int setSonicObject(SonicObject *sob); }; // Implementation of callBackO RTSFilter *AudioSignalProcessor:: callBack(int *location_index,const double *force,int srate,int buflen) { *location_index = 0 ; 3 // Only one location It caches the result and computes this only once, unless model parameters change. 102 getlnputBufferFromAudioHardware(force.buflen); // Get input from audio return sob->getRTSFilter(srate); // return the f i l t e r parameters } The function getlnputBufferFromAudioHardware(double *,int) uses the operating system's audio A P I to obtain a buffer representing the input audio signal. This can originate for example from a microphone. 7.2 Authoring Tools In this section we will discuss some authoring tools that were created to facilitate the creation of vibration models. 7.2.1 Off-line Model Tester onicObiect Tester; File O X Help nm Seconds of sound Damping Sampling Rate j ' | Strike 3 Number of modes 1 10 p i Interaction Type 144100 j j j Freq. scale jT Mallet Hardness Soft Figure 7.2: Off-line model tester. This utility allows the testing of a vibration model by applying various forces 103 to it and writing the resulting audio to disk as a wav file. This allows a user to fine-tune the model parameters and to analyze in detail the sounds made. The interface (shown in Figure 7.2) requires the user to select a vibration model from disk by loading an sy file. The vibration model can then be excited in three ways. The object can be struck with a virtual mallet with adjustable "hardness". This parameter corresponds to the width of the peak of the excitation force, which has the form given in Equation 6.1. The object can be scraped with white noise and the interaction force can also be loaded from disk as a wav file. The parameters of the vibration model can be modified in a number of ways. The dampings and frequencies can be scaled by constants, the sampling rate can be set, and the number of modes can be set. The sounds on the web page [4], and on the accompanying C D referred to in Section 5.2 were generated with this tool. The tool is written in portable C + + i with a command-line interface. A G U I written in Java was wrapped around this. 7.2.2 Vibration Model Generators Two command-line tools were used to construct vibration models. A parameter fitting module implemented in M A T L A B was discussed in detail in Section 5.2. We have used it to create models of vases, pots and pans, stringed instruments, hockey sticks, bells, boxes, plates, glasses, cups, basketballs, swords, and more. A t a sampling rate of 44,100 Hz we would typically generate models with 100 modes at window sizes of N = 1024,2048,4096,8192, or about 25 - 200ms. We would then try the resulting models with the off-line tester described in Section 7.2.1. Sometimes spurious modes need to be eliminated manually. These modes arise from constant background noise and lead to frequencies with zero decay rate. They are 104 immediately audible, and can also be recognized by inspecting the s y file. Usually one or two window sizes give results that are considerably better than the others and those would be selected. As expected, the results were best for objects with a rather simple spectrum. None of the sounds were "photo-realistic", except perhaps the hockey stick. But in an interactive setting the sounds were convincing. A tool to create vibration models of mathematical shapes as described in Section 5.1 was implemented in C . The shape, material, size, and other relevant parameters are given as command line arguments and the module produces a sequence of s y files. 7.3 7.3.1 Demonstration Programs Sonic Explorer 1.0 We have constructed a test bed application in C + + for SGI, called the "Sonic Explorer", which demonstrates the enhancement interactive audio can bring to a three dimensional graphics environment. Figure 7.3: A room modeled with the Sonic Explorer 105 The first version of the Sonic Explorer presents to the user a three dimensional model of a room with various objects as depicted in Figure 7.3. The user can view the room from any point by moving a virtual camera. The purpose of this program is to evaluate the quality and usefulness of simple impact sounds in a real environment. This demo does not use real-time synthesis, it is intended to test' impact sounds only. Several objects in the room were modeled by vibration models of mathematical shapes. The impulse responses of these objects were computed on a grid of locations using 10 points in each dimension and the resulting audio samples were stored on disk. The user can move a virtual drumstick around and hit objects at various locations and the corresponding audio sample is identified and played back. The material of the objects was modeled by the method described in Section 4 . 5 . : Even though the interactivity is very limited in the demo, the effect of changing sound as objects are struck at different locations and the material of the objects are conveyed quite well with these simple models. 7.3.2 Sonic Explorer 2.0 The second version of the Sonic Explorer implements a real-time synthesis engine as outlined in Section 7.1 using the IRIX audio library. Geometrical models of a vase (see Figure 7.5) and of a xylophone (see Figure 7.4) were created using the Open Inventor toolkit. A vibration model of the vase was constructed using the parameter fitting module discussed in Section 7.2.2. Data from three regions on the vase the top, middle, and bottom, was used and integrated manually into an sy file with three locations. The vibration models for the eight xylophone bars were constructed manually from the solutions of the free bar equations and the 106 107 base frequencies were tuned to form a "white key" scale of pitches tuned in just temperament. A special interaction device, called the Sonic Probe was constructed to allow the user to touch these objects by tapping and scraping. The probe utilizes a graphics tablet to move the mouse pointer. When the mouse pointer is on a the vase or on a xylophone bar the appropriate action is taken by the controller, as described in Section 7.1.2, and the SynthesisEngine will model the corresponding vibration model. The stimulus force on the virtual object is obtained from a contact microphone embedded in the pen. When the user touches the graphics tablet the local interaction, caused by tapping or scraping, is picked up by the contact mike and used as the input signal for the audio synthesis as described in Section 7.1.2. A small cover of rough wood was optionally placed over the graphics tablet. The, surface of the wood is rougher than the smooth plastic of the tablet and in this way we obtain a more interesting signal. In this manner we effectively map the surface structure of the wood to the object being modeled. A n additional feature is that the vase can be resized in the vertical direction, stretching or shrinking it, and the frequencies of the modes are scaled according to the length of the vase. Even though this is not the actual way the modes will change if the geometry of a vase is changed in this manner, it illustrates the concept quite well. 7.3.3 Sonified Objects This application uses the idea of picking up real interaction forces with contact microphones and mapping them on virtual sonic objects in a non-graphical context. The SynthesisEngine was extended to process two independent input streams and 108 Figure 7.6: Plastic swords with embedded contact mikes used to create metal sword sounds. apply the virtual forces to two independent objects and to render the audio to different speakers. We attached contact microphones to various objects and obtained interesting effects by feeding these data streams to various vibration models. Two plastic toy swords were equipped with contact mikes and two different metal sword audio models were used to render the sounds. See Figure 7.6. Amusing effects could be obtained by changing the model to church bells or other inappropriate sounds. We have also used this setup (with one data stream) to create a digital effect box for an electric guitar. Very interesting effects were obtained by hooking the guitar up to vibration models of plates, objects with randomized modal structures, 109 bells, etc. 7.3.4 A v a l a n c h e Sounds A n implementation of the synthesis algorithm was made in Java. Unfortunately Java 1.1 does not provide any method to render computed sound in real time. It even doesn't provide a method to render computed sounds by any means* whatsoever. Fortunately an "unofficial" audio A P I (the "sun.audio" library) does provide a minimal A P I to allow the playback of computed arrays of 8-bit audio data at the sampling rate 8,000 Hz. We have created a small package of classes to allow the creation of sonic objects and forces. The force can then be "applied" to the sonic object' and the resulting audio can be saved in a buffer and rendered later. A n advantage of this is that the sounds are created locally and need not be transmitted over the network. Viewed in this way the synthesis is the ultimate compression method. The class hierarchy and the complete documentation for the Java synthesis classes are available on the accompanying C D and on [4]. The classes were used to create parameterized avalanche sounds in the applet depicted in Figure 7.7. A n avalanche sound is generated by applying white noise to a vibration model containing an adjustable number of modes with frequencies randomly generated within a selectable frequency band. These are called the "rumbling frequencies". The stimulus force consists of modulated white noise whose time envelope can be controlled by four set-points. The avalanche sound is computed when the button "Compute" is pressed and played by pressing "Trigger". By repeatedly pressing the trigger button overlayed sections of the pre-computed sounds can generate a certain amount of interactivity. It is possible to re-compute the sounds while they are playing and 110 Trigger i Trigger avalanche (press repeatedly to layer) I Compute soundeffect (slightly indeterministic) Compute Minimum rumbling frequency 100.0 Maximum rumbling frequency 150.0 Rumbling damping factor 0.87 Number of modes (sound quality) 25 Time to full level 1.2 Time to start fade 3.0 Time to end avalanche 6.0 Time to end all sound 10.0 Figure 7.7: Applet to generate avalanche sounds. every time a new set of random modes will be created, providing never repeating sounds. By setting the maximum frequency to higher values other interesting sounds reminiscent of wind or even eerie musical effects can be created. 7.3.5 Location Dependent Sounds of Mathematical Shapes If the ideas presented here are to be used in a virtual reality environment, better tools will need to be provided to the user to specify the sonic attributes of objects. A graphical design tool that allows a user to define the types of sounds made by an object interactively would be very useful. For example, a user will describe an object as a circular metallic plate with a base frequency of 415 Hz. The user should then be allowed to poke and scratch the virtual object and adjust the model parameters until the object behaves as desired. A prototype has been implemented as a Java Applet, which allows a small class of objects to be created. See Figure 7.8. The object selected is depicted graphically and the user can hit it and scrape it 111 Object Force C C Hit 2 r Bart C Ptusk (~ Bat? P CtreularPhK* (~ HRI Scrapft(whftft note*) Grid EKB <• S-f*|:ft (0*J?SI*t> rwis*> f~" P$€l»dOSW«fi Duration Scrape time PtuckBmg Hammtf hardness Figure 7.8: Applet to create location dependent sounds for a variety of objects. and adjust the model parameters. Besides as a prototype of a sonic design tool, the program is also useful to test contact models. For example, the finite duration contact model described in Section 6.1 is currently implemented and this allows an immediate evaluation of the model. By repeatedly hitting an object, possibly at different locations, some interactive control can be exercised. Every mouse click generates an audio thread and many can be overlayed, through which dynamic effects can be created. 7.3.6 V i r t u a l B e l l Tower A third Java applet was created to show an application which is both entertaining, small, and musically interesting. The applet maintains models of eight bells (a bell tower) and allows the user to select among different bell models, retune the bells using various presets or custom, play them interactively with the mouse, change the hardness of the mallet used to strike the bells, and sequence "change ringing" patterns. English bell ringing [40, 92, 19] is a complex mathematical art-form based 112 on ringing tuned bells in a steady rhythm in such a way that the bells ring all, or a subset of, the possible permutations of the bells. Typically six, eight, ten or twelve bells are used, but as few as three or as many as sixteen tower bells can also be rung to "changes". A mathematical notation system to describe a "method" exists. This "place notation" system can be viewed as an algorithm for generating permutations. The applet implements a place notation parser to allow the interested virtual change ringer to try out methods. A collection of ancient traditional methods can also be selected from the pull-down menu. Apart from the synthesized bell sounds, it is also possible to use wave table playback. For this purpose, eight recordings of actual church bells have to be loaded. Though the quality of the sound is somewhat better, all interactivity is lost, as the bells can not be tuned, or struck with different mallets etc. Under Netscape 4, the digital samples are not loaded from the server until the samples are actually used. The very noticeable delay while the audio samples are loaded illustrates the advantage of creating the sounds locally rather than downloading them over the network. 7.3.7 Real-time Model Tester This demonstration program runs on Windows 95/98 using the DirectSound audio A P I . It is intended as a test-bed for object models, interaction models, and A P I design, as well as a demonstration program of the versatility of the audio synthesis methodology. The main control window is depicted in Figure 7.11. The check boxes allow the selection of various operation modes. In the basic mode, when all boxes are unchecked, except possibly the "3D" 113 .*Jtnfc.«.. .AjarMK.lt. . 4j*n—.4 *jf ^ ^ ^ P u J P ^ ^ ^ ^ W 4 P ^ ^ ^ ^^^M^^^ ^ ^ * W M p " ^ ^ ^ * W J p ^ ^ liMirHiiRi ^^^K^**^ ^^^^4p*"" 12345878 12346678 21438587 12483857 21648375 26143857 62413375 62148735 26417853 624 d Start Faster Slower C Belli (• Bell2B C Bell 2 f Sample Ed* pitch Irregular c B A E D C Method Modes (1-30) Sharp 20 187.5 K G F strike 200.0 scRaMbLe 150.0 I* I* K K 133.33 Apply 125.0 Ready Dull Regular X.38.X. 14.X. 1258.X.36 X . 14.X.58.X. 18 .X.78 . X . 16 .X.58 . X . 14.X.: Cambridge Surprise Major Tuning Major iCamtiridge Surprise Major Grandsire Doubles Plain Bob Minor Little Bob Minor Reverse Canterbury Pleasure Place Doubles Norwich Surprise Minor Birmingham Carter Triples Figure 7.9: Help Applet for ringing a bell tower. 114 3 .jjuiraujL. «kK». 12345678 12345678 21436587 12463857 21848375 28143857 82418375 82148735 28417853 824 - —^ F aster Start Slower I Irregular - - A - - G - Method J Belli (• Bell2B C Bell 2 f Sample :1 — strike C B Is/ Sharp 20 187.5 F 133.33 E 125.0 D 112.5 C 100.0 scRaMbLe Appfc Ready Dull X.38.X. 14.X. 1258.X.38 X . 14.X.58 X . 16 X.78.X. 16 .X.58 X . 14.X.; Cambridge Surprise Major Tuning Major Major For You the Bells Toll (If you press STA Dorian Phrygian Lyd i a n Mixo-Lydian Figure 7.10: Applet for ringing a bell tower. 115 Help Modes (1-30) 166.66 Regular .&J*M».4. JL*«k^. C Edit pitch ~ : ±J jLa<>tJL w 1 jL* S0 IP S y n t h e s i s E n g i n e Object- Scrape randobjectl .txt randstringl .txt ;rectplate1 .txt beil.sy desklamp.sy guitar.sy vase.sy computertower. < stick, sy -Hit Vol Restart Stop F F r i f Force S p e e d Frequency Contact Model Min Force Elast i Damping 10.0 Modes r Object Car Joystick _ Contact Model Aspect 30 100 Realtime Modifiers Freq. Damp. Max 10K . j - white.wav 1 overt. 5.wav 1 overf.wav r Mike 3D 25 - r mmrnrn wood.wav metal.wav J plastic.wav sandpaper.wav grid.wav Fr 50 50 8060 1.07 T' 1 7.05 Figure 7.11: Real-time model tester. Main control panel. 116 Fwrpl Slow Scrape.. .Fast Scrape Figure 7.12: Real-time model tester. Interaction windows to scrape and hit objects. Car Engine Parameter Intake Combustion Exhaust - i Beatupness Figure 7.13: Real-time model tester. Engine model editor. 117 box, the program allows the user to interact with several virtual objects. The object is selected from the "object" list, or can be loaded from disk as an "sy" file (see Appendix A ) or as a "txt" file. The sy file is expected to contain a vibration model of an object with one location. If a "txt" file is selected, the object is specified by its shape, base frequency, damping constant, maximum frequency, number of modes, and its aspect ratio (if applicable). We currently have implemented the ideal string, the rectangular plate and a "random" object, which creates randomized resonances and dampings. A n example of a "txt" file specifying a rectangular plate is rectplate base_frequency: 50.000000 damping: 5.559740 maximum_frequency: 8060.745000 number_of_modes: 13 npoints_x((must_be_l_for_now)or_npoints_in_l_dimension): 1 npoints_y((must_be_l_for_now)unused_in_l_dimension): 1 aspect_ratio_rectangle(unused_for_others): 7.052800 After a model is loaded from a "txt" or "sy" file, the user can alter its 118 properties with the sliders labeled "Frequency", "Damping", "Modes", "Aspect". This requires a restart of the synthesis by pressing the button in the upper right. The three sliders labeled "Realtime Modifiers" can be adjusted during synthesis: They allow the frequencies and the dampings to be scaled uniformly, and the slider labeled " F w r p l " applies a more complicated transformation to the vibration model. This transformation is intended to demonstrate the capability of real-time "sound morphing" with this synthesis method. After the vibration model is selected, the user can interact with it in various manners. If the "Mike" option is selected, the audio input channel is interpreted as the applied force, as in the demo described in Section 7.3.3. A problem with DirectSound causes very high latency of up to half a second, making this mode: very awkward. Nevertheless we have used it to test vibration models stimulated by real-time interactions. ; > If the "Mike" option is not checked, the user can stimulate the object in various other manners. If the "Car" and "Joystick" modes are not selected, 'the user interacts with the object by selecting a contact model from the second list box or from disk. The contact model is stored as a wav file and represents the surface texture of the virtual object, according to the model explained in Chapter 6. Mathematical models using scaling noise, as well as recorded dry scraping sounds have been used. Once the object and the contact model are selected, the user can scrape the object by moving the mouse over the upper interaction window depicted in Figure 7.12. The position of the mouse on the interaction rectangle determines the scraping normal force (top to bottom) and the scraping speed (left to right). The range of the force and speed can be adjusted with the "Scrape" sliders. If the "3D" option is selected, the horizontal position of the mouse is mapped to left-right 119 spatialization of the resulting sound. The second interaction window depicted in Figure 7.12 allows the user to hit the object by clicking the mouse at various points on the interaction window. The vertical position is mapped to force, while the horizontal position is mapped to the hardness of the virtual hammer, using the contact model for hitting described in Chapter 6. Sounds of a four stroke combustion engine are obtained by selecting the "Car" option. The interaction with the engine models is similar to the interaction with scraping models, except that one now controls the engine speed through the upper window. For this purpose, several wav files representing models for combustion engines as discussed in Chapter 6 can be selected. If the mouse pointer leaves the window, the engine sound continues at this rpm. If the "Joystick" option is selected, the user can control the rpm and the spatial position of the sound with the joystick. In this case the idling speed, and the maximum rpm can also be adjusted in real time with the joystick buttons. In "Car" mode, a third interaction window is present, depicted in Figure 7.13, which allows the user to adjust the volumes of the four stages of the four stroke engine model independently, to create different sounds interactively. The "Realtime Modifiers" allow the user to change the resonance properties of the engine and exhaust system while running the engine interactively with the joystick. Very convincing sounds of motorcycles, race cars, lawn mowers, and chain saws, can be created using relatively simple vibration and interaction models. 120 Chapter 8 Conclusions and Extensions 8.1 Summary of Results In this thesis we have constructed a methodology to synthesize audio in real-time in simulation or game environments. We have shown how the physics of vibrating objects (using linear approximations) leads to a parameterized vibration model suitable for real-time synthesis of interactive audio. The synthesis algorithm maps the eigenvalues of the associated P D E to resonances and maps the mode shapes of physical objects to coupling coefficients. We believe the connection between the physics and the synthesis method is important as it allows a refinement of the method if so desired. In contrast to synthesis methods such as F M , our method provides a clear path to extension and improvement: include more physics. We have investigated mathematical models of simple geometries and computed the vibration models directly. A one parameter model for the material was found to convey the perception of wood versus metal versus glass quite well. A parameter fitting algorithm was implemented, which reconstructs a vibra- 121 tion model from a recording of a struck object. A large number of vibration models of objects have been reconstructed and the method performs satisfactorily. We have investigated interaction models for collisions, continuous contact, engines, and more abstract forces such as avalanches. It was concluded that a simple one parameter model of impact suffices to convey the perception of the "hardness" of a virtual mallet. For continuous contacts, we conclude that looped wave-table. playback of a small sample (representing knowledge about the surface contact) is the best solution. It conveys the perception of "roughness" and of contact speed quite well and it has a solid physical justification to be used as the first order approximation for contact force. The synthesis algorithm was implemented on several platforms and its performance in terms of speed, latency, and quality was analyzed in some detail, both experimentally and theoretically. It was found that the synthesis is efficient enough to allow a real-time synthesis of reasonably complex objects using just a few percent of C P U cycles on modern personal computers. We have designed and implemented an A P I for synthesis around the idea of a SynthesisEngine (which implements the low level synthesis) and a SynthesisController (which implements models of interaction). The A P I was implemented in C + + and Java and was used to create several demonstration programs, some of which can be found on the accompanying C D or W W W site [4]. From the performance and quality of the sound obtained in the demonstration programs we conclude that this type of audio synthesis can be used to significantly enhance the feeling of immersion in interactive environments. Current wave-table technologies are not able to provide the level of interactivity required in audio for continuous sounds. The synthesis method presented here is the most 122 efficient and realistic method to obtain contact sounds of interacting objects. The method also produces very good engine and rumbling sounds. 8.2 Extensions and Future Work Several directions for future work present themselves naturally. Some are obvious and straightforward, some are more challenging. 8.2.1 Faster Synthesis The performance of the synthesis algorithm can be improved in several ways. A significant performance boost can be expected by implementing the inner loop of the synthesis algorithm in assembly language instead of C . Another possible improvement is to implement the algorithm at multiple resolutions. For a resonance mode of 100 Hz, a sampling rate of 200 Hz should be high enough to accurately compute the contribution of this mode to the total vibration. It is perhaps more efficient to compute each resonance at the minimum sampling rate for that specific resonance and "up-sample" the result when adding all the modes together. It remains to be seen if the overhead of down-sampling the interaction force once for every mode and up-sampling the resulting modal contribution is computationally more efficient than a straightforward computation at a fixed sampling rate. 8.2.2 Integrate Waveguide Models For linear structures such as strings and air-tubes (exhaust pipes!), waveguide models [67] provide a more efficient method to model resonance properties. The reason is that the resonances in those structures are linearly spaced and are much denser than in solid structures. These systems can be modeled with the modal techniques 123 pursued in this thesis, but this will be computationally expensive because each resonance takes a fixed percentage of the C P U . Waveguides provide a large set of harmonic resonances for the price of a single mode. Though the amplitudes and dampings can not be individually controlled as in modal synthesis, the performance gain is considerable. Extending the entire methodology presented here to incorporate waveguide models of vibrating structures is completely straightforward. 8.2.3 Improve Parameter Fitting How can the linear models be improved? The parameter fitting module produces a set of vibration models which depend on how several parameters such as the window size for the windowed Fourier transform, but there is no systematic or automatic way to chose the "best". As discussed in Section 4.6, this requires a measure on perceptual audio space that would allow us to determine if sound A is a better approximation to sound C than sound B . This appears to be a very difficult problem, which probably requires more understanding of audio perception. A less ambitious project would be to formally evaluate the quality of the models by psycho-acoustic experiments. It would be interesting to see if a more objective classification of physical objects with respect to their suitability to be modeled with modal vibration models can be obtained. On the technical level, as discussed in Section 5.2, several improvements and refinements of the parameter fitting algorithm could be attempted. 124 8.2.4 Improve Interaction Models The interaction models presented in Chapter 6 are all "first-cut" models and could potentially be improved. The one-parameter impulse model could be refined to include properties relating to the detailed shape of the force profile. The continuous contact model essentially assumes that the surface profile of the object is followed exactly during contact. In reality there are deformation, hysteresis effects, and contact is frequently broken when one object slides across another. Unfortunately the actual physics of contact is prohibitively complex, and it is not clear how to improve the model in an obvious and simple manner. We suggest investigation of a model where one object frequently "goes ballistic", i.e., at higher contact speeds the objects break contact frequently. Perhaps a simple inelastic collision model would be computationally tractable enough to refine the contact models driving scraping and sliding sounds. The engine excitation models were constructed in an ad-hoc manner, using simple-minded ideas about explosions and hissing sounds in a four stroke cylinder. Many different types of engines such as diesel engines, two stroke engines,- multicylinder configurations with accurately modeled firing sequences, could be modeled and investigated. 8.2.5 Non-linear Models The resonance models are strictly linear. Though non-linear effects can be obtained by coupling vibration models with feedback loops, objects in the real world do not behave in a linear fashion to various degrees. If you hit an object harder and harder, the resulting sounds are not related by just a volume factor. This is partly due to the interaction, which changes non-linearly, and partly due to the actual vibration 125 equation of the object, which is linear only in first order approximation. A general theory of non-linear vibration does not exist. In general, nonlinear phenomena are extremely complicated and no universal unified method for analysis exists [90]. Some work has been done on non-linear extensions to harmonic oscillators [59], but a physical motivation seems to be lacking. A n interesting observation is that many objects, such as glasses, behave approximately linearly when struck, but then suddenly change their behaviour completely: they break. The interesting thing to note is that after the breaking event we again have an approximately linear system of glass fragments that fall within the scope of the present method. After the breaking event we will have to model many objects and their collisions in order to correctly synthesize the sound of the glass fragments falling. Presumably, some stochastic method of generating the impulsive forces and the distribution of the sizes of the fragments could drive the synthesis. If we consider the breaking characteristic of the glass as part of the glass model, we can view this as a piecewise linear simulation. Before the glass breaks we use a linear model and after the glass break we have a (different) linear model. See also [30, 17, 86] for other work on this topic. This type of piecewise linear behaviour occurs very frequently in reality. For example, an old rattly car consists of many parts which collide against each other as a result of the oscillations of the different parts. This line of reasoning would lead to a generalization of the modal audio synthesis model where different components are allowed to collide and impulsive forces are generated internally by collisions. A simple example of a non-linear model of this kind is a constrained pendulum, depicted in Figure 8.1. The pendulum behaves approximately linearly as long as the amplitude of the oscillation is small, but as soon as the pendulum starts 126 Figure 8.1: Pendulum is non-linear when colliding with the walls. colliding with the walls, it is only piecewise linear. Such a constrained pendulum could perhaps be used as a generalization of the spring-damper systems on which the audio synthesis described in this thesis is built. 127 Bibliography [1] http://ccrma-www.stanford.edu. [2] http://mediatheque.ircam.fr/articles/index-e.html. [3] http://www.cnmat.berkeley.edu. [4] http://www.cs.ubc.ca/nest/lci/thesis/kvdoel. [5] http://www.engr.wisc.edu/epd/steuber/motorcycle.html. [6] http://www.ircam.fr. [7] http://www.neuroinformatik.ruhr-uni-bochum.de/ini/people/heja/sy- • prog/nodel.html. [8] MATLAB Reference Guide. The MathWorks, Inc., 1992. Computer [9] D . Baraff. Dynamic simulation of non-penetrating flexible bodies. Graphics, 26(2):303-308, 1992. [10] D . Baraff. Dynamic Simulation Cornell University, 1992. of Non-Penetrating Rigid [11] Durand R. Begault. 3-D Sound for Virtual and Multimedia. Reality Bodies. P h D thesis, Academic Press, London, 1994. [12] R. T . Beyer. Nonlinear Acoustics. Department of the Navy, Sea Systems Command, Washington, D . C , 1974. [13] W . G . Bickley and A . Talbot. Systems. An Introduction to the Theory of Vibrating Oxford University Press, London, 1961. [14] S. Braun and Y . M . Ram. Determination of structural modes via the prony model: System order and noise induced poles. J. Acoust. 1459, 1987. 128 Soc. Am., 81(5):1447- [15] Albert S. Bregman. Auditory Scene Analysis. M I T Press, London, 1990. [16] G . F . Carrier. On the non-linear vibration problem of the elastic string. Q. Appl. Math, 3:157-165, 1945. [17] Michael Anthony Casey. Auditory Basis Methods for Structured Group Audio. Theory with Applications to Statistical P h D thesis, Massachusetts Institute of Technology, 1998. [18] Antoine Chaigne and Vincent Doutaut. Numerical simulations of xylophones, i . time domain modeling of the vibrating bars. J. Acoust. Soc. Am., 101(1):539557, 1997. [19] A . W . T. Cleaver. The theory and Co., Oxford, 1976. of change ringing: an introduction. J . Hannon [20] Perry R. Cook. Physically informed sonic modeling (phism): Percussive synthesis. In Proceedings of the International Computer Music Conference, pages 228-231, Hong Kong, 1996. [21] Perry Raymond Cook. Identification Vocal Tract Model, with Applications of Control Parameters to the Synthesis in an of Singing. Articulatory P h D thesis, Stanford University, 1991. [22] J . Cremer and A . J . Stewart. The architecture of newton, a general-purpose dynamics simulator. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 1806 - 1811, 1989. [23] Baron Gaspard Riche de Prony. Essai experimental et analytique: sur les lois de la dilatabilite de fluides elastique et sur celles de la force expansive de la vapeur de Palkool, a differentes temperatures. Journal de VEcole Polytechnique, l(22):24-76, 1795. [24] P. Depalle and X . Rodet. Synthese additive par F F T inverse. Technical report, IRC A M , Paris, 1990. [25] James F . Doyle. An FFT-Based Verlag, New York, 1989. Spectral Analysis [26] N . I. Durlach and A . S. Mavor, editors. Virtual logical challenges. [27] Frank Fahy. Response. Methodology. Reality, Scientific Springer- and techno- National Academy Press, Washington, D . C , 1995. Sound and Structural Vibration. Academic Press, London, 1985. 129 Radiation, Transmission and [28] L . Fox, P. Henrici, and C . Moler. Approximations and bounds for eigenvalues of elliptical operators. SIAM J. Num. Analy., 4:89-102, 1967. [29] Daniel J . Fried. Auditory correates of perceived mallet hardness for a set of recorded percussive sound events. J. Acoust. Soc. Am., 87(1):311-321, 1990. [30] W . W . Gaver. What in the world do we hear?: A n ecological approach to auditory event perception. Ecological Psychology, 5 ( l ) : l - 2 9 , 1993. [31] James K . Hahn, Joe Geigel, Jong Won Lee, Larry Gritz, Tapio Takala, and Suneil Mishra. A n integrated approach to motion and sound. Journal of Visualization and Computer Animation, 6(2):109-123, 1992. [32] Donald E . Hall. Piano string excitation in the case of small hammer mass. J. Acoust. Soc. Am., 9(1):141-147, 1986. [33] Donald E . Hall. Piano string excitation II: General solution for a hard narrow hammer. J. Acoust. Soc. Am., 81(2):535-545, 1987. [34] Donald E . Hall. Piano string excitation III: General solution for a soft narrow hammer. J. Acoust. Soc. Am., 81(2):547-555, 1987. [35] H . L . F . Helmholtz. On the Sensations of Tone. Dover, New York, 1954. [36] David Jaffe. Ten criteria for evaluating synthesis and processing techniques. Computer Music Journal, 19(l):76-87, 1995. [37] David Javelosa. Sound and Music for Multimedia. Henry Holt & Company' Inc., New York, 1997. [38] Claes Johnson. Numerical solutions of partial differential equations element method. Cambridge University Press, Cambridge, 1987. [39] K . L . Johnson. Contact Mechanics. by the finite Cambridge University Press, Cambridge, 1985. [40] Ron Johnston and Allsopp Graham. An atlas of Bells. Blackwell Reference, Cambridge, Mass., 1990. [41] M a t t i Karjalainen and Julius Smith. Body modeling techniques for string instrument synthesis. In Proceedings of the International Computer Music Conference, pages 232-239, Hong Kong, 1996. [42] J . B . Keller. Impact with friction. Journal 1986. 130 of Applied Mechanics, 53(l):l-4, [43] Eric Krotkov and Roberta Klatzky. Robotic perception of material: Experiments with shape-invariant acoustic measures of material type. In Preprints of the Fourth International Symposium on Experimental Robotics, ISER '95, Stanford, California, 1995. [44] Debassis Kundu. Estimating the parameters of undamped exponential signals. Technometrics, 35(2):215-218, 1993. [45] Horace Lamb. The Dynamical Theory [46] Strutt Lord Rayleigh. The Theory of Sound. of Sound. Edward Arnol, London, 1910. Dover, New York, 1896. ' [47] Per L0tstedt. Numerical simulation of time-dependent contact and friction problems in rigid body mechanics. SIAM J. Sci. Stat. Comput., 5(2):370-393, 1984. [48] J . D . Markel and A . H . Gray. Linear New York, 1976. Preciction of Speech. Springer Verlag, [49] Robert J . McAulay and Thomas F . Quatieri. Speech analysis/synthesis based on a sinusoidal representation. IEEE Trans. Acous., Speech, Signal Processing, ASSP-34(4):744-754, 1986. [50] John C . Middlebrooks and David M . Green. Sound localization by human listeners. Annu. Rev. Psychol, 42:135-159, 1991. [51] Brian C . J . Moore. An Introduction Press, London, 1986. to the Psychology of Hearing. 1 Academic [52] J . D . Morrison and J . - M . Adrien. Mosaic: A framework for modal synthesis. Computer Music Journal, 17(1), 1993. [53] P. M . Morse and K . U . Ingard. Theorectical Acoustics. Princeton University Press, Princeton, N J , 1986. [54] Philip Morse. Vibration and Sound. American Institute of Physics for the Acoustical Society of America, fourth edition, 1976. [55] D . E . Newland. Mechanical Vibration Analysis and Computation. Longman Scientific and Technical, New York, 1989. [56] M . R. Osborne and G . K . Smyth. A modified prony algorithm for fitting functions defined by difference equations. SIAM 382,1991. 131 J. Sci. Stat. Comput., 12(2):362- [57] T . W . Parks and C . S. Burrus. Digital Filter Design. John Wiley and Sons, Inc, New York, 1987. [58] Maurice Petyt. Introduction to Finite University Press, Cambridge, 1990. Element Vibration Analysis. Cambridge [59] John R. Pierce and Scott A . van Duyne. A passive nonlinear digital filter design which facilitates physics-based sound synthesis of highly nonlinear musical instruments. J. Acoust. Soc. Am., 101(2):1120-1122, 1997. [60] L . R. Rabiner, M . J . Cheng, A . E . Rosenberg, and C . A . McGonegal. A comparative performance study of several pitch detection algorithms. IEEE Trans, on Acoustics, Speech and Signal Processing, 24(5):399-418, 1976. [61] Clifford T . Mullis Richard A . Roberts. Wesley, Reading, Massachusetts, 1987. Digital Signal Processing. Addison- [62] M . J . Ross, H . L . Schaffer, A . Cohen, R. Freudberg, and H . J . Manley. Average maginitude difference function pitch extractor. IEEE Trans, on Acoustics, Speech and Signal Processing, 22(5):353-362, 1974. [63] Thomas D . Rossing, D . Scott Hampton, and Uwe J . Hansen. Music from oil drums: the acoustics of the steel pan. Physics Today, March:24-29, 1996. [64] William R . Savage, Edward L . Kottick, Thomas J . Hendrickson, and Ken-, neth D . Marshall. A i r and structural modes of the harpsichord. J. Acoust. Soc. Am., 91(4):2180-2189, 1992. [65] A . A . Shabana. Theory of Vibration, Volume I: An Introduction. Springer- Verlag, London, 1991. [66] A . A . Shabana. Systems. Theory of Vibration, Volume II: Discrete [67] J . O . Smith. Physical modeling using digital waveguides. Journal, Continuous Computer Music 16(4):75-87, 1992. [68] M . M . Sondhi. New methods of pitch extraction. IEEE Electro and Springer-Verlag, London, 1991. Acoustics, Trans, on Audio and 16(2):262^266, 1968. [69] William M . Steedly, Chinghui J . Ying, and Randolph L . Moses. Statistical analysis of tls-based prony techniques. Automatica, Special Issue on Statistical Processing and Control, 1994. 132 William M . Steedly, Chinghui J . Ying, and Randolph L . Moses. A modified tlsprony method using data decomation. IEEE Transactions on Signal Processing, 42(9):2292-2303,1995. Ken Steiglitz. Classic maximum entropy. In: Maximum Methods. Kluwer Academic, Norwell, M A , 1989. Ken Steiglitz. A Digital Audio and Computer Signal Music. Processing Primer Entropy and with Applications Bayesian to Digital Addison-Wesley, New York, 1996. R. W . B . Stephens. Acoustics lishers) L t d . , London, 1966. and Vibrational Physics. Edward Arnold (Pub- Adam Stettner and Donald P. Greenberg. Computer graphics visualization for acoustic simulation. Proc. SIGGRAPH'89, Computer Graphics, 23(3):195-206, 1989. Gilbert Strang. Press, 1986. Introduction to Applied mathematics. Wellesley-Cambridge Anatoli Stulov. Hysteretic model of the grand piano hammer felt. J. Soc. Am., 97(4):2577-2585, 1995. Acoust. Donald L . Sullivan, accurate frequency tracking of timpani spectral lines. J. Acoust. Soc. Am., 101(l):530-538, 1997. Frank W . Sinden Suresh Goyal, Elliot N . Pinson. Simulation of Dynamics of Interacting Rigid Bodies Including Friction I: General Problem and Contact Model. Engineering with Computers, 10:162-174, 1994. Frank W . Sinden Suresh Goyal, Elliot N . Pinson. Simulation of Dynamics of Interacting Rigid Bodies Including Friction II: Software System Design and Implementation. Engineering with Computers, 10:175-195, 1994. Tapio Takala and James Hahn. Sound rendering. Proc. SIG'GRAPH'92, Computer Graphics, ACM 26(2):211-220, 1992. Samuel Temkin. Elements of Acoustics. John Wiley and Sons, Inc., New York, 1981. Barry Truax. Real-time granular synthesis with a digital signal processor. puter Music Journal, 12(2):14-26, 1988. 133 Com- C. Ullrich and Dinesh K . Pai. Contact response maps for real time dynamic simulation. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 1019 - 1025, Leuven, May 1998. Presence, Kees van den Doel and Dinesh K . Pai. The sounds of physical shapes. 7(4):382-395, 1998. Richard F . Voss and John Clarke. 1/f noise in music: Music from 1/f noise. J. Acoust. Soc. Am., 63(l):258-263, 1978. R. M . Warren and R. R. Verbrugge. Auditory perception of breaking and bouncing events: A case study in ecological acoustics. Journal of Experimental Psychology: Human Perception and Performance, 10:704-712, 1984. C. A . Wert. Internal friction in solids. Journal of Applied Physics, 60(6):1888- 1895,1986. Bruce J . West and Michael Shlesinger. On the ubiquity of 1/f noise. Journal of Modern Physics B, 3(6):795-819, 1989. Inernational Bruce J . West and Michael Shlesinger. The noise in natural phenomena. ican Scientist, Amer- 78:40-45, 1990. G . B . Whitham. Linear and Nonlinear Waves. Wiley, New York, 1974. Richard P. Wildes and Whitman A . Richards. Recovering material properties from sound. In Whitman Richards, editor, Natural Computation, Cambridge, Massachusetts, 1988. The M I T Press. Wilfred G . Wilson. Change ringing on Richard Woodbridge. Acoustic Recordings from Antiquity. In Proceedings of church and hand the I.E.E.E., bells. ringing : the art and science of change Faber and Faber, London, 1965. pages 1465-1466, 1965. P. M . Zurek. The precedence effect and its possible role in the avoidance of interaural ambiguities. J. Acoust. Soc. Am., 134 67(3):952-964, 1980. Appendix A File Format for V i b r a t i o n Models We give the specification of the "sy" file format used by our applications to define a vibration model of an object. The lines with text are comments, indicating the meaning of the following data. The nactive_freq: field describes how many modes should be used by the synthesis software to render the sound. It should be less than or equal to n_freq: which is the number of modes stored. The n_points: field indicates how many discrete "locations" are available for the amplitudes a. These "locations" may correspond to actual geometric locations on the contact surface of the objects, or to directional parameters. The fields frequencyjscale:, damping_scale:, and amplitude_scale: multiply the following frequency, damping, and amplitude parameters by a constant scale factor. These fields allow a quick manual edit by shifting all frequencies, increasing the damping, or boosting the coupling strength of the model. After the frequencies: header field, the frequencies are listed in Hertz, after the 135 damping: header the damping coefficients (the factors d in the impulse response a e (-dt+2irift) Q f a n individual mode), and after the amplitudes[point][freq]: the amplitudes; first the amplitudes of the first "location", then the second location, etc. The end of file is indicated by the string E N D . Below we give a short example of an sy-file for an ideal string with coupling data for three locations. nactive_freq: 10 n_freq: 10 n_points: - 3 frequency_scale: 1.000000 damping_scale: 1.000000 amplitude_scale: 1.000000 frequencies: 440.000000 880.000000 1320.000000 1760.000000 2200.000000 2640.000000 3080.000000 3520.000000 3960.000000 136 4400.000000 dampings: 1.000000 2.000000 3.000000 4.000000 5.000000 6.000000 7.000000 8.000000 9.000000 10.000000 amplitudes[point][freq] : 0.258819 0.250000 0.235702 0.216506 0.193185 0.166667 0.137989 0.108253 0.078567 0.050000 0.707107 0.500000 0.235702 0.000000 0.141421 137 0.166667 0.101015 0.000000 0.078567 0.100000 0.965926 0.250000 0.235702 0.216506 0.051764 0.166667 0.036974 0.108253 0.078567 0.050000 END Appendix B Parameter F i t t i n g D a t a In this section we present the results of the models obtained by parameter fitting for several objects. The corresponding sounds can be heard on the accompanying C D or on [4]. Each figure shows the linear fits overtime to the identified frequency peaks (in d B ) , the estimated frequencies in Hz, and the frequency response of the resulting model on two different scales. The sampling rate was 44100 Hz and we show the results of various size Fourier windows. The material constant plots show the ratio of frequency and damping for the first 100 modes (sorted by amplitude), which should be constant according to the simplest material model. 139 \ ^ N_ W \ V \ x~ "S* \ S v \ \ >— W ^_ \ S>» \ V- S** ^ V^, 3316 1507 1120 172 4436 474 4006 2412 3790 2584 7537 4737 4565 4264 1249 1981 61 IS 10724 9733 6288 5814 3058 8742 ' 7192 5383 4996 3445 15590 12575 11628 9087 8269 7321 6891 6632 1134 3402 2196 13652 12489 11757 10422 8053 7020 5512 3015 2239 18217 17786 17485 16710 16150 13135 12877 12705 12317 9647 9259 8355 7B81 5642 3833 2627 18906 14815 13351 13264 12059 11413 10982 10465 10034 9432 9130 8096 7752 684B 5943 5657 5211 5082 4608 43 , 15805 15719 15418 15159 14427 14384 14126 14040 13480 13006 11B43 11542 11370 11283 11240 1000 2000 3000 4000 Figure B . l : Top of vase with a window of 1024 140 5000 7000 6000 ^ ^ \ v^. \ S« \ 3424 3316 2606 1507 4436 2390 2003 1120 49 S 4005 x 2261 172 1270 301 6115 4177 3165 2067 5922 4587 3876 5383 4759 10573 9755 6223 5814 3790 2778 2649 11003 2175 4996 4242 8656 7558 7321 9109 8829 8742 7214 6998 6891 5663 5233 4350 3611 3704 3510 3208 3058 1034 10444 10034 9022 6355 8053 7773 7687 7515 6632 6611 6417 5986 5620 5534 5340 5146 5060 4845 4134 3639 3036 2326 2196 1680 1615 775 624 388 43 16710 11348 10939 10788 10702 10422 9668 8979 8506 8419 v^. v. V 689 Figure B.2: Top of vase with a window of 2048 141 \ \ \ ^ X. \ \ \ \ V \ x x \ \ \ 3790 3208 2229 2121 2046 1109 6105 5846 5028 4619 X X \ \ X \ \ 4242 3941 3908 3714 3499 3165 3058 3036 2595 1518 \ ^ ^ x 1270 301 97 65 7558 6611 6298 5922 5814 X. X x~ X 5060 4113 3650 3230 2778 2186 484 10444 9109 X 8829 8742 8355 6826 6643 6578 6492 6385 6072 5394 5243 5179 5136 4953 4037 3112 2175 1324 678 624 420 388 10993 10476 961S 8775 8269 8236 8053 7203 6891 6460 6223 5986 5717 5663 5599 5523 5426 7000 aoco X X X \ \ X V. X \ X X X \ X x \ X X ^ X- ^ X \ X V X X 4425 4005 3822 3758 3467 3424 3391 3316 2993 2649 2390 2261 2078 1992 163 4996 4576 4177 3973 3865 6686 Figure B.3: Top of vase with a window of 4096 142 502B 4177 4005 3978 3908 3871 3822 3795 375B 3424 3391 3079 3063 3031 2993 2600 2579 2390 2261 2186 2078 1997 1513 1114 7558 6616 6422 6298 6110 Figure B.4: Top of vase with a window of 8192 143 3467 10 20 30 40 50 n 60 70 80 90 100 Figure B.5: Top of vase with a window of 1024: Material constant 3000 25001- 2000 k Figure B.6: Top of vase with a window of 2048: Material constant 144 145 \. V- W X. V, s- V. >»- V V. W V •v. X* v. V. s» W ft* w *1 S V 2713 689 345 2369 2024 3747 1680 1034 7450 3058 4823 4479 5556 5943 3402 1335 5428 17270 4134 5211 6503 16279 9819 1652 7063 11154 8613 12B77 10508 15590 6675 7838 8226 43 9991 62B8 172 15805 6158 1895 16581 14212 13308 12532 6848 5771 11542 10250 9044 13738 11585 8398 3919 16624 10680 5814 12403 12274 9388 861 13824 6115 1378 16925 152B9 13049 12618 5728 2541 ' 1206 14901 14686 14556 11929 11499 10379 B18 16451 15935 15719 11843 10939 10637 10207 9561 6891 4091 3445 1163 474 18863 18734 18131 17743 15073 14858 13523 10810 10164 9690 8r— Figure B.8: Santur with a window of 1024 146 V 1012 3424 3079 2046 1357 2390 1873 1701 668 4479 2735 3747 4134 883 452 5211 1104 G5 5556 4414 14S4 1120 17270 1572 538 5426 4070 129 9819 4823 8613 7063 11S20 818 10508 5922 k. 16279 12B77 1H76 7838 603 15590 6675 1249 10659 v. 5103 258 12920 12855 11133 B226 5943 10487 4845 9970 9044 6288 1787 840 172 16258 16042 x. w. ^ 585 2713 2153 193B 1637 8247 7795 6848 6309 6115 5448 4113 3725 2821 2239 2175 495 237 13394 1443 775 22 Figure B.9: Santur with a window of 2048 147 \ \ K w K X - V 334 4479 2379 1012 678 3736 1357 4081 3424 7450 4124 3779 2035 1873 3068 484 1701 1120 452 V, s. fcv s. V. 2724 s. V "s. <** \ V. •»» X. V- \ \ "»» w S» s. V, i— \. 118 883 54 5932 5200 4425 1314 1249 592 183 B6 7074 5416 1572 1464 624 1647 1637 1443 1174 840 12886 10497 9819 8613 6492 5566 1195 947 549 538 420 301 258 ii 17270 11574 11176 10659 8236 5803 5556 4834 3391 1475 969 829 818 388 237 65 15579 11520 9830 5114 1916 1690 1184 1087 1066 894 775 743 506 269 22 16290 16268 15590 11133 7429 5760 5448 5426 5211 3079 2089 1755 1744 1669 7000 BOOO Figure B.10: Santur with a window of 4096 148 \ ^. \- \ \ IV V X \. k \ \ •\ K > \ v. \ \ \ \ \ \ \ V V V, S- v. \ \. s, \ \ \ V ^. X \ h X 3758 2708 2385 1696 678 3424 1674 1249 1233 1120 1017 883 834 4124 3052 1873 1448 1319 1303 624 452 178 92 59 7074 7052 5932 4829 4479 1626 1195 818 544 479 382 301 6498 5206 4420 1642 1475 1470 1044 974 904 592 275 „ 12866 10497 7429 5566 '5561 4129 3391 3047 2089 1647 1577 1211 1206 1179 1098 1039 969 899 700 635 608 436 210 188 167 156 32 5421 5416 5200 4064 1933 7000 8000 ¥ X \ V 312 264 258 237 215 16268 9819 8236 7849 6691 Figure B . l l : Santur with a window of 8192 149 X. N \ \ >v \ S. v., »«• *~ »m Km W ** >*» ** •* Sir. >» *r 1" ft* »» *~ m fV. 1766 1593 1421 661 689 517 345 172 2326 1938 1249 2110 2885 2670 2498 1034 3445 3058 3833 3230 3618 4608 5383 4221 5943 6158 3273 9776 4996 4760 4005 11025 5556 4436 4048 10207 8829 4953 15332 15246 6675 43 16322 13178 12059 9948 8613 6288 17528 t5676 11800 11197 10724 8226 8053 7838 7278 6417 3144 19078 17657 15073 14083 14040 13781 13264 8441 7537 7020 6546 6503 5211 5168 4393 18777 14815 14212 13910 11972 11671 10637 9862 9173 9001 B742 7795 7666 7623 7192 6804 6374 6331 6115 1077 18519 16882 16064 15590 14901 14772 »10* Hz Figure B.12: Piano with a window of 1024 150 8000 V \ X. \ \ Vv. tl V V. "n >•* > s» S. N *r M fa *k *t ^ « fa fa 1593 1227 883 345 172 1400 711 517 2864 3252 1938 105S 2498 65 2692 3833 3058 3639 4587 4027 9778 6137 4221 4996 4436 >* V* *•* 1766 2304 258 5383 5190 5577 2799 5556 4802 4242 1960 15892 14707 10573 9109 7795 6934 6503 5426 5082 3575 3553 3316 3122 2670 2067 16107 13738 12425 10357 10164 10056 9388 8786 8269 7106 6998 6891 6675 5900 5685 3984 3919 2218 1120 18691 17227 16042 15827 15676 15052 14750 14643 13910 13501 12985 12511 11951 11929 11779 10831 10336 9819 9755 9647 Figure B.13: Piano with a window of 2048 151 \ V, N s. N \ \ X. Sr v. X s. \ s W V. W> V* -> V "-r \ \ V, 231S \ s *• T- '—i *t ton V 1410 1227 700 528 172 1949 1583 872 355 3252 2638 2498 1055 3058 3833 3639 2B75 3445 65 4231 2078 6137 5383 1873 V- 4791 2864 301 9776 9765 5566 4996 4597 2455 118 54 22 5954 4393 3435 2175 1680 1130 s. v. s- M <-* ** 3>V| IN 1766 >• w« V. 2132 h> 441 398 280 248 129 108 11 7644 4985 3682 3564 3305 2379 2272 1303 1195 1044 947 592 463 420 291 269 226 97 32 16645 12705 11326 11197 1092B 10810 10250 9991 9841 9033 6557 6514 6180 5792 4436 4425 4188 4081 3962 Figure B.14: Piano with a window of 4096 152 \ X X N \ X x X x X \ x X x. X X V x TN "•>. W. H~ V< V- A* 700 350 3252 2498 1233 1039 528 9771 3440 3058 4592 3833 2869 2681 1949 1410 172 59 2638 2132 2309 1879 140 4990 2315 371 318 226 124 70 4430 4027 3634 X S 2482 2277 1750 1308 1249 910 565 501 312 X 108 38 ii 9528 6508 5572 6383 5184 4791 s~ 3558 2126 2105 2078 2073 1620 1292 1184 1168 743 732 554 463 425 393 328 291 280 269 248 242 221 215 210 205 129 97 75 43 22 5 17636 16511 16145 15832 14713 w- S X. 877 X X, X. 1588 x v. X 1766 x x- M 7000 Figure B.15: Piano with a window of 8192 153 8000 v. »l y >i V \ \ V i. *« k, s \ M f h ^ v, \ V \ V s V, \. \ \ \ v, h k, \ \ \. v. \. v. \ V \ 1. V, \ \ V V ll v. \ \ s k 0 \ \ > >» \ V. i \ 20586 18519 10680 7149 6115 4048 3790 1550 1895 21792 1292 517 16710 16193 16107 15504 15246 14987 14384 14040 13867 12834 9905 8958 8527 6891 6029 5512 5254 3101 2929 2670 1723 861 21275 21016 19638 19466 18863 17916 14729 14212 13954 13178 12575 12489 11972 11886 11628 ,„„ 10250 10164 9647 9130 20930 8096 5599 3445 2412 2326 1378 775 21533 21447 V. 2075B 20413 19552 19380 18949 18002 17140 17054 16882 16796 \ 16538 16451 15935 15848 15676 15332 14643 13695 13351 13006 6000 7000 o.z 000 2000 3000 4000 5000 Figure B.16: Computer tower with a window of 512 154 8000 155 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 1895 7429 5146 4974 2390 538 474 280 7257 6546 6115 6051 5857 5642 5469 5383 5254 5190 4845 4694 4479 4199 3854 3704 3575 3316 3230 2993 2713 2649 2498 1960 1680 1615 1357 1055 818 775 689 8204 7687 7192 7149 7041 6891 6804 6740 6266 6094 5965 5922 5900 5749 5728 5663 5577 5534 5340 5297 5103 4931 4910 4780 4759 4716 4651 4587 4565 4393 \ 4350 4221 4156 4113 4091 4005 3790 3747 3639 3510 \ 3424 3402 3338 3187 3165 3122 3101 3036 2842 2821 \ \ \ \ V \ \ V \ \ V \ .V 2907 \ \ \ \ \ \ \ \ V \ \ \ \ V \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ Figure B.18: Computer tower with a window of 2048 156 2B64 1680 614 5146 4963 4920 4587 4091 3790 3338 3316 3090 2896 2832 2659 2616 2541 2487 2444 2412 2304 2239 2207 1884 1830 1615 1540 1497 1475 1357 1281 1087 1012 528 420 398 248 10336 10271 10142 10088 9442 7978 7644 7429 7418 7257 6331 6105 6094 6062 6051 5243 807 11940 5329 5297 5276 5254 4845 4802 4791 4716 10034 9905 9701 6740 6621 6546 6492 5965 5911 5846 5545 5437 5233 5179 5103 4974 4877 4651 4630 4554 4522 4479 Figure B.19: Computer tower with a window of 4096 157 11402 1500 10 20 30 40 50 n 60 70 80 90 100 Figure B.20: Computer tower with a window of 512: Material constant 1500 r 1 1 1 1 1 1 1 1 r 10 20 30 40 50 n 60 70 80 90 1000^ 500 k 0 100 Figure B.21: Computer tower with a window of 2048: Material constant 158 159 g 4000 Figure B.23: Bell with a window of 1024: Material constant Figure B.24: Bell with a window of 2048: Material constant 160 g 4000y Figure B.25: Bell with a window of 4096: Material constant 2 4000 Figure B.26: Bell with a window of 8192: Material constant 161 Appendix C Audio Synthesis Techniques for Music A comprehensive overview of musical audio synthesis techniques is given by Herbert JanBen, on the W W W site [7], from which the following material is compiled. • Additive Synthesis, Fourier Synthesis Any sound, however complex it may be, can be described as a mixture of a number of sine wave components with different phases and amplitudes. These are the partials of a sound, which are also called harmonics if their frequencies are an integer multiple of the fundamental frequency. The method to generate a complex sound spectrum as the sum of (many) simple sine waves is called Fourier synthesis, after Joseph Fourier who found its mathematical basis. The more general term additive synthesis can also be used if the waveforms added are not sine waves. Ideally, a lot of sine oscillators are needed for Fourier synthesis. How many 162 depends on the required range and brightness: a bright bass note, think of a slap bass, may need more than hundred, while a high pitched harmonic sound will probably need only a dozen. For dynamical sounds and expressive play of an additive synth very many parameters are needed: ideally, each oscillator should have its own amplitude envelope, pitch envelope, velocity sensitivity and modulation routing. Although this may sound like the hardware is the limiting factor, the usability is even more so. Most of the many parameters have only little influence on the sound and generally it is very hard to estimate how the spectrum of a desired sound looks like. Thus the simulation of acoustic instruments seems to be impossible without appropriate analysis hardware and software. • Subtractive Synthesis This is just the classical method of synthesis used in most analog synths and in most sample playback synths and samplers. Subtractive synthesis means that you take a sound (preferably a spectral rich one like a sawtooth or a square/pulse wave, or a sample of a grand piano) and route it trough a modulatable filter and amplifier to change its timbre. This way you reduce the level of some partials of the original spectrum and hence the term. The terminology is a bit fuzzy for the real world, since almost any synth uses filters. The general usage of the term tends to refer to the classical "oscillator/filter/amplifier" trinity though. What is nice about subtractive synthesis is that by selecting the oscillator waveform or sample, the basic timbre is rather well determined and the usual filter and amplifier parameters allow to effectively tweak it to make it brighter, 163 duller, more percussive etc. The main problem with subtractive synthesis is that its tools are rather bold: the oscillator waveforms or samples have a distinct character that is hard to overcome and the usual filters do not allow for very subtle changes. • Analog Synthesis Analog Synthesis is not really a synthesis method, but rather a hardware issue: analog synths use analog instead of digital electronics to create their sounds. What most of them can do in terms of synthesis methods is quite simple subtractive synthesis. However some more advanced machines and especially modular synths may use a great variety of synthesis methods including F M , wave shaping, vector synthesis and others. Analog synths are considered "cool" nowadays, because of their supposedly "warm" and "fat" sound. So what is so special about analog synthesis then? First, most analog synths have an extensive user interface with dedicated knobs and switches for every function. This gives a very intuitive access and results in instant gratification for sound tweaking. This is possible because typical analog synths offer a limited number of control parameters, but those parameters are highly effective. Second, rather simple analog circuitry can perform the functions needed for subtractive synthesis rather well: the resulting sound will be artificial but with slight variations and instability. Thus, the sound quality is lively in a way similar to acoustic instruments. On the other hand most digital synths compromise in sound quality to achieve the high number of voices many buyers seem to be fond of today. 164 Modular analog synths have the additional advantage that there is no distinction between audio and control signals. Everything is just a control voltage which results in a vast number of patching possibilities. These beasts are very rare and pricy nowadays, but are probably still the best way to learn about synthesizing sounds, and can be used as a musical instrument of exceptional power. • Sample Playback ( P C M , A W M , A W M 2 , A I , ...) This is a form of subtractive synthesis that is also called P C M (Pulse Code Modulation), A W M (Advanced Wave Memory), A W M 2 (Advanced Wave.Memory Version 2) A I (?) by manufacturers. Usually all those terms refer to basically the same thing: A n audio signal e.g. a miked acoustic instrument or an electrical or electronical instrument is sampled (digitized) and the recording is stored in R A M or R O M . If a device is able to sample and store the result in R A M or to disk, it is called a sampler. A device that can playback samples (from R A M , R O M or disk) at different pitches is called a sample playback synth. Most samplers and sample-based synths use subtractive synthesis although there are some samplers and synths that offer only very limited processing. The term P C M refers to the coding technique which is used in virtually all digital instruments. The terms A W M and A W M 2 are Yamaha marketing slang for 16-bit sample playback and 16-bit sample playback with filters. Korg uses the term A I synthesis for their M l , which is just another sample playback synth, synthesis wise. Sample playback is what has made synths realistic sounding. On the other 165 hand sampling per se offers less options for expressive play than almost any other synthesis scheme. Samples that have a lot of inherent character or are easily recognized as an acoustic instrument are hard to shape, so filters and other processing options will merely adjust the timbre of the sample. There are however unique possibilities in sample synthesis, but these are often not implemented in commercial synths. I'm talking about modulation of the sample playback parameters itself: extreme transposition of multisamples, modulation of sample start and sample loop length, multiple sample loops and more. Among others, various Ensoniq and Emu synths and samplers are capable of some of these. On many synths (including the SYs) even transposition (the changing of sample playback rate) can only be achieved with the trick of a constantly biased pitch envelope. • Frequency Modulation ( F M , A F M ) Frequency Modulation is usually abbreviated F M or A F M (for Advanced Frequency Modulation). This is the family of synthesis methods that brought a breakthrough for commercial digital instruments in the eighties. Basically it means that you control the frequency of an audio oscillator by the frequency of another audio oscillator. The interesting aspect sound-wise is that you can generate a very wide variety of spectra plus many transient sound characteristics with F M (and not only the never ending variations of electric pianos and bells). F M was "invented" by John M . Chowning at Stanford and used in the academic computer music scene long before Yamaha marketed it. The commercial Yamaha implementation introduced some restrictions, but also some useful ex- 166 tensions like feedback. F M exists in many different flavours: some analog synths resp. digital/analog hybrids are able to do a very basic F M . But F M relies mainly on the frequency ratios of the oscillators involved, and therefore requires very high tuning stability. Also F M becomes a versatile synthesis technique only if you have multiple oscillators with multiple envelopes to control their amplitude which results in a big number of components/modules needed in the analog realm. Maybe this is why it was and is not as popular with analog synths. Yamahas digital F M implementations use custom chips to reduce cost. In case of digital F M , there are also many variants: depending on the number of oscillators (minimum is 2, most synths use 4 or 6, I recall to have heard of 10 in some Yamaha organs), whether there is a real envelope per oscillator (some very simple Yamaha sound chips like the one used in the old Atari STs miss them) and of course how variable the routing between the oscillators is (number of "algorithms, modulation and feedback paths). • R C M (Real-Time Convolution and Modulation) Synthesis This is another marketing term by Yamaha and is mainly an extension to F M . The background is that you use a whole AWM2-element as modulator input for an AFM-operator, which also means that you can apply the filter on the sample before you put it through the FM-process. The former fact may be used to motivate the term modulation, the latter the term convolution (one possible algorithm for a filter is the convolution of the signal with a kernel). In my opinion the term R C M is ill defined and misleading. In lack of a better term, I will use R C M to denote the capability to use feed the A W M section 167 into the A F M section and vice versa. Phase Distortion (PD) and Interactive Phase Distortion (iPD) The terms phase distortion and interactive phase distortion were used by Casio for their synths (CZ and V Z series). Actually the two methods seem to be very different. The phase distortion synths (the C Z series) offer eight basic waveforms (saw, pulse, resonance, etc) . 1 Each of them can be morphed continuously from and to a sine wave via an eight-stage envelope, thus emulating the use of a low-pass filter on a basic analog synthesizer. Another two envelopes are used for pitch and amplitude and for two-oscillator patches a ring modulator can be used. Wave Shaping Wave shaping refers to a sound manipulation (not generation) technique which applies a (nonlinear) function on the original signal (i.e. the output of an oscillator). This scheme is similar in principle to analog distortion in a guitar amp or fuzz unit, but offers much more sound variation possibilities including resonance-like effects. Wave shaping can be used as an advanced synthesis method in a way similar to F M . L A (Linear Arithmetic) Synthesis This buzzword was used by Roland to describe their approach to digital sound synthesis in the eighties. It is based on the observation that the attack transient of a sound is its most important part with respect to human perception. Therefore the LA-synths (D-50, M T 32 and others, but most notably not the D-70) used a combination of sampled attack transients and simple digital oscillators with only sawtooth and pulse waveforms to generate the sustained part of the sound. 168 Ring Modulation, Amplitude Modulation Ring modulation and amplitude modulation are not complete synthesis methods, but rather processing techniques that are quite common on advanced analog and digital synths. Sometimes these features are wrongly named or used when the actual implementation is quite different. Manufacturer specific terminology for similar schemes includes the terms cross modulation and F X M (frequency cross modulation). Ring modulation is the multiplication of two signals. The output of a ring modulator will contain the sum and the difference of all available input frequency pairs. Amplitude modulation is the multiplication of two signals, where one signal is always positive. Vector Synthesis The first synth to implement this paradigm was the SCI Prophet V S . The V S can mix four oscillators with different waveforms in real-time via a joystick controller and a multistage envelope. While this is a really simple concept, it is effective for expressive play and nice evolving sounds. The Korg Wavestations and the Yamaha SY22, TG33 and SY35 are other "vectorized" synths. The Yamahas can mix up to two F M and two sample elements, while the Wavestations mix up to four sample based wave sequencing oscillators. In principle most synths can do real-time vector synthesis, when fed with M I D I joystick data to cross-fade oscillators. If you like to try that you can rewire a P C game joystick to fit your synths pedal jacks. 169 Wave Table Synthesis This term is used for two completely different things: Many sound card companies call their R A M based sample playback capabilities like this (because the samples are stored in a table in R A M ) . For the P P G Wave and the Waldorf Microwave and Wave synths this term is used to describe the ability to produce a sound by sequencing through a table of different waveforms during the duration of a single note. For the wave tables and waves there is a preset R O M area as well as a user loadable R A M area provided. Which entry of the wave table is selected may be controlled by an envelope, L F O or any other modulation source in real-time. Also these synths can interpolate between subsequent waveforms in the wave table thus smoothing the timbral change. The waveforms are single cycle ones, so realistic acoustic emulations are out of reach for this technique, but the vastly improved •' modulation capabilities, compared to sample playback, more than make up for this. Wave Sequencing This term means that a sequence of different sample segments can be used to generate a sound. Korg implemented this on their famous Wavestation synths. The Wavestations oscillators can sequence through programmable patterns of samples. Each of the patterns consists of a number of individually tunable sample snippets and each sample in the sequence is assigned its own level and duration. Typical for the Wavestation (and rather easy to program) are "rhythmic" wave sequences in which an oscillator steps through a number of samples in a predefined periodic rhythm. The Wavestations also combine this 170 with vector synthesis capabilities. Granular Synthesis Granular Synthesis means sequencing through very many very short sound (sample) snippets. The difference to wave sequencing is that the single samples are played for such a short time, that the sequencing is heard more as a timbre than as a rhythm. Granular synthesis has been developed in the academic computer music scene and has not found its way into commercial products so far. Physical Modeling Synthesis Physical modeling (PM) is a whole class of synthesis methods that do not synthesize sound based on an abstract mathematical description, like the Fourier transform for additive synthesis or by classical signal processing means like filtering for subtractive synthesis, but rather tries to model the diverse instruments themselves: e.g. the bow, string and resonance corpus of a cello or the plucking finger, string and body for an acoustical guitar. There are many different physical modeling algorithms including relatively simple ones like Karplus-Strong synthesis and rather complex ones like the waveguide [67] approach which uses multiple delay lines (comb filters) to model strings or air columns. The biggest advantage of physical modeling is the real-time control it offers. While other synthesis methods offer some algorithm specific and rather arbitrary control parameters like filter cutoff or modulation index, physical modeling enables the use of control parameters that are more musical and have a more complex influence on the timbre and phrasing. 171 Examples for such parameters are embouchure or tonguing. Another, advantage of P M is that the sound generation is context sensitive: a note on a clarinet model will sound different if it is played with legato binding to its predecessor or with a little pause in between. The dependency is much more complex than with the traditional synthesizer portamento or glide function. Another example: the pitch bend of a clarinet patch will not just linearly shift the frequency of the note, but the synth will respond in a similar way to a real clarinet, i.e., it will shift the frequency and timbre for a while but then jump to the octave. P M has its disadvantages though: Instrument models have to be designed with great care and a lot of knowledge of both instrument acoustics and the necessary math. On the available P M instruments by Korg and Yamaha, editing is only possible via macro parameters of the otherwise hard coded instrument models, probably the only practicable way to provide sound programming to the "normal" user. • Karplus-Strong Synthesis This synthesis method uses a percussive sound, like a noise burst or a single pulse, which excites a delay unit with feedback. If the feedback is high enough (90-99%) an exponentially decaying sound with definite pitch will result. It is the delay time that determines the pitch in this case. To be exact the delay time equals the period of the resulting periodical wave. This synthesis method is particularly well suited for emulating plucked strings and other percussive harmonic sounds. To make the decay more realistic, one can include a low-pass filter in the feedback path, so that higher harmonics are damped faster. 172 Karplus-Strong synthesis is actually a very simple form of physical modeling and shares the most important physical modeling advantage: since the perr cussive sound acts as an exciter for the delay loop that produces the harmonic sound, one can change the plucking/fingering of the model-led string by chang- > ing the percussive sound, and change the string by controlling the delay line parameters. • Modal Synthesis This synthesis techniques uses a bank of resonators with individually adjustable frequencies and dampings, which is driven by some input signal. Non-harmonic percussive instruments such as bells and marimba have been successfully modeled with this technique [20]. For general musical instrument synthesis this technique has the disadvantage that a resonator is needed for every partial, and the partials are spaced linearly for most musical instruments, with the exception of non-harmonic percussive instruments like bells. The many resonators needed leads to a large computational cost, which can be avoided by using waveguides or comb filters, which by themselves already have a complete harmonic spectrum and are therefore better building blocks. 173
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Sound synthesis for virtual reality and computer games
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Sound synthesis for virtual reality and computer games Doel, Cornelis Pieter van den 1998
pdf
Page Metadata
Item Metadata
Title | Sound synthesis for virtual reality and computer games |
Creator |
Doel, Cornelis Pieter van den |
Date Issued | 1998 |
Description | The synthesis of audio in real-time computer simulations is investigated. A physics based parameterized vibration model for physical objects is constructed, and a realtime synthesis algorithm is developed which allows the synthesis of the sound made by such objects under any kind of interaction force. Methods for obtaining the parameters of such models are investigated. We study mathematical models with simple geometries, parameter fitting to measured data, and empirical models. Models for interaction forces occurring during contacts between rigid bodies such as impact and sliding interactions are developed, as well as models for the driving forces for combustion engines and avalanches. Studies were conducted of several objects which were successfully modeled with these techniques. Several computer programs were written for the testing of models, for the construction of models, and for the demonstration of the level of realism that can be achieved with this type of synthesis. It is concluded that this type of synthesis can generate realistic, interactive audio using only a small fraction of available CPU power on modern personal computers. |
Extent | 11598899 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2009-07-03 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0051161 |
URI | http://hdl.handle.net/2429/10055 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 1999-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_1999-389936.pdf [ 11.06MB ]
- Metadata
- JSON: 831-1.0051161.json
- JSON-LD: 831-1.0051161-ld.json
- RDF/XML (Pretty): 831-1.0051161-rdf.xml
- RDF/JSON: 831-1.0051161-rdf.json
- Turtle: 831-1.0051161-turtle.txt
- N-Triples: 831-1.0051161-rdf-ntriples.txt
- Original Record: 831-1.0051161-source.json
- Full Text
- 831-1.0051161-fulltext.txt
- Citation
- 831-1.0051161.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051161/manifest