Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Design of two psychoacoustic models for real time implementation of a wideband audio codec Koch, Anthony C. 1993

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_1994-0012.pdf [ 1.52MB ]
Metadata
JSON: 831-1.0065232.json
JSON-LD: 831-1.0065232-ld.json
RDF/XML (Pretty): 831-1.0065232-rdf.xml
RDF/JSON: 831-1.0065232-rdf.json
Turtle: 831-1.0065232-turtle.txt
N-Triples: 831-1.0065232-rdf-ntriples.txt
Original Record: 831-1.0065232-source.json
Full Text
831-1.0065232-fulltext.txt
Citation
831-1.0065232.ris

Full Text

DESIGN of TWO PSYCHOACOUSTICMODELS for REAL TIME IMPLEMENTATIONof a WIDEBAND AUDIO CODECbyAnthony C. KochB. A. Sc. (Electrical Engineering), The University of Waterloo, 1991A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF APPLIED SCIENCEinTHE FACULTY OF GRADUATE STUDIESDEPARTMENT OF ELECTRICAL ENGINEERINGWe accept this thesis as conformingto the required standardTHE UNIVERSITY OF BRITISH COLUMBIAOctober 1993© Anthony C. Koch, 1993In presenting this thesis in partial fulfilment of the requirements for an advanceddegree at the University of British Columbia, I agree that the Library shall make itfreely available for reference and study. I further agree that permission for extensivecopying of this thesis for scholarly purposes may be granted by the head of mydepartment or by his or her representatives. It is understood that copying orpublication of this thesis for financial gain shall not be allowed without my writtenpermission.___________Department of LcL &The University of British ColumbiaVancouver, CanadaDate OcrDE-6 (2/88)AbstractImplementation of an audio codec (coder/decoder) is sought which meets a newstandard proposed by the ISO/MPEG (International Standards Organization / MovingPictures Experts Group) committee. The standard aims to encode wideband audio signals,achieving data compression by employing psychoacoustic modelling. By exploiting theproperties of the human auditory system, psychoacoustic modelling shapes quantizationnoise spectrally to render it inaudible. This thesis provides the design of two newpsychoacoustic models used to effect the real time implementation of the new ISO/MPEGaudio codec on an existing hardware platform at MPR Teltech, a subsidiary of the BritishColumbia Telephone Co.The two new models are named: spreading function model and attenuation model.Both models overcome real time implementation problems of existing psychoacousticmodels found in the literature by introducing new methods of calculating the globalmasking threshold and signal to mask ratios. Listening tests and simulation analysesshow that while both models attain very high audio coding quality, the attenuation modelis superior to the spreading function model, and is considered for implementation in thenew ISO/MPEG codec on the existing hardware platform.11TABLE of CONTENTSAbstract.iiList of Tables vList of Figures viAcknowledgment viiiChapter 1 Introduction 1Section 1 Background: Audio Coding Techniques 3Section 2 Applications 6Section 3 Thesis Outline 8Chapter 2 Software and Hardware Development Platforms 9Section 1 Software Development Platform 10Section 2 Hardware Development Platform 12Chapter 3 Real Time Modeffing of Psychoacoustic Principles 14Section 1 Critical Bands and the Bark Scale 15Topic 1 Critical Bands 15Topic 2 Bark Scale 17Topic 3 Spectral Warping into Bark Scale 18Section 2 Spectral Masking of Single Targets 20Topic 1 Inter-Band Masking: Spreading Function 21Topic 2 Intra-Band Masking 24Section 3 Spectral Masking of Multiple Targets 26Topic 1 Spreading Function Modelling 26Topic 2 Attenuation Modelling 28Section 4 Absolute Threshold of Hearing 30Section 5 Summary 31Chapter 4 Multirate Signal Decomposition 32Section 1 Multirate Filter Banks 35Section 2 Lapped Transforms 37Section 3 Block Transforms 38111Chapter 5 Implementation of Two Psychoacoustic Models.39Section 1 Global Masking Threshold and Signal to Mask Ratios(SMR) 40Section 2 Spreading Function Model Algorithm . .. 42Topic 1 Calculation of Global Masking Threshold 42Topic 2 Calculation of SMR 44Section 3 Attenuation Model Algorithm 45Topic 1 Calculation of Global Masking Threshold 45Topic 2 Calculation of SMR 52Section 4 Summary 54Chapter 6 Experiments and Results 55Section 1 Meeting Real Time Constraints 56Section 2 Listening Tests 60Topic 1 Audio Test Material. 60Topic 2 Apparatus 61Topic 3 Methodology 61Topic 4 ResultsSection 3 Analyses. 72Chapter 7 Summary and Suggestions 75Section 1 Summary 75Section 2 Suggestions. 78BIBLIOGRAPHY 79ivList of TablesTable 1.1 Coding Techniques: From Speech to Wideband Audio. . . . 4Table 1.2 Wideband Audio Applications and Bit Rates 6Table 3.1 Critical Band Bandwidths 17Table 5.1 Differences Between Proposed Model and New AttenuationModel 47Table 5.2 Attenuation Model Algorithm Description: Determination ofGlobal Masking Threshold 52Table 6.1 EBU Recommended Audio Programme Test Material . . . . 61Table 6.2 Descriptions of Encoded Signal Impairments Observed bySome Listeners 66VList of FiguresFigure 2.1 ISO/MPEG Audio Codec Software Development Platform. 10Figure 2.2 ISO/MPEG Audio Codec Hardware Development Platform. 12Figure 2.3 Audio Codec Encoder and Decoder Cards 13Figure 3.1 Experimental Determination of Critical Bands 16Figure 3.2 Masking of Pure Tones by Narrow Band Noise 21Figure 3.3 Approximated Spreading Function of Inter-Band Masking—Centered at Critical Band 9 23Figure 3.4 Global Masking Threshold Calculation: Spreading FunctionModelling 27Figure 3.5 Global Masking Threshold Calculation: AttenuationModelling 29Figure 3.6 Absolute Threshold of Hearing 30Figure 4.1 Critically Sampled Multirate Signal Decomposition Types. . 34Figure 5.1 IS0/MPEG Audio Encoder Software Platform 41Figure 5.2 Comparison of ISO/MPEG Encoder with Paillard Encoder . 46Figure 5.3 Comparison of Filter Responses of P-QMF and MLT . . . . 48Figure 5.4 Attenuation Model — Determination of Global MaskingThreshold 51Figure 6.1 DSP Partitioning of Tasks for Audio EncoderImplementation 56viFigure 6.2 Real Time Encoder DSPO CPU Usage 57Figure 6.3 Comparison of DSP1 CPU Usage: Attenuation Model andSpreading Function Model 59Figure 6.4 Listening Test Setup and Environment 62Figure 6.5 Listening Test Methodology 63Figure 6.6 MOS Scores: Attenuation Model 65Figure 6.7 MOS Scores: Spreading Function Model 69Figure 6.8 Coding Quality Comparison of the Attenuation Model withthe Spreading Function Model 71Figure 6.9 Comparison of Attenuation Model with Spreading FunctionModel Quantization Noise Levels for Harpsichord 73viiAcknowledgmentI would like to thank my supervisor, Dr. M. P. Beddoes, for his helpful suggestions andfruitful discussions, as well as for his invaluable assistance in preparing this thesis.Thanks to Tim Woinoski for his very giving attitude and his persistence in resolvingseveral codec implementation issues. Thanks to Dr. Rong Peng for his technical support.The team efforts of both have resulted in the final operating product: an ISO/MPEGcodec. Thanks to Rick Beaton who had the foresight to bring the whole project togetherand vision to see its applications. Thanks to Albert Chau for his “golden ears” andtechnical consultations. Thanks to Bruno Paillard for his careful explanations of hispsychoacoustic model.Very special thanks to my great father for his wise words of guidance and excellentadvice as to the focus and presentation of this thesis. Also thanks to my mother for hercaring and uplifting spirits. Special thanks to my brother by indirectly motivating me inworking so diligently to prepare for his final accounting exams.I also wish to acknowledge financial support from the Science Council of BritishColumbia, and equipment support from MPR Teltech.viiiChapter 1Chapter 1 IntroductionA new audio codec standard has been proposed by the ISO/MPEG (InternationalStandards Organization I Moving Pictures Experts Group) committee. The standardaims to encode wideband audio signals, achieving data compression by employingpsychoacoustic modelling. By exploiting the properties of the human auditory system,psychoacoustic modelling shapes quantization noise spectrally to render it inaudible. Realtime implementation of the ISO/MPEG codec standard requires a hardware platform.At MPR Teltech, a subsidiary of the British Columbia Telephone Co., a hardwareplatform exists for general signal processing applications. Our objective is to effect thereal time implementation of the ISO/MPEG codec on the existing hardware platform.Because the ISO/MPEG codec employs psychoacoustic modelling which is open tothe user’s choice, one must find a suitable psychoacoustic model. Psychoacousticmodels found in the literature lead to considerable computational complexity, makingreal time implementation infeasible on the existing hardware platform. Furthermore,some models are ill-suited for direct use in the ISO/MPEG standard. By makingmodifications and simplifications to models found in the literature, this thesis designs twonew psychoacoustic models that overcome the two problems described above. The resultof the present work is that real time implementation of the ISO/MPEG codec is realizedon the existing hardware platform. The two models are named: spreading functionmodel and attenuation model. Both models introduce new methods of calculating theChapter 1global masking threshold and signal to mask ratios. Listening test results show a highquality of audio coding.In the next two Sections, some background and applications of the two psychoacoustic models are described; and in the last Section, the thesis outline is presented.2Chapter 11.1 Background: Audio Coding TechniquesOver the last few decades, different audio coding techniques have evolved. Fourtypical examples of these techniques are summarized and compared in Table 1.1. Lowcomplexity coding techniques are 1) limited to encoding only narrow signal bandwidthsand 2) attain low data compression ratios. At the cost of increased coding complexity, thebenefits attained are coding of the full audio spectrum, 20 Hz to 20 kHz, and attaining alarge data compression ratio of 6 to 1. In the following paragraphs, the four audio codingtechniques listed in Table 1.1, ranging from very low to high complexity, are described.Log-companding PCM coding is commercially available and is characterized byzero delay and very low complexity. This time domain technique encodes speech signalsbandlimited to 4 kHz at a rate of 8 bits per sample [Jayant and Noll 1984]. An exampleof such a coding technique is CCITT (English translation: International Telegraph andTelephone Consultive Committee) 24—channel i — law. No data compression is achieved.ADPCM coding encodes 4 kHz bandlimited speech signals by removing input signalredundancies in the time domain. At the expense of increased coding complexity anddelay, toll quality speech is attained at 32 kbps, or 4 bits per sample. An example ofADPCM coding is specified in recommendation CCITT G.721 [Jayant and Noll 1984].ADPCM coding attains a data compression ratio of 2 to 1.Two-Band ADPCM QMI? coding employs a two—band subband ADPCM encoder,implemented via a quadrature mirror filter, or QMF (see Chapter 4). An example of thiscoding technique is Recommendation CCITT’ G.722, which encodes 14—bit PCM speech3Chapter 1[Table 1.1 is synthesized from the references found in Section 1.1.]Coded OutputBit Data DataSignal Sani- Input CodingRate Corn- RateAudio Coding Band- pling Bits/ Corn-(bits pression (perTechniquewidth Freq. Sample plexityper Ratio monosample) channel)Column Label (1) (2) (3) (4) (5) (6) (7)Calculation given given given (6)1(2) (2)*(3) givenMethod 1(6)Log-compandingPCM 4 kHz 8 kHz 8 8 1 : 1 64 very low(e.g CCITT u-law) kbpsADPCM(e.g. CCITT 4 kHz 8 kHz 8 4 2: 1 32 lowG.721) kbps2-band ADPCMQMF 7 kHz 16 kHz 14 4 3.5 : 1 64 medium(e.g. CCITT kbpsG.722)44.1 kHzPsychoacoustically (CD 4 4: 1 192based quality) kbpssubband coding 20 kHz or 16 high(e.g. ISO/MPEG 48 kHzstandard) (Studio 2.7 6 : 1 128quality) kbpsTable 1.1 Coding Techniques: From Speech to Wideband Audioat 4 bits per sample [Smyth, McCanny and Challenger 1979]. This Recommendationrepresents a gain over ADPCM coding since the coded signal bandwidth increases from4 kHz to 7 kHz. It is noted that both time and frequency domain techniques are used inthis coding method. A data compression ratio of 3.5 to 1 is attained.Psychoacoustically based subband coding is a frequency domain technique whichaims to encode wideband audio signals having bandwidth of 20 kHz. In contrast to thethree previous lower complexity coding techniques, the latter technique employs high4Chapter 1complexity multi-band subband coding and psychoacoustic modelling. Data compressionis achieved by modelling the properties of human perception. Using psychoacousticmodelling, the coefficients of the subbands are quantized so that the resultant quantizationnoise is spectrally shaped to be inaudible. The cost of employing subband codingand psychoacoustic modelling is high computational complexity. An example of apsychoacoustically based subband coding technique is specified by the new ISO/MPEG(International Standards Organization I Moving Pictures Experts Group) standard. Thisnewly proposed standard defines the coding of moving pictures and associated widebandaudio for digital transmission and storage media [Dehery 1991] [Stoll 1992] [ISO/MPEGDraft International Standard 19921. The standard leaves the psychoacoustic modellingopen to the user’s choice. This thesis deals with the designs of two new psychoacousticmodels for use in the ISO/MPEG standard. The goal of the standard is to code 1 6—bitPCM CD (Compact Disc) and studio quality audio signals with very high audio quality,while attaining a data compression ratio of 6 to 1. In the next chapter, applications ofpsychoacoustically based subband coding are presented.5Chapter 11.2 ApplicationsTypical applications of psychoacoustically based subband coding currently beingexplored are listed in Table 1.2 [Dehery 1992] [Stoll 1992] [Dehery, Lever, and Urcun1991] [Veldhuis 1992]. The bit rates listed range from 192 kbps to 64 kbps permono channel. Applications that demand very high audio fidelity, such as DigitalCompact Cassettes and professional studio audio processing, require bit rates of 192kbps. Applications such as broadcast distribution links between studio and transmitterstations require very high audio fidelity which is provided at 128 kbps. Applicationssuch as reporting links and videoconferencing can afford some loss of signal fidelity andrequire bit rates of 64 kbps.Bit Rate(per mono Typical Applicationchannel)192 Professional studio sound recording, playback, editing, and postprocessing.kbps192 Digital Compact Cassette (DCC) which can record a stereophonic sound on a minikbps cassette in digital sound compressed mode.192-128 Primary distibution links from studio to transmitter stations.kbps128 Radio sound programme emission, Digital Audio Broadcasting (DAB) based onkbps Coded Orthogonal Frequency Division Multiplexing (COFDM).64 Stereophonic transmission via narrowband ISDN for reporting links andkbps videoconferencing.Table 1.2 Wideband Audio Applications and Bit Rates6Chapter 1Psychoacoustic principles can be exploited to achieve the bit rates ranging from192 kbps to 64 kbps listed in Table 1.2. Because it uses psychoacoustic principles, theISO/MPEG standard is a potential candidate for the applications of Table 1.2. If thestandard becomes internationally accepted, an audio codec conforming to this standardcould potentially be usable in a broad market. The motivation of this thesis, to developtwo psychoacoustic models for implementation in the ISO/MPEG audio codec, is drivenby the fact that a wide range of potential applications exist for the codec.7Chapter 11.3 Thesis OutlineConsiderations ranging from human perception to standard electrical engineeringtechniques are used to develop an audio codec system. Each of these matters is consideredseparately as shown in the list of Chapters (Chapter 2 to Chapter 6) below:Chapter 2 is an overview of the existing hardware and software platforms used todevelop and test a psychoacoustically based audio codec.Chapter 3 reviews the psychoacoustic principles used to model human perceptionin real time.Chapter 4 is a review of spectral analysis techniques employed in the design of theISO/MPEG audio codec.Chapter 5 presents the real time implementation of two psychoacoustic models foruse in the new ISO/MPEG audio codec.Chapter 6 analyzes the results of real time measurements and listening tests.Chapter 7 summarizes the results of Chapters 5 and 6, and presents appropriateconclusions and recommendations.8Chapter 2Chapter 2 Software and HardwareDevelopment PlatformsThe goal of this thesis is the development and implementation of two psychoacousticmodels in the new ISO/MPEG audio codec using an existing hardware platfonn. InSection 2.1, the ISO/MPEG codec software platform is presented. In Section 2.2, theexisting hardware platform is presented.9Chapter 221 Software Development PlatformShown in Fig.2. 1 [Stoll 19921 is the ISO/MPEG audio codec software platform. Thecodec platform consists of two parts: the audio encoder and decoder.The encoder consists of two parts:1. Coarse spectral analysis via a pseudo-QMF filter bank, generating 32 subbandcoefficients.2. Fine spectral analysis via a 1024 point FF1’.Digital AudioCoded Audio______SignalSignal ‘ I (705 kbps2 1 • 32 Quantized Synthesis(192-64 kbps)- Subband Filter BankBitstream I 32bbands)0 I • CoefficientsISO/MPEG th I________________AUDIO DECODER BLOCK DIAGRAMFigure 2.1 ISO/MPEG Audio Codec Software Development PlatformLinear 32 QuantizedAUDIO ENCODER BLOCK DIAGRAM10Chapter 2The filter bank decomposes the digital audio input signal into 32 decorrelated subbandcoefficients. These subband coefficients are quantized, formatted into the ISO/MPEG bitstream, and transmitted. The linear quantization of the subband coefficients is controlledby a dynamic bit allocation scheme. This scheme shapes quantization noise spectrally,based on the auditory masking threshold determined by the psychoacoustic model. Onlyone psychoacoustic model is required by the encoder. Two psychoacoustic models aredesigned in this thesis. Each is suited for use in the encoder.The decoder simply consists of an inverse transform filter bank. The inversetransform reconstructs the digital (PCM) audio signal based on the quantized subbandcoefficients.11Chapter 22.2 Hardware Development PlatformOur objective is to use the existing hardware platform provided by MPR Teltechto implement the ISO/MPEG software platform discussed in Section 2.1. The existingplatform is shown in Fig.2.2. An IBM compatible PC houses the encoder and decodercards of the audio codec (coder/decoder). The encoder receives 16—bit PCM digitalaudio from a Digital Audio Tape at a rate of 705 kbps. The rate and format of thedigital audio are defined by the Audio Engineering Society/European Broadcasting Union(AES/EBU). An example of the AES/EBU audio format is the digital output availableon some CD players. Using the psychoacoustic models developed in Chapter 5, theencoder compresses the 705 kbps PCM digital audio into a bitstream of 64, 128, orPersonalComputerHeadphonesAES/EBUDigital PCM / Keyboard Digital PCMInterface / Interface(705 kbps) / \ (705 kbps)Note: 705 kbps = 16 bit PCM x 44.1 kHz samplingfrequencyFigure 2.2 ISO/MPEG Audio Codec Hardware Development Platform12Chapter 2ENCODER DECODERCARD CARDAES/EBU_______Sercial Communication AES/EBU{]—Figure 2.3 Audio Codec Encoder and Decoder Cards192 kbps, as desired by the user. The format of the compressed bitstream is defined bythe ISO/MPEG standard. The encoder communicates the ISO/MPEG bitstream to thedecoder via a DSP (Digital Signal Processor) interface. The decoder then uncompressesthe ISO/MPEG bitstream back to the 705 kbps AES/EBU format. This PCM bitstream isinput to a digital to analog converter (DAC). A pair of high quality headphones is usedto listen to the reconstructed audio signal.The engine used to achieve data compression of the digital audio in real time consistsof two Motorola DSP56000s, which are digital signal processors. Both DSP56000s ofthe encoder card are shown in Fig.2.3. This card is housed within the IBM compatiblePC, as shown in Fig.2.2. The two DSPs share a common 1024 bytes of RAM. Thecompressed bit stream is uncompressed in real time on the decoder via a single MotorolaDSP56000. The decoder card is also housed within the IBM compatible PC, as shownin Fig.2.2. The interface between encoder and decoder cards is a Serial CommunicationInterface between two DSP56000s.13Chapter 3Chapter 3 Real Time Modelling ofPsychoacoustic PrinciplesPsychoacoustics describes the non-linear hearing processes of the human auditorysystem. These processes govern the way we perceive different audio phenomena. Themain impetus to model psychoacoustics stems from the fact that it leads to audiocoding bit rate reduction, or data compression, which is necessary to meet the newlyproposed ISO/MPEG codec standard. Data compression is achieved by spectrally shapingquantization noise, based on psychoacoustic principles, to render the noise inaudible.The aim of Chapter 3 is twofold:1. To present psychoacoustic principles that determine the spectral shape required ofquantization noise so that it is rendered inaudible, or masked.2. To reveal approximations and assumptions made in the modelling of these principleswhich are required to allow real time implementation.In Sections 3.1 to 3.4, three psychoacoustic principles are discussed: critical bands,spectral masking, and the absolute threshold of hearing. These principles are used inthe designs of the two new psychoacoustic models of Chapter 5. Finally, a summary ispresented in Section 3.5.14Chapter 33.1 Critical Bands and the Bark Scale3.1.1 Critical BandsIt has been determined through listening experiments [Greenwood 1961] that thehuman auditory system perceives sounds by grouping the energy content into certainfrequency regions. These regions are defined as critical bands. If the spectrum of anevent is within one critical band, sensations beyond this critical band are perceivedseparately [Kapust 1992]. Fig.3. 1 [Zwicker and Fastl 1990] depicts the results of twoexperiments used to determine critical bands:a. Two tonal maskers of 50 dB intensity and separated by zf, render a narrow bandnoise target centered at 2 kHz inaudible.b. Two narrow band noise maskers of 50 dB intensity and separated by zf, render atonal target located at 2 kHz inaudible.In both experiments, the goal is to render inaudible, or mask, the target by using the twomaskers separated by Lf. For a given zf, there exists a maximum intensity of the targetso that it is just masked. If this maximum target intensity is exceeded, the target becomesaudible. This maximum intensity of the target, so that it is just masked, is plotted asa function of f, the frequency separation of the two maskers. Fig.3. 1(a) reveals thatfor small zf, the maximum intensity of the noise target, so that it is masked, remainsindependent of f. Beyond a certain /f the two tonal maskers become less effectivein masking the noise target. Thus, the maximum intensity of the noise target decreasesso that the noise remains masked by the two tones. The critical band is defined as the15Chapter 3U)E oVIC—C 0)(UCC- 0(a) Two Tones at 50 dBMasking Narrow Band Noise at 2 kHz50 100 200 500 1000f (Hz)2000(b) Two Narrow Bands of Noise at 50 dBMasking Tone at 2 kHz50403020105040 - 1criticalband=300Hzf—*50dB20 fi fi2 kHz10 I If (Hz)Figure 3.1 Experimental Determination of Critical Bandscrossing point of the horizontal and decaying slopes. In this example, the critical bandis approximately 300 Hz. Fig.3.1(b) reveals the same results using noise maskers and atonal target. Beyond the critical band, the masking ability of the two maskers decreases.50 100 200 500 1000 200016Chapter 33.1 .2 Bark ScaleModelling the critical band in real time is accomplished by dividing the auditoryspectrum (20 Hz to 20 kHz) into 25 critical bands. The partitioning is an approximationof the bandwidths of critical bands as a function of frequency. Experiments show thatthe bandwidth of critical bands increases as a function of frequency [Scharf 19701. Table3.1 [Scharf 19701 reveals the 25 critical bands and the approximate lower and upperfrequencies that define their bandwidths. The Bark scale is defined such that each Barkunit equals one critical bandwidth.Table 3.1 Critical Band BandwidthsCritical Band No. Lower Edge Upper Edge Bandwidth(Bark Scale) (Hz) (Hz) (Hz)1 0 100 1002 100 200 1003 200 300 1004 300 400 1005 400 510 1106 510 630 1207 630 770 1408 770 920 1509 920 1080 16010 1080 1270 19011 1270 1480 21012 1480 1720 24013 1720 2000 28014 2000 2320 32015 2320 2700 38016 2700 3150 45017 3150 3700 55018 3700 4400 70019 4400 5300 90020 5300 6400 110021 6400 7700 130022 7700 9500 180023 9500 12000 250024 12000 15500 350025 15500 24000 850017Chapter 33.1.3 Spectral Warping into Bark ScaleIn Chapter 5, the ability of two tones to mask narrow band noise, as observed inFig.3.1(a), is modelled in the critical band domain, shown in Table 3.1 [Scharf 1970].In order to model the masking observed in Fig.3.1(a) in the critical band domain, atransformation is required from the frequency domain, in the Hertz scale, to the criticalband domain, in the Bark scale. The transformation is referred to as spectral warpinginto the Bark scale. The Bark scale divides the frequency spectrum into intervals, suchas those listed in Table 3.1. For each interval, a critical band power density value iscomputed from the samples of the spectral power density in the corresponding intervalon the Hertz scale.The sampled critical band power density, X [1], is computed as follows [Beerendsand Stemerdink 1992]:X[l] = {f(zi + z/2)-f(zi - z/2)] [ Xf[k]] (3.1)f Rwhere index 1 is in the Bark scale domain, index k is in the frequency domain, z is theBark unit, f is the frequency unit, and N1 is the number of frequency lines in region, R1,bounded from f(zi— z/2) to f(z1 + zz/2).The second square bracket term of equation (3.1) represents the average spectralenergy over the frequency range, R1, defined by the Bark scale interval, zz. The Barkscale interval defines the resolution used to warp signal energy from the frequency domaininto the critical band domain. For example, Table 3.1 maps signal energy in the frequencydomain into single critical bands; thus /z = 1 and critical band 1 contains the first 10018Chapter 3Hz of signal energy. Doubling the critical band resolution implies warping signal energyfrom the frequency domain into every half critical band; thus Lz = 0.5, and criticalband 1 is split into two regions, one containing signal energy ranging from 0—50 Hz, theother 50—100 Hz. The first square bracket term of equation (3.1) is the ratio of the widthof the frequency range in Hertz to the critical band interval. It represents a weighting ofthe average energy. A critical band interval that covers a wide frequency range resultsin a high weighting of the average energy. Thus spectral warping from the Hertz scaleinto the Bark scale preserves total signal energy.19Chapter 33.2 Spectral Masking of Single TargetsExamples of spectral masking are common in everyday life. For example, aconversation on a quiet road requires little speech power by the speakers. In contrast,conversation near a busy boulevard requires the speakers to raise their voices. In effect,their speech must mask the noise of passing traffic. This is an example of a noisy targetbeing masked. Similarly, in a musical performance, a faint instrument may be maskedby another that is louder. This is an example of a more tonal target being masked. Whenthe louder one pauses, the faint instrument becomes audible again.The main impetus to model the masking of targets stems from the fact that thismodelling leads to data compression. Data compression is achieved by a reduction inthe required coding bit rate. The result of the bit rate reduction is the introduction ofquantization noise into the signal. Can the introduced quantization noise be masked bythe signal so that the noise is not perceived? This chapter begins to answer this questionby modelling two types of masking of single targets:1. Inter-Band Masking.2. Intra-Band Masking.In Section 3.3, the question is more completely answered by modelling the masking ofmultiple targets. In Sections 3.2.1 and 3.2.2, modelling of inter-band and intra-bandmasking, respectively, is presented.20Chapter 33.2.1 Inter-Band Masking: Spreading FunctionInter-band masking refers to the ability of a masker, located in one critical band, torender inaudible, or mask, a target located in another critical band. Inter-band masking hasbeen analyzed by experiments that determine the ability of a narrow band noise maskerto mask a single pure tone [Zwicker and Fastl 1990]. Fig.3.2 shows results of some ofthese experiments. The left-most narrow band noise masker is centered at 250 Hz and hasbandwidth of 100 Hz. The maximum intensity of a single tone target so that it remainsmasked by the narrow band noise masker is named the masking threshold intensity.Above the masking threshold intensity, the pure tone is no longer masked and becomesaudible. Fig.3.2 plots the masking threshold intensity of a single pure tone masked byNoise Masker: Noise Masker: Noise Masker:BW=lO0Hz BW=l6OHz BW=700Hzf = 250 Hz f = 1 kHz f = 4 kHzC C CI I2 5 9 14Critical Band (Bark Scale)19 23 25Figure 3.2 Masking of Pure Tones by Narrow Band NoiseNoise0-0.02I I I I I0.05 0.1 0.2 0.5 1 2 5 10 20Frequency (kHz)121Chapter 3the 100 Hz wide noise masker as a function of frequency. The plot is the left-most curvelabelled ‘Noise Masker:BW=100 Hz, f=250Hz’. Three observations are made:1. The ability of the narrow band noise to mask a single tone spreads across a certainfrequency region. This frequency region covers several critical bands. In this case,the narrow band of noise located at 250 Hz can mask pure tones from 0.1 to 1 kHz.Fig.3.2 shows that this covers critical bands 1 to 9.2. As the pure tone target becomes more spectrally distant from the noise masker, thedecrease in masking threshold intensity of the pure tone is interpreted as an increaseddifficulty for the noise to mask the pure tone.3. Ascending from lower frequencies, the masking threshold intensity of the pure toneshows a steep increase. After reaching the maximum, near 250 Hz, a less steepdecrease is observed.Fig.3.2 also plots the masking threshold intensity of a single pure tone masked bynoise maskers centered at 1 kHz and 4 kHz as a function of frequency. The same threeobservations listed above are made. In examining all three curves of Fig.3.2, a fourthobservation is made:1. Tonal targets located at higher frequencies are more difficult to mask. The peakthreshold intensities for the three tones fall below the noise level by 2 dB, 3 dBand 5 dB respectively.The smooth slopes of the masking threshold intensities observed in Fig.3.2 areapproximated by coarse curves in the critical band domain. For example, the smooth22Chapter 3C,,-cC,,Figure 3.3 Approximated Spreading Function ofInter-Band Masking — Centered at Critical Band 9curve centered at 1 kHz, or critical band 9, is approximated by the curve shown in Fig.3.3[ISO/MPEG Draft International Standard 1992]. Experiments have determined that thesmooth slopes in Fig.3.2 change as a function of the masker intensity [Zwicker and Fasti1990]. This effect is not accounted for in the coarse curve of Fig.3.3. The advantageof not taking into account this effect is a decrease in the computational complexityrequired to determine the masking threshold intensity of a target. This facilitates realtime implementation.Approximated Spreading Function—4 5 6 7 8 9 10 11 t2 13 14 15 16 17 18Critical Band Number barkj23Chapter 33.2=2 Intra-Band MaskingIntra-band masking, in contrast to inter-band masking, refers to the ability of a maskerin one critical band to mask a target in the same critical band. Intra-band masking ischaracterized by two factors:1. The frequency of the masker.2. The tonal or noise-like nature of the masker.Both factors are observed from experimental results [Zwicker and Fasti 19901. Firstly,Fig.3.2 reveals the dependence of masking ability on the frequency of the masker. Theintensity difference between the noise masker and the peak masking threshold of a puretone increases from 2 dB to 5 dB as the centre frequency of the noise masker increasesfrom 250 Hz to 4 kHz. Secondly, Fig.3. 1 reveals the dependence of masking ability onthe tonality of the masker. Fig.3. 1(a) reveals that an intensity difference of at least 16dB is required between the two tone maskers and the narrow band noise target in orderto mask the noise target. In contrast, Fig.3. 1(b) reveals that an intensity difference ofat least 5 dB is required between the two narrow band noise maskers and the pure tonetarget in order to mask the tone target.Experiments reveal that the two factors affecting masking within a critical band canbe approximated by the following equation [Hellman 1972] [Kapust 1992]:M[zl = c4z] * (14.5 + z) + (1 — c4zj) * 5.5 [dB] (3.2)where M[z] is the masking threshold intensity in critical band z, and the coefficient oftonality, a [z], approximates the noiselike or tonelike nature of the signal in critical band24Chapter 3z The number ‘5.5’ in equation (3.2) should actually be a function of critical band[Kapust 19921. However, in order to facilitate real time implementation, the change inthis number as a function of critical band is not accounted for.25Chapter 33.3 Spectral Masking of Multiple TargetsIn Section 3.2, the following question was posed: Can a signal mask, or hide,introduced quantization noise so that the noise is not perceived? Section 3.2 beganto answer this question by modelling the masking of single targets. Section 3.3 morecompletely answers the question by modelling the masking of multiple targets. Whymust multiple targets be considered? The ISO/MPEG audio encoder introduces differentlevels of quantization noise into 32 subbands. These subband noise energies representmultiple targets that must be masked by the original signal. Two techniques which modelthe masking of multiple targets are discussed in this chapter:1. Spreading function modelling: The individual masking curves obtained by Zwicker(see Fig.3.2) are summed to form a global masking threshold.2. Attenuation modelling: Maintaining a quantization noise level that is an attenuatedversion of the original spectrum will result in the noise being inaudible.These techniques are at the core of the two new psychoacoustic models designed inChapter 5.3.3.1 Spreading Function ModellingSpreading function modelling is so named because it takes into account the maskingcurves which spread across critical bands, as discussed in Section 3.2. Fig.3.4 depictscalculation of the global masking threshold as the sum of individual energies spread acrosscritical bands. The spreading function used to spread the signal energies across critical26Chapter 3CoC______CFigure 3.4 Global Masking Threshold Calculation: Spreading Function Modellingbands is shown in Fig.3.3. The global masking threshold represents a maximum energylevel, below which multiple targets are masked, or rendered inaudible, by multiple targets.Above the global masking threshold, targets become audible. To facilitate calculation ofthe global masking threshold in real time, three assumptions are made [Veldhuis 1992]:1. The individual spreading functions (dotted lines of Fig.3 .4) derived from Zwicker’ sexperiments describe narrow bands of noise masking tones. However, the goal ofaudio compression is that the harmonic energy of the signal should mask narrowbands of subband quantization noise. It is assumed that the masking thresholdsfor tonal targets can be used for the quantization noise targets. This assumption isverified informally [Veldhuis 1992].2. Fig.3.4 reveals that an addition law for individual spreading functions is requiredto calculate the global masking threshold. Experiments have shown [Beerends andTonalMaskerTonalMasker lndMdual Masking ThresholdGlobal Masking ThresholdCritical Band [Bark]27Chapter 3Stemerdink 19921 [Lufti 1983] that a power law is a useful approximation:fri\1/CXMglobal[z1= ( M{z] ) , <2 (3.3)\i=i /where z is the critical band scale, and M? [zj is the individual masking threshold dueto the jth masker. For real time implementation, the global mask is approximated asthe sum of the individual thresholds [Veldhuis 1992].3. The spreading functions obtained from Zwicker’s experiments result from the analysis of stationary targets and maskers. Transient behaviour, temporal masking andbinaural effects are not considered. Not accounting for these principles reduces thecomputational complexity required to calculate the global mask. This facilitates realtime implementation.3.3.2 Attenuation ModellingDetection of quantization noise occurs if, in any critical band, the sum of the originalsignal energy level and coding noise, differs by more than u dB from the signal energylevel alone [Zwicker 1970]. Given this constraint, how should the global maskingthreshold be calculated? It is shown that due to the linearity of the transformation fromfrequency domain to critical band domain, the global masking threshold is calculated asan attenuated version of the original spectrum. Simulations and listening tests show thatthe amount of attenuation from the original spectrum be approximately 13 dB [Paillard1992]. In other words, quantization noise that is 13 dB below the original signal spectrumfor all frequencies is inaudible. This observation is made by other authors [Brandenburgand Sporer 1992]. Fig.3.5 depicts calculation of the global masking threshold as anattenuated version of the original signal spectrum.28-v(1)a)4//Chapter 3—— Original Signal SpectrumGlobal Masking ThresholdFigure 3.5 Global Masking Threshold Calculation: Attenuation Modelling/29Chapter 33.4 Absolute Threshold of HearingThe absolute threshold of hearing, Tabs, defines a curve below which sound eventscannot be perceived. This curve is determined from experimental results and modelledapproximately by equation (3.4) [Terhardt 19791:/ f \0.8= 3.64i—) — 6.5exP(_O.6(_— 3.3)2)+ 1O—3Q_.) [dB](3.4)Fig.3.6 plots the absolute threshold of hearing. Real time incorporation of the absolutethreshold of hearing is accomplished by storing this data in a look-up table.-o>-a)3000 6000 9000 12000 15000Frequency (Hz)Figure 3.6 Absolute Threshold of Hearing, fin::18000 21000 2400C30Chapter 33.5 SummaryThe modelling of three psychoacoustic principles: critical bands, spectral masking,and the absolute threshold of hearing, has been presented. The main impetus to modelthese principles stems from the fact that it leads to audio coding bit rate reduction, ordata compression, which is necessary to meet the new ISO/MPEG codec standard. Datacompression is achieved by spectrally shaping quantization noise to render the noiseinaudible. Incorporating the psychoacoustic principles of this Chapter into the two newpsychoacoustic models developed in Chapter 5 is achievable in a real time implementationby making several assumptions and approximations.31Chapter 4Chapter 4 Multirate Signal DecompositionThroughout this thesis, different multirate signal decomposition techniques are usedto perform spectral analysis. For example, Fig.2. 1 shows a pseudo-QMF filter bank usedin the ISO/MPEG audio encoder software platform. Fig.5.2 shows an MLT (ModulatedLapped Transform) structure in the Paillard audio encoder. Fig.5.4 shows an FFT usedin the attenuation psychoacoustic model. The advantages and disadvantages of thesedecomposition techniques are discussed in Sections 4.1 to 4.3. The techniques discussedin these chapters have four common objectives:1. Data compression: By removing the redundancies of an audio signal, fewer bitsare required to encode the signal.2. Perfect reconstruction: Reconstruction of the original signal from the spectralrepresentation should not introduce distortion in the absence of quantization noise.3. Critical sampling: The audio signal should be critically sampled. Given N inputaudio samples, N spectral coefficients are generated. This minimizes the bit raterequired to encode and transmit the signal.4. Minimum delay: The computational complexity of the analysis technique shouldbe minimized to reduce encoding delays.Fig.4. 1 is an overview of three critically sampled, multirate signal decompositiontechniques:1. Multirate Filter Banks.32Chapter 42. Lapped Transforms.3. Block Transforms.The three techniques vary in computational complexity and ability to achieve the objectives outlined above. (Fig.4. 1 was created by synthesizing information from the followingreferences: [Galand 19841 [Malvar 1988] [Malvar 19901 [Nussbaumer and Vetterli 1984][Preuss 1982] [Rothweiler 1983] [Vaidyanathan 1988] [Vetterli and Le Gall 1989] [Wang1985]).33Chapter 4LaDoed / Direct Implementation:Transforms: ILappedOrthogonal # Computations:I Transform Overhead of100% over DCT(LOT)Audio Samples (Block size = M)M M M MI(00Analysis Filter (Length = L)L = 2MFigure 4.1 Critically Sampled Multirate Signal Decomposition 1’pesDirect Implementation:0()y (n)M1Polv&iase StructureImplementation:y (n)E0(z)VE1(z)M)r0 YEM4z)Polvohase StructureImplementation:Multirate Fi’terDirect Implementation: [ Pseudo # Computations:Banks: uadrature Mirror Filter QMF Bank 35% savings(QMF) Bank overQMFAudio Samples (Block size = M)M M M Mr I I:a____________.=I_____________4 Analysis Filter (Length = L) P-QMF (L=512, M=32)L MC0(a5)0C)II)•0CaC0)(00CC0C.)IDID0.0(0C(a.0Co0.Ca0C’)>xa)0.E8CaC0CaD0.E00Ca)(I)(a5)0CPolvohase Structure4Implementation:ModulatedLapped Computations:Transform 40% saVingsover LOT(MLT))LOT (L = 16, M=8)0 dB-21 dBik Karhunen-Loeve Transform (KLT) Computations:Transforms: Discrete Cosine Transform (DCT) Real DFT is 50%Discrete Fourier Transform (DFT) savings over DCTAudio Samples (Block size = M)M M M Mr I I I I 0dB0C p Ii___-10dBr.C0L = M Analysis Filter (Length = L) DCT (L = 8, M=8)34Chapter 44.1 Multirate Filter BanksThe most general type of signal decomposition technique is that of multirate filterbanks. This technique involves splitting the signal to be analyzed into different frequencybands, or subbands, via a uniform bank of FIR bandpass filters. The filter outputsare downsampled; these samples are named subband coefficients or subband samples.Removal of signal redundancy is achieved by decomposing the signal into a set ofdecorrelated subband coefficients. The signal energy is redistributed into as few spectralcoefficients as possible. This energy compaction results in fewer bits required to encodethe original signal [Akansu and Haddad 1992].Multirate filter banks are characterized by 1) FIR filter lengths that exceed the numberof subbands, 2) high bandstop rejection, and 3) high computational complexity. TheISO/MPEG filter bank implemented in this thesis (see Part 2, Chapter 2.1) decomposesthe signal into 32 subbands, M, using an FIR analysis filter of order 512, L. The resultingbandstop rejection is 90 dB [Stoll 1992].Fig.4. 1 reveals that the responses of the individual bandpass filters overlap, for allthree decomposition techniques. This results in frequency domain aliasing components.Theoretically, for perfect reconstruction, the analysis and synthesis stages of the decomposition technique satisfy aliasing cancellation requirements. In practice, however, thereare two causes for non-cancelled aliasmg components to exist in the reconstructed signal:1. Due to psychoacoustic principles, some subband coefficients are not encoded andthus are not available in the reconstruction stage.35Chapter 42. The encoding system introduces quantization noise into the subband coefficients,impeding perfect alias cancellation.An optimal filter bank structure, therefore, minimizes aliasing energy components. Thisis achieved with large bandstop rejection, such as the 90 dB bandstop rejection of theISO/MPEG filter bank.Two methods of implementing multirate filter banks exist: direct and polyphasestructure. The direct method is based on a QMF (Quadrature Mirror Filter) binary treestructure. A 32 subband filterbank requires a five stage binary tree structure. A morecomputationally efficient method is named pseudo-QMF and uses a polyphase structure.The bandpass filters are derived by frequency shifting a single prototype lowpass filter,resulting in a 35% saving over the direct method [Rothweiler 1983], [Nussbaumer andVetterli 1984]. The saving is realized by the ability to implement polyphase structuresusing generalized transform techniques.An advantage of the filter bank structure over the transform methods, described inSections 4.2 and 4.3, is the ability to optimize the prototype lowpass filter characteristics tomeet desired requirements. The ISO/MPEG filter bank is optimized to have sharp cutoffsso that quantization noise is confined to one subband and its two adjacent neighbors[Dehery 1991]. The goal is to allow independent control of noise levels in each subband.This goal is realizable given the 90 dB bandstop rejection of the ISO/MPEG filter bank.The goal is less realizable given the 21 dB and 10 dB bandstop rejection of the lappedand block transforms, respectively.36Chapter 44.2 Lapped TransformsTransform coding is a special case of multirate filter bank coding. The transformis equivalent to a bank of orthonormal (i.e. energy preserving) filters with subsampledoutputs [Akansu and Haddad 1992]. Subband coding achieves signal decorrelation byfiltering serial data, whereas transform coding uses a transformation.Lapped transforms are characterized by a transform analysis length, L, twice thenumber of subbands, M, (i.e. L = 2M). Both direct and polyphase implementationsexist, respectively named Lapped Orthogonal Transform (LOT) and Modulated LappedTransform (MLT). The MLT is a 40% saving over the LOT [Malvar 19901. Theperformance of block transforms (see Chapter 4.3) is known to degrade significantlyat low bit rates [Akansu and Haddad 1992]. The “blocking” effect manifests itself bythe discontinuities at the block boundaries, and is perceived as periodic clicks. Byoverlapping adjacent analysis windows, the effect is reduced in both lapped transformsand multirate filter banks.The reduced complexity of lapped transforms compared to multirate filter banks isobtained at the cost of a decrease in bandstop rejection, from 90 dB to 21 dB [Malvar1988]. Thus data compression is less efficient since there is greater correlation betweensubband coefficients. However, the decorrelation of lapped transforms is close to thatof block transforms of length 2M [Malvar 1990]. This is achieved with twice thecomputational complexity of a block transform [Malvar 1990].37Chapter 44.3 Block TransformsBlock transforms are a particular case of lapped transforms such that the transformanalysis length, L, and number of subbands, M, are equal. Examples of transform codinginclude the DFT (Discrete Fourier Transform), DCT (Discrete Cosine Transform) and theKLT (Karhunen Loeve Transform). Both DFT and DCT can be implemented via fastalgorithms. For example, the commonly used FF1’ can be used to implement both theDCT and DFT. The decrease in computational complexity is at the expense of poorerspectral resolution, resulting in bandstop rejection of only 10 dB [Malvar 1990].The KLT is optimal in setting the upper bound for signal decorrelation for transformcoding. However, the upper bound for filter banks is set by ideal filter banks withzero aliasing, leading to perfect interband decorrelation. Thus, multirate filter banksthat approach the ideal filter bank exhibit improved signal decorrelation over the KLT[Akansu and Haddad 1992].38Chapter 5Chapter 5 Implementation of TwoPsychoacoustic ModelsThe new ISO/MPEG audio codec standard purposefully does not rigidly define apsychoacoustic model to be used in its software platform (see Fig.2. 1). The followingquote points to the reason for this: “There is an on-going race for the best codingalgorithm regarding audio signal quality, data rate, implementation complexity and othercriteria” [Brandenburg and Seitzer 19891. The success in meeting these criteria is largelydetermined by the psychoacoustic model employed by the ISO/MPEG codec. In Chapter5, two new psychoacoustic models, named spreading function model and attenuationmodel, are designed for real time implementation of the ISO/MPEG codec on the existinghardware platform. Both models represent modifications and simplifications to existingmodels found in the literature. The models found in the literature suffer from either beingcomputationally too complex for real time implementation or being ill-suited for directuse in the ISO/MPEG codec standard. The two new models overcome these problemsby introducing new methods for the calculation of the global masking threshold and thesignal to mask ratios. In this Chapter, Section 5.1 presents the global masking thresholdand signal to mask ratios in context of the ISO/MPEG codec. Section 5.2 presents theimplementation of the spreading function psychoacoustic model. Section 5.3 presents theimplementation of the attenuation psychoacoustic model.39Chapter 55.1 Global Masking Threshold and Signalto Mask Ratios (SMR)The purpose of the global masking threshold and signal to mask ratios (SMR)in the ISO/MPEG software platform, shown in Fig.5. 1, is to determine the spectralshape required of quantization noise to render it inaudible. Quantization noise isintroduced into an audio signal when the signal is encoded with the aim of achieving datacompression. Data compression is accomplished by coarsely quantizing the ISO/MPEGsubband coefficients. In other words, a few bits are used to code the subbands. Thisresults in quantization noise in the subband, or spectral, domain. Both the spreadingfunction model and attenuation model, developed in Sections 5.2 and 5.3, have a commonobjective: to spectrally shape subband quantization noise so that it becomes inaudible.How is this objective achieved? Fig.5.1 shows an example of a psychoacoustic modelwithin the ISO/MPEG audio encoder. The 513 spectral energy coefficients output fromthe 1024 pt FFT are input into the psychoacoustic model. The psychoacoustic model usesthese energy coefficients to calculate the global masking threshold. The global maskingthreshold represents the maximum intensity of quantization noise for which the noiseis inaudible. In other words, when quantizing the subband coefficients, the introducedquantization noise intensity must be kept below the global masking threshold. Thepsychoacoustic model uses the global masking threshold to determine 32 SMR (Signal toMask Ratios), one for each of the 32 subbands. The SMR are used to dynamically allocatebits to encode the analysis ifiter bank coefficients of the 32 subbands of the ISO/MPEGcodec. Dynamic bit allocation to the subbands achieves the objective of spectrally shaping40Chapter 5quantization noise since a small bit allocation results in large quantization noise, and alarge bit allocation results in small quantization noise.Both the spreading function model and attenuation model determine the globalmasking threshold and signal to mask ratios in real time. In Sections 5.2 and 5.3, newmethods for the calculation of the global masking threshold and signal to mask ratios arediscussed for the spreading function and attenuation psychoacoustic models. These newmethods facilitate real time implementation on the given hardware platform.Figure 5.1 ISO/MPEG Audio Encoder Software PlatformLinear32 QuanUzed41Chapter 55.2 Spreading Function Model AlgorithmThis Section develops the algorithm for a new spreading function psychoacousticmodel used for real time implementation in the ISO/MPEG audio codec. The followingtopics are discussed in the next two sections:1. Calculation of global masking threshold.2. Calculation of signal to mask ratios (SMR).The goal is to reveal particular features embodied in the new spreading function modelthat lead to real time implementation on the existing hardware platform.5.2.1 Calculation of Global Masking ThresholdThe function of the spreading function model is to account for the individual maskingabilities of the multiple maskers in an audio signal. As discussed in Sections 3.2.1and 3.3.1, the masking abilities spread across several critical bands and are summed tocreate a global masking threshold. An algorithm is proposed in the ISO/MPEG audiocodec standard [ISO/MPEG Draft International Standard 1992]. However, the proposedmodel suffers from being computationally too complex for real time implementationon the existing hardware platform. Fig.2.3 reveals that the encoder card consists of twoDSP56000s; one DSP is reserved to implement the psychoacoustic model. The DSP56000is a fixed point digital signal processor which cannot efficiently perform trigonometricand exponential arithmetic. However, the algorithm proposed in the literature requirestrigonometric and exponential calculations to determine noise-like and tonal features42Chapter 5of the audio signal. The new spreading function model represents a simplification tothe model proposed in the literature by making an assumption: the audio signal beingencoded consists only of tonal energy components, and thus no noise-like components.At the expense of no longer accounting for tonal and noise-like signal components, theadvantage gained by the assumption is a reduction in the computational complexity ofthe model, which facilitates real time implementation.The steps involved in the new spreading function model algorithm are outlined below.Step 4 represents a simplification to the model proposed in the literature. In each criticalband, the masking ability of the signal energy is a function of the noise-like or tone-likenature of the signal, as discussed in Section 3.2.2. The new spreading function modelassumes, in Step 4, that the signal energy is tonal.Steps to spreading function model calculation of global masking threshold:1. The spectral energy of the audio signal is mapped from the frequency domain intothe critical band domain using equation (3.1).2. For each critical band, the energy in the critical band is spread across critical bands,using the spreading function discussed in Section 3.2.1. This accounts for themasking effect across critical bands.3. The energies, which have been spread across critical bands, are summed togetheras depicted in Fig.3.4.4. In each critical band, the masking ability is a function of the critical band number.The energy in each critical band is attenuated by the intra-band masking functionof Section 3.2.2. The intra-band masking function assumes that the signal energy43Chapter 5is tonal. The attenuated energy forms an initial estimate of the masking thresholdin the critical band domain.5. The estimated masking threshold is mapped from the critical band domain back tothe frequency domain.6. In the frequency domain, the absolute threshold of hearing intensity (Fig.3.6) iscompared to the estimated masking threshold intensity (derived in Step 5). For eachfrequency line, the global masking threshold is determined as the larger of the twointensities.7. The global masking threshold (derived in Step 6) is mapped from the frequencydomain into the subband domain. For each of the 32 subbands, the minimum globalmasking threshold value within a subband is chosen as the mask for that subband.5.2.2 Calculation of SMRThe new spreading function model calculates SMR (Signal energy to mask energyratios) in the same manner proposed in the literature [ISO/MPEG Draft InternationalStandard 1992]. The SMR are calculated in the subband domain. Fig.5. 1 shows that32 SMR, one for each subband, are calculated by the psychoacoustic model. For eachsubband, the signal energy is calculated as the average spectral energy contained in thesubband. The mask energy in each subband is derived in Section 5.2.1, Step 7. The 32SMR are calculated as the ratio of the signal and mask energies in each subband.44Chapter 55.3 Attenuation Model AlgorithmThis Section develops the algorithm for a new attenuation psychoacoustic modelused for real time implementation in the ISO/MPEG audio codec. The following topicsare discussed in the next two sections1. Calculation of global masking threshold.2. Calculation of signal to mask ratios (SMR).The goal is to reveal particular features embodied in the new attenuation model that leadto real time implementation on the existing hardware platform.5.3.1 Calculation of Global Masking ThresholdThe basis of the attenuation model is that the global masking threshold is derivedfrom the original spectrum attenuated by 13 dB, as discussed in Section 3.3.2. In theliterature, an algorithm is proposed [Paillard 1992]. However, the algorithm suffersfrom the problem of being ill-suited for direct use within the ISO/MPEG audio codecstandard. Fig.5.2 [ISO/MPEG Draft International Standard 1992 and Paillard 19921compares and contrasts the ISO/MPEG audio encoder structure with the audio encoderstructure proposed in the literature [Paillard 19921. Table 5.1 reveals the two followingmodifications made by the new attenuation model to the algorithm proposed in theliterature:1. The signal decorrelation method used for encoding is changed from 128 Pt MLT(Modulated Lapped Transform) to 32 subband P-QMF (Pseudo-QMF).45Chapter 5LinearFigure 5.2 Comparison of ISO/MPEG Encoder with Paillard EncoderISO/MPEG AUDIO ENCODER BLOCK DIAGRAMLinear128 Quantized Coded Audio128 Subband Quantizer Subband Signal(variablePAILLARD AUDIO ENCODER BLOCK DIAGRAM46Chapter 5Algorithm NewModification Parameter Proposed in AttenuationLiterature Model_________________[Paillard] AlgorithmSignal Method Modulated Pseudo-QMFDecorrelation Lapped Filter bankMethod Transform#of Channels 128 32CodedSpectral Resolution Method Modulated FF1’Analysis Method LappedTransform# of Frequency 1024 points 513 pointsLinesTable 5.1 Differences Between Proposed Model and New Attenuation Model2. The spectral analysis method used for masking threshold calculation is changedfrom 1024 Pt MLT (Modulated Lapped Transform) to 1024 pt FFT (Fast FourierTransform). The former has a spectral resolution of 1024 points, the latter hasspectral resolution of 513 points.By making the two modifications above, the new attenuation model is suited for direct usein the ISO/MPEG audio codec standard. In the next two sections, the two modificationsare discussed in more detail.5.3.1.1. Signal Decorrelation MethodFig.5.3 [Paillard 1992 and Vaidyanathan 19881 compares the filter responses of thePseudo-Quadrature Mirror Filter (P-QMF) with the Modulated Lapped Transform (MLT).The former is used by the ISO/MPEG audio encoder, the latter by the Paillard encoder(see Fig.5.2). The P-QMF is the preferred method of signal decorrelation, over the MLT,since the P—QMF exhibits a 90 dB reduction in aliasing components, whereas the MLT47Chapter 5shows only a 21 dB reduction. In theory, the aliasing components are perfectly cancelledin the reconstructed signal only if perfectly transmitted. In practice, quantization of thefrequency components results in imperfect aliasing cancellation. This problem is reducedif the aliasing components are minimized. This is better achieved with the high aliasingrejection of the P-QMF compared to the lower aliasing rejection of the MLT.0 dB-90 dBo dB-21 dBPseudo Quadrature Mirror Filter (P-QMF)(L=512, M=32)Modulated Lappted Transform (MLT)(L = 256, M=128)Figure 5,3 Comparison of Filter Responses of P-QMF and MLT48Chapter 55.3.1.2. Spectral Analysis Method for Masking Threshold CalculationFig.5.2 shows that the Paillard encoder uses a 1024 Pt MLT for spectral analysis asinput to the psychoacoustic model. The preferred method of spectral analysis is the 1024pt FFT. There are two reasons for the preference:1. Frequency domain resolution.2. Computational complexity.Firstly, the spectral resolution of 513 coefficients at a sampling rate of 44.1 kHz (CDquality) is approximately 85 Hz. This is less than the width of the most narrow criticalband, 100 Hz (see Table 3.1). There is no need to double the spectral resolution, to 42Hz, by using a 1024 pt MLT. Secondly, the MLT is computationally more complex thanthe FFT. The MLT would approximately double the load on the DSP implementing thepsychoacoustic model, making real time implementation less feasible. Another benefitof using the FFT is that library routines for its efficient, real time implementation exist.This is not true of the MLT.49Chapter 5Steps to attenuation model calculation of global masking threshold:A detailed description of the real time calculation of the global masking thresholdis depicted in Fig.5.4 and described in Table 5.2. The time domain input block consistsof 1152 samples, a parameter defined by the ISO/MPEG standard. The subband domainoutput is a vector of 32 masking threshold values derived from the global maskingthreshold. These are used to calculate the 32 signal to mask ratios (see Section 5.3.2).The seven steps required to obtain the 32 masking threshold values are described below(refer to Fig.5.4 and Table 5.2):1. Two overlapping time domain data blocks are both windowed using 1024 Pt Hanningwindows.2. A 1024 pt FFT is performed on each of the overlapping windowed data.3. Two energy spectra are generated, consisting of 513 coefficients each.4. The minimum of each spectral line of the two spectra is selected.5. The minimum spectrum is attenuated by 13 dB.6. The absolute threshold of hearing is compared to the attenuated minimum spectrum.The global masking threshold is determined as the larger of the absolute thresholdof hearing intensity or the attenuated minimum spectrum intensity for each of the513 spectral lines.7. The global masking threshold (derived in Step 6) is mapped from the frequencydomain into the subband domain. For each of the 32 subbands, the minimum globalmasking threshold value within a subband is chosen as the mask for that subband.50Chapter 51024 pt Hanning 1024 pt HanningWindow (A) Window (B)Window Samples1088 16O 16324O 512I 1956x576 5768O0ldSampIes— 1 Frame=ll52NewSampleswindowed\Jj,’ samples (A) windowed 4J’ samples (B)1024 pt FFT L 1024 pt FFT FFT bothwindowed samples(B) Calculate 2energy spectra513 1 513/0Choose minimumMinimum ot the 2 spectraSpectrum513Minimum Spectrum2Au:edMin±Attenuate minimumspectrum513Absolute Threshold of Hearing 0Attenuated Minimum SpectrumIncludeabsolutethreshold of hearing11 21 3 - - - 31 32 subband#1163248... 480 513I 496Map maskingrithreshold into______________________________subband domainSubband #Figure 5.4 Attenuation Model — Determination of Global Masking Threshold51Chapter 5N INPUT PROCESS DESCRIPTION OUTPUT1152 new samples 2 x 1024 pt 1024 windowed samples480 old samples Hanning Windows: (A). 1024 windowed samplesWindow samples 1 to 1024 (A) (B)Window samples 576 to 1600 (B)2 1024 windowed samples (A) 2 x 1024 pt 513 pt complex1024 windowed samples (B) Real FFT’s: spectrum (A)513 pt complexReal FF1’ on windowed samples (A) spectrum (B)Real FF1’ on windowed samples (B)3 513 Pt complex spectrum (A) Calculate 2 Energy Spectra: 513 pt energy spectrum513 Pt complex spectrum (B) (A)Generate energy spectra (A) 513 pt energy spectrumGenerate energy spectra (B) (B)4 513 Pt energy spectrum (A) Determine Minimum Spectrum: 513 pt Minimum513 Pt energy spectrum (B) SpectrumMinimum Spectrum[i] =min{ Spectrum(A[i]),Spectrum(B[i]) }1 <= i <= 5135 513 pt Minimum Spectrum Attenuate Minimum Spectrum: 513 pt AttenuatedMinimum SpectrumAttenuated Minimum Spectrum =Minimum Spectrum * -13dB6 513 pt Attenuated Include Absolute Threshold of 513 Pt MaskingMinimum Spectrum Heanng: ThresholdMasking Threshold[i] =max{Absolute Threshold Hearing[iJ,Minimum Attenuated Spectrum[iJ7 513 Pt Masking Map 513 pt Masking Threshold to 32 32 pt MaskingThresholdThreshold Subbands: Threshold, 1) Divide 513 points into 32subbands (17 masking thresholdenergies per subbaand,1 sample overlap)2) Subband masking threshold =min{ 17_masking_threshold_energies}Table 5.2 Attenuation Model Algorithm Description: Determination of Global Masking Threshold5.3.2 Calculation of SMRCalculation of SMR (Signal energy to mask energy ratios) are performed in thesubband domain. Fig.5. 1 shows that 32 SMR, one for each subband, are calculated by52Chapter 5the psychoacoustic model.Calculation of the SMR requires determination of the signal energy in each of the32 subbands. The algorithm proposed in the literature [Paillard 1992] calculates averagesignal energy based on the coefficients of the 128 subbands. If this method were used ina stereo ISO/MPEG codec application, two blocks of 1152 subband coefficients would berequired for the signal energy calculation. (The ISO/MPEG standard specifies that audiodata be divided into blocks of 1152 by grouping the coeffcients of the 32 subbands intoblocks of 36). Given the existing hardware platform, as shown in Fig.2.3, the common1024 bytes of RAM makes passing the two blocks of 1152 subband samples between thetwo DSPs infeasible. Passing of this information must occur since the calculation of theSMR resides on DSP1 and the 1152 subband coefficients reside on DSPO. Furthermore,passing only the 32 average subband signal energies implies that the average signalenergy calculation is done on DSPO. It is later shown in Fig.6.2, that there is no CPUtime available on DSPO for this calculation. In order to overcome the problem of thecalculation of average signal energy per subband, a new method is used by the newattenuation model. The method consists of simply using the spectral energy derived fromthe 1024 pt FFT. This spectral energy information is available from DSP 1, the sameDSP used for the SIVIR calculation. Thus, the problem of passing 1152 subband samplesbetween DSPO and DSP1 is overcome by the new attenuation model.Calculation of the SMR also requires determination of the mask energy in eachsubband. These values are determined in Section 5.3.1, Step 7. The 32 SMR arecalculated as the ratio of the signal and mask energies in each subband.53Chapter 55.4 SummaryTwo new psychoacoustic models are designed: the spreading function model andthe attenuation model. Both models employ new methods in the calculation of: a globalmasking threshold and signal to mask ratios. The new models, therefore, representmodifications to models found in the literature. The modifications successfully overcomeproblems with the models found in the literature. These problems were either that themodels were too computationally complex for feasible real time implementation, or wereill-suited for the ISO/MPEG audio codec. By overcoming these problems, the two newpsychoacoustic models are feasible for real time implementation of the new ISO/MPEGaudio codec on the existing hardware platform.54Chapter 6Chapter 6 Experiments and ResultsIn order to evaluate the two psychoacoustic models developed in Chapter 5, it isnecessary to determine the answers to the following two questions:1. Do the two psychoacoustic models meet real time constraints when implemented inthe ISO/MPEG audio codec using the existing hardware platform?2. Is the audio coding of high quality?Measurements and listening tests are performed to answer these two questions. Inthe following Sections, the measurements and test results are presented and discussed.55Chapter 66.1 Meeting Real Time ConstraintsThe goal of this Chapter is to prove that the two new psychoacoustic modelsdeveloped in Chapter 5 meet real time constraints. To review, implementation of theISO/MPEG audio codec is achieved using the hardware platform shown in Fig.2.2 andFig.2.3. Fig.6. 1 shows the partitioning of audio encoder tasks between the two DSPs ofthe encoder. DSPO is responsible for the following tasks:1. Generation of the subband coefficients via a P-QMF analysis filter bank.2. Scaling the subband coefficients by appropriate scale factors.DSP 0Digital Audio1 Linear 32 Quantized ISignal•32 Subband Quantizer Subband Coded Audio(705 kbps • CoeFficients : Coefficients I • SignalPCM):. : 192Filter BankI . Cl)WLI_I(32 Subbands)32 L ‘[1 - 32 — J • BitstreamI 4 II-I32 Bit AllocationValuesCommon RAM (1024 bytes)32 Bit AllocationDSP 1 ValuesII 513 SpecIal Energy 32 Signal to Mask I• Coefficients Ratios IFFT 1 Psycho- 1 Dymanuc I1024 Points • acoustic • Bit II• Model J Allocation II_____513_____32 fl_____I IhFigure 6.1 DSP Partitioning of Tasks for Audio Encoder Implementation56Chapter 6DSPO70- 63%60-50 -0/° 40CPU30 18°!20 7/ 12%10 ——_________________________________________________________________________________________LL 0 C. 0 ——o.— 0.0o0 o&Um 0 N -C0Figure 6.2 Real Time Encoder DSPO CPU Usage3. Quantizing the subband coefficients based on the bit allocation determined by thepsychoacoustic model.4. Assembling the scalefactors and quantized subband coefficients into the ISO/MPEGbitstream.Fig.6.2 shows the CPU usage for a stereo implementation. The filter bank analysisrequires 63% of available real time. Obtaining the scalefactors requires 7%. Quantizationrequires 18%. And, finally, the bitstream assembly requires 12%. The total is 100%.Thus there is no remaining real time for other tasks to be performed by DSPO. For thisreason, the psychoacoustic model is implemented on DSP1.DSP1 is responsible for the following tasks:57Chapter 61. Implementation of the psychoacoustic model which generates the global maskingthreshold and signal to mask ratios.2. Determination of the dynamic bit allocation based on the signal to mask ratios.Fig.6.3 compares and contrasts the CPU usage of the spreading function model with theattenuation model. In both cases, the dynamic bit allocation requires 15% of real time.The spreading function model requires 80% of available real time. This leaves 5% of realtime unused. The less computationally complex attenuation model requires only 65%,leaving 20% of real time unused. These figures prove that both psychoacoustic modelsmeet real time constraints.58Chapter 6Attenuation ModelCpu Usage(DSP 1)70 65%60 ::::::::50%cpu 4030 20%20 — 15%______10 II: :::::10 C DaEw 0a,caga,> >Spreading Function ModelCPU Usage80% (DSP1)80 - . -.-.-.-.-.-.-70 -60 -50 -40CPU3020 ::::::::: 15%10 !!0.2 C D.2.2 a()8° .2 a,><C) a,>Figure 6.3 Comparison of DSP1 CPU Usage: Attenuation Model and Spreading Function Model59Chapter 66.2 Listening TestsThe two psychoacoustic models developed in Chapter 5 are evaluated using listeningtests. The goal is to assess the coding quality of the models. The models are implementedin a real time ISO/MPEG audio codec operating at 128 kbps per mono channel. Theinput audio is a CD quality stereo recording at a bit rate of 705 kbps per channel. Thesignal is thus undergoing compression of nearly 6:1. The following sections detail theevaluation procedure:1. Audio Test Material.2. Apparatus.3. Methodology.4. Results.6.2.1 Audio Test MaterialThe European Broadcasting Union (EBU) recommends certain audio programmesignals that are appropriate test materials for the sound quality assessment of a codecsystem. Seven of these signals, representing critical audio signals which are difficult tocode, are described in Table 6.1.These test signals are used to evaluate the two psychoacoustic models developedin Chapter 5. The signals are digital recordings of CD quality, and have durations ofapproximately 10 to 20 seconds.60Chapter 6Instrument “Song Title” or Description AbbreviationViolins “Asa Jinder” ASAHarpsichord Solo Arpeggio HPSCastanets Solo CASMale German Speech GSPBass Clarinet Solo Arpeggio BCLBagpipes “Amazing Grace” BAGViolins “Vi Salde Vara Hemman” VISTable 6.1 EBU Recommended Audio Programme Test Material6.2.2 ApparatusListening tests are performed in an anechoic chamber. By removing backgroundnoise, detection of small differences between the original and encoded test signals isameliorated. Fig.6.4 depicts the setup and environment used to conduct the listeningtests. The computer containing the sound files is located outside the anechoic chamberso that its cooling fan cannot be heard. The sound files are input to a DIA converterand a pair of high-quality headphones is used for listening. A monitor displays the testmethodology discussed in the next section.6.2.3 MethodologyThe test methodology involves two sessions:1. Training Session.2. Test Session.The listening test environment of Fig.6.4 shows a psychoacoustic model evaluation systemdisplayed on a monitor. The actual evaluation system displayed to the listener is shown61Chapter 6ANECHOIC CHAMBERMonitorHeadphonesPsychoacoustic ModelEvaluation System =Converter____________ComputerU U U UFigure 6.4 Listening Test Setup and Environmentin more detail in Fig.6.5. During the training session, the listener is infonned thatclicking button A, B, and C will result in hearing the following signals respectively:the original signal, the signal coded using the attenuation model, and the signal codedusing the spreading function model. The listener is asked to write down the time atwhich differences can be heard between the original and the coded signals. Picking outdifferences between A, B and C is facilitated by being able to arbitrarily click betweenA, B and C using the mouse pointer. Typical training sessions lasted approximately twohours for the seven test signals.The test session involved scoring the two psychoacoustic models using the MOS(Mean Opinion Score) method. Fig.6.5 shows the MOS scale. The listener can judgethe quality of a signal as:1. Having perceptible noise that is very annoying.2. Having perceptible noise that is annoying.3. Having perceptible noise that is slightly annoying.62Chapter 6Figure 6.5 Listening Test Methodology4. Having perceptible noise, but not annoying.5. Excellent.In contrast to the training session, the listener is informed that clicking button A resultsin hearing the original signal and that buttons B and C contain either the original signalor the signal encoded with a psychoacoustic model. The listener is asked to score audiosignals B and C. Both the attenuation model and the spreading function model wereevaluated during the same session. This allows for direct comparison between the twomodels. Five people volunteered their time to evaluate the two models. Two more alsoTIME:00:00PSYCHOACOUSTIC MODEL EVALUATION SYSTEMAUDIO FILE NAMEMOS SCORE: B__5.04.03.0MOS SCORE: C5.0 --4.0 -3.0 --2.0 --1.0 --ExcellentPerceptible,But not AnnoyingSlightly AnnoyingAnnoyingVery AnnoyingOriginal EncodedSignal or Signal(706 kbps) (128 kbps)Signal OiiginalHeard by SignalUstener (706 kbps)CEncoded OriginalSignal or Signal(128 kbps) (706 kbps)63Chapter 6underwent the training session but, on account of the very high coding quality, could notdifferentiate between the original and coded audio signals.6.2.4 Results6.2.4.1. Results of Attenuation ModelThe MOS scores resulting from the attenuation model are presented in Fig.6.6. Theabbreviations used for the test signals are listed in Table 6.1. In presenting these results,the goal is to reveal the coding quality of the attenuation model.Differentiating Between Original and Encoded Signals For each of the sevensignals, the left bar represents the mean MOS score of the original signal; the right barrepresents the mean score of the encoded signal. The standard deviation is shown by theerror bar. The mean MOS scores of the original signal approach 5.0 and exhibit a smallstandard deviation. These results have two implications:1. Some listeners could not differentiate between the original signal and the encodedsignal. In this case, both signals (buttons B and C of Fig.6.5) were given scores of5.0, or approaching 5.0. The small standard deviations reflect the uncertainty as towhich is the original signal and which is the coded signal.2. Some listeners could differentiate between the original and the encoded signal. Theoriginal signal was scored as 5.0; and the coded signal was scored appropriately.64Chapter 6C0Ca>a)(I)C3.0)Uicc02.U)U)0Subjective Evaluation of Audio Codec: Attenuation ModelOriginal Signal ?Encoded Signal IFigure 6.6 MOS Scores: Attenuation Model0 1 2 3 4 5 6 7GSP CAS ASA BCL HPS VIS BAG65Chapter 6Recognizing the original signal was facilitated during the test session since the listenerwas able to compare buttons B and C to reference button A, as shown in Fig.6.5; andthe listener was informed that button A was the original. It is concluded from the twoimplications above that one should expect the original signals to show MOS scores near5.0 and have a small standard deviation. This is true since if a listener could differentiatebetween the coded signal and the original signal, a score of 5.0 was entered for theoriginal signal; and if the listener could not differentiate between the coded signal andthe original signal, a score of 5.0 was entered for both signals.Assessment of Audio QualityFig.6.6 shows that the standard deviations of the encoded signals are larger thanthose of the original signals. This reflects the fact that impairments were perceived in theencoded signals by some listeners; whereas for the original signals, scores approaching5.0 were entered. It is noted that scores above 5.0 and below 1.0 could not be enteredby the listener. The fact that some standard deviations exceed 5.0 reflects that a standarddeviation assumes a symmetric distribution about the mean. In actuality, the probabilityAudio Test Signal Impairment DescriptionGerman Speech No signal degradation.Castanets Faint “chirp” at attack of some notes.“Asa Jinder” Some faint “jingles”.Bass Clarinet “Breathy” modulation of last note.Harpsichord Loss of crispness at the attack of some notes.“Vi Salde Vara Hemman” Certain sections exhibit slight “chirping”.Bagpipes Certain sections exhibit soft “scratches”.Table 6.2 Descriptions of Encoded Signal Impairments Observed by Some Listeners66Chapter 6distributions of the MOS scores should be truncated below 1.0 and above 5.0. Table 6.2characterizes perceptible impairments observed by some of the listeners. It is noted thatother listeners heard no impairments.On average, Fig.6.6 shows that all seven coded signals show mean MOS scoresabove 4.0. The implication is that on average, the coded signals contained no audibleimpairment, or contained such slightly perceptible impairment that it was judged as notannoying. It is concluded that the attenuation model results in a very high coding qualityfor all seven test signals.67Chapter 66.2.4.2. Results of Spreading Function ModelThe MOS scores resulting from the spreading function model are presented in Fig.6.7.The abbreviations used for the test signals are listed in Table 6.1. In presenting theseresults, the goal is to reveal the coding quality of the spreading function model.Differentiating Between Original and Encoded Signals Examining the MOSscores of the original signals in Fig.6.7, the results of the spreading function modelare similar to those of the attenuation model. The same two implications are made:1. Some listeners could not differentiate between the original signal and the codedsignal. In this case, both signals (buttons B and C of Fig.6.5) were given scores of5.0, or approaching 5.0. The small standard deviations reflect the uncertainty as towhich is the original signal and which is the coded signal.2. Some listeners could differentiate between the original and the coded signal. Theoriginal signal was scored as 5.0; and the coded signal was scored appropriately.Assessment of Audio Quality As with the attenuation model, some listeners couldhear impairments in the encoded signals, as listed in Table 6.2. Others heard noimpairments. On average, Fig.6.7 shows that six of seven encoded signals resultedin mean MOS scores above 4.0. For these six signals, the implication is that on average,the coded signal contained no audible impairment, or contained such slightly perceptibleimpairment that it was judged as not annoying. For the castanet signal, labelled as ‘CAS’,the impairment, on average, was judged as slightly annoying. It is concluded that for sixof the seven test signals, the coding quality of the spreading function model is very high.68Chapter 6Subjective Evaluation of Audio Codec: Spreading Function Model6 Ij IC I I5 -- -____L - -xxi I> 10(11 I II L/%e9I I I- RStWfl”r -I II I I In -Cu— I I3•5- H-:--3 s 11I I825 - ‘- -- - — -(1)N0 I I0 2 —, r — T 1 —rI I15± Ft- - .. 11________1______0 1 2 3 4 5 6 7GSP GAS ASA BCL HPS VIS BAGOriginal Signal QQEncoded SignalFigure 6.7 MOS Scores: Spreading Function Model69Chapter 66.2.4.3. Comparison of Attenuation Model with Spreading Function ModelBoth the attenuation model and the spreading function model were assessed duringthe same listening test session, as described in Section 6.2.3. This allows for a directcomparison of the two models. The mean MOS scores of the encoded signals, using thetwo models, are shown in Fig.6.8. For 5 of 7 encoded signals, the mean score of theattenuation model exceeds that of the spreading function model. Also, none of the meanscores for the attenuation model fall below 4.0. These results have two implications:1. For the majority of the test signals, the coding quality of the attenuation modelsurpasses that of the spreading function model.2. The coding quality of the attenuation model is very high for all seven test signals.From these implications, it is concluded that the attenuation model is superior to thespreading function in that it is capable of coding, with high quality, more test signalsthan the spreading function model. For this reason, it is concluded that the attenuationmodel is considered as the prime candidate for implementation in the ISO/MPEG codec.70Chapter 6Comparison of Spreading Function and Attenuation ModelsU I I I I I I I— — —- Spreading Function2.5 ModelAttenuation Model2 :1.51 0 1 2 3 4 5 6 7 8GSP CAS ASA BCL HPS VIS BAGFigure 6.8 Coding Quality Comparison of the Attenuation Model with the Spreading Function Model71Chapter 66.3 AnalysesThe conclusion reached in Section 6.2.4 is that the attenuation model is superior tothe spreading function model. Why is this so? This question is answered by the analysesof quantization noise levels introduced by both models.Both the attenuation model and spreading function model introduce quantizationnoise into the audio signal being coded. It has been determined through listeningexperiments that the perceived loudness of this quantization noise is proportional tothe number of critical bands in which the noise exceeds the global masking threshold[Schroeder, Atal and Hall 1979]. In other words, the greater the number of critical bandsin which noise exceeds the masking threshold, the more audible is the noise.Fig.6.9 compares the quantization noise level of the attenuation model with thequantization noise level of the spreading function model for the harpsichord signal (see‘HPS’ in Table 6.1). The quantization noise of the spreading function model for subbands1 to 6 exceeds both the global masking threshold and the noise of the attenuation model.Subbands 1 to 6 include the first 4 kllz of the audio spectrum and therefore contain18 critical bands (see Table 3.1, critical bands 1 to 18). The quantization noise of theattenuation model for subbands 7 to 17 exceeds both the global masking threshold and thenoise of the spreading function model. Subbands 7 to 17 cover the frequency range from4 kHz to 12 kHz and therefore contain only 5 critical bands (see Table 3.1, critical bands19 to 23). Thus, in the first 18 critical bands, the noise level of the spreading functionmodel exceeds that of the attenuation model. In the next 5 critical bands, the noise of72Chapter 6the attenuation model exceeds that of the spreading function model. It is concluded thatfor the harpsichord signal, quantization noise is more audible for the spreading function-U)a)czAverage Subband Noise_Energy2 4 6 8 10 12 14 16 18 20 22 24Subband Numberra—1 0— 20—3 0—40—50— 60—70—8 00— Global Masking ThresholdAttenuationModel Noise:°:;Figure 6.9 Comparison of Attenuation Model with SpreadingFunction Model Quantization Noise Levels for Harpsichord73Chapter 6model since there is a greater number of critical bands (18 vs 5) in which the noiseexceeds both the noise of the attenuation model and the global masking threshold.The observations made in Fig.6.9 for the HPS signal are also made for the GSP, ASA,BCL and VIS signals. Fig.6.8 confirms that for the GSP, ASA, HPS and VIS signals, theattenuation model MOS scores exceed those of the spreading function model. Why arethe attenuation MOS scores lower than the spreading function MOS scores for the BAGand BCL signals? The BAG signal shows a noise level trend opposite to Fig.6.9: thenoise level of the attenuation model shows a greater number of critical bands in whichthe noise exceeds both the noise of the spreading function model and the global maskingthreshold. Thus the noise of the attenuation model is more audible than the spreadingfunction model for the BAG signal. The BCL signal exhibits the same noise level trendas the HPS signal shown in Fig.6.9. However, the lower score of the attenuation modelfor the BCL signal is attributed to a software bug in the attenuation model algorithm,causing the coding quality degradation. Informal listening tests reveal that in fixing thebug, the audio degradation is removed.It is concluded that the better performance of the attenuation model compared to thespreading function model is due to the quantization noise introduced by the spreadingfunction model being spectrally located such that a greater number of critical bands havenoise in excess of both the attenuation model noise and the global masking threshold.74Chapter 7Chapter 7 Summary and Suggestions7.1 SummaryThis thesis provides the design of two new psychoacoustic models used to effect thereal time implementation of the new ISO/MPEG audio codec on an existing hardwareplatform at MPR Teltech, a subsidiary of the British Columbia Telephone Co.The two new models are named: spreading function model and attenuation model.The spreading function model is so named because calculation of the global maskingthreshold takes into account the masking curves which spread signal energies acrosscritical bands. The attenuation model derives the global masking threshold from theoriginal spectrum attenuated by 13 dB. The new models represent modifications to modelsfound in the literature. Models found in the literature suffer from either being toocomputationally complex for feasible real time implementation, or being ill-suited fordirect use in the ISO/MPEG audio codec standard. The new models overcome theseproblems by introducing new methods in the calculation of a global masking thresholdand signal to mask ratios.Measurements and listening tests were carried out with both psychoacoustic models.The test results point to the following observations:1. The problem of achieving real time implementation on the existing hardware is successfully overcome. Measurements of CPU usage show that both models successfullymeet real time constraints. Fig.6.3 reveals that CPU headroom of 5% remains in the75Chapter 7implementation of the spreading function model. The less computationally complexattenuation model shows 20% CPU headroom remaining.2. In assessing the coding quality of the attenuation model, Fig.6.6 shows that all sevenencoded signals show mean MOS scores above 4.0. The implication is that onaverage, the encoded signal contained no audible impairment, or contained suchslightly perceptible impairment that it was judged as not annoying. It is concludedthat the attenuation model results in a very high coding quality for all seven testsignals.3. In assessing the coding quality of the spreading function model, Fig.6.7 shows thatsix of the seven encoded signals resulted in mean MOS scores above 4.0. For thecastanet signal, the impairment, on average, was judged as slightly annoying. It isconcluded that for six of the seven test signals, the coding quality of the spreadingfunction model is very high.4. In comparing the attenuation model with the spreading function model, Fig.6.8 showsthat for 5 of 7 test signals, the mean scores of the attenuation model exceed thoseof the spreading function model. Also, none of the mean scores for the attenuationmodel falls below 4.0. It is concluded that the attenuation model is superior to thespreading function model in that it is capable of coding, with high quality, more testsignals than the spreading function model. For this reason, the attenuation model isconsidered as the prime candidate for implementation in the ISO/MPEG codec.The two psychoacoustic models developed in this thesis have been designed for realtime implementation of the ISO/MPEG audio codec on the existing hardware platform.76Chapter 7Several potential applications exist for this audio codec. The new models initiated inthis thesis show promising results in furthering the commercial development of the newISO/MPEG audio codec.77Chapter 77.2 SuggestionsThe two models initiated in this thesis do not use up all available CPU time. Fig.6.3reveals that there is some real time remaining. Therefore, there is room for furtherrefinements of the two models. For example, the next step in improving the spreadingfunction model is to develop a means of accounting for noise-like and tone-likefeatures of a signal, in real time. Modelling to account for these features is certainlyexpected to affect the coding quality of the ISO/MPEG audio codec. One possiblemethod is to calculate the predictability of the subband coefficients. If the coefficientsof a subband are highly predictable, the signal energy of that subband exhibits amore tonal nature and requires large bit allocation. In contrast, if the coefficients ofa subband show low predictability, the signal energy of that subband is noise-likeand thus requires a lower bit allocation.2. Certain audio applications, such as studio post-processing, editing and mixing, require that an audio codec be tandemmed. Tandemming refers to the process ofrepeatedly coding and decoding an audio signal. As an audio signal is tandemmed,the introduced quantization noise level increases. The behaviour of the two psychoacoustic models, under tandem conditions, should be investigated. It is desired thatthe introduced quantization noise be shaped so that its increase due to tandemmingis minimized.78BIBLIOGRAPHY[1] Akansu, A.N. and Haddad, R.A. Multiresolution Signal Decomposition— Transforms, Subbands, and Wavelets. Academic Press, Inc., 1992.[2] Beerends, J.G. and Stemerdink, J.A. “A Perceptual Audio Quality Measure Basedon a Psychoacoustic Sound Representation”. Journal of Audio Engineering Society,vol. 40, no. 7, pp. 963—978, December 1992.[3] Brandenburg, K. and Seitzer, D. “Low Bit Rate Coding of High Quality DigitalAudio: Algorithms and Evaluation of Quality”. Proceedings of the AES 7thInternational Conference, pp. 201—209, 1988.[4] Brandenburg, K. and Sporer, T. “NMR and Masking Flag: Evaluation of QualityUsing Perceptual Criteria”. Proceedings of the AES 11th International Conference,pp. 169—179, 1992.[5] Chau, A.K. “Test Procedures For ISO Layer II Audio Codecs”. Master’s thesis,Simon Fraser University, 1993.[6] Crochiere, R.E. and Rabiner, L.R. Multirate Digital Signal Processing. Prentice-Hall, Inc., 1983.[7] Davidson, G., Fielder, L. and Antil, M. “High Quality Audio Transform Coding at128 kbps”. IEEE Proceedings of ICASSP, pp. 1117—1120, 1990.[8] Dehery, Y.F. “MUSICAM Source Coding”. Proceedings of the AES 10th International Conference, pp. 71—79, 1991.[9] Dehery, Y.F., Lever, M. and Urcun, P. “A MUSICAM Source Coding Codec forDigital Audio Broadcasting and Storage”. IEEE Proceedings of ICASSP, pp. 3605—3608, 1991.79[10] Fielder, L.D. “Evaluation of the Audible Distortion and Noise Produced by DigitalAudio Converters”. Journal ofAudio Engineering Society, vol. 35, no. 7/8, pp. 517—535, July/August 1987.[11] Furui, S. and Sonhi, M. Advances in Speech Signal Processing. Marcel, Dekker, 1991.[12] Galand, C.R. “New Quadrature Mirror Filter Structures”. IEEE Transacations onASSP, vol. ASSP-32, no. 3, pp. 522—530, June 1984.[13] Gersho, A. and Gray, R.M. Vector Quantization and Signal Compression. KiuwerAcademic Publishers, 1992.[14] Greenwood, D.D. “Auditory Masking and the Critical Band”. Journal of theAcoustical Society of America, vol. 33, no. 4, pp. 484—502, 1961.[15] Hellman, R.P. “Asymmetry of Masking Between Noise and Tone”. Perception andPsychophysics, pp. 241—246, 1972.[16] Herre, J., Eberlein, E., Schott, H. and Schmidmer, C. “Analysis Tool for RealtimeMeasurements Using Perceptual Criteria”. Proceedings of the AES 11th InternationalConference, pp. 180—190, 1992.[17] ISO/IEC JTC1/SC29IWG1 1 MPEG 92/0. Draft International Standard. “Coding ofMoving Pictures and Associated Audio for Digital Storage Media at up to 1.5 Mbitls”.May 1992.[18] Jayant, N.S. and Noll, P. Digital Coding of Wavefonns— Principles andApplicationsto Speech and Video. Prentice-Hall, Inc., 1984.[19] Johnston, J.D. “Estimation of Perceptual Entropy Using Noise Masking Criteria”.IEEE Proceedings of ICASSP, pp. 2524—2527, April 1988.[20] Kapust, R. “A Human Ear Related Objective Measurement Technique Yields AudibleError and Error Margin”. Proceedings of the AES 11th International Conference, pp.80191—202, 1992.[21] Lee, H.S. “On the Performance of Speech Waveform Coders with Noise SpectralShaping”. IEEE Transactions on Communications, vol. 33, no. 7, pp. 742—746, July1985.[221 Lufti, R.A. “Additivity of Simultaneous Masking”. Journal of the Acoustical Societyof America, vo. 73, pp. 262—267, 1983.[23] Malvar, H.S. “Lapped Transforms for Efficient TransformlSubband Coding”. IEEETransactions on ASSP, vol. 38, no. 6, pp. 969—978, 1990.[24] Masson, J. and Picel, Z. “Flexible Design of Computationaly Efficient Nearly PerfectQMF Filter Banks”. IEEE Proceedings of ICASSP, pp. 541—544, March 1985.[25] Nussbaumer, H.J. and Vetterli, M. “Computationally Efficient QMF Filter Banks”.IEEE Proceedings of ICASSP, pp. 11.3.1—11.3.4, 1984.[26] Paillard, B. Codage Perceptuel des Signaux Audio de Haute Qualite. PhD thesis,Universite de Sherbrooke, Fevrier 1992.[27] Paillard, B., Mabilleau P. and Morissette, S. “Transparent Coding of a MonophonicAudio Signal at 100 Kb/s”. 92nd AES Convention, March 1992, Preprint # 3224,pp. 1—28.[28] Paillard, B., Mabilleau, P., Morisette, S. and Soumagne, J. “PERCEVAL: PerceptualEvaluation of the Quality of Audio Signals”. Journal ofAudio Engineering Society,vol. 40, pp. 21—31, 1992.[29] Preuss, R.D. “Very Fast Computation of the Radix-2 Discrete Fourier Transform”.IEEE Transactions on ASSP, vol. ASSP-30, no. 4, pp. 595—607, August 1982.[30] Princen, J.P. and Bradley, A.B. “Analysis/Synthesis Filter Bank Design Based onTime Domain Aliasing Cancellation. IEEE Transactions on ASSP, vol. 34, no. 5, pp.811153—1161, October 1986.[31] Rabiner, L.R. “On the Use of Symmetry in FFT Computation”. IEEE Transactionson ASSP, vol. ASSP-27, no. 3, pp. 233—239, June 1979.[32] Rabiner, L.W. and Schafer, R.W. Digital Processing of Speech Signals. Prentice-Hall, Inc., 1978.[33] Ramstad, T.A. “Considerations on Quantization and Dynamic Bit-Allocation inSubband Coders”. IEEE Proceedings of ICASSP, pp. 841—844, 1986.[34] Rothweiler, J.H. “Polyphase Quadrature Filters— A New Subband Coding Technique”. IEEE Proceedings of ICASSP, pp. 1280—1283, 1983.[35] Scharf, B. Foundations of Modern Auditory Theory. Academic, pp. 159—202, 1970.Critical Bands.[36] Schroeder, M.R., Atal, B.S. and Hall, J.L. “Optimizing Digital Speech Coders byExploiting Masking Properties of the Human Ear”. Journal of the Acoustic Societyof America, vol. 66, pp. 1647—1652, December 1979.[37] Smyth, S.M.F. and Challenger, P. “An Efficient Coding Scheme for the Transmissionof High Quality Music Signals”. British Telecom Technological Journal, vol. 6, no.2, pp. 60—70, April 1988.[38] Smyth, S.M.F., McCanny, J.V. and Challenger, P. “An Independent Evaluation of thePerformance of the CCITT G.722 Wideband Coding Recommendation Using MusicSignals”. IEEE Proceedings of ICASSP, pp. 2532—2535, April 1988.[39] Stoll, G. “Source Coding for DAB and the Evaluation of its Performance: A MajorApplication of the new ISO Audio Coding Standard”. 1st International Symposiumon DAB, pp. 83—97, 1992.82[40] Teh, D., Tan, A. and Koh, S. “Subband Coding of High-Fidelity Quality AudioSignals at 128 kbps”. IEEE Proceedings ICASSP, vol. 2, pp. 197—200, 1992.[41] Terhardt, E. “Calculating Virtual Pitch”. Journal of the Acoustical Society ofAmerica,vol. 1, pp. 155—182, 1979.[42] Terhardt, E., Stoll, G. and Sweeman, M. “Algorithm for Extraction of Pitch and PitchSalience from Complex Tonal Signals”. Journal of the Acoustical Society ofAmerica,vol. 71, no. 3, pp. 678—688, March 1982.[43] Treicher, J.R., Johnson C.R. and Larimore, M.G. Theory and Design of AdaptiveFilters. John Wiley & Sons, 1987.[44] Vaidyanathan, P.P. “A Tutorial on Multirate Digital Filter Banks”. IEEE InternationalSymposium on Circuits and Systems, pp. 2241—2248, 1988.[45] Vandendorpe, L. “Optimized Quantization for Image Subband Coding”. SignalProcessing: Image Communication, vol. 4 (1), pp. 65—80, November 1991.[46] Veldhuis, R.N.J., Breeuwer, M. and R. van der Waal, R. “Subband Coding of DigitalAudio Signals Without Loss of Quality”. IEEE Proceedings ICASSP, pp. 2009—2012,1989.[47] Vetterli, M. and Le Gall, D. “Perfect Reconstruction FIR Filter Banks: LappedTransforms, Pseudo-QMF’s an Paraunitary Matrices”. IEEE International Symposiumon Circuits and Systems, vol. 2, pp. 2249—2253, May 1989.[48] Wang, Z. “On Computing the Discrete Fourier and Cosine Transforms”. IEEETransactions on ASSP, vol. ASSP-33, no. 4, pp. 1341—1344, October 1985.[49] Wiese, D. and Stoll, G. “Bit Rate Reduction of High Quality Audio Signals byModeling the Ears Masking Thresholds”. 89th AES Convention, 1990, Preprmt #2970, pp. 1—16.83

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0065232/manifest

Comment

Related Items