Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Analysis and evaluation of an adaptive silence deletion algorithm for compression of telephone speech Loo, Clifford 1997

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_1997-0264.pdf [ 5.34MB ]
Metadata
JSON: 831-1.0065200.json
JSON-LD: 831-1.0065200-ld.json
RDF/XML (Pretty): 831-1.0065200-rdf.xml
RDF/JSON: 831-1.0065200-rdf.json
Turtle: 831-1.0065200-turtle.txt
N-Triples: 831-1.0065200-rdf-ntriples.txt
Original Record: 831-1.0065200-source.json
Full Text
831-1.0065200-fulltext.txt
Citation
831-1.0065200.ris

Full Text

ANALYSIS AND EVALUATION OF A N A D A P T I V E SILENCE DELETION A L G O R I T H M FOR COMPRESSION OF T E L E P H O N E SPEECH By Clifford Loo B. A . Sc. (Electrical Engineering with Computer Engineering Option) University of British Columbia A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F M A S T E R O F A P P L I E D S C I E N C E in T H E F A C U L T Y O F G R A D U A T E STUDIES D E P A R T M E N T O F E L E C T R I C A L E N G I N E E R I N G . We accept this thesis as conforming to the required standard T H E U N I V E R S I T Y O F BRITISH C O L U M B I A Apri l 1997 © Clifford Loo, 1997 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for refer-ence and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of Electrical Engineering The University of British Columbia 2356 Main Mall Vancouver, Canada V 6 T 1Z4 •fail So, ffl7 Abstract This thesis is concerned with the analysis and evaluation of adaptive silence deletion as a means to compress telephone voice signals bandlimited to the range 200-3400 Hz. Speech is accompanied by noise arising from various environmental factors such as poor reception, interference of radio signals from mobile or cordless units, audible mechanical or social activities in the surroundings, and the conventional crosstalk and hum in the telephone system. A speech compression system based on significant modifications to an existing silence dele-tion algorithm has been implemented. Effects of the various system parameters on the operation of the system, as applied to telephone speech samples, are studied and analyzed graphically. Quality of the speech compression is assessed with subjective listening tests. With minimal al-gorithmic complexity and delay, the application of silence coding together with 4-bit A D P C M speech coding can compress uncoded telephone speech from an original bit rate of 128 kbps down to 16 kbps. Analysis of system performance shows that a processing frame size of 8 to 16 milliseconds yields the best combination of speech quality and compression efficiency. A set of system parameters is found to give robust performance in a wide range of operating environments, with different or varying speech and noise levels. Good playback quality resulting from compressed speech recorded in quiet and also in noisy environments is achieved at 50 percent compression, equivalent to half the bit rate of A D P C M . ii Table of Contents Abstract ii List of Tables vi List of Figures vii Acknowledgements xi 1 Introduction 1 1.1 Speech Compression 2 1.2 Silence Compression 3 1.3 Background 5 1.4 Outline of Thesis 7 2 Detection of Silence Intervals ; 9 2.1 Decision Criteria . 9 2.1.1 Signal Energy . . . . . . . . . 10 2.1.2 Zero-crossing Rate 10 2.1.3 Average Magnitude Factor 13 2.2 Adaptive Thresholds 19 2.2.1 Long-term Averages 19 2.2.2 Threshold Hysteresis and Hangover 22 2.3 Silence Detector 24 3 A Speech Compression System Based on Silence Deletion 25 3.1 Silence Deletion Algorithm • 25 iii 3.1.1 System Initialization 29 3.1.2 Calculation of Long-term Averages 30 3.1.3 System Parameters 32 3.2 Coding of Speech and Silence Frames 32 3.2.1 Speech Coding 33 3.2.2 Silence Coding 37 3.3 Silence Insertion and Decoding 41 3.4 Peripherals 42 3.4.1 Filtering and Sampling 42 3.4.2 Quantization and Coding 43 4 Analysis of System Performance 44 4.1 Frame Size and Delay 44 4.2 Zero-crossing Rate • • 49 4.3 Signal Energy Versus Average Magnitude Factor 54 4.4 Segmentation Analysis 57 4.5 Initialization and Recovery from Detection Error 65 4.6 Rate of Compression 75 4.6.1 Effects of Detection Threshold 75 4.6.2 Effects of Frame Size 84 4.6.3 Effects of Background Noise Level 88 5 Subjective Listening Evaluation 91 5.1 Objectives 91 5.2 Scoring 92 5.3 Organization and Results 93 5.3.1 Speech Coding 95 5.3.2 Amount of Compression 99 iv 5.3.3 Amount of Re-inserted Silence 100 5.3.4 Energy Level of Re-inserted Silence 102 5.3.5 Segmentation Size 102 5.3.6 Background Noise 103 6 Conclusions 105 6.1 Summary of Findings 105 6.2 Future Work 106 Bibliography 108 A Listening Tests 115 A . l Instructions 115 A . 2 Scores -115 B System Software 118 B. l The Simulator Program ' 118 B.2 The Speech Compression System 118 B.2.1 TMS320C30 Assembly Codes • 119 B.2.2. C Codes 119 v List of Tables 2.1 Various frame sizes with matching averaging periods 21 3.1 The silence detection threshold modes. 29 3.2 Adjustable parameters of the silence deletion algorithm 33 3.3 Maximum lengths of silence allowed by the silence code. . 38 3.4 Mapping of a G.711 8-bit ^-law P C M coder. 39 4.1 Mean segmentation size of silence deletion using the signal energy criteria E. . . 58 4.2 Mean segmentation size of silence deletion using the A M F criteria 59 4.3 Combinations of parameters that produce a segmentation size of 181.8182 ms. . . 62 5.1 Quality rating scale for an absolute category rating (ACR) test. . 93 5.2 Quality rating scale for a degradation category rating (DCR) test 93 5.3 Variables subjectable to evaluation by listening tests 94 5.4 Results of Subjective Listening Test Part 1 (samples produced from a relatively clear original) 96 5.5 Results of Subjective Listening Test Part 2 (samples produced from a relatively noisy original) . 97 A . l Individual scores for Subjective Listening Test Part 1. 116 A.2 Individual scores for Subjective Listening Test Part 2. 117 vi List of Figures 1.1 Waveform of the phrase "the navy attacked." 4 1.2 Waveform of the phrase "two factors here." . . 4 2.1 The phrase "the navy attacked" (a) sampled at 8 kHz, with (b) its signal energy, (c) zero-crossing rate, and (d) average magnitude factor calculated for a frame size of 4 ms. 11 2.2 The.phrase "two factors here" (a) sampled at 8 kHz, with (b) its signal energy, (c) zero-crossing rate, and (d) average magnitude factor calculated for a frame size of 4 ms 12 2.3 The //-law encoder function with 16-bit signed integer input x and 8-bit unsigned integer output y 14 2.4 The one's complement function with 8-bit unsigned integer input y and 8-bit signed integer output z 15 2.5 The magnitude factor (MF) function with 16-bit signed integer input x and 8-bit signed integer output z 15 2.6 A speech sample of (a) large amplitude, with its (b) signal energy, (c) zero-crossing rate, and (d) average magnitude factor, all calculated for a frame size of 4 ms. 17 2.7 A speech sample of (a) small amplitude, with its (b) signal energy, (c) zero-crossing rate, and (d) average magnitude factor, all calculated for a frame size of 4 ms 18 2.8 The E-Threshold hysteresis • 23 2.9 The AMF-Threshold hysteresis 23 vii 3.1 Speech compression system with silence coding. . . 26 3.2 The silence deletion algorithm 27 3.3 The /x-law compressor characteristic 34 3.4 Resolution of the ^t-law function: 16-bit signed integer input x encoded to 8-bit unsigned integer, which is then decoded to 16-bit signed integer r 35 3.5 Silence code format \ 37 3.6 Calculation of the lowest frequency / that can produce the silence code for /x-law P C M (with no overloading in the signal) 40 3.7 Decoding of silence-compressed data 41 4.1 The phrase "the porch steps" with its short-time zero-crossing rate Z calculated for K = 4,8,16,32,64,128 46 4.2 The phrase "the wide road" with its short-time average magnitude E calculated for K =. 4,8,16,32,64,128. 47 4.3 The phrase "the wide road" with its A M F calculated for K = 4, 8,16, 32,64,128. 48 4.4 Silence deletion of the phrase "the porch steps" using the energy and the zero-crossing criteria. , 50 4.5 Silence deletion of the phrase "the porch steps" using the energy criteria alone. . 52 4.6 Silence deletion of speech with a background hum 53 4.7 The hum reducing the zero-crossing rate 54 4.8 Silence deletion of the phrase "the porch steps" with added noise, using the energy and the zero-crossing criteria 55 4.9 Silence deletion of the phrase "the porch steps" using the A M F criteria 56 4.10 Mean segmentation size of silence deletion using the signal energy criteria E. . . 60 4.1,1 Mean segmentation size of silence deletion using the A M F criteria 61 viii 4.12 Segmentation profiles for silence deletion using the energy criteria E at the set-tings: K = 4 M,N = 10, K = 16 M,N = 4, K = 128 M,N = 1, and for silence deletion using the A M F criteria at the settings: K = 4 M, N = 7, K = 8 M,N — 5, K = 32 M , ./V = 4 64 4.13 Segmentation profile for silence deletion using the A M F criteria at the setting of K = 32 M, N = 2. 65 4.14 50% silence deletion of the sample "DEC23A_x.dat" using the signal energy cri-teria, at K = 4 and M, N = 10 66 4.15 50% silence deletion of the sample "DEC23A_x.dat" using the signal energy cri-teria, at K = 128 and M,N — 1. . 67 4.16 50% silence deletion of the sample "DEC23A_x.dat" using the A M F criteria, at K = 32 and M, N = 4 68 4.17 Silence deletion of sample "apr l8c .da t" with preset initialization values for the long-term averages 69 4.18 Silence deletion of sample " j u l l 8 A . d a t " with preset initialization values for the long-term averages 71 4.19 Silence deletion of sample " j u l l 8 A . d a t " with run-time initialization values for the long-term averages. 72 4.20 Silence deletion of sample "apr l8c .da t" with run-time initialization values for the long-term averages (i) 73 4.21 Silence deletion of sample "apr l8c .da t" with run-time initialization values for the long-term averages (ii) - 74 4.22 Silence deletion of sample "abrupt.dat" with run-time initialization, at a frame size of K = 128. 76 4.23 Silence deletion of sample "abrupt.dat" with run-time initialization, at a frame size of K = 64 77 ix 4.24 Silence deletion of sample "abrupt.dat" with run-time initialization, at a frame size of K — 16 78 4.25 Silence deletion of sample "abrupt.dat" with run-time initialization, at a frame size of K = 4 '. 79 4.26 Compression rate against threshold factors, for a frame size of K = 64 (i) 81 4.27 Compression rate against threshold factors, for a frame size of K — 64 (ii) 82 4.28 Compression rate against threshold factors, for a frame size of K = 64 (iii). . . . 83 4.29 Compression rate against threshold factors, for a frame size of K — 4 (i) 85 4.30 Compression rate against threshold factors, for a frame size of K = 4 (ii) 86 4.31 Compression rate against threshold factors, for a frame size of K = 4 (iii) 87 4.32 Compression characteristics of progressively noisier speech samples based on the sample "DEC23A_x.dat" (compressed at K = 64, M = N - 1) . . 89 4.33 Compression characteristics of progressively noisier speech samples based on the sample "DEC23D_s.dat" (compressed at K = 64, M = N = 1) 90 5.1 Silence compressed speech recovered with different speech codings: /i-law P C M and A D P C M . . . 98 5.2 Comparison of MOS between A D P C M and jtt-law P C M 99 5.3 Comparison of MOS between 60% and 50% silence compression 100 5.4 Comparison of MOS between 50% and 100% silence expansion 101 5.5 Comparison of MOS between 50% and 100% silence energy in expansion 102 5.6 Comparison of MOS between 4 frame sizes 103 5.7 Comparison of MOS between clear and noisy samples 104 A . l On-screen instructions for the subjective listening tests 115 x Acknowledgements I want to thank my supervisor, Dr. R. W . Donaldson, without whose continued guidance and encouragement, completion of this work would not have been possible. I also want to thank my family and friends who have supported me in various ways throughout the years of work on this thesis, making it an enjoyable task. Last but not least, my thanks and appreciation go to the 29 individuals who have volunteered their precious time to help me finish the subjective listening tests. xi Chapter 1 Introduction Speech and audio signals comprise a significant portion of the content of telecommunication transmission and storage. Recent years, in particular, have witnessed a rapid growth in the insatiable demand for voice communication, mostly spurred by the popularization of cellular mobile communication and an upsurge of interest in multimedia content in data networks. More than ever, there is a need to conserve bandwidth in both wired and wireless telecom-munication networks, and to conserve disk space in voice storage systems. There has been a large increase in research and development in the coding of wideband speech (7-kHz bandwidth) for audioconference applications, and of high-fidelity audio signals (20-kHz) for the coding of' compact disc (CD) quality material. However telephone speech, limited to a bandwidth of 3.2 kHz (200 Hz to 3.4 kHz), remains the most widely used and in demand audio source. For conventional data primarily in the form of text, many effective lossless coding (or com-pression) schemes exist [1, 2]. Speech, however, is a difficult material to compress, and all speech coding schemes are information lossy. As Shannon shows in his seminal work on information and coding theory [3, 4], a signal source could be coded with zero error at a data rate equal to or greater than the entropy (a measure of the information content of the source). Speech and audio signals are examples of infinte-alphabet, analog sources, for which the encoding error tends to approach zero only at an infinite bit rate. It is possible to achieve significant compression of speech at the expense of some distortion. Compression of audio signals always involves tradeoffs between signal quality, coding efficiency, complexity, and delay; these are the attributes most often considered in evaluating the perfor-mance of any speech coding system [5, 6]. Signal quality is that perceived by a human receiver, 1 Chapter 1. Introduction ' 2 often measured on a five-point absolute quality or relative impairment scale. Efficiency, the main objective of coding, is expressed as a reduction in bandwidth (in Hz, Hertz) or bit rate (in bps, bits per second) of the original signal source. Complexity of the coding algorithm is the computational effort required to implement the encoding and decoding processes in signal processing hardware, typically measured in terms of arithmetic capability (in instructions per second that the machine is'capable of executing) and memory requirement (in bytes). Finally, it takes time to buffer (algorithmic delay), to encode and decode (processing delay), and to transmit speech (communication delay), which afJ contribute to the one-way system delay [6]. In the absence of echo control, one-way delay should not exceed 25 ms for network telephony. 1.1 Speech Compression Audio telephony signals (bandlimited to 200-3400 Hz) are sampled at 8 kHz (compared to the 16 kHz for teleconferencing and 44.1 kHz for compact discs). The analog-to-digital converter output is often a linear pulse-code-modulated ( P C M ) signal, with a resolution of 16 bits per sample (65536 levels divided into 32767 positive and 32768 negative levels of uniform step size). The resulting bit rate of 128 kbps serves as a reference bit rate for uncoded speech [7]. Taking advantage of the redundancies inherent in speech signals, many methods of digital speech coding have been developed, and standards created. To better tailor the quantization levels to the non-Gaussian dynamics of speech, a compres-sor/expandor system (compandor) for quantization was standardized in the C C I T T 1 Recom-mendation G.711, in 1972. The 64 kbps 8-bit ^-law 2 companded P C M is generally taken as a standard for toll or network quality representation of the speech waveform [8, 9]. By exploiting the strong correlation between v speech samples, and further adapting the quantization step size to the nonstationary nature of speech, adaptive differential P C M sys-tems ( A D P C M ) were developed. The G.721 4-bit A D P C M coder was standardized in 1984 1 The International Telephone and Telegraph Consultative Committee, predecessor of the Telecommunication Standardization Sector of the International Telecommunications Union (ITU-T). 2^-law for North America and Japan, A-law for the rest of the world. Chapter 1. Introduction 3 (and revised in 1986) [10], followed by the 5-bit,-3-bit, and 2-bit versions (G.723 and G.726). Embedded A D P C M , which enables bit dropping and hence the option of a variable bit rate for congestion control, became a standard in 1990 (G.727) [11]. While 5-bit and 4-bit A D P C M systems provide toll quality speech at 40 and 32 kbps, respectively, the 3-bit and 2-bit versions at 24 kbps and 16 kbps do not have as consistent a quality. Combining the techniques of linear predictive coding (LPC) with the principle of vector quantization (essentially the block source coding scheme of Shannon [4]), code-excited linear prediction ( C E L P ) provides toll quality speech at 16 kbps or even lower rates [12, 13]. In 1992, the C C I T T adopted the 16 kbps low-delay C E L P ( L D - C E L P ) as a standard (G.728) [14]. And between 1995 and 1996, 8 kbps conjugate-structure algebraic-code-excited linear-prediction ( C S - A C E L P ) , along with a 6.3 and 5.3 kbps speech coder, were finally approved as ITU Recommendations G.729 and G.723.1, respectively [6]. Current research is now focused on achieving toll quality speech at and below 4 kbps, employing techniques such as sinusoidal coding and waveform interpolation [7]. 1.2 Silence Compression Speech coding at medium to low bit rates (16 kbps and lower) often entails considerable al-gorithmic complexity and processing delay. Many of the new applications of speech coders, including cellular telephony and voice storage systems, do not require a fixed bit rate. One simple and effective way of compressing speech for these applications involves removal of the numerous silence intervals which are present in speech between sentences, phrases, words, and even syllables. These silences contain insignificant acoustic energy or speech content, appropri-ately defined. Of course, a simple mapping between linguistic units in a spoken language and written language does not always exist [15], and silence intervals in speech do not necessarily correspond to word textual boundaries. Separate words may coalesce into one when spoken, and intra-word silence intervals also exist, particularly, before the stop consonants / p / , / b / , / k / , / g / , / t / , and / d / . That is partly why endpoint detection for isolated word recognition is Chapter 1. Introduction 4 such a challenging problem. Figures 1.1 and 1.2 show the waveform of the phrases "the navy attacked" and "two factors here," respectively, sampled at 8 kHz and linearly quantized to a precision of 16 bits per sample. The silence intervals seem to dictate a grouping of the spoken X 10* 6 0 0 8 0 0 T i m e (ms) Figure 1.1: Waveform of the phrase "the navy attacked." Figure 1.2: Waveform of the phrase "two factors here." Chapter 1. Introduction 5 words rather as "the na vyat ta eked" and "twofac torshere." Regardless of where the silence intervals occur, normally a reduction in bandwidth or bit rate of the speech signal can be achieved by removing silence intervals before transmission or storage. Decompression of the signal at the point of reception or retrieval involves expanding or regenerating the silence intervals at the appropriate energy level. In a time-sharing system that assigns channels, on demand, to active sources, silence inter-vals are those periods when the channel is assigned to another, non-silence source. A statistical multiplexor allocates the use of a channel according to the activity of the various users. There is no need to record the position and length of a silence interval; silence begins when a user is switched off from the channel, and it ends when the user is switched back on. Assuming that the background noise is stationary and its general charateristics are known, a random "comfort noise" [7] can be supplied at the receiving end, to approximate the original silence intervals, when the channel is re-assigned to another user. In general, however, the length and energy level are transmitted (or stored) at the occurrence of a silence interval. This coding of the silence interval into non-speech data is called silence compression or silence coding. Silence coding is akin to run length coding [1], where a special marker is inserted to indicate a run of silence, followed by two numbers indicating the silence length and energy level. There is overhead associated with this information, but by coding only silence intervals greater than a minimum length, a significant reduction in average bit rate can be guaranteed. 1.3 Background Silence detection, also known variously as speech detection, voice activity detection (VAD) , and endpointing has long been an essential component of many speech processing systems. Detecting the presence or level of speech in a background of noise is a problem common to variable rate speech coding [16, 17] and speech recognition [18, 19, 20], while packet-switched transmission systems [21] and voice mail systems [22, 23, 24] can benefit from a reduction in bit Chapter 1. Introduction 6 rate and storage space, respectively, that results from its use. More recently V A D has found much application in cellular and satellite mobile telephony [25, 26, 27, 28], particularly since the introduction of code division multiple access ( C D M A ) for digital cellular telephony [7, 29, 30]. The first studies on silence intervals in telephone conversations date back to the work of Norwine and Murphy in 1938 [31]. Since then, efforts to increase channel capacity for voice communications have resulted in time assignment speech interpolation (TASI) systems [32] in the 1960s and later, combined with digital speech coding techniques, in digital speech interpolation (DSI) systems [33]. Studies have-shown that, in a typical two-way telephone conversation, speech is present at most 50% of the time in each direction of transmission [34, 35]. By efficiently using the silence intervals, the channel capacity can be doubled in circuit-switched voice transmission [35]. In speech interpolation systems the speech detector generates on-off patterns to enable assignment of channels to active users. Early silence detection algorithms detect silence/speech by comparing the signal level, the signal energy or its envelope, the zero crossing rate, or the combinations of these with pre-set threshold values [36, 37, 38]. Over time, silence detectors have grown in sophistication. Both Fariello [39] and Jankowski [40] employed peak detection. Un and Lee [41] developed a speech/silence discriminator based on counting bit alternations of the bit stream from linear delta modulation. Lamel's improved endpoint detector for isolated word recognition uses four energy thresholds to define the presence of an "energy pulse," which is a speech-like burst of energy [42]. Yatsuzuka [43] designed a speech detector for D S I - A D P C M systems that utilizes the periodicity of the sign bit sequences of the input signal. The logarithmic energy is used in Hahn's speech detector [44]. In 1976 Atal and Rabiner proposed a statistical pattern recognition approach to the voiced-unvoiced-silence classification of speech [45, 46]. Since then many researchers have taken a statistical-decision approach to the problem of silence detection [47, 20, 48]. Parameters per-tinent to the characteristics of the speech signal, such as speech energy, autocorrelation, and even predictor coefficients are estimated in a training phase. Actual measurements of these Chapter 1. Introduction 7 parameters are then made for each data frame. Classification of the data frame is made on a statistical basis, by deciding how likely are the measured parameters to deviate from the pattern estimates. These estimates are updated regularly, in order for the detector to adapt to the nonstationary nature of both the signal and ambience. As computing power increases, more computationally demanding techniques have been em-ployed to detect voice activity. Huang [49] used the Walsh spectra of speech data as the basis for detecting endpoints of isolated utterances. Haigh [50] reported on the success of a cepstral based algorithm. Exploiting the nonlinearities in the speech production model, Rangoussi de-veloped a speech detector based on the non-zero third-order statistics of speech signals [51, 52]. The Teager energy measure is used in Ying's endpoint detection algorithm [53]. Amidst all this research with increasingly complex silence detectors, Gan [23, 54], Savoji [18], Rose [24, 31], Taboa'da [55], and Jacobs [56] have illustrated the feasibility and effectiveness of a simple design, using the conventional short-time energy and zero-crossing rate, the average magnitude factor ( A M F ) , and a few adaptive factors. 1.4 Outline of Thesis This thesis is concerned with the performance of an adaptive silence compression algorithm as applied to speech recorded from the telephone network, which may include various kinds of noise ranging from radio interference to crosstalk. With low computation demands, an average, bit rate around 16 kbps can be achieved. Our silence deletion algorithm is based in large part on the one previously used by Rose [24, 31].' Modifications introduced accommodate the real-time coding of speech and silence frames, enable further savings on memory usage and computation, and adapt to the more erratic noise characteristics of actual telephone recordings. While focussing on algorithmic simplicity, the present thesis develops an in-depth understanding of silence detection in various noisy environments and brings insight to the choice of system parameters under such conditions, emphasizing particularly the segmentation size considerations for speech and silence intervals. Chapter 1. Introduction 8 Informal subjective listening tests have been performed in evaluating the speech quality of the system. Chapter 2 describes the components of an adaptive silence detector. Chapter 3 discusses implementation al issues for a proposed speech compression system based on silence deletion. Chapter 4 is a study of the internal dynamics as well as compression characteristics of the system, with speech samples recorded in various situations from telephone lines. Chapter 5 presents the organization and results of subjective listening tests conducted to evaluate the perceived quality of the speech which has been compressed, transmitted and reconstituted. Chapter 6 summarizes the findings of this thesis and suggests some topics for future work. Chapter 2 Detection of Silence Intervals Silence detection really involves identification of those periods of time when voice activity is absent; background noise, however, may be present [20]. Speech in general, and telephone conversations in particular, consists of intermittent talkspurts separated by pauses or periods of silence. The process of identifying when talkspurts occur is called voice activity detection (VAD)[7] . A silence detector in general performs measurement of specific characteristics of the speech signal (producing short-time statistics), compares these to threshold values, and declares an acoustic segment as speech or silence according to whether a statistic is below or above a threshold. Depending on the speaker and the communication environment, the telephone speech signals can undergo variations in characteristics from call to call, or even within the duration of a single call. A n adaptive silence detector maintains dynamic thresholds that track the current state of the speech signal. 2.1 Decision Criteria Based on the general characteristics exhibited by the presence and absence of voice activity, two standard statistics have been developed for use as the decision criteria for discriminating speech from background noise. These are the signal energy and the zero-crossing rate [57, 9]. A new criteria, the average magnitude factor ( A M F ) , was adopted with slight modification from the work of Jacobs, Eleftheriadis and Anastassiou [56] for consideration in this thesis. 9 Chapter 2. Detection of Silence Intervals 10 The silence detector processes the input stream of samples by dividing them into non-overlapping data frames, each of size K. The three statistics are evaluated for each frame. If the signal energy E, or the zero-crossing rate Z, or the A M F exceeds a certain threshold, the frame will be declared as speech; otherwise, the frame will be dismissed as silence. (When the silence detector is incorporated into the silence deletion algorithm, decision of whether to code the frame as speech, or to discard it, will be made also in accord with the current state of the silence deletion algorithm.) 2.1.1 Signal Energy In voiced speech, the signal x(t) generally exhibits relatively large signal energy, which can be characterized as its short-time energy. With the squaring operation in its calculation(VJ x2{t)), the short-time energy is particularly sensitive to large signal levels; this problem is often solved by using the short-time average magnitude instead [9, 31, 58]. Eliminating the squaring opera-tion also simplifies calculation, which may be critical in a real-time implementation [31]. Since this short-time average magnitude gives a good indication of the short-time energy, it will be loosely referred to as the signal energy in this thesis. It is calculated as follows: where Xj is the j-th sample in a frame. Typically, speech remains stationary for frames on the order of 20 milliseconds [58]. For speech signals sampled at 8 kHz, therefore, the frame size K must not exceed 8 X 20 = 160 samples. Figures 2.1(b) and 2.2(b) are plots of the signal energy of the speech waveforms shown in Chapter 1 (calculated for a frame size of K = 32). The signal energy, when above threshold, indicates the presence of speech; the threshold has a value related to the silence energy. 2.1.2 Zero-crossing Rate The zero-crossing rate can be regarded as an indicator of the frequency content of the input signal. For unvoiced fricatives and stop consonants, the short-time energy is relatively low and Chapter 2. Detection of Silence Intervals 11 "apr31 b": (a) Speech Sample I I I I I I I 1600 1800 2000 2200 2400 2600 2800 (b) Short-time Magnitude (E) of the Speech Sample 6000 | 1 1 1 1 - 1 1— : : - r . 1600 1800 2000 2200 2400 2600 2800 (c) Zero-crossing (2) of the Speech Sample 0-81 1 1 1 1 1 r 1600 1800 2000 2200 2400 2600 2800 Time (ms) Figure 2.1: The phrase "the hayy attacked" (a) sampled at 8 kHz, with (b) its signal energy, (c) zero-crossing rate, and (d) average magnitude factor calculated for a frame size of 4 ms. Figure 2.2: The phrase "two factors here" (a) sampled at 8 kHz, with (b) its signal energy, zero-crossing rate, and (d) average magnitude factor calculated for a frame size of 4 ms. Chapter 2. Detection of Silence Intervals 13 concentrated mostly in the high frequency region. In a background of ambient noise arising from air movement, the high frequencies in fricatives have a higher zero-crossings rate. Gan [23] has found the zero-crossing rate to be effective in detecting the presence of speech in a noisy background. The zero-crossing rate [9, 58] is calculated as follows: z = 1 * S g n ( ^ ~ ^ ( ^ J - 1 ) ! (2 2) ^ j-1 ^ where i + 1, if a; > 0 sgn(x) = (2.3) — 1, if a; < 0 and Xj is the j - t h sample in a speech frame. Rate Z is the average number of zero-crossings per speech sample. Figures 2.1(c) and 2.2(c) are plots of the zero-crossing rate of the example speech waveforms, calculated for a frame size of K = 32. From the latter figure, a peak in zero-crossing rate can be found in the interval 10900-llOOOms, which corresponds to the V sound in the word "factors" (see Figure 1.2). As a criteria for silence detection, the zero-crossing rate indicates the presence of speech above a Z-Threshold, Zj. 2.1.3 Average Magnitude Factor The variability in the signal-to-noise ratio, a result of the differing environments in which speech is recorded, makes it difficult, and sometimes impossible, to select an optimal energy threshold. With noisy channel or weak signal reception, the long-term averages cannot reliably track the speech and silence energies. This will be discussed further in Section 4.5. Meanwhile, we consider an alternative detection criteria, the average magnitude factor ( A M F ) . The application of the A M F in a silence detection algorithm was conceived with packets of /u-law P C M encoded speech data in mind [56, 59]. A M F is a low complexity calculation which operates directly in the non-linear /u-law domain. In data networks and in the telephone Chapter 2. Detection of Silence Intervals 14 backbone network in North America, speech samples are conveniently available in /z-law P C M format. • The encoder function of an ITU G.711 /z-law P C M quantizer appears in Figure 2.3. (Sec-tion 3.2.1 describes /z-law P C M . ) The one's complement function takes y as an input and Figure 2.3: The /z-law encoder function with 16-bit signed integer input x and 8-bit unsigned integer output y. (2.4) produces the output z according to the formula V, for 0 < y < 128 y - 255, for 128 < y < 255. In binary representation, z is found by inverting each bit in y. The one's complement function is plotted in Figure 2.4. The function that results from cascading /z-law encoding with Equa-tion 2.4, plotted in Figure 2.5, is called the magnitude factor (MF) of x [56]. A mathematical description of M F is thus derived from Figure 2.5 and Equation 3.11: f - logf lH-M -J^— ) I V l X m a M/y if Y > n log(l-rM) m a x ' 1 -(2.5) M R -1 -log 1+^ l 1/ v if y . / n log(l+/x) . J m a x ' 1 1 ^3 <• U Chapter 2. Detection of Silence Intervals 15 Figure 2.5: The magnitude factor (MF) function with 16-bit signed integer input x and 8-bit signed integer output z. Chapter 2. Detection of Silence Intervals 16 where Xj is the j - t h sample in a frame, ji = 255, and Xmax and Ymax are the maximum values in the input and the ouput, respectively. Here we deviate from Jacobs [56] by defining the A M F as the M F averaged over a speech frame (K speech samples), as opposed to the low-pass filtered M F signal. 1 l< AMF=-YlMFj . . . (2.6) ./ i (The low-pass filter achieves a similar averaging effect.) This simplifies calculation and speeds up processing time, with no noticeable detriment to the effectiveness of the silence-detecting power of the A M F . Figures 2.1(d) and 2.2(d) are plots of the A M F , which indicates the presence of speech when it drops below a threshold. This threshold value is related to the A M F of the silence intervals. • • The approach taken by Jacobs et al., based on the smalT and large-signal behaviour of the speech waveform in the //-law (i.e. logarithmic) domain, offers a feasible means to handle the large variability of speech amplitude levels. Using the logarithms of the actual' data values reduces the variability exponentially. In the linear domain, the values of speech and silence energies can vary tenfold or more1 from one speech sample to another because of noise or bad signal transmission. Performing calculations in the logarithmic domain can be seen as an extension of the concept of reducing sensitivity to signal level variation, previously applied in using the short-time average magnitude instead of the short-time energy. Figures 2.6 and 2.7 show two speech samples of contrasting signal amplitudes, each accom-panied by short-time statistics. Note that while the range of the signal energy in Figure 2.6 spans a few tens of thousands and, in Figure 2.7, a few hundreds in the latter, that of the A M F remains within ±100 in both cases. 'In one extreme case, a hundredfold difference is observed between two speech samples. Chapter 2. Detection of Silence Intervals 17 Figure 2.6: A speech sample of (a) large amplitude, with its (b) signal energy, (c) zero-crossing rate, and (d) average magnitude factor, all calculated for a frame size of 4 ms. Chapter 2. Detection of Silence Intervals 18 "Imiz236.dat": (a) Speech Sample 1500, 1 1 • 1 1 1.05 1.1 1.15 1.2 1.25 Time (ms) x 1 Q4 Figure 2.7: A speech sample of (a) small amplitude, with its (b) signal energy, (c) zero-crossing rate, and (d) average magnitude factor, all calculated for a frame size of 4 ms. Chapter 2. Detection of Silence Intervals, 19 2.2 Adaptive Thresholds As noted earlier [45, 31], a,fixed detection threshold does not give reliable and consistent com-pression results when variation occurs in speech or background noise levels. For the silence detector to be useful in different speech environments or under different noise conditions, the detection thresholds are adaptive to the varying speech characteristics. Adaptation is accom-plished through maintenance of good estimates of the long-term averages for the energy and A M F statistics, and by using threshold hysteresis. 2.2.1 Long-term Averages , As described earlier, the signal energy E and A M F are compared to specific thresholds, and these comparisons are then used as criteria for silence detection. For these thresholds to adapt to fluctuations in speech and background noise levels, they must be calculated from long-term energy and A M F averages. A larger energy factor (hence a higher threshold) or lower A M F threshold factor generally lowers the probability that a frame will qualify as speech. The relationships between the thresholds and their respective long-term averages are as follows: E-Threshold = ET • MAX(SilAvg, SphAvg/U) . . (2.7) AMF-Threshold = LT • MW(SilAMF, SphAMF + 59) (2.8) M A X ( z , y ) MW(x,y) where t x, if x > y y, otherwise, x, if x < y y, otherwise. SilAvg and SilAMF are the long-term averages of the signal energy and A M F for silence, and SphAvg and SphAMF are those for speech. (2.9) (2.10) Chapter 2. Detection of Silence Intervals 20 In most circumstances, tracking the silence averages would be sufficient. A threshold should reflect the level of a criteria expected of silence. Exceeding the threshold indicates sufficient deviation from the pattern of silence, that speech is probably present. Rose [31] suggested adapting the energy threshold to the current speech level too, so that silence could still be detected in speech with low background noise and widely fluctuating speech level. The constant 13 in Equation 2.7 was called the SpeechScale [31], and was chosen experimentally without much explanation. In fact this SpeechScale factor can be treated as a rough upper bound on the signal-to-noise ratio (SNR) of speech energy relative to noise energy (in this case, 201og10(13) « 22dB). Equation 2.7 dictates that, if the long-term silence energy SilAvg drops below a specified level, the energy threshold will adapt to the speech energy instead, so as to maintain a satisfactory amount of silence compression. In the same spirit, the SpeechScale factor has now been incorporated into the calculation of the AMF-Threshold. From the iz-law characteristic equation (Equation 3.11), where the approximation on the last line can be made for large signals (p\x\ >• ^ m a x ) [60]. Taking x' = x/13, y\ = lcO)l Yy max c(x')\ c l o g ( M l ^ l ) - l o g ( 1 3 ) log(l + / i ) . max Chapter 2. Detection of Silence Intervals 21 For p. = 255 and Y m a x = 128, \y'\ ~ M - 5 9 . Comparing Figure 3.3 and 2.5, one sees that in the M F function, through the combined effects of the G.711 encoder and the ones's complement function, the fi-law characteristic gets shifted (for negative a;) and flipped (for positive x). A decrease in the /it-law magnitude is thus translated into an increase in A M F . Thus the rationale for adding 59 to SphAMF in the calculation of the AMF-Threshold (Equation 2.8). In all, four long-term averages have to be evaluated over the entire speech sample during compression: two for silence and two for speech. The period over which an average is calcu-lated has to be small enough to track the variability of the characteristic and large enough to withstand short-term fluctuations. Period length therefore determines the sensitivity of the statistic. It has been found [31] that the long-term statistics for speech should be averaged over a period of approximately one second, and for background noise or silence over a 128 ms period. Table 2.1 is constructed for the various combinations of frame sizes and averaging periods (in number of frames); values are given in powers of two to facilitate real-time implementation on a digital signal processor. In general, the silence and speech averaging periods, SilAvgPrd and Table 2.1: Various frame sizes with matching averaging periods. Frame Size (K) SilAvgPrd SphA vgPrd 128 8 64 64 16 128 32 32 256 16. 64 512 8 128 1024 4 256 2048 SphAvgPrd can be calculated from the frame size K as follows: SilAvgPrd = 1024/K SphAvgPrd = 8192/it' (2.11) (2.12) Chapter 2. Detection of Silence Intervals 22 The importance of maintaining good estimates of the various long-term averages for si-lence (and for speech too) cannot be overemphasized. These long-term averages are constantly updated. A l l frames classified as silence and speech contribute to the long-term averages for silence and speech, respectively. Since the speech/silence classification is based on threshold values, and hence on the long-term averages, defining long-term averages on the basis of the classification risks the danger of a vicious circle (or positive feedback), as warned by Southcott [26]. Whether or not this affects the performance of the adaptive thresholds in practice will be discussed in Section 4.5. In calculating the speech energy average, many samples with uncharacteristically low energy or zero-crossing rate have been included, along with those more representative of speech. A frame is classified as speech if either one of its criteria exceeds the threshold. While the speech average is not really crucial to the operation of the silence deletion algo-rithm, the silence average is critical, and does not tolerate being corrupted in similar ways. To prevent possible corruption of the silence energy and A M F averages by inclusion of spurious high-energy frames, a critical factor Ec can be used [54, 31]. A more stringent qualification for classified as silence is that the frame energy statistic E falls below a critical level, Ec-SilAvg, in order for a frame to be included in the calculation of the long-term silence averages. The included frame must also meet the other criteria, of course. 2.2.2 Threshold Hysteresis and Hangover Speech onset often shows a larger signal energy than does the termination [31]. To exploit this characteristic, hysteresis in the energy threshold can be implemented; in Figure 2.8, two energy threshold factors, Eo and E\, are used in place of Ej. The higher one (E\) is used during silence intervals, for detecting speech onset, to prevent triggering by noise. To minimize end-clipping, a lower one (Eo) is used during speech, for detecting the termination of an utterance. Similarly, Lj can be replaced with the dual factors Lo and L\ \ the lower factor L0, though, will be used to detect speech onsets, while the higher one, L\, is for detecting silence (Figure 2.9). Chapter 2. Detection of Silence Intervals 23 E0 S I L E N C E S P E E C H Signal Energy Figure 2.8: The E-Threshold hysteresis. Lo S P E E C H S I L E N C E Average Magnitude Factor Figure 2.9: The AMF-Threshold hysteresis. To facilitate detection of weak stop consonants, which sometimes trail a speech utterance, it is sometimes beneficial for speech thresholds (EQ and L^) to hang over for H frames. It has been found that a non-zero hangover H is useful for large frame sizes where the consonant energy becomes blurred in the surrounding silence, but may compromise silence detection with small frame sizes [31]. While the hysteresis and the hangover are included in our implementation, we will not spend time to duplicate or to refute previous findings. These fine-tuning parameters will instead be left at their default values (E\ = EQ and Ec = 1.5), which suffice in our study. Chapter 2. Detection of Silence Intervals 24 2.3 Silence Detector The decision of the silence detector is based on. one or more of the three criteria presented above. In a speech communication system, it is usually more important to transmit information than to conserve bandwidth by removing redundancy. Therefore, to minimize clipping, speech should be declared whenever one or more of the criteria (short-time statistics) exceed its threshold. The signal energy and the A M F are never used in conjunction, however, because there is much duplication of functionality between them. They are different measures of the same thing, namely, the signal amplitude. Instead they will be evaluated one against the other in silence compression performance, and their effectiveness in detecting speech/silence will be studied in Sections 4.3 and 4.6. In summary, the operation of the adaptive silence detector is controlled by eight adjustable parameters: the frame size K, the threshold factors EQ^, ZT, £O,I> the critical factor Ec, and the hangover H. Chapter 3 A Speech Compression System Based on Silence Deletion This chapter deals with the implementation of a speech compression system employing silence coding. Based on the silence detector described in the previous chapter, a speech compression system has been built using an I B M P C / A T compatible computer, a Spectrum TMS320C30 DSP board, and a simple signal amplifier. The system can record speech from the telephone line to a disk file, encode, decode, and play back the audio data file through a loudspeaker or a telephone handset. The DSP board has built-in A / D and D / A converters, and is responsible for data acquisition and playback. The computer controls data flow and buffering, performs disk operations, and is responsible for encoding and decoding the audio files. Software implementing the speech compression system is listed in Appendix B. The functional blocks of the system can be visualized using the data flow diagram in Fig-ure 3.1. By their functionality, the blocks can be further grouped into three: algorithmic control, coding and I/O (input/output), and analog-to-digital interfacing. 3.1 Silence Deletion Algorithm The speech compression system described in this thesis is based in part on a previous silence deletion algorithm [24, 31]. Modifications have been made to enable real-time coding of speech and silence frames, to demonstrate further savings on memory usage and computation, and to adapt to the more erratic noise characteristics in an actual telephone recording. Figure 3.2 shows a control flow diagram of this new silence deletion algorithm. Following is a legend for 25 Chapter 3. A Speech Compression System Based on Silence Deletion 26 Speech Source A/D Voice Activity Detector Transmission Channel . or Storage . Medium Speech Silence Control — 1 — Speech Encoder Silence Encoder Speech ' Receiver D/A Transmission Channel or Storage Medium Speech Decoder Speech Silence . Code Detector \ Silence Silence Decoder Control Transmission Channel or Storage Medium • Figure 3.1: Speech compression system with silence coding. some of the undefined variable names in the figure: Statistic(s)—-one or more of the short-time statistics: the signal energy. E, the zero-crossing Z, and the A M F .. Threshold(s) — one or more of their corresponding thresholds: E-Threshold, Z-Threshold(= ZT), and AMF-Threshold SpeechAvg(s) — the long-term averages for speech: SphAvg and SphAMF SilenceAvg(s) j—the long-term averages for silence: SilAvg and SilAMF Energy — short-time energy E or long-term silence energy average SilAvg, de-pending on context UnclassifiedSum(s).— accumulated sums for the unclassified frames: UnclassifiedSum and Unclassified AMF Other variable names either have been defined in the previous sections, or will be defined later. The silence detector or. voice activity detector (VAD) is at the heart of the speech compres-sion system, dictating whether the incoming signal is to be treated as speech or as silence. As described in the previous chapter, its basic function is to compare the short-time statistics to some thresholds (which are based on long-term averages), and then to decide: speech or silence. The algorithm operates on.a frame:by-frame basis; that is, with a minimal algorithmic delay [6] of one frame size~(-^- ms). Chapter 3. A Speech Compression System Based on Silence Deletion 27 (START) Initialize frame buffer Set State = SILENCE Set ThresholdMode = SILENCE Set SilenceLen = SpeechLen = CriticalLen = 0 . Set UnclassifiedSum(s) = CriticalSum = 0 Read a data frame into buffer Calculate Statistic(s) for current frame Set SpeechAvg(s) = Silence Avg(s) = Statistic(s) Calculate Threshold(s) Add Statistic(s) to Unclassified3um(s) Increment SpeechLen Add Energy to CriticalSum Increment SilenceLen S P E E C H SILENCE State? Generate silence code with SilenceLen and the [current SilenceAvg Energy Set SilenceLen = 0 Encode current frame and SilenceLen previous frames as speech Update SpeechAvg(s) Add SilenceLen to SpeechLen |Set SilenceLen = CriticalLen = 0| Set UnclassifiedSum(s) — CriticalSum = 0 Set State = S P E E C H Encode current frame and (SpeechLen-1) previous frames as speech Update SpeechAvg(s) Set CriticalLen = 0 Set UnclassifiedSum(s) — CriticalSum — 0 SILENCE ^s . S P E E C H State? Discard current frame and SpeechLen previous frames as silence Update SilenceAvg(s) Add SpeechLen to SilenceLen |Set SpeechLen =• CriticalLen = 0| Set UnclassifiedSum(s) = CriticalSum = 0 Set State = SILENCE Discard current frame and (SilenceLen-1) previous frames as silence Update SilenceAvg(s) |Set SpeechLen = CriticalLen = 0| Set UnclassifiedSum(s) = CriticalSum = 0 Set ThresholdMode = S P E E C H Encode SilenceLen outstanding frames as speech Set ThresholdMode = SILENCE Add SpeechLen to SilenceLen Generate silence code with SilenceLen and the current SilenceAvg Energy ( END ) Figure 3.2: The silence deletion algorithm. Chapter 3. A Speech Compression System Based on Silence Deletion 28 In the context of the silence deletion algorithm, two additional constraints have to be sat-isfied before final decision can be made: the minimum silence and minimum speech duration [31, 54]. M consecutive silence frames must be detected before the system changes state to SILENCE and finally declares them all to be silence frames; similarly, N consecutive speech frames have to be recognized by the detector before the system switches to-SPEECH state and classifies them all as speech frames. . The minimum speech constraint imposes a limit on the shortest duration of speech that is considered to convey useful information, while the minimum silence constraint puts a similar limit on the shortest run of silence that can be efficiently coded. (As noted earlier, there is an overhead associated with each silence code.) It will be shown in Section 4.4 that increasing M and N has an effect similar to increasing the frame size K. The values of M and N must be well matched. While a low talkspurt frequency can be facilitated by using a larger M, setting M too much higher than N can lead to a very low percentage of declared (as opposed to detected) silence, hence a very low compression rate. Similarly, excessive, clipping of speech will result if N is too much larger than M. By choosing M and N judiciously, the speech compression system can be tailored to suit different sensitivity requirements for speech and for silence. By allowing the algorithm to reverse a decision of the detector, the minimum speech and silence constraints provide an escape from the positive feedback (refer to Section 2.2.1), breaking the inter-dependency between the long-term averaging and the silence detector. The worst problem with the silence detector is to mistake a speech frame for silence and to subsequently update the silence averages with statistics of the mistaken frame; this could cause the thresholds to adapt to the characteristics of speech, thereby making it easier and easier for speech frames to pass as silence—in a vicious circle! Granted that it is much harder for a speech frame to. pass unnoticed by the detector, the problem is made even less likely by updating the long-term averages according to the final decision of the silence deletion algorithm, which may differ from that of the silence detector module if M > 1 and/or TV > 1. Chapter 3. A Speech Compression System Based on Silence Deletion 29 The internal state of the silence detector (ThresholdMode) may therefore differ from the state of the system (State). The dual thresholds of the detector, when in use, will be invoked by ThresholdMode according to Table 3.1. Table 3.1: The silence detection threshold modes. ThresholdMode Energy Threshold Factor A M F Threshold Factor SPEECH Eo Li SILENCE Ei Lo 3.1.1 System Initialization In the first iteration, all long-term averages are initialized to the values of the statistics of the first data frame. Simulation has shown that it is necessary to initialize.the long-term averages to meaning-ful values, rather than to some arbitrary estimates hard-coded in the system. Difference in communication environments renders any hard-coded estimates useless at best, and very often detrimental to system performance. For example, if SilAvg has been initialized to a certain pre-determined value which is too low for the noise level at hand, background noise will be mis-taken for speech. If this situation persists, SilAvg will not have the opportunity to be updated to any higher values, while SphAvg wiD track the signal energy of the input signal, regardless of whether it is speech or silence. Eventually, zero compression will be the likely result (see Section 4.5). At the start of each execution, therefore, the system initializes the long-term averages SilAvg, SphAvg, SilAMF, and SphAMF to the statistics of the first data frame for short-time energy and A M F . The initial State will be set to SILENCE. Chapter 3. A Speech Compression System Based on Silence Deletion 30 3.1.2 Calculation of Long-term Averages In calculating long-term signal averages (for tracking the varying speech and background noise levels), Rose [31] proposes measures to minimize both the calculation time and the memory storage requirement. The long-term average at time t over n data frames can be calculated as A(t) = ?& (3.1) where S{t) = , E* (3-2) is the sum of the n values-for the individual frames. A simplification noted by Rose [31] is: A(t + l ) = E t + l + S ® - E t + 1 - n . (3.3) It is required that all n samples contained in the average are temporarily stored in memory. A block averaging scheme is then proposed to group each m samples into a block, saving memory by a factor of m [31]. The present thesis proposes an efficient averaging calculation that can dispense with all the memory storage for either individual frame averages or the block averages. Note that, in updating the long-term average for each new data frame, Et+l + S(t) - Et+l-n . A(t + 1) n Et+i + nA(t) -.Et+i-n n Et+1 + (n - l)A{t) 71 (3.4) for a sufficiently large n. This makes it unnecessary to store the n frame averages Et+\-n, • • • ? Et. With the minimum silence/speech constraints (M and N) and the critical threshold require-ment (Ec), the number of frames to be included in each averaging calculation may differ from one. -Ej+i in Equation 3.4 will be replaced by the sum of all unclassified frames with energy below the critical level; the number of frames will be the total of unclassified frames minus those that exceed the critical energy level. Values for the averaging period n have been given in Chapter 3. A Speech Compression System Based on Silence Deletion 31 Table 2.1, Figures 2.11 and 2.12. The four long-term averages of the system are thus updated as follows: SilAvq = TTTT-T—^—• X (UnclassifiedSum — CriticalSum y SilAvaPrd V J g r  +SilAvg x (SilAvgPrd— UnclassifiedLen + CriticalLen^j (3-5.) SilAMF = } „ , X (UnclassifiedAMF- Critical AMF . SilAvgPrd V +SUA MF x (SilAvgPrd — UnclassifiedLen + CriticalLen)j (-3-6) UnclassifiedSum-f SphAvg x (SphAvgPrd — UnclassifiedLen) S p h A v g = ! — : ; — ( 3 - 7 ) „ , UnclassifiedSum + 'SphAMFX (SphAvqPrd — UnclassifiedLen) . SphAMF = ^ S p / ^ p r / " ' ( 3 - 8 > UnclassifiedLen is determined from the counters SilenceLen and SpeechLen in Figure 3.2, which count the number of frames judged by the detector as silence and speech, respectively; silence frames remain unclassified until SilenceLen reaches M, and speech frames, until SpeechLen reaches N. Therefore, in the four boxes in Figure 3.2 where SpeechAvg(s) and SilenceAvg(s) are updated, the value of UnclassifiedLen is, sequentially from left to right in the figure, SilenceLen + 1, SpeechLen, SpeechLeh+ 1, and SilenceLen. The minimum speech and silence constraints ensure that SilenceLen < M and SpeechLen < N always hold in the above operation (i.e. when updating the long-term averages). Therefore, the choice of M and N is restricted by the following inequalities: SilAvgPrd > M A X ( M , N) \ (3.9) SphAvgPrd > M A X ( M , N) • (3.10) where the values of SilAvgPrd and SphAvgPrd are given in Table 2.1 and Equations 2.11 and 2.12. CriticalLen is the counter for unclassified frames whose energy level and A M F could pos-sibly corrupt the silence averages. Instead of saving the individual frame statistics, these are accumulated in CriticalAvg and CriticalAMF for later subtraction from the UnclassifiedSum(s). Chapter 3. A Speech Compression System Based on Silence Deletion 32 Experimental results have shown that the above approximations (Equations 3.5, 3.6, 3.7, 3.8) produce good estimates of the long-term averages that are comparable to those calculated from the stored samples or blocks. Since floating long-term averages are the sole means by which the system adapts to the varying speech and noise levels, it is of vital importance that quantization effect in their calcu-lations be minimized. The precision of fixed-point data varies with the size of the stored data or the resultant value after an operation. Rounding or truncation can engender significant errors, and this becomes an issue especially in the calculation of long-term averages of the A M F and of the silence energy, mainly due to their smaller values (in the range of a few tens to a few hundreds). In one instance, with a relatively noise-free speech recording, the already-small silence aver-age energy (around 100) consistently gets updated to smaller and smaller, values until it can no longer serve to differentiate between silence and speech frames. This is the effect of accumulated truncation errors in fixed-point division. r~>. To solve this problem, either intermediate scalings (to make better use of the machine precision) or floating-point arithmetic must be implemented. The latter approach has been taken in this implementation. 3.1.3 System Parameters Now that the silence deletion algorithm has been fully described, a summary of all the adjustable parameters is presented in Table 3.2. ' 3.2 Coding of Speech and Silence Frames Silence compression results in variable-rate coding. There are two coding modes, one for silence and background noise, and one for active speech. This section goes into more details about the two coding modes. Chapter 3. A Speech Compression System Based on Silence Deletion 33 Table 3.2: Adjustable parameters of the silence deletion algorithm. Name Symbol Type and Range Suggested Values Frame Size K integer > 2 powers of 2 E- Threshold Factors Eo,i real number > 0, E\ > EQ 2.0. Z- Threshold Factor ZT real number > 0 0.7 AMF-Threshold Factors £o,i real mimber> 0, L \ > LQ 0.75 Critical Energy Factor Ec real number > 0 1.5 Hangover H integer > 1 1 Minimum Silence M integer > 1 1 Minimum Speech N • integer > 1 1 3.2.1 Speech Coding Two speech coding formats are supported in this speech compression system: the G.711 64-kbps jti-law P C M , and the G.726 32-kbps 4-bit A D P C M . //-Law P C M In linear quantization, the step size A is chosen to accommodate the dynamic range of the speech signal. Much resolution is thus wasted in coding the relatively few peaks. As a stationary random process, the signal.amplitude of speech is far from uniformly dis-tributed [9]; it is in fact closer to the Laplace or the Gamma distribution. The result is that, with linear or uniform quantization, low-level signals such as fricatives will have a relatively large quantization error. It turns out that logarithmically spaced quantization levels give a near-optimal signal-to-quantization noise ratio [60]. From a speech perception point of view, more quantization noise is perceived for signals of small amplitude than for signals of large amplitude due to a masking effect [61] in the perception of the human ear. A louder signal masks the quantization noise. Quantizing the signal samples on a logarithmic scale exploits this masking effect, such that the step size between quantization levels becomes progressively larger with increasing amplitude. Historically, non-uniform quantization was achieved by first compressing the signal x using Chapter 3. A Speech Compression System Based on Silence Deletion 34 a non-uniform compressor characteristic c(-), quantizing the compressed signal y — c(x) em-ploying a uniform quantizer, and then expanding the quantized version of the compressed signal using a non-uniform transfer characteristic c _ 1 ( - ) that is inverse to that of the compressor [60]. Hence the name companding: compressing and expanding. A standard logarithmic (or pseudo-logarithmic [60]) coding characteristic is the /z-law com-pander: log(l + M I T M ) C ( X ) = S g n ( : C ) log(l + /0 y" iax' ( 3- 1 1 } where the function sgn(a;) is as defined in Equation 2.3. The /z-law characteristic function with u, = 255 is plotted in Figure 3.3. In an actual implementation, such as the one defined in the -1 -0.8 -0.6 -0.4 . -0.2 0 0.2 0.4 . 0.6 0.6 1 x/Xmax Figure 3.3: The /z-law compressor characteristic. ITU Recommendation G.711, piecewise linear approximations to the /z-law characteristic are used in the conversion between linear P C M and /z-law P C M formats, and /z .= 255 has been chosen to provide a good approximation to the piecewise characteristics for an 8-bit resolution [62, 60, 63]. As such, the /z-law code consists of a sign bit, a 3-bit segment number (to identfy the piece of linear approximation), and a 4-bit level number (the level within a segment), and Chapter 3. A Speech Compression System Based on Silence Deletion 35 all the bits are inverted in transmission. That is why the actual G.711 /i-law coder produces a different mapping (Figure 2.3) to the one given in Equation 3.11 and Figure 3.3. Relative to linear quantization, /j-law quantization yields a 24-dB reduction in quantization noise power [57]; as a result, 8-bit />law P C M produces speech of quality comparable to that of 12-bit linear P C M [9] for nominal input levels. From another point of view, accurate resolution of large signals is sacrificed for improved resolution of low level signals (see Figure 3.4). x 10 4 41 1 1 1 1 1 1 1 2 - i -- 2 -- 3 -_41 i i i i 1 1 1 1 - 4 - 3 - 2 - 1 0 1 2 3 4 x (Encoder input) 1 0 « Figure 3.4: Resolution of the /i-law function: 16-bit signed integer input x encoded to 8-bit unsigned integer, which is then decoded to 16-bit signed integer r. A D P C M Variance of the quantization error is a function of the input speech signal variance. Unfor-tunately, in coding speech, the exact value of the input variance is not known in advance; moreover, it tends to change with time. With a fixed set of near-optimal quantization step sizes, //-law P C M yields a good signal-to-quantization noise ratio over a broad range of input variances [60]. Since speech in the long Chapter 3. A Speech Compression System Based on Silence Deletion 36 term is not a stationary stochastic process, a time-invariant quantizer is still not ideal. For a more efficient waveform coding, we turn to adaptive quantization. A n adaptive quantizer has a time-varying step size A(n) that adapts to the changing input variance cr^(n) [60]. A feedforward adaptive quantizer adjusts its step size for each signal sample based on a short-term temporal estimate of the input speech signal variance (for example, using the short-term autocorrelation estimator), so that A ( n + l ) = A(n)a 3 . (n- | - l ) , (3.12) where a\{n + 1) is an estimate of the variance for the next sample at time n + 1 (and ax(n + 1) estimates the standard deviation). A feedback adaptive quantizer employs the output of the quantizer in the adjustment of the step size: A ( n + 1 ) = A(n)a(n), (3.13) where the scale factor a(ii) depends on the previous quantizer output. To further exploit the redundancy in speech waveforms, differential coding or predictive coding can be used. By exploiting the inter-sample correlation in speech waveforms, it is possible to achieve an increased SNR at a given bit rate; or equivalently, a reduced bit rate for a given requirement of SNR. In differential pulse code modulation ( D P C M ) a signal s{n) is represented by the difference samples, e(n) = s(n) - s(n - 1) (3.14) The variance in the error signal e(n) can further be reduced with the use of linear prediction (LP) , which produces an estimate of the current sample P s(n) = ^2 a(i)s(n — i) (3.15) t'=i based on the past P samples. The prediction error samples e(n) — s(n) — s(n) (3.16) Chapter 3. A Speech Compression System Based on Silence Deletion 37 will be transmitted instead. As a refinement, incorporating the error samples in the prediction yields an even better estimate P Q s(n) = ^2a(i)s(n-'i) + ^ 2b(i)e(n-i). (3.17) i=l i=l The two sets of coefficients {a(i)} and {b(i)} are selected to minimize some function of the error sequence e(n), such as the mean squared error (MSE) [58]. Making the predictor adaptive improves performance still more. In adaptive D P C M (AD-P C M ) , the coefficients of the predictor can be changed periodically to reflect the changing signal statistics of the source. The ITU Recommandation G.726 has established a standard for a 32-kbps 4-bit A D P C M scheme, which employs an adaptive feedback quantizer and a predictor with two poles and six zeros (P = 2 and Q = 6 in Equation 3.17). A gradient algorithm is employed to adaptively adjust the coefficients of the pole-zero predictor [10]. In our implementation, to facilitate the coding of 4-bit data on the digital computer, a constraint has been placed on the frame size, such that K must be an even number. 3.2.2 Silence Coding To allow for the insertion of artificially generated silence frames with matching signal energy during expansion of compressed speech, each deleted silence interval is replaced by a silence code of a fixed length. This silence ,code adds compression overhead, making sometimes coun-terproductive removal of short silence intervals. This can be controlled by adjusting the frame size K and the minimum silence/speech parameters M and N. The coding of silence intervals follows previous practice [31]. In addition, allowance has been made for A D P C M speech coding. The format of the silence code is shown in Figure 3.5. Silence Code Flag 16-Bit Silence Length 16-Bit Silence Energy Figure 3.5: Silence code format. The length in number of frames is recorded in an unsigned 16-bit integer, which puts an upper limit on the duration of uninterrupted silence. As shown in Table 3.3, a reasonable length Chapter 3. A Speech Compression System Based on Silence Deletion 38 Table 3.3: Maximum lengths of silence allowed by the silence code. Frame Size (A') Maximum Length of Codable Silence 0.5ms (4) • 32.76s 1ms (8) l m 5.53s 2ms (16) 2m 11.07s 4ms (32) 4m 22.14s 8ms (64) 8m 44.28s 16ms (128) 17m 28.56s of silence can be coded for all practical frame sizes. Immediately following the silence length is another unsigned 16-bit integer, the silence energy, which indicates the energy level (in 16-bit linear P C M ) of the background noise in the silence interval. The inclusion of this value makes it possible to supply a random "comfort noise" [7] during playback, to approximate the original silence intervals which have been discarded during coding. Each silence code is prefixed by a flag to distinguish it from ordinary speech data. To be detected from within a stream of codewords for speech data, this flag should take on the value of an impossible or otherwise unused combination of speech codewords. Silence Code Flag for Linear P C M For linear P C M coded speech data on the telephone network, the concatenation of the most positive 16-bit signed integer with its negative complement forms a very unlikely sequence. The. ordered 2-tuple (32767,-32767), or 7FFF8001 1 in hexadecimal, can thus serve as the silence code flag. Speech on the general telephone network is bandlimited to between 200 and 3400 Hz. The chosen silence code flag, at a sampling rate of 8000 times per second, corresponds to successive samples separated by half a period of a 4000-Hz monotone signal at maximum amplitude, which 1Note that byte order is machine-dependent. Care should be exercised in reading and writing order-specific data such as the silence code flag on computer systems: in general, PC's are little endian, while UNIX workstations are big endian. Chapter 3. A Speech Compression System Based on Silence Deletion 39 can never occur naturally in telephone speech. Silence Code Flag for /z-law P C M By thesame argument, the pair of 8-bit unsigned integers (128,0) (8000 in hexadecimal) can be used as the silence code flag for /z-law P C M coded speech. However, due to the piecewise ap-proximation in digital /z-law companding, a range of linear P C M values besides (32767, —32767) •are mapped into the /z-law P C M 2-tuple (128,0). Table 3.4 lists part of the mapping between the input and output of an ITU G.711 /z-law P C M coder. Therefore successive samples of a Table 3.4: Mapping of a G.711 8-bit /z-law P C M coder. 16-bit linear input 8-bit /z-law output -32768 0 -31612 0 -31611 1 31611 129 31612 128 32767 128 4000 Hz monotone signal at ^767 ° ^ t n e maximum amplitude can also produce the silence code flag at the output of the /z-law P C M coder. So can successive samples of a maximum amplitude monotone at frequency / given by f 32767 sin(7r-^—) = 31612,or (3.18) v 8000 ; v ' f = 3321.9Hz, (3.19) as shown in Figure 3.6. Even though a 3321.9 Hz signal falls within the bandwidth of telephone speech, the probability of a natural occurrence of the silence code flag has been found to be negligible [31]. A special marker could be appended as-an indicator—emulating the practice Chapter 3. A Speech Compression System Based on Silence Deletion 40 Amplitude Time (s) Figure 3.6: Calculation of the lowest frequency / that can produce the silence code for //-law P C M (with no overloading in the signal). of character-based and bit-oriented framing in data networks [64]—to the end of any naturally occurring silence code flag. For our purposes (analysis and evaluation of system performance), this has not been imple-mented, and no error has been encountered either. A signal which exceeds the maximum input range of the analog-to-digital ( A / D ) converter could produce the silence code flag at an even lower frequency. According to the following relationship for a 16-bit A / D converter f A s i n ( 7 r - i — ) = 32767, (3.20) v 8000 ; v ' the frequency / can be made arbitrarily small by pushing the amplitude A above 32767. But of course, signal overloading of the A / D would have occurred long before the amplitude reached those high levels. To avoid severe degradation in speech quality, out-of-range amplitudes are often clipped by some protection circuit before A / D conversion. The silence code flag being produced by overloaded signals of low frequency is therefore even less likely. Silence Code Flag for A D P C M For A D P C M coded speech, the strategy of combining the largest and smallest valued codes to produce the flag no longer applies. Due to the nature of differential coding, the "most positive" Chapter 3. A Speech Compression System Based on Silence Deletion 41 and "most negative" numbers represent values of the difference signal rather than those of the speech signal itself, and their occurrence is as frequent as that of any other codeword. Yet, an impossible combination of speech values presents itself: examined arbitrary streams of 4-bit A D P C M .coded speech data shows that there is no zero codeword present. In fact, the revision of C C I T T Recommendation G.721 in 1986 has eliminated the all-zero codeword 2, thus changing the quantizer from 16 levels to 15 [10]. The 4-bit zero codeword can therefore act as the silence code flag; to facilitate data transfer, a full byte zero (that is, the concatenation of two 4-bit zero A D P C M codewords) is used to indicate silence. 3.3 Silence Insertion and Decoding Compared to the silence deletion algorithm, decompression or expansion of silence-compressed speech is relatively simple. It consists of differentiating between silence codes and frames of speech data, and decoding them accordingly. Figure 3.7 shows a flowchart of the silence insertion algorithm. Before each frame is read, (START) ' Check input data NO *Z X . YES —^  ? Read data frame Decode and write.out data frame Read Length and Energy Expand and write out silence I « W O | ^ J of da ta^ ' T Y E S ( END ) Figure 3.7: Decoding of silence-compressed data. 2The reason was that North American networks could not handle long strings of 0s. Chapter 3. A Speech Compression System Based on Silence Deletion 42 the silence code detector looks ahead for any "impossible sequence" of speech code, which signifies a silence code. If a silence code is found, the next four bytes will be read as two 16-bit unsigned integers, the first representing the number of frames of silence to be re-generated, and the second, the average energy level (in 16-bit linear P C M ) of the background noise. The decoder then proceeds to generate the specified amount of background noise (or silence)—the "comfort noise"—at the specified energy level, using a white noise generator based on the linear congruential method [31, 65]. In this study, the amount of background noise, and the level of the noise energy, can optionally be varied. The possibility and feasibility of inserting random noise of shorter duration and/or of lower energy than in the original will be explored by subjective listening in Chapter 5. 3.4 Peripherals This section covers the analog-to-digital ( A / D ) and digital-to-analog ( D / A ) interface of the system. Before it is fed to the DSP I/O port, the signal from the telephone line is amplified by a differential amplifier with matching input impedance (600 fi), from the millivolt range to a few volts. 3.4.1 Filtering and Sampling For many signals, including speech, the highest frequency component is not distinctly known and it is therefore necessary to band-limit the signal by filtering it prior to digitization. The analog filter employed in such a situation is normally called an anti-aliasing or pre-sampling filter. Speech in general contains frequency components with significant energies up to about 10 kHz; however most energy is below 5 kHz. Depending on the application, the sampling rate fs for speech will normally lie in the range 6-20 kHz. Telephone speech (bandlimited to the range of 200-3400 Hz) is typically sampled at fs = 8 kHz, and a pre-sampling (lowpass) filter is therefore required to remove frequency components above the Nyquist frequency / s / 2 . Chapter 3. A Speech Compression System Based on Silence Deletion 43 The analog inputs and outputs are provided with variable 4th order lowpass filters on the Spectrum TMS320C30 DSP board [66]. Equal resistor values are used to give a Butterworth (maximally flat) response with a 24 dB per octave roll-off in the stop band. The cutoff frequency fc and the resistor values R are related follows: fc = 61.2/12 (3.21) The filters are adjusted to have a cut-off frequency at 3.44 kHz (with 17.8 kJ7 resistors). Since signals coming from the telephone line is already bandlimited to below 3.4 kHz, the in-put lowpass filter acts primarily as a pre-sampling safeguard against extraneous noise. The out-put lowpass filter, on the other hand, is responsible for smoothing out the quantized waveform produced by the D / A converter, which contains an abundance of high frequency components. 3.4.2 Quantization and Coding The amplitude of the sample values is converted into digital form using a finite number of binary digits (bits). This process of representing a real value in the continuous domain by a discrete value with finite step-size is called quantization. The reconstructed signal is therefore bound to differ from the original signal. The amount of distortion introduced is quantization noise, which is reduced as the number of bits increases. The number of bits used affects the speech quality as well as the number of bits per second (bit rate) required to store or transmit the digital signal. The on-board A / D produces linearly quantized 16-bit P C M samples, and at a sampling rate of 8 kHz, this results in a bit rate of 128 kbps, which can be interpreted as a reference bit rate for uncoded speech [7]. Chapter 4 Analysis of System Performance This chapter presents an analysis of the performance of the speech compression system as applied to speech samples recorded from the telephone line. The internal dynamics of the silence deletion algorithm will be studied, including the in-tricacy of choosing system parameters, how silence detection is affected by noise, threshold adaptation, and the segmentation and compression of speech data. Rabiner and Sambur [37] showed that visual examination of the speech waveform alone does not always lead to correct discrimination of the speech and silence intervals. When the various system variables and statistics of the speech signal (energy, zero-crossing, A M F ) are plotted along with the waveform, however, visual examination can provide many useful insights into the operation of the silence deletion algorithm. The following analyses will threrefore be supported strongly by graphical aids. 4.1 Frame Size and Delay In a speech coding system, the time spent in buffering data is called the algorithmic delay, while that spent in computing a result is called the processing delay [6]. A larger frame buffer gives rise to a longer algorithmic delay. A smaller frame size, on the other hand, means that there are more individual frames to be considered for speech/silence classification, and to be input and output (I/O); all these lead to a longer processing delay overall. The choice of the frame size is therefore of the utmost importance to the performance of the system. Although previous work has favoured the small frame size of K = 4 (equivalent to 0.5 ms) [31], this will be shown to be a poor choice, in terms of both compression efficiency and speech 44 Chapter 4. Analysis of System Performance 45 quality. Short-time statistics are sensitive to frame size. The frame is essentially a rectangular window, and good temporal resolution requires a short window while good frequency resolution-calls for a long window [9]. -If K is too small—on the order of a pitch period or less—the short-time statistics will fluctuate very rapidly depending on exact details of the waveform. If K is too large—on the order of several pitch periods—the short-time statistics will change very slowly and thus will not adequately reflect the changing properties of the speech signal. Unfortunately this implies that no single value of K is entirely satisfactory because the duration of a pitch period varies from about 16 samples (at an 8-kHz sampling rate) for a high pitch female or a child, up to 200 samples for a very low pitch male. With these shortcomings in mind, Rabiner [9, p. 122] suggests that a suitable practical choice for the frame size is 10-20 ms in duration; that is, for a sampling rate of 8 kHz, K =80-160. Figure 4.1 shows the waveform of the phrase "the porch steps" along with its short-time average zero-crossing rate calculated for the frame sizes of 4 (0.5 ms), 8 (1 ms), 16 (2 ms), 32 (4 ms), 64 (8 ms), and 128 (16 ms). As noted in [31], the zero-crossing rate is counter-productive for silence deletion at small frame sizes. This is as expected since, as shown in the figure, only at larger frame sizes does the zero-crossing rate display any useful features that correspond to the speech waveform. At small frame sizes where K = 4 or 8, the zero-crossing curve is rather jagged and shapeless, and it is difficult to make use of the erratic jumps for silence/speech discrimination. Occasional peaks could indicate either noise or fricative consonants. At K = 32 and above, the final 's' sound of "steps" begins to exhibit well-defined peaks between 5600 ms and 5800 ms. Figures 4.2 and 4.3 show the short-time average magnitude and the A M F , respectively, at various frame sizes, for the phrase "the wide road." These short-time statistics fluctuates widely at smaller frame sizes. The A M F , in particular, appears to be usable only at frame sizes above K = 16. Chapter 4. Analysis of System Performance 46 Chapter 4. Analysis of System Performance 47 Figure 4.2: The phrase "the wide road" with its short-time average magnitude E calculated for # = 4,8,16,32,64,128. Figure 4.3: The phrase "the wide road" with its A M F calculated for K = 4,8,16,32,64,128. Chapter 4. Analysis of System Performance 49 Comparisons of speech quality for the four frame sizes K = 4,16,64,128 are made by subjective listening described in Chapter 5. 4.2 Zero-crossing Rate Speech includes frequency components with significant energies up to 10 kHz, with most of the energy below 5 kHz. Only unvoiced fricative and aspirated sounds exhibit significant spectral energy above 5 kHz [57]. The zero-crossing rate is sensitive to these high-frequency features of speech. The mean short-time zero-crossing rate is 49 and 14 per 10 ms for unvoiced and voiced speech, respectively [9]. In practice, some researchers consider speech frames above 3700 zero-crossings per second (or 0.46 per sample for a sampling rate of 8 kHz) as containing unvoiced sounds [67]. Rose [31] has reported that the zero-crossing criteria would cause faulty detection at small frame sizes, and the results of the previous section confirm this erratic behaviour. Zero-crossing criteria are therefore studied for larger frame sizes. For the present thesis, a Z-Threshold of 0.7 crossing per sample has produced good results. Setting it at a lower value would cause unwanted silence frames to be classified as speech. Figure 4.4 shows how the zero-crossing criteria helps to detect the final 's ' sound in the phrase "the porch steps", with the system parameters set at K = 64, EQ = E\ = 2, and ZT — 0.7 (EC = 1.5 and H = M = N = 1 by default). In the figure, • the first row plots the speech signal with the decision of the silence deletion algorithm superimposed as a square wave, • the second shows how the short-time average magnitude E exceeds E-Threshold when there is voice activity • the third shows the long-term energy averages SilAvg and SphAvg, • the fourth, how the short-time average zero-crossing Z exceeds Z-Threshold when there is significant high-frequency content in the signal, and Chapter 4. Analysis of System Performance 50 x 10 clear.dat: Signal and Decision Logic (Low:SII_ENCE, High:SPEECH) 1000 r SilAvg (_). SphAvg ( ) & SphAvg/SphScale (_.J .= 500 r E < -400 4800 5000 Z(_)&Z-ThresholdL. 5200 '5400 Time (ms) 5600 5800 6000 Figure 4.4: Silence deletion , of the phrase "the porch steps" using the energy and the zero-crossing criteria. Chapter 4. Analysis of System Performance 51 • the fifth displays the contents of the counters SilenceLen and SpeechLen, the accumulated length of consecutive silence frames and speech frames. The utility of zero-crossing criteria becomes doubtful when using the signal energy criteria by itself enables detection of all the speech and silence frames. Figure 4.5 is a similar series of plots, showing silence deletion at work on the same phrase ("the porch steps"), this time with the Z-Threshold disabled (Z? is set at 0.95, beyond the reach of the present zero-crossing rate). The zero-crossing is also strongly affected by D C offset in the analog-to-digital converter, 60-Hz hum in the signal, and any noise that may be present in the digitizing system [9]. Shown in Figure 4.6 is a run of the silence deletion algorithm on a speech sample with a noisy hum in the background. The zero-crossing rate experiences a drastic reduction, down to below 0.4 per sample. Figure 4.7 illustrates how a low-frequency periodic noise can "mask" a high-frequency signal of low amplitude. The hum provides an "envelope" to the low-amplitude signal, effectively preventing the signal from crossing zero more often than the fundamental frequency of the hum. Noise common in telephone speech has a similar effect on the zero-crossing rate. Figure 4.8 shows the silence deletion of the same phrase "the porch steps", this time with pre-added noise recorded from telephone conversations on cordless handsets. The zero-crossing of the final 's' sound drops below the 0.7 threshold, escaping notice of the silence detector. In an attempt to circumvent difficulty with verying zero-crossing rates, the zero-crossing threshold has been made to adapt to the changing environment—with disastrous results. Due to the fast fluctuating nature of the short-time zero-crossing measure, an adaptive threshold often cannot respond quickly enough to changes and remain stuck at a low level, causing subsequent silence frames to be all classified as speech. Zero-crossing adaptation has subsequently been abandoned in our study. The zero-crossing rate is used for refined endpointing of words with fricative beginnings and endings. It has been found to be ineffective for telephone speech [42], and our study confirms this judgement. The telephone speech bandwidth (200-3400 Hz) already limits the amount of Chapter 4. Analysis of System Performance x 10 clear.dat: Signal and Decision Logic (Low:SILENCE, High:SPEECH) E < a) 2000 E (J & E-Threshold ( ) 1000r SilAvg U, SphAvg ( ) & SphAvg/SphScale (_._ 500 r E < -400 4800 5000 Z(J&Z-Threshold(__ 5200 5400 Time (ms) 5600 5800 6000 Figure 4.5: Silence deletion of the phrase "the porch steps" using the energy criteria alone Chapter 4. Analysis of System Performance , 53 x 10 aug10.dat: Signal and Decision Logic (Low:SILENCE, High:SPEECH) SilAvg (J, SphAvg ( ) & SphAvg/SphScale L . J g 20001 E < 1000f Z (J & Z-Threshold ( ) -1000 8800 9000 SilenceLen (-ve) & SpeechLen (+ve) 9200 9400 Time (ms) 9600 9800 10000 Figure 4.6: Silence deletion of speech with a background hum. Chapter 4. Analysis of System Performance 54 Q-E < 0 "D 3 0 5 10 15 20 25 Time (ms) 30 35 40 45 50 Figure 4.7: The hum reducing the zero-crossing rate energy carried by unvoiced speech sounds. Considering the levels and variety of noise in modern telephony, the zero-crossing criteria is problematic rather than helpful in silence delineation. 4.3 Signal Energy Versus Average Magnitude Factor The signal energy and the A M F are different measures of the same property, namely the signal amplitude. They can be considered to work, respectively, in the linear and logarithmic domains. This section illustrates their differences and similarities. Figure 4.9 shows silence deletion of the phrase "the porch steps" using the A M F criteria, with the parameters K = 64 and Lo = L\ = 0.9. As short-time statistics, the A M F and the signal energy seem to possess different temporal resolutions. The A M F tends to be more prone to follow small fluctuations in the signal: it manages to convince the AMF-Threshold of the existence of a few more speech frames after the phrase proper has ended, by dipping a few more times below the threshold. Sensitivity to temporal variations in the signal is of course related to the frame size K. This suggests that at a larger frame size, the A M F may have an even more similar performance to the signal energy. (The effect of the frame size on segmentation, which Chapter 4. Analysis of System Performance 55 Z (J & Z-Threshold ( ) 11 1 1 1 1 r 0.2 SilenceLen (-ve) & SpeechLen (+ve) - 6 0 0 1 ' 1 1 1 • L_ 1 4800 ' 5000 5200 5400 5600 5800 6000 Time (ms) Figure 4.8: Silence deletion of the phrase "the porch steps" with added noise, using the energy and the zero-crossing criteria. Chapter 4. Analysis of System Performance 56 clear.dat: Signal and Decision Logic (Low:SILENCE, High:SPEECH) T 1 1 : 1 J L AM F (J & AM F-Threshold ( ) 801 1 1 1 SilAM F (_), SphAM F ( ) & SphAM F+SphScaleAM F (_._) 1 0 0 r - 1 1 1 1 1 4oL E(J&SilAvg(__) 3000 I 1 1 1 SilenceLen (-ve) & SpeechLen (+ve) § -200 1 1 4800 5000 5200 5400 5600 5800 6000 Time (ms) Figure 4.9: Silence deletion of the phrase "the porch steps" using the A M F criteria. Chapter 4. Analysis of System Performance 57 differs slightly for the energy criteria and for the A M F criteria, will be studied in Section 4.4.) But other than the few extraneous mistaken speech frames, the A M F criteria produces a deletion profile similar to that of the signal energy; all the major voice activity regions are reported (compare Figures 4.9 to 4.5, with the rows on SilenceLen and SpeechLen perhaps giving a better indication of their resemblance). 4.4 Segmentation Analysis The silence deletion algorithm segments a speech signal into one of two categories: silence and speech. Segments of each category are coded differently: speech coding (/z-law P C M or A D P C M ) is performed speech segments, while each silence segment is compressed into a silence code. Even though a "comfort noise" with an adaptive energy level is supplied during decoding, it is only an approximation to the original silence interval characteristics. There is often a perceptible transition between speech and silence segments. The frequency of this transition is a function of the segmentation size, which is therefore an important parameter for speech quality. The frame size K has a major influence on segmentation. Because classification of speech samples is done on a frame basis, K defines the smallest unit for segmentation; for example, when K = 128 (16 ms), no segment can possibly be less than 16 ms. The other key parameters relative to segmentation size are the minimum numbers of con-tiguous silence and speech frames, M and N. The limiting segmentation size for K = 4 and M = N = 8 is (4)(8) = 32 frames (i.e. 4 ms), however the mean segmentation size for these values is apparently larger than that for a setting of K = 32 and M = N — 1. Both alternatives give the same actual minimum segment size. But the probability of 8 consecutive speech (or silence) frames at K = 4 is smaller than that of one speech (or silence) frame at K = 32. Short-time statistics exhibit a greater variability at smaller window (frame) size. Simulations run on a six-second speech sample ("DEC23A_x.dat"), at various combinations of K, M and N, produced a 50% deletion of silence frames. For each run, the mean segmentation Chapter 4. Analysis of System Performance 58 Table 4.1: Mean segmentation size of silence deletion using the signal energy criteria E. Mean Segmentation Size (ms) / #o,i M,N K = 4 K = 8 K = 16 K = 32 K = 64 K = 128. K - 250 1. 6.1920 /1.85 15.8311 / 1.9 33.8983 / 2 60.6061 / 2.2 146.3415/ 2.3 181.8182/2.3 222.2222/2.68 2 19.5440/1.9 36.8098/ 2 80 /2.1 146.3415/2.2 171.4286/2.25 222.2222/'2.3 285.7143/3.15 3 36.3636/ 1.9 72.2892 / 2.1 146.3415/2.2 193.5484/ 2.2 206.8966/2.213 260.8696/2.3 315.7895/3.36* 4 59.4059/1.9 105.2632/2.05 181.8182/2.2 193.5484/ 2.2 240 / 2.1 285.7143/ 2.6 5 82.1918/1.9 162.1622/2.05 193.5484/2.2 206.8966/ 2.2 240 / 2 285.7143/ 2.6 6 109.0909/1.9 181.8182/2.05 193.5484/2.1 222.2222/2.05 260.8696/2.042 315.7895/3.3 7 127.6596/1.9 181.8182/2.05 206.8966/2.1 222.2222/2.05 285.7143/ 2.5 352.9412/3.4 8 162.1622/1.9 181.8182/2.05 206.8966/2.1 240 /2.05 285.7143/3.15 400 /3.35 9 162.1622/1.9 181.8182/2.1 206.8966/1.9 240 /2.05 285.7143/3.15 400 / 3.6 10 181.8182/1.9 193.5484/2.05 222.2222/1.85 240 /2.05 285.7143/3.15 11 181.8182/1.9 206.8966/2.05 222.2222/1.85 240 /2.05 285.7143/3.15 12 193.5484/1.9 222.2222/2.05 222.2222/1.85 260.8696/ 2.3 13 206.8966/1.9 222.2222/ 1.8 222.2222/1.85 260.8696/2.43 14 206.8966/1.9 222.2222/ 1.8 222.2222/1.85 285.7143/2.43 15 222.2222/1.9 222.2222/1.9 240 / 1.7 285.7143/2.43 16 222.2222/1.9 222.2222/1.9 240 / 1.7 285.7143/2.43 17 222.2222/1.9 240 /1.9 240 /1.75 285.7143/ 3 18 222.2222/1.9 240 /1.9 240 /1.75 285.7143/ 3 19 222.2222/2.0 240 /1.9 240 /1.75 20 222.2222/2.0 240 /1.93 240 / 1.8 21 222.2222/1.9 240 /1.93 260.8696/1.8 22 222.2222/1.8 240 /1.93 260.8696/1.8 23 240 /1.65 240 /1.95 260.8696/1.95 24 240 /1.65 240 /1.96 25 240 /1.65 240 /1.96 26 240 /1.65 240 /1.96 27 240 / 1.7 240 /1.96 28 240 / 1.7 29 240 / 1.7 30 240 / 1.7 31 240 /1.64 32 240 /1.64 33 240 /1.64 34 240 /1.64 35 240 / 1.7 36 240 / 1.7 Chapter 4. Analysis of System Performance 59 size, based solely on the energy criteria, is tabulated in Tables 4.1. Also listed is the value of the threshold factor i?o,i a t which a 50% silence deletion is achieved. In a few cases, the threshold factor has been adjusted to the precision of a number of decimal places, in an attempt to reduce deviation from 50% to less than 0.1% a. Thus, for a 6000-frame speech sample, 3000 ± 30 data frames are deleted as silence. Achieving a 50-50 division of speech and silence frames helps to give an overall unbiased mean segmentation size. The speech sample has a speech/silence content ratio of approximately 1:1. The threshold factor _Eo,i tends to increase across the columns (i.e. for increasing frame size K)—at least while M, N are small. Not much of a trend, though, can be discerned down the rows (i.e. for increasing M,N); the factor value fluctuates, generally ending on a high value, when any further increase in segmentation size would be too large to allow a 50% silence deletion. Of course, by its very nature a larger frame size produces a larger segmentation size, and therefore the column ends shorter to the right side of the table. The mean segmentation sizes in Table 4.1 are plotted in Figure 4.10. The mean segmentation sizes of performing silence deletion on the same speech sample, this time using the A M F criteria, are tabulated in Tables 4.2 and plotted in Figure 4.11. The Table 4.2: Mean segmentation size of silence deletion using the A M F criteria. Mean Segmentation Size (ms ) / Lo;i M,N K = 4 K = 8 K = 16 K = 32 K = 64 K = 128 K - 250 1 1.2823 / .95 2.4783 /.85 9.1047 /.88* 25.5319/.89* 72.2892 / .87 133.3333/ .8 206.8966/ .78 2 5.3619 /.9567* 14.8883 / . 8 8 38.7097 / .92 98.3607 /.92* 153.8462/.88* 222.2222/ .81 260.8696/ .78 3 23.9044/ .95* 56.0748 /.92 113.2075/ .92 162.1622/.94* 206.8966/.91* 260.8696/ .83 315.7895/.752 4 75.9494/ .95* 139.5349/.92 171.4286/ .93 181.8182/.94* 260.8696/ .9 285.7143/ .78 5 109.0909/ .92 181.8182/.94 222.2222/ .93 240 /.94* 240 /.88* 285.7143/.77 6 153.8462/ .92 222.2222/.91 240 /.94* 240 /.94* 285.7143/ .88 285.7143/.74* 7 181.8182/ .94 240 /.86 240 / .91 285.7143/.89* 285.7143/.85 8 222.2222/ .98 240 /.86 240 /.91- 285.7143/.88* 285.7143/ .85 9 240 / .88 10 240 / .88 11 240 / .86 A M F criteria, as pointed out earlier (Section 4.1), is too unstable at small frame sizes [K =4 'An asterisk * in the table indicates where such an attempt has failed. Chapter 4. Analysis of System Performance Figure 4.10: Mean segmentation size of silence deletion using the signal energy criteria E. Figure 4.11: Mean segmentation size of silence deletion using the A M F criteria. Chapter 4. Analysis of System Performance 62 and 8) to be of use in silence deletion. However, when it is coupled with larger values of M and N, surprisingly, the segmentation results are not very different from those achieved with larger frame sizes. The minimum silence and speech constraints put a lower limit on the segment size, making silence deletion using small frame sizes more feasible. Even so, maintaining a 50% compression as M,N increases becomes difficult, and is impossible above M, N = 8. (This situation is slightly better with frame sizes of K > 16.) Although the A M F initially (at M, N = 1) gives smaller segmentation sizes than does the signal energy, moving down each column shows that the A M F sizes soon overtake the energy sizes. This further supports the claim that the A M F , as a criteria, is less flexible than the signal energy; A M F surely provides a narrower choice of segmentation sizes. Compression of other speech samples may yield mean segmentation sizes different from those in Tables 4.1 and 4.2; segmentation size varies with the speech sample. The tables do display a recurrence of the same numbers, in different columns, and even between the two tables. Apparently, segmentation occurs at discrete levels of segmentation sizes. The same values occurring in different columns and in different tables suggest that similar results of silence deletion are achieved at those settings. For example, consider the mean segmentation size of 181.8182 ms, achievable at the various settings shown in Table 4.3. Table 4.3: Combinations of parameters that produce a segmentation size of 181.8182 ms. Detection Criteria K M,N Signal energy 4 10,11 A M F 4 7 Signal energy 8 6,7,8,9 A M F 8 5 Signal energy 16 '4 A M F 32 4 Signal energy • 128 1 Visualize a speech sample as recorded on a magnetic tape, with the tape then being cut into segments of speech and silence; the segments are then aligned along the x-axis with the Chapter 4. Analysis of System Performance 63 speech segments pointing up and the silence segments pointing down, to yield a segmentation profile. The similarities among the results of silence deletion at the above settings can be seen by comparing the segmentation profiles as shown in Figure 4.12. Aside from the exact same number of segments detected, these profiles also have very similar features; for instance, they all show a long segment pointing downward near the middle (which corresponds to the approximately one-second pause in the original speech sample), and a similar combination of speech and silence segments near the end. By contrast, Figure 4.13 shows a very different segmentation profile of the same speech sample for a 50%-deletion resulting in a mean size of 98.3607 ms. Figures 4.14, 4.15, and 4.16 show the analytical details in the silence deletion of three of the above settings, that have all resulted in a mean segmentation size of 181.8182 ms. Note, in particular, how M and N compensate for the instability of the criteria at small frame size, by imposing a minimum requirement on SilenceLen and SpeechLen before the system changes state. Compression of other speech samples has shown a similar correspondence of segmentation size between settings of small K and large M, N, and those of large K and small M,N. While informal listening tests have verified their similar compression quality, they have also indicated some occasional end-clippings in settings with large M,N. In conclusion, it is found that in silence deletion, the effective segmentation size, rather than the frame size, determines achievable speech quality. With a suitable selection of the minimum silence and speech constraints, M and N, the same segmentation size is achievable at different frame sizes. A l l things being equal, it is therefore wiser to use the largest K possible for the desired segmentation size, by setting M , N = 1, which gives the smallest processing delay (due to fewer number of frames and therefore I/O operations). Non-unity (and differing) values of M and N are recommended only for applications where different segmentation sizes for silence and speech are desired. Chapter 4. Analysis of System Performance 64 600 400 200 0 -200 -400 -600 -800 -1000 K4MN10 (E) K4MN7 (AMF) 10 20 30 Segment Number 10 20 30 Segment Number K16MN4 (E) K8MN5 (AMF) 600 400 200 (ms) 0 <B w -200 c <0 E Ol -400 CO -600 -800 -1000 10 20 30 Segment Number 10 20 30 Segment Number 600 400 200 0 -200 -400 -600 -800 -1000 K128MN1 (E) K32MN4 (AMF) 600 400 200 (ms) 0 <D W -200 C d> E D> -400 W -600 -800 -1000 1 r X- . -LIU lill U u L 10 20 30 Segment Number 10 20 30 Segment Number Figure 4.12: Segmentation profiles for silence deletion using the energy criteria E at the settings: K = 4 M,N = 10, K = 16 M,N = 4, K = 128 M,N = 1, and for silence deletion using the A M F criteria at the settings: K = 4 M, N = 7, K = 8 M, N = 5, K = 32 M , N - 4 . Chapter 4. Analysis of System Performance 65 K32MN2 (AMF) ! 1 n r n i l _ , n 11 n — 1 1 1 1 1 1 i-i PI u U ij Liu y u U [[li : 0 - 1 0 20 30 - 40 60 60 70 Segment Number Figure 4.13: Segmentation profile for silence deletion using the A M F criteria at the setting of K = 32 M,N = 2. 4.5 Initialization and Recovery from Detection Error Silence detection depends on a good estimate of the speech and background noise (silence) levels. Previous work has used preset values for initializing S'ilAvg and SphAvg [31]. Initialization is an issue particularly when the speech compression system has to deal with speech with atypical levels of energy, or background noise. Judging by the variability of energy levels in different telephone conversations, preset values are often poor estimates. Figure 4.17 shows a poorly initialized execution of the silence deletion algorithm, with disastrous results. The silence, or more accurately here, noise energy has been grossly underestimated. The 5peec/i5ca/e-factored SphAvg helps to sustain the energy threshold at a reasonable level for a short while, and detects two silence intervals; however it does not give enough time for the SilAvg to adapt to the significantly higher noise level (i.e. higher than the preset value for SilAvg). As a result, the system regards the remaining signal as speech. Ultimately, this problem is associated with the fault-tolerance of the silence detector: how many wrong decisions can it take before its stability falters. Mistaking silence frames for speech not only reduces the compression rate; such an error also can cause subsequent failure in recognizing silence frames by.lowering the long-term speech energy average. This effect is illustrated in Figure 4.17. Southcott [26] also cautioned about the danger of updating speech Chapter 4. Analysis of System Performance 66 x 10* 10000 5000 E < 1000 500 E < 11.5 tfi * 1 CL 3 0.5 DC 0 1000 500 -500 DEC23A_x.dat: Signal and Decision Logic (Low:SILENCE, High:SPEECH) E (J & E-Threshold ( ) SilAvg (J, SphAvg ( ) & SphAvg/SphScale [_-J Z (J & Z-Threshold ( ) [in"iII111Hlil I III II III' 11 dUl L lII 1LMJJ' I ' 'I 1 h l i i I'M SilenceLen (-ve) & SpeechLen (+ve) 4000 4200 4400 4600 4800 5000 5200 5400 Time (ms) 5600 5800 6000 Figure 4.14: 50% silence deletion of the sample "DEC23A_x.dat" using the signal energy criteria, at K = 4 and M,N = 10. Chapter 4. Analysis of System Performance 67 Figure 4.15: 50% silence deletion of the sample "DEC23A_x.dat" using the signal energy criteria, at K = 128 and M,N = 1. . Chapter 4. Analysis of System Performance 68 x 10 DEC23A_x.dat: Signal and Decision Logic (Low:SILENCE, High:SPEECH) 2h =5. 0 E -2 -50 100 80 - 60 E < 40 -• 20 — a) 4000 1000 f 500 © o —i -500 1 1 1 1 1 1 "IF* 1 1 i 1 1 HP" l AMF (J & AMF-Threshold (_ SilAMF (J, SphAMF ( ) & SphAM F+SphScaleAM F (_•_) 1 1 1 1 1 E (_) & SilAvg (_ SilenceLen (-ve) & SpeechLen (+ve) 1 1 4000 ' 4200 4400 4600 4800 5000 5200 5400 5600 5800 6000 Time (ms) Figure 4.16: 50% silence deletion of the sample "DEC23Ajx.dat" using the A M F criteria, at K = 32 and M, N = 4. Chapter 4. Analysis of System Performance 69 x 10 apr18c.dat (preset): Signal and Decision Logic (Low:SILENCE, High:SPEECH) E (J & E-Threshold ( ) 15000 E 5000 < SilAvg (J, SphAvg ( ) & SphAvg/SphScale (_.J Z (J & Z-Threshold ( ) 400 . 600 800 1000 1200 1400 1600 1800 2000 ' Time (ms) Figure 4.17: Silence deletion of sample "aprl8c.dat" with preset initialization values for the long-term averages. Chapter 4. Analysis of System Performance 70 and silence averages based on silence detector decisions. The A M F criteria is no less sensitive to improper initialization; in fact, results seem worse. It could also happen that the speech energy has been set initially too high, resulting in total deletion of the entire speech sample. Figure 4.18 shows such a case, where the speech has little noise, but a low energy level. While no amount of fine-tuning seems able to find preset values that suit all situations, one simple solution is to intialize the long-term averages to values of the first data frame; that is, to initialize the averages at run-time, instead of prior to program execution. This is found to be a most reliable initialization strategy. The same speech sample (" jul l8A.dat"), compressed with silence deletion initialized at run-time, is shown in Figure 4.19. The low speech energy level is no longer a problem. The energy averages evidently adapt easily, albeit with delay, to current speech and noise levels. This dynamic initialization also allows recovery from detection errors. The noisy speech sample "aprl8c.dat", previously shown in Figure 4.17, is compressed in Figures 4.20 and 4.21 with run-time initialization. Although SilAvg is now initialized to a higher-than-nominal value, leading to the mis-classification of several major speech activity regions, it manages to return eventually to the appropriate level, enabling the algorithm to get back on track (at 2600 ms). This is far more desirable than an all-on or ail-off consequence of an unrecoverable detection error. Initializing the long-term averages according to the first data frame works well in 99% of cases, because most applications would not begin a transmission in the middle of a voice activity region. From the curve for the long-term averages, however, it takes about one second for SpeechAvg to adjust and adapt to the nominal speech level. We will investigate next how long it takes for the silence deletion system to adapt,, at initialization, to the speech and noise levels in the worst-case scenario—that is, when compression begins in the middle of a phrase. Figures 4.22, 4.23, 4.24, and 4.25 show the silence deletion of an abruptly begun speech sample, at frame sizes ranging from K = 128 to K = 4 ( M = N = 1 and all other system Chapter 4. Analysis of System Performance ' 71 5000 E < -5000 jul18A.dat (preset): Signal and Decision Logic (Lbw:SILENCE, High:SPEECH) 2000 1500 1000 < 500 E (J & E-Threshold ( ) 15000 co 10000 E 5000 SilAvg (J, SphAvg (_ J & SphAvg/SphScale (_.J Z (J & Z-Threshold (__ p-1.5 f 0.5 -3000 200 SilenceLen (-ve) & SpeechLen (+ve) 400 600 800 1000 1200 . Time (ms) 1400 1600 1800 • 2000 Figure 4.18: Silence deletion of sample "jull8A.dat" with preset initialization values for the long-term averages. Chapter 4. Analysis of System Performance 72 jul18A.dat (run-time initialized): Signal and Decision Logic (Low:SILENCE, High:SPEECH) 5000 E < E (J & E-Threshold ( ) 600 m 400 E 200 < SilAvg (J, SphAvg ( ) & SphAvg/SphScale (_.J ~i r ~i r n r _i : i_ _j u 2(J&Z-Threshold ( ) |-1.5 1h IS 0.5 rr 400 1 200 o 0 -200 SilenceLen (-ve) & SpeechLen (+ve) Figure 4.19: Silence deletion of sample "jull8A.dat" with run-time initiahzation values for the long-term averages. Chapter 4. Analysis of System Performance 73 x io* apr18c.dat (run-time initialized): Signal and Decision Logic (Low:SILENCE, High:SPEECH) E < -2h E (_) & E-Threshold (_ J 1 " \ I , R . . . . N . . . . . \ V — \ ^ ^ ' ~ N - - \ • ^ JL/.1 1 1 1 1 1 1 1 1 1 a> 4000 E 2000 < SilAvg (J, SphAvg (_ J & SphAyg/SphScale (_.J 800 1000 1200 Time (ms) 2000 Figure 4.20: Silence deletion of sample "aprl8c.dat" with run-time initialization values for the long-term averages (i). Chapter 4. Analysis of System Performance •74 x -jo4 apr18c.dat (run-time initialized): Signal and Decision Logic (Low:SILENCE, High:SPEECH) E L) & E-Threshold ( ) 6000 ® 4000 3000 a 2000 1000 SilAvg (J, SphAvg (__) & SphAvg/SphScale (_.J Z (_) & Z-Threshold (__ •a-1.5 w 0.5 SilenceLen (-ve) & SpeechLen (+ve^  500 E 0 -500 -10001 2000 2200 2400 2600 2800 3000 . 3200 3400 3600 3800 4000 Time (ms) Figure 4.21: Silence deletion of sample "aprl8c.dat" with run-time initialization values for the long-term averages (ii). . Chapter 4. Analysis of System Performance 75 parameters set at default values, too). The surprising result is that a smaller segmentation size does not facilitate a quicker adaptation of the long-term averages, even though fewer instances of mis-classification seem to have occurred than with larger frame sizes. Adaptation of the averages is still mainly a function of the averaging period (one second for speech and 128 ms for noise [31]). It takes approximately one second, regardless of segmentation size, for either the speech or the silence average to come to its nominal level. Correct detection of speech begins somewhat earlier, fortunately. 4.6 Rate of Compression Together with playback speech quality, the most important performance measure of a speech compression system is the amount of compression achieved. Compression with silence deletion results in a variable bit rate, because the amount of compression depends on the amount of silence in the speech signal, which varies with the speaker's activity. Studies [34, 35] have shown that for a normal casual two-person telephone conversation the speech and silence periods are about 40% and 60% respectively, and implementations of silence deletion algorithms [54, 23, 31, 24] suggest that, for typical speech samples, optimal compression occurs with removal of 40-50% of the original acoustic material. The following sections discuss how the compression rate is affected by various factors. A choice between the two detection criteria, the signal energy and the A M F , will then be presented. 4.6.1 Effects of Detection Threshold Whatever the frame (segmentation) size, the detection threshold ultimately determines the amount of compression achieved. There is no guarantee that a particular threshold level will always achieve a prespecified compression rate. The relationship between compression rate and threshold value varies with the speech sample. In silence compression, it is desirable to remove all silence intervals between sentences, between words, and if possible within words (intra-word silences), as long as there is not much Chapter 4. Analysis of System Performance 76 x 10"' abrupt.dat (K=128): Signal and Decision Logic (Low:SILENCE, High:SPEECH) 4000 3000 CO T3 i 2000 a. E < 1000 E(_)&E-Threshold(_J a- 1.5 1h 3 °-5 rr 500 co -500 h -1000 \ — "~ \ V ^ 7 \zJ-1 1 1 1 1 1 I 1 1 SilAvg (J, SphAvg ( ) & SphAvg/SphScale (_.J Z L) & Z-Threshold (__ SilenceLen (-ve) & SpeechLen (+ve) 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Time (ms) Figure 4.22: Silence deletion of sample "abrupt.dat" with run-time initialization, at a frame size of K = 128. Chapter 4. Analysis of System Performance 77 x 10 abrupt.dat (K=64): Signal and Decision Logic (Low:SILENCE, High:SPEECH) =5 0 4000 3000 CD i 2000 C L E < 1000 E (_) & E-Threshold ( ) 1 1 i 1 1 :1 \ : — / _ _ ;_ -ll-*- — — — — i i I I i I I i i SilAvg (J, SphAvg (_ J & SphAvg/SphScale (_.J Z(JSZ-Threshold ( ) Figure 4.23: Silence deletion of sample "abrupt.dat" with run-time initialization, at a frame size of K = 64. Chapter 4. Analysis of System Performance 78 Figure 4.24: Silence deletion of sample "abrupt.dat" with run-time initialization, at a frame size of K = 16. Chapter 4. Analysis of System Performance 79 E (J & E-Threshold ( ) SilAvg (J, SphAvg (_ J & SphAvg/SphScale (_•_) 2000 1500 CD X ) u 1000 "a. E < 500 f Z (J & Z-Threshold ( ) 21 1 i 1 1 1 1 1 1 r SilenceLen (-ve) & SpeechLen (+ve) 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Time (ms) Figure 4.25: Silence deletion of sample "abrupt.dat" with run-time initialization, at a frame size of K = 4. Chapter 4. Analysis of System Performance 80 of an adverse effect on speech quality. Speech quality assessment is the topic of the next chapter. Here we present data to show the compression rates and how to control these. Silence deletion has been performed on a diverse selection of speech samples, with different threshold settings (and no speech coding; i.e. speech frames remain in linear P C M form). The signal energy and the A M F criteria have been used independently, and the resulting compression rates (for a frame size of K = 64) are plotted in Figures 4.26, 4.27, and 4.28. In the figures, the ratio of silence frames to total frames is plotted in solid lines, the ratio of silence codes (i.e. the number of silence intervals) to total frames with plus signs, and the actual reduction in size, in dotted lines. The compression characteristics of the energy criteria are shown on the left column, and those of the A M F criteria, the right. This is a more revealing plot than simply plotting the percentage of compression, because silence codes can sometimes make a noticeable contribution to the final size (especially at small frame sizes); at the same time, the silence frames ratio generally gives a good estimate of the compression rate. For the present case (K = 64), the silence frames ratio approximates the reduction curve so well that the dotted line almost always conincide with the solid line. This is because the physical size of a silence code (6 bytes)—the overhead—is negligible compared to that of a data frame (128 bytes). As shown, the energy threshold has a more predictable, consistent pattern, with a use-ful range of from 1.5 to 2.5, normally yielding a compression rate of approximately 40% to 60%. The variation is mostly attributable to characteristics of the speech sample; for example, "apr l6 .da t" is a noisier sample, " lmiz237.dat" has a rather alow energy level, "sepl4d.dat" contains few talkspurts in the midst of silence, and "nov01.dat" is a highly concentrated recorded message with few pauses for breath. It can be seen from the compression curves that, for all applications, Eotj = 2.0 is a good threshold. By contrast, the A M F threshold behaves rather erratically. Its range of useful values is 0.7 to 0.9, but even the rather conservative value of 0.8 does not yield any consistent compression percentage. At a threshold factor of 0.8, the compression rate drops to 0% for one sample and rises to 90% for another (see "aprl6" and "DEC03G", respectively, in Figure 4.26), while often Chapter 4. Analysis of System Performance 81 Compression Analysis of K64MN1 Silence & Code Ratios vs E-Threshold: DEC03C K64MN1 Silence & Code Ratios vs AMF-Threshold: DEC03C K64MN1 + + + + + + -F + + + + + + + + + -0.5 1 1.5 2 2.5 3 3.5 Silence & Code Ratios vs E-Threshold: DEC03D_K64MN1 + + + + + + -h + + + + + + + + + •• 0.5 t 1.5 2 ,'2.5 3 3.5 Silence & Code Ratios vs E-Threshold: DEC03G K64MN1 1 + + + + + + - + + + + + + + + + -i J . 4- + + + + + + - + + + + + + 0 0.2 0.4 0:6 0 8 1 1 Silence & Code Ratios vs AMF-Threshold: DEC03D K64MN1 • + + + + + + + + + -Silence & Code Ratios vs AMF-Threshold: DEC03G K64MN1. H i l l ' I I I ' I I I + + + ^ Silence & Code Ratios vs E-Threshold: apr08c_K64MN1 Silence & Code Ratios vs AMF-Threshold: apr08c_K64MN1 r ' •+ + + +++++++ + + + + + - i- i i i ' i + + i + + + -F + + + + + + +Xs Silence & Code Ratios vs E-Threshold: apr16_K64MN1 Silence & Code Ratios vs AMF-Threshold: apr16_K64MN1 • I l-l -4 + + + + + -I- + + + + + + rf + + • . + + + _ _ l 1 L i _ l 1 i _ i Figure 4.26: Compression rate against threshold factors, for a frame size of K — 64 (i). Chapter 4. Analysis of System Performance Compression Analysis of K64MN1 Silence & Code Ratios vs E-Threshold: apr16a_K64MN1 + + + -F + + + + + + + + + -Silence & Code Ratios vs E-Threshold: apr16c_K64MN1 r - l - * " * + + + i+ + + -i- + + + + + +:+ + + -2.5 3 3.S Silence & Code Ratios vs E-Threshold: apr30b_K64MN1 + + + + + + -I- + + + + + + H- + + -Silence & Code Ratios vs E-Threshold: apr31 b_K64MN1 I-4- + 4 , + + + :+ + + -Silence & Code Ratios vs E-Threshold: aug04_K64MN1 _ ^ - r - i + + +.+ + + + + + + + + + + + + • Silence & Code Ratios vs AMF-Threshold: apr16a_K64MN1 c 4- + -h + + + +++++++ Silence & Code Ratios vs AMF-Threshold: apr16c_K64MN1 I—I—I— . + + + +++++++ 0.2 0.4 0.6 • 0.8 I 1.2 Silence & Code Ratios vs AMF-Threshold: apr30b_K64MN1 0.2 0.4 0.6 0.8 1 1.2 Silence & Code Ratios vs AMF-Threshold: apr31 b_K64MN1 0.2 . + + + + + + +++++++ 0 0.2 0.4 0.6 0.8 1 1.2 Silence & Code Ratios vs AMF-Threshold: aug04_K64MN1 • + + + Figure 4.27: Compression rate against threshold factors, for a frame size of K = 64 Chapter 4. Analysis of System Performance Compression Analysis of K64MN1 Silence & Code Ratios vs E-Threshold: jul20a_K64MN1 r i-rff+4i + + +r+ + + + + + + + + + H- + + -0.5 1 1.5 2 2.5 3 3.5 Silence & Code Ratios vs E-Threshold: Imiz237 K64MN1 1 1 1 1 rr^rzi* + +.+ + + + t + + + + + * + + . Silence & Code Ratios vs E-Threshold: nov01 K64MN1 1 1 i 1 + +.+ + + -I- + + + + + + + + + • 0.5 1 1.5 2 2.5 3 3.5 Silence & Code Ratios vs E-Threshold: princess_K64MN1 1 1 1 1 1 1 • . - / + + + , + + 4 - 4 - 4 - 4 - + + + + ;+ + + -0.5 1 t.5 2 2.5 3 3.5 Silence & Code Ratios vs E-Threshold: sep14d_K64MN1 • • /+ + + :+ + + -I- + + + + + + rt- 4- 4- -Silence & Code Ratios vs AMF-Threshold: jul20a_K64MN1 0 0.2 0.4 0.6 0.8 1 1.2 Silence & Code Ratios vs AMF-Threshold: lmiz237_K64MN1 0.2 0.4 0.6 0.8 1 1.2 Silence & Code Ratios vs AMF-Threshold: nov01_K64MN1 • + + + + + + + + + + + + + ++v 0 0.2 0.4 0.6 0.8 1 1.2 Silence & Code Ratios vs AMF-Threshold: princess_K64MN1 0 0.2 0.4 Silence & Code Ratios vs AMF-Threshold: sep14d_K64MN1 Figure 4.28: Compression rate against threshold factors, for a frame size of K = 64 Chapter 4. Analysis of System Performance 84 staying at a ineffectively low level (see "aug04" in Figure 4.27, "pr incess" and "sepl4d" in Figure 4.28). In all those cases the energy criteria gives a good and consistent compression rate. It can be concluded from its compression characteristics alone, that the A M F is not very reliable. 4.6.2 Effects of Frame Size While the energy threshold factor determines the general rate of compression, the frame size exerts its influence, through the segmentation size, on the number of segments, and hence the number of silence codes generated. With a smaller frame size, more silence codes can be expected, which imposes a larger compression overhead. The compression characteristics for silence deletion at K — 4 are plotted in Figures 4.29, 4.30, and 4.31. The figures show the compression rates versus the threshold factors for the same 15 speech samples as shown in Figures 4.26, 4.27., and 4.28. ~ A brief comparison between the two sets reveals one major difference: for the smaller frame size ( i i = 4), the reduction curves (in dotted lines) are significantly lower than the silence frames ratios. The compression overhead is much higher because the silence code size is large compared to the size of a deleted segment (a ratio of 6:8). For silence deletion using the energy threshold, the actual reduction in size is typically 5-10% lower than that achievable at K = 64, and sometimes even 20% lower. The compression overhead is even larger if the A M F criteria is used instead. Silence compression at a small frame size (i.e. small segmentation size) is inefficient. Speech frames encoded by /z-law P C M or A D P C M will actually experience an overall increased size. For a frame size of K = 4 (0.5 ms), a coded speech frame takes up (4)(1) = 4 bytes for /z-law P C M , and (4)(0.5) = 2 bytes for A D P C M , whereas a silence code is 6 bytes long. Small silence frames are often deleted individually; a segment that could have been coded as speech in 2 or 4 bytes is then represented by a 6-byte silence code. This is obviously a misuse of silence deletion. Chapter 4. Analysis of System Performance Compression Analysis of K4MN1 Silence & Code Ratios vs E-Threshold: DEC03C K4MN1 1 ^+^^+ + + + + + 1 h t + + + + + + + + -Silence & Code Ratios vs E-Threshold: DEC03D_K4MN1 1 r I t ! ! 1- + + +. + + + H- + + • Silence & Code Ratios vs E-Threshold: D E C 0 3 G K4MN1 / / . ..• , , J * / + + ! » - + + i K + + + + + + + -Silence & Code Ratios vs E-Threshold: apr08c_K4MN1 0.5 t 1.5 2 2.5 3 3.5, Silence & Code Ratios vs E-Threshold: apr16_K4MN1 Silence & Code Ratios vs AMF-Threshold: D E C 0 3 C K4MN1 Silence & Code Ratios vs AMF-Threshold: DEC03D_K4MN1 r 0 0.2 0.4 0.6 0.8' I 1.2 Silence & Code Ratios vs AMF-Threshold: DEC03G_K4MN1 I I 1 1 1 1 1 . . . . \ . \ K + + . + + + + - + + + + + + + + + : 0 0.2 0.4 0.6 0.8 1 1.2 Silence & Code Ratios vs AMF-Threshold: apr08c_K4MN1 I 0.8 0.6 1 1 1 s —'^^^^ '• - + + + + + + . .+.+.+. + + .+.+. 0 - 0.2 0.4 0.6 0.8 1 1.2 Silence & Code Ratios vs AMF-Threshold: apr16_K4MN1 Figure 4.29: Compression rate against threshold factors, for a frame size of K = Chapter 4. Analysis of System Performance 86 Compression Analysis of K4MN1 Silence & Code Ratios vs E-Threshold: apr16a_K4MN1 + + + H + + + + + + + + + -Silence & Code Ratios vs AMF-Threshold: apr16a_K4MN1 i r > , . - . . . . . . . . . Silence & Code Ratios vs E-Threshold: apr16c_K4MN1 1 r / / t + + + 1 !• + + +; + + + + + + -Silence & Code Ratios vs AMF-Threshold: apr16c_K4MN1 ' r . . . .s . ; -^ "N. . + + + .+'+'+ + 4..+. -h + .+.+.+. +. + ^ + . + ; Silence & Code Ratios vs E-Threshold: apr30b_K4MN1 4 ; + + + t + + 4 : + + + + + + -0.5 I 1.5 2 2.5 3 3.5 Silence & Code Ratios vs E-Threshold: apr31b_K4MN1 0.8 + + + H + + + t + + + + + • 0.5 1 1.5 2 2.5 3 3.5 Silence & Code Ratios vs E-Threshold: aug04_K4MN1 + + + H F + + + + + + + + -Silence & Code Ratios vs AMF-Threshold: apr30b_K4MN1 1 r v. • + + + •+•+•+ Silence & Code Ratios vs AMF-Threshold: apr31b_K4MN1 Silence & Code Ratios vs AMF-Threshold: aug04_K4MN1 i r + + + + + + + +'+' +;'+' +' +' + + + . Figure 4.30: Compression rate against threshold factors, for a frame size of K = 4 (ii). Chapter 4. Analysis of System Performance 87 Compression Analysis of K4MN1 Silence & Code Ratios vs E-Threshold: jul20aJ<4MN1 ; ^ ^^ T^ rtr"?^  ^ + +;+ + + + + ++; + + + ;+ + + -0.5 1 1.5 2 2.5 3 3.5 Silence & Code Ratios vs E-Threshold: lmiz237_K4MN1 JC... y. -k . " 7 . . . . / r . .j/. ./.': ^+ + *+ + -- + + + + + + + + + -Silence & Code Ratios vs E-Threshold: nov01_K4MN1 + + + -h + + + + + + + + + -0.5 1 1.5 2 2.5 3 3.5 Silence & Code Ratios vs E-Threshold: princess_K4MN1 + + + + +.+ -Silence & Code Ratios vs E-Threshold: sep14d_K4MN1 • + + + + + + :+ + + -Silence & Code Ratios vs AMF-Threshold: jul20a_K4MN1 : + + '+ •+ •+ •+ • ++.+.-Silence & Code Ratios vs AMF-Threshold: lmiz237J<4MN1 TV m: \ • • : v. : •+•++ •— - + + + + + + + + + t + + + Silence & Code Ratios vs AMF-Threshold: nov01_K4MN1 :-+-11:41 •+•+• + Silence & Code Ratios vs AMF-Threshold: princess_K4MN1 i t 0.2 + 4-4- 4-4-4-4-4-4-4- + + + + + + :+ + + -0 0.2 0.4 , 0.6 0.8 I 1.2 Silence & Code Ratios vs. AMF-Threshold: sep14d_K4MN1 -+'+' + '+'+'+' + "+'4-+'++'+ '++" + + +>-Figure 4.31: Compression rate against threshold factors, for a frame size of K = 4 (iii). Chapter 4. Analysis of System Performance 88 4.6.3 Effects of Background Noise Level Crosstalk, radio reception from the cordless or mobile unit, and noise in a sometimes changing background, are some of the factors that can conspire to add to silence detection difficulties. Rarely is noise absent in any telephone conversation, especially with the advent of wireless phones for domestic use and cellular phones. Users of such telephones are familiar with the noise associated with poor radio signal reception, which is not unlike that encountered in F M radio broadcasts. Noise can be present, including that from machinery or other sources, such as in a car, on the road, or in a public place. When speech signal originates from a poorly grounded telephone handset, noise in the form of a low hum may intrude. In one experimental setting, a 60 Hz hum, possibly coming from the the power lines, is found in the speech signal on the telephone line. Figures 4.32 and 4.33 plot the compression characteristics of silence deletion on speech with progressively more background noise. In each figure, the compression curve of the original speech sample is presented at the top, followed underneath by those of progressively noisier samples (made by artificially adding noise to the original). The left and right hand columns apply to compression with the energy and A M F criteria, respectively. In both figures, as the speech gets progressively noisier, there is an increase in compression rate using the energy criteria; as the SNR drops from about 22 dB in the original to 7 dB or 6 dB in the noisiest sample, the compression rate increases by almost 20% (at a threshold factor of 2.0). With the A M F criteria, however, the compression rate decreases; at the A M F threshold factor of 0.8, compression declines by over 30% as the noise increases. In most applications involving noisy speech, the detection characteristics of the energy criteria is preferable to that of the A M F criteria. Speech corrupted by noise has lost much information already; preserving corrupted information with precious storage or transmission capacity would be uneconomical under most circumstances. And it will be shown in Chapter 5 that noisy speech can withstand over-compression without much noticeable degradation. Chapter 4. Analysis of System Performance 89 DEC23A_x with Noise Silence & Code Ratios vs E-Threshold: DEC23A x K64MN1 Silence & Code Ratios vs AMF-Threshold: DEC23A x K64MN1 + + + -I  + + +1 + + + + + + -Silence & Code Ratios vs E-Threshold: A_x15_K64MN1 r^+ + + + + + + + H I-4- + + ++++++-0.5 1 . 1 . 5 2 2.5 3 3.5 Silence & Code Ratios vs E-Threshold: A xt3 K64MN1 + + + + 4- 4- 4 1 - 4 - 4 - 4 , + 4- 4 - .+ + + -Silence & Code Ratios vs E-Threshold: A x10 K64MN1 + + + + + 4 - 4 - + + + 4- 4- 4- 4- + 4- -Silence & Code Ratios vs E-Threshold: A_x07_K64MN1 + + + + + + i F 4- 4- 4i 4- 4- 4 - 4 - 4 - + -. + + + +++++++++ + + + + + H ^ Silence & Code Ratios vs AMF-Threshold: A_x15_K64MN1 . + + + + + + + + + + + + 4 0 0.2 0.4 0.6 0.8 1 1.2 Silence & Code Ratios vs AMF-Threshold: A x13 K64MN1 - + 4- + + + +-+- + + H r + + + 4 + + * 0 0 2 0.4 0.6 0.8 t t Si lence&Code Ratios vs AMF-Threshold: A x10 K64MN1 : . + + + + 4 + 4 - + + -I- + + + Silence & Code Ratios vs AMF-Threshold: A_x07_K64MN1 • + + + + ++:+ + + -H + 4 + + +> Figure 4.32: Compression characteristics of progressively noisier speech samples based on the sample "DEC23Ajx.dat" (compressed at K = 64, M = N = 1) Chapter 4. Analysis of System Performance 90 DEC23D_s with Noise Silence & Code Ratios vs E-Threshold: DEC23D_s_K64MN1 Silence & Code Ratios vs AMF-Threshold: DEC23D_s_K64MN1 + + + 4 4-4--1-4-4-4, 4-4 4-4.4-4-. 0.5 1 1.5 2 2.5 3 3.5 Silence & Code Ratios vs E-Threshold: • s15 K64MN1 4 4 4- + + 4 H I- + 4 4 4 4 4 if 4- 4 -• + + + 4 4-4-4-4-4-f + + +:+ + + Silence & Code Ratios vs AMF-Threshold: D s15 K64MN1 4-44, 4-44- + + + " f + + + Silence & Code Ratios vs E-Threshold: D_s12_K64MN1 Silence & Code Ratios vs AMF-Threshold: D s12 K64MN1 + + + 4 + + H h + 4 4 4 4 4 4 - 4 4 -Silence & Code Ratios vs E-Threshold: D s09_K64MN1 4 + 4 4 4 4 - J h 4 4 + 4 4 4 4 + 4 • 0.5 1 1.5 2 2.5 3 3.5 Silence & Code Ratios vs E-Threshold: D_s06_K64MN1 + 4 + r f -f + H I - 4 4 - U + 4 + 4 + + . -4 + 4 4 -4 + + i 4 4 4 . 4 4 + T + 0 0.2 0.4 0.6 0.8 1 1.2 Silence & Code Ratios vs AMF-Threshold: D_s09_K64MN1 - 4 4 - 4 4 4 4 + + + " Silence & Code Ratios vs AMF-Threshold: D_s06_K64MN1 • + + + + 4 + + + +^  i r 4 r ^ t i •t i i Figure 4.33: Compression characteristics of progressively noisier speech samples based on the sample "DEC23D_s.dat" (compressed at K = 64, M = N = 1) Chapter 5 Subjective Listening Evaluation As the amount of compression increases, playback quality of the compressed speech deteriorates. There comes a point at which the loss in quality or intelligibility ceases to be a reasonable cost for the amount of compression achieved; optimal compression occurs somewhere before this, when both the compression factor and the speech quality are satisfactory. Speech quality has many perceptual dimensions, the most important ones being intelligi-bility and naturalness of the speech. Perceived speech quality eventually has to be determined subjectively by the listener. In recent years, in an attempt to develop inexpensive methods for the characterization of communication systems and to supplement or possibly replace conventional subjective assess-ments, efforts have been made to derive some objective measures of speech quality [68, 69]. Notable ones include those based on the cepstral distance, on the coherence function, on the concept of mutual information, on pattern recognition concepts [70, 71], on wider band (such as 4-tone) SNR measurements, and on amplitude jitter [72]. As yet, it has not been possible to devise an objective criterion that correlates well with speech quality for a variety of speech coders and input signals; consequently, listening tests conducted with human subjects yielding the mean opinion score (MOS) remain the standard procedure for the evaluation of speech quality [7, 24, 23, 73, 26, .25, 6, 5, 74, 11]. 5.1 Objectives Conducting subjective tests is expensive and time consuming. In designing the tests, we must first have our goals clearly identified, partly in anticipation of the possible test outcomes. The 91 Chapter 5. Subjective Listening Evaluation 92 tests serve, on the one hand, to discover the perceived quality of reproduced speech, and on the other, to confirm expected results. In our evaluation of the present speech compression system, we are concerned with the quality of speech that results from the application of silence deletion to telephone speech, as affected by varying coding and decoding parameters, and by differing speech quality of the source. On the application level, there are five adjustable parameters in a silence compression sys-tem: the amount of compression, the amount of silence re-inserted on playback, the energy level of the re-inserted silence, the coding method for speech frames, and the segmentation size. While it can be expected that optimal compression with full re-insertion of all silence in-tervals at the original noise level, will result in the best quality in compressed speech, we will investigate how much degradation is perceptible with over-compression, shortened silence in-tervals upon playback, and reduced noise levels in playback. As suggested earlier, speech quality is affected to a large extent by the segmentation size, and hence by the frame size. Compression using a range of frame sizes will be evaluated. In addition, the effect of silence deletion on speech coding—in particular on A D P C M — w i l l be investigated. It is strongly suspected that A D P C M , because of its adaptive nature, is adversely affected by small frame sizes. A few tests will be conducted to confirm this suspicion. Finally, all comparisons will be made between the compression results of a clear original sample and those of a noisy one. The robustness of the system against background noise can thus be evaluated. 5.2 Scoring The most commonly performed tests are the absolute category rating (ACR) listening tests [75, 6] in which the subjects listen to short stimuli (typically eight seconds each),.and rate the quality of these speech samples on a five-point scale shown in Table 5.1. Responses selected by each listener are coded with the'rating numbers, and the arithmetic average over all scores for Chapter 5. Subjective Listening Evaluation 93 a given test condition yields the mean opinion score (MOS). Table 5.1: Quality rating scale for an absolute category rating (ACR) test. Description Rating Excellent 5 Good 4 Fair 3 Poor 2 Bad , 1 Since our evaluation includes speech with high levels of background noise, a degradation category rating (DCR) test appears more appropriate for our purposes. Following the practice of other researchers [76, 75, 6], a set of modified MOS ratings (Table 5.2) have been used. A Table 5.2: Quality rating scale for a degradation category rating (DCR) test. Description Rating As good as, or better than the "Good" sample 5. Good, but not as good as the "Good" sample 4 Fair (somewhere between 4 and 2) 3 Bad, but not as bad as the "Poor" sample 2 As bad as, or worse than the "Poor" sample 1 slight modification to the standard D C R test has been made, in providing two reference samples instead of one; the "Good" reference is the original unprocessed sample, while one of the over-compressed and under-expanded samples has been chosen as the "Poor" reference. This serves to normalize the five-point scale, so as to fully utilize its entire range. 5.3 Organization and Results There are five variables in total, to be studied for each of the clear and noisy samples. The variables and the values they take on are shown in Table 5.3. Permutations of the first four amount to 2 x 2 x 2 x 4 = 32 samples. Add to that.the four A D P C M coded samples, one for Chapter 5. Subjective Listening Evaluation ' 94 Table 5.3: Variables subjectable to evaluation by listening tests. Variable Values Amount of compression Amount of re-inserted silence Energy level of re-inserted silence Frame size (K) Speech coding .50%, 60% 100%, 50% 100%, 50% (of original) 4, 16, 64, 128 /x-law P C M , A D P C M each frame size, and we have 36 samples altogether. Two sets of 36 processed samples each have been generated, one for the clear original and one for the noisy original. Both original speech samples contain the following sentences: The wide road shimmered in the hot sun. Place a rosebush near the porch steps. which have been chosen from a set of phonetically balanced Harvard sentences, recommended by the I E E E [77]. The first sentence is spoken by a male, the second by a female, both of whom are Western Canadian native speakers of American English. The listening test is divided into two sessions, the first presenting samples generated from the clear sample, the second, from the noisy sample. The setup is installed on the same A T compatible personal computer, as is the speech compression system. In each part, the 36 samples are completely randomized in their order of presentation. At the beginning the listener hears a "Good" sample and a "Poor" sample for reference and orientation. The listening test proper then begins, with the processed samples presented one by one. After hearing each processed sample, the listener can choose to • replay the processed sample any number of times, or • replay the "Good" reference sample any number of times, or • replay the "Poor" reference sample, again any number of times, or Chapter 5. Subjective Listening Evaluation 95 • grade the processed sample with a number according to Table 5.2. Following grade assignment, the test will proceed with the next processed sample, until the end of the session. The option to replay the reference speech samples any number of times allows a more accurate grading for each sample. For each part of the test, the original unprocessed speech sample serves as the "Good" reference. The "Poor" sample is the one compressed at 60%, with 50% re-inserted silence at 50% of the original noise level, processed at a frame size of K = 4; this combination is believed to produce the worst quality, which is confirmed unanimously by all test subjects. There were 29 subjects who took the listening test, consisting of native and non-native English speakers. These include staff and student members of the Electrical and Computer Engineering Department, and others from the university campus. Appendix A includes the instructions they received, and their individual test sample scores. Tables 5.4 and 5.5 summarize the the parameters subject to evaluation by listening tests,, their permutations, and the corresponding sample number. A mean opinion score (MOS) and the associated standard deviation, calculated from the 29 scores, are shown at the end of each row of the tables. Standard deviation are less than 1.0 for 90% of all samples, suggesting general consistency of our listening group. 5.3.1 Speech Coding A D P C M does not provide a static, constant mapping between coded values and samples of the uncoded waveform. A D P C M basically records the difference between samples, with as little quantization error as possible. Because of an adaptative feedback mechanism, the error is kept small in the long term. But adaptation takes time, so the error is often large at the beginning of each adaptation period. Silence deletion divides a speech signal into segments of speech and silence, necessitating the A D P C M coder to restart the adaptation process at the beginning of each speech segment. This multiplies the instances of large errors. Chapter 5. Subjective Listening Evaluation 96 Table 5.4: Results of Subjective Listening.Test Part 1 (samples produced from a relatively clear original). Sample % Compressed % Re-ins Sil % Sil Energy Frame Size (A) Score 50 60 100 50 100 50 128 64 16 4 ADPCM Mean Std Dev 01 v7 V V V 4.00 0.89 02 v7 v7 • v7 v7 v7 3.72 0.84 03 s/ v7 V- v7 3.90 0.94 04 v7 V v7 v7 v7 3.72 0.88 05 v7 V V v7 3.69 0.97. 06 V V 7 . V v7 v 7 3.34 0.86 07 V V v7 v7 3.24 1.06 08 v7 V V v7 v7 3.03 0.98 09 v7 v7 v7 2.31 0.85 10 v7 v7 v7 v7 3.14 0.88 11 V V V v7 2.79 0.82 12 v7 v7 v7 v7 2.52 0.83 13 V V v7 V 3.48 0.91 14 v7 v7 V 7 v7 3.69 0.85 15 v7 V v7 v7 3.59 0.95 16 V v7 V v7 3.24 0.91 17 V V V v7 2.03 0.68 18 v7 V V v7 2.66 0.81 19 v7 v7 V v7 2.24 0.69 20 V V v7 v7 2.45 1.02 21 V V7 v7 v7 3.93 0.92 22 V v7 v7 4.14 0.74 23 V v7 V v7 3.24 0.87 24 v7 V • v7 2.83 0.71 25 V s/ V v7 2.45 0.57 26 V v7 v7 2.90 0.72 27 V v7 v7 1.66 0.67 28 v7 V V v7 1.17 0.47 29 V v7 V v7 3.86 0.83 30 v7 v7 V v7 3.79 0.77 31 V v- V v7 2.90 0.77 32 V V V v7 2.59 0.95 33 V V V v7 2.14 0.69 34 v7 V V v7 2.79 0.82 35 v7 v7 v7 1.48 0.69 36 v7 v7 V v7 1.07 0.26 Chapter 5. Subjective Listening Evaluation 97 Table 5.5: Results of Subjective Listening Test Part 2 (samples produced from a relatively noisy original). Sample % Compresed % Re-ins Sil % Sil Energy Frame Size (K) ADPCM Score 50 60 100 50 100 50 128 64 16 4 Mean Std Dev 01 V s/ V V 4.17 0.76 02 V y y 3.45 0.78 03 y V V y 3.55 0.78 04 V y 3.45 0.78 05 V V y 3.76 0.95 06 V , V y y 3.66 1.11 07 V V y 3.59 1.05 08 V y y. 2.48 0.83 09 3.41 0.82 10 V y • V y 3.38 0.78 11 V V y 3.69 0.93 12 V y 3.24 0.99 13 V V V 3.21 0.86 14 V V V y 2.97 0.78 15 y 3.34 0.81 16 V V V y 3.21 1.01 17 y V V 3.07 0.92 18 V V y 3.07 0.75 19 V V V y 3.41 0.82 20 V y 3.10 0.82 21 V V V 3.48 0.74 22 V y V y 3.38 0.68 23 V V V y 2.48 0.91 24 V V V y 1.79 0.86 25 V V V 3.14 0.83 26 V y 2.83 1.00 27 V V V y 1.97 0.87 28 V V V y 1.10 0.31 29 V V V V 3.14 0.95 30 V V V y 3.17 0.76 31 y V y 2.48 0.74 32 V y 1.72 0.80 33 V V V V 2,69 0.71 34 V y 2.52 0.69 35 V V y . y 1.69 0.66 36 V y y 1.03 0.19 Chapter 5. Subjective Listening Evaluation 98 Figure 5.1 illustrates the success and failure of A D P C M in reproducing small segments of the original speech signal, after the application of silence deletion. On top is the uncoded speech Original Waveform 2000 C D 0 "a. < - 2 0 0 0 - 4 0 0 0 I 2000 CD I ° "a. E < - 2 0 0 0 - 4 0 0 0 I 2000 <D -D 0 13 "a. < - 2 0 0 0 - 4 0 0 0 fl. 1 1, A . A . . :. .A .. A . .. i 1 1 \hfi\l\ 1'A ... ] 1 . . . 1 - J ' -0 2 4 i 1 J 5 8 10 12 14 16 i 18 20 Time (ms) 1 . . . .1 Mu-law PCM i i J 1 I 1 . . . . 18 20 ADPCM I 1 1 1 _ I 0 8 10 12 Time (ms) 14 16 18 20 Figure 5.1: Silence compressed speech recovered with different speech codings: //-law P C M and A D P C M . waveform, followed below by ones decoded from the silence-compressed waveforms with /z-law P C M and A D P C M speech coding, respectively. The square wave indicates the speech and silence intervals; the low and high edges correspond to the silence and speech coded portions, respectively. It can be seen that, while /z-law P C M is able to reproduce virtually all the speech segments, A D P C M fails miserably on segments shorter than 2 ms. Apparently, it takes approximately 1 ms for the A D P C M coder to successfully follow the waveform. One expects A D P C M coding to be less successful than P C M when incorporated into silence deletion, especially at frame sizes of less than 2 ms, or K < 16, which yield a smaller segmen-tation size and hence more segments. However, test results show a perceived degradation of Chapter 5. Subjective Listening Evaluation 99 around 0.2 MOS unit for A D P C M coded samples at all frame sizes. (Compare Samples 1/3/5/7 with Samples 2/4/6/8 in both tables; see Figure 5.2.) The quality of speech compressed at small frame sizes suffers significantly solely from smaller segmentation; added degradation of A D P C M at small frame sizes is therefore not perceptible. ' A D P C M ( J , P C M ( ) 4.5 3.5 (f) O 2.5 1.5 0.5 / / . . N / / / / S. _ - - J / / Various Samples Figure 5.2: Comparison of MOS between A D P C M and /x-law P C M . 5.3.2 Amount of Compression Over-compression (about 60%) is likely to result in speech of poorer quality than is optimal compression of around 50%; we are interested in quantifying the resulting incremental degra-dation. , • Comparison is made between Samples 1/3/5/7 and 9-12, 13-16 and 17-20, 21-24 and 25-28, and between 29-32 and 33-36 (plotted in Figure 5.3). One observes that noisy speech samples can withstand more compression with little perceptible degradation. Over-compressing the Chapter 5. Subjective Listening Evaluation 100 clear sample results in a decrease of 1.0-1.5 MOS units, while the noisy sample experiences a drop of only 0.5 MOS unit. 60% Compression (_), 50% Compression ( ) 4.5 3.5 3 h co O 2.5 2 1.5 0.5 i / . I\ ^ \ \ 1 \ ' v A s \ '• V \ 1 ' , ' . A ' r—' \ i I \ i 1 \^ / V : 1 y \ i 1 \ \ . . . \ . -\ \ \ \ V L • "Vv "' ' : \ v ' \ v '/ V v'l • \ 1 1 \ \ 1 Various Samples Figure 5.3: Comparison of MOS between 60% and 50% silence compression. 5 . 3 . 3 Amount of Re-inserted Silence Re-insertion of half of the silence intervals (by expanding each silence interval to half its original length) results in faster playback. Comparing Samples 1/3/5/7 with 21-24, and 9-20 with 25-36 (shown in Figure 5.4), reveals that, as long as the quality of the original sample is good, speeded playback is generally acceptable. The frame size, however, has to be kept at K > 16; compression with smaller K produces an effect similar to over-compression, which usually requires full expansion of the compressed silences. When the re-inserted comfort noise is reproduced at half its original energy level, a faster playback can enhance speech quality. In particular, compare the scores between Samples Chapter 5. Subjective Listening Evaluation 101 50% Expansion (_), 100% Expansion ( ) co i \ V \ / \ i \ /\ .N , \ \ \ \ \ . \ . . . '.\ \ x \ 1 . \ .. \ . . . . I rs II \ 11 V .\V . T . r / \ N \ 'A ^ 1 i/\ \ i\ / \ N' \ 1 \ 1 A .. A \ fi\. 11 \ ' I \ 11 \ \ 1 X/i \ \ V/ \ » \ •1 \ • j \ Various Samples Figure 5.4: Comparison of MOS between 50% and 100% silence expansion. Chapter 5. Subjective Listening Evaluation 102 29/30/33/34 and 13/14/17/18 in Table 5.4. 5.3.4 Ene rgy L e v e l of Re- inser ted Silence The re-insertion of comfort noise improves speech quality in playback [23, 7]. Re-inserting half as much noise energy as in the original sample engenders a degradation of less than half an MOS unit in most cases. (Compare Samples 1/3/5/7 to 13-16, 9-12 to 17-20, and 21-28 to 29-36, plotted in Figure 5.5.) 50% Silence Energy ( J , 100% Silence Energy (_ J if) O 2.5 i \ \ / \ \ c . . . . . \v . . . . < \ i \ \ \ 1 \ ' s \ / \ \ 1 \ \ 1 / \ \v lf\/ v. ' . \ ' — v ' \ \ l ' N / ' s / ' A N / \ ' \ \ ' / ^ N \ / \ : li V ) '• I \ : i \ '/ \ '/ \ y \\ \\ \v._ \\ \\ \ : 1 V 1 V// V 1 \ \ J \\ \ • Various Samples Figure 5.5: Comparison of MOS between 50% and 100% silence energy in expansion. 5.3.5 Segmentat ion Size Contrary to previous findings [31], the smallest segmentation size does not produce the best reconstructed speech quality on telephone recorded samples, especially with faster playback. In Chapter 5. Subjective Listening Evaluation 103 fact, it often gives the worst quality. Comparisons can be made among each subgroup of four samples in Tables 5.4 and 5.5 (shown in Figure 5.6). K=4(J.K=16( ), K=64 (_•_). K=128 (..) Various Samples Figure 5.6: Comparison of MOS between 4 frame sizes. Small frame sizes may excel over larger ones when, over-compressing a noisy speech sample (compare Samples 19 & 20 with 17 & 18, in Table 5.5). But even this advantage rarely occurs. Generally, K = 64 (frame size of 8 ms) gives the best all-round performance, with K = 128 following in second place. 5.3.6 Background Noise Source quality is characterized by the signal-to-noise ratio (SNR). A higher level of background noise decreases the SNR of the speech signal while increasing the difficulty of silence (or voice activity) detection. Speech samples with background noise can be recorded either from a noisy environment, Chapter 5. Subjective Listening Evaluation 104 or by adding prerecorded noise to clean speech samples. For noise levels at least 10 dB below the speech level, people's talking behaviour is not yet changed by the Lombard effect, and synthetically added noise can serve as a good approximation of the ambient noise [75]. The noisy sample used in this study is created from the clear sample by adding pre-recorded telephone noise. This has been found to provide the necessary control over the SNR for our investigations. While the clear sample is measured to have a 22-dB SNR, the noisy one has been created with an SNR of 10 dB. Comparison between Table 5.4 and Table 5.5 (plotted in Figure 5.7) shows that the speech compression system works well with clear speech, and even better with noisy speech. A 50% compression can be achieved with a small degradation of about one rating unit. Clear Sample (_), Noisy Sample ( ) 4.5 3.5 CO O 2.5 2H-1.5 0.5 \ / V . . . \ ,.\ • \ J • • X r ~ \ . /.../ \ / \ ' ... . \ Ii... A / vy y k/ \ i . . . \ \ \ \ \ < ; Y \ \,....!.... \' \ V f Various Samples Figure 5.7: Comparison of MOS between clear and noisy samples. Chapter 6 Conclusions A speech compression system based on adaptive silence deletion has been implemented, as a tool for analysis and evaluation. The design of the system emphasizes simplicity over state-of-the-art coding efficiency, but still enables reduction of telephone speech from an uncoded rate of 128 kbps down to 16 kbps with a little more than one MOS rating unit of degradation (see Samples 2 and 4 in Table 5.4). The following sections summarize the major findings of the analysis and evaluation, and suggests some topics for future research. 6.1 Summary of Findings Three low-complexity criteria for silence detection have been studied: the short-time average magnitude (or energy) E, the short-time average zero-crossing rate Z, and the average magni-tude factor ( A M F ) . Our findings confirm that the limited usefulness of the zero-crossing rate is restricted to relatively noise-free speech processed at large frame sizes (over 2 ms), and therefore is generally ineffective for telephone speech. The A M F has been considered as an alternative to the energy criteria. Due to its finer temporal resolution, it behaves rather erratically at small frame sizes, but offers comparable detection performance to the energy criteria when longer frames are processed. Use of the A M F , however, is handicapped by inconsistent compression performance. A good threshold factor cannot be found that yields reliable compression for all levels of background noise. The signal energy therefore remains the best criteria in the arena of low complexity silence detection. Efficient calculations of the long-term averages of key parameters have been implemented, resulting in a reduction in storage requirement (for buffering) and an algorithmic simplification. 105 Chapter 6. Conclusions 106 Other aspects of silence compression have been studied as well. Dynamic initialization at run-time has been found to be much more reliable and robust than static initialization with preset values. With run-time initialization, the long-time averages are set to values that are never too far away from the actual speech and noise levels. In the worst case, it takes about one second for the averages to adapt to any new levels of speech and noise. Segmentation of the compressed speech has been studied. The segmentation size is found to be dependent on the frame size (it'), the minimum length of silence ( M ) , and the minimum length of speech (N). The best results are obtained with M = N = 1. Contrary to previous findings [31], speech quality and compression efficiency have been found to peak at a frame size of between K = 64 and K = 128; that is, between 8 and 16 ms. Based on the compression characteristics of various speech samples recorded from the telephone network, the optimal energy threshold factor WE = 2.0, which typically yields a compression of from 40% to 60%, depending on the speech content of the sample. It works well in a wide range of operating environments, with different or varying speech and noise levels. 6.2 Future Work The present work has left open several areas for future research. With recent advances in speech coding techniques, some low bit rate coders have been standardized (e.g. L D - C E L P and C S - A C E L P ) . How well silence deletion can be integrated into these systems remains to be studied. The silence compression algorithm does not include security measures against bit errors or malicious attacks. The effect of bit errors on a silence compression system can be investigated and quantified. One serious pitfall with the present system lies in silence expansion: one six-byte silence code could generate up to seventeen minutes of silence, which would use large amounts of storage space or channel capacity on the receiving end. Computer hackers could easily exploit this loophole to hoard disk space in such a voice storage system, or could deactivate a voice channel by flooding it with noise expanded from fraudulent silence codes. Chapter 6. Conclusions 107 A limit could be imposed on the amount of silence that can be coded, but this strategy ultimately defeats the purpose of silence compression. Alternatively, an error-detection mech-anism or digital authentication scheme could be proposed for use with the silence compression system. Finally, silence compression creates a compressed source with variable bit rate. The issues of buffering, delay, and bit rate control could usefully be addressed. Bibliography [1] J . Aronson, "Data compression — a comparison of methods," Special Publication 500-12, Institute for Computer Sciences and Technology, National Bureau of Standards, Jun 1977. [2] M . Crochemore and W . Rytter, Text Algorithms. New York: Oxford University Press, 1994. [3] C. E . Shannon, " A mathematical theory of communications," The Bell System Technical Journal, vol. 27, pp. 379-423,623-656, 1948. [4] C. E . Shannon, "Coding theorems for a discrete cource with a fidelity criterion," IRE National Convention Record, vol. 4, pp. 142-163, 1959. [5] N . Jayant, "Signal compression: Technology targets and research directions," IEEE Jour-nal on Selected Areas in Communications, vol. 10, pp. 796-818, June 1992. [6] R. V . Cox and P. Kroon, "Low bit-rate speech coders for multimedia communication," IEEE Communications Magazine, pp. 34-41, December 1996. [7] W. B . Kleijn and K . K . Paliwal, eds., Speech Coding and Synthesis. Elsevier, 1995. [8] A . S. Spanias, "Speech coding: A tutorial review," Proceedings of the IEEE, vol. 82, pp. 1541-1582, October 1994. [9] L . R. Rabiner and R. W . Schafer, Digital Processing of Speech Signals. Englewood Cliffs: Prentice-Hall, 1978. [10] N . Benvenuto, G . Bertocci, W . R. Daumer, and D. K . SparreU, "The 32-kb/s A D P C M coding standard," AT&T Technical Journal, vol. 65, pp. 12-22, Sep-Oct 1986. [11] M . H . Sherif, D. 0 . Bowker, G . Bertocci, B . A . Orford, and G. A . Mariano, "Overview and performance of C C I T T / A N S I embedded A D P C M algorithms," IEEE, Transactions on Communications, vol. 41, pp. 391-399, Feb. 1993. [12] A . M . Kondoz, Digital Speech: Coding for Low Bit Rate Communication Systems. John Wiley k Sons, 1994. [13] B . S. Atal , V . Cuperman, and A . Gersho, eds., Speech and Audio Coding for Wireless and Network Applications. Kluwer, 1993. [14] A . Gersho, "Advances in speech and audio compression," Proceedings of the IEEE, vol. 82, pp. 900-918, June 1994. [15] F . Coulmas, The Writing Systems of the World. Basil Blackwell, 1989. 108 Bibliography 109 [16] G. Wu and J . W . Mark, "Multiuser variable rate subband coding incorporating DSI and buffer control," IEEE Transactions on Communications, vol. 38, pp. 2159-2165, Dec. 1990. [17] Y . Shoji, 0 . Noguchi, K . Horiguchi, and S. Tsukagoshi, " A speech processing LSI for A T M network subscriber circuits," in IEEE International Symposium on Circuits and Systems, (New Orleans, L A , USA) , pp. 2897-2990, May 1990. [18] M . H . Savoji, " A robust algorithm for accurate endpointing of speech signals," Speech Communication, vol. 8, pp. 45-60, Mar. 1989. [19] E . S. Dermatas, N . D . Fakotakis, and G. K . Kokkinakis, "Fast endpoint detection algorithm for isolated word recognition in office environment," in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, (Toronto, O N , Canada), pp. 733-736, Apr. 1991. [20] P. de Souza, " A statistical approach to the design of an adaptive self-normalizing silence detector," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 31, pp. 678-684, Mar. 1983. . • • [21] M . E . M . Nasr, " A multiple user variable rate speech coding for packet transmission sys-tems," in Mediterranean Electrotechnical Conference Proceedings (MELECON), (Lisbon, Portugal), pp. 221-224, Apr. 1989. [22] L . M . Lundheim and T. A . Ramstad, "Variable rate coding for speech storage," in Pro-ceedings of IEEE International Conference on Acoustics} Speech, and Signal Processing, vol. 4, (Tokyo, Japan), pp. 369-372, 1986. [23] C. K . Gan and R. W. Donaldson, "Adaptive silence deletion for speech storage and voice mail applications," IEEE Transactions on Acoustics, Speech, and Signal Processing, pp. 924-927, June 1988. [24] C. Rose and R. W . Donaldson, "Real-time implementation and evaluation of an adaptive silence deletion algorithm for speech compression," in IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, pp. 461-468, May 1991. [25] D. K . Freeman, G. Cosier, C. B . Southcott, and I. Boyd, "The voice activity detector for the Pan-European digital cellular mobile telephone service," in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, (Glasgow, Scotland), May 1989. [26] C. B . Southcott, D . Freeman, G. Cosier, D. Sereno, A . V . der Krogt, A . Gilloire, and H . J . Braun, "Voice control of the Pan-European digital mobile radio system," in IEEE Global Telecommunication Conference and Exhibition: Communication Technology for the 1990s and Beyond, vol. 2, (DaUas, T X , USA), pp. 1070-1074, Nov. 1989. [27] A . M . Kondoz and B . G . Evans, " A high quality voice coder with integrated echo canceller and voice activity detector for V S A T systems," in 3rd European Conference on Satellite Communications, (Manchester, U K ) , pp. 196T200, NOV. 1993. Bibliography 110 M . R. Suddle, A . M . Kondoz, and B. G . Evans, "DSP implementation of low bit-rate C E L P based speech coders," in IEE 6th International Conference on Digial Processing of Signals in Communications, (Loughborough, U K ) , pp. 309-314, Sept. 1991. A . Gersho and E . Paksoy, " A n overview of variable rate speech coding for cellular net-works," in Proceedings of IEEE International Conference on Selected Topics in Wireless Communications, (Vancouver, B C , Canada), pp. 172-175, June 1992. E . Paksoy and A . Gersho, "Variable rate speech coding for multiple access wireless net-works," in Mediterranean Electrotechnical Conference Proceedings (MELECON), (Antalya, Turkey), pp. 47-50, 1994. C. Rose, "Real-time implementation and evaluation of an adaptive silence deletion al-gorithm for speech compression," Master's thesis, Department of Electrical Engineering, University of British Columbia, Aug. 1991. K . Bullington and J . M . Fraser, "Engineering aspects of TASI," The Bell System Technical Journal, vol. 38, pp. 353-364, Mar. 1959. S. J . Campanella, "Digital speech interpolation," COMSAT Technical Review, vol. 6, pp. 127-158, Spring 1976. P. T. Brady, " A statistical analysis of on-off patterns in 16 conversations," The Bell System Technical Journal, vol. 47, pp. 73-92, Jan. 1968. H . H . Lee and C. K . Un, " A study of on-off characteristics of conversational speech," IEEE Transactions on Communications, vol. 34, pp. 630-637, June 1986. H . Miedema and M . G. Schachtman, "TASI quality—Effect of speech detectors and inter-polation," The Bell System Technical Journal, vol. 38, pp. 353-364, Mar. 1959. L. R. Rabiner and M . R. Sambur, "An algorithm for determining the endpoints of isolated utterances," The Bell System Technical Journal, vol. 54, pp. 297-315, Feb. 1975. P. G. Drago, A . M . Molinari, and F. C. Vagliani, "Digital dynamic speech detectors," IEEE Transactions on Communications, vol. 26, pp. 140-145, Jan. 1978. E . Fariello, " A novel digital speech detector for improving effective satellite capacity," IEEE Transactions on Communications, vol. 20, pp. 55-60, Feb. 1972. J . A . Jankowski, Jr., " A new digital voice activated switch," COMSAT Technical Review, vol. 6, pp. 159-178, Spring 1976. C. K . Un and H . H . Lee, "Voiced/unvoiced/silence discrimination of speech by delta mod-ulation," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, pp. 398-407, Apr. 1980. Bibliography 111 [42] L. F . Lamel, L . R. Rabiner, A . E . Rosenberg, and J . G . Wilpon, " A n improved endpoint de-tector for isolated word recognition," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 29, pp. 777-785, Aug. 1981. [43] Y . Yatsuzuka, "Highly sensitive speech detector and high-speed voiceband data discrimi-nator in D S I - A D P C M systems," IEEE Transactions on Communications, vol. 30, pp. 739-750, Apr. 1982. [44] M . Hahn and C. K . Park, "An improved speech detection algorithm for isolated Korean utterances," in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, (San Francisco, C A , USA) , pp. 525-528, Mar. 1992. [45] B . S. Atal and L . R. Rabiner, " A pattern recognition approach to voiced-unvoiced-silence classification with applications to speech recognition," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, pp. 201-212, June 1976. [46] L. R. Rabiner, C. E . Schmidt, and B. S. Atal , "Evaluation of a statistical approach to voiced-unvoiced-silence analysis for telephone quality speech," The Bell System Technical Journal, vol. 56, pp. 455-482, Mar. 1977. [47] L. J . Siegel and A . C. Bessey, "Voiced/unvoiced/mixed excitation classification of speech," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 30, pp. 451-460, Mar. 1982. [48] G . Bruno, M . D. Di Benedetto, M . G . Di Benedetto, A . Gilio, and P. Mandarini, " A Bayesian-adaptative decision method for the v/uv/s classification of segments of a speech signal," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 35, pp. 556-559, Apr. 1987. [49] J . Huang and B . -D . Tseng, " A Walsh Transform based endpoint detection of isolated utterances," in Conference Record of the 25th Asilomar Conference on Signals, Systems and Computers, (Pacific Grove, C A , USA) , pp. 335-338, Nov. 1991. [50] J . A . Haigh and J . S. Mason, "Robust voice activity detection using cepstral features," in Proceedings of IEEE Region 10 International Conference on Computers, Communications and Automation (TENCON), (Beijing, China), Oct. 1993. [51] M . Rangoussi, A . Delopoulos, and M . Tsatsanis, "On the use of.higher order statistics for robust endpoint detection of speech," in IEEE Signal Processing Workshop on Higher-Order Statistics, (South Lake Tahoe, C A , USA) , pp. 56-60, June 1993. [52] M . Rangoussi and G . Carayannis, "Higher order statistics based Gaussianity test applied to on-line speech processing," in Conference Record of the 28th Asilomar Conference on Signals, Systems and Computers, (Pacific Grove, C A , USA) , pp. 303-307, 1994. [53] G . S. Ying, C. D. Mitchell, and L. H . Jamieson, "Endpoint detection of isolated utterances based on a modified Teager energy measurement," in Proceedings of IEEE International Bibliography 112 Conference on Acoustics, Speech, and Signal Processing, (Minneapolis, M N , USA) , pp. 732-735, Apr. 1993. C. K . Gan, "Efficient speech storage via compression of silence periods," Master's thesis, Department of Electrical Engineering, University of British Columbia, December 1984. J . Taboada, S. Feijoo, R. Balsa, and C. Hernandez, "Explicit estimation of speech bound-aries," IEE Proceedings. Science, Measurement and Mechnology, vol. 141, pp. 153-159, May 1994. S-. Jacobs, A . Eleftheriadis, and D. Anastassiou, "Silence detection for multimedia commu-nication." On the W W W http: / /www.ctr .columbia.edu/~elef t /papers/mmsj96.html, June 1995. F. J . Owens, Signal Processing of Speech. Macmillan New Electronics Series, London: Macmillan, 1993. J . R. D. Jr., J . G. Proakis, and J . H . L . Hansen, Discrete-Time Processing of Speech Signals. Macmillan: Prentice-Hall, 1993. A . Eleftheriadis, S. Pejhan, and D. Anastassiou, "Algorithms and performance evalua-tion of the Xphone multimedia communication system," in ACM Multimedia Conference, (Anaheim, C A , USA) , pp. 311-320, August 1993. N . S. Jayant and P. Noll, Digital Coding of waveforms. Signal Processing Series, Englewood Cliffs: Prentice-Hall, 1984. G . J . Borden, K . S. Harris, and L . J . Raphael, Speech Science Primer. Baltimore: Williams and Wilkins, 3rd ed., 1994. J . Bellamy, Digital Telephony. Wiley Series in Telecommunications, New York: Wiley, 1991. G . E . Pelton, Voice Processing. New York: McGraw-Hill , 1993. D. Bertsekas and R. Gallager, Data Networks. Prentice-Hall, 2nd ed., 1992. D. E . Knuth, Seminumerical Algorithms, vol. 2 of The Art of Computer Programming. Addison-Wesley, 1969. S P E C T R U M Signal Processing Inc., Burnaby, B C , Canada, TMS320C30 System Board Technical Reference Manual, May 1990. Z. Goh and S.-N. Koh, "Speech coding by wavelet representation of residual signal," in Pro-ceedings of IEEE International Conference on Circuits and Systems, (Singapore), pp. 860-864,1994. T. Yamazaki and H . Irii, "Objective estimation of speech quality degradation due to trans-mission error," in SUPERCOMM/ICC '92: Discovering a New World of Communications, (Chicago, IL, USA) , pp. 66-70, June 1992. Bibliography 113 S. Wang, A . Sekey, and A . Gersho, " A n objective measure for predicting subjective quality of speech coders," IEEE Journal on Selected Areas in Communications, vol. 10, pp. 819— 829, June 1992. H . Irii, "Comparison of four objective speech quality assessment methods based on inter-national subjective evaluations of universal codecs," in Proceedings of IEEE International Conference on Communications, (Denver, CO, USA) , pp. 1726-1730, June 1991. R. Kubichek, D. J . Atkinson, S. Voran, J . Lansford, H. L i , and J . Schroeder, "Advances in objective voice quality assessment," in IEEE 42nd Vehicular Technology Conference, (Denver, CO, USA) , pp. 155-158, May 1992. S. Dimolitsas, F . L . Corcoran, and J . G. Phipps Jr., "Estimation of digital low-rate en-coded voice link performance from instrumental measurements," IEEE Transactions on Instrumentation and Measurement, vol. 42, pp. 799-805, Aug. 1993. S. Dimolitsas, F . L . Corcoran, and M . R. Baraniecki, "Transmission quality of North American cellular, personal communications, and public switched telephone networks," IEEE Transactions on Vehicular Technology, vol. 43, pp. 245-251, May 1994. L. R. Rabiner, "Applications of voice processing," Proceedings of the IEEE, vol. 82, pp. 199-228, Feb. 1994. P. Kroon, "Evaluation of speech coders," in Speech coding and synthesis (W. B . Kleijn and K . K . Paliwal, eds.), ch. 13, pp. 467-494, Elsevier, 1995. R. G . Dorbolo, "Design and test of a real-time voice communications system for power line communication channels," Master's thesis, Department of Electrical Engineering, Univer-sity of British Columbia, January 1994. I E E E , " I E E E recommended practice for speech quality measurement," IEEE Transactions on Audio and Electroacoustics, vol. 17, pp. 227-246, Sept. 1969. T. W . Parsons, Voice and Speech Processing. New York: McGraw-Hill , 1987. Institution of Electrical Engineers, International Conference on Speech Input/Output: Techniques and Applications, (London, U K ) , Mar. 1986. D. G. Childers, M . Hahn, and A . E. Rosenberg, "Silent and voiced/unvoiced/mixed exci-tation (four-way) classification of speech," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, pp. 777-785, Apr. 1981. B. Reaves, "Comments on ' A n improved endpoint detector for isolated word recognition'," IEEE Transactions on Signal Processing, vol. 39, pp. 526-527, Feb. 1991. J . F . Lynch, Jr., J . G . Josenhans, and R. E . Crochiere, "Speech/silence segmentation for real-time coding via rule based adaptive endpoint detection," in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 87, (Dallas, T X , USA), pp. 1348-1351, Apr. 1987. Bibliography 114 [83] W . Gang and J . Z. Ou Yang, "Speech signal processing basing upon delta modulation," in Proceedings of IEEE International Conference on Circuits and Systems, (Shenzhen, China), pp. 6-9, Institute of Electrical and Electronics Engineers, June 1991. [84] G. Stoll, G . Theile, and M . Link, " M A S C A M : Using psychoacoustic masking effects for low-bit-rate coding of high quality complex sounds," in Structure and Perception of Elec-troacoustic Sound and Music: Proceedings of the Marcus Wallenberg Symposium held in Lund Sweden, on 21-28 August 1988 (S. Nielzen and 0 . Olsson, eds.), pp. 161-180, Ex-cerpta Medica, 1989. [85] H . Nakada and K.- I . Sato, "Variable rate speech coding for asynchronous transfer mode," IEEE Transactions on Communications, vol. 38, pp. 277-284, Mar. 1990. [86] K . Konstantinides, "Fast subband filtering in M P E G audio coding," IEEE Signal Process-ing Letters, vol. 1, pp. 26-28, Feb. 1994. [87] M . Kumar and M . Zubair, " A high performance software implementation of M P E G au-dio encoder," in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 1049-1052, May 1996. [88] Y . Yatsuzuka, "High-gain digital speech interpolation with adaptive differential P C M en-coding," IEEE Transactions on Communications, vol. 30, pp. 750-761, Apr. 1982. [89] B . Mak, J .-C. Junqua, and B . Reaves, " A robust speech/non-speech detection algorithm using time and frequency-based features," in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, (San Francisco, C A , USA) , pp. 269-272, Mar. 1992. [90] S.-W. Park, "Speech compression using A R M A model and wavelet transform," in Pro-ceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, (Adelaide, SA, Australia), pp. 209-212, 1994. [91] X . - Y . Wang, M. -Q . Kong, and J.-F. L i , "Embedded A D P C M for variable bit rate cod-ing in A T M , " in Proceedings of IEEE Region 10 International Conference on Computers, Communications and Automation (TENCON), (Beijing, China), pp. 314-316, Oct. 1993. [92] D. W . Lin , V . K . Varma, and J . L . Dixon, "Design of a medium rate linear-predictive speech coder for digital portable radio communications," in IEEE 40th Vehicular Technol-ogy Conference, (Orlando, F L , USA) , pp. 326-330, May 1990. [93] P. Norton, Inside the IBM PC and PS/2. New York: Brady, 3rd ed., 1990. Appendix A Listening Tests A . l Instructions A session of subjective listening test consists of two parts, each presenting 36 test samples to the subject for evaluation. Prior to Part 1 of the test, the subject is given the instructions shown in the following screen dump (Figure A . l ) . Part 2 has exactly the same instructions. S I L E N C E C O M P R E S S I O N S U B J E C T I V E L I S T E N I N G T E S T 1 INSTRUCTIONS: Thank you f o r t a k i n g p a r t i n t h i s l i s t e n i n g t e s t . For your o r i e n t a t i o n , you w i l l f i r s t hear the "Good" sample and the "Poor" sample through the telephone handset. A s e r i e s of 36 speech samples w i l l then be played back to you one at a time. A f t e r h e a r i n g each sample you can do one of the f o l l o w i n g . • Rank the q u a l i t y of the speech sample w i t h a number: 5 - As good as, or b e t t e r than the "Good" sample 4 - Good, but not as good as the "Good" sample 3 - F a i r (somewhere between 4 and 2) 2 - Bad, but not as bad as the "Poor" sample 1 - As bad as, or worse than the "Poor" sample • Or i s s u e a o n e - l e t t e r command: G - P l a y the "Good" r e f e r e n c e sample again P - P l a y the "Poor" r e f e r e n c e sample again R - Replay the c u r r e n t sample Press any key to proceed Figure A . l : On-screen instructions for the subjective listening tests. A. 2 Scores A total of 29 subjects took the test and their contributions are recorded in Tables A . l and A.2. 115 Appendix A. Listening Tests 116 Table A . l : Individual scores for Subjective Listening Test Part 1. S . ' S a t m ' d p R M D 1 C W A C S E C K S J P H M C S R M J D S G T C A E K E G C 0 e e L D L B C S L L T C L C N N S L N W M W S C L C S B W S C S v 04 4 4 5 3 4 4 4 4 4 4 4 5 3 3 4 4 3 3 5 3 3 5 4 2 4 4 4 4 1 3 72 0 88 07 2 3 5 3 3 4 3 4 3 5 2 4 2 2 4 5 4 2 4 2 4 4 3 2 4 3 4 3 1 3 24 1 06 27 1 3 1 3 1 1 2 2 2 1 2 2 1 2 1 2 1 1 2 2 3 2 1 2 1 1 2 2 1 1 66 0 67 24 2 3 3 4 3 4 2 3 2 4 2 4 2 2 3 2 3 2 3 3 4 3 2 2 3 3 3 3 3 2 83 0 71 01 5 3 4 2 4 5 4 5 5 3 3 3 4 4 5 5 5 3 5 4 4 5 4 3 5 3 4 4 3 4 00 0 89 32 2 4 2 2 3 3 2 3 2 5 2 4 2 2 1 3 4 1 3 2 3 3 1 2 3 2 3 3 3 2 59 0 95 34 3 3 2 4 3 3 4 3 3 2 2 2 1 3 2 4 2 3 3 3 2 3 4 2 3 2 2 4 4 2 79 0 82 12 2 2 4 3 2 3 2 3 3 2 2 3 2 1 2 3 1 2 4 2 2 4 3 2 4 2 3 3 2 2 52 0 83 13 4 3 4 4 4 3 3 4 4 5 3 4 2 2 4 5 2 3 4 3 2 5 5 3 3 3 3 4 3 3 48 0 91 31 2 3 3 3 3 3 3 3 3 5 3 4 3 2 2 4 4 2 3 2 3 4 2 2 3 2 3 2 3 2 90 0 77 10 4 2 4 4 3 3 3 4 4 2 3 2 2 2 3 2 2 3 4 4 3 4 3 3 4 4 3 5 2 3 14 0 88 14 4 3 5 3 5 4 4 4 4 4 4 4 3 3 4 4 2 3 5 4 2 5 4 3 4 4 3 4 2 3 69 0 85 09 3 2 5 1 2 3 2 3 2 2 3 2 2 2 3 1 2 2 3 1 2 4 2 2 3 2 2 2 2 2 31 0 85 35 2 1 1 3 2 3 1 2 1 1 2 1 1 1 1 1 2 1 1 2 1 1 2 2 1 3 1 1 1 1 48 0 69 16 3 3 4 4 3 5 3 4 3 5 2 4 2 2 3 4 3 2 5 3 3 4 3 3 3 4 3 2 2 3 24 0 91 23 2 3 4 4 3 5 3 4 2 5 3 4 2 2 3 4 3 3 4 4 4 3 4 3 2 3 3 3 2 3 24 0 87 19 2 1 3 2 2 1 2 3 3 2 2 2 1 1 2 3 2 2 3 3 2 3 3 2 3 3 2 3 2 2 24 0 69 28 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 3 1 1 1 1 17 0 47 08 2 3 5 2 3 4 3 4 4 4 2 4 2 2 3 2 3 2 4 3 3 4 2 2 5 4 3 2 2 3 03 0 98 20 1 1 2 3 2 4 2 3 3 2 2 2 2 1 2 5 1 2 3 2 2 4 2 2 4 4 3 3 2 2 45 1 02 21 3 4 2 5 4 5 3 4 4 5 3 4 3 5 4 5 3 4 4 5 4 4 4 3 5 5 3 5 2 3 93 0 92 05 3 4 5 3 4 5 4 4 5 3 4 4 2 3 4 4 4 3 5 4 3 4 3 3 5 5 3 3 1 3 69 0 97 33 3 2 1 2 3 2 3 3 2 2 2 2 1 2 1 2 3 3 3 1 2 2 3 2 1 3 2 2 2 2 14 0 69 11 3 5 4 3 3 2 3 3 3 2 3 2 2 2 3 3 2 2 '4 3 1 4 3 2 3 3 3 3 2 2 79 0 82 06 3 3 5 2 3 4 3 4 4 4 3 4 3 3 4 3 3 2 5 2 3 4 3 3 5 4 3 3 2 3 34 0 86 22 3 5 4 5 5 5 4 4 3 5 3 4 3 5 4 5 4 5 4 4 5 4 4 4 5 4 3 4 3 4 14 0 74 25 3 2 2 3 3 3 2 3 2 3 2 3 1 2 2 2 2 2 3 2 3 3 3 3 2 3 2 3 2 2 45 0 57 36 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 07 0 26 15 3 5 4 4 3 5 3 4 4 4' 3 3 2 3 3 5 3 2 4 3 2 5 3 4 5 4 3 5 3 3 59 0 95 18 2 2 3 3 2 2 3 3 3 2 3 2 2 1 2 4 2 2 4 4 2 4 3 3 3 4 2 3 2 2 66 0 81 26 3 4 2 4 3 4 3 3 3 2 2 2 2 3 2 4 3 4 3 2 3 3 3 3 4 3 2 3 2 2 90 0 72 02 5 3 5 3 4 4 4 5 4 .3 5 3 4 3 4 5 4 3 4 4 3 4 3 3 5 3 2 3 3 3 72 0 84 03 4 3 5 4 3 4 5 4 5 5 5 5 3 3 5 5 3 2 4 2 4 5 4 4 4 3 3 4 3 3 90 0 94 29 3 3 5 4 5 4 4 4 4 5 3 4 3 5 4 5 4 4 2 4 3 4 4 4 5 3 2 4 4 3 86 0 83 30 3 4 5 4 4 4 4 4 4 5 3 4 3 4 3 5 4 4 2 4 3 4 3 4 5 4 2 4 4 3 79 0 77 17 2 1 2 2 2 2 4 3 2 2 2 2 1 2 2 2 2 1 3 2 1 3 2 2 1 2 2 3 2 2 03 0 68 Appendix A. Listening Tests 117 Table A.2: Individual scores for Subjective Listening Test Part 2. S , S a t m d p R M D 1 C W C S E C M A S J P H M C S R M J D S G T C A E K E G C 0 e e L D B C S L T L T C L C N N S L N W M W S C L C S B W S C S v 01 4 5 4 4 4 4 3 5 5 3 2 5 4 4 4 4 4 5 5 5 3 4 4 4 5 5 4 5 4 4 17 0 76 20 2 5 3 3 4 2 4 3 4 3 3 4 2 3 2 3 2 2 4 3 3 4 2 3 4 4 3 3 3 3 10 0 82 05 3 4 5 5 5 3 3 5 5 3 2 4 3 3 4 4 3 2 3 5 3 5 3 5 4 4 3 4 4 3 76 0 95 28 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 10 0 31 31 2 3 3 2 3 3 4 3 3 3 3 3 3 1 2 1 2 2 1 2 3 2 2 3 3 3 2 3 2 2 48 0 74 10 3 3 5 3 4 4 3 5 3 3 3 3 3 3 3 3 4 4 2 4 2 4 3 3 4 4 2 4 4 3 38 0 78 14 4 4 4 4 3 4 4 4 3 2 3 2 2 3 2 2 2 2 3 3 2 3 4 3 3 3 2 3 3 2 97 0 78 24 1 2 3 2 2 1 3 1 2 5 2 2 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 2 2 1 79 0 86 02 3 4 4 4 3 3 3 5 4 2 3 4 3 3 3 4 4 2 4 5 3 4 3 4 3 4 2 4 3 3 45 0 78 35 2 2 3 1 2 2 2 1 2 1 2 1 2 1 1 1 2 2 1 2 2 2 1 3 1 3 1 2 1 1 69 0 66 22 3 4 5 4 3 3 4 3 3 4 3 4 4 4 3 3 3 3 2 4 3 3 4 4 3 4 2 3 3 3 38 0 68 11 3 4 4 4 5 4 3 5 4 5 2 3 3 3 3 5 4 3 2 4 4 5 4 4 5 4 2 3 3 3 69 0 93 08 2 3 3 3 3 1 2 4 3 4 3 2 2 2 3 3 3 1 2 3 1 3 2 3 3 3 1 2 2 2 48 0 83 30. 3 3 4 3 3 4 4 3 4 3 2 3 3 3 3 3 2 4 2 4 4 2 3 4 5 3 2 3. 3 3 17 0 76 15 3 5 3 3 4 3 4 5 4 3 3 3 3 3 3 2 3 3 4 3 2 3 3 4 5 4 3 4 2 3 34 0 81 18 3 3 4 3 4 3 4 3 3 3 3 3 2 2 3 2 2 4 5 4 2 3 3 3 4 3 2 3 3 3 07 0 75 34 2 3 4 3 2 3 3 1 3 2 3 2 2 2 2 3 2 3 2 3 2 2 3 4 3 2 2 2 3 2 52 0 69 12 2 4 3 4 5 2 4 5 4 4 3 2 3 3 3 3 3 2 3 3 3 5 2 4 5 3 2 2 3 3 24 0 99 13 3 5 3 3 4 4 3 5 4 3 3 3 2 3 2 2 2 3 4 3 2 4 3 4 4 4 3 3 2 3 21 0 86 32 1 3 2 1 3 2 2 1 2 2 1 4 2 1 2 1 1 1 1 3 1 1 2 2 2 2 2 1 1 1 72 0 80 36 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 03 0 19 29 3 3 5 3 3 4 4 2 4 4 3 2 3 3 3 2 2 5 1 3 4 3 4 4 3 4 3 2 2 3 14 0 95 07 3 3 4 4 4 2 3 5 5 4 3 3 3 4 5 5 4 1 2 4 4 4 2 4 5 5 3 3 3 3 59 1 05 16 2 4 2 3 5 2 4 4 5 3 3 3 3 2 2 3 3 1 4 4 3 4 2 4 5 4 3 3 3 3 21 1 01 17 4 3 4 3 4 3 4 3 3 2 2 3 3 3 1 2 2 4 4 4 2 4 .2 4 3 3 3 5 2 3 07 0 92 21 3 4 5 4 3 4 4 3 3 3 3 4 4 4 4 2 4 2 3 4 3 4 3 4 4 4 3 4 2 3 48 0 74 06 2 5 5 5 4 3 3 5 5 3 2 4 3 3 4 5 4 2 3 5 3 5 3 4 5 3 2 4 2 3 66 1 11 19 2 •5 5 3 4 3 4 4 4 3 3 3 2 3 3 3 2 3 5 3 3 4 3 4 4 4 3 '4 3 3 41 0 82 33 3 4 4 3 3 3 4 2 3 3 3 2 3 2 3 2 2 3 1 3 2 2 3 3 2 2 2 3 3 2 69 0 71 25 4 4 5 3 3 3 4 3 3 3 3 3 4 3 3 2 1 5 2 3 3 3 3 4 3 3 2 3 3 3 14 0 83 03 4 3 5 4 4 3 3 5 4 2 3 4 3 4 3 4 4 3 4 5 3 4 3 4 4 3 3 3 2 3 55 0 78 27 2 3 3 2 2 2 2 2 1 1 2 4 3 1 1 2 1 2 1 2 3 4 2 2 1 2 1 2 1 1 97 0 87 09 4 4 4 4 4 3 4 5 3 4 2 3 3 3 4 3 4 4 2 4 3 4 2 3 3 3 2 5 3 3 41 0 82 23 1 3 4 3 3 2 4 2 2 5 3 3 3 2 2 2 1 1 2 3 3 2 2 3 2 2 2 3 2 2 48 0 91 04 4 4 5 4 4 3 3 5 4 2 3 4 3 4 3 3 4 3 3 4. 2 3 3 4 4 4 2 3 3 3 45 0 78 26 3 3 5 3 3 3 4 4 3 2 3 3 3 3 2 1 1 5 1 2 4 3 3 3 2 3 2 2 3 2 83 1 00 Appendix B System Software The majority of computer codes have been written in the C programming language. These include a prototyping simulator, and the software for the speech compression system proper. B . l The Simulator Program A simulator program has been written that runs on U N I X (SunOS) and on MS-DOS. Compar-isons of the diagnostic data show occasional minor discrepancies between program executions on the two different computing environments (SunOS on a Sparc I P X , and MS-DOS on a PC486), possibly due to different machine precisions in floating-point numeric representations. The simulator can also produce analytical data, which have been used extensively in Chap-ter 4. There are three simulator programs: • silence deletion based on the energy criteria ( s i l d e l . c ) • silence deletion based on the A M F criteria (mulawdel. c) • silence insertion ( s i l i n s . c) Byte order conversion is done by r o t a t e l 6 . c . G.711 /x-law conversion is provided by g711.c of Section B.2.2. B.2 The Speech Compression System The speech compression system consists of software written in TMS320C30 assembly code and in the C programming language. 118 Appendix B. System Software 1.19 B.2.1 TMS320C30 Assembly Codes The A / D and D / A functions of the Spectrum TMS320C30 DSP board are controlled by software written in the TMS320C30 assembly code (playrec. asm). These routines are loaded onto the DSP board by the speech compression, system, which executes control and passes data through memory-mapped addressing. B.2.2 C Codes Codes for the speech compression system have been modularized into the following: • main module (scs.c) • command interpreter (user interface) (shell.c) • , user interface helper routines (gui.c) • processing routines (process. c) • I/O routines (data_io. c) • speech coding routines (g711. c, g72x. c, g721. c) 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0065200/manifest

Comment

Related Items