UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Adaptive predictive delta coder combining syllabic adaptation and a self-adaptive quantizer Schellenberg, Walter Ulrich 1974

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-UBC_1974_A7 S34.pdf [ 2.78MB ]
Metadata
JSON: 831-1.0065540.json
JSON-LD: 831-1.0065540-ld.json
RDF/XML (Pretty): 831-1.0065540-rdf.xml
RDF/JSON: 831-1.0065540-rdf.json
Turtle: 831-1.0065540-turtle.txt
N-Triples: 831-1.0065540-rdf-ntriples.txt
Original Record: 831-1.0065540-source.json
Full Text
831-1.0065540-fulltext.txt
Citation
831-1.0065540.ris

Full Text

AN ADAPTIVE PREDICTIVE DELTA CODER COMBINING SYLLABIC ADAPTATION AND A SELF-ADAPTIVE QUANTIZER by Walter U l r i c h Schellenberg D i p l . E l . Ing., Swiss F e d e r a l I n s t i t u t e of Technology, Z u r i c h , 1969 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF APPLIED SCIENCE i n the Department of E l e c t r i c a l Engineering We accept t h i s t h e s i s as conforming t o the r e q u i r e d standard THE UNIVERSITY OF BRITISH COLUMBIA A p r i l 1974 In presenting t h i s thesis i n p a r t i a l f u l f i l m e n t of the requirements for an advanced degree at the University of B r i t i s h Columbia, I agree that the Library s h a l l make i t f r e e l y available for reference and study. I further agree that permission for extensive copying of t h i s thesis for scholarly purposes may be granted by the Head of my Department or by his representatives. It i s understood that copying or publication of t h i s thesis f o r f i n a n c i a l gain s h a l l not be allowed without my written permission. Department of g'd J f^t <~-Z>C.. t U^, ^ h of-•/ The University of B r i t i s h Columbia Vancouver 8, Canada Date T ^ K r / >7<?; ll'7-f Abstract The purpose of t h i s t h e s i s i s to evaluate an adaptive d i f f e r -e n t i a l encoder f o r d i g i t a l communication channels. The redundancy r e d u c t i o n technique used i n t h i s coder i s not r e s t r i c t e d to speech s i g n a l s only. However, i t was optimized f o r such s i g n a l s w i t h the o b j e c t i v e of keeping the b i t - r a t e as low as p o s s i b l e . In the t r a n s -m i t t e r , the s i g n a l redundancy i s reduced i n two steps. F i r s t , t a k i n g advantage of the q u a s i - p e r i o d i c i t y of speech s i g n a l s , the current s i g n a l value i s p r e d i c t e d from the value one p e r i o d before. Secondly, the d i f f e r e n c e between t h i s p r e d i c t i o n and the true value i s estimated by a p r e d i c t i o n based on the two previous d i f f e r e n c e s . The e r r o r of t h i s second p r e d i c t i o n i s quantized and transmitted to the r e c e i v e r . The r e c e i v e r produces a r e p l i c a of the o r i g i n a l s i g n a l by adding the r e c e i v e d e r r o r s i g n a l to the p r e d i c t e d value. The adaptation of the quantizer step s i z e has an exponential c h a r a c t e r i s t i c and contains a delay of one sampling p e r i o d . These two f e a t u r e s give r i s e to i n s t a b i l i t i e s and poor re p r o d u c t i o n of h i g h s i g n a l f r e q u e n c i e s , the l a t t e r being an inherent c h a r a c t e r i s t i c of t h i s q u a n t i -ze r . The s t a b i l i t y problem could not be solved t h e o r e t i c a l l y because the n o n l i n e a r i t y and delay i n the quantizer render the system mathemati-c a l l y i n t r a c t a b l e . By r e s t r i c t i n g the maximum quan t i z e r l e v e l and adding a d i r e c t feedback of the step s i z e to the second p r e d i c t o r , the i n s t a b i l -i t i e s are r e s t r i c t e d to acceptable l i m i t s . The coder was simulated on a d i g i t a l computer and optimized f o r sampling r a t e s of 8, 12, and 16 kHz using o b j e c t i v e c a l c u l a t i o n s of the s i g n a l - t o - q u a n t i z a t i o n noise r a t i o as w e l l as s u b j e c t i v e preference i i t e s t s . In some cases the c a l c u l a t e d S/Q r a t i o s a l l o w no d i s t i n c t i o n i n performance, but the s u b j e c t i v e evaluations e x h i b i t strong d i f f e r e n c e s i n preference. This observation emphasizes the n e c e s s i t y to examine the s u b j e c t i v e performance of such v o i c e systems. Comparisons w i t h speech from a l o g PCM encoder i n d i c a t e that at a sampling r a t e of 8 kHz, the s u b j e c t i v e q u a l i t y of the reconstructed speech i s s l i g h t l y s u p e r i o r to that of l o g PCM encoded at 3 b i t s per sample and the same sampling frequency. I t i s estimated that at a sampling r a t e of 8 kHz, an a d d i t i o n -a l 3800 b i t s / s e c are r e q u i r e d to transmit the four coder parameters. The suggested f i n a l t r a n s m i s s i o n b i t - r a t e i s t h e r e f o r e 11.8 k b i t s / s e c . Thus, a data compression of approximately two to one has been achieved. Included are suggestions f o r reducing the b i t - r a t e to 9.6 k b i t s / s e c w i t h minimal degradation i n speech q u a l i t y below t h a t achieved using 11.8 k b i t s / s e c . i i i TABLE OF CONTENTS Page I. INTRODUCTION 1 I I . REVIEW OF IMPORTANT PREVIOUS RESULTS 4 2.1 An Adaptive D e l t a Modulator Using a One-Bit Memory. 4 2.1.1 Response to a Step Function 5 2.1.2 Bounds on P and Q 6 2.1.3 Performance With Speech Input 6 2.2 An Adaptive P r e d i c t i v e Coder Using P i t c h and Formant S t r u c t u r e 7 I I I . THE ADAPTIVE PREDICTIVE CODER WITH A SELF-ADJUSTING QUANTIZER LEVEL 11 3.1 The F i r s t Loop: Reduction of the P i t c h Redundancy.. 12 3.2 The Second Loop: Reduction of the Formant Redundancy 20 IV. ON MEASURING AND COMPARING SPEECH QUALITY 29 V. COMPUTER SIMULATION AND RESULTS 31 5.1 P r e p a r a t i o n of the Data ,... 31 5.2 S u b j e c t i v e Test Procedure 32 5.3 R e s u l t s and D i s c u s s i o n 33 5.3.1 V a r i a t i o n of the Step M u l t i p l i e r P 33 5.3.2 V a r i a t i o n of the D i r e c t Feedback C o e f f i c i e n t y 36 5.3.3 V a r i a t i o n of the Frame Length 37 5.3.4 V a r i a t i o n of the Learning P e r i o d 39 5.3.5 V a r i a t i o n of the Switch Threshold 40 5.3.6 V a r i a t i o n of the Product of the Step M u l t i p l i e r s P and Q 42 5.3.7 Comparison w i t h Logarithmic PCM 44 VI. ADDITIONAL RESULTS AND RECOMMENDATIONS 46 6.1 Improvement of the Formant Redundancy Reduction.... 46 6.2 Improvement of the Reproduction of V o i c e l e s s Sounds 46 \ Page 6 . 3 E f f e c t of Channel Noise 47 6 . 4 Q u a n t i z a t i o n of the P r e d i c t o r Parameters 48 V I I . SUMMARY AND CONCLUSIONS... 51 REFERENCES v 55 v LIST OF ILLUSTRATIONS Figure Page 2.1 Jayant's adaptive d e l t a modulator 4 2.2 Response to a step f u n c t i o n 5 2.3 Adaptive p r e d i c t i v e coder [6] 7 3.1 D e t a i l e d b l o c k diagram of the Adaptive P r e d i c t i v e D e l t a Modulator 13 3.2 R e l a t i v e s p e c t r a l l e v e l s of a t e s t sentence and the a s s o c i a t e d q u a n t i z a t i o n n o i s e 15 3.3 C o n f i g u r a t i o n and impulse response to a pulse of height 1 at t=0 of a f i l t e r w i t h an e x p o n e n t i a l l y decreasing impulse response and no overshoot 17 3.4 Frequency response of the f i l t e r i n Figure 3.3 18 3.5 Input x(nT) and output y(nT) of the f i l t e r i n Figure 3.3... 18 3.6 F i l t e r w i t h a rec t a n g u l a r impulse response 19 3.7 The r e c u r s i v e f i l t e r i n the second loop 22 3.8 a) S t a b i l i t y range f o r the c o e f f i c i e n t s and b) Truncated s t a b i l i t y range 24 3.9 Geometrical i n t e r p r e t a t i o n of the denominator of (3.8) 25 3.10 The n o n l i n e a r i t y i n the second loop 26 5.1 Dependence of the S/Q r a t i o on the step m u l t i p l i e r 34 5.2 a) The S/Q r a t i o as a f u n c t i o n of the c o e f f i c i e n t y b) S u b j e c t i v e preference as a f u n c t i o n of the c o e f f i c i e n t y. 37 5.3 Dependence of the S/Q r a t i o on the frame length 38 5.4 The S/Q r a t i o as a f u n c t i o n of the l e a r n i n g p e r i o d f o r a sampling frequency of 8 kHz 40 5.5 a) Dependence of the S/Q r a t i o on the switch t h r e s h o l d set by the number of zer o - c r o s s i n g s b) Dependence of the S/Q r a t i o on the switch t h r e s h o l d set by the s i g n a l energy 41 v i Figure Page 5.6 S/Q r a t i o as a f u n c t i o n of the product P«Q 42 5.7 Comparison of the Adaptive P r e d i c t i v e D e l t a Modulator to l o g a r i t h m i c PCM 44 6.1 Histograms of the coder parameters, c a l c u l a t e d f o r a l l four sentences 49 vi-£ ACKNOWLEDGEMENT I wish to thank Dr. R. W. Donaldson, my s u p e r v i s o r , f o r suggestions that provided the motivations f o r t h i s t h e s i s . I am very t h a n k f u l to Mr. M. E. Koombes who wrote the programs to d i g i t i z e the v o i c e s i g n a l s . The f i n a n c i a l support r e c e i v e d i n the form of U.B.C. Graduate Fe l l o w s h i p s from 1972 to 1974, and under the N a t i o n a l Research C o u n c i l Grant 67-3308 i s g r a t e f u l l y acknowledged. F i n a l l y , I would l i k e to thank my w i f e Ruth f o r moral support and endless patience w h i l e I completed t h i s p r o j e c t . v - i i i 1 I . INTRODUCTION Transmission of speech i n d i g i t a l form o f f e r s s e v e r a l advan-tages over conventional analog techniques. An important b e n e f i t of d i g i t a l channels i s t h e i r r e l a t i v e l y low s u s c e p t a b i l i t y to n oise and c r o s s t a l k . D i s t o r t e d or weak pulses can be regenerated by repeaters without being c u m u l a t i v e l y degraded; t h e r e f o r e maintaining good q u a l i t y over long d i s t a n c e s . A l s o , some communication l i n k s r e q u i r e r e l i a b l e speech e n c r y p t i o n , a r a t h e r d i f f i c u l t task i n an analog system but d i g i t a l l y r e a l i z a b l e w i t h comparatively simple means. Another p o i n t that favours d i g i t a l techniques i s the tremendous t e c h n o l o g i c a l progress i n manufacturing d i g i t a l c i r c u i t r y , which makes inexpensive, mass produced, and m i n i a t u r i z e d subsystems a v a i l a b l e o f f the s h e l f . The immediate p r i c e f o r these advantages i s a higher bandwidth requirement. However, d i g i t a l techniques a l s o o f f e r ways to compress speech by removal of some redundancy, thus reducing bandwidth. The q u a l i t y of s e v e n - b i t , l o g a r i t h m i c Pulse-Code Modulation (PCM) w i t h a sampling frequency of 8 kHz i s g e n e r a l l y accepted as good telephone q u a l i t y . But the necessary b i t - r a t e of 56 k b i t s / s e c compares very unfavourably w i t h the a c t u a l i n f o r m a t i o n content of speech. The entropy of the w r i t t e n e quivalent of speech i s only of the order of 50 b i t s / s e c ( [ 1 ] , p. 4). Experiments a l s o i n d i c a t e that the human i s not able to process i n f o r m a t i o n at r a t e s i n excess of about 50 b i t s / s e c ( [ 1 ] , s e c t i o n 1.3). On the other hand, a conventional analog v o i c e channel r e q u i r e s a bandwidth of at l e a s t 3000 Hz and a s i g n a l - t o - n o i s e r a t i o of 30 db i n order t o transmit speech s a t i s f a c t o r i l y . According to Shannon's theorem such a channel has a c a p a c i t y of approximately 2 30,000 b i t s / s e c . E v i d e n t l y , v o i c e s i g n a l s c o n t a i n a l o t of redundancy, but t h e i r a c t u a l i n f o r m a t i o n content i s hot known e x a c t l y . Several systems f o r t r a n s m i t t i n g speech i n t e l l i g i b l y at low b i t - r a t e s have been reported. For a good summary and b i b l i o g r a p h y see Flanagan [ 1 ] , chapter V I I I , and f o r some recorded samples see Bayless [ 2 ] , and A t a l [3]. However, high i n t e l l i g i b i l i t y i s not a l w a y s - - s u f f i c i e n t ; f o r example, when the l i s t e n e r would a l s o l i k e to be able to recognize the speaker and the speaker's emotions. Therefore a c e r t a i n s u b j e c t i v e f i d e l i t y c r i t e r i o n i s s p e c i f i e d , and any coding method, although aiming at a r e d u c t i o n of the channel c a p a c i t y r e q u i r e d to transmit the s i g n a l , must not impair t h i s c r i t e r i o n s e v e r e l y . There are mainly two ways to reduce s i g n a l redundancy. In the approach taken by vocoders, the speech spectrum i s analyzed. The e x t r a c t e d parameters are t r a n s m i t t e d and used i n the r e c e i v e r to synthe-s i z e a r e p l i c a of the o r i g i n a l s i g n a l spectrum. Since they are based on i d e a l i z e d models of speech generation and p e r c e p t i o n , and d i s c a r d the i n f o r m a t i o n that cannot be parameterized, vocoders u s u a l l y s u f f e r from unnaturalness of the synthesized speech. The other approach i s p r e d i c t i v e coding. As i n vocoder systems some c h a r a c t e r i s t i c s of speech such as p i t c h and formants are used f o r a p a r t i a l p a r a m e t e r i z a t i o n of the s i g n a l . But u n l i k e the vocoders, p r e d i c t i v e coders transmit the d i f f e r e n c e between the parameterized s i g n a l and the a c t u a l i n p u t . On the b a s i s of these parameters, the r e c e i v e r p r e d i c t s the s i g n a l from i t s p a s t , using the t r a n s m i t t e d d i f f e r e n c e as a c o r r e c t i o n . Since the entropy of the d i f f e r e n c e s i g n a l i s s m a l l e r than the entropy of the o r i g i n a l s i g n a l , i t needs l e s s b i t s f o r i t s encoding. 3 The present t h e s i s deals w i t h p r e d i c t i v e coding. The main features of the speech encoders p r e v i o u s l y published by A t a l [6] and Jayaht 14] were combined to form a new adaptive p r e d i c t i v e coding s y s -tem. 4 I I . REVIEW OF IMPORTANT PREVIOUS RESULTS The two systems that are r e l a t e d most c l o s e l y to the coder described here, and that provided the primary m o t i v a t i o n s f o r t h i s pro-j e c t , s h a l l be reviewed b r i e f l y . The f i r s t s e c t i o n summarizes r e s u l t s obtained by Jayant [4] i n a s i m u l a t i o n of an adaptive d e l t a modulator. In the second s e c t i o n the speech encoder by A t a l and Schroeder [6] u s i n g p i t c h and formant redundancy r e d u c t i o n i s described i n s h o r t . 2.1 An Adaptive D e l t a Modulator Using a One-Bit Memory Figure 1 shows t h i s system described by Jayant [ 4 ] ; i t s s i m p l i -c i t y i s very a t t r a c t i v e . As i n ordinary d e l t a modulation, a t w o - l e v e l quantizer i s used. The transmitted channel symbol represents the p o l a r i t y of the d i f f e r e n c e between the sampled, b a n d l i m i t e d input s i g n a l S n and the l a t e s t approximation to i t , P . The system i s adaptive i n the sense t h a t , at every sampling i n s t a n t nT, the s t e p s i z e L^ i s modified on the b a s i s of a comparison between the two l a t e s t t r a n s m i t t e d channel symbols, C n and Cn_^« The adaptation l o g i c i s very simple; the new s t e p s i z e L r i s c a l c u l a t e d by m u l t i p l y i n g the l a t e s t s t e p s i z e l " n _ ^ by e i t h e r +P or -Q, depending on whether and are equal or not. Conventional d e l t a modulation r e s u l t s f o r P=Q=1. ADAPTATION LOGIC A=P IF C„=Cn_, A*-Q IF Cn9*C^ Ln=A'Ln-i INTEGRATOR INPUT Sn •e Cn=SIGN(Dn) OUTPUT Sn Figure 2.1 Jayant's adaptive d e l t a modulator 5 2.1.1 Response to a Step Function Figure 2.2 shows the s t a i r c a s e approximation of a step i n p u t . The s m a l l e s t s t e p s i z e i s assumed to be 1, P and Q are s e t at 1.5 and 0.66, r e s p e c t i v e l y . The response has two d i f f e r e n t phases. I n the f i r s t phase, the system hunts the i n p u t . Due to the e x p o n e n t i a l l y i n c r e a s i n g s t e p s i z e , t h i s hunting p e r i o d i s considerably s h o r t e r than f o r a con-v e n t i o n a l d e l t a modulator. In other words, the coder w i l l be l e s s s u s c e p t i b l e to slope overload than a nonadaptive scheme. A f t e r the s t a i r c a s e approximation has caught up w i t h the i n p u t , the o s c i l l a t i n g phase s t a r t s . The most important f e a t u r e of t h i s s t a t e i s the f a c t that the s t e p s i z e does not always assume i t s s m a l l e s t p o s s i b l e v a l u e . The system may o s c i l l a t e around the constant input w i t h a nonminimum s t e p s i z e , r e s u l t i n g i n an i n c r e a s e of the q u a n t i z a t i o n n o i s e during q u i e t input p e r i o d s . This i s of importance when the minimum a l l o w a b l e s t e p s i z e f o r such a system i s chosen. 22 0 T TIME Figure 2.2 Response to a step f u n c t i o n . 6 2.1.2 Bounds On P arid Q Under the assumption that s t e p s i z e adaptations using the m u l t i p l i e r s P and Q are e q u a l l y probable, Jayant d e r i v e d an optimum c o n d i t i o n f o r P and Q, namely: ( p . Q ) o p t = 1. (2.1) In a d d i t i o n , m i n i m i z a t i o n of the mean square e r r o r between input and output y i e l d s an upper bound f o r P t« Together w i t h the requirement that P<1, h i s theoretical>,bouridsaare 1 <<P = 1/Q <?.{£. , 9 0, opt opt (2.2) When c a l c u l a t i n g the mean-square e r r o r i n computer s i m u l a t i o n s of the coder, us i n g speech as w e l l as video i n p u t , the optimum values f o r P and Q were found to l i e w i t h i n the p r e d i c t e d range; that i s , P = 1.5 opt and Q =0.6. (2.3) opt 2.1.3 Performance With Speech'Input Using computer s i m u l a t i o n s Jayant c a l c u l a t e d the s i g n a l - t o -quantization-noi-ae r a t i o s (S/Q r a t i o s ) f o r the sentence "Have you seen B i l l ? " f o r d i f f e r e n t sampling r a t e s . He found that the adaptive d e l t a modulator outperformed l o g a r i t h m i c PCM at b i t - r a t e s lower than 40 k b i t s / sec. The S/Q r a t i o s obtained were Sampling r a t e i n kHz 20 40 60 S/Q r a t i o i n db 18 28 34 The s u b j e c t i v e optimum d i f f e r e d from these values and occurred f o r P = 1.2 [5] . 7 2.2 An Adaptive P r e d i c t i v e Coder Using P i t c h and Formant S t r u c t u r e Any e f f i c i e n t source encoder r e l i e s on the s t a t i s t i c a l proper-t i e s of the s i g n a l source. The more the a v a i l a b l e channel c a p a c i t y i s r e s t r i c t e d , the more s i g n a l redundancy has to be removed; and t h e r e f o r e the b e t t e r the s i g n a l c h a r a c t e r i s t i c s have to be known. U n f o r t u n a t e l y , the s t a t i s t i c s of v o i c e s i g n a l s are n o n - s t a t i o n a r y . However, the v o i c e d parts of the speech s i g n a l c o n t a i n most of the s i g n a l energy and show a l s o the h i g h e s t c o r r e l a t i o n between the s i g n a l samples [ 7 ] . Therefore any redundancy r e d u c t i o n techniques a p p l i e d to these p a r t s w i l l be most e f f e c t i v e . The main sources of redundancy i n v o i c e d speech p a r t s are the q u a s i - p e r i o d i c nature of the s i g n a l and the shape of the s p e c t r a l envelope. The l a t t e r i s a l s o c a l l e d the formant s t r u c t u r e and has i t s o r i g i n i n a c o u s t i c resonances of the v o c a l t r a c t and the v o c a l cord. Over short periods of time, such as f o r the d u r a t i o n of a vowel, the p h y s i c a l shape of the v o c a l t r a c t h a r d l y changes. Consequently, the formant s t r u c t u r e may be considered q u a s i - s t a t i o n a r y . This i m p l i e s that i n order to be optimum, the redundancy r e d u c t i o n process has to be adapted to the momentarily s t a t i o n a r y character of speech. F i g u r e 2.3 shows such an adaptive p r e d i c t o r [ 6 ] . Figure 2.3 Adaptive p r e d i c t i v e coder [ 6 ] . The input s i g n a l S n i s a sample of a bandlimited v o i c e s i g n a l . The f i r s t loop takes care of the q u a s i - p e r i o d i c nature of speech and c o n s i s t s merely of a delay and a gain adjustment. I t estimates the present s i g n a l value from the valu e one p i t c h p e r i o d b e f o r e . The second loop forms a l i n e a r combination of i past values of the f i r s t loop output, removing formant i n f o r m a t i o n from the s p e c t r a l envelope [ 6 ] , [ 3 ] , A c t u a l l y , the p r e d i c t o r e r r o r should be minimized by o p t i m i z i n g both loops a t the same time. However, t h i s procedure proves mathemati-c a l l y very d i f f i c u l t . Instead, a suboptimum s o l u t i o n can be found by t r e a t i n g the two loops s e p a r a t e l y . The c o e f f i c i e n t s f o r the f i r s t p r e d i c -t o r P^(z) are obtained by minimizing the output power of the f i r s t loop during a c e r t a i n p e r i o d of time N'T f o r which the p r e d i c t o r i s to be optimum. The i n t e r v a l N-T i s c a l l e d the l e a r n i n g p e r i o d of the system. The output power i s r e l a t e d to the sum of the squares of the f i r s t loop output, that i s , 2 9 r = w (c —o.c \ £ The minimum i s given by D =.r»E (S. - 3 -S (? L) = ~ / M =^M ^ ( 2 . 5 ) N „ -opt . , i - M i = l where M maximizes the normalized c o r r e l a t i o n opt ' N £ S..-S ,P = * 1 , MV0°.° (2-6> m 2 N N 2 1/2 r u ; i . . L i^M.\. l . l l . i-M. X 1 This o p t i m i z a t i o n i n v o l v e s considerable computation f o r each r e -c a l c u l a t i o n of the parameter B. 9 The second loop forms a l i n e a r estimate of the d i f f e r e n c e s i g n a l using K previous samples, that i s , K D = Z a, «D , (2.7) n n , k° n-k k=l The mean-square e r r o r (MSE) of the p r e d i c t o r output i s given by [ e 2 ] = [(D -D ) 2 ] , (2.8) n Jav L n n 1 av ' Where [ x l & v stands f o r the sample mean of x over the l e a r n i n g p e r i o d N*T. Thus T. N [x] = rjr .Z, x.. av N i = l i 2 The MSE can be minimized by s e t t i n g the p a r t i a l d e r i v a t i v e s of [ e n ] a v w i t h respect to to equal to z e r o , that i s 6 ^ a v == [ (D - E OL «D ,)-D = 0, j=l,...K. (2.9) . n i i k n-k n-j av 6 a. k=l J Equation (2.9) y i e l d s K simultaneous l i n e a r equations f o r the parameters a to a which can be solved using w e l l known computational procedures [8] 1 K Since there are t y p i c a l l y three formants and one v o c a l cord resonance i n speech lowpass f i l t e r e d a t 4 kHz, A t a l and Schroeder set K equal to 8. In t h e i r system, the d i f f e r e n c e between the p r e d i c t o r output and the t r u e value of the s i g n a l was t r a n s m i t t e d using a one-bit adaptive q u a n t i z e r . To minimize the q u a n t i z a t i o n n o i s e , the q u a n t i z e r l e v e l and the p r e d i c t o r parameters g, M, and to ag were readjusted every 5 msec. The computational e f f o r t i n v o l v e d i n the o p t i m i z a t i o n of the coder i s h i g h , but t h i s e f f o r t was rewarded by a very good system performance. The authors report s u b j e c t i v e comparisons w i t h l o g a r i t h m i c PCM that r a t e the q u a l i t y of t h e i r coder only s l i g h t l y i n f e r i o r to the q u a l i t y of a 6-bi t l o g PCM"*". Both encoders operated a t a sampling frequency of 6.67 kHz. No attempt was made to quantize the p r e d i c t o r parameters; but the authors' conjecture i s that an a d d i t i o n a l 3 k b i t s / s e c w i l l s u f f i c e to transmit a l l the parameters. +In another s i m u l a t i o n of t h i s scheme by Cummiskey [ 5 ] , these r e s u l t s could not be d u p l i c a t e d . I I I . THE ADAPTIVE PREDICTIVE CODER WITH A SELF-ADJUSTING QUANTIZER LEVEL This s e c t i o n describes i n d e t a i l the adaptive d e l t a modulator that was simulated on the IBM 370/168 at the U n i v e r s i t y of B r i t i s h Columbia. The r e s u l t s of the s i m u l a t i o n s are discussed i n chapter V. The scheme implemented combines the two main features of the coders o u t l i n e d i n the previous chapter. Jayant's system i s a t t r a c t i v e because of i t s s i m p l i c i t y of implementation. However, the incoming s i g n a l has to be h i g h l y oversampled i n order to be reproduced at the r e c e i v e r output w i t h an acceptable q u a l i t y . The other coder y i e l d s very good r e s u l t s at a low sampling r a t e , but r e q u i r e s a l a r g e number of computations to remove enough s i g n a l redundancy. One o b j e c t i v e of t h i s p r o j e c t was to keep the tran s m i s s i o n b i t - r a t e of the encoder as low as p o s s i b l e , p r e f e r a b l y below 10 k b i t s / s e c . Lowpass f i l t e r e d speech of telephone q u a l i t y must be sampled at about 7 kHz. Consequently, i n order not to exceed the intended maximum b i t r a t e , only one b i t per s i g n a l sample may be t r a n s -m i t t e d (adaptive d e l t a modulation, ADM) . The remaining 3 k b i t s are req u i r e d to transmit the p r e d i c t o r parameters. The d e s i r e d r a t e of 10 k b i t s / s e c or l e s s makes e f f i c i e n t redundancy r e d u c t i o n a n e c e s s i t y . For t h i s reason, i t was decided a f t e r some p r e l i m i n a r y s i m u l a t i o n s of a simpler scheme, • that' the?pitch" ! informa-t i o n should a l s o be ex t r a c t e d by the p r e d i c t o r . Consequently, the coder complexity increased c o n s i d e r a b l y . Some m o d i f i c a t i o n s were made to reduce the number of computa-t i o n s r e q u i r e d . The second loop was s i m p l i f i e d by using only the pre-vious two d i f f e r e n c e s i g n a l s D , and D _„ i n s t e a d of the previous e i g h t as i n [6] (see a l s o F i g . 2.3). Of course t h i s m o d i f i c a t i o n a f f e c t s the amount of formant i n f o r m a t i o n removed by the p r e d i c t o r . However, other s t u d i e s r e p o r t only minor improvements when the number of p r e d i c t o r taps i s increased from two to e i g h t ( [ 1 2 ] , p. 130, [ 3 ] ) . Furthermore, to avoid the computations r e q u i r e d to optimize the q u a n t i z e r l e v e l f o r each new i n t e r v a l f o r which a l l parameters are r e c a l c u l a t e d , Jayant's exponential self-adjustment of the l e v e l was adopted. This m o d i f i c a t i o n a l s o has the advantage that i t i s unnecessary to transmit parameters to r e a d j u s t the q u a n t i z e r l e v e l i n the r e c e i v e r . Another m o d i f i c a t i o n con-cerns the f i r s t loop; i t was found that by i n s e r t i n g a very simple f i l t e r the speech q u a l i t y could be improved n o t i c e a b l y . F i g u r e 3.1 shows a d e t a i l e d b l o c k diagram of the coder simulated. Note that i n order to maintain synchronous t r a c k i n g , the r e c e i v e r has to be the exact i n v e r s e of the t r a n s m i t t e r . The s t a t e shown r e f l e c t s the contents of a l l r e g i s t e r s a f t e r the n ^ sample has entered the coder and the corresponding channel symbol C n has been t r a n s m i t t e d . The system i s ready to process the next sample S n +^. The time delays T equal to 1/f , where f i s the sampling frequency, are i n c l u d e d to c l a r i f y the s s sequence of c a l c u l a t i o n s . 3.1 The F i r s t Loop: Reduction of the P i t c h Redundancy The gain c o e f f i c i e n t $ and the delay parameter M were c a l c u l a -ted according to (2.5) and (2.6). In order to f i t the p i t c h of d i f f e r e n t speakers, M was given a range of 20 to 146. This choice was a l s o i n f l u -enced by the f a c t that M has to be t r a n s m i t t e d to the r e c e i v e r over a d i g i t a l channel. I t i s t h e r e f o r e convenient to l e t the range of M be a power of 2, minus one. Figure 3.1 Detailed block diagram of the Adaptive Predictive Delta Modulator Since the purpose of the f i r s t p r e d i c t o r loop i s to reduce the p i t c h redundancy, t h i s p o r t i o n should be o p e r a t i v e only during v o i c e d s t r e t c h e s of the i n p u t . A switch that s e t s the new p r e d i c t i o n i n F i g . 3.1 to zero during unvoiced or q u i e t periods was added to the loop. The d e c i s i o n voiced/unvoiced i s based on the z e r o - c r o s s i n g r a t e of the input s i g n a l f o r the previous l e a r n i n g p e r i o d . The n o i s e - l i k e , unvoiced f r i c a t i v e s have a much higher z e r o - c r o s s i n g r a t e than the voiced sounds. During pauses i n the input speech 3 should be set to zero to minimize i d l e n o i s e generation by the coder i t s e l f . This i s done by observing the input energy l e v e l during the previous l e a r n i n g p e r i o d . I f s a i d energy drops below a c e r t a i n l e v e l , g i s set to zero. Even i f the input s i g n a l energy i s s m a l l but f i n i t e , f o r example i f background noise i s present during speech pauses, s e t t i n g 3 to zero i s acceptable s i n c e such s i g n a l s are not p e r i o d i c . However, i f the r e c e i v e r were switched o f f completely, background noise would be absent during short speech pauses, g i v i n g the l i s t e n e r the impression of having been cut o f f from the v o i c e c i r c u i t . The s i m u l a t i o n s showed that both the z e r o - c r o s s i n g r a t e and the energy l e v e l were f a i r l y n o n c r i t i c a l and could be v a r i e d over a wide range without a f f e c t i n g the performance (see s e c t i o n 5.3.5). The coder complexity i s not s e v e r e l y increased by the above m o d i f i c a t i o n s . A hardware implementation of the switch i s q u i t e simple. The zero-c r o s s i n g s can be detected and counted using a l i m i t e r f o l l o w e d by a r e s e t t a b l e counter. A r e c t i f i e r and a r e s e t t a b l e i n t e g r a t o r are a l l that are r e q u i r e d to determine the energy l e v e l . 15 The second a d d i t i o n to the f i r s t p r e d i c t o r loop i s the d i g i t a l f i l t e r G preceding the r e g i s t e r H. The f i l t e r smooths the estimate of the input b e f o r e i t i s st o r e d i n the p i t c h r e g i s t e r H. S i g n a l S^ i s i d e n t i c a l w i t h the r e c e i v e r output and t h e r e f o r e contains the t r a n s m i t t e r input s i g n a l d i s t o r t e d by some q u a n t i z a t i o n noise added by the coder. Figure 3.2 shows the r e l a t i v e s p e c t r a l l e v e l s of the input s i g n a l ("Joe took f a t h e r ' s shoe bench out") and the q u a n t i z a t i o n n o i s e a t the r e c e i v e r output. The spectrum of the noise i s , as expected, f l a t t e r than that of the input v o i c e s i g n a l . Because only a sm a l l p o r t i o n of the e n t i r e noise energy l i e s o u t s i de the input s i g n a l bandwidth ( 0 - 3 7 4 0 Hz), f i l t e r i n g does not improve the s i g n a l - t o - q u a n t i z a t i o n noise r a t i o c o n s i d e r a b l y . Nevertheless, other c o n s i d e r a t i o n s that w i l l be explained i n the next paragraph made i t a d v i s a b l e to i n s e r t a f i l t e r i n t o the f i r s t loop. I • ! . . . . , , / 2 3 4 FREQUENCY IN KHZ Figure 3.2 R e l a t i v e s p e c t r a l l e v e l s of a t e s t sentence and the ass o c i a t e d q u a n t i z a t i o n noise The coder approximates the input s i g n a l by a s t a i r c a s e w i t h a v a r i a b l e s t e p s i z e . I f the input contains some f a i r l y r a p i d changes the approximation i s u s u a l l y much more coarse than f o r a s l o w l y v a r y i n g i n p u t . The output D n of the f i r s t loop then o s c i l l a t e s even more r a p i d l y , and compared to more q u i e t p e r i o d s , at a r e l a t i v e l y h i g h amplitude. Since the s t e p s i z e i s readjusted only f o r the next p r e d i c t i o n , the feed-back of the second loop contains a delay, and the p r e d i c t o r cannot keep up w i t h the f a s t input v a r i a t i o n s . In these cases i t was observed that the s t e p s i z e was decreased i n s t e a d of in c r e a s e d , making the t r a c k i n g even worse. This s i t u a t i o n can be improved only by reducing the e r r o r amplitude superimposed on the input s i g n a l of the p i t c h r e g i s t e r H; t h i s i s the purpose of the f i l t e r G. In r e a l i z i n g f i l t e r G, a r e c u r s i v e Chebyshev lowpass f i l t e r of f o u r t h order and a c u t o f f frequency of 3700 Hz was f i r s t t r i e d , f o l l owed by Butterworth f i l t e r of the same type. The r e s u l t s were not very encouraging, i n that the S/Q r a t i o at the r e c e i v e r output improved by much l e s s than one db. The reason f o r t h i s i s that these f i l t e r s have an unfavourable impulse response. Their t r a n s f e r f u n c t i o n s are of the form G(§)'s>== 11 (3.1) ' f f [ ( s + a . ) 2 + b2. ] w i t h impulse response g(t) = Z c . • e ~ a i t s i n ( b . t ) . (3.2) i The response i s e x p o n e n t i a l l y decaying, but has some overshoot due to the terms s i n ( b ^ t ) . Although the f i l t e r e d s i g n a l w i l l c o n t a i n l e s s n o i s e , the nature of t h i s n o i s e w i l l be a l t e r e d . The negative overshoot of the response to a p o s i t i v e input pulse might, f o r example, add to the negative response to a negative input p u l s e . Therefore a f i l t e r of t h i s type does not r e a l l y improve the output of the f i r s t loop i n the sense that i t becomes e a s i e r to t r a c k by the second loop. A f i l t e r w i t h an e x p o n e n t i a l l y decaying impulse response and no overshoot seems to be the answer. Such a f i l t e r i s shown i n Fi g u r e 3.3. The m u l t i p l i c a t i v e f a c t o r a i s a r b i t r a r y so long as i t i s l e s s than u n i t y . Experimentally i t was found that best r e s u l t s were obtained f o r a= 0.5. X(z) Yfz) yft) .5-.25-2T 3T 4T TIME 5T 6T F i g u r e 3.3 C o n f i g u r a t i o n and impulse response to a pulse of height 1 at t=0 of a f i l t e r w i t h an e x p o n e n t i a l l y decreasing impulse response and no overshoot The t r a n s f e r f u n c t i o n of the f i l t e r i n F i g . 3.3 i s r e a d i l y found using z-transforms. Y(z) = 1 [Y(z)-z~1 + X ( z ) ] . (3.3) Hence, Y ( z ) = , x(z) = H(z)« X(z) rt X 2 - z To f i n d the frequency response, l e t z=e J ( i ) T; i n which case 1 (3.4) H(w) = 2 - e -joiT (3.5) The response f o r frequencies between 0 and 4 kHz i s shown i n F i g . 3.4. The a t t e n u a t i o n curve i s almost s i n u s o i d a l and has a maximum of 9.5 db. 18 kHz FREQUENCY F i g u r e 3.4 Frequency response of the f i l t e r i n F i g u r e 3.3 The f i l t e r i s explained best as a device that averages the input over a short p e r i o d of time. R e w r i t i n g (3.3) i n the time domain, we f i n d y(nT) = i [y(nT-T) + x ( n T ) ] . (3.6) The new f i l t e r output i s the mean of the o l d output and the new i n p u t . Therefore, no overshoot i s p o s s i b l e . Figure 3.5 i l l u s t r a t e s t h i s pro-perty. Lu § I I CO CO nT nT+T -i 1 i i i mT TIME Figure 3.5 Input x(nT) and output y(nT) of the f i l t e r i n F i g . 3.3 Because of the time l a g introduced by any p h y s i c a l l y r e a l i z a b l e lowpass f i l t e r , the f i l t e r e d output i s delayed even more w i t h respect to the true s i g n a l than i s the i n p u t . A time s h i f t by T to the l e f t would r e -s u l t i n a much b e t t e r f i t . This was a l s o experimentally v e r i f i e d . I n s e r t i n g t h i s f i l t e r i n t o the f i r s t loop and r e p l a c i n g the delay parameter M by M-l to take i n t o account the above mentioned forward time s h i f t , r e s u l t e d i n a S/Q r a t i o improvement of 1 to 2 db. The s u b j e c t i v e q u a l i t y a l s o improved n o t i c e a b l y (see s e c t i o n 5.3.1). Another f i l t e r u s i n g one previous sample o n l y i s given by y(nT) = | [x(nT-T) + x(nT)] (3.7) This f i l t e r and i t s impulse response are shown i n Figure 3.6. X(z) 1 71 >-0 yft) Yfz) -5 - I — — — I -0 1 21 31 41 51 TIME Figure 3.6 F i l t e r w i t h a r e c t a n g u l a r impulse response This f i l t e r and s e v e r a l others whose outputs depended on more than j u s t one previous sample were t r i e d , but none of them gave b e t t e r r e s u l t s than the one described by (3.6). G e n e r a l l y , f i l t e r s u s i n g more than one previous sample have a s m a l l e r but more slo w l y decaying impulse response, i . e . they average the input s i g n a l over a longer period of time. Such heavy lowpass f i l t e r i n g reduces the p i t c h redundancy r e d u c t i o n achieved by the f i r s t p r e d i c t o r loop, and i s t h e r e f o r e not d e s i r a b l e . 3.2 The Second Loop: Reduction of the Formant Redundancy The second loop produces a p r e d i c t i o n n °f t n e output of the f i r s t l oop, D^. The e r r o r e^ i s then quantized to two l e v e l s , encoded, and t r a n s m i t t e d as channel symbol The quantizer l e v e l l o g i c i s the same as described i n Figure 2.1. The two c o e f f i c i e n t s and are c a l c u l a t e d using (2.9) w i t h K equal to two. Fromv(2T9) we f i n d 6 [ e 2 ] n av i i a v — . ? [D n - ( o - D ^ + a 2 - D n _ 2 ) . D n _ l ] a v = 0, (3.8a) and 6 [ e 2 ] n av / " a V = [D - (a 0-D , +aa -D „)-D _] = 0 , (3.8b) So^ n 2 n-1 2 n-2 n-2 av where t x ] a v denotes the sample mean of x over the l e a r n i n g period N*T. Equations (3.8a) and (3.8b) can now be solved f o r and a^. I f K previous samples are used, K l i n e a r equations f o r a- to a 1 K are obtained from (2.9). They can be r e w r i t t e n i n m a t r i x n o t a t i o n as A°A • 0 A«A = 0 (3.9) where A i s a K by K covariance m a t r i x w i t h i t s ( i j ) t h element given by 6. . = [D . *D . ] = ^ h /D ., (3.10) i j n-x» n-j av N n - i n - j ' A i s a column vec t o r of dimension K, c o n t a i n i n g the c o e f f i c i e n t s , t h and 0 i s a c o r r e l a t i o n v e c t o r of dimension K whose j component 0 i s given by | N 0. = [D D .] = ^ E D -D •• (3.11) j n n-3 av N n = 1 n n-j I f (3.9) i s solved f o r A using m a t r i x i n v e r s i o n , a s o l u t i o n i s im-p o s s i b l e i f the matrix A i s s i n g u l a r . In t h i s case, a unique s o l u t i o n can always be forced by adding a s m a l l q u a n t i t y to each diagonal e l e -ment of A. For l a r g e r values of K, one should take advantage of the symmetry of A and apply methods that r e q u i r e l e s s computation and that do not i n v o l v e m a t r i x i n v e r s i o n [ 8 ] . A l s o i t should be noted that u l t i -mately the c o e f f i c i e n t s w i l l have to be quantized, t h e r e f o r e stimu-l a t i n g the use of i t e r a t i v e techniques to s o l v e (3.9). 22 The main problem of the second loop i s i t s s t a b i l i t y . I t i s d e s i r a b l e that a feedback system be f u l l y c o n t r o l l a b l e , t h a t i s , t h a t each subsystem be s t a b l e . In p a r t i c u l a r , s t a b i l i t y i s a n e c e s s i t y f o r the r e c u r s i v e f i l t e r i n the second loop. F i g u r e 3.7 shows t h i s p a r t of the coder again, where X(z) and Y(z) correspond to L - and P„ .., n+I 2,n+1 r e s p e c t i v e l y (see F i g u r e 3.1). The dashed l i n e i n F i g u r e 3.1 i s omitted and w i l l be discussed l a t e r . X(z) 7 - / - 0 — Y(z) Figute 3.7 The r e c u r s i v e f i l t e r i n the second loop The requirements f o r s t a b i l i t y can be derived e a s i l y u s i n g z-transforms. The system equation i s Hence, Y(z) = a±[X(z) + Y ( z ) z _ 1 ] + c^z 1 [ X ( z ) + Y ( z ) z 1 ] Y(z) = a l + a 2 Z -1 1 - a^z - a^z X ( z ) , or a z + a_z Y(z) = • X(z) = H(z)-X(z) z - a^z -(3.12) In order f o r the f i l t e r to be s t a b l e , i t s t r a n s f e r f u n c t i o n H(z) has to be f i n i t e f o r a l l z = e J U ) T , that i s , the poles z^ of H(z) have to l i e w i t h i n the u n i t c i r c l e . Thus, we r e q u i r e | z ± | $ 1 - e, e > 0. (3.13) The poles z^ are z± 2 = a±/2 ±[[aJ/4 + a ^ 2 = r - e ± j a , (3.14) where and a = a r g ( z x ) r = \z± 2| = [a|/4 + |a2/4 + a 2 | ] 1 / 2 . 3.15) I f the poles z^ are complex 2 a1/4 + a 2 < 0, and |a2/4 + a 2]a|/4(* 2^|+ = 2)^( a 2/4 + « 2) . (3.16) Using (3.16) i n (3.15) and observing (3.12), we o b t a i n the f o l l o w i n g bound f o r c. - - T s . a 0 >y -(1 - e). (3.17) Another, r e s t r i c t i o n i s fpund By s e l e c t i n g r e a l poles z^ 2 such that 0 $ z.. 9 ^ 1 -e. From (3.14) and (3.13), we o b t a i n 1 > ^  c^/2 ± [aJ/4 + a 2 ] 1 / 2 $ 1 -z~. '(3. 18) Re-arranging (3.18), we get two equations and [a 2/4 + a 2 ] 1 / 2 < 1 - e - a±/2, -[a 2/4 + a 2 ] 1 / 2 $ 1 - e - a±/2. (3.19a) (3.19b) Since both sides i n (3.19a) are pos i t i v e , (3.19a) may be squared. A f t e r r e o r d e r i n g , the f o l l o w i n g bound on and i s found (1 - c)'a± + a2 $ (1 - £)' 1 - 2e. (3.20a) The t h i r d c o n s t r a i n t i s obtained by assuming r e a l poles 2 such that -(1 - e) < z^ 2 < 0 and f o l l o w i n g the same steps as above. The r e s u l t i s - (1 - e) 'Oi± + a2 $ (1 - e ) - 1 - 2 e. (3.20b) A geometrical i n t e r p r e t a t i o n of (3.17), (3.20a), and (3.20b) i s shown i n Figure 3.8a. The dashed l i n e s represent the boundaries of the s t a b i l i t y range f o r and as given by (3.17), (3.19a), and (3.19b). P o i n t s P i n i t i a l l y o u t s i de the t r i a n g l e are moved i n a s t r a i g h t l i n e towards the o r i g i n , t e r m inating on the s t a b i l i t y boundary. The s o l i d l i n e s of the outer t r i a n g l e are obtained f o r e = 0. // / / real -<// ; N X s-p poles yr/ complex poles ^ \ Figure 3.8 a) S t a b i l i t y range f o r the c o e f f i c i e n t s and b) Truncated s t a b i l i t y range 25 The value of e can be found as f o l l o w s . The magnitude of the f i l t e r t r a n s f e r f u n c t i o n H(z) i s H(z)| = 2 j_ O^Z + O^Z (z - z o ) | - | ( z - Z Q ) where Z q and Z q are complex conjugate poles as given by (3.14). The denominator can be i n t e r p r e t e d g e o m e t r i c a l l y (see F i g u r e 3.9) as the product of the lengths of the vectors (z - z ) and (z - z * ) . Figure 3.9 Geometrical i n t e r p r e t a t i o n of the denominator of (3.8) The minimum of t h i s product corresponds to the maximum of |H(z)| and occurs f o r z = e J r . For t h i s value of z, the l e n g t h of the v e c t o r (z ~ z ) i s the s h o r t e s t p o s s i b l e , and equal to e. 26 The c o e f f i c i e n t s ot^  and are c a l c u l a t e d such t h a t the f i l t e r t r a n s f e r f u n c t i o n H(z) approximates the s p e c t r a l envelope of the input s i g n a l . In order to avoid some a d d i t i o n a l degradation of the r e c o n s t r u c t e d speech, the minimum bandwidth of the f i l t e r t r a n s f e r f u n c t i o n should be r e s t r i c t e d to a value comparable to that of the bandwidth of s p e c t r a l peaks of the input s i g n a l . A n a l y s i s of the s p e c t r a l envelope of v o i c e s i g n a l s show that the bandwidth of the f i r s t formant u s u a l l y l i e s between approximately 40 and 70 Hz ( [ 1 ] , p. 182). In our s i m u l a t i o n e = 0.024 (3.21) was s e l e c t e d , corresponding to a minimum bandwidth of H(z) of 60 Hz. Other bandwidth r e s t r i c t i o n s of H(z) can be imposed e a s i l y by a s u i t a b l e choice of e. Since the above s t a b i l i t y a n a l y s i s n e g l e c t s the n o n l i n e a r i t y contained i n the second loop, (3.17), (3.20a,b), and (3.21) are not s u f f i c i e n t c o n d i t i o n s to guarantee s t a b i l i t y f o r t h i s loop. In f a c t , i t was observed that some i n s t a b i l i t i e s were s t i l l present. F i g u r e 3.10 shows t h i s loop again. The quantizer l e v e l c a l c u l a t o r i s represented by the n o n l i n e a r part NL. V/WI 1 D(nT) P/nT) ' I — —I-/ 1 jTe'fnT-T) L(nT) A: +P IF SGN[e'(nT)e'(nT-T)] * 7 L-Q IF SGN[e'(nT)-e'(nT-T)] =-1 [ UnT) = A-L(nT-T) NL Figure 3.10 The n o n l i n e a r i t y i n the second loop This k i n d of a n o n l i n e a r i t y w i t h memory makes an a n a l y t i c a l s o l u t i o n very d i f f i c u l t . E x p e r i m e n t a l l y , i t was found that the system was s t a b l e f o r a l l c o e f f i c i e n t s a r e s t r i c t e d to the truncated s t a b i l i t y r e g i o n shown i n Figure 3.8b by dashed l i n e s . Negative values of a were set to zero, w h i l e p o i n t s outside the area i n the f i r s t quadrant again moved on a s t r a i g h t l i n e through the o r i g i n as shown. The unfortunate r e s u l t was an almost complete l o s s of f r i c a t i v e s i n the reconstructed speech. Some negative c o e f f i c i e n t s , needed to reproduce these sounds, gave r i s e to i n s t a b i l i t i e s ; attempts were made to exclude such c o e f f i -c i e n t values but no completely s a t i s f a c t o r y s o l u t i o n could be found. By l i m i t i n g the quantizer l e v e l to a maximum of 1/8 of the peak s i g n a l amplitude the o v e r a l l e r r o r could be reduced considerably; but the l a r g e r e r r o r s during the i n s t a b i l i t i e s appear as q u i t e a u d i b l e c l i c k s . Common to a l l i n s t a b i l i t i e s was the phenomenon that the c o e f f i c i e n t s cu and a2 had values such that the p r e d i c t i o n P„ could 1 2 2 never reach the input s i g n a l D even though the quantizer l e v e l was s t e a d i l y i n c r e a s i n g i n magnitude. To avoid t h i s s i t u a t i o n , a more d i r e c t feedback of the quantizer l e v e l was introduced i n t o the second loop, as shown by the dashed l i n e i n Figure 3.1. Now, s i n c e the newly c a l c u l a t e d s t e p s i z e always has the s i g n of the d i f f e r e n c e between the input s i g n a l and i t s p r e d i c t i o n , and i s i n c r e a s i n g e x p o n e n t i a l l y , t h i s a d d i t i o n a l feedback must e v e n t u a l l y catch any run-away s i t u a t i o n . How long i t takes to reach that c o n d i t i o n depends upon the values and as w e l l as on the feedback c o e f f i c i e n t y . During t h i s time, the e r r o r might s t i l l b u i l d up c o n s i d e r a b l y , but at l e a s t i t cannot increase beyond l i m i t s . Although adding t h i s d i r e c t feedback to the s i m u l a t i o n program d i d not r e s u l t i n a lower S/Q r a t i o , the s u b j e c t i v e impression of the reproduced speech was con s i d e r a b l y b e t t e r (see s e c t i o n 5.3.2). I t was a l s o observed that the reproduction of f r i c a t i v e sounds became worse again w i t h i n c r e a s i n g values of the feedback c o e f f i c i e n t y. IV. ON MEASURING AND COMPARING SPEECH QUALITY In the i n t r o d u c t i o n i t was mentioned that the i n f o r m a t i o n con-tent of speech i s not known e x a c t l y , and that h i g h i n t e l l i g i b i l i t y of a v o i c e system does not n e c e s s a r i l y mean that the s u b j e c t i v e q u a l i t y r a t i n g w i l l be e q u a l l y high. Sound pe r c e p t i o n i s very complex and not yet f u l l y understood. I t i s not p o s s i b l e to e s t a b l i s h a mathematically d e f i n e d , absolute q u a l i t y measure that i s equivalent to s u b j e c t i v e q u a l i t y e v a l u -a t i o n . To evaluate the performance of a system under development using s u b j e c t i v e l i s t e n i n g t e s t s e x c l u s i v e l y i s too tedious. I t seems i n t u i t i -v e l y s a t i s f y i n g that the c l o s e r the waveforms of the generated r e p l i c a are to the t r u e s i g n a l , the b e t t e r the s u b j e c t i v e q u a l i t y w i l l be. A convenient measure f o r t h i s d i f f e r e n c e i s the mean square e r r o r (MSE). The r e l a t i o n between MSE and s u b j e c t i v e l y rated q u a l i t y has been s t u d i e d elsewhere [ 9 ] , [ 5 ] . These r e s u l t s show that the MSE i s a u s e f u l q u a l i t y measure, provided that the nature of the e r r o r between the o r i g i n a l and the regenerated s i g n a l i s the same f o r a l l the speech samples compared. For another type of e r r o r the s u b j e c t i v e degradation might be very d i f f e r e n t , even though the MSE i s the same. In any d e l t a modulation system, both q u a n t i z a t i o n and slope overload c o n t r i b u t e to the e r r o r . U s u a l l y these two degradations are d i f f i c u l t to analyze s e p a r a t e l y . Therefore, s i n c e a s u b j e c t i v e t e s t i s always the u l t i m a t e q u a l i t y measure i f q u a l i t y f a c t o r s besides i n t e l l i g i b i l i t y are a l s o t e s t e d , i t i s o f t e n convenient to compare the v o i c e system under t e s t w i t h some standard s i g n a l that i s easy to reproduce. N-bit l o g a r i t h m i c PCM as described by Smith [ 1 0 ] meets these requirements and was used i n the present study as a reference s i g n a l f o r p a i r e d preference t e s t s . Other standard q u a l i t y r a t i n g methods are discussed i n [11]. V. COMPUTER SIMULATION AND RESULTS 5.1 P r e p a r a t i o n of the Data Three se t s of data were prepared, one f o r each of the sampling frequencies 8, 12, and 16 kHz. A set c o n s i s t s of the f o l l o w i n g two groups of two sentences: "Joe took f a t h e r ' s shoe bench out. Should we chase those young outlaw cowboys?", and "We were away a year ago. May we a l l l e a r n a y e l l o w l i o n r o a r . " The four sentences have a t o t a l l e n g t h of 10.75 sec, and are b e l i e v e d to be reasonably r e p r e s e n t a t i v e of c o n v e r s a t i o n a l speech. The f i r s t group of two sentences contains most of the v o i c e l e s s phonemes of E n g l i s h speech; whereas the second p a i r of sentences c o n s i s t s of v o i c e d sounds only. A d e s c r i p t i o n of the sounds of E n g l i s h speech of General American D i a l e c t , can be found i n Flanagan [ 1 ] , chapter 2.2. Note that the theory on which the coder i s based n e g l e c t s v o i c e l e s s sounds. By choosing two groups of sentences, the expected d i f f e r e n c e s i n performance could be studied more conveniently. A l l the sentences were spoken by a 35-year o l d male u n i v e r s i t y p r o f e s s o r w i t h a western Canadian accent. The recordings were c a r r i e d out i n an anechoic chamber usi n g an AKG D-200E dynamic microphone and a f u l l - t r a c k S c u l l y 280 tape recorder at a r e c o r d i n g speed of 15 i n / s e c . Subsequently, the s i g n a l was lowpass f i l t e r e d , sampled and d i g i t i z e d w i t h an accuracy of 10 b i t s per sample and stored on d i g i t a l magnetic tape. The analog tapes r e q u i r e d f o r the l i s t e n i n g t e s t s were obtained by the reverse process. 5.2 S u b j e c t i v e Test Procedure The t e s t s were conducted i n a qui e t room. The same l i s t e n e r s , eleven male and one female u n i v e r s i t y students, took p a r t i n a l l sessions f i v e had previous experience w i t h l i s t e n i n g t e s t s . The t e s t tapes were played back using the S c u l l y 280 tape recorder and Sharpe HA-10-MK-II headphones, each equipped w i t h an independent e x t e r n a l volume c o n t r o l . P r i o r to a l i s t e n i n g s e s s i o n , a few samples were played i n order to permit the l i s t e n e r s to adjust t h e i r i n d i v i d u a l volume c o n t r o l s . Subse-quently, no change i n the loudness l e v e l was allowed. The t e s t s were based on preference o n l y , and c o n s i s t e d of a s e r i e s of p a i r s of ut t e r a n c e s . For each p a i r of utterances presented, the subjects were asked to s e l e c t the utterance they would p r e f e r to l i s t e n to i n a telephone conversation. The utterances of each p a i r were processed i n d i f f e r e n t ways, but c o n s i s t e d of the same group of sentences I t i s known that such t e s t s e x h i b i t a s l i g h t p s y c h o l o g i c a l b i a s f o r choosing the second utterance of a p a i r . For t h i s reason the p a i r s were a l s o played i n reverse order. Since each p a i r was played twice o n l y , the t o t a l number of comparisons was too small to a l l o w an a r b i t r a r y choice i n case a l i s t e n e r was undecided as to which utterance to p r e f e r . Therefore, s i x l i s t e n e r s were given the i n s t r u c t i o n to vote i n favour of the f i r s t utterance i n case they could not come to a d e c i s i o n . The other s i x were t o l d to give t h e i r votes to the second utterance. 5.3 Results arid D i s c u s s i o n In t h i s s e c t i o n the dependence of the S/Q r a t i o on the system parameters i s s t u d i e d . Only one parameter i s v a r i e d at a time, a l l others are h e l d at constant v a l u e s . The standard set of parameters was Step m u l t i p l i e r s P = 1.15 Q = 1/P D i r e c t step feedback c o e f f i c i e n t y = 0.2 Maximum s t e p s i z e (quantizer l e v e l ) 65 u n i t s Minimum s t e p s i z e (quantizer l e v e l ) 1 u n i t Learning p e r i o d NT 5 msec I n t e r v a l between r e - c a l c u l a t i o n of the p r e d i c t o r parameters (frame length) 5 msec Switch t h r e s h o l d ; (zero-crossings) 15 per 5 msec ( s i g n a l energy) 15,000 u n i t s per 5 msec Input s i g n a l amplitude range -511 ... +512 u n i t s RMS s i g n a l value f o r the f i r s t two sentences 66 u n i t s RMS s i g n a l value f o r the second two sentences 81 u n i t s 5.3.1 V a r i a t i o r i of the Step M u l t i p l i e r P Figures 5.1a, c, and d show the c a l c u l a t e d S/Q r a t i o at d i f f e r e n t sampling r a t e s . The values f o r P = 1.0 were obtained i n sim u l a t i o n s w i t h a f i x e d s t e p s i z e that was optimized i n terms of MSE. I t i s evident that the f i l t e r G added to the f i r s t p r e d i c t o r loop improved the performance. The gain i s about 1 db f o r the sentences c o n t a i n i n g unvoiced sounds, and approximately 1.5 db f o r those con-s i s t i n g of v o i c e d sounds only. In a d d i t i o n , a formal l i s t e n i n g t e s t 34 12^ 4 11 I I I I I I I I I I I I I I I I I I I I I I I I I I I M I I I I I I I I I | I I I I | I I I I | 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 STEPMULTIPUER P C) 4 |•••i|11•i|11i•|••11 |ii|i••i| i ' 111 •••• | 11 i • | 11 11 | 1.0 1.1 1.2 1.3 1.4 l.S 1.6 1.7 1.8 1.9 2.0 STEPMULTIPUER P d) FIGURE 5.1 D E P E N D E N C E O F T H E S/Q RATIO O N T H E . S T E P M U L T I P L I E R P "WE W E R E AWAY A Y E A R AGO. MAY WE A L L L E A R N A Y E L L O W LION ROAR. " " J O E TOOK F A T H E R ' S S H O E B E N C H OUT. S H O U L D W E C H A S E T H O S E Y O U N G O U T L A W C O W B O Y S . " . W I T H FILTER G W I T H O U T FI L T E R G W I T H F I L T E R G W I T H O U T F I L T E R G 35 confirmed a s u b j e c t i v e improvement of the q u a l i t y when the f i l t e r was used. The t e s t was conducted f o r P = 1.15 and P = 1.05. A l l other parameters were set to t h e i r standard v a l u e s . Between 69 and 71% of the l i s t e n e r s p r e f e r r e d the q u a l i t y of the coder c o n t a i n i n g the f i l t e r G. In another l i s t e n i n g t e s t the r e l a t i o n between the s u b j e c t i v e q u a l i t y and the step m u l t i p l i e r P was measured. The s u b j e c t i v e p r e f e r -ence curve, Figure 5.1b, was obtained by comparing sentences processed w i t h v a r i o u s c o e f f i c i e n t s P, ranging from 1.0 to 2.0, to a reference sentence generated w i t h P =-1.15. In a previous, i n f o r m a l l i s t e n i n g t e s t , the reference sentence wasesele^teednamorig allcproeessedtsentences as the one that y i e l d e d the best s u b j e c t i v e q u a l i t y . The or d i n a t e i n Figure 5.1b i n d i c a t e s the percentage of votes i n favour of the q u a l i t y of the speech encoder f o r a given value of the c o e f f i c i e n t P. When comparing two r e -ference sentences, each one should o b t a i n the same number of votes (see l a s t i p a r a g r a p h i n s e c t i o n 5.2), corresponding to a s u b j e c t i v e preference score of 100% i n Figure 5.1b. The c a l c u l a t e d curves, Figures 5.1a, c, and d, do not show a very pronounced maximum. In p a r t i c u l a r , the S/Q r a t i o f o r a f i x e d step s i z e i s , on the average, about the same as f o r the best performance w i t h a v a r i a b l e q u a n t i z e r l e v e l . However, the s u b j e c t i v e r a t i n g i n Figure 5.1b shows a very d i s t i n c t preference of step s i z e m u l t i p l i e r s P between 1.05 and 1.15 over a l l other values of P. In p a r t i c u l a r the q u a l i t y achieved w i t h a constant step s i z e i s judged very poor, which i s an expected r e s u l t . The q u a n t i z a t i o n e r r o r i s due to e i t h e r overload d i s -t o r t i o n or granular n o i s e , depending on whether the step s i z e i s too sma l l or too l a r g e r e l a t i v e to the s i g n a l being processed. The adaptation f o l l o w s a c e r t a i n t i m e - i n v a r i a n t r u l e and w i l l t h e r e f o r e y i e l d s m aller e r r o r s f o r s m a l l input s i g n a l s , and l a r g e r e r r o r s f o r l a r g e s i g n a l s . Thus, what can be gained i s an increase i n dynamic range r a t h e r than an inherent S/Q r a t i o advantage over a nonadaptive scheme. The adaptive system would be p r e f e r r e d by a l i s t e n e r because the e r r o r i s s i g n a l amplitude dependent. The preference curve Figure 5.1b was obtained i n a formal l i s t e n i n g t e s t f o r a sampling r a t e of 8 kHz. An i n f o r m a l l i s t e n i n g t e s t using a sampling frequency of 12 kHz revealed the same preference of step m u l t i p l i e r s P between 1.05 and 1.15. Therefore, Figure 5.1b i s be l i e v e d to be r e p r e s e n t a t i v e f o r s e v e r a l sampling r a t e s . The s u b j e c t i v e l y best performance of Jayant's system was obtained f o r P = 1.2 [ 5 ] , which value c o i n c i d e s approximately w i t h the values of P f o r best s u b j e c t i v e q u a l i t y of the present coder. 5.3.2 V a r i a t i o n of the D i r e c t Feedback C o e f f i c i e n t y Figure 5.2a shows that no in c r e a s e i n the S/Q r a t i o i s gained by adding the d i r e c t feedback of the step s i z e to the second p r e d i c t o r loop. However, the preference curve i n Fi g u r e 5.2b e x h i b i t s a d i s t i n c t s u b j e c t i v e preference f o r the q u a l i t y achieved by a l l processors con-t a i n i n g the d i r e c t feedback of the step s i z e . F i g ure 5.2b was obtained i n the same way as o u t l i n e d f o r Figure 5.1b i n s e c t i o n 5.3.1. With the exception of y , which was set to 0.4, the reference sentence was genera-ted u s i n g the standard set of parameters. The s u b j e c t i v e preference i s a t t r i b u t e d to a r e d u c t i o n of c r a c k l i n g n o i s e i n the re c o n s t r u c t e d speech. I t was a l s o observed that the high-frequency content of the reproduced speech decreased as the feedback c o e f f i c i e n t y approached u n i t y . I f y i s set to 0.2 t h i s lowpass e f f e c t i s h a r d l y n o t i c e a b l e , 37 n e v e r t h e l e s s , Figure 5.2b c l e a r l y shows a s u b j e c t i v e preference f o r t h i s value of y over y =0. For these two reasons, the standard set of para-meters contains the value 0.2 f o r the step-feedback c o e f f i c i e n t , even though the s u b j e c t i v e optimum occurs f o r y = 0.4. Figure 5.2 a) The. S/Q r a t i o as a b) S u b j e c t i v e preference as a f u n c t i o n of the c o e f f i c i e n t y f u n c t i o n of the c o e f f i c i e n t y 5.3.3 V a r i a t i o n of the Frame Length The i n t e r v a l between r e - c a l c u l a t i o n of the p r e d i c t o r para-meters i s c a l l e d the frame length of the coder. A f t e r each frame, the new p r e d i c t o r c o e f f i c i e n t s are transmitted to the r e c e i v e r . The depend-ence of the S/Q r a t i o on the frame length i s shown i n Figure 5.3 f o r d i f f e r e n t sampling r a t e s . As expected, the performance d e t e r i o r a t e s when the readjustments occur l e s s o f t e n . 38 10-n INTERVAL BETWEEN RECALCULATION OF PRRRHETERS IHSEC1 a) 12-1 u-3 1 ion 12 KHZ i i i i I i i i i i i i i i I 2.S 5.0 7. I i i i i i i i i i I i i i i I i i i i I i i i i | .5 10.0 12.5 15.0 ?7.5 20.0 INTERVRL BETWEEN RECALCULATION OF PRRBMETERS CH5ECI b) 12-' 10 a 9 S5 8-d 16 KHZ i i I i i 2.S - T J T - . 5.0 1111 111 I I ' 111 111' I ' •'' I' 1 ' 11 7.5 10.0 12.5 15.0 17.5 20.0 INTERVRL BETWEEN RECRLCULHTION OF PRRflMEiERS (MSEC) c) FIGURE 5.3 D E P E N D E N C E O F T H E S/Q R A T I O O N T H E F R A M E L E N G T H "WE W E R E AWAY A Y E A R A G O . MAY W E A L L L E A R N A Y E L L O W L I O N R O A R . " " J O E T O O K F A T H E R S S H O E B E N C H OUT. S H O U L D W E C H A S E T H O S E -YOUMG O U T L A W C O W B O Y S . " I n d i v i d u a l phonemes have durations of about f i f t y to s e v e r a l hundred m i l l i s e c o n d s . For most voic e d sounds, the spectrum i s n e a r l y constant during t h i s time; and the p r e d i c t o r c o e f f i c i e n t s need not be changed d r a s t i c a l l y . For other speech sounds, p a r t i c u l a r l y during t r a n s i t i o n s from one phoneme to another, the spectrum may vary compara-t i v e l y r a p i d l y . For best performance, the coder must be able to adapt to these changes as q u i c k l y as p o s s i b l e . However, s i n c e the t r a n s i t i o n s are r e l a t i v e l y short compared to the d u r a t i o n of v o i c e d sounds, there i s not much to be gained by r e a d j u s t i n g the p r e d i c t o r i n i n t e r v a l s c o n s i d e r -ably s h o r t e r than the average d u r a t i o n of a t r a n s i t i o n (10 - 20 msec). This e f f e c t i s r e f l e c t e d i n the curves f o r the sentences c o n s i s t i n g of voiced sounds o n l y , which show that there i s only l i t t l e or no improve-ment i f the p r e d i c t o r i s readjusted i n i n t e r v a l s s h o r t e r than 5 msec. In the sentences c o n t a i n i n g v o i c e l e s s sounds, a much l a r g e r number of periods w i t h changing s i g n a l s t a t i s t i c s occur. Therefore, these curves show a f u r t h e r improvement f o r frame lengths s h o r t e r than 5 msec. 5.3.4 V a r i a t i o n of the Learning P e r i o d Figure 5.4 shows the dependence of the S/Q r a t i o on the length of the l e a r n i n g p e r i o d f o r a sampling r a t e of 8 kHz. Curves f o r higher sampling frequencies were not measured, but are expected to look the same. A l e a r n i n g p e r i o d of 40 input samples (5 msec) seems to be s u f f i c i e n t to determine the short-term s i g n a l s t a t i s t i c s w i t h adequate accuracy. This corresponds to r e s u l t s p r e v i o u s l y obtained by Davisson [13]. 40 10-1 a 7 «5 6H S A M P L I N G R A T E 8 K H Z 5 I l i l l I i i i l I l l l r I l i l l I l l l l I l i l i I J/25 2.50 3.75 5.00 6.25 7.50 LEARNING PERIOD (MSEC) Figure 5.4 The S/Q r a t i o as a f u n c t i o n of the l e a r n i n g p e r i o d f o r a sampling frequency of 8 kHz 5.3.5 V a r i a t i o n of the Switch Threshold F i g u r e s 5.5a and 5.5b i n d i c a t e that the v a r i a t i o n of the threshold f o r the switch i n the f i r s t p r e d i c t o r loop (see F i g u r e 3.1) has very l i t t l e i n f l u e n c e on the o v e r a l l S/Q r a t i o . However, i t was found i n some i n f o r m a l l i s t e n i n g t e s t s , that i n t r o d u c i n g t h i s s w i tch caused disappearance of some of the noise during q u i e t input periods and some of the d i s t o r t i o n at the beginning of words, e s p e c i a l l y a f t e r such q u i e t p e r i o d s . F i g u r e 5.5a shows the S/Q r a t i o as a f u n c t i o n of the number of zero-crossings per 5 msec which causes the f i r s t loop to be d i s a b l e d by opening the s w i t c h . I f the threshold i s set to zero-, the p i t c h p r e d i c t i o n i s never i n operation. Thus, the 1.5 db in c r e a s e of the S/Q r a t i o between the switch thresholds of 0 and 10 ze r o - c r o s s i n g s per 5 msec r e f l e c t s the improvement achieved by the f i r s t p r e d i c t o r loop. 41 1 0 1 S A M P L I N G R A T E 8 K H Z l i l l I i i i r I i i l i I l i i i I l i i I J • I ' ' J 5 10 15 20 25 30 SKITCH T^ KESHOLO (ZERO-CROSSINGS! Figure 5.5 a) Dependence of the S/Q r a t i o on the switch t h r e s h o l d set by the number of zero-crossings 10-0D ° 8-3 a. c 7-1 Si S A M P L I N G R A T E 8 K H Z i i i i i i i i i I i i i i i i i i • I ' ' ' I 10000 20000 30000 SWITCH THRESHOLD (SIGNAL ENERGY! b) Dependence of the S/Q r a t i o on the switch t h r e s h o l d set by the s i g n a l energy Figure 5.5b shows the S/Q r a t i o versus the sw i t c h t h r e s h o l d which i s dependent on the s i g n a l energy. A number, corresponding to the s i g n a l energy, i s c a l c u l a t e d by summation of the squares of the s i g n a l samples during the l a s t l e a r n i n g p e r i o d . I f t h i s sum i s l e s s than the th r e s h o l d , the switch i n the p i t c h p r e d i c t o r i s opened. Both of the above curves were a l s o c a l c u l a t e d f o r sampling frequencies of 12 and 16 kHz, but are not shown because they were very s i m i l a r to the above. 42 5.3.6 V a r i a t i o n of the Product of the Step M u l t i p l i e r s P and Q In Figure 5.6 the dependence of the S/Q r a t i o versus the product of the qu a n t i z e r l e v e l m u l t i p l i e r s P and Q i s shown. The maximum occurs f o r values of P-Q between 1.0 and 1.05. PRODUCT P-Q Figure 5.6 S/Q r a t i o as a f u n c t i o n of the product P'Q Jayant's bound, which i s ( P . Q ) o p t = 1, (2.1) (see s e c t i o n 2.1) cannot be a p p l i e d e x a c t l y , because the assumption that step s i z e adaptations using the m u l t i p l i e r s P and Q are e q u a l l y probable, i s not s a t i s f i e d f o r the present coder. The upper and lower l i m i t s imposed on the qu a n t i z e r l e v e l a f f e c t the p r o b a b i l i t i e s f o r P-type and Q-type adaptations. A l s o , the d i r e c t step-feedback i n the second loop tends to make the p r e d i c t i o n of D n exceed the tru e v a l u e , and thus favours the use of the m u l t i p l i e r Q. Nevertheless, the requirement that the step s i z e should not tend to increase beyond l i m i t s or to decay to z e r o , s t i l l a p p l i e s . Jayant considers the r a t i o R(N) of the magnitudes of L ^ , ^ , the step s i z e at the sampling i n s t a n t T'+N, and L^, , the step s i z e at the sampling i n s t a n t T'. Denoting the number of P-type and Q-type adaptations i n the i n t e r v a l N by Np Q and Nq Q r e s p e c t i v e l y , Jayant w r i t e s [4] R ( N ) = l ^ ' + j ' = P NPo. QNqo m ( p P o . Q q 0 ) N I V I " For N-H» p Q and q Q tend to the p r o b a b i l i t i e s P Q p t and q o p t r e s p e c t i v e l y , f o r an optimum adaptation. Thus, l i m R q (N) = l i m (P P°pt.Q qopt) N . I t was mentioned above th a t the step s i z e must not show a tendency to increase beyond l i m i t s or decay to zero. Therefore, the asymptotic r a t i o R (oo ) must be f i n i t e and non-zero. A necessary and s u f f i c i e n t opt c o n d i t i o n i s P P o p t . Q q o p t = ± m ( 5 > 1 ) Experimental r e s u l t s confirmed t h i s requirement. The values p Q and q Q f o r the two sentences c o n t a i n i n g only v o i c e d sounds were 0.400 and 0.600 r e s p e c t i v e l y . The optimum p r e d i c t o r performance, i n terms of S/Q r a t i o , was achieved f o r P-Q = 1.05, where P =1.15 and Q = 0.913043 (see Figure 5.6). A p p l i c a t i o n of the above c o n d i t i o n y i e l d s 1.15°' 4-0.913043 0 , 6 = 1.00129, which s a t i s f i e s (5.1). 5.3.7 Comparison w i t h Logarithmic PCM The same four t e s t sentences were encoded usi n g 3- , 4-, and 5-bit l o g PCM w i t h a sampling r a t e of 8 kHz. They were then compared i n formal l i s t e n i n g t e s t s to the q u a l i t y achieved by the adaptive p r e d i c t i v e d e l t a modulator (APDM). The compression c h a r a c t e r i s t i c f o r a l o g PCM q u a n t i z e r i s defined by _ V ' l o g ( l + u x /V) , . y = ^ u 1 — -sgn(x), l o g ( l + y) where y represents the output voltage corresponding to an inp u t s i g n a l v o l t a g e x, ii i s a dimensionless parameter which determines the degree of compression, and V i s the compressor overload v o l t a g e [10]. Since the four t e s t sentences were d i g i t i z e d w i t h an accuracy of 10 b i t s per sample, V was set to 512. The parameter u was chosen equal to 100 to make the l i s t e n i n g t e s t s compatible w i t h most other such t e s t s i n the l i t e r a t u r e . \ o u £ 3-bit 4-bit 5-bit log PCM Figure 5.7 Comparison of the Adaptive P r e d i c t i v e D e l t a Modulator to l o g a r i t h m i c PCM The r e s u l t s of the s u b j e c t i v e t e s t s (see Figure 5.7) show that the q u a l i t y of the speech processed by the APDM was s l i g h t l y b e t t e r than th a t of 3 - b i t l o g PCM. The crossover p o i n t occurs at 3.1 b i t s per sample. Even though the granular noise of the APDM system compares favourably to the granular noise of 4 - b i t l o g PCM,tmanyelisteriers s t i l l p r e f e r r e d the q u a l i t y of 3 - b i t l o g PCM to the q u a l i t y of the APDM coder because of the greater high-frequency content of 3 - b i t l o g PCM. A l s o , 3 - b i t l o g PCM reproduces v o i c e l e s s sounds b e t t e r d e s p i t e i t s very coarse q u a n t i z a t i o n . V I . ADDITIONAL RESULTS AND RECOMMENDATIONS This chapter contains i n f o r m a t i o n concerning a d d i t i o n a l experiments and observations, as w e l l as some recommendations f o r f u r t h e r research. 6.1 Improvement of the Formant Redundancy Reduction I n c r e a s i n g the number of previous samples used i n the formant p r e d i c t o r should r e s u l t i n an improvement of the coder performance. A formant p r e d i c t o r based on e i g h t previous samples was implemented on the computer, but i t turned out to be unstable. The poles of the f i l t e r t r a n s f e r f u n c t i o n were r e s t r i c t e d i n the same way as o u t l i n e d i n s e c t i o n 3.2, but, t h i s d i d not s o l v e the problem completely. In a d d i t i o n , t h i s o p e r a t i o n r e q u i r e s a f a i r amount of computation t o f i n d the poles of the f i l t e r t r a n s f e r f u n c t i o n . A b e t t e r and simpler s o l u t i o n i n terms of computation i s i n d i c a t e d i n Haskew et.a-11 [14]. Instead of f i n d i n g the poles of the f i l t e r t r a n s f e r f u n c t i o n , Haskew e t . a l . l [14] c a l c u l a t e d the area f u n c t i o n of an a c o u s t i c tube which corresponds to the f i l t e r defined by the p r e d i c t o r c o e f f i c i e n t s . Unstable p r e d i c t o r c o e f f i c i e n t s r e s u l t i n an area f u n c t i o n w i t h some negative v a l u e s . Thus, such i n s t a b i l i t i e s can be detected e a s i l y . 6.2 Improvement of the Reproduction Of V o i c e l e s s Sounds The adaptive p r e d i c t i v e d e l t a modulator i s based on t h e o r i e s that consider only voiced p a r t s of speech. Furthermore, the spectrum of a v o i c e s i g n a l f a l l s o f f at 6 to 12 db per octave at frequencies above 500 Hz. These are the two main f a c t o r s c o n t r i b u t i n g to the poor reproduction of v o i c e l e s s sounds. In order to enhance the high-frequency content of the repro-duced s i g n a l , the input s i g n a l spectrum was shaped using a d i g i t a l f i l t e r . The pre-emphasis implemented began at 600 Hz and increased l i n e a r l y w i t h frequency to +12 db at 3000 Hz. I t was found that f o r the reasons explained i n s e c t i o n 3.1, the second p r e d i c t o r loop of the encoder was not able to handle the a d d i t i o n a l high-frequency content of the input s i g n a l . The delay contained i n t h i s p a r t of the coder represents a severe problem when the input s i g n a l amplitude i s f l u c t u a t i n g r a p i d l y . These d i f f i c u l t i e s could probably be circumvented by i n c r e a s i n g the number of q u a n t i z e r l e v e l s . This would a l l o w the instantaneous s e l e c t i o n of s e v e r a l step s i z e s , thus p a r t i a l l y bypassing the delay problem. Extensive research concerning coders using adaptive m u l t i - l e v e l q u a n t i z e r s has been c a r r i e d out by Cummiskey [12]. Another approach i s described i n [14], where vocoder techniques are used to transmit v o i c e l e s s sounds. U n f o r t u n a t e l y , improvements mentioned above may r e q u i r e that the b i t - r a t e be increased. 6.3 E f f e c t of Channel Noise The e f f e c t of t r a n s m i s s i o n e r r o r s due to channel noise was not s t u d i e d because the d i g i t a l channels p r e s e n t l y used commercially are v i r t u a l l y e r r o r - f r e e . The e r r o r r a t e s are of the order of one b i t per 10 b i t s . I t i s known that the performance of Jayant's scheme degrades r a p i d l y when the e r r o r p r o b a b i l i t y becomes con s i d e r a b l y l a r g e r than 10 ^ [15]. I t i s expected that the APDM coder w i l l show a s i m i l a r be-haviour . 48 6.4 Quan t i z a t i o n of the P r e d i c t o r Parameters Figures 6.1a, b, and c show histograms of the gain parameter 6, the delay c o e f f i c i e n t M, and the two p r e d i c t o r c o e f f i c i e n t s and A l l three graphs e x h i b i t a c l u s t e r i n g of the parameter v a l u e s . Therefore, to minimize the number of b i t s r e q u i r e d to transmit the coder parameters, n o n l i n e a r q u a n t i z a t i o n or v a r i a b l e length coding should be used. Note, however, that the delay c o e f f i c i e n t M takes on i n t e g e r values only and may not be quantized any f u r t h e r . V a r i a b l e l e n g t h encoding i s p o s s i b l e but r e q u i r e s a d d i t i o n a l b u f f e r storage, thus i n c r e a s i n g the coder complexity. Table 6.1 gives an estimate of the number of b i t s r e q u i r e d to transmit the p r e d i c t o r parameters. The delay parameter i s given a range of 20 to 146, thus r e q u i r i n g 7 b i t s f o r i t s encoding. The l a r g e s t number, 146, i s reserved f o r t r a n s m i t t i n g the d e c i s i o n voiced/unvoiced. Para-meter M becomes i r r e l e v a n t when the d e c i s i o n r e q u i r e s d i s a b l i n g of the f i r s t p r e d i c t o r loop by opening the switch. In t h i s case, the t r a n s m i t t e r sets M equal to 146, which t e l l s the r e c e i v e r to switch the p i t c h pre-d i c t i o n o f f . This procedure saves one b i t that would otherwise be necessary to transmit the switch o p e r a t i o n s e p a r a t e l y . The estimate of the b i t requirement f o r the gain c o e f f i c i e n t 3 i s based on r e s u l t s of a non l i n e a r q u a n t i z a t i o n of t h i s parameter i n K e l l y et a l . [16]. Parameter B i t Requirement per frame M 7 3 . 3 a. 5 1 0U 4 2 t o t a l 19 bi t s / f r a m e 49 -o.o i * '* i i • pr^mTi 0.2 0.4 0.6 0.6 1.0 1.2 1.4 1.6 1.6 2.0 GRIN COEFFICIENT BETR 0.16-1 0.34-0.12-0.10->— g o . o a -o "0.06 0.04-0.02->147 -0.00 i i i i i i I i i i I i i i I i i i I i i i I i T i | i i i I 0 20 40 60 60 100 120 140 160 DELAY COEFFICIENT M a) HISTOGRAM OF /3 b) HISTOGRAM OF M FIGURE 6.1 HISTOGRAMS OF THE CODER PARAMETERS, CALCULATED FOR ALL FOUR SENTENCES An experimental q u a n t i z a t i o n of the c o e f f i c i e n t s and to 5 and 4 b i t s r e s p e c t i v e l y , reduced the S/Q r a t i o of the two sentences c o n t a i n i n g no v o i c e l e s s sounds by 0.65 db. The corresponding degradation f o r the other two sentences was 0.8 db. I t i s expected that a proper n o n l i n e a r q u a n t i z a t i o n , using the same number of b i t s , would y i e l d con-s i d e r a b l y b e t t e r r e s u l t s . Table 6.1 i n d i c a t e s that at 200 frames per second 3800 b i t s are required to transmit the p r e d i c t o r c o e f f i c i e n t s . The o v e r a l l b i t -r a t e w i l l t h e r e f o r e reach 11.8 k b i t s / s e c when the input s i g n a l i s sampled at 8 kHz. V I I . SUMMARY AND CONCLUSIONS A new adaptive p r e d i c t i v e d e l t a modulator was simulated on a d i g i t a l computer and optimized f o r speech s i g n a l s . The main o b j e c t i v e was to keep the tra n s m i s s i o n b i t - r a t e as low as p o s s i b l e . Therefore, an e f f i c i e n t redundancy r e d u c t i o n technique had to be a p p l i e d . Speech s i g n a l s c o n t a i n segments w i t h low c o r r e l a t i o n between the samples, such as v o i c e l e s s sounds, t r a n s i t i o n periods between two sounds, and pauses; but the s i g n a l energy i s mainly concentrated i n voiced segments w i t h high c o r r e l a t i o n between the samples. Redundancy r e d u c t i o n methods the r e f o r e concentrate on these segments. The main sources of redundancy i n such p a r t s of the speech s i g n a l are peaks i n the short-term s p e c t r a l envelope, a l s o c a l l e d formants, and the p i t c h , that i s , a q u a s i - p e r i o d i -c i t y of the s i g n a l . The f i r s t stage of the coder c a l c u l a t e s the d i f f e r e n c e between the true s i g n a l value and a p r e d i c t i o n , derived from the value one p i t c h p e r i o d b e f o r e , thus reducing the p i t c h redundancy. Subsequently, the d i f f e r e n c e s i g n a l i s f i l t e r e d i n the second stage of the coder, where a r e c u r s i v e f i l t e r removes some formant redundancy. The remaining, now l e s s c o r r e l a t e d e r r o r s i g n a l i s quantized to two l e v e l s and t r a n s m i t t e d . Since speech s i g n a l s l a c k p e r i o d i c i t y i n v o i c e l e s s sounds and during q u i e t p e r i o d s , the p i t c h p r e d i c t o r contains a s w i t c h , a c t i v a t e d by a t h r e s h o l d dependent on the z e r o - c r o s s i n g r a t e and the s i g n a l energy. This a d d i t i o n r e s u l t e d i n a s u b j e c t i v e l y b e t t e r performance during q u i e t s i g n a l periods and at the beginning of words. An improvement of 1 to 2 db of the S/Q r a t i o was achieved by a f i l t e r i n s e r t e d i n t o the p i t c h p r e d i c t o r to e l i m i n a t e some q u a n t i z a t i o n noise fed back to t h i s part of the coder. For the p i t c h p r e d i c t o r a delay and a gain c o e f f i c i e n t are a l l that are r e q u i r e d . The number of parameters used i n the second stage depends on the number of formants to be removed. The coder described here reduces redundancy due to the f i r s t formant only. Two parameters are r e q u i r e d f o r t h i s purpose. Assuming that the s i g n a l s t a t i s t i c s are s t a t i o n a r y over short periods of time, a l l four parameters are r e -c a l c u l a t e d i n i n t e r v a l s of 5 msec (frame l e n g t h of the coder) and optimized i n terms of mean square e r r o r over the l a s t 5 msec ( l e a r n i n g p e r i o d of the coder). Measurements of the S/Q r a t i o f o r d i f f e r e n t lengths of the l e a r n i n g p e r i o d showed that 5 msec are s u f f i c i e n t to determine the s i g n a l s t a t i s t i c s w i t h adequate accuracy. S i m i l a r c a l c u l a t i o n s f o r the frame length i n d i c a t e d that no s u b s t a n t i a l improvement of the S/Q r a t i o was achieved w i t h frame lengths s h o r t e r than 5 msec. Contrary to the four coder parameters that are readjusted only every 5 msec, the q u a n t i z e r l e v e l changes each time a new input sample has been processed. The adaptation r u l e i s simple; according to whether the new e r r o r s i g n a l has the same s i g n as the previous one, the q u a n t i z e r l e v e l i s m u l t i p l i e d by a constant f a c t o r greater or l e s s than one. During periods of slope overload the q u a n t i z e r l e v e l i ncreases exponenti-a l l y , and when the s i g n a l energy drops to low values the q u a n t i z e r l e v e l decreases a c c o r d i n g l y . Thus, f o r any s i g n a l l e v e l , the q u a n t i z e r a d j u s t s i t s l e v e l to the optimum value . The computer s i m u l a t i o n s of the system revealed some d i f f i c u l t i e s o r i g i n a t i n g from t h i s e x p o n e n t i a l s e l f -adaptation. Due to the delay of one sampling p e r i o d , the adaptation was sometimes not able to f o l l o w f a s t f l u c t u a t i n g input s i g n a l s . The adaptation was d i s t u r b e d , causing i n s t a b i l i t i e s and poor reproduction of the high-frequency content of such s i g n a l s . This e f f e c t was observed i n p a r t i c u l a r f o r v o i c e l e s s sounds. Since the adaptation scheme i s non-l i n e a r and has memory, i t becomes mathematically i n t r a c t a b l e . E x p e r i -mentally no completely s a t i s f a c t o r y s o l u t i o n of the s t a b i l i t y problem was found. However, l i m i t i n g the maximum all o w a b l e q u a n t i z e r l e v e l to one eighthogfthe?peakssi'gnalaampli of the quantizer l e v e l to the second stage of the coder improved the s t a b i l i t y c o n s i d e r a b l y . O c c a s i o n a l l y , some i n s t a b i l i t i e s s t i l l occur. They r e s u l t i n c l i c k s i n the reproduced speech s i g n a l . The coder was simulated on a d i g i t a l computer and optimized f o r sampling r a t e s of 8, 12, and 16 kHz, using o b j e c t i v e c a l c u l a t i o n s of the S/Q r a t i o as w e l l as s u b j e c t i v e l i s t e n i n g t e s t s . In some cases where the measured S/Q r a t i o s were almost i d e n t i c a l and d i d not a l l o w any d i s t i n c t i o n i n performance, the s u b j e c t i v e t e s t s e x h i b i t e d strong preferences. This observation enhances the n e c e s s i t y to evaluate the performance of such systems by s u b j e c t i v e methods i n a d d i t i o n to the o b j e c t i v e c a l c u l a t i o n of the mean-square e r r o r s . Comparisons w i t h speech of the same bandwidth that was encoded by a 3 - b i t and a 4 - b i t l o g PCM system w i t h the same sampling r a t e of 8 kHz, showed that at t h i s sampling r a t e the s u b j e c t i v e q u a l i t y of the adaptive p r e d i c t i v e d e l t a modulator was equivalent to the s u b j e c t i v e q u a l i t y of 3.1-bit l o g PCM. Since the granular noise of the adaptive coder i s s i m i l a r to that of 4 - b i t l o g PCM, i t i s expected that i f the reproduction of high frequencies could be improved, the adaptive coder would compare favourably to 4- b i t l o g PCM. The q u a n t i z a t i o n and encoding of the four coder parameters that have to be tr a n s m i t t e d to the r e c e i v e r was not s t u d i e d i n d e t a i l . However, p r e l i m i n a r y r e s u l t s show that when the v o i c e s i g n a l i s sampled at 8 kHz, an a d d i t i o n a l 3800 b i t s / s e c are r e q u i r e d to transmit the para-meters. This suggests a f i n a l t r a n s m i s s i o n b i t - r a t e of 11.8 k b i t s / s e c . A PCM s i g n a l w i t h the equivalent s u b j e c t i v e q u a l i t y r e q u i r e s 24 k b i t s / s e c . Thusi^. aidatea eompressionuof 2:1 has been achieved. One could probably reduce the required b i t - r a t e from 11.8 k b i t s / s e c to 9.6 k b i t s / s e c (a standard r a t e f o r telephone t r a f f i c ) without s e r i o u s l y degrading speech q u a l i t y . Reduction of the sampling r a t e from 8 kHz to 7 kHz should not a f f e c t the q u a l i t y s i g n i f i c a n t l y , provided the input speech bandwith i s reduced from 3.75 kHz to 3.4 kHz [17]. Increas-i n g the frame le n g t h from 5.0 msec to 7.5 msec should reduce the S/Q r a t i o by l e s s than 1/2 db (see Figure 5.3). F i n a l l y , a d d i t i o n a l work to improve high frequency f i d e l i t y might w e l l r e s u l t i n a 9.6 k b i t s / s e c coder whose q u a l i t y r i v a l s that of 4 - b i t l o g PCM. 55 REFERENCES 1. Flanagan, J.L., "Speech A n a l y s i s , S y n t h e s i s , and P e r c e p t i o n " , Second E d i t i o n , Springer, 1972. 2. Bayless, J.W., Campanella, S.J., and Goldberg, A.J., "Voice S i g n a l s : b i t - b y - b i t " , IEEE Spectrum, pp. 28-34, Oct. 1973. 3. A t a l , B.S., and Hanauer, S.L., "Speech A n a l y s i s and Synthesis by Li n e a r P r e d i c t i o n of the Speech Wave", JASA, v o l . 50, pp. 637-655, 1971. 4. Jayant, N.S., "An Adaptive D e l t a Modulation w i t h a One-Bit Memory", BSTJ, v o l . 49, no. 3, pp. 321-342, March .1970. 5. Jayant, N.S., and Rabiner, A.E., "The Preference of Slope Overload to G r a n u l a r i t y i n the D e l t a Modulation of Speech", BSTJ, v o l . 50, no. 10, pp. 3117-3125, Dec. 1971. 6. A t a l , B.S., and Schroeder, M.R., "Adaptive P r e d i c t i v e Coding of Speech S i g n a l s " , BSTJ, v o l . 49, no. 65, pp. 1973-1986, Oct. 1970. 7. Schroeder, M.R., "Vocoders: A n a l y s i s and Synthesis of Speech, Proc. IEEE, v o l . 54, no. 5, pp. 720-734, May 1966. 8. Faddeev, D.K., and Faddeeva, V.N., "Computational Methods of L i n e a r Algebra", E n g l i s h T r a n s l a t i o n by R.C. W i l l i a m s , San F r a n c i s c o : W.H. Freeman, 1963, pp. 144-147. 9. L e v i t t , H., McGonegal, C.A., and Cherry, L.L., "Perception of Slope-Overload D i s t o r t i o n i n Delta-Modulated Speech S i g n a l s " , IEEE Trans.  Audio E l e c t r o a c . , v o l . AU-18, no. 3, pp. 240-247, Sept. 1970. 10. Smith, B., "Instantaneous Companding of Quantized S i g n a l s " , BSTJ, v o l . 36, pp. 653-709, May 1957. 11. "IEEE Recommended P r a c t i c e f o r Speech Q u a l i t y Measurements", IEEE Trans. Audio E l e c t r o a c . , v o l . AU-17, pp. 227-246, Sept. 1969. 12. Cummiskey, P., "Adaptive D i f f e r e n t i a l PCM f o r Speech P r o c e s s i n g " , Newark College of Engineering, D. Eng. Sc. Thesis, 1973. 13. Davisson, L.D., "Theory of Adaptive Data Compression", Recent Advances i n Communication Systems, v o l . 2, A.V. Balakrishnan Ed., New York Academic Press 1966. 56 14. Haskew, J.R., K e l l y , J.M., K e l l y , R.M., and McKinney, T.H., "Results of a study of the Li n e a r P r e d i c t i o n Vocoder", IEEE Trans. Commun. Technol., v o l . COM-21, no. 9, pp. 1008-1014, Sept. 1973. 15. Chang, K.Y., and Donaldson, R.W., unpublished work. 16. K e l l y , J.M., et a l . , F i n a l report on p r e d i c t i v e coding of speech s i g n a l s , c o n t r a c t no. DAAB03-69-C-0338, B e l l L a b o r a t o r i e s , June 1970. 17. Donaldson, R.W., and Chan, D., " A n a l y s i s and Su b j e c t i v e E v a l u a t i o n of D i f f e r e n t i a l Pulse-Code Modulation Voice Communication System", IEEE Trans. Commun. Technol., v o l . COM-17, pp. 10-19, Feb. 1969. 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0065540/manifest

Comment

Related Items