UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Synthesis of stops, fricatives, liquids and vowels by a computer controlled electronic vocal tract analog Spencer, Kenneth Albert 1971

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-UBC_1972_A1 S64.pdf [ 6.58MB ]
Metadata
JSON: 831-1.0101407.json
JSON-LD: 831-1.0101407-ld.json
RDF/XML (Pretty): 831-1.0101407-rdf.xml
RDF/JSON: 831-1.0101407-rdf.json
Turtle: 831-1.0101407-turtle.txt
N-Triples: 831-1.0101407-rdf-ntriples.txt
Original Record: 831-1.0101407-source.json
Full Text
831-1.0101407-fulltext.txt
Citation
831-1.0101407.ris

Full Text

SYNTHESIS' OF STOPS, FRICATIVES, LIQUIDS AND VOWELS BY A COMPUTER CONTROLLED ELECTRONIC VOCAL TRACT ANALOG ' • b y KENNETH A. SPENCER B.A.Sc, University of B r i t i s h Columbia, 1967 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in the Department of E l e c t r i c a l Engineering We accept this thesis as conforming to the required standard Research Supervisor Members of the Committee. » Head of the Department Department of E l e c t r i c a l Engineering THE UNIVERSITY OF BRITISH COLUMBIA December, 19 71 I n p r e s e n t i n g t h i s t h e s i s i n p a r t i a l f u l f i l m e n t o f t h e r e q u i r e m e n t s f o r a n a d v a n c e d d e g r e e a t t h e U n i v e r s i t y o f B r i t i s h C o l u m b i a , I a g r e e t h a t t h e L i b r a r y s h a l l make i t f r e e l y a v a i l a b l e f o r r e f e r e n c e a n d s t u d y . I f u r t h e r a g r e e t h a t p e r m i s s i o n f o r e x t e n s i v e c o p y i n g o f t h i s t h e s i s f o r s c h o l a r l y p u r p o s e s may b e g r a n t e d b y t h e H e a d o f my D e p a r t m e n t o r b y h i s r e p r e s e n t a t i v e s . I t i s u n d e r s t o o d t h a t c o p y i n g o r p u b l i c a t i o n o f t h i s t h e s i s f o r f i n a n c i a l g a i n s h a l l n o t b e a l l o w e d w i t h o u t my w r i t t e n p e r m i s s i o n . D e p a r t m e n t o f E l e c t r i c a l E n g i n e e r i n g The U n i v e r s i t y o f B r i t i s h C o l u m b i a V a n c o u v e r 8 , C a n a d a D a t e D e c e m b e r 2 1 , 1 9 7 1 ABSTRACT The speech synthesizer described i n this thesis i s a computer-controlled s o l i d - s t a t e analog of the human a r t i c u l a t o r y system. The synthesizer's design overcomes two serious d i f f i c u l t i e s which have pre-viously hampered vocal tract analog r e a l i z a t i o n ; namely, the r e a l i z a t i o n of time-varying inductors and capacitors which require a minimum of computer service and whose important parameters are r e l a t i v e l y i n s e n s i t i v e to l.arge variations i n terminal voltage and current, s i g n a l frequency, and instantaneous element value; and the r e a l i z a t i o n of a control scheme which requires a minimum of computer storage capacity and service to the synthesizer. Present software provides control for the a r t i c u l a t o r y analog, allows on-line construction, evaluation, and modification'of words, displays a r t i c u l a t o r y parameters during speech production, and provides for automatic checkout of synthesizer hardware. The analog was used to synthesize vowels, semi-vowels, f r i c a -t i v e s , stops, and English words. Spectrograms of synthesized phonemes were compared to published data. A r t i c u l a t o r positions and timing para-meters for synthesized phoneme sequences and English words were evaluated using subjective l i s t e n i n g tests. F i f t y English^Xtfqrds were synthesized to demonstrate i n t e l l i g i b i l i t y , structure and complexity of synthesis rules, and to demonstrate that the synthesizer's accent can be learned quickly. TABLE OF CONTENTS Page ABSTRACT i i TABLE OF CONTENTS i i i LIST OF FIGURES .. . v LIST OF TABLES . . . . v i i ACKNOWLEDGEMENT v i i i 1. INTRODUCTION 1.1 Uses of Speech. Synthesizers 1 1.2 Past Work 2 1.3 Scope of the Thesis ....4 2. ELECTRONIC SIMULATION OF THE HUMAN ARTICULATORY PROCESS 2.1 Description of the Vocal Mechanism 8 2.2 How Sounds are Produced 10 2.3 An E l e c t r i c a l Analog of the Human Vocal Tract 13 2.4 Overall Description of the Vocal Tract Analog 15 3.. HARDWARE AND SOFTWARE 3.1 Time-Varying Conductor 18 3.2 Time-Varying Inductors 20 3.3 Time-Varying Capacitors., 24 3.4 D i g i t a l Interpolator... 27 3.5 G l o t t a l Source 31 3.6 Noise Source 35 3.7 Control. 37 3.8 Overall Hardware Synthesizer 39 3.9 Software. 41 Supervisor 41 Display and Light Pen 43 Teletype • . 43 Vocal Tract Analog 43 A-D Handler 43 3.10 Sequence of Events During Synthesis of a Word 44 4. TESTING THE VOCAL TRACT ANALOG 4.1 Introduction to the Vowel Tests 47 4.2 Description of the Vowel Listening Test 49 i . . . IX x Page 4.3 Results of the Vowel Listening Test . ... 50 4.4 Frequency Response and Formant Frequency Measurements 53 4.5 Formant Bandwidths of the Vowels 57 4.6 Introduction to the Fricat i v e s 60 4.7 Listening Test for the Fri c a t i v e s 61 4.8 Results of the F r i c a t i v e Test 65 4.9 Frequency Response of the F r i c a t i v e Configurations 72 5. SYNTHESIS OF THE STOPS 5.1 Introduction to the Stops 74 5.2 Methods of Closing the Vocal Tract 75 5.3 Target Configurations for the Stops 76 5.4 Methods of Producing i n i t i a l stops........ 77 5.5 Stop-Vowel Test I 78 5.6 Results of Stop-Vowel Test I 8 2 5.7 Fricative-Stop-Vowel Tests 8 6 5.8 Results of Fricative-Stop-Vowel Tests 89 5.9 Stop-Vowel Test I I 9 1 5.10 Results of Stop-Vowel Test I I . . . 92 5.11 Spectrographic Study of the Voiced Stops 94 5.12 Summary of the Stop Synthesis Results 96 6.. THE ENGLISH WORD TEST 6.1 Introduction to the English Word Tests 97 6.2 Synthesis Rules for the English Words..... 9 7 6.3 The English Word Tests 1 0 0 6.4 Results of the English Word Tests I 0 1 7. CONCLUDING REMARKS 7.1 Realization, Cost and Capability of a Computer Controlled Vocal Tract Analog 1 0 7 7.2 Further Work 109 7.3 Summary HO APPENDIX PHONETIC TRANSCRIPTION SYSTEM H 1 BIBLIOGRAPHY I 1 2 i v LIST OF FIGURES Figure Page 2.1 (a) Midsaggital section of the human vocal mechanism 9 (b) Functional diagram of the human vocal mechanism Z2 Vocal tract p r o f i l e s for the p r i n c i p a l phonemes i n the English Language 11 2.3 (a) Right cylinder (b) E l e c t r i c a l analogs for a short right cylinder 14 "2.4 Schematic of MINERVA 15 3.1 (a) Discrete-state time-varying conductance and i t s control c i r c u i t r y 19 (b) Stepwise approximation G(t) to continuous time function G(t) 20 3.2 C i r c u i t of Riordan for simulating a time-invariant inductor 3.3 (a) Ungrounded time-varying inductor . ... 22 (b) Series combination of the k r e s i s t o r and switch i n . G(t) 3.4 (a) Q of steady state inductor plotted against control number B and frequency f (b) Steady state inductance for the c i r c u i t of Fig. 3.3 vs. frequency f and control number B (c) Same as (b) for L=1.0H 23 3.5 C i r c u i t used to simulate capacitor 25 3.6 F i n a l r e a l i z a t i o n of time-varying capacitors 26 3.7 Input capacitance C and quality factor Q vs. variable resistance R^  for capacitor i n Fig. 3.6 27 3.8 Input capacitance C and quality factor Q vs. binary control number B for capacitor i n Fig. 3.6 28 3.9 Quality factor Q vs. frequency for capacitor i n Fig. 3.6 for various values of C^  and R^ 29 3.10 Control number N as a function of time using up-down counter, comparator and variable rate clock ... 30 3.11 (a) G l o t t a l excitation source (b) Waveforms •••• ^ 3.12 Spectrum of voicing source used and spectrum of source used by Stevens et a l . . 33 Figure Page 3.13 Variable rate pulse generator used to simulate changing i n f l e c t i o n at voicing source 34 3.14 Variable rate clock pulse produced by c i r c u i t i n Fig. 3.13... 35 3.15 Noise generator used to simulate f r i c t i o n and aspiration 36 3.16 I l l u s t r a t i n g modulation of the noise source by the g l o t t a l source * 36 3.17 D i g i t a l comparator and clock memory matrix 37 3.18 Overall hardware f a c i l i t y 40 3.19 Function of program blocks 42 3.20 Timing information and source amplitudes for the word LEASH.. 45 4.1 Timing used for vowel test 50 4.2 Data used i n vowel tests 51 4.3 Frequency responses of vocal tract analog for 10 selected vowel configurations 54 4.4 F i r s t formant-second formant (F^-F^) data 56 4.5 Formant frequency of vowel configurations from Stevens et a l . 58 4.6 F i r s t formant bandwidth plotted against f i r s t formant frequency 59 4.7 (a) Vocal tract configurations for synthesizing f r i c a t i v e s . . . (b) Frequency response for configurations i n (a) 62 4.8 Timing controls used i n fricative-vowel tests 63 4.9 Timing control for /ho/ and /ha/ i n fricative-vowel t e s t . . . . . 64 4.10 Timing used i n p i l o t fricative-vowel test CVC 70 5.1 Configurations used to simulate the stops 79 5.2 Control timing used i n stop-vowel test 1 80 5.3 Control timing used i n fricative-stop-vowel test 87 5.4 Synthetic spectrograms of stop-vowel combinations 95 v i LIST OF TABLES Table Page 4.1 Results of vowel l i s t e n i n g test 52 4.2 Signal-to-noise r a t i o for 10 selected vowel configurations used i n Fig. 4.3 ... 55 4.3 Bandwidths of vowels measured by Dunn >.and bandwidths of vowels synthesized on MINERVA 59 4.4 Results of vowel-fric.itive test 66 4.5 Results of p i l o t test with fricative-vowel combinations CVC. 71 5.1 Results of stop-vowel test I S3 5.2 Results of fricative-stop-vowel test 90 5.3 Results of stop-vowel test I I 93 5.4 The maximum response rate for each stop with various vowels. 94 6.1 Results of English word test 102 6.2 Percentage scores i n English word test for each word 103 6.3 Incorrect responses for each word and percentage of time each was made during tests 105 v i i ACKNOWLEDGEMENT I wish to acknowledge the assistance I received during the years of work on my thesis. Dr. Robert W. Donaldson, who supervised the thesis and provided encouragement throughout the project. Dr. Andre-Pierre Benguerel, who heped me through the phonetics and provided subjects for l i s t e n i n g tests. Jean-Pierre Steger, who quickly and e f f i c i e n t l y b u i l t MINERVA. Gordon McConnell, who did the maintenance and changes for MINERVA. Linda Morris, who quickly typed the thesis. The National Research Council who supported me and the project. My wife, Sharon, who gave encouraging support to me throughout the thesis, typed the f i r s t draft and spent many hours proof reading. 1 1. INTRODUCTION 1.1 USES OF SPEECH SYNTHESIZERS Speech synthesis has been studied since the late eighteenth 1 2 century, when Kratzenstein and Von Kempelen made mechanical talk i n g machines. P o t e n t i a l uses of speech synthesizers today are many-fold. One of the main uses would be as an output device for a computer. I f computers could speak thei r answers as w e l l as pr i n t them and display them graphically, the use of d i g i t a l computers could be broadened. The present cost of putting computer terminals i n homes and offices i s s t i l l r e l a t i v e l y high, but i f a terminal consisted of the telephone, then vast amounts of information could be available to v i r t u a l l y anyone having access to a telephone. Other t y p i c a l uses of speech synthesizers are; reading machines for the b l i n d , teaching machines, communication with the computer i n spacecraft and a i r c r a f t without a bulky terminal or console, and auto-matic information services. Another important use of speech synthesizers 3 i s i n speech research. As Mattingly points out, when developing a theory for language, the phonetician can never be sure that the utte r -ances he receives from speakers have followed the rules that he has set down. Having a speech synthesizer can help i n learning about the speech process. The closer i n concept this synthesizer i s to the human vocal tract the more one might learn about how humans produce speech. 4 5 As'Holmes et a l i n 1964 and Flanagan et a l i n 1970 pointed out, synthesized speech need not be i d e n t i c a l to a natural utterance for many applications. Recognition as a possible utterance i n the di a l e c t i s s u f f i c i e n t . The machine can be permitted to have a machine accent and s t i l l be useful. i 2 1.2 PAST WORK Speech synthesizers f a l l broadly into two classes, terminal analogs and vocal tract analogs. Each type can be r e a l i z e d physically using hardware, or can be simulated on a general purpose d i g i t a l computer. A terminal analog attempts to reproduce the spectral properties of the speech being synthesized. Generators of various frequencies may be used to give a certain spectrum, or a source r i c h i n harmonics may be f i l t e r e d . The f i l t e r s may be s e r i a l , or i n p a r a l l e l . The spectrum of speech contains prominent peaks, c a l l e d f o r -mants. Reasonably i n t e l l i g i b l e speech can be synthesized by using three bandpass f i l t e r s to reproduce the f i r s t three formants, although terminal analogs have d i f f i c u l t y reproducing speech where the spectral characteristics change rapidly. A complete study of both terminal analogs and vocal tract analogs was given by Flanagan i n 1965. One of the e a r l i e s t terminal analogs was the pattern playback device developed at Haskins Speech Laboratory i n 1950. The control was painted on a b e l t , and this control in t e n s i t y modulated harmonically related frequencies as the belt ran over photocells. Since this early attempt many terminal analogs have been developed, including PAT 7, OVE I 8 , and OVE I I 9 , SPASS 1 0, and Glace-11 12 Holmes . Some terminal analogs such as OVE I I I and one developed by 13 Cooper and Epstein are computer controlled, giving them a great deal more f l e x i b i l i t y . A vocal tract analog attempts to model the equations for pressure and volume v e l o c i t y of the a i r i n the human vocal tract during speech production. The model usually assumes that the human vocal tract can be modelled by a series of right c i r c u l a r cylinders of various areas. In 1950 Dunn made a s t a t i c analog with twenty-five sections, each 2 modelling' .5 cm of length and 6 cm of area. A co n s t r i c t i o n corres-ponding to the tongue hump could be inserted, dividing the vocal t r a c t into two c a v i t i e s . By moving the position where the tongue was inserted different vowels were produced. Stevens et a l " ^ i n 1953 made a t h i r t y -f i v e section analog, with each section simulating .5 cm of length. The 2 2 area of each section could be varied from .17 cm to 17 cm by means of a bank of switches. The analog produced the vowels and the f r i c a t i v e consonants /f, s, .[/*. Both the above analogs were s t a t i c i n that they could only produce one sound continuously. To produce another sound the bank of switches had to be changed manually. In 1956 Rosen"*"^  b u i l t a f i f t e e n section analog that could store two configurations. Changes from one configuration to another were made by a s p e c i a l l y designed control system. This vocal tract analog produced fricative-vowel and vowel-fricative combinations. Hecker i n 1962 added a nasal analog to this vocal tract analog and produced nasals. 18 In 1962 Kel l y and Lochbaum developed a program that simu-lated the vocal t r a c t . The output was a d i g i t a l tape that represented the sound pressure waveform. This program took about forty seconds to 19 produce one second of speech. Matsui et a l have b u i l t a vocal tract analog that runs from a computer. The compu-ter calculates the para-meters for speech every five msecs and generates a control tape. The control tape then runs a vocal tract analog o f f - l i n e . This synthesizer i s capable of producing short s t o r i e s . A IPA or modified symbols are used throughout the thesis and are explained i n Appendix A and Fig. 2.2. 20 Rice i n 1970 developed a software analog. In th i s analog one configuration i s stored i n a computer for every twelve msec of speech. The excitation of the vocal tract i s also stored. From the configuration and the input, the program calculates the output. At present, this program produces vowels and other voiced sounds. There are plans to extend i t to include sounds that use other e x c i t a t i o n . This program takes one hundred seconds to calculate one second of speech. Not a l l synthesizers are d i s t i n c t l y terminal or vocal tract analogs. One of the synthesizers used by Flanagan et a l ^ i s a mixture of the two. The s t a r t i n g point i s a vocal t r a c t configuration and the computer solves for eigenfrequencies of the configuration every ten msecs. This solution information i s then used to run a terminal analog. The problem of automatic speech synthesis can be divided into two main areas. One i s the development of higher l e v e l rules to produce 21 22 23 sentences from phonemic inputs. Rabiner and his co-workers 5 ' , 24 25 Mattingly and Tatham are among the workers i n this area. The other problem i s to produce better synthesizers that lend themselves to being controlled by simple rules. 1.3 SCOPE OF THE THESIS The work presented here i s a continuation of Rosen's work. Rosen's analog had some serious drawbacks. Because i t was made of vacuum tubes, the parameters d r i f t e d so much that the analog required frequent adjustment. The inductors used were saturable reactors which required a non-linear control. The value of the inductance could be changed over a range of 100:1 whereas a range of 250:1 or more i s desirable. The analog could produce two sounds i n sequence, and manual resetting of the control system was necessary to synthesize a d i f f e r e n t pair. 5 The synthesizer described i n this thesis i s a hardware vocal tract analog (hereafter referred to as MINERVA). Vocal tract configurations i n MINERVA can be set by computer using an int e r a c t i v e graphics display. The programs allow long sequences of speech to be synthesized. In f a c t , the maximum length of a speech utterance that can be synthesized i s a function only of the computer and the program. The synthesizer operates i n real-time*. The synthesizer control programs incorporate any changes that were made i n configuration shapes or timing control since the same utterance was l a s t synthesized. The effect of these changes can be heard immediately, as opposed to some synthesizers where a long time elapses before the output i s heard. The real-time feature i s convenient i n designing synthesis rules because a sequence can be heard, changes can be made, and the effect of these changes can be heard immediately. There i s always a trade off between f l e x i b i l i t y and s i m p l i -c i t y of calculations. The software simulated analogs are very f l e x i b l e i n that changes only involve a change i n the program, but they require considerable computer time for calculations; so much i n fact, that they cannot operate i n real-time. Even hardware synthesizers controlled d i r e c t l y by a computer would keep a small computer f u l l y occupied i f signals had to be generated every f i v e or ten msec. In MINERVA, a d i g i t a l hardware interpolator was designed to We define real-time synthesis as follows: immediately upon command, an acoustic utterance from a selected l i s t can be obtained from a sequence of stored control signals, or from stored rules for generating such control signals. Thus synthesis by lowpass f i l t e r i n g of stored samples of a previously sampled acoustic waveform would not be considered real-time synthesis. Although careful examination reveals that our d e f i n i t i o n i s not as precise as one might wish, i t seems reasonable and i s sat i s f a c t o r y for our purposes. operate between the computer and the vocal tract analog. The computer loads the interpolator with parameter values and rates of change. The interpolator then controls the analog for up to 500 msec, thereby freeing the computer for other tasks. This free computer time i s presently used to run interactive software so that the user can e a s i l y make on-line modifications i n control data. Also, the free time i s used to display the spectral characteristics of the synthesized speech. This hardware interpolating scheme vastly reduces the number of times that the analog has to be serviced without detracting from the qua l i t y of the synthesized speech. Any small loss i n t o t a l system f l e x i b i l i t y more than compensated for by the additional computer time and memory available for other tasks. The scheme also allows real-time synthesis. This thesis was mainly concerned with development of a good real-time synthesizer and some word synthesis rules. Higher l e v e l sentence rules which would be needed for a synthesis by rule scheme were not considered. The purpose of the research i n this thesis was to 1. Prove that a vocal tract analog could be b u i l t from d i g i t a l l y controlled inductors and capacitors and to develop inductors, capacitors and computer interface c i r c u i t s for the above analog. 2. Show that the above analog could produce continuous speech i n real-time using a small d i g i t a l computer for control purposes. 3. Study the production of synthetic stops i n the i n i -t i a l and middle positions. Production of vowels, li q u i d s and f r i c a t i v e s was also studied. 4. I n i t i a t e development of rules needed to control the synthesizer i n synthesizing English words. 5. Provide a r e l i a b l e vocal t r a c t analog for speech syn-thesis and analysis research. Chapter of the thesis describes the electronic simulation of the vocal tract and b r i e f l y describes the various kinds of sounds 7 encountered i n English. Chapter 3 describes the inductors and capacitors used i n the simulation, the hardware interpolator; the exc i t a t i o n sources used, the interactive software and the o v e r a l l f a c i l i t y . Chapter 4 shows how the vocal tract analog's c a p a b i l i t i e s were v e r i f i e d i n the s t a t i c mode and also i n the synthesis of fricative-vowel and vowel-fricative combinations. Chapter 5 presents a study of stop consonants. Chapter 6 describes the rules used to synthesize 50 English words. Chapter 7 includes a discussion of the speed and memory capacity of a computer needed to control a f i n a l version of MINERVA, and some suggestions for further work. 8 2. ELECTRONIC SIMULATION OF THE HUMAN ARTICULATORY PROCESS 2.1 DESCRIPTION OF THE VOCAL MECHANISM The major parts of the vocal apparatus are the lungs, trachea, vocal cords, pharnynx, larynx, velum, tongue, o r a l cavity, nasal cavity and teeth, as shown i n Fig. 2.1 a. The vocal tract i s an acoustical tube which i s non-uniform i n area, terminated at one end by the l i p s and at the other by the vocal cords which separate the trachea from the larynx. In an average adult male the vocal tract i s about 17 cm long. The cross-sectional area i s determined by placement of the jaw, l i p s , velum, tongue and pharyngeal w a l l and can vary from a f u l l 2 aperture of twenty cm to zero for complete closure. The nasal tract i s an additional path for sound transmission beginning at the velum and terminating at the n o s t r i l s . In an adult male the nasal tract i s about 12 cm. i n length and has a volume of 3 about 60 cm and the shape does not vary with time. There are three ways of e x c i t i n g the vocal tract for the pro-duction of sound. The most common excitation i s voicing. Voiced sounds normally are produced by using a i r from the lungs and forcing i t through an opening between the vocal folds c a l l e d the g l o t t i s . As the a i r i s forced through, i f the vocal folds are put i n proper position and under adequate muscular tension, they vibrate. This v i b r a t i o n p e r i o d i c a l l y interrupts the a i r flow, causing roughly triangular pulses to excite the vocal t r a c t . A second way of e x c i t i n g the t r a c t i s by f r i c t i o n . To produce unvoiced f r i c t i o n the g l o t t i s i s opened so that the a i r passes freely through i t . At some point i n the tract there w i l l be a narrow 9 Nasal Cavity Velum Lips Teeth Vocal Cords Trachea Lungs Chest Cavity Diaphragm Oral Cavity Tongue Pharynx Cavity Larynx Esophagus (a) Nose Output Mouth Output ^_ Nasal Cavity / Velum Vocal Cords Lung Volume r y ~ / — \ L . y — r Tongue Pharynx Trachea Oral Cavity Hump Cavity (b) I p p/aphragm ^ Force Figure 2 . 1 (a) Midsagittal section of the human vocal mechanism (b) Functional diagram of the human vocal mechanism-10 co n s t r i c t i o n . I f th i s c o n s t r i c t i o n i s s u f f i c i e n t l y small a turbulence i s created causing a sound s i m i l a r to white noise. F r i c t i o n can be Voiced, that i s the a i r passing through- the g l o t t i s can be interrupted p e r i o d i c a l l y as i n voiced sounds and the turbulence i s caused further up the tr a c t . A t h i r d method of ex c i t i n g the tract i s to completely close off the tract and b u i l d up pressure behind the closure. Opening the closure abruptly, causes a burst of a i r to come out. A further bufst of a i r can come from the g l o t t i s , when this occurs, the sound i s c a l l e d aspirated. By controlled movement of the vocal tract and proper e x c i t a t i o n of the sources, speech i s produced. 2.2 HOW SOUNDS ARE PRODUCED The various classes of sounds and how they are produced i s des-cribed by Flanagan and a short summary i s presented here. Sounds which have voicing and no other excitation are the vowels, l i g u i d s , glides and dipthongs. The vowels are produced with Voicing e x c i t a t i o n and by putting the vocal tract into a configuration which i s quasi-stationary for the duration of the vowel. Different vowels are produced by using different configurations of the vocal t r a c t . Typical shapes of the vocal tract for various sounds are shown i n Fig. 2.2. The semi-vowels /%/ and / r / are sim i l a r to vowels except the vocal tract i s more constricted and the tongue t i p i s not down. Glides /w/ and /y/ also are produced with voicing but the vocal tract moves during the production of these sounds. The vocal tract i s put into a configuration and moves toward the configuration of the vowel which- follows the glide. Dipthongs are voiced sounds that r e l y on vocal tr a c t motion for their production. The vocal t r a c t changes from one vowel position to another and i s i n constant motion. t (FOR) 8 (THIN) 3 (1E€I i ( E V E ) I ( I T ) e (KATE) C ( M E T ) S(SHE) 5(THEN) h (HE) z (zoo) V (VOTE) 3 (AZU*E) Vocal tract profiles lor toe fricative cccsonants of English. The short pain of lines dravn on tbe throat represent vocal cord operation (adapted from POTTXR, Xorr and Caxxir) se(AT) U (FOOT) a ( FATHER) U(BOOT) 0 ( A L L ) A (UP) 0 (OOEY) 3 (BIRO) Schematic vocal tract profiles for the production of English vovreb (adapted from TOTTES, Kerr and G u u ) P(PAT) t(TO) k(rtEV) Vocal prof2a for tha oasal coasocaata (ju'tar Pvuaa. Xorr and C u l l ) F i g . 2 . 2 Vocal tract p r o f i l e s for the p r i n c i p a l phonemes i n the English language 12 The nasal consonants are voiced sounds. In producing nasals the velum i s lowered and the ora l t r a c t i s usually completely closed off at some point ahead of the velum. The closed o r a l cavity acts as a side resonator and can influence the sound radiated at the n o s t r i l s . Another class of sounds are the f r i c a t i v e consonants there are two types, voiced and unvoiced. The unvoiced f r i c a t i v e s are pro-duced by f r i c t i o n as described i n the previous section. The f r i c a t i v e If I i s produced with the upper teeth close to the inner surface of the lower l i p . The a i r stream passes through this narrow constriction to cause turbulence. Another f r i c a t i v e , /9/, i s produced with the t i p of the tongue close to or touching the upper i n c i s o r s . The'third f r i c a t i v e , /s/, i s produced with the tongue raised to approach the back of the gums of the upper teeth. The f i n a l f r i c a t i v e , /// i s produced with the tongue raised to the front of the hard palate. /h/ i s considered to be an un-voiced f r i c a t i v e but i s the subject of some controversy. I t i s thought by many that the turbulence i s caused i n the larynx. The voiced f r i c a -tives are produced by simultaneous voicing and f r i c t i o n described i n the previous section. The voiced f r i c a t i v e /v/ and the unvoiced f r i c a t i v e / f / have e s s e n t i a l l y the same configuration. S i m i l a r l y so do /S/ and /9/, 111 and /s/, and /^/ and ///. There i s no voiced counterpart to /h/ i n English. Plosive or stop sounds are produced by getting a burst as described i n section 2.1. Since the vocal tract i s closed off for a short time, Voiceless stops are always preceded by a s i l e n t i n t e r v a l : In a voiced stop however, during this s i l e n t i n t e r v a l there might be a low frequency hum radiating from the throat, not through the ora l or nasal aperture but through the walls of the Vocal t r a c t , jaw and Gheeks. The voice source i s turned on at some time during the plosive sound or during the tr a n s i t i o n to the following vowel. Plosives are usually divided into two groups, unvoiced, /p,t,k/ and voiced, /b,d,g/. In English whether a stop i s voiced or unvoiced i s usually thought to depend on the time the voicing s t a r t s , with the voicing s t a r t i n g sooner i n voiced sounds than i n unvoiced sounds. Plosives /p/ and /b/ are produced by causing the closure to be at the very front of the vocal t r a c t , the l i p s . Plosives / t / and /d/ are produced by closing the tract off at the back of the teeth. The last two plosives /k/ and /g/ are produced by closing the tract off further back with the back of the tongue against the palate, although for these two stops the exact place of closure i s strongly influenced by the neighbouring sounds, the vowels i n p a r t i c u l a r . 2 .3 AN ELECTRICAL ANALOG OF THE HUMAN VOCAL TRACT The human ar t i c u l a t o r y system i s i l l u s t r a t e d i n Fig. 2 . 1(b). In an e l e c t r i c a l analog of this system current i s analogous to a i r volume v e l o c i t y , and Voltage i s analogous to sound pressure. I f the cross-dimension of the vocal tract i s appreciably less than a wavelength and as long as the tube doesn't f l a r e too rapidly, the acoustical system can be approximated by the Webster equation c 9t where A(x) i s the cross-sectional area, p sound pressure and c the velocity of sound i n a i r . This equation i s analogous to the equation for voltage on a transmission l i n e . The vocal tract can be approximated by a series of short right uniform c y l i n d r i c a l tubes of various areas such as the one shown i n 26 Fig. 2 . 3(a). I t was shown by Fant that equivalent c i r c u i t s can be ( 3 ) circumference S cross-sectional area A c _ _ i G G « _ C 2 ~ " ^ 2 2 1 ~ 2 IT Section L A R 5 I -if 0)O/J p : air density u : viscosity coefficient of air : radian frequency g : specific heat of air at constant pressure (b) 6 = 7 k - ' f e c : speed of sound in air n : adiabafic constant of air A : coefficient of heat conduction Figure 2.3 (a) Right cylinder; length I, cross-sectional area A, circumference S. (b) E l e c t r i c a l analogs for a short right cylinder found, for such a tube. Two equivalent c i r c u i t s appear i n Fig. 2.3(b). Inductance L, capacitance C, resistance R and conductance G represent respectively, a i r i n e r t i a , a i r compliance, viscous f r i c t i o n - a t the tube w a l l and heat loss at the tube w a l l . An arbitrary constant corresponding to impedance scaling can also be included i n the equations. These equivalent c i r c u i t s w i l l be v a l i d i f the cross-dimension of the t r a c t , 15 ^ k Ls, V L? L, i2 I, c4 ' c6 ' c7 Li L> = Cr-1 k?c2 0.156 H , L 2 = 0.136 H 2.75 nF, C2=5.87nF 470 Q , Lr = 270mH rir ° 4Zw7s 7^ 6 -/2 '/2 ;3 •14 n '/5 t 1 2 3 4-10 II 12 13 14 ft fC/77) 1.0 1.0 1.5 1.5 1.0 0.5 — — — — 2.14 2.14 5.68 2.84 1.49 2.0 m 0 6 6 8 8 8 8 t[ (nF) — — 0.9 0.3 0.525 0.112 0.425 0.887 mi 0 0 6 6 8 8 8 8 Figure 2.4 Schematic of MINERVA. Elements with double arrows are controlled d i r e c t l y from a d i g i t a l computer. Capacitors with single arrows are controlled d i r e c t l y by hardware averaging numbers used to control adjacent inductors. Elements C^, , and L are time-invariant. Constants p and c are defined i n Fig. 2.2. Constant k i s arbi t r a r y . The table shows^the simulated length I for each section, and inductance L and capacitance C which result when corresponding control number B = 1. Also shown for each inductor and capacitor are the respective number of b i t s n and m i n each control number B. The output voltage i s v Q ( t ) . A noise source could be inserted i n series with any of the inductors. and the length' of the cylinder i s small with respect to'a wavelength. By cascading the u or T sections i n Fig. 2.3(b) a vocal tract analog consisting of short c i r c u l a r tubes of various' areas can be modelled. 2.4 OVERALL DESCRIPTION OF THE VOCAL TRACT ANALOG The vocal tract analog that was constructed i s shown i n Fig. 2.4. I t c o n s i s t s o f 14 c a s c a d e d TT s e c t i o n s t h a t s i m u l a t e 14 r i g h t c y l i n d e r i c a l t u b e s . S e r i e s and s h u n t c o n d u c t o r s as shown i n F i g . 2 . 3 ( b ) a r e n o t i n c l u d e d as t he l o s s e s i n the i n d u c t o r s and c a p a c i t o r s a r e amp le . The 16 a r r angement o f t h e s e c t i o n l e n g t h s i s s i m i l a r t o t h a t u sed by Rosen The f i r s t two s e c t i o n s r e p r e s e n t i n g t he l a r y n g a l c a v i t y have a f i x e d l e n g t h o f 1.0 cm and a f i x e d r a d i u s o f .85 cm and .91 cm r e s p e c t i v e l y . S e c t i o n s 3 t o 10 have a f i x e d l e n g t h o f 1.5 cm and have a v a r i a b l e a r e a t h a t can v a r y o v e r a range o f 64 : 1 w i t h the r a d i u s r a n g i n g f r o m .28 cm t o 2.24 cm. S e c t i o n s 11 and 12 have f i x e d l e n g t h s o f 1.0 cm and .5 cm r e s p e c t i v e l y and can v a r y o v e r a r ange o f 255 :1 w i t h t he r a d i u s v a r y i n g f r o m .14 cm to 2.24 cm. S e c t i o n s 13 and 14 have a v a r i a b l e a r e a and a v a r i a b l e l e n g t h b e c a u s e t h e l e n g t h o f t h e v o c a l t r a c t i n f r o n t o f t h e t e e t h can change. L r and R r s i m u l a t e t h e r a d i a t i o n l o a d o f t h e mouth on the V o c a l t r a c t . F l a n a g a n has shown t h a t t h i s r a d i a t i o n impedance can be s i m u l a t e d by an i n d u c t a n c e i n p a r a l l e l w i t h a r e s i s t a n c e as f o l l o w s : R r ^ i 2 ! - L r - ^ ( 2 . 2 ) I n ( 2 . 2 ) k i s t he impedance s c a l e f a c t o r , r i s t he r a d i u s o f o p e n i n g a t t h e mouth and c i s t he V e l o c i t y o f s ound . T h e r e f o r e t h e r e s i s t a n c e i s c o n s t a n t and t h e i n d u c t a n c e i s p r o p o r t i o n a l t o t he r a d i u s . The g l o t t a l end o f t he t r a c t i s e x c i t e d by a c o n s t a n t c u r r e n t g e n e r a t o r w h i c h g e n e r a t e s : 1) t r i a n g l e shaped p u l s e s f o r v o i c i n g , 2) n o i s e f o r a s p i r a t i o n o r 3) a c o m b i n a t i o n o f b o t h . F o r p r o d u c t i o n o f f r i c a t i v e s and s t o p s , n o i s e s o u r c e s can a l s o be i n s e r t e d a t any p o i n t a l o n g the t r a c t t o s i m u l a t e t u r b u l e n c e . The i n d u c t o r s and c a p a c i t o r s s h o u l d p e r m i t s i m u l a t i o n o f a r e a s 2 r a n g i n g f r o m c o m p l e t e c l o s u r e t o 20 cm . S i n c e c o m p l e t e c l o s u r e r e p r e s e n t s an a r b i t r a r i l y l a r g e i n d u c t a n c e i t was d e c i d e d t h a t t he r ange o f e l e m e n t s 17 for the front part of the tract should be 250 :1 or greater giving an area v a r i a t i o n of 250 :1 or greater. The elements should be frequency independent over the range 200 Hz to 5000 Hz and of s u f f i c i e n t l y high Q to give proper formant bandwidths. Setting the impedance l e v e l of the tract involves the following factors: 1. the range over which the simulated area i s to vary, 2. the minimum and maximum values of variable inductors available, 3. the minimum and maximum values of variable capacitors available. 2 From Fig. 2 .3 i t follows that LC = (i/c) . Therefore LC i s constant for any given length of cylinder and independent of the impedance scale factor k. Since the front sections near the teeth and l i p s can have a smaller aperture than the rear sections behind the tongue, the variable that i s common to a l l sections regardless of how much they 2 vary i s the maximum A. The maximum A was chosen at 16 cm . The maximum A sets the minimum L and the maximum C. From the inductors and capacitors that are described i n the following chapter, a reasonable value for the maximum C was 57.6 nF. For I = 1.5 cm this gives the minimum L a value of 3 3 . 3 H. Using these values, which are realizable with the simulated inductors and capacitors, the table of values given i n Fig. 2 . 3 can be developed. These values give the impedance scale factor k a value of - 3 - 4 - 1 - 1 3.22 x 10 gm cm sec ohm From equation 2.2 Rr * 460fi Lr = 79 .5 mH x r. For R , 470Q was used. 18 3. HARDWARE AND SOFTWARE 3.1 TIME-VARYING CONDUCTOR To simulate time-varying inductors as described i n the next sections, a d i g i t a l l y controllable time-varying conductor was th needed. The c i r c u i t chosen appears i n Fig. 3.1(a). I f the k switch th i s closed when b^ = 1 and the k switch i s open when b^ = 0, then; G = gB where n-1 , B = E b, 2 k=o and G i s the conductance between terminals x-x'. If the switches are opened and closed i n proper time sequence, then any time function such as G(t) i n Fig. 3.1(b) can be approximated a r b i t r a r i l y closely by a stepwise function G(t), as the following argument shows. In Fig. 3.1(b), r ( t ) i s a piecewise l i n e a r approximation to G(t), and s(t) i s a stepwise approximation to r ( t ) , the step height i s g. Errors between G(t) and s(t) result from r e s i s t o r inaccuracies. Let = G - r, ~ r - s> £3 = S -G and e - G-G. Since E = e_ + e~ + £„, e<|e1|+[e„|+|e^l. Let G and G' be the maximum magni-1 2 3 1 1' 1 2 1- 1 3' m m dG tudes of G(t) and - j ^ respectively for a $ t < b. For a ^  t < b e i ( t ) * |t-a| |(G(t) - G(a))/(t-a) - (G(b) - G(a))/(b-a)| It follows that | e J <: 2TG'. Since (2 n-l)g=G and |e„U g» leJ $ G / 1 l 1 m m 1 21 ' 2 m ( 2 -1). Let Ag k and AG be the var i a t i o n i n g^ and G about t h e i r nominal values. n-1 AG= (G F F I / C 2 N - 1 ) ) z (g k/g) ( A g k / g k ) b k k=o i f Ag k/g k i s small. I f |Ag k/g k| sA for a l l k, then j E 3 | < G T N A » A N D 19 . ~~I Time-varying g <L2y .y4y---y2n~'g j - * - Conductance 'Switch j Digital n-i i Interpolating —' i System Direction-of-— Count Control Signal BitO Bit I Bit 2-"-Bit n-1 V. Digital Computer or other Digital Storage Device ( 3 ) L . Clock Puise Generator Figure 3.1 (a) Discrete-state time-varying conductance and i t s control c i r c u i t r y , (b) Stepwise approximation G(t) to continuous time function G(t). A l l steps i n s(t ) are of equal height g. Any difference between s(t) and G(t) results from inaccuracies i n resis t o r s 2^ i n (a). 20 28 Figure 3.2 C i r c u i t of Riordan for simulating time-invariant inductor. I f the element denoted by Z^  i s a time-varying conductance, i f Z„, Z^ and Z,. are r e s i s t i v e , and i f i s a capacitor then impedance from 1 to ground i s a time-varying inductance. |e(t)/G I $ 2T G'/G + l / ( 2 n - l ) + A 1 m1 m m It follows that e(t) can be made a r b i t r a r i l y small by making T, l / n , and A small. 3.2 TIME-VARYING INDUCTORS As discussed i n the previous chapter, time-varying inductors were needed. These inductors had to be variable over a range of 250:1, had to be frequency independent over the range 200 Hz < f < 5 kHz and 27 had to have Q > 25. In 1967 Donaldson and Wickwire developed such an 2 8 inductor from a time-invariant inductor proposed by Riordan . The basic c i r c u i t i s shown i n Fig. 3.2. The input impedance between terminal 1 and ground i s i n Z1 Z3 Z5  Z2 Z4 (3.1) I f Z,, Z O J Z. and Z c are r e s i s t i v e and Z_ = -~ then Z. = —• , which 1' 3 4 5 2 Cs i n G ' simulates an inductor where K i s a constant. 21 Since a f l o a t i n g inductance rather than an inductance to ground was desired, a symmetrical c i r c u i t such as the one i n Fig. 3.3(a) had to be employed. I f the gains of the d i f f e r e n t i a l amplifiers are large and the admittance of C i s large compared to then v(t) = L(t) ^"^^ W h e r e L(t) = G 4C 2/G 5G 3G 1(t) Since v(t) = d L ( t ) i ( t ) a true time-varying inductor, the c i r c u i t simulates an inductance i f L(t) ^^—t^ <<i(t) ^L(t) , i f q ±s constant dt dt i and one of the pairs G„Gl, G,G!or GCG' is varied simultaneously then v(t) = 3 3 4 4 5 5 — ( t ^ ( t ) . With i ( t ) = Icos(2TTft) and A and I as defined i n Fig. 2.3 dt i ( t ) d L ( t ) / L(t)di,(t) = JL dA/Jl dt / dt 27rfA dt ' and was small over the frequency range of interest 200 Hz £ f $ 5 kHz unless A/& changed suddenly and by a large amount. Since the assumptions of the analog were questionable during t h i s time and since much more elaborate c i r c u i t r y was necessary to vary the conductors simultaneously, the c i r c u i t of Fig. 3.3 was used. One b i t i n the time-varying conductor i s shown i n Fig. 3.3(b). With G^(t) i n Fig. 3.3(a) constant, the c i r c u i t simulates a l i n e a r time-invariant inductor. Let L be the inductance which results when control number B=l. Note that since G^  = gB, L = L/B. With L = 10.OH (C = .01 uF) the inductor's Q exceeds200 for the range of frequencies 100 Hz f £ 20kHz and 10 £ B £ 255 as shown i n Fig. 3.4(a). The Q exceeded 50 for 100 Hz £ f £ 5 kHz and 11$ B < 9. Similar behaviour of Q was observed for L = 1.0 H(C=.luF). For £ = 10.OH and again for L = 1.0H, LB/L deviated appreciably from unity over the frequency range 200 Hz ^ f ^ 10 kHz only for B ^ 75, as shown i n Fig. 3.4(b and c). Fortunately, B ^ 63 for most 22 10K 4.7K 'v(t) Time-varying Conductor G,(t) \ Digital Inter-Ipohiing System 2N5I63 i t A^f5/63^ -M ?! MPS3638 G8K 2H5IS3 IM rib; 7? Figure 3.3 (a) Ungrounded time-varying inductor. A grounded inductor results i f A and B are grounded, (b) Series combination of the k r e s i s t o r and switch i n G..(t). When v, = 1 • k When v^ = 12 v o l t s , b^ = 1. -6 v o l t s , b. = 0. • k to 10 6 c 4 o c: 2 sy &* 1 w lit O.b 0.4 / 1 \ i i / 1 '• \' i y> i • V • 7 \j o ] >zoo <&> ! j region of-interest for vocal tract analog <4> -S <OQ o * 9 QQ 1.4 8 & 1.2 •g •5 1.0 1 C3 0.8 0.6 1 1.2 1.0 0.8 0.6 (g) Bindry Control Number B'l, 10 / B*255 f B-IO, 100, 355 0.1 0.2 OA 0.6 0.81 2 4 6 d 10 (fo) Frequency f (KHz) 0-/ SH 10 B=I0 A 100 B-100 B*255 W t 255V 0.1 02 OA 0.6OBI 2 4 6 810 Frequency f (KHz) (C) ure 3.4 (a) Q of steady state inductor plotted against control number B and frequency f. (b) Steady state inductance for the c i r c u i t of Figure 3.3 vs. frequency f and control number B. L equals the inductance for B = 1. L = 10H, (c) Same as (b) for L = 1.0H. . ' 24 inductors i n Fig. 2.4 and B < 75 for the remaining inductors most of the time. The Q of radiation inductor Lr i n Fig. 2.4 was between 5 and 10 for 200 Hz £ f £ 5 kHz independent of B. In this case a low Q can be tolerated as the inductor was i n p a r a l l e l with a 470fi r e s i s t o r . Ratio IB — was within 3% of i t s theoretical value for 200 Hz $ f $ 5 kHz for 1<B<15. L Transient tests were done and are described i n Wickwire's 29 27 thesis and i n Donaldson and Wickwire 3.3 TIME-VARYING CAPACITORS If the c i r c u i t of Fig. 3.2 i s used with r e s i s t i v e and Z^  capacitive then a simulated time-varying capacitor i s r e a l i z e d , since i ( t ) = dC(t)V(t) w . t h c ( t ) = R G ( t ) C f (3.3.1) at 29 Although this c i r c u i t was tested by Wickwire , when i t was connected to the simulated inductor the combination was found to be unstable i n the sense that the outputs of the operational amplifiers o s c i l l a t e d between saturation i n the positive d i r e c t i o n and saturation i n the negative d i r e c t i o n . A more stable capacitor was then realized by modifying the 30 c i r c u i t i n Fig. 3.5 which was o r i g i n a l l y used by Antoniou . The modi-f i c a t i o n resulted when the time-varying r e s i s t o r R-j(t) i n Fig. 3.6 replaced the fixed r e s i s t o r Z^ i n Fig. 3.5. In Fig. 3.6, i ( t ) and V(t) are related by 3.3.1 as i s shown below: V(t) R^R, V (t) = ——-V 0 2 U ; « v ( t ) V(t)-V (t) R v o i ( t ) = O F ) 0 2 — V w R 0(t) . c 5 d (v( t ) -v 0 1 ( t ) ) i ( t ) „ — _ 25 Figure 3.5 C i r c u i t used to simulate capacitor. Time-varying capacitor i s realized i f i s a time-varying conductor. i f then R 3 = N(OG a n d N ( t ) C 5 " C ( t ) i ( t ) Kdv(t)C(t) dt Fig. 3.7 shows the measured capacitance C and quality factor Q vs R^ . Fig. 3.8 shows C and Q vs control number, B, when the r e s i s t o r ladder with FET switches was substituted for R^ . In Fig. 3.8 good l i n e a r i t y of C with respect to B i s observed. The Q was highest for mid-control numbers and for higher values of C^  and was lowest for small numbers and small C^ . The Q was always above 25 and was usually greater than 100. Eig. 3.9 shows Q plotted against frequency for various values of R^  and C^ . The input.capacitance (not shown) was within 2% of the calculated value for 200 Hz S f S 10 kHz. The frequency range of interest i s between the two arrows i n Fig. 3.9. The Q i s always above 30 except when C f = 2.2 nF, R 3 = 100 kft and f $ 500 Hz. Because only one capacitor of the 12 variable capacitors i n the vocal tract analog has C^ •= 2.2 nF the c i r c u i t i n Fig. 3.6 was considered acceptable for Variable R e s i s t a n c e R q (Ohms) Figure 3.7 Input capacitance C and quality factor Q vs. variable resistance R for capacitor i n Fi g . 3.6 (1) C f = 2.2nF, (2) Cf=4.7nF, (3) C f = lOnF, (4) C f = 22nF. The measurements were taken at a frequency of 1 kHz. use i n MINERVA. 3.4 DIGITAL INTERPOLATOR I t was desirable to minimize the time required for the computer to calculate and transfer control numbers to the time-varying re s i s t o r s i n MINERVA. Accordingly a peripheral hardware d i g i t a l i n t e r p o l a t i o n system was designed with the result that the computer needed to supply control numbers only at the break points, For each inductor or capacitor to be controlled a register of semi-conductor memory, a d i g i t a l comparator, anup-down counter and a variable rate clock were needed. Fig. 3.1 shows the control scheme used to control each element i n a 28 Binary Control Number B F i g u r e 3.8 I n p u t c a p a c i t a n c e C and q u a l i t y f a c t o r 0 v s . b i n a r y c o n t r o l number B f o r c a p a c i t o r i n F i g . 3 .6. (1) C^=2.2nF s (2) C^= 4 . 7 n F , (3) C f = l O n F . The measurements were t a k e n a t a f r e q u e n c y o f 1 k H z . d i g i t a l p i e c e - w i s e l i n e a r f a s h i o n . The v a l u e o f t h e e l e m e n t was c o n t r o l l e d by an up-down c o u n t e r , w h i c h c o u n t e d a t a r a t e d e t e r m i n e d by t he v a r i a b l e r a t e c l o c k . The o u t p u t o f t h e up-down c o u n t e r was c o n n e c t e d t o a d i g i t a l c o m p a r a t o r w h i c h compared t h e number i n i t s memory w i t h t h e number i n t h e up-down c o u n t e r . I f t h e number i n t h e up-down c o u n t e r was l a r g e r t h a n t h e number i n t he memory t he c o m p a r a t o r e n a b l e d t h e do\m c o u n t and t h e c o u n t e r d e c r e m e n t e d . S i m i l a r l y i f t h e . number i n the c o u n t e r was l e s s t h a n t he number i n t h e memory, t h e n the c o u n t e r i n c r e m e n t e d . I f t h e numbers we re e q u a l t h e n t h e c o u n t e r was d i s a b l e d . The v a r i a b l e r a t e c l o c k c o n s i s t e d o f a down c o u n t e r and a memory r e g i s t e r . The memory r e g i s t e r c o n t e n t s were l o a d e d i n t o t h e c o u n t e r w h i c h 29 i o n * J O O H Z m U z iokHz: Frequency F i g u r e 3.9 Q u a l i t y f a c t o r Q v s . f r e q u e n c y f o r c a p a c i t o r i n F i g . 3.6 f o r v a r i o u s v a l u e s o f C f and R,. (1) C f = 2 . 2 n F , (3) C f =10nF , (6) C f = 22nF. F o r C (6 ) and = lOKf i Q>1000 and i s n o t shown on the g r a p h . c o u n t e d a t a f i x e d r a t e . When t h e c o u n t e r r e a c h e d z e r o , an o u t p u t p u l s e was g e n e r a t e d and the c o n t e n t s o f t h e memory r e g i s t e r w e r e l o a d e d i n t o the c o u n t e r a g a i n . I f t he number C was l o a d e d i n t o t he memory r e g i s t e r and t h e c l o c k was d r i v e n w i t h a s q u a r e wave o f p e r i o d T , t h e o u t p u t w o u l d be a p u l s e t r a i n o f p e r i o d Cx. T h i s . p e r i o d w o u l d be m a i n t a i n e d u n t i l t he memory r e g i s t e r i n t h e c l o c k was changed by t h e c o m p u t e r . T h i s c o n t r o l scheme u t i l i z e d the i d e a o f - t a r g e t c o n f i g u r a t i o n s f o r t he v o c a l t r a c t , . The v o c a l t r a c t w o u l d move f r o m one p o s i t i o n t o a n o t h e r and t h e n h o l d t h a t p o s i t i o n . F o r example, assume t h a t t he number a was i n t he up-down c o u n t e r and t he number b was l o a d e d i n t o t h e d i g i t a l c o m p a r a t o r ' s memory r e g i s t e r a t t i m e t ^ , and a t t h a t same t i m e t h e number T / ( | b - a | ) was l o a d e d i n t o t he c l o c k . The up-down c o u n t e r and hence t h e e l e m e n t w o u l d r e a c h t h e 7* Figure 3.10 Control number N as a function of time using up-down counter, comparator and variable rate clock. value of number b at time tH-T, and would hold that number u n t i l the memory register i n the comparator was changed at time t , as shown i n Fig. 3.10. The basic period for each variable rate clock was .125 msec, therefore the maximum rate of change was one step every .125 msecs. The slowest rate was one step every 128 msecs. An interrupt clock was b u i l t by modifying a variable rate clock so that the counting could be disabled and enabled. The enabling pulse to the interrupt clock synchronized the astable multivibrator. When the counter reached zero an output pulse was communicated to the d i g i t a l computer. The output pulse also disabled .the astable multivibrator. The effect of the interrupt clock was as follows: i f the number C was loaded into i t , the computer would .receive an interrupt Cx l a t e r where x i s the basic period of the astable multivibrator. In MINERVA x was .5 msecs and C was a 10 b i t number, therefore times of .5 msecs to 512 msecs could be set. The purpose of the interrupt clock was to allow the computer to calculate the time of the next interrupt and to load the interrupt clock. When MINERVA required the next service, i t would request i t by r a i s i n g an interrupt. 31 3.5 GLOTTAL SOURCE As n o t e d i n C h a p t e r 2 v o c a l c o r d e x c i t a t i o n i s c a u s e d by p e r i o d i c i n t e r r u p t i o n o f t he f l o w o f a i r t h r o u g h the t r a c h e a . The s o u r c e has a h i g h a c o u s t i c a l impedance and i t s e l e c t r i c a l a n a l o g i s a c u r r e n t s o u r c e . I n MINERVA t h e g l o t t a l s o u r c e c o u l d g e n e r a t e a p e r i o d i c b u z z t o s i m u l a t e v o i c i n g and/o r an a p e r i o d i c n o i s e t o s i m u l a t e a s p i r a t i o n . A c o n t r o l wo rd a l l o w e d v o i c i n g o n l y , a s p i r a t i o n o n l y o r b o t h s i m u l t a n e o u s l y . The c i r c u i t r y f o r t h e b u z z s o u r c e w h i c h p r o d u c e d a p e r i o d i c t r i a n g u l a r shape p u l s e i s shown i n F i g . 3 . 1 1 ( a ) . I f G ( t ) was a d i g i t a l t i m e - v a r y i n g c o n d u c t o r as d e s c r i b e d i n s e c t i o n 3 . 1 , t h e n a v o l t a g e v ( t ) as shown i n F i g . 3 .11 (b ) appea r s a t t he o u t p u t i f G ( t ) i n c r e a s e s w i t h t i m e . I f s q u a r e p u l s e s p ( t ) , as shown, e x c i t e t he b a s e o f t r a n s i s t o r Q t h e n a t r a i n o f p u l s e s o f i n c r e a s i n g a m p l i t u d e w o u l d a p p e a r a t t h e c o l l e c t o r o f t r a n s i s t o r Q. The p u l s e g e n e r a t o r was e i t h e r a f i x e d f r e q u e n c y p u l s e g e n -e r a t o r o r a v a r i a b l e r a t e p u l s e g e n e r a t o r w h i c h w i l l be d e s c r i b e d l a t e r . The f i x e d f r e q u e n c y p u l s e g e n e r a t o r i s an a s t a b l e m u l t i v i b r a t o r . The p u l s e s h a p i n g f i l t e r i s a l o s s y i n t e g r a t o r w h i c h w i l l g i v e t he w a v e f o r m i g ( t ) shown i n F i g . 3 . 1 1 ( b ) . T h i s Xirave i s t h e n summed w i t h t he a s p i r a t i o n and a p p l i e d t o a v o l t a g e t o c u r r e n t c o n v e r t e r , mak i n g t h e g l o t t a l s o u r c e a p p e a r as a c u r r e n t s o u r c e . The s p e c t r u m o f t h e b u z z w a v e f o r m i s shown i n F i g . 3.12 a l o n g 15 w i t h t h e s p e c t r u m u sed by S t e v e n s e t a l i n 1953 on a s t a t i c v o c a l t r a c t a n a l o g . The two s p e c t r a a r e w i t h i n 4 db when 100 Hz £ f $ 5 k H z . S t e v e n s e t a l cho se t h e i r s p e c t r u m by l i s t e n i n g t o s y n t h e s i z e d v o w e l s and a d j u s t i n g t h e - s o u r c e s p e c t r u m f o r opt imum v o w e l q u a l i t y . To s i m u l a t e c h a n g i n g p i t c h d u r i n g s p e e c h p r o d u c t i o n a v a r i a b l e p e r i o d p u l s e g e n e r a t o r was added t o t h e g l o t t a l s o u r c e and f i x e d o r v a r i a b l e A/ariable Gain Amplifier 1 j -Pulse Shaping Fitter Pulse [ Generator p(t) _ Summing Amplifier z i i T ^ M V SI -J Voltage-to -current -Converter Vocal Trsct Analog > + T r a j fsCOi > 7//77e Cons/ant t-RC \ 1 /\ V Figure 3.11 (a) G l o t t a l e x c i t a t i o n source. (b) Waveforms. In MINERVA a = 200ysec, g .= 8 msec, W =0.5 msec. Pulse generator p(t) i s periodic, with period ft. to N5 V Figure 3.12 Spectrum of voicing source used and spectrum of source used by Stevens et pitch was selectable by a switch. The pulse generator, shown i n a block diagram i n Fig. 3.13, was a variable rate clock, described i n Section 3.4, i n which the memory register has been removed and replaced by the output of an up-down counter. After each output pulse the modified clock's counter was loaded with the contents of the up-down counter. I f the period of the output of the variable rate clock toggling the up-down counter i s long compared to the period of the astable multivibrator i n the modified clock, then the e f f e c t w i l l be pulses coming at a slowly varying period as shown i n Fig. 3.14. Again a comparator, clock, and up-down counter combination were used to set the new period of the pulse generator and the rate at which to move to the new period without further service from the computer. Although the pulse generator had an e f f e c t i v e frequency varying from 33 Hz to 8 kHz the range 100 Hz to 150 Hz was used during most speech synthesis studies. F i g u r e 3.13 V a r i a b l e r a t e p u l s e g e n e r a t o r u sed t o s i m u l a t e c h a n g i n g i n f l e c t i o n a t v o i c i n g s o u r c e . J J n n n n fl Variable Rate n Clock Pulses Figure 3.14 Variable rate clock pulse produced by c i r c u i t i n Fig. 3.13 The pseud.onoise generator, suggested by A. C. Davies ' , and shown i n Fig. 3.15, generated the noise by loading a 6 b i t d i g i t a l to analog (d-a) converter with the binary output of a 15 b i t m-sequence s h i f t register which had a 1 MHz clock rate. The noise had a uniform amplitude pr o b a b i l i t y density function whose short term spectrum was e f f e c t i v e l y white over the range 0 < f < 10 kHz. The output of the d-a converter was connected to a variable gain amp l i f i e r , l i k e the one i n Fig. 3.11, whose gain was varied by varying G(t). The gain G(t) was a trapezoidal function except when voicing and f r i c t i o n occured together. In the l a t t e r case G(t) varied shown i n Fig. 3.16. The numbers bn(t) and bs(t) were 6 b i t binary numbers proportional to the ampltidue of the noise source and the voicing part of the g l o t t a l Source respectively. This alternation simulated the p e r i o d i c i t y of the a i r i n the tract coming through the vocal folds and passing through a constriction which caused turbulence. In MINERVA the location selector was used to select the proper transformer for i n s e r t i n g the noise into the vocal tract. 3.6 NOISE SOURCE Pseudonoise was used to simulate aspiration and f r i c t i o n . 31,32 at the voicing frequency between bn(t) and the larger of {0:bn-bs} as MJ-P 9 C l O C K (GLOTTIS) COM FWRATOH (GLOTTIS) COMPARATOR (FRICTION) T MlUYlVWRATOR I MUZ. UP-DOWN COUNTER I is m i SWIFT KrcisriK * 6 UP-DOWN — ^ COUNTER INVERTER M sc. ADDER SELECT t A R G l l l J O : b n - b J \iW NOISE-GATE FROM GLOTTIS PULSE GENERATOR STORAGE R r o i s i r R 3E 6-BIT V/A VAKIAULE GaIN AMPLIFIER LOCATION SELECTOR TO VOCAL ~ TRACT T R A W S F O R M r R ! Figure 3.15 Noise generator used to simulate f r i c t i o n and aspiration. Transformers allowed noise to be inserted i n series with any inductor i n Fig. 2.4. bs(t) time (ms, time (ms) Figure 3.16 I l l u s t r a t i n g modulation of the noise source by the g l o t t a l source. Numbers b s and b n are s i x - b i t binary numbers proportional to the amplitudes of the g l o t t a l source and noise source, respectively. 37 Digital Comparators Clocks Column 0 Column 7 .so I C O 1 1 Wo, I J T K o o o jo PI l J l K i 6 p 0 / [J T K 9 9 <? e t a 1 1 M P 0 0 1 J T K <? 0 9 6 o 0 I J T K ^ 9 9 I Column 8 0 1 0 1 j r.K J T K p O O 1 ? 1 x : _ i r 0 / 7 T K Column 17 o I J T K Row 0 9 p 0 u, P P 0 1 J T H o p 9 1 1 Row i i — t 0 I \J T K Row 13 Computer Output Buffer Register No. 2 Ins Flag Register Delay Figure 3.17 D i g i t a l comparator and clock memory matrix. For. aspiration the output of the d-a converter was connected to a separate' variable gain amplifier, then summed with the voicing component. This sum then excited the voltage to current converter. 3.7 CONTROL . • Fig. 3.17 shows the matrix arrangement-of the -J-K f l i p - f l o p th memories of the d i g i t a l comparators and variable rate clocks. The i th ' row corresponded to the i computer controlled element i n MINERVA, and • the column to the b i t of a binary word. The J and K terminals of a l l f l i p - f l o p s i n the j* " * 1 column were connected to the ZERO and ONE * * til output, respectively, of the j f l i p - f l o p i n output buffer regi s t e r No 2. th ' th A l l toggle terminals *T i n the i row were connected to the i output terminal of the address decoder. This decoder directed the f l a g register's ,th pulse to the i toggle l i n e i f and only i f the number i was stored i n 38 t h e 6 b i t o u t p u t b u f f e r r e g i s t e r No 1. To l o a d b i n a r y number b f r o m the t h compute r i n t o t h e i row o f t h e m a t r i x , t he compute r f i r s t l o a d e d number i i n t o t h e o u t p u t b u f f e r r e g i s t e r No 1 w h i c h was l o c a t e d i n a g e n e r a l p u r p o s e i n t e r f a c e . N e x t t h e number b was l o a d e d i n t o o u t p u t b u f f e r r e g i s t e r No 2. A p u l s e t h e n appea red a t t h e f l a g r e g i s t e r w h i c h i s a s s o c i a t e d w i t h r e g i s t e r No 2. T h i s p u l s e was d e l a y e d l u s e c and was p l a c e d t h on t h e i t o g g l e l i n e by the a d d r e s s d e c o d e r c a u s i n g t h e c o n t e n t s o f o u t p u t t h b u f f e r r e g i s t e r No. 2, b , t o be l o a d e d i n t o t h e i row o f t h e J - K m a t r i x , l e a v i n g a l l o t h e r rows u n d i s t u r b e d . U s i n g t h i s scheme any e l e m e n t ' s t a r g e t v a l u e c o u l d be u p d a t e d w i t h o u t a f f e c t i n g t h o s e o f t h e o t h e r e l e m e n t s . T w e n t y - f o u r 18 - b i t b i n a r y words were u sed t o c o n t r o l MINERVA. One wo rd c o n t r o l l e d the i n t e r r u p t c l o c k , one t h e v o i c i n g p a r t o f t h e g l o t t a l s o u r c e , one t h e p o s i t i o n o f i n s e r t i o n o f the n o i s e , one t h e i n f l e c t i o n c l o c k , and 19 words the v a l u e s and r a t e s o f change o f the e l e m e n t s . On l y 4 b i t s we re u sed t o l o c a t e t h e p o s i t i o n o f the n o i s e w i t h t h e r e m a i n i n g b i t s b e i n g u sed t o c o n t r o l m i s c e l l a n e o u s f u n c t i o n s t h a t were u sed i n v a r i o u s e x p e r i m e n t s . S i n c e e a c h c a p a c i t o r i n F i g . 2.4 r e p r e s e n t s h a l f t he c a p a c i t a n c e f o r 2 a d j a c e n t TT s e c t i o n s i n F i g . 2 . 3 , - (A. .l.,+A.SL.) C± * — 1 1 1 • 1 1 ( 3 . 6 . 1 ) kpc . K p £ . . L. L i - - A T - B 7 < 3 ' 6 ' 2 ) 1 1 S o l v i n g ( 3 . 6 . 2 ) f o r A^ y i e l d s A. = k p i . B . S u b s t i t u t i n g i n t o 3 .6.1 f o r i and i - 1 y i e l d s 2 2 B, - l. B. c. - - V x~l ^ + 4 - ^ ) 2c L. . L. l - l l 39 In capacitors to C 10 £2 1.2 l - l _ I L. . L. l - l l C. = ! V (B._ 1 +B.) 1 2 ; 2 c L. l Capacitors C^  to can be controlled by averaging the binary numbers for the adjacent inductors. The averaging was accomplished by dropping the least s i g n i f i c a n t b i t of the output of a binary adder. C^ i s controlled by a weighted average of and the control number B^ which would correspond to the fixed inductor L^. Capacitors C^ to C^ are controlled d i r e c t l y because of the d i f f i c u l t y i n calculating the control numbers when L. ., 4 L- and I. n 4 l - l I l - l I In the control words used to control the elements, 8 b i t s were used to control the value of the element through the d i g i t a l comparators, and 10 b i t s were used to control the rate of change of the element through the variable rate clock. Where the element had a 6 b i t v a r i a t i o n , only 6 b i t s were used to control the element and the other 2 b i t s were unused. 3.8 OVERALL HARDWARE SYNTHESIZER The c i r c u i t s for MINERVA f i t into a standard 72" by 19" rack with power supplies i n another rack. The vocal tract analog was. constructed with r e s i s t o r - t r a n s i s t o r l o g i c for the d i g i t a l part and discrete components for the analog part. MINERVA was controlled by a d i g i t a l computer through a general purpose interface. The d i g i t a l computer used was a D i g i t a l Equipment Corporation PDP-9 which has a lusec cycle time, 16,384 words of 18 b i t core memory, one dectape control with three transports, a type 30 display with l i g h t pen and an extended arithmetic c a p a b i l i t y which 40 Vocal Tract Analog C c D J5 D I n t e r 4-D f a c e r il/ras/ssle tted& M e t e r J? Figure 3.18 Overall hardware f a c i l i t y . A=audio l i n e s , C=control l i n e s , D = data l i n e s . allows hardware multiplications and di v i s i o n s . An additional f a c i l i t y developed for speech work included a 32 channel spectrum analyzer which was an improved s o l i d - s t a t e version of 33 one b u i l t by Hughes and Hendel i n 1965. The spectrum analyzer i s described 3A 35 i n d e t a i l by Ito ' . The spectrum analyser's output could be displayed on the computer's display. c 36 A speech amplitude meter s i m i l a r to one of Peterson and M Kinney was b u i l t . This meter r e c t i f i e d the speech s i g n a l , low pass f i l t e r e d the output of the r e c t i f i e r , and had both a li n e a r and logarithmic output. A Scully model 280 tape recorder could be stopped and started on command from . the computer. A Kay E l e c t r i c spectrograph, which makes time-amplitude-frequency displays of speech, was also used. Time and frequency are the coordinates and amplitude i s represented by the darkness of points. The o v e r a l l hardware f a c i l i t y i s shown i n Fig. 3.18. 41 3.9 SOFTWARE The main objective of the software was to make i t as i n t e r -active as possible. One of the outstanding advantages of the hardware vocal tract analog was that i t could be operated i n real-time and the s o f t -ware was designed to take advantage of this feature. With i n t e r a c t i v e software, a word could be synthesized, l i s t e n e d to, changed and resynthesized i n a matter of seconds. Several words could be synthesized one right after another, allowing for easy comparison of small differences i n syn-thesized speech samples. Timing information for synthesizing 80 words and target c o n f i -gurations for 30 phones could be stored i n core memory. At any one time a set of software pointers were pointing at the data for one p a r t i c u l a r word, called the active word. The active word could be changed by typing a 2 d i g i t number on the teletype or by light-penning the d i g i t s on the display. The active target configuration during s t a t i c operation of MINERVA was the target configuration being synthesized. I t could be changed by typing a 2 l e t t e r code which was used to represent each target configuration or by light-penning these same 2 l e t t e r s on the display. The role of the active target configuration i n dynamic operation w i l l be explained l a t e r . The programs can be broken into 5 main segments for synthesis of speech and 3 segments for a u x i l i a r y operations as i n Fig. 3.19. Supervisor The supervisor took care of decoding and encoding phoneme codes, setting pointers to the active word and active target configurations, reading data from and w r i t i n g data on mass storage, and handling the interrupts from the various devices. It also calculated the control DATA TABLE HARDWARE CHECKOUT -Interface -Up-down counters -Averagers -Comparators SUPERVISOR -Decodes and encodes -Decides active vord and active configuration -Ca2.aula.tes control parameters, rates for clocks, time for next interrupt -Handles interrupts from a l l devices -Dectape handler DISPLAY -Vocal tract configurations -Word timing -Spectrum LIGHT PEN -Choice of displays -Change timing -Change configuration -Start vocal tract analog -Change active word -Change active configuration -Build timing for words TELETYPE -Command decoder -Choice of displays -Change configuration -Start vocal tract analog -Change active word -Change active configuration -Build timing for words -Printout timing l i s t s -Printout configurations -Process spectrum -Move data to bulk storage -Move data blocks VOCAL TRACT ANALOG -Takes data and loads comparators and clocks A-D HANDLER -Gets data from each channel of spectrum analyzer during speech synthesis and stores in a table Figure 3.19 Function of program blocks. 43 numbers for the comparators, rates for the clocks, and time of the next interrupt. In addition the supervisor started and stopped the audio tape recorder, moved data between the a-d handler and display table, duplicated parts of data table into another part and gave control to other programs. Display and Light Pen The display program would display the active target configuration or the timing information for the active word or both of these at the same time. Also the spectrum of a spoken word (human or machine synthesized) could be stored and displayed. The l i g h t pen allowed fast changes i n timing and configurations, as w e l l as changes of the active word or the active target configuration, changes i n what was being displayed, commands to the supervisor to synthesize a word and the building of parameter l i s t s for words. Teletype The teletype allowed for slower i n t e r a c t i o n but a wider range of communication. In addition to being able to do a l l of the things that could be done with the l i g h t pen, the teletype could p r i n t out configurations or timing l i s t s , give commands to the supervisor to move data i n the data buffer and to move data to and from tape. Vo Cal Tract Analog The supervisor could c a l l a loading program to move the data from memory into MINERVA and to load the interrupt clock. A-D Handler The supervisor would c a l l the a-d handler each time i t received an interrupt from the a-d converter. The handler took the data, stored i t 44 appropriately i n a table, set the multiplexer to the next channel and started the next conversion. This program was interleaved with the loading of the vocal tract and the display. 3.10 SEQUENCE OF EVENTS DURING SYNTHESIS OF A WORD The timing information was stored i n 12 l i s t s as shown i n Fig. 3.20 for the word "leash". DURCF - contained the time to elapse between changes i n target c o n f i -gurations . TRACF - contained the time to change from one target configuration to another. DURVS - contained the time to elapse between service c a l l s to the g l o t t a l source. TRAVS - contained the time to change from one amplitude of voicing to another. HTTVS - contained the amplitude of the voicing part of the g l o t t a l source. TRAAP - contained the time to change from one amplitude of aspiration to another. DURNS - contained the time between services to the noise source. TRANS - contained the time to "move from one amplitude of noise to the next. HTTNS - contained the amplitude of the noise source. In addition there x^ as a l i s t which stored pointers to the various target configurations used i n a word. til til MINERVA would synthesize the i word i f the i word was made the active word and "GO" was pointed at by the l i g h t pen or i f the "GO" button was pushed on the console. In the example i n Fig. 3.20 for the word "leash", / j t i j V , after "GO" has been signalled 250 msec was loaded into the interrupt clock and the target configuration for /%/ was loaded into the vocal tract analog. The ^-Amplitude of Noise 'Glottal Source Frequency 'Amplitude *[of Voicing vocal tract moving stationary 75 h - \\Configuration [Changes I\ time 0 WO 200 300 400 500 600 Word begins here ms TIMING LIST (in ms) dur cf tra cf dur vs tra vs htt vs dur ns tra ns htt ns tra fr htt fr 375 50 250 0 0 615 0 0 125 100 240 50 125 50 20 125 125 10 240 112 -0 75 240 50 35 0 100 0 50 135 0 0 50 0 0 140 Figure 3.20 Timing information and source amplitudes for the word LEASH. . , 46 f i r s t DURVS and TRACF were arbitrary because nothing was heard u n t i l the f i r s t interrupt occured. Before the interrupt occured the supervisor looked ahead and calculated the necessary control information for the f i r s t interrupt. When the interrupt occured, 20 was loaded for the amplitude of the voicing part of the g l o t t a l source and 112 for the voicing frequency. The appropriate rates were loaded into t h e i r respective clocks so that they .would reach this amplitude i n 50 msec and 125 msec respectively. The interrupt clock was loaded with 125 msec. After the 125 msec elapsed, interrupt 1.2 occured and the next target configuration was loaded along with the respective clock rates. The clock rates for the configurations were set so that a l l sections arrived at their positions at the same time. Also at the same time the new heights and rates were loaded into the voicing part of the g l o t t a l source as before and the time u n t i l the next interrupt was also loaded. This process continued u n t i l the supervisor recognized that a l l durations l e f t were 0. To load a l l the parameters into MINERVA and to calculate the control numbers for the next service required 625 usee. For synthesized words, to date, the average time between service c a l l s was 180 msec. The computer spent only .35% of the time during speech production i n calculating and loading control numbers. The remaining 99.65% of the time was available for other tasks. In our work th i s extra time was spent calculating spectra and running the display. 47 IV. TESTING THE VOCAL TRACT ANALOG 4.1 INTRODUCTION TO THE VOWEL TESTS After MINERVA was designed and constructed i t was decided that the synthesizer should be checked i n the s t a t i c mode by studying the vowels. Synthesizing the vowels allowed locations of formants and th e i r bandwidths to be studied, which i n turn gave some measure of the usefulness of the synthesizer. Also since vowels are the most common speech sounds, any other sounds would probably be tested i n combination with some of the vowels. Area functions of vowels from F a n t ^ , Stevens et a l ^ Rosen^, 37 38 Chiba and Kajiyama and P e r k e l l were obtained. The area functions had to be adapted to f i t the structure of MINERVA. With the data from Rosen, no changes were necessary as the section lengths were the same as the ones i n MINERVA. The data from Stevens et a l and from Fant gave the cross-sectional area for every .5 cm of length. Because MINERVA simulated section lengths of .5 cm, 1.0 cm, and 1.5 cm the average area was used i f a section length i n MINERVA covered more than 1 section of Stevens' et a l or Fant's data. With the data from Chiba and Kajiyama the configuration for MINERVA was f i t as closely as possible by averaging where sections i n MINERVA overlapped sections i n their data. The data from P e r k e l l was i n the form of tracings of the vocal tract as viewed by X-ray from the side of the head. Copies were made of a tracing which had the configurations of seven vowels on i t . 39 The co-ordinate system developed by Ohman was superimposed on the tracings. The tract x^ as then divided into section lengths corresponding to the ones 26 i n MINERVA. The average diameter was taken and a form factor given by Fant to compensate for non-circular cross sections was applied to the diameter to 48 give a vocal tract configuration. Unfortunately^ the data given by P e r k e l l i s r e l a t i v e and was not intended to be used i n an absolute form. The configurations from the above sources of data were put into the synthesizer, informal l i s t e n i n g tests were done, spectrograms from the Kay E l e c t r i c spectrograph and spectrograms from the spectrum analyzer described i n Chapter 3 were made. On the basis of these preliminary r e s u l t s , the configurations obtained from the data of Rosen and from the tracings of P e r k e l l were found to be unsuccessful i n MINERVA. Some area functions were generated by t r i a l and error techniques and added to the set. Selected vowels from the above sources of data were then put i n a formal l i s t e n i n g test described i n the next section. In addition^the frequency responses of several of the configurations were taken so that formants, bandwidths, and dynamic range could be studied. The features of speech can be divided into two classes, prosodic and segmental. Segmental features are the ones by which people distinguish one sound from another. The shape of the speech wave, or the spectrum i n the frequency domain, presence or absence of voicing and rounding of the l i p s are some of the segmental features i n English. Prosodic features are those features which are not necessary i n distinguishing a sound but give "colour" to speech. Some prosodic features i n English are the fundamental frequency of the voicing, i n t e n s i t y , i n f l e c t i o n , and duration. Duration can also be a segmental feature as sometimes i t i s used to distinguish sounds. In synthesizing speech, the important features were the segmental ones and these features were varied i n tests. The prosodic features were usually chosen somewhat a r b i t r a r i l y . 4.2 DESCRIPTION OF THE VOWEL LISTENING TEST The l i s t e n i n g t e s t f o r t h e v o w e l s was g i v e n t o t e n p h o n e t i c a l l y t r a i n e d s u b j e c t s . Seven o f t h e s e h a d one y e a r o f t r a i n i n g i n p h o n e t i c s and t h r e e had s e v e r a l y e a r s t r a i n i n g . The t e s t s we re g i v e n i n a q u i e t room on a Revox mode l 77A t a p e r e c o r d e r w i t h Sharpe HA10 e a r p h o n e s . E i g h t e e n v o w e l s f r o m the v a r i o u s o r i g i n s d e s c r i b e d above we re s y n t h e s i z e d and pu t i n random o r d e r . Each v o w e l was g i v e n t w i c e i n s u c c e s s i o n and each o f t h e s e v o w e l p a i r s o c c u r r e d 4 t i m e s t h r o u g h o u t t he t e s t , t w i c e w i t h a d u r a t i o n o f 230 msec and t w i c e w i t h a d u r a t i o n o f 290 msec. I n t h i s t e s t t he c o n f i g u r a t i o n o f t h e v o c a l t r a c t was c o n s t a n t so t he t e s t i s c o n s i d e r e d s t a t i c . Howeve r , i t was d e s i r e d t o g i v e t he v o w e l s a d e f i n i t e l e n g t h f o r t he p e r c e p t i o n t e s t s so the a m p l i t u d e o f t h e v o i c i n g was changed f r o m z e r o t o some a m p l i t u d e and b a c k t o z e r o so as t o g i v e l e n g t h t o t he v o w e l . The i n f l e c t i o n was a l s o c h a n g i n g t h r o u g h o u t t h e t e s t . I n a l l c a se s the r i s e and f a l l t i m e o f t h e v o i c i n g a m p l i t u d e was 50 msec. The i n f l e c t i o n i s a p r o s o d i c f e a t u r e and can be c h o s e n a r b i t r a r i l y . I f a p e r s o n g i v e s a l i s t o f i t e m s V e r b a l l y , t h e i n f l e c t i o n i s u s u a l l y r i s i n g on each i t e m e x c e p t t he l a s t one. F o r t h i s r e a s o n i t was d e c i d e d t o v a r y the f r e q u e n c y o f t h e v o i c i n g s o u r c e f r o m 100 Hz t o 140 Hz o v e r t he d u r a t i o n o f t h e v o w e l , w i t h t i m i n g as shown i n F i g . 4 . 1 . The a r e a f u n c t i o n s a r e shown i n F i g . 4 . 2 . The t e s t was a f o r c e d c h o i c e t e s t and t he f o l l o w i n g i n s t r u c t i o n s were w r i t t e n a t the t o p o f each r e s p o n s e f o r m . T r a n s c r i b e t h e g i v e n v o w e l s . Each v o w e l i s g i v e n t w i c e f o l l o w e d ..by an 8 s e c o n d s p a c e . Respond once f o r each p a i r . Choose the v o w e l f r o m the f o l l o w i n g l i s t : iDD Hz 9 ft s LP Amplitude of Voicing 29D c r 23® ms F i g u r e 4.1 T i m i n g u sed f o r v o w e l t e s t . f e e t f i t met fad: . b i r d b u t f a t h e r b o u g h t book b o o t go h a t e The f i r s t t h r e e p a i r s a r e p r a c t i c e and s h o u l d be answered i n : 'A, B, C. S i n c e some o f t h e s u b j e c t s on t h e r e s p o n s e f o r m i n d i c a t e d t h a t t h e y c o u l d no d i f f e r e n c e be tween IqJ and /ol i n t h e i r d i a l e c t (Wes te rn C a n a d i a n ) and w o u l d have d i f f i c u l t y i n d o i n g s o i n t h e t e s t , i t was d e c i d e d t o p u t t he / Q / and Id/ r e s p o n s e s i n t o one c l a s s i f i c a t i o n . 4 . 3 RESULTS OF THE VOWEL L ISTENING TEST The r e s u l t s o f the V o w e l l i s t e n i n g t e s t a r e shown i n T a b l e 4.1 i n the f o r m o f a c o n f u s i o n m a t r i x . The t op number i n e ach c e l l i s t he p e r c e n t a g e r e s p o n s e f o r t h e l o n g e r d u r a t i o n and t h e l o w e r e n t r y i s t he p e r c e n t a g e r e s p o n s e f o r the s h o r t e r d u r a t i o n . - i — i — i — i — i — i — i — r - i — i — i — i — r — i — i — r - T — i — i 1 — i 1 — i — r 1— — i — i — i — r - T— i — r as . T 1 1 1 1 1 r i i i i i i I i a --i 1 1 1 1 1 1 r T 1 1 1 1 1 1 r e -T — i — i — i — i — i — i — r - i — i — i — i — i — i — i — r u 3 — '•5 - i — i — i — i — i — i — i — r — cc ' jU ~ i — i 1 — i 1 — i 1 — r ~ O 2 •* 6 8 IO IZ 14 lb Distance from Gloltis (cm) T — I — i — i — r — i — r T \ 1 1 1 1 1 r -i 1 1 1 1 1 1 r — _ E tt. -i 1 1 1 1 1 1 — r ceo i 1 1 1 1 1 1 r— o z •* c a to iz w tc Distance from Glottis (cm) -l 1 1 1 — i 1 1 — r C A > (b) Figure 4.2 Data used i n vowel tests. C i r c l e represents radius of mouth opening (a) Data adapted from Stevens et a l . (b) Data from other sources. 1 = Fant 2 6 . 2 = Chiba and Kajiyama 3 7 The rest were made on a t r i a l and error basis. i I e" e 3! a o u u A o none STEVENS, KASOWSKI and FANT 1 5 I 290 230 100 90 10 I 290 230 50 75 40 10 10 15 £ 290 230 88 80 2 8 5 5 7 5 a 290 230 5 10 75 85 5 5 5 10 a 290 230 60 20 5 35 70 5 5 o 290 230 100 100 A 290 230 5 7 5 80 63 10 30 U 290 230 10 10 10 90 80 u 290 230 10 100 90 FANT 2 6 i .290 230 90 90 10 10 a 290 230 95 88 5 12 CHIBA and KAJIYAMA 37 i 290 230 80 75 20 25 e" 290 230 5 5 25 40 70 55 OTHER e 290 230 5 15 32 85 58 5 e 290 230 5 90 85 5 10 5 V 290 230 90 95 10 5 u 290 230 5 95 90 5 5 o 290 230 55 30 45 70 Table 4.1 Results of vowel l i s t e n i n g test. Stimuli are listed-down l e f t side, responses are l i s t e d across the top. 53 The d a t a o b t a i n e d f r o m c o n f i g u r a t i o n s f ound i n S t e v e n s e t a l (SKF) g i v e good r e s u l t s f o r a l l v o w e l s w i t h t he e x c e p t i o n o f /a/> / u/ and fU/. However s i n c e / a / and /o/ were t r e a t e d as t he same r e s p o n s e / a / can be made f r om t h e d a t a f o r /a./. A l t h o u g h t h e d a t a f o r /{// and / u / d i d n o t g i v e good r e s u l t s , lul and / u / were s y n t h e s i z e d by t r i a l and e r r o r and 95% r e c o g n i t i o n was o b t a i n e d i n t h e i r r e s p e c t i v e c a t e g o r i e s . V a r i o u s s o u r c e s o f d a t a were u sed and i f t h e s c o r e i s t a k e n f o r e a ch d i f f e r e n t v o w e l , a t o t a l o f n i n e v o w e l s can be s y n t h e s i z e d w i t h a r e c o g n i t i o n r a t e o f 75% o r g r e a t e r and w i t h an a v e r a g e r e c o g n i t i o n r a t e o f 89%. I n a d d i t i o n /o/ was s y n t h e s i z e d w i t h a r e c o g n i t i o n r a t e o f 100% b u t u s i n g t h e d a t a t h a t S t e v e n s e t a l u sed f o r /uf. 4.4 FREQUENCY RESPONSE AND FORMANT FREQUENCY MEASUREMENTS F r e q u e n c y r e s p o n s e measurements we re made f o r the n i n e s y n t h e s i z e d v o w e l s m e n t i o n e d p r e v i o u s l y . The g l o t t a l s o u r c e was removed and a s i n e w a v e , w i t h a h i g h impedance was i n j e c t e d , s o as t o s i m u l a t e t h e c l o s e d g l o t t i s c o n d i t i o n . The r e s p o n s e was t a k e n t w i c e , once a t n o r m a l l e v e l s and once a t -20db w i t h r e s p e c t t o n o r m a l l e v e l s . T h i s was done t o check t h e dynamic r ange o f t h e s y n t h e s i z e r . The o u t p u t was measu red w i t h a H e w l e t t P a c k a r d 3400A RMS v o l t m e t e r and t h e f r e q u e n c y o f t h e g e n e r a t o r was measu red w i t h a Mon san to mode l 101A c o u n t e r . The f r equency , r e s p o n s e was -done i n t h e u s u a l manner e x c e p t t h a t t he e x a c t f r e q u e n c i e s o f peak s and h a l f power p o i n t s we re r e c o r d e d . The n o r m a l i z e d r e s p o n s e c u r v e s f o r t h e 10 v o w e l s a r e shown i n F i g . 4 . 3 . The s o l i d l i n e i s t he n o r m a l s i g n a l , t he d o t t e d l i n e i s t h e r e s p o n s e a t - 2 0 d b . F o r jxjj and / u / use o f n o r m a l s i g n a l l e v e l s c au sed t h e v o c a l t r a c t a n a l o g a m p l i f i e r s t o overload. 54 T 1 1 1 1 ' 1 1 1 1 r Figure 4.3 Frequency responses of vocal tract analog for 10 selected vowel configurations. Solid l i n e i s response at normal signal l e v e l s , dotted l i n e i s response at -20db with respect to normal signal l e v e l s . SKF refers to data from Stevens et al- x5- and Fant refers to data from Fant26. 55 VOWEL SIGNAL-TO-NOISE RATIO db i 30.0 I 31 .5 e 31 .6 e 34.5 as 34.0 a 34.4 o 34.2 A 33.7 u 19.1 u 17.3 TABLE 4.2 Signal-to-noise r a t i o for 10 selected vowel configurations used i n Fig. 4 . 3 . The signal-to-noise ratios for the vowels are given i n Table 4.2. The signal-to-noise ratios were obtained by measuring the RMS voltage with a normal l e v e l of voicing at the g l o t t i s and by measuring the RMS voltage with the voicing source turned to zero. Most of the noise comes from cross-talk from the variable rate clocks and i f shielded cable was used on the audio lines then the signal-to-noise r a t i o would l i k e l y improve considerably. Except for / \f and /u/ the shape of the response curve 20db below normal l e v e l was very close to the curve at normal l e v e l . The s l i g h t differences that can be seen at the minima were probably due to background noise. The reason for the differences i n / i / i s that the f i r s t minimum i s extremely low so that noise looks large a f t e r normalization. O 200 -fOO €00 800 1000 1200 1400 Frequency of F\ in Hertz Figure 4.4 F i r s t formant-second formant (F--F 2) data. F i - F 2 d a t a o f Peterson and Barney^O shown i n large areas, ^-^^2 d a t a o f selected synthesized vowels shown by crosses.. I n Fig. 4.4 the f i r s t two formants for the synthesized vowels are plotted i n the f i r s t formant-second formant (F1-F2) plane with values from the data of Peterson and Barney^. They collected formant i n f o r -mation for 10 words and 76 speakers. The synthesized vowels are seen 57 to l i e i n the proper regions. Fig . 4 .5 shows the formants as measured using vocal tract config-urations obtained from Stevens et a l . Also shown are the formants obtained when Stevens et a l used the same data i n their s t a t i c vocal tract analog. Our f i r s t formant for /u/ was found to be higher than that of Stevens' et a l and our second formant for lul was lower. The reason the formants were not the same for MINERVA might be caused by the fact that MINERVA was a 15 section analog while Stevens et a l had a 35 section analog. I t can be seen that /u/ and (Ul were the two vowels with the longest vocal tract and this could indicate a problem i n MINERVA'S a b i l i t y to produce vowels with rounded l i p s . For the other vowels^ the agreement between formant frequencies for the two sets of data i s very close. 4.5 FORMANT BANDWIDTHS OF THE VOWELS 6 I t was predicted by Flanagan and confirmed by the results of Fujimura and L i n d q u i s t ^ i n 1970^. that formant bandwidth should decrease as formant frequency increases. The measured bandwidth for the f i r s t formant vs the f i r s t formant frequency for the nine synthesized vowels mentioned previously i s plotted i n Fig. 4.6 along with the results of Fujimura and 41 Lindqvist . It can be seen that for the lower f i r s t formants, the two sets of bandwidths are s i m i l a r but for MINERVA the f i r s t formant bandwidth increases approximately l i n e a r l y with formant frequency as would be expected from a system with almost constant Q. Adding the components for w a l l losses would complicate MINERVA'S r e a l i z a t i o n and while i t would cause the bandwidths to decrease with increasing frequency, i t would make the exi s t i n g bandwidths even larger. Table 4 . 3 shows the bandwidths of various 42 vowels as measured by Dunn i n 1961 and the.bandwidths of vowels syrt-58 u e % 3 2 h I 23 X5 _ U I I I—1 I I I I — 0 2 4 G 8 10 12 14 16 Distance from Glottis (cm) WOO 2000 3000 Formsni Frequency in Hertz Figure 4.5 Formant frequency of vowel configurations from Stevens et al.15.. Solid l i n e for data i n their analog, dotted l i n e for data in MINERVA. Left side shows configuration used i n MINERVA. When no dotted l i n e appears i t i s coincident with s o l i d l i n e . ' • * 59 N I /go E u izo /oo 80 10 CO 20 1 1 1 1 a / / +• / + . 1+ . i - x i i i i i i i 200 *00 600 OOO First Formant Frequency fHz) Figure 4.6 F i r s t formant bandwidth plotted against f i r s t formant frequency. Solid l i n e shows results of Fujimura and Lindquist 41 for male speakers, dotted l i n e for female speakers. Crosses show results for MINERVA. ^ st Formant 2 n d Formant 3 r d Formant VOWEL Av Extremes Synth-esized Av Extremes Synth-, esized Av Extremes Synth-esized i 38 30 80 48 66 30 120 130 171 60 300 260 I_ 42 30 100 75 71 40 120 166 142 60 300 205 E 65 30 120 105 72 30 140 240 126 50 300 253 £ 60 30 140 145 90 40 200 294 156 50 300 265 a 47 30 160 204 50 30 80 210 102 40 300 187 0 50 30 120 146 50 30 200 232 98 40 240 220 U_ 51 30 120 83 58 30 200 85 107 50 200 116 u 56 30 100 121 61 30 140 95 90 40 200 161 A. 46 30 140 147 63 30 140 295 102 50 300 252 Table 4 .3 Bandwidths of vowels measured by Dunn ' and bandwidths vowels synthesized on MINERVA. 60 t h e s i z e d on MINERVA. I t can be s een t h a t f o r f o r m a n t s 2 and 3^ the b a n d w i d t h o f • t h e s y n t h e s i z e d v o w e l s a r e o f t he o r d e r o f 2 o r 3 t i m e s t he a ve r a ge v a l u e s f ound by Dunn, b u t w i t h t he e x c e p t i o n o f / a / ^ t h e f i r s t f o r m a n t s a r e c l o s e t o t he u p p e r e x t r eme f o u n d by Dunn. The l a r g e f o rman t b a n d w i d t h s d i d n ' t p r e v e n t t he l i s t e n e r s f r o m r e c o g n i z i n g t h e v o w e l s , a l t h o u g h i t c o u l d be a r g u e d t h a t t h e r e s u l t s w o u l d have been b e t t e r i f t h e b a n d w i d t h s had been n a r r o w e r . N a r r o w e r b a n d w i d t h s 43 a r e p r e f e r r e d by l i s t e n e r s as shown by House i n 1960 I t can be s een t h a t t h e Q o f t h e f o r m a n t s i s 3 t o 5 w h i l e t h e Q o f t he components was g r e a t e r t han 25 . The l ow Q i n t h e f o r m a n t s was n o t due t o t h e r a d i a t i o n r e s i s t o r R^, as r e m o v i n g t h i s r e s i s t o r d i d n o t i m p r o v e t h e f o rman t Q. The o r g i n a l r e q u i r e m e n t s on t he Q ' s o f t h e components were a p p a r e n t l y t o o l o w . The c l o s e n e s s o f -the f o r m a n t f r e q u e n c i e s w i t h t h e d a t a f r o m S t e v e n s e t a l and t h e s c o r e s i n the l i s t e n i n g t e s t s showed t h i s a n a l o g t o be c a p a b l e o f s y n t h e s i z i n g s t a t i c Voxvels. I t r e m a i n e d t o d e t e r m i n e MINERVA'S dynamic c a p a b i l i t i e s . 4.6 INTRODUCTION TO THE FRICATIVES A f t e r the s t u d y o f the v o w e l s was c o m p l e t e d and i t was shown t h a t t h e v o c a l t r a c t a n a l o g c o u l d p r o d u c e sounds i n t h e s t a t i c mode, MINERVA'S dynamic mode c a p a b i l i t i e s were s t u d i e d b y s y n t h e s i z i n g u n v o i c e d and V o i c e d f r i c a t i v e s i n c o m b i n a t i o n w i t h t h e v o w e l s . S e v e r a l d i f f e r e n t t a r g e t c o n f i g -u r a t i o n s f o r e a c h ' f r i c a t i v e f r o m v a r i o u s s o u r c e s o f d a t a and d i f f e r e n t t i m i n g p a r a m e t e r s t h a t gave t h e b e s t s y n t h e s i z e d s p e e c h f r o m MINERVA we re f o u n d . T h i s i n f o r m a t i o n w o u l d b e u s e f u l i n d e t e r m i n i n g a s y n t h e s i s by r u l e s cheme. C o n s o n a n t - V o w e l (CV) and v o w e l - c o n s o n a n t (VC) s equence s we re 61 synthesized i n i t i a l l y . The configurations used were taken from Rosen 26 and Fant . Changes between the configurations of f r i c a t i v e s and the vowels /'[/ and /a/ were made using various timing rules. Informal l i s t e n i n g tests and spectrographic studies were then made. Vowels /i/ and /a/ were chosen as t y p i c a l closed and open vowels. From this set of data, four f r i c a t i v e target configurations and three sets of timing parameters were chosen. The configurations used are shown i n Fig. 4.7(a). The arrow shows the place at which the noise was inserted to simulate f r i c t i o n . The three sets of timing waveforms are shown i n Fig. 4.8. Unvoiced f r i c a t i v e s were synthesized by p a r t i a l l y closing the tract at some location and in s e r t i n g a noise source close to the constriction to simulate turbulence. For voiced f r i c a t i v e s , voicing i s applied at the g l o t t i s and the noise source i s modulated at the g l o t t a l frequency as explained i n Chapter 2. Phoneme / h/ was produced by putting the tract i n a neutral configuration and applying a noise current source at the g l o t t i s . 4.7 LISTENING TEST FOR THE FRICATIVES The four configurations chosen i n the previous section were each paired with the vowels /i/ and /a/ and given as s t i m u l i i n a set of formal l i s t e n i n g tests. In h a l f the stimuli, the f r i c a t i v e came before the vowel (CV) and i n the other h a l f , after the vowel (VC). In half of the s t i m u l i the voicing was turned on at the same time as the f r i c t i o n and i n half i t was started when the f r i c t i o n began to decay for CV, and stopped with the end of the vowel for VC. This difference i n timing for the voicing was intended to give the voiced-voiceless d i s t i n c t i o n . Three different timing schemes were used giving a t o t a l of 96 s t i m u l i . In addition /hi / and /ha / were added giving a t o t a l of 98 s t i m u l i . The timinj $2 eg e 10 0 U —I 1 1 1 1 1 1 r-0 2 4 6 8 tO 12 I* |6 D i s t a n c e -from G l o t t i s (cm) 1 — i — i — i — i — i — i r ~i—i 1 — i — i — i — i r fa) ~i 1—i 1 — i — i — i r o - i 0 ZK -iK CK 8K 10K Frequency ( Hertz) j» T 1 r Figure 4.7 (a). Vocal tract configurations for synthesizing f r i c a t i v e s Arrow shows place of in s e r t i o n of noise. (b) Frequency response for configurations i n (a) 6 i-CV -vc 2-CV 2-VC 3 C V 3-VC Fricth ion Voicing  Configuration 7 7 - v 7 " A r -y . x X •Ay— - V i i i i i i —H r - 5 0 m s I I I I I I i I Figure 4.8 Timing controls used i n fricative-vowel tests, CV for fricative-vowel sequences, VC for vowel-fricative sequences. For f r i c t i o n and voicing, amplitude of source i s represented. For configuration, horizontal lines indicate the vocal tract was stationary, diagonal l i n e s indicate the vocal tract was moving. ' . * I l l S I I ' Figure 4.9 Timing control for / h i / and /ho/ i n fricative-vowel test. For f r i c t i o n and voicing, amplitude of source i s represented. For configuration, horizontal lines indicate the vocal tract was stationary, diagonal lines indicate the vocal tract was moving. for / h i / and/ho/ i s given i n Fig. 4.9. The 98 s t i m u l i were placed at random into 3 subjective l i s t e n i n g tests of 34, 32 and 32 dif f e r e n t s t i m u l i each. In each test the s t i m u l i were placed i n random order. Each s t i m u l i was given twice i n succession, 2 seconds apart, followed by an 8 second space. The subjects were untrained phonetically since no phonetically trained subjects were available for the tests. They were a l l male graduate students between the ages of 22 and 29 whose native language was English. A l l subjects had one previous exposure to synthesized speech from MINERVA, but none had heard the f r i c a t i v e s . The following instructions appeared at the top of the response form: Test 1 You are about to hear synthesized nonsense words. Each word i s given twice i n succession followed by an 8 second space. Res-pond once for each pa i r . There are 2 phonemes to each word. One of the phonemes i s a vowel, either ee or a. The other phoneme i s a consonant from the 1 1 1 1 I — * 6 5 l i s t below. I f the sound doesn't sound l i k e any of the sounds below respond none. The f i r s t three are pr a c t i c e , respond i n a, b, and c. F fee H hee V Van S see Z zee SH sh.ee ZH pleasure TH bath DH ; then In the second and t h i r d tests the words4"The f i r s t three are practice, respond i n a, b, and c." were eliminated. The subjects were given the tests i n a quiet room with Sharpe HA10 headphones and a Scully 280 tape recorder on which were recorded the o r i g i n a l samples from MINERVA. 4 . 8 RESULTS OF THE FRICATIVE TEST The test results are shown i n Table 4 . 4 . For the unvoiced case, / f a / and /af/ were d i f f i c u l t to synthesize and had i d e n t i f i c a t i o n rates of 4 0 % to 5 0 % . Sequence / 9 i / was almost never selected possibly because this i s not a common combination i n English. Sequences /0a/ and /\q/ reached a 6 0 % i d e n t i f i c a t i o n rate while /a9/ was selected rarely. The combination /sa/ was selected 70% of the time and the remaining com-binations were i d e n t i f i e d 8 0 % of the time or more. The voiced f r i c a t i v e s were more d i f f i c u l t to synthesize. Phonemes /^/ and /z/ yielded high scores but /hi and /v/ were more d i f f i -c u l t . Sequences / v i / and / i v / could be synthesized to y i e l d scores of 6 0 % but /va/ and /av/ were consistently i d e n t i f i e d less than ha l f of the time. Phoneme /S/ was not selected at a l l ; /&/ and /0/ are very s i m i l a r to Ivl and / f / respectively and are d i f f i c u l t to synthesize and recognize by machine. Many workers i n synthesis and analysis do not attempt / 5 / i unvoiced a s I V 0 z 5 h none f • CV • 1 70 10 20 2 100 3 100 VC 1 80 10 10 2 70 20 3 70 30 9 CV 1 70 10 10 10 2 80 10 10 3 60 10 10 20 VC 1 50 40 2 40 60 3 70 10 s • CV 1 40 60 2 40 60 3 90 10 VC 1 40 10 10 10 30 2 80 . 20 3 80 10 10 I CV 1 90 10 2 100 3 100 VC 1 100 2 100 3 90 10 h CV 70 30 Table 4 .4 (a) Results of vowel-fricative test. Stimuli are l i s t e d down l e f t side, responses are shown across the top. Results shown are when the f r i c -ative was unvoicdd and the vowel /i / was used. a unvoiced f 9 s r J V b 2 3 h none f CV 1 10 30 30 20 10 2 40 10 30 10 10 3 30 10 30 10 10 10 VC 1 50 20 10 20 2 40 20 40 , 10 3 40 30 10 10 e c v 1 20 60 20 2 20 60 20 3 20 60 20 v c i 50 40 2 60 40 3 40 30 10 s c v 1 10 80 10 2 50 50 3 -70 20 TO VC 1 80 10 10 2 70 10 20 3 70 10 10 J CV 1 -10 40 50 2 90 10 3 90 10 VC 1 100 2 90 10 3 90 10 h CV 100 Table 4.4 (b) Results of vowel-fricative test. Stimuli are l i s t e d down l e f t side, responses are shown across the top. Results shown are when the f r i c a t i v e was unvoiced and the vqwel /a/ was used. . i voi ced f .0 f V 6 z 3 h none V cv- 1 60 10 30 2 30 10 60 3 50 20 30 VC 1 60 10 30 2 60 10 30 '3 30 70 5 cv 1 20 30 50 2 40 40 20 3 10 90 VC 1 30 30 10 10 20 2 60 10 10 10 3 60 20 20 z cv 1 10 10 70 10 2 100 3 80 10 10 VC 1 10 60 10 20 2 80 10 10 3 40 60 cv 1 60 30 10 2 10 70 20 3 20 70 10 VC 1 20 50 1.0 20 2 20 70 10 3 20 10 60 60 10 Table 4 .4 (c) Results of vowel-fricative test. Stimuli are l i s t e d down l e f t side, responses are shown across the top. Results shown are when the f r i c a t i v e was voiced and the vowel /i/ was used. a voiced f e s r V S z 3 h none CV 1 20 40 10 30 V 2 20 80 3 30 20 50 VC 1 50 10 40 2 10 40 10 10 30 3 20 10 20 10 40 8 C V 1 20 20 60 2 30 30 40 3 10 10 10 70 VC 1 20 30 10 40 2 20 70 10 3 30 30 10 30 2 CV 1 60 40 2 60 40 3 80 10 10 VC 1 10 60 10 20 2 10 80 10 3 50 40 10 3 CV 1 60 30 10 2 10 80 10 3 10 80 10 VC 1 10 10 70 10 2 90 10 3 20 70 10 Table 4.4 (d) Results of vowel-fricative test. Stimuli are l i s t e d down the l e f t side, responses are shown across the top. Results shown are when the f r i c a t i v e was voiced and the vowel /'of was used. Com Faag gjrssf'flon /\, — s\~ ~i r — i i i Figure 4.10 Timing used i n p i l o t fricative-vowel test CVC. For f r i c t i o n and voicing, amplitude of source i s represented. For configuration, horizontal lines indicate the vocal tract was stationary, diagonal lines indicate the vocal tract was moving. and /9/. Phoneme /h/ was produced by using the configuration for /a/ which i s a f a i r l y open vowel. Table 4.5 shows scores obtained from a p i l o t test conducted p r i o r to the formal f r i c a t i v e test. The test results are presented here to compare the results of phonetically trained subjects with those for untrained subjects. In the p i l o t test there were four subjects, a l l phone-t i c a l l y trained. Timing as shown i n Fig. 4.10 was used and the formant was CVC where the two C's were the same. Each stimulus was repeated twice as i n the formal test and the timing i s s i m i l a r to timing 2 of the formal f r i c a t i v e test. As can be seen the p i l o t test scores are generally much higher than those i n Table 4.4. The higher scores could be caused by three factors; (1) the fact that i n the f i r s t test there are naive l i s t e n e r s trying to take a test that requires phonetically trained people, (2) the phonetically trained people had more experience with synthetic speech and 71 e none f 9/ 9? s I V 6, z 5 88 75 50 12 12 25 25 75 25 75 63 63 50 25 25 100 25 25 12 50 100 a 0/ e 2 s V 5/ z 5 75 88 88 25 25 12 100 37 63 38 12 37 100 100 100 s / V 5, z 100 100 75 25 100 2 5 75 100 88 25 25 12 50 100 12 88 0( z X 75 63 63 25 25 37 12 25 100 25 75 75 50 25 25 100 25 75 Table 4.5 Results of p i l o t test with fricative-vowel combinations CVC. 72 (3) CVC g i v e s more c l u e s t h a n CV o r V'c a l t h o u g h t h e l i s t e n e r s were n o t t o l d t h a t the C.'s were the same. U s i n g p h o n e t i c a l l y t r a i n e d s u b j e c t s i t can be s een t h a t /v/ s c o r e s i m p r o v e d , p a r t i c u l a r l y w i t h o t h e r v o w e l s . A l s o t h e s c o r e s o f / f / w i t h I<xl i m p r o v e d . However / § / and /&/ were s t i l l n o t s y n t h e s i z e d s u c c e s s -f u l l y . 4.9 FREQUENCY RESPONSE OF THE FRICATIVE CONFIGURATIONS The f r e q u e n c y r e s p o n s e curx 'es we re measured by i n s e r t i n g a s i n e wave o f v a r i o u s f r e q u e n c i e s i n p l a c e o f t h e n o i s e s o u r c e and m e a s u r i n g t he o u t p u t s i g n a l ' s a m p l i t u d e . The c u r v e s o b t a i n e d a r e shown i n F i g . 4 .7 Cb).. I f t h e s e c u r v e s a r e compared with t he c u r v e s o b t a i n e d by Hughes and , 4 4 H a l l e i n 1956, i t can be s een t h a t t h e y a r e s i m i l a r i n g e n e r a l a p p e a r a n c e . The d a t a o f Hughes and H a l l e i s f o r many c o n t e x t s , however t h e r e a r e some g e n e r a l t r e n d s . F o r /£/ t h e r e i s q u i t e a few low f r e q u e n c y peak s and t h e n a h i g h f r e q u e n c y peak be tween 8kHz and 9 k H z , s i m i l a r l y t he s y n t h e s i z e d / f / had t he l o w f r e q u e n c y peak s a t 415 Hz and 1295 Hz w i t h a l o t o f s m a l l e r peak s and a h i g h f r e q u e n c y peak a t 8 k H z . I f t h e a ve r a ge o f t h e f i r s t f o r m a n t f o r t he 3 s p e a k e r s and 6 c o n t e x t s f o r / f / i s t a k e n f r o m Hughes and H a l l e i t i s 2683 Hz w i t h a r ange f r o m 2016 Hz t o 3250Hz; t h e f i r s t f o r m a n t o f t he s y n t h e s i z e d / J / was 2603 H z . A l s o most o f t h e Hughes and H a l l e r e s p o n s e s have a d o u b l e peak be tween 5000 Hz and 7000 Hz and t h e s y n t h e s i z e d /J7 h ad a p e a k a t 6572 H z . . . The a ve rage f i r s t f o r m a n t f r o m Hughes and H a l l e f o r /s/ i s 4550 Hz w i t h a r ange f r o m 3666 Hz t o 5533 H z . The f i r s t f o rman t f r o m t h e s y n t h e s i z e d /s/ was 4288 Hz". The Hughes and H a l l e r e s p o n s e ha s t h e second formant between 8000 Hz and 9000 Hz while the synthesized version has a second formant at 9680 Hz. The s t a t i c responses of the f r i c a t i v e s / f , | , s / were f a i r l y close to the expected responses so that any d i f f i c u l t y i n synthesizing f r i c a t i v e s i n combination with vowels i s probably a result of parameters that j o i n them together. I t can be seen that timing i s a c r i t i c a l parameter but simple rules can be generated that give f a i r results for recognition rates. Timing 2 gave the best results for a l l f r i c a t i v e s except for /si i n which timing 3 gave more unvoiced responses. In a synthesis by rule scheme, context dependent rules would have to be used. The results discussed i n the previous sections show some target configurations and timing rules that can be used for synthesis of vowel-fricative combinations. 74 5. SYNTHESIS OF THE STOPS 5.1 INTRODUCTION TO THE STOPS The l a s t phoneme c l a s s s y n t h e s i z e d was t h e s t o p s , i n c l u d i n g the u n v o i c e d s t o p s / p , t » k / and t h e v o i c e d s t o p s / b , d , g / . S t op s we re s t u d i e d , l a s t be cau se t hey were c o n s i d e r e d t h e most d i f f i c u l t sounds t o s y n t h e s i z e ; i t was t h o u g h t t h a t e x p e r i e n c e g a i n e d i n s y n t h e s i z i n g v o w e l s and f r i c a t i v e s w o u l d be u s e f u l i n a t t e m p t i n g s t o p s y n t h e s i s . L a c k o f p r e c i s e a r t i c u l a t o r y d a t a g r e a t l y hampered s t o p s y n t h e s i s . The re have been some s t u d i e s o f t h e v o i c e d - v o i c e l e s s d i s t i n c t i o n done a t H a s k i n s L a b o r a t o r y ^ >46,47,48,49 a n c j some s t u d i e s o f t r a n s i t i o n a l cues i n t h e a c o u s t i c a l domain by S t e v e n s and House i n 1 9 5 6 ^ , H a r r i s e t a l i n 1 9 5 8 5 1 , Wang i n 1 9 5 9 5 2 and D e l a t t r e e t a l i n 1 9 5 5 5 3 . However , t h e r e has been l i t t l e wo rk done i n t he movement o f the a r t i -c u l a t o r s d u r i n g p r o d u c t i o n o f s t o p s , and l i t t l e w o r k done i n s y n t h e s i s u s i n g v o c a l t r a c t a n a l o g s . Two s e t s o f d a t a we re f ound u s e f u l f o r 39 t h e p r e s e n t x jo rk . The wo rk o f P e r k e l l i n 1964 was s t u d i e d a l t h o u g h t h e d a t a was n o t i n t e n d e d f o r use i n an a b s o l u t e f o r m and h e n c e no a b s o l u t e d i m e n s i o n s were g i v e n . T h i s d a t a was u sed t o i n d i c a t e a p p r o x i -mate a r t i c u l a t o r y t i m i n g p a t t e r n s and was u sed i n g e n e r a t i n g some i n i t i a l v o c a l t r a c t c o n f i g u r a t i o n s f o r s t o p s . However when' t h e s e c o n f i g u r a t i o n s were g i v e n i n an i n f o r m a l l i s t e n i n g t e s t , t h e y were f o u n d u n s a t i s f a c t o r y . The s e cond s e t o f d a t a t h a t was u s e f u l was t h a t o f F u j i m u r a i n 1 9 6 1 ^ on b i l a b i a l s t o p s . T h i s d a t a showed l i p movement i n t he p r o d u c t i o n o f /p/ and /b/ . U n f o r t u n a t e l y , the d a t a d i d n o t show t h e c o n f i g u r a t i o n f o r t h e r e s t o f t h e v o c a l t r a c t o r f o r o t h e r s o u n d s . The d a t a o f Houde i n 1 969 " ^ was n o t u sed becau se i t showed o n l y the m o t i o n o f t he tongue body without showing the motion of the rest of the vocal t r a c t . It was found too d i f f i c u l t to correlate this data with- either the data of P e r k e l l or Fujimura because of differences i n speakers and contexts. Most of the vocal tract configurations and timing parameters for synthesizing the stops with MINERVA were arrived at by t r i a l and error. A configuration and timing would be selected, synthesized, and given i n informal l i s t e n i n g tests. The results would be used to suggest modifications which would then be evaluated by more informal l i s t e n i n g tests. This process continued u n t i l changes did not seem to improve the synthesized stops. 5.2 METHODS OF CLOSING THE VOCAL TRACT The minimum cylinder radius of .14 cm i n the front sections and .28 cm for the middle and rear sections made complete closure of the simulated vocal tract impossible. Therefore, 3 d i f f e r e n t methods of simulating closure were t r i e d . One method was to physically open c i r c u i t the e l e c t r i c a l analog by opening a relay contact that was inserted i n series with the inductor i n the section where closure was to be simulated. Another method was to e l e c t r i c a l l y open c i r c u i t the tract with a f i e l d effect tr a n s i s t o r (FET) switch at the point of closure. With-, the FET switch- the rate of opening could be controlled. A t h i r d method was to merely simulate the smallest cylinder possible and consider the t r a c t completely closed. Because the relay added audible c l i c k s that were large enough to saturate the operational amplifiers i n MINERVA, th i s f i r s t method was discarded. The FET switch- had a high resistance which substantially lowered the effe c t i v e Q of the series inductor. To lower t h i s resistance 76 the FET switch was put on the primary winding of a 4:1 transformer. The secondary winding was inserted i n series with- an inductor so that when the switch was closed connecting the tract the eff e c t i v e resistance 1 of the switch-was -jg- of the resistance of the FET switch. Although this method prevented the effec t i v e Q of the section from being lox^ered, the simulated inductors and capacitors sometimes exhibited i n s t a b i l i t y during closure. It was found that the method of making the cylinder radius as small as possible gave results that were as good as those obtained when the tract was e l e c t r i c a l l y opened. This method was f i n a l l y adopted, since i t caused no i n s t a b i l i t i e s . 5.3 TARGET CONFIGURATIONS FOR THE STOPS In synthesizing stops, the vocal tract analog must be put i n a configuration with closure at some point. The pressure which builds up behind a closure must be simulated and then the configuration of the vocal tract analog must move to the target configuration of the following phoneme. The placement of the closure, the i n i t i a l configuration of the tr a c t , the rate of movement of the various sections after release, and the placement of the noise source to simulate f r i c t i o n are a l l important parameters i n determining the character of the stop. In stops, more than i n any other phoneme class, these parameters are context dependent. Several s t a r t i n g configurations were t r i e d based on data from 26 38 5 56 Fant , P e r k e l l , Flanagan et a l and Potter et a l . These configurations were context independent. In addition some context dependent configurations were generated by taking the target configuration for a vowel and modifying 77 i t by closing one or two adjacent sections either at the front, middle or rear of the vocal tract. These sections were made to simulate the smallest cylinder radius and the remaining parts of the vocal tract were l e f t i n the same position as they were for the vowel. This config-uration was called a context dependent stop configuration and i t was only used as a s t a r t i n g configuration for a stop when i t x^ as followed by the vowel from which the configuration was generated. 5.4 METHODS OF PRODUCING INITIAL STOPS The synthesis of stops i n the i n i t i a l position used the following time sequence. F i r s t the vocal tract analog was moved for the appropriate stop configuration an arbitrary time ahead of the release of the stop. Then the noise source was turned on to simulate the pressure behind the closure and i f the stop was to be aspirated the aspiration was also turned on. At the time of release, the vocal tract started to move toward the configuration of the phoneme following the stop. The time that voicing started depended on whether a voiced or unvoiced stop was being synthesized. The configuration used for .the stop, the place of in s e r t i n g the noise, the rate of decaying the noise burst, whether aspiration was . added or not added, the rate of change from the stop configuration to the vowel configuration, and the time that the voicing started are the parameters that were varied i n formal l i s t e n i n g tests described i n the next section. Many other parameters i n addition to those above were varied i n informal l i s t e n i n g tests. Informal tests used two or three subjects^ one at a time with the subject hearing the direct output of MINERVA. The set of s t i m u l i was presented i n random order and the subject 7 8 could hear each stimulus-as many times as he desired before making a response. A l l configurations and timing parameters that did not produce stop-like sounds or did not produce sounds that resembled the English stops were discarded after the informal l i s t e n i n g tests. For this reason the formal l i s t e n i n g tests show some incompleteness i n values of parameters. For example the test used t r a n s i t i o n times of SO and 75 msec to move from one configuration to the following one when the closure was i n front or rear and times of 2 0 and 5 0 msec when the closure was i n the middle. This was done because the informal tests had shown the l\l sounds were more recognizable when the t r a n s i t i o n was faster. There are ri s k s i n not including a l l values of a l l variables i n the formal l i s t e n i n g t e s t s , but these ris k s must be traded Off against the r i s k of fatigue on the part of the subjects i f too many similarly sounding s t i m u l i are presented. 5 . 5 STOP-VOWEL TEST I Stops were f i r s t synthesized i n MINERVA i n CV sequences. The results of some informal l i s t e n i n g tests indicated that the configurations from Potter et al. and the context dependent configurations described i n section 5 . 3 should be put into a formal l i s t e n i n g test. The l i s t e n i n g test used CV sequences with / ] / or /a/ as the vowel. Nine different s t a r t i n g configurations for the stops were used. Three of these configurations come from measuring the diameters from tracings of the a r t i c u l a t o r s for /p/, / t / and /k/ shown i n Potter et a l . The sections of the vocal tract analog were put i n positions which best f i t the measured diameters. Another 3 s t a r t i n g configurations were generated by taking the vowel /i / and closing off the front, middle or rear sections as described i n section 5 . 3 . Rear section i s r e l a t i v e , i n fact i t refers to sections 7 and 8 which are about mid-way between the simulated g l o t t i s 79 Front Constriction Si-—i—i—i—r~|i—r~~ry-T|" /?ear Constriction M i d Constriction i — i — i — i — i — i — i r~ 0 2-* 0 3 lO 12 f* It, Distance, from Crloitis (cm) Fixed Configuration l — i — r i — i — i — r Context Dependent Configuration from i i — i — i — i — i — i — i — r - 5 i — i — i — i — i — i — i — r Context Dependent ConPiguration From O. ~ i — I — i — I — i — i " — i — r Figure 5.1 Configurations used to simulate the stops. The f i r s t 3 configurations were from Potter et a l 5 6 the next 3 were generated from /i/, and the l a s t 3 were generated from /a/. 8 0 •nelson F decay S decay SSjpsrated mm - azpiral 20, BO o r 75" ms I I T —ho® I*— ms Figure 5.2 Control timing used i n stop-vowel test I. For f r i c t i o n , a s piration, and voicing; amplitude of source i s represented. For configuration horizontal lines indicate the vocal tract was stationary, diagonal lines indicate the vocal tract was moving. and l i p s . These 3 configurations were only used when the vowel /!/ followed the stop. The l a s t 3 s t a r t i n g configurations were s i m i l a r l y generated from the vowel /a/. The 9 configurations are shown i n Fig. 5.1. The synthesized nonsense words used the dif f e r e n t rates of decay of the inserted noise as i l l u s t r a t e d i n Fi g . 5.2. In one case the amplitude decayed from maximum to zero i n 6 msec. In the second case i t decayed to zero i n 50 msec. Noise was inserted at one of three places shown i n the plots of the configurations i n Fig. 5.1 by the arrows. The position of in s e r t i o n of the noise source was one of the parameters studied i n the test. I t could be added i n front for front closure, i n the middle for middle closure and both at the rear and i n the middle for rear closure. 81 V o i c i n g was t u r n e d on a t one o f two t i m e s , e i t h e r a t t he t i m e o f r e l ea se o f t h e s t o p o r 60 msec a f t e r r e l e a s e . The v o i c i n g r o s e t o i t s maximum v a l u e i n 50 msec. A s p i r a t i o n , w h i c h was s i m u l a t e d by i n j e c t i n g n o i s e a t the g l o t t i s , was added e i t h e r as shown i n F i g . 5.2 o r .not a l a l l . A s p i r a t i o n was n e v e r added when v o i c i n g was s t a r t e d a t t h e t i m e o f r e l e a s e . Two r a t e s o f v o c a l t r a c t movement were u sed f o r e a c h d i f f e r e n t s t o p c o n f i g u r a t i o n . When- t h e s t a r t i n g c o n f i g u r a t i o n h a d t h e c l o s u r e i n the f r o n t o r r e a r , t h e t i m e t a k e n t o move f r o m the s t a r t i n g c o n f i g u r a t i o n t o t he f o l l o w i n g v o w e l c o n f i g u r a t i o n was e i t h e r 50 o r 75 msec s . When t he s t a r t i n g c o n f i g u r a t i o n had t he c l o s u r e i n t h e m i d d l e , t h e r a t e t o move t o the f o l l o w i n g v o w e l was e i t h e r 50 o r 20 msec . The d i f f e r e n t r a t e s f o r d i f f e r e n t p o s i t i o n s o f c l o s u r e x^ere due t o t he f a c t t h a t m i d d l e c l o s u r e s i m u l a t e d c l o s u r e by t h e tongue t i p , w h i c h has a s m a l l i n e r t i a and moves q u i c k l y . I n summary t h e r e were 192 s t i m u l i . E i g h t i n i t i a l s t o p c o n f i g -u r a t i o n s were u sed w i t h each o f two v o w e l s , g i v i n g 16 c o m b i n a t i o n s . A c h o i c e o f e i t h e r v o i c i n g , n o i s e and no a s p i r a t i o n ; n o i s e and no a s p i r a t i o n ; o r n o i s e and a s p i r a t i o n gave t h r e e c o m b i n a t i o n s o f e x c i t a t i o n . I n a d d i t i o n the n o i s e s ou r ce s a m p l i t u d e had two r a t e s o f decay and t h e movement o f the v o c a l t r a c t had two d i f f e r e n t r a t e s . The 192 s t i m u l i Xirere r a ndom l y o r d e r e d t o y i e l d t h r e e t e s t s o f 64 s t i m u l i p e r t e s t . The s u b j e c t s were a l l ma le g r a d u a t e s t u d e n t s be tween t h e ages o f 22 and 29 x^hose n a t i v e l a n g u a g e was E n g l i s h . They a l l r e p o r t e d no h e a r i n g a b n o r m a l i t i e s . The t h r e e t e s t s we re g i v e n a t t h e same t i m e on c o n s e c u t i v e d a y s . The t e s t s we re g i v e n i n a q u i e t room u s i n g a S c u l l y .model 280 t a p e r e c o r d e r and Sha rpe HAlO e a r p h o n e s . F i v e o f t h e s u b j e c t s w e r e c l a s s i f i e d as e x p e r i e n c e d as t h e y h a d t a k e n p a r t i n some i n f o r m a l 82 l i s t e n i n g tests. The other f i v e subjects were c l a s s i f i e d as naive i n that they had not heard synthesized speech before. The subjects were given the following instructions for Test 1; You are about to hear synthesized nonsense words. Each word i s given twice i n succession followed by an 8 second space. Respond once for each p a i r . Choose the closest one of the following for each pair or none. pea bee paw . baw tea dee taw daw key gee kaw gaw hard g hard g The f i r s t three are practice pairs and should be answered i n A, B, and C. The instructions for Tests 2 and 3 were the same except there were no practice pairs. 5.6 RESULTS OF STOP-VOWEL TEST I The results of stop-vowel test 1 are shown i n Table 5.1. A response of none, a voiced response to an unvoiced s t i m u l i , or an un-voiced response to a voiced s t i m u l i are not shown. No s i g n i f i c a n t differences i n responses were found between naive and experienced l i s t e n e r s . The characteristics of the s t i m u l i are shown i n the l e f t column and at the top of Table 5.1 and the responses are shown across the top. Table 5.1(a) shows the results when the s t i m u l i contained the vowel /\/ and Table 5.1(b) shows the results x^ hen the s t i m u l i contained the vowel /a/. The data from Potter et a l i s labelled fixed because the same configuration V o i c e d b d g No V o i c i n g p t k A s p i r a t e d p t k F r o n t C o n s t r i c t i o n 50 F F i x e d 75 F F r o n t N o i s e 50 S 75 S 7 0 1 4 0 4 7 0 1 4 1 2 10 0 0 8 0 1 10 0 0 9 0 0 10 0 0 10 0 0 10 0 0 9 1 0 F r o n t C o n s t r i c t i o n 50 F Context Dependent 75 F F r o n t N o i s e 50 S 75 S 7 2 1 10 0 0 6 2 1 7 0 3 7 0 . 3 6 1 2 2 8 0 8 0 1 3 6 1 3 7 0 1 8 1 3 7 0 M i d C o n s t r i c t i o n 50 F F i x e d 20 F M i d N o i s e 50 S 20 S 2 0 1 5 1 1 2 1 3 0 9 1 3 1 4 4 3 1 0 8 1 0 9 1 0 9 1 0 10 0 0 9 1 0 10 0 M i d C o n s t r i c t i o n 50 F Context Dependent 20 F Mid N o i s e " 50 S 20 S 0 9 1 0 9 1 0 9 1 0 9 1 0 7 3 0 4 0 0 9 0 0 9 1 0 9 1 3 1 2 0 10 0 0 10 0 Rear C o n s t r i c t i o n 50 F F i x e d 75 F Rear N o i s e 50 S 75 S 1 1 4 1 2 2 1 0 7 1 0 4 5 0 2 3 1 2 5 0 2 3 0 1 4 1 3 5 2 1 4 1 2 2 1 0 Rear C o n s t r i c t i o n 50 F Context Dependent 75 F Rear N o i s e 50 S 75 S 0 10 0 0 10 0 0 9 0 0 8 0 0 10 0 0 10 0 0 10 0 0 9 1 0 10 0 0 10 0 0 10 0 0 10 0 Rear C o n s t r i c t i o n 50 F F i x e d 75 F M i d N o i s e 50 S 75 S 3 0 2 0 0 6 3 1 2 1 0 2 7 1 0 6 0 0 7 ' 1 1 4 0 1 4 5 1 6 0 2 4 .2 4 3 4 2 Rear C o n s t r i c t i o n 50 F Context Dependent 75 F M i d N o i s e 50 S 75 S 0 9 0 0 10 0 0 10 0 0 10 0 0 10 0 0 10 0 0 10 0 0 10 0 0 10 0 0 10 0 0 10 0 0 10 0 Table 5.1 (a) Results of stop-vowel-test I for vowel / i / . Stimuli characteristics are l i s t e d down l e f t side, responses are l i s t e d across the top. Unvoiced responses to voiced s t i m u l i and voiced responses to unvoiced s t i m u l i are not shown. Configuration characteristics refer to Fig. 5.1. The columns showing 50, 75, 20, F, and S refer to symbols i n F i g . 5.2. V o i c e d b i g No V o i c i n g p t k A s p i r a t e d p t k F r o n t C o n s t r i c t i o n 50 F F i x e d 75 F F r o n t N o i s e 50 S 75 S 0 4 6 1 3 8 1 4 5 1 4 4 6 0 4 1 0 4 1 . 0 9 1 0 8 2 5 3 1 7 2 1 6 3 2 6 2 F r o n t C o n s t r i c t i o n 50 F Context Dependent 75 F F r o n t N o i s e 50 S 75 S 6 0 3 8 0 2 6 1 3 6 0 3 7 0 1 7 0 0 4 2 3 5 2 1 2 6 2 0 6 4 1 7 2 2 4 4 M i d C o n s t r i c t i o n 50 F F i x e d 20 F M i d N o i s e 50 S 20 S 0 7 3 0 6 4 0 7 2 0 3 3 1 6 2 3 6 1 0 8 2 0 8 2 0 8 2 0 8 1 0 7 2 0 9 1 M i d C o n s t r i c t i o n 50 F Context Dependent 20 F M i d N o i s e 50 S 20 S 1 4 3 3 3 4 5 1 3 2 6 1 1 8 1 0 10 0 0 10 0 0 8 1 0 10 0 0 9 1 0 10 0 0 9 1 Rear C o n s t r i c t i o n 50 F F i x e d 75 F Rear N o i s e 50 S 75 S 0 0 10 0 0 10 0 0 9 2 0 8 0 0 9 0 0 9 0 0 9 0 0 10 0 1 9 1 1 8 0 5 4 0 0 10 Rear C o n s t r i c t i o n 50 F Context Dependent 75 F Rear N o i s e 50 S 75 S 0 9 1 0 7 1 0 8 2 0 4 1 0 10 0 0 10 0 0 8 2 0 9 1 0 9 1 0 4 6 0 8 2 0 8 2 Rear C o n s t r i c t i o n 50 F F i x e d 75 F M i d N o i s e 50 S 75 S 0 0 9 0 0 10 0 2 8 0 0 9 3 2 4 0 3 6 1 0 8 0 1 9 0 3 7 0 1 9 0 3 8 0 0 10 Rear C o n s t r i c t i o n 50 F Context Dependent 75 F M i d N o i s e 50 S 75 S 0 9 1 0 5 1 0 9 1 0 8 1 0 9 1 0 8 1 0 8 1 0 8 1 0 9 1 0 8 1 0 9 1 0 8 1 Table 5.1 (b) Same as (a) except the vowel IqJ was used. 85 was used with both vowels. The configurations that were generated from the Vowels as described previously are labelled context dependent. The results show that MINERVA could synthesize a'/t/ i n the i n i t i a l position with / i / and /a/ following. I f the context dependent configuration was used for the stop then s t i m u l i with either the constric-tion at the middle or rear and the noise either at the middle or rear had a large percentage of and /fa/responses i n their respective catagories. For /t/'s voiced counterpart /d/ i t can be seen that there was a large percentage of /d/ responses when- voicing was added to the above s t i m u l i . One exception was when the i n i t i a l stop configuration used middle closure, middle noise and the vowel /i/ was used, i n th i s case the responses were f a i r l y evenly distributed between /b,d,g/. As can be seen from the r e s u l t s , s t i m u l i consisting of fixed configurations and front closure have large percentages of / p i / responses and s l i g h t l y lower percentages of / b i / responses but the responses for the same configuration with /a/ are not /pa/ or /ba/. The context de-pendent front closure s t i m u l i had larger percentages of / b i / , /ba/, and /pa/ responses than the fixed configuration did. These results indicate that /p/ i s highly context dependent x^ hen used xjith the vox^els / i / and /a/. I t can be seen that / k i / and / g i / were not responded very often and hence were not synthesized successfully. The largest percentage of responses i n these catagories was not much above random guessing. Sequences /ka/ and /go./ have much higher percentages of responses with the fixed configuration giving /ka/ and /ga/ percentages of 90% and 100% respectively. I f the results for rear.closure are studied i t can be seen that the context dependent configurations have high percentages of responses i n the /ka/ and /go./ catagories but when used with the vowel / ] / the rates were distributed f a i r l y evenly. These results would suggest that the reason / k i / or / g i / were not responded often was that the right s t a r t i n g configuration was not used. Results using different vowels are discussed i n section 5 . 1 0 . In general the differences i n rates of decay of noise and differences i n rate of movement of the vocal tract configuration did not change the results s i g n i f i c a n t l y , although i n a few i s o l a t e d cases these parameters seem to be important. One such case was a context dependent f r o n t a l closure with / i / ; a s i g n i f i c a n t l y higher number of / p i / responses resulted when the vocal tract configuration changed i n 75 msec and the noise decayed to zero amplitude i n 6 msec. A few other isola t e d cases of timing parameters "making differences can be found by examining Table 5 . 1 . The addition of aspiration (noise at the g l o t t i s ) had no s i g n i f i c a n t effect on the results. A minor effect was that there were more [\f responses than there were i n the non-aspirated case. I t should be noted that subjects did not distinguish between aspirated stops and non-aspirated stops. The term aspirated i n these results refers to the s t i m u l i not to the responses. 5.7 FRICATIVE-STOP-VOWEL TESTS The second set of tests on synthesized stops used a C-^^ combination. The consonant C^ x^ as the unvoiced f r i c a t i v e /s/, was an unvoiced stop, and V was either the vowel /i/ or /a/. The f r i c a t i v e Isl was used because I t i s the only f r i c a t i v e that precedes a stop i n English. The voiced stops were not tested because there was no f a c i l i t y I Oft Fr foscmq Co n f i'gu r at Ion ,9 decdy aspirated non-asp/rated 20,5C or 75 ms - v — V s -1 1 1 1 1 1 I —> 100 ms <— Figure 5 .3 Control timing used i n fricative-stop-Vowel test. For f r i c t i o n , aspiration and voicing^ amplitude of source i s represented. For configuration horizontal lines indicate the vocal tract was stationary, diagonal lines indicate the vocal tract was moving. i n the analog to simulate the low-frequency hum that precedes a voiced stop as described i n Chapter 2. The vocal tract analog model breaks down because i n the human speaker the sound radiates through the walls of the vocal t r a c t , but there i s no e l e c t r i c a l analog of this e f f e c t . Attempts to simulate the voicing by not completely closing the tract f a i l e d to produce the desired e f f e c t . The timing parameters used i n th i s test are shown i n Fig. 5.3. 88 The parameters that were varied i n the stop-vowel test T described i n the previous section were also varied i n this test. However, the combination of rear closure and middle f r i c t i o n was not used, because the previous test showed that the results were not s i g n i f i c a n t l y d i f f e r e n t from the combination of rear closure and rear f r i c t i o n . Because voiced stops were not being synthesized the case of early turn on of the voicing source was not used; A parameter that was. added to this test that was not needed with s.tops i n i n i t i a l posi.tion was ' the .duration of the s i l e n t i n t e r v a l between the end of the noise i n /s/ and the release of the stop. Two values, 70 msec and 150 msec for the s i l e n t period were used as indicated 34 • by the data of Ito i n 19 71. Using 6 configurations for the stop, 2 vowels, 2 rates of decay for the noise, 2 rates of change during release, nonaspirated excitation and 2 durations of s i l e n t i n t e r v a l s , gave a t o t a l of 192 s t i m u l i . The 192 s t i m u l i were randomly ordered to y i e l d 3 tests of 64 s t i m u l i per test. The l i s t e n i n g tests were given to the same ten subjects that had taken the stop-vowel tests I a week e a r l i e r . The equipment and the testing conditions were the same. The following instructions were written at the top of the response form for test 1; you are about to hear synthesized nonsense words. Each word i s given twice i n succession followed by an 8 second space. Respond once for each pai r . Choose the closest one of the following for each p a i r or none. spee spa stee sta skee ska The f i r s t three are practice pairs and should be answered i n A, B, and C. NON ASPIRATED ASPIRATED Short pause p * t k Long pause p t k Short pause p t k Long pause p t k Front Constriction 50 F Fixed 75 F Front Noise 50 S 75 S 3 0 7 2 0 8 1 0 9 0 0 10 5 0 5 2 0 7 1 0 9 0 0 10 0 1 9 1 1 8 0 0 9 0 1 9 2 0 8 0 0 8 0 0 10 0 0 10 Front Constriction 50 F Context Dependent 75 F Front Noise 50 S 75 S 1 0 9 1 0 9 0 0 10 0 0 10 3 0 7 0 0 10 .1 0 9 0 0 10 0 4 6 0 8 2 0 5 5 0 2 8 0 4 6 0 7 3 0 2 8 0 3 7 Mid Constriction 50 F Fixed 20 F Mid Noise 50 S 20 S 5 0 5 2 4 3 2 3 5 0 4 6 5 0 4 9 1 0 1 3 5 0 4 6 0 5 5 0 7 2 1 3 6 0 5 5 1 3 6 1 3 6 0 3 6 3 1 5 Mid Constriction 50 F Context Dependent 20 F Mid Noise 50 S 20 S 8 2 0 6 2 2 5 2 2 3 4 3 8 0 2 9 0 1 8 2 0 1 4 5 0 9 1 0 7 3 1 8 1 0 9 1 0 8 1 0 4 5 1 7 2 3 3 4 Rear Constriction 50 F Fixed 75 F Rear Noise 50 S 75 S 0 0 9 0 0 10 0 0 10 0 0 10 1 0 9 0 0 10 0 0 10 0 0 10 0 0 10 0 0 10 0 0 10 0 0 10 0 0 10 0 0 10 0 0 10 0 0 10 Rear 50 F Context Dependent 75 F Rear Noise 50 S 75 S 0 6 4 0 2 8 0 0 10 0 1 9 1 3 6 0 3 7 0 0 10 0 0 10 0 0 10 0 1 9 0 1 9 0 1 9 0 2 8 0 2 8 0 1 8 0 2 8 Table 5 .2Ca) Results of fricative-stop-vowel test with vowel / i / . Stimuli characteristics are l i s t e d down l e f t side, responses are l i s t e d across the top. Unvoiced responses to voiced s t i m u l i and voiced responses to unvoiced s t i m u l i are not shown. Configuration characteristics refer to Fig. 5 . 1 . The columns showing 5 0 , 7 5 , 20 , F and S r e f e r to symbols i n Fig. 5.3-. , Tests 2 and 3 were given at the same time on consecutive days. The instructions were the same except there were no practice p a i r s . , 5.8 RESULTS OF FRICATIVE-STOP-VOWEL TESTS The results of the tests, are shown i n Table 5 . 2 . I t can be seen that one s t i m u l i had 100% / s k i / responses compared to / k i / which 90 . NON ASPIRATED ASPIRATED Short pause p t k Long pause p t k Short pause p t k Long P p n s e k Front Constriction 50 F Fixed 75 F Front Noise 50 S 75 S i 8 0 2 6 0 3 6 0 4 7 1 2 6 0 3 7 1 2 7 0 3 6 0 4 5 0 4 4 0 6 8 0 2 9 0 1 10 0 0 9 0 1 .6 1 3 10 0 0 Front Constriction 50 F Context Dependent 75 F Front Noise 50 S 75 S 2 0 8 3 0 7 1 0 8 2 0 8 4 0 6 3 0 7 2 2 6 5 0 4 2 3 4 4 3 3 1 1 8 3 1 5 1 5 4 4 4 2 0 2 4 3 5 1 Mid Constriction 50 F Fixed 20 F Mid Noise 50 S 20 S 0 4 6 0 5 5 0 3 7 0 6 4 3 1 6 0 5 4 0 6 3 0 8 2 0 7 3 0 7 3 0 7 3 0 9 1 1 7 1 0 8 1 2 4 3 0 8 1 Mid Constriction 50 F Context Dependent 20 F Mid Noise 50 S 20 S 0 4 6 0 5 5 0 5 5 0 3 7 0 5 5 0 6 4 0 0 10 0 5 4 0 8 2 0 8 1 0 8 2 1 6 2 - 0 7 3 0 7 3 0 4 6 0 8 2 Rear Constriction 50 F Fixed 75 F Rear Noise 50 S 75 S 2 0 6 2 1 7 5 0 4 3 0 7 5 0 5 2 1 7 6 0 4 2 0 7 3 2 5 3 0 6 3 1 6 3 0 7 4 0 5 5 0 4 3 0 5 4 0 6 Rear Constriction 50 F Context Dependent 75 F Rear Noise 50 S 75 S 0 4 '5 0 7 3 0 3 7 0 4 6 0 6 4 0 6 4 0 8 2 0 7 3 0 7 2 0 8 2 0 6 3 0 8 2 0 7 2 0 6 3 0 8 2 0 5 5 Table 5.2(b) Same as Table 5.2(a) except the vowel /a/ was used. had a low percentage of responses i n the previous test. The combination / s t i / had a low number of responses while / t i / had a large percentage of responses i n the previous test. The combination /spa/ had a low percentage response to a l l s t i m u l i and /sta/ had a large percentage response only to a few s t i m u l i . These results show that the stops are highly context dependent and that different configurations and timing parameters might have to be used for each different stop i n each different context. The differences i n percentage responses between a short s i l e n t i n t e r v a l and a long s i l e n t i n t e r v a l were not s i g n i f i c a n t with 91 with two exceptions. One i s the case of aspirated fixed front constriction with /i/, where the long s i l e n t i n t e r v a l gave more /p/ responses. The other i s the case of the context dependent rear constriction with I"\ / where the long s i l e n t i n t e r v a l gave more f\f responses and less /k/ responses. There were no s i g n i f i c a n t changes i n response rates for the different decay rates of the noise, or i n d i f f e r e n t rates of change of the configurations. Adding aspiration gave s i g n i f i c a n t l y more / t / responses. I t can be seen from the stop-vowel and the fricative-stop-vowel tests that the stop configuration was a very c r i t i c a l parameter while the timing parameters were less c i r t i c a l . 5.9 STOP-VOWEL TEST I I . The .next test that was done using synthesized stops was a CV test with a stop i n i n i t i a l position followed by one of the four vowels /l,e,as,o/. Since the rates of decay of the injected noise and the rate of moving from one target configuration to the next was not c r i t i c a l i n stop-vowel test I, i t was decided to use only one set of values. The rates used were 50 msec for moving from the stop configuration to the vowel configuration and 6 msec for the rate of decay of the noise. Two configurations were used for each place of closure. One 5 6 from Potter et a l „ and one context dependent configuration generated from the vowel that followed the stop. There was a choice of three parameters i n e x c i t a t i o n ; voiced with no aspiration, unvoiced, with no aspiration, and unvoiced with aspiration. These parameters gave a t o t a l of 72 s t i m u l i . The s t i m u l i were put i n random order, synthesized and recorded on a Scully model 280 tape recorder. The test was played back to the subjects on the same tape recorder using Sharpe HA10 earphones. The subjects were the same ones that had participated i n the other stop l i s t e n i n g tests 2 weeks e a r l i e r . The subjects were given the following i n s t r u c t i o n s : You are about to hear synthesized nonsense words. Each word i s given twice i n succession followed by an 8 second space. Respond once, for each pai r . The words are a consonant followed by a vowel. From the following l i s t choose the consonant you think was synthesized. p b -t d * k g T none hard g 5.10 RESULTS OF STOP-VOWEL TEST I I The results of stop-vowel I I are shown i n Table 5.3 along with the results that- used the same parameters for /i/ and /al i n stop-vowel test I. The responses of none and a voiced response to a voiced s t i m u l i are not shown. I t can be seen that some s t i m u l i have large percentage of responses i n a given category while other have about an equal number of respones i n a l l categories. Table 5.4 shows the largest percentage of responses for each combination. I t can be seen that /d/ and /k/ are the hardest to synthesize using the described schemes. V o i c e d No V o i c i n g A s p i r a t e d b d g P t k P t k F r o n t C F F r o n t N 80 0 20 10 0 0 90 0 0 F r o n t C CD F r o n t N 90 0 0 20 0 0 80 0 0 0 M i d c- F M i d N . 0 60 30 30 0 0 10 40 0 M i d C CD M i d N 60 10 30 20 60 0 0 60 40 Rear C F Rear N 0 0 100 30 0 10 20 0 80 Rear C CD Rear N. 10 0 80 20 0 80 10 0 90 F r o n t C F F r o n t N 90 0 0 60 0 10 30 70 0 F r o n t C CD F r o n t N 90 0 0 60 0 0 40 .40 20 M i d C F M i d N 60 40 0 40 30 0 10 70 10 M i d C CD M i d N 70 10 10 20 0 0 50 40 10 Rear C F Rear N 40 0 50 70 0 0 20 40 40 Rear C CD Rear N 0 10 90 0 60 40 10 50 40 F r o n t C F F r o n t N 40 10 20 80 0 10 40 60 0 F r o n t C CD F r o n t N 90 0 0 40 0 10 50 20 20 e •Mid C F M i d N 10 40 20 40 10 20 20 70 10 Mid. C CD M i d ' N 80 0 10 50 0 20 30 20 40 Rear C F Rear N 0 10 90 60 0 20 10 10 80 Rear C CD Rear N 0 20 80 0 50 50 10 20 70 F r o n t C F F r o n t N 80 0 0 50 20 10 80 20 0 F r o n t C CD F r o n t N 90 0 0 70 10 0 10 50 20 •I M i d C F M i d N 70 0 10 20 40 10 10 90 ; 0 M i d C CD M i d N 100 0 0 50 0 0 80 10 10 Rear C F Rear N 50 0 30 80 0 10 60 20 20 Rear C CD Rear N 0 20 80 0 80 10 0 60 40 F r o n t C F F r o n t N 0 40 60 60 0 40 20 50 30 F r o n t C CD F r o n t N 60 0 30 70 0 10 20 60 20 a M i d C F M i d ,N 0 70 30 10 60 20 0 80 20 M i d c CD M i d N 10 40 30 10 80 10 0 100 0 Rear c F Rear N 0 0 100 0 0 90 0 10 90 Rear c CD Rear N 0 90 10 0 90 10 0 80 20 F r o n t c F F r o n t N 70 0 10 100 0 0 100 0 0 F r o n t c CD F r o n t N 70 20 10 70 0 30 30 60 10 i M i d c F M i d N 20 0 10 30 10 40 0 90 10 M i d c CD M i d N 0 90 10 0 70 30 0 90 10 Rear c F Rear N 10 10 40 50 ,0 20 40 10 30 Rear c CD Rear N 0 100 0 0 100 0 0 100 0 Table 5.3 Results of stop-vowel test I I . Characteristics of s t i m u l i are shown down l e f t side; C denotes c o n s t r i c t i o n , F denotes fixed, CD denotes context dependent, and N i s an abbreviation for noise, as shown i n Fig. 5.1. Responses are shown across the top. b d g P t k 7 0 1 0 0 4 0 1 0 0 1 0 0 4 0 _ I 9 0 3 0 8 0 8 0 8 0 4 0 • ' 9 0 4 0 4 0 8 0 7 0 8 0 9 0 4 0 9 0 6 0 7 0 4 0 _a 6 0 9 0 1 0 0 7 0 1 0 0 9 0 JO 9 0 6 0 1 0 0 9 0 6 0 9 0 s _ i - - - 1 0 0 8 0 1 0 0 - - - 9 0 8 0 1 0 0 Sable 5.4 The maximum response rate for each stop with v a r i ous voweIs. 5.11 SPECTROGRAPHIC STUDY OF THE VOICED STOPS Since some stop-vowel combinations did not have a high percentage of responses to any st i m u l i i t was decided to take spectrograms of the synthesized voiced stops and to study the formant patterns. In Fig. 5.4 are synthetic spectrograms showing second formant transitions that produce 53 voiced stops with various vowels. Shown are a set from Delattre et a l in 1955 and a set from the synthesized voiced stops. The spectrograms confirm the results of the l i s t e n i n g test. For example looking at the second formant for /ba/ i t would be expected that the context dependent configuration would give better results than the fixed configuration because the second formant comes from below i n the context dependent configuration. In fact the l i s t e n i n g tests show better results i n the context dependent case. There are numerous other examples of spectrograms that agree with the results of the l i s t e n i n g test. The reason /g\/ was not front Constriction . Fixed Front Constriction Context Dependent Mid Constriction Fixed Mid Constriction Context Dependent Rear Constriction Fixed 3-* .? 1-1 J 1 •3 1 Rear Constnction Context Dependent Delattre <, et si 31 Figure 5.4 Synthetic spectrograms of stop-vowel combinations. The f i r s t 6 are taken from synthesizer, the l a s t 3 are repro-ductions from data of Delattre et a l Data on l e f t side refer to configurations i n Fig. 5.1. 96 successfully synthesized was that the second formant should have started above the f i n a l value. However, there are some exceptions to the agreement between spectrograms and results of the l i s t e n i n g tests. For example /ge/ and /gae/ had a large number of responses and the formant did not sta r t above the f i n a l value as i t did i n the data of Delattre et a l . The f a i l u r e to get many /d/ responses with /1, e, ae/ would seem to be caused by the formant s t a r t i n g too far below the f i n a l value. 5.12 SUMMARY OF THE STOP SYNTHESIS RESULTS It can be concluded that the vocal tract analog was capable of synthesizing stops i n the i n i t i a l position and following the f r i c a t i v e /s/. To obtain a l l contexts with a l l stops, d i f f e r e n t s t a r t i n g configura-tions for each stop with each vowel would have to be used. The timing and sequence of events was important but the rate of noise decay and the rate of change of the vocal tract was not c r i t i c a l . The fact that the vocal tract was not completely closed off did not preclude the success-f u l synthesis of the stops. 6. THE ENGLISH WORD TESTS 6.1 INTRODUCTION TO THE ENGLISH WORD TESTS-The goal of v i r t u a l l y any speech synthesizer scheme i s that i t be understood by li s t e n e r s . In some applications, l i s t e n e r s may have l i t t l e or no opportunity to learn the accent of a p a r t i c u l a r synthesizer. In many other applications, however, such learning would be possible. I t was f e l t that to evaluate MINERVA f a i r l y , tests should determine how accurately synthetic speech could be understood on f i r s t hearing and how quickly listeners would become accustomed to the synthe-sizer's accent. Accordingly, 50 different English words were synthesized and presented to listeners whose responses indicated both i n t e l l i g i b i l i t y and naturalness of the words. 6.2 SYNTHESIS RULES FOR ENGLISH WORDS The timing parameters for some simple English words were taken from spectrograms of human speech. These parameters were then modified by t r i a l and error. For some of the f i r s t English words synthesized these rules exhibited some general trends. On the basis of these trends the following timing rules were developed and used for a l l words. DURATION* RULES 1. Changes i n phoneme target configuration and source amplitude target configuration a l l begin at the same time except at the beginning of ^DURATION of a phoneme target configuration or amplitude i s the time difference between when the vocal tract or source starts to move toward a new value and the time i t started to move toward the current value. a word or when a stop i s being synthesized. Opening of the vocal tract for a stop begins 10 msec after the source change begins. Tense vowel duration equals 240 msec. Lax vowel duration equals 180 msec. Semi-vowel and f r i c a t i v e duration equals 125 msec. Stop burst duration i s 30 msec, followed by 30 msec of no source exc i t a t i o n . N o n - i n i t i a l stops are preceded by a 70 msec quiet i n t e r v a l . Rules 2, 3, and 4 apply except x^ hen a voiced fricative-vowel sequence occurs, i n which case the s t a r t i n g of the change of the configuration, and the s t a r t i n g of the voicing are both delayed by 20 msec. Aspiration durations and i n f l e c t i o n durations equal voicing durations. TRANSITION RULES Transition .times for changing configuration 1. F i r s t entry i s arbit r a r y , 50 msec was used. 2. Semi-vowel-vowel and fricative-vowel transitions require 50 msec. 3. Vowel-semi-vowel transitions require 100 msec. 4. Vowel-fricative transitions require 75 msec. Transition times for voicing amplitude 1. F i r s t entry i n l i s t i s not used. 2. F i n a l t r a n s i t i o n i n any word requires 100 msec. 3. For voiced f r i c a t i v e s 125 msec, i n i n i t i a l t r a n s i t i o n s , and 50 msec i n f i n a l transitions are used, unless the f r i c a t i v e i s the l a s t phone i n the word i n which case 100 msec i s used for f i n a l transitions. 4. I n i t i a l and f i n a l transitions of voiced stops are 6 msec. 5. For a l l situations not covered by Rules 2, 3, and 4,50 msec was used. 99 I I I Transition times for noise source 1. F i r s t entry not used. 2. I n i t i a l t r a n s i t i o n for f r i c a t i v e requires 125 msec. 3. F i n a l t r a n s i t i o n for f r i c a t i v e i s 40 msec unless the f r i c a t i v e i s l a s t phone i n the word, i n which case 100 msec was used. 4. I n i t i a l and f i n a l transitions for stop bursts require 6 msec. IV Transition times for aspiration source 1. F i r s t entry not used. 2. A l l other transitions for stops require 6 msec. V I n f l e c t i o n The voicing frequency increased from 100 - 140 Hz during a contiguous voiced portion of any word. In the above rules vowel durations were found by studying those 57 58 measured by House i n 1961 and Peterson and Lehiste i n 1960. Unvoiced f r i c a t i v e durations were the same as Rosen's. Voiced f r i c a t i v e s were assigned the same duration as the corresponding unvoiced f r i c a t i v e . Semi-vowel and stop durations were obtained i n i t i a l l y from spectrograms and l a t e r modified by t r i a l and error. Rule 6 of the duration rules was added to as s i s t i n preventing unvoiced f r i c a t i v e s being perceived as voiced f r i c a t i v e s i n fricative-vowel sequences. I n i t i a l l y , a l l t r a n s i t i o n times were 50 msec except for i n i t i a l transitions which were 125 msec i n unvoiced f r i c a t i v e s . Changes from these i n i t i a l choices were made on the basis of l i s t e n i n g tests. Excitation source amplitudes were obtained by examining the output of the synthesized speech with the speech amplitude meter described i n Chapter 3 and comparing the results with the results obtained from human speech. . New words could be made d i r e c t l y from these rules. Over 60% 100 of the words i n the l i s t e n i n g test described i n the next section were made d i r e c t l y from these rules. The rules should be regarded as preliminary since some of the synthesized words were not i d e n t i f i e d with high accuracy, and would undoubtedly be more i n t e l l i g i b l e i f modifications were made i n their timing signals. 6.3 THE ENGLISH WORD TESTS Subjective evaluation of MINERVA'S c a p a b i l i t i e s resulted from_ five l i s t e n i n g tests. Tape recorded synthesized words were presented to nine male graduate students and'one male faculty member, a l l of whose native language was English and whose ages ranged from 21 to 31 years. A l l l isteners reported no hearing abnormalities and f i v e were c l a s s i f i e d as naive i n that they had not heard synthesized speech before and the remaining -five were .classified as experienced i n that they had taken part i n the stop tests described i n Chapter 5. The experienced group had not heard the English words synthesized before. Subjects were tested i n groups of 5 except for test 4 i n which they were tested i n d i v i d u a l l y . They were tested i n a quiet room using a Scully 280 tape recorder and Sharpe HAlO earphones. A l l f i v e tests consisted of the same f i f t y synthesized words i n random order given 10 seconds apart. The order was dif f e r e n t i n each test. Test 1 was preceded by three practice words that were chosen at random from the test. The practice words enabled volume controls to be set. Test 1 was given i n the morning, test 2 that afternoon and test 3 the following morning. A week l a t e r test 4 was given i n the morning and test 5 the same afternoon. In test 4 the word was given, the subject wrote his response and then a card with the intended word written on i t was turned over by the tester. 101 The instructions for test 1 were as follows: You are about to hear simple English words synthesized by a computer. Write down the word you think has been synthesized and evaluate the naturalness with a number between .1 and 5. 5 would correspond to completely natural speech, 1 to completely machine-like, unnatural speech. The f i r s t three are practice and should be answered i n A, B, and C. Tests 2 and 3 were the same except the words "The f i r s t three are practice and should be answered i n A, B, and C." were omitted. For test 4 the instructions were: You are about to hear simple English words synthesized by a computer. Write down the word you think has been synthesized. After your response a card w i l l be turned over showing the intended word. The instructions for test 5 were the f i r s t two sentences of the instructions for test 4. 6.4 RESULTS OF THE ENGLISH WORD TESTS Table 6.1 shows the percentage scores for each subject on each test. The test numbers are across the top and the subject numbers are down the l e f t side. Subjects 1 to 5 were experienced while subjects 6 to 10 were naive. As can be seen from Table 6.1 the experienced subjects had higher average scores than the naive ones did, although the variance between subjects i s quite large. The results of the f i r s t four tests are without feedback to the subjects except that i n test 4 the answers were given after each response, possibly effecting responses to the words S u b j e c t No T e s t No 1 2 3 4 5 E X S U 1 66 74 76 78 88 P I B J 2 92 98 94 94 96. E R E C 3 80 74 74 80 86 E N T S 4 90 94 96 96 96 C • E 5 82 88 88 94 96 D Average 82 85 85 88 92 N S 6 90 94 94 96 98 A I U B 7 68 76 76 76 84 V E J E 8 78 86 86 - 90 92 C T 9 56 64 72 ' 72 70 S 10 70 76 68 76 74 Average 72 79 79 82 84 Table 6.1 Results of English word tests. Scores are percentage correct for each subject i n each test. following i n that test. The results i n Table 6.1 indicate that the feedback did effect the results of both tests 4 and 5. Test 3 showed a le v e l i n g off with the results close to those of Test 2 for 8 of the 10 subjects while test 4 shows improvement by most subjects r e l a t i v e to test 3. The results indicate that the subjects learned the synthesizer's accent rather.quickly. Table 6.2 shows the l i s t of English words given i n the test. Shown with each word i s the percentage score obtained i n each t e s t , the percentage score obtained over a l l f i v e tests and the average naturalness WORD TEST TEST TEST TEST TEST AVER NAT ORDER 1 2 3 4 5 AGE URAL NESS • IN TEST 4 FISH 100 100 100 100 100 100 3.2 8 LASH 100 100 100 100 100 100 3.2 49 RAW 100 100 100 100 100 100 3.9 15 REEL 100 100 100 100 100 100 3.6 20 SASH 100 100 100 100 100 100 3.5 33 SAUCE 100 100 100 100 100 100 4.0 7 SAW 100 100 100 100 100 100 4.1 39 SEE 100 100 100 100 100 100 3.5 18 SHE 100 100 100 100 100 100 3.9 6 SHELL 100 100 100 100 100 100 3.4 45 TEA 100 100 100 100 100 100 3.5 24 TOSS 100 100 100 100 100 100 3.5 40 LAV 100 90 100 100 100 98 3.6 44 ' PEEL 100 90 100 100 100 98 3.5 4 LEE 90 100 100 90 100 96 3.4 23 SEAL 90 100 90 100 100 96 3.2 27 EEL 100 90 90 100 90 94 3.0 31 FEEL 90 90 90 100 100 94 3.3 43 BOSS 90 90 80 100 - 100 92 3.1 32 SELL 90 80 100 90 90 90 3.6 36 'SOLE 80 100 90 90 90 90 3.0 16 ILL 80 90 90 90 90 88 2.6 41 SKI 70 90 90 90 90 86 3.0 34 ASH 90 90 60 90 80 84 3.1 14 LEASE 60 80 90 90 100 84 2.6 5 LEASH 70 90 80 90 90 84 2.8 29 STEAL 90 80 80 80 90 84 3.0 21 ASS 70 80 90 80 100 84 2.8 25 FEE 60 80 90 80 100 82 2.6 48 PEA 80 90 70 90 90 82 3.0 1 PEACE 70 100 70 90 80 82 3.4 22 SPILL 70 80 80 90 80 80 2.6 42 TEAL 90 60 90 90 70 80 2.3 12 IF 70 70 70 80 90 78 2.7 9 PASS 50 70 90 70 90 76 3.2 17 SPIEL 70 70 70 80 90 76 2.8 3 TEASE 60 70 70 90 90 72 2.7 10 EASE 80 70 60 60 80 70 2.8 47 GASH 50 70 70 80 80 70 3.1 35 OFF 70 60 50 80 90 70 2.8 30 PAUSE 70 70 70 70 70 70 2.9 19 GAS 60 70 70 70 70 68 2.9 28 SEIZE 70 70 60 60 70 66 2.9 26 IS 60 60 60 70 70 64 2.6 13 SAYS 60 70 60 70 60 64 3.2 2 CEASE 50 60 50 70 80 62 3.5 50 PEAS 40 60 70 60 70 60 2.4 11 REEF 40 50 70 50 80 58 2.6 37 ZEAL 50 50 60 40 60 52 2.5 46 BEE 40 50 60 40 70 52 2.1 38 Table 6.2 Percentage scores i n English word tests for each word, shown on right i s random order used i n test A. 104 rating from the f i r s t three tests. As can be seen from Table 6.2, twelve words had perfect recognition scores i n that they were recognized correctly by every subject i n each of the 5 tests. Twenty-one words had 100% i n test 5. Forty-eight of the f i f t y words had scores 70% or higher i n test 5. The average naturalness rating for each correct word i s shown. This naturalness rating averaged over the words for test 1 was 3.0 for test 2 was 3.1 and for test 3 was 3.1. The o v e r a l l naturalness rating for the correct words was 3.1. Table 6.3 shows some of the incorrect responses and the percentage of the time that each was made i n the tests. I f an incorrect response was made only once or i f no response at a l l was made i t i s included i n the column "OTHERS". I f the results of Table 6.3 are studied one sees that of the t o t a l number of errors made^ .18% of them resulted from mistaking a voiced sound when i t s unvoiced counterpart was intended. Another error that was very common was responding /l/ when Izl was intended, as i n " e e l " for "ease". The t h i r d common error was confusion i n the stops. Not including the mistake of a voiced stop for i t s un-voiced counterpart, the wrong stop gave 15% of the errors. The remaining errors showed no general trends. I t can be seen that most of the words have high scores by the time that they are heard the f i f t h time. However there are a few words whose scores are not much above 50% even i n test 5. Indications are that the rules that were used to generate the timing parameters are the cause of this problem since the target configurations are recognized i n other contexts. The exceptions to the above could be the voiced stops /b/ and /g/. The stop /b/ i s recognized with a high score i n "boss" but i t could be timing or i t could be the target configuration that gives i t a low STIMULI CORRECT ERRORS OTHERS BEE 52 PEA 20 EE 10 ME 4 14 ZEAL 52 LEAL 32 VEAL 14 2 REEF 58 REEVE 40 2 PEAS 60 PEEL 26 PEEVE 4 10 CEASE 62 SEES 38 SAYS 64 SELL 28 ZELL -4 4 IS 64 IL L 22 ELL 44 10 SEES 66 SEAL 18 ZEAL 8 CEASE 6 2 GAS 68 PASS 24 ASS ,4 4 PAUSE 70 PALL 10 PAARL 10 PALM 10 OFF 70 ARE 10 OTH 8 HOT 4 8 GASH 70 BASH 10 MASH 10 DASH 4 6 EASE 70 EEL 20 EELS 6 EVE 4 TEASE 72 TEAL 26 2 SPIEL 76 STEAL 8 SKEAL 8 SPILL 4 4 PASS 76 TASS 8 ASS 4 TASK 4 8 IF 78 ECH 6 ITH 6 EH 4 6 TEAL 80 PEAL 10 KEAL 4 FEEL 4 2 SPILL 82 SKILL 8 SCALE 6 SPELL 4 2 PEACE 82 PEAS 16 2 PEA 82 KEY •8 BEE 4 FEE 4 2 FEE 82 FEED 10 VEE 8 ASS 84 PASS 8 AS 4 4 STEAL 84 SKEEL 12 SKILL 4 LEASH 84 LEIGE 12 LEASE 4 LEASE 84 LEES 8 LEAVES 8 IL L 86 ELL 6 8 SKI 86 SPEE 12 . 2 ASH 84 HASH 12 4 SOLE 90 SOIL 10 SELL 90 SAIL 10 BOSS 92 8 FEEL 94 VEAL 4 2 EEL 94 6 SEAL 96 ZEAL 4 LEE 96 4 LAY 98 2 PEEL 98 2 Table 6.3 Incorrect responses for each word and percentage of time each was made during tests. OTHERS includes incorrect responses that were only made once and when no response was made. score i n "bee". The voiced stop /g/ appears i n two words and i t was mis-taken most of the time for some other stop. The rules used to generate these words are preliminary. Refinement i s needed to give a consistently high score for a l l words. The English word test described above indicates that MINERVA i s capable of synthesizing isol a t e d words, and that the timing rules for synthesis are not p r o h i b i t i v e l y complex. 7. CONCLUDING REMARKS 7.1 REALIZATION, COST AND CAPABILITY OF A COMPUTER CONTROLLED VOCAL  TRACT ANALOG Using the control scheme developed i n th i s thesis a f i n a l version of MINERVA would be controlled by 27, 18 b i t words and would require computer service for less than 1% of the speech production time. I f a l l configuration data were stored i n core memory, MINERVA could be used i n a scheme that stored the timing parameters on bulk storage. Use of context free phoneme target configurations would l i k e l y be satisfactory i n many applications provided each s i g n i f i c a n t allophone has i t s own target configuration. To store 100 diff e r e n t target config-urations would require approximately 12000 b i t s of core. Assuming that, on the average, each duration l i s t needs four 9 b i t words, each t r a n s i t i o n l i s t needs five 8 b i t words and each source amplitude l i s t needs f i v e 6 b i t words, 428N b i t s of timing information would be needed for an N word vocabulary. To store target configurations and timing information for 250 words would require 6,700, 18 b i t words. Inexpensive bulk storage i s available on magnetic tape. However access time to tape i s r e l a t i v e l y slow and increases with vocabulary s i z e . A l t e r n a t i v e l y , 4 0 0 0 words of 18 b i t core memory and a 250,000 word disk with a maximum access time of 32 msec could be used. I f the target configurations were stored i n the core memory and the timing data for any word was stored i n sequence on the disk, access to data and transfer of the information would never exceed 33 msec. Infor-mation for more than 1 0 , 0 0 0 words could be stored on the disk and be used to synthesize speech i n real-time. A much smaller disk or another 108 form of bulk storage would be adequate i n many applications. In synthesizing large vocabularies, the present synthesis strategy would not be limited by the required amount of storage, but by the amount of work to determine the l i s t s of timing parameters. A second method of generating timing control signals becomes more att r a c t i v e 21 22 23 59 as the vocabulary grows. Rabiner and his co-workers ' ' , Coker , 24 25 Mattingly and Tatham have been developing rules for generating timing signals from input phoneme strin g s . Much of their work could be applied to MINERVA. Such a synthesis by rule scheme would use more memory for storage of programs and more of the computers time than the stored l i s t scheme, but would require considerably less memory for the storage of data. Co-articulation i s the configuration of the vocal tract for one phone affecting the configuration for an adjacent phone. Quite often i n the production of speech, the vocal tract w i l l s t a r t to move toward a target configuration, but before i t arrives the mind w i l l anticipate the following phone and s t a r t the a-rticulators moving toward the config-urations for the following phone, so that the vocal tract never reaches some of the target configurations i t started towards. In synthesizing speech, co - a r t i c u l a t i o n must be incorporated i n the synthesis scheme. MINERVA i s p a r t i c u l a r l y w e l l suited to handle this problem. In MINERVA^ when the configuration i s moving toward a p a r t i c u l a r target configuration, i t i s easy to load a new target configuration into the memories and the vocal tract w i l l s t a r t immediately moving toward the new configuration. In-the present synthesis scheme when moving from one target Configuration to the next target configuration, a l l sections s t a r t moving at the same time and arrive at th e i r position for the new configuration 109 at the same time. They travel at different rates depending on how far 17 they have to move. Hecker i n 1961 indicated that such a scheme was a serious l i m i t a t i o n and that i t would be better i f each section could be controlled i n d i v i d u a l l y . The hardware i n MINERVA i s capable of such control, although control program changes x^ould be. required to effect control for each i n d i v i d u a l section. Use of this l a t t e r scheme would require the storage of "timing l i s t s " for each section with a resulting increase i n storage cost. This fine control was not considered necessary at the present stage of development. 7.2 FURTHER WORK Further work on speech synthesis by vocal tract analog should include a study of the nasals and glides, and synthesis of stop consonants i n word-final positions. Some work might be done on improvement of the target configurations and timing parameters for phonemes considered i n the present study. When such work i s completed a vocal tract analog synthesizer controlled by a computer with a disk would be capable of producing a large.vocabulary of isolated words as described i n the previous section. Further work after the above i s the finding of rules to generate timing parameters. Much work remains to be done i n this f i e l d and a vocal tract analog w i l l be of great use i n evaluating rules. The f i n a l goal of a speech synthesis scheme i s that the computer and synthesizer would be capable of synthesizing speech from only a l i s t of phonemes as input data. 110 7.3 SUMMARY This thesis has described the design, construction and evaluation of an analog of the human vocal t r a c t , as wel l as i t s use i n a study of synthesis of vowels, l i q u i d s , f r i c a t i v e s and stops. Glides and dipthongs were synthesized but not evaluated subjectively. Nasals were not consi-17 dered i n our study, however nasals have been synthesized by Hecker i n some contexts on a analog s i m i l a r i n concept to ours. A study was made of various timing parameters and target configurations for stop-vowel and stop-fricative-vowel sequences. F i f t y English words were synthesized to demonstrate i n t e l l i g i b i l i t y , structure and complexity of synthesis rules, and to demonstrate that the synthesizer's accent can be learned quickly. A control scheme has been developed which results i n computational times and storage requirements which are substantially less than those required by other synthesis strategies. APPENDIX PHONETIC TRANSCRIPTION SYSTEM Phonetic Key Phonetic Key Symbol Word Symbol Word I feet b bad I f i t d dive e l e t g .give a bat p £pt A but t .toy a f a r k cat o law X? book z zero u boot 3 v i s i o n if bir-d V very pain 5 that o go h hat f Lat • j y_ou 9 thing J" shed w we I , late s sat r rate 112 BIBLIOGRAPHY Note JASA = Journal of the Acoustical Society of America. 1. Kratzenstein, L. G., "Sur l a Raissance de l a Formation des Voyelles", J. Physics 21, pp~ 358-380, 1782. 2. Von-Kempelen, W., "Le Mechanisme de l a Parole, S u i v i de l a Description d'une Machine Parlante", Vienna J.V. Degen, 1791. 3. Mattingly, I. G., "Synthesis by Rule as a Tool for Phonological Research", Haskins Laboratory Report SR-3; New York C i t y , pp.. 61-76, 1965.. 4. Holmes, J. N., Mattingly, I. G., Shearme, J. N., "Speech Synthesis by Rule", Language and Speech, v o l . 3, pp. 127-143, 1964. 5. Flanagan, J. L., Coker, C. H., Rabiner, L. R., Schafer, R. W., Umeda. N., "Synthetic Voices for Computers", I.E.E.E. Spectrum, vol.7 No. 10, pp. 22-45; 19 71. 6. Flanagan, J. L., "Speech Analysis Synthesis and Perception", Academic Press, New York, 1965. 7. Lawrence, W., "The Synthesis of Speech from Signals Which Have a Low • Information Rate", Communication Theory ed by W. Jackson, pp. 460-471, Butterworths; London. 8. Chistovich, L. ,..Fant, G. , and De Serpu-Leitao, A., "Mimicking and Perception of Synthetic Vowels", Stockholm Speech Transmission Lab., QPR-3, 1966. 9. Fant C. G. M., Martony, J., Rengman, V. and Risberg, A. "OVE II Synthesis Strategy", Proceedings of the Speech Communications , Seminar, FS. Speech Transmission Laboratory, Royal I n s t i t u t e of Technology, Stockholm 1962. 10. Tomlinson, R.S., "SPASS - An Improved Terminal Analog Speech Synthe-s i z e r " , MIT QPR No 80, pp. 148-205, 1966. 11. Glace, D. A. " P a r a l l e l Resonance Synthesizer for Speech Research", JASA, v o l . 44, p. 391, 1968. 12. L i l j e n c r a n t s , J. C. W. A., "The OVE I I I Speech Synthesizer", I.E.E.E. Trans. AU-16. pp. 137-140, 1968. 13. Epstein, R., "A Transistorized Formant Type Synthesizer", Haskins Laboratory Report SR-1, New York City. 14. Dunn,H.K.,"The Calculation of Vowel Resonances and an E l e c t r i c a l Vocal Tract", JASA, v o l . 22, pp. 740-793, 1950. 15. Stevens, K. N., Kasowski, S. and Fant, C. G. M., "An E l e c t r i c a l Analog of the Vocal Tract", JASA, v o l . 25, pp. 734-742, 1953. 113 16. Rosen, G., "Dynamic Analog Speech Synthesizer", Technical Report 353, MIT Research Laboratory of Electronics, Cambridge Mass., 1960. 17. Hecker, M. H., "Studies of Nasal Consonants with an Art i c u l a t o r y Speech Synthesizer", JASA, v o l . 34, pp. 179-188, 1962. 18. K e l l y , J. L. , and Lochbaum, C. C , "Speech Synthesis", 4th Inter-national Congress on Acoustics, G42, Copenhagen^1962. 19. Matsui, E., Suzuki, T., Umeda, N., Omura, H., "Synthesis of Fairy Tales Using an Analog Vocal Tract", 6th International Congress  on Acoustics, pp. B159-B162, Tokyo, 1968. 20. Rice, L., "A New Line Analog Speech Synthesizer for the PDP-12", Working Papers i n Phonetics, No 17, UCLA, pp. 58-75, 1971. 21. Rabiner, L-R.> L e v i t t , H., and Rosenberg, A.E., "Investigation of Stress Patterns for Speech Synthesis by Rule", JASA, Vol. 45, pp. 92-101, 1969. 22. Rabiner, L.R., and L e v i t t , H., "New Results i n Speech Synthesis by Rule", 6th International Congress on Acoustics pp. B203-B206, Tokoyo, 1969. 23. Rabiner, L.R., "Speech Synthesis by Rule: An Acoustic Domain Approach", B e l l System Technical Journal, pp. 17-37, January 1968. 24. Mattingly, I.G., "Synthesis by Rule of General American English", Haskins Laboratory Report SR-7/8, July-Dec, 1966. 25. Tatham, M.A.A., "A Speech Production Model for Synthesis-by-rule", Working Papers i n L i n g u i s t i c s No 6, The Ohio State University, pp. 67-87, 1970. 26. Fant, C.G.M., "Acoustic Theory of Speech Production", Mouton and Co., The Hague, the Netherlands. I960. . 27. Donaldson, R.W., Wickwire, K.F., "A Method for Realizing Time-Varying C i r c u i t Elements", Proceedings of the I.E.E.E., v o l . 57, pp. 103-104, 1969. 28. Riordan, R.H.S., "Simulated Inductors Using D i f f e r e n t i a l Amplifiers", Electronics Letters, v o l . 3, pp. 50-51, 1967. 29. Wickwire, K.F., "The Design and Testing of Time-Varying Inductors and Capacitors for an E l e c t r i c a l Speech Synthesizer", M.A.Sc. Thesis, University of B.C., Vancouver,, B .C. , 1967. 30. Antoniou, A., " S t a b i l i t y Properties of Some Gyrator C i r c u i t s " , Electronics Letters, v o l . 4, pp. 510-512, 1968. 31. Davies A.C., "The Design of Feedback Shift-Registers and Other Synchronous Counters", The Radio and Electronic Engineer, pp. 1-11, 1969. 114 32. Davies, A.C., Personal Communication. 33. Hughes, G.W., and Hemdal, J.F., "Speech Analy s i s " , Purdue Research Foundation, Lafayette Ind. Rept TR-EE65-9, July 1965. 34. I t o , M.R., "Investigation.of Time Domain Measurements f o r Analysis and Machine Recognition of Speech", PhD. Thesis, U n i v e r s i t y of B.C., Vancouver, B.C.,' 1971. 3 5 . Ito, M.R. and Donaldson, R.W., "Zero Crossing Measurements for Analysis and Recognition of Speech Sounds", I.E.E.E., AU-19, pp. 235-242, 1971 36. Peterson, G.E., and McKinney, N.P. , "The Measurement of Speech Power", Phonetica, v o l . 7, pp. 65-84, 1961. 3 7 . Chiba, T., and Kajiyama, M., "The Vowel-Its Nature and Structure", Phonetic Society of Japan, Tokyo, 1958. 38. P e r k e l l , J.S., "Physiology of Speech Production. Results and Implications of a Quantative Cineradiographic Study", MIT Press, Cambridge Mass., 1969. 39. Ohman, S.E.G., "Numerical Model of C o a r t i c u l a t i o n " , JASA, v o l . , 41, pp. 310-320, 1967. 40. Peterson, G.E. and Barney, H.L., "Control Methods Used i n a Study of the Vowels", JASA, v o l . 24, pp. 175-184, 1952. 41. Fujimura, 0., and Li n d q v i s t , J . , "Sweep Tone Measurements of Vocal Tract C h a r a c t e r i s t i c s " , JASA, v o l . 49, pp. 541-558, 1971. 42. Dunn, H.K., "Methods of Measuring Vowel Formant Bandwidths", JASA, v o l . 33, pp. 1737-1746, 1961. 43. House, A.S., "Formant Bandwidths and Vowel Preference", J . Speech arid Hearing Research, v o l . 3, pp. 3-8, 1960. 44. Hughes, G.W., and Halle, J . , "Spectral Properties of F r i c a t i v e Consonants", JASA, v o l . 28, pp. 303-310, 1956. 4 5 . L i s k e r , L. and Abramson, A., "Voice Onset Time i n the Production and Perception of Engli s h Stops", Haskin Reports SR-1, pp. 3.1-3:'15, New York, 1965. 4 6 . L i s k e r , L. and Abramson, A., "A Cross-Language Study of Voicing i n I n i t i a l Stops:Acoustical Measurements", Word, v o l . 20, pp. 384-422, 1964. 4 7 . L i s k e r , L. and Abramson, A. "Some E f f e c t s of Context on Voice Onset Time i n English Stops", Language and Speech, v o l . 10, pp. 1-28, 1 9 6 9 . 115 48. Lisker, L., "Closure Duration and the Intervocalic Voiced-Voiceless D i s t i n c t i o n i n English", Language, v o l . 33, 1957. 49. Lorge, B., "A Study of the Relationship Between Production and Perception of I n i t i a l and Intervocalic It/ and /d/ i n Individual English Speaking Adults", Haskins Reports SR-9, pp. 3.1-3.5, 1967. 50. Stevens, K.N # > and House, A.S., "Studies of Formant Transitions Using a Vocal Tract Analog", JASA, v o l . 28, pp. 578-585, , 1956. 51. Harri s , K.S., Hoffman, H.S., Liberman, A.M. , Delattre, P.C. and Cooper, F.S., "Effect of Third Formant Transitions on the Perception of the Voiced Stop Consonants", JASA, v o l . 30, pp. 122-126r1958. 52. Wang, W.S.Y., "Transition and Release as Perceptual Cues for Fi n a l Plosives", J . Speech and Hearing Research, v o l . 2, pp. 66-73, 1959. 53. Delattre, P.C, Liberman, A.M., Cooper, F.S., "Acoustic Loci and Transitional Cues for Consonants", JASA, v o l . 27, pp. 769-773, 1955. 54. Fujimura, 0., " B i l a b i a l Stop and Nasal Consonants: a Motion Picture Study and i t s Acous ti'cal Imp l i cations ", J. Speech  .and Hearing Research, v o l . 4, pp. 233-247, 1967. 55. Houde, R., "A Study of Tongue Body Motion During Selected Sounds", SCRL, Monograph No 2, Aug. 1968. 56. Potter, R.K., Kopp G.A., and Kopp, H.G., " V i s i b l e Speech", Dover Publications, New York, 1966. 57. House, A.S-, "On Vowel Duration i n English", JASA, v o l . 33, pp. 1174-1178, 1961. 58. Peterson, G.E., and Lehiste, I . , "Duration of Syllable Nuclei i n English", JASA, v o l . 32, pp. 693-703, 1960. 59. Coker, C.H., "Synthesis by Rule from Art i c u l a t o r y Parameters", 1967 Conference on Speech Communication and Processing, Cambridge, Mass. pp. 52-53, 1967. 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0101407/manifest

Comment

Related Items