Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Speech amplitude and zero crossing for automated identification of human speakers Wasson, Douglas Arnold 1974

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-UBC_1974_A7 W38.pdf [ 2.15MB ]
Metadata
JSON: 831-1.0065539.json
JSON-LD: 831-1.0065539-ld.json
RDF/XML (Pretty): 831-1.0065539-rdf.xml
RDF/JSON: 831-1.0065539-rdf.json
Turtle: 831-1.0065539-turtle.txt
N-Triples: 831-1.0065539-rdf-ntriples.txt
Original Record: 831-1.0065539-source.json
Full Text
831-1.0065539-fulltext.txt
Citation
831-1.0065539.ris

Full Text

SPEECH AMPLITUDE AND ZERO CROSSING FOR AUTOMATED IDENTIFICATION OF HUMAN SPEAKERS by Douglas Arnold Wasson B . S c . E . , U n i v e r s i t y of New Brunswick, 1972 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF APPLIED SCIENCE i n the Department of E l e c t r i c a l Engineering We accept th is thes i s as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA May 1974 In presenting t h i s t h e s i s in p a r t i a l f u l f i l m e n t of the requirements f o r an advanced degree at the U n i v e r s i t y of B r i t i s h Columbia, I agree that the L i b r a r y s h a l l make i t f r e e l y a v a i l a b l e f o r reference and study. I f u r t h e r agree t h a t permission for extensive copying of t h i s t h e s i s f o r s c h o l a r l y purposes may be granted by the Head of my Department or by h i s r e p r e s e n t a t i v e s . It i s understood that copying or p u b l i c a t i o n of t h i s t h e s i s f o r f i n a n c i a l gain s h a l l not be allowed without my w r i t t e n permission. Depa rtment The U n i v e r s i t y of B r i t i s h Columbia Vancouver 8, Canada ABSTRACT This thes i s involves an i n v e s t i g a t i o n of the usefulness of speech amplitude, low pass zero cross ing rate (ZCR) and h igh pass zero cross ing rate for speaker r e c o g n i t i o n . Speech samples were recorded from ten speakers and prepro -cessed to y i e l d time quantized preprocessed waveforms of SPEECH AMPLI-TUDE, LOW PASS ZCR and HIGH PASS ZCR. These preprocessed waveforms were then averaged for each speaker to form three sets o f averaged waveforms. These waveforms were then expanded i n an n-dimensional feature space, where n i s the number of speakers used i n t e s t i n g the system. The orthogonal functions descr ib ing the feature space were derived from the average preprocessed waveforms by us ing the Gram Schmidt o r t h o g o n a l i z a t i o n technique. The average preprocessed wave-forms were expanded i n the feature space to provide a reference feature vec tor f o r each speaker. The recogn i t ion procedure was based on measuring the Euc l idean distance between the feature vec tor der ived from a sample preprocessed waveform for the unknown speaker and the reference feature v e c t o r s . The speaker whose reference vec tor was c loses t to the tes t vec tor was chosen as the speaker of that ut terance . A v a r i e t y of combinations o f the preprocessed waveforms were tes ted . The best r e s u l t s showed an average percentage o f i n c o r r e c t dec is ions of 3.4% when one of ten speakers was i d e n t i f i e d on the bas i s of a s i n g l e spoken sentence. The r e s u l t s i n d i c a t e that speech amplitude and zero cross ing ra te are use-f u l for speaker i d e n t i f i c a t i o n . i i TABLE OF CONTENTS Page Ab s t r a c t i i Table of Contents i i i L i s t of Tables i v L i s t of Figures v Acknowledgement "vi I INTRODUCTION 1 II. SPEAKER RECOGNITION SYSTEM 3 2.1 B a s i c System 3" 2.2 Preprocessing 3 2.3 Feature E x t r a c t i o n 3 2.4 D e c i s i o n Making 10 I I I DATA ACQUISITION AND INITIAL PROCESSING 11 3.1 Speech Recording 11 3.2 D i g i t i z i n g of the Speech Samples 11 3.3 L o c a t i n g the Speech Boundaries 12 3.4 Preprocessing Algorithm 13 IV SYSTEM TESTS AND RESULTS 16 4.1 Test Procedure 16 4.2 Test Results 17 4.2.1 I n d i v i d u a l Preprocessed Waveforms 17 4.2.2 Concatenated Preprocessed Waveforms 21 4.2.3 Concatenated Feature Vectors 26 4.3 Measurement of the Variances of the Preprocessed Waveforms 31 4.4 D i s c u s s i o n of Results 31 V CONCLUSION 36 REFERENCES 38 i i i LIST OF TABLES Table Page 4.1 Measurements F , G, and F / G made on the feature space when i n d i v i d u a l preprocessed waveforms were used 22 4.2 Results of a l l poss ib le combinations of three preprocessed waveforms 23 4.3 Measurements F , G and F / G made on the feature space when concatenated preprocessed waveforms were used 27 4.4 Measurements F , G, and F / G made on the feature space when concatenated feature vectors were used 30 4.5 Normalized variances for a l l three preprocessed waveforms 32 i v LIST OF FIGURES Block Diagram of Speaker Recognit ion System T y p i c a l Preprocessor Output Waveforms Speech Boundary Determination T y p i c a l preprocessed waveforms for a speech sample of candidate CS Two t y p i c a l 2-dimensional feature vec tor p lo t s for 2 candidates Performance r e s u l t s for the i n d i v i d u a l preprocessed waveforms Performance r e s u l t s for the concatenated preprocessed waveforms tested Performance r e s u l t s for the concatenated feature vectors tested v ACKNOWLEDGEMENT I wish to thank my s u p e r v i s o r , D r . R. W. Donaldson, for the va luable suggestions and constant encouragement which I rece ived during the course of my p r o j e c t . I would also l i k e to thank Mr. M. Koombes for the t e c h n i c a l ass is tance provided i n the d i g i t i z a t i o n of the data . The f i n a n c i a l support rece ived from the Nat iona l Research Counci l i n the form of a 1967 Science Scholarship and under Grant 67-3308 i s g r a t e f u l l y acknowledged. v i 1 CHAPTER 1 INTRODUCTION The a b i l i t y of human l i s t e n e r s to i d e n t i f y speakers from t h e i r voices has long been known. V i r t u a l l y everyone has recognized fr iends on the telephone s o l e l y on the bas i s of the word "Hel lo" . Studies to obta in estimates on the r e l i a b i l i t y with which human l i s t e n e r s can recognize human speakers [1,2] show that l i s t e n e r s can recognize speakers with accuracies of 94-98%. The human l i s t e n e r s ' speaker r e c o g n i t i o n c a p a b i l i t i e s motivates attempts to design auto-matic (computer based) systems for speaker r e c o g n i t i o n . Many d i f f e r e n t uses e x i s t for a speaker r e c o g n i t i o n system. I t could provide c o n t r o l l e d access to a f a c i l i t y or informat ion to se lected i n d i v i d u a l s . A r e l i a b l e speaker i d e n t i f i c a t i o n system could supply clues to a speaker's i d e n t i t y when other clues are e i t h e r mis -s ing or h i g h l y ambiguous. I t has shown that automatic speaker r e c o g n i t i o n on popula-t ions of 10-20 persons i s f e a s i b l e [3-10]. In most previous s tud ie s , the speech s i g n a l was transformed i n t o s p e c t r a l form and the r e s u l t i n g t ime-frequency-energy (spectrographic) patterns were used to i d e n t i f y or v e r i f y the speaker. S p e c t r a l transformations y i e l d large amounts of data which has to be ca l cu la ted and analyzed using computational ly expensive techniques. B . S . A t a l developed a system [9] i n which he used p i t c h con-tours to recognize speakers. P i t c h contours were used to avoid the dependency of s p e c t r a l data on v a r i a b l e transmiss ion c h a r a c t e r i s t i c s . The purpose of t h i s study was to determine the usefulness of speech amplitude, low pass zero -cros s ing ra te and high pass z e r o - c r o s -s ing ra te i n speaker r e c o g n i t i o n . These features have been used by others [11] for recogn i t ion of speech sounds. I t has been noted [11] that zero -cros s ing ra te measurements are w e l l su i t ed to d i g i t a l p r o -cess ing , v i r t u a l l y independent of speaker volume? and a lso r e s u l t i n much l e s s data per speech sample than when spectra are used. Unfor tun-a t e l y , these features may be less speaker-dependent than i s s p e c t r a l data [113. 2 The task of speaker r e c o g n i t i o n could invo lve e i t h e r i d e n t -i f i c a t i o n of a person from a populat ion of severa l known speakers or v e r i f i c a t i o n of a person's claimed i d e n t i t y . Since t h i s study was conducted to see i f the features chosen were s u f f i c i e n t l y speaker dependent to be u s e f u l , the speaker's i d e n t i f i c a t i o n task was se lec ted for conducting the speaker r e c o g n i t i o n t e s t . The study cons i s t s gener-a l l y of two p a r t s . One was the c o l l e c t i o n of da ta , the a n a l o g - t o - d i g -i t a l conversion and the e x t r a c t i o n of the preprocessed waveforms. The other par t invo lved the eva luat ion of the features for speaker r e c o g n i t i o n . In Chapter 2 the b a s i c system for speaker r e c o g n i t i o n i s o u t l i n e d . A d e s c r i p t i o n of how the preprocessed waveforms are com-bined and expanded i n feature space i s g iven. The d e c i s i o n making process i s o u t l i n e d . The recording of speech samples, the a n a l o g y t o - d i g i t a l con-v e r s i o n and preprocess ing i s described i n Chapter 3. The performance tes t s of the speaker r e c o g n i t i o n system and r e s u l t s appear i n Chapter 4. Conclusions and ideas for fur ther research are out l ined i n Chapter 5. 3 CHAPTER 2 SPEAKER RECOGNITION SYSTEM 2.1 Bas i c System The speaker recogn i t ion system can be represented as shown i n Figure 2 .1 . F i r s t the analog speech sample i s sampled and d i g i t i z e d . The d i g i t i z e d speech sample i s then preprocessed to y i e l d three wave-forms, which are then operated on by the feature ex trac tor to y i e l d a feature v e c t o r . This feature vec tor i s then handed on to the dec i s i on making algori thm which produces a dec i s i on as to the speaker's i d e n t i t y . 2.2 Preprocess ing The preprocessor operates on the d i g i t i z e d speech sample to produce t ime-quantized waveforms of speech amplitude, low pass zero-cross ing and high pass zero -cross ing r a t e . Time quant i za t ion r e s u l t s from averaging each preprocessed waveform over adjacent 10 msec time windows. Figure 2.2 shows the preprocessor output waveforms which are time normalized to T=1600 msec. 2.3 Feature E x t r a c t i o n Feature e x t r a c t i o n r e s u l t s from the expansion of one or more of the preprocessed output waveforms i n n dimensional space where n i s the number of speakers i n the experiment. I f a l l three preprocessed waveforms i n Figure 2.2 are used i n the recogn i t ion system there are s evera l ways i n which the orthogonal expansion can be done: One approach i s to concatenate two or three preprocessed wave-forms to form a composite waveform of length 2T or 3T which i s then ex-panded i n the n dimensional orthogonal space to y i e l d a feature vec tor of d imens ional i ty of n . Another approach i s to expand each preprocessed waveform i n i t s own n dimensional orthogonal space; the vectors of co-e f f i c i e n t s for each of the preprocessed waveforms can then be concatenated to form an o v e r a l l feature vec tor . The d imens ional i ty of the feature vec tor i s 2n or 3n depending whether 2 or 3 feature waveforms were used. Next i s a d e s c r i p t i o n of how to der ive the functions which Incoming speech sample. Feature Extractor Decis ion (speaker i s candidate i ) Preprocessed Waveforms Feature Vectors Figure 2.1 Block Diagram of Speaker Recognit ion System SPEECH RMPUTUDE c c g -D-l 0.0 -1— 0.25 0.75 t.O TIME (SEC) LOW PASS ZCR I — 1.5 2.0 -1 0.5 —1 1 0.75 1.0 TIME (SEC) I — ' 1.25 HIGH PASS ZCR —1 1— 0.75 \.0 TIME (SEC) l'.S — I — 1.7S —1 2.0 O.S 0.2S Figure 2.2 T y p i c a l Preprocessor Output Wavefprms 6 describe the n dimensional orthogonal space. F i r s t , s evera l samples of the sentence to be used i n the speaker recogn i t ion system are r e -corded from each speaker. Then the preprocessed waveforms for each one of these samples are der ived . Let represent the speech amplitude waveform of the j ^ utterance of the i*"*1 speaker, represent the low-pass zero -cross ing rate waveform for the same ut terance , and z^_. represent the h igh pass zero -cross ing rate waveform for the same ut terance; i can vary from 1 to n where n i s the t o t a l number of speakers, and j can vary from 1 to m, where m i s the t o t a l number of speech samples recorded per speaker. The average preprocessed waveforms for each person are c a l c u -l a t e d as fo l lows: 1 m x. = - Z x . . 2.1 l m j = 1 ±2 i m y. = — E y . . 2 . 2 ' i m j = 1 ' i j 1 m z. = - E z. . 2 . 3 l m j = 1 I J Next the preprocessed waveform s_^ , i = l , n i s formed, where s^ i s the average of the preprocessed waveform or concatenated waveforms to be transformed, i . e . s. = x. or y . or z . 2 . 4 or s ± = ( x ± , y i ) or ( x ^ z ^ or (Yj^yZ^ 2 . 5 or 1^ = ($±,y±,z±) 2 . 6 The n preprocessed waveforms s^ are now transformed into an independent orthogonal set of functions <f>^, i = l , n us ing the Gram Schmidt technique 112J given by the equations below • l = S l 2 . 7 7 J " 1 *j = s - Z Ck j<).k j = 2 , 3 , . . . , n 2.8 where C, ,•> S,1 d t 2.9 k J ;<{.2 dt " k In using the Gram Schmidt technique the assumption i s made that the preprocessed waveforms s^ are independent. I f the preprocessed waveforms s^ are not independent then one of the func t ions , cb ,^ w i l l be zero. Now that the orthogonal functions have been d e r i v e d , the p r e -processed waveforms s . . where s. . = x . . or y . or z . . 2.10 i j i j i j i j or s . . = ( x . . , y . . ) or ( x . . , z . .) or ( y . . , z . . ) 2.11 ±2 i ] i j i j i j i j i j or s . . = (x. . , y . . , z . . ) 2.12 can be expanded i n the orthogonal space described by the funct ions <JK . The preprocessed waveforms s ^ used correspond to the average prepro -cessed waveforms s^ used i n generating the orthogonal se t . The equa-t ions descr ib ing the expansion i n the feature space are n s. . = E u. . (k) cf> 2.13 X J k=l i J k where u. . (k) = s. . h d t / A 2.14 2.2 - o o i j k and A = J"*"" qb,2 (t) dt 2.15 - o o ^ u^^ (k) i s the coordinate value of the preprocessed waveform s along the axis described by the funct ion ( j ^ . t i l t i l The feature vec tor for the j utterance of the i speaker i s xjt.. = {u. . Cl) »• • • ,VL. . (k) , . . . , u . . Cn) } which i s j u s t the coordinates of l j i j I j i j S _ J , J expanded i n the n dimensional orthogonal space. When the prepro-cessed waveform s. i s ..expanded i n the h dimensional space i t re su l t s i n the feature u = { i J L ( i ) » • • . » u ^ ( k ) , . . . ( n ) } . The feature represents the average l o c a t i o n of candidate i i n the n dimensional space. In the second method of us ing two or three preprocessed wave-forms, each preprocessed waveform i s expanded i n i t s own n dimensional space. This method requires two or three sets of the orthogonal func-tions <JK to be generated depending upon the preprocessed waveforms used. In th i s case, for each set of <j>.'s, s. = x. or s. = y . o r s . = z. de-' x ' x x x ' x x x pending upon which waveform i s being considered. As before the prepro -cessed waveforms s . . , i n th i s case s . . = x. . o r s . . = y . . o r s . . = z . . , i j i j i j i J i j i j i j are expanded i n t h e i r corresponding n dimensional space. This i s done for each preprocessed waveform used. This r e su l t s i n the corresponding feature vectors u . ? or u . ^ or u.z.. The feature vector u . x represents i j 13 • i j i j the c o e f f i c i e n t s o f the speech amplitude preprocessed waveform x . . ex--*• y ^ panded i n i t s own n-dimensional space, s i m i l a r l y for u . . and low pass zero -cross ing rate and u^* and high pass zero-cross ing r a t e . The sets of preprocessed waveforms s. can also be expanded i n t h e i r corresponding 1 ± ± ± n dimensional space which re su l t s i n the feature vectors u. , u . ^ , u . z . Using th i s method the new composite feature vector i s u . . where u . . cart be u . . = ( u . ^ , u . y ) = { u . ? ( l ) , . . . , u . X ( n ) , u . ? ( l ) , . . . , u . ? ( n ) } 2.16 xj x j ' ±2 i j 13 i j i j or u . . = ( u . . , u . . ) 2.17 13 13 u. . = ( u . ? , ^ ) 2.18 or u . . = ( u . X , u . ^ , u . ? ) 2.19 xj x j ' x j ' xj ->-There i s also the average composite feature vector u^ Which equals the corresponding combination of u*,u^ and u? as u\_. has i n (2 .16-2.19) . For the operat ion of the system the orthogonal functions descr ib -i n g the n dimensional space(s) have to be s t o r e d . Also the features u\ ( i= l ,n) , the average l o c a t i o n of the candidate i i n the n dimensional space, has to be r e t a i n e d . 9 The Gram Schmidt or thogonal !zat ion technique has the advantage i n that i f more speakers are added to the system thus r e q u i r i n g more orthogonal functions to be generated, only the a d d i t i o n a l orthogonal functions have to be c a l c u l a t e d , with the e x i s t i n g functions remaining u n a l t e r e d . The new orthogonal functions are generated us ing the new independent preprocessed waveforms and a l l the previous orthogonal func-t ions (see eqn. ( 2 . 8 ) ) . Another advantage i s that the average feature vectors have to be ca l cu la ted for the new speakers on ly . The average feature vectors for the e a r l i e r speakers are e a s i l y a l t e r e d by adding zeroes to the end of the feature vector to give i t the correc t dimension-a l i t y . The above statement follows from the expansion technique. Equa-t i o n 2.8 can be w r i t t e n as s i = *i + 1 = 1 ° k A - + c 2 . * 2 + : : .+ c . ^ . * . . , + 2.20 Comparing (2.20) with (2 .13) i t can be seen that u± = ( 1 , 0 ,0 , 0) U2 = ( c 1 2 ' 1 , 0 ' ' 0 ) u i = ( c 1 1 , c 2 1 , . . . . , c i _ l j l , l , 0 , . . , 0 ) .> u = (c. , c 0 , c . ,1) n In 2n' n - l , n ' Each vector has n coordinates . The vector u^, 1 £ i < n , has non-zero values for the coordinates between 1 and i - 1 , the coordinate i has the value 1, and the coordinates between i+1 and n are zero . 2.4 Dec i s ion Making In system operat ion an utterance i s recorded from a speaker. A set of preprocessed waveforms are chosen and the utterance processed r e s u l t i n g i n a feature vec tor u . Once the feature u i s generated the minimum distance between t h i s feature and the average feature vector of each speaker i s c a l c u l a t e d using cL^  = | | u - u"k| | k= l ,n 2.21, The distance d^'s are then sorted to locate the smal lest one. The system then decides that the utterance came from the k'* 1 speaker for which d^ i s the smal les t . A correct dec i s ion i s made i f the utterance d i d a c t u a l l y come from the k t n speaker chosen by the system. CHAPTER 3 DATA ACQUISITION AND INITIAL PROCESSING 3.1 Speech Recording The f i r s t step i n the p r o j e c t was to c o l l e c t data for the t e s t i n g of the speaker recogn i t ion system. The speech samples were recorded i n a quiet room us ing a high q u a l i t y audio recording system c o n s i s t i n g of a AKG D200-E microphone and a S c u l l y Model 280 tape recorder . A quiet room was used for recording to s imulate a favourable atmosphere i n which a r e a l system would operate . There were ten speakers i n a l l , seven male and three female. Each speaker read a set of f i ve sentences w r i t t e n on a c a r d . Only the t h i r d sentence of the f i v e , "We were away a year ago", was used i n the speaker recogn i t ion experiment. This sentence i s an e n t i r e l y vo iced sentence whose durat ion v a r i e d from 1.1 to 2.1 seconds. The other sentences were read i n order to avoid any s p e c i a l emphasis on the key sentence. The speakers were not given any i n s t r u c t i o n s concerning the manner i n which they should read the sentences. They were t o l d the recordings would be used i n future speaker recogn i t ion t e s t s . The recording sessions were he ld once i n the morning and once i n the afternoon on f ive d i f f e r e n t days. A t o t a l o f 100 d i f f e r e n t samples of the set of f i ve sentences were recorded, 10 samples per speaker. 3.2 D i g i t i z i n g of the Speech Samples The analog speech s i g n a l was p r e f i l t e r e d , d i g i t i z e d and s tored on d i g i t a l magnetic tape. The p r e f i l t e r s used inc luded low pass f i l t e r s of 8- and 1-khz bandwidth and a band-pass f i l t e r having a 1-8 khz band-pass . The analog speech s i g n a l was sampled at 16 khz and quantized u n i -formly to 10 b i t s . This process was accomplished using a Data General Super Nova computer. The low-pass and band-pass f i l t e r s were Khronkite #3342 R f i l t e r s have a 48 db/octave cutof f r a t e . The remainder o f "'the process ing of the d i g i t i z e d speech samples was done on an IBM Model 360/67 duplex computer. •• • For the convenience, the 8-khz low pass f i l t e r e d speech w i l l be r e f e r r e d to as normal speech, the 1 khz low pass speech as low pass speech and the 1-8 khz band pass f i l t e r e d speech as high pass speech. 12 3.3 L o c a t i n g the Speech Boundaries I n i t i a l l y the speech samples were d i g i t i z e d and s tored on magnetic tape. Before the preprocessed waveforms could be e x t r a c t e d from the speech sample the beginning and ending of the speech sample had to be l o c a t e d . The method used was one i n which a measure on the speech c a l c u l a t e d ; when t h i s measure exceeded a thresho ld the beginning of the speech sample was s a i d to be found, and when the measure l a t e r dropped below a threshold the ending o f the speech sample was s a i d to be l o c a t e d . The measure used was the energy passed through a time window. The energy was simply the sum of the squares of the speech samples seen through the window, which was moved along the speech samples w i t h the energy being c a l c u l a t e d at each move. When the energy exceeded a thresh-o l d the beginning of the speech sample had been found. Once the b e g i n -ning of the speech sample was found, scanning for the end began. When the energy coming through the window dropped below a threshold the end-i n g of the speech sample had been l o c a t e d . This technique was used e a r l i e r by B. Gold [13]. Speech value Sentence beginning Sentence ending I i -it-Scanning window ( 1 0 msec) Scanning window ( 1 0 0 msec) Time Speech Boundary Determination Figure 3.1 The width o f the window used for scanning f o r the beginning of the speech sample was chosen as 10 msec. Since speakers tend to 13 speak more s o f t l y at the end of the sentence than the beginning , as i l l u s t r a t e d i n F igure 3 .1 , a wider window was used to loca te the ending of the speech sample. The window width for l o c a t i n g the ending of the speech sample was 100 msec. The two thresho lds , one for l o c a t i n g the beginning of the speech sample and one for l o c a t i n g the ending of the speech sample, were determined experimental ly . The same thresholds were used for l o c a t i n g the beginning and ending of normal speech samples and low pass speech samples. Another set of energy thresholds were determined for l o c a t i n g the beginning and ending of the high pass speech samples. One problem arose i n l o c a t i n g the ending of the speech samples. Some speakers paused long enough between the words "away" and "a" i n the sentence "We were away a year ago" to cause the a lgor i thm to place the ending of the speech sample fo l lowing "away". This problem was solved by taking in to cons iderat ion the fac t that the minimum speech sample length was 1.1 seconds. The a lgor i thm was a l t e r e d such that the scanning for the ending of the speech sample d id not s t a r t u n t i l 1 second of the speech sample had passed s ince the beginning of the speech sample had been l o c a t e d . Thi s change re su l t ed i n the ending of the speech sample being located c o r r e c t l y i n a l l cases. 3 . A Preprocess ing Algori thm The next step was the preprocess ing of the speech samples p r i o r to feature e x t r a c t i o n . The preprocessor y i e l d e d speech a m p l i -tude, low pass zero -cros s ing ra te and high pass zero -cross ing ra te waveforms. As the f i r s t step i n preprocess ing , the d i g i t i z e d speech sample was d iv ided in to adjacent 10 msec i n t e r v a l s . The amplitude A assigned to any 10 msec i n t e r v a l was obtained by averaging the absolute value of the amplitudes of the 160 samples i n the i n t e r v a l . Thus the speech amplitude waveform i s represented by the amplitudes of a l l the 10 msec i n t e r v a l s i n the speech sample. The zero -cros s ing ra te (ZCR) was determined by the number of zero-cross ings i n a 10 msec i n t e r v a l . ZCR was c a l c u l a t e d using n Z C R = | = l t : L ~ s 8 n <-\+i) s g n ( \ ) ] / 2 3.1 where x^ i s the amplitude of sample k i n the i n t e r v a l and n=160. Therefore the low pass ZCR waveform i s represented by a l l the ZCR's ca l cu la ted using the low pass speech and the high pass ZCR waveform i s represented by a l l the ZCR's c a l c u l a t e d using high pass speech. Durations of the extracted preprocessed waveforms v a r i e d from 1.1 seconds to 2.1 seconds. The average' length was 1.6 seconds. Therefore a l l the preprocessed waveforms were time normalized to 1.6 seconds using l i n e a r s c a l i n g . The amplitude waveform was a l so amplitude normalized such that i t s t o t a l energy equal led a prese lec ted constant va lue . A t y p i c a l example of the three normalized waveforms appears i n F igure 3 .2 . Figure 3.2 T y p i c a l preprocessed waveforms for a speech sample of candidate CS CHAPTER 4 SYSTEM TESTS AND RESULTS 4.1 Test Procedure In t h i s chapter a d e s c r i p t i o n of the d i f f e r e n t tes ts p e r -formed on the speaker r e c o g n i t i o n system and t h e i r r e s u l t s are presented. The same tes t procedure was used to tes t a l l combinations of the preprocessed waveforms used. The tes t procedure used was to tes t the system with 2 ,4 ,6 ,8 and 10 candidates separate ly using f i v e d i f f e r e n t random groupings for each number of candidates , p r o v i d i n g a t o t a l of 25 tes ts i n a l l . The candidate grouping remained the same for a l l tests on the d i f f e r e n t preprocessed waveform combinations. For the case of ten candidates the d i f f e r e n t grouping were merely d i f f e r e n t orderings of the same candidates (d i f f eren t orderings a f f ec t the orthogonal funct ions used) . The performance measure for the system was the percentage of i n c o r r e c t dec i s ions ( i e . i d e n t i f i e d the wrong person) i n the t o t a l number of dec i s ions made. As descr ibed i n Sect ion 2 .3 , to set up the system the preprocessed waveforms from a l l ten speech samples for each candidate to be used i n t e s t i n g the system were averaged to form aver-age preprocessed waveforms which i n turn were transformed in to or tho -gonal funct ions which describe the feature space. To tes t the system the preprocessed waveform(s) from each speech sample of each candidate used i n t e s t i n g the system were i n turn expanded i n the feature space. Using each r e s u l t i n g feature vector a d e c i s i o n was made as to the i d e n t i t y of the speaker of that p a r t i c u l a r speech sample. The d e c i s i o n was then checked for i t s v a l i d i t y keeping t rack of the i n c o r r e c t d e c i s i o n s . The t o t a l number of dec i s ions made i n a system te s t equal led ten times the number of candidates used i n the t e s t . There were a l so three other measurements made on the feature vectors i n the feature space. These were the quant i t i e s F , G and F / G defined as fo l lows: N N t t F = i = i J,i !vV / N ( N _ 1 ) 4 ,1 17 N" 1 0 t G = J - i 5 = 1 l U i j - ^ | / l " 0 N 4.2 ->• - th where u. i s the average feature vector for the j speaker, u . . i s the feature vector der ived from the i utterance of the j speaker and N i s the number of speakers used i n the system t e s t . F i s a measure of the average d i s tance between the average feature vectors of any two candidates i n the feature space. G i s a measure of the average dis tance between the i n d i v i d u a l feature vectors and t h e i r average feature vector for a l l vectors i n the feature space. F / G i s a f i gure of meri t for the system; the l a r g e r F / G the be t t er should be the p e r -formance of the system. The r e s u l t s of the d i f f e r e n t system te s t s are presented i n g r a p h i c a l form with percentage of i n c o r r e c t dec i s ions versus number of candidates . For a f i xed number of candidates the tes t r e s u l t s v a r i e d from one group to another. The v a r i a t i o n of r e s u l t s i s i l l u s t r a t e d on the graph by a bar at that number of candidates covering the range of r e s u l t s . A c i r c l e on the bar i s used to denote the average value of the performance of the system f o r that number of candidates . As s tated i n Sect ion 2.3 the d imens iona l i ty of the feature space equal led the number of candidates used i n the speaker recogni t ior t e s t s . Therefore for the case of two speakers the feature vectors can be p lo t t ed to show how the vectors group i n the feature space. These p l o t s were made for each group of two candidates . Two examples of the p l o t s are shown i n F igure 4 .1 . 4.2 Test Results 4 .2 .1 I n d i v i d u a l Preprocessed Waveforms F i r s t the i n d i v i d u a l preprocessed waveforms were evaluated separate ly wi th the speaker r e c o g n i t i o n system. The r e s u l t s are presented i n F igure 4.2. As can be seen from the curves the i n d i v i d u a l preprocessed waveform LOW PASS ZCR performed bes t , with an average e r r o r rate of 8.4% with ten candidates . The preprocessed waveform SPEECH AMPLITUDE ranked second with an average e r r o r of 12.5% under the same c o n d i t i o n s . The preprocessed waveform HIGH PASS ZCR was l a s t wi th an average e r r o r rate of 21.9% i n the same s i t u a t i o n . In a l l FEATURE VECTOR PLOT FOR CANDIDATES ND (X) 'AND GB (0) o0 -e-M - MEAN o 0 MO ~~l 1 0 . 2 5 0 . 5 -0.25 75 1.25 Concatenated preprocessed waveform used: HIGH PASS ZCR, SPEECH AMPLITUDE -e-FERTURE VECTOR PLOT FOR CANDIDATES DW. (X) AND JM (0) M - MEAN o o M o o 0 0 0 0 - f - 1 1.25 -0.25 —i i r 0.25 0.5 /-k 0.75 <t>, x 1.0 Preprocessed waveform used: LOW PASS ZCR Figure 4.1 Two t y p i c a l 2-dimensional feature vector p lo t s for 2 candidates co154-U . 2 -O CO LU g 5 10-UJ o o: LU o_ o LU cr; rr: o o I 0 £ 6 8 N U M B E R OF C A N D I D A T E S 10 Figure 4.2(a) Performance r e s u l t s of -preprocessed waveform SPEECH AMPLITUDE 154-z u_ 2 o cn UJ o < Q Lu <-> DC LU CL err c4-o CJ + + 4 6 8 N U M B E R OF C A N D I D A T E S 10 Figure 4.2(b) Performance r e s u l t s of the p r e -processed waveform LOW PASS ZCR 30-4-o CO u LU Q 2 0 -t 1 5 f o U l Cd. cr o o z LL. o LU irj-CD ' LU u LU 0_ 5 4 -4 - 6 8 N U M B E R O F C A N D I D A T E S 10 Figure 4.2(c) Performance r e s u l t s of the p r e -processed waveform HIGH PASS ZCR cases the e r r o r i n creased approximately l i n e a r l y w i t h the number of candidates. The measurements F,G and F/G made on the f e a t u r e v e c t o r s d e r i v e d from the preprocessed waveforms are recorded i n Table 4.1. The values are averaged over the f i v e d i f f e r e n t groups f o r each number of candidates used i n the system t e s t . The preprocessed waveform LOW PASS ZCR showed the best range of the f i g u r e of merit F/G, ranging from 6.4 to 2.1 as the number of candidates increased from 2 to 10. The range of F/G f o r the preprocessed SPEECH AMPLITUDE was from 4.9 to 1.6 and f o r the preprocessed waveform HIGH PASS ZCR from 4.3 to 1.5. 4.2.2 Concatenated Preprocessed Waveforms The speaker r e c o g n i t i o n system was used to evaluate concat-enated preprocessed waveforms as explained i n S e c t i o n 2.3. Since there are two ways to concatenate two preprocessed waveforms A and B, that i s A,B and B,A, and s i x ways to concatenate three preprocessed waveforms A,B and C i t was decided to t e s t a l l p o s s i b l e combinations of two and three preprocessed waveforms using the same set of ten candidates. The r e s u l t s are presented i n Table 4.2. The four d i s t i n c t sets of concatenated preprocessed wave-forms which performed the best under the above t e s t were chosen f o r complete e v a l u a t i o n . The sets that were chosen were HIGH PASS ZCR,SPEECH AMPLITUDE and LOW PASS ZCR, SPEECH AMPLITUDE and HIGH PASS ZCR, LOW PASS ZCR and LOW PASS ZCR, HIGH PASS ZCR,SPEECH AMPLITUDE. The r e s u l t s of the above s e t s are presented i n F i g u r e 4.3. The best performance was obtained from the concatenated preprocessed waveform LOW PASS ZCR, HIGH PASS ZCR, SPEECH AMPLITUDE w i t h an average e r r o r r a t e of 5.2% w i t h ten candidates. This performance was only s l i g h t l y b e t t e r than the concatenated preprocessed waveform.HIGH PASS ZCR, SPEECH AMPLITUDE w i t h an average e r r o r r a t e of 5.4% under the same c o n d i t i o n s . The concatenated preprocessed waveform LOW PASS ZCR, SPEECH AMPLITUDE ranked next w i t h an average e r r o r r a t e of 9.8% w i t h ten candidates and HIGH PASS ZCR, LOW PASS ZCR was l a s t w i t h an average e r r o r r a t e of 18.6% under s i m i l a r c o n d i t i o n s . Preprocessed waveform: SPEECH AMPLITUDE No. of candidates F G F / G 2 1.002 .2041 4.9.1 4 1.183 .3536 3.3.5 '6 1.263 .5988 2.12 8 1.378 .7233 1.91 10 1.436 .8899 1.62 Preprocessed waveform: LOW PASS ZCR No. of candidates F G F / G 2 1.012 .1576 6.42 4 1.156 .3439 3.36 6 1.247 .4790 2.60 8 1.311 .5702 2.35 10 1.395 .6686 2.09 Preprocessed waveform: HIGH PASS ZCR No. of candidates F G : F / G 2 .986 .2306 4.28 4 1.156 .5630 2.05 6 1.302 .7644 1.70 8 1.430 .9431 1.52 10 1.557 1.044 1.49 •Measurements F , G, and F / G made on the feature space when i n d i v i d u a l preprocessed waveforms were used Table 4.1 23 Combinations of Preprocessed Waveforms Number of candidates: 10 Concatenated Preprocessed Waveform % error SPEECH AMPLITUDE, LOW PASS ZCR 10.0 LOW PASS ZCR, SPEECH AMPLITUDE 7.0 SPEECH AMPLITUDE, HIGH PASS ZCR 7.0 HIGH PASS ZCR, SPEECH AMPLITUDE ° 5.0 LOW PASS ZCR, HIGH PASS ZCR 19.0 HIGH PASS ZCR, LOW PASS ZCR 18.0 SPEECH AMPLITUDE, LOW PASS ZCR, HIGH PASS ZCR 10.0 SPEECH AMPLITUDE, HIGH PASS ZCR, LOW PASS ZCR 7.0 LOW PASS ZCR, SPEECH AMPLITUDE, HIGH PASS ZCR 6.0 HIGH PASS ZCR, SPEECH AMPLITUDE, LOW PASS ZCR 6.0 LOW PASS ZCR, HIGH PASS ZCR, SPEECH AMPLITUDE 6.0 HIGH PASS ZCR, LOW PASS ZCR, SPEECH AMPLITUDE 6.0 Results of a l l possible combinations of three preprocessed waveforms Table 4.2 CO NO 10 ii - — O CO o LU LU CD Q < 1— h- 5 Z O LU LU o cc cc cc LU o o- o z i NUMBER OF 6 8 CANDIDATES 1 0 Figure 4.3(a) Performance r e s u l t s of the concatenated preprocessed waveform HIGH PASS ZCR, SPEECH AMPLITUDE CO 1 5 + o u LU LU 1 0 CD Q < •— h -z o LU LU o cr: cc cr: LU o o_ o z I t 5 + + + NUMBER OF 6 8 CANDIDATES 1 0 Figure 4.3(b) Performance r e s u l t s of the concatenated preprocessed waveform LOW PASS ZCR, SPEECH AMPLITUDE 25 25 + CO z o (75 U 2 0 Q t— CJ LU Cd O O 15' o 10 + LU CD < z LU O LU 0 _ 5 + 5 0 4 6 8 NUMBER OF CANDIDATES 10 Figure 4.3(c) Performance r e s u l t s o f the concatenated preprocessed waveform HIGH PASS ZCR, LOW PASS ZCR CO §10-O CO LUUJ CD Q < Z O LU LU O Cd CC CC UJ o ° - a z: | - 5 + NUMBER 6 8 OF CANDIDATES 10 Figure 4.3(d) Performance r e s u l t s of the concatenated preprocessed waveform LOW PASS ZCR, HIGH PASS ZCR, SPEECH AMPLITUDE The measurements F , G and F / G for these four sets of concat-enated preprocessed waveforms are presented i n Table 4 .3 . The two concatenated waveforms LOW PASS ZCR, HIGH PASS ZCR, SPEECH AMPLITUDE and HIGH PASS ZCR, SPEECH AMPLITUDE have approximately the same F / G va lues , 6.02 to 2.00 and 5.98 to 1.97 r e s p e c t i v e l y as the number of candidates increase from 2 to 10. The F / G values for the l a s t two concatenated waveforms, LOW PASS ZCR, SPEECH AMPLITUDE and HIGH PASS ZCR, LOW PASS ZCR range from 5.68 to 1.72 and 4.60 to 1.68 r e s p e c t i v e l y , as the number of candidates increases from 2 to 10. 4 .2 .3 Concatenated Feature Vectors In t h i s s ec t i on the eva luat ion r e s u l t s of the concatenated features der ived by concatenating the feature vectors der ived from i n d i v i d u a l preprocessed waveforms are presented. Since i t was the feature vectors that were concatenated and not the preprocessed wave-forms as i n the l a s t s ec t ion i t made no d i f f erence i n which order the feature vectors were concatenated. The r e s u l t s for the concatenated feature: vectors us ing the d i f f e r e n t preprocessed waveforms are presented i n F igure 4.4. The concatenated feature vec tor der ived from the preprocessed waveforms SPEECH AMPLITUDE, LOW PASS ZCR and HIGH PASS ZCR performed best with an average error rate of 3.4% with ten candidates . The concatenated feature vector der ived from the preprocessed waveforms SPEECH AMPLI-TUDE, LOW PASS ZCR was chosen behind with an average e r r o r ra te of 3.6% with ten candidates . The other two concatenated feature vectors der ived from the preprocessed waveforms SPEECH AMPLITUDE, HIGH PASS ZCR and LOW PASS ZCR, HIGH PASS ZCR performed approximately the same, each having an average e r r o r rate of 8.6% with ten candidates . Table 4.4 contains average values for the measurements F , G and F / G made the feature space created using the concatenated feature v e c t o r s . Note that the concatenated feature vector derived from the preprocessed waveforms SPEECH AMPLITUDE, LOW PASS ZCR had F / G values ranging from 4.86 to 1.77 as the number of candidates increases from 2 to 10 which i s be t t er than the F / G values for the concatenated feature vector derived from a l l three preprocessed vave-forms which range from 4.27 to 1.59>under the same c o n d i t i o n s . The No. of candidates F G F / G Concatenated Waveform: HIGH PASS ZCR, SPEECH AMPLITUDE 2 1.001 .167 5.98 4 1.159 .365 3.48 6 1.233 .512 2.41 8 1.323 .605 2.19 10 1.385 .702 1.97 Concatenated Waveform: LOW PASS ZCR, SPEECH AMPLITUDE 2 1.001 .176 5.68 4 1.164 .425 2.73 6 1.251 .604 2.07 8 1.362 .708 1.92 10 1.448 .841 1.72 Concatenated Waveform: HIGH PASS ZCR, LOW PASS ZCR 2 1.002 -.218 4.60 4 1.151 .492 2.34 6 1.291 .663 1.95 8 1.429 .824 1.73 10 1.530 .906 1.68 Concatenated Waveform: LOW PASS ZCR, HIGH PASS ZCR, SPEECH AMPLITUDE 2 1.001 .166 6.02 4 1.158 .361 3.21 6 1.232 .506 2.43 8 1.322 .597 2.21 10 1.384 .693 2.00 Measurements F , G and F / G made on the feature space when concatenated Preprocessed Waveforms were used. Table 4.3 10' 5+ J J 2 k 6 8 N U M B E R OF C A N D I D A T E S 10 Figure 4 .4(a) Performance results of concatenated feature vectors derived from the preprocessed waveforms SPEECH AMPLITUDE and LOW PASS ZCR 15-^ 1 0 -LU U L j g 5-^ 8 t t .2 L N U M B E R 1 h-6 8 OF C A N D I D A T E S 10 Figure 4 .4(b) Performance results of concatenated feature vectors derived from the preprocessed waveforms SPEECH AMPLITUDE and HIGH PASS ZCR 15' CO z o o10 UJ UJ O Q < 0 i 2 4 6 8 N U M B E R OF C A N D I D A T E S 10 Figure 4.4(c) Performance r e s u l t s o f concatenated feature vectors der ived from the preprocessed waveforms LOW PASS ZCR and HIGH PASS ZCR CO Z o o CJ LU LU CD O r - • z o LU LU CJ or cr: ce LU o o_ CJ 5 + I J 4 6 8 N U M B E R OF C A N D I D A T E S 10 Figure 4.4(d) Performance r e s u l t s o f concatenated feature vectors der ived from the preprocessed wave-forms SPEECH AMPLITUDE, LOW PASS ZCR and HIGH PASS ZCR No. of candidates F G F/G Preprocessed waveforms used: SPEECH AMPLITUDE, LOW PASS ZCR 2 1.426 .293 4.86 4 1.638 .626 .2.61 6 1.770 .858 2.06 8 1.900 .973 1.95 10 2.023 1.142 1.77 Preprocessed waveforms used: SPEECH AMPLITUDE, HIGH PASS ZCR 2 1.416 .339 4.17 4 1.643 .783 2.10 6 1.818 1.055 1.72 8 2.003 1.263 1.59 10 2.144 1.440 1.49 Preprocessed waveforms used: LOW PASS ZCR, HIGH PASS ZCR 2 1.426 .323 4.42 4 1.631 .695 2.35 6 1.811 .924 1.96 8 1.971 1.130 1.74 10 2.105 1.274 1.65 Preprocessed waveforms used: SPEECH AMPLITUDE, LOW PASS ZCR and HIGH PASS ZCR 2 1.743 .408 4.27 4 2.006 .892 2.25 6 2.206 1.195 1.85 8 2.402 1.414 1.70 10 2.565 1.613 1.59 Measurements F, G, and F/G made on the feature space when concatenated feature vectors were used. Table 4.4 F / G values for the two concatenated feature vectors der ived from SPEECH AMPLITUDE,_HIGR PASS ZCR and LOW PASS ZCR, HIGH PASS ZCR range from 4.17 to 1.49 and 4.42 to 1.65 r e s p e c t i v e l y as the number of candidates increases from 2 to 10. '4.3 Measurement of the Variances of the Preprocessed Waveforms In the i n i t i a l process ing ten preprocessed waveforms of each type were generated for each candidate . Each set of ten preprocessed waveforms was averaged to form three average preprocessed waveforms for each candidate to be used i n the c a l c u l a t i o n of the bas i s funct ions i n the Gram-Schmidt o r t h o g o n a l i z a t i o n . The var iance between each I n d i v i d u a l preprocessed waveform and i t s average was c a l c u l a t e d and then normalized by the root mean square (RMS) value of the average preprocessed waveform. The normalized v a r i a n c e s , the average normalized var iance for each set of waveforms and the RMS value of the average preprocessed waveform for a l l the speakers and for a l l the waveform types of SPEECH AMPLITUDE, LOW PASS ZCR and HIGH PASS ZCR are presented i n table 4 .5 . The o v e r a l l average normalized v a r i a n c e , that i s averaged over a l l the preprocessed waveforms of one type of a l l the speakers, of the preprocessed waveforms SPEECH AMPLITUDE, LOW PASS ZCR, and HIGH PASS ZCR are .414, .299 and .257 r e s p e c t i v e l y . 4.4 Discuss ion of Results The speaker recogn i t ion system was tested using i n d i v i d u a l preprocessed waveforms, concatenated preprocessed waveforms and concate-nated feature vectors der ived from i n d i v i d u a l preprocessed waveforms. The best performance re su l t ed from the cases i n which concatenated feature vectors were used. The lowest average e r r o r rate obtained was 3.4% with ten candidates achieved with concatenated feature vectors der ived from a l l three preprocessed waveforms. In a l l cases the e r r o r rate i n -creased approximately l i n e a r l y with the number of candidates . In us ing e i t h e r concatenated preprocessed waveforms or concatenated feature vec-t o r s , the improvement achieved from using a l l three preprocessed wave-forms seems minimal s ince a reduct ion of only 0.2% i n e r r o r rate was achieved with ten candidates i n each case. The method of using concatenated feature vectors, has the ad-vantage i n that the d imens iona l i ty of the feature space i s increased PRFPP0CF8SE0 WAVFFORM t SPF.F.r.H AHPLITUOF 32 SPEAKER NORMALIZED WAVEFORM V ARIANCF3 MF A N RMS VALUE VARIANCE MEAN WAVEFORM 1 2 3 0 5 6 7 B 9 10 ND 0.2B1 0.325 0.291 0.2B9 0,333 0.385 0.325 0,366 0.355 0.217 0.317 50.8 BC 0,293 0.138 0,393 0.318 0.1B7 0,389 0.90B 0.279 0,381 0.121 0,131 53,R AS 0.3B7 0,312 0.015 0,552 0.133 0,571 0.301 0,351 0.167 0,169 0,126 53.8 BA 0.180 0.133 0,373 0.279 0.297 0,287 0.3B6 0,387 0,367 1 .820 0,511 50'.7 JD 0,510 0.295 0.517 0.397 0,390 0,528 0.353 0,131 0.193 0.172 0,138 53'.9 DW D.126 0.183 0.395 0,355 0.367 0.381 0.122 1.890 0.376 0.396 0.519 50.0 C3 0.103 0.196 0,317 0.187 0.277 0,335 0.310 0.318 0.168 0.302 0.371 51.1 GE 0,117 0.319 0.323 0.332 0.361 0.188 0.381 0.338 0.331 0.112 0.371 51.3 JM 0,119 o',337 0.302 0,297 0.331 0,252 0,290 0,311 0.392 0.305 0,330 51.6 GR 0,281 0.118 0.511 0,399 0.526 0.260 0.399 0,157 0.320 0.283 0.392 51.2 PREPROCESSED WAVFFORM : LOW PASS ZCR SPEAKER NORMALIZED WAVEFORM VARIANCES MEAN RMS VALUE VARIANCE MEAN WAVEFORM 1 2 3 1 5 6 7 8 9 1 0 ND 0.203 0.212 0,300 0.233 0.315 0,265 0.320 0,273 0.265 0.237 0,265 8.0 BC 0,221 0,210 0,296 0.227 0.188 0.201 0,191 0.201 0.212 0,180 0,217 8.7 AS 0.191 0,350 0,110 0.181 0.393 0.387 0.102.0,371 0.111 0.129 0,115 6.5 BA 0,360 0,308 0,258 0.219 0.301 0,306 0.299 0,319 0,271 0,263 0,291 6.7 JD 0.326 0.350 0.396 0.298 0.251 0,121 0.373 0.351 0.353 0.351 0.317 7.5 D* 0,295 0.281 0.319 0.319 0.251 0.251 0.372 0.215 0.215 0.27B 0,286 8.7 CS 0,275 0,291 0.266 0.301 0,222 0,235 0.192 0,218 0.250 0.255 0,251 6.8 GE 0,289 0.255 0.236 0.300 0.303 0.331 0.2B3 0,313 0.271 0.276 0,286 7.3 JM 0,122 0,305 0,293 0.277 0.318 0.353 0.311 0.138 0.112 0.338 0,317 6.7 GB 0.262 0.286 0,357 0.277 0,316 0,263 0.263 0.238 0.262 0.279 0,280 7.3. PREPROCESSED WAVEFORM I HIGH PASS ZCR SPEAKER NORMALIZED WAVEFORM VARIANCES MEAN RMS VALUE VARIANCE MEAN WAVEFORM 1 2 3 1 5 6 7 8 9 1 0 ND 0,236 0,321 0.215 0.281 0,201 0.301 0.228 0.220 0.187 0.203 0,213 39.2 BC 0,291 0.221 0.22S 0,221 0,106 0,299 0.353 0.211 0.256 0.209 0,271 38.S AS 0,503 0.158 0.117 0.659 0.399 0.560 0,359 0.378 0,101 0.315 0,118 10.6 BA 0,182 0'.219 0.200 0.181 0.2C7 0.163 0.123 0,167 O . l f l l 0,181 0,181 32.9 JO 0.390 0.618 0.112 0,291 0.269 0.311 0,321 0,350 0.368 0,317 0.371 31.0 OW .0.179 0.231 0.233 0.212 0.195 0.321 0,211 0.161 0.231 0.190 0,223 37.8 C8 0.231 0,207 0,225 0,205 0.328 0,208 0,202 0,197 0.196 0.207 0.221 39.0 GE 0,219 0.158 0.211 0,258 0,251 0,162 0.210 0.167 0.263 0.239 0.211 31.1 JM 0,183 0.198 0.265 0,179 0.271 0.179 0.111 0,180 0.150 0.117 0,190 Jfl'.O 68 0,156 0,179 0.200 0.225 0.193 0,187 0.209 0.221 0.217 0.221 0,201 J7.3 Normalized variances f o r a l l three preprocessed waveforms Table 4.5 t o 2 n o r 3 n , w h e r e n i s t h e n u m b e r o f c a n d i d a t e s u s e d i n t h e s y s t e m t e s t s , d e p e n d i n g u p o n w h e t h e r 2 o r 3 p r e p r o c e s s e d w a v e f o r m s w e r e u s e d . F o r i n d i v i d u a l a n d c o n c a t e n a t e d w a v e f o r m s t h e d i m e n s i o n a l i t y o f t h e f e a t u r e s p a c e i s n . T h e m e t h o d o f u s i n g c o n c a t e n a t e d f e a t u r e v e c t o r s r e q u i r e s 2 o r 3 t i m e s m o r e s t o r a g e t o s t o r e t h e c o e f f i c i e n t s o f t h e a v e r a g e f e a t u r e v e c t o r s a n d a s i m i l a r i n c r e a s e i n t i m e t o d o c a l c u l a -t i o n s t o a r r i v e a t a d e c i s i o n , d e p e n d i n g u p o n w h e t h e r 2 o r 3 p r e p r o c e s s e d w a v e f o r m s w e r e u s e d , a s t h e m e t h o d o f c o n c a t e n a t i n g p r e p r o c e s s e d w a v e -f o r m s . F o r t h e a d d i t i o n a l s t o r a g e a n d t i m e r e q u i r e m e n t s a n i n c r e a s e i n s y s t e m p e r f o r m a n c e w a s o b t a i n e d . I t w a s o b s e r v e d t h a t t h e o r d e r i n w h i c h t h e p r e p r o c e s s e d w a v e f o r m s w e r e c o n c a t e n a t e d h a d s o m e e f f e c t o n s y s t e m p e r f o r m a n c e . T h i s e f f e c t w a s p r o b a b l y d u e t o t h e d i s c o n t i n u i t y w h e r e t h e t w o p r e p r o c e s s e d w a v e f o r m s j o i n e d . When t h e d i s c o n t i n u i t y w a s l a r g e t h e r e c o u l d b e m o r e v a r i a t i o n b e t w e e n d i f f e r e n c e s a m p l e s a n d w h e n t h e d i s c o n t i n u i t y w a s s m a l l , r e s u l t i n g i n p o o r e r p e r f o r m a n c e . O n e e x a m p l e o f t h i s i s t h e p r e p r o c e s s e d w a v e f o r m s S P E E C H A M P L I T U D E a n d LOW P A S S Z C R ; t h e c o n c a t e -n a t e d w a v e f o r m S P E E C H A M P L I T U D E , LOW P A S S ZCR p e r f o r m e d b e t t e r t h a n t h e c o n c a t e n a t e d w a v e f o r m LOW P A S S Z C R , S P E E C H A M P L I T U D E a s c a n b e s e e n i n T a b l e 4 . 2 . T h e r e a s o n i s t h e p r e p r o c e s s e d w a v e f o r m S P E E C H A M P L I T U D E u s u -a l l y s t a r t s w i t h a l a r g e v a l u e ( 4 0 ) a n d e n d s w i t h a s m a l l e r v a l u e ( 1 5 - 2 0 ) , a n d t h e p r e p r o c e s s e d w a v e f o r m LOW P A S S Z C R s t a r t s a n d e n d s w i t h a r o u n d t h e s a m e v a l u e s ( 4 - 8 ) t h u s t h e d i s c o n t i n u i t y i s s m a l l e r h a v i n g S P E E C H A M P L I T U D E f i r s t a n d LOW P A S S Z C R s e c o n d . T h e e r r o r r a t e d e p e n d e d u p o n t h e o r d e r i n g o f t h e c a n d i d a t e s , w h e n t e n c a n d i d a t e s w e r e u s e d t o t e s t t h e s y s t e m . To s e e t h e r e a s o n f o r t h i s e f f e c t o n e h a s t o l o o k a t t h e G r a m - S c h m i d t o r t h o g o n a l i z a t i o n p r o c e d u r e w h i c h t r a n s f o r m s a s e t o f n i n d e p e n d e n t s i g n a l s x]_ ( t ) ,X2 ( t ) , . . . , x n ( t ) i n t o a s e t o f n o r t h o g o n a l s i g n a l s y ^ ( t ) , y 2 ( t ) , . . . y n ( t ) u s i n g e q u a t i o n s ( 2 . 7 - 2 . 9 ) . I n t h e s y s t e m t h e x ( t ) ' s a r e t h e a v e r a g e p r e p r o c e s s e d w a v e f o r m s o f e a c h c a n d i d a t e a n d t h e y ( t ) ' s a r e t h e o r t h o -g o n a l f u n c t i o n s w h i c h d e s c r i b e t h e n d i m e n s i o n a l f e a t u r e s p a c e . T h i s p r o c e d u r e c a n g e n e r a t e a n i n f i n i t e n u m b e r o f o r t h o g o n a l s e t s , d e p e n d i n g u p o n w h a t o r d e r t h e x i ( t ) ' s a r e p r o c e s s e d . E a c h d i f f e r e n t s e t o f o r t h o -g o n a l f u n c t i o n s g e n e r a t e s a d i f f e r e n t s e t o f t r a n s f o r m a t i o n c o e f f i c i e n t s C ^ j ' s . A s s h o w n i n S e c t i o n 2 . 3 t h e a v e r a g e f e a t u r e v e c t o r s l o c a t i o n f o r each candidate are dependent upon the C j ^ ' s . Therefore the distance between candidates vary from one o r d e r i n g of the candidates to the next which r e s u l t s i n some candidates being c l o s e r together i n some cases than i n others. In a l l cases the v a r i a t i o n i n performance was a matter of only 3 or 4 i n c o r r e c t d e c i s i o n s out of a t o t a l of 100 d e c i s i d a s . The performance curve f o r the preprocessed waveform HIGH PASS ZCR has a drop i n percent e r r o r from e i g h t to ten candidates, as a r e s u l t of the f a c t the system made no more e r r o r s f o r ten candidates than f o r e i g h t . The e f f e c t i s also n o t i c e a b l e i n the performance curve of the concatenated preprocessed waveform HIGH PASS ZCR, LOW PASS ZCR. The preprocessed waveforms r e s u l t e d i n much b e t t e r system performance when they were combined than when they were used alone. The preprocessed waveforms had l a r g e variances w i t h SPEECH AMPLITUDE having'the l a r g e s t average normalized variance of 0.414. SPEECH AMPLITUDE also had three or four samples w i t h normalized variances i n the range 0.9 to 1.8 which meant these samples were very a t y p i c a l . The HIGH PASS ZCR preprocessed waveform had the lowest variance although i t performed the worst. Generally the normalized variance of each sample i n a set f o r one candidate was q u i t e close to the average normalized variance of the s e t . The l a r g e variances of the preprocessed waveforms i n d i c a t e that there i s considerable v a r i a t i o n i n the preprocessed waveforms derived from two d i f f e r e n t speech samples from the same speak-e r . The preprocessed waveforms from d i f f e r e n t speakers seem q u i t e s i m i l a r , as was n o t i c e when only two candidates were used to t e s t the system. Considering (2.7 - 2.9) w i t h only two candidates the equations become h = s 2 - enta = s 2 - c i i S i J"S 2$j;dt / S 2 S 1 d t C l l = /(t)2 d t /§2 dt 1 1 Therefore the more s i m i l a r S^ and § 2 are the c l o s e r C-Q approaches one. S^ represents the average preprocessed waveform and represents the orthogonal f u n c t i o n s . In many cases C xi eq u a l l e d approximately .95 which i n d i c a t e s the two average preprocessed waveforms only d i f f e r 5%. For i n d i v i d u a l and concatenated preprocessed waveforms the values of F and G general ly f e l l i n the range of 1 to 1.5 and 0.2 - 1.0 r e s p e c t i v e l y as the number of candidates increased from 2 to 10. In a l l cases G increases fas ter than F as the d imens ional i ty of the feature space increases which re su l t s i n F / G decreas ing . For the case of con-catenated feature vectors the values of F and G are equal to the square root of the sum of the squared values of F and G for the i n d i v i d u a l feature vectors used. The f igure of m e r i t , F / G , i n d i c a t e s general trends; for example, as F / G decreases the e r r o r rate increases . In some cases F / G was smal ler than i n other cases but the e r r o r rate was also smal l er . Thus, the r e l a t i o n s h i p between F / G and % e r r o r i s . s t r i c t l y speaking not monotonic over smal l i n t e r v a l s , but tends to be monotonic for large changes i n F / G . CHAPTER 5 CONCLUSION The p u r p o s e o f t h i s t h e s i s was t o i n v e s t i g a t e t h e u s e f u l n e s s o f SPEECH AMPLITUDE, LOW PASS ZCR and HIGH PASS ZCR fo.r i d e n t i f y i n g d i f f e r e n t s p e a k e r s . A s y s t e m as d e s c r i b e d i n Chapter 2 was s i m u l a t e d and d i f f e r e n t c o m b i n a t i o n s o f the p r e p r o c e s s e d waveforms were e v a l u a t e d f o r r e c o g n i t i o n . I t was shown t h a t t h e p r e p r o c e s s e d waveforms c o u l d be combined such t h a t an average e r r o r r a t e as low as 3.4% c o u l d be a c h i e v e d w i t h t e n c a n d i d a t e s . The sys t e m e r r o r r a t e seemed t o i n c r e a s e l i n e a r l y w i t h t h e number o f c a n d i d a t e s i n a l l c a s e s . The r e s u l t s s u g -g e s t t h a t the t h r e e measurements made on the speech a r e v e r y e f f e c t i v e i n i d e n t i f y i n g s p e a k e r s . The p r e p r o c e s s e d waveforms u s e d had the advantages t h a t t h e y were e a s i l y c a l c u l a t e d and r e q u i r e d r e l a t i v e l y s m a l l amounts o f s t o r a g e t o c h a r a c t e r i z e t h e speech s a m p l e s . The m a j o r d i s a d v a n t a g e was l e s s s p e a k e r dependence t h a n one m i g h t w i s h . None t h e l e s s t h e s e p r e p r o c e s s e d waveforms y i e l d e d e n c o u r a g i n g r e s u l t s . From the p o i n t o f v i e w o f v a r i -ances and s y s t e m p e r f o r m a n c e t h e p r e p r o c e s s e d waveform LOW PASS ZCR seemed t h e b e s t f o r s e p a r a t i n g s p e a k e r s . I t a l s o has t o be n o t e d t h e s e n t e n c e used was e n t i r e l y v o i c e d . I f a s e n t e n c e w i t h u n v o i c e d segments was used maybe t h e p r e p r o c e s s e d waveform o f HIGH PASS ZCR m i g h t w e l l 0 have been more u s e f u l t h a n was t h e case i n the p r e s e n t s t u d y . The o r t h o g o n a l f u n c t i o n s t h a t d e s c r i b e d t h e n d i m e n s i o n a l f e a t u r e space were d e r i v e d from the average p r e p r o c e s s e d waveforms. The t e c h n i q u e used t o o b t a i n t h e o r t h o g o n a l f u n c t i o n s has t h e advantage t h a t a d d i t i o n a l s p e a k e r s a r e e a s i l y added t o the s y s t e m by o n l y h a v i n g t o c a l c u l a t e t h e a d d i t i o n a l o r t h o g o n a l f u n c t i o n s and a v e r a g e f e a t u r e v e c t o r s . F u r t h e r i n v e s t i g a t i o n m i ght i n c l u d e use o f a s e n t e n c e t h a t c o n t a i n s u n v o i c e d segments i n o r d e r t o improve t h e p e r f o r m a n c e o f the p r e p r o c e s s e d waveform HIGH PASS ZCR. The s y s t e m c o u l d a l s o be examined u s i n g more s p e a k e r s i n o r d e r t o d e t e r m i n e how t h e e r r o r r a t e depends o n / ^ th e number o f c a n d i d a t e s when t h i s number exceeds t e n . More samples from each s p e a k e r c o u l d be o b t a i n e d , t o see i f t h e v a r i a n c e and t h e average of the preprocessed waveforms change any. The measurements made on the speech sample appeared qui te e f f e c t i v e for speaker r e c o g n i t i o n . Perhaps the system performance could be improved i f the speaker was required to speak more than one sentence, p a r t i c u l a r l y i f the i n i t i a l sentence d i d not y i e l d r e s u l t s having a s u f f i c i e n t l e v e l of confidence. 38 REFERENCES [1] I . P o l l a c k , J . M . P i c k e t t , and W..H. Sumby, "On the 'Ident i f icat ion" of Speakers by Voice" , J . Acoust . Soc. Amer., 26, 403-406 (1954) 12] P . D . B r i c k e r and S. Pruzansky, "Effects of Stimulus Content and Duration on Talker I d e n t i f i c a t i o n " , J . Acoust . Soc. Amer. , 40, 1441-1449 (1966) [3] S. Pruzansky, "Pattern-Matching Procedure for Automatic Talker Recognit ion", J . Acoust . Soc. Amer. , 35, 354-358 (1968) 1 4 ] K . P . L i , J . E . Dammanon, and W.D. Chapman, "Experimental Studies i n Speaker V e r i f i c a t i o n Using an Adaptive System", J . Acoust . Soc. Amer. , 40, 966-978 (1966) [5] P . D . B r i c k e r , R. Gnanadesikan, M.V. Mathews, S. Pruzansky, P . A . • ,' Tukey, K.W. Wachter, and J . L . Warner, " S t a t i s t i c a l Techniques for Talker I d e n t i f i c a t i o n " , B . S . T . J . , 50, 1427-1454 (1971) [6] S .K. Das and W.S. Mohn, "Pattern Recognit ion i n Speaker V e r i f i c a -t ion" , AFIPS Conf. P r o c , F a l l J o i n t Computer Conf erence, 35, 721-732 (1969) 1 7 ] W.S. Mohn J r . , "Two s t a t i s t i c a l feature eva luat ion techniques appl ied to speaker r e c o g n i t i o n " , IEEE Transact ions on Computers, C-20, no. 9 , 979-987 (Sept. 1971) [8] S .K . Das, W.S. Mohn, "A Scheme f o r Speech Process ing i n Automatic Speaker V e r i f i c a t i o n " , IEEE Trans, on Audio and E l e c t r o a c o u s t i c s , AU-19, No. 1, 32-43, (March 1971) [9] B . S . A t a l , "Automatic Speaker Recognit ion Based on P i t c h Contours", J . Acoust . Soc. Amer. , 52, 1687-1697 (1972) 110] R . C . Lummis, "Speaker V e r i f i c a t i o n by Computer Using Speech Intens i ty for Temperal R e g i s t r a t i o n " , IEEE Transact ions on Audio and E l e c t r o a c o u s t i c s , AU-21, No. 2, 80-88, ( A p r i l 1973) [ l l j M.R. I t o , R.W. Donaldson, "Zero-Cross ing Measurements for Analys i s and Recognit ion of Speech Sounds", IEEE Transact ions on Audio and E l e c t r o a c o u s t i c s , AU-19, No. 3, 235-242 (Sept. 1971) [12] B . P . L a t h i , An Introduct ion to Random Signals and Communication Theory, I n t e r n a t i o n a l Textbook Company, Scranton, Pennsylvania , 1968, p . 70-73 [13] Bernard Gold , Word-Recognition Computer Program, R . L . E . Technica l Report 452, Massachusetts I n s t i t u t e of Technology, June 15, 1966. 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0065539/manifest

Comment

Related Items