AUTOMATIC INFANT CRY ANALYSIS AND RECOGNITIONbyQIAOBING XIEB. A. Sc. (Computer Application), China Textile University, 1982M. A. Sc. (Computer Science and Application), China Textile University, 1985A THESIS SUBMflTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinTHE FACULTY OF GRADUATE STUDIESTHE DEPARTMENT OF ELECTRICAL ENGINEERINGWe accept this thesis as conformingto the required standardJune 1993© Qiaobing Xie, 1993In presenting this thesis in partial fulfilment of the requirements for an advanceddegree at the University of British Columbia, I agree that the Library shall make itfreely available for reference and study. I further agree that permission for extensivecopying of this thesis for scholarly purposes may be granted by the head of mydepartment or by his or her representatives. It is understood that copying orpublication of this thesis for financial gain shall not be allowed without my writtenpermission.(Signature)_____________________________Departmentof____________________The University of British ColumbiaVancouver, CanadaDate cQØt. 2, (99DE-6 (2/88)AbstractThis dissertation is a report of my investigation on introducing modern speech processing/recognition techniques to the field of infant cry research, and on developing efficientand effective methodologies for automatic assessment of the physical/emotional situation ofinfants using the information derived from the cry signals.I first identify some problems facing present infant cry research, especially those obstructing the practical applications of the results generated from basic research. By demonstratingthe similarities between infant cry generation and adult speech generation, I establish thetheoretical foundation for the development of my new automatic cry processing/analysis techniques.In particular, I develop the new concept of cry phonemes as an effective method forrepresenting cry signals for automatic cry analysis. Based on the cry phonemes, I furtherdefine a composite parameter, the H-value, which can be calculated from the cry signal, andis found to be a reliable indicator of the distress level of the infant.Using these new concepts, I design two automatic infant cry analysis systems. One systemis based on my newly developed nonparametric VQ-kernel classifier, and the other system isbased on the Hidden Markov Model technique. Each of these systems estimates the H-valuefrom the cry signal automatically. This, in turn, is utilized in the automatic assessment ofthe infant’s distress level.The performance of these two systems was evaluated with cries uttered by 36 infants.I found that both systems give assessments of infants’ distress levels consistent with theperceptions of experienced parents who listened to the recording of the same cries. Thisdemonstrates the effectiveness of my newly developed techniques.In addition, the methodologies developed in this research can be easily generalized andapplied to other problems of normal and abnormal infant cry analysis.UTable of ContentsAbstractTable of ContentsList of TablesList of FiguresAcknowledgments1 INTRODUCTION1.1 The Overall Research Plan2 INFANT CRY STUDY: HISTORY, ADVANCES AND CHALLENGES2.1 History 52.2 Major Directions in the Study of Infant Cries2.3 Techniques and Instruments2.3.1 Auditory Analysis2.3.2 Non-acoustical Analysis2.3.3 Time Domain Acoustic Analysis2.3.4 Sound Spectrographic Analysis2.3.5 Computer-aided Analysis2.4 Remaining Problems2.4.1 Controversy on How Much Information Infant Cry Conveys2.4.2 Problems Facing Modem Computerized Infant Cry Analysis2.5 Summary11111VII’ixxl’125710111112121516161821111Speech SignalsGeneration 273 AN OVERVIEW OF APPLICABLE VOICE AND SPEECH PROCESSINGTECHNIQUES 233.1 Human Vocalization Organs and Speech Production 233.2 The Digital Model of Speech Production and Linear Prediction Coding (LPC) of273.2.1 Time-varying Filter Model of Speech3.2.2 Linear Prediction of Speech Signal 283.3 Pattern Recognition 303.4 Automatic Speech/Speaker Recognition 323.4.1 Automatic Speech Recognition 323.4.2 Speaker Recognition 333.5 Vector Quantization 343.5.1 Some Details of Vector Quantization 353.6 Hidden Markov Models 363.6.1 Description of HMMs 363.6.2 Three Problems 373.6.3 Speech Recognition with HMMs 384 FEASIBILITY STUDIES ON APPLYING SPEECH PROCESSINGTECHNIQUES TO CRY ANALYSIS 404.1 Investigating the Physiological Similarities between Infant Crying and AdultSpeech 414.1.1 A Survey: Physiology of Infant Crying and a Physioacoustic Model of CryProduction 41iv4.2 Formant Frequency Estimate Using an LPC Model 474.2.1 Theoretical Analysis 474.2.2 Experiment 504.2.3 Results 514.3 Some Technical Considerations of Automatic Cry Analysis 534.3.1 Training of the Automatic Cry Analyzer 544.3.2 Feature Selection 555 THE H-VALUE: A MEASURE FOR THE AUTOMATIC ASSESSMENT OFNORMAL INFANT DISTRESS-LEVELS FROM THE CRY SIGNAL 585.1 The Level of Distress (LOD) Ratings of Normal Infant Cries 595.2 Definitions of “Cry Phonemes” for Normal Infant Cries 615.3 The Relationship between “Cry Phonemes” and the Parents’ LOD Ratings. . . 655.3.1 Experiments 665.3.2 Results 685.4 The Introduction of the H-value and Its Application in Quantifying LODAssessments 745.5 Discussions and Remarks 756 CRY ANALYZER I: A NONPARAMETRIC STATISTICALCLASSIFIER-BASED METHOD 776.1 A Survey of Traditional Statistical Classification Methods and Their Limitations 776.2 The Development of the VQ-based Nonparametric Classification Approach . 80• 6.2.1 Development of the Algorithms 836.2.2 Experiments of the Application of Our VQ-based Classifiers 886.2.3 Summary 94V6.3 Automatic Cry Analyzer Based on the VQ-kernel Methods 956.3.1 System Design 956.3.2 Determination of VQ-kernel Classifier Parameters 986.3.3 Training of VQ-kernel Classifier 1026.4 Results and Discussions 1037 CRY ANALYZER II: A HIDDEN MARKOV MODEL (HMM) -BASEDMETHOD 1077.1 System Design 1087.1.1 Data Preprocessing and Feature Extraction 1087.1.2 Vector Quantization of the Feature Vectors 1097.1.3 Segmentation of the Signal into Recognition Units 1107.1.4 Configuration of the HMM-based Analyzer and the Calculation of theH-value 1127.2 Training of the HMMs 1167.3 Experiment Results and Discussion 1178 CONCLUSIONS 1218.1 Accomplishments and Contributions 1218.2 Some Topics for Future Study on Automatic Infant Cry Analysis 1268.2.1 Refining the Techniques and Methods We Have Developed 1268.2.2 Extending Our New Methodology to Other Infant Cry Research Topics . . 1288.2.3 Developing New Products Based on Our Computerized Infant Cry AnalysisTechniques 130Appendix A Linear Predictive Coding (LPC) Algorithms 132Appendix B Linde-Buzo-Gray’s Algorithm of Vector Quantizer Design 136viAppendix C Bayes’ Theorem for Statistical Classifier Design 138Appendix D The HMM Computation and Training Algorithms 140Appendix E Instruction to the Parents in the Infant LOD Rating Experiment 146Bibliography 147viiList of Tables5.1 Characteristics of the 36 infants 666.1 CPU time used for finding the reduced set in the speech data experiment. (*including CPU time used for classification) 937.1 Distribution of the micro-segments in our training set 116viiiList of Figures3.1 Schematic diagram of the human vocal system. (after Flanagan et al. ) 243.2 A typical glottal pulse train 253.3 Spectra in speech generation — (a) idealized spectrum of glottal pulse train, (b)frequency response of vocal tract, where the peaks correspond to formants, and (c)spectrum of the resultant speech 263.4 A two-stage speech generation model 263.5 The digital time-varying filter model of speech production 273.6 A pattern recognition system 303.7 A 3-state HMM with 4 output symbols 374.1 A simplified view of the cry production model 444.2 (A) Spectrum of an idealized periodic source; (B) spectrum of an turbulence source;(C) vocal tract transfer function; (D) spectrum of an idealized radiationcharacteristic; and (E) spectrum of an idealized output. (after Golub and Corwin ) 454.3 LPC formant estimate experiment of infant cries 504.4 (a) Estimated formant frequencies by using the LPC-based method on a pain cry.Estimations marked with circles indicate the presence of the hyperphonation. (b)Spectrogram of the same cry. In both graphs, the trajectories of the first formant,second formant, and third formant are labeled by A, B, and C, respectively. . . . 525.1 The time-frequency patterns of the 10 “cry phonemes” shown in thecomputer-derived spectrograms. They are: (1) trailing, (2) flat, (3) falling, (4)double harmonic break, (5) dysphonation, (6) rising, (7) hyperphonation, (8)inhalation, (9) vibration, and (10) weak vibration, respectively 64ix5.2 An example of labeling a pain elicited cry into the “cry phonemes.” Numbers on thetop indicate the mode or type of the “cry phoneme” of each segment of the crysignal. The curve below indicates the energy of the signal 655.3 The relationship of parents’ LOD ratings to the percentage occurrences of different“cry phonemes” 735.4 Relationship between the H-value and the parents’ LOD ratings 756.1 The classification error rates of our VQ-based classifiers and other traditionalreduced data classifiers 906.2 The classification error rates of our VQ-based classifiers, the traditional reduced dataNN classifiers, and Fukunaga’ s reduced Parzen classifier for the speech data. . . . 926.3 Automatic infant cry analysis system based on the VQ-kernel classifier 966.4 The generation of the reduced training sets for the VQ-kernel classifier 986.5 Determination of the decision threshold, t 1016.6 The estimated classification error rate vs. the decision threshold t 1016.7 The H-values which are estimated by our VQ-kernel classifier-based cry analyzerversus the actual measurements. The estimation errors appear as the verticaldistances from the circles to the diagonal dashed line 1046.8 Relationship between the H-values estimated by the VQ-kernel classifier-based cryanalyzer and the parents’ LOD ratings 1057.1 The block diagram of our HMM based cry distress level analysis system 1087.2 Vector quantization 1097.3 The segmentation of a cry signal 1117.4 Topology of the hidden Markov models for micro-segment identification 113x7.5 The structure of the H-HMM (the same topology is used for the E-HMM) 1147.6 The HMMs-based classifier 1157.7 The computer estimations versus the actual measurements of the H-values of the 58cries in the testing set. The estimation errors appear as the vertical distances fromthe circles to the diagonal dashed line 1187.8 Computer estimated H-value versus parents’ LOD rating 119xiAcknowledgmentsFirst, my greatest debt of gratitude goes to my supervisors, Professor Rabab K. Ward andProfessor Charles A. Laszlo, who have introduced and guided me into this exciting field ofresearch, and provided sound advice and invaluable critical feedback at every stage of thisproject. Without their support, this thesis could not have been written. I would also thankDr. M. J. Yedlin, who served as an examining committee member in my proposal defence,departmental defence, and final oral examination, for his advice and suggestions.I am very grateful to Dr. Ruth V. E. Grunau of BC Children’s Hospital for her allowingme to use the infant cry data she collected in her previous research.Thank you also goes to the parents who participated in our experiment and provided theimportant LOD ratings.I also wish to acknowledge Pingnan Shi and Hui Peng for their friendship and encouragement over these four long years.Finally, I would like to acknowledge that this project is partly supported by NSERCgrants to Dr. Rabab K. Ward and Dr. Charles A. Laszlo.The collection of the infant cry data which Dr. Grunau supplied us was supported by agrant to Dr. K. D. Craig from NSERC.I dedicate this work tO Mrs. Wei-Lie Zeng and Mr. Shou Jiang Xie, my mother and father.xiiChapter 1INTRODUCTIONThis work presents our research in applying modern digital computer-based speech processing/recognition techniques to the automatic analysis of infant vocalizations.Crying and other forms of non-verbal vocalizations play a very important role in thecommunication of infants with their caretakers. It has also been speculated since the lastcentury that such vocalizations also contain information about the infants’ physical andemotional situations, and that the proper interpretation of that information can lead to effectiveand efficient methods of monitoring the infants’ well-being. In turn, this could lead to thediagnosis of some diseases and abnormalities.During the past three decades significant progress has been made in the understandingof the infant cry generation mechanism and various relationships between infants’ physicalsituations and different cry attributes. Nevertheless, many important questions still remainunanswered, and no widely accepted applications of infant cry analysis are found in practice.It is particularly striking that many years of effort to convert the findings and results of thisfundamental research into practical applications has brought few results. This, in our opinion,is due to the lack of utilizing highly effective and efficient methods of analysis in infant cryresearch. We believe that more engineering-oriented methodologies should be introduced intoinfant cry research so that the findings and results of past cry research can be transferred intoa form suitable to practical applications.1Chapter 1: INTRODUCTION 2In recent years, automatic speech processing/recognition technology has gone throughrapid development. Many new techniques, such as vector quantization and hidden Markovmodeling, have been introduced and successfully applied to automatic speech processing/recognition. If we consider the infant cry signals as signals closely related to speech,it is an excellent opportunity to investigate the introduction of modern speech processing/recognition techniques into infant cry research. As a result of introducing speech processing techniques into infant cry research, we expect to obtain new analysis methods whichare effective and efficient from the viewpoint of engineering-oriented application. Our planof research is detailed in the following.1.1 The Overall Research PlanWe plan to ‘achieve the following objectives in our investigation.1. To identify the problems facing infant cry application research— We will review thevarious theories, methodologies, techniques, and instrumentation developed for infant cryresearch in the past three decades. We will also examine the large amount of findings andresults on infant crying in the literature. In particular, we will try to identify the mostdifficult problems and the obstacles that hinder the improvement of the effectiveness andefficiency of the techniques employed in the study of infant cries. This will allow usto evaluate the potential of modern signal and speech processing methods in infant cryresearch and clinical applications.2. To investigate the theoreticalfeasibility ofapplying speech processing/recognition technology to infant cry research — It is important to note that the theory and methodology ofmodern speech processing is heavily based on specific physioacoustic models of speechChapter 1: INTRODUCTION 3generation. The success of our introduction of modern speech processing technologyinto infant cry research is, therefore, fundamentally related on the physiologic and acoustic similarities between the mechanisms of speech generation and cry generation. Thisphase of our work is designed to compare speech and cry generation mechanism, and todetermine the applicability of existing physioacoustic models to the cry sound.3. To address the special problem of assessing normal infant distress levels from cry signals— Provided that we can show that speech processing technology and methodology isapplicable to cry sounds, we will then concentrate our investigation on one specialproblem. This problem is the automatic assessment of the physical and/or emotionalsituation of normal (clinically healthy) infants from the cry sounds emitted by them. Inparticular, we plan to develop effective and efficient methods to characterize and representnormal infant cry sounds in a way suitable to automated analysis. Based on this signalrepresentation technique, we will then try to find parameters to measure and quantify thedistress level of normal infants. These parameters should be able to effectively indicate theinfants’ physical/emotional situations. Equally important is that these parameters can bereliably estimated from the cry sound by means of modern signal processing/recognitiontechnique.4. To develop automatic infant cry analysis systems based on our previous investigations andevaluate their performance on computer — In this stage of our research, we will work onthe system design of the automatic infant cry analyzer. The to-be-designed system shouldbe capable of automatically analyzing the cry sound and giving reliable estimation of thedistress level of the infant. Besides the effectiveness and accuracy of the analysis, wewill also pay great attention to the efficient implementation of the cry analysis system;Chapter]: INTRODUCTION 4efficient in the sense of both engineering and clinical practices.In Chapter 2, we review the history of infant cry study, survey the technology developments in this field, and discuss the main problems facing current infant cry research. Chapter3 presents a survey of modern speech processing/recognition techniques, with emphasis onthose methods which have the most potential for automatic infant cry analysis systems. Thefeasibility of applying speech processing technology to infant cry analysis and some technicalconsiderations related to such application will be studied in Chapter 4. In Chapter 5, wediscuss the problem of computer-based assessment of normal infant distress levels from thecry signal and present a “cry phoneme” system to effectively characterize normal infant crysignals. A cry signal-derived infant distress levels indicator — the H-value— is introducedand examined in this chapter. In Chapter 6 and 7, we present two different implementations ofour automatic infant cry analyzer — one based on our newly developed vector quantizationkernel nonparametric classifier, and the other on the hidden Markov modeling technique. Theperformances of both systems are tested with real cry data. In Chapter 8, we summarizethe accomplishments of the current phase of our research, and discuss future directions ofautomatic infant cry research.Chapter 2INFANT CRY STUDY: HISTORY, ADVANCES AND CHALLENGES2.1 HistoryDoes infant crying mean anything more than his/her need for attention? This question hasstimulated curiosity among physicians, parents, and scientists for a very long time.As early as the nineteenth century, scientists began to believe that infant cry soundscontained information about the child, his physical and emotional well-being and about thecry-provoking situation. In 1832, Gardiner described the infant cry with reference to itslocation on a piano keyboard and as having an up-down melodic pattern. Also, when showeda series of photographs depicting various grimaces in the expressions of crying infants, CharlesDarwin hinted at the notion that crying contains meaningful information.In the early twentieth century, researchers started to use the International PhoneticAlphabet to note their perceptions of infant cries. It was not until the invention of the magnetictape recorder and the sound spectrograph in the early 1950s that more systematic acousticinvestigations on infant crying were able to attract wide attention. Lynip [65], in 1951, wasprobably the first scientist who investigated infant cries with a sound-spectrographic device— the sonograph. He measured the fundamental frequency of the infants’ early utterancesand searched for different vowel-like patterns in the cry sound. In 1962, Wasz-Höckert et al.[122] reported their first findings on their sound-spectrographic cry analysis of birth, pleasure,hunger and pain cries. This actually marked the beginning of modern acoustic cry research.5Chapter 2: HISTORY, ADVANCES AND CHALLENGES 6Important fundamental facts were published by Truby, Bosma, and Lind [8, 114] in 1965.Using sound-spectrography, cineradiography, and the recording of the intraesophageal airpressure, they thoroughly described various acoustic phenomena in pain-elicited cries and themotions of different respiratory organs while these cries were recorded. In 1968, a monographon the statistical analysis of cries of both healthy and abnormal infants was published byWasz-Höckert et al. [118]. This work is a milestone in infant cry research since it set upthe basic methodology of infant cry research which had been followed by many researchersfor over two decades. Michelsson et al. [69, 70, 72—78] began using more sophisticatedspectrographic techniques in the early 70s, and conducted a series of investigations on criesof infants with various diseases or abnormalities. They found that some characteristics of thecries of abnormal infants differ significantly from those found in normal infants.Starting in the middle 70s, significant advances in electronics, signal processing theories,and computer technology occurred. These technological developments gradually had theirinfluence in the infant cry research. Tenold et al. [113] in 1974 investigated the variabilityof fundamental frequency F0 and cry spectra of full-term and premature infants’ cries usingcepstral and stationarity analysis. Golub and Corwin [33—35] in the early 80s reported theirresult on computer-aided cry analysis. They utilized computer signal processing techniques.Their findings suggest that the analysis of the infant cry holds promise for detecting a numberof abnormalities. They also introduced a physioacoustic model of cry production in an attemptto relate the acoustic properties of the cry to the anatomic and neurophysiologic functioning ofthe infant. In 1986 and 1988, Fuller et al. [26, 27] reported their findings about the differentacoustic characteristics among four types of infant vocalizations. A PDP 11/34 computerwas employed in their study. In 1989, Ludge and Gips [63], using a microcomputer withChapter 2: HISTORY, ADVANCES AND CHALLENGES 7specifically designed hardware, studied jitter of the cry uttered by newborn infants.Recent years have also seen the beginning of the use of some sophisticated signalprocessing techniques in cry research. For example, the Fast Fourier Transform (FFT) analysiswas employed in both normal and abnormal infant cry studies by Rapisardi etal. (1989) [103],Vohr et al. (1989) [116], and Fuller (1991) [25].Nevertheless, in general, the tremendous improvement in recent electronics, signal processing and computer technology has not yet been fully appreciated and exploited by theresearchers in infant cry research field. Some researchers seem to be reluctant to utilize modern signal processing theories and to replace the labor of “reading” spectrograms by advancedanalysis techniques employing computers. It is somewhat surprising that the enthusiasm forinfant cry research has even faded a little since the 60s and 70s.To summarize, despite the many years of systematic research and the numerous publications on infant cry research as exemplified above, our knowledge about the infant cry isstill surprisingly limited. Furthermore, no widely accepted practical application has emergedfrom the work of the past three decades in this field. While most of this is apparently due tothe fact that the phenomenon of infant cry is of tremendous complexity, we believe that it isalso due to the inadequacy of the tools which have been employed in the past. In the nextsection, a detailed survey of past work and an analysis of the major problem areas facinginfant cry research is presented.2.2 Major Directions in the Study of Infant CriesGenerally, in the past, infant cry research has approached the problem from one of fiveviewpoints.Chapter 2: HISTORY, ADVANCES AND CHALLENGES 81. Psychological Investigations on the Subjective Perceptions of Infant CriesCries from both normal (clinically healthy) and abnormal infants are used in such work.The cries in some cases are grouped according to the (sometimes presumed) types of stimulusthat caused the cry (e.g., “birth”, “pleasure”, “hunger”, “cooing”, “pain”). The purposes ofthe investigations are usually aimed to obtain information about:• the perception of adults of infant cries [9, 10, 39, 83, 115, 118], e.g., the parents’ability to distinguish different cry-types and to identify the cries from abnormalinfants;• the responses of adults to infant cries [12, 19, 20, 81, 82, 112], e.g., the verbaland behavioral responses of the adults with varying temperament to various types ofcries; and• infant care and nurse training [44, 79, 111, 121];2. Research on Physiological and Developmental AspectsIt has been postulated that infant crying is a reflection of a variety of complex neurophysiological functions [34, 51, 114]. Many researches therefore aimed at investigating theinfants’ physiological and developmental status by analyzing the cry signals. Such investigations include:• the cries and the infants’ linguistic development [54, 55];• the cries and the infants’ respiratory system physiology, and infants’ neurological andpsychological development [8, 35, 43, 49, 56, 57, 68, 96, 114, 129];Chapter 2: HISTORY, ADVANCES AND CHALLENGES 93. Assessment of Emotional/Physical Situations of the InfantThe importance of assessing the emotional/physical situations of infants is obvious. Moredemanding is the problem of medical assessment of pain in infants. Approaches of assessmentbased on the analysis of such parameters as facial expression, heart rate, respiratory rate,transcutaneous oxygen level, body movement, and vocal behavior have been suggested [11,26, 27, 37, 38, 52, 89]. Among them, methods based on cry analysis have the potential ofbeing the most convenient.4. Research on Abnormal Infant Cries for Diagnostic PurposesMost researches of this group utilize spectrographic methods, sometimes with the aid ofa computer, to get potentially diagnostically-valuable information from cry sounds of infantswith specific diseases or abnormalities. Usually only pain cries, which are commonly inducedby stimuli such as a pinch on the infants’ arm or ear, are studied in an attempt to control andstandardize the intensity of the stimulation. Investigations have been carried out to examinethe correlations between various spectral and temporal attributes and the particular medicalproblems. These include oropharyngeal anomalies, asphyxia neonatorum, symptomless lowbirth weight, herpes encephalitis, congenital hypothyroidism, hyperbilirubinemia, bacterialmeningitis, hydrocephalus, bradycardia, various forms of brain damage, malnutrition, geneticdefects, and sudden infant death syndrome (SIDS) [34, 59, 60, 69—78, 90, 103, 109, 116].5. Detection and Recognition of Infant Cries as a Kind of Warning Signals for HearingImpaired Parents.It is very important to alert deaf and hard of hearing parents whenever their baby iscrying. Their constant awareness of the baby’s emotional and physical situation is criticalChapter 2: HISTORY, ADVANCES AND CHALLENGES 10for properly caring the infant. Cry research has been expected to provide data and methodswhich can lead to the building of reliable alerting devices for this purpose.Unfortunately, little work has been done in this area. Vuorenkoski in 1970 [117]introduced an infant cry detection and analysis device, Cry Detector. The device was actuallydesigned for collecting and analyzing cry samples in a clinical neonatal ward. It compares themeasurements of signal energy from two different channels, the total channel of frequencyrange from 150—7000Hz, and the abnormally high pitch channel of frequency range from1000—7000Hz, with preset thresholds. Any acoustic signal longer than 400 msec is considereda cry. The Cry Detector was manufactured by Special Instruments, Sweden in 1971, butobjective evaluation has indicated that its usefulness in practice was limited [119, page 86].In 1986, Lundh [64] reported a new baby-alarm based on tenseness information a of thecry signal. Instead of the conventional baby-alarm for hearing impaired parents which onlyactivate a flash or a vibrator when the sound emitted by the baby exceeds a predeterminedthreshold, this new device can give suggestions about the baby’s feelings (happy, crying, ordistressed) by illuminating a picture on its panel which shows either a smiling face, a tearfulface, or a screaming face. After being tested in ten deaf families, the device was judgedto be useful and helpful, but it suffered from problems of identifying the baby’s “feelings”incorrectly and frequent false alarms.2.3 Techniques and InstrumentsIn this section, we review the evolution of techniques and instruments employed in the pastthree decades and evaluate the development of the technology employed in cry analysis works.a — In Lundh’s definition, tenseness is represented by two types of signal: tense and relaxed. Tense sounds are strident and in associatedspectrograms a high intensity of upper harmonics can be observed.Chapter 2: HISTORY, ADVANCES AND CHALLENGES 112.3.1 Auditory AnalysisThe most readily available means for cry analysis is the human ear, and the art of diagnosticlistening was already described in ancient times. Flatan and Gutzmann in 1906 used agraphophone b to record infant vocalizations, and listened to recordings of cries of 30 neonates.They noted 3 infants with highly pitched phonations. Fairbanks [15] in 1942 listened togramophone C records to study the frequency characteristics of the “hunger wails” of oneinfant over a period of 9 months. Wasz-Hockert et al. in 1964 [120] found from their taperecordings that hunger, pain, pleasure, and birth cries can be identified auditorily. Valanne etal. in 1967 [115] reported that mothers can recognize the vocalizations of their own infants.Also, Partanen et al. in 1967 demonstrated that the pain cries of healthy infants could bedifferentiated from the cries of sick babies with certain diseases. They also showed thatafter a training period of approximate 2 hours, 82 pediatricians could diagnose normal versuspathological cries very accurately [92].These reports confirm the common-sense experience that some basic information can beobtained by simply listening to cries.2.3.2 Non-acoustical AnalysisTo understand the anatomy of the vocal tract of infants, the functioning of the differentrespiratory organs during the cries, and the relationship between crying and the infants’neurophysiological state, many cry studies have utilized methods which analyze and extractinformation from other than the acoustic signals of the cries.b — A type of phonograph using wax to record.C — Another type of phonograph which records and reproduces sounds by using a metal disk (instead of a wax cylinder) covered with athin coat of oil or grease.Chapter 2: HISTORY, ADVANCES AND CHALLENGES 12For example, Truby and Bosma et al. [8, 114] in 1965 used cineradiography, spirography,and intraesophageal pressure recordings together with sound spectrograms to investigate therelation of infants’ cry-sound to cry-act. In 1980 Langlois et al. [49] investigated therespiratory behavior of infants during both cry and non-cry vocalizations with impedancepneumography.2.3.3 Time Domain Acoustic AnalysisTo obtain time domain information about the cry signal, early researchers used direct writingoscillographs and other devices that could graph the sound magnitude or waveforms of thecry as a function of time on a paper chart.Fisichelli and Karelitz, in the early 60s, used such a device to examine infant cries. Theyfound that infants with diffuse brain damage required a greater stimulus to produce 1 minuteof cry [45], and that the mean latency period between pain stimulus and the onset of crywas significantly longer for abnormal infants (2.6 sec) than that for healthy infants (1.6 sec)[16]. Wolff (1967, 1969) [124, 125] , using a similar device, measured inspiratory as wellas expiratory phonations and found that in pain-induced cries, the cry units (one expiratoryphonation) are longer in the beginning of the cry than at the end.Time domain instruments usually are easy to operate, inexpensive and reliable. However,they can only provide gross information about the cry signal.2.3.4 Sound Spectrographic AnalysisSound spectrographs can provide permanent visual record of the sound, showing the distribution of energy in different frequency bands versus time. The technique was originallyinvented at the Bell Laboratories in the late 1 940s in an attempt to present speech visuallyChapter 2. HISTORY, ADVANCES AND CHALLENGES 13to deaf people. While this goal was not accomplished, due to the excessive time-frequencycomplexity of speech signals, the technique itself has become very useful and important inmany areas of acoustic signal research.After its invention, the sound spectrographic technique quickly found its way into infantcry studies and became quite popular after 1950. A Scandinavian research team, headed by0. Wasz-Höckert and I. Lind, is most noted for setting up standards for the study of the infantcry with spectrographic techniques. In particular, it was them who introduced the definitionsof many of the commonly-used spectrographic features. The following are short descriptionsof these features as given by Golub and Corwin [35]:1. Duration Features:• Latency period: The time between the pain stimulus applied to the child and theonset of the cry sound. The onset of cry is defined as the first phonation lasting morethan 0.5 seconds.• Duration: This feature is measured from the onset of the cry to the end of thesignal and consists of the total vocalizations occurring during a single expiration orinspiration. The boundaries are determined by the point on the spectrogram wherethe sound “seems” to end.• Second pause: The time interval between the end of the signal and the followinginspiration.2. Fundamental Frequency Features:• Maximum pitch: The highest value of the fundamental frequency F0 on the spectroChapter 2: HISTORY, ADVANCES AND CHALLENGES 14gram over the entire expiratory period d•• Minimum pitch: The lowest value of the F0 on the spectrogram over the entireexpiratory period.• Pitch of shift: Frequency after a rapid increase in F0 seen on the spectrogram.• Glottal roll or vocal fry: Aperiodic phonation of the vocal folds, usually occurringat the end of an expiratory phonation when the signal becomes very weak and theF0 becomes very low.• Vibrato: Defined to occur when there are at least foUr rapid up-and-down movementsof F0.• Melody type: Either falling, rising/falling, rising, falling/rising, or flat.• Continuity: A measure of whether the cry was entirely voiced, partly voiced, orvoiceless.• Double harmonic break: A simultaneous parallel series of harmonics in between theharmonics of the fundamental frequency.• Biphonation: An apparent double series of harmonics of two fundamental frequencies.Unlike double harmonic break, these two series seem to be independent of each other.• Gliding: A very rapid up and/or down movement of F0, usually of short duration.• Noise concentration: High energy peak at 2000-2300 Hz, found both in voiced andvoiceless signals.• Furcation: Term used to denote a “split” in the F0 where a relatively strong crysignal suddenly breaks into a series of weaker ones, each one of which has its ownF0 contour. It is seen mainly in pathological cries.d — In spectrographic analysis, usually only one expiratory phase of the cry is chosen and analyzed for each infant.Chapter 2. HISTORY, ADVANCES AND CHALLENGES 15Glottal plosives: Sudden release of pressure at the vocal folds producing an impulsiveexpiratory sound.A wealth of data and findings, especially concerning the abnormal infant cries, have beenreported in past sound spectrographic cry studies. Undoubtedly, sound spectrography has beena useful tool and has significantly contributed to recent advancements in infant cry study.However, there are severe limitations to the technique that have hindered the widespreadapplications of spectrographic analysis in medical practice. First, the spectrogram has poordynamic range and often inadequate frequency resolution. Secondly, visual inspection isrequired to extract acoustic information from the spectrogram. This is usually a long andtedious process that requires much expertise. The results are also subject to the expertiseand biases of the inspector. As a result, it is not easy to analyze a large set of cry samplesquickly, accurately, and consistently.2.3.5 Computer-aided AnalysisIt becomes more and more evident that computerized methods and highly efficient signalprocessing techniques are the future of infant cry research. Recently, researchers have begunto apply computer-aided analysis methods to infant cry study [25—27, 33, 63, 103, 1161. Thecomputer and signal processing techniques used by these researchers, however, are usuallysimple and basic. In some cases, the electromechanical spectrography device was simplyreplaced by a computer that calculates the spectrogram with the Fast Fourier Transform (FFT)algorithm and displays the results on a terminal screen electronically, instead of printing on thetraditional paper charts. While this approach improves the dynamic range of the spectrogramand provides flexibility in observing the signals, the essence of these “new” methods is in factChapter 2: HISTORY, ADVANCES AND CHALLENGES 16identical to that of the traditional approach. The concepts of the spectrographic cry analysisare unchanged, the same types of acoustic features are measured, and the computer systemonly serves as an improved spectrography machine. This will be discussed in greater detailsin the next section.2.4 Remaining ProblemsBy any standard, the infant cry research is still at the infancy stage itself! As recently statedby Lester [51], “ . . . despite many years of programmatic research and numerous publishedarticles on infant crying, we know surprisingly little about the topic.” In this section, we willexplore some of the unsolved and controversial issues, and summarize the problems facinginfant cry research.2.4.1 Controversy on How Much Information Infant Cry ConveysIt has become a widely accepted belief that infant crying carries meaningful information thata listener can use to estimate the infants’ physiological and/or emotional state. However, itis important to point out that this is still no more than a postulation. There are still somewho argue that infant crying may contain too little information which can be extracted andused for the above purpose.After comparing and surveying the data and results from a number of previous researches,Hollien (1980) [43, page 28] concludes that:it would appear that the cries of normal infants carry too little perceptual information to permit auditors to identify the condition that evoked them.Therefore, it might be hypothesized that, within the normal home situation,the cry generally acts simply to alert the mother and most (if not all) of herChapter 2: HISTORY, ADVANCES AND CHALLENGES 17suppositions concerning the situation that evoked the crying behavior mustbe based on additional environmental cues.”As to the perception of infants’ health states, Hollien’s conclusion is rather negative. He says:it must be concluded that the health of the neonate probably cannot bededuced from perceptual analysis of his or her cries.” (ibid., on page 31)He also surveyed previously published results of acoustic and spectral analyses on infantcrying, and indicates that:it appears possible that neonates may exhibit different types of criesand that these cry classes might related to some behavioral or physiologicalevent/condition. Unfortunately, however, there is little or no quantitative(spectral) evidence to indicate what these cries might be or relative to howthey might differ, one from another. Hence, the resulting data provide littleguidance when the acoustic analyses of neonatal crying are to be considered.”(ibid., on page 39)Hollien’s opinions are shared by Gardoski (1980) [28] who also discusses the meaningfulness of infant cry. He says (on pages 108—109):“Previous research does not provide unequivocal support for the contentionthat the vocalizations of an infant are sufficient to specify the cause of distress. . . . Every cry is not necessarily induced by hunger or some immediately obvious external cause. . . . If difficulties arise in specifying the exactcause of a cry, even with the observer’s knowledge of the circumstances ofthe cry-provoking situation, inferring the probable cause of a cry must beChapter 2: HISTORY, ADVANCES AND CHALLENGES 18even more difficult without that additional knowledge. The task may becomeimpossible if neonatal crying in initially an innate mechanism that servesprimarily to promote proximity. . . . It has not been shown that the audiblesignal alone can be indicative of anything more than the presence of grossabnormality.”In her book The Developing Child [5, page 1561, Helen Bee also says:“Infants may have several cries with somewhat different sound patterns,but those different sounds do not seem to be related to different kinds ofdiscomforts or problems. Parents often feel that they can distinguish betweena ‘hunger cry’ and a ‘wet diaper cry’, but when children’s cries are recordedand played back to parents, the parents are unable to tell the cause of the cry.”These comments vividly illustrate the difficulties infant cry research now faces. At least,it seems clear that either there is little situation-related information the infant imposes onhis/her cry sounds besides his/her desire for getting attention, or that the perceptual, as well asthe traditional spectrographic analysis, approaches have failed to effectively extract adequateinformation to unveil the relationship between the cry production and the possible cause(s).It can also be implied from Gardoski’s and Bee’s comments that the relationship between thecause(s) and the crying may not be as straightforward as being suggested by the classificationof cry types (e.g., pain, hunger, cooing) used in the past.2.4.2 Problems Facing. Modern Computerized Infant Cry AnalysisAs we pointed out earlier, computerized analysis is a trend of present infant cry research. Itappears to be the logical avenue to extract detailed information from cry signals so that theChapter 2: HISTORY, ADVANCES AND CHALLENGES 19complex relationships between the cause(s) and the cry production can be truly understood.Furthermore, computerization is crucial for bringing the results of infant cry research intoclinical applications, a goal that is just as important as the research itself. Therefore, weview the development of computerized automatic infant cry analysis techniques as one of theobjectives of present-day cry research.Unfortunately, many studies in the past were motivated by the academic interest. Thus, theemphasis was on fundamental investigation with little thought given to clinical applications,and the engineering problems associated with the design of an automatic cry analyzer.To realize a practical application which by necessity must involve computerized automaticanalysis of infant cries, modifications to the traditional methodology must be made. Thefollowing issues need to be reconsidered.2.4.2.1 Excessive editing of the cry signal before analysisIn many past investigations, only the first expiratory phase of the cry signal was retainedfor analysis while the rest of the signal was usually discarded. This kind of editing usuallydeletes short signals, such as the glottal plosives, under 0.4 seconds duration. Such editingwas necessary in the 60s since it reduced drastically the complexity and variations of the crysignal. It also made possible the handling of the usually long cry signals with a spectrograph,or sonagraph, which could only analyze few seconds of sound recording at a time. Obviously,this kind of excessive manual editing is not compatible with computer-based automatic cryanalysis.2.4.2.2 Traditional measures not suitable or efficient for automatic real-time analysisLatency, duration, fundamental frequency (pitch), and formant frequencies have been theChapter 2: HISTORY, ADVANCES AND CHALLENGES 20most commonly investigated measures. Their introduction into cry research in the 1960’swas mainly due to the availability of instrumentation and the development of spectrographictechniques. Such measures are the most convenient that a spectrogram reader can take directlyfrom the paper charts. The selection of measures to describe the cry on such basis does notguarantee that they will be suitable and/or efficient for modem computerized signal analysis.2.4.2.3 Lack of unique definitions for many commonly used measuresDifferent researchers postulate different definitions for commonly used measures such as theduration and the fundamental frequency (F0). For example, some researchers exclude thehyperphonated periods of the cry signal when they calculate theF0-related measures whileothers include them. This is likely the cause of some conflicting reports. For example, bothWasz-Hockert et al. (1968) [118] and Fuller et al. (1986) [261 carried out investigations onthe mean of Fo of 2-6 month-old normal infants. Wasz-Höckert reported 530Hz F0 meanfor pain cries, 500Hz for hunger cries, and 440Hz for pleasure sounds while Fuller obtained450Hz for pain, 490Hz for hunger, and 355 for pleasure sounds. So Wasz-Höckert concludedthat pain cries have the highest mean F0, but Fuller’s data shows that the highest mean F0is from hunger cries!2.4.2.4 Past findings hard to be directly used in engineering applicationsMost findings in the cry-study literature are reported in terms of statistical analysis. Acommonly employed tool is the analysis of variance (ANOVA) and findings often reportthe statistical differences under certain significance level amongst the different quantitiesinvestigated. This kind of presentation of results is popular in fields such as social andbehavioral sciences, but its usefulness in signal analysis is very limited.Chapter 2: HISTORY, ADVANCES AND CHALLENGES 21In engineering applications, according to the theory of pattern recognition [22, 91], itis always desirable that the feature variables possess a separable distribution in the featurespace. This enables the variables to be used as reliable indicators to distinguish and classifythe different situations. Equally, if not more, important is that there must be an effectiveand efficient approach to electronically measure the variables of interest. Ideally, suchfeature variables should not correlate highly with each other. Showing statistical differenceswith certain significance level, as it is usually does in the cry-study literature, does notnecessarily imply that the variables investigated meet the above criteria, especially the practicalrequirement of having effective and efficient methods to electronically estimate/extract themfrom the original signal. To sum up, while past findings from the study of cries are definitelyvaluable in guiding our research into automatic cry analysis, they are not in the form thatengineers can directly use to design an automatic cry analysis device.2.5 SummaryIn this chapter, we have reviewed the history of infant cry research and summarized themethodological and technological diversities of the modern cry study. As we have seen,infant cry research has now attracted tremendous enthusiasm from various disciplines of bothsciences and engineering.Since infant crying is a very complex phenomenon, it is not surprising to see that manyproblems still remain unsolved after so many years of systematic investigations. Theseproblems can probably be partially attributed to the lack of precise and objective analysismethods used in cry research in the past. The techniques used were usually unable to extractsubtle differences from the cry signal, which may be characteristic of the conditions thatChapter 2: HISTORY, ADVANCES AND CHALLENGES 22evoked the cry, and therefore will be crucial for the identification of such conditions from thecry itself. Also, older techniques often relied on the final interpretations of the measurementsby human observers, making the interpretations subjective and difficult to reproduce.The introduction of computerized signal analysis techniques to cry research holds greatpromise since computerized techniques can often achieve accurate and fast detection of theacoustic attributes in the cry signal, and enable the extraction of very subtle diagnosticinformation that would otherwise be unattainable. Amongst the many advanced signalprocessing/analysis techniques we are most interested in those of digital voice and speechprocessing, and of computer pattern recognition. Tremendous technical, as well as theoreticaladvancements have been achieved in these two fields recently. In the following chapter, wewill review some of the most important and applicable aspects in voice and speech processingand pattern recognition techniques.Chapter 3AN OVERVIEW OF APPLICABLEVOICE AND SPEECH PROCESSING TECHNIQUESLike infant cry research, voice and speech processing is also an interdisciplinary subject. Itdraws on acoustics, linguistics, physiology, psychology, computer science, and engineering.The very purpose of voice and speech processing is to extract information carried in the voiceand speech signal. It was not until the introduction of sophisticated digital signal processingtechnology that voice and speech processing have been able to grow into the realm of practicalengineering applications. In this Chapter, we will outline briefly some important aspects ofcurrent voice and speech processing theory and technology. These aspects are chosen becauseof our intention of applying them to infant cry analysis. We will start with an introduction tothe human vocalization organs and speech production, since most speech processing theoriesare based on physiological models. Then we will give a description to the computer speechproduction model which has guided the development of many modern speech processingalgorithms. The rest of this Chapter will discuss the basic principles of pattern recognition,and will survey some important speech processing techniques, including vector quantization(VQ), hidden Markov modeling (HMM), and speech/speaker recognition techniques.3.1 Human Vocalization Organs and Speech ProductionAs shown in the schematic diagram in Figure 3.1 [18], the human vocal system can befunctionally divided into three main subsystems: 1. lungs and trachea/bronchi, 2. larynx(vocal cords), and 3. vocal and nasal tracts. The lungs and trachea/bronchi are the power23Chapter 3: VOICE AND SPEECH PROCESSING TECHNIQUES 24LUNGS TRACHEA VOCALBRONCHI CORDSFigure 3.1 Schematic diagram of the human vocal system. (after Flanagan et al. [18])supply of the vocal system: air is compressed by the lungs and delivered to the system byways of the bronchi and trachea. They also control the loudness of the resulting sound. Thelarynx contains the principal sound-generating mechanism. Finally the sound is modulatedand enhanced by the vocal tract, and sometimes by the nasal tract.In Figure 3.1, the lungs are represented by the air reservoir at the left. The force of therib-cage muscles raises the air in the lungs to subglottal pressure F8. This pressure expels aflow of air with velocity UG through the vocal cord orifice. The vocal cords are representedas a mechanical oscillator composed of a mass, spring, and viscous damping. The passingair flow causes the cords to vibrate and hence interrupt the air flow. The interrupted flowproduces quasiperiodic, broad-spectrum pulses— the glottal pulses (see Figure 3.2), whichexcite the vocal tract. After modulation in the vocal tract, voiced sounds are produced. Ifat the same time the nasal tract is coupled to the vocal tract by opening the trapdoor ofthe velum, we hear the nasal sounds, otherwise non-nasal sounds are produced. Unvoicedsounds, like fricative sounds, are generated by forming a constriction at some point in thevocal tract, usually toward the mouth end, and forcing air through the constriction to produceMUSCLE FORCE NASAL TRACT NOSTRILUMVOCAL TRACT MOUTHChapter 3: VOICE AND SPEECH PROCESSING TECHNIQUES 25turbulence. A source of noise-like sound is thereby created.In voiced sounds, the repetition rate of the pulses generated by the interrupted air flow atthe vocal cords is defined as the pitch or fundamental frequency (F0).—Figure 3.2 A typical glottal pulse train.Excluding its power source (the lungs etc.), the operation of the vocal system comprisestwo basic functions excitation and modulation [911, as shown schematically in Figure3.4. The excitation takes place mostly at the glottis, which is the most important sound-generating organ in the larynx, although other points may also contribute, as mentionedearlier. Modulation is performed by the various organs in the vocal and nasal tracts.The principal means of modulation is by filtering [91, 101, 107]. In the case of voicedsounds, the glottal pulses can introduce an abundance of harmonics, and the vocal tract, likeany acoustical tube, has natural frequencies which are functions of its shape. These naturalresonances are called fonnants, and they are the most important way of modulating the voice.Formants account for the generation of all the vowels and some of the consonants.Mathematically, if we let the function of glottal pulse waveform be g(t) and the vocaltract impulse response be h(t), then the resulting speech signal will be the convolution ofg(t) with h(t). In the frequency domain, if the spectrum of g(t) is G(f) and the vocal tracttransfer function is H(f), the spectrum of the output will be G(f)H(f). Figure 3.3 showsthis schematically.Chapter 3: VOICE AND SPEECH PROCESSING TECHNIQUES 26Rf\R1\ (a)frequencyfrequencyfrequencyFigure 3.3 Spectra in speech generation — (a) idealized spectrumof glottal pulse train, (b) frequency response of vocal tract, where thepeaks correspond to formants, and (c) spectrum of the resultant speech.OutputFigure 3.4 A two-stage speech generation model.The excitation-modulation model shown in Figure 3.4 serves as the basis of modernspeech and voice processing theories.Chapter 3. VOICE AND SPEECH PROCESSING TECHNIQUES 273.2 The Digital Model of Speech Production and LinearPrediction Coding (LPC) of Speech Signals3.2.1 Time-varying Filter Model of Speech GenerationIn computerized digital speech signal processing, the two-stage conceptual model shown inFigure 3.3 and Figure 3.4 has led to the introduction of the following digitalized time-varyingfilter model of speech generation, as shown in Figure 3.5 [101, 107].PITCH PERIODE1DIGITAL FILTER COEFFICIENTS(VOCAL TRACT PARAMETERS)IMPULSETRAINGENERATORU(n) TIME VARYING S(n)DIGITAL FILTERri INUMBER AMPLITUDE, GTORFigure 3.5 The digital time-varying filter model of speech production.In this digital model, excitation is represented by an impulse train generator and a randomnoise generator. For voiced speech, the digital filter is excited by the impulse train generatorthat creates a quasiperiodic impulse train in which the spacing between impulses correspondsto the fundamental period of the glottal excitation. For unvoiced sounds, the filter is excited bythe random number generator that produces flat spectrum noise. In both cases, an amplitudecontrol, G, regulates the intensity of the input to the digital filter, and thus the volume ofthe resultant speech.Modulation is furnished by a time-varying digital filter whose coefficients are changedwith time so that the transfer function of the filter can always approximate the overall spectralChapter 3: VOICE AND SPEECH PROCESSING TECHNIQUES 28characteristics of the transmission properties of the vocal and nasal tracts, and the spectralproperties of the glottal pulse train.Usually, such a time-varying system is very hard to manipulate in practice because of itshigh computational complexity. Fortunately, there is a way to simplify the problem drastically.Since the vocal/nasal tract changes its shape relatively slowly during speech production, itis possible, and also necessary, to assume that over a very short time interval (10—20 msec)the characteristics of the vocal/nasal tracts are almost constant. Thus, within this short timeperiod, the time-varying digital system in Figure 3.5 can be treated as a much simpler time-invariant system [91, 101, 107]. The parameters of this speech generation model, i.e. thedigital filter coefficients, pitch period, voiced/unvoiced control, and amplitude, are assumedfixed during each of the short intervals.3.2.2 Linear Prediction of Speech SignalThe digital filter in Figure 3.5 can be simply chosen as ap order all-pole system (in statistics,a th order autoregressive model). Provided p is properly chosen, this assumption is adequateand reasonable for most practical speech applications. Detailed discussion on the all-polemodel and its application to speech analysis are found in [67, 101, 107]. More fundamentaltheoretical analysis on the selection of p is found in [1, 93].Thus, the digital filter has a transfer function asH(z)== (3.1)1— akzk=1where G is the gain parameter which controls the volume of the output speech, and ak isthe coefficient of the filter. The speech output s (n) is related to the excitation u (n) by theChapter 3: VOICE AND SPEECH PROCESSING TECHNIQUES 29following difference equation,s(n)=aks(n — k) + Gu(n). (3.2)For each short-time interval, we can estimate the gain parameter, G and the filtercoefficient, a from the speech samples in a straightforward and computational-efficientmanner, as being discussed below.The method is called linear predictive coding (LPC) analysis. Suppose that we predict thespeech signal at time n with a order all-pole linear predictor with prediction coefficients,ak, i.e.,(n) =— k). (3.3)Then the prediction error, e(n), is defined ase(n) = s(n) — (n) = (n)— aks(n— k). (3.4)It can be seen by substituting equation (3.2) into equation (3.4) that, if ak = ak andthe speech signal really does obey the model of equation (3.2), then e(n) = Gu(n). Thismeans that, between the excitation impulses of voiced speech, the prediction error should bevery small if the predictor coefficient, ck is equal to the actual parameter ak of the vocaltract transfer function.Therefore, by minimizing the average squared prediction error= e(m) = (s(m) — (3.5)m mChapter 3: VOICE AND SPEECH PROCESSING TECHNIQUES 30where s (m) is a segment of the speech waveform that has been selected in the vicinity ofsample n, i.e.,sn(m) = s(m + n), (3.6)we can obtain an estimation of the predictor coefficients ak and the gain parameter O. Thus,(z)=(3.7)1— E kzkk=1will be a good approximation to the vocal tract transfer function [91, 101, 1071.Recently, in many areas of speech processing LPC coefficients have been used withexcellent results to represent speech signals. This is partly due to the availability of veryefficient algorithms which compute the LPC coefficients [91, 101]. Still more important is thatfrom the LPC coefficients we can economically derive many very useful speech parameters,such as the formant frequencies, cepstrum coefficients, PARCOR coefficients, log area ratiocoefficients, and others.3.3 Pattern RecognitionAny problem concerning discrimination or classification of signals may be thought of as apattern recognition problem. In its simplest form, a pattern recognition system consists oftwo major function blocks: feature extraction and classification, as shown in Figure 3.6.MEASUREMENTSFigure 3.6 A pattern recognition system.Chapter 3: VOICE AND SPEECH PROCESSING TECHNIQUES 31Prior to feature extraction, we have to determine which features should be measured. Theselection of proper features is usually crucial to the success of a pattern recognizer. Featuresare supposed to be [21, 91, 108]:1. easy to measure (automatically) and computationally feasible;2. varying widely from class to class;3. invariant or insensitive to extraneous variables and the common distortions of the patterns;4. not correlated with other features; and5. contain little redundancy.Unfortunately, at present, there is very little available in the way of a general theory forthe selection of features. The decision of what to measure to obtain features is usuallyapplication-dependent. This decision is often based on either the importance of the featuresin characterizing the patterns, or on the contribution of the features to the performance of therecognizer (i.e., the accuracy of recognition) [21, 108]. In some cases, simulation may aidin the choice of appropriate features.The classification problem can be defined mathematically as follows. Suppose that Nfeatures are extracted from each input pattern, as shown in Figure 3.6. These N featurescompose a vector x, called the feature vector. This vector corresponds to a point in the Ndimension feature space . The problem of classification is to assign each possible featurevector, or equivalently the corresponding point in the feature space, to a proper pattern class.This can be interpreted as the partition of the feature space Q into mutually exclusive regions,where each region corresponds to a particular pattern class. The boundaries of the partitionare called the decision boundaries, which are described by hyperplanes in the N-dimensionChapter 3: VOICE AND SPEECH PROCESSING TECHNIQUES 32feature space Thus, solving the classification problem is equivalent to specifying thesehyperplanes. If all the a priori probability information is known, we can determine thehyperplanes analytically, otherwise, we have to get that information by training the classifierwith patterns whose classification is already known [21, 1081.3.4 Automatic Speech/Speaker Recognition3.4.1 Automatic Speech RecognitionSpeech recognition problems can be divided into the following categories [91, 101]:1. isolated-word recognition— recognition of words separated by pauses;2. word spotting — the detection of occurrences of a specified word in continuous speech;3. continuous-speech recognition— recognition of words without pauses in between; and4. speech understanding — the extraction of the meaning of an utterance with the use ofstored information about the language being spoken.Among the above, the isolated-word recognition is the least difficult; the pauses betweenwords significantly simplify the problem. However, many of the techniques developed forisolated-word problem have been carried over into word-spotting and continuous-speechrecognition.The following technical aspects are important in the design of speech recognition systems[1011.1. Feature selection.The choice of features is usually application-dependent. Commonly used features includeamplitude, zero-crossing rate, gross spectrum balance, LPC coefficients, etc.Chapter 3: VOICE AND SPEECH PROCESSING TECHNIQUES 332. Word boundaries detection.This is especially crucial in isolated-word recognition and word-spotting, since theserecognizers usually treat each word in the vocabulary as their basic recognition unit toperform pattern-matching.3. Time nonnalization.The traditional method of time normalization is dynamic time warping (DTW) which canefficiently normalize the non-uniform distortion on the time axis of the unknown pattern.In recent years, newly developed HMM techniques can achieve the same result with higherefficiency [50, 91, 99].3.4.2 Speaker RecognitionThere are two related but different areas of speaker recognition: speaker verification (SV),and speaker identification (SI) [88, 911. Given a speech signal, the former involves verifyingwhether the speaker is who he or she claims to be, while the latter involves finding the identityof the person most likely to have spoken. Both SV and SI use a stored data base of referencepatterns for N known speakers, and employ analysis and decision-making techniques similarto those used in speech recognition. A speaker recognizer (either an SV or an SI) can workin two different ways — text-dependent or text-independent. A text-dependent recognizeris easier to implement since the templates and the unknown patterns come from the samewords [88, 91, 104].Since the meaning of the sentence spoken by the speaker is not important for speakerrecognition, the criterion of feature selection may differ from that of speech recognition. Inparticular, more emphasis could be placed on the features which convey information aboutChapter 3. VOICE AND SPEECH PROCESSING TECHNIQUES 34the acoustic differences between different speakers. With this in mind, many features havebeen explored as potential indicators of a speaker’s identity [106, 123], including:1. vowel formant frequencies and bandwidths and glottal source poles;2. location of pole frequencies in nasal consonants;3. pitch contours over a selected sentence; and4. timing characteristics, specifically the rate of change of the second formant over a selectedperiod of sentence.As acoustic cues to a speaker’s identity are spread throughout each of his or her utterances,many systems utilize templates of averaged parameters [88]. This statistical approach is mostuseful in text-independent cases, since the time sequences of training and testing utterances donot correspond. Recently, vector quantization (VQ), combined with LPC-related parameters,such as predictor coefficients, cepstral coefficients, reflection coefficients, orthogonal LPCparameters, and log-area ratios has been successfully used in both text-independent and text-dependent recognizers [91, page 338]. The advantage of using these parameters is that veryefficient computer algorithms are available to extract these parameters from the speech signal.3.5 Vector QuantizationVector quantization (VQ) is a coding technique initially developed for communications.Recently, its applications have been successfully extended into many other fields of digitalsignal processing, such as classification [46, 126], image and speech data compression [31,32, 84], automatic speech/speaker recognition [50, 88, 91, 100, 1101.Generally, a vector quantizer, composed of a code-book (or reproduction alphabet) anda definition of distance measure, can map a real input vector onto a discrete symbol [61].Chapter 3. VOICE AND SPEECH PROCESSING TECHNIQUES 35The code-book consists of a set of fixed prototype vectors with the same dimensions asthe input vector. To perform the mapping, the distance between the input vector and eachprototype vector in the code-book is calculated using the pre-defined distance measure. Theprototype vector which has the smallest distance to the input vector is found and becomesthe representation of the input vector. A more detailed description of VQ is given in thenext section.3.5.1 Some Details of Vector QuantizationAn M-level d-dimension quantizer is a mapping, q(), that assigns to each input vector,x = (x0,. . . , x_1), a reproduction vector, Yk = q(x), drawn from a finite reproductionalphabet (or code-book), A={y; i= 1,••• , M}. The q(•) is usually chosen as a minimum-distance mapping. This means that the reproduction vector Yk is chosen such thatd(x,yk)= mind(x,y) (3.8)all Iwhere d(.,.) is any nonnegative distance measure defined on the d-dimension space, andd(x, Yk) is called the distortion of the quantization. Obviously, a minimum-distance mapping,which is completely defined by the reproduction alphabet A along with the definition of thedistance measure, uniquely describes a Dirichiet partition, S = {S; i = , M}, of thesample space.An M-level quantizer is said to be optimal if it minimizes the expected distortionD (q) = E { d( x, q (x)) }, that is, q* is optimal if for any other quantizer q having M reproductionvectors, D(q*) D(q).In speech recognition, a feature vector is usually extracted from each frame of speechsignal. Then vector quantization is applied to convert the feature vector into an integer (theChapter 3: VOICE AND SPEECH PROCESSING TECHNIQUES 36index of the nearest prototype vector) which is often only one or two bytes long. In manycases, this drastically compresses the original speech data with a minimal information loss,and significantly simplifies the design of the speech recognition system.3.6 Hidden Markov ModelsThe hidden Markov model (HMM) is a stochastic process that has been found to be particularlyuseful in automatic speech recognition (ASR). As one of the most important pattern-matchingapproaches used in ASR, the HMM is capable of dealing with the variations in both thetemporal structure and the spectral patterns of speech signals.3.6.1 Description of H1VIMsAn HMM is a collection of states connected by transitions. Each transition from state i tostates j has a transition probability a3. When state i is reached at time t, a symbol O isemitted, where O is from a finite symbol set {vl,v2,...,vM}, i.e., O {vl,v2,...,vM},where M is the number of all possible symbols in the HMM. Each state is associated with anoutput symbol probability distribution, which defines the conditional probability of emittinga certain observation symbol given that the state is reached.Figure 3.7 shows an example of an 11MM. The three circles represent the three statesof the HMM. At a discrete time instance t, the model reaches one of the three states andemits an observation symbol O E {v1,v2,v3,v4}. At instance t + 1, the model moves to anew state (sometimes itself) by taking a transition and emits another observation symbol, andso on. This continues until a final terminating state is reached at time T. All the transitionprobabilities are tabulated in a N x N transition matrix, A = [a], where N is the number ofstates in the model, and is the probability of occupying state j at time t +1, given state i atChapter 3: VOICE AND SPEECH PROCESSING TECHNIQUES 370.1 0.6 0.3A = [as] = 0.0 0.3 0.70.0 0.0 1.0V1 V2 V3 V40.1 0.2 0.1 0.6B =[bfk]= 0.5 0.3 0.1 0.10.3 0.4 0.1 0.2t =[1.0 0.0 0.0]Figure 3.7 A 3-state HMM with 4 output symbols.time t. Matrix B is composed of the output symbol probability distributions of all states. Itselement bk is defined as the probability of emitting symbol vk when state j is reached. Theinitial state distribution K = [ir] determines the probability of occupying state i at time I = 1.A convenient notation can be used to indicate an FIMM and its parameters, A = (A, B, yr).3.6.2 Three ProblemsAn HIVIM can be used either as a generator of symbol observation sequences, or as a modelfor analyzing how a given observation sequence was generated. To make HMMs useful inautomatic speech recognition, there are three problems of particular interest [99]:Problem 1— Given an observation sequence 0 = 0102 OT, where O E{v1,u2,. . . , vj}, and a certain hidden Markov model A = (A, B, 7r), how canP(0A), the generation probability of the observation sequence, be computedefficiently?1.00.30.60.3Chapter 3: VOICE AND SPEECH PROCESSING TECHNIQUES 38Problem 2— Since an observation sequence 0 can be emitted by a model A = (A, B, ir)passing through one of many different state sequences Q = qq qr, giventhe observation sequence 0, and the model A, how can the optimal (i.e., themost likely) state sequence Q be determined?Problem 3— How can the model parameters A = (A, B, 7r) be adjusted to maximize P(OjA)?Using the methodology of dynamic programming, solutions to these problems have beenfound, and efficient computer algorithms exist as well (see Appendix D, and also [95, 991for more discussions on the algorithms).3.6.3 Speech Recognition with HI’VlMsThe following postulate is the key to applying HMMs to speech recognition: an utteranceis produced by the vocal organs passing through an ordered sequence of stationary states ofdfferent durations; the output from each state (which actually forms the utterance) can beregarded as a probabilistic function of the state. This is a crude and drastically simplifieddescription of the complexities of speech; speech is a smooth and continuous process anddoes not jump from one articulatory position to another. However, the success of HMMs inspeech recognition demonstrates that if the hidden Markov model is correctly chosen and itsparameters are properly set, it captures enough of the underlying mechanism to be effective.As we discussed earlier, speech signals are usually represented as sequences of featurevectors, and each vector corresponds to a very short time interval of the speech. The HMMsdescribed above can deal with only a finite number of symbols but the feature vectors arereal vectors with their elements continuously valued. Therefore the feature vectors need toChapter 3: VOICE AND SPEECH PROCESSING TECHNIQUES 39be quantized so that HMMs can be applied. Usually the vector quantization technique isused for this purpose.The following example shows how HMMs work in isolated-word recognition. Assumewe have a vocabulary of K words to be recognized. For each word, we have a training set ofL tokens. Also, we can have an independent testing set. Before applying HMMs, each of theL tokens and each of the words in the testing set is represented by a sequence of symbols.To build the recognizer, we must:1. design (the topology of) one or more HMMs for each of the K words in the vocabulary,and2. estimate the parameters (transition and output probabilities) of each 11MM, using thesequences of symbols representing the L tokens of this word (This process is called thetraining of the HMMs).To perform the recognition we then perform the following steps:1. for each unknown word in the testing set, characterized by its sequence of symbols,calculate the probability of producing this sequence from all the trained HMMs, and2. choose the word whose model has the highest probability of generating the unknown word.Chapter 4FEASIBILITY STUDIES ON APPLYINGSPEECH PROCESSING TECHNIQUES TO CRY ANALYSISAs it was already mentioned in Chapter 2, the analysis of infant cry has long been consideredas having the potential to provide diagnostic information in a clinical setting. Thus, manystudies have been undertaken to recognize the characteristics of cries uttered by healthy infantsin different situations and by ill infants with various diseases or abnormalities. These studiesused various techniques, such as auditory and acoustic analysis, spectrography, and computer-aided analysis. However, extracting situational and diagnostic information from infants’ criesusing these methods is usually a long and tedious process that requires much expertise, andis sometimes vulnerable to the subjective bias of the observers. Accordingly, there are stillmany unsolved problems and unanswered questions about such analysis of the infant cry andsuccessful clinical applications are extremely rare.We will show in this chapter that work over the past three decades has shown that infantcries have many speech-like elements. Furthermore, if it can be shown that models of the generation of infant cries possess enough similarities to those of speech generation, an excellentopportunity exists to apply techniques developed for automatic speech processing/recognitionto the analysis and classification of infant cries.Many new and powerful techniques have been developed recently in the field of automaticspeech recognition, including LPC analysis, hidden Markov modeling (HMM), vector quantization (VQ), and others. With the help of these techniques accurate automatic recognition40Chapter 4. FEASIBILITY STUDIES 41of continuous speech is becoming a reality [50]. We believe that it is also feasible to develophighly efficient methods for infant cry analysis. In this chapter, we study this feasibility usingboth theoretical analysis and experimental methods.4.1 Investigating the Physiological Similarities between Infant Crying and Adult SpeechThis investigation is focused on answering the following question: Do infants vocalize inways which are similar to the ways adults generate speech? Because of our intention ofapplying speech processing/recognition techniques to analyze infant cries, it is crucial to showtheoretically or experimentally that these techniques are valid and applicable to infant cries.In the following literature survey, we establish the physiological similarities betweeninfant vocalizations and adult speech by describing a physioacoustic model of infant cryingwhich is obviously compatible with the production model of adult speech.4.1.1 A Survey: Physiology of Infant Crying and aPhysioacoustic Model of Cry ProductionInfant crying is the result of complex interactions between many anatomic structures andphysiological mechanisms. These interactions involve the central and peripheral nervoussystem, the respiratory system, and a variety of muscle groups [35].4.1.1.1 Respiration patterns of crying infantsThe functions of vocal system are closely integrated with functions of the respiratory systemwhich supplies oxygen to the bloOd and removes carbon dioxide from the body. Thanksto the cooperation of the many organs and muscular groups in our vocal and respiratorysystems, we do not have to stop breathing to speak. These organs and muscles often workfor both vocalization and respiration simultaneously. We learn to speak and breathe at theChapter 4: FEASIBILITY STUDIES 42same time as young children. The breathing accompanied with speaking is called speechbreathing [49]. The chief physiological requirement for speech breathing is to have shortinhalations, and long, controlled exhalations. It is similar for infant vocalizations. Infantsmust be able to manage their respiratory system to meet the basic criteria of speech breathing.They have to shorten the inspiration and extend the expiration, to quickly intake an adequatevolume of air, and to augment or diminish the relaxation pressure generated by the chestwall’s elastic rebound. They also have to constantly adjust the activity of the respiratorymuscles to maintain a high and constant pulmonic pressure during expiration. Observationson infant cry respiration suggest that “at a very early age the infant can very well negotiate ashort inspiration and a prolonged expiration. . . . The infants’ respiration system is capable ofmeeting those criteria for adult speech breathing” [49]. It was also pointed out by Lieberman(1985) [55] that “three aspects of the intonation pattern of normal human newborn cry aresimilar to the patterns that adult speakers usually use” and “many of the linguistically salientaspects of human speech can be seen in the vocal behavior of infants.”4.1.1.2 Models of infant cry generationMany scientists believe that at birth the infants’ supralaryngeal vocal tract resembles mostclosely a uniform cross-section tube with a length of approximate 7.5 cm, open at both endsduring the production of sound. Although, anatomically speaking, the infant’s vocal apparatusis significantly different from that of the adult, in both position of various organs and theabsolute mobility, from the acoustical viewpoint they both operate on similar principles ofvoice generation [56, 57].Chapter 4: FEASIBILITY STUDIES 43Based on detailed investigations with both sound spectrographs and cineradiographs,Truby and Lind in 1965 [114] suggested an infant vocalization model that is very similarto one commonly used to model adult speech production in speech processing. Their primaryconcept is: infant vocalization derives from sound source plus resonance, i.e., from theamplification of simple or complex quasiperiodic physical excitation. It can be formulatedas:Vocalization Source + Resonance. (4.1)They divided infant vocalization into three categories: (1) phonation or “basic cry”, resultingfrom more or less simple glottal oscillation, with adequate amplification; (2) dysphonationor “turbulence”, caused by sometimes simple, sometimes complex glottal excitation, withor without supraglottal excitation, with or without frication; (3) hyperphonation or “shift”,resulting from certain constraints on the source plus or minus simple and/or complex glottalparticipation, with or without supraglottal excitation, with or without frication.Extending the Truby and Lind’s model, Golub and Corwin introduced in 1985 a physioacoustic model of the infant cry production which is even closer to modem speech generationconcepts [35]. As shown in Figure 4.1, their model divides cry production into four parts.The first part is the subglottal (respiratory) system which supplies the necessary air pressure(Fs(t)) below the glottis for driving the vocal folds. The second part is the sound sourcelocated at the larynx. Mathematically, the sound source can be described, in the frequencydomain, as either a periodic source (S(f)) or a turbulence noise source (N(f)). Frequentlythese sources operate simultaneously. S(f) results from the vibration of infants’ vocal foldsand the turbulence noise is most likely produced by the turbulence created by forcing airthrough a small opening left by incomplete closure of the vocal folds [35]. The third partcFRADIATIONCRYSOUNDFROMMOUTH[R(f)]R(J)T(J)[N(f)+S(f)]=0WFigure4.1Asimplifiedviewofthecryproductionmodel.Chapter 4: FEASIBILITY STUDIES 45of the cry model consists of the vocal and nasal tracts located above the larynx. This partfunctions as an acoustic filter that has a transfer function T(f), determined by the shape andlength of the vocal and nasal tracts and the degree of nasal coupling. The fourth part of thesystem is the radiation characteristic R(f) that describes the acoustic transmission betweenthe mouth of the infant and the auditor.Figure 4.2 shows the spectra of idealized sound sources, vocal tract transfer function,radiation function, and the spectrum of the output cry sound. The amplitude of the sound isdirectly related to the subglottal pressure P (t).40 A20C 11111.1‘40I- B20t L.FREQ. (kHz)Figure 4.2 (A) Spectrum of an idealized periodic source; (B) spectrum of an turbulencesource; (C) vocal tract transfer function; (D) spectrum of an idealized radiationcharacteristic; and (E) spectrum of an idealized output. (after Golub and Corwin [35])As to the three different types of infant cries proposed by Truby and Lind, Golub andCorwin explained them in their new cry production model. They supposed that the phonationChapter 4. FEASIBILITY STUDIES 46is produced by the vocal folds vibrating fully at an F0 range of approximately 250—700 Hz;hyperphonation results from a “falsetto” -like vibration pattern of the vocal folds with an F0range of about 1000-2000Hz (likely only a thin portion of the vocal ligament is involved inthis mode); and dysphonation contains both a periodic and aperiodic sound source and occurswhen turbulence noise is generated at the vocal folds.In summary, Golub and Corwin’ s cry production model can be described in the frequencydomain as:Output = Source x Filter. (4.2)It is not difficult to realize that equation (4.1) and (4.2) describe exactly the sameprinciple the excitation-modulation speech production model shows (see Figure 3.4, Chapter3). Furthermore, by comparing Figure 3.5 of Chapter 3 with Figure 4.1, we find that, if weset the transfer function of the digital filter of Figure 3.5 to be S(f)T(f)R(f) e then thedigital model in Figure 3.5 will be equivalent to the cry production model of Figure 4.1.The above comparisons between the models clearly show that not only does the sameexcitation-modulation principle apply to infant cry production as it does to adult speechgeneration, but also that the modulation process can be similarly formulated in terms ofa time-varying transfer function. This suggests that the speech generation model shownin Figure 3.5 can also be used to describe the production of infant crying, provided theparameters in the model are appropriately chosen. This will allow the applications of signale— See Figure 4.1, where S(f), T(f), and R(f) are the glottal source spectrum, the transfer function of supraglottal system, and mouthradiation function, respectively.Chapter 4: FEASIBILITY STUDIES 47processing/recognition techniques, which were initially developed for adult speech analysis,to the analysis of infant cries.4.2 Formant Frequency Estimate Using an LPC ModelTo further examine the validity of representing infant crying with models conformable with theadult speech generation theory discussed in section 3.1, we conducted the following formantfrequency estimate experiment.We first hypothesized that infant cry generation can be adequately modeled by the systemsdepicted in Figure 3.4 and 3.5, and further, that the digital filter in Figure 3.5 can be definedas having an all-pole transfer function. Then, based on these assumptions, we used thelinear prediction coding method to estimate the formant frequency tracks of the cry signal.Finally, we compared the LPC analysis results with the formant frequencies observed on thespectrogram of the same cry. We expected that the LPC results would match the spectrogramobservation, if our two above-mentioned assumptions are correct. Our experimental resultsproved that our expectation was valid.4.2.1 Theoretical AnalysisBy the definition, formant frequencies or formants are the resonances of the vocal tract. Eachof the formants will cause a local maximum in the magnitude curve of the frequency responseof the vocal tract at the frequency of that formant, as shown schematically in Figure 3.3 (b).The locations of these local maxima or peaks are exclusively determined by the shape ofthe vocal tract. Therefore, the formant estimate method used in this experiment involves 1)estimating the transfer function of the infants’ vocal tract, and 2) locating the peaks in theChapter 4: FEASIBILITY STUDIES 48magnitude response derived from this transfer function. The locations (frequencies) of thepeaks then will be our estimations of the formants.Suppose that a th order all-pole digital filter with a transfer function ofH(z) = G (43)1 — akzkkr=1is used to model the generation of the óry signal. G is the system gain constant and has noeffect on the formants. To determine this transfer function, we need to estimate the valuesof all the ak’s. As discussed in Section 3.2.2, we can use the LPC method to determine theaks from the cry signal, and obtain an estimation of the transfer function=(4.4)1— E kZkkzlwhere 0 and ck are estimates of G and ak, respectively.The vocal tract magnitude response can then be determined from the estimated transferfunction with the procedure described below.To establish the procedure, first letA(z) =1— kz_k. (4.5)Then we can rewrite equation (4.4) as(z)= A(z) (4.6)The magnitude of the frequency response of the vocal tract is then given by= A(4.7)z=cJ’Chapter 4. FEASIBILITY STUDIES 49We evaluate for N points over 0 w < 2ir, that is, to compute=k = 0,1,... ,N— 1. (4.8)This can be conveniently effected by finding H(k) for k = 0, 1,.•• , N — 1 first. Fromequation (4.4), we have= A(e_23k/N), k = 0, 1,... , N — 1. (4.9)By defining 3o = 1 and /3 = —aj for 1 <i p, we can rewrite equation (4.5) asA(z) = 1— aiz_Z = /z—Z. (4.10)Then we have= (4.lla)= (4.llb)where /3 = 0 for i > p. It should be noted that the right-hand side of equation (4.1 lb)is simply the Discrete Fourier Transform (DFT) of sequence [i3o, ....... , /3p, 0, 0, , 01.Therefore, we evaluate the magnitude of the frequency response of the vocal tract transferfunction in the following steps:1. estimate a set of p LPC coefficients [a1, . ,2. compute the DFT of the N point sequence [1, —ai,••, —ar, , 0j with the FF1algorithm. This gives H(k) for k = 0, 1,. . . , N — 1; and3. compute for k = 0,1,... ,N — 1 and then take the reciprocal of the result toobtain H(k), k = 0,1,... ,N —1.Chapter 4. FEASIBILITY STUDIES 504.2.2 ExperimentIn this experiment, our samples of infant cries came from a phonograph record found inWasz-Höckert’s monograph (1968) [118]. Before the LPC analysis, the cry signals werelow-pass-filtered with a cut-off frequency of 7.5 kHz and digitized at a sampling frequencyof 15 kHz with 12 bit resolution. Figure 4.3 shows the flow diagram of our LPC formantestimation experiment.CRYSIGNALS4DIGITIZATIONHAMMINGWINDOWSEGMENTATIONLPC PREDICTORCOEFFICIENTSCOMPUTATIONSILENCEDETECTION512 POINTFFTCI)Izw0——iiiUw00ctoI--iowCI)WPRE-EMPHASISI—--0z00w0—SECONDDERIVATIONIPEAKDETECTIONAMPLITUDENORMALIZATIONIRESULTSDISPLAYFigure 4.3 LPCformant estimate experiment of infant cries.Chapter 4. FEASIBILITY STUDIES 51The digitized signal is segmented into frames, each with fixed length (256 points, or 17.1msec). Silence periods are detected by checking the short-time energy and are deleted in thefollowing analysis. Pre-emphasis removes dc components and flattens the spectrum so as toreduce the effect of the glottal waveform and lip radiation characteristics [18, 36]. Since theautocorrelation method (see Appendix A) is used for the LPC analysis in our experiment,windowing is necessary to reduce the Gibbs effect which is due to the segmentation of thesignal [101, 1071. The 14th order predictor is used in our experiment, i.e., p = 14 in equation(4.3). Thus, after the LPC analysis we obtain a set of 14 LPC coefficients from each frameof the signal. Using the procedure described in Section 4.2.1, we estimate the magnituderesponse H(w) of the vocal tract transfer function (or called the spectrum envelope of thevoice signal) from the predictor coefficients. Then, a peak-picking program locates all thelocal maxima or peaks of the magnitude response, and the possible formant frequenciescorresponding to this frame of the signal are estimated from the locations of the peaks.The above analysis procedure repeats for each frame of the signal, and at the end of theexperiment the trajectories of the estimated formants are plotted.4.2.3 ResultsThe results from our experiment show that the formant frequencies which were estimatedwith the LPC method are consistent with those observed on the spectrogram of the cry signal.Figure 4.4 (a) and Figure 4.4 (b) are examples, where Figure 4.4 (a) shows the estimatedformant trajectories of a pain cry by using the above LPC method, and Figure 4.4 (b) showsthe spectrogram of the same cry, respectively.Chapter 4: FEASIBILITY STUDIES 52800C70006000N 50004000300020001000Hz700060005000400030002000100001 1.5 2 2.5 3time in seconds(a)3.5 4 4.5(C)(B)(A)Figure 4.4 (a) Estimated formant frequencies by using the LPC-based method on a pain cry.Estimations marked with circles indicate the presence of the hyperphonation. (b)Spectrogram of the same cry. In both graphs, the trajectories of the first formant,second formant, and third formant are labeled by A, B, and C, respectively.o *o*0 */B)- (A)1.0 2.0 3.0 4.0Time in sec(b)Chapter 4: FEASIBILITY STUDIES 53These results serve as evidence that:1) normal infant crying can be modeled by a system similar to the adult speech generationsystem depicted in Figures 3.4 and 3.5, and2) an all-pole model, as well as the accompanying LPC method, is effective in analyzingnormal infant crying.This conclusion is an important experimental support of our theoretical analysis on infantcrying modeling in section 4.1.In addition, Figure 4.4 shows an example of hyperphonation. The hyperphonation occursduring 1.2—1.45 seconds, identifiable by a fundamental frequency (F0) as high as 1—2 kHz. Toindicate the presence of the hyperphonation, we mark the estimated formant frequencies in thatperiod with circles in Figure 4.4 (a). During that period, the formant frequency estimates arenot reliable because the harmonics of the extraordinarily high F0 dominate the LPC-derivedspectrum envelope and overshadow the peaks corresponding to the actual formants.4.3 Some Technical Considerations of Automatic Cry AnalysisAs stated in Chapter 1, the goal of our infant cry research is to develop automatic deviceswhich can analyze and monitor the infants’ physical/emotional situation. In this sectionwe will discuss some technical concerns about the design of such an automatic infant cryanalyzer. In many respects, the automatic infant cry analysis problem is analogous to theproblem of automatic speech recognition. We can consider cries corresponding to differentphysical/emotional situations as different “cry sentences” uttered by the infant, and thedifferent time-frequency patterns in different segments of the cry signal as different “crywords”. To determine the physical/emotional situation of the infant, we therefore need toChapter 4: FEASIBILITY STUDIES 54recognize those “cry words” or “cry sentences.” Similar to speech recognition systems, thefollowing technical aspects are of interest in the design of our automatic cry analyzer.4.3.1 Training of the Automatic Cry AnalyzerLike any pattern recognition system, proper training is the key to the success of our infant cryanalyzer. Generally, training process in a pattern recognition system involves establishing abase of standard exemplars to which an unknown (unclassified) input can be compared andthen classified. To establish this exemplar base, a set of classified or labeled inputs, calledthe training set, is needed. The identity information from each labeled input will be analyzed,extracted, and stored to form the exemplar base.As a compromise between recognition performance and application convenience, twodifferent training strategies are used in automatic speech recognition systems: namely speaker-dependent and speaker-independent training. The speaker-dependent strategy achieves higheraccuracy but requires the system be trained on-site by the specified user(s) before using thesystem for recognition. The speaker-independent system is pre-trained and then applied toother speakers without on-site training. The convenience offered by the speaker-independentsystem is at some cost in recognition accuracy. This is because the speaker-independentsystem is not trained by the user’s own voice. In order to achieve the required accuracy,the structure of the speaker-independent systems is usually more complicated than that of thespeaker-dependent systems.The same training problem exists in the automatic infant cry analyzer. We can similarlydesign the automatic cry analyzer to be infant-dependent or infant-independent. But on-sitetraining of a cry analyzer is in most times inconvenient or even unacceptable. Therefore, aChapter 4: FEASIBILITY STUDIES 55practical and clinically useful automatic infant cry analyzer should be infant-independent.The implementation of a totally infant-independent cry analysis system may pose tremendous technical difficulties, especially if the system is to achieve acceptable accuracy. Sometrade-off thus must be made. One possible solution is to introduce age-dependence in aninfant-independent cry analyzer. It has been noticed by many researchers that some patternsand acoustic attributes in the cries change with the age of the infant, especially in the firstweek after birth [51, 97, 118]. If we designed and trained the cry analyzer for only a groupof infants whose ages are within a specified range, the variation in the training data may besignificantly reduced, and thus the analysis accuracy of the cry analyzer could be improved.We can also train an exemplar base for each of the different age groups of infants, and inte0grate all exemplar bases into the cry analyzer. In use, the operator of this cry analyzer onlyneeds to specify the age of the infant whose cry is to be analyzed. The analyzer will thenmatch the input cry signal to the exemplar base trained for infants of this age group.4.3.2 Feature SelectionAs we discussed earlier in Section 3.3, there is no generally applicable theory to guide theselection of effective and efficient features. In particular, feature selection is mainly conductedon a trial-and-error basis. Nevertheless, it is reasonable to get help from the published resultsand findings of infant cry researchers. In our case, we need to examine both the featurescommonly used in speech processing and those used in “traditional” cry research.Chapter 4: FEASIBILITY STUDIES 56Some of the features which, we believe, are worthy of investigation are listed below:1. short-time energy, zero-crossing rate, LPC coefficients and prediction errorThese are the traditional features used in speech processing/recognition research [101].For most of these features, efficient algorithms exist to extract them from the voice signal.2. LPC-derived cepstrum coefficients fLPC-derived cepstrum coefficients are reported to yield the highest recognition accuracy in speaker recognition [881, and are also the most commonly used in recent speechrecognition systems [50, 102, 104]. Cepstrum coefficients have the desirable characteristic of being insensitive to any fixed frequency-response distortions in the recordingapparatus and in the transmission system [2, 88]A highly efficient method exists to compute the cepstrum coefficients. This is to derivethem from the linear predictor coefficients by using the following recursive relationships[2],(c1=an-i / k (4.12)1. c = — —) akcn_k + a, i<n<pwhere p is the order of predictor, and cj and a are the i1 cepstrum coefficient, and theth linear predictor coefficient, respectively.3. fundamental frequency, formant frequencies, melody types, cry duration, and other attributes used in past infant cry researchThese features, as well as their relationship to various physicallemotional situationsof the infant, have been the focus of infant cry research in the past (see Chapter 2). Manyf— In the simplest form, the (real) cepstrum of the signal x(t) is defined as the function c(t), which is the inverse Fourier transform ofC(w), where C(w) = hi IX(w)I.Chapter 4. FEASIBILITY STUDIES 57published results seem to suggest that certain distinguishable connections exist betweenthese cry attributes and specific physical/emotional conditions of the infant. However, touse those attributes as features in the cry analyzer, we must establish effective and efficientmethods to electronically extract such features from the cry signal. This is not a trivialproblem even for the fundamental and formant frequencies which have been carefullystudied from the very beginning of speech processing research. Also, the relationshipsbetween these cry attributes and the infants’ physical/emotional situations need to bereexamined since our objective is not restricted to the analysis of the cry, but involvesthe building of an automatic analyzer/monitor of infant cries.Chapter 5THE H-VALUE: A MEASURE FOR THE AUTOMATIC ASSESSMENTOF NORMAL INFANT DISTRESS-LEVELS FROM THE CRY SIGNALIn the previous chapters we have reviewed infant cry research, and discussed the importanceand feasibility of introducing modern speech and voice processing/recognition techniques intothis field. We found that many problems in infant cry research remain unsolved because highlyefficient and precise analytical approaches have not yet been explored. We concluded that thesuccess of applying infant cry analysis techniques in practice will heavily depend on the useof computer-based automatic signal processing methodologies and systems. Through boththeoretical analysis and experimental investigations, we have shown that the mechanism ormodel governing the infant cry generation is comparable both physiologically and acousticallyto that of adult speech production. Thus, the methodology and technology developed foradvanced voice signal processing and automatic speech recognition are likely applicable tothe infant cry analysis after some modifications.These perspectives and findings suggests that a broad range of opportunities exist for theinvestigation of using signal processing techniques to extract reliable information from thecry signal, and to make assessment of the physical/emotional situation of the infant basedon such information. In addition, new computerized techniques may provide opportunities oftransferring accumulated findings and results on infant crying into practical applications.In the rest of this dissertation, we will demonstrate our concept of automatic infant cryanalysis by studying one particular issue which has long challenged infant cry research— the58Chapter 5. A MEASURE FOR INFANT DISTRESS-LEVEL ASSESSMENT 59assessment of the levels of distress (LOD) of infants from their cries [127]. In the followingchapters we will first present a “cry phoneme” system which effectively represents varioustime-frequency patterns commonly found in normal infants’ cries. Then we shall report thedesign of two automatic systems which estimate the LOD of the infant from cry sounds. Thesesystems are based on (1) statistical classification and (2) hidden Markov modeling techniques.In this chapter, we discuss the concept of infant LOD, introduce our “cry phoneme”system, and define an effective indicator— the H-value — to measure infant LOD fromthe cry signal.5.1 The Level of Distress (LOD) Ratings of Normal Infant CriesIt is an important prospect to be able to assess the infants’ physical and/or emotional situationby systematically analyzing their cries. In real life, this assessment is made by parents relyingon their intuition and experience formed by everyday dealings with their infants. However,it is still far from being understood how parents assess the infants’ situation from the crysounds, or what attributes in the cries influence parents’ assessments most. It is therefore oftremendous theoretical and practical value to investigate the process of parents’ assessmentof the infants’ situation, and develop effective and systematic methods to automatically makesuch assessment from the infants’ cry sounds.In the research on assessing the infants’ physical/emotional situation, one commonly usedapproach is to classify the cries into different cry types according to the (presumed) stimulusthat caused the cry, e.g., hunger and pain. Then, based on the definitions of these cry types,investigations aimed at finding the relationship between the cry attributes and the definedcry types, or at subjectively recognizing these cry types by the caretakers are usually carriedChapter 5. A MEASURE FOR INFANT DISTRESS-LEVEL ASSESSMENT 60out [9, 10, 26, 27, 118, 120]. The pain, hunger, pleasure, birth, and fussy cry are the mostoften used cry types [26, 27, 118]. However, this approach implies that the observed criesare a single variable function of the often uncertain and presumed stimulus. It is unlikelythat such a straightforward functional relationship always exists. The sleep/waking state,differences in neurophysiological maturity, individual sensitivity, and other uncontrollableinternal and external factors may also have important influence on the cry attributes. Withouta thorough and proven knowledge of the complex cause-effect relationship of normal infant crygeneration, it is not possible to ensure either the meaningfulness of these cry-type definitions,or the reliability of the results obtained. For this reason we choose not to follow the aboveapproach.We chose a different path in our research. Instead of relating the cry parametersto the possible cause of the cry, we relate them to parents’ perceptions of the infants’physicallemotional situation. This involves subjective observations and judgements since afterhearing a cry, parents must report their best guess as to the infants’ level of distress. Theparents’ reports are then correlated with the different parameters of the cry. The advantagesof this approach are: (1) it is independent of any presumed cry generation mechanism ormodels, and is therefore free of the uncertainties or risks involved in trying to relate the cryparameters to the cause of the cry, and (2) it can lead to the discovery on how experiencedcaretakers assess the infants’ situations and what information in the cry contributes most totheir decision. After identifying the parameters which show the most consistency with theparents’ perceptions, it may become possible to build a device that can automatically estimatethose parameters and make “humanlike” assessments of normal infant cries.To evaluate how adult caretakers assess the infants’ situation from their cries, severalChapterS: A MEASURE FOR INFANT DISTRESS-LEVEL ASSESSMENT 61different systems of ratings have been investigated. Notably are the aversiveness rating itemsoriginally used by Zeskind and Lester [130] and the more traditional semantic differential items[87]. The eight aversiveness rating items are urgent, distressing, sick, arousing, grating,discomforting, piercing, and aversive. The semantic differential items include unpleasant,sharp, rugged, awful, fast, heavy, bad, active, and hard. Gustafson and Green [39] cross-checked all these 17 rating items, and found virtually identical results for all ratings. Theyconcluded that, “ . . . at least when applied to the cries of normal infants, the individualitems typically used to assess cry perception do not have discrete meanings, instead they maybe measures of a single underlying dimension”. It was also concluded by Murray [82] thatrather than differences of cry types based upon their causes, cries differ along a continuumof intensity. Murray also observed that it is the severity of distress that observers largelydiscern. Therefore, we decided to use a single item — level-of-distress (LOD) — to describethe perceptual assessment of the normal infants’ emotional situation during the cry.5.2 Definitions of “Cry Phonemes” for Normal Infant CriesIn a fashion analogous to the phoneme definitions in linguistics, we introduce here the conceptof “cry phonemes”. We use ten “cry phonemes” to encode normal infant cries into cryphoneme sequences, or cry phoneme scripts. These scripts will be later used to calculate asingle measure, the H-value.These ten “cry phonemes” are chosen to satisfy two criteria. The first is that the “cryphonemes” should constitute a basis for normal infant cry signals in the sense that togetherthey cover most time-frequency patterns or variations commonly found in normal infant cries.Thus, any cry from normal infants could be divided into a sequence, of which each memberChapterS: A MEASURE FOR INFANT DISTRESS-LEVEL ASSESSMENT 62belongs to one of these cry phonemes. The second criterion is that these phonemes shouldbe detectable and distinguishable using computer signal analysis.First we describe our definitions of the ten “cry phonemes” as they are observed asdifferent time-frequency patterns.1. Trailing (Glottal roll) — Usually occurs at the end of a long and powerful expiratoryphonation. It is characterized by a) a very low, gradually decreasing, and vibratingfundamental frequency F0, and b) a gradually decreasing total energy level.2. Flat — The basic expiratory phonation. Characterized by a) a smooth and steady F0, b)clearly observable harmonics, and c) little energy distribution in between the harmonics.3. Falling — Similar to the flat except for a descending F0.4. Double Harmonic break — A simultaneous parallel series of harmonics in between theharmonics of F0. The in-between harmonics occur suddenly and are usually weaker thanthe primary ones.5. Dysphonation— Characterized by a) an unstructured energy distribution over all thefrequency range, sometimes with a tendency of higher concentration over the middle tohigh (1—5 kHz) frequency range, or b) an unstructured energy distribution imposing onor in between the barely distinguishable harmonics.6. Rising — Similar to fiat except for an ascending F0.7. Hyperphonation — Phonation with an extraordinarily high F0 (often over 1 kHz).8. Inhalation — The sound produced by the infant’s rapid breathing in of air. Usuallyoccurs after an exhaustive expiratory phase.9. Vibration — Characterized by a) clearly observable harmonics but with a vibrating F0,b) no unstructured energy distribution in between harmonics, and c) a normally high totalChapter 5. A MEASURE FOR INFANT DISTRESS-LEVEL ASSESSMENT 63energy level.10. Weak Vibration — Similar to the vibration except that the total energy level is significantlylower than normal level.Figure 5.1 shows the spectrograms of the 10 “cry phonemes”. Of each spectrogram, thefrequency ranges from 0 Hz at the bottom to 4 kHz at the top.The above choice of the ten “cry phonemes” are conceptually based on the cry generationmodels suggested by Truby et al. in 1965 [114] and Golub et al. in 1985 [35]. However,our choice is also governed by the two criteria stated at the beginning of the present section.In particular, as mentioned in Subsection 4.1.1.2, three different cry modes are suggestedby Golub and Corwin [35]. They are: phonation, hyperphonation, and dysphonation. To laythe basis for reliable computer-aided analysis, in our “cry phoneme” definitions we furtherdivide the phonation cry type into five sub-types, namely flat, rising, falling, vibration, andweak vibration, according to the melody types and/or the short-time energy level of the signal.We also define “cry phonemes” for trailing (or glottal roll) and double harmonic break,because of their distinctive time-frequency patterns and their frequent occurrence in normalinfant cries. There is no direct explanation of these two phenomena in the cry generationmodel in [35]. The same applies to the sound created during the inspiratory phases, whichwe define as a separate cry phoneme: inhalation.Figure 5.2 shows an example of labeling a three second segment of a pain-elicited crywith the above-defined “cry phonemes.” On the top of the spectrogram in Figure 5.2, whichis generated by our specifically designed graphics software, the segmentation of the signal isshown with the number indicating the mode or type of the “cry phoneme” during the interval.The curve at the bottom shows the short-time energy of the cry. There are basically twoChapter 5. A MEASURE FOR INFANT DISTRESS-LEVEL ASSESSMENT 64C.)(3)(7)(6)Figure 5.1 The time-frequency patterns of the 10 “cry phonemes” shownin the computer-derived spectrograms. They are: (1) trailing, (2) flat,(3) falling, (4) double harmonic break, (5) dysphonation, (6) rising, (7)hyperphonation, (8) inhalation, (9) vibration, and (10) weak vibration, respectively.timeexpiratory and one inspiratory vocalizations in this cry signal. The first expiratory phaseChapter 5. A MEASURE FOR INFANT DISTRESS-LEVEL ASSESSMENT 65EnergyFigure 5.2 An example of labeling a pain elicited cry into the “cry phonemes.”Numbers on the top indicate the mode or type of the “cry phoneme” of eachsegment of the cry signal. The curve below indicates the energy of the signal.started with a rising phonation (mode 6) followed by a dysphonation (mode 5), then a fallingphonation (mode 3), and after a short pause, the first expiratory phase is ended with anotherweak dysphonation (mode 5) which can be distinguished from the background noise fromthe accompanying energy curve. After the inspiratory phase (mode 8) and a short pause, thesecond expiratory phase started with a rising (mode 6) followed by a short harmonic break(mode 4), and then ended with a short falling phonation (mode 3).5.3 The Relationship between “Cry Phonemes” and the Parents’ LOD RatingsTo determine the relationship between each of our “cry phonemes” and the parents’ assessmentof the LODs of the cries, we conduct some experiments. We also deduce a single measure— the H-value— to quantify the LOD of the infant.ModeSpectrogramTimeChapterS: A MEASURE FOR INFANT DISTRESS-LEVEL ASSESSMENT 66Variable Mean S.D. RangeAge (mm) 130.25 40.00 57-253Birthweight (g) 3540.83 532.26 2540-4900Birth length (cm) 51.25 1.83 48-55Gestation (weeks) 38.94 1.01 37-41Apgar at 5 mm 9.28 0.57 8-10Table 5.1 Characteristics of the 36 infants.5.3.1 ExperimentsSubjectsThe subjects include 36 newborn infants recruited from the labor and delivery unit ofa major metropolitan maternity hospital. g Criteria for inclusion were spontaneous vaginaldelivery or planned caesarian section, gestational age 37—42 weeks, birth weight above 2500g, 5 mm Apgar of 8, 9, or 10, and infant judged clinically healthy (normal). Table 5.1 showsthe characteristics of the infants [38].ProcedureThe data collection was conducted in a quiet room near the nursery. A color camera witha unidirectional microphone was used to record both the facial activities and the vocalizationsof the infant with 3/4” videotape (the recorded video signal was not used in our study). Duringthe recording the microphone was suspended 8 inches from the infant’s mouth; a distancethat was maintained for all infants.Each infant responded by crying to the following three procedures: injection of vitaminK into the thigh, application of triple dye disinfectant solution to the umbilical cord, andg — The infant recruitment, criteria setting, and data collection was conducted by Dr. R. V. B. Gruñau et al. of the B.C. Children’sHospital and University of British Columbia, and was supported by a grant to Dr. K. D. Craig from NSERC. For more details on datacollection, see [38].Chapter 5: A MEASURE FOR INFANT DISTRESS-LEVEL ASSESSMENT 67swabbing the thigh opposite the one receiving injection. The order of the procedures wasrandomly set for each infant.The cry recordings were first transferred onto 2-track audiocassettes and then digitizedwith a sampling rate of 8 kHz after lowpass-filtered at 4 kHz. A 12 bit resolution was usedfor the digitization.After deleting some noise-corrupted samples and splitting those long cry samples whichwere often interrupted by periods of silence, we finally collected 103 cries from the recording.Amongst these 103 cries, 58 were randomly chosen as our testing sample set and were usedin this experiment. The remaining 45 cries form our training set that were later used indesigning our cry analyzers.Then, each of the 58 cries in the testing set was divided and labeled into a cry phonemesequence as discussed in the previous section. This was accomplished with the aid of ourcomputer graphics program which provides a convenient work space consisting of a highresolution spectrogram and an energy curve of the signal (as seen in Figure 5.2). To computethe spectrogram and the short-time energy, the signal was cut into frames by a 32-msec-widesliding Hamming window which advanced in steps of 10 msec each. Then, for each cry wecalculated the percentage of the total time of each cry phoneme out of the entire vocalizationtime of the cry.The same cries were later evaluated by 20 parents (10 married couples with at least oneyoung child each). These parents were recruited from the graduate student population of ouruniversity. The parents were instructed to repeatedly listen to the recordings of the 58 infantcries, and to report their assessments of the infants’ level of distress. They were asked to rateeach cry into one of five LOD grades. Grade 1 to grade 5 in the rating represents an increasingChapterS: A MEASURE FOR INFANT DISTRESS-LEVEL ASSESSMENT 68degree of LOD from the least to the most. To prevent preconceptions, the precise causes ofthe cries were not disclosed to the parents prior to (and during) conducting the experiment.5.3.2 ResultsAs mentioned above, each of the 58 cries was segmented and labeled into sequences of“cry phonemes”, with the aid of our computer graphics tool. Also, the cries were separatelyassigned by the parents to one of the five LOD grades. Then, to find the correlation betweenthe “cry phonemes” and the parents’ LOD ratings, scatter graphs in Figure 5.3 were computed.Each scatter graph contains 58 dots corresponding to the 58 cries and shows the relationshipbetween a specific “cry phoneme” and the parents’ LOD ratings. The abscissa in a scattergraph is the percentage duration of the respective cry phoneme in each cry sample. Eachcry in a scatter graph is represented by a line with a dot in its middle. The dot indicates themean LOD assigned by the parents and the line shows the standard deviation of the ratingsgiven to that cry. Please note that a dot in the vertical axis of a scatter graph indicates thatthe corresponding “cry phoneme” is not present in the cry signal represented by the dot.Amongst the 10 “cry phonemes”, the dysphonation, hyperphonation, and inhalation showstrong positive relationship with the parents’ LOD ratings, and the flat and weak vibrationshow strong negative relationship. Also, the trailing and double harmonic break show weakeryet still clear positive relationship. No strong correlation is observed with the falling, rising,and vibration.The dysphonation “cry phoneme” shows the most consistent positive relation with theparents’ ratings. From our experiment data, all the cries containing more than 10% ofChapter 5: A MEASURE FOR INFANT DISTRESS-LEVEL ASSESSMENTbe0CcM0be0CcMCci)Figure 5.3 The relationship ofparents’ LOD ratings to the percentageoccurrences of different “cry phonemes “. (Continued... )69Type 1 - Trailing (%)Type 2 - flat (%)Chapter 5: A MEASURE FOR INFANT DISTRESS-LEVEL ASSESSMENTen.s 3.5CCl)2.5en1-4CCd,a)Figure 5.3 The relationship ofparents’ LOD ratings to the percentageoccurrences of different “cry phonemes “. (Continued... )70Type 3 - Falling (%)0 5 10 15 20 25 30 35Type 4 - Harmonic break (%)1)0ti)Chapter 5. A MEASURE FOR INFANT DISTRESS-LEVEL ASSESSMENT 714.53.52.5Type 5 - Dysphonation (%)54.543.532.521.50 5 10 15 20 25Type 6 - Rising (%)Figure 5.3 The relationship ofparents’ LOD ratings to the percentageoccurrences of different “cry phonemes “. (Continued... )0(dDChapter 5. A MEASURE FOR INFANT DISTRESS-LEVEL ASSESSMENT 724.543.532.521.50 2 4 6 8 10 12Type 7 - Hyperphonation (%)0Type 8 - Inhalation (%)Figure 5.3 The relationship ofparents’ LOD ratings to the percentageoccurrences of different “cry phonemes “. (Continued... )0 5 10 15 20 25 30Chapter 5. A MEASURE FOR INFANT DISTRESS-LEVEL ASSESSMENT0Q0C,)4-a‘UFigure 5.3 The relationship of parents’ LOD ratings tothe percentage occurrences of different “cry phonemes”.10073Type 9 - Vibration (%)Type 10 - Weak vibration (%)Chapter 5. A MEASURE FOR INFANT DISTRESS-LEVEL ASSESSMENT 74dysphonation were given a mean rating between the middle to the highest LOD. This isconsistent with the findings of Gustafson and Green [39] who analyzed the cry on thebasis of single expiratory phases, and also agrees with the definition and descriptions aboutdysphonation in Truby and Lind [114].5.4 The Introduction of the H-value and Its Applicationin Quantifying LOD AssessmentsFor expediency and efficiency, in the study of the LOD rating, we wish to characterize the10 “cry phonemes” by a single composite parameter — the H-value. The H-value of a “cryphoneme” sequence is defined in terms of the phonemes showing positive correlations with theparents’ LOD ratings. We propose that the H-value be defined as the percentage time durationof the combination of the following five “cry phonemes” to the whole time duration of thevocalization of the cry: trailing, double harmonic break, dysphonation, hyperphonation, andinhalation, i.e.,H=DT+1xlOO (%), (5.1)Dtotaiwhere DT, DB, DD, DH, and D1 are the durations of the above five “cry phonemes”,respectively, and D00i is the total time of vocalization in the whole cry. Figure 5.4 showsthe correlation between the H-values of the 58 cries and the parents’ LOD ratings.It is evident from Figure 5.4 that the H-value shows a very high consistency with theparents’ LOD ratings, i.e. high H-values always correspond to high LOD ratings of theparents while low H-values always correspond to low LOD ratings. This makes it possibleto use the H-value as a reliable measure for evaluating the LOD of the normal infant. Anautomated cry analyzer may thus calculate the H-value Of the cry under consideration andthen give an estimation of the LOD of the cry.ChapterS: A MEASURE FOR INFANT DISTRESS-LEVEL ASSESSMENTCa.)Figure 5.4 Relationship between the H-value and the parents’ LOD ratings.5.5 Discussions and Remarks75We follow the approach of correlating the cry attributes to subjective observations by parents,or experienced caretakers. At present, we believe this is a more feasible approach than that ofclassifying the cry into types corresponding to the different stimuli that may have caused thecry. The latter approach will become more productive after we gain a thorough and provenunderstanding of the cause-result relationship in the infants’ cry generation process.In this chapter, we introduced (1) the use of the “cry phonemes” and the “cry phoneme”sequences or scripts to characterize the normal infant cries, and (2) from these “cry phoneme”scripts we extract a single parameter— the H-value— to evaluate the underlying LOD of theinfant during the cry. In our experiment with 36 newborn infants and 20 parents, we found0 10 20 30 40 50 60 70 80 90 100H-valueChapter 5: A MEASURE FOR INFANT DISTRESS-LEVEL ASSESSMENT 76that the H-values derived from the cry signals show clear correlation with the parents’ ratingson the LOD of the infants. The next challenge is to find effective and efficient methodsto automatically estimate the H-value from the cry signal. Then it will become feasible tobuild reliable and robust devices which can make humanlike assessments of the infants’ LODby automatically analyzing their cries. In the following two chapters, we describe two suchmethods, respectively — one of which is based on nonparametric statistical classifier, and theother on the hidden-Markov-model technique.Chapter 6CRY ANALYZER I: A NONPARAMETRICSTATISTICAL CLASSIFIER-BASED METHODIn this chapter, first we construct a cry analysis system based upon the traditional statisticalclassifier. Then, we evaluate the performance of this cry analysis system using cry samplesfrom normal infants, by comparing the estimated values of the infants’ LOD which this systemyields with the LOD reported by parents who listened to the same cries.The idea behind the design of this cry analyzer is to use statistical classification methods toautomatically identify the 10 different “cry phonemes” of the cry signal defined in Chapter 5.Then we calculate the H-value of the cry by using the definition given in Section 5.4. TheH-value serves as an estimation of the infants’ LOD inferred by the cry, as discussed in theprevious chapter.6.1 A Survey of Traditional Statistical Classification Methods and Their LimitationsTheoretically, a classifier assigns an object whose value is observed or known, but whoseclass is unknown, to one of many predetermined classes or groups. Traditionally, statisticalclassifiers are based on Bayes’ Theorem (see Appendix C). The Bayes Theorem is stated asfollows,assign the object to group wk ifp(xwk)P(wk) p(x)P() for allj (6.1)77Chapter 6: CRY ANALYZER I. STATISTICAL METHOD 78where k’ (k = , N) are N different groups to which the unknown observation x isto be assigned, and p(xwk) is the conditional probability density function of observation xgiven group and F(k) is the apriori probability of group wk. A more detailed discussionabout Bayes’ Theorem is found in Appendix C.To build a classifier, it is thus important to determine or estimate the conditionalprobability density function (pdf) p(xk) for each group wk, k = 1,2, , N. In somecases, we can analytically deduce these pdfs from some a priori knowledge about the problem.However, in most practical situations, the pdfs can only be estimated from a set of sampleswith known group assignments. This approach of estimating pdfs is very commonly used,and is referred to as the training of the classifier.The pdfs estimating procedure, as well as the resultant classifiers, are divided into twodifferent classes — parametric and nonparametric — depending on whether or not thefunction form of the pdfs is predetermined prior to the training. For parametric classifiers,such as the linear classifiers and the quadratic classifiers [22, 40], the function form of thepdfs is predetermined. Hence we only need to estimate the parameters of the pdfs.On the other hand, nonparametric classifiers are built without any assumption about theform of the pdfs function. This makes the nonparametric classification one of the mostuseful approaches in statistical pattern recognition [22, 40]. When dealing with problemswith unknown or complicated distributions, nonparametric classifiers frequently show muchhigher classification accuracy than those achieved by parametric classification approaches.Even when the data are from normally distributed populations, the nonparametric algorithmssometimes outperform their parametric counterparts [85, 861.Generally, when we do not possess enough knowledge to allow us to make reasonableChapter 6: CRY ANALYZER I: STATISTICAL METHOD 79assumption about the functional form of the pdfs, we must give the nonparametric methodsthe first consideration. This is exactly the case in our infant cry analysis problem, wherelittle or no knowledge is available about the distributions of the features of infant cries ineither the temporal or the spectral domains. We, therefore, decided to use the nonparametricclassification approach to build our automatic infant cry analyzer.For all its advantages and power, the application of nonparametric classifiers also suffersfrom various difficulties, especially as the size of the problem increases.The common disadvantages of classical nonparametric approaches (kernel estimator, kNNclassifier, etc.) are their computational complexity and their requirement of large amountsof computer storage to hold the design sets. Large design sets are always desirable whennonparametric approaches are used. This is because the feature vectors in the problemsrequiring the use of nonparametric methods usually possess very complex or unknownprobability distributions. This means that a large sample set is required to obtain sufficientstatistical information. Because of these difficulties, nonparametric classification methods areusually complex and slow, and their on-line application is rare. Their uses are often limitedto situations where the computational time is not a crucial factor, such as in the estimationof the Bayes error and data structure analysis [22].A solution to the above problems is to reduce the size of the design set while insistingthat the classifiers built upon the reduced design set perform as well, or nearly as well asthe classifiers built upon the original design set. This idea has been explored for variouspurposes over a period of time and resulted in the development of many algorithms for thekNN classifier design using reduced design sets. Examples of this type of classifier are thecondensed NN (CNN) [42], the reduced NN (RNN) [29], and the edited NN (ENN) [13]. InChapter 6: CRY ANALYZER I: STATISTICAL METHOD 80these algorithms, iterative processes are used to test the effect on the classification performanceof each individual vector in the design set as the vector is moved in and out of the designset. Then only the “pertinent” vectors are retained in the reduced design set. For very largedesign sets, these methods are often tedious and difficult to implement since a new classifieris in fact built and evaluated every time a vector is moved in or out of the design set. Themost serious disadvantage is that the resulting reduction rate is usually low (i.e., not as goodas hoped). Also the reduction rate is usually not under the control of the algorithms, butdepends entirely on the nature of the design set to be reduced.Recently, two nonparametric data reduction algorithms for the Parzen’ s kernel classifierand the NN classifier design have been proposed by Fukunaga et al. [22—24]. Their algorithmsfind the optimal reduced design set in the sense that the difference between the estimatedprobability density functions of the reduced set and the original set is minimized. Bearingsome similarities to the traditional reduced data kNN algorithms, their algorithms iterativelymove each individual vector in and out of a tentatively chosen reduced design set and testthe resultant effect on the criterion function. To avoid an exhaustive search of all possiblesubsets, the optimization scheme used in Fukunaga’s algorithms achieves a local optimum.The computational complexities of these algorithms are considerable. Moreover, the initialguess of the reduced sample set is of crucial importance in Fukunaga’s reduced NN algorithm.So far, an intuitively developed initial assignment procedure for only the 2:1 reduction ratecase has been published [24].6.2 The Development of the VQ-based Nonparametric Classification ApproachWe introduce a new approach for nonparametric data reduction using the vector, or block,Chapter 6: CRY ANALYZER I. STATISTICAL METHOD 81quantization technique. In a recent publication “Vector Quantization Technique for Nonparametric Classifier Design”, we have already demonstrated the effectiveness of this approach[126].While our VQ-based method is developed for the purpose of realizing efficient automaticinfant cry analysis, it should be noted that the method is, in fact, a general purpose non-parametric classification method. Hence it is applicable to other statistical pattern recognitionproblems. Therefore, in this section we will present, investigate, and discuss such propertiesas the accuracy, speed and computational complexity of this new method within the broadercontext of statistical pattern recognition. Then in Section 6.3 we will apply this approach tothe design of an automatic cry analyzer system.Vector quantization is a mathematical process which has already been widely used invarious areas of engineering, such as digital signal processing, communication, and speechrecognition, as was mentioned in section 3.5. In particular, VQ has been used as an effectivedata and image compression technique. Our research indicates that this technique can alsobe effectively utilized to perform the desired data reduction in the design of nonparametricclassifiers. Combining vector quantization with each of the classical Parzen’ s kernel and thekNN approaches, we develop two new algorithms of reduced nonparametric classifier design,which we shall call the VQ-kernel and the VQ-kNN methods.In these new algorithms, we first construct an optimal vector quantizer for each classof the training data. For each class, its corresponding original design set is used as thetraining sequence to determine the optimal reproduction alphabet of its vector quantizer. Thisreproduction alphabet, or code-book, is then retained as the reduced design set to representthe original set of this class. Using the reduced sets, a classifier utilizing either the Parzen’ sChapter 6. CRY ANALYZER I: STATISTICAL METHOD 82kernel or the kNN methods is then built.For our purposes, we find that the reproduction alphabet serves as a good representativeof the original design set. The obtainable reduction rate can be significantly high and may bepreset freely in the algorithms. As illustrated by the examples given later, large original designsets containing hundreds of vectors can be well represented by a reproduction alphabet withonly a few vectors. Despite the high reduction rate, our VQ-based classifiers perform as wellor better than classical reduction methods in terms of classification error rate. This remarkableclassification performance probably results from the fact that our VQ-based method does notrestrict the search for the members of the representative (reduced) design set to vectors in theoriginal design set. Instead, our VQ-based method creates the representative set which retainsinformationfrom all the vectors in the original set. This separates our VQ-based method fromthe traditional methods which always restrict a representative vector to belong to one of theoriginal design set vectors.In our VQ-based methods, we use the algorithm developed by Linde et al. [61] to designthe vector quantizer. This algorithm has the advantage that it is well developed and widelyapplied in various fields, and its efficiency in implementation and its convergence propertiesare proven. A summary of Linde’s algorithm [61] is found in Appendix B.As an alternative, the learning vector quantization (LVQ) algorithm, which was proposedrecently by Kohonen [46, 47], may be used to generate the quantizer. The overall performanceof the LVQ technique has been shown to be comparative to that of Linde’s VQ algorithm inimage compression applications [84].Chapter 6: CRY ANALYZER I. STATISTICAL METHOD 836.2.1 Development of the AlgorithmsSince our new algorithms use vector quantization as the first stage, it is helpful to repeat brieflysome of the definitions of vector quantization presented in Section 3.5. Then in Subsection6.2.1.2 we will discuss an important property of vector quantization. Our algorithms are thendescribed in Subsection 6.2.1.3.6.2.1.1 Definition of vector quantizationAn M-level d-dimension quantizer is a mapping, q(), that assigns to each input vector,x = (x0,... , X1), a reproduction vector, Yk = q(x), drawn from a finite reproductionalphabet (or code-book), A= {y; i= 1,••• , M}. The q(•) is usually chosen as a minimum-distance mapping. This means that the reproduction vector Yk is chosen such thatd(x, Yk) mm d(x, y) (6.2)where d(.,.) is any nonnegative distance measure defined on the d-dimension space, andd(x, Yk) is called the distortion of the quantization. Obviously, a minimum-distance mapping,which is completely defined by the reproduction alphabet A along with the definition of thedistance measure, uniquely describes a Dirichiet partition, S = {S; i = 1, .. , M}, of thesample space.An M-level quantizer is said to be optimal if it minimizes the expected distortionD (q) = B { d( x, q (x)) }, that is, q* is optimal if for any other quantizer q having M reproductionvectors, D(q*) D(q).Chapter 6: CRY ANALYZER I. STATISTICAL METHOD 846.2.1.2 Distribution Property of the Reproduction Vectorsand the VQ Data Reduction MethodIn the following, we argue that the distribution of the reproduction vectors in an optimal vectorquantizer possesses desirable properties that make the VQ technique a promising approachfor nonparametric data reduction. Before doing so, however, it is helpful to point out ageneralization concerning the nonparametric classifier design.The philosophy guiding the development of most traditional nonparametric classificationmethods is as follows: the statistical information contained in a set of pre-classified samples(or design set), is used for finding a good approximation of the actual underlyingprobability density function, p(xwk). Then the classifier is built by applying the Bayesianrule (see section 6.1). This philosophy is also explicitly employed in the development ofFukunaga’s reduced Parzen and NN classifiers. In these classifiers, as mentioned earlier,the reduced design set is selected in such a way that the difference between the densityestimate J3 (x wk) obtained from this reduced set and that obtained from the original designset is minimized. However, for achieving high classification performance, this approximationto p(xk), while it is obviously sufficient, is not necessary. An illustration of this is thatany good approximation to [p(xwk )] where constant a > 0, will yield the same Bayesianclassifier as that achieved by approximating p(x wk) itself.In [30] Gersho addresses the properties of the reproduction vectors of an optimal vectorquantizer. The density function of the reproduction vector in a d-dimensional quantizer isdefined asgM(x)= Mv(S)’ if xe S, for i = (6.3)Chapter 6. CRY ANALYZER I. STATISTICAL METHOD 85where V(S) is the volume of S. Gersho shows that for an optimal quantizer, in the asymptoticsituation where M is sufficiently large, gj,j (x) will closely approximate a continuous densityfunction )(x) which is proportional to [f(x)J’, where f(x) is the actual underlying densityfunction of the input random vector and 3 is a constant determined by the dimension d andthe distance measure. Therefore, if we generate the optimal quantizer by using the trainingvectors from class ‘k the density function of the resulting reproduction alphabet (of class k)1/1+/3will closely approximate a function that is proportional to [p(x k)]This finding, along with the general consideration discussed at the beginning of thissection, strongly indicates that the reproduction alphabet in an optimal quantizer could beused as an effective design set for building classifiers, provided the level of the quantizerM is sufficiently high.Using VQ algorithms such as Linde’s and Kohonen’s, we can estimate an either locallyor globally optimal reproduction alphabet from the training data set. This leads us to thefollowing data reduction approach for the nonparametric classifier design: using the originaldesign set as the training data, find the optimal reproduction alphabet of this set and use itas the reduced design set. In the following section we describe our data reduction methodsin full detail.The most important difference between our data reduction method and the other traditionalmethods is that our reduced set is not necessarily a subset of the original design set. This isbecause in our VQ-based method, the vectors in the reduced set are created (but not selected)by surveying all the information carried by every known vector. These vectors are createdso that they best represent all the vectors in the original set in the sense of minimizing theaverage distortion rate of quantization.Chapter 6. CRY ANALYZER I: STATISTICAL METHOD 866.2.1.3 Classifier Design Algorithms Using Vector QuantizationThe VQ-kernel ClassifierIn this method, we propose that vector quantization be first applied to each of the originaldesign sets of all classes. The reproduction alphabets of the resultant optimal quantizers arethen retained as the reduced design sets. Then the kernel method is applied as usual exceptthat the reduced design sets are used.The following steps explain this method in greater detail.Considering an N-class problem in d dimension, assume that for each classw (i = 1• .. , N) an original design set {x; j = o,... , n — i} is given.1. For class w, as in the algorithm in [611 we start with an initial Mi-level reproductionalphabet A0. This initial alphabet can be generated by using the “splitting” approachproposed in [61].2. Find the Mi-level optimal quantizer for class w using the quantizer design method in[61]. Suppose the final reproduction alphabet for class w is A= {y; j = i,. . . , M}.3. Apply the Parzen’s kernel method to the reproduction alphabet A2 to get an estimate ofp(Xwj), i.e.M1 / (i\p(4,j) = K — y3 } (6.4)j=1where K represents the kernel function of class4. Repeat step ito step 3 for each class and get estimates of j3(xw) for i = 1,... ,N.5. Finally a Bayes classifier is built upon all the estimated class-conditional pdfs Q3(xw)).The Bayes classifier assigns the unknown observation x to class Wm if p(Xki-’m )-P(m)p(xw1)P(wj) for all 1 m, where F(w) is the known a priori probability of class w.Chapter 6. CRY ANALYZER I. STATISTICAL METHOD 87For the special case when the levels of quantizers are chosen equal to the number ofvectors in the original design sets, i.e. M = ri (i 1,••• , N), the above VQ-kernelmethod becomes equivalent to the traditional kernel method.It should be noted that the selection of the kernel function as well as its parametersis important, yet difficult. This disadvantage is inherent to the classical kernel approach.Theories and conclusions developed in the literature on the classical kernel methods can bedirectly used to guide the selection of proper kernels and smoothing factors. A thoroughdiscussion on this topic is found in [22, 41].The VQ-kNN ClassifierThis algorithm combines the VQ technique and the kNN method.For an N-class problem, we assume that for each class w (i 1, . . , N), a design set{x; j = O,•.. , n — 1 }, and an initial Mi-level reproduction alphabet A0 are given (forthe selection of A0 see [61]).1. For each class cj (i = 1•• . , N), find the M -level optimal quantizer using the algorithmin [611. Suppose the final reproduction alphabets are= {y; j = i,... , M }(i=1,...,N).2. Combine all the reproduction alphabets A ‘ s of all classes into one single set A of Mvectors. That is, A=lJ A and M = M.3. The classification rule is as follows: assume x is the new observation point to be classified,a) For x, find the first k nearest reproduction vectors in A. Suppose that amongst thesek reproduction vectors there are km vectors from class L.m. Then the classificationrule is:Chapter 6: CRY ANALYZER I: STATISTICAL METHOD 88assign x to class C)m iffor all 1 m. (6.5)Mm M1b) If the a priori probability P(wj) is unknown, we can approximate it from the designsetas=. (6.6)1=1The classification rule, then, becomes:assign x to class LL)m ifmm ii for all 1 m. (6.7)lk’Im lviiThe classical kNN method becomes a special case of our new method when M is chosento be equal. to n (i = , N). If k = 1 is used, we get the VQ-NN classifier, which isvery effective in computation since, when all the a priori probabilities are equal, the unknownobservation is simply assigned to the class to which its nearest reproduction vector belongs.6.2.2 Experiments of the Application of Our VQ-based ClassifiersWe give now two examples to demonstrate the use of our VQ-kernel and VQ-NN classifiers.The data used in the first example has a known probability distribution, in particular a mixtureof Gaussian distributions. In the other example, we test our VQ classifiers with real speechdata of unknown probability distribution. For the reason of comparison, we also test theperformances of traditional reduction algorithms including the CNN, RNN, ENN, as well asFukunaga’s reduced Parzen method.Chapter 6. CRY ANALYZER I: STATISTICAL METHOD 896.2.2.1 Example 1 (synthetic data from Gaussian mixture)We adopt Fukunaga’s 8-dimension data model from [22] (page 555, Experiment 11-7). Inthis model, each of the two classes consists of two Gaussian distributions. The a prioriprobabilities are equal for the two classes. The class-conditional pdfs are p(xw1) =0.5N(1u,I) + 0.5N(p2,I) and p(xw2) = 0.5N(3,I) + 0.5N(t4,I) respectively, wherethe covariance matrices are equal to the identity matrix I, and the mean vectors are =[00. 01T 1t2 = [6.580. 0]T [L3 = [3.290. 1T and j = [9.870. . . 0]T respectively.The Bayes error of this data model is = 7.5%. The Euclidean metric system was used as thedistance measure. The original design set contains 150 vectors for each class, and similarlythe testing set has 150 vectors from each class.Fig. 6.1 shows the result of our experiment in which we tested our VQ-kernel and VQNN classifiers as well as other traditional reduced data classifiers, including the CNN [42], theRNN [29], and the ENN and the edited-condensed nearest neighbor (EC-NN) [13]. Fukunagaet al. reported the performance of their reduced Parzen classifier with the same data modelin [22]. We include their results in the graph as well. The same normal kernel function,with a constant covariance matrix of 1.52 x I, which was used in Fukunaga’s reduced Parzenclassifiers, was also used in our VQ-kemel method. Each point in the graph represents theaverage result after 10 trials.In both Fukunaga’ s and our methods, all 1—level reduced data classifiers failed due tothe fact that the underlying probability density of each class is composed of two separateGaussian distributions. When 2 or more representatives are used in the reduced sets, bothour new VQ-based classifiers showed excellent performance. In particular, our VQ-kernelclassifier achieved the best performance amongst all the classifiers. It gave a classificationChapter 6. CRY ANALYZER I. STATISTICAL METHODFigure 6.1 The classification error rates of our VQ-basedclassifiers and other traditional reduced data classifiers.90accuracy extremely close to the Bayes error, the theoretical lowest bound, at almost all thereduction rates, and its performance showed little correlation to the reduction rate.Our VQ-NN classifier evidently also outperformed all other modified NN classifiers interms of both the reduction rate and the classification accuracy. This is with the exceptionof the ENN which showed a better accuracy but at a low reduction rate of only 1.67:1. It isvery interesting to notice that at high reduction rates (M<40), our VQ-NN classifier showed%50454035 -302520o p VQ-kernel-- VQ-NNFukunaga’s reduced ParzenRNN15EC -ISJNU1050tBayeserrorI 11111111 NBasicU NNBasickernelM0 1 2 3 4 5 6 7 8 9 10N 20 60 100 140 160# of representatives per classChapter 6: CRY ANALYZER I: STATISTICAL METHOD 91significantly better performance than it did at the lower reduction rates. We believe that thisis due to the specific structure of the distribution density functions (mixtures of two Gaussiandistributions) underlying the experimental data.6.2.2.2 Example 2 (speech data with unknown distribution)This example demonstrates the classification performance of our new VQ-based classifierswith real data extracted from speech signals.The 12-dimension speech data measured the first 10 cepstrum coefficients, the short-timezero-crossing rate, and the short-time energy of the voice signals.The data for class 1 (2247 vectors in total) came from male speaker 1 (Chuck), and thedata in class 2 (2256 vectors in total) were for male speaker 2 (Gray). For each trial in theexperiments, we first randomly drew a 500 vector design set from each class, and used therest to form the test set.For our VQ-kernel classifier and Fukunaga’ s reduced Parzen classifier, we used theEuclidean metric system and normal kernel functions with their covariance matrices estimatedfrom the original design sets, that is, the kernel function in equation (6.4) becomesK(x—y)(2)2h/2exp [_(x_Y)’1(x_Y)], i = 1,2 (6.8)with Dj = C, where C is the covariance matrix of the original design set of class i. Forour VQ-NN and the traditional reduced data NN classifiers, the variance-weighted distancemeasure was used, i. e.d(x,y) = ) — yl2/o , (6.9)Chapter 6. CRY ANALYZER I. STATISTICAL METHOD 92where x1 and y are the l element of the x and y, respectively, and cii is the variance ofthe 1th element of the training vectors.Fig. 6.2 shows the result of our experiments. The best performances were produced byour VQ-kernel classifier at all the reduction rates. The flatness of the curve shows that theclassification accuracy of our VQ-kernel classifier is almost independent of the reduction rate.Comparing to Fukunaga’s reduced Parzen classifier, the improvement on the classificationaccuracy of our VQ-kernel classifier is obvious, especially at the high reduction rates (> 25 : 1,or M < 20.Ep p VQ-kernel25 VQ-NN-<----(- Fukunaga’s reduced Parzen20EC-NN:: M=286)Basic kernel5 (M=500)M0 I I I I I I I I I I I I I0123456789 20 60 100 140# of representatives per classFigure 6.2 The classification error rates of our VQ-based classifiers, the traditional reduceddata NN classifiers, and Fukunaga’s reduced Parzen classifier for the speech data.When operating at the same range of reduction rate (for M 16), our VQ-NN classifiersignificantly outperformed the other reduced-data NN classifiers. When the reduction rateChapter 6: CRY ANALYZER I: STATISTICAL METHOD 93used in our VQ-NN classifier becomes very high (62.5 : 1 to 500 : 1, i.e., 1 M 8),our VQ-NN classifier still achieved classification accuracies comparable to those of the otherreduced-data NN classifiers operating at much lower reduction rates, ranging from 1.75:1 forthe ENN to 23:1 for the EC-NN. Please note that, for the other reduced-data NN classifiers(CNN, RNN, ENN, and EC-NN) the reduction rate is not selectable, but solely depends on thedistribution of original design sets. Only our VQ-NN classifier has the capacity of choosingdifferent reduction rates.In addition to the classification accuracy and the reduction rate, the computational demandsof Fukunaga’s algorithms are much larger than that of our VQ-kernel algorithm. Thecomputational complexity of finding the reduced set in Fukunaga’ s reduced Parzen algorithmcan be shown to be of the order of rN2k2 while that of our VQ-kemel algorithm is of theorder of rNk, where r is the size of the reduced set, N is the size of the original set, andk is the dimension of the data. In our above speech data experiment, a single trial of ourVQ-kernel classifier took only about 3.92 minutes of CPU time, while a trial of Fukunaga’sreduced Parzen classifier for exactly the same data and on the same SUN Sparc 2 computertook over 56 hours of CPU time! Amongst the NN-based algorithms, our VQ-NN classifierwas also found to be the fastest. Table 6.1 shows the CPU time used in a single trial by eachof the tested algorithms for constructing the reduced design sets in the above speech dataexperiment. The CPU time for our VQ-NN was measured with M=128.Table 6.1 CPU time used for finding the reduced set in the speechdata experiment. (* including CPU time used for classification)Chapter 6: CRY ANALYZER I: STATISTICAL METHOD 946.2.3 SummaryIn this section we introduced the vector quantization technique into the area of nonparametricclassifier design and showed that vector quantization is an extremely effective approach todata reduction.Using the vector quantization data reduction technique, two methods of nonparametricclassifier design, namely the VQ-kernel and the VQ-kNN, are proposed and tested with bothsynthetic and real data. Compared to other known nonparametric data reduction algorithms,the new methods are found:1) to givc much better results in terms of both the classification accuracy and the datareduction rate;2) to have significantly less computational complexity in general;3) to have control over the reduction rate; and,4) to achieve a classification accuracy which is only moderately dependent on the reductionrate.Theoretically, for highly nonparametric data the classification accuracy of our VQ-basedclassifiers will increase as M, the number of representatives in the reduced set, is increased.When the value of M approaches that of the size of the original design set, the performanceof our VQ-based classifiers will approach that of the basic (without reduction) Parzen and NNclassifiers. This tendency is shown clearly in our experiments above. Therefore, the selectionof M becomes a trade-off: a larger M generally yields higher accuracy but lower reductionrate, and a smaller M yields lower accuracy but higher reduction rate.Chapter 6: CRY ANALYZER I: STATISTICAL METHOD 956.3 Automatic Cry Analyzer Based on the VQ-kernel MethodsIn this section, we build a system which automatically estimates the H-value of the cry signal.The H-value is a measure which indicates the level of distress of the infant who uttered thecry. We detail the system design and important technical considerations of our automaticcry analysis system. This system is built around the new VQ-based nonparametric classifierdeveloped in the previous section. In particular, we choose the VQ-kernel classifier becauseof its superior classification accuracy over other nonparametric classifiers, as illustrated inthe previous section.6.3.1 System DesignFigure 6.3 shows the block diagram of our automatic cry analysis system. The system canbe readily divided into three functional parts — preprocessing/feature extraction, VQ-kernelclassifier, and H-value calculation.6.3.1.1 Data preprocessing and feature extractionAs illustrated in Figure 6.3, the signal of the cry sound is first digitized at a sampling frequencyof 8 kHz and a resolution of 12 bits per sample. Then a pre-emphasis filter with a transferfunction of 1 — O.95z1 is applied to the digitized signal to remove its dc-component andto flatten the spectrum in order to remove the effect of glottal waveform and lip radiationcharacteristics [17, 36]. Finally, at the end of the data preprocessing, the cry signal is dividedinto overlapping frames by a 32 msec-wide sliding Hamming window which advances every10 msec. Every frame has 256 samples.The feature vector used in our automatic cry analyzer is composed of 24 elements. Thefirst twelve elements are the zero-crossing rate, the energy, and the first ten LPC-derivedChapter 6: CRY ANALYZER I: STATISTICAL METHODH-valuecalculationFigure 6.3 Automatic infant cry analysis system based on the VQ-kernel classifier.96cepstrum coefficients (see subsection 4.3.2). To take into account the sequential characteristicsof the cry attributes, the feature vector also includes the temporal derivatives of the firsttwelve elements. For reasons of simplicity, we approximate the derivative of each attributeby its first-order difference. Thus two consecutive frames are used to create the temporalCry Samples1H’Zero-crossing Short-timeRate EnergyFeature Vector II1Reduced Design SetsVQ-kernel H-typeClassifier E-type++H-valuesChapter 6. CRY ANALYZER I: STATISTICAL METHOD 97derivatives of the original twelve attributes, i.e.,—(6.10)ttkwhere (tk) is the vector of the first 12 attributes at a discrete time tk. Thus, the final featurevector used in our present system has 24 elements, as shown in Figure 6.3. This means thateach frame of 256 samples, or equivalently 32 msec, of the cry signal will be representedby a 24 element feature vector.6.3.1.2 The VQ-kernel classifier and H-value calculationThe 24 element feature vectors obtained from the data preprocessing/feature extraction stageare used as inputs to the VQ-kernel classifier. The latter is designed so as to classify twotypes of input vectors — the B-type and H-type vectors, which will be defined below. Theclassification results are used in estimating the H-value of the cry.We define the H-type vectors as feature vectors representing frames of one of the trailing,double harmonic break, dysphonation, hyperphonation, and inhalation cry phonemes. Thisdefinition is based on our H-value definition in equation (5.1) in Section 5.4. Similarly, theB-type vectors are vectors representing frames from the remaining five cry phonemes, i.e.the flat, falling, rising, vibration, and weak vibration cry phonemes (see Chapter 5 for thedefinitions of all cry phonemes).With the above H-type vector definition and our definition of H-value in Section 5.4,it is clear that the H-value can be estimated by counting the number of H-type vectors (or,equivalently, the number of H-type frames) and then dividing it by the total number of nonsilence frames (or vectors, accordingly) in the cry, i.e.H-value the number of H-type vectors (6.11)the total number of non-silence vectorsChapter 6: CRY ANALYZER I: STATISTICAL METHOD 98VQ Reduced Training SetFigure 6.4 The generation of the reduced training sets for the VQ-kernel classifier.The non-silence frames are easily detected by the energy in the feature extraction phase.To design the VQ-kernel classifier, two training sets are generated; the first is composedof all the H-type vectors from the 45 training cry samples (see Chapter 5), and the second setis composed of all the remaining E-type vectors. Then vector quantization is applied to eachof these sets to obtain the reduced training (design) sets. The reduced sets are then used inthe design of the VQ-kernel classifier, as discussed in the previous section. Figure 6.4 showsthe generation of the two reduced training sets.6.3.2 Determination of VQ-kernel Classifier ParametersAs all kernel classifiers, the VQ-kernel classifier also requires the determination of the kernelfunction (refer to Subsection 6.2.1.3). Other important parameters that need to be determinedfor the VQ-kernel classifier are the decision threshold of the classifier [22] and the size ofthe code-books.Chapter 6: CRY ANALYZER I: STATISTICAL METHOD 996.3.2.1 The kernel functionThe VQ-kernel classifier uses the normal kernel function. Thus, the equation for kernelfunction isK(x) (2)d/21/2 exp [_xT_1x], (6.12)where d is the dimension of sample space, which is equal to 24 in our case, and E is thecovariance matrix. The value of the covariance matrix has a significant effect on the propertiesof the kernel function [22]. We use two different covariance matrices, Z and 2 for thetwo different classes of feature vectors: the H-type and the E-type vectors. To choose thevalues of the two covariance matrices we conducted a series of experiments. We found thatthe highest classification accuracy was given by using the covariance matrix of the vectors ofeach training data set as the Z and E2 of the respective class, i.e., for H-type signal we use= °H where °H is the covariance matrix of the H-type vectors in the original (beforereduction) training data, and similarly, for the E-type signal we use = eE.Thus, we get the following estimated conditional pdfs (see Subsection 6.2.1.3):(xH-type)= (2’NHOH112exp — YHj)— YHi)] (6.13a)(xjE-type)= (2K)2NEjeEjh/2 XP[_(x — YEj)— YEi)] (6.13b)where NH and NE are the numbers of training vectors in the reduced H-type training set andthe reduced E-type training set, respectively, and YHj and YEj are the feature vector inthe reduced H-type training set and the reduced E-type training set, respectively.Chapter 6. CRY ANALYZER I: STATISTICAL METHOD 1006.3.2.2 The decision thresholdAs indicated in [22, Chapter 7], when the kernel density estimate method is used, the likelihoodratio classifier becomes1j3(xH-type)t (6.14)p(xjE-type)where t is the decision threshold. Note that when the a priori probabilities P(H-type) =P(E-type) (no information suggesting otherwise in our case), the likelihood ratio classifier isequivalent to the Bayes classifier (see [221).The decision threshold t is simply a means of compensating for the bias inherent inthe density estimation procedure. In [221, Fukunaga suggests methods for determining t inthe Parzen classifier with a normal kernel function. To determine the threshold t in ourVQ-kernel classifier, we use the following approach.First, the reduced training sets are generated via the vector quantization method. Then weuse the resulting VQ-kernel classifier with a fixed value of t to classify each training vectorin the original (not the reduced) design sets into the H-type or E-type class. The classificationaccuracy is computed after all the training vectors were classified. The above is repeated fordifferent decision threshold t. The decision threshold which achieves the highest classificationaccuracy is chosen in our classifier (6.14). This procedure is illustrated in Figure 6.5.Figure 6.6 shows the curve of the classification error rate vs. t, as t is varied from —4.5to 4.5. Figure 6.6 is the result of applying the above procedure to the 45 cry samples inour training set. In this experiment a code-book of 256 vectors was used for each reducedtraining set. As shown in Figure 6.6, the lowest error rate (14.71%) occurs at t = 1.5. Wethen use this decision threshold in our final VQ-kernel classifier.Chapter 6. CRY ANALYZER I. STATISTICAL METHODCC.)rJc)UReducedTraining Sett101Figure 6.6 The estimated classification error rate vs. the decision threshold t.It is worthy to point out that while this procedure can find a reasonably good value forDecision ThresholdFigure 6.5 Determination of the decision threshold t.-5 -4 -3 -2 -1 0 1 2 3 4 5Chapter 6: CRY ANALYZER I: STATISTICAL METHOD 102t, the estimated classification error rate (as shown in Figure 6.6) is in fact the resubstitutionerror as defined in [22].6.3.2.3 The size of code-booksBesides the parameters of the ordinary kernel method, as discussed in Section 6.2, the size ofthe code-books, i.e., the number of representative vectors in each of the reduced training sets,is the only additional parameter that needs to be predetermined in our VQ-kernel classifier.Theoretically, the larger the size of the code-book M the higher the classification accuracy, andthe higher computational complexity in the classification operation. However, as demonstratedby the examples in Section 6.2, the increase in the classification accuracy “saturates” as thesize of the code-book reaches some point. Beyond this point, the accuracy will no longer showany significant improvement no matter how large the increase in the size of code-book is.To determine the appropriate code-book size of the VQ-kernel classifier in our infantcry analyzer, we tested the VQ-classifier with code-book sizes of 32, 64, 128, 256, and 512on each of our two training data sets. We found that the classification accuracy “saturates”when 256—vector code-books were used, and no further significant increase in the accuracywas observed when the code-book sizes were increased to 512. We therefore decided to usecode-books of sizes of 256 each.6.3.3 Training of VQ-kernel ClassifierThe determination of the parameters such as the t, M, and the ‘ s, in fact, constitutes partof our training process. However, most of the computational effort in the training process isconsumed by the estimation of the 256—level optimal code-books.Chapter 6. CRY ANALYZER I: STATISTICAL METHOD 103After the preprocessing/feature extraction phase, about 18,000 feature vectors for eachof the two classes (H-type and E-type) are obtained from the original training data. Eachfeature vector is of dimension 24. Without our VQ-kernel algorithm, it would be impossibleto apply the kernel method to such a huge training data set. We used the procedure describedin Subsection 6.2.1.3 to condense each original training set containing 18,000 vectors, intoa reduced design set which has only 256 vectors. It took about 100 hours of CPU time ona SUN SparcStation 2 computer to generate the two reduced sets. It is obvious from Table6.1 that the computer time consumption would have been prohibitive (orders of magnitudelarger) if the Fukunaga’s Parzen method was used to find the reduced sets.6.4 Results and DiscussionsWe tested our infant cry analyzer with the cry samples in the testing set (see Chapter 5). Ourinfant cry analyzer gives as its output an estimated H-value after each of the cry samples inthe testing set is processed. Figure 6.7 shows the relationship between the measured H-values(equation 5.1) and the H-value which is automatically estimated by our infant cry analyzer.Each small circle in the diagram represents a cry in the testing set. The vertical distance froma circle to the diagonal dashed line is the error of the estimated H-value. The mean errorbetween the estimated H-value and the measured H-value is 14%. It is seen in Figure 6.7that the automatic infant cry analyzer gives H-value estimations very close to the measuredones when the values of the actual measurements are high, and gives estimations higher thanthe measured values when the latter have low values ( 20).In Figure 6.8, the H-values which are automatically estimated by the infant cry analyzerare related to the parents’ LOD ratings of the same cry samples in the testing set. ComparingChapter 6: CRY ANALYZER I: STATISTICAL METHOD 104100 I I0,’90-00-60- 00 9’ 00 ,‘oj ,‘OO0 000,, 0cdD “ 070- 0 00, -- 9,I 0 0,’O 0Cd00 - -050 0- 0 000040 00 -- 0030 0 -0020‘000010-,‘ -00 20 40 60 80 100Measured H-valueFigure 6.7 The H-values which are estimated by our VQ-kernel classifier-basedcry analyzer versus the actual measurements. The estimation errors appear asthe vertical distances from the circles to the diagonal dashed line.Figure 6.8 with Figure 5.4, we can easily see the effect of the errors in the low H-value rangewe mentioned above, which appears to “scatter” the points in that region and move someof them towards the higher H-value region (the ideal locations of the data points should bearound the diagonal from the lower left corner to the upper right corner on the scatter graph).As expected, Figure 6.8 clearly shows a consistency between the parents’ LOD assessments of the cries and the H-values estimated by our automatic infant cry analyzer. Thisresult is preliminary but is of significant importance because it establishes for the first time aChapter 6. CRY ANALYZER I. STATISTICAL METHOD 105C2)Figure 6.8 Relationship between the H-values estimated by theVQ-kernel classifier-based cry analyzer and the parents’ LOD ratings.relationship between the parents’ perceptual assessment of the infants’ LOD conveyed in thecry sound and the automatically derived parameter (the H-value) of the LOD in the cry signal.More important is that we now have an effective and efficient method, which is based ondigital signal processing and pattern recognition techniques, and which automatically extractsthe H-value parameter from the cry sound. At the same time, the above results are a strongconfirmation of the physical meaningfulness of 1) our use of the cry phonemes to representthe time-frequency patterns of normal infant cries and 2) of our use of the H-value to quantifythe parents’ perception of the infants’ physical/emotional situations from the cry sound.0 10 20 30 40 50 60 70 80 90 100Estimated H-value by VQ-classifier-based methodChapter 6: CRY ANALYZER I. STATISTICAL METHOD 106The performance of our automatic infant cry analyzer is expected to improve if thevarious parameters, especially those in the feature vectors are fine-tuned, and if the sequentialcharacteristics of the cry patterns are further exploited. In the next chapter, we presentanother approach for designing our proposed automatic infant cry analyzer, which is basedon the hidden Markov modeling technique, and which gives even better performance withour present infant cry data base.Chapter 7CRY ANALYZER II: A HIDDEN MARKOV MODEL (HMM) -BASED METHODIn this chapter, we apply the hidden Markov model (HMM) technique to normal infants’cry sound analysis [128]. As in Chapter 6, the purpose here is the automatic estimationof the infant LOD from the cry signal. However, we now base our estimation on the11MM technique. As discussed in Chapter 3, the HMM technique is a powerful stochasticmodeling approach. Recently it has been successfully applied to many automatic speechrecognition systems as an alternative to the traditional dynamic time warping (DTW) method.In particular, the I{MM is capable of dealing with the randomness in both the temporal andspectral structure of the signal.Considering that infant crying is a special kind of human utterance, and that the generationof infant crying can be modeled via an approach which is similar to that of the generation ofadult speech (see Chapter 4), lead us to believe that it is possible to utilize the hidden Markovmodeling technique in our proposed automatic infant cry analysis system.The main characteristic which distinguishes our HMM-based cry analysis system from theVQ-kernel classifier-based system described in the previous chapter is that here the recognitionof the different time-frequency patterns is based on the structural information of the signalrather than on its statistical information.In the following, we will first describe the structure of our HMM-based cry analyzer.Then we will discuss the training and recognizing procedures of the HMMs and the systemas a whole. Finally, we will present experimental results and some concluding remarks.107Chapter 7: CRY ANALYZER II: HMM METHOD 1087.1 System DesignOur HMM-based infant cry analysis system is composed of the functional blocks shown inFigure 7.1. These are 1) data preprocessing/feature extraction to transfer the signal into a longsequence of feature vectors, 2) vector quantization to encode the sequence of feature vectorsinto a long sequence of symbols, 3) segmentation to divide the long symbol sequences intoshort segments suitable for the HMMs, 4) the application of HMMs to recognize the differentcry phonemes, and finally 5) the H-value calculation.cry signalsPreprocessing(Feature Extraction)Vector QuantizationSegmentationHMMsH-value calculationEstimated H-valuesFigure 7.1 The block diagram of our HMM based cry distress level analysis system.7.1.1 Data Preprocessing and Feature ExtractionThe data preprocessing/feature extraction is similar to that in our VQ-kernel classifier-basedcry analyzer discussed in the previous chapter. As before, after the cry sound signal is lowpassfiltered and digitized at 8kHz, a pre-emphasis filter with a transfer function of 1 — O.95z1is applied to flatten the spectrum so that the effect of glottal waveform and lip radiationcharacteristics can be removed. Then, a sliding Hamming window with a width of 32 msecChapter 7: CRY ANALYZER II: HMM METHOD 109feature vector code indexFigure 7.2 Vector quantization.is applied to extract frames of 256 points each from the digitized signal. The Hammingwindow’s sliding step is 10 msec, thus the consecutive frames overlap.Then, the procedure diverges from the feature vector construction in the VQ-kernelclassifier-based cry analyzer. In particular, we now extract only a twelve-dimension featurevector from each signal frame. The feature vector is composed of the normalized short-timeenergy, the zero-crossing rate, and the first 10 cepstrum coefficients. The other 12 featuresused in the previous cry analyzer which represent the time difference are not included.Silence periods or pauses in the cry sound are detected by comparing the short-timeenergy to a preset energy threshold. If the short-time energy is below the energy threshold,the frame is considered and marked as a silence frame.7.1.2 Vector Quantization of the Feature VectorsVector quantization is an important technique for reducing data redundancy in HMM-basedvoice processing systems. In particular, it allows the use of discrete HMMs so that theimplementation of the system is greatly simplified [50, 95, 98].We use a code book with 128 prototype vectors, or code-words, to quantize every featurevector. The code book was obtained by using Linde et al. ‘ s vector quantizer design algorithm[61]. As shown in Figure 7.2, after quantization each feature vector (representing a frame)is represented by a code-word index. Thus, at the output end of the vector quantizer, theChapter 7: CRY ANALYZER II: HMM METHOD 110cry signal is transformed into a sequence of code-word indices and silence indicators whichmark the pauses in the cry signal.7.1.3 Segmentation of the Signal into Recognition UnitsIn speech recognition, the nature of spoken sentences provides an opportunity for using ahierarchical decomposition approach. This approach is based on the fact that there is alwayssome grammar governing the generation of all the allowable sentences. A sentence itself canalways be further decomposed into a sequence of words, and finally, a word can be furtherdecomposed into a sequence of phonemes. This hierarchical structure is very suitable andefficient in the implementation of HMMs-based systems [50, 98].Unfortunately, this kind of clear hierarchical structure (grammatical — syntactic —phonetical) is not available in infant crying. It is not easy to use the cry phonemes as basicrecognition units on which we can base our HMM system. The first reason for this difficultyis that we have no effective method to detect the start and end points of cry phonemes,making it very difficult to separate them in a long continuous cry signal. Secondly, even ifwe could effectively isolate each cry phoneme in the cry signal, it would still be difficultto establish an effective pattern analysis method for the recognition of the cry phoneme.This is because the duration of a cry phoneme often varies widely (up to 10 folds), and cryphonemes can be interrupted by unpredictable short pauses or uncharacteristic distortions inthe time-frequency pattern.To cope with the above problems, we propose dividing each cry phoneme into morefundamental recognition units. We found that, despite great variations in duration and shortinterruptions or distortions within a cry phoneme, the time-frequency characteristics of a cryChapter 7. CRY ANALYZER II: HMM METHOD 111phoneme boundariesc.p. 0 c.p. 1 c.p. 2 c.p. 3 c.p. 4timedistortion/interruptionSegmentationa micro-segmenttimefl F,...!......c.p. cry phoneme* micro-segments affected by phoneme boundary or distortionFigure Z3 The segmentation of a cry signal.phoneme show a “pseudo-stationary” pattern. Therefore, if we divide a cry phoneme intoshort “micro-segments” of fixed length, these micro-segments show great similarity to eachother in terms of their time-frequency characteristics. On the other hand, micro-segments fromdifferent types of cry phonemes will be dissimilar because different cry phonemes are definedon the basis of distinguishable time-frequency characteristics. Since all the micro-segmentsare of the same duration, it is easier and more reliable to analyze a micro-segment, and toidentify the cry phoneme to which the micro-segment belongs. This concept of segmentationis illustrated in Figure 7.3. Figure 7.3 shows an example of a part of a cry signal containingfive cry phonemes (c.p. 0, c.p. 1, c.p. 2, c.p. 3, and c.p. 4) each with varying durations. Aftersegmentation, the signal is divided into micro-segments with fixed durations. We choose theduration of the micro-segments to be short enough to ensure that the shortest cry phonemescan still be further divided into a series of micro-segmentsChapter 7: CRY ANALYZER II: HMM METHOD 112The boundaries between each cry phoneme need not be detected in this scheme. As shownin Figure 7.3, a phoneme boundary will only affect the micro-segment to which the boundarybelongs. Such an affected micro-segment will likely be misrecognized since it contains mixedinformation from two different cry phonemes. Similarly, if a distortion or interruption occurswithin a cry phoneme (as illustrated in c.p. 1 in Figure 7.3), the few micro-segments towhich this part of the signal belongs will be affected, and will likely be misrecognized later.However, these corrupted micro-segments only constitute a tiny portion of the entire signal.The majority of the remaining micro-segments will be reliably recognized.As shown in Figure 7.1, the above segmentation scheme is implemented after performingvector quantization, i.e., we apply our segmentation scheme to the code-word indices representing the feature vectors of the overlapping frames of the cry signal. The overlapping framesare shifted by 10 msec intervals. Thus for every 10 msec there exists a code-word index.From our experiment, we observed that the duration of a cry phoneme is typically around1 second (but may vary between 0.5 to 3 seconds). Thus, each cry phoneme is typicallyrepresented by 100 code-word indices. The sequence of code-word indices is segmented intomicro-segments (silence periods are skipped). Each micro-segment contains ten code-wordindices. This ensures that a cry phoneme is segmented into at least five micro-segments. Themicro-segments are then used as the basic recognition units in our HMMs-based recognizer.7.1.4 Configuration of the HMM-based Analyzer and the Calculation of the H-valueTo identify the cry phoneme to which each micro-segment belongs, one 11MM is assignedto each of the ten cry phonemes (the definitions and basic operations of HMMs are brieflyexplained in Section 3.6 and in Appendix D, also see [95, 98] for a full description). A 3-stateleft-to-right topology, as shown in Figure 7.4, is used for every HMM. Each state in an HMMChapter 7: CRY ANALYZER II: HMM METHOD 113Start StopFigure 7.4 Topology of the hidden Markov models for micro-segment identification.has two out-going transitions, one to the state itself and the other to the next state. At anyinstant of time only one state is occupied or reached. When a state is reached, a symbol isemitted. The symbol is one of the 128 code-word indices of the prototype vectors in the VQcode book. The decision as to which symbol is emitted is governed by the output probabilitydensityfunction of the occupied state. In our HMMs, the output probability density function ofa state is represented by a table, which consists of 128 probabilities corresponding to the 128code-word indices. Each of the two out-going transitions is also associated with a transitionalprobability. If the model at time t occupies state i, the transitional probability defines theprobability of the model reaching state j at time t + 1.The values of all the above model parameters such as the transitional probabilities and theoutput probabilities are determined by training. Every HMM is trained with only the microsegments associated with the cry phoneme the HMM represents. For example, the HMMrepresenting trailing is trained with all the micro-segments of all the trailing cry phonemesin the training set.Given an HMM and a micro-segment, the probability of the model generating this microsegment is determined by using algorithms such as the forward-backward procedure in [95,98]. This probability is known as the model generation probability (see Appendix D). As aresult of the training, the parameters of each HMM are set such that the model generationChapter 7: CRY ANALYZER II: HMM METHOD 114Figure 7.5 The structure of the H-HMM (the same topology is usedfor the E-HMM).probability over all the micro-segments belonging to the cry phoneme represented by thisHMM will have the highest mean value.Since we want to estimate the H-value (and not to identify each micro-segment), wedivide the ten HMMs into two groups. This is done after each HMM is individually trained,as described above. The first group is composed of the five HMMs corresponding to the fivecry phonemes involved in the definition of the H-value, (namely the trailing, double harmonicbreak, dysphonation, hyperphonation, and inhalation cry phonemes, see Section 5.4). Theother group is composed of the remaining five HMMs representing the flat, falling, rising,vibration, and weak vibration cry phonemes. Figure 7.5 shows how the first group of HMMsare assembled to form a big HMM. We call it the H-HMM. The other big HMM, which we callthe E-HMM, has the same topology as that of Figure 7.5, but is composed of the other groupof HMMs representing the flat, falling, rising, vibration, and weak vibration cry phonemes.Chapter 7: CRY ANALYZER II: HMM METHOD 115From To H-valueSegmentation CalculationFigure 7.6 The HMMs-based classifier.When a micro-segment and a big HMM are presented to the forward-backward procedure(see Appendix D and [95, 98] for detailed description of the procedure and correspondingalgorithms), the probability of this big HMM generating the micro-segment is determined.Because of the topology of our big HMMs, it is straightforward to show that if the micro-segment belongs to one of the five cry phonemes in this big HMM, then this big HMM willmore likely give a higher model generation probability than the other big HMM.Our cry analyzer is built around the H-HMM and the E-HMM, as shown in Figure7.6. Each micro-segment is presented to both the H-HMM and the E-HMM simultaneously.The likelihood of a micro-segment being generated by each of the H-HMM and E-HMM iscomputed using the forward-backward procedure. The micro-segment is then classified aseither H-type or B-type according to which model generation likelihood is larger.The classification results from the HMM analyzer are collected to calculate the H-valueof the cry signal. From its definition in Section 5.4, the H-value is estimated asH- 1— the number of H-type micro-segmentsva ue— the total number of micro-segments in the whole signalP(SI H-HHM)SChapter 7. CRY ANALYZER II. HMM METHOD 1161. Trailing 1422. Flat 7283. Falling 1644. Double harmonic break 1815. Dysphonation 7326. Rising 2027. Hyperphonation 158. Inhalation 3469. Vibration 13910. Weak vibration 155TOTAL 2804Table 7.1 Distribution of the micro-segments in our training set.7.2 Training of the HMMsThe training of our HMM-based cry analyzer system takes two steps: first, the training ofeach of the ten individual HMMs, and then the training of the H-HMM and E-HMM.There are a total of 2804 micro-segments in the 45 cries in our training set. Table 7.1shows the distribution of these micro-segments over the ten different cry phoneme groupsin the training data.Uniform initialization is used for all the transitional probabilities in all the ten individualHMMs, i.e., all transitions from a state are considered equally likely. The output probabilitydensity table of each of the three states in an HMM is initialized as follows: each of themicro-segments in the group corresponding to that HMM was divided into three parts —the beginning, middle, and end parts, and the frequency of occurrences of each of the 128Chapter 7. CRY ANALYZER II: HMM METHOD 117code-word indices in these three parts was used as the initial values corresponding to thatsymbol in state 1, 2, and 3 of that HMMs, respectively.After initialization, the forward-backward algorithm (see Appendix D) was used tooptimize the parameters in the HMMs. For each HMM, we ran the forward-backwardalgorithm iteratively with the micro-segments belonging to the cry phoneme of this HMMas the training data. The training of an HMM ended when the averaged model generationprobability over this group of micro-segments showed no further significant increase.In the second step of the training, the H-HMM and E-HMM were initialized with theoptimized parameters from the ten individually trained HMMs. Using the same trainingprocedure described above and a joint training data set from the five cry phonemes groupwhich defines the H-value, the parameters in the H-HMM, consisting of all the parametersin the five individual HMMs, were refined. The same procedure was applied to refine theparameters in the E-HMM, except that the joint training data set from the other five cryphonemes group was used.This second training is proven to be important to our HMM-based cry analyzer as ourexperiment shows that the accuracy of classifying the individual micro-segments in the testingset increased from 80.15% to 82.67% after the second training.7.3 Experiment Results and DiscussionWith a similar procedure we used to evaluate our VQ-kernel classifier-based infant cryanalyzer, the performance of our HMM-based infant cry analyzer was tested with the 58cries in the testing set. The system automatically gave an estimation of the H-value aftereach of the 58 cries was analyzed.Chapter 7: CRY ANALYZER II: HMM METHOD 1181000/90- 00 0Oç 080- o.0o6- / 060000:0000400 00000 00 0JU$20 00/ 000 20 40 60 80 100Actual Measurement of the H-valueFigure 7.7 The computer estimations versus the actual measurementsof the H-values of the 58 cries in the testing set. The estimation errorsappear as the vertical distances from the circles to the diagonal dashed line.Figure 7.7 shows the 58 estimated H-values versus their true measurements. Each smallcircle in the diagram represents a cry in the testing set. The vertical distance from a circleto the diagonal dashed line is the error of the estimated H-value. The mean absolute errorbetween the HMM-based system estimation and the actual measurements is 12.9%. Thisdemonstrates that our HMM-based method is effective in estimating the H-values from thecry sounds.Chapter 7: CRY ANALYZER II: HMM METHOD 119bSC2.51.5Figure 7.8 Computer estimated H-value versus parents’ LOD rating.Figure 7.8 shows the relationship between the H-value estimated with our HMM systemand the parents’ level-of-distress (LOD) rating. All the estimated H-values, except the onewhose coordinates are [69, 1.05], do show a clear trend of consistency with the LOD ratingsgiven by the parents.Comparing this result (Figure 7.8) with that of our VQ-kernel classifier cry analyzerreported in the previous chapter (Figure 6.8), we can see that the HMM-based method givesbetter estimations when the cry samples under analysis have low LOD ratings. This is shownin Figure 7.8 as that cries with low LOD rating ( 3) are consistently assigned to low H-4.53.50 10 20 30 40 50 60 70 80 90 100Computer Estimated H-valueChapter 7: CRY ANALYZER II: HMM METHOD 120values ( 50) by our HMM-based cry analyzer. However, for cries with high LOD rating( 3), the VQ-kernel classifier-based method shows slightly better performance, as shownin Figure 6.8. Overall, the performance of our HMM-based method is better than that ofour VQ-kernel classifier-based method. The mean error of the estimated H-value in ourHMM-based system is 12.9%, while the mean error rate is 14.0% in our VQ-kernel classifiersystem. We believe that the better performance of our HMM-based system is due to themore effective exploitation of the sequential (structural) information of the time-frequencycharacteristics in the cry sound.Chapter 8CONCLUSIONS8.1 Accomplishments and ContributionsSince the 19th century, scientists have believed that infant cry sounds convey informationabout the baby’s physical, physiological and emotional situation. As a result of generationsof research, a wealth of observations and findings about infant crying have been achieved.However, as our literature survey in Chapter 2 shows, practical applications of these achievements remain scarce.The recent advancements in digital signal processing theories and techniques, coupledwith the rapid development of computer technology, lead us to believe that we now are in aposition to investigate the development of methods and means by which infant cry analysiscould be automated. In this work, we began such development by combining what we alreadyknow about infant crying with modern signal analysis techniques.A fundamental accomplishment of our research is the establishment of a platform, basedon modem speech processing/recognition theories and techniques, for the automatic analysisof infant cry signals.In doing so, we first identified the problems and difficulties facing the study of infant cries,particularly those hindering the utilization of automated analysis methods. Then, to solve andalleviate these problems, we proposed some modifications to the traditional methodology ofinfant cry analysis. These included the use of modern signal analysis approaches, as well121Chapter 8. CONCLUSIONS 122as the introduction of an efficient measure to characterize the cry signals. We argued thatefficiency here must be defined in terms of reliable detection and recognition of this measureby automated means.In particular, our contributions are the following. We have,1. Demonstrated the compatibility of infant crying with adult speech generation modelsThis is important since most successfully-applied modern speech processing techniques are based on models of adult speech generation. In showing this compatibility, wediscussed the physiological similarities of infant crying to adult speech, surveyed and analyzed infant cry generation theories and models proposed previously by others. We havealso compared these models with the speech generation models commonly used in modern digital speech processing/recognition theories. To further support this compatibility,we conducted experiments to estimate the formant frequencies of infant cries using thelinear prediction coding (LPC) method. LPC is a method which is heavily based on thedefinition of the signal generation model and which is well proven in speech processingapplications. The results of the experiments clearly showed that the LPC-based methodgives good estimation of the formant frequencies of the infant cries (see Chapter 4).Our investigation concluded that: a) the same excitation-modulation voice generationprinciple which applies to adult speech also applies to infant crying, b) the modulationprocess of the infant cry generation can be formulated as a time-varying filter system,in a fashion similar to that of the adult speech generation, and c) an all-pole digitalmodel, as well as the LPC method based on such a model, is effective in analyzing andrepresenting normal infant cries.Chapter 8: CONCLUSIONS 123This investigation set up the basis for applying the techniques and methods originallydeveloped for speech signals to that of infant cry sounds analysis and processing.2. Discussed the problems of the automatic cry analysis and the automatic assessment ofnormal infants’ distress levelsWe addressed one particular issue of normal infant cry research — the automaticassessment of the infants’ distress level via analyzing the infants’ cry signals. We arguedthat, to build a device which can assess the infants’ physical/emotional situations, anew approach is required. The conventional approach classifies cries according to someprobable or presumed cry stimuli, such as pain, fussy, and hunger. At present, thisapproach is not as reliable and practical as that of correlating the attributes of the crysignal to the parents’ and other caretakers’ subjective perceptions about the infants’ generalsituation conveyed in the cry. We also suggested the use of a single parameter — thelevel-of-distress (LOD) — to evaluate the cry-implied physical/emotional situation of theinfant.3. Introduced the cry phonemes concept and the H-value parameter to characterize the crysound, and to evaluate the infants’ LOD conveyed in the cryTo effectively and efficiently characterize and represent the varieties of the infantcry signals, we introduced a set of “cry phonemes”. This set is based on the infant crygeneration models and the typical time-frequency patterns which we have identified innormal infant cries. Our set of cry phonemes was chosen to meet two important criteria:a) any normal infant cry signal can be divided into, and thus represented by, a sequenceof cry phonemes belonging to this set; and b) the cry phonemes in the set can be reliablyrecognized by means of automatic signal processing and recognition techniques.Chapter 8: CONCLUSIONS 124With our definition of the cry phonemes set, we conducted experiments to analyzethe relationship between the individual cry phoneme and the LOD perceptions by parentswho have listened to the recording of the cry. Based on the results of our experimentsusing recorded cries from 36 newborns, we defined an indicator— the H-value— as aquantitative measure of the infants’ LOD conveyed in the cry. From the experiments wefound that the H-value measured from a cry signal always shows clear consistency withthe parents’ LOD rating on that cry.4. Developed and tested two automatic infant cry analyzers— the VQ-kernel classifier-basedcry analyzer and the HMM-based cry analyzerBased on our cry phonemes set and the H-value, we developed two automatic infantcry analyzers which are capable of making humanlike assessment of the normal infants’LOD.The principle behind both of our two designs is to estimate the H-value of the cryby detecting and calculating all occurrences and durations of each of the ten different cryphonemes in the cry signal.In the first system, we used a nonparametric classification method to classify cryphonemes of the cry signal after dividing the signal into frames. Because of the hugeamount of data involved, we developed a set of highly efficient vector quantization-basednonparametric classification algorithms. These new algorithms, while retaining a highclassification accuracy, can greatly reduce the storage requirement and drastically boostup the classification speed. The new algorithms we developed are general in nature, andare therefore suitable to handle any large scale problem, which may cause difficulties fortraditional nonparametric methods.Chapter 8: CONCLUSIONS 125Our second automatic infant cry analyzer uses the hidden Markov modeling (HMM)technique. This technique has become very popular recently and has been shown topossess many advantages when applied to automatic speech recognition. Comparingthis HMM-based cry analyzer to our nonparametric classifier-based one, the HMM-basedapproach puts more emphasis on the sequential structure of the time-frequency patternsin the cry signal. In the HMM-based cry analyzer we introduced an approach based on“micro-segments” of the signal representation. This makes it possible to associate thecry signal with a hierarchical structure which allows the efficient implementation of theHMM method.The performances of our two different automatic infant cry analyzers were tested andevaluated with a set of 58 cry samples from normal infants. The test results show thatboth of our two infant cry analyzers perform well. In terms of the estimation accuracyof the H-value, the 11MM-based analyzer achieves an accuracy of 87.1%, while thenonparametric classifier-based analyzer gives an accuracy of 86.0%. When the automaticestimations of the H-value given by these analyzers are related to the LOD ratings givenby the parents, clear consistency between them can be observed.In conclusion, our research clearly shows that it is feasible to utilize efficient moderndigital signal analysis methods to detect subtle characteristics and features in the infantcry signals. These features and distinctions are otherwise undetectable with the traditionalmethods used in the past. Our research also establishes relationship between the objectivemeasurements of information conveyed in the cry signals (determined automatically) and thesubjective human perceptions of that information.Chapter 8: CONCLUSIONS 126The above achievements and contributions show that applying modern signal processingtechniques to the study of infant cries promises success. We have only opened the door to avery exciting area of research where there is much room for improvements of methods andare many important applications. Some of these are further discussed in the following section.8.2 Some Topics for Future Study on Automatic Infant Cry Analysis8.2.1 Refining the Techniques and Methods We Have DevelopedWe have shown, both theoretically and experimentally, the feasibility of automating infantcry analysis by using modern speech processing/recognition techniques, and developed twosystems to demonstrate this concept. However, time and scope limitations have preventedus from further refining many technical aspects of the design and the implementation of theinfant cry analysis systems reported in this dissertation.Opportunities for technical refinements and further investigations exist in the followingareas:Feature selection and extraction:In our infant cry systems, there are still many promising feature candidates that maybe studied. Examples of such features are the fundamental frequency, the formantfrequencies, and the transitional cepstrum coefficients. Note that while some very coarselydefined transitional cepstrum coefficients (equation 6.10) are used in our VQ-kernelclassifier-based infant cry analyzer, but none is used in our HMM-based cry analyzer.It is worthwhile investigating whether or not these features could provide improvedperformance over those used in the systems reported in this thesis.Chapter 8: CONCLUSIONS 127More sophisticated model structures and topologies for the HIVIMs-based system:In our HIVIMs-based infant cry analysis system, for simplicity we use the basic structureand basic model operation of the HMMs. More sophisticated model structures however areworthy of investigation in the future to obtain better system performance and efficiency.Among the many other models, the continuous density HMMs and the HMMs withexplicit state duration density [98] deserve special attention. In the continuous densityHMMs, the observations are not symbols chosen from a finite alphabet. Instead, theyare real vectors with continuous values. This eliminates the possibility of performancedegradation caused by the quantization process in the discrete HMMs-based systems. TheHMMs with explicit state duration density attempt to overcome the major weakness ofconventional HMMs, which is the modeling of the state duration. Each of these twoalternative versions of the conventional HMM has been successfully used in many speechrecognition applications [102, 105].Further refinement may also be possible in the selection of HMM topologies used inour infant cry analysis system, especially the one we chose for the ten individual HMMs.Improved system training and performance evaluation:A larger infant cry sample database is always desirable so as to train and test ourinfant cry analysis systems better. Like any other classification/recognition systems, theperformances of our systems are greatly dependent upon the training process. The largerthe training sample database, the better trained our cry systems will be.Another key factor to the success of our cry analyzer design is the performanceevaluation. Since we judge the system performance solely against the parent ratingdatabase, the quality of this rating database becomes crucial. The performance evaluationChapter 8: CONCLUSIONS 128of our infant cry analysis systems will be improved by using a larger testing sample setand an improved parent rating database formed from a larger number of parents.New techniques for the representation and processing of signals:Many new theories and technologies are recently being applied to research in speech/voiceprocessing, image processing, pattern recognition, computer vision, and other signalprocessing fields. Especially noteworthy are the neural network [62, 94, 80], the fuzzylogic, the wavelet transformation [48, 58, 66], and the instantaneous frequency (IF)representation theories [6, 7]. Most of these techniques have been shown to possesssome advantages in one aspect or another when applied to voice and speech processingrelated problems. The feasibility and merits of incorporating these new technologies intothe infant cry analysis are certainly worthwhile pursuing.8.2.2 Extending Our New Methodology to Other Infant Cry Research TopicsAlthough our automatic infant cry analysis systems are designed and trained with the crydata from a specific age group of normal infants, it is important to note that the methodologywe have developed is not restricted to this case. Our methodology can be generalized andapplied to other age groups of infants and to problems other than the automatic assessment ofinfants’ LOD. In particular, our methodology applies to the problem of automated diagnosisof specific diseases and/or abnormalities in sick infants.8.2.2.1 Developing analysis/recognition/diagnosis systemsfor abnormal infants’ cry signalsWith the sound spectrograph and, recently, with some computer-aided analysis methods, manyscientists have carried out investigations in the area of abnormal infant cry analysis. CriesChapter 8: CONCLUSIONS 129from abnormal infants with various diseases, especially those which affect the baby’s centralnervous system, have been studied (see the survey in Chapter 2).These investigations have generated a wealth of findings about the abnormal infant crying.In many occasions, differences amongst the various attributes of the cries of sick infants andthose of the clinically normal infants are reported [34, 63, 69—78, 103, 109, 116]. Thus, itwill be worthwhile to extend our signal processing-based infant cry analysis technique to theanalysis of abnormal infant cries.From a technical viewpoint, the problem of abnormal infant cry analysis is better definedthan that of normal infants’. The physical situation of the infant (either the disease, or theabnormality, or both) can usually be determined/diagnosed by using conventional (thoughmore costly, complex, or inconvenient) medical methods. In addition, in abnormal infant cryresearch it is the “pain cry” that is of interest. By studying the “pain cry” only, it is possible tostandardize and unify the situations, under which the cry signals are acquired, and the stimuliinducing the cries. It has also been pointed out that pain is the maximal stimulus to the nervesystem of the infant, hence may significantly overshadow the influence of other factors in theprocess of cry generation. Because of these factors, we expect the methodology we developedin our two automatic infant LOD assessment systems to be effective, especially the approachof identifying typical time-frequency patterns and defining the corresponding cry-phonemes.Different diseases or abnormalities may need different definitions of cry-phoneme sets to bestcharacterize the particular problem.In these types of studies, the cooperation with pediatricians and medical staff will beessential. This is specially true for rare and difficult-to-diagnose diseases.Chapter 8: CONCLUSIONS 1308.2.3 Developing New Products Based on Our ComputerizedInfant Cry Analysis TechniquesWe believe it is now feasible to develop automatic cry analyzing devices based on thetechniques we have developed, by using modem advanced VLSI and computer hardwaretechniques. In the following, we list a few possible products:1. Intelligent baby alarm systems with the ability of assessing the baby’s LODOur automatic infant LOD assessment techniques could be incorporated into ordinarybaby alarm systems. In ordinary baby alarm systems, the device is usually triggered whenthe sound level in the baby’s room exceeds a preset threshold. This simple strategy isnot only subject to false alarms, but it is also unable to provide the parents with moredetailed information about the infants’ physical/emotional situation. The incorporation ofour automatic infant LOD assessment techniques into ordinary baby alarm systems willgive the alarm device some intelligence; it will be able to distinguish infant cries fromvarious background noise sources so as to avoid false alarms, and estimate and report tothe parents the infants’ LOD situation. This kind of device will be particularly helpfulto deaf and hard of hearing parents.2. Monitoring systems used in hospital wards which can detect the cries, and automaticallyanalyze and record the LOD situations of each infantThis kind of equipment can make the supervision of infants more efficient bysupplying the caretaking environment with accurate information about each infant’ssituation, and automatically registering that information for later analysis.3. Computer cry analysis systems to assist physicians in the diagnosis of infant diseasesChapter 8: CONCLUSIONS 131This has been one of the main objectives of research into the abnormal infant cry.Infant disease diagnosis by cry analysis will facilitate diagnostic procedures, and makethem convenient, more economical, and faster. Such a system may contain different setsof parameters and databases for different kinds of diseases, and for different age groupsof infants. This will increase the accuracy of analysis and, at the same time, enable thesystem to diagnose different diseases and handle infants of different ages.The development of such systems obviously requires further research on abnormalinfant cries.Appendix A: Linear Predictive Coding (LPC) Algorithms 132Appendix ALinear Predictive Coding (LPC) AlgorithmsLinear predictive analysis is one of the most powerful speech analysis techniques developed in recently years. The importance of this method lies both in its ability to provideaccurate estimates of the speech parameters, and in its computational effectiveness.We have already summarized the basic principles of linear predictive analysis in Section3.2. Here we will outline some commonly used algorithms in LPC analysis.As indicated in Section 3.2, to get an estimate of the vocal transfer function=, (A.1)1—k=1we need to estimate the coefficients ck ‘5 by minimizing the average squared prediction errorE = e(m) = (s(m) — n(m))2 (A.2)m mwhere s (in) is a segment of the speech waveform that has been selected in the vicinity ofsample n, i.e., s(m) = s(m + n). Substitute=—k) (A.3)into equation (A.2) and set aE/8 = 0, i = 1,2, . ..,p, so as to find the values of athat minimize E. We obtains(m— i)s(m) = s(m— i)(m — k) 1 i p (A.4)m k=1 mIf we definek) = s(m— i)s(m — k) (A.5)Appendix A: Linear Predictive Coding (LPC) Algorithms 133then equation (A.4) can be written more compactly as= (i,o) 1,2,.. .,p (A.6)This set of p equations in p unknowns (the predictor coefficients {ak }) can be solved in anefficient manner.One approach solving this equation set, called autocorrelation method, assumes that.s(m) = .s(m + n)w(m), where w(m) is finite length window (e.g., a Hamming window)that is identically zero outside the interval 0 rn N— 1. This approach also defines theaverage squared prediction error as+00= > e(m). (A.7)m= -ooSince .s(m) is nonzero only for 0 m N—i, from equation (A.2) and (A.3), we can noticethat e(m), for a lth order predictor, will be nonzero only for 0 m N — 1 + p. Thus,+00 N-i+p= e(m) = e(m). (A.8)m=—oo m=OWith similar analysis, we can show thatN—i—(i—k)s(m)s(m+i-k) (A.9)m=OWe define the autocorrelation function of s(m) asN—i—kR(k) s(m)s(m + k) (A.iO)m=OThus, the right-hand side of equation (A.9) can be seen as the autocorrelation of s(m) for(i — k). That is,q(i,k) = R(i — k). (A.1i)Appendix A: Linear Predictive Coding (LPC) Algorithms 134Since R(k) is an even function, equation (A.6) can be expressed asor in matrix form asR(O) R(l) R(2) R(p—1) a R(1)R(l) R(O) R(l) R(p—2) a R(2)R(2) R(l) R(O) R(p—3) a = R(3)R(p—l) R(p—2) R(p—3) R(O) c, R(p)The p x p matrix of autocorrelation values is symmetric and is also a Toeplitz matrix.In the following, we give a recursive algorithm for solving equation (A. 13) by exploitingits Toeplitz nature. This procedure, called Levinson-Durbin ‘s recursive procedure, is one ofthe most efficient method known for solving this particular system of equations [14, 53]. Theprocedure can be stated as follows:p(o) —-I-in — n= (R() — (i_1)R (i_i)) /E1= k(i) (i—i) (i—i)= a3 —— (i — k)E1Equations (A. 14)-(A. 18) are solved recursively for i = 1, 2,. . .,p and the final solution isgiven asakRn(li — k) = Rn(i) 1 j p (A.12)(A.13)(A.14)1ip (A.15)(A.16)1ji—1 (A.17)(A.18)a=a 1jp (A.19)Appendix A: Linear Predictive Coding (LPC) Algorithms 135Therefore, at time n, we obtained estimates of the digital filter coefficients { aj } of the vocal transfer function stated in equation (A. 1). These estimates are optimal in the sense ofminimizing the average squared error of linear prediction over the segment of speech of{s(m + n), 0 m N—1}. More detailed discussions on LPC algorithms and application can be found in [91, 101].Appendix B: Linde-Buzo-Gray’s Algorithm of Vector Quantizer Design 136Appendix BLinde-Buzo-Gray’s Algorithm of Vector Quantizer DesignLinde-Buzo-Gray’s (LBG) method actually contains two different algorithms: one forproblems with known distribution function, and the other for problems in which the distribution properties of the input random vectors are unknown. Since we are only interested in thelatter case, the algorithm for known distribution is omitted in this appendix.The LBG algorithm (for unknown distribution) is stated as follows:(0) Initialization: Given N, number of levels, the distortion threshold 0, an initial N-level reproduction alphabet A0, and a training sequence {x,; j = 0,.. . , ii — 1}. Set m = 0and D_1 = oc.(1) Given Am = {y; i = 1,. . . , N}, find the minimum distortion partition P (Am) ={S;i = 1,... ,N} of the training sequence: x3 e S if d(x,,y) d(x,y1) for all 1,where d(x, y) is the distortion measure (or distance) between x and y. Compute the averagedistortionDmD({Am,P(Am)}) =n mm d(x,y). (B.1)m(2) If (Dm_i — Dm)/Dm €, halt with Am as the final reproduction alphabet. Otherwisecontinue.(3) Find the optimal reproduction alphabet *(P(Am)) = {k(S); i = 1,.. . , N} forP (Am). Here, *(S) can simply be the centroid of S (under the distance definition d(x, y)).Set Am+i *( (Am)). Replace m by m + 1 and go to (1).Appendix B: Linde-Buzo-Gray ‘s Algorithm of Vector Quantizer Design 137For the choice of the initial alphabet A0, Linde et al. suggest a “splitting” approach,as stated below.(0) Initialization: Set M = 1 and define A0(1) = *(A), the centroid of the trainingsequence.(1) Given the reproduction alphabet Ao(M) containing M vectors {y; i = 1,. . . , M},“split” each vector y into two close vectors y + 0 and y2 — 0, where 0 is a fixed perturbationvector. The collection A of {yj + 8, yj — 0, i = 1,... , M} has 2M vectors. Replace M by2M.(2)Is M = N? If so, set A0 = A(M) and halt. A0 is then the initial reproduction alphabetfor the N-level quantization algorithm. If not, run the algorithm for an M-level quantizer onA(M) to produce a good reproduction alphabet Ao(M), and then return to step (1).For detailed discussion of above algorithms, see [611.Appendix C: Bayes’ Theorem for Statistical Classifier Design 138Appendix CBayes’ Theorem for Statistical Classifier DesignConsider an N-dimensional classification problem with a feature vector X =[x1,x2, . . . ,ZN]. For each class wj, j = 1,. . , m, assume that the conditional probability density function of X, p(Xjj), and the probability of occurrences of j, F(w1), areknown. On the basis of the a priori information p(Xw) and P(), j = 1,.. . , m, thefundamental function of a classifier is to perform the classification task by minimizing therisk resulting from its decisions (in the simplest case, the probability of misrecognition).Let L d3) be the loss incurred by the classifier if the decision d3 is made when theinput vector, X, is actually from w. The conditional loss ( also called conditional risk) isr(w,d)= f Lx(w,d)p(Xj)dX (C.1)where 1x is the N-dimensional sample space. For a given set of a priori probabilitiesP = {P(1),P(w2),. .. , P((4im)}, the average loss (or average risk) isR(P,d) = P(L)r(wì,d) (C.2)Substitute equation (C. 1) into equation (C.2) and letrx(P, d) = Di Lx(wj, d)P(wj)p(Xfr) (C.3)p(X)then equation (C.2) becomesR(F, d)=prx, d)dX (C.4)rx(P, d) being defined as the a posteriori conditional average loss of the decision d for givenfeature measurements X.Appendix C: Bayes’ Theorem for Statistical Classifier Design 139The problem is to choose a proper decision d, j = 1,. . . , m, to minimize the averageloss R(P, d). The optimal rule which minimizes the average loss is called the Bayes’ rule.From equation (C.4), it is clear that R(P, d) can be minimized if for each particular featuremeasurement X we choose a decision d* so as to minimize the a posteriori conditionalaverage loss rx(F, d). That is, for a given X, we choose d*, so thatrx(P, d*) rx(P, d,) j = 1,2,. . . , m (C.5)or,m mLx(wj,d)P(wj)p(Xfrj) j = 1,2,...,m (C.6)A simple yet very useful loss function is the (0,1) loss function, i.e.,Lx(j,dj) = 1— (C.7)In such cases, the average loss is essentially the probability of misrecognition. The Bayes’decision rule then becomes= d, i.e., X‘if(C.8)P()p(Xwj) P(w)p(Xw) for all j = 1,2,...This is the decision rule we used in our classifier designs in Chapter 6.Appendix D: The HMM Computation and Training Algorithms 140Appendix DThe HMM Computation and Training AlgorithmsFirst we define a set of more formal notations. An 11MM can be defined as A = (A, B, ir).A is the state transitional probability distribution matrix, defined asA={a3}, li,jN (D.1)where N is the number of states in the HMM and = P(qi = Si qt = Si), i.e., theprobability of entering state Sj at time t + 1, given the HMM is in state S at time t. Theobservation symbol probability distribution matrix, B, is defined as_ 1f1fl 1jNlV)f’ 1 7I rwhere b(k) P(vk at tq Si), i.e., the probability of emitting the kth observation symbol,vk, in the finite alphabet of M distinct symbols, given the HMM is in state S at time t. Theinitial state distribution ir = {ir}, where 7t = P(q1 = Si), 1 i N.Before presenting the solutions, we list below once again the three general problems inHMM operations and training:Problem 1— Given the observation sequence 0 = Oi 02 . . 0T and a model A = (A, B, ir),how to efficiently compute P(OA), the probability ofgenerating the observationsequence from the model?Problem 2— Given the observation sequence 0 = 0102 . °T and a model A = (A, B, 7r),how to determine a state sequence Q= qiq q which is optimal in somemeaningful sense?Problem 3— How to adjust the model parameters A = (A, B, ir) to maximize P(0 A)?Appendix D: The HMM Computation and Training Algorithms 141Solution to Problem 1The method is called forward-backward procedure. However, for solving Problem 1,either the forward part, or the backward part, of the algorithm is needed.Define the forward variable at(i) as(D.3)i.e., the probability of the partial observation sequence, 0102 . . . Ot, (until time t) and stateS at time 1, given the model ). We can solve for t(i) inductively, as follows:1. Initialization:c1(i) = 7rb(01), 1 i N. (D.4)2. Induction:at+1(i)= [t(i)aii]b(Ot+i) 1 (D.5)3. Termination:P(0) =N(D.6)The computation requirement of the above procedure is in the order of N2T calculations.Solution to Problem 2Different way of defining the “optimal” state sequence associated with the given observation sequence leads to different solution to this problem.Appendix D: The HMM Computation and Training Algorithms 142The most widely used optimization criterion is to find the single best state sequence (path),i.e., to maximize F(QO, )). A formal technique for finding this best path exists, based ondynamic programming methods, and is called Viterbi algorithm.First we need to define the quantitymax P(q1q2...qt=i,O1O2...Oj.A) (D.7)ql,q2,...,qt—ai.e., 6(i) is the best score (highest probability) along a single path, at time t, which accountsfor the first t observations and ends in state S. By induction we have6+i(j) = [mx 8t(i)aii]b(Ot+i) (D.8)To actually retrieve the state sequence, we need to keep track of the argument whichmaximized equation (D.8), for each t and j. We do this via the array b(j). The completeprocedure can be stated as follows:1. Initialization:61(i) = irb(O1), 1 i N(D.9)2. Recursion:• . 2tT= max [6t_i(z)a]bj(Ot),1zN• • 2tTL’(j) = arg max[6t_1(i)aij],•1zN lJ3. Termination:= max [6T(z)11iN(D.11)q = arg max [T(z)1.1zNAppendix D: The HMM Computation and Training Algorithms 1434. Path (state sequence) backtracking:q=&t+1(q1), t=T—1,T—2,...,1. (D.12)At the end of the procedure, the best path (state sequence) is found as = {q; q, , q},and P is the probability of the model ). taking the best state sequence Q* and emitting theobservation sequence 0.Solution to Problem 3In a manner similar to equation (D.3), we first define a backward variable i3(i) asi3(i) = P(Ot1Oj2...OTq = (D.13)i.e., the probability of the partial observation sequence from t + 1 to the end, given state Sat time t and the model ).. We also need to define (i, i), the probability of being in stateS at time t, and state S at time t + 1, given the model and the observation sequence, i.e.,t(i,j) = P(q = Si,qt-i-i (114)From the definitions of the forward and backward variables, we can write t(i, j) in the form= F(0)— (D.15)NNDi=1 j1where the numerator term is, in fact, P(q=S, qt+i = Sj, OLX) and the division by P(OA)gives the desired probability measure. Another definition necessary for solving Problem 3 isAppendix D: The HMM Computation and Training Algorithms 144the probability of being in state S at time t, given the observation sequence and the model,t(i). ‘yt(i) can be associated with t(i,j) as= (D.16)The follow interpretations are necessary to the solution we are going to present later,(i) = expected number of transitions from S(D.17)(i, i) = expected number of transitions from S to 5,.Using Equations (D. 15)-(D. 17), we can have a set of reasonable re-estimation formulasfor r, A, and B as follows:= expected frequency (number of times) in state S at time (t = 1)=— expected number of transitions from state S to state 5,a3— expected number of transitions from state ST-1> (i,j)— T—1 (D.18)E z)(k) — expected number of times in state j and observing symbol vk— expected number of times in state jT—_________________T7tC)t=1equation (D. 18) gives us a re-estimated model = (A, B, It has been proven byBaum et al. [3, 4] that either 1) the initial model ) defines a critical point of the likelihoodAppendix D: The HMM Computation and Training Algorithms 145function, in which case A = A; or 2) model A is more likely than model A in the sense thatF(OA) > P(OA), i.e., we have found a new model A from which the observation sequenceis more likely to have been produced.Appendix E: Instruction to the Parents in the Infant LOD Rating Experiment 146Appendix EInstruction to the Parents in the Infant LOD Rating ExperimentDear participants:We are working in a project which is aimed to use computer technology to automatically evaluate the degree of distress of an infant by analyzing his/her cry sound.In the tape, we recorded dozens of cries uttered by different babies. We would likeyou to listen to each of them and grade your feeling about the baby’s distress in 5levels. Level 1 can be used to indicate that you feel the baby is crying with littledistress, while level 5 can be used to indicate that the segment of cry sounds to youas if the infant is trying to express the greatest distress. Level 2 to level 4 can beused if you feel the baby’s distress situation is between the above two extremes.Please also note:1. You can, and are highly encouraged, to listen to the cry recording repeatedly ifyou find it is necessary for you to make your assessments.2. Please make your assessment based on ONLY the segment of cry you justlistened, but NOT try to predict the tendency of the baby’s distress situation beforeor after the recorded cry segment.Thank you very much for your participation.Qiaobing XieGraduate Student, EE Department, UBCBibliography 147Bibliography[1] H. Akaike. Autoregressive model fitting for control. Ann. Inst. Statist. Math., 23:163—80,1971.[2] B. S. Atal. Effectiveness of linear prediction characteristics of the speech wave forautomatic speaker identification and verification. J. Acoust. Soc. Ame., 55:1304—12, 1974.[3] L. E. Baum and I. A. Egon. An inequality with applications to statistical estimation forprobabilistic functions of a Markov process and to a model for ecology. Bull, Amer.Meteorol. Soc., 73:360—3, 1967.[4] L. E. Baum and T. Petrie. Statistical inference for probabilistic functions of finite stateMarkov chains. Ann. Math. Stat., 37:1554—63, 1966.[5] H. L. Bee. The Developing Child. Harper and Row, New York, 2 edition, 1978.[6] B. Boashash. Estimating and interpreting the instantaneous frequency of a signal — part1: Fundamentals. Proceedings of IEEE, 80(4):520—38, April 1992.[7] B. Boashash. Estimating and interpreting the instantaneous frequency of a signal — part2: Algorithms and applications. Proceedings of IEEE, 80(4):540—68, April 1992.[8] 1. F. Bosma, H. M. Truby, and J. Lind. Cry motions of the newborn infant. Acta Paediatr.Scand. Suppi., 163:62—92, 1965.[9] M. Brennan and J. Kirkland. Discrimination of infants’ cry-signals. Perceptual and MotorSkills, 48:683—6, 1979.[10] M. Brennan and J. Kirkland. Perceptual dimensions of infant cry signals: A semanticdifferential analysis. Perceptual and Motor Skills, 57:575—81, 1983.Bibliography 148[11] K. D. Craig, R. V. E. Grunau, and J. Aquan-Assee. Judgment of pain in newborns:Facial activity and cry as determinants. Canad. J. Behav. Sci./Rev. Canad. Sci. Comp.,20(4):442—51, 1988.[12] H. P. Crowe and P. S. Zeskind. Psychophysiological and perceptual responses to infantcries varying in pitch: Comparison of adults with low and high scores on the child abusepotential inventory. Child Abuse & Neglect, 16:19—29, 1992.[13] P. A. Devijver and J. Kittler. On the edited nearest neighbor rule. Proc. 5th mt. ConfPattern Recognition, 1980.[14] J. Durbin. Efficient estimation of parameters in Moving-Average models. Biometrika,46:306—16, 1959.[151 G. Fairbanks. An acoustical study of the pitch of infant hunger wails. Child Development,13:227, 1942.[16] V. R. Fisichelli and S. Karelitz. The cry latencies of normal infants and those with braindamage. .1. Pediatrics, 62:742, 1963.[17] J. L. Flanagan. Speech Analysis, Synthesis and Perception. Academic, New York, 1965.[18] J. L. Flanagan, C. H. Coker, L. R. Rabiner, R. W. Schafer, and N. Umeda. Syntheticvoices for computers. IEEE Spectrum, 7(10):22—45, October 1970.[191 A. Frodi. When empathy fails: Aversive infant crying and child abuse. In B. M. Lesterand C. F. Z. Boukydis, editors, Infant Crying: Theoretical and research perspectives.Plenum Press, New York and London, 1985.[201 A. Frodi and M. Senchak. Verbal and behavioral responsiveness to the cries of atypicalinfants. Child Development, 61:76—84, 1990.Bibliography 149[211 K. S. Fu, editor. Digital pattern recognition. Springer-Verlag, Berlin, Heidelberg, NewYork, 1976.[22] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, Inc., NewYork, 1990.[23] K. Fukunaga and R. R. Hayes. The reduced Parzen classifier. IEEE Trans. on PatternAnal. Machine Intel., 11:423—425, April 1989.[24] K. Fukunaga and J. M. Mantock. Nonparametric data reduction. IEEE Trans. on PatternAnal. Machine Intel., 6:115—118, January 1984.[25] B. F. Fuller. Acoustic discrimination of three types of infant cries. Nursing Research,40(3):156—60, 1991.[261 B. F. Fuller and Y. Horii. Differences in fundamental frequency, jitter, and shimmeramong four types of infant vocalizations. J. Commun. Disord., 19:441—7, 1986.[27] B. F. Fuller and Y. Horii. Spectral energy distribution in four types of infant vocalizations.J. Commun. Disord., 21:251—61, 1988.[28] T. Gardoski, P. Ross, and S. Singh. Acoustic characteristics of the first cries of infants. InT. Murry and J. Murry, editors, Infant Communication: Cry and Early Speech. CollegeHill Press, Houston, Texas, 1980.[29] G. W. Gates. The reduced nearest neighbor rule. IEEE Trans. on Inform. Theory, 18:43 1—433, 1972.[30] A. Gersho. Asymptotically optimal block quantization. IEEE Trans. on Inform. Theory,25(4):373—80, July 1979.Bibliography 150[31] A. Gersho and V. Cuperman. Vector quantization: A pattern-matching technique forspeech coding. IEEE Commun. Mag., December 1983.[32] A. Gersho and B. Ramamurthi. Image coding using vector quantization. IEEE mt. Confon Acoust., Speech, and Signal Proc., April 1982.[33] H. L. Golub. A physioacoustic model of infant cry production. PhD thesis, MassachusettsInstitute of Technology, Cambridge, Mass, 1980.[34] H. L. Golub and M. J. Corwin. Infant cry: A clue to diagnosis. Pediatrics, 69:197—201,1982.[35] H. L. Golub and M. J. Corwin. A physioacoustic model of the infant cry. In B. M. Lesterand C. F. Z. Boukydis, editors, Infant Crying: Theoretical and research perspectives.Plenum Press, New York and London, 1985.[36] A. H. Gray, Jr. and J. D. Markel. A spectral flatness measure for studying theautocorrelation method of linear prediction of speech analysis. IEEE Trans. on ASSP,22:207—17, 1974.[37] R. V. E. Grunau and K. D. Craig. Facial activity as a measure of neonatal pain expression.In D. C. Tyler and E. J. Krane, editors, Advances in Pain Research Therapy, volume 15.Raven Press, Ltd., New York, 1990.[38] R. V. E. Grunau, C. C. Johnston, and K. D. Craig. Neonatal facial and cry responses toinvasive and non-invasive procedures. Pain, 42:295—305, 1990.[39] G. E. Gustafson and I. A. Green. On the importance of fundamental frequency and otheracoustic features in cry perception and infant development. Child Development, 60:772—80, 1989.Bibliography 151[40] D. J. Hand. Discrimination and Classification. John Wiley & Sons, New York, 1981.[41] D. J. Hand. Kernel Discriminant Analysis. Research Studies Press, New York, 1982.[42] P. E. Hart. The condensed nearest neighbor rule. IEEE Trans. on Inform. Theory, 14:515—516, 1968.[43] H. Hollien. Developmental aspects of neonatal vocalizations. In T. Murry and J. Murry,editors, Infant Communication: Cry and Early Speech. College-Hill Press, Houston, Texas,1980.[44] Downey J. Perinatal information on infant crying. Child Care, Health and Development,16(2):113—21, 1990.[45] S. Karelitz and V. R. Fisichelli. The cry thresholds of normal infants and those withbrain damage. J. Pediatics, 61:679, 1962.[46] T. Kohonen. Learning vector quantization. Neural Networks, 1(Suppl. 1):303, 1988.[47] T. Kohonen, G. Barna, and R. Chrisley. Statistical pattern recognition with neuralnetworks: Benchmarking studies. IEEE Proc. of ICNN’88, 1:61—68, 1988.[48] R. Kronland-Martinet, J. Monet, and A. Grossman. Analysis of sound patterns throughwavelet transforms. J. Pattern Recog. and Art. Intell., 1:273—301, 1987.[49] A. Langlois, R. Baken, and C. Wilder. Pre-speech respiratory behavior during the firstyear of life. In T. Murry and J. Murry, editors, Infant Communication: Cry and EarlySpeech. College-Hill Press, Houston, Texas, 1980.[50] K. F. Lee. Automatic Speech Recognition— The Development of the SPHINX System.Kluwer Academic Publishers, Boston, London, 1989.Bibliography 152[51] B. M. Lester. Introduction: There’s more to crying than meets the ear. In B. M. Lester andC. F. Z. Boukydis, editors, Infant Crying: Theoretical and research perspectives. PlenumPress, New York and London, 1985.[52] J. Levine and N. C. Gordon. Pain in prelingual children and its evaluation by pain-inducedvocalization. Pain, 14:85—93, 1982.[53] N. Levinson. The Wiener RMS error criterion in filter design and prediction. J. Math.Phys., 25:261—78, 1947.[541 M. M. Lewis. Infant Speech : A Study of the Beginnings of Language. Amo Press, NewYork, 1975.[55] P. Lieberman. The physiology of cry and speech in relation to linguistic behavior. InB. M. Lester and C. F. Z. Boukydis, editors, Infant Crying: Theoretical and researchperspectives. Plenum Press, New York and London, 1985.[56] P. Lieberman, E. S. Crelin, and D. H. Klatt. Phonetic ability and related anatomyof the newborn and adult human, neanderthal man, and the chimpanzee. AmericanAnthropologist, 74:287—307, 1972.[57] P. Lieberman, K. S. Harris, P. Wolff, and L. H. Russell. Newborn infant cry and nonhumanprimate vocalization. J. of Speech and Hearing Research, 14:718—27, 1971.[58] 1. 5. Lienard and C. d’Allessandro. Wavelets and granular analysis of speech. Proc. mt.Conf on Wavelets, Time-Frequency Methods and Phase Space, 1987.[59] J. Lind, V. Vuorenkoski, G. Rosberg, T. Partanen, and 0. Was-Höckert. Spectrographicanalysis of vocal response to pain stimuli in infants with Down’s syndrome. Dev. Med.Child Neurol., 12:478—86, 1970.Bibliography 153[601 J. Lind, 0. Was-Höckert, V. Vuorenkoski, and E. Valanne. The vocalizations of a newborn,brain-damaged child. Annales Paediatriae Fenniae, 11:32—7, 1965.[61] Y. Linde, A. Buzo, and R. Gray. An algorithm for vector quantizer design. IEEE Trans.on Commu., 28:84—95, 1980.[621 R. P. Lippmann. An introduction to computing with neural nets. IEEE ASSP Magazine,April 1987.[63] W. Ludge and P. Gips. Microcomputer-aided studies of cry jitter uttered by newbornchildren based upon high-resolution analysis of fundamental frequencies. Comp. Meth.and Frog. in Biomed., 28:151—6, 1989.[64] P. Lundh. A new baby-alarm based on tenseness of the cry signal. Scand. Audiol.,15:191—6, 1986.[65] A. W. Lynip. The use of magnetic devices in the collection and analysis of the preverbalutterances of an infant. Genet. Psychol. Monogar., 44:221—60, 1951.[66] S. Mallat. A theory for multiresolution signal decomposition: The wavelet representation.IEEE Trans. on Pat. Anal. and Machine Intell., 1 1(7):674—93, 1989.[67] J. D. Markel and A. H. Gray, Jr. Linear Prediction of Speech. Springer-Verlag, NewYork, 1976.[68] D. McCarthy. Organismic interpretation of infant vocalizations. Child Development,23(4):273—80, 1952.[69] K. Michelsson. Cry analyses of symptomless low birth weight neonates and of asphyxiatednewborn infants. Acta Paediatrica Scand., 19:309—15, 1971.Bibliography 154[70] K. Micheisson, H. Kaskinen, R. Aulanko, and A. Rinne. Sound spectrographic cry analysisof infants with hydrocephalus. Acta Paedia. Scand., 73:65—8, 1984.[71] K. Micheisson, J. Raes, and A. Rinne. Cry score — an aid in infant diagnosis. FoliaPhoniatrica, 36:219—24, 1984.[72] K. Michelsson, J. Raes, C. J. Thodén, and 0. Wasz-Höckert. Sound spectrographic cryanalysis in neonatal diagnostics: An evaluative study. J. of Phonetics, 10:79—80, 1982.[73] K. Micheisson and P. Sirvio. Cry analysis in herpes encephalitis. Proceedings of the 5thScand. Congress in Perinatal Medicine, 1975.[74] K. Micheisson and P. Sirvio. Cry analysis in congenital hypothyroidism. Folia Phoniatrica, 28:40—7, 1976.[75] K. Michelsson, P. Sirvio, M. Koivisto, A. Sovijarvi, and 0. Wasz-Höckert. Spectrographicanalysis of pain cry in neonates with cleft palate. Biology of the Neonate, 26:353—8, 1975.[76] K. Michelsson, P. Sirvio, and 0. Wasz-Höckert. Pain cry in full term asphyxiated newborninfants correlated with late findings. Acta Paedia. Scand., 66 (a):61 1, 1977.[77] K. Micheisson, P. Sirvio, and 0. Wasz-Höckert. Sound spectrographic cry analysis ofinfants with bacterial meningitis. Develop. Med. and Child Neuro., 19 (b):309—15, 1977.[78] K. Michelsson, N. Tuppurainen, and P. Auld. Cry analysis of infants with karyotypeabnormality. Neuropediatrics, 11:365—76, 1980.[79] K. Micheisson et al. Crying, feeding and sleeping patterns in 1 to 12—month-old infant.Child Care, Health and Development, 16(2):99—1 11, 1990.[80] P. Mueller and J. Lazzaro. A machine for neural computation of acoustical patternswith application to real-time speech recognition. In J. S. Denker, editor, AlP ConferenceBibliography 155Proceedings 151, Neural Networks for Computing. Snowbird, Utah, 1986.[81] E. Muller, H. Hollien, and T. Murry. Perceptual response to infant crying. J. of ChildLanguage, 1:89—95, 1974.[82] A. D. Murray. Infant crying as an elicitor of parental behavior: An examination of twomodels. Psychological Bulletin, 86:191—215, 1979.[83] A. D. Murray. Aversiveness is in the mind of the beholder: Perception of infant cryingby adults. In B. M. Lester and C. F. Z. Boukydis, editors, Infant Crying: Theoretical andresearch perspectives. Plenum Press, New York, 1985.[84] N. M. Nasrabadi and Y. Feng. Vector quantization of images based upon the Kohonenself-organizing feature maps. IEEE Proc. of ICNN’88, 1988.[85] J. W. Van Ness. On the effects of dimension in discriminant analysis for unequalcovariance populations. Technometrics, 21:119—127, 1979.[86] J. W. Van Ness and C. Simpson. On the effects of dimension in discriminant analysis.Technometrics, 18:175—187, 1976.[87] C. E. Osgood, G. J. Suci, and P. H. Tannenbaum. The measurement ofmeaning. UniversityOf Illinois Press, Urbana, 1957.[88] D. O’Shaughnessy. Speaker recognition. IEEE ASSP magazine, October 1986.[89] M. E. Owens. Pain in infancy: Conceptual and methodological issues. Pain, 20:213—30,1984.[90] A. H. Parmelee. Infant crying and neurologic diagnosis. J. of Pediat., 61:801—2, 1962.[91] T. Parsons. Voice and Speech Processing. McGraw Hill Book Company, New York, 1987.Bibliography 156[921 T. Partanen, 0. Wasz-Höckert, V. Vuorenkoski, K. Theorell, E. Valanne, and J. Lind.Auditory identification of pain cry signals of young infants in pathological conditions andin sound spectrographic basis. Annules Pediatriae Fenniae, 13:56—63, 1967.[93] E. Parzen. Some recent advances in time series modeling. IEEE Trans. on Auto. Control,AC-19:723—30, 1974.[94] S. M. Peeling, R. K. Moore, and M. J. Tomlinson. The multi-layer perception as a tool forspeech pattern processing research. Proc. bA Autumn Conf on Speech and Hearing, 1986.[95] J. Picone. Continuous speech recognition using hidden markov models. IEEE ASSPMagazine, July 1990.[96] R. Prescott. Infant cry sound; developmental features. J. ofAcoust. Soc. Am., 57(5): 1186—91, 1975.[97] R. Prescott. Cry and maturation. In T. Murry and J. Murry, editors, Infant Communication:Cry and Early Speech. College-Hill Press, Houston, Texas, 1980.[98] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speechrecognition. Proceedings of IEEE, 77(2):257—85, February 1989.[99] L. R. Rabiner and B. H. Juang. An introduction to hidden Markov models. IEEE ASSPMagazine, January 1986.[100] L. R. Rabiner, K. C. Pan, and F. K. Soong. On the performance of isolated word speechrecognition using vector quantization and temporal energy contours. AT&T Bell Tech. J.,63:1245—60, 1984.[101] L. R. Rabiner and R. W. Schafer. Digital processing of speech signals. Prentice-Hall,Inc., Englewood Cliffs, New Jersey, 1978.Bibliography 157[102] L. R. Rabiner, J. G. Wilpon, and F. K. Soong. High performance connected digitrecognition using hidden Markov models. IEEE Trans. on ASSP, 37:1214—25, 1989.[103] G. Rapisardi, B. Vohr, W. Cashore, M. Peucker, and B. Lester. Assessment of infant cryvariability in high-risk infants. Intern. J. of Pediatric Otorhinol., 17:19—29, 1989.[1041 A. E. Rosenberg and F. K. Soong. Evaluation of a vector quantization talker recognitionsystem in text independent and text dependent modes. Proc. ICASSP 86, IEEE, 1986.[105] M. J. Russell and R. K. Moore. Explicit modeling of state occupancy in hidden Markovmodels for automatic speech recognition. Proc. ICASSP 85, IEEE, pages 5—8, March 1985.[106] M. R. Sambur. Selection of acoustic features for speaker identification. IEEE Trans. onASSP, 23:176—82, 1975.[107] R. W. Schafer and L. R. Rabiner. Digital representations of speech signals. Proceeding,IEEE, 63:662—77, 1975.[108] R. Schalkoff. Pattern Recognition: Statistical, Structural and Neural Approaches. JohnWiley & Sons, Inc., New York, 1992.[109] P. Sirviö and K. Micheisson. Sound-spectrographic cry analysis of normal and abnormalnewborn infants; a review and a recommendation for standardization of the crycharacteristics. Folia Phoniatrica, 28:161—73, 1976.[1101 F. K. Soong and A. E. Rosenberg. On the use of instantaneous and transitional spectralinformation in speaker recognition. ICASSP 86, IEEE, 1986.[11111. St James-Roberts and T. Halil. Infant crying pattern in the first year: Normal communityand clinical findings. J. of Child Psychol. Psychiat., 32(6):95 1—68, 1991.Bibliography 158[112] P. R. Stasiewicz et al. Effects of infant cries on alcohol consumption in college males atrisk for child abuse. Child Abuse & Neglect, 13(4):463—70, 1989.[113] J. Tenold, D. Crowell, R. Jones, T. Daniel, D. McPherson, and A. Popper. Cepstraland stationarity analyses of full-term and premature infants’ cries. J. Acoust. Soc. Am.,56:975—80, 1974.[114] H. M. Truby and J. Lind. Cry sound of the newborn infant. Acta Paediatr. Scand. Suppi.,163:8—59, 1965.[115] E. Valanne, V. Vuorenkoski, T. Partanen, J. Lind, and 0. Wasz-Höckert. The ability ofhuman mothers to identify the hunger cry signals of their newborn infants during thelying-in period. Experientia, 23:1, 1967.[116] B. R. Vohr, B. Lester, G. Rapisardi, C. O’Dea, L. Brown, M. Peucker, W. Cashore, andW. Oh. Abnormal brain-stem function (brain-stem auditory evoked response) correlateswith acoustic cry features in term infants with hyperbilirubinemia. J. of Pediatrics,115(2):303—8, 1989.[117] V. Vuorenkoski, M. Kaunisto, P. Tjernlund, and L. Vesa. Cry detector: A clinical apparatusfor surveillance of pitch and activity in the crying of a newborn infant. Child Psychiatry,Parallel session C III a: Neurology, 1970.[118] 0. Wasz-Höckert, J. Lind, V. Vuorenkoski, T. Partanen, and B. Valanne. The Infant Cry— A Spectrographic and Auditory Analysis. Spastics International Medical Publicationsin Association with William Heinemann Medical Books Ltd., 1968.[119] 0. Wasz-Höckert, K. Micheisson, and J. Lind. Twenty-five years of Scandinavian cryresearch. In B. M. Lester and C. F. Z. Boukydis, editors, Infant Crying: Theoretical andBibliography 159research perspectives. Plenum Press, New York and London, 1985.[120] 0. Wasz-Höckert, T. Partanen, V. Vuorenkoski, and E. Valanne. The identification of somespecific meanings in the newborn and infant vocalization. Experientia, 20:154, 1964.[121] 0. Wasz-Höckert, T. Partanen, V. Vuorenkoski, E. Valanne, and K. Micheisson. Effectof training on ability to identify pre-verbal vocalizations. Developmental Medicine andChild Neurology, 6:393—6, 1964.[122] 0. Wasz-Hockert, V. Vuorenkoski, E. Valanne, and K. Michelsson. Tonspektrographischeuntersuchungen des sauglinggeschreis. Experientia, 18:583, 1962.[123] J. J. Wolf. Efficient acoustic parameters for speaker recognition. JASA, 51:2030—43, 1972.[124] P. H. Wolff. The role of biological rhythms in early psychological development. Bulletinof the Menninger Clinic, 31:197, 1967.[125] P. H. Wolff. The natural history of crying and other vocalizations in early infancy. InB. M. Foss, editor, Determinants of infant behavior, volume 4. Methuen, London, 1969.[126] Q. Xie, C. A. Laszlo, and R. K. Ward. Vector quantization technique for nonparametricclassifier design. IEEE Trans. on Pattern Anal. Machine Intel., 1992. (in press).[127] Q. Xie, R. K. Ward, and C. A. Laszlo. Characterization of normal infants’ level-of-distressby a single parameter derived from cry sounds. submitted to IEEE Trans. on Speech andAudio Processing, February 1993.[1281 Q. Xie, R. K. Ward, and C. A. Laszlo. A hidden Markov model method for estimatingnormal infant distress level from cry sounds. submitted to IEEE Trans. on Speech andAudio Processing, May 1993.Bibliography 160[129] P. S. Zeskind. A developmental perspective of infant crying. In B. M. Lester and C. F. Z.Boukydis, editors, Infant Crying: Theoretical and research perspectives. Plenum Press,New York and London, 1985.[130] P. S. Zeskind and B. H. Lester. Acoustic features and auditory perceptions of the cries ofnewborns with prenatal and perinatal complications. Child Development, 49:580—9, 1978.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Automatic infant cry analysis and recognition
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Automatic infant cry analysis and recognition Xie, Qiaobing 1994
pdf
Page Metadata
Item Metadata
Title | Automatic infant cry analysis and recognition |
Creator |
Xie, Qiaobing |
Date Issued | 1994 |
Description | This dissertation is a report of my investigation on introducing modern speech process ing/recognition techniques to the field of infant cry research, and on developing efficient and effective methodologies for automatic assessment of the physical/emotional situation of infants using the information derived from the cry signals. I first identify some problems facing present infant cry research, especially those obstruct ing the practical applications of the results generated from basic research. By demonstrating the similarities between infant cry generation and adult speech generation, I establish the theoretical foundation for the development of my new automatic cry processing/analysis tech niques. In particular, I develop the new concept of cry phonemes as an effective method for representing cry signals for automatic cry analysis. Based on the cry phonemes, I further define a composite parameter, the H-value, which can be calculated from the cry signal, and is found to be a reliable indicator of the distress level of the infant. Using these new concepts, I design two automatic infant cry analysis systems. One system is based on my newly developed nonparametric VQ-kernel classifier, and the other system is based on the Hidden Markov Model technique. Each of these systems estimates the H-value from the cry signal automatically. This, in turn, is utilized in the automatic assessment of the infant’s distress level. The performance of these two systems was evaluated with cries uttered by 36 infants. I found that both systems give assessments of infants’ distress levels consistent with the perceptions of experienced parents who listened to the recording of the same cries. This demonstrates the effectiveness of my newly developed techniques. In addition, the methodologies developed in this research can be easily generalized and applied to other problems of normal and abnormal infant cry analysis. |
Extent | 2652749 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-04-08 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0065020 |
URI | http://hdl.handle.net/2429/6931 |
Degree |
Doctor of Philosophy - PhD |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 1994-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_1994-893753.pdf [ 2.53MB ]
- Metadata
- JSON: 831-1.0065020.json
- JSON-LD: 831-1.0065020-ld.json
- RDF/XML (Pretty): 831-1.0065020-rdf.xml
- RDF/JSON: 831-1.0065020-rdf.json
- Turtle: 831-1.0065020-turtle.txt
- N-Triples: 831-1.0065020-rdf-ntriples.txt
- Original Record: 831-1.0065020-source.json
- Full Text
- 831-1.0065020-fulltext.txt
- Citation
- 831-1.0065020.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0065020/manifest