Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

An eclectic cry research tool for the automatic estimation of an infant’s level of distress Black, John Scott 1997

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_1997-0340.pdf [ 7.3MB ]
JSON: 831-1.0065237.json
JSON-LD: 831-1.0065237-ld.json
RDF/XML (Pretty): 831-1.0065237-rdf.xml
RDF/JSON: 831-1.0065237-rdf.json
Turtle: 831-1.0065237-turtle.txt
N-Triples: 831-1.0065237-rdf-ntriples.txt
Original Record: 831-1.0065237-source.json
Full Text

Full Text

AN ECLECTIC CRY RESEARCH TOOL FOR T H E AUTOMATIC ESTIMATION OF AN INFANT'S L E V E L OF DISTRESS by JOHN SCOTT BLACK B.Sc.E., University of New Brunswick, 1995 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF APPLIED SCIENCE in THE FACULTY OF GRADUATE STUDIES Department of Electrical Engineering We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA JUNE 1997 © John Scott Black, 1997. In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of EUgjT(^Uy\L &PbX^ECr^-MO(> The University of British Columbia Vancouver, Canada Date Q 1 / \Q ]°>1 DE-6 (2788) ABSTRACT The infant cry has received strong research interest from scientists, medical doctors and engineers for a number of decades. As an infant's primary vehicle for communication, it is believed that the cry contains valuable information of the infant's physical and emotional well-being. As a result, the cry has been proposed as a useful means for monitoring an infant's level of distress (LOD). Recently, a novel approach for estimating the LOD has been introduced based on the proven concept of the cry being composed of recognizable cry-words. Using an established set of ten cry-words, a single value indication of an infant's LOD is obtainable called the H-value. Previous methods for demonstrating the usefulness of this measure required complex algorithms and a complete tool was not developed. In this thesis the development of such a desirable tool is reported. This tool uses no single recognition technique but rather an eclectic variety of correlation methods and simple decision rules in conjunction with common features and techniques such as the fundamental frequency, FFT, short-time energy and a modified power spectral density. Without computationally intensive routines, the complexity of the system was reduced from previous methods resulting in a practical cry research tool. Testing of this system made use of actual cry data. Comparisons are given with previous systems and experienced parents' LOD ratings. The tool was also implemented on both a SUN and a PC to demonstrate the portability of the tool between different systems. ii TABLE of CONTENTS Abstract ii Table of Contents iii List of Figures v List of Tables vii Acronyms and Abbreviations viii Acknowledgements ix 1 INTRODUCTION 1 2 INFANT CRYING 4 2.1 Flistory of Cry Research 4 2.2 Techniques and Features 8 2.2.1 Techniques 8 2.2.2 Features 13 2.3 Speech and Cry Models 15 2.4 Cry Information Debate 18 2.5 Summary 20 3 SUMMARY of XBE's WORK 21 3.1 Cry-Words 21 3.2 H-Value 25 3.3 Nonparametric Statistical Classifier-Based Method 28 3.3.1 VQ-Kernel Automatic Cry Analyzer 29 3.3.2 VQ-Kernel Results 30 3.4 Hidden Markov Model 31 3.4.1 HMM Automatic Cry Analyzer 31 3.4.2 HMM Recognition Results 34 4 DIRECTION of STUDY 35 4.1 Xie's Contributions 35 4.2 Unopened Doors 35 iii 4.3 Motivation for Direction of Study 37 4.4 Spectrograms 39 4.5 Wavelets 41 4.6 Summary 44 5 DEVELOPMENT of the SYSTEM 45 5.1 Introduction 45 5.2 Relevant Parameters 46 5.2.1 F 0 and st-Energy 46 5.2.2 Duration and Sequencing Information 52 5.2.3 Voiced Versus Unvoiced 53 5.2.4 Frequency of Occurrence 54 5.3 Fundamental Frequency Detection and Estimation 56 5.3.1 Cepstral Analysis 59 5.3.2 Pitch Detection Using the stRC 62 5.4 Improved System 65 6 TESTING and RESULTS 82 6.1 Testing Overview 82 6.2 System Accuracy 83 6.2.1 Comparisons with our MVE Results 83 6.2.2 Comparison with Xie's MVE Results 85 6.2.3 Comparison with Parental Assessments 86 6.3 System Time 88 6.4 System Complexity 90 7 CONCLUSIONS and RECOMMENDATIONS 93 7.1 Conclusions 93 7.2 Recommendations 96 BIBLIOGRAPHY 99 iv LIST of FIGURES 2.1 A typical model for adult speech generation 17 2.2 Simple model for infant cry generation 18 3.1 Spectrograms (time-frequency patterns) of the ten cry-words proposed by Xie [Xie, 1993] 24 3.2 The HMM-based classifier as shown in [Xie, 1993] 33 4.1 Spectrogram for a typical sample of an infant cry 42 5.1 Plot of typical average ranges for F„ and normalized energy for the ten cry-words 51 5.2 Differences between voiced and unvoiced cry samples. The upper plots are spectrograms of typical voiced and unvoiced signals while the lower plots are the instantaneous squared magnitudes of the FFTs at time t=0.05 sec 55 5.3 Description of computation of stRC (borrowed from [Deller etal., 1993]) 61 5.4 Illustration of cry signal, stRC and high-time liftered stRC 63 5.5 Flowchart providing an overview of the new recognition system Each of the sections numbered (1) through (7) will be referred to in the discussion of the system. 66 5.6 The results of applying the modified power spectral density to both a voiced and unvoiced cry sample 71 5.7 Outline of the steps followed in P^ test 74 5.8 Expansion of the overview flowchart (Fig. 5.5) showing more detail describing the new recognition system 78 5.9 Detailed flowchart (part 1). This flowchart provides more detail than Fig. 5.5 and Fig. 5.8. A, B and C connect with corresponding sections in Fig. 5.10 79 5.10 Detailed flowchart (part 2). A, B, and C connect with corresponding sections in Fig. 5.9 80 v LIST of FIGURES (cont.) 6.1 The subjective MVE H-value estimations versus the new system automatic H-value estimations 84 6.2 Automatic computer estimated H-values versus parents' LOD ratings. The top plot shows results for our system using a non-overlapping analysis window. The bottom plot shows results for Xie's HMM based system 87 vi LIST of TABLES 5.1 The "Monster"- simple visual display of a number of features so patterns can be deteirnined to identify each of the cry-words 47 5.2 Average results from examining a number of different samples of the ten cry-words. Note that "tot" refers to the average for the total sequence while "beginning" ("end") refers to the averages over roughly the first (last) quarter or third of the sample. VA-VI-^A refer to roughly the LA, ¥2. and 3A points of the sequence length where the F„ values were calculated 50 5.3 Distribution of the micro-segments used in Xie's training set (information from [Xie, 1993]) 56 6.1 Mean absolute differences in H-values when subjective classifications are compared with estimated values from the automatic recognition systems 88 6.2 Cross-correlation coefficients between subjective classifications and estimated H-values from the automatic recognition systems 88 vii ACRONYMS and ABBREVIATIONS ANN -- Adaptive Neural Network BPF -- Band Pass Filter CC - Complex Cepstrum DFT -- Discrete Fourier Transform DHB -- Double Harmonic Break DSP -- Digital Signal Processing FFT -- Fast Fourier Transform F 0 -- Fundamental frequency FT Fourier Transform HMM -- Hidden Markov Model TFT -- Inverse Fourier Transform LOD - Level Of Distress LP Linear Prediction MVE -- Manual Visual Estimate PDF -- Probability Density Function Pxx " Modified power spectral density RC -- Real Cepstrum st-energy - Short Time energy STFT -- Short Time Fourier Transform stRC -- Short Time Real Cepstrum VQ - Vector Quantization WT -- Wavelet Transform A C K N O W L E D G E M E N T S I would like to take this opportunity to thank my parents and family for their continued support and prayer throughout university. A special thanks also goes to my best friend Marsalie whose patience, encouraging words and smiling face always surfaced when they were needed most. I would like to thank Dr. C. A. Laszlo and Dr. R. K. Ward for allowing me the opportunity to work on this project. Their guidance, feedback and willingness to sit down and help work through various problems proved very valuable. Thanks too to Q. Xie who was willing to answer any questions I had about his work. Finally, I would like to thank C. Jaeger for her general discussions of different aspects of the project and for her overall help. This project was supported by an NSERC research grant. To God be the glory ix Chapter 1 INTRODUCTION 'Ts that infant crying for a reason other than solely to receive attention ?" This is an age old question to which many are willing to offer their answers. Unfortunately, with little objective data available, numerous subjective opinions are all we have. Implementing a standardized infant cry recognition system of an infant's level of distress may not only help answer this important question, but it would also be very useful for infant monitoring and cry research purposes. Before infants can speak, crying is their earliest vehicle for communication. Initially crying is nothing more than a reflex, but over time an infant learns from others' responses and crying becomes more volitional. As an example, a baby may cry when he/she feels the discomfort associated from hunger and in turn may get fed. The baby then learns that this crying invokes a response in the caregivers which curbs the discomfort. The cry-and-get-food process is then stored in memory and becomes a purposeful means of communication. Just as speech recognition engineers desire to learn more about speech, a number of professionals desire to learn more about infant cries. It has been the view of many scientists, medical doctors and engineers for some time that an infant cry contains many indications of an infant's physical and emotional state [ Murry, 1980; Michelson &Wasz-H6ckert, 1980; Tenold et al., 1974]. However, despite years of research, knowledge of exactly how much information is contained in the cries remains unknown. 1 Chapter 1: INTRODUCTION 2 With such a broad spectrum of professionals working in this field, there is little wonder that many different aspects of the cry are examined. Some common research interests include trying to detennine the cause-effect relationship of a cry or searching for patterns which might relate cries of unhealthy infants to their diseases. As mentioned above, there is also a need for an automatic estimation of an infant's distress level To date, a highly effective detection/recognition system has not been developed. However, recent work by Xie introduced a platform on which such a system could be based [Xie, 1993]. Unfortunately, while the platform was introduced, a practical tool was not. Our research is intended to address this issue. The object of our work will be two-fold. First, we plan to develop a practical tool to provide an automatic estimation of an infant's level of distress. This tool will use the concepts and platform introduced by Xie, but not necessarily the same methods. Our second objective is to develop this tool using a less complex, less computationally intensive method than the systems suggested by Xie without compromising system accuracy or analysis time. We believe that a more detailed study of some common features such as the fundamental frequency and energy distribution of the FFT will reveal useful patterns or trends which could assist in recognition. Li Chapter 2, we provide a history of cry research over the past two centuries emphasizing the features and techniques which have proven most useful during this period. We also demonstrate the similarities between adult speech and the infant cry through the use of established generation models. Chapter 3 reviews the work of Xie and examines both his established set of ten cry-words and definition for the H-value as a single value indicator of an infant's level of distress. Much of our work will be built on his contributions. In Chapter 4 we discuss other avenues for future Chapter 1: INTRODUCTION 3 research based on Xie's work and then present our direction of study. An outline of our work is given in Chapter 5 which details our new proposed H-value estimation tool. Here we discuss why each feature or method is selected and how it fits into our overall recognition scheme. At the end of the chapter, a summary of the tool is presented in a system flowchart. Details of our testing strategy are given in Chapter 6. Testing makes use of actual cry data and comparisons are made with previous systems based on system accuracy, time and complexity. Finally, Chapter 7 summarizes our work, in particular our contributions to cry research. Recommendations for future work are also provided. Chapter 2 INFANT CRYING 2.1 HISTORY of CRY RESEARCH The art of "dkgnostic-hstening" dates back to ancient Greece in the days of Hippocrates [Petroni et al., 1995; Golub and Corwin, 1985]. The usefulness of such a practice was not really explored, however, until the nineteenth century. In 1838, Gardiner's work described an infant's cry as having an up-and-down melodic pattern and that each cry could be located between middle A and E on a piano. Also in the mid-1800's, Darwin attempted to show that an infant had a different facial expression for different types of cries. Even though both scientists were limited by the non-existence of recording equipment, their interpretations set the tone for the years to follow. Early in the twentieth century, advances in technology permitted acoustic investigations of the infant cry. One of the first developments was the graphophone which allowed the recording of infant vocalizations. Flatau and Gutzman used this device in 1960 and determined that an infant with an extremely high pitch had a breathing problem; a discovery that encouraged researchers in the years that followed to investigate the cries of both normal and abnormal infants. Other advances in the early 1900's included the development of both the gramophone and tape recorder which improved cry analysis techniques by allowing repeated listening to a cry. Still, these first cry Chapter 2: INFANT CRYING 5 studies were strfctly based on auditory analysis and researchers remained limited in tests they could perform. In the late 1940's, analysis was revolutionized with the development of the spectrograph at the Bell Laboratories. Initially designed to aid the deaf with a visual display of speech, the spectrograph produced a permanent visual record of the distribution of energy in both time and frequency called the spectrogram. While it failed in its intended purpose because of the high complexity of speech, it became a useful tool for many areas of signal processing including speech processing and cry analysis. In 1951, Lynip was probably the first to use sound-spectrographic methods for cry analysis, measuring the fundamental frequency (FD) and searching for vowel-like patterns in the cry sound [Lynip, 1951]. A few years later in 1962, the Scandinavian research team of Wasz-Hockert et al. presented their first findings of sound spectrographic methods with their analysis of four basic cry types, namely birth, hunger, pain and pleasure. Later studies showed that experienced persons such as midwives, nurses and experienced parents were better at discerning between these cry types [Wasz-Hockert et al., 1964a & 1964b]. This marked the beginning of modern acoustic cry research. The team of Wasz-Hockert et al. is also credited with another milestone in infant cry analysis when they presented a statistical analysis of cries in 1968 which set guidelines for many researchers in the twenty years that followed [Wasz-Hockert et al., 1968]. In the 1970's, major developments in other fields also resulted in advances in cry analysis research. This era saw significant improvements in electronics, signal processing techniques and the use of computers. This enabled techniques and algorithms to be employed which had previously been too cumbersome and complex. One technique which had not seen much use until then was Chapter 2: INFANT CRYING 6 cepstral analysis which Tenold et al. utilized, along with stationarity analysis, to study the variability ofFJTenold etal., 191 A]. Improvements to computers and digital signal processing (DSP) techniques continued into the 1980's. Also by this time there was a greater understanding of the complex interactions between the many anatomical structures and physiological mechanisms involved with crying. In the early 1980's, Golub and Corwin made use of these improvements in technology combined with the greater understanding of the cry and applied computer signal processing techniques to cry analysis [Golub and Corwin, 1982 &1985]. Their work suggested that cry analysis could be useful as an indicator of abnormalities in infants. They also developed a physioacoustic model of cry production which could be used to correlate acoustical measurements with medical abnormalities. This agreed with past reports of many researchers outlined in [Golub and Corwin, 1985] where cries of infants with various forms of brain damage, chromosomal abnormalities, low birth weight and malnutrition displayed differences from the cries of normal infants. Later work by Lester supported the use of cry analysis as a non-invasive measure to detect developmental outcome of an infant at risk [Lester, 1987]. His work highlighted F 0 and the first formant as being meditated by neural mechanisms white features such as duration and amplitude were more related to respiratory control These can be related to the physioacoustic model of Golub and Corwin. In the late 1980's and early 1990's, even greater improvements in DSP and computing capabilities allowed much more complex algorithms and technologies to be utilized. For instance, the fast Fourier transform (FFT) algorithm which had proven over the years to have many uses in signal processing could now be used efficiently with the increased computing power. Many researchers began taking advantage of this including Rapisardi et al. [Rapisardi et al., 1989] and Chapter 2: INFANT CRYING 7 Fuller [Fuller, 1991]. At present, the FFT algorithm is computed so efficiently that it can be used with little worry of its complexity, a luxury not available even ten years ago. With professionals from multiple disciplines researching cry analysis, it has been very difficult to establish a standardized nomenclature which can be understood and used by all. For the same reason, the development of a common approach to cry analysis has remained elusive. In 1993, Xie proposed a systematic scheme whereby cry analysis was treated as speech analysis [Xie, 1993; Xie et al., 1993a]. In particular, cries were broken into a set of cry words in a fashion analogous to the English phonemes. Identification then made use of a statistical analysis method or the hidden Markov model, techniques which could not have been used without the advances in computing power. A further discussion of Xie's work is presented in the next chapter. Petroni et al. also made use of complex algorithms requiring powerful computer processing abilities. In 1994 they outlined a new, robust method for FD extraction which required creating a three dimensional plot called a cross-correlogram [Petroni et al., 1994]. The same group showed in 1995 that adaptive neural networks (ANN) are useful for classification and discrimination of certain cries [Petroni et al., 1995]. Without question, cry analysis has evolved dramatically over the years. From early days of diagnostic listening, to listening to recorded cry sounds, to looking at crude spectrographic representations, to the advanced algorithms and spectrograms using the latest computers, professionals have long sought to understand exactly how much information is contained in an infant's cry. Despite years of collaborative work, however, knowledge of the infant cry remains quite limited. This is due largely to the complexity of the cry itself, resulting in uncertainties as to how to take full advantage of modern signal processing capabilities. Chapter 2: INFANT CRYING 8 _ 2.2 TECHNIQUES AND FEATURES 2.2.1 Techniques Throughout the evolution of cry analysis, technology has continually unproved allowing for more advanced equipment. These equipment improvements have in turn translated to developments in analysis techniques. In general, the evolution of acoustic cry analysis can be grouped into five techniques for extracting and identifying acoustical data from an infant's cry, as outlined by Golub and Corwin [Golub and Corwin, 1985] 1. Auditory Analysis The earliest and most obvious means for cry analysis has been the human ear. From Hippocrates in ancient times through to the 1960's with work by Wasz-Hockert et a/.[Wasz-H6ckert et al., 1964a & 1964b] and others, this was also the only method available. As technology progressed, researchers benefited from developments such as the graphophone, gramophone and magnetic tape recorder which allowed a closer, repeated study of the cries. But while medical information could be obtained using this method, and analysis could be improved from experience or by training, the use of auditory analysis was obviously limited. Significantly more information was available in the cries which would require more sophisticated analysis techniques. 2. Time Domain Analysis A technique which received some attention in the 1960's was time domain analysis. For this technique, sound magnitudes or waveforms were recorded as a function of time on a paper chart using a direct writing oscillograph or other such device. Though results using this method were Chapter 2: INFANT CRYING 9 _ not plentiful, notable findings by Fisichelli and Karelitz showed that abnormal infants required a sigraficantly longer mean latency period between pain stimulus and cry onset [Fisichelli and Karelitz, 1963] and also that infants with brain damage required increased pain stimulation to produce a 1 minute cry duration [Karelitz and Fishichelli, 1962]. Even more than with auditory analysis, use of the above method was limited. It was easy to use, inexpensive and reliable, but it was a rather tedious process and more significantly, timing information constitutes only a small portion of the available information contained in the cry. 3. Frequency Domain Analysis A second technique used in the 60's with less popularity and success was frequency domain analysis. Studies focussed on breaking the cry signal into a coarse representation of frequency bands using a band of one-third or one-half octave band pass filters. Without timing information and with large, inflexible bandwidths for the frequency ranges, very little useful information was obtained with this method. Major developments in cry research generally do not include studies grouped under this technique. 4. Spectrographic Analysis In the late 1940's the sound spectrograph was invented at the Bell Laboratories. While the time domain and frequency domain analysis techniques provided limited information when used by themselves, spectrographs proved very useful as they provided a permanent record of the distribution of energy in both the temporal and spectral domains. The development of this instrument would have to be considered one of the greatest developments in the history of cry Chapter 2: INFANT CRYING 10 analysis. Since its inception to cry research in the early 1950's, most studies in the 20-30 years that followed made use of this technology. The most significant contributions to cry research using spectrographic methods were those from the Scandinavian research team headed by Wasz-Hockert and Lind. Among other discoveries, it was this team which developed many of the commonly used spectrographically based cry parameters. A brief summary of these features, as given by Golub and Corwin [Golub and Corwin, 1985] are given below: A. Durational Features • Latency Period: The time between the pain stimulus applied to the child and the onset of the crysound. The onset of crying was defined as the first phonation lasting more than 0.5 seconds. • Duration: The feature is measured from the onset of the cry to the end of the signal and consists of the total vocalizations occurring during a single expiration or inspiration. The boundaries were determined by the point on the spectrogram where the sound "seems" to end. B. Fundamental Frequency Features • Maximum pitch: The highest measurable point of the fundamental frequency seen on the spectrogram • Minimum pitch: The lowest measurable point in the FQ contour seen on the spectrogram • Pitch of shift: Frequency after a rapid increase in the F 0 seen on the spectrogram • Glottal roll or vocal fry: Unperiodic phonation of the vocal folds usually occurring at Chapter 2: INFANT CRYING 11 the end of an expiratory phonation when the signal becomes weak and F„ becomes very low. • Vibrato: Defined to occur when there are at least four rapid up-and-down movements of F c . • Melody type: Either falling, rismg/falling, rising, falling/rising, or flat. • Continuity: A measure of whether a cry was entirely voiced, partly voiced or voiceless. • Double harmonic break: A simultaneous parallel series of harmonics in between the harmonics of the fundamental frequency. • Biphonation: An apparent double series of harmonics of two fundamental frequencies. Unlike double harmonic break, these two series seem to be independent of each other. • Gliding: A very rapid up and/or down movement of F 0 , usually of short duration. • Noise concentration: Fligh energy peak at 2000 - 3000 Hz, found both in voiced and voiceless signals; this attribute is clearly audible. • Furcation: Term used to denote a "split" in the F 0 where a relatively strong cry signal suddenly breaks into a series of weaker ones, each one of which has its own F 0 contour. It is seen mainly in pathological cries. • Glottal plosives: Sudden release of pressure at the vocal folds producing an impulsive expiratory sound. Using the above mentioned features, researchers have been able to make many advances in the understanding of cries. It has proven particularly useful for relating cries of abnormal infants to those of normal infant cries because both domains can be displayed simultaneously for Chapter 2: INFANT CRYING 12_ comparison. However, this method has not been without fault. A poor dynamic range, inadequate frequency resolution and the tedious process of viewing and interpreting the data have kept this method from achieving widespread use in the medical field. 5. Computer-Based Signal Processing A major shortcoming of the techniques mentioned thus far was that feature extraction required off-line examination of the data which was often slow and tedious. In many cases resolution was often quite low as well. The advent of the computer began a new era for cry analysis as digital signal processing (DSP) techniques made possible the use of more sophisticated recognition techniques. This allowed data extraction to be performed automatically. In addition to the possibility of new techniques, computers also allowed the older techniques to be used more efficiently. For example, the spectrogram is computed far more quickly and with more resolution now as computers determine the spectrogram using the FFT. The same acoustical features as before are therefore computed more accurately and there is also an ability to extract new information which was unobtainable before. The most recent developments in cry research are begiiuiing to make use of even more sophisticated techniques such as the hidden Markov model and adaptive neural networks; techniques widely used in the field of speech recognition. It can be shown that using some of the techniques useful for speech recognition could be effective for infant cry recognition, as well. At this point, exactly which techniques will be most beneficial for infant cry research and how they will be used are unknown but continued improvements to recognition abilities are expected with these methods. Chapter 2: INFANT CRYING 13_ 2.2.2 Features The previous section outlined the evolution of cry analysis over the past two centuries from the point of view of the techniques used to analyze the signal With professionals from many fields studying different aspects of infant cries, these techniques have been used to study many different features. As outlined in [Xie, 1993], infant cry research has generally been approached from one of five viewpoints. 1. Psychological Investigations on Subjective Perceptions of Infant Cries The team of Wasz-Hockert et al. showed in 1964 that experienced persons such as nurses, midwives and experienced parents were better at discerning between cry types [Wasz-Hockert et al., 1964a & 1964b]. Their work is representative of this first type of research. Focus is placed on studying the perception of infant cries and responses of caregivers to an infant's cry for both normal and abnormal cries. 2. Physiological and Developmental Aspects It is the view of many that cries may be useful indicators of an infant's neurophysiological functioning [Tenold et al., 1974; Golub and Corwin, 1982]. Much research has been performed tracking the infant's physiological and developmental status using cry analysis of linguistic, respiratory, neurological and psychological developments. 3. Emotional/Physical Situation Assessment The assessment of the emotional/physical situation of an infant may be one of the most important Chapter 2: INFANT CRYING 14_ uses of cry analysis, particularly for medical assessment. Of the parameters researched for determining the medical state of an infant, including facial expression, heart rate, respiratory rate, body movement and vocalizations, the vocalizations or cries will likely be the most convenient means for monitoring the infant. 4. Abnormal Infant Cries for Diagnostic Purposes A common practice among researchers is to correlate differences between cries of normal infants with the cries of those infants diagnosed with a particular abnormality. Research in this area has tested cries for such abnormalities as asphyxia neonatorum, symptomless low birth weight, Down's syndrome, cleft palate, herpes encephalitis, congenital hypothyroidism, hyperbilirubinemia, bacterial meningitis, hydrocephalus, bradycardia, various forms of brain damage, malnutrition, genetic defects and sudden infant death syndrome. Normally only pain-cries are used to control and standardize testing procedures. 5. Detection and Recognition of Infant Cries as a Kind of Warning System Though this type of research has received little attention, developing a warning system which detects and recognizes distressful infant cries would be beneficial in a number of environments. Hospitals and homes could use them so that caregivers would know when an infant was crying for something serious instead of crying solely to be heard. It would also prove useful for alerting hard of hearing or deaf parents who could not otherwise properly monitor their infant. To date, a highly effective detection/recognition system has not been developed. The Scandinavian research team of Vuorenkoski et al. developed one system in 1971 called the Cry Chapter 2: INFANT CRYING 15 Analyzer which was manufactured by the Swedish company Special Instruments [Vuorenkoski et al., 1970; Wasz-Hockert et al, 1985]. The system was developed for everyday use in a neonatal ward to collect samples from all babies and would determine the number of cries that were below and above 1000 Hz. Unfortunately, to be recorded cries were required to be longer than 400 msec, and since cries of this duration with a pitch greater than 1000 Hz are infrequent, effective usage of this system was limited. A better system developed by Lundh in 1986 relied on tenseness information from the infants cry [Lundh, 1986]. While other alarm systems simply alerted caregivers when cries were above a certain threshold with a flash, buzzer or vibrator (for deaf parents), this system displayed a different face depending on the infant's feeling as deterrnined by the system, ie. happy face, tearful face or screaming face. In-home tests showed that this system was useful but frequently displayed false alarms or improper identification of the baby's feeling. In 1993, research at the University of British Columbia [Xie, 1993; Xie et al., 1993a] led to a very promising method for cry detection but did not produce a system or tool. Their work made use of parents' perceived levels of the stress of the infants' cries to train a system with "human-like" qualities to recognize different types of cries. This method is discussed in the next chapter. 2.3 SPEECH AND CRY MODELS It was mentioned in section 2.2.1 that the same techniques used successfully for speech recognition could also be applied to infant cry analysis. Before discussing this possibility, it is important to substantiate this statement by showing that speech and infant cries are very similar in nature. Chapter 2: INFANT CRYING 16 Speech is an acoustic sound pressure waveform resulting from controlled movements of the human speech production systems. Fortunately, the various organs and muscles of the vocal and respiratory systems work in unison so that even while speaking we are able to breathe. This process is called speech breathing [Langlois et al., 1980]. The so-called power supply of the system is the lungs and trachea (windpipe) which forces air through the remainder of the system The forced air first passes through the larynx (vocal cords) which is the primary sound generating mechanism. It provides the periodic excitation for voiced sounds as a result of periodic vibration of the vocal folds. The waveform then proceeds through the vocal and/or nasal tracts where the speech is filtered or modulated. Finally, it passes through the mouth and lips which adds a radiation effect before the speech is uttered. This lip radiation may significantly effect the speech signal and is usually filtered out for speech analysis. To gain further insight into the nature of speech, use is often made of a speech model Many models have been proposed for speech with a common factor for many of these being to model the separate components as filters. The vocal system overall, then, can be viewed as a multi-component filter. A typical speech model is given as S(z) = U(z)H(z)R(z) (2.1) where U(z) models the waveform, H(z) models the dynamics of the vocal tract, and R(z) models the radiation effects of the lips [Deller et al., 1993]. The waveform U(z) can either be periodic corresponding to voiced sounds or non-periodic for unvoiced sounds. Unvoiced sounds are created by forcing air through a constriction in the vocal folds which creates a turbulence. This non-periodic signal is often modelled using a random noise generator. Figure 2.1 presents a block diagram for this model. Chapter 2: INFANT CRYING 17 IMPULSE TRAIN GENERATOR VOICE SOURCE GAIN GLOTTAL PULSE MODEL G(z) RANDOM NOISE GENERATOR IVOCAL-TRACT MODEL H(z) RADIATION MODEL R(z) NOISE SOURCE GAIN Figure 2.1 A typical model for adult speech generation. SPEECH S(z) = U(z)H(z)R(z) As with adult speech, infant crying is a result of complex interactions involving many mechanisms. Briefly mentioned in the discussion of adult speech was the concept of speech breathing. This requires short inhalations and long, controlled exhalations. Infant vocalizations are based on the same phenomenon. Studies have shown that even very young infants, "... can very well negotiate a short inspiration and a prolonged expiration... The infant's respiration system is capable of meeting those criteria for adult speech breathing." [Langlois et al., 1980] In 1985, Golub and Corwin introduced a physioacoustic model for infant cry generation. This model, like the one for adult speech, was comprised of four main sections. The first section was the subglottal or respiratory system which was the system "power source". It provided the air pressure to drive the vocal folds. The second section was the sound source located in the larynx which generated either a periodic signal, S(z), or a turbulence noise, N(z). The vocal and nasal tracts comprised the third section and they acted as an acoustic filter as was the case with the speech model as well This acoustic filter model is shown as T(z) in Figure 2.2. The final segment of this model was the acoustic transmission between the mouth, lips and the auditor, modelled as Chapter 2: INFANT CRYING 18 R(z) in Figure 2.2. Though the output of this model was quite different, it is evident that both speech and cries are modelled in a very similar fashion. SUBGLOTTAL SYSTEM TURBULENCE S O U R C E Nfz) PERIODIC S O U R C E m V O C A L AND NASAL TRACT T(z) —RADIATION— CHARACTERISTIC FROM MOUTH BUI C R Y S O U N D R(z)T(z)(S(z) + N(z)) Figure 2.2 Simple model for infant cry generation. It will also be shown in the following chapter that as in speech, the infant cry can be broken into smaller units, each having distinct time-frequency characteristics. In speech these are called phonemes and for infant vocalizations they are termed cry modes or cry-words. Since phonemes are often used as a basis for recognition in speech and because the models representing both are so similar, it would seem reasonable that using the cry-words, those techniques used for adult speech recognition might also be suitably applied to infant cry analysis. 2.4 CRY INFORMATION DEBATE While many scientists, medical doctors, engineers and other professionals agree that the infant cry contains valuable information of the infant's physical and emotional well-being, there remains some debate as to the amount of information. It is the goal of some research to show that the infant cry follows a definite cause-effect pattern. This group of researchers feel that infants will have different distinguishable cries for different causes such as hunger, sickness, pain, pleasure and Chapter 2: INFANT CRYING 19_ other states or emotions. Other researchers feel that the cry by itself can be used for detecting various diseases. For the most part, however, their assumptions remain to be postulations. Not all researchers are under the impression that the cry contains quite so much information. A common feeling is that the cry is more of a distress signal. For instance, Hollien states that, "... it would appear that the cries of infants carry too little perceptual information to permit auditors to identify the condition that evoked them... it might be hypothesized that... the cry generally acts simply to alert the mother and most (if not all) of her suppositions that evoked the crying behaviour must be based on additional environmental cues" [Hollien, 1980, pg 28]. Hollien also questioned if the cry could be the sole indicator for many diseases but did agree it should be able to help in diagnosis which is a belief shared by many researchers. Few researchers will question that the use of cry analysis will result in many benefits to child monitoring and in the understanding of the infant in general. As Golub and Corwin state [Golub and Corwin, 1985, pg. 80], "Cry analysis, if proven to be reliable, is particularly well suited to its proposed role as a newborn screening test. It is noninvasive, and, by utilizing the automated analysis system, it can be efficiently and economically performed on large numbers of infants without significantly disrupting normal hospital procedures." At this point there is no clear cut answer to the debate of how much information is contained in the infant cry. Until such time, a more likely application will be to use the cry in conjunction with other diagnostic observations for identifying diseases or as part of a distress system with no need of recognizing the cause of the cry. Chapter 2: INFANT CRYING 20 2.5 SUMMARY Infant cry analysis has come a long way over the past century and especially since the 1960's. This chapter has presented the history of infant cry research and has provided a brief description of those techniques and features especially useful in the study of infant cries. Because cry research is such an interdisciplinary field, it was not possible to mention all advances in cry research. The major developments were highlighted to provide the necessary background of the directions of research in this field. The future of cry research will undoubtedly rely on computer-aided signal processing. Despite many years of research, little is known about the nature of the infant cry leaving many avenues for research open in the years to come. Researchers will need to discover reliable indicators to distinguish and classify different cries. They will also need to develop an effective and efficient approach to electronically measure the variables of interest. The cry-words presented by Xie may be suitable features, but an efficient approach to measure them has not yet been provided. At this point there is also no clear indication of which parameters will prove most useful Despite the uncertainty, there is little question that the continued use of cry analysis will be important. Chapter 3 SUMMARY OF XIE'S WORK Infant cry research has often focused on the cause-effect relationship and on recognizing a particular type of cry, be it hunger, pain, pleasure, birth or others [Wasz-Hockert et al, 1964a & 1964b]. The difficulty with this approach is that an observed cry may not be the result of a single stimulus only. There are many features such as sleep/waking state, age of the infant, individual sensitivity and many others which undoubtedly alter the cause-effect relationship. As a result of the inherent uncertainty in selecting the cry stimulus, Xie developed a novel method of infant cry recognition [Xie, 1993; Xie et al., 1995] where he correlated the cry data with parental assessments. It is generally accepted that parents are able to distinguish or get a feel of what an infant is trying to communicate based on their past experience and intuition [Wasz-Hockert et al., 1964b, 1985]. Using parental assessments therefore offered the benefits of developing a system independent of the cry generation mechanism and avoided the uncertainties of determining the cry cause. It was proposed that modelling those characteristics which parents found important might lead to human-like recognition systems. 3.1 CRY-WORDS American English contains 42 phonemes which are the basic building blocks for all spoken words [Deller et al., 1993]. Each phoneme is like a separate code delineated by the type and 21 Chapter 3: SUMMARY of XIE's WORK 22_ location of sound excitation and position and movement of the articulators (eg. tongue placement). As a result of these differences, there are also differing temporal and spectral characteristics. Speech-recognition engineers use these differences to differentiate between different phonemes in the recognition stage, and in turn it is the different phonemes that are used to recognize different utterances. Unlike infant vocalizations, adult speech also contains linguistic constraints that may help with recognition. For example, language has definite words satisfying a lexical constraint, and they must fit together grammatically in a prescribed pattern which makes syntactical sense [Deller et al., 1993]. Unfortunately for infant cries, there are no structures or guidelines to follow so recognition is a more difficult task. One of Xie's first contributions was to define a set of ten cry-words which closely parallelled the idea of phonemes for adult speech. In developing the cry-word set, Xie did not simply select the units blindly. Earlier work by Golub and Corwin suggested three cry modes called phonation, hyperphonation and dysphonation [Golub and Corwin, 1982]. Still, a cry-mode "set" had not been developed which would satisfy two important criteria. One was that cry-words, like phonemes, must cover most variations in vocalizations. In speech all words are combinations of the 42 phonemes. Similarly, all cries need to be combinations of the cry-words. The second criterion was that cry-words should be detectable and distinguishable in an automated fashion, ie. by computer. To develop the set of cry-words, Xie took the three modes of Golub and Corwin and then separated phonation into more specific units, namely flat, rising, falling, vibration and weak vibration identified by differing melody types and/or short time energy content. In addition, trailing, double harmonic break (DHB), and inhalation were added because of their distinct time-frequency representations. Given below is Xie's list of Chapter 3: SUMMARY of XIE's WORK 23_ ten cry-words which satisfy these criteria with a brief description of each. Figure 3.1 which follows shows spectrograms of the different modes which help clarify visually the differences between cry-words. (1) Trailing(glottal roll) - usually found at the end of long and powerful expiratory phonations. The fundamental frequency, F 0, is low and vibrating, and gradually decreases as does the total energy level. (2) Flat - the most basic phonation. F 0 is smooth and steady and there are clearly observable harmonics with little energy distribution between them (3) Falling - very similar to flat except F 0 is decreasing. (4) Doable Harmonic Break - characterized by a parallel series of weaker harmonics in between the stronger, primary harmonics at F 0. (5) Dysphonation - basically an unstructured energy distribution over all frequencies. (6) Rising - similar to flat except that FD is ascending. (7) Hyperphonation - characterized by an extremely high F 0 , usually > 1kHz. (8) Inhalation - the sound produced by the rapid intake of air. (9) Vibration - the harmonics are clearly observable, but the F 0 is vibrating. There is normally a high total energy level but no unstructured energy between the harmonics. (10) Weak Vibration - basically the same as vibration except that the total energy is significantly lower. Chapter 3: SUMMARY of XIE's WORK 24 Trailing 0.2 0.3 Time (s) Falling 0.05 0.1 0.15 0.2 Flat Uinc. '"'"""" 2 L.^ ^^ o^ ^^ r-r^ ***^ ^ 0 0.05 0.1 0.15 0.2 0.25 DHB ::>:-:v-v<.x^ <.x.:.:.:.>>>:.:<.:.:.:<.;.;.;.>:.x<<.:.:<<.>>>.. 0.05 0.1 Figure 3.1 Spectrograms (time-frequency patterns) of the ten cry-words proposed by Xie [Xie, 1993]. Chapter 3: SUMMARY of XIE's WORK 25 Hyperphonation 0.02 0.04 Time (s) Inhalation 0 0.02 0.04 0.06 0.08 0.1 Figure 3.1 Spectrograms (time-frequency patterns) of the ten cry-words proposed by Xie [Xie, 1993] (continued). 3.2 H-VALUE Infants cry as the result of many stimuli - pain, pleasure and birth being a few of the more common [Hollien, 1980]. As mentioned previously, while these cries do have noticeable Chapter 3: SUMMARY ofXIE's WORK „ 26 differences, these differences cannot always be explained or identified. This is why Xie set out to study a different attribute, the level of distress (LOD). In earlier work, Murray developed two models for crying with this focus [Murray, 1979]. The first model was that cry was a releaser of parental behaviour and was a type of distress signal for the infants. For this model, cry recognition would be based more on cry intensity than cry stimulus. Her second model was that cry was an indicator of emotions. Cry intensity for this model was said to increase with greater emotion, so again emphasis was focussed not on the stimulus but the intensity or LOD. Xie combined the idea of estimating the LOD of cries based on recogmzing the ten cry-words along with parental inputs to develop a human-like recognition system In an experiment, 20 experienced parents were asked to listen to a set of 58 different cries. The cry samples were originally recorded on video tape, but the parents were only permitted to listen to the cries since facial expression would also serve as an indicator of LOD [Grunau et al., 1990]. Any indication of the cry stimulus was not made known to the parents although it was known for these samples. After listening to these cries, the parents were asked to rate the cries on their perceived LOD on a scale of one to five. Level one was the least severe and five corresponded to the highest LOD. The same 58 cries were also decomposed into their respective cry-words off-line using a computer graphics program When computing the spectrogram for identifying the different cry-words, use was made of a 32 msec sliding Hamming window which proceeded in steps of 10 msec. This allowed accurate identification of the presence and duration of the various cry-words. For each cry-word, the sum of its time duration in the entire cry was measured. A percentage of this sum duration for each cry-word compared to the time of the entire cry was then calculated. To determine the correlation between the parents' perceived LOD ratings and percentage duration, Chapter 3: SUMMARY of XIE's WORK 27 scattergraphs were plotted for each. It was determined that there was a positive correlation between the LODs and the dysphonation, hyperphonation and inhalation cry-words. The strongest correlation was with dysphonation which was not a surprise. Gustafson and Green found that when adults rate cries they are most correlated with duration, dysphonation, and low and high frequency energy [Gustafson and Green, 1989]. Other findings from the comparisons showed weaker but still positive correlations for trailing and double harmonic break. Strong negative correlations were observed for the flat and weak vibration cry-modes. No strong correlations were observed forfaiting, rising or vibration. Finally, Xie decided that a single indicator based on the correlations between parental perceived LODs and cry-mode time would allow an efficient and expedient means of analysis. Using the sum of time durations of those cry-words which were positively correlated, the indicator was called the H-value and was defined as DT+D n+D ,+D R+D H H=——-—-—-—-xl00% (3.1) ^Total where the time durations of the trailing, dysphonation, inhalation, double harmonic break and hyperphonation cry-words are given as D T , D D , Dr, D B and D H , respectively. D T o t a l is the total duration of the cry, not including silence periods. When H-values were calculated for each of the cries based on the off-line data and compared again with the parents' perceived LOD, a strong positive correlation resulted. High H-values corresponded to high parental LOD ratings and low H-values corresponded with low ratings. It was then proposed that an automatic cry analyzer could be developed based on calculating the H-Chapter 3: SUMMARY of XIE s WORK 28_ value as an estimation of the LOD. This was the motivation behind Xie's work with infant cry recognition. To get a more thorough description of Xie's experimental work, results and scattergraphs, refer to [Xie, 1993; Xie et al., 1995]. In the following two sections we will briefly examine Xie's methods for automatically determining the H-value, specifically a nonparametric statistical classifier and the hidden-Markov-model. 3.3 NONPARAMETRIC STATISTICAL CLASSIFIER-BASED METHOD The first cry analysis system Xie developed was based on a nonparametric statistical classifier. The idea behind this method was to assign one of a number of predetermined classes to an observed data value or sequence. A nonpararretric classifier was chosen because no assumption of the form of the probability distribution functions (PDF) found in the Bayes Theorem formula was required. Nonparametric classifiers also tend to provide better classification over parametric classifiers [Xie, 1993; Xie et al., 1993b]. With only mirumal a priori knowledge of the infant cry feature distributions, this choice was justifiable. Despite the advantage of possible better classification, nonparametric classifiers are not without fault. In particular, these classifiers are very computationally complex and require large amounts of computer storage for the large design sets. Because of their complexity and slow speed, their use is generally reserved for off-line classification. This would not be useful for infant cry analysis. To realize the nonparametric benefits and avoid the complexity and speed problems, the size of the design set needed to be reduced. A new method was developed to accomplish the required nonparametric data reduction based on vector quantization (VQ) [Xie, 1993; Xie et al., 1993b]. Chapter 3: SUMMARY of XIE's WORK 2 9 _ VQ is a minimum distance mapping that assigns a reproduction vector drawn from a code book to each input vector. It is a popular technique which has been used for many years for digital signal processing, communication and speech recognition [Deller et al., 1993; Makhoul et al., 1985; Rabiner, 1989]. Xie combined VQ for the first time with both the Parzen's kernel and kNN methods. These were named the VQ-kernel and VQ-kNN nonparametric methods, respectively. In comparison to other nonparametric data reduction algorithms, the new VQ based methods gave significantly better results in terms of classification accuracy and data reduction rate. This not only led to a system which was significantly less conputationally intensive as a result of reduced data sets, but the reduction rate was controllable and did not heavily influence the classification accuracy. The VQ-kernel also seemed to outperform the VQ-kNN method and was selected for use in Xie's cry analyzer system 3.3.1 VQ-Kernel Automatic Cry Analyzer The VQ-kernel automatic cry analyzer was divided into three main parts: preprocessing/feature extraction, VQ-kernel classification and H-value calculation. Each cry sample was first input to the preprocessing/feature extraction segment where it was digitized to 12 bit, 8 kHz and pre-emphasis filtered with a l-0.95z"x filter to remove any DC component as well as lip radiation [Deller et al., 1993]. Recall from section 2.3 that the lip radiation can significantly effect signal processing if it is left in the signal. The filtered signal was then segmented using a 32 msec (256 point) Hamming window. From each 256 point window, 24 elements were selected for the feature vector. In particular, those 24 elements consisted of the zero-crossing rate, energy, first Chapter 3: SUMMARY of XIE's WORK 30__ ten LPC-derived cepstrum coefficients, and the temporal derivatives of the twelve elements already mentioned. The output of this section was then a 24-element feature vector. The second part of this system was the VQ-kernel classifier. This segment was trained to recognize one of two types of vectors, H-type or E-type. H-type vectors are defined as vectors representing trailing, double harmonic break, dysphonation, hyperphonation or inhalation cry-words. E-type vectors represent the five remaining cry words. Tests showed that an original design set of some 18,000 vectors could be reduced to a code book of only 256 vectors using this algorithm. When the 24 element feature vector from the preprocessing/feature extraction stage was passed to the classifier, it assigned to it one of the 256 code book vectors which was either of H-type or E-type. The final segment of the analyzer was the calculation of the H-value. From equation 3.1 and the definitions of H-type vectors we see that the H-value is equal to the sum of the durations of the H-type vectors compared to the total time. In other words, an estimate of the H-value was given by H-Value = N u m b e r o f tf'^g vectors (3.2) Total number of non-silence vectors 3.3.2 VQ-Kernel Results When the VQ-kernel based analyzer was tested, results were very promising. It was observed that estimating the H-value using the VQ-based classifier provided a mean absolute error of 14% when compared with the H-values computed manually earlier. A strong correlation was also observed when the parents' LOD ratings were plotted versus the VQ-classifier based H-value Chapter 3: SUMMARY of XIE's WORK 31 estimates. This not only showed the usefulness of the VQ-based classifier, but it provided further evidence that the use of cry-words and the H-value were a useful means for infant cry recognition. 3.4 HIDDEN-MARKOV MODEL In recent years the hidden-Markov-model (HMM) has proven very useful for speech recognition [Rabiner, 1989]. One of its biggest advantages is that it is capable of dealing with both temporal and spectral pattern variations of a signal Since infant crying is a type of human utterance and can be modelled in a similar fashion, Xie decided to base his second automatic cry analyzer on the HMM. In the simplest sense, an HMM is a, "...'stochastic finite automaton' - a type of abstract 'machine' - used to model a speech utterance." [Deller et al., 1993]. An HMM can be viewed as a collection of states connected by transitions. For each possible transition there exists a transitional probability indicating the likelihood of a transition between state i and state j. The HMM, in other words, bases recognition on structural information of the different time-frequency patterns rather than the statistical information used with the VQ-classifier method. A more thorough description of the HMM can be found in [Rabiner, 1989; Xie, 1993]. 3.4.1 H M M Automatic Cry Analyzer Xie's HMM system was divided into five steps. As was the case with the VQ-kernel classifier, the first step was the data preprocessing/feature extraction which took the analog signal and put it into a long sequence of feature vectors. The signal was first lowpass filtered and digitized to 8 kHz followed by a 1-0.95Z" 1 pre-emphasis filter to remove lip radiation effects. A 32 msec Chapter 3: SUMMARY of XIE's WORK 32 (256 point) sHding Harraning window was also again used to divide the signal into frames with a sliding step of 10 msec. The main difference for the preprocessing was that only twelve features were extracted: short time-energy, zero-crossing rate and the first ten cepstrum coefficients. The twelve time-derivatives used in the VQ-kernel system were not needed here. The second step in the HMM system was vector quantization (VQ) of the feature vectors. VQ quantized a block of data at a time to an element in a prescribed code book. VQ not only reduced data redundancy as a result, but it also allowed discrete HMMs to be used. Since speech and crying are both smooth, continuous processes and not a series of states, if discrete HMMs were to be used, this became an important step. The Linde-Buzo-Gray algorithm was used for the quantization [Deller et al, 1993; Xie, 1993]. Output from this stage was a sequence of code-word indices. Unfortunately for infant crying a clear hierarchical structure does not exist to aid in recognition. With speech, spoken sentences provide a solid starting point for feature isolation. Sentences can then be broken into phrases, phrases to words, and words to phonemes. Without this hierarchical structure for infant crying, recognition is a much more arduous task because it is difficult to isolate cry-words. Xie proposed as a third step to further segment each code word into "micro-segments" to avoid this isolation problem Given that an average cry-word lasts 1 second and the sliding window shifts in 10 msec increments, about 100 code-word indices were expected for a typical cry-word. A single micro-segment contained ten of these code-word indices so that an average cry-word would be represented by ten micro-segments. It also meant even the smallest cry-words of duration about 0.5 second would be represented by five micro-segments. When a boundary between cry-words was encountered, it might be deemed as unrecognized but because Chapter 3: SUMMARY of XIE's WORK 33 of the large number of micro-segments, this would not affect results significantly. These micro-segments then become the basic recognition units in the HMM-based recognizer. The steps mentioned to this point served to prepare the data for the HMM. The actual HMM began with a simple 3-state left-to-right topology typical for speech models. This structure was repeated ten times, once for each cry-word. Using the appropriate data, each of the ten HMMs was then trained using the forward-backward algorithm [Xie, 1993; Deller et al., 1993]. The HMMs which were trained with the five H-type cry-words used for the calculation of the H-value were then grouped in parallel into one large HMM called the H-HMM. Similarly, the five remaining HMMs comprised another large HMM called the E-HMM. Both of these large HMMs were then combined to form the HMM-based classifier as shown in Figure 3.2. P(S|H-HMM) *] Computation Probability S Comparison + (H or E ?) H-Value From Segmentation E-HMM Calculation 1 Computation P ( S | E . H M M ) Probability Figure 3.2 The HMM-based classifier as shown in [Xie, 1993]. The micro-segments from the previous step were next applied to the two HMMs. Depending on which structure had the higher probability of occurrence at the output, each micro-segment was classified as either an H-type or an E-type micro-segment. The final step was then to calculate an Chapter 3: SUMMARY of XIE's WORK 34 estimate for the H-value. This was simply the sum of the recognized H-type micro-segments divided by the total number of non-silence micro-segments. 3.4.2 H M M Recognition Results As with the VQ-kernel based recognition system, a set of 58 cries were tested by the HMM based recognizer and estimates for the H-value were obtained for each. When compared with the manual, off-line H-value estimates, an absolute of error of 12.9% was observed. This helped demonstrate the validity of using the HMM-based system for estimating the H-value. When compared to the parents' perceived LOD there was again a strong positive correlation, i.e. high calculated H-values corresponded to high parental LOD ratings and low H-value estimates corresponded to low parental LOD ratings. This provided a further indication that the calculation of the H-value was a meaningful approach for human-like analysis of infant cries. Comparing the VQ-kernel results with the HMM results we find that the VQ-kernel seemed to provide better classification for high LOD ratings (LOD ^ 3). The HMM, however, was better for low LOD ratings and overall slightly outperformed the VQ-kernel method (12.9% error instead of 14.0%). Xie believed the better performance of the HMM was, "...due to the more effective exploitation of the sequential (structural) information of the time-frequency characteristics in the cry sound." [Xie, 1993, pg. 120] Chapter 4 DIRECTION of STUDY 4.1 XIE'S CONTRIBUTIONS The previous chapter provided an overview of the work accomplished by Xie in the early 1990's. The motivation for his work was to develop a systematic approach for cry analysis which earlier work had been unable to provide. Using information known about infant crying at the time, Xie first determined the particular difficulties which might be encountered hindering the use of an automated analysis method. To overcome these obstacles he then modified the existing methodologies and introduced modern signal analysis approaches which had proven useful in speech recognition. The result was a platform for the automatic analysis of infant cry signals based on the recognition of cry-words with distinct time-frequency characteristics. In developing this platform, Xie provided many contributions to infant cry research. Most notably were the established set of ten cry-words and the introduction of the H-value as a single, quantitative value indicating the LOD of an infant which provided humanlike assessment of cries. But while Xie was able to prove the concept of the H-value, his work finished before he could further explore other facets of its use leaving plenty of room for future work. 4.2 UNOPENED DOORS As Xie states, "By any standard, the infant cry research is still at the infancy stage itself!" 35 Chapter 4: DIRECTION of STUDY 36 [Xie, 1993, pg. 16] While he was referring to infant cry research in general, this clever play on words relates to his proposed systematic approach for the study of infant cries, as well. Though there is evidence that this platform provides a viable approach for assessing infant cries with humanlike capabilities, a great deal of work remains. If this method is to become universally accepted and used, further tests will need to be performed and other avenues studied. While many possibilities exist for further tests, we can identify four extensions immediately. 1. Additional features Xie's approach for determinmg/recognizing each cry-word made use of many features including zero-crossing rate, short-time energy and cepstrum coefficients. However, many other features were not examined such as fundamental frequency (F0), formant frequencies and transitional cepstrum coefficients which might allow the H-value to be computed more easily or efficiently. The use of F 0, in particular, might be worthwhile to examine as the majority of cry-word definitions from section 3.1 made some reference to F 0. 2. Larger data set An infant's cry changes dramatically over the first few weeks and months of life as the vocal and respiratory systems grow and develop. As a result there are also changes to the characteristics of the cries (eg. the F„ will change). To control his experiments Xie used a limited data set of cries from 36 healthy neonates. It was then proposed that the results obtained for this group could be generalized to include all other infants. At some point, larger and different data sets will need to Chapter 4: DIRECTION of STUDY 37 be examined, encompassing different aged infants, both healthy and unhealthy, and from more than 36 neonates to verify this generalization. 3. New techniques The HMM and VQ-kernel statistical classifier methods were selected for recognition because they had been useful for speech recognition and/or other areas of signal processing. While both methods provided respectable results, they may not be the best methods to use for this approach. In the future, other techniques which have shown promise in signal processing might also be applied including neural networks, wavelets, fuzzy logic or instantaneous frequency representation theories. 4. Practical, easy to use tool While Xie developed a novel approach for infant cry recognition, he did not provide a practical tool to simply and efficiently compute the H-value. Though the HMM and VQ-kernel classifier are powerful recognition techniques, the manner in which they were used in this application was complicated and clumsy. If this method is to receive acceptance in a very interdisciplinary field, an easy to use tool will need to be developed to allow any user to find the H-value of a cry without knowledge of the HMM, VQ-kemel or whatever recognition scheme is used. This tool should also be able to calculate the H-value accurately and quickly. 4.3 MOTIVATION FOR DIRECTION OF STUDY Though only four "unopened doors" for further research were discussed in the previous section, many others exist. These particular four were mentioned because they were obvious and Chapter 4: DIRECTION of STUDY 38 could provide irnmediate benefits. With time as improvements to the approach are discovered, other areas of research using this method will undoubtedly develop. How successful it becomes depends on what improvements are made. Fortunately the four mentioned possible areas for refinement did not have to be investigated independently. The ultimate goal of our research was to expand on the existing method and develop a practical tool to compute the H-value. In doing so, additional features and new techniques were examined in an attempt to recognize the cry words simply and efficiently. A larger data set was not used, however. The same data used by Xie was again used to provide some basis for comparison. As with Xie's work, if it can be demonstrated that the improved system works effectively for this set of neonates, it will be generalized that the same methodology could form a basis for recognition for infants of other ages, as well. A key assumption in developing the tool was to accept the ten cry words proposed by Xie as a complete set defining all types of cries. This provided a starting point for this research and maintained a consistency with Xie's work. Since the goal of this work was to develop a practical tool, re-definition of the H-value was not examined. Once again this allowed for a comparison with Xie's work and further justified the use of Xie's established cry-word set which was used for the H-value definition. With the developments of Xie acting as our starting point, our approach was based on the following observation. In Chapter 3, Figure 3.1 showed spectrograms of the different cry-words to clarify the differences between modes visually. It should be obvious from the figure that the cry types are distinguishable from observation of the spectrogram. If the cries can be detected visually, they should be abb to be detected with the computer, as well. With this in mind, the focus of our Chapter 4: DIRECTION of STUDY 39 project was to discover how to detect the different cry-words from the spectrogram. To accomplish this, no single recognition technique was employed. Instead, a more eclectic recognition scheme was utilized making use of the FFT, cepstral analysis for F 0 estimation, rrK)dified power spectral density, other correlation methods and simple decision rules. While one combination of techniques could be used to extract data for recognition of one type of cry-word, a different combination may be needed to recognize another. The different combinations could then be used in unison so that one system can recognize all cry-words. 4.4 SPECTROGRAMS As discussed in Chapter 2, the Scandinavian teams headed by Wasz-Hockert in the 1960's made great advances in the understanding of infant cries through in depth studies of the spectrograph. As a result, their work helped popularize the use of the spectrograph in the field of cry research. In the nearly 40 years that followed, it has remained one of the most widely used tools for cry analysis. For this research, the spectrograph was again used as an effective method for studying the infant cry, this time as an integral part of the development of the practical automatic recognition tool. For all spectrographs, the input is a signal represented as amplitude versus time. In most cases this signal is a speech sample or, in this case, an infant cry. The purpose of the spectrogram is then to display the signal amplitudes in various bands of frequencies using a fixed filter bank for the duration of the signal The resulting output or spectrogram is then a 2-D representation of the signal in frequency versus time. Chapter 4: DIRECTION of STUDY 40 In the earliest years, spectrograms were obtained mechanically by repeatedly playing a recorded sample through a variable band pass filter (BPF) [Flanagan, 1965]. The output was a physical representation of the spectrogram, often on paper. The early spectrographs were very slow, however, and did not see widespread use in the medical field. In addition to the time issue, systems suffered from a poor dynamic range and an inadequate frequency range. Using a narrowband, 50 Hz BPF helped the resolution problem somewhat, but this further slowed things down. Another difficulty concerned the interpreting of the data. Not only did it require visual inspection which was a slow, tedious process, but a certain expertise was needed to read them. In recent years, computers have enabled researchers to obtain spectrograms much more efficiently. Now, instead of using a physical bank of filters, spectrograms can be computed using the short-time Fourier transform (STFT) which essentially acts as the bank of filters. In using the STFT, it is assumed that the input signal x(t) will be short-time stationary over the duration of a small window g(t). The definition of the STFT is then given as where g*(t-x) refers to the complex conjugate of the window. (Our work has used a 32 msec Hamming window.) Once the STFT is computed, the spectrogram is then simply defined as the squared modulus of the result or (4.1) SPEC{x,f) = \STFT(x,f)\2 (4.2) Chapter 4: DIRECTION of STUDY 41 The newer systems offer the benefits of improved frequency range, greater efficiency, quicker time, and most importantly they allow the spectrogram to be detennined using computer-based signal processing. One advantage of this is that in comparison with the earlier systems, no complicated hardware is required, just a computer. This also makes possible a spectrogram which can be interpreted by the computer allowing non-experts to use and interpret results. An important consideration when using the spectrogram is to realize that because it is defined as the squared modulus of the STFT there will be no phase information. For the most part this will not cause difficulties as the majority of relevant information is contained in the signal magnitude. If for some applications it would be beneficial to include the phase, computer programs would only require minor modifications to compute the phase as well Figure 4.1 provides an example of a spectrogram (Figure 3.1 also provides spectrograms of the ten cry words.) As with most spectrograms, the horizontal axis in this figure represents the time while the vertical axis is frequency. Normally the frequencies are shown from 0 to the Nyquist frequency which was 4 KHz for the sample shown here. Simply observing the signal provides limited information, but a number of features are immediately obvious from observation of the spectrogram, including the melody pattern (i.e. rising, flat, or falling) and a quick estimate of the F c. The spectrogram in Figure 4.1 clearly contains silence near t=0 followed by segments of rising, dysphonation, DHB,flat and falling. 4.5 WAVELETS In recent years another method for decomposing signals based on wavelets has become very popular. Many consider wavelet analysis an extension and improvement to Fourier analysis. Like Chapter 4: DIRECTION of STUDY 42 Spectrogram 4 3 5 3 2 . 5 I £ 2 s 1 .5 0 5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Time (s) Figure 4.1 Spectrogram for a typical sample of an infant cry. the STFT, the wavelet transform (WT) maps a time function into a 2-dimensional function, but with wavelets the mapping is scale versus time instead of frequency versus time. Unlike the STFT, however, the WT uses shorter windows for higher frequencies and longer windows for lower frequencies while the STFT uses a constant window size for all frequencies. The result is an analysis which is localized in both time and frequency; better time resolution is achieved at higher frequencies and better frequency resolution is noticed for the lower frequencies. Another difference Chapter 4: DIRECTION of STUDY 43 between analysis methods is that the basis functions for the STFT are sines and cosines which are of infinite duration. With the WT, the wavelet basis functions all have finite durations making them better for detecting sharp discontinuities. Since speech and infant cries are both short time transient signals, it appears that wavelets closer resemble the signals in question and would therefore offer a better representation. The fundamental idea of using wavelets is to analyze signals according to scale. As mentioned earlier, various window sizes are used depending on the signal's local frequency content. Analysis is still based on the idea of BPFs but this time there is a constant relative bandwidth. Wavelet analysis is therefore often considered a constant-Q analysis. This allows the user, "... to see the woods and the trees." [Vetterli and Herley, 1992] which is definitely an advantage in most signal processing applications. While wavelets are relatively new to the field of signal processing, they have received much attention in recent years. Some applications for which they have proven particularly useful include signal analysis and synthesis, denoising of noisy data [Graps, 1995], signal compression [Mallat, 1989] and F 0 detection [Kadambe and Boudreax-Bartels, 1992; de Bruin and du Preez, 1994]. The WT has also been used for speech analysis but is really just begirining to see use for speech recognition. Studies by Favero indicate that the WT can be a successful front end preprocessor for speech recognition [Favero, 1994], but widespread use of wavelets for speech recognition has not yet been observed. With limited results of wavelets for speech recognition, it is not surprising that wavelets have not yet been applied to infant cry recognition. In the early stages of this research both 1-diinensional and 2-dirnensional wavelets were experimented with in hopes they would help in the Chapter 4: DIRECTION of STUDY 44 development of the proposed tool because of their benefits and potential abilities. However, with the complexity of the infant cry, the selection of appropriate scale-time features was difficult and little information was obtained. Another major difficulty was the selection of the best mother wavelet of which there are many. Most likely a new one will have to be created, although this assumption will obviously require further testing. The use of wavelets was abandoned for other methods for this research, but not because wavelets will not be useful for infant cry research in the future. Just as the HMM first proved useful for speech recognition before Xie applied it to infant cry recognition, work with the WT will first need to be completed on speech recognition. With a better understanding of speech, and because of linguistic constraints which offer an advantage for speech recognition, the application of wavelets to speech processing should be more obvious initially. Once the proven use of wavelets becomes established for this purpose, it will be a natural extension to apply wavelets to infant cry research as well. 4.6 SUMMARY With the knowledge gained through Xie's work, the next step was to develop the actual method of data extraction and recognition for the new automatic infant cry recognizer. As mentioned in section 4.3, the development of the tool makes use of a variety of methods and features. The next chapter will provide more details of how and why each method was used. In the end, a tool will be outlined for automatic infant cry recognition which will compute the H-value based on a combination of methods. Chapter 5 DEVELOPMENT of the SYSTEM 5.1 INTRODUCTION The introduction of a standardized approach for assessing an infant's distress level based on recognizing ten different cry-words and computing the H-value has great potential for infant monitoring and cry research. The main objective of our work was to develop a tool which would automatically recognize and classify each cry-word simply and efficiently and determine this distress measure. Instead of using complex algorithms as used in past work, our system based recognition on an eclectic variety of methods. As a result, a detailed study of each cry type was first required to determine the best method for isolating each cry-word. For the earlier systems developed by Xie, only a rudimentary understanding of each cry type was necessary. After developing his set of ten cry-words, Xie really did not need to further understand the detailed characteristics of each type at all. To implement the HMM, for example, his system simply computed the appropriate input parameters and recognition was based on the trained system For the most part, the basis for recognition for the various cry types remained transparent to the user. Such was not the case for our system, however. The information gained from our detailed study has led to a system in which the basis for recognition is known. Of particular interest for this system were the F 0 , short-term energy (st-energy), modified power spectral density, FFT, a small amount of sequencing information and the interrelationships between parameters. 45 Chapter 5: DEVELOPMENT of the SYSTEM 46 5.2 RELEVANT PARAMETERS As mentioned in the introduction to this chapter, a detailed study of each cry type was required to determine suitable features which would allow us to distinguish between cry-words. The results of this study are presented in Table 5.1. Because of the relative size of the table and amount of information contained therein, we have called this table the "Monster". The purpose of the "Monster" was to provide a simple visual display of a number of features simultaneously so that patterns might be determined which would allow us to isolate or identify each of the cry-words. Of particular interest were the identifiable features for the H-type cry-words since these were required for the H-value calculation. The H-type cry-words are flagged in the table with an HT (for H-Type) at the top of the respective columns. While not all of the information from the "Monster" was used, the average F 0, average st-energy, modified P^ information greater than 500 units, and sequencing information were all helpful for recognition and will be further discussed in the subsections that follow. 5.2.1 F 0 and st-Energy One of the most important features used for discriminating between the different cry-words shown in the "Monster" was the fundamental frequency, F 0. Also recall from the descriptions for each cry-word in section 3.1 that nine of ten definitions made reference to the F„ in some way. (Mention of the fundamental was not given for inhalation.) A careful look at the descriptions will reveal that F 0 information differs for each cry-word and thus should be very useful for recognition purposes. Hyperphonation, for instance, is characterized as having an extremely high F 0 while both trailing and double harmonic break (DHB) have much lower fundamentals if the in- between Chapters: DEVELOPMENT of the SYSTEM 47 HT HT HT HT HT 1 2 3 4 5 6 7 8 9 10 Cry-Word Trail Flat Falling DHB Dys. Rising Hyper. Inh. Vib. w.v. F„ - Ave. L M M L - M H - M UM F„ - Trend F Lv F Lv - R Lv - V V Energy-Ave. L H H M H H H M H M Energy -Trend F Lv F Lv F R Lv F Lv Lv Duration-Ave X Lo M M M Sh M Sh Lo M Mod. > 500 units Y Y Y Y N Y Y N Y Y In-Between Energy O N N 0 Y N N Y N O/N Energy Content A B B B B B B B B L Sequencing Info 3-1-S - - * - S-6-# - S-8-S - -K E Y : L = Low V = Vibrating Y = Yes M = Medium X = Extra Long N = No H = High Lo = Long .A = All frequencies F = Falling Sh = Short B = Both high and low Lv = Level O = Occasionally S = Silence R = Rising Table 5.1 The "Monster" - simple visual display of a number of features so patterns can be determined to identify each of the ten cry-words. harmonics are considered for DHB. Dysphonation basically has an unstructured energy pattern as does inhalation, although one or two significant components or short sequences of hyperphonation or flat could exist during an inhalation sequence. The remaining cry-words are characterized as Chapter 5: DEVELOPMENT of the SYSTEM 48__ having distinct harmonics with little in-between energy and are defined by their trend in F 0 movement, i.e. either flat, falling, rising or vibrating. In a number of cases the descriptions also made reference to the st-energy. For example, weak vibration is similar to vibration except it has a considerably lower st-energy while rising is defined as having both an increasing F„ and st-energy. While Xie used the st-energy as an input parameter for his systems, he did not use the F 0 and understanding the relationships between these parameters and the different cry-words was not explored. We have also seen from the "Monster" that trailing is distinct from the other cry-words because of its lower st-energy. A logical first step in developing this system was therefore to examine the values and behaviour of the F 0 and the st-energy for each individual cry-word and try to determine if any unique relationships existed. To provide an idea of the trend of F 0 and st-energy relationships for each cry-word type, a number of cry samples were examined. Typically a cry sample contains a series of a few different cry-words. To help maintain a consistency between data and to reduce the effects of obtaining uncorrected results simply caused by examining different infant's cries, the same infant's cry sample was used for as many cry-word types as possible. For example, sample #1 may have contained sections of trailing, flat, falling, DHB, dysphonation, rising, inhalation and vibration but not of hyperphonation or weak vibration. This sample would then be used for as many of the cry types as possible, ie. the eight it contained. The available data set from which the samples were extracted consisted of 103 cry samples collected from 36 infants under controlled conditions. This was the same data set used Xie. In his case, the cries were further broken into two subsets, the first containing 45 samples for training and the remaining 58 for testing. Unfortunately, while the 58 sample testing set was separated, the other set contained the entire 103 samples with no indication Chapter 5: DEVELOPMENT of the SYSTEM 49 provided of which samples were used for ttaining. It is therefore likely that some of the cries selected at random from the larger set and examined in this section were part of the testing set as well. For each cry-word sample, the F 0 and st-energy were calculated and recorded over various sections of the sample sequence. The F 0 was measured off-line from the spectrogram for each sample at roughly the LA, VL and % points of the sequence length in order to gather information of how the F c changes during a particular cry-word type. The average results appear in Table 5.2 along with the average fundamental frequency estimations over the complete duration of the cry sample. The average st-energy was measured over approximately the first quarter or third of the sequence and again for a comparable length at the end of the sequence again to see if a change had occurred. The average st-energy for each of these periods was also calculated and is shown in Table 5.2 as "tot(heginning - end)" where "tot" refers to the average st-energy for the entire or total sample, "beginning" refers to the average st-energy over the first quarter or third of the sequence and "end" refers to the average st-energy over the final quarter or third. As expected, the phonation types (flat, falling and rising) were true to their name. For these cry types the average normalized st-energy and F 0 were either relatively steady for flat, decreasing for falling or increasing for rising. For vibration and weak vibration the vibrating F c was more noticeable visually from the spectrogram than from the calculated results. Also as expected, trailing and DHB had relatively low fundamentals while hyperphonation had a very high FG. While Table 5.2 helps convey the differences between cry-words, the most useful and interesting results were obtained for this data when the F 0 was plotted against the st-energy. Using the information provided in the table, a range of possible FD and st-energy combinations was first Chapter 5: DEVELOPMENT of the SYSTEM 50 determined and plotted for each cry-word. The results are illustrated in Figure 5.1 with the H-type cry-words shaded for easy identification. Dysphonation and inhalation were not included in the graph since they have unstructured energy and therefore no fundamental but are shown just below the graph so they will not be forgotten. These are both H-type cry-words, as well. From this figure it is clearly shown that the E-type words are all clustered in one tight area while the rernaining H-type cry-words are distinct from each other. It should be irnrnediately obvious that the H-type cry-words are linearly separable from the E-type using these parameters. The initial assumption drawn from the "Monster" that F„ information might be useful is certainly demonstrated in this figure. CRY T Y P E A V E R A G E L E N G T H (samples) A V E R A G E ENERGY tot (beginning - end) A V E R A G E Fo (Hz) % - Vi • 3A (tot) Trailing 9530 186 (244 -138) 319 - 241- 173 (244) Flat 4530 563 (571 - 564) 500 - 501 - 500 (500) Falling 3050 560 (641 - 375) 495 - 456 - 400 (450) DHB 2210 479 (503 - 461) 250 - 247 - 244 (247) Dysphonation 2960 555(600-411) (Unstructured energy) Rising 1265 561 (319-628) 427-473-533(478) Hyperphonation 2684 550 (544 - 485) 1125-1210-1152 (1162) Inhalation 916 405 (423 - 158) (Unstructured energy) Vibration 3680 635 (575 - 513) 475 - 398 - 412 (428) Weak Vibration 3114 339 (272 - 296) 415 - 404 - 435(418) Total Averages 3394 483 (469 - 403) 536 - 491-481(527) Table 5.2 Average results from examining a number of different samples of the ten cry-words. Note that "tot" refers to the average for the total sequence while "beginning" ("end") refers to the averages over roughly the first (last) quarter or third of the sample. V4-V2-V4 refer to roughly the V4, V2 and 3A points of the sequence length where the Fa values were calculated. Chapter 5: DEVELOPMENT of the SYSTEM 51 1200 1100 1000 is? X 900 o LL. >-~ 800 o z LU 3 700 L U CC ^ 600 $ z ^ 5 0 0 g Z 400 =) u_ 300 200 100 W. V. HYPERPHONATION VISING. TRAILING FALLING DHB VIBRATION 200 300 400 500 ENERGY (normalized units) = H-Type ^SPHONATtotJ J W H A I A T I O N | 700 Both have unstructured F. Figure 5.1 Plot of typical average ranges for FQ and normalized energy for the ten cry-words. Figure 5.1 shows that the ten cry-words are clearly and simply separable. However, this figure is somewhat misleading since the areas defined for each cry-word were determined from averages of data. In practice if all data are considered, there may be some overlap between cry-word boundaries. While this overlap will not be significant, separation will not always be as distinct. In particular, trailing data may overlap in frequency with some instances of weak vibration and on occasion with early sections of rising which follow sequences of silence. In most cases the Chapter 5: DEVELOPMENT of the SYSTEM 52_ lower st-energy can be used to help identify trailing from weak vibration, but there may be times when the energy values will overlap as well Additional information will generally be required for discerning between trailing and the early stages of rising before the FD and st-energy have risen sufficiently. Hyperphonation will also not be quite as distinct as shown in the figure if all the data are considered. While the average F 0 shown in the figure is greater than 1100 Hz, samples of hyperphonation in the 750-800 Hz range are also possible. Hyperphonation should still be separable from the other cry-words, however, since the E-type cry-words seldom have frequencies above 700 Hz. It is also important to consider that these averages were calculated from a small set of data. By no means can this be considered an exhaustive study of average values of F 0 or st-energy, nor was it meant to be. The results in Table 5.2 and Figure 5.1 simply illustrate the potential usefulness of these two parameters for recognition. 5.2.2 Duration and Sequencing Information The next features we looked at from the Table 5.1 were the average durations and sequencing information for each cry type. The duration information was also already presented in Table 5.2 under the heading Average Length (samples). With the exception of trailing which was significantly longer, the remaining durations did not provide any useful recognition clues and thus duration information was not included in the new recognition system. However, though it was not useful for recognition, the duration information was still useful for showing that an analysis window of 256 points will generally result in a minimum of almost four non-overlapping analysis windows for any cry-word. Again the values for length in Table 5.2 are averaged results so in practice there will be durations of cry-word occurrences which are shorter than the averages, particularly for Chapter 5: DEVELOPMENT of the SYSTEM 53 dysphonation and for DHB and flat where short sequences are often present in between sequences of dysphonation. For the most part, however, the durations will be significantly larger than 256 samples. Sequencing information was another parameter that was examined during data collection, but only to a limited extent. With a few exceptions, there did not appear to be any clear sequences that would substantially aid in cry-word type recognition. One exception was for DHB. DHB is characterized as a sudden series of in-between harmonics. As a result, the F„ will suddenly half from one frame to the next when DHB occurs. When used with the low F 0 value as discussed in section 5.2.1, this allowed for easy identification of DHB. Another exception involved inhalation. In most, but not all cases inhalation was separated on both sides by silence. Trailing and a few other sequences were also occasionally separated by silence, albeit much more infrequently. The majority of time trailing followed falling and led into silence. From the cry-word descriptions in Chapter 3 we see that trailing usually occurs at the end of a long, exhaustive expiration so this falling-trailing-sHence sequence is understandable. The final sequencing observation was that on occasion rising occurred after silence, although it was noticeable in other situations as well. Though not a great deal of attention was given to sequencing information, these few observations were useful for decision making purposes. It might be appropriate in later research to further pursue this avenue. 5.2.3 Voiced Versus Unvoiced In the cry-word descriptions in Chapter 3, eight of ten cry-words were defined as having clearly observable harmonics with little in-between energy (see Figure 3.1). A convenient Chapter 5: DEVELOPMENT of the SYSTEM 54 representation of energy is the squared magnitude of the Fourier transform components. If from the spectrogram, the instantaneous squared magnitude of the FFT was plotted for a particular time for one of these eight cry-words, the resulting squared magnitude versus frequency plot was periodic in nature. It was already seen in Figure 5.1 that the two remaining cry-words, namely dysphonation and inhalation, have unstructured energy therefore no real fundamental frequency. The instantaneous squared magnitude of the FFT for a particular time from the spectrogram for these unvoiced cry-words would simply appear as random white noise. To illustrate the differences between voiced and unvoiced cry types, we plot spectrograms for typical samples of voiced and unvoiced cry sequences in Figure 5.2. The upper plots in this figure display the spectrograms of the respective samples while the lower plots display the instantaneous squared magnitudes of the FFTs at time t=0.05 seconds. Taking advantage of the differences shown in this figure, one method for distinguishing these unvoiced samples is to use a modified version of the power spectral density (PM) derived from the instantaneous squared magnitude of the FFT, with thresholding in two dimensions followed by some simple pattern recognition. A further description of this method is given in section 5.4. 5.2.4 Frequency of Occurrence The frequency of occurrence of particular cry-words might also potentially be a useful recognition parameter although it was not included in the "Monster". It could possibly be used if some type of weighted decision scheme were to be implemented. While use of this parameter was not examined for our system, Xie's decompositions of the 45 cries used in his training set into their respective cry-words are shown in Table 5.3. Most significantly from this table we see that of the Chapter 5: DEVELOPMENT of the SYSTEM 55 Figure 5.2 Differences between voiced and unvoiced cry samples. The upper plots are spectrograms of typical voiced and unvoiced signals while the lower plots are the instantaneous squared magnitudes of the FFTs at time t=0.05 sec. Chapter 5: DEVELOPMENT of the SYSTEM 56_ 45 cries, less than 1% of the total was hyperphonation while over 50% were either flat or dysphonation which coincides with earlier remarks that hyperphonation is very infrequent while flat is the most common E-type cry-word. With such a limited data set, however, there was no way of knowing if this was indicative of all cries which was why this information was not incorporated into our new system. If a larger data set were available, this could also be a useful avenue for further research. CRY-WORD # of u-SEGMENTS % of TOTAL Trailing 142 5.1% Flat 728 26.0% Falling 164 5.8% DHB 181 6.5% Dysphonation 732 26.1% Rising 202 7.2% Hyperphonation 15 0.5% Inhalation 346 12.3% Vibration 139 5.0% Weak Vibration 155 5.5% Totals 2804 100% Table 5.3 Distribution of the micro-segments used in Xie's training set (information from [Xie, 1993]) 5.3 FUNDAMENTAL FREQUENCY DETECTION and ESTIMATION The F 0 is probably the single most important feature required for cry recognition with our new method. Accurate detection of the F 0 was therefore essential for a quality automatic Chapter 5: DEVELOPMENT of the SYSTEM 57 recognition system F 0 detection has received considerable attention in the field of speech processing and recognition, and to a lesser extent for infant cry analysis. The major difference between adult speech F 0 and infant F„ detection is the range of frequencies under consideration. The average male adult has an F„ in the range 50-250 Hz and an adult female generally has an F c between 120-500 Hz [Deller et. al., 1993]. An infant's cry, on the other hand, has a considerably higher fundamental, typically between 250-1500 Hz. With a few minor modifications to account for the higher fundamentals, the same techniques used successfully for adult F 0 detection can also be applied to infant cry FQ detection. Because the fundamental frequency can vary significantly during an expiration, the F 0 is usually determined for short, windowed utterances which are considered short-time stationary. The F 0 is then tracked as the window slides to another section. While many methods exist for the detection and estimation of the fundamental frequency for each window, the easiest method simply uses the short-time Fourier transform (STFT) of the signal. The STFT effectively acts as a filter bank and breaks the signal into its frequency components from 0 to the Nyquist frequency. Peak detection is then used to determine the location of the first spectral spike or maximum which is defined as the fundamental. Since the fundamental is repeated in harmonics at multiples of the F 0, an alternative method is to determine the average spectral distance between a number peaks. Though the STFT for fundamental detection is the simplest conceptually and easiest to implement, it is seldom the best method. If the signals are not perfectly "clean", automatic peak detection can be difficult due to in-between harmonics and pitch fluctuations in addition to the difficulties resulting from the corruption by noise. One way to improve resolution and allow for Chapter 5: DEVELOPMENT of the SYSTEM 58 a more accurate estimation of the F 0 which has proven very useful in speech processing is to use cepstral analysis. To provide some context into the use of cepstral analysis, it is worthwhile to discuss the rnotivation for its development. To do so requires mention of Fourier analysis which has been very useful for many signal processing applications. The basic concept of Fourier analysis is that all stationary signals can be approximated by a series of sinusoids of differing frequencies. If a signal is a linear function of some components, e.g. composed of additive components each with different frequencies, the frequency spectrum will then provide a representation of the signal with the linearly combined components separated in the spectrum As an example, a low-frequency signal, Sj(n) may be corrupted by noise, w(n), whose frequencies are higher than those of s^ri) so that s(ri) = Jx(n) + w(ri) (5.1) Observation of the spectrum for s(n) will illustrate the two signal components are in fact separable, i.e. located at different locations in the spectrum. To eliminate the high frequency noise simply requires a low-pass filter of the signal or the subtraction of the high frequency components from the spectrum Unfortunately not all signals are comprised of additive components which are clearly separable. In some cases, there will be significant overlap of frequencies which will limit the usefulness of Fourier analysis. An additional complexity with speech and infant cry signals is that the signal s(n) does not solely include additive components. Looking at one speech production model [Deller et al., 1993] and extending the model to infant cry generation as shown in Chapter Chapter 5: DEVELOPMENT of the SYSTEM 59_ 2, we see that voiced (periodic) signals are composed of an excitation sequence, e(n), convolved with the vocal system impulse response, 8(ri) s(n) = e(n) * Q(n) (5.2) Unlike the previous signal the components of this signal are no longer linearly combined so direct implementation of the Fourier transform (FT) will not be useful for separating e(n) from 6(n). Cepstral analysis was designed to take this into account. Similar to the spectrum, it will be shown that the "cepstrum" will represent the transformation of a signal with two properties: (1) Components will be separated in the cepstrum (2) The separated components will be linearly combined in the cepstrum Note that this method is only useful for voiced signals. No useful information will be obtained for unvoiced signals which are modelled using a random noise generator. 5.3.1 Cepstral Analysis In general there are two types of cepstrums used in signal processing. The first type is called the complex cepstrum (CC). This is frequently used for homomorphic signal processing which,"... is generally concerned with the transformation of signals combined in nonlinear ways to a linear domain in which they can be treated with conventional techniques, and then the retransformation of the results to the original nonlinear domain." fDeller et. al., 1993, pg. 386] The second type of cepstrum is the real cepstrum (RQ which is defined as the absolute CC and contains no phase information. As mentioned earlier, use of phase information is seldom required for speech Chapter 5: DEVELOPMENT of the SYSTEM 60_ processing since the majority of information is contained in the signal magnitude. For this reason, the R C is most often used for speech processing and was used for our system as well. It should be noted that due to the loss of phase information the R C is not homomorphic, i.e. it does not form a "round-trip" and is unable to return to the nonlinear domain. The R C is a powerful tool for speech processing because it is able to separate the convolved components into linearly separated entities. To do so, the FT is used in conjunction with basic fundamental laws of signal processing and mathematics. Given the signal in equation 5.2, the first useful law is that convolution in the time domain is equivalent to multiplication in the frequency domain as shown in (5.3). s(n) = e(n) *8(n) ~ S(w) = E(G>) 0(w) (5.3) Taking the logarithm of the magnitudes we get log 15(co) I = log|£((o)0((o)| = log|E(w)| + log|0(w)| (5.4) Finally, defining log X as Cx we get C,(co) = Cc(G>) + C e(o) (5.5) Now that the new signal Cs(co) is composed of the linear, additive components Ce(<y) and CJico), linear techniques can again be applied to separate the components as described for the earlier case. A common practice is to apply the Fourier transform again. Since the new signal is already in the frequency domain, the new domain for the transformed signal is termed the "quefrency" domain. The resulting output is the "cepstrum" (real cepstrum in this case) which is analogous to Chapter 5: DEVELOPMENT of the SYSTEM 61 the spectrum of the frequency domain. From the above description, it should be obvious that the definition for the RC is given as 1 1 1 c,(#i) = ^ [log|^[ s(n) ] |] = - L f log]S(to) | e*n du (5.6) -it where &~and correspond to the FT and inverse FT (TFT), respectively. Because the F 0 can vary significantly during an expiration, it is more desirable to compute the RC over shorter, windowed sequences. The short termRC (stRQ is therefore often used which provides a means for tracking important features of cry or speech signals such as formant frequencies or pitch for successive frames. For our system the stRC was computed for a 256 point (32 msec) sliding analysis window. Since the RC disregards all phase information, the phase delay of the window was lost as well In order to eliminate aliasing, it was also important to zero pad the windowed sequence to a higher power of two when computing the stRC, usually to 512 or 1024 samples in total. A block diagram describing the computation of the stRC is given below. Convert to 'quefrency* Remove formant domain frequency information ZERO PADDING STFT log II INVERSE STFT High-Time Lifter STFT ; * w W p w[m-n) "cepstrum* Peak detection for F. estimation Figure 5.3 Description of computation of stRC (borrowed from [Deller et. al., 1993]) Chapter 5: DEVELOPMENT of the SYSTEM 6 2 _ 5.3.2 Pitch Detection Using the stRC The stRC has many uses in speech and infant cry analysis including signal recognition, pitch detection and formant detection [Deller et al., 1993; Noll, 1967]. For our work its main use was for pitch period detection from which the fundamental frequency can be determined. The stRC was first used for speech processing in 1967 (Noll, 1967). Though the stRC had been used in previous years for other applications, Noll was the first to realize that it would be useful for separating the convolved components of speech. A closer look at the stRC applied to speech reveals the resulting transformed signal is comprised of two parts: a slow varying part or envelope associated with the vocal tract, 10 (a))\, and a fast varying part from the excitation signal \E(GJ)\. When the IFT is performed, the cepstrum will contain the c/n) component at low quefrencies and the ce(n) component at high quefrencies. To isolate one component simply requires a low-time or high-time "liftering" which is the same as filtering in the time domain. A low-time lifter isolates the characteristics of the vocal tract which is useful for determining the formant frequencies. A high-time lifter will isolate the excitation signal. For this high-time case, "rahmonics" will be visible which are basically the harmonic frequencies of Cs(cS) resulting from the periodicity of the excitation signal. These rahmonics will be repeated at the pitch period, P, which will be detected in order to estimate the F„. While Noll's work was specifically concerned with speech, these same characteristics can also be expected for infant cries. To estimate the fundamental frequency using the stRC, the pitch period must first be determined from the Wgh-time liftered case. Figure 5.4 illustrates a typical output of a cry sample showing the cry signal, stRC, and high-time liftered stRC isolating the excitation component. Chapter 5: DEVELOPMENT of the SYSTEM 63 Figure 5.4 Illustration of cry signal, stRC and high-time liftered stRC. Once the excitation component has been isolated, the pitch period can be determined in two ways. The method most often used, as well as easiest, is to simply determine where the first peak is located. In most cases this will be well separated from the low-quefrency signal. As an Chapters: DEVELOPMENT of the SYSTEM 64 alternative, the second method determines the pitch period using the distance between successive peaks. This method is more prone to error, however, due to the continual attenuation of successive peaks. The first method was chosen for the F 0 detection routine for our system After the pitch period has been cletermined, the fundamental is calculated simply by inverting the pitch period. For example, the normalized pitch period in Figure 5.4 was 15 norm-sec which is equivalent to an actual pitch period of roughly 1.9 msec when it is divided by the sampling frequency of 8 kHz. The F 0 is then the inverse of the pitch period or 533 Hz. To test this method for use with infant cries, the F 0 was calculated for roughly 40 windowed samples for each of the voiced cry-words. For frequencies up to 727 Hz, the cepstral method correctly estimated the FG to within ~ 5% over 95% of the time. While the above method is accurate in most cases, a unique problem with infant cries results from the higher values of F 0 (those greater than ~ 700 Hz) which translate to much shorter pitch periods. It is possible the first peak and thus the pitch period could be buried in the low-quefrency formant frequency information. To help alleviate this problem, an additional method was required for detection of those frequencies greater than 727 Hz, in particular for the hyperphonation cry-words. Detection of these higher frequencies makes use of the peak detection of the STFT which was discussed at the beginning of section 5.3. Since frequencies in this range are very rare, the possible lower accuracy is not a major concern. It will be shown in a future section that precision in estimating F 0 values greater than 740 Hz was not required as long as the system accurately determines when the F 0 is over this threshold value. Chapter 5: DEVELOPMENT of the SYSTEM 65_ 5.4 IMPROVED SYSTEM The previous sections have provided the majority of information needed for the development of the improved, more practical recognition system. Having obtained the necessary information for accurate recognition, careful thought and much deliberation were required in order to determine the most appropriate method for packaging this information to allow for the most efficient system. Figure 5.5 illustrates the logical framework of the final system we have chosen. To further understand the workings of this framework, each section in the flowchart has been labelled and will be discussed individually. A more complete system flowchart is given in Figure 5.8 which is then further detailed in Figures 5.9 and 5.10. (1) High-Pass Filter: Before the digitized cry sample actually reaches the recognition system, it is first passed through a high pass filter. The purpose of this filter is to eliminate any DC component and/or low frequency artifact. The effects of these components are generally quite small so a 2nd-order Butterworth filter with a cutoff of 75 Hz is sufficient for their removal In many voice recognition systems, a l-0.95z_1 pre-emphasis filter is often used instead, which has a 3 dB point between 1500 and 2000 Hz. Xie made use of this filter for his infant cry recognition systems. The choice of this filter was justified partly because it provided an accurate model of the lip radiation effects. It also emphasized high frequencies which helped maintain numerical stability when computing linear prediction (LP) parameters [Deller et. al., 1993] Since LP parameters were not used for our new system, numerical stability was not an issue. The emphasizing of high frequencies was also not desired since a number of signals contained predominantly low frequencies. Chapter 5: DEVELOPMENT of the SYSTEM 66 Figure 5.5 Flowchart providing an overview of the new recognition system. Each of the sections numbered (1) through (7) will be referred to in the discussion of the system. (2) Silence Detection: After filtering out any DC component and low frequency noise, the first classification by our system is neither for H nor E-type. The initial system decision is to determine if a windowed sequence is Chapter 5: DEVELOPMENT of the SYSTEM 67 silence or non-silence. An accurate silence detection scheme is essential for an accurate H-value estimate since by definition, the H-value is the ratio of the total duration of H-type divided by the total non-silence time. If silence is not properly identified, recognition as non-silence ultimately bads to it being recognized as unvoiced because of its lack of structure which falsely increases the H-value. The main method used for silence detection relies on energy information. Observation of an actual cry signal reveals that silence periods and crying sequences differ considerably in signal amplitude. To determine a suitable energy threshold for silence detection, average st-energies were measured for roughly 360 silence samples. From this data, a threshold of 85 (normalized units) was chosen as the mean plus two standard deviations. When compared to the normalized energies in Table 5.1 we see that the silence threshold of 85 is substantially lower than the average energies for any of the cry-words. The only cry-word with energies which might violate this threshold is trailing. As a result, consideration was given to trailing sequences which might fall below this threshold at the tail end. In some cases, a large signal component between 100 Hz and 150 Hz caused the st-energy to violate the silence threshold. An additional method was also therefore required for this special case. Because samples with low F 0 values in the same range were possible it was decided not to simply filter out this 100-150 Hz component. Instead the sum of signal energies was compared in the 0-500 Hz and 500-4000 Hz ranges. If the majority of significant energy was below 500 Hz the sequence was considered silence. Signals which were not silence were seldom if ever considered silence by this method since even signals with only low frequency content typically had signal energy between 1500 and 2000 Hz. Chapter 5: DEVELOPMENT of the SYSTEM 68 Once an analysis sequence has been recognized as silence, regardless of which method was used for detection, the routine will cease further processing of that sequence and slide the analysis window to test the next. If three consecutive sequences are recognized as silence, the silence sequence is flagged which is useful for later recognition decisions. (3) Voiced/Unvoiced Detection: In section 5.2.3, it was mentioned that voiced and unvoiced signals should be distinguishable using what we called the modified power spectral density followed by thresholding and simple pattern recognition. Such a detection method is of considerable importance to the system since only dysphonation and inhalation exhibit unvoiced therefore unstructured properties, and both cry-words are H-type. If a sequence can be detected as unvoiced, processing of the current analysis window will be complete and the cry-type will be noted as H-type before moving to the next analysis frame. At this point it is important to describe how the modified power spectral density is derived from the instantaneous squared magnitude of the FFT. To do so first requires an understanding of the unmodified or normal power spectral density. Given an energy signal, s(t), a measure of the similarity between the signal and a replica of the signal delayed by an amount x is obtainable. This measure is termed the autocorrelation and is defined as where s*( t-T) is the delayed complex conjugate of s(t). The power spectral density, Ys(f), is then simply defined as the FT of the autocorrelation or (5.7) Chapter 5: DEVELOPMENT of the SYSTEM 69 V(f) = fR,(T)e (5.8) For our modified power spectral density, we have altered the signal for which the autocorrelation is measured. We have selected as our signal a cross-section of the spectrogram at a given time. The signal which we have selected, g(o>), is therefore represented as squared modulus of the STFT at a given time versus frequency, rather than a signal in amplitude versus time. The modified power spectral density can therefore be summarized as follows. We begin with s(t) and compute its FT as From equation 4.2 we see the spectrogram is then the modulus squared of the STFT which is simply a windowed version of 5.9. Taking the cross-section of the spectrogram for a given time we have The same approaches as in equations 5.7 and 5.8 can still be applied, but it is important to realize that the domains now differ and the final output will NOT be represented as power versus frequency. To compute the autocorrelation, in this case in frequency not time we have (5.9) g(u) = \S(f)2\ (5.10) (5.11) Finally, the modified power spectral density which we call is defined as Chapter 5: DEVELOPMENT of the SYSTEM 70 -j2*QudQ (5.12) Our modified power spectral density is therefore a function of u, which, in scaled time units we call rnodified time units. Unlike the normal power spectral density of equation 5.8 which yields a signal in power versus frequency, our modified power spectral density is a function of P M amplitude versus these modified time units. While the modified power spectral density is by no means a standard approach, it provides a useful output with clear patterns distinguishing unvoiced from voiced cry samples. From the "Monster" we see that when the nwdified power spectral density (P^ is calculated, the output will be different for voiced and unvoiced signals. If the P^ of a spectrogram cross-section for a voiced signal is calculated using MATLAB's "psd" routine, the result will be a signal which will exhibit a series of hanrwnics. For the cry samples examined for this work, the P M generally had four or five significant peaks with the latter three decreasing in amplitude. The first peak usually occurs near 500 modified time units and subsequent peaks are repeated in multiples of the first. Conversely, when the unvoiced case is examined, the resulting Pj„ is not periodic. The output typically consists of one significant spike at a very low frequency with no other significant components. Typical outputs are presented in Figure 5.6. With this in mind, it should be obvious that voiced and unvoiced signals can be distinguished using modified spectral density information using simple pattern recognition methods. The P M test employed in this system is best described with use of the flowchart shown in Figure 5.7. Beside the flowchart are sketches of what the signal will look like at the various stages Chapter 5: DEVELOPMENT of the SYSTEM 71_ Voiced Unvoiced * 3 X g 2 a> cr 0 0.02 0.04 0.06 0.08 Time (s) X 1 Q4 Instantaneous |FFT|A2 x 10 1 2 3 Frequency (kHz) Pxx 1 2 3 Modifed Time Units 0 0.02 0.04 0.06 0.08 x 1 04 Instantaneous |FFT|A2 x 10 1 WW, A . Figure 5.6 The results of applying the modified power spectral density to both a voiced and unvoiced cry sample. Chapter 5: DEVELOPMENT of the SYSTEM 72_ of the routine. The first step for this test is to compute the spectrogram for the cry-signal which will provide a 2-D representation of the signal in time versus frequency. The cross-section of the spectrogram is then determined at a particular time resulting in a signal in squared magnitude versus frequency. Once the cross-section has been obtained, the modified power spectral density can be determined. As shown in Figure 5.7, the calculation will be dependent on the distribution of energy content of the signal It was observed that a better P„ representation was gained for signals with predominantly low frequencies (majority of energy in 0-2 kHz range) if a smaller length discrete Fourier transform (DFT) was used for the calculation. Signals with energy distributions over the entire Nyquist range have better P M representations when computed with a longer DFT length. Once the P^ of the signals has been calculated, the routine makes use of thresholds in both P M amplitude and modified time. The P„ amplitude threshold first gates the P^ signal at 30% of the maximum value so that all values below the threshold are assigned a value of zero. This value was chosen after a number of tests which showed this threshold eliminated any stray peaks in dysphonation and inhalation except those less than about 500 modified time units while leaving most of the peaks for the voiced signals. After amplitude thresholding, a peak detection routine is used to select only the modified times where peaks occur. The modified time threshold is then implemented at 465 modified time units, chosen because this dysphonation and inhalation only have significant energy below this value and 465 is a convenient value for the resolution used. The final part of this routine then tests for signal energy which is greater than 30% of the maximum amplitude and also greater than 465 modified time units. If only zero components are observed in this range, the signal under consideration is an unvoiced signal, Le. dysphonation or inhalation. Chapters: DEVELOPMENT of the SYSTEM 73_ If any non-zero components remain the signal is considered a voiced signal. From the flowchart we see that this process is repeated three times at the lA, Vi and 3A point times of the analysis window. To avoid the possibility of simply sampling the spectrogram at a bad time, two of three consecutive iterations must be considered unvoiced before the cry-word in the current analysis window is deemed unvoiced. Initial testing of this routine with relatively "clean" samples of the various cry-words was quite promising. With the exception of hyperphonation for which fewer samples were available, roughly 20-40 samples were tested for the other cry-types with this routine correctly separating dysphonation and inhalation from the others 98% of the time. 100% separation is not expected since periods of hyperphonation or flat occasionally occur in the middle of an inhalation sequence. The samples initially tested were clearly selected to be one type of cry-word or an other. Further testing has revealed that this routine will not isolate voiced from unvoiced quite as well for data where ambiguities exist. In particular, signals with very low fundamental frequencies such as some occurrences of DHB or trailing are at times considered unvoiced when their harmonics are not clearly separated. This is especially noticeable in the later stages of trailing when the F„ can be 100 Hz or less. The other cry types are generally not misinterpreted because their fundamentals are further separated so that some in- between energy does not usually significantly influence the test. Since DHB and trailing are both H-type cry- words, and the ultimate goal is simply to compute the H-value, these misrepresentations are not significant to the calculation of the H-value and can therefore be accepted. Chapter 5: DEVELOPMENT of the SYSTEM 74 Cry Signal Spectrogram frutantaneous |FFT| '@t=rV4 of frame Compute energy o-2kHz &2-4kHz Compute P„ wttti 128 point DFT —^— Threshold @ 30% of max time time 2kHz . freq Compute P» with 256 point DEI h 500 Mod. Time UV(n) = 1 UV(n) = 0 Figure 5.7 Outline of the steps followed in the test. Chapter 5: DEVELOPMENT of the SYSTEM 75_ (4) Fa Detection: The details of the F 0 detection methods used for this system have already been discussed in detail in section 5.3. It is worth reiterating, however, that the F„ is the most important parameter for this recognition system For the system to operate efficiently and with any precision, an accurate estimation of the F„ is imperative. (5) Cry-Word Decisions: At this point in the flowchart, the H-type cry-words dysphonation and inhalation as well as silence will have been detected. While the majority of cry-words remain to be recognized, the necessary information required for separating the cry-words has already been calculated. The easiest identified of the remaining cry-words is hyperphonation because of its extremely high F 0. To determine a suitable frequency threshold to separate hyperphonation from the other cry-words, fundamental frequency data was measured at the beginning, middle and end of 32 hyperphonation samples with a resulting mean FQ for the roughly 100 measurements of 1028 Hz and a standard deviation of just over 190 Hz. The threshold was chosen as approximately the mean minus one and a half standard deviations or 740 Hz. Since the cepstral method only measures F 0 values to 727 Hz, the STFT peak detection method is needed to measure for frequencies above this threshold. If the F 0 is determined to be greater than 740 Hz, the cry sample is considered hyperphonation which is H-type and processing moves to the next analysis frame. Continuing with strictly F 0 information, the next decision looks for fundamentals between 381 Hz and 740 Hz this time using the more accurate cepstral analysis method. (The threshold of 381 Hz was chosen because of the resolution in F„ estimation.) If the F„ lies in this range it is Chapter 5: DEVELOPMENT of the SYSTEM 76 considered an E-type sample. Again, since the ultimate purpose of the system is to measure the H-value, additional identification of which particular E-type sample is being examined is not needed and would only add to the computational complexity. E-type cry-words may exist below the lower threshold limit but the remaining H-type cry-words trailing and DHB will generally not be greater. Below 381 Hz, the majority of cry-words will be H-type, in particular trailing or DHB. However, since it is possible for E-type cry-words to be below 381 Hz as well, additional criteria besides F 0 must be considered. The most frequently encountered E-type cry-words in this range are rising (early stages), weak vibration and possibly the latter stages of falling. Recognition strategy at this point could follow two paths: recognize the sample as H-type or recognize the sample as E-type. Both options needed to be considered. As shown in Figure 5.5, the routine first tests for DHB. An important consideration for detecting DHB is that it occurs suddenly and will contain in-between harmonics. As a result, the F 0 will appear as if it has been halved from the previous sample so knowledge of the preceding FG values is required. At this point the fundamental should not exceed 740 Hz so DHB detection looks for F„ values below 370 Hz which are less than 60% of the previous voiced F 0. This method of checking the previous sample to see if the current F 0 is approximately half will obviously not work for a second, consecutive sample however. For this reason, a flag is used so that if the F 0 continues to be in this range it too will be labelled as DHB. Once the frequency returns above 381 Hz, the flag will be removed. If a sample with a fundamental frequency below 381 Hz is not recognized as DHB, it is probably trailing. If it is one of the E-type cry-words, energy information can be used in addition to the F 0. From Table 5.1 and Table 5.2, trailing is unique since it is the only cry-word which can be separated in both frequency and energy. For fundamental frequencies between 300 and 381 Hz, Chapter 5: DEVELOPMENT of the SYSTEM 77_ the signal must also be below an energy threshold before it is considered trailing. Below 300 Hz it will automatically be considered trailing since it was not recognized already as DHB and overlapping E-type cry-words will not typically have Fo value below ~320-350 Hz. Also note that for these lowest frequencies trailing may often be perceived as unvoiced. This is not a concern, though, since the unvoiced cry-words and trailing are all H-type. This strategy of energy and F 0 thresholds for trailing recognition will be accurate in most cases. However, as discussed earlier, the thresholds were chosen based on averages and some overlap is possible. One circumstance for which overlap may be encountered in both amplitude and frequency is for early stages of rising before the signal energy has risen sufficiently. In this case it will be necessary to determine if the sample is rising which requires use of sequencing information. For such circumstances where the signal is rising yet the FQ and energy are still below the set thresholds, the signal will have just followed a silence sequence. Silence sequences are therefore flagged to allow for this special consideration. After the flagged silence sequence, if the first couple of samples are considered trailing or DHB while the third sequence is clearly E-type after the energy is above the threshold, the routine will go back and re-label the previous samples as E-type, as well. (6) HorE-Type: Once each sample is recognized as either H or E-type, the results must be recorded. This section of the system deciphers whether an analysis frame sample has been recognized as H-type, E-type or silence and records the information accordingly. Providing analysis has not reached the end of the cry signal, the 32 msec sliding window will then shift to the next sequence. Chapter 5: DEVELOPMENT of the SYSTEM 78 Slide Window (1) Cry Signal f HPF _TL 256 pt. (32 msec) sliding window Energy Calculat ion (2) SILENCE Spectrogram (3) (5) HYPER-PHONATION (6) H or E Type Rising Test H-VALUE End of signal (7) P„ Test UNVOICED (DYSPHONATION or MHALATION) j . VOICED (4) Calculat ion N / 3 8 K F N N 740 Hz > — — K E-TYPE E-Type DHB DHB Flag DHB (D_fl= 1) Unflag (D_fl - 0) Figure 5.8 Expansion of the overview flowchart (Fig. 5.5) showing more detail describing the new recognition system. Chapter 5: DEVELOPMENT of the SYSTEM 79 (1) Slide Window (2) (3) Cry Signal f HPF _TL Energy Calculation Spectrogram 256 pt. (32 msec) sliding window N Test for Consecut ive Silence Frames SILENCE P„ Test • » ( D Y S P H O N A T I O N UNVOICED o r INHALATION) VOICED Calculation (4) Figure 5.9 Detailed flowchart (part 1). This flowchart provides more detail than Fig. 5.5 and Fig. 5.8. A, B and C connect with corresponding sections in Fig. 5.10. Chapter 5: DEVELOPMENT of the SYSTEM 80 Figure 5.10 Detailed flowchart (part 2). A, B and C connect with corresponding sections in Fig. 5.9. Chapter 5: DEVELOPMENT of the SYSTEM 81_ (7) H-Value: Upon reaching the end of the cry signal, the final segment of this routine calculates an estimate of the H-value. This estimation is based on Xie's equation [Xie, 1993] H-Value - Number of H-type sequences ^ ^) Total number of non-silence sequences where the H-type cry-words are trailing, DHB, dysphonation, hyperphonation and inhalation. More detailed flowcharts for this system are given in Figures 5.8 to 5.10. Testing details will be discussed in the following chapter. Chapter 6 TESTING and RESULTS 6.1 TESTING OVERVIEW The new system outlined in the previous chapter represents a complete research recognition tool that can be used by cry researchers based on Xie's proven concept that cries are composed of recognizable words. It has been proposed that our new system is an improvement of Xie's HMM system developed in 1993 [Xie, 1993; Xie et al. 1995]. To provide a basis for comparison, the same test data used by Xie for his system was again used to test our tool. Unfortunately one of the 58 cry samples in the testing set was corrupted. Rather than create an additional sample from the training data for which results based on Xie's methods were unavailable, we simply used the remaining 57 non-corrupted samples. Before testing the tool each of the 57 samples was first decomposed into its respective cry-words using a computer spectrogram program By manually analyzing small segments at a time, durations of each cry-word were visually detected and recorded. These decompositions were subsequently used to calculate an H-value based on equation 3.1. For the remainder of the discussion, we will call the 57 H-values obtained off-line by this method the Manual Visual Estimates, or MVE results. While many features may be useful for comparing our system with Xie's, our efforts have been focussed on system accuracy, time and complexity. The following sections will detail these 82 Chapter 6: TESTING and RESULTS 83 three parameters and demonstrate the improvements of O U T new system Though Xie developed two methods for recognition, namely a VQ-statistical classifier and an HMM based system, comparisons will only be provided for the latter system which was considered the better of the two. 6.2 SYSTEM ACCURACY 6.2.1 Comparisons with our M Y E Results The purpose of implementing this automatic recognition system was to provide an automatic computer estimation of the H-value of a cry. Initial tests were therefore aimed specifically at the accuracy of the system Using the 57 cries from the test set, the new system estimated H-values for each of the samples using a 32 msec, non-overlapping, sliding analysis window. When we compared the results of our automatic estimation method to our MVE results, we find our new system had a mean absolute error of 5.9%. Figure 6.1 illustrates the differences between our off-line MVE H-values and the automatically estimated H-values. The 5.9% error was calculated as the vertical distance between each data point and the straight line with slope equal to one. From this figure we see that the automatic H-value estimations tend to be slightly higher than the MVE values, especially for H-values below about 40. With the exception of three or four points, however, the estimations are generally quite accurate verifying the use of this new system for automatic H-value estimations. The few cases where automatic estimations were less accurate can be accounted for as having lower than normal average fundamental frequencies, excessively high noise causing many samples to be estimated as unvoiced, or low signal energies. On an individual basis, thresholds could be modified to account for these errors. Chapter 6: TESTING and RESULTS 84 90 80 70 o E X 0 "S E 1 40 30 20 10 1 1 1 V i — 1 1 1 o 1 o / --o o / o / -0 /o o y -o o y° ° -O / o / o / o °7 o o -o o/ ° CP/ ° / _ 0 / o / 0 ° / 0 C P / o / o / o o / o/ / \ 1 1 1 1 1 1 1 0 10 20 30 40 50 60 70 80 90 100 MVE H-Value Estimation Figure 6.1 The subjective MVE H-value estimations versus the new system automatic H-value estimations. As most studies reported in the literature use overlapping analysis windows, additional tests examined the accuracy of using a 32 msec sliding window with either a 50% overlap or a 10 msec sliding step size as used by Xie. When compared again to our MVE H-values for the 57 sample test set, the 50% overlapping window H-values demonstrated a slight improvement in mean Chapter 6: TESTING and RESULTS 85 absolute error at 5.8% while the 10 msec step size was a little less accurate with a mean absolute error of 6.1%. Though a very marginal accuracy improvement using the 50% overlapping window was noticed, its benefit was heavily outweighed by the doubling of processing time. For this reason, the non-overlapping analysis window was selected for our final system Early in the development of the system it was noted that if a small enough analysis window was used, we could assume the data would be short-time stationary. Ideally then, the use of these non-overlapping windows was justifiable. Later results will further support this choice. As a final segment in this phase of testing, we compared our MVE results to Xie's HMM automatic estimation results and found that his system had an absolute error of 13.3%. While it would seem that our system was over twice as accurate, it will be shown that this is largely due to the subjective assessment of obtaining the MVE results. This will be further discussed in the following sub-section. 6.2.2 Comparison with Xie's M V E Results Reviewing Xie's work, we see that an initial step in his testing was to obtain his own MVE results. When compared with his automatic HMM results, it was determined that his system was accurate to roughly 13%. Again, from this value, it would appear that our new system was over twice as accurate. However, when our new system H-value results were compared with Xie's MVE results, errors were 16-17%. Based on Xie's MVE H-values, our new system was actually slightly less accurate. The main reason for the difference in errors was the individual interpretation of the spectrogram for the off-line estimates of the H-values. In fact, when our MVE results were compared with Xie's, it was determined that even these values differed by a mean error of 13.7%. Chapter 6: TESTING and RESULTS 86 This clearly shows one of the difficulties in off-line assessment, that being the inherent errors from the subjective assessment of data. It also strengthens the argument that a standard means for automatic recognition is required. 6.2.3 Comparison with Parental Assessments At this point it is impossible to judge whose results are more accurate in an absolute sense. A more useful comparison was needed which was independent of any single person. When Xie set out to develop his initial system based on the recognition of the different cry-words, one of his objectives was to model the system based On parental input so that the system would provide a human-like assessment of cries. A comparison based on the parental assessments of the same cries therefore provided a more meaningful measure of the system accuracy. Figure 6.2 illustrates the differences when the H-value estimations of our new system and Xie's HMM system were plotted against the parental LOD assessments. For the most part, both plots clearly show a positive correlation between estimated H-values and the parents' LOD ratings although a few errant points for both systems are noticeable. When the mean absolute errors were measured, calculated as the vertical distance between the mean parental value and a straight line with slope equal to 4, it was detennined that both systems had a mean absolute error of 17.5%. Further analysis measuring the cross-correlation coefficients between the parental assessments and automatically estimated H-values shows that the HMM system had a slightly higher correlation factor of 0.731 compared with 0.681 for our system Chapter 6: TESTING and RESULTS 87 5 • 4.5 • 4 • co 3.5 r EC a fl • : Q Q : • a': i 6 g : w 2 2.5K cd Q. 2 -1.5 -1 10 i 6-*l Hi 6 °-: : 6: o I • '. - 96! 20 30 40 Computer H_Value Estimation (New System) 90 100 5 • 4.5 -4 -CD cz « 3.5 • cc Q 3 3-2 2.5 • a a. 2 • 1.5 • - '• 'co •• -.0: O: : 9 M?0 A l i i O: ;? o :: O $ 9 9 ; :? 9 9W = o • 10 20 30 40 50 60 70 80 90 100 Computer H_Value Estimation (HMM System) Figure 6.2 Automatic computer estimated H-values versus parents' LOD ratings. The top plot shows results for our new system using a non-overlapping analysis window. The bottom plot shows the results for Xie's HMM based system. One of our initial objectives was to develop a complete system tool which would have comparable or better accuracy than Xie's system. Despite the slightly lower cross-correlation coefficient, our first objective was still met as evidenced by the 17.5% mean error for both systems. Chapter 6: TESTING and RESULTS 88_ Tables 6.1 and 6.2 summarize the results of this section. Also note that when compared to the parental assessments, the mean absolute error and cross-correlation coefficient for the non-overlapping analysis window case were a little better than the for the non-overlapping cases thereby further defending its selection. Subjective Classifications New System (No overlap) New System (50% overlap) New System (10ms slide step) Xie's HMM System MVE (ours) 5.9% 5.8 % 6.1 % 13.3 % MVE (Xie) 16.4 % 16.7 % 16.9% 13.1 % Parental Assessment 17.5 % 17.8% 18.0 % 17.5 % Table 6.1 Mean absolute differences in H-values when subjective classifications are compared with estimated values from the automatic recognition systems. Subjective Classifications New System (No overlap) New System (50% overlap) New System (10ms slide step) Xie's HMM System MVE (ours) 0.934 0.930 0.923 0.853 MVE (Xie) 0.824 0.818 0.813 0.880 Parental Assessment 0.681 0.678 0.671 0.738 Table 6.2 Cross-correlation coefficients between subjective classifications and estimated H-values from the automatic recognition systems. 6.3 SYSTEM TIME In developing the new automatic H-value estimation system, the main effort was placed on implementing a complete estimation tool which was easier to use and simpler conceptually. It was Chapter 6: TESTING and RESULTS 89_ felt that the time issue, in particular having an analysis time approaching real time, could be resolved once the tool had been developed. While it was not a major consideration when creating the tool, the time issue was not completely ignored either. It was believed from the beginning that developing a less complicated system might decrease time accordingly. When testing the accuracy of the new system the computation time was recorded for the 57 test samples. It was determined that the average signal length of the test cries was 66654 samples or roughly 8.3 seconds. Using a SUN SparcStation 10, the system computed the H-value in an average of 129.5 seconds. In more useful terms, this new tool analyzed and computed an H-value for one second of cry data in just over 15.5 seconds. Unfortunately, the computation times for Xie's work were not recorded. In a recent personal correspondence he does however state that, "For a few seconds of data,... the response time [of the HMM system] should be no longer than 1 or 2 min." [Xie, 1997]. From this statement it would seem that our new system was at least as fast and probably faster already, even without modifications aimed at speeding up the processing. These results appear to verify our assumption that a less complicated system might also be faster. In addition to the computation time, a significant time issue concerns the required time for training the systems. In general the training of an HMM system is quite time consuming depending on the complexity of the model and amount of braining data. In Xie's case, framing of the HMMs took in the range of tens of hours on a SUN SparcStation 2 with some training requiring upwards of 40 hours. An obvious time saving improvement with our new tool is the absence of this required training process. Future use of this tool could benefit from this difference as will be explained in the next section. Chapter 6: TESTING and RESULTS 90 Finally, time was also considered when selecting the appropriate sliding window for use in the new tool As mentioned in 6.2.1, the overlapping windows were not selected because they offered no real accuracy improvement and required double or triple the computation time. While the non-overlapping analysis window tests required about 16 seconds of analysis time for one second of cry data, the 50% overlapping window required 30 seconds and the 10 msec sliding step size window 48 seconds to estimate the H-value for the same one second of cry data. 6.4 SYSTEM COMPLEXITY Testing and comparing the complexity of these two systems was not an entirely obvious process. Conceptually, our new system was definitely simpler. With even a rudimentary understanding of a few simple concepts such as fundamental frequency, a user could follow the system flowchart (Figure 5.7) and understand the basis for recognition. Without question, the HMM would be a much more difficult concept to convey to a non-expert. While we get some indication of the decreased complexity based on the computation time, quantifying the difference can be difficult. One method for demonstrating the lower complexity of the new system was to implement it on a different platform Use of the HMM system required significant computing power and as a result had to be operated on a SUN. The ability of our new system to be operated on the SUN has also already been demonstrated. The same system was then tested on a 486 DX, 66 MHZ PC. As expected, identical analysis results were obtained and the analysis time on the less powerful system was longer. The first ten samples of the test set were tested with the mean time of analyzing approximately 1 second of cry data increasing from 16 seconds on the SUN to 28 seconds on the Chapter 6: TESTING and RESULTS 91_ PC. Though analysis time was increased 1.75 times, these results show the feasibility of operating the new system using a PC. With much faster Pentium PCs now available which have speeds up to 300 MHZ, these differences in analysis time would be reduced. These results also demonstrate that our new system is easily portable between different machines. This is a definite advantage and could ultimately contribute to an overall reduced recognition system cost. A second means for comparing the system complexities was to examine the ease of use or user friendliness. Had Xie continued in his work he would most likely have created a tool based on the HMM. But as mentioned before, this tool was not created. To compute the H-value first required training the HMMs. Once trained they could be used repeatedly, but a considerable initial time investment was nonetheless required before recognition could be performed. After training the HMMs, the first step in his system was to obtain feature vectors for the input cry. Since the system was not connected as a tool, the feature vectors would then have to be applied separately to the HMMs. The final output after these steps was the H-value. Not only was longer time required, but more user steps were necessary. For our new system, the cry was again used as the input to the system and the output this time was simply the H-value with no mtermediate steps. If more information was desired, the system also kept track of each frame's F„, st-energy and cry-word determination for future reference. Using this information and the flowchart in Figure 5.7 the user would be able to gain a better understanding of why a particular cry sample was given a particular H-value. In most cases however, this information would remain transparent from the user unless it was requested. Another factor in comparing the ease of operation for the two system was retraining. Returning to Figures 6.1 and 6.2, we see the systems' estimated H-values were well correlated with Chapter 6: TESTING and RESULTS 92 MVE values. However, in both figures there were a few stray samples which were not recognized as accurately. A closer look at these samples reveals their mis-recognition was often caused by a lower than average F 0 or low signal energy. For an individual basis, Xie's system would take many hours on a powerful computer to retrain and as previously mentioned, a non-expert could not do it. With this new system, modifying the system to individually account for these abnormalities would simply require thresholds to be adjusted. Instead of taking tens of hours, likely tens of minutes would be needed to determine a suitable lower (or higher) threshold and persons with little computing expertise could make the adjustments. It has been shown that it is feasible to operate this system on a PC so this could even be done at home. While it is difficult to quantify the decrease in system complexity, the demonstrated ability to operate our new tool on a PC, decreased analysis time and much easier retraining all point to an improved system which satisfies our final objective that the new system be less computationally intensive and easier for the user. Chapter 7 CONCLUSIONS and RECOMMENDATIONS 7.1 CONCLUSIONS The infant cry has both intrigued and puzzled scientists, medical doctors and engineers for some time. Many of these professionals believe that the infant cry contains important inforrnation about an infant's physical and emotional state. While there is some debate as to the amount of inforrnation conveyed in the cry, few will argue that an infant's cry is a very good indicator of the infant's level of distress (LOD). In 1993, Xie introduced a novel approach for measuring the LOD. Modelling his methods on proven speech recognition principles, Xie showed that the infant's cry is composed of connected segments of ten different recognizable words each with distinct time-frequency characteristics. Xie's work also showed that by recognizing these cry-words and recording their sum durations, a single value indicator of the infant's distress level could be measured called the H-value. Tests verified that this distress measure was well correlated with parental LOD ratings. However, as mentioned before, while a platform was established a practical tool for automatic H-value estimation was not. Based on Xie's proven concepts, our work has gone one step further and developed such a tool. A key assumption in the development of this tool was that since each cry-word could be distinguished by a human observer with the eye from the spectrogram, they should also be 93 Chapter 7: CONCLUSIONS and RECOMMENDATIONS 94 distinguishable using a computer. One of our accomplishments was finding suitable features which would permit this identification accurately and efficiently. It was determined that F„ was the most useful feature for recognizing each cry-word. In particular, cry samples with relatively high or low F„ values were generally H-type cry-words. A second useful observation was that only two of the ten cry-words exhibited unvoiced characteristics, and both of these were H-type cry-words. Information from the power spectral density was useful for recognizing which cry-words exhibited these unstructured energy patterns. The definition of the H-value also provided some flexibility in detenriining its estimation. In some cases, our new tool would incorrectly identify particular cry-words. However, based on the definition of the H-value as the sum of H-type analysis frames divided by the total number of non-silence analysis frames, this allowed some room for error. For example, the H-type cry-words trailing or DHB were on occasion misrecognized as unvoiced because they had extremely low fundamentals and thus their harmonics were not clearly separated. Because the unvoiced cry-words were H-type as well it would not effect the H-value calculation. Since the goal of the system was to accurately estimate the H-value, it was acceptable to have a small number of H-type cry-words misrecognized as another, but also H-type, cry-word. In comparison with the earlier systems developed by Xie, our new system provided many improvements. In particular, benefits from our work include: (1) Complete tool: First and foremost, our work satisfied our primary objective and provided a complete H-value estimation tool with similar accuracy and comparable or better time to Xie's systems. Our new Chapter 7: CONCLUSIONS and RECOMMENDATIONS 95 system also correlated as well with parental assessments as Xie's. In testing our tool, ambiguities experienced when comparing the systems' accuracy also strengthened the argument that a standardized means of cry analysis is necessary. The development of this tool could therefore be very useful for future cry research. (2) Decreased complexity: Without question, the HMM and VQ-statistical classifier are powerful mathematical methods and have many uses. One of their drawbacks, though, is that they are very computationally intensive. A second advantage of our new system is that it is based on simple principles and as a result is less complex and does not require as great computing power. This is a significant development as it leads to a practical tool. The remaining system improvements also stem from this improvement. (3) Machine portability: As a result of the decrease in complexity, we were able to demonstrate the feasibility of operating our tool on both a SUN and a PC. This ability was not examined in Xie's work partly because he was trying to prove a concept but also because of the required computing power, particularly for training. While it should be possible to operate Xie's system on a PC given the necessary software modifications, it would not be practical as in our case. The ability to operate our tool on a PC should ultimately lead to a total overall reduced system cost, as well. (4) Easy to use: If a system is to be employed by a broad spectrum of users, it is important that the system be easy Chapter 7: CONCLUSIONS and RECOMMENDATIONS 96_ to use. By this we do not simply mean that the system should only have a user friendly interface. We felt that having a system which was simpler conceptually and for which the basis for recognition was known would contribute to a better overall system which the user could understand. To re-iterate an example provided in section 6.4, to re-train Xie's system for an individual purpose would require an expert many hours with a high powered computer. With our system a general understanding of a few basic concepts would allow non-experts to adjust thresholds in minutes. As expected, a detailed study of common cry features has led to this improved research tooL While Xie's earlier systems used complex mathematical models, our new tool is based on simple decisions involving predetennined thresholds for such cry features as the fundamental frequency, st-energy, a small amount of sequencing information and the power spectral density. The result: a cry research tool which is simple enough to use on a PC and is much easier to use in general. 7.2 RECOMMENDATIONS The use of an established level of distress estimation tool could be very beneficial to a wide variety of users. An obvious group of users would include hospitals and homes where the ability to have an expedient measure of an infant's distress level could greatly influence the monitoring of infants. Psychologists, medical doctors, engineers and others would also benefit from such a tool for cry research. Having a tool which offered a standardized measure of a cry could, as an example, assist them in detecting abnormalities for infants with certain diseases. The tool we developed is definitely a step towards the universal tool described above. At this point, however, our tool is more suited for cry research than for monitoring purposes. Further Chapter 7: CONCLUSIONS and RECOMMENDATIONS 97_ work will be required before this general tool is to become a reality. Specifically, future work should include: (1) Larger data sets: With such a small data set, making any generalizations about our system is both difficult and uncertain. Though Xie has proven that cries are composed of recognizable words and we have developed a tool for their recognition, at this point testing has been limited to a small group of test samples. The next step with this work should include larger data sets encompassing infants of a variety of ages, both healthy and unhealthy. (2) Additional features: While we have demonstrated that out new system determines the H-value with acceptable accuracy and compilation time, we cannot be certain that other features will not improve our system Formant frequencies, additional sequencing information and frequency of occurrence may also prove useful. Before these can be accurately examined, however, the larger data sets mentioned in (1) should be obtained for best results. (3) Optimize: Our system was capable of analyzing one second of cry data and estimating its H-value in just under 16 seconds. For the tool to receive considerable attention this processing time will need to be optimized. An obvious improvement would be to replace the MATLAB platform on which our Chapter 7: CONCLUSIONS and RECOMMENDATIONS 98_ system operated with only the necessary routines. Much of the information stored by the system such as the F„ and st-energy may also not be needed for all analysis frames. Eliminating unused data would reduce both computation time and memory requirements. Finally, the use of the latest machines with 300 MFfZ clock speeds and parallel processors would undoubtedly bring analysis closer to real-time on PCs. (4) Further develop/package tool: It was mentioned at the beginning of the section that the tool is currently more suited for cry research than for infant monitoring. We say this because (a) the system does not yet have real-time capabilities and (b) the tool still needs to be "packaged". Presently it is used off-line for recorded cry-signals. For it to be useful for more universal purposes, the tool will need to be further developed to reduce the analysis time and then packaged with the appropriate hardware for its use as a monitoring system BIBLIOGRAPHY Brennan, M. and J. Kirkland. "Discrimination of infants' cry-signals," Perceptual and Motor Skills, vol. 48, pp. 683-6, 1979. de Bruin, J. C , and J. A. du Preez. "Automatic language recognition based on discriminating features in pitch contour," IEEE South African Symp. on Comm. and Sig. Proc, Jan Smuts Airport, South Africa, pp. 133-8, 1993. Deller, J. R., J. G. Proakis, and J. H. Hansen. Discrete-Time Processing of Speech Signals. New York: Macmillan Publishing Company, 1993. Favero, R. F. "Compound wavelets: wavelets for speech recognition," IEEE-SP Int'l Symposium on Time-Frequency and Time-Scale Analysis, Philadelphia, Pa., pp. 600-3, 1994. Flanagan, J. L. Speech Analysis Synthesis and Perception. New York: Springer-Verlag, 1965. Fuller, B. F. "Acoustic discrimination of three types of infant cries," Nursing Research, vol. 40, no. 3, pp. 156-60, May/June 1991. Gardosik, T. A., P. J. Ross, and S. Singh. "Acoustic characteristics of the first cries of infants," in Infant Communication: Cry and Early Speech (T. Murry and J. Murry, eds.), ch. 5. Houston, Texas: College-Hill Press, 1980. Golub, H. L. and M. J. Corwin. "A physioacoustic model of the infant cry," in Infant Crying: Theoretical and Research Perspectives (B. M. Lester and C. F. Z. Boukydis, eds.), ch. 3. New York: Plenum Press, 1985. . "Infant cry: a clue to diagnosis," Pediatrics, vol 69, no. 2, pp. 197-201, Feb. 1982. Graps, A. "An introduction to wavelets," IEEE Computational Science and Engineering, vol. 2, no. 2, pp. 50-61, Summer 1995. Grunau, R. V., C. C. Johnston and K D. Craig. "Neonatal facial and cry responses to invasive and non-invasive procedures," Pain, vol. 42, pp. 295-305, 1990. Gustafson, G. E. and J. A. Green. "On the importance of fundamental frequency and other acoustic features in cry perception and infant development," Child Development, vol. 60, pp. 772-80,1989. 99 BIBLIOGRAPHY 100 Hollien, H. ''Developmental aspects of neonatal vocalizations," in Infant Communication: Cry and Early Speech (T. Murry and J. Murry, eds.), ch. 2. Houston, Texas: College-hill Press, 1980. niingworth, R. S. "The development of communication in the first year and the factors which affect it," in Infant Communication: Cry and Early Speech (T. Murry and J. Murry, eds.), ch. 1. Houston, Texas: College-Hill Press, 1980. Ismaelli, A., G. Rapisardi, G. P. Donzelli, M. Moroni, and P. Bruscaglioni. "A new device for computerized infant cry analysis in the NICU," 16th Annual IEEE Engineering in Medicine and Biology Conf, Baltimore, Md., vol. 2, pp. 854-5, 1994. Kadambe, S., and G. F. Boudreaux-Bartels. "Application of the WT for pitch detection of speech signals," IEEE Trans, on Info. Theory, vol 38, no. 2, pp 917-24, Mar. 1992. Karelitz, S., and V. R. Fisichelli. "The cry thresholds of normal infants and those with brain damage," / . Pediatrics, vol. 61, pg. 679, 1962. Langlois, A., R. J. Baken, and C. N. Wilder. "Pre-speech respiratory behaviour during the first year of life," m Infant Communication: Cry and Early Speech (T. Murry and J. Murry, eds.), ch. 3. Houston, Texas: College-Hill Press, 1980. Lester, B. M. "Introduction: There's more to crying than meets the ear," in Infant Crying: Theoretical and Research Perspectives (B. M. Lester and C. F. Z. Boukydis, eds.), ch. 1. New York: Plenum Press, 1985. . "Developmental outcome prediction from acoustic cry analysis in term and preterm infants," Pediatrics, vol. 80, no. 4, pp. 529-34, Oct. 1987. Lieberman, P. Speech Physiology and Acoustic Phonetics. New York: Macmillan Publishing Co., 1977. Lundh, P. "A new baby-alarm based on tenseness of the cry signal," Scand. Audiol., voL 15, pp. 191-6, 1986. Lynip, A. W. "The use of magnetic devices in the collection and analysis of the preverbal utterances of an infant," Genetic Psychology Monographs, vol. 44, pp. 221-62, 1951. MakhouL J., S. Roucos, and H. Gish. "Vector quantization in speech coding," Proc. of the IEEE, vol 73, no. 11, pp. 1551-88, Nov. 1985. Mallat, S. G. "A theory for multiresolution signal decomposition: the wavelet representation," IEEE Trans, on Pattern Analysis and Machine Intelligence (PAMT), vol. 11, no. 7, pp. 674-93, July 1989. BIBLIOGRAPHY 101 Michelsson, K. "Cry characteristics in sound spectrographic cry analysis," in Infant Communication: Cry and Early Speech (T. Murry and J. Murry, eds.), ch. 4. Houston, Texas: College-Hill Press, 1980. Michelsson, K., and O. Wasz-Hockert. "The value of cry analysis in neonatology and early infancy," in Infant Communication: Cry and Early Speech (T. Murry and J. Murry, eds.), ch. 7. Houston, Texas: College-Hill Press, 1980. Murray, A. D. "Infant crying as an elicitor of parental behaviour: An examination of two models," Psychol. Bull, voL 86, pp. 191-215, 1979. Murry, T. "Introduction," in Infant Communication: Cry and Early Speech (T. Murry and J. Murry, eds.). Houston, Texas: College-Hill Press, 1980. Noll, A. M. "Cepstrum pitch determination," Journal of the Acoustical Society of America, vol 41, pp. 293-309, Feb. 1967. Petroni, M., A. S. Malowany, C. C. Johnson, and B. J. Stevens. "A new, robust vocal fundamental frequency (FJ determination method for the analysis of infant cries," 7th Annual IEEE Symposium on Computer Based Medical Systems, Winston-Salem, N. C , pp. 223-8, 1994. . "Classification of infant cry vocalizations using artificial neural networks (ANNs)," ICASSP '95: Acoustics, Speech andSig. Proc. Conf., Detroit, Mi., vol 5, pp. 3475-8, 1995. Prescott, R. "Infant cry sound: Developmental features," Journal of the Acoustical Society of America, vol. 57, pp. 1186-91,1975. Rabiner, L. R. "A tutorial on hidden Markov models and selected applications in speech recognition," Proc. of the IEEE, vol. 77, no. 2, pp. 257-85, Feb. 1989. Tenold, J. L., D. H. Crowell R. H. Jones, T. H. Daniel, D. F. McPherson, and A. N. Popper. "Cepstral and stationarity analyses of full-term an premature infants' cries," JASA, vol 56, pp. 975-80,1974. Thodeh, C. J., A. L. Jarvenpaa, and K Michelsson. "Sound spectrographic cry analysis of pain cry in prematures," in Infant Crying: Theoretical and Research Perspectives (B. M. Lester and C. F. Z. Boukydis, eds.), ch.5. New York: Plenum Press, 1985. Vetterli, M., and C. Herley. "Wavelets and filter banks: theory and design," IEEE Trans, on Sig. Proc, vol. 40, no. 9, pp. 2207-32, Sept. 1992. Vuorenkoski, V., M. Kaunisto, P. Tjernlund, and L. Vesa. "Cry detector: a clinical apparatus for surveillance of pitch and activity in the crying of a newborn infant," Child Psychiatry, Parallel session C III a: Neurology, 1970. BIBLIOGRAPHY 102 Wasz-Hockert, O., J. Lind, V. Vuorenkoski, T. Partanen, and E. Valanne. The Infant Cry: A Spectrographic and Auditory Analysis, Spastics International Medical Publications in Association with William Heinemann Medical Books Ltd., 1968. Wasz-Hockert, O., K. Michelsson, and J. Lind. 'Twenty-five years of Scandinavian cry research," in Infant Crying: Theoretical and Research Perspectives (B. M. Lester and C. F. Z. Boukydis, eds.), ch. 4. New York: Plenum Press, 1985. Wasz-Hockert, O., T. J. Partanen, V. Vuorenkoski, K. Michelsson, and E. Valanne. "The identification of some specific meanings in infant vocalizations," Experientia, vol 20, pg. 154, 1964a. Wasz-Hockert, O., T. J. Partanen, V. Vuorenkoski, E. Valanne, and K. Michelsson. "Effect of training on ability to identify preverbal vocalisations," Develop. Med. Child. Neurol., vol 6, pp. 393-6,1964b. Xie, Q. Personal correspondence, May 23,1997. . Automatic Infant Cry Analysis and Recognition. Ph. D. thesis, University of British Columbia, Vancouver, 1993. Xie, Q., R. K Ward, and C. A. Laszlo. "Automatic assessment of infants' levels-of-distress from the cry signals," submitted for publication to IEEE Trans, on Speech and Audio Processing, July 1995. . "Detenraning normal infants' level-of-distress from cry sounds," Proc. of the 1993 Canadian Conf. on Elec. and Comp. Eng., Vancouver, B. C , pp. 1094-96,1993a. Xie, Q., C. A. Laszlo, and R. K. Ward. "Vector quantization technique for nonparametric classifier design," IEEE Trans, on Pattern Analysis and Machine Intelligence, vol. 15, no. 12, pp. 1326-30, December 1993b. 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items