UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

In search of isochrony : compensating for durational warping in speech production Caulfield, Anne Jeanette 1985

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-UBC_1985_A6_7 C39.pdf [ 4.18MB ]
JSON: 831-1.0096031.json
JSON-LD: 831-1.0096031-ld.json
RDF/XML (Pretty): 831-1.0096031-rdf.xml
RDF/JSON: 831-1.0096031-rdf.json
Turtle: 831-1.0096031-turtle.txt
N-Triples: 831-1.0096031-rdf-ntriples.txt
Original Record: 831-1.0096031-source.json
Full Text

Full Text

IN SEARCH OF I50CHR0NY: COMPENSATING FOR DURATIONAL WARPING IN SPEECH PRODUCTION: by Anne Jeanette Caulf Ield B.A, University of British Columbia, 1981 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES Faculty of Medicine School of Audiology and Speech Sciences We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA SEPTEMBER 1985 © Anne Jeanette Caulf ield, 1985 In presenting t h i s thesis i n p a r t i a l f u l f i l m e n t of the requirements for an advanced degree at the University of B r i t i s h Columbia, I agree that the Library s h a l l make i t f r e e l y available for reference and study. I further agree that permission for extensive copying of t h i s thesis for scholarly purposes may be granted by the head of my department or by h i s or her representatives. I t i s understood that copying or publication of t h i s thesis for f i n a n c i a l gain s h a l l not be allowed without my written permission. Department of Audiology and Speech Science; The University of B r i t i s h Columbia 1956 Main Mall Vancouver, Canada V6T 1Y3 Date DE -6 (3/81) ii ABSTRACT The rhythmic organization of speech into regular intervals (i.e. isochrony), is a strong perceptual phenomenon. However, Investigators have been unable to demonstrate the existence of isochrony in production data. It is hypothesized in this study that the intended rhythm of a speaker is in fact isochronous, but that this is obscured by several distorting influences which introduce durational irregularity at the syllable level, e.g. intrinsic duration, stress, position of the syllable in a phrase and number of syllables in a phrase. It is proposed that removing the predictable durational irregularities will yield a more regular signal, reflecting the (hypothesized) Intended Isochronous rhythm of the speaker. The latter two sources of distortion introduce progressive durational irregularity or "warping" which can be readily incorporated into an automated "dewarplng" procedure. A computer program was devised to compensate, at the syllable level for these two sources of distortion. The former two sources are not amenable to such an automated procedure, and were therefore not included. The "dewarping" program was run on the speech amplitude envelopes of two speakers, one French and one English. The results Indicate that, for the French speaker, dewarping does remove some of the durational Irregularity, yielding a more regular amplitude envelope. For the English speaker, no such Improvement in regularity is obtained. This indicates that the dewarping used, which presumes the syllable as "unit" of dewarping, is appropriate for syllable-timed languages such as French, but inappropriate for stress-timed languages such as English. It therefore provides some support for isochrony in French at the syllable level. Finally, the results also give support to the hypothesis that the degree of warping perceived as regular in speech perception studies corresponds to the degree of dewarping which, iii conversely, yields the most regular speech amplitude envelope; however, further experimentation is necessary to determine the optimum values of the parameters of the dewarping function. iv TABLE OF CONTENTS Page ABSTRACT 11 TABLE OF CONTENTS iv LIST OF TABLES V LIST OF FIGURES vi ACKNOWLEDGEMENT vii Chapter 1 INTRODUCTION 1 Chapter 2 LITERATURE REVIEW 2.1 Rhythm 5 2.2 Duration of syllables 23 Chapter 3 OUTLINE OF EXPERIMENTAL AIM5 35 Chapter 4 EXPERIMENTAL TECHNIQUE 4.1 Evaluation of trends in existing data 39 4.2 Preliminary investigation I 42 4.3 Preliminary investigation II 45 4.4 Dewarping algorithm 45 4.5 Experimental study 52 CHAPTER 5 RESULTS 5.1 Histograms 54 5.2 ANOVA results 61 5.3 Probability graphs 67 CHAPTER 6 DISCUSSION 69 BIBLIOGRAPHY 74 V LIST OF TABLES Table Page 1 Parameter values for DW function 46 2 Mean and s.d. - de Gaulle 60 3 Mean and s.d. - Nixon 60 4 Analysis of variance - de Gaulle 62 5 Analysis of variance - Nixon 62 6 Newman-Keuls Test - de Gaulle 64 7 Newman-Keuls Test - Nixon 64 8 Analysis of variance - de Gaulle, sections 3-7 66 9 Newman-Keuls Test - de Gaulle, sections 3-7 66 vi LIST OF FIGURES Figure Page 1 D-function for 6-syllables 41 2 Duration of stressed syllables as a function of position in 41 phrase and number of syllables per word. From Lindblom and Rapp (1973), Figure 7. 3 . Syllable durations found in preliminary investigation 1 44 4 Computed values of syllable durations for preliminary 44 investigation 1 5 Relationship between D-function and DW-function 48 6 Flow chart of dewarping computer programs 50 7 Histogram of de Gaulle - Pauseout 55 8 Histogram of de Gaulle - Warpout 1 55 9 Histogram of de Gaulle - Warpout 2 56 10 Histogram of de Gaulle - Warpout 3 56 11 Histogram of de Gaulle - Warpout 4 57 12 Histogram of Nixon - Pauseout 57 13 Histogram of Nixon - Warpout 1 58 14 Histogram of Nixon - Warpout 2 58 15 Histogram of Nixon - Warpout 3 59 16 Histogram of Nixon - Warpout 4 59 v i i ACKNOWLEDGEMENT As my graduate work draws to a close I would like to express my appreciation to all those who have helped me in my course and clinical work, and, in particular, in the preparation of this thesis. Special thanks to Andre-Pierre Benguerel for his patience, advice, assistance, and continued encouragement. Thanks also to John Gilbert, for serving on my committee, and John Nicholl for his assistance with the PDP-12. I would also like to thank each of my professors, instructors and clinical supervisors for providing me with the educational base to begin a rewarding career. I would especially like to thank Noelle Lamb for all the time and support she has given me. Finally, I would like to thank all my family and friends who have put up with me during the last few months and still managed to encourage me. Special thanks to my father, for sacrificing his Macintosh for this thesis, and especially to my husband Steve, for all his love and support. This research was supported by a Medical Research Council of Canada Studentship. 1 CHAPTER 1: INTRODUCTION One of the behaviours which separate humans from animals is the ability to communicate through speech. This unique ability has long fascinated and puzzled humans, and has provided a varied and interesting field of study. The applications for information about speech are many; in the clinical area, they include the enhancement of production, reception and understanding of speech under strained conditions or by disordered individuals; in the technical and industrial area, they include the synthesis and the recognition of speech by computers. The extreme complexity of the many factors involved in a particular speech event has necessitated the division of this field into several major areas for the purposes of study: Phonetics, Phonology, Morphology, Semantics, Syntax and Pragmatics. Phonetics, which is the area of interest to this thesis, is the study of the sounds constituting the speech event. It is concerned with segmentals, which are the individual sounds of speech, and with suprasegmentals, which are the patterns of stress, intonation or timing superimposed on the sequences of individual sounds constituting a speech utterance. While a particular researcher may concentrate his studies on one of the major areas, it is important to remember that these areas overlap and are inter-related; together they deal with the entire speech event. No single area can be considered in total isolation from the others. The area of phonology influences, and is influenced by the other areas of study. For example, the particular sounds selected, their order in a sequence and the patterns of stress, intonation and timing are determined not only by phonology but also by the semantics, morphology, syntax and pragmatics of the message to be conveyed. As an example of the complex interrelationships between the major areas of study, let us consider the simple act of requesting an action, such as closing the window. The 2 familiarity of the addressee, the formality of the situation, and other situational factors will determine if the request takes the form of a command, a question, and indirect statement, or a polite request. The form chosen may then be expressed by one of several different syntactic structures, upon which intonation and stress patterns will be superimposed. The pragmatics of the situation will also determine the specific vocabulary used. Several possible utterances are: "Shut the window!", "Don't you think it's cold in here?", "I'm really cold", "Could I trouble you to close the window please?". Articulatory constraints may then modify the actual realization of the intended sequences. For example, rapid articulation of the adjacent segments Itl and /ju/ in "Don't you..." may alter the pronunciation of these segments to /tju/. Similarly, coarticulation effects will lead to different pronunciation of the segment /n/ preceeding a 161 versus preceeding a Iql. Conversely, alterations of sounds or patterns at the production level may modify the semantics or pragmatics of the message produced. For example, stress on "window" can modify the meaning to "close the window, not the door", while stress on "close" can modify the meaning to "close the window, don't open it". Similarly the intonation contour and rhythmic structure can change a statement into a question, a sarcastic comment, a command, a complaint or a request, while a "slip of the tongue" during production may change the utterance from "Shut the window" to "Show the windut". In studying phonetics, one can examine the acoustic parameters of speech and their articulatory and perceptual correlates. The acoustic parameters of a signal, of a speech signal in particular, are amplitude, frequency and duration. These acoustic parameters correlate, at the psychoacoustic level, with the parameters of loudness, pitch and subjective duration (timing). However, they are also influenced by each other and by the phonetic context of the utterance. The 3 psychoacoustic parameters in turn correlate with the linguistic prosodic parameters of stress, intonation and rhythm. At the segmental level, complex interactions of the three acoustic parameters also contribute to the phonetic quality perceived. This thesis is concerned with the study of rhythm in speech production, specifically with the regularity of occurence, or isochrony, of the units of speech. This area has received little attention in the past, but as its importance in speech perception and production is recognized, more research is undertaken. The importance of rhythm in speech production and perception can be best observed in situations (described in greater detail in Section 2.1.4) in which communication through speech is somehow constrained. For example, rhythm has been shown to enhance intelligibility of the speech signal in cases where segmental information in a speech signal is distorted, e.g. in noisy conditions, over poor transmission (radio or telephone) lines, or under hearing-impairment. Conversely, intelligibility of the speech signal is often reduced in situations in which the rhythm is distorted, e.g. in the speech of aphasics, of deaf people or of foreign speakers of a language. In these situations, the listener generally notes an "unnatural" quality in the speech and may be distracted by this. Studies of cases in which therapy has been provided to improve the naturalness of the rhythm in the speech of deaf and aphasic individuals have also shown, in some cases, a concomitant increase in fluency and intelligibility. The above mentioned situations indicate that rhythm may play an important role when speech intelligibility becomes borderline. The extent of the influence of rhythm on speech perception and production, and the precise role it plays in the transmission and reception of information needs to be studied in more detail. Developing a greater understanding of the nature of rhythm in speech perception and production may enable researchers to find ways of enhancing communication under strained conditions, of helping speakers with 4 abnormal rhythm, such as aphasics and the hearing impaired, as well as improving computer programs for the recognition and synthesis of speech. 5 Chapter 2: Literature Review 2J_ Rhythm 2.1.1 General. Rhythm has not received as much attention in the past as stress and Intonation. The complex interrelation of all the prosodic parameters of speech poses serious problems to the researcher. However, rhythm is probably the most complex and difficult of these parameters to study. There are three central questions which face researchers wishing to study rhythm: (1) What is rhythm? (2) What should be measured? (3) How should it be measured? The third question is a little easier to address than the first two. Rhythm is primarily a timing phenonemon. Therefore the parameters of time - rate and duration - are of interest to the researcher. Tools for the physical measurement of time, and for the production of stimuli for the perceptual measurement of time, were slow and imprecise until quite recently. New developments in the equipment used, particularly in computer analysis and stimulus synthesis, have made the measurement of time considerably easier. The first two questions still confront researchers today. A precise definition of what rhythm is and what variables are important to rhythm has not yet been universally agreed upon. That rhythm exists is not disputed. It is a strong sensation underlying our perception of many activities and sensory stimuli (primarily visual, auditory, tactile and kinesthetic). Nonetheless it has proven very difficult to define. This difficulty has been recognized as far back as the time of Plato (Fraisse,1982) who proposed the definition of rhythm as "the order in the movement". Fraisse added to this: "there is rhythm when we can predict on the basis of what is perceived". Martin (1982) defined rhythm as "temporal patterning". These very general definitions can be applied to many different examples of rhythm; from rapid, repetitive motor movements, such as heartbeat, 6 respiration, walking and talking, to slow cycles of nature such as growth, planetary motions creating the days, lunar months, and years. Martin and Fraisse both refer to physiological rhythms of parts of the nervous system and of motor activities. Other rhythms of slowly unfolding events are, according to Fraisse, conceived or inferred rather than directly perceived. Martin proposes that rhythmic behaviours are hierachically organized. Further he postulates a central timing mechanism which controls the production of behaviours in sequences. The timing of each element or behaviour in a sequence is then relative to all the others in that sequence. Martin states that the alternative to a rhythmic sequence is a concatenated or successive sequence. Each element in the sequence is then related only to adjacent elements. Such a sequence is controlled by peripheral mechanisms. Speech is a rhythmic behaviour which appears to have a central planning mechanism controlling the temporal pattern of a sequence. However there also may be some local or peripheral influences on the final temporal pattern of speech, such as coarticulation effects and motor constraints. These are discussed in section (2.2.). From a perceptual point of view, there are different levels of temporal organization which can be described as rhythmic. At the simplest level, the perception of rhythm may result from the repetition of a single event at constant intervals in time. Some examples of this are the click of a metronome, and spontaneous activities such as tapping and walking. Spontaneous activities may have their own characteristic tempo. The spontaneous tempo has been measured as the speed of tapping, or other repetitive motor behaviour, produced at the most "natural" rate. Corresponding to this is the preferred tempo, which is the measure of the speed of a series of stimuli (generally lights or sounds) perceived to be the 7 most "natural". These two measures are highly correlated, but not identical, for an individual. They are also often connected, for example in the case of a motor behaviour produced in synchrony with a perceived rhythm. Examples of such synchrony can be found in behaviours such as dancing, rocking or tapping to music, marching in unison, or musicians playing together. In general, there is a time lag between perception of a cue stimulus and production of a motor response. This is known as reaction time. Production of a behaviour simultaneously with a stimulus can only occur when the stimulus has been anticipated before it occurs. The cue for the response is the anticipation of the stimulus, not the stimulus itself. The ability to anticipate or predict events is, by our earlier definition, a characteristic of rhythm. The perception and production of rhythm is evident from a very early age. Newborn infants show a very stable spontaneous rate in rhythmic sucking. Later, infants show the ability to synchronize activities, such as rocking, with a rhythmic sound, indicating the perception of rhythm. The strongly perceptual nature of rhythm is shown in the phenomenon of "subjective rhythmization" (Fraisse, 1982). Simple regular repetitions of a stimulus are not perceived as a succession of single stimuli but are subjectively perceived to occur in groups of twos or threes. An example of this is the regular ticking of a clock which is subjectively grouped into repetitions of "tick-tock". Similarly, sequences which are not strictly rhythmic may be perceived as rhythmic. Allen (1975) suggests that if the durations in a sequence vary within a small range, they will be perceived as equal. The shorter durations are overestimated, and the longer durations are underestimated. At a higher level, the perception of rhythm may result from the repetition, at regular intervals, of groups of pulses where the pulses differ from each other in duration, intensity and/or pitch; or it may result from the repetition of pulses of 8 equal duration, intensity and pitch, but where the intervals between pulses differ within a group although they are the same from group to group. The rhythmic grouping of such stimuli is called "objective rhythmization" (Fraisse, 1982). The ability to perceive the rhythmic groupings depends on both the number of elements in a group and the duration or time over which they are spread. As the number of elements in a group increases it becomes more difficult to maintain a perception of them as a unit. Sub-grouping of elements can increase the total number of elements perceived together. There is however a limit to the duration of pauses or intervals between groups or patterns. If the intervals exceed 1800 ms (Koffka,1909 cited in Fraisse, 1982), the groups are not perceived as a sequence or chain, but as isolated cases. There is also a limit to the total duration a group can have and maintain the perception of unity. As the number of elements in a group increases, this maximum total duration decreases; it is related to the limits of short-term memory storage, and has been named "psychological present" (Woodrow, 1951). It is also known that rhythm can aid memory. Strings of digits, such as telephone numbers, are remembered more easily when spoken in clusters or rhythmic groupings than when spoken in sequences of single digits. Advertisers also know that rhythm or a tune, such as a jingle, enhances recall of words or slogans. Differences in intensity or pitch can serve to accent certain elements in a sequence of repeated stimuli, dividing the sequence into groups and subgroups. Duration also is important in the grouping of elements. An important lengthening of a sound leads it to play the role of a pause. A slight lengthening of the duration of a sound makes it appear more intense and confers upon it the role of an accent. It then, most often, becomes the first element of a pattern. (Fraisse, 1982, p. 159) 9 The duration of the elements and of the intervals between them is the acoustic feature which contributes the most to the perception of rhythm. When studying and measuring rhythm, it is the parameter most often considered. 2.1.2 Rhythm in speech production. The subjective judgement that there is a regular rhythm underlying speech is both pervasive and persuasive. This judgement is quite consistent across many different studies, and appears to be easily made, even by naive subjects, suggesting that it is a prominent feature of speech in many languages. It demonstrates that the units of speech are generally perceived to occur at more or less regular intervals. The concept of isochrony, the topic of this thesis, is defined as this regularity of occurence of speech units. Some researchers have proposed a classification scheme of languages according to the subjective judgement of what the units of isochrony or regularity are (Pike, 1946, Abercrombie, 1967). This has led to the categories of syllable-timed and stress-timed languages. Some examples of these are French and Italian, and English and German, respectively. Mora-timed (the mora is the smallest speech unit in Japanese) and tone-timed languages have been added more recently. The interdependent concepts of isochrony and stress- or syllable- timing are not universally accepted (Pointon, 1978; Wenk and Wioland, 1982). They have been the subject of many debates between proponents of regularity in speech, and those who contend that no such regularity exists. This disagreement has been a bonus to the field of rhythm, because it has spawned new research. The stringent criteria for isochrony require that specific speech events occur at regular intervals. For example, in a language such as English, since it is classified as stress-timed, the stressed syllables (or perhaps their beat) would be equally spaced. Measurements of the durations of inter-stress or inter-syllable peak intervals have been made for many languages, and have shown that absolute 10 equality of the durations does not exist (Classe, 1939; Shen & Peterson, 1962; Bolinger, 1965;Uldall, 1978; Hoequist, 1983). The variability seen in the durations of speech units is great, even under restricted conditions favoring isochrony (O'Connor, 1965). Some authors have taken the lack of strict isochrony in production to be evidence against the existence of isochrony. Other authors have adopted less stringent criteria for isochrony, which allows for some variability in the duration of "equal" intervals. Classe (1939) and Hoequist (1983a, b), for example, refer to a tendency toward isochrony in the production and perception of speech. As noted above, the inter-stress intervals in a stress-timed language are expected to be equal or near equal, regardless of the number of intervening unstressed syllables. The intervening unstressed syllables are presumed to be compressible, thus maintaining the duration of the whole unit relatively constant (Pike, 1945). Lehiste (1971), studying English, compared the duration of base words alone with base words in longer composite words and found that the stressed base portion of the word was also compressed as the number of syllables in the word increased. However, she found that the total duration of the word did increase with increasing number of syllables. Several other researchers have also shown that inter-stress duration (Lea, 1974a), and single-stress word duration (Nakatani, O'Connor and Aston, 1981), increase almost linearly as a function of number of syllables. Lea (1974a) suggested that the hypothesis of alternating stress may explain some of the apparent inconsistencies between the notion of stress-timing and the experimental data. In this hypothesis, proposed by Chomsky and Halle (1968), stressed and unstressed syllables alternate in speech and delineate objective rhythmic groupings. These rhythmic groupings would be the units of isochrony, and would therefore be expected to be of constant duration. Lea proposed that the 11 alternation pattern may reassert itself when the number of intervening unstressed syllables increases. This would result in some intervening syllables taking on secondary stress, dividing the inter-primary stress interval into several sub-intervals. Each subinterva! would have a duration isochronous with all other stress groups (containing primary or secondary stress). The total duration between primary stresses with several intervening syllables would then be greater than that expected with no stress on the intervening syllables. Lea found support for this reasoning in his study. Syllables perceived as stressed by the listeners, or classified as stressed by his stressed-syliable location algorithm, occurred in inter-stress intervals containing several syllables. Uldall (1978) found a similar effect in her study measuring the duration of rhythmic feet in rapid R.P (Received Pronunciation, the accepted standard for British English). She found that the duration of feet increased very slightly with an increase in the number of syllables up to three syllables. However the duration of feet containing four syllables showed a significant increase. She proposed that the four-syllable feet operated as if they were divided into two, two-syllable feet, despite carrying only one main stress. Martin (1982) presents a slightly different view of rhythm and timing in speech. He proposes that all languages are stress-timed. The perceived difference between languages which are categorized as stress-timed and syllable timed reflects, according to him, the perceptual prominence of the underlying rhythmic structure. For example in a language such as English, word stress is fixed in such a way that there are widely varying numbers of intervening syllables "crushed" together, which increases the perceptual prominence of stressed syllables. Conversely, in a language such as French, in which stress is claimed to be fixed and falling on the final syllable of each word, there are few (or no) unstressed 12 syllables between stressed ones, thus decreasing the perceptual prominence of the stressed syllables and giving the impression of evenly spaced syllables. Several other researchers have rejected the classifications of syllable- and stress-timing in favor of the concept that all languages are produced and perceived in rhythmic groups (Dauer, 1982; Wenk and Wioland, 1982). Wenk and Wioland propose that, in French, "lengthening of what is perceived as the final syllable in each group, whose vowel is generally unmarked by any intensity increment" serves as the accent to define rhythmic groups. They therefore proposed to characterize French as being "trailer-timed". In English, the "regular occurence of stronger syllables at the beginning of each group" defines the division into rhythmic groups. Therefore, English is proposed to be "leader-timed" (ibid, p. 214). Thus Wenk and Wioland appear to agree with Martin that the perceived rhythmic difference between French and English is due to the perceptual prominence of stress in each, though they propose a different reason for the difference in saliency. Hoequist (1983b) presents a slightly different view in which the class-ification of languages is proposed to be based not on the type, or the unit, of isochrony used, but rather on the type of control of duration used. English is thus classified as a duration-compensating language, and Japanese is classified as a duration-controlling language. Some of the above mentioned papers describe studies yielding production data for the investigation of isochrony. In general, isochrony in production is not well supported by the durational data. There are some other possible influences on the duration of intervals which may offer a partial explanation for the irregularity seen in timing. Some influences affecting the duration of speech segments include 13 phonetic quality, stress, and position in the utterance. These will be discussed in greater detail in section 2.2. Variation in the tempo of speech has also been suggested as a possible influence on segmental duration (Allen, 1973; Martin, 1982). In general, an increase in tempo is correlated with a decrease in duration, and vice versa. However the relationship between duration and tempo is not quite so simple in speech, since the duration of pauses between segments must also be considered. Changes in speaking rate exert a complex influence on the durational patterns of a sentence. When speakers slow down, a good fraction of the extra duration goes into pauses (Goldman-Eisler, 1968). On the other hand, increases in speaking rate are accompanied by phonological and phonetic simplifications as well as differential shortening of vowels and consonants. There is a complex reorganization of the motor commands to the articulators such that consonantal gestures are strengthened as speaking rate is increased, but the motor commands for vowels are not enhanced (Gay et al., 1974). (Klatt, 1976, p. 1210) Consequently, changes in tempo over the course of an utterance may contribute to some of the durational irregularities seen in speech production data. Martin (1982) suggests that tempos can change within a rhythmic pattern without altering the perception of its rhythm. He states that this probably occurs in most cases, even though the change of tempo may be large enough to be recognized. He proposes that durational data be reanalzyed, comparing the actual syllable durations to durations postulated for a constant tempo, to uncover any systematic changes in tempo, speech style of the speaker also has been found to have an influence on both the duration of segments, and the duration of pauses (Duez, 1982). 14 2.1.3 Rhythm in speech perception. Perceptual experiments on natural speech are more supportive of the phenomenon of isochrony than the production data. Listeners tend to perceive speech as having a regular rhythm. It has been proposed that isochrony exists as a perceptual phenomenon (Lehiste, 1977). That is, despite actual objective differences, the subjective duration of syllables may be perceived as being regular. Lehiste suggests that the durational differences seen in the production data may be below the perceptual threshold for duration. She found that judgements of the relative durations of sequences were more accurate for non-speech than speech stimuli. She took this as evidence that small durational differences in speech are often not perceived and that therefore, isochrony is a language-bound perceptual phenomenon. Other experimenters have not found significant differences between the perceptual judgements of speech and non-speech stimuli (D'Arcy, 1984). Allen (1975) found that subjects judged stress intervals to be more regular than they actually were. He relates this to his findings with non-speech data which showed a tendency for listeners to overestimate short intervals and underestimate long intervals. Conversely, he also found a tendency to impose subjective rhythm on regular sequences. (See section 2.1.1) Morton, Marcus and FrankIsh (1976) used the technique of "imposed isochrony", in which subjects are asked to adjust the sequences to yield perceptual Isochrony. They found that listeners judged acoustically isochronous digit sequences to be an isochronous. The results showed that the perceptually isochronous sequences which were created by the subjects were acoustically an isochronous. Fowler (1979) compared these perceptual measurements with the subjects' production of subjectively isochronous sequences. She found that the acoustic anisochrony in the sequences produced was the same as the acoustic anisochrony required to hear an utterance as perceptually isochronous. In a 15 further study, Tuller and Fowler (1980) correlated articulatory measures with acoustic and perceptual measures of isochrony. Their results support Morton et al.'s proposal (1976) that judgements of isochrony made by a listener are based on the talker's articulation, as reflected in the acoustic signal. D'Arcy (1984) also showed that acoustically decelerated non-speech and speech signals were perceived as being regular, while acoustically isochronous signals were perceived as being slowed down or speeded up. The acoustic signals in her study were decelerated or accelerated in a progressive fashion, following an exponential function. Different degrees of acceleration or deceleration were achieved by altering a single variable in this exponential function. D'Arcy was then able to establish more precisely the degree of deceleration which corresponds to the perception of isochrony. She also obtained an estimation of the increment or the decrement from this degree of deceleration which is required for t subjects to notice a departure from regularity, that is the difference limen for systematic irregularity. It is important to note here that different experimental methodologies may predispose subjects to employ different temporal processing mechanisms. Hibi (1983) proposed two processing mechanisms for the perception and production of temporal sequences. In the first, called "ongoing processing", the timing of an event is predicted and then checked against the actual event. This involves one-by-one processing of events and can occur only at low rates of succession (from 1.0 to 2.5 times per second). In the second, called "holistic processing", a regular timing pattern for a sequence is postulated and compared against the actual overall temporal pattern. This involves pattern-matching and can occur at higher rates of succession (from 4.0 to 7.0 times per second). In Morton et al.'s study subjects listened repeatedly to an utterance, concentrating on a single aspect of the utterance in order to adjust it to become isochronous. This task thus 16 corresponds to Hibi's ongoing processing mechanism. Conversely, in D'Arcy's study subjects listened to short, rapid sequences once in order to make a judgement of regularity. This task corresponds to Hibi's holistic processing mechanism. 2.1.4 Importance of rhythm. Rhythm, as previously noted, is a strongly perceptual phenomenon which develops very early in life. Indeed, infants learn to mimic the intonation and rhythmic patterns of speech very early,thereby indicating that it may play a role in speech perception; however, the importance of rhythm in speech is not completely understood. Martin (1982) suggests that patterning in speech provides redundant linguistic cues, important in speech intelligibility. Due to the rhythmic organization, anticipation of elements occurs and allows tracking of the sequence without continuous monitoring. Suppose that some elements in a sequence are more informative than others. If these informative elements are nonadjacent and temporally predictable, then certain efficient perceptual strategies (e.g., attention cycling between input and processing) might be facilitated. Perception of concatenated sounds, on the other hand, would seem to require continuous attention. (Martin, 1982, p.488) The location of rhythmic stress beats and pauses In the utterance may also provide cues to syntactic bracketing, occurences of coordination and subordination, and specific semantic structures (Lea, 1974b). According to Kozhevnikov and Chistovich (1965), prosody of an utterance is important in the division of the utterance into syntagmas. Syntagmas are defined by these authors as the units of meaning at the phrase or sentence level, which, in articulatory terms, are distinguished from each other by pauses. The linguistic importance of rhythm is generally difficult to resolve under normal conditions because of its redundancy with more obvious linguistic cues, 17 such as the phonetic content. Distortion of the signal, either at the production or at the perception level, removes some of the information and more clearly demonstrates the potential contribution of rhythm to speech intelligibility. Rhythm and intelligibility have been studied by Cherry and Wiley (1967) and by Hoiloway (1970), who replaced weakly-voiced segments with white noise to simulate speech under noisy conditions. This procedure maintained rhythmic continuity while removing part of the phonetic information. Despite the loss of segments, the speech was found to be intelligible. In contrast, if no white noise was added, resulting in rhythmic discontinuity, intelligibility of the speech was very low. Blesser (1972) spectrally inverted the speech signal, distorting segmentals while keeping the suprasegmentals intact. Subjects were then asked to attempt to communicate with each other, hearing only the distorted speech. He found that some subjects learned to communicate quite well in continuous discourse, which carries a lot of rhythmic information. This success did not extend to isolated syllables, words and sentences, which contain fewer rhythmic cues. It appears then that rhythm carries some of the linguistic message, but under normal conditions, rhythm is redundant. The studies mentioned in the previous paragraph show the influence of normal rhythm on the intelligibility of speech. Studies of the effect of abnormal rhythm on intelligibility have also been undertaken. The speech of the deaf has been studied by many authors and consistently found to have errors of timing (Stark, 1979). A study of rhythm in the speech of the deaf by Hudgins and Numbers (1942) found that the most common errors were: unusual breath groups or inappropriate pausing, misplaced word accents or accenting of normally unaccented syllables, extraneous syllables, and omission of syllables from polysyllabic words. A study of the speech of the deaf (Levitt, Smith and Stromberg, 1974) showed a significant effect of abnormal rhythm on 18 intelligibility. Another study, by John and Howarth (1965), showed improvement in the intelligibility of deaf children who were trained to speak with more normal timing patterns. However, it may be difficult to separate, in the speech of the deaf, the effects of other errors, such as phonetic errors, from the effects of rhythm errors. Similarly, the intelligibility of non-native speakers may be affected by both errors of phonetics and rhythm. Huggins (1976, cited in Stark, 1979) artificially manipulated the timing of natural speech. His study also showed a marked reduction in intelligibility with increasingly unnatural rhythm. Studies of aphasics have also pointed to the importance of rhythm and stress in speech production and perception. Goodglass, Fodor, and Schuloff (1967) studied imitation of unaccented words in three-word phrases and found that Broca"s aphasics had greater difficulty with phrase-initial, than phrase-final, unaccented words . The authors concluded that, for function words, prosody played a larger role than grammar in determining whether a word was retained, and that accented syllables were important for initiating and maintaining speech. The observation that rhythm or music can facilitate speech led to the development of the Melodic Intonation Therapy technique (Sparks, 1973). This technique, which uses exaggerated rhythm, has been found to be quite successful in improving the speech production of aphasics who are non-fluent but have good auditory comprehension. Mothers speaking to babies or young children also exaggerate their rhythm and intonation, in addition to using simple grammatical structures and frequent repetition. This appears to enhance the child's comprehension and facilitate language learning. The linguistic importance of rhythm has implications for the fields of automatic recognition and synthesis of speech as well. The rules governing the 19 frequency patterns of phonetic segments have been more precisely defined than the rules governing the parameters of rhythm and stress. Consequently, speech synthesizers which can reproduce phonetic patterns with reasonable accuracy and intelligibility are available. However, without adequate rules for rhythm, most speech synthesizers simply concatenate phonemes, ignoring rhythm. The resulting sequences are perceived to be unnatural, but are in general fairly intelligible, due to the high redundancy of linguistic information. Algorithms for use in automatic speech recognition models such as analysis-by-synthesis (Bell, Fujisaki, Heinz, Stevens and House, 1961; Halle and Stevens, 1962), require rules for the analysis of sound segments and patterns, and language rules for extracting meaning from the sound segments. As noted above, the rules for phonetic segments are more precise than those for rhythm, consequently limited success has been achieved under ideal conditions in which phonetic information alone is sufficient. As demonstrated above, in less than ideal situations in which speech is distorted, linguistic information provided by rhythm is important for intelligibility. Rules allowing the recognition and use of well-defined rhythmic patterns may help to extend the use of automatic recognition devices to speech obtained under imperfect conditions or from abnormal speakers, The enormous task of describing the rules of language concisely, but with enough detail to permit speech recognition, is likely to be the greatest barrier to the extended use of automatic recognition. 2.1.5 Measurement of rhythm: As outlined in section 2.1.1, the "measurement" of rhythm is a difficult task, hinging on the concept of what rhythm is. The definitions of rhythm previously given refer in general to patterns in time. From a subjective, perceptual view point, at high rates of succession it is the overall pattern which is perceived, but at low rates of succession it is the 20 individual units which are perceived (Hibi, 1983), thus judgements of rhythm can be made directly or indirectly. From an objective, physical view point, the units making up the pattern, and the relationships between the units can be measured. Thus, rhythm in an utterance can be indirectly studied by measuring the time patterns of the rhythm of the acoustic units or of articulatory motions correlated with the acoustic units; specifically, measuring the duration of individual units and the tempo of sequences of units. Measurements of rhythm thus obtained may then be used to verify theories of the presence or absence of isochrony in speech perception and speech production. Most studies to date have been unable to find evidence for isochrony in speech production data, therefore apparently providing support for the lack of isochrony in speech. However the problem may not be so simple. It is hypothesized in this study that "distortion" of the intended utterance takes place during production. Consequently, the rhythm intended by a speaker may be obscured and not easily detected in the production data. Only when the rhythm in production data can be studied without this "distortion", that is by removing the effects of "distortion", can the question of the presence or absence of isochrony in speech be adequately addressed. Some of the factors known to distort the timing pattern of speech, and some ways of compensating for these distortions, are discussed in section 2.2. Researchers have used subjective measures of rhythm of events a great deal. These measures may be judgement of the overall rhythmicity (Fowler, 1979; D'Arcy, 1984), the adjustment of vowel duration in a synthesized word (Nooteboom, 1973), or the location of stressed syllables or stress beats by tapping or click matching (Allen, 1973a and b). In general, subjective judgements are quite consistent by a subject over time, and between subjects. Consequently, subjective judgements are regarded by many researchers as reliable measurements of rhythm. 21 Researchers investigating rhythm in production have studied the duration or tempo of the acoustic signal and its articulatory correlates (Ladefoged, Draper, and Whitteridge, 1958 and Cooker, 1963 cited in Allen, 1972a; Benguerel, 1970; Tuller and Fowler, 1980). Some of the articulatory measures used include sub-glottal pressure, air flow, laryngeal tension, degree of jaw opening, and electro-myographic potentials of lip muscles and respiratory muscles. These can, in some cases, be compared with the subjective judgements of the same stimuli. Direct measurement of the duration of these units, or the intervals between them, can be accomplished by various time measurement techniques available. In the case of rhythms in sound, the spectrogram, which displays spectral distribution of energy versus time, as well as the mingogram, which plots the oscillogram of the speech signal (as well as several derived measures) versus time, have proved useful tools. The duration is determined by measuring on the spectrogram or mingogram, the distance corresponding to a sound, or to the interval between sounds, and converting this distance to time. The resulting durations can be plotted into a histogram. Any regularity or periodicity present in the signal will be seen as a peak, or several "periodic peaks" in the histogram, corresponding to the periodicity present in the original signal and its multiples. This procedure is relatively easy for sounds which can be easily segmented, that is sounds which have sharp boundaries. However, speech sounds, particularly sonorants and fricatives, do not always have sharp boundaries and can be very difficult to segment. Similarly, articulatory measures, such as jaw movement or chest muscle activity, may not be easily segmented due to the continuous nature of speech production. Coartlculatory effects are particularly pronounced with rapid tempo, making the divisions between the different sounds even less distinct. Consequently, the direct measurement of rhythm may not have the degree of precision desired. 22 Dreher (1969) proposed the use of the autocorrelation function as an indirect measure of rhythm. Autocorrelation is a statistical measure which can be used to measure, or at least estimate, peridiocity in a signal. In this procedure, the signal is correlated with itself, for successive values of the time-shift between the two copies. This procedure can be simply described by considering the case of a sine wave. The first stage of the autocorrelation is done with the two identical sine waves in phase with each other. As they are exact copies of each other, each point on the first sine wave will be perfectly and positively correlated with its copy. The value of the correlation is then 1. One of the sine waves is then shifted, by a constant increment, relative to its copy and the correlation is recomputed. The sine waves will now be slightly out of phase with each other, consequently the correlation will be less than one. This procedure is repeated for each value of the shift. When the shift is equal to one-quarter of the period of the sine wave, the sine waves will be 90° out of phase, with a correlation equal to zero. When the shift is equal to half the period of the sine wave, the sine waves will be 180° out of phase. The correlation will now be perfect but negative, with a value of -1. When the shift is equal to the period of the sine wave, the sine waves will again be in phase, with a perfect positive correlation of 1. The value of the correlation will thus vary between 1 and -1 as the shift increases through even and odd multiples of the half-period. The periodicity of the autocorrelation function corresponds, in this example, to the periodicity of the original signal. A histogram can be plotted showing the interval between peaks, if any, of the autocorrelation function. For a sine wave, which is perfectly periodic, all the between-peak intervals will be of equal length, therefore all values will fall In the same bin of the histogram. The shift value for this bin will be equal to the length of the period. 23 This procedure can be applied similarly to rectified, speech amplitude envelopes. In a rectified envelope, there are no negative values of the amplitude, therefore the value of the correlation will vary between +1 and 0. The periodicity (or lack thereof) of the speech amplitude envelope will be reflected in the histogram of the autocorrelation peak intervals. The more periodic the original signal, the narrower the histogram. Conversely, the less periodic the signal, the broader the histogram. The autocorrelation function thus provides a tool to obtain information on the periodicity of the speaker's utterances. It allows computer analysis of long sequences of speech, rather than the time consuming measurement of individual durations which are "polluted" anyway by other factors. The use of this technique on the speech amplitude envelope also bypasses the need for precise segmentation since no measurements of the interval between specific syllables is required. The use of the autocorrelation function over long strings also minimizes the influence of small, insignificant variations in duration on the overall measure of rhythm. 22 Duration of syllables 2.2.1 Introduction. Duration is the primary acoustic parameter involved in the perception of rhythm, therefore it is measured when studying the rhythm of a speech signal. However studies have shown that the duration of a syllable also varies according to a number of influences. It is proposed that these influences may be represented by the following structure. First, linguistic programming of the duration of each segment must occur in the formulation of the utterance. While this is a logical notion, there is no direct evidence for its existence, nor any clues as to what form it may take. Consequently, it is proposed here as a stage at which pre-planned elements of speech may be organized. It is at this stage that 24 the overall grammatical structure, individual words, and individual segments may be selected. Intrinsic duration (Peterson and Lehiste, 1960; Klatt, 1975, 1976), durational influences of syntactic structure (Benguerel, 1970; Lindblom and Rapp, 1973; Oiler, 1973; Klatt, 1975, 1976) and emphatic stress (Benguerel, 1970; Fry, 1958; Oiler, 1973; Klatt, 1975, 1976) are encoded in the pre-programming. Second, the neural execution of each segment in the linguistic programming (Ohala, 1973) takes place. This may follow an underlying, programmed rhythm or may be simply sequential. Finally, the individual segments which form the spoken sequence are realized articulator!ly. At this stage the segmental durations realized may be influenced by phonetic environment and limitations imposed by the articulators (i.e. coarticulatory influences) (Peterson & Lehiste 1960; House, 1961; Klatt, 1975, 1976), by the number of syllables in a word (Nooteboom, 1972; Lindblom and Rapp, 1973; Lehiste, 1975; Oiler, 1973) and by random timing errors. The Influence of each of these three stages is discussed further below. 2.2.2 Linguistic programming. It appears that some of the factors influencing duration may be introduced during the linguistic programming of the Intended message. One such factor is the intrinsic duration of phonemes. Peterson and Lehiste (1960), In a study of the duration of vowels and diphthongs in American English, measured the average duration of syllable nuclei in minimal pairs differing In the voicing of the final consonant. Their results showed two distinct groups of syllable nuclei: intrinsically short syllable nuclei and intrinsically long syllable nuclei. A factor analysis of durational data collected by Klatt (1975) "showed that 56% of the variance in vowel duration is explained by assigning different Inherent durations to each vowel category" (p. 136). Klatt (1976) found that there are also small differences in inherent duration of consonants, seen as a function of place of articulation. It appears that there are 25 consistent, pre-set values for the duration of vowels and consonants. These values may be programmed at the linguistic level when a particular phoneme is selected. The syntactic structure of an utterance is also programmed at the linguistic level, and also influences the duration of syllables. Benguerel (1970) studied unemphatic and emphatic stress in French. Linguistically, unemphatic stress is generally described as being a prominence of some kind falling on the last syllable of an isolated word, or on the last syllable of each group of words in the case of a longer utterance. These groups of words are called rhythmic groups, sense groups or breath groups. Emphatic stress on the other hand is a supplementary stress which may occur in addition to the unemphatic stress. Its purpose is to give prominence to a particular word in the utterance. It falls on a syllable other than the last one when the word or group of words has two syllables or more, (p.8) In his study, Benguerel found that the best correlate of unemphatic stress was lengthening of the last syllable of each rhythmic group. He proposed that this final lengthening phenomenon, which is generally called "stress" or "accent", is misnamed and would be better described as "prominence". Oiler (1973) measured the duration of vowels and consonants, in nonsense words pronounced by English speakers, as a function of the position of the syllable and the type of sentence. He also found a strong tendency for vowel duration to increase toward the end of the word, with the greatest lengthening on the final syllable. In addition to this, his data showed a tendency for a slight lengthening of the initial consonant of a word relative to the medial consonant. Experimenters studying several languages, such as English (Lehiste, 1971; Klatt, 1975), Dutch (Nooteboom, 1972) and Swedish (Lindblom and Rapp, 1973), have also found 26 syllable lengthening in final position in a word and in final position in a phrase or sentence. Benguerel (1970) also noted the influence of intonation on unemphatic stress or phrase-final lengthening. In his study, he found less lengthening on the final syllable for rising intonation than for falling intonation, this effect being greatest for open syllables. He proposed that non-emphatic stress in French is marked by increased duration when the intonation is falling, but is marked in part by the intonation when the intonation is rising. Conversely, Oiler (1973), studying English, found greater final-syllable lengthening with rising intonation than with falling intonation. Klatt (1975) noted several possible explanations for the lengthening found at the syntactic boundaries. The lengthening may provide acoustic cues to the listener for the location of phrase- and word-boundaries. Another explanation is that the lengthening is simply a side-effect of constraints on the processing employed by the speaker. Lindblom (1974, cited in Klatt, 1975) hypothesized that processing involved a phrase buffer. The number of words remaining in the buffer may determine the speaking rate. Oiler (1973) hypothesized that planning of the next phrase occurs during the production of the current phrase. He proposed that production of the phrase slows down to give the speaker time to process the next phrase. The use of duration for semantic distinctions and emphasis, which may modify the meaning of an utterance, may also be introduced in the linguistic programming stage. In English, stress is also used to make semantic distinctions. Some examples are the noun"sub ject" versus the verb "sub ject". Similarly, the location of the syntactic boundaries, as signalled by segment duration distinguishes the two (three) possible meanings of the phrase "Two thousand year old horses", and the meaning of "I scream" versus "ice cream" (Lindblom and Rapp, 27 1973). Fry (1958) manipulated duration in synthesized speech stimuli and found that it acted as a strong cue in judgements between nouns and verbs, such as the previous examples. Emphatic stress, as stated above, is the use of stress on a word not usually stressed by the regular rhythm of speech, for the purpose of emphasis. It is correlated with an increase in acoustic intensity, fundamental frequency, and syllable duration (Benguerel, 1970; Lindblom and Rapp, 1973; Klatt, 1976). For example, the sentence "John is in Hawaii" can have several slightly different interpretations, reflected in the location of emphatic stress, depending on what question is being answered by this statement. Possible questions include "Where is John?" or "Who is in Hawaii?". 2.2.3 Neural organization. Ohala (1973) hypothesized three possible types of neural organization. The first hypothesis is that of a sequential organization in which there is no central control of timing or rhythm. Segments are simply produced in succession, or concatenated. The afferent feedback indicating the beginning of articulation of a syllable is the trigger for the execution of the next syllable (Kozhevnlkov and Chistovich, 1965). This has also been called the chain model (Ohala, 1973) or the closed-loop model (Kozhevnikov and Chistovich, 1965). The second hypothesis is that of scheduling (Ohala, 1973) in which there is an underlying, pre-established time schedule, which is not necessarily isochronous or periodic. In this model, afferent feedback has no influence on the execution of the program. This has also been called the comb model (Ohala, 1973) or the open-loop model (Kozhevnikov and Chistovich, 1965). The third hypothesis also involves scheduling, but in this case there is a regular or periodic rhythm. Again, afferent feedback has no influence. This is the organization which would be necessary for the concepts of isochrony and syllable- or stress-timing to be valid. A number of 28 studies have been undertaken to provide evidence for this type of organization, however, results have been negative so far. Allen (1973) proposed two levels of speech at which rhythmic organization operates. First, there is the short-term or phrase level, at which the rhythmic beats are associated with the syllables or feet of the phrase. At this level the rhythmic organization influences the duration of the segments. Articulatory correlate of the syllabic or speech rate, and consequently the average duration of syllables, is the frequency of movement of the articulators (lip, tongue or velum) during syllable production. The maximum rate of syllable production ranges from a mean of 6.7 syllables per second for the velum or tongue to 8.2 syllables per second for the tip of the tongue (Hudgins and Stetson, 1937, cited in Lehiste, 1970). This may impose articulatory limitations on the range of short-term rhythmic rates. The second proposed by Allen (1973) is the long-term or discourse level, at which the rhythmic beats are associated with the pauses in speech. At this level the rhythmic organization influences the rate of speech, and extends over a large number of utterances produced by the speaker. An articulatory correlate of the rate of pauses is the rate of breathing, which has a low frequency corresponding to durations of 1.5 to 7.2 seconds for the expiratory phase during speech (F6nagy and Magdics, 1960); it may impose articulatory limitations on the long-term rhythmic rate and the duration of breath groups. The syllabic rate affects the duration of segments at the short-term level. As the syllabic rate increases, in order to produce the same number of syllables in a shorter amount of time, either the syllable durations, or the pause durations, or both, must decrease. Similarly, the duration of segments affects the rate of speech. This counter-influence must be taken into account, that is, investigators should attempt to hold one variable constant when making measurements at the other level. 29 A number of investigators have attempted to produce experimental evidence to determine which type of organization characterizes speech (Kozhevnikov and Chistovich, 1965; Allen, 1969; Lehiste, 1971 and 1972; Ohala, 1973); however, the results obtained have been generally inconclusive. Ohala (1973) stated that there is evidence for the perceptual significance of timing at the short-term level, but not at the long-term level. He therefore proposed a hybrid model of rhythmic organization in which short-term timing follows the comb model, while long-term timing follows the chain model. 2.2.4 Articulatory realization. The duration of segments in the final speech output depends on the ability of the articulators to carry out the neural instructions that they receive. Physical limitations and the influence of adjacent segments on the articulators may affect the final realization of the segments. Coarticulatlon is "the overlapping of adjacent articulations" (Ladefoged, 1975, p. 48). The overlapping Influence on a segment may be anticipatory, that is backward from following segments, or perseverative, that is forward from preceding segments. It Is proposed that there are three types of coartlculatory influences: those which are physiologically determined, and are therefore universal and Innate; those which are specific to a particular language; and those influences which have become part of the phonology of a particular language or dialect, therefore they must be learned by speakers of the language. Peterson and Lehiste studied the effect of preceding and following consonants and found ....that the durations of all syllable nuclei in English are significantly affected by the nature of the consonants that follow the syllable nuclei; the influence of the initial consonants upon the durations of the syllable nuclei appears to be negligible. In general, the syllable 30 nucleus is shorter when followed by a voiceless consonant, and longer when followed by a voiced consonant. In a large number of minimal pairs of CNC (consonant-syllable nucleus-consonant) words differing in the voicing of the final consonant, the ratio of the durations of the vowels was approximately 2:3, the syllable nucleus before the voiced consonant being longer in every case. As a class, plosives are preceded by the shortest syllable nuclei; nasals had approximately the same influence as voiced plosives. Syllable nuclei were longest before voiced fricatives. (Peterson and Lehiste, 1960, p.702) Lengthening of vowels preceding voiced consonants in English was also found by House (1961) and Klatt (1975). House (1961) divided the lengthening influences on English vowels into two groups, "primary lengthening" and "secondary lengthening", on the basis of degree of lengthening. "Primary lengthening" describes the lengthening of tense vowels versus lax vowels, and of vowels before voiced consonants versus vowels before voiceless consonants. "Secondary lengthening" describes the lengthening of open vowels versus closed vowels, and of vowels before fricatives versus vowels before stops and affricates. House proposes, on the basis of some cross-linguistic studies, that the primary lengthening influences in English are a part of the phonological system of the language and must be learned; secondary lengthening influences in English, on the other hand, are a function of the articulatory process itself, and are thus universal. The number of syllables in a word or phrase may influence segmental duration by straining the accuracy with which the articulators can produce a syllable In a long sequence. Nooteboom (1972), Lindblom and Rapp (1973) and Lehiste (1975) studied segment and syllable duration as a function of the number of syllables in the word by comparing the duration of a syllable occurring in isolation with the duration of the same syllable occurring as the root of a word. All of these authors found that the duration of a syllable decreased when the 31 number of following syllables increases. Nakatani et al (1981) found a linear increase in word duration as a function of number of syllables. Their interpre-tation of this finding was that the addition of a syllable adds a constant amount to the duration of the word. However, since data on the duration of syllables as a function of number of syllables was not presented it is difficult to make comparisons with the data of the previous authors. Oiler's study (1973) did not show any consistent changes in duration with the number of syllables. The discrepancy between his results and those of Lehiste may be explained in part by the differences in the data studied, such as the use of nonsense words versus real words, and the comparison of the duration of initial segments versus comparison of final segments in words of different lengths. However, Oiler notes that his data show that final syllable lengthening occurs even in monosyllabic words. Consequently, in comparisons between one-syllable and two-syllable words, an apparent decrease In syllable duration may be seen. The data of Nakatani et al. (1981) also show final lengthening in monosyllabic words. However, this explanation cannot entirely account for the decrease in duration of the first segment seen in 3- or more syllable words. 2.2.5 Experimental techniques. The presence of some of the above influences on duration can be escaped or compensated for by careful experimental techniques. This permits the experimenter to remove as many of the irrelevant (to the particular study) variables as possible, making the effects of the relevant variables more distinct. For the purposes of this study, durational effects related to the overall rhythm of an utterance are relevant, while lower level or segmental durational effects are irrelevant. In order to avoid the effects of the intrinsic duration of vowels and the Influence of the following vowel, Oiler (1973) used nonsense words made of repetitions of the same CV or CVC syllable. Similarly, 32 Nakatani et al. (1981) used reiterant speech in which a constant syllable is substituted for the actual syllables of real words, preserving the prosodic features of normal speech but eliminating phonemic durational influences. In a study by Benguerel (1970) using natural speech, the words were carefully chosen to ensure that the specific syllables to be compared have the same phonetic structure. Similarly, Lehiste (1975) compared the duration of a phonetic segment as a single word with the same phonetic segment as the base of a longer word. 2.2.6 Perception of duration. As previously discussed, there are variables which have predictable influence on the duration of segments. The question now is: are the differences in duration caused by these influences perceived by the listener? It may be that some of the variability in duration is below perceptual threshold, or that the perceptual system has correction factors which it applies to the acoustic duration of segments when determining the perceptual duration of these segments. Klatt (1976) suggests that durational effects due to production constraints may be compensated for by the listener, or may be below the perceptual threshold for duration, but that lengthening which serves as perceptual cues to syntactic boundaries may be left intact by this compensation. Huggins (1968, 1972) found changes In the duration of phonetic segments of approximately 20 ms for vowels, and 40 ms for consonants, were possible before subjects reported an unnatural sentence timing pattern. He suggested that this means speakers must time the duration of vowels more accurately than the duration of stop gaps. Klatt (1976) studied the just-noticeable-difference (jnd) for duration and found it to be around 25ms. Lehiste (1978) measured the durational change, as a function of position of interval, required for a segment to be reliably judged as "longer" or "shorter" in both speech and non-speech data. She found durational changes between 30 and 100 ms were required for consistent 33 judgements, the smaller durational change being in medial position, and larger durational change in initial, then final position. Hibi (1983) found the degree of temporal distortion in a sequence which is detected 50% of the time is about 6% at rates slower than 3 Hz, and 7.6 to 8.9% at rates faster than 4 Hz. This corresponds to a durational change of approximately 20 ms. While there is some variation between values of jnd as studied by each of the above experimenters, the smallest possible jnd appears to be in the range of 20 to 30 ms. This will be taken as the lower limit of jnd. Some of the durational differences found in studies of production data are smaller than the jnd and therefore would not be expected to be perceptually significant. However, many of the durational differences which are greater than the jnd do not result in perceived differences in timing. This indicates that there must be some compensation in the perceptual system for these differences. Lehiste and Peterson (1959) in their study of the relationship between amplitude of vowels and perceived intensity of those vowels, found that the amplitude of a vowel was greatly influenced by its phonetic quality. When different vowels were produced with equal effort, they were not of equal acoustic amplitude. It appears that certain vowels have greater intrinsic amplitude than others. Listeners rated the vowels as being of equal loudness if they were produced with equal effort, regardless of the actual acoustic amplitude. Lehiste and Peterson suggest that perceptual judgements of intended stress may be made by applying a correction factor to the vowel amplitude, according to its phonetic quality. Thus, the perceptual system must compensate for the consistent or predictable irregularities in the production of speech. 34 "Fuller and Fowler (1980) looked at the relationship between articulatory, acoustic and perceptual measures of isochrony. They found that listeners' judgements of isochrony are better correlated with articulatory isochrony than with acoustic isochrony. They suggested that perceptual judgements of isochrony are thus "articulation-referential". There appears to be a general trend that a listener's identification of segments matches articulatory descriptions of the segments more closely than acoustic descriptions, for example, judgement of equal effort as described in the previous paragraph (Lehiste and Peterson, 1959). The motor theory of speech perception (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967) has proposed that listeners identify sounds by accessing their own production systems. This procedure may be described as a kind of analysis-by-synthesis. While this particular model is not accepted by all authors, the characterization of speech perception as articulation-referential is more widely accepted duller and Fowler, 1980). Another possibility is that articulation and perception have a common or overlapping neural processing system. Constraints on the production system at the motor level (articulators) or at the neural level may influence perception. Similarly, constraints on the perceptual system at the receptor (auditory system) or at the neural level may Influence production. The acoustic signal is the "middle man" between the production and perception of a speech signal. A trend found in the perception of an acoustic signal is likely to be a reflection of, or a compensation for, production influences on the acoustic signal. The current study proposes to demonstrate isochrony in speech production data. The durational influences of phrase-final lengthening and number of syllables will be compensated for in an attempt to expose the underlying rhythm of speech. In compensating for phrase-final lengthening, a measure of the degree of deceleration in speech production will be obtained. If the concept of 35 articulation-referential perception is valid, it is expected that the degree of deceleration as measured by this study will be close to the degree of deceleration found by D'Arcy (1984), in her study on the perception of isochrony. 36 Chapter 3: Outline of Experimental Alms 1 1 Description of problem and proposed solution. The phenomenon of isochrony has been quite well established in speech perception. As there appear to be many cases in which the production and perception of a behaviour are somehow inter-related, one expects parallel expression of the underlying speech structure in speech production and speech perception. Consequently the existence of isochrony in speech production is an appealing hypothesis. However, to date, acoustic studies of speech production have not satisfactorily proven or disproven this hypothesis. In speech production data, distorting Influences on the duration of segments obscure the underlying speech structure, preventing the resolution of the hypothesis of isochrony in speech production. A number of studies have attempted to remove or compensate for some of the durational influences, such as intrinsic duration (Oiler, 1973; Nakatani, et al, 1981). However these attempts have not provided conclusive findings. One influence which has not yet been compensated for in studies of rhythm In production data 1s phrase-final lengthening. As discussed in section 2.3.2 the duration of the final syllable of a phrase is as much as 50% greater than the duration of the penultimate syllable. This distortion, or time-warping, may therefore account for a great deal of the variability found In the duration of units proposed to be isochronous. Consequently, removing this distortion, or "de-warping" speech production data will have a significant effect on the time pattern of the data, possibly revealing the rhythmic nature of speech. Therefore, in this study the distorting influence of primary interest is phrase-final lengthening, in which the duration of a syllable is a function of its position in a phrase. Of secondary interest is the influence of 37 word or phrase size, in which the duration of a syllable is a function of number of syllables in the phrase. The other distorting influences discussed in section 2.3, such as intrinsic duration, will not be taken into account. They will be assumed to have a minor effect, or be averaged out in the current study. In order to determine the de-warping required, it was necessary first to determine the degree of warping present in speech production. Production studies which give data relating to two influences mentioned above, i.e. phrase-final lengthening and phrase size, were examined and the data compiled with data from this study's preliminary investigations. The studies examined represent several different languages: English, French, Swedish and Dutch. Each shows evidence of phrase-final lengthening, indicating that it may be a universal phenomenon. Therefore it seems appropriate to study the data as a group, disregarding language differences. A function to define phrase-final warping was developed, and a dewarping algorithm, which was the Inverse of the warping function, was devised to compensate for the effects of the two influences mentioned. The first objective of this study is to find stronger evidence for isochrony in speech production. In an attempt to achieve this, the distortion caused by phrase-final lengthening, and word or phrase size, will be removed. The second objective is to determine if a difference in timing-type exists between French (traditionally categorized as syllable-timed) and English (traditionally categorized as stress-timed). The third objective is to compare the degree of dewarping required to restore (acoustic) regularity, found in the current speech production study, with the degree of acoustic warping required for the perception of regularity, found in D'Arcy's (1984) perception study. The speech data In the main study consists of recordings of political speeches by two prominent politicians: Charles de Gaulle, in French and Richard 38 Nixon, in English. Political speeches were chosen, due to the greater length of silent pauses, facilitating division into breath groups, and the near absence of non-silent pauses (Duez, 1982). These recordings will be rectified, lowpass filtered and sampled onto digital computer tapes, yielding speech amplitude envelopes. A dewarping algorithm, derived from the warping function used by D'Arcy (1984), will be applied to the speech amplitude envelopes. This algorithm is a function of the syllable position in the utterance, the degree of warping a and the number of syllables b. Several pairs of a and b values will be tested. A measure of the periodicity of the resulting dewarped amplitude envelope will then be determined from the autocorrelation function of this envelope. The use of the autocorrelation function will eliminate the tedious task of locating syllable boundaries and making direct measurements of the syllable durations. The amplitude envelope will be peak-clipped to remove some of the irrelevant amplitude variability due to influences such as phonetic quality and stress. Finally, a period histogram will be compiled for each combination of speaker and pair of a and b, and statistical analysis of these will be carried out to determine the most regular data. 39 Chapter 4: Experimental Technique 41 Evaluation of trends in existing data. 4.1.1 Phrase-final lengthening. In her study, D'Arcy (1984) proposed a formula to compute the durations of syllables, as a function of position in utterance, based on the production data of Oiler (1973). The exponential function arrived at, hereafter called the D-function, was: D = Ns * exp (a * x "**) (eq. 1) where D = syllable duration in samples x = a parameter depending on the position of the syllable in the utterance which varies linearly and discretely between 0 (first syllable) and 1 (last syllable). That is, x = i (number of syllables-1) where i = the position in utterance of each syllable, relative to the first syllable. Ns = number of samples in first syllable a = scale factor for the exponent of the exponential n = exponent of x (when n*z = 2, D has the form of the Gaussian function) z = adjusts the value of the exponential in such a way that, for x = 1 (i.e. last syllable), it will always have the same value for a given a-value, regardless of the value of n Thus D is a function of the position of the syllable in the sequence (i) and of the degree of final-syllable lengthening (a). Using the values n = 5.75 and z = .693 the value of a was varied to give different degrees of warping. The mean value of a which yielded sequences of nonsense syllables or clicks perceived as isochronous by subjects in D'Arcy's study was 0.194. The estimate of the difference limen (DL) for acceleration and deceleration she obtained was Aa = .30. 40 Figure 1 shows the syllable durations of a six syllable sequence obtained for a = .194. The value of Ns was determined by dividing the sampling rate to be used in the present study (400 Hz) by the average syllable rate of conversational speech (approximately seven syllables per second according to Gerber, 1974), thus Ns = 60 samples. The production data of Lindblom and Rapp (1973), Nooteboom (1973), Oiler (1973) and Klatt (1975; 1976) were plotted as a function of position in utterance. The data followed the general shape of the D-function fairly well. The D-function was also shown by D"Arcy's study to be appropriate for perceptual experiments. It was therefore decided to use the D-function as the basis for the dewarping algorithm for the current study. 4.1.2 Number of syllables. Lehiste (1971), Nooteboom (1972) and Lindblom and Rapp (1973) found a trend toward increasing duration of syllables as a function of decreasing number of syllables per word and number of main stresses per phrase. The results of Lindblom and Rapp, who measured the duration of syllables in words one to four syllables long, and in phrases with one to four main stresses, are replotted in Figure 2. This graph shows clearly the two simultaneous influences on syllable duration. The curves with positive slopes join successive syllables from the same utterance and demonstrate the influence of phrase-final lengthening (corresponding to the variable a). The dashed curves with negative slopes join syllables occupying the same position relative to the end of an utterance and show the influence of number of syllables. This influence will be Incorporated into the D-function as the variable b, however the magnitude and range of its influence must be determined in order to establish the form of the b variable, and the values to be used. 41 0 1 2 3 4 5 6 syllable position FIGURE 1: D-function for 6 syllables syllable duration (samples) 140T 130--120-110 100 • 9 0 -80 •• 70-3 4 5 6 syllable position 8 KEY: words with D- one-syllable •- two-syllables O- three-syllables • • four-syllables A - extrapolation 10 FIGURE 2: Duration of stressed syllables as a function of position in phrase and number of syllables per word. From Lindblom and Rapp(l973), Figure 7. 42 The data of Lindblom was extrapolated, as indicated In Figure 2, to approximate the syllable durations for words with 6- and 8-syllables. The production data of Lehiste and Noteboom was studied to determine the magnitude of the effect of increasing number of syllables. Due to the many gaps in the data base from the preceding studies, investigation of simplified speech data (described in Section 42) and of connected French speech (described In Section 4.3) was carried out to estimate the influence of word length. The data from these five sources were combined to approximate the magnitude of the word-length (or phrase-length) influence and derive the dewarping formula (Section 4.4). 4.2 Preliminary investigation I. 4.2.1 Data collection. In order to further investigate the influence of number of syllables, a preliminary study was carried out. In this study a native speaker of French repeated the syllable /ta/ in sequences of two to twelve syllables. Repetition of a single syllable was employed to avoid the effects of intrinsic duration. A voiceless consonant was used in the syllable to make the segmentation of the rectified speech wave easier. The speaker attempted to maintain constant pitch, stress and timing throughout the recording session. To promote use of regular rhythm, the speaker listened intermittently to a 7-Hz click through headphones during the recording session. Recording was carried out in a quiet room, using high-fidelity equipment; the microphone was held approximately 6 inches from the mouth. At least ten tokens of each utterance length were obtained. 4.2.2 Data analysis. The speech was then rectified, lowpass filtered at 50 Hz to smooth the wave form and sampled, at a rate of 400 Hz, onto digital 43 tapes by a PDP-12 digital computer, using a set of computer programmes written by Lloyd Rice at the U.C.L.A. Phonetics Laboratory. Using these computer programmes, the resulting speech amplitude envelopes were inspected visually. The durations, in samples, of each pause and each syllable were measured manually. A pause was taken to be a stretch of the envelope, at least 140 samples (i.e. .350 ms) long, for which the amplitude fell below a threshold value representing the level of background noise. The syllable boundaries between syllables were easily located since each syllable consisted of a voiceless consonant [t] and a vowel thus the amplitude of the speech envelope was at a maximum during voicing of the vowel, and at a minimum during closure of the consonant. The between-syllable boundaries were thus taken to be the point at which the amplitude decreased below the threshold, indicating the end of voicing of the vowel. It was not possible, however to determine the onset of the consonant of the first syllable of each group since the closure of the voiceless stop cannot be distinguished from the silence during pauses. Consequently, the durations of the initial syllables were not measured, but nevertheless estimated. 4.2.3 Results. The syllable durations were plotted as a function of both position in utterance, and utterance length. As can be seen in Figure 3, they follow the same basic trends as the data of Lindblom & Rapp for both syllable position and number of syllables. 44 A 1 1 — i 1 1 1 — 1 1 1 2 3 4 5 6 7 8 9 10 syllable position KEV: 3 syllable 4 syllable -©- 5 syllable word word word " x - 6 syllable 7 syllable " n - 8 syllable 9 syllable word word word word FIGURE 3: Syllable durations found in preliminary investigation I. FIGURE 4: Computed values of syllable durations for preliminary investigation I. 45 43 Preliminary investigation 11 4.3.1 Data collection and analysis. This investigation was carried out to verify the warping influences on syllable durations in the data to be studied, and to estimate the syllable durations of the unwarped syllables (that is phrase initial syllables) for use as the target syllable durations in the dewarping program. Tapes of French political figures Charles de Gaulle and Jean-Louis Tixier-Vignancour were selected. The tapes were screened auditorily by the investigator to select sections which sounded "regular", had low levels of background noise and no interruptions. These sections were then analyzed following the same procedure as the one employed in section 4.2.2. 4.3.2 Results. The average phrase-initial syllable duration was 85 samples for the first speaker, and 66 samples for the second speaker. The effects of number of syllables and position in syllable appeared to roughly follow the trends seen in the previous data. Tapes of the first speaker, Charles de Gaulle, will be used in the main study. The average phrase-initial syllable duration, 85 samples, will be used for the variable Ns in the dewarping function applied to this data. 4.4 Dewarping algorithm. 4.4.1 Modified warping function. The D-function was modified to include the utterance length factor, resulting in the formula: D = N,*(1+b)*exp (a*x"*z) s (eq. 2) 46 where s = number of syllables in the breath group (word or phrase), and b = slope of word-length function This formula can then be used to compute the syllable durations as a function of both number of syllables and position in phrase, using the constants n = 5.75 and z = .96. The values of a, b and Ns were varied to find the closest fit to the syllable durations in the speech data. Figure 4 shows the syllable durations calculated by the D-function to fit the production data of preliminary investigation I. This graph quite closely matches Figure 3, the graph of the actual data for the preliminary investigation. Values of the parameters a and b of the D-function were selected to match data from studies in the literature; these values are displayed in Table 1. It should be noted that the first set of values was obtained from perception data while the all of the other values were obtained from production data; this is particularly evident when one looks at the a values. The first value of a is the average value for the perception of regularity, in the original D-function, found by D'arcy (1984). The values of the other two parameters for the D'Arcy data were selected for this study. The remaining sets of values were obtained from the production data of Lindblom and Rapp (1973), preliminary investigation I, and Oiler (1973), respectively. These values were used in the dewarping algorithm in the current investigation. Data source a b N3 D'Arcy .194 .25 64 Lindblom &. Rapp .56 .25 64 Prelim, investigation 1 .48 .55 66 Oiler .30 .11 78 Table 1: Parameter values for D-function to fit the data of four sources. 47 4.4.2 Dewarping algorithm. The preceding section gives the D-function (eq.2) and the parameter values which describe the degree of warping found in production data examined. In order to undo this warping taking place at production time, a dewarping function having the inverse effect of the D-function on syllable duration is required. This is illustrated in Figure 5. Ns is the duration (measured in samples) of each syllable before the warping function is applied, assuming that the intended syllable durations are isochronous. The duration of each syllable after warping is given by D[i,s] for the values of i (syllable position) and s (number of syllables in the breath group). These two parameters are related to the parameter x, by the relation x = _j s-1 The number of samples that must be discarded for each syllable is given by the difference between D[i,s] and Ns; this represents the amount of warping which is to be compensated for. This difference may be represented as DW[i,s], where: DW[i,s] = D[i,s]-Ns (eq.3) that is: DW[1,s] = Ns * [ (1+b) * exp (a * x n"z) -1 ] (eq. 4) s where DW[1,s] is the number of samples removed from the syllable to dewarp it. The production data indicated that the phrase-final warping extends (backward) no further than the final five syllables. Therefore the DW-function was applied to utterances in a reverse direction, that is on, syllables S to S - 4 with no dewarping on earlier syllables. Thus the parameter x took the form: x = 5-(s-1-i) 5 48 syllable duration 75 70 65 60 (samples) g 5 50 45 3 4 syllable position 15 10 5 number of 0 samples _5 discarded -10 -15 *" D-function DW-function <->" Oewarped syllable durations FI6URE 5: Relationship between D-function and DW-function 49 In this equation, (s-1-i) is the position in utterance relative to the end of the utterance. The final form of the DW-function used in the dewarping algorithm was: DW = NS*[ U±b_l*exp(a*xn*z)-l] (eq. 5) s 4.4.3 Dewarping computer program. A set of dewarping programs written for the PDP-12 digital computer were modified to fit the implementation of the DW-function. Figure 6 is a flow chart illustrating the order of application of these programs. POLIT is a program which peak clips the speech amplitude envelope and determines the durations of the pauses and of the articulated sequences of the waveform. The two thresholds for peak clipping, high and low, are selected by visual inspection of the amplitude envelope before running the program. This study is concerned only with the sections of the amplitude envelope which correspond to times when the speaker is actually speaking, that is the articulated sequences or breath groups. PAUSEOUT recopies the articulated sequences of the original speech amplitude envelope consecutively, sample by sample, without the intervening pauses. It does not alter the syllable durations in any way. This unmodified amplitude envelope is used as the base for comparisons of the effectiveness of the WARPOUT programs. WARPOUT first removes pauses in the same manner as PAUSEOUT; in addition, it removes the time-warping due to phrase-final lengthening and number of syllables according to one set of parameter values. To accomplish this, the program defines 5 (or s if s<5) intervals working backwards from the last sample. The size of each interval is calculated by D[i,s], to be the predicted duration of a 50 FIGURE 6: Flow Chart of Computer Programs Speech sequence (approximately .40 syllables! POLIT measures pause and articulated sequence durations, peak clips PAUSEOUT recopies articulated sequences without intervening pauses no modification of syllable durations WARPOUT recopies articulated sequences without intervening pauses dewarps syllables, using four sets of parameter values DREHER autocorrelates amplitude envelope PERIOD calculates inter-peak intervals, produces histogram 5 histograms per speech sequence 51 syllable in that position. The program then discards DW[i,s] samples in each of the intervals. This is the number of deleted samples required to reduce the syllable duration to the average "unwarped" duration (Ns). The samples to be discarded are selected evenly over the whole interval. WARPOUT is run four times on the speech amplitude envelopes, each time for a different set of parameter values, as displayed in Table 1. For Warpout 1, the parameter values were a=. 194 and b =.25, for Warpout 2, a=.56 and b=.25, for Warpout 3, a=.48 and b=.55, for Warpout 4, a=.3 and b=.l 1. This results In four de-warped (Warpout) amplitude envelopes in addition to the unmodified (Pauseout) amplitude envelope. Each one of the amplitude envelopes processed by PAUSEOUT or by WARPOUT was duplicated in a second file, and DREHER, an autocorrelation function program, was then run on each pair of (identical) files. This program correlates the second half of the dewarped amplitude envelope with its copy for successive shifts of four samples, back to the beginning of the file. In this program, the auto-correlation is taken as the sum of the products of the amplitudes of corresponding samples, for every fourth sample of the envelope. This sum is normalized by dividing it by the value of the correlation for a zero shift, yielding a set of values between one and zero. There are no negative values for the correlation in this particular case because the amplitude envelope is rectified, and is thus never negative. (For a further description of the autocorrelation function see Section 2.2.5.) Due to the periodic content of the amplitude envelope, the autocorrelation function waveform is also periodic in shape, with peaks and valleys corresponding to points of maximum and minimum correlation between the amplitude envelopes. The successive maxima and minima values of the autocorrelation are located by the program HI5T0. This program also computes the interval between succcessive peaks and between successive valleys, and plots the results as a period histogram. 52 The distribution of the histogram will display the regularity of the periodicity of the waveform. A relatively narrow histogram indicates regularity, while a widely distributed histogram indicates variability in the periodicity. The programs were tested and debugged on pseudo-data and on the /ta/ data of preliminary study I. The pseudo-data consisted of sequences of rectified half-periods of sine waves, representing the syllables, whose durations were computed by the D-function. The resulting amplitude envelopes had the durational characteristics hypothesized for the warped syllable sequences. The WARPOUT program was therefore expected to completely undo the warping and yield perfectly regular amplitude envelopes. 4.5 Experimental study. 4.5.1 Data collection. The two speech samples used for the main study were obtained from LP recordings of political speeches made by by Charles de Gaulle and by Richard Nixon during their respective political careers. These recordings were dubbed onto magnetic tapes. Most of the recordings were of high quality, however some of the speeches were made outdoors or in large halls resulting in high levels of background noise. The tapes were therefore screened auditorily to select long sections of tape with low levels of background noise and no audience noise or applause. The recorded selections were full-wave rectified, lowpass filtered and sampled onto digital tapes using the procedure described in section 4.2.2. 50 Hz lowpass filtering was used and the sampling frequency was 400 Hz. The resulting speech amplitude envelopes were viewed to determine the thresholds for peak clipping, and the number of syllables per breath group. Visual segmentation of the amplitude envelope to determine the number of syllables was relatively easy for 53 syllables containing voiceless consonants because the consonants have lower amplitudes than vowels and are thus clearly distinguished. However segmentation of syllables containing voiced consonants, which generally have large amplitudes, was quite difficult. For this reason, the recorded tape was also listened to in order to facilitate the determination of the number of syllables for each breath group. 45.2 Data manipulation. The data for each subject was broken up into sections approximately 40 syllables long. For de Gaulle, analysis was performed on 7 sections, with a total of 283 syllables; for Nixon, 8 sections, with a total of 372 syllables. Each section was then processed by the computer programs, as described above and summarized in Figure 6. For each section of the data of each speaker, five histograms were obtained for a total of 75 histograms. For each speaker, the histograms of the sections for the PAUSEOUT program were combined, as were the histograms for each WARPOUT run, yielding five grouped histograms for each speaker. 54 CHAPTER 5: RESULTS 5.1 Histograms . As described in section 4.5.2, the data was pooled for each speaker and for each processing condition and a frequency histogram was plotted for each combination speaker-by-condition. The resulting histograms and the respective means and standard deviations (s.d.) are shown in Figures 7 through 16. It should be noted here that each inter-peak interval represents the distance between successive maxima of the autocorrelation function. In the case of the Pauseout condition, for which there is no dewarping of the amplitude envelope was performed, the units of the histograms can be considered to be, in some sense, milliseconds or centiseconds. However, in the case of the Warpout conditions, dewarping of the amplitude envelope was applied, consequently the units of the histograms are dewarped "milliseconds" or "centiseconds", henceforth "ms" or "cs". Some preliminary observations can be made on the basis of visual inspection of these histograms. For the de Gaulle data, the distribution of the data points look narrower for the four Warpout conditions than for the control (Pauseout) condition. Conditions Warpout 1 and Warpout 2 (Figures 8 & 9) appear to have the narrowest distributions. Visual inspection of the histograms of the Warpout conditions for the Nixon data also seems to indicate narrowing of the distribution for the warpout conditions vs. the control. The histograms have, in general, somewhat irregular envelopes. The graphs were therefore smoothed, by plotting the mean of each pair of adjacent bins. This procedure did not alter the shapes of the histograms significantly, and therefore did not aid in interpretation of the results. Tables 2 and 3 summarize the means and s.d.'s for each combination condition-by-section. The statistic of relevance to the current study is the s.d., 55 25 T mean = 26.8 s.d. = 11.1 20 15 + NUMBER OF OCCURENCES 10 0 5 10 15 20 25 30 35 40 45 50 55 60 65 INTER-PEAK INTERVAL FIGURE 7: Histogram of de Gaulle - Pauseout FIGURE 8: Histogram of de Gaulle - Warpout 1 56 25 T mean = 23.9 s.d. = 9.0 20 + 8 = 5 6 T b - .25 15 NUMBER Cf OCCURENCES 10 I I I I > I M I I I I I I I 0 5 10 15 20 25 30 35 40 45 50 55 60 65 INTER-PEAK INTERVAL FIGURE 9: Histogram of de Gaulle - Warpout 2 mean = 23.0 s.d. = 9.3 a= .48 b - .55 NUMBER OF OCCURENCES II I I I I I I I I W < I t I t M I 0 5 10 15 20 25 30 35 40 45 50 55 60 65 INTER-PEAK INTERVAL FIGURE 10: Histogram of de Gaulle - Warpout 3 57 FIGURE 11: Histogram of de Gaulle - Warpout 4 FIGURE 12: Histogram of Nixon - Pauseout 58 NUMBER OF OCCURENCES 0 5 10 15 20 25 30 35 40 45 50 55 60 INTER-PEAK INTERVAL FIGURE 13: Histogram of Nixon - Warpout 1 NUMBER OF OCCURENCES 0 5 10 15 20 25 30 35 40 45 50 55 60 INTER-PEAK INTERVAL FIGURE 14: Histogram of Nixon - Warpout 2 59 mean = 21.3 11111111111 0 5 10 15 20 25 30 35 40 45 50 55 60 INTER-PEAK INTERVAL FIGURE 15: Histogram of Nixon - Warpout 3 NUMBER OF OCCURENCES 0 . 5 10 15 20 25 30 35 40 45 50 55 60 INTER-PEAK INTERVAL FIGURE 16: Histogram of Nixon - Warpout A 60 TREATMENT | CD6 1 CDG2 CDG3 CDG4 CDG5 CDG6 CDG7 GROUP PAUSEOUT MEAN 1 32.6 31.6 24.7 24.3 27 25.4 23.2 26.8 S .D . | 8 .9 9.2 7.7 10.2 10.3 12.6 13.9 11.1 WARPOUT 1 MEAN | 31.6 24 22 22.8 22.7 21.5 25.9 24.2 S .D . | 9.2 7.2 5.5 5.8 8 9 11 8 .6 WARPOUT 2 MEAN | 25.1 25.7 22.6 25.4 21.3 20.1 27.5 23.9 S .D . 1 8 .2 10.2 10.3 7.6 6.4 7.8 11.7 9 WARPOUT 3 MEAN | 26.6 25.2 22.6 25.3 18.8 20.9 23.1 23 S .D . | 8 .9 11.5 10 7.4 6.7 8 .3 10.6 9 .3 WARPOUT 4 MEAN | 29.2 24.9 23.7 24.8 22.5 19.9 22.7 23.9 S .D . j 8 .2 9.9 5.4 6 .5 7.9 8 .3 10.9 8 .8 TABLE 2: MEAN AND S.D. - DE GAULLE DATA TREATMENT | NIX 1 NIX2 NIX3 NIX4 NIX5 NIX6 NIX7 NIX8 GROUP PAUSEOUT MEAN | 26.2 20 22.5 21.5 18.9 30 19.6 25.1 22.4 S .D . | 7.4 9 .5 8 .3 10.7 9.6 9.6 7.9 11.5 9 .9 I WARPOUT 1 MEAN | 23.9 21 20.3 21.8 20.9 24.1 23 24.2 22.3 S .D . j 10.6 6.1 6.6 9 .5 10.8 8.7 10.3 5.6 8 .9 I WARPOUT 2 MEAN | 21.8 23.2 21.1 17.1 21.3 20.3 22.4 27.8 21.4 S .D . | 9 .9 7.9 6.9 4.7 13.3 12.4 13.5 8 .9 10.3 I WARPOUT 3 MEAN 1 20.9 21.6 19.1 18.3 21.1 23.4 22.4 26.9 21.3 S .D . I 9.2 7 .3 6 .5 5.8 11 11.1 10.4 10.2 9 .3 I WARPOUT 4 MEAN | 24.7 19.9 20.6 22.6 20.6 26.1 21.8 24.1 22.4 S .D . | 10 5.6 6 .3 9.1 10.5 9.1 10.6 6.1 ' 8 . 8 TABLE 3: MEAN AND S.D. - NIXON DATA 61 which Is a characterization of the distribution of the data, namely the peak-to-peak Interval of the autocorrelation function of the speech amplitude envelopes. Examination of the group s.d.'s for the de Gaulle data confirms the observation made In the previous paragraph: the s.d.'s of the Warpout conditions are each smaller than the s.d. of the control condition. This Indicates that the experimental manipulation leads to a narrowing of the distribution, thus to an improvement In the periodicity of the amplitude envelope, when compared with the control condition. Condition Warpout 1 has the smallest s.d., Indicating the greatest improvement, while condition Warpout 3 has the largest s.d, indicating the least improvement. Similarly, examination of the s.d.'s for the Nixon data seem to bear out the apparent improvement noted from visual inspection of the histograms. With the exception of Warpout 2, each Warpout condition has a smaller s.d. than the control condition, though the differences between s.d.'s Is less than for the de Gaulle data. This again Indicates an improvement in the periodicity of the amplitude envelope, in the case of Warpout 2, the s.d. Is larger than the control condition, apparently Indicating a more irregular amplitude envelope than the original. The differences between the s.d. of each condition are small, therefore analysis of variance was performed to determine the significance of the differences. 52. ANOVA results 5.2.1 Conditions effect. A one-way analysis of variance (ANOVA) was performed, with condition as the factor. This was a repeated measures design, in that each condition was run on each section. The results of the ANOVA for the de Gaulle data and the Nixon data are summarized In Tables 4 and 5, respectively. In the de Gaulle data results, the F-value for the condition factor is significant at 62 Source of variation df SS MS TREATMENT BETWEEN SECTION WITHIN SECTION ERROR 26 6 . 5 61.6 10.3 28 75.9 2.71 24 49.9 2.08 3.1253562 <.05SIG 4.9359657 <.01 SIG 1.3036223 >.05 N.S. TOTAL 34 138 TABLE 4: ANALYSIS OF VARIANCE - DE GAULLE DATA Source of variation df SS MS TREATMENT BETWEEN SECTION WITHIN SECTION ERROR TOTAL 32 9.14 2.28 83.8 39 187 12 104 3.24 28 94.5 3.38 0.6763446 >.05N.S. 3.544344 <.01 SIG 0.9595431 >.05 N.S. TABLE 5: ANALYSIS OF VARIANCE - NIXON DATA 63 the .05 level, whereas the condition F-value for the Nixon data does not reach significance. These results indicate that the differences in the s.d. are significant for the de Gaulle data, but not for the Nixon data. However, from this table it is not possible to determine which condition(s) contributes most to the significant effect in the de Gaulle data. The Newman-Keuls test, which compares the differences between each possible pair of conditions, was carried out. The results of these comparisons are shown in Tables 6 and 7. As expected from the initial analysis, none of the comparisons between conditions for the Nixon data reach significance. For the de Gaulle data, the Pauseout-Warpout 1 comparison is significant at the .05 level. This latter condition has a smaller s.d. than the Pauseout condition, indicating increased regularity of the amplitude envelope. This is in agreement with the comparison between s.d.'s in Table 2, and with the visual inspection of the histograms, both of which indicated that condition Warpout 1 has the narrowest distribution. This suggests that, for the de Gaulle data, the dewarping program is performing the intended transformation. All other comparisons do not reach significance. 5.2.2 Between sections. For both speakers, the between sections (of speeches) F values from Tables 4 and 5 are significant at the .01 level. This indicates that there is a great deal of variability In the syllable durations within a speaker. In the case of Nixon, the sections were taken from the same speech, however in the case of de Gaulle, the sections were taken from several short speeches. Going back to Table 2, it is clear that the control condition means for sections CDG1 and CDG2 are considerably larger than the control condition means for each of the other sections. The mean inter-peak interval reflects the mean syllable duration of the amplitude envelope. Since the autocorrelation was carried 64 TREATMENTS W 1 W 4 W 2 W 3 PAUS. TOT. 55.7 57.1 62.2 63.4 72.8 WARPOUT 1 55.7 . - 1.4 6 . 5 7.7 17.1 WARPOUT 4 57.1 - . 5.1 6 .3 15.7 WARPOUT 2 62.2 - - - 1.2 10.6 WARPOUT 3 63.4 - • - - - 9.4 PAUSEOUT 72.8 - — — — — #STEPS APART 2 3 4 5 q .99 ( r ,24 ) 3.96 4.54 4.91 5.17 q .95 ( r ,24 ) 2.92 3.53 3 .9 4.17 (n*MSerror)" .5= 4.08 cr i t ica l values = q.99 16.2 18.5 20 21.1 q.95 11.9 14.4 15.9 17 WARPOUT 1 vs PAUSEOUT significant at the .05 leve l , a l l other comparisons not significant. TABLE 6: NEWMAN-KEULS TEST - DE GAULLE DATA TREATMENTS W 4 W 1 W 3 PAUS. W 2 TOT. 67.3 68.2 71.5 74.5 77.5 WARP 4 67.3 - . 0.9 4.2 7.2 10.2 WARP 1 68.2 - - 3.3 6 .3 9 .3 WARP 3 71.5 - - - 3 6 PAUSOUT 74.5 - - - - 3 WARP 2 77.5 • - - - - -#STEPS APART 2 3 4 5 q .99 ( r ,28 ) 3.92 5 4.85 5.11 q .95( r ,28 ) 2.9 3.51 3.86 4.13 (n*MSerror)" .5= 5.2 cr i t ica l values = q.99 20.4 26 25.2 26.6 q.95 15.1 18.2 20.1 21.5 None of the comparisons reach significance TABLE 7: NEWMAN-KEULS TEST - NIXON DATA 65 out on every fourth sample, the syllable duration In samples corresponds to four times the mean Inter-peak interval. The difference between the mean syllable durations of the first two sections (CD61 and CD62) and those of the other sections Is Illustrated by comparing the extreme values of sections CDG1 and CDG7. The mean inter-peak Intervals of these two sections are 32.6 and 23.2 "cs", respectively, corresponding approximately to syllable durations of 130.4 samples (326 ms), and 92.8 samples (232 ms). In the Warpout program for the de Gaulle data, a value of 85 samples (212.5 ms) was used for the variable Ns, representing the "unwarped" syllable duration. If this value is significantly different from the target syllable duration, which is the unwarped phrase-initial syllable duration, then the Warpout program will not produce the degree of de-warping Intended. For sections CDG1 and CDG2, this was the case. It was therefore decided to reanalyze the results of the de Gaulle data without the first two sections. This manipulation of the data was Justified, due to the use of different speeches for de Gaulle. The same Justification could not be applied to the Nixon data, therefore It was not reanalyzed In this way. The results of the reanalysls are displayed In Tables 8 and 9. The results of ANOVA are the same as for the whole group of data, the between section variances are still significant at the .01 level, and the treatments variance Is significant at the .05 level. The Newman-Keuls test shows the Pauseout-Warpout differences to be significant at the .05 level for each of the Warpout conditions. The results of the initial analysis showed significant differences only for the Pauseout-Warpout 1 comparison. This increase In the significance of the results Indicates that sections CDG1 and CDG2 do have a minimizing influence on the differences between Pauseout and most of the Warpout conditions, reducing them below significance. However, it is interesting to note that the difference between Pauseout and Warpout condition 1 remains significant even when the two deviant sections are Included in the analysis. This 66 Source of variation TREATMENT BETWEEN SECTION WITHIN SECTION ERROR df SS MS F P 3.9897932 <.05 SIG 7.1926786 <.01 SIG 1.5979586 >.05 N.S. TOTAL 4 32.5 8.13 4 58.6 14.7 20 65.1 3.26 16 32.6 2.04 24 124 TABLE 8: ANALYSIS OF VARIANCE - DE GAULLE DATA SECTIONS 3 -7 TREATMENTS W 4 W 1 W 3 W 2 PAUS TOTAL 39 39.3 43 43.8 54.7 WARPOUT 4 39 - 0.3 4 4.8 15.7 WARPOUT 1 39.3 - - 3.7 4 .5 15.4 WARPOUT 3 43 - • - - 0.8 11.7 WARPOUT 2 43.8 - - - 10.9 PAUSEOUT 54.7 - - - -#STEPS APART 2 3 4 5 q .99( r ,16) 4. 13 4.78 5.19 5.49 q.95(-r ,16) 3 3.65 4.05 4.33 ( n M S e r r o r ) A . 5 = 3.19 -cr i t ical values = q.99 13.2 15.3 16.6 17.5 q.95 9.58 11.7 12.9 13.8 The differences between PAUSEOUT and each condition reach .05 significance. Comparisons between conditions are not significant TABLE 9: NEWMAN-KEULS TEST - DE GAULLE DATA SECTIONS 3-7 67 suggests that this difference Is the most robust, and that this condition provides the best degree of dewarping. 5 J . Probability graphs The autocorrelation data for each combination speaker-by-condltlon was also plotted on probability paper. On this graph paper, the Inter-peak Interval Is plotted (on the ordinate) versus the cumulative percent occurrence (on the abscissa). If the distribution of the data follows the normal distribution, the points will fall on a straight line whose slope is directly related to the s.d. This type of graph therefore provides information about the normality of distribution, in addition to information about the mean and s.d. of the distribution. The data points for each graph were arranged along a curved line whose concavity was directed upward. The maximum curvature was at the extremetles of the curve for each case, with greater or lesser degrees of curvature for the different plots. This shape Is that obtained for probability plots of positively skewed distributions (King, 1971). The shape therefore Indicates the presence of deviantly large inter-peak Intervals, corresponding to lengthened syllable durations. The degree of curvature, particularly on the high side of the graph, was generally greater for the Nixon Warpout data than for the de Gaulle Warpout data, indicating greater skew for the Nixon data. It was possible to fit a straight line to each graph for the data points falling within approximately one s.d. of the mean (I.e. approximately between Inter-peak Intervals 10 and 30 "cs"). The slope of this best fit line was measured for each graph, as an estimation of the s.d. For the de Gaulle data, the slopes of the Warpout conditions were each smaller than the slope of the control condition, indicating narrower distributions. The slope of Warpout condition 1 was the smallest. These findings agree with the findings based on the s.d. of the whole 68 distribution. For the Nixon data, the slopes of two of the Warpout conditions (Warpout 2 and 3) were smaller than the slope of the control condition. However, these two graphs showed the greatest degree of curvature, particularly in the region around the mean. The best fit lines were therefore difficult to determine, and may be suspect. The Kolmogorov-Smirnov test for goodness of fit was carried out on the data, using the normal distribution with mean and s.d. equal to that of the distribution under investigation. The results were of little help since all of the distributions showed significant departure from the normal distribution. 69 CHAPTER 6: DISCUSSION The three objectives of this study were: (1) to find stronger evidence for Isochrony in speech production data; (2) to determine if a difference can be shown between French and English due to timing type differences; (3) to determine values of a and b which give the best results and compare these values with the values of these parameters found for speech perception by D'Arcy (1984). The experimental findings of this thesis relating to each of these three objectives are summarized below. Regarding the first objective, it was expected that the dewarping algorithm could produce an Increase or a decrease In the regularity of the amplitude envelope, depending on the values of the parameters. The experimental finding that the s.d. of the Inter-peak Intervals of the autocorrelation function was larger than the control condition only in the case of the Warpout 2 condition for the Nixon data, and was smaller for for all other Warpout conditions for both speakers, suggests that the the values chosen for the dewarping algorithm were appropriate for compensating for some of the warping present. The decrease In s.d. was significant for the French data only, suggesting a difference in the effectiveness of the dewarping algorithm on French versus English speech. The relevance of this difference in relation to the second objective is discussed further below. The increased regularity of the signal provides better evidence for the existence of Isochrony (In French) than found in previous studies. It indicates that processing of the data in a systematic fashion to remove consistent sources of Irregularity can better demonstrate the underlying rhythmic structure of speech. However, while the data showed an improvement in the degree of regularity compared to the original signal, the degree of regularity was still not sufficient to provide unequivocal evidence for Isochrony. The most stringent criterion for 70 isochrony would require that, after dewarping, all inter-peak intervals of the autocorrelation to be equal, but this is certainly not the case in this study. A less stringent criterion for isochrony could be that the Inter-peak intervals form a distribution in which the s.d. corresponds to a durational difference below the threshold for the perception of duration difference. In section 2.2.6, this threshold was taken to be 20 - 30 ms. The s.d.'s of the inter-peak intervals of the de Gaulle data for the control condition and for the best dewarping condition correspond to durational differences of 110 and 86 "ms" respectively. Both of these values obviously exceed the threshold for perception by a substantial amount, though the dewarping condition value Is significantly smaller. Further investigation using the dewarping function with other values of a, b, and Ns is necessary to determine the optimal values of each of these variables. Such Investigation may yield a greater degree of regularity, below the threshold for perception of durational differences, providing significant evidence for the existence of Isochrony In French. The above mentioned criteria for Isochrony may not have been met in this study because only the systematic, progressive durational irregularities due to the position of the syllable in a phrase and the to number of syllables In the phrase were compensated for. Irregularities due to other sources of durational distortion were not accounted for. For example, as mentioned previously in Section 2.2, intrinsic duration, stress and coartlculatory influences each introduce systematic durational distortions. There also may be random sources of durational variability, such as those investigated by Hibi (1983) In his perception experiments. The use of speech amplitude envelopes in this study also means that variability in syllable amplitude will influence the results of the autocorrelation function. Intrinsic amplitude, stress and Intensity of speaking will introduce variability in the amplitude envelope. As the current dewarping algorithm does 71 not compensate for these sources of variability, it is therefore unreasonable to expect It to yield perfectly regular amplitude envelopes. Further modification of the dewarping program will be necessary to compensate for the other sources of variability. Regarding the second objective, the results of this study, as mentioned above, show a significant difference in the effectiveness of the dewarping program between the French data and the English data (at least for the speakers chosen). Significant Improvement in regularity was seen for the French data only. This finding was expected, on the basis of the proposed timing-types for these two languages as outlined in Section 2.1.2; consequently, it provides some support for the proposed classification. The unit of dewarping used In the dewarping program is the syllable. French Is usually considered to be syllable-timed, and consequently the dewarping program Is appropriate for the French data. Conversely, English is is usually considered to be stress-timed. The dewarping of the data performed by the dewarping program therefore does not occur at the appropriate places for the English data, and in addition the program does not compensate for the variability In duration due to word stress. If the dewarping program could be modified to account for the warping In stress units, It may be possible to apply It to English and other stress-timed languages, to provide evidence for Isochrony, but at the foot level this time. It would be Interesting to study the effectiveness of the dewarping program used in this study on other speakers of French and English, to confirm the differences found in this study. It would also be interesting to apply this dewarping program, and other programs modified for different timing-types, to other languages to demonstrate the existence of isochrony and to determine the particular timing type for those languages. 72 Regarding the third objective, It was found that, for the grouped data of de Gaulle, condition Warpout 1 gave the smallest s.d. and therefore the best (most regular) results. The value of a for the DW-function In condition 1 Is the same as the value of a for the D-function which D'Arcy (1984) found to be perceived as most regular. This confirms the hypothesis, put forward by Tuller and Fowler (1980) (see Section 2.2.6), that perception and production of speech are somehow interrelated. That Is, the degree of warping found by D'Arcy to be necessary for the perception of regularity is the same as the degree of dewarping required to give optimum (for this study) improvement in the regularity of speech production data. This value of a was the smallest value used. The value of a for condition 4 is the next smallest value, and Is the value chosen to fit the production data of Oiler. Condition 4 also had the second smallest s.d., but the difference between it and the control condition did not reach significance at the .05 level. As previously mentioned, it Is still necessary to determine the optimum value of a. Different values of this parameter, smaller than the value for condition 1 or between the values of condlton 1 and 4, should be tried in the dewarping program. Also different values of b and Ns combined with the optimum value of a, should be tried to determine the optimum value for each of these parameters. The optimum values thus found can then be compared with the values selected for the perceptual data of D'Arcy. As mentioned In Chapter 1, these findings for rhythm In speech production may have clinical, Industrial or technical application. In the clinical area, these findings may have some benefit in determining the degree of dewarping required for speech to be perceived as having natural rhythm. The dewarping program could be used as a test to determine if the speech of individuals falls Into the normal range (to be determined by further experimentation) of warping. While this would 73 provide interesting Information about the reason for abnormal sounding rhythm, it would be very difficult to extend It to therapy and treatment of Individuals thus Identified as having abnormal rhythm. The complexity of the formula for the degree of warping present In speech, and the small size of the durational Increments would make conscious manipulation of the syllable durations extremely difficult to teach in therapy. In industrial or technical areas, It may be possible to see practical application of the findings of this study in computer synthesis and recognition of speech. As outlined in Section 2.1.4, computer synthesized voices are often rated as having unnatural rhythm. This is.not only esthetlcally displeasing and distracting, but 1t also may reduce Intelligibility. It therefore would seem desirable to produce speech with natural rhythm, to enhance transfer of information. If the warping algorithm used In this study, with the optimum values of the parameters, were Incorporated Into a speech synthesis program, the resulting speech would more closely follow the natural pattern of deceleration toward the end of a phrase, rather than the simple concatenation of syllables, which 1s sometimes the procedure currently used. The extraction of rules for the other Influences on duration and amplitude, and for Intonation patterns would also help Improve the naturalness of synthesized speech. In the area of speech recognition systems, knowledge of the location of syllable boundaries may make the segmentation of speech input more accurate, resulting in more accurate Interpretation of speech. The dewarping algorithm could thus be used to remove durational Irregularities and allow analysis to be performed on a more regular, and therefore more easily segmented, speech signal. 74 BIBLIOGRAPHY Abercrombie, D. (1967). Elements of General Phonetics. Edinburgh: Edinburgh University Press. Allen, G. D. (1972a). The location of rhythmic stress beats in English: An experimental study I. Language and Speech 15, 72-100. Allen, G. D. (1972b). The location of rhythmic stress beats in English: An experimental study II. Language and Speech 15, 179-195. Allen, G. D. (1973). Segmental timing control in speech production. J. of Phonetics, 1 219-237. Allen, G. D. (1975). Speech Rhythm: its relation to performance universals and articulatory timing. J. of Phonetics, 3, 75-86. Bell, C.G., Fujisaki, H, Heinz, J.M., Stevens K.N. and House, A.S. (1961). Reduction of speech spectra by analysis-by-synthesis techniques. J. Acoust. Soc. Amer., 33, 1725-1774. Benguerel, A.P. (1970). Some physiological aspects of stress in French. Phonetics Laboratory Natural Language Studies No. 4. The University of Michigan, Ann Arbor, Michigan. Blesser, B. (1972). Speech perception under conditions of spectral transformation: I. Phonetic characteristics. J.S.H.R.. 15. 5-41. Bolinger^  D.L. (1965). Pitch accent and sentence rhythm. In: Forms of English. (I. Abe, T. Kaneklyo, Eds.) Cambridge: Harvard Press. Cherry, C. & Wiley, R. (1967). Speech communication in very noisy environments. Nature, 214,1164. Chomsky, N. and Halle, M. (1968). The Sound Pattern of English. New York; Harper & Row. Classe, A. (1939). The rhvthm of English prose. Oxford: Basil Blackwell and Mott. 75 Cooker, H.5. (1963). Time Relationships of Chest Wall Movements and Intra-Oral Pressures during Speech. Ph.D dissertation. State University of Iowa, University Microfilms, No. 64-3357. D'Arcy, J.M. (1984). Perception of rhythm in sequences of clicks and of syllables. Unpublished M.Sc. Thesis, University of British Columbia. Dauer, R.M. (1982). Stress-timing and syllable-timing reanalyzed. J. of Phonetics, 1151-62. Dreher, J.J. (1969). Speaker rhythm by autocorrelation. Study of Sounds, 14. 41-56. Duez, D. (1982). Pauses in three speech styles. Language and Speech, 25^ 11-28. Fonagy, I. and Magdics, K. (1960). Speed of utterance in phrases of different lengths. Language and Speech, 3, 179-192. Fowler, C.A. (1979). "Perceptual centers" in speech production and perception. Perception & Psychophysics, 25, 375-388. Fraisse, P. (1982). Rhythm and Tempo. In: The Psychology of Music. (D. Deutsch, Editor). California: Academic Press. Fry, D.B. (1958). Experiments in the perception of stress. Language and Speech, _L 126-152. Goodglass, H., Fodor, I.G., & Schuloff, C. (1967). Prosodic factors in grammar -Evidence from aphasia. J.S.H.R., 10, 5-20. Halle, M. & Stevens, K.N. (1962). Speech Recognition: A model and a program for research. IRE Transactions on Information Theory, IT-8. 155-159. Hibi, S. (1983). Rhythm perception in repetitive sound sequence. J. Acoust. Soc. Jpn. (E), 4,2, 83-95. Hoequist, C. Jr. (1983a). Durational correlates of linguistic rhythm categories. Phonetica, 40,19-31. Hoequist, C. Jr. (1983b). Syllable duration in stress-, syllable-, and mora-timed languages. Phonetica, 40, 203-237. Holloway, CM. (1970). Passing the strongly voiced components in noisy speech. Nature. 226. 178-179. 76 House, A. (1961). On vowel duration in Engl ish. J. Acoust. Soc. Amer., 33, 1174-1 178. Hudgins, C.V. and Numbers, F.C. (1942). An investigation of intelligibility of speech of the deaf. Genet. Psychol. Monogr 25, 289-292. Hudgins, C.V. and Stetson, R.H. (1937). Relative speed of articulatory movements. Archives neerlandaises de Phonetique experimental, 13, 85-94. Huggins, A.W.F. (1968). How accurately must a speaker time his articulations? IEE Transactions on Audio and Electroacoustics. AU-16. No. 1, 112-117. Huggins, A.W.F. (1972). Just Noticeable differences for segment duration in natural speech. J. Acoust. Soc. Amer., 51, 1270-1278. John, J.E. and Howarth, J.N. (1965). The effect of time distortion on the intelligibility of deaf children's speech. Language and Speech, iL 127-134. King, J.R. (1971) Probability charts for decision making. New York: Industrial Press Inc. Klatt, D.H. (1975). Vowel lenghening is syntactically determined in a connected discourse. J. of Phonetics, X 129-140. Klatt, D.H. (1976). Linguistic uses of segmental duration in English: Acoustic and perceptual evidence. J. of Acoust. Soc. of Amer., 59, 1208-1221. Kozhevnikov, V.A and Chistovich, L.A (1965). Rech artikulyatsya i vospriyatie. Moscow-Leningrad, 1965. Translated as Speech: Articulation and Perception. Springfield, Va.: Joint Publications Research service, U.S. Dept. of Commerce. Ladefoged, P. (1975). A Course in Phonetics. San Francisco: Harcourt Brace Jovanovich, Inc. Ladefoged, P., Draper, M.H. and Whitteridge, D. (1958). Syllables and stress. Miscell. Phonet.,HL 1. Lea, W.A (1974a). Prosodic aids to speech recognition: IV. A general strategy for prosodically-guided speech understanding. Univac Report * PX 10791. St. Paul, Minn.: Sperry Univac Defense Systems Div. 77 Lea, W.A. 0974b). Prosodic aids to speech recognition: V. A summary of results to date. Univac Report * PX 11087. 5t. Paul, Minn.: Sperry Univac Defense Systems Div. Lea, W.A. (1980). Prosodic aids to speech recognition. In: Trends in Speech  Recognition. W.A. Lea (ed). New Jersey: Prentice Hall. Lehiste, I. (1970). Suprasegmentals. Cambridge, Mass.: M.I.T. Press. Lehiste, I. (1971). Temporal organization of spoken language, in: Form and  substance. L.L. Hammerich, R. Jakobson and E. Zwirner (Editors). E. Akademisk Forlag. Lehiste, I. (1972). Rhythmic units and syntactic units in production on perception. J. of Acoust. Soc. of Amer., 54, 1228-1247. Lehiste, I. (1975). The timing of utterances and linguistic boundaries. J. of Acoust. Soc. of Amer., 5L 2018-2024. Lehiste, I. (1977). Isochrony reconsidered. J. of Phonetics, 5, 253-263. Lehiste, I. (1978). The perception of duration within sequences of four intervals. J. of Phonetics, Z 313-316. Lehiste, I. and Peterson, 6.E. (1959). Vowel amplitude and phonemic stress in American English. J. of Acoust. Soc. of Amer., 3_L 428-435. Levitt, K, Smith, C. and Stromberg, H. (1974). Acoustic articulatory and perceptual characteristics of the speech of deaf children. Proc. Speech  Communication Seminar. Stockholm, Sweden, late submission, not paginated in Proceedings. Lieberman, A.M., Cooper, F.S., Shankweiler, D.P., and Studdert-Kennedy, M. (1967). Perception of the speech code. Psychological Review, 74. 431 -461. Lindblom, B.E.F. (1974). Chariman's Comments, Session on Speech Production and Synthesis by Rules. Speech Communication Seminar. Stockholm, Aug. 1 -3,1974. Uppsala: Almqvist and Wiksell. Lindblom, B.E.F., and Rapp, K. (1973). Some temporal regularities of spoken Swedish. Papers of the Institue of Linguistics of the University of  Stockholm. Publication 21. 78 Martin, J.G. (1982). Rhythmic (hierarchical) versus serial structure in speech and other behaviour. Psychological Review, 22, 487-509. Morton, J., Marcus, S. and Frankish, C. (1976). Perceptual centers. Psychological Review, 83, 405-408. Nakatani, L.H., O'Connor, K.D. and Aston, C.H. (1981). Prosodic Aspects of American English Speech Rhythm. Phonetica, 38, 84-106. Nooteboom, S.G. (1972). Production and perception of vowel duration: a study of durational properties of vowels in Dutch. Doctoral dissertation, University of Utrecht. Nooteboom, S.G. (1973). The perceptual reality of some prosodic durations. J. Phon., _L 25-45. O'Connor, J.D. (1965). The perception of time intervals. Phonetics Laboratory, University College, London, Progress Report 2. 11-15. Ohala, J.J. (1973). The temporal regulation of speech. In: Proceedings of the  Leningrad Symposium on Auditory Analysis and Perception of Speech. 516-538. Oiler, D.K. (1973). The effect of position in utterance on speech segment duration in English. J. Acoust. Soc. Amer., 54, 1235-1247. Peterson, G.E. and Lehiste I. (1960). Duration of Syllable Nuclei in English. J. of Acoust. Soc. of Amer., 32, 693-703. Pike K.L. (1946). The Intonation of American English. Ann Arbor: University of Michigan Press. Pointon, G.E. (1980). Is Spanish really syllable-timed? J. of Phonetics, 8, 293-305. Shen, Y. and Peterson, G.G. (1962). Isochronism in English. Universtiy of Buffalo, Studies in Linguistics, Occasional Papers 9. 1-36. Sparks, R.W. (1973). Melodic Intonation Therapy. In: Language Intervention Strategies in Adult Aphasia. (R. Chapey, Editor). Baltimore: Williams & Wilkins. 79 Stark, R. E. (1979). Speech of the hearing-impaired child. In: Hearing and Hearing  Impairment. (L.J. Bradford and W.G. Hardy, Editors). New York: Grune & Stratton. Tuller, E. and Fowler, C.A. (1980). Some articulatory correlates of perceptual isochrony. Perception & Psychophysics, 2Z. 277-283. Uldall, ET. (1978). Rhythm in very Rapid R.P. Language and Speech 21_, 397-402. Wenk, B.J. and Wioland, F. (1982). Is French really syllable-timed? J. of Phonetics, JO, 193-216. Woodrow, H. (1951). Time perception. In: Handbook of Experimental Psychology. (S.Stevens, Ed.). New York: Wiley. 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items