Computer generated corpus and lexical analysis of English language instructional materials prescribed for use in British Columbia Junior Secondary Grades

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Computer generated corpus and lexical analysis of English language instructional materials prescribed for use in British Columbia Junior Secondary Grades Edwards, Peter

Abstract

The major purpose of the study was to capture a representative sample of natural language from the textbooks prescribed for use in the junior secondary curriculum for British Columbia schools, organize the sample for computer processing through the development of needed programs, develop a lexical analysis and describe the word and sentence characteristics of the samples organized by grades, subjects across grades, subjects within grades and textbook corpora. A number of hypotheses related to the distribution of frequently occurring words and a sub-set of representative sentence lengths across the corpora were then tested and a model was developed to aid in selecting lexically significant vocabulary from word lists based on samples from subject area textbooks. A stratified sampling model, applied to thirty-seven textbooks from seven subject areas, produced a Corpus of approximately a quarter million running words of natural language text based on 469 samples of approximately 500 words each. The results of the lexical analysis indicated that Grade 9 makes significantly greater reading demands in terms of volume of material (tokens) and vocabulary (word-types) than either Grades 8 or 10. Considerable diversity was exhibited in type and token distribution by grades, subjects, and textbooks but no apparent pattsrn emerged. However, use of Yule's K characteristic to determine the repeat rate frequency of word-types across the various corpora, revealed great variation in redundancy of word-types with the most striking differences exhibited in the samples from English textbooks and to some extent those from Home Economics and Commerce. Similar results were obtained in applying Yule's K as a measure of the repeat rate frequency for sentence lengths. Samples from English textbooks, again, exhibited exceptional variability in sentence length variety. These results were further substantiated by the analysis of other measures of variability based on computation of standard deviations, coefficients of variation, Pearson's skew factor and, to a lesser degree, the average number of sentences per 500 word sample. In all instances, organization of the samples by gross grade groupings tended to mask the real inherent variability of the samples organized by subjects and textbooks. Chi-square analyses of word and sentence distribution further substantiated the inherent variability revealed by the lexical analysis. Little uniformity was exhibited in the distribution of the most frequently occurring words in English and a representative sub-set of sentence lengths with the samples organized by grade levels, subjects across grades and subjects within grades. Grouping by gross grade level again masked subject variations. The style and content characteristics of the print materials prescribed for use in the separate subject areas are therefore significantly instrumental in affecting the frequency of occurrence of even the most common words in English and a representative sub-set of sentence lengths. Further analysis of the word lists produced in the study substantiated the utility of developing an elimination technique, based on omission of the most frequently occurring words and the relatively rare words, to identify the significant vocabulary from word lists based on samples from texts in subject areas. The major conclusion of the study suggests that the print materials prescribed for use in junior secondary grades exhibit marked variability when examined on even the most straightforward of linguistic characteristics such as word and sentence frequency. It is suggested that this variability would be even more pronounced if analyses were developed based on other syntactic and semantic variables. The expertise of the subject area specialist and the reading specialist should be combined in developing instruction to maximize learning from print materials. Such instruction would best be based on materials organized by subjects across grades and by separate subjects within grades rather than on materials organized by gross grade groupings.

Item Metadata

Title	Computer generated corpus and lexical analysis of English language instructional materials prescribed for use in British Columbia Junior Secondary Grades
Creator	Edwards, Peter
Publisher	University of British Columbia
Date Issued	1974
Description	The major purpose of the study was to capture a representative sample of natural language from the textbooks prescribed for use in the junior secondary curriculum for British Columbia schools, organize the sample for computer processing through the development of needed programs, develop a lexical analysis and describe the word and sentence characteristics of the samples organized by grades, subjects across grades, subjects within grades and textbook corpora. A number of hypotheses related to the distribution of frequently occurring words and a sub-set of representative sentence lengths across the corpora were then tested and a model was developed to aid in selecting lexically significant vocabulary from word lists based on samples from subject area textbooks. A stratified sampling model, applied to thirty-seven textbooks from seven subject areas, produced a Corpus of approximately a quarter million running words of natural language text based on 469 samples of approximately 500 words each. The results of the lexical analysis indicated that Grade 9 makes significantly greater reading demands in terms of volume of material (tokens) and vocabulary (word-types) than either Grades 8 or 10. Considerable diversity was exhibited in type and token distribution by grades, subjects, and textbooks but no apparent pattsrn emerged. However, use of Yule's K characteristic to determine the repeat rate frequency of word-types across the various corpora, revealed great variation in redundancy of word-types with the most striking differences exhibited in the samples from English textbooks and to some extent those from Home Economics and Commerce. Similar results were obtained in applying Yule's K as a measure of the repeat rate frequency for sentence lengths. Samples from English textbooks, again, exhibited exceptional variability in sentence length variety. These results were further substantiated by the analysis of other measures of variability based on computation of standard deviations, coefficients of variation, Pearson's skew factor and, to a lesser degree, the average number of sentences per 500 word sample. In all instances, organization of the samples by gross grade groupings tended to mask the real inherent variability of the samples organized by subjects and textbooks. Chi-square analyses of word and sentence distribution further substantiated the inherent variability revealed by the lexical analysis. Little uniformity was exhibited in the distribution of the most frequently occurring words in English and a representative sub-set of sentence lengths with the samples organized by grade levels, subjects across grades and subjects within grades. Grouping by gross grade level again masked subject variations. The style and content characteristics of the print materials prescribed for use in the separate subject areas are therefore significantly instrumental in affecting the frequency of occurrence of even the most common words in English and a representative sub-set of sentence lengths. Further analysis of the word lists produced in the study substantiated the utility of developing an elimination technique, based on omission of the most frequently occurring words and the relatively rare words, to identify the significant vocabulary from word lists based on samples from texts in subject areas. The major conclusion of the study suggests that the print materials prescribed for use in junior secondary grades exhibit marked variability when examined on even the most straightforward of linguistic characteristics such as word and sentence frequency. It is suggested that this variability would be even more pronounced if analyses were developed based on other syntactic and semantic variables. The expertise of the subject area specialist and the reading specialist should be combined in developing instruction to maximize learning from print materials. Such instruction would best be based on materials organized by subjects across grades and by separate subjects within grades rather than on materials organized by gross grade groupings.
Geographic Location	British Columbia
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2010-01-28
Provider	Vancouver : University of British Columbia Library
Rights	For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.
DOI	10.14288/1.0055689
URI	http://hdl.handle.net/2429/19318
Degree (Theses)	Doctor of Philosophy - PhD
Program (Theses)	Education
Affiliation	Education, Faculty of
Degree Grantor	University of British Columbia
Campus	UBCV
Scholarly Level	Graduate
Aggregated Source Repository	DSpace

Item Media

UBC_1974_A2 E38_8.pdf -- 16.04MB

Item Citations and Data

Rights

For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.

Open Collections

UBC Theses and Dissertations

Computer generated corpus and lexical analysis of English language instructional materials prescribed for use in British Columbia Junior Secondary Grades Edwards, Peter

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights