UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Models comparing estimates of school effectiveness based on cross-sectional and longitudinal designs Shim, Minsuk 1991

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-UBC_1991_A8 S52.pdf [ 4.14MB ]
Metadata
JSON: 831-1.0054534.json
JSON-LD: 831-1.0054534-ld.json
RDF/XML (Pretty): 831-1.0054534-rdf.xml
RDF/JSON: 831-1.0054534-rdf.json
Turtle: 831-1.0054534-turtle.txt
N-Triples: 831-1.0054534-rdf-ntriples.txt
Original Record: 831-1.0054534-source.json
Full Text
831-1.0054534-fulltext.txt
Citation
831-1.0054534.ris

Full Text

MODELS COMPARING ESTIMATES OF SCHOOL EFFECTIVENESS BASED ON CROSS-SECTIONAL AND LONGITUDINAL DESIGNS By MINSUK SHIM B.A., The Seoul National University, 1982 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS in THE FACULTY OF GRADUATE STUDIES DEPARTMENT OF EDUCATIONAL PSYCHOLOGY AND SPECIAL-EDUCATION We accept this thesis as confirming r to the required standard. THE UNIVERSITY OF BRITISH COLUMBIA April 1991 © Minsuk Shim, 1991 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of Educational Psychology and Special Education The University of British Columbia Vancouver, Canada Date March, 1 5 , 1991  DE-6 (2/88) A b s t r a c t The primary purpose of this study is to compare the six models (cross-sectional, two-wave, and multiwave, with and without controls) and determine which of the models most appropriately estimates school effects. For a fair and adequate evaluation of school effects, this study considers the following requirements of an appropriate analytical model. First, a model should have controls for students' background characteristics. Without controlling for the initial differences of students, one may not analyze the between-school differences appropriately, as students are not randomly assigned to schools. Second, a model should explicitly address individual change and growth rather than status, because students' learning and growth is the primary goal of schooling. In other words, studies should be longitudinal rather than cross-sectional. Most researches, however, have employed cross-sectional models because empirical methods of measuring change have been considered inappropriate and invalid. This study argues that the discussions about measuring change have been unjustifiably restricted to the two-wave model. It supports the idea of a more recent longitudinal approach to the measurement of change. That is, one can estimate the individual growth more accurately using multiwave data. Third, a model should accommodate the hierarchical characteristics of school data because schooling is a multilevel process. This study employs an Hierarchical i i iii Linear Model (HLM) as a basic methodological tool to analyze the data. The subjects of the study were 648 elementary students in 26 schools. The scores on three subtests of Canadian Tests of Basic Skills (CTBS) were collected for this grade cohort across three years (grades 5, 6 and 7). The between-school differences were analyzed using the six models previously mentioned. Students' general cognitive ability (CCAT) and gender were employed as the controls for background characteristics. Schools differed significantly in their average levels of academic achievement at grade 7 across the three subtests of CTBS. Schools also differed significantly in their average rates of growth in mathematics and reading between grades 5 and 7. One interesting finding was that the bias of the unadjusted model against adjusted model for the multiwave design was not as large as that for the cross-sectional design. Because the multiwave model deals with student growth explicitly and growth can be reliably estimated for some subject areas, even without controls for student intake, this study concluded that the multiwave models are a better design to estimate school effects. This study also discusses some practical implications and makes suggestions for further studies of school effects. Table of Contents Chapter page Abstract i i L i s t of Tables v i 1. Introduction 1 1.1. Background of the Problem 1 1.2. Definition of Terms 4 1.3. Identification of the Problem 4 1.4. Research Questions 6 2. Review of Literature 9 2.1 Measurement of Change 9 Cross-Sectional Designs 10 Two-Wave Designs 11 Multiwave Designs 14 2.2 Analysis of Multilevel Data 15 Aggregation Bias 16 Choosing the Appropriate Unit of Analysis 17 Contextual Effect . 18 Specification of Appropriate Analytical Model 19 2.3. Hierarchical Linear Model 20 Application of the HLM for Studies of School Effectiveness . . . 21 Application of the HLM for Measuring Change 25 A Three-level Hierarchical Linear Model 27 3. Research Methodology 30 3.1 Subjects and Data Collection 30 3.2 Data Analysis 33 Model la (Cross-sectional Model without Controls) 33 Model lb (Cross-sectional Model with Controls) 34 Model Ha (Two-wave Longitudinal Model without Controls) . . . 36 Model lib (Two-wave Longitudinal Model with Controls) . . . . . . 37 Model Ilia (Three-wave Longitudinal Model without Controls) . 38 Model 111b (Three-wave Longitudinal Model with Controls) . . . . 41 Estimation of Bias 42 4. Results 43 School Differences in Grade 7 Status 43 School Effects on the Rate of Growth 48 Differences between Two-wave Model and Multiwave Model . . . 55 Homogeneity of Regression Slopes 57 iv V Bias of Models 57 5. Summary and Conclusion 62 References 68 Appendix 72 List of Tables Table Page Ia. H L M Results for Cross-sectional Model Without Controls (Model Ia) . . . 44 lb. HLM Results for Cross-sectional Model With Controls (Model lb) 45 Ha. HLM Results for Two-wave Model Without Controls (Model Ha) 49 lib. H L M Results for Two-wave Model With Controls (Model lib) 50 Ilia. HLM Results for Multiwave Model Without Controls (Model Ilia) 52 Illb. HLM Results for Multiwave Model With Controls (Model Illb) 53 IV. Homogeneity of Regression Slopes 58 V. Extent of Bias 59 v i 1. Introduction 1.1. Background of the Problem The evaluation of the effectiveness of school systems has received considerable attention by educators, policy makers and the public (Murnane, 1987). The public continues to demand educational accountability. They want to know whether they are receiving a good return for the money they invest in education, and they call for improved accountability procedures. To meet these demands, educators and policy makers are interested in monitoring student achievement and identifying 'effective' schools—those that enhance student achievement more than other schools. They have systematically collected 'performance indicators' of school effects while examining the following important questions: Which characteristics of teachers and schools are associated with student achievement? To what extent? How can limited resources be allocated most effectively? Many researchers have examined the relative efficiency of various school policies and management practices on student learning over time. They have asked whether educators should concentrate on certain types of policies or should finance others. Data on performance indicators have been used for evaluation and policy decisions at different levels of the educational system and recently have been used in school award programs in California, Florida, and South Carolina (Mandeville & Anderson, 1987). Attempts to identify 'effective schools' have created many controversies over 1 2 the kind of data to be collected, the appropriate methodology and the interpretation of results. Data on performance indicators should be carefully collected and interpreted for comparing schools. Some school districts and provincial evaluation programs make only simple comparisons of schools' mean scores on some standardized tests. Even though this is the easiest way to compare schools, it is not a fair and adequate evaluation. Through formal and informal selection procedures, schools differ in their student composition. School settings could be considered 'treatments' in a quasi-experimental design. Because random assignment to schools is not possible, every possible variation among schools associated with student background should be controlled to reject other plausible rival hypotheses regarding between-school differences. Otherwise, the school (treatment) effect can be confounded with other unspecified factors (Campbell & Stanley, 1966). Without considering the differences in the characteristics of students when they enter school, such as their general cognitive ability, prior achievement, and family background, it does not make sense to identify schools that are exceptional in their performance. A comparison of school means will falsely suggest that schools with more advantaged students do better than those with less advantaged students. Moreover, unless all the variables associated with student background are controlled, estimates of school effects will be biased. Students' family background variables alone do not appear to be sufficient for controlling initial differences (Willms, 1985). A premeasure, parallel to the outcome measure of interest, is known to be by far the most important control variable to adjust for student background 3 (Alexander & Pallas, 1983). Therefore, many researchers have tended to move away from cross-sectional designs toward longitudinal designs (Willms, 1985; Gray, 1988). However, true longitudinal designs—designs measuring individual change over time—have not been widely used until recently. Researchers who collected data at two or more points in time analyzed them as if they were separate cross-sectional studies. Many authors persuaded empirical researchers not to use the difference score, which is the natural way of measuring change, and to reframe the questions about change into the questions about status. (Harris, 1963; Cronbach & Furby, 1970; Linn & Slinde, 1977). More recently, however, a few researchers have argued that the previous discussions about the difference score caused a lot of misunderstandings about the measurement of change. They emphasize the advantages of multiwave longitudinal models over traditional cross-sectional models (Rogosa, Brandt, & Zimowski, 1982; Rogosa & Willet, 1983; Willet, 1988). The development of appropriate analytical models of school effects is necessary for fair and adequate evaluation of school practices, and for effective decisions about school policies. Such a model should be able to take account of all the relevant factors that affect student achievement. It should provide also more valid estimates of school effects to compare schools. This may be one of the reasons that researchers on teacher and school effectiveness have produced inconsistent findings until now. There have been discrepancies between the models they have used and the inferences they have drawn from the models. 4 1.2. Definition of Terms The following terms, wi th their accompanying definition, w i l l be used throughout this study. Cross-sectional design: A research design which employs data collected at a single point i n time. Longitudinal design (Growth design): A research design which employs data collected at more than one time point. This design includes both two-wave and multiwave designs. Two-wave design: A research design based on measures at two time points, such as a Pretest-Posttest Design (Campbell & Stanley, 1966, p.13), or a posttest design wi th prior measures of ability or academic achievement. Multiwave design: A research design based on measures at more than two time points, such as a time series design in which same individuals are assessed on three or more occasions. Difference score: Simple gain score between two time points. Rate of growth: The average increase in students' scores over a fixed period. 1.3. Identification of the Problem The question whether schooling has significant effects on student outcomes has attracted researchers for at least three decades. They have tried to assess causal relationships between school characteristics and student achievement using 5 various models. The review by Bridge, Judd, & Moock (1979) summarizes the works done during the late sixties and seventies. Recently, several review articles have noted the conceptual and methodological problems which hampered the earlier studies of school effects. There have been two major methodological problems: the problem of measuring change (Bryk, 1980; Rogosa & Willet, 1983; Willet, 1988), and the problem of analyzing multilevel data (Burstein, 1980; A i t k i n & Longford, 1986; Goldstein, 1987; Raudenbush & Bryk, 1988). There is no doubt that student learning and growth is of central interest to educational researchers and that such growth occurs i n hierarchical settings. U n t i l now, measuring 'change' has not received the attention it deserves. Many influential authors condemned the difference score as unreliable or invalid. Cronbach & Furby (1970) suggested that questions of learning should be framed as questions of 'status' rather than questions of 'growth'. Rogosa and Willet (1983), among others, recently argued that this was simply an inappropriate conceptualization of 'change': it had been conceptualized as an 'increment' in the time period between premeasure and postmeasure rather than as a process of continuous development over time (Willet, 1988, p. 347). This misconceptualization suggested that measuring change was limited exclusively to two-wave designs. But recent longitudinal approaches describe 'change' as continuous growth based on more than two time points. By collecting additional waves of data, individual growth can be measured more appropriately and accurately (Willet, 1988; Bryk & Raudenbush, 1988). 6 The purpose of the present study is to compare various models for assessing school effects, and to demonstrate the benefits of modelling growth rather than status. The analyses provide estimates of school effects based on three research designs: cross-sectional, two-wave, and multiwave. W i t h each design, I compare a model that includes no adjustment for students' gender and ability wi th one that includes control for students' gender and prior ability. The study is restricted to the comparisons of academic performance and does not address broader questions about school effectiveness. The study has implications for those collecting indicators of school performance at school, school district, or provincial levels, because it examines how performance data should be collected and interpreted for a fair evaluation of school effects. 1.4. Research Questions This study assesses differences among schools using the six models listed below: Ia. Cross-sectional model of grade 7 status without control for students' gender and prior ability. Ib. Cross-sectional model of grade 7 status wi th control for students' gender and prior ability. Ha. Longitudinal model of growth based on measures at two points in time, without control for students' gender and prior ability. 7 lib. Longitudinal model of growth based on measures at two points in time, with control for students' gender and prior ability. Ilia. Longitudinal model of growth based on measures at three points in time, without controls for students' gender and prior ability. Illb. Longitudinal model of growth based on measures at three points in time, with controls for students' gender and ability. In comparing these models, four research questions were addressed. 1. Are there significant differences among schools in their average levels of academic achievement at the end of grade 7, after controlling for students' gender and prior ability? (Model Ia & Model lb) 2. Are there significant differences among schools in their students' average rates of growth in academic achievement between grades 5 and 7, after controlling for students' gender and prior ability? (Model Ilia & Model Illb) 3. To what extent do estimates of average rates of growth (adjusted for gender and prior ability) based on a two-wave model differ from those based on a multiwave model? (Models Ila, lib & Models Ilia, Illb) 4. To what extent are estimates of average levels of achievement or average rates of growth biased if there is no adjustment for students' gender or prior ability? The second chapter reviews the major literature related to two basic problems in most educational studies and discusses the procedures developed to resolve these problems. The Hierarchical Linear Model (HLM) that was employed as a key 8 methodological tool of this study is also reviewed. Chapter 3 outlines the sample, the data collected, and the models used in the present study. The fourth and fifth chapters present the results and conclusions from the investigation. 2. Review of Literature This chapter outlines two questions relevant to the current research questions: Why and how should we measure change? How can we analyze multilevel data that describe many educational processes? These are not separate issues. As Burstein (1980) claimed, the specification of appropriate analytical models can be the answer for both questions. Among several statistical models that have been developed to address these problems, this study reviews the Hierarchical Linear Model (HLM) in detail (Bryk & Raudenbush, 1988; Raudenbush & Bryk, 1986, 1988). The HLM has been used in a variety of educational studies, such as studies of school effectiveness (Raudenbush & Bryk, 1986; Willms & Raudenbush, 1989), and studies of individual growth trajectory or measurement of change (Bryk & Raudenbush, 1987, 1988; Willms & Jacobsen, 1990). This chapter also examines various applications of the HLM. 2.1 Measurement of Change Research studies attempting to compare schools in terms of their performance indicators must address an important question: Should we use a 'snapshot' of achievement as an indicator of school quality? Or should we use the rate of growth as an indicator of school quality? The review comprises three categories of design: cross-sectional, two-wave, and multiwave designs. 9 10 Cross-Sectional Designs In earlier studies, researchers tried to assess schooling effects at a single point in time with a cross-sectional design. They tried to explain the relationships between certain schooling inputs and student outcomes measured at one time point, after controlling for students' background characteristics. A major concern about the cross-sectional design was related to the problem of 'selection bias'. Unless all of the relevant variables were controlled, the estimates of school effects would be biased. This was one of the major criticisms about the historical report by Coleman, Hoffer, and Kilgore (1981). They analyzed data from the first wave of the High School and Beyond (HSB) study and attempted to estimate differences between public and private schools in their average performance level. They controlled for differences between the two sectors in their background characteristics using student-level variables such as family income, parental education, race, ethnicity, and family structure. But the study was limited in that the most important control variable, a measure of prior achievement or aptitude before high school was not available in the 1980 data. The accuracy of their estimates depended on their statistical model to control for selection bias. Many critics attempted to address this problem (Alexander & Pallas, 1983; Willms, 1985). Therefore, when the 1982 data became available, subsequent analyses of the data were able to estimate the sector effects more accurately using a longitudinal design—see Haertel, James & Levin (1987) for a detailed review. In addition, longitudinal designs have a conceptual advantage over 11 cross-sectional designs, because they deal with individual 'growth' rather than 'status'. A cross-sectional design can at best represent the prevailing 'status' measured at one time, whereas the fundamental questions of educational research ought to be concerned with individual learning that implies change and growth (Willet, 1988, p.346). Two-Wave Designs The most frequently used longitudinal design has been a two-wave design, a traditional Pretest-Posttest design (Cook & Campbell, 1979). The typical pretest-posttest design consists of an assessment of individual performance using the same measure before and after treatment. If subjects are not randomly assigned to groups, we need to adjust for preexisting differences among groups to assess the valid treatment effect. There are several ways of adjusting for pre-treatment group differences—see Anderson, Auquier, Hauck, Oakes, Vandaele, & Weisberg (1980) for a detailed review. Analysis of covariance (ANCOVA) and the use of difference scores or gain scores are two of the most popular methods of analyzing pre- and postmeasure data. We compute the mean gains across groups and then the treatment effect is nothing but the difference in these mean gains among treatment groups. Nevertheless, the statistical properties of the difference score have become the focus of criticism in theories of measurement. Many authors condemned the difference score as an unfair measure of individual growth because of its low 12 reliability, and the fact that it is negatively correlated with initial status (Bereiter, 1963; Linn & Slinde, 1977; Cronbach & Furby, 1970). First, they argued that the difference score cannot be both reliable and valid simultaneously. Based on the classical test score model, they argued that the difference score cannot be interpreted as a valid estimate of individual growth unless the premeasure and postmeasure were highly correlated. They also argued that this requirement for construct validity makes the difference score intrinsically unreliable according to the following expression of reliability: P d = 2 2 °x1 + ax2 ~ 2axlax2PxlPx2 Second, in addition to low reliability~many empirical findings have shown that the correlation between the difference score and the pretest ( r x l D ) is typically negative. Based on this, Linn and Slinde (1977) argued that the difference score is inappropriate to measure individual change, because it gives "an advantage to persons with certain values of the pretest scores" (p.125). Concerned with the negative bias of sample correlation, some authors recommended the use of the residual change score which is uncorrelated with initial status (Cronbach & Furby, 1970). But it is hardly interpretable in many cases and does not provide much different estimates of individual growth from those based on the simple difference score. Those purported problems of the difference score are basically due to 13 measurement error and not the problems of the difference score per se. The negative bias of r x l D is caused by measurement error in the estimation of correlations. The sample correlation r x i r ) is not a good estimate of the correlation between true change and true initial status psl^. The only correlation of any real interest is the population correlation between true change and true initial status (Rogosa, Brandt & Zimowski, 1982). This correlation can be estimated if we can separate the true parameter variance from the observed variance. Low reliability of the difference score is also related to measurement error. The important ideas are that reliability is a ratio of true variance to observed variance, and that the difference scores are notoriously unreliable because the observed variance is exceedingly large due to the fact that it includes the variance due to measurement error from both occasions. The true variance, however, includes the variance due to systematic differences among individuals. The classical test score framework overlooks the possibility of variation in the rates of growth among individuals: some might grow faster or slower than others. Individuals may change at different rates during different time periods. When true differences in the rates of growth exist, the true variance becomes larger, and the ratio of true variance to observed variance (reliability) can be high, even higher than the reliability of the scores themselves. In other words, the difference score is not intrinsically unreliable if there are substantial differences among individuals in their rates of growth. Even when the estimated reliability is low, the difference score can be an 14 accurate and useful measure of change, because "low reliability does not necessarily imply lack of precision" (Rogosa et al., 1982, p.731). If individuals grow at the same rate, the reliability of the difference score will be low no matter how precisely the difference score is measured. That is, low reliability comes from systematic variations among individuals as well as measurement error. Rogosa and Willet (1983) have successfully demonstrated the statistical properties of the difference score under different circumstances, and showed that the difference score is an unbiased estimate of individual growth. Multiwave Designs The difference score between two data points might be unreliable due to large measurement error and the lack of variation among individual growth. However, the precision of the difference score can be improved with data collected at more than two time points. The precision of the parameter estimates increases as the number of data points increases. Moreover, two-wave data require external information about precision, whereas precision can be estimated with multiwave data. In addition, the two-wave design does not give enough information about individual growth. The model fails to unambiguously provide evidence about the shape of individual growth: whether it is linear or nonlinear over time. With only two waves of data, growth is assumed to be linear. Even if the constant rate of growth model (linear model) appears to be appropriate locally—quite likely in most cases, it is still interesting to see whether the rate of growth might be nonlinear in 15 the long run. Hence, Willet (1988) suggested the collection of additional waves of data instead of trying to fix up the flaws of two-wave designs with other fallible statistical controls. With multiwave data, a researcher can more confidently specify an appropriate statistical model for assessing individual growth. The alteration from a question of 'status' to a question of 'change', and the extension of a simple two-wave longitudinal design to a multiwave design, appear to be quite logical. But unfortunately, multiwave data have been used rarely in the literature of measuring change. Researchers have disregarded the advantages of multiwave data: if the simple difference score is intrinsically unreliable, why should one bother to collect additional data anyway? But such arguments are no longer reasonable when the difference score can be both reliable and valid, and the multiwave design can improve the precision of growth parameters of the two-wave design. 2.2 Analysis of Multilevel Data Another troublesome and long-standingmethodological concern in educational research has been related to the problem of analyzing multilevel data. Schooling is a multilevel process. Students receive schooling in classrooms, classrooms are nested within schools, and schools are nested within school districts. Educational decisions are made at different levels of this hierarchy (Barr & Dreeben, 1983). But 16 traditional models in educational research have employed single-level analyses. By ignoring the obvious multilevel structure, these single-level models did not match with the nested multilevel educational processes (Haney, 1977). Thus, Burstein (1980) called for the development of an appropriate statistical model that could specify the processes occurring within each level of an hierarchy. Such models should specify how schooling inputs measured at different levels of aggregation influence student outcomes. Burstein (1980) summarized the following four methodological difficulties of analyzing nested data: "(1) cross-level inference or aggregation bias, (2) choice of the appropriate unit of analysis, (3) contextual analysis, and (4) specification of appropriate analytical models for multilevel data." (p. 161) Aggregation Bias Issues of 'cross-level inference' arise when researchers make inferences about individual behavior from analyses based on school-level data. The aggregate analyses have been common in educational research because disaggregated data are usually difficult to obtain and complex to analyze. However, aggregate analyses are likely to provide biased estimates of student-level effects. If the purpose of the study is to evaluate the educational effects on individual student's performance, it would not be appropriate to use aggregate data because the level of aggregation 17 should match the level at which one wishes to make inferences (Haney, 1977). Choosing the Appropriate Unit of Analysis Concerns about 'aggregation bias' have been closely related with the issue of 'choosing the appropriate unit of analysis.' Researchers first noticed the problem of ignoring 'classroom' factors and choosing 'student' as a unit of analysis. If 'treatment' is implemented in a classroom setting, ignoring the classroom factor (grouping factor) may violate the assumption of independent response required for performing analysis of variance. Such violation will lead to liberal tests of significance and increase the probability of Type I error (Aitkin, Anderson & Hinde, 1981). To avoid this problem, some researchers (for example, Wiley, 1970; Glass & Stanley, 1970) suggested that the appropriate unit of analysis is the class where instruction is received simultaneously by all students. On the other hand, the researchers who supported the choice of students as the unit of analysis, asserted that student reaction is individual and thus the focus of the evaluation should be on individual student—see Wittrock & Wiley (1970) for a detailed discussion. Some researchers chose to analyze their data in both ways, that is, at both the student and classroom level. But this strategy provided no guidance on how to proceed if the results diverged, as occurred in the evaluation of Project Follow Through (Haney, 1977). Burstein (1980), among others, argued that the 'choice of unit' is not the right question. Because relationships between variables at one level influence 18 relationships at other levels, choosing a single correct unit is simply a digression. The emphasis should be on choosing an appropriate analytical model which allows the estimation of random variation at both levels rather than choosing a specific unit of analysis (Burstein, 1980, p.196). Contextual Effect Another consideration for examining school effects is the 'contextual effect' of group membership. The 'contextual effect' is "the effect that a group's aggregate characteristics have on outcomes, over and above the effects due to the individual-level characteristics." (Raudenbush & Willms, 1988) For example, the school mean of ability has been used to represent the ability context of the school. A contextual effect for ability occurs when school mean ability is related to individual outcomes after controlling for the effect of individual ability and other relevant individual-level factors. Individual ability affects performance. Moreover, school mean ability may affect instructional practices by causing schools to adjust their instruction to the level of the students' ability. As a result, individual students wi th in the school can be expected to learn more or less than they would have in other schools. In school effects research, the contextual effect is viewed as independent of one's own ability whereas the so-called 'frog-pond' effect represents the interaction effect between school mean ability and individual ability. They are not easily separated. Furthermore, the contextual effect includes the influence of wider social, economic and political factors of community that are 19 usually intractable, such as community support for education, local resources, and residential segregation (Dyer, 1970; Willms & Raudenbush, 1989). That is, the contextual effect includes all the effects that are not directly manipulable by school after controlling for students' background characteristics. Therefore, they are somewhat difficult to estimate. Research, however, has shown that it is important to examine school effects with and without controls for contextual effects, that is, Type A and Type B effects (Raudenbush & Willms, 1988). And the contextual effects that are not easily manipulable by school should be distinguished from substantive school policies and practices when the purpose is to hold teachers and principals accountable. They are not responsible for the factors which lie beyond their control. With an appropriate model, we should be able to specify all the relevant group-level variables which affect individual outcomes. Specification of Appropriate Analytical Model for Multilevel Data All the above problems call for one solution: the specification of an appropriate analytical multilevel model (Raudenbush & Bryk, 1988). An appropriate statistical model should identify and sort out the effects attributable to the different levels within the educational system. If we can specify such a model, all the other problems which threaten the valid inferences of multilevel data can be resolved. Several analytical models have been developed (Dyer, 1970; Dyer, Linn, & Patton, 1969; Marco 1974; Aitkin & Longford, 1986; Goldstein, 1986; Raudenbush & Bryk, 1986). Among them, this review now turns to the recently developed hierarchical 20 linear model. 2.3. Hierarchical Linear Model Since the publication of Burstein's (1980) review, several groups of researchers have developed statistical models which can resolve the problems associated with multilevel data. These models have been called variance component models (Aitkin & Longford, 1986), mixed linear models (Goldstein, 1986), and hierarchical linear models (Raudenbush & Bryk, 1986,1987). Despite the differences in algebraic operations, the above methods share common properties. First, they provide explicit structural models for nested processes occurring within each level. These models explicitly specify how the explanatory variables at a higher level of aggregation influence outcomes at a lower level. Second, these methods allow researchers to specify and test the random effects of each unit. That is, a researcher can test whether the slopes of the control variables are homogeneous across schools. Whether the extent of random effects are statistically significant, the variation among subgroups and among schools is of interest. Among these methods this study will adopt the term, 'hierarchical linear model' developed by Raudenbush and Bryk, which will be labelled HLM for convenience. This review first describes the applications of the HLM for the studies of school effectiveness and individual growth. Then, it describes a three-level HLM, which is a promising approach to the combined problem of school effectiveness on individual growth. 21 Application of the HLM for Studies of School Effectiveness (Multilevel Data) The general framework for analyzing school effects is multiple regression, with schooling outcome regressed on variables describing student intake and school characteristics. Although many researchers have noticed mismatches between the single-level linear models and the multilevel educational processes, such traditional linear models have been widely used because there have been no viable alternatives. One of the promising approach to resolve these problems was the 'slopes-as-outcomes' model (Burstein, Miller & Linn, 1979). This approach allows researchers to explore the effects of schooling inputs on the structural relationships within schools (slopes) as well as to examine the effects of schooling inputs on average performance (intercepts). In spite of these advantages, this method has not been widely used because of a number of technical difficulties: precision of the estimated slopes (unreliability of slopes), distinction between parameter and sampling variance, and multiple slopes as outcomes (Raudenbush & Bryk, 1986). Recent advances in statistical theory, such as EM (Estimation and Maximization) algorithm or Bayesian estimation, have contributed to overcome some of these difficulties. As a 'slopes-as-outcomes' model, the HLM utilizes both intercept and slope coefficients as outcome variables at the next stage. It provides a powerful tool that permits a separation of within-school variation from between-school variation, and therefore allows the researcher to distinguish parameter variance from observed variance. 22 A two-level H L M for analyzing school effects can be specified with two sets of equations. The first set comprises within-school equations which describe the relationships between an outcome measure and students' background variables, such as socioeconomic status, home environment, previous achievement, and level of general ability. The second set comprises between-school equations which describe the relationships between the regression coefficients from the first set (i.e., intercept and slopes) and various school-level variables, including school policy and context variables. Following the notation of Willms and Raudenbush (1989), the within-school regression model can be written as; Yy = £ j 0 + /SjiXiji + - + ^ jiAjk + H (D where is the outcome score for student i (i=l,...,np in school j (j = l,...,J). There are k independent variables, X^ k , which describe students' background characteristics. The /3jk's are within-school regression coefficients and the are student-level residuals. If the background variables are centered around their means for the entire sample, the estimates of the intercept, /3j0, are the background-adjusted school means. They describe how well a student with sample-average background characteristics can be expected to score in each school. The estimates of slopes, (3^, /3j2, j3jk describe the effect of each background variable on the outcome score. When Y^'s and X^'s are standardized, the coefficients are equivalent to the partial correlations between the outcome and the background variables. 23 At the second stage, researchers are interested in whether these estimates of intercepts (JSj0) and slopes (jSjl,...,)8jk) are a function of particular school policies and practices. The second-level equation can be described as: 0jk = e o k + + + u j k (2) where the /3jk are the regression coefficients from the within-school equations. Each j8jk is regressed on school policy variables, Pj, and on school context variables, Cj. The between-school regression slopes, 0 l k and 0 2 k , capture the effects of school-level variables on the within-school structural relationships (/3jk). The between-school residuals, Ujk, denote the unique contribution of each school that is not explained by school-level variables in the model. When we do not specify the school-level variables in the model, Uj k represent overall school differences. The variation of these residuals is particularly interesting because it represents the extent to which schools vary in their background-adjusted achievement. Equations (1) and (2) can be combined into a single equation. Only two student-level variables and two school-level variables were considered here. Yg = 6 0 0 (grand mean) + BoiXyx + 0o2^ij2 (controls for student background) + 0IQPJ + ®2(Pi (effects of school-level variables) + e-uPjXy! + e^PjXija + 621CjXijl + esgCjXjjg (interaction effects) + U^X^ + Uj2^ij2 + UJO (school-level residuals) + (student-level residuals) (3) 24 If the student background variables and school characteristic variables are centered around their means for the entire sample, the intercept coefficient, 80o represents the grand mean of the entire sample. The slope coefficients, 601 and 802 denote the effects of student background variables, and 0 1 O and Q2Q denote the effects of school-level variables. The statistical test for the homogeneity of regression slopes among groups asks whether the variation of /30j, or /32j is statistically significant. In other words, researchers can test whether the effects of student-level variables differ across schools and i f so, whether the variation is associated wi th particular school-level variables. They might conclude that certain school policies are effective for some schools but not for the others. Moreover, the above equation permits researchers to test whether school-level variables have the same effect oh student outcomes across the subgroups wi th different background characteristics. euPjX^, 8i2PjXij2> 02i9i^iji' a n c* ®229r^ ij2 represent the effects of the interactions between school-level variables and student background variables (e.g., aptitude-treatment interaction). Researchers might conclude that certain school policies are more effective for some students than others within a school. The equation also distinguishes the school-level residuals (U^X^, L^X^, Uj0) from the student-level residuals (e^). In equation (3), the distinction between policy variables and context variables was made i n the sense that the first are endogenous to school systems but the latter are exogenous. Based on the interests of different groups, the emphasis can be either on the particular effects of school policy or on the overall effects of schooling 25 including both contextual and policy effects (Raudenbush & Willms, 1988). However, researchers often do not have measures of school policies and practices (Pj), or data describing contextual factors (Cj). In this case, equation (3) simplifies to become: Yy = 6 0 0 (grand mean) + Boi-X^ + 602X^ 2 (controls for student background) + UjiXy! + UjgXp + U j 0 (school effects) + £jj (student-level residuals) (4) Then the school-level residuals represent overall school effects (Type A effects). Application of the HLM for Measuring Change . An hierarchical linear model can be used for the measurement of change. In a longitudinal HLM, time is 'nested' within each subject analogous to subjects nested within school. Therefore, the first level of the HLM (within-subject) represents the initial status and individual growth rate for each subject. In the second level (between-subject), the growth parameters of each subject become outcome variables and are explained by subject-level background characteristics. The within-subject equation can be described with a set of regressions that model outcome scores on time: 26 Yit = Wio + Wiiajt + U i t (5) where Yit is the outcome score for student i at time t (t=0,l,2,..), and is the age of student i at time t. If is centered for the first occasion, then iri0 describes the initial status of the student i and TT^ denotes the rate of growth of student i. The residual, uit, includes both sampling and measurement error, which is the chief advantage of HLM in growth measure applications. An important feature of equation (5) is the assumption that the growth parameters vary among individuals. The between-subject equation can be formulated to represent this variation. In this equation, the individual growth parameters are a function of students' background variables. The between-subject equation include two sets of equations. One equation examines the relationships between initial status (-7ri0) and students' background variables: the other examines the relationships between rate of growth (TTJ-]) and students' background variables. ^io = Poo + Po-Pii + /302xi2 + ei0 (initial status) (6a) ^ i i = Pio + ^nXii + /3i2Xi2 + e i i <>ate o f growth) (6b) The HLM estimates how reliably these growth parameters are measured. By separating within-subject variance from between-subject variance, it allows researchers to partition the observed variance in the estimated regression coefficients into two components: sampling variance and parameter variance. Reliability is the ratio of the true parameter variance to the observed variance, analogous to that of classical test score theory. If most of the variability in growth 27 trajectory were due to sampling error, it is almost impossible to find any systematic relationships between these estimates and the background variables at the second stage. Then one can not detect the significant relations, even if they exist, because the data do not provide enough power. In that case, the percentage of total variance explained may be very small even when the between-subject model is accounting for most of the explainable variance, that is, the parameter variance. Along with the reliability, the HLM utilizes the empirical Bayes methods for estimating the parameters of interest. The empirical Bayes methods correct the ordinary least square (OLS) growth parameters (slopes) according to their reliabilities. If those slopes are not very reliable, the empirical Bayes estimates are based primarily on the estimated mean slope for the sample. Therefore, outliers resulting from a large sampling error in the OLS slopes are controlled. Empirical Bayes method shrinks these outliers in toward the overall sample mean. As a result, the HLM enables a more accurate measurement of slopes than does OLS. A Three-level Hierarchical Linear Model Applications of the HLM for multilevel data and for measuring change can be combined to provide a promising resolution of these two methodological problems (Raudenbush & Bryk, 1988). For a cross-sectional design of school effects, a two-level hierarchical linear model provides a useful tool for conceptualization of the multilevel character of schooling processes. But the cross-sectional design represents the effects of schooling on students' status, which is one instance of 28 accumulated growth over time. Conceptually, the model which estimates schooling effects on students' growth seems more appropriate than the model which estimates schooling effects on students' status. However, the problem with a two-wave longitudinal design is that this growth rate is measured less reliably. But reliability increases as the number of data points increases. A multiwave design provides a stronger basis for causal inference i n nonexperimental studies than a cross-sectional or a two-wave design does (Cook & Campbell, 1979). A two-level hierarchical linear model which was discussed earlier naturally extends to a three-level model as we examine the school effects on individual growth. A three-level hierarchical linear model can be conceptualized in the following way. The first two levels of the model are identical to those of the two-level model for individual growth. The first level, that is, within-subject level, represents individual growth as a function of individual growth parameters plus random errors: Yijt = ^ijo + ^i*Hjt + uijt (7) where Yyt is the outcome score for student i in school j at time t. The second level of the model, between-subject level, represents the relationship between growth parameters (ITJJ0 and TT^) and students' background characteristics (Xy). ^ijo = /^ ooj + /^oijXij + eijO (initial status) (8a) TTij! = /S 1 0j + /S11jXij- + e i j : L (rate of growth) (8b) 29 The third level of the model, between-school level, enables specification of how school characteristics influence the distribution of growth within schools. j800j = 80oo + BooiPj + ^002^) + ^ooj (effects of school-level variables on average initial status for each school) (9a) ^oij = ^oio + ®onPj + ^012^' + U 0 ij (effects of school-level variables on the within-school slopes of status and background) (9b) /310j = 6 1 0 0 + 8 1 0 1Pj + 6 1 0 2Cj + U 1 0j (effects of school-level variables on average rate of growth for each school) (9c) fill} = ^no + ^mPj + ^112^ + U n j (effects of school-level variables on the within-school slopes of growth and background) (9d) The empirical Bayes estimation employed in the HLM makes it possible to estimate individual growth trajectory more precisely than does OLS. Therefore, the model is more sensitive to the effects of school-level variables at the next stage. Researchers using a three-level model with multiple time points per student are likely to discover important effects of schools on student growth—effects that could not be discovered with cross-sectional or two-wave data. 3. Research Methodology 3.1 Subjects and Data Collection The subjects of this study, which were a subsample of the data collected by Wil lms and Jacobsen (1990), included a large cohort of students enrolled in one school district who had completed their seventh grade during the 1987-1988 school year. There are 32 elementary schools i n the district, which serves two cities and surrounding suburban areas. It also includes a large rural, agricultural area. The population is of mixed socio-economic status and includes several racial and ethnic groups. In the fall of 1988, when the data were collected from the district records, the students of this grade cohort had just begun their eighth grade. The majority of the. students were the same age—born in 1974. But approximately 20 percent were one or two years older as they had repeated one or two grades at some time during their elementary schooling. Approximately 2 percent of the students were one or two years younger than their cohort. The entire cohort included 1122 students. In May of each year from 1984 to 1989, while the students were in grades 3 to 7, the district administered the Canadian Tests of Basic Skil ls (CTBS) and the Canadian Cognitive Abilit ies Test (CCAT) . Among the entire cohort of 1122 students, only the students who had complete C T B S data from grades 5 to 7 were selected. In addition, students were required to have complete C C A T data for grades 3, 4 and 5, because the average C C A T score across three years was computed to denote student's level 30 31 of general cognitive ability. Based on these criteria, six schools were excluded from the analysis because they had fewer than ten students who met the selection criteria. The estimates of school effects based on small samples may not be reliable, making it difficult to interpret significant between-school differences. The achieved sample, therefore, included 648 students in 26 schools. The sample size for each school ranged from 12 to 55. A demographic description of the achieved sample is attached in the Appendix. A concern about the resultant final sample was that it might be biased because low ability students would be more likely to have incomplete test data. Out of 474 students in the cohort who did not meet the selection criteria, C C A T data at grade 7 were available for 207 students. The difference in average C C A T score between the excluded group and the final sample was small even though it was statistically significant (p<.05); 103.195 compared wi th 105.997. The standard deviations of the two groups were almost the same; 12.65 compared with 13.00. Furthermore, the differences between the group means did not vary significantly by schools. Therefore, I do not think that attrition bias would be substantially critical, especially not for the longitudinal models, because loss of subjects with slightly lower ability would have little effect on the average rate of growth for each school. Outcome measures in this study were the scores of three subtests on C T B S : reading comprehension, vocabulary and the composite scores of mathematics (mathematics has three subtests—computation, concepts, and problem solving). The Canadian Tests of Basic Skil ls (CTBS) are norm-referenced achievement tests 32 designed to reflect the continuous nature of ski l l development i n the elementary-school (King, 1982). They are more concerned with generalized intellectual skills that are crucial to the educational development rather than separate measures of achievement in content subjects. Grade-equivalent (GE) scores were used to express the continuous development of basic skills, which were necessary for measuring growth. Instead of usual grade equivalent scores, this test employs 'months-of-schooling' as the outcome metric. This is simply a multiple of G E scores. For the 'months-of-schooling' metric, G E scores wi l l be multiplied by 10 (i.e., 59, 69 instead of 5.9, 6.9, respectively) and the variance w i l l be multiplied by 100 (i.e., .12119 instead of .00121). This does not make any differences in interpreting individual growth and gives more accuracy to the estimates because we keep more decimal points. The most important student-level background variable in this study was the level of general cognitive ability which was measured by the Canadian Cognitive Abilit ies Tests (CCAT) . The C T B S and the C C A T were standardized on the same population of students and administered under the same conditions at approximately the same time. This has made it possible to examine the relationships between achievement and ability under nearly ideal conditions. In addition, the average C C A T scores of verbal, quantitative, and non-verbal subtests across three years were computed to get more reliable estimates of general ability. Gender was also employed as another control variable for students' background characteristics because prior research suggested significant gender 33 differences in student achievement (Willms & Jacobsen, 1990). 3.2 Data Analysis The following models were employed to analyze the data in this study. Model Ia (Cross-sectional Model without Controls) This model has been frequently used in school award programs because of its simplicity. But the evaluation purely based on average school performance at a single point in time would not be fair for schools with more disadvantaged students in terms of their background characteristics. If we do not control for the init ial differences, the estimates of school effects would be biased. Therefore, for more improved accountability procedure, it might be an interesting question how much differences there are between the estimates of school effects using the model without controls for students' background (Model Ia) and those with controls (Model lb). In Model Ia and lb, my outcome measures were grade 7 C T B S scores in three subject areas (mathematics, reading comprehension, and vocabulary). The data were analyzed using a two-level hierarchical linear regression model. This model enables us to analyze the multilevel data as discussed in Chapter 2. The data can be described with two sets of equations. The first level of the model simply estimates school mean scores for each school and assumes all other variations are just random errors. The within-school equations are as follows: 3 4 Y7ij = 0Qj + ey where Y 7 ij is an outcome measure at grade 7 for student i (i=l,2...,nj) who is in school j (j = l,2 26). The parameter, /30j, simply represents the average performance of school j, and was employed as an outcome measure in the second level of the model: Poj = e o o + U 0 j where 6 0 0 is a grand mean, that is, an average performance of 26 schools in the sample. Therefore, the between-school residual term, U0j, includes the overall school effects including both context and policy effects of each school and school-level random errors. Unfortunately, no variables about specific school policies and practices were available in this study. Therefore, this analysis examined only Type A effects; that is, overall effects including both school policy and contextual effects (Willms & Raudenbush, 1988). Because the relevant variables at school level were not specified, the estimates of school effects in this analysis can be considered upper limits. Model lb (Cross-sectional Model with Controls) This model has been the most popular statistical model for school evaluation. Since schools do differ in terms of their student composition, this model is designed to control for some of the most important differences of students' background characteristics. Two control measures, CCAT and gender, were employed in this 35 investigation. No family background variables were controlled in this study. But many researchers argued that measures of general ability are known to be more powerful controls for students' background characteristics than the family background variables (Alexander, Cook & McDill, 1978). In particular, CCAT is a very powerful control on CTBS, since those two tests were standardized on the same population at almost the same time. Gender was another control variable in this analysis because gender differences have appeared to be significant in some content areas (Fennema, 1980; Martin & Hoover, 1987). The first level of the model, therefore, examines the relationships between student outcomes and two background variables: The intercept parameter, /30j, denotes the expected grade 7 status for a 'typical' child because the background variables were centered such that zero values represent what we refer to a typical child: one with a CCAT score of 100 (the normalized standard mean score for the test). The slope parameters, and /32j- denote the effects of CCAT and gender on student outcomes, respectively. All three parameters were employed as outcomes in the next level of the model. The between-school level equations are as follows: Y7ij = /30j + /^(CCAT)^ + ^ (Gender)^ + £ i j 0Oj = 0oo + U, oj (intercept) £ij = ©oi + Ujj /32j = 0 O2 + U 2 j (slope for CCAT) (slope for gender) 36 where the residual terms, U0j, and U 2j represent the overall school effects on student outcomes because the school-level variables are not specified in this model. The variation of these residuals is of interest to see whether there were significant differences among schools. The variances of the school-level residuals were tested by a chi-square test of variance. In addition, a researcher can test whether the regression slopes for CCAT or gender are parallel across schools. This is one of the interesting questions a researcher might have but could not ask with the ordinary analysis of covariance method. If there are significant differences in slope coefficients among schools, the HLM allows them to vary; if not, the slopes can be fixed across schools as with OLS. Model Ila (Two-wave Longitudinal Model without Controls) The difference scores (gain scores or change scores) have had a long history in educational research because questions in education are naturally concerned with students' gain and change over time. Until recently, a two-wave model has been used most frequently for measuring individual change. Model Ila is almost identical with Model Ia except that the student-level outcome measures are the difference scores between grades 6 and 7 instead of grade 7 scores. Individual difference scores were partitioned into two parts: average difference scores for schools and individual random errors, assuming the deviations from the average school gain scores represent only the random fluctuation among individuals. 37 The within-school level equation is followed as: Y67ij = 0Oj + e i j where Y 6 7 i j is the difference score between grades 6 and 7 for student i (i=l,2,...,nj) in school j (j=1,2,...,26). Then the parameter, /30j-, denotes the school average difference score. The second level of the model is identical with that of Model la. ^OJ = e oo + U 0 j The between-school equation estimates the grand mean, which is the average difference score of the sample. Then the school-level residual, U0j, represents everything other than the average difference score of the sample, which includes both systematic differences among schools (overall school effects) and random errors. Again, the chi-square test of variance was used to test the variation of school-level residuals. The HLM also provides reliability estimates—the ratio of the parameter variance to the observed variance—by differentiating between-school variance from within-school variance. The reliabilities of the two-wave models compared with the cross-sectional models and the three-wave models are of interest in this study. Model lib (Two-wave Longitudinal Model with Controls) This model is similar to Model lb because it also controls for differences in 38 students' background characteristics. But the key difference between the two is that the difference score is used as a student outcome in this model. The first level of the model is as follows: Y 6 7 i j = /30j + /^(CCATOij + j 8 9 ( G e n d e r ) s + £ i j where Y 6 7 i j- is the difference score between grades 6 and 7 of student i in school j and the slope parameters, /3-y and /32j represent the effects of CCAT and gender, respectively. The intercept parameter, jS0j represents the expected difference score between grades 6 and 7 for a typical child with a CCAT score of 100. The second levels of the model are identical with those of Model lb. 0qj = eoo + U 0 j Pii = e o i + U x j #2j = 6 0 2 + U 2 j As in the Model lb, the residual terms, U0j, U-y, and U 2j represent the overall school effects on student outcomes. The variances of them were statistically tested. The slope coefficients, 6 0i and 6 0 2 were also tested to see whether the effects of CCAT and gender are the same across schools. Model Ilia (Three-wave Longitudinal Model without Controls) In this model, three waves of data which is the simplest form of multiwave design were employed. 39 Ideally, one can employ a three-level hierarchical linear model as discussed in Chapter 2: that is, between temporal points within subjects, between subjects within schools, and between schools. Unfortunately, computer programs which can estimate three-level hierarchical linear models have only recently been fully developed. The alternative way to handle this is to use the OLS method to estimate the parameters of the first level. Then using the growth parameters obtained by the OLS, the usual two-level hierarchical linear model can be employed for the two higher levels (Willms & Jacobsen, 1990). The problem with this method might be the precision of individual growth parameters. However, the growth estimates were based on multiple time points. If more and more time points were added, the precision of the growth parameters estimated by the OLS and by the HLM would be quite close. I compared the estimates of growth rate using three data points and those using four or more data points whenever they were available, and found the estimates were quite stable. Therefore I expect the results do not significantly differ from those using a three-level hierarchical model. The first level of the model can be described as: Y i j t = ^ijO + ^ i j l ( T i m e ) i t + uijt where Yyt is a growth measure for student i in school j at time t (t= 1,2,3). The variable, (Time)it, denotes the number of months of schooling that the student i in school j had received before testing occasion t. In this study, (Time)it was set to be zero for the second testing occasion (grade 6) since the estimates of expected values 40 near the center of the data would be more reliable than those at the extremes with O L S regression. The intercept parameter, TT^Q, represents the status for each student at the end of grade 6, which in this study refers to the G E score attained after 69 months of schooling (including kindergarten). The slope parameter, TT^, represents the individual rate of growth over three years. Both of the parameters can be used as outcome measures at the second level. However, this study did not employ the grade 6 status as an outcome measure at the second level because the relationships between the grade 6 status and the background variables were not of any practical interest in this study. The rates of growth also allow me to compare the results with those from the previous two-wave models. The second (between-subject) level of the model can be described as: ^ i j i = PQJ + e i j where /30j denotes the average rate of growth for school j and represents the random fluctuation among students. The third (between-school) level of the model becomes: POJ = eoo + U 0 j where 0 O O denotes the grand mean, the average rate of growth for the sample. The between-school residual term, U 0 j , includes the overall school effects as well as the random fluctuation among schools. A n hierarchical linear model allows one to 41 partition the error variance into three parts (within subject, wi thin school and between school) and provides more precise estimates of growth parameters. Model Illb (Three-wave Longitudinal Model with Controls) The first level of the model is identical with that of Model I l i a to estimate the grade 6 status and the rate of growth: Y ijt = ^ijo + ^iji(Time) i t + u i j t The second level of the model investigates the relationships between individual rate of growth and the background variables: 7rij]L = /30j + j S ^ C C A T ^ + jS?j(Gender)y + ey where (30- represents the average rate of growth for a typical child, and /3-y and jS2j represent the slope parameters for C C A T and gender, respectively. Both intercept and slope parameters of the second-level become the school-level outcomes at the next level: 0qj = e o o + u 0 j H = 0 i o + U U 02) = 6 20 + U 2 j The school-level residual terms, U0j, U^, and U2j are of primary interest to examine between-school differences in their average rate of growth. The slope 42 parameters, 6 1 0 and Q2o> w e r e examined to see whether the regression slopes of three background variables were parallel across schools. Estimation of Bias The difference between estimates of school mean performance based on an unadjusted model and those based on an adjusted model was computed to denote the estimates of bias between two models. The bias can be written as follows: Bias = S*j - Sj where S*- denotes the estimates of school average performance from the unadjusted model, and Sj denotes the estimates of school average performance from the adjusted model. Then the mean absolute bias between two models was used to indicate the extent of overall bias of the unadjusted model against adjusted model: Mean absolute bias = 21 S*j - Sj | /26 I examined the extent of bias of Model Ia, Ila, and Ilia against Model lb, lib, and Illb, respectively. Furthermore, the biases of the two-wave models (Model Ila and lib) against the multiwave model (Model Illb) were also of interest. 4. Results This chapter reports the findings generated by this study. The study compares the estimates of school effects based on six models (two cross-sectional, two two-wave, and two multiwave models) across three CTBS subtests (mathematics, reading comprehension, and vocabulary, respectively). Tables la through Illb show the results of the HLM for each model. The chapter also examines the between-school differences in average grade 7 status and in average rates of growth, and it examines the effect of two control variables (CCAT and gender) and the reliability estimates of school means. Table IV shows whether the regression slopes of two control variables are homogeneous across schools. Table V describes the extent of bias between adjusted models and unadjusted models across three subject areas. School Differences in Grade 7 Status Table la and lb summarize the HLM results of the cross-sectional model with and without controls for student intake (Model la vs. Model lb) for mathematics, reading, and vocabulary. The first row of Tables la and lb show the average estimates of grade 7 status for three subtests. The unadjusted estimates of grade 7 status (Table la) were higher or close to the expected score, which is 79 in terms of months-of-schooling, across all three subtests. The adjusted estimates of grade 7 status (Table lb), however, were below the expected score for all three subtests. 43 Table la HLM Results for Cross-sectional Model Without Controls (Model la) Estimated Parameters Mathematics Reading Vocabulary (Fixed Effects) Effect (S.E.) Effect (S.E.) Effect (S.E.) Constant 80.761** (.777) 77.514** (.708) 78.224** (.645) Estimated Variance Components (Random Effects) Estimate ( x 2 ) 2 5 Estimate ( x 2 ) 2 5 Estimate ( x 2 ) 2 5 Observed Variance Parameter Variance Reliability 15.979 10.610** .664 (78.693) 13.425 7.559** .563 (60.119) 11.211 * * 5.673 .506 (53.921) Notes: * p<.05 ** p<.01 Table lb HLM Results for Cross-sectional Model With Controls (Model lb) Estimated Parameters Mathematics Reading Vocabulary (Fixed Effects) Effect (S.E.) Effect (S.E.) Effect (S.E.) Constant 77.809** (.641) 74.639 (.490) 75.670** (.413) CCAT .588* (.024) .573* (.026) .529** (.026) Gender -.095 (.605) .382 (.671) -2.414** (.674) Estimated Variance Components (Random Effects) Estimate ( x 2 ) 2 5 Estimate ( x 2 ) 2 5 Estimate ( x 2 ) 2 5 Observed Variance 10.401 6.041 4.363 Parameter Variance 7.640** (275.83) 2.622** (195.93) .890** (156.14) Reliability .735 .434 .204 Notes: * p<.05 ** p<.01 46 The adjusted estimates were lower than the unadjusted estimates because the average CCAT of this sample (105.336) was higher than the national norms of 100. The constant represents the expected score for a 'typical' student in terms of months of schooling. When we control for the higher CCAT of this sample, a 'typical' student of this sample had GE scores that were around one month below the expected score for mathematics (77.8 compared with 79), four months below the expected score for reading comprehension (74.6 compared with 79) and three months below the expected score for vocabulary (75.7 compared with 79). The next two lines of Table lb show the effects of two background variables for Model lb. The level of prior ability (CCAT) was significantly related to grade 7 status across all three subtests. The effect sizes of CCAT on three outcomes were almost identical. Gender differences were negligible at this grade in both mathematics and reading; however, males outperformed females by over two months of schooling on the vocabulary subtest. The results for mathematics are consistent with the findings from previous research on gender differences. In reviewing studies of mathematical achievement and gender differences, Fennema (1980) reported that no significant differences appeared consistently between males and females during the elementary years of schooling. Martin and Hoover (1987) also reported that there was no gender difference in the composite scores of ITBS mathematics from grades 3 to 8, while females did somewhat better in computation and males showed a better understanding of mathematical concepts. Even though the gender differences are 47 relatively small during elementary education, researchers generally agreed that the gender gap in mathematics becomes larger during secondary education (Maccoby & Jacklin, 1974; Fennema, 1980; Willms & Jacobsen, 1990). Previous research on reading comprehension has also shown a slight but consistent advantage of females in general reading ability (Martin and Hoover, 1987; Maccoby & Jacklin, 1974). The gender difference derived in this study was small and not statistically significant. Vocabulary was interesting, quite different from other subtests. Females had a significantly lower mean on vocabulary at the end of grade 7. Generally, female superiority on verbal tasks at the earlier stage of development has been one of the solidly established generalizations in the field of gender differences. Some of the research has found no consistent gender differences, but whenever the difference was found, it was usually females who obtained higher scores (reviews by Maccoby and Jacklin, 1974). However, these differences favoring girls seem to be reversed during the later years of primary schooling. Martin and Hoover (1987) reported very similar results to this study. Instead of the consistent advantage for males or females that was found on the other subtests, they found a cross-over in the results for vocabulary. In grades 3 and 4, females had a slightly higher mean than males. But in the later grades, males had the significantly higher mean, and the improvement in performance by males was consistent. The bottom half of the tables show the estimates of observed variance and parameter variance, and reliability of estimates of school means. The parameter 48 variance represents the variation among schools with the variance due to measurement and sampling errors removed. This is similar to the true score variance of classical test score theory. The chi-square tests showed that parameter variance was statistically significant (p<.01) across al l subtests. In other words, there were significant differences between schools in their grade 7 status, even after controlling for the effects of students' background variables. The last row of the tables show the reliability coefficients of intercept parameters. The primary purpose of this study was to examine the between-school differences i n the average performance (intercept) rather than the between-school differences in structural relationships (slope). Therefore, only the reliability coefficients of intercept parameters were examined here. The between-school differences were most reliably estimated for mathematics and least reliably estimated for vocabulary: .664 and .735 for mathematics, .563 and .434 for reading, and .506 and .204 for vocabulary, without and with controls, respectively. School Effects on the Rate of Growth The second research question (see Chapter 1) is concerned with students' rates of growth between grades 5 and 7 rather than their grade 7 status. The first line of Table Ha and l i b present the estimates of average rate of growth during the last two years of elementary schooling, with and without controls for student background for the three subtests. The unadjusted and the adjusted two-wave models (Model Ha and Model l ib) suggested that the average rate of growth of this Table I la HLM Results for Two-wave Model Without Controls (Model Ila) Estimated Parameters Mathematics Reading Vocabulary (Fixed Effects) Effect (S.E.) Effect (S.E.) Effect (S.E.) Constant 8.516** (.387) 8.064** (.374) 6.514** (.315) Estimated Variance Components (Random Effects) Estimate ( x 2 ) 2 5 . Estimate ( x 2 ) 2 5 Estimate ( x 2 ) 2 5 Observed Variance 3.997 4.001 2.852 Parameter Variance 2.230** (63.558) .952 (34.102) .636 (34.072) Reliability .558 .238 .223 Notes: * p<.05 ** p<.01 T a b l e l i b HLM Results for Two-wave Model With Controls (Model lib) Estimated Parameters Mathematics Reading Vocabulary (Fixed Effects) Effect (S.E.) Effect (S.E.) Effect (S.E.) Constant 8.519** (.398) 7.758** (.394) 6.312** (.337) CCAT .070* (.019) .063* (.025) .047* (.021) Gender .132 (.478) -.952 (.629) -1.408* (.534) Estimated Variance Components (Random Effects) Estimate (x\5 Estimate (x 2 ) 2 5 Estimate (X 2) 2 5 Observed Variance 3.997 3.943 2.897 Parameter Variance 2.264** (65.492) .934 (35.390) .706 (34.900) Reliability .567 .236 .245 Notes: * p<.05 ** p<.01 51 sample was lower than expectation, that is, less than ten months-of-schooling. The average rate of growth for vocabulary was approximately three and a half months below the norm for both Models Ila and lib (6.5 and 6.3, respectively, compared with 10). The CCAT was significantly related to the rates of growth across three subtests. Gender differences were not significant in mathematics and reading, but males outgrow females by one and a half months of schooling in the vocabulary test. The effects of adjustment of the two-wave models were not as large as those of the cross-sectional models. The chi-square tests suggested that the between-school differences in their rates of growth between grades 6 and 7 were statistically significant for mathematics but not for reading and vocabulary subtests. Because of the large measurement errors related to the two-wave data, the reliability of estimates of average gain were lower than that of the cross-sectional models: .56 for mathematics, .24 for reading, and .25 for vocabulary. Tables Ilia and Illb present the HLM results of multiwave models with and without controls (Model Ilia vs. Model Illb). The first row of Tables Ilia and Illb shows the estimates of average rate of growth between grades 5 and 7. For this particular sample, students seem to be levelling off in vocabulary and reading: that is, the growth rate between grades 5 and 7 was much less than ten (8.9 and 9.3, respectively), whereas in mathematics it was close to ten (9.9). One of the most interesting findings from this analysis was that the effects of student background variables were no longer significant when we employed growth rates based on multiwave data as the outcome measure. Even the CCAT did Table I l i a HLM Results for Multiwave Model Without Controls (Model Ilia) Estimated Parameters " Mathematics Reading Vocabulary (Fixed Effects) Effect (S.E.) Effect (S.E.) Effect (S.E.) Constant 9.896** (.306) * * 9.374 (.217) 8.885** (.171) Estimated Variance Components (Random Effects) Estimate (x 2 ) 2 5 Estimate (x 2 ) 2 5 Estimate (x 2 ) 2 5 Observed Variance 2.459 1.296 .862 Parameter Variance 1.874** (116.97) , .552** (46.654) .121 (30.649) Reliability .762 .426 .140 Notes: * p<.05 ** p<.01 T a b l e I l l b HLM Results for Multiwave Model With Controls (Model Illb) Estimated Parameters Mathematics Reading Vocabulary (Fixed Effects) Effect (S.E.) Effect (S.E.) Effect (S.E.) Constant 9.845** (.310) 9.295** (.227) 8.923** (.182) C C A T .011 (.011) .016 (.012) -.002 (.012) Gender -.173 (.279) -.166 (.313) -1.108* (.309) Estimated Variance j Components (Random Effects) Estimate (x 2 ) 2 5 Estimate (x 2 ) 2 5 Estimate (x 2 ) 2 5 Observed Variance 2.448 1.311 .853 Parameter Variance 1.866** (117.26) .562** (47.315) .121 (31.225) Reliability .761 .431 .142 Notes: * p<.05 ** p<.01 54 not have significant effects on students' rates of growth across the three subtests. Considering the highly significant effects of CCAT in the cross-sectional models, this finding has practical implications. Even when the data on students' background characteristics are not available, the growth measure based on multiwave data can be a better indicator of school performance, even without controls for student intake. 'Selection bias', which occurs when we do not specify all the relevant variables in the model is no longer a big problem for a multiwave longitudinal model as for a cross-sectional model. However, significant gender differences in vocabulary growth rate (Table Illb) were noteworthy (p<.05). Female growth rate in vocabulary was significantly smaller than the male growth rate during the last three years of elementary schooling. Since boys are supposed to catch up in vocabulary during the later grades of elementary schooling, the average growth rate of boys between grades 5 and 7 may be significantly larger than that of girls. One suggestion might be that a non-linear model should be used when modelling vocabulary growth rate between grades 3 to 7. The test results for the parameter variances among schools are shown in the bottom portion of the tables. As we move towards the longitudinal models, the parameter variance gets much smaller because growth measures indirectly control for differences in prior achievement, that is, grade 5 and 6 scores. For mathematics and reading, there were significant differences among schools in their effects on growth rate between grades 5 and 7. The parameter variance of mathematics was 55 much larger than that of reading. This suggests that mathematics is more related to school instruction than is reading. The parameter variance among schools for vocabulary was smallest and was not statistically significant at the .05 level of significance. This is consistent wi th what we would expect given home influences on students' vocabulary. Vocabulary is not so directly related to instruction given at school as are the other skills. It depends more on one's language experiences in his or her home. It is possible that schools do not differ significantly in their effects on students' vocabulary over a period of three years, even though the richness of language experiences in a school program may affect students' vocabulary ski l l over a longer period. Another interesting finding of this study comes from the reliability of the average growth measure based on a multiwave model. The reliability of the growth model (Model Illb) was as high as that of the cross-sectional model (Model lb): .761 compared wi th .735 for mathematics, .431 compared with .434 for reading, and .142 compared with .204 for vocabulary. The between-school differences in their rates of growth based on the multiwave model (Model Illb) were more reliably estimated than those based on the two-wave model (Model l ib) for mathematics and reading: .761 compared with .567 for mathematics, .431 compared with .236 for reading. Differences between Two-wave Model and Multiwave Model The two-wave model consistently underestimated the average rate of growth across three subtests: 8.159 compared with 9.295 for mathematics, 7.758 compared 56 with 9.295 for reading, and 6.312 compared wi th 8.923 for vocabulary. Moreover, the standard errors of these estimates were much larger for the two-wave model than for the multiwave model: .398 compared with .310 for mathematics, .394 compared with .228 for reading, and .337 compared wi th .182 for vocabulary. In other words, the average growth rate was less precisely estimated and negatively biased for the two-wave model. The effects of C C A T were weak for the multiwave model compared wi th the two-wave model. The gender differences i n vocabulary were statistically significant at the .05 level of significance. In other words, males grew faster than females between grades 5 and 7 in the vocabulary subtest. Compared wi th mathematics and reading, the reliability estimate of the growth parameters for vocabulary was disappointingly low (.142). For mathematics, the between-school differences i n their rates of growth were reliably estimated using the multiwave model, even more reliably estimated than the between-school differences i n their grade 7 status using the cross-sectional model. For vocabulary, however, the reliability of the estimates of between-school differences in their growth rates was low, even lower than that based on the two-wave model. One of the factors that might affect reliability is the variability among schools. For vocabulary, growth rate would not greatly vary among schools after the background variables were controlled. Schools do not have explicit effects on students' vocabulary power as they have in the other two tests. The small variability in the vocabulary growth rates among schools might explain why between-school differences in their vocabulary growth rate were most unreliably estimated. 57 Homogeneity of Regression Slopes The analysis also determined whether the effects of background variables on student outcomes varied significantly across schools. For example, some schools might be particularly effective for males but not for females, or vice versa, or some schools might be more effective for the students wi th higher cognitive ability. This is one of the questions that can be determined wi th an H L M analysis. Therefore, in a preliminary H L M analysis, al l the slopes were allowed to be random. Nine models—three (Model lb, l i b , and Illb) for each subtest—were examined. Table IV shows that the variances among schools i n their effects of prior ability and gender were not statistically significant at the .01 level of significance. It was not possible to say that some schools were more effective for males than for females, or for students wi th higher ability than for students wi th lower ability. Therefore, al l the slopes were fixed to be parallel in the main analyses of this study. When slopes are fixed, the intercepts can be estimated more reliably. Bias of Models The first part of Table V shows the extent of bias of unadjusted cross-sectional models (Model la) against adjusted cross-sectional models (Model l ib) across the three subtests. The extent of bias ranged from -.2391 to 6.5342 for mathematics, from -.0624 to 6.6879 for reading, and from -.6271 to 6.6692 for vocabulary. The mean absolute bias ranged from 2.6002 (vocabulary) to 2.9713 58 Table IV Homogeneity of Regression Slopes subtests model slopes parameter variance chi-square P Mathematics Model lb CCAT .00713 40.647 .025 gender 1.5693 23.104 >.5 Model lib CCAT .00227 28.943 .266 gender 4.06588 42.960 .014 Model Illb CCAT .00092 32.037 .157 gender .3873 20.286 >.5 Reading Model lb CCAT .00102 22.872 >.5 gender 1.79341 28.121 .302 Model lib CCAT .00375 33.039 .130 gender 1.33986 19.867 >.5 Model Illb CCAT .00077 24.225 >.5 gender .06767 12.215 >.5 Vocabulary Model lb CCAT .00189 16.270 >.5 gender 6.37617 36.098 .07 Model lib CCAT .00664 35.834 .074 gender 1.76095 24.370 >.5 Model Illb CCAT .00108 25.765 .420 gender .50950 25.991 .408 59 Table V Extent of Bias Mathematics Reading Vocabulary-Model l a vs lb min bias -.2391 -.0624 -.6271 max bias 6.5342 6.6879 6.6692 mean absolute bias 2.9713 2.8802 2.6020 Model Ila vs l i b min bias .051 .1961 .0999 max bias .7557 .459 .3099 mean absolute bias .3566 .3059 .2025 Model I l i a vs I l lb min bias -.0144 .0259 -.074 max bias .1456 .1434 -.0051 mean absolute bias . .0531 .0783 .0374 Model Ila vs I l lb min bias -4.3068 -2.9839 -3.2154 max bias 1.2205 -.0968 -1.9182 mean absolute bias 1.5248 1.2314 2.4087 Model l i b vs I l lb min bias -4.5459 -3.2492 -3.4617 max bias .8182 -.4675 -2.0573 mean absolute bias 1.7608 1.5373 2.6112 60 (mathematics). Therefore, estimates of school mean performance based on unadjusted models differ from those based oh adjusted models, on average, by about three months of schooling. Moreover, estimates based on an unadjusted model are biased in favor of schools with higher ability students. For example, the school with the highest mean CCAT (115.509) had the highest unadjusted means across all outcomes (91.947 for mathematics, 88.000 for reading and 88.632 for vocabulary). When differences in students' background characteristics were controlled, however, the school with lower mean CCAT (109.231) had higher adjusted means, even though the simple means were much lower than the school with the highest mean CCAT (79.423 for mathematics , 77.346 for reading, and 79.115 for vocabulary). Therefore, mean differences among schools based on the unadjusted cross-sectional model may not be appropriate for comparing schools. A simple comparison of school means will falsely suggest that schools with advantaged students do better than those with less advantaged students. The second part of Table V shows the extent of bias of unadjusted two-wave models (Model Ila) against adjusted two-wave models (Model lib). The extent of bias ranged from .051 to .7557 for mathematics, from .1961 to .459 for reading, and from .0999 to .3099 for vocabulary. The mean absolute bias ranged from .2025 for vocabulary to .3566 for mathematics (i.e., less than one-half of one month of schooling). Thus, the bias was far less than that of the unadjusted cross-sectional model. The bias between the unadjusted two-wave model and the adjusted two-wave model was not as crucial as was the bias between two cross-sectional models. 61 The third part of Table V shows the extent of bias of unadjusted multiwave models (Model Ilia) against adjusted multiwave models (Model Illb). The extent of bias ranged from -.0144 to .1456 for mathematics, from .0259 to .1434 for reading, and from -.074 to -.0051 for vocabulary. The mean absolute bias was the smallest amongst three designs (.0531 for mathematics, .0783 for reading, and .0374 for vocabulary). This also supports the result that the effect of students' background characteristics was not significant for the multiwave models. This suggests that one could use the rate of growth as a performance indicator of school quality and get fairly accurate estimates of school effects without controlling for students' background characteristics. The last parts of Table V examine the extent of bias among four longitudinal models. The adjusted multiwave model (Model Illb) was assumed to be the best longitudinal model amongst the four models. The extent of bias ranged from -4.5459 to 1.2205 for mathematics, from -3.2492 to -.0968 for reading and from -3.4617 to -1.9182 for vocabulary. The mean absolute bias of unadjusted two-wave model (Model Ha) against Model Illb ranged from 1.2314 to 2.4087. The mean absolute bias of the adjusted two-wave model (Model lib) against the adjusted multiwave model (Model Illb) ranged from 1.5373 to 2.6112. These differences were substantially large compared with the bias between two-wave models and the bias between multiwave models. This shows that the two-wave models provide fairly biased estimates of students' growth compared with the multiwave models. The two-wave model is not an appropriate model to estimate students' rate of growth. 5. Summary a n d Conclusion This study examined several statistical models designed to detect school differences in their effects on student outcomes. Almost al l research studies on school effectiveness has been concerned wi th school effects on students' average performance or variability at one particular time. That is, studies have been cross-sectional rather than longitudinal. Even though the goal is to assess student learning, which implies change and growth (Willet, 1988), most empirical methods of measuring change have been considered inappropriate and invalid (Harris, 1963; Cronbach & Furby, 1970). Therefore, questions which should be addressed under a longitudinal framework were analyzed wi th cross-sectional designs. This study supports the idea of a more recent longitudinal approach to the measurement of change. It claims that not only the idea of measuring change is important, but also the empirical methods of measuring change can be improved by collecting additional waves of data. The discussions about measuring change have been unjustifiably restricted to the two-wave design. But the two-wave model is simply not a very good design of measuring individual growth. We can get better estimates of individual growth with multiwave data. Therefore, the primary purpose of this study was to examine how much the multiwave models could improve the estimates of individual growth over two-wave models. To accommodate the hierarchical characteristics of educational processes and get more precise estimates of growth, the recently developed hierarchical linear 62 63 regression model (HLM) was employed in this study. The major findings of this study are as follows: (1) Across the three subtests of CTBS, schools differed significantly in their average levels of academic achievement at the end of grade 7, after controlling for students' prior ability and gender. The effects of students' prior ability were significant across all outcomes but gender effects were not significant for mathematics and reading comprehension scores. (2) There were significant differences among schools in their average rates of growth in academic achievement between grades 5 and 7 for mathematics and reading comprehension. For vocabulary, however, the between-school differences in their average rates of growth were not significant. Prior ability and gender were not significantly related to growth rates for the multiwave models across all three subtests. (3) The average bias of an unadjusted cross-sectional model against an adjusted cross-sectional model was substantial—about three months of schooling, whereas the average bias of an unadjusted multiwave model against an adjusted multiwave model was less than one-tenth of one month of schooling. The bias of two-wave models against multiwave models was also substantial. (4) The estimates of school effects based on multiwave models were more reliable than those based on two-wave models. Although more reliable, they were still not very reliable (.76 for mathematics, .43 for reading and .14 for vocabulary). 64 Within-school sample size, reliability of the tests, and variability among schools are some of the factors that affect reliability. The above findings lead to the following conclusions and practical implica-tions. (1) For the cross-sectional analysis, simple comparison of school means without controlling for students' intake characteristics is not appropriate for a fair and adequate evaluation of school effects. The bias of an unadjusted model against an adjusted model was quite large. In other words, if one does not control for differences among schools in their background characteristics, the estimates of school effects will be biased against less advantaged schools. This also suggests that it might be equally important to control the relevant school-level variables, even though this study was not able to include these school-level variables. Unless all the relevant variables at both student and school level are specified, the estimates would be biased especially for the cross-sectional models. (2) The two-wave model did not estimate the students' growth rates reliably because of large measurement errors. The reliability estimates of adjusted differences between schools based on the two-wave model were lower than those based on the cross-sectional model. Nevertheless, this does not imply that the growth score is intrinsically unreliable. The reliability estimates based on multiwave models were much higher than those based on two-wave models. They were even higher than those based on cross-sectional models when there were 65 considerable differences among schools i n their true rates of growth, as was the case for mathematics and reading. By reducing the measurement errors involved i n the difference score, the multiwave model can estimate individual rates of growth more precisely and reliably. Since low reliability of the difference score has been one of the key problems of measuring change, it was interesting to see how much differences in reliability one additional data point can make. Moreover, precision of the estimates would be increased further i f one had additional waves of data. (3) Students' background variables were not significantly related to the rates of growth based on the multiwave model across al l three outcomes. This suggested that one could use rates of growth as a performance indicator of school quality and get fairly accurate estimates of school- effects without controlling students' background characteristics. This might be one of the conveniences of using multiwave models instead of using cross-sectional models. It would be practically better and easier for an evaluator to collect additional waves of data rather than to collect all the relevant background information that are related to selection into schools. (4) The variance among schools in their effects was largest for mathematics and smallest for vocabulary. This means that schools vary most in their mathematics performance and least in their vocabulary performance. This is consistent with what one would expect given family influences on vocabulary. Since vocabulary is more related to students' home background rather than to specific school practices, the average rate, of growth would not vary much among schools 66 after controlling for students' background variables. Because schools varied little i n their adjusted growth rates in vocabulary, reliability estimates were low. Because students' learning and growth is the primary goal of education and because growth can be reliably estimated for some subject areas, those estimating between-school differences should be advised to use rates of growth based on multiwave data as performance indicators instead of commonly used adjusted school differences based on cross-sectional models. This study examined the conceptual advantages of a multiwave model over cross-sectional or two-wave models. The empirical results were consistent with expectations. However, there are several points to be considered that were not addressed in this study. First, this study did not include a wide range of students' background variables such as SES , or any school-level variables denoting specific school policies and practices. To make the comparisons as clear as possible, this study employed the simplest form of between-school differences. One of the interesting results of the study was that rate of growth based on multiwave model can estimate the school effects fairly accurately without controlling for student background. A study that includes more student-level variables, especially SES might be of interest. Moreover, this study estimated only Type A effects. It would be a useful extension to also estimate Type B effects. That is, a model that includes school-level variables describing school policy and context would be useful to help explain why schools vary. Second, this study did not include the basic analysis of covariance model with 67 grade 6 scores used as covariates to adjust grade 7 scores. Because of the low reliability of the difference score between two data points, researchers have preferred the ANCOVA model to the two-wave model. The parameters for the ANCOVA model, however, could not be estimated with the HLM because of the multicollinearity between grade 6 scores and grade 7 scores. A possible way to solve this problem is centering the grade 6 scores around each school mean. But this study excluded this analysis because it was beyond the scope of this study. However, a comparison between the ANCOVA model and the multiwave model might be another interesting topic for future study. Third, this study also employed the simplest form of multiwave model, that is, a three-wave model. The effects of schooling on students' rates of growth between grades 5 and 7 were only examined in this study. Therefore, it can only show the short-term effects of schooling on student outcomes. A future study that includes four waves or five waves of data can show different school effects, which are based on the long-term effects of schooling. References A i t k i n , M . A . , Anderson, D., & Hinde, J . (1981). Statistical modeling of data on teaching styles. Journal of the Royal Statistical Society A, 144, 419-461. A i t k i n , M . A . , & Longford, N . T. (1986). Statistical modelling issues in school effectiveness studies. Journal of the Royal Statistical Society A, 149, 1-43. Alexander, K . L . & Pallas, A . M . (1983). Private schools and public policy: New evidence on cognitive achievement in public and private schools. Sociology of Education, 56, 170-182. Alexander, K . L . , Cook, & M c D i l l , (1978). Curr iculum tracking and educational stratification. American Sociological Review, 43(3), 47-66. Barr , R. & Drebeen, R. (1983). How schools work. Chicago: University of Chicago Press. Bereiter, C. (1963). Some persisting dillemmas in the measurement of change. In C. W. Harris (Ed.). Problems in measuring change.' Madison, Wisconsin: University of Wisconsin Press. Bridge, R. G.,~Judd, C. M . , & Moock, P. R. (1979). The determinants of educational outcomes. Cambridge, M A : Ballinger. Bryk, A . S. (1980). Analyzing data from premeasure/postmeasure designs. In S. Anderson, A . Auquier, W. W. Hauck, D. Oakes, W. Vandaaele, & H . I. Weisberg. Statistical methods for comparative studies. New York: Wiley. Bryk, A . S., & Raudenbush, S. W. (1987). Application of hierarchical linear model to assessing change. Psychological Bulletin, 101(1), 147-158. Bryk, A .S . , & Raudenbush, S. W. (1988). Toward a more appropriate conceptualization of research on school effects: A three-level hierarchical linear model. American Journal of Education, 97(1), 65-108. Burstein, L . (1980). The analysis of multilevel data in educational research and evaluation. In D.C. Berliner (Ed.), Review of Research in Education, Washington, D . C : American educational Research Association, 158-231. Burstein, L . , Mi l le r , M . D., & L inn , R. L . (1979). The use of within-group slopes as indices of group outcomes (CSE Report Series). Los Angeles: Center for the 68 69 Study of Evaluation, University of California, Los Angeles. Campbell, E . Q., & Stanley, J . C. (1966). Experimental and quasi-experimental designs for research. Chicago: Rand McNal ly . Coleman, J . S., Hoffer, T., & Kilgore, S. (1981). Public and private schools. Report to the National Center for Educational Statistics. Chicago: National Opinion Research Center. Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design & analysis issues for field settings. Chicago: Rand McNal ly . Cronbach, L . J . , & Furby, L . (1970). How should we measure "change" - or should we? Psychological Bulletin, 74, 68-80. Dyer, H . S. (1970). Toward objective criteria of professional accountability i n the schools of New York city. Phi Delta Kappan, 52, 206-211. Dyer, H.S. , L i n n , R. L . , & Patton, M . J . (1969). A comparison of four methods of obtaining discrepancy measure based on observed and predicted school J system means on achievement tests. American Educational Research Journal, 6, 591-605. Fennema, E . (1980). Sex-related differences in mathematics achievement: where and why. In L . H . Fox, L . Brody & D. Tobins (Eds.), Women and the mathematical mystique. Baltimore, M D : The Johns Hopkins University Press. Glass, G.V. , & Stanley, J . C. (1970). Statistical methods in education and psychology. Englewood Cliffs, N J : Prentice-Hall. Goldstein, H . (1987). Multilevel models in educational and social research. New York: Oxford University Press. Gray, J . (1988). Multi level models: issues and problems emerging from their recent appication in Br i t i sh studies of school effectiveness. In D. R. Bock (Ed.), Multilevel Analyses of Educational Data, New York: Academic Press, 1-19. Haertel, E . H . , James, T., & Levin, H . M . (1987). Comparing Public and private schools, Vol 2: School achievement. New York: The Falmer Press. Haney, W. (1977). A technical history of the national Follow Through Evaluation Vol. V. The Follow Through planned variation experiment. Cambridge, Mass.: The Huron Institute. 70 Harris , C. W. (1963). (Ed.). Problems in measuring change. Madison: University of Wisconsin Press. King , E . M . (1982). Canadian Test of Basic Skills (teacher's Guide). Canada: Nelson Canada Limited. L i n n , R. L . , & Slinde, J . A . (1977). The determination of the significance of change between pre- and post-testing periods. Review of Educational Research, 47, 121-150. Maccoby, E . E . , & Jacklin, C. N . (1974). The psychology of sex differences. Stanford, C A : The Stanford University Press. Mandeville, G . K . , & Anderson, L . W. (1987). The stability of school effectiveness indices across grade levels and subject areas. Journal of Educational Measurement, 24(3), 203-216. Marco, G. L . (1974). A comparison of selected school effectiveness measures based on longitudinal data. Journal of Educational Measurement, 11(4), 225-234. Mart in , D. J . , & Hoover, H . D. (1987). Sex differences in educational achievement: A longitudinal study. Journal of Early Adolescence, 7(1), 65-83. Murnane, R. J . (1987). Improving educational indicators and economic indicators. Educational Evaluation and Policy Analysis, 9, 101-116. Raudenbush, S. W., Bryk, A . S. (1986). A hierarchical model for studying school effects. Sociology of Education, 59(1), 1-7. Raudenbush, S. W., & Bryk, A . S. (1988). Methodological advances in analyzing the effects of schools and classrooms on student learning. In E . Z. Rothkopf (Ed.), Review of Research in Education, 423-475, Washington, D . C : American Educationl Research Association. Raudenbush, S.W. & Willms, J . D. (1988). Sources of bias in the estimation of school effects. Edinburgh University: Centre for Educational Sociology, and University of Br i t i sh Columbia: Centre for Policy Studies in Education. Rogosa, D. R., Brandt, D. & Zimowski, M . (1982). A growth curve approach to the measurement of change. Psychological Bulletin, 92, 726-748. Rogosa, D. R., & Willet, J . B. (1983). Demonstrating the reliability of the difference score in the measurement of change. Journal of Educational Measurement, 20, 335-343. 71 Willet, J . B. (1988). Questions and answers in the measurement of change. In E . Z. Rothkopf (Ed.). Review of Research in Education, 345-422. Washington, D. C : American Educational Research Association. Wil lms, J . D. (1985). Catholic-school effects on academic achievement: New evidence from the High School and Beyond Follow-up study. Sociology of Education, 58, 98-114. Willms, J . D., & Raudenbush, S. W. (1989). A longitudinal hierarchical linear model for estimating school effects and their stability. Journal of Educational Measurement, 26, 209-232. Wil lms, J . D., & Jacobsen, S. (1990). Growth i n mathematics skills during the intermediate years: sex differences and school effects. International Journal of Educational Research, 14, 157-174. Wittrock, M . C . & Wiley, D . E . (Eds.) (1970). The evaluation of instruction: Issues and problems. Ney York: Holt , Rinehart, & Winston. Appendix Demographic Descriptions of the sample n C C A T CTBS math CTBS read CTBS vocab Female mean (SD) mean (SD) mean (SD) mean (SD) mean 1 19 115.509 (10.678) 91.947 (7.982) 88.000 (8.888) 88.632 (10.377) .474 84.105 (5.537) 76.947 (8.229) 80.158 (8.604) 69.316 (9.019) 67.263 (9.291) 68.895 (8.906) 2 38 109.382 (12.176) 82.576 (9.194) 80.342 (10.207) 82.105 (9.781) .575 74.632 (8.704) 71.289 (9.392) 75.632 (8.952) 60.868 (10.351) 60.632 (11.262) 62.316 (10.730) 3 26 109.231 (13.955) 79.423 (13.949) 77.346 (13.532) 79.115 (9.425) .538 70.577 (11.024) 68.731 (12.824) 70.962 (8.417) 63.923 (10.677) 62.038 (10.698) 62.154 (9.922) 4 39 108.739 (10.906) 83.026 (10.539) 79.359 (11.354) 80.103 (10.218) .462 74.590 (11.255) 72.641 (9.598) 73.385 (7.308) 62.667 (9.847) 59.846 (9.502) 63.000 (9.481) 5 32 108.052 (14.618) 79.000 (12.297) 77.375 (12.045) 78.844 (11.274) .531 69.656 (9.973) 71..594 (10.320) 74.750 (9.253) 61.500 (11.769) 59.875 (9.366) 61.156 (10.287) 6 42 107.960 (10.609) 82.786 :(8,883) 80,095 (10.670) 81.238 (8.826) .524 73.857 (7.216) 70.071 (9.506) 74.048 (8.052) 61.905 (8.213) 62.548 (6.922) 63.738 (7.126) n C C A T CTBS math CTBS read CTBS vocab Female mean (SD) mean (SD) mean (SD) mean (SD) mean 7 26 107.679 (13.685) 82.692 (9.277) 80.346 (11.070) 79.731 (11.557) .692 72.115 • (8.392) 71.077 (11.842) 72.423 (11.236) 63.846 (7.588) 62.038 (9.718) 60.296 (12.321) 8 17 107.422 (11.299) 80.412 (13.496) 77.118 (8.184) 79.353 (10.099) .647 73.059 (9.430) 72.471 (8.676) 71.824 (7.519) 60.059 (7.636) 59.706 (9.218) 62.765 (8.671) 9 38 106.833 (14.873) 79.237 (11.790) 78.526 (11.747) 76.342 (13.581) .447 71.842 (10.394) 71.184 (11.759) 70.842 (10.839) 61.974 (10.265) 58.158 (11.890) 60.263 (10.894) 10 19 106.360 (13.210) 78.263 (10.572) 75.211 (10.628) 76.474 (11.172) .526 68.947 (10.926) 66.684 (10.149) 69.316 (11.036) 64.158 (9.002) 61.000 (10.451) 60.632 (10.383) 11 24 105.910 (12.621) 84.458 (11.018) 78.375 (12.272) 75.833 (14.743) .542 77.167 (11.397) 71.250 (8.684) 71.833 (9.707) 62.917 (9.908) 59.583 (10.866) 60.625 (8.821) 12 17 105.216 (12.510) 80.294 (10.658) 76.706 (10.209) 78.118 (9.911) .647 74.529 (10.248) 70.118 (10.222) 73.176 (11.293) 59.882 (8.746) 56.824 (10.858) 59.000 (8.746) 13 16 104.979 (11.818) 80.062 (10.109) 78.562 (10.106) 79.937 (11.958) .375 72.312 (7.097) 69.312 (8.171) 69.625 (12.126) 57.625 (8.801) 57.437 (10.282) 58.375 (7.429) n C C A T CTBS math CTBS read CTBS vocab Female mean (SD) mean (SD) mean (SD) mean (SD) mean 14 26 104.917 (15.447) 74.346 (15.263) 73.269 (11.148) 79.192 (10.358) .423 67.038 (12.456) 66.538 (12.238) 71.038 (9.493) 59.615 (10.269) 58.885 (8.687) 59.000 (10.844) 15 34 104.662 (16.167) 85.618 ''(9.474) 79.626 (13.660) 79.441 (11.368) .588 79.029 (8.997) 74.471 (7.684) 74.412 (9.494) 63.059 (10.245) 58.059 (13.071) 62.206 (12.715) 16 16 103.615 (10.365) 81.812 (8.573) 78.437 (12.915) 77.937 (13.279) .438 71.687 (8.822) 68.250 (7.611) 72.875 (6.292) 61.437 (7.023) 58.062 (7.886) 60.812 (8.207) 17 12 103.319 (14.344) 84.583 (10.613) 78.417 (9.020) 75.417 (10.630) .500 71.083 (10.122) 68.750 (11.522) 69.000 (11.176) 62.667 (10.057) 58.250 (8.476) 57.583 (12.280) 18 22 102.977 (10.397) 80.909 (12.290) 77.409 (14.050) 78.409 (7.980) .635 69.545 (12.046) 69.000 (12.024) 71.545 (9.277) 59.864 (11.281) 57.273 (10.058) 63.273 (8.509) 19 16 102.896 (15.933) 80.062 (9.740) 78.937 (10.096) 77.625 (10.059) .250 70.375 (10.589) 69.250 (9.462) 70.750 (10.618) 57.437 (11,314) 60.812 (12.117) 59.312 (9.707) 20 55 102.730 (12.011) 82.400 (10.477) 78.545 (10.378) 78.800 (11.616) .455 76.709 (8.425) 70.764 (9.355) 74.036 (10.331) 60.836 (8.346) 58.673 (10.885) 60.509 (10.358) n CCAT CTBS math CTBS read CTBS vocab Female mean (SD) mean (SD) mean (SD) mean (SD) mean 21 36 102.509 (10.118) 85.000 ,1(8.380) 80.083 (8.673) 78.750 (9.646) .444 72.917 (6.809) 68.778 (9.363) 70.278 (8.345) 60.777 (8.901) 57:389 (9.986) 59.222 (9.469) 22 13 101.308 (11.987) 75.231 (7.247) 71.231 (10.353) 75.692 (7.642) .538 65.154 (10.953) 64.923 (8.311) 65.538 (7.891) 56.000 (7.937) ' 55.231 (10.895) 55.923 (10.851) 23 12 99.500 (10.519) 75.833 (7.987) 72.667 (13.845) 72.833 (10.945) .667 65.417 (7.669) 63.750 (12.563) 67.583 (9.949) 53.917 (10.945) 55.250 (8.497) 54.583 (8.775) 24 14 99.405 (10.580) 73.929 (8.704) 72.071 (6.650) 72.929 (10.759) .500 67.286 . (6.696) 64.214 (6.5770 66.429 (6.734) 55.214 (4.388) 55.429 (5.854) 55.214 (5.563) 25 19 97.009 (10.261) 75.316 (11.076) 69.316 (9.995) 73.947 (8.714) .474 67.316 (9.099) 64.947 (11.198) 68.316 (7.667) 54.158 (7.603) 51.579 (7.904) 56.632 (8.355) 26 20 96.742 (11.830) 76.100 (9.684) 71.450 (10.324) 70.300 (8.921) .450 70.750 (8.595) (12.792) 65.600 (8.535) 64.950 (7.783) 52.550 (13.221) 57.200 (7.811) 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0054534/manifest

Comment

Related Items