ON MODELLING CHANGE ANDGROWTH WHEN THE MEASURES THEMSELVES C H A N G E A C R O S S W A V E S : M E T H O D O L O G I C A L A N D M E A S U R E M E N T ISSUES A N D A N O V E L NON-PARAMETRIC SOLUTION by JENNIFER E L I Z A B E T H VICTORIA L L O Y D M . A . , University o f Victoria, 2002 B . S c , University o f Victoria, 1998 A THESIS C O M P L E T E D IN P A R T I A L F U L F I L L M E N T OF T H E R E Q U I R E M E N T S F O R THE D E G R E E OF DOCTOR OF PHILOSOPHY in THE F A C U L T Y OF G R A D U A T E STUDIES (Measurement, Evaluation, and Research Methodology) T H E U N I V E R S I T Y OF BRITISH C O L U M B I A © Jennifer Elizabeth Victoria L l o y d , 2006 Abstract In the past 20 years, the analysis o f individual change has become a key focus o f research in education and the social sciences. There are several parametric methodologies that centre upon quantifying change. These varied methodologies, known as repeated measures analyses, are commonly used in three research scenarios: In Scenario 1, the exact same measure is used and re-used across waves (testing occasions). In Scenario 2, most o f the measures' content changes across waves - typically commensurate with the age and experiences o f the test-takers - but the measures retain one or more common items (test questions) across waves. In Scenario 3, the measures either vary completely across waves (i.e., there are no common items) or the sample being tested across waves is small or there is no norming group. Some researchers assert that repeated measures analyses should only occur i f the measure itself remains unchanged across waves, arguing that it is not possible to link or connect the scores (either methodologically or conceptually) o f measures whose content varies across waves. Because it is not uncommon to face Scenarios 2 and 3 in educational and social science research settings, however, it is vital to explore more fully the problem o f analysing change and growth with measures that vary across waves. To this end, the first objective o f this dissertation is to weave together the (a) test linking and (b) change/growth literatures for the purpose o f exploring this problem i n a comprehensive manner. The second objective is to introduce a novel solution to the problem: the nonparametric hierarchical linear model (for multi-wave data) and the non-parametric difference score (for two-wave data). T w o case studies that demonstrate the application o f the respective solutions are presented, accompanied by a discussion o f the novel solution's strengths and limitations. A l s o presented is a discussion about what is meant by 'change'. Table o f Contents Abstract ii Table o f Contents iii List o f Tables v List o f Figures vi Acknowledgements ix Dedication xi Chapter 1: Introduction 1 Using Repeated Measures Analyses: Three Research Scenarios 1 A n Oft Overlooked, Oft Misunderstood Assumption o f Repeated Measures Analyses... 6 What is Meant by "Same Dependent Variable"? 7 What is Meant by "Equatable" Test Scores? 7 The Motivating Problem: Analysing Change/Growth with Time-Variable Measures 9 Objectives and N o v e l Contributions 10 Importance o f the Dissertation Topic 13 Framework o f the Dissertation 14 Chapter 2: Foundational Issues: Precisely What is Meant by 'Change'? 17 Amount versus Quality o f Change 18 Is Constancy as Interesting as Change? 19 Personal versus Stimulus Change 20 Formulating a Research Design: Important Questions 22 Common Interpretation Problems i n Change Studies 24 H o w Should One Conceptualise Change? 27 Chapter 3: Five Types o f Test Linking 30 Equating 32 Two Equating Research Designs 34 Six Types o f Equating 35 Two Alternative Types o f Equating 41 Calibration 42 Statistical Moderation 46 Projection 47 Social Moderation 47 Summary: Selecting the Appropriate Type o f Test L i n k i n g Method 48 Chapter 4: Seven Current Strategies for Handling Time-Variable Measures 55 Vertical Scaling 56 Growth Scales 57 Rasch Modelling 58 Latent Variable or Structural Equation Modelling 61 Multidimensional Scaling 63 Standardising the Test Scores or Regression Results 64 Converting Raw Scores to A g e (or Grade) Equivalents Pre-Analysis 67 Chapter Summary 69 Chapter 5: The Conover Solution: A N o v e l Non-Parametric Solution for Analysing Change/Growth with Time-Variable Measures 71 Traditional Applications o f the Rank Transformation 73 iv Introducing the Conover Solution to the Motivating Problem 77 H o w A p p l y i n g the Conover Solution Changes the Research Question (Slightly) 79 Within-Wave versus Across-Wave Ranking : 80 Establishing the Viability o f the Conover Solution 81 Primary Assumption o f the Conover Solution: Commensurable Constructs 83 Chapter 6: Two Conover Solution Case Studies: The Non-Parametric H L M and the N o n Parametric Difference Score 85 Choice o f Statistical Software Packages 86 Determining the Commensurability o f Constructs 86 Case 1: Non-Parametric H L M (Conover Solution for M u l t i - W a v e Data) ..90 ' Description o f the Data 91 Specific Variables o f Interest and Proposed Methodology 91 Statistical Models and Equations 94 Hypotheses Being Tested 100 Explanation o f the Statistical Output 100 Case 2: Non-Parametric Difference Score (Conover Solution for Two-Wave Data) 105 Description o f the Data 106 Specific Variables o f Interest and Proposed Methodology 107 Hypotheses Being Tested 109 Explanation o f the Statistical Output 109 Chapter Summary 110 Chapter 7: Discussion and Conclusions Ill Summary o f the Preceding Chapters Ill Strengths o f the Conover Solution '. 114 Limitations o f the Conover Solution 117 Suggestions for Future Research 120 Conclusions 122 References 125 Appendix A : M o r e about the Non-Parametric H L M (Case 1) 135 A B r i e f Description o f Mixed-Effect Modelling 135 Fixed versus Random Effects ..136 Performing the Analysis using the Graphical User Interface (GUI) 138 Performing the Analysis using Syntax 149 Appendix B : M o r e about the Non-Parametric Difference Score (Case 2) 151 Performing the Analysis using the Graphical User Interface (GUI) 151 Performing the Analysis using Syntax 165 Appendix C : U B C Behavioural Research Ethics Board Certificate o f Approval 167 List o f Tables Table Table 1 Title Page Test-Takers' Simple Difference Scores on Three Subtests, Grouped by Wave 1 Performance 25 Table 2 Example Conversion Table for Test X and Test Y Scores 31 Table 3 K o l e n and Brennan's (2004) Comparison o f the Similarities o f Five Test L i n k i n g Methods on Four Test Facets Table 4 50 L i n n ' s (1993) Requirements o f Different Techniques in L i n k i n g Distinct Assessments 52 VI List o f Figures Figure Figure 1 Title Page A Hypothetical Test-Taker's Performance Across Three Waves o f a Simulated Mathematics Assessment (Solid Line = Actual Scores, Dashed Line = Line o f Best Fit) Figure 2 28 Illustrating the Fit Function that Links the Scores on T w o Versions o f a Hypothetical 100-item Test (Modified from K o l e n & Brennan, 2004). The Arrows Show that the Direction o f the Linkage is Unimportant when Test Linking Figure 3 Descriptive Statistics for Each o f the Five Waves o f S R D T R a w Scores Collected by Siegel Figure 4 30 , 92 Histograms o f the Siegel Study's R a w Scores across Five Waves: Grade 2 (Top Left), Grade 3 (Top Right), Grade 4 (Middle Left), Grade 5 (Middle Right), and Grade 6 (Bottom Left). Note that Each o f the Distributions is Skewed Negatively 93 Figure 5 Unconditional M o d e l Output 101 Figure 6 Conditional M o d e l Output 103 Figure 7 Entering the Data in SPSS (Step 1) 139 Figure 8 Rank Transforming the Data Within Wave i n SPSS (Step 2a) 140 Figure 9 Rank Transforming the Data Within Wave in SPSS (Step 2b) 140 Figure 10 Rank Transforming the Data Within Wave in SPSS (Step 2c) 141 Figure 11 The N e w , Rank-Transformed Data Matrix (Step 2d) 143 Figure 12 Restructuring the Data i n SPSS (Step 3a) 144 vii Figure 13 Restructuring the Data i n SPSS (Step 3b) 145 Figure 14 Restructuring the Data in SPSS (Step 3c) 146 Figure 15 Restructuring the Data in SPSS (Step 3d) 146 Figure 16 Restructuring the Data in SPSS (Step 3e) 147 Figure 17 Restructuring the Data i n SPSS (Step 3f) 148 Figure 18 The N e w , Restructured Data Matrix (Step 3g) 149 Figure 19 Entering the Data in SPSS (Step 1) 152 Figure 20 . Rank Transforming the Data Within Wave i n SPSS (Step 2a) 152 Figure 21 Rank Transforming the Data Within Wave i n SPSS (Step 2b) 153 Figure 22 Rank Transforming the Data Within Wave i n SPSS (Step 2c) 154 Figure 23 The N e w , Rank-Transformed Data Matrix (Step 2d) 155 Figure 24 Computing a Correlation Matrix (Step 4a) 156 Figure 25 Computing a Correlation Matrix (Step 4b) 157 Figure 26 The Resultant Correlation Output (Step 4c) 157 Figure 27 Computing the Ratio o f Standard Deviations (Step 4d) 158 Figure 28 Computing the Ratio o f Standard Deviations (Step 4e) 158 Figure 29 Computing the Ratio of Standard Deviations (Step 4f) 159 Figure 30 The Resultant Descriptive Statistics Output (Step 4g) 159 Figure 31 Computing the Residualised Change Score (Step 5 a) 160 Figure 32 Computing the Residualised Change Score (Step 5b) 160 Figure 33 Computing the Residualised Change Score (Step 5c) 161 Figure 34 Computing the Residualised Change Score (Step 5d) 161 Figure 35 Computing the Residualised Change Score (Step 5e). 162 viii Figure 36 The Newly-Created Residualised Change Score (Step 5f) 163 Figure 37 Conducting the Independent Samples Mest (Step 6a) 164 Figure 38 Conducting the Independent Samples Mest (Step 6b) 164 IX Acknowledgements This dissertation represents the efforts o f many special people. First, I would like to thank my research supervisor, D r . Bruno Zumbo. There are too few glowing words in m y vocabulary to describe him, but I w i l l give it a shot: kind, generous, inspiring, funny, encouraging, talented, and all heart. Thank you for serving as a beacon these past few years. I am honoured to call you m y mentor and friend. Grazi! I would also like to thank Dr. Anita Hubley and D r . Kimberly Schonert-Reichl. They are both wonderful role models, particularly for young women in academics: bright, accomplished, kind, fun, and class acts. I couldn't have imagined a better research committee. I would also like to thank Dr. Linda Siegel and the British Columbia Ministry o f Education for allowing me to use their respective data sets i n m y dissertation. Dr. Siegel's data were collected at great personal expense to her. Her generosity is very much appreciated. I would also like to thank Edudata Canada (Dr. Victor Glickman) for not only disseminating the Ministry data to me, but also for offering me such rewarding employment during m y doctoral program. Thanks also to the Human Early Learning Partnership (Dr. Clyde Hertzman) for welcoming me into such a vibrant and dynamic group o f scholars. In addition, I extend my thanks to the Social Sciences and Humanities Research Council ( S S H R C ) o f Canada for funding this research project. Their financial support has been a tremendous gift. I have been blessed with many wonderful friendships that have truly enriched m y life. I thank Aeryn, Brian (Melanie), David, Debbie, James, Janine (Stephen), N i c k i , Rachel, X Sharon, Tanis, m y church family, and especially Catrin. Thanks also to my friends i n the Measurement, Evaluation, and Research Methodology ( M E R M ) program, and to the members o f Dr. Zumbo's Edgeworth Laboratory. M y family has played an integral role i n m y education. From a very early age, m y parents encouraged me to read. They enrolled me in various sorts o f lessons (piano, ballet, and even baton!?) and camps, when I am sure that the fees were sometimes more expensive than they could afford. They were always present at school events, plays, recitals, track meets, and awards ceremonies. They "encouraged" me to start working part-time from the age o f 15, and these jobs taught me how to handle my money and m y time. M y father, K e l v i n , is an unwaveringly hardworking and conscientious man, and has always shown me the importance o f a good, solid work ethic. M y mother, Janet, is the heart of our family, always encouraging me to climb mountains and to follow m y dreams. What better influences can a little girl have possibly had growing up? I would also like to thank m y brother, David. A n accomplished cyclist, gymnast, and singer, Dave often reminds me about the importance o f keeping balance in m y life. I ' m grateful to have a little brother who looks out for his big sister. .but those who hope in the Lord w i l l renew their strength. They w i l l soar on wings like eagles; they w i l l run and not grow weary, they w i l l walk and not be faint. Isaiah 40:31 Dedication For M u m , Dad, Dave, Nana, and Buster and for m y extended family and for my much-loved grandfather, John B o w . (Now, there are two doctors in the family.) 1 On Modelling Change and Growth When the Measures Themselves Change across Waves: Methodological and Measurement Issues and a N o v e l Non-Parametric Solution Chapter 1: Introduction In the past 20 years, the analysis o f individual change has become a key focus o f research in education and the social sciences. If one thumbs through a handful o f quantitative-based educational and social science research journals, conference proceedings, or grant applications, chances are good that one w i l l spot the word "trajectories" nestled somewhere within the prose - which, loosely speaking, pertains to the amount by which individuals change, grow, mature, improve, and progress over time. Chapter 2 presents a more thorough discussion o f what is meant by 'change'. Such individual change can occur naturally (e.g., an infant's learning to crawl and then to walk) or may be experimentally-induced (e.g., improvement in test scores as a result 1 of coaching). In either case, "by measuring and charting changes [we] uncover the temporal nature o f development" (Singer & Willett, 2003, p. 3). This temporal nature o f development may be studied over diverse spans o f time: hours, days, weeks, months, or even years. Measurement occasions or periods o f data collection that punctuate these spans o f time are generally referred to as waves. Using Repeated Measures Analyses: Three Research Scenarios There are several parametric methodologies that centre upon quantifying change. These varied methodologies are known as repeated measures analyses. Such methodologies include the paired samples Mest, the repeated measures analysis o f variance ( A N O V A ) , For the purpose of this dissertation, the words "measure", "form", "test", "scale", and "assessment" are used interchangeably. 1 2 profile analysis, and mixed-effect modelling (individual growth modelling, hierarchical 2 linear modelling, or simply H L M ) . In addition to affording researchers the opportunity to study change, repeated measures analyses reduce or eliminate problems caused b y characteristics that may vary from individual to individual and can influence or confound the obtained scores, such as age or gender (Gravetter & Wallnau, 2005). In education and the social sciences, there are three distinct research scenarios in which repeated measures analyses may be used. Each o f these scenarios is illustrated b y means o f an example: Scenario 1: Exact same measure across waves. Imagine that a teacher wishes to investigate the amount by which her students' academic motivation changes across the span o f one school year. She designs her study such that her students are assessed across three waves: the beginning o f the school year, mid-way through the school year, and once again at the end o f the school year. Motivation, she posits, can theoretically be measured using the exact same measure across all testing occasions, irrespective o f the ever-changing age, cognitive development, and personal and scholarly experiences o f her students (because she has no reason to believe that students' motivation changes commensurately with these developmental factors). A s a result, she decides it is unnecessary to change the measure's wording or items (questions) across waves, and further decides that the scores collected at each wave can be interpreted in the same way (i.e., a test score o f 50 at Wave 1 means the same "amount" o f motivation as a score o f 50 at Wave 2, and so forth). Scenario 2: Time-variable measures that can be linked. N o w imagine that a 3 professor in a university music program wishes to study progress in his concert piano 2 Mixed-effect/HLM modelling is described in more detail in Appendix A. 3 students' musicality across their four-year undergraduate degree. H e designs the study such that his students are assessed annually, commencing i n Year 1 o f the program and ending in Year 4. He conceives that musicality, irrespective o f the year i n which the student is assessed, represents some composite o f three subscale scores: theory (i.e., the ability to recognise notes, intervals, keys, and scales), technique (i.e., rate o f accuracy playing a year-specific piece), and artistry (i.e., ability to convey the appropriate emotion when playing a yearspecific piece). He is aware that, as his students progress throughout the program years, their technique and artistry w i l l improve relatively commensurately with the age, practise, and scholarly experiences o f his students. Given that expert theory skills, however, are prerequisites for entry into the music program, the professor does not believe that students' theory skills should change across years: Rather, he postulates that these specific skills should remain at expert level for the duration o f the program. A s such, he designs the four measures such that each contains (a) year-specific technique items (i.e., items that are specific to the students' particular year o f study), (b) year-specific artistry items, but (c) common theory items (i.e., test items that remain unchanged across years and whose subscale scores can purportedly be interpreted in the same way across waves) to ensure that students' theory skills are remaining at expert levels for the duration o f the program. Items that remain unchanged across waves or test versions are called common items or anchor items. i For brevity, measures whose content, wording, response categories, etc., must vary across waves in repeated measures designs are referred to as 'time-variable'. 3 4 To this end, the items that are designed to assess students' technique (and artistry) are, b y definition, worded differently, and have different response categories (and perhaps even different response formats) across the different waves o f the study. A s a result, the testlevel scores (and the respective subscale scores for technique and artistry) collected at the four waves cannot necessarily be interpreted in the same way, even though all o f the measures are collectively thought to assess students' musicality (and their technique, artistry, and theory, in particular). A s such, a total test score o f 50 at Wave 1 does not necessarily mean the same "amount" o f musicality as a score o f 50 at Wave 2. Furthermore, a technique subscale score o f 20 at Wave 1 does not necessarily mean the same "amount" o f technique as a score o f 20 at Wave 2, and so forth. Because there are common theory items across the four versions o f the measure, however, it is possible to link (connect) the theory subscale scores across waves and to interpret those subscale scores i n the same way. Scenario 3: Time-variable measures that cannot be linked. Finally, imagine that a researcher is interested i n exploring elementary students' achievement i n mathematics over time. Her research design involves testing a cohort o f students annually, beginning when the students are i n Grade 1 and ending when the students are i n Grade 7. Within the context o f this example, there would be no coherent rationale for administering the exact same mathematics measure to her participants across the seven waves. The specific mathematics test that first-graders would have the cognitive ability and requisite training to complete could not, b y definition, be the exact same mathematics test administered to the students when they are i n later grades: The wording o f the items, the complexity o f the concepts presented, and specific content domains tapped by each item must vary across the seven waves o f the study (Mislevy, 1992). If not, the reliability and validity o f the test scores are 5 compromised, likely rendering the study useless (Singer & Willett, 2003). A s a result, the test-level scores collected at the seven waves cannot necessarily be interpreted i n the same way (e.g., a test score o f 50 at Wave 1 does not necessarily mean the same "amount" o f mathematics achievement as a score o f 50 at Wave 2, and so forth), even though all o f the scale items are collectively thought to assess students' mathematics achievement. Although Scenario 3 has been characterised in the previous example as the situation in which one's measures share no common items across waves, it is also possible to encounter Scenario 3 in two additional situations: when one's sample size is small or when one does not have the ability to compare the sample's scores to that o f a norming group (discussed in more detail in a later chapter). In the case o f small sample sizes, it is not necessarily advisable to link or connect the scores o f measures, even i f the measures share common items (as i n Scenario 2). In part because the linking together o f scores from different measures has traditionally been handled with item response theory techniques (as described more fully in a subsequent chapter), it is, in general, recommended that two or more measures' scores are only linked when one has a minimum sample size o f 400 testtakers per measure (Kolen & Brennan, 2004). 4 For the remainder o f this document, the shorthand used for Scenario 3 is "timevariable measures that cannot be linked". Although this dissertation generally uses this phrase in reference to measures that are non-linkable due to a lack o f common items across waves, please note this phrase also encompasses situations i n which there are too few testtakers in a sample to link the measures' scores (even i f the measures share common items) or to situations in which one cannot refer to the scores o f a norming group. In the next section, 4 This sample size is only a rule of thumb, and can vary depending on the method of test linking one chooses. 6 a particular assumption o f repeated measures analyses - one that has significant implications for the current dissertation's focus - is explained and discussed. An Oft Overlooked, Oft Misunderstood Assumption of Repeated Measures Analyses Repeated measures analyses are those i n which one set o f individuals is measured more than once on the same (or commensurable) dependent variable. Embedded i n this definition is a special assumption that is often worded so subtly and succinctly that it often fails to garner the attention o f researchers - leading them to either overlook it completely or to deem it as trivial. Doing so is regrettable, because the theoretical and practical implications o f this particular assumption are profound. This special assumption is captured in this phrase: "the same (or commensurable) dependent variable". In many research contexts, particularly those involving repeated measures analyses o f variance ( A N O V A ) , this particular phrase is often understood to mean that the exact same measure must be used across all waves o f the repeated measures design. For more about the meaning o f "commensurable", please refer to Chapters 5 and 6. A s Scenario 1 above illustrates, certain constructs can indeed be measured using the exact same measure across all testing occasions - irrespective o f the ever-changing age, cognitive development, and personal and scholarly experiences o f the test-takers. In these situations, the test length, item wording, and response categories remain constant across all waves o f the study. A s Scenario 2 (time-variable measures that can be linked) and particularly Scenario 3 (time-variable measures that cannot be linked) depict, however, there are often situations in which one's construct o f choice makes using and re-using the exact same measure across waves unreasonable - and even impossible. 7 What is Meant by "Same Dependent Variable"? In a seminal article i n which the authors make several recommendations for measurement in longitudinal studies, Willett, Singer, and Martin (1998) clarify what is actually meant by the phrase "the same dependent variable": 1. " A t the very least, the attribute [must] be equatable over occasions o f measurement, and must remain construct valid for the period o f observation" (p. 397); 2. "Seemingly minor differences across occasions - even those invoked to improve data quality - w i l l undermine equatability. Changing item wording, response category labels, or the setting i n which instruments are administered can render responses nonequatable. In a longitudinal study, at a minimum, item stems and response categories must remain the same over time" (p. 411); and 3. "Whenever time varying variables are measured, their values must be equatable across all occasions o f measurement" (p. 411). Unfortunately, missing from Willett et al.'s (1998) article is an explicit explanation o f what they mean by "equatable": It is difficult to ascertain i f they use this word as is common in the English vernacular (e.g., a synonym for "commensurable" or "comparable" or "linkable") or i f they mean this word in a strict psychometric sense (described more fully in the next section and in Chapter 3). A s such, in order to determine the specific conditions under which test scores are equatable, it is necessary to refer to the test linking literature. What is Meant by "Equatable" Test Scores? In general, test linking refers to the general problem o f linking, connecting, or comparing the scores on different tests (Linn, 1993). Test equating, a special case o f test linking, adjusts for differences in tests' difficulty, not differences i n content (Kolen & 8 Brennan, 2004). Hence, when the scores on various tests are equated, the measures may be used interchangeably for any purpose (von Davier, Holland, & Thayer, 2004). In addition, any use or interpretation justified for scores'on Test X is also justified on Test Y (Linn, 1993) - meaning a score o f 26 on Test X means the same "amount" o f a given construct as a score o f 26 on Test Y (Kolen & Brennan, 2004). This chapter's mention o f test equating is made so as to weave together the concepts presented by Willett et al. (1998) and von Davier et al. (2004); however, test linking is certainly not limited to test equating alone. In Chapter 3, the fuller spectrum o f test linking methods are described. V o n Davier et al. (2004) state that the following five conditions must all be met in order to deem different measures as equatable : 5 1. Equal Constructs: The tests must measure the same construct. This requirement sometimes is referred to as measurement invariance or measurement equivalence, 6 and is achieved when the items tap the same underlying construct or latent trait at 7 each wave (Johnson & Raudenbush, 2002). A s Meade, Lautenschlager, and Hecht (2005) observe, " i f measurement invariance does not hold over two or more measurement occasions, differences in observed scores are not directly interpretable" (p. 279). 2. Equal Reliability: The various tests' scores must not have different reliabilities, even i f they measure the same construct. A n y changes i n psychometric properties o f tests These specific conditions are discussed in greater detail in subsequent sections of this dissertation. Measurement invariance is a similar to the notion of commensurability - which is discussed in more detail in Chapters 5 and 6. A latent variable is an unobserved variable that accounts for the correlation among one's observed or manifest variables. Ideally, psychometricians design scales such that the latent variable that drives test-takers' responses is a representation of the construct of interest. 5 6 7 9 across waves can change the predictive validity o f the test scores (Meade et al., 2005); 3. Symmetry: The equating function for equating the scores o f Test Y to Test X should be the inverse o f the equating function for equating the scores o f Test X to Test Y . In other words, the results should be the same regardless o f the direction o f the linkage (Pommerich, Hanson, Harris, & Sconing, 2004). 4. Equity: The actual test written should be a matter o f indifference to the test-taker. In other words, students should not find one test harder or more complex than the other. 5. Population Invariance: The equating function used to link the scores o f Tests X and Y should not be affected by the choice o f sub-populations used to compute the function. More specifically, the function obtained for one sub-group o f test-takers should be the same as the function obtained from a different sub-group o f test-takers (Ercikan, 1997). Proper equating must be invariant against arbitrary changes i n the group (Lord, 1982). The Motivating Problem: Analysing Change/Growth with Time-Variable Measures A s has been described, Willett et al. (1998) state that tests scores must be equatable i n order for repeated measures analysis to be performed validly. In turn, for test scores to be equatable, the conditions o f equal constructs, equal reliability, symmetry, equity, and population invariance must first be met (von Davier et al., 2004). Equating test scores used in repeated measures designs is best achieved by using the exact same measure across all waves o f the study (as is the case in Scenario 1) because any changing o f item wording, response category labels, or the setting i n which tests are administered can conceivably render test scores non-equatable (Willett et al., 1998). 10 A s Scenario 1 portrays, there are legitimate situations in which a particular construct can be measured longitudinally by using and re-using the exact same measure across waves. Other constructs, such as those presented i n Scenario 2 and particularly Scenario 3, make the repeated use o f the exact same measure across waves unreasonable - and even impossible. K o l e n and Brennan (2004) remind readers that, when test forms cannot be made to be identical, it is unnecessary or even impossible to equate. So what is one to do, then, i f the use o f time-variable measures is necessary and unavoidable? A s such, the problem motivating this dissertation is that o f analysing change and growth within the contexts presented i n Scenario 2 (time-variable measures that can be linked) and Scenario 3 (time-variable measures that cannot be linked) - both o f which are characterised by the measures changing across waves. This dissertation devotes specific attention to the latter o f the two scenarios, given that this particular scenario has been relatively unaddressed in the test linking and change/growth literatures. Objectives and Novel Contributions The general aim o f this dissertation is to investigate the problem o f analysing change and growth with time-variable measures (particularly those that cannot be linked). There are two specific objectives o f this dissertation. A s is described below, these two objectives are also the dissertation's novel contributions. Objective/novel contribution 1. The first objective is to weave together, or to bridge the gap between, the test linking and change/growth literatures i n a comprehensive manner. U n t i l now, the two literatures have either been largely disconnected or, at most, woven together in such a manner so as situate the motivating problem primarily around vertical scaling techniques. A n approach to linking together the scores o f measures with 11 different difficulties for groups with different abilities (described in more detail in Chapters 3 and 4), vertical scaling has several limitations - most notably that it requires the presence o f common items across measures (and, hence, cannot be used when the time-variable measures are non-linkable). Unfortunately, the gap between the test linking and change/growth literatures causes two problems. First, the two literatures often use different terms to refer to similar ideas. For example, the terms "measurement invariance" (from the test linking literature) and "commensurability" (from the change/growth literature) are often regarded as being disconnected notions, when the two are, in fact, variations on a theme. Because the two literatures often do not "speak the same language", one's.understanding o f analysing change/growth with time-variable measures can be muddied unnecessarily. It should be noted that this problem is exacerbated by the fact that writers within even one o f the named literatures often come from a variety o f research backgrounds (e.g., nursing, psychology, education, economics) and often use different terms for similar ideas. A second problem caused by the gap between the two literatures is that one's understanding o f the motivating problem may be severely curtailed i f he or she is not familiar with both literatures. For example, recall from an earlier section that Willett et al. (1998) state i n the change/growth literature that " A t the very least, the attribute [must] be equatable over occasions o f measurement, and must remain construct valid for the period o f observation" (p. 397). I f a reader were to limit his or her reading to the change/growth literature, then one may interpret the word "equatable" to mean that change/growth analyses are permissible simply i f a similar construct is being measured across waves. It is only by referring to the test linking literature that one is able to determine that the actual meaning o f 8 The concept of commensurability is described in more detail in Chapters 5 and 6. 12 the word "equatable" is much more nuanced and demanding o f the data (e.g., von Davier et al., 2004) than the meaning implied in the change/growth literature. Therefore, it is only b y bridging the gap between the two literatures that one can analyse change/growth with timevariable measures in a rigorous fashion. Objective/novel contribution 2. The second objective o f this dissertation is to provide a novel solution to the problem o f studying change and growth when one has timevariable measures (that can or cannot be linked). A s Chapter 4 describes more fully, many o f the strategies currently being used i n the change/growth literature as means o f handling the motivating problem are often only useful to large testing organisations that have at their fingertips very large numbers (sometimes tens o f thousands) o f test-takers and/or measures with hundreds o f items. Unfortunately, researchers in everyday research settings often do not have the means to use such large sample sizes or item pools. Moreover, many o f the strategies presented i n Chapter 4 require the presence o f common items across measures which, as discussed earlier, is not always feasible (or warranted). Therefore, this dissertation introduces a workable solution that can be implemented easily in everyday research settings, and one that is particularly useful when one's measures cannot be linked. This dissertation has coined this solution the Conover solution in honour o f the seminal work o f Conover and Iman (1981) and Conover (1999), whose research not only inspired the novel solution, but also provided evidence for the solution's viability. Although the specifics o f the Conover solution are saved for subsequent sections o f this dissertation, it is important to highlight that this solution is novel because it involves an innovative bridging o f the gap between parametric and non-parametric statistical methods. Indeed, this gap has already been bridged in various other contexts (e.g., the Spearman 13 correlation, described in Chapter 5) due, i n large part, to the seminal work o f Conover and Iman (1981) and Conover (1999). The bridge, however, has never before been extended to the problem o f analysing change/growth, particularly with time-variable measures. It is by expanding upon the research o f Conover and Iman (1981) and Conover (1999), and extending the parametric/non-parametric bridge to the context o f the current dissertation, that two new change/growth analyses can be performed for the first time: the non-parametric hierarchical linear model (for multi-wave data) and the non-parametric difference score (for two-wave data). These new analyses are discussed i n more detail later i n the dissertation. Importance of the Dissertation Topic There are two primary reasons why it is important to address the motivating problem. First, as Willett et al.'s (1998) and von Davier et al.'s (2004) work describes, the rules about which tests are permissible for repeated measures designs are precise and strict. Given these conditions, it is necessary to investigate i f and how repeated measures designs are possible when the measures are time-variable. Without an adequate solution to this problem, it calls to question the very reliability, validity, and usefulness o f past studies involving timevariable measures (particularly those that cannot be linked). Furthermore, this problem, i f left unsolved, implies that longitudinal analyses involving time-variable measures should altogether cease to occur. A s many educational psychologists, psychometricians, social scientists, and statisticians alike would agree, a 'non-solution' to this problem is hardly satisfactory. Second, there has been substantial growth in longitudinal large-scale achievement testing i n the past decade, most notably in North America. Such testing is being practised zealously at the institutional level (e.g., schools and districts), within universities, at the 14 public policy level (e.g., British Columbia's Foundation Skills Assessment, the Government o f Canada's National Longitudinal Survey o f Children and Youth, and America's N o C h i l d Left Behind mandates), and within the private sector (e.g., Educational Testing Service). In such contexts, it is common for questions about change over time to arise. Once again, without an adequate solution to the problem o f handling repeated measures designs with time-variable measures (particularly those that cannot be linked), it is impossible to ascertain i f the inferences these organisations are making about test score changes are accurate. A s a practical matter, unless and until there is an adequate solution to this problem, much o f the billions o f dollars directed towards funding large-scale testing across North America each year w i l l simply be wasted, and many o f the policy-related decisions borne o f these scores w i l l be fundamentally unsound. A s K o l e n and Brennan (2004) put so succinctly, "the more accurate the information, the better the decision" (p. 2). Although much is unknown about this particular topic, one thing is for sure: This is a rich and fertile area for research by educational psychologists, psychometricians, social scientists, and statisticians alike (Holland & Rubin, 1982). Framework of the Dissertation The process o f change is a deceptively more complex, multifaceted, and nuanced process than many researchers first acknowledge. Without a deep understanding about what is meant by change, it is possible to draw erroneous conclusions from studies or, worse still, to create fundamentally flawed research designs. A s such, the purpose o f Chapter 2 is to "unpack" the precise meaning o f 'change', so as to set the context for the remainder o f the dissertation. 15 Chapter 3 offers a brief description o f each o f the major types o f test linking: (1) equating, (2) calibration, (3) statistical moderation, (4) projection, and (5) social moderation, respectively. This chapter also serves as a backdrop for the remainder o f the dissertation, because many o f the concepts and terms presented in this chapter are revisited in later chapters. ( Chapter 4 weaves together, in a comprehensive manner, the test linking and change/growth literatures by presenting seven test linking strategies currently being used i n the change and growth literature as means o f handling the problem o f analysing change and growth with time-variable measures (particularly those that cannot be linked). These seven strategies include: (1) vertical scaling, (2) growth scales, (3) Rasch modelling, (4) latent variable or structural equation modelling, (5) multidimensional scaling, (6) standardising the test scores or regression results, and (7) converting raw scores to age- or grade-equivalents pre-analysis, respectively. Each strategy is presented, where possible, with examples from real-life research settings. In Chapter 5, the Conover solution is introduced as a means o f handling the problem o f analysing change/growth when none o f the aforementioned seven strategies is able to be implemented (most notably in the case o f time-variable measures that cannot be linked). A s described earlier, in the case o f two-wave data, the Conover solution is called the nonparametric difference score; in the case o f multi-wave data, the Conover solution is called the non-parametric H L M . Although the Conover solution may be applied i n either Scenario 2 (time-variable measures that can be linked) or Scenario 3 (time-variable measures that cannot be linked), it is particularly useful in the latter o f the two - a scenario which has gone relatively unaddressed i n the test linking and change/growth literatures. The Conover 16 solution involves rank transforming (or ordering) individuals' longitudinal test scores within wave pre-analysis, and then using these rank scores i n the place o f raw or standardised scores in subsequent statistical analyses. In Chapter 6, by way o f two case studies involving real data, the step-by-step implementation o f the two Conover solutions (the non-parametric H L M solution and the nonparametric difference score solution) are presented, respectively. A n explanation o f the resultant statistical models and statistical output is also offered. Chapter 7 concludes by discussing the Conover solution's strengths and limitations and offers suggestions for future studies focussed on the analysis o f change and growth with time-variable measures (particularly those that cannot be linked). 17 Chapter 2: Foundational Issues: Precisely What is Meant by 'Change'? Change is an inexorable and pervasive part o f human beings' daily lives. Because discussion o f change and growth is central to this dissertation, it is important to "unpack", at the outset, many o f the foundational issues surrounding the process o f change. The investigation o f change is o f great interest to many educational and social science researchers. In particular, two groups o f scholars have made the study o f change an integral component o f their research programs. First, developmental psychologists are concerned with both descriptions (i.e., depictions or representations) o f change and explanations for change (i.e., the specification o f the causes or antecedents o f development). In essence, the primary objectives o f developmental psychologists are to describe and to explain what stays the same and what changes across life, and to account for the reasons for such change (Dixon & Lerner, 1999). Second, psychometricians often focus their research on quantifying the amount by which individuals change and grow over time. A s introduced in the previous chapter, the path that summarises a test-taker's pattern o f change over multiple waves is referred to as a trajectory (Singer & Willett, 2003). Although a number o f developmental psychologists, psychometricians, and their counterparts pepper journal articles and book chapters with talk o f change, there is a surprising dearth o f discourse about the precise meaning o f the term. Perhaps this dearth is attributable to the fact that, at first glance, researchers' basic understanding o f change appears to be generally well i n hand: Most human beings have at least a lay understanding about what change means - with or without a scholarly discussion o f its precise meaning. Even so, an absence o f a precise meaning o f change i n the literature introduces its fair share o f problems. A s this chapter details, change is a deceptively more complex, 18 multifaceted, and nuanced process than many researchers first acknowledge. Without a deep understanding about what is meant by change - particularly in the context o f the change/growth and test linking contexts - it is possible to draw erroneous conclusions from studies or, worse still, to create fundamentally flawed research designs. To this end, the purpose o f this chapter is to discuss the precise meaning o f change. Amount versus Quality of Change Researchers generally agree that there are two broad types o f change: quantity (amount) and quality. Imagine that a farmer has in his possession a small barrel, which he places outdoors i n the centre o f a field. A t the same time each morning, the farmer goes out to the barrel and sticks a metre stick into the barrel and measures the amount o f rain that has fallen i n the preceding 24 hours. After measuring the rainfall, he empties the barrel. The next day, he returns to the barrel, uses his metre stick to record the past day's amount o f rainfall, and then empties the barrel once again. This process continues for several weeks during the crop season. B y continually measuring the rainfall according to his metre stick, the farmer is measuring changes in the amount o f rainfall over time. Moreover, by simply tracking the daily rainfall levels over time, he is able to determine the trajectory o f rainfall in his field. It is important to note that he uses the exact same metre stick each morning - because he knows that, by changing the measuring stick, he may compromise the quality o f the day-to-day comparisons. Psychometricians refer to this type o f distortion as measurement error. Even though the farmer's metre-stick routine allows h i m to quantify the amount o f rainfall change, it does not allow him to measure the quality o f the change i n rainfall. Although he is able to detect simple increases or decreases i n the amount o f rainfall relative 19 to some baseline measure (e.g., the first day o f the crop season), his particular metre-stick routine does not allow h i m to detect changes i n the quality o f the rainfall - such as levels o f acidity i n the rain or the nutrient composition o f the collected water. A s such, an important and necessary starting place i n better understanding the nature of change is to draw the comparison between the amount and the quality o f change. It is important to note that growth models, a relatively new methodological tool for analysing individual change, pertain only to amounts o f change; they do not allow researchers to study qualities o f change over time. The latter type o f change is explored best using qualitative tools. Given that amount o f change is the focus o f the current dissertation, the remainder o f this chapter (and dissertation) focuses on this particular type o f change. Is Constancy as Interesting as Change? According to D i x o n and Lerner (1999), the opposite o f change is constancy or continuity. Typically, when studying a sample o f individual test-takers' growth curves, a researcher is not interested in the way in which the scores 'stay the same' across waves. A researcher does not generally elect to study a particular construct over time i f she does not suspect a priori that the test-takers' scores w i l l change, i n some systematic or measurable way, across waves - particularly when one considers the great financial costs and complexities associated with many longitudinal research designs. Growth modelling, by definition, implies change - not constancy. Therefore, a researcher opts to study a particular construct over time typically because she expects that individual test-takers' scores w i l l , on average, increase (e.g., improvement in mathematics scores as a result o f coaching) or decrease (e.g., deterioration in memory scores as a result o f age) across waves. 20 Certain personality characteristics, such as intellectual mastery, demonstrate remarkable consistency across time (Conger & Galambos, 1997). Does their constancy mean that such characteristics should not be measured over time? Singer and Willett (2003) chide readers that "not every longitudinal study is amenable to the analysis o f change" (p. 9), adding that there are three particular methodological features that make a study suited for longitudinal analysis: 1. three or more waves o f data; 2. a sensible metric for clocking time; and 3. an outcome whose values change systematically over time (e.g., patients whose symptomatology differ before, during, and after therapy). Constructs purported to remain constant over time are considered to be not particularly useful (or interesting) in longitudinal analyses. This issue is revisited i n a later section o f this chapter. Personal versus Stimulus Change There is one important, but altogether unwritten, assumption underlying any growth modelling analysis: It is that the test-taker (the person) stays, for all intents and purposes, 'the same' for the duration o f the study (i.e., any changes in a test-taker's scores across waves are attributable to changes in the amount o f a construct over time, and not to the test-taker's personal attributes). Hence, the only changeable aspect to the study is the amount o f testtaker's construct across waves. Imagine, for example, that Joe Smith's mathematics achievement is tested across three waves (Grades 4, 7, and 10). The typical assumption is that Joe Smith (the person) remains constant over the three waves o f the study. Only the measure (the stimulus) changes 21 across waves. R e c a l l that, i n a n i d e a l w o r l d , p s y c h o m e t r i c i a n s d e s i g n m e a s u r e s latent v a r i a b l e that d r i v e s test-takers' r e s p o n s e s A s such, any and all changes changes is a r e p r e s e n t a t i o n o f the c o n s t r u c t o f in Joe's across-wave test scores in his amount o f mathematics achievement - their scores - as c a p t u r e d b y t h e m e a s u r e s fair o n e to m a k e ? that the p e r s o n s o m e h o w the interest. are attributed strictly to a n d n o t to p e r s o n a l c h a n g e s h e m a y h a v e e x p e r i e n c e d a c r o s s B u t is this a s s u m p t i o n - s u c h that years. and hence 9 remains constant across waves - a It c a n b e a r g u e d t h a t t h i s a s s u m p t i o n i s f u n d a m e n t a l l y f l a w e d f o r t h i s r e a s o n : H o w is it p o s s i b l e f o r a n y p e r s o n , p a r t i c u l a r l y a y o u n g t e s t - t a k e r a s s e s s e d s e v e r a l y e a r s , to r e m a i n 'the s a m e ' across Surely Joe Smith experiences some over years? d e g r e e o f p e r s o n a l c h a n g e as the w a v e s pass: H e not o n l y ages a n d matures, but his cognitive skills develop, his scholarly a n d recreational interests c h a n g e , a n d h e is c o n f r o n t e d b y a v a r i e t y o f p e r s o n a l a n d s c h o l a r l y both positive and negative, from the w o r l d a r o u n d h i m - occurs irrespective o f purported changes change) across the three T o c o m p o u n d the a study experiences experiences, a n d all o f this p e r s o n a l i n his m a t h e m a t i c s a c h i e v e m e n t (the change stimulus waves! 'person-stays-constant' p r o b l e m is that e a c h a n d e v e r y test-taker distinctly unique personal change. A s s u c h , there is n o p o s s i b l e w a y w h i c h to a c c o u n t a n d c o n t r o l f o r s u c h v a r i a t i o n i n p e r s o n a l c h a n g e a c r o s s test-takers they are e x p e r i e n c i n g s u c h c h a n g e i n e x t r e m e l y u n i q u e in in because ways. Because one's scale score is correlated to the underlying latent variable, and because the latent variable is, i n turn, ideally thought to be an approximation o f the construct o f interest, the term 'stimulus' refers variably to (a) , the measure itself or (b) the construct the measure is purported to measure. 22 Formulating a Research Design: Important Questions Because personal change occurs among test-takers i n unpredictable and variable ways - irrespective o f the individuals' stimulus change - the reality is that researchers must devote serious and informed thought about the implications o f such personal change when formulating research designs centred on quantifying stimulus change. Right from the outset of the study, researchers must ask themselves 'thought questions' such as: 1. What recommendations do the relevant theory and the existing literature make about the formulation o f the research design? 2. K n o w i n g that test-takers' personal changes w i l l occur more and more frequently the longer the duration o f the study, over how many waves should the test-takers be assessed? H o w much time should elapse between waves - hours, days, months, or years? According to Singer and Willett (2003), one should choose a metric for time that best reflects the cadence expected to be the most useful for the outcome. 3. H o w w i l l test-takers' personal changes affect the inferences made about their stimulus changes? A r e there any safeguards that can be implemented (e.g., from a test development perspective) that may mitigate the muddying effect that personal change can have on the inferences made about test-takers' stimulus change? Can the time-variable measures be designed with an eye toward lessening the impact o f personal changes on stimulus changes? If so, how likely is it that these safeguards w i l l protect researchers from making incorrect inferences about their test-takers' stimulus change? Recall from Chapter 1 that some constructs (such as the academic motivation example presented in Scenario 1) are purported to be measurable using the exact same 23 measure across all testing occasions, irrespective o f the ever-changing age, cognitive development, and personal and scholarly experiences o f her students. In such cases, the item wording, content o f the measure, response formats, and response categories are unchanged from wave to wave. In the context in which the exact same measure is used across waves, it is important that researchers ask themselves the following additional 'thought question': 4. K n o w i n g that test-takers w i l l all experience some degree o f personal change, does it even make sense to use and re-use the exact same measure across waves? Is the use and re-use o f the exact same measure, in fact, erroneous because it implies that testtakers' personal change has no relationship to or impact upon their stimulus change? Take, for example, one multiplication test item administered to a group o f test-takers in Grade 4 and then again when they are i n Grade 7: In the younger grade, this item may assess mathematics skill; in the later grade, however, the very same item may instead assess memory! Unfortunately, due to the case-specific nature o f change, only the researchers themselves can answer these questions. Nonetheless, it is imperative that researchers distinguish personal change from stimulus change, and that they are cognisant o f the impact of both when formulating their research designs. It is important to highlight that the questions outlined above are not the specific research questions motivating this dissertation. Rather they are 'thought questions' for researchers interested i n studying change and growth, particularly v i a the use o f timevariable measures (that can or cannot be linked). 24 Common Interpretation Problems in Change Studies Regular growth models involve estimating the initial status (y-intercept) and the 10 growth rate (rate o f change) for each individual test-taker. If one reads the output from any growth modelling analysis, one portion o f output allows one to determine i f the value o f the participants' mean baseline measure score is significantly different from zero. This particular snippet is, generally, not o f much interest to researchers. A separate snippet o f output, however, allows one to determine i f there is a significant main effect o f time (wave) on the participants' across-wave scores. In growth modelling, one is generally interested only in phenomena whose scores show marked increases or decreases across time. Very often, however, certain test-takers' scores start high and stay high. Is this type o f performance as interesting to researchers as test-takers who start low and end medium? Or those who start medium and end high? To illustrate this point, imagine that a researcher investigates test-takers' performance on various tests o f achievement over two-waves: (a) numeracy, (b) reading, and (c) writing. Based on their performance in the first wave o f testing, test-takers are divided into one o f three performance categories: low, medium, and high. After the second wave o f testing, the researcher computes each person's simple difference score (described more fully in Chapter 6) and presents the change scores i n a table (please refer to Table 1). Growth models, often referred to as mixed-effect or HLM models, are discussed in more detail in Appendix A. 25 Table 1 Test-Takers' Simple Difference Scores on Three Subtests, Grouped by Wave 1 Performance L o w Performance M e d i u m Performance H i g h Performance 735 2V65 230 Reading Subtest 15.00 4.95 -0.07 Writing Subtest 5.06 1.85 0.25 Numeracy Subtest " Note. Table inspired by G a l l , Borg, and Gall, 1996. B y glancing at the table o f results across subtests, the researcher notices that the testtakers who scored the lowest at Wave 1 showed the greatest improvement over time - as reflected by their larger simple difference scores, relative to those o f the "medium" and "high" performance groups. Moreover, the test-takers who showed the best overall performance at Wave 1 showed the least amount o f improvement over time, relative to the " l o w " and "medium" groups. H o w should the researcher interpret these data? D o the data mean that the students with the lowest initial achievement are likely to learn more than students in the "medium" and "high" groups? In some cases, answering the latter with "yes" is correct. In other cases, however, such findings may simply be an artefact produced by measurement error across waves (Gall e t a l , 1996). G a l l et al. (1996) outline five interpretation problems common to change studies, but particularly to those that involve two-wave data: 1. Ceiling Effect: This occurs when the range o f difficulty o f the test items is limited. Therefore, scores at the higher end o f the possible score continuum are artificially 26 restricted. In the example above, it is possible that the "high" performance scored near the ceiling o f the Wave 1 test. A s such, they could only earn a small simple difference score across the two waves. 2. Regression Toward the Mean: This describes the tendency for test-takers who earn a higher score at Wave 1 to earn a lower score at Wave 2 (and vice versa). This phenomenon occurs because o f measurement errors across waves, and because the test scores are correlated to each other (Zumbo, 1999). 3. Assumption of Equal Intervals: M a n y change studies assume that, on a hypothetical 100-point test, a gain from 90 to 95 is equivalent to the gain from 40 to 45 points. In reality, it is likely much more difficult to make a gain o f five points at the higher end, for example, o f the score continuum than at the mid-point o f the same continuum. 4. Different Ability Types: V e r y often, a given test score reflects different types and levels o f ability for different test-takers. Even though two students both earn a simple difference score o f 15 on a two-wave mathematics assessment, it does not mean that they have the same pattern o f strengths and weaknesses: one student may have improved his algebra over time, whereas the other may have improved his trigonometry. A s such, researchers should be careful to not attribute equivalence to score trajectories that, in reality, do not represent the same thing. 5. Low Reliability: Another interpretation problem associated with change studies is that their scores are not reliable. 11 For reasons outlined by Zumbo (1999), the higher the correlation between the scores across waves, the lower the reliability o f the '' Due to the subjective nature of change studies, there is no consensus about what is considered low reliability. As a general rule of thumb, however, Singer and Willett (2003) state that reliability = 0.8 or 0.9 is "reliable enough" (p. 15) for the study of change. 27 change scores. It should be noted, however, that Zumbo (1999) explains why this particular criticism o f change scores can, i n some cases, be unmerited. According to L i n n and Slinde (1977), test-takers who yield exceptionally high or low change scores are often identified so that they receive some sort o f special treatment. That said - i f not from a statistical but from instead a practical standpoint - is it not just as impressive for a high achiever to perform consistently well across waves as it is for a low achiever to show substantial change across waves? Recall from an earlier section that constancy o f performance is often regarded as being unimportant (or less interesting) as performance that shows systematic change. For reasons outlined i n this section, perhaps educational and social science researchers' stance on what is considered exceptional performance across waves needs to be reconsidered. H o w Should O n e Conceptualise Change? Change relative to a score earned in the first wave - whether it be in two-wave or multi-wave contexts - is the typical way in which educational and social science researchers conceptualise change. Is this, however, the best way to think about change? (Particularly when one considers the various interpretation problems associated with change studies) Recall an earlier example in which the numeracy achievement o f Joe Smith is studied across three waves: Grades 4, 7, and 10. Imagine that Joe's actual performance is plotted in a line graph, as depicted in Figure 1. 28 Grade 4 Grade 7 Grade 10 Figure 1. A hypothetical test-taker's performance across three waves o f a simulated mathematics assessment (solid line = actual scores, dashed line = line o f best fit). Joe's actual test scores, plotted using a solid line, show that his mathematics achievement increased steadily across all three waves. If one "chunks down" his performance into two sets o f two waves, however, it is possible to see that Joe's performance improved more between Grades 7 to 10 than between Grades 4 to 7. When conducting any change and growth analysis, the researcher must decide on the fit function she w i l l use with the data. Often times, the chosen fit function is that o f a straight line - sometimes called a linear fit function. 12 B y fitting a linear fit function to Joe's overall performance, as shown by the dashed line, the across-wave improvement in his scores is indeed still visible; however, the spike i n Joe's performance between Grades 7 to 10 is masked. Choosing one single fit function, particularly for data sets containing thousands o f cases with variable trajectories, can often be a difficult task - and is discussed in more detail 12 There are several additional types of fit functions. Please refer to Singer and Willett (2003) for more detail. 29 by Singer and Willett (2003). Before selecting a fit function, however, educational and social science researchers should first ask themselves one question: Should educational and social science researchers perpetually view change as that which is gained, on average, across all waves (as is typically done i n growth modelling analyses), or is there some (or even more) utility in this pair-wise or "chunking down" approach? Once again, due to the case-specific nature o f change, only researchers with intimate knowledge o f a given data set can answer this question. In summary, this chapter "unpacked" many o f the foundational issues surrounding the process o f change. Such issues involved distinguishing (a) 'amount' from 'quality' o f change, (b) 'constancy' from 'change', (c) 'personal' from 'stimulus' change. A l s o presented were various important 'thought questions' one should pose when conducting any study focussed on exploring change and growth (and i n particular those using the exact same measure across waves) and a review o f interpretation problems common in studies o f change. This chapter concluded by asking i f the current way in which many educational and social science researchers conceptualise change (i.e., that change is always relative to a score earned in the first o f two or more waves) requires reconsideration. In the next chapter, each o f the major types o f test linking methods is described: (1) equating, (2) calibration, (3) statistical moderation, (4) projection, and (5) social moderation, respectively. Chapter 3 also serves as a backdrop for the remainder o f the dissertation, because many o f the concepts and terms presented i n this chapter are revisited i n later chapters. 30 C h a p t e r 3: Five Types of Test L i n k i n g A s mentioned i n Chapter 1, test linking refers to the process o f systematically linking or connecting the scores o f one test (Test X ) to the scores o f another (Test Y ) . Figure 2 1 3 illustrates an example i n which the scores o f Tests X and Y are linkable linearly. In this example, a raw score o f 50 on Test X can be interpreted in the same way as a raw score o f 60 on Test Y , and so on and so forth along the score continua. 100 90 80 >- 70 60 + >j 50 40 30 20 10 { / / 0 10 20 30 40 50 60 70 80 90 100 Raw Score on Test X Figure 2. Illustrating the fit function that links the scores on two versions o f a hypothetical 100-item test (modified from Kolen & Brennan, 2004). The arrows show that the direction o f the linkage is unimportant when test linking. The relationship between Test X ' s and Test Y ' s scores does not necessarily.have to be linear, or even mathematical for that matter (so long as the chosen fit function reflects the form o f the data). Although rarely stated explicitly, one ultimately performs test linking with 13 Recall from Chapter 1 that the direction of the test linkage should be unimportant. 31 an eye towards producing what is termed a conversion table (correspondence table), which indicates the specific Test Y score that is considered equivalent to a given score on Test X (and vice versa). U s i n g a conversion table, such as the example provided in Table 2, one can easily spot that a score o f 30 on Test X means the same thing as a score o f 31 on Test Y . Table 2 Example Conversion Table for Test X and Test Y Scores Test X R a w Score Test Y R a w Score 30 31 31 32 32 33 33 34 34 35 The phrase test scaling is often used synonymously with test linking, but, i n fact, the two phrases mean different things. Test scaling refers specifically to the process o f transforming raw test scores into new sets o f numbers with given attributes, such as a particular mean and standard deviation (Lissitz & Huynh, 2003), thus increasing the interpretability o f the test scores (Kolen & Brennan, 2004). A s L i n n (1993) notes, there are numerous other terms used to refer to the basic concept o f test linking: anchoring, benchmarking, calibration, concordance, consensus moderation, equating, prediction, projection, scaling, statistical moderation, social moderation, verification, and auditing. Unfortunately, not all o f these terms have had precise definitions and technical meanings (Linn, 1993). 32 It is, however, generally accepted that these numerous terms fall into five broad types o f test linking (listed in decreasing order o f statistical rigour): equating, calibration, statistical moderation, projection, and social moderation, respectively (Kolen & Brennan, 2004; L i n n , 1993). Given that equating is the most rigorous and widespread o f the five types, this chapter discusses equating in more detail than the other test linking methods. Although Chapter 1 presents test linking i n the context o f repeated measures designs, it should be stressed that test linking is certainly not limited to longitudinal designs alone. Test linking may be used, for example, to connect the test scores o f one district to the scores o f another district (e.g., i f the two districts completely different assessments) or in the creation o f parallel measures. A s is elucidated in subsequent sections o f this chapter, none o f these five test linking methods proves to be a suitable solution to the problem o f analysing change/growth with time-variable measures (particularly those that cannot be linked). Consequently, these five methods are presented here for the purpose o f providing readers with a brief overview o f test linking, and with specific reasons why each is insufficient i n the study o f change and growth with time-variable measures. B y highlighting each linking method's shortcomings in terms o f the motivating problem, the rationale and need for the Conover solution presented i n Chapter 5 are established. Equating Equating, the first type o f test linking discussed in this chapter, has the most stringent technical requirements (Linn, 1993; L i n n & Kiplinger, 1995). When different tests have been equated, then it is surmised that it is a matter o f indifference to the test-takers which particular version o f the measure they write (Kolen & Brennan, 2004). 33 Although there are six different types o f equating (described in a later section), each type involves following this general procedure (Kolen & Brennan, 2004; Mislevy, 1992): 1. One must decide on the purpose for equating. 2. One must construct alternate forms from the same test blueprint, ensuring the same content and statistical specification. A t this step, it is also important to define clearly the expected correspondence among the scores o f tests (Schumacker, 2005). 3. One must choose a data collection design and implement it. Equating requires that data are collected in order to determine how measures vary statistically (Kolen & Brennan, 2004). 4. One must choose one or more operational definitions o f equating. Here, one makes the choice about which specific type o f equating is the most appropriate. 5. One must choose one or more statistical estimation methods. 6. Finally, one must evaluate the results o f equating. K o l e n and Brennan (2004) offer various equating evaluation procedures. A s the above procedure suggests, test equating is rooted i n ways i n which the tests themselves are constructed, and not i n the statistical procedure (Mislevy, 1992). A s Pommerich et al. (2004) note, i f test scores are not considered equatable, "there is a higher probability o f misuse o f the linked scores because the interpretation and usage o f results is less straightforward" (p. 248). Two types o f error influence the interpretation o f the results o f equating. The first type o f error, random equating error, is unavoidable because samples are used to estimate population parameters (e.g., means and standard deviations). This type o f error can be 34 reduced, however, by using large samples o f test-takers 14 and by choosing carefully the type o f equating design. The second type o f error, systematic equating error, results from the violation o f assumptions and conditions specific to the equating methodology one implements. Unlike random error, which can be quantified using standard error calculations, systematic error is more difficult to estimate (Schumacker, 2005). T w o E q u a t i n g Research Designs The most common data collection designs that are used for equating are the random groups design and the common-item non-equivalent groups design. The former utilises a 'spiralling process' i n which alternate test-takers are administered alternate forms o f the exam (Schumacker, 2005). For example, Test X may be given to the first test-taker, Test Y to the second test-taker, Test X to the third test-taker, and so forth. When the random assignment o f the two test forms to test-takers is used, the two groups (Test X group and Test Y group) are considered equivalent in proficiency. A n y statistical differences across the two groups on the tests are interpreted as a difference i n the test forms. For example, i f the Test X group performed better overall than the Test Y group, it is assumed that Test X was easier than Test Y (Braun & Holland, 1982). Random groups designs are often preferable to single groups designs; however, much larger samples are necessary to obtain stable parameter estimates, hence limiting their use in many testing situations (Schumacker, 2005). If only one version o f a test can be administered on a given date, the common-item non-equivalent groups design can be used. In this case, the test forms are not assigned randomly to the two groups o f test-takers; hence, it cannot be assumed that the two groups have the same overall proficiency. Therefore, it is necessary to identify any proficiency 14 Please refer to Chapter 1 for a recommendation on one's minimum sample size. 35 difference between the two groups. To do this, a subset o f common test items (i.e., anchor items) is placed on both test forms. Because these common items are administered to all examinees, irrespective o f group, they can be used to estimate differences in proficiency across the groups. Once the group differences have been identified, any remaining score differences can be interpreted as a difference in the difficulty o f the two test forms and, hence, the scores are comparable directly (Haertel, 2004). Equating can then properly adjust for these test form differences. Six Types of E q u a t i n g K o l e n and Brennan (2004) discuss six types o f equating, and the situations in which each is appropriate. The six additional methods - linear, equipercentile, identity, mean, twoor three-parameter I R T equating, and Rasch equating - are reviewed respectively. It should be noted that all six types o f equating are able to be implemented i n both random groups and common-item non-equivalent groups designs. (i) L i n e a r equating. Linear equating, perhaps the most widely used equating function, allows for differences in difficulty between the two measures to vary along the scale score. For example, linear equating allows Test X to be more difficult than Test Y for low-achieving test-takers, but easier for high-achieving test-takers (Kolen & Brennan, 2004). Scores that are an equal (signed) distance from their respective means, in standard deviation (z score) units, are considered equal. Thus, linear equating allows for the scale units, as well as the means, o f the two measures to differ (Kolen & Brennan, 2004). T w o tests are considered equated when the mean and variance o f the distributions o f test scores are the same across tests (Bolt, 1999). In either equating research design, linear equating may be used when: 36 1. There are adequate control and standardisation procedures, and alternate forms are built to the same specifications; 2. There are small samples; 3. There are similar test form difficulties; 4. One desires simplicity i n the conversion tables or equations, in conducting analyses, and in describing procedures to non-psychometricians; and 5. It is important to have accuracy o f the results near the mean scale score. (ii) Equipercentile equating. Equipercentile equating involves identifying scores on Test X that have the same percentile ranks on Test Y (Kolen & Brennan, 2004). Equivalence is achieved when the scores from Tests X and Y are at the same quantile (points taken at regular vertical intervals from the cumulative distribution function o f a random variable) o f their respective distributions over the target population, rather than merely having the same zscores (von Davier et al., 2004). In either equating research design, equipercentile equating may be used when: 1. There is adequate control and standardisation conditions, and alternate forms are built to the same specifications; 2. There are large samples; 3. The test forms can differ i n difficulty more than for a linear method o f equating; 4. One can tolerate complexity in the conversion tables or equations, in conducting analyses, and in describing procedures to non-psychometricians; and 5. The accuracy o f results along all scale scores is important (Kolen & Brennan, 2004). (iii) Identity equating. Identity equating is perhaps the easiest type o f equating conceptually. In essence, a score on Test X is thought to be equivalent to the identical score 37 on Test Y . For example, a score o f 15 on Test X is thought to mean the same thing as a score o f 15 on Test Y . It should be noted that identity equating is thought to be the same as mean equating (described below) and linear equating i f the two forms are identical in difficulty along the scale score (Kolen & Brennan, 2004). In either equating research design, identity equating can be used when: 1. There is poor quality control or standardisation conditions; 2. There are very small sample sizes, or no data at all (e.g., data are not available across the entire range o f possible scores); 3. There are similar test form difficulties; 4. One desires simplicity in conversion tables and equations, in conducting analyses, and in describing procedures to non-psychometricians; and 5. One can tolerate possible inaccuracies in results. (iv) M e a n equating. In mean equating, Test X ' s difficulty differs from Test Y by a constant amount along the scale score. For example, i f Test X is three points easier than Test Y for higher-achieving test-takers, it is also three points easier for lower-achieving testtakers. In other words, mean equating involves the addition o f a constant to all raw scores on Test X to find equated scores on Test Y (Kolen & Brennan, 2004). W i t h mean equating, scores on the tests that are equal (signed) distances from their respective means are deemed to be equivalent (Kolen & Brennan, 2004). In either equating research design, mean equating is appropriate when: 1. There are adequate control and standardisation procedures, and alternate forms are built to the same specifications; 2. There are very small samples; 38 3. There are similar test form difficulties; 4. One desires simplicity in the conversion tables or equations, i n conducting analyses, and in describing procedures to non-psychometricians; and 5. It is important to have accuracy o f the results near the mean. (v) Two or three-parameter logistic IRT equating. Item response theory (IRT) deals with modelling the response o f a test-taker to a test item. Lord (1982) writes that I R T is useful for designing tests, for selecting items, for describing and evaluating items and tests, for optimal scoring o f the test-takers' responses, for predicting the test scores o f test-takers and o f groups o f test-takers, and for interpreting and manipulating test scores. Each test item is described by a set o f parameters that can be used to depict the relationship between an item score and some latent trait through the use o f an item characteristic curve (ICC). When it is assumed that test-takers' ability is described by a single latent variable or dimension (referred to as 'theta'), two- or three-parameter logistic models can be used to equate tests whose items are scored dichotomously, such as the ubiquitous multiple-choice examination (where 0 = incorrect, 1 = correct). The use o f one latent variable implies that the construct being measured b y the tests is unidimensional (meaning, within the context o f IRT, that the tests measure only one ability). A s is likely intuitive from the name, a three-parameter model focuses the attention on three specific parameters. The a parameter (the discrimination parameter) represents the degree to which a given item discriminates between test-takers with different levels o f theta. The b parameter (the difficulty or location parameter) determines how far left or right on the theta scale the I C C is positioned. W i t h logistic models, b is often reported as representing the point on the ability scale where the probability o f a correct response is 50%. M o r e 39 accurately, b represents the half-way point between the c parameter and 1.0. The c parameter (the lower asymptote- or pseudo-chance parameter) describes the probability that a test-taker with very low ability w i l l correctly answer a given item. This low end o f the ability continuum is often influenced by test-takers' guessing on the given item (Hambleton, Swaminathan, & Rogers, 1991) A two-parameter model also involves these three specific parameters but, in this case, the c parameter is set to zero. For the purpose o f equating, the three-parameter logistic model is favoured (Stroud, 1982), because it is the only one that "explicitly accommodates items which vary in difficulty, which vary in discrimination, and for which there is a nonzero probability o f obtaining the correct answer by guessing" (Kolen & Brennan, 2004, p. 160). A s K o l e n and Brennan (2004) note, when one uses I R T to equate with non-equivalent groups o f test-takers, the parameters from the different tests need to be on a common IRT scale. Often, however, the parameter estimates that result from I R T are on different I R T scales. For example, imagine that the IRT parameters estimated for Test X are based on Population 1, and the parameters estimated for Test Y are based on a non-equivalent population, called Population 2. Statistical software packages often define the theta scale as having M = 0 and S D = 1 for the data being analysed. So, for this scenario, the abilities for each population would be scaled so that they both had thetas with M = 0 and S D = 1, even though the populations' respective abilities are different. For this reason, K o l e n and Brennan (2004) provide details on transforming I R T scales. In either equating research design (random groups or common-item non-equivalent < groups), two- or three-parameter logistic I R T equating is appropriate when: 40 1. There are adequate control and standardisation procedures, and alternate forms are built to the same specifications; 2. There are large samples; 3. Test forms differ i n difficulty level more than for a linear method; 4. One can tolerate complexity in the conversion tables, in parameter estimation, i n conducting analyses, and in describing procedures to non-psychometricians; 5. One can tolerate a computationally-intensive item parameter estimation procedure (this problem is mitigated i f the item parameter estimates are needed for other purposes - for example, test construction); 6. Accuracy o f results is important all along the score scale; and 7. The LRT model assumptions hold reasonably well. (vi) R a s c h equating. The one-parameter logistic I R T model is often referred to as the Rasch model. Whereas the two- and three-parameter logistic models involve estimations of the a and b parameters and o f the a, b, and c parameters, respectively, the Rasch model deals with just one item parameter: b (difficulty). In a sense, the Rasch model estimates item difficulties free o f the effects o f the abilities o f the test takers. Arguably the most widely used LRT model, the Rasch model is often favoured because o f its simplicity (Bejar, 1983). Stroud (1982) notes that the one-parameter I R T method works best when equating a test to itself (e.g., same test over time with different examinees), adding that it is "better than any other method when the anchor is either easier or harder than the test being equated" (p. 137). In either equating research design, Rasch equating is appropriate when: 41 1. There are adequate control and standardisation procedures, and alternate forms are built to the same specifications; 2. There are small samples; 3. There are similar test form difficulties; 4. One can tolerate complexity in the conversion tables, in parameter estimation, i n conducting analyses, and i n describing procedures to non-psychometricians; 5. Accuracy o f results is important i n the area that is not very far from the mean; and 6. The LRT model assumptions hold reasonably well. T w o A l t e r n a t i v e Types of E q u a t i n g In addition to the six types o f equating already described, K o l e n and Brennan (2004) state that two alternative types o f test equating do exist. These alternative types may only be applied, however, i f one o f two conditions is met: either when test-takers administered different scales are evaluated at the same time, or when score trends are to be evaluated over time. They caution readers, however, that their alternatives to equating are "typically unacceptable" (p. 6) for reasons elucidated below. The first alternative is to report and to compare raw scores, irrespective o f the actual form written by the test-taker. For example, imagine that a student earns a score o f 27 on Test X administered in the first wave and a score o f 30 on Test Y in the second wave. The obvious problem with this method, however, is that it becomes difficult to discern i f the apparent improvement i n scores is attributable to differences i n the two forms, to differences in the achievement o f the student, or some combination o f the two (Kolen & Brennan, 2004). The second alternative is to convert raw scores to other types o f scores, such that . given characteristics o f the scores are transferable among administrations o f the tests. For 42 example, imagine one set o f test-takers' scores on Test X are transformed to z-scores at Wave 1, and then again at Wave 2. Because each o f the score distributions is purposefully set M = 0 and S D = 1, respectively, it is impossible to track changes in the group's overall performance over time. A s K o l e n and Brennan (2004) note, the problems associated with each o f these two alternatives can be remedied with proper equating. Equating, they state, adjusts for differences in the difficulty o f test forms, such that the test scores can be interpreted in the same way, irrespective o f when the tests were administered or o f the group o f students who wrote each test. Equating does not, however, adjust for differences in content. Because differences i n content are inherent in the context o f achievement testing (i.e., the items must change, by definition, commensurate with the grade level o f the test-takers), it is clear that equating is ineffectual for the problem motivating this dissertation: analysing change and growth with time-variable measures (particularly those that cannot be linked). Calibration The second type o f test linking is called calibration. When different tests are calibrated, they are purported to measure the same construct (e.g., mathematics achievement), but with different accuracy or in different ways (e.g., comparing the performance o f students across grade levels). Similar to equating, calibration requires that the different tests all measure the same construct; however, whereas successful equating yields scale scores that can be used interchangeably for any purpose, successful calibration simply means that the results o f each test are mapped to a common variable, matching up the most likely score o f a given test-taker on all tests (Mislevy, 1992). Furthermore, unlike equating, the reliabilities for each test's scores may differ. 43 A s M i s l e v y (1992) describes, there are three distinct cases o f calibration - each having its own procedure: 1. Case 1: One should use the same content, format, and difficulty blueprint to construct tests, but with more or fewer items on each test. It is the expected percents correct that are calibrated. 2. Case 2: One should collect tests from a collection o f items that fit an item response theory (IRT) model satisfactorily. Inferences from the test scores should be carried out in terms o f the IRT proficiency variable. M i s l e v y (1992) offers a surprising example o f this case: fourth- and eighth-grade geometry tests connected by an IRT scale with common items. A s has been highlighted previously, calibration has little use i n the context o f thisdissertation, given that it is highly unlikely to find measures from different grade levels (e.g., Grade 4 and Grade 7 mathematics tests) that share any number o f common items. 3. Case 3: One should collect judgements on a common, more abstractly-defined variable. The consistency o f these judgements should then be verified. Vertical scaling (vertical "equating") is a subcategory o f calibration methods, and refers to the process o f scaling tests with different difficulties for groups o f individuals with different abilities, usually in different grades 15 (Pomplun, Omar, & Custer, 2004). Put another way, "Vertical scales are created through administering an embedded subset o f items to different students at two educational levels, typically one year apart, and linking all the items at the two levels to a common scale through the In contrast, horizontal scaling involves equating tests of different forms or at different times of a single grade (Leung, 2003). 44 comparative performance o f the two groups o f students on the common items" (Schafer, 2006, p. 1). Braun and Holland (1982) describe vertical scaling as the placing o f scores from tests o f widely different difficulty on some sort o f common metric, thus allowing the scores o f test-takers at different grade levels to be compared. This common metric against which the test-takers' scores can be compared is often called a developmental scale. Because the content o f each o f the tests varies across groups o f test-takers at different educational levels, vertical scaling cannot be used for the purpose o f making the test forms themselves interchangeable (Kolen & Brennan, 2004). Leung (2003) describes an approach to vertical scaling that combines both linear and I R T equating. A t first glance, vertical scaling appears to be the solution to the very problem that has inspired this dissertation - analysing change and growth with time-variable measures. Recall, however, that differences in the measures' content, item wording, response categories, etc. are not permissible i n the study o f change and growth i f one heeds the recommendations o f Willett et al. (1998). Furthermore, Martineau (2006) warns that: many assessment companies provide such vertical scales and claim that those scales are adequate for longitudinal [modeling]. However, psychometricians tend to agree that scales spanning wide grade/developmental ranges also span wide content ranges, and that scores cannot be considered exchangeable along the various portions o f the scale (p. 35). Martineau (2006) adds that vertical scaling can lead to "remarkable distortions" (p. 35) in the results, because "the calibration requirement that two tests measure the same thing is generally only crudely approximated with tests designed to measure achievement at different achievement levels (Linn, 1993, p. 91). Schafer (2006, pp. 1-3) also discusses several deficiencies in vertical scales, noting, as he states, that: 1. They unrealistically assume a unidimensional trait across grades; 2. Scale development includes out-of-level testing, and therefore lacks face validity; 3. Lower-grade tests that are eventually implemented w i l l have invalid content representation for higher-grades' curricula; 4. Scores for students in lower grade levels are overestimated due to lack o f data about inabilities over contents at higher grade levels; 5. Average growth is uneven for different adjacent grade-level pairs; 6. Differences between achievement-levels change from grade-to-grade; 7. Achievement-level growth is uneven for the same achievement level for different adjacent grade-level pairs; 8. Interval-level interpretations between grades are not grounded, either through norms or through criteria; 9. Achievement-level descriptions o f what students know and can do for identical scores are different for different grade levels; 10. Decreases in student scores from year-to-year are possible; 11. Comparable achievement level cut-scores can be lower at a higher grade level; 12. If they come from different grades, students with the same scores have different growth expectations for the same instructional program; 46 13. The scale may be estimated from sparse data; and 14. The scale invites misinterpretations o f comparability across grades. For these reasons outlined by Martineau (2006) and Schafer (2006), it is clear that vertical scaling is not a viable solution to the problem o f analysing growth with timevariable measures (particularly those that cannot be linked). Statistical M o d e r a t i o n The third type o f test linking is called statistical moderation (scaling, anchoring). Statistical moderation involves making comparisons among scores obtained from different sources (e.g., teachers, parents) or different subject matters (e.g., English, mathematics, science), purportedly adjusting these scores so as to make them comparable. A s is likely evident, the different tests do not necessarily measure the same construct and may advantage particular content areas or situations (Linn, 1993; Mislevy, 1992). This type o f linking is implemented so that one may match up the distributions o f the various tests' scores in real or hypothetical groups o f students, thus facilitating the creation o f conversion tables o f comparable scores (Mislevy, 1992). A s L i n n (1993) notes, it is necessary for there to be some external examination or anchor measure for the adjustment o f scores to be possible. Due to the time and financial constraints that researchers often face in real-life research settings, it is not always possible to have an (external) anchor measure to which to compare the scores o f one's sample o f test-takers. A s is likely obvious from the description o f this type o f test linking, statistical moderation cannot be used when one measures change and growth with time-variable measures (that can or cannot be linked) particularly because the constructs measured at each wave need not be the same. 47 Projection The fourth type o f test linking is called projection (prediction), which is regarded as one o f the weakest forms o f statistical linking (Linn, 1993). Similar to statistical moderation, the various tests do not measure the same construct. According to M i s l e v y (1992), after observing the scores on Test Y , one can then calculate what would be likely scores on Test X . This type o f linking involves administering different measures to the same set o f testtakers and then estimating the joint distribution among the scores. Projection has largely been criticised in the test linking literature. A s M i s l e v y (1992) notes, "projection sounds precarious, and it is. The more assessments arouse different students' knowledge, skills, and attitudes, the wider the door opens for students to perform differently i n different settings" (p. 63). W i t h projection, one can neither equate nor calibrate the measures to a common metric because the time-variable measures are not purported to even measure the same construct. A s such, it is evident that this type o f test linking cannot be used when one measures change and growth with time-variable measures (that can or cannot be linked). Social M o d e r a t i o n The fifth type o f test linking is called social moderation (concordance, consensus moderation, auditing, and verification), and is the least rigorous test linking method discussed. A s is the case with statistical moderation and projection, social moderation does not require that the construct tapped by each o f the different measures is the same; rather, the scores from distinct tests that measure related, but different, constructs are linked (Pommerich & Dorans, 2004). 48 This type o f linking typically involves judges' rating the performance on the distinct tests using some common framework and then interpreting the performances according to some common standard (Linn, 1993) for the purpose o f determining which levels o f performance across tests are to be treated as comparable (Mislevy, 1992). This common standard can be difficult to achieve, however, particularly i f standards o f performance vary across regions (e.g., across school districts or provinces). A s has been highlighted i n the five descriptions above, only equating and calibration require that the construct measured is the same or similar across waves - thus taking statistical moderation, projection, and social moderation 'out o f the running' as viable solutions to the motivating problem. O f equating and calibration, the former is regarded as the stronger and more rigorous, but neither is sufficient in the study o f change and growth with time-variable measures (for reasons outlined previously and also summarised in the next section). S u m m a r y : Selecting the A p p r o p r i a t e T y p e of Test L i n k i n g M e t h o d In the previous sections o f this chapter, five o f the most common types o f test linking methods were described: equating, calibration, statistical moderation, projection, and social moderation, respectively. K o l e n and Brennan (2004), L i n n (1993), and M i s l e v y (1992) compare and contrast these five test linking methods i n regards to four specific test features: inferences, constructs, populations, and measure characteristics, respectively. In this section, these features are reviewed one by one, with an eye to synthesising the information provided in earlier sections o f this chapter, and, hence, to assisting readers i n selecting the appropriate test linking method: 1. Inferences: This test feature refers to the extent to which the scores for two tests are used to draw similar types o f inferences. Put another way, this feature asks i f the two tests share common measurement goals that are operationalised i n scales intended to yield similar inferences (Kolen & Brennan, 2004); 2. Constructs: This test feature relates to the extent to which two tests measure the same construct. In other words, are the true scores for the two tests related functionally? In many test linking contexts, the two tests share common constructs, but they also assess unique constructs (Kolen & Brennan, 2004); 3. Populations: This test feature relates to extent to which the two tests being linked are designed to be used with the same population. T w o tests may measure essentially the same construct, but are not necessarily appropriate for the same populations (Kolen & Brennan, 2004); and 4. Measure Characteristics: This test feature refers to the extent to which the two tests share common measurement characteristics or conditions. According to K o l e n and Brennan (2004), these characteristics or conditions are often called facets. Such facets may include test length, test format, administration conditions, etc. In t a b l e 3, K o l e n and Brennan (2004) summarise how each o f the five test linking methods compare, i n terms o f these four features. L i n n (1993) provides a similar 'taxonomy table', which has been presented in Table 4 . 1 6 K o l e n and Brennan (2004) acknowledge that the degrees o f similarity o f the test linking methods, as depicted i n these tables, are sometimes ambiguous, cautioning that: The contents of Tables 2 and 3 have been reworded and/or reformatted only very slightly from their original versions for clarity and consistency. 16 50 1. "context matters, and there is not a perfect mapping o f the taxonomy categories and degrees o f similarity" (p. 435); 2. "there is no one 'right' perspective, and uncritical acceptance o f any set o f linking categories is probably unwarranted" (p. 436); and 3. "the demarkation between categories can be very fuzzy, and differences are often matters o f degree" (p. 436). Table 3' K o l e n and Brennan's (2004) Comparison o f the Similarities o f Five Test L i n k i n g Methods on Four Test Facets Test L i n k i n g Method Inferences Constructs Populations Measure Characteristics Equating Same Same Same Same Calibration Same Same/Similar Dissimilar Same/Similar Statistical moderation Dis(similar) Dis(similar) Dis(similar) Dis(similar) Projection Dis(similar) Dis(similar) Similar Dis(similar) Same Similar Same/Similar Dis(similar) Social moderation Note. Test linking methods have been listed in decreasing order o f statistical rigour. A s Table 3 shows, equating requires that the inferences made from both versions o f a measure are the same, that the constructs both measures are purported to tap are the same, that the populations used for each measure are the same, and that the characteristics o f the measures are the same (and so on and so forth for the remainder o f the linking methods 51 represented along the vertical). This table highlights that only equating and calibration require that the construct measured over time is the same or similar across waves. Thus, statistical moderation, projection, and social moderation are not viable solutions to the motivating problem. O f equating and calibration, only the latter allows one to use dissimilar populations - suggesting, at first glance, that it would be an appropriate strategy for dealing with time-variable measures (particularly those that cannot be linked). For reasons outlined above, however, neither equating nor calibration is sufficient in the study o f change and growth with time-variable measures - primarily because equating places too many restrictions on the allowable content and psychometric properties o f the measures and their respective scores, and because calibration can lead to distortions i n the results [for reasons aforementioned by Martineau (2006) and Schafer (2006)]. 52 Table 4 L i n n ' s (1993) Requirements o f Different Techniques in L i n k i n g Distinct Assessments E C SMI P SM2 1. Measure the same thing (construct) Yes Yes No No No 2. Equal reliability Yes No No No No 3. Equal measurement precision Yes No No No No 4. A common external examination No No Yes No No 5. Different conversion to go from Test No Maybe N/A Yes No No Yes No Yes No No Yes Yes Yes Yes No No No No Yes No No No No Yes Requirements for Assessment throughout the range o f levels o f student achievement X to Y than from Y to X 6. Different conversions for estimates for individuals and for group distributional characteristics 7. Frequent checks for stability over contexts, groups, and time required 8. Consensus on standards and on exemplars o f performance 9. Credible, trained judges to make results comparable Note. E = Equating, C = Calibration, S M I = Statistical Moderation, P = Projection, and S M 2 = Social Moderation. 53 A s Table 4 shows, equating requires that: (1) the two measures tap the same construct, (2) the measures' scores yield equal reliability, and (3) there is equal measurement precision throughout the range o f levels o f student achievement. Equating does not, however, require: (4) a common external examination, (5) different conversion to go from Test X to Y than from Y to X , (6) different conversions for estimates for individuals and for group distributional characteristics, (7) frequent checks for stability over contexts, groups, and time required, (8) consensus on standards and on exemplars o f performance, nor (9) credible, trained judges to make results comparable (and so on and so forth for the remainder o f the linking methods represented along the horizontal). L i k e Table 3, Table 4 also highlights that only equating and calibration require that the construct measured over time is the same (or similar) across waves. Irrespective o f the choice between the two, one still faces the need to change the item wording, response categories, etc., across waves - which Willett et al. (1998) state renders the test scores unusable for the purpose o f studying change and growth. In summary, Tables 3 and 4 provide the reader with helpful, but not definitive, descriptions o f the various test linking methods (Kolen & Brennan, 2004) necessary when choosing a test linking method. Readers should also remain cognisant that, the further one departs from equating, the less rigorous the study becomes statistically. In other words, there is no free test linking lunch! A s these tables and the previous sections o f this chapter illustrate, none o f the five test linking methods serves as a feasible solution for the problem motivating this dissertation (analysing change and growth when the measures are time-variable, particularly when the measures cannot be linked) primarily because each one o f the five test linking methods 54 requires that the same or similar measures are used across waves. Unfortunately, missing from K o l e n and Brennan's (2004) and L i n n ' s (1993) work is a precise explanation about what is meant by "same" or "similar" measures: It is unclear i f they mean that there must simply be the same primary dimension or latent variable driving the students' responses across waves or i f this means that there must be anchor items common to all versions o f the measure. Because o f this uncertainty, it is clear that a solution to the problem o f analysing change and growth with time-variable measures (particularly those that cannot be linked) is necessary. 55 Chapter 4: Seven Current Strategies for Handling Time-Variable Measures W i t h an eye toward summarising how researchers are handling the problem o f analysing change and growth with time-variable measures in real-life research settings, this chapter discusses seven strategies presented in the current literature as means o f handling the motivating problem - each o f which is accompanied with real-life examples. Whereas the previous chapter spoke about test linking in a general sense, this chapter focuses on various strategies that educational researchers are using to handle the motivating problem specifically. A s this later sections o f this chapter elucidate, many o f these seven strategies can only be applied to time-variable measures that can be linked. Furthermore, the strategies that can be applied to non-linkable time-variable measures have either not been explained fully by the cited author, or have been criticised in the literature for various reasons (outlined in later sections). Therefore, the primary message o f this chapter is that there is a need for more work targeted at finding a practicable solution to the problem o f analysing change and growth with time-variable measures (particularly those that cannot be linked). A s the reader w i l l note, the seven strategies presented here do not necessarily adhere to the conditions laid out by Willett et al. (1998) and von Davier et al. (2004), as described i n Chapter 1. It would appear that both sets o f authors would chide that, when dealing with time-variable measures (particularly those that cannot be linked), researchers are simply at an impasse: Repeated measures analyses o f developmental data should altogether cease to occur, because there is no adequate way in which to equate the scores o f the time-variable measures. A s many educational psychologists, psychometricians, social scientists, and statisticians alike would agree, however, a 'non-solution' to this problem is hardly satisfactory and is, moreover, wholly impractical given the widespread need for longitudinal 56 achievement assessment. A s such, the seven strategies below are discussed i n terms o f both their underlying methodologies and data demands 17 for the purpose o f showing their limitations in certain contexts - thus establishing the need for the Conover solution presented in Chapter 5. It should also be noted that not all o f these seven strategies can be slotted into one o f the five test linking methods described in Chapter 3. V e r t i c a l Scaling The first alternative for handling the problem o f analysing change/growth with timevariable measures (specifically those that can be linked) is to use vertical scaling, which Chapter 3 describes as the process o f scaling tests with different difficulties for groups o f individuals with different abilities, usually i n different grades (Pomplun et al., 2004). Braun and Holland (1982) describe vertical scaling as placing test scores from widely different difficulties on some sort o f common metric, thus allowing the scores o f test-takers at different grade levels to be compared (creating something akin to a developmental scale onto which the scores o f test-takers can be compared across waves). Because the content o f each o f the tests varies across groups o f test-takers at different educational levels, vertical scaling cannot be used for the purpose o f making the test forms themselves interchangeable (Kolen & Brennan, 2004). More specifically, scores from vertically-scaled levels cannot be used interchangeably because levels typically differ in content, and because an individual student would be measured with different precision at different levels (Kolen, 2001). K o l e n (2001) discusses using vertical scaling to link tenth-graders' P L A N data to their scores on the A C T Assessment (a college entrance, examination) in the twelfth-grade, Unfortunately, authors often do not report the specifics about their test linking methodologies, nor the data demands, making it challenging to provide details of these topics in this dissertation. Throughout this chapter, if a test-linking method is not explained in technical detail, it is because the author of the cited paper does not provide the relevant details. Often the test linking is performed by large-scale testing companies, not by the authors themselves. 17 57 and, more specifically, to determine the extent to which P L A N scores were on the same metric as A C T Assessment scale scores. Clemans (1993) also used vertical scaling, in this instance to link the California Achievement Tests ( C A T s ) given annually to students from Grades 1 to 12. Unfortunately, these articles do not provide further explanation about the specifics o f their chosen methodologies, so attempting to replicate their procedure with one's own data is impossible. G r o w t h Scales The second alternative for handling the problem o f analysing change/growth with time-variable measures (that can or cannot be linked) is to use growth scales. Introduced as an alternative to vertical scaling, growth scales are designed to recognise the important role o f vertically-moderated cut scores (passing scores) in many achievement testing contexts (Schafer, 2006). A cut score represents the specific point (score) on a given score scale at which scores at or above that point are considered to be at a different (better) performance category than scores below the point. Vertical moderation refers to the process o f setting cut scores at any one grade such that they have consistent meaning in terms o f growth from the prior grade, as well as expectations o f growth to the next grade (Schafer, 2006). Growth scales are best explained by means o f two examples provided by Schafer (2006). First, the Texas Learning Index (TLI) consists o f a two-digit, test-based, withingrade score that is anchored with a cut score equal to 70, "but whose other values depend on the distributional characteristics o f the student scores at that grade" (Schafer, 2006, p. 4). The grade level o f the test-taker (and, hence, o f the test) is added before the two-digit test score to aid interpretation - so the final test-score can be either three or four digits in length. 58 In a second example, Washington State linearly transforms original, grade-specific test scores (in logit form ) using the 'proficient' and 'advanced' cut points to anchor the scale. Scale scores for the other cut points appear wherever they fell using the transformation (Schafer, 2006). Schafer and Twing (2006) have combined Texas' and Washington's approaches b y using grade-level tests and generating from them three (or four) digit scores (like Texas), but using relevant cut points to anchor the scale (like Washington). Imagine, for example, that a cut point o f 40 = 'proficient' and a score o f 60 = 'advanced'. It is then possible to "transform the underlying logit scale o f the test to arrive at the transformation to the scale for the full range o f the underlying logit scale. If it does not transform to remain within two digits for all grades, then adjustments could be made to the arbitrary choices o f 40 and 60" (Schafer, 2006, p. 4). The test-taker's grade level is then added before the two-digit score such that a score of 440 at Grade 4 is just as 'proficient' as a score o f 640 at Grade 6. For more about the development and usage o f growth scales, please refer to Schafer (2006) and to Schafer and Twing (2006). Rasch Modelling The third alternative for handling the problem o f analysing change/growth with timevariable measures (particularly those that can be linked) is to use Rasch modelling techniques. For example, Afrassa and Keeves (1999) studied the analysis and scaling o f Australian mathematics achievement data cross-sectionally over a 30-year period o f time by the use o f the Rasch model. A s described in Chapter 3, the Rasch model is thought to be the most robust o f the item response models (Afrassa & Keeves, 1999), and identifies the dependent variable as the dichotomous response (i.e., the response score) for a particular 59 person for a specified item or test question. The independent variable is each person's trait score (theta) and the item's difficulty level (b). Embretson and Reise (2000) note that the independent variables combine additively, and the item's difficulty is subtracted from the participant's ability score (theta). It is the relationship o f this difference to item responses that is the key focus o f this type o f modelling. It should be noted that, because unidimensionality o f the test scores is necessary for Rasch modelling, Afrassa and Keeves (1999) began by examining the results o f confirmatory factor analyses, and concluded that the test scores were, in fact, unidimensional (and purportedly this one dimension or latent variable represented mathematics achievement). U s i n g the Rasch model, Afrassa and Keeves (1999) were able to bring the mathematics achievement scores o f each o f the participants to a common scale - a scale independent o f both the samples o f participants tested and the samples o f items used. In another example, Jordan and Hanich (2003) used a similar strategy in their investigation o f the reading and mathematics achievement and specific mathematical competencies o f 74 participants tested across four waves during second and third grades. Participants' Woodcock Johnson raw scores for each wave o f data collection were transformed into Rasch-scaled scores which were used for the longitudinal analyses to provide a common metric with equal interval properties. In yet another example, Notenboom and Reitsma (2003) use a similar strategy in their exploration o f the spelling achievement o f grade school students; however, rather than following one group o f students longitudinally, they chose a cross-sectional design - using I R T to link the test scores o f students in different grades at one point in time. ) 60 M a and his colleagues also utilise this type o f test linking i n various studies. For example, using national data from the Longitudinal Study o f American Youth ( L S A Y ) , M a (2005) sought to explore changes in students' mathematics achievement over a six-year period o f time, beginning when the participants were in the seventh-grade (1997-1998) and ending when they were i n the twelfth-grade (1992-1993). M a used students' scores on the L S A Y ' s National Assessment o f Educational Progress ( N A E P ) at each grade level to construct the outcome measure: the rate o f growth in mathematics achievement during middle and high school. H e adds, however, that "the L S A Y staff calibrated test scores for each grade level through item response theory; therefore, test scores were comparable across middle school and high school years" (p. 79). M a makes reference to this calibration in additional studies, as well (e.g., M a & M a , 2004; M a & X u , 2004). What is unclear, however, is precisely how the L S A Y calibrated these scores. A l s o using N A E P data, A i (2002) investigated gender differences i n mathematics achievement over time. A key feature o f A i ' s work is that it is one o f only a rare few in the educational literature that adopts a three-level (also known as a simultaneous multilevel/individual growth model) approach: repeated measures (the L e v e l 1 units) nested or grouped within students (the Level 2 units), who are, i n turn, nested within schools (the Level 3 units). The L S A Y study tracked children from two cohorts: Cohort 1 students (the older cohort) were tracked annually from Grades 10 to 12; Cohort 2 students (the younger cohort) were tracked annually from Grades 7 to 10. A i , using the data from the younger cohort only, identified the outcome variable as the mathematics scores measured at each grade across the four waves, which were imputed using the I R T technique scale ranging from 61 0 to 100. Unfortunately, missing from A i ' s paper is a description o f this particular imputation to which he refers. Latent Variable or Structural Equation Modelling The fourth alternative for handling the problem o f analysing change/growth with time-variable measures (that can or cannot be linked) is to use latent variable modelling or structural equation modelling. Muthen and Khoo (1998) describe its use in their L S A Y investigation o f the change and growth in N A E P mathematics achievement o f two cohorts o f students: Cohort 1 students were tracked annually from Grades 10 to 12; Cohort 2 students were tracked annually from Grades 7 to 10. Mathematics achievement was presumed to be a function o f background variables, such as gender, mother's education, and home resources. Similar to path analysis, structural equation models ( S E M ) can be used to test causal relationships specified by theoretical models. In essence, S E M incorporates information about both the group and individual growth, providing a means by which to model change and growth as a factor o f repeated measurement over time. In the context o f this dissertation, time, then, is treated as a dimension along which individual growth varies (Ding, Davison, & Petersen, 2005). These types o f models generally include the investigation o f two types o f variables: (1) latent variables (unobserved variables that account for the covariance/correlations among observed variables and that, ideally, also represent the theoretical constructs that are o f interest to the researcher); and (2) manifest variables or those that are actually observed/measured by the researcher and that are used to define or to infer the latent variable or construct (Schumacker & Lomax, 2004). 62 Schumacker and Lomax (2004) describe four major reasons w h y S E M is used commonly: 1. S E M allows researchers to use and query multiple variables to better understand their construct(s) o f interest; 2. S E M acknowledges to a greater extent than other methodologies the validity and reliability o f observed scores (Gall et al., 1996), by taking measurement error into account explicitly, rather than treating measurement error and statistical analyses separately (Schumacker & Lomax, 2004). Furthermore, S E M involves the explicit estimation o f error structures (Ding et al., 2005); 3. S E M allows for the analysis o f advanced models, such as those including interaction terms and complex phenomena; and 4. S E M software (e.g., L I S R E L ) is becoming increasingly user friendly. Muthen and K h o o (1998) warn researchers o f two possible misuses o f S E M . First, researchers may fail to thoroughly check their raw data for such features as the shape o f individual growth curves and outliers. A s D i n g et al. (2005) note, many data sets simply do not meet the statistical assumptions or sample size requirements necessary for S E M (unfortunately they do not clarify the specific aspects to which they refer by this statement). Second, it may be possible to have competing models which fit the means and covariances in much the same way, but may lead to different data interpretations, though they offer no explanation about when this situation might possibly be encountered. This strategy for handling the problem has also been used by various other researchers. For example, Guay, Larose, and B o i v i n (2004) explored children's academic self-concept, family socioeconomic status, family structure (single- versus two-parent 63 families) and elementary school academic achievement as predictors o f participants', educational attainment level in young adulthood within a ten-year longitudinal design. In another example, Rowe and H i l l (1998) combined multilevel modelling with structural equation modelling to investigate educational effectiveness across two waves. Finally, Petrides, Chamorro-Premuzic, Frederickson, and Furnham (2005) used S E M methodologies in their exploration o f the effects o f various psychosocial variables on scholastic achievement and behaviour o f teen-aged students in Britain, though they seem to sidestep the added complexities associated with using time-variable measures by purposefully choosing to explore only one wave o f otherwise longitudinal data. M u l t i d i m e n s i o n a l Scaling A fifth strategy for handling the problem o f analysing change/growth with timevariable measures (particularly those that can be linked) is to use multidimensional scaling ( M D S ) . M D S typically involves judging observations for the similarity or dissimilarity o f all possible pairs o f stimuli. A s C l i f f (1993) notes, the model most commonly specifies that educational or psychological distance is a Euclidean distance i n k-dimensional space. If distance is proportional to the judged dissimilarity, then it makes it possible to analyse the distances for the purpose o f recovering the values o f the stimuli on the underlying scales or dimensions. D i n g et al. (2005) describe the way in which they use M D S methods to analyse change in students' mathematics achievement over four waves: Grades 3, 4, 5, and 6. They begin by using M D S to obtain initial estimates o f the scale values, which index the latent growth or change patterns. Once obtained, these estimates can be used to create either (1) a growth profile model with the scale values reflecting the growth rate and the initial value 64 representing initial growth level; or (2) a change model with the scale values reflecting change patterns and the intercept estimating the average score o f each participant across waves (Ding et al., 2005). O f possible interest to the reader, D i n g et al. (2005) differentiate the terms "growth" and "change". The former term is characterised by systematic or directional changes (e.g., as one would perhaps expect with mathematics achievement over time), whereas the latter term refers to linear or monotonic change. A s C l i f f (1993) observes, "it is more plausible to assume that the dissimilarities are only monotonically related to psychological distance, and that this monotonic relation is unknown a priori" (p. 83). This latter term is often referred to as non-metric M D S . Standardising the Test Scores or Regression Results The sixth alternative for handling the problem o f analysing change/growth with timevariable measures (that can or cannot be linked) is to standardise the test scores pre-analysis or to use standardised results o f the regression analyses. Goldstein (1995) suggests that some form o f test score standardisation is sometimes needed before the scores can be modelled. It is common in such cases to standardise the scores o f each test so that, at each occasion, they have the same population distribution. For example, researchers analysing longitudinal data on the same variable over time often begin by standardising the scores so that each wave's scores have a mean o f zero and a standard deviation o f one. Furthermore, educational researchers often use standardised regression coefficients, rather than raw or observed regression coefficients (Willett et al., 1998). A s an example, Flook, Repetti, and U l l m a n (2005) compared children's peer acceptance in the classroom to academic performance from fourth to sixth grades. Academic 65 performance across the three grades was assessed by their report card grades in two subjects: reading and mathematics. Because the grades were assigned b y different teachers during the semester i n which the participants were tested, the researchers standardised reading and mathematics grades within each school and cohort to M = 0 and S D = 1 (presumably withinwave). They then computed the average o f the reading and mathematics grades at each wave in order to form an overall measure o f academic performance for each participant during each wave. Willett et al. (1998) describe the three popular rationales offered for standardising the test scores or regression results: 1. Enhanced test score interpretability: Placing test scores onto some common metric has some intuitive appeal, especially, given that there are few educational and psychological variables that have well-accepted metrics o f interpretation. Imagine, for example, that a Grade 4 version o f a mathematics test has a mean score o f 15 and the Grade 7 version o f a mathematics test (three years later) has an observed score mean o f 23. Without first standardising each test's scores, how is one to interpret this mean difference when the two measures were time-variable and designed, by definition, to be independent o f one another? Unfortunately, z-transforming each test's raw scores does not necessarily mean that one is measuring the same construct in Grade 4 as in Grade 7 (e.g., a z-score o f 1.0 in Grade 4 does not necessarily mean the same thing as a z-score o f 1.0 at Grade 7): A l l z-transforming does is allow one to determine a given test-taker's performance relative to all other test-takers within the same wave; z-transforming, therefore, does not necessarily allow for across-wave comparisons. 66 2. Relative importance o f predictors: A second rationale that researchers give for standardising test scores is that it helps in the identification o f the relative importance o f predictors i n a regression model. More specifically, standardisation helps to eliminate difficulties encountered comparing the regression coefficients when predictors have been measured on different scales. The argument is that the predictor with the largest standardised regression coefficient is the "most important" predictor in the model (Willett et al., 1998). 3. 18 Comparison o f findings across samples: A third rationale often offered for standardisation is that it facilitates the comparison o f results across different samples, seemingly affording researchers the ability to investigate i f other studies o f the same construct detected effects o f the same magnitude (Willett et al., 1998). Despite these three seemingly appealing rationales for standardisation, this particular strategy for handling time-variable test scores is problematic for several reasons: 1. When test scores are standardised within-wave pre-analysis, there can be no expectation o f any trend i n either the mean or variance over time. If Test X ' s scores and Test Y ' s scores are transformed respectively to z-scores, then, by definition, both sets o f scores have identical measures o f central tendency, making within-person change and growth impossible to ascertain. It should be noted, however, that there can still be between-individual variation (Goldstein, 1995); 2. Standardising the outcome within-wave places constraints on its variation. For example, i f the group's individual growth curves fan out, standardising the outcome within-wave increases the amount o f outcome variation during early time periods and For a detailed description of the problems associated with this particular rationale, please refer to Thomas, Hughes, and Zumbo (1998). 18 67 diminishes outcome variation during later occasions. Therefore, the standardised growth trajectories fail to resemble those based on raw scores (Willett et al., 1998); 3. Longitudinal studies are afflicted with some degree o f attrition and drop-out. Thus, the standardising o f predictors within-waves is based on means and standard deviations that are estimated in a decreasing pool o f participants. If such attrition/drop-out is non-random, then the following samples used in the estimation o f measures o f central tendency w i l l be non-equivalent, hence making the standardised values o f the predictors unable to be compared from wave to wave (Willett et al., 1998); and 4. If the standard deviation o f either the outcome variable or the predictor variable varies across samples, then samples with identical population parameters can yield "strikingly different standardized regression coefficients creating the erroneous impression that the results differ across studies" (Willett et al., 1998, p. 413). Converting Raw Scores to Age (or Grade) Equivalents Pre-Analysis A seventh strategy for handling the problem o f analysing change/growth with timevariable measures is to convert the test scores to age equivalents (or mental age equivalents) pre-analysis, and then use these age equivalents in the place o f raw scores i n subsequent analyses. This process involves assigning each possible raw score an age (e.g., in months or years) for which that particular score is the population mean or median (Goldstein, 1995). A n age equivalent score, then, is the average score on a particular test that is earned by various students o f the same age. A s a result o f this process, it is then possible to interpret a score of, for example, 57 on Test X as the average score earned by students aged 12.6 years (Gall et al., 1996). 68 This strategy is often used in the field o f special education. For example, Abbeduto and Jenssen Hagerman (1997) used mental-age equivalents in their study o f language and communication problems o f individuals with Fragile X Syndrome ( F X S ) , a genetic disorder resulting from a mutation on the X chromosome which is associated with various physical, behavioural, cognitive, and language problems. This particular strategy is ideal when the test scores change smoothly with age, because then the age equivalent metric is more easily interpretable. In the United Kingdom, Plewis (2000) used a variation o f this type o f strategy i n his investigation o f reading achievement change and growth (Goldstein, 1995). To establish a common metric for his four waves o f reading achievement data, he began by computing the first principal component o f the scales within-wave. Then component scores (akin to factor scores i n factor analysis) were then converted to z-scores. Third, each z-score was assigned a reading age equivalent score, such that the mean score at each age was the same as the mean chronological age, and the variance increased with age, which is to be expected in studies o f change (Clemans, 1993; Plewis, 2000). Muijs and Reynolds (2003) adopted a similar approach in their study o f the effects o f student background and various teacher variables on students' longitudinal achievement in mathematics. Participants' National Foundation for Educational Research ( N F E R ) numeracy subtest scores were collected twice a year for two respective school years: 1997-1998 and 1998-1999. For the purpose o f the analyses, they (seemingly) converted students' raw scores into age equivalent scores based on a national sample o f students in England. Another strategy for dealing with repeated measures designs with time-variable tests is to convert the test scores to grade equivalents pre-analysis. A grade equivalent score is the 69 average score on a test earned by students o f the same grade-level. The process is similar to that o f converting test scores to age equivalents, but the interpretation o f the scores is slightly different: A score o f 57 on Test X may now be the average score o f sixth-graders (Gall et al., 1996). Some authors advise researchers to avoid including equivalent scores i n any descriptive or inferential statistical methods given that equivalents often have unequal units. A s a general rule, age- and grade-equivalents should always be presented with the original raw scores (Gall et al., 1996). 19 It should also be noted that this particular strategy has the disadvantage o f requiring the scores o f a norming group (normative sample) - a large sample, ideally representative o f a well-defined population, whose test scores provide a set o f standards against which the scores o f one's sample can be compared (Gall et al., 1996). Unfortunately, due to time and financial constraints, it is not always possible to obtain the scores o f a norming sample. Chapter Summary In summary, this chapter described seven strategies used i n the current literature as means o f handling the problem o f analysing change/growth with time-variable measures (particularly those that can be linked). Not all educational research, however, can "fit" within the methodologies and data demands required o f these strategies, particularly i n cases where one's study involves time-variable measures that cannot be linked or i f one has small sample sizes. Furthermore, it is not always possible to mimic the aforementioned methodologies in one's own research setting, given that many researchers fail to report or disclose the specific details o f their linking methods. A s such, the next chapter presents the For an excellent discussion about issues related to using equivalents in the place of raw scores, please refer to Zimmerman and Zumbo (2005). 19 70 Conover solution to the problem o f analysing change and growth using time-variable measures which can be implemented when none o f the aforementioned strategies can be implemented. Although this solution may be implemented i n Scenario 2 (time-variable measures that can be linked) or Scenario 3 (time-variable measures that cannot be linked), the Conover solution is particularly useful for the latter scenario, particularly when one considers that there is still no consensus on handling the problem o f analysing change and growth with time-variable measures that cannot be linked. In closing, rather than thinking o f these seven strategies as being separate and distinct from one another, it may be more useful to cluster some o f the related strategies. For example, (a) vertical scaling, (b) growth scales, (c) Rasch scaling/modelling, (d) standardising test scores or regression results, and (e) converting scores to age/grade equivalents can be all considered "linking/scaling approaches" (in the psychometric sense o f the phrase), because each o f these strategies is used for the purpose o f putting the scores o f different measures onto a common metric. 71 Chapter 5: The Conover Solution: A Novel Non-Parametric Solution for Analysing Change/Growth with Time-Variable Measures The previous chapter describes seven strategies used in the current literature as means of handling the problem o f analysing change/growth with time-variable measures (particularly those that can be linked). Not all educational research, however, can "fit" within the methodologies and data demands required o f these strategies for reasons outlined in the previous chapter. Thus, this chapter expands upon previous rank transformation work undertaken most notably by Conover (1999) and Conover and Iman (1981) , and introduces 20 a novel solution to the motivating problem that can be used with time-variable measures that can or cannot be linked. A s has been described in previous chapters, this novel solution is called the Conover solution. In the case o f two-wave data, the Conover solution is called the non-parametric difference score; in the case o f multi-wave data, the Conover solution is called the nonparametric H L M . Both involve rank transforming (or ordering) individuals' longitudinal test scores within wave pre-analysis, and then using these rank scores i n the place o f raw or standardised scores i n subsequent analyses. It is because the original scores are transformed into ranks that the Conover solution is non-parametric (Conover, 1999). It should be noted that Domhof, Brunner, and Osgood's (2002) work relates tangentially to this novel solution, in that it discusses rank-based procedures for dealing with repeated measures data. Their research, however, uses rank-based procedures as a means o f handling missing data, and presumes that the measure itself remains unchanged across waves. Zumbo (2005) has also considered the use of ranks in longitudinal analyses, and is credited for coining the terms "non-parametric H L M " and "non-parametric difference score". 72 A rank score depicts the position o f a test-taker on a variable relative to the positions held by all other test-takers. Ranking or rank transforming refers to the process o f transforming a test-taker's raw score to rank relative to other test-takers - suggesting a oneto-one function f from the set { X i , X , . . . , X } , the sample values, to the set {1, 2,.. .,N}, the 2 N first N positive integers (Marascuilo & McSweeney, 1977; Zimmerman & Zumbo, 1993a). For example, i f Student X earned a score o f 12, Student Y earned a score o f 13, and Student Z earned a score o f 14, then the students' respective rank scores would be 1, 2, and 3 (where a rank o f 1 is assigned to the test-taker with the lowest score). 21 A s Zimmerman and Zumbo (2005) remind readers, inasmuch as test-takers' scores are represented in terms o f their position relative to other test-takers in the same wave, a rank score is similar to a percentile score. A percentile score is a type o f rank score that represents a raw score as the percentage o f test-takers in an external norming group whose score falls below that score; unfortunately, referring to the scores o f a norming sample is not always practical or possible; this issue is revisited in a later section. In general, researchers can effectively use ranks in situations i n which regular statistical assumptions (specifically, normality and homogeneity o f variance) are not or cannot be met (Beasley & Zumbo, 2003; Zimmerman & Zumbo, 1993a), or when one's scale of measurement is ordinal, and not interval (Zimmerman & Zumbo, 1993a). Lamentably, researchers often underestimate or overlook the utility o f rank scores perhaps because introductory statistics course instructors frequently isolate 'rank speak' into separate units that appear to students to be disconnected from the general flow o f the course (Conover & Iman, 1981) or because textbook writers often convey the impression that One can also assign rank scores so that the test-taker with the highest score receives a rank of 1. However, it is easier to think of students receiving the highest score also receiving the highest rank value. 21 73 parametric tests are thought to be more powerful than their non-parametric counterparts under normal theory (Zimmerman & Zumbo, 2005). Zimmerman and Zumbo (1993a, 2005) note that transforming scores to ranks and using non-parametric methods often improves the validity and power o f significance tests for non-normal distributions. Moreover, rank transformations often produce similar results to those o f parametric tests (Zimmerman & Zumbo, 2005), and can be used in two-wave or multi-wave designs (described in more detail below). Using rank scores (or percentile scores) in the place o f raw scores does not mean a 'free lunch' for the researcher, however. One primary c r i t i c i s m 22 o f the Conover solution is that "[differences] between raw scores are not necessarily preserved by the corresponding ranks. For example, a difference between the raw scores corresponding to the 15th and the 16th ranks is not necessarily the same as the difference between the raw scores corresponding to the 61st and 62nd ranks in a collection o f 500 test scores" (Zimmerman & Zumbo, 2005, p. 618). Furthermore, for this solution to be feasible, one must .first be sure that the construct being measured does not change over time (i.e., the scores do not vary across waves in terms o f the major underlying dimension or latent variable). Traditional Applications of the Rank Transformation A number o f well-known non-parametric statistical procedures are rooted in rank transformations. The most common include Spearman's rho (p), the Mann-Whitney test, the Kruskal-Wallis test, the sign test, the W i l c o x o n signed ranks test, and the Friedman test each o f which is described respectively in this section: More thorough discussion of the strengths and limitations of the novel solution is reserved for Chapter 7. 74 i Spearman's rho: Arguably, the most well-known use o f the rank transformation is where the ubiquitous Pearson's correlation, r, is applied to ranks (Conover, 1999). Whereas a perfect Pearson correlation requires that both the ordinal positions match across two variables and that the variables share a perfect linear relationship, a perfect Spearman correlation (p) only requires that the ordinal positions match across the samples (Cohen, 1996). Moreover, unlike the Pearson correlation, a Spearman correlation does not require a bivariate normal distribution, nor does it require that each variable follow its own normal distribution. It is for these reasons that the Spearman correlation is referred to as a "distribution-free" statistic. A s Cohen (1996) observes, the only assumptions that must be met for a Spearman correlation are those for other tests o f ordinal data: independent random sampling (i.e., each pair o f observations should be independent o f all other pairs) and continuous variables (i.e., both variables are assumed to have continuous underlying distributions). The Spearman correlation can be used in three cases. In the first case, both variables have been measured on an ordinal scale. In the second case, one variable is measured on an interval/ratio scale and the other variable has been measured on an ordinal scale. In the final case, both variables are measured on an interval/ratio scale, but the raw scores in each variable have first been transformed into ranks. One may experience this third case i f (a) the distributional assumptions o f the Pearson correlation are severely violated, (b) there are small sample sizes, or (c) the relationship is far from linear and one only wants to investigate the degree to which the relationship is monotonic (Cohen, 1996). 75 Mann-Whitney test: A k i n to a non-parametric, distribution-free independent samples Mest, the Mann-Whitney involves examining two random samples to see i f the populations from which the two samples were drawn have equal means. The Mann-Whitney is likely to be used in situations i n which one faces small sample sizes that are likely to differ in terms o f size and variability. If the populations' scores are distributed normally, the independent samples Mest is the most powerful; the Mann-Whitney test, however, can be used when the two populations' scores are not distributed normally (Conover, 1999). The Mann-Whitney test begins by pooling all participants (irrespective o f subgroup) into one large group and then rank ordering the participants' dependent variable scores. Finally, the sum o f the ranks for each separate subgroup are computed and compared. The Mann-Whitney test requires that two assumptions are met: that there is independent random sampling (i.e., each pair o f observations should be independent o f all other pairs) and that the dependent variable is a continuous/quantitative variable. Kruskal-Wallis test: Whereas the Mann-Whitney test is akin to an independent samples Mest, the Kruskal-Wallis is the non-parametric variation o f the one-way analysis o f variance ( A N O V A ) with three or more independent samples (Conover, 1999). L i k e the Mann-Whitney test, the test begins by pooling all participants (irrespective o f subgroup) into one large group and then rank ordering the participants' dependent variable scores. Finally, the sum o f the ranks for each separate subgroup are computed and compared. The KruskalWallis has the same assumptions as the Mann-Whitney test, and is used in the same circumstances (Cohen, 1996). 76 Sign test: In situations in which the amount o f the difference between a paired set o f participants (i.e., the difference score i n a repeated measures context) cannot be computed, but that the direction (sign) o f the difference can be, the sign test can be implemented. One may plan to have a non-quantified difference (because only the direction o f change is o f interest) or not (e.g., one aims to use a paired-samples Mest, but the sample size is small and the difference scores do not approximate a normal distribution) (Cohen, 1996). The sign test assigns each o f the participants a positive sign (e.g., i f person's score improved across waves) or a negative sign (e.g., i f person's score declined across waves), and determines i f the difference in the counts o f positive and negative scores across participants are likely to have occurred b y chance (Conover, 1999). The sign test requires that three assumptions are met. First, the events must be dichotomous (each simple event being measured can fall into either a positive or negative category only). Second, the events must be independent (the outcome o f one trial does not influence the outcome o f any other. Finally, the process must be stationary - meaning that the probabilities o f each category remain the same across all trials o f the experiment (Cohen, 1996). Wilcoxon signed ranks test: I f each participant i n one's study has been measured on an interval or ratio scale twice (i.e., across two waves), but the difference score is distributed differently than a normal distribution in a population and the sample sizes are small, one can opt to use the W i l c o x o n signed ranks test. Another distribution-free test, the W i l c o x o n involves rank ordering the participants' simple difference scores (ignoring the sign o f the difference scores - thus indicating only absolute differences) and then summed separately for each sign (Conover, 1999). 77 Friedman test: The last rank-based test discussed i n this section is called the Friedman test, and is the distribution-free analogue to the one-way repeated measures A N O V A (Howell, 1995). The Friedman test is relatively easy to compute: Rather than rank ordering all o f the participants' difference scores with respect to each other, one need only rank order the scores, within wave, and compare the sum o f the ranks o f each wave (Cohen, 1996). The Friedman test can be based on either (a) ranking the observed data at the outset o f data collection or (b) interval/ratio-collected data that are rank transformed and converted to ranks i n situations i n which the parametric assumptions required o f a repeated measures A N O V A are not met (Cohen, 1996). Introducing the Conover Solution to the Motivating Problem A s mentioned previously, there are invariably situations in which the implementation of the seven strategies for handling the problem o f analysing change and growth using timevariable measures presented i n Chapter 4 is not possible (as in the case o f time-variable measures that cannot be linked). When faced with such situations, readers are presented with this novel solution to the problem: Rank transform (or order) individuals' longitudinal test scores within wave pre-analysis, and then use these rank scores in the place o f raw or standardised scores i n subsequent analyses. Put another way, this solution involves partitioning the observed data into subsets (waves), then ranking each wave independently o f the other w a v e s 23 (Conover & Iman, 1981), and then using the rank scores in the place o f the original scores in subsequent parametric analyses. Recall from an earlier section that this dissertation's novel solution is coined the Conover solution in this document in honour o f the Conover and Iman (1981) refer to this type of rank transformation as RT-2. transforming, please refer to this article. 23 For additional methods of rank 78 seminal work o f Conover (1999) and Conover and Iman (1981), whose research provided the groundwork and evidence for the novel solution's viability. A t first glance, the Conover solution may appear to resemble a strategy discussed by Zimmerman and Zumbo (2005), in which percentile scores are used in the place o f raw or standardised scores in subsequent analyses. When one recalls, however, that percentile scores represent a given raw or standardised score on a measure as the percentage o f individuals in a norming group whose score falls below that group, then the distinction between this dissertation's Conover solution and Zimmerman and Zumbo's (2005) solution is more clear: Whereas the percentile solution requires the inclusion o f the scores o f an (external) norming group, the Conover solution introduced in this dissertation does not. One o f the major strengths 24 o f the Conover solution is its ease o f use: It is often more convenient to use ranks in a parametric statistical program than it is to write a program for a non-parametric analysis (Conover & Iman, 1981). Furthermore, b y rank transforming the data pre-analysis, one is able to bridge the gap between parametric and non-parametric statistical methods, thereby providing "a vehicle for presenting both the parametric and nonparametric methods i n a unified manner" (Conover & Iman, 1981, p. 128). This last issue was discussed in detail in Chapter 1. The Conover solution makes use o f the ordinal nature o f continuous-scored data: A test-taker with a low raw score relative to other test-takers in his wave w i l l also yield a low relative rank score . Similarly, a test-taker with a high test-score w i l l 25 also yield a high rank score. A s a result, wi.thin-wave order among the students is More thorough discussion of the strengths and limitations of the novel solution is reserved for Chapter 7. Statistical packages often assign a rank of one (1) to each wave's smallest score. It is recommended that analysts retain this default setting, as it is logical to think that the lowest test score should also have the lowest rank. 2 4 25 79 preserved. B y r a n k i n g test-takers' scores within-wave p r e - a n a l y s i s , it is p o s s i b l e to p u t the l o n g i t u d i n a l test s c o r e s o n a " c o m m o n m e t r i c " , t h e r e b y p r o v i d i n g a s t a n d a r d against w h i c h test-takers' scores can be measured and compared. How Applying the Conover Solution Changes the Research Question (Slightly) A s has been described in an earlier chapter, Chapter 6 provides a demonstration o f n o n - p a r a m e t r i c H L M(the difference specific score (the C o n o v e r solution for m u l t i - w a v e data) a n d the n o n - p a r a m e t r i c C o n o v e r solution for t w o - w a v e research question being posed in each It i s i m p o r t a n t t o n o t e t h a t , w h e n o n e data). w a y o f includes the applies the C o n o v e r solution to the p r o b l e m (that c a n o r c a n n o t b e slightly the research q u e s t i o n b e i n g investigated. This change linked), o f one is h i g h l i g h t e d b e s t b y example. I n C h a p t e r 6's gender differences students? Case 1 (the m u l t i - w a v e case), the r e s e a r c h q u e s t i o n p o s e d i n the r a n k - b a s e d l o n g i t u d i n a l r e a d i n g a c h i e v e m e n t T h e operative phrase i n this e x a m p l e f u l l y i n C h a p t e r 7), scores; C h a p t e r 6 also demonstration. analysing change/growth with time-variable measures changes the the C o n o v e r solution m a k e s it m u s t b e c a u t i o n e d , h o w e v e r , is r a n k - b a s e d . use is: A r e scores o f a group A s described above (and there o f more o f the o r d i n a l nature o f the o r i g i n a l that the differences o r gaps b e t w e e n the o r i g i n a l scores are not necessarily p r e s e r v e d b y the c o r r e s p o n d i n g ranks. Therefore, when C o n o v e r s o l u t i o n h a s b e e n a p p l i e d to a set o f o r i g i n a l s c o r e s (i.e., w h e n the the o r i g i n a l scores have been t r a n s f o r m e d to r a n k s p r e - a n a l y s i s ) , t h e n the r e s e a r c h q u e s t i o n , the results, a n d inferences made f r o m the results m u s t reflect the fact that the scores h a v e b e e n the transformed. 80 Within-Wave versus Across-Wave Ranking A s mentioned in previous sections o f this dissertation, the Conover solution involves ranking test-takers' raw scores within wave, rather than across waves, and then using the rank scores in the place o f raw or standardised scores i n subsequent analyses. The purpose o f current section o f this dissertation is to compare and contrast within-wave and across-wave ranking, so as to elucidate why within-wave ranking is a key component o f the Conover solution methodology. Within-wave ranking involves assigning a rank score to a given test-taker's original score in a given wave (or column in the data matrix) relative to the scores o f all other testtakers in the same wave. A s a result, the rank scores resulting from the within-wave ranking retain their wave-specificity and, hence, the temporal nature o f the data collection is preserved. Such ranking allows one to track an individual student's progress, relative to other test takers, throughout the duration o f the study. More specifically, it allows one to see how a given test-taker's rank performance, relative to the other test-takers, improves or declines across time. In contrast, across-wave ranking, another type o f rank transformation , involves 26 taking all o f the test-takers' raw or standardised scores and aligning them vertically into one column, irrespective o f the wave in which the scores were collected. Once all o f the testtakers' multiple within-wave scores have been stacked, one on top o f the other into the one column, the scores are then rank-ordered. Finally, the rank scores that result from this rankordering are put back into waves and are used in the place o f the original scores in subsequent parametric analyses. Across-wave ranking is an integral component o f Please refer to Conover and Iman (1981) for descriptions of additional methods of rank transformation. 81 Zimmerman and Zumbo's (1993b) non-parametric repeated measures A N O V A (Zumbo & Forer, i n press). Although across-wave ranking could certainly be extended to the context o f change and growth with time-variable measures (that can or cannot be linked), it may be somewhat difficult to interpret the precise meaning o f a given test-taker's rank score i n some cases given that ranks are assigned on a vertical column of, in essence, 'wave-less' scores. Imagine, for example, that raw scores from a Wave 1 measure has been designed to have a possible score range o f 0 to 50 points, whereas the scores from a Wave 2 measure can range from 0 to 60 points. Because o f the differences in possible score ranges inherent to the respective measures, an earned raw score o f 45 does not necessarily mean the same "amount" o f a given construct across measures. Nonetheless, both raw scores (Wave 1 = 45, Wave 2 = 45) still receive the same rank score - even though the raw scores are, to some degree, an artefact o f the range o f possible scores on each measure. A s such, readers are encouraged to use this method o f ranking with caution when applying it to the context o f change and growth (particularly with time variable-measures that can or cannot be linked), and especially i f the total possible point values vary across waves. Establishing the Viability of the Conover Solution Such a solution for handling the problem has been supported by statistical theory. Conover (1999), for example, writes: Non-parametric tests that are equivalent to parametric tests computed on the ranks are easily computed using computer programs designed for parametric tests. Simply rank the data and use the parametric test on the ranks in situations where programs for the nonparametric tests are not readily available (p. 419). A d d i n g support for the viability o f the Conover solution, Conover and Iman (1981) write that "least squares, forward or backward stepwise regression, or any other regression method may be applied to the ranks o f the observations" (p. 27). Furthermore, they assert that rank-transformed data can be used i n the place o f raw or standardised scores i n several situations in which satisfactory parametric procedures already exist: • 2 or k independent samples • regression • paired samples • discriminant analysis • randomised complete block • multiple comparisons, and design • cluster analysis. • correlation Additionally, Zimmerman and Zumbo (1993 a) outline how the M a n n - Whitney test i n its standard form (i.e., the large sample normal approximation) is equivalent to an ordinary Student Mest (for independent samples) performed on the ranks o f observed scores, instead o f the observed scores themselves. They add that: 1. "Apart from details o f computation, it makes no difference whether a researcher performs a W i l c o x o n test based on rank sums, or alternatively, pays no attention to W and simply performs the usual Student t test on the ranks" (p. 488); and 2. " I f the initial data in a research study are already in the form o f ranks, it is immaterial whether one performs a t test or a W i l c o x o n test" (p. 489); and 83 3. "For quite a few nonnormal distributions, [the] Wilcoxon-Mann-Whitneytest holds a power advantage over the Student t test, both in the asymptotic limit and for small and moderate sample sizes. [This] power advantage is accounted for by reduction in the influence o f outliers in conversion o f measures to ranks" (p. 495). This latter point is discussed i n more detail i n Chapter 7. In closing, despite traditional underestimation or oversight about the utility o f rank-based methods, the work o f Conover (1999), Conover and Iman (1981), and Zimmerman and Zumbo (1993a) reminds readers o f the viability o f using rank scores in the place o f original scores - hence providing evidence o f the appropriateness and cogency o f the Conover solution presented in this dissertation. Primary Assumption of the Conover Solution: Commensurable Constructs The primary assumption underlying the implementation o f the Conover solution is that one is measuring a commensurable (similar) construct across all waves o f the study. In other words, the researcher must be satisfied that the same primary dimension or latent variable is driving the test-takers' responses across waves. Recall from an earlier section that a latent variable is any unobserved variable that accounts for the correlation among one's observed or manifest variables. Ideally, psychometricians design measures such that the latent variable that drives test-takers' responses is a representation o f the construct o f interest. What is an example o f the inappropriate use o f the Conover solution? Imagine that a testing company chooses to measure students' academic achievement across three waves, but the subject matter tested at each wave varies considerably (e.g., Wave 1 = mathematics achievement, Wave 2 = reading achievement, and Wave 3 = science achievement). Although 84 these three constructs relate collectively to students' academic achievement, it is very likely that the latent variable driving test-takers' responses at each distinct wave differs considerably from those of the other waves (e.g., due the variability in subject matter across waves). In this example, implementation of the Conover solution would be inappropriate because the testing company is attempting to compare the incomparable. The Conover solution is indeed an effective method of handling the problem of analysing change and growth with time-variable measures that are designed to assess commensurable constructs across waves; however, it is by no means a universal panacea to the problem - for reasons outlined in Chapter 7. Further discussion about what constitutes commensurability is presented in Chapter 6. In the next chapter, readers are taken through step-by^step implementations of this Conover solution by means of two comprehensive examples: Case 1 introduces the nonparametric H L M , and can be used in multi-wave research designs. Case 2 introduces the non-parametric difference score, and can be used in two-wave research designs. 85 Chapter 6: Two Conover Solution Case Studies: The Non-Parametric H L M and the Non-Parametric Difference Score In the previous chapter, the Conover solution to the problem o f analysing change/growth with time-variables measures was introduced. The solution is particularly useful when none o f the seven strategies offered i n Chapter 4 is executable (most notably when time-variable measures cannot be linked), and involves rank transforming individuals' longitudinal test scores within wave pre-analysis, and then using these rank scores i n the place o f raw or standardised scores in subsequent analyses. The current chapter describes the implementation o f the Conover solution using two distinct case studies o f real data. It should be noted that, within the context o f this dissertation, the term "case study" refers to a real-data demonstration o f how to use and interpret the Conover solution wherein one is analysing change/growth with time-variable measures (particularly those that cannot be linked). In particular, this case study chapter: (a) offers a rationale for the preferred statistical software package (SPSS), (b) describes each case study's data and research objective, (c) discusses the specific variables o f interest and the proposed methodology, (d) presents the statistical models/equations , and (e) explains 27 the resultant statistical output. The Conover solution's strengths and weaknesses are discussed in Chapter 7. It should also be noted that the steps one must follow when performing the respective analyses v i a either the graphical user interface (GUI) or syntax are presented in Appendix A (non-parametric H L M for multi-wave data) and Appendix B (non-parametric difference Statistical models/equations are presented for the non-parametric HLM case only. Please note that information about the exemplar data's reliability and validity was not available because the Ministry does not disseminate item-level data; however, Lloyd, Walsh, and Shehni Yailaigh (2005) report that the coefficient alphas for the provincial population of fourth- and seventh- grade students' responses on the 2001 FSA numeracy subtests were .85 and .86, respectively 2 7 2 8 86 score for two-wave data). Also presented in Appendix A is a brief description o f hierarchical linear modelling. Choice of Statistical Software Packages Mixed-effect modelling can be performed in several statistical packages, namely SPSS (Statistical Package for the Social Sciences), H L M (Hierarchical Linear Modelling), and M L w i N (created by the Centre for Multilevel Modelling team based at the University o f Bristol, United Kingdom). O f these three packages, SPSS is used i n the demonstrations. Because o f its widespread use i n educational and social science settings, its user-friendliness in terms o f data handling, its ability to rank transform data within waves, and its ability to perform mixed-effect analyses, SPSS was deemed the suitable choice. It is important to note, however, that the Conover solution is not an SPSS solution per se; this solution could, in theory, be easily implemented using other popular statistical software packages. Determining the Commensurability of Constructs A s is described more fully in subsequent sections, the first case study focuses on reading achievement scores collected from students in an urban school district in British Columbia across five annual waves: Grade 2 to Grade 6, inclusive. The second case study includes large-scale numeracy assessment data collected across two waves (Grades 4 and 7) by the British Columbia Ministry o f Education. Recall from Chapter 5 that the primary assumption underlying the implementation o f the Conover solution is that one is measuring a commensurable (similar) construct across all waves o f the study. Although there is no clearly-stated definition o f the term i n the repeated measures literature, commensurability is generally thought to mean that the same primary 87 dimension or latent variable is driving the test-takers' responses across waves. In a sense, commensurability is analogous to comparability or similarity. The time-variable measures used i n each respective case study were concluded to measure commensurable constructs across waves primarily because large-scale test developers generally construct their measures according to tables o f specifications - test blueprint documents that (a) define the specific sub-domains o f the construct o f interest, (b) detail the specific sub-domain to which each scale item belongs, and (c) specify the proportion o f scale items devoted to a specific sub-domain. In general, constructs are thought to be commensurable over time when there is parity in the tables o f specifications across waves. This conceptualisation o f commensurability relates to Sireci's (1998) definition o f content validity: There is consensus over the years that at least four elements o f test quality define the concept o f content validity: domain definition, domain relevance, domain representation, and appropriate test construction procedures (p. 101). 1 In the demonstration o f the non-parametric difference score solution, the British Columbia Ministry o f Education's (n.d.) numeracy subtest table o f specifications reveals that four numeracy sub-domains assessed on each version o f the measure (number, patterns and relationships, shape and space, and statistics and probability) are defined in a consistent manner across grade levels - thus providing evidence for consistent domain definitions across waves. Furthermore, the total proportions o f items devoted to each o f the four numeracy sub-domains are the same in the Grade 4 version o f the numeracy subtest as in the Grade 7 version, providing evidence for consistent domain representation. . 88 The same is true for the data used to demonstrate the non-parametric H L M solution: The reading measures have all been designed to assess, i n a consistent manner, four subdomains across waves: phonetic analysis, vocabulary, comprehension, and scanning. Furthermore, extensive research has been conducted in terms o f determining the various measures' psychometric properties (specifically the validity and reliability o f the test scores) and, purportedly, Rasch modelling was used to equate, calibrate, and develop scale scores across forms and levels (though no specifics are provided i n terms o f how this modelling was undertaken) (Karlsen & Gardner, 1978-1996). When tables of specifications have been employed. When several versions o f a measure share no common items (as with time-variable measures that cannot be linked), the measures' content validity is the primary piece o f evidence that one can use in determining the measures' commensurability. A s stated above, it is generally thought that i f there is parity i n the measures' tables o f specifications, then the measures are "content valid" and, i n turn, commensurable. If the measures share any number o f items (e.g., Scenario 1 = exact same measure across waves; Scenario 2 = time-variable measures that can be linked), then it may be possible to conduct multi-group exploratory factor analyses (to explore how many factors there are, whether the factors are correlated, and which observed variables appear to best measure each factor) and multi-group confirmatory factor analyses (to specify the number o f factors, which factors are correlated, and which observed variables measure each factor), in addition to investigations o f content validity, in determining the measures' commensurability. Please refer to Schumacker and Lomax (2004) for more about these types o f factor analyses. 89 When tables of specifications have not been employed. In instances in which timevariable measures (either linkable or non-linkable) have not been designed according to tables o f specifications, it may be possible to instead define the measures' commensurability from a more general validity perspective. One such option could be to expand the usage o f Campbell and Fiske's (1959) multitrait-multimethod matrix ( M T M M ) - an approach to assessing a measure's construct validity (the extent to which the inferences from a test's scores accurately reflect the construct that the test is purported to measure). In their seminal paper, Campbell and Fiske (1959) distinguish between two subcategories o f construct validity: convergent validity which refers to the degree to which concepts that should be related theoretically are interrelated in reality; and discriminant validity which refers to the degree to which concepts that should not be related theoretically are, i n fact, not interrelated i n reality. The M T M M , simply a table o f correlations, assumes one measures each o f several traits (e.g., mathematics achievement) by each o f several methods (e.g., a paper-and-pencil test, a direct observation, a performance measure). In essence, a measure is purported to be "construct valid" i f its scores are correlated with related traits (convergence) and uncorrelated with unrelated traits (discrimination). Extending Campbell and Fiske's (1959) concept o f construct validity to the problem surrounding the current dissertation, it may be possible to conclude that one's time-variable measures are indeed commensurable i f the pattern o f wave-specific convergent and discriminant correlations are similar across all waves o f one's study (i.e., the M T M M at Wave 1 is similar to the M T M M at Wave 2, and so forth). Given that the idea put forth i n this dissertation o f defining time-variable measures' commensurability from a general 90 validity perspective is new and has not been explored i n either the test linking or change/growth literatures, this idea most certainly requires investigation i n future research. So as to explain the rationale for choosing both a two-wave example and a multiwave example with which to demonstrate the Conover solution, it is necessary to first elucidate the distinction between two-wave and multi-wave research designs. Therefore, the next section introduces the first case study (the multi-wave case). Included is a description o f the selected data set, the specific variables o f interest, the proposed methodology, and the statistical models/equations. The resultant statistical output is also explained. Case 1: Non-Parametric H L M (Conover Solution for Multi-Wave Data) Given the proliferation o f criticisms o f two-wave designs, research designs in which an individual is tested over multiple (i.e., three or more) occasions quickly became the new 'gold standard'. These multi-wave designs seemed to solve many o f the problems associated with two-wave designs and were, hence, regarded as the 'one and only' way i n which to study individual change. What if, however, time constraints, contextual factors, and/or financial limitations preclude a multi-wave design? Should researchers abandon the study o f change altogether? Although there is still no universal concurrence on the 'two-wave versus multi-wave' debate, the general consensus is that, where possible, one should indeed implement a multi-wave design (e.g., Willett et al., 1998). In situations in which multi-wave designs are not possible, however, one may choose between using simple difference scores or residualised change scores according to a decision tree offered by Zumbo (1999). Meade et al. (2005) warn, however, that i f one chooses to implement two-wave designs, it is.necessary to provide 91 evidence that the two measurement occasions are equivalent psychometrically in order to ensure the valid measurement o f change over two waves. Description of the Data The research question being posed in this particular case study is: A r e there gender differences in the rank-based longitudinal reading achievement scores o f a group o f students? W i t h permission from Dr. Linda Siegel at the University o f British Columbia, a particular extract o f longitudinal language and literacy assessment data for North Vancouver School District (District 44) students was obtained. Dr. Siegel's data source is rich, having collected several types o f demographic and assessment data for the district's students across seven annual waves (kindergarten through to Grade 6, inclusive). For more about Dr. Siegel's research project, please refer to Chiappe, Siegel, and Gottardo (2002), Chiappe, Siegel, and Wade-Woolley (2002), and Lesaux and Siegel (2003). Specific Variables of Interest and Proposed Methodology Obtained were the raw scores on the Stanford Diagnostic Reading Test ( S D R T ) , a standardised test of reading comprehension, for 653 c h i l d r e n 29 (nf ema ie = 336, n m a i e = 317) tested across five o f the seven waves: Grade 2, Grade 3, Grade 4, Grade 5, and Grade 6, inclusive. A s Figures 3 and 4 depict, the descriptive statistics for each wave o f S D R T raw scores vary widely. Recall from Chapter 1 that Kolen and Brennan (2004) state that most test linking strategies require a minimum sample size of 400 test-takers per form. It is worth noting that, using this rule of thumb, Case 1 's sample size would be considered sufficient for test linking. For the purpose of this case study, students missing one or more waves of S D R T data were excluded from analyses. 3 0 92 Descriptive Statistics grade2raw grade3raw grade4raw grade5raw grade6raw Valid N (listwise) N Statistic 653 653 653 653 653 653 Minimum Statistic 12 10 8 4 11 Maximum Statistic 40 45 54 54 54 Mean Statistic 36.24 35.99 41.48 44.46 41.17 Std. Statistic 3.997 6.183 7.394 6.473 8.337 Skewness Statistic Std. -2.503 -1.209 -1.362 -2.139 -1.009 Error .096 .096 .096 .096 .096 Kurtosis Std. Statistic 8.472 1.505 ' 2.099 6.936 .741 Error .191 .191 .191 .191 .191 Figure 3. Descriptive statistics for each o f the five waves o f S R D T raw scores collected by Siegel. 93 10 20 30 40 50 60 Grade 6 SDRT Raw Score Figure 4. Histograms o f the Siegel study's raw scores across five waves: Grade 2 (top left), Grade 3 (top right), Grade 4 (middle left), Grade 5 (middle right), and Grade 6 (bottom left). Note that each o f the distributions is skewed negatively. 94 The S D R T administration involves each child receiving a booklet, reading the short passages within the booklet, and providing responses to multiple-choice questions based on the reading in a prescribed time limit (Lesaux & Siegel, 2003). Because students' reading comprehension changes with time, the S D R T has purportedly been changed developmentally (i.e., across waves) b y the test developer, Harcourt Assessment. Because this case study involves data collected over three or more waves, it is possible to conduct an H L M analysis o f change. To this end, rank-transformed test scores serve as the Level 1 outcome variable in the model. A l s o obtained was an encrypted student identification number that corresponds to each o f the test scores (variable name = casenum), which serves as the L e v e l 2 grouping variable, and as well as each student's gender (coded female = 1, male = 0), which serves as the Level 2 predictor variable. In summary, this case study involves a two-level H L M model in which five waves o f test scores at Level 1 (variable names - grade2rank, grade3rank, grade4rank, grade5rank, and grade6rank, respectively) are nested within students (the Level 2 units). Statistical M o d e l s a n d Equations When dealing with nested data, two sets o f analyses are performed: unconditional and conditional. B y doing so, one can then determine what improvement i n the prediction o f the outcome variable is made after the addition o f the predictor variable(s) to the model (Singer & Willett, 2003). Unconditional H L M models (sometimes called baseline or null models) generally involve computing the proportion o f variance in the outcome variable that can be explained simply by the nesting o f the Level 1 outcome variable (in this case, the rank-based literacy 95 score - Y , / ) within the Level 2 grouping units (in this case, the test-takers). Therefore, the Level 2 predictor variable, gender, has not been included i n this model. The Level 1 model is a linear individual growth model, and represents the withinperson (test-taker) variation. The Level 2 model expresses variation in parameters from the growth model as random effects unrelated to any test-taker level predictors (Singer, 1998), and represents the between-person (test-taker) variation. U s i n g the notation o f B r y k and Raudenbush (1992), i n which each level is written as a series o f separate but linked equations, the relevant models and notation are as follows (Singer, 1998; Singer & Willett, 2003): 96 • Y R tj Unconditional M o d e l Level 1 (Within-Person) =^j+^j(wave) . + r j, i i where: 1. Ytf = test-taker i's rank-based literacy score on measurement occasion j; 2. wave = time or measurement occasion; 3. 7Tq/ = test-taker z's true initial status (the value o f the outcome when waveij=0); 4. Try = test-taker f s true rate o f change during the period under study; 5. ry• = the portion o f test-taker i's outcome that is unpredicted on occasion j (the within-person residual); and 6. nj~N(0,<j ). 2 ^ 97 Unconditional M o d e l Level 2 (Between-Person) where: uqj u\j 7"00 Toi J and where: 7. 7TQ/= true initial status; 2. Try = true rate o f change; (2) 3. ^ /3 o and j8io= level 2 intercepts (the population average initial status 0 and rate o f change, respectively); 4. UOJ and uy = level 2 residuals (representing those portions o f initial status or rate o f change that are unexplained at level 2; in other words, they represent deviations o f the individual change trajectories around their respective group average trends). The H L M model i n Equations 1 and 2 is expressed as the sum o f two parts. The fixed part contains two fixed effects - for the intercept and for the effect o f wave (time). The random part contains three random effects - for the intercept, the wave slope, and the withintest-taker residual (ry) (Singer, 1998). For a description o f fixed and random effects, please refer to Appendix A . 98 • Y,* Conditional M o d e l Level 1 (Within-Person) =^j+x>j(wave).+ . riJ where: D 1. Yjj = test-taker i's rank-based literacy score on measurement occasion j; 2. wave = time or measurement occasion; 3. "KQJ = test-taker fs true initial status (the value o f the outcome when waveij=0); 4. -Ky - test-taker f s true rate o f change during the period under study; 5. Yy = the portion o f test-taker i's outcome that is unpredicted on occasion j (the within-person residual); and 6. nj~N(Q,o ). 2 ^ 99 • Conditional M o d e l Level 2 (Between-Person) where: Uq/ ~N Uy 700 T0\ T\0 T U and where: 1. ITOJ• = true initial status; 2. %\j = true rate o f change; 3. gender = level 2 predictor o f both initial status and change; 4. (8oo and B = level 2 intercepts (the population average initial status l0 (4) and rate o f change, respectively); 5. /3oi and 3u = level 2 slopes (representing the effect o f gender on the change trajectories, and which provide increments or decrements to initial status and rates o f change, respectively); and 6. no/ and uy - level 2 residuals (representing those portions o f initial status or rate o f change that are unexplained at level 2; in other words, they represent deviations o f the individual change trajectories around their respective group average trends). Having already fit the unconditional model in Equations 1 and 2, Equations 3 and 4 involve an H L M model which explores whether or not variation i n the intercepts and slopes is related to the Level 2 predictor, gender (Singer, 1998). 100 Hypotheses Being Tested In this case, the null hypothesis being tested is that the males' and females' respective mean rank-based intercept scores are not significantly different from zero and are not significantly different from one another. A second null hypothesis is that the males' and females' respective mean rank-based rates o f change are not significantly different from zero and are not significantly different from one another. A third null hypothesis is that there is no gender x wave interaction. The terms "intercept" and "rate o f change" are discussed i n more detail i n a later section. Explanation of the Statistical Output Unconditional model. Figure 5 illustrates the output from the unconditional H L M analysis. 101 Mixed Model Analysis Model Dimension? Fixed Effects Random Effects Residual Total Intercept wave Intercept + wave 1 Number of Levels 1 1 2 Covariance Structure Unstructured 4 Number of Parameters 1 1 3 1 6 Subject Variables casenum a- As of version 11.5, the syntax rules for the RANDOM subcommand have changed. Your command syntax may yield results that differ from those produced by prior versions. If you are using S P S S 11 syntax, please consult the current syntax reference guide for more information. t>- Dependent Variable: RANK of grade2raw. Information Criteria -2 Restricted Log Likelihood Akaike's Information Criterion (AIC) Hurvich and Tsai's Criterion (AICC) Bozdogan's Criterion (CAIC) Schwarz's Bayesian Criterion (BIC) 1 41816.625 41824.625 41824.638 41852.987 41848.987 The information criteria are displayed in smaller-is-better forms. a. Dependent Variable: RANK of grade2raw. Fixed Effects Type III Tests of Fixed Effects Source Intercept wave Numerator df 1 1 Denominator df 652.000 652.000 F 2514.582 .000 Sig. .000 1.000 a- Dependent Variable: RANK of grade2raw. Estimates of Fixed Effects 1 95% Confidence Interval Parameter Intercept/"' wave ( Depenc 327.0000 Std. Error 6.521009 df 652.000 t 50.146 -7E-013 1.644396 652.000 .000 — " S l g - * ^ Lower Bound .000 ^ 9 J 4 . 1 9 5 2 8 7 1.000 JB.228952 Upper Bound 339.804713 3.228952 Variable: R A N K of grade2raw. Figure 5. Unconditional model output. A s Figure 5 shows, the parameter value 327.000 represents the estimate of the average intercept across test-takers (the average value o f Y when Wave = 0). Therefore, the average person began with a rank score o f 327. The fact that this estimate is statistically 102 significant (p = 0.000) simply means that this average intercept is significantly different from zero (which is not a particularly useful finding). Even though the p-va\ue associated wave is not statistically significant (p = 1.000), there was, on average, nearly zero rate o f change from Grade 2 (Wave 1) to Grade 6 (Wave 5) (in other words, the average slope across persons was nearly zero). This finding suggests that, on average, the test-takers in Dr. Siegel's sample did not tend to change their position over time, relative to the rest o f the test-takers (Zumbo, 2005). C o n d i t i o n a l model. Figure 6 illustrates the output from the conditional H L M analysis. Conditional H L M models generally involve computing the proportion o f variance in the outcome variable that can be explained not only by the nesting o f the Level 1 scores within the Level 2 grouping units (in this case, the students), but also by the inclusion o f the predictor variable(s) in the analysis. Therefore, the Level 2 predictor variable, gender, has been included i n this analysis. 103 Mixed Model Analysis Model Dimension 5 Number of Levels 1 Intercept Fixed Effects wave gender 1 1 wave * gender 1 Intercept + wave Random Effects 1 1 Unstructured 3 1 Residual Total Subject Variables 1 2 1 Number of Parameters 1 Covariance Structure 6 casenum 8 a - As of version 11.5, the syntax rules for the RANDOM subcommand have changed. Your command syntax may yield results that differ from those produced by prior versions. If you are using S P S S 11 syntax, please consult the current syntax reference guide for more information. D - Dependent Variable: RANK of grade2raw. Information Criteria 1 -2 Restricted Log Likelihood 41781.254 Akaike's Information 41789.254 Criterion (AIC) Hurvich and Tsai's 41789.267 Criterion (AICC) Bozdogan's Criterion 41817.614 (CAIC) Schwarz's Bayesian Criterion (BIC) 41813.614 The information criteria are displayed in smaller-is-better forms. a - Dependent Variable: RANK of grade2raw. Fixed Effects Type III Tests of Fixed Effects Denominator Source Numerator df df F Sig. Intercept 1 651.000 1061.419 .000 wave 1 651.000 1.097 .295 gender 1 651.000 14.317 .000 wave * gender 1 651.000 2.131 .145 a - Dependent Variable: RANK of grade2raw. Estimates of Fixed Effects' 95% Confidence Interval Parameter""'"" ,>ffe?cept Estimate Std. Error 301.8524 9.265121 651.000 32.579 wave -2.469243 2.358072 651.000 gender 48.873229 12.916298 4.798856 3.287336 ^ a v e * gender - df Lower Bound Upper Bound .000 ^883659238 320.045494 -1.047 .295 -7^569588 2.1611.03 651.000 3.784 .000 23.5/0597 74.235862 651.000 1.460 .145 ^*B56205 11.253917 [gjjt Variable: RANK of grade2raw. Figure 6. Conditional model output. t 104 A s Figure 6 shows, the average person began with a rank score o f 302. This finding is similar to the result o f the unconditional model, only now the model is controlling for the L e v e l 2 predictor, gender. Once again, the fact that this estimate is statistically significant (p = 0.000) simply means that this average intercept is significantly different from zero (which is not a particularly useful finding). Figure 6 also highlights that, with the inclusion o f the Level 2 predictor (gender) in the model, the p-value for wave is still not statistically significant (p = 1.000). This means that, on average, there is nearly zero rate o f change from Grade 2 (Wave 1) to Grade 6 (Wave 5). This finding suggests that, on average, the test-takers i n Dr. Siegel's sample did not tend to change their position over time, relative to the rest o f the test-takers (Zumbo, 2005), which is similar to the result o f the unconditional model. The coefficient for gender, 48.87, captures the relationship between initial status and this Level 2 predictor. Because there is a significant main effect o f gender (p = 0.000), it is concluded that there is a relationship between initial status and gender. This finding suggests that female test-takers, on average, begin with a rank score 48.87 higher than that o f males (recalling that males are coded 0, females are coded 1). W i t h respect to growth rates, there is an effect o f gender: The parameter estimate o f 4.79 (wave x gender) indicates that individuals who differ by 1.0 with respect to gender have growth rates that differ by 4.79 (though this is not a statistically significant result,/? = .145) (Singer, 1998). In other words, females' rank-based, longitudinal performance is, on average, better than that o f males. The next section introduces the second case study (the two-wave case). Included is a description o f the difference between two common indexes o f change, the selected data set, 105 the specific variables o f interest, and the proposed methodology. The resultant statistical output is also explained. Case 2: Non-Parametric Difference Score (Conover Solution for Two-Wave Data) Two-wave designs (also known as pretest-posttest designs) are characterised by some comparison o f an individual's score at the second wave o f data collection to some baseline or initial measure score. The most common indexes o f change involved in two-wave designs are (a) the simple difference score and (b) the residualised change score. Simple difference score (change or gain score). A s first introduced i n Chapter 2, the most common o f all change indexes is the simple difference score, and is calculated by simply subtracting an individual's observed score at Wave 1 from his or her score observed at Wave 2. A positive simple difference score typically indicates an increase in a given phenomenon o f interest over time, whereas a negative score indicates diminishment over time. Residualised change score. A s Zumbo (1999) notes, it has been argued that simple difference scores are unfair because o f their base-dependence (i.e., scores at Wave 2 are correlated negatively with scores at Wave 1). A s such, the residualised change score was developed as an alternative to the simple difference score. Although there are different ways to create such scores, the most common residualised change score is estimated from the regression analysis o f the observed Wave 2 score on the observed Wave 1 score. In other words, the estimated Wave 2 score is subtracted from the observed Wave 2 score. The intrinsic fairness, usefulness, reliability, and validity o f the two-wave research design have been debated widely for decades (Zumbo, 1999). In their seminal article, Cronbach and Furby (1970) (as cited in Zumbo, 1999) famously disparaged the use o f two- 106 wave designs, arguing that gain scores are rarely useful, no matter how they are adjusted or refined (Cronbach & Furby, 1970, as cited i n Zumbo, 1999). The crux o f their argument is that two-wave designs lead to inherently unreliable, inherently invalid, and inherently biased indexes o f change. Their disdain o f two-wave designs was so strong that they stated that researchers "who ask questions using gain scores would ordinarily be better advised to frame their questions in other ways" (Cronbach & Furby, 1970, as cited in Zumbo, 1999, p. 80). A s Zumbo (1999) notes, it is somewhat puzzling that there has been such frequent avoidance o f two-wave designs, given that variations o f difference scores lie at the heart o f various widely-used and commonly-accepted statistical tests, such as the paired samples Mest. Description of the D a t a The research question being posed in this particular case study is: A r e there gender differences i n students' rank-based change scores - based on their performance across two waves o f the Foundation Skills Assessment (FS A ) numeracy subtest? The F S A , a three-part annual assessment test administered by the British Columbia Ministry o f Education, is designed to measure the reading comprehension, writing, and numeracy skills o f 4th- and 7th-grade students throughout British Columbia. The F S A is administered in public and in funded independent schools across the province in late April/early M a y o f each year. Approximately 40,000 students per grade level write the F S A each year. The F S A relates to what students learn in the classrooms in two important ways. First, the F S A measures: critical skills that are part o f the provincial curriculum. F S A represents broad skills that all students are expected to master. F S A only addresses skills that can be tested in a limited amount o f time, using a pen-and-paper format. F S A 107 does not measure specific subject knowledge or many o f the more complex, integrated areas o f learning (British Columbia Ministry o f Education, 2003, p. 20). Second, the F S A tests are designed to measure cumulative learning. This means that when, for example, 7th-grade students complete their version o f the F S A , they are expected to use skills gained from kindergarten to Grade 7 (British Columbia Ministry o f Education, 2003). The British Columbia Ministry o f Education and the school districts use F S A results to: (a) report the results o f student performance in various areas o f the curriculum; (b) assist in curriculum improvement; (c) facilitate discussions on student learning; and (d) examine the performance o f various student populations to determine i f any require special attention. Schools use F S A data primarily to assist in the creation and modification o f various school growth plans (e.g., plans for academic improvement). It should be noted that written approval from both the University o f British Columbia's Behavioural Research Ethics Board (Appendix C ) and the B C Ministry o f Education was received i n order to conduct this particular case study. Specific Variables of Interest and Proposed Methodology Obtained was the entire population o f standardised scores 32 31 (scaled) numeracy subtest o f 41,675 students who wrote the F S A in both 1999/2000 (Wave 1, Grade 4) and 33 The Ministry has standardised students' FSA scores such that each wave's score distribution has M=0 and SD = 1. For this reason, descriptive statistics and histograms are not presented in this particular case study. Willett et al. (1998) argue that standardised test scores should never be used in the place of raw scores in individual growth modelling analyses (readers are referred to their article for the specific reasons why). Recall, however, that ranks are actually being used in the place of the original test scores. Thus, it is unimportant whether or not the original test scores come in the form of standardised scores. Furthermore, the Ministry of Education does not supply researchers with raw FSA scores - only standardised scores. 3 1 3 2 108 2002/2003 (Wave 2, Grade 7 ) . O f this population o f students, a 10% random sample o f 34 4097 students ( n f ema i = 2055; n i = 2042) was retained for analyses. Each student record e m a e included a unique (but arbitrarily assigned) case number, and a gender flag (coded F = female, M = male). To begin this case study, students' F S A scores were rank-transformed within wave. The (a) correlation between the Grade 4 and 7 rank scores and (b) ratio o f the two standard deviations o f each grade's rank scores were then computed. Zumbo (1999) writes that "one should utilize the simple difference score instead o f the residualized difference i f and only i f p(Xi,X2) > aX\l 0X2" (P- 293). A s such, because the correlation between the Grade 4 and 7 rank scores [p(Xi,X2) = 0.669] was less than the ratio o f the two standard deviations (1182.843/1182.846 = 0.999), it was necessary to use the residualised change score, rather than the simple difference score, as the index o f change i n this case study. Please refer to Appendix B for a more thorough description o f how the residualised change score was computed i n SPSS. In this case, the residualised change score represents an individual's Wave 2 rank score minus his or her rank score at Wave 2 predicted from his or her Wave 1 rank score. The residualised change score serves as the dependent variable i n the subsequent statistical analysis: an independent samples Mest (for gender). B y implementing an independent samples Mest, it is possible to explore whether female and male students' longitudinal performance on the F S A differs statistically. Recall from Chapter 1 that Kolen and Brennan (2004) state that most test linking strategies require a minimum sample size of 400 test-takers per form. It is worth noting that, using this rule of thumb, Case 1 's sample size would be considered sufficient for test linking. For the purpose of this case study, students missing one wave of FSA data were excluded from analyses. 3 3 3 4 109 Hypotheses Being Tested In this case, the null hypothesis being tested is that the males' mean rank-based residualised change score is not significantly different from that o f the female test-takers. In contrast, the alternative hypothesis is that males' mean rank-based residualised change score differs significantly from that o f the female test-takers. In this case, a two-tailed hypothesis was chosen, because there was no strong directional expectation. Explanation of the Statistical Output The output from the independent samples Mest showed that the mean residualised change score for males was -7.6467 (SD = 882.31) as opposed to 7.5983 for females (SD = 876.71). So, the average Wave 2 rank score less the rank score at Wave 2 predicted from his Wave 1 rank score is higher for females than for males. This means that females gained, on average, 7.5 points i n relative standing, whereas boys' relative standing decreased, on average, approximately 7.6 points (Zumbo, 2005). Despite the mean differences in residualised change scores for males and females, the independent samples Mest results showed that there is no statistically significant gender difference i n the residualised change scores, t(4095) = -.555, p = .579 (assuming equal variances; two-tailed). Thus, males' mean rank-based residualised change score did not significantly different from that o f the female test-takers. Put another way, males' and females' relative standing over time did not differ significantly. Even though there was no statistically-significant gender difference found, an effect size should still be computed for reasons outlined by Zumbo and Hubley (1998). 35 In this No effect size was calculated for the non-parametric HLM case, because there is still no consensus on how it should be calculated for hierarchical models, particularly when repeated measures are nested within test-takers. Some researchers are investigating the use of the intra-class correlation for this purpose (e.g., Singer & Willett, 2003). 3 5 110 case, the Cohen's d effect size was calculated by subtracting the mean residualised change score o f one group (females) from that o f the other group (males) and dividing that difference by the pooled standard deviation. The resultant effect size equals 0.02, which represents a small effect size (Cohen, 1988). Chapter Summary In summary, this chapter illustrated the particulars o f the Conover solution to the problem o f analysing change and growth with time-variable measures, which is particularly useful when one's measures cannot be linked. The two case studies (the multi-wave case and the two-wave case, respectively) offered detailed descriptions o f the data, the specific variables o f interest, the proposed methodology, the statistical models and equations, and explained the resultant statistical output, respectively. In the next chapter, the Conover solution's strengths and limitations are discussed, and readers are offered suggestions for future studies rooted in analysing change and growth with time-variable measures (particularly those that cannot be linked). Ill Chapter 7: Discussion and Conclusions A s was illustrated earlier by way o f examples, repeated measures analyses are typically used i n three scenarios: In Scenario 1, the exact same measure is used and re-used across waves. In Scenario 2, most o f the measures' content changes across waves - typically commensurate with the age and experiences o f the test-takers - but the measures retain one or more linkable items across waves. In Scenario 3, the measures vary completely across waves (i.e., there are no linkable items), or there are small sample sizes, or there is no norming group. Because Scenarios 2 and 3 are found in educational and social science research settings, it was vital to explore more fully this particular problem: analysing change and growth within the contexts presented in Scenario 2 (time-variable measures that can be linked) and Scenario 3 (time-variable measures that cannot be linked) - both o f which are characterised by the measures changing across waves. This dissertation devoted specific attention to the latter o f the two scenarios, given that this particular scenario has gone relatively unaddressed i n the test linking and change/growth literatures. Summary of the Preceding Chapters This dissertation had two objectives (and novel contributions): Objective 1: To weave together test linking and change/growth literatures. The first objective was to weave together, or to bridge the gap between, the test linking and change/growth literatures i n a comprehensive manner. U n t i l now, the two literatures have either been largely disconnected or, at most, woven together in such a manner so as situate the motivating problem primarily around vertical scaling techniques. A s was described in Chapter 1, the gap between the test linking and change/growth literatures causes several 112 problems. Therefore, it is only by bridging the gap between the two literatures that one can analyse change/growth with time-variable measures in a rigorous fashion. In Chapter 3, readers were provided an overview o f the five broad types o f test linking: equating, calibration, statistical moderation, projection, and social moderation, respectively (Kolen & Brennan, 2004; L i n n , 1993) i n order to highlight the methods that currently exist for linking or connecting the scores o f one measure to those o f another. A s was elucidated in this chapter, none o f these five test linking methods proved to be a suitable solution to the problem o f analysing change/growth with time-variable measures that cannot be linked. Consequently, these five methods were presented for the purpose o f providing readers with a brief overview o f test linking, and with specific reasons why each is insufficient in the study o f change and growth with time-variable measures. W i t h the aim o f ascertaining how researchers are handling the problem o f analysing change and growth with time-variable measures in real-life research settings, Chapter 4 offered a discussion o f seven test linking strategies currently being used in the change and growth literature as means o f handling the problem o f analysing change and growth with time-variable measures (particularly those that can be linked). These seven strategies included: (1) vertical scaling, (2) growth scales, (3) Rasch modelling, (4) latent variable or structural equation modelling, (5) multidimensional scaling, (6) standardising the test scores or regression results, and (7) converting raw scores to age- or grade-equivalents pre-analysis, respectively. Each strategy was presented, where possible, with examples from real-life research settings. Whereas Chapter 3 spoke about test linking in a general sense, this chapter focussed on various strategies that educational researchers are using to handle the motivating problem specifically. 113 For reasons outlined in that chapter, none o f these seven strategies proved to be an adequate solution to the problem o f analysing change and growth with time-variable measures (particularly those that cannot be linked). B y highlighting each test linking method's shortcomings i n terms o f the motivating problem, the rationale for this dissertation's Conover solution for the problem o f analysing change/growth with timevariable measures (that cannot be linked) was established. Objective 2: To introduce a novel solution to the problem. Given the limitations o f the test linking strategies presented in Chapters 3 and 4 (in terms o f being viable solutions to the motivating problem), the second objective o f the dissertation was to introduce a solution to the problem o f handling time-variable measures that cannot be linked. A s Chapters 1 and 4 described, many o f the strategies currently being used in the change/growth literature as means o f handling the motivating problem cannot be used b y researchers in everyday research settings - often because they do not have the means to use large sample sizes or item pools. Moreover, many o f the strategies presented i n Chapter 4 require the presence o f common items across measures which, as discussed previously, is not always feasible (or warranted). Therefore, this dissertation introduced a workable solution that can be implemented easily in everyday research settings, and one that is particularly useful when one's measures cannot be linked. This solution was coined the Conover solution. Two case studies demonstrated the application o f the Conover solution: The multiwave case, involving Dr. Siegel's literacy data, illustrated the non-parametric H L M solution. The two-wave case, involving Foundation Skills Assessment data, described the non- 114 parametric difference score solution. In the next sections o f this chapter, some o f the Conover solution's strengths and limitations, respectively, are discussed. Strengths o f the Conover Solution The Conover solution offered as a means o f handling the problem o f analysing change and growth with time-variable measures (particularly those that cannot be linked) has several strengths - each o f which is described below. Ease of use. The first strength relates to the ease o f implementation o f the Conover solution. A s Conover and Iman (1981) note, it is often more convenient to use ranks i n a parametric statistical program than it is to write a program for a non-parametric analysis. Furthermore, all o f the steps required for the implementation o f the Conover solution (i.e., rank transforming data within waves, conducting independent samples Mests and mixedeffect analyses, restructuring the data matrix, etc.) can be easily performed using commonlyused statistical software packages, such as SPSS. Given the widespread use o f such software packages i n educational and social science research settings, the Conover solution is particularly appealing from a practical standpoint. Bridges the parametric/non-parametric gap. Second, by rank transforming the data pre-analysis, one is able to bridge the gap between parametric and non-parametric statistical methods, thereby providing "a vehicle for presenting both the parametric and nonparametric methods in a unified manner" (Conover & Iman, 1981, p. 128). Unfortunately, introductory statistics courses and textbooks very often treat the two methods as i f they are completely distinct from one another when, in fact, there can be great strength in marrying or bridging the two. This particular strength was also discussed in Chapters 1 and 5. 115 Makes use of the ordinal nature of data. Third, the Conover solution makes use o f the ordinal nature o f continuous-scored data: A test-taker with a low raw score relative to other test-takers in his wave w i l l also yield a low relative rank score . Similarly, a test-taker with a high test-score w i l l also yield a high rank score. A s a result, within-wave order among the students is preserved. B y ranking test-takers' scores within-wave pre-analysis, it is possible to put the longitudinal test scores on a "common metric", thereby providing a standard against which test-takers' scores can be measured and compared. Makes fewer assumptions, may improve power, and may mitigate effect of outliers. U s i n g ranks in the place o f raw or standardised scores i n subsequent analyses has a fourth strength o f not requiring multivariate normality, nor does it require that each variable follow its own normal distribution. It is for this reason that rank-based methods are often referred to as "distribution-free". In his widely-cited article, Micceri (1989) investigated the distributional characteristics o f 440 large-sample achievement and psychometric measures' scores. H e found that all 440 measures' scores were significantly non-normally distributed (p < 0.01). A s a result, "the underlying tenets o f normality-assuming statistics appear fallacious for these commonly used types o f data" (Micceri, 1989, p. 156). B y utilising ranks in the place o f raw or standardised scores i n subsequent analyses, there may be an improvement in statistical power and mitigation i n the effects o f outliers. Zimmerman and Zumbo (1993a) state: For quite a few nonnormal distributions, [the] Wilcoxon-Mann-Whitney test holds a power advantage over the Student t test, both in the asymptotic limit Statistical packages often assign a rank of one (1) to each wave's smallest score. It is recommended that analysts retain this default setting, as it is logical to think that the lowest test score should also have the lowest rank. 3 6 116 and for small and moderate sample sizes. [This] power advantage is accounted for by reduction in the influence o f outliers in conversion o f measures to ranks (p. 495). Furthermore, M i c c e r i (1989) writes that: Robustness o f efficiency (power or beta) studies suggest that competitive tests such as the W i l c o x o n rank sum exhibit considerable power advantages while retaining equivalent robustness o f alpha in a variety o f situations (p. 157). In summary, the work o f Micceri (1989) and Zimmerman and Zumbo (1993a) indicates that normality-assuming statistics may be relatively non-robust in nonA normal contexts. A s a result o f the possible imprecision o f statistical procedures dependent on the normality assumption, a fourth strength o f the Conover solution is that it makes fewer assumptions about the distribution about the data, may improve power, and may mitigate the effect o f outliers i n certain contexts. 37 Requires no common/linkable items. Unlike many o f the test linking methods and strategies described in earlier chapters, the Conover solution can be implemented not only in situations in which one's study involves time-variable measures that can be linked, but also situations i n which the time-variable measures share no linkable items whatsoever. Hence, unlike vertical scaling, equating, and their linking counterparts, the Conover solution provides a means by which researchers can study change and growth - whether or not the measures contain linkable items. It is anticipated that this particular feature o f the Conover solution w i l l likely prove appealing to researchers studying constructs thought to change developmentally (e.g., achievement). For detail about what can be considered an outlier or "wild observation", please refer to Kruskal (1960). 117 Requires no norming group. Recall from an earlier section that a norming group (normative sample) is a large sample, ideally representative o f a well-defined population, whose test scores provide a set o f standards against which the scores o f one's sample can be compared (Gall et al., 1996). Due to time and financial constraints, it is not always possible to compare the scores o f one's sample to those o f an external norming sample. A s such, a sixth strength o f the Conover solution is that it can be conducted using simply the scores o f the immediate sample o f test-takers. Limitations of the Conover Solution A s with any methodological tool, the Conover solution has various limitations - each o f which is described in this section. Within-wave ranks are bounded. Recall from an earlier section that rank transforming refers to the process o f converting a test-taker's original score to rank relative to other test-takers - suggesting a one-to-one function f from the set { X ] , X2, ..., X N } (the sample values) to the set {1, 2,.. .,N}(the first N positive integers) (Zimmerman & Zumbo, 1993a). The values assigned by the function to each sample value i n its domain are the number o f sample values having lesser or equal magnitude. Consequently, the rank scores are bounded from above by N . A s a result, "any outliers among the original sample values are not represented b y deviant values in the rank" (Zimmerman & Zumbo, 1993a, p. 487). Imagine that, on a standardised test o f intelligence, Student W earns a score 100, Student X earns a score o f 101, Student Y earns a score o f 102, and Student Z earns a score o f 167. Student Z ' s score, relative to the other test-takers, is exceptional. Despite her exceptional performance on the measure, her test score is masked b y the application o f ranks: Student W = 1, Student X = 2, Student Y = 3, and Student Z = 4. 118 A s a result, one limitation o f the Conover solution is that there may be problems associated with the inherent restriction o f range it places on data. Differences between any two ranks range between 1 and N - 1 , whereas the differences between original sample values range between 0 and infinity (Zimmerman & Zumbo, 1993a). Difficulties associated with handling missing data. Recall from the two case studies (Chapter 6) that only those test-takers for whom data were available at each and every wave were retained in the analyses. A s most educational and social science researchers w i l l note, no discussion about change and growth is complete without a complementary discussion about one unavoidable problem: missing data. In longitudinal designs, particularly those that span months or years, it is extremely common to face problems associated with participant dropout, attrition, and as well as participants who join, or return to the study, in later waves. The complexity (even messiness!) o f many longitudinally-collected data sets can have serious implications for growth analyses. Singer and Willett (2003) chide readers that, when fitting a growth model: Y o u implicitly assume that each person's observed records are a random sample o f data from his or her true growth trajectory. I f your design is sound, and everyone is assessed on every planned occasion, your observed data w i l l meet this assumption. If one or more individuals are not assessed on one or more occasions, your observed data may not meet this assumption. In this case, your parameter estimates may be biased and your generalizations incorrect (p. 157). 119 The degree to which one's generalisations are incorrect is dependent on the degree and type o f data missing data. Although there is no general consensus on the categories o f data 'missingness', Schumaker and Lomax (2004) outline three ways i n which missing data can arise: 1. M i s s i n g completely at random ( M C A R ) , which implies that data are missing unrelated, statistically, to the values that would have been observed. Put another way, the observed values are a random sample o f all the values that could have been observed (ideally), had there been no missing data (Singer & Willett, 2003); 2. Missing at random ( M A R ) , in which data values are missing conditional on other variables or a stratifying variable; and 3. Non-ignorable data, which implies probabilistic information about the values that would have been observed. One possible strategy for circumventing, or at least mitigating the effect of, missing data is to impute the missing raw or standardised scores prior to rank-transforming the data within-wave pre-analysis. Because detailed discussion o f the various imputation methods are beyond the scope o f this dissertation, and because missing data discussion is largely casedependent, this dissertation merely provides a general description o f four methods o f imputing missing data (Schumaker & Lomax, 2004): 1. Mean substitution, which involves substituting the mean for missing values in a given variable; 2. Regression imputation, in which the missing value is substituted with some predicted value; 120 3. M a x i m u m likelihood imputation, which involves finding the expected value for a missing datum based on maximum likelihood parameter estimation; and 4. Matching response pattern imputation, in which variables with missing data are matched to variables with complete data to determine a missing value. Makes use of the ordinal nature of data. Recall that the fact that the Conover solution makes use o f the ordinal nature o f continuous-scored data was previously identified as one o f the solution's strengths. Unfortunately, precisely what the Conover solution wins by, it also loses by: Because o f the rank transformation o f the raw or standardised scores, "differences between raw scores are not necessarily preserved by the corresponding ranks. For example, a difference between the raw scores corresponding to the 15th and the 16th ranks is not necessarily the same as the difference between the raw scores corresponding to the 61st and 62nd ranks i n a collection o f 500 test scores" (Zimmerman & Zumbo, 2005, p. 618). Suggestions for Future Research In the current section, various suggestions for future research are presented: First, two o f the test linking methods referred to in Chapter 3 - concordance through scaling (i.e., social moderation) and projection - have been used b y Cascallar and Dorans (2005) to link the scores o f separate tests. They use these two methods specifically to link scores from different tests o f similar content but o f different languages, and not to link scores from time-variable achievement tests; however, it is conceivable that these two solutions could be extended to the problem o f analysing change and growth with time-variable scales that can or cannot be linked. 121 The first linking method, concordance through scaling, is used to establish an equating relationship between tests: 1. that are non-exchangeable (in the context o f the current problem, non-exchangeability would mean that a fourth-grade student would likely not be able to complete the seventh-grade version o f a mathematics test; conversely, a seventh-grader would find writing the fourth-grade version o f a mathematics test too easy), but 2. whose content is highly related (e.g., fourth-grade and seventh-grade mathematics tests that are both heavy on statistics and probability concepts). The second linking method, projection, is merely concerned with minimising imprecision in the predictions o f one score from one or more scores, thus "doing the best job possible o f predicting one set o f scores from another" (Cascallar & Dorans, 2005, p. 343). It involves observing the scores on Test X , then predicting what would be likely scores on Test Y . This type o f linking involves administering different measures to the same set o f testtakers and then estimating the joint distribution among the scores. For example, seventhgraders' scores on a mathematics test could be estimated from their respective scores on earlier versions o f the time-variable tests administered i n each prior school year. Recall that projection is often regarded as the weakest form o f statistical linking (Linn, 1993). It is suggested that future research examines the utility o f concordance through scaling and projection, i n the context o f change and growth, particularly when the measures are timevariable and cannot be linked. Second, recall from an earlier section that a limitation o f the Conover solution is that it is not designed to handle missing data. For this reason, various imputation methods are necessary to fill in the 'gaps' left by the missing data. It is suggested that researchers try 122 imputing missing data in the four different ways presented i n this dissertation (mean substitution, regression imputation, maximum likelihood imputation, and matching response pattern imputation) to see the extent to which the results and/or the inferences made about the results vary across the methods o f imputation. Third, as was described i n Chapter 6, it may be possible to conclude that one's timevariable measures are indeed commensurable i f the pattern o f wave-specific convergent and discriminant correlations (Campbell & Fiske, 1959) are similar across all waves o f one's study (i.e., the M T M M at Wave 1 is similar to the M T M M at Wave 2, and so forth). Given that this idea is new and has not been explored in either the test linking or change/growth literatures, this idea most certainly requires investigation i n future research. Finally, very little research has explored the issue o f sample size i n non-parametricbased repeated measures designs, in part because the methodology is so new (e.g., Zimmerman & Zumbo, 1993b and the current dissertation). Although there are general recommendations about what is considered a "minimally sufficient" sample size for general test-linking strategies, these recommendations are typically situated around item response theory requirements (400 test-takers per measure; see Chapter 1). A s such, a final suggestion for future research is to explore the results, and inferences made from the results, o f nonparametric H L M and non-parametric difference score analyses conducted on samples o f various sizes. Conclusions There are two primary reasons why investigating the problem o f analysing change and growth with time-variable measures was undertaken in this dissertation. First, as Willett et al. (1998) and von Davier et al. (2004) describe, the rules about which tests are permissible 123 for repeated measures designs are precise and strict. Given these conditions, it was necessary to investigate i f and how repeated measures designs are possible - speaking both psychometrically and practically - when the measures themselves must change across waves. Second, given the substantial growth in longitudinal large-scale achievement testing in the past decade, it was (and is) necessary to find a viable and coherent solution to the problem so that researchers, educational organisations, policy makers, and testing companies can make the most accurate inferences possible about their test scores. In this dissertation, readers were introduced to a novel solution for handling the problem o f analysing the problem o f change and growth with time-variable measures (particularly those that cannot be linked). It should, however, be stressed again that the Conover solution is by no means a universal panacea. It is imperative that educational and social science researchers continue to carry the torch i n terms o f exploring the various ways (and contexts) in which the motivating problem can be solved. L i n n (1993) notes that considering any one individual method as the ultimate solution to the problem o f linking test scores is fundamentally unsound because: The sense i n which the scores for individual students can be said to be comparable to each other or to a fixed standard depends fundamentally on the similarity o f the assessment tasks, the conditions o f administration, and their cognitive demands. The strongest inferences that assume the interchangeability o f scores demand high degrees o f similarity. Scores can be made comparable in a particular sense for assessments that are less similar. Procedures that make scores comparable in one sense (e.g., the most likely score for a student on a second assessment) w i l l not simultaneously make the scores comparable in another sense (e.g., the proportion o f students that exceed a fixed standard. Weaker forms o f linkage are likely to be context, group, and time dependent, which suggests the need for continued monitoring of the comparability o f scores (p. 100). Because o f the case-specific nature o f the problem o f analysing change and growth with time-variable measures (that can or cannot be linked), researchers are beseeched to prioritise the making o f careful and trained judgements about their proposed measures - right at the outset o f the study. The later one waits to make such judgements, the less accurate the inferences one makes from the tests' scores. To reiterate a sentiment offered by K o l e n and Brennan (2004), "the more accurate the information, the better the decision" (p. 2). 125 References Abbeduto, L . , & Hagerman, R. (1997). Language and communication in fragile X syndrome. Mental Retardation and Developmental Disabilities Research Reviews, 3, 313-322. Afrassa, T., & Keeves, J. P. (1999). Changes in students' mathematics achievement i n Australian lower secondary school schools over time. International Education Journal, 7(1), 1-21. A i , X . (2002). Gender differences i n growth in mathematics achievement: Three-level longitudinal and multilevel analyses o f individual, home, and school influences. Mathematical Thinking and Learning, 4, 1-22. Beasley, T. M . , & Zumbo, B . D . (2003). Comparison o f aligned Friedman rank and parametric methods for testing interactions in split-plot designs. Computational Statistics and Data Analysis, 42, 569-593. Bejar, 1.1. (1983). Introduction to item response models and their assumptions. In R. K . Hambleton (Ed.), Applications of item response theory. Vancouver, B C : Educational Research Institute o f British Columbia. Bolt, D . M . (1999). Evaluating the effects o f multidimensionality on I R T true-score equating. Applied Measurement in Education, 12(4), 383-407. Braun, H . I., & Holland, P. W . (1982). Observed-score test equating: A mathematical analysis o f some E T S equating procedures. In P. W . Holland and D . B . Rubin (Eds.), Test equating. N e w Y o r k : Academic Press. British Columbia Ministry o f Education. (2003). Interpreting and communicating British Columbia Foundation Skills Assessment results 2002. Retrieved December 8, 2003, from http://www.bced.gov.bc.ca/assessment/fsa/02interpret.pdf 126 British Columbia Ministry o f Education, (n.d.). FSA numeracy specifications. Retrieved August 30, 2006, from http://www.bced.gov.bc.ca/assessment/fsa/numeracy_specs.pdf Bryk, A . S., & Raudenbush, S. W . (1992). Hierarchical linear models: Applications and data analysis methods. Newbury Park, C A : Sage Publications. Campbell, D . T., & Fiske, D . W . (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105. Cascallar, A . S., & Dorans, N . J. (2005). Linking scores from tests o f similar content given i n different languages: A n illustration involving methodological alternatives. International Journal of Testing, 5(4), 337-356. Chiappe, P., Siegel, L . S., & Gottardo, A . (2002). Reading-related skills o f kindergartners from diverse linguistic backgrounds. Applied Psycholinguistics, 23, 95-116. Chiappe, P., Siegel, L . S., & Wade-Woolley, L . (2002). Linguistic diversity and the development o f reading skills: A longitudinal study. Scientific Studies of Reading, 6, 369-400. ' Clemans, W . V . (1993). Item response theory, vertical scaling, and something's awry in the state o f the test mark. Educational Assessment, 1(A), 329-347. Cliff, N . (1993). What is and isn't measurement. In G . Keren & C . Lewis (Eds.), A handbook for data analysis in the behavioral sciences, Volume 1: Methodological issues (pp. 59-93). Hillsdale, N J : Lawrence Erlbaum. Cohen, B . F. (1996). Explaining Psychological Statistics. Pacific Grove, C A : Brooks/Cole Publishing Company. 127 Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, N J : Lawrence Erlbaum, Associates. Conger, J. J., & Galambos, N . L . (1997). Adolescence and youth: Psychological development in a changing world (5th ed.). N e w Y o r k : Addison Wesley Longman. Conover, W . J. (1999). Practical Nonparametric Statistics (3rd ed.). N e w Y o r k : John W i l e y & Sons. Conover, W . J., & Iman, R . L . (1981). Rank transformations as a bridge between parametric and nonparametric statistics. The American Statistician, 35, 124-129. Cronbach, L . J., & Furby, L . (1970). H o w should we measure "change" - Or should we? Psychological Bulletin, 74, 68-80. Ding, C . S., Davison, M . L . , Petersen, A . C . (2005). Multidimensional scaling analysis o f growth and change. Journal of Educational Measurement, 42(2), 171-191. D i x o n , R. A . , & Lerner, R. M . (1999). History and systems in developmental psychology. In M . H . Bornstein & M . E . Lamb (Eds.), Developmental psychology: An advanced textbook (4th ed.; pp. 3-45). Mahwah, N J : Lawrence Erlbaum Associates. Domhof, S., Brunner, E . , & Osgood, D . W . (2002). Rank procedures with repeated measures with missing values. Sociological Methods and Research, 30(3), 367-393. Embretson, S. E . , & Reise, S. P. (2000). Item response theory for psychologists. N e w Jersey: Lawrence Erlbaum Associates. Ercikan, K . (1997). L i n k i n g statewide tests to the N A E P : Accuracy o f combining test results across states. Applied Measurement in Education, G a l l , M . D . , Borg, W . R., & Gall, J. P. (1996). Educational ed.). White Plains, N Y : Longman. 7(7(2),145-159. research: An introduction, (6th 128 Goldstein, H . (1995). Multilevel statistical models (2nd ed.). London: Arnold. Golembiewski, R. T., Billingsley, K . , & Yeager, S. (1976). Measuring change and persistence i n human affairs: Types o f change generated by O D Designs. Journal of Applied Behavioral Sciences, 12, 133-157. Gravetter, F. J., & Wallnau, L . B . (2005). Essentials of Statistics for the Behavioral Sciences (5 ed.). Belmont, C A : Thomson Wadsworth. th Guay, F., Larose, S., & B o i v i n , M . (2004). Academic self-concept and educational attainment level: A ten-year longitudinal study. Self and Identity, 3(1), 53-68. Haertel, E . H . (2004). The behavior of linking items in test equating. Retrieved February 12, 2006 from http://www.cse.ucla.edu/reports/R630.pdf Hambleton, R. K . , Swaminathan, FL, & Rogers, H . J. (1991). Fundamentals of item response theory. Newbury Park, C A : Sage Publications. Holland, P. W . , & Rubin, D . B . , Eds. (1982). Test equating. N e w Y o r k : Academic Press. H o w e l l , D . C . (1995). Fundamental Statistics for the Behavioral Sciences (3rd ed.). Belmont, C A : Duxbery Press. Johnson, C , & Raudenbush, S. W . (2002). A repeated measures, multilevel Rasch model with application to self-reported criminal behavior. Retrieved February 12, 2006 from http://www.ssicentral.com/hlm/techdocs/NotreDamePaper2c.pdf Jordan, N . C , & Hanich, L . B . (2003). Characteristics o f children with moderate mathematics deficiencies: A longitudinal perspective. Learning Disabilities Research and Practice, 18(A), 213-221. Karlsen, B . , & Gardner, E . F. (1978-1996). Stanford Diagnostic Reading Test, Fourth Edition. San Antonio, T X : Harcourt Brace Educational Measurement. 129 K o l e n , M . J. (2001). L i n k i n g assessments effectively: Purpose and design. Educational Measurement: Issues and Practice, 20(1), 5-19. K o l e n , M . J., & Brennan, R. L . (2004). Test Equating, Scaling, and Linking: Methods and Practices (2nd ed.). N e w Y o r k : Springer-Verlag. Kruskal, W . H . (1960). Some remarks on w i l d observations. Technometrics, 2, 1-3. Lesaux, N . K . , & Siegel, L . S. (2003). The development o f reading i n children who speak English as a Second Language ( E S L ) . Developmental Psychology, 39(6), 1005-1019. Leung, S. O. (2003). A practical use o f vertical equating by combining I R T equating and linear equating. Practical Assessment, Research & Evaluation, 8(3). Retrieved July 24, 2006, from http://pareonline.net/getvn.asp?v=8&n=23 L i n n , R. L . (1993). L i n k i n g results o f distinct assessments. Applied Measurement in Education, 6(1), 83-102. L i n n , R. L . , & Kiplinger, V . L . (1995). L i n k i n g statewide tests to the National Assessment o f Educational Progress: Stability o f results. Applied Measurement in Education, 8, 135-155. L i n n , R. L . , & Slinde, J. A . (1977). The determination o f the significance o f change between pre-and posttesting periods. Review of Educational Research, 47(1), 121-150. Lissitz, R. W . , & Huynh, H . (2003). Vertical equating for state assessments: issues and solutions i n determination o f adequate yearly progress and school accountability. Practical Assessment, Research & Evaluation, 8(10). Retrieved November 3, 2005 from http://PAREonline.net/getvn.asp?v=8&n=10 L l o y d , J. E . V . , Walsh, J., & Shehni Yailagh, M . (2005). Sex differences i n mathematics: If I ' m so smart, why don't I know it? Canadian Journal of Education, 28(3), 384-408. 130 Lord, F. M . (1982). Item response theory and equating - A technical summary. In P. W . Holland & Donald B . Rubin (Eds.), Test equating. Princeton, N J : Academic Press. M a , X . (2005). Growth in mathematics achievement during middle and high school: Analysis with classification and regression trees. Journal of Educational Research, 99, 78-86. M a , X . , & M a , L . (2004). Modeling stability o f growth between mathematics and science achievement during middle and high school. Evaluation Review, 28, 104-122. M a , X . , & X u , J. (2004). Determining the causal ordering between attitude toward mathematics and achievement in mathematics. American Journal of Education, 110, 256-280. Marascuilo, L . A . , & McSweeney, M . (1977). Nonparametric and distribution-free methods for the social sciences. Monterey, C A : Brooks-Cole. Martineau, J. A . (2006). Distorting value added: The use o f longitudinal, vertically scaled student achievement data for growth-based, value-added accountability. Journal of Educational and Behavioral Statistics, 37(1), 25-62. Meade, A . M . , Lautenschlager, G . J., & Hecht, J. E . (2005). Establishing measurement equivalence and invariance i n longitudinal data with item response theory. International Journal of Testing, 5(3), 279-300. Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin, 105(1), 156-166. Mislevy. R. J. (1992). Linking educational assessments: Concepts, issues, methods, and prospects. Princeton, N J : Educational Testing Service. Muijs, D . , & Reynolds, D . (2003). Student background and teacher effects on achievement and attainment in mathematics. Educational Research and Evaluation, 9(1), 21-35. 131 Muthen, B . , & Khoo, S. T. (1998). Longitudinal studies o f achievement growth using latent variable modeling. Learning and Individual Differences, Newsom, J. T. (n.d.). Distinguishing coefficients. 10, 73-101. between random and fixed: Variables, effects, and Retrieved February 17, 2005, from http://www.upa.pdx.edu/IOA/newsom/mlrclass/ho_randfixd.doc Notenboom, A . , & Reitsma, P. (2003). Investigating the dimensions o f spelling ability. Educational & Psychological Measurement, 63, 1039-1059. Petrides, K . V . , Chamorro-Premuzic, T., Frederickson, N . , & Furnham, A . (2005). Explaining individual differences i n scholastic behavior and achievement. British Journal of Educational Psychology, 17, 239-255. Plewis, I. F . (2000). Evaluating educational interventions using multilevel growth curves: the case o f reading recovery. Educational Research and Evaluation, 6, 83-101. Pommerich, M . , & Dorans, N . J. (2004). Linking scores via concordance: Introduction to the special issue. Applied Psychological Measurement, 28(4), 216-218. Pommerich, M . , Hanson, B . A . , Harris, D . J., & Sconing, J. A . (2004). Issues i n conducting linkages between distinct tests. Applied Psychological Measurement, 28(4), 247-273. Pomplun, M . , Omar, M . D . H . , & Custer, M . (2004). A comparison o f W I N S T E P S and B I L O G - M G for vertical scaling with the Rasch model. Educational Psychological and Measurement, 64(4), 600-616. Rowe, K . J., & H i l l , P. W . (1998). Modeling educational effectiveness i n classrooms: The use o f multi-level structural equations to model students' progress. Research and Evaluation, 4, 307-347. Educational 132 Schafer, W . D . (2006). Growth scales as an alternative to vertical scales. Assessment, Research & Evaluation, Practical 11(A), 1-6. Retrieved July 14, 2006, from http://pareonline.net/pdf/vlln4.pdf Schafer, W . D . , & Twing, J. S. (2006). Growth scales and pathways. In R. W . Lissitz (Ed.), Longitudinal and value added modeling of student performance, Maple Grove, M N : J A M Press. Schumacker, R . E . (2005). Test equating. Retrieved from November 4, 2006, from www.appliedmeasurementassociates.eom/White%20Papers/TEST%20EQUATrNG.p df Schumacker, R. E , & Lomax, R. G . (2004). A Beginner's Modeling, 2 nd Guide to Structural Equation edition. Mahwah, N J : Lawrence Erlbaum Associates. Serlin, R. C . , Wampold, B . E . , & Levin, J. R. (2003). Should providers o f treatment be regarded as a random factor?: If it ain't broke, don't 'fix' it. Psychological Methods, 8, 524-534. Singer, J. D . (1998). Using S A S P R O C M I X E D to fit multilevel models, hierarchical models, and individual growth models. Journal of Educational and Behavioral Statistics, 24, 323-355. Singer, J. D . , & Willett, J. B . (2003). Applied Longitudinal Data Analysis: Modeling Change and Event Occurrence. N e w York: Oxford Press. Sireci, S. G . (1998). The construct o f content validity. In B . D . Zumbo (Ed.), Validity Theory and the Methods Used in Validation: Perspectives from the Social and Sciences (pp. 83-117). Netherlands: Kluwer Academic Press. Behavioral 133 SPSS. (2002). Linear mixed-effect modelling in SPSS: An introduction to the Mixed procedure (Publication N o . L M E M W P - 1 0 0 2 ) . Chicago, I L : Author. Stroud, T. W . F . (1982). Discussion o f "a test o f the adequacy o f linear score equating models". In P. W . Holland & Donald B . Rubin (Eds.), Test equating. Princeton, N J : Academic Press. Tabachnick, B . G . , & Fidell, L . S. (1996). Using Multivariate Statistics (3rd ed.). N e w Y o r k : Harper Collins College Publishers. Thomas, D . R., Hughes, E . , & Zumbo, B . D . (1998). O n variable importance i n linear regression. Social Indicators Research: An International Journal for Quality-of-Life and Interdisciplinary Measurement, 45, 253-275. von Davier, A . A . , Holland, P. W . , & Thayer, D . T. (2004). The Kernel Method of Equating. N e w Y o r k : Springer. Willett, J. B . , Singer, J. D . , & Martin, N . C . (1998). The design and analysis o f longitudinal studies o f development and psychopathology i n context: Statistical models and methodological recommendations. Development and Psychopathology, 10, 395-426. Zimmerman, D . W . , & Zumbo, B . D . (1993a). Relative power o f parametric and nonparametric statistical methods. In G . Keren & C . Lewis (Eds.), A handbook for data analysis in the behavioral sciences, Volume 1: Methodological issues (pp. 481 - 517). Hillsdale, N J : Lawrence Erlbaum. Zimmerman, D . W . , & Zumbo, B . D . (1993b). Relative power o f the W i l c o x o n test, the Friedman test, and repeated-measures A N O V A on ranks. Journal of Education, 62, 75-86. Experimental 134 Zimmerman, D . W . , & Zumbo, B . D . (2005). Can percentiles replace raw scores in statistical analysis o f test data? Educational and Psychological Measurement, 65, 616-638. Zumbo, B . D . (1999). The simple difference score as an inherently poor measure o f change: Some reality, much mythology. In Bruce Thompson (Ed.). Advances in Social Science Methodology, Volume 5, (pp. 269-304). Greenwich, C T : J A I Press. Zumbo, B . D . (2005). Notes on a new statistical method for modeling change and growth when the measure changes over time: A nonparametric mixed methodfor two-wave or multi-wave data. Unpublished manuscript, University o f British Columbia. Zumbo, B . D . (n.d.). Thinking about robustness in general inferential strategies. Unpublished manuscript, University o f British Columbia. Zumbo, B . D . , & Forer, B . (in press). Friedman test. In N e i l J. Salkind (Ed.), Encyclopedia of Measurement and Statistics. Thousand Oaks, C A : Sage Press. Zumbo, B . D . , & Hubley, A . M . (1998). A note on misconceptions concerning prospective and retrospective power. Journal of the Royal Statistical Society, Series D (The Statistician), 47, 385-388. 135 Appendix A: More about the Non-Parametric H L M (Case 1) This appendix relates to material presented in Chapter 6, particularly the sections related to the multi-wave case involving Dr. Siegel's literacy data. This appendix offers readers a brief description o f mixed-effect modelling (or H L M ) , and provides step-by-step information about performing the said analyses v i a S P S S ' graphical user interface (GUI) and syntax. A Brief Description of Mixed-Effect Modelling Mixed-effect modelling is known by a plethora o f other names: individual growth modelling, random coefficient modelling, multilevel modelling, hierarchical linear modelling ( H L M ) . Unlike other repeated measures analyses o f change (e.g., the paired samples Mest, repeated measures ANOVA, profile analysis), mixed-effect models can handle complex and 'messy' data, unbalanced designs, time-related covariates, continuous predictors o f rates o f change, and unequal variances. Furthermore, mixed-effect models allow researchers to explore the effect o f dependencies in data observations due to the hierarchical or nested nature o f educational and behavioural data. Mixed-effect modelling has two broad, albeit interrelated, classes: multilevel modelling and individual growth modelling. The primary difference between these classes pertains to the way in which the data are nested or grouped: Within the context o f educational research, a common three-level example o f multilevel modelling is students (at Level 1 o f the hierarchy) clustered within units such as classrooms (Level 2) which, in turn, may be grouped within schools (Level 3). A common two-level individual growth modelling example involves repeated measures (at Level 1) nested within students (Level 2). 136 It has been argued that, prior to the development o f mixed-effect modelling techniques in the early 1980s, there were no appropriate quantitative equipment in the proverbial research toolbox to even allow for the rigorous investigation o f change (Singer & Willett, 2003). The primary criticism o f pre-mixed-effect modelling techniques concerns the assumptions involving the error term involved in regular ordinary least squares ( O L S ) analyses: linearity, normality, homoscedasticity, and independence. According to B r y k and Raudenbush (1992), the latter two assumptions must be modified when the data show dependencies. Otherwise, standard errors are too small, leading to higher-than-appropriate rates o f rejecting the null hypothesis and, hence, the attribution o f statistical effects where none should exist. Mixed-effect modelling allows for these data dependencies to be taken into account. Fixed versus Random Effects Mixed-effect modelling refers to special classes o f regression techniques in which at least one independent variable (or factor) in a statistical model is considered fixed and at least one other independent variable in the same model is considered random. Unfortunately, ascertaining the exact distinction between a fixed and a random factor is, at best, thorny (Newsom, n.d.). There are three possible reasons for this thorniness. First, there exists an acute dearth o f scholarly pieces that actually provide precise definitions o f fixed and random factors - a particularly surprising finding given that these concepts characterise the very framework o f mixed-effect techniques. It is possible that this dearth lies with the fact that many o f the seminal writers in the field - Anthony Bryk, Harvey Goldstein, Stephen Raudenbush, and D o u g W i l l m s , to name a handful - tailor their articles to researchers already confident in the language o f mixed modelling. This dearth may also relate to 137 Singer's (1998) observation that, were it not for current day, user-friendly statistical software packages, "few statisticians and even fewer empirical researchers would fit the kinds o f sophisticated statistical models being promulgated today" (p. 350). Perhaps, lamentably, the ease with which such software packages may be used has enabled researchers to deem discussion o f the fundamentals o f mixed modelling as trivial. Second, i n the scarce occurrence that a scholarly piece does, in fact, provide a distinction between fixed and random factors, the definitions tend to be worded vaguely. For example, Serlin, Wampold, and L e v i n (2003) state that a fixed factor is one that is related to population means, whereas SPSS (2002) defines a fixed factor as any variable that "affects the population mean" (p. 3). Clearly, such vague definitions are o f little use to novice mixed modellers. Third, a given factor can be considered fixed or random, depending on the context o f the study (Newsom, n.d.). For the purpose o f this primer, definitions o f each type o f factor have been provided and, at a minimum, provide a useful starting point for thinking about this distinction. Fixed factors. Recall that an important assumption underlying traditional statistical analyses is that the independent factors in the model are fixed. Newsom (n.d.) defines a fixed factor as one that is: 1. assumed to be measured without error; and 2. purported to contain all or most o f the values found in the same variable had the findings been generalised to a population; and 3. not necessarily invariant (equal) across subgroups. 138 Often the fixed component o f the model is referred to as M o d e l I (Serlin et al., 2003). Recall that, when dealing with fixed factors, the following assumptions are made about the errors (Zumbo, n.d.): 1. the errors have an average o f zero; 2. the errors have the same variance across all individuals; and 3. the errors are distributed identically with density function f; and 4. the error density, f, is assumed to be distributed symmetrically; and 5. the error density, f, is assumed to be distributed normally. Random factors. Conversely, Newsom (n.d.) defines a random factor as one that: 1. is assumed to be measured with measurement error (scores are a function o f a true score and random error); and 2. contains values that come from and are intended to generalise to a much larger population o f possible values with a defined probability distribution; and 3. contains values that are small or narrower in scope that would be found in the same variable pertaining to a population. A model's random component is often referred to as M o d e l II (Serlin et al., 2003). In summary, it is useful to conceptualise the distinction between regular OLS-based regression and mixed-effect designs as follows: Traditional statistical approaches involve models that contain fixed independent factors only. Mixed-effect techniques are so-named because they involve models with mixtures o f fixed and random components. Performing the Analysis using the Graphical User Interface (GUI) Step 1: Entering the data. When conducting most repeated measures analyses, data are entered into the data matrix (spreadsheet) in person-level format, in which one row 139 represents one individual, with time-related variables represented spreadsheet. A s Figure 7 shows, variable names grade5raw, a n d g r a d e 6 r a w identify students' S D R T , respectively. 1 Fifesdrf Edit af View i n G r a d e 2, 3 6 Data Edit D a tfj.sav a T r a ns f SPSS orm Analyze Graphs m\ • | I grade2raw, grade3raw, utilities i n G r a d e 3, 4 4 Ac d - o n s Windov i with casenum i n G r a d e 4, a n d so the grade4raw, raw scores o n each grade-specific A s the data m a t r i x s h o w s , the student earned a raw score o f 35 a l o n g the horizontal o f version o f 5 is f e m a l e , the w h o forth. y Help . , Mc?| «l -rir-i d a&lrrl %\® !l: n onrln r LjellUCI cassrium qrade2raw qrade3raw qrade4raw grades raw gradebraw 5 35 36 44 43 2 15 34 29 21 38 34 3 16 39 43 47 53 51 4 22 33 17 17 19 15 5 23 38 32 39 44 44 6 24 35 37 43 48 47 7 32 37 38 45 44 43 8 33 39 42 50 51 50 9 45 39 38 34 37 45 37 10 46 40 42 47 49 50 11 47 3B 42 46 44 47 12 53 39 37 51 48 53 13 54 36 39 38 48 47 14 60 39 37 41 45 42 15 61 38 40 47 51 51 50 16 74 37 38 47 49 17 80 35 39 42 51 42 18 81 34 40 47 52 50 46 19 85 40 39 47 50 20 87 15 30 8 41 11 21 90 38 38 38 50 43 22 91 37 31 36 40 40 50 49 43 50 j 23 95 39 41 24 96 40 37 47 47 25 97 15 27 30 26 98 0 40 40 44 _| 27 99 0 40 40 44 46 46 53 28 101 0 32 29 31 41 33 102 0 35 41 49 53 47 30 104 1 39! 37"! 38 44 45 LiL ItfUflfeTiffe, I Hi m tl) U A f» F i g u r e 7. I .dWrtow va, _ 1 i 1 •ii is readv _> C:\Docu... E n t e r i n g the data in S P S S (Step 1 wr ' 39 47 29 < I > | \ D a t a Viewy! Variable View / —-JM. var 1 HjIMfd.dl... le " , ] f r ™ u „ . QQ ^ou^utl.. 7:. 1). Step 2: Rank transforming the data, within wave. select the "rank cases" o p t i o n u n d e r the " t r a n s f o r m " m e n u . O n c e the data are Please entered, refer to F i g u r e 8. iwj Gr«<i»Sraw| Gnda6m OdtEAtma... Create Time Seres. . Replace Missng Vibes... Random Number Ganeratars... JJ_J\D«I VI«W X'vanabie View / | | Figure 8. Rank transforming the data within wave in SPSS (Step 2a). Once the "rank cases" dialogue box appears, move the variables one wishes to rank transform into the "variable(s)" portion o f the box. Note that, by default, SPSS assigns a rank o f one (1) to each variable's smallest score. It is recommended that analysts retain this default setting, as it is logical to think that the lowest test score should also have the lowest rank. Please refer to Figure 9. io:j n i tut | H •rr- ciini » | cmnum | aena«f [ Gude2r«v,| Gfid,3iaw| G.adB4rJw|Giid»S<iwrGi Figure 9. Rank transforming the data within wave in SPSS (Step 2b). 141 Once the five original test score variables have been moved, click "ties" and retain the default setting ("mean"). Doing so w i l l ensure that, in the event that two students share the same raw score within wave (i.e., i f their scores are tied), they w i l l both receive a mean rank in that wave. Please refer to Figure 10. mm I atnd»> | Or»d*?.i»| Gnai3.—| G^aMriwI G.iJ«S>—| Gna«6.i*r~ mi J L __*J _ J j J \ ' o « i Vl«w"X Variable View / Figure 10. Rank transforming the data within wave i n SPSS (Step 2c). 142 A s Figure 11 illustrates, the original variables' scores have now been rank transformed. The five newly-created rank-based variables (grade2rank-grade6rank) now appear alongside the original raw scores (grade2raw-grade6raw) i n the data matrix. Simple frequency analyses o f each o f the new rank variables confirm that the minimum possible rank is 1 (assigned to the lowest within-wave raw score), and the maximum rank is 653 (assigned to the highest within-wave raw score) because there are 653 participants in the sample. A s Figure 11 shows, the student with casenum 5 earned a Grade 2 rank score o f 148.5 (a rank she shared with others who earned the same Grade 2 raw score). In other words, her Grade 2 raw score was the 148.5 lowest i n that particular wave. Her raw score i n Grade 4 th earned a rank score o f 257.5, meaning that her standing within this sample o f test-takers increased very slightly from Grade 2 to 3. In Grade 4, this student's raw score earned a rank o f 363, suggesting her S D R T improved yet again from Grade 3 to Grade 4 (and so forth). 143 sdrt_working.sav - SPSS Data Edit File Edit Veiw Data Transform Analyze Graphs Utilit Add-ons Wn idow Hep l tfiaiami [ 1 : grade2rank H IMG?I : 148.5 m-FIM nittlrj m casenumqender qrade2raw qrade3raw qrade4raw qrade5raw qrade6raw qrade2rank qrade3rank qa 'de4rank qrade5rank qrade6rank 1 5 1 35 36 44 43 38 1485 . 257.5 363.0 1915 . 19B0 . 2 15 34 29 21 38 34 1100 . 86.5 14.5 77.0 1235 . 3 16 1 39 43 47 53 51 5205 617 0 506.0 648.5 614.0 4 22 33 17 17 19 15 87.0 7.5 7.0 7.0 4.5 5 23 1 38 32 39 44 44 402.0 1515 . 1840 . 231.5 357.5 6 24 1 35 37 43 48 47 148 5 300.5 318.5 452.5 464 0 7 32 37 38 45 44 43 2990 340.0 400.5 231.5 323 5 8 33 1 39 42 50 51 50 520.5 576.0 622.5 604.0 586.0 9 45 1 39 34 37 45 37 520.5 1955 . 1375 . 276.0 1765 10 46 1 40 42 47 49 50 620.0 576.0 506.0 514.5 586.0 11 47 1 38 42 46 44 47 402.0 576.0 446.0 231.5 464.0 520.5 12 53 39 37 51 48 53 300.5 638.5 452.5 647.5 211.5 13 54 1 36 39 38 48 47 391.5 1610 . 452.5 464.0 14 60 1 39 37 41 45 42 520.5 300.5 239.5 276.0 295.0 15 Bl 1 38 40 47 51 51 402.0: 458.5' 506.0 j 604.0 614.0 16 74 37 38 47 49 50 299.0 340 0 506 0 514.5 586 0 17 80 1 35 39 42 51 42 1 4 8 . 5 ; 3 9 1 . 5 2 7 5 . 5 6 0 4 . 0 2 95.0 1100 . 18 81 1 34 40 47 52 50 458.5 5061 635.5 586 0 6 2 0 0 . 19 85 1 40 39 47 50 46 391.5 506.0 564.0 429.0 3 5 1105 20 87 15 30 8 41 11 . 1.0 1295 1.0 4 0 2 0 . 21 90 1 38 38 38 50 43 340.0 1610 . 564.0 323.5 2 9 9 C 22 91 37 31 36 40 40 1305 1180 . 1105 . 242 5 23 95 1 39 41 50 49 43 520.5 521.0 622.5 514.5 323.5 24 96 1 40 37 47 47 50 620.0 300.5 506.0 391.5 586.0 25 97 1 15 27 3D 33 39 3.5 65.5 56.0 35.5 219.5 26 9 8 0 4 0 4 0 4 4 4 7 4 4 6 2 0 . 0 4 5 8 . 5 3 6 3 0 3 9 1 . 5 357.5 9 0 27 19 4 0 4 0 4 6 4 6 5 3 6 2 0 . 0 4 5 8 . 5 4 4 6 . 0 3 3 2 . 0 647.5 1 0 28 10 3 2 2 9 3 1 4 1 3 3 7 0 . 0 8 8 . 5 6 2 . 0 1 2 9 5 . 1090 . 02 0 29 39 41 49 53 47 5205 521.0 593.0 648.5 464.0 30 1<M » 39 37 38 44 45 520 5 300.5 1610 . 231.5 394.0 iSPSS Processor is ready ^C:\pocu.. | |0 IbydjdL Figure 11. T h e new, rank-transformed data matrix (Step 2d). j^Outputl. a •I A HI 7:39 PM 144 Step 3: R e s t r u c t u r i n g the data. Recall that the data were originally entered into the data matrix i n person-level format (one row per participant). W h e n conducting any sort o f individual growth modelling analysis, however, it is necessary to have an explicit wave/time variable (that represents the specific wave in which each test score is collected), which is missing i f the data are left in person-level format. Therefore, analysts must restructure (transpose) the data into person-period format, in which each participant has multiple records (rows), one for each wave or time point. A s such, select the "restructure" option from the data" menu. 38 Please refer to Figure 12. rasa id" Grade2ranl Define variable Properties,.. Copy Data Proper res... Define Dates... Insert Variable |Gr»il»<r.w|Gwd»5ri> ] G'«g*fia-« | Grade. 6 7| ""\ ! Split Fill.,, Wei^itCases.., " eiiismais : _J__\Diti Vliw/Vanabje Vie SPSS Procmsw ta Figure 12. Restructuring the data in SPSS (Step 3a). One may wonder why, if HLM analyses require person-period data formatting, the data are ever entered in person-level format to begin with. It is necessary to first enter the data in person-level, not person-period format, so that the five original SDRT variables can be properly rank transformed within wave. 145 Once the "restructure data wizard" dialogue box opens, SPSS asks analysts how they would like the data formatted. In order to take change a person-level formatted data file into a person-period format, select the first/default option: "restructure select variables into cases". In SPSS language, "case" is a synonym for "row". Then click "next". Please refer to Figure 13. : D "Gride2rank ff 1 JTWlJIffii 1 ' • j j 1 • [' • TTt j' 1111 |t^f^^^^»^M^p^^p^^^^^^^^^^^^^^^^^^^^^^^^yj ,'.: ? ,'. :.jf.- . J : L J I '-'11 r 1 ( L ; 5 i 7 8 9 10 23jfeniiit 24 Ifamal* 32;mali ii lemale 4S!(smale agifemale 13 13 53 male 54 female 60 fimali 61 Ifamala 74|m»l» Ed female B1 i (amain B6i female 0 r ,N 1 Welcome io the Restructure Daia Wizard! i HtMmank < BR 2S 3 00 s o W t4 0 "'2 0 0 1 . 0 you taiMuouepM data Iron nUb* van*t*9i (cduim) n • taigja can Io pot* d iaM*d 1100" 10.5 14 00 20.0 *n currart data lal **> rejbuetuad daia Nwafhal daia iMtndmng carrot ba 16 00 10.5 32.50 300 whai do you-art to do? 7.00 135: 25 50 23 5| Die Ifu ™h«n Hchcmnw curtnt UM hat um ES vaiaUei Urn rev *ouW lie ID iMiianpa no g/«jn •! 3400 200 9 00 20 0 1200 135 JSSD 30C Ut.lhuwhwiMuh.™ povi of lalalad caul thai you want 2550 235 h> raanang* n thai Mataiaa.' h ara rw« tnfad ii l ? 30 300 a ant)* can n n m. daia m 25 50 32C 25 50 27[ i JO 6< Al eattr ** be mm* vanaUtt and labeled venaben nl 930 27C become caiai m llw inn data tot [Chuouigiht opliLriHI 600 SC •nd tht raiam and »* 1i«m>n tfafcg na) appeal.) 3250 235 2550 1?C 400 2C | Meet) ) | Cancel ; He* i l_:EI3 il IB 1? IB 19 30 21 22 ia 23 - • .« ~i 96 female 26 97"":fern»ia 96 imai* To 71 99imale 28 1fji;mate 29 30 •~A fanutm • 1 • N o m Vl«wX varace V | 33 «r 29 l i t m 31 « •* w STI —snr snr i i " " " ' M i 3 D ] 3 6 5 3 ' # [ M S 2 8 0" ** * MSi 130 1 S mM&smmi ui • ; r_ ' 2050 600 3100 9m — 16 C 65 33 5 lOi - JJJ Figure 13. Restructuring the data in SPSS (Step 3b). Once the "restructure data wizard - step 2 o f 7" dialogue box appears, SPSS asks analysts how many variable sets they would like to transpose. In this case, there is only one set o f variables (the "set" being the five S D R T variables). A s such, analysts may select the first/default option and then click "next". Please refer to Figure 14. 146 «§ raj s* u Variables to Cases: Number of Variable Groups YouhMCtaaiiio ittnucUa tafactad vvublu Ho ojoupa o/ wlaiad eaaa. n iht n men 32 50 700 25 50 2050 3100 900 12.00 2550 2550 1300 Hjrh1.h2,i>r>cl>a U-M-l-hM-H • • UMTI one lla •xarvM. wl wi. wJ *nJ hi. h2. h3. K Figure 14. Restructuring the data in SPSS (Step 3c). Once the "restructure data wizard - step 3 o f 7" dialogue box appears, select "use selected variable" from the "case group identification" drop-down menu. M o v e the participant identification number (variable name = casenum) into the "variable" box. Type over "transl" a nickname for the five SDRT-related variables (e.g., rankSDRT). Finally, move the five rank S D R T variables into the "target variable" area. C l i c k "next". Please refer to Figure 15. • a j 0 Grad92rank cMtmim \ oandi' | G ; ; 4 5 6 7 f lew. 2? nail 23 "i-mUt U -tmt\e 33 famaia K Variables to Case*: Select Variables aacn •s-tne o"« you h "» ajrant <ui. w t t w rwdVarw a• ssi o 46 female i: 1: 16 17 IE IS 2T. 21 22 23 21 26 26 27 28 29 53 mw S4 female 60 f»f».« Gl'UmtJ* EO'iamU* 61 lane* 66 » » r t 87 mM on 91 mil* <K Imai. 56 fcmaH T female « m«lt 99 m»* 101 male 102 malt ZZfv 34 00 'agwVtHUa *>Gp»del** i ...ik-iii >,. /•••.: 2550 i< . J 25 9) 2550 «» Vi«w/'varatleviev SPSS Protestor 6 ready Figure 15. Restructuring the data in SPSS (Step 3d). 147 The "restructure data wizard - step 4 o f 7" dialogue box requires that analysts identify the number o f chosen index variables - variables that SPSS uses to create the new columns. In this case, it is necessary to format the data according to one particular variable: participants' casenum. A s such, select the first/default option "one" and click "next". Please refer to Figure 16. •.i&l«; rr Bar; Variables to Cases: Creale Index Variables Un IHs nton i variable, gicmj itrandt lha iNactt o) • i^gb? 35 50 30 50 14 llG 13 00 2S5D 25.50 < U* i ] NM i AJi> Figure 16. Restructuring the data in SPSS (Step 3e). In the "restructure data wizard - step 5 of 7" dialogue box, analysts may choose the name o f the soon-to-be-created time/wave variable in the newly-formatted data file. Type wave into the "name" box, and click "next". Please refer to Figure 17. 148 0 Gn C»Wwn | j»rvo»' | ' i >itw [ ura(MOia-a | j-eoii „ 16 50 80 ~ Variables to Cases Create One Index Variable 5 t 23 W . 1 32 r-a4» 33 ifemali 45 female 46 47 E 'J 1C M _ 16 17 ja T, 72 24 25 26 37 36 39 30 femle »OJ«MOW': j .ra r.» .<..*>. Iht vanetaticart*MwMnMwiinianiii WhvkMxm*v«)jei' 54 60 '•male 61 female 74 80 • IIKIB. Vatufli 1. 2. 3,4, 5 r Vaieawn d> Ira ktk* Variable N ame and Label -7^? a'"^ ,••••••,1**" 85 lema'a B7|mili ' 2650 200 1100 1400 IB 00 32.60 7.00 25 50 20.50' 336 1 0 10 6 200 10 5 30.0! 135! 23 5! 10 5 34.00 9 00 ' 12.00 25.50 26 50 13.00 20 0; 20 0 1 35! 30 0 23 5! 30 0! 25.50 27 0! 1 0 0 6 5 ! 6 0 0 5 0 ' 32 5 0 2 3 5! 25.50' 17 C 400 "2 9G female D| 90: male 2 0 5 0 1 5 0 2 j - i l i A au Vl«w"X Variable yi Figure 17. 41 37 : 4 9 5 3 38 44 47 45 ; 24 5 ; 2 8 0 24 5; 13 0 vsfwwtiKii'r:i:7„..,,'„.:';; ' ijr'' zz'i R e s t r u c t u r i n g the data i n S P S S ""tM65! 3100'33,51 900 10.fi] ~~ settings a n d click "finish". Figure 18. 3f). step 6 o f 7" d i a l o g u e b o x , select the N o t e that e a c h p a r t i c i p a n t is a l l o t t e d f i v e r o w s o f the s t u d y 3 9 . T h e r e is also n o w G r a d e 2, W a v e A l s o , n o t i c e that t h e r e is n o w a new i n the spreadsheet - variable, r a n k S D R T one for rank w h i c h has b e e n c o d e d as f o l l o w s : G r a d e 3, W a v e 2 = G r a d e 4, W a v e 3 = Grade each (named b y each participant's wave-specific an explicit time variable, wave, 1 = default T h e result is a p e r s o n - p e r i o d f o r m a t t e d d a t a file, as illustrated i n analyst i n a p r e v i o u s c o m m a n d ) , that represents = •" •if (Step Finally, i n the "restructure data w i z a r d - wave : 5, a n d W a v e 4 = the score. W a v e Grade 0 6. By default, SPSS codes the "wave" variable 1-5. Intercepts are calculated as if wave = 0, so the intercept would actually be calculated for a time that existed prior to the actual first wave of data collection. As such, for easier interpretation of the intercept term, the wave variable's values were recoded from 1-5 to 0-4, respectively, prior to running the non-parametric HLM. 149 Un File Edit View Data Transform Analyze Graphs Utilities Add •ons Window Help *\amml H ! M o ? | M| • r i r i n i - i . n i %\<&l 4 i 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 casenum qender wave rankSORT »ai 5 1.00 0 90 5 1 00 1 10.0 5 1.00 2 15.5 5 1.00 3 8.0 5 1.00 4 70 15 00 0 6.5 15 .00 1 35 15 .00 2 3.0 15 .00 3 4.0 15 .00 4 4.5 16 1.00 0 24 5 16 1.00 1 340 16 1.00 2 25.5 16 1.00 3 33.5 16 1.00 4 31.5 22 .00 0 4.5 22 .00 1 1.0 22 .00 2 20 22 .00 3 1.0 22 .00 4 2.0 23 1.00 0 18.5 23 1 00 1 70 23 1.00 2 11.0 23 1.00 3 10.5 23 1.00 4 15.5 24 1 00 0 9.0 24 1 00 1 13.0 24 1 00 2 14.0 24 1.00 3 20.0 247 1.00 4 22 0 £]MSN... \ var |SPSSProcessor is ready U>C;\p... •an,,). ( J 8 B ••„ Si I S r I va, ' Iii Quip.. j H'synt.. var va LjAcro... S(k)[ZQ 5:34 PM Figure 18. The new, restructured data matrix (Step 3g). Step 4: Performing the H L M analysis. For brevity, the specific graphical user interface (GUI) steps required in order to perform the H L M analysis are not included in this proposal. In the next section, however, the H L M syntax is provided. Performing the Analysis using Syntax In this section o f the chapter, the SPSS syntax required to perform the steps detailed in the previous section is detailed. For the ease and convenience o f the reader, the syntax has been typed in a distinct font. Step 2 : Rank transforming the data, within wave. In order to rank transform the data, within wave, use the following syntax: 150 RANK VARIABLES=Grade2raw Grade3raw Grade4raw Grade5raw Grade6raw /RANK /PRINT=YES /TIES=MEAN . (A) RENAME VARIABLES (Rgrade2r = grade2rank) (Rgrade3r = grade3rank) (Rgrade4r = grade4rank) (Rgrade5r = grade5rank) (Rgrade6r = g r a d e 6 r a n k ) . EXECUTE. Step 3: Restructuring the data. In order to restructure the data, from person-level to person-period format, use the following syntax: VARSTOCASES /MAKE rankSDRT FROM Grade2rank Grade3rank Grade4rank Grade5rank Grade6rank /INDEX = wave(5) /KEEP = casenum gender Grade2raw Grade3raw Grade4raw Grade5raw Grade6raw /NULL = KEEP. . Step 4: Performing the H L M analysis. In order to perform the non-parametric H L M , use the following syntax. * U n c o n d i t i o n a l Model: No L e v e l 2 (student) p r e d i c t o r v a r i a b l e . J u s t r e p e a t e d measures a t L e v e l 1 n e s t e d w i t h i n s t u d e n t s . MIXED rankSDRT WITH wave /METHOD = REML /PRINT = SOLUTION TESTCOV R /FIXED = wave /RANDOM = INTERCEPT wave | SUBJECT (casenum) COVTYPE(UN). * C o n d i t i o n a l Model: Gender added as a L e v e l 2 p r e d i c t o r v a r i a b l e . MIXED rankSDRT WITH wave gender /METHOD = REML /PRINT = SOLUTION TESTCOV R /FIXED = wave gender wave*gender /RANDOM = INTERCEPT wave | SUBJECT (casenum) COVTYPE(UN). 151 Appendix B: More about the Non-Parametric Difference Score (Case 2) This appendix relates to.material presented in Chapter 6, particularly the sections related to the two-wave case involving Foundation Skills Assessment data. This appendix provides readers with specific instructions about performing the said analyses v i a S P S S ' graphical user interface (GUI) and syntax. In addition, this appendix provides a description about how the decision between using the simple difference score or the residualised change score i n analyses is made. Please refer to Chapter 6 for a definition o f these terms. Performing the Analysis using the Graphical User Interface (GUI) Step 1: Entering the data. A s Figure 19 illustrates, variable names grade4scale and grade7scale identify students' standardised (scaled) scores on each grade-specific version o f the F S A , respectively. The student with casenum 9 is female, who earned a standardised score o f -1.18 i n Grade 4 and -1.20 in Grade 7. 152 !_FSANunieracy_1June2006. File Edit View Data Transform I Me?] Analyze Graphs utilities Add-ons window He^ M| >rlf-\: / . i . r ; \ 1: casenum casenum gender qrade4scaie grade7scale -1 20 -.09 05 -.75 • 39 02 47 60 85 -.17 54 .05 73 _12 -2.26 -75 -30 119 |F -.87 122 M 1.99 152 IF"" -.29 -1.21 -1.10 160 IF -.70 175 M -73 -63 -1.52 -1.27 185 M -.46 192 20 206 .22 -34 "231 .19 -1.81 -.71 -1.07 254 F -1.24 20 264 |F .07 .42 ; 22 280 M 23 295 IM -1 09 -.68 297 M -1 24 303"I'M" 314M 29 -1.41 -.42 -63 -1.41 324 IM -4.83 335 IF -1.99 341 M -.91 342 M -1 5 0 -.77 -* 26 < I • l \ Data Viaw ^"Variable View / LU JdJ :SPSS Processor is ready j BwindowsM.. I.C^^tl-S H J lbjd_dresa . iQ 3:06 PM Figure 19. Entering the data in SPSS (Step 1). Step 2: R a n k transforming the data, within wave. Once the data are entered, select the "rank cases" option under the "transform" menu. Please refer to Figure 20. ItawiOradeSi-awl Grada6raw| Create Time SartM,,, Replace Mftstig Vabes... Random Number Generators... JXJ\'D«I yitW'X'Vanable View / P-artC-iaw Figure 20. Rank transforming the data within wave in SPSS (Step 2a). 153 Once the "rank cases" dialogue box appears, move the variables one wishes to rank transform into the "variable(s)" portion o f the box. Note that, by default, SPSS assigns a rank o f one (1) to each variable's smallest score. It is recommended that analysts retain this default setting, as it is logical to think that the lowest test score should also have the lowest rank. Please refer to Figure 21. Vow Data Transform Analfa \ Lmirtes flttt-wis wrrti* He •' V U Display Hnwyl .••jr. I^i.f. 1 5 2 - 1 2 2 - 27 " 34' l 81 - 7 1 - 1 0 7 i 2 4 * 2 0 ' 07 42 -1 50; .1.09: •66 .41 -67 -1 24 -63 •4 6T -199 ' 1 41! -81 ' -150 S i ^ PrrceisorUrsa* Figure 21. Rank transforming the data within wave in SPSS (Step 2b). A s with Case 1, once the two original test score variables have been moved, click "ties" and retain the default setting ("mean"). Doing so w i l l ensure that, in the event that two students share the same raw score within wave (i.e., i f their scores are tied), they w i l l both receive a mean rank i n that wave. Please refer to Figure 22. 154 sua % : t . ' » i M 1":f- CMir.l I : : c.Mnun. |fl n »|q,.rJ 4 c.1.U.d.7 c.l.i e u e 3 e — .4, 1 VtMbjtjl): s ,,- m I ••" I .„ I • £ - I"1 1 Uii ;! £ _ £ _ ,-AjagpRankllD | « SrnalH v.» * (3 14 15 IB | n is -124 07 1 Hi 3BQ1M 295 M w I w 314 M - .- # Mun JI J -PHi Lo» ' ' Hi> Canal j C 5ajjs!.id r«+i Io i/tque •alias '.'BB '.".'ij'lt V.63 335:F 14 M 34? M at) r* * Rs* Atujnul to Tilt 240 M 254 F 2B4JF 36 39 30 ^ . .i 306 M 30 31 33 33 34 35 26 f mf 1C /•.wjvafiablev aw/ is-Hv.j » r visit -1 99 -91 -55 ' ~ j , J. m ' i , 3 . . . . i u i n . ' m a . ^ . . . 4JI . fjiwu-B. a a s»»- Figure 22. Rank transforming the data within wave in SPSS (Step 2c). A s Figure 23 illustrates, the two original variables' scores have now been rank transformed within wave. The two newly-created rank-based variables (grade4rank and grade7rank) now appear alongside the original scores (grade4scale and grade7scale) in the data matrix. Simple frequency analyses o f each o f the new rank variables confirm that the minimum possible rank is 1 (assigned to the lowest within-wave raw score), and the maximum rank is 4097 (assigned to the highest within-wave raw score) because there are 4097 participants in the convenience sample. A s the data matrix shows, the student with casenum 9 earned a rank score o f 576 for Grade 4. In other words, her Grade 4 standardised score was the 576 lowest in that th particular wave. Her raw score in Grade 7 earned a rank score o f 268, meaning that her standing relative to other test takers in the sample decreased from Grade 4 to Grade 7. 155 File Edit Ve iw Data Transform Analyze Graphs Utilities Add-ons Whdow Hep l & u a an - | -i | c ? i MI -rlr-l °>i<ai t 1 : casenum f [~9 casenum I qender C|rade4scale qrade7scale -1.18 -1.20 576 268 .05 2266 2106 .02 1469 2036 grade4rank qrade7rank 1 9|F ; 2 29IF -.09 • 3 44 M -.54 4 47 M -.75 -.39 1138 1334 5 50 F .85 17 3509 1737 1806 6 54 F 7 73 M 8 82 F " 0 5 -12 2484 -2.26 -.75 80 774 96 -.30 823 1509 9 • 119 F -.87 -1.21 966 266 10 122 M 1.99 .49 4028 2799 11 137 IF .33 -.67 2894 876 12 152 F -.29 -1.10 1904 353 13 -.70 789 039 14 T eoJ|MF 175 -1.00 -.73 -.63 1170 935 15 185 M -.11 -.48 2223 1164 IB 192 F -1.52 17 206JM .22 -1 27 287 217 -34 2725 1441 18 231 F .19 -1 81 2684 29 19 240 M -.71 -1.07 1210 390 20 254 F -1.24 .20 500 2348 21 264 F .07 .42 2514 2696 22 280 M •1.50 23 295 M -.68 24 297 M 25 303 M 26 314 M 27 324 M 28 335 F -1.99 -1 50 106 105 29 341 M -.91 -.77 920 743 342 M -.55 -1.26 1461 222 30 <I > " fipataView/ Variable View / 301 369 • 41 1267 1290 -.67 -1.24 1271 239 -1.41 -.42 372 1282 -.63 -1.41 1335 144 -4.83 -.81 37 662 VI r var var 1 yai var I< I SPSS Processor is ready j Kt) Whdows M. 1 •i gioutputi- )Q 3:16PM Figure 23. The new, rank-transformed data matrix (Step 2d). Step 3: Restructuring the data. Recall that the data were originally entered into the data matrix i n person-level format (one row per participant). A l s o recall that, when conducting any sort o f individual growth modelling, one requires an explicit wave/time variable. Unlike the previous (multi-wave) case, the interest in the current study lies with each student's individual index o f change score, not with the two observed scores themselves. A s such, there is no need to restructure the data from person-level to personperiod format. Step 4: Determining the appropriate index of change. In order to determine which specific index of change serves as the dependent variable in this particular case, it is 156 necessary to follow the guidelines o f Zumbo (1999), who writes that "one should utilize the simple difference score instead o f the residualized difference i f and only i f p(Xi,X2) > aX\l 0X2" (p. 293). To this end, it is necessary to calculate the Pearson correlation between the Grade 4 and 7 rank scores [p(Xi,X2)] and the ratio of the standard deviations o f the respective rank scores (aX\l 0X2). It is important to stress that, when implementing the Conover solution for two-wave data, one's decision about using the simple difference and residualised change score must be based on students' rank, not observed, scores. In order to calculate the Pearson correlation, select the "correlate" option under the "analyze" menu. Select "bivariate" as the correlation type. Please refer to Figure 24. . . . •• m Utilities mmm%wmmrni,m mSm n| Graphs Add-ons Window Help 1 .'<" Fife E it View Date TraT»fOnYI Ul ! feiD 1 * ' ' J9 1 : t ; I • 122 • '-| Multiple Response '' i 00 : 160 s T 22 23 24 25 26 27 28 29 30 fi 206 231 240 -i'2* 264 280 70 24B4 - 2BE W ' 3 f t * jjjjjj Ml' jj g'4 2 5 1 4 ' 2 £ 8 B | 30 V 'seat 1267''' 1280 1 2 7 1 2 3 9 237*!.....'..'. 303 •i "67 -1 41 •1 24 .42 "'372''1282' 314 M 324 M •63 •4 63 • 1 41 " 1 3 3 6 i ' i i f 342 :::::: 2 7 2 5 1 4 4 1 2S84 29: 1 2 1 0 3 9 0 ; 8 | -1 itt SF -:'J 7 8 9 8 3 9 07 J 1806 774 2! iiitj 93BT 2 3 2 3 H t a i j • i y1 34 , ... _ I 9 24 7O9 T • 1| 42 0 B2 8 8 ;P '-'.nt'''to'!" -.iit -'«? • 192 ? I IS [ i, 268 * j 8 0 S£ JIB .ii 11. MI-'I 1 rests* I 966 '119 14 576 ' 50 "si g • , | Mtaed Mode* 29 f. , Con-pars Meant VI ;;;;;; = EE: 3 7 ' 6 8 2 1 50 1 0 6 i d s ' 9 2 0 7 4 3 ] :M ,28 Mi vi«w / Variable Vew/ 0 '" Jl • Figure 24. Computing a correlation matrix (Step 4a). Next, move the two rank score variables (grade4rank and grade7rank) into the "variables" box. C l i c k "ok". Please refer to Figure 25. 157 * a s % n7 v . ff. M T r - c-air t %m J 26E 210E .i.!.'.. - j 4>fjtfrtnai | 1 . °* f •"—•1 .„"-) a : ^ 1 1 • i 22 1334 1737 '606 774 1509 "* J CnnWion CcWriciera. -ft™, j Tnl of Siontcanc* 0n».tM 1 0 Flag iigr*e«l coninatmn i zm 268 121 1 "ai 30 1 t _ 2 22 23 26 27 2! 23 30 231 IF 240 Vt 254 ;F" 264 V 280 M •1 81 -1 07 20 •1 50 " 1)3 W 314JM 1 M 335 F 341 M 342 M Tbs ««Vi»wX Variable V is 12" 37? 1335 37 -."42 -1 41 Vai -1 50 -• 2799 871 353 639 935 1164 217 29 390 2348 2696 1290 239 1282 144 682 1*1 B u y * * ? 3 J C ^ .... LP* Figure 25. Computing a correlation matrix (Step 4b). A s Figure 26 illustrates, the resultant SPSS output shows that the correlation between students' Grade 4 and Grade 7 rank scores [p(X],X )] is equal to 0.669. 2 Correlations grade4rank Pearson Correlation grade4rank 1 grade7rank .669" Sig. (2-tailed) N grade7rank .000 4097 Pearson Correlation .669" Sig. (2-tailed) .000 N 4097 4097 1 4097 **• Correlation is significant at the 0.01 level (2-tailed). Figure 26. The resultant correlation output (Step 4c). The next step is to compute the ratio o f the standard deviations o f the respective rank scores {aX l oX ). To this end, select "descriptive statistics" from the "analyze" menu. x 2 Then choose "descriptives". Please refer to Figure 27. 158 Figure 27. Computing the ratio o f standard deviations (Step 4d). Then move the two rank variables into the "variables" box. Then click on "options' Please refer to Figure 28. „ a Figure 28. Computing the ratio o f standard deviations (Step 4e). In the "options" dialogue box, check the "std. deviation" box. Then click "continue' When the "option" box closes, click "ok". Please refer to Figure 29. 159 Figure 29. Computing the ratio of standard deviations (Step 4f). The resultant SPSS output shows that the standard deviations of the Grade 4 and Grade 7 rank scores, respectively, as 1182.843 and 1182.846. Please refer to Figure 30. Descriptive Statistics N Std. Deviation grade4rank 4097 1182.843 grade7rank 4097 1182.846 Valid N (listwise) 4097 Figure 30. The resultant descriptive statistics output (Step 4g). Because the correlation between the Grade 4 and 7 rank scores [p(X],X ) = 0.669] is 2 less than the ratio of the two standard deviations (1182.843/1182.846 = 0.999), it is necessary to use the residualised change score, rather than the simple difference score, in this case study (Zumbo, 1999). Step 5: Computing the residualised change score. To compute the residualised change score, the dependent variable of the case analysis, begin by selecting "regr ?ression under the "linear" menu. Then choose "linear". Please refer to Figure 31. 160 #BM at: o j tisl j Figure 31. Computing the residualised change score (Step 5a). Next, regress grade7rank (the Wave 2 rank score) onto grade4rank (the Wave 1 rank score) as follows. Then click "save". Please refer to Figure 32. q GJ _»=-J CD Mathod- Cat* Lacttj Q , iErHei I . V M Waist* CD r—— Figure 32. Computing the residualised change score (Step 5b). Next, check the "unstandardized" box under "predicted values". C l i c k "continue". Once the "save" dialogue box closes, click "ok". Please refer to Figure 33. 161 -Si .- Modioli !'«>gMdi*rt I CD * |PJPJPJPJPJ|^ I • i— CD CD ' I• wt,; r Vl«w;f Variable View / Figure 33. Computing the residualised change score (Step 5c). A s a result o f the regression, the data matrix now contains a new variable: p r e _ l . This variable represents the Wave 2 (Grade 7) rank predicted from the Wave 1 (Grade 4) rank score. To finalise the computation o f the residualised change score, select "compute' from the "transform" menu. Please refer to Figure 34. Figure 34. Computing the residualised change score (Step 5d). 162 In the "target variable" area, type a name for the new residualised change score variable (e.g., residualised_chg_score). In the "numeric expression" area, subtract pre_l from grade7rank. C l i c k "ok". Please refer to Figure 35. ' ' :PRE_1 "CD •> grade* »* •*>FREJ ^ X i ^ O m VLw^Vanable View / J IlJiSl J .Slcl i-ll.ll JjJUl J JSJ-U _u_J '! 'JM. Mtd sr /muhbj— Bj 749 6i.ec; _H«J Hi ni •'«: " Figure 35. Computing the residualised change score (Step 5e). A s Figure 36 illustrates, the data matrix now contains a new variable, residualised_chg_score, which represents the Wave 2 rank score (grade7rank) less the Wave 2 rank score predicted from the Wave 1 rank score (pre_l). 163 i r i ^ y -3 File Edit view Data Transform Analyze Graphs Utilities Add-ons Window Help *IBI<I|*|. 1 : casenum i iMiUMlTir-iniiBini* I gender 29 47 50 73 82 122 137 12 152 180I F 175i'[M" 1B5M 192 F 206 M "231 F 240. M 254 F 264 F 260 M 295 M 297 M 303:M 314M 324 M 335 i 5 JF JF 341*1 MJM 342 IM 121M qrade4scale -.09 -.54 ""-.75 ! 29 30 05 02 39 .85 .05 -.12 -2.26 "'-"98 -.87 -1.21 .33 -.67 - 29 -1 10 -70 -1.00 -.73 : 22 grade7scaie -1.52 .22 19 -.71 -1.24 .07 -1.50 -75 - 3D -.63 -.48 ""-127 -34 -1.81 -1 07 20 42 -1.09 -.88 -41 •57 -1.24 -1.41 -42 -63 -1.41 '""483 -1.99 -.91 -.55 -.81 -1.50 -77 -126 qradB4iank | 576.000 2266.000 1489 000 1138.000 3509 000 2484.000 80.000 823 000 966.000 [ 4028.000" 2094.000 1904 000 789.000 1170000 2223.000 28? QC0 2725 000 2684 000 1210000 500 000 2514 000 301 000 1267 000 1270 500 372.000 1335 000 36.500 106.000 920.000 1461.000 grade7rank 26B.0D0 2106!000 2036.000 1334.000 1736 500 1806.000 774.000 1509.000 [ 266.000 1324.74155 2799" 000 3372.46028 2614.09547" 1952.03096 1206.37244 146116696 2165.36285 870.65892 2501 07638 2473.65754 1487.91704 1013.10310 2359.96970 880 02144 1526.03591 1528 37654 927.50284 1571 51105 703 13653 749 61480 1293,97895 1655.77380 878 000 353.000 839.000 935.000 1164.000 217 000 1440 500 29 000 390 000 2348.000 2695 500 369 000 1290 0C0 239 000 1282 0C0 144 000 682.000 105.000 743 000 222.000 |\ Data V l « w ^ V a r i a b l e V i >_) C:\Pooi. PRE 1 1063.92825 ""2194.11919' '674. 1439,76690 3025 37797 2339.90713 732.22724 1229.11001 SPSS Processor is ready ©Window... residualised chg score -795.93 18 12 361.50 -10577 -533.91 41.77 279.89 -1058.74 -573.46 -1738.10 -1599.03 -367.37 -526.17 -1001.36 •653.66" •1060.58 -2444.66 -1097.92 133490 335.53 -51102 •236.04 -12B9.38 354.50 -1427 51 -21.14 -644.61 •550.99 -1433.77 1 JJI yd_di.. i l 9 Q 6:39 PM Figure 36. The newly-created residualised change score (Step 5f). Step 6: P e r f o r m i n g the Independent Samples r-Test. N o w that the residualised change score has been computed, one can perform the requisite statistical analyses - i n this case, an independent samples Mest. Under "analyze", select "compare means". Then select "independent-samples T test" (sic). Please refer to Figure 37. 164 20pf>„2!l0. W U J . HJ L casenurtj flends, gra : ; 47'M BJIF : j 4028000 119|F 123IM 137 IF 152 F F l hCI ITS M 192 F 3 37 2E 29 3C EKBSponse """TIT. • 70 -.63 -too •73 •1.52 22 231 ' 240 !M 2S4jF 264 F M M H w 297 M a . :fe -34 -1 09 Tab -JE 2799000 fa? 11919 ^ M W - f W - J «12 ^ ^ hlMM3025 37797 2339 90713 732 22*24 122911001 132174*66 337?46G28 2694 000 676 000 261109647 1904000 353X0 195203096 789000 839000 1206 37744 1170 000 935 X0 146116696 2223000 1164000 2'65 362B5 267 000 21'000 67366692 2725 000 1440 5U0 2601 07638 2684.000 29 da: 2473 65754: 1210000 390000! 148791704 5TJODO0 2348 000 1013 10310 2514 000 2695 500 ' 2353 96970 301 0 0 0 3 6 9 0 0 0 : H H : :.,!44 1267 000 1290.000 1526 3 3 5 1270 500 239 000 1529 37654 •1 07 20 10577 1288 88 533 91 41 77 2 9 B9 1068 74 -57346 1 9 : '-'.'63 4 S3 324 M 336F ' - 2 -17381C -159903 - 367 37 5261 ? '001 36 -653 66 1060 56 -2444 SG -1097 92 1334 90 33653 611 02 3 6 04 1299 38 1335 000 144 000' 1671 51105 3 6 5 0 0 6 8 2 0 0 0 : 7 0 3 13663 iOfi'o'OO105.000 749 61400 920 000 743 000 1293 97895 1461(100 222 000' ' 1655 7 7 3 9 0 1 4 3 '.'t'.SJ 342 M 3 « i Vi«w / va/iawe v e w / Ind^etiderVSan-ctes T Tsaf • • M I I M i l i l . General Lrear Modal > •::-„.. Mined Mi.ir.lels > MAi. U&*iMKS<MmUi^'m TTm Reg/ess ion * LoglriBrrt• 3609000 I 736 5O0 Classify » 2484 000 1B06 000 Data Redxtbn • BOOOO 774 000 Scale • 923 COO 1609 000 Norparamstric Tests • 966000 266 000 29F • > r i i l n e , s w - SPSS D n u t d l l o i UuraaOpii Reports Descriptor) Statistic > •* -1427 51: "2\A4\ -644 51 -55098 3 77 ill " SPSS haa5wr a r s M | Figure 37. Conducting the independent samples Mest (Step 6a). M o v e the newly-created residualised_chg_score variable into the "test variable" area of the dialogue box. M o v e the gender variable into the "grouping variable" box. Then, click on "define groups", and type " M " (for male) and " F " (for female) into the "define groups" dialogue box. Then click "continue". When the "define groups" dialogue box closes, click "ok". Please refer to Figure 38. 1 casrjnum 1 2 3 4 5 6 7 9 10 12 13 9 casanum | qendar f qi»de4scal» qriiH? seals | 9|F -1.18 -120; ... •• • • r " ( '. w >midu«4i^(i>(i_Kai 4> 0*da7tciit rr^giadsnank I U ~ 4 CO i ' 14 ~"~ 15 16 ' * - " - * — I B T T F — " j " ^raz 17 206 M j 2 2 IB Z h ' l F 1 9 19 240 M -71 254;F ' 1 24 20 264V' t 07 21 22 2B0 M 150 23 295 M « 24 2 3 > l M ' . ' " 6 7 303 M * a25 3 1 4 1 M - 63 _ 27 324 M " 4 S3 26 135 F 99 29 342 M ' - . S B 30 • I • iV> . nudttfranlr | 576 000 . B / " l n , G ndi7nnk 1 268 000; 21060O0' 2036 000 . ' 1334.000 1736 500 "j •nit :m 774 OOC Runt i 1609 000 Canctij 266 000' 2799 000! K* i 876 000; un 339 000 353 ' ^_J a o*~. ; -12/ ""jr»,'uLAJ " 34 2725 000 -• 81 2694 000" : v '.'io"'500.000 42 2514 000 - t O B I ' 3 0 1 000 "-41 1 2 6 7 . 0 0 0 • 2 4 [ 2 7 0 500 - 42 372000 -i 4 T 1 3 3 5 000 36500 1 50 106.000'' - 7 7 9 2 0 000 ' ' - ' i ' 2 6 ; 1 4 6 1 bob'"' : I.I PRE 1 i • i I.. 2194 M9I9 1674 49866 1439 76690 mil ua i*d :-g tc:'i 795 93 •88 12 361 5C 106 77 3026 3/797 2339 90713 732 22724 1324 74166 3372 46026 264 09547 195203096 1206 37244 W Cm _=*J ' 'mood 1290 000 239 000 ' 1282000 ' 144 UK 682 000 " 105 OOC ' 743.000 222 000: " " SPSS Processor a ready 860 02144 '526 0359' ••528 37654 577 50784 1S71 51106 703 13663 749 61480 1293 97896 1666 77360 ' I ! BCMTJIL. • -63391 41 77 279 89 1066 74 67346 •1736-10 -1599 03 367 37 j mm* 1 663 66 1060 58 2444 66 1097 92 1334 90 33653' 511 -37 236 04 1289 SB 354 50 142751 ?' '4 644 61 650 98 1433 77 .11 i sV Figure 38. Conducting the independent samples /-test (Step 6b). 165 Performing the Analysis using Syntax In this section o f the chapter, the SPSS syntax required to perform the steps detailed in the previous section is detailed. For the ease and convenience o f the reader, the syntax has been typed i n a distinct font. Step 2: Rank transforming the data, within wave. In order to rank transform the data, within wave, use the following syntax: RANK VARIABLES=grade4scale /TIES=MEAN . grade7scale (A) /RANK /PRINT=YES RENAME VARIABLES (Rgrade4s = grade4rank) (Rgrade7s = g r a d e 7 r a n k ) . EXECUTE. Step 4: Determining the appropriate index of change. In order to compute the necessary correlation and standard deviation values, use the following syntax: CORRELATIONS /VARIABLES=grade4rank grade7rank /PRINT=TWOTAIL NOSIG /MISSING=PAIRWISE . DESCRIPTIVES VARIABLES=grade4rank grade7rank /STATISTICS=MEAN STDDEV MIN MAX . Step 5: Computing the residualised change score. In order to compute the residualised change score, use the following syntax: REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT grade7rank /METHOD=ENTER grade4rank /SAVE PRED . COMPUTE r e s i d u a l i s e d _ c h g _ s c o r e = grade7rank - PRE_1. EXECUTE . 166 Step 6: Performing the Independent Samples r-Test. independent samples Mest, use the following syntax: T-TEST GROUPS = Gender ( ' M F ) /MISSING = ANALYSIS /VARIABLES = r e s i d u a l i s e d chg score /CRITERIA = CI( 95) 1 1 1 In order to compute the
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- On modelling change and growth when the measures themselves...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
On modelling change and growth when the measures themselves change across waves : methodological and… Lloyd, Jennifer Elizabeth Victoria 2006
pdf
Page Metadata
Item Metadata
Title | On modelling change and growth when the measures themselves change across waves : methodological and measurement issues and a novel non-parametric solution |
Creator |
Lloyd, Jennifer Elizabeth Victoria |
Publisher | University of British Columbia |
Date Issued | 2006 |
Description | In the past 20 years, the analysis of individual change has become a key focus of research in education and the social sciences. There are several parametric methodologies that centre upon quantifying change. These varied methodologies, known as repeated measures analyses, are commonly used in three research scenarios: In Scenario 1, the exact same measure is used and re-used across waves (testing occasions). In Scenario 2, most of the measures' content changes across waves - typically commensurate with the age and experiences of the test-takers - but the measures retain one or more common items (test questions) across waves. In Scenario 3, the measures either vary completely across waves (i.e., there are no common items) or the sample being tested across waves is small or there is no norming group. Some researchers assert that repeated measures analyses should only occur if the measure itself remains unchanged across waves, arguing that it is not possible to link or connect the scores (either methodologically or conceptually) of measures whose content varies across waves. Because it is not uncommon to face Scenarios 2 and 3 in educational and social science research settings, however, it is vital to explore more fully the problem of analysing change and growth with measures that vary across waves. To this end, the first objective of this dissertation is to weave together the (a) test linking and (b) change/growth literatures for the purpose of exploring this problem in a comprehensive manner. The second objective is to introduce a novel solution to the problem: the nonparametric hierarchical linear model (for multi-wave data) and the non-parametric difference score (for two-wave data). Two case studies that demonstrate the application of the respective solutions are presented, accompanied by a discussion o f the novel solution's strengths and limitations. Also presented is a discussion about what is meant by 'change'. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2011-01-28 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
IsShownAt | 10.14288/1.0054569 |
URI | http://hdl.handle.net/2429/30944 |
Degree |
Doctor of Philosophy - PhD |
Program |
Measurement, Evaluation and Research Methodology |
Affiliation |
Education, Faculty of Educational and Counselling Psychology, and Special Education (ECPS), Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_2007-267516.pdf [ 17.83MB ]
- Metadata
- JSON: 831-1.0054569.json
- JSON-LD: 831-1.0054569-ld.json
- RDF/XML (Pretty): 831-1.0054569-rdf.xml
- RDF/JSON: 831-1.0054569-rdf.json
- Turtle: 831-1.0054569-turtle.txt
- N-Triples: 831-1.0054569-rdf-ntriples.txt
- Original Record: 831-1.0054569-source.json
- Full Text
- 831-1.0054569-fulltext.txt
- Citation
- 831-1.0054569.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0054569/manifest