Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

On modelling change and growth when the measures themselves change across waves : methodological and… Lloyd, Jennifer Elizabeth Victoria 2006

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2007-267516.pdf [ 17.83MB ]
Metadata
JSON: 831-1.0054569.json
JSON-LD: 831-1.0054569-ld.json
RDF/XML (Pretty): 831-1.0054569-rdf.xml
RDF/JSON: 831-1.0054569-rdf.json
Turtle: 831-1.0054569-turtle.txt
N-Triples: 831-1.0054569-rdf-ntriples.txt
Original Record: 831-1.0054569-source.json
Full Text
831-1.0054569-fulltext.txt
Citation
831-1.0054569.ris

Full Text

O N M O D E L L I N G C H A N G E A N D G R O W T H W H E N T H E M E A S U R E S T H E M S E L V E S C H A N G E A C R O S S W A V E S : M E T H O D O L O G I C A L A N D M E A S U R E M E N T ISSUES A N D A N O V E L N O N - P A R A M E T R I C S O L U T I O N by J E N N I F E R E L I Z A B E T H V I C T O R I A L L O Y D M . A . , University of Victoria, 2002 B . S c , University of Victoria, 1998 A THESIS C O M P L E T E D I N P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F D O C T O R O F P H I L O S O P H Y in T H E F A C U L T Y O F G R A D U A T E S T U D I E S (Measurement, Evaluation, and Research Methodology) T H E U N I V E R S I T Y OF B R I T I S H C O L U M B I A © Jennifer Elizabeth Victoria Lloyd , 2006 Abstract In the past 20 years, the analysis of individual change has become a key focus of research in education and the social sciences. There are several parametric methodologies that centre upon quantifying change. These varied methodologies, known as repeated measures analyses, are commonly used in three research scenarios: In Scenario 1, the exact same measure is used and re-used across waves (testing occasions). In Scenario 2, most of the measures' content changes across waves - typically commensurate with the age and experiences of the test-takers - but the measures retain one or more common items (test questions) across waves. In Scenario 3, the measures either vary completely across waves (i.e., there are no common items) or the sample being tested across waves is small or there is no norming group. Some researchers assert that repeated measures analyses should only occur i f the measure itself remains unchanged across waves, arguing that it is not possible to link or connect the scores (either methodologically or conceptually) of measures whose content varies across waves. Because it is not uncommon to face Scenarios 2 and 3 in educational and social science research settings, however, it is vital to explore more fully the problem of analysing change and growth with measures that vary across waves. To this end, the first objective of this dissertation is to weave together the (a) test linking and (b) change/growth literatures for the purpose of exploring this problem in a comprehensive manner. The second objective is to introduce a novel solution to the problem: the non-parametric hierarchical linear model (for multi-wave data) and the non-parametric difference score (for two-wave data). Two case studies that demonstrate the application of the respective solutions are presented, accompanied by a discussion of the novel solution's strengths and limitations. Also presented is a discussion about what is meant by 'change'. Table of Contents Abstract i i Table of Contents i i i List o f Tables v List o f Figures v i Acknowledgements ix Dedication x i Chapter 1: Introduction 1 Using Repeated Measures Analyses: Three Research Scenarios 1 A n Oft Overlooked, Oft Misunderstood Assumption of Repeated Measures Analyses... 6 What is Meant by "Same Dependent Variable"? 7 What is Meant by "Equatable" Test Scores? 7 The Motivating Problem: Analysing Change/Growth with Time-Variable Measures 9 Objectives and Novel Contributions 10 Importance of the Dissertation Topic 13 Framework of the Dissertation 14 Chapter 2: Foundational Issues: Precisely What is Meant by 'Change'? 17 Amount versus Quality o f Change 18 Is Constancy as Interesting as Change? 19 Personal versus Stimulus Change 20 Formulating a Research Design: Important Questions 22 Common Interpretation Problems in Change Studies 24 How Should One Conceptualise Change? 27 Chapter 3: Five Types of Test Linking 30 Equating 32 Two Equating Research Designs 34 Six Types of Equating 35 Two Alternative Types of Equating 41 Calibration 42 Statistical Moderation 46 Projection 47 Social Moderation 47 Summary: Selecting the Appropriate Type of Test Linking Method 48 Chapter 4: Seven Current Strategies for Handling Time-Variable Measures 55 Vertical Scaling 56 Growth Scales 57 Rasch Model l ing 58 Latent Variable or Structural Equation Modell ing 61 Multidimensional Scaling 63 Standardising the Test Scores or Regression Results 64 Converting Raw Scores to Age (or Grade) Equivalents Pre-Analysis 67 Chapter Summary 69 Chapter 5: The Conover Solution: A Novel Non-Parametric Solution for Analysing Change/Growth with Time-Variable Measures 71 Traditional Applications of the Rank Transformation 73 iv Introducing the Conover Solution to the Motivating Problem 77 H o w Applying the Conover Solution Changes the Research Question (Slightly) 79 Within-Wave versus Across-Wave Ranking : 80 Establishing the Viabi l i ty of the Conover Solution 81 Primary Assumption of the Conover Solution: Commensurable Constructs 83 Chapter 6: Two Conover Solution Case Studies: The Non-Parametric H L M and the Non-Parametric Difference Score 85 Choice of Statistical Software Packages 86 Determining the Commensurability of Constructs 86 Case 1: Non-Parametric H L M (Conover Solution for Mult i -Wave Data) ..90 ' Description of the Data 91 Specific Variables of Interest and Proposed Methodology 91 Statistical Models and Equations 94 Hypotheses Being Tested 100 Explanation of the Statistical Output 100 Case 2: Non-Parametric Difference Score (Conover Solution for Two-Wave Data) 105 Description of the Data 106 Specific Variables of Interest and Proposed Methodology 107 Hypotheses Being Tested 109 Explanation o f the Statistical Output 109 Chapter Summary 110 Chapter 7: Discussion and Conclusions I l l Summary of the Preceding Chapters I l l Strengths of the Conover Solution '. 114 Limitations of the Conover Solution 117 Suggestions for Future Research 120 Conclusions 122 References 125 Appendix A : More about the Non-Parametric H L M (Case 1) 135 A Br ie f Description o f Mixed-Effect Modell ing 135 Fixed versus Random Effects ..136 Performing the Analysis using the Graphical User Interface (GUI) 138 Performing the Analysis using Syntax 149 Appendix B : More about the Non-Parametric Difference Score (Case 2) 151 Performing the Analysis using the Graphical User Interface (GUI) 151 Performing the Analysis using Syntax 165 Appendix C: U B C Behavioural Research Ethics Board Certificate of Approval 167 List of Tables Table Title Page Table 1 Test-Takers' Simple Difference Scores on Three Subtests, Grouped by Wave 1 Performance 25 Table 2 Example Conversion Table for Test X and Test Y Scores 31 Table 3 Kolen and Brennan's (2004) Comparison of the Similarities of Five Test Linking Methods on Four Test Facets 50 Table 4 Linn 's (1993) Requirements of Different Techniques in Linking Distinct Assessments 52 VI List of Figures Figure Title Page Figure 1 A Hypothetical Test-Taker's Performance Across Three Waves of a Simulated Mathematics Assessment (Solid Line = Actual Scores, Dashed Line = Line of Best Fit) 28 Figure 2 Illustrating the Fit Function that Links the Scores on Two Versions of a Hypothetical 100-item Test (Modified from Kolen & Brennan, 2004). The Arrows Show that the Direction of the Linkage is Unimportant when Test Linking 30 Figure 3 Descriptive Statistics for Each of the Five Waves of S R D T Raw Scores Collected by Siegel , 92 Figure 4 Histograms of the Siegel Study's Raw Scores across Five Waves: Grade 2 (Top Left), Grade 3 (Top Right), Grade 4 (Middle Left), Grade 5 (Middle Right), and Grade 6 (Bottom Left). Note that Each of the Distributions is Skewed Negatively 93 Figure 5 Unconditional Model Output 101 Figure 6 Conditional Model Output 103 Figure 7 Entering the Data in SPSS (Step 1) 139 Figure 8 Rank Transforming the Data Within Wave in SPSS (Step 2a) 140 Figure 9 Rank Transforming the Data Within Wave in SPSS (Step 2b) 140 Figure 10 Rank Transforming the Data Within Wave in SPSS (Step 2c) 141 Figure 11 The New, Rank-Transformed Data Matrix (Step 2d) 143 Figure 12 Restructuring the Data in SPSS (Step 3a) 144 v i i Figure 13 Restructuring the Data in SPSS (Step 3b) 145 Figure 14 Restructuring the Data in SPSS (Step 3c) 146 Figure 15 Restructuring the Data in SPSS (Step 3d) 146 Figure 16 Restructuring the Data in SPSS (Step 3e) 147 Figure 17 Restructuring the Data in SPSS (Step 3f) 148 Figure 18 The New, Restructured Data Matrix (Step 3g) 149 Figure 19 Entering the Data in SPSS (Step 1) 152 Figure 20 . Rank Transforming the Data Within Wave in SPSS (Step 2a) 152 Figure 21 Rank Transforming the Data Within Wave in SPSS (Step 2b) 153 Figure 22 Rank Transforming the Data Within Wave in SPSS (Step 2c) 154 Figure 23 The New, Rank-Transformed Data Matrix (Step 2d) 155 Figure 24 Computing a Correlation Matrix (Step 4a) 156 Figure 25 Computing a Correlation Matrix (Step 4b) 157 Figure 26 The Resultant Correlation Output (Step 4c) 157 Figure 27 Computing the Ratio of Standard Deviations (Step 4d) 158 Figure 28 Computing the Ratio of Standard Deviations (Step 4e) 158 Figure 29 Computing the Ratio of Standard Deviations (Step 4f) 159 Figure 30 The Resultant Descriptive Statistics Output (Step 4g) 159 Figure 31 Computing the Residualised Change Score (Step 5 a) 160 Figure 32 Computing the Residualised Change Score (Step 5b) 160 Figure 33 Computing the Residualised Change Score (Step 5c) 161 Figure 34 Computing the Residualised Change Score (Step 5d) 161 Figure 35 Computing the Residualised Change Score (Step 5e). 162 v i i i Figure 36 The Newly-Created Residualised Change Score (Step 5f) 163 Figure 37 Conducting the Independent Samples Mest (Step 6a) 164 Figure 38 Conducting the Independent Samples Mest (Step 6b) 164 IX Acknowledgements This dissertation represents the efforts of many special people. First, I would like to thank my research supervisor, Dr. Bruno Zumbo. There are too few glowing words in my vocabulary to describe him, but I w i l l give it a shot: kind, generous, inspiring, funny, encouraging, talented, and all heart. Thank you for serving as a beacon these past few years. I am honoured to call you my mentor and friend. Grazi! I would also like to thank Dr. Anita Hubley and Dr. Kimberly Schonert-Reichl. They are both wonderful role models, particularly for young women in academics: bright, accomplished, kind, fun, and class acts. I couldn't have imagined a better research committee. I would also like to thank Dr. Linda Siegel and the British Columbia Ministry of Education for allowing me to use their respective data sets in my dissertation. Dr. Siegel's data were collected at great personal expense to her. Her generosity is very much appreciated. I would also like to thank Edudata Canada (Dr. Victor Glickman) for not only disseminating the Ministry data to me, but also for offering me such rewarding employment during my doctoral program. Thanks also to the Human Early Learning Partnership (Dr. Clyde Hertzman) for welcoming me into such a vibrant and dynamic group o f scholars. In addition, I extend my thanks to the Social Sciences and Humanities Research Council (SSHRC) of Canada for funding this research project. Their financial support has been a tremendous gift. I have been blessed with many wonderful friendships that have truly enriched my life. I thank Aeryn, Brian (Melanie), David, Debbie, James, Janine (Stephen), N i c k i , Rachel, X Sharon, Tanis, my church family, and especially Catrin. Thanks also to my friends in the Measurement, Evaluation, and Research Methodology ( M E R M ) program, and to the members of Dr. Zumbo's Edgeworth Laboratory. M y family has played an integral role in my education. From a very early age, my parents encouraged me to read. They enrolled me in various sorts of lessons (piano, ballet, and even baton!?) and camps, when I am sure that the fees were sometimes more expensive than they could afford. They were always present at school events, plays, recitals, track meets, and awards ceremonies. They "encouraged" me to start working part-time from the age of 15, and these jobs taught me how to handle my money and my time. M y father, Ke lv in , is an unwaveringly hardworking and conscientious man, and has always shown me the importance of a good, solid work ethic. M y mother, Janet, is the heart of our family, always encouraging me to climb mountains and to follow my dreams. What better influences can a little girl have possibly had growing up? I would also like to thank my brother, David. A n accomplished cyclist, gymnast, and singer, Dave often reminds me about the importance of keeping balance in my life. I 'm grateful to have a little brother who looks out for his big sister. .but those who hope in the Lord w i l l renew their strength. They w i l l soar on wings like eagles; they w i l l run and not grow weary, they w i l l walk and not be faint. Isaiah 40:31 Dedication For M u m , Dad, Dave, Nana, and Buster and for my extended family and for my much-loved grandfather, John Bow. (Now, there are two doctors in the family.) 1 On Model l ing Change and Growth When the Measures Themselves Change across Waves: Methodological and Measurement Issues and a Novel Non-Parametric Solution Chapter 1: Introduction In the past 20 years, the analysis of individual change has become a key focus of research in education and the social sciences. If one thumbs through a handful of quantitative-based educational and social science research journals, conference proceedings, or grant applications, chances are good that one w i l l spot the word "trajectories" nestled somewhere within the prose - which, loosely speaking, pertains to the amount by which individuals change, grow, mature, improve, and progress over time. Chapter 2 presents a more thorough discussion of what is meant by 'change'. Such individual change can occur naturally (e.g., an infant's learning to crawl and then to walk) or may be experimentally-induced (e.g., improvement in test1 scores as a result of coaching). In either case, "by measuring and charting changes [we] uncover the temporal nature of development" (Singer & Willett, 2003, p. 3). This temporal nature of development may be studied over diverse spans of time: hours, days, weeks, months, or even years. Measurement occasions or periods of data collection that punctuate these spans of time are generally referred to as waves. Using Repeated Measures Analyses: Three Research Scenarios There are several parametric methodologies that centre upon quantifying change. These varied methodologies are known as repeated measures analyses. Such methodologies include the paired samples Mest, the repeated measures analysis of variance ( A N O V A ) , 1 For the purpose of this dissertation, the words "measure", "form", "test", "scale", and "assessment" are used interchangeably. 2 profile analysis, and mixed-effect modelling 2 (individual growth modelling, hierarchical linear modelling, or simply H L M ) . In addition to affording researchers the opportunity to study change, repeated measures analyses reduce or eliminate problems caused by characteristics that may vary from individual to individual and can influence or confound the obtained scores, such as age or gender (Gravetter & Wallnau, 2005). In education and the social sciences, there are three distinct research scenarios in which repeated measures analyses may be used. Each of these scenarios is illustrated by means of an example: Scenario 1: Exact same measure across waves. Imagine that a teacher wishes to investigate the amount by which her students' academic motivation changes across the span of one school year. She designs her study such that her students are assessed across three waves: the beginning of the school year, mid-way through the school year, and once again at the end of the school year. Motivation, she posits, can theoretically be measured using the exact same measure across all testing occasions, irrespective o f the ever-changing age, cognitive development, and personal and scholarly experiences of her students (because she has no reason to believe that students' motivation changes commensurately with these developmental factors). A s a result, she decides it is unnecessary to change the measure's wording or items (questions) across waves, and further decides that the scores collected at each wave can be interpreted in the same way (i.e., a test score of 50 at Wave 1 means the same "amount" of motivation as a score of 50 at Wave 2, and so forth). Scenario 2: Time-variable3 measures that can be linked. N o w imagine that a professor in a university music program wishes to study progress in his concert piano 2 Mixed-effect/HLM modelling is described in more detail in Appendix A. 3 students' musicality across their four-year undergraduate degree. He designs the study such that his students are assessed annually, commencing in Year 1 of the program and ending in Year 4. He conceives that musicality, irrespective of the year in which the student is assessed, represents some composite of three subscale scores: theory (i.e., the ability to recognise notes, intervals, keys, and scales), technique (i.e., rate of accuracy playing a year-specific piece), and artistry (i.e., ability to convey the appropriate emotion when playing a year-specific piece). He is aware that, as his students progress throughout the program years, their technique and artistry w i l l improve relatively commensurately with the age, practise, and scholarly experiences of his students. Given that expert theory skills, however, are pre-requisites for entry into the music program, the professor does not believe that students' theory skills should change across years: Rather, he postulates that these specific skills should remain at expert level for the duration of the program. A s such, he designs the four measures such that each contains (a) year-specific technique items (i.e., items that are specific to the students' particular year of study), (b) year-specific artistry items, but (c) common theory items (i.e., test items that remain unchanged across years and whose subscale scores can purportedly be interpreted in the same way across waves) to ensure that students' theory skills are remaining at expert levels for the duration of the program. Items that remain unchanged across waves or test versions are called common items or anchor items. 3 i For brevity, measures whose content, wording, response categories, etc., must vary across waves in repeated measures designs are referred to as 'time-variable'. 4 To this end, the items that are designed to assess students' technique (and artistry) are, by definition, worded differently, and have different response categories (and perhaps even different response formats) across the different waves of the study. A s a result, the test-level scores (and the respective subscale scores for technique and artistry) collected at the four waves cannot necessarily be interpreted in the same way, even though all of the measures are collectively thought to assess students' musicality (and their technique, artistry, and theory, in particular). A s such, a total test score of 50 at Wave 1 does not necessarily mean the same "amount" of musicality as a score of 50 at Wave 2. Furthermore, a technique subscale score of 20 at Wave 1 does not necessarily mean the same "amount" of technique as a score of 20 at Wave 2, and so forth. Because there are common theory items across the four versions of the measure, however, it is possible to link (connect) the theory subscale scores across waves and to interpret those subscale scores in the same way. Scenario 3: Time-variable measures that cannot be linked. Finally, imagine that a researcher is interested in exploring elementary students' achievement in mathematics over time. Her research design involves testing a cohort of students annually, beginning when the students are in Grade 1 and ending when the students are in Grade 7. Within the context of this example, there would be no coherent rationale for administering the exact same mathematics measure to her participants across the seven waves. The specific mathematics test that first-graders would have the cognitive ability and requisite training to complete could not, by definition, be the exact same mathematics test administered to the students when they are in later grades: The wording of the items, the complexity of the concepts presented, and specific content domains tapped by each item must vary across the seven waves of the study (Mislevy, 1992). If not, the reliability and validity of the test scores are 5 compromised, l ikely rendering the study useless (Singer & Willett, 2003). A s a result, the test-level scores collected at the seven waves cannot necessarily be interpreted in the same way (e.g., a test score of 50 at Wave 1 does not necessarily mean the same "amount" of mathematics achievement as a score of 50 at Wave 2, and so forth), even though all o f the scale items are collectively thought to assess students' mathematics achievement. Although Scenario 3 has been characterised in the previous example as the situation in which one's measures share no common items across waves, it is also possible to encounter Scenario 3 in two additional situations: when one's sample size is small or when one does not have the ability to compare the sample's scores to that of a norming group (discussed in more detail in a later chapter). In the case of small sample sizes, it is not necessarily advisable to link or connect the scores of measures, even i f the measures share common items (as in Scenario 2). In part because the linking together of scores from different measures has traditionally been handled with item response theory techniques (as described more fully in a subsequent chapter), it is, in general, recommended that two or more measures' scores are only linked when one has a minimum sample size of 400 test-takers per measure 4 (Kolen & Brennan, 2004). For the remainder of this document, the shorthand used for Scenario 3 is "time-variable measures that cannot be linked". Although this dissertation generally uses this phrase in reference to measures that are non-linkable due to a lack o f common items across waves, please note this phrase also encompasses situations in which there are too few test-takers in a sample to link the measures' scores (even i f the measures share common items) or to situations in which one cannot refer to the scores of a norming group. In the next section, 4 This sample size is only a rule of thumb, and can vary depending on the method of test linking one chooses. 6 a particular assumption of repeated measures analyses - one that has significant implications for the current dissertation's focus - is explained and discussed. An Oft Overlooked, Oft Misunderstood Assumption of Repeated Measures Analyses Repeated measures analyses are those in which one set of individuals is measured more than once on the same (or commensurable) dependent variable. Embedded in this definition is a special assumption that is often worded so subtly and succinctly that it often fails to garner the attention of researchers - leading them to either overlook it completely or to deem it as trivial. Doing so is regrettable, because the theoretical and practical implications of this particular assumption are profound. This special assumption is captured in this phrase: "the same (or commensurable) dependent variable". In many research contexts, particularly those involving repeated measures analyses of variance ( A N O V A ) , this particular phrase is often understood to mean that the exact same measure must be used across all waves of the repeated measures design. For more about the meaning of "commensurable", please refer to Chapters 5 and 6. A s Scenario 1 above illustrates, certain constructs can indeed be measured using the exact same measure across all testing occasions - irrespective of the ever-changing age, cognitive development, and personal and scholarly experiences of the test-takers. In these situations, the test length, item wording, and response categories remain constant across all waves of the study. A s Scenario 2 (time-variable measures that can be linked) and particularly Scenario 3 (time-variable measures that cannot be linked) depict, however, there are often situations in which one's construct of choice makes using and re-using the exact same measure across waves unreasonable - and even impossible. 7 What is Meant by "Same Dependent Variable"? In a seminal article in which the authors make several recommendations for measurement in longitudinal studies, Willett, Singer, and Mart in (1998) clarify what is actually meant by the phrase "the same dependent variable": 1. " A t the very least, the attribute [must] be equatable over occasions of measurement, and must remain construct valid for the period of observation" (p. 397); 2. "Seemingly minor differences across occasions - even those invoked to improve data quality - w i l l undermine equatability. Changing item wording, response category labels, or the setting in which instruments are administered can render responses nonequatable. In a longitudinal study, at a minimum, item stems and response categories must remain the same over time" (p. 411); and 3. "Whenever time varying variables are measured, their values must be equatable across all occasions of measurement" (p. 411). Unfortunately, missing from Willett et al.'s (1998) article is an explicit explanation of what they mean by "equatable": It is difficult to ascertain i f they use this word as is common in the English vernacular (e.g., a synonym for "commensurable" or "comparable" or "linkable") or i f they mean this word in a strict psychometric sense (described more fully in the next section and in Chapter 3). A s such, in order to determine the specific conditions under which test scores are equatable, it is necessary to refer to the test linking literature. What is Meant by "Equatable" Test Scores? In general, test linking refers to the general problem of linking, connecting, or comparing the scores on different tests (Linn, 1993). Test equating, a special case of test linking, adjusts for differences in tests' difficulty, not differences in content (Kolen & 8 Brennan, 2004). Hence, when the scores on various tests are equated, the measures may be used interchangeably for any purpose (von Davier, Holland, & Thayer, 2004). In addition, any use or interpretation justified for scores'on Test X is also justified on Test Y (Linn, 1993) - meaning a score of 26 on Test X means the same "amount" of a given construct as a score o f 26 on Test Y (Kolen & Brennan, 2004). This chapter's mention of test equating is made so as to weave together the concepts presented by Willett et al. (1998) and von Davier et al. (2004); however, test linking is certainly not limited to test equating alone. In Chapter 3, the fuller spectrum of test linking methods are described. V o n Davier et al. (2004) state that the following five conditions must all be met in order to deem different measures as equatable5: 1. Equal Constructs: The tests must measure the same construct. This requirement sometimes is referred to as measurement invariance 6 or measurement equivalence, and is achieved when the items tap the same underlying construct or latent trait7 at each wave (Johnson & Raudenbush, 2002). A s Meade, Lautenschlager, and Hecht (2005) observe, " i f measurement invariance does not hold over two or more measurement occasions, differences in observed scores are not directly interpretable" (p. 279). 2. Equal Reliability: The various tests' scores must not have different reliabilities, even i f they measure the same construct. A n y changes in psychometric properties of tests 5 These specific conditions are discussed in greater detail in subsequent sections of this dissertation. 6 Measurement invariance is a similar to the notion of commensurability - which is discussed in more detail in Chapters 5 and 6. 7 A latent variable is an unobserved variable that accounts for the correlation among one's observed or manifest variables. Ideally, psychometricians design scales such that the latent variable that drives test-takers' responses is a representation of the construct of interest. 9 across waves can change the predictive validity of the test scores (Meade et al., 2005); 3. Symmetry: The equating function for equating the scores o f Test Y to Test X should be the inverse of the equating function for equating the scores of Test X to Test Y . In other words, the results should be the same regardless of the direction of the linkage (Pommerich, Hanson, Harris, & Sconing, 2004). 4. Equity: The actual test written should be a matter of indifference to the test-taker. In other words, students should not find one test harder or more complex than the other. 5. Population Invariance: The equating function used to link the scores of Tests X and Y should not be affected by the choice of sub-populations used to compute the function. More specifically, the function obtained for one sub-group of test-takers should be the same as the function obtained from a different sub-group of test-takers (Ercikan, 1997). Proper equating must be invariant against arbitrary changes in the group (Lord, 1982). The Motivating Problem: Analysing Change/Growth with Time-Variable Measures A s has been described, Willett et al. (1998) state that tests scores must be equatable in order for repeated measures analysis to be performed validly. In turn, for test scores to be equatable, the conditions of equal constructs, equal reliability, symmetry, equity, and population invariance must first be met (von Davier et al., 2004). Equating test scores used in repeated measures designs is best achieved by using the exact same measure across all waves of the study (as is the case in Scenario 1) because any changing of item wording, response category labels, or the setting in which tests are administered can conceivably render test scores non-equatable (Willett et al., 1998). 10 A s Scenario 1 portrays, there are legitimate situations in which a particular construct can be measured longitudinally by using and re-using the exact same measure across waves. Other constructs, such as those presented in Scenario 2 and particularly Scenario 3, make the repeated use of the exact same measure across waves unreasonable - and even impossible. Kolen and Brennan (2004) remind readers that, when test forms cannot be made to be identical, it is unnecessary or even impossible to equate. So what is one to do, then, i f the use of time-variable measures is necessary and unavoidable? A s such, the problem motivating this dissertation is that of analysing change and growth within the contexts presented in Scenario 2 (time-variable measures that can be linked) and Scenario 3 (time-variable measures that cannot be linked) - both of which are characterised by the measures changing across waves. This dissertation devotes specific attention to the latter of the two scenarios, given that this particular scenario has been relatively unaddressed in the test linking and change/growth literatures. Objectives and Novel Contributions The general aim of this dissertation is to investigate the problem of analysing change and growth with time-variable measures (particularly those that cannot be linked). There are two specific objectives of this dissertation. A s is described below, these two objectives are also the dissertation's novel contributions. Objective/novel contribution 1. The first objective is to weave together, or to bridge the gap between, the test linking and change/growth literatures in a comprehensive manner. Unt i l now, the two literatures have either been largely disconnected or, at most, woven together in such a manner so as situate the motivating problem primarily around vertical scaling techniques. A n approach to linking together the scores of measures with 11 different difficulties for groups with different abilities (described in more detail in Chapters 3 and 4), vertical scaling has several limitations - most notably that it requires the presence of common items across measures (and, hence, cannot be used when the time-variable measures are non-linkable). Unfortunately, the gap between the test linking and change/growth literatures causes two problems. First, the two literatures often use different terms to refer to similar ideas. For example, the terms "measurement invariance" (from the test linking literature) and "commensurability" (from the change/growth literature) are often regarded as being disconnected notions, when the two are, in fact, variations on a theme. Because the two literatures often do not "speak the same language", one's.understanding of analysing change/growth with time-variable measures can be muddied unnecessarily. It should be noted that this problem is exacerbated by the fact that writers within even one of the named literatures often come from a variety of research backgrounds (e.g., nursing, psychology, education, economics) and often use different terms for similar ideas. A second problem caused by the gap between the two literatures is that one's understanding of the motivating problem may be severely curtailed i f he or she is not familiar with both literatures. For example, recall from an earlier section that Willett et al. (1998) state in the change/growth literature that " A t the very least, the attribute [must] be equatable over occasions of measurement, and must remain construct valid for the period of observation" (p. 397). I f a reader were to limit his or her reading to the change/growth literature, then one may interpret the word "equatable" to mean that change/growth analyses are permissible simply i f a similar construct is being measured across waves. It is only by referring to the test linking literature that one is able to determine that the actual meaning of 8 The concept of commensurability is described in more detail in Chapters 5 and 6. 12 the word "equatable" is much more nuanced and demanding o f the data (e.g., von Davier et al., 2004) than the meaning implied in the change/growth literature. Therefore, it is only by bridging the gap between the two literatures that one can analyse change/growth with time-variable measures in a rigorous fashion. Objective/novel contribution 2. The second objective of this dissertation is to provide a novel solution to the problem of studying change and growth when one has time-variable measures (that can or cannot be linked). A s Chapter 4 describes more fully, many of the strategies currently being used in the change/growth literature as means of handling the motivating problem are often only useful to large testing organisations that have at their fingertips very large numbers (sometimes tens of thousands) of test-takers and/or measures with hundreds of items. Unfortunately, researchers in everyday research settings often do not have the means to use such large sample sizes or item pools. Moreover, many o f the strategies presented in Chapter 4 require the presence of common items across measures which, as discussed earlier, is not always feasible (or warranted). Therefore, this dissertation introduces a workable solution that can be implemented easily in everyday research settings, and one that is particularly useful when one's measures cannot be linked. This dissertation has coined this solution the Conover solution in honour of the seminal work of Conover and Iman (1981) and Conover (1999), whose research not only inspired the novel solution, but also provided evidence for the solution's viability. Although the specifics of the Conover solution are saved for subsequent sections of this dissertation, it is important to highlight that this solution is novel because it involves an innovative bridging of the gap between parametric and non-parametric statistical methods. Indeed, this gap has already been bridged in various other contexts (e.g., the Spearman 13 correlation, described in Chapter 5) due, in large part, to the seminal work of Conover and Iman (1981) and Conover (1999). The bridge, however, has never before been extended to the problem of analysing change/growth, particularly with time-variable measures. It is by expanding upon the research of Conover and Iman (1981) and Conover (1999), and extending the parametric/non-parametric bridge to the context of the current dissertation, that two new change/growth analyses can be performed for the first time: the non-parametric hierarchical linear model (for multi-wave data) and the non-parametric difference score (for two-wave data). These new analyses are discussed in more detail later in the dissertation. Importance of the Dissertation Topic There are two primary reasons why it is important to address the motivating problem. First, as Willett et al.'s (1998) and von Davier et al.'s (2004) work describes, the rules about which tests are permissible for repeated measures designs are precise and strict. Given these conditions, it is necessary to investigate i f and how repeated measures designs are possible when the measures are time-variable. Without an adequate solution to this problem, it calls to question the very reliability, validity, and usefulness of past studies involving time-variable measures (particularly those that cannot be linked). Furthermore, this problem, i f left unsolved, implies that longitudinal analyses involving time-variable measures should altogether cease to occur. A s many educational psychologists, psychometricians, social scientists, and statisticians alike would agree, a 'non-solution' to this problem is hardly satisfactory. Second, there has been substantial growth in longitudinal large-scale achievement testing in the past decade, most notably in North America. Such testing is being practised zealously at the institutional level (e.g., schools and districts), within universities, at the 14 public policy level (e.g., British Columbia's Foundation Skills Assessment, the Government o f Canada's National Longitudinal Survey of Children and Youth, and America's N o Chi ld Left Behind mandates), and within the private sector (e.g., Educational Testing Service). In such contexts, it is common for questions about change over time to arise. Once again, without an adequate solution to the problem of handling repeated measures designs with time-variable measures (particularly those that cannot be linked), it is impossible to ascertain i f the inferences these organisations are making about test score changes are accurate. A s a practical matter, unless and until there is an adequate solution to this problem, much of the billions of dollars directed towards funding large-scale testing across North America each year w i l l simply be wasted, and many of the policy-related decisions borne of these scores w i l l be fundamentally unsound. A s Kolen and Brennan (2004) put so succinctly, "the more accurate the information, the better the decision" (p. 2). Although much is unknown about this particular topic, one thing is for sure: This is a rich and fertile area for research by educational psychologists, psychometricians, social scientists, and statisticians alike (Holland & Rubin, 1982). Framework of the Dissertation The process of change is a deceptively more complex, multifaceted, and nuanced process than many researchers first acknowledge. Without a deep understanding about what is meant by change, it is possible to draw erroneous conclusions from studies or, worse still, to create fundamentally flawed research designs. A s such, the purpose of Chapter 2 is to "unpack" the precise meaning of 'change', so as to set the context for the remainder of the dissertation. 15 Chapter 3 offers a brief description of each of the major types of test linking: (1) equating, (2) calibration, (3) statistical moderation, (4) projection, and (5) social moderation, respectively. This chapter also serves as a backdrop for the remainder of the dissertation, because many of the concepts and terms presented in this chapter are revisited in later chapters. ( Chapter 4 weaves together, in a comprehensive manner, the test linking and change/growth literatures by presenting seven test linking strategies currently being used in the change and growth literature as means of handling the problem of analysing change and growth with time-variable measures (particularly those that cannot be linked). These seven strategies include: (1) vertical scaling, (2) growth scales, (3) Rasch modelling, (4) latent variable or structural equation modelling, (5) multidimensional scaling, (6) standardising the test scores or regression results, and (7) converting raw scores to age- or grade-equivalents pre-analysis, respectively. Each strategy is presented, where possible, with examples from real-life research settings. In Chapter 5, the Conover solution is introduced as a means of handling the problem of analysing change/growth when none of the aforementioned seven strategies is able to be implemented (most notably in the case of time-variable measures that cannot be linked). A s described earlier, in the case of two-wave data, the Conover solution is called the non-parametric difference score; in the case of multi-wave data, the Conover solution is called the non-parametric H L M . Although the Conover solution may be applied in either Scenario 2 (time-variable measures that can be linked) or Scenario 3 (time-variable measures that cannot be linked), it is particularly useful in the latter of the two - a scenario which has gone relatively unaddressed in the test linking and change/growth literatures. The Conover 16 solution involves rank transforming (or ordering) individuals' longitudinal test scores within wave pre-analysis, and then using these rank scores in the place of raw or standardised scores in subsequent statistical analyses. In Chapter 6, by way of two case studies involving real data, the step-by-step implementation of the two Conover solutions (the non-parametric H L M solution and the non-parametric difference score solution) are presented, respectively. A n explanation of the resultant statistical models and statistical output is also offered. Chapter 7 concludes by discussing the Conover solution's strengths and limitations and offers suggestions for future studies focussed on the analysis of change and growth with time-variable measures (particularly those that cannot be linked). 17 Chapter 2: Foundational Issues: Precisely What is Meant by 'Change'? Change is an inexorable and pervasive part of human beings' daily lives. Because discussion of change and growth is central to this dissertation, it is important to "unpack", at the outset, many of the foundational issues surrounding the process of change. The investigation of change is of great interest to many educational and social science researchers. In particular, two groups of scholars have made the study of change an integral component of their research programs. First, developmental psychologists are concerned with both descriptions (i.e., depictions or representations) of change and explanations for change (i.e., the specification of the causes or antecedents of development). In essence, the primary objectives of developmental psychologists are to describe and to explain what stays the same and what changes across life, and to account for the reasons for such change (Dixon & Lerner, 1999). Second, psychometricians often focus their research on quantifying the amount by which individuals change and grow over time. A s introduced in the previous chapter, the path that summarises a test-taker's pattern of change over multiple waves is referred to as a trajectory (Singer & Willett, 2003). Although a number of developmental psychologists, psychometricians, and their counterparts pepper journal articles and book chapters with talk of change, there is a surprising dearth of discourse about the precise meaning of the term. Perhaps this dearth is attributable to the fact that, at first glance, researchers' basic understanding of change appears to be generally well in hand: Most human beings have at least a lay understanding about what change means - with or without a scholarly discussion of its precise meaning. Even so, an absence of a precise meaning of change in the literature introduces its fair share of problems. A s this chapter details, change is a deceptively more complex, 18 multifaceted, and nuanced process than many researchers first acknowledge. Without a deep understanding about what is meant by change - particularly in the context of the change/growth and test linking contexts - it is possible to draw erroneous conclusions from studies or, worse still, to create fundamentally flawed research designs. To this end, the purpose of this chapter is to discuss the precise meaning of change. Amount versus Quality of Change Researchers generally agree that there are two broad types of change: quantity (amount) and quality. Imagine that a farmer has in his possession a small barrel, which he places outdoors in the centre of a field. A t the same time each morning, the farmer goes out to the barrel and sticks a metre stick into the barrel and measures the amount of rain that has fallen in the preceding 24 hours. After measuring the rainfall, he empties the barrel. The next day, he returns to the barrel, uses his metre stick to record the past day's amount of rainfall, and then empties the barrel once again. This process continues for several weeks during the crop season. B y continually measuring the rainfall according to his metre stick, the farmer is measuring changes in the amount of rainfall over time. Moreover, by simply tracking the daily rainfall levels over time, he is able to determine the trajectory of rainfall in his field. It is important to note that he uses the exact same metre stick each morning - because he knows that, by changing the measuring stick, he may compromise the quality of the day-to-day comparisons. Psychometricians refer to this type of distortion as measurement error. Even though the farmer's metre-stick routine allows him to quantify the amount of rainfall change, it does not allow him to measure the quality of the change in rainfall. Although he is able to detect simple increases or decreases in the amount of rainfall relative 19 to some baseline measure (e.g., the first day of the crop season), his particular metre-stick routine does not allow him to detect changes in the quality of the rainfall - such as levels of acidity in the rain or the nutrient composition of the collected water. A s such, an important and necessary starting place in better understanding the nature of change is to draw the comparison between the amount and the quality of change. It is important to note that growth models, a relatively new methodological tool for analysing individual change, pertain only to amounts of change; they do not allow researchers to study qualities of change over time. The latter type of change is explored best using qualitative tools. Given that amount of change is the focus of the current dissertation, the remainder of this chapter (and dissertation) focuses on this particular type of change. Is Constancy as Interesting as Change? According to Dixon and Lerner (1999), the opposite of change is constancy or continuity. Typically, when studying a sample of individual test-takers' growth curves, a researcher is not interested in the way in which the scores 'stay the same' across waves. A researcher does not generally elect to study a particular construct over time i f she does not suspect a priori that the test-takers' scores w i l l change, in some systematic or measurable way, across waves - particularly when one considers the great financial costs and complexities associated with many longitudinal research designs. Growth modelling, by definition, implies change - not constancy. Therefore, a researcher opts to study a particular construct over time typically because she expects that individual test-takers' scores w i l l , on average, increase (e.g., improvement in mathematics scores as a result of coaching) or decrease (e.g., deterioration in memory scores as a result of age) across waves. 20 Certain personality characteristics, such as intellectual mastery, demonstrate remarkable consistency across time (Conger & Galambos, 1997). Does their constancy mean that such characteristics should not be measured over time? Singer and Willett (2003) chide readers that "not every longitudinal study is amenable to the analysis of change" (p. 9), adding that there are three particular methodological features that make a study suited for longitudinal analysis: 1. three or more waves of data; 2. a sensible metric for clocking time; and 3. an outcome whose values change systematically over time (e.g., patients whose symptomatology differ before, during, and after therapy). Constructs purported to remain constant over time are considered to be not particularly useful (or interesting) in longitudinal analyses. This issue is revisited in a later section of this chapter. Personal versus Stimulus Change There is one important, but altogether unwritten, assumption underlying any growth modelling analysis: It is that the test-taker (the person) stays, for all intents and purposes, 'the same' for the duration of the study (i.e., any changes in a test-taker's scores across waves are attributable to changes in the amount of a construct over time, and not to the test-taker's personal attributes). Hence, the only changeable aspect to the study is the amount of test-taker's construct across waves. Imagine, for example, that Joe Smith's mathematics achievement is tested across three waves (Grades 4, 7, and 10). The typical assumption is that Joe Smith (the person) remains constant over the three waves of the study. Only the measure (the stimulus) changes 21 a c r o s s w a v e s . R e c a l l t h a t , i n a n i d e a l w o r l d , p s y c h o m e t r i c i a n s d e s i g n m e a s u r e s s u c h t h a t t h e l a t e n t v a r i a b l e t h a t d r i v e s t e s t - t a k e r s ' r e s p o n s e s i s a r e p r e s e n t a t i o n o f t h e c o n s t r u c t o f i n t e r e s t . A s s u c h , a n y a n d a l l c h a n g e s i n J o e ' s a c r o s s - w a v e t e s t s c o r e s a r e a t t r i b u t e d s t r i c t l y t o c h a n g e s i n h i s a m o u n t o f m a t h e m a t i c s a c h i e v e m e n t - a s c a p t u r e d b y t h e m e a s u r e s a n d h e n c e t h e i r s c o r e s - a n d n o t t o p e r s o n a l c h a n g e s h e m a y h a v e e x p e r i e n c e d a c r o s s y e a r s . 9 B u t i s t h i s a s s u m p t i o n - t h a t t h e p e r s o n s o m e h o w r e m a i n s c o n s t a n t a c r o s s w a v e s - a f a i r o n e t o m a k e ? It c a n b e a r g u e d t h a t t h i s a s s u m p t i o n i s f u n d a m e n t a l l y f l a w e d f o r t h i s r e a s o n : H o w i s i t p o s s i b l e f o r a n y p e r s o n , p a r t i c u l a r l y a y o u n g t e s t - t a k e r a s s e s s e d o v e r s e v e r a l y e a r s , t o r e m a i n ' t h e s a m e ' a c r o s s y e a r s ? S u r e l y J o e S m i t h e x p e r i e n c e s s o m e d e g r e e o f p e r s o n a l c h a n g e a s t h e w a v e s p a s s : H e n o t o n l y a g e s a n d m a t u r e s , b u t h i s c o g n i t i v e s k i l l s d e v e l o p , h i s s c h o l a r l y a n d r e c r e a t i o n a l i n t e r e s t s c h a n g e , a n d h e i s c o n f r o n t e d b y a v a r i e t y o f p e r s o n a l a n d s c h o l a r l y e x p e r i e n c e s , b o t h p o s i t i v e a n d n e g a t i v e , from t h e w o r l d a r o u n d h i m - a n d a l l o f t h i s p e r s o n a l c h a n g e o c c u r s i r r e s p e c t i v e o f p u r p o r t e d c h a n g e s i n h i s m a t h e m a t i c s a c h i e v e m e n t ( t h e s t i m u l u s c h a n g e ) a c r o s s t h e t h r e e w a v e s ! T o c o m p o u n d t h e ' p e r s o n - s t a y s - c o n s t a n t ' p r o b l e m i s t h a t e a c h a n d e v e r y t e s t - t a k e r i n a s t u d y e x p e r i e n c e s d i s t i n c t l y u n i q u e p e r s o n a l c h a n g e . A s s u c h , t h e r e i s n o p o s s i b l e w a y i n w h i c h t o a c c o u n t a n d c o n t r o l f o r s u c h v a r i a t i o n i n p e r s o n a l c h a n g e a c r o s s t e s t - t a k e r s b e c a u s e t h e y a r e e x p e r i e n c i n g s u c h c h a n g e i n e x t r e m e l y u n i q u e w a y s . Because one's scale score is correlated to the underlying latent variable, and because the latent variable is, in turn, ideally thought to be an approximation of the construct o f interest, the term 'stimulus' refers variably to (a) , the measure itself or (b) the construct the measure is purported to measure. 22 Formulating a Research Design: Important Questions Because personal change occurs among test-takers in unpredictable and variable ways - irrespective of the individuals' stimulus change - the reality is that researchers must devote serious and informed thought about the implications of such personal change when formulating research designs centred on quantifying stimulus change. Right from the outset of the study, researchers must ask themselves 'thought questions' such as: 1. What recommendations do the relevant theory and the existing literature make about the formulation of the research design? 2. Knowing that test-takers' personal changes w i l l occur more and more frequently the longer the duration of the study, over how many waves should the test-takers be assessed? How much time should elapse between waves - hours, days, months, or years? According to Singer and Willett (2003), one should choose a metric for time that best reflects the cadence expected to be the most useful for the outcome. 3. How w i l l test-takers' personal changes affect the inferences made about their stimulus changes? Are there any safeguards that can be implemented (e.g., from a test development perspective) that may mitigate the muddying effect that personal change can have on the inferences made about test-takers' stimulus change? Can the time-variable measures be designed with an eye toward lessening the impact of personal changes on stimulus changes? If so, how likely is it that these safeguards w i l l protect researchers from making incorrect inferences about their test-takers' stimulus change? Recall from Chapter 1 that some constructs (such as the academic motivation example presented in Scenario 1) are purported to be measurable using the exact same 23 measure across all testing occasions, irrespective of the ever-changing age, cognitive development, and personal and scholarly experiences of her students. In such cases, the item wording, content of the measure, response formats, and response categories are unchanged from wave to wave. In the context in which the exact same measure is used across waves, it is important that researchers ask themselves the following additional 'thought question': 4. Knowing that test-takers w i l l all experience some degree of personal change, does it even make sense to use and re-use the exact same measure across waves? Is the use and re-use of the exact same measure, in fact, erroneous because it implies that test-takers' personal change has no relationship to or impact upon their stimulus change? Take, for example, one multiplication test item administered to a group of test-takers in Grade 4 and then again when they are in Grade 7: In the younger grade, this item may assess mathematics ski l l ; in the later grade, however, the very same item may instead assess memory! Unfortunately, due to the case-specific nature of change, only the researchers themselves can answer these questions. Nonetheless, it is imperative that researchers distinguish personal change from stimulus change, and that they are cognisant of the impact of both when formulating their research designs. It is important to highlight that the questions outlined above are not the specific research questions motivating this dissertation. Rather they are 'thought questions' for researchers interested in studying change and growth, particularly via the use of time-variable measures (that can or cannot be linked). 24 Common Interpretation Problems in Change Studies Regular growth models 1 0 involve estimating the initial status (y-intercept) and the growth rate (rate o f change) for each individual test-taker. If one reads the output from any growth modelling analysis, one portion of output allows one to determine i f the value of the participants' mean baseline measure score is significantly different from zero. This particular snippet is, generally, not of much interest to researchers. A separate snippet of output, however, allows one to determine i f there is a significant main effect of time (wave) on the participants' across-wave scores. In growth modelling, one is generally interested only in phenomena whose scores show marked increases or decreases across time. Very often, however, certain test-takers' scores start high and stay high. Is this type of performance as interesting to researchers as test-takers who start low and end medium? Or those who start medium and end high? To illustrate this point, imagine that a researcher investigates test-takers' performance on various tests of achievement over two-waves: (a) numeracy, (b) reading, and (c) writing. Based on their performance in the first wave of testing, test-takers are divided into one of three performance categories: low, medium, and high. After the second wave of testing, the researcher computes each person's simple difference score (described more fully in Chapter 6) and presents the change scores in a table (please refer to Table 1). Growth models, often referred to as mixed-effect or HLM models, are discussed in more detail in Appendix A. 25 Table 1 Test-Takers' Simple Difference Scores on Three Subtests, Grouped by Wave 1 Performance L o w Performance Medium Performance High Performance Numeracy Subtest " 735 2V65 230 Reading Subtest 15.00 4.95 -0.07 Writing Subtest 5.06 1.85 0.25 Note. Table inspired by Gal l , Borg, and Gal l , 1996. B y glancing at the table of results across subtests, the researcher notices that the test-takers who scored the lowest at Wave 1 showed the greatest improvement over time - as reflected by their larger simple difference scores, relative to those of the "medium" and "high" performance groups. Moreover, the test-takers who showed the best overall performance at Wave 1 showed the least amount of improvement over time, relative to the " low" and "medium" groups. How should the researcher interpret these data? Do the data mean that the students with the lowest initial achievement are likely to learn more than students in the "medium" and "high" groups? In some cases, answering the latter with "yes" is correct. In other cases, however, such findings may simply be an artefact produced by measurement error across waves (Gall e t a l , 1996). Gal l et al. (1996) outline five interpretation problems common to change studies, but particularly to those that involve two-wave data: 1. Ceiling Effect: This occurs when the range of difficulty of the test items is limited. Therefore, scores at the higher end of the possible score continuum are artificially 26 restricted. In the example above, it is possible that the "high" performance scored near the ceiling of the Wave 1 test. A s such, they could only earn a small simple difference score across the two waves. 2. Regression Toward the Mean: This describes the tendency for test-takers who earn a higher score at Wave 1 to earn a lower score at Wave 2 (and vice versa). This phenomenon occurs because of measurement errors across waves, and because the test scores are correlated to each other (Zumbo, 1999). 3. Assumption of Equal Intervals: Many change studies assume that, on a hypothetical 100-point test, a gain from 90 to 95 is equivalent to the gain from 40 to 45 points. In reality, it is likely much more difficult to make a gain of five points at the higher end, for example, of the score continuum than at the mid-point of the same continuum. 4. Different Ability Types: Very often, a given test score reflects different types and levels of ability for different test-takers. Even though two students both earn a simple difference score of 15 on a two-wave mathematics assessment, it does not mean that they have the same pattern of strengths and weaknesses: one student may have improved his algebra over time, whereas the other may have improved his trigonometry. A s such, researchers should be careful to not attribute equivalence to score trajectories that, in reality, do not represent the same thing. 5. Low Reliability: Another interpretation problem associated with change studies is that their scores are not reliable. 1 1 For reasons outlined by Zumbo (1999), the higher the correlation between the scores across waves, the lower the reliability of the '' Due to the subjective nature of change studies, there is no consensus about what is considered low reliability. As a general rule of thumb, however, Singer and Willett (2003) state that reliability = 0.8 or 0.9 is "reliable enough" (p. 15) for the study of change. 27 change scores. It should be noted, however, that Zumbo (1999) explains why this particular criticism of change scores can, in some cases, be unmerited. According to L inn and Slinde (1977), test-takers who yield exceptionally high or low change scores are often identified so that they receive some sort of special treatment. That said - i f not from a statistical but from instead a practical standpoint - is it not just as impressive for a high achiever to perform consistently well across waves as it is for a low achiever to show substantial change across waves? Recall from an earlier section that constancy of performance is often regarded as being unimportant (or less interesting) as performance that shows systematic change. For reasons outlined in this section, perhaps educational and social science researchers' stance on what is considered exceptional performance across waves needs to be reconsidered. H o w Should One Conceptualise Change? Change relative to a score earned in the first wave - whether it be in two-wave or multi-wave contexts - is the typical way in which educational and social science researchers conceptualise change. Is this, however, the best way to think about change? (Particularly when one considers the various interpretation problems associated with change studies) Recall an earlier example in which the numeracy achievement of Joe Smith is studied across three waves: Grades 4, 7, and 10. Imagine that Joe's actual performance is plotted in a line graph, as depicted in Figure 1. 28 Grade 4 Grade 7 Grade 10 Figure 1. A hypothetical test-taker's performance across three waves of a simulated mathematics assessment (solid line = actual scores, dashed line = line of best fit). Joe's actual test scores, plotted using a solid line, show that his mathematics achievement increased steadily across all three waves. If one "chunks down" his performance into two sets of two waves, however, it is possible to see that Joe's performance improved more between Grades 7 to 10 than between Grades 4 to 7. When conducting any change and growth analysis, the researcher must decide on the fit function she w i l l use with the data. Often times, the chosen fit function is that of a straight line - sometimes called a linear fit function. 1 2 B y fitting a linear fit function to Joe's overall performance, as shown by the dashed line, the across-wave improvement in his scores is indeed still visible; however, the spike in Joe's performance between Grades 7 to 10 is masked. Choosing one single fit function, particularly for data sets containing thousands of cases with variable trajectories, can often be a difficult task - and is discussed in more detail 1 2 There are several additional types of fit functions. Please refer to Singer and Willett (2003) for more detail. 29 by Singer and Willett (2003). Before selecting a fit function, however, educational and social science researchers should first ask themselves one question: Should educational and social science researchers perpetually view change as that which is gained, on average, across all waves (as is typically done in growth modelling analyses), or is there some (or even more) utility in this pair-wise or "chunking down" approach? Once again, due to the case-specific nature of change, only researchers with intimate knowledge of a given data set can answer this question. In summary, this chapter "unpacked" many of the foundational issues surrounding the process of change. Such issues involved distinguishing (a) 'amount' from 'quality' of change, (b) 'constancy' from 'change', (c) 'personal' from 'stimulus' change. Also presented were various important 'thought questions' one should pose when conducting any study focussed on exploring change and growth (and in particular those using the exact same measure across waves) and a review of interpretation problems common in studies of change. This chapter concluded by asking i f the current way in which many educational and social science researchers conceptualise change (i.e., that change is always relative to a score earned in the first of two or more waves) requires reconsideration. In the next chapter, each of the major types of test linking methods is described: (1) equating, (2) calibration, (3) statistical moderation, (4) projection, and (5) social moderation, respectively. Chapter 3 also serves as a backdrop for the remainder of the dissertation, because many of the concepts and terms presented in this chapter are revisited in later chapters. 30 Chapter 3: Five Types of Test L i n k i n g A s mentioned in Chapter 1, test linking refers to the process of systematically linking or connecting the scores of one test (Test X ) to the scores o f another (Test Y ) 1 3 . Figure 2 illustrates an example in which the scores of Tests X and Y are linkable linearly. In this example, a raw score of 50 on Test X can be interpreted in the same way as a raw score of 60 on Test Y , and so on and so forth along the score continua. >-100 90 80 70 60 + >j 50 40 30 20 10 / { / 0 10 20 30 40 50 60 70 80 90 100 Raw Score on Test X Figure 2. Illustrating the fit function that links the scores on two versions of a hypothetical 100-item test (modified from Kolen & Brennan, 2004). The arrows show that the direction of the linkage is unimportant when test linking. The relationship between Test X ' s and Test Y ' s scores does not necessarily.have to be linear, or even mathematical for that matter (so long as the chosen fit function reflects the form of the data). Although rarely stated explicitly, one ultimately performs test linking with 1 3 Recall from Chapter 1 that the direction of the test linkage should be unimportant. 31 an eye towards producing what is termed a conversion table (correspondence table), which indicates the specific Test Y score that is considered equivalent to a given score on Test X (and vice versa). Using a conversion table, such as the example provided in Table 2, one can easily spot that a score of 30 on Test X means the same thing as a score of 31 on Test Y . Table 2 Example Conversion Table for Test X and Test Y Scores Test X Raw Score Test Y Raw Score 30 31 31 32 32 33 33 34 34 35 The phrase test scaling is often used synonymously with test linking, but, in fact, the two phrases mean different things. Test scaling refers specifically to the process of transforming raw test scores into new sets of numbers with given attributes, such as a particular mean and standard deviation (Lissitz & Huynh, 2003), thus increasing the interpretability o f the test scores (Kolen & Brennan, 2004). A s L inn (1993) notes, there are numerous other terms used to refer to the basic concept of test linking: anchoring, benchmarking, calibration, concordance, consensus moderation, equating, prediction, projection, scaling, statistical moderation, social moderation, verification, and auditing. Unfortunately, not all of these terms have had precise definitions and technical meanings (Linn, 1993). 32 It is, however, generally accepted that these numerous terms fall into five broad types of test linking (listed in decreasing order of statistical rigour): equating, calibration, statistical moderation, projection, and social moderation, respectively (Kolen & Brennan, 2004; Linn, 1993). Given that equating is the most rigorous and widespread of the five types, this chapter discusses equating in more detail than the other test linking methods. Although Chapter 1 presents test linking in the context of repeated measures designs, it should be stressed that test linking is certainly not limited to longitudinal designs alone. Test linking may be used, for example, to connect the test scores of one district to the scores of another district (e.g., i f the two districts completely different assessments) or in the creation of parallel measures. A s is elucidated in subsequent sections of this chapter, none of these five test linking methods proves to be a suitable solution to the problem of analysing change/growth with time-variable measures (particularly those that cannot be linked). Consequently, these five methods are presented here for the purpose of providing readers with a brief overview of test linking, and with specific reasons why each is insufficient in the study o f change and growth with time-variable measures. B y highlighting each linking method's shortcomings in terms of the motivating problem, the rationale and need for the Conover solution presented in Chapter 5 are established. Equating Equating, the first type of test linking discussed in this chapter, has the most stringent technical requirements (Linn, 1993; L inn & Kiplinger, 1995). When different tests have been equated, then it is surmised that it is a matter of indifference to the test-takers which particular version of the measure they write (Kolen & Brennan, 2004). 33 Although there are six different types of equating (described in a later section), each type involves following this general procedure (Kolen & Brennan, 2004; Mislevy, 1992): 1. One must decide on the purpose for equating. 2. One must construct alternate forms from the same test blueprint, ensuring the same content and statistical specification. At this step, it is also important to define clearly the expected correspondence among the scores of tests (Schumacker, 2005). 3. One must choose a data collection design and implement it. Equating requires that data are collected in order to determine how measures vary statistically (Kolen & Brennan, 2004). 4. One must choose one or more operational definitions of equating. Here, one makes the choice about which specific type of equating is the most appropriate. 5. One must choose one or more statistical estimation methods. 6. Finally, one must evaluate the results of equating. Kolen and Brennan (2004) offer various equating evaluation procedures. A s the above procedure suggests, test equating is rooted in ways in which the tests themselves are constructed, and not in the statistical procedure (Mislevy, 1992). A s Pommerich et al. (2004) note, i f test scores are not considered equatable, "there is a higher probability of misuse of the linked scores because the interpretation and usage of results is less straightforward" (p. 248). Two types of error influence the interpretation of the results of equating. The first type of error, random equating error, is unavoidable because samples are used to estimate population parameters (e.g., means and standard deviations). This type of error can be 34 reduced, however, by using large samples of test-takers14 and by choosing carefully the type of equating design. The second type of error, systematic equating error, results from the violation of assumptions and conditions specific to the equating methodology one implements. Unlike random error, which can be quantified using standard error calculations, systematic error is more difficult to estimate (Schumacker, 2005). T w o Equat ing Research Designs The most common data collection designs that are used for equating are the random groups design and the common-item non-equivalent groups design. The former utilises a 'spiralling process' in which alternate test-takers are administered alternate forms of the exam (Schumacker, 2005). For example, Test X may be given to the first test-taker, Test Y to the second test-taker, Test X to the third test-taker, and so forth. When the random assignment of the two test forms to test-takers is used, the two groups (Test X group and Test Y group) are considered equivalent in proficiency. A n y statistical differences across the two groups on the tests are interpreted as a difference in the test forms. For example, i f the Test X group performed better overall than the Test Y group, it is assumed that Test X was easier than Test Y (Braun & Holland, 1982). Random groups designs are often preferable to single groups designs; however, much larger samples are necessary to obtain stable parameter estimates, hence limiting their use in many testing situations (Schumacker, 2005). If only one version of a test can be administered on a given date, the common-item non-equivalent groups design can be used. In this case, the test forms are not assigned randomly to the two groups of test-takers; hence, it cannot be assumed that the two groups have the same overall proficiency. Therefore, it is necessary to identify any proficiency 1 4 Please refer to Chapter 1 for a recommendation on one's minimum sample size. 35 difference between the two groups. To do this, a subset of common test items (i.e., anchor items) is placed on both test forms. Because these common items are administered to all examinees, irrespective of group, they can be used to estimate differences in proficiency across the groups. Once the group differences have been identified, any remaining score differences can be interpreted as a difference in the difficulty of the two test forms and, hence, the scores are comparable directly (Haertel, 2004). Equating can then properly adjust for these test form differences. Six Types of Equat ing Kolen and Brennan (2004) discuss six types of equating, and the situations in which each is appropriate. The six additional methods - linear, equipercentile, identity, mean, two-or three-parameter IRT equating, and Rasch equating - are reviewed respectively. It should be noted that all six types of equating are able to be implemented in both random groups and common-item non-equivalent groups designs. (i) L inea r equating. Linear equating, perhaps the most widely used equating function, allows for differences in difficulty between the two measures to vary along the scale score. For example, linear equating allows Test X to be more difficult than Test Y for low-achieving test-takers, but easier for high-achieving test-takers (Kolen & Brennan, 2004). Scores that are an equal (signed) distance from their respective means, in standard deviation (z score) units, are considered equal. Thus, linear equating allows for the scale units, as well as the means, of the two measures to differ (Kolen & Brennan, 2004). Two tests are considered equated when the mean and variance of the distributions of test scores are the same across tests (Bolt, 1999). In either equating research design, linear equating may be used when: 36 1. There are adequate control and standardisation procedures, and alternate forms are built to the same specifications; 2. There are small samples; 3. There are similar test form difficulties; 4. One desires simplicity in the conversion tables or equations, in conducting analyses, and in describing procedures to non-psychometricians; and 5. It is important to have accuracy of the results near the mean scale score. (ii) Equipercentile equating. Equipercentile equating involves identifying scores on Test X that have the same percentile ranks on Test Y (Kolen & Brennan, 2004). Equivalence is achieved when the scores from Tests X and Y are at the same quantile (points taken at regular vertical intervals from the cumulative distribution function of a random variable) of their respective distributions over the target population, rather than merely having the same z-scores (von Davier et al., 2004). In either equating research design, equipercentile equating may be used when: 1. There is adequate control and standardisation conditions, and alternate forms are built to the same specifications; 2. There are large samples; 3. The test forms can differ in difficulty more than for a linear method of equating; 4. One can tolerate complexity in the conversion tables or equations, in conducting analyses, and in describing procedures to non-psychometricians; and 5. The accuracy of results along all scale scores is important (Kolen & Brennan, 2004). (iii) Identity equating. Identity equating is perhaps the easiest type of equating conceptually. In essence, a score on Test X is thought to be equivalent to the identical score 37 on Test Y . For example, a score of 15 on Test X is thought to mean the same thing as a score of 15 on Test Y . It should be noted that identity equating is thought to be the same as mean equating (described below) and linear equating i f the two forms are identical in difficulty along the scale score (Kolen & Brennan, 2004). In either equating research design, identity equating can be used when: 1. There is poor quality control or standardisation conditions; 2. There are very small sample sizes, or no data at all (e.g., data are not available across the entire range of possible scores); 3. There are similar test form difficulties; 4. One desires simplicity in conversion tables and equations, in conducting analyses, and in describing procedures to non-psychometricians; and 5. One can tolerate possible inaccuracies in results. (iv) Mean equating. In mean equating, Test X ' s difficulty differs from Test Y by a constant amount along the scale score. For example, i f Test X is three points easier than Test Y for higher-achieving test-takers, it is also three points easier for lower-achieving test-takers. In other words, mean equating involves the addition of a constant to all raw scores on Test X to find equated scores on Test Y (Kolen & Brennan, 2004). Wi th mean equating, scores on the tests that are equal (signed) distances from their respective means are deemed to be equivalent (Kolen & Brennan, 2004). In either equating research design, mean equating is appropriate when: 1. There are adequate control and standardisation procedures, and alternate forms are built to the same specifications; 2. There are very small samples; 38 3. There are similar test form difficulties; 4. One desires simplicity in the conversion tables or equations, in conducting analyses, and in describing procedures to non-psychometricians; and 5. It is important to have accuracy of the results near the mean. (v) Two or three-parameter logistic IRT equating. Item response theory (IRT) deals with modelling the response of a test-taker to a test item. Lord (1982) writes that IRT is useful for designing tests, for selecting items, for describing and evaluating items and tests, for optimal scoring of the test-takers' responses, for predicting the test scores of test-takers and of groups of test-takers, and for interpreting and manipulating test scores. Each test item is described by a set of parameters that can be used to depict the relationship between an item score and some latent trait through the use of an item characteristic curve (ICC). When it is assumed that test-takers' ability is described by a single latent variable or dimension (referred to as 'theta'), two- or three-parameter logistic models can be used to equate tests whose items are scored dichotomously, such as the ubiquitous multiple-choice examination (where 0 = incorrect, 1 = correct). The use of one latent variable implies that the construct being measured by the tests is unidimensional (meaning, within the context of IRT, that the tests measure only one ability). A s is likely intuitive from the name, a three-parameter model focuses the attention on three specific parameters. The a parameter (the discrimination parameter) represents the degree to which a given item discriminates between test-takers with different levels of theta. The b parameter (the difficulty or location parameter) determines how far left or right on the theta scale the I C C is positioned. With logistic models, b is often reported as representing the point on the ability scale where the probability of a correct response is 50%. More 39 accurately, b represents the half-way point between the c parameter and 1.0. The c parameter (the lower asymptote- or pseudo-chance parameter) describes the probability that a test-taker with very low ability w i l l correctly answer a given item. This low end of the ability continuum is often influenced by test-takers' guessing on the given item (Hambleton, Swaminathan, & Rogers, 1991) A two-parameter model also involves these three specific parameters but, in this case, the c parameter is set to zero. For the purpose of equating, the three-parameter logistic model is favoured (Stroud, 1982), because it is the only one that "explicitly accommodates items which vary in difficulty, which vary in discrimination, and for which there is a nonzero probability of obtaining the correct answer by guessing" (Kolen & Brennan, 2004, p. 160). A s Kolen and Brennan (2004) note, when one uses IRT to equate with non-equivalent groups of test-takers, the parameters from the different tests need to be on a common IRT scale. Often, however, the parameter estimates that result from IRT are on different IRT scales. For example, imagine that the IRT parameters estimated for Test X are based on Population 1, and the parameters estimated for Test Y are based on a non-equivalent population, called Population 2. Statistical software packages often define the theta scale as having M = 0 and SD = 1 for the data being analysed. So, for this scenario, the abilities for each population would be scaled so that they both had thetas with M = 0 and SD = 1, even though the populations' respective abilities are different. For this reason, Kolen and Brennan (2004) provide details on transforming IRT scales. In either equating research design (random groups or common-item non-equivalent < groups), two- or three-parameter logistic IRT equating is appropriate when: 40 1. There are adequate control and standardisation procedures, and alternate forms are built to the same specifications; 2. There are large samples; 3. Test forms differ in difficulty level more than for a linear method; 4. One can tolerate complexity in the conversion tables, in parameter estimation, in conducting analyses, and in describing procedures to non-psychometricians; 5. One can tolerate a computationally-intensive item parameter estimation procedure (this problem is mitigated i f the item parameter estimates are needed for other purposes - for example, test construction); 6. Accuracy of results is important all along the score scale; and 7. The LRT model assumptions hold reasonably well . (vi) Rasch equating. The one-parameter logistic IRT model is often referred to as the Rasch model. Whereas the two- and three-parameter logistic models involve estimations of the a and b parameters and of the a, b, and c parameters, respectively, the Rasch model deals with just one item parameter: b (difficulty). In a sense, the Rasch model estimates item difficulties free of the effects of the abilities of the test takers. Arguably the most widely used LRT model, the Rasch model is often favoured because o f its simplicity (Bejar, 1983). Stroud (1982) notes that the one-parameter IRT method works best when equating a test to itself (e.g., same test over time with different examinees), adding that it is "better than any other method when the anchor is either easier or harder than the test being equated" (p. 137). In either equating research design, Rasch equating is appropriate when: 41 1. There are adequate control and standardisation procedures, and alternate forms are built to the same specifications; 2. There are small samples; 3. There are similar test form difficulties; 4. One can tolerate complexity in the conversion tables, in parameter estimation, in conducting analyses, and in describing procedures to non-psychometricians; 5. Accuracy of results is important in the area that is not very far from the mean; and 6. The LRT model assumptions hold reasonably well . T w o Alternative Types of Equat ing In addition to the six types of equating already described, Kolen and Brennan (2004) state that two alternative types of test equating do exist. These alternative types may only be applied, however, i f one of two conditions is met: either when test-takers administered different scales are evaluated at the same time, or when score trends are to be evaluated over time. They caution readers, however, that their alternatives to equating are "typically unacceptable" (p. 6) for reasons elucidated below. The first alternative is to report and to compare raw scores, irrespective of the actual form written by the test-taker. For example, imagine that a student earns a score of 27 on Test X administered in the first wave and a score of 30 on Test Y in the second wave. The obvious problem with this method, however, is that it becomes difficult to discern i f the apparent improvement in scores is attributable to differences in the two forms, to differences in the achievement o f the student, or some combination of the two (Kolen & Brennan, 2004). The second alternative is to convert raw scores to other types of scores, such that . given characteristics of the scores are transferable among administrations of the tests. For 42 example, imagine one set of test-takers' scores on Test X are transformed to z-scores at Wave 1, and then again at Wave 2. Because each of the score distributions is purposefully set M = 0 and SD = 1, respectively, it is impossible to track changes in the group's overall performance over time. A s Kolen and Brennan (2004) note, the problems associated with each of these two alternatives can be remedied with proper equating. Equating, they state, adjusts for differences in the difficulty of test forms, such that the test scores can be interpreted in the same way, irrespective of when the tests were administered or of the group of students who wrote each test. Equating does not, however, adjust for differences in content. Because differences in content are inherent in the context of achievement testing (i.e., the items must change, by definition, commensurate with the grade level of the test-takers), it is clear that equating is ineffectual for the problem motivating this dissertation: analysing change and growth with time-variable measures (particularly those that cannot be linked). Cal ibra t ion The second type of test linking is called calibration. When different tests are calibrated, they are purported to measure the same construct (e.g., mathematics achievement), but with different accuracy or in different ways (e.g., comparing the performance of students across grade levels). Similar to equating, calibration requires that the different tests all measure the same construct; however, whereas successful equating yields scale scores that can be used interchangeably for any purpose, successful calibration simply means that the results of each test are mapped to a common variable, matching up the most likely score of a given test-taker on all tests (Mislevy, 1992). Furthermore, unlike equating, the reliabilities for each test's scores may differ. 43 A s Mis levy (1992) describes, there are three distinct cases of calibration - each having its own procedure: 1. Case 1: One should use the same content, format, and difficulty blueprint to construct tests, but with more or fewer items on each test. It is the expected percents correct that are calibrated. 2. Case 2: One should collect tests from a collection of items that fit an item response theory (IRT) model satisfactorily. Inferences from the test scores should be carried out in terms of the IRT proficiency variable. Mis levy (1992) offers a surprising example of this case: fourth- and eighth-grade geometry tests connected by an IRT scale with common items. A s has been highlighted previously, calibration has little use in the context of thisdissertation, given that it is highly unlikely to find measures from different grade levels (e.g., Grade 4 and Grade 7 mathematics tests) that share any number of common items. 3. Case 3: One should collect judgements on a common, more abstractly-defined variable. The consistency of these judgements should then be verified. Vertical scaling (vertical "equating") is a subcategory of calibration methods, and refers to the process of scaling tests with different difficulties for groups of individuals with different abilities, usually in different grades 1 5 (Pomplun, Omar, & Custer, 2004). Put another way, "Vertical scales are created through administering an embedded subset of items to different students at two educational levels, typically one year apart, and linking all the items at the two levels to a common scale through the In contrast, horizontal scaling involves equating tests of different forms or at different times of a single grade (Leung, 2003). 44 comparative performance of the two groups of students on the common items" (Schafer, 2006, p. 1). Braun and Holland (1982) describe vertical scaling as the placing of scores from tests of widely different difficulty on some sort of common metric, thus allowing the scores of test-takers at different grade levels to be compared. This common metric against which the test-takers' scores can be compared is often called a developmental scale. Because the content of each of the tests varies across groups of test-takers at different educational levels, vertical scaling cannot be used for the purpose of making the test forms themselves interchangeable (Kolen & Brennan, 2004). Leung (2003) describes an approach to vertical scaling that combines both linear and IRT equating. A t first glance, vertical scaling appears to be the solution to the very problem that has inspired this dissertation - analysing change and growth with time-variable measures. Recall, however, that differences in the measures' content, item wording, response categories, etc. are not permissible in the study of change and growth i f one heeds the recommendations of Willett et al. (1998). Furthermore, Martineau (2006) warns that: many assessment companies provide such vertical scales and claim that those scales are adequate for longitudinal [modeling]. However, psychometricians tend to agree that scales spanning wide grade/developmental ranges also span wide content ranges, and that scores cannot be considered exchangeable along the various portions of the scale (p. 35). Martineau (2006) adds that vertical scaling can lead to "remarkable distortions" (p. 35) in the results, because "the calibration requirement that two tests measure the same thing is generally only crudely approximated with tests designed to measure achievement at different achievement levels (Linn, 1993, p. 91). Schafer (2006, pp. 1-3) also discusses several deficiencies in vertical scales, noting, as he states, that: 1. They unrealistically assume a unidimensional trait across grades; 2. Scale development includes out-of-level testing, and therefore lacks face validity; 3. Lower-grade tests that are eventually implemented w i l l have invalid content representation for higher-grades' curricula; 4. Scores for students in lower grade levels are overestimated due to lack of data about inabilities over contents at higher grade levels; 5. Average growth is uneven for different adjacent grade-level pairs; 6. Differences between achievement-levels change from grade-to-grade; 7. Achievement-level growth is uneven for the same achievement level for different adjacent grade-level pairs; 8. Interval-level interpretations between grades are not grounded, either through norms or through criteria; 9. Achievement-level descriptions of what students know and can do for identical scores are different for different grade levels; 10. Decreases in student scores from year-to-year are possible; 11. Comparable achievement level cut-scores can be lower at a higher grade level; 12. If they come from different grades, students with the same scores have different growth expectations for the same instructional program; 46 13. The scale may be estimated from sparse data; and 14. The scale invites misinterpretations of comparability across grades. For these reasons outlined by Martineau (2006) and Schafer (2006), it is clear that vertical scaling is not a viable solution to the problem of analysing growth with time-variable measures (particularly those that cannot be linked). Statistical Modera t ion The third type of test linking is called statistical moderation (scaling, anchoring). Statistical moderation involves making comparisons among scores obtained from different sources (e.g., teachers, parents) or different subject matters (e.g., English, mathematics, science), purportedly adjusting these scores so as to make them comparable. A s is l ikely evident, the different tests do not necessarily measure the same construct and may advantage particular content areas or situations (Linn, 1993; Mislevy, 1992). This type of linking is implemented so that one may match up the distributions of the various tests' scores in real or hypothetical groups of students, thus facilitating the creation of conversion tables of comparable scores (Mislevy, 1992). A s L inn (1993) notes, it is necessary for there to be some external examination or anchor measure for the adjustment of scores to be possible. Due to the time and financial constraints that researchers often face in real-life research settings, it is not always possible to have an (external) anchor measure to which to compare the scores of one's sample of test-takers. A s is likely obvious from the description of this type of test linking, statistical moderation cannot be used when one measures change and growth with time-variable measures (that can or cannot be linked) particularly because the constructs measured at each wave need not be the same. 47 Projection The fourth type of test linking is called projection (prediction), which is regarded as one of the weakest forms of statistical linking (Linn, 1993). Similar to statistical moderation, the various tests do not measure the same construct. According to Mis levy (1992), after observing the scores on Test Y , one can then calculate what would be likely scores on Test X . This type of linking involves administering different measures to the same set of test-takers and then estimating the joint distribution among the scores. Projection has largely been criticised in the test linking literature. A s Mis levy (1992) notes, "projection sounds precarious, and it is. The more assessments arouse different students' knowledge, skills, and attitudes, the wider the door opens for students to perform differently in different settings" (p. 63). Wi th projection, one can neither equate nor calibrate the measures to a common metric because the time-variable measures are not purported to even measure the same construct. A s such, it is evident that this type of test linking cannot be used when one measures change and growth with time-variable measures (that can or cannot be linked). Social Modera t ion The fifth type of test linking is called social moderation (concordance, consensus moderation, auditing, and verification), and is the least rigorous test linking method discussed. A s is the case with statistical moderation and projection, social moderation does not require that the construct tapped by each of the different measures is the same; rather, the scores from distinct tests that measure related, but different, constructs are linked (Pommerich & Dorans, 2004). 48 This type of linking typically involves judges' rating the performance on the distinct tests using some common framework and then interpreting the performances according to some common standard (Linn, 1993) for the purpose of determining which levels of performance across tests are to be treated as comparable (Mislevy, 1992). This common standard can be difficult to achieve, however, particularly i f standards of performance vary across regions (e.g., across school districts or provinces). A s has been highlighted in the five descriptions above, only equating and calibration require that the construct measured is the same or similar across waves - thus taking statistical moderation, projection, and social moderation 'out of the running' as viable solutions to the motivating problem. O f equating and calibration, the former is regarded as the stronger and more rigorous, but neither is sufficient in the study of change and growth with time-variable measures (for reasons outlined previously and also summarised in the next section). Summary: Selecting the Appropr ia te Type of Test L i n k i n g Method In the previous sections of this chapter, five of the most common types of test linking methods were described: equating, calibration, statistical moderation, projection, and social moderation, respectively. Kolen and Brennan (2004), L inn (1993), and Mis levy (1992) compare and contrast these five test linking methods in regards to four specific test features: inferences, constructs, populations, and measure characteristics, respectively. In this section, these features are reviewed one by one, with an eye to synthesising the information provided in earlier sections of this chapter, and, hence, to assisting readers in selecting the appropriate test linking method: 1. Inferences: This test feature refers to the extent to which the scores for two tests are used to draw similar types of inferences. Put another way, this feature asks i f the two tests share common measurement goals that are operationalised in scales intended to yield similar inferences (Kolen & Brennan, 2004); 2. Constructs: This test feature relates to the extent to which two tests measure the same construct. In other words, are the true scores for the two tests related functionally? In many test linking contexts, the two tests share common constructs, but they also assess unique constructs (Kolen & Brennan, 2004); 3. Populations: This test feature relates to extent to which the two tests being linked are designed to be used with the same population. Two tests may measure essentially the same construct, but are not necessarily appropriate for the same populations (Kolen & Brennan, 2004); and 4. Measure Characteristics: This test feature refers to the extent to which the two tests share common measurement characteristics or conditions. According to Kolen and Brennan (2004), these characteristics or conditions are often called facets. Such facets may include test length, test format, administration conditions, etc. In table 3, Kolen and Brennan (2004) summarise how each o f the five test linking methods compare, in terms of these four features. L inn (1993) provides a similar 'taxonomy table', which has been presented in Table 4 . 1 6 Kolen and Brennan (2004) acknowledge that the degrees of similarity of the test linking methods, as depicted in these tables, are sometimes ambiguous, cautioning that: 1 6 The contents of Tables 2 and 3 have been reworded and/or reformatted only very slightly from their original versions for clarity and consistency. 50 1. "context matters, and there is not a perfect mapping of the taxonomy categories and degrees of similarity" (p. 435); 2. "there is no one 'right' perspective, and uncritical acceptance of any set of linking categories is probably unwarranted" (p. 436); and 3. "the demarkation between categories can be very fuzzy, and differences are often matters of degree" (p. 436). Table 3' Kolen and Brennan's (2004) Comparison of the Similarities of Five Test Linking Methods on Four Test Facets Test Linking Method Inferences Constructs Populations Measure Characteristics Equating Same Same Same Same Calibration Same Same/Similar Dissimilar Same/Similar Statistical moderation Dis(similar) Dis(similar) Dis(similar) Dis(similar) Projection Dis(similar) Dis(similar) Similar Dis(similar) Social moderation Same Similar Same/Similar Dis(similar) Note. Test linking methods have been listed in decreasing order of statistical rigour. A s Table 3 shows, equating requires that the inferences made from both versions of a measure are the same, that the constructs both measures are purported to tap are the same, that the populations used for each measure are the same, and that the characteristics of the measures are the same (and so on and so forth for the remainder of the linking methods 51 represented along the vertical). This table highlights that only equating and calibration require that the construct measured over time is the same or similar across waves. Thus, statistical moderation, projection, and social moderation are not viable solutions to the motivating problem. O f equating and calibration, only the latter allows one to use dissimilar populations - suggesting, at first glance, that it would be an appropriate strategy for dealing with time-variable measures (particularly those that cannot be linked). For reasons outlined above, however, neither equating nor calibration is sufficient in the study of change and growth with time-variable measures - primarily because equating places too many restrictions on the allowable content and psychometric properties of the measures and their respective scores, and because calibration can lead to distortions in the results [for reasons aforementioned by Martineau (2006) and Schafer (2006)]. 52 Table 4 Linn 's (1993) Requirements of Different Techniques in Linking Distinct Assessments Requirements for Assessment E C S M I P S M 2 1. Measure the same thing (construct) Yes Yes N o N o N o 2. Equal reliability Yes N o No No N o 3. Equal measurement precision Yes N o N o No No throughout the range of levels of student achievement 4. A common external examination No N o Yes N o N o 5. Different conversion to go from Test No Maybe N / A Yes No X to Y than from Y to X 6. Different conversions for estimates No Yes N o Yes N o for individuals and for group distributional characteristics 7. Frequent checks for stability over N o Yes Yes Yes Yes contexts, groups, and time required 8. Consensus on standards and on N o N o N o N o Yes exemplars of performance 9. Credible, trained judges to make No N o N o N o Yes results comparable Note. E = Equating, C = Calibration, S M I = Statistical Moderation, P = Projection, and S M 2 = Social Moderation. 53 A s Table 4 shows, equating requires that: (1) the two measures tap the same construct, (2) the measures' scores yield equal reliability, and (3) there is equal measurement precision throughout the range of levels of student achievement. Equating does not, however, require: (4) a common external examination, (5) different conversion to go from Test X to Y than from Y to X , (6) different conversions for estimates for individuals and for group distributional characteristics, (7) frequent checks for stability over contexts, groups, and time required, (8) consensus on standards and on exemplars of performance, nor (9) credible, trained judges to make results comparable (and so on and so forth for the remainder of the linking methods represented along the horizontal). L ike Table 3, Table 4 also highlights that only equating and calibration require that the construct measured over time is the same (or similar) across waves. Irrespective of the choice between the two, one still faces the need to change the item wording, response categories, etc., across waves - which Willett et al. (1998) state renders the test scores unusable for the purpose of studying change and growth. In summary, Tables 3 and 4 provide the reader with helpful, but not definitive, descriptions of the various test linking methods (Kolen & Brennan, 2004) necessary when choosing a test linking method. Readers should also remain cognisant that, the further one departs from equating, the less rigorous the study becomes statistically. In other words, there is no free test linking lunch! A s these tables and the previous sections of this chapter illustrate, none of the five test linking methods serves as a feasible solution for the problem motivating this dissertation (analysing change and growth when the measures are time-variable, particularly when the measures cannot be linked) primarily because each one of the five test linking methods 54 requires that the same or similar measures are used across waves. Unfortunately, missing from Kolen and Brennan's (2004) and Linn 's (1993) work is a precise explanation about what is meant by "same" or "similar" measures: It is unclear i f they mean that there must simply be the same primary dimension or latent variable driving the students' responses across waves or i f this means that there must be anchor items common to all versions of the measure. Because of this uncertainty, it is clear that a solution to the problem of analysing change and growth with time-variable measures (particularly those that cannot be linked) is necessary. 55 Chapter 4: Seven Current Strategies for Handling Time-Variable Measures With an eye toward summarising how researchers are handling the problem of analysing change and growth with time-variable measures in real-life research settings, this chapter discusses seven strategies presented in the current literature as means of handling the motivating problem - each of which is accompanied with real-life examples. Whereas the previous chapter spoke about test linking in a general sense, this chapter focuses on various strategies that educational researchers are using to handle the motivating problem specifically. A s this later sections of this chapter elucidate, many of these seven strategies can only be applied to time-variable measures that can be linked. Furthermore, the strategies that can be applied to non-linkable time-variable measures have either not been explained fully by the cited author, or have been criticised in the literature for various reasons (outlined in later sections). Therefore, the primary message of this chapter is that there is a need for more work targeted at finding a practicable solution to the problem of analysing change and growth with time-variable measures (particularly those that cannot be linked). A s the reader w i l l note, the seven strategies presented here do not necessarily adhere to the conditions laid out by Willett et al. (1998) and von Davier et al. (2004), as described in Chapter 1. It would appear that both sets of authors would chide that, when dealing with time-variable measures (particularly those that cannot be linked), researchers are simply at an impasse: Repeated measures analyses of developmental data should altogether cease to occur, because there is no adequate way in which to equate the scores of the time-variable measures. A s many educational psychologists, psychometricians, social scientists, and statisticians alike would agree, however, a 'non-solution' to this problem is hardly satisfactory and is, moreover, wholly impractical given the widespread need for longitudinal 56 achievement assessment. A s such, the seven strategies below are discussed in terms of both their underlying methodologies and data demands 1 7 for the purpose of showing their limitations in certain contexts - thus establishing the need for the Conover solution presented in Chapter 5. It should also be noted that not all o f these seven strategies can be slotted into one of the five test linking methods described in Chapter 3. Ver t i ca l Scaling The first alternative for handling the problem of analysing change/growth with time-variable measures (specifically those that can be linked) is to use vertical scaling, which Chapter 3 describes as the process of scaling tests with different difficulties for groups o f individuals with different abilities, usually in different grades (Pomplun et al., 2004). Braun and Holland (1982) describe vertical scaling as placing test scores from widely different difficulties on some sort of common metric, thus allowing the scores of test-takers at different grade levels to be compared (creating something akin to a developmental scale onto which the scores of test-takers can be compared across waves). Because the content of each of the tests varies across groups of test-takers at different educational levels, vertical scaling cannot be used for the purpose of making the test forms themselves interchangeable (Kolen & Brennan, 2004). More specifically, scores from vertically-scaled levels cannot be used interchangeably because levels typically differ in content, and because an individual student would be measured with different precision at different levels (Kolen, 2001). Kolen (2001) discusses using vertical scaling to link tenth-graders' P L A N data to their scores on the A C T Assessment (a college entrance, examination) in the twelfth-grade, 1 7 Unfortunately, authors often do not report the specifics about their test linking methodologies, nor the data demands, making it challenging to provide details of these topics in this dissertation. Throughout this chapter, if a test-linking method is not explained in technical detail, it is because the author of the cited paper does not provide the relevant details. Often the test linking is performed by large-scale testing companies, not by the authors themselves. 57 and, more specifically, to determine the extent to which P L A N scores were on the same metric as A C T Assessment scale scores. Clemans (1993) also used vertical scaling, in this instance to link the California Achievement Tests (CATs) given annually to students from Grades 1 to 12. Unfortunately, these articles do not provide further explanation about the specifics of their chosen methodologies, so attempting to replicate their procedure with one's own data is impossible. Growth Scales The second alternative for handling the problem of analysing change/growth with time-variable measures (that can or cannot be linked) is to use growth scales. Introduced as an alternative to vertical scaling, growth scales are designed to recognise the important role of vertically-moderated cut scores (passing scores) in many achievement testing contexts (Schafer, 2006). A cut score represents the specific point (score) on a given score scale at which scores at or above that point are considered to be at a different (better) performance category than scores below the point. Vertical moderation refers to the process of setting cut scores at any one grade such that they have consistent meaning in terms of growth from the prior grade, as well as expectations of growth to the next grade (Schafer, 2006). Growth scales are best explained by means of two examples provided by Schafer (2006). First, the Texas Learning Index (TLI) consists of a two-digit, test-based, within-grade score that is anchored with a cut score equal to 70, "but whose other values depend on the distributional characteristics of the student scores at that grade" (Schafer, 2006, p. 4). The grade level of the test-taker (and, hence, of the test) is added before the two-digit test score to aid interpretation - so the final test-score can be either three or four digits in length. 58 In a second example, Washington State linearly transforms original, grade-specific test scores (in logit form ) using the 'proficient' and 'advanced' cut points to anchor the scale. Scale scores for the other cut points appear wherever they fell using the transformation (Schafer, 2006). Schafer and Twing (2006) have combined Texas' and Washington's approaches by using grade-level tests and generating from them three (or four) digit scores (like Texas), but using relevant cut points to anchor the scale (like Washington). Imagine, for example, that a cut point of 40 = 'proficient' and a score of 60 = 'advanced'. It is then possible to "transform the underlying logit scale of the test to arrive at the transformation to the scale for the full range of the underlying logit scale. If it does not transform to remain within two digits for all grades, then adjustments could be made to the arbitrary choices of 40 and 60" (Schafer, 2006, p. 4). The test-taker's grade level is then added before the two-digit score such that a score of 440 at Grade 4 is just as 'proficient' as a score of 640 at Grade 6. For more about the development and usage of growth scales, please refer to Schafer (2006) and to Schafer and Twing (2006). Rasch Mode l l ing The third alternative for handling the problem of analysing change/growth with time-variable measures (particularly those that can be linked) is to use Rasch modelling techniques. For example, Afrassa and Keeves (1999) studied the analysis and scaling of Australian mathematics achievement data cross-sectionally over a 30-year period of time by the use of the Rasch model. A s described in Chapter 3, the Rasch model is thought to be the most robust of the item response models (Afrassa & Keeves, 1999), and identifies the dependent variable as the dichotomous response (i.e., the response score) for a particular 59 person for a specified item or test question. The independent variable is each person's trait score (theta) and the item's difficulty level (b). Embretson and Reise (2000) note that the independent variables combine additively, and the item's difficulty is subtracted from the participant's ability score (theta). It is the relationship of this difference to item responses that is the key focus of this type of modelling. It should be noted that, because unidimensionality of the test scores is necessary for Rasch modelling, Afrassa and Keeves (1999) began by examining the results of confirmatory factor analyses, and concluded that the test scores were, in fact, unidimensional (and purportedly this one dimension or latent variable represented mathematics achievement). Using the Rasch model, Afrassa and Keeves (1999) were able to bring the mathematics achievement scores of each of the participants to a common scale - a scale independent of both the samples of participants tested and the samples of items used. In another example, Jordan and Hanich (2003) used a similar strategy in their investigation of the reading and mathematics achievement and specific mathematical competencies of 74 participants tested across four waves during second and third grades. Participants' Woodcock Johnson raw scores for each wave of data collection were transformed into Rasch-scaled scores which were used for the longitudinal analyses to provide a common metric with equal interval properties. In yet another example, Notenboom and Reitsma (2003) use a similar strategy in their exploration of the spelling achievement of grade school students; however, rather than following one group of students longitudinally, they chose a cross-sectional design - using IRT to link the test scores of students in different grades at one point in time. ) 60 M a and his colleagues also utilise this type of test linking in various studies. For example, using national data from the Longitudinal Study of American Youth ( L S A Y ) , M a (2005) sought to explore changes in students' mathematics achievement over a six-year period of time, beginning when the participants were in the seventh-grade (1997-1998) and ending when they were in the twelfth-grade (1992-1993). M a used students' scores on the L S A Y ' s National Assessment of Educational Progress ( N A E P ) at each grade level to construct the outcome measure: the rate of growth in mathematics achievement during middle and high school. He adds, however, that "the L S A Y staff calibrated test scores for each grade level through item response theory; therefore, test scores were comparable across middle school and high school years" (p. 79). M a makes reference to this calibration in additional studies, as well (e.g., M a & M a , 2004; M a & X u , 2004). What is unclear, however, is precisely how the L S A Y calibrated these scores. Also using N A E P data, A i (2002) investigated gender differences in mathematics achievement over time. A key feature of A i ' s work is that it is one of only a rare few in the educational literature that adopts a three-level (also known as a simultaneous multilevel/individual growth model) approach: repeated measures (the Level 1 units) nested or grouped within students (the Level 2 units), who are, in turn, nested within schools (the Level 3 units). The L S A Y study tracked children from two cohorts: Cohort 1 students (the older cohort) were tracked annually from Grades 10 to 12; Cohort 2 students (the younger cohort) were tracked annually from Grades 7 to 10. A i , using the data from the younger cohort only, identified the outcome variable as the mathematics scores measured at each grade across the four waves, which were imputed using the IRT technique scale ranging from 61 0 to 100. Unfortunately, missing from A i ' s paper is a description of this particular imputation to which he refers. Latent Variable or Structural Equation Modelling The fourth alternative for handling the problem of analysing change/growth with time-variable measures (that can or cannot be linked) is to use latent variable modelling or structural equation modelling. Muthen and Khoo (1998) describe its use in their L S A Y investigation of the change and growth in N A E P mathematics achievement of two cohorts of students: Cohort 1 students were tracked annually from Grades 10 to 12; Cohort 2 students were tracked annually from Grades 7 to 10. Mathematics achievement was presumed to be a function of background variables, such as gender, mother's education, and home resources. Similar to path analysis, structural equation models (SEM) can be used to test causal relationships specified by theoretical models. In essence, S E M incorporates information about both the group and individual growth, providing a means by which to model change and growth as a factor of repeated measurement over time. In the context of this dissertation, time, then, is treated as a dimension along which individual growth varies (Ding, Davison, & Petersen, 2005). These types of models generally include the investigation of two types of variables: (1) latent variables (unobserved variables that account for the covariance/correlations among observed variables and that, ideally, also represent the theoretical constructs that are of interest to the researcher); and (2) manifest variables or those that are actually observed/measured by the researcher and that are used to define or to infer the latent variable or construct (Schumacker & Lomax, 2004). 62 Schumacker and Lomax (2004) describe four major reasons why S E M is used commonly: 1. S E M allows researchers to use and query multiple variables to better understand their construct(s) of interest; 2. S E M acknowledges to a greater extent than other methodologies the validity and reliability of observed scores (Gall et al., 1996), by taking measurement error into account explicitly, rather than treating measurement error and statistical analyses separately (Schumacker & Lomax, 2004). Furthermore, S E M involves the explicit estimation of error structures (Ding et al., 2005); 3. S E M allows for the analysis of advanced models, such as those including interaction terms and complex phenomena; and 4. S E M software (e.g., L I S R E L ) is becoming increasingly user friendly. Muthen and Khoo (1998) warn researchers of two possible misuses of S E M . First, researchers may fail to thoroughly check their raw data for such features as the shape of individual growth curves and outliers. A s Ding et al. (2005) note, many data sets simply do not meet the statistical assumptions or sample size requirements necessary for S E M (unfortunately they do not clarify the specific aspects to which they refer by this statement). Second, it may be possible to have competing models which fit the means and covariances in much the same way, but may lead to different data interpretations, though they offer no explanation about when this situation might possibly be encountered. This strategy for handling the problem has also been used by various other researchers. For example, Guay, Larose, and Bo iv in (2004) explored children's academic self-concept, family socioeconomic status, family structure (single- versus two-parent 63 families) and elementary school academic achievement as predictors of participants', educational attainment level in young adulthood within a ten-year longitudinal design. In another example, Rowe and H i l l (1998) combined multilevel modelling with structural equation modelling to investigate educational effectiveness across two waves. Finally, Petrides, Chamorro-Premuzic, Frederickson, and Furnham (2005) used S E M methodologies in their exploration of the effects of various psychosocial variables on scholastic achievement and behaviour of teen-aged students in Britain, though they seem to sidestep the added complexities associated with using time-variable measures by purposefully choosing to explore only one wave of otherwise longitudinal data. Mul t id imensional Scaling A fifth strategy for handling the problem of analysing change/growth with time-variable measures (particularly those that can be linked) is to use multidimensional scaling (MDS) . M D S typically involves judging observations for the similarity or dissimilarity of all possible pairs of stimuli. A s C l i f f (1993) notes, the model most commonly specifies that educational or psychological distance is a Euclidean distance in k-dimensional space. If distance is proportional to the judged dissimilarity, then it makes it possible to analyse the distances for the purpose of recovering the values of the stimuli on the underlying scales or dimensions. Ding et al. (2005) describe the way in which they use M D S methods to analyse change in students' mathematics achievement over four waves: Grades 3, 4, 5, and 6. They begin by using M D S to obtain initial estimates of the scale values, which index the latent growth or change patterns. Once obtained, these estimates can be used to create either (1) a growth profile model with the scale values reflecting the growth rate and the initial value 64 representing initial growth level; or (2) a change model with the scale values reflecting change patterns and the intercept estimating the average score of each participant across waves (Ding et al., 2005). O f possible interest to the reader, Ding et al. (2005) differentiate the terms "growth" and "change". The former term is characterised by systematic or directional changes (e.g., as one would perhaps expect with mathematics achievement over time), whereas the latter term refers to linear or monotonic change. A s C l i f f (1993) observes, "it is more plausible to assume that the dissimilarities are only monotonically related to psychological distance, and that this monotonic relation is unknown a priori" (p. 83). This latter term is often referred to as non-metric M D S . Standardising the Test Scores or Regression Results The sixth alternative for handling the problem of analysing change/growth with time-variable measures (that can or cannot be linked) is to standardise the test scores pre-analysis or to use standardised results of the regression analyses. Goldstein (1995) suggests that some form of test score standardisation is sometimes needed before the scores can be modelled. It is common in such cases to standardise the scores of each test so that, at each occasion, they have the same population distribution. For example, researchers analysing longitudinal data on the same variable over time often begin by standardising the scores so that each wave's scores have a mean of zero and a standard deviation of one. Furthermore, educational researchers often use standardised regression coefficients, rather than raw or observed regression coefficients (Willett et al., 1998). A s an example, Flook, Repetti, and Ullman (2005) compared children's peer acceptance in the classroom to academic performance from fourth to sixth grades. Academic 65 performance across the three grades was assessed by their report card grades in two subjects: reading and mathematics. Because the grades were assigned by different teachers during the semester in which the participants were tested, the researchers standardised reading and mathematics grades within each school and cohort to M = 0 and SD = 1 (presumably within-wave). They then computed the average of the reading and mathematics grades at each wave in order to form an overall measure of academic performance for each participant during each wave. Willett et al. (1998) describe the three popular rationales offered for standardising the test scores or regression results: 1. Enhanced test score interpretability: Placing test scores onto some common metric has some intuitive appeal, especially, given that there are few educational and psychological variables that have well-accepted metrics of interpretation. Imagine, for example, that a Grade 4 version of a mathematics test has a mean score of 15 and the Grade 7 version of a mathematics test (three years later) has an observed score mean of 23. Without first standardising each test's scores, how is one to interpret this mean difference when the two measures were time-variable and designed, by definition, to be independent of one another? Unfortunately, z-transforming each test's raw scores does not necessarily mean that one is measuring the same construct in Grade 4 as in Grade 7 (e.g., a z-score of 1.0 in Grade 4 does not necessarily mean the same thing as a z-score of 1.0 at Grade 7): A l l z-transforming does is allow one to determine a given test-taker's performance relative to all other test-takers within the same wave; z-transforming, therefore, does not necessarily allow for across-wave comparisons. 66 2. Relative importance of predictors: A second rationale that researchers give for standardising test scores is that it helps in the identification of the relative importance of predictors in a regression model. More specifically, standardisation helps to eliminate difficulties encountered comparing the regression coefficients when predictors have been measured on different scales. The argument is that the predictor with the largest standardised regression coefficient is the "most important" predictor in the model (Willett et al., 1998). 1 8 3. Comparison of findings across samples: A third rationale often offered for standardisation is that it facilitates the comparison of results across different samples, seemingly affording researchers the ability to investigate i f other studies of the same construct detected effects of the same magnitude (Willett et al., 1998). Despite these three seemingly appealing rationales for standardisation, this particular strategy for handling time-variable test scores is problematic for several reasons: 1. When test scores are standardised within-wave pre-analysis, there can be no expectation of any trend in either the mean or variance over time. If Test X ' s scores and Test Y ' s scores are transformed respectively to z-scores, then, by definition, both sets of scores have identical measures of central tendency, making within-person change and growth impossible to ascertain. It should be noted, however, that there can still be between-individual variation (Goldstein, 1995); 2. Standardising the outcome within-wave places constraints on its variation. For example, i f the group's individual growth curves fan out, standardising the outcome within-wave increases the amount of outcome variation during early time periods and 1 8 For a detailed description of the problems associated with this particular rationale, please refer to Thomas, Hughes, and Zumbo (1998). 67 diminishes outcome variation during later occasions. Therefore, the standardised growth trajectories fail to resemble those based on raw scores (Willett et al., 1998); 3. Longitudinal studies are afflicted with some degree of attrition and drop-out. Thus, the standardising of predictors within-waves is based on means and standard deviations that are estimated in a decreasing pool of participants. If such attrition/drop-out is non-random, then the following samples used in the estimation of measures of central tendency w i l l be non-equivalent, hence making the standardised values of the predictors unable to be compared from wave to wave (Willett et al., 1998); and 4. If the standard deviation of either the outcome variable or the predictor variable varies across samples, then samples with identical population parameters can yield "strikingly different standardized regression coefficients creating the erroneous impression that the results differ across studies" (Willett et al., 1998, p. 413). Converting Raw Scores to Age (or Grade) Equivalents Pre-Analysis A seventh strategy for handling the problem of analysing change/growth with time-variable measures is to convert the test scores to age equivalents (or mental age equivalents) pre-analysis, and then use these age equivalents in the place of raw scores in subsequent analyses. This process involves assigning each possible raw score an age (e.g., in months or years) for which that particular score is the population mean or median (Goldstein, 1995). A n age equivalent score, then, is the average score on a particular test that is earned by various students of the same age. A s a result of this process, it is then possible to interpret a score of, for example, 57 on Test X as the average score earned by students aged 12.6 years (Gall et al., 1996). 68 This strategy is often used in the field of special education. For example, Abbeduto and Jenssen Hagerman (1997) used mental-age equivalents in their study of language and communication problems of individuals with Fragile X Syndrome (FXS) , a genetic disorder resulting from a mutation on the X chromosome which is associated with various physical, behavioural, cognitive, and language problems. This particular strategy is ideal when the test scores change smoothly with age, because then the age equivalent metric is more easily interpretable. In the United Kingdom, Plewis (2000) used a variation of this type of strategy in his investigation o f reading achievement change and growth (Goldstein, 1995). To establish a common metric for his four waves of reading achievement data, he began by computing the first principal component of the scales within-wave. Then component scores (akin to factor scores in factor analysis) were then converted to z-scores. Third, each z-score was assigned a reading age equivalent score, such that the mean score at each age was the same as the mean chronological age, and the variance increased with age, which is to be expected in studies of change (Clemans, 1993; Plewis, 2000). Muijs and Reynolds (2003) adopted a similar approach in their study of the effects of student background and various teacher variables on students' longitudinal achievement in mathematics. Participants' National Foundation for Educational Research (NFER) numeracy subtest scores were collected twice a year for two respective school years: 1997-1998 and 1998-1999. For the purpose of the analyses, they (seemingly) converted students' raw scores into age equivalent scores based on a national sample of students in England. Another strategy for dealing with repeated measures designs with time-variable tests is to convert the test scores to grade equivalents pre-analysis. A grade equivalent score is the 69 average score on a test earned by students of the same grade-level. The process is similar to that of converting test scores to age equivalents, but the interpretation of the scores is slightly different: A score of 57 on Test X may now be the average score of sixth-graders (Gall et al., 1996). Some authors advise researchers to avoid including equivalent scores in any descriptive or inferential statistical methods given that equivalents often have unequal units. A s a general rule, age- and grade-equivalents should always be presented with the original raw scores (Gall et al., 1996). 1 9 It should also be noted that this particular strategy has the disadvantage of requiring the scores of a norming group (normative sample) - a large sample, ideally representative of a well-defined population, whose test scores provide a set of standards against which the scores of one's sample can be compared (Gall et al., 1996). Unfortunately, due to time and financial constraints, it is not always possible to obtain the scores of a norming sample. Chapter Summary In summary, this chapter described seven strategies used in the current literature as means of handling the problem of analysing change/growth with time-variable measures (particularly those that can be linked). Not all educational research, however, can "fit" within the methodologies and data demands required of these strategies, particularly in cases where one's study involves time-variable measures that cannot be linked or i f one has small sample sizes. Furthermore, it is not always possible to mimic the aforementioned methodologies in one's own research setting, given that many researchers fail to report or disclose the specific details of their linking methods. A s such, the next chapter presents the 1 9 For an excellent discussion about issues related to using equivalents in the place of raw scores, please refer to Zimmerman and Zumbo (2005). 70 Conover solution to the problem of analysing change and growth using time-variable measures which can be implemented when none of the aforementioned strategies can be implemented. Although this solution may be implemented in Scenario 2 (time-variable measures that can be linked) or Scenario 3 (time-variable measures that cannot be linked), the Conover solution is particularly useful for the latter scenario, particularly when one considers that there is still no consensus on handling the problem of analysing change and growth with time-variable measures that cannot be linked. In closing, rather than thinking of these seven strategies as being separate and distinct from one another, it may be more useful to cluster some of the related strategies. For example, (a) vertical scaling, (b) growth scales, (c) Rasch scaling/modelling, (d) standardising test scores or regression results, and (e) converting scores to age/grade equivalents can be all considered "linking/scaling approaches" (in the psychometric sense of the phrase), because each of these strategies is used for the purpose of putting the scores of different measures onto a common metric. 71 Chapter 5: The Conover Solution: A Novel Non-Parametric Solution for Analysing Change/Growth with Time-Variable Measures The previous chapter describes seven strategies used in the current literature as means of handling the problem of analysing change/growth with time-variable measures (particularly those that can be linked). Not all educational research, however, can "fit" within the methodologies and data demands required of these strategies for reasons outlined in the previous chapter. Thus, this chapter expands upon previous rank transformation work undertaken most notably by Conover (1999) and Conover and Iman (1981) 2 0 , and introduces a novel solution to the motivating problem that can be used with time-variable measures that can or cannot be linked. A s has been described in previous chapters, this novel solution is called the Conover solution. In the case of two-wave data, the Conover solution is called the non-parametric difference score; in the case of multi-wave data, the Conover solution is called the non-parametric H L M . Both involve rank transforming (or ordering) individuals' longitudinal test scores within wave pre-analysis, and then using these rank scores in the place of raw or standardised scores in subsequent analyses. It is because the original scores are transformed into ranks that the Conover solution is non-parametric (Conover, 1999). It should be noted that Domhof, Brunner, and Osgood's (2002) work relates tangentially to this novel solution, in that it discusses rank-based procedures for dealing with repeated measures data. Their research, however, uses rank-based procedures as a means of handling missing data, and presumes that the measure itself remains unchanged across waves. Zumbo (2005) has also considered the use of ranks in longitudinal analyses, and is credited for coining the terms "non-parametric H L M " and "non-parametric difference score". 72 A rank score depicts the position of a test-taker on a variable relative to the positions held by all other test-takers. Ranking or rank transforming refers to the process of transforming a test-taker's raw score to rank relative to other test-takers - suggesting a one-to-one function f from the set { X i , X 2 , . . . , X N } , the sample values, to the set {1, 2,.. . ,N}, the first N positive integers (Marascuilo & McSweeney, 1977; Zimmerman & Zumbo, 1993a). For example, i f Student X earned a score of 12, Student Y earned a score of 13, and Student Z earned a score of 14, then the students' respective rank scores would be 1, 2, and 3 (where a rank of 1 is assigned to the test-taker with the lowest score). 2 1 A s Zimmerman and Zumbo (2005) remind readers, inasmuch as test-takers' scores are represented in terms of their position relative to other test-takers in the same wave, a rank score is similar to a percentile score. A percentile score is a type of rank score that represents a raw score as the percentage of test-takers in an external norming group whose score falls below that score; unfortunately, referring to the scores of a norming sample is not always practical or possible; this issue is revisited in a later section. In general, researchers can effectively use ranks in situations in which regular statistical assumptions (specifically, normality and homogeneity of variance) are not or cannot be met (Beasley & Zumbo, 2003; Zimmerman & Zumbo, 1993a), or when one's scale of measurement is ordinal, and not interval (Zimmerman & Zumbo, 1993a). Lamentably, researchers often underestimate or overlook the utility of rank scores -perhaps because introductory statistics course instructors frequently isolate 'rank speak' into separate units that appear to students to be disconnected from the general flow of the course (Conover & Iman, 1981) or because textbook writers often convey the impression that 2 1 One can also assign rank scores so that the test-taker with the highest score receives a rank of 1. However, it is easier to think of students receiving the highest score also receiving the highest rank value. 73 parametric tests are thought to be more powerful than their non-parametric counterparts under normal theory (Zimmerman & Zumbo, 2005). Zimmerman and Zumbo (1993a, 2005) note that transforming scores to ranks and using non-parametric methods often improves the validity and power of significance tests for non-normal distributions. Moreover, rank transformations often produce similar results to those of parametric tests (Zimmerman & Zumbo, 2005), and can be used in two-wave or multi-wave designs (described in more detail below). Using rank scores (or percentile scores) in the place of raw scores does not mean a 'free lunch' for the researcher, however. One primary cr i t ic ism 2 2 o f the Conover solution is that "[differences] between raw scores are not necessarily preserved by the corresponding ranks. For example, a difference between the raw scores corresponding to the 15th and the 16th ranks is not necessarily the same as the difference between the raw scores corresponding to the 61st and 62nd ranks in a collection of 500 test scores" (Zimmerman & Zumbo, 2005, p. 618). Furthermore, for this solution to be feasible, one must .first be sure that the construct being measured does not change over time (i.e., the scores do not vary across waves in terms of the major underlying dimension or latent variable). Traditional Applications of the Rank Transformation A number of well-known non-parametric statistical procedures are rooted in rank transformations. The most common include Spearman's rho (p), the Mann-Whitney test, the Kruskal-Wallis test, the sign test, the Wilcoxon signed ranks test, and the Friedman test -each of which is described respectively in this section: More thorough discussion of the strengths and limitations of the novel solution is reserved for Chapter 7. 74 i Spearman's rho: Arguably, the most well-known use of the rank transformation is where the ubiquitous Pearson's correlation, r, is applied to ranks (Conover, 1999). Whereas a perfect Pearson correlation requires that both the ordinal positions match across two variables and that the variables share a perfect linear relationship, a perfect Spearman correlation (p) only requires that the ordinal positions match across the samples (Cohen, 1996). Moreover, unlike the Pearson correlation, a Spearman correlation does not require a bivariate normal distribution, nor does it require that each variable follow its own normal distribution. It is for these reasons that the Spearman correlation is referred to as a "distribution-free" statistic. A s Cohen (1996) observes, the only assumptions that must be met for a Spearman correlation are those for other tests of ordinal data: independent random sampling (i.e., each pair of observations should be independent of all other pairs) and continuous variables (i.e., both variables are assumed to have continuous underlying distributions). The Spearman correlation can be used in three cases. In the first case, both variables have been measured on an ordinal scale. In the second case, one variable is measured on an interval/ratio scale and the other variable has been measured on an ordinal scale. In the final case, both variables are measured on an interval/ratio scale, but the raw scores in each variable have first been transformed into ranks. One may experience this third case i f (a) the distributional assumptions of the Pearson correlation are severely violated, (b) there are small sample sizes, or (c) the relationship is far from linear and one only wants to investigate the degree to which the relationship is monotonic (Cohen, 1996). 75 Mann-Whitney test: A k i n to a non-parametric, distribution-free independent samples Mest, the Mann-Whitney involves examining two random samples to see i f the populations from which the two samples were drawn have equal means. The Mann-Whitney is l ikely to be used in situations in which one faces small sample sizes that are likely to differ in terms of size and variability. If the populations' scores are distributed normally, the independent samples Mest is the most powerful; the Mann-Whitney test, however, can be used when the two populations' scores are not distributed normally (Conover, 1999). The Mann-Whitney test begins by pooling all participants (irrespective of subgroup) into one large group and then rank ordering the participants' dependent variable scores. Finally, the sum of the ranks for each separate subgroup are computed and compared. The Mann-Whitney test requires that two assumptions are met: that there is independent random sampling (i.e., each pair of observations should be independent of all other pairs) and that the dependent variable is a continuous/quantitative variable. Kruskal-Wallis test: Whereas the Mann-Whitney test is akin to an independent samples Mest, the Kruskal-Wallis is the non-parametric variation of the one-way analysis of variance ( A N O V A ) with three or more independent samples (Conover, 1999). Like the Mann-Whitney test, the test begins by pooling all participants (irrespective of subgroup) into one large group and then rank ordering the participants' dependent variable scores. Finally, the sum of the ranks for each separate subgroup are computed and compared. The Kruskal-Wall is has the same assumptions as the Mann-Whitney test, and is used in the same circumstances (Cohen, 1996). 76 Sign test: In situations in which the amount of the difference between a paired set of participants (i.e., the difference score in a repeated measures context) cannot be computed, but that the direction (sign) of the difference can be, the sign test can be implemented. One may plan to have a non-quantified difference (because only the direction of change is of interest) or not (e.g., one aims to use a paired-samples Mest, but the sample size is small and the difference scores do not approximate a normal distribution) (Cohen, 1996). The sign test assigns each of the participants a positive sign (e.g., i f person's score improved across waves) or a negative sign (e.g., i f person's score declined across waves), and determines i f the difference in the counts of positive and negative scores across participants are likely to have occurred by chance (Conover, 1999). The sign test requires that three assumptions are met. First, the events must be dichotomous (each simple event being measured can fall into either a positive or negative category only). Second, the events must be independent (the outcome of one trial does not influence the outcome of any other. Finally, the process must be stationary - meaning that the probabilities of each category remain the same across all trials o f the experiment (Cohen, 1996). Wilcoxon signed ranks test: If each participant in one's study has been measured on an interval or ratio scale twice (i.e., across two waves), but the difference score is distributed differently than a normal distribution in a population and the sample sizes are small, one can opt to use the Wilcoxon signed ranks test. Another distribution-free test, the Wilcoxon involves rank ordering the participants' simple difference scores (ignoring the sign of the difference scores - thus indicating only absolute differences) and then summed separately for each sign (Conover, 1999). 77 Friedman test: The last rank-based test discussed in this section is called the Friedman test, and is the distribution-free analogue to the one-way repeated measures A N O V A (Howell, 1995). The Friedman test is relatively easy to compute: Rather than rank ordering all o f the participants' difference scores with respect to each other, one need only rank order the scores, within wave, and compare the sum of the ranks of each wave (Cohen, 1996). The Friedman test can be based on either (a) ranking the observed data at the outset of data collection or (b) interval/ratio-collected data that are rank transformed and converted to ranks in situations in which the parametric assumptions required of a repeated measures A N O V A are not met (Cohen, 1996). Introducing the Conover Solution to the Motivating Problem A s mentioned previously, there are invariably situations in which the implementation of the seven strategies for handling the problem of analysing change and growth using time-variable measures presented in Chapter 4 is not possible (as in the case o f time-variable measures that cannot be linked). When faced with such situations, readers are presented with this novel solution to the problem: Rank transform (or order) individuals' longitudinal test scores within wave pre-analysis, and then use these rank scores in the place of raw or standardised scores in subsequent analyses. Put another way, this solution involves partitioning the observed data into subsets (waves), then ranking each wave independently of the other waves 2 3 (Conover & Iman, 1981), and then using the rank scores in the place of the original scores in subsequent parametric analyses. Recall from an earlier section that this dissertation's novel solution is coined the Conover solution in this document in honour of the 2 3 Conover and Iman (1981) refer to this type of rank transformation as RT-2. For additional methods of rank transforming, please refer to this article. 78 seminal work of Conover (1999) and Conover and Iman (1981), whose research provided the groundwork and evidence for the novel solution's viability. A t first glance, the Conover solution may appear to resemble a strategy discussed by Zimmerman and Zumbo (2005), in which percentile scores are used in the place of raw or standardised scores in subsequent analyses. When one recalls, however, that percentile scores represent a given raw or standardised score on a measure as the percentage of individuals in a norming group whose score falls below that group, then the distinction between this dissertation's Conover solution and Zimmerman and Zumbo's (2005) solution is more clear: Whereas the percentile solution requires the inclusion of the scores of an (external) norming group, the Conover solution introduced in this dissertation does not. One of the major strengths2 4 of the Conover solution is its ease of use: It is often more convenient to use ranks in a parametric statistical program than it is to write a program for a non-parametric analysis (Conover & Iman, 1981). Furthermore, by rank transforming the data pre-analysis, one is able to bridge the gap between parametric and non-parametric statistical methods, thereby providing "a vehicle for presenting both the parametric and nonparametric methods in a unified manner" (Conover & Iman, 1981, p. 128). This last issue was discussed in detail in Chapter 1. The Conover solution makes use of the ordinal nature of continuous-scored data: A test-taker with a low raw score relative to other test-takers in his wave w i l l also yield a low relative rank score 2 5. Similarly, a test-taker with a high test-score w i l l also yield a high rank score. A s a result, wi.thin-wave order among the students is 2 4 More thorough discussion of the strengths and limitations of the novel solution is reserved for Chapter 7. 2 5 Statistical packages often assign a rank of one (1) to each wave's smallest score. It is recommended that analysts retain this default setting, as it is logical to think that the lowest test score should also have the lowest rank. 79 p r e s e r v e d . B y r a n k i n g t e s t - t a k e r s ' s c o r e s w i t h i n - w a v e p r e - a n a l y s i s , i t i s p o s s i b l e t o p u t t h e l o n g i t u d i n a l t e s t s c o r e s o n a " c o m m o n m e t r i c " , t h e r e b y p r o v i d i n g a s t a n d a r d a g a i n s t w h i c h t e s t - t a k e r s ' s c o r e s c a n b e m e a s u r e d a n d c o m p a r e d . How Applying the Conover Solution Changes the Research Question (Slightly) A s h a s b e e n d e s c r i b e d i n a n e a r l i e r c h a p t e r , C h a p t e r 6 p r o v i d e s a d e m o n s t r a t i o n o f t h e n o n - p a r a m e t r i c H L M ( t h e C o n o v e r s o l u t i o n f o r m u l t i - w a v e d a t a ) a n d t h e n o n - p a r a m e t r i c d i f f e r e n c e s c o r e ( t h e C o n o v e r s o l u t i o n f o r t w o - w a v e d a t a ) . C h a p t e r 6 a l s o i n c l u d e s t h e s p e c i f i c r e s e a r c h q u e s t i o n b e i n g p o s e d i n e a c h d e m o n s t r a t i o n . I t i s i m p o r t a n t t o n o t e t h a t , w h e n o n e a p p l i e s t h e C o n o v e r s o l u t i o n t o t h e p r o b l e m o f a n a l y s i n g c h a n g e / g r o w t h w i t h t i m e - v a r i a b l e m e a s u r e s ( t h a t c a n o r c a n n o t b e l i n k e d ) , o n e c h a n g e s s l i g h t l y t h e r e s e a r c h q u e s t i o n b e i n g i n v e s t i g a t e d . T h i s c h a n g e i s h i g h l i g h t e d b e s t b y w a y o f e x a m p l e . I n C h a p t e r 6's C a s e 1 ( t h e m u l t i - w a v e c a s e ) , t h e r e s e a r c h q u e s t i o n p o s e d i s : A r e t h e r e g e n d e r d i f f e r e n c e s i n t h e r a n k - b a s e d l o n g i t u d i n a l r e a d i n g a c h i e v e m e n t s c o r e s o f a g r o u p o f s t u d e n t s ? T h e o p e r a t i v e p h r a s e i n t h i s e x a m p l e i s r a n k - b a s e d . A s d e s c r i b e d a b o v e ( a n d m o r e f u l l y i n C h a p t e r 7), t h e C o n o v e r s o l u t i o n m a k e s u s e o f t h e o r d i n a l n a t u r e o f t h e o r i g i n a l s c o r e s ; i t m u s t b e c a u t i o n e d , h o w e v e r , t h a t t h e d i f f e r e n c e s o r g a p s b e t w e e n t h e o r i g i n a l s c o r e s a r e n o t n e c e s s a r i l y p r e s e r v e d b y t h e c o r r e s p o n d i n g r a n k s . T h e r e f o r e , w h e n t h e C o n o v e r s o l u t i o n h a s b e e n a p p l i e d t o a s e t o f o r i g i n a l s c o r e s ( i . e . , w h e n t h e o r i g i n a l s c o r e s h a v e b e e n t r a n s f o r m e d t o r a n k s p r e - a n a l y s i s ) , t h e n t h e r e s e a r c h q u e s t i o n , t h e r e s u l t s , a n d t h e i n f e r e n c e s m a d e f r o m t h e r e s u l t s m u s t r e f l e c t t h e f a c t t h a t t h e s c o r e s h a v e b e e n t r a n s f o r m e d . 80 Within-Wave versus Across-Wave Ranking A s mentioned in previous sections of this dissertation, the Conover solution involves ranking test-takers' raw scores within wave, rather than across waves, and then using the rank scores in the place of raw or standardised scores in subsequent analyses. The purpose of current section of this dissertation is to compare and contrast within-wave and across-wave ranking, so as to elucidate why within-wave ranking is a key component of the Conover solution methodology. Within-wave ranking involves assigning a rank score to a given test-taker's original score in a given wave (or column in the data matrix) relative to the scores o f all other test-takers in the same wave. A s a result, the rank scores resulting from the within-wave ranking retain their wave-specificity and, hence, the temporal nature o f the data collection is preserved. Such ranking allows one to track an individual student's progress, relative to other test takers, throughout the duration of the study. More specifically, it allows one to see how a given test-taker's rank performance, relative to the other test-takers, improves or declines across time. In contrast, across-wave ranking, another type of rank transformation 2 6, involves taking all o f the test-takers' raw or standardised scores and aligning them vertically into one column, irrespective of the wave in which the scores were collected. Once all of the test-takers' multiple within-wave scores have been stacked, one on top of the other into the one column, the scores are then rank-ordered. Finally, the rank scores that result from this rank-ordering are put back into waves and are used in the place o f the original scores in subsequent parametric analyses. Across-wave ranking is an integral component of Please refer to Conover and Iman (1981) for descriptions of additional methods of rank transformation. 81 Zimmerman and Zumbo's (1993b) non-parametric repeated measures A N O V A (Zumbo & Forer, in press). Although across-wave ranking could certainly be extended to the context of change and growth with time-variable measures (that can or cannot be linked), it may be somewhat difficult to interpret the precise meaning of a given test-taker's rank score in some cases -given that ranks are assigned on a vertical column of, in essence, 'wave-less' scores. Imagine, for example, that raw scores from a Wave 1 measure has been designed to have a possible score range of 0 to 50 points, whereas the scores from a Wave 2 measure can range from 0 to 60 points. Because of the differences in possible score ranges inherent to the respective measures, an earned raw score of 45 does not necessarily mean the same "amount" of a given construct across measures. Nonetheless, both raw scores (Wave 1 = 45, Wave 2 = 45) still receive the same rank score - even though the raw scores are, to some degree, an artefact of the range o f possible scores on each measure. A s such, readers are encouraged to use this method of ranking with caution when applying it to the context of change and growth (particularly with time variable-measures that can or cannot be linked), and especially i f the total possible point values vary across waves. Establishing the Viability of the Conover Solution Such a solution for handling the problem has been supported by statistical theory. Conover (1999), for example, writes: Non-parametric tests that are equivalent to parametric tests computed on the ranks are easily computed using computer programs designed for parametric tests. Simply rank the data and use the parametric test on the ranks in situations where programs for the nonparametric tests are not readily available (p. 419). Adding support for the viability of the Conover solution, Conover and Iman (1981) write that "least squares, forward or backward stepwise regression, or any other regression method may be applied to the ranks of the observations" (p. 27). Furthermore, they assert that rank-transformed data can be used in the place of raw or standardised scores in several situations in which satisfactory parametric procedures already exist: • 2 or k independent samples • regression • paired samples • discriminant analysis • randomised complete block • multiple comparisons, and design • cluster analysis. • correlation Additionally, Zimmerman and Zumbo (1993 a) outline how the Mann-Whitney test in its standard form (i.e., the large sample normal approximation) is equivalent to an ordinary Student Mest (for independent samples) performed on the ranks of observed scores, instead of the observed scores themselves. They add that: 1. "Apart from details of computation, it makes no difference whether a researcher performs a Wilcoxon test based on rank sums, or alternatively, pays no attention to W and simply performs the usual Student t test on the ranks" (p. 488); and 2. " I f the initial data in a research study are already in the form of ranks, it is immaterial whether one performs a t test or a Wilcoxon test" (p. 489); and 83 3. "For quite a few nonnormal distributions, [the] Wilcoxon-Mann-Whitney-test holds a power advantage over the Student t test, both in the asymptotic limit and for small and moderate sample sizes. [This] power advantage is accounted for by reduction in the influence of outliers in conversion of measures to ranks" (p. 495). This latter point is discussed in more detail in Chapter 7. In closing, despite traditional underestimation or oversight about the utility of rank-based methods, the work of Conover (1999), Conover and Iman (1981), and Zimmerman and Zumbo (1993a) reminds readers of the viability of using rank scores in the place of original scores - hence providing evidence of the appropriateness and cogency of the Conover solution presented in this dissertation. Primary Assumption of the Conover Solution: Commensurable Constructs The primary assumption underlying the implementation of the Conover solution is that one is measuring a commensurable (similar) construct across all waves of the study. In other words, the researcher must be satisfied that the same primary dimension or latent variable is driving the test-takers' responses across waves. Recall from an earlier section that a latent variable is any unobserved variable that accounts for the correlation among one's observed or manifest variables. Ideally, psychometricians design measures such that the latent variable that drives test-takers' responses is a representation of the construct of interest. What is an example of the inappropriate use of the Conover solution? Imagine that a testing company chooses to measure students' academic achievement across three waves, but the subject matter tested at each wave varies considerably (e.g., Wave 1 = mathematics achievement, Wave 2 = reading achievement, and Wave 3 = science achievement). Although 84 these three constructs relate collectively to students' academic achievement, it is very likely that the latent variable driving test-takers' responses at each distinct wave differs considerably from those of the other waves (e.g., due the variability in subject matter across waves). In this example, implementation of the Conover solution would be inappropriate because the testing company is attempting to compare the incomparable. The Conover solution is indeed an effective method of handling the problem of analysing change and growth with time-variable measures that are designed to assess commensurable constructs across waves; however, it is by no means a universal panacea to the problem - for reasons outlined in Chapter 7. Further discussion about what constitutes commensurability is presented in Chapter 6. In the next chapter, readers are taken through step-by^step implementations of this Conover solution by means of two comprehensive examples: Case 1 introduces the non-parametric H L M , and can be used in multi-wave research designs. Case 2 introduces the non-parametric difference score, and can be used in two-wave research designs. 85 Chapter 6: Two Conover Solution Case Studies: The Non-Parametric H L M and the Non-Parametric Difference Score In the previous chapter, the Conover solution to the problem of analysing change/growth with time-variables measures was introduced. The solution is particularly useful when none o f the seven strategies offered in Chapter 4 is executable (most notably when time-variable measures cannot be linked), and involves rank transforming individuals' longitudinal test scores within wave pre-analysis, and then using these rank scores in the place of raw or standardised scores in subsequent analyses. The current chapter describes the implementation of the Conover solution using two distinct case studies of real data. It should be noted that, within the context of this dissertation, the term "case study" refers to a real-data demonstration of how to use and interpret the Conover solution wherein one is analysing change/growth with time-variable measures (particularly those that cannot be linked). In particular, this case study chapter: (a) offers a rationale for the preferred statistical software package (SPSS), (b) describes each case study's data and research objective, (c) discusses the specific variables of interest and the proposed methodology, (d) presents the statistical models/equations 2 7, and (e) explains the resultant statistical output. The Conover solution's strengths and weaknesses are discussed in Chapter 7. It should also be noted that the steps one must follow when performing the respective analyses via either the graphical user interface (GUI) or syntax are presented in Appendix A (non-parametric H L M for multi-wave data) and Appendix B (non-parametric difference 2 7 Statistical models/equations are presented for the non-parametric HLM case only. 2 8 Please note that information about the exemplar data's reliability and validity was not available because the Ministry does not disseminate item-level data; however, Lloyd, Walsh, and Shehni Yailaigh (2005) report that the coefficient alphas for the provincial population of fourth- and seventh- grade students' responses on the 2001 FSA numeracy subtests were .85 and .86, respectively 86 score for two-wave data). Also presented in Appendix A is a brief description of hierarchical linear modelling. Choice of Statistical Software Packages Mixed-effect modelling can be performed in several statistical packages, namely SPSS (Statistical Package for the Social Sciences), H L M (Hierarchical Linear Modelling), and M L w i N (created by the Centre for Mult i level Modell ing team based at the University o f Bristol, United Kingdom). O f these three packages, SPSS is used in the demonstrations. Because of its widespread use in educational and social science settings, its user-friendliness in terms of data handling, its ability to rank transform data within waves, and its ability to perform mixed-effect analyses, SPSS was deemed the suitable choice. It is important to note, however, that the Conover solution is not an SPSS solution per se; this solution could, in theory, be easily implemented using other popular statistical software packages. Determining the Commensurability of Constructs A s is described more fully in subsequent sections, the first case study focuses on reading achievement scores collected from students in an urban school district in British Columbia across five annual waves: Grade 2 to Grade 6, inclusive. The second case study includes large-scale numeracy assessment data collected across two waves (Grades 4 and 7) by the British Columbia Ministry of Education. Recall from Chapter 5 that the primary assumption underlying the implementation of the Conover solution is that one is measuring a commensurable (similar) construct across all waves of the study. Although there is no clearly-stated definition of the term in the repeated measures literature, commensurability is generally thought to mean that the same primary 87 dimension or latent variable is driving the test-takers' responses across waves. In a sense, commensurability is analogous to comparability or similarity. The time-variable measures used in each respective case study were concluded to measure commensurable constructs across waves primarily because large-scale test developers generally construct their measures according to tables of specifications - test blueprint documents that (a) define the specific sub-domains of the construct of interest, (b) detail the specific sub-domain to which each scale item belongs, and (c) specify the proportion of scale items devoted to a specific sub-domain. In general, constructs are thought to be commensurable over time when there is parity in the tables of specifications across waves. This conceptualisation of commensurability relates to Sireci's (1998) definition of content validity: There is consensus over the years that at least four elements of test quality define the concept of content validity: domain definition, domain relevance, domain representation, and appropriate test construction procedures (p. 101). 1 In the demonstration of the non-parametric difference score solution, the British Columbia Ministry of Education's (n.d.) numeracy subtest table of specifications reveals that four numeracy sub-domains assessed on each version of the measure (number, patterns and relationships, shape and space, and statistics and probability) are defined in a consistent manner across grade levels - thus providing evidence for consistent domain definitions across waves. Furthermore, the total proportions of items devoted to each of the four numeracy sub-domains are the same in the Grade 4 version of the numeracy subtest as in the Grade 7 version, providing evidence for consistent domain representation. . 88 The same is true for the data used to demonstrate the non-parametric H L M solution: The reading measures have all been designed to assess, in a consistent manner, four sub-domains across waves: phonetic analysis, vocabulary, comprehension, and scanning. Furthermore, extensive research has been conducted in terms of determining the various measures' psychometric properties (specifically the validity and reliability of the test scores) and, purportedly, Rasch modelling was used to equate, calibrate, and develop scale scores across forms and levels (though no specifics are provided in terms of how this modelling was undertaken) (Karlsen & Gardner, 1978-1996). When tables of specifications have been employed. When several versions of a measure share no common items (as with time-variable measures that cannot be linked), the measures' content validity is the primary piece of evidence that one can use in determining the measures' commensurability. A s stated above, it is generally thought that i f there is parity in the measures' tables of specifications, then the measures are "content val id" and, in turn, commensurable. If the measures share any number of items (e.g., Scenario 1 = exact same measure across waves; Scenario 2 = time-variable measures that can be linked), then it may be possible to conduct multi-group exploratory factor analyses (to explore how many factors there are, whether the factors are correlated, and which observed variables appear to best measure each factor) and multi-group confirmatory factor analyses (to specify the number of factors, which factors are correlated, and which observed variables measure each factor), in addition to investigations of content validity, in determining the measures' commensurability. Please refer to Schumacker and Lomax (2004) for more about these types of factor analyses. 89 When tables of specifications have not been employed. In instances in which time-variable measures (either linkable or non-linkable) have not been designed according to tables of specifications, it may be possible to instead define the measures' commensurability from a more general validity perspective. One such option could be to expand the usage o f Campbell and Fiske's (1959) multitrait-multimethod matrix ( M T M M ) - an approach to assessing a measure's construct validity (the extent to which the inferences from a test's scores accurately reflect the construct that the test is purported to measure). In their seminal paper, Campbell and Fiske (1959) distinguish between two sub-categories o f construct validity: convergent validity which refers to the degree to which concepts that should be related theoretically are interrelated in reality; and discriminant validity which refers to the degree to which concepts that should not be related theoretically are, in fact, not interrelated in reality. The M T M M , simply a table of correlations, assumes one measures each of several traits (e.g., mathematics achievement) by each of several methods (e.g., a paper-and-pencil test, a direct observation, a performance measure). In essence, a measure is purported to be "construct val id" i f its scores are correlated with related traits (convergence) and uncorrelated with unrelated traits (discrimination). Extending Campbell and Fiske's (1959) concept of construct validity to the problem surrounding the current dissertation, it may be possible to conclude that one's time-variable measures are indeed commensurable i f the pattern of wave-specific convergent and discriminant correlations are similar across all waves of one's study (i.e., the M T M M at Wave 1 is similar to the M T M M at Wave 2, and so forth). Given that the idea put forth in this dissertation of defining time-variable measures' commensurability from a general 90 validity perspective is new and has not been explored in either the test linking or change/growth literatures, this idea most certainly requires investigation in future research. So as to explain the rationale for choosing both a two-wave example and a multi-wave example with which to demonstrate the Conover solution, it is necessary to first elucidate the distinction between two-wave and multi-wave research designs. Therefore, the next section introduces the first case study (the multi-wave case). Included is a description of the selected data set, the specific variables of interest, the proposed methodology, and the statistical models/equations. The resultant statistical output is also explained. Case 1: Non-Parametric H L M (Conover Solution for Multi-Wave Data) Given the proliferation of criticisms of two-wave designs, research designs in which an individual is tested over multiple (i.e., three or more) occasions quickly became the new 'gold standard'. These multi-wave designs seemed to solve many of the problems associated with two-wave designs and were, hence, regarded as the 'one and only' way in which to study individual change. What if, however, time constraints, contextual factors, and/or financial limitations preclude a multi-wave design? Should researchers abandon the study of change altogether? Although there is still no universal concurrence on the 'two-wave versus multi-wave' debate, the general consensus is that, where possible, one should indeed implement a multi-wave design (e.g., Willett et al., 1998). In situations in which multi-wave designs are not possible, however, one may choose between using simple difference scores or residualised change scores according to a decision tree offered by Zumbo (1999). Meade et al. (2005) warn, however, that i f one chooses to implement two-wave designs, it is.necessary to provide 91 evidence that the two measurement occasions are equivalent psychometrically in order to ensure the valid measurement of change over two waves. Description of the Data The research question being posed in this particular case study is: Are there gender differences in the rank-based longitudinal reading achievement scores o f a group of students? With permission from Dr. Linda Siegel at the University of British Columbia, a particular extract of longitudinal language and literacy assessment data for North Vancouver School District (District 44) students was obtained. Dr. Siegel's data source is rich, having collected several types o f demographic and assessment data for the district's students across seven annual waves (kindergarten through to Grade 6, inclusive). For more about Dr. Siegel's research project, please refer to Chiappe, Siegel, and Gottardo (2002), Chiappe, Siegel, and Wade-Woolley (2002), and Lesaux and Siegel (2003). Specific Variables of Interest and Proposed Methodology Obtained were the raw scores on the Stanford Diagnostic Reading Test (SDRT), a standardised test of reading comprehension, for 653 children 2 9 (nf e m aie = 336, n m a i e = 317) tested across five of the seven waves: Grade 2, Grade 3, Grade 4, Grade 5, and Grade 6, inclusive. A s Figures 3 and 4 depict, the descriptive statistics for each wave of S D R T raw scores vary widely. Recall from Chapter 1 that Kolen and Brennan (2004) state that most test linking strategies require a minimum sample size of 400 test-takers per form. It is worth noting that, using this rule of thumb, Case 1 's sample size would be considered sufficient for test linking. 3 0 For the purpose of this case study, students missing one or more waves of SDRT data were excluded from analyses. 92 Descriptive Statistics N Minimum Maximum Mean Std. Skewness Kurtosis Statistic Statistic Statistic Statistic Statistic Statistic Std. Error Statistic Std. Error grade2raw 653 12 40 36.24 3.997 -2.503 .096 8.472 .191 grade3raw 653 10 45 35.99 6.183 -1.209 .096 1.505 .191 grade4raw 653 8 54 41.48 7.394 -1.362 .096 ' 2.099 .191 grade5raw 653 4 54 44.46 6.473 -2.139 .096 6.936 .191 grade6raw 653 11 54 41.17 8.337 -1.009 .096 .741 .191 Valid N (listwise) 653 Figure 3. Descriptive statistics for each of the five waves of S R D T raw scores collected by Siegel. 93 10 20 30 40 50 60 Grade 6 SDRT Raw Score Figure 4. Histograms of the Siegel study's raw scores across five waves: Grade 2 (top left), Grade 3 (top right), Grade 4 (middle left), Grade 5 (middle right), and Grade 6 (bottom left). Note that each of the distributions is skewed negatively. 94 The S D R T administration involves each child receiving a booklet, reading the short passages within the booklet, and providing responses to multiple-choice questions based on the reading in a prescribed time limit (Lesaux & Siegel, 2003). Because students' reading comprehension changes with time, the S D R T has purportedly been changed developmentally (i.e., across waves) by the test developer, Harcourt Assessment. Because this case study involves data collected over three or more waves, it is possible to conduct an H L M analysis of change. To this end, rank-transformed test scores serve as the Level 1 outcome variable in the model. Also obtained was an encrypted student identification number that corresponds to each of the test scores (variable name = casenum), which serves as the Level 2 grouping variable, and as well as each student's gender (coded female = 1, male = 0), which serves as the Level 2 predictor variable. In summary, this case study involves a two-level H L M model in which five waves of test scores at Level 1 (variable names - grade2rank, grade3rank, grade4rank, grade5rank, and grade6rank, respectively) are nested within students (the Level 2 units). Statistical Models and Equations When dealing with nested data, two sets of analyses are performed: unconditional and conditional. B y doing so, one can then determine what improvement in the prediction of the outcome variable is made after the addition of the predictor variable(s) to the model (Singer & Willett, 2003). Unconditional H L M models (sometimes called baseline or null models) generally involve computing the proportion of variance in the outcome variable that can be explained simply by the nesting o f the Level 1 outcome variable (in this case, the rank-based literacy 95 score - Y , / ) within the Level 2 grouping units (in this case, the test-takers). Therefore, the Level 2 predictor variable, gender, has not been included in this model. The Level 1 model is a linear individual growth model, and represents the within-person (test-taker) variation. The Level 2 model expresses variation in parameters from the growth model as random effects unrelated to any test-taker level predictors (Singer, 1998), and represents the between-person (test-taker) variation. Using the notation of Bryk and Raudenbush (1992), in which each level is written as a series of separate but linked equations, the relevant models and notation are as follows (Singer, 1998; Singer & Willett, 2003): 9 6 • Unconditional Model Level 1 (Within-Person) YtjR =^j+^j(wave)i. + rij, where: 1. Ytf = test-taker i's rank-based literacy score on measurement occasion j; 2. wave = time or measurement occasion; 3. 7Tq/ = test-taker z's true initial status (the value of the outcome when ^ waveij=0); 4. Try = test-taker f s true rate of change during the period under study; 5. ry• = the portion of test-taker i's outcome that is unpredicted on occasion j (the within-person residual); and 6. nj~N(0,<j2). 97 Unconditional Mode l Level 2 (Between-Person) where: uqj u\j J and where: 7"00 Toi 7. 7TQ/= true initial status; 2. Try = true rate of change; 3. ^ /30o and j8io= level 2 intercepts (the population average initial status and rate o f change, respectively); 4. UOJ and uy = level 2 residuals (representing those portions of initial status or rate of change that are unexplained at level 2; in other words, they represent deviations of the individual change trajectories around their respective group average trends). (2) The H L M model in Equations 1 and 2 is expressed as the sum of two parts. The fixed part contains two fixed effects - for the intercept and for the effect of wave (time). The random part contains three random effects - for the intercept, the wave slope, and the within-test-taker residual (ry) (Singer, 1998). For a description of fixed and random effects, please refer to Appendix A . 98 • Conditional Mode l Level 1 (Within-Person) Y,* =^j+x>j(wave).+riJ. where: D 1. Yjj = test-taker i's rank-based literacy score on measurement occasion j; 2. wave = time or measurement occasion; 3. "KQJ = test-taker fs true initial status (the value of the outcome when ^ waveij=0); 4. -Ky - test-taker f s true rate of change during the period under study; 5. Yy = the portion of test-taker i's outcome that is unpredicted on occasion j (the within-person residual); and 6. nj~N(Q,o2). 99 • Conditional Model Level 2 (Between-Person) where: Uq/ Uy ~N and where: 700 T0\ T\0 TU 1. ITOJ• = true initial status; 2. %\j = true rate of change; 3. gender = level 2 predictor of both initial status and change; 4. (8oo and Bl0= level 2 intercepts (the population average initial status (4) and rate of change, respectively); 5. /3oi and 3u = level 2 slopes (representing the effect of gender on the change trajectories, and which provide increments or decrements to initial status and rates of change, respectively); and 6. no/ and uy - level 2 residuals (representing those portions of initial status or rate of change that are unexplained at level 2; in other words, they represent deviations of the individual change trajectories around their respective group average trends). Having already fit the unconditional model in Equations 1 and 2, Equations 3 and 4 involve an H L M model which explores whether or not variation in the intercepts and slopes is related to the Level 2 predictor, gender (Singer, 1998). 100 Hypotheses Being Tested In this case, the null hypothesis being tested is that the males' and females' respective mean rank-based intercept scores are not significantly different from zero and are not significantly different from one another. A second null hypothesis is that the males' and females' respective mean rank-based rates of change are not significantly different from zero and are not significantly different from one another. A third null hypothesis is that there is no gender x wave interaction. The terms "intercept" and "rate of change" are discussed in more detail in a later section. Explanation of the Statistical Output Unconditional model. Figure 5 illustrates the output from the unconditional H L M analysis. 101 Mixed Model Analysis Model Dimension? Number of Levels Covariance Structure Number of Parameters Subject Variables Fixed Effects Intercept 1 1 wave 1 1 Random Effects Intercept + wave1 2 Unstructured 3 casenum Residual 1 Total 4 6 a- As of version 11.5, the syntax rules for the RANDOM subcommand have changed. Your command syntax may yield results that differ from those produced by prior versions. If you are using SPSS 11 syntax, please consult the current syntax reference guide for more information. t>- Dependent Variable: RANK of grade2raw. Information Criteria 1 -2 Restricted Log Likelihood Akaike's Information Criterion (AIC) Hurvich and Tsai's Criterion (AICC) Bozdogan's Criterion (CAIC) Schwarz's Bayesian Criterion (BIC) 41816.625 41824.625 41824.638 41852.987 41848.987 The information criteria are displayed in smaller-is-better forms. a. Dependent Variable: RANK of grade2raw. Fixed Effects Type III Tests of Fixed Effects Denominator Source Numerator df df F Sig. Intercept 1 652.000 2514.582 .000 wave 1 652.000 .000 1.000 a- Dependent Variable: RANK of grade2raw. Estimates of Fixed Effects1 95% Confidence Interval Parameter Std. Error df t — " S l g - * ^ Lower Bound Upper Bound Intercept/"' 327.0000 6.521009 652.000 50.146 .000 ^ 9 J 4 . 1 9 5 2 8 7 339.804713 wave ( -7E-013 1.644396 652.000 .000 1.000 JB.228952 3.228952 Depenc Variable: R A N K of grade2raw. Figure 5. Unconditional model output. A s Figure 5 shows, the parameter value 327.000 represents the estimate of the average intercept across test-takers (the average value of Y when Wave = 0). Therefore, the average person began with a rank score of 327. The fact that this estimate is statistically 102 significant (p = 0.000) simply means that this average intercept is significantly different from zero (which is not a particularly useful finding). Even though the p-va\ue associated wave is not statistically significant (p = 1.000), there was, on average, nearly zero rate of change from Grade 2 (Wave 1) to Grade 6 (Wave 5) (in other words, the average slope across persons was nearly zero). This finding suggests that, on average, the test-takers in Dr. Siegel's sample did not tend to change their position over time, relative to the rest of the test-takers (Zumbo, 2005). Condi t ional model. Figure 6 illustrates the output from the conditional H L M analysis. Conditional H L M models generally involve computing the proportion of variance in the outcome variable that can be explained not only by the nesting of the Level 1 scores within the Level 2 grouping units (in this case, the students), but also by the inclusion of the predictor variable(s) in the analysis. Therefore, the Level 2 predictor variable, gender, has been included in this analysis. 103 Mixed Model Analysis Model Dimension5 Number of Levels Covariance Structure Number of Parameters Subject Variables Fixed Effects Intercept 1 1 wave 1 1 gender 1 1 wave * gender 1 1 Random Effects Intercept + wave1 2 Unstructured 3 casenum Residual 1 Total 6 8 a - As of version 11.5, the syntax rules for the RANDOM subcommand have changed. Your command syntax may yield results that differ from those produced by prior versions. If you are using S P S S 11 syntax, please consult the current syntax reference guide for more information. D - Dependent Variable: RANK of grade2raw. Information Criteria1 -2 Restricted Log Likelihood Akaike's Information Criterion (AIC) Hurvich and Tsai's Criterion (AICC) Bozdogan's Criterion (CAIC) Schwarz's Bayesian Criterion (BIC) 41781.254 41789.254 41789.267 41817.614 41813.614 The information criteria are displayed in smaller-is-better forms. a - Dependent Variable: RANK of grade2raw. Fixed Effects Type III Tests of Fixed Effects Source Numerator df Denominator df F Sig. Intercept 1 651.000 1061.419 .000 wave 1 651.000 1.097 .295 gender 1 651.000 14.317 .000 wave * gender 1 651.000 2.131 .145 a - Dependent Variable: RANK of grade2raw. Estimates of Fixed Effects' Parameter""'"" -95% Confidence Interval Estimate Std. Error df t Lower Bound Upper Bound ,>ffe?cept 301.8524 9.265121 651.000 32.579 .000 ^ 8 8 3 6 5 9 2 3 8 320.045494 wave -2.469243 2.358072 651.000 -1.047 .295 -7^569588 2.1611.03 gender 48.873229 12.916298 651.000 3.784 .000 23.5/0597 74.235862 ^ a v e * gender 4.798856 3.287336 651.000 1.460 .145 ^ * B 5 6 2 0 5 11.253917 [gjjt Variable: RANK of grade2raw. Figure 6. Conditional model output. 104 A s Figure 6 shows, the average person began with a rank score of 302. This finding is similar to the result of the unconditional model, only now the model is controlling for the Level 2 predictor, gender. Once again, the fact that this estimate is statistically significant (p = 0.000) simply means that this average intercept is significantly different from zero (which is not a particularly useful finding). Figure 6 also highlights that, with the inclusion of the Level 2 predictor (gender) in the model, the p-value for wave is still not statistically significant (p = 1.000). This means that, on average, there is nearly zero rate of change from Grade 2 (Wave 1) to Grade 6 (Wave 5). This finding suggests that, on average, the test-takers in Dr. Siegel's sample did not tend to change their position over time, relative to the rest of the test-takers (Zumbo, 2005), which is similar to the result of the unconditional model. The coefficient for gender, 48.87, captures the relationship between initial status and this Level 2 predictor. Because there is a significant main effect of gender (p = 0.000), it is concluded that there is a relationship between initial status and gender. This finding suggests that female test-takers, on average, begin with a rank score 48.87 higher than that of males (recalling that males are coded 0, females are coded 1). With respect to growth rates, there is an effect of gender: The parameter estimate of 4.79 (wave x gender) indicates that individuals who differ by 1.0 with respect to gender have growth rates that differ by 4.79 (though this is not a statistically significant result,/? = .145) (Singer, 1998). In other words, females' rank-based, longitudinal performance is, on average, better than that of males. The next section introduces the second case study (the two-wave case). Included is a description of the difference between two common indexes of change, the selected data set, 105 the specific variables of interest, and the proposed methodology. The resultant statistical output is also explained. Case 2: Non-Parametric Difference Score (Conover Solution for Two-Wave Data) Two-wave designs (also known as pretest-posttest designs) are characterised by some comparison of an individual's score at the second wave of data collection to some baseline or initial measure score. The most common indexes of change involved in two-wave designs are (a) the simple difference score and (b) the residualised change score. Simple difference score (change or gain score). A s first introduced in Chapter 2, the most common of all change indexes is the simple difference score, and is calculated by simply subtracting an individual's observed score at Wave 1 from his or her score observed at Wave 2. A positive simple difference score typically indicates an increase in a given phenomenon of interest over time, whereas a negative score indicates diminishment over time. Residualised change score. A s Zumbo (1999) notes, it has been argued that simple difference scores are unfair because of their base-dependence (i.e., scores at Wave 2 are correlated negatively with scores at Wave 1). A s such, the residualised change score was developed as an alternative to the simple difference score. Although there are different ways to create such scores, the most common residualised change score is estimated from the regression analysis o f the observed Wave 2 score on the observed Wave 1 score. In other words, the estimated Wave 2 score is subtracted from the observed Wave 2 score. The intrinsic fairness, usefulness, reliability, and validity of the two-wave research design have been debated widely for decades (Zumbo, 1999). In their seminal article, Cronbach and Furby (1970) (as cited in Zumbo, 1999) famously disparaged the use of two-106 wave designs, arguing that gain scores are rarely useful, no matter how they are adjusted or refined (Cronbach & Furby, 1970, as cited in Zumbo, 1999). The crux of their argument is that two-wave designs lead to inherently unreliable, inherently invalid, and inherently biased indexes of change. Their disdain of two-wave designs was so strong that they stated that researchers "who ask questions using gain scores would ordinarily be better advised to frame their questions in other ways" (Cronbach & Furby, 1970, as cited in Zumbo, 1999, p. 80). A s Zumbo (1999) notes, it is somewhat puzzling that there has been such frequent avoidance of two-wave designs, given that variations of difference scores lie at the heart of various widely-used and commonly-accepted statistical tests, such as the paired samples Mest. Description of the Data The research question being posed in this particular case study is: Are there gender differences in students' rank-based change scores - based on their performance across two waves of the Foundation Skills Assessment (FS A ) numeracy subtest? The FS A , a three-part annual assessment test administered by the British Columbia Ministry of Education, is designed to measure the reading comprehension, writing, and numeracy skills of 4th- and 7th-grade students throughout British Columbia. The F S A is administered in public and in funded independent schools across the province in late April/early M a y of each year. Approximately 40,000 students per grade level write the F S A each year. The F S A relates to what students learn in the classrooms in two important ways. First, the F S A measures: critical skills that are part of the provincial curriculum. F S A represents broad skills that all students are expected to master. F S A only addresses skills that can be tested in a limited amount of time, using a pen-and-paper format. F S A 107 does not measure specific subject knowledge or many of the more complex, integrated areas of learning (British Columbia Ministry of Education, 2003, p. 20). Second, the F S A tests are designed to measure cumulative learning. This means that when, for example, 7th-grade students complete their version of the F S A , they are expected to use skills gained from kindergarten to Grade 7 (British Columbia Ministry of Education, 2003). The British Columbia Ministry of Education and the school districts use F S A results to: (a) report the results of student performance in various areas of the curriculum; (b) assist in curriculum improvement; (c) facilitate discussions on student learning; and (d) examine the performance o f various student populations to determine i f any require special attention. Schools use F S A data primarily to assist in the creation and modification of various school growth plans (e.g., plans for academic improvement). It should be noted that written approval from both the University of British Columbia's Behavioural Research Ethics Board (Appendix C) and the B C Ministry o f Education was received in order to conduct this particular case study. Specific Variables of Interest and Proposed Methodology Obtained was the entire population of standardised3 1 (scaled) numeracy subtest scores 3 2 o f 41,675 3 3 students who wrote the F S A in both 1999/2000 (Wave 1, Grade 4) and 3 1 The Ministry has standardised students' FSA scores such that each wave's score distribution has M=0 and SD = 1. For this reason, descriptive statistics and histograms are not presented in this particular case study. 3 2 Willett et al. (1998) argue that standardised test scores should never be used in the place of raw scores in individual growth modelling analyses (readers are referred to their article for the specific reasons why). Recall, however, that ranks are actually being used in the place of the original test scores. Thus, it is unimportant whether or not the original test scores come in the form of standardised scores. Furthermore, the Ministry of Education does not supply researchers with raw FSA scores - only standardised scores. 108 2002/2003 (Wave 2, Grade 7) 3 4 . O f this population of students, a 10% random sample of 4097 students (nf e m a i e = 2055; n m a i e = 2042) was retained for analyses. Each student record included a unique (but arbitrarily assigned) case number, and a gender flag (coded F = female, M = male). To begin this case study, students' F S A scores were rank-transformed within wave. The (a) correlation between the Grade 4 and 7 rank scores and (b) ratio of the two standard deviations of each grade's rank scores were then computed. Zumbo (1999) writes that "one should utilize the simple difference score instead of the residualized difference i f and only i f p(Xi,X2) > aX\l 0X2" (P- 293). A s such, because the correlation between the Grade 4 and 7 rank scores [p(Xi,X2) = 0.669] was less than the ratio of the two standard deviations (1182.843/1182.846 = 0.999), it was necessary to use the residualised change score, rather than the simple difference score, as the index of change in this case study. Please refer to Appendix B for a more thorough description of how the residualised change score was computed in SPSS. In this case, the residualised change score represents an individual's Wave 2 rank score minus his or her rank score at Wave 2 predicted from his or her Wave 1 rank score. The residualised change score serves as the dependent variable in the subsequent statistical analysis: an independent samples Mest (for gender). B y implementing an independent samples Mest, it is possible to explore whether female and male students' longitudinal performance on the F S A differs statistically. 3 3 Recall from Chapter 1 that Kolen and Brennan (2004) state that most test linking strategies require a minimum sample size of 400 test-takers per form. It is worth noting that, using this rule of thumb, Case 1 's sample size would be considered sufficient for test linking. 3 4 For the purpose of this case study, students missing one wave of FSA data were excluded from analyses. 109 Hypotheses Being Tested In this case, the null hypothesis being tested is that the males' mean rank-based residualised change score is not significantly different from that of the female test-takers. In contrast, the alternative hypothesis is that males' mean rank-based residualised change score differs significantly from that of the female test-takers. In this case, a two-tailed hypothesis was chosen, because there was no strong directional expectation. Explanation of the Statistical Output The output from the independent samples Mest showed that the mean residualised change score for males was -7.6467 (SD = 882.31) as opposed to 7.5983 for females (SD = 876.71). So, the average Wave 2 rank score less the rank score at Wave 2 predicted from his Wave 1 rank score is higher for females than for males. This means that females gained, on average, 7.5 points in relative standing, whereas boys' relative standing decreased, on average, approximately 7.6 points (Zumbo, 2005). Despite the mean differences in residualised change scores for males and females, the independent samples Mest results showed that there is no statistically significant gender difference in the residualised change scores, t(4095) = -.555, p = .579 (assuming equal variances; two-tailed). Thus, males' mean rank-based residualised change score did not significantly different from that of the female test-takers. Put another way, males' and females' relative standing over time did not differ significantly. Even though there was no statistically-significant gender difference found, an effect size should still be computed for reasons outlined by Zumbo and Hubley (1998). 3 5 In this 3 5 No effect size was calculated for the non-parametric HLM case, because there is still no consensus on how it should be calculated for hierarchical models, particularly when repeated measures are nested within test-takers. Some researchers are investigating the use of the intra-class correlation for this purpose (e.g., Singer & Willett, 2003). 110 case, the Cohen's d effect size was calculated by subtracting the mean residualised change score of one group (females) from that of the other group (males) and dividing that difference by the pooled standard deviation. The resultant effect size equals 0.02, which represents a small effect size (Cohen, 1988). Chapter Summary In summary, this chapter illustrated the particulars of the Conover solution to the problem of analysing change and growth with time-variable measures, which is particularly useful when one's measures cannot be linked. The two case studies (the multi-wave case and the two-wave case, respectively) offered detailed descriptions of the data, the specific variables of interest, the proposed methodology, the statistical models and equations, and explained the resultant statistical output, respectively. In the next chapter, the Conover solution's strengths and limitations are discussed, and readers are offered suggestions for future studies rooted in analysing change and growth with time-variable measures (particularly those that cannot be linked). I l l Chapter 7: Discussion and Conclusions A s was illustrated earlier by way of examples, repeated measures analyses are typically used in three scenarios: In Scenario 1, the exact same measure is used and re-used across waves. In Scenario 2, most of the measures' content changes across waves - typically commensurate with the age and experiences of the test-takers - but the measures retain one or more linkable items across waves. In Scenario 3, the measures vary completely across waves (i.e., there are no linkable items), or there are small sample sizes, or there is no norming group. Because Scenarios 2 and 3 are found in educational and social science research settings, it was vital to explore more fully this particular problem: analysing change and growth within the contexts presented in Scenario 2 (time-variable measures that can be linked) and Scenario 3 (time-variable measures that cannot be linked) - both of which are characterised by the measures changing across waves. This dissertation devoted specific attention to the latter of the two scenarios, given that this particular scenario has gone relatively unaddressed in the test linking and change/growth literatures. Summary of the Preceding Chapters This dissertation had two objectives (and novel contributions): Objective 1: To weave together test linking and change/growth literatures. The first objective was to weave together, or to bridge the gap between, the test linking and change/growth literatures in a comprehensive manner. Unt i l now, the two literatures have either been largely disconnected or, at most, woven together in such a manner so as situate the motivating problem primarily around vertical scaling techniques. A s was described in Chapter 1, the gap between the test linking and change/growth literatures causes several 112 problems. Therefore, it is only by bridging the gap between the two literatures that one can analyse change/growth with time-variable measures in a rigorous fashion. In Chapter 3, readers were provided an overview of the five broad types of test linking: equating, calibration, statistical moderation, projection, and social moderation, respectively (Kolen & Brennan, 2004; Linn, 1993) in order to highlight the methods that currently exist for linking or connecting the scores of one measure to those of another. A s was elucidated in this chapter, none of these five test linking methods proved to be a suitable solution to the problem of analysing change/growth with time-variable measures that cannot be linked. Consequently, these five methods were presented for the purpose of providing readers with a brief overview of test linking, and with specific reasons why each is insufficient in the study of change and growth with time-variable measures. With the aim of ascertaining how researchers are handling the problem of analysing change and growth with time-variable measures in real-life research settings, Chapter 4 offered a discussion of seven test linking strategies currently being used in the change and growth literature as means of handling the problem of analysing change and growth with time-variable measures (particularly those that can be linked). These seven strategies included: (1) vertical scaling, (2) growth scales, (3) Rasch modelling, (4) latent variable or structural equation modelling, (5) multidimensional scaling, (6) standardising the test scores or regression results, and (7) converting raw scores to age- or grade-equivalents pre-analysis, respectively. Each strategy was presented, where possible, with examples from real-life research settings. Whereas Chapter 3 spoke about test linking in a general sense, this chapter focussed on various strategies that educational researchers are using to handle the motivating problem specifically. 113 For reasons outlined in that chapter, none of these seven strategies proved to be an adequate solution to the problem of analysing change and growth with time-variable measures (particularly those that cannot be linked). B y highlighting each test linking method's shortcomings in terms of the motivating problem, the rationale for this dissertation's Conover solution for the problem of analysing change/growth with time-variable measures (that cannot be linked) was established. Objective 2: To introduce a novel solution to the problem. Given the limitations of the test linking strategies presented in Chapters 3 and 4 (in terms of being viable solutions to the motivating problem), the second objective of the dissertation was to introduce a solution to the problem of handling time-variable measures that cannot be linked. A s Chapters 1 and 4 described, many of the strategies currently being used in the change/growth literature as means of handling the motivating problem cannot be used by researchers in everyday research settings - often because they do not have the means to use large sample sizes or item pools. Moreover, many of the strategies presented in Chapter 4 require the presence of common items across measures which, as discussed previously, is not always feasible (or warranted). Therefore, this dissertation introduced a workable solution that can be implemented easily in everyday research settings, and one that is particularly useful when one's measures cannot be linked. This solution was coined the Conover solution. Two case studies demonstrated the application of the Conover solution: The multi-wave case, involving Dr. Siegel's literacy data, illustrated the non-parametric H L M solution. The two-wave case, involving Foundation Skills Assessment data, described the non-114 parametric difference score solution. In the next sections of this chapter, some of the Conover solution's strengths and limitations, respectively, are discussed. Strengths of the Conover Solution The Conover solution offered as a means o f handling the problem of analysing change and growth with time-variable measures (particularly those that cannot be linked) has several strengths - each of which is described below. Ease of use. The first strength relates to the ease of implementation of the Conover solution. A s Conover and Iman (1981) note, it is often more convenient to use ranks in a parametric statistical program than it is to write a program for a non-parametric analysis. Furthermore, all o f the steps required for the implementation of the Conover solution (i.e., rank transforming data within waves, conducting independent samples Mests and mixed-effect analyses, restructuring the data matrix, etc.) can be easily performed using commonly-used statistical software packages, such as SPSS. Given the widespread use of such software packages in educational and social science research settings, the Conover solution is particularly appealing from a practical standpoint. Bridges the parametric/non-parametric gap. Second, by rank transforming the data pre-analysis, one is able to bridge the gap between parametric and non-parametric statistical methods, thereby providing "a vehicle for presenting both the parametric and nonparametric methods in a unified manner" (Conover & Iman, 1981, p. 128). Unfortunately, introductory statistics courses and textbooks very often treat the two methods as i f they are completely distinct from one another when, in fact, there can be great strength in marrying or bridging the two. This particular strength was also discussed in Chapters 1 and 5. 115 Makes use of the ordinal nature of data. Third, the Conover solution makes use of the ordinal nature o f continuous-scored data: A test-taker with a low raw score relative to other test-takers in his wave w i l l also yield a low relative rank score . Similarly, a test-taker with a high test-score w i l l also yield a high rank score. A s a result, within-wave order among the students is preserved. B y ranking test-takers' scores within-wave pre-analysis, it is possible to put the longitudinal test scores on a "common metric", thereby providing a standard against which test-takers' scores can be measured and compared. Makes fewer assumptions, may improve power, and may mitigate effect of outliers. Using ranks in the place of raw or standardised scores in subsequent analyses has a fourth strength of not requiring multivariate normality, nor does it require that each variable follow its own normal distribution. It is for this reason that rank-based methods are often referred to as "distribution-free". In his widely-cited article, Miccer i (1989) investigated the distributional characteristics of 440 large-sample achievement and psychometric measures' scores. He found that all 440 measures' scores were significantly non-normally distributed (p < 0.01). A s a result, "the underlying tenets of normality-assuming statistics appear fallacious for these commonly used types of data" (Micceri, 1989, p. 156). B y utilising ranks in the place of raw or standardised scores in subsequent analyses, there may be an improvement in statistical power and mitigation in the effects of outliers. Zimmerman and Zumbo (1993a) state: For quite a few nonnormal distributions, [the] Wilcoxon-Mann-Whitney test holds a power advantage over the Student t test, both in the asymptotic limit 3 6 Statistical packages often assign a rank of one (1) to each wave's smallest score. It is recommended that analysts retain this default setting, as it is logical to think that the lowest test score should also have the lowest rank. 116 and for small and moderate sample sizes. [This] power advantage is accounted for by reduction in the influence of outliers in conversion of measures to ranks (p. 495). Furthermore, Miccer i (1989) writes that: Robustness of efficiency (power or beta) studies suggest that competitive tests such as the Wilcoxon rank sum exhibit considerable power advantages while retaining equivalent robustness of alpha in a variety of situations (p. 157). In summary, the work of Miccer i (1989) and Zimmerman and Zumbo (1993a) indicates that normality-assuming statistics may be relatively non-robust in non-A normal contexts. A s a result of the possible imprecision of statistical procedures dependent on the normality assumption, a fourth strength of the Conover solution is that it makes fewer assumptions about the distribution about the data, may improve power, and may mitigate the effect of outliers 3 7 in certain contexts. Requires no common/linkable items. Unlike many o f the test linking methods and strategies described in earlier chapters, the Conover solution can be implemented not only in situations in which one's study involves time-variable measures that can be linked, but also situations in which the time-variable measures share no linkable items whatsoever. Hence, unlike vertical scaling, equating, and their linking counterparts, the Conover solution provides a means by which researchers can study change and growth - whether or not the measures contain linkable items. It is anticipated that this particular feature of the Conover solution w i l l l ikely prove appealing to researchers studying constructs thought to change developmentally (e.g., achievement). For detail about what can be considered an outlier or "wild observation", please refer to Kruskal (1960). 117 Requires no norming group. Recall from an earlier section that a norming group (normative sample) is a large sample, ideally representative of a well-defined population, whose test scores provide a set of standards against which the scores of one's sample can be compared (Gall et al., 1996). Due to time and financial constraints, it is not always possible to compare the scores of one's sample to those of an external norming sample. A s such, a sixth strength of the Conover solution is that it can be conducted using simply the scores of the immediate sample of test-takers. Limitations of the Conover Solution A s with any methodological tool, the Conover solution has various limitations - each of which is described in this section. Within-wave ranks are bounded. Recall from an earlier section that rank transforming refers to the process of converting a test-taker's original score to rank relative to other test-takers - suggesting a one-to-one function f from the set { X ] , X2, . . . , X N } (the sample values) to the set {1, 2,.. .,N}(the first N positive integers) (Zimmerman & Zumbo, 1993a). The values assigned by the function to each sample value in its domain are the number of sample values having lesser or equal magnitude. Consequently, the rank scores are bounded from above by N . As a result, "any outliers among the original sample values are not represented by deviant values in the rank" (Zimmerman & Zumbo, 1993a, p. 487). Imagine that, on a standardised test of intelligence, Student W earns a score 100, Student X earns a score o f 101, Student Y earns a score of 102, and Student Z earns a score of 167. Student Z ' s score, relative to the other test-takers, is exceptional. Despite her exceptional performance on the measure, her test score is masked by the application of ranks: Student W = 1, Student X = 2, Student Y = 3, and Student Z = 4. 118 A s a result, one limitation of the Conover solution is that there may be problems associated with the inherent restriction of range it places on data. Differences between any two ranks range between 1 and N - 1 , whereas the differences between original sample values range between 0 and infinity (Zimmerman & Zumbo, 1993a). Difficulties associated with handling missing data. Recall from the two case studies (Chapter 6) that only those test-takers for whom data were available at each and every wave were retained in the analyses. A s most educational and social science researchers w i l l note, no discussion about change and growth is complete without a complementary discussion about one unavoidable problem: missing data. In longitudinal designs, particularly those that span months or years, it is extremely common to face problems associated with participant dropout, attrition, and as well as participants who join, or return to the study, in later waves. The complexity (even messiness!) of many longitudinally-collected data sets can have serious implications for growth analyses. Singer and Willett (2003) chide readers that, when fitting a growth model: Y o u implicitly assume that each person's observed records are a random sample of data from his or her true growth trajectory. I f your design is sound, and everyone is assessed on every planned occasion, your observed data w i l l meet this assumption. If one or more individuals are not assessed on one or more occasions, your observed data may not meet this assumption. In this case, your parameter estimates may be biased and your generalizations incorrect (p. 157). 119 The degree to which one's generalisations are incorrect is dependent on the degree and type of data missing data. Although there is no general consensus on the categories of data 'missingness', Schumaker and Lomax (2004) outline three ways in which missing data can arise: 1. Miss ing completely at random ( M C A R ) , which implies that data are missing unrelated, statistically, to the values that would have been observed. Put another way, the observed values are a random sample of all the values that could have been observed (ideally), had there been no missing data (Singer & Willett, 2003); 2. Missing at random ( M A R ) , in which data values are missing conditional on other variables or a stratifying variable; and 3. Non-ignorable data, which implies probabilistic information about the values that would have been observed. One possible strategy for circumventing, or at least mitigating the effect of, missing data is to impute the missing raw or standardised scores prior to rank-transforming the data within-wave pre-analysis. Because detailed discussion of the various imputation methods are beyond the scope of this dissertation, and because missing data discussion is largely case-dependent, this dissertation merely provides a general description of four methods of imputing missing data (Schumaker & Lomax, 2004): 1. Mean substitution, which involves substituting the mean for missing values in a given variable; 2. Regression imputation, in which the missing value is substituted with some predicted value; 120 3. Maximum likelihood imputation, which involves finding the expected value for a missing datum based on maximum likelihood parameter estimation; and 4. Matching response pattern imputation, in which variables with missing data are matched to variables with complete data to determine a missing value. Makes use of the ordinal nature of data. Recall that the fact that the Conover solution makes use of the ordinal nature of continuous-scored data was previously identified as one of the solution's strengths. Unfortunately, precisely what the Conover solution wins by, it also loses by: Because of the rank transformation of the raw or standardised scores, "differences between raw scores are not necessarily preserved by the corresponding ranks. For example, a difference between the raw scores corresponding to the 15th and the 16th ranks is not necessarily the same as the difference between the raw scores corresponding to the 61st and 62nd ranks in a collection of 500 test scores" (Zimmerman & Zumbo, 2005, p. 618). Suggestions for Future Research In the current section, various suggestions for future research are presented: First, two of the test linking methods referred to in Chapter 3 - concordance through scaling (i.e., social moderation) and projection - have been used by Cascallar and Dorans (2005) to link the scores of separate tests. They use these two methods specifically to link scores from different tests of similar content but of different languages, and not to link scores from time-variable achievement tests; however, it is conceivable that these two solutions could be extended to the problem of analysing change and growth with time-variable scales -that can or cannot be linked. 121 The first linking method, concordance through scaling, is used to establish an equating relationship between tests: 1. that are non-exchangeable (in the context of the current problem, non-exchangeability would mean that a fourth-grade student would likely not be able to complete the seventh-grade version of a mathematics test; conversely, a seventh-grader would find writing the fourth-grade version of a mathematics test too easy), but 2. whose content is highly related (e.g., fourth-grade and seventh-grade mathematics tests that are both heavy on statistics and probability concepts). The second linking method, projection, is merely concerned with minimising imprecision in the predictions of one score from one or more scores, thus "doing the best job possible of predicting one set of scores from another" (Cascallar & Dorans, 2005, p. 343). It involves observing the scores on Test X , then predicting what would be likely scores on Test Y . This type o f linking involves administering different measures to the same set of test-takers and then estimating the joint distribution among the scores. For example, seventh-graders' scores on a mathematics test could be estimated from their respective scores on earlier versions of the time-variable tests administered in each prior school year. Recall that projection is often regarded as the weakest form of statistical l inking (Linn, 1993). It is suggested that future research examines the utility of concordance through scaling and projection, in the context of change and growth, particularly when the measures are time-variable and cannot be linked. Second, recall from an earlier section that a limitation of the Conover solution is that it is not designed to handle missing data. For this reason, various imputation methods are necessary to f i l l in the 'gaps' left by the missing data. It is suggested that researchers try 122 imputing missing data in the four different ways presented in this dissertation (mean substitution, regression imputation, maximum likelihood imputation, and matching response pattern imputation) to see the extent to which the results and/or the inferences made about the results vary across the methods of imputation. Third, as was described in Chapter 6, it may be possible to conclude that one's time-variable measures are indeed commensurable i f the pattern of wave-specific convergent and discriminant correlations (Campbell & Fiske, 1959) are similar across all waves of one's study (i.e., the M T M M at Wave 1 is similar to the M T M M at Wave 2, and so forth). Given that this idea is new and has not been explored in either the test linking or change/growth literatures, this idea most certainly requires investigation in future research. Finally, very little research has explored the issue of sample size in non-parametric-based repeated measures designs, in part because the methodology is so new (e.g., Zimmerman & Zumbo, 1993b and the current dissertation). Although there are general recommendations about what is considered a "minimally sufficient" sample size for general test-linking strategies, these recommendations are typically situated around item response theory requirements (400 test-takers per measure; see Chapter 1). A s such, a final suggestion for future research is to explore the results, and inferences made from the results, o f non-parametric H L M and non-parametric difference score analyses conducted on samples of various sizes. Conclusions There are two primary reasons why investigating the problem of analysing change and growth with time-variable measures was undertaken in this dissertation. First, as Willett et al. (1998) and von Davier et al. (2004) describe, the rules about which tests are permissible 123 for repeated measures designs are precise and strict. Given these conditions, it was necessary to investigate i f and how repeated measures designs are possible - speaking both psychometrically and practically - when the measures themselves must change across waves. Second, given the substantial growth in longitudinal large-scale achievement testing in the past decade, it was (and is) necessary to find a viable and coherent solution to the problem so that researchers, educational organisations, policy makers, and testing companies can make the most accurate inferences possible about their test scores. In this dissertation, readers were introduced to a novel solution for handling the problem of analysing the problem of change and growth with time-variable measures (particularly those that cannot be linked). It should, however, be stressed again that the Conover solution is by no means a universal panacea. It is imperative that educational and social science researchers continue to carry the torch in terms of exploring the various ways (and contexts) in which the motivating problem can be solved. L inn (1993) notes that considering any one individual method as the ultimate solution to the problem of linking test scores is fundamentally unsound because: The sense in which the scores for individual students can be said to be comparable to each other or to a fixed standard depends fundamentally on the similarity of the assessment tasks, the conditions of administration, and their cognitive demands. The strongest inferences that assume the interchangeability of scores demand high degrees of similarity. Scores can be made comparable in a particular sense for assessments that are less similar. Procedures that make scores comparable in one sense (e.g., the most likely score for a student on a second assessment) w i l l not simultaneously make the scores comparable in another sense (e.g., the proportion of students that exceed a fixed standard. Weaker forms of linkage are likely to be context, group, and time dependent, which suggests the need for continued monitoring of the comparability of scores (p. 100). Because of the case-specific nature of the problem of analysing change and growth with time-variable measures (that can or cannot be linked), researchers are beseeched to prioritise the making of careful and trained judgements about their proposed measures - right at the outset of the study. The later one waits to make such judgements, the less accurate the inferences one makes from the tests' scores. To reiterate a sentiment offered by Kolen and Brennan (2004), "the more accurate the information, the better the decision" (p. 2). 125 References Abbeduto, L . , & Hagerman, R. (1997). Language and communication in fragile X syndrome. Mental Retardation and Developmental Disabilities Research Reviews, 3, 313-322. Afrassa, T., & Keeves, J. P. (1999). Changes in students' mathematics achievement in Australian lower secondary school schools over time. International Education Journal, 7(1), 1-21. A i , X . (2002). Gender differences in growth in mathematics achievement: Three-level longitudinal and multilevel analyses of individual, home, and school influences. Mathematical Thinking and Learning, 4, 1-22. Beasley, T. M . , & Zumbo, B . D . (2003). Comparison of aligned Friedman rank and parametric methods for testing interactions in split-plot designs. Computational Statistics and Data Analysis, 42, 569-593. Bejar, 1.1. (1983). Introduction to item response models and their assumptions. In R. K . Hambleton (Ed.), Applications of item response theory. Vancouver, B C : Educational Research Institute of British Columbia. Bolt , D . M . (1999). Evaluating the effects of multidimensionality on IRT true-score equating. Applied Measurement in Education, 12(4), 383-407. Braun, H . I., & Holland, P. W . (1982). Observed-score test equating: A mathematical analysis of some E T S equating procedures. In P. W . Holland and D . B . Rubin (Eds.), Test equating. New York: Academic Press. British Columbia Ministry of Education. (2003). Interpreting and communicating British Columbia Foundation Skills Assessment results 2002. Retrieved December 8, 2003, from http://www.bced.gov.bc.ca/assessment/fsa/02interpret.pdf 126 British Columbia Ministry of Education, (n.d.). FSA numeracy specifications. Retrieved August 30, 2006, from http://www.bced.gov.bc.ca/assessment/fsa/numeracy_specs.pdf Bryk, A . S., & Raudenbush, S. W . (1992). Hierarchical linear models: Applications and data analysis methods. Newbury Park, C A : Sage Publications. Campbell, D . T., & Fiske, D . W . (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105. Cascallar, A . S., & Dorans, N . J. (2005). Linking scores from tests of similar content given in different languages: A n illustration involving methodological alternatives. International Journal of Testing, 5(4), 337-356. Chiappe, P., Siegel, L . S., & Gottardo, A . (2002). Reading-related skills of kindergartners from diverse linguistic backgrounds. Applied Psycholinguistics, 23, 95-116. Chiappe, P., Siegel, L . S., & Wade-Woolley, L . (2002). Linguistic diversity and the development of reading skills: A longitudinal study. Scientific Studies of Reading, 6, 369-400. ' Clemans, W . V . (1993). Item response theory, vertical scaling, and something's awry in the state of the test mark. Educational Assessment, 1(A), 329-347. Cliff, N . (1993). What is and isn't measurement. In G . Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences, Volume 1: Methodological issues (pp. 59-93). Hillsdale, N J : Lawrence Erlbaum. Cohen, B . F. (1996). Explaining Psychological Statistics. Pacific Grove, C A : Brooks/Cole Publishing Company. 127 Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, N J : Lawrence Erlbaum, Associates. Conger, J. J., & Galambos, N . L . (1997). Adolescence and youth: Psychological development in a changing world (5th ed.). New York: Addison Wesley Longman. Conover, W . J. (1999). Practical Nonparametric Statistics (3rd ed.). New York: John Wiley & Sons. Conover, W . J., & Iman, R. L . (1981). Rank transformations as a bridge between parametric and nonparametric statistics. The American Statistician, 35, 124-129. Cronbach, L . J., & Furby, L . (1970). How should we measure "change" - Or should we? Psychological Bulletin, 74, 68-80. Ding, C. S., Davison, M . L . , Petersen, A . C. (2005). Multidimensional scaling analysis of growth and change. Journal of Educational Measurement, 42(2), 171-191. Dixon, R. A . , & Lerner, R. M . (1999). History and systems in developmental psychology. In M . H . Bornstein & M . E . Lamb (Eds.), Developmental psychology: An advanced textbook (4th ed.; pp. 3-45). Mahwah, N J : Lawrence Erlbaum Associates. Domhof, S., Brunner, E . , & Osgood, D . W . (2002). Rank procedures with repeated measures with missing values. Sociological Methods and Research, 30(3), 367-393. Embretson, S. E . , & Reise, S. P. (2000). Item response theory for psychologists. New Jersey: Lawrence Erlbaum Associates. Ercikan, K . (1997). Linking statewide tests to the N A E P : Accuracy of combining test results across states. Applied Measurement in Education, 7(7(2),145-159. Gal l , M . D . , Borg, W . R., & Gal l , J. P. (1996). Educational research: An introduction, (6th ed.). White Plains, N Y : Longman. 128 Goldstein, H . (1995). Multilevel statistical models (2nd ed.). London: Arnold. Golembiewski, R. T., Billingsley, K . , & Yeager, S. (1976). Measuring change and persistence in human affairs: Types of change generated by O D Designs. Journal of Applied Behavioral Sciences, 12, 133-157. Gravetter, F. J., & Wallnau, L . B . (2005). Essentials of Statistics for the Behavioral Sciences (5th ed.). Belmont, C A : Thomson Wadsworth. Guay, F. , Larose, S., & Boiv in , M . (2004). Academic self-concept and educational attainment level: A ten-year longitudinal study. Self and Identity, 3(1), 53-68. Haertel, E . H . (2004). The behavior of linking items in test equating. Retrieved February 12, 2006 from http://www.cse.ucla.edu/reports/R630.pdf Hambleton, R. K . , Swaminathan, FL, & Rogers, H . J. (1991). Fundamentals of item response theory. Newbury Park, C A : Sage Publications. Holland, P. W. , & Rubin, D . B . , Eds. (1982). Test equating. New York: Academic Press. Howel l , D . C. (1995). Fundamental Statistics for the Behavioral Sciences (3rd ed.). Belmont, C A : Duxbery Press. Johnson, C , & Raudenbush, S. W . (2002). A repeated measures, multilevel Rasch model with application to self-reported criminal behavior. Retrieved February 12, 2006 from http://www.ssicentral.com/hlm/techdocs/NotreDamePaper2c.pdf Jordan, N . C , & Hanich, L . B . (2003). Characteristics o f children with moderate mathematics deficiencies: A longitudinal perspective. Learning Disabilities Research and Practice, 18(A), 213-221. Karlsen, B . , & Gardner, E . F. (1978-1996). Stanford Diagnostic Reading Test, Fourth Edition. San Antonio, T X : Harcourt Brace Educational Measurement. 129 Kolen, M . J. (2001). Linking assessments effectively: Purpose and design. Educational Measurement: Issues and Practice, 20(1), 5-19. Kolen, M . J., & Brennan, R. L . (2004). Test Equating, Scaling, and Linking: Methods and Practices (2nd ed.). New York: Springer-Verlag. Kruskal, W . H . (1960). Some remarks on wi ld observations. Technometrics, 2, 1-3. Lesaux, N . K . , & Siegel, L . S. (2003). The development of reading in children who speak English as a Second Language (ESL) . Developmental Psychology, 39(6), 1005-1019. Leung, S. O. (2003). A practical use of vertical equating by combining IRT equating and linear equating. Practical Assessment, Research & Evaluation, 8(3). Retrieved July 24, 2006, from http://pareonline.net/getvn.asp?v=8&n=23 Linn , R. L . (1993). Linking results of distinct assessments. Applied Measurement in Education, 6(1), 83-102. Linn , R. L . , & Kiplinger, V . L . (1995). Linking statewide tests to the National Assessment of Educational Progress: Stability of results. Applied Measurement in Education, 8, 135-155. Linn , R. L . , & Slinde, J. A . (1977). The determination of the significance of change between pre-and posttesting periods. Review of Educational Research, 47(1), 121-150. Lissitz, R. W. , & Huynh, H . (2003). Vertical equating for state assessments: issues and solutions in determination of adequate yearly progress and school accountability. Practical Assessment, Research & Evaluation, 8(10). Retrieved November 3, 2005 from http://PAREonline.net/getvn.asp?v=8&n=10 Lloyd , J. E . V . , Walsh, J., & Shehni Yailagh, M . (2005). Sex differences in mathematics: If I 'm so smart, why don't I know it? Canadian Journal of Education, 28(3), 384-408. 130 Lord, F. M . (1982). Item response theory and equating - A technical summary. In P. W . Holland & Donald B . Rubin (Eds.), Test equating. Princeton, N J : Academic Press. M a , X . (2005). Growth in mathematics achievement during middle and high school: Analysis with classification and regression trees. Journal of Educational Research, 99, 78-86. M a , X . , & M a , L . (2004). Modeling stability of growth between mathematics and science achievement during middle and high school. Evaluation Review, 28, 104-122. M a , X . , & X u , J. (2004). Determining the causal ordering between attitude toward mathematics and achievement in mathematics. American Journal of Education, 110, 256-280. Marascuilo, L . A . , & McSweeney, M . (1977). Nonparametric and distribution-free methods for the social sciences. Monterey, C A : Brooks-Cole. Martineau, J. A . (2006). Distorting value added: The use of longitudinal, vertically scaled student achievement data for growth-based, value-added accountability. Journal of Educational and Behavioral Statistics, 37(1), 25-62. Meade, A . M . , Lautenschlager, G . J., & Hecht, J. E . (2005). Establishing measurement equivalence and invariance in longitudinal data with item response theory. International Journal of Testing, 5(3), 279-300. Miccer i , T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin, 105(1), 156-166. Mislevy. R. J. (1992). Linking educational assessments: Concepts, issues, methods, and prospects. Princeton, N J : Educational Testing Service. Muijs, D . , & Reynolds, D . (2003). Student background and teacher effects on achievement and attainment in mathematics. Educational Research and Evaluation, 9(1), 21-35. 131 Muthen, B . , & Khoo, S. T. (1998). Longitudinal studies of achievement growth using latent variable modeling. Learning and Individual Differences, 10, 73-101. Newsom, J. T. (n.d.). Distinguishing between random and fixed: Variables, effects, and coefficients. Retrieved February 17, 2005, from http://www.upa.pdx.edu/IOA/newsom/mlrclass/ho_randfixd.doc Notenboom, A . , & Reitsma, P. (2003). Investigating the dimensions of spelling ability. Educational & Psychological Measurement, 63, 1039-1059. Petrides, K . V . , Chamorro-Premuzic, T., Frederickson, N . , & Furnham, A . (2005). Explaining individual differences in scholastic behavior and achievement. British Journal of Educational Psychology, 17, 239-255. Plewis, I. F . (2000). Evaluating educational interventions using multilevel growth curves: the case of reading recovery. Educational Research and Evaluation, 6, 83-101. Pommerich, M . , & Dorans, N . J. (2004). Linking scores via concordance: Introduction to the special issue. Applied Psychological Measurement, 28(4), 216-218. Pommerich, M . , Hanson, B . A . , Harris, D . J., & Sconing, J. A . (2004). Issues in conducting linkages between distinct tests. Applied Psychological Measurement, 28(4), 247-273. Pomplun, M . , Omar, M . D . H . , & Custer, M . (2004). A comparison of W I N S T E P S and B I L O G - M G for vertical scaling with the Rasch model. Educational and Psychological Measurement, 64(4), 600-616. Rowe, K . J., & H i l l , P. W . (1998). Modeling educational effectiveness in classrooms: The use of multi-level structural equations to model students' progress. Educational Research and Evaluation, 4, 307-347. 132 Schafer, W . D . (2006). Growth scales as an alternative to vertical scales. Practical Assessment, Research & Evaluation, 11(A), 1-6. Retrieved July 14, 2006, from http://pareonline.net/pdf/vlln4.pdf Schafer, W. D . , & Twing, J. S. (2006). Growth scales and pathways. In R. W . Lissitz (Ed.), Longitudinal and value added modeling of student performance, Maple Grove, M N : J A M Press. Schumacker, R. E . (2005). Test equating. Retrieved from November 4, 2006, from www.appliedmeasurementassociates.eom/White%20Papers/TEST%20EQUATrNG.p df Schumacker, R. E , & Lomax, R. G . (2004). A Beginner's Guide to Structural Equation Modeling, 2nd edition. Mahwah, N J : Lawrence Erlbaum Associates. Serlin, R. C. , Wampold, B . E . , & Levin, J. R. (2003). Should providers of treatment be regarded as a random factor?: If it ain't broke, don't ' f ix ' it. Psychological Methods, 8, 524-534. Singer, J. D . (1998). Using S A S P R O C M I X E D to fit multilevel models, hierarchical models, and individual growth models. Journal of Educational and Behavioral Statistics, 24, 323-355. Singer, J. D . , & Willett, J. B . (2003). Applied Longitudinal Data Analysis: Modeling Change and Event Occurrence. New York: Oxford Press. Sireci, S. G . (1998). The construct of content validity. In B . D . Zumbo (Ed.), Validity Theory and the Methods Used in Validation: Perspectives from the Social and Behavioral Sciences (pp. 83-117). Netherlands: Kluwer Academic Press. 133 SPSS. (2002). Linear mixed-effect modelling in SPSS: An introduction to the Mixed procedure (Publication No . L M E M W P - 1 0 0 2 ) . Chicago, IL: Author. Stroud, T. W . F. (1982). Discussion of "a test of the adequacy of linear score equating models". In P. W . Holland & Donald B . Rubin (Eds.), Test equating. Princeton, N J : Academic Press. Tabachnick, B . G . , & Fidel l , L . S. (1996). Using Multivariate Statistics (3rd ed.). New York: Harper Collins College Publishers. Thomas, D . R., Hughes, E . , & Zumbo, B . D . (1998). On variable importance in linear regression. Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, 45, 253-275. von Davier, A . A . , Holland, P. W. , & Thayer, D . T. (2004). The Kernel Method of Equating. New York: Springer. Willett, J. B . , Singer, J. D . , & Martin, N . C. (1998). The design and analysis of longitudinal studies of development and psychopathology in context: Statistical models and methodological recommendations. Development and Psychopathology, 10, 395-426. Zimmerman, D . W. , & Zumbo, B . D . (1993a). Relative power of parametric and nonparametric statistical methods. In G . Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences, Volume 1: Methodological issues (pp. 481 -517). Hillsdale, N J : Lawrence Erlbaum. Zimmerman, D . W. , & Zumbo, B . D . (1993b). Relative power of the Wilcoxon test, the Friedman test, and repeated-measures A N O V A on ranks. Journal of Experimental Education, 62, 75-86. 134 Zimmerman, D . W. , & Zumbo, B . D . (2005). Can percentiles replace raw scores in statistical analysis of test data? Educational and Psychological Measurement, 65, 616-638. Zumbo, B . D . (1999). The simple difference score as an inherently poor measure of change: Some reality, much mythology. In Bruce Thompson (Ed.). Advances in Social Science Methodology, Volume 5, (pp. 269-304). Greenwich, C T : J A I Press. Zumbo, B . D . (2005). Notes on a new statistical method for modeling change and growth when the measure changes over time: A nonparametric mixed method for two-wave or multi-wave data. Unpublished manuscript, University of British Columbia. Zumbo, B . D . (n.d.). Thinking about robustness in general inferential strategies. Unpublished manuscript, University of British Columbia. Zumbo, B . D . , & Forer, B . (in press). Friedman test. In Ne i l J. Salkind (Ed.), Encyclopedia of Measurement and Statistics. Thousand Oaks, C A : Sage Press. Zumbo, B . D . , & Hubley, A . M . (1998). A note on misconceptions concerning prospective and retrospective power. Journal of the Royal Statistical Society, Series D (The Statistician), 47, 385-388. 135 Appendix A: More about the Non-Parametric H L M (Case 1) This appendix relates to material presented in Chapter 6, particularly the sections related to the multi-wave case involving Dr. Siegel's literacy data. This appendix offers readers a brief description of mixed-effect modelling (or H L M ) , and provides step-by-step information about performing the said analyses via SPSS ' graphical user interface (GUI) and syntax. A Brief Description of Mixed-Effect Modelling Mixed-effect modelling is known by a plethora of other names: individual growth modelling, random coefficient modelling, multilevel modelling, hierarchical linear modelling ( H L M ) . Unlike other repeated measures analyses of change (e.g., the paired samples Mest, repeated measures ANOVA, profile analysis), mixed-effect models can handle complex and 'messy' data, unbalanced designs, time-related covariates, continuous predictors of rates of change, and unequal variances. Furthermore, mixed-effect models allow researchers to explore the effect of dependencies in data observations due to the hierarchical or nested nature of educational and behavioural data. Mixed-effect modelling has two broad, albeit interrelated, classes: multilevel modelling and individual growth modelling. The primary difference between these classes pertains to the way in which the data are nested or grouped: Within the context of educational research, a common three-level example of multilevel modelling is students (at Level 1 of the hierarchy) clustered within units such as classrooms (Level 2) which, in turn, may be grouped within schools (Level 3). A common two-level individual growth modelling example involves repeated measures (at Level 1) nested within students (Level 2). 136 It has been argued that, prior to the development of mixed-effect modelling techniques in the early 1980s, there were no appropriate quantitative equipment in the proverbial research toolbox to even allow for the rigorous investigation of change (Singer & Willett, 2003). The primary criticism of pre-mixed-effect modelling techniques concerns the assumptions involving the error term involved in regular ordinary least squares (OLS) analyses: linearity, normality, homoscedasticity, and independence. According to Bryk and Raudenbush (1992), the latter two assumptions must be modified when the data show dependencies. Otherwise, standard errors are too small, leading to higher-than-appropriate rates of rejecting the null hypothesis and, hence, the attribution of statistical effects where none should exist. Mixed-effect modelling allows for these data dependencies to be taken into account. Fixed versus Random Effects Mixed-effect modelling refers to special classes of regression techniques in which at least one independent variable (or factor) in a statistical model is considered fixed and at least one other independent variable in the same model is considered random. Unfortunately, ascertaining the exact distinction between a fixed and a random factor is, at best, thorny (Newsom, n.d.). There are three possible reasons for this thorniness. First, there exists an acute dearth of scholarly pieces that actually provide precise definitions of fixed and random factors - a particularly surprising finding given that these concepts characterise the very framework of mixed-effect techniques. It is possible that this dearth lies with the fact that many of the seminal writers in the field - Anthony Bryk, Harvey Goldstein, Stephen Raudenbush, and Doug Wil lms, to name a handful - tailor their articles to researchers already confident in the language of mixed modelling. This dearth may also relate to 137 Singer's (1998) observation that, were it not for current day, user-friendly statistical software packages, "few statisticians and even fewer empirical researchers would fit the kinds of sophisticated statistical models being promulgated today" (p. 350). Perhaps, lamentably, the ease with which such software packages may be used has enabled researchers to deem discussion of the fundamentals of mixed modelling as trivial. Second, in the scarce occurrence that a scholarly piece does, in fact, provide a distinction between fixed and random factors, the definitions tend to be worded vaguely. For example, Serlin, Wampold, and Levin (2003) state that a fixed factor is one that is related to population means, whereas SPSS (2002) defines a fixed factor as any variable that "affects the population mean" (p. 3). Clearly, such vague definitions are of little use to novice mixed modellers. Third, a given factor can be considered fixed or random, depending on the context of the study (Newsom, n.d.). For the purpose of this primer, definitions of each type of factor have been provided and, at a minimum, provide a useful starting point for thinking about this distinction. Fixed factors. Recall that an important assumption underlying traditional statistical analyses is that the independent factors in the model are fixed. Newsom (n.d.) defines a fixed factor as one that is: 1. assumed to be measured without error; and 2. purported to contain all or most of the values found in the same variable had the findings been generalised to a population; and 3. not necessarily invariant (equal) across subgroups. 138 Often the fixed component of the model is referred to as Model I (Serlin et al., 2003). Recall that, when dealing with fixed factors, the following assumptions are made about the errors (Zumbo, n.d.): 1. the errors have an average of zero; 2. the errors have the same variance across all individuals; and 3. the errors are distributed identically with density function f; and 4. the error density, f, is assumed to be distributed symmetrically; and 5. the error density, f, is assumed to be distributed normally. Random factors. Conversely, Newsom (n.d.) defines a random factor as one that: 1. is assumed to be measured with measurement error (scores are a function o f a true score and random error); and 2. contains values that come from and are intended to generalise to a much larger population of possible values with a defined probability distribution; and 3. contains values that are small or narrower in scope that would be found in the same variable pertaining to a population. A model's random component is often referred to as Model II (Serlin et al., 2003). In summary, it is useful to conceptualise the distinction between regular OLS-based regression and mixed-effect designs as follows: Traditional statistical approaches involve models that contain fixed independent factors only. Mixed-effect techniques are so-named because they involve models with mixtures of fixed and random components. Performing the Analysis using the Graphical User Interface (GUI) Step 1: Entering the data. When conducting most repeated measures analyses, data are entered into the data matrix (spreadsheet) in person-level format, in which one row 1 3 9 r e p r e s e n t s o n e i n d i v i d u a l , w i t h t i m e - r e l a t e d v a r i a b l e s r e p r e s e n t e d a l o n g t h e h o r i z o n t a l o f t h e s p r e a d s h e e t . A s F i g u r e 7 s h o w s , v a r i a b l e n a m e s g r a d e 2 r a w , g r a d e 3 r a w , g r a d e 4 r a w , g r a d e 5 r a w , a n d g r a d e 6 r a w i d e n t i f y s t u d e n t s ' r a w s c o r e s o n e a c h g r a d e - s p e c i f i c v e r s i o n o f t h e S D R T , r e s p e c t i v e l y . A s t h e d a t a m a t r i x s h o w s , t h e s t u d e n t w i t h c a s e n u m 5 i s f e m a l e , w h o e a r n e d a r a w s c o r e o f 3 5 i n G r a d e 2 , 3 6 i n G r a d e 3 , 4 4 i n G r a d e 4 , a n d s o f o r t h . 1 sdrf fj.sav - SPSS Data Edit Fife Edit View Data T r a n s f o r m Analyze Graphs utilities Ac d-ons Windov i Help y . , af m\ • | I Mc?| « l -rir-i d a&lrrl %\® ! l : cassr ium n onr ln r 1 5 LjellUCI qrade2raw 35 qrade3raw 36 qrade4raw 44 grades raw 43 gradebraw var 38 ' —-JM. wr 1 va, 2 15 34 29 21 38 34 _ 3 16 39 43 47 53 51 4 22 33 17 17 19 15 5 23 38 32 39 44 44 6 24 35 37 43 48 47 7 32 37 38 45 44 43 8 33 39 42 50 51 50 9 45 39 34 37 45 37 10 46 40 42 47 49 50 11 47 3B 42 46 44 47 12 53 39 37 51 48 53 13 54 36 39 38 48 47 14 60 39 37 41 45 42 15 61 38 40 47 51 51 16 74 37 38 47 49 50 17 80 35 39 42 51 42 18 81 34 40 47 52 50 19 85 40 39 47 50 46 20 87 15 30 8 41 11 21 90 38 38 38 50 43 22 91 37 31 36 40 40 23 95 39 41 j 50 49 43 24 96 40 37 47 47 50 25 97 15 27 30 _| 39 26 98 0 40 40 44 47 44 1 27 99 0 40 40 46 46 53 28 101 0 32 29 31 41 33 29 102 0 35 41 49 53 47 30 104 1 39! 37"! 38 44 45 i < I > | \Data Viewy! Variable View / LiL 1 • i i is readv ItfUflfeTiffe, I Hi m tl) U A f» I .dWrtow _> C:\Docu... HjIMfd.dl... l e " , ] f r ™ u „ . ^ou^ut l . . QQ 7:. F i g u r e 7 . E n t e r i n g t h e d a t a i n S P S S ( S t e p 1 ) . Step 2: Rank transforming the data, within wave. O n c e t h e d a t a a r e e n t e r e d , s e l e c t t h e " r a n k c a s e s " o p t i o n u n d e r t h e " t r a n s f o r m " m e n u . P l e a s e r e f e r t o F i g u r e 8 . OdtEAtma... Create Time Seres. . Replace Missng Vibes... Random Number Ganeratars... JJ_J\D«I VI«W X'vanabie View / | | iwj Gr«<i»Sraw| Gnda6m Figure 8. Rank transforming the data within wave in SPSS (Step 2a). Once the "rank cases" dialogue box appears, move the variables one wishes to rank transform into the "variable(s)" portion of the box. Note that, by default, SPSS assigns a rank o f one (1) to each variable's smallest score. It is recommended that analysts retain this default setting, as it is logical to think that the lowest test score should also have the lowest rank. Please refer to Figure 9. io:j n i tut | H •rr- ciini » | cmnum | aena«f [ Gude2r«v,| Gfid,3iaw| G.adB4rJw|Giid»S<iwrGi Figure 9. Rank transforming the data within wave in SPSS (Step 2b). 141 Once the five original test score variables have been moved, click "ties" and retain the default setting ("mean"). Doing so wi l l ensure that, in the event that two students share the same raw score within wave (i.e., i f their scores are tied), they w i l l both receive a mean rank in that wave. Please refer to Figure 10. mm I atnd»> | Or»d*?.i»| Gnai3.—| G a^MriwI G.iJ«S>—| Gna«6.i*r~ m i __*J _JjJ\ 'o«i Vl«w"X Variable View / J L Figure 10. Rank transforming the data within wave in SPSS (Step 2c). 142 A s Figure 11 illustrates, the original variables' scores have now been rank transformed. The five newly-created rank-based variables (grade2rank-grade6rank) now appear alongside the original raw scores (grade2raw-grade6raw) in the data matrix. Simple frequency analyses of each of the new rank variables confirm that the minimum possible rank is 1 (assigned to the lowest within-wave raw score), and the maximum rank is 653 (assigned to the highest within-wave raw score) because there are 653 participants in the sample. A s Figure 11 shows, the student with casenum 5 earned a Grade 2 rank score of 148.5 (a rank she shared with others who earned the same Grade 2 raw score). In other words, her Grade 2 raw score was the 148.5 t h lowest in that particular wave. Her raw score in Grade 4 earned a rank score of 257.5, meaning that her standing within this sample of test-takers increased very slightly from Grade 2 to 3. In Grade 4, this student's raw score earned a rank of 363, suggesting her S D R T improved yet again from Grade 3 to Grade 4 (and so forth). 1 4 3 sdrt_working.sav - SPSS Data Edit File Edit View Data Transform Analyze Graphs Utilit tfiaiami H IMG?I m-FIM nittlrj Add-ons Window Help m [ 1 : grade2rank : 148.5 casenum qender qrade2raw qrade3raw qrade4raw qrade5raw qrade6raw qrade2rank qrade3rank q'ade4rank qrade5rank qrade6rank 1 5 1 35 36 44 43 38 148.5 257.5 363.0 191.5 19B.0 2 15 34 29 21 38 34 110.0 86.5 14.5 77.0 123.5 3 16 1 39 43 47 53 51 5205 617 0 506.0 648.5 614.0 4 22 33 17 17 19 15 87.0 7.5 7.0 7.0 4.5 5 23 1 38 32 39 44 44 402.0 151.5 184.0 231.5 357.5 6 24 1 35 37 43 48 47 148 5 300.5 318.5 452.5 464 0 7 32 37 38 45 44 43 2990 340.0 400.5 231.5 323 5 8 33 1 39 42 50 51 50 520.5 576.0 622.5 604.0 586.0 9 45 1 39 34 37 45 37 520.5 195.5 137.5 276.0 1765 10 46 1 40 42 47 49 50 620.0 576.0 506.0 514.5 586.0 11 47 1 38 42 46 44 47 402.0 576.0 446.0 231.5 464.0 12 53 39 37 51 48 53 520.5 211.5 300.5 638.5 452.5 647.5 13 54 1 36 39 38 48 47 391.5 161.0 452.5 464.0 14 60 1 39 37 41 45 42 520.5 300.5 239.5 276.0 295.0 15 Bl 1 38 40 47 51 51 402.0: 458.5' 506.0 j 604.0 614.0 16 74 37 38 47 49 50 299.0 340 0 506 0 514.5 586 0 17 80 1 35 39 42 51 42 148.5; 391.5 275.5 604.0 295.0 18 81 1 34 40 47 52 50 110.0 620.0 3 5 402.0 299C 458.5 5061 635.5 586 0 19 85 1 40 39 47 50 46 391.5 506.0 564.0 429.0 20 87 15 30 8 41 11 110.5 1.0 1295 1.0 21 90 1 38 38 38 50 43 340.0 161.0 564.0 323.5 22 91 37 31 36 40 40 1305 118.0 110.5 242 5 23 95 1 39 41 50 49 43 520.5 521.0 622.5 514.5 323.5 24 96 1 40 37 47 47 50 620.0 300.5 506.0 391.5 586.0 25 97 1 15 27 3D 33 39 3.5 65.5 56.0 35.5 219.5 26 98 0 40 40 44 47 44 620.0 458.5 363 0 391.5 357.5 27 99 101 102 0 40 40 46 46 53 620.0 458.5 446.0 332.0 647.5 28 0 32 29 31 41 33 70.0 88.5 62.0 129.5 109.0 29 0 39 41 49 53 47 5205 521.0 593.0 648.5 464.0 30 1<M » 39 37 38 44 45 520 5 300.5 161.0 231.5 394.0 •I A iSPSS Processor is ready C^:\pocu.. | |0 IbydjdL j^Outputl. HI a 7:39 PM F i g u r e 1 1 . T h e n e w , r a n k - t r a n s f o r m e d d a t a m a t r i x ( S t e p 2 d ) . 144 Step 3: Restructuring the data. Recall that the data were originally entered into the data matrix in person-level format (one row per participant). When conducting any sort o f individual growth modelling analysis, however, it is necessary to have an explicit wave/time variable (that represents the specific wave in which each test score is collected), which is missing i f the data are left in person-level format. Therefore, analysts must restructure (transpose) the data into person-period format, in which each participant has multiple records (rows), one for each wave or time point. A s such, select the "restructure" option from the data" menu. 3 8 Please refer to Figure 12. id" Grade2ranl 6 7| Define variable Properties,.. Copy Data Proper res... Define Dates... Insert Variable ""\ Split Fill.,, ! Wei^ itCases.., " eiiismais : _J__\Diti Vliw/Vanabje Vie |Gr»il»<r.w|Gwd»5ri> SPSS Procmsw ta rasa ] G'«g*fia-« | Grade. Figure 12. Restructuring the data in SPSS (Step 3a). One may wonder why, if HLM analyses require person-period data formatting, the data are ever entered in person-level format to begin with. It is necessary to first enter the data in person-level, not person-period format, so that the five original SDRT variables can be properly rank transformed within wave. 145 Once the "restructure data wizard" dialogue box opens, SPSS asks analysts how they would like the data formatted. In order to take change a person-level formatted data file into a person-period format, select the first/default option: "restructure select variables into cases". In SPSS language, "case" is a synonym for "row". Then click "next". Please refer to Figure 13. : D "Gri de2rank ff 1 JTWlJIffii 11'1 • j j11 • [' • TTt j' 1111 |t^f^^^^»^M^p^^p^^^^^^^^^^^^^^^^^^^^^^^^yj i HtMmank ,'.:r? ,'.(:.jf.- . J : L J I L'-'-,N BR< 1 3 00 4 0 5 23jfeniiit Welcome io the Restructure Daia Wizard! you taiMuouepM data Iron nUb* van*t*9i (cduim) n • taigja can Io pot* d iaM*d 2S s o W t "'2 0 0 1 . 0 1100" 10.5 i 7 8 24 Ifamal* 32;mali ii lemale 0 *n currart data lal **> rejbuetuad daia Nwafhal daia iMtndmng carrot ba 14 00 20.0 16 00 10.5 32.50 300 9 ; 10 4S!(smale agifemale r whai do you-art to do? Die Ifu ™h«n Hchcmnw curtnt UM hat um vaiaUei Urn rev *ouW lie ID iMiianpa no g/«jn •! Ut.lhuwhwiMuh.™ povi of lalalad caul thai you want h> raanang* n thai Ma tai aa.' h ara rw« tnfad ii a ant)* can n n m. daia m Al eattr ** be mm* vanaUtt and labeled venaben nl become caiai m llw inn data tot [Chuouigiht opliLriHI •nd tht raiam and »* 1i«m>n tfafcg na) appeal.) | Meet) ) | Cancel ; He* i 7.00 135: 25 50 23 5| 13 13 il IB 1? IB 53 male 54 female 60 fimali 61 Ifamala 74|m»l» Ed female B1 i (amain E S l_:EI3 3400 200 9 00 20 0 1200 135 JSSD 30C 2550 235 l ? 30 300 25 50 32C 19 30 21 B6i female 25 50 27[ i JO 6< 930 27C 22 23 ~i 26 To ia - • .« 96 female 97"":fern»ia 96 imai* 600 SC 3250 235 2550 1?C 400 2C 71 28 29 99imale 1fji;mate | «r m w STI 1—snr snr ' 29 31 i i " " " ' M i 3 D ] 3 6 l i t « S 5 3 ' # [ M S 2 8 0" 2050 16 C 600 6 5 3100 33 5 30 •~A fanutm 33 •* ** * MSi 130 9m lOi • 1 • Nom Vl«wX varace V mM&smmi ui • ; r_ — - JJJ Figure 13. Restructuring the data in SPSS (Step 3b). Once the "restructure data wizard - step 2 of 7" dialogue box appears, SPSS asks analysts how many variable sets they would like to transpose. In this case, there is only one set of variables (the "set" being the five S D R T variables). A s such, analysts may select the first/default option and then click "next". Please refer to Figure 14. 146 s* u «§ raj Variables to Cases: Number of Variable Groups YouhMCtaaiiio ittnucUa tafactad vvublu Ho ojoupa o/ wlaiad eaaa. n iht n Hjrh1.h2,i>r>cl>a U-M-l-hM-H • • UMTI one lla •xarvM. wl wi. wJ *nJ hi. h2. h3. K men 32 50 700 25 50 2050 3100 900 12.00 2550 2550 1300 Figure 14. Restructuring the data in SPSS (Step 3c). Once the "restructure data wizard - step 3 of 7" dialogue box appears, select "use selected variable" from the "case group identification" drop-down menu. Move the participant identification number (variable name = casenum) into the "variable" box. Type over "transl" a nickname for the five SDRT-related variables (e.g., rankSDRT). Finally, move the five rank S D R T variables into the "target variable" area. C l i ck "next". Please refer to Figure 15. • a j 0 Grad92rank cMtmim \ oandi' | G ; ; l e w . 4 2? nail 5 6 23 "i-mUt U -tmt\e 7 f 33 famaia K i: 46 female 53 mw 1: 16 S4 female 60 f»f».« Gl'UmtJ* 17 EO'iamU* IE IS 61 lane* 66 »»rt 2T. 87 mM 21 22 on 91 mil* 23 <K Imai. 21 26 26 27 56 fcmaH T female « m«lt 99 m»* 28 29 101 male 102 malt ZZfv «» Vi«w/'varatleviev Variables to Case*: Select Variables aacn •s-tne o"« you h "» ajrant <ui. w t t w rwdVarw 'agwVtHUa *>Gp»del** i ...ik-iii >,. a so • si 34 00 /•••.: 2550 i .< J 25 9) 2550 SPSS Protestor 6 ready Figure 15. Restructuring the data in SPSS (Step 3d). 147 The "restructure data wizard - step 4 of 7" dialogue box requires that analysts identify the number of chosen index variables - variables that SPSS uses to create the new columns. In this case, it is necessary to format the data according to one particular variable: participants' casenum. A s such, select the first/default option "one" and click "next". Please refer to Figure 16. •.i&l«; rr Bar; AJi> Variables to Cases: Creale Index Variables Un IHs nton i variable, gicmj itrandt lha iNactt o) • i^gb? < U * i ] NM i 35 50 30 50 14 llG 13 00 2S5D 25.50 Figure 16. Restructuring the data in SPSS (Step 3e). In the "restructure data wizard - step 5 of 7" dialogue box, analysts may choose the name of the soon-to-be-created time/wave variable in the newly-formatted data file. Type wave into the "name" box, and click "next". Please refer to Figure 17. 1 4 8 0 Gn C»Wwn | j»rvo»' | >itw [ ura(MOia-a | j-eoii „ 16 50 8 0 ~ ' i Variables to Cases Create One Index Variable 2650 336 5 23 W . » O J « M O W ' : .ra r.» .<..*>. Iht vanetaticart*MwMnMwiinianiii 200 1 0 1100 10 6 t 1 E 'J 32 r-a4» 33 ifemali 45 female 1400 200 IB 00 10 5 32.60 30.0! 7.00 135! 1C M 46 47 femle j WhvkMxm*v«)jei' • IIKIB. Vatufli 1. 2. 3,4, 5 25 50 23 5! 20.50' 10 5 _ 54 60 '•male r Vaieawn 34.00 20 0; 9 00 ' 20 0: 12.00 1 35! 16 17 61 74 80 female d> Ira ktk* Variable N ame and Label 25.50 30 0 26 50 23 5! 13.00 30 0! j-a 85 lema'a B7|mili ,••••••,1**" -7^ ? a'"^ ' 25.50 27 0! 1 0 0 6 5 ! T, 72 24 25 26 9G female 90: male 6 0 0 5 0 ' 32 5 0 2 3 5! 2 5 . 5 0 ' 17 C 400 " 2 D| 37 36 39 2 41 4 9 5 3 47 24 5 ; 2 8 0 2 0 5 0 1 5 0 ""tM65! 3 1 0 0 ' 3 3 , 5 1 30 - i l i A au Vl«w"X Variable yi j vsf ' 37 : 38 44 45 ; 24 5; 13 0 900 10.fi] wwtiKii'r:i:7„..,,'„.:';; ijr'' zz ' i ~~ •" •if F i g u r e 1 7 . R e s t r u c t u r i n g t h e d a t a i n S P S S ( S t e p 3 f ) . F i n a l l y , i n t h e " r e s t r u c t u r e d a t a w i z a r d - s t e p 6 o f 7 " d i a l o g u e b o x , s e l e c t t h e d e f a u l t s e t t i n g s a n d c l i c k " f i n i s h " . T h e r e s u l t i s a p e r s o n - p e r i o d f o r m a t t e d d a t a f i l e , a s i l l u s t r a t e d i n F i g u r e 1 8 . N o t e t h a t e a c h p a r t i c i p a n t i s a l l o t t e d f i v e r o w s i n t h e s p r e a d s h e e t - o n e f o r e a c h w a v e o f t h e s t u d y 3 9 . A l s o , n o t i c e t h a t t h e r e i s n o w a n e w v a r i a b l e , r a n k S D R T ( n a m e d b y t h e a n a l y s t i n a p r e v i o u s c o m m a n d ) , t h a t r e p r e s e n t s e a c h p a r t i c i p a n t ' s w a v e - s p e c i f i c r a n k s c o r e . T h e r e i s a l s o n o w a n e x p l i c i t t i m e v a r i a b l e , w a v e , w h i c h h a s b e e n c o d e d a s f o l l o w s : W a v e 0 = G r a d e 2 , W a v e 1 = G r a d e 3 , W a v e 2 = G r a d e 4 , W a v e 3 = G r a d e 5 , a n d W a v e 4 = G r a d e 6 . By default, SPSS codes the "wave" variable 1-5. Intercepts are calculated as if wave = 0, so the intercept would actually be calculated for a time that existed prior to the actual first wave of data collection. As such, for easier interpretation of the intercept term, the wave variable's values were recoded from 1-5 to 0-4, respectively, prior to running the non-parametric HLM. 149 Un File Edit View Data Transform Analyze Graphs Utilities Add •ons Window Help *\amml H ! M o ? | M| • r i r i n i - i . n i %\<&l 4 casenum qender wave rankSORT »ai var ••„ Si I S r I va, ' var va i 5 1.00 0 90 2 5 1 00 1 10.0 3 5 1.00 2 15.5 4 5 1.00 3 8.0 5 5 1.00 4 70 6 15 00 0 6.5 7 15 .00 1 35 8 15 .00 2 3.0 9 15 .00 3 4.0 10 15 .00 4 4.5 11 16 1.00 0 24 5 12 16 1.00 1 340 13 16 1.00 2 25.5 14 16 1.00 3 33.5 15 16 1.00 4 31.5 16 22 .00 0 4.5 17 22 .00 1 1.0 18 22 .00 2 20 19 22 .00 3 1.0 20 22 .00 4 2.0 21 23 1.00 0 18.5 22 23 1 00 1 70 23 23 1.00 2 11.0 24 23 1.00 3 10.5 25 23 1.00 4 15.5 26 24 1 00 0 9.0 27 24 1 00 1 13.0 28 24 1 00 2 14.0 29 24 1.00 3 20.0 30 247 1.00 4 22 0 |SPSS Processor is ready £]MSN.. . \ U>C;\p... •an,,). ( J 8 B Iii Quip... j H'synt.. LjAcro... S(k)[ZQ 5:34 PM Figure 18. The new, restructured data matrix (Step 3g). Step 4: Performing the H L M analysis. For brevity, the specific graphical user interface (GUI) steps required in order to perform the H L M analysis are not included in this proposal. In the next section, however, the H L M syntax is provided. Performing the Analysis using Syntax In this section of the chapter, the SPSS syntax required to perform the steps detailed in the previous section is detailed. For the ease and convenience of the reader, the syntax has been typed in a distinct font. Step 2 : Rank transforming the data, within wave. In order to rank transform the data, within wave, use the following syntax: 150 RANK VARIABLES=Grade2raw Grade3raw Grade4raw Grade5raw Grade6raw (A) /RANK /PRINT=YES /TIES=MEAN . RENAME VARIABLES (Rgrade2r = grade2rank) (Rgrade3r = grade3rank) (Rgrade4r = grade4rank) (Rgrade5r = grade5rank) (Rgrade6r = grade6rank). EXECUTE. Step 3: Restructuring the data. In order to restructure the data, from person-level to person-period format, use the following syntax: VARSTOCASES /MAKE rankSDRT FROM Grade2rank Grade3rank Grade4rank Grade5rank Grade6rank /INDEX = wave(5) /KEEP = casenum gender Grade2raw Grade3raw Grade4raw Grade5raw Grade6raw /NULL = KEEP. . Step 4: Performing the H L M analysis. In order to perform the non-parametric H L M , use the following syntax. * Unconditional Model: No Level 2 (student) p r e d i c t o r v a r i a b l e . Just repeated measures at Level 1 nested within students. MIXED rankSDRT WITH wave /METHOD = REML /PRINT = SOLUTION TESTCOV R /FIXED = wave /RANDOM = INTERCEPT wave | SUBJECT (casenum) COVTYPE(UN). * Conditional Model: Gender added as a Level 2 pr e d i c t o r v a r i a b l e . MIXED rankSDRT WITH wave gender /METHOD = REML /PRINT = SOLUTION TESTCOV R /FIXED = wave gender wave*gender /RANDOM = INTERCEPT wave | SUBJECT (casenum) COVTYPE(UN). 151 Appendix B: More about the Non-Parametric Difference Score (Case 2) This appendix relates to.material presented in Chapter 6, particularly the sections related to the two-wave case involving Foundation Skills Assessment data. This appendix provides readers with specific instructions about performing the said analyses via SPSS ' graphical user interface (GUI) and syntax. In addition, this appendix provides a description about how the decision between using the simple difference score or the residualised change score in analyses is made. Please refer to Chapter 6 for a definition of these terms. Performing the Analysis using the Graphical User Interface (GUI) Step 1: Entering the data. A s Figure 19 illustrates, variable names grade4scale and grade7scale identify students' standardised (scaled) scores on each grade-specific version of the F S A , respectively. The student with casenum 9 is female, who earned a standardised score of -1.18 in Grade 4 and -1.20 in Grade 7. 152 File Edit View Data \ 1 : c a s e n u m casenum !_FSANunieracy_1June2006. Transform Analyze Graphs utilities I Me?] M| >rlf-\: / . i . r; Add-ons window He^ 47 60 20 22 23 29 54 73 gender qrade4scaie -.09 -.75 85 .05 -2.26 119 |F 122 M 152 IF"" 160 IF -.87 grade7scale -1 20 05 02 • 39 -.17 _12 -75 -30 1.99 175 185 M 192 206 "231 M 254 ; F 264 |F 280 M 295 IM 297 M 303"I'M" 3 1 4 M 324 IM 335 IF 341 M 342 M -.29 - 7 3 -1.52 .22 .19 -.71 -1.24 .07 -.68 -1.41 - 6 3 -4.83 -1.99 -.91 -1.21 -1.10 -.70 - 6 3 -.46 -1.27 -34 -1.81 -1.07 20 .42 -1 09 -1 24 -.42 -1.41 -1 50 -.77 < I • l \ Data Viaw ^"Variable View / -* 26 LU :SPSS Processor is ready j BwindowsM.. JdJ I . C ^ ^ t l - S H J lbjd_dresa . iQ 3:06 PM Figure 19. Entering the data in SPSS (Step 1). Step 2: R a n k transforming the data, within wave. Once the data are entered, select the "rank cases" option under the "transform" menu. Please refer to Figure 20. ItawiOradeSi-awl Grada6raw| Create Time SartM,,, Replace Mftstig Vabes... Random Number Generators... JXJ\'D«I yitW'X'Vanable View / P-artC-iaw Figure 20. Rank transforming the data within wave in SPSS (Step 2a). 153 Once the "rank cases" dialogue box appears, move the variables one wishes to rank transform into the "variable(s)" portion of the box. Note that, by default, SPSS assigns a rank of one (1) to each variable's smallest score. It is recommended that analysts retain this default setting, as it is logical to think that the lowest test score should also have the lowest rank. Please refer to Figure 21. Vow Data Transform Analfa Lmirtes flttt-wis wrrti* HeV •' \ U Display Hnwyl .••jr. I^ i.f. 1 5 2 - 1 27 " 2 2 - 34' l 81 - 7 1 - 1 0 7 i 2 4 * 2 0 ' 07 42 -1 50; .1.09: •66 .41 -67 -1 24 -63 ' 1 41! •4 6T -81 -199 ' -150 S i ^ PrrceisorUrsa* Figure 21. Rank transforming the data within wave in SPSS (Step 2b). A s with Case 1, once the two original test score variables have been moved, click "ties" and retain the default setting ("mean"). Doing so w i l l ensure that, in the event that two students share the same raw score within wave (i.e., i f their scores are tied), they wi l l both receive a mean rank in that wave. Please refer to Figure 22. 154 sua % : t . ' » i M 1":f- CMir.l c.Mnun. |fl en u»|q,.rJ e4 3c.1.U.d.7 ec.l. i .4, 1 £ I m I ••" I . „ I • - I " : : — VtMbjtjl): 1 £ _ ,-AjagpRankllD | « SrnalH v.» * mf s ,,- Uii ;! £ 1C _ f r* * JI J -PHi (3 14 15 IB n 306 M | . i is 30 31 33 33 240 M 254 F 2B4JF 3BQ1M 295 M -124 07 1 Hi '.'BB Rs* Atujnul to Tilt # Mun Lo» ' ' Hi> C 5ajjs!.id r«+i Io i/tque •alias Canal j 34 35 26 w I w 314 M V.63 '.".'ij'lt 36 39 30 335:F 14 M 34? M -1 99 -91 -55 visit at) /.•wjvafiablev a w / ' ~ m ' i 4JI 1 ^ . - . - i s - H v . j » r j , J . , 3 . . . . i u i n . ' m a . ^ . . . . f j i w u - B . a a s » » -Figure 22. Rank transforming the data within wave in SPSS (Step 2c). A s Figure 23 illustrates, the two original variables' scores have now been rank transformed within wave. The two newly-created rank-based variables (grade4rank and grade7rank) now appear alongside the original scores (grade4scale and grade7scale) in the data matrix. Simple frequency analyses of each of the new rank variables confirm that the minimum possible rank is 1 (assigned to the lowest within-wave raw score), and the maximum rank is 4097 (assigned to the highest within-wave raw score) because there are 4097 participants in the convenience sample. A s the data matrix shows, the student with casenum 9 earned a rank score of 576 for Grade 4. In other words, her Grade 4 standardised score was the 576 t h lowest in that particular wave. Her raw score in Grade 7 earned a rank score of 268, meaning that her standing relative to other test takers in the sample decreased from Grade 4 to Grade 7. 155 File Edit View Data Transform Analyze Graphs Utilities Add-ons Whdow Help & u a an - | -i t |c? i MI -rlr-l °>i<ai 1 : c a s e n u m [~9 f casenum I qender C|rade4scale qrade7scale grade4rank qrade7rank VI r var var 1 yai var 1 1 9|F -1.18 -1.20 576 268 ; 2 29IF -.09 .05 2266 2106 • 3 44 M -.54 .02 1469 2036 4 47 M -.75 -.39 1138 1334 5 50 F .85 17 3509 1737 6 54 F " " 0 5 -12 2484 1806 7 73 M -2.26 -.75 80 774 8 82 F 96 -.30 823 1509 9 • 119 F -.87 -1.21 966 266 10 122 M 1.99 .49 4028 2799 11 137 IF .33 -.67 2894 876 12 152 F -.29 -1.10 1904 353 13 TeoJF -1.00 -.70 789 039 14 175 |M -.73 -.63 1170 935 15 185 M -.11 -.48 2223 1164 IB 192 F -1.52 -1 27 287 217 17 206JM .22 -34 2725 1441 18 231 F .19 -1 81 2684 29 19 240 M -.71 -1.07 1210 390 20 254 F -1.24 .20 500 2348 21 264 F .07 .42 2514 2696 22 280 M •1.50 301 369 23 295 M -.68 • 41 1267 1290 24 297 M -.67 -1.24 1271 239 25 303 M -1.41 -.42 372 1282 26 314 M -.63 -1.41 1335 144 27 324 M -4.83 -.81 37 662 28 335 F -1.99 -1 50 106 105 29 341 M -.91 -.77 920 743 30 342 M -.55 -1.26 1461 222 < I > fipataView/ Variable View / I < I • i SPSS Processor is ready j Kt) Whdows M. gioutputi- )Q 3:16PM Figure 23. The new, rank-transformed data matrix (Step 2d). Step 3: Restructuring the data. Recall that the data were originally entered into the data matrix in person-level format (one row per participant). Also recall that, when conducting any sort of individual growth modelling, one requires an explicit wave/time variable. Unlike the previous (multi-wave) case, the interest in the current study lies with each student's individual index of change score, not with the two observed scores themselves. A s such, there is no need to restructure the data from person-level to person-period format. Step 4: Determining the appropriate index of change. In order to determine which specific index of change serves as the dependent variable in this particular case, it is 156 necessary to follow the guidelines of Zumbo (1999), who writes that "one should utilize the simple difference score instead of the residualized difference i f and only i f p(Xi,X2) > aX\l 0X2" (p. 293). To this end, it is necessary to calculate the Pearson correlation between the Grade 4 and 7 rank scores [p(Xi,X2)] and the ratio of the standard deviations of the respective rank scores (aX\l 0X2). It is important to stress that, when implementing the Conover solution for two-wave data, one's decision about using the simple difference and residualised change score must be based on students' rank, not observed, scores. In order to calculate the Pearson correlation, select the "correlate" option under the "analyze" menu. Select "bivariate" as the correlation type. Please refer to Figure 24. mSm 1 .'<" n m mmm%wmmrni,m . . . •• Fife E it View Date TraT»fOnYI | Graphs Utilities Add-ons Window Help ! feiD 1 * Ul ' 1 : t ' J9 VI Con-pars Meant , i , ; 29 Mtaed Mode* , | 576 268 I 2! - : ' J f. 50 "si ' I IS [ 24B4- 1806 :::::: g S£ JIB * j 8 0 774 I '119 .ii 11. MI-'I 1 rests * I 966 2BE • 122 • • 1 4 0 2 8 2 7 9 8 ; '-| Multiple Response • | 2 B 9 4 O T P 1 160 '' :i 00 70 7 8 9 8 3 9 4 '-'.nt'''to'!" iiitj 93BT • • -.iit -'«? 2 3 2 3 H t a i j 192 • i y 1 W ' 3 f t * ? 206 231 fi 34 ... _ 8 | , 2 7 2 5 1 4 4 1 2S84 29: s T 240 264 -i'2* 07 Ml' 1 2 1 0 3 9 0 ; jjjjjj jj4g'-2 5 1 4 ' 2 £ 8 B | ;;;;;; = 22 23 280 J -1 itt 30 V 'seat 1267''' 1280 EE: 24 237 *!.....'..'. "67 •1 24 1 2 7 1 2 3 9 25 303 •i -1 41 .42 " ' 3 7 2 ' ' 1 2 8 2 ' 26 314 M •63 • 1 41 " 1 3 3 6 i ' i i f 27 324 M •4 63 3 7 ' 6 8 2 28 29 S F 1 50 1 0 6 i d s ' 9 2 0 7 4 3 ] 30 342 :M Mi vi«w / Variable V ew/ '" , 2 8 • Jl 0 Figure 24. Computing a correlation matrix (Step 4a). Next, move the two rank score variables (grade4rank and grade7rank) into the "variables" box. C l i ck "ok". Please refer to Figure 25. 157 * a s % n7 v . ff. M T r - c - a i r t %m . i . ! . ' . . 26E 210E J 1 ^ 1 22 • i j 4>fjtfrtnai - °* f 1334 | . a •"—•1 1737 1 . „ " - ) '606 : " * J 774 1509 CnnWion CcWriciera. 2799 - f t ™ , 871 353 j Tnl of Siontcanc* 639 0n».t M 935 t _ 2 22 23 0 Flag iigr*e«l coninatmn 1 i 1 zm 268 121 1 "ai 30 1164 231 IF 240 Vt 254 ;F" 264 V 280 M •1 50 •1 81 -1 07 20 Tbs 217 29 390 2348 2696 1290 26 27 2! 23 30 " 1)3 W 314JM 1 M 335 F 341 M 342 M ««Vi»wX Variable V -• -."42 -1 41 Vai -1 50 12" 239 37? 1282 1335 144 37 682 1*1 i s B u y * * ? 3 J C ^ .... L P * Figure 25. Computing a correlation matrix (Step 4b). A s Figure 26 illustrates, the resultant SPSS output shows that the correlation between students' Grade 4 and Grade 7 rank scores [p(X],X 2 )] is equal to 0.669. Correlat ions grade4rank grade7rank grade4rank Pearson Correlation Sig. (2-tailed) N 1 4097 .669" .000 4097 grade7rank Pearson Correlation Sig. (2-tailed) N . 669 " .000 4097 1 4097 **• Correlation is significant at the 0.01 level (2-tailed). Figure 26. The resultant correlation output (Step 4c). The next step is to compute the ratio of the standard deviations o f the respective rank scores {aXxl oX2). To this end, select "descriptive statistics" from the "analyze" menu. Then choose "descriptives". Please refer to Figure 27. 158 Figure 27. Computing the ratio of standard deviations (Step 4d). Then move the two rank variables into the "variables" box. Then click on "options' Please refer to Figure 28. „ a Figure 28. Computing the ratio of standard deviations (Step 4e). In the "options" dialogue box, check the "std. deviation" box. Then click "continue' When the "option" box closes, click "ok". Please refer to Figure 29. 159 Figure 29. Computing the ratio of standard deviations (Step 4f). The resultant SPSS output shows that the standard deviations of the Grade 4 and Grade 7 rank scores, respectively, as 1182.843 and 1182.846. Please refer to Figure 30. Descriptive Statistics N Std. Deviation grade4rank grade7rank Valid N (listwise) 4097 4097 4097 1182.843 1182.846 Figure 30. The resultant descriptive statistics output (Step 4g). Because the correlation between the Grade 4 and 7 rank scores [p(X],X 2) = 0.669] is less than the ratio of the two standard deviations (1182.843/1182.846 = 0.999), it is necessary to use the residualised change score, rather than the simple difference score, in this case study (Zumbo, 1999). Step 5: Computing the residualised change score. To compute the residualised change score, the dependent variable of the case analysis, begin by selecting "regr under the "linear" menu. Then choose "linear". Please refer to Figure 31. ?ression 160 #BM at: o j tisl j Figure 31. Computing the residualised change score (Step 5a). Next, regress grade7rank (the Wave 2 rank score) onto grade4rank (the Wave 1 rank score) as follows. Then click "save". Please refer to Figure 32. q GJ CD _»=-J Mathod- iErHei Cat* Lacttj Q I , . VM Waist* CD r—— Figure 32. Computing the residualised change score (Step 5b). Next, check the "unstandardized" box under "predicted values". C l i ck "continue". Once the "save" dialogue box closes, click "ok". Please refer to Figure 33. 161 .- Modioli I !'«>gMdi*rt CD * • i — -Si |PJPJPJPJPJ|^  CD wt,; CD r I ' I • Vl«w;f Variable View / Figure 33. Computing the residualised change score (Step 5c). A s a result of the regression, the data matrix now contains a new variable: pre_ l . This variable represents the Wave 2 (Grade 7) rank predicted from the Wave 1 (Grade 4) rank score. To finalise the computation of the residualised change score, select "compute' from the "transform" menu. Please refer to Figure 34. Figure 34. Computing the residualised change score (Step 5d). 162 score In the "target variable" area, type a name for the new residualised change variable (e.g., residualised_chg_score). In the "numeric expression" area, subtract pre_l from grade7rank. C l i ck "ok". Please refer to Figure 35. ' ' :PRE_1 "CD •> grade* »* •*>FREJ ^ X i ^ O m VLw^Vanable View / J IlJiSl J .Slcl i-ll.ll JjJUl J JSJ-U _u_J '! 'JM. Mtd sr /muhbj— Bj 749 6i.ec; Hi ni •'«: " _ H « J Figure 35. Computing the residualised change score (Step 5e). A s Figure 36 illustrates, the data matrix now contains a new variable, residualised_chg_score, which represents the Wave 2 rank score (grade7rank) less the Wave 2 rank score predicted from the Wave 1 rank score (pre_l). 163 File Edit view Data Transform Analyze Graphs Utilities *IBI<I|*|. i iMiUMlTir-iniiBini* i r i ^ y - 3 Add-ons Window Help 1 : casenum 12 22 29 30 I gender 29 47 50 73 82 122 137 152 180 175 I F i'[M" 1B5:M 192 F 206 M "231 F 240. M 254 F 264 F 260 M 295 M 297 M 303:M 314!M 324 M 335 JF 341*1 342 IM qrade4scale i 5 J F MJM 121M | \ Data V l « w ^ V a r i a b l e Vi -.09 -.54 ""-.75 .85 .05 -2.26 "'-"98 -.87 .33 - 29 -1.00 -.73 -1.52 .22 19 -.71 -1.24 .07 -1.50 -.88 • 5 7 -1.41 - 6 3 ' " " 483 -1.99 -.91 -.55 grade7scaie 05 02 39 -.12 -75 - 3D -1.21 -.67 -1 10 -70 -.63 qradB4iank | grade7rank 576.000 2266.000 26B.0D0 1489 000 1138.000 3509 000 2484.000 80.000 823 000 2106!000 2036.000 1334.000 1736 500 1806.000 774.000 1509.000 1063.92825 ""2194.11919' '674. 966.000 [ 4028.000" 266.000 2094.000 1904 000 789.000 -.48 ""-127 -34 -1.81 -1 07 20 42 -1.09 -41 -1.24 - 42 1170000 2223.000 28? QC0 2799" 000 878 000 353.000 839.000 935.000 1164.000 -1.41 -.81 -1.50 -77 -126 >_) C:\Pooi. 2725 000 2684 000 1210000 500 000 2514 000 301 000 1267 000 1270 500 372.000 1335 000 36.500 106.000 920.000 1461.000 217 000 1440 500 29 000 390 000 2348.000 2695 500 369 000 1290 0C0 239 000 1282 0C0 144 000 682.000 105.000 743 000 222.000 1439,76690 3025 37797 2339.90713 732.22724 1229.11001 1324.74155 3372.46028 2614.09547" 1952.03096 1206.37244 146116696 2165.36285 870.65892 SPSS Processor is ready ©Window... 2501 07638 2473.65754 1487.91704 1013.10310 2359.96970 880 02144 1526.03591 1528 37654 927.50284 1571 51105 703 13653 749 61480 1293,97895 1655.77380 1 P R E 1 [ residualised chg score -795.93 18 12 361.50 -10577 -533.91 41.77 279.89 -1058.74 -573.46 -1738.10 -1599.03 iyd_di.. Figure 36. The newly-created residualised change score (Step 5f). -367.37 -526.17 -1001.36 •653.66" •1060.58 -2444.66 -1097.92 133490 335.53 -51102 •236.04 -12B9.38 354.50 -1427 51 -21.14 -644.61 •550.99 -1433.77 JJI l 9 Q 6:39 PM Step 6: Performing the Independent Samples r-Test. N o w that the residualised change score has been computed, one can perform the requisite statistical analyses - in this case, an independent samples Mest. Under "analyze", select "compare means". Then select "independent-samples T test" (sic). Please refer to Figure 37. 164 20pf>„2 ! l0 . . U u r a a O p i i >r i i lne,sw - SPSS D n u t d l l o i • • M I I M i l i l W U J HJ L > •* Reports Descriptor) Statistic :fe . . : ; casenurtj flends, 29F gra General Lrear Modal > Mined Mi.ir.lels > Reg/ess ion * LoglriBrrt- • Classify » Data Redxtbn • Scale • Norparamstric Tests • • : : - „ . . fa? ^ M W - f W - J MAi. U&*iMKS<MmUi^'m 11919 « 1 2 TTm ^ ^ • : 47'M BJIF j 119|F 123IM h l M M - 10577 3609000 I 736 5O0 3025 37797 1288 88 2484 000 1B06 000 2339 90713 533 91 BOOOO 774 000 732 22*24 41 77 923 COO 1609 000 122911001 219 B9 966000 266 000 132174*66 1068 74 4028000 2799000 337?46G28 -57346 37 IF152 F lFhCI -too EKBSponse """TIT. • 70 2694 000 676 000 261109647 -17381C 1904000 353X0 195203096 -159903 789000 839000 1206 37744 - 367 37 ITS M •73 -.63 1170 000 935 X0 146116696 5261 ? 2223000 1164000 2'65 362B5 '001 36 192 F •1.52 267 000 21'000 67366692 -653 66 22 -34 2725 000 1440 5U0 2601 07638 1060 56 2684.000 29 da: 2473 65754: -2444 SG 1210000 390000! 148791704 -1097 92 5TJODO0 2348 000 1013 10310 1334 90 2514 000 2695 500 ' 2353 96970 33653 301 0 0 0 3 6 9 0 0 0 : H H : :.,!44 611 02 1267 000 1290.000 1526 3 3 5 9 ' - 2 3 6 04 1270 500 239 000: 1529 37654 1299 38 a 231 ' 240 !M 2S4jF 264 F M M H w 297 M Tab -JE •1 07 20 -1 09 3 37 2E 29 3C 324 M 336F 342 M 3« i Vi«w / va/iawe v '-'.'63 4 S3 e w / " '.'t'.SJ 1335 000 144 000' 1671 51105 -1427 51: 3 6 5 0 0 6 8 2 0 0 0 : 7 0 3 13663 "2\A4\ iOf i 'o 'OO105.000 749 61400 -644 51 920 000 743 000 1293 97895 -55098 1461(100 222 000' ' 1655 7 7 3 9 0 1 4 3 3 77 ill Ind^etiderVSan-ctes T Tsaf SPSS haa5wr arsM| Figure 37. Conducting the independent samples Mest (Step 6a). Move the newly-created residualised_chg_score variable into the "test variable" area of the dialogue box. Move the gender variable into the "grouping variable" box. Then, click on "define groups", and type " M " (for male) and " F " (for female) into the "define groups" dialogue box. Then click "continue". When the "define groups" dialogue box closes, click "ok". Please refer to Figure 38. 1 casrjnum 9 casanum | qendar f qi»de4scal» qriiH? seals | nudttfranlr | ndi7nnk 1 PRE 1 i • mil ua i*d :-g tc:' i 1 • 1 9|F -1.18 -120; 576 000 268 000; i I.. 795 93 2 •• • ... . 21060O0' 2194 M9I9 •88 12 3 2036 000 1674 49866 361 5C 4 r • " ( w '. . ' 1334.000 1439 76690 106 77 5 6 4> 0*da7tciit rr^ giadsnank U ~ 4 >midu«4i^ (i>(i_Kai "j 1736 500 •nit :m 3026 3/797 2339 90713 -63391 7 Runt i 774 OOC 1609 000 266 000' 732 22724 41 77 279 89 1066 74 9 Canctij 1324 74166 10 K * i 2799 000! 3372 46026 67346 876 000; 264 09547 •1736-10 12 I CO i 353 un 195203096 -1599 03 13 B ' l n , G ' a ^ _ J 339 000 1206 37244 367 37 14 15 ~"~ o*~. ; 16 ' * - " - * — I B T T F — " j " ^raz -12 / : " " j r» , 'uLAJ v " j 663 66 17 206 M j 2 2 34 2725 000 1060 58 IB Z h ' l F 1 9 -• 81 2694 000" mm* W C m 2444 66 19 20 240 M -71 254;F ' 1 24 ' . ' i o " ' 500 .000 _ = * J 1097 92 1334 90 21 22 264V' t 07 2B0 M 150 42 2514 000 - t O B I ' 3 0 1 000 ' 'mood 860 02144 33653' 511 -37 23 295 M - « "-41 : 1 2 6 7 . 0 0 0 1290 000 '526 0359' 236 04 24 2 3 > l M ' . ' " 6 7 • 2 4 [ 2 7 0 500 239 000 ••528 37654 1289 SB 25 303 M * a- - 42 372000 ' 1282000 ' 577 50784 354 50 _ 3 1 4 1 M - 63 -i 4 T 1 3 3 5 000 144 UK 1S71 51106 142751 27 324 M " 4 S3 36500 682 000 703 13663 ?' '4 26 135 F 99 1 50 106.000'' " 105 OOC ' 749 61480 644 61 29 - 7 7 9 2 0 000 ' 743.000 1293 97896 650 98 30 342 M ' - . S B ' - ' i ' 2 6 ; 1 4 6 1 bob'"' 222 000: 1666 77360 1433 77 • I • iV> / " I.I " " ' I .11 SPSS Processor a ready . ! BCMTJIL. i sV Figure 38. Conducting the independent samples /-test (Step 6b). 165 Performing the Analysis using Syntax In this section of the chapter, the SPSS syntax required to perform the steps detailed in the previous section is detailed. For the ease and convenience of the reader, the syntax has been typed in a distinct font. Step 2: Rank transforming the data, within wave. In order to rank transform the data, within wave, use the following syntax: RANK VARIABLES=grade4scale grade7scale (A) /RANK /PRINT=YES /TIES=MEAN . RENAME VARIABLES (Rgrade4s = grade4rank) (Rgrade7s = grade7rank). EXECUTE. Step 4: Determining the appropriate index of change. In order to compute the necessary correlation and standard deviation values, use the following syntax: CORRELATIONS /VARIABLES=grade4rank grade7rank /PRINT=TWOTAIL NOSIG /MISSING=PAIRWISE . DESCRIPTIVES VARIABLES=grade4rank grade7rank /STATISTICS=MEAN STDDEV MIN MAX . Step 5: Computing the residualised change score. In order to compute the residualised change score, use the following syntax: REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT grade7rank /METHOD=ENTER grade4rank /SAVE PRED . COMPUTE residualised_chg_score = grade7rank - PRE_1. EXECUTE . 166 Step 6: Performing the Independent Samples r-Test. In order to compute the independent samples Mest, use the following syntax: T-TEST GROUPS = Gender ( ' M 1 1 F 1 ) /MISSING = ANALYSIS /VARIABLES = r e s i d u a l i s e d chg score /CRITERIA = CI( 95) 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0054569/manifest

Comment

Related Items