UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Grouping variables in bias and impact detection : an attributional stance for observational studies Zou, Danjie 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2017_november_zou_danjie.pdf [ 1.75MB ]
Metadata
JSON: 24-1.0355884.json
JSON-LD: 24-1.0355884-ld.json
RDF/XML (Pretty): 24-1.0355884-rdf.xml
RDF/JSON: 24-1.0355884-rdf.json
Turtle: 24-1.0355884-turtle.txt
N-Triples: 24-1.0355884-rdf-ntriples.txt
Original Record: 24-1.0355884-source.json
Full Text
24-1.0355884-fulltext.txt
Citation
24-1.0355884.ris

Full Text

   GROUPING VARIABLES IN BIAS AND IMPACT DETECTION: AN ATTRIBUTIONAL STANCE FOR OBSERVATIONAL STUDIES by Danjie Zou M.A., The University of British Columbia, 2017  A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS in THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Measurement, Evaluation, and Research Methodology)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) September 2017 © Danjie Zou, 2017    iiAbstract Group disparity in the desired outcome (e.g., academic achievement) has always been a research emphasis. A group comparison is fair and meaningful only if the measured scores are comparable and free from bias. Measurement bias is closely connected to test fairness and has been investigated for decades. In the psychometric literature, the concepts of DIF, bias and impact are interconnected but different from one another. Wu et al.'s (2017) attributional framework, based on group comparison, explicates the differences among them and provides analytical procedures to detect bias and impact. Their attributional view makes the definition and interpretation of bias and impact sensible and meaningful even when groups are non-experimentally formed. Building on Wu et al.'s (2017) framework, the purpose of this thesis is to address the issues arising from group comparisons intended to test a causal hypothesis in observational studies. Specifically, this thesis focuses on the concerns with interpreting and communicating bias and impact detection based on comparisons between "pre-existing" groups (i.e., groups differ in their socio-demographic characteristics that exist before testing or research). To this end, Chapter 1 provides background knowledge of DIF, bias and impact, including their relationships with test fairness, measurement invariance, and validity. The terminology of DIF, bias, impact and the manners in which they are connected with and distinct from one another are discussed. The definition of bias and impact refined by Wu et al. are then reviewed.    iiiChapter 2 distinguishes various types of grouping variables and proposes a typology for grouping variables resulting from different research designs in social and behavioral research. Based on the typology, difficulties in interpreting the results of applying Rubin's causal model (Rosenbaum & Rubin, 1983) are discussed. Chapter 3 presents relevant terminologies and techniques according to Wu et al.'s analytical procedures for studying bias and impact, including the propensity score, statistical matching, and conditional logistic regression. In Chapter 4, two data examples are illustrated to show how the adapted conceptual, methodological, and technical rhetoric helps to clarify the long-standing confusions in understanding and communicating bias and impact. Chapter 5 summarizes contributions and limitations of this thesis, and future work.       ivLay Summary Human is inclined to make a causal inquiry based on group comparison; Researchers intended to find out if gender is the reason for the difference in the test performance. The extant differences between the groups (e.g., parental education) can confound the group comparison and render the attribution difficult. In observational studies, researchers usually are unable to manipulate, change, or select participants’ group membership. This limits the validity in attributing an observed difference to the groups. In order to arrive at a reasonable attribution based on group comparison, this thesis distinguished four types of group composition being formed in research. It continues to adapt the experimental terminology (e.g., for propensity score matching) to be more appropriates for an attributional claim for observational studies. The adapted terminology then was piloted on two real data examples, for an investigation of measurement bias and true group disparity as in Wu et al. (2017).      vPreface This thesis is a continuation of the work by Wu, Liu, Stone, Zou, and Zumbo (2017), on which I was a co-author of the article published in the journal of Frontiers in Education: Assessment, Testing and Applied Measurement. Building on Wu et al.'s (2017) framework and analytical procedures for bias and impact, my thesis supervisor, Dr. Wu, and I formulated the original research purpose and the structure of this thesis. I was solely responsible for the search, review, and synthesis of the literature. I also took full charge of the data analysis with clear guidance from Dr. Wu. I reported and interpreted the results and wrote them into draft documents. Dr. Wu critically reviewed and revised my report of the results and findings before I presented them in this thesis.      vi Table of Contents Abstract ........................................................................................................................................... ii Lay Summary ................................................................................................................................. iv Preface............................................................................................................................................. v Table of Contents ........................................................................................................................... vi List of Tables .................................................................................................................................. ix List of Figures ................................................................................................................................. x List of Abbreviations...................................................................................................................... xi Acknowledgments......................................................................................................................... xii Dedication .................................................................................................................................... xiv Chapter 1: Background for Bias and Impact ................................................................................... 1 1.1 Background Knowledge for This Thesis ........................................................................... 1 1.1.1 Test Fairness, Validity, Measurement Invariance and Test Bias .............................. 1 1.1.2 The Terminology of DIF, Bias, and Impact ............................................................. 4 1.2 Refined Definition of Bias and Impact ............................................................................ 12 Chapter 2: Types of Groups in Casual Inquiries – Implications on Bias and Impact Research ... 15 2.1 Post-Positivism Approach in Social Science ................................................................... 15 2.2 The Mismatch between Inquiry and Methodology in the Old Ways of Studying Bias and Impact .................................................................................................................................... 17 2.3 Four Types of Groups ...................................................................................................... 18 2.4 Rubin's Causal Model for Observational Studies ............................................................ 21 2.5 The Attributional Stance for Bias and Impact Detection ................................................. 23 Chapter 3: Analytical Procedures and Relevant Statistical Techniques for Bias and Impact ....... 25 3.1 Techniques for Propensity Score and Matching .............................................................. 25 3.2 Methods for Bias Detection ............................................................................................. 28    vii 3.3 Method for Impact Investigation ..................................................................................... 31 Chapter 4: Demonstration of Two Types of Groups in Observational Design for Studying Bias and Impact ........................................................................................................................................... 32 4.1 Data Sources .................................................................................................................... 32 4.2 Data Analyses and Computing Tools ............................................................................... 34 4.3 Descriptive Statistics ........................................................................................................ 34 4.4 Propensity Score Estimation and Matching ..................................................................... 38 4.5 Bias Detection .................................................................................................................. 43 4.6 Impact Detection .............................................................................................................. 47 Chapter 5: Contributions and Discussions .................................................................................... 53 5.1 Contributions.................................................................................................................... 53 5.2 Limitations and Future Work ........................................................................................... 57 5.3 Closing Remarks .............................................................................................................. 58 Bibliography ................................................................................................................................. 59 Appendices .................................................................................................................................... 63 Appendix A. Item Information for Booklet One of Eighth Grade Science Achievement Test (TIMSS, 2011) ....................................................................................................................... 63 Appendix B. Examples of Released Item (Eighth Grade Science, TIMSS, 2011) ................ 64 B.1 Item 25 (Water Wheel: Energy of Tank Water) ........................................................ 64 B.2 Item 29 (Evidence Continents Were Joined) ............................................................ 64 Appendix C. Selected Covariates for Propensity Score Estimation (TIMSS, 2011) ............. 65 Appendix D. Results of Optimal Pair Matching (1to1) on Test Language ............................ 66 D.1 Jitter Plot .................................................................................................................. 66 D.2 Histogram ................................................................................................................. 66 D.3 Percentage of Bias Reduction .................................................................................. 67 Appendix E. Results of Optimal Full Matching (1toM) on Test Language ........................... 68 E.1 Jitter Plot ................................................................................................................... 68    viii E.2 Histogram ................................................................................................................. 68 E.3 Percentage of Bias Reduction ................................................................................... 69 Appendix F. Results of Optimal Pair Matching (1to1) on Gender ........................................ 70 F.1 Jitter Plot ................................................................................................................... 70 F.2 Histogram .................................................................................................................. 70 F.3 Percentage of Bias Reduction ................................................................................... 71 Appendix G. Results of Optimal Full Matching (1toM) on Gender ...................................... 72 G.1 Jitter Plot .................................................................................................................. 72 G.2 Histogram ................................................................................................................. 72 G.3 Percentage of Bias Reduction .................................................................................. 73 Appendix H. Results of Optimal Full Matching (1MM1) on Gender ................................... 74 H.1 Jitter Plot .................................................................................................................. 74 H.2 Histogram ................................................................................................................. 74 H.3 Percentage of Bias Reduction .................................................................................. 75 Appendix I. PBR Comparison of Three Matching Methods ................................................. 76 Appendix J. R Codes for Analyses ........................................................................................ 77          ix List of Tables Table 1. Group Difference in Item Correct Proportion by Test Language.............................. 36 Table 2. Group Difference in Item Correct Proportion by Gender...........................................37 Table 3. Percentage of Bias Reduction Before and After Optimal Full Matching (1MM1) on Test Language Groups..............................................................................................................42 Table 4. Results of Bias Detection in Comparing the Test Language Groups.........................44 Table 5. Results of Bias Detection in Comparing the Gender Groups....................................46 Table 6. Results of Impact Detection between Test Language Groups....................................48 Table 7. Results of Impact Detection between Gender Groups...............................................50 Table 8. Summary of Bias and Impact Results for Test Language Comparison......................51 Table 9. Summary of Bias and Impact Results for Gender Comparison.................................52       x List of Figures Figure 1. Analytical Procedures of Detecting Bias and Impact...............................................25 Figure 2. Jitter Plot for Optimal Full Matching (1MM1) on Test Language...........................39 Figure 3. Comparison of Propensity Score Distributions before and after Optimal Full Matching (1MM1) between Test Language Groups................................................................ 40       xi List of Abbreviations 1MM1   One-to-Multiple and Multiple-to-One 1to1   One-to-One 1toM   One-to-Multiple A1    Manipulable and Randomly Assigned Grouping Variable A2    Manipulable but not Randomly Assigned Grouping Variable ANOVA  Analysis of Variance ANCOVA  Analysis of Covariance B1    Non-Manipulable but Changeable/Selectable Grouping Variable B2    Neither Manipulable nor Changeable/Selectable Grouping Variable DIF    Differential Item Functioning Group F   Focal Group Group R  Reference Group IRT    Item Response Theory IV    Independent Variable LRT   Likelihood Ratio Test MI    Measurement Invariance PBR   Percentage of Bias Reduction       xii Acknowledgments I have many people to thank for their help during my study as a M. A. student and the completion of this thesis.  I owe my particular thanks to my supervisor Dr. Amery Wu. I received countless help and encouragement from her during the two-year study and the whole process of this thesis. Without her comprehensive help, I could not have completed my Master's studies in two years while taking care of my family as well. Also, I thank her for her innovative thoughts, efforts and time spent in reviewing my thesis drafts and other projects.  In addition, I offer my enduring gratitude to Professor Bruno Zumbo. Without his help, I would not have the chance to study at UBC as a MERM student. And he gave me invaluable comments and suggestions on this thesis as a member of my thesis defense committee. Moreover, I learned a lot and got inspired from his Multi-Level Linear Model and Advanced courses.  I thank Dr. Yan Liu for the helpful suggestions and direct help on this thesis. I learn a great deal of knowledge and skills from her course and research project that helped me tackle many technical difficulties I encountered in data analysis. I also thank Professors Anita Hubley, Kadriye Ercikan, Sandra Mathison for their illuminating instruction during the courses which I took with them.  I thank Dr. Jennifer Baumbusch and Dr. Jennifer Lloyd for the opportunity of getting me involved in exciting research projects, and for their support and encouragement of my study.  In fact, I have a long list of people to thank, Graduate Program Assistant Karen Yan, English    xiii teachers at the ESL program, other professors in the Department, my classmates, and so on. They helped me and my family in my first two years living in Vancouver. Without their kind help, I would have experienced extra difficulties and may not have even made it through. I appreciate all the help I received from all these kind people. Finally, I would like to thank my wife Sue and my daughter Rina, for their support, understanding and companionship.            xiv Dedication This thesis is dedicated to my parents, Fuji Zou and Meilan Zou.     1 Chapter 1: Background for Bias and Impact 1.1 Background Knowledge for This Thesis 1.1.1 Test Fairness, Validity, Measurement Invariance and Test Bias Testing is prevailing and important in modern industrialized societies. Almost everyone is influenced by tests directly or indirectly for educational, psychological, vocational, and/or medical reasons (Furr & Bacharach, 2014). What's more, tests are often used in making high-stake decisions, such as college admission, employee recruitment, disease diagnosis, or legal litigation. When doing so, it is believed that the test results are fair and trustworthy for all sub-populations because test scores are comparable among sub-groups (Osterlind & Everson, 2009). Test fairness is essential and has ramified implications on social and political equity. The Civil Rights Act of 1964 in the U.S. marked the advent of era of civil rights, in which equality and equal opportunity were advocated (Osterlind & Everson, 2009). This social trend influenced the testing practices and inspired test developers and researchers to address the issue of measurement equivalence across groups, which, in part, led to the appearance and prosperity of test bias studies since 1960's. This history showcases that bias study is closely related to the requirement of test fairness. The Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014, referred to as the Standards hereafter) describes test fairness as "a fundamental issue in protecting test takers and test users in all aspects in testing." (p. 49). However, the meaning of test fairness is    2 not consented, as the Standards points out, which "fairness has no single technical meaning and is used in many different ways in public discourse." (p. 49). Generally, a fair test "reflects the same construct(s) for all test takers, and scores from it have the same meaning for all individuals in the intended population." (p. 50). On the other hand, "a fair test does not advantage or disadvantage some individuals because of characteristics irrelevant to the intended construct." (p. 50). According to the Standards, meanings of fairness in testing were enumerated and discussed in the following four aspects: (1) fairness in treatment during the testing process, (2) fairness as lack of measurement bias, (3) fairness in access to the construct(s) as measured, and (4) fairness as validity of individual test score interpretations for the intended use (pp. 50-54). From the above descriptions, it is clear that inspecting measurement bias is one aspect or approach to ensuring fairness in testing, which is one of the major focus of this thesis. To ensure fairness in comparing groups based on test scores, measurement bias should be considered and minimized as much as possible.  Fairness is a generic concept and extensively discussed in disciplines such as sociology, philosophy and law. In testing, Furr and Bacharach (2014) contended that fairness relates to the appropriate application of tests, rather than being a characteristic of a test per se. In addition, test fairness may be evaluated for both individuals and groups (Camilli, 2013). In contrast, measurement bias (i.e. test bias) is a psychometric term, which is detectable via specific statistical methods, often evaluated between groups, and its meaning is narrower. Even so, the term bias has been used to denote different ideas in the psychometric literature, too. This communication    3 confusions will be discussed later in Section 1.1.2. Specifically, this thesis only addresses the issues of measurement bias associated with group comparison on test item scores. Measurement bias is also closely connected to validity. Validity was viewed as the most fundamental consideration in developing tests and evaluating measurement. As the Standards (2014, p. 11) defines, "Validity refers to the degree to which evidence and theory support the interpretations of test score for proposed use of tests." From the perspective of validity, detection of test bias is a process to discover evidence against the validity of test use and score interpretation. For example, the content of a test item can be in favor of English native speakers and against English language learners. If test scores are free from bias, then this could be regarded as a piece of evidence for validity. The Standards (2014) points out that the validity of individual test score interpretation for the intended use should be a concern when it is generalized to subgroups such as groups with disabilities, English language learners, or different ethnic groups when the test is employed out of the scope of its intended use. The Standards (2014) also points out that that characteristics of the test takers (i.e., grouping membership) and its interaction with the testing context should be taken into account when drawing inferences from the test score. These points highlight that validity of test score use and interpretation is tied to the group selection and group characteristics when bias is evaluated as measurement inequality between groups. The issues of group selection and group characteristics are the main focuses of this thesis and will be discussed in the Section 2.2 and 2.3.  Another related term –measurement invariance (MI, or measurement equivalence) should be    4 briefly described (see Byrne, Shavelson and Muthén, 1989; Van de Schoot, Lugtig, and Hox, 2012). The definition of MI varies somewhat depending on the kinds of groups being compared and the statistical method being used. For example, Vandenberg and Lance (2000) summarized four contexts in which MI is applied, encompassing cross-cultural comparison, rater's evaluation, common demographic variable groups (e.g. gender, ethnicity) and time serials comparison. Measurement equivalence is often regarded as the antonym of measurement bias. This understanding is not necessarily correct because lack of measurement invariance is only one instantiation of measurement bias. One is not necessarily the opposite of the other.  1.1.2 The Terminology of DIF, Bias, and Impact DIF and impact are two terms that often appear in the discussion of measurement bias. The following revisits the concepts and definitions of these three terms. DIF DIF is the most well defined and least confusing term in the three. There is an overall consensus on its definition among researchers. The Standards defines it as: "DIF is said to occur when equally able test takers differ in their probabilities of answering a test item correctly as a function of group membership." (p. 51). Three points should be stressed in this definition. First, a DIF study undertakes an item-score comparison between groups such as females vs. males or Native English speakers vs. English language learners. Second, the compared test takers should have equal ability that the test intends to measure. This means the groups should be matched on the tested ability before conducting group comparison. For example, to study gender DIF on items    5 in a Grade-4 mathematics test, it is necessary to match the boys and girls on their mathematical ability; otherwise, DIF will not be detected according to the definition. Finally, DIF is considered present if the matched groups have different average probabilities to answer a question correctly. In contrast, if the matched groups answer a question correctly with an equal probability, then DIF is not flagged. Historically, the term of DIF was described by Holland and Thayer (1988) as a neutral terminology to distinguish it from the term "item bias," which could be automatically regarded as having a negative connotation. Although since then, the term DIF is not automatically considered as item bias, a positive DIF result is often directly regarded as bias even till these days (e.g. Marotta, Tramonte, & Willms, 2015). Not to mention that the word "bias" is used in different ways in the psychometric literature; its meaning can be ambiguous and confusing as how it relates to DIF.  Three analytical points are worthy noting for DIF. First, it is essential to distinguish DIF from the simple average score difference between the groups (Osterlind & Everson, 2009). DIF should be investigated between groups who are equal in the ability that the test intends to measure whereas the calculation of difference in the average scores dose not need this requirement. Therefore, the simple mean difference between groups does not represents the existence of DIF. Second, DIF is supposed to be systematic, which means all or most members in the same subgroup should respond to the test item in a similar and predictable manner (Osterlind & Everson, 2009). In this sense, DIF is a phenomenon studied based on groups, not on individuals. In other words, DIF is a group comparison, not a differentiation among individuals. In a typically DIF study, a subgroup of    6 research interest is called the focal group (i.e. group F), and the subgroup to be compared with is the reference group (i.e. group R) (Holland & Thayer, 1988). Note that Wu et al. (2017)) adopted this naming of groups for bias or impact investigation because, as they pointed out, they are all between-group comparisons and always have a pair of focal group and reference group. Hence, the selection of groups and understanding the group composition is a fundamental and substantive aspect of understanding DIF, bias, and impact. Third, as the Standards (2014) states, a flagged DIF does not certainly indicate bias in a test question. If DIF is identified, it only shows that there is a statistically significant difference in the probabilities of choosing the correct answer between the focal group and reference group. More substantive evidences or additional analyzes are needed to support a conclusion that there is a bias in comparing the groups based on a positive DIF result. For this reason, in practice, experts are often involved to review DIF items with a goal to detect problematic item contents which might have caused DIF. However, it is not uncommon that experts cannot identify any biased content that is responsible for DIF. In fact, DIF might be due to many other aspects or factors other than the item contents, such as group differences in thinking processes and social-cultural attitude towards the tested subject, differences in test dimensionality, etc.  As for methods for detecting DIF, various statistical tools are available. Zumbo (2007) summarized that a variety of procedures were developed to model differential item responses, including methods via contingency tables and/or regression models, item response theory (IRT) models, and multidimensional models.     7 Bias The term of bias appeared in the psychometric literature earlier than DIF. Studies of test bias or item bias were first undertaken in the 1960s (Holland & Wainer, 1993). As has been mentioned, test bias has multiple meanings and can be investigated by different approaches. For example, Zumbo (1999) argued that one can take a judgmental or statistical approach to investigate measurement bias. Judgmental approach mainly relies on expert's opinion-based review of the item contents including themes/topic, wording, response format, and so on (Standards, 2014); while statistical approach largely relies on statistical techniques. Moreover, one can study test bias by examining test scores internal relationships (items with one another or item with the test) or by the external relationships (test scores' relationships with other criterion variables) (e.g., Zumbo, 1999; Embretson & Reise, 2000; Furr & Bacharach, 2014). Moreover, measurement bias can be studied at the item level, domain/subscale level, or test level. The word "bias" can refer to different scenarios when studied and communicated with different perspectives, purposes, and utilities. No wonder misunderstanding and miscommunication can often occur if the meaning of bias is not clearly clarified in advance.   Bias is not a technical term, although it can be investigated in a technical manner. In nature, examination of test bias or item bias can be considered as a process of validation. The sources of bias may reside internal or external to the test. As stressed by the Standard (2014), validity is not a characteristic of the measurement itself. Similarly, bias is not an inherent attribute of test or item per se. Rather, bias is a property of test/item use or test/item score interpretation. In this sense, the    8 habitual use of the term "test" bias or "item" bias could be misleading. For this reason, Wu et al. (2017) contended that word "item" should be removed from the traditional terminology of "differential item functioning," "item bias," and "item impact." Following their suggestion, the word "item" or "test" would be avoided hereafter, if possible, when referring to DIF, bias and impact. Note that however, DIF, will not be referred to as DF (missing the letter I) because the acronym of DIF itself is already an accustomed vocabulary in the psychometric literature. This thesis uses this term to simply mean the technical occurrence of different probabilities in answering an item correctly between equally capable groups.  As to the relationship between DIF and bias, some authors contended that DIF is a necessary, but not a sufficient condition for bias (e.g. Clauser & Mazor, 1998; Zumbo, 1999). Namely, the presence of DIF does not necessarily mean the presence of bias. On the other hand, if bias is identified, then there must be DIF. Zieky (2003) also argued that "No statistic can determine whether or not a test question is biased. DIF helps us spot questions that may be unfair, but DIF is NOT synonymous with bias." (p. 3). This is to say, DIF is only a statistical result, groups' differences in responding to the test items, as flagged by DIF, could further be judged as either biased or unbiased. Zieky further articulated that a perfectly fair question may show DIF, if it happens to be measuring a skill that is not well-represented in the test as a whole, since tests usually measure more than one aspect of knowledge or skill. The judgment of bias should also be reviewed in light of test purpose and utility, which is closely connected to the issue of validity. It is possible that group comparison of a test question is flagged as biased for one purpose while unbiased for    9 the others. Although nowadays most researchers agree that DIF is not a sufficient condition for bias, Wu et al. (2017) refuted the previous belief that DIF is a necessary condition for bias. They explained and demonstrated that bias may exist even when DIF is not identified. Wu et al. referred to this phenomenon as hidden bias. In sum, DIF is neither a necessary nor a sufficient condition for bias. Besides the relationship between DIF and bias, it is also important to distinguish both bias and DIF from potentially offensive item content (Clauser & Mazor, 1998). Zumbo (1999) pointed out that content inappropriate was not necessarily DIF or bias because all groups may be equally affected by the offensive language. With regard to methods for detecting bias, Wu et al. (2017) gave a brief history. In the 1960s, regression was employed to detect bias, considering whether the predictive validity of a test (with an external criterion) differed among groups of interest. In late 1970s and early 1980s, bias studies developed to compare the internal characteristics of a test between groups as evidence for construct (internal) validity. Procedures for detecting bias internal to the test included comparing the internal consistency reliability, rank order of item difficulty, item correlation with total score, loadings of items on the general factor, and relative frequencies in choice of error distractors. Bias investigation focused on accumulating evidence of equivalence between groups to ascertain construct validity. During 1980s and early 1990s, studies of bias focused on group responses at the item level. Various statistical methods were proposed during this time. These methods focused on situation where "examinees of one group are less likely to answer an item correctly (or endorse an    10 item) than examinees of another group because of some characteristic of the test item or testing situation that is not relevant to the test purpose." Zumbo (1999, p. 12). In this context, bias could be due to the feature of item itself, or the testing situation. Later, one of the methods stood out from this approach and was referred to as "DIF" separately. DIF then was considered merely as a feature of item itself, and often not discussed in the broader issues of external and internal validity. Currently, it is known that the source of DIF, bias and impact can be relevant to not only item content, but also to the groups being compared, or other contextual factors (Wu et al., 2017; Zumbo et al., 2015).   Impact Compared with bias, the term of impact is more straightforward and less confusing. It is simply defined as the true difference across subgroups in item performance (e.g., Ackerman, 1992; Dorans & Holland, 1993; Zumbo, 1999; Wu et al., 2017). Impact is regarded as common and unavoidable because the compared groups may differ with respect to the distributions of ability of test interest among their members (Ackerman, 1992; Dorans & Holland, 1993). Some suggested, similar to DIF, impact should be considered as a technical term. It should not be automatically regarded as pejorative (Ackerman, 1992).  Dorans and Holland (1993) made clear that impact is not DIF. Impact is the difference in item performance that reflects differences in overall ability distributions. DIF, however, is an unexpected difference between matched groups of test takers who have equal underlying ability being measured by the item. When detecting DIF, groups should be matched on the examinee's    11 ability. In contrast, it is not necessary to control for groups' ability of test interest when investigating impact.  Ackerman (1992) discussed the relationship between bias (specifically, the internal bias) and impact. He argued that if all the items are measuring only the valid skills or constructs, any group differences would reflect impact, not bias. Unfortunately, it is possible that the group differences in test performance reflect other skills that may be measured by the items. However, he failed to consider the possibility of external confounding factors that may invalidate the group comparison. In order to detect the true difference, all sources of bias should be ruled out or controlled for. All is to say, impact cannot be detected if bias exists. As Wu et al. (2017) summarized, impact could not be investigated if DIF or bias exists. Because observed group difference may be confounded with characteristics of group members irrelevant to the latent ability that an item intends to measure. Several methods based on the multidimensional latent variable approach were proposed attempting to distinguish impact from bias (e.g., Thissen , Steinberg & Gerrard, 1986; Kok, 1988; Mellenbergh, 1989; Ackerman,1992; Shealy and Stout, 1993; Beller, 2014). However, these methods did not facilitate the empirical studies of impact, probably because these methods failed to address the challenge of confounding effects inherent in observational comparative research for studying group difference (Wu et al., 2017). In summary, DIF has a fairly precise meaning and it is a statistical technique arising from the methodology literature for studying measurement bias. In contrast, the term of bias contains a variety of meanings and can be defined mathematically and conceptually in different ways, which    12 sometimes causes confusions and miscommunications. As for impact, though being straightforward and comprehensible, it is less investigated due to insufficient or ineffective methods for controlling for unwanted confounders. The next section revisits the definitions of bias and impact in Wu et al. that were revised with an aim to explicate the nature of each term and to distinguish them from each other.  1.2 Refined Definition of Bias and Impact This section borrowed heavily from Wu et al. (2017) because of the intent to stay as faithful to the original definition in Wu et al. Because there is a general consensus on the definition of "DIF," We et al. (2017) did took the old definition (that was already provided earlier in this thesis). As explained, DIF is only a statistical result which shows there might be differential responses to the item between subgroups that are matched on the ability of test interest. With a DIF analysis, one can only confirm the existence of this phenomenon, but can not make a causal or attributional statement that the group composition or the item is the cause of DIF (hence bias exists). Therefore, further evidence is needed for claiming measurement bias based on positive DIF results. As for bias, it was defined as "it is biased to compare response outcomes among groups if the observed response difference is attributable to the groups that are equal in the measured construct." (Wu et al., 2017, p. 4). Several features in this definition are worthy noting. First of all, it is the group comparison that is to be investigated for bias. Although it is the difference of test scores (i.e.    13 the outcome of test takers' responses) that is being compared, the subject being examined for bias is the group differences in responding to the test item. Bias examines whether the difference of item scores between subgroups is a result of the group's differential responses to the item. Second, similar to DIF, bias detection should rest on matched subgroups who have equal ability in order to examine the existence of bias. Finally, this definition implies an attributional claim such that group membership is a reason of the differential responses. As for impact, it was defined as "the difference in the item response that is attributable to the group's difference in the measured construct based on an unbiased group comparison." (Wu et al., 2017, p. 4). Compared with bias, the definition of impact stipulates that the difference in the item scores that is truly due to the groups' difference in the ability of test interest, rather than the differential responses. In addition, because impact aims to compare the difference in test takers' ability, it is unnecessary to match the subgroups on this ability. In so doing, it would directly conflict with the definition and purpose of impact. Furthermore, same as bias, this definition of impact intends to make an attributional claim such that test groups membership is a reason for the difference in the test scores between the subgroups. Finally, the detection of impact should only conducted for unbiased group comparisons. Namely, if group comparison is found to be biased, impact investigation should not be conducted because one would never know whether item score difference is due to bias or groups' ability. From the perspective of test validation (i.e. validity of test use), it can be argued that the refined definition of bias is tied to a validation method by way of warranting scores for the intended use    14 and interpretation. This view of validation extends beyond the traditional approaches of measurement bias, which focused on evidence based on test content, internal structure or external criterion variables. As for impact, it is not to address the issue of validity. An impact study aims to investigate groups' true difference in the ability as indicated by the test score.  As implied in the definitions of bias and impact, it is clear that findings of bias and impact is group-composition dependent. Furthermore, the legitimacy of these definitions depends on whether the attributional stance would work for different compositions of groups contrasted for studying bias an impact. These issues are discussed in the next chapter.      15 Chapter 2: Types of Groups in Casual Inquiries – Implications on Bias and Impact Research  Chapter 1 discusses the meaning of measurement bias, its importance, and related terminologies. The following sections describe different methodologies to the investigation of measurement bias in social sciences. The issues around each methodology will be discussed.   2.1 Post-Positivism Approach in Social Science An inquiry in social-behavioral sciences may take different perspectives according to the ontology of phenomenon under investigation: post-positivism, constructionism, or interpretivism, to name a few. Many researchers, being devoted to the post-positivism, believe that the goal of inquiry is to "discover facts" as would a truly controlled experiment in natural science (a positivism philosophy). Sir K. Popper, however, attested that the progress in science is not to discover new facts or prove a theory to be true (the view of positivism), rather, it is to propose a theory and to find that the theory cannot be falsified tentatively despite the endeavors to do so (Crotty, 1998). In this way, human knowledge is accumulated. This alternative post-positivist view believes that the reality exists, but in an imperfect and probabilistic form, not a deterministic form (Robson, 2002). This school of thought believes that human knowledge is not based on a solid foundation, instead, it is composed of conjectures about phenomena which will be accepted temporarily until they are disproved (Bhattacherjee, 2012). Once a conjecture is falsified, the understanding to the examined phenomenon is deepened.     16 Acknowledging that other theoretical perspectives are available, this thesis takes the post-positivism view of scientific inquiry for investigating bias and impact. Taking this perspective, a researcher would ideally adopt an experiment as the methodology for studying causality implied in the definition of bias and impact in We et al. (2017). A randomized experiment is believed to create identical groups via random assignment of participants to different treatment groups. Random assignment is able to balance any preexisting differences, both observed and unobserved, among groups. It follows that differences in the outcome variable is merely due to the manipulation of the experiment (i.e., the treatment); hence the treatment can be concluded to be the only reason for the difference in the outcome. In this sense, randomized experiment is regarded as a gold standard for building a strong and convincing causal inference (e.g., Meldrum, 2000; Shadish, Cook, & Campbell, 2002). Further, a researcher would apply correspondent statistical techniques developed for the experimental design such as analysis of variance (ANOVA) and analysis of covariance (ANCOVA) developed by Sir R. A. Fisher (1934).  Definitely, experimental design has its advantages in terms of manipulating the relationships between the cause and effect for making an intended causal claim. However, feasibility, practicality, and ethics often hamper the deployment of an experiment on human subjects. For a study to be an "experiment," the bottom line is: the manipulability of the levels of the factors (different conditions of the cause). Namely, participants can be assigned to different conditions (levels) of the factor. The experimenter, then, can deliver different treatments (interventions) to participants, and observe their effects on the outcome variable. If manipulation is not feasible, then the study is not an    17 experiment. Simply put, no manipulation, no experiment, not to mention random assignment of treatments. Hence, without the random manipulation, it is hard to draw a causal claim. When manipulation is not feasible, researchers have to choose alternative research designs to provide evidence for their causative research hypotheses.  2.2 The Mismatch between Inquiry and Methodology in the Old Ways of Studying Bias and Impact When possible, researchers usually take advantage of a randomized experiment to address a causative research question. For example, experimental designs frequently appear in cognitive psychology and behavioral psychology. However, on many occasions in social science research, the researcher cannot not, for various reasons, manipulate the cause of his/her interest, not to mention the feasibility of random assignment. Most often, the cause of their research interest is a sociodemographic characteristic that exists prior to the study. For instance, a researcher plans to explore the impact of marriage status on one's subjective well-being. The researcher cannot manipulate a person's marriage status, neither can the researcher "randomly assign" a participant to a marriage status. Marriage status exists before the study and can not be manipulated for the purpose of the study. What the researcher can do is to recruit a sample of participants with different marriage statuses and measure their subjective well-being. In this case, the experiment is not a feasible research design to answer the causative research question. The researcher can only conduct an observational study.  Even to date, it is believed that randomized experiment is the only valid means to establish    18 causality. A study making a causal claim without a randomized experiment is easily a target for criticism and rejection. Many inquiries of causality in social sciences face the difficulty: the conditions (levels) of independent variable of research interest are preexisting sociodemographic characteristics and usually cannot be manipulated, hence, an experiment is impossible. In this case, one cannot make a direct and definitive causal claim. Unfortunately, many important research questions in social sciences entail a causal inquiry but cannot undertake a randomized experiment. Clinging to the "gold standard" of a randomized experiment for providing evidence often takes the researchers nowhere with their causative inquiries. Ideally, a research design should accommodate the research question and the reality it is in, rather than the reverse. Fortunately, experiment is not the only way to investigate causality. Statisticians have developed methods to help examine causality for non-randomization experiments, which will be discussed Section 2.4.  2.3 Four Types of Groups Group comparisons are most prevalent and frequently the purpose of a research with an implied intention to infer causality. In this section, four forms of groups are distinguished by research design according to two dimensions: 1) whether participants are randomly assigned to the treatments (randomized manipulation) and 2) whether the grouping membership is changeable/selectable. Note that the meanings of manipulable, changeable and selectable may seem equivalent, there are nuances in these three expressions. That is, the researchers have different levels of control over the forming of participants' group memberships: manipulable is stronger than changeable, which, in turn, stronger than selectable, as will be explained in the    19 following. Manipulable and randomly assigned (denoted as A1). In a randomized experiment, the researcher can randomly assign and manipulate the treatments as well as its timing as planned. For example, the experimenter plans to evoke a specific positive emotion in one group and a specific negative emotion in the other. Here emotions are the independent variable (IV), while positive and negative emotions are two levels of the IV. Both emotional groups are formed in line with causal temporal sequence.  Manipulable but not randomly assigned (denoted as A2). Second, in the quasi-experiment, the researcher can manipulate the treatment, but cannot or do not randomly assign the participants to the treatments. In other words, the experimental groups are not formed randomly. Using the same example of emotion manipulation, the researcher decides to select two already existing classes and assign different treatments to the classes, instead of forming two new groups randomly.  The third and fourth forms of grouping are based on observational research design. The group membership is merely an individual's characteristic, rather than an experimental treatment. There are numerous natural and preexisting characteristics that can be arbitrarily grouped by the researcher according to their research interest. The most common characteristics for grouping are sociodemographic variables such as, gender, education, language, income, nationality, ethnicity, marriage status, and so forth. Membership grouped by socio-demographic characteristic is not manipulable by the researchers, however, it is sometimes changeable/selectable by the participants or related others.    20 Non-manipulable but changeable/selectable (denoted as B1). Some of the group memberships based on participants existing characteristics are changeable or selectable by the participants themselves or related others, such as language, income, and nationality, etc. An example of selectable group memberships is types of schooling: public, private, or home schooling. In reality, the researcher has no right to select a school for the student for research purpose, but the students or their parents/guardians can select the schools.  Neither manipulable nor changeable/selectable (denoted as B2). This type of group membership in neither determined by the researcher nor by the participant. A typical example is sex, which is determined a few weeks right after the fertilization of ovum, which generally far precedes any research. It is unusual if not entirely impossible for a researcher, or a participant to change or select one's sex for a study.  For A1 and A2 groups, the languages used in experimental causal studies such as manipulation and assignment makes good sense because these languages were developed for groups formed by experiments. For B1 groups, the experimental languages of "assignment" and "manipulation" is still applicable occasionally although it is hardly undertaken. Take education as an example, the researcher can assign a participant to an educational program in theory for the purpose of his/her study. This kind of assignment is rare due to feasibility and research ethics. For B2, a semantic difficulty occurs when the researcher uses experimental terminologies to describe the forming of the groups. For example, saying that "a person is assigned to the female group" does not make sense because sex is inherent and unchangeable during a regular research project. Unfortunately,    21 B2 type of grouping is very, if not most, common and frequently the interest of researchers in social and behavioral sciences. In the next section, Rubin's causal model will be revisited in the contexts of how it relates to the B1 and B2 types of group compositions that are often studied for bias and impact.  2.4 Rubin's Causal Model for Observational Studies A researcher could face difficulties in establishing a causality when the nature of the inquiry is causal but only an observational study is feasible. Randomized experiment is powerful, but it is not the only approach for providing evidence for a causative hypothesis. Rubin's causal model is one of the alternatives to such kind of inquiry.  In a perfectly controlled randomized experiment, the treatment is the only reason for the change in outcome variable (the treatment effect). Random assignment of participants to treatments equalizes the groups before the manipulation. On the contrary, groups in observational studies are unequal because the groups pre-exist before the study or are formed by categorizing the individuals' characteristics that are often not manipulable. A researcher cannot make a direct and definitive causal claim based on the observed group difference in the outcome if no apparatus is taken to equalize the groups on factors that may be associated with the outcome variable (i.e., covariates). To tackle this difficulties, Rubin and his colleagues proposed the theory of counterfactual probability and the concept of balancing score to achieve equivalence between groups statistically (Rosenbaum & Rubin, 1983; Rubin, 1997, 2006). Compared with equal groups achieved by randomization prior to the manipulation, the balancing score is a statistical method    22 applied after the data collection (although the good application of balancing score method would still entail a proper research plan before the data collection).   As Rosenbaum and Rubin (1983, p. 42) defined, "A balancing score b(X) is a function of the covariates X such that the conditional distribution of X given b(X) is the same for treated (Z = 1) and control (Z = 0) units." It can be expressed as below, in Dawid's notation (1979): )(| XbZX  ,                                 (1) where X is a vector of covariates, b(X) is a function of X, and Z denotes the group membership. Equation (1) indicates covariates X and the group membership Z are independent conditioning on b(X). Namely, given the condition of b(X), the distributions of X are equal between the treated and the control groups. If done well empirically, pre-existing differences that may influence the outcome (the effect of covariates) can be minimized if not entirely eliminated. From the above definition, it can be seen that the balancing score is a function of a vector of covariates, which are irrelevant to the hypothesized causality. It can be used to create approximately equal groups, and on the basis of which, the outcome can be compared between groups for establishing causality. Although the balancing score and the matching methods strive to create equal groups analogous to those in a randomized experiment, it is worthy to note that the groups are balanced and matched only on observed covariates. It is possible that influential confounding variables are not included as a covariate in a study. As a result, the groups could be still inequivalent after matching. In contrast, the groups in a randomized experiment are theoretically balanced on all    23 observed and unobserved confounding variables. Nonetheless, the balancing score, if done well, is especially useful and practical when randomization is not feasible.  A variety of balancing scores are available. As will be seen later, the propensity score is the chosen method for this thesis. Rosenbaum and Rubin (1983) showed that the propensity score is the coarsest kind of balancing scores. With the propensity scores in place, one can compare the difference in the outcomes between groups that are approximately equal on the covariates, similar to that can be achieved by randomized assignment. However, the meaning the propensity score could be difficult to interpret in observational studies. Propensity score is defined by Rosenbaum and Rubin (1983), using the language of experiment, as "the conditional probability of assignment to a particular treatment given a vector of observed covariates." (p. 41). This language apparently is awkward for the B2 type of groups. For example, it is awkward to say that "the conditional probability being 'assigned' to the female group is high given the observed covariates." The definition of propensity score needs to be revised to accommodate this situation for B2 grouping. Otherwise, its meaning is not applicable to a grouping variable that is not changeable and selectable such as gender and ethnicity.  2.5 The Attributional Stance for Bias and Impact Detection In order to address the issues of causal inquiry by observational studies, and to tackle the communication difficulties discussed in previous sections, Wu et al. (2017) proposed an alternative stance in looking into causality. To them, attribution is an act of assigning a reason or reasons to a    24 phenomenon that occurred. They argued, when comparing groups based on observational study for a causal inquiry, in particular based on B1 and B2 type of grouping, the language and concept of experimental-based causality, which most of the researchers are accustomed, are awkward and create communication difficulties. In an observational study, the purpose is to show the grouping is a cause, among many others possible reasons, which is very different from that in an experimental study where the goal is to demonstrate the treatment is the only cause for the outcome differences. The attributional stance loosens the strict requirements of an experimental design and the rigid view of causality, which facilitates the causal discussions and its interpretations for observational studies. For example, when taking an attributional view, a researcher can avoid the unreasonable statements such as "one is assigned to a female group," or "gender does not cause the difference in school achievements." Instead, it can be concluded that "the group composition of gender is not a reason for difference in school achievements," This language makes good sense in most lived experiences for this kind of non-experimentally established causality. Although a causal-type of inference based on an observational study might be less direct and clear-cut than that in a randomized experiment, and the researcher needs more evidences from other sources or methods to support the inference, at least, taking an attributional stance helps to facilitate the feasibility and increase the flexibility in and room for discussing causal inquiries through an observational study.      25 Chapter 3: Analytical Procedures and Relevant Statistical Techniques for Bias and Impact Although relevant terminologies have been introduced and discussed in earlier chapters, it is necessary to describe the analytical methods employed in this thesis. This chapter is mainly based on Wu et al.'s (2017) conceptual framework and the corresponding analytical procedures of bias and impact detection in their article.  Figure 1. Analytical Procedures of Detecting Bias and Impact (Wu et al., 2017)  3.1 Techniques for Propensity Score and Matching According to the analytical procedures in Figure 1, the first step includes three sub-steps: selecting covariates, estimating balancing scores, and matching on balancing scores.     26 In the first sub-step, covariates are supposed to be associated with the outcome variable (e.g. an item score of Science test) and the grouping variable (e.g., test language and gender), but irrelevant to research interest (i.e., irrelevant to the construct that the test intends to measure and irrelevant to the group comparison). For technical discussions on the selection of covariates, one can refer to Guo and Fraser (2010), Wei and Pan (2015), and Wu et al. (2017). For the second sub-step of estimating the balancing score, the propensity score was applied. It is defined by Rosenbaum and Rubin (1983) as, )|1()( iii XZPXe  ,                       (2) where Xi denotes a vector of covariates, Zi denotes the group membership, Zi = 1 indicates that individual i is in the focal group and Zi = 0 indicates an individual i is in the reference group. The propensity score e(Xi) is the probability of individual i being in the focal group given the vector of observed covariates. Propensity score can be estimated through logistic regression model, it was expressed as (e.g. Guo & Fraser, 2010; Liu et al., 2016), )()(001)|1(XXeeXZP,                    (3) where β0+β(X) is a linear combination of covariates of X, β0 is the intercept, and β is a vector of parameter coefficients for j covariates. The expansion of β(X) is equivalent to β1X1+β2X2+...+βjXj. If necessary, the estimation of propensity score can include interaction terms among the covariates. Equation (3) can be expressed on a logit scale as,     27 )())|1((log 0 XXZPit   .                   (4) With the above equations, the probability of being in the focal group can be estimated by the covariates. For the third sub-step, three matching methods were employed, including optimal pair matching (one-to-one, denoted as 1to1 hereafter), optimal full matching (one-to-multiple, denoted as 1toM hereafter) and combined optimal full matching (a combination of one-to-multiple and multiple-to-one, denoted as 1MM1 hereafter) (see Liu et al., 2016). Pair matching means an individual in the focal group would be matched with only one individual in the reference group. In so doing, a large number of individuals in the reference group may be discarded. With full matching (1toM), an individual in the focal group has the chance to be matched with multiple individuals in the reference group. In so doing, all individuals in the reference groups will be matched and will not discarded. With the combined optimal full matching, not only will all individuals in the reference group be matched and hence retained, but also members in both the focal group and reference group will have the chance to be matched several times. For more information about matching methods, one can refer to Guo and Fraser (2010), Wei and Pan (2015), and Liu et al. (2016). For demonstrative purpose, only one of the three the matching methods that best balanced the covariate distributions was selected for further analyses for bias and impact. In order to choose the best matching method, one can examine the plot of propensity score distribution and compare the percentage of bias reduction (PBR) (Cochran & Rubin, 1973; Liu et al., 2016). PBR is given as,    28 %100prepostpreBiasBiasBiasPBR,                   (5) where 01XX MMBias . 0XM denotes the mean score of reference group on a covariate, and 1XM denotes the mean score of focal group on the same covariate. Biaspre refers to bias computed before matching, and Biaspost refers to bias computed after matching. If the value of PBR is positive, then it indicates that the mean difference between two groups has been reduced after matching. Therefore, the size PBR can be used to examine the amount to which the pre-existing differences in the covariates has been eliminated after matching. Note the word "bias" is used in the propensity score literature to denote the difference (imbalance between groups) in the means of a covariate. It is important not to confuse the meaning of this term with that used to denote the bias in group comparison on the item scores, which is the main focus of this thesis.   3.2 Methods for Bias Detection As mentioned, bias in group comparison in this thesis is detected by a DIF analysis. Because the DIF method based on logistic regression (Swaminathan & Rogers, 1990) was employed in this thesis, it is briefly described as below. Logistic regression DIF can be inspected by three nested models given as, Tpp101ln  ,                       (6)    29 GTpp2101ln  ,                    (7) GTGTpp32101ln ,                  (8) where p is the proportion of examinees who answer an item correctly, i.e., the probability; T indicates the rest total score for each test taker on one item, which is a proxy of an examinee's ability that the item intends to measure; G denotes the grouping variable (0 = reference group, 1 = focal group); and T×G indicates the interaction term between T and G. With the logit link function, an examinee's probability of answering an item correctly is modeled as a linear combination of these three terms. The value of β0 is the regression intercept coefficient, β1, β2 and β3 are regression slope coefficients. Equation (6) is the baseline model because G is not included, thus there is no group comparison; only the ability measure (T) is included in the model. Equation (7) is a model for detecting uniform DIF and Equation (8) for detecting non-uniform DIF (Mellenbergh, 1982). Then, two likelihood ratio tests are conducted to detect whether, controlling for ability, the group responses are different uniformly or non-uniformly (Hosmer, Lemeshow & Sturdivant, 2013):  LRTuniform = -2×(LLnull - LLuniform),                     (9) LRTnon-uniform = -2×(LLuniform - LLnon-uniform),                  (10) where LRT denotes likelihood ratio test, LL denotes the natural logarithm of likelihood. If LRTnon-uniform is statistically significant evaluated against Chi-square distributions with degrees of freedom being the difference between two models, then it suggests that group response difference is non-   30 uniform. In addition, if LRTnon-uniform is not significant, but LRTuniform is significant, then it suggests that group responses difference is uniform. In Chapter 5, the result of LRTs will be shown and assist in judging whether the group responses are differential, uniformly or non-uniformly.  For bias detecting, similar to the conventional logistic regression method described above, a conditional logistic regression was used to detect bias and impact, and the results are shown in Chapter 5. The difference lies in that the latter model is applied to groups of data in pairs or clusters that are matched by the propensity scores. Besides, the parameter estimation is based on the data of the discordant pairs/clusters because the concordant pairs/clusters do not provide information for the likelihood (see Liu et al., 2016; Hosmer, Lemeshow & Sturdivant, 2013). The three conditional logistic regression models for detecting uniform bias and non-uniform bias are given as follows,  )(1ln 0110 TTpp,                    (11) )()(1ln 0120110 GGTTpp,                (12) )()()()(1ln 010130120110 GGTTGGTTpp,           (13) where subscripts 1 and 0 represent the focal group and reference group respectively, hence T1 - T0 denotes the difference of rest total scores in a matched pair/cluster, and G1 - G0 denotes the difference of group membership. Other notations are the same as those in Equation (8). These three models were fitted to the matched/clustered data. Likewise, value of LRT was computed to help    31 assess whether, the responses of two groups, after controlling for ability and confounding variables, were different uniformly or non-uniformly. If so, the group comparison would be considered biased. 3.3 Method for Impact Investigation As for the detection of impact, Wu et al. (2017) purposed to test true group difference based on the matched data (without being confounded by the covariates). The conditional logistic regression should include the G variable, where the two groups are matched on the propensity scores. Moreover, there is no need to control for ability when investigating impact, hence T and T×G should not be included in the model. The conditional logistic regression for detecting impact is given as, )(1ln 0110 GGpp,                        (14) A significant β1 would suggest that there is a true difference in the ability between the focal group and reference group. Note that because the framework for bias and impact in Wu et al. (2017) is conceptual and generic rather than technical, one can apply other statistical methods to investigate bias and impact instead of logistic regression.         32 Chapter 4: Demonstration of Two Types of Groups in Observational Design for Studying Bias and Impact  This chapter, with two real data examples, investigates bias and impact using We et al.'s (2017) conceptual and analytical procedures. The purpose is to test out whether the attributional stance makes substantive and pragmatic sense for conceptualizing and communicating bias and impact between groups that are formed by observational design, i.e., B1 and B2 types of groups as described in Section 2.3.  4.1 Data Sources The measurement instrument used in these two examples was TIMSS 2011 eighth grade Science achievement assessment. Data were retrieved from TIMSS 2011 international database (IEA's Trends in International Mathematics and Science Study, TIMSS). Booklet 1 from Canada were selected for this demonstration. Four content domains: Biology, Chemistry, Physics, and Earth Science were included in the booklet, and each content domain contained more specific topic areas. There were 30 items in the booklet, of which 28 items were dichotomous and are the focus of the present study; the other two items were polytomous, and were excluded from the current study. The information of test items and sample test questions can be found in in Appendix A and B, respectively. The grouping variables of interest were test language (B1 type) and biological gender (B2 type, referred to as gender hereafter). In the data set, cases with missing values in test items and    33 covariates were deleted. After cleaning up the raw data, a final data set for gender comparison contained a total of 848 students, 429 were male students (the reference group), and 419 were female students (the focal group). The sample sizes of both groups were close to be balanced in the dataset. The final dataset for test language comparison included a total of 782 students, among which 562 were students taking the English version test (the reference group), and 220 were students taking the French version test (the focal group). In this dataset, the sample size of focal group is much smaller, only about 40% of the size of reference group. It is worthy to note that both English and French are official languages in Canada. Therefore, taking a French version test is not an uncommon practice in some provinces of Canada. For both datasets, every student's rest total scores were computed. An item's rest total score is the total score of 30 items (the whole booklet) minus the score of that item. Therefore, One student would have 28 rest total scores for 28 items correspondingly. The rest total score is the proxy of a student's ability in Science and would be used in the bias detection. As for the selection of covariates, a confounder should be irrelevant or unwanted to the group comparison and irrelevant to the construct to be measured by the items (Wu et al., 2017). Statistically, a confounder should correlate with both the grouping variable and the outcome variable. With this criterion, 19 variables were chosen from 92 related questions in the TIMSS student's background questionnaire (Mullis et al, 2009). These questions were listed in Appendix C. The selected covariates included questions on student's domestic educational environment (one question), student's educational expectation (one question), feelings about school security (one    34 question), personal interest in learning Science (six questions), self-confidence in learning Science (seven questions), and valuing science (three questions). Statistically, these covariates correlate with the outcome with a minimum absolute size of 0.150. These selected covariates turned out to have small correlation with the two grouping variables: (with gender, rM = 0.003, rmax = 0.155, rmin = -0.136 and with test language, rM = 0.009, rmax = 0.168, rmin = -0.199 for language, respectively). Although the correlations between the grouping variables and the covariates were quite low, they were included for propensity score estimation because they could still help to balance the differences in the group's unwanted characteristics.   4.2 Data Analyses and Computing Tools All analyses were conducted in R Studio (Version 1.0.143), following the analytical procedures in Figure 1. Package of MatchIt (Version 2.4-22, Ho et al., 2017) in R was applied to estimate propensity scores and to match the test language groups and gender groups. In addition, Epi package (Version 2.10, Carstensen et al., 2017) was employed to conduct conditional logistic regression and likelihood ratio tests.  4.3 Descriptive Statistics  Differences in Correct Proportion between Test Language and Gender. Based on the raw data, the proportion of test takers answering an item correctly by groups of test language and gender was computed. The results are shown in Table 1 and 2 respectively. In these tables, the difference represents the correct proportion of focal group minus that of reference group (French    35 - English in Table 1, and Female - Male in Table 2). It is obvious that several differences were large (the absolute value was larger than 0.100), and many were small (the absolute value was close to 0). For test language groups, the mean of differences was -0.057, and the maximum positive difference was 0.087 (Item 20) and maximum negative difference was -0.284 (Item 8). For gender groups, the mean of differences was -0.004, and the maximum positive difference was 0.127 (Item 2) and maximum negative difference was -0.112 (Item 7).       36 Table 1. Group Difference in Item Correct Proportion by Test Language Item Language Difference French English 1 0.300 0.359 -0.059 2 0.718 0.696 0.022 3 NA NA NA 4 0.468 0.571 -0.103 5 0.559 0.514 0.045 6 0.582 0.609 -0.027 7 0.327 0.351 -0.023 8 0.241 0.525 -0.284 9 NA NA NA 10 0.673 0.715 -0.043 11 0.514 0.587 -0.074 12 0.586 0.726 -0.140 13 0.791 0.783 0.008 14 0.664 0.838 -0.174 15 0.873 0.907 -0.035 16 0.823 0.819 0.004 17 0.382 0.520 -0.138 18 0.759 0.763 -0.004 19 0.464 0.489 -0.026 20 0.564 0.477 0.087 21 0.586 0.596 -0.010 22 0.955 0.931 0.024 23 0.395 0.596 -0.201 24 0.427 0.482 -0.055 25 0.005 0.128 -0.124 26 0.041 0.176 -0.135 27 0.373 0.301 0.072 28 0.836 0.870 -0.034 29 0.245 0.352 -0.107 30 0.236 0.299 -0.063        37 Table 2. Group Difference in Item Correct Proportion by Gender  Item Gender Difference Female Male 1 0.348 0.331 0.017 2 0.747 0.620 0.127 3 NA NA NA 4 0.566 0.536 0.030 6 0.578 0.622 -0.045 7 0.289 0.401 -0.112 8 0.442 0.436 0.006 9 NA NA NA 10 0.668 0.741 -0.073 11 0.561 0.587 -0.027 12 0.697 0.660 0.037 13 0.780 0.795 -0.014 14 0.809 0.765 0.045 15 0.924 0.862 0.061 16 0.823 0.809 0.015 17 0.487 0.466 0.021 18 0.785 0.718 0.067 19 0.496 0.469 0.028 20 0.465 0.534 -0.068 21 0.604 0.592 0.012 22 0.931 0.925 0.005 23 0.539 0.529 0.010 24 0.453 0.473 -0.020 25 0.076 0.117 -0.040 26 0.103 0.184 -0.082 27 0.327 0.331 -0.004 28 0.874 0.858 0.016 29 0.289 0.368 -0.080 30 0.274 0.284 -0.010  Note that this initial difference is neither bias nor impact. With these descriptive statistics, one cannot judge whether the groups responded differently to an item or truly performed differently,    38 because, in both cases, the two groups were different in the first places. Even when the difference between groups was statistically significant, it was unclear that whether the difference was due to the contents of an item, the group membership, or test-takers other characteristics that caused the difference in the correct proportions. Further device is needed in order to detect bias and impact.  4.4 Propensity Score Estimation and Matching To focus on the findings of bias and impact of different types of group composition (rather than different ways of matching), propensity score estimation and matching were conducted in advance. Also only the results of optimal full matching by one-to-multiple and multiple-to-one method (1MM1) on test language are shown below. Other matching results are presented in Appendix D, E, F, G and H. In Figure 2, the values on the bottom represent the range of propensity score, i.e., the probability from 0 to 1. Four groups from top to bottom are the Unmatched Treatment Units, the Matched Treatment Units, the Matched Control Units, and the Unmatched Control Units. Treatment denotes the focal group (French in this case), and Control denotes the reference group (English in this case). Circles represent group members and the size of a circle denotes the weight of this group member. There is no circle in the Unmatched Treatment Units or the Unmatched Control Units, which means that all group members have been matched. Circles in the Matched Treatment Units distribute evenly across the axis, while the majority of circles in the Matched    39 Control Units cluster around the lower end. Nonetheless, the distributions of the focal group and reference group overlap largely, except for the higher end.  Figure 2. Jitter Plot for Optimal Full Matching (1MM1) on Test Language  In Figure 3, "Raw" denotes raw data before matching; "Matched" denotes matched data after matching. Compared with the Raw Treated, the distribution of the Raw Control is more positively skewed. Hence, there is a noticeable difference between the two distributions. On the right panel of the graph, the distribution of the Matched Control becomes less skewed and similar to that of the Matched Treated after matching. The matching procedure has approximately achieved the expected result: to equalize the distributions of focal group and reference group on the covariates, similar to that can be achieved by random assignment in an experiment. The graphical diagnostics indicate that the result of optimal full matching (1MM1) was satisfactory.    40  Figure 3. Comparison of Propensity Score Distributions before and after Optimal Full Matching (1MM1) between Test Language Groups  The plot was generated automatically by the matched data with plot command in R. It is worthy to note that the default terms of treatment and control originated from the experimental design, were used by the developers of MatchIt pacakge in R. In fact, there is no treatment or control in the two exemplary studies. For the grouping variables of test language (B1) and gender (B2), the group membership preexisted before this study, not assigned by the test administrator. In theory, for B1-type of groups of test language, it is possible to randomly assign an examinee to English test group or French test group if an examinee is bilingual, it rarely happens in reality (unless it is an experiment). As for gender groups, it is not even possible to assign someone to the female group or male group, not to mention random assignment. For these two types of groups the term of treatment and control is obviously inappropriate. The DIF terminology of focal vs.    41 reference is more suitable. These points highlight the drawbacks of using experimental conceptualization for a casual-type of inquiry of bias and impact where the grouping composition is simply observed.  Besides comparing the covariates distribution plots before and after matching, an index of PBR can also help to indicate the effect of matching. Table 3 shows the mean difference before and after matching and the percentage of mean difference reduction. In the rightmost column, various percentages of reduction were achieved. Negative reductions (e.g., BSBS17C, BSBS19C, BSBS19D) indicate the mean difference increased after matching. This is due to a small mean difference before matching accompanied by an increase in mean difference after matching. Because the mean difference was already very small in the first place, an increase in mean difference easily appears fairly large in percentage after matching. This negative PBR is not too much a concern, because the increased mean difference was in fact very small. Similarly, small positive reductions (e.g., BSBS17D) indicate that the mean difference did not reduce much after matching, which resulted from a small mean difference before matching as well. In contrast, for many positive reductions, the mean difference were vastly reduced. Note that "Distance" in the first row is an overall index measuring the difference in the propensity scores. It was markedly reduced from 0.132 to 0.010 after matching, resulting in a 92.55% of reduction.         42 Table 3. Percentage of Bias Reduction Before and After Optimal Full Matching (1MM1) on Test Language Groups Covariate Before Matching After Matching Reduction (%) Mean Treated Mean Control Mean Difference Means Control Mean Difference distance 0.376  0.244  0.132  0.367  0.010  92.552  BSBG04 2.873  3.393  -0.521  2.897  -0.024  95.357  BSBG07 5.055  5.237  -0.182  5.126  -0.072  60.730  BSBG12B 1.914  1.628  0.286  1.876  0.038  86.787  BSBS17A 2.145  1.900  0.245  2.172  -0.026  89.305  BSBS17B 2.659  2.879  -0.220  2.746  -0.087  60.556  BSBS17C 3.150  3.153  -0.003  3.219  -0.069  -2166.533  BSBS17D 2.877  2.838  0.039  2.916  -0.038  2.004  BSBS17F 2.073  1.970  0.103  2.150  -0.077  25.402  BSBS17G 1.700  1.552  0.148  1.721  -0.021  85.961  BSBS19A 1.818  1.790  0.028  1.837  -0.019  32.711  BSBS19B 2.950  3.066  -0.116  2.893  0.057  50.557  BSBS19C 2.714  2.735  -0.021  2.651  0.063  -194.269  BSBS19D 2.055  2.066  -0.011  2.124  -0.070  -517.287  BSBS19E 3.182  3.093  0.089  3.147  0.035  60.548  BSBS19F 2.173  2.244  -0.071  2.280  -0.107  -51.206  BSBS19I 3.236  3.158  0.078  3.129  0.107  -37.625  BSBS19L 1.895  1.817  0.079  1.873  0.022  71.902  BSBS19M 2.023  1.947  0.076  2.063  -0.040  47.145  BSBS19N 2.391  2.331  0.060  2.483  -0.093  -54.426   Note that three matching methods were implemented and the results were summarized in    43 Appendix I. Compared with other two methods, results show that optimal full matching (1MM1) better balanced the covariate distributions. Therefore, the matched data with this method were selected for further investigations.  4.5 Bias Detection The matched data were used for bias and impact detection. Table 4 reports the result of bias detection on test language. Among a total of 28 items, comparisons on 14 items were found to be biased. Group comparisons on three items were identified as non-uniformly biased, and group comparisons on the other 11 items were flagged as uniformly biased. Because the confounding covariates have been balanced between the two groups, the difference in ability has been controlled for as well, and considering the test items have been well developed, the following conclusion could be made: It is biased to compare item scores on these 14 items between the two test language groups. That is, bias can be attributed to the group composition. Taking the attributional stance for bias by Wu et al. (2017), test language is a reason responsible for the differential group responses to these 14 items. Instead, group comparisons on the rest of the 14 items were found to be unbiased, and they were qualified to enter the next step for impact analysis.         44 Table 4. Results of Bias Detection in Comparing the Test Language Groups Item Non-uniform  Uniform Δχ2 p  Δχ2 p 1 12.963 <0.001  1.497 0.221 2 1.500 0.221  3.365 0.067 3 NA NA  NA NA 4 0.006 0.936  1.700 0.192 5 0.032 0.857  7.012 0.008 6 4.245 0.039  0.086 0.769 7 0.742 0.389  0.570 0.450 8 0.176 0.675  32.253 <0.001 9 NA NA  NA NA 10 0.196 0.658  0.119 0.730 11 0.070 0.791  1.281 0.258 12 2.105 0.147  7.098 0.008 13 0.290 0.590  0.298 0.585 14 0.936 0.333  17.483 <0.001 15 0.832 0.362  0.683 0.409 16 0.307 0.580  4.208 0.040 17 0.346 0.557  5.362 0.021 18 0.001 0.979  0.454 0.500 19 0.002 0.966  2.136 0.144 20 0.015 0.903  14.740 <0.001 21 0.946 0.331  2.277 0.131 22 3.197 0.074  8.233 0.004 23 0.325 0.568  12.024 0.001 24 0.929 0.335  0.086 0.769 25 0.000 0.991  29.484 <0.001 26 0.983 0.321  12.639 <0.001 27 4.586 0.032  11.113 0.001 28 0.142 0.707  0.219 0.640 29 2.485 0.115  3.619 0.057 30 0.038 0.845  1.346 0.246 Total # detected 3   11 Note. Significant p values (<0.05) are presented in bold face. Once group comparison on an item is flagged as biased non-uniformly, it is not flagged as biased uniformly even though the corresponding p-value is statistically significant.   Table 5 reports the results of bias detection for gender group comparison. A total of eight comparisons were found biased. On only one item, group comparison was identified as non-   45 uniformly biased; group comparisons on other seven items were identified as uniformly biased. For these comparisons, bias can be attributed to the gender groups. In other words, gender is a reason for the differential responses to these items. Instead, group comparisons on the rest of the 20 items were flagged as unbiased, hence these items could enter the next step for impact analysis.        46 Table 5. Results of Bias Detection in Comparing the Gender Groups Item Non-uniform  Uniform Δχ2 p  Δχ2 p 1 0.092 0.761  0.150 0.699 2 0.139 0.709  14.186 <0.001 3 NA NA  NA NA 4 0.036 0.849  0.415 0.519 5 0.030 0.863  0.090 0.764 6 <0.001 0.995  3.797 0.051 7 0.101 0.750  16.396 <0.001 8 0.247 0.619  0.003 0.960 9 NA NA  NA NA 10 1.270 0.260  7.088 0.008 11 2.295 0.130  0.543 0.461 12 2.745 0.098  0.901 0.342 13 0.042 0.837  0.001 0.974 14 0.271 0.603  0.618 0.432 15 2.075 0.150  4.288 0.038 16 1.330 0.249  0.005 0.944 17 0.065 0.799  0.563 0.453 18 0.128 0.721  5.969 0.015 19 4.312 0.038  0.344 0.558 20 0.348 0.555  2.107 0.147 21 0.119 0.731  0.415 0.520 22 0.165 0.685  0.008 0.929 23 0.461 0.497  1.058 0.304 24 0.020 0.888  0.144 0.704 25 2.170 0.141  2.540 0.111 26 2.336 0.126  7.246 0.007 27 0.165 0.685  0.124 0.724 28 0.784 0.376  0.129 0.719 29 0.282 0.595  5.711 0.017 30 1.101 0.294  0.023 0.879 Total # detected 1   7 Note. Significant p values (<0.05) are presented in bold face. Once group comparison on an item is flagged as biased non-uniformly, it is not flagged as biased uniformly even though the corresponding p-value is statistically significant.   In summary, test language group comparisons of 14 items (out of 28 items) were identified as biased, on the other hand, gender group comparisons of eight items in the same booklet were    47 flagged as biased. Gender group comparisons were less biased in this case. Different items and dissimilar numbers of bias indicate that bias is not a characteristic of an item itself, rather, it is directly relevant to the selection of grouping variable. Once the grouping variable changed, the result of bias detection on the same item can be thoroughly different.   4.6 Impact Detection Table 6 reports the results of impact between test language groups on the 14 items that group comparisons were found unbiased. It was found that language groups had impact only on item 29. Specifically, test result indicated that the odds of answering item 29 correctly for English group examinees are about 1.558 times higher than that for French group examinees. This suggests that English group had a statistically higher ability on this item than the French group. With the attributional view of impact, the differential performance on this item in this scenario is due to the difference between test language groups in their science ability to solve this item, rather than the other unwanted or irrelevant characteristics of the groups. As for the other items, the results suggest that the two groups did not have significant difference in their ability to solve them.       48 Table 6. Results of Impact Detection between Test Language Groups Item coefficient exp(coef) se(coef) z-value p-value 2 0.272 1.313 0.187 1.456 0.145 4 -0.292 0.747 0.178 -1.643 0.100 7 0.033 1.033 0.185 0.177 0.860 10 -0.150 0.861 0.187 -0.801 0.423 11 -0.332 0.718 0.179 -1.856 0.063 13 -0.045 0.956 0.217 -0.207 0.836 15 -0.288 0.750 0.270 -1.066 0.287 18 0.058 1.059 0.205 0.281 0.778 19 0.101 1.106 0.183 0.550 0.582 21 0.158 1.171 0.180 0.880 0.379 24 -0.101 0.904 0.176 -0.572 0.567 28 -0.273 0.762 0.247 -1.104 0.270 29 -0.443 0.642 0.197 -2.247 0.025 30 -0.278 0.757 0.201 -1.383 0.167 Note. Significant p values (<0.05) are presented in bold face. It is worthy to note that the observed mean difference on item 29 was not the maximum among the items (maximum difference was on Item 8, see Table 1). However, groups were found to have an impact on item 29. This goes to say, the observed mean difference does not necessarily reflect the true difference in ability between groups, which can be confounded by other factors. With the above procedures, the true difference was likely to be detected at last.  Table 7 reports the results of impact between gender groups on the 20 items that group comparisons were found unbiased. Similar to test language, the gender groups showed impact on    49 only one item (Item 25). The male group had higher odds of answering it correctly than the female group (about 1.805 times). The difference in performance can be attributed to the boys and girls differences in science ability in solving this questions. For other 19 items, the boys and girls did not show statistically significant difference in ability to solve them.       50 Table 7. Results of Impact Detection between Gender Groups Item coefficient exp(coef) se(coef) z-value p-value 1 -0.053 0.949 0.158 -0.332 0.740 4 0.091 1.095 0.152 0.596 0.551 5 -0.117 0.890 0.155 -0.752 0.452 6 -0.302 0.739 0.157 -1.919 0.055 8 -0.052 0.950 0.156 -0.330 0.741 11 -0.145 0.865 0.149 -0.971 0.332 12 0.111 1.117 0.166 0.668 0.504 13 -0.064 0.938 0.180 -0.355 0.723 14 0.125 1.133 0.189 0.661 0.508 16 <0.001 1.000 0.202 <0.001 1.000 17 0.081 1.084 0.153 0.526 0.599 20 -0.239 0.787 0.153 -1.567 0.117 21 -0.117 0.890 0.154 -0.760 0.447 22 -0.154 0.857 0.296 -0.522 0.602 23 0.098 1.103 0.150 0.651 0.515 24 -0.121 0.886 0.154 -0.786 0.432 25 -0.591 0.554 0.263 -2.247 0.025 27 -0.079 0.924 0.162 -0.491 0.624 28 0.016 1.016 0.231 0.069 0.945 30 -0.107 0.899 0.176 -0.608 0.543 Note. Significant p values (<0.05) are presented in bold face. Table 8 and Table 9 further summarize the results of bias and impact detection on test language and gender separately. In general, gender group comparisons were less biased than language group    51 comparisons. For unbiased group comparisons, both group compositions of test language and gender had impact on only one test item.  Table 8. Summary of Bias and Impact Results for Test Language Comparison Conclusion Number of Items % Direction of Bias/Impact Biased 14 50.00% Eleven group comparisons were found uniformly biased; among which four comparisons were biased against the English group (on #5, #16, #20, #22); the other seven comparisons were biased against the French group (on #8, #12, #14, #17, #23, #25, #26). Besides, three group comparisons (on #1, #6, #27) were non-uniformly biased. Having Impact 1 3.57% Language groups had impact on only one item (#29), the English group performed better than French group.  No Impact 13 46.43% No impact was found between the groups on thirteen items. The two groups performed equally well on these items. Total 28 100.00%       52 Table 9. Summary of Bias and Impact Results for Gender Comparison Conclusion Number of Items % Direction of Bias/Impact Biased 8 28.57% Seven group comparisons were found uniformly biased; among which, three comparisons were biased in favor of the female group (on #2, #15, #18); the other four items were biased in favor of the male group (on #7, #10, #26, #29). Besides, group comparison on one item (#19) were non-uniformly biased. Having Impact 1 3.57% Gender groups had impact on only one item (#25), the male group performed better than female group. No Impact 19 67.86% No impact was found between the groups on nineteen items. The two groups performed equally well on these items. Total 28 100.00%         53 Chapter 5: Contributions and Discussions  5.1 Contributions Building on Wu et al.'s (2017) framework for bias and impact, this thesis continues to contribute to this line of work by explicating the typology of grouping variables in various research designs. This taxonomy helps to clarify the conceptualization and communication difficulties in using causative terminology of randomized experiment for expressing ideas of bias and impact in psychometrics. This taxonomy also explains the need for a non-experimentally based causal view for understanding bias and impact – such as the attributional stance taken by Wu et al.  Two studies of bias and impact were conducted to serve as example of the difficulties of the traditional methodological view of causality based on experiment. These two studies also worked as trials for testing out the legitimacy of taking attributional lenses to conceptualize and communicate bias and impact. The first example was for the B1 grouping: non-randomized but changeable/selectable group memberships (test languages). The second was for B2 grouping: neither manipulable nor changeable/selectable group memberships (sexes). These two types of grouping variables were investigated to detect bias and impact on the TIMSS Science questions following Wu et al.'s procedures. For both, the group memberships were referred to as focal vs. reference group used in the DIF literature. A communication difficulty would have aroused if the experimental conceptualization and its semantics (e.g., treatment vs. control) had been applied to these two types of group composition. This is because these languages were developed for the    54 experimental methodology that was not applicable to typical group comparisons for bias and impact studies in psychometrics that are causative inquiries in nature.  Following the same thread of thought, the terms defined with an experimental in mind can be adapted to be more flexible for discussing bias and impact. In the current study, two terms are relevant. One is the balancing score, and the other is the propensity score. As the primary advocator, Rosenbaum and Rubin (1983) defined a far-reaching definition of balancing score and propensity score. They defined the balancing score b(X) as "a function of the observed covariates X such that the conditional distribution of X given b(X) is the same for treated (Z = 1) and control (Z = 0) units." (p. 42, mentioned in Chapter 2). Generalizing this definition to be used in observational studies, it can be revised as a function of the observed covariates X such that the conditional distribution of X given b(X) is the same for focal (Z = 1) and reference (Z = 0) groups. Likewise, their definition on propensity score was "the propensity towards exposure to treatment 1 given the observed covariates X." (p. 43, mentioned in Chapter 3) can be revised as the propensity being in the focal group (Z = 1) given the observed covariates X.  By avoiding the wording of experimental design, it allows more flexible language for communicating causal inquiries but not investigated by a randomized experiment. For the gender grouping variable (B2 groups), the propensity score of a female student is the conditional probability of "being in the female group" given those observed covariates, rather than the conditional probability of "being assigned to the female group" given those observed covariates, nor the propensity towards "exposure to the female treatment" given the observed covariates. The    55 latter two descriptions read conspicuously awkward and irrational. For the grouping variable of test language (B1 groups), the propensity score of an examinee who took the French version Science test, is defined as the conditional probability of "being in the French group" given those observed covariates, rather than the conditional probability of being "assigned to the French group" given those observed covariates. Although it is possible to randomly assign someone into a French group in theory, it is a rare if not unseen case in practice (unless it is an experiment). Moreover, the revision of language in defining propensity scores also has implication on the target population of inferential interest. For those who can be randomly assigned to a French or English group, they should be bilingual (a much smaller population), rather than mainly monolingual. Therefore, the revised languages for expressing the balancing scores and propensity scores would fit the reality better in observational studies even though the group membership is theoretically changeable.   These two example studies illustrate that, the attributional view works well to conceptualize and communicate bias and impact. That is, it is reasonable to make an attributional claim that the group membership is a reason for differential group responses to an item. This claim differs from the traditional intended claim borne in the experimental methodology that the independent variable under experimentation is the sole and unique cause for the outcome. For instance, according to the results of bias detection, the group membership of test language was "a reason" of differential responses to 14 items, which means that taking the Science test in English or French will influence    56 examinees' performance on these items. This was an undesired consequence in measurements, especially in large scale international assessment with a purpose to make cross-country comparisons. Similarly, based on the results of bias detection on gender, gender was found to be "a reason" for differential responses to eight items. This findings suggest that gender has influence on one's responses to theses test items. Impact investigation was undertaken on those items that their group comparisons were found to be unbiased. With the attributional stance and analytical procedures, the true difference between groups was detected and can be attributed to the group composition. The attributional stance of causal claim can be regarded as an adaptation and extension of causality to observational studies. Meanwhile, the typology of grouping variables and the revision of relevant terminologies can be viewed as a complement to the attributional stance for understanding a causative relationship. These improvements address the difficulties of using the traditional experimental framework for observational psychometrics and facilitate the application and interpretation of causality in observational psychometric studies. In summary, this thesis takes on the recent work by Wu et al. (2017) and further tackles the topic of causality by highlighting issues related to "grouping" variable in observational psychometric studies where no actual "grouping" is conducted as in the case of studying DIF, bias and impact. It is time to alter the mindset that a causative inquiry should and can only be addressed and investigated by an experimental method of manipulation and randomization.     57 5.2 Limitations and Future Work Sensitivity analysis was not conducted in the present research to detect the possibility of unobserved confounding covariates in propensity score estimation. This was because no statistical algorithm was developed yet to deal with the issue of varying matching ratio when applying the optimal full matching on the propensity scores (Liu et al., 2016). Currently, only method of sensitivity analysis for one-to-one pair matching is available. Therefore, it was impossible to know whether there were other influential unobserved covariates that were not included in the propensity score estimation. Other evidence is needed for assessing the robustness of propensity score matching. Only dichotomous items in the current study were investigated. In the real world, a test can be composed of various types of items. Available statistical methods can be transplanted to study bias and impact for responses recorded or scored differently (e.g., ordinal or nominal responses). Wu et al.'s (2017) framework addresses the item level issues of bias and impact. Theoretically, bias and impact investigation can also be examined at the domain/subscale level. It is important to detect bias and impact at the domain/subscale level because a test taker's ability is usually evaluated by composite scores at the domain or subscale level, not by a single item score. Future work can extend the new developments in this thesis and Wu at al. (2017) to the domain/subscale level.    58 5.3 Closing Remarks Inquiries of measurement bias and group disparity have a long history in psychometrics and will continue to be one of the major themes in research. The typology explicated in this thesis distinguishes two different types of group composition within observational studies as well as from those by assignments in experiments. It further highlights the subtleties and implications on conceptualizing causative inquiry for different types of groups. Building on these clarifications, the traditional rhetoric based on experimental methodology or statistical matching is revised to accommodate a more flexible tone of language for communicating a causal inquiry.  This thesis illustrates that the attributional rhetoric is a reasonable and pragmatic adaptation for conceptualizing and communicating causal inquiries based on observational designs such as investigation of bias and impact. It is expected that these new developments will help to clear the confusions and barriers in researching and communicating DIF, bias, and impact. It is also anticipated that as this line of research continues, new developments and discussions for bias and impact will burgeon to fill the remaining research gaps.        59 Bibliography   Ackerman, T. A. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of educational measurement, 29(1), 67-91. American Psychological Association, American Educational Research Association, Joint Committee on Standards for Educational and Psychological Testing (U.S.), & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Beller, M. (2014). "Test bias detection" in Wiley StatsRef: Statistics Reference Online. doi:10.1002/9781118445112.stat06399  Bhattacherjee, A. (2012). Social science research: Principles, methods, and practices. Retrieved July 18, 2017, from http://scholarcommons.usf.edu/oa_textbooks/3 Camilli, G. (2013). Ongoing issues in test fairness. Educational Research and Evaluation, 19(2-3), 104-120. doi:10.1080/13803611.2013.767602 Carstensen, B., Plummer, M., Laara, E., & Hills, M. (2017). A Package for Statistical Analysis in Epidemiology. Retrieved March 15, 2017, from https://cran.r-project.org/web/packages/Epi/Epi.pdf Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: issues and practice, 17(1), 31-44. Cochran, W. G., & Rubin, D. B. (1973). Controlling bias in observational studies: A review. Sankhyā: The Indian Journal of Statistics, Series A, 417-446. Crotty, M. (1998). The foundations of social research: Meaning and perspective in the research process. Sage. Dawid, A. P. (1979). Conditional independence in statistical theory. Journal of the Royal Statistical Society. Series B (Methodological), 1-31. Dorans, N. J., & Holland, P. W. (1993). DIF Detection and Description: Mantel-Haenszel and Standardization. In Holland, P. W., Wainer, H., & Educational Testing Service. (1993). Differential item functioning. Hillsdale: Lawrence Erlbaum Associates.    60 Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, N.J: L. Erlbaum Associates. Fisher, R. A. (1934). Statistical methods for research workers (5th Ed.). Edinburgh: Oliver and Boyd Ltd. Furr, R. M., & Bacharach, V. R. (2014). Psychometrics: an introduction (2nd Ed.). Sage. Guo, S., & Fraser, M. W. (2010). Propensity score analysis: Statistical methods and applications. Thousand Oaks, Calif: Sage Publications. Ho, D., Imai, K., King, G., Stuart, E., & Whitworth, A. (2017). Nonparametric Preprocessing for Parametric Casual Inference. Retrieved March 15, 2017, from https://cran.r-project.org/web/packages/MatchIt/MatchIt.pdf Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. Test validity, 129-145. Holland, P. W., Wainer, H., & Educational Testing Service. (1993). Differential item functioning. Hillsdale: Lawrence Erlbaum Associates. Hosmer Jr, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression (Vol. 398). John Wiley & Sons. Liu, Y., Zumbo, B. D., Gustafson, P., Huang, Y., Kroc, E., & Wu, A. D. (2016). Investigating Causal DIF via Propensity Score Methods. Practical Assessment, Research & Evaluation, 21(13). Kok, F. (1988). Item bias and test multidimensionality. In Latent trait and latent class models (pp. 263-275). Springer US. Marotta, L., Tramonte, L., & Willms, J. D. (2015). Equivalence of testing instruments in Canada: Studying item bias in a cross-cultural assessment for preschoolers. Canadian Journal of Education, 38(3), 1. Meldrum, M. L. (2000). A brief history of the randomized controlled trial: From oranges and lemons to the gold standard. Hematology/oncology clinics of North America, 14(4), 745-760. Mellenbergh, G. J. (1982). Contingency table models for assessing item bias. Journal of educational statistics, 7(2), 105-118. Mellenbergh, G. J. (1989). Item bias and item response theory. International journal of educational    61 research, 13(2), 127-143. Mullis, I. V., Martin, M. O., Ruddock, G. J., O'Sullivan, C. Y., & Preuschoff, C. (2009). TIMSS 2011 Assessment Frameworks. International Association for the Evaluation of Educational Achievement. Herengracht 487, Amsterdam, 1017 BT, The Netherlands. Osterlind, S. J., & Everson, H. T. (2009). Differential item functioning (Vol. 161). Sage Publications. Robson, C. (2002). Real world research: A resource for social scientists and practitioner-researchers (2nd ed.). Oxford, UK;Madden, Mass: Blackwell Publishers. Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41-55. Rubin, D. B. (1997). Estimating causal effects from large data sets using propensity scores. Annals of internal medicine, 127(8_Part_2), 757-763. Rubin, D. B. (2006). Matched sampling for causal effects. Cambridge, UK; New York, USA: Cambridge University Press. Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Wadsworth Cengage learning. Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58(2), 159-194. Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational measurement, 27(4), 361-370. Thissen, D., Steinberg, L., & Gerrard, M. (1986). Beyond group-mean differences: The concept of item bias. Psychological Bulletin, 99(1), 118. Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational research methods, 3(1), 4-70. Van de Schoot, R., Lugtig, P., & Hox, J. (2012). A checklist for testing measurement invariance. European Journal of Developmental Psychology, 9(4), 486-492.    62 Wu, A.D., Liu, Y., Stone, J.E., Zou, D. and Zumbo. B.D. (2017). Is Difference in Measurement Outcome between Groups Differential Responding, Bias or Disparity? A Methodology for Detecting Bias and Impact from an Attributional Stance. Front. Educ. 2:39. doi: 10.3389/feduc.2017.00039 Zieky, M. (2003). A DIF primer. Princeton, NJ: Educational Testing Service. Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF). Ottawa: National Defense Headquarters. Zumbo, B. D. (2007). Three generations of DIF analyses: Considering where it has been, where it is now, and where it is going. Language assessment quarterly, 4(2), 223-233. Zumbo, B. D., Liu, Y., Wu, A. D., Shear, B. R., Olvera Astivia, O. L., & Ark, T. K. (2015). A methodology for Zumbo's third generation DIF analyses and the ecology of item responding. Language Assessment Quarterly, 12(1), 136-151.      63 Appendices Appendix A. Item Information for Booklet One of Eighth Grade Science Achievement Test (TIMSS, 2011) Item ID Content Domain Cognitive Domain Item Type Options Key Maximum Points S032611 Biology Knowing MC 4 A 1 S032614 Biology Applying CR - - 1 S032451 Biology Applying CR - - 2 S032156 Chemistry Reasoning MC 4 C 1 S032056 Chemistry Applying CR - - 1 S032087 Biology Knowing MC 4 C 1 S032279 Physics Applying MC 4 D 1 S032238 Physics Applying MC 4 A 1 S032369 Physics Applying CR - - 2 S032160 Earth Science Knowing MC 4 C 1 S032654 Earth Science Reasoning MC 4 A 1 S032126 Earth Science Knowing CR - - 1 S032510 Earth Science Knowing MC 4 D 1 S032158 Physics Knowing MC 4 B 1 S052093 Biology Applying MC 4 C 1 S052088 Biology Applying MC 4 A 1 S052030 Biology Reasoning MC 4 B 1 S052080 Biology Knowing MC 4 D 1 S052091 Biology Reasoning CR - - 1 S052152 Chemistry Applying MC 4 C 1 S052136 Chemistry Reasoning CR - - 1 S052046 Chemistry Knowing MC 4 D 1 S052254 Chemistry Reasoning MC 4 A 1 S052207 Physics Knowing CR - - 1 S052165A Physics Knowing CR - - 1 S052165B Physics Knowing CR - - 1 S052165C Physics Reasoning CR - - 1 S052297 Earth Science Knowing MC 4 B 1 S052032 Earth Science Reasoning CR - - 1 S052106 Earth Science Applying CR - - 1 Note: MC denotes multiple choice, and CR denotes constructed response.    64 Appendix B. Examples of Released Item (Eighth Grade Science, TIMSS, 2011)  B.1 Item 25 (Water Wheel: Energy of Tank Water) The diagram shows water flowing from a tank and rotating a wheel.  A. What kind of energy does the water have when it is in the tank? B. What kind of energy does the water have just before it hits the wheel?  B.2 Item 29 (Evidence Continents Were Joined) Two continents are separated by water. Geologists are looking for evidence that the two continents were once joined. What fossil evidence would support this idea? Explains that fossils from identical (land) organisms (that cannot fly or swim) can be found on both continents        65 Appendix C. Selected Covariates for Propensity Score Estimation (TIMSS, 2011) Variable Name Question BSBG04 About how many books are there in your home? BSBG07 How far in your education do you expect to go? BSBG12B How much do you agree that you feel safe when you are at school? BSBS17A How much do you agree that you enjoy learning science? BSBS17B How much do you agree that you wish you did not have to study science? BSBS17C How much do you agree that you read about science in your spare time? BSBS17D How much do you agree that science is boring? BSBS17F How much do you agree that you like science? BSBS17G How much do you agree that it is important to do well in science? BSBS19A How much do you agree that you usually do well in science? BSBS19B How much do you agree that science is more difficult for you than for many of your classmates? BSBS19C How much do you agree that science is not one of your strengths? BSBS19D How much do you agree that you learn things quickly in science? BSBS19E How much do you agree that science makes you confused and nervous? BSBS19F How much do you agree that you are good at working out difficult science problems? BSBS19I How much do you agree that science is harder for you than any other subject? BSBS19L How much do you agree that you need to do well in science to get into the <university> of your choice? BSBS19M How much do you agree that you need to do well in science to get the job you want? BSBS19N How much do you agree that you would like a job that involves using science?       66 Appendix D. Results of Optimal Pair Matching (1to1) on Test Language D.1 Jitter Plot  D.2 Histogram       67 D.3 Percentage of Bias Reduction  Before Matching After Matching  Means Treated Means Control Mean Difference Means Control Mean Difference Reduction(%) distance 0.376  0.244  0.139  0.347  0.151  77.950  BSBG04 2.873  3.393  1.125  2.918  1.152  91.267  BSBG07 5.055  5.237  1.343  5.127  1.469  60.064  BSBG12B 1.914  1.628  0.761  1.836  0.839  72.936  BSBS17A 2.145  1.900  0.896  2.095  0.977  79.600  BSBS17B 2.659  2.879  1.006  2.759  1.060  54.527  BSBS17C 3.150  3.153  0.967  3.077  1.015  -2304.278  BSBS17D 2.877  2.838  1.017  2.923  1.042  -15.972  BSBS17F 2.073  1.970  0.963  2.014  0.977  42.617  BSBS17G 1.700  1.552  0.730  1.682  0.832  87.748  BSBS19A 1.818  1.790  0.761  1.868  0.809  -77.644  BSBS19B 2.950  3.066  0.894  2.955  0.945  96.076  BSBS19C 2.714  2.735  1.025  2.709  1.023  78.599  BSBS19D 2.055  2.066  0.866  2.091  0.892  -222.063  BSBS19E 3.182  3.093  0.877  3.177  0.850  94.909  BSBS19F 2.173  2.244  0.883  2.232  0.884  16.826  BSBS19I 3.236  3.158  0.951  3.168  0.938  12.588  BSBS19L 1.895  1.817  0.921  1.818  0.981  1.849  BSBS19M 2.023  1.947  1.019  1.959  1.035  16.387  BSBS19N 2.391  2.331  1.105  2.377  1.142  77.253        68 Appendix E. Results of Optimal Full Matching (1toM) on Test Language E.1 Jitter Plot  E.2 Histogram       69 E.3 Percentage of Bias Reduction  Before Matching After Matching  Means Treated Means Control Mean Difference Means Control Mean Difference Reduction (%) distance 0.376  0.244  0.132  0.338  0.038  70.973  BSBG04 2.873  3.393  -0.521  2.989  -0.116  77.659  BSBG07 5.055  5.237  -0.182  5.148  -0.093  48.790  BSBG12B 1.914  1.628  0.286  1.837  0.077  73.069  BSBS17A 2.145  1.900  0.245  2.072  0.074  69.833  BSBS17B 2.659  2.879  -0.220  2.745  -0.086  61.038  BSBS17C 3.150  3.153  -0.003  3.149  0.001  62.433  BSBS17D 2.877  2.838  0.039  2.884  -0.006  83.764  BSBS17F 2.073  1.970  0.103  2.058  0.014  85.949  BSBS17G 1.700  1.552  0.148  1.673  0.027  81.775  BSBS19A 1.818  1.790  0.028  1.842  -0.023  17.100  BSBS19B 2.950  3.066  -0.116  2.958  -0.008  92.741  BSBS19C 2.714  2.735  -0.021  2.726  -0.012  42.573  BSBS19D 2.055  2.066  -0.011  2.075  -0.021  -84.515  BSBS19E 3.182  3.093  0.089  3.150  0.032  64.281  BSBS19F 2.173  2.244  -0.071  2.218  -0.045  36.233  BSBS19I 3.236  3.158  0.078  3.153  0.084  -7.517  BSBS19L 1.895  1.817  0.079  1.860  0.036  54.677  BSBS19M 2.023  1.947  0.076  2.014  0.008  88.951  BSBS19N 2.391  2.331  0.060  2.403  -0.012  80.665        70 Appendix F. Results of Optimal Pair Matching (1to1) on Gender F.1 Jitter Plot  F.2 Histogram       71 F.3 Percentage of Bias Reduction  Before Matching After Matching  Means Treated Means Control Mean Difference Means Control Mean Difference Reduction (%) distance 0.546  0.443  0.152  0.451  0.145  7.466  BSBG04 3.303  3.072  1.231  3.095  1.226  10.052  BSBG07 5.334  5.028  1.461  5.079  1.423  16.589  BSBG12B 1.601  1.814  0.858  1.795  0.850  8.850  BSBS17A 2.031  1.939  0.918  1.945  0.920  6.235  BSBS17B 2.757  2.879  1.023  2.878  1.022  0.414  BSBS17C 3.282  3.007  0.999  3.029  0.992  7.882  BSBS17D 2.785  2.907  1.010  2.895  1.014  9.684  BSBS17F 2.098  1.937  0.924  1.945  0.928  5.003  BSBS17G 1.566  1.625  0.756  1.621  0.756  7.082  BSBS19A 1.842  1.781  0.726  1.792  0.727  18.633  BSBS19B 3.005  3.051  0.855  3.048  0.860  7.632  BSBS19C 2.630  2.809  0.957  2.790  0.958  10.561  BSBS19D 2.153  1.986  0.817  2.005  0.816  11.251  BSBS19E 3.033  3.170  0.830  3.160  0.833  7.502  BSBS19F 2.360  2.093  0.846  2.115  0.842  7.980  BSBS19I 3.076  3.275  0.885  3.258  0.889  8.708  BSBS19L 1.809  1.904  0.929  1.895  0.925  9.900  BSBS19M 1.931  2.028  1.004  2.019  1.002  9.136  BSBS19N 2.348  2.387  1.074  2.372  1.074  38.006        72 Appendix G. Results of Optimal Full Matching (1toM) on Gender G.1 Jitter Plot   G.2 Histogram      73 G.3 Percentage of Bias Reduction  Before Matching After Matching  Means Treated Means Control Mean Difference Means Control Mean Difference Reduction(%) distance 0.546  0.443  0.103  0.451  0.095  7.288  BSBG04 3.303  3.072  0.231  3.091  0.212  8.329  BSBG07 5.334  5.028  0.306  5.071  0.263  14.146  BSBG12B 1.601  1.814  -0.212  1.793  -0.191  9.751  BSBS17A 2.031  1.939  0.092  1.947  0.084  8.145  BSBS17B 2.757  2.879  -0.122  2.874  -0.118  3.669  BSBS17C 3.282  3.007  0.275  3.026  0.255  7.013  BSBS17D 2.785  2.907  -0.122  2.892  -0.107  12.302  BSBS17F 2.098  1.937  0.161  1.950  0.148  8.170  BSBS17G 1.566  1.625  -0.059  1.620  -0.054  8.159  BSBS19A 1.842  1.781  0.062  1.789  0.054  12.691  BSBS19B 3.005  3.051  -0.047  3.046  -0.041  11.737  BSBS19C 2.630  2.809  -0.179  2.792  -0.162  9.493  BSBS19D 2.153  1.986  0.167  2.001  0.152  8.961  BSBS19E 3.033  3.170  -0.137  3.158  -0.125  8.898  BSBS19F 2.360  2.093  0.267  2.114  0.246  7.742  BSBS19I 3.076  3.275  -0.199  3.258  -0.181  8.708  BSBS19L 1.809  1.904  -0.095  1.895  -0.086  9.733  BSBS19M 1.931  2.028  -0.097  2.019  -0.089  8.809  BSBS19N 2.348  2.387  -0.038  2.375  -0.026  32.220        74 Appendix H. Results of Optimal Full Matching (1MM1) on Gender H.1 Jitter Plot   H.2 Histogram     75 H.3 Percentage of Bias Reduction  Before Matching After Matching  Means Treated Means Control Mean Difference Means Control Mean Difference Reduction(%) distance 0.546  0.443  0.103  0.542  0.004  95.903  BSBG04 3.303  3.072  0.231  3.333  -0.030  87.025  BSBG07 5.334  5.028  0.306  5.320  0.015  95.258  BSBG12B 1.601  1.814  -0.212  1.595  0.006  96.980  BSBS17A 2.031  1.939  0.092  2.009  0.022  75.474  BSBS17B 2.757  2.879  -0.122  2.830  -0.073  40.086  BSBS17C 3.282  3.007  0.275  3.223  0.059  78.622  BSBS17D 2.785  2.907  -0.122  2.933  -0.147  -21.239  BSBS17F 2.098  1.937  0.161  2.032  0.066  58.835  BSBS17G 1.566  1.625  -0.059  1.566  0.000  99.529  BSBS19A 1.842  1.781  0.062  1.828  0.015  75.719  BSBS19B 3.005  3.051  -0.047  3.004  0.001  98.803  BSBS19C 2.630  2.809  -0.179  2.661  -0.030  82.980  BSBS19D 2.153  1.986  0.167  2.171  -0.018  88.906  BSBS19E 3.033  3.170  -0.137  3.066  -0.032  76.352  BSBS19F 2.360  2.093  0.267  2.368  -0.008  97.141  BSBS19I 3.076  3.275  -0.199  3.095  -0.018  90.811  BSBS19L 1.809  1.904  -0.095  1.797  0.012  86.902  BSBS19M 1.931  2.028  -0.097  1.946  -0.015  84.570  BSBS19N 2.348  2.387  -0.038  2.327  0.022  43.585       76 Appendix I. PBR Comparison of Three Matching Methods Covariate Percentage of Bias Reduction (%) Optimal Pair Matching (one-to-one) Optimal Full Matching (one-to-multiple) Optimal Full Matching (one-to-multiple and multiple-to-one) distance 77.950 70.973 92.552 BSBG04 91.267 77.659 95.357 BSBG07 60.064 48.790 60.730 BSBG12B 72.936 73.069 86.787 BSBS17A 79.600 69.833 89.305 BSBS17B 54.527 61.038 60.556 BSBS17C -2304.278 62.433 -2166.533 BSBS17D -15.972 83.764 2.004 BSBS17F 42.617 85.949 25.402 BSBS17G 87.748 81.775 85.961 BSBS19A -77.644 17.100 32.711 BSBS19B 96.076 92.741 50.557 BSBS19C 78.599 42.573 -194.269 BSBS19D -222.063 -84.515 -517.287 BSBS19E 94.909 64.281 60.548 BSBS19F 16.826 36.233 -51.206 BSBS19I 12.588 -7.517 -37.625 BSBS19L 1.849 54.677 71.902 BSBS19M 16.387 88.951 47.145 BSBS19N 77.253 80.665 -54.426     77 Appendix J. R Codes for Analyses # to import data file data.lang <- read.csv("final.data.csv", header = T) # to check variable names names(data.lang)  # to import covariate names covariate.name <- read.csv("covariate.name.csv", header = F) # to give a name to the dataset names(covariate.name) <- "covariate.name"  #propensity score matching # to install package "optmatch" and "MatchIt" install.packages("optmatch") install.packages("MatchIt") # to load package "optmatch" and "MatchIt" library("optmatch", lib.loc="~/R/win-library/3.4") library("MatchIt", lib.loc="~/R/win-library/3.4")  #optimal pair matching (one to one, redundant units in the reference group will be discarded) pair.m <- matchit(ITLANG ~ BSBG04 + BSBG07+ BSBG12B + BSBS17A + BSBS17B + BSBS17C + BSBS17D + BSBS17F + BSBS17G + BSBS19A + BSBS19B + BSBS19C + BSBS19D + BSBS19E +BSBS19F + BSBS19I + BSBS19L +BSBS19M + BSBS19N, data = data.lang, method = "optimal", distance = "logit", ratio =1) # to check the result of optimal pair matching sum.pair.m <- summary(pair.m)    78  # to create a data file to store percentage of bias reduction (PBR) # PBR information would be used to decide which matching method would be applied pbr.pair.m <- as.data.frame(matrix(NA,nrow=20,ncol = 7)) # to add row names (i.e. covariate names) rownames(pbr.pair.m)<-rownames(sum.pair.m$sum.all)  # to copy pbr information # to extract mean difference before matching and store it in the dataset pbr.pair.m[,1:3]<-sum.pair.m$sum.all[,1:3] # mean difference after matching pbr.pair.m[,4:6]<-sum.pair.m$sum.matched[,1:3] # mean difference reduction pbr.pair.m[,7]<-sum.pair.m$reduction[,1] # export results write.csv(pbr.pair.m,"pbr.pair.m.csv", row.names = T)  # to plot matching results # Jitter plot plot(pair.m,type="jitter") # Histogram plot plot(pair.m,type="hist")  #optimal full matching (one to multiple, all units in the reference group will be matched) full.1toM <- matchit(ITLANG ~ BSBG04 + BSBG07+ BSBG12B + BSBS17A + BSBS17B + BSBS17C + BSBS17D + BSBS17F + BSBS17G + BSBS19A + BSBS19B + BSBS19C + BSBS19D + BSBS19E +BSBS19F + BSBS19I + BSBS19L +BSBS19M + BSBS19N, data =    79 data.lang, method = "full", distance = "logit", min.controls= 1, max.controls=5) # to check the result of 1toM optimal full matching sum.full.1toM <- summary(full.1toM) # to create a data file to store percentage of bias reduction (PBR) pbr.full.1toM <- as.data.frame(matrix(NA,nrow=20,ncol = 7)) # to add row names rownames(pbr.full.1toM)<-rownames(sum.full.1toM$sum.all) # to copy pbr information # mean difference before matching pbr.full.1toM[,1:3]<-sum.full.1toM$sum.all[,1:3] # mean difference after matching pbr.full.1toM[,4:6]<-sum.full.1toM$sum.matched[,1:3] # mean difference reduction pbr.full.1toM[,7]<-sum.full.1toM$reduction[,1] # export results write.csv(pbr.full.1toM,"pbr.full.1toM.csv", row.names = T)  # Jitter plot plot(full.1toM,type="jitter") # Histogram plot plot(full.1toM,type="hist")  #optimal full matching (a combination of one to multiple and multiple to one, units from  both reference and focal group may be matched multiple times) full.1MM1 <- matchit(ITLANG ~ BSBG04 + BSBG07+ BSBG12B + BSBS17A + BSBS17B + BSBS17C + BSBS17D + BSBS17F + BSBS17G + BSBS19A + BSBS19B + BSBS19C + BSBS19D + BSBS19E +BSBS19F + BSBS19I + BSBS19L +BSBS19M + BSBS19N, data =    80 data.lang, method = "full", distance = "logit", min.controls= 1/5, max.controls=5) # to check the result of 1toM & Mto1 optimal full matching sum.full.1MM1 <- summary(full.1MM1) # to create a data file to store percentage of bias reduction (PBR) pbr.full.1MM1 <- as.data.frame(matrix(NA,nrow=20,ncol = 7)) # to add row names rownames(pbr.full.1MM1)<-rownames(sum.full.1MM1$sum.all) # to copy pbr information # mean difference before matching pbr.full.1MM1[,1:3]<-sum.full.1MM1$sum.all[,1:3] # mean difference after matching pbr.full.1MM1[,4:6]<-sum.full.1MM1$sum.matched[,1:3] # mean difference reduction pbr.full.1MM1[,7]<-sum.full.1MM1$reduction[,1] # export results write.csv(pbr.full.1MM1,"pbr.full.1MM1.csv", row.names = T)  # Jitter plot plot(full.1MM1,type="jitter") # Histogram plot plot(full.1MM1,type="hist")  # to use results of optimal full matching (a combination of one to multiple and multiple to one) as the matched data for further analyses #save the matched data, for the conditional logistic regression analysis match.data <- match.data(full.1MM1)    81 # change variable names for the convenience of model equation expression # for example, "z.tot01" replaced by "z.tot1", "0" moved names(match.data)[87:116] <-paste("z.tot", seq(1:30), sep = "")   #conditional logistic regression, clr for short # to create three lists for storing conditional logistic regression results clr.null <- list() clr.uniform <- list() clr.non.uni <- list()  # to install "Epi" package for conditional logistic regression analyses install.packages("Epi") # to load "Epi" package library("Epi", lib.loc="~/R/win-library/3.4") # conditional logistic regression, null model # including model 3 and model 9, which are invalid # denoted as NA # because of item 3 and item 9 are polytomous  # using function tryCatch() to ignore errors and proceed to analysis clr.null <- lapply(1:30, function(i) {return(tryCatch(clogistic(paste("y", i,"~z.tot",i, sep = ""), strata = subclass, data = match.data), error=function(e) NA))})  clr.uniform <- lapply(1:30, function(i) {return(tryCatch(clogistic(paste("y", i,"~ ITLANG + z.tot",i, sep = ""), strata = subclass, data = match.data), error=function(e) NA))})  # note that "y1 ~ ITLANG * z.tot1" equals to "y1 ~ ITLANG + z.tot1 + ITLANG * z.tot1" in the    82 model equation, others can be done in the same manner clr.non.uni <- lapply(1:30, function(i) {return(tryCatch(clogistic(paste("y", i,"~ ITLANG * z.tot",i, sep = ""), strata = subclass, data = match.data), error=function(e) NA))})  # to create a dataframe to store the results of conditional "DIF" analysis clr.result <- as.data.frame(matrix(NA,nrow=30,ncol = 4)) colnames(clr.result)<- c("Chi.uniform","p-value.1","Chi.non-uniform","p-value.2") # to store likelyhood ratio test (LRT) results clr.result[,1]<-as.data.frame(unlist(lapply(1:30,function(i) {return(tryCatch((clr.uniform[[i]]$loglik[2] - clr.null[[i]]$loglik[2])*2, error=function(e) NA))}))) clr.result[,3]<-as.data.frame(unlist(lapply(1:30,function(i) {return(tryCatch((clr.non.uni[[i]]$loglik[2] - clr.uniform[[i]]$loglik[2])*2, error=function(e) NA))}))) clr.result[,2]<-as.data.frame(unlist(lapply(1:30,function(i) {return(tryCatch(1-pchisq(clr.result[i,1],1), error=function(e) NA))}))) clr.result[,4]<-as.data.frame(unlist(lapply(1:30,function(i) {return(tryCatch(1-pchisq(clr.result[i,3],1), error=function(e) NA))})))  # to export results write.csv(clr.result,"clr.result.csv", row.names = F)  # to select items (on which group responses were not biased) for impact investigation item.for.impact <- c(2,4,7,10,11,13,15,18,19,21,24,28,29,30)  #impact detection #create a vector to store formula for impact detection formula.for.impact <- paste("y",item.for.impact," ~ ITLANG",sep = "") formula.for.impact    83 # to load "Epi" package to conduct conditional logistic regression, this time only grouping variable (i.e. language) is as predictor in the model, rest total score as index of ability is removed library("Epi", lib.loc="~/R/win-library/3.4") impact.test.result <- lapply(formula.for.impact, clogistic, strata= match.data$subclass, data=match.data) # to check result summary # to take 4th element in the result list as example print(impact.test.result[[1]], digits = 4)  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0355884/manifest

Comment

Related Items