UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Assessing the multilevel validity of program-level inferences based on aggregate student perceptions… Barclay McKeown, Stephanie 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2014_november_mckeown_stephanie.pdf [ 1.04MB ]
Metadata
JSON: 24-1.0167009.json
JSON-LD: 24-1.0167009-ld.json
RDF/XML (Pretty): 24-1.0167009-rdf.xml
RDF/JSON: 24-1.0167009-rdf.json
Turtle: 24-1.0167009-turtle.txt
N-Triples: 24-1.0167009-rdf-ntriples.txt
Original Record: 24-1.0167009-source.json
Full Text
24-1.0167009-fulltext.txt
Citation
24-1.0167009.ris

Full Text

    ASSESSING THE MULTILEVEL VALIDITY OF PROGRAM-LEVEL INFERENCES  BASED ON AGGREGATE STUDENT PERCEPTIONS  ABOUT THEIR GENERAL LEARNING   by  Stephanie Barclay McKeown  B.A., The University of British Columbia, 1996 M.A., The University of British Columbia, 2005  A DISSERTATION SUBMITTED IN PARTIAL FULFILMENT OF  THE REQUIREMENTS FOR THE DEGREE OF    DOCTOR OF PHILOSOPHY  in  The Faculty of Graduate and Postdoctoral Studies  (Measurement, Evaluation and Research Methodology)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)   October 2014    © Stephanie Barclay McKeown, 2014  ii  Abstract      Aggregate survey results are commonly used by universities in Canada to compare effective educational practices across program majors within a university and between equivalent majors across campuses.  Despite this recurrent practice, many researchers neglect to examine the multilevel validity of inferences made from program-level responses.  This study illustrates the importance of determining the multilevel validity of program-level inferences prior to making conclusions based on survey data.        Survey responses regarding student perceptions about their general learning outcomes and ratings about the learning environment were collected from the National Survey of Student Engagement (NSSE) and the Undergraduate Experience Survey (UES).  The analytic procedures used in this study included two-level exploratory multilevel factor analyses (MFA) and three statistical approaches to determine the appropriateness of aggregation: analysis of variance (ANOVA), the within and between analysis (WABA), and the unconditional multilevel model.  Multilevel regression models were applied to survey data to examine the relationships of program-level characteristics with perceived student learning outcomes.        The results led to four conclusions regarding the use of student survey results aggregated to the program level.  First, results from the MFA revealed that the multilevel structure of items regarding perceived learning were consistent across the student and program levels for most samples, but the multilevel structure of items  iii  regarding the learning environment was not supported at the program level.  Second, results from the ANOVA and unconditional multilevel models indicated that aggregation to the program level for perceived learning was statistically appropriate for three out of the four study samples; however, WABA results indicated that aggregation to a level lower than the program major was more suitable.  Aggregation to the program level was not supported for any the learning environment scales across all three procedures.  Third, aggregation was variable dependent as demonstrated by lower levels of within-program agreement on ratings of the learning environment, but larger levels of agreement with perceived learning outcomes.  Finally, student-level perceptions about learning were partially influenced by student- and program-level characteristics; however, program means were not highly reliable and results did not support making program comparisons.  Implications for educational research and recommendations for further research were discussed.          iv  Preface      This dissertation is original, unpublished work of the author, Stephanie Barclay McKeown.  The author identified and designed the research program, performed all parts of the research, and analyzed the research data.  The author did not administer the surveys, but a secondary analyses of these survey results was approved by UBC Ethics Certificate number H13-00445.    v  Table of Contents Abstract.................................................................................................................. ii Preface....................................................................................................................iv Table of Contents ................................................................................................... v List of Tables .........................................................................................................xi List of Figures..................................................................................................... xiii List of Acronyms .................................................................................................xiv Acknowledgements .............................................................................................. xv 1  Chapter One:  Statement of the Purpose and Problem .................................. 1 1.1  Introduction ................................................................................................. 1 1.2  Traditional and Emerging Ideals of Quality in Higher Education .............. 9 1.3  Indicators of Institutional Quality: Conceptual and Empirical  Considerations ................................................................................................. 15 1.3.1  Levels of Analysis and Interpretation .............................................. 15 1.3.2  Multilevel Survey Measures ............................................................ 17 1.3.3  Multilevel Validity ........................................................................... 17 1.3.4  Single Level Regression Models ..................................................... 19 1.3.5  Multilevel Regression Models ......................................................... 20 1.4  Statement of the Problem .......................................................................... 22 1.5  Purpose of the Study ................................................................................. 24 1.5.1  Research Questions .......................................................................... 24 1.5.2  Definition of Key Terms .................................................................. 25 1.6  Summary of the Research Problem .......................................................... 29  vi  2  Chapter Two:  Review of the Literature ....................................................... 31 2.1  Introduction ............................................................................................... 31 2.2  The Foundations of Institutional Effectiveness Research ......................... 31 2.3  A Modern Approach to Validation Theory............................................... 37 2.4  Student Perceived Learning as a Measure of Quality/Effectiveness  in Higher Education ......................................................................................... 40 2.5  Measuring Student Perceptions About Their Learning  ........................... 45 2.5.1  The National Survey of Student Engagement (NSSE) .................... 46 2.5.1.1  Criticisms of the NSSE as a Measure of Learning ................. 47 2.5.2  The Undergraduate Experience Survey (UES) ................................ 50 2.5.2.1  Criticisms of the UES as a Measure of Student Learning ...... 51 2.6  Summary of the Literature Review ........................................................... 53 3  Chapter Three:  Research Design and Methodology ................................... 56 3.1  Introduction ............................................................................................... 56 3.2  Research Questions ................................................................................... 58 3.3  Study Samples........................................................................................... 59 3.3.1  The NSSE Sample ........................................................................... 61 3.3.2  The UES Sample .............................................................................. 67 3.4  Study Measures ......................................................................................... 71 3.4.1  The NSSE Measure .......................................................................... 72 3.4.2  The UES Measure ............................................................................ 73 3.5  Analytic Procedures .................................................................................. 75 3.5.1  Creating Composite Ratings ............................................................ 75    vii   3.5.1.1  Principal Components Analysis:  Single Level and Multilevel  Models ....................................................................................................... 76 3.5.2  Methods to Examine the Appropriateness of Aggregation .............. 80 3.5.2.1. Analysis of Variance Approach .............................................. 81  3.5.2.1.1  ANOVA Step One:  Assessing Non-Independence ....... 81  3.5.2.1.2  ANOVA Step Two:  Reliability of Program Means ...... 82  3.5.2.1.3  ANOVA Step Three:  Within-Group Agreement .......... 83 3.5.2.2  Within and Between Analysis................................................. 84  3.5.2.2.1  WABA Step One:  Level of Inference ........................... 86  3.5.2.2.2  WABA Step Two:  Bivariate Analysis .......................... 87  3.5.2.2.3  WABA Step Three:  Decomposition of Correlations .... 88 3.5.2.3  The Unconditional Two-Level Multilevel Approach ............. 89 3.5.3.  Examining Program-Level Characteristics and Student  Perceptions ................................................................................................. 92 3.5.3.1  The Dependent Variable ......................................................... 93 3.5.3.2  The Independent Variables:  Student-Level ........................... 94 3.5.3.3  The Independent Variables:  Program-Level .......................... 97 3.5.3.4  Multilevel Regression Model Design ................................... 106 3.5.3.4.1  Model Assumptions for Multilevel Modeling ............. 110 3.5.3.4.2  Empirical Bayes Estimators ......................................... 111 3.6  Summary of Research Methodology ...................................................... 113 4  Chapter Four:  Research Results ................................................................. 115 4.1  Introduction ............................................................................................. 115  viii        4.2  Results for the NSSE Example ............................................................... 116 4.2.1  NSSE Composite Ratings .............................................................. 116  4.2.1.1  NSSE Exploratory Multilevel Factor Analysis Results ........ 120        4.2.2  Aggregation Statistics - NSSE ANOVA Results ........................... 122  4.2.2.1  NSSE ANOVA Step One:  Assessing Non-Independence ... 122  4.2.2.2  NSSE ANOVA Step Two:  Reliability of Program Means .. 124  4.2.2.3  NSSE ANOVA Step Three:  Within-Group Agreement ...... 125  4.2.2.4  Summary of NSSE ANOVA Results ................................... 125 4.2.3  Aggregation Statistics - NSSE WABA Results ............................. 127  4.2.3.1  NSSE WABA Step One:  Level of Inference ....................... 127  4.2.3.2  NSSE WABA Step Two:  Bivariate Analysis ...................... 130  4.2.3.3  NSSE WABA Step Three:  Decomposition of   Correlations ........................................................................................ 131  4.2.3.4  Summary of NSSE WABA Results ...................................... 132 4.2.4  Aggregation Statistics - NSSE Unconditional Multilevel Model  Results ...................................................................................................... 133  4.2.4.1  NSSE Empirical Bayes Point Estimates   ............................. 135  4.2.4.2  Summary of NSSE Unconditional Multilevel Model  Results ...................................................................................................... 139 4.2.5  Overall Summary of Aggregation Statistics for NSSE .................. 139 4.2.6  NSSE Multilevel Regression Model Results ................................. 140  4.2.6.1  NSSE Random-Coefficient Model Results........................... 140  4.2.6.2  NSSE Intercepts- and Slopes-as-Outcomes Model Results .. 143  ix  4.2.7  Overall Summary of NSSE Findings ............................................. 148 4.3  Results for the UES Example ................................................................. 150 4.3.1  UES Composite Ratings ................................................................ 150  4.2.1.1  UES Multilevel Exploratory Factor Analysis Results .......... 153 4.3.2  Aggregation Statistics - UES ANOVA Results ............................. 157  4.3.2.1  UES ANOVA Step One:  Assessing Non-Independence ..... 157  4.3.2.2  UES ANOVA Step Two:  Reliability of Program Means .... 158  4.3.2.3  UES ANOVA Step Three:  Within-Group Agreement ........ 159  4.3.2.4  Summary of UES ANOVA Results ...................................... 159 4.3.3  Aggregation Statistics - UES WABA Results ............................... 160  4.3.3.1  UES WABA Step One:  Level of Inference ......................... 161  4.3.3.2  UES WABA Step Two:  Bivariate Analysis ........................ 162  4.3.3.3  UES WABA Step Three:  Decomposition of Correlations... 163  4.3.3.4  Summary of UES WABA Results ........................................ 165 4.3.4  Aggregation Statistics - UES Unconditional Multilevel Model  Results ...................................................................................................... 166  4.3.4.1  UES Empirical Bayes Point Estimates   ............................... 167  4.3.4.2  Summary of UES Unconditional Multilevel Model  Results ...................................................................................................... 170 4.3.5  Overall Summary of Aggregation Statistics for UES .................... 170 4.3.6  UES Multilevel Model Regression Results ................................... 171  4.3.6.1  UES Random-Coefficient Regression Model Results .......... 171  4.3.6.2  UES Intercepts- and Slopes-as-Outcomes Model Results .... 174  x  4.3.7  Overall Summary of UES Findings ............................................... 182 4.4  Summary of Research Findings .............................................................. 185 5  Chapter Five:  Summary, Conclusions and Recommendations ................ 189 5.1  Introduction ............................................................................................. 189 5.2  Discussion of Research Findings ............................................................ 190  5.2.1  Sample Representativeness and Item/Scale Reliability ................. 193  5.2.2  Program-level Aggregation:  Do Student Survey Outcomes Add Up? ......................................................................................................... 199  5.2.3  Multilevel Regression Analyses of Perceived Student Learning ... 203 5.3  Limitations of the Study ......................................................................... 206 5.4  Implications for Institutional Effectiveness Research  ........................... 209 5.5  Recommendations for Further Research ................................................. 215 Bibliography ....................................................................................................... 219 Appendices .......................................................................................................... 247      Appendix A:  NSSE Intercorrelations and P-P Plots ...................................... 247      Appendix B:  UES Intercorrelations and P-P Plots ........................................ 257    xi  List of Tables Table 3.1     NSSE student respondents by program major ................................... 64 Table 3.2     NSSE final study sample demographics ............................................ 66 Table 3.3     UES student respondents by program major ..................................... 69 Table 3.4      UES final study sample demographics ............................................. 71 Table 3.5     WABA criteria for determining level of inference for a composite variable................................................................................................................... 87 Table 3.6     WABA criteria for determining aggregation and the relationship  between two variables ............................................................................................ 89 Table 4.1     NSSE descriptive statistics for composite scales by year level ....... 118 Table 4.2     NSSE MFA results for perceived general learning for Fourth-Year  Students ................................................................................................................ 121     Table 4.3     NSSE ANOVA approach to test aggregation by program major .... 124 Table 4.4     NSSE WABA approach to test levels of inference by program  major .................................................................................................................... 128 Table 4.5     NSSE WABA comparisons of bivariate correlations ...................... 131 Table 4.6     NSSE WABA decomposition of correlations .................................. 132 Table 4.7     NSSE unconditional multilevel results to test aggregation by  program major...................................................................................................... 135 Table 4.8     NSSE results of the random-coefficient regression model for  fourth-year students ............................................................................................. 142 Table 4.9     NSSE results of the intercepts- and slopes-as-outcomes model for  fourth-year students ............................................................................................. 144  xii  Table 4.10   UES descriptive statistics for composite scores by year level ......... 152 Table 4.11   UES MFA results for perceived general learning by year level ...... 155  Table 4.12   UES MFA results for campus climate for diversity for first-year  students ................................................................................................................ 156  Table 4.13   UES ANOVA approach to test aggregation grouped by program  major .................................................................................................................... 158 Table 4.14   UES WABA approach to test levels of inference ............................ 162 Table 4.15   UES WABA comparisons of bivariate correlations ........................ 163 Table 4.16   UES WABA results for the decomposition of correlations ............. 165 Table 4.17   UES unconditional multilevel results to test aggregation by  program major...................................................................................................... 167 Table 4.18   UES results of the random-coefficient regression model for first-year  students ................................................................................................................ 173 Table 4.19   UES results of the random-coefficient regression model for  fourth-year students ............................................................................................. 174 Table 4.20   UES results of the intercepts- and slopes-as-outcomes model for  first-years ............................................................................................................. 175 Table 4.21   UES results from the intercepts- and slopes-as-outcomes model for fourth-year students ............................................................................................. 179    xiii  List of Figures Figure 3.1     Flow chart of study design ............................................................... 57 Figure 4.1     Empirical Bayes point estimates for NSSE first-year students  (unconditional two-level multilevel model) ........................................................ 137 Figure 4.2     Empirical Bayes point estimates for NSSE fourth-year students  (unconditional two-level multilevel model) ........................................................ 138 Figure 4.3     Empirical Bayes point estimates for NSSE fourth-year students  (intercepts- and slopes-as-outcomes model) ........................................................ 147 Figure 4.4     Empirical Bayes point estimates for UES first-year students (unconditional two-level multilevel model) ........................................................ 168 Figure 4.5     Empirical Bayes point estimates for UES fourth-year students (unconditional two-level multilevel model) ........................................................ 169 Figure 4.6     Empirical Bayes point estimates for UES first-year students (intercepts- and slopes as outcomes model) ........................................................ 177 Figure 4.7     Empirical Bayes point estimates for UES fourth-year students  (intercepts- and slopes-outcomes model) ............................................................ 181         xiv  List of Acronyms ANOVA Analysis of Variance CLA  Collegiate Learning Assessment EB  Empirical Bayes estimates EFA  Exploratory factor analysis HLM  Hierarchical linear modelling ICC  Intraclass Correlation Coefficient MFA  Multilevel Factor Analysis MLF  Full maximum likelihood estimation NSSE  National Survey of Student Engagement OLS  Ordinary least squares estimation PCA  Principal Component analysis REML  Restricted Maximum likelihood estimation TIMSS Third International Mathematics and Science Study UES  Undergraduate Experience Survey UK  United Kingdom USA  United States of America WABA Within and Between Analysis    xv  Acknowledgements A special thank you to Dr. Kadriye Ercikan, my research supervisor, for her ongoing support, wonderful encouragement, and patience throughout the time she has supervised my dissertation.  I also wish to extend my appreciation to Dr. Bruno Zumbo, Dr. Charles Ungerleider and Dr. Lesley Andres for offering their time, support, and positive feedback.  I am thankful to Dr. Kurt Geisinger, the external examiner, for his wisdom and his advice regarding my dissertation.  In addition, I wish to thank Dr. Amy Scott Metcalf and Dr. Sterrett Mercer for their insightful questions and constructive feedback.  My sincere gratitude goes to my friend Dr. Trudy Kavanagh who edited my dissertation, and offered her ongoing support, and reassurance to me over the past several years.  I am incredibly grateful for the loving support that my husband, Jamie, has provided me over the years while I’ve been working on my dissertation, as well as taking such wonderful care of our children -- Dylan, Robyn and Owen.  I also wish to thank my parents and sisters who never lost faith in me, as well as my friends and colleagues, Dr. “Dallie” Sandilands, Cindy Bourne, Kathy Fahey, Shannon Dunn, and Mary DeMarinis, for their encouragement and emotional support.   1   1  Chapter One: Statement of the Purpose and Problem 1.1  Introduction The current public and political discourse on student learning has renewed the focus on accountability and institutional effectiveness research in higher education.  The accountability programs of the 1980s and 1990s in Canada and the United States of America (USA) neglected to evaluate student learning because at that time excellence in learning was considered implicit in higher education (Inman, 2009; Wieman, 2014).  More recently, student learning has become an explicitly-stated component of the mission of most institutions of higher education in Canada and the USA (Ewell, 2008; Kirby, 2007; Maher, 2004; Shushok, Henry, Blalock & Sriram, 2009), thus student learning has also become an essential component of the evaluation of institutional effectiveness and accountability in higher education (Chatman, 2007; Douglass, Thomson & Zhao, 2012; Erisman, 2009).   The onus is now on institutions to articulate the effective educational practices they use to promote student success that can be attributed to the quality of the learning environment (Astin, 2005; Keeling, Wall, Underhile, & Dungy, 2008; Kuh, 2009; Porter, 2006; Tinto, 2010).  Consequently, the demand for information about student learning in higher education, and the additional value that institutions provide to the learning process, has increased significantly (Banta, 2006; Canadian Council on Learning, 2009; Gibbs, 2010; Liu, 2011; Woodhouse, 2012).  In Canada, information regarding student-learning outcomes at the program level has been  2   particularly pertinent in fulfilling program accreditation requirements and informing academic program reviews.  As part of the standards in the accreditation and academic review processes, there has been a recent emphasis on reporting student learning outcomes that can be attributed to the quality of the program (Ewell, 2008).   Aggregate composite ratings based on student survey results are commonly used in Canada and the USA to compare effective educational practices across major fields of study such as psychology, mathematics, and sociology within a university, and between the same majors across universities (Chatman, 2009; Nelson Laird, Shoup & Kuh, 2005).  Results from these surveys are used to support high-stakes decisions such as making instructional changes or informing resource allocations.  Accurately interpreting results from these surveys is important to support effective educational practices.  Data from these surveys have a nested structure because although the data are collected from students, the results are intended to be interpreted at the program level.  The multilevel nature of these types of surveys should be examined from a multilevel perspective that takes into consideration the nested design.   When the intent is to interpret aggregated survey results rather than the individual student responses, multilevel validity evidence is required to support interpretations made from the higher level of analysis.  If there is too much heterogeneity across the levels of analysis, perhaps due to diversity in student responses, diversity in size of the programs, or instructional approaches, the within-level structure may be distorted and interpretations at the higher level of aggregation may be more problematic to interpret, which may not allow for meaningful conclusions to be derived from the  3   data.  In such instances, interpretations of program-level decisions across university programs may not be appropriate or meaningful (Bobko, 2001; Hubley & Zumbo, 2011; Zumbo & Forer, 2011; Zumbo, Liu, Wu, Forer & Shear, 2010).  The issue of heterogeneity and aggregation is important for understanding how to summarize and apply student survey results that are suitable for informing program-level decision-making, accountability, and improvement efforts.  This study addresses the appropriateness of aggregating student perceptions about their general learning and the learning environment to reflect attributes about program quality and/or effectiveness, and how suitable the aggregate responses are for comparing across program majors for this study. Learning outcomes in higher education are indicators of what a student is expected to know, understand, and demonstrate at the end of a period of learning (Adam, 2008).  Until recently, institutions relied on traditional ideals of quality, characterized by resource- and reputation-outcome indicators such as student-to-faculty ratios and labour market outcomes.  These types of indicators have resulted in limited information about institutional accountability for student learning (Brint & Cantwell, 2011; Ewell, 2008; Skolnik, 2010).  This gap in our understanding has resulted from the use of traditional approaches to measuring quality in higher education, which has disregarded the influence of the educational context on student performance, and instead focused on the impact of student performance on the institution’s reputation.  Although we want students to learn discipline-specific skills, a recent trend has been to measure general learning outcomes at the program or institutional level as a  4   measure of institutional effectiveness (Penn, 2011).  General learning outcomes in higher education typically refer to the knowledge and skills graduates need to prepare them for the workplace and society such as critical thinking skills, writing and problem-solving skills that are not discipline-specific but are skills that can be applied across the disciplines (Spelling Commission on the Future of Higher Education, 2006).  Measuring learning in higher education, and determining whether students are actually meeting the stated learning outcomes, is a complex process because there is currently little consensus on how to measure student general learning across programs and institutions (Penn, 2011; Porter, 2012). In contrast to traditional ideals of quality, proponents of the emerging ideals of quality in higher education focus on determining the effects of institutional practices on student performance (Porter, 2006).  This focus is accomplished by taking a holistic approach in our understanding of student learning that considers the nature and structure of the learning environment as well as the background characteristics and prior achievement of the students (Alexander & Stark, 2003; Angelo, 2002; Bak, 2012; Borden & Young, 2008; Entwistle & Peterson, 2004; Price, 2011; Timmermans, Doolaard, & Inge de Wolf, 2011).  From this holistic perspective, researchers have argued that the learning environment can be purposively changed in particular ways to enhance the student learning experience and increase student success in the acquisition of knowledge (Astin, 2005; Keeling et al., 2008; Kuh, 2009; Tinto, 2010).  This broad approach to understanding student general learning in  5   higher education provides additional complexity in how to measure their learning within the context of the university setting.   Institutional effectiveness research, based on emerging ideals of quality, is focused primarily on the educational practices that support the learning process rather than on evaluating individual student learning (Alexander & Stark, 2003; Angelo, 2002).  More specifically, effectiveness research is focused on determining the types of activities and behaviours that can be empirically linked to teaching and learning in higher education as a measure of institutional effectiveness (Douglass et al., 2012; McCormick & McClenney, 2012; Porter, 2006).  Although direct measures of student learning such as examinations or student work are the preferred approach used to assess student learning, indirect measures such as student surveys are gaining popularity.  Surveys are a common approach used by institutional effectiveness researchers in Canada to solicit feedback from students because they can include questions regarding student perceptions of their learning outcomes, together with questions regarding perceptions about how well university practices contribute to learning (Chatman, 2007; Pike, 2004;).  The National Survey of Student Engagement (NSSE; Kuh, 2009) and the Undergraduate Experience Survey (UES; Hafner & Shim, 2010) are two examples of survey measures that are used to examine student perceptions about their general learning outcomes and the learning environment in higher education.  These surveys are considered indirect measures of learning, or proxy measures, because they ask about student perceptions of learning rather than directly measuring student performance.  These types of surveys are referred to as  6   multilevel measures because student-level survey responses are intended to be aggregated to a higher unit for interpretation, such as to the university or program level, to provide information about program quality/effectiveness.   Conclusions drawn from aggregate student survey responses were originally expected to provide information about university quality on a national basis and to compare results across institutions; however, the most common institutional use for these surveys has been to provide program-level information regarding student learning outcomes for accreditation and academic program reviews (NSSE, 2011).  For example, students responding to the NSSE are asked to identify their program major or intended program major in an open-text field.  The programs identified in this open-field by respondents were re-coded by NSSE staff into 85 program major categories.  The NSSE staff members encourage institutions to use this program major field through the use of the Major Field Report on their website.  The authors of the NSSE suggest using aggregate survey results by the program major field to examine educational patterns across program majors within the university and among equivalent program majors across comparable universities (NSSE, 2011).   Some researchers have argued that program-level analysis is more appropriate than institutional-level analysis for these types of surveys (Chatman, 2009; Nelson Laird, Shoup & Kuh, 2005).  Using the UES results from the University of California, Chatman (2007) found that student survey results varied considerably more among program majors within the university than among equivalent majors across university campuses (Chatman, 2007, 2009), and that academic experience and student  7   engagement varied by program of study in predictable ways.  Thus, Chatman insisted that differences in aggregate survey responses across program disciplines must be examined to inform program and university improvement efforts.  Using the NSSE results, Nelson Laird et al. (2005) found that students who indicated more frequent engagement in deep learning behaviours also reported greater educational gains, higher grades, and greater satisfaction with college; however, they found that these patterns varied by program area.  Nelson Laird et al. (2005) also suggested that using the results grouped by program major field could identify valuable information to guide department-level improvements in teaching and learning.   Group-level analysis is common in educational research, and is of importance to institutional effectiveness research that focuses on the interactions among groups as well as the interaction between the individual and group levels (D’Haenens, Van Damme, & Onghena, 2008; Porter, 2011).  Educational measurement theorists contend that the appropriate level of aggregation based on student survey results requires empirical and substantive support (Borden & Young, 2008; Forer & Zumbo, 2011; Griffith, 2002; Zumbo & Forer, 2011; Zumbo, Liu, Wu, Forer & Shear, 2010), but many group-level analyses have neglected to examine the appropriateness of aggregation prior to drawing their conclusions at the aggregate level.  The process is referred to as multilevel validation when validating interpretations based on aggregate results (Hubley & Zumbo, 2011).  Although researchers have criticised the use of an aggregate institutional-level score based on survey results to guide internal improvement efforts (Banta, 2007; Chatman, 2007), few have applied the same  8   criticism to the aggregation of survey ratings at the program major level (Chatman, 2007; Umbach & Porter, 2002).  Furthermore, there are no published research studies that have examined the multilevel validity of interpretations made from NSSE and UES student survey responses when aggregated to the program level (Chatman, 2007; Dowd, Sawatsky & Korn, 2011; Liu, 2011; Olivas, 2011; Porter, 2011).   This study contributes to the literature examining the multilevel validity of survey data, and demonstrates how heterogeneity across the student and program levels can impede meaningful interpretations made, based on aggregate scores.  The NSSE and the UES were selected for this study because each uses a different measurement perspective to collect information about student perceptions of learning and experiences related to learning within higher education.  Furthermore, each survey is administered to individual students to solicit feedback regarding their perceptions about their general learning and the learning environment while at university, and the results are intended to be interpreted at the program or university level.  As such, the NSSE and the UES can be considered multilevel surveys, which require empirical evidence in support of aggregation prior to interpretation at the program level.   The remaining five sections of this chapter establish the background and rationale for this study.  The first section locates this study in the context of traditional approaches and emerging developments in evaluating quality in higher education.  The second section examines some conceptual and empirical considerations of using student perceptions of learning outcomes as indicators of quality in higher education.  The third and fourth sections identify the problems addressed in this study, define the  9   purpose of the study, propose research questions and define the key terms used throughout the study.  The chapter concludes with a brief summary of the research problem. 1.2  Traditional and Emerging Ideals of Quality in Higher Education  Towards the end of the twentieth century, government officials concluded that there was a positive relationship between individual learning and improved opportunities that would enhance national economic competitiveness (Ikenberry, 1999; Skolnik, 2010).  As a consequence, the introduction of performance reporting, based on selected key performance indicators (KPIs), was driven primarily by the economic recession of the early 1990s coupled with increased student enrolment in Canada and the USA (Snowdon, 2005).  As a result, institutions were required to submit accountability reports to provincial and state governments on their institutional infrastructures, processes and outcomes (Baker, 2002; Bernhard, 2009; Federkeil, 2008; Houston, 2008).    The increased reporting requirements caused some tensions between those who saw the role of universities as dedicated to the development of long-term knowledge for societal good, and those who regarded the academy as a source of knowledge for the sake of the economy (Bok, 2003; Deiaco, Hughes & McKelvey, 2012; Inman, 2009; Kirby, 2007).  These competing views about the role of the university have resulted in different perspectives about what constitutes quality in higher education and how that quality should be estimated (Douglass, Thomson & Zhao, 2012; Gibbs,  10   2010; Skolnik, 2010).  Some researchers believe that these competing views have led to the development of two conceptually different assessment paradigms that focus either on accountability or on improvement (Carpenter-Hubin & Hornsby, 2005; Ewell, 2008; Hussey & Smith, 2003).  These researchers suggest that the purpose of assessment for accountability is to demonstrate to external public and political stakeholders that higher education institutions are achieving government-defined goals and standards such as KPIs.  In apparent contrast to assessment for accountability, assessment for improvement is for internal use to enhance teaching and learning within the institution.  Assessment for improvement typically involves internal evaluations and assessment procedures for determining quality assurance that are based on the goals and strategic priorities of the institution.  Despite the apparent polarity of the accountability-improvement contrast, there is a long-standing desire across the education sectors to use indicators that can inform internal improvement efforts as well as communicate accountability to external audiences (Ewell, 2009; Ungerleider, 2003; Zhao, 2011).  Another point of contention with increased government reporting was that the guidelines for the institutional information and the choice of KPIs were developed using management-driven theories of efficiency and economy that required audit-style reporting (Bok, 2003; Conlon, 2004; Snowdon, 2005).  These management-driven theories were considered counter to the academic culture of higher education, and were therefore criticized for not taking into consideration the complexity of higher education institutions and student learning (Skolnik, 2010).  The few pioneering institutions that  11   were already collecting student academic information were doing so to inform internal improvement efforts that took an institution-centred approach to assessing and evaluating student learning (Ewell, 2009).  These institution-centred approaches to assessment were uniquely designed for each institution and not easily standardized to others, and thus were less amenable to government accountability requirements.    Under the management-driven approaches, KPIs primarily reflect institutional resources such as student-to-faculty ratios, average class size, and expenditure per student (Skolnik, 2010).  KPIs also emphasize student outcomes such as graduation rates and employment results.  Furthermore, KPIs are used to standardize a variety of institutional and academic practices designed to facilitate institutional comparisons and system-level reporting (Conlon, 2004; Skolnik, 2010; Snowdon, 2005).  This management-driven approach led to a propensity for reports to be narrowly focused on the task of ranking and comparing institutions rather than on linking the quality of education to student learning (Carmichael et al., 2001; Change & Morgan, 2000; Dwyer et al., 2006; Harvey, 2002; Jobbins et al., 2008; Merisotis & Sadlak, 2005).  Critics challenged the assumptions and consequences of using narrowly-focused evaluations of academic quality at the institutional level that gave little attention to teaching or to student learning (Hussey & Smith, 2003; Murphy, 2011).    The use of rankings in rating and comparing universities has become a global phenomenon (Altbach, 2012; Artushina & Troyan, 2007; Cheng & Liu, 2007; Clarke, 2005; Usher, 2009; Wise, 2010).  Most university rankings in Canada, the United Kingdom (UK), and the USA are produced by the media (e.g., MacLean’s, Globe &  12   Mail, Times Good University Guide, US News & World Report, Forbes College Rankings).  These organizations market the rankings as tools or guides that will aid students in choosing the right school (Dowsett Johnston, 2001; Morse, 2008; Scott, 2009) or for potential employers to hire the right employees (Cheng & Liu, 2007).  Some argue that these rankings improve institutional status and give institutions the ability to compete more effectively for the best students and scholars (Van Dyke, 2005).  The methodologies used to rank institutions of higher education differ substantially in the criteria selected to judge quality (e.g., focused on research, teaching, or student satisfaction) and the levels at which they report quality (e.g., discipline-based or institutional level).  There are some rankings that emphasize teaching, such as the Times Good Universities Guide published in the UK, but most rankings in Canada seem to focus on reporting research activity (Cheng & Liu, 2007; Clarke, 2005; Usher & Savino, 2007), which makes the use of institutional rankings inadequate as a proxy for quality learning in higher education.     Accreditation by professional agencies is another approach used to determine quality in higher education (Ikenberry, 1999).  Accrediting agencies function differently in Canada than in the USA and in Europe in that they only accredit professional programs, not the institutions (Skolnik, 2010).  There is no national accreditation of higher education institutions in Canada (Canadian Council on Learning, 2009); accreditation is a decentralized process that is governed by provincial and territorial governments.  The process of accreditation is only mandatory for specific programs in Canada that require certification or professional designation, such as  13   engineering, nursing, teacher education, and dentistry.  Rather than focusing on the institution, accreditation is focused on determining if these programs have met established standards.  Where programs are involved with accreditation or certification processes, these standards are intended to be integrated into the quality review processes of the institution. As part of the standards in the accreditation process, accrediting agencies have recently placed emphasis on student learning outcomes that can be attributed to the quality of the program (Ewell, 2008).  This emphasis has resulted in increased institutional participation in collecting information about student learning for these programs in particular.  Most accrediting agencies require that institutions use assessments to examine student achievement as part of their self-study and review processes (Ewell, 2001).  The choice of assessments is typically left to those responsible for the program, who may select to use direct measures (e.g., standardized exam, portfolios, classroom assessments) or indirect measures (e.g., surveys) of student learning.   Although there is no national accreditation agency in Canada, the Association for Universities and Colleges of Canada (AUCC) acts as the national voice for Canadian universities and degree-granting colleges.  AUCC was first established in 1911, and as of 2014 represented 97 post-secondary institutions.  AUCC member institutions are mandated to formalize their quality assurance principles and processes for their institution, and these principles and processes are required to be clearly stated in their strategic plans (http://www.aucc.ca/canadian-universities/quality-assurance/aucc- 14   quality-assurance-principles/).  Quality assurance through AUCC requires that each institution develop and maintain periodic internal quality assurance policies and procedures.  Quality assurance policies and procedures for member-institutions are required to be within the scope of the guidelines and policies outlined by their respective provincial or territorial governments, as well as AUCC principles.  Internal quality assurance for AUCC is typically demonstrated through academic program reviews that include self-evaluation and peer review.  The self-evaluation typically involves reporting a combination of administrative data and student survey results to inform how well students believe they are doing academically in the program and their perceptions about how their learning can be attributed to the types of educational experiences offered by the program.  AUCC requires that these reports be made public by reporting the results on university websites. The interpretations made from aggregate program-level survey results currently lack sufficient evidence of multilevel validity (Borden & Young, 2008; Zumbo & Forer, 2011).  Most institutional effectiveness research involves large multi-institutional datasets to compare results across institutions or across programs (Astin & Denson, 2009; Porter, 2006).  There have been few attempts to validate interpretations from these survey results at a single university (Esquirel, 2011; Gordon, Ludlum & Hoey, 2008; Korzekwa, 2010; LaNasa, Cabrera & Trangsrud, 2009; Olivas, 2011), and none that have examined the multilevel validity of interpretations made from aggregating student-level responses to the program level;  15   yet the most common institutional use of the survey results is for program-level analysis (Chatman, 2007, 2009; NSSE, 2011).   1.3  Indicators of Institutional Quality: Conceptual and Empirical Considerations   Current conceptual advances made in validity and measurement theory have the potential to contribute significantly to enhancing the understanding of program- and institutional-level analyses in higher education.  There are three considerations of particular importance to this study: 1) the holistic approach to understanding learning that has led to the recent development, and increased use, of multilevel survey measures; 2) the extension of validity theory to include multilevel validity; and 3) the development of sophisticated statistical techniques used to analyse these multilevel survey results.  These recent conceptual developments are starting to make their way into the understanding of how we measure student performance in higher education.  The uses of multilevel models provide much needed insight into the complex nature and structure of learning within the higher education environment. 1.3.1  Levels of Analysis and Interpretation      Educational research is often concerned with studying variables or constructs that are not observed directly, but can be measured by a range of indicator variables assumed to be associated with the construct (Andres, 2012; Charles, 1995).  Surveys are one way that institutions measure complex social phenomena such as student satisfaction with their university experience, perceptions about their learning outcomes, and attitudes regarding their learning environment.  To measure these  16   complex constructs, individual survey items that are related to each other can be combined to create composite scores or ratings.  A primary concern for assessing the quality of higher education from aggregate student-level outcomes is determining whether ratings and perceptions collected from individuals reflect attributes at the higher level of aggregation, so that interpretations made at this level are meaningful (Griffith, 2002).  Constructs, such as perceived learning, can take on different meanings in terms of their use and interpretation when considered at the disaggregate level (e.g., student level) or at subsequent aggregate levels (e.g., program or institution level) (Forer & Zumbo, 2011; Zumbo & Forer, 2011).    There are two potential threats to validity when a construct loses its meaning upon aggregation: atomistic fallacy, and ecological fallacy (Dansereau, Cho, & Yammarino, 2006).  An atomistic fallacy refers to inappropriate conclusions made about groups based on individual-level results, while an ecological fallacy refers to inappropriate inferences made about individuals based on group-level results (Kim, 2005).  When valid inferences can be made about constructs, or variables, at multiple levels of aggregation, it is known as a multilevel construct (Zumbo & Forer, 2011).  Researchers have found that relationships between variables may change depending on the unit of analysis selected at either the individual or group level, which would lead to making either an atomistic or ecological fallacy (Bobko, 2001).  In this study, if the meaning at the program level is not consistent with the meaning at the student level, then drawing conclusions based on program-level results could lead to making an atomistic fallacy.  Thus, aggregating student survey responses to the program level should not be assumed  17   appropriate without first determining if the meaning at the student level has been upheld at the program level (Zumbo, Liu, Wu, Forer & Shear, 2010; Zumbo & Forer, 2011).   1.3.2  Multilevel Survey Measures  As previously mentioned, although the surveys used in this study were administered to individual students, the survey results were intended to be interpreted at the program or institutional level.  In light of this, it can be reasoned that the NSSE and UES are multilevel measures of student perceptions about their learning because they are used as measures of program- or institutional-level quality in higher education rather than about individual student-level learning (Zumbo & Forer, 2011).  In addition, these surveys were designed using the emerging ideals of quality because they include questions regarding the learning environment, and ask students to indicate to what extent the university has supported student learning and personal development.  Thus, these surveys are intended to be examined from a multilevel perspective that considers the student-level information about perceived general learning within the context of the learning environment, which includes the characteristics of the student and structure of the student’s program.   1.3.3  Multilevel Validity  In its contemporary use, validity theory is focused on the interpretations or conclusions drawn from the assessment results rather than the actual assessment scores (Borden & Young, 2008; Zumbo, 1998).  The validation process involves first  18   specifying the proposed inferences and assumptions that will be made from the assessment findings, and then providing empirical evidence in support of them (Kane, 2006, 2013; Messick, 1995b; Zumbo, 2007, 2009).  Kane (2006, 2013) suggested that the evidence required to justify a proposed interpretation or use of the assessment findings depends primarily on the importance of the decisions made based on these findings.  If an explanation makes modest or low stakes claims, then less evidence would be required than an explanation that makes more ambitious claims.  Thus, Kane (2013) offered an argumentative approach to the validation of interpretations.  It is from this modern perspective of the validation process that contemporary theorists such as Zumbo and Forer (2011), Forer and Zumbo (2011), and Hubley and Zumbo (2011) proposed an extension to validity theory to include multilevel validity.  As such, Zumbo and Forer (2011) reasoned that validation efforts of multilevel variables must include supporting evidence that the interpretations of individual results are still meaningful when aggregated to some higher-level unit or grouping.  The extension of validity theory by Zumbo and Forer (2011) to include multilevel validity is relevant for this study when interpreting attributes of program quality measured by aggregating student perceptions of their learning (Borden & Young, 2008).   These authors also argued for the greater use of appropriate measurement models that can empirically validate multilevel inferences based on aggregate results.      19   1.3.4  Single Level Regression Models  Until recently, most investigations of institutional effectiveness in higher education have used multivariate statistics to analyse individual student performance or group means (Astin & Denson, 2009).  These methods examine relationships among student-level variables, but neglect to investigate how students are clustered in programs or universities (Goldstein, 1997; Porter, 2006; Raudenbush & Bryk, 2002).  Research strategies for addressing these complex multilevel structures have been limited, partly due to the lack of estimation procedures used to analyse these datasets (Astin & Denson, 2009; Raudenbush & Bryk, 2002).  In addition, there has been lack of consideration for the implications of assumptions made about measuring variables in their natural level while drawing inferences at another level based upon data aggregation or disaggregation (Borden & Young, 2008; Heck & Thomas, 2000).    When the object of analysis is to determine the importance of program-level variables based on student-level outcomes, the decision for the researcher is either to aggregate the outcomes to the program level for analysis, or to analyse the student-level outcomes.  As an example, if a researcher chooses to aggregate the student-level data collected from 1,000 observations across seven programs, the standard errors estimated for the program-level data would depend upon seven aggregated program-level observations.  The information available in the student-level data would be lost or diminished in aggregate-level analysis because the variability in student performance would be reduced to a single program-level variable (Heck & Thomas, 2000; O’Connell & Reed, 2012).  If the students within a program respond similarly  20   to survey items, the aggregate program-level variable will likely represent their perceptions.  Conversely, when there is too much heterogeneity among student survey responses within a program the aggregate program-level variable may no longer appropriately represent the student perceptions for the program. 1.3.5  Multilevel Regression Models  Universities provide an obvious example of multilevel data where students are grouped into classes that are clustered within programs.  Traditional ideals of quality perform poorly when incorporating the many ways in which environmental variables could affect the learning and performance at the student level, program level, or institutional level.  Failing to acknowledge the within-group variability in the data can possibly distort the relationships examined among units by either overestimating or underestimating the influence of group membership (Aitkin & Longford, 1986).    Statistical single-level models built on traditional ideals of quality offer a simplistic view of a complex situation because they neglect to incorporate the influence of the program characteristics on the student and the learning environment.  These traditional approaches have been criticized for being statistically weak in their interpretation since so much information is lost or diminished, depending upon the unit of analysis selected: student, or program.  In contrast, multilevel approaches model student and program data simultaneously.  Thus, these models are useful when examining how student results aggregated to the program level are influenced by  21   student-level characteristics, and how belonging to a particular program influences student perceptions about their learning (Goldstein, 1997; Rasbash et al., 2002).    A limitation with multilevel modelling is the large sample-size requirement for the number of higher-level units (Maas & Hox, 2005).  The efficiency and power of the multilevel approach rests on pooled data across the units comprising the highest level, which requires a large data set.  With less than adequate power there is an unacceptable risk of not detecting cross-level interactions between the program and student performance.  It has generally been accepted that there should be adequate statistical power with 30 groups of 30 observations each (Kreft & De Leeuw, 1998), 150 groups with 5 observations (Mok, 1995), or even 20 groups at the higher level in a two-level multilevel model design (Centre for Multilevel Modelling, 2011); yet at most institutions of higher education it might be difficult to find even 20 or 30 units that contained enough student-level data to be fitted to a multilevel model.  As a consequence of the large sample size requirement, multilevel models have been more commonly incorporated in analyses across multiple institutions, but they have been less commonly applied to data from a single institution.  When sample sizes are small, particularly in terms of the number of higher level units, a post-hoc Bayesian approach to multilevel modelling called empirical Bayes (EB) estimation can be used (Rupp, Dey, & Zumbo, 2004).  The benefit of using EB estimates is that information for the estimation procedure is determined from the actual data.  EB posterior estimates provide a more reduced, or shrunken, estimate when compared to ordinary least squares (OLS) estimation results.  The 95% credible  22   values around the EB estimates can provide additional information regarding the variability of these shrunken estimates within programs (Raudenbush & Bryk, 2002).  In addition, EB estimates are not derived from theory based on large sample sizes, nor do they require a large number of higher-level groups in the study; thus, EB estimates were useful in the current study to examine results within programs, across programs, and among equivalent programs across two campuses of one university (Gelman, 2006; Greenland, 2000).     1.4  Statement of the Problem Borden and Young (2008) described a series of assumptions that must be met to relate student-learning outcomes to the effectiveness of university programs.  First, that the data sources, typically student measures, are reliable (Bowman, 2011; Campbell & Cabrera, 2011; Gonyea, 2005; Lüdtke, Robitzsch, Trautwein & Kunter, 2009; Pike, 2011; Porter, 2011), and results are representative of the characteristics of the larger population (Timmermans, Doolaard, & Inge de Wolf, 2011) while still capturing the potential differences among student subgroup populations (Chatman, 2007; Dowd et al., 2011).  Second, that the methods used to collect the data are valid measures of student learning (Banta, 2007; Chatman, 2009; Kuh, 2009; Pace, 1986).  Third, that the reporting of individual student-learning outcomes aggregated to the program level is appropriate and retains the same meaning at the higher level as it did at the student level (Dansereau et al., 2006; Forer & Zumbo, 2011; Griffith, 2002; Van de Vijver & Poortinga, 2002).  Finally, that the program-level characteristics of  23   the learning environment are suitably represented in the model and can be associated with student learning in a meaningful way (Astin & Denson, 2009; Porter, 2006).   These assumptions must be addressed when attempting to establish an empirical link between what students know, and how that learning has been affected by their experiences at the university (Borden & Young, 2008).  Most importantly for this study is the multilevel validity of aggregating student survey outcomes to the program level to report out on program-level claims, and to compare across program majors.  Unfortunately, there is a paucity of research investigating the multilevel validity of aggregate student survey responses used for comparative purposes across program majors (Griffith, 2002; Porter, 2011; Olivas, 2011).   This study aims to contribute to our understanding of using student surveys to measure and compare program quality based on aggregate student perceptions about their general learning and the learning environment.  Prior to using aggregate survey results to inform program improvement or accountability efforts, multilevel validity regarding aggregation at the program level needs to be determined.  When there is too much variability across the levels of analysis, then interpretations at the higher level of aggregation might not be appropriate or meaningful (Bobko, 2001; Hubley & Zumbo, 2011; Zumbo & Forer, 2011; Zumbo, Liu, Wu, Forer & Shear, 2010).  Drawing conclusions at the program level could lead to making an atomistic fallacy and erroneous decisions based on conclusions that are not valid.  It could be argued that using survey results to inform program accreditation and academic program reviews are high-stakes claims, which would require substantial validity evidence to  24   support these conclusions (Kane, 2006, 2013; Zumbo, 2007, 2009).  This study establishes the importance of determining multilevel validity for program-level analysis and program major comparisons, as well as more broadly for institutional effectiveness research overall.   1.5  Purpose of the Study The purpose of this study was to determine to what extent student perceptions about their general learning could be aggregated to the program level to make program-level claims regarding student learning, for making comparisons across programs within the University of British Columbia (UBC), and among equivalent programs across UBC’s Vancouver and Okanagan campuses.  In addition, this study examined how aggregate student survey results about their perceived general learning were related to student- and program-level characteristics at UBC.  The analytic approaches used in this study are intended to be sufficiently comprehensive that they can be applied to validating aggregate results from other multilevel constructs in higher education such as student engagement, student satisfaction, or student academic achievement.   1.5.1  Research Questions  The overarching research question addressed in this study was:  To what extent do aggregate student perceptions about their general learning outcomes and ratings of the learning environment reflect attributes of program quality/effectiveness within a  25   university?  To answer this broad question, two specific research questions were addressed: 1. Are multilevel inferences made from student perceptions about their general student learning outcomes and ratings of the learning environment valid when aggregated to the program major level?   The multilevel structure of the survey data was examined using a two-level multilevel exploratory factor analysis.  The appropriateness of aggregation to the program level was determined by using and comparing three different analytic methods; the one-way analysis of variance (ANOVA), the within and between analysis of variance (WABA), and the unconditional multilevel model. 2. What student- and program-level characteristics, and cross-level relationships, are associated with student perceptions about their general learning outcomes? Conclusions about how the student- and program-level characteristics at UBC were related to student perceptions about their general learning outcomes were made using results from the multilevel regression analyses. 1.5.2  Definition of Key Terms  The following are definitions of terms that will be used throughout this dissertation and are appropriate for the focus of this study.  26   Indirect measures of learning are used as proxies for student learning, such as surveys asking student perceptions on their learning outcomes or gains, rather than a direct measure of their learning or growth.  The outcomes of indirect measures can be obtained from the instructor, the student, or others involved in the learning process through the assessment of values, feelings, perceptions and attitudes.  Indirect measures of learning are compared with direct measures of learning, which are assessments that are based on student work, including class assignments, reports, demonstrations and performances.  The benefit of using direct measures of learning is that they capture a sample of what students know, understand and demonstrate; however, it is more difficult to obtain information about perceptions, feelings and attitudes through direct measures. A limitation of indirect measures is that perceptions of learning may not be accurately measured, or be related to academic grades.   The unit or level of analysis dilemma involves deciding whether the analysis should focus on the individual or the institution.  When institutional-level variables are the focus of analysis, and data are collected at the student level, the dilemma is whether these should be aggregated to the institutional level for analysis or kept at the individual level. If the choice is the individual level, then the institutional structure will not be represented in the model. If the choice is the institutional level, then information about individuals, such as students, will be lost. Differences in unit selection may lead to different conclusions that are sometimes contradictory when understanding and explaining the differences in outcomes.  27   Multilevel regression models are used on observational data with a hierarchical or clustered structure.  Many of the populations of interest to educational researchers are organized into this type of structure.  These structures are multilevel phenomena where individuals are nested within clusters such as classes, departments, and faculties, which are also nested within larger units such as universities. Each nested unit may also interact with contextual factors and characteristics of the lower or upper level of the structure; for example, student learning that takes place in the classroom could be affected by the quality of instruction, curriculum content, class size and processes, school climate or culture, and other contextual variables (Raudenbush, 2004).  Multilevel models allow researchers to investigate different levels of analysis simultaneously.   Single-level regression models are traditional linear methods used to measure relationships among student-level variables. These methods focus on the student level and ignore the ways in which students are nested in classes or programs, and the influence of the program-level characteristics upon the students. These types of analyses result in two problems: first, the resulting conclusions are often biased and overly optimistic; and second, these models generally assume the same effects across groups, which fail to incorporate higher levels such as programs into the models explicitly.  Thus, the influence of the learning environment on student-level variables in these models is overlooked (Goldstein, 1997).   Referent-shift and referent-direct consensus models are used in the development of item-wording for multilevel survey measures and are conceptually distinct from each  28   other.  Using a referent-direct model, the researcher asks individuals to rate their own perspectives or attitudes regarding learning when responding to survey questions.  Then the researcher determines the amount of agreement on these items within the group to inform how the construct would be conceptualized at the group level.  In contrast, a researcher using a referent-shift model asks individuals to rate items that refer to the group perspectives rather than about their own individual perspectives.  Again, the researcher determines the amount of agreement within the group (Kline, Conn, Smith, & Sora, 2001).  In this study, some of the learning environment items used from the NSSE and the UES provide an example of a referent-shift  consensus model because items ask students to indicate, in their opinions, how well the university supported student learning.  The perceived general leaning questions on the surveys provides an example of a referent-direct consensus model that asked students to rate their own learning as a result of being at UBC.   Response-shift bias occurs when students respond to a pre-test survey; then, after the intervention, they respond to the post-test survey using a different measurement perspective.  When the student’s measurement perspective changes from the pre- to the post-test, the ratings would reflect this shift in understanding as well as any actual changes in the student’s perceptions, which would limit any comparisons made between the two measures because the conclusions drawn from these results would not be valid.  The UES offers a pre-post embedded design into the survey, which asked students to respond to questions regarding their perceived learning gains by rating their learning when they started at UBC and at the time of the survey.  The UES authors  29   argued that by embedding a pre-post design within a single survey, student responses about their learning will be more reliable because the response-shift bias is reduced (Howard, 1980).   1.6  Summary of the Research Problem  This study aims to contribute to the national conversation about the complexity in measuring student general learning outcomes in higher education, specifically with respect to using aggregate survey results as program-level outcomes.  This is the first study in Canada that uses results collected from the NSSE and the UES to examine the multilevel validity of student perceptions about their learning, and learning environment, when aggregated to the program major level.  The proponents of these surveys and other similar surveys argue that the results are appropriate for diagnostic purposes to inform educational improvement efforts within universities by comparing survey results across program majors within and among institutions.  In addition, these surveys are proposed to provide information for accountability purposes for program accreditation and academic program reviews.  Prior to making comparisons across programs, multilevel validity must be determined to support the use of aggregate ratings in this way.  This is an issue of validity because there is the potential that student-level information aggregated to the program major level does not reflect program-level attributes.  As such, any interpretations drawn from these aggregate program-level results may not be valid and could lead to inaccurate judgments about programs.  30    To this aim, the current study takes a multistep process for multilevel validity by examining survey responses from a multilevel perspective that examines how items contribute to both the student- and program-level models using a two-level MFA approach.  Second, this study demonstrates and compares three statistical approaches for determining the appropriateness of aggregating student survey outcomes to the program major level for interpretation.  Finally, this study promotes the use of multilevel modelling for determining how student perceptions about their learning relate to student- and program-level characteristics.  The results from this study will contribute to our understanding of how to summarize and apply student survey results that are suitable for informing program-level decision-making, accountability, and improvement efforts.  More specifically, the findings from this study can be used to inform the use of aggregate student survey results to measure and compare program-level outcomes, as well as for use in higher education to satisfy program accreditation and academic program review requirements regarding student learning outcomes.          31   2  Chapter Two:  Review of the Literature 2.1  Introduction  The following literature review provides a context in which institutional effectiveness research that examines student learning in higher education can be understood theoretically and practically.  This review is divided into four sections.  The first section describes the focus of the institutional effectiveness research paradigm and identifies the origins of this field of research.  The second section provides a summary of the current approach to validation theory and its relevance to this study.  The third section discusses the recent focus on student learning in higher education as an indicator of institutional quality or effectiveness.  The fourth section introduces the two different surveys examined in this study that measure student perceptions about their general learning outcomes, as well as a discussion of the criticisms of each.  The chapter concludes with a brief summary of the literature as it relates to this study.   2.2  The Foundations of Institutional Effectiveness Research  The institutional effectiveness research paradigm describes educational research concerned with exploring differences among learning outcomes within and among institutions of higher education.  Researchers in this area focus on obtaining knowledge about relationships between explanatory and outcome factors, using a variety of statistical models.  Researchers investigating institutional effectiveness and student learning can attribute the origins of their work to the paradigm known as school effectiveness and school improvement research, which has been the source of debate on  32   the relevance and appropriateness of measures used to determine effective schools in the K-12 education system for the past five decades.    Two seminal studies (Coleman et al., 1966; Jencks & Brown, 1975) conducted in the 1960s have been credited in influencing the foundation for school effectiveness research in education (Goldstein & Woodhouse, 2000; Scott & Walberg, 1979).  The initial debate about how to measure school effectiveness began in 1966 when results were released from a research project commonly referred to as “The Coleman Study,” (Coleman et al., 1966) and was further supported by research conducted by Jencks and Brown (1975).   The findings of the Coleman study (1966) led researchers to conclude that schools had little effect on academic achievement in students.  Coleman et al. (1966) theorized that two general areas, achievement and motivation, could define school effectiveness.  In their study, they defined achievement as showing the accomplishments of the school to date, and motivation as showing the interest it created for further achievements.  They focused mainly on standardized student achievement tests as the main outcome for measuring school effectiveness.  The researchers argued that they could take average scores from students at each school and compare these average scores across schools; however, they did not take into consideration any differences that may have already existed among schools that could interfere with the conclusions drawn from their study.  Similarly, Jencks and Brown (1975) also concluded that school characteristics  33   contributed little to understanding the variation in student performance.  Using data from Project Talent, a longitudinal study of students in grades nine through twelve that was published in 1960, they suggested that differences in high school characteristics such as teacher experience, class size, and social composition were unlikely to change high-school effectiveness.  They argued that the data supported a focus on differences within high schools rather than differences among high schools.  In contrast to the conclusions drawn from Coleman et al. (1966) and Jencks and Brown (1975) studies, Edmonds (1979) believed that schools were responsible for student performance and improvement.  He asserted that by only emphasizing individual student differences within schools, educators would have limited responsibility to be instructively effective.  He also suggested that putting too much emphasis on differences among individual students would place a heavy and erroneous burden on parents for their role in student learning.    Scott and Walberg (1979) argued that student academic achievement was much more complex than determining differences among schools or within schools.  Instead, they theorized that the student, the school, and the home were like a three-legged stool that was as strong as its weakest leg.  They suggested that multiple indicators describing the student, school, and home characteristics would have more discriminative power in determining what factors contributed to the effectiveness of a school in improving learning outcomes.  These pioneers of school effectiveness research did not propose that schools had no  34   effect on student performance.  They indicated that when investigating differences in the effect of average academic achievement among schools, it was difficult to identify school-related variables that accounted for the observed differences (Austin, 1979).  Yet troubling to early school effectiveness researchers who believed that there was an effect of school on student performance was that the conclusions drawn by Coleman et al. (1966) and Jencks and Brown (1975) did not seem to be particularly concerned with the impact of schools on the broad social systems in which they were embedded.  Carver (1975) suggested that the study by Coleman et al. (1966) motivated researchers to demonstrate that measuring the effect of schools upon student performance had been neglected.  In other words, these seminal research studies had concentrated on the effect of student performance on school reputation, rather than how the schooling environment affects student performance.   More recently, Schmidt, McKnight, Houang, et al. (2001) also argued strongly that the structure and programming of schools were important in understanding differences among student academic performance within and across schools.  They based their conclusions on results collected from a cross-national study called, the Third International Mathematics and Science Study (TIMSS).  The TIMSS assessment was launched in 1995 to examine student achievement in mathematics and science in over 40 countries across the year levels in K-12 education (TIMSS, 1995).  Schmidt et al. (2001) found that it was not the quality of the students that contributed to low achievement in the USA, but rather they found that poor student achievement could be related to the type of curriculum being taught in the schools.  They recommended that  35   changes be made to the curriculum to foster student learning. Similarly, Porter (2006) suggested that in higher education, the discussion in the past regarding student academic achievement focused on how student characteristics influence student behavior, rather than on how and why institutional structures should affect student outcomes.  Porter (2006) shifted the focus from student attributes to institutional structures, and he found that institutional structures were related in predictable ways to student engagement as measured by the Beginning Post-secondary Student Survey.  He also found that pre-university student characteristics such as demographics and experiences were minimally related to student engagement, student outcomes differed by institution type, and that smaller institutions had better student engagement outcomes than larger institutions.  He concluded that these findings were related not only to research outcomes, but also to the density of faculty members per acre, and differentiation of the curriculum.  Porter (2006) called for further research to take into consideration institutional structures and faculty behavior when examining student-level survey responses.  Other contemporary researchers investigating effectiveness in higher education, and how it connects to student learning, have argued for a new comprehensive understanding of the measurement of student learning (Banta, 2006) that considers the social, cultural, political and economic contexts in which learning occurs (Boreham & Morgan, 2004; Gipps, 1994; Moss, 2003).  Similar to Scott and Walberg’s (1979) research, this broad view of student learning extends beyond the cognitive aspects to include all aspects of learning, some of which are beyond the control of the higher  36   education institution (Illeris, 2003).  Researchers who have taken social and personal factors into consideration in their analyses have argued that differences remained among institutions in their survey outcomes that could be attributed to the quality of the educational process (Keeling et al., 2008; Porter, 2006; Tinto, 2010).    The holistic approach to measuring learning by considering contextual variables and the environment can be considered an issue of validation (Moss, 2003).  In her study on classroom assessment and validity in the higher education context, Moss (2003) suggested broadening validity theory to include a sociocultural approach to understanding student learning and the assessment of learning.  She argued that from a sociocultural perspective, learning is a complex process that is viewed as a social activity and is best understood by considering learning in the context in which it occurs (Bak, 2012; Kreber, 2006; Taylor, 2007).  Theoretically, the sociocultural perspective on learning implies the simultaneous transformation of social practices and the individuals who participate in them.  Thus, researchers examining learning outcomes from this perspective believe that student achievement cannot be interpreted appropriately without taking into consideration both the social and individual dimensions of learning (Moss, 2003; Pryor & Crossouard, 2008; Trigwell & Prosser, 1991).  As an extension, researchers should also take into consideration the social and individual dimensions of learning when examining students’ perceptions about their learning outcomes.      37   2.3  A Modern Approach to Validation Theory  When most measurement specialists discuss the process of validation they understand that it is not the survey or test that is being validated, but rather the inferences or conclusions made from the assessment results (Hubley & Zumbo, 1996; Hubley & Zumbo, 2011; Kane, 2006, 2013; Messick, 1995b; Zumbo, 2007, 2009).  Data on their own are meaningless until they are interpreted in some way (Earl & Fullan, 2003).  Measurement validation involves a process of first interpreting and creating meaning from the data, followed by translating the data into relevant information, and finally, translating that information into knowledge in a timely manner that can be used to inform planning, decisions or actions (Ercikan & Barclay McKeown, 2007).  In very basic terms, when researchers are interested in a particular phenomenon, they may wish to observe and measure that phenomenon.  In the context of survey measures, researchers are concerned with whether they have confidence in the inferences made and conclusions drawn from the outcomes.  Of particular importance to this study is to determine if interpretations made at the program level are valid based on aggregate student-level survey responses.    Over the last two decades there has been a renewed discussion about the overall theory of validity and the process of validation as an exercise in establishing evidence to support inferences based on data rather than considering a proliferation of types of validity (Zumbo, 1998, 2007).  The Standards for Educational and Psychological Testing (American Educational Research Association [AERA], American Psychology Association [APA], National Council on Measurement in Education [NCME], 1999)  38   state that validity has to do with an investigation of threats and biases that can weaken the meaningfulness of the research.  Thus, the validation process involves asking questions that may threaten the conclusions drawn and inferences made from the findings, to consider alternate answers, and to support the claims made through the integration of validity evidence, including values and consequences.      Using this approach, a researcher would present validity evidence in the form of an argument that specifies the conclusions drawn, and the supporting assumptions needed to support score-based interpretations and uses.  Recently, Kane (2013) further described the argument-based approach to validation that is focused on the plausibility of the conclusions drawn from assessment responses.  From this perspective, Kane (2013) provided eight points regarding the validation process that should be considered in providing validity evidence that goes beyond the interpretation of assessment use, but also includes an examination of how well the evidence supports the claims being made from the assessment used, the types of decisions that will be made based on the interpretations, and a consideration of the negative consequences of the interpretations.  Thus, the process for determining validity does not begin once the survey has been administered and data have been collected, but rather the process of validation continues throughout all stages of measurement and evaluation.  All inferences about observations collected using measures must be related back to, and derive their validity from, the conditions and contexts in which the observations occurred, including survey design, administration and analysis.    39    Whether or not validity should include intended and unintended consequences has received much attention and is a point of departure for many proponents of validity theory (AERA, APA, NCME, 1999).  Most researchers understand the consequences of test use and interpretations of results to be important, but not all consider it to be pertinent to validity theory.  Messick (1995a) argued that decisions made from inferences based on the results obtained from measures must include an evaluation of consequences for their intended use.  Hubley and Zumbo (2011) provided further clarification of what Messick meant by consequential validity.  They argued that if test developers and users want to use assessment results for examining differences among individuals, or comparing across social groups, these consequences of use should be considered in the validation process.    This study supports the consideration of consequential validity by demonstrating the importance of examining the multilevel validity of program-level inferences drawn from aggregate student survey responses.  With respect to this study, if decisions regarding student programming or curriculum changes will be determined based on aggregate survey data used to compare programs, then one could argue that these consequences of use should be considered during the validation process.  Furthermore, consideration of the purpose or use of the survey results in this way should guide the researcher in determining what validity evidence would be required to support aggregation and interpretation at the program level.    The modern argument-based approach to validity emphasizes a notion of validity as a unified theory rather than different types of validity evidence, and signifies  40   validation as an ongoing process because new knowledge and understanding can influence interpretations of the findings, and may identify the need for more evidence to support such interpretations (Kane, 2006, 2013).  Furthermore, new understandings of technological developments affecting measurement theory can help test assumptions in ways that could not previously be accommodated.  Based on this contemporary understanding of validity theory, multilevel validity evidence is required prior to making any claims at the program level based on aggregate student-level survey data.  2.4  Student Learning as a Measure of Quality/Effectiveness in Higher Education  Many officials of higher education view student learning as a fundamental mission of their operation (Keeling et al., 2008; Kirby, 2007; Shushok et al., 2009), and believe that student learning is strongly influenced by the educational context in which it occurs (Astin, 2005; Kuh et al., 2005; Tinto, 2010).  This perspective could be evidence of the major paradigm shift in higher education that was described by Barr and Tagg (1995), which puts the student rather than the instructor at the centre of the learning process (Douglass et al., 2012; Klor de Alva, 2000; Hooker, 1997; Maher, 2004; McCormick & McClenney, 2012).  Instead of the traditional transfer of knowledge from instructor to student, the focus shifts to understanding the student learning process and the effective educational practices used to promote learning.    This paradigm shift from teaching to learning has occurred in many different countries.  For example, Ministers of Education across many European countries  41   convened in June 1999 to sign the “Bologna Declaration” that would establish and promote the European system of higher education (Van der Wende, 2000).  The “Bologna Declaration” proposed the adoption of a system with comparable degrees, the establishment of a system of credits, and the promotion of student mobility within the European system of higher education.  Although the original 1999 Bologna Declaration neglected to include information about learning outcomes, by 2003 the Ministers required that institutions adopt common learning outcomes to provide additional information about higher education (Adam, 2008).  Through a review of government documentation, Adam (2008) demonstrated how learning outcomes have become central to the paradigm change taking place in all sectors of European education.  In addition, the process of accreditation in Europe has become the dominant method to evaluate quality in higher education (Stensaker, 2011).  Reasons for the shift to a better understanding of student learning in Europe have been variously attributed to the need for greater transparency for qualifications and curriculum reform, the need to provide more tailored educational programming that provides clear information to learners, and the need to improve links between the university, the labour market and employment.  In 2008, at the Bologna seminar on learning outcomes held in Edinburgh, it was suggested that more clarity and common understanding on the nature and depth of applying learning outcomes to understanding and assessing student learning in higher education were required; specifically on how they should be reported, and the level of detail they should encompass (Gallavara et al., 2008).  42    In the USA, the public and political focus on student learning in higher education gained momentum after the presentation of the Spellings Commission Report (2006), which reviewed the state of higher education nationally.  The results of the Spellings Report suggested there was a gap in the knowledge about how much students learned while earning a degree.  As a result, the recommendations in the report called for more enhanced accountability of higher education institutions in the USA.  The report specifically identified two areas of accountability to develop: (1) a widely used standardized assessment, such as the Collegiate Learning Assessment, to measure the value added to student learning by higher education institutions; and (2) new federal guidelines for the nation’s accrediting agencies to develop national standards and comparative reviews of institutional performance based on surveys like the NSSE (Thomson & Douglass, 2009).  At the same time, the voluntary system of accountability (VSA) was introduced by the American Association of State Colleges and Universities (AASCU) and the Association of Public and Land-grant Universities (APLU), which invited institutions to adopt a new accountability system in the USA that focused on student learning outcomes (Liu, 2011).  Similar to the Spellings Commission Report, the VSA also advocated for the use of standardized assessments as direct measures of student learning in higher education, as well as indirect measures, such as large-scale surveys.    The NSSE and UES were designed in response to the increased use of university rankings by educational administrators and the general public, and to provide information on student learning and the learning environment in higher education.   43   These surveys were intended to gather feedback from students on activities and behaviours that were empirically linked to teaching and learning in higher education for institutional comparisons rather than rankings (McCormick & McClenney, 2012).  The results from these types of surveys were offered to administrators and faculty members as information for examining and comparing the prevalence of effective educational practices on their campuses inside and outside the classroom (Chatman, 2007; Keeling et al., 2008; Kuh, 2009).   As mentioned in Chapter One, there is a growing awareness among researchers who examine institutional effectiveness that the goal is to determine how much value the institution adds to the learning process, but this has proven difficult to measure.  As Porter (2006) points out, when students differ in their incoming achievement levels and background characteristics, they will also tend to differ in their outputs.  Thus, to decompose how much of the variation in learning outcomes across programs is due to differences among students and how much to the characteristics of the program, researchers need appropriate measures of pre-university student characteristics against which to compare the value to educational outcomes that has been added by the program structure.    Raudenbush and Willms (1995) identified two types of effects to distinguish how much value the institution adds to the learning process:  Type A and B.  Type A models control only for student-level factors, including prior attainment and limited background variables, while Type B models control for institution-level and contextual factors outside the institution’s control (Schagen & Hutchinson, 2003).  Type A  44   comparisons generally make little or no attempt to understand the effect of the educational environment on student learning because these studies only account for students’ entering achievements and abilities.  In contrast, Type B comparisons are more useful to researchers interested in the effects of institutional policy on learning outcomes because, along with a student’s background characteristics, they also consider the relationships among students and institutional structures to understand the learning climate when examining student achievement.  Type B approaches are relevant for this study, because they consider the contextual variables and program-level characteristics as well as student characteristics in their models.  Institutions in the USA believe they can use value-added modelling (VAMs) to attempt to measure a teacher’s impact on student achievement, or the value added by the teacher, by taking into consideration the student’s family background, prior achievement and environmental variables, such as peer influences.     A common feature of this type of analysis is called differential effectiveness, which means that programs may achieve quite different results for students who are initially low achievers as compared with students who are initially high achievers (Carini, Kuh, & Klein, 2006; Hopkins et al., 1999; Opdenakker & Van Damme, 2000).  Goldstein and Speigelhalter (1996) drew the distinction between a value-added model and the adjusted comparison model, and argued that justifying the amount of value a program has added to student learning is impractical because, depending on the nature of the study, what determines value is relative.  To incorporate the value-added or adjusted model into institutional effectiveness research requires that data are available at the  45   individual level.  As a result there has been increased demand for student-level data, both on intake and outcome variables, in determining the impact of program performance on student learning.    Since the late 1970s, there have been improvements and significant contributions made to the institutional effectiveness paradigm.  In the 1980s, the multilevel statistical model was developed in the UK, and, almost simultaneously, hierarchical linear analysis was developed in the USA.  These statistical techniques greatly improved the way in which effective institutional analyses could be investigated because they were able to handle the multilevel, or hierarchical, nature of students nested within classrooms, programs and universities (Aitken & Longford, 1986; Ballou et al., 2004; Kezar & Eckel, 2002).  By allowing for the nested structure of these data, the researcher can examine both the variability taking place within groups as well as the variability between groups simultaneously, which should reduce the possible distortion of cross-level relationships and the effects of group membership.  The development of these statistical and computational techniques is important for this study, in terms of examining the multilevel structure of students within their program major. 2.5  Measuring Student Perceptions About Their Learning   This study examined the results from two surveys measuring student perceptions about their learning and the learning environment: the NSSE and the UES.  These two surveys have been used to collect information at the student level with the purpose of aggregating the scores at the program level or higher, to provide feedback on learning  46   effectiveness, and to determine how much value the program practices and structures add to the learning process.  These surveys are described in the next sections, including a discussion on the benefits and criticisms of each as found in the literature. 2.5.1   The National Survey of Student Engagement (NSSE)  The NSSE is an indirect measure of learning that uses a survey instrument to gauge the level of student engagement and how it relates to learning at universities and colleges in Canada and the USA.  Only first- and fourth-year students are invited to participate in the NSSE.  The results of the survey are intended to help administrators and instructors assess student engagement with purposively educational activities, which is expected to correlate with high academic achievement.   The NSSE was developed by the Center for Postsecondary Research at the Indiana University School of Education and was first piloted in 1999.  Since the first official administration in 2000, 1,452 colleges and universities from Canada and the USA have participated in the survey.  The 2011 version of the survey included 85 items that measured undergraduate student behaviours, attitudes and perceptions of institutional practices intended to correlate with student learning (Kuh, 2009).  It is administered annually by many institutions and in alternating years by others.  The results from the NSSE are used to enhance student learning experiences through influencing institutional policy decisions and program development.    47    The primary purpose of the NSSE is to assess an institution on the following five benchmarks: level of academic challenge; active and collaborative learning; student-faculty interaction; enriching educational experiences; and supportive campus environment.  Students at participating institutions complete an online survey administered by the Center for Postsecondary Research at the Indiana University School of Education that consists of a variety of questions asking students to describe their learning outcomes retrospectively.   Although Canadian institutions did not participate in the NSSE until 2004, it is now one of the most widely used institutional measures of student engagement and learning in Canada.  The NSSE was adopted for use in Canada with minor changes only to the wording of the demographic items.  The Canadian students in higher education were considered similar to the students in the USA, and by participating in the NSSE, Canadian schools can be compared against comparable institutions in the USA.     2.5.1.1  Criticisms of the NSSE as a Measure of Learning  In Fall 2011, a special issue of “The Review of Higher Education” was dedicated to the validity concerns about NSSE and other large-scale surveys of student learning (Olivas, 2011).  Porter’s 2009 article, reprinted for this issue, found that most self-reported student responses on large surveys have minimal validity due to students’ poor self-reporting of their behaviour (Porter, 2011).  He used results from the NSSE to demonstrate his conclusions and he advocated that researchers provide stronger  48   evidence of validity than had been done in the past to support higher-level claims based on student survey results (Porter, 2011).  Bowman (2011) agreed with Porter’s conclusions and suggested that any findings associated with self-reported gains in learning and development should be interpreted with caution.  Bowman found that the correspondence between self-reported gains and longitudinal growth outcomes were too low for self-reported outcomes to be used as a proxy of institutional-level gains in learning and development over time.  Other research studies published in this special issue also called for further validity studies regarding multilevel surveys (Campbell & Cabrera, 2011; Dowd, Sawatzky & Korn, 2011; Olivas, 2011).    Most of the research studies on the NSSE results have been used to compare multiple institutions.  Carini, Kuh and Klein (2006) investigated the linkages between student engagement and student learning by examining outcomes from 14 different institutions.  They found that self-reported measures of student learning were only weakly related to outcomes on critical thinking and grades, although a stronger relationship appeared to be present for those with lower SAT scores. The authors concluded that certain types of institutions tended to be more effective at converting engagement into stronger student achievement for students.  Carini, Kuh and Klein (2006) interpreted this to mean that student engagement is comprised of a range of institutional processes that add value to student learning.    Researchers have found that reporting NSSE results at the institutional level masks important program and student subgroup information.  Dowd, Sawatsky and Korn (2011) called for additional validity evidence to be provided on intercultural  49   effort, which they referred to as the effort needed to counter negative pressures experienced by members of minority racial-ethnic groups at predominantly white universities.  They argued that when determining the differences between international and domestic university students in response to the NSSE, researchers need to consider intercultural effort.  The authors stressed that the exclusion of intercultural effort items on the NSSE constitutes construct underrepresentation with regard to the amount of effort a student puts forth when comparing the experiences of international learners with domestic learners.  Conway & Zhao (2012) also found substantial subgroup differences in their research study on the five NSSE benchmark results: level of academic challenge; active and collaborative learning; student-faculty interaction; enriching educational experiences; and supportive campus environment. They concluded that there were undercurrents of student engagement across specific academic programs and student subgroups that may be better understood at a lower level unit rather than at the university-wide level, and based on individual survey items rather than the NSSE benchmarks.  One of the intended purposes of the NSSE is for institutional use for accountability and improvement efforts.  As such, researchers have called for further examination on how well the NSSE provides valid evidence of engagement within a single institution (Gordon, Ludlum & Hoey, 2008; Olivas, 2011).  Few studies based on the analysis at a single university have been published; for example, Campbell and Cabrera (2011) used the NSSE scores at one institution and found a lack of construct and predictive validity for non-transfer graduating seniors when they examined the  50   relationship between the NSSE responses and college grade-point average.  In response to these criticisms, McCormick and McClenney (2012) argued that validation work is ongoing, and rather than critiquing the NSSE, or similar surveys, researchers should work towards understanding the measurement of learning and engagement more appropriately through ongoing research studies.     2.5.2  The Undergraduate Experience Survey (UES)  The UES was developed as a collaborative project of faculty and institutional researchers within the University of California (UC) system, and is administered by the Office of Student Research at the Center for Studies in Higher Education at UC Berkeley.  It is the product of a larger research project focused on analyzing and improving the undergraduate experience within major research universities.   The UES is similar to NSSE in that it measures student learning indirectly, and that it is intended for use at the institutional level.  It differs from the NSSE in that it is a census survey of all undergraduates at the institution using both pre- and post-response options in the same survey.  Students are asked to identify their learning outcomes when they first arrived at the university, and then how they would judge their current level of learning at the time of survey participation.  A version of the UES survey was adapted for use at UBC in 2010 and 2012.     51   2.5.2.1  Criticisms of the UES as a Measure of Student Learning  Although Porter’s (2011) critique of the NSSE with respect to student self-reports could also be applied to the UES, Douglass, Thomson and Zhao (2012) suggested that the unique design of the UES promised many advantages over other surveys and standardized assessments of learning for large complex institutional settings.  They identified two key aspects of the survey design they believed were beneficial: 1) the use of a census design; and 2) a retrospective pretest and a current posttest design to measure student learning outcomes.  The authors argued that the adoption of a census survey design, where all students are invited to participate in the survey, provided more detailed data at the individual level of the academic program and student subpopulations of interest.    The approach used for the pretest/posttest design was based on research conducted on response-shift bias (Douglass, Thomson & Zhao, 2012).  Response-shift bias refers to the phenomenon where a student who gains more experience by being involved in a program makes more informed choices, which make posttest evaluations of their proficiency levels more accurate than their pretest evaluations (Douglass, Thomson & Zhao, 2012).  With this approach, the UES asks students to rate their level of proficiency on general learning skills and personal development when they started at the university, and their current level, using a six-point Likert scale (very poor to excellent) on a series of questions related to educational outcomes.            52    Researchers have found that the use of pretest and posttest survey designs can produce upwardly biased ratings due to social desirability bias (Bowman & Hill, 2011; Gonyea, 2005), and that the bias is non-uniform across student subpopulations (Douglass, Thomson & Zhao, 2012; Thomson & Douglass, 2009).  Thomson and Douglass (2009) examined the difference scores between the UES pretest/posttest student responses of their reported learning outcomes, and found that Asian respondents rated most of their learning lower than non-Asian students.  Even the proponents of the UES suggest the use of caution when interpreting institutional level analysis for internal improvement efforts.   A potential measurement concern with the use of the pretest/posttest design of the UES is the low reliability of difference, or gain scores, on academic outcomes.  Using an example of a repeated-measures analysis of variance (ANOVA) design, Thomas and Zumbo (2012) recently demonstrated why the use of difference scores can lead to zero reliability, and how this finding has misled researchers to believe that difference scores are inappropriate to use in educational research because of the low reliability estimates.    Thomas and Zumbo (2012) suggested that the formula for determining the reliability estimate should differ depending on the focus of the study, either at the individual level or the group level.  As such, they provided a mathematical description of how difference scores in pretest and posttest designs can result in close to zero reliability coefficients because each subject acts as its own control, which results in a constant “true score” in the measurement model with zero variability.   53   They argued that researchers do not need to avoid the use of difference scores when the focus of the analysis is at the aggregate level.  The authors upheld the idea that classical reliability is relevant whenever the focus of the measure is the individual, but, when the measure is used in an aggregate model, classical reliability is no longer relevant.  Instead they proposed the use of a modified definition of reliability related to noncentrality.     2.6  Summary of the Literature Review  The literature discussed in this chapter acknowledges that, despite their disagreements, the underlying goal of all researchers examining educational effectiveness was to isolate the characteristics that distinguish effective universities or programs from the rest.  In this effort, there have been significant improvements made over the years to the research paradigm that have recently made their way into studies of higher education.    Most early researchers examining school effectiveness conceptualized poorly the many ways in which contextual variables could affect learning and performance at the student, program, and university levels.  These models generally assume the same relationships across program and university levels, which fail to incorporate the sociocultural perspective of learning and the hierarchical nature of students clustered in universities.  The multilevel method has increased in popularity with the development of computer programs sophisticated enough to handle these complex structures.  As a result, the multilevel model has become the preferred approach for  54   investigating school effectiveness and improvement research in the K-12 system, and has recently been applied to examine effectiveness in higher education.    At the turn of the 21st Century, with the increased demand for a more complex set of skills in industry, universities were asked to demonstrate how much their students were learning and the extent to which universities were involved in the learning process.  In response, new surveys measuring student perceptions of learning and the student experience were widely adopted across universities and colleges within Canada with the intent to provide additional information about the contextual understanding of learning and how much value the institution provided to the learning process.  These surveys were originally intended for institutional-level analysis and comparison; however, with the increased demand for information about student learning, the most common use for the results from these surveys is for program-level analysis and comparison.  These surveys function as proxy measures of student learning, but they are still in their infancy and much has to be done to determine their valid use in understanding the effectiveness of the program in the learning process.    As these surveys are primarily focused on the program rather than the student, evidence should be provided that supports the inferences on student perceptions about their learning at multiple levels of aggregation.  Researchers have found that macro-level analyses, where the emphasis is on aggregated individual scores, mask the complex role of the social context in shaping those scores and the interpretations (Hubley & Zumbo, 2011; Moss, 2003; Zumbo & Forer, 2011).    55    Campbell and Cabrera (2011) recently stated that further validity evidence is required to support the claim that the NSSE items are valid for establishing institutional effectiveness, and they argued that these analyses should involve data gathered from single- as well as multi-institutional samples.  Douglass, Thomson and Zhao (2012) suggested that the UES student survey design has several advantages over other surveys in providing information regarding learning outcomes, particularly in complex educational environments.  Therefore, it is imperative to consider carefully the validity evidence required to support the aggregation of these types of survey results, as well as how these measures relate to program processes and characteristics that influence individual student learning.  The purpose of this study is to demonstrate the importance of evidence in support of multilevel validity, prior to drawing interpretations from aggregate student survey ratings.    56   3  Chapter Three:  Research Design and Methodology 3.1  Introduction The first two chapters described how current understandings of student learning, coupled with advances to validity and measurement theory, can improve the understanding of university- and program-level analyses of learning outcomes in higher education.  The literature reviewed for this study reveals a dearth of validation efforts for examining the appropriateness of aggregating results from the student to the program level for surveys such as the NSSE and the UES, which is one of the most common institutional uses of the survey results.  Furthermore, few studies have investigated student self-reported perceptions about general learning outcomes as measured by university surveys and related them to program characteristics at a single university.  Of particular importance for this study was the evidence required to support the multilevel validity of the program-level inferences drawn from aggregate student survey results.  This chapter describes the research design and methodology used in this study to examine validity evidence that supports interpretations of aggregated student-level outcomes, and how these outcomes relate to higher educational practices.  Figure 3.1 provides a broad overview of the study and is intended to assist the reader in following the descriptions of the study design and methods.   A principal components analysis was conducted followed by a two-level MFA on each of the composite scales created for all samples across both survey measures.    57     Figure 3.1  Flow Chart of Study Design.  58   Three approaches to evaluating the appropriateness of aggregation were investigated using data collected from the NSSE and the UES: an ANOVA, a WABA, and an unconditional two-level multilevel model.  The results from the unconditional multilevel model were used to portray graphically the average student ratings about their learning with 95% credible values for each program, which were then examined to determine the suitability of program comparisons made based on aggregate student perceptions.  Each study sample and survey measure is described in detail over the next several sections of this chapter, followed by an explanation of the analytic procedures used in this study to examine the appropriateness of aggregation.  Finally, a multilevel statistical approach is demonstrated to examine how program characteristics of the learning environment are associated with student perceptions of their general learning outcomes.     3.2  Research Questions   This study addresses two specific research questions:  First, are multilevel inferences made from student perceptions about their general student learning outcomes and ratings of the learning environment valid when aggregated to the program major level?  Second, what student- and program-level characteristics, and cross-level relationships, are associated with student perceptions about their general learning outcomes?  These two research questions will provide information regarding the extent to which aggregate student perceptions about their general learning outcomes and ratings of the learning environment reflect attributes of quality/effectiveness at UBC.      59   The research questions listed above and the analyses used to answer them were applied separately to data collected from each of the surveys: the NSSE and the UES.  The next sections describe the two datasets used in this study, followed by descriptions of each of the survey measures. 3.3  Study Samples The data used in this study were collected from students enrolled at one of UBC’s two campuses in the Okanagan or in Vancouver.  Both campuses of UBC are located in Western Canada with a current overall total undergraduate population of almost 48,000 students.  The Okanagan campus is located in Kelowna in the Okanagan Valley in the southern interior region of British Columbia (BC); the valley comprises a combination of rural and small urban centres.  The city of Kelowna is about 400 kilometers northeast of Vancouver, and is considered a small city with a census metropolitan population of 179,839 reported in 2011.  UBC’s Vancouver campus is located on the coast of BC in a large urban city with a census metropolitan population of 2,313,328 reported in 2011.   The Okanagan campus was established as one of UBC’s campuses in 2005.  Prior to July 1, 2005, the Okanagan campus was a separate institution called Okanagan University College (OUC), which delivered a combination of programming in vocational/ trades training as well as granting university baccalaureate degrees.  Most of the OUC baccalaureate degree programs transitioned to UBC when the Okanagan campus opened in 2005, and the vocational/ trades training transitioned to Okanagan  60   College.  Many program changes have occurred over the past decade to align and develop programming at the Okanagan campus with UBC’s standards and policies.   Although these two campuses belong to one university and share a Board of Governors, they are overseen separately by two distinct academic Senates.  The Vancouver and Okanagan Senates are the academic governing body of each campus and their responsibilities include admissions, course and program curriculum, granting degrees, academic policy, examinations and appeals of standing, and student discipline.  There is also a Council of Senates at UBC that can act like a Senate on matters pertaining to system-wide UBC matters.  Therefore, the development of academic programming at each campus is campus-specific and differs across the two UBC campuses. The total undergraduate student population at UBC’s Okanagan campus was around 7,700 in 2014, which represents about 16% of UBC’s overall undergraduate student enrolment.  The overall undergraduate population of the Vancouver campus was around 40,000 students in 2014.  There were four samples included in this study.  First- and fourth-year UBC students who responded to the 2011 NSSE made up the first two samples, and first- and fourth-year UBC students who responded to the 2010 UES comprised the third and fourth study samples.  Although students across all year levels were asked to participate in the UES, only the first- and fourth-year samples were included in this study for comparison against the NSSE samples.   61   The proponents of these surveys suggest that the results at the program level could be used for diagnostic purposes to inform educational improvement efforts within universities by comparing survey results across equivalent program majors.  Researchers have found that there was more variability among different programs within a university than within comparable program majors across universities (Chatman, 2007, 2009; Nelson Laird, Shoup & Kuh, 2005).  For these reasons, students enrolled at both UBC’s campuses were included in this study.  Students were grouped into distinct program majors across the two campuses, and program-level results were examined within each campus and among equivalent program majors across the two campuses.    Only full-time undergraduate students who responded to the surveys were included in the analyses.  Full-time students, who were enrolled in 24 or more credits per year, comprised about 68% of the Vancouver undergraduate enrolments, and about 78% of the Okanagan undergraduate enrolments in 2010 and 2011 during the time these surveys were administered to students.  Part-time students may differ in their learning experiences and perceptions of their general learning outcomes due to their part-time status, so only full-time students were included in this study.  Each of these respective samples is described in detail in the upcoming sections.  3.3.1  The NSSE Sample  All registered first- and fourth-year students were invited to participate in the 2011 version of the NSSE at both of UBC’s main campuses.  The NSSE respondent  62   sample included 4,635 students, representing an overall 38% response rate.  Of the 4,635 participants, 1,126 were enrolled at the Okanagan campus (24% of respondents) and 3,509 enrolled at Vancouver (76% of respondents).  These students were enrolled in either the first (n=2,275) or fourth year (n=2,360) of their program.   Students enrolled in their first- and fourth-year of their academic program were invited to participate in the online version of the NSSE through an email invitation sent to them by staff at the Center for Postsecondary Research at the Indiana University School of Education on behalf of UBC.  This online survey was available to students for three months, from February to April 2011.  As an incentive to participate, the names of students who completed surveys were entered into a draw for the chance to win one of three travel prizes worth $500 and two at $250.   The program major variable was a NSSE categorization based on two open-text survey questions that asked students to “Please print your major(s) or your expected major(s),” and options for them to print their primary major in a text box, and a follow-up question, “If applicable, your second major (not minor, concentration, etc.)” in a second text box.  Based on the open responses to these text questions, the researchers at the Centre for Postsecondary Research created 85 program major categories.  Equivalent programs were separated across the two main UBC campuses as shown in Table 3.1, where UBCV represents programs at the Vancouver campus and UBCO represents programs at the Okanagan campus.  63   Only programs with at least five full-time students responding to the NSSE were included in this study.  This resulted in an overall final count of 1,852 first-year students and 1,281 fourth-year students.  There were 497 first-year students included who were enrolled in programs at the Okanagan campus, and 1,355 first-year students enrolled in programs at the Vancouver campus.  As for fourth-year students who were included, 256 students were enrolled in programs at the Okanagan campus and 1,025 students were enrolled at the Vancouver campus.  The sample characteristics compared with the overall population are reported for each of these samples in Table 3.2.      64   Table 3.1 NSSE Student Respondents by Program Major   First-Year Students Fourth-Year Students Program Major n % n %Accounting UBCV 25 1.3 27 2.1Allied health/other medical UBCV 34 1.8 25 2.0Anthropology UBCO - - 9 0.7Anthropology UBCV - - 15 1.2Art, fine and applied UBCO 19 1.0 6 0.5Art, fine and applied UBCV 33 1.8 22 1.7Biochemistry or biophysics UBCO 30 1.6 12 0.9Biochemistry or biophysics UBCV 38 2.1 35 2.7Biology (general) UBCO 38 2.1 15 1.2Biology (general) UBCV 92 5.0 53 4.1Chemical engineering UBCV - - 23 1.8Chemistry UBCO 14 0.8 7 0.5Chemistry UBCV 37 2.0 17 1.3Civil engineering UBCO 9 0.5 17 1.3Civil engineering UBCV 21 1.1 31 2.4Computer science UBCO 5 0.3 - -Computer science UBCV 34 1.8 31 2.4Economics UBCO 14 0.8 - -Economics UBCV 31 1.7 30 2.3Electrical or electronic engineering UBCO - - 8 0.6Electrical or electronic engineering UBCV - - 39 3.0English (language and literature) UBCO 21 1.1 12 0.9English (language and literature) UBCV 51 2.8 52 4.1Environmental science UBCO 7 0.4 - -Environmental science UBCV 15 0.8 - -Finance UBCV 18 1.0 30 2.3General/other engineering UBCO 23 1.2 - -General/other engineering UBCV 163 8.8 55 4.3Geography UBCO - - 12 0.9Geography UBCV - - 20 1.6History UBCO 9 0.5 11 0.9History UBCV 20 1.1 35 2.7Kinesiology UBCO 53 2.9 22 1.7Kinesiology UBCV 55 3.0 40 3.1             (table continues)  65   Table 3.1 (continued)   First-Year Students Fourth-Year Students Program Major n % n %Language and literature (except English) UBCO 7 0.4 - -Language and literature (except English) UBCV 30 1.6 35 2.7Management UBCO 30 1.6 18 1.4Marketing UBCV 27 1.5 25 2.0Mathematics UBCO 9 0.5 - -Mathematics UBCV 15 0.8 - -Mechanical engineering UBCO 12 0.6 11 0.9Mechanical engineering UBCV 22 1.2 30 2.3Microbiology or bacteriology UBCO 5 0.3 7 0.5Microbiology or bacteriology UBCV 27 1.5 18 1.4Natural resources and conservation UBCV 34 1.8 - -Nursing UBCO 49 2.6 19 1.5Nursing UBCV 11 0.6 33 2.6Other biological science UBCO 10 0.5 - -Other biological science UBCV 54 2.9 28 2.2Other business UBCV 49 2.6 - -Other physical science UBCO 14 0.8 - -Other physical science UBCV 73 3.9 34 2.7(pre)Pharmacy UBCO 11 0.6 - -Pharmacy UBCV 64 3.5 25 2.0Physics UBCO 6 0.3 - -Physics UBCV 17 0.9 - -Political science (govt., intl. relations) UBCO 24 1.3 9 0.7Political science (govt., intl. relations) UBCV 94 5.1 80 6.2Psychology UBCO 49 2.6 40 3.1Psychology UBCV 91 4.9 107 8.4Social work UBCO - - 14 1.1Social work UBCV - - 8 0.6Sociology UBCO 5 0.3 7 0.5Sociology UBCV 27 1.5 22 1.7Undecided UBCO 11 0.6 - -Undecided UBCV 43 2.3 - -Zoology UBCO 13 0.7 - -Zoology UBCV 10 0.5 - -Total 1852 100.0 1281 100.0       66        The final sample consisted of 1,852 first-year students in 59 program majors, and 1,281 fourth-year students in 49 program major groupings.  The variables, gender, international status and whether a student attended UBC directly from high school (e.g., direct entry status), or transferred to UBC from another higher education institution, are reported for the first- and fourth-year students in the NSSE sample in Table 3.2, and these variables are compared to the overall population at each campus.   Table 3.2 NSSE Final Study Sample Demographics First-Year Students Fourth-Year Students   sample population sample population  Vancouver Campus  n = 1355 N = 5874 n = 1025 N = 4346 % Female 63.6 53.9 60.3 52.5 % International 19.6 14.9 7.6 9.4 % Direct Entry 94.1 93.2 65.8 60.7  Okanagan Campus n = 497 N = 1934 n = 256 N = 1028 % Female 65.8 53.1 65.2 58.4 % International 10.3 9.8 3.9 3.9 % Direct Entry 92.8 90.9 54.3 51.6      As shown in Table 3.2, the response rates for the final study samples were similar across campuses; 23% for the first- and fourth-year Vancouver samples, and 26% for the first-year Okanagan sample and 25% for the fourth-year Okanagan sample.  In addition, Table 3.2 provides information regarding how the samples of respondents included in this study differed among the demographic characteristics from the overall population of full-time students at UBC’s campuses.  Female students responding to the NSSE were over-represented for both year levels at both campuses  67   ranging from 7% difference to 13% difference, while international students were over-represented in the first-year Vancouver sample by about 5% and slightly under-represented by about 2% in the fourth-year Vancouver sample.  The proportions of international students in the Okanagan samples were similar to the overall proportion of international students in the population for both year levels.  The proportion of direct entry students in first-year were fairly close to the overall population of first-year students at both campuses, but were slightly over-represented by 5% in the fourth-year Vancouver sample. In addition, the proportion of students from the invited population at the Okanagan campus, compared with the Vancouver campus, were similar to the proportion of students in the final NSSE study sample across both year levels. Due to the low response rates of just over 20% across these samples, and the variances in demographic characteristics of the NSSE respondent samples compared with the overall population, the generalizability of these findings to the overall UBC population is limited and conclusions from these survey results must be interpreted cautiously as they do not appropriately represent the characteristics of the overall population.   3.3.2  The UES Sample  All undergraduate students were invited to participate in the online UES survey from February to April 2010.  Students were sent email invitations from the Student Services Office on the Okanagan campus, and from the Planning and Institutional  68   Research Office on the Vancouver campus.  As an incentive to participate, the names of students who completed surveys were entered into a draw for the chance to win one of three travel prizes worth $500 and two at $250.  The overall response rate for the UES was 28%. Student respondents were grouped by program major, which was collected from the university’s administrative records and added to the survey data file.  In contrast to the NSSE, which selects only first- and fourth-year students, the UES used a census administration of all year levels, which resulted in an overall sample size of 9,727 students.  For the purposes of comparison to the NSSE, however, only the first-year and fourth-year full-time students who completed the UES were included in this study.  To maintain an adequate sample size for subsequent analyses, only programs with a minimum of five students each were included, which resulted in a total of 2,260 first-year students across 22 program majors and 2,374 fourth-year students across 28 program majors (see Table 3.3).  The demographic variables of interest to this study are reported in Table 3.4.     69   Table 3.3 UES Student Respondents by Program Major  First-Year Students Fourth-Year Students Program Major n % n %Bachelor of Arts UBCV 663 29.3 829 34.9Bachelor of Arts UBCO 151 6.7 123 5.2Bachelor of Applied Science UBCV 238 10.5 220 9.3Bachelor of Applied Science UBCO 23 1.0 17 0.7Bachelor of Commerce UBCV 76 3.4 126 5.3Bachelor of Computer Science UBCV - - 10 0.4Bachelor of Dentistry UBCV 10 0.4 20 0.8Bachelor of Environmental Design UBCV - - 8 0.3Bachelor of Fine Arts UBCV - - 28 1.2Bachelor of Fine Arts UBCO 15 0.7 7 0.3Kinesiology UBCV 49 2.2 98 4.1Kinesiology UBCO 42 1.9 - -Management UBCO 34 1.5 44 1.9Medical Laboratory Sciences UBCV - - 6 0.3Music UBCV 15 0.7 18 0.8Midwifery UBCV 5 0.2 5 0.2Science (Applied Biology) UBCV - - 5 0.2Agroecology UBCV 10 0.4 5 0.2Sciences UBCV 595 26.3 467 19.7Sciences UBCO  104 4.6 76 3.2Natural Resources and Conservation UBCV 34 1.5 19 0.8Pharmacy UBCV 59 2.6 38 1.6Science Forestry UBCV 14 0.6 10 0.4Science (Food, Nutrition, Health) UBCV 69 3.1 68 2.9Forest Science UBCV 9 0.4 - -Global Research UBCV - - 19 0.8Nursing UBCV - - 50 2.1Nursing UBCO 34 1.5 34 1.4Social Work UBCV - - 12 0.5Social Work UBCO - - 12 0.5Pharmacy UBCO 11 0.5 - -Total 2,260 100 2,374 100  70        As shown in Table 3.4, the response rates for the final study samples were about 31% for the first-year Vancouver sample and about 26% for the first-year Okanagan sample.  The response rate was 30% for both the Vancouver and Okanagan fourth-year samples.  In addition, Table 3.4 provides information regarding how the samples of respondents included in this study differed among the demographic characteristics from the overall population of full-time students at UBC’s campuses.  Similar to the NSSE sample, female students were over-represented for both year levels at both campuses, which ranged from a difference of 7% to a difference of 13%.  The proportion of international students across both year levels in the UES sample were fairly similar to the population proportions, as were the proportion of direct entry students in first-year.  There were more direct entry fourth-year students in the survey sample than in the overall population with a difference of about 11%.  In addition, the proportion of students from the invited population at the Okanagan campus, compared with the Vancouver campus, were similar to the proportion of students in the final UES study sample across both year levels.     71   Table 3.4 UES Final Study Sample Demographics First-Year Students Fourth-Year Students   sample population sample population  Vancouver Campus  n = 1912 N = 6222 n = 2141 N = 7113 % Female 63.4 52.2 62.3 54.1 % International 12.9 14.8 7.6 8.9 % Direct Entry 89.7 90.4 45.1 34.8  Okanagan Campus n = 429 N = 1668 n = 327 N = 1089 % Female 68.2 54.9 70.9 63.9 % International 7.2 8.0 4.6 3.7 % Direct Entry 84.8 88.5 39.1 27.8        Due to the low response rates of just higher than 25% across these samples, and the proportion of demographic variables of the UES respondent samples compared with the overall population, the generalizability of these findings is limited and conclusions drawn from these survey results must be interpreted cautiously as they do not appropriately represent the characteristics of the overall population. 3.4  Study Measures As stated previously, two surveys measuring student perceptions about their learning outcomes were examined in this study: the NSSE (Kuh, 2009), and the UES (Hafner & Shim, 2010).  These measures were selected because each is administered to individual students and is intended to be interpreted at the institutional level, yet each uses a different approach to measuring learning outcomes.  Each of these two measures is described in detail throughout the next two sections.  72   3.4.1  The NSSE Measure Since the first official administration in the year 2000, more than 1,500 colleges and universities from all over Canada and the USA have participated in the NSSE survey.  The NSSE is administered annually by many institutions, while some administer the survey in alternating years.   The NSSE is a voluntary 87-item online survey that measures undergraduate student behaviours, attitudes and perceptions of institutional practices intended to correlate with learning (Kuh, 2009).  It was first piloted in 1999 by the Center for Postsecondary Research at the Indiana University School of Education, and it is normally administered to students enrolled in first- and fourth-year.   As mentioned in Chapter Two, the NSSE is recognized for producing five benchmarks of effective educational practices, but some users of this information may find institutional-level results too general to act upon within a university.  As an alternative to the five benchmarks, Pike (2006a) developed 12 scalelets comprised of 49 NSSE items and examined the dependability of group means for assessing student engagement at the university, college, and program levels.  He also investigated the convergent and discriminant validity of these NSSE scalelets.  Based on his findings, he argued that the NSSE scalelets had greater explanatory power and provided richer detail than the NSSE benchmarks, which he believed would lead to information for improvement at the program level.    73   Pike’s (2006a) scalelets on gains in general education, overall satisfaction, emphasis on diversity, and support for student success were used in this study rather than the five NSSE benchmarks.  These scalelets were selected because Pike (2006b) concluded from his generalizability study that all 12 scalelets produced dependable group means in making judgments about the average ratings of items comprising the scalelet.  He examined the generalizability of scalelets across four different fourth-year student sample sizes within institutions:  25, 50, 100, and 200.  Although the results were analysed by institution, Pike (2006b) suggested that the scalelets would work well for making judgments about programs.  The four scalelets used in this study were comprised of 15 items from the NSSE, which are described further in section 3.5.2.3, and in Appendix A.   3.4.2  The UES Measure The UES was developed as a collaborative project by faculty and institutional researchers within the University of California system, and is the product of a larger research project focused on analyzing and improving the undergraduate experience within major research universities:  the Student Experience in the Research University based at the Center for Studies in Higher Education at UC Berkeley.   An adapted version of the UES was sent to current undergraduate students at UBC in the spring of 2010.  Unlike the NSSE, the UES was a census survey that asked all students enrolled in undergraduate programs to participate.  It was administered by staff at the university via an online survey tool.    74   The UES is similar to the NSSE in that it is an indirect measure of student learning, it is a voluntary online survey, and the results are intended to be used at the institutional level; however, it differs from the NSSE in three key ways.  First, it was administered to all undergraduate students, not just first- and fourth-year students.  Second, it was administered and analysed by staff members at UBC rather than by a third party, such as the NSSE.  Third, it used both pre- and post- response options in the same survey, whereas NSSE used a retrospective measurement design.      Several items on the UES were designed using a pre-post format so that students could retrospectively rate their learning ability when they started at UBC, and then rate their current ability at the time they wrote the UES survey.  This study included student responses based on their current abilities, which was similar to the way in which the NSSE questions were asked.  Similar to the NSSE scalelets used in this study (Pike, 2006a), students were asked to respond to items regarding their perceptions about their general learning outcomes, overall satisfaction, campus climate for diversity, and support for student success.  These 19 UES items were selected because of their similarity to the NSSE items used to create Pike’s (2006a) scalelets and to draw general comparisons across the results of these two surveys.  Further information regarding the UES scales are described in section 3.5.2.3 and in Appendix B.    75   3.5  Analytic Procedures   There were several sets of secondary analyses conducted for this study to answer the research questions in this study.  The first set of analyses addressed the first research question by examining the appropriateness of aggregating student perceptions about their general learning and ratings of the learning environment to the program level for four survey samples: first- and fourth-year students who responded to the NSSE, and first- and fourth-year students who responded to the UES.  Due to the limited number of studies examining the multilevel validity of program-level inferences in higher education effectiveness research (D’Haenens, Van Damme and Onghena, 2008), this study used an exploratory approach for considering the nested structure of student perceptions and program-level analyses (Van de Vijver & Poortinga, 2002).  The second research question was addressed by examining the results from a two-level intercepts- and slopes-as-outcomes model to determine what student- and program-level characteristics, and cross-level relationships, are associated with student perceptions about their general learning outcomes.  Additional details on the analytic procedures used to determine the appropriateness of aggregation and the effects of the institutional structure on student learning perceptions are explained in the following sections.   3.5.1  Creating Composite Ratings The use of composite ratings in educational survey research is quite common.  Typically, several survey items are used to describe a particular construct, such as using  76   four items to describe student perceptions about their learning outcomes (Andres, 2012).  When creating composite ratings the researcher first needs to determine that the scale is essentially unidimensional, which means that there is one underlying construct that is being measured by the combination of survey items based on individual data.  There are many different techniques used by researchers to test dimensionality prior to creating composite ratings based on a linear combination of items (Tabachnick & Fidell, 2001).  The process of creating composite ratings for the study variables are described in the next section. 3.5.1.1  Principal Components Analysis:  Single Level and Multilevel Models  One approach commonly used to test dimensionality is principal components analysis (PCA) followed by aggregating the survey results to the group level for analysis.  A PCA analysis determines which linear components exist within the data and how a particular variable might contribute to that component, which was the focus of this study (Hair, Anderson, Tatham, & Black, 1998).  D’Haenens, Van Damme and Onghena (2008) identified two limitations to this common approach of using only student-level data when the interpretation of survey results is at the group level:  (1) the issue of dependency of responses; and (2) the assumption of absolute correspondence between the levels of analysis.  When the intent is to create composite ratings based on student-level data and interpret these results at the program level, there are two sources of variance:  within programs, and between programs.  D’Haenens, Van Damme and Onghena (2008) argued that when the researcher does not consider that individual perceptions are likely to be dependent on group membership, results can lead to  77   overestimated inter-item correlations, or covariances, and biased factor scores.  They also acknowledged that some researchers have found distinct latent factors that differ across the two levels, which are not detected by the traditional single-level approach.  As such, researchers have recently proposed that exploratory multilevel factor analysis (MFA) should be used to analyze the latent structure of multilevel data over that of the traditional single-level method (Muthén and Muthén, 2011).   The first step in the exploratory approach used in this study was to conduct a PCA to extract the factors on the total data without taking into consideration the multilevel structure (Van de Vijver & Poortinga, 2002).  Next, the between-group- and pooled-within-group-correlation and covariation matrices were calculated, where the pooled-within-group matrix is based on the individual student deviations from the corresponding group mean, and the between-group matrix is based on the group deviations from the grand mean.  Then a two-level MFA was conducted on the student and program level items simultaneously using Mplus Version 7.2 for categorical data (Muthén and Muthén, 2011).  The estimation procedure used was weighted least squares estimation with missing data (WLSMV) because this procedure is appropriate for categorical variables, which were used in this study to determine composite ratings (Brown, 2006).  MFA is useful for providing information to the researcher on the combination of multilevel survey items by decomposing the variance into the within-program and between-program components.   The results of the PCA were examined for each scale regarding the:  Kaiser-Meyer-Olkin (K-M-O) measure of sampling adequacy; Bartlett’s Test of Sphericity; anti- 78   image matrices; interpretation of the factors; number of items included in each factor; size of the pattern and structure coefficients; and the percentage of variance explained.  The acceptable value of the Kaiser-Meyer-Olkin (K-M-O) measure of sampling adequacy was 0.60 and a significant p-value for the Bartlett’s Test of Sphericity.  The anti-image matrices were reviewed to examine the correlations, where the values on the diagonal were greater than 0.50 indicating that the item contributes well to the scale.  A common rule applied in PCA and factor analysis for determining the number of acceptable factors to retain is Kaiser’s rule of an eigenvalue greater than 1.00, which was applied in this study.  In addition, the criteria for the acceptable model fit indices described by Hu and Bentler (1998) were used in this study to interpret the results of the MFA analyses:  the acceptable values for the root mean square error of approximation (RMSEA) value and the standardized root mean square residual (SRMR) within and between values were less than or equal to 0.08; and the acceptable values for the comparative fit index (CFI) and the Tucker Lewis index (TLI) were larger than or equal to 0.95. There are a variety of methods used to create composite ratings based on survey items, which are referred to as either non-refined or refined approaches (DiStefano, Zhu & Mîndrilǎ, 2009).  Non-refined methods essentially refer to different approaches used to sum ratings across the related items after conducting a factor analysis (Comrey & Lee, 1992; Pike, 2006a).  Summed ratings preserve the variation in the original data.  The original variation is maintained because the created composite ratings are not tied to the sample used in the analysis, but are dependent on the number of items in the  79   scale and the number of response options provided for each item.  Refined approaches are considered more exact than non-refined methods because they use information from the survey data to create standardized ratings; however, refined approaches do not preserve the variability of the original data.  The dependence on the scale and the response options, rather than the data, makes non-refined approaches more stable in creating composite ratings and more appropriate for comparing across study samples (DiStefano, Zhu & Mîndrilǎ, 2009).  In this study, composite ratings were created using the non-refined simple unweighted summed approach.  The unweighted summed composite ratings approach was used to preserve the variation in the original data, and to be able to compare results across survey samples.  In addition, this approach is recommended by Pike (2006a) because it is an approach that many institutional users of NSSE would easily be able to calculate.  Composite ratings were created by first transforming all related items to a common scale ranging from 0 to 100, and then summing the transformed ratings across the selected items using an unweighted approach (Comrey & Lee, 1992).  Details regarding the development of these composite ratings are described next.   NSSE items were used as indicators to describe five composite variables: student perceptions about their general learning outcomes; overall satisfaction with education, perceived emphasis on diversity; perceived support for student success; and perceived interpersonal environment.  Once the composite ratings were created, bivariate correlations were calculated to determine evidence of correlational relationships among the original items and the newly created composite ratings.  In addition, the  80   distributions of composite ratings were examined to determine their normality using p-p plots (see Appendix A for details).   There were two steps involved in creating the student-level composite ratings in this study.  First, because some items differed in the number of response options they were placed on a common measurement scale by taking the original categorical student responses and transferring them to a scale ranging from 0 to 100.  For example, survey items with a four-point response scale, such as “very much, quite a bit, some, and very little,” which were originally coded from 4 to 1 were converted to scores of 100, 66.67, 33.33, and 0.  Likewise, survey items with a seven-point response scale were also converted to a scale of 0 to 100: 100, 83.33, 66.67, 50.00, 33.33, 16.67, and 0.   The second step of this process was to create student-level composite ratings for each group of items by taking the average of each student’s converted ratings, if they had answered at least half or three-fifths of the items related to any particular scale (Pike, 2006a).  These proportions were selected to create composite ratings to ensure that students had answered at least half of the items included in the composite rating for the student responses to be included in subsequent analyses.  The same procedures were used in creating composite ratings for each of the four samples used in this study.   3.5.2  Methods to Examine the Appropriateness of Aggregation        This study takes an exploratory perspective to examine the multilevel nature of student survey data, which are to be interpreted at the program level.  MFAs and  81   three approaches regarding the appropriateness of aggregation were used to examine the first research question in this study: (1) analysis of variance (Bliese, 2000; Griffith, 2002; Lüdtke et al., 2009); (2) within and between analysis (Dansereau et al., 2006; Forer & Zumbo, 2011; Kim, 2005); and (3) an unconditional two-level multilevel model (Raudenbush & Bryk, 2002).  The variables included in these analyses were the composite scores created from student perceptions about their general learning outcomes and their perceptions of the learning environment, as measured by the NSSE and the UES.  3.5.2.1  Analysis of Variance Approach  The ANOVA approach involved a three-step process when validating the aggregation of data from individuals (Bliese, 2000): assess non-independence of individual responses, determine the reliability of program means, and consider the level of within-program agreement.  Each of these steps is described in the following sections.      3.5.2.1.1  ANOVA Step One:  Assessing Non-Independence A common statistical procedure used to assess non-independence is an ANOVA in which groups are the independent variable and the ratings or scales of interest are the dependent variable (Griffith, 2002).  If group membership is related to the variable of interest, then group means should differ, which leads to non-independence, and which is determined by reviewing the Intraclass Correlation Coefficient, or ICC(1) values.  Using Equation 3.1, the ICC(1) can be calculated to indicate the proportion of variance  82   in the variable of interest explained by group membership (Bliese, 2000).  Griffith (2002) suggested the use of an ICC(1) value greater than 0.12 to determine if variance in the dependent variable is related to group membership.  As the program major groupings used in this study for all samples were quite unbalanced in numbers, the harmonic mean was used rather than the arithmetic mean.  By using the harmonic mean, less weight is given to programs with extremely large numbers, and it provides a more appropriate calculation of the average for groups in this study. ICCሺ1ሻ ൌ ெௌಳିெௌೈெௌಳାሺ௞ିଵሻெௌೈ .       (3.1) 3.5.2.1.2  ANOVA Step Two:  Reliability of Program Means The next step in the ANOVA procedure is to determine the reliability of program means.  The group means of the variable of interest may differ, yet members in each group may give very different responses.  Bliese (2000) suggested that reliability of group means could be determined by the ICC(2) values:  if they are high, then a single rating from an individual would be a reliable estimate of the group mean; but if they are low, then multiple ratings would be necessary to provide reliable estimates of the program mean.  Thus, the extent to which the responses from some group members correspond to those of other group members needs to be determined.  When correspondence is high, then responses from a few group members may be used to estimate the group’s mean response.  The ICC(2) is a common statistical assessment of the reliability of group means, and Griffith (2002) suggested the use of a ICC(2) value greater than 0.83 to determine acceptable reliability estimates for group means (see  83   Equation 3.2) (Griffith, 2002; Lüdke et al., 2009). ICCሺ2ሻ ൌ ௞	௫	ூ஼஼ሺଵሻଵ	ାሺ௞ିଵሻ௫	ூ஼஼ሺଵሻ	.      (3.2) 3.5.2.1.3  ANOVA Step Three:  Within-Group Agreement The final step to consider is the within-group agreement (Bliese, 2000).  High reliability does not necessarily mean high within-group agreement.  Within-group agreement is determined by calculating the rwg statistic because although student ratings within a program major might correspond to each other, thereby resulting in high reliability estimates, the individuals may not be in agreement (Griffith, 2002).  Griffith (2002) provided an example of an individual responding using ratings of 1s, 2s, and 3s on a five-point scale, while another individual in the same group responded using 3s, 4s and 5s.  This response scenario would result in high reliability because the student ratings would be proportionately consistent, but the within-program agreement would be low because the response of a 3 from the first individual corresponds to a 5 from the second individual.  Griffith (2002) suggested that researchers use the rwg statistic, calculated using Equation 3.3a and 3.3b, to determine the extent to which group members give the same ratings or responses.   ݎௐீ ൌ 1 െ ௦ೣమ௦ಶೆమ	.      (3.3a) ݏா௎ଶ ൌ ஺మିଵଵଶ .                (3.3b)  84   LeBreton and Senter (2008) suggested the use of a rwg value greater than 0.70 to determine acceptable values for agreement levels, but that the final choice of an acceptable value for agreement must be based in theory.  They suggested that if decisions made as a result of these surveys were considered low stakes, then a lower estimate might be acceptable; however, if decisions made based on these responses were considered high stakes, then a much higher value, likely greater than 0.90 would be required.   3.5.2.2  Within and Between Analysis  The within and between analysis (WABA) as described by Dansereau, Cho and Yammarino (2006) was used to determine which appropriate levels of analysis provide evidence of valid inference.  The conceptual basis of WABA is to determine whether interpretations from data aggregated to the group level are valid, or if data should remain at the individual level.  It entails a three-step process that examines variability patterns to determine if aggregation at the university level is appropriate for these measures of learning.  These analyses include two group-level inferences as “wholes” or “parts,” and two individual-level inferences as “equivocal” or “null.”  When the inference drawn is “wholes,” it means that the results can be aggregated to the group level because the group is homogeneous in their responses.  If the inference indicates a “parts” level, then some level of aggregation may be appropriate, but not quite at the level of the group.  A “parts” inference could mean that there is some level of interdependency among respondents based on group membership, but that there are still too many differences among individuals within the group to support aggregation at  85   the highest “wholes” level.  In contrast, if the inference drawn is at the level of “equivocal,” then it implies that aggregation beyond the individual is not valid and the researcher should use only disaggregated results.  Also, if the inference results in “null,” then the within-group variation is in error, and, again, any level of aggregation is not supported.      Dansereau, Cho and Yammarino (2006) identified to two views of groups, “wholes” or “parts,” and two views of individuals “equivocal” or “null.”  With the wholes view, groups are homogeneous and the use of group averages to represent group differences is valid.  Therefore, there is only variability across groups on the construct, not among individuals within that group.  When groups are heterogeneous, the use of group averages is not valid because there is within-group variability, and so it is applicable to use deviation scores from group averages to represent differences within groups.  In the equivocal view of individuals, both the averages and the deviation scores are valid, because there is between-group variability and within-group variability.  Finally, the null hypothesis states there is no variability within or between groups.  The purpose of WABA differs from the ANOVA approach, though the techniques used are similar (O’Connor, 2004).  The purpose of WABA is to examine correlations between study variables to assess the level or levels of analysis that may underlie an observed bivariate or multiple correlation.  Groups are identified, and the question of interest here is whether one or some combination of groups at higher levels may or may not underlie an observed relationship.  Unlike the ANOVA approach, WABA does not assume that variation within groups is always in error, nor  86   does it assume that within-group variation always represents a non-group effect, because it might represent sub-group effects.   3.5.2.2.1  WABA Step One: Level of Inference Table 3.5 describes the criteria established by Dansereau and Yammarino (2000), and adapted by Griffith (2002, p. 121), that were used for this study when determining the appropriate levels of inference using the WABA approach.  In the first column are the four possible inferences that can be made, “wholes,” “parts,” “equivocal” and “null.”  The next three columns presented in Table 3.5 provide the information on how these levels of inferences can be drawn from the data.   The variance between program majors, or Varbn, is the mean sum of squares between programs in an ANOVA, in which program major is the independent variable and composite program-level variables served as the dependent variables.  The variance within programs, or Varwn, is the mean sum of squares within groups in an ANOVA, in which group membership is the independent variable and survey scalelets served as the dependent variables. The F-value provides information about the statistical significance of the variance comparisons; if it is larger than 1.00, then a “wholes” inference can be drawn, but if it is smaller than 1.00 then a “parts” inference should be drawn.  If the F-value is equal to 1.00 or equal to zero, then no level of aggregation is supported, and either an “equivocal” or “null” inference should be drawn.    87   The results for the eta-between (ηbn) and the eta-within (ηwn) provide information about the practical significance of aggregation.  When the eta-between is larger than 66% and the eta-within is smaller than 33%, then a “wholes” inference should be made.  If eta-between is smaller than 33% and eta-within is larger than 66%, then a “parts” inference should be drawn; however, when eta-between is larger than 33% but smaller than 66%, and the eta-within is larger than 66%, no level of aggregation is practically supported, and an “equivocal” or “null” inference should be made.  Table 3.5 WABA Criteria for Determining Level of Inference for a Composite Variable Four Possible Inferences Variance ComparisonBetween-group DifferencesEffect Size Wholes Varbn > Varwn F > 1.00 ηbn > 66% and ηwn < 33% Parts Varbn < Varwn F < 1.00 ηbn < 33% and ηwn > 66% Equivocal Varbn  = Varwn > 0 F = 1.00 33% < ηbn > 66%  and ηwn > 66%  Null Varbn  = Varwn  = 0 F = 0.00 3.5.2.2.2  WABA Step Two: Bivariate Analysis      The criteria detailed in Table 3.6, which were adapted from Dansereau and Yammarino (2000) by Griffith (2002, p.122), were used to interpret the results for these next steps in the WABA aggregation analysis.  Table 3.6 includes two sections, where the first column shows the criteria used to understand the correlations between the dependent variable and the independent variables at the program level and at the student level.  The second section of Table 3.6 provides the criteria for interpreting the decomposition of the correlations between the dependent variables (student  88   perceived general learning outcomes), and the independent variables (learning environment composite ratings) into their between- and within-components.   3.5.2.2.3  WABA Step Three: Decomposition of Correlations      This third step in the WABA approach was referred to as multiple relationship analysis by O’Connor (2004) because this step extends the inferences that were made based on the second step of the WABA process by investigating the boundary conditions of those step two inferences.  Table 3.6 reports the results of the third step in the WABA analysis that first compares the correlations between perceived student learning and the learning environment variables, and then decomposes the between-component and within-component correlations.  The between-component value is calculated by multiplying the correlation of student ratings and program average ratings on perceived general learning (eta-between; ηbn x) with the correlation of student ratings and average program ratings on the learning environment variables (eta-between; ηbn y), and with the correlation between perceived general learning and the learning environment variables at the program level (rbn xy).         The within-component correlations involved multiplying the correlation of student ratings and student deviation scores from the program mean on the independent variables (eta-within; ηwn x) with the correlation of student ratings and student deviation scores from the program average on student perceptions about their general learning outcomes (eta-within; ηwn y), and finally multiplied by the correlation  89   between the dependent variable and the independent variable at the student level (rwn xy).    Table 3.6 WABA Criteria for Determining Aggregation and the Relationship Between Two Variables.  Level of Inferences Comparisons of Correlations between X and Y Decomposition of Correlations Between X and Y Between Component Within Component (ηbn x) (ηbn y) (rbn xy) + (ηwn x) (ηwn y) (rwn xy) = rxy "Wholes" rbn xy  > rwn xy  Between Component > Within Component "Parts" rbn xy  < rwn xy  Between Component < Within Component "Equivocal" rbn xy  = rwn xy  > 0 Between Component = Within Component > 0 "Null" rbn xy  > rwn xy = 0 Between Component = Within Component = 0  3.5.2.3  The Unconditional Two-Level Multilevel Approach  The third approach to examine the appropriate level of data aggregation and analysis used a multilevel modeling approach called an unconditional two-level multilevel model, which is usually the first step in building a full multilevel model, and provides information on whether student-level responses show sufficient between-program variability to warrant a multilevel model (Raudenbush & Bryk, 2002).  This model is called the unconditional two-level model because it includes student and program level, is not conditioned by any independent variables, and thus only includes the dependent variables in the model, as shown by Equations 3.4 and 3.5.  In this way, the variable’s variance is partitioned into its within- and between-program components.    90   Level-One: .0 rY ijjij                  (3.4) Where i = students.  Level-Two: u jj 0000  .               (3.5) Where j = programs. The unconditional multilevel model provides three results that help determine the appropriateness of examining the dependent variable as a group characteristic:  the proportion of variance in the dependent variable explained by group membership compared to total within-group variance; the extent to which member responses on dependent variables indicate group responses; and the extent to which variation in the dependent variable can distinguish group membership. Using the notation described by Raudenbush and Bryk (2002), the proportion of variance explained by the unconditional multilevel model was calculated using Equation 3.6.   Proportion of Variance Explained = ߬̂଴଴/ሺ߬̂଴଴ ൅	ߪොଶሻ .   (3.6) The unconditional two-level multilevel analysis is also referred to as a one-way ANOVA with random effects, and differs from the first ANOVA approach in that it  91   simultaneously models the student and program level by taking the multilevel structure of the data into account.  Rather than considering variability as a result of program majors in error like the ANOVA approach, this aggregation approach allows variability in student perceptions about learning based on program membership.  In addition, the unconditional two-level multilevel model provides information about the presence of program-level variances (ICC(1)) and determines whether there are significant differences among program majors based on perceived general learning outcomes.   All multilevel models applied in this study used restricted maximum likelihood estimation (REML) followed by an examination of the EB point estimates by program across UBC’s two campuses.  The REML procedure was selected over the full maximum likelihood (MLF) approach because, according to Bryk and Raudenbush (1992), the fixed effects under the MLF procedure are assumed to be known so that the posterior variance of ߚ௤௝ does not reflect uncertainty about them; however, they suggested that this assumption is more accurate on samples with a large number of higher-level units.  They recommended that when there are fewer higher-level units, the REML procedure provides more realistic posterior variances than with MLF.   REML procedures were used because the number of higher-level units included for the NSSE samples in this study is considered to be on the lower-bound end of the number of units to include for multilevel modelling.   When sample sizes are fairly small, particularly when the number of higher-level units is less than 30, and the group sizes are unbalanced, the benefits of multilevel modelling techniques are challenged because these techniques rely on large-sample theory (Raudenbush & Bryk, 2002).  In  92   such situations, the use of EB estimates are recommended because they take full account of the uncertainty around ߪଶ and ܂ in estimating the program-level ݏߛ and their standard errors (Raudenbush & Bryk, 2002).  Due to the smaller number of higher-level units for the samples included in this study, EB point residual estimates with 95% credible values were graphically produced, and a visual examination was conducted to compare average student perceptions about their general learning outcomes across program major groupings. 3.5.3  Examining Program-Level Characteristics and Student Perceptions  Once the reliability of aggregated perceived learning outcomes had been determined, the next step was to investigate the effects of the learning environment on the outcome variables (Lüdke et al., 2009).  The multilevel regression approach (also known as hierarchical linear modeling, HLM; Raudenbush & Bryk, 2002) is the most appropriate method for examining research questions about university effectiveness in this way (Astin & Denson, 2009).   The following section describes how a multilevel regression technique is used to examine the linkages between student perceptions about their general learning outcomes and program characteristics.  The first three subsections describe the dependent variables and the student- and program-level independent variables used in this study.  Next, each design step in building a multilevel regression model is explained.   93   3.5.3.1  The Dependent Variable  Student perceptions of general learning was the dependent variable included in this study for all analyses; however, this variable was developed separately using items from the NSSE and the UES for each corresponding study sample.  Details of how this dependent variable was determined for each study sample are described below. The NSSE question asked students, “How much has your experience at this institution contributed to your knowledge, skills and personal development in the following areas: writing clearly and effectively, speaking clearly and effectively, acquiring broad general education, and thinking critically and analytically.”  Students were asked to rate their responses on a four-point scale: “very much, quite a bit, some, and very little.”  These four items were combined using the summed score procedure to create a composite score on student perceptions about their general learning outcomes as described in Section 3.5.1.    The UES question asked students, “Please rate your ability in the following areas when you started at UBC and now: ability to be clear and effective when writing, ability to speak clearly and effectively in English, ability to read and comprehend academic material, and analytical and critical thinking skills.”  Students were asked to rate their responses to these four items using a six-point scale; “very poor, poor, fair, good, very good, and excellent.”  A composite score on student perceptions about  94   their general learning outcomes was created for each student in the UES sample using the summed score procedure.   3.5.3.2  The Independent Variables:  Student-Level Five variables were included as student-level independent variables in both the single level and multilevel regression models: (1) entering grade point average (GPA); (2) current grade point average (GPA); (3) gender; (4) international status; and (5) direct entry status.  These five variables were included in this study to separate out any conditional effects of program characteristics on student perceptions about their general learning by examining the effects for students of differing pre-university academic ability levels and demographic characteristics.   Entering GPA is a percentage earned by the student in previous educational institutions (e.g., high school or other higher education) and was recorded in the student administrative files as the students’ entering grade used for admission to the university.  It was included in this study as a background variable at the student-level on academic ability.   Current GPA is the accumulated percentage earned by the student from the start of their higher education experience at the institution to the point in time of their invitation to the UES or the NSSE.  This variable was included in this study as another measure of academic ability to determine the extent to which it is related to perceived general student learning.    95   Gender is a dichotomous variable dummy-coded: 1=male and 0=female.  There are differences in success rates for female students compared with males for particular programs at the University where the proportion of male students is significantly larger than females, such as in Engineering.  Therefore, it was important to include gender as a variable when investigating learning outcomes across programs in this study.  International status is included in this study as another background characteristic of the student. This is a variable that is collected on each of the assessments as well as being available from the student administrative file.  It is a dummy-coded variable where 0=international and 1=domestic.  It is important to determine if international status has a differentiating effect on learning outcomes in this study because the literature has been contradictory on international student learning outcomes.  Some researchers have found evidence that international students seem to be more engaged than domestic students, yet have reported lower levels of learning outcomes (Carini et al., 2005). Direct entry status is a dummy-coded variable where 1=direct entry and 0=transfer students.  Students are considered as direct entry if they had registered at UBC as first-year students and remained enrolled there until their fourth-year.  It is important to determine if the students who spent more time on campus will have different learning outcomes from those who transferred to the campus after experiencing higher education elsewhere.  96        The first level of the model (student model) examines the relationships between student perceptions about their general learning and six parameters: an intercept and five regression coefficients (entering GPA, current GPA, gender, international status, and direct entry status).  Using the notation as described by Raudenbush and Bryk (2002), Equation 3.7 shows the relationships used for level-one.  Level-One: .))))()5554443332221110((((rXXXXXXXXXXYijjijjjijjjijjjijjjijjjij     (3.7) Or more succinctly, rXXY ijjqqijj qjjij   )(510 .                                                                   (3.8) q=1, 2, 3, 4, 5. Where ௜ܻ௝ is student i’s score regarding their perceptions about their general learning outcomes in program j, which is assumed to be normally distributed, ݕ	~	ܰሺܺܤ, ߗሻ;   ߚ௢௝	is the average rating concerning student perceptions about their learning outcomes for program j; ߚଵ௝ is the average influence of entering GPA on average student perceptions about their learning outcomes in program j;  97   ߚଶ௝ is the average influence of current GPA on average student perceptions about learning outcomes in program j; ߚଷ௝ is the average influence of gender status on average student perceptions about learning outcomes in program j (i.e., the mean difference between perceived learning outcomes of male compared with female students);   ߚସ௝	is the average influence of international status on average student perceptions about learning outcomes in program j (i.e., the mean difference between international student ratings compared with domestic student ratings); and ߚହ௝ is the average influence of direct entry status on learning outcomes for program j (i.e., the mean difference between direct entry student ratings compared with transfer students).      The individual variance,	ߪଶ, represents the residual variance at level-one, ݎ௜௝, that remains unexplained after taking into account the five regression coefficients.  Each of these effects (ߚ௢௝, 	ߚଵ௝, ߚଶ௝,	ߚଷ௝,	ߚସ௝,	ߚହ௝) are net of the others after controlling for the other variables in the analysis.  Within the level-one model, each program can have a different average student rating about learning outcomes (i.e., intercept), and a different relationship with five student-level variables on these average program ratings (i.e., slope).   3.5.2.3  The Independent Variables: Program-Level To include attributes of program quality that may explain differences among student perceptions about their general learning outcomes, five variables about the  98   higher education learning environment were included in this study for all samples: (1) proportion of international students; (2) proportion of male students; (3) student-to-faculty ratio; (4) campus location; and (5) general purpose operating (GPO) budget dollars per student full-time equivalent (FTE).  In addition, three composite variables created from the survey items were used to describe student perceptions about the learning environment if aggregation to the program level was supported: (1) overall satisfaction; (2) campus climate for diversity; and (3) supportive campus environment.  None of these learning environment scales were supported by any of the three aggregation procedures when aggregated to the program level, and as such were excluded from the multilevel regression analyses. These variables were included to examine the relationship between student perceptions of their learning environment and their ratings regarding their perceived general learning outcomes.  The proportion of males and proportion of international students were included at the higher-level unit to determine if there were any differentiating effects for these student subpopulations across programs, and campus was included to compare across equivalent programs across campuses.   Studies have shown positive relationships between these program attributes and student academic achievement (Keeling et al., 2008; Lüdtke et al., 2009; Porter, 2006; Price, 2011).  A more detailed explanation of how these variables were derived for each study sample is provided below.  99   Overall satisfaction is included as a measure of institutional and program quality and is a self-reported measure by the student collected from items on the NSSE and the UES.   The two items from the NSSE used to calculate this variable are:   how would you evaluate your entire educational experience at this institution (response scale: excellent, good, fair, poor); and  if you would start over again, would you go to the same institution you are now attending (response scale: definitely yes, probably yes, probably no, definitely no).   Three items from the UES were used to measure overall satisfaction:   please rate your level of satisfaction with the overall academic experience (response scale: strongly disagree, disagree, somewhat disagree, somewhat agree, strongly agree); and  how likely would you encourage others to enrol at UBC (response scale: strongly disagree, disagree, somewhat disagree, somewhat agree, strongly agree).  I feel that I belong at UBC (response scale: strongly disagree, disagree, somewhat disagree, somewhat agree, strongly agree).  Campus climate for diversity is another measure of institutional quality used in this study, which was also self-reported by the student on items on the NSSE and  100   UES.  There were three items from the NSSE measuring diversity (response scale: very much, quite a bit, some, very little):   In your experience at your institution during the current school year, about how often have you done each of the following?  how often have you had serious conversations with students of a different race or ethnicity than your own;   how often have you had serious conversations with students who differ from you in terms of their religious beliefs, political opinions, or personal values; and  to what extent does your institution emphasize encouraging contact among students from different economic, social and racial or ethnic backgrounds.   There were six items from the UES that were used to calculate the campus climate for diversity variable for the UES sample in this study (Chatman, 2011; Thomson & Alexander, 2011) (response scale for all items: strongly disagree, disagree, somewhat disagree, somewhat agree, agree, strongly agree):   students are respected here regardless of their sexual orientation;   students are respected here regardless of their economic or social class;   students are respected here regardless of their race or ethnicity;   students are respected here regardless of their gender;   students are respected here regardless of their political beliefs; and  students are respected here regardless of their physical ability/disability.  101   Supportive campus environment is the third measure of program quality included in this study and was calculated from items on the NSSE and UES.  This construct was split into two scales for the NSSE:  Support for Student Success, and Interpersonal Environment. Support for Student Success:  To what extent does your institution emphasize each of the following?  providing the support you need to help you succeed academically (response scale for the first three items: very much, quite a bit, some, very little);   helping you cope with your non-academic responsibilities such as work, family, etc.;   providing the support you need to thrive socially; quality of relationships with other students;  Interpersonal Environment: Mark the box that best represents the quality of your relationships with people at your institution.  relationships with other students (response scale: (1) unfriendly, unsupportive, sense of alienation, to (7) friendly, supportive, sense of belonging);   relationships with faculty members (response scale: (1) unavailable, unhelpful, unsympathetic, to (7) available, helpful, sympathetic); and   102    relationships with administrative personnel (response scale: (1) unavailable, unhelpful, unsympathetic, to (7) available, helpful, considerate, flexible. There were also six items on the UES that were used to create the supportive campus environment variable (response scale for all items: strongly disagree, disagree, disagree somewhat, agree somewhat, agree, strongly agree):   I feel respected as an individual on this campus;   there is a clear sense of appropriate and inappropriate behaviour on this campus;   I am proud to attend UBC;   most students are proud to attend UBC;   UBC communicates with students and values their opinions; and   overall satisfaction with student life and campus experience. Proportion of International Students was included as a characteristic of the program for this study.  Students in this study were considered International if they were not Canadian citizens or permanent residents of Canada and they paid international fees while enrolled at the University.  This variable was collected from the University Registrar administrative records and was dummy-coded at the student level where 0=international and 1=domestic.  The student file was aggregated to the program level, which involved summing the number of international undergraduate  103   students for each program and taking this total as a proportion of international students per program included in the study.   Proportion of Male Students was included in this study as another characteristic of the program.  There are programs at the University where the population is predominantly male or female, and where two programs, Science and Engineering, have implemented specific retention efforts designed specifically to increase the achievement of female students.  It is important to see if the effects by program are amplified for female students.  This variable was dummy-coded at the student-level where 1=male and 0=female.  Again, the student file was aggregated to the program level, which involved summing the number of males for each program and taking this total as a proportion of the males per program.   Student-to-Faculty Ratio was calculated to determine how many students per faculty member are taught at the program level.  It was calculated using full-time equivalent tenured or tenure-track faculty positions, excluding sessional positions, and then divided by the number of taught student full-time equivalents.  The Planning and Institutional Research Office reports this information by program on their website. It was used in this study as an example of a traditional ideal in determining quality of higher education. This variable was included in this study to determine if the student-per-faculty ratio was related to student perceptions about their general learning outcomes.    104   Campus location was a dichotomous variable where 1 indicated the Vancouver campus and 0 the Okanagan campus.  The campus location was included to determine if differences in student perceptions about their general learning existed between equivalent programs across the two campuses. General Purpose Operating (GPO) Budget Dollars per Student FTE is a calculation used to determine the operating budget allocated per individual student for each program of study. This variable was included in this study to determine if lower or higher budget dollars per student FTE was related to perceived general student learning outcomes. The second level of the model (program model) examines the effects of the four program-level variables on the within-student relationships as shown in Equation 3.9.  Of the level-one coefficients, gender, year level, international status, and direct entry status were considered fixed, while entering GPA and cumulative GPA were specified as random in the level-two model. uWW qjkkjqkjqqj   )(510 .               (3.9) q=0, 1, 2, 3, 4, 5.                                           k=1, 2, 3, 4, 5. Where,  ଵܹ௝ = student-to-faculty ratio;  ଶܹ௝ = GPO per FTE; ଷܹ௝ = proportion of male students;   105   ସܹ௝ = campus location; ହܹ௝	= proportion of international students; ߛ଴଴,…,ߛ଴ଵ଴	= level-two intercept/slopes to model ߚ଴௝ ߛଵ଴,…,ߛଵଵ଴	= level-two intercept/slopes to model ߚଵ௝  ߛଶ଴,…,ߛଶଵ଴	= level-two intercept/slopes to model ߚଶ௝ ߛଷ଴,…,ߛଷଵ଴	= level-two intercept/slopes to model ߚଷ௝ ߛସ଴,…,ߛସଵ଴	= level-two intercept/slopes to model ߚସ௝ ߛହ଴,…,ߛହଵ଴	= level-two intercept/slopes to model ߚହ௝ ݑ଴௝,…,ݑ଴ଵ௝	= are level-two random effects (gender, year level, international status, and direct entry status are level-two fixed effects) Because there were two level-two random effects, entering GPA and current GPA, the variances and covariance among them form a 2 x 2 matrix.   T = 111000  Finally, the combined model is stated in Equation 3.10:    5151 0000)()(j j kkjqjqqijqij WWXXY   1 1251)())((jijojjqqijjj kkjjqqijqk ruXXuWWXX .         (3.10) The part of the equation,  251))((j kkjjqqijqk WWXX , represents the cross-level interaction between the student-level model ܺ௤௜௝ (group-mean centred) variables and  106   the program-level model ௞ܹ௝	(grand-mean centred) variables.  These can be understood as the moderating effects, or interaction, of program-level variables on the relationships between student-level variables ܺ௤௜௝ and the outcome ௜ܻ௝.   In the current study, only entering GPA and current GPA were considered random effects at the program-level, while gender, international status and direct entry were considered fixed.  This was because there were some programs included in the samples in which these three variables were homogeneous and did not vary across programs, or varied minimally across programs.  For example, there were few female respondents from the Engineering programs, and few male respondents from the Nursing programs.   3.5.3.4  Multilevel Regression Model Design       The multilevel model for this study began with the unconditional two-level multilevel model without independent variables at any level, which was described in section 3.5.1.3, under aggregation techniques using Equation 3.4.        The next step involved building a random-coefficient model with independent variables at the student level only, and then an intercepts- and slopes-as-outcomes model with independent variables at the both the student and program level. Using the notation described by Raudenbush and Bryk (2002), the proportion of variance explained by the random-coefficient multilevel model over the unconditional model was calculated using Equation 3.11.  107   Proportion of Variance Explained =   ఙෝమሺ௡௨௟௟	௠௢ௗ௘௟ሻି	ఙෝమ	ሺ௥௘௚௥௘௦௦௜௢௡	௠௢ௗ௘௟ሻ	ఙෝమሺ௡௨௟௟	௠௢ௗ௘௟ሻ        (3.11) The final model, known as the intercepts- and slopes-as-outcomes model, was used in this study to illustrate how differences among programs in their institutional characteristics might influence the distribution of perceived learning outcomes within and among programs (Raudenbush & Bryk, 2002).  This multilevel approach was conducted using the software program, HLM 7.0 Hierarchical Linear and Nonlinear Modeling (Raudenbush, Bryk & Congdon, 2000).   The proportion of variance explained by the full intercepts- and slopes-as-outcomes model over the random-coefficient multilevel model was calculated using Equation 3.12 (Raudenbush & Bryk, 2002). Proportion of Variance Explained =   ఙෝమሺ௥௘௚௥௘௦௦௜௢௡	௠௢ௗ௘௟ሻି	ఙෝమ	ሺ௙௨௟௟	௠௢ௗ௘௟ሻ	ఙෝమሺ௥௘௚௥௘௦௦௜௢௡	௠௢ௗ௘௟ሻ       (3.12) By using a two-level multilevel model approach, this study could explain the differentiating effect of level-two program characteristics on level-one student characteristic relationships.  Two-level intercepts- and slopes-as-outcomes multilevel models were designed for each of the NSSE and UES samples, where aggregation of student’s perceived general learning outcomes to the program major level was appropriate.  The within-group student relationships are represented by the regression coefficients in the level-one models.  The effects of program-level variables on each of these relationships are represented in the level-two models.  The results from these models indicated whether residual slope variability remained after adding program  108   contextual variables.   Another benefit of multilevel modeling is that programs do not need to be balanced in numbers; for example, Program A does not have to have the same number of students in the sample as Program Z; this is an important consideration for the current study because student enrollments vary considerably across programs in the samples.  Multilevel modeling provides a powerful framework for exploring how average relationships vary across hierarchical structures. Although there is no statistically correct choice for centering level-one variables (Davison et. al., 2002; Hofmann & Gavin, 1998; Kreft, de Leeuw, & Aiken, 1995), the choice of location should be grounded in theory.  Lüdtke et al. (2009) found that conclusions about the effects of learning environments can be substantially affected by the choice of centering when using aggregated student ratings.  As such, they argued that researchers conducting multilevel analyses must be very clear on their choice for centering the independent variables.  Only when group-mean centering is applied to level-one variables can the cross-level interaction and between-group interaction be differentiated (Hofmann & Gavin, 1998).  This study focused on the cross-level interaction and the between-group interaction by examining the relationships among program majors on students’ perceptions of their general learning, and how belonging to a particular program influenced the program-level relationships and the individual student-level ratings (Kreft, de Leeuw, & Aiken, 1995; Lüdtke et al., 2009).  As such, the level-one independent variables in this study were group-mean centred.  109        The first level of the model (student model) described using Equations 3.7 and 3.8, examined the relationships between overall learning outcomes on the three assessments and seven parameters: an intercept and five regression coefficients (entering grade, cumulative grade, gender, international status, and direct entry status).   The second level of the model (program model) described using Equation 3.9, examined the effects of the four program-level variables on the within-student relationships (student-to-faculty ratio, GPO per FTE, proportion male, and proportion international).  Of the level-one coefficients, gender, international status, and direct entry status were considered fixed, while entering grade, and cumulative grade were specified as random in the level-two model.  Gender, international status, and direct entry were considered fixed because there were some programs included in the sample in which these three variables were homogeneous and did not vary or varied minimally. Part of this model represents the cross-level interaction between the group-mean centred student-level and the grand-mean centred level-two variables.  These cross-level interactions can be understood as the moderating effects, or interaction, of level-two variables on the relationships between level-one independent variables and the outcome variable.   Prior to analysing data using a multilevel approach, several model assumptions need to be considered.  These assumptions are discussed in the following section.   110   3.5.3.4.1  Model Assumptions for Multilevel Modeling      An assumption of multilevel modeling is that at level-one, errors are normally distributed and homogeneous.  Raudenbush and Bryk (2002) suggested that the estimation of the fixed effects and their standard errors will be robust to violations of this assumption.  Also for level-one, under the null hypothesis, it is expected that the average achievement for program j will equal the average program mean for all j programs, and the slopes of program j will equal the average of the slopes for all j programs.  At level-two, the tau ( ) values are the variances of the intercepts and slopes, and the covariance between them, and it is assumed that the program-level residuals follow a multivariate normal distribution with variances and covariances.  This dependency violates the assumption in single level regression of independent errors across observations, but can be handled using a multilevel approach (Heck & Thomas, 2000).  A further consideration of multilevel modeling is that missing data are acceptable at level-one, but there cannot be missing data at level-two (Raudenbush & Bryk, 2002).   A second assumption is that the efficiency and power of the multilevel approach rests on data pooled across the units comprising the student- and program-levels, which implies the requirement of a large dataset.  It is generally accepted that there is adequate statistical power when the sample includes 30 groups of 30 observations each, or even 150 groups each with five observations.  Mok (1995) provided a detailed simulation analysis of sample size requirements for two-level balanced designs in educational research.  The author found that the between-group variances were  111   noticeably biased in models with only five level-two units.  With only a few level-two units, observations are limited, which makes it difficult to get an accurate estimate of between-group variability.  In addition, with less than adequate statistical power there is an unacceptable risk of not detecting cross-level interactions; however, in practice it has been deemed acceptable to have at least 20 higher-level units (Centre for Multilevel Modelling, 2011). The NSSE and UES samples used in this study met the recommended sample size requirements of at least 20 level-two units for multilevel modeling.  In addition, the application of EB estimators have been applied post-hoc to multilevel analyses with small number of higher-level units and sample sizes (Raudenbush & Bryk, 2002).  This study used EB estimates to compare results among programs; this approach is further described in the next section. 3.5.3.4.2  Empirical Bayes Estimators      EB estimates are calculated on the assumption that students belong to a definable population where they can be assigned a quantity that will locate them within the underlying probability distribution of this statistical population.  These EB estimates provide the best prediction of the unknown student-level parameter or residual for a particular program major, which uses not only data from that specific program, but also combines the data from all other similar programs to estimate each element of the student-level.  EB estimators minimize the expected distance between the  112   unknown parameter, or the residual, and the estimator, which is why they are considered to be shrunken estimates (Raudenbush & Bryk, 2002).        EB estimation procedures are more beneficial than ordinary least squares (OLS) estimation procedures or analysis of covariance models because, unlike the OLS estimation, EB estimation takes into account group membership when the number of programs are large, and produces relatively stable estimates even when sample sizes per program are small or moderate (Raudenbush & Bryk, 2002).      In this study, this Bayesian model hypothesizes that all students within a particular program major will have an unknown effect added to their expected score as a result of their being a member of that program.  With increasing reliability of the level-one estimates, more weight is given to the individual student characteristics, which is based on information from the entire sample of programs.  In contrast, if the level-one estimates are unreliable, then more weight will be given to the program characteristics.        When EB estimators are used in a multilevel model they can provide stable indicators for judging individual program performance (Raudenbush & Bryk, 2002).  Researchers also calculate the EB residual estimates for comparing between groups of programs.  This type of model hypothesizes that all students within program j have an effect that is added to their expected score as a result of program membership.   The EB point estimates and 95% credible values were calculated in the current study using the software program, Hierarchical Linear and Nonlinear Modeling 7.0  113   (HLM; Raudenbush, Bryk & Congdon, 2000).  The program estimates from the NSSE and the UES are reported in Chapter Four. 3.6  Summary of Research Methodology      This study used an exploratory approach to examine the multilevel validity of program-level inferences drawn from multilevel surveys.  By using the methods outlined in this chapter, the current investigation aimed to clarify and extend previous research concerning the measurement of learning outcomes using survey data within higher education in Canada as a measure of program effectiveness/quality.  Many researchers aggregate these outcomes to report on measures of program or institutional quality without first testing the validity of levels of aggregation and interpretation.  It is important to determine the appropriateness of aggregation to ensure that when we use these program-level measures they can be used to make valid judgments about programs.        The first objective of this study was to establish the appropriateness of aggregating the composite ratings:  general learning outcomes, overall satisfaction; campus climate for diversity; and supportive campus environment.  The first step used the common traditional approach using a single-level EFA model to examine items contributing to each of the study scales, and then two-level MFAs were conducted to determine if the survey items used to create composite ratings in this study were consistent across the student and program levels.  Next, the appropriateness of aggregating these composite scores to the program level was determined using three different statistical techniques:   114   ANOVA, WABA and an unconditional multilevel model.  These three approaches were considered in this analysis because each offers a different measurement perspective in determining the appropriateness of aggregation.  The ANOVA approach considers the interdependency of student responses to program membership as error, and provides information on whether it is appropriate or inappropriate to aggregate to the program level.  In contrast, the WABA and the unconditional multilevel model techniques allow students to respond based on their program membership, and the WABA provides additional information on whether there is a more suitable level of aggregation that is not at the program level.  These methods were selected to demonstrate to the reader the importance of providing sufficient evidence in support of multilevel validity, prior to drawing interpretations from aggregate student survey ratings.  The results of the MFA and the three different aggregation procedures were compared to determine if the choice in approach could lead to different results regarding aggregation and levels of analysis.        Another objective of this study was to determine the program characteristics that were related to improved student perceptions about their learning outcomes across the NSSE and UES samples.  Using two-level multilevel regression models, differences among program majors on perceived general learning outcomes and ratings about the learning environment were examined to determine how they varied by student- and program-level characteristics.  The research findings from these analytic approaches are discussed in detail in Chapter Four.   115   4  Chapter Four:  Research Results 4.1  Introduction The purpose of this study was to determine the appropriateness of aggregating individual student level survey responses regarding their perceived general learning outcomes for evaluating quality and/or effectiveness across university programs.  The findings of this study are reported in two main sections in this chapter.  The first half of this chapter is dedicated to reporting the study results for the NSSE example and the second half of this chapter reports the results for the UES example.  Each of these two mains sections are comprised of subsections that report on the findings from each of the analyses described in Chapter Three:  the two-level MFA analyses, descriptive statistics for the composite ratings, results for each of the three aggregation procedures, and the multilevel regression models.  Each main section concludes with an overall summary of findings for each example: the NSSE, and the UES.  The final section of this chapter includes an overall summary of the research findings.        For each of the NSSE and UES examples, the two-level MFA was conducted to determine if the combined items to create composite ratings for analysis at the student level were invariant when aggregated to the program level.  Next, an ANOVA with random effects, the WABA, and an unconditional two-level multilevel modelling procedure was conducted to examine the appropriateness of aggregation in this study.  These three aggregation approaches were applied separately to each of the NSSE samples and the UES samples to answer the first research question in this study:  Are  116   multilevel inferences made from student perceptions about their general learning outcomes and ratings of the learning environment valid when aggregated to the program major level?  Multilevel regression modelling procedures were applied to the survey data to address the second research question in this study:  What student- and program-level characteristics, and cross-level relationships, are associated with student perceptions about their general learning outcomes? 4.2  Results for the NSSE Example      The upcoming sections report the findings from the three analytic procedures used on NSSE data from 2011 for first- and fourth-year students at UBC’s Vancouver and Okanagan campuses.  The first section provides information on the descriptive statistics for the NSSE composite ratings for each year level and the two-level MFA results for the items contributing to composite ratings used in this study.  The following sections report the results of each step in the three aggregation procedures and the results of the two-level multilevel regression models.  Each section ends with a brief summary of the research findings for NSSE.  4.2.1  NSSE Composite Ratings      The means, standard deviations, and intercorrelations of the NSSE items were examined to determine if there was correlation among the items used to describe each of the five composite, or scale, variables.  Model assumptions require that the variances of different samples be homogeneous, a criterion that was met for all the variables included in this analysis.  In addition, variables should be normally distributed; this was  117   confirmed upon examination of the P-P plots that were created for each variable using IBM SPSS Version 21 (see Appendix A).      Of the five composite variables created from the NSSE items, overall satisfaction was rated the highest for both year levels.  Overall satisfaction was rated slightly higher by first-year students with a mean of 71.93, compared with 69.30 for fourth-year students.  The composite variable that was rated the lowest for both year levels was support for student success, with fourth-year students rating it much lower than first-year students (37.42 and 45.50, respectively).       The NSSE descriptive statistics for the composite scores were similar for first- and fourth-year students (Table 4.1).  This table includes the number of students for whom the composite scores were calculated, as well as the mean scores, the standard error and standard deviation, and the scale-level reliability statistic calculated using Cronbach Alpha.  The criteria described in Chapter Three was used to examine the K-M-O values, results from the Bartlett’s test of sphericity, and the anti-image correlation matrices, which were all found to be acceptable.  The reliability of each scale was determined using Cronbach Alpha, which ranged from 0.65 to 0.76 for first-year students and from 0.63 to 0.79 for fourth-year students in the NSSE example.  In survey research, reliabilities of 0.60 have been found to be acceptable when the number of items used to construct a scale is less than 10 (Lowenthal, 1996).  For this study, a reliability estimate of 0.60 was considered acceptable, along with the assessment of item intercorrelations, because all scales were created with less than 10 items each.  The highest reliability value was reported for the composite variable  118   student perceptions about their general learning outcomes for both year levels, while the lowest reliability value was reported for perceptions regarding emphasis on diversity, again for both year levels.      Table 4.1 NSSE Descriptive Statistics for Composite Scales by Year Level Composite Scale N MeanStandard ErrorStandard Deviation Cronbach AlphaFirst-Year Students Perceived general learning 1844 61.57 0.50 21.37 0.76Overall satisfaction 1841 71.93 0.52 22.41 0.75Emphasis on diversity 1831 53.65 0.59 25.44 0.65Support student success 1850 45.50 0.54 23.15 0.73Interpersonal environment 1850 65.23 0.41 17.43 0.68Fourth-Year Students Perceived general learning 1176 67.87 0.64 21.95 0.79Overall satisfaction 1175 69.30 0.69 23.75 0.78Emphasis on diversity 1233 54.14 0.70 24.66 0.63Support student success 1194 37.42 0.65 22.40 0.76Interpersonal environment 1209 64.05 0.54 18.84 0.70     Bivariate correlations were calculated for each of the individual and composite variables included in this study (for additional details see Appendix A).  The level of interpretation for these correlation coefficients was determined as follows:  values less than 0.30 were considered weakly correlated; values greater than 0.30 and less than 0.50 were considered moderately correlated; and finally, values larger than 0.50 were considered strongly correlated (Cohen, 1988).    119        The first-year NSSE sample scale scores for perceptions about general learning outcomes were positively correlated with current GPA (r = 0.133, p = 0.000), overall satisfaction (r = 0.408, p = 0.000), interpersonal environment (r = 0.346, p = 0.000), emphasis on diversity (r = 0.319, p = 0.000) and support for student success (r = 0.416, p = 0.000).  Emphasis on diversity was related to support for student success (r = 0.348, p = 0.000), interpersonal environment (r = 0.220, p = 0.000), and overall satisfaction (r = 0.328, p = 0.000).  Support for student success was positively related to interpersonal environment (r = 0.395, p = 0.000) and overall satisfaction (r = 0.403, p = 0.000).  Interpersonal environment was positively related to overall satisfaction (r = 0.484, p = 0.000).        Fourth-year student perceptions about their general learning outcomes were positively but weakly correlated with current GPA (r = 0.100, p = 0.01), and moderately correlated with overall satisfaction (r = 0.478, p = 0.000), interpersonal environment (r = 0.383, p = 0.000), emphasis on diversity (r = 0.292, p = 0.000) and support for student success (r = 0.385, p = 0.000).  Emphasis on diversity was moderately related to support for student success (r = 0.309, p = 0.000), interpersonal environment (r = 0.204, p = 0.000), and weakly correlated with overall satisfaction (r = 0.254, p = 0.000).  Support for student success was moderately positively related to interpersonal environment (r = 0.423, p = 0.000), and overall satisfaction (r = 0.454, p = 0.000).  Interpersonal environment was moderately positively related to overall satisfaction (r = 0.558, p = 0.000).  Current GPA had a positive weak relationship with emphasis on diversity (r = 0.132, p = 0.000), support for student success (r =  120   0.100, p = 0.000), interpersonal environment (r = 0.130, p = 0.000), overall satisfaction (r = 0.150, p = 0.000), and GPO per FTE (r = 0.085, p = 0.003).  Current GPA had a weak negative relationship with student-to-faculty ratio (r = -0.064, p = 0.028) and proportion of male students (r = -0.082, p = 0.004).  The fourth-year NSSE sample scale scores for self-reported grades were strongly correlated with their current GPA (r = 0.743, p = 0.01).   4.2.1.1 NSSE Exploratory Multilevel Factor Analysis Results      A two-level MFA for categorical data was performed using the software program Mplus Version 7.2 (Muthén & Muthén, 2011), and the estimation procedure WLSMV, to examine the multilevel nature of each study survey scale across the levels of analysis: the student level and program level.  The results of the two-level MFA across all of the NSSE scales indicated that only the perceived general learning scale was supported for the fourth-year NSSE sample, while all other scales resulted in misspecified models and their multilevel structure was not supported according to the results from the MFA analyses (for additional details see Appendix A).        There were four survey items used to create the perceived general learning scale for the NSSE example: acquiring a broad general education; writing clearly and effectively; speaking clearly and effectively; and thinking critically and analytically.  The within-level eigenvalues for the fourth-year NSSE sample were 2.7 followed by an eigenvalue of 0.64, while the between-level eigenvalues were estimated at 3.18 and the next value at 0.47; these results suggest that the perceived general learning scale was  121   unidimensional across both levels of analysis for the NSSE fourth-year sample.  For the fourth-year sample, the model fit statistics indicated an acceptable fit, although the RMSEA value of 0.09 was slightly higher than the acceptable criteria of 0.08 (Hu & Bentler, 1998).  The SRMR within-value was 0.03 and the SRMR between-value was 0.05, both acceptable at values less than 0.08.  The CFI value was also acceptable at 0.99, as was the TLI value of 0.97.        The results of the ICC values and the factor loadings for within levels and between levels for the perceived general learning scale for the fourth-year NSSE sample are reported in Table 4.2.  Table 4.2 NSSE MFA results for Perceived General Learning for Fourth-Year Students Factor Loadings Survey Items ICC values Within levels Between levelsAcquiring a broad general education 0.073 0.565* 0.854*Writing clearly and effectively 0.130 0.898* 0.961*Speaking clearly and effectively 0.064 0.797* 0.727*Thinking critically and analytically 0.037 0.754* 0.870*Note. *p < 0.05.       The ICC values reported in Table 4.2 were fairly small across most of the items used to describe the perceived general learning scale, which indicates that item results about general learning outcomes did not vary strongly based on program membership for the NSSE samples.  The ICC value for the item “writing clearly and effectively” for fourth-year students did yield the highest ICC value across all items, 0.13, which suggests that about 13% of the variance in student responses could be attributed to  122   program membership on this item for the NSSE fourth-year sample.  Also, all items contributed well to the overall scales based on the within-level factor loadings, and for items on the between-level factor loadings (p < 0.05).  When examining the fourth-year results across the levels of analysis there was some variation between the student and program levels on the order of the factor loadings based on their magnitude.  The item “acquiring a broad general education” contributed more at the program level than it did at the student level. 4.2.2 Aggregation Statistics – NSSE ANOVA Results      The next three subsections report the findings for the three-step ANOVA approach used to examine the appropriateness of aggregating student-level ratings from the NSSE survey to the program level (Bliese, 2000; Griffith, 2002; Lüdtke et al., 2009).  A one-way ANOVA with random effects was conducted for 59 program majors for first-year students, and for 49 program majors for the fourth-year students who participated in the 2011 NSSE.   4.2.2.1  NSSE ANOVA Step One: Assessing Non-Independence      Table 4.3 displays the results of the ANOVA approach.  The first column lists the F-values associated with each composite variable, and the next column represents the level of statistical significance of the F-values.  As shown, all F-values were statistically significant at the p < 0.05 level for the composite variables created using student ratings in the NSSE first-year sample.  For the composite variables created using student ratings in the NSSE fourth-year sample, only perceived general learning  123   outcomes, emphasis on diversity, and interpersonal environment had F-values that were statistically significant at the p < 0.001 level.        Using Equation 3.1 on page 82, the ICC(1) values were calculated for each of the five NSSE study variables.  The ICC(1) values in Table 4.3 show that program-level variances were relatively small, ranging from 0.00 to 0.10 across year levels, which were much lower than the expected value of greater or equal to 0.12 (Griffiths, 2002).   Perceived general learning showed the largest variation for fourth-year students with an ICC(1) value of 0.10.  These values indicate that about 10% of variation among fourth-year student perceptions about their general learning outcomes could be attributed to belonging to the student’s particular program major.   Emphasis on diversity and interpersonal environment also showed relatively large variation among fourth-year students, with ICC(1) values of 0.06 and 0.07, respectively.  The ICC(1) values for the remaining variables were quite low.         124   Table 4.3  NSSE ANOVA Approach to Test Aggregation by Program Major    Aggregation Statistics ICC(1) ICC(2) rwgStudy Variable F-value p-value > 0.12 > 0.83 > 0.70  First-Year Students Perceived general learning  1.80 0.000 0.04 0.44 0.45Overall satisfaction 1.63 0.002 0.04 0.39 0.44Emphasis on diversity 1.52 0.007 0.03 0.34 0.26Support for student success 1.42 0.022 0.02 0.29 0.39Interpersonal environment 1.43 0.020 0.02 0.30 0.65   Fourth-Year Students Perceived general learning 2.99 0.000 0.10 0.67 0.48Overall satisfaction 1.44 0.027 0.03 0.31 0.38Emphasis on diversity 2.02 0.000 0.06 0.50 0.31Support for student success 1.02 0.430 0.00 0.02 0.41Interpersonal environment 2.12 0.000 0.07 0.55 0.614.2.2.2  NSSE ANOVA Step Two:  Reliability of Program Means The next step tested the reliability of program means and determined if student responses within a program major corresponded to those of other students in their program major; this was calculated using Equation 3.2 on page 83 to determine the ICC(2) values, and is reported in Table 4.3.   None of the variables had reliability estimates, or ICC(2) values, larger than the expected value of 0.83 (Griffith, 2002).  The ICC(2) values for first-year students ranged from 0.29 to 0.44, and the ICC(2) values for fourth-year students ranged from 0.02 to 0.67.  These low values indicate that across the year levels the average  125   program ratings on all five of these variables did not provide reliable estimates of program major means. The highest ICC(2) value reported was for perceived general learning for both first- and fourth-year students (0.44 and 0.67, respectively).  The lowest ICC(2) value reported for first-year and fourth-year students was support for student success, 0.29 and 0.02, respectively. 4.2.2.3  NSSE ANOVA Step Three:  Within-Group Agreement The final step in examining the appropriateness of aggregating student NSSE scores involved determining the within-program agreement using Equation 3.3a and Equation 3.3b on page 84, which is represented by the rwg statistic for each of the five variables included in this analysis (Table 4.3).   The within-group agreement statistic was calculated for each of the program majors separately and then averaged for each variable (Castro, 2002).  These averages ranged from 0.26 to 0.65 for first-year students, and from 0.31 to 0.61 for fourth-year students.  The lowest within-group values reported for both year levels were for emphasis on diversity.  These values indicated low levels of within-group agreement based on program major groupings for both of the NSSE samples.   4.2.2.4  Summary of NSSE ANOVA Results The low ICC(1) values indicated that student responses on the NSSE study variables were not substantially associated with belonging to a particular program  126   major.  If the ICC(1) values were greater than 0.12, there would have been evidence to suggest that students within program majors tended to correspond with each other; however, the findings reported in Table 4.6 imply that a student’s ratings within a particular program major did not correspond with other students in the same program major.   Results from the ICC(2) analysis indicated low levels of reliability for group means based on the program major groupings included in this study.  These ICC(2) values indicate a lack of correspondence among the first- and fourth-year students on their perceptions, which implies that multiple ratings would be necessary to provide reliable estimates of group means on all five study variables (Bliese, 2000).  The ICC(2) value reported for perceived general learning outcomes for fourth-year students was 0.67, which might suggest that the reliability of these group means could be used for low-stakes decisions (LeBreton & Senter, 2008); however, within-group agreement was quite low at 0.48.  Finally, low rwg estimates also indicated low levels of within-program agreement, which suggests that students within a program major were not responding similarly to each other.  Therefore, based on the results of the one-way ANOVA approach with random effects, aggregation of the five NSSE variables included in this study were not supported when using the program major groupings.     127   4.2.3 Aggregation Statistics – NSSE WABA Results      The second approach that was tested on the five NSSE study variables is called the within and between analysis of variance, or WABA, using the procedure for examining composite individual variables items (Dansereau & Yammarino, 2000).  Similary to Zumbo and Forer (2011), the WABA analysis was conducted using a multistep procedure.  The results of each of the WABA steps are described in the next sections. 4.2.3.1  NSSE WABA Step One: Level of Inference      As was described in Chapter Three there are four levels of inference that can be made based on responses to the composite NSSE variables when using the WABA approach.  Table 4.4 displays the results for the first step of the WABA analysis for both first- and fourth-year students in the NSSE sample.        The criteria set out in Table 3.4 on page 71 can be used to interpret the results reported in Table 4.4.  The second column in Table 4.4 compares between-program variance values with the within-program variance values for each study variable.  The first value in the second column is the between-program variance, and the second value is the within-program variance.  The third column in the table provides the F-values and the fourth column provides the p-values.  If the between-program variance value is larger than the within-program variance value, and the F-value is greater than 1.00, then aggregation is statistically appropriate (i.e., wholes).  If the between-variance value is smaller than the within-group variance value, and the F-value is less  128   than 1.00, aggregation at the program level is not statistically appropriate and another level of aggregation would be more suitable (i.e., parts).   Table 4.4 NSSE WABA Approach to Test Levels of Inference by Program Major Study Variable Variance Comparison F-value p-value Effect Sizes First-Year Students Perceived general learning  800.55 > 445.51 1.80 0.000 0.232 < 33% 0.772 > 66% Overall satisfaction 803.87 > 492.24 1.63 0.002 0.232 < 33% 0.772 > 66% Emphasis on diversity 969.92 > 636.83 1.52 0.007 0.222 < 33% 0.782 > 66% Supporting student success  749.37 > 528.97 1.42 0.022 0.212 < 33% 0.792 > 66% Interpersonal environment 427.51 > 299.62 1.43 0.020 0.212 < 33% 0.792 > 66%    Fourth-Year Students Perceived general learning  1331.97 > 445.52 2.99 0.000 0.332 = 33% 0.672 > 66% Overall satisfaction 799.62 > 554.17 1.44 0.027 0.242 < 33% 0.762 > 66% Emphasis on diversity  1179.20 > 585.13 2.02 0.000 0.282 < 33% 0.722 > 66% Supporting student success  513.24 > 501.47 1.02 0.430 0.212 < 33% 0.792 > 66% Interpersonal environment 749.75 > 338.53 2.21 0.000 0.292 < 33% 0.712 > 66%      There are two scenarios where aggregation would not be statistically appropriate; equivocal or null.  When the between-group variance value and the within-group variance value are equal and greater than 0, with a corresponding F-value equal to 1.00, this would result in an equivocal conclusion.  When the between-group variance value and the within-group variance value are equal and equal to 0, with a corresponding F-value also equal to 0, this would result in a null conclusion regarding aggregation.        As shown in Table 4.4, the between-group variances were larger than the within-group variances for all variables across both year levels, which indicate that  129   aggregation to the program major level was statistically appropriate for the NSSE samples.  The F-values had statistically significant p-values for all variables in the first-year sample, but were statistically significant for only two variables in the fourth-year sample: perceived general learning and emphasis on diversity.        The final two columns in Table 4.4 provide information regarding the effect size (eta-squared) statistic, which represents the practical significance of aggregating to a particular level of inference.  Aggregation to the program level is appropriate when the eta-between value is greater than 66%, and the eta-within value is less than 33%.  If the eta-between value is less than 33%, and the eta-within value is greater than 66%, then aggregation at a lower-level group would be more practical.  The within-group and between-group etas are calculated and tested relative to each other using the F tests for statistical significance and E tests for practical significance.  Tests of statistical significance incorporate sample sizes, while the tests of practical significance  are geometrically based and are not influenced by sample sizes (Castro, 2002).      The results reported in Table 4.4 indicate that aggregation to the program level was not practically supported across both year levels in this study, which implies a “parts” relationship for all study variables.  This could suggest that sub-group populations may be responding similarly, but there was too much variation within the program major grouping to support aggregation to that level.   130   4.2.3.2  NSSE WABA Step Two: Bivariate Analysis        The next step for the NSSE WABA procedure used bivariate analyses to examine the correlations between the dependent variable and the independent variables at the student and program levels.  Correlations between the dependent variable, perceived general learning outcomes, and each of the independent variables were calculated at the program level and the student level.  Each of these program- and student-level correlations was compared against each other for each of the study variables to determine the level of inference that could be made from these NSSE data.        Based on the criteria displayed in Table 3.5 (page 87) that were described by Griffith (2002), Table 4.5 reports the results of the bivariate correlations for both first- and fourth-year students using NSSE data.  For first-year students, the relationships among the dependent variable, perceived general learning outcomes, and the four program-level independent variables could be examined as program-level occurrences, or “wholes,” because the bivariate correlations based on program-level data were greater than the correlations based on student-level data.  The results for fourth-year students indicated that only emphasis on diversity could be interpreted at the program-level.  The student-level correlations for the fourth-year students were larger than the program-level correlations, thus the remaining variables should be interpreted from a lower-level grouping within the program major.       131   Table 4.5 NSSE WABA Comparisons of Bivariate Correlations  Correlation between Variables Perceived general learning Comparison of Correlations Z test Level of Inference First-Year Students Overall satisfaction 0.66** > 0.41** 11.52*** wholesEmphasis on diversity 0.42** > 0.32**  3.74*** wholesSupport for student success 0.49** > 0.42** 2.85** wholesInterpersonal environment 0.51** > 0.34** 6.76*** wholesFourth-Year Students Overall satisfaction 0.23*** < 0.47*** -6.57*** partsEmphasis on diversity 0.57** > 0.28** 8.96*** wholesSupport for student success -0.01 < 0.38** -10.18*** partsInterpersonal environment -0.11 < 0.38** -12.42*** partsNote. **p<.01. ***p<.001.   4.2.3.3  NSSE WABA Step Three:  Decomposition of Correlations      Next the correlations of student perceptions about their general learning outcomes and program-level independent variables were decomposed into between-components and within-components to determine whether the relationships between these variables should be studied as “parts,” “wholes,” “equivocal,” or “null” (Dansereau & Yammarino, 2006).          When the correlations of student perceived general learning outcomes and program-level variables were decomposed into their between- and within-components (see Table 4.6), the results indicated that these relationships should more  132   appropriately be studied as sub-groups within programs, or the “parts” effect, rather than with “wholes” across both year levels (Dansereau & Yammarino, 2006).  As shown in Table 4.6, the results indicated that the between-component correlations were quite small, while the within-component correlations were much larger, and statistically significant, for all study variables across both year levels.   Table 4.6 NSSE WABA Decomposition of Correlations   Decomposition of Correlations between X and YPerceived general learning (Y) Between-Component Within-Component  First-Year Students Overall satisfaction (X1) 0.02 0.39*** Emphasis on diversity (X2) 0.01 0.31*** Support for student success (X3) 0.01 0.40*** Interpersonal environment (X4) 0.01 0.33***  Fourth-Year Students Overall satisfaction (X1) 0.01 0.44*** Emphasis on diversity (X2) 0.04 0.26*** Support for student success (X3) 0.00 0.36*** Interpersonal environment (X4) -0.01 0.35*** Note. ***p < .001.   4.2.3.4  Summary of NSSE WABA Results      These results tend to support the inference that the between-group variance across variables is not homogenous, and the pattern of results suggested that the only meaningful levels of analysis were at the within-group level.  Across both NSSE year levels an inference of “parts” was found for all study variables, which implied that  133   this inference is multiplexed (Forer & Zumbo, 2011).  The results suggested that some differences in responses were based on program membership for both year levels, but the effect size variations were likely associated with a lower program-level effect within programs, or what is referred to as a “parts” effect (Dansereau & Yammarino, 2000).      These WABA results suggested that group disagreement within program areas on the NSSE study variables require further examination because there appeared to be an appropriate level of aggregation within the program major grouping, just not quite at the level of program major.   4.2.4  Aggregation Statistics – NSSE Unconditional Multilevel Model Results The third and final analysis was used to determine whether the unconditional multilevel model was adequate in determining the appropriate level of aggregation for these data at a single university.  All five variables were each fitted as the dependent variable to one-way ANOVA with random effects as outcomes models, where the student- and program-level models were analysed simultaneously.  As described in Chapter Three, RELM was used for this analysis because it provides more realistic and larger posterior variances than with full maximum likelihood, particularly when the number of highest grouping units is small (Bryk & Raudenbush, 1992).   Table 4.7 displays the results of the unconditional multilevel analyses for first- and fourth-year students.  At the program level, the reliability of group means is  134   influenced by the number of students sampled per program and the level of student agreement within programs (Raudenbush & Bryk, 2002).  The proportion of variance explained was calculated using Equation 3.4 on page 90.  The proportions of variance explained were quite low for the first-year students, which implied that for the groupings in this sample there was likely more variability within program majors than among program majors on the outcome variables.  These results were supported by reliability estimates for each of these variables, ranging from 0.26 to 0.37, which indicates that program means were not reliable. The fourth-year student results for the unconditional multilevel model also did not support aggregation at the program major level.  Only about 5.9% of the variability among student perceptions for perceived general learning outcomes was explained by program membership, which suggests that the remaining 94.1% could be attributed to differences between students within program majors, and a term that includes error variance.   These results imply that the program major means were not reliable because there were too many within-program differences among student responses to the NSSE study items.  Only student perceptions of general learning outcomes for fourth-year students seemed to have a reliability estimate large enough, 0.64, to be considered appropriate for aggregation.        135   Table 4.7 NSSE Unconditional Multilevel Results to Test Aggregation by Program Major Between-variance Within-variance Reliability Chi-square p-value  First-Year Students Perceived general learning 10.38 444.71 0.37 104.52 0.001 Overall satisfaction 11.30 487.22 0.36 96.87 0.001 Emphasis on diversity 10.99 638.35 0.30 86.93 0.008 Support student success 6.99 529.13 0.26 67.17 0.001 Interpersonal environment 5.03 297.11 0.30 81.49 0.023 Fourth-Year Students Perceived general learning 44.12 437.23 0.64 154.19 0.001 Overall satisfaction 12.07 551.20 0.31 69.95 0.021 Emphasis on diversity 26.14 568.56 0.47 96.97 0.001 Support student support 6.08 489.81 0.21 53.43 0.273 Interpersonal environment 16.40 334.65 0.49 105.90 0.001 4.2.4.1  NSSE Empirical Bayes Point Estimates Due to the small number of groups included in this study, EB residual estimates were used to compare results on average perceived general learning outcomes across program majors.   Figures 4.1 and 4.2 display the EB residual estimates with associated 95% credible values for the first- and fourth-year NSSE samples by campus location.  It is clear from Figures 4.1 and 4.2 that the uncertainty attached to program estimates made it difficult to distinguish among many of the program majors (located on the X axis), as demonstrated by how the “tails” of the 95% credible values overlapped the overall average perceived general learning outcomes (62.78) (located on the Y axis).    136   These results suggest that only the program majors that did not have credible value “tails” touching the average line could be compared against each other and could be considered to have different program means.  As shown in Figure 4.1, there most program majors intersected with the mean, 62.78, for the first-year sample, which indicates less variability among program means on perceived general learning outcomes in this sample.  Figure 4.2 displays the results for the fourth-year students which tended to show more variability among program means for perceived general learning outcomes, with the overall mean at 68.89.      137           Figure 4.1.  The Empirical Bayes Point Estimates with 95% Credible Values Plotted using HLM software for Average Ratings of Perceived General Learning (Y axis) for Programs in the NSSE First-year Sample (X axis) with Overall Mean (62.78) (unconditional two-level multilevel model). 59 NSSE Program Majors Perceived General Learning Vancouver Campus (blue) Okanagan Campus (red)  138                Figure 4.2.  The Empirical Bayes Point Estimates with 95% Credible Values Plotted using HLM software for Average Perceived General Learning (Y axis) for Programs in the NSSE Fourth-year Sample (X axis) with Overall Mean (68.89) (unconditional two-level multilevel model). 49 NSSE Program Majors Perceived General Learning  0 12.75 25.50 38.25 51.0046.0958.5170.9483.3695.78Vancouver Campus (blue) Okanagan Campus (red)  139   4.2.4.2  Summary of NSSE Unconditional Multilevel Model Results In summary, the unconditional multilevel model provided evidence that there was more within-group variance than between-group variance based on the program means for the five study variables.  These results indicated that the program means were not reliable due to excessive variance on the outcome variables within programs rather than between program majors.  The only variable that could provide somewhat reliable estimates would be the perceived general learning outcomes for fourth-year students, with a reliability estimate of 0.64. 4.2.5  Overall Summary of Aggregation Statistics for NSSE In summary, conclusions about the aggregation of student responses by program major for each of the five NSSE study variables differed slightly depending on the approach used to test aggregation.     The results from the one-way ANOVA with random effects demonstrated that low levels of variance among student ratings on the study variables could be attributed to program membership.  Furthermore, most of the group means had low reliability estimates.  The largest reliability estimate was for student perceptions about their general learning outcomes for fourth-year students with a value of 0.67, but the within-group agreement was quite low at 0.48.  Similarly, the unconditional multilevel model results demonstrated that when simultaneously modelling the student- and program-level data, the aggregation of student perceptions about general learning outcomes by program major was supported for NSSE fourth-year students.   140   The WABA results suggested that group disagreement within program areas on the NSSE study variables require further examination because there appeared to be an appropriate level of aggregation within the program major group, but just not quite at the level of program major. The next step was to investigate these variables using regression approaches to model student- and program-level characteristics to understand the variability within program major better.     4.2.6  NSSE Multilevel Regression Model Results       The final multilevel model used for the NSSE sample was built using a step-wise modelling procedure.  The baseline for this model was reported in section 4.2.4 of this chapter where the results of the unconditional multilevel model without any independent variables were reported.  The next step involved modelling independent variables at the student-level only using a random-coefficient regression model, and then, finally, using the intercepts- and slopes-as-outcomes model with independent variables at both the student- and program-level.  The results of the random-coefficient regression model and the intercepts- and slopes-as-outcomes model are reported in the next sections. 4.2.6.1  NSSE Random-Coefficient Regression Model Results A random-coefficient regression model was used to represent the distribution of perceived general learning outcomes in each of the 49 program majors for fourth-year  141   students; this was the only variable in the unconditional multilevel model that had an acceptable reliability coefficient of 0.64 to support aggregation.  The aggregation of the first-year variables and the other fourth-year variables were not supported by the results of the unconditional multilevel model.     The perceptions about general learning outcomes for a student in a particular program major was regressed on gender, direct entry status, international status, entering grades and current GPA.  This model was estimated using REML.  The distribution of perceived general learning outcomes for each program was explained in terms of six parameters: an intercept and five regression coefficients.  As was described in Chapter Three, all student-level variables were group-mean centred for this model.  The results from the random-coefficient regression analysis are reported in Table 4.8. Using Equation 3.6 on page 90, the proportion of variance in the student model explained by the random-coefficient regression model was 7%, and 2% more variance than the unconditional model, which was calculated using Equation 3.11 on page 107.  In this model, the chi-square test statistic for the intercept was 159.08 and for the current GPA slope was 39.93, both with 30 degrees of freedom.  The results indicate that the posited unconditional hypothesis is highly implausible, and that there are differences regarding the average program ratings of perceived general learning outcomes (p < 0.001).  The reliability for examining how program-level characteristics influence the differentiation of current GPA (slope) was acceptable at 0.23.  Raudenbush and Bryk (2002) indicated that a reliability estimate larger than  142   0.10 for the slope is sufficient to allow it to vary at the higher level, thus in this model current GPA was allowed to vary across program major groupings.  These results indicate that significant variation does exist among program majors on the outcome variable, but not with the differentiating effect of current GPA on the outcome.  There was a small correlation among the program-level relationships (p = 0.17).  The model was analysed with current GPA considered fixed. Table 4.8 NSSE Results from the Random-Coefficient Regression Model for Fourth-Year Students Fixed Effect  Coefficient  Standard  t-ratio  p-valueerrorIntercept 68.91 1.18 58.27 <0.001Entering GPA -0.11 0.08 -1.38 0.169Current GPA 0.44 0.10 4.46 <0.001Gender (1=Male 0=Female) -1.02 1.65 -0.62 0.539Direct entry (1=Direct 0=Transfer) 2.42 1.71 1.41 0.158International status (1=Dom 0=Intl) -2.75 2.63 -1.04 0.297Random Effect Variance   d.f. χ2 p-value ComponentProgram mean 44.04 48 159.08 <0.001level-1  427.28Reliability of OLS Regression-Coefficient Estimates Average perceived general learning 0.64        The reliability estimate of 0.64 was acceptable for examining the hypotheses about the effects of program characteristics on average ratings in student perceptions about their general learning outcomes.  The random-coefficient regression model also indicated current GPA (p < 0.001) had a statistically significant positive relationship with average program ratings of perceived general learning outcomes; as current GPA increased, average program ratings on perceived learning increased by 0.44  143   points.  Once the variability of the regression estimates across program majors were estimated, the next step was to build an explanatory model to account for the variability.   4.2.6.2  NSSE Intercepts- and Slopes-as-Outcomes Model Results The results for the model are reported in Table 4.9.  As shown, perceived general learning outcomes, for fourth-year students participating in the NSSE were, on average, slightly higher for programs with a larger proportion of international students (p = 0.001).  Campus location (p = 0.026) and proportion male (p = 0.002) were negatively related to perceived learning outcomes.  As the proportion of international students increased in the program, the gender status slope was marginally decreased (p = 0.061).  Students with higher current GPAs tended to rate their perceived general learning outcomes higher, on average, by about 0.30 points more than students with lower current GPAs (p = 0.011).  Although not statistically significant, domestic students rated their perceived general learning outcomes lower, on average, than international students by about 5.13 points (p = 0.115), but as the student-to-faculty ratio increased, the gap between international and domestic students narrowed (p = 0.009).     144   Table 4.9 NSSE Results from the Intercepts- and Slopes-As-Outcomes Model for Fourth-Years Fixed Effect  Coefficient  Standard  t-ratio  p-value errorPerceived Learning Outcomes       Intercept 69.249303 0.750432 92.279 <0.001 Campus (1=Van 0=Okn) -6.205955 2.697473 -2.301 0.026 Student-to-Faculty Ratio 0.334900 0.198762 1.685 0.099 GPO per FTE -0.000665 0.000391 -1.701 0.096 Proportion Intl 0.008868 0.002589 3.426 0.001 Proportion Male -0.003421 0.001047 -3.268 0.002 Gender Status slope (1=Male 0=Female) Intercept -1.448533 1.762276 -0.822 0.411 Campus (1=Van 0=Okn) -1.336272 7.589092 -0.176 0.860 Student-to-Faculty Ratio 0.264828 0.655867 0.404 0.686 GPO per FTE -0.000129 0.001002 -0.129 0.897 Proportion Intl -0.010245 0.005458 -1.877 0.061 Proportion Male 0.003774 0.002618 1.442 0.150 Direct Entry (1=Direct 0=Transfer) Intercept 2.666829 1.570538 1.698 0.090 Campus (1=Van 0=Okn) -7.045680 5.756819 -1.224 0.221 Student-to-Faculty Ratio -0.298074 0.588455 -0.507 0.613 GPO per FTE 0.000731 0.000786 0.930 0.353 Proportion Intl -0.000006 0.005497 -0.001 0.999 Proportion Male 0.002903 0.002398 1.211 0.226 Entering GPA Intercept -0.095579 0.100785 -0.948 0.348 Campus (1=Van 0=Okn) 0.037384 0.385433 0.097 0.923 Student-to-Faculty Ratio 0.019990 0.039748 0.503 0.618 GPO per FTE 0.000002 0.000035 0.057 0.955 Proportion Intl -0.000067 0.000336 -0.200 0.842 Proportion Male -0.000031 0.000171 -0.184 0.855 Current GPA Intercept 0.300935 0.113325 2.656 0.011 Campus (1=Van 0=Okn) -0.122711 0.395500 -0.310 0.758 Student-to-Faculty Ratio 0.000422 0.046514 0.009 0.993 GPO per FTE -0.000029 0.000049 -0.585 0.562 Proportion Intl 0.000024 0.000328 0.075 0.941 Proportion Male 0.000109 0.000150 0.724 0.473                                                                                                      (table continues)  145   Table 4.9 (continued) Fixed Effect  Coefficient  Standard  t-ratio  p-value errorInternational Status, (1=Dom 0=Intl) Intercept  -5.135533 3.251 -1.580 0.115 Campus	(1=Van 0=Okn) -17.87371 14.190 -1.260 0.208 Student-Faculty Ratio 1.501478 0.575 2.612 0.009 GPO per FTE 0.001158 0.002 0.720 0.472 Proportion Intl  -0.000431 0.005 -0.079 0.937 Proportion Male  0.004661 0.004 1.039 0.299 Random Effects Variance   d.f. χ2 p-value  ComponentAverage Perceived Learning   11.06714 43 69.607 0.006 Entering GPA 0.02853 43 43.042 0.470 Current GPA 0.05824 43 46.512 0.330 level-1 427.82414Reliability of OLS Regression-Coefficient Estimates Average Perceived Learning Outcomes 0.34Entering GPA slope 0.07Current GPA slope    0.10            Using Equation 3.6 on page 90, the proportion of variance explained by the full model was 2%.  Using Equation 3.12 no additional variance was accounted for in this model over the random-coefficient regression model.  In this model, the chi-square test statistic for the intercept was 69.61, for the current GPA slope it was 46.51, and entering GPA was 43.04, all with 43 degrees of freedom.  The null hypothesis for differences among program means was rejected (p < 0.01), which indicates that significant variation did exist, but not with the differentiating effects of entering GPA (p = 0.470) or current GPA (p = 0.330) on the outcome.  In addition, the reliability of these program means was very low at a value of 0.34.  146        The full intercepts- and slopes-as-outcomes model reduced the reliability of the program means in the random-coefficient regression model from 0.64 to 0.34.  Also, the reliability for the slope of current GPA was significantly reduced from 0.23 in the random-coefficient regression model to 0.10 in the full model.        As shown in Figure 4.3, when programs were plotted using the EB point estimate residuals with 95% credible values, there was more differentiation among programs than with the unconditional multilevel model.   147   0 12.75 25.50 38.25 51.0045.0056.9468.8780.8192.75     Figure 4.3.  The Empirical Bayes Point Estimates with 95% Credible Values Plotted using HLM software for Average Perceived General Learning Outcomes (Y axis) for Programs in the NSSE Fourth-year Sample (X axis) with Overall Mean (69.04) (intercepts- and slopes-as-outcomes model). 49 NSSE Program Major Groups Perceived General Learning  Okanagan Campus (blue) Vancouver Campus (red)  148   4.2.7  Overall Summary of NSSE Findings      The results from the NSSE example suggested that aggregation to the program level was supported only for the fourth-year NSSE sample regarding their perceptions about their general learning, but that aggregation was not supported for ratings about the learning environment across both year levels.  The results of the two-level MFAs corresponded with the results from the aggregation procedures.  Only the program-level analyses of perceived general learning for the fourth-year NSSE sample was supported by the two-level MFA and the aggregation procedures.  Results suggested that all other scales should not be analysed at the program level, which includes perceived general learning for the first-year sample, and all of the learning environment scales.        Of the three aggregation procedures, the WABA approach tended to be the most informative for this study because it allowed a “parts” inference rather than concluding that no group-level effects existed.  Although the WABA approach indicated that aggregation was statistically appropriate across first- and fourth-year NSSE samples for perceived general learning, the effect size analyses indicated that aggregation was not practically supported.  Further, only the fourth-year NSSE sample met the acceptable criteria for aggregation of perceived general learning established by the results of the ANOVA and an unconditional multilevel model.        Thus, these NSSE results answered the first research question by indicating that only one study variable could appropriately be aggregated to the program level and  149   included in subsequent analyses:  perceived general learning outcomes for the fourth-year NSSE sample.  All other study variables did not meet the acceptable criteria levels that were used to determine the appropriateness of aggregating to the program level, and these were excluded from further analyses.  These results clearly illustrate the consequence of ignoring the multilevel structure of these survey results, or making assumptions about the multilevel structure without first empirically examining these assumptions prior to drawing conclusions at the aggregate level.       A two-level multilevel regression model was developed for the fourth-year NSSE sample in addressing the second research question regarding what student- and program-level characteristics, and cross-level relationships, are associated with student perceptions about their general learning.  The first step was to build a random-coefficient multilevel regression model, where only the student variables were included in the model.  The results suggested that only current GPA was statistically related to perceived general learning when program-level variables were not included in the model.  When both student- and program-level variables were included in the full intercepts- and slopes-as-outcomes model, results indicated that campus location and proportion male were negatively related to perceived general learning (p = 0.026, and p = 0.002, respectively) and that proportion of international students was positively related (p = 0.001).  These results imply that student in programs at the Vancouver campus rated their perceived general learning lower, on average, than students in programs at the Okanagan campus.  These results also imply that as the proportion of male students increased in a program perceptions regarding  150   general learning decreased; however, as the proportion of international students in a program increased perceptions regarding general learning increased.  Current GPA was the only student-level variable to be positively related to the intercept (p = 0.011).  As for the cross-level relationships, or interactions, as the student-to-faculty ratio increased there was a positive influence on perceived general learning for domestic students (p = 0.009).  These results should be interpreted cautiously because the reliability estimates for the program means in the full multilevel model were quite small at a value of 0.34.  4.3  Results for the UES Example      The next sections report the findings from the three analytic procedures used on UES data from 2010 for first- and fourth-year students at UBC’s Vancouver and Okanagan campuses.  The first section provides information on the descriptive statistics for the UES composite ratings for each year level and the two-level MFA results for the items contributing to composite ratings used in this study.  The following sections report the results of each step in the three aggregation procedures and the results of the two-level multilevel regression models.  Each section ends with a brief summary of the research findings for the UES.   4.3.1  UES Composite Ratings           Based on a combination of individual UES survey items, composite ratings were created using the same summed approach used for the NSSE sample in this study (Pike, 2006a), where the original student ratings were converted to a scale ranging from 0 to  151   100, and then average ratings were determined if the student had responded to half or three-fifths of the items in the scale.  The means, standard deviations, and intercorrelations of the UES items were examined to determine if there was correlation among the items used to describe each of the five composite, or scale, variables.  Model assumptions require that the variances of different samples be homogeneous, which was met for all the variables included in this analysis.  In addition, variables must be normally distributed, which was demonstrated by examining P-P plots for each variable conducted using SPSS (see Appendix B).       Of the four composite variables created from the UES items, campus climate for diversity was rated the highest for both year levels.  In addition, campus climate for diversity was rated slightly higher by first-year students, with a mean of 82.28, compared with 79.68 for fourth-year students.  The composite variable that was rated the lowest for first-year students was perceived general learning outcomes, with a mean of 69.50, and for fourth-year students it was supportive campus environment rating with a mean of 69.29.       The reliability of each scale was determined using Cronbach Alpha, which ranged from moderate to strong coefficient values: 0.78 to 0.89 for first-year students, and from 0.77 to 0.90 for fourth-year students (Lowenthal, 1996).  The highest reliability value was reported for the composite variable campus climate for diversity for both year levels, while the lowest reliability value was reported for perceived general learning outcomes, again for both year levels.      152        The UES descriptive statistics for the composite scores are reported in Table 4.10.  This table includes the number of students for whom the composite scores were calculated, as well as the mean scores, the standard error and standard deviation, and the scale-level reliability statistic using Cronbach Alpha. Table 4.10 UES Descriptive Statistics for Composite Scores by Year Level Scale Sample Size Mean S.E S.D. Cronbach Alpha First-Year Students Perceived general learning  2320 69.50 0.30 14.21 0.78 Overall satisfaction 2319 70.70 0.38 18.12 0.80 Campus climate for diversity 2318 82.28 0.31 14.70 0.89 Supportive campus environment 2331 73.28 0.29 14.09 0.82 Fourth-Year Students Perceived general learning  2446 76.51 0.26 13.06 0.77 Overall satisfaction 2444 69.59 0.42 20.59 0.84 Campus climate for diversity 2437 79.68 0.35 17.04 0.90 Supportive campus environment 2463 69.29 0.32 16.13 0.84      The first-year UES sample scale scores for perceived general learning outcomes were positively correlated with current GPA (r = 0.102, p = 0.000), overall satisfaction (r = 0.298, p = 0.000), supportive campus environment (r = 0.248, p = 0.000) and campus climate for diversity (r = 0.197, p = 0.000).  Campus climate for diversity was moderately correlated with supportive campus environment (r = 0.470, p = 0.000) and overall satisfaction (r = 0.372, p = 0.000).  Supportive campus environment was strongly positively correlated with overall satisfaction (r = 0.722, p = 0.000).  Current GPA had a negative weak correlation with student-to-faculty ratio (r = -0.090, p = 0.000), and with GPO per FTE (r = -0.091, p = 0.000).  153        The fourth-year UES sample scale scores for perceived general learning outcomes were positively correlated with current GPA (r = 0.164, p = 0.000), overall satisfaction (r = 0.286, p = 0.000), supportive campus environment (r = 0.198, p = 0.000) and campus climate for diversity (r = 0.130, p = 0.000).  Campus climate for diversity was moderately correlated with supportive campus environment (r = 0.567, p = 0.000) and overall satisfaction (r = 0.433, p = 0.000).  Supportive campus environment was again strongly positively correlated with overall satisfaction (r = 0.758, p = 0.000).  Current GPA was positively correlated with overall satisfaction (r = 0.125, p = 0.000) and with GPO per FTE (r = 0.071, p = 0.000), and negatively correlated to proportion of males (r = -0.160, p = 0.000), proportion of international students (r = -0.159, p = 0.000) and student-to-faculty ratio (r = -0.170, p = 0.000).      Based on these results, it appears that student perceptions of their learning environment were related to academic achievement for fourth-year students, but not for first-year students in this study.   4.3.1.1  UES Exploratory Multilevel Factor Analysis Results          A two-level MFA for categorical data was performed using the software program Mplus Version 7.2 (Muthén & Muthén, 2011), and the estimation procedure WLSMV, to examine the multilevel nature of items at the student- and program-levels of each scale used in this study.  The results of the two-level MFA analyses across all of the UES scales indicated that the multilevel structure was supported for the perceived general learning scale for the first- and fourth-year UES samples, and campus climate  154   for diversity for the first-year sample.  The results of the two-level MFA indicated that the multilevel structure was not supported for all other scales.       As with the NSSE example, there were four items used to describe the perceived general learning scale for the UES example:  ability to be clear and effective when writing; ability to speak clearly and effectively in English; ability to read and comprehend academic material; and analytical and critical thinking skills.  The eigenvalue for the scales were decomposed into their within and between components.  For the first-year UES sample, the first eigenvalue within levels was estimated to be 2.6 followed by an eigenvalue of 0.59, while the first between-level eigenvalues was estimated at 3.4 followed by an eigenvalue of 0.53.  Using Kaiser’s rule of an eigenvalue greater than 1.0, these results suggest that the items used to create the perceived general learning scale were essentially unidimensional across both levels of analysis used for this study for first-year UES students.   For the fourth-year UES sample, the first within-level eigenvalues was 2.6 followed by a value of 0.6, while the between-level eigenvalues were estimated at 2.9 followed by an eigenvalue of 0.7, which also supports unidimensionality across levels of analysis.        The acceptable fit of the model was determined using criteria described by Hu and Bentler (1998): RMSEA and SRMR within and SRMR between equal or less than 0.08; CFI and TLI equal or greater than 0.95.  The results for the UES first-year sample indicated that the model was a reasonable fit.  The CFI value was acceptable at 0.99 and the SRMR within-level value was also acceptable at 0.024, but the RMSEA was too high at 0.10, as was the SRMR between-levels value at 0.094.  These high values  155   indicate that the fit was better for the within-level model than the between-level model.  For the UES fourth-year sample, the model was an acceptable fit with a RMSEA value of 0.00, the SRMR within value of 0.02, the SRMR between value of 0.08, and the CFI and TLI values of 1.00.  The ICC values and the factor loadings within and between levels are reported in Table 4.11. Table 4.11 UES MFA results for Perceived General Learning by Year Level Factor Loadings Survey Items ICC values Within levels Between levelsFirst-Year Students Analytical and critical thinking skills .048 .751* .874*Clear and effective when writing .013 .795* .879*Read and comprehend academic material .013 .736* 1.055*Speak clearly and effectively in English .027 .654* .738*Fourth-Year Students Analytical and critical thinking skills .033 .766* .810*Clear and effective when writing .040 .765* .972*Read and comprehend academic material .020 .738* .652*Speak clearly and effectively in English .069 .664* .799*Note. *p < .05.      The ICC values reported in Table 4.11 were also fairly small across the items, which indicate that item results did not vary strongly based on program membership for the UES samples.  Based on the factor loadings within- and between-levels all items contributed well to the overall scales, and all were statistically significant at the p < 0.05 level.  There did appear to be some slight variation between the levels on the factor loadings, especially for first-year students.  These results suggest that there are some slight differences across the student and program levels concerning how items contribute to the perceived general learning scale.     156        There were six survey items that contributed to the campus climate for diversity scale in the UES example: students are respected here regardless of their sexual orientation; students are respected here regardless of their economic or social class; students are respected here regardless of their race or ethnicity; students are respected here regardless of their gender; students are respected here regardless of their political beliefs; and students are respected here regardless of their physical ability/disability.      The eigenvalues for the first-year UES sample for the campus climate for diversity scale within levels were 4.3 followed by the next value at 0.5, while the eigenvalues for the between-levels resulted in eigenvalues of 5.7 followed by a value of 0.2.  These eigenvalue results indicate that the items contributed to a unidimensional scale across both year levels.  The ICC values and the factor loadings within and between levels are reported in Table 4.12. Table 4.12 UES MFA results for Campus Climate for Diversity by Year Level Factor Loadings Survey Items ICC valuesWithin levelBetween level First-Year Students Economic or social class 0.019 0.786* 0.913* Gender 0.002 0.848* 0.988* Race 0.021 0.832* 1.024* Political beliefs 0.014 0.822* 0.971* Sexual orientation 0.037 0.814* 0.933* Physical ability/disability 0.000 0.786* 0.966* Note. *p < .05.  157        The ICC values were again quite low, which indicates that student responses could not be attributed to belonging to their program major.  The factor loadings for the within-levels model contributed well to the scale, p < 0.05, and so did the all of the items between levels.    4.3.2  Aggregation Statistics - UES ANOVA Results As with the NSSE example, the first approach that was tested for determining the appropriateness of aggregating the composite scores on the UES was the ANOVA with random effects.  This approach determines whether it is appropriate to aggregate student ratings to the program major level on the dependent variable (perceived general learning outcomes) and the three program-level independent variables: overall satisfaction, campus climate for diversity, and supportive campus environment. A one-way ANOVA with random effects was conducted for 22 program majors with 2,260 first-year students and for 28 program majors with 2,374 fourth-year students.  4.3.2.1  UES ANOVA Step One:  Assessing Non-Independence The ICC(1) values in Table 4.13 show that differences in values for the UES example were larger for perceived general learning outcomes compared with the other three variables across both year levels, which implies that about 15% and 19% of the variance in the variable could be attributed to belonging to a particular program major for first-year students and fourth-year students respectively.  158   Table 4.13 UES ANOVA Approach to Test Aggregation Grouped by Program Major Aggregation Statistic ICC(1) ICC(2) rwgStudy Variables F-value F sig > 0.12 > 0.83 > 0.70 First-Year Students Perceived general learning  3.767 .000 0.15 0.78 0.79Overall satisfaction 2.786 .000 0.09 0.67 0.64Campus climate for diversity 2.426 .000 0.08 0.64 0.76Supportive campus environment 1.806 .011 0.03 0.39 0.77Fourth-Year Students Perceived general learning  3.380 .000 0.19 0.78 0.81Overall satisfaction 1.296 .130 0.02 0.27 0.50Campus climate for diversity 1.361 .092 0.03 0.30 0.65Supportive campus environment 1.431 .061 0.04 0.37 0.69 4.3.2.2  UES ANOVA Step Two:  Reliability of Program Means   The reliability of these program means on perceived general learning outcomes for both year levels were relatively high at 0.78, but not at the expected value of 0.83 (Griffith, 2002).  The rwg statistic indicates the level of agreement among students within a particular program major.  The level of agreement on perceived general learning was high across both year levels, as indicated by the rwg statistics of 0.79 and 0.81, which were much higher than the acceptable value of 0.70.     159   4.3.2.3  UES ANOVA Step Three:  Within-Group Agreement      In comparison to the other study variables in the UES example, there appears to be within-group agreement based on program majors for first-year students on campus climate for diversity and supportive campus environment, but their means are not highly reliable based on the ICC(2) values of 0.64 and 0.39, respectively.   Fourth-year students tended to have a fairly high level of agreement within program majors on their ratings of support for students (rwg = 0.69), but the reliability of program means was low based on the ICC(2) value of 0.37. 4.3.2.4  Summary of UES ANOVA Results      Support for aggregation at the program major level was supported for first- and fourth-year students on their UES ratings of perceived general learning, if we accept the ICC(2) values at 0.78 instead of greater than 0.83.  These results suggest that first-year students differed by about 15% on ratings of their perceived general learning as a result of being in a particular program major, and fourth-year students differed by about 19%, as measured by the ICC(1) values.  In addition, the means are fairly reliable at 0.78 for both year levels, even if they did not exceed the 0.83 criteria value described by Griffith (2002).  Finally, the within-group agreement (rwg) for both year levels was quite high at 0.79 for first-year students and 0.81 for fourth-year students.      As for the learning environment scales, the ANOVA results suggest that for first-year students, supportive campus environment did not have a high ICC(2) value at 0.39, which indicates that the reliability of the program-level means were not very  160   strong, but the within-group agreement statistic (rwg) was high at 0.77, which might indicate that at a subgroup level there was some strong agreement within the program groupings.  Also, the results of the ICC(2) value for climate for diversity was lower than the acceptable criteria at 0.64, but might indicate that there was some level of correspondence among students within programs.  The within-group agreement (rwg) was acceptable at 0.76.         The three program-level variables for the UES fourth-year sample had low ICC(2) values: overall satisfaction, 0.27; climate for diversity, 0.30; and supportive environment 0.37.  The low ICC(2) values, combined with the relatively low within-group statistics for fourth-year students, indicated that there was too much within-group variance to support aggregation to the program major level on all variables but perceived general learning outcomes. 4.3.3  Aggregation Statistics - UES WABA Results      The WABA (Dansereau & Yammarino, 2000) was the second approach tested on the UES samples to determine if there were underlying relationships that influenced the levels of aggregation for the UES study variables.  Levels of inferences were determined using the WABA approach based on the criteria described in Table 3.4 on page 71.  The results for the UES example are included in Table 4.14.    161   4.3.3.1  UES WABA Step One:  Level of Inference      The WABA results for the UES example demonstrate that the between-program variances were larger than the within-group variances for both year levels, and all F-values were greater than 1.00, which indicated aggregation was appropriate to the program-major level.  All three study variables had statistically significant F-values for first-year students, while only perceived general learning outcomes and campus climate for diversity were statistically significant for fourth-year students.        The two program-level independent variables that were not statistically significant for fourth-year students were overall satisfaction and supportive campus environment.  The effect size (eta-squared) statistics represents the practical significance of aggregating to a particular level of inference.  Results across both year levels in this study indicated that for “practical purposes” aggregation to the program major was not supported; however, the results detected a “parts” effect, which suggests that sub-group populations may have responded similarly to each other, and some level of aggregation could be supported.       162   Table 4.14 UES WABA Approach to Test Levels of InferenceVariable/Inference Variance Comparison F-values Effect Size  First-Year Students Perceived learning  763.32 > 192.94 3.96*** 0.192 < 33% 0.812 > 66%Overall satisfaction 875.82 > 318.59 2.75*** 0.162 < 33% 0.842 > 66%Campus diversity  525.28 > 206.58 2.54*** 0.152 < 33% 0.852 > 66%Supportive campus  305.25 > 192.73 1.58* 0.122 < 33% 0.882 > 66%Fourth-Year StudentsPerceived learning  607.35 > 162.81 3.73*** 0.202 < 33% 0.802 > 66%Overall satisfaction 390.50 > 279.63 1.40 0.132 < 33% 0.872 > 66%Campus diversity  387.04 > 253.29 1.53* 0.132 < 33% 0.872 > 66%Supportive campus  556.29 > 418.10 1.33 0.122 < 33% 0.882 > 66%Note. *** p < 0.001. ** p < 0.01. * p < 0.05.       4.3.3.2  UES WABA Step Two:  Bivariate Analysis      The next step for the WABA procedure used a bivariate analysis to examine the correlations between perceived general learning and the three UES independent variables at the student and program levels.  Table 3.5 on page 87 displays the criteria for interpreting the WABA results, as described by Griffith (2002, p. 122) who adapted them from Dansereau and Yammarino (2000).  Based on these criteria, Table 4.15 reports the results of the bivariate correlations for both UES first- and fourth-year students.        These results show that for first-year students the relationships among the perceived general learning outcomes and two of the three program-level variables  163   could be examined as program-level occurrences, or “wholes,” but that campus climate for diversity could not be aggregated to the program major level in this sample.  Results for fourth-year students indicated that the study variables should be interpreted from a lower-level grouping within program major.   Table 4.15 UES WABA Comparison of Bivariate Correlations   Correlation between Variables Perceived general learning outcomes Comparison of Correlations Z test Level of InferenceFirst-Year Students Overall satisfaction 0.60** > 0.30** 13.07*** Wholes Campus climate for diversity 0.00 < 0.20** -6.90*** Parts Supportive campus environment 0.33** > 0.25** 2.98** Wholes Fourth-Year Students Overall satisfaction -0.52** < 0.13** -24.69*** Parts Campus climate for diversity -0.04* < 0.20** -8.50*** Parts Supportive campus environment 0.09** < 0.29** -7.28*** Parts Note. *** p < .001. ** p < .01.  4.3.3.3  UES WABA Step Three:  Decomposition of Correlations           Next, the correlations between student perceptions of their perceived general learning outcomes and program-level independent variables were decomposed into between-components and within-components to determine whether the relationships between these UES variables should be studied as “parts,” “wholes,” “equivocal,” or “null” (Dansereau & Yammarino, 2006).  The results as shown in Table 4.16 indicated  164   that these relationships should more appropriately be studied as sub-groups within programs, or the “parts” effect, rather than at the level of program major (Dansereau & Yammarino, 2006).  The variance explained between groups was quite small, but within-group variation was much larger and statistically significant for all variables across both year levels.  Similar to the WABA findings for the NSSE example, these results tended to support the inference that the between-group variance across variables was not homogenous, and that the patterns of results suggested that the only meaningful levels of analysis were at the within-group level.        A practical level of inference, measured by the effect sizes, was at the “parts” level for all study variables across both year levels.  These results suggest that some differences in responses are based on program membership for both year levels, but the tests of effect size indicated that variations in student responses were likely associated with a lower program-level effect within programs (Dansereau & Yammarino, 2000).        165   Table 4.16 UES WABA Results for the Decomposition of Correlations Decomposition of Correlations Perceived general learning  First-Year Students  Between-Component Within-Component Overall satisfaction 0.02 0.29** Campus climate for diversity 0.00 0.19** Supportive campus environment 0.00 0.24**  Fourth-Year Students Overall satisfaction 0.00 0.28** Campus climate for diversity 0.01 0.13** Supportive campus environment 0.00 0.19** Note. ** p < 0.01.  4.3.3.4  Summary of UES WABA Results      Overall, the results from the WABA analysis were inconsistent with the first aggregation procedure tested on the UES, the one-way ANOVA with random effects.  Results from the ANOVA analysis indicated that at least three study variables could be aggregated for the UES samples to the program major, while the WABA approach did not support aggregation in the practical sense to this level.        Although the WABA approach did not support aggregation to the program major level, the results indicated that some level of aggregation would be appropriate, which suggests multiplexed inferences (Forer & Zumbo, 2011).  The WABA approach allowed for a third multilevel pattern to emerge that went beyond selecting the choice between the disaggregate student-level or the aggregate program-level, to  166   include the interdependency of individuals within-programs, or the “parts” inference.  These results suggest that group disagreement within program areas require further examination, because there appeared to be an appropriate level of aggregation at a sub-group population. 4.3.4  Aggregation Statistics - UES Unconditional Multilevel Model Results      The unconditional multilevel model was the third aggregation procedure tested on the UES data to determine how much programs varied in their average ratings.  All four UES study variables were fitted as the outcome variable to a one-way ANOVA with random effects as outcomes models.  This model was estimated using REML.      Table 4.17 displays the results of the unconditional multilevel analyses for first- and fourth-year students.  The proportion of variance explained was calculated using Equation 3.6 on page 90.  The amount of variance explained for the first- and fourth-year samples ranged from 0.00 to 0.04, by program groupings, which were quite low.  The low amount of proportion explained implies that there was more variability within program majors than among program majors on the outcome variables.         In addition, the reliability estimates for each of the outcome variables ranged from extremely low values of 0.13 to a high of 0.49.  The largest reliability estimates were for perceived general learning outcomes across both year levels. The p-values associated with perceived general learning outcomes for first- and fourth-year students were statistically significant p < 0.001.    167        These results for the unconditional multilevel model supported aggregation at the program major level only for perceived general learning, for both year levels.   Table 4.17 UES Unconditional Multilevel Results to Test Aggregation by Program Major Study Variables Between-varianceWithin-variance ReliabilityChi-square p-valueFirst-Year Students Perceived general learning  5.99 196.81 0.49 84.22 < 0.001Overall satisfaction 6.23 323.24 0.41 66.89 < 0.001Emphasis on diversity 6.23 212.91 0.49 58.19 < 0.001Support student success 0.86 197.93 0.20 41.16 0.011Fourth-Year Students Perceived general learning  6.79 165.27 0.48 104.36 < 0.001Overall satisfaction 1.35 422.94 0.13 40.83 < 0.001Emphasis on diversity 1.62 288.75 0.19 41.66 < 0.001Support student support 1.74 259.04 0.21 43.36 0.0544.3.4.1  UES Empirical Bayes Point Estimates       Due to the small number of groups used in this analysis, EB residual estimates shown in Figures 4.4 and 4.5 were used to compare results on average ratings regarding perceived learning across program majors for the UES samples.    168         UES Program Majors Figure 4.4.  The Empirical Bayes Point Estimates with 95% Credible Values Plotted using HLM software for Average Perceived General Learning (Y axis) for Programs in the UES First-year Sample (X axis) with Overall Mean (69.70) (unconditional two-level multilevel model). Perceived General Learning  0 6.00 12.00 18.00 24.0055.6462.6069.5676.5283.49 Vancouver Campus (red) Okanagan Campus (blue)  169   0 7.75 15.50 23.25 31.0061.3968.9476.4884.0391.57Figure 4.5.  The Empirical Bayes Point Estimates with 95% Credible Values Plotted using HLM software for Perceived General Learning (Y axis) for Programs in the UES Fourth-year Sample (X axis) with Overall Mean (76.56) (unconditional two-level multilevel model). 29 UES Program Major Groups Perceived General Learning Vancouver Campus (red) Okanagan Campus (blue)  170   4.3.4.2  Summary of UES Unconditional Two-Level Multilevel Model Results      In summary, the unconditional two-level multilevel model provided evidence that there was more within-group variance than between-group variance based on the program means for three of the four study variables for both year levels.  Aggregating student perceptions about their learning to the program major level was not supported for first- and fourth-year students using this unconditional multilevel model approach due to low reliability values for the program means.  For the other three variables, results indicated that the program means were not reliable due to too much variance on the outcome variables within programs rather than between program majors.   4.3.5  Overall Summary of Aggregation Statistics for UES      The results from the one-way ANOVA with random effects supported aggregation of perceived general learning for both year levels.  The ANOVA findings indicated that about 15% of variance in first-year student ratings on perceived general learning could be attributed to belonging to a program, and about 19% for fourth-year students.  The ANOVA results also indicated that there were high levels of within-group agreement for first-year students in the UES sample on emphasis on diversity and supportive campus environment, suggesting that there might be within-group agreement at a lower level unit than the program level.       The WABA procedures provided additional information regarding aggregation that suggested that there was interdependence among some individuals based on program membership, but that aggregation to the level of program major for all of the study  171   variables for both UES years was not practically supported.  Rather, the WABA results suggested that these data could be interpreted at the “parts” level.        The next step was to examine these variables using regression approaches to model student- and program-level characteristics to determine if the variability within program major could be better understood.   4.3.6  UES Multilevel Model Regression Results      As with the NSSE samples the multilevel model was built using UES data in a three-step process.  The base-line model used the results from the unconditional multilevel model without independent variables at any level, then a random-coefficient model with independent variables at the student-level only was modelled, and the final intercepts- and slopes-as-outcomes model where independent variables were included at both the student- and program-level were analysed simultaneously.        The unconditional model for the UES samples was discussed in this chapter under the aggregation section 4.3.4.  Multilevel models were built to determine if additional information could inform ratings on learning gains across both year levels for the UES samples.   4.3.6.1  UES Random-Coefficient Regression Model Results      Only the student-level variables were included in the random-coefficient regression model fitted to the UES data.  This model was designed to represent the distribution of ratings about students’ perceived general learning in each of the 24 program majors for  172   first-year students and 31 programs for fourth-year students.  Specifically, at the student level, the perceived general learning composite rating for a student in a particular program major was regressed on gender, direct entry status, international status, entering GPA and current GPA.  All student-level variables were group-mean centred.  The results of the random-coefficient model are reported in Tables 4.18 and 4.19, for first- and fourth-year students, respectively.        The average program ratings about students’ perceived general learning was 68.83 for first-year students and 76.70 for fourth-year students.  The reliability estimates used for examining the hypothesis about the effects of program characteristics on average ratings in student perceptions about their general learning was quite low at 0.54 for the first-year sample, and 0.53 for the fourth-year UES sample.  In terms of the reliability for examining how program-level characteristics influenced the slopes, Raudenbush and Bryk (2002) suggested that levels of 0.10 or higher were acceptable.  In the first-year sample, current GPA met the acceptable level at 0.15 as did entering GPA at 0.18; however, the p-values were not statistically significant.  For the fourth-year UES sample, the reliability of entering GPA was acceptable at 0.11, but the reliability of current GPA was not acceptable with a value of 0.06.  In addition, the current GPA and entering GPA slopes were not statistically significant.  For both UES samples, these two variables were considered random in the full intercepts- and slopes-as-outcomes models.      The random-coefficient regression model also indicated that perceived learning ratings were statistically significantly different across first-students in the UES  173   example, and was positively influenced by current GPA (p = 0.010), entering GPA (p < 0.001) and international status (p < 0.001).  Perceptions about student general learning were negatively related to direct entry status with a reduction of 3.83 points (p < 0.001) for first-year students.  The proportion of variance explained in this model over the unconditional multilevel model was 4%, which was calculated using Equation 3.11.      Perceived general learning outcomes for fourth-year students in the UES sample was positively related to current GPA (p < 0.001), international status (p < 0.001), and gender (p =0.044).  The proportion of variance explained in this model over the unconditional multilevel model was 4%, which was calculated using Equation 3.11.  Table 4.18 UES Results from the Random-Coefficient Regression for First-Year Students Fixed Effect  Coefficient  Standard  t-ratio  p-valueerrorPerceived general learning  68.83 0.71 98.50 <0.001Current GPA  0.07 0.03 2.81 0.010Entering GPA  0.24 0.06 3.93 <0.001International status (1=Dom 0=Intl) 2.74 0.59 4.66 <0.001Gender status (1=Male 0=Female) 1.18 0.68 1.75 0.081Direct entry (1=Direct 0=Transfer) -3.83 1.15 -3.32 <0.001Random Effects Variance   d.f. χ2 p-value ComponentPerceived general learning  6.356 21 86.71 <0.001Current grade  0.005 21 20.36 >0.500Entering grade  0.015 21 25.33 0.233level-1, r 188.327     Reliability of OLS Regression-Coefficient Estimates Average Gains in General Learning 0.54Current GPA 0.15Entering GPA 0.18           174   Table 4.19 UES Results from the Random-Coefficient Model for Fourth-Year Students Fixed Effect  Coefficient  Standard  t-ratio  p-valueerrorIntercept 76.700 0.658 116.503 <0.001Entering GPA slope 0.011 0.011 1.053 0.301Current GPA slope 0.267 0.033 8.190 <0.001Direct entry slope -0.817 0.660 -1.238 0.216International status slope 3.632 1.063 3.418 <0.001Gender slope 0.938 0.466 2.012 0.044Random Effects Variance   d.f. χ2 p-value ComponentPerceived general learning 6.947 28 106.18 <0.001Entering GPA slope 0.001 28 39.11 0.079Current GPA slope 0.005 28 29.85 0.370level-1, r 155.364Reliability of OLS Regression-Coefficient Estimates Perceived general learning 0.53Entering GPA 0.11Current GPA 0.06         4.3.6.2  UES Intercepts- and Slopes-as-Outcomes Model Results      The results for the intercepts- and slopes-as-outcomes model for first-year students are reported in Table 4.20 and in Table 4.21 for fourth-year students.        None of the program-level variables were statistically significantly related to average perceived general learning outcomes for first-year students when student-level and program-level variables were included in the model.  Entering GPA slope adjusted the intercept slightly upwards (p = 0.020), and the international status slope was adjusted slightly downwards for domestic students as the proportion of male students in a program increased, but the coefficient values were extremely minimal.  GPO per FTE slightly reduced the direct entry slope (p = 0.066).  175   Table 4.20    UES Results from the Intercepts- and Slopes-as-Outcomes for First-Years Fixed Effect  Coefficient  Standard  t-ratio  p-value errorPerceived General Learning 69.607639 0.7196 96.73 <0.001 Campus (1=Van 0=Okn) -3.844209 1.8800 -2.05 0.058 Proportion Male -0.000296 0.0009 -0.31 0.759 Student-to-Faculty Ratio -0.202949 0.1745 -1.16 0.262 GPO per FTE -0.000253 0.0002 -1.12 0.278 Proportion International 0.003274 0.0029 1.11 0.283 Current GPA slope Intercept -0.005173 0.0785 -0.07 0.948 Campus (1=Van 0=Okn) 0.154327 0.1915 0.81 0.432 Proportion Male -0.000031 0.0001 -0.45 0.657 Student-to-Faculty Ratio 0.015222 0.0160 0.95 0.355 GPO per FTE -0.000022 0.0000 -0.74 0.469 Proportion International -0.000095 0.0002 -0.48 0.637 Entering GPA slope Intercept 0.257950 0.0999 2.58 0.020 Campus (1=Van 0=Okn) -0.061654 0.2629 -0.24 0.818 Proportion Male 0.000154 0.0001 1.39 0.185 Student-to-Faculty Ratio 0.002368 0.0248 0.10 0.925 GPO per FTE 0.000020 0.0000 0.55 0.594 Proportion International -0.000302 0.0003 -0.92 0.370 International Status slope (1=Dom 0=Intl) Intercept -0.620509 2.0992 -0.30 0.768 Campus (1=Van 0=Okn) -2.697784 4.7248 -0.57 0.568 Proportion Male 0.003063 0.0015 2.06 0.039 Student-to-Faculty Ratio 0.219907 0.3843 0.57 0.567 GPO per FTE 0.000156 0.0006 0.26 0.795 Proportion International -0.004349 0.0036 -1.22 0.223 Gender Status slope (1=Male 0=Female) Intercept -0.190324 1.3666 -0.14 0.889 Campus (1=Van 0=Okn) -2.605000 3.2400 -0.80 0.421 Proportion Male 0.001513 0.0010 1.51 0.131 Student-to-Faculty Ratio 0.083681 0.2709 0.31 0.757 GPO per FTE 0.000016 0.0005 0.03 0.973 Proportion International -0.002428 0.0025 -0.96 0.340      (table continues)  176   Table 4.20 (continued) Fixed Effect  Coefficient  Standard  t-ratio  p-value errorDirect Entry slope (1=Direct 0=Transfer)      Intercept -1.170 1.781 -0.66 0.511 Campus (1=Van 0=Okn) -3.933 4.659 -0.84 0.399 Proportion Male -0.001 0.002 -0.66 0.507 Student-to-Faculty Ratio 0.681 0.498 1.37 0.172 GPO per FTE 0.002 0.001 2.31 0.021 Proportion International 0.003 0.005 0.57 0.571 VarianceRandom Effect  Component   d.f. χ2 p-value Average Perceptions of General Learning 4.399 16 45.94 <0.001 Current GPA slope 0.013 16 20.47 0.200 Entering GPA slope 0.036 16 23.52 0.100 level-1, r 188.488Reliability of OLS Regression-Coefficient Estimates   Gains in General Learning 0.51Current GPA 0.14Entering GPA 0.20             When the student- and program-level variables were included in the first-year UES sample model, the reliability estimate of 0.51, was close to the value estimated for the random-coefficient regression model of 0.54.  The reliability estimate for current GPA slope and entering GPA slopes in the full model were similar to the random-coefficient model, but the p-values associated with both were again not statistically significant.   The proportion of variance explained in this model over the unconditional multilevel model was minimal, which was calculated using Equation 3.12.         177   0 6.00 12.00 18.00 24.0054.2361.7569.2876.8084.33          Figure 4.6.  The Empirical Bayes Point Estimates with 95% Credible Values Plotted using HLM software for Perceived General Learning (Y axis) for Programs in the UES First-year Sample (X axis) with Overall Mean (69.61) (intercept- and slopes-as-outcomes model). 22 UES Program Major Groups Perceived General Learning  Vancouver Campus (red)  Okanagan Campus (blue)  178   As shown in Figure 4.6, when programs were plotted using the EB point estimates residuals with 95% credible values, there was more differentiation among programs than with the unconditional multilevel model.      Again, none of the program-level variables were statistically significantly related to average perceived general learning outcomes for the fourth-year UES sample.  Results for the fourth-year students differed from the first-year students.  In this model, current GPA increased perceived general learning outcomes slightly by 0.25 points (p = 0.002).  As the proportion of males increased in the program major, there was a minimal reduction in the negative influence of the direct entry slope (p = 0.017).  Fourth-year domestic students tended to rate their average perceived general learning outcomes about 5.75 points higher than international students (p = 0.053).        Overall, for both the first- and fourth-year students there were still statistically significant differences between program averages on perceived general learning outcomes that were not explained by the variables included in these two models, as demonstrated by the chi-square value of 57.11, p < 0.001, and for first-year students a chi-square value of 45.94, p < 0.001.      179   Table 4.21 UES Results from the Intercepts- and Slopes-As-Outcomes for Fourth-Year Students Fixed Effect  Coefficient  Standard  t-ratio  p-value errorPerceived General Learning Intercept 76.821796 0.655701 117.16 <0.001 Campus (1=Van 0=Okn) -3.540471 1.893126 -1.87 0.074 Proportion Male -0.000502 0.000920 -0.545 0.591 Student-to-Faculty Ratio -0.159139 0.139336 -1.142 0.265 GPO Per FTE 0.000015 0.000013 1.162 0.257 Proportion International 0.004506 0.002714 1.66 0.110 Entering GPA Intercept -0.000358 0.030184 -0.012 0.991 Campus (1=Van 0=Okn) -0.071702 0.070564 -1.016 0.320 Proportion Male -0.000007 0.000023 -0.297 0.769 Student-to-Faculty Ratio 0.002494 0.005877 0.424 0.675 GPO Per FTE 0.000000 0.000002 0.033 0.974 Proportion International 0.000031 0.000055 0.573 0.572 Current GPA Intercept 0.254281 0.073466 3.461 0.002 Campus (1=Van 0=Okn) 0.184210 0.178464 1.032 0.313 Proportion Male -0.000040 0.000068 -0.59 0.561 Student-to-Faculty Ratio 0.015943 0.016594 0.961 0.347 GPO Per FTE 0.000001 0.000002 0.296 0.770 Proportion International -0.000069 0.000191 -0.36 0.722 Direct Entry (1=Direct 0=Transfer) Intercept -2.046524 6.162219 -0.332 0.740 Campus (1=Van 0=Okn) -3.102502 3.009598 -1.031 0.303 Proportion Male 0.002259 0.000944 2.392 0.017 Student-to-Faculty Ratio -0.338588 0.223753 -1.513 0.130 GPO Per FTE 0.000030 0.000432 0.069 0.945 Proportion International -0.000190 0.002329 -0.082 0.935 International Status (1=Dom 0=Intl) Intercept 5.753378 2.974085 1.935 0.053 Campus (1=Van 0=Okn) -1.077956 5.640943 -0.191 0.848 Proportion Male -0.000474 0.001993 -0.238 0.812 Student-to-Faculty Ratio 0.159200 0.355045 0.448 0.654 GPO Per FTE -0.000021 0.000033 -0.646 0.518 Proportion International -0.002908 0.003704 -0.785 0.432 (table continues)     180   Table 4.21 (continued) Fixed Effect  Coefficient  Standard  t-ratio  p-value errorGender Status (1=Male 0=Female) Intercept 1.362096 1.112902 1.224 0.221 Campus (1=Van 0=Okn) -4.887971 2.671717 -1.83 0.067 Proportion Male 0.00064 0.000836 0.766 0.444 Student-to-Faculty Ratio -0.296705 0.197984 -1.499 0.134 GPO Per FTE 0.000008 0.000026 0.331 0.741 Proportion International 0.001835 0.001944 0.944 0.345 Random Effect Variance   d.f. χ2 p-value  ComponentAverage Perceived Learning 5.2800 23 57.11 <0.001 Entering GPA slope 0.0005 23 31.36 0.114 Current GPA slope 0.0145 23 25.81 0.310 level-1, r 155.1094Reliability of OLS Regression-Coefficient Estimates Gains in General Learning 0.47Entering GPA 0.07Current GPA 0.18               As shown in Figure 4.7, when programs were plotted using the EB point estimates residuals with 95% credible values, there was more differentiation among programs than with the unconditional two-level multilevel model.  Again, this full model did not explain any additional variance over that of the random-coefficient regression model.   181   0 7.75 15.50 23.25 31.0061.3170.0778.8487.6096.37   Figure 4.7.  The Empirical Bayes Point Estimates with 95% Credible Values Plotted using HLM for Perceived General Learning (Y axis) for Programs in the UES Fourth-year Sample (X axis) with Overall Mean (76.82) (intercepts- and slopes-as-outcomes model). 24 UES Program Major Groups Perceived General Learning Vancouver  Campus (red)  Okanagan Campus (blue)  182   4.3.7  Overall Summary of UES Findings      The results from the UES example suggested that aggregation to the program level was supported for both first- and fourth-year samples regarding their perceptions about their general learning, but similar to the NSSE aggregation was not supported for ratings about the learning environment across both year levels.  Again, as with the NSSE, the results of the two-level MFA corresponded with the results from the aggregation procedures.  Only the program-level analyses of perceived general learning was supported by the two-level MFA and the aggregation procedures; however, the program means reported in the unconditional multilevel model were not acceptable.  Results suggested that all of the learning environment scales should not be analysed at the program level.        Thus, these UES results answered the first research question by indicating that only one study variable could be appropriately aggregated to the program level and could be included in subsequent analyses:  perceived general learning outcomes for the first- and fourth-year UES sample.  Although the WABA approach indicated that aggregation was statistically appropriate across first- and fourth-year UES samples for perceived general learning, the effect size analyses indicated that aggregation to the program level was not practically supported.  All other study variables did not meet the acceptable criteria levels that were used to determine the appropriateness of aggregating to the program level, and thus were excluded from further analyses.  Again, these results demonstrate the consequence of making assumptions about the  183   multilevel structure without empirically examining these assumptions prior to drawing conclusions at the aggregate level.        As a result of the MFA and the aggregation procedures, the only scales used in subsequent analyses was perceived general learning for the first- and fourth-year samples.  The results of the two-level multilevel regression models developed for the UES samples were interpreted to address the second research question in this study:  What student- and program-level characteristics, and cross-level relationships, are associated with student perceptions about their general learning?        The first-year student results of the random-coefficient multilevel regression models, where only the student variables were included in the model, indicated that three of the five student variables were positively correlated to perceived general learning outcomes:  current GPA, entering GPA, and domestic student status.  In addition, first-year students who attended UBC direct from high school indicated lower perceived general learning outcomes compared with transfer students.  These results could imply that first-year students entering direct from high school may have a more difficult transition to UBC in their first year when compared with transfer students, and rate their perceived learning lower as a result.  When both student- and program-level variables were included in the full intercepts- and slopes-as-outcomes model, results for the first-year UES sample indicated that students at the Vancouver campus rated their learning gains lower than students at the Okanagan campus (p = 0.058), while all other program variables were not statistically related to perceived learning.  As for the cross-level relationships for the first-year sample, as the  184   proportion of male students increased across programs, the negative rating of general learning by domestic students was slightly reduced (p = 0.039).  Also, as the GPO per FTE increased across programs, the negative rating of perceived learning outcomes made by direct entry students was slightly reduced (p = 0.021).  These results should be interpreted with extreme caution, though, because the reliability of the program means was quite low at 0.54 for the random-coefficient regression model, and 0.51 for the full intercepts- and slopes-as-outcomes model.  These low reliability values indicate that the program means included in this study cannot be used reliably to make program comparisons across average ratings of perceived general learning.      For the fourth-year UES sample, the results of the random-coefficient regression model indicated that only three student variables were positively related to perceived general learning outcomes:  current GPA, domestic student status, and male student status.  Once the data were fitted to a full intercepts- and slopes-as-outcomes model, only current GPA and domestic student status were positively related to students’ perceived general learning outcomes and none of the program-level variables were statistically significantly related to perceived general learning.  As for the cross-level relationships, or interactions, as the proportion of male students increased across programs there was a positive influence on perceived general learning for direct entry students (p = 0.017).  Again, these results should be interpreted cautiously because the reliability estimates for the random-coefficient regression model was 0.53 and for the full intercepts-and slopes-as-outcomes model it was 0.47, which are not reliable to compare outcomes across programs in this study.    185   4.4  Summary of Research Findings      These analyses were conducted to address the two research questions in this study:  (1) Are program-level inferences made from aggregate student perceptions about their general student learning outcomes and ratings of the learning environment valid when aggregated to the program major level; and (2) What student- and program-level characteristics, and cross-level relationships, are associated with student perceptions about their general learning outcomes?   Overall, the results indicated that only perceived general learning outcomes could be inferred at the program level, but none of the learning environment scales, such as climate for diversity, should be interpreted at the program level.  Furthermore, the results indicated low reliability estimates associated with program means in this study.  These findings have significant consequences for the use of program-level analyses that is commonly conducted in institutional effectiveness research.      To answer the first research question, the results of the two-level MFAs and three aggregation procedures were examined, and results tended to correspond to each other; however, the WABA approach indicated that some other level of aggregation would be more practically appropriate than aggregation to the program level.  The WABA approach allowed for a third multilevel pattern that went beyond selecting either the disaggregate student-level, or the aggregate program-level, to include the interdependency of students within-program, or the “parts” inference.        The ANOVA aggregation approach for the NSSE sample indicated that program-membership had little influence on students’ responses to the study variables, as was  186   demonstrated with low ICC(1) values, low reliability with group means (ICC(2)), and low levels of within-program agreement.  These results suggested that aggregation of all the study variables across both year levels for the NSSE were not supported by the ANOVA approach, but the ICC values were fairly high for fourth-year students even though they did not quite meet the cut-off criteria described by Griffith (2002).  The differences in results across the NSSE first- and fourth-year student samples could be due to selection factors, where the students in the fourth-year sample have decided to stay at UBC to complete their degrees, but the first-year students may be considering transferring to another institution.  In addition, program majors may not be as meaningful for first-year students as they might be for fourth-year students.   Differences in the ANOVA aggregation approach were found for the UES sample when compared to the NSSE.  The results from the ANOVA on UES data indicated that program membership did tend to influence student responses for first- and fourth-year students on perceived general learning outcomes, which suggests these variables could be aggregated to the program major level.        Overall, the results from the first step of the WABA aggregation analysis in determining levels of inference suggested that all study variables for the NSSE and UES could be aggregated to the program major level, based on tests of statistical significance across both year levels; however, the effect size results indicated that the practical level of inference was more appropriate at a lower-level grouping within program major.       187        When the WABA bivariate correlations between perceived general learning outcomes and the study independent variables were examined, results differed across the NSSE and UES measures.  Results for the first-year students in the NSSE sample supported aggregation at the program-major level, while for fourth-year students in the NSSE sample only emphasis on diversity could be aggregated to that level.  The results for the other three study variables in the fourth-year NSSE sample supported a lower-level grouping, or “parts.”   For the first-year UES sample, the aggregation of the variables overall satisfaction and supportive campus environment were supported at the program major level, but aggregation of campus climate for diversity to the program level was not supported.  All fourth-year learning environment variables aggregated to program level for the UES sample were not supported.  In both the UES and the NSSE, when the correlations were decomposed into their within- and between-components, the results showed that for all variables the within-component values were larger than the between-component values.        The second research question was addressed by examining the results of the multilevel regression models.  These results indicated that there were some student- and program-level characteristics that were related to perceived general learning outcomes, although the program means were not highly reliable.  Further, examination of the EB point estimates revealed that many of the programs could not be distinguished from each other because the 95% credible values for each program overlapped significantly.  Thus, these findings suggest that program comparisons on perceived general learning were not appropriate for the programs included in this  188   study.  Chapter Five describes the conclusions drawn from these findings, addresses the limitations of this study, identifies the contributions to institutional effectiveness research focused on student learning outcomes, and offers recommendations for further research.      189   5  Chapter Five: Discussion of Research Findings 5.1  Introduction Aggregate composite ratings based on student survey results regarding student engagement, learning and experiences have been commonly used in Canada and the USA to determine effective educational program practices (Chatman, 2009; Nelson Laird, Shoup & Kuh, 2005), despite the paucity of research examining the multilevel validity of inferences drawn from such analyses.  Results from these surveys are used to support high-stakes decisions such as making instructional changes or informing institutional policy (Porter, 2011).  As has been described throughout this dissertation, the appropriateness and accuracy of interpretation of results from student surveys as indicators of effectiveness at program levels or higher is a multilevel validity issue.  When the intent is to interpret aggregate program-level results, not the individual student responses, multilevel validity evidence is required to support interpretations made from these aggregate results (Zumbo & Forer, 2011).  A potential threat to the multilevel validity of interpretations made from these aggregate results occurs when there is heterogeneity within aggregation levels (Dansereau, Cho, & Yammarino, 2006).  Thus, interpretations made based on aggregate data may be problematic to interpret. This study contributes to the understanding of multilevel validation procedures by demonstrating the importance of multilevel validity evidence in support of aggregate-level interpretations in higher education effectiveness research.  The findings of this  190   study offer new insights into the use of aggregate survey results to examine general learning outcomes by program major, as well as attitudes about the university learning environment.   This chapter provides a discussion of the findings of the research regarding the appropriateness of aggregating student-level survey outcomes for making program-level claims and drawing comparisons across programs.  The first half of this chapter connects the results of this study to other research studies conducted on the use of student surveys for examining program effectiveness and quality.  The next section addresses the limitations of this study and identifies how results might differ if alternate methodological decisions were made throughout the process of analysis.  Despite the study limitations, there were considerable findings from this study that have implications for survey research and institutional effectiveness research.  Finally, recommendations for further study are presented. 5.2  Discussion of Research Findings Group-level analyses are common in educational research (D’Haenens, Van Damme, & Onghena, 2008; Porter, 2011), and are of importance to institutional effectiveness research that focuses on the interactions among groups as well as the interaction between the individual and group levels; yet the appropriate level of aggregation based on individual results requires empirical and substantive support.  Many group-level analyses regarding program of study have neglected to examine the appropriateness of aggregation prior to drawing their conclusions.     191   As was described in Chapter One, a recent trend in higher education has been to measure general learning outcomes at the program level to determine educational quality/effectiveness, particularly for program accreditation and academic program reviews (Ewell, 2008; NSSE, 2011).  Measuring general learning outcomes is a complex process because currently there is little consensus on how to measure student general learning across programs and institutions (Penn, 2011; Porter, 2012).  The use of surveys (e.g the UES and the NSSE) has become a common approach in Canada to solicit feedback from students regarding their perceived learning outcomes and their impressions regarding the learning environment.  Although the primary purpose for the NSSE and the UES is to determine student engagement and reflections regarding their learning experiences, they have also been used to report on general learning outcomes (Pike, 2006a) and to examine aggregate survey results across program majors (NSSE, 2013).  It is this secondary purpose that was the focus of this study, and the results illustrate the importance of determining the multilevel validity of aggregate group-level inferences.   Chapter Two in this dissertation described how the contemporary application of validity theory is focused on the interpretations or conclusions drawn from the assessment results rather than the actual test scores or survey responses (Borden & Young, 2008), and that evidence is provided in an argumentative way to support interpretations (Kane, 2006, 2013).  It was from this modern perspective of validation, particularly with consequential validity, that Hubley and Zumbo (2011) proposed an extension to validity theory to include multilevel validity.  Furthermore, Zumbo and  192   Forer (2011) argued that a greater number of educational research studies should include validation efforts of multilevel validity, and should provide supporting evidence that the interpretations of individual assessment results are still meaningful when aggregated to some higher-level unit.  These recent conceptual and measurement developments provided the motivation for this study and whether program-level inferences could be drawn from aggregate student outcomes from the NSSE and the UES. Data from surveys like the NSSE and the UES have a nested structure; although the data are collected from students, the results are intended to be interpreted at the program level. Due to the multilevel nature of these types of surveys they should be examined from a multilevel perspective that takes into consideration the nested design, and evidence must be provided that supports interpretations drawn from the aggregate level.  The analytic approaches used in this study are intended to be sufficiently comprehensive that they can be applied to validating aggregate results from other multilevel constructs in higher education.  This study further established the importance of determining multilevel validity for program-level analysis and program major comparisons, as well as more broadly for institutional effectiveness research overall.   A review of the literature suggests that past reports about quality in higher education have used measurement models that do not adequately capture the complex nature of student learning within the context of the institutional learning environment (Liu, 2011; Olivas, 2011).  Most important, the interpretations made from aggregate student results in higher education currently lack sufficient evidence of multilevel  193   validity (Borden & Young, 2008; Porter, 2006).  The research examining institutional effectiveness using the NSSE data has usually involved multiple institutions analysed together, and findings are reported across universities and colleges (Olivas, 2011).  Despite the fact that the purpose of these measures is for internal institutional use, there have been few attempts to validate these models within a single university setting (Borden & Young, 2008).  The findings in this study identify the importance of determining multilevel validity prior to interpreting aggregate survey ratings based on individual student results.     The results in this study suggest that the use of student surveys as indirect measures of student general learning should, at best, be only used for formative evaluation and not summative evaluation.  The individual student survey results can be useful for determining student perceptions about their learning environment, and as such may provide guidance on how to follow-up with additional measures of quality and student learning to inform improvement efforts within the program major.  Researchers should not rely on student survey results alone in understanding student learning, but use direct measures of learning as well. 5.2.1  Sample Representativeness and Item/Scale Reliability       A representative survey sample is one that has strong external validity in relation to the target population the sample is meant to represent.  As such, the findings from the survey can be generalized with confidence to the population of interest.  In this study, the percentage of female students, percentage of international students, and the  194   percentage of direct entry students for each sample were compared against the population to determine their representativeness.  The survey samples did not appear to be representative across these variables, because they were over-represented by female students by about 10%, and by about the same amount for the percentage direct entry students for the UES fourth-year sample.  The proportions of respondents across year levels and surveys from UBC’s Okanagan campus were similar to the proportion of Okanagan students in the overall population.  There might be evidence to suggest that these study samples were similar to the population, but not representative.  In addition, the perceptions of students who did not participate in the surveys could be substantially different from the students who chose to respond.  The over-representation of female students in the NSSE respondent group is consistent with the findings of Gordon, Ludlum and Hoey (2008), Korzekwa (2010), Mancuso, Desmarais, Parkinson, and Pettigrew (2010), and Pike (2006b), who also reported higher levels of participation by female students in their surveys.        Researchers have found that when examining survey response rates and nonresponse bias, they have found no relationship (Groves, 2006).  Further, Mancuso et al. (2010) suggested that demographic variables have been shown to be less important in affecting survey results than have psychological and cognitive factors.  Porter (2012) argued that that low response rates were not alone sufficient evidence to indicate bias and suggested nonresponse bias requires an examination of how characteristics influenced their choice to participate in the study and the way in which they respond to the survey items.  While too many female respondents in the survey  195   samples might not indicate a substantial nonresponse bias problem alone, when this is coupled with low response rates, at just over 20%, the tentative conclusion for this study is to interpret these results with caution.      There were subtle, but important, measurement differences between the NSSE and the UES measures, and among the items within each of the surveys.  First, the program grouping variable was determined differently between the surveys:  the NSSE used an open-text field re-coded into 85 categories, while the UES was grouped based on administrative records.  To collect information regarding perceived general learning outcomes, students used a four-point scale to respond to a series of questions from the NSSE that asked students:  “How much has your experience at this institution contributed to your knowledge, skills and personal development?” On the UES, students used a six-point scale to answer the question:  “Please rate your ability in the following areas when you started at UBC and now.”  Despite these measurement differences, there were substantial similarities drawn across findings for these two surveys.  Where differences were found, the modifications in measurement approaches might help to explain these inconsistencies.     Scales created from student ratings on each survey in this study showed low to moderate internal consistency with Cronbach Alpha coefficients ranging from 0.63 to 0.79 for the two NSSE samples, and moderate to high internal consistency with alpha coefficients ranging from 0.77 to 0.90 for the two UES samples.  Gordon, Ludlum, and Hoey (2008) examined the NSSE scalelets (Pike, 2006a) and found alpha coefficients well below 0.70 in their study at Georgia Tech.  The alpha coefficients reported in their  196   study for diversity, support of student success, and interpersonal environment were very similar to the results found in this study, as were the alpha coefficients reported by LaNasa, Cabrera and Trangrud (2009) for supportive campus environment at a public, research-intensive university in the Midwest.  Esquival (2011) also reported coefficient alphas that were less than 0.70 when examining the utility of Pike’s scalelets at the University of Tennessee.  Similarly, Korzekwa (2010) examined responses to the NSSE scalelets for first-year students at the University of New Mexico and also found that the majority of the alpha coefficients were lower than 0.75.  The alpha coefficient values reported in this study for the UES were consistent with those reported by Chatman (2011), which ranged from 0.82 to 0.92. Although 0.70 is typically the acceptable level for reliability estimates in educational research (Tavakol & Dennick, 2011), research reliabilities of 0.60 have been found to be acceptable in survey research when the number of items used to construct a scale are less than 10 (Loewenthal, 1996); however, lower reliability estimates could be due to increased measurement error, which could hamper the correct interpretation of students’ perceptions of their learning and the learning environment.  The results reported in this study suggest that the utility of the NSSE scalelets was somewhat limited.  In this study, reliability values at 0.60 were considered acceptable because the scales were comprised of less than 10 items; however, the UES scales had reliability estimates that were more widely acceptable in educational research than the NSSE scales (Tavakol & Dennick, 2011).  The higher reliability estimates for the UES are likely related to the slightly larger number of items used to describe each composite  197   rating and the greater number of response options for each item (Gordon, Ludlum & Hoey, 2008).   The low alpha coefficients reported for the NSSE samples in this study could also be connected to the wording of the NSSE items.  The three scales with the highest values were:  perceived general learning, overall satisfaction, and support for student success.  Each of these scales were comprised of items using a referent-direct consensus approach where students were asked to refer to their own experiences when responding, while the other two scales used a referent-shift approach where students were asked to refer to students in general, or to the institution, when responding rather than their own personal experiences.  This might lead to the conclusion that items intended to be aggregated to a higher-level grouping should be worded using a referent-direct approach.  Klein, Conn, Smith and Sorra (2001) also highlighted the importance of item-wording in their study of employee perceptions of the work environment; however, their findings differ from this study, and they reported that items using the referent-shift approach increased between-group variability as well as within-group agreement in their study.  They concluded that item-wording using a referent-shift approach might yield greater support for group-level aggregation. Further research regarding the wording of survey items would need to be conducted to determine which approach would be best for program-level grouping within a university setting. The inter-item correlations were fairly low across all samples included in this study.  These results are consistent with other research studies.  Esquival (2011) found that very few inter-item correlation values regarding the NSSE scalelets were larger than  198   0.40.  The correlation results of the UES provide slightly more information than the NSSE regarding how to improve student perceptions of their perceived general learning outcomes and their perceptions of their learning environment.  Overall satisfaction had the strongest relationship with perceived general learning outcomes, and overall satisfaction was strongly positively related to supportive campus environment in the UES samples; therefore, efforts to improve the supportive campus environment might influence student perceptions of satisfaction, which could then potentially increase student perceptions about their general learning outcomes.  Given this premise, the results from both surveys support the research that suggests the learning environment can be purposively changed in particular ways to enhance the learning experience and increase student academic success (Astin, 2005; Keeling et al., 2008; Kuh, 2009; Tinto, 2010).    Gordon, Cabrera and Hoey (2008) found a negative relationship between freshman GPA and student-reported institutional emphasis on helping students cope with non-academic responsibilities, which is consistent with the findings of this study.  They concluded that students who are dealing with academic or personal difficulties may be more likely to engage with academic advisors and student affairs professionals.  Results in this study suggest that the program-level variables were more related to academic achievement of fourth-year students than first-year students.  This is not surprising given that many first-year students included in this study have not experienced the university structures and learning environment as much as fourth-year students.     199   5.2.2  Program-Level Aggregation:  Do Student Survey Outcomes Add Up?   The overall results of this study showed that different interpretations could be drawn regarding the reliability and accuracy of a scale when using a single-level PCA compared with a two-level MFA approach.  The items used in this study contributed significantly to the student-level scales, but when these scales were examined using an MFA the findings were not consistent at the program level.  These findings were consistent with the research of D’Haenens, Van Damme, and Onghena (2008) who also found different outcomes in their study when constructing school process variables based on teacher perceptions by using an MFA model compared with a standard EFA model.  They found that some of the factors at the group level had a distinct composition compared with the teacher-level factors.  They concluded that an MFA approach should be used when dependency is present in the dataset.   The learning environment scales for both the NSSE and the UES did not perform as well as the perceived general learning scales did in terms of contributing similarly to the student-level and the program-level scales.  The failure of the learning environment models to meet acceptable MFA model-fit criteria across most samples in this study is troubling.  Of particular concern is that many researchers fail to examine the multilevel structure of their data using MFA levels of analysis prior to creating composite ratings and interpreting program-level results.  The MFA results for the NSSE and the UES indicated that items related to supportive campus environment and interpersonal environment did not contribute significantly to the program scale as they did for the student-level scale.  Items regarding emphasis on diversity for the fourth-year NSSE  200   sample did contribute significantly for both the student and program levels, as did items for the first-year UES sample regarding diversity.  The conclusion drawn from this study is that for the NSSE and UES surveys, most of the learning environment variables identified in the two-level MFA models had no multilevel validity.   The aggregation results for the first-year NSSE sample indicated that there was low correspondence among student responses, and low levels of within-program agreement.  This limited degree of agreement could be related to student responses to the two items from the MFA that did not contribute significantly to the program-level scale:  speaking clearly and effectively, and thinking critically and analytically.  The MFA results provided item-level information regarding how items contributed to the student- and program-levels, while the aggregation procedures provided information regarding the variability in composite scores within and between program majors.  The WABA aggregation approach demonstrated how the correlations for fourth-year NSSE students with perceived student learning and emphasis for diversity was the only scale where a “wholes” inference was supported.  The ANOVA approach and the unconditional multilevel models for NSSE and the UES did not support aggregation to the program level for any of the learning environment scales.  Although the aggregation results did not support program-level inferences to be drawn from any of the composite scores about the learning environment, the emphasis for diversity scale for the first-year UES sample indicated that there were high levels of correspondence and agreement within programs on this scale. Of the three aggregation procedures used in this study, the ANOVA approach was the most  201   stringent for rejecting aggregation at the program-level across the NSSE samples but not for the UES samples.  Overall, the amount of variance in variables explained by program membership (ICC(1) values) was low, except for the survey results for perceived general learning.  The ICC(1) values of 0.15 and 0.19 that were reported for UES first- and fourth-year students in this sample were high.  Porter (2005) suggested that ICC values in higher education research tend to be values at 0.10 or lower.  The ICC(1) values for the NSSE were larger for perceived learning variables that assessed individual characteristics, and smaller for items referring to group characteristics for both survey examples.  These results differ from Griffith (2002) who found the reverse in his study examining student survey results across schools.   Results from the unconditional multilevel model were similar to the ANOVA approach, but supported aggregation at the program major level for the dependent variable for fourth-year NSSE sample.  The WABA approach seemed to be more useful for determining alternative levels of aggregation, rather than examining a single level of aggregation at the program major.  In contrast to the ANOVA and unconditional multilevel models, the WABA approach provided additional information about aggregation, which suggests that there may be interdependence among some students based on program membership in these samples.  For this reason, the WABA approach was more useful for determining levels of aggregation for this study because it provided statistical support for aggregation to the program level, but also indicated that the practical significance would be best explained with lower-level groupings.  These conclusions regarding the flexibility of the WABA  202   approach are consistent with the results reported by Castro (2002) and Griffith (2002). The differences in aggregation results from these two approaches could be that the program-major groupings in this study had unequal sample sizes.  The ANOVA analysis is based on estimating the grand mean across samples; however, with unequal sample sizes, the grand mean is difficult to calculate.  When sample sizes are extremely different across program groupings, the homogeneity of variance assumption could be violated, and ANOVA results could introduce error.  In their paper, LeBreton and Senter (2008) discussed the consequences of unequal sample sizes on the calculation of the rwg statistic, and on the interpretability of the results, because all of the formulas used in the ANOVA approach are dependent on the average sample size of the groups, or grand mean.  To avoid the concerns with unbalanced sample sizes across programs in this study, the harmonic mean was used over that of the arithmetic mean.  In comparison, the unconditional multilevel model can accommodate unbalanced groupings because it considers the nested structure of the data in the model and simultaneously models the student- and program-level variances.   Schriesheim et al. (1995) found that the number of items in a measure seemed to affect the size and significance of the within-group rwg coefficient:  for measures with more items, the rwg coefficient was greater and was more likely to be significant.  Furthermore, Lindell and colleagues demonstrated that since the rwg index is based upon the Spearman–Brown formula, it increases as the number of items in the scale  203   increases (Lindell & Brandt, 1999; Lindell, Brandt, Whitney, 1999).  Thus, part of the reason why the scales for the UES might have higher within-group agreement values than do the scales in the NSSE is because the UES scales were created with more items (Castro, 2002; Gordon, Ludlum & Hoey, 2008).   The results of the aggregation procedures on the study samples highlighted the importance of determining the appropriate level of interpretation as part of the validation process prior to interpreting findings at an aggregate level.  In addition, the type of aggregation procedure used could lead to different results, thus the researcher should use the procedure that will appropriately answer their research question.  5.2.3  Multilevel Regression Analyses of Perceived Student Learning      The primary aim of the multilevel regression analyses conducted in this study was to examine how differences among program majors on perceived general learning outcomes varied by student- and program-level characteristics.  Results from the multilevel model demonstrated cross-level interactions in which the relationships could be detected among program-level variables and student-level variables and the dependent variable.        Although the proportion of variance explained by the random-coefficient regression models in this study were fairly small, ranging from about 3% to 4%.  The small coefficient values found in this study were consistent with the literature.  Pike (2006b) also found small coefficient values for academic units that ranged from 1% to 6% for the NSSE scalelets.  Porter (2006) suggested that, in the context of higher  204   education, student behaviour and outcomes are difficult to change, and he argued that even small coefficient values can be substantively important given the difficulty of independent variables influencing the dependent variable.  Porter (2006) argued that measures of variance cannot provide information on whether these study variables actually affect students in general, and to what extent.  Instead, he suggested focusing on the results of the hypothesis tests and regression coefficients, which was the focus of the results in this study.          The survey results from the random-coefficient regression model in this study indicated that current GPA was positively related to perceived learning outcomes for NSSE fourth-year students and both UES samples.  Similarly, using UES student survey results collected at the University of California, Douglass, Thomson and Zhao (2012) found that as GPA increased, students reported higher levels of learning outcomes for fourth-year students.  Umbach and Porter (2002) also found that students’ satisfaction with their program majors was related to higher cumulative GPAs.      The findings of the full intercepts- and slopes-as-outcomes model for the NSSE and UES suggested that there were some cross-level relationships where program-level characteristics influenced student-level characteristics.  For the NSSE fourth-year sample, the proportion of international students had a positive relationship with average student perceived general learning outcomes, and as the student-to-faculty ratio increased, the negative relationship with learning outcomes of domestic students decreased.  In addition, as GPO per FTE increased, there was a negative relationship  205   with current GPA.  In the first-year UES sample, there were program-level variables that were statistically significantly related to student-level variables; for example, across both year levels in the UES sample, the negative learning outcomes of domestic students were reduced as the proportion of male students in the program increased.        When campus location was included in the full intercepts-and slopes-as-outcomes models in this study, campus location was not statistically related to perceived general learning, with the exception of the fourth-year NSSE sample where students at the Okanagan campus were rated learning higher than Vancouver students.  The differences in NSSE results are likely related to the smaller size of the Okanagan campus.  These results correspond to those of Douglass, Thomson, Zhao (2012) who concluded that student self-ratings tended to be associated with the social composition of the campus.  They reported that there were initially statistically significant differences among student ratings of their learning outcomes across the University of California campuses on the UES; however, when they adjusted for ethnicity and discipline type the significance of campus location was eliminated.        The reliability of the regression coefficients in the full multilevel models were quite low in this study, 0.34 for the fourth-year NSSE sample, 0.51 for the first-year UES sample and 0.47 for the fourth-year UES sample.  The reliability coefficients depend on two aspects:  (1) the degree to which the true underlying parameters vary among groups, and (2) accurate estimations of each group’s regression equation (Raudenbush & Bryk, 2002).  A possible reason for the low reliability values of the  206   random coefficients in this study could be related to the small numbers of students within some programs.  In each of the study samples, programs with lower than 20 respondents comprised about 36% to 46% of overall program groups.  In addition, the graphical display of the EB point estimates illustrate the variability within and among programs with perceived general learning outcomes .5.3  Limitations of the Study      Given the number of measurement decisions made at each phase of analysis in this study, there were many alternate approaches that, if used, may have changed the results of this study.  Thus, the methodological decisions made in this study could be considered limitations of the study with respect to sample sizes, the procedure used for calculating composite scores, and the creation of the program major grouping variable.  Each of these limitations is described below with a brief discussion on their potential impact to the study.      A limitation of this study was related to the sample size requirements to conduct the multilevel models:  the WABA, the unconditional multilevel model, the random-coefficient multilevel regression model and the intercepts- and slopes-as-outcomes multilevel model.  As described in Chapter Three, multilevel procedures are based on the assumption that there are many higher-level units in the analysis.  Sample sizes of 30 units with 30 individuals work with acceptable loss of information in a multilevel model (Kreft & De Leeuw, 1998), and practically, 20 units are considered acceptable (Centre for Multilevel Modelling, 2011).  The sample sizes used in this study met  207   these requirements with 59 first-year and 49 fourth-year programs for the NSSE example and 22 first-year and 28 fourth-year programs for the UES example; each with a minimum of 5 students per program.  The results of the multilevel analyses indicated that the program means for these samples were not highly reliable, which could be due to the small within-group sample sizes.  The program means might have been more reliable, and results regarding the influence of program majors could have differed, if more program-level units, and more students included within programs, were included in the analyses.  To address the concerns about small sample sizes, this study took steps to consider the shrunken estimates for the program means using EB residual estimates to examine the differences among program means in the multilevel models for this study; however, future research could examine these relationships using a full Bayesian approach that can appropriately model smaller numbers of units for multilevel modelling (Gelman, 2006) and compare these results to the findings from this study.      Another limitation of this study was the choice to use an unweighted summed procedure based on PCA to create composite scores across the four samples.  This approach assumed that all items contributed similarly to the latent construct, but based on the results from the PCA and the MFA, items did not contribute equally to the scale.  The summed score approach was used in this study because it is the most commonly used approach by researchers examining NSSE results, which they suggest is because it can be easily replicated by the institutional users of the NSSE (Pike, 2006a).  In addition, the summed approach was used because it maintains the  208   variability of the original scores and is able to be compared across samples.  The weighted summed approach might provide more stable results because it considers the variability in how items contribute to the overall scale, thus providing more weight to items that contribute more and less weight to items that contribute less. When the value of the factor loadings is similar across items the overall results won’t change too much, but when the results differ considerably across items the results could differ substantially.          Identifying appropriate higher-level groupings was also a limitation in this study.  When examining the differences among groups, it is best to use a natural grouping variable rather than forcing an unnatural grouping of individuals.  As previously mentioned, the NSSE survey asked students to identify their program major, or intended program major, using an open text field.  Although this process of collecting program major could be considered a natural grouping, based on how the student views their program membership, the results of this study for the NSSE example indicated that there was still too much variability in responses from all but one scale for inferences to be drawn at the program level.  In contrast, the UES does not include a question regarding program majors, and instead, the program major is identified for each student using administrative records.  The procedure used for the UES could be problematic for first-year students because they have not yet identified a specialization; thus the grouping variable for many of them was at the highest program level, such as Bachelor of Arts, rather than Bachelor of Arts majoring in English.  There was too much variability among the program means for most of the  209   survey scales across the NSSE and the UES samples regarding the learning environment, which might have been due to poor program groupings.  Although the NSSE staff members encourage institutional NSSE users to compare across program majors (NSSE, 2011), the results of this study did not support the use of the NSSE program major field as a grouping variable for many of the study variables.   5.4  Implications for Institutional Effectiveness Research      Despite these limitations, the results of this study led to five general areas where this study contributes to the research.  First, even though the most common institutional use of these surveys has been for program-level analyses, this is the first study to examine the multilevel validity of program level inferences made from the NSSE and UES results.  Second, by comparing three different approaches to aggregation, the results identified WABA as the most appropriate method to examine levels of aggregation when aggregation at different levels might be of interest.  Third, this study demonstrated how inferences made at the program level based on aggregate student survey outcomes are best examined using a multilevel perspective.  Fourth, the results of this study reveal that the appropriateness of aggregation could be variable dependent and related to item phrasing and item design.  Finally, the results of this study underscore the importance of statistical procedures used to examine the multilevel validity of program-level inferences, and how multilevel models provide additional insight that would be overlooked with single-level models.  Each of these contributions is discussed in more detail below.  210       First, the use of the program major field item on the NSSE to create groupings within the university is encouraged by NSSE staff, which they claim can be used to examine educational patterns across program majors within the university and among equivalent program majors across comparable universities (NSSE, 2011).  Some researchers have argued that program-level analysis is more appropriate than institutional-level analysis for these types of surveys (Chatman, 2009; Nelson Laird et al., 2005).  Studies examining program-level survey outcomes lack conceptual and empirical rationales for aggregating data obtained from students to higher program levels.  In considering the multilevel validity of inferences drawn from aggregate survey ratings, the results of this study suggest that program-level analyses were not appropriate regarding ratings of the learning environment across both surveys and all four samples.  In addition, the results from the MFA analyses indicated that for the learning environment scales the items contributed differently to the student level compared with the program level, which indicates that the meaning differs across levels of analysis and impedes meaningful interpretation of these data at the program level.  Further, although average program ratings regarding perceived general learning differed on survey variables considered in this study, the between-program differences were quite small, which indicates that program membership did not strongly influence student responses.  Therefore, this study contributes to measurement theory regarding the summarization and application of aggregate student survey results for making program-level claims and comparisons, and the importance of multilevel validity in institutional effectiveness research.    211        Second, results from this study imply that using student perceptions about their general learning outcomes as a measure of educational quality might be represented as program-level occurrences for some year levels, as evidenced by the aggregation analyses; however, the WABA procedure identified that subgroup aggregation would likely be more practical for all samples in this study.  In other ways, however, the unconditional multilevel model differed from the ANOVA and WABA findings.  While the program-level aggregation ANOVA results based on perceived general learning outcomes were comparable to the results from the unconditional multilevel model, the WABA approach had the added benefit of indicating whether differences in the variables of interest would be more appropriately examined, not as individual scores relative to their respective program means, but as individual scores relative to smaller groups within program majors.  Specifically, the WABA results showed that the covariance between student perceptions about their general learning outcomes and variables measuring their perceptions regarding the learning environment occurred at levels lower than the program-level grouping, but could be aggregated at the “parts” level.  The multilevel nature of WABA provided more information regarding variations within program groupings and recommended aggregation at a different lower level for these samples.      Third, results indicate that the appropriate level of analysis was variable dependent (Griffiths, 2002).  That is, the results of statistical aggregation were associated with the variables assessed across two different surveys.  Assessments of the learning environment scales showed lower levels of within-program agreement than did  212   ratings on perceived general learning outcomes.  In addition, questions on the NSSE and the UES differed somewhat, but essentially asked students to rate the quality of the institution regarding the learning environment.  The approach taken in designing these questions is based on a referent-shift consensus model because students were asked to make references to UBC when providing their responses.  In contrast, items designed using a referent-direct approach, where students were asked to make references to their own individual-level learning and experiences within the contest of UBC, performed better in the MFA analyses and the aggregation analyses than referent-shift questions.  These results might suggest that if student survey results are intended to be interpreted at the program level, that questions should be designed using a referent-direct approach rather than a referent-shift approach.  An alternative would be to consider questions relating to the program or course level rather than questions referencing the institutional level, and to determine if these types of questions result in larger within-group agreement levels.       Fourth, results from the multilevel analyses have implications for higher educational policy, higher educational practice, and research.  Knowing whether higher educational quality and effectiveness and associated phenomena are program-based and at what level they occur, can inform decisions about educational policies and practices.  There are few studies of the quality of higher education that make the appropriate linkages of quality and effectiveness indicators to educational process for the differences in educational outcomes between students, programs and the learning environment within a single institution.  The results in this study suggest the  213   program-level relationship to student perceptions about their general learning outcomes were, in part, determined by how students rated their learning environment as well as program-level variables.  The approach used in this study provided an example of Type B comparisons in institutional effectiveness research (Raudenbush & Willms, 1995), which examined the effects of institutional characteristics along with student background characteristics.  Compared to the NSSE, the UES provided more information in understanding learning outcomes in higher education and the relationship to the educational environment, which is consistent with other research (Douglass et al., 2012).  The findings from this study suggest that the UES was a better measure for reporting on perceived general learning outcomes at the program level than the NSSE.  Multilevel analyses can inform researchers about institutional aspects of educational quality and effectiveness, in particular about aspects of educational quality regarding sub-group differentiation.      A final implication of this study for institutional effectiveness research in higher education is determining the relevance of the multilevel techniques described and used in this study.  Hubley and Zumbo (2006) have asserted that issues of data aggregation and analyses relate to the consequences of assessment use; particularly when the intent is to make program-level claims and compare across programs.  Therefore, the process of multilevel validation is intended to provide evidence that supports interpretations drawn from student-level data aggregated to the program level.  Zumbo and Forer (2011) argued that more educational studies should consider the empirical appropriateness of aggregation before drawing conclusions based on  214   group comparisons.  Borden and Young (2008) also indicated that examining levels of analysis was lacking in higher education research that makes claims regarding the contributions institutions add to student learning outcomes.  They argued that evidence in support of aggregation should demonstrate that the response of any one student group member should reflect those of other group members.  This study conducted a two-level MFA on all of the scales to determine the dimensionality of the scales at the student and program levels.  Results from these analyses provided insight as to how the items contributed differently across the levels of analysis, particularly with the learning environment scales, and corresponded with the results from the aggregation procedures.  Interpretations of aggregate program-level scores were supported only for the perceived general learning scales for the fourth-year NSSE sample and the first- and fourth-year UES samples.  It was equally important to demonstrate that the average responses of program-major members were reliable, and in this study the reliabilities of the program means were not strong.  Therefore, this study provides an example of the type of empirical evidence that is required to support the multilevel validity of interpretations made at the program level, and demonstrates how important this evidence is in supporting claims and comparisons made at this level.  Program-level analyses are commonly applied to aggregate survey data without considering the multilevel nature of these results and determining the multilevel validity of the program-level inferences.  Despite this recurrent practice, the results of this study clearly illustrate the consequence of ignoring or assuming a multilevel structure of these survey results without first empirically examining these assumptions prior to drawing conclusions at the aggregate level.    215   5.5  Recommendations for Further Research      As a result of the limitations of this study and the implications for institutional effectiveness research and educational policy, there are five recommendations for further research that are addressed in this section: (1) further examine the WABA results; (2) consider different grouping techniques; (3) use different procedures for creating composite scores; (4) consider rewording the items on the surveys and determine if different item-wording improves the success of aggregating results to the program level; and (5) apply a full Bayesian multilevel model to examine the multilevel relationships.  These are discussed below.      First, a future direction for this research would be to examine more closely the results of the WABA analysis to determine relationships at various levels of aggregation in these samples.  The results of this study suggested that aggregation was supported, but that practically the results would be better aggregated to a level lower than the program level.  One approach that would help identify what this alternate grouping would be is to use a dispersion model to analyse the within-group variability by program major.  This study determined within-group agreement levels, whereas the dispersion approach examines the amount of within-group variability as an attribute of the group (Klein, Conn, Smith & Sorra, 2001).  Student age, gender, race, culture, educational background and other characteristics could potentially be used to examine the correlates of variability about the perceptions of program members’ general learning outcomes and the learning environment.       216        An alternative approach to determine the program-level groupings might yield different results than the grouping variable based on the NSSE open-text field, or administrative records used for the UES example.  Chatman (2007) used a multistep approach to determining the academic major clusters for the UES results at the University of California.  He first used the academic records of students to assign them to one of 19 program majors, and then computed factor mean scores by program that had 100 or more respondents.  He then conducted a cluster analysis with the mean scores using an agglomerative hierarchical clustering approach based on centroid distance.  This process resulted in seven academic major clusters, which were then used in subsequent analyses.  The results of this empirically derived academic structure presented them with an interesting and unexpected mix of programs; area, ethnic, cultural and gender studies cluster was distinguished from other program majors.  A recommendation for further research would be to examine the multilevel validity of program-level inferences from the NSSE and the UES at UBC using an empirically-derived approach to program grouping.        This study used the summed unweighted approach to create composite scores, which was similar to the procedure recommended by Pike (2006a) for institutional NSSE users in creating scalelets.  Briefly described in Chapter Three were alternate procedures for creating composite scores using a weighted summed approach, or more refined methods such as regression scores.  A future study could create composite scores using different procedures to determine if the results differ depending on the procedure used.  217        Another recommendation for further research would be to revise the items on the UES and the NSSE to include referent-direct questions regarding the learning environment.  A simple experiment could randomly assign these referent-direct questions to one group of students, while randomly assigning the original referent-shift questions to another group of students, to determine if differences exist with responses and aggregation.  The results of this study suggest that referent-direct questions could be aggregated more appropriately than referent-shift questions.  In addition, for items where the intended interpretation is at the program-level, the questions should reference the program rather than the institution to determine if this reference might be more appropriate for program-level aggregation.  Further research examining the wording of these questions could provide more insight into designing survey questions that are asked of individuals but are intended to be aggregated and interpreted at a higher-level grouping.      Finally, due to the smaller sample sizes included in this study, future research could examine the relationships among student- and program-level characteristics using a full Bayesian approach that can appropriately model smaller numbers of units for multilevel modelling (Gelman, 2006).  A full Bayesian approach does not require a large data set but it does require information about the model before using the data, which is referred to as a priori information (Raudenbush & Bryk, 2002).  The full Bayesian approach provides a framework for combining the a priori model information with the information contained in the data to arrive at more refined statistical distribution, which is referred to as the a posteriori model distribution.  The  218   approach of using prior model information combined with the posterior distribution provides a more tightly constrained estimated model than other procedures, such as the ordinary least squares estimation procedure.  Rather than testing a hypothesis, the full Bayesian approach determines the amount of uncertainty in the model.  The full Bayesian approach might provide additional insight to the relationships among variables considered in this study.          219   Bibliography Adam, S.  (2008).  Learning Outcomes Current Developments In Europe: Update On The Issues And Applications Of Learning Outcomes Associated With The Bologna Process. Paper presented at the Bologna Seminar: Learning outcomes based higher education: the Scottish experience 21 - 22 February 2008, at Heriot-Watt University, Edinburgh, Scotland. Aitkin, M. & Longford, N. (1986). Statistical Modelling Issues in School Effectiveness Studies.  Journal of the Royal Statistical Society Series A (Statistics in Society), 149, 1, 1-43. Altbach, P. G. (2012).  The Globalization of College and University Rankings. Change: The magazine of higher learning. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (AERA, APA, & NCME: 1999). Standards for Educational and Psychological Testing. Washington, DC: American Psychological Association. Amsler, S. & Bolsmann, C. (2012). University Ranking as Social Exclusion.  British Journal of Sociology of Education, 33, 2. Andres, L.  (2012).  Designing and doing survey research.  London:  Sage. Angelo, T.  (2002).  Fostering critical thinking in our courses: Practical, research-based strategies to improve learning.  The Pennsylvania State University Teaching and Learning Colloquy VIII.  May 8, 2002.  Artushina, I., & Troyan, V. (2005). Methods of the Quality of Higher Education Social Assessment.  Higher Education in Europe, 32, 1, 83-89.  220   Arum, R., Roska, J., & Cho, E. (2011). Improving Undergraduate Learning: Findings and Policy Recommendations from the SSRC-CLA Longitudinal Project. Retrieved August 12, 2012, from http://www.ssrc.org/publications/view/D06178BE-3823-E011-ADEF-001CC477EC84/. Association of Universities and Colleges of Canada [AUCC] (2011). Principles of institutional quality assurance in Canadian higher education.  Retrieved from http://www.aucc.ca/canadian-universities/quality-assurance/principles. Astin, A. W. (1999). Student Involvement: A Developmental Theory for Higher Education. Journal of College Student Development, 40, 5, 518-29. Astin, A. W. (2005). Making sense out of degree completion rates. Journal of College Student Retention, 7, 5-17. Astin, A. W., and Denson, N.  (2009).  Multi-Campus Studies of College Impact:  Which Statistical Method is Appropriate?  Research in Higher Education, 50, 354-367. Austin, G. R. (1979).  Exemplary Schools and the Search for Effectiveness.  Educational Leadership, 37, 10-14. Bak, O. (2012). Universities: can they be considered as learning organizations?: A preliminary micro-level perspective. The Learning Organization, 19, 2, 163-172. Baker, R.L., (2002).  Evaluating quality and effectiveness: regional accreditation principles and practices.  The Journal of Academic Librarianship, 28, 1, 3–7. Banta, T. (2006). Reliving the History of Large-Scale Assessment in Higher Education. Assessment Update, 18, 4, 1-13.  221   Banta, T. (2007, January 26). A Warning on Measuring Learning Outcomes. Inside Higher Education, Retrieved March 29, 2011, from http://www.insidehighered.com/views/2007/01/26/banta. Barr, R. B. & Tagg, J. (1995). From Teaching to Learning: A new paradigm for undergraduate education.  Change: The magazine of higher learning, 27, 5, 12-25. Bernhard, A. (2009). A Knowledge-Based Society Needs Quality in Higher Education.  Problems of Education in the 21st Century, 12, 15-21. Bliese, P. D.  (2000).  Within-group agreement, non-independence, and reliability:  Implications for data aggregation and analysis.  In K. J. Klein & J. S. Kozlowski (Eds.), Multilevel theory, research, and methods in organizations (pp. 349-381).  San Francisco:  Jossey-Bass. Bok, D. (2003). Universities in the Marketplace: The Commercialization of Higher Education.  Princeton, NJ: Princeton University Press. Borden, V. M. H. (2008). Assessing and Accounting for Quality: Living Between a Rock and a Hard Place.  Paper presented at the Educational Testing Services [ETS] and Association for Institutional Research [AIR] Symposium, Princeton, NJ. Borden, V. M. H. and Young, J. W. (2008).  Measurement Validity and Accountability for Student Learning.  New Directions for Institutional Research: Assessment, Supplement, 2007, 19-37. Boreham, N. & Morgan, C. (2004).  A Sociocultural Analysis of Organizational Learning. Oxford Review of Education, 30, 3, 307-325.  222   Bowman, N. A. (2011).  Validity of College Self-Reported Gains at Diverse Institutions.  Educational Researcher, 40, 1, 22-24. Bowman, N. A. & Hill, P. L.  (2011).  Measuring How College Affects Students:  Social Desirability and Other Potential Biases in College Student Self-Reported Gains.  New Directions for Institutional Research, 150, 73-85. Bratianu, C. (2007). The Learning Paradox and the University.  Journal of Applied Quantitative Methods, 2, 4, 375- 386. Bridges, D. (2009).  Research Quality Assessment in Education: Impossible Science, Possible Art?  British Educational Research Journal, 35, 4, 497-517. Brint, S. (2005). Creating the Future: ‘New Directions’ in American Research Universities.  Minerva, 43, 23–50. Brint, S. & Cantwell, A. M. (2011).  Academic Disciplines and the Undergraduate Experience: Rethinking Bok’s “Underachieving Colleges” Thesis.  University of California, Berkeley: Research & Occasional Paper Series: CSHE.6.11 (retrieved from http://cshe.berkeley.edu) Brown, G. H. (2011). The academy and the market place.  Perspectives: Policy and Practice in Higher Education, 15, 3, 77-78. Brown, T. A.  (2006).  Confirmatory factor analysis for applied research.  New York: Guilford. Burke, J., &Minassians, H., (2004). Implications of State Performance Indicators for Community College Assessment.  New Directions for Community Colleges, 126, 53-64.  223   Burke, J., Minassians, H., & Yang, P. (2002). State Performance Reporting Indicators: What Do They Indicate? Planning for Higher Education, 15-29. Campbell, C. M. and Cabrera, A. F. (2011). How Sound is NSSE? Investigating the Psychometric Properties of NSSE at a Public, Research-Extensive Institution. The Review of Higher Education, 35, 1, 77-103. Canadian Council on Learning (2009). Up to Par: The Challenge of Demonstrating Quality in Canadian Post Secondary Education. Challenges in Canadian Post-Secondary Education, Ottawa: 35 pages. Carey, K. (2010). Student Learning: Measure or Perish. Chronicle of Higher Education, 57, 17. Carini, R. M., Kuh, G. D., & Klein, S. P.  (2006).  Student engagement and student learning:  Testing the linkages.  Research in Higher Education, 47, 1-32. Carmichael, R., Palermo, J., Reeve, L., & Vallence, K. (2001). Student Learning: 'the heart of quality' in education and training. Assessment & Evaluation in Higher Education, 26, 5, 449-463. Carpenter-Hubin, J., & Hornsby, E. (2005). Making Measurement Meaningful. AIR Professional File, 97, 1-4. Carver, R. P.  (1975).  The Coleman Report:  Using Inappropriately Designed Achievement Tests. American Educational Research Journal, 12, 77-86. Castro, S.L.  (2002).  Data analytic methods for the analysis of multilevel questions: A comparison of intraclass correlation coefficients rwg, hierarchical linear modeling, within-and between-analysis, and random group resampling.  Leadership Quarterly, 13, 69-93.  224   Centre for Multilevel Modelling (2011).  Data Structures.  Retrieved March 4, 2011, from www.bristol.ac.uk/cmm/learning/multilevel-models/data-structures.html. Change, R.Y. & Morgan, M.W. (2000). Performance Scorecards: Measuring the Right Things in the Real World.  Jossey-Bass. Charles, C. M.  Introduction to Educational Research: Second Edition.  New York, N. Y.: Longman Publishers.  Chatman, S. (2007). Institutional versus academic discipline measures of student experience: A Matter of Relative Validity.  Berkeley: Centre for Studies in Higher Education (CSHE.8.07), University of California. Chatman, S.  (2009).  Factor Structure and Reliability of the 2008 and 2009 SERU/UCUES Questionnaire Core.  Center for Studies in Higher Education.  Berkeley, CA. Chatman, S.  (2011).  Factor Structure and Reliability of the 2011 SERU/UCUES Questionnaire Core.  Center for Studies in Higher Education.  Berkeley, CA. Cheng, Y., & Liu, N. (2007). Academic Ranking of World Universities by Broad Subject Fields. Higher Education in Europe, 32, 1, 17-29. Clarke, M. (2005). Quality Assessment Lessons from Australia and New Zealand. Higher Education in Europe, 30, 2, 183-197. Clarke, M. (2007). The Impact of Higher Education Rankings on Student Access, Choice, and Opportunity.  Higher Education in Europe, 32, 1, 59-70. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.  225   Coleman, J. S., Campbell, E. Q., et al. (1966).  Equality of Educational Opportunity.  Washington, DC:  US Government Printing Office. Comrey, A. L., & Lee, H. B.  (1992).  A first course in factor analysis (2nd ed.).  Hillside, NJ: Erlbaum. Conlon, M. (2004). Peformance Indicators: Accountable to Whom? Higher Education Management and Policy, 16, 1, 41-48. Considine, M. (2006). Theorizing the University as a Cultural System: Distinctions, Identities, Emergencies. Educational Theory, 56, 3, 255-270. Conway, C., & Zhao, H.  (2012).  The NSSE National Data Project:  Phase Two Report.  Higher Education Quality Council of Ontario. Toronto, Ontario. Cook, S. D. N. & Brown, J. S. (1999). Bridging Epistemologies: The Generative Dance between Organizational Knowledge and Organizational Knowing. Organization Science, 10, 4, 381-400. Council for Aid to Education [CAE]. (2006).  Collegiate Learning Assessment Project. Retrieved May 9, 2006, from http://www.cae.org/content/pro_collegiate.htm. D’Haenens, E., Van Damme, J., and Onghena, P.  (March 2008).  Multilevel Exploratory Factor Analysis:  Evaluating its Surplus Value in Educational Effectiveness Research.  Paper presented at the Annual Meeting of the American Educational Research Association, New York City, NY.  Dansereau, F., Cho, J., & Yammarino, F. J.  (2006).  Avoiding the “Fallacy of the Wrong Level.”  Group and Organization Management, 31, 536-577.  226   Davison, M., Kwak, N., Seok Seo, Y., & Choi, J.  (July 2002).  Using Hierarchical Linear Models to Examine Moderator Effects:  Person-by-Organization Interactions.  Organizational Research Methods, 5, 3, 231-254. Deiaco, E., Hughes, A. & McKelvey, M.  (2012).  Universities as Strategic Actors in the Knowledge Economy.  Cambridge Journal of Economics, 36, 525-541. Deiaco, E., Hughes, A. & McKelvey, M.  (2012).  Universities as Strategic Actors in the Knowledge Economy.  Cambridge Journal of Economics, 36, 525-541. DiStefano, C., Zhu, M., & Mîndrilǎ, D.  (2009).  Understanding and Using Factor Scores:  Considerations for the Applied Researcher.  Practical Assessment, Research & Evaluation, 14, 20, 1-11. Douglass, J. A., Thomson, & G., Zhao, C-M. (2012). The learning outcomes race: the value of self-reported gains in large research universities.  Higher Education Research, 63, 2, 1-19. DOI 10.1007/s10734-011-9496-x Douglass, J. A., Thomson, & G., Zhao, C-M. (2012). The learning outcomes race: the value of self-reported gains in large research universities.  Higher Education Research, 63, 2, 1-19. DOI 10.1007/s10734-011-9496-x Dowd, A. C., Sawatzky, M. & Korn, R.(2011). Theoretical foundations and a research agenda to validate measures of intercultural effort.  The Review of Higher Education, 35, 1, 17-44.  Dowsett Johnston, A. (2001). Choosing the Right School: An Insider's Guide. Maclean's Exclusive University Rankings: Special 2001 Edition, 22-28. Dwyer, C. A., Millett, C. M., & Payne, D. G. (2006). A Culture of Evidence: Postsecondary Assessment and Learning Outcomes. Educational Testing Services  227   [ETS].  Retrieved June 22, 2006, from http://www.ets.org/Media/Resources_For/Policy_Makers/pdf/cultureofevidence.pdf. Dyera, N. G., Hanges, P. J., & Hall R. J.  (2005).  Applying multilevel confirmatory factor analysis techniques to the study of leadership.  The Leadership Quarterly, 16, 149-167. Earl, L. and Fullan, M. (2003). Using Data in Leadership for Learning. Cambridge Journal of Education, 33, 3, 384-393. Easterby-Smith, M. & Araujo, L. (1999). Organizational learning: current debates and opportunities, in Easterby-Smith, M., Burgoyne J. & Araujo L. (Eds), Organizational Learning and the Learning Organization: Developments in Theory and Practice, Sage, London, pp. 1-21. Edmonds, R. (1979).  Programs of School Improvement:  An Overview.  Educational Leadership, 37, 20-24. Entwistle, N. J. & Peterson, E. R. (2004).  Conceptions of learning and knowledge in higher education:  Relationships with study behaviour and influences of learning environments.  The International Journal of Educational Research, 41, 407-428. Ercikan, K., & Barclay McKeown, S.  (2007).  Design and Development Issues in Large-Scale Assessments:  Designing Assessments to Provide Useful Information to Guide Policy and Practice.  Canadian Journal of Program Evaluation, 22, 53-71.  228   Erisman, W.  (2009).  Measuring Student Learning as an Indicator of Institutional Effectiveness:  Practices, Challenges, and Possibilities.  Texas Higher Education Policy Institute.  Austin, TX. Esquivel, S. L.  (2011).  The Factorial Validity of the National Survey of Student Engagement (Doctoral thesis).  The University of Tennesee, Knoxville. Ewell, P. T.  (2001).  Accreditation and Student Learning Outcomes:  A Proposed Point of Departure.  Council for Higher Education Accreditation Occasional Paper.  Washington, DC. Ewell, P. T. (2008). Assessment and accountability in America today: Background and context. In V. M. H. Borden & G. Pike (Eds), Assessing and accounting for student learning: Beyond the Spellings Commission.  New Directions for Institutional Research, Assessment Supplement 2007, pp. 7–18. San Francisco, CA: Jossey-Bass. Ewell, P. T. (2009).  Assessment, Accountability, and Improvement:  Revisiting the Tension.  National Institute for Learning Outcomes Assessment. Federkeil, G. (2008). Rankings and Quality Assurance in Higher Education.  Higher Education in Europe, 33, 2, 219-231. Feldman, J. & Tung, R. (2001).  Whole School Reform: How Schools Use the Data-Based Inquiry and Decision Making Process.  Paper Presented at the 82nd Annual Meeting of the American Educational Research Association, Seattle, WA. Forer, B. & Zumbo, B. D. (2011). Validation of Multilevel Constructs: Validation Methods and Empirical Findings for the EDI.  Social Indicators Research, 103, 231-265.  229   Friedman, T. (2005, April 3). It's a Flat World, After All. New York Times Magazine, Retrieved May 9, 2010, from http://www.nytimes.com/2005/04/03/magazine/03DOMINANCE.html?pagewanted=1&ei=5090&en=cc2a003cd936d374&ex=1270267200. Gallavara, G., Hreinsson, E., Kajaste, M., Lindesjöö, E., Sølvhjelm, C., Sørskår, A. K., & Sedigh Zadeh, M.  (2008).  Learning Outcomes:  Common framework – different approaches to evaluation learning outcomes in the Nordic countries.  European Association for Quality Assurance in Higher Education.  Helsinki, Finland. Gelman, A.  (2006).  Prior Distributions for Variance Parameters in Hierarchical Models.  Bayesian Analysis, 1, 3, 515-533. Gipps, C. (1994). Beyond Testing: Towards A Theory of Educational Assessment. London, UK: Falmer Press. Gipps, C. (1999). Socio-Cultural Aspects of Assessment.  Review of Research in Education, 24, 355-392. Gibbs, G. (2010).  Dimensions of Quality.  York,  UK: The Higher Education Academy. Goff, M., & Ackerman, P L. (1992).  Personality-intelligence relations: Assessment of typical intellectual engagement. Journal of Educational Psychology, 84, 537-553. Goldstein, H. (1997). Methods in School Effectiveness Research.  School Effectiveness and School Improvement, 8, 369-395.   230   Goldstein, H. & Speigelhalter, D. J. (1996).  League Tables and their Limitations:  Statistical Issues in Comparisons of Institutional Performance.  Journal of the Royal Statistical Society, Series A (Statistics in Society), 159, 1, 149-163. Goldstein, H. & Woodhouse, G. (2000).  School Effectiveness Research and Educational Policy.  Oxford Review of Education, 26, 3&4, 353-363. Gonyea, R. M.  (2005).  Self-reported Data in Institutional Research:  Review and Recommendations.  New Directions for Institutional Research, 127, 73-89. Gordon, J., Ludlum, J., & Hoey, J. J. (2008). Validating NSSE Against Student Outcomes: Are They Related?  Research in Higher Education, 49, 1, 19–39. Greenland, S.  (2000).  Principles of Multilevel Modelling.  International Epidemiological Association, 29, 158-167. Griffith, J.  (2002).  Is Quality/Effectiveness An Empirically Demonstrable School Attribute?  Statistical Aids for Determining Appropriate Levels of Analysis.  School Effectiveness and School Improvement, 13, 1, 91-122. Groves, R. M. (2006).  Nonresponse rates and nonresponse bias in household surveys.  Public Opinion Quarterly, 70, 646–675. Guarino, C., Ridgeway, G., Chun, M., & Buddin, R. (2005). Latent Variable Analysis: A New Approach to University Ranking.  Higher Education in Europe, 30, 2, 147-165. Hafner, A. L., & Shim, N. (2010).  Evaluating Value Added in Learning Outcomes in Higher Education Using the Collegiate Learning Assessment. Los Angeles, CA: California State University.  231   Hair, J.F. Jr., Anderson, R.E., Tatham, R.L., & Black, W.C.  (1998).  Multivariate Data Analysis, (5th Edition).  Upper Saddle River, NJ: Prentice Hall. Hardison, C. M., & Vilamovska, A. (2009). The Collegiate Learning Assessment: Setting standards for performance at a college or university. Santa Monica, CA: RAND Education. Harvey, L. (2002). Evaluation for What? Teaching in Higher Education, 7, 3, 245-263. Harvey, L. (2008). Rankings of Higher Education Institutions: A Critical Review. Quality in Higher Education, 14, 3, 187-207. Hattie, J. & Timperley, H. (2007). The Power of Feedback. Review of Educational Research, 77, 1, 81-112. Heck, R. & Thomas, S. (2000). An Introduction to Multilevel Modeling Techniques. Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Hofmann, D. & Gavin, M.  (1998).  Centering Decisions in Hierarchical Linear Models:  Implications for Research in Organizations.  Journal of Management, 24, 5, 623-641. Hooker, M. (1997). The Transformation of Higher Education. In Oblinger, D. and Rush, S.C. (Eds.) The Learning Revolution. Bolton, MA: Anker Publishing Company, Inc.  Hopkins, D., Reynolds, D., & Gray, J.  (1999).  Moving and Moving up:  Confronting the Complexities of School Improvement in the Improving Schools Project.  Educational Research and Evaluation, 5, 1, 22-40.  232   Houston, D. (2008). Rethinking Quality and Improvement in Education. Quality Assurance in Education, 16, 1, 61-79. Howard, G. S. (1980).  Response-Shift Bias A Problem in Evaluating Interventions with Pre/Post Self-Reports. Evaluation Review, 4, 1, 93-106. Hox, J. J. (1998).  Multilevel modeling:  When and Why.  In I. Balderjahn, R. Mathar & M. Schader (Eds.).  Classification, data analysis, and data highways, 147-154.  New York: Springer Verlag. Hu, L., & Bentler, P. M.  (1998).  Fit Indices in Covariance Structure Modeling:  Sensitivity to Underparameterized Model Misspecification.  Psychological Methods, 3, 4, 424-453. Hubley, A. M. & Zumbo, B. D. (1996). A Dialectic on Validity: Where We Have Been and Where We Are Going. The Journal of General Psychology, 123, 3, 207-215. Hubley, A. M. & Zumbo, B. D. (2011). Validity and the Consequences of Test Interpretation and Use.  Social Indicators Research, 103, 219-230. Hussey, T. & Smith, P.  (2003).  The Uses of Learning Outcomes.  Teaching in Higher Education, 8, 3, 357-368. Ikenberry, S. (1999). Higher Education and A New Era. Keynote Address at the International Conference on Higher Education. Prague. Illeris, K. (2003). Towards a contemporary and comprehensive theory of learning. International Journal of Lifelong Education, 22, 4, 396-406. Inman, D. (2009). What Are Universities For? Academic Matters: The Journal for Higher Education. 23-29.  233   Jencks, C. S. & Brown, M.  (1975).  The Effects of High Schools on Their Students.  Harvard Educational Review, 45, 3, 273-324. Jobbins, D. (2002). The Times/The Times Higher Education Supplement - League Tables in Britain: An Insider's View. Higher Education in Europe, 27, 4, 383-388. Jobbins, D. (2005). Moving to a Global Stage: A Media View. Higher Education in Europe, 30, 2, 137-145. Jobbins, D., Kingston, B., Nune, M., & Polding, R. (2008). The Complete University Guide - A New Concept for League Table Practices in the United Kingdom. Higher Education in Europe, 33, 2/3, 357-359. Jöreskog, K. G.  (1999, June 22).  How Large Can a Standardized Coefficient be?  Retrieved July 8 2010, from http://www.ssicentral.com/lisrel/techdocs/HowLargeCanaStandardizedCoefficientbe.pdf.  Kane, M. (2006). Validation. In R. Brennan (ed. ), Educational Measurement. (4th ed.). Westport, Conn.: American Council on Education/Praeger. Kane, M. T. (2013).  Validating the Interpretations and Uses of Test Scores. Journal of Educational Measurement, 50, 1–73. DOI: 10.1111/jedm.12000 Keasey, K., Moon, P., & Duxbury, D. (2000). Performance Measurement and the Use of League Tables: Some experimental evidence of dysfunctional consequences. Accounting and Business Research, 30, 4, 275-286. Keeling, R. P., Wall, A. F., Underhile, R., & Dungy, G. J. (2008). Assessment Reconsidered: Institutional Effectiveness for Student Success. Published by International Center for Student Success and Institutional Accountability  234   (distributed by the National Association of Student Personnel Administrators, 2008), 9. Klien, S., Benjamin, R. Shavelson, R., & Bolus, R.  (2007).  The Collegiate Learning Assessment: Facts and Fantasies.  Evaluation Review, 31, 5, 415-439. Klien, S., Conn, A. B., Smith, D. B., & Sora, J. S.  (2001).  Is Everyone in Agreement?  An Exploration of Within-Group Agreement in Employee Perceptions of the Work Environment.  Journal of Applied Psychology, 86, 1, 3-16. Kim, K.  (2005).  An Additional View of Conducting Multi-level Construct Validation.  In F. J. Yammarino and F. Dansereau, (eds.), Multi-level Issues in Organizational Behavior and Processes. (Research in Multi Level Issues, Volume 3, pp. 317-333).  Emerald Group Publishing Limited. Kirby, D. (2007). Reviewing Canadian Post-Secondary Education: Post-Secondary Education Policy in Post-Industrial Canada.  Canadian Journal of Educational Administration and Policy, 65, Retrieved October 8 2010, from http://www.umanitoba.ca/publications/cjeap/articles/kirby.html.  Klor de Alva, J. (2000 March/April).  Remaking the Academy: 21st-Century Challenges to Higher Education in the Age of Information. EDUCAUSE Review. Korzekwa, A. M.  (2010).  An Examination of the Predictive Validity of National Survey of Student Engagement and Scalelets (Master’s thesis).  University of New Mexico, New Mexico.  235   Kreber, C. (2006). Introduction: The Scope of Possibility in Interpreting and Promoting Research-Based Teaching. New Directions for Teaching and Learning, 107, 7-12. Kreft, I. G. G., de Leeuw, J., & Aiken, L.  (1995).  The Effect of Different Forms of Centering in Hierarchical Linear Models.  Multilevel Behavioral Research, 30, 1, 1-21. Kuh, G. D. (2009). The National Survey of Student Engagement: Conceptual and empirical foundations. New Directions for Institutional Research, 141, 5-20. Kuh, G. D., Kinzie, J., Schuh, J. H., Whitt, E. J., & Associates. (2005). Student success in college: Creating conditions that matter. San Francisco: Jossey-Bass. Kuh, G., & Pascarella, E. (2004). What Does Institutional Selectivity Tell Us About Educational Quality? Change, 53-58. LaNasa, S. M., Cabrera, A. F. & Trangsrud, H.  (2009).  The Construct Validity of Student Engagement:  A Confirmatory Factor Analysis Approach.  Research in Higher Education, 50, 315-332. Lang, D. W. (2005). “World Class” or The Curse of Comparison? The Canadian Journal of Higher Education, 35, 3, 27 – 55. LeBreton, J. M. & Senter, J. L.  (2008).  Answers to 20 Questions About Interrater Reliability and Interrater Agreement.  Organizational Research Methods, 11, 4, 815-852. Liberal Education & America’s Promise.  (2007).  College Learning for the New Global Century.  A Report from the National Leadership Council, Association of American Colleges and Universities, Washington, DC.  236   Lindell, M. K. & Brandt, C. J.  (1999). Assessing Interrater Agreement on the Job Relevance of a Test: A Comparison of the CVI, T, rWG(j), and r*WG Indexes.  Journal of Applied Psychology, 84, 640–647. Lindell, M. K., Brandt, C. J., & Whitney, D. J. (1999). A Revised Index of Inter-rater Agreement for Multi Item Ratings of a Single Target.  Applied Psychological Measurement, 23,127–135. Liu, O. L. (2011). Outcomes Assessment in Higher Education: Challenges and Future Research in the Context of Voluntary System of Accountability.  Educational Measurement: Issues and Practice, 30, 3, 2-9. Lüdtke, O., Robitzsch, A., Trautwein, U., & Kunter, M. (2009).  Assessing the impact of Learning Environments:  How to Use Student Ratings of Classroom or School Characteristics in Multilevel Modeling. Contemporary Educational Psychology, 34, 2, 120–131, DOI: 10.1016/j.cedpsych.2008.12.001. Maher, A. (2004). Learning Outcomes in Higher Education: Implications for Curriculum Design and Student Learning. Journal of Hospitality, Leisure, Sport and Tourism Education, 3, 2, 47-54. Mancuso, M., Desmarais, S., Parkinson, K., & Pettigrew, B.  (2010).  Disappointment,  Misunderstanding and Expectations: A Gap Analysis of NSSE, BCSSE and FSSE. Toronto: Higher Education Quality Council of Ontario. Martin, J. S. & Marion, R. (2005). Higher Education Leadership Roles in Knowledge Processing. The Learning Organization, 12, 2, 140-151. Maas, C. J. M.  & Hox, J. J.  (2005).  Sufficient Sample Sizes for Multilevel Modeling.  Methodology, 1, 3, 86-92.  237   McCormick, A. C. and McClenney, K. (2012). Will These Trees Ever Bear Fruit? A Response to the Special Issue on Student Engagement. The Review of Higher Education, 35, 2, 307-333. Merisotis, J., & Sadlak, J. (2005). Higher Education Rankings: Evolution, Acceptance, and Dialogue. Higher Education in Europe, 30, 2, 97-101. Messick, S. (1995a). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23, 2, 13–23. Messick, S. (1995b). Validity of Psychological Assessment: Validation of Inferences From Persons’ Responses and Performances as Scientific Inquiry Into Score Meaning. American Psychologist, 50, 9, 741-749. Mok, M. (1995).  Sample Size Requirements for 2-Level Designs in Educational Research.  Multilevel Modelling Newsletter, 7, 2, 11-15. Morrison, H. G. & Cowan, P. C. (1996). The State Schools Book: A Critique of A League Table. British Educational Research Journal, 22, 241-250. Morse, R. (2008). The Real and Perceived Influence of the US News Ranking. Higher Education in Europe, 33, 2, 349-356. Moss, P. (2003). Reconceptualizing Validity for Classroom Assessment. Educational Measurement Issues and Practices, 22, 4, 13-25. DOI: 10.1111/j.1745-3992.2003.tb00140.x. Murphy, M. (2011). Troubled by the past: history, identity and the university. Journal of Higher Education Policy and Management, 33, 5, 509-517. Muthén, L. K. & Muthén, B. O. (1998-2012).  Mplus User’s Guide. Seventh Edition.  Los Angeles, CA: Muthén & Muthén   238   National Survey of Student Engagement.  (2011).  Accreditation Toolkits.  Retrieved May 14, 2014, from http://nsse.iub.edu/html/accred_toolkits.cfm. National Survey of Student Engagement.  (2011).  Major Field Report.  Retrieved May 14, 2014, from: http://nsse.iub.edu/html/major_field_report.cfm. Nelson Laird, T. F., Shoup, R., & Kuh, G. D.  (2005).  Deep Learning and College Outcomes: Do Fields of Study Differ? Paper presented at the Annual Conference of the California Association for Institutional Research. San Diego, CA. Nelson Laird, T. F., Shoup, R., & Kuh, G. D.  (May, 2008).  Measuring deep  approaches to learning using the National Survey of Student Engagement. Paper presented at the Annual Meeting of the Association for Institutional Research. Chicago, IL. Retrieved November 16, 2011, from http://nsse.iub.edu/pdf/conference_presentations/2006/AIR2006DeepLearningFINAL.pdf. O’Connell, A. A. & Reed, S. J.  (2012).  Hierarchical Data Structures, Institutional Research, and Multilevel Modeling.  New Directions for Institutional Research, 154, 5-22. O’Connor, B. P. (2004). SPSS and SAS programs for addressing interdependence and basic levels-of-analysis issues in psychological data. Behavior Research Methods, Instrumentation, and Computers, 36, 17-28. Olivas, M. A. (2011). If you Build It, They Will Assess It (or, An Open Letter to George Kuh, with Love and Respect). The Review of Higher Education, 35, 1, 1-15.  239   Opdenakker, M-C., & Van Damme, J.  (2001).  Relationship between School Composition and Characteristics of School Process and their Effect on Mathematics Achievement. British Educational Research Journal, 27, 4, 407-432. Pace, C. R. (1986). Quality, content, and context in the assessment of student learning and development in college. Los Angeles, CA: The Center for the Study of Evaluation, Graduate School of Education, University of California, Los Angeles. Penn, J. D.  (2011).  The Case for Assessing Complex General Education Student Learning Outcomes.  New Directions for Institutional Research, 149, 5-14, DOI: 10.1002/ir.376. Pike, G.  (2006a).  The convergent and discriminant validity of NSSE Scalelet scores. Journal of College Student Development, 47, 5, 550-563. Pike, G.  (2006b).  The Dependability of NSSE Scalelets for College- and Department-Level Assessment. Research in Higher Education, 47, 2, 177-195. Pike, G.  (2011). Using College Students’ Self-Reported Learning Outcomes in Scholarly Research. New Directions in Institutional Research, 150, 41-58. Porter, S. R.  (2005). What can multilevel models add to institutional research?  In: Coughlin, M. A. (Ed.), Applications of Advanced Statistics in Institutional Research, Association of Institutional Research, Tallahassee, FL. Porter, S. R.  (2006).  Institutional Structures and Student Engagement.  Research in Higher Education, 47, 5, 521-558.  Porter, S. R.  (2011). Do College Student Surveys Have Any Validity? The Review of Higher Education, 35, 1, 45–76.  240   Porter, S. R.  (2012).  Using Student Learning as a Measure of Quality in Higher Education.  Retrieved October 9, 2012, from http://www.hcmstrategists.com/contextforsuccess/papers/PORTER_PAPER.pdf. Price, L.  (2011).  Modelling factors for predicting student Learning Outcomes in Higher Education.  In ‘Learning in Transition:  dimensionality, validity and development’ scientific research network conference, University of Antwerp, Belgium. Proulx, R. (2007). Higher Education Ranking and League Tables: Lessons Learned from Benchmarking. Higher Education in Europe, 32 (1), 71-82. Pryor, J. & Crossouard, B.  (2008).  A Socio-cultural Theorisation of Formative Assessment.  Oxford Review of Education, 34, 1, 1-20. Raudenbush, S. (2004). What Are Value-Added Models Estimating and What Does This Imply for Statistical Practice? Journal of Educational and Behavioral Statistics, 29 (1), 121-129. Raudenbush, S. & Bryk, A. (2002).  Hierarchical Linear Models:  Applications and Data Analysis Methods.  Second Edition.  Thousand Oaks, CA:  Sage Publications, Inc. Raudenbush, S.W., Bryk, A.S, & Congdon, R.  (2004).  HLM 7 for Windows [Computer software]. Skokie, IL: Scientific Software International, Inc.  Raudenbush, S. & Wilms, J.  (1995).  The Estimation of School Effects.  Journal of Education and Behavioral Statistics, 20, 307-335. Rasbash, J., Browne, W., Goldstein, H., Yang, M., Plewis, I., Healy, M., Woodhouse, G., Draper, D., Langofrd, I., & Lewis, T. (2002). A User’s Guide to MLwiN.  241   Version 2.1d, Centre for Multilevel Modelling Institute of Education, University of London. Redmond, R., Curtis, E., Noone, T., & Keenan, P. (2008). Quality in higher education: The contribution of Edward Deming's principles. Quality in Higher Education, 22, 5, 432-441. Rocki, M. (2005). Statistical and Mathematical Aspects of Ranking: Lessons from Poland. Higher Education in Europe, 30, 2, 173-181. Roska, J. & Arum, R. (2011). The State of Undergraduate Learning. Change, 43, 2, 35-38; DOI: 10.1080/00091383.2011.556992. Rupp, A. A., Dey, D. K., & Zumbo, B. D.  (2004).  To Bayes or Not to Bayes, From Whether to When:  Applications of Bayesian Methodology to Modeling.  Structural Equation Modeling, 11, 3, 424-451. Schagen, I. P. & Hutchison, D. (2003). Adding Value in Educational Research – The Marriage of Data and Analytical Power. British Educational Research Journal, 29, 5, 749-765. Schmidt, W., McKnight, C., Houang, R., Wang, H., Wiley, D., Cogan, L., & Wolfe, R.  (2001).  Why Schools Matter:  A Cross-National Comparison of Curriculum and Learning.  San Francisco, CA:  John Wiley & Sons, Inc.  Schriesheim, C. A., Cogliser, C. C., and Neider, L. L. (1995).  Is it ‘‘trustworthy’’? A multiple levels-of-analysis reexamination of an Ohio State leadership study, with implications for future research.  Leadership Quarterly, 6, 111 – 145. Scott, D. (2009).  University Ranking As A Marketing Tool: Readers Beware.  Academic Matters: The Journal of Higher Education, February/March.  242   Scott, R. & Walberg, H.  (1979).  Schools Alone are Insufficient:  A Response to Edmonds.  Educational Leadership, 38, 24-27. Shore, C., & Wright, S. (2004). Whose Accountability? Governmentality and the Auditing of Universities. Parallax, 10, 2, 100-116. Shushok, Jr. F., Henry, D. V., Blalock, G. & Sriram, R. R. (2009). Learning At Any Time: Supporting Student Learning Wherever It Happens. About Campus, 10-15. Skolnik, M. L.  (2010).  Quality assurance in higher education as a Political Process.  Higher Education Management and Policy, 22, 1, 67-86. Smyth, J. & Dow, A.  (1998). What’s Wrong with Outcomes? Spotter plans, action plans, and steerage of the educational workplace.  British Journal of Sociology of Education, 19, 3, 291-303. Snowdon, K. (2005).  Without a Roadmap: Government Funding and Regulation of Canada’s Universities and Colleges. Ottawa: Canadian Policy Research Networks. Spelling Commission on the Future of Higher Education. (2006). A Test of Leadership: Charting the future of US Higher Education. US Department of Education, September 26, 2006. Stark, J. S. & Lowther, M.A.  (1980).  Measuring Higher Education Quality.  Research in Higher Education, 13, 3, 283-287. Steedle, J. (2009). Advancing institutional value-added score estimation. New York: Council for Aid to Education Stensaker, B.  (2011).  Accreditation of Higher Education in Europe – Moving Towards the US Model?  Journal of Education Policy, 26, 6, 757-769.  243   Stewart, A.C. & Carpenter-Hubin, J. (2002). The Balanced Scorecard: Beyond Reports and Ratings.  Planning for Higher Education, 49, 2, 37-42. Tabachnick, B. G. & Fidell, L. S.  (2001).  Using Multivariate Statistics.  (4th ed.).  London, UK:  Allyn & Bacon.  Tagg, T.  (2008).  Changing Minds in Higher Education: Students Change, So Why Can’t Colleges?  Planning for Higher Education, 37, 1, 15–22.  Tagliaventi, M. & Mattarelli, E. (2006). The Role of Networks of Practice, Value Sharing, and Operational Proximity in Knowledge Flows Between Professional Groups. Human Relations, 59, 3, 291-319. Tavakol, M. & Dennick, R.  (2011).  Making Sense of Cronbach’s Alpha.  International Journal of Medical Education, 2, 53-55. Taylor, E. W. (2007). Transformative Learning Theory.  New Directions for Adult and Continuing Education, 119, 5-15. Third International Mathematics and Science Survey, TIMSS.  (1995).  User Guide for the TIMSS International Database. Gonzalez, E. J., and Smith, T. A. (Eds.).  International Association for the Evaluation of Educational Achievement, Amsterdam: The Netherlands. Thomas, S. L., & Heck, R. H. (2001).  Analysis of large-scale secondary data in higher education research: Potential perils associated with complex sampling designs.  Research in Higher Education, 42, 517–540. Thomas, D. R. & Zumbo, B. D.  (2012).  Difference Scores From the Point of View of Reliability and Repeated-Measures ANOVA:  In Defense of Difference Scores for Data Analysis.  Educational and Psychological Measurement, 72, 1, 37-43.   244   DOI: 10.1177/0013164411409929. Thompson, B. & Daniel, L. (1996).  Factor Analytic Evidence for the Construct Validity of Scores: A Historical Overview and Some Guidelines.  Educational and Psychological Measurement, 56, 2, 127-137. Thomson, G. & Alexander, S. (2011). SERU and Campus Climate Research: An Introduction. Fifth Annual UCUES/SERU Research Symposium. University of North Carolina, Raleigh, NC. Thomson, G. & Douglass, J. (2009). Decoding learning gains: Measuring outcomes and the pivotal role of the major and student backgrounds. Research & Occasional Paper Series: Centre for Studies in Higher Education (CSHE.5.09). Timmermans, A. C., Doolaard, S., & de Wolf, I.  (2011).  Conceptual and Empirical Differences among Various Value-Added Models for Accountability.  School Effectiveness and Improvement, 22, 4, 393-413. Tinto, V. (2010). Enhancing Student Retention: Lessons Learned in the United States.  Presented at the National Conference on Student Retention, Dublin, Ireland. Trigwell, K. & Prosser, M.  (1991).  Improving the Quality of Student Learning: The Influence of Learning Context and Student Approaches to Learning on Learning Outcomes.  Higher Education, 22, 251-266. Umbach, P. D. & Porter, S. R. (2002).  How Do Academic Departments Impact Student Satisfaction?  Understanding the Contextual Effects of Departments.  Research in Higher Education, 43, 2, 209-233. Ungerleider, C.  (2003).  Large-scale student assessment:  Guidelines for policy-makers.  International Journal of Testing, 3, 2, 119-128.  245   Ungerleider, C.  (2006).  Reflections on the Use of Large-Scale Student Assessment.  Canadian Journal of Education, 29, 873-883. Usher, A. (2009). University Rankings 2.0: New Frontiers in Institutional Comparisons. Australian Universities Review, 51, 2, 87-90. Usher, A., & Savino, M. (2007). A Global Survey of University Ranking and League Tables. Higher Education in Europe, 32, 1, 5-15. Van de Vijver, F. J. R. & Poortinga, Y. H.  (2002).  Structural Equivalence in Multilevel Research.  Journal of Cross-Cultural Psychology, 33, 2, 141-156. Van der Wende, M. C.  (2000).  The Bologna Declaration: Enhancing the Transparency and Competitiveness of European Higher Education.  Higher Education in Europe, 25, 3, 305-310. Van Dyke, N. (2005). Twenty Years of University Report Cards. Higher Education in Europe, 30, 2, 103-125.  Van Kemenade, E., Pupius, M., & Hardjono, T. (2008). More Value to Defining Quality. Quality in Higher Education, 14, 2, 175-185. Webb, N. L. (2007). Issues Related to Judging the Alignment of Curriculum Standards and Assessments.  Applied Measurement in Education, 20, 1, 7-25. Wieman, C.  (2014).  Doctoral Hooding Ceremony Address to the Graduating Class of the Teachers College Columbia University.  Retrieved June 18, 2014, from https://www.youtube.com/watch?v=SQ6vbVBotpM. Woodhouse, D.  (2012).  A Short History of Quality.  The Commission for Academic Accreditation Quality Series, No. 2.  Abu Dhabi: United Arab Emirates.  246   Zhao, C.-M., Kuh, G., & Carini, R. (2005). A Comparison of International Student and American Student Engagement in Effective Educational Practices. The Journal of Higher Education, 76, 2, 209-231. Zumbo, B. D. (Ed.) (1998).  Validity theory and the Methods Used in Validation:  Perspectives from the Social and Behavioral Sciences.  Special issue of the journal.  Social Indicators Research:  An International and Interdisciplinary Journal for Quality-of-life Measurement, 45, 1-3, 1-359. Zumbo, B. D., & Forer, B. (2011).  Testing and Measurement from a Multilevel View:  Psychometrics and Validation.  In James A. Bovaird, Kurt F. Geisinger, & Chad W.  Buckendahl (Editors).  High Stakes Testing in Education – Science and Practice in K-12 Settings, pp. 177-190.  American Psychological Association Press, Washington, D. C. Zumbo, B. D., Liu, Y., Wu, A. D., Forer, B. & Shear, B.  (2010).  National and International Educational Achievement Testing:  A Case of Multi-Level Validation.  In A Multilevel View of Test Validity.  Symposium conducted at the meeting of the Amercian Educational Research Association, Denver, CO.   Zumbo, B.D. (2007).  Validity: Foundational Issues and Statistical Methodology.  In C.R. Rao and S. Sinharay (Eds.) Handbook of Statistics, 26, Psychometrics, (pp. 45-79). Elsevier Science B.V.: The Netherlands.    Zumbo, B. D. (2009).  Validity as Contextualized and Pragmatic Explanation, and Its Implications for Validation Practice.  In Robert W. Lissitz (Ed.) The Concept of Validity: Revisions, New Directions and Applications, (pp. 65-82).  Information Age Publishing, Inc.: Charlotte, NC.   247   Appendix A  Table A.1 Means, Standard Deviations and Intercorrelations of Measures for the NSSE First-Year Students Measures Mean SD n 1 2 3 4 5 6 7 UBC Grades           1. NSSE self-reported GPA 5.28 1.84 1955               Perceived general learning outcomes 2. Acquiring a broad general education 3.00 0.80 1966 .107** 3. Writing clearly and effectively 2.78 0.86 1970 .121** .367** 4. Speaking clearly and effectively 2.45 0.94 1962 .064** .304** .603** 5. Thinking critically and analytically 3.17 0.78 1968 .117** .402** .521** .453** 6. Analyzing quantitative problems 2.98 0.89 1961 .055* .257** .294** .287** .553** Overall Satisfaction 7. Overall experience 3.07 0.75 1957 .252** .351** .292** .268** .357** .214** 8. Attend same institution 3.25 0.75 1961 .163** .284** .219** .214** .270** .173** .603** Climate for Diversity 9. Had serious conversations with students of a different race or ethnicity than your own 2.73 1.03 2079 .080** .159** .122** .093** .163** .117** .237** 10. Had serious conversations with students who are very different from you in terms of their religious beliefs, political opinions, or personal values 2.60 0.99 2074 .039 .193** .160** .123** .199** .122** .216** 11. Encouraging contact among students from different economic, social, and racial or ethnic backgrounds 2.48 0.97 1995 .046* .259** .244** .310** .253** .169** .316** Note. *p<0.05. **p<0.01. (table continues)      248   Table A.1 (continued) Measures Mean SD n 1 2 3 4 5 6 7 Support for Student Success 12. Providing the support you need to help you succeed academically 2.87 0.81 1995 .166** .323** .267** .224** .318** .225** .391** 13. Helping you cope with your non-academic responsibilities (work, family, etc.) 1.94 0.88 1996 -.006 .202** .223** .302** .145** .151** .258** 14. Providing the support you need to thrive socially 2.28 0.88 1987 .054* .276** .239** .306** .204** .156** .348** Interpersonal Environment 15. Relationships with other students 5.53 1.25 2027 .116** .258** .172** .202** .248** .210** .407** 16. Relationships with Faculty members 4.80 1.30 2027 .218** .235** .224** .219** .216** .082** .421** 17. Relationships with staff 4.44 1.45 2022 .093** .204** .185** .177** .161** .108** .316** Note. *p<0.05. **p<0.01.    (table continues)           249   Table A.1 (continued)    Measures Mean SD n 8 9 10 11 12 13 Climate for Diversity 9. Had serious conversations with students of a different race or ethnicity than your own 2.73 1.03 2079 .206**      10. Had serious conversations with students who are very different from you in terms of their religious beliefs, political opinions, or personal values 2.60 0.99 2074 .174** .751**     11. Encouraging contact among students from different economic, social, and racial or ethnic backgrounds 2.48 0.97 1995 .220** .169** .203**    Support for Student Success 12. Providing the support you need to help you succeed academically 2.87 0.81 1995 .287** .095** .101** .404**   13. Helping you cope with your non-academic responsibilities (work, family, etc.) 1.94 0.88 1996 .197** .075** .083** .484** .368**  14. Providing the support you need to thrive socially 2.28 0.88 1987 .254** .120** .137** .483** .396** .620** Interpersonal Environment 15. Relationships with other students 5.53 1.25 2027 .366** .221** .177** .236** .240** .171** 16. Relationships with Faculty members 4.80 1.30 2027 .311** .079** .061** .228** .352** .215** 17. Relationships with staff 4.44 1.45 2022 .241** .019 .019 .183** .287** .221** Note. *p<0.05. **p<0.01.    (table continues)  Table A.1 (continued)    Measures Mean SD n 14 15 16 Interpersonal Environment 15. Relationships with other students 5.53 1.25 2027 .281** 16. Relationships with Faculty members 4.80 1.30 2027 .265** .378** 17. Relationships with staff 4.44 1.45 2022 .213** .298** .584** Note. *p<0.05. **p<0.01.   250           Figure A.1  P-P Plots in SPSS to Determine if the Distribution of NSSE First-Year Student Composite Scores on General Learning Gains, Overall Satisfaction (quality), Emphasis on Diversity, Support for Student Success and Interpersonal Environment are Normal. 251   Table A.2  Means, Standard Deviations and Intercorrelations of Measures for the NSSE Fourth-Year Students Measures Mean SD n 1 2 3 4 5 6 7 8 UBC Grades                      1. Entering GPA 82.65 8.69 1266  2. Current GPA 79.46 7.0 1281 .295**      3. NSSE Self-reported GPA 6.03 1.43 1174 .277** .743**       Gains in General Learning  4.. Acquiring a broad general education 3.08 0.82 1191 -.058* .017 .051 5. Writing clearly and effectively 2.98 0.87 1194 -.043 .100** .144** .443** 6. Speaking clearly and effectively 2.73 0.92 1193 -.031 .112** .121** .362** .631** 7. Thinking critically and analytically 3.33 0.75 1188 .000 .088** .128** .423** .560** .486** 8. Analyzing quantitative problems 2.94 0.92 1185 .033 .064* .040 .224** .232** .290** .476** Overall Satisfaction  9. Quality of educational experience 3.03 0.76 1195 .015 .092** .169** .380** .376** .371** .431** .294** 10. Would you go to UBC again if you could start over 3.12 0.82 1195 .006 .047 .105** .263** .304** .289** .330** .228** Climate for Diversity  11. Had serious conversations with students of a different race or ethnicity than your own 2.88 1.01 1252 .091** .064* .134** .123** .143** .136** .174** .065* 12. Had serious conversations with students who are very different from you in terms of their religious beliefs, political opinions, or personal values 2.71 0.95 1255 .050 .044 .107** .130** .161** .171** .177** .067* 13. Encouraging contact among students from different economic, social, and racial or ethnic backgrounds 2.25 0.94 1205 -.004 .032 .062* .208** .177** .286** .195** .127** Note. *p<0.05. **p<0.01.  (table continues)     252    Table A.2 (continued)  Measures Mean SD n 1 2 3 4 5 6 7 8 Support for Student Success                14. Providing the support you need to help you succeed academically 2.61 0.80 1211 .011 .101** .129** .314** .326** .311** .360** .260**     15. Helping you cope with your non-academic responsibilities (work, family, etc.) 1.77 0.82 1207 -.025 .016 .024 .167** .169** .248** .165** .165**     16. Providing the support you need to thrive socially 1.98 0.83 1204 .039 .066* .087** .187** .213** .278** .216** .230** Interpersonal Environment  17. Relationships with other students 5.31 1.39 1230 .076** .085** .105** .143** .157** .233** .223** .224** 18. Relationships with faculty members 5.03 1.31 1229 -.048 .153** .187** .261** .327** .333** .356** .208** 19. Relationships with staff 4.18 1.57 1227 -.054 .027 .031 .189** .214** .227** .228** .183** Note. *p<0.05. **p<0.01.  (table continues)      253   Table A.2 (continued) Measures Mean SD n 9 10 11 12 13 14 15 16 17 18 Overall Satisfaction 9. Quality of educational experience 3.03 0.76 1195 10. Would you go to UBC again if you could start over 3.12 0.82 1195 .641**          Climate for Diversity 11. Had serious conversations with students of a different race or ethnicity than your own 2.88 1.01 1252 .175** .097**         12. Had serious conversations with students who are very different from you in terms of their religious beliefs, political opinions, or personal values 2.71 0.95 1255 .159** .099** .745**        13. Encouraging contact among students from different economic, social, and racial or ethnic backgrounds 2.25 0.94 1205 .290** .239** .163** .173**       Support for Student Success 14. Providing the support you need to help you succeed academically 2.61 0.80 1211 .467** .398** .075** .086** .414**      15. Helping you cope with your non-academic responsibilities (work, family, etc.) 1.77 0.82 1207 .280** .268** .058* .040 .455** .445**     16. Providing the support you need to thrive socially 1.98 0.83 1204 .330** .290** .121** .087** .451** .442** .659** Interpersonal Environment 17. Relationships with other students 5.31 1.39 1230 .379** .348** .151** .110** .255** .275** .185** .249** 18. Relationships with faculty members 5.03 1.31 1229 .506** .429** .107** .089** .204** .435** .246** .238** .443** 19. Relationships with staff 4.18 1.57 1227 .377** .379** .026 .009 .216** .364** .252** .255** .359** .541** Note. *p<0.05. **p<0.01.    254         Figure A.3.  P-P Plots in SPSS to Determine if the Distribution of NSSE Fourth-Year Student Composite Scores on General Learning Gains, Overall Satisfaction (quality), Emphasis on Diversity, Support for Student Success and Interpersonal Environment are Normal.   255   Table A.3 NSSE MFA results for Perceived General Learning by Year Level Factor Loadings Survey Items ICC valuesWithinlevelsBetween levels First-Year Students Acquiring a broad general education 0.025 0.553* 0.883* Writing clearly and effectively 0.035 0.843* 1.011* Speaking clearly and effectively 0.043 0.770* 0.340 Thinking critically and analytically 0.002 0.743* 0.989 Note. *p < 0.05. First-Year, within-level eigenvalues, 2.6, 0.7, 0.4, 0.3; between-level eigenvalues, 3.0, 0.9, 0.1, -0.1; Model-fit, RMSEA value 0.082, SRMR within-value 0.035, SRMR between-value 0.07, and CFI value 0.985 and TLI 0.956; misspecified model.  Table A.4 NSSE MFA results for Campus Climate for Diversity by Year Level Factor Loadings Survey Items ICC valuesWithin levels Between levels First-Year Students Different race or ethnicity 0.014 0.844* 1.226* Differ from you in religious, political, values 0.021 0.979* 0.771* Different economic, social, racial or ethnic backgrounds 0.008 0.229* 0.652 Fourth-Year Students Different race or ethnicity 0.057 0.894* 0.704* Differ from you in religious, political, values 0.024 0.932* 1.043* Different economic, social, racial or ethnic backgrounds 0.019 0.191* 0.958* Note. *p < .05. First-Years: eigenvalues within-levels, 1.9, 0.9, 0.2; between-levels, 2.5, 0.5, 0.0; model fit, just-identified model; misspecified model. Fourth-Years: eigenvalues within-levels,1.9, 0.9, 0.2; between-levels, 2.6, 0.4, 0.0; model fit, just-identified model; misspecified model.     256   Table A.5 NSSE MFA results for Supportive Campus Environment by Year Level Factor Loadings Survey Items ICC valuesWithin levelsBetween levels First-Year Students Cope with non-academic responsibilities 0.018 0.822* 1.363 Support to succeed academically 0.024 0.555* 0.228 Support to thrive socially 0.003 0.862* 0.729 Fourth-Year Students Cope with non-academic responsibilities 0.014 0.619* -0.248 Support to succeed academically 0.011 0.884* 0.704 Support to thrive socially 0.014 0.855* 1.405 Note. *p < 0.05. First-years, eigenvalues within-level 2.1, 0.6, 0.3; between-level eigenvalues, 2.1, 0.9, 0.0; model fit, just-identified; misspecified model. Fourth-years, eigenvalues within-level, 2.2, 0.5, 0.2; eigenvalues between-levels 2.1, 0.9, 0.0; model fit, just-identified; misspecified model.  Table A.6 NSSE MFA results for Interpersonal Environment by Year Level Factor Loadings Survey Items ICC valuesWithin levelsBetween levels First-Year Students Relationships with other students 0.008 0.476* 0.023 Relationships with faculty 0.035 0.882* 5.443 Relationships with administrative staff 0.017 0.687* 0.166 Fourth-Year Students Relationships with other students 0.046 0.590* 0.160 Relationships with faculty 0.059 0.853* 0.277 Relationships with administrative staff 0.055 0.662* 2.451 Note. *p < 0.05. First-Years, eigenvalues within-levels, 1.9, 0.7, 0.4; between-levels, 1.9, 1.0, 0.1; model fit, just-identified; misspecified model. Fourth-year eigenvalues within 2.0, 0.6, 0.4; between values were 1.8, 1.0, 0.2; model fit, just-identified; misspecified model.  257   Appendix B  Table B.1 Means, Standard Deviations and Intercorrelations of Measures UES First-Year Students Measures Mean SD n 1 2 3 4 5 6 7 8 UBC grades 1. Current GPA 73.88 10.85 2114 2. Entering GPA 86.78 7.56 2342 .279** Gains in General Learning 3. Analytical and critical thinking skills  4.35 0.83 2317 .114** .040 4. Ability to be clear and effective when writing 4.19 0.92 2309 .062** .041* .528** 5. Ability to read and comprehend academic material 4.41 0.86 2308 .103** .042* .535** .528** 6. Ability to speak clearly and effectively in English  4.94 1.03 2306 .048* -.008 .394** .492** .416** 7. Quantitative (mathematical and statistical) skills 3.91 1.13 2299 .134** .161** .338** .143** .263** .070** Overall Satisfaction 8. Overall academic experience 4.42 1.00 2238 .125** .030 .297** .269** .270** .223** .123** 9. I would encourage others to enroll at UBC Okanagan 4.80 1.05 2311 -.012 .016 .138** .157** .154** .206** .045* .516** Note. * p < 0.05. ** p < 0.01. (table continues)     258   Table B.1 (continued) Measures Mean SD n 1 2 3 4 5 6 7 8 Emphasis for Diversity 10. I feel free to express my political beliefs on campus 5.11 1.11 2339 -.018 -.001 .143** .140** .098** .152** .029 .166** 11. I feel free to express my religious beliefs on campus 5.21 1.16 2331 -.021 .010 .123** .109** .099** .159** .031 .151** 12. Students are respected here regardless of their economic or social class 4.99 0.98 2298 -.020 .029 .099** .120** .123** .125** .086** .254** 13. Students are respected here regardless of their gender 5.31 0.84 2303 -.024 .028 .112** .118** .162** .165** .103** .234** 14. Students are respected here regardless of their race or ethnicity 5.11 0.94 2307 .009 .027 .102** .132** .168** .160** .066** .224** 15. Students are respected here regardless of their political beliefs 5.13 0.86 2205 -.024 .015 .105** .103** .133** .166** .061** .220** 16. Students are respected here regardless of their sexual orientation 5.05 0.94 2172 -.053* -.005 .079** .114** .126** .166** .019 .230** 17. Students are respected here regardless of their physical ability/disability 5.11 0.91 2213 -.048* -.005 .075** .073** .110** .121** .051* .192** Supportive Campus Environment 18. I feel respected as an individual on this campus 4.74 0.86 2327 .007 .011 .161** .176** .207** .200** .131** .373** 19. There is a clear sense of appropriate and inappropriate behavior on this campus 4.43 0.97 2325 -.062** -.045* .089** .110** .134** .110** .072** .242** 20. I am proud to attend UBCO 5.04 0.95 2326 -.036 .023 .096** .167** .148** .194** .072** .428** 21. Most students are proud to attend UBCO 4.91 0.87 2311 -.048* .047* .073** .157** .119** .181** .023 .312** 22. This institution communicates with students and values their opinions 4.37 1.05 2318 -.023 .010 .045* .101** .127** .083** .069** .362** 23. Student life and campus experience 4.49 1.09 2310 -.007 .010 .143** .199** .163** .196** .082** .531** 24. I feel that I belong at this campus 4.38 1.14 2309 -.002 .043* .154** .182** .168** .196** .081** .497** Note. * p < 0.05. ** p < 0.01. (table continues)     259   Table B.1 (continued) Measures Mean SD n 9 10 11 12 13 14 15 16 Emphasis for Diversity 10. I feel free to express my political beliefs on campus 5.11 1.11 2339 .188** 11. I feel free to express my religious beliefs on campus 5.21 1.16 2331 .147** .607** 12. Students are respected here regardless of their economic or social class 4.99 0.98 2298 .299** .319** .324**      13. Students are respected here regardless of their gender 5.31 0.84 2303 .284** .318** .319** .589**     14. Students are respected here regardless of their race or ethnicity 5.11 0.94 2307 .250** .292** .297** .613** .626**    15. Students are respected here regardless of their political beliefs 5.13 0.86 2205 .262** .460** .396** .545** .580** .606**   16. Students are respected here regardless of their sexual orientation 5.05 0.94 2172 .250** .301** .313** .524** .523** .560** .587**  17. Students are respected here regardless of their physical ability/disability 5.11 0.91 2213 .247** .293** .299** .504** .557** .539** .573** .654** Supportive Campus Environment 18. I feel respected as an individual on this campus 4.74 0.86 2327 .446** .265** .213** .367** .342** .347** .331** .303** 19. There is a clear sense of appropriate and inappropriate behavior on this campus 4.43 0.97 2325 .306** .156** .175** .272** .261** .280** .271** .263** 20. I am proud to attend UBCO 5.04 0.95 2326 .677** .156** .144** .291** .313** .285** .267** .240** 21. Most students are proud to attend UBCO 4.91 0.87 2311 .492** .175** .134** .274** .297** .266** .304** .254** 22. This institution communicates with students and values their opinions 4.37 1.05 2318 .458** .132** .137** .296** .259** .266** .263** .230** 23. Student life and campus experience 4.49 1.09 2310 .600** .178** .167** .325** .282** .261** .231** .234** 24. I feel that I belong at this campus 4.38 1.14 2309 .713** .208** .173** .312** .282** .292** .250** .243** Note. * p < 0.05. ** p < 0.01. (table continues)     260    Table B.1 (continued) Measures Mean SD n 17 18 19 20 21 22 23 Supportive Campus Environment 18. I feel respected as an individual on this campus 4.74 0.86 2327 .318** 19. There is a clear sense of appropriate and inappropriate behavior on this campus 4.43 0.97 2325 .244** .474**      20. I am proud to attend UBCO 5.04 0.95 2326 .244** .497** .392** 21. Most students are proud to attend UBCO 4.91 0.87 2311 .241** .444** .349** .686** 22. This institution communicates with students and values their opinions 4.37 1.05 2318 .247** .439** .406** .507** .519**   23. Student life and campus experience 4.49 1.09 2310 .207** .418** .279** .492** .361** .356** 24. I feel that I belong at this campus 4.38 1.14 2309 .232** .461** .317** .550** .371** .397** .670** Note. * p < 0.05. ** p < 0.01.                261               Figure B.1  P-P Plots in SPSS to Determine if the Distribution of First-Year Student Composite Ratings on Gains in General Learning, Overall Satisfaction (quality), Campus Climate for Diversity, and Supportive Campus Environment are Normal.   262   Table B.2 Means, Standard Deviations and Intercorrelations of Measures UES Fourth-Year Students   Mean SD n 1 2 3 4 5 6 UBC Grades                   1. Entering GPA 74.65 23.53 2468 2. Current GPA 78.39 8.73 2468 .205** General Gains in Learning 3. Analytical and critical thinking skills 4.80 0.78 2438 .033 .164** 4. Ability to be clear and effective when writing  4.61 0.87 2437 .039 .146** .519** 5. Ability to read and comprehend academic material 4.76 0.80 2424 .025 .159** .497** .499** 6. Quantitative (mathematical and statistical) skills  3.88 1.17 2428 .073** .107** .201** .091** .165** 7. Ability to speak clearly and effectively in English 5.14 0.91 2425 .012 .049* .382** .456** .427** .060** Overall Satisfaction 8. Overall academic experience 4.56 1.01 2348 .049* .208** .270** .258** .289** .098** 9. I would encourage others to enroll at UBC Okanagan 4.58 1.23 2432 .052* .071** .168** .174** .182** .117** Note. * p < 0.05. ** p < 0.01. (table continues)     263   Table B.2 (continued) Measures Mean SD n 1 2 3 4 5 6 Emphasis for Diversity                   10. Students are respected here regardless of their sexual orientation 5.04 0.99 2298 .005 .007 .089** .086** .106** .094** 11.Students are respected here regardless of their economic or social class 4.80 1.14 2418 .023 .025 .082** .079** .078** .101** 12. Students are respected here regardless of their race or ethnicity 4.96 1.06 2413 .027 .051* .087** .086** .123** .119** 13. Students are respected here regardless of their gender 5.18 0.96 2417 .038 .046* .065** .059** .082** .111** 14. I feel free to express my political beliefs on campus 5.03 1.21 2466 -.021 .036 .074** .067** .082** .101** 15. I feel free to express my religious beliefs on campus 5.17 1.32 2452 -.003 .030 .066** .100** .106** .066** 16. Students are respected here regardless of their political beliefs 4.90 1.07 2341 .016 .028 .068** .037 .089** .147** 17. Students are respected here regardless of their physical ability/disability 5.02 1.02 2313 .016 -.011 .048* .067** .085** .107** Supportive Campus Environment 18. I feel respected as an individual on this campus 4.70 0.98 2459 .066** .088** .189** .142** .178** .111** 19. There is a clear sense of appropriate and inappropriate behavior on this campus 4.39 1.04 2454 .002 -.029 .076** .064** .117** .095** 20. I am proud to attend UBCO 4.80 1.07 2459 .004 .025 .136** .141** .169** .089** 21. Most students are proud to attend UBCO 4.61 0.98 2445 -.009 .003 .085** .105** .144** .072** 22. This institution communicates with students and values their opinions 3.94 1.23 2453 -.008 .016 .039 .052* .069** .114** 23. I feel that I belong at this campus 4.30 1.26 2417 .083** .075** .176** .167** .181** .172** 24. Student life and campus experience 4.34 1.16 2430 .057** .049* .133** .141** .120** .156** Note. * p < 0.05. ** p < 0.01. (table continues)      264   Table B.2 (continued) Measures Mean SD n 7 8 9 10 11 12 13                       Overall Satisfaction 8. Overall academic experience 4.56 1.01 2348 .202** 9. I would encourage others to enroll at UBC Okanagan 4.58 1.23 2432 .158** .662** Emphasis for Diversity 10. Students are respected here regardless of their sexual orientation 5.04 0.99 2298 .154** .271** .311** 11.Students are respected here regardless of their economic or social class 4.80 1.14 2418 .089** .346** .368** .549** 12. Students are respected here regardless of their race or ethnicity 4.96 1.06 2413 .123** .288** .348** .608** .632** 13. Students are respected here regardless of their gender 5.18 0.96 2417 .077** .306** .331** .637** .638** .671** 14. I feel free to express my political beliefs on campus 5.03 1.21 2466 .102** .202** .235** .393** .406** .375** .395** 15. I feel free to express my religious beliefs on campus 5.17 1.32 2452 .120** .175** .172** .274** .312** .296** .289** 16. Students are respected here regardless of their political beliefs 4.90 1.07 2341 .082** .270** .287** .588** .588** .570** .571** 17. Students are respected here regardless of their physical ability/disability 5.02 1.02 2313 .089** .265** .303** .688** .564** .582** .616** Supportive Campus Environment 18. I feel respected as an individual on this campus 4.70 0.98 2459 .155** .463** .478** .404** .469** .426** .446** 19. There is a clear sense of appropriate and inappropriate behavior on this campus 4.39 1.04 2454 .100** .285** .308** .353** .371** .354** .368** 20. I am proud to attend UBCO 4.80 1.07 2459 .152** .614** .755** .363** .407** .373** .387** 21. Most students are proud to attend UBCO 4.61 0.98 2445 .116** .418** .559** .317** .317** .335** .326** 22. This institution communicates with students and values their opinions 3.94 1.23 2453 .030 .399** .477** .297** .404** .352** .329** 23. I feel that I belong at this campus 4.30 1.26 2417 .191** .566** .703** .307** .372** .348** .330** 24. Student life and campus experience 4.34 1.16 2430 .158** .535** .620** .288** .314** .306** .297** Note. * p < 0.05. ** p < 0.01. (table continues)      265   Table B.2 (continued) Measures Mean SD n 14 15 16 17 18 Emphasis for Diversity                 15. I feel free to express my religious beliefs on campus 5.17 1.32 2452 .584** 16. Students are respected here regardless of their political beliefs 4.90 1.07 2341 .561** .372** 17. Students are respected here regardless of their physical ability/disability 5.02 1.02 2313 .363** .286** .535** Supportive Campus Environment 18. I feel respected as an individual on this campus 4.70 0.98 2459 .331** .248** .412** .385** 19. There is a clear sense of appropriate and inappropriate behavior on this campus 4.39 1.04 2454 .266** .197** .366** .359** .503** 20. I am proud to attend UBCO 4.80 1.07 2459 .264** .196** .345** .362** .561** 21. Most students are proud to attend UBCO 4.61 0.98 2445 .250** .171** .316** .326** .426** 22. This institution communicates with students and values their opinions 3.94 1.23 2453 .250** .171** .344** .330** .460** 23. I feel that I belong at this campus 4.30 1.26 2417 .245** .204** .276** .282** .512** 24. Student life and campus experience 4.34 1.16 2430 .196** .148** .238** .260** .427** Note. * p < 0.05. ** p < 0.01. (table continues)  Table B.2 (continued) Measures Mean SD n 19 20 21 22 23 20. I am proud to attend UBCO 4.80 1.07 2459 .435** 21. Most students are proud to attend UBCO 4.61 0.98 2445 .414** .702** 22. This institution communicates with students and values their opinions 3.94 1.23 2453 .443** .537** .535** 23. I feel that I belong at this campus 4.30 1.26 2417 .317** .629** .438** .390** 24. Student life and campus experience 4.34 1.16 2430 .293** .549** .414** .381** .699** Note. * p < 0.05. ** p < 0.01.  266             Figure B.2.  P-P Plots in SPSS to Determine if the Distribution of Fourth-Year Student Composite Ratings on Gains in General Learning, Overall Satisfaction (quality), Campus Climate for Diversity, and Supportive Campus Environment are Normal.        267   Table B.3 UES MFA results for Overall Satisfaction by Year Level Factor Loadings Survey Items ICC values Within levels Between levelsFirst-Year Students Overall academic experience 0.031 0.639* 0.249I feel I belong at this campus 0.006 0.866* 0.671Encourage others to enroll at UBC 0.024 0.905* 1.481Fourth-Year Students Overall academic experience 0.027 0.754* 0.328I feel I belong at this campus 0.008 0.947* 1.437Encourage others to enroll at UBC 0.004 0.815* 0.274Note. *p < .05. First-Years, eigenvalues within-level 2.3, 0.5, 0.2; between-level eigenvalues 2.1, 0.9, 0.0; model-fit, just-identified, misspecified model. Fourth-Years, within-level eigenvalues 2.4, 0.4, 0.2; between-level eigenvalues 1.7, 0.9, and 0.4; model-fit, just-identified, misspecified model.   Table B.4 UES MFA results for Campus Climate for Diversity for Fourth-Year Students Factor Loadings Survey Items ICC values Within level Between levelFourth-Year Students Sexual orientation 0.000 0.791* 0.152Economic or social class  0.002 0.823* 0.686Race 0.004 0.850* 0.770*Gender 0.007 0.877* 0.607Political beliefs 0.016 0.780* 1.387Physical ability/disability 0.004 0.781* 0.608Note. *p < .05. First-Years, within-level eigenvalues 4.3, 0.5, 0.4, 0.3, 0.3, 0.2; eigenvalues between levels 5.7, 0.2, 0.1, 0.1, 0.0, -0.1, misspecified model. Fourth-Years, within-level eigenvalues 4.4, 0.5, 0.4, 0.3, 0.2, 0.2; between-level eigenvalues 4.2, 1.5, 1.1, 0.2, -0.2, -0.7, misspecified model.   268   Table B.5 UES MFA Results for Supportive Campus Environment by Year Level Factor Loadings Survey Items ICC valuesWithin levelBetween level First-Year Students Student life and campus experience 0.004 0.575* 0.540 Respected as an individual 0.003 0.775* -0.841 Clear appropriate and inappropriate behaviour 0.002 0.674* -0.750 I am proud to attend UBC 0.013 0.831* 1.047* Most students are proud to attend UBC 0.030 0.846* 1.076* UBC communicates well 0.000 0.685* -0.561 Fourth-Year Students Respected as an individual 0.014 0.752* -0.572 Clear appropriate and inappropriate behaviour 0.001 0.625* -0.235 I am proud to attend UBC 0.034 0.878* 0.792* Most students are proud to attend UBC 0.081 0.812* 1.071* UBC communicates well 0.011 0.722* -0.239 Student life and campus experience 0.003 0.618* 1.368 Note. *p < .05. First-Years, within-level eigenvalues, 3.6, 0.7, 0.7, 0.5, 0.4, 0.2; between-level eigenvalues, 4.2, 2.0, 1.0, 0.0, and 0.0; Model-fit, RMSEA 0.025; SRMR within level 0.060; SRMR between-level value 0.393, misspecified model.        Fourth-Years, within-level eigenvalues 3.7, 0.7, 0.6, 0.4, 0.4, 0.2; between-levels eigenvalues 1.0:  3.6, 2.7, 0.7, 0.7, -0.4, -0.7, misspecified model.    

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0167009/manifest

Comment

Related Items