Statistical and Machine Learning Classification Methodsfor Credit RatingsbyXixi HuB.Math., University of Waterloo, 2016A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Statistics)The University of British Columbia(Vancouver)September 2018c© Xixi Hu, 2018The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, a thesis/dissertation en-titled:Statistical and Machine Learning Classification Methods for Credit Ratingssubmitted by Xixi Hu in partial fulfillment of the requirements forthe degree of Master of Sciencein StatisticsExamining Committee:Harry JoeCo-supervisorNatalia NoldeCo-supervisorSupervisory Committee MemberAdditional ExaminerAdditional Supervisory Committee Members:Supervisory Committee MemberSupervisory Committee MemberiiAbstractCredit rating is an ordinal categorical label that serves as an important measure ofa financial institution’s credit worthiness. It is frequently used to decide whether ornot to grant loans as well as how much interest to charge. Companies with highercredit ratings often enjoy lower interest rate and more flexibility in obtaining loans.Due to the increased competition in the lending market, there is renewed interestin the business community in applying statistical and machine learning methodsto assign credit ratings. The challenge of adapting and generalizing these methodsoften lies in understanding and interpreting them in addition to matching ratingsaccurately.Our goal is to compare the classification performance and interpretability offour statistical learning methods on a credit rating dataset from the industry, wherethe rating variable comes from human expert opinions. We fit the ordinal regres-sion, ordinal gradient boosting, multinomial gradient boosting and random forestmethods with the goal of finding an interpretable method that can replicate the hu-man expert ratings as closely as possible. We find that while the linear ordinalregression is the most interpretable, it fails to achieve high classification accuracyduring cross-validation. Furthermore, the ordinal models (ordinal regression andordinal gradient boosting) produce significant amount of negative fitted probabil-ities in practice due to the lack of numerical constraints. While ordinal gradientboosting and random forest perform the best in our three measures of classificationaccuracy: perfect match rate, within one-class match rate and 80% prediction in-tervals, ordinal gradient boosting produces high proportions of negative values andnon-unimodality in the fitted probability mass function. Thus we choose randomforest as the most preferred method and focus on its interpretation using variableiiiimportance ranking, partial derivative of the probability mass function and cumula-tive probability function, as well as local interpretable model-agnostic explanationplots.ivLay SummaryCredit rating is used to decide whether or not to grant a loan to a company aswell as the interest rate to charge. Human experts have reviewed a set of financialstatements and produced the ratings. We use data from the same financial state-ments and try to find an interpretable statistical model that can accurately matchthe human ratings.We compare four methods: ordinal regression, ordinal gradient boosting, multi-nomial gradient boosting and random forest. The two methods that matched the hu-man ratings most accurately are ordinal gradient boosting and random forest. Theproblem with ordinal gradient boosting is that it produces negative and the non-unimodal fitted probability vectors, which we cannot interpret. We choose randomforest as the most preferred method. We focus on its interpretation by looking atthe change in probabilities associated with changing a variable and interpreting therating decisions locally at each company level.vPrefaceThis dissertation is the unpublished work by the author, Xixi Hu under the co-supervision of Dr. Natalia Nolde and Dr. Harry Joe.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1 Dataset Source . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Imputation, Variable Selection & Transformation . . . . . . . . . 73 Classification Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Classification Trees . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Ordinal Regression with Binary Splits . . . . . . . . . . . . . . . 14vii3.4 Ordinal Gradient Boosting and Multinomial Gradient Boosting . . 163.5 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.6 Conditional Class Probabilities . . . . . . . . . . . . . . . . . . . 194 Comparison of Methods . . . . . . . . . . . . . . . . . . . . . . . . . 204.1 Classification Accuracy . . . . . . . . . . . . . . . . . . . . . . . 204.1.1 Cross-validation . . . . . . . . . . . . . . . . . . . . . . 214.1.2 Parameter Tuning . . . . . . . . . . . . . . . . . . . . . . 214.1.3 Accuracy of Point Estimate . . . . . . . . . . . . . . . . 214.1.4 Point and Interval Estimates of Ratings . . . . . . . . . . 234.2 Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2.1 Ordinal Methods . . . . . . . . . . . . . . . . . . . . . . 254.2.2 Negative Fitted Probabilities in Ordinal Methods . . . . . 264.2.3 Unimodality of the Probability Mass Function . . . . . . . 264.2.4 Examples of Predicted Rating Probabilities as a Functionof Predictors . . . . . . . . . . . . . . . . . . . . . . . . 284.3 Interpreting Random Forest . . . . . . . . . . . . . . . . . . . . . 294.3.1 Variable Importance . . . . . . . . . . . . . . . . . . . . 304.3.2 Local Interpretable Model-agnostic Explanations . . . . . 304.3.3 Partial Derivatives Plot . . . . . . . . . . . . . . . . . . . 325 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49viiiList of TablesTable 2.1 Count by credit rating . . . . . . . . . . . . . . . . . . . . . . 7Table 2.2 Variable selection by clusters . . . . . . . . . . . . . . . . . . 9Table 4.1 Numerical summary of training set sizes in 100 cross validations 21Table 4.2 Mean classification accuracy over 100 cross validations . . . . 22Table 4.3 80% prediction intervals by length. ord: ordinal regression;ogb: ordinal gradient boosting; mgb: multinomial gradient boost-ing; rf: random forest. . . . . . . . . . . . . . . . . . . . . . . 23Table 4.4 Proportion of probability vectors with modal probabilities largerthan 50%. ord: ordinal regression; ogb: ordinal gradient boost-ing; mgb: multinomial gradient boosting; rf: random forest. . . 24Table 4.5 Numerical summary of negative pˆ in 100 cross-validation runs 26Table 4.6 Proportion of unimodal predicted probability mass functions.ord: ordinal regression; ogb: ordinal gradient boosting; mgb:multinomial gradient boosting; rf: random forest. . . . . . . . 28Table 4.7 Numerical summary of v5 values in the leftmost box region . . 34Table 4.8 Numerical summary of v5 values in the rightmost box region . 34Table 4.9 Spearman correlation of top 5 important variables and v23 ver-sus rating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36Table 4.10 Partial derivative range for top 5 important variables and v23.The indicated values are the minimum and maximum. . . . . . 37ixList of FiguresFigure 3.1 A simple classification tree . . . . . . . . . . . . . . . . . . . 12Figure 4.1 Box plot of classification accuracy over 100 cross validations.ord: ordinal regression; ogb: ordinal gradient boosting; mgb:multinomial gradient boosting; rf: random forest. . . . . . . . 22Figure 4.2 Variable importance plot of random forest. A variable that ismore important has higher mean decrease in Gini index on thex-axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Figure 4.3 Lime plot of case 5306 . . . . . . . . . . . . . . . . . . . . . 32Figure 4.4 Partial derivative ∂ pr∂xl with respect to v5 over the training datasetunder the random forest model. The x-axis labels are mediancovariate values within each box. . . . . . . . . . . . . . . . . 34Figure 4.5 Partial derivative ∂ ∑rk=1 pk∂xlwith respect to v5 over the trainingdataset under the random forest model. The x-axis labels aremedian covariate values within each box. . . . . . . . . . . . 35Figure 4.6 Partial derivative ∂ pr∂xl with respect to v22 over the training datasetunder the random forest model. The x-axis labels are mediancovariate values within each box. . . . . . . . . . . . . . . . . 36Figure 4.7 Partial derivative ∂ ∑rk=1 pk∂xlwith respect to v22 over the trainingdataset under the random forest model. The x-axis labels aremedian covariate values within each box. . . . . . . . . . . . 37Figure 4.8 Partial derivative ∂ pr∂xl with respect to v6 over the training datasetunder the random forest model. The x-axis labels are mediancovariate values within each box. . . . . . . . . . . . . . . . . 38xFigure 4.9 Partial derivative ∂ ∑rk=1 pk∂xlwith respect to v6 over the trainingdataset under the random forest model. The x-axis labels aremedian covariate values within each box. . . . . . . . . . . . 39Figure 4.10 Partial derivative ∂ pr∂xl with respect to v10 over the training datasetunder the random forest model. The x-axis labels are mediancovariate values within each box. . . . . . . . . . . . . . . . . 40Figure 4.11 Partial derivative ∂ ∑rk=1 pk∂xlwith respect to v10 over the trainingdataset under the random forest model. The x-axis labels aremedian covariate values within each box. . . . . . . . . . . . 41Figure 4.12 Partial derivative ∂ pr∂xl with respect to v17 over the training datasetunder the random forest model. The x-axis labels are mediancovariate values within each box. . . . . . . . . . . . . . . . . 42Figure 4.13 Partial derivative ∂ ∑rk=1 pk∂xlwith respect to v17 over the trainingdataset under the random forest model. The x-axis labels aremedian covariate values within each box. . . . . . . . . . . . 43Figure 4.14 Partial derivative ∂ pr∂xl with respect to v7 over the training datasetunder the random forest model. The x-axis labels are mediancovariate values within each box. . . . . . . . . . . . . . . . . 44Figure 4.15 Partial derivative ∂ ∑rk=1 pk∂xlwith respect to v7 over the trainingdataset under the random forest model. The x-axis labels aremedian covariate values within each box. . . . . . . . . . . . 45Figure 4.16 Partial derivative ∂ pr∂xl with respect to v23 over the training datasetunder the random forest model. The x-axis labels are mediancovariate values within each box. . . . . . . . . . . . . . . . . 46Figure 4.17 Partial derivative ∂ ∑rk=1 pk∂xlwith respect to v23 over the trainingdataset under the random forest model. The x-axis labels aremedian covariate values within each box. . . . . . . . . . . . 46xiAcknowledgmentsFirst and foremost, I would like to express my sincere gratitude to Dr. Harry Joeand Dr. Natalia Nolde for their guidance, insights and encouragement during mymaster study and related research. I also want to thank Scotiabank for providingthe funding for my project under the risk analytics initiative.In addition, I would like to thank the administration staff, the faculty mem-bers, and the fellow graduate students in the department of Statistics for creating astimulating and friendly research environment.Last but not the least, I would like to thank my parents for their continuoussupport and encouragement throughout my studies.xiiChapter 1IntroductionCredit rating presents a rating agency or a bank’s assessment of the risk profile ofa company. It is used extensively by business practitioners to judge a company’scredit worthiness [Hwang et al. (2010)]. It determines whether or not credit shouldbe granted to the company, and how much interest should be charged upon ex-tending the credit. For example, on July the 4th 2018, the effective yield for UScorporate bonds rated AAA, a credit rating that represents the highest creditworthi-ness, is 3.49%. For bonds rated BBB, a rating grade with lower creditworthiness,the effective yield is 4.40%. For business agencies that issues loans, identifyingthe correct credit rating is vital in maintaining profitability and solvency.In this report, we view the credit scoring problem as a statistical classificationproblem with ordinal labels. Our approaches differ from the traditional approachof conducting fundamental review on financial statements, but the goal is the same— to accurately classify credit rating constraining on time and human resources,and to provide explanations for the credit ratings. It is difficult to build trust andgeneralize the models if there is a lack of understanding about the model. Thus,being able to explain and interpret the credit rating models is particular importantif a black-box machine learning method is used.We apply four statistical classification methods:1. ordinal regression on binary splits2. ordinal gradient boosting13. multinomial gradient boosting4. random foreston data captured from a set of financial statements. The rating is generated fromfundamental reviews conducted by human reviewers. Our goal is to find inter-pretable machine learning and/or statistical methods that can replicate the humanexpert as closely as possible based on relevant variables available in the financialstatements. For confidentially purposes, we transform most of the explanatoryvariables and omit their names.The four classification methods are interconnected. Method 1 and method 2are ordinal methods which use the ordering of ratings. Ordinal regression can alsobe viewed as the linear counterpart of ordinal gradient boosting, as both meth-ods model the log-odds of the binary rating splits just with different instruments.Method 3 and 4 ignore the order information. The ratings categories are regardedas distinct unordered labels. The linear counterpart of multinomial boosting, al-though not discussed in this report, is multinomial regression. Random forest doesnot have a linear counterpart, but it is an ensemble tree-based methods as are thetwo boosting methods. The tree-based methods (multinomial boosting, ordinalboosting and random forest) allow more flexible structures, but are more difficultto interpret compared to ordinal regression.The struggle in statistical credit classification is often not on achieving highclassification accuracy, but on finding a method with accuracy and interpretabil-ity at the same time. For example, the ordinal regression method, while easy tointerpret in theory, often suffers from poor classification performance and can bedifficult to implement with near multicollinearity in the large covariate space. Tree-based learning methods such as random forest and gradient boosting can oftenachieve high classification accuracy with minimal assumptions on the data, but aredifficult to interpret. The lack of intepretability prevents these methods from beinggeneralized in the business community.The report is organized as follows: Chapter 2 provides details on the datasetand the relevant variables. Chapter 3 introduces the four methods from a theoreticalperspective. Chapter 4 compares the classification accuracy and interpretability ofthe four methods on the dataset. Chapter 5 provides directions for future work and2as well as a conclusion.1.1 Literature ReviewThe adaptation of statistical classification models in the business community iscertainly not novel. Hand and Henley (1997) reviews a variety of statistical classi-fication tools used in credit scoring to ease the increased demand and competitionin the lending industry. These models include logistic regression, probit analy-sis, nonparametric smoothing methods, mathematical programming, Markov chainmodels, recursive partitioning, expert systems, genetic algorithms, neural networksand conditional independence models. While the model used may be different, thecore struggle remains the same: to combine good classification performance withinterpretability. This struggle often translates to seeking a modification on existingmethods to achieve parsimony.For example, Hue et al. (2017) proposes to improve logistic regression usingadditional variables identified from the random forest model. The paper theorizesthat the lack of classification accuracy in logistic regression credit scoring is due tonon-linear relationships which often arise from threshold effects. The paper pro-poses to improve logistic regression by expanding the feature space to include a setof possible threshold variables, which are identified using the nodes of a randomforest with only depth-1 and depth-2 trees. Adaptive Lasso method is then used forfeature selection. This paper also presents an interesting idea, that to interpret therandom forest model, one can try to diminish the complexity of the random forest’saggregation rule, selecting (via an objective criterion) only some trees in the forest;or to keep the simplicity of logistic regression and improve its predictive perfor-mance using univariate and bivariate threshold effects. However, extending the ap-proach beyond the papers’ simple binary dataset to a multi-class S&P credit ratingdating yields limited improvement in classification accuracy. The addition of thethreshold variables creates piecewise linear functions to model the non-linearity.On a complex dataset with many features, the selection of the threshold variablesis computationally expensive even only considering depth-1 and depth-2 trees. Themodel seems best suited for binary classification problems with relatively small ini-tial feature space.3Another paper that attempts to combine the random forest model with logisticmodel is Prinzie and Van den Poel (2008). The authors propose a random multi-nomial logit model (RMNL), which is a combination of random forest and multi-nomial logit model (MNL). Compared to random forest, RMNL preserves the ran-dom feature selection and majority vote structure, but uses a MNL model insteadof a classification tree as element of the ‘forest’. Aside from this modification,RMNL follows the same steps as random forest, including the calculation of vari-able importance. The paper shows that in applications RMNL achieves comparableclassification accuracy with random forest. However it is questionable whether ornot interpretability is truly improved between RMNL and random forest. Whilea MNL model may be more interpretable than a classification tree, a black boxmodel built by averaging a collection of MNL loses the interpretibility feature ofMNL.In contrast to the previous two papers where the authors try to combine ran-dom forest, a machine learning method with logistic regression, a classical statisticmodel, in the following two papers by Hwang et al. in 2009 and 2010 focus onimproving the classification accuracy of the standard ordered linear probit model.In Hwang et al. (2009), a modified ordered linear probit model is fitted on adataset based on S&P long-insurer ratings. The modification is on the cutoff pointof a standard probit model. Instead of using a cutoff value of 0.5 on binary classi-fication, the paper proposes a data-driven cutoff point for multi-class classificatio.The standard probit model is equivalent to using 0.5 as a cutoff point on the CDFof the latent distribution. The papers proposes that an optimal cutoff point can beestimated instead, by minimizing error rate on the estimation sample. Aside fromthis modification, the variable selection is conducted using stepwise regression andthe size of the coefficients and the p-values are looked at to determine the importantvariables, similar to the standard ordered probit modelIn Hwang et al. (2010), an additional extension is made on the ordered lin-ear probit model. Instead of using linear regression for the latent variable, theauthors uses a more flexible semi-parametric method, where parameters can be es-timated by maximizing a local (weighted) log-likelihood function. The method iscomputationally comparable to linear regression. The paper demonstrates that thenew method has improved prediction accuracy, and argues that the functional form4between the S&P’s long-insurer ratings and the selected continuous predictors isnonlinear, which why a semi-parametric method provides better fit.However, in both papers, Hwang et al. merge the 10 S&P rating categoriesinto 3 ordinal categories and considered only 24 explanatory variables. With morerating categories and more explanatory variables, it is hard to make these methodwork well.5Chapter 2Dataset2.1 Dataset SourceThe data source consists of the quarterly balance sheets, income statements andcash flow statements of around 5000 business during a 3-year period. Due to prac-tical limitations such as businesses entering or dropping out of the pool in themiddle of the 3-year period, the average number of times that a company appearsin the data is 2 and the maximum number of appearances is 8. A company’s ratingis not necessarily constant over the time period. In fact around 40% of companiesexperience a rating change. The rating variable, which has 11 categories, is an as-sessment of the the company’s long-term ability to meet its financial obligations ontime, similar to the long-term issuer’s rating by S&P (Standard and Poor (2016)).2.2 Data DescriptionThere are 10013 data vectors in the dataset. Each vector corresponds to a com-pany’s financial documents in a particular quarter. The data vectors have dimen-sion 150 consisting of the 50 explanatory covariates, the one-year lag (lag-1) andtwo-year lag (lag-2) values of the covariates, and the rating variable. For confiden-tially purposes, we denote the non-lagged covariates as v1, · · · ,v50 and the laggedcovariates as v1 1, · · · ,v50 1 and v1 2, · · · , v50 2 respectively. The explanatoryvariables are continuous and sometimes bounded below by 0 due to the intrinsic6nature of financial data. For example, account receivable is always positive for asolvent company.The rating variable is an ordered categorical variable between the range 1 whichrepresents low creditworthiness in the company, to 11 which represents high cred-itworthiness. In Table 2.1, the credit rating variables are listed from low creditworthiness to high credit worthiness from left to right. The rating variable is pro-duced by human experts who reviewed the financial statements. There are initially16 rating categories. We merge the lowest four rating categories which lead torejection of loans and also the highest two rating categories due to data sparsity.Rating 1 2 3 4 5 6 7 8 9 10 11Count 262 218 464 1047 2302 2736 2807 2061 1361 481 136Proportion 0.02 0.02 0.03 0.08 0.17 0.20 0.20 0.15 0.10 0.03 0.01Table 2.1: Count by credit rating2.3 Imputation, Variable Selection & TransformationIn order to calculate the Spearman correlation values, we first impute the miss-ing numerical values. We choose the random forest imputation method availablefrom ‘rfImpute’ function in ‘randomForest’ package by Liaw and Wiener (2002).‘rfImpute’ is easier to use as a first step compared with looking for surrogate vari-ables for dependence relations. This imputation method updates the missing valuesusing a weighted average of the non-missing observations, where the weights arebased on the proximity matrix of the random forest method. We use the companyidentifier as the supervising variable when building the forest.After the initial imputation, we calculate the Spearman correlation matrix ofthe 50 non-lagged variables and observe that the variables are highly collinear. Toreduce the collinearity, we use a heat map to identify strongly-correlated clustersand choose a variable to represent each cluster.There are 8 clusters in the non-lagged variables given below.71. v10, v9, v20 1.000 0.939 0.9160.939 1.000 0.8640.916 0.864 1.0002. v12,v14,v161.000 0.803 0.7920.803 1.000 0.9820.792 0.982 1.0003. v41, v48, v43 1.000 0.854 0.8350.854 1.000 0.8940.835 0.894 1.0004. v1, v15, v42 1.000 0.975 0.8350.975 1.000 0.8060.835 0.806 1.0005. v21, v22,v23 1.000 0.798 0.9720.798 1.000 0.8200.972 0.820 1.0006. v35, v31 [1.000 0.8870.887 1.000]7. v10, v9,v17 1.000 0.939 0.9160.939 1.000 0.8640.916 0.864 1.00088. v4, v44 [1.000 1.0001.000 1.000]We reduce each cluster to a single variable based on the Spearman correlationmatrix and the proportion of missing data before imputation. For example, beforeimputation, the missing proportion of v12, v14 and v16 in the second cluster are16%, 13% and 13% respectively. We select v14 from this cluster and remove v12and v16 from the dataset, as v14 has the least amount of the data missing andcomparatively stronger correlation with the other two vectors.The selection of variables is summarized in Table 2.2.Clusters 1 2 3 4 5 6 7 8Vectors v10,v9,v20 v12,v14,v16 v41,v48,v43 v1,v15,v42 v21,v22,v23 v35, v31 v10,v9,v17 v4,v44Selected v10 v14 v43 v1 v23 v31 v10 v4Table 2.2: Variable selection by clustersAfter variable selection, 36 non-lagged variables remain from the 50 variables.Together with the corresponding lag-1 and lag-2 variables, there are in total 108explanatory variables.To further remove collinearity within the selected variables and also to get uni-modal and centered variables, we transform the variables in the following ways:• Some variables are scaled.• Transform variables that are non-negative and skewed, as well as variablesthat’s heavy-tailed with a large minimum and maximum, using the log func-tion and the square-root function.• Transform variables that have big positive or negative values and those thatare heavily skewed in both directions using the sinh-arcsinh function (Jonesand Pewsey (2009)).For variables that the highly skewed in both the positive and negative directions,there is nothing in the regression literature to suggest good transforms. But for lin-ear regression (Gaussian or ordinal), we cannot expect a linear relation to hold withrespect to an explanatory variable when the variable varies by orders of magnitudes9for positive and negative numbers. The sinh-arcsinh could be a useful transformfor such variables to avoid variations by the orders of magnitudes.The above imputation, variable selection and transformation are not necessaryfor the tree-based classification methods, random forest and the two gradient boost-ing methods. For these methods, missing values can be handled by treating allmissing values as a label ‘missing’ without requiring linearity. However, to geta fair comparison, we compare all four methods on the same imputed and trans-formed dataset with the 36 non-lagged variables. This does not affect the classifica-tion accuracy of the tree-based methods, in fact, we notice that the performance ofrandom forest improve after reducing the collinear variables during initial model-fitting.10Chapter 3Classification MethodsIn this Chapter, we provide an overview of the classification methods for ordinalresponse. In section 3.2, we discuss classification trees, which are building blocksfor the two gradient boosting methods and random forest. In Section 3.1, we fixthe notations. In Section 3.3 and 3.4, we discuss the two ordinal methods, ordinalregression and ordinal gradient boosting. These two methods use the orderinginformation of credit rating. The presentation of methods is based on Hastie et al.(2009).3.1 NotationsData are denoted as (xi,yi), i ∈ {1, ...n}, where xi is a p-dimensional numericalcovariates vector xi = (xi1, · · · ,xip) and yi is a rating response variable with valuesin {1, · · · ,K}. (xi,yi) are considered as independent realizations of a random vector(X,Y ).3.2 Classification TreesA tree is an undirected graph in which any two nodes are connected by exactlyone path. The intuition behind a classification tree is to divide the p-dimensionalcovariate space into regions with homogeneous credit rating categories and to rep-resent such division via a tree. A node of a tree is characterized by a splittingvariable and a corresponding numerical value which is used as a splitting point.11v1< 501TRUEv2< 1002TRUE3FALSEFALSEFigure 3.1: A simple classification treeThe nodes from the root of a tree to a terminal node defines the boundary of aterminal region. A rating label is assigned to each region. Figure 3.1 is a simpleclassification tree with 2 nodes that divide the 2-dimensional covariates space into3 rating regions 1, 2 and 3. The first node uses the splitting variable v1 at splittingvalue 50. The second node uses the splitting variable v2 at splitting value 100. Bydefault, when the expression at the node returns TRUE, the vector proceeds downthe left subtree of the node; otherwise, the vector proceeds down the right subtreeof the node. The region corresponding to rating 1 is defined by the left child nodeof the first node, and the regions corresponding to rating 2 and 3 are defined by thefirst node’s right child node and the second node.Consider a tree with M terminal regions. Let Jm be the index of variables forthe mth region Rm, m= 1, · · ·M. Let {X ( j),x( j)} j∈Jm denote the nodes or the pairs ofsplitting variable and splitting point. Since the information in a tree can be encodedonly using the ‘<’ sign or ‘>’ signs, we restrict the tree to only using the ‘<’ sign.Let γm ∈ {1, · · · ,K} denote the class label assigned to region Rm.A tree can be formally expressed asT (x,θ ) =M∑m=1γmI(x ∈ Rm), (3.1)where θ = {Rm,γm}Mm=1 and I is the indicator function. I(x ∈ Rm) = 1 if x ∈ Rmand 0 otherwise.The parameter θ is found by minimizing a loss function measuring the homo-12geneity or node impurity between the regions. In practice, γm is assigned to themodal label of region m, and the optimization problem of finding the optimal re-gion partition is reduced to a top-down recursive partitioning algorithm that addsnew nodes to the tree in a way that minimizes the sum of the loss function over thesubtrees.There are multiple measures of the homogeneity between regions. We use theGini index as it is more amenable to numerical optimization and is more sensitiveto node impurity, according to Chapter 9 of Hastie et al. (2009).Denote the index set of data vectors that reach node j as N j. Then the propor-tion of class k observations in node m ispˆ jk =1|N j| ∑i∈N jI(yi = k), k ∈ {1, ..,K}. (3.2)The Gini index at node j is defined byK∑k=1pˆ jk(1− pˆ jk).The next node is added by seeking the partition that minimizes the sum of Giniindexes of the left and right child of the node. We seek the splitting variable X ( j+1)and the splitting point x( j+1) that solvesminX ( j+1), x( j+1)[ K∑k=1pˆ( j+1)Lk(1− pˆ( j+1)Lk)+K∑k=1pˆ( j+1)Rk(1− pˆ( j+1)Rk)], (3.3)where ( j+ 1)L denotes the left child of node ( j+ 1), and ( j+ 1)R denotes theright child of node ( j+ 1). Nodes are added recursively to existing nodes untila stopping criteria, such as the minimum number of items left in the node or theminimum gain in Gini index by splitting reaching a pre-defined value.Intuitively speaking, since classification trees allow for hyperplane splits formore flexibility, the structure of trees is flexible and can capture non-linearity well.However, one major drawback of using a single classification tree is that this pro-cedure has high variance, meaning that a small change in the data may result ina completely different tree. To alleviate this problem, the two gradient boostingmethods and the random forest method both use a linear combination of trees to13create an average result. Another drawback is that a reasonable prediction intervalis not guaranteed. This issue is addressed further in Section 4.1.4.3.3 Ordinal Regression with Binary SplitsOrdinal regression method is a classical linear classifier with a straight-forwardinterpretation as discussed in Section 4.2.1. It is designed for an ordinal response,thus suitable for our ordinal credit rating classification problem. Ordinal regressionbased on binary splits at each category 2, · · · ,K is used because ordinal regressiondoes not fit well with many explanatory variables when the vector of non-interceptregression coefficients must be constant over the different cut points for the ordinalcategories.This method solves (K−1) binary classification problems and estimates P(Y <k|x), k∈ {2, ..,K}, the probability of having a rating less than k given the covariatesx.Each binary classification problem models the log-odds of having a credit rat-ing better than k using a linear function of the covariatesfk(x) := log(P(Y < k|X = x)P(Y ≥ k|X = x) = βTk x+β0k, (3.4)where the vector β k = ( βk1, · · · ,βklk) are the coefficients for the kth binary classi-fication problem.Let Z be a logistic random variable with distribution functionFZ(z) =11+ e−z=ez1+ ez,−∞< z< ∞.Then{w : Y < k|X = x} ≡ {w : Z < β Tk x+β0k},since from equation (3.4),P(Yi < k|X = x) = 11+ e−βTk x−β0k, (3.5)For ordinal regression, β Tk x+β0k should be increasing in k for all x in the pre-14dictor space. If the predictor space is an entire Euclidean space, then it is necessarythat β k = β and the intercept β0k is increasing in k.If the predictor space is a subset of an Euclidean space, we can attempt anordinal regression fit by performing (K−1) different logistic regressions based onthe binary splits (Y < k) versus (Y ≥ k) for k = 2, · · · ,K.If the estimated βˆ k are nearly constant, then β01, · · · ,β0k and the common β canbe estimated by maximum likelihood. If βˆ k are different, for example, if differentpredictor variables are more important depending on k, then it is simpler to use theresults of the separate logistics regression fits.We do not restrict the predictor variables to be the same for different binarysplits, as a common slope would not capture the variability in the dataset. One dis-advantage of using ordinal regression with K > 2 categories when there are manyexplanatory variables is that non-convergence can arise due to multicollinearity.We try to alleviate this problem through preprocessing the data, and when the prob-lem is encountered, we select a random subset of the predictors of size 30 and usestepwise regression to select the model.Using the order information, no borrower can have a credit rating less than 1or more than K, thusP(Y < 1|X = x) = 0 and P(Yi < K+1|X = x) = 1. (3.6)The probability that a borrower with covariate vector x has credit rating k canbe calculated byP(Y = k|X = x) = P(Y < k+1|X = x)−P(Y < k|X = x), k ∈ {1, · · · ,K}. (3.7)A model is fitted to a training set to get estimates (βˆ k, βˆ0k) for k = 2, · · · ,k.For a member of the holdout set with X = xh, the predicted probability vector is(pˆ1(xh), · · · , pˆK(xh)) wherepˆk(xh) =11+ e−βˆTk+1xh−βˆ0,k+1− 11+ e−βˆTk xh−βˆ0,k.The predicted yˆh is argmaxk pˆk(xh) and an interval estimate of categories can be15obtained if (pˆ1, · · · , pˆK) is a non-negative and unimodal vector.3.4 Ordinal Gradient Boosting and MultinomialGradient BoostingThe setup of the ordinal gradient boosting method can be viewed as an extensionof the ordinal regression methods using additive gradient boosting trees. The dif-ference between ordinal gradient boosting and ordinal regression is that instead ofusing a linear function to solve the K−1 binary classification problems as shownin equation (3.4), ordinal gradient boosting uses additive gradient boosting trees toestimate P(Y < k|x). The class probabilities are calculated using equations (3.6)to (3.7) and the rating class with the highest class probability is chosen, identicalto ordinal regression. Multinomial gradient boosting uses the gradient boostingtree algorithm in similar way as ordinal gradient boosting. One difference betweenthe multinomial boosting method and the ordinal boosting method is that multino-mial gradient boosting treats the K rating classes as unordered classes, and assignsclass probabilities using class proportion instead of taking differences using equa-tions (3.6) to (3.7). For out-of-sample classification, the classification category isargmaxpˆ(xh)The idea behind boosting is to combine ‘weak’ classifiers whose performanceare only slightly better than random guess, for example simple classification trees,to produce a powerful ensemble classifier with arbitrarily high accuracy (Freundand Schapire (1997)). The algorithm sequentially fits the ‘weak’ classifier to re-peatedly modified versions of the data. The data modification at each boosting stepassigns increased weights to the observations that are misclassified at the previousstep, and decreased weights for the observations that are classified correctly. Thusas iterations proceed, observations that are difficult to classify correctly receiveever-increasing weight. Each successive classifier is forced to concentrate on thosetraining observations that are missed by previous ones in the sequence. The outputsfrom successive classifiers are then combined using a weighted majority vote.16Formally speaking, the boosting model is a sum of classifiers:fQ(x) =Q∑q=1β qT (x,θ q), (3.8)where trees defined in (3.1) can be used as the classifier T. For a boosting algorithm,the goal is to findmin{θ q,β q}Qq=1N∑i=1L(yi,Q∑q=1β qT (xi,θ q))with respect to a loss function L over the training set.Gradient boosting uses the a boosting model to target the gradient of the lossfunction.For the K-class classification problem, we use the multinomial deviance resid-ual function as the loss function:L(yi, p1(xi), · · · , pK(xi)) =−K∑k=1I(yi = k) log pk(xi),where the logit probability functionsfk(x) :=log(P(yi = k|X = x)log(P(yi = K|X = x))andpk(x) =e fk(x)∑Kl=1 e fl(x)for k = 1, · · · ,K.The function fk is an extension of the binomial case in equation (3.4). Here weimpose the constraint that ∑Kk=1 fk(x) = 0 to reduce the redundancy in functions fk,for otherwise adding the same g(x) to each fk leaves the pk’s unchanged.The negative gradient is at point xi is−∇L=− ∂L∂ f1(xi)· · ·− ∂L∂ fK(xi)= I(yi = 1)− p1(xi)· · ·I(yi = K)− pK(xi) . (3.9)In ordinal gradient boosting, K = 2 for each of the (K− 1) binary classification17problem; in multinomial gradient boosting, K equals the total number of ratingclasses.Successive classifiers, T (x,θ q)) in equation (3.8), are obtained using forwardstage-wise additive modeling. At the step q, the weight β q and the classifierT (x;θ q) are obtained by minimizing the total loss after the (q− 1) step, and fis updated iteratively.(β q,θ q) = argminβ ,θn∑i=1L(yi,q−1 (xi)+β T (xi;θ )) (3.10)fq(x) = fq−1(x)+β qT (x;θ q)From a numerical optimization perspective, gradient boosting is similar to thesteepest descent method, as the terminal region of negative gradient is targeted ateach iteration of (3.9) using the boosting methods with regression tree as T (x;γq).3.5 Random ForestWhile both random forest and boosting methods are ensemble methods built on acollection of trees, logics behind them are fundamentally different. Each tree inrandom forest is trained using a random selection of m < p covariates out of thetotal p covariates on a bootstrap sample of the training set. As mentioned in section3.2, a single classification tree often has low bias but high variance, which meansthat they benefit from averaging. A drawback of averaging the trees built using all pcovariates on bootstrapped samples is that the size of the correlation between treeslimits the reduction in variance (Exercise 15.1 in Hastie et al. (2009)). By usinga random selection of m covariates, trees in random forest have less correlationcompared to trees built using all p covariates. Intuitively, reducing m further willreduce the correlation of any two trees in random forest further. Thus, randomforest reduces the variance of a single classification tree by averaging a collectionof noisy and approximately unbiased trees with reduced correlations.For classification, random forest classifies to the majority class, or the classwith the highest proportion of votes by trees in the forest. Class probabilities canbe produced using the proportion of votes in the forest for each class. However,there is no inherent way to incorporate the order information of the classes, thus we18treat each credit rating as an unordered class and forgo the ordering information.3.6 Conditional Class ProbabilitiesLet p(k|x) =P(Y = k|X= x) represent the conditional probability ofY taking labelk given covariates X = x. Let pˆ(k|x) be the model-based estimates of p(k|x).For a fixed k and varying x, ordinal regression calculates pˆ(k|x) using a func-tion that is the difference of two transformations of linear functions (equation (3.6)and (3.7)). Random forests and gradient boosting calculate pˆ(k|x) using a non-linear function created using linear combination of piecewise constant functions.When p(k|x) is a conditional probability mass function over k = 1, · · · ,K withx fixed, ordinal regression method and ordinal gradient boosting both calculatepˆ(k|x) by taking difference of the cumulative distribution function using equation(3.7). The cumulative function for ordinal regression is a transformation of a linearfunction (equation (3.6)), whereas the cumulative function for the ordinal gradientboosting is a non-linear function created using a linear combination of trees whereeach tree is a sum of piece-wise linear functions. Because there’s no numericalconstraints when taking the difference using equation (3.7), the two ordinal meth-ods sometimes lead to negative pˆ(k|x). The method can be considered acceptableif out-of-sample pˆ(k|x) are seldom negative, and if negative, they are close to 0 andin the extremes of the predictor space.19Chapter 4Comparison of MethodsAt its core, the practice of assigning credits is concerned with identifying importantvariables, or a combination of characteristics that can accurately distinguish goodrisk from bad risk. Therefore, the ability to interpret a model is just as importantas the classification accuracy of the model. In this Chapter, we compare the fourmethods:1. ordinal regression on binary splits2. ordinal gradient boosting3. multinomial gradient boosting4. random forestin terms of classification accuracy (Section 4.1) and interpretability (Section 4.2),then we focus on the interpretation of the random forest model (Section 4.3).4.1 Classification AccuracyWe compare the classification accuracy of the four methods in the proportion of rat-ings matched perfectly (the perfect-match rate), the proportion of ratings classifiedto within one class of the correct rating (the within-one-class rate), and predic-tion intervals based on pˆ(k|x). In Section 4.1.1 and 4.1.2, we outline the setupof cross-validations and parameter tuning. In Section 4.1.3 and Section 4.1.4, we20compare the accuracy of the point and interval estimates for classification. Thebest-performing methods are random forest and ordinal gradient boosting.4.1.1 Cross-validationWe partition the dataset into a training set where the models are trained, and a testset where we measure the classification accuracy. We randomly withhold observa-tions on one-fifth of the companies as the test set, and use rest of the data vectorsas the training set. This roughly corresponds to a five-fold cross validation with thenumerical summaries of the training set sizes provided in Table 4.1.Min 1st Quantile Median Mean 3rd Quantile Max7930 7986 8006 8005 8024 8062Table 4.1: Numerical summary of training set sizes in 100 cross validationsTo minimize the fluctuation in classification accuracy due to the random datapartitions, we cross-validate 100 times. During each cross-validation, the methodsare applied to the same partition of training and test sets.4.1.2 Parameter TuningFor random forest, we use the default√p number of random predictors used ineach tree as suggested by Liaw and Wiener (2002). We tune the number of treesused in random forest, and grow 500 trees in the forest as there is minimal changein mean classification accuracy by increasing the total number of trees past 500.For the gradient boosting methods, we tuned the shrinkage parameter, whichcontrols the speed of learning, and used the value 0.25.For ordinal regression we selected variables using stepwise selection by AIC.4.1.3 Accuracy of Point EstimateBecause classification performance of the methods may vary significantly in differ-ent partitioning of training and test sets, we report the mean classification accuracyrates of the 100 test sets in Table 4.2. The spread of the accuracy rates is reportedin Figure 4.1, where the perfect-match rate is in light gray and the within-one-classrate is in dark gray.21Methods Perfect Match Within One ClassOrdinal Regression (ord) 32% 77%Ordinal Gradient Boosting (ogb) 38% 80%Multinomial Gradient Boosting (mgb) 33% 75%Random Forest (rf) 42% 83%Table 4.2: Mean classification accuracy over 100 cross validationsFigure 4.1: Box plot of classification accuracy over 100 cross validations.ord: ordinal regression; ogb: ordinal gradient boosting; mgb: multino-mial gradient boosting; rf: random forest.From Table 4.2 and Figure 4.1, we can see that random forest achieves the high-est mean classification accuracy, followed by ordinal gradient boosting, multino-mial gradient boosting and then ordinal regression. The spread of the perfect-matchrates and the within-one-class rates are small, indicating that the performances ofthe four methods are stable across different partitions of the dataset.22Methods length=1 length≤2 length ≤ 3 length ≤4ord 1% 1% 21% 82%ogb 0 18% 75% 80%mgb 0 0 15% 74%rf 1% 9 % 74% 92%Table 4.3: 80% prediction intervals by length. ord: ordinal regression; ogb:ordinal gradient boosting; mgb: multinomial gradient boosting; rf: ran-dom forest.4.1.4 Point and Interval Estimates of RatingsIn addition to the comparisons in the preceding subsections, we also look at thefitted conditional probability mass function pˆ(k|x) over k = 1, ...,K with x heldfixed.When the pˆ(k|x) is concentrated at one rating k, then a point estimate using themodal rating k is reasonable. However, when the probability mass function is flat,the probability difference between the modal rating and the rating with the secondor third highest probability may be small. In situations like this, a prediction in-terval provides more information. Assuming that pˆ(k|x) is reasonably unimodal(Section 4.2.3), we define the prediction interval as the shortest rating interval thatcontains the modal rating and ratings near the mode such that the total probabilityof the interval exceeds a threshold, for example 80%. A longer prediction intervalmeans that there is more uncertainty at this x and the probability mass function isflat over the rating categories. A short prediction interval means that the point esti-mate using the mode rating can adequately capture the risk profile of the company.We create 80% prediction intervals for the rating categories using the four meth-ods, and compare the length of the prediction intervals in Table 4.3. The heading ofTable 4.3 indicate the length criteria and the content of the table are the proportionof prediction intervals following the length criteria by method. For example, 1%of the ordinal regression prediction interval is of length 1, and 21% of the ordi-nal regression prediction intervals are of length less than or equal to 3. Randomforest and ordinal gradient boosting performs the best in terms of achieving shortprediction intervals.Table 4.4 provides the proportion of fitted probability vectors with modal value23Methods ord ogb mgb rfProportion 2% 24% 2% 15%Table 4.4: Proportion of probability vectors with modal probabilities largerthan 50%. ord: ordinal regression; ogb: ordinal gradient boosting; mgb:multinomial gradient boosting; rf: random forest.exceeding 50%. Because negative fitted probabilities occur while fitting the two or-dinal methods (Section 4.2.2), we exclude the fitted probability vectors that containnegative fitted values smaller than −0.5. As higher modal values may be attainedto compensate for the negative fitted probability values. The proportion of fittedprobabilities vectors with mode exceeding 50% is small, which corresponds to thesmall proportion of short prediction intervals (with length less than or equal to 2)in Table 4.3.From Tables 4.4 and 4.3, we can see that ordinal gradient boosting method andrandom forest produces the highest proportion of fitted probability vectors withmodal probability exceeding 0.5, and the highest proportion of short predictionintervals. Ordinal regression and multinomial gradient boosting more often pro-duce flatter probability mass functions, accompanied by smaller modal probabilityvalues and longer prediction intervals.Based on the comparisons of accuracy rates and prediction intervals, the bestperforming-method are random forest and ordinal gradient boosting. For accuracyrates, random forest performs the best in both perfect-match and within-one-classcriteria, followed by ordinal gradient boosting. For prediction intervals, randomforest produces higher proportions of prediction intervals with length less than (in-cluding) 1 and length less than (including) 4, but ordinal gradient boosting pro-duces higher proportions of prediction intervals with length less than (including)2 and 3. In our comparison, the modal probabilities produced by ordinal gradi-ent boosting are higher compared to random forest. However there are additionalnegative values, those larger than -0.5, in the ordinal gradient boosting probabilityvector, thus the difference in modal probabilities between ordinal gradient boostingand random forest should be interpreted with caution.244.2 InterpretabilitySince the rating variable is of an ordinal nature, it is more reasonable to utilize theorder information during classification. However, after model fitting, we discoverthat the ordinal regression and ordinal gradient boosting methods both suffer fromhigh proportion of negative fitted probabilities which we discuss in Section 4.2.1and 4.2.2. Because of the difficulty in interpreting negative fitted probabilities,we decide that out of the two best performing methods, random forest is moreinterpretable.Aside from ordinal regression, all other three classification methods use linearcombination of trees. While a single decision tree can be easily visualized and in-terpreted as a sequence of binary decisions, an ensemble of trees becomes a black-box which requires new tools to interpret. Thus for the random forest method, welook at the important variables and focus on the unimodality of the fitted probabil-ity mass functions and the partial derivative of modal probabilities and cumulativeprobabilities. While we have looked the standard visualization tools such as thepartial dependence plots and individual conditional expectation plots, we feel thatthese plots are not informative on our dataset and omit them. We include the LIMEplots in Section 4.3.1 because it can be a useful tool to explain the rating decisionsfor individual customers.4.2.1 Ordinal MethodsOf the two ordinal methods, ordinal regression is more interpretable with a linearstructure as shown in equation (3.4). The model can be interpreted as a unit ofincrease in xil changes the average log odds of being less than rating category kincrease by βkl . However, the classification performance of ordinal regression ispoor thus we do not focus on its interpretation. Ordinal gradient boosting methodmodels the log odds of being less than a rating category using the additive treestructure. Thus ordinal gradient boosting faces the same interpretation hurdle asrandom forest: the difficulty to assess the change in class probabilities associatedwith a change in the predictors.Across cross-validation runs, both ordinal methods suffer from high propor-tion of negative class probabilities (Section 4.2.2), since there is no numerical con-25straints when using equation (3.7). The negative fitted probabilities cannot be inter-preted. Thus even though it is more reasonable to incorporates the ordinal natureof the rating data, random forest is still preferable as it maintains positive fittedprobability values and achieves high classification accuracy.4.2.2 Negative Fitted Probabilities in Ordinal MethodsOver 100 cross-validations, negative values occurs in 43% of the training set fit-ted probability vectors when using the ordinal regression method, and 28% whenusing the ordinal gradient boosting method with threshold for negative at -0.03.Table 4.5 provides the numerical summaries of the negative pˆ(k|x) that occurredduring the cross-validation runs. The magnitude of the negative values far exceedswhat one can reasonably ignore. Although ordinal gradient boosting achieves highclassification accuracy in Section 4.1, the negative probabilities create a seriousinterpretation hurdle.Ordinal RegressionMin 1st Quantile Median Mean 3rd Quantile-1.0 -1.0 -1.0 -0.81 -0.03Ordinal Gradient BoostingMin 1st Quantile Median Mean 3rd Quantile-0.96 -0.31 -0.16 -0.23 -0.07Table 4.5: Numerical summary of negative pˆ in 100 cross-validation runs4.2.3 Unimodality of the Probability Mass FunctionIn Section 4.1.4, we introduced the concept of 80% prediction intervals. To pro-duce a reasonable interval estimate, one crucial requirement is that pˆ(k|x) needsto be unimodal. This is so that ratings close to the modal rating can be includedin the interval estimate or taken as alternatives. If there is a rating gap in the mostlikely and second most likely ratings due to non-unimodality, then it is impossibleto explain the jump in a business context. In Table 4.6, we compare the proportionof unimodal pˆ(k|x) by methods at different tolerance levels. A tolerance level isthe small jump size violating exact unimodality in the probability mass function26that can be ignored. A probability vector does not need to be strictly unimodal topass the tolerance threshold. For example, if the fitted probability vector is(0.02,0.04,0.02,0.05,0.08,0.10,0.40,0.13,0.10,0.05,0.01).Strictly speaking there are two local maximum at at 0.04 and 0.40, but becausethe jump size for the first local maximum is below the tolerance level of 0.03, weconsider the probability vector to be unimodal. The pseudo code for checking theunimodality is provided in Algorithm 1.Algorithm 1 Check unimodality of probability vector. tolerance = 0 for exactunimodality, tolerance> 0 for approximate unimodality1: procedure CHECKUNIMODALITY(p, tolerance). input probability vector p and a tolerance value between 0 to 12: nn← length(p)3: ii← index of mode of p; choose the first index in case of tie.4: iok← TRUE5: if ii< nn then6: for j← (ii+1) : nn do7: if p[ j]− p[ j−1]≥ tolerance then iok = FALSE8: end if9: end for10: end if11: if ii> 1 then12: for j← 1 : (ii−1) do13: if p[ j]− p[ j+1]≥ tolerance then iok = FALSE14: end if15: end for16: end if17: return iok18: end procedureEven though the ordering information is ignored, random forest still achieveshigh proportion of unimodality. In particular it exceeds the ordinal gradient boost-ing by more than 14% under both tolerance levels. Thus we eliminate ordinalgradient boosting method due to its lower proportion of unimodal probability vec-tors and the high proportion of negative fitted probabilities mentioned in Section274.2.2. We mainly focus on the interpretability of the random forest model in theremaining sections.Tolerance ord ogb mgb rf0.02 79% 64% 82% 81%0.03 84% 70% 86% 87%Table 4.6: Proportion of unimodal predicted probability mass functions. ord:ordinal regression; ogb: ordinal gradient boosting; mgb: multinomialgradient boosting; rf: random forest.4.2.4 Examples of Predicted Rating Probabilities as a Function ofPredictorsThe matrix below provide the pˆ(k|x) by methods for four observations in thedataset. The modal probabilities and the true ratings are in bold. In the first twomatrix, the four methods produce similar pˆ(k|x) which are all unimodal and havethe same mode. In the last two matrix, the four methods produce different results.In the matrix for row 3061, the pˆ(k|x) contains a negative fitted value of -0.29 andis bi-modal. In the matrix for row 3762, there is no significant negative values, butthe pˆ(k|x) produced by the two ordinal methods are both bimodal and the modes ofthe four methods are different. Over the covariate space, the four methods turn todiverge more near the extreme ends of the covariate space, and agree more closerto the center of the covariate space.Row 18731 2 3 4 5 6 7 8 9 10 110.00 0.00 0.01 0.03 0.22 0.39 0.25 0.06 0.02 0.00 0.000.02 0.03 0.00 0.06 0.23 0.32 0.20 0.08 0.05 0.00 0.010.01 0.01 0.01 0.05 0.23 0.32 0.23 0.08 0.06 0.01 0.000.01 0.00 0.02 0.07 0.22 0.36 0.21 0.07 0.02 0.00 0.00ogbordmgbr f28Row 14641 2 3 4 5 6 7 8 9 10 110.00 0 0.01 0.00 0.11 0.31 0.39 0.13 0.04 0.00 0.000.01 0 0.00 0.03 0.14 0.29 0.33 0.12 0.07 0.01 0.010.00 0 0.01 0.03 0.15 0.30 0.37 0.12 0.01 0.01 0.000.00 0 0.01 0.00 0.07 0.30 0.36 0.16 0.07 0.03 0.00ogbordmgbr fRow 30611 2 3 4 5 6 7 8 9 10 110.51 −0.04 0.28 0.08 −0.29 0.29 0.09 0.06 0.00 0.00 0.010.02 0.10 0.14 0.20 0.17 0.13 0.19 0.03 0.02 0.00 0.000.00 0.00 0.01 0.03 0.07 0.12 0.17 0.19 0.34 0.07 0.000.08 0.05 0.14 0.30 0.21 0.08 0.07 0.04 0.02 0.01 0.00ogbordmgbr fRow 37621 2 3 4 5 6 7 8 9 10 110.08 0.45 0.06 0.31 0.08 −0.01 0.03 0.00 0.00 0.00 0.000.02 0.04 0.15 0.07 0.40 0.20 0.09 0.02 0.00 0.00 0.000.00 0.00 0.00 0.01 0.04 0.08 0.33 0.30 0.18 0.04 0.010.13 0.02 0.15 0.09 0.08 0.06 0.20 0.07 0.10 0.08 0.01ogbordmgbr f4.3 Interpreting Random ForestIn this section, we focus on the interpretation of the random forest model fromthe important variables, local interpretable model-agnostic explanation plots andthe partial derivatives of the modal probability and the cumulative probability withrespect to the important variables.294.3.1 Variable ImportanceFor random forest, relative importance proposed by Breiman et al. is used to assessvariable importance. For a single classification tree T , the relative importance forpredictor Xl is a weighted average of the decrease in node impurity (Gini index) atnodes that uses Xl . Preserving the notation of equation (3.2), we define the decreasein Gini index at node m using predictor X j as the splitting variable as∆(m, j) =K∑k=1pˆmk(1− pˆmk)−[ K∑k=1pˆmLk(1− pˆmLk)+K∑k=1pˆmRk(1− pˆmRk)].pˆmk =∑i∈NmL I(yi = k)∑i∈Nm I(yi = k)pˆmLk+∑i∈NmR I(yi = k)∑i∈Nm I(yi = k)pˆmRk for a split to left (mL) and rightbranches (mR) at a node m. ∆(m, j) is larger if pˆmLk and pˆmRk are more different.The relative importance of variable Xl where the internal nodes that uses Xl as asplitting variable is indexed by set Tl is defined asImp(Xl) = ∑m∈TlNtN∆(m, l),whereNtNrepresents the proportion of samples reaching node t. For an ensemble ofclassification tree, the relative importance of variable Xl is a simple average of therelative importance value in each tree. For the tree-based method, while variableimportance measure provide a sense of which variable is more important, it doesn’tprovide a direct sense of how the classification results change with the variables.Nevertheless, variable importance measure is a good tool to select the importantvariables. Figure 4.3.1 visualizes the most importance variables for the randomforest method.4.3.2 Local Interpretable Model-agnostic ExplanationsLocal interpretable model-agnostic explanations (lime) is a tool that use a linearrelationship to explain the local region of close to an observation. Lime is model-agnostic meaning that it can be used for any black-box model.The general steps of lime is as follows (Ribeiro et al., 2016):30Figure 4.2: Variable importance plot of random forest. A variable that ismore important has higher mean decrease in Gini index on the x-axis.1. Given an observation, permute it to create replicated feature data with slightvalue modifications.2. Then similarity distance measure between original observation and permutedobservations are computed3. Apply selected black-box model to predict outcomes of permuted data.4. Select m number of features to best describe predicted outcomes.5. Fit a simple model to the permuted data, explaining the complex model out-come with m features from the permuted data weighted by its similarity tothe original observation.On Figure 4.3, we provide the lime plot for the rating classification under therandom forest model for a customer in a quarter. The explanation fit is the R2 ofthe linear fit. The red bar indicates that in the local region, the variable conditioncauses decrease in the predicted probabilities, and the green bar indicates that thecondition cause an increase in the predicted probabilities. In particular, we observethat when the transformed v5 is smaller than a value close to 0, class probabilities31Figure 4.3: Lime plot of case 5306decrease for class 6, 7, and 8, and increase for class 1 to 5. This can be used toexplain to individual customers the top features that contributes to the credit ratingthat they’ve received.4.3.3 Partial Derivatives PlotSuppose that the modal probability occurs in class r = r(x), the partial derivativeof the modal probability at class r with respect to covariate xl on the original (un-transformed) data at x = (x1,x2, · · · ,xl, · · · ,xp) is∂ pr(x)∂xl= limh→0pr(x1,x2, · · · ,xl+h, · · · ,xp)− pr(x1,x2, · · · ,xl, · · · ,xp)h.Similarly, the partial derivative of the cumulative probability up to and includ-ing class r is∂ ∑rk=1 pk∂xl= limh→0∑rk=1 pk(x1,x2, · · · ,xl+h, · · · ,xp)−∑rk=1 pk(x1,x2, · · · ,xl, · · · ,xp)h.32The partial derivative ∂ pr∂xl represents the magnitude and direction of change inthe modal probability associated with a change of size h in covariate xl . If∂ pr∂xlis close to 0, then a small increase at x in the direction of xl does not affect themodal rating probability pr, meaning that the modal probability function is at alocal minimum or a local maximum on xl . If∂ pr∂xlis positive, then pr will increase ifxl increases, resulting in higher modal probability and possibly a shorter predictioninterval. If ∂ pr∂xl is negative, then pr will decrease if xl increase by a small amount,resulting in lower modal probability with some shift to the right or the left of ratingr.The partial derivative ∂ ∑rk=1 pk∂xlcan be used to further understand if the proba-bilities are shifting toward smaller rating classes (lower credit quality) or higherrating classes (higher credit quality). ∂ ∑rk=1 pk∂xl< 0 implies that as xl increases, thereis a shift of probabilities to ratings above r. This might be expected if rating ismarginally positively associated with xl .∂ ∑rk=1 pk∂xl> 0, then the cumulative proba-bility of belonging to rating region lower than r increases when xl increases by h.The magnitude | ∂ ∑rk=1 pk∂xl | represents the size of associated probability change.In the p-dimensional covariate space, the partial derivative may be 0 in mostdirections. As such, the gradient ∇ = ( ∂ pr∂x1 , · · · ,∂ pr∂xp ) may not be informative. Wefocus on a few covariate directions. Using the data points in the training dataset,we calculate the partial ∂ pr∂xl and∂ ∑rk=1 pk∂xlvalues under the random forest modelfocusing on the top 5 important variables identified in Section 4.3.1.In Figure 4.4 to Figure 4.17, the covariates are divided into 10 regions withequal number of observed data points. The labels on the x-axis indicate the medianof the covariate values in the region. The corresponding box plot indicate therange of values that ∂ pr∂xl and∂ ∑rk=1 pk∂xltake when the covariate takes a step size ofh as indicated in the title of the plot. As the random forest model is built on thetransformed data scale, a small change in the transformed scale often correspondto a larger change on the original data scale.In Figure 4.4 and 4.5, ∂ pr∂xl and∂ ∑rk=1 pk∂xlwith respect to v5 are mostly negativeand increasing with respect to v5. Thus when v5 increases, the modal probabilitydecreases and then flattens out. However since the median covariate values indi-cated in the x-axis titles are close to 0, the derivatives at step size h = 5 are mostinformative when the v5 value are in the smallest region. Table 4.8 provide numer-33Figure 4.4: Partial derivative ∂ pr∂xl with respect to v5 over the training datasetunder the random forest model. The x-axis labels are median covariatevalues within each box.ical summaries of v5 values in the leftmost box and the rightmost box. Thus whenv5 value negative and much smaller than 0, an increase in v5 will significantly de-crease the modal probability and the cumulative probability of belonging to a lowcredit rating region.Min 1st Quantile Median Mean 3rd Quantile Max-516.00 -0.20 -0.12 -0.84 -0.01 0Table 4.7: Numerical summary of v5 values in the leftmost box regionMin 1st Quantile Median Mean 3rd Quantile Max0.27 0.33 0.43 0.69 0.65 26.65Table 4.8: Numerical summary of v5 values in the rightmost box regionIn Figure 4.6 and 4.7, ∂ pr∂xl and∂ ∑rk=1 pk∂xlwith respect to v22 are also negativeand increasing with respect to v22. Here the median covariate values indicated in34Figure 4.5: Partial derivative ∂ ∑rk=1 pk∂xlwith respect to v5 over the trainingdataset under the random forest model. The x-axis labels are mediancovariate values within each box.the x-axis titles are further away from 0 compared to that of v5. Thus over theentire v22 region, an increase in v22 will decrease the modal probability and thecumulative probability of belonging to a low credit rating region. The magnitudeof the decrease also decreases when v22 increases, so for larger v22 values the twoprobabilities are flat.For v6, v7 and v17, we observe similar mostly negative partial derivatives withdecreasing magnitude in Figure 4.8 to Figure 4.13. Increasing these covariate val-ues decreases the modal probability and the decrease flattens out; in addition, theprobability mass shifts towards higher rating regions. Thus the rating variables ispositively correlated with these 5 variables. On a marginal scale, this is confirmedby the positive Spearman correlation values of the top 5 important variables versusthe rating variable (taken as a numerical variable) is in Table 4.9. Table 4.10 pro-vides the range of the partial derivatives for the 5 variables. For some explanatory35Figure 4.6: Partial derivative ∂ pr∂xl with respect to v22 over the training datasetunder the random forest model. The x-axis labels are median covariatevalues within each box.variables, the local dependence is not always positively related with rating as it candepend on values of other predictors.Variable v5 v22 v6 v10 v17 v23Spearman Correlation 0.41 0.41 0.36 0.3 0.22 -0.15Table 4.9: Spearman correlation of top 5 important variables and v23 versusratingIn Figure 4.8 and 4.9, ∂ pr∂xl and∂ ∑rk=1 pk∂xlare negative with decreasing magnitudewhen v6 values increase. The decrease in the magnitude of the median partialderivative values are smooth.In Figure 4.10 and 4.11, the repeated median value 3.73 is due to imputation.The missing values in v10 are imputed using the median covariate value in Section2.3. The partial derivative values with respect to v10 are stable until v10 is bigger36Figure 4.7: Partial derivative ∂ ∑rk=1 pk∂xlwith respect to v22 over the trainingdataset under the random forest model. The x-axis labels are mediancovariate values within each box.Variable v5 v22 v6 v10 v17 v23∂ pr∂xl[-0.082, 0.002] [-0.060, 0.002] [-0.054, 0.002] [-0.037, 0.002] [-0.063, 0.003] [-0.003,0]∂ ∑rk=1 pk∂xl[-0.081, 0.020] [-0.058, 0.023] [-0.053, 0.016] [-0.038, 0.12] [-0.060, 0.018] [-0.002,0.002]Table 4.10: Partial derivative range for top 5 important variables and v23.The indicated values are the minimum and maximum.than 3.73.In Figure 4.12 and 4.13, the partial derivatives are mostly negative. The mag-nitude of the partial derivative increase and the decrease smoothly when v17 valueincreases.In Figure 4.14 and 4.15, the partial derivative values have more outliers in eachboxplot. The magnitude of the negative partial derivative increase slightly and thendecrease to 0 when v22 value increases.In Figure 4.16 and 4.17, we provide the partial derivatives with respect a less37Figure 4.8: Partial derivative ∂ pr∂xl with respect to v6 over the training datasetunder the random forest model. The x-axis labels are median covariatevalues within each box.important covariate v23. The h size needs to be much larger in order for us toobserved non-zero ∂ pr∂xl . The magnitude of∂ pr∂xlare also much closer to 0 comparedto the 5 important variables. The median ∂ ∑rk=1 pk∂xlvalues are all very close to 0.Although v23 and the rating variable is negatively correlated on a marginal scale(Table 4.9), the partial derivative of the cumulative probability jumps between neg-ative and positive values very close to 0. This is an example that for a less importantvariable the changes in probabilities are small when the variable values change.38Figure 4.9: Partial derivative ∂ ∑rk=1 pk∂xlwith respect to v6 over the trainingdataset under the random forest model. The x-axis labels are mediancovariate values within each box.39Figure 4.10: Partial derivative ∂ pr∂xl with respect to v10 over the trainingdataset under the random forest model. The x-axis labels are mediancovariate values within each box.40Figure 4.11: Partial derivative ∂ ∑rk=1 pk∂xlwith respect to v10 over the trainingdataset under the random forest model. The x-axis labels are mediancovariate values within each box.41Figure 4.12: Partial derivative ∂ pr∂xl with respect to v17 over the trainingdataset under the random forest model. The x-axis labels are mediancovariate values within each box.42Figure 4.13: Partial derivative ∂ ∑rk=1 pk∂xlwith respect to v17 over the trainingdataset under the random forest model. The x-axis labels are mediancovariate values within each box.43Figure 4.14: Partial derivative ∂ pr∂xl with respect to v7 over the training datasetunder the random forest model. The x-axis labels are median covariatevalues within each box.44Figure 4.15: Partial derivative ∂ ∑rk=1 pk∂xlwith respect to v7 over the trainingdataset under the random forest model. The x-axis labels are mediancovariate values within each box.45Figure 4.16: Partial derivative ∂ pr∂xl with respect to v23 over the trainingdataset under the random forest model. The x-axis labels are mediancovariate values within each box.Figure 4.17: Partial derivative ∂ ∑rk=1 pk∂xlwith respect to v23 over the trainingdataset under the random forest model. The x-axis labels are mediancovariate values within each box.46Chapter 5SummaryThe main interest of this thesis is to compare four statistical classification methods:1. ordinal regression on binary splits2. ordinal gradient boosting3. multinomial gradient boosting4. random foreston a credit rating dataset.The comparison of classification accuracy is measured using the accuracy ofpoint and interval estimate in perfect-match rate, within-one-class rate and 80%prediction intervals. We feel that it is important to look at the fitted probability massfunctions and prediction intervals because a point estimate may not convey enoughinformation about the risk profile of the underlying companies. Random forest andordinal gradient boosting achieve the best classification performance using thesecriteria. Because ordinal gradient boosting method produces significant proportionof negative fitted class probabilities and high proportion of non-unimodal proba-bility vectors, we choose random forest as the preferred method and focus on itsinterpretation.We then interpret the random forest model using variable importance measures,local model-agnostic explanation tools, the fitted probability mass functions and47partial derivatives of the modal probability and cumulative probability with respectto the important variables.One direction of extension is to investigate ordinal random forest methods sothat the rating information is also incorporated Janitza et al. (2016).Another direction of extension on methodology is to fit a vine copula on thedataset. This method models the the marginal distribution, and the dependencestructure separately. Once fitted, we can fully specify the multivariate distribution.If we can achieve comparable classification using the vine approach, then the vinemodel would surpass random forest as the more preferable model due its highinterpretability Chang and Joe (2018).48BibliographyB. Chang and H. Joe. Prediction based on conditional distributions of vinecopulas. 2018. → pages 48Y. Freund and R. E. Schapire. A decision-theoretic generalization of onlinelearning and an application to boosting. Journal of Computer and SystemSciences, 55(1):119–139, Aug 1997. → pages 16D. J. Hand and W. E. Henley. Statistical classification methods in consumer creditscoring: a review. Journal of the Royal Statistical Society: Series A (Statisticsin Society), 160(3):523–541, 1997. → pages 3T. Hastie, R. Tibshirani, and J. Friedman. The elements of stasticial learning,2009. → pages 11, 13, 18S. Hue, C. Hurlin, and S. Tokpavi. Machine learning for credit scoring: Improvinglogistic regression with non linear decision tree effects. 2017. → pages 3R.-C. Hwang, K. Cheng, and C.-F. Lee. On multiple-class prediction of issuercredit ratings. Applied Stochastic Models in Business and Industry, 25(5):535–550, 2009. → pages 4R.-C. Hwang, H. Chung, and C. Chu. Predicting issuer credit ratings using asemiparametric method. Journal of Empirical Finance, 17(1):120–137, 2010.→ pages 1, 4S. Janitza, G. Tutz, and A.-L. Boulesteix. Random forest for ordinal responses:Prediction and variable selection. Computational Statistics & Data Analysis,96:57 – 73, 2016. → pages 48M. C. Jones and A. Pewsey. Sinh-arcsinh distributions. Biometrika, 96(4):761–780, 2009. → pages 9A. Liaw and M. Wiener. Classification and regression by randomforest. R News, 2(3):18–22, 2002. → pages 7, 2149A. Prinzie and D. Van den Poel. Random forests for multiclass classification:Random multinomial logit. Expert systems with Applications, 34(3):1721–1732, 2008. → pages 4M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust you?: Explaining thepredictions of any classifier. CoRR, abs/1602.04938, 2016. → pages 30Standard and Poor. S&P global ratings definitions, 2016. → pages 650
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Statistical and machine learning classification methods...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Statistical and machine learning classification methods for credit ratings Hu, Xixi 2018
pdf
Page Metadata
Item Metadata
Title | Statistical and machine learning classification methods for credit ratings |
Creator |
Hu, Xixi |
Publisher | University of British Columbia |
Date Issued | 2018 |
Description | Credit rating is an ordinal categorical label that serves as an important measure of a financial institution’s credit worthiness. It is frequently used to decide whether or not to grant loans as well as how much interest to charge. Companies with higher credit ratings often enjoy lower interest rate and more flexibility in obtaining loans. Due to the increased competition in the lending market, there is renewed interest in the business community in applying statistical and machine learning methods to assign credit ratings. The challenge of adapting and generalizing these methods often lies in understanding and interpreting them in addition to matching ratings accurately. Our goal is to compare the classification performance and interpretability of four statistical learning methods on a credit rating dataset from the industry, where the rating variable comes from human expert opinions. We fit the ordinal regression, ordinal gradient boosting, multinomial gradient boosting and random forest methods with the goal of finding an interpretable method that can replicate the human expert ratings as closely as possible. We find that while the linear ordinal regression is the most interpretable, it fails to achieve high classification accuracy during cross-validation. Furthermore, the ordinal models (ordinal regression and ordinal gradient boosting) produce significant amount of negative fitted probabilities in practice due to the lack of numerical constraints. While ordinal gradient boosting and random forest perform the best in our three measures of classification accuracy: perfect match rate, within one-class match rate and 80% prediction intervals, ordinal gradient boosting produces high proportions of negative values and non-unimodality in the fitted probability mass function. Thus we choose random forest as the most preferred method and focus on its interpretation using variable importance ranking, partial derivative of the probability mass function and cumulative probability function, as well as local interpretable model-agnostic explanation plots. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2018-08-21 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0371167 |
URI | http://hdl.handle.net/2429/66863 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2018-09 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2018_september_hu_xixi.pdf [ 1.97MB ]
- Metadata
- JSON: 24-1.0371167.json
- JSON-LD: 24-1.0371167-ld.json
- RDF/XML (Pretty): 24-1.0371167-rdf.xml
- RDF/JSON: 24-1.0371167-rdf.json
- Turtle: 24-1.0371167-turtle.txt
- N-Triples: 24-1.0371167-rdf-ntriples.txt
- Original Record: 24-1.0371167-source.json
- Full Text
- 24-1.0371167-fulltext.txt
- Citation
- 24-1.0371167.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0371167/manifest