Modeling Ordinal Data For Recommendation SystembyAnupam SrivastavaB.Tech., Indian Institute of Information Technology, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University Of British Columbia(Vancouver)September 2014c© Anupam Srivastava, 2014AbstractIn this work we investigate the problem of making personalized recommendationsby creating models for predicting user-item rating, such as in movie recommenda-tions. The study is based on the Movielens data set [16] which has ratings on anordinal scale. In the past, partly due to motivation gained by the Netflix challenge,researchers have constructed models that make point predictions to minimize theroot mean square error (RMSE) on test sets, typically by learning latent user andmovie feature structure. In such models, the difference between ratings of 2 and 3stars is the same as the difference between ratings of 4 and 5 stars, etc., which is astrong prior assumption. We construct probabilistic models which also learn latentuser and movie feature structure but do not make this assumption. These modelsinterpret the ratings as categories (nominal and ordinal) and return a probabilitydistribution over the ratings for each user-movie pair instead of making a pointprediction. We evaluate and compare our models with other models for makingpersonalized recommendations for the top-n task and comparing the precision vsrecall, receiver operating characteristic and cost curves. Our results show that ourordinal data model performs better than a nominal data model, a state-of-the-artpoint prediction model, and other baselines.iiPrefaceThis dissertation is original and independent work by the author, A. Srivastava un-der the supervision of Professor David Poole. The work has been submitted asa research paper for publication in AAAI-2015 under the title “Modeling Ordi-nal Data for Recommendation System” with A. Srivastava as the main author andDavid Poole as the co-author.Ethics approval was not required for this work as the data sets used were pub-licly available[16].iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Collaborative filtering . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Latent factor model . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.1 BellKor model . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Ordec model . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Evaluation of recommendation systems . . . . . . . . . . . . . . 82.3.1 Precision at n vs recall at n . . . . . . . . . . . . . . . . . 92.3.2 Receiver operating characteristic curve . . . . . . . . . . 102.3.3 Cost curves . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.4 Fraction of concordant pairs . . . . . . . . . . . . . . . . 13iv2.4 Constrained optimization . . . . . . . . . . . . . . . . . . . . . . 153 Proposed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.1 Logistic regression model . . . . . . . . . . . . . . . . . . . . . . 173.2 Cumulative model . . . . . . . . . . . . . . . . . . . . . . . . . . 204 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . 264.1 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.3 Experiment 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.4 Experiment 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 375.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Appendix A Additional Plots . . . . . . . . . . . . . . . . . . . . . . . 45A.1 Additional plots for experiment 1 . . . . . . . . . . . . . . . . . . 45A.2 Additional plots for experiment 2 . . . . . . . . . . . . . . . . . . 45A.3 Additional plots for experiment 4 . . . . . . . . . . . . . . . . . . 47vList of TablesTable 1.1 Netflix data set format. Tuple 1 means that user with ID 192gave a rating of 2 to the movie with ID 1021 at t = 14211329. . 2Table 2.1 User consumption history.“Yes” represents that an item has beenconsumed by the user, “No” represents that an item was rejectedby the user whereas “?” represents the item has not been seenby the user . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Table 2.2 Confusion matrix. This helps in understanding the definitionof true positive, false positive, false negative and true negative.For example, a prediction is called false negative if it was pre-dicted to be false by the model but was actually true. . . . . . . 10Table 3.1 Percentage distribution of the ratings in the 1 million Movie-Lens data set. . . . . . . . . . . . . . . . . . . . . . . . . . . 18Table 4.1 Experiment 3: FCP values for the four models on the 1 mil-lion MovieLens data set. A higher value reflects better rankingaccuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Table 4.2 Experiment 4: FCP values for the four models on the 10 millionMovieLens data set . . . . . . . . . . . . . . . . . . . . . . . 34viList of FiguresFigure 2.1 Latent factor model . . . . . . . . . . . . . . . . . . . . . . . 6Figure 2.2 Precision and recall as we change the size of the recommenda-tion list. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Figure 2.3 Precision at n vs recall at n . . . . . . . . . . . . . . . . . . . 11Figure 2.4 ROC curve and corresponding set of lines in cost curve and thelower envelope. . . . . . . . . . . . . . . . . . . . . . . . . . 14Figure 3.1 Plot of Zu,i ∼N (hu,i,1) with cutoffs to map the latent trait tocategories/ratings. If Zu,i falls between γu,r−1 and γu,r it getsmapped to category r. . . . . . . . . . . . . . . . . . . . . . 22Figure 4.1 Experiment 1: Precision at n vs recall at n . . . . . . . . . . . 28Figure 4.2 Experiment 1: Precision vs recommendation list size . . . . . 28Figure 4.3 Experiment 1: Recall vs recommendation list size . . . . . . 29Figure 4.4 Experiment 1: Average number of true positives per user vsrecommendation list size . . . . . . . . . . . . . . . . . . . . 29Figure 4.5 Experiment 2: Precision at n vs recall at n . . . . . . . . . . . 31Figure 4.6 Experiment 2: Precision vs recommendation list size . . . . . 31Figure 4.7 Experiment 2: Recall vs recommendation list size . . . . . . . 32Figure 4.8 Experiment 2: Receiver operating characteristic curve . . . . 32Figure 4.9 Experiment 2: Cost curve . . . . . . . . . . . . . . . . . . . . 33Figure 4.10 Experiment 4: Precision at n vs recall at n . . . . . . . . . . . 35Figure 4.11 Experiment 4: Precision vs recommendation list size . . . . . 35Figure 4.12 Experiment 4: Recall vs recommendation list size . . . . . . . 36viiFigure 4.13 Experiment 4: Receiver operating characteristic curve . . . . 36Figure 4.14 Experiment 4: Cost curve . . . . . . . . . . . . . . . . . . . . 36Figure A.1 Experiment 1: Average number of false negative per user vsrecommendation list size . . . . . . . . . . . . . . . . . . . . 46Figure A.2 Experiment 1: Average number of false positive per user vsrecommendation list size . . . . . . . . . . . . . . . . . . . . 46Figure A.3 Experiment 2: Average number of true positive per user vsrecommendation list size . . . . . . . . . . . . . . . . . . . . 47Figure A.4 Experiment 2: Average number of false negative per user vsrecommendation list size . . . . . . . . . . . . . . . . . . . . 48Figure A.5 Experiment 2: Average number of true negative per user vsrecommendation list size . . . . . . . . . . . . . . . . . . . . 48Figure A.6 Experiment 2: Average number of false positive per user vsrecommendation list size . . . . . . . . . . . . . . . . . . . . 49Figure A.7 Experiment 4: Average number of true positive per user vsrecommendation list size . . . . . . . . . . . . . . . . . . . . 50Figure A.8 Experiment 4: Average number of false negative per user vsrecommendation list size . . . . . . . . . . . . . . . . . . . . 50Figure A.9 Experiment 4: Average number of true negative per user vsrecommendation list size . . . . . . . . . . . . . . . . . . . . 51Figure A.10 Experiment 4: Average number of false positive per user vsrecommendation list size . . . . . . . . . . . . . . . . . . . . 51viiiAcknowledgmentsFirst of all, I thank my research supervisor Professor David Poole for his contin-uous support and guidance. This work would not have been possible without hisinvaluable supervision and encouragement.I thank Professor Mark Schmidt for being the second examiner for my thesis.I am grateful to all my fellow researchers at STAR-AI Lab.Finally, I am extremely grateful to my parents and my brother for their contin-uous support and encouragement.ixDedicationI dedicate my work to my parents for their endless love and support.xChapter 1IntroductionIn recent years, the amount of information on the internet has exploded and so hasthe choices available to a consumer. Modern e-commerce websites offer a widearray of products to its users, for example amazon.com offers over 250 millionproducts1. This information overload is overwhelming for the user and makes itdifficult for them to make decisions. The same problem is faced in case of musicand movie streaming websites. In such an environment, it is beneficial for boththe user and website for the website to come up with a personalized list of itemrecommendations which would be consumed by the user with a high probability.This makes the need for a recommendation engine in these websites imperative.In the literature, three types of recommendation systems have been studied ex-tensively. Content based recommendation systems which recommend items basedon user specific features like age, occupation, sex, etc., and item specific featureslike price, age appropriateness, etc., [4, 8, 25]. Collaborative filtering based recom-mendation systems recommend item by studying and identifying patterns in userconsumption history and finding similar user and items by using some similaritynotion which differs from algorithm to algorithm [18, 24, 29]. Often researchersbring these two types of recommendation systems together into the third type calledhybrid recommendation systems where both usage history and user and item fea-tures are used to make recommendations [3, 7].1 The number was obtained by asking amazon to return product list which do not match to anonsense query1User ID Movie ID Rating Time stamp192 1021 2 14211329192 1109 5 14312461184 2013 4 14319122184 1021 4 14321232203 2207 5 14322923Table 1.1: Netflix data set format. Tuple 1 means that user with ID 192 gavea rating of 2 to the movie with ID 1021 at t = 14211329.In October 2006, Netflix made public a 100 million rating data set [6]. The dataset contained tuples like those shown in Table 1.1. Each entry represented the ratinggiven by a specific user to a movie, both recognized by unique IDs. The challengeposted by Netflix was to make predictions on the held out user and movie ID pairsand improve the root mean squared error (RMSE) measure by 10 percent over theirexisting recommendation system. The challenge brought a surge in research in thisfield and many submissions and subsequent publications were made. The winner ofthis challenge was an ensemble method bringing together many different solutions[20]. The latent factor based model proposed in [5] was part of this ensemblemethod and performed very well on the RMSE measure [20]. We will use thismodel as a baseline and it will be referred to as the BellKor solution.Since the challenge was to improve on the root mean square error measure, theproposed recommendation models treated the ratings as numbers [5, 34]. But wedon’t always have the luxury of numerical ratings. If you look at most e-commerce,music and movie streaming websites, user preferences towards items is perceivedthrough rather implicit signals like searching for an item, bookmarking an item,using an item, etc. Its not easy to assign a number to these signals. However, theycan be assigned categories. Moreover we can observe that we can often order thesecategories, for example, bookmarking an item is a stronger signal of user affectiontowards a product compared to just searching for an item.There is no intuitive way of using recommendation systems which interpretratings as numbers for such data sets. In addition, even though treating the movieratings (given that RMSE is to be optimized) as numbers is intuitive, it is misin-2terpretation of the data. Users don’t assign numbers to movies, rather they assigncategories represented by numbers. Treating them as numbers means we are treat-ing the difference between 1 and 2 stars, the same as the difference between 4 and5 stars. This is a strong prior assumption and hinders recommendation qualities asshown by our experimental results.In this work, we propose two probabilistic recommendation models which arebased on collaborative filtering. The models don’t treat the movie ratings as num-bers but as categories. In one model, we treat the categories to be nominal whereasin the second model we treat the categories to be ordinal. Both models are basedon a latent factor model similar to the BellKor solution [5]. However, they differfrom the BellKor model in more ways than just interpretation of input. In the pro-posed ordinal model we learn an additional set of parameters per user which allowsus to learn user rating pattern. The proposed model output a probability distribu-tion over the set of possible ratings instead of just one number as done in BellKormodel. We evaluate these based on the top-n task [19] where the aim is to producen recommendations and evaluate which ones of these the user likes.We discuss collaborative filtering, latent factor models, and other related workin the next chapter. In Chapter 3, we describe our two proposed probabilistic rec-ommendation models. We test our model against the BellKor solution on data setsof varying sizes and probability distribution. We use the MovieLens data set [putreference] for our evaluation, since the Netflix data set is no longer available. Wegive the details about our experiments and results in Chapter 4. Finally we con-clude in chapter 5 with some discussion and some pointers to future work.3Chapter 2Background2.1 Collaborative filteringThe biggest problem faced by recommendation systems is data sparsity. An av-erage user rates only a small percentage of the entire set of items available. Thismakes the prediction task difficult because there are very few positive samples anda very large number of negative samples. In collaborative filtering algorithms, thisproblem is solved by trying to collect and understand preferences from multipleusers and items to make recommendations for one user or item. One method isto form clusters of similar users and items. Then predictions are made for a userby looking at the usage history of a similar user or by looking at the set of itemssimilar to the item already used by the user.As an example, look at Table 2.1 which displays the consumption history offive users. Looking at the table, we can tell User 3 and User 5 have similar taste initems as they both have consumed Item 1 and Item 2 and rejected item 3. Now, ifwe were to make prediction for User 3, we can look at the consumption history ofUser 5 and recommend Item 4 to User 3. Similarly, Item 2 can be recommendedto User 1. However, our data is not always a simple “consumed - rejected - notseen” data set. We are going to train and test our models on movie1 rating datasets. The movie ratings are ordered over 5 categories which makes the problem ofmaking recommendation more complex. Moreover, in this example we have used1 From now on, we are going to use movie and items interchangeably4Item 1 Item 2 Item 3 Item 4User 1 Yes ? No YesUser 2 Yes ? ? ?User 3 Yes Yes No ?User 4 ? ? Yes ?User 5 Yes Yes No YesTable 2.1: User consumption history.“Yes” represents that an item has beenconsumed by the user, “No” represents that an item was rejected by theuser whereas “?” represents the item has not been seen by the usera very simple and stripped down notion of similarity. As we see in the literature,researchers have proposed more complex and more effective ways of interpretingsimilarity between users and items.2.2 Latent factor modelIn the latent factor model, we assume that there are a set of unobserved charac-teristic traits which explain why a user gave a particular rating to a movie. Theseunobserved characteristic traits may belong to user or item. For example: a usertrait may be their affinity towards action movies and a movie trait might be thedegree to which it is an action movie. Learning these latent factors is similar toclustering users and items. Each latent factor can represent a cluster and the valueof the latent factor for a particular user/item can represent the degree to which auser/item is associated with a cluster. This way a user/item can be associated withmultiple cluster with different degrees of association and then the ratings given outby the user can be explained in terms of these associations.In Figure 2.1, we can see a toy example of a post hoc interpretation of a 2-dimensional latent factor model for items. The X-axis captures whether the movieis comic or tragic whereas the Y-axis captures if the movie is romantic or thrilling.For example, “Armageddon” is a thriller with a comic touch and hence is placed inthe second quadrant. Now a user depending on his own set of latent factors wouldbe attracted to certain portions of this state-space. For example, a user may lie inthe top right quadrant based on its own latent factors as shown in Figure 2.1. Now5Figure 2.1: Latent factor modelthe user would prefer romantic comedies like “The Ugly Truth”. The idea behindthe latent factor based recommendation system is that the user would give highratings to movie lying in corresponding portions of the state space and low ratingsto movie lying far away from there.2.2.1 BellKor modelThe Bellkor model is based on a latent factor model [5]. The model assumes thateach user and each movie has a set of unobserved latent factors. The interactionbetween these latent factors linearly combined with appropriate offsets explainsthe rating given by the user to the movie.rˆu,i = µavg +bu +bi +∑fpu, f q f ,i (2.1)where rˆu,i is the rating which user u would give to item i according to the model,µavg is the average rating over the data set, bu is a user bias, bi is a item bias, f isthe latent factor, pu, f is the value of latent factor for user u and q f ,i is the value ofitem latent factor for item i.6Using the definition in Equation 2.1, the cost function can be defined as:C(θ) = ∑〈u,i,r〉∈D(rˆu,i− ru,i)2+λ‖θ‖22 (2.2)where D is the data set consisting of tuples <u, i, r>, θ =(bu,bi, pu, f ,qi, f), ru,iis the rating given by user u to item i and rˆu,i is the model prediction. λ is theregularization constant and the second term is introduced to avoid over-fitting. Thel2 regularizer makes our model more numerically stable [12, 27].The model learns the user and item bias along with the latent factors by mini-mizing Equation 2.2:θˆ = argmin ∑〈u,i,r〉∈D(rˆu,i− ru,i)2+λ‖θ‖22 (2.3)The minimization can be done using stochastic gradient. The algorithm canbeen seen in Algorithm 1. The model parameters are initialized to random valuessampled from uniform random distribution. The termination criteria depends onthe improvement gained in the cost function. In our experiments, if the improve-ment drop below .01%, then the algorithm is stopped. The derivatives of the costfunction with respect to model parameters are:∂C(θ)∂bu= 2(rˆ− r)+λ2bu∂C(θ)∂bi= 2(rˆ− r)+λ2bi∂C(θ)∂ pu, f= 2(rˆ− r)qi, f +λ2pu, f∂C(θ)∂qi, f= 2(rˆ− r)pu, f +λ2qi, f2.2.2 Ordec modelIn [21], the authors extended the BellKor model to create a probabilistic recom-mendation model called Ordec which returns a probability distribution over ratingsfor each user-item pair. In addition to the parameters defined in (2.1), the model in-7Algorithm 1 BellKor latent factor model1: INPUT: data set of 〈UserID,MovieID,Rating〉2: OUTPUT: bu,bi, pu, f ,qi, f3: Initialize model parameters4: while Termination criteria not met do5: for each data point: user u, item i, rating r do6: rˆ = µavg +bu +bi +∑ f pu, f q f ,i . Make prediction7: θ new = θ − step size ∂C(θ)∂θ . Update model parameters8: θ = θ new9: end for10: end whiletroduced S - 1 ordered thresholds, associated with each of the rating values besidesthe last onet1 ≤ t2 · · · ≤ tS−1 (2.4)where t1 is an actual model parameter but other thresholds are defined using anotherset of parameters β1,β2, · · · ,βS−2:tr+1 = tr + exp(βr) (2.5)where r ∈ [1,S− 2]. This ensures that the difference between consecutive thresh-olds is non-negative. Given these model parameters, the authors defined:P(ru,i ≤ r|θ) =11+ exp(rˆu,i− tu,r)(2.6)where θ represents the biases and latent factors defined in (2.1) along with thethresholds explained in (2.5). The learning was done by using stochastic gradientascent on the log likelihood function.2.3 Evaluation of recommendation systemsMotivated by the Netflix challenge, recommendation systems are being evaluatedon the basis of how well they do on the root mean square measure (RMSE).RMSE =√∑ni=1(rˆ−r)2n8where rˆ is the prediction and r is the actual rating. However, over the years thefocus of recommendation systems have shifted from prediction accuracy to rankingof items. The recommendation system is expected to return a list of items rankedaccording to the probability of them being consumed by the user. Ranking of itemsgive a more accurate representation of how the recommendation system is actuallyused in the real world [22].In [9], authors showed that methods which do well on the RMSE measuremight actually not do that well on predicting highly ranked items. This is be-cause while optimizing RMSE, the model gets equal incentive for doing well onpredicting low and high ratings. However, a system designed to return a list ofitems which would be ranked highly by the user can achieve its goal just by doingwell on the higher ranked items even though it might be very bad in predicting ordistinguishing between lower ranked items.In our study, we evaluate the system on the basis of how well they do in pre-dicting items which would be consumed by the user with a high probability. Wedidn’t carry out an actual user study, so we don’t have a good way of measuringor evaluating this phenomenon. To overcome this, we use a surrogate; we ask therecommendation models to present users with K items and say an item is liked if itwas given a rating of 5 by the user.2.3.1 Precision at n vs recall at nRecommendation systems are often compared against each other using the preci-sion at n vs recall at n metric [32]. Precision at n is the precision of the model whenit was used to generate a recommendation list of size n. Recall at n is the recall ofthe model when it was used to generate a recommendation list of size n. They arecalculated as follows:Precision at n =#T P(n)#T P(n)+#FP(n)(2.7)Recall at n =#T P(n)#T P(n)+#FN(n)(2.8)where n is the recommendation list size, #T P(n) is the number of true positives(the number of movies the user rated 5 out of the n movies presented), #FP(n) is9Ground truthPredictiontrue falsetrue true positive false positivefalse false negative true negativeTable 2.2: Confusion matrix. This helps in understanding the definition oftrue positive, false positive, false negative and true negative. For exam-ple, a prediction is called false negative if it was predicted to be false bythe model but was actually true.the number of false positives (#T P(n)+ #FP(n) = n) , #FN(n) is the number offalse negatives (#T P(n)+#FN(n) = the number of items the user rated as 5) whenthe recommendation list size is n. True positives, false positives and false negativescan be understood from Table 2.2We can observe that as we increase the value of n, recall at n would increaseas the denominator in (2.8) is a constant and the #T P(n) would monotonicallyincrease. On the other hand, precision at n would show a tendency to decrease.The recommendation list of increasing lengths are made by adding the successiveelements from the ranked list. So the probability of the new item being liked bythe user is less than any of the previously recommended items. This implies thatwith an increase in n, the new element would have a greater tendency to become afalse positive rather than true positive. Therefore, #FP(n) would increase relativelyquickly as compared to #T P(n) as we increase n. Hence, the precision at n wouldshow a tendency to decrease.Figure 2.2a shows a toy example of how the precision of a model varies aswe change the recommendation list size. Figure 2.2b shows a toy example of thevariation in recall of the model as we change the recommendation list size. Theinformation is aggregated together by plotting the precision at n vs recall at n curveFigure 2.3.2.3.2 Receiver operating characteristic curveWe also compare the models based on the Receiver Operating Characteristic orROC curves [15]. ROC curve is a plot between the true positive and false positive10(a) Precision vs recommendation list size (b) Recall vs recommendation list sizeFigure 2.2: Precision and recall as we change the size of the recommendationlist.Figure 2.3: Precision at n vs recall at n11rate:True positive rate =#T P(n)#T P(n)+#FN(n)(2.9)False positive rate =#FP(n)#FP(n)+#T N(n)(2.10)Different values for true positive and false positive rate are calculated from rec-ommendation list of varying size and then plotted. Both precision at n vs recall atn and ROC curve gives us a representation of the proportion of items that wouldbe consumed by the user from the recommendation list. However, precision at n vsrecall at n curve emphasize more on the recommended items which are consumedby the user (precision) whereas ROC curve gives more emphasis to recommendeditems which are not consumed by the user (false positive) [32]. Due to this differ-ence, we should chose between precision at n vs recall at n and ROC curve to doour analysis based on the context. For example, if we want our models to have ahigh precision, we should compare them on the basis of precision at n vs recall at ncurve. On the other hand, if we want our models to have low false positive rate, weshould compare them on the basis of ROC curves. In this work, we have plottedboth curves and analyzed them independently.2.3.3 Cost curvesPrecision at n vs recall at n curve and ROC curves allow us to compare betweentwo classifiers while we vary some model parameter to see how they work underdifferent conditions. In this work, we have varied the recommendation list size toget different performance measures. However, these curves don’t account for theprobability distribution of the positive (or negative) samples in our test population.This is problematic especially when our test and train population probability dis-tribution don’t represent the real world accurately. In our experiments, there is nogood way of knowing that the probability distribution of the real world is the sameas the test set. Hence, we would like to have a tool which allows us to comparebetween classifiers at varying probability of positive samples.Cost curve helps us in comparing between classifiers at varying probabilitiesof positive samples [10]. Cost curve plots the error rate at varying probability of12positive samplesE[Cost] = FN ∗ p(+)∗C(−|+)+FP∗ p(−)∗C(+|−) (2.11)where FN is the false negative rate, p(+) represents the probability of positive sam-ple, C(- |+) is cost of missclassifying positive sample as negative, FP is the falsepositive rate, p(-) represents the probability of negative sample and C( +|-) is costof miss-classifying a negative sample as positive. Note that p(-) = 1 - p(+). Wefurther assume that C(- |+) = C(+|-). Making these simplifications, we get:Norm(E[Cost]) =FN ∗ p(+)∗C(−|+)+FP∗ p(−)∗C(+|−)p(+)∗C(−|+)+ p(−)∗C(+|−)=C(−|+)(FN ∗ p(+)+FP∗ p(−))C(−|+)(p(−)+ p(+))= FN ∗ p(+)+FP∗(1− p(+))= (FN−FP)∗ p(+)+FP(2.12)Note here that FN = 1-TP where TP is true positive rate. ROC curve is acurve between TP and FP. So, for every point in ROC curve, we get a line in costcurve by varying p(+) as seen in equation (2.12). As seen in Figure 2.4, for eachpoint on the ROC curve, we plot a line in cost curve using the equation (2.12).We can see that these sets of intersecting lines form a lower envelope. This lowerenvelope represents the set of optimal operating points for a given classifier as theprobability of positive samples is varied from 0 to 1. In our experiments, we plotcost curves for our three competing models and make inferences based on theirlower envelopes.2.3.4 Fraction of concordant pairsThe fraction of concordant pairs (FCPs) is another evaluation measure used to com-pare different classifiers. Precision at n vs recall at n, ROC and cost curve are basi-cally meant to compare binary classifiers. We can extend them to recommendationsystem by asking a binary question like Does the user like the movie recommended13(a) ROC curve (b) Cost curveFigure 2.4: ROC curve and corresponding set of lines in cost curve and thelower envelope.by the recommendation model? Liking a movie is synonymous to giving a highrating to a movie. So a classifier which can predict movies which would receivehigh ratings, say 5, can do very well on these curves. However websites would of-ten want to order movies over the entire rating scale. This is important for pricingthe movies. So we would like to evaluate how our recommendation system doeson the entire rating structure i.e., can it distinguish between a rating of 2 and 3stars as well as it is able to distinguish between 2 and 5 stars. FCP allows us to dothis. To define FCP, let’s first define concordant pairs. Given a test set D, we definethe number of concordant pairs for user u by counting those pair of items that areranked correctly by the rating predictor:nuc = |{(i, j) : rˆu,i > rˆu, j and ru,i > ru, j}| (2.13)Similarly we can define discordant pairs for user u as the number of pair ofitems which are ranked incorrectly by the rating predictor:nud = |{(i, j)rˆu,i > rˆu, j and ru,i < ru, j}| (2.14)Given the definition of nuc and nud , we can define FCP as:FCP =ncnc +nd(2.15)where nc =∑u nuc and nd =∑u nud . From the defintion, we can observe that in order14to do well on the FCP measure, the ranking predictor should be equally good atdistinguishing low quality movies and high quality movies.2.4 Constrained optimizationConstrained optimization is a large and well studied field of research. We are goingto look at a type of constrained optimization problem which we encountered whilemodelling our recommendation system,Minimize f(x),subject to fi(x)≤ 0The problem is to minimize a function f(x) without violating the constraintsfi(x) ≤ 0 ∀i ∈ [1..m]. To solve the above minimization problem, a new functioncalled a barrier function is introduced [17]:φ(x) =∑mi=1−log(− fi(x)), fi ≤ 0, i = 1, ...,m∞, otherwise(2.16)You can imagine the barrier function φ(x) representing a wall separating thevalid space of parameters from the invalid. The function attains a high positivevalue as fi → 0 (the wall) and a small value as we keep going away from zero.Therefore, this function motivates the parameters to not violate the constraints.The barrier function is linearly combined with the original function which wasto be minimized to create a new optimization problem but without any constraintsthis time,h(x; t) = f (x)+1tφ(x) (2.17)The new optimization problem has a hyper-parameter t which is to be opti-mized. We can note here that higher the value of t, lower is the contribution of theφ(x) to h( x; t ) and as t→ ∞, h( x; t ) ≈ f(x).In the literature, h(x;t) is minimized by minimizing a series of different h(x;tˆ) problems where tˆ is pre-fixed [17, 33]. We start with an initial value for tˆ andcalculate x∗ which minimizes h(x; tˆ). Then the value for tˆ is increased and h(x; tˆ)15is again minimized. However, this time the minimization starts from x∗ calculatedin the previous minimization step and so on. As t→ ∞, x∗ which minimizes h(x;tˆ)also minimizes f(x).16Chapter 3Proposed ModelsWe argued that most recommendation systems were treating movie ratings as num-bers which was a misinterpretation of the data set. Movie ratings represents cate-gories which have an ordered structure. Moreover, movie ratings are very subjec-tive, in the sense that a movie which gets assigned a rating of 3 by a user might geta rating of 4 by another user even though they equally enjoyed the movie. There-fore, its important to not only learn what the user may or may not like, but an equalemphasis must be given to the way he assigns ratings to the movie. The secondpart is often ignored by recommendation systems. This is explicitly incorporatedin our second recommendation system.We are now going to talk about the two models. The two models vary in theway they interpret the data. The first model treats the data as nominal and thesecond one treats it to be ordinal. Due to their different way of interpreting thedata, they have different ways of assigning probability, and hence cost functions tobe optimized.3.1 Logistic regression modelIn this model, we interpret the data to be nominal. We learn independent logisticregression model (LRM) for each category1. While learning the LRM for a partic-1We could also learn a multinomial representation, but this did not give different performance inour tests. We present the logisitic regression model as our testing was for rating = 5 and we used thelogisitic regression model for this category17Rating Percentage distribution1 0.05616%2 0.10753%3 0.34889%4 0.26114%5 0.22626%Table 3.1: Percentage distribution of the ratings in the 1 million MovieLensdata set.ular category, say j, we treat all data points with rating = j as positive samples andrest as negative. Due to this simplification step, the training data set for all LRM isheavily biased with lot of negative samples and relatively few positive samples.Table 3.1 shows the distribution of ratings in the 1 million rating Movielensdata set. For example, the most skewed data set is observed for rating = 1, whereto train the LRM we would have approximately 5% positive samples and 95%negative samples. On the other hand, the least skewed data set is observed forrating = 4, where to train the LRM we would have approximately 35% positivesamples and 65% negative samples.The LRM is based on the BellKor model. Like the BellKor solution we assumethat the users and items have latent traits which interact with each other, and theirinteraction decide the rating which a movie gets from a particular user. Since weare learning 5 independent LRMs, one for each category, we assume that user andmovies have 5 sets of latent factors, one for each category. Then we define:response j = bu, j +bi, j +∑fpu, j, f q f , j,i (3.1)where j∈ [1,5] is an index over the 5 independent LRM models,bu, j, bi, j, pu, j andqi, j are the model parameters for jth LRM. bu, j represents the user bias for user u,bi, j represents the item bias for item i, f are the latent factors, pu, j, f are values of theunobserved latent factors for user u and q f ,i, j are values of the unobserved latentfactors for item i.This response is converted to a probability value by passing it through a sig-18moid function. So, we get:P(ru,i = j)=11+ e−response j(3.2)where P(ru,i = j)is the probability of user u assigning a rating j to movie i andresponse j is as defined in (3.1) Given the above definition of probability of a moviegetting a rating of j, we can define the likelihood of a data set D with respect toparameters θ :LD(θ) =n∏〈u,i,r〉∈D5∏j=1(P(ru,i = j))1(ru,i= j)(1−P(ru,i = j))1(ru,i 6= j) (3.3)where D is the dataset, u is the user, i is the movie and r is the rating given to themovie i by user u. We try to learn model parameters θ =(bu, j,bi, j, pu, j, f ,qi, j, f)which maximize the likelihood of the data under our model. This is equivalent tominimizing the negative of the log likelihood. We have added a l2 regularizer aswell:NLL(θ)=−n∑u,i,r∈D5∑j=1log((P(ru,i = j))1(ru,i= j)(1−P(ru,i = j))1(ru,i 6= j))+λ‖θ‖22(3.4)where λ is the regularization constant. We minimize NLL(θ ) by using stochasticgradient as described in Algorithm 2. The model parameters are initialized to ran-dom values using uniform random distribution. We use a constant step size [31].We terminate when the improvement in NLL(θ ) between consecutive iterationsbecomes less than 0.01%. The derivatives with respect to model parameters are:19∂NLL(θ)∂bu, j=− ∑〈u,i,r〉∈D((1(ru,i = j)−P(ru,i = j))+2λbu, j)∂NLL(θ)∂bi, j=− ∑〈u,i,r〉∈D((1(ru,i = j)−P(ru,i = j))+2λbi, j)∂NLL(θ)∂ pu, j, f=− ∑〈u,i,r〉∈D((1(ru,i = j)−P(ru,i = j))∗q f , j,i +2λ pu, j, f)∂NLL(θ)∂q f , j,i=− ∑〈u,i,r〉∈D((1(ru,i = j)−P(ru,i = j))∗ pu, j, f +2λq f , j,i)Algorithm 2 Logistic regression model1: INPUT: data set of 〈UserID,MovieID,Rating〉2: OUTPUT: bu, j,bi, j, pu, j, f ,qi, j, f3: Initialize model parameters4: while Termination criteria not met do5: for each data point: user u, movie i, rating r do6: for each category j do7: response j = bu, j + bi, j + ∑ f pu, j, f ∗q f , j,i8: P(ru,i = j)= 11+e−response j. Make prediction9: θ new = θ − step size∗ ∂NLL(θ)∂θ . Update model parameters10: θ = θ new11: end for12: end for13: end whileOnce the model parameters and learned using the algorithm 2, we can makepredictions using (3.2). Using (3.2) the model can answer questions like; what isprobability of a movie i getting a rating j from user u. We are particularly interestedin knowing what is the probability of a movie i getting a rating of 5 from user uwhich we use later in our experiments as a surrogate for liking a movie.3.2 Cumulative modelThe cumulative model is a recommendation model where we treat the ratings asordinal. In models where the ratings are treated as numbers, the difference between20the ratings is assumed to be linear e.g., the difference between a rating of 1 and 2is the same as difference between rating of 4 and 5. This is a very restrictiveassumption.The cumulative model is inspired by the regression models for ordinal dataproposed in [26]. In this model, we model the ordinal ratings by postulating theexistence of an unobserved latent variable Zu,i and a set of user specific thresholds.Zu,i captures the affection of user u towards an item i whereas the thresholds areused to define mapping between regions on Zu,i and ratings. By learning thresholdsand Zu,i we arrive at a very flexible model in comparison to the models which treatthe ratings as numbers. These additional sets of thresholds allow us to learn userrating pattern which could be non-linear in structure i.e., now with this model wecan learn rating patterns where the difference between 1 and 2 stars could be verysmall as compared to difference between 4 and 5 stars.Similar to the previous models, we define a response that takes into accountuser and item biases and the latent traits:hu,i = bu +bi +∑fpu, f q f ,i (3.5)where bu is the user bias, bi is the item bias, f is the set of latent factors, pu, f is thevalue of latent factor for user u, q f ,i is the value of latent factor for item i.We then assume a noisy observation model, Zu,i:Zu,i = hu,i + εu,i, (3.6)where εu,i is a random variable to model the noise. The noise could be because ofnumerous independent random processes. We assume that the εu,i are independentfor each u, i pair, with each one normally distributed: εu,i ∼N (0,1). Then we cansay thatZu,i ∼N (hu,i,1) (3.7)We map this unobserved latent variable Zu,i to categories by introducing thresh-olds or cutoffs. For data set with 5 categories, we introduce 6 cutoffs: γu,r wherer ∈ {0, 5}. We assume γu,0 = - ∞ and γu,5 = ∞. We mentioned earlier that movierating is a subjective process. A movie can receive different ratings from different21Figure 3.1: Plot of Zu,i ∼ N (hu,i,1) with cutoffs to map the latent trait tocategories/ratings. If Zu,i falls between γu,r−1 and γu,r it gets mapped tocategory r.user even though they equally enjoyed (which is captured by Zu,i) the movie. Thismotivates us to have user specific cutoffs.Figure 3.1 shows Zu,i for a particular movie i and user u along with the userspecific thresholds γu,r. Zu,i is distributed according to (3.7) and (3.5). The mappingusing the thresholds is straightforward. If Zu,i lies between γu,r−1 and γu,r, we saythat user u gave a rating of r to movie i.Once we have Zu,i and user specific γ , its easy to come up with an expressionfor assigning the probability to the event of user u giving a rating r to item i:P(ru,i = r)= P(γu,r−1 < Zu,i ≤ γu,r)= P(Zu,i ≤ γu,r)−P(Zu,i < γu,r−1)= F(γu,r−hu,i)−F(γu,r−1−hu,i)(3.8)where P(ru,i = r)is the probability with which the user u gives a rating of r to moviei and F is the cumulative Gaussian probability distribution.Given the definition of probability of the event of user u giving a rating r to22item i, we can define the likelihood of a data set D with respect to parameters θ :LD(θ) = ∏〈u,i,r〉∈DF(γu,r−hu,i)−F(γu,r−1−hu,i) (3.9)We try to learn the model parameters θ =(bu, bi, pu, f , q j, f , γu)which maximizethe likelihood of the data under our model. This is equivalent to minimizing thenegative of the log likelihood. We add a l2 regularizer:NLL(θ) =− ∑〈u,i,r〉∈Dlog(F(γu,r−hu,i)−F(γu,r−1−hu,i))+λ‖θ‖22 (3.10)with constraintsγu,r−1 ≤ γu,r∀r ∈ [0, 5]where λ is the regularization constant and the second term is introduced to avoidover fitting. We minimize NLL(θ ) by using stochastic gradient. The model pa-rameters were initialized to random values using uniform random distribution. Itis interesting to note here that NLL(θ ) is not convex. In our empirical studies werealized that minimizing NLL(θ ) in its current form was very difficult because ofnumerical instability. The constraints on γ were often violated, in the sense, weoften encountered scenarios where γu, j < γu, j−1 for some user u and rating j. Dueto this, the probability of a movie receiving a rating j from the user u becomesundefined. To tackle this problem, we made following adjustments in our solution:1. We changed our cost function to explicitly include a term for the constraints.This was done using the barrier method discussed in Section 2.4. Then ournew cost function is:C(θ ; t) =− ∑〈u,i,r〉∈Dlog(F(γu,r−hu,i)−F(γu,r−1−hu,i))+λ‖θ‖22 +1t4∑m=1−log(γu,m− γu,m−1)(3.11)2. Secondly, we made changes to the way γu were getting initialized. Ratherthan randomly initializing them, we trained our system assuming that all the23data points were coming from a single user. The final value for γ obtainedin this pre-processing step was used as a seed. γu were initialized to γ andsamples from zero mean Gaussian noise. This step can be interpreted asusing some sort of global average to initialize the values of γ and then duringthe training step, we just learn deviations from the global average for eachuser. It can also be interpreted as imparting your prior knowledge about thedistribution of γ .Algorithm 3 Cumulative model1: INPUT: data set of 〈UserID,MovieID,Rating〉2: OUTPUT: bu,bi, pu, f ,qi, f ,γu3: Initialize model parameters4: while Termination criteria not met do5: for each data point: u, i, r do6: P(ru,i = r) = F(γu,r−hu,i)−F(γu,r−1−hu,i). Make prediction7: θ new = θ − step size ∂C(θ)∂θ . Update model parameters8: θ = θ new9: end for10: end whileAfter making these two adjustments, the model parameters are obtained byminimizing C(θ ;t) with respect to θ by using stochastic gradient. The algorithmfor the same can be seen in Algorithm 3. The step size is set to a constant value[31]. We terminate when the improvement in C(θ ) between consecutive iterationsbecomes less than 0.01%. The derivatives with respect to model parameters are:24∂C(θ)∂bu= ∑〈u,i,r〉∈D((f (γu,r−hu,r)− f (γu,r−1−hu,r−1))F(γu,r−hu,r)−F(γu,r−1−hu,r−1) +2λbu, j)∂C(θ)∂bi= ∑〈u,i,r〉∈D((f (γu,r−hu,r)− f (γu,r−1−hu,r−1))F(γu,r−hu,r)−F(γu,r−1−hu,r−1) +2λbi)∂C(θ)∂ pu, f= ∑〈u,i,r〉∈D((f (γu,r−hu,r)− f (γu,r−1−hu,r−1))qu, fF(γu,r−hu,r)−F(γu,r−1−hu,r−1) +2λ pu, f)∂C(θ)∂qu, f= ∑〈u,i,r〉∈D((f (γu,r−hu,r)− f (γu,r−1−hu,r−1))pu, fF(γu,r−hu,r)−F(γu,r−1−hu,r−1) +2λqu, f)∂C(θ)∂γu,r= ∑〈u,i,r〉∈D((f (γu,r−hu,r)− f (γu,r−1−hu,r−1))F(γu,r−hu,r)−F(γu,r−1−hu,r−1) −1t(γu,r− γu,r−1) +2λγu,r)where F is the Gaussian cumulative distribution function and f is the Gaussianprobability density function.Once we have trained the system to learn all the model parameters, we canpredict the probability with which user u will give a rating of r to a movie i using(3.8). We are in particular interested in knowing what is the probability of a moviei getting a rating of 5 from a user u which we use in our experiments as a surrogatefor liking a movie.25Chapter 4Experiments and ResultsWe evaluated the six models;1. the logistic regression model presented in Section 3.12. the cumulative model presented in Section 3.23. BellKor solution as explained in Section 2.2.14. Ordec model as explained in Section 2.2.25. Most popular movies: In this case, the recommendation list was generatedby sorting the movies on the basis of the number of 5 ratings they get in thetraining data set.6. Random moviesThe evaluation is done on two data sets; 1 million rating MovieLens data set and10 million rating MovieLens data set. The 1 million rating data set has 5 categorieswhereas 10 million rating data set has 10 categories. For all experiments, the dataset in question is divided into two parts chronologically: 80% training set and 20%test set.264.1 Experiment 1The first experiment was carried out using the 1 Million rating MovieLens dataset collected as part of the Grouplens project1 [16]. The data set has ratings from4000 movies and over 6000 users. We want to evaluate if the recommendationmodels are able to generate personalized recommendations which the user wouldlike. Since we didn’t perform an actual user study, we used a surrogate for liking amovie. We say, a movie is liked by a user if he or she gives the movie a rating of5. These movies form the positive samples. The data set was filtered to have onlythose users for which we had ≥ 20 positive samples. After the filtration step, wewere left with 582 users.The six models were evaluated by generating a list of recommendations for theusers. In case of BellKor, the list was generated by sorting the movies on the basisof the rating they received by the user. For LRM, the list was generated by sortingthe movies on the basis of the probability of them receiving a rating of 5. The samemethod was used for the cumulative and ordec model. We varied the size of the listof recommendation from 1 to 20 and calculated the precision and recall at each listsize. The precision at n vs recall at n where n is the size of the list can be seen inFigure 4.1As can be seen in Figure 4.1, the cumulative model is completely dominat-ing the BellKor solution and LRM. The Ordec model and the most popular moviebaseline has higher precision as compared to the cumulative model for small rec-ommendation list sizes (the most popular couple of movies is easy to predict) butgets dominated in other region of the graph. The curve between precision andrecommendation list size, and recall and recommendation list size, can be seen inFigure 4.2 and Figure 4.3 respectively. If we look at figure Figure 4.4 which is aplot between the average number of true positives per user vs the size of the rec-ommendation list, we observe that all models achieve very low average with thehighest around 1.4 attained by the cumulative model. This results in low precisionand recall values as observed in Figure 4.2 and Figure 4.3.The low precision and recall values can be explained. In our experiment de-sign, we were making predictions over the set of all possible movies (4000). How-1http://grouplens.org/datasets/movielens/27Figure 4.1: Experiment 1: Precision at n vs recall at nFigure 4.2: Experiment 1: Precision vs recommendation list size28Figure 4.3: Experiment 1: Recall vs recommendation list sizeFigure 4.4: Experiment 1: Average number of true positives per user vs rec-ommendation list size29ever, on an average only 100 movies are rated by the users. So a whole bunchof recommendations were being deemed negative, not because they were bad rec-ommendations (not rated as 5) but because they were just never rated by the user.This is not a drawback of the recommendation system(s) being evaluated but of ourexperimental setup. This drawback is countered in the next experiment.We have also plotted other metrics like average number of false positive vsrecommendation list size, average number of false negative vs recommendation listsize to help compare between the models. These plots can be seen in Section A.14.2 Experiment 2To overcome the drawback of our experiment setup we observed in the previousexperiment, we made the following adjustment. Instead of asking the recommenda-tion models to come up with a list of recommendations from the list of all possiblemovies, we asked them to rank the movies in the test set of the user and return therecommendation list specific to the user test set. Now we are guaranteed that therecommendation being made by the model is already rated by the user and hence,we can fairly assign a positive or negative label to it depending on the rating itreceived by the user in question.Figure 4.5 shows the precision at n vs recall at n curve for the six models.Figure 4.8 shows the ROC curve for the six models. We can observe that in bothcurves, the cumulative model is dominating the logistic regression model (LRM)and other baselines. LRM is doing even worse than the most popular movie base-line. As done in the previous experiment, the plot between precision at n vs recall atn has been broken down into two separate plots; precision vs recommendation listsize and recall vs recommendation list size as shown in Figure 4.6 and Figure 4.7respectively.To further validate our results we have also plotted other metrics like averagenumber of true positive vs recommendation list size, average number of false pos-itive vs recommendation list size. These plots can be seen in Section A.2We also plotted the cost curve for this experiment in Figure 4.9. As explainedin section Section 2.3.3, cost curves allow us to compare classifiers at differentvalues of probability of positive samples. Again we see that cumulative model is30Figure 4.5: Experiment 2: Precision at n vs recall at nFigure 4.6: Experiment 2: Precision vs recommendation list size31Figure 4.7: Experiment 2: Recall vs recommendation list sizeFigure 4.8: Experiment 2: Receiver operating characteristic curve32Figure 4.9: Experiment 2: Cost curveRecommendation Model FCPBellKor model 0.6252LRM 0.6259Cumulative model 0.6766Ordec model 0.6187Table 4.1: Experiment 3: FCP values for the four models on the 1 millionMovieLens data set. A higher value reflects better ranking accuracy.doing better than LRM and Bellkor solution Figure 4.9.4.3 Experiment 3In the first two experiments, we have just evaluated the system on the basis ofhow well it is able to predict movies which are likely to be rated as 5 by the user.However, we also want to evaluate the system on the basis of how well they areable to distinguish other ratings from each other i.e., it is able to distinguish be-tween a movie rated 2 and a movie rated 4. To evaluate the systems over the entirerating structure, we use the fraction of concordant pair measure as explained inSection 2.3.4. Table 4.1 shows the result. Again we see that the cumulative modelis doing better than LRM and the BellKor solution.33Recommendation Model FCPBellKor 0.6330LRM 0.6157Cumulative Model 0.6621Ordec model 0.6289Table 4.2: Experiment 4: FCP values for the four models on the 10 millionMovieLens data set4.4 Experiment 4The previous three experiments were carried out on the same data set. To ensurethat our model was not over-fitting to the data set, we carried out the same set ofexperiments on another data set of a much larger size. We used the 10 millionrating MovieLens data set [16] which has ratings for 17000 movies from 70000users. Moreover, the ratings now have 10 categories between 0.5 to 5.We carried out the same experiments. We filtered the data set to have only thoseusers for which there are at least 20 training items in the test set. The models wereasked to present personalized recommendation list for users of varying sizes (from1 to 20). For the different recommendation list sizes, we calculated and plottedprecision, recall and false positive rate. Figure 4.10 shows the precision at n vsrecall at n curve, Figure 4.13 shows the ROC curve whereas Figure 4.14 showsthe cost curve. We can observe that as the number of categories is increasingand the distance between them is decreasing, the cumulative model, Ordec modeland BellKor model have similar performance on the different measures. Again tomake it more clear, the plot between precision at n vs recall at n has been brokendown into two separate plots; precision vs recommendation list size and recall vsrecommendation list size as shown in Figure 4.11 and Figure 4.12 respectively.We also calculated the fraction of concordant pairs for the four recommenda-tion models on this data set. Table 4.2 contains the results. Again we observe thatthe cumulative model is beating the other models. To make it more evident, wehave also plotted other metrics like average number of true positive vs recommen-dation list size, average number of false positive vs recommedation list size etc.These plots can be seen in Section A.334Figure 4.10: Experiment 4: Precision at n vs recall at nFigure 4.11: Experiment 4: Precision vs recommendation list size35Figure 4.12: Experiment 4: Recall vs recommendation list sizeFigure 4.13: Experiment 4: Receiver operating characteristic curveFigure 4.14: Experiment 4: Cost curve36Chapter 5Conclusion and Future Work5.1 ConclusionThe experiment and results section allows us to make the following conclusions:• The cumulative model does better than the LRM and other baselines includ-ing BellKor and Ordec model in predicting movies which would be rated 5by the users. However, the precision and recall values are very small.• The cumulative model does better than LRM and other baselines includingBellKor and Ordec model in predicting movies which would be rated 5 bythe user given the set of movies which are rated by the user.• The cumulative model is able to do better than the LRM, BellKor and Ordecmodel on the FCP measure. This means the cumulative model is able to dobetter at learning the rating pattern of the user.These results have been verified for data sets of varying sizes, categories and pos-itive sample probability distributions. In all cases, we can say that the cumulativemodel does a good job at coming up with personalized recommendations.The LRM is a very intuitive and straight-forward model. The idea behind itwas very simple; if we want to predict movies with high ratings, we should simplylearn a binary classifier for the same. However, in doing so we lost out on crucialinformation. In particular, when we train the LRM for rating 5, we treat data point37with other ratings as a negative sample. In doing so, we don’t distinguish betweena rating of 1 or 2 or 3 or 4 which is a major design flaw. 4 is clearly more closeto 5 than 1 and this information should be utilized rather than making a binarydecision of handing out positive and negative labels. This information is capturedin our cumulative model. It is able to understand the ordering of the ratings andunderstand that 4 is closer to 5 as compared 3 and so on. Because of this basicdifference in the way our two models treat the data, the cumulative model is ableto do better than the LRM.BellKor is a very simple and well principled method which tries to capturethe latent characteristics of users and movies and makes predictions based on theinteraction between these latent characteristics. The major drawback with BellKoris that it treats the ratings as numbers. In doing so, it implicitly assumes that therating have a linear structure, i.e., the difference between a 2 stars and a 3 stars isthe same as a 4 stars and a 5 stars. There is no valid reason to accept this assumptionand it rather seems to be a very strong prior assumption on the user rating pattern.The cumulative model doesn’t make this strong assumption. The cumulative modelalso tries to capture the latent characteristic of users and movies and assumes thatthe affection of a user towards a movie can be explained in terms of interactionbetween these latent factors. However, we learn an additional set of parameters foreach user in our model. This additional set of parameters is used to map the groundtruth reflecting the affection between a user and movie to the rating user would giveto the movie. Since, we are learning an additional set of parameters, it gives ourmodel a lot of flexibility to learn non-linear rating structure like 2 and 3 stars can bevery far apart but 4 and 5 stars can be very close to each other. The only constraintis that the rating structure should be monotonic. Due to this additional flexibilityprovided in our model, the cumulative model is able to do better than the BellKorsolution which is evident from our experiments and results.5.2 Future workOur experiments and results have shown that the cumulative model for recommen-dation is a very promising venture. There are various ways in which we would liketo take this work forward. Some of them are:381. We have used a very basic form of BellKor solution. Over the years, to gainmarginal benefits over the existing model, new features have been added tothe original BellKor model [23]. But we can extend our model very easilyto wrap around these new features by changing the definition of Zu,i in (3.6).It would be interesting to note how our model would compare against theseextensions of BellKor solution.2. We made a modeling assumption in the cumulative model. We assumed that:Zu,i ∼N (hu,i, 1). We would like to experiment with different probabilitydistributions and study the effect of choice of probability distribution on therecommendation system accuracy and scalability. One possible alternativeis to use the extreme value distribution if we are mainly interested in highratings.3. We observed in our empirical studies that initializing the user specific cutoffrandomly led to a very numerically unstable recommendation system. Toovercome this problem, we made two adjustments:• We introduced the constraints on the cutoff explicitly into our costfunction using the barrier function. In literature, there are various otherways [11, 14] of dealing with a constrained optimization problem. Wewould like to explore these different avenues and see how our systemresponds.• The other adjustment we made was to train the system assuming thereis just one user and then use the final cutoffs as some sort of globalaverage value for the cutoff which were used to initialize the cutoffsfor other users by adding some random Gaussian noise. This step canbe interpreted as using a hierarchical prior for the distribution of thecutoff where we define the prior for each user’s cutoffs by using infor-mation available about other users. After this initialization we learn thedeviation from the prior for each user. This motivates us to do come upwith a Bayesian cumulative model. This goes hand in hand with choos-ing an alternative probability distribution because cumulative Gaussiandistribution is known to not have a conjugate prior. On the other hand,39in [13] authors showed that all members of the exponential family dis-tribution have a conjugate prior. Therefore, for Bayesian analysis ofthe problem, we need to change our choice of probability distributionas well.4. A very common problem faced by recommendation systems is the cold startproblem [28, 30]. In the cold start problem, our model have access to verylittle or no training data for a user. This is a recurring use case which oc-curs when a new user joins the website. In the literature, many solutionshave been proposed to tackle this problem [1, 2, 28]. One common solu-tion is to use user specific features like age, sex, etc., and movie specificfeatures like genre, cast etc. in the recommendation system to make the pre-dictions. We would also like to study how we can extend our existing modelto incorporate these observed user and movie specific features, and how wecompare against other recommendation systems specially designed to handlethis problem.40Bibliography[1] D. Agarwal and B.-C. Chen. flda: matrix factorization through latentdirichlet allocation. In Proceedings of the third ACM internationalconference on Web search and data mining, pages 91–100. ACM, 2010. →pages[2] H. J. Ahn. A new similarity measure for collaborative filtering to alleviatethe new user cold-starting problem. Information Sciences, 178(1):37–51,2008. → pages[3] A. Albadvi and M. Shahbazi. A hybrid recommendation technique based onproduct category attributes. Expert Systems with Applications, 36(9):11480–11488, 2009. → pages[4] M. Balabanovic´ and Y. Shoham. Fab: content-based, collaborativerecommendation. Communications of the ACM, 40(3):66–72, 1997. →pages[5] R. M. Bell, Y. Koren, and C. Volinsky. The bellkor solution to the netflixprize, 2007. → pages[6] J. Bennett and S. Lanning. The netflix prize. In Proceedings of KDD cupand workshop, volume 2007, page 35, 2007. → pages[7] R. Burke. Hybrid recommender systems: Survey and experiments. Usermodeling and user-adapted interaction, 12(4):331–370, 2002. → pages[8] I. Cantador, A. Bellogı´n, and D. Vallet. Content-based recommendation insocial tagging systems. In Proceedings of the fourth ACM conference onRecommender systems, pages 237–240. ACM, 2010. → pages[9] P. Cremonesi, F. Garzotto, and R. Turrin. Investigating the persuasionpotential of recommender systems from a quality perspective: An empirical41study. ACM Transactions on Interactive Intelligent Systems (TiiS), 2(2):11,2012. → pages[10] C. Drummond and R. C. Holte. Cost curves: An improved method forvisualizing classifier performance. 2006. → pages[11] M. A. Figueiredo, R. D. Nowak, and S. J. Wright. Gradient projection forsparse reconstruction: Application to compressed sensing and other inverseproblems. Selected Topics in Signal Processing, IEEE Journal of, 1(4):586–597, 2007. → pages[12] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statisticallearning, volume 1. Springer Series in Statistics New York, 2001. → pages[13] A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B.Rubin. Bayesian data analysis. CRC press, 2013. → pages[14] P. E. Gill, W. Murray, and M. H. Wright. Practical optimization. 1981. →pages[15] J. A. Hanley and B. J. McNeil. The meaning and use of the area under areceiver operating characteristic (roc) curve. Radiology, 143(1):29–36,1982. → pages[16] J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl. An algorithmicframework for performing collaborative filtering. In Proceedings of the 22ndannual international ACM SIGIR conference on Research and developmentin information retrieval, pages 230–237. ACM, 1999. → pages[17] H. Hindi. A tutorial on convex optimization ii: duality and interior pointmethods. In American Control Conference, 2006, pages 11–pp. IEEE, 2006.→ pages[18] T. Hofmann. Collaborative filtering via gaussian probabilistic latentsemantic analysis. In Proceedings of the 26th annual international ACMSIGIR conference on Research and development in informaion retrieval,pages 259–266. ACM, 2003. → pages[19] G. Karypis. Evaluation of item-based top-n recommendation algorithms. InProceedings of the tenth international conference on Information andknowledge management, pages 247–254. ACM, 2001. → pages[20] Y. Koren. The bellkor solution to the netflix grand prize. Netflix prizedocumentation, 81, 2009. → pages42[21] Y. Koren and J. Sill. Collaborative filtering on ordinal user feedback. InProceedings of the Twenty-Third international joint conference on ArtificialIntelligence, pages 3022–3026. AAAI Press, 2013. → pages[22] O. Koyejo, S. Acharyya, and J. Ghosh. Retargeted matrix factorization forcollaborative filtering. In Proceedings of the 7th ACM conference onRecommender systems, pages 49–56. ACM, 2013. → pages[23] R. Kumar, B. K Verma, and S. Sunder Rastogi. Social popularity basedsvd++ recommender system. International Journal of ComputerApplications, 87(14):33–37, 2014. → pages[24] G. Linden, B. Smith, and J. York. Amazon. com recommendations:Item-to-item collaborative filtering. Internet Computing, IEEE, 7(1):76–80,2003. → pages[25] L. Martı´nez, L. G. Pe´rez, and M. Barranco. A multigranular linguisticcontent-based recommendation model. International Journal of IntelligentSystems, 22(5):419–434, 2007. → pages[26] P. McCullagh. Regression models for ordinal data. Journal of the royalstatistical society. Series B (Methodological), pages 109–142, 1980. →pages[27] A. Y. Ng. Feature selection, l 1 vs. l 2 regularization, and rotationalinvariance. In Proceedings of the twenty-first international conference onMachine learning, page 78. ACM, 2004. → pages[28] S.-T. Park, D. Pennock, O. Madani, N. Good, and D. DeCoste. Naı¨vefilterbots for robust cold-start recommendations. In Proceedings of the 12thACM SIGKDD international conference on Knowledge discovery and datamining, pages 699–705. ACM, 2006. → pages[29] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-based collaborativefiltering recommendation algorithms. In Proceedings of the 10thinternational conference on World Wide Web, pages 285–295. ACM, 2001.→ pages[30] A. I. Schein, A. Popescul, L. H. Ungar, and D. M. Pennock. Methods andmetrics for cold-start recommendations. In Proceedings of the 25th annualinternational ACM SIGIR conference on Research and development ininformation retrieval, pages 253–260. ACM, 2002. → pages43[31] M. Schmidt. Convergence Rate of Stochastic Gradient with Constant StepSize. Technical report, University of British Columbia, Department ofComputer Science, 09 2014. → pages[32] G. Shani and A. Gunawardana. Evaluating recommendation systems. InRecommender systems handbook, pages 257–297. Springer, 2011. → pages[33] S. J. Wright. On the convergence of the newton/log-barrier method.Mathematical programming, 90(1):71–100, 2001. → pages[34] Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallelcollaborative filtering for the netflix prize. In Algorithmic Aspects inInformation and Management, pages 337–348. Springer, 2008. → pages44Appendix AAdditional PlotsIn this appendix, we have presented some additional plots which should help thereader in understanding the results of our experiments and the conclusions drawnfrom them.A.1 Additional plots for experiment 1In this section, we are presenting some additional plots for experiment 1.1. Figure A.1 shows the plot between the average number of false negative peruser vs the size of recommendation list.2. Figure A.2 shows the plot between the average number of false positive peruser vs the size of the recommendation list.We can see that cumulative model is dominating LRM and the other baselines inall the above mentioned plots.A.2 Additional plots for experiment 2In this section, we are presenting some additional plots for experiment 2.1. Figure A.3 shows the plot between the average number of true positive peruser vs the size of recommendation list.45Figure A.1: Experiment 1: Average number of false negative per user vs rec-ommendation list sizeFigure A.2: Experiment 1: Average number of false positive per user vs rec-ommendation list size46Figure A.3: Experiment 2: Average number of true positive per user vs rec-ommendation list size2. Figure A.4 shows the plot between the average number of false negative peruser vs the size of recommendation list.3. Figure A.5 shows the plot between the average number of true negative peruser vs the size of the recommendation list.4. Figure A.6 shows the plot between the average number of false positive peruser vs the size of the recommendation list.We can see that cumulative model is dominating LRM and other baselines in allthe above mentioned plots.A.3 Additional plots for experiment 4In this section, we are presenting some additional plots for experiment 4.1. Figure A.7 shows the plot between the average number of true positive peruser vs the size of recommendation list.47Figure A.4: Experiment 2: Average number of false negative per user vs rec-ommendation list sizeFigure A.5: Experiment 2: Average number of true negative per user vs rec-ommendation list size48Figure A.6: Experiment 2: Average number of false positive per user vs rec-ommendation list size2. Figure A.8 shows the plot between the average number of false negative peruser vs the size of recommendation list.3. Figure A.9 shows the plot between the average number of true negative peruser vs the size of the recommendation list.4. Figure A.10 shows the plot between the average number of false positive peruser vs the size of the recommendation list.We can see that cumulative model is very similar in performance as BellKor andOrdec model in all the above mentioned plots.49Figure A.7: Experiment 4: Average number of true positive per user vs rec-ommendation list sizeFigure A.8: Experiment 4: Average number of false negative per user vs rec-ommendation list size50Figure A.9: Experiment 4: Average number of true negative per user vs rec-ommendation list sizeFigure A.10: Experiment 4: Average number of false positive per user vsrecommendation list size51
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Modeling ordinal data for recommendation system
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Modeling ordinal data for recommendation system Srivastava, Anupam 2014
pdf
Page Metadata
Item Metadata
Title | Modeling ordinal data for recommendation system |
Creator |
Srivastava, Anupam |
Publisher | University of British Columbia |
Date Issued | 2014 |
Description | In this work we investigate the problem of making personalized recommendations by creating models for predicting user-item rating, such as in movie recommendations. The study is based on the Movielens data set which has ratings on an ordinal scale. In the past, partly due to motivation gained by the Netflix challenge, researchers have constructed models that make point predictions to minimize the root mean square error (RMSE) on test sets, typically by learning latent user and movie feature structure. In such models, the difference between ratings of 2 and 3 stars is the same as the difference between ratings of 4 and 5 stars, etc., which is a strong prior assumption. We construct probabilistic models which also learn latent user and movie feature structure but do not make this assumption. These models interpret the ratings as categories (nominal and ordinal) and return a probability distribution over the ratings for each user-movie pair instead of making a point prediction. We evaluate and compare our models with other models for making personalized recommendations for the top-n task and comparing the precision vs recall, receiver operating characteristic and cost curves. Our results show that our ordinal data model performs better than a nominal data model, a state-of-the-art point prediction model, and other baselines. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2014-09-25 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivs 2.5 Canada |
DOI | 10.14288/1.0166985 |
URI | http://hdl.handle.net/2429/50433 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2014-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/2.5/ca/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2014_november_srivastava_anupam.pdf [ 1.88MB ]
- Metadata
- JSON: 24-1.0166985.json
- JSON-LD: 24-1.0166985-ld.json
- RDF/XML (Pretty): 24-1.0166985-rdf.xml
- RDF/JSON: 24-1.0166985-rdf.json
- Turtle: 24-1.0166985-turtle.txt
- N-Triples: 24-1.0166985-rdf-ntriples.txt
- Original Record: 24-1.0166985-source.json
- Full Text
- 24-1.0166985-fulltext.txt
- Citation
- 24-1.0166985.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0166985/manifest