Symmetric Collaborative Filtering using the Noisy Sensor Model by R i t a Sharma M.Tech. , University of Roorkee, India A T H E S I S S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F Master of Science in T H E F A C U L T Y O F G R A D U A T E S T U D I E S (Department of Computer Science) We accept this thesis as conforming to the required standard The University of British Columbia February 2001 © Ri ta Sharma, 2001 In presenting this thesis/essay in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying for this thesis for scholarly purposes may be granted by the Head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. fcb £ . 2001 Date Department of Computer Science The University of British Columbia 2366 Main mall Vancouver, BC Canada V6T1Z4 Abstract Collaborative filtering is the process of making recommendations regarding the potential preference of a user, for example shopping on the Internet, based on the preference ratings of the user and a number of other users for various items. This thesis considers the problem of symmetric collaborative filtering based on explicit ratings. To evaluate the algorithms, we consider only pure collaborative filtering, using given ratings and excluding other information about the people or items. Our approach is to predict an active user's preferences regarding a par-ticular item by using other people's ratings of that item and other items rated by the active user as noisy sensors. The noisy sensor model uses Bayes' the-orem to compute the probability distribution for the active user's rating of a new item. We give two variants for learning the noisy sensor model: one, for explicit binary rating data; and the second, for explicit multi-valued rating data. The model for binary rating data is based on Bayesian learning. Its performance motivate us to further explore the use of noisy sensor model for multi-valued rating data. We give two variant models for multi-valued rating data: in one, we learn a linear model of how users rate items; in another, we assume different users rate items identically, but that the accuracy of the sensors must be learned. We compare the two models of multi-valued rating data with state-of-the-art techniques and show how they are significantly better whether a user has rated only two items or many. We report empirical results using the EachMovie database of movie ratings. We also show that by considering the items similarity along with the users similarity the accuracy of the prediction increases. 11 Contents A b s t r a c t i i Contents i i i Li s t of Tables v Li s t of Figures v i Acknowledgements v i i i 1 I n t r o d u c t i o n 1 1.1 Information Fi l ter ing 1 1.1.1 Content-based Fi l ter ing 2 1.1.2 Collaborative Fi l ter ing 3 1.2 Contributions 6 1.3 Outline of the Document 7 2 R e l a t e d W o r k 9 3 S y m m e t r i c C o l l a b o r a t i v e F i l t e r i n g U s i n g T h e N o i s y Sensor M o d e l 14 in 3.1 Filtering Problem and Mathematical Notation 14 3.2 Noisy Sensor Model 15 3.3 Algorithm AlgoBayes 17 3.4 Results for AlgoBayes 20 4 Noisy Sensor Mode l for Mult i -Valued Rating Data 26 4.1 AlgoLinear 27 4.1.1 K dummy observations 30 4.1.2 Selecting Noisy Sensors 31 4.1.3 Variant AlgoEqual of AlgoLinear 32 4.2 Evaluation Framework 33 4.2.1 Metrics • • 34 4.2.2 Data and Protocols 35 4.2.3 Selecting K for AlgoLinear 36 4.2.4 Selecting K for AlgoEqual 39 4.2.5 Selecting Noisy Sensors 40 4.2.6 Comparison with Other Methods 44 5 Conclusions and Future Work 49 5.1 Conclusions 49 5.2 Future Work 50 Bibliography 53 iv List of Tables 3.1 Average absolute deviation scores on the EachMovie data for AlgoBayes, Decision Tree and Correlation (lower scores are bet-ter) 25 4.1 Average absolute deviation scores on the subset of EachMovie database as Pennock et al . [PHLGOO] for AlgoLinear, AlgoE-qual, PD and Correlation, lower scores are better 45 4.2 Average absolute deviation scores on the EachMovie database for AlgoLinear, AlgoEqual, PD and Correlation, for extreme rat-ings 0.5 above or 0.5 below the overall average rating 45 4.3 Significance levels of the differences in average absolute devia-tion between AlgoLinear and PD, and between AlgoEqual and PD, on EachMovie data computed using the randomization test. Low significance levels indicate that the difference in results are unlikely to be coincidental 48 v List of Figures 3.1 Naive Bayesian Network Semantics for The Noisy Sensor Model 16 3.2 Effect of K on algorithm AlgoBayes. This experiment was per-formed using 7=10 22 3.3 Effect of K on algorithm AlgoBayes. This experiment was per-formed using I = 10 23 3.4 The average absolute deviation as a function of the number of best user noisy sensors for different numbers of best item noisy sensors 24 4.1 Effect of A'for AllButl and Givenl 0 protocols on AlgoLinear 37 4.2 Effect of A'for Given5 and Given2 protocols on AlgoLinear . 38 4.3 Effect of A'on AlgoEqual for AllButl and GivenlO protocols . 40 4.4 Effect of K on AlgoEqual for Given5 and Given2 protocols . . 41 4.5 The average absolute deviation as a function of the number of best user noisy sensors for different numbers of best item noisy sensors. This experiment was performed using AlgoLinear and K = 0.5 for AllButl protocol 42 vi 4.6 The average absolute deviation as a function of the number of best user noisy sensors for different numbers of best item noisy sensors. This experiment was performed using AlgoEqual and K =1 for AllButl protocol 43 4.7 Average absolute deviation error of AlgoEqual and PD in 60 samples for AllButl protocol with K — .5, U — 30, and / = 20. 46 4.8 Average absolute deviation error of AlgoLinear and PD in 60 samples for AllButl protocol with K = 1, U = 50, and / = 20. 47 vn Acknowledgements I would like to thank my supervisor, David Poole, for his ideas, comments, guidance and financial support. I would like to thank Holger Hoos, as my second reader, for his valuable advice on my thesis draft, discussions, and providing Perl scripts to make EachMovie subsets. I would like to thank David M . Pennock for providing the EachMovie dataset subsets. I would like to thank Kirsty Barclay for taking the time to proof read my thesis draft. Finally, I would like to thank my husband, Sudhir, and my son, Ashutosh, for their unconditional encouragement and support. Without them I might not be here! I would also like to thank my parents for their encouragement and support. R I T A S H A R M A The University of British Columbia February 2001 Vll l Chapter 1 Introduction 1.1 Information Filtering In today's world, filtering vast amount of information has become an impor-tant part of daily life for an increasing number of people. The amount of information available through books, movies, news, advertisements, and in particular, on-line sources such as email, Usenet News, and Web sites, has become enormous and is considerably more than any one person can absorb. The World Wide Web has emerged as a relevant electronic marketplace. It is no longer feasible for a person to filter through this sea of information unaided, and to quickly and conveniently find the most interesting and relevant information to them. The situation is worsening. In response to the challenge of information overload, information filtering techniques are sought to help people quickly identify what will likely be of most interest to them. There are two major approaches to information filtering: content- based filtering and collaborative filtering. 1 1.1.1 Content-based Filtering Content-based filtering techniques filter the information based on their content. For example, in text documents it compares the contents of text documents with a user's profile (preferences) and selects documents whose contents match the user's profile. Examples of these systems include kill files (used to filter out advertising), e-mail filtering software (which sorts e-mail into categories based on sender information or based on keyword matches in the mail fields). The techniques used in content-based filtering can vary greatly in complexity. Keyword-based search is one of the simplest techniques that involves matching different combinations of keywords (sometimes in boolean form). Content-based filtering does not depend on having other users in the system. However, there are some limitations to content-based filtering [SM95] . Its primary limitation is its lack of inherent methods for finding related documents not specified by the user. Since in general all information about a user's prefernces is obtained from material that the user has seen, the user cannot specifically call for items that he is not aware of already. In practice, additional hacks must often be added to content-based filtering. Another problem is that the items must be in some machine parsable form, or attributes must have been assigned to the items by hand. With current technology, media such as sound, photographs, art, video or physical items cannot be analyzed automatically for relevant attribute information. Often it is not practical or possible to assign attributes by hand due to the limitation of resources. Finally, current content-based filtering makes recommendations based on analytical criteria, such as the frequency of a keyword used in the docu-ments, but it knows nothing about the quality of the documents. For example, 2 it cannot distinguish between a well written and a badly written document if both documents happen to use the same keywords. 1.1.2 Collaborative Filtering Collaborative filtering techniques filter items for a user based on human judge-ments, not on analytical criteria, as in content-based filtering. The user for whom we are predicting a rating is called the active user. It uses the knowl-edge of many users to make predictions about individual user preferences. It automates the process of word-of-mouth by which people recommend products or services to one another [SM95]. Before the existence of automated infor-mation filtering systems, we may have sought the recommendations of friends for filtering the information. Collaborative filtering assists and augments this process by relying on potentially thousands of other people, and considering potentially thousands of different items. Collaborative filtering systems build a database of user ratings of avail-able items. Users of collaborative filtering systems usually rate items they have experience with in order to establish a profile of their interests. The col-laborative filtering system then matches an active user to people with similar interests to make a recommendation for the active user. People's ratings can be explicit or implicit. Explicit ratings require that the users rate the items with a score, rank, or other classification of the item. Implicit ratings are collected by electronic agents, which monitor the user's behavior. This can be accomplished for example, by timing visits to sites using cookie files and history files, or by watching the user's browsing activities, such as save and print operations. Collaborative filtering enables the user to find new items that she may be interested in by drawing from the preferences of like-minded 3 people. In collaborative filtering, the main premise is that the active user wi l l prefer items which like-minded people prefer, or even that dissimilar people don't prefer. It relies on the fact that people's preferences are not randomly distributed; there are patterns within the preferences of a person and among similar groups of people, creating correlation. This is best illustrated by example. Imagine three users: Mike , David and Br ian have been asked to give their three favorite musicians. Mike likes: • Elvis • Buddy Hol ly • Li t t le Richard David likes: • J i m m y Hendrix • James Brown • Aretha Frankl in Br ian likes: • Li t t le Richard • E lv i s • Jerry Lee Lewis 4 Collaborative filtering compares their preferences and finds that Mike and Br ian are similar, it then swaps each man's prefernces: Mike, you may like Jerry Lee Lewis; Brian, you may like Buddy Holly. If people's preferences are not correlated, then such predictions can not be made. Collaborative filtering can be formalized as: given a set of ratings for various user-item pairs, predict a rating for a new user-item pair. It is inter-esting that the abstract problem is symmetric between users and items. Collaborative filtering has been a very lively research area in recent years. Several collaborative filtering algorithms have been suggested, ranging from binary to non-binary rating, implici t and explicit rating. Most com-parisons to date are empirical in nature [BP98, H K B R 9 9 , RIS+94, B H K 9 8 , SM95, SKR99] . Init ial collaborative filtering algorithms were based on sta-tistical methods using correlation between user preferences. Collaborative fil-tering problem is not a standard machine learning problem because the data available is very sparse and also there is no other information about the users and items available except their ratings. Recently some researchers have used machine learning methods [BP98, UF98] for collaborative filtering algorithms. These methods essentially discover the hidden attributes for users and items, which explain the similarity between users and items. Collaborative filtering is a key technology used to bui ld Recommender Systems on the Internet. It has been used by recommender systems such as Amazon.com — a book store on the web, CDNow.com — a C D store on the web, and MovieFinder .com — a movie site on the internet [SKR99]. Amazon.com suggests books to customers based on other books the customers have told Amazon they like. CDNow.com helps customers choose CDs to purchase as gifts, based on other CDs the recipient has liked in the past. 5 1.2 Contributions Most of the approaches suggested for collaborative filtering are asymmetric in nature, meaning they either attempt to find similar users based on their preferences of items (e.g., the GroupLens system [RIS+94]), or they attempt to find similar items based on user preferences (e.g. the Bayesian network model described in [BHK98]). However, the abstract problem is symmetric between users and items. There is no convincing reason to use only user similarity or item similarity. In this thesis we show that a symmetric collaborative filtering technique using both user and item similarity offers significant advantages over asymmetric techniques. In this thesis we focus on symmetric collaborative filtering. We propose and evaluate a probabilistic approach based on a noisy sensor model. We describe how a collaborative filtering problem could be solved using the noisy sensor model. Our approach is based on the idea that to predict a user rating for a particular item, we can use all those people who rated that item and other items rated by that user as the noisy sensors. We view the noisy sensor model as a belief network. The conditional probability table associated with each sensor node reflects the noise in the sensor. After learning the noisy sensor model, or the conditional probability table, associated with each sensor node, we use Bayes' theorem to compute the probability distribution for the active user's rating of a new item. There are many ways of learning the conditional probability tables asso-ciated with each sensor node. Here we give two variants of the general idea for learning the noisy sensor model: one, for explicit binary rating data; and the second, for explicit multi-valued rating data. Algorithm AlgoBayes, based on Bayesian statistics (Bayesian Learning), is for binary rating data. AlgoBayes 6 is based on the assumption that two users are similar if they rate an item the same. We use the maximum a posteriori hypothesis for computing the similarity between two users, given their preferences over co-rated items. We show that the performance of AlgoBays improves when we use both items and users as noisy sensors, which motivates us to further explore the noisy sensor model for multi-valued rating data. The algorithms (AlgoEqual, AlgoLinear) for multi-valued rating data are based on assuming a linear relationship with Gaussian error between users and items. In AlgoLinear we learn a classical linear regression model of how users rate items, and in AlgoEqual we assume that the different users rate items the same, but the accuracy of the sensor needs to be learn. We compare the two models of multi-valued rating data with state-of-the-art techniques and show how they are significantly better in all cases ranging from cases where a user has only rated two items, to cases where the user has rated many items. We report empirical results on the EachMovie database of movie ratings. We show that symmetric collaborative filtering, which employs both users and item similarity, offers better accuracy than asymmetric collaborative filtering. 1.3 Outline of the Document The rest of the thesis is organized as follows. Chapter Two presents related work. Previous research in this area is discussed. Chapter Three presents a noisy sensor model and its application to collaborative filtering. The algorithm AlgoBays for binary rating data and its results on the EachMovie database are discussed. Chapter Four presents the use of a noisy sensor model for multi-valued rating data. Algorithms AlgoLinear and AlgoEqual are discussed. The 7 empirical analysis of results and comparison between algorithms are discussed. Chapter Five presents conclusions and possibilities for future work. 8 Chapter 2 Related Work Collaborative filtering involves making predictions about a person's prefer-ences for a particular item, using pre-existing information about this user in conjunction with other users' information. Collaborative filtering incorporates the assumption that by finding similar users and examining their behavior or preferences, we can make useful recommendations for an active user. The concept of collaborative filtering originates from the work of Gold-berg et al. [GNOT92] in the area of information filtering. Tapestry is one of the first collaborative filtering systems used for recommending electronic doc-uments, such as e-mail and Netnews. Tapestry was developed at Xerox Palo Alto Research Center [GNOT92]. In this system users could attach annota-tions as they read documents. Annotations became accessible as virtual fields of the documents, and the filters that searched the annotations for interesting articles were constructed by the end user, using a special query language de-signed for this task. The query could involve many different criteria, including keywords, subject, author and annotations given to the document by other people. The collaborative filtering provided by Tapestry was not automated but the recommendations of other people were taken into account. In this sys-9 tem the user had to specify other users with whom she would share interests inorder to obtain the recommendations [GNOT92]. GroupLens [RIS+94], a neighborhood-based algorithm, is one of the best known early automated collaborative filtering systems for recommending the articles of Usenet news. GroupLens used the Pearson correlation coefficient to weight user similarity, used all those users who rated the item j , and computed a weighted average of deviations from the user's mean: Sui represents the prediction for the user u for item z, n is the number of users, Su represents the average rating for user u and rau is the similar-ity between the active user and user u as defined by the Pearson correlation coefficient. where the summations over m are over the co-rated items, items rated by both users a and u. The Ringo music recommender [SM95] expanded upon the GroupLens algorithm. Shardanand and Maes [SM95] compared four different recommen-dation algorithms based on the mean errors in prediction using each algo-rithm. They claimed better performance by computing correlations using a constrained Pearson correlation coefficient: where 4 was chosen because it was the midpoint of their seven-point rat-ing scale. Ringo considers only those neighbors whose correlation was greater r, T.T=i (Sai-A)*(Sui-A) v t e i (Sai - 4)2 * £r=i (Sm - 4)2 10 than a fixed threshold. To generate predictions, Ringo computed a weighted average of ratings from all users in the neighborhood. Breese et al. [BHK98] proposed the use of vector similarity, based on vector cosine to weight user similarity. This method is often employed in the field of information retrieval for measuring the similarity between two documents. A weighted average of ratings from all users in the neighborhood is used to generate the prediction. The authors also performed an empirical analysis of neighborhood-based collaborative filtering algorithms. The authors found that vector similarity does not perform as well as Pearson correlation [BHK98]. Neighborhood or Correlation based algorithms predict the active user rating as a similarity-weighted sum of the other users ratings. These algorithms are simple, work reasonably well in practice, and new data can be added easily and incrementally. They are also referred to as memory based algorithms [BHK98]. These algorithms become computationally expensive as the size of the database grows. All the above mentioned neighborhood-based algorithms [RIS+94, SM95, BHK98] are asymmetric in nature. They implement only users' similarity in making predictions. These algorithms do not account for what amount of trust can be placed in a similarity measure with a neighbor. It is not uncommon in a collaborative filtering system for the active user to have highly similar neighbors assessed on a very small number of co-rated items (often three to five co-rated items). The more co-rated items available for comparing the opinions of two users, the more we can trust that the computed similarity is representative of the true similarity between them. While trying to fit linear relationship, one can often get a good linear fit with small amount of data, even though the sensor is not 11 a reliable sensor; one may have a better sensor with more data, but not such a good linear relationship. Our prosposed noisy sensor models for collaborative filtering address this problem. Breese et al. [BHK98] proposed several model based collaborative filter-ing algorithms. In model based collaborative filtering, first the user database is used to learn a model of users, items, and/or ratings; predictions are then made using the model. The authors described and evaluated two probabilis-tic models for collaborative filtering: cluster models and Bayesian networks. In the cluster model, users with similar preferences are clustered together. This model structure is a naive Bayesian network, wherein a user's preferences regarding various items are independent given his class membership. The model's parameters, the number of clusters, and the conditional probability of ratings given class are estimated from the training data. In the Bayesian network, nodes correspond to items in the database. The state of each node corresponds to the possible ratings for that item. The structure of the network and the conditional probabilities are learned from the training data. Breese et al. showed that a Bayesian network approach outperformed correlation-based and cluster-based methods on an implicit rating database. Ungar and Foster [UF98] suggest a clustering methods for collaborative filtering. Both users and items are classifieds into clusters. For each cluster of users, the probability that they like each cluster of items is estimated. They use K-means clustering and Gibbs sampling algorithms for clustering and model estimation. To compare the algorithms, they use both synthetic and real data. Poole et al. [PHBOO] proposed symmetric collaborative filtering for binary ratings data. They proposed two alternative algorithms for clustering the users and items based on their similarity: Clustering with Decision Trees and Clustering with Tables. 12 Memory requirements for model based algorithms are generally less than for storing a full database. Predictions can be calculated quickly once the model is generated, though the time needed to learn the model may be pro-hibitive, and addition of new data may require a full recompilation. Pennock et al. [PHLGOO] propose a collaborative filtering algorithm called Personality Diagnosis (PD). This algorithm is based on a probabilistic model of how people rate items. In this approach it is assumed that each user reports ratings with Gaussian error. Given the active user's known ratings for the items, the probability that the user has the same personality type as every other user is computed. After computing the personality type for an active user, the probability distribution of the active user's rating for a new item is computed, and the most probable rating returned as the predicted value. Au-thors present the empirical evaluation of PD with other neighborhood-based and model based algorithms on the Eachmovie and CiteSeer databases. They show that PD makes better predictions than four other algorithms (Correla-tion, Vector Similarity, Bayesian Network, and Bayesian Clustering). Personality Diagnosis makes a strict assumption that the variance a2 of the normal distribution of a user's rating for an item is the same for all users: this parameter was tuned to maximize accuracy. 13 Chapter 3 Symmetric Collaborative Filtering Using The Noisy Sensor Model 3.1 Filtering Problem and Mathematical Notation Let ./V be the number of users and M be the total number of items in the database. S is a N x M matrix of all user's ratings for all items, Sui is the rating given by the user u to item i. In collaborative filtering, 5, the user-item matrix, is generally very sparse since each user will only have rated a small percentage of the total number of items. With this formulation, the collaborative filtering problem becomes predicting those Su{ which are not defined in S, the user-item matrix. The active user is denoted by a such that a e { l , 2 , . . . , i V } . Let the ratings be on a cardinal scale with ra values that we denote 14 ui, V2, • • •, vm. Then each rating Sui has a domain of possible values (vi, v2,..., v. In our equations, it is always assumed that we perform operations on those values of Sui that are exist. 3.2 Noisy Sensor Model We propose a simple probabilistic approach for symmetric collaborative fil-tering using the noisy sensor model [RN95] for calculating the rating by the active user a of an unseen item j, based on the ratings of those users who rated the item j, and the observed ratings of the active user on other items. We designate all those users who rated the item j and all other items rated by active user a to be the noisy sensors. The noisy sensor model is depicted as a naive Bayesian network in Figure 3.1. The direction of the arrow shows that the prediction of active user a for item j "causes" the sensor u to take on a particular prediction for item j and similarly, the prediction of active user a for item j "causes" sensor a to take on a particular prediction for item k. The sensor model is the conditional probability table associated with each sensor node. The noise in the sensor is reflected by the probability of incorrect prediction; that is, by the conditional probability table associated with it. To make the model simple we use the independence assumption that the prediction of any sensor for item j is independent of others given the prediction of active user a for item j. We need the following probabilities for Figure 3.1: Pr (Suj\Saj) : the probability of user u's prediction for item j , given the prediction of active user a for item j. Pf (Sak\Saj) '• the probability of active user a's prediction for item k, given the prediction of active user a for item j. 15 Figure 3.1: Naive Bayesian Network Semantics for The Noisy Sensor Model Pr (Saj) • the prior probability of active user a's prediction for item j. We compute the prior probability distribution Pr (Saj = Uj) of the ac-tive user's rating for item j by the fraction of rating in the training data set, where Vi G (t>i, v2, • • •, vm). Given the conditional probabilities table for all noisy sensors, we can compute the probability distribution for the active user's rating of an unseen item j using the noisy sensor model, described in Figure 3.1. By applying Bayes' rule we can have the following: Pr (Saj\ (5*1 j , • • • , SNJ) A (Sal, • • • , SUM)) oc Pr (Saj) • itU Pr (SUJ\Saj). TltLi Pr (Sak\Saj) We have showed how a collaborative filtering problem could be solved using the noisy sensor model, now we need the probability table for probabil-ities Pr (Suj\Saj) and Pr (Sak\Saj) to make the resultant recommendation. Consider first the problem of estimating Pr (Suj\Saj), which is the prob-lem of estimating user it's rating for item j given active user a's rating of it. Generally, in collaborative filtering the data available is very sparse. There-16 fore, to compute the probability table from sparse data we need to make some prior assumptions about the relationship. In this thesis we propose three dif-ferent algorithms for learning the noisy sensor model. The first algorithm, AlgoBayes, the simplest algorithm for explicit binary rating data, is based on Bayesian learning. The second algorithm, AlgoLinear, for multi-valued rating data, is based on a linear relationship with Gaussian error between the pref-erences of users and similarly between the preferences received by items. The third algorithm AlgoEqual, a variant of AlgoLinear, is based on the idea that different users rate items the same, but the accuracy of the sensors need to be learned. 3.3 Algorithm AlgoBayes The algorithm AlgoBayes is based on Bayesian learning. Suppose that active user a and user u co-rated n items, and that their ratings over n co-rated items are denoted by n pairs of observations (xi, yi), (x2, y2), • • •, (xn, yn). Then xi denotes the rating given by user a and y\ denotes the rating given by user u for the Ith observation. We assume the binary ratings as 1 and 0. To compute the probability table Pr (Suj\Saj) we need to learn two parameters. Generally, in collaborative filtering the data available is very sparse. Therefore, to reduce the number of parameters we assume the following: Pr (Suk = l\Sak = 1) = Pr (Suk = 0\Sak = 0) = 8au, for random k where M is the total number of items in the database. Now, we have the following: Pr (Suk = Sak) = 0au, for random k Pr (Suk 7^ Sak) = 1 - 9au, for random k We want to find the maximum a posteriori value of the parameter 6au 17 given the ratings of user a and u over all co-rated items. Pr ((Sak — Suk) \D) is the posterior probability of users a and u for rating the items identical. It reflects our confidence that users a and u will rate the items same after we have seen the data D (ratings given by users a and u over n co-rated items). Using Bayes' rule we can write the posterior probability of similarity thus: P nq q \\m Pr(D\(Sak = Suk))*Pr((Suj = Saj)) Pr{{Sak = Suk)\D) = p~r~{D) Pr (D) is a normalizing constant and Pr (Suj — Saj) is the prior prob-ability of users a and u for rating the items same. We need a method to find the prior distribution of probability Pr (Sak — Suk). We assume the beta dis-tribution for the prior probability as it is convenient and most commonly used distribution [Hec95]. Pr(Sak = Suk) = 6?u*(l-9an)K where K > 0 is the parameter of beta distribution. Now, Pr (D\ (Sak = Suk)) = n?=i Pr (D,\(Sak = Suk)) The datacase D\ = (x^yi) denotes the rating given by users a and u to the Ith co-rated item. Pr (Di\ (Sak = Suk)) = Pr (xi,yi, \ (Sak = Suk)) The ratings of users a and u are not independent given that they rate the items same. Hence, the above equation can be written thus: Pr (Di\ (Sak = Suk)) = Pr (xi\yi A (Sak = Suk)) * Pr (yi\ {Sak = Suk)) The probability Pr (yt\ (Sak = Suk)): probability of user u's rating for item / given that users a and u rate the items same, is a constant because user a's rating is not known. It can be computed as the frequency of the rating yi in the training set. The probability Pr (x/|t// A (Sak = Suk)) is 0au or 1 — 6au 18 depending upon whether the ratings x\ and y\ are the same or not. #au*constant if xi = yi Pr(Dl\(Sak = Suk)) = \ . (1 — 9au) *constant if xi ^ yi If users a and u give the same ratings on / items and different ratings for m items, the posterior probability denoted by M Pr((Sak = Suk)\D) = M = Pr(D/(Sak = Sk))*Pr(SUJ=Sa3) = c m s t a n t + C K ^ _ K Pr (D) To find the maximum a posteriori value of the parameter 9au we differ-entiate M w.r.t 0au and set equal to zero. The maximum a posteriori value of 6au is given as: 6 l + K U a u ~ l + m + 2K The value of 0au shows how similar the users a and u are. The Bayesian learning approach for computing 0au, the maximum a posteriori hypothesis, takes into account the prior probability which helps us in two ways. First, if users a and u give the same ratings over all co-rated items then the probability table of the sensor u will be deterministic for K = 0 ( but K > 0 avoids this problem). We do not want a deterministic sensor in our model as it discounts the effect of other sensors and the deterministic prediction of the active user a for unseen item j will be based on only one sensor. Second, maximum a posteriori hypothesis approach does not enable the best sensor based on fewer co-rated items. For example, if users a and u rated 4 items together and I = 3,m = 1. If users a and v rated 12 items together and / = 9,m = 3 then 6av = .71 and 6au = .66 for K = 1 and 9au = 9av for K = 0. For K = 1, 8av > 6au, which means user a is more similar to user v than to user 19 u. Therefore, clearly the best or most reliable sensors will be based on more co-rated items. On the other hand, if we use a maximum likelihood technique for computing the value of 9au then the above mentioned problem will arise, as it does not consider the prior probability. We compute 9au between the active user and all other users who have rated the item j and also between the item j and all other items rated by the active user a. To find the reliability of the noisy sensor we use the value of 9au; the greater the value of 9au the more reliable the noisy sensor. Note, for reliability we consider only similarity, not dissimilarity. We use the best U user noisy sensors and best / item noisy sensors for computing the active user a's prediction for unseen item j. The parameters setting for U and / is explained in the next section. To predict a rating (for example, to compare it with other algorithms that predict ratings), we predict the expected value of the rating. The expected value of the rating is defined as follows: E (Saj) = Pr (Saj = 1| (Sij, . . • , SNj) A (S^ , . . . , SaM)) * 1 + Pr (SaJ = 0| (Sij, • • • , SNj) A (Sal, • • • , SaM)) * 0 3.4 Results for AlgoBayes There are no other explicit binary rating collaborative filtering algorithms to compare with, nor any explicit binary rating data to measure the performance of AlgoBayes. To test AlgoBayes we use the EachMovie database, available from the Digital Equipment Research Center. The EachMovie database con-tains ratings from 72,916 users for 1,628 movies, elicited on a integer scale from zero to five. We binarised the original rating data using a split point of 2.5. The ratings > 2.5 are mapped to 1 and ratings < 2.5 are mapped to 20 0. We extract the subset of data for a restricted number nu of users and nm of movies. The resulting data sets are partitioned into the training and test sets by randomly selecting a fraction of ratings and moving them into the test set, keeping all remaning ratings as the training set. We created five subsets from the data by selecting one thousand users and one thousand movies. We select one thousand movies and five sets of thousand users. We keep the same one thousand movies' in each subset but we select a different one thousand users. We also dreated two small subsets, one containing one hundred users and one hundred movies, and the other containing two hundred users and two hundred movies. We compare the performance of AlgoBayes with Correla-tion [RIS+94] and Decision Tree [PHB00] using the average absolute deviation metrics explained in Section 4.3 of the next chapter. To see the effect of K on the performance of AlgoBayes we did exper-iments with different values of K. Figure 3.2 shows the variation of average absolute deviation with user noisy sensors for different values of K. This ex-periment was performed using best ten item noisy sensors: with K = 0, the performance of the algorithm is very poor. When we do not take the prior probability into consideration, AlgoBayes does not find good sensors. From Figure 3.2 we can not see clearly the performance of the algorithm for K > 0. Therefore, we show the performance of AlgoBayes for K = 1 to 3 in Figure 3.3. The average absolute deviation error with zero user noisy sensor, as shown in Figures 3.2 and 3.3, is based on the best ten item noisy sensors. The Figure 3.3 shows that for the small number of user noisy sensors (for example 15) the performance of the algorithm is better with K = 3 and 4, but the aver-age absolute deviation error fluctuates a lot with small number of users and when it stabilizes then the performance of the algorithm is best, with K = 1. However, the improvement in accuracy is on the third place of decimal. We 21 set the value of K as 1 for all further experiments reported in this section. 0.81 K=0 0 K=1 o K=2 * K=3 • K=4 60 80 100 user noisy sensors 120 140 Figure 3.2: Effect of K on algorithm AlgoBayes. This experiment was per-formed using 7=10. Figure 3.4 shows the variation of average absolute deviation with best user noisy sensors for different numbers of best item noisy sensors for AlgoB-ayes. It shows that the accuracy of AlgoBayes increases as we include items as noisy sensors along with the user sensors. It also shows that the average ab-solute deviation error first decreases with the increase in number of user noisy sensors and then increases as more user noisy sensors are user for prediction. This is because the large number of user sensors results in too much noise for those user sensors that have good reliability. It shows that AlgoBayes gives 22 0.16 0.148 60 80 100 user noisy sensors Figure 3.3: Effect of K on algorithm AlgoBayes. This experiment was per-formed using I = 10. best accuracy with 10 items noisy sensors. With did experiment with all sub-sets and found that AlgoBayes gives better accuracy with ten-to-twenty items and forty-to-seventy users as noisy sensors. We set the parameter U as sixty (sixty users noisy sensors) and I as ten (ten items noisy sensors) for AlgoBayes for the experiments following in this section. The parameters U and I depend on the database in case of EachMovie database the number of users are more than the movies and each user has rated only few movies because of this the number of best user noisy sensors are more than the number of best item noisy sensors. 23 0.18 0.175 0.17 CO '> CD | 0.165 0.16 0.155 0.15 0 item noisy sensors 0 5 item noisy sensors O 10 item noisy sensors 15 item noisy sensors * 20 item noisy sensors . *.-5 - M l o O o o o o 20 40 60 80 user noisy sensors 100 120 140 Figure 3.4: The average absolute deviation as a function of the number of best user noisy sensors for different numbers of best item noisy sensors. Table 3.1 shows the results for AlgoBayes, Correlation and Decision Tree on EachMovie database. Table 3.1 shows that AlgoBayes works better than Correlation over all instances and better than Decision Tree over all 1000 x 1000 instances. The results of AlgoBayes for symmetric collaborative filtering using a noisy sensor model are very good and motivate us to further explore the use of a noisy sensor model for multi-valued rating data. The application of a noisy sensor model for multi-valued rating data is presented in the next chapter. / 24 Instances AlgoBayes Algorithm Decision Tree Correlation emlOO x 100 .242 .216 .321 em200 x 200 .169 .152 .308 emlOOO x 1000 — a .133 .145 .232 emlOOO x 1000 - b .154 .161 .262 emlOOO x 1000 — c .198 .228 .324 emlOOO x 1000 -d .236 .245 .337 emlOOO x 1000 — e .225 .240 .336 Table 3.1: Average absolute deviation scores on the EachMovie data for Algo-Bayes, Decision Tree and Correlation (lower scores are better). 25 Chapter 4 Noisy Sensor Model for Multi-Valued Rating Data We have described the Noisy Sensor model and its application in a collabora-tive filtering algorithm for explicit binary rating data in the previous chapter. However, very often one wants to make recommendations based on multi-valued ratings data. To compute Pr (Suj\Saj), ra x ra probability table, where m is the possible number of ratings in multi-valued rating data, we need to learn ra2 parameters. However, the data available is very sparse and we need to make some prior assumptions about the relationship. To make the model simple or to reduce the number of learning parameters. Here, we assume that there is a linear relationship with Gaussian error between the preferences of users and similarly between the ratings received by the items. Because this is the simplest model as supported by Occam's razor [Mit97]. In this model we need to learn only three parameters, it reduces our problem of learning ra2 parameters to three parmeters. Here we give two variants of the general idea for learning the noisy sensor model for explicit multi-valued rating data: one, where we learn a 26 linear relationship with Gaussian error between the preferences of users and between the ratings received by items (AlgoLinear); and another, where we assume that the different users rate items the same (the preferences of active user a and another user u are the same) and learn the variance of uesr u's prediction (AlgoEqual). 4.1 AlgoLinear In AlgoLinear we learn the linear relationship with Gaussian error between the preferences of users and between the ratings received by items. Suppose the rating of the active user a (the independent variable) is denoted by x and the rating of the user u (the dependent variable) is denoted by y. Suppose that the active user a and user u co-rated n items and their ratings over n co-rated items are denoted by n pairs of observations (x\,yi) , (x2, y2), • • •, (xn, yn). We want to find a straight line which best fits these points. We use classical normal linear regression model to find the relationship between the preferences of the active user and other user. We assume that the mean of y can be expressed as a linear function of independent variable x. Since a model based on the independent variable cannot predict exactly the observed values of y, it is necessary to introduce an error term et-. For the ith observation, we have the following: yi = a + (3xi + a We assume that unobserved errors (ej) are independent and normally distributed with a mean of zero and the variance a2. Since y,- is a linear function of e8-, which is normally distributed, yi itself will also be normally distributed. We assume that the variance a2 is the same for all observations. The mean and variance of yx are given thus: 27 E (yi) = n = a + fixi var (yi) = a2 For the ith observation, the probability distribution function may be written thus: P(y = Vi\x - xi) 1 exp -1 2^2 V2Tra2 where — a + fixi The joint probability distribution function or the likelihood of all the observations is the product of the individual P (yi/xi) over the entire observa-tions. We denote the likelihood function by LF (a,{3, a2). LF(a,(3,a2) = UP^x;) i=l 27T<72) wexp We apply the maximum likelihood method [Guj95] to estimate the un-known parameters (a, f3,cr2). Maximizing the likelihood of LF is equivalent to maximizing the log likelihood of LF. The log likelihood of LF can be expressed thus: log LF = logf[P(yi\xi) i-l n t=i = - J log (2TT) - n- log (a2) - - L £ ( y , . - ( a + 0 xt)\ 28 Differentiating the above equation partially with respect to a, (5 and a2 and setting equal to zero, we get the following values of the parameters: ^ _ n E Vi ~ E ViXj n = -I>? 2 n 2 After calculating the parameters a, (3 and <7 we can write the proba-bility distribution of the user u's preference for item j given the active user a's preference of item j as follows: Pv (Suj u^j'I'S'aj Xaj) — P (y — ^ ujl^ — ^ aj) 1 -exp -1 2 — (xuj - (a + pxaj)) V2TTO-2 After computing this probability distribution for all users who rated the item j, and for all the items rated by the active user a, we can compute the probability distribution of the active user's rating for item j using the noisy sensor model as described in the previous chapter. When the linear relationship exceeds the maximum value of the rating scale, we use the maximum value and when it is lower than the minimum value of the rating scale, we use the minimum value. To predict a rating (for example, to compare it with other algorithms that predict ratings), we predict the expected value of the rating. The expected value of the rating is defined as follows: m E(Saj) = J2Pr (^W = ' y ' l ( ^ l i ' • • • >£wj ) A (Sai, • • -,SaM)) * Vi 29 4.1.1 K dummy observations While trying to fit lines with sparse data, we often find a perfect linear rela-tionship, even though the sensor isn't perfect. If there is a perfect fit between users a and u then the variance a2 will be zero according to the above cal-culations. Therefore, the sensor u's prediction for item j will be perfect or deterministic; that is, the conditional probability table associated with sensor u will be purely deterministic. We do not want this for our noisy sensor model because the deterministic sensor will discount the effect of other sensors. We hypothesize that the above mentioned problem can be overcome if we add K dummy observations along with n observations (co-rated item). We assume that active user a and user u give ratings over K dummy items (K > 0) in such a way that their ratings for K dummy items are distributed over all possible rating pairs. For m scale rating data there are m 2 possible combination of the rating pairs. We compute the prior distribution of each rating pair by the frequency of that rating pair in the training data. That is, in the training data for all user pairs we count the occurrence of their rating pairs over their co-rated items and then we normalize this to compute the prior distribution each rating pair. We use the prior distribution of rating pairs for distributing the effect of K dummy points over all rating pairs like a hierarchical prior, which reduces our ability to guarantee that the ratings for K items are distributed overall possible rating pairs. For example, if users a and u co-rated 2 movies in EachMovie database. Suppose for the first movie the rating of user a is 1 and the rating of user u is 2, and for the second movie the rating of user a is 3 and the rating of user u is 5. So, the observed rating pairs are: (1,2) and (3,5). In EachMovie database the ratings are on a scale of 0 to 5. The possible number of rating pairs are 36 30 ( (0,0) , (0,1), . . . , (5, 5) ). If the prior of each rating pair is same (i.e. prior = 1/36) then effect of K = 1 (1 dummy point) on each pair is 1/36. Now, there is already one observed point at (1,2) and (3,5), therefore, the effect of (1,2) and (3,5) rating pair is (1 + 1/36) and the effect of the rest of the rating pairs is 1/36. For calculating the coefficients a, f3 and cr2 we need the values of Yuu Y x i , YxiVii Y xi a n d Y^f- Suppose h{ denotes the prior distribution of the ith rating pair and X{,Y{ denotes the x and y values of the ith rating pair. The expressions for Y Hi-, Y xn Y xiVi, Yx] a n d Ye] c a n then be written as follows: Y yt = YU y« + Yti (K * hi) * Yi = + (A' *hi)*Xi Y xiyi = £ ? = i xiVi + E;?i (K * hi) * XiYi Y x2 = YU x> + YT=i (K * hi) * Xi2 Yef = Yti (Vi - « - M 2 + YT=i K *hi*(Yi-a- BXt)2 We did experiments with different values of K and its effect on the performance, explained in the next section. 4.1.2 Selecting Noisy Sensors To determine the prediction for an active user we could use all items and users, but we do not need to. Instead, we investigate using only the most reliable item and user sensors. For determining the reliability of the noisy sensors, we consider the goodness of fit of the fitted regression line to a set of observations, we use the coefficient of determination r 2 [Guj95], is a measure that tells how well the regression line is fitted to the observations. It measures the proportion or percentage of the total variation in the dependent 31 variable explained by the regression model, r 2 is calculated as follows [Guj95]: 2 i —^ r = 1 —2". E (yi - y) where y is the mean of the ratings of user u. The value of r 2 lies between 0 and 1 (0 < r 2 < 1). If r2 = 1, there exists a perfect linear relationship between the preferences of active user a and user u; that is, e; = 0 for each observation (co-rated item). On the other hand, if r 2 = 0, it means there is no linear relationship between users a and u and the best fit line is a horizontal line going through the mean y. We order the user and item noisy sensors according to r 2 . We use the best U user noisy sensors and best I item noisy sensors for making the prediction. The parameters setting for U and I is explained in the next section. 4.1.3 Variant AlgoEqual of AlgoLinear The problem with AlgoLinear is that we are often fitting linear relationship with very little data (co-rated items). It may be better to assume a priori the linear model and then just learn the noise. The algorithm AlgoEqual is based on the idea that different users rate the items same. We assume that the preferences of active user a and another user u are the same; that is, the expected value of user it's preference of any item is equal to active user a's preference for that item. E (yi\x = Xi) = n = Xi We learn the variance of user it's prediction. The algorithm AlgoEqual can be derived from algorithm AlgoLinear by making the regression coefficients 32 a = 0 and 8 = 1. We also add the K dummy observations because the same problem can arise in AlgoLinear. We are not fitting the relationship between active user a and user u, but we assume an equal relationship. In this case it doesn't make sense to use the coefficient of determination r2 for finding the reliability of the noisy sensors. To find the reliability of the noisy sensor, therefore, we use the variance; the less variance the more reliable the noisy sensor. We use the best U user noisy sensors and best I item noisy sensors for making the prediction. The parameters setting for U and I is explained in the next section. The addition of K dummy point also helps in finding the good reliable sensors based on more co-rated items. For example, if two users have few co-rated items, the effect of K dummy points becomes significant and the variance will be more. For large n or more co-rated items, the effect of K dummy points is insignificant. We evaluate empirically both AlgoLinear and AlgoEqual with state-of-the-art techniques on EachMovie database. The evaluation criteria, the various protocols, the dataset used in the analysis, and the experiments are all presented in the next section. 4.2 Evaluation Framework The most common approach taken to evaluating the accuracy of collaborative filtering algorithms is the training and test set approach. In this approach, the dataset of users (and their ratings) are separated into two: a training set, which is used as the collaborative filtering dataset; and a separate test set, which is used to evaluate the accuracy of the collaborative filtering algorithm. We treat each user from the test set as the active user. To carry out testing, we divide the ratings for each test user into a set of ratings that we treat as 33 observed, Ia, and a set that we will attempt to predict, Pa. We attempt to predict the ratings in Pa using a CF algorithm, observed ratings, and training data. 4.2.1 Metrics To evaluate the accuracy of collaborative filtering algorithm we use the follow-ing metrics: Coverage metrics evaluate the number of items for which the col-laborative filtering algorithm could provide the prediction. The coverage is computed as the percentage of items for all users of which a prediction is re-quested and for which the system is able to produce a prediction. Maximal coverage may be less than 100% coverage because there may be no ratings in the data for certain items, or because very few people rated an item, or because all user sensors had zero co-rated items with the active user. All experimental results shown in this thesis have maximal coverage by design. Statistical accuracy metrics evaluate the accuracy of the collabo-rative filtering algorithm by comparing the predicted rating against the user observed rating for the items that have both predicted and observed ratings. Breese et al. [BHK98] and Pennock et al. [PHLG00] both have used average absolute deviation to measure the performance of the collaborative filtering algorithm. Other metrics that have been used are root mean squared error [RIS+94]. Both the above metrics generally provide the same conclusions. In this thesis we only report average absolute deviation as it is the more com-monly used metric. We calculate the average absolute deviation of the predicted rating against the actual rating of items by users in the test set. That is, if the 34 number of predicted ratings in the test set for the active user is n a , then the average absolute deviation for a user is given as follows: Sa= ^ J2jePa \Pa,J ~ r<M'l> where Pa is the test set for active user a, paj is active user a's observed rating for item j and raj is active user a's predicted rating for item j. These absolute deviations are then averaged over all users in the test set of users. This metric measures how accurate the collaborative filtering algo-rithm is and gives an estimate of the average error a user of the collaborative filtering algorithm can expect. The lower the average absolute deviation, the more accurate the collaborative filtering algorithm is. Statistical significance is assessed for the difference in average abso-lute deviation between two algorithms using the randomization paired sample test of differences of mean [Coh95]. The sampling distribution of the mean difference between two algorithms is achieved by repeatedly shuffling and re-calculating the mean difference in 10,000 different permutations. The shuffling involves reversing the sign of the difference score for each sample with proba-bility of .5. The probability of achieving a difference less than or equal to the mean difference is reported as a result. That is, the probability of incorrectly rejecting the null hypothesis that the deviation scores of both algorithms arises from the same distribution [Coh95]. 4.2.2 Data and Protocols We compared the explicit multi-valued versions of our noisy sensor model to PD (Personality Diagnosis) which its authors recently compared to a number of other memory and model based algorithms [PHLG00]. To compare the performance we used the same subset of the EachMovie database as Breese 35 et al. [BHK98] and Pennock et al. [PHLGOO] had used, consisting of 1,623 movies, 381,862 ratings, 5,000 users in the training set, and 4,119 users in the test set. On average, each user rated about 46.3 movies. We also tested the algorithms on other subsets to verify that our findings are not overly biased by the peculiarities of the subset. To understand the effect of the amount of data on the prediction accu-racy of the collaborative filtering algorithm, we ran experiments to examine how well the algorithm predicts given the different amounts of information about a user's preferences. As in [BHK98] for the AllButl protocol, we with-held a single randomly selected rating for each user in the test set, and tried to predict its value given all other ratings by the user. As in [BHK98], for each user in the test set, for each GivenX protocol, we selected X items at random from these the user has rated. Using these selected ratings as the observed ratings, we then attempted to predict ratings for the remaining items. We did the experiments for X — 2, 5 and 10. With fewer items rated we expect a decrease in accuracy. As in [BHK98], users who have not rated enough items to satisfy a protocol are eliminated from that protocol, so the exact number of predictions in each GivenX protocol varies. For reporting the results we use the same observed ratings and test ratings for each test user of the test set as Pennock et al. [PHLGOO]. We also tested the algorithms by randomly selecting the observed and test ratings for each test user of the test set. 4.2.3 Selecting K for AlgoLinear The performance of AlgoLinear depends on the value of K. To select its value we did experiments with different values of K for all protocols. The value of K can not be zero because it can make a sensor deterministic (for details 36 see section 4.1.1). Figure 4.1 shows the effect of K for AllButl and GivenlO protocols. Figure 4.2 shows the effect of K for Given5 and Given2 protocols. AllButl Protocol 1.15 1.1 0.95 0.9 K=0.25 0 K=.5 o K=1 * K=2 • K=3 * * * * * * * * * * * * * * * * * Q. o. 20 40 60 80 user noisy sensors GivenlO Protocol 100 120 140 1.2 I 1.15 CD •a CD i I 1 1 o in io 1-05( 1 0.95 K=0.25 0 K=.5 O K=1 * K=2 • K=3 i i i i i i I j J * * * * * * * * * * * * * * * * * * * * * ^ H ^ M U U ? m 8 8 8 8 8 8 ? 8 § § 8 ? <! i i i i i i i 20 40 60 80 user noisy sensors 100 120 140 Figure 4.1: Effect of A'for AllButl and GivenlO protocols on AlgoLinear The average absolute deviation error is more when K > 1 for all pro-tocols. This is because the effect of the error introduced by K dummy points for large K is significant and hence, AlgoLinear is not finding good user sen-sors for the active user a. AlgoLinear works better for AllButl protocol with K = .5 but for GivenlO and Given5 protocols the performance is better with K = .25 but the difference is not significant. This is the expected behaviour because the effect of K dummy points for large K is more than the effect of the amount of information available in GivenX protocol, which decreases the 37 1 K=0.25 0 K=.5 o K=1 * K=2 • K=3 Given5 Protocol 1 * * • * ' • * * * * * • * 1.1 9. Q OOg O G O O O O O O. O O . O . O O O O Q O O O O O O Q__0_-&--£—£-.* * O Q o P. 0 0 0 0 20 40 60 80 user noisy sensors Given2 Protocol 100 120 I I K=0.25 0 K=.5 o K=1 * K=2 • K=3 * * * * * * * * * * * * * * *_ 20 40 60 80 user noisy sensors 100 120 140 140 Figure 4.2: Effect of A'for GivenS and Given2 protocols on AlgoLinear accuracy of the algorithm. The effect of K on Given2 protocol is different than on other protocols because each user has rated only two items and co-rated items between users are less than or equal to 2. With so much less information AlgoLinear does not fit the good relationships between users and hence it is not finding good user sensors. We can also see from Figure 4.2 that for Given2 protocol the AlgoLinear performs better when we consider only items as a noisy sensor, while the average absolute deviation error increases with the increase in num-ber of user sensors. This is because the items are rated by many users and AlgoLinear fits them better. 38 Figure 4.2 shows that the algorithm gives better accuracy with K = .5 and K = 1 for Given2 protocol when using user sensors 30 - 70. Based on all these experiments we found that overall AlgoLinear gives best performance with K = .5 for all protocols. For subsequent experiments we, therefore, chose K = .5 for AlgoLinear. 4.2.4 Selecting K for AlgoEqual The performance of AlgoEqual also depends on K. The effect of K on the performance of AlgoEqual is different from the effect on AlgoLinear because in AlgoEqual we are not fitting the relationship between users. Instead, we are using K for learning the variance. To select the value of K we did experiments with different values for all protocols. Figure 4.3 shows the effect of K for AllButl and GivenlO protocols. Figure 4.4 shows the effect of K for Given5 and Given2 protocols. AlgoEqual performs best for the AllButl, GivenlO and Given5 protocols with K = 1 and K = 2. The error is more when K < 1 or K > 2. Figures 4.3 and 4.4 show that with a decrease in the amount of available information (for GivenX protocol) the performance of AlgoEqual improves for lower values of K. Figures 4.3 and 4.4 also show that there is no significant difference in the accuracy with K = 1 and K = 2 for AllButl, GivenlO and Givend protocols. Figure 4.4 shows that AlgoEqual performs best for Given2 protocol (for user sensor range thirty-to-seventy) with K = .25 to K = 1. Based on all these experiments we found that overall AlgoEqual gives best accuracy with K = 1 for all protocols. For subsequent experiments we, therefore, chose K = 1 for AlgoEqual. 39 AllButl Protocol K=0.25 0 K=.5 o K=1 * K=2 • K=3 60 80 user noisy sensors Givenl 0 Protocol 100 120 140 § 0.95 0.9 K=0.25 0 K=.5 o K=1 * K=2 • K=3 20 40 60 80 user noisy sensors 100 120 140 Figure 4.3: Effect of K on AlgoEqual for AllButl and GivenlO protocols 4.2.5 Selecting Noisy Sensors After learning the noisy sensor model we determine which noisy sensors should be used in making the prediction for the active user. Figure 4.5 shows the variation of average absolute deviation with user noisy sensors for different numbers of items noisy sensors for AlgoLinear. Fig-ure 4.6 shows the same for AlgoEqual. This experiment was performed on AllButl protocol with K = 0.5. We used the same training and test set as described above but the test rating and the observed ratings for each test user of the test set were selected randomly. 40 1.2 Given5 Protocol K=0.25 0 K=.5 o K=1 . * K=2 • K=3 "4 $ $ 60 80 user noisy sensors Given2 Protocol 100 120 K=0.25 0 K=.5 o K=1 * K=2 • K=3 * * * 40 60 80 user noisy sensors 100 120 140 140 * * * * * * * * * * * * * * * * - 0 - e - e ~ Q g o y H ^ H ^ 9 y* 9 Y* Y Figure 4.4: Effect of K on AlgoEqual for Given5 and Given2 protocols Figures 4.5 and 4.6 show that the average absolute deviation of the prediction decreases with the increase in number of item sensors. There is no significant improvement in accuracy when the number of item sensors is more than twenty. It also shows that the average absolute deviation first decreases with the increase in number of user sensor and then increases as more user sensors are used for prediction. This is because the large number of user sensors results in too much noise for those user sensors that have good reliability. From the experiments, we found that both algorithms give better per-formances with ten-to-twenty five item noisy sensors. We also found that algorithm AlgoLinear gives less average absolute deviation error with forty-41 1.25 r 0 item noisy sensors 5 item noisy sensors O 10 item noisy sensors 15 item noisy sensors * 20 item noisy sensors + 25 item noisy sensors 1.15 K > <D •o 0) 1.05* 0.95 0.9 L 10 20 30 40 50 60 70 80 90 user noisy sensors 100 110 120 130 140 150 Figure 4.5: The average absolute deviation as a function of the number of best user noisy sensors for different numbers of best item noisy sensors. This experiment was performed using AlgoLinear and K = 0.5 for AllButl protocol. to-seventy user noisy sensors and AlgoEqual gives better performance with twenty-to-fifty user noisy sensors. For the experiments reported in the fol-lowing section, we use the best fifty user noisy sensors and best twenty item noisy sensors (i.e. U = 50 and / = 20) for AlgoLinear. We use thirty user noisy sensorsand the best twenty item noisy sensors (i.e. U = 30 and I = 20) for AlgoEqual. The parameters U and I depend on the database in case of EachMovie database the number of users are more than the movies and each user has rated only few movies because of this more best user noisy sensors are selected than the best item noisy sensors. 42 0 item noisy sensors 5 item noisy sensors O 10 item noisy sensors 15 item noisy sensors * 20 item noisy sensors + 25 item noisy sensors 0.9 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 user noisy sensors Figure 4.6: The average absolute deviation as a function of the number of best user noisy sensors for different numbers of best item noisy sensors. This experiment was performed using AlgoEqual and K =1 for AllButl protocol. From Figure 4.5 we also see that the minimum average absolute devi-ation is .936 when we use both user and item noisy sensors (with sixty user and twenty item noisy sensors). It is .964 when we use only the user noisy sensors, shown by the zero item noisy sensors case, and 1.027 when we use only the item noisy sensors, shown on the y-axis for ten item noisy sensors. This shows that by including the item noisy sensors along with the user noisy sensors the quality of the prediction improves considerably. It also shows that if we use only the item noisy sensors for prediction then the average absolute deviation becomes greater than when we use only user noisy sensors. There-43 fore, symmetric collaborative filtering offers better accuracy than asymmetric collaborative filtering. 4.2.6 Comparison with Other Methods We compared the models AlgoLinear, AlgoEqual, Correlation and PD using the same training and test set as Pennock et al. [PHLGOO]. For each test user we tested the same item as Pennock et al. The results of comparing AlgoLinear, AlgoEqual, Correlation and PD are shown in Table 4.1. We re-implemented PD and Correlation. Our results for PD match exactly with those reported in Pennock et al. [PHLGOO], though not the case for Correlation. The coverages of AlgoLinear, AlgoEqual and PD are the same, but the coverage of Correlation is less. We believe that Pennock et al. [PHLGOO] might have made some changes in Correlation to provide the same coverage as PD. Since we can not compare two collaborative filtering algorithms if their coverages are not same. We report the results for Correlation from Pennock et al. [PHLGOO]. AlgoLinear performed better than PD and Correlation for AllButl and GivenlO protocols. For GivenS and Given2 protocols AlgoLinear performance is better than Correlation but not better than PD. We believe that AlgoLin-ear's poor performance can be explained by the fact the lines that are fitted to very small data sets are often a poor fit to the actual relationship. The algorithm AlgoEqual, based on an equal relationship between users, doesn't suffer from the same problem, outperformed better than each of the other three algorithms under all protocols. Shardanand and Maes [SM95] and Pennock et al. [PHLGOO] proposed that the accuracy of a collaborative filtering algorithm is most crucial when 44 Algorithm Protocol AllButl GivenlO Given5 Given2 AlgoEqual .893 .943 .974 1.012 AlgoLinear .943 .983 1.021 1.196 PD .964 .986 1.016 1.039 Correlation .999 1.069 1.145 1.296 Table 4.1: Average absolute deviation scores on the subset of EachMovie database as Pennock et al. [PHLGOO] for AlgoLinear, AlgoEqual, PD and Correlation, lower scores are better. predicting extreme ratings (very high or very low ratings) for items. The idea is that most of the time people are interested in suggestions about items they might like or dislike, but not about items they are unsure of. Pennock et al. [PHLGOO] define extreme ratings, as ratings which are 0.5 above the average rating or 0.5 below the average rating of the subset. Algorithm Protocol AllButl GivenlO Given5 Given2 AlgoEqual 1.001 1.057 1.087 1.124 AlgoLinear .997 1.063 1.125 1.249 PD 1.0296 1.0874 1.128 1.163 Correlation 1.108 1.127 1.167 1.189 Table 4.2: Average absolute deviation scores on the EachMovie database for AlgoLinear, AlgoEqual, PD and Correlation, for extreme ratings 0.5 above or 0.5 below the overall average rating. To compare the performance of algorithms with extreme ratings we computed the predicted ratings for those test cases (from the test sets of all protocols) whose observed rating is less than R — 0.5 or greater than R + 0.5 where R is the overall average rating in the subset. The results for the extreme ratings are shown in Table 4.2. The results for extreme ratings show that AlgoLinear performs better than AlgoEqual for AllButl protocol. It also performs better than PD and 45 Correlation over GivenlO and Given5 protocols. AlgoEqual performs better than the other three algorithms over GivenlO, Given5 and GivenS protocols. AllButl Protocol 1.4 * AlgoEqual O PD 30 Samples (1:60) 60 Figure 4.7: Average absolute deviation error of AlgoEqual and PD in 60 sam-ples for AllButl protocol with K = .5, U = 30, and I = 20. To determine the statistical significance of these results, we computed the significance levels for the differences in average absolute deviation between AlgoLinear and PD, PD and AlgoLinear, AlgoEqual and PD, and PD and AlgoEqual, for all protocols. For doing this, we divided the test set for all protocols into 60 samples of equal size and use randomization paired sample test of differences of mean [Coh95] (for details please see Section 4.3.1). Figure 4.7 shows the average absolute deviation error of AlgoEqual and PD in 60 46 AllButl Protocol 1.4 * AlgoLinear O PD 30 Samples (1:60) 60 Figure 4.8: Average absolute deviation error of AlgoLinear and PD in 60 samples for AllButl protocol with K — 1, U — 50, and I = 20. samples for AllButl protocol. Figure 4.8 shows the average absolute deviation error of AlgoEqual and PD in 60 samples for AllButl protocol. We found similar results for the other protocols. The statistical significance of the EachMovie data results is given in Table 4.3; it shows the probability of achieving a difference less than or equal to the mean difference. That is, it shows the probability of incorrectly rejecting the null hypothesis that both algorithms' deviation scores arise from the same distribution. Recently, Pennock et al. [PHLGOO] showed that PD is the most accurate 47 Protocol AlgoLinear PD vs. AlgoEqual PDvs. vs. PD AlgoLinear vs. PD AlgoEqual AllButl .0006 .9994 .0001 .9999 AUButl(extveme) .0003 .9997 .0127 .9873 GivenlO .1377 .8623 .0001 .9999 Givenl 0( extreme) .0211 .9789 .0009 .9991 Given5 .9064 .0936 .0043 .9957 Given5(extreme) .2897 .7103 .0001 .9999 Given2 .9999 .0001 .0019 .9981 Given2(extreme) .9999 .0001 .0001 .9999 Table 4.3: Significance levels of the differences in average absolute deviation between AlgoLinear and PD, and between AlgoEqual and PD, on EachMovie data computed using the randomization test. Low significance levels indicate that the difference in results are unlikely to be coincidental. explicit multi-valued rating collaborative filtering algorithm. The statistical significance results from Table 4.3 show that the performance of AlgoEqual is better than PD over all protocols and the performance of AlgoLinear is better than PD for the AllButl and GivenlO protocols. 48 Chapter 5 Conclusions and Future Work 5.1 Conclusions In this work, we have concerned ourselves with symmetric collaborative fil-tering based on explicit ratings, a technique that can be used for making recommendations for a user based on ratings of various items by a number of people. We proposed a probabilistic approach using a noisy sensor model for symmetric collaborative filtering, which, to make predictions allows to consider both the user-and the item similarity. We presented an explicit binary rating version of the noisy sensor model AlgoBayes, based on Bayesian Learning. We showed the performance results of AlgoBayes applied to several subsets of the EachMovie database, whch indicate that the AlgoBayes performs better than the Correlation and Decision Tree algorithms. To predict recommendations for a user based on multi-valued rating data, we devised a mechanism for learning the noisy sensor model, assuming a linear relationship between the users' preferences. We presented two variants of the multi-valued rating versions of the noisy sensor model, AlgoLinear and 49 AlgoEqual. We compared the two algorithms with other collaborative filtering state-of-the-art techniques. We reported empirical results for the EachMovie database of movie ratings. According to the absolute deviation metrics, Algo-Linear makes better predictions than PD over Allbutl and GivenlO protocols, and better than Correlation over all protocols. AlgoEqual makes better predic-tion than PD and Correlation over all protocols. We analysed and confirmed the statistical significance of these results. Our results from symmetric collab-orative filtering using noisy sensor model show that by including the items' similarity along with users' similarity the accuracy of prediction typically in-creases. 5.2 Future Work This thesis has created many new opportunities for future work. This section suggests some specific avenues for future research may take. The noisy sensor model has been shown to produce an average error of about 1, but the performance of this model degrades with the number of users and items. The noisy sensor model requires computation time that expands with both the number of users and the number of items. For large data sets it may become impractical to find all the reliable noisy sensors for an active user, except in an infrequent off-line manner. New technologies are needed to quickly produce high quality recommendations, especially for large data sets. Techniques of dimensionality reduction are currently being investigated in an attempt to reduce the amount of online computation. Example techniques are sampling of users, and singular value decomposition [BP98]. One could explore the use of dimensionality reduction techniques to improve the performance of noisy sensor model over a large database. 50 Collaborative filtering provides inaccurate predictions regarding users that are not similar to many other users. Also, using collaborative filtering system alone, the first users to rate an item cannot get a recommendation regarding it. Recently some ini t ia l work has been done to integrate content-based techniques with collaborative filtering and thereby alleviate the above problem [BS97]. One could explore the hybrid system of noisy sensor model with other existing content-based techniques. The proposed noisy sensor model for collaborative filtering works very well with explicit rating data. We did not explore the use of noisy sensor model for implicit rating. A n example of implicit rating would be a web page recommendation system in which we can log what pages a user sees but cannot collect explicit ratings [BHK98]. In this case, a simple way to get explicit rating would be to use a binary scale with the ratings viewed or did not view. While we believe that the noisy sensor model could accommodate this situation through an appropriate transformation of ratings, this is another interesting area for future research. One limitation of the present model is the use of a linear relationship between the preferences of users, for learning the noisy sensor model. If a pair of users have highly similar preferences but their relation is nonlinear, this may not be recognized as a reliable or good sensor and may hence be rejected by the noisy sensor model when it is considering only the best noisy sensors for making the recommendations. One could explore the use of polynomial regression for finding the relationship between users' preferences. The polynomial regression may not be feasible if the data available is very sparse. However, polynomial regression may not be feasible if the data available is very sparse. One could also extend the noisy sensor model by incorporating user and item properties beyond ratings — for example, user age group or movie genres can be explored with the use of multiple linear regression for learning the noisy sensor model. Finally, the proposed algorithms should be applied to a diverse set of prediction domains, to determine what results hold true generally across pre-diction domains. Unfortunately, sufficiently large test sets containing multi-valued ratings are not as yet, readily available. One could also test the pro-posed algorithms on the synthetic data. The synthetic data can be generated from the EachMovie database using the certain properties of the data, such as the same name of users, same number of items, and same average rating. 52 Bibliography [BHK98] John S. Breese, David Iieckerman, and Carl Kadie. Empirical ananlysis of predictive algorithms for collaborative filtering. In proceedings of the IJ^th conference on Uncertainity in artificial In-telligence, pages 43-52, July 1998. [BP98] Daniel Billsus and Michael J . Pazzani. Learning collaborative in-formation filters. In Proceedings of the 15th International Confer-ence on Machine Learning, pages 46-54, July 1998. [BS97] Marko Balabanovic and Yoav Shoham. Content-based, collabo-rative recommendations. In Communications of the ACM, 40(3), pages 66-72, March 1997. . [Coh95] Paul R. Cohen. Empirical Methods for Artificial Methods. MIT Press, 1995. [GNOT92] David Goldberg, David Nichols, Brian M . Oki, and Douglas Terry. Using collaborative filtering to weave an information tapestry. In Communications of the ACM, pages 61-70, December 1992. [Guj95] Damodar N. Gujarati. Basic Econometrics. McGraw Hill, Inc., third edition, 1995. [Hec95] David Heckerman. A Tutorial on Learning with Bayesian Net-works, March 1995. Technical Report, MSR-TR-95-06. [HKBR99] J. Herlocker, J . Konstan, A . Brochers, and J. Riedl. An algorithmic framework for performing collaborative filtering. In Proceedings of the 1999 Conference on Research and Development in Information Retrieval, Berkley, C A , August 1999. [Mit97] Tom M . Mi tche l l . Machine Learning. McGraw Hill, Inc., 1997. [PHBOO] David Poole, Holger H . Hoos, and Craig Boutilier. Model-based Symmetric Collaborative Filtering, February 2000. Technical Re-port. [PHLGOO] David M . Pennock, Er ic Horvi tz , Steve Lawrence, and C. Lee Giles. Collaborative filtering by personality diagnosis: A hybrid memory-and model-based approach. In Proceedings of the 16th Conference on Uncertainty in Artificial intelligence, pages 473-480, June 2000. [RIS+94] Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and John Riedl . Grouplens: An open architecture for collaborative filtering of netnews. In Proceedings of ACM CSCW'94 Conference on computer supported cooperative work, pages 175-186, 1994. [RN95] Stuart Russell and Peter N orvig. Artificial Intelligence A Modern Approach. Prentice Ha l l , 1995. [SKR99] J. Ben Schafer, Joseph Konstan, and John Riedl. Recommender system in e-commerce. In Proceedings of the ACM Conference on Electronic Commerce (EC-99)., pages 158-166, November 1999. [SM95] Upendra Shardanand and Patti Maes. Social information filtering: Algorithms for automating "word of mouth". In Proceedings of ACM CHI'95 Conference on Human Factors in Computing Sys-tems, pages 210-217, 1995. [UF98] Lyle H. Ungar and Dean P. Foster. Clustering methods for collab-orative filtering. In Workshop on Recommendation Systems at the 15th National Conference on Artificial Intelligence, July 1998. 55
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Symmetric collaborative filtering using the noisy sensor...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Symmetric collaborative filtering using the noisy sensor model Sharma, Rita 2001
pdf
Page Metadata
Item Metadata
Title | Symmetric collaborative filtering using the noisy sensor model |
Creator |
Sharma, Rita |
Date Issued | 2001 |
Description | Collaborative filtering is the process of making recommendations regarding the potential preference of a user, for example shopping on the Internet, based on the preference ratings of the user and a number of other users for various items. This thesis considers the problem of symmetric collaborative filtering based on explicit ratings. To evaluate the algorithms, we consider only pure collaborative filtering, using given ratings and excluding other information about the people or items. Our approach is to predict an active user's preferences regarding a particular item by using other people's ratings of that item and other items rated by the active user as noisy sensors. The noisy sensor model uses Bayes' theorem to compute the probability distribution for the active user's rating of a new item. We give two variants for learning the noisy sensor model: one, for explicit binary rating data; and the second, for explicit multi-valued rating data. The model for binary rating data is based on Bayesian learning. Its performance motivate us to further explore the use of noisy sensor model for multi-valued rating data. We give two variant models for multi-valued rating data: in one, we learn a linear model of how users rate items; in another, we assume different users rate items identically, but that the accuracy of the sensors must be learned. We compare the two models of multi-valued rating data with stateof- the-art techniques and show how they are significantly better whether a user has rated only two items or many. We report empirical results using the EachMovie database of movie ratings. We also show that by considering the items similarity along with the users similarity the accuracy of the prediction increases. |
Extent | 2463482 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-07-29 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0051313 |
URI | http://hdl.handle.net/2429/11450 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2001-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_2001-0114.pdf [ 2.35MB ]
- Metadata
- JSON: 831-1.0051313.json
- JSON-LD: 831-1.0051313-ld.json
- RDF/XML (Pretty): 831-1.0051313-rdf.xml
- RDF/JSON: 831-1.0051313-rdf.json
- Turtle: 831-1.0051313-turtle.txt
- N-Triples: 831-1.0051313-rdf-ntriples.txt
- Original Record: 831-1.0051313-source.json
- Full Text
- 831-1.0051313-fulltext.txt
- Citation
- 831-1.0051313.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0051313/manifest