Combating the Cold-start UserProblem in Collaborative FilteringRecommender SystemsbySampoorna BiswasB.Tech. Computer Science, Indraprastha Institute of Information Technology, NewDelhi, 2014A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)December 2016c© Sampoorna Biswas 2016AbstractFor tackling the well known cold-start user problem in collaborative filter-ing recommender systems, one approach is to recommend a few items to acold-start user and use the feedback to learn her preferences. This can thenbe used to make good recommendations to the cold user. In the absenceof a good initial estimate of the preferences, the recommendations are likerandom probes. If the items are not chosen judiciously, both bad recom-mendations and too many recommendations may turn off a user. We studythe cold-start user problem for the two main types of collaborative filter-ing methods – neighbourhood-based and matrix factorization. We formalizethe problem by asking what are the b (typically a small number) items weshould recommend to a cold-start user, in order to learn her preferencesbest, and define what it means to do that under each framework. We castthe problem as a discrete optimization problem, called the optimal interviewdesign (OID) problem, and study two variants – OID-NB and OID-MF –for the two frameworks. We present multiple non-trivial results, includingNP-hardness as well as hardness of approximation for both. We furtherstudy supermodularity/submodularity and monotonicity properties for theobjective functions of the two variants. Finally, we propose efficient al-gorithms and comprehensively evaluate them. For OID-NB, we propose agreedy algorithm, and experimentally evaluate it on 2 real datasets, whereit outperforms all the baselines. For OID-MF, we discuss several scalableheuristic approaches for identifying the b best items to recommend to theuser and experimentally evaluate their performance on 4 real datasets. Ourexperiments show that our proposed accelerated algorithms significantly out-perform the prior art, while obtaining similar or lower error in the learneduser profile as well as in the rating predictions made, on all the datasetstested.iiPrefaceThis thesis is submitted in partial fulfilment of the requirements for a Masterof Science degree in Computer Science. All the work presented in this dis-sertation are original and independent work of the author, performed underthe supervision of Prof. Laks V. S. LakshmananChapter 4 is part of work conducted in collaboration with Prof. Laks V.S. Lakshmanan, and Prof. Senjuti Basu Roy. I was responsible for derivingall the technical results and the experimental evaluation.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . viii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background and Related Work . . . . . . . . . . . . . . . . . 52.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.1 Recommender Systems . . . . . . . . . . . . . . . . . 52.1.2 Neighbourhood-based Collaborative Filtering . . . . . 52.1.3 Matrix Factorization . . . . . . . . . . . . . . . . . . 62.1.4 Monotonicity and Submodularity/Supermodularity . 72.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.1 Cold Start Problem in Collaborative Filtering . . . . 72.2.2 Interactive Recommendation . . . . . . . . . . . . . . 82.2.3 Complexity of Recommendation . . . . . . . . . . . . 93 Neighbourhood-based Methods . . . . . . . . . . . . . . . . . 103.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1.1 Dominating Set . . . . . . . . . . . . . . . . . . . . . 103.1.2 Problem Statement . . . . . . . . . . . . . . . . . . . 103.2 Technical Results . . . . . . . . . . . . . . . . . . . . . . . . 113.2.1 Hardness . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.2 Monotonicity . . . . . . . . . . . . . . . . . . . . . . . 133.2.3 Submodularity . . . . . . . . . . . . . . . . . . . . . . 13ivTable of Contents3.2.4 Approximation . . . . . . . . . . . . . . . . . . . . . . 133.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . 143.4.1 Dataset and Model . . . . . . . . . . . . . . . . . . . 153.4.2 Model Parameters & Experimental Setup . . . . . . . 153.4.3 Algorithms Compared . . . . . . . . . . . . . . . . . . 163.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 173.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 184 Matrix Factorization Methods . . . . . . . . . . . . . . . . . . 204.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.1.1 Probabilistic Matrix Factorization . . . . . . . . . . . 204.1.2 Problem Statement . . . . . . . . . . . . . . . . . . . 214.2 Solution Framework . . . . . . . . . . . . . . . . . . . . . . . 214.3 Technical Results . . . . . . . . . . . . . . . . . . . . . . . . 234.3.1 Hardness . . . . . . . . . . . . . . . . . . . . . . . . . 244.3.2 Hardness of Approximation . . . . . . . . . . . . . . . 294.3.3 Monotonicity . . . . . . . . . . . . . . . . . . . . . . . 324.3.4 Supermodularity and Submodularity . . . . . . . . . 324.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.4.1 Accelerated Backward Greedy . . . . . . . . . . . . . 334.4.2 Accelerated Forward Greedy . . . . . . . . . . . . . . 354.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . 374.5.1 Dataset and Model . . . . . . . . . . . . . . . . . . . 374.5.2 Model Parameters & Experimental Setup . . . . . . . 394.5.3 Algorithms Compared . . . . . . . . . . . . . . . . . . 414.5.4 Quality Experiments . . . . . . . . . . . . . . . . . . 414.5.5 Scalability Experiments . . . . . . . . . . . . . . . . . 445 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48vList of Tables3.1 Dataset Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.1 Dataset Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2 Experiment Sizes . . . . . . . . . . . . . . . . . . . . . . . . . 39viList of Figures3.1 ML 100K dataset with default mean rating . . . . . . . . . . . . 183.2 ML 100K dataset with default rating = 1 . . . . . . . . . . . . . 193.3 ML 1M dataset with default rating = 1 . . . . . . . . . . . . . . 194.1 Movielens 100K and 1M Datasets . . . . . . . . . . . . . . . . . 424.2 Netflix Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3 Movielens 20M Dataset . . . . . . . . . . . . . . . . . . . . . . 434.4 We measure the running time by varying b from 4 to 100 for ML100K and ML 1M, averaged across 200 and 1000 cold users respec-tively. The plots show the performance of all 10 greedy algorithmson a candidate pool of 1682 and 3952 items respectively. . . . . . 444.5 We measure the running time by varying b from 4 to 100 for Netflixand ML 20M, averaged across 5000 cold users. . . . . . . . . . . 45viiAcknowledgementsFirst of all, I would like to thank my supervisor Dr. Laks V.S. Lakshmananfor his invaluable guidance, advice and patience. This thesis would not bewhat it is without his help at every step of the way, from being a soundingboard for my ideas to helping me write the thesis. He taught me how tocritically analyze research, and write technical content, and I really valueall the time he dedicated to our long research discussions.I would also like to thank all the other professors and collaborators whohelped me along the way, in particular, Dr. David Poole, for being mysecond reader and providing feedback on my work, and Dr. Senjuti BasuRay and Dr. Nick Harvey, for insightful discussions and helping me polishmy work.I am also thankful for my labmates at the Data Mining and Managementlab; I learned a lot from them. In particular I would like to thank GlennBevilacqua and Sharan Vaswani for all the tips, tricks and machine learninginsider information without which my experiments may not have proceededas smoothly. And finally I would like to thank my family, and friends allaround the world, thanks for your love and support, and for giving me abreak from research from time to time!viiiChapter 1IntroductionRecommender systems have emerged as a popular solution to the informa-tion overload problem and have been successfully deployed in many appli-cation domains such as recommendation of products (books, music, movies,etc.), services (e.g., restaurants), and content (e.g., search queries, hashtags,and other online content). Owing to their wide appeal, various approacheshave been developed for generating good recommendations that are likelyto appeal to their users.Broadly, there are two main methods for producing recommendations• Collaborative filtering (CF) systems [13], which exploit the intuitionthat users may like items preferred by users with tastes or intereststhat are similar to theirs, or that users may enjoy items similar tothose that they have enjoyed in the past, or a combination of theseideas• Content-based systems, which exploits the idea that users would likeitems that have similar properties (eg. movies of the same genre,books by the same author) to items that they have enjoyed previouslyor that users with similar properties (eg. same geographic location,similar demographics) would have similar tastesMethods based on collaborative filtering can be further divided into [2]• Memory-based methods [10], which recommend items based on heuris-tics aggregated over similar users or items• Model-based methods, which build statistical models for the ratingdata and use those to make predictionsA common memory-based approach is to construct neighbourhoods ofsimilar items or users, and recommend items that have been enjoyed byone user but not yet consumed by another user with similar tastes. Amongmodel-based approaches, an approach that has been particularly successfulis the so-called matrix factorization (MF) approach, which assumes a latent1Chapter 1. Introductionfactor model of low dimensionality for users and items, which are learned byfactoring the matrix of observed ratings [26]. Compared to content-basedfiltering, collaborative filtering has been found to result in higher qualityrecommendations and is more scalable. Hence, CF is the focus of this thesis.An important challenge faced by any recommender system is the so-called cold-start user and cold-start item problem [3, 42]. The former occurswhen a new user joins the system and the latter when a new item becomesavailable or is added to the system’s inventory. Without past user-item in-teractions, collaborative filtering cannot determine similarity of users/items,while content-based filtering requires additional metadata for every new userand item. In this thesis, we study the cold-start problem for users underboth neighbourhood-based and matrix factorization settings.A simple approach to mitigate this problem for CF techniques, is topresent a certain minimum number of items to the cold-start user and ob-tain the user’s feedback on them. A key question is how to select itemsto recommend to a cold-start user. Approaches that have been explored inthe literature for tackling this problem have mainly tended to be ad hocand heuristic in nature. E.g., [16, 23, 50] explore an offline strategy forselecting the items while [12, 30, 49] develop an online strategy for recom-mending items to cold-start users. In the online approach, the item recom-mended to a cold-start user takes into account her feedback on the previousrecommendation, whereby the user profile is updated and used for mak-ing further recommendations. While these works report empirical resultsbased experiments conducted on some datasets, unfortunately, these worksdo not formulate the item selection problem in a rigorous manner and donot analyze its computational properties. Furthermore, no comprehensivescalability experiments have been reported on their proposed strategies foritem selection.In this thesis, we focus on the cold-start user problem and formulate theitem selection problem for cold-start users as a discrete optimization prob-lem. We study this both for a latent factor model based on probabilisticmatrix factorization [39], and for a neighbourhood-based system for our un-derlying recommender system. Since user attention and patience is limited,we assume that there is a budget b on the number of items which we canrequest feedback from a cold-start user. The main question we then studyis, how to select the b best items to recommend to such a user that will allowthe system to learn the user’s preferences as well as possible. We formalizewhat it means to learn the user’s preferences best under the two settings.For the neighbourhood-based system, we define that as being able to predictfuture ratings with low error, for the most number of items. The rationale2Chapter 1. Introductionis that it is not very useful to know what the user likes or dislikes (given byrating prediction accuracy) if it is only on a handful of items; we want tomaximize the number of items for which we can make accurate predictions,also known as prediction coverage [20]. For the matrix factorization system,we define that as learning the user’s latent factor vector (the user’s profile)with low error. The motivation is that if the user profile is learned well, itwill pay off in allowing the system to make high quality recommendationsto the user in the future. We formulate the item selection problem as a dis-crete optimization problem, called optimal interview design (OID), wherethe items selected can be regarded as questions selected for interviewingthe cold-start user for her feedback on those items. For the two settings,we study two variants of the problem, which we call OID-NB and OID-MFfor the neighbourhood-based system and the matrix factorization systemrespectively.A challenge with OID-MF is defining the true user profile against whichto measure the error of a learned profile. This is necessary for defining theobjective function we need to optimize with our choice of b items. Thedifficulty is that there is no prior information on a cold-start user. Weaddress this by showing that under reasonable assumptions, which will bemade precise in Section 2.1, we can directly express the difference betweenthe learned user profile and the true user profile in terms of the latentfactors of the b items chosen. This allows us to reason about the quality ofdifferent choices of b items and paves the way for our optimization framework(Section 4.2).We establish that both OID-NB and OID-MF problems are NP-hard.We show that the OID-NB problem is monotone and submodular, so theproposed greedy algorithm provides a (1− 1/e)-approximation to the opti-mum. We subsequently also show that the OID-MF problem is NP-hard toapproximate to within a factor αθ , where α and θ depend on the probleminstance (Section 4.3.2). Further, we show that its objective function, i.e.,least squared error between the true and learned user profile, is neither sub-modular nor supermodular, suggesting efficient approximation algorithmsmay be unlikely to exist (Section 4.3.4).Finally, we propose efficient algorithms and comprehensively evaluatethem. For OID-NB, we propose a greedy algorithm (Section 3.3), and exper-imentally evaluate it on 2 real datasets (Section 3.4), where it outperformsall the baselines. For OID-MF, we discuss several scalable heuristic ap-proaches for identifying the b best items to recommend to the user (Section4.4) and experimentally evaluate their performance on 4 real datasets. Ourexperiments show that our proposed accelerated algorithms significantly out-3Chapter 1. Introductionperform the prior art, while obtaining similar or lower error in the learneduser profile as well as in the rating predictions made, on all the datasetstested (Section 4.5).4Chapter 2Background and RelatedWork2.1 BackgroundIn this section, we summarize the relevant notions on collaborative filter-ing (CF) that are used in our problem formulation and further technicaldevelopment.2.1.1 Recommender SystemsMost recommender systems use a rating/interaction matrix, R, of size n×m,of users, U = {u1, u2, ..., um}, and items, I = {v1, v2, ..., vn}. Typically thematrix is sparse, and has only very few known ratings. Let each entry Rijin the matrix be drawn from a rating scale and represent whether a userui likes a particular item vj or not. This preference information can begained through explicit or implicit feedback. For implicit feedback, it istypically binary, but for explicit feedback, the rating scale can be anything,for example, ranging from 1 to 5. We focus on learning the profile of a singlecold-start user. This would be represented as an entire row of unknownratings in R.2.1.2 Neighbourhood-based Collaborative FilteringNeighbourhood-based methods construct similarity neighbourhoods of usersor items and aggregate statistics across them to predict future ratings. LetRˆij indicate the predicted rating for user ui and item vj . User based methodsuse ui’s similarity to other users who have rated vj , to predict Rˆij , whileitem based methods use ui’s ratings on items similar to vj to predict Rˆij .For the former, we first construct ui’s neighbourhood by computing itssimilarity score, w(ui, uk), with every other user uk ∈ U \ ui. Let Iui be theset of items rated by ui. For user-user similarity, Pearson correlation is used52.1. Background[40] as follows,w(ui, uk) =∑vj∈Iui,uk (Rij − R¯i)(Rkj − R¯k)√∑vj∈Iui,uk (Rij − R¯i)2 ·√∑vj∈Iui,uk (Rkj − R¯k)2(2.1)Here, Iui,uk denotes the set of items rated by both ui and uk, i.e.,Iui,uk = Iui ∩ Iuk , and R¯i denotes the average rating given by ui. Notethat w(ui, uk) = w(uk, ui). Let Nj(ui) be the set of neighbours of ui thathave rated vj . Neighbours can be selected in two ways: either we select userssuch that their similarity score is above some absolute threshold [43] whichwe call θneighbourhood in this thesis, or we select the top n most similar users[40]. With either technique, being able to select good quality neighbours iscritical to achieving high prediction accuracy. Then we use weighted averageto compute Rˆij as follows,Rˆij =∑uk∈Nj(ui)w(uk, ui) ·Rkj∑uk∈Nj(ui) abs(w(uk, ui))(2.2)For item-item similarity on the other hand, adjusted cosine similarityscore has been shown to perform better than other similarity measures suchas cosine similarity and correlation-based scores [40]. Let Uvj be the set ofusers who have rated vj , and Uvj ,vk = Uvj∩Uvk . Then the similarity betweentwo items vj and vk are given by,w(vj , vk) =∑ui∈Uvj,vk (Rij − R¯i)(Rik − R¯i)√∑uj∈Uvj,vk (Rij − R¯i)2 ·√∑uj∈Uvj,vk (Rik − R¯i)2(2.3)Note that w(vj , vk) = w(vk, vj). Let Ni(vj) be the set of neighbours ofvj that have been rated by ui. As with user-based collaborative filtering,neighbours can be selected using the following two methods: either selectthose with similarity scores above some absolute threshold θneighbourhood,or by selecting the top n similar items. We can estimate Rˆij by using aweighted average across this set.Rˆij =∑vk∈Ni(vj)w(vj , vk) ·Rik∑vk∈Ni(vj) abs(w(vj , vk))(2.4)2.1.3 Matrix FactorizationTo make predictions using neighbourhood-based collaborative filtering, theentire rating matrix R needs to be stored in memory, along with the simi-larity computations. In contrast, matrix factorization represents users and62.2. Related Workitems in terms of lower dimensional vectors of features inferred from theratings, called latent features or latent factors [26]. Instead of storing theentire rating matrix in memory, it stores two lower dimensional factor ma-trices – the user latent factor matrix U ∈ Rm×d and the item latent factormatrix V ∈ Rn×d, where d is the dimension of the latent feature-space. Acertain user ui’s preference for an item vj is represented by an inner productbetween their corresponding latent vectors, V Tj Ui, which approximates Rij .The goal in matrix factorization is to determine U, V that produce the low-est error between the observed ratings in R, and the estimated ratings Rˆ.Typically gradient descent or alternating least squares is used to minimizeregularized root mean square error (RMSE) over the set of known ratingsin R [26].2.1.4 Monotonicity and Submodularity/SupermodularityMonotonicity and submodularity or supermodularity are important proper-ties of constrained optimization objective functions. We study these proper-ties for our problem in Sections 3.2.2, 3.2.3, 4.3.3, and 4.3.4. Here we reviewthe definitions.Definition 2.1.1. For subsets A ⊂ B ⊂ U of some ground set U , a func-tion f : 2U → R is monotone decreasing if f(A) ≥ f(B) and monotoneincreasing if f(A) ≤ f(B).Definition 2.1.2. For subsets A ⊂ B ⊂ U of some ground set U , andx ∈ U \B, a set function f : 2U → R≥0 is submodular if f(B∪{x})−f(B) ≤f(A∪{x})−f(A). The function f(.) is supermodular iff −f(.) is submodular,or equivalently iff f(B ∪ {x})− f(B) ≥ f(A ∪ {x})− f(A).2.2 Related WorkWe classify research related to the problem studied in this thesis under thefollowing categories.2.2.1 Cold Start Problem in Collaborative FilteringThe cold start problem in collaborative filtering recommender systems hasbeen addressed using many different methods. A common approach com-bines CF with user content (metadata) and/or item content informationto start off the recommendation process for cold users [25, 27, 28, 42, 47].72.2. Related WorkOther approaches leverage information from an underlying social network torecommend items to cold users [29, 31, 32]. Some researchers have proposedsimilarity measures tuned to cold-start users [3, 8], and use of feature-basedregression [35]. In addition, online CF techniques, that incrementally updatethe latent vectors as new items or users arrive, have been proposed as a wayto incorporate new data without retraining the entire model [1, 22, 41].None of these works rigorously study the problem of selecting a limitednumber of items for a cold-start user as an optimization problem.One exception is [4], which studies the cold-start item problem and for-malizes it as an optimization problem of selecting users, to rate a givencold-start item. We borrow motivation from this paper and study the cold-start user problem by formalizing an optimization function in a probabilisticmanner. Unlike them, our recommender model is based on probabilistic MF.Furthermore, they do not study the complexity or approximability of theuser selection problem in their framework. They also do not run any scala-bility tests, and their experiments are quite limited. As part of our technicalresults, we show that our objective function is not supermodular. By dual-ity between the technical problems of cold-start users and cold-start items,it follows that the objective used in their framework is not supermodulareither, thus correcting a misclaim in their paper. A practical observationabout the cold-start user problem is that it is easy and natural to motivate acold-start user by asking her to rate several items in return for better qualityrecommendations using the learned profile. However, it is less natural andtherefore harder to motivate users to help the system learn the profile of anitem, so that it can be recommended to other users in the future.2.2.2 Interactive RecommendationItems may be recommended to a cold-start user in batch mode or interac-tive mode. In batch mode, the items are selected in one shot and then usedfor obtaining feedback from the cold-start user. E.g., this is the approachadopted in [4] (for user selection). The drawback of recommending a staticset of items to every cold user, is that it assumes that the actual ratings givenby the user do not affect the informativeness of the questions. That is, wecannot make use of the rating on one item, to choose the next few items.In interactive mode, feedback obtained on an item can be incorporated inselecting the next item. This can be further handled in two ways – online oroffline. In the online setting, multi-armed bandit frameworks that interleaveexploration with exploitation have been studied [11, 12, 30, 44, 49]. How-ever, these approaches require re-training of the model after each item is82.2. Related Workrecommended. In contrast, offline approach considers all possible outcomesfor feedback and prepares an “interview plan” in the form of a decisiontree [16, 23, 50]. It is well known that constructing the optimal decision treeis NP-complete. Furthermore, the search space of possible decision treesfor forming interview plans is exponential. While heuristic solutions areproposed in [16, 23, 50], large scale scalability experiments are not reported.2.2.3 Complexity of RecommendationThe complexity of recommendation systems has been studied before undervarious assumptions. Non-negative matrix factorization (used in some CFapproaches) has been shown to be NP-hard [45]. In [18], the authors findproducts that, between them, have been bought by as many people as pos-sible, and recommend them as well as products similar to them to otherusers. They show that finding products that cover as many users as possibleis NP-hard. Both papers above are not concerned with item selection forlearning the profile of a cold-start user.In [17], the authors study the problem of finding a limited number ofusers such that upon recommending a given item to them, if they rate itfavorably, it will then be recommended to the maximum number of otherusers by the system. They show this problem is NP-hard and inapprox-imable. Finally, [5] shows that the problem of forming groups of users suchthat the group recommendations they receive maximize user satisfaction isNP-hard. These problems are orthogonal to the problem studied in thispaper and do not directly address the item selection problem for cold-startusers.9Chapter 3Neighbourhood-basedMethods3.1 Preliminaries3.1.1 Dominating SetIn graph theory, a dominating set for a graph G = (V,E) is D ⊂ V suchthat ∀v 6∈ D, there is an edge e = (u, v) and u ∈ D. The domination numberγ(G) is the number of vertices in the smallest dominating set for G.Given a graph G, and a number k, the Dominating Set problem asks tofind D such that it dominates the maximum number of vertices and |D| = k.3.1.2 Problem StatementWe focus on the item-item similarity model for the underlying recommendersystem. Let u` be a cold-start user whose preferences need to be learned byrecommending a small number of items to u`. Each item recommended tou` can be viewed as a probe or “interview question” to gauge u`’s interest.Since there is a natural limit on how many probe items we can push to a userbefore saturation or apathy sets in, we assume a budget b on the number ofprobe items. Our objective is to select b items that maximize our learningof the cold user’s preferences, determined by achieving a high predictionaccuracy on the maximum number of items.We assume that ratings are described perfectly by the cold user’s pref-erence for similar items, modulo some noise. Then for some item vj , R`j =Rˆ`j + ε`j , where ε`j is the noise term associated with the user-item pair(u`, vj), and the predicted rating would be given by,Rˆ`j =∑vk∈N`(vj)w(vk, vj) ·R`k∑vk∈N`(vj) abs(w(vk, vj))(3.1)Note that initially the number of vj ’s neighbours rated by u`, |N`(vj)| =0, but after the interview, |N`(vj)| ≤ b, as we obtain u`’s feedback on b103.2. Technical Resultsitems, and at most all b items can be neighbours of vj . In this thesis we useabsolute thresholding to select neighbours as a sufficiently large thresholdensures higher quality neighbours, and thus better prediction accuracy [20].We denote the threshold used in absolute thresholding as θneighbourhood. Twoitems with a similarity score w(., .) ≥ θneighbourhood are considered to bein each other’s neighbourhoods. Note that due to the sparsity in ratingsin recommender systems, using a lower value of θneighbourhood increases thenumber of items for which rating predictions can be made, as it increases thenumber of items with at least one neighbour, but it introduces more noiseand can result in poorer prediction accuracy. Thus the value of θneighbourhoodindirectly controls prediction accuracy. Given a judiciously chosen thresholdθneighbourhood, we would like to maximize the number of items for which weare able to predict the user’s preference rating. This has clear motivation,as it provides a larger universe from which to recommend items to that user.This is known as prediction coverage [20]. For a set of items A ⊆ I, theprediction coverage of A, PC(A) is defined as follows,PC(A) =∣∣∣∣ ⋃vk∈AN`(vk)∣∣∣∣.Given a fixed threshold θneighbourhood, our objective is to select b items thatachieve maximum prediction coverage. We now formally state the problemwe study.Problem 1 (Optimal Interview Design - Neighbourhood Based (OID-NB)).Given an item-item similarity matrix, θneighbourhood absolute threshold forneighbourhood selection, cold start user u`, and a budget b, find a set of bitems B, to recommend to u` that maximizes prediction coverage PC(B).3.2 Technical Results3.2.1 HardnessTheorem 3.2.1. The optimal interview design problem for an item-basedcollaborative filtering recommender system (OID-NB, Problem 1) is NP-hard.Proof. We prove the hardness of OID-NB through a reduction from thedominating set problem (Section 3.1.1).Reduction: From an instance X of dominating set, create an instanceY of OID-NB. For every node w ∈ V , create a corresponding item vw in113.2. Technical ResultsI, and for every ei = (u,w) ∈ E, create a corresponding user ui. Createa dummy item vd and the cold-start user u`. Therefore, |I| = |V | + 1 and|U | = |E|+ 1.Now, construct the rating matrix R. We consider only binary ratingvalues here, but the proof can be easily extended to any interval ratingscale. For every user ui corresponding to ei = (u,w) ∈ E in X , assign arating Riu = Riw ∈ 0, 1. Also assign rating Rid 6= Riu. Therefore, everyuser rates exactly 3 items, out of which two receive the same rating, andthe dummy item receives the opposite. Alternate this for each user. Forexample, if the first user rates the 2 items 1, and rates the dummy item0, then the next user will rate the 2 items 0, and dummy item 1. Forall other user-item pairs, the rating is considered null, or unknown. Letθneighbourhood = 1. By construction, w(vd, vk) < θneighbourhood = 1, for anyvk ∈ I. To see that this is true, w.l.o.g. consider a user ui who has ratedboth vd and some item vk. R¯i is either13 or23 (by construction). Hence(Rik − R¯i)(Rdi − R¯i) = (r − 13)(1 − r − 23) < 0, where r = {0, 1}. Addingthis up over all users who have rated both, we obtain w(vd, vk) < 0 < 1.For θneighbourhood = 1, we claim that we are able to produce predictionson an item if and only if the node in X corresponding to its neighbour,is in the dominating set D. Therefore, ratings received on the b itemsrecommended to u`, allow us to make predictions on their neighbours. Thenthe b nodes that dominate the most number of nodes in X , produces thehighest prediction coverage in Y.To see that nodes are adjacent in X , if and only if the correspondingitems in Y are neighbours, we compute the adjusted cosine similarity of twoitems vj , vk in X such that there is an edge ei = (j, k) in Y. By construction,Uvj ,vk = Uvj ∩ Uvk = ui.w(vj , vk) =(Rij − R¯i)(Rik − R¯i)√(Rij − R¯i)2.√(Rik − R¯i)2=(r − R¯i)(r − R¯i)√(r − R¯i)2.√(r − R¯i)2=(r − R¯i)(r − R¯i)(r − R¯i).(r − R¯i) = 1 (3.2)If the edge did not exist in X but the nodes did, there would be noui in Y, and w(vj , vk) would be 0. On the other hand, w(vd, vk) < 1 asthe node corresponding to vd does not exist in X . It also follows that, to123.2. Technical Resultsproduce recommendations on an item vj in Y, the node corresponding to itmust be dominated by at least one other node in X . Otherwise, N`(vj) =φ. Therefore, the b items that dominate the largest number of nodes inX produce the largest prediction coverage in J . This is what we had toshow.3.2.2 MonotonicityWe define monotonicity as given in Definition 2.1.1.Consider the cold-start user u`. Let f(A) = PC(A) = |⋃vk∈AN`(vk)|for some subset A of item set I. First we show that f(A) is normalized,that is, f(φ) = 0. This is trivially true, as |N`(φ)| = |φ| = 0. To showmonotonicity, we consider B = A ∪ vj where vj ∈ I \A.f(B)− f(A) =∣∣∣∣ ⋃vk∈BN`(vk)∣∣∣∣− ∣∣∣∣ ⋃vk∈AN`(vk)∣∣∣∣=∣∣∣∣ ⋃vk∈A∪vjN`(vk)∣∣∣∣− ∣∣∣∣ ⋃vk∈AN`(vk)∣∣∣∣=∣∣∣∣ ⋃vk∈AN`(vk) ∪N`(vj)∣∣∣∣− ∣∣∣∣ ⋃vk∈AN`(vk)∣∣∣∣ ≥ 0The last equation is true as |N`(vj)| ≥ 0. Hence f(.) is a monotoneincreasing function.3.2.3 SubmodularityWe define submodularity as given in Definition 2.1.2.Consider A,B, f(.) as defined in Section 3.2.2. Under this definition, theobjective function simplifies into the budgeted maximum covering problem,which has been shown to be submodular in previous work [21, 24].3.2.4 ApproximationTheorem 3.2.2. Consider an item-item similarity collaborative filteringsystem, with item set I, cold user u`, and θneighbourhood = 1. Let theset of b items producing optimal prediction coverage be OPTOID−NB =argmaxA⊂I,|A|=b|⋃vk∈AN`(vk)|, and let A be the solution produced by se-lecting the items one by one, each time selecting the one that covers the most133.3. Algorithmsnumber of as yet uncovered elements, and |A| = b. Then∣∣∣∣ ⋃vk∈AN`(vk)∣∣∣∣ ≥ (1− 1/e)∣∣∣∣ ⋃vk∈OPTOID−NBN`(vk)∣∣∣∣.Proof. In Sections 3.2.2 and 3.2.3, we show that the objective function is nor-malized, monotone increasing and submodular. Now we apply the generalresult on submodular function maximization with a cardinality constraintby Nemhauser et. al. which states that an algorithm that greedily selectssets (here items) can do no better than (1− 1/e) times the optimal solution[34]. In fact, it states that a better approximation factor does not exist.3.3 AlgorithmsIn this section, we present our algorithm for selecting items with which tointerview a cold user so as to learn her preferences as well as possible. Inview of the hardness and hardness of approximation results (Theorems 3.2.1and 3.2.2), we present the following greedy selection algorithm: select itemsone at a time, each time selecting the item that increases the predictioncoverage most. Let the solution produced by this algorithm be A, and Aibe the set of items selected at iteration i. We can formally express this as,Ai+1 = Ai ∪ {arg maxvj δ(vj |Ai)}, where δ(vj |Ai) denotes the increase inprediction coverage when vj is added to Ai.The pseudocode is given in Algorithm 1. In Line 6, we compute the in-crease in prediction coverage, which utilizes the item-item similarity matrixand θneighbourhood. In case no item increases the prediction coverage, thealgorithm adds the last item by default (Line 4).3.4 Experimental EvaluationIn this section, we describe the experimental evaluation for our algorithm,and compare them with prior art.The development and experimentation environment uses a Linux Serverwith 2.93 GHz Intel Xeon X5570 machine with 98 GB of memory withOpenSUSE Leap OS.143.4. Experimental EvaluationAlgorithm 1 Greedy Selection (GS)Input: item set I; budget b; rating matrix R; item-item similarity matrix;θneighbourhoodOutput: items subset B, |B| = b.1: B ← φ2: do3: max ← 04: maxitem ← I[−1]5: for vi ← 1 to |I| do6: δ(vi|B) ← PC(B ∪ vi)− PC(B)7: if δ(vi|B) ≥ max then8: max ← δ(vi|B)9: maxitem ← vi10: I ← I \ vi11: end if12: B ← B ∪maxitem13: end for14: while |B| ≤ b3.4.1 Dataset and ModelFor our experiments, we use two datasets – Movielens (ML) 100K and Movie-lens 1M 1, which are used for movie recommendations. We describe theircharacteristics in Table 3.1.Table 3.1: Dataset SizesDataset # Ratings # Users # Items SparsityML 100K 100,000 943 1682 6.3%ML 1M 1,000,209 6,040 3,900 4.25%3.4.2 Model Parameters & Experimental SetupFor each dataset, we compute item-item similarity using adjusted cosinesimilarity for all items using only the ratings given by 70% of the users.We refer to them as the warm users, U . We use an absolute threshold,1Available at http://grouplens.org/datasets/movielens/. Source: [19]153.4. Experimental Evaluationθneighbourhood, to compute neighbours, and study the effect of varying itsvalue.Experimental Setup: We simulate the cold user interview process asfollows:1. Set up the system(a) Randomly select 70% of the users in a given dataset to train themodel (U)(b) R := Matrix of ratings given by U only(c) For every pair of items (vj , vk), compute adjusted cosine similarityw(vj , vk) =∑ui∈Uvj,vk(Rij−R¯i)(Rik−R¯i)√∑uj∈Uvj,vk(Rij−R¯i)2·√∑uj∈Uvj,vk(Rik−R¯i)2.(d) If w(vj , vk) > θneighbourhood, add each item to the other’s neigh-bourhood list2. For each cold user u` 6∈ U ,(a) Randomly split items they have rated, into candidate pool CPand test set Test equally(b) Run item selection algorithm on CP with corresponding neigh-bours list and budget = b(c) B := items returned by algorithm to interview u`(d) Reveal R` := u`’s ratings on B(e) Evaluation: RMSE on Test :=√1|Test|∑vj∈Test(R`j − Rˆ`j)2, pre-diction coverage PC(B) := |⋃vk∈B N`(vk)|3. Average prediction error and prediction coverage over all cold usersIn Step 2e, we compute RMSE and prediction coverage on Test. Notethat some items in Test may not have any neighbours among B. In thatcase, we predict a default rating, such as the mean rating of the dataset.3.4.3 Algorithms ComparedWe study three versions of the Greedy Selection algorithm (GS) as describedin Section 3.3, corresponding to three different θneighbourhood = {0.4, 0.5, 0.6}.We call them (GS4), (GS5) and (GS6). We compare them against baselinesRandom Selection (RS), where the items are randomly sampled from thecandidate pool, Popular (Pop), where the b most rated items are selected,163.4. Experimental Evaluationwith the rationale being that obscure items may not be informative aboutusers’ likes and dislikes, while also being difficult for them to rate [16, 36],Entropy0 (Ent0) [36] (Equation 3.3) and EntPop (EP). The last two are mod-ifications of entropy, which is a measure of informativeness. Items which aremore controversial, and have a wider spread in their rating distribution,tend to be more informative about a user’s preferences, while knowing thatthe user likes an item that everyone likes, does not tell us much. However,items with high entropy are often less popular. To balance informativenesswith popularity, we use the measures Entropy0 (Ent0) as described in Equa-tion 3.3, and EntPop (EP), where the b items with highest values of entropy× log(popularity) are selected. Both have been shown to perform betterthan pure entropy [36].Entropy0(vj) = − 1∑iwi5∑i=0piwilog(pi) (3.3)Here, pi is the fraction of users who have rated vj equal to i, starting fromi = 0, referring to the class of users who have not rated vj , to i = 5, themaximum possible rating in our datasets. wi refers to the weights given tothe 6 classes. We use w = [0.5, 1, 1, 1, 1, 1], which was shown to work thebest in [36].This gives us a total of 7 algorithms to compare.3.4.4 ResultsThe first thing we observe is that GS4 performs better than all other al-gorithms after b = 15 (Fig. 3.1(a)). Up to that point, Ent0, EP and Popperform the best. However, they consistently perform poorly on predictioncoverage (Fig. 3.1(b)). We hypothesize that their “good performance” forlower values of b is not because they are able to accurately predict what theuser would like or dislike, but due to the default predicted rating. The lowprediction coverage implies that many items in Test would not have anyneighbours in the B selected by these algorithms, and so would be assignedthe mean rating by default. Hence the “good performance” would only bedue to the other algorithms making predictions that are worse than pre-dicting the mean, as they have likely not yet learned the user’s preferences.This hypothesis can also explain why GS performs worse as we increaseθneighbourhood from 0.4 to 0.6. As the threshold is raised, fewer items arecovered, and so more items are assigned a mean rating.To test this hypothesis, we observe what happens when we change thedefault rating from the mean rating to an arbitrarily bad rating, such as 1.173.4. Experimental Evaluation0 20 40 60 80 1001.11.151.21.251.31.35Number of movies ratedRMSE RSEnt0EPPopGS4GS5GS6(a) Prediction Error0 20 40 60 80 10060080010001200140016001800Number of movies ratedPrediction Coverage RSEnt0EPPopGS4GS5GS6(b) Prediction CoverageFigure 3.1: ML 100K dataset with default mean ratingAfter doing that, our proposed algorithm GS4 performs the best in terms ofpredicton accuracy as well as prediction coverage (Fig. 3.2). Interestingly,prediction error does not decrease monotonically as we increase the similaritythreshold, as after θneighbourhood ≥ 0.5, the reduction in coverage reduces thenumber of neighbours, and increases the number of items for which we areunable to make any prediction. Thus we observe that raising θneighbourhood infact leads to worse prediction error performance. We observe a similar trendwith the ML 1M dataset as well (Fig 3.3) when we set the default ratingto 1. The prediction accuracy for GS5 and GS6 worsens as the predictioncoverage worsens.3.4.5 DiscussionCompared to real world recommender systems, the datasets we use in theexperiments are quite small. Even then, and even with our efficient greedyselection algorithm and scalable baselines, the experiments took an ex-tremely long time. The major roadblock was computing item-item sim-ilarity scores for all item pairs, which is a necessary pre-processing step.For large datasets, it becomes impractical to compute this information, andalso to update it as new users and items enter the system. Some scalablemethods have been explored using clustering [46], deriving similarity scoresfor multiple users/items at a time [6], and through parallel computing [15].However, these approaches do not address the cold-start user problem, andfurther research needs to be done to adapt them for this purpose.183.4. Experimental Evaluation0 20 40 60 80 10011.21.41.61.822.22.4Number of movies ratedRMSE RSEnt0EPPopGS4GS5GS6(a) Prediction Error0 20 40 60 80 10070080090010001100120013001400150016001700Number of movies ratedPrediction Coverage RSEnt0EPPopGS4GS5GS6(b) Prediction CoverageFigure 3.2: ML 100K dataset with default rating = 10 20 40 60 80 10011.522.5Number of movies ratedRMSE RSEnt0EPPopGS4GS5GS6(a) Prediction Error0 20 40 60 80 1005001000150020002500300035004000Number of movies ratedPrediction Coverage RSEnt0EPPopGS4GS5GS6(b) Prediction CoverageFigure 3.3: ML 1M dataset with default rating = 119Chapter 4Matrix FactorizationMethods4.1 Preliminaries4.1.1 Probabilistic Matrix FactorizationMF treats the (latent) features of users and items as deterministic quantities.Probabilistic MF (PMF), on the other hand, assumes that these featuresare drawn from distributions. More precisely, PMF [39] expresses the ratingmatrix R as a product of two random low dimension latent factor matriceswith the following zero-mean Gaussian priors:Pr[U |ΣU ] =m∏i=1N (Ui|0, σ2uiI),Pr[V |ΣV ] =n∏j=1N (Vj |0, σ2vjI), (4.1)where N (x|µ, σ2) is the probability density function of a Gaussian distribu-tion with mean µ and variance σ2. It then estimates the observed ratingsas R = Rˆ + ε = UTV + ε, where ε is a matrix of noise terms in the model.More precisely, εij = σ2ij represents zero-mean noise in the model.The conditional distribution over the observed ratings is given byPr[R|U, V,Σ] =m∏i=1n∏j=1[N (Rij |UTi Vj , σ2ij)]δij (4.2)where Σ is a d × d covariance matrix, and δij is an indicator function withvalue 1 if user ui rated item vj , and 0 otherwise.204.2. Solution FrameworkTaking the log over the posterior gives us (see [39]):ln Pr[U, V |R,Σ,ΣV ,ΣU ] = −12m∑i=1n∑j=1δijσ2ij(Rij − UTi Vj)2−12m∑i=1UTi Uiσ2ui− 12n∑j=1V Tj Vjσ2vj−12(m∑i=1n∑j=1δij lnσ2ij + dm∑i=1lnσ2ui + dn∑j=1lnσ2vj)+ C (4.3)Algorithms like gradient descent or alternating least squares can be usedto optimize the resulting non-convex optimization problem.4.1.2 Problem StatementConsider a PMF model (U, V ) trained on an observed ratings matrix R, byminimizing a loss function such as squared error between R and the predictedratings Rˆ = UTV (with some regularization). Let u` be a cold-start userwhose profile needs to be learned by recommending a small number of itemsto u`. As in Section 3.1.2, to avoid saturation or apathy, we limit the numberof probe items to a small number b. We denote the true profile of u` by U`and the learned profile (using her feedback on the b items) as Uˆ`. Ourobjective is to select b items that minimizes the error in the learned profileUˆ` compared to the true profile U`. We next formally state the problemstudied in this paper.Problem 2 (Optimal Interview Design - Matrix Factorization). Given userlatent vectors U , item latent vectors V , cold start user u`, and a budgetb, find the b best items to recommend to u` such that E[||Uˆ` − U`||2F ] isminimized.4.2 Solution FrameworkA first significant challenge in solving Problem 2 is that in order to measurehow good our current estimate the user profile is, we need to know the actualprofile of the cold user, on which we have no information! In this section,we devise an approach for measuring the error in the estimated user profile,which intelligently circumvents this problem (see Lemma 4.2.1).214.2. Solution FrameworkNote that using the PMF framework in Section 4.1.1, we obtain low di-mensional latent factor matrices U, V . In the absence of any further informa-tion, we assume that the latent vector of the cold user “truly” describes herprofile.2 Notice that the budget b on the number of allowed interview/probeitems is typically a small number. Following prior work [4, 37, 41], we as-sume that the responses of the cold user u` to this small number of itemsdoes not significantly change the latent factor matrix V associated withitems. Under this assumption, we can perform local updates to U` as theratings from u` on the b probe items are available. A second challengeis that we consider a batch setting for our problem. This means that weshould select the b items without obtaining explicit feedback from the colduser. We overcome this challenge by estimating the feedback rating the useru` would provide according to the current model. Specifically, we estimatecold user u`’s rating on an item vj as R`j = Rˆ`j + ε`j = VTj U` + ε`j , whereε`j is a noise term associated with the user-item pair (u`, vj).Let R` denote the vector containing the ratings of the cold user u` onthe b items presented to her, and let VB be the d × b latent factor matrixcorresponding to these b items. We assume that the noise in estimating theratings Rˆ depends on the item under consideration, i.e., E[ε2ij ] = σ2vj , for allusers ui. This gives us the following posterior distribution,Pr[U`|R`, VB, CB] ∝ N (R`|V TB U`, CB)N (U`|0, σ2u`I)where CB is a b × b diagonal matrix with σ21, σ22, ..., σ2b at positions corre-sponding to the items in B. Using Bayes rule for Gaussians, we obtainPr[U`|R`, VB, CB] ∝ N (U`|Uˆ`,ΣB), where Uˆ` = ΣBVBC−1B R` and ΣB =(σ−2u` I + VBC−1B VTB )−1. Setting γ = σ−2u` , the estimate Uˆ` of the cold user’strue latent factor vector U` can be obtained using a ridge estimate. Moreprecisely,Uˆ` = (γI + VBC−1B VTB )−1VBC−1B R` (4.4)Here, γ is mainly used to ensure that the expression is invertible.Under this assumption, we next show that solving Problem 2 reduces tominimizing tr((VBC−1B VTB )−1). More precisely, we have:Lemma 4.2.1. Given user latent vectors U , item latent vectors V , coldstart user u`, and budget b, a set of b items B minimizes E[||Uˆ`−U`||2F ] iff itminimizes tr((VBC−1B VTB )−1), where VB is the submatrix of V correspondingto the b selected items.2There may be a high variance associated with U`.224.3. Technical ResultsProof. Our goal is to select b items such that using her feedback on thoseitems, we can find the estimate of the latent vector Uˆ` of the cold user u`,that is as close as possible to the true latent vector vector U`.Equation 4.4 gives us an estimate for Uˆ`. For simplicity, we will assumethat γ = 0, and that VBC−1B VTB is invertible.R` can be expressed as VTB U` + εB, where εB is a vector of the b zero-mean noise terms corresponding to the b items. Replacing this in Equation4.4, we getUˆ` = U` + (VBC−1B VTB )−1VBC−1B εB⇒ Uˆ` − U` = (VBC−1B V TB )−1VBC−1B εB (4.5)From Equation 4.5, it is clear that the choice of the b interview itemsdetermines how well we are able to estimate Uˆ`. The expected error in theestimated user profile isE[||Uˆ` − U`||2F ] = E[tr((Uˆ` − U`)(Uˆ` − U`)T )] (4.6)Replacing Equation 4.5 in Equation 4.6 and simplifying, we getE[tr((Uˆ` − U`)(Uˆ` − U`)T )] =E[tr((VBC−1B VTB )−1VBC−1B εBεTB(C−1B )TV TB (VBC−1B VTB )−1)]= tr((VBC−1B VTB )−1) (4.7)The second equality above follows from from replacing E[εBεTB] = CBand simplifying the algebra. The lemma follows.In view of the lemma above, we can instantiate Problem 2 and restateit as follows.Problem 3 (Optimal Interview Design - Matrix Factorization (OID-MF)).Given user latent vectors U , item latent vectors V , cold start user u`, and abudget b, find the b best items to recommend to u` such that E[||Uˆ`−U`||2F ] =tr((VBC−1B VTB )−1) is minimized.Since the lemma shows that Problem 2 is essentially equivalent to Prob-lem 3, we focus on the latter problem in the rest of the thesis.4.3 Technical ResultsIn this section, we study the hardness and approximation of the OID-MFproblem we proposed.234.3. Technical Results4.3.1 HardnessOur main result in this section is:Theorem 4.3.1. The optimal interview design problem under the matrixfactorization setting (OID-MF, Problem 3) is NP-hard.The proof of this theorem is non-trivial. We establish this result byproving a number of results along the way. For our proof, we consider thespecial case where σ21 = σ22 = ... = σ2b = σ2 and λ = σ2σ2u`. Then CB = σ2I,and plugging it in to Equation 4.7 yieldsE[||Uˆ`−U`||2F ] = σ2 ·tr((VBV TB )−1). We prove hardness for this restrictedcase. The hardness of the general case follows.The proof of hardness is performed by reduction from the well-knownNP-complete problem Exact Cover by 3-Sets (X3C) [14].Reduction: Given a collection S of 3-element subsets of a set X, where|X| = 3q, X3C asks to find a subset S∗ of S such that each element ofX is in exactly one set of S∗. Let (X,S) be an instance of X3C, withX = {x1, ..., x3q} and S = {S1, ..., Sn}. Create an instance of OID-MF asfollows. Let the set of items be I = {a1, ..., an, d1, ..., dk}, where k = 3q, itemaj corresponds to set Sj , j ∈ [n], and dj are dummy items, j ∈ [k]. Converteach set Sj in S into a binary vector uj of length k, such that aj[i] = 1whenever xi ∈ Sj and aj[i] = 0 otherwise. Since the size of each subsetis exactly 3, we will have exactly three 1’s in each vector. These vectorscorrespond to the item latent vectors of the n items a1, a2, ..., an. We callthem set vectors to distinguish them from the vectors corresponding to thedummy items, defined next: for a dummy item dj , the corresponding vectordj is such that dj[j] = η and dj[i] = 0, i 6= j. LetW be the set of all vectorsconstructed. We will set the value of η later. Thus, W is the transformedinstance obtained from (X,S). Assuming an arbitrary but fixed ordering onthe items in I, we can treatW as a k×(n+k) matrix, without ambiguity. LetA = {a1, ...,an} and D = {d1, ...,dk} resp., denote the sets of set vectorsand dummy vectors constructed above. We set the budget to b := q + k.For a set of items B ⊂ I, with |B| = b, we let B denote the k × (q + k)submatrix of W associated with the items in B. Formally, our problem isto find b items B ⊂ I that minimize tr((BBT )−1).For a matrix M, define f(M) = tr((MMT )−1). Defineθ :=q3 + η2+k − qη2. (4.8)We will show the following claim.244.3. Technical ResultsClaim 1. Let B ⊂ I, such that |B| = k + q. Then f(B) = θ if (B \ D)encodes an exact 3-cover of X and f(B) > θ, otherwise.We first establish a number of results which will help us prove this claim.Notice that Theorem 4.3.1 follows from Claim 1: if there is a polynomialtime algorithm for solving OID-MF, then we can run it on the reducedinstance of OID-MF above and find the b items B that minimize f(B).Then by checking if f(B) = θ, we can verify if the given instance of X3C isa YES or a NO instance.In what follows, for simplicity, we will abuse notation and use A,B,Wboth to denote sets of vectors and the matrices formed by them, relativeto the fixed ordering of items in I assumed above. We will freely switchbetween set and matrix notations.Recall the transformed instance W of OID-MF obtained from the givenX3C instance. The next claim characterizes the trace of BBT for matricesB ⊂ W that include all k dummy vectors of W.Claim 2. Consider any B ⊂ W such that |B| = k+ q and B includes all thek dummy vectors. Then tr(BBT ) = k + k · η2.Proof. Let B′ = B−D. We have tr(BBT ) = tr(B′B′T +DDT ) = tr(B′B′T )+tr(η2I) =∑ki=1∑qj=1 b2ij + kη2. As B′ is a binary matrix, ∑ki=1∑qj=1 b2ij =∑ki=1 ||b∗i||0, where b∗i is the ith row, and || · ||0 is the l0−norm. This isnothing but the total number of 1’s in B′, which is 3q = k. Thus, tr(BBT ) =k + kη2.The next claim shows that among such subsets B ⊂ W, the ones thatinclude all dummy vectors have the least f(.)-value, i.e., have the minimumvalue of tr((BBT )−1). Recall that D = {d1, ...,dk} is the set of dummyvectors constructed from the given instance of X3C.Claim 3. For any subset A ⊂ W, with |A| = k+ q, such that D 6⊂ A, thereexists A′, with |A′| = k + q and D ⊂ A′, such that f(A′) < f(A).Proof. By Claim 2, tr(A′A′T ) = k + kη2. By assumption, A has at least1 fewer dummy vectors than A′ and correspondingly more set vectors thanA′. Since each set vector has exactly 3 ones, we have tr(AAT ) ≤ k +kη2 + 3 − η2 for η2 > 3. Let us consider the way the trace is distributedamong the eigenvalues. The distribution giving the least f(.) is the uniformdistribution. For AAT , this is λ1 = λ2 = ... = λk = tr(AAT )/k. Thedistribution yielding the maximum f(.) is the one that is most skewed. ForA′A′T , this happens when there are two distinct eigenvalues, namely η2254.3. Technical Resultswith multiplicity (k − 1) and k + η2 with multiplicity 1. This is because,the smallest possible eigenvalue is η2 and the trace must be accounted for.3We next show that the largest possible value of f(A′) is strictly smaller thanthe smallest possible value of f(A), from which the claim will follow.Under the skewed distribution of eigenvalues of A′A′T assumed above,f(A′) ≤ k−1η2+ 1k+η2. Similarly, for the uniform distribution for the eigen-values of AAT assumed above, f(A) ≥ k × k/tr(AAT ) ≥ k2k+kη2+3−η2 . Setη to be any value ≥√(k + 3). Then we havef(A′) ≤ (k − 1)(k + 3)+1(2k + 3)=2k(k + 1)(k + 3)(2k + 3)f(A) ≥ k2(k + k(k + 3) + 3− k − 3)=k(k + 3).(4.9)Now, 2(k+1) < (2k+3). Multiplying both sides by k(k+3) and rearranging,we get the desired inequality f(A′) ≤ 2k(k+1)(k+3)(2k+3) < k(k+3) ≤ f(A), showingthe claim. We can obtain a tighter bound on η by solving k−1η2+ 1k+η2≤k2k+kη2+3−η2 , which gives us η2 ≥ 12 [√5k2 + 4− k + 4].In view of this, in order to find B ⊂ W with |B| = k + q that minimizesf(B), we can restrict attention to those sets of vectors B which include allthe k dummy vectors.Consider B ⊂ W, with |B| = k + q that includes all k dummy vectors.We will show in the next two claims that the trace tr(BBT ) = k+kη2 will beevenly split among its eigenvalues iff B − D encodes an exact 3-cover of X.We will finally show that it is the even split that leads to minimum f(B).Claim 4. Consider a set B, with |B| = k+ q, such that B includes all the kdummy vectors. Suppose the rank q matrix B′ = B −D does not correspondto an exact 3-cover of X. Then B′B′T has q non-zero eigenvalues, at leasttwo of which are distinct.3Such extreme skew will not arise in reality since this corresponds to all q set vectorsof A being identical (!), but this serves to prove our result.264.3. Technical ResultsProof. The q column vectors in B′ are linearly independent, so rank(B′) =rank(B′B′T ) = q. Since B′B′T is square, it has q non-zero eigenvalues. It issufficient to show that at least two of those eigenvalues, say λ1 and λ2, areunequal. As B′ does not correspond to an exact 3-cover, at least one rowhas more than one 1, and so at least one row is all 0’s. The correspondingrow and column in B′B′T will also be all 0’s.Define the weighted graph induced by B′B′T as G = (V,E,w) such that|V | = k, w(i, j) = (B′B′T )ij , ∀i, j ∈ [k]. The all-zero rows correspond toisolated nodes. We know that the eigenvalues of the the matrix B′B′T areidentical to those of the induced graph G, which in turn are the same asthose of the connected components of G. Consider a non-isolated node i.Since each row of B′ is non-orthogonal to at least two other rows, it followsthat (B′B′T )ij ≥ 1 for at least 2 values of j 6= i. Thus, each non-isolated nodeis part of a connected component of size ≥ 3 and since there are isolatednodes, the number of (non-isolated) components is < q. Thus, the q non-zero eigenvalues of G are divided among the < q components of G. By thepigeonhole principle, there is at least one connected component with ≥ 2eigenvalues, call them λ1, λ2, say λ1 ≥ λ2. We know that a component’slargest eigenvalue has multiplicity 1, from which it follows that λ1 6= λ2, aswas to be shown.We next establish two helper lemmas, where M denotes a k × k sym-metric matrix.Lemma 4.3.2. Let M be a positive semidefinite matrix of rank q. Sup-pose that it can be expressed as a sum of rank one matrices, i.e., M =∑qi=1 ai · aiT , where ai is a column vector, and ∀i, j ∈ [k], i 6= j,ai · ajT =0, and ai · aiT = s . Then the q eigenvalues of M are identical and equalto s.Proof. The spectral decomposition of a rank q matrix M is given asM =q∑i=1λiuiuiT (4.10)where λi are eigenvalues and ui are orthonormal vectors. From thehypothesis of the lemma, we have 1s · ai · aiT = 1.M =q∑i=1aiaiT =q∑i=1s× ai√sai√sT(4.11)where ai√sare orthonormal. Comparing this with Eq. 4.10, the eigenvaluesof M are λ1 = λ2 = ... = λq = s.274.3. Technical ResultsLemma 4.3.3. Let M be a symmetric rank k matrix and suppose that itcan be decomposed into∑qi=1 aiaiT +κ · I, for some constant κ. Then it has(k − q) eigenvalues equal to κ.Proof. Let the eigenvalues of M be λ1, λ2, ..., λk. Let λ any eigenvalue ofM, and v the corresponding eigenvector. Then we have (M + κI)v =(∑qi=1 aiaTi +κI)v = (λ+κ)v. Since∑qi=1 aiaiT results in a rank q symmetricmatrix, it has q non-zero eigenvalues. Adding κ to all of them, we get,λq+1 = ... = λk = κ.Proof of Claim 1: Consider any set of vectors B ⊂ W: |B| = k + q.By Claim 3, we may assume w.l.o.g. that B includes all k dummy vectors.Suppose B′ := B − D encodes an exact 3-cover of X. Then BBT can bedecomposed into the sum of q rank one matrices and a diagonal matrix:BBT = ∑qj=1 bj · bTj + η2I. Here bi refers to the ith column of B, which isa set vector. Since B′ is an exact 3-cover, we further have that bi · bTi = 3,i ∈ [q], and bi · bTj = 0, i 6= j. By Lemma 4.3.2, since B′B′T is also a positivesemidefinite matrix of rank q, we have λB′1 = · · ·λB′q = 3, where λB′i are theeigevalues of B′B′T . The corresponding q eigenvalues of BBT are all η2 + 3.Furthermore, by Lemma 4.3.3, the remaining k−q eigenvalues of BBT are allequal to η2. That is, the eigenvalues of BBT are λB1 = · · · = λBq = η2 + 3 andλBq+1 = · · · = λBk = η2. For this B, f(B) = tr((BBT )−1) = qη2+3 + k−qη2 = θ(see Eq. 4.8).Now, consider a set of vectors A ⊂ W, with |A| = k + q, such thatthat A includes all k dummy vectors. Suppose A′ := A − D does not cor-respond to an exact 3-cover of X. Notice that A is a symmetric rank kmatrix which can be decomposed into A = ∑qj=1 ai · aTi + η2I, so λAq+1 =· · · = λAk = η2, where λAi , i ∈ [q + 1, k], are k − q of the eigenvalues ofAAT . Since both B and A include all k dummy vectors and q of the setvectors, by Claim 2, tr(BBT ) = tr(AAT ) = k+ kη2. We have ∑qj=q+1 λBj =(k − q)η2 = ∑kj=q+1 λAj and so ∑qj=1 λAj = ∑qj=1 λBj = q(η2 + 3). Now,f(B) = ∑kj=1 1λBj = qη2+3 + k−qη2 , whereas f(A) = ∑kj=1 1λBj = ∑qj=1 1λAj + k−qη2 .Thus, to show that f(B) < f(A), it suffices to show that qη2+3<∑qj=11λAj.LHS = q× 1AM(λB1 ,...,λBq )= q× 1AM(λA1 ,...,λAq ), where AM(.) denotes the arith-metic mean. RHS = q × 1HM(λA1 ,...,λAq ), where HM(.) denotes the harmonicmean. It is well known that AM(.) ≥ HM(.) for a given collection of posi-tive real numbers and the equality holds iff all numbers in the collection areidentical. On the other hand, we know that since A′ does not correspond to284.3. Technical Resultsan exact 3-cover of X, by Claim 4, not all eigenvalues of A′ are equal, fromwhich it follows that LHS < RHS, completing the proof of Claim 1 as alsoTheorem 4.3.1.After this, we show our inapproximability result for OID-MF.4.3.2 Hardness of ApproximationTheorem 4.3.4. It is NP-hard to approximate the optimal interview designproblem in polynomial time within a factor less than αθ , where α = θ +2(2+η2)(4+η2)(3+η2), η2 ≥ 12 [√5k2 + 4 − k + 4], θ = q3+η2+ k−qη2, k = 3q is thedimension of the latent vector-space.First we define a variant of the X3C problem which we refer to as Maxq-Cover by 3-Sets (M3C), which will be convenient in our proof.Definition 4.3.1. Given a number q and a collection of sets S = S1, S2, ..., Sn,each of size 3, is there a subset S∗ of S such that the cover C = |⋃s∈S∗ s| =3q and |S∗| ≤ q?Since each set has 3 elements, with |S∗| ≤ q, we get C = 3q if and onlyif |S∗| is an exact cover. Thus X3C can be reduced to M3C, making M3CNP-hard.We convert an instance x of M3C to an instance of OID-MF, h(x), inthe same way as described in the NP-Hardness proof: let the set of itemsbe I = {a1, ..., an, d1, ..., dk}, where k = 3q, item aj corresponds to set Sj ,j ∈ [n], and dj are dummy items, j ∈ [k]. Let the dummy vectors be definedas above, and b := q + k. As shown previously in Claim 3, we need to onlyconsider those sets of vectors B that have all k dummy vectors. Similarly,we can transform a solution y of OID-MF, back to a solution of M3C, g(y),in the following manner: discard the chosen dummy vectors, and take thesets corresponding to the q set vectors.As an YES instances of M3C correspond to YES instances of X3C, aninstance x with C = 3q corresponds to f(B) = θ.For the NO instances of M3C, C ≤ 3q − 1 (by the definition). Unfor-tunately, a similar one-to-one mapping does not exist in such cases: withthe same C, there could be multiple instances of M3C that corresponds todifferent instances of OID-MF and correspondingly f(B). From Theorem4.3.1, we know that it is NP-hard to determine whether f(B) ≤ θ for agiven instance of OID-MF h(x).294.3. Technical ResultsTo find the lowest f(B) of a NO instance of OID-MF, we first prove anintermediate result that shows that among the set of different f(B) giving thesame cover value C, the lowest possible f(.) value increases as C decreases.Claim 5. As the cover value increases, the best (i.e., lowest) f(.) valueamong all the solutions with the same cover value decreases.Proof. Let B′ = B \ D.By interpreting B′B′T as a (k × k) adjacency matrix, the dimensionscorrespond to the k nodes in the graph. Dimensions that are uncoveredare isolated nodes, and dimensions that are covered are part of a connectedcomponent. Sum of degrees of the entire graph = 3k (sum of all entries inthe adjacency matrix B′B′T ) which is a constant given k.From this, given that the sum of the degrees over the graph is 3k (which isa constant), we argue that with more uncovered dimensions/nodes, averagedegree (davg) (ignoring the isolated nodes) and maximum degree (dmax)increase. From this, it follows that each non-isolated node has degree atleast 3, hence the average degree for such nodes is greater than 3 for anyB′B′T . If there are multiple components in a given graph, considering theone with the highest average degree, λ1 ≥ max(dmaxavg ,√dmax), where dmaxavgis the highest average degree among all components.For a NO instance, the highest average degree among all connected com-ponents is greater than 3, since the vectors must overlap at least over 1 di-mension. For a given cover C, the lowest value of λ1 is thus lower boundedby davg > 3, which increases as the overlap increases. In turn, a higher valueof λ1 makes the distribution of eigenvalues more skewed, leading to a higherf(.). To have a lower f(.), we must have λ1 as close to 3 as possible, bydecreasing davg and dmax, thereby, increasing coverage.Following this claim, among the NO instances of OID-MF, it is sufficientto show that the lowest f(.) corresponds to the highest C, where C = 3q−1.Next we calculate its corresponding f(.).Given a NO instance with C = 3q−1, we next show that such an instancegives rise to a unique OID-MF solution. For this scenario, it could be shownthat there are exactly q−2 disjoint sets, and 2 sets cover exactly one elementtwice. This can only be obtained from a solution y of OID-MF, if in thegiven solution, q− 2 set vectors are disjoint, and 2 have exactly one 1 in thesame position. The following example illustrates this.Example 1. For an instance with q = 3, a solution with exactly two vectorsoverlapping on one dimension could look like304.3. Technical ResultsB′T =1 1 1 0 0 0 0 0 00 0 1 1 1 0 0 0 00 0 0 0 0 1 0 1 1ThenB′B′T =1 1 1 0 0 0 0 0 01 1 1 0 0 0 0 0 01 1 2 1 1 0 0 0 00 0 1 1 1 0 0 0 00 0 1 1 1 0 0 0 00 0 0 0 0 1 0 1 10 0 0 0 0 1 0 1 10 0 0 0 0 0 0 0 00 0 0 0 0 1 0 1 1Next, we show what f(B) of such a solution would be. As before, letB′ := B \ D. Interpreting B′B′T as the adjacency matrix of a graph G, weknow that the eigenvalues of B′B′T are the same as those of G, given bythe multi-set union of its components, which are: q − 2 corresponding tothe disjoint set vectors, and 1 corresponding to the two over-lapping vectors.The first q−2 components each form a 3-regular graph which contributes aneigenvalue of 3 each. It could be shown that the last one, which correspondsto the overlap, contributes to (0, 0, 0, 0, 2, 4). Therefore, f(B) = α = 2qη2+q−23+η2+ 12+η2+ 14+η2= θ + 2(2+η2)(4+η2)(3+η2).It follows from our arguments, that f(B) ≥ α if and only if C ≤ 3q − 1.Let A be an approximation algorithm that approximates OID-MF towithin c < αθ , returns a value v such that OPTOID−MF (h(x)) ≤ v ≤ c ×OPTOID−MF (h(x)).Claim 6. x is a YES instance of M3C if and only if θ ≤ v < α.Proof. If h(x) is a YES instance of OID-MF, OPTOID−MF (h(x)) = θ, soθ ≤ v ≤ c× θ. Since c < αθ , θ ≤ v < α. If h(x) is a NO instance of OID-MF,α ≤ OPTOID−MF (h(x)), so α ≤ v. Since the intervals are disjoint, theclaim follows.Thus if such an approximation algorithm A existed, we would be able todistinguish between the YES and NO instances of M3C in polynomial time.However as that is NP-hard, unless P = NP, A cannot exist.314.3. Technical Results4.3.3 MonotonicityWe define monotonicity as in Definition 2.1.1.In [4], for a similar objective function for the user selection problem fora cold-start item, the authors prove it is monotone decreasing.4.3.4 Supermodularity and SubmodularityWe define submodularity/supermodularity as in Definition 2.1.2In [4], for a similar objective function for the user selection problemfor a cold-start item, the authors claimed that their objective function issupermodular. The following lemma shows that the objective function f(.)for our OID-MF problem is not supermodular.Lemma 4.3.5. The objective function f(B) = tr((VBVTB )−1) of the OID-MF problem is not supermodular.Proof. We prove the result by showing that the function tr(MMT )−1) is ingeneral not supermodular. Consider the following matrices:M1 =1 1 0 0 00 0 1 0 01 0 0 1 00 0 1 1 10 0 1 0 1M2 =0 0 1 0 1 11 0 0 0 0 10 1 0 0 1 11 1 0 1 0 01 0 0 1 0 1 ,and vectorxT =[0 1 0 0 0]Notice that M1, viewed as a set of column vectors, is a subset of M2,viewed as a subset of column vectors. Now, f(M1) = tr(M1MT1 ) =12, f(M1 ∪ {x}) = 10.333, f(M2) = 6.6250, f(M2 ∪ {x}) = 4.4783.Clearly, f(M1 ∪ {x}) − f(M1) = 10.333 − 12 = −1.6667 and f(M2 ∪{x})− f(M2) = 4.4783− 6.6250 = −2.1467, which violates f(M1 ∪ {x})−f(M1) ≤ f(M2 ∪ {x})− f(M2), showing f(.) is not supermodular.324.4. AlgorithmsWe remark that the lack of supermodularity of f(.) is not exclusive tobinary matrices; supermodularity does not hold for real-valued matricesMas well. As a consequence, by the duality between the technical problems ofcold-start users and cold-start items, the lemma above disproves the claimin [4] about the supermodularity of their objective function.Lemma 4.3.6. The objective function of f(B) = tr((VBVTB )−1) the OID-MF problem is not sub-modular.Proof. Consider the same matrix M2 and x as above and the followingmatrix M1.M1 =0 0 0 1 11 0 0 0 10 1 0 1 11 1 1 0 01 0 1 0 1As before, regarded as sets of column vectors,M1 ⊂M2. However, f(M1) =20, f(M1 ∪ {x}) = 16.333, f(M2) = 6.6250, f(M2 ∪ {x}) = 4.4783. Hencef(M1∪{x})−f(M1) = 16.333−20 = −3.6667 and f(M2∪{x})−f(M2) =4.4783 − 6.6250 = −2.1467. Clearly, submodularity fails to hold, sincef(M1 ∪ {x})− f(M1) ≥ f(M2 ∪ {x})− f(M2) is violated.4.4 AlgorithmsIn this section, we present algorithms for selecting items with which tointerview a cold user so as to learn her preferences as well as possible. Inview of the hardness and hardness of approximation results (Theorems 4.3.1and 4.3.4), and the fact that the objective function is neither submodularnor supermodular (see Section 4.3.4), efficient approximation algorithms areunlikely to exist. We present scalable heuristic algorithms for item selection.4.4.1 Accelerated Backward GreedyAs discussed earlier, in [4], the authors study the problem of user selection forthe cold-start item problem. They propose two backward greedy algorithms,called Backward Greeedy (BG) and Backward Greedy2 (BG2).Recall that for a set of items S ⊆ {v1, ..., vn}, we denote by VS thesubmatrix of the item latent factor matrix corresponding to the items inS and f(VS) := tr((VSVTS )−1) is the profile learning error that we seek to334.4. Algorithmsminimize by selecting the best items. It can be shown, following a similarresult in [4] (Proposition 3) that f(.) is monotone decreasing, i.e., for itemsets S ⊆ T , f(VT ) ≤ f(VS).Overview. Backward greedy algorithms essentially remove the worst itemsfrom the set of all items and use the remaining ones as interview items.The core idea is to start with the set of all items and successively removean item with the smallest increase in the error, until no more than b itemsare left, where b is the budget. It was claimed in [4] that the backwardgreedy algorithms are approximation algorithms. This is incorrect sincetheir claim relies on the error function being supermodular. Unfortunately,their proof of supermodularity is incorrect as shown by our counterexamplein Section 4.3.4. Thus, backward greedy is a heuristic for their problem aswell as our OID problem.Acceleration. In this section, we propose accelerated versions of BG andBG2, by incorporating two optimizations: (i) we speed up the matrix in-version step, necessary for evaluating the error function f(.), by using theSherman-Morrison formula with a rank-one update; (ii) we borrow ideasfrom the classic lazy evaluation approach, originally proposed in [33] to saveon function evaluations. We briefly describe these optimizations next.Sherman-Morrison. Suppose the current set of items is S. Let v ∈ S bean item that is removed from S to give S′ = S \ {v}. Suppose (VSV TS )−1 isalready computed. Then (VS′VTS′)−1 can be computed as(VS′VTS′)−1 = (VSV TS − vvT )−1= (VSVTS )−1 +(VSVTS )−1vvT (VSV TS )−11− vT (VSV TS )−1v.Lazy evaluation. Lazy evaluation, originally proposed as a way to speed upthe greedy algorithm for submodular function maximization, can be easilyadapted to a supermodular function minimization as follows. Suppose g(S)is a supermodular set function that is monotone decreasing and we wantto find a set S of size ≤ b that minimizes g(S), using the backward greedyframework. Let Si be the set of items in iteration i of a backward greedy al-gorithm. Then for j > i, clearly Sj ⊂ Si. Define g˜(u|S) := g(S \{u})−g(S),i.e., increase in error from dropping item u from set S. If g is supermodular,it follows that g˜(u|Si) ≤ g˜(u|Sj), whenever u ∈ Sj , i < j. This followsupon noting that the function g˜(u|S), denoting increase in error on drop-ping u from S is actually the negative of the marginal gain of adding u toS \ {u}, i.e., g˜(u|S) = −g(u|S). Suppose there are items u ∈ Si and v ∈ Sj ,with i < j, such that g˜(u|Si) ≥ g˜(v|Sj). Then there is no need to evaluate344.4. Algorithmsg˜(u|Sj), since g˜(u|Sj) ≥ g˜(u|Si) ≥ g˜(v|Sj). This saving on evaluations canbe implemented by keeping track of when each error increment (or marginalgain) was last updated and making use of a priority queue for picking theelement with the least error increment.Recall that our error function f(.) is actually not supermodular. Ourmain goal in applying lazy evaluation to it is not only to accelerate itemselection, but also explore the impact of lazy evaluation on the error perfor-mance.We give the pseudocode for the accelerated version of BG2 in Algo-rithm 2. The algorithm corresponding to accelerated BG can be easilyobtained from this, as we explain below.We start with B initialized to all items. We use a priority queue forefficient implementation of lazy evaluation. The “freshness” check in Line10 in Algorithm 2 checks that the error increment in αj was computedin the latest iteration to make sure vj is indeed the item with the leasterror increment. Notice that we use the notation f˜(vj |VB, CB) where VBis the latent factor matrix corresponding to the items in B and CB is thecovariance matrix. CB is initialized to the diagonal matrix of noise termsC, where Cjj = σ2j , vj ∈ B.The use of Sherman-Morrison optimization allows us to save on repeatedinvocation of matrix inverse, and instead allows it to be computed incremen-tally and hence efficiently using rank one update. The use of lazy evaluationsaves on evaluations of error increments that are deemed redundant, assum-ing (pretending, to be more precise) that f(.) is supermodular. We willevaluate both the prediction and profile error performance as well as therunning time performance of these optimizations in Section 4.5.The algorithm for the accelerated version of basic Background Greedy(BG) differs from Algorithm 2 by simply assuming that the noise terms areidentical, i.e., σ21 = · · · = σ2n = σ2, i.e., we set C to σ2 · I, where I is theidentity matrix. This saves some work compared to Algorithm 2. We referto this modified algorithm as ABG1 and omit its pseudocode for brevity.4.4.2 Accelerated Forward GreedyIn a real recommender system, the number n of items may be in the millionsand b may be << n. One key shortcoming of the backward greedy family ofalgorithms (BG and BG2 and their accelerated versions) is that they needto sift through a large number of items and eliminate them one by one tillthe budget b is reached. One approach for remedying this is to consider aforward greedy approach.354.4. AlgorithmsAlgorithm 2 Accelerated Backward Greedy 2 (ABGD2)Input: item set I and corresponding matrix V ; budget b; diagonal matrixof noise terms C.Output: items subset B, |B| = b.1: VB ← V2: B ← I3: CB ← C4: for j ← 1 to |I| do5: insert (vj , f˜(vj |VB, CB)) into priority queue Q6: end for7: do8: pop (vj , αj) from Q9: if αj not “fresh” then10: recompute αj = f˜(vj |VB, CB)11: end if12: if αj < MIN(Q) then13: VB ← VB \ vj14: B ← I \ {vj}15: CB ← CB \ σ2j16: else17: insert (vj , αj) into Q18: end if19: while |VB| > b364.5. Experimental EvaluationOverview. Recall that the function f(.) is monotone decreasing, so −f(.) ismonotone increasing. We start with B initialized to the empty set of items,which has the highest error and hence −f(∅) has the smallest value. At eachiteration, we add to B an item that has the maximum marginal gain w.r.t.−f(.). That is, we successively addv∗ = arg maxv∈I\B[−f(VB∪{v}|CB∪{v})− (−f(VB)|CB)]= arg maxv∈I\B[f(VB|CB)− f(VB∪{v}|CB∪{v})]to B until the budget b is reached. In the algorithm, we use −f˜(vj |VB, CB)to denote [f(VB|CB)− f(VB∪{v}|CB ∪ {v})].As with accelerated backward greedy, Sherman-Morrison formula andlazy evaluation are used to optimize forward greedy. The resulting algo-rithm, referred to as Accelerated Forward Greedy 2, is depicted in Algo-rithm 3. It can be also be adapted to get Accelerated Forward Greedy 1(AFG1), by setting C = σ2 · I, as we did for ABG1.In the next section, we conduct an empirical evaluation of the forwardand backward greedy algorithms as well as their accelerated versions pro-posed here and compare them against baselines.4.5 Experimental EvaluationIn this section, we describe the experimental evaluation for our algorithms,and compare them with prior art. We evaluate our solutions both qualita-tively and scalability-wise: quality evaluation is done by measuring predic-tion error and user profile estimation error (see Section 4.5.2), whereas, thescalability study is conducted by measuring the running time.The development and experimentation environment uses a Linux Serverwith 2.93 GHz Intel Xeon X5570 machine with 98 GB of memory withOpenSUSE Leap OS.4.5.1 Dataset and ModelFor our experiments, we use datasets from the movie recommendation do-main – Netflix 4 and Movielens (ML) 5.Moreover, we use three differentavailable ML datasets that gives us a total of four different datasets. Wedescribe their characteristics in Table 4.1.4The original dataset was released as part of The Netflix Prize [7] but has since beenremoved from the public domain due to privacy concerns.5Available at http://grouplens.org/datasets/movielens/. Source: [19]374.5. Experimental EvaluationAlgorithm 3 Accelerated Forward Greedy 2 (AFG2)Input: item set I and corresponding matrix V ; budget b; diagonal matrixof noise terms C.Output: items subset B, |B| = b1: VB ← φ2: B ← φ3: for j ← 1 to |I| do4: insert (vj ,−f˜(vj |VB, CB)) into priority queue Q5: end for6: do7: pop (j, αj) from Q8: if αj not ”fresh” then9: recompute αj = −f˜(vj |VB, CB)10: end if11: if αj > MAX(Q) then12: VB ← VB ∪ {vj}13: B ← I ∪ {vj}14: else15: insert (vj , αj) into Q16: end if17: while |VB| < b384.5. Experimental EvaluationTable 4.1: Dataset SizesDataset # Ratings # Users # Items SparsityML 100K 100,000 943 1682 6.3%ML 1M 1,000,209 6,040 3,900 4.25%ML 20M 20,000,263 138,493 27,278 0.53%Netflix 100,480,507 480,189 17,770 1.18%4.5.2 Model Parameters & Experimental SetupFor each dataset, we train a probabilistic matrix factorization model [39]on only the ratings given by 70% of the users. We refer to them as the warmusers, U . We use gradient descent algorithm [48] to train the model, withlatent dimension = 20, momentum = 0, regularization = 0.1 and linearlydecreasing step size for faster convergence. This allows us to use large stepswhile initially moving towards the minima, decreasing as we approach it toavoid overshooting. We report the number of warm users, number of itemsrated by them, and RMSE obtained, for the different datasets in Table 4.2.Note that # Items in Table 4.1 is different from |I| in Table 4.2. This isbecause some items were not rated by the warm users. We had to discardsuch items from consideration, as we could not build item profiles for them.Table 4.2: Experiment SizesDataset # Warm Users |U| |I| RMSEML 100K 702 1647 0.9721ML 1M 4,473 3,666 0.8718ML 20M 102,628 25,529 0.7888Netflix 355,757 17,770 0.8531Experimental Setup: We simulate the cold user interview process asfollows:1. Set up the system(a) Randomly select 70% of the users in a given dataset to train themodel (U)(b) R := Matrix of ratings given by U only394.5. Experimental Evaluation(c) Train a PMF model on R, to obtain U, V2. Construct item covariance matrix C given by,σ2j :=1|R∗j |∑i∈R∗j (Rij − Rˆij)2, where R∗j refers to the column(set) ofratings received by item j3. For each cold user u` 6∈ U ,(a) Construct U` using gradient descent method [37], and using theitem latent factor matrix V(b) Randomly split items they have rated, into candidate pool CPand test set Test(c) Run item selection algorithm on CP with corresponding V andbudget = b(d) B := items returned by algorithm to interview u`(e) Reveal R` := u`’s ratings on B(f) Construct Uˆ` := (γI + VBC−1B VTB )−1VBC−1B R`(g) Evaluation: RMSE on Test :=√1|Test|∑vj∈Test(R`j − V Tj Uˆ`)2 ,profile error := ||Uˆ` − U`||2F4. Average prediction and profile error over all cold usersIn Step 2, we estimate the noise terms using the method outlined in [4].For the case where the covariance matrix C = σ2I, we estimate σ2 that bestfit a validation set (a randomly chosen subset of the cold user ratings).In Step 3a we compute true latent vectors of the cold users. We cannotcompute that using the PMF model in Step 1c, as these ratings are hiddenat that stage. Moreover, since each cold user is independent, we cannottrain a model on their combined pool of ratings. Instead, we adopt thegradient descent based method given in [37], to generate the latent vectorfor a new user keeping everything else constant. This method producesresults comparable to retraining the entire model globally (with 1% error inthe worst case), in a few milliseconds [37]. The user profiles thus generatedare used as true profiles.In Step 3g, the estimated rating is computed by taking an inner productbetween the item and the cold user vector Uˆ`. For the real rating R`j , westudy two different settings: in one we use ratings that were provided by u`on Test, but were kept hidden from the system until the evalution stage,and in the other R`j = VTj U`. We call the first one the real setting, as404.5. Experimental Evaluationwe use the real ratings provided by the cold user, and we call the secondone the ideal setting, as it corresponds to an ideal, zero error MF model.Studying our algorithms under the ideal setting has two advantages: first,it allows us to decouple our problem from the problem of tuning a matrixfactorization model. This way, the model error from using a possibly lessthan perfect PMF model does not percolate to our problem. Second, we donot have to be limited to only selecting the items for which we have ratingsin our dataset, as we can generate ratings for all items. This is especiallycrucial for sparse datasets, like ours.4.5.3 Algorithms ComparedWe compare the following algorithms against their accelerated versions,as described in Section 4.4: Backward Greedy Selection 1 (BG), BackwardGreedy Selection 2 (BG2), Forward Greedy Selection (FG) and Forward GreedySelection 2 (FG2). Further, we use the following two heuristics as baselines:Random Selection (RS), where the items are randomly sampled from thecandidate pool, and High Variance (HV), where b items with the highestvariance in rating prediction among warm users are selected. This gives usa total of 10 algorithms to compare.4.5.4 Quality ExperimentsResults: We run quality experiments to measure prediction error and pro-file error for all five datasets. For the datasets ML 100K and ML 1M, wecompare all 10 algorithms under the ideal setting, where we note that BGperforms almost exactly the same as FG (see Fig. 4.1), while BG2 and FG2perform better for both profile and prediction error. For the larger datasetsNetflix and ML 20M, we compare algorithms under the real setting. Forboth, we observe that FG2 outperforms BG2 for both prediction and profileerror, and FG outperforms BG for smaller values of b.We also observe that algorithms that produce low profile error also pro-duce low prediction error. The only exception is RS and HV in Figures 4.2,4.3.Despite the lack of supermodularity or submodularity, the acceleratedvariants of all the algorithms perform akin to their non-accelerated variantson both prediction and profile error, for all four datasets (Fig. 4.1, 4.2,4.3). This suggests that the objective function may be close to satisfyingsupermodularity. This requires further investigation.414.5. Experimental Evaluation0 20 40 60 80 1002.62.72.82.933.13.23.33.43.5Number of movies ratedRMSE ABG2ABGAFGAFG2HVBG2BGFGRSFG2(a) Prediction Error – ML 100K0 20 40 60 80 10024681012141618Number of movies ratedUser profile error ABG2ABGAFGAFG2HVBG2BGFGRSFG2(b) Profile Error – ML 100K0 20 40 60 80 1000.40.50.60.70.80.911.11.21.3Number of movies ratedRMSE ABG2ABGAFGAFG2HVBG2BGFGRSFG2(c) Prediction Error – ML 1M0 20 40 60 80 10000.511.522.533.544.5Number of movies ratedUser profile error ABG2ABGAFGAFG2HVBG2BGFGRSFG2(d) Profile Error – ML 1MFigure 4.1: Movielens 100K and 1M Datasets424.5. Experimental Evaluation0 20 40 60 80 1000.820.840.860.880.90.920.940.960.9811.02Number of movies ratedRMSEAverage number of candidate items per user:282.3378 ABG2ABGAFGAFG2HVBG2BGFGRSFG2(a) Prediction Error0 20 40 60 80 1003.753.83.853.93.9544.05Number of movies ratedUser profile error ABG2ABGAFGAFG2HVBG2BGFGRSFG2(b) Profile ErrorFigure 4.2: Netflix Dataset0 20 40 60 80 1000.750.80.850.90.951Number of movies ratedRMSEAverage number of candidate items per user:271.287 ABG2ABGAFGAFG2HVBG2BGFGRSFG2(a) Prediction Error0 20 40 60 80 10011.11.21.31.41.51.61.71.8Number of movies ratedUser profile error ABG2ABGAFGAFG2HVBG2BGFGRSFG2(b) Profile ErrorFigure 4.3: Movielens 20M Dataset434.5. Experimental Evaluation4.5.5 Scalability ExperimentsTo test scalability of our proposed solutions we run all 10 algorithms on 2of the datasets, ML 100K and ML 1M under ideal setting and on Netflixand ML 20M under real setting, and measure running times with varyingbudget. Note that due to the datasets’ sparsity, the average number of itemsper cold user that the algorithms sift through in the real setting ranges from271 to 282, while in the ideal setting, it is significantly more (1647 and 3666for ML 100K and ML 1M respectively).Results: In all cases, the accelerated algorithms produce error similarto their un-accelerated counterparts (Fig. 4.1, 4.2, 4.3), but running timeperformance is far superior (Fig. 4.4, 4.5). Among all algorithms, FG2 (bothaccelerated and unaccelerated) has the best qualitative performance, withprediction and profile error comparable to BG2 (Fig. 4.1) or better (Fig.4.2, 4.3), and is significantly faster than BG2 in terms of running time. Infact, even for ML 100K, our smallest dataset, under the ideal setting, thetime taken by unaccelerated FG2 for b = 100 is approximately a sixth ofthe time taken by ABG2 for b = 4. Moreover, running times of all backwardgreedy algorithms increase significantly as we decrease b (see Fig. 4.4, 4.5),which makes them unsuitable for use in a real world system, where b wouldtypically be very small.0 20 40 60 80 100020406080100120140Number of movies ratedTime in seconds BGABGBG2ABG2FGAFGFG2AFG2(a) Running Time – ML 100K0 20 40 60 80 100050100150200250Number of movies ratedTime in seconds BGABGBG2ABG2FGAFGFG2AFG2(b) Running Time – ML 1MFigure 4.4: We measure the running time by varying b from 4 to 100 for ML 100Kand ML 1M, averaged across 200 and 1000 cold users respectively. The plots showthe performance of all 10 greedy algorithms on a candidate pool of 1682 and 3952items respectively.444.5. Experimental Evaluation0 20 40 60 80 1000123456Number of movies ratedTime in seconds BGABGBG2ABG2FGAFGFG2AFG2(a) Running Time – Netflix0 20 40 60 80 100012345678Number of movies ratedTime in seconds BGABGBG2ABG2FGAFGFG2AFG2(b) Running Time – ML 20MFigure 4.5: We measure the running time by varying b from 4 to 100 for Netflixand ML 20M, averaged across 5000 cold users.45Chapter 5ConclusionIn this thesis, we studied theoretical properties of the cold-start user prob-lem in collaborative filtering recommender systems. We studied this prob-lem under the two most prevalent frameworks for collaborative filtering –neighbourhood-based methods and matrix factorization. More specifically,we studied which items to interview a cold-start user with, so as to best learntheir preferences. We call this the optimal interview design (OID) problem,and we formalized what it means to learn the user’s preferences best underthe two different frameworks – OID-NB for neighbourhood-based methodsand OID-MF for matrix factorization.We proved that both OID-NB and OID-MF are NP-hard, and provedbounds on their approximability. We also studied their monotonicity andsubmodularity/supermodularity properties, proposed various efficient algo-rithms and evaluated their performance comprehensively on two and fourreal world datasets, for the neighbourhood based framework and the matrixfactorization framework respectively. For the first, we considered only item-item similarity collaborative filtering and absolute thresholding for neigh-bour selection, but it would be interesting to study the same problem withuser-user similarity and top-n neighbour selection methods. In our exper-iments, we found that the item-item similarity matrix computation was aserious bottleneck. Although some scalable alternatives have been proposed[6, 15, 46], further research is needed to adapt them for the cold-start prob-lem.For the second part, we demonstrated the importance of learning thecold-start user’s profile, and how that translates to improved rating predic-tion performance. Our proposed algorithm not only outperformed the stateof the art in terms of learning the user’s preferences, but was also manytimes faster. We also observed that our proposed accelerated algorithmsperformed akin to the unaccelerated variants, suggesting that the objectivefunction is close to supermodular. One future research direction would beto see whether it satisfies weak supermodularity [9].This work can be also extended to the cold-start item problem, whichis analogous to what we studied and the cold-start system problem as well.46Chapter 5. ConclusionAnother direction to explore would be to study multiple cold-start usersat the same time. In this work, we assumed that each cold-start user wasindependent. Different noise, ratings and covariance models can also beexplored.Lastly, the OID problem can be studied in an interactive setting, wherethe response from the user on one item is used to determine the next. Itcould be combined with an implicit rating model, where the user does notneed to explicitly rate the items served to her, but indicates her preferencethrough actions such as clicking on the item, making it a favorite, or buyingit.47Bibliography[1] Jacob Abernethy, Kevin Canini, John Langford, and Alex Simma. On-line collaborative filtering. University of California at Berkeley, Tech.Rep, 2007.[2] Gediminas Adomavicius and Alexander Tuzhilin. Toward the next gen-eration of recommender systems: A survey of the state-of-the-art andpossible extensions. Knowledge and Data Engineering, IEEE Transac-tions on, 17(6):734–749, 2005.[3] Hyung Jun Ahn. A new similarity measure for collaborative filteringto alleviate the new user cold-starting problem. Information Sciences,178(1):37–51, 2008.[4] Oren Anava, Shahar Golan, Nadav Golbandi, Zohar Karnin, RonnyLempel, Oleg Rokhlenko, and Oren Somekh. Budget-constrained itemcold-start handling in collaborative filtering recommenders via optimaldesign. In Proceedings of the 24th International Conference on WorldWide Web, pages 45–54. International World Wide Web ConferencesSteering Committee, 2015.[5] Senjuti Basu Roy, Laks VS Lakshmanan, and Rui Liu. From grouprecommendations to group formation. In Proceedings of the 2015 ACMSIGMOD international conference on management of data, pages 1603–1616. ACM, 2015.[6] Robert M Bell and Yehuda Koren. Scalable collaborative filtering withjointly derived neighborhood interpolation weights. In Seventh IEEEInternational Conference on Data Mining (ICDM 2007), pages 43–52.IEEE, 2007.[7] James Bennett and Stan Lanning. The netflix prize. In Proceedings ofKDD cup and workshop, volume 2007, page 35, 2007.48Bibliography[8] Jesu´S Bobadilla, Fernando Ortega, Antonio Hernando, and Jesu´SBernal. A collaborative filtering approach to mitigate the new usercold start problem. Knowledge-Based Systems, 26:225–238, 2012.[9] Christos Boutsidis, Edo Liberty, and Maxim Sviridenko. Greedyminimization of weakly supermodular set functions. arXiv preprintarXiv:1502.06528, 2015.[10] John S Breese, David Heckerman, and Carl Kadie. Empirical analysisof predictive algorithms for collaborative filtering. In Proceedings of theFourteenth conference on Uncertainty in artificial intelligence, pages43–52. Morgan Kaufmann Publishers Inc., 1998.[11] Guy Bresler, George H Chen, and Devavrat Shah. A latent source modelfor online collaborative filtering. In Advances in Neural InformationProcessing Systems, pages 3347–3355, 2014.[12] Ste´phane Caron and Smriti Bhagat. Mixing bandits: A recipe for im-proved cold-start recommendations in a social network. In Proceedingsof the 7th Workshop on Social Network Mining and Analysis, page 11.ACM, 2013.[13] Michael D Ekstrand, John T Riedl, and Joseph A Konstan. Collabora-tive Filtering Recommender Systems. Now Publishers Inc, 2011.[14] Michael R. Garey and David S. Johnson. Computers and Intractability:A Guide to the Theory of NP-Completeness. W. H. Freeman & Co.,New York, NY, USA, 1979.[15] Thomas George and Srujana Merugu. A scalable collaborative filteringframework based on co-clustering. In Fifth IEEE International Confer-ence on Data Mining (ICDM’05), pages 4–pp. IEEE, 2005.[16] Nadav Golbandi, Yehuda Koren, and Ronny Lempel. Adaptive boot-strapping of recommender systems using decision trees. In Proceedingsof the Fourth ACM International Conference on Web Search and DataMining, WSDM ’11, pages 595–604, New York, NY, USA, 2011. ACM.[17] Amit Goyal and Laks VS Lakshmanan. Recmax: Exploiting recom-mender systems for fun and profit. In Proceedings of the 18th ACMSIGKDD international conference on Knowledge discovery and datamining, pages 1294–1302. ACM, 2012.49Bibliography[18] Mikael Hammar, Robin Karlsson, and Bengt J Nilsson. Using max-imum coverage to optimize recommendation systems in e-commerce.In Proceedings of the 7th ACM conference on Recommender systems,pages 265–272. ACM, 2013.[19] F Maxwell Harper and Joseph A Konstan. The movielens datasets: His-tory and context. ACM Transactions on Interactive Intelligent Systems(TiiS), 5(4):19, 2016.[20] Jonathan L Herlocker, Joseph A Konstan, Loren G Terveen, and John TRiedl. Evaluating collaborative filtering recommender systems. ACMTransactions on Information Systems (TOIS), 22(1):5–53, 2004.[21] Dorit S Hochba. Approximation algorithms for np-hard problems. ACMSIGACT News, 28(2):40–52, 1997.[22] Yanxiang Huang, Bin Cui, Jie Jiang, Kunqian Hong, Wenyu Zhang,and Yiran Xie. Real-time video recommendation exploration. In Pro-ceedings of the 2016 International Conference on Management of Data,SIGMOD ’16, pages 35–46. ACM, 2016.[23] Rasoul Karimi, Alexandros Nanopoulos, and Lars Schmidt-Thieme.A supervised active learning framework for recommender systemsbased on decision trees. User Modeling and User-Adapted Interaction,25(1):39–64, 2015.[24] Samir Khuller, Anna Moss, and Joseph Seffi Naor. The budgeted max-imum coverage problem. Information Processing Letters, 70(1):39–45,1999.[25] Heung-Nam Kim, Ae-Ttie Ji, Inay Ha, and Geun-Sik Jo. Collabora-tive filtering based on collaborative tagging for enhancing the qualityof recommendation. Electronic Commerce Research and Applications,9(1):73–83, 2010.[26] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorizationtechniques for recommender systems. Computer, 42(8):30–37, 2009.[27] Xuan Nhat Lam, Thuc Vu, Trong Duc Le, and Anh Duc Duong. Ad-dressing cold-start problem in recommendation systems. In Proceedingsof the 2nd international conference on Ubiquitous information manage-ment and communication, pages 208–211. ACM, 2008.50Bibliography[28] Blerina Lika, Kostas Kolomvatsos, and Stathes Hadjiefthymiades. Fac-ing the cold start problem in recommender systems. Expert Systemswith Applications, 41(4):2065–2073, 2014.[29] Jovian Lin, Kazunari Sugiyama, Min-Yen Kan, and Tat-Seng Chua.Addressing cold-start in app recommendation: latent user models con-structed from twitter followers. In Proceedings of the 36th internationalACM SIGIR conference on Research and development in informationretrieval, pages 283–292. ACM, 2013.[30] Je´re´mie Mary, Romaric Gaudel, and Philippe Preux. Bandits warm-upcold recommender systems. CoRR, abs/1407.2806, 2014.[31] Paolo Massa and Paolo Avesani. Trust-aware recommender systems.In Proceedings of the 2007 ACM conference on Recommender systems,pages 17–24. ACM, 2007.[32] Paolo Massa and Bobby Bhattacharjee. Using trust in recommendersystems: an experimental analysis. In Trust Management, pages 221–235. Springer, 2004.[33] Michel Minoux. Accelerated greedy algorithms for maximizing sub-modular set functions. In Optimization Techniques, pages 234–243.Springer, 1978.[34] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. Ananalysis of approximations for maximizing submodular set functionsi.Mathematical Programming, 14(1):265–294, 1978.[35] Seung-Taek Park and Wei Chu. Pairwise preference regression for cold-start recommendation. In Proceedings of the third ACM conference onRecommender systems, pages 21–28. ACM, 2009.[36] Al Mamunur Rashid, George Karypis, and John Riedl. Learning pref-erences of new users in recommender systems: an information theoreticapproach. ACM SIGKDD Explorations Newsletter, 10(2):90–100, 2008.[37] Steffen Rendle and Lars Schmidt-Thieme. Online-updating regularizedkernel matrix factorization models for large-scale recommender systems.In Proceedings of the 2008 ACM conference on Recommender systems,pages 251–258. ACM, 2008.51Bibliography[38] Matthew Richardson, Rakesh Agrawal, and Pedro Domingos. Trustmanagement for the semantic web. In International semantic Web con-ference, pages 351–368. Springer, 2003.[39] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factor-ization. Citeseer, 2011.[40] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Item-based collaborative filtering recommendation algorithms. In Proceed-ings of the 10th international conference on World Wide Web, pages285–295. ACM, 2001.[41] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. In-cremental singular value decomposition algorithms for highly scalablerecommender systems. In Fifth International Conference on Computerand Information Science, pages 27–28. Citeseer, 2002.[42] Andrew I Schein, Alexandrin Popescul, Lyle H Ungar, and David MPennock. Methods and metrics for cold-start recommendations. InProceedings of the 25th annual international ACM SIGIR conferenceon Research and development in information retrieval, pages 253–260.ACM, 2002.[43] Upendra Shardanand and Pattie Maes. Social information filtering: al-gorithms for automating word of mouth. In Proceedings of the SIGCHIconference on Human factors in computing systems, pages 210–217.ACM Press/Addison-Wesley Publishing Co., 1995.[44] Linqi Song, Cem Tekin, and Mihaela van der Schaar. Clustering basedonline learning in recommender systems: a bandit approach. In Acous-tics, Speech and Signal Processing (ICASSP), 2014 IEEE InternationalConference on, pages 4528–4532. IEEE, 2014.[45] Stephen A Vavasis. On the complexity of nonnegative matrix factor-ization. SIAM Journal on Optimization, 20(3):1364–1377, 2009.[46] Gui-Rong Xue, Chenxi Lin, Qiang Yang, WenSi Xi, Hua-Jun Zeng,Yong Yu, and Zheng Chen. Scalable collaborative filtering using cluster-based smoothing. In Proceedings of the 28th annual international ACMSIGIR conference on Research and development in information re-trieval, pages 114–121. ACM, 2005.52Bibliography[47] Mi Zhang, Jie Tang, Xuchen Zhang, and Xiangyang Xue. Address-ing cold start in recommender systems: A semi-supervised co-trainingalgorithm. In Proceedings of the 37th international ACM SIGIR confer-ence on Research & development in information retrieval, pages 73–82.ACM, 2014.[48] Tong Zhang. Solving large scale linear prediction problems usingstochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, page 116. ACM,2004.[49] Xiaoxue Zhao, Weinan Zhang, and Jun Wang. Interactive collabora-tive filtering. In Proceedings of the 22nd ACM international conferenceon Conference on information & knowledge management, pages 1411–1420. ACM, 2013.[50] Ke Zhou, Shuang-Hong Yang, and Hongyuan Zha. Functional matrixfactorizations for cold-start recommendation. In Proceedings of the 34thInternational ACM SIGIR Conference on Research and Development inInformation Retrieval, SIGIR ’11, pages 315–324, New York, NY, USA,2011. ACM.53
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Combating the cold-start user problem in collaborative...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Combating the cold-start user problem in collaborative filtering recommender systems Biswas, Sampoorna 2016
pdf
Page Metadata
Item Metadata
Title | Combating the cold-start user problem in collaborative filtering recommender systems |
Creator |
Biswas, Sampoorna |
Publisher | University of British Columbia |
Date Issued | 2016 |
Description | For tackling the well known cold-start user problem in collaborative filtering recommender systems, one approach is to recommend a few items to a cold-start user and use the feedback to learn her preferences. This can then be used to make good recommendations to the cold user. In the absence of a good initial estimate of the preferences, the recommendations are like random probes. If the items are not chosen judiciously, both bad recommendations and too many recommendations may turn off a user. We study the cold-start user problem for the two main types of collaborative filtering methods -- neighbourhood-based and matrix factorization. We formalize the problem by asking what are the b (typically a small number) items we should recommend to a cold-start user, in order to learn her preferences best, and define what it means to do that under each framework. We cast the problem as a discrete optimization problem, called the optimal interview design (OID) problem, and study two variants -- OID-NB and OID-MF -- for the two frameworks. We present multiple non-trivial results, including NP-hardness as well as hardness of approximation for both. We further study supermodularity/submodularity and monotonicity properties for the objective functions of the two variants. Finally, we propose efficient algorithms and comprehensively evaluate them. For OID-NB, we propose a greedy algorithm, and experimentally evaluate it on 2 real datasets, where it outperforms all the baselines. For OID-MF, we discuss several scalable heuristic approaches for identifying the b best items to recommend to the user and experimentally evaluate their performance on 4 real datasets. Our experiments show that our proposed accelerated algorithms significantly outperform the prior art, while obtaining similar or lower error in the learned user profile as well as in the rating predictions made, on all the datasets tested. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2018-01-31 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0340630 |
URI | http://hdl.handle.net/2429/60254 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2017-02 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2017_february_biswas_sampoorna.pdf [ 692.01kB ]
- Metadata
- JSON: 24-1.0340630.json
- JSON-LD: 24-1.0340630-ld.json
- RDF/XML (Pretty): 24-1.0340630-rdf.xml
- RDF/JSON: 24-1.0340630-rdf.json
- Turtle: 24-1.0340630-turtle.txt
- N-Triples: 24-1.0340630-rdf-ntriples.txt
- Original Record: 24-1.0340630-source.json
- Full Text
- 24-1.0340630-fulltext.txt
- Citation
- 24-1.0340630.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0340630/manifest