Flexible and Efficient Exploration of Rated Datasets by Naresh Kumar Kolloju B.Sc.,Indian Institute of Technology, Bombay, 2009 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty of Graduate Studies (Computer Science) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) March 2013 c Naresh Kumar Kolloju 2013 Abstract As users increasingly rely on collaborative rating sites to achieve mundane tasks such as purchasing a product or renting a movie, they are facing the data deluge of ratings and reviews. Traditionally, the exploration of rated data sets has been enabled by rating averages that allow user-centric, itemcentric and top-k exploration. More specifically, canned queries on user demographics aggregate opinion for an item or a collection of items such as 18-29 year old males in CA rated the movie The Social Network at 8.2 on average. Combining ratings, demographics, and item attributes is a powerful exploration mechanism that allows operations such as comparing the opinion of the same users for two items, comparing two groups of users on their opinion for a given class of items, and finding a group whose rating distribution is nearly unanimous for an item. To enable those operations, it is necessary to (i) adopt the right measure to compare ratings, and to (ii) develop efficient algorithms to find relevant <user,item,rating> groups. We argue that rating average is a weak measure for capturing such comparisons. We propose contrasting and querying rating distributions, instead, using the Earth Mover’s Distance (EMD), a measure that intuitively reflects the amount of work needed to transform one distribution into another. We show that the problem of finding groups matching given rating distributions is NP-hard under different settings and develop appropriate heuristics. Our experiments on real and synthetic datasets validate the utility of our approach and demonstrate the scalability of our algorithms. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . vi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Goals and Challenges . . . . . . . . . . . . . . . . . . . . . . 1 1 3 2 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Groups and Rating Distributions . . . . . . . . . . . . . . . . 6 6 3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Exploration Problem . . . . . . . . . . . . . . . . . . . . . . 8 8 4 Rating Comparison Measures . . . . . . . . . . . . . . . . . . 4.1 EMD Vs. Other Measures . . . . . . . . . . . . . . . . . . . 10 10 5 Algorithms . . . . . . . . . . . . . . . . . . . . . 5.1 Exploration Algorithms . . . . . . . . . . . . 5.1.1 Complexity of Problems . . . . . . . 5.1.2 Group Discovery Algorithms . . . . . 5.2 EMD Algorithms . . . . . . . . . . . . . . . 5.2.1 Computing EMD of Two Distributions 5.2.2 EMD for Multiple Distributions . . . 13 13 13 16 23 23 27 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of Contents 6 Experiments . . . . . . . . . . . . . . . . . . . . . . 6.1 Experimental Setup . . . . . . . . . . . . . . . . 6.2 Offline Quality Evaluation . . . . . . . . . . . . 6.2.1 Purity, Rand-Index and F-Measure . . . 6.2.2 Coverage, description, diversity and EMD 6.3 Scalability Evaluation . . . . . . . . . . . . . . . . . . . . . 29 29 30 30 32 34 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 40 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix A Algorithm Optimality . A.1 Notations/Definitions A.2 Lemmas & Facts . . . A.3 Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 43 44 46 iv List of Tables 1.1 An Example of Simpson’s Paradox . . . . . . . . . . . . . . . 3 4.1 Distance Measures . . . . . . . . . . . . . . . . . . . . . . . . 11 5.1 Instance of Maximum Coverage obtained by reduction from 3SET PACKING. Blanks represent distinct constants appearing nowhere else. . . . . . . . . . . . . . . . . . . . . . . . . . Running Example of EMD Computation. . . . . . . . . . . . 14 26 5.2 6.1 6.2 6.3 6.4 Summary of datasets . . . . . . . . . . . . . . . . . . . . . . . Rating distributions of various groups in synthetic data . . . Performance of various algorithms measured using Purity, RandIndex, F-Measure . . . . . . . . . . . . . . . . . . . . . . Table describing how description length varies with input size on ML dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 30 30 31 34 v List of Figures 1.1 Figure showing (a) Demographics Breakdown of Ratings on IMDb (b) Movie Rating Distributions . . . . . . . . . . . . . 2 4.1 Examples of Rating Distributions . . . . . . . . . . . . . . . . 11 5.1 Figure shows how to push a node conditioned on attribute A0 towards the leaf. . . . . . . . . . . . . . . . . . . . . . . . . . DTAlg (a) Hierarchy of Attribute Age (b) Binary Split (c) Categorical Split . . . . . . . . . . . . . . . . . . . . . . . . . Figure showing flows for example in Table 5.2 . . . . . . . . . 5.2 5.3 6.1 6.2 6.3 6.4 Figure showing coverage for two types of split Categorical and Binary on two datasets . . . . . . . . . . . . . . . . . . . . . . Figure showing plots of coverage and diversity in description for movie Lens and Book Crossing. (a)Coverage Vs. Input Size (b)Coverage Vs. Variance (c)Diversity Vs. Input Size(Plot legend is same as in (a)) . . . . . . . . . . . . . . . Figure showing plots of coverage for movie Lens and Book Crossing. (a)Coverage Vs. EMD Threshold (b)Coverage Vs. No. of Trees (Plot legend is same as in (a)) . . . . . . . . . . Figure showing plots of various experiments on both the datasets: (a) Time Vs. Input size (b)Avg. EMD Vs. No. of Trees (Plot legend is same as in (a)) (c) EMD-Naive vs EMD-Prune . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1 Figure shows how to modify optimal flow F opt to obtain new flow F opt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 19 27 32 35 36 37 44 vi Acknowledgments I would like to thank my supervisor, Professor Laks V S Lakshmanan, for his great help and guidance through out this thesis. I am indebted to him for his patient guidance, insightful comments on my research and his support during the past few years. I am also grateful to Ruben H Zamar (Professor at UBC statistics Department) and Sihem Amer Yahia (Director of research at CNRS), who worked with me throught out this thesis. Most ideas of this thesis came from the discussion among Laks, Sihem, Ruben and me. I feel lucky to have worked under guidance of these people. I owe thanks to all members of the Database Management and Mining lab, and also Holly Kwan, Joyce Poon for making my study at UBC fun and memorable. vii Chapter 1 Introduction 1.1 Motivation In collaborative rating systems such as Flixster, Epinions, and Yelp, users, in their role as producers, rate items they have experienced. These ratings form the basis for users, in their role as consumers, to explore items in order to find what they may be interested in, or to receive recommendations. In this paper, we focus on the kind of support a system needs to provide in order to make the exploration of rated datasets effective and useful. We argue that the ability to compare rating distributions of groups of user,item,rating tuples is central to such exploration, as it enables choosing between different rater populations, contrasting items with respect to the opinion of their raters, finding groups of user,item,rating tuples that nearly agree on their ratings for items in the group, etc. To achieve that, we develop a framework to compare and find such groups given conditions on users, items and ratings. In the physical world, asking one’s friends or acquaintances to compare sets of items, e.g., French and American comedies, Canon and Nikon cameras, is common practice. The same could be said of comparing the opinion of two users, Mary and John, or user sets, e.g., males and females, on a specific item or a class of items such as moderately priced Chinese restaurants in the Bay Area. Online, the most common way to do that is by contrasting rating averages. Averages are computed on user,item,rating groups that include one or multiple items, e.g., French Comedies, and where the user dimension represents all or some users, such as movie critics or 18-29 year old’s. Figure 1.1(a) illustrates an example of a demographics breakdown on IMDb for the movie The Social Network. The example shows a global average as well as averages for canned user,item,rating groups where item refers to a single movie. Several studies have established that users in different demographics, i.e., different rater populations, may have different preferences for the same item [6]. Table 1.1 shows average ratings for two different movies by two rater populations – those living in NYC and those living in Nashville. One can easily see that average rating for a given population does not necessarily 1 1.1. Motivation (a) 1 1 0.5 Average: 1.4 Population: Hawaii Average: 3.00 Population: All 0.5 0 0 1 2 3 4 5 1 2 3 4 5 b) The Blair Witch Project (1999) a) The Blair Witch Project (1999) 1 1 Population: All Average: 4.38 Population: Portland 0.5 Average: 4.3 0.5 0 0 1 2 3 4 5 1 2 3 4 5 d) American Beauty (1999) c) The Blair Witch Project (1999) 1 1 Population: "Boston & Middle−Age" Population: Portland Average: 3.17 0.5 Average: 4.8 0.5 0 0 1 2 3 4 5 e) American Beauty (1999) 1 2 3 4 5 f) American Beauty (1999) (b) Figure 1.1: Figure showing (a) Demographics Breakdown of Ratings on IMDb (b) Movie Rating Distributions reflect those of populations it contains. Indeed, while the first movie is preferred in both NYC and Nashville, its average rating overall is lower than the second movie’s. This is also known as the Simpson’s Paradox 1 . Such user groups are not always known in advance and need to be discovered. In particular, given the large number of groups that could be derived by sub-setting items and users in an input rated dataset user,item,rating , a strategy is needed to determine the most relevant of such groups according to a query semantics. In this paper, we explore using rating distributions as input and finding groups that compare favorably to input distributions (referred to as query distributions). 1 http://en.wikipedia.org/wiki/Simpson’s paradox 2 1.2. Goals and Challenges Group New York Nashville Total Movie1 10 = 1000 10 3.5 = 3150 900 4150 4.15 = 1000 Movie2 8 = 7200 900 1 = 100 100 7300 7.3 = 1000 Winner Movie1 Movie1 Movie2 Table 1.1: An Example of Simpson’s Paradox To illustrate the benefit of comparing rating distributions directly instead of relying on their averages, we use real examples from MovieLens [14] and focus on specific groups of ratings from users in Hawa¨ı, Boston and Portland for the movies American Beauty (AB) and The Blair Witch Project (BWP) (see Figure 1.1(b)). While AB was globally preferred to BWP, the average rating of people in Portland for BWP is higher than that of middle-aged users in Boston for AB. Moreover, while the overall population is neutral about BWP, giving it an average of 3, people in Hawaii hated that movie while people in Portland clearly liked it. Finally, rating averages of middle-aged users in Boston for AB and of all users for BWP are very similar (resp., 3.17 and 3), but their rating distributions are quite different. In the case of AB, users are polarized and in the other case, ratings are relatively uniform, i.e., random. Clearly, average ratings are too coarse to reveal interesting patterns and rating distributions enable finer comparisons of user,item,rating groups. 1.2 Goals and Challenges Our thesis is that the exploration of rated datasets is made effective by the ability to identify user,item,rating groups that satisfy conditions on users, items and ratings. While conditioning users and items is straightforward and can be done with regular relational operators, conditions on ratings are enabled with the ability to compare rating distributions. For example, a user exploring a rated dataset may be interested in finding movies among new releases for which the ratings are (almost) unanimously very high. Or she may be interested in movies for which the ratings among teens, such as herself, are polarized. In these examples, an input rating distribution is used as a query in order to find user,item,rating groups that exhibit similar distributions. In each case, users could represent the entire rater population or a subset (teens) and the group could contain one or a class of items. Enabling those operations raises two important challenges. The first challenge is choosing the right measure for gauging the proximity of rating distributions. Many distance measures for distributions have been used in previous work in various contexts. These include KL-divergence 3 1.2. Goals and Challenges and cosine distance. We show that they suffer from a fundamental problem: they do not discriminate pairs of distributions which are intuitively close to each other from those that aren’t. We propose to use Earth Mover’s Distance (EMD) [16], a measure that intuitively captures the minimum amount of work required to transform one distribution into another. The closer the distributions the lower their EMD penalty. Our second challenge is computational: Given a set of query distributions, our problem is to find a (potentially partial) partition of the rated dataset into reasonably sized user,item,rating groups such that each group in the partition enjoys a rating distribution that has a low EMD with respect to some query distribution. Such groups need to be dynamically discovered from a potentially exponential search space and have to be describable: “Middle aged Minnesotans in the construction business”, “Horror movies from the 90’s” or “Comedies rated by teenagers” are examples of describable groups. We show that the problem of finding such groups is NP-hard under two different settings and propose to adapt decision trees to work with EMD and split an input rated dataset along user and item attributes. It has been found in the area of classification that the approach of random forests, pioneered by Breiman [2], essentially an ensemble classifier, has a much better performance than any one decision tree. We exploit this intuition using EMD-based decision trees and explore different strategies for combining multiple EMD-based decision trees. We make the following contributions. • We formalize the exploration of a rated dataset as the problem of finding a (potentially partial) partition of the dataset into reasonably sized user,item,rating groups such that each group enjoys a rating distribution that has a low EMD with respect to some query distribution (Section 2). • We show that well-known rating aggregation and comparison measures are not well-adapted for comparing rating distributions and propose using Earth Mover’s Distance (EMD) for this purpose (Section 4.1). • We show that the problem of finding user,item,rating groups that “match” query distributions is NP-hard under two different settings and develop appropriate heuristics. The first heuristic is based on adapting decision trees. The second one is based on the random forest approach. Our decision tree construction algorithm makes use of an elegant linear time algorithm for computing EMD along with clever 4 1.2. Goals and Challenges pruning techniques inspired by the well-known NRA algorithm in topk query processing. Our random forest approach explores different strategies for combining decision trees (Section 5.1). • We run comprehensive experiments on two real data sets and a synthetic data set and demonstrate the effectiveness of querying by rating distributions using two kinds of offline quality evaluation. We also report the scalability of several competing algorithms (Section 6). Our results show that in general, random forest approaches are superior and examine in detail when decision tree algorithms perform better. Related work is discussed in Section 7. We conclude the paper with future directions in Section 8.1. 5 Chapter 2 Data Model A rated dataset consists of a set of users with schema SU , a set of items with schema SI and a set of rating actions with schema SR . E.g., we could have SU = uid, age, gender,state, city and a user might be represented as u1 , 18 , student, new york , nyc . Similarly, restaurants on Yelp2 , can be described with SI = item id, cuisine, attire, ambiance, price , and an example restaurant represented as i2 , Thai , Formal , Romantic, Moderate . Finally, a rating action can be represented using the schema SR = uid, item id, rating , where the domain of attribute rating depends on the application and the dataset. E.g., in MovieLens, it is {1, ..., 5} and in BookCrossing, {1, ..., 10}. An instance in our data model consists of relations U, I, and R of users, items, and rating actions over their respective schema. 2.1 Groups and Rating Distributions Groups. A rated dataset effectively contains tuples of the form user,item,rating . Collaborative rating sites typically contain millions of such tuples. We adopt the approach proposed and experimentally validated in [6], where sets of rating tuples of the form user,item,rating , are viewed according to groups that are structurally describable using a conjunction of predicates on user and item attributes. E.g., a group may represent a set of rating tuples on Thai restaurants by young males in NY. We let g denote a group, g.idesc the set of item predicates, in this case {cuisine= Thai}, and g.udesc the set of user predicates, in this case {gender= male & age ≤ 35 & state=new york}, that are associated with g. In what follows, we use the term describable groups to mean groups which are describable using a conjunction of predicates of the form Attr op val where Attr is a user or item attribute, val is a domain value, and op is one of =, <, ≤, >, ≥.3 For categorical attributes, we only allow the predicate 2 http://www.yelp.com We found descriptions involving = are not very intuitive unless they are combined with one of the other predicates above, so we disallow them w.l.o.g. 3 6 2.1. Groups and Rating Distributions Attr = val. We abuse the notation and write u ∈ g, (resp., i ∈ g), to mean that user u, (resp., item i), satisfies all the user, (resp., item), predicates in g.udesc, (resp., g.idesc). Rating Distributions. The set of all (describable) groups that contributed ratings in a rated set R is denoted GR . Given a group g ∈ GR , we define ratings(g, R) = { u, i, r ∈ R | u ∈ g & i ∈ g} as the set of rating actions of all users in g on items in g, in the rated set R. The rating distribution of g in R is defined as dist(g, R) = [1 : w1 . . . m : wm ] where the ∈ratings(g,R)|r=j}| rating scale is {1, ..., m} and wj = |{ u,i,r |ratings(g,R)| is the fraction of ratings with value j in ratings(g, R). When it causes no confusion, we blur the distinction between g and ratings(g, R) and speak of the tuples in g or the size |g| of g. Figure 1.1(b) contains many example rating distributions, including those of Portland users for two movies: [0 0 0.1 0.4 0.5] for The Blair Witch Project (BWP) and [0 0 0 0.2 0.8] for American Beauty (AB). Query Distributions. Users might be interested in finding groups whose distributions are similar to query distributions of interest to them. Some example query distributions include unanimous distributions U1 , ..., Um where Ui denotes the distribution where the mass is concentrated at rating value i: Ui (i) = 1 and Ui (j) = 0, j = i; moderately unanimous distributions Ui,i+1 , where the mass is concentrated on successive rating values i, i + 1: e.g., Ui,i+1 (i) = Ui,i+1 (i + 1) = 0.5 and Ui,i+1 (j) = 0, j = i, i + 1; and polarized distribution P1,m where mass is concentrated on the extreme ratings 1 and m: e.g., P1,m (1) = P1,m (m) = 0.5 and P1,m (j) = 0, j = 1, m. Indeed, the user may use any distribution(s) as query distributions. Example 1 (Querying by Distribution) Consider the set S of ratings of movie BWP by all users and let the query Q = {ρ1 , ρ2 } contain the distributions ρ1 = [0, 0, 0, 0.4, 0.6] (high ratings) and ρ2 = [0.5, 0, 0, 0, 0.5] (polarized). In this case, we are interested in finding user,item,rating groups in S whose rating distributions on BWP are close to a distribution in Q, be it high or polarized. As a second example, Q may be the rating distribution of Portland users on BWP. In this case, we are asking for groups whose rating distribution on BWP is close to that of Portland raters. In a last example, the user, a New Yorker, may have first explored ratings of New Yorkers on Avatar and on AB. Treating these two distributions as query distributions, she might want to find groups whose rating distribution on BWP closely resemble any of the query distributions. 7 Chapter 3 Problem Definition 3.1 Exploration Problem In this paper, we are interested in developing a framework for helping users explore a rated dataset. At the core of this is the ability to compare rating distributions. Our framework aims to cover different scenario where two rating distributions may represent the opinion of the same users on different items or the opinions of different users on the same or different items. Also, the end user may wish to explore ratings at the granularity of individual items (e.g., a specific digital camera) or sets of items (e.g., the class of Nikon SLR digital cameras). Similarly, she may wish to compare the opinions of two specific users (her friends John and Mary) or two sets of users (teens in NY vs. teens in CA). In its general form, a user query contains conditions on items, conditions on users and a set of rating distributions, referred to as query distributions, that represent the desired rating distributions of returned user,item,rating groups. Given the input rated dataset R, it is easy to see that the application of conditions on items and users is straightforward as it amounts to finding the subset S ⊆ R where each tuple user,item,rating satisfies input item and user conditions. Hence we focus on conditions on rating distributions. For now, let us assume a generic function ratComp that compares two rating distributions ρ1 , ρ2 and returns a score to reflect how far apart ρ1 and ρ2 are. We will explore the choices for ratComp in Section 4.1. The idea is that given S, we want to discover user,item,rating groups whose rating distribution on the item(s) in the group is very close to one of the given query distributions. The resulting groups must be describable, reasonably sized and together cover as many of the input tuples user,item,rating ∈ S as possible. Indeed, splitting S into describable groups whose ratings are close to query distributions does not guarantee that all tuples in S will be in the resulting groups. Also, shorter group descriptions are preferred as they are meant for enduser consumption. Hence, there are two ways of formulating our problem. Since we can only optimize one objective, we can either maximize coverage 8 3.1. Exploration Problem of tuples in S or minimize the description length of resulting groups. We call π = [g1 , ..., gm ] a partial partition of S if gi are pairwise disjoint and i gi ⊆ S. Problem Statement: Maximum coverage. Given a rated dataset S ⊆ R, a group size threshold b, a set of query distributions Q = {ρ1 , ..., ρk }, and a rating proximity threshold θ, find a (possibly partial) partition π = [g1 , ..., gm ] of S, such that (i) each gi is a describable group, (ii) |gi | ≥ b, ∀i ∈ [1, m], (iii) ∀i ∈ [1, m], ratComp(dist(gi , Si ), ρj ) ≤ θ, for some j ∈ [1, k], | i gi | and (iv) |S| is maximized. The first condition states that each block in the partition is describable with at least one user attribute in gi .udesc or one item attribute in gi .idesc. The second condition enforces that each group should not be too small, i.e., must contain at least b tuples. The third condition states that the rating distribution of each group gi should be at a distance no more than θ to one of the query distributions ρj . The last condition says the fraction of input tuples S that are covered by π should be maximized. Problem Statement: Minimum description length. Given a rated dataset S ⊆ R, a rating proximity threshold θ, and a set of query distributions Q = {ρ1 , ..., ρk }, find a (possibly partial) partition π = [g1 , ..., gm ] of S, such that (i) each gi is a describable group, (ii) ∀i ∈ [1, m], ratComp(dist(gi , Si ), ρj ) ≤ θ, for some j ∈ [1, k], and (iii) i∈[1,m] |gi .udesc∪ gi .idesc| is minimized. The main difference with the first problem is that instead of maximizing coverage, we want to minimize the total description lengths of the groups returned. Since intuitively, shorter descriptions contain more tuples, we don’t constrain the group size. In the rest of the paper, by partition, we mean a possibly partial partition. 9 Chapter 4 Rating Comparison Measures 4.1 EMD Vs. Other Measures A key ingredient in the problem we study is the choice of the function ratComp that quantifies the proximity between two distributions. We first briefly review some obvious candidate choices for this function. Recall, the rating distributions we consider are normalized so they are probability distributions. KL-divergence [11] is a well-known measure used for gauging the proximity between probability distributions ρ1 and ρ2 . It is j j j defined as DKL (ρ1 ||ρ2 ) = j ρ1 log(ρ1 /ρ2 ). Another candidate is a symmetric version of KL-divergence, called JS-divergence [11], and defined as DJS (ρ1 , ρ2 ) = 1/2(DKL (ρ1 ||ρ3 ) + DKL (ρ2 ||ρ3 )), where ρ3 = 1/2(ρ1 + ρ2 ). Or we could interpret rating distributions as vectors (over rating values as dimensions) and use cosine distance or Euclidean distance. Unfortunately, all these measures have limitations which make them a poor choice, as we next illustrate with an example. Example 2 (Rating Comparison Example) Consider three rating distributions ρ1 = [0.9, 0.025, 0.025, 0.025, 0.025], ρ2 = [0.025, 0.9, 0.025, 0.025, 0.025], and ρ3 = [0.025, 0.025, 0.025, 0.025, 0.9], illustrated in Figure 4.1. Say they correspond to ratings by the same set of users on three different digital cameras c1 , c2 , c3 . Intuitively, distributions ρ1 and ρ2 are more in agreement with each other than ρ1 and ρ3 : the users have very similar opinions about cameras c1 and c2 and very different opinions about c1 and c3 . Table 4.1 shows the considered distance scores for the two pairs of distributions. Notably, only EMD (described below) distinguishes between the two pairs. Earth Mover’s Distance. The Earth Mover’s Distance (EMD) is a widely used distance measure for distributions [16]. Intuitively, the EMD between two distributions is the minimum amount of work done per unit 10 4.1. EMD Vs. Other Measures Prob. 1 g1.dist 0.5 0 1 2 3 Prob. 1 4 5 g2.dist 0.5 0 1 2 3 1 4 5 Prob. g3.dist 0.5 0 1 2 3 4 5 Figure 4.1: Examples of Rating Distributions Measure Cosine KL − Divergence JS − Divergence Euclidean EMD (ρ1 , ρ2 ) 0.058 3.13 0.53 1.24 1 (ρ1 , ρ3 ) 0.058 3.13 0.53 1.24 4 Table 4.1: Distance Measures mass in converting one distribution to another, where the distributions are viewed as piles of earth at various positions. If D is a discrete domain, EMD is computed based on a solution to the well-known transportation problem [16]. Let ρ1 = [1 : p1 , .., i : pi , .., n : pn ] , ρ2 = [1 : q1 , .., i : qi , .., n : qn ] represent two probability distributions over a discrete domain D = {1, 2, 3, ..., n}. The amount of work required to convert ρ1 to ρ2 is defined as: n n min Work(ρ1 , ρ2 , F ) = F dij fij i=1 j=1 subject to the constraints: fij ≥ 0 1 ≤ i, j ≤ n; nj=1 fij = pi 1 ≤ i ≤ n; n and i=1 fij = qj 1 ≤ j ≤ n, where fij is the amount of mass moved from position i to j in the process of converting ρ1 to ρ2 . F = [fij ] is the matrix representing the flows and dij is the ground distance from position i to j, which, for our setting, is defined as the absolute difference in positions, |i − j|. A flow F is optimal if the work done is minimum in that flow. Therefore, the EMD is defined as: EMD(ρ1 , ρ2 ) = minF Work(ρ1 , ρ2 , F ) n n i=1 j=1 fij 11 4.1. EMD Vs. Other Measures EMD is work done per unit mass in an optimal flow. In our setting, the region D over which EMD is calculated is always the whole domain of the distribution, so the value of the denominator in the above equation is 1. So we can ignore the denominator and speak of EMD as the work done itself. The EMD definition clearly captures fine-grained differences between two rating distributions in the form of work done. This makes EMD particularly suitable for comparing rating distributions. 12 Chapter 5 Algorithms 5.1 Exploration Algorithms Our problems, defined in Section 3.1, require to develop efficient algorithms for dynamically finding and comparing relevant groups. In the next subsection, we outline our approach and discuss the inherent complexity of the problems. In the following subsections, we discuss our algorithms. 5.1.1 Complexity of Problems Given a rated dataset S ⊆ R, our first problem, Maximum Coverage, involves finding a partition of S whose groups are describable, are large enough, between them cover the maximum number of tuples of S and are close enough to one of the given query distributions. Let g be a group obtained from S and Q a set of query distributions. We abuse the notation and write EMD(g, Q) to mean the EMD between g and its closest query distribution in Q. We next show that Maximum Coverage is NP-hard. Theorem 1 The Maximum Coverage problem is NP-hard. Proof: We show the decision version of the problem to be NP-hard by reduction from 3-SET PACKING, defined as follows. Given an instance I consisting of a family {S1 , ..., Sn } of sets containing exactly 3 elements and a number k, the question is whether I contains k pairwise disjoint sets. This is an NP-complete problem [8]. Given an instance I of this problem, we create an instance J of Maximum Coverage as follows. J consists of a single rating relation, corresponding to ratings on one item. Let i Si = {x1 , ..., xm }. Then J contains n categorical user attributes Ai corresponding to Si and 2m tuples corresponding to 2m users; for each xj , it contains two rated tuples rj+ and rj− . The values of the tuples are set as follows: whenever xj ∈ Si , we set rj+ [Ai ] = ai and rj− [Ai ] = bi . The rating value of all tuples of the form rj+ is 5 (max) while that of tuples of the form rj− is 1 (min). The values of all other attributes are distinct constants appearing 13 5.1. Exploration Algorithms uid r1+ r2+ r3+ r4+ r5+ r6+ A1 a1 a1 a1 A2 a1 a1 a1 A3 a1 a1 a1 rating 5 5 5 5 5 5 uid r1− r2− r3− r4− r5− r6− A1 b1 b1 b1 A2 b1 b1 b1 A3 b1 b1 b1 rating 1 1 1 1 1 1 Table 5.1: Instance of Maximum Coverage obtained by reduction from 3SET PACKING. Blanks represent distinct constants appearing nowhere else. nowhere else. The query distributions are Q = {U1 , U5 } and set the group size threshold to b = 3 and the rating proximity threshold to θ = 0. It can be shown that I has k pairwise disjoint sets iff J admits a partition of describable groups of size ≥ b that covers exactly 6k elements. For example, if I = {{x1 , x2 , x3 }, {x3 , x4 , x5 }, {x4 , x5 , x6 }}, then J = {r1+ , r1− , ..., r6+ , r6− }, illustrated in Table 5.1. • Given a solution for I, solution to J is obtained by grouping tuple corresponding to each set. In the above example solution to I is {{x1 , x2 , x3 }, {x4 , x5 , x6 }}, then corresponding solution to J is {{r1+ , r2+ , r3+ }, {r4+ , r5+ , r6+ }}, {r1− , r2− , r3− }, {r4− , r5− , r6− }}. • Given a solution for J one can obtain solution for problem I, by choosing corresponding sets in the input. The construction is designed in such a way that a group is describable iff it corresponds to a set in I. For instance in the above example set {r1+ , r2+ , r3+ } is describable using description A1 = a1 , but the set {r2+ , r3+ , r4+ } is not describable as negations are not allowed in the attributes. In the above example solution to J is {{r1+ , r2+ , r3+ }, {r4+ , r5+ , r6+ }}, {r1− , r2− , r3− }, {r4− , r5− , r6− }}, then solution to I is {{x1 , x2 , x3 }, {x4 , x5 , x6 }}. Next, consider the problem of Minimum Description Length. We formalize this objective in terms of decision trees. More precisely, we seek to find partitions by constructing decision trees, with predicates over attributes used as split conditions, where the leaves correspond to the groups of the partition. The descriptions can be obtained by reading the predicates off the paths. We call such trees partition decision trees. Figure 5.2 shows an example. Ignore paths labeled by negative predicates such as age = teen for now. We will discuss this later. Given a rated dataset S ⊆ R, a rating proximity threshold θ, we want to find a partition of S containing groups whose descriptions are short and whose EMD w.r.t. some query distribution is ≤ θ. When formulated in terms 14 5.1. Exploration Algorithms of decision trees, this problem corresponds to minimizing the height of the decision tree. More precisely, let T be such a decision tree whose leaves form a partition π = [g1 , ..., gm ] of S. Then minimizing i (|gi .udesc| + |gi .idesc|) is equivalent to minimizing the total length of all root-to-leaf paths. This total length is in turn minimized exactly when the height of T is minimized. We next show: Theorem 2 Given a rated dataset S ⊆ R, a set of query distributions Q and an EMD threshold θ, finding a minimum height partition decision tree for S where each group’s EMD to some distribution in Q is ≤ θ is NP-hard. Proof Sketch. We show the decision version to be NP-hard by reduction from the classic Minimum Height Decision Tree problem, defined as follows. Given a set I of n m-bit vectors and a number k, the question is whether there is a binary decision tree with height ≤ k such that, each of its leaves is a unique bit vector in I and internal nodes are labeled by binary tests on some bit. From this instance I, we construct an instance J as follows. J has m binary attributes A1 , ..., Am and a categorical attribute A0 . For + − each bit vector si , J has two tuples t+ i and ti , i ∈ [1, n], with ti [j] and + − t− i [j] set to the j-th bit of vector si , j ∈ [1, m]. Assign ti [A0 ] and ti [A0 ] to two distinct constants appearing nowhere else. Finally, the rating value − for t+ i (resp., ti ) is 5 (resp., 1). Set the EMD threshold θ = 0 and let the query distributions be Q = {U1 , U5 }. E.g., if I = {011, 010, 100} then J contains the tuples (a1 , 0, 1, 1, 5), (b1 , 0, 1, 1, 1), (a2 , 0, 1, 0, 5), (b2 , 0, 1, 0, 1), (a3 , 1, 0, 0, 5), (b3 , 1, 0, 0, 1), where the last value is the rating. We have: Claim. I admits a decision tree of height ≤ k iff J admits a partition decision tree of height ≤ k + 1 where each group at its leaf is describable and exactly matches U1 or U5 . Only If: Given a decision tree T for I, by definition, it contains a unique bit vector si ∈ I at each leaf. If we apply this tree to J, we will get a tree each of whose leaves contains a group containing exactly the tuples − {t+ i , ti }, i ∈ [1, m]. These groups do not match either of U1 , U5 . Applying a split based on A0 = ai versus A0 = bi divides this group into two singleton − groups {t+ i } and {ti } which match U5 and U1 . The groups are describable. This tree has height one more than that of T . If: Let T be a partition decision tree of height ≤ k + 1 for J. By definition, each leaf of T contains a unique tuple of J. Notice that none of the groups at the leaves can contain more than one tuple with the same rating value, as they are not describable (without distinction’s or negations). 15 5.1. Exploration Algorithms Y N Y N Figure 5.1: Figure shows how to push a node conditioned on attribute A0 towards the leaf. − T must apply the predicates on attribute A0 to separate tuples t+ i and ti . Suppose T applies these tests after all other tests. Then the node at which − A0 = ai vs. A0 = bi is applied must contain exactly the group {t+ i , ti }. By replacing that group by the corresponding bit vector si , we get a decision tree of height ≤ k for I. Suppose T applies one or more tests on A0 before tests on other attributes Ai , where i > 0. Now we show that, we can ”push down” these tests (or node) on A0 so they are applied at the parent of leaf level, without increasing the height of the tree. Figure 5.1 illustrates how on can push the nodes conditioned on A0 . As shown in the figure 5.1, red color node represent the node conditioned on attribute A0 and green color nodes − t+ i and ti represents two rating tuples which are separated on attribute A0 . Note that this transformation on tree, does not effect height of green nodes. Though it decreases height of the leaf nodes under red node, it may decrease the height of the tree but it wont increase its height. Modified tree is still a decision tree with required height 5.1.2 Group Discovery Algorithms Given that finding partitions that either minimize description length or maximize coverage is NP-hard, a natural question is whether we can develop efficient heuristics, which we next address. Rather than tailor the heuristics specifically for minimizing description length or maximizing coverage, we argue that the algorithms we develop must in general favor short descriptions; in addition to improving understand-ability, they in turn tend to favor larger groups and thus make it more likely to satisfy the size threshold. In other words, they help improve the coverage too. We take a top-down approach for finding partitions, based on decision trees (recall the notion of partition 16 5.1. Exploration Algorithms decision trees defined in Section 5.1.1). Decision trees have the benefit that group descriptions can be directly read off the root-to-leaf paths. Whereas classic decision trees are driven by gain functions like entropy 4 and giniindex 5 , a novelty in our case is that our decision trees are designed to discover groups whose distributions are close to query distributions. That has the benefit of exploiting EMD properties as will be shown in Section 5.2. In the next two subsections, we develop two algorithms – DTAlg and RFAlg. Decision Tree Algorithm (DTAlg) Our first algorithm, DTAlg, is based on partition decision trees. Algorithm 1 gives the pseudo-code. It takes as input a set of query distributions Q, a set S of rating tuples on any items, a group size threshold b and a rating proximity threshold θ. For simplicity, we assume all our attributes are categorical. Numerical attributes like age and year are binned appropriately.6 Since attributes are binned, the only type of descriptions we need to focus on are conjunctions of Attr = val. Attribute values can be organized in a hierarchy. Figure 5.2(a), shows a partial hierarchy of attribute age from MovieLens dataset. Algorithm 1 repeatedly divides rating set S, in a breadth first manner, to find interesting groups with short descriptions and large enough sizes. At each node the algorithm checks if its corresponding group is good, i.e., it has good description, has size ≥ b, and has EMD ≤ θ to some distribution in Q. It adds good groups to the output list. Here, a description is good if consists of positive atoms of the form Attr = val. We disallow descriptions consisting of only negative atoms Attr = val since they are hard to understand intuitively. If a group is not good, the algorithm makes the following decisions: the group is dropped if it is too small (line 8); if the group’s EMD to the closest query distribution in Q is > θ, the algorithm looks for a splitting attribute (using a procedure described in Section 5.1.2) and the group is split accordingly; if the EMD is ≤ θ but the group has only negations in its description, the group is split again recursively using the same attribute until a description with no negation is found. This process is described in Section 5.1.2. A key operation that is repeatedly performed by the algorithm is finding the EMD of a group w.r.t. a set of query distributions of which the smallest 4 http://en.wikipedia.org/wiki/Entropy http://en.wikipedia.org/wiki/Gini index 6 In the real datasets MovieLens and BookCrossings, these attributes are available only as binned. 5 17 5.1. Exploration Algorithms Algorithm 1 DTAlg(S, Q, b, θ) 1: Initialize queue K = φ and list Output = φ. 2: Add S to K. 3: while not empty(K) do 4: parent = K.pop() 5: if goodGroup(parent) then 6: Add parent to Output. 7: else if |parent| < b then 8: continue 9: else if not hasGoodEM D(parent, Q, θ) then 10: Attribute attr = f indBestAttribute(parent) 11: Array child = split(parent, attr) 12: for i = 1 → size(child) do 13: if goodGroup(child[i]) then 14: Add child[i] to Output. 15: else 16: Add child[i] to K. 17: end if 18: end for 19: else if not hasGoodDescription(parent) then 20: Array child = removeBadDecription(parent) 21: for i = 1 → size(child) do 22: if goodGroup(child[i]) then 23: Add child[i] to Output. 24: else 25: Add child[i] to K. 26: end if 27: end for 28: end if 29: end while 30: return Output 18 5.1. Exploration Algorithms Age Teen Old Young Middle-Age 18-24 25-34 (a) Titanic EMD:0.57 Age = Teen Age != Teen EMD:0.44 EMD:0.58 Age = Old Age != Old EMD:0.53 EMD:0.58 (b) Titanic EMD:0.57 Age = Teen EMD:0.44 Age = Young EMD:0.57 Age = Old EMD:0.53 OCC = Art EMD:0.38 Age = Middle-Age EMD:0.6 OCC = University EMD:0.57 (c) Figure 5.2: DTAlg (a) Hierarchy of Attribute Age (b) Binary Split (c) Categorical Split EMD is picked. We anticipate users querying rated datasets with many possible query distributions. Thus it is important to minimize the work done for this operation. In Section 5.2.1, we give an elegant algorithm inspired by the top-k algorithm NRA [7] to accomplish this. Group Splitting We explored two ways of splitting groups into subgroups that are describable: Given an attribute Ai and a value vj , binary splitting on a group produces two child groups based on two complimentary predicates Ai = vj and Ai = vj . Figure 5.2(b) shows an example of binary splitting on ratings of movie Titanic using attribute age and value teen. 19 5.1. Exploration Algorithms This kind of splitting adds negation to group descriptions. For example, the leftmost node is described with age = teen & age = old. Categorical splitting on the other hand, results in n child groups, one for each value vj ∈ V where V = {v1 . . . vn } is the active domain of Ai . Figure 5.2(c) shows an example of categorical splitting on ratings of movie Titanic using attribute age, resulting in four child groups. Unlike binary splitting, categorical splitting does not introduce negation in the description of child groups. Thus, the check for goodness of descriptions performed by Algorithm 1 is needed only for binary splits. Gain Functions In order to proceed with group splitting, Algorithm 1 calculates a gain for each attribute and picks the one with maximum gain. After experimenting with various choices, we propose using minimum average EMD. Suppose splitting a group g using an attribute Ai yields m child groups ci1 . . . cim . The gain of Ai is defined as the inverse of average EMD of its child groups. If child groups have zero EMD then the gain is infinity. More formally: m Gain(Ai ) = m i j=1 EMD(cj , T ) Finally, an attribute may not be useful for splitting a group, if all the rating tuples in it have the same value for that attribute. For example, if a group has ratings of a single movie Titanic then none of the movie attributes are useful for splitting the ratings. Such attributes, if discovered at any group are discarded and are not considered for further splitting of any of the children groups. Algorithm 2 describes an algorithm for categorical splitting. Algorithm 2 : f indBestAttribute(g) Require: Array of attributes : attribs 1: maxGain = 0 2: maxGainAttribute = −1 3: for i = 1 → size(attribs) do 4: curGain = f indGain(g, attribs[i]) 5: if curGain > maxGain then 6: maxGain = curGain 7: maxGainAttribute = i 8: end if 9: end for 10: return attribs[maxGainAttribute] 20 5.1. Exploration Algorithms Removing Bad Descriptions As discussed earlier, the description of a group is said to be bad if it only contains negative predicates. Suppose group g contains negation predicate Ai = vi for an attribute Ai . Then we can modify the description of the group depending on two situations. In the first situation, if all the rating tuples in the group have a unique value vj for the attribute j = i, then the condition on the attribute Ai is replaced with Ai = vj . In the second situation, if there are multiple values for Ai in the group, we split the group recursively until all child groups have a unique value for the attribute Ai . Algorithm 3 describes an pseudo code for handling bad descriptions. Algorithm 3 : removeBadDecription(g) Require: Array of attributes : attribs 1: for i = 1 → size(attribs) do 2: if hasN egation(description(g, attribs[i])) then 3: if hasU niqueV alue(g, attribs[i]) then 4: description(g, attribs[i]) = uniqueV alue 5: return R 6: else 7: Array childs = split(g, attribs[i]) 8: return childs 9: end if 10: end if 11: end for The Random Forest Approach (RFAlg) One of the shortcomings of the DTAlg algorithm is that the condition used to split the root will be shared by all groups. E.g., in the Figure 5.2(b) and (c), every group obtained will have a condition on attribute age. Although not a goal in itself, the diversity in the descriptions of groups that belong to the same partition may matter. In particular, groups sharing the same root attribute will have lower diversity, as measured using, e.g., the Jaccard distance between their descriptions. To mitigate this, we draw inspiration from the work of Breiman [2] who proposed the approach of Random Forests (RF) to dramatically improve the performance of decision tree based classifiers. The idea is given n predictor attributes, a classical decision tree examines all n of them to pick the best split attribute and split point at each stage. The RF approach consists of two main steps. 21 5.1. Exploration Algorithms In step 1, the classical decision tree algorithm is run m times for some parameter m, where each run examines a random subset of k predictor √ attributes at each splitting node, a default value for k being n. This generates m decision trees. In step 2, the decisions of these m trees are combined, e.g., by voting, to yield an ensemble classifier. We adapt random forests to our setting. Step 1 of the RF approach √ remains the same: we run the DTAlg algorithm on a random subset of n user/item attributes available m times. For us, step 2 should produce a partition, not a classifier. In the next section, we examine different strategies for combining the m partitions from step 1 into a single partition. Combining Partitions In this section, we describe five heuristics for combining the m partitions π1 , ..., πm produced in step 1 of the RF approach into a single partition. 1. RF-Cluster: Each partition πi intuitively captures a (possibly partial) clustering. For each pair of rated tuples ri , rj ∈ S, we can determine the fraction of partitions in which they are clustered together. Let kij be the number of partitions in which ri and rj are clustered together. Then we can define the distance between ri and rj as dij = kij /m. Now, we can use any standard clustering algorithm to obtain a clustering of S with this distance measure. After examining several alternate clustering algorithms, we chose hierarchical clustering. As the desired number of clusters, we chose the average number of groups in the partitions π1 , ..., πm . Once groups are obtained, we discard groups g for which |g| < b or EMD(g, Q) > θ holds. The resulting groups do not necessarily have a natural exact description. We take a pattern mining approach to solve this issue. Viewing each rated tuple as a transaction and each (user or item) attribute as an “item”, we obtain maximal frequent patterns, by setting the support threshold to 90%. Any maximal frequent pattern serves as an approximate description of the group, with an accuracy of at least 90%. 2. RF-Description: A second strategy favors a partition with diverse groups w.r.t. their descriptions. Define the Jaccard distance between groups gi , gj as Jaccard(gi , gj ) = |desc(gi ) ∩ desc(gj )|/|desc(gi ) ∪ desc(gj )|. Start with an empty output partition and successively add a group whose total Jaccard distance to existing groups in the output is maximum until we cannot add any more groups. The first group is picked at random. This strategy favors diversity among the group descriptions. We will verify that in our experiments in Section 6. A common theme among the remaining strategies is that we order the 22 5.2. EMD Algorithms groups from the various partitions π1 , ...πm using some criterion. Then we add these groups to the output partition one by one: if a group overlaps with existing groups in the output, we discard it. Thus, the only difference is in the way groups are ordered. 3. RF-Random: Order the groups in a random manner. 4. RF-Size: Order the groups in decreasing order of their size. The rationale is to favor larger groups, which may help with the coverage. 5. RF-EMD: Order the groups in increasing order of their EMD to their closest query distribution in Q. This favors a solution where groups closer to some query distribution are preferred. We expect any of the RF algorithms to take more time than DTAlg since m decision trees are constructed and further time is incurred for combining their groups. Among them, RF-Cluster is by far the most expensive since it requires finding the distance between all pairs of tuples in S. 5.2 EMD Algorithms A key operation that is repeatedly performed by both DTAlg and RFAlg is finding the EMD between a group and its closest query distribution and it is important to do this efficiently. In general, the calculation of EMD between two given distributions can be done using the Hungarian algorithm and takes time O(n3 log n) where n is the domain size of the distribution [10]. However, in our setting, the distributions are probability distributions and are over the same domain (rating scale), thus it is possible to compute their EMD in linear time [16]. A similar observation was also exploited by Li et al. [12] in a different context. The key insight is to use a stack to manipulate the flow that corresponds to the amount of mass moved to transform one distribution to another. The stack helps minimize the work needed to find the closest distribution to a given distribution, from among a set of query distributions. 5.2.1 Computing EMD of Two Distributions Suppose we want to measure the EMD between two rating distributions ρ1 and ρ2 . We make one pass over distributions ρ1 , ρ2 starting from the left-most positions (1). A position i is excess (resp., deficit, equal ) position if ρ1 [i] > ρ2 [i] (resp., ρ1 [i] < ρ2 [i], ρ1 [i] = ρ2 [i]). We keep track of each position as we scan it. It moves mass such that ρ1 converts to ρ2 , and keeps track of the mass flow F [i, j] from position i to j. 4,5 together describe pseudo code of the algorithm. 23 5.2. EMD Algorithms Algorithm 4 : EMD(ρ1 [1, n], ρ2 [1, n]) 1: work = 0 // work done 2: Stack S = φ 3: state = equal 4: for i = 1 → n do 5: if ρ1 [i] > ρ2 [i] then 6: if state = excess or equal then 7: S.push({i, (ρ1 [i] − ρ2 [i])}) 8: else if state = def icit then 9: work + = updateStack(i) 10: end if 11: else if ρ1 [i] < ρ2 [i] then 12: if state = def icit or equal then 13: S.push({i, (ρ2 [i] − ρ1 [i])}) 14: else if state = excess then 15: work + = updateStack(i) 16: end if 17: end if 18: end for 19: return work 24 5.2. EMD Algorithms Algorithm 5 : updateStack(i) 1: work = 0; 2: mass = |ρ1 [i] − ρ2 [i]| 3: while mass >0 and S.isEmpty = false do 4: {k, kmass} = S.pop() 5: if mass ≥ kmass then 6: work = work + (kmass)(|i − k|) 7: updateFlow(kmass,i,k,state) 8: mass -= kmass 9: else 10: work = work + (mass)(|i − k|) 11: updateFlow(mass,i,k,state) 12: mass = 0; 13: S.push({k, (kmass−mass)}) 14: end if 15: end while 16: if mass >0 then 17: state = !state //invert 18: S.push({i, mass}) 19: end if 20: if S.isEmpty = true then 21: state = equal 22: end if 23: return work 25 5.2. EMD Algorithms step 0 1 2 3 4 5 stack {φ} { (1, 0.2) } { (1, 0.2) } { (3, 0.1) } { (3, 0.1), (4, 0.2) } {φ} state equal excess excess deficit deficit equal work 0 0 0 0.4 0.4 0.8 Table 5.2: Running Example of EMD Computation. We may first encounter excess or deficit or equal positions. Equal positions are just ignored. Say we first see an excess position. We store it on a stack. The stack is now said to be in excess state. Future excess positions are pushed to the stack while deficit positions are processed by flowing mass out of the top excess position on the stack. The stack may transition to deficit state as deficit positions are seen. Thus the stack is always in a welldefined state – excess or deficit. It reaches an equal state when it is empty. For each position type, we perform the following actions. Excess Position: This means ρ1 [i] > ρ2 [i]. Set the flow F [i, i] = ρ2 [i]. If the stack is in equal or excess state, push the entry (i, ρ1 [i] − ρ2 [i]) onto stack. This is the excess mass available at i. If the stack is in deficit state, pop the top element, say (j, δ), i.e., there is a deficit of δ at j. If µ =def ρ1 [i] − ρ2 [i] > δ, set F [i, j] = δ and decrement δ from µ. Repeat this for remaining deficit positions on stack until excess mass is left in µ. If µ becomes < δ, set F [i, j] = µ, and set δ = δ − µ and µ = 0. In the end, the stack may remain in deficit state, move to equal state (i.e., become empty). If position i remains an excess position push its entry on stack and the stack moves to excess state. Deficit Position: This means ρ1 [i] < ρ2 [i]. It is the mirror analog of the above case and we omit the obvious detail. Equal Position: In this case, since ρ1 [i] = ρ2 [i], we set the flow F [i, i] = ρ1 [i] and simply move to the next position. Table 5.2 illustrates the algorithm on a particular example with ρ1 = [0.2, 0, 0.2, 0.3, 0.3] and ρ2 = [0, 0, 0.5, 0.5, 0]. As shown in the table, in step 1 since there is excess mass of 0.2 at ρ1 [1], and it is pushed to stack and the stack state changes from equal to excess. In step 2, since ρ1 [2] is an equal position, the algorithm simply moves to the next position. ρ1 [3], is a deficit position, so we use excess positions on the stack to flow mass to ρ1 [3]. But since there is not enough mass, ρ1 [3] is added to stack with a deficit of 0.1 26 5.2. EMD Algorithms 0.1 0.2 0.2 Figure 5.3: Figure showing flows for example in Table 5.2 and the state of the stack is changed to deficit. Since there is a mass flow of 0.2 from position 1 to 3, 0.4 work is added to current work. In the next step, ρ1 [4] is a deficit position so it is pushed to stack. Since ρ1 [5] is an excess position mass is moved from it to previous deficit positions. Figure 5.3 shows the corresponding flows. It is easy to see that EM D(ρ1 , ρ2 ) is 0.8. We can show: Theorem 3 Work computed by algorithm to convert ρ1 to ρ2 is optimal and is equal to EMD(ρ1 , ρ2 ). The proof, omitted here for lack of space, appears in a technical report, not cited for reasons of anonymity. It is easy to see that the algorithm takes time O(n) where n is the length of the distribution, which in our application is the size of the rating scale. 5.2.2 EMD for Multiple Distributions Our algorithms for finding partitions, DTAlg and RFAlg, involve multiple evaluations of the following query: given a distribution ρ (corresponding to a group) and a set of query distributions Q = {ρ1 , ..., ρk }, what is the EMD of ρ to its closest distribution in Q? We anticipate the application of our algorithm for any rated dataset where the rating scale may be large (e.g., Yahoo!Music has a rating scale of 1–100) and where the number of query distributions k may be large (e.g., the user may have seen several distributions while exploring items and may want to know which groups exhibit similar distributions on a given class of items). To make this possible, it is essential to provide efficient support for the above query. A na¨ıve approach is to evaluate EMD(ρ, ρi ), i ∈ [1, k] and pick the smallest EMD. But this is inefficient. In this section, we develop a pruning strategy that relies on maintaining a lower bound i and an upper bound ui on the possible values of 27 5.2. EMD Algorithms EMD(ρ, ρi ), ∀i. Then whenever i > min{uj | j ∈ [1, k]}, we can safely discard ρi as a candidate for being the closest query distribution to ρ. To develop the bounds, consider an algorithm that maintains a list L of promising candidates to be the closest EMD-neighbors of the given distribution ρ (equiv., group). Similar to our earlier algorithm for calculating the EMD for a pair of distributions, this algorithm makes one concurrent pass over ρ and the query distributions ρi . It maintains excess/deficit positions on a stack as before. We assume there are n rating values in the rating scale. We can show: Lemma 1 After the algorithm has examined the first i positions, let δ be the total mass on the stack, j, k be positions on the top and bottom of the stack. Let γ = it=1 ρp [t] = γ be the cumulative mass of distribution ρp ∈ Q. Let ∆ be the work done so far for converting ρ into ρp . Then p = ∆ + δ × (i + 1 − j) ≤ EM D(ρ, ρp ) ≤ up = ∆ + δ × (n − k) + (1 − δ − γ) × (n − i − 1). Proof: The key intuition is that after examining i positions, in the best scenario, all the mass (say excess) on the stack just needs to move to the next ((i + 1)-th) position. Thus, the EMD cannot be smaller than the current work done plus this amount. The upper bound corresponds to the worst case where all the mass on stack (say excess) has to move the furthest, i.e., from the bottom position to the very last (n-th) position. Additionally, it is possible that there is excess mass left over at the last position n, which needs to move to position i + 1. The maximum possible value of this excess mass is 1 − δ − γ. We illustrate the pruning algorithm with a simple example. Consider a given distribution ρ = [0.8, 0.2, 0, 0, 0] and the query distributions Q = {ρ1 = [1, 0, 0, 0, 0], ρ2 = [0, 1, 0, 0, 0], ρ3 = [0, 0, 1, 0, 0], ρ4 = [0, 0, 0, 1, 0], ρ5 = [0, 0, 0, 0, 1]}. After examining the first two positions, the current work done for the various query distributions is 0.2, 0.8, 0, 0, 0. The bounds are: 1 = u1 = 0.2; 2 = u2 = 0.8; 3 = 4 = 5 = 1; u3 = u4 = u5 = 4. At this point, The lower bounds for ρ2 , ..., ρ5 exceed the upper bound for ρ1 and we can drop ρ2 , ..., ρ5 from further consideration. 28 Chapter 6 Experiments The goal of our experiments is two-fold: validate the quality of partitions obtained by our algorithms and study the scalability of our algorithms. In order to assess the quality of obtained partitions, we perform offline experiments that report measures used to validate clustering performance since our partitioning can be viewed as a clustering of the input rated dataset. We also examine coverage, description length and diversity of descriptions for groups generated using our algorithms. Finally, we evaluate the scalability of our algorithms as a function of the number of ratings and of trees. 6.1 Experimental Setup We use two rated datasets, 1 million MovieLens (ML) and Book crossing (BC) summarized in Table 6.1. ML contains { gender, age, occupation, location} as user attributes. We combine it with IMDb to obtain attributes {title, actor, director, writer } for each movie. Similarly BC provides {location, age} for users and { title,author,year,publisher } for books. Some attributes have a hierarchy. For example, the hierarchy of location is Country → State → City. This information is readily available for every user in BC, but for ML, we queried Yahoo! Maps to get this information. We manually created hierarchies for attributes age, year and occupation. Other attributes like director, gender, author have trivial hierarchies (i.e height =0). All experiments were done on a Xeon 2.5GHz Quad CoreWindows Server 2003 machine with 16GB RAM and a 128GB SCSI hard disk. All code is written in Java using JDK/JRE 1.6. Unless mentioned otherwise, EMD threshold, θ = 0.2,(2 for BC), number of trees is 4 , and minimum group size is b = 5. All our results are averages of 3 runs. For ML, we used the query distribution Q = {U1 , U2 , ..., U5 , U1,2 , U2,3 , U3,4 , U4,5 , P1,5 } and for BC, Q = {U1 , U2 , ..., U10 }. 29 6.2. Offline Quality Evaluation about users items ratings Rating Scale MovieLens (+IMDB) movie ratings 6040 3900 ˜ 1000209 (million) 1 to 5 Book Crossing book ratings 38511 260 196842 1 to 10 Table 6.1: Summary of datasets Location/Age Teen Young Middle Senior Count East Central West Count U˜1 U U 2000 U U˜5 U 1000 U U U 1000 U U U˜2 1000 2000 1000 1000 5000 Table 6.2: Rating distributions of various groups in synthetic data 6.2 6.2.1 Offline Quality Evaluation Purity, Rand-Index and F-Measure Since our algorithms partition a rated dataset into groups, we propose to borrow standard clustering evaluation measures like “purity”, “Rand-Index” and “F-Measure”, to evaluate the quality of our partitions. All those measures require a gold standard. We therefore construct a gold standard by generating synthetic data using known distributions. We generated 5000 rating tuples for an item, in which users have three attributes age, occupation, location. Our gold standard consists of the three groups shown in Table 6.2. For example ratings by user in group, [age = Teen,Location= East] are generated from slightly perturbed U1 distribution represented as U˜1 . We do not use exact Ui , to make the data look more realistic and also to make it more difficult for our algorithms. For this experiment, the query distribution is Q = {U1 , U2 , ..., U5 } . Purity is defined as the fraction of rating tuples which are “correctly classified” according to our gold standard. Consider a rating tuple r, let gi be the group to which r belongs in the gold standard and gj the group to which it belongs in the partition obtained from our algorithm. Let the nearest query distribution (using EMD) for gi be ρi and for gj be ρj . In this setting r is correctly classified iff ρi = ρj . Purity is defined as: Purity = No. of correctly classified tuples Total no. of tuples Purity measures whether an individual rating tuple r belongs to a group with the same rating distribution. It does not measure pairwise cooccurrence of tuples ri and rj . This is measured by RandIndex and Fβ 30 6.2. Offline Quality Evaluation Algorithm DTAlg Rf-Size Rf-Description Rf-EMD Rf-Rand Purity 0.9955 0.9923 0.9891 0.9891 0.9891 RandIndex 0.9058 0.9687 0.9420 0.8646 0.8712 F1 score 0.8542 0.9570 0.9177 0.7707 0.7904 F2 score 0.6540 0.9129 0.8691 0.6347 0.7002 F0.5 score 0.8842 0.9729 0.9624 0.8620 0.8620 Table 6.3: Performance of various algorithms measured using Purity, RandIndex, F-Measure score. These measures look at all pairs of rating tuples and compare their groups in the gold standard and in the algorithm output. Depending on various combinations, each pair is recognized as True Positive (T P ) or True Negative (T N ) or False Positive (F P ) or False Negative (F N ). T P, T N, F P, F N have the same meaning as in classic clustering theory. RandIndex and Fβ score are defined as: TP + TN TP + TN + FP + FN (1 + β 2 )T P Fβ score = (1 + β 2 )T P + β 2 F N + F P RandIndex = Values of all three measures are bounded in the range [0, 1]. Table 6.3 shows the performance of various algorithms. We do not measure these values for algorithm RF-Cluster as it takes a long time to terminate (as shown later). Results show that all algorithms perform equally good in terms of purity. In fact all algorithms nearly achieve maximum value of 1 for purity. Togetherness of rating tuples is measured by RandIndex, Fβ measure. Overall, the DT algorithm may find a smaller child group instead of a big parent group. For example, it may not identify the group [age = Teen, Location= East] , but may return all smaller groups in it like [age = Teen, Location= East, Occupation=lawyer]. This may happen because in generating trees in DTAlg, Occupation might have been chosen before Location. For any choice of gain function, this can always happen for some input. It is because of this that DTAlg does not perform well in terms of these measures. RF algorithms overcome this by creating many trees. Clearly, RF-Size and RF-Description are better heuristics than RF-EMD and RF-Rand. In fact RF-EMD and RF-Rand perform worse than DTAlg. This shows that the heuristic used in combining output from various DT runs of RF plays an important role. Our experiments show that RF-Size and RF-Description are better heuristics with RF-Size being the best. 31 6.2. Offline Quality Evaluation MovieLens Coverage 100 RF−DESCRIPTION RF−SIZE 50 RF−EMD RF−RAND 0 100 Coverage RF−CLUSTER BINARY CATEGORICAL Book Crossing RF−CLUSTER RF−DESCRIPTION RF−SIZE 50 RF−EMD RF−RAND 0 BINARY CATEGORICAL Figure 6.1: Figure showing coverage for two types of split Categorical and Binary on two datasets 6.2.2 Coverage, description, diversity and EMD We compare the quality of found groups using their coverage, description diversity, description length and EMD. Description diversity and description length are good indicators of the effectiveness of an algorithm, even though the former is not a goal in itself. Diversity of a partition is defined as the average pairwise Jaccard between groups in the partitions. That is m Jaccard(gi ,gj ) diversity(π) = i=j,i=1 m . In our experiments we observed that RF has better coverage when we use categorical splitting. As shown in the figure 6.1, coverage has low value for both the dataset when split type is binary. On close observation of partitions, we found that, partitions obtained from binary splitting overlaps a lot. One possible explanation of this is, in binary split, an attribute can get picked many times. As a result an attribute which has good gain for many attribute values, has more chance of getting picked again and again. Because number of attributes are very few in number (4-5) in both the dataset, this can happened a lot. As a result, partitions from different decision trees overlap a lot. The coverage remains almost same or improves slightly when compared to a partition from a single tree. In rest of the experiment we use categorical splitting. We found that the gain function based on minimum average EMD is the best. Figure 6.2(a) shows coverage on inputs of various sizes. It can be noticed that, all RF approaches have higher coverage than DTAlg on both datasets. One can notice that the margin between DTAlg and RF approaches increases as input size grows. For large input sizes, coverage by RF algorithm is almost double that of DTAlg. This shows that, single trees may not be sufficient to find non intersecting groups. Among different variants of RF, 32 6.2. Offline Quality Evaluation RF-Cluster has slightly more coverage than others in BC. Finally, RF-Size and RF-Cluster have better coverage than other variants. Figure 6.2(b) demonstrates the dependence of coverage on query distributions. Items with low variance in their ratings have higher coverage, and coverage decreases as variance increases. Recall that the query mostly consists of either distributions of type Ui or of type Ui,i+1 . For items with lower rating variance, most of the population gives the same rating value, so the rating distribution for such items will be close to one of the query distributions. For such items, coverage is higher. As variance increases, rating distributions get very far from query distributions. For these items, coverage is less. Also obtained group size is larger for items with low variance. Figure 6.3(a) shows how coverage varies as emd threshold increases. Coverage increases in all cases as threshold increases. It is expected because, now many in the algorithm qualify. Also note that gap between decision tree algorithm and RF-Size initially increases as emd increases, but later both of them converge. It is because, initially as emd increases,coverage of RF-grow rapidly because it discover more number of new blocks than a single decision tree algorithm. But as threshold gains large value, almost all subsets make it to final set. Thus coverage of both the algorithms converge to 100. Figure 6.3(a) shows how coverage changes as no. random trees increase. initially coverage increase as number of random trees increase, but it converge very quickly to a constant value. We found it experimentally that no. of trees at which coverage converge actually depends on size of the input. Figure 6.2(c) shows the diversity of discovered groups as input size varies. Notice that RF-Cluster has high diversity in description. It is because the description for each set is chosen independently using maximal frequent patterns. DTAlg in general has very low diversity for small input sizes. It is because the height of DTAlg is small and leaves share many nodes resulting in low diversity. Surprisingly, RF-Desc does not have diversity much higher than other RF approaches. Diversity of RF approaches is almost same. We suspect that is because the number of attributes in the datasets is small. Figure 6.4(b) shows average EMD obtained for various algorithms increases with the number of random trees generated for RF approaches. Since DTAlg is independent of the number random trees, its curve is constant. RF-EMD as expected has the smallest average EMD as the number of trees increases. When the number of random trees is one, average EMD is slightly smaller for DTAlg in ML, but almost the same for BC. The gap between average EMD for RF approaches and DTAlg is higher in case of ML. For BC, the values are almost identical. That is probably because of the difference in rating values in the two datasets. 33 6.3. Scalability Evaluation Algorithm/No. of Ratings DTAlg RF-CLUSTER RF-DESCRIPTION RF-SIZE RF-EMD RF-RAND 500 1.2710 1.1750 2.2750 2.2950 2.4040 2.3860 1000 1.6670 1.3810 2.5230 2.5540 2.5540 2.6020 2000 1.7360 1.6340 2.6630 2.8240 2.8610 2.8290 Table 6.4: Table describing how description length varies with input size on ML dataset Table 6.4 shows the variation in average description length of groups obtained as the input size grows (ML dataset). DTAlg has the least description length compared to other RF approaches except RF-Cluster. Recall that the description of RF-Cluster is approximate. In general description length increases as input size increases, although not dramatically. 6.3 Scalability Evaluation Figure 6.4(a) shows the running times of various algorithms as input size grows. For DTAlg the increase is unnoticeable. The time taken to create random trees is same for all RF approaches. The difference in running times is due to different strategies in combining partition from different trees. For RF-Cluster, though it has good quality in terms of coverage and diversity, its running time is very long. For large input sizes, it takes minutes. RF-Size, RF-Rand, RF-EMD have low running times. RF-Description time increases as input size grows as expected as finding groups which are far apart in terms of description is a costly step. Compared to previous work in finding groups [6] our algorithms are scalable. Since rating distribution length (rating scale) of groups is small, to observe the effect of pruning in EMD calculation algorithms, algorithm has to perform many such operations. This is only possible on very large datasets. In order to understand the effect of pruning, we compare time taken by RF approaches as the number of random trees grows since the number of EMD calculations increases. Effect of pruning can be seen from the plot. As the number of random trees grows, the gap between two curves increase which justifies that bounds can prove helpful when distributions are very long. Improvement is not very large. It is probably because, the constant factor in linear time algorithm also increases due maintenance of bounds. 34 6.3. Scalability Evaluation MovieLens 100 DTAlg Coverage 80 RF−CLUSTER RF−DESCRIPTION 60 RF−SIZE RF−EMD 40 RF−RAND 20 0 500 1000 1500 2000 No. Of Ratings Book Crossing 100 DTAlg Coverage 80 RF−CLUSTER RF−DESCRIPTION 60 RF−SIZE RF−EMD 40 RF−RAND 20 0 50−100 150−200 250−400 500−700 No. Of Ratings (a) MovieLens 100 DTAlg Coverage 80 RF−CLUSTER RF−DESCRIPTION 60 RF−SIZE RF−EMD 40 RF−RAND 20 0 0−0.6 0.6−0.8 0.8−1.0 1.0−1.2 1.2−1.4 1.4−1.6 1.6−4.0 Variance Book Crossing 100 DTAlg Coverage 80 RF−CLUSTER RF−DESCRIPTION 60 RF−SIZE RF−EMD RF−RAND 40 20 0 3−6.5 7−8 8−9 9−10 10−11 11−12 12−13 13−14 14−16 Variance (b) MovieLens Diversity 1 0.8 0.6 0.4 0.2 0 500 1000 No. Of Ratings 1500 2000 Book Crossing Diversity 1 0.5 0 50−100 150−200 250−400 No. Of Ratings 500−700 35 (c) Figure 6.2: Figure showing plots of coverage and diversity in description for movie Lens and Book Crossing. (a)Coverage Vs. Input Size (b)Coverage Vs. Variance (c)Diversity Vs. Input Size(Plot legend is same as in (a)) 6.3. Scalability Evaluation MovieLens Coverage 50 40 30 20 10 0 0 5 10 No. Of Trees 15 20 25 20 25 Book Crossing Coverage 60 40 20 0 0 5 10 15 No. Of Trees (a) MovieLens Coverage 100 DECISION−TREE 80 RF−CLUSTER RF−DESCRIPTION 60 RF−SIZE RF−EMD 40 RF−RAND 20 0 0 0.1 0.2 0.3 EMD Threshold 0.4 0.5 0.6 0.7 0.8 0.9 1 Book Crossing 100 Coverage DECISION−TREE 80 RF−CLUSTER RF−DESCRIPTION 60 RF−SIZE 40 RF−EMD RF−RAND 20 0 0 0.5 1 1.5 2 2.5 3 3.5 4 EMD Threshold (b) Figure 6.3: Figure showing plots of coverage for movie Lens and Book Crossing. (a)Coverage Vs. EMD Threshold (b)Coverage Vs. No. of Trees (Plot legend is same as in (a)) 36 6.3. Scalability Evaluation 20 DTAlg RF−CLUSTER RF−DESCRIPTION RF−SIZE RF−EMD RF−RAND 18 16 Time(sec) 14 12 10 8 6 4 2 0 0 2000 4000 6000 8000 10000 12000 No. Of Ratings (a) MovieLens Avg. EMD 0.15 0.14 0.13 0.12 0.11 0 5 10 No. Of Trees 15 20 25 20 25 Book Crossing Avg. EMD 1.6 1.4 1.2 1 0.8 0 5 10 15 No. Of Trees (b) 2.5 EMD−Prune EMD−Naive Time(sec) 2 1.5 1 0.5 0 0 20 40 60 80 100 120 140 No. Of Trees (c) Figure 6.4: Figure showing plots of various experiments on both the datasets: (a) Time Vs. Input size (b)Avg. EMD Vs. No. of Trees (Plot legend is same as in (a)) (c) EMD-Naive vs EMD-Prune 37 Chapter 7 Related Work Recently, the work of Das et al. [6] introduced complex mining of rated datasets with the goal of extracting meaningful demographic patterns that describe groups with biased opinions. They study the problem of finding groups that are uniform or polarized in their ratings. Our work can be seen as a generalization of their problems as we propose to query the rating distributions of an input dataset and discover describable groups that are close to input query distributions. Our query distributions could have any shape including uniform and polarized ones. Moreover, unlike ours, the algorithms proposed in that paper rely on data cubes. Exploring their applicability to our setting is an interesting future direction. Some data-driven studies have indicated that the observed diversity of opinion can be the result of demographic groups reacting differently to external events. Choudhury et al. [5] examined opinion biases in blogosphere communities, relying on entropy measure as an indicator of diversity in opinions. Alternatively, Varlamis et al. [17] propose clustering accuracy as an indicator of the blogosphere opinion convergence. Using this measure, they detected a divergence in topics near few major events. Several dimensionality reduction techniques, such as Subspace Clustering and Principle Component Analysis (PCA), were developed in order to describe a large structured dataset as labeled clusters. In particular, subspace clustering has been used extensively for data exploration in a variety of domains, see [9, 15] for reviews. While Subspace Clustering may be extended to handle the detection of groups in our setting, it needs to be modified to account for our rating distribution comparison measure (EMD) and for describability and scalability. CLIQUE [1] is the first bottom-up subspace clustering algorithm that relies on a global notion of density - the percentage of the overall dataset that falls within a particular subspace. ENCLUS [4] uses information entropy as the clustering objective, and shows a relationship between entropy and density, correlation, and coverage. Several extensions of the original algorithm were developed: CLTree [13] uses a decision-tree approach to identify high-density regions, while Cell-Based Clustering [3] improves scalability by partitioning the data so as to produce 38 Chapter 7. Related Work fewer clusters. Finally, PCA relies on pre-determining the set of attributes to use to describe clusters instead of discovering them on the fly, as in our work. 39 Chapter 8 Conclusion 8.1 Conclusion In this paper, we present the first work that enables scalable exploration of rated datasets in a rating distribution-centric manner. Our solution returns a set of <user,item,rating> groups whose rating distribution is as close as possible to input query distributions. We show that the problem is NP-hard under two different settings and develop two heuristics: the first one is based on adapting decision trees and the second is based on the random forest approach. At the center of our work is the use of Earth Mover’s Distance, a measure that intuitively captures the minimum amount of work required to transform one distribution into another. In future work, we would like to explore the relationship between different definitions of diversity and coverage and describability. We would also like to investigate the applicability of other algorithms such as subspace clustering for finding groups. 40 Bibliography [1] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data. Data Min. Knowl. Discov., 11(1):5–33, 2005. [2] L. Breiman. Random forests. 10.1023/A:1010933404324. Machine Learning, 45:5–32, 2001. [3] J.-W. Chang and D.-S. Jin. A new cell-based clustering method for large, high-dimensional data in data mining applications. In SAC, pages 503–507. ACM, 2002. [4] C. H. Cheng, A. W.-C. Fu, and Y. Zhang. Entropy-based subspace clustering for mining numerical data. In KDD, pages 84–93, 1999. [5] M. D. Choudhury, H. Sundaram, A. John, and D. D. Seligmann. Multiscale characterization of social network dynamics in the blogosphere. In CIKM, pages 1515–1516, 2008. [6] M. Das, S. Amer-Yahia, G. Das, and C. Yu. Mri: Meaningful interpretations of collaborative ratings. PVLDB, 4(11):1063–1074, 2011. [7] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci., 66(4):614–656, 2003. [8] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY, USA, 1979. [9] H.-P. Kriegel, P. Kr¨ oger, and A. Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. TKDD, 3(1), 2009. [10] H. W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2:83–97, 1955. 41 [11] S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical Statistics, 22:49–86, 1951. [12] N. Li, T. Li, and S. Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and l-diversity. In ICDE, pages 106–115, 2007. [13] B. Liu, Y. Xia, and P. S. Yu. Clustering through decision tree construction. In Proceedings of the ninth international conference on Information and knowledge management, CIKM ’00, pages 20–29, New York, NY, USA, 2000. ACM. [14] MovieLens dataset, http://www.grouplens.org/data/, as of 2003. [15] L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: a review. SIGKDD Explorations, 6(1):90–105, 2004. [16] Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision, 40:99–121, 2000. 10.1023/A:1026543900054. [17] I. Varlamis, V. Vassalos, and A. Palaios. Monitoring the evolution of interests in the blogosphere. In ICDE Workshops, pages 513–518, 2008. 42 Appendix A Algorithm Optimality In this section we prove that EMD calculation algorithm 4 is indeed optimal and correct. That is we prove the following statement: Theorem 4 Distance computed by algorithm 4 is optimal and is equal to EMD Suppose you we are calculating EMD between distributions ρs (source) and ρt (target). Algorithm 4 makes a pass over distribution ρs and keeps track of deficit and excess positions using a stack. As algorithm while keeping track of workdone, uses positions in the stack to move mass such that ρs is converted to ρt . Its a greedy algorithm because, it moves masses between nearest position (top element of stack) A.1 Notations/Definitions Before going into details of proof, we explain various notation used in the proof: 1. ρs [i, j] - represents an array of probability values from i to j in ρs 2. ρs [i : j] = j k=i ρs [k], i.e sum of the probabilities for values i to j 3. In process of converting ρs to ρt , mass is flown between the positions. Such flows are represented using a flow matrix. F represent any flow matrix which convert ρs to ρt . 4. F opt represent flow matrix of the optimal solution 5. G represent flow matrix obtained from greedy algorithm 4 6. If F is the flow matrix, Fij = F [i, j] =flow from position i to position j. 7. W (F ) represents work done due to flow F 8. Given ρs , ρt , the region ρs [1, i] is 43 A.2. Lemmas & Facts • self-sufficient(SS) if ρs [1 : i] = ρt [1 : i] • deficit if ρs [1 : i] < ρt [1 : i] • excess if ρs [1 : i] > ρt [1 : i] 9. F [1, k][1, k] represents submatrix of F of size k × k and it has flows between position 1 to k A.2 Lemmas & Facts In this section we state lemmas and proofs which are useful in proving the theorem 4 Fact 5 If region ρs [1, l] is deficit then there must be inflow to this region opt > 0, k > l, 1 ≤ j ≤ l from other positions. That is ∃k, j s.t Fkj Fact 6 If region ρs [1, l] is excess then there must be outflow from this reopt > 0, k > l, 1 ≤ j ≤ l gion to other positions. That is ∃k, j s.t Fjk Lemma 2 If region ρs [1, l] is deficit or SS then in an optimal flow F opt , there cannot be any out flow from this region. That is if ρs [1, l] is deficit or SS then opt = 0 ∀k > l & 1 ≤ i ≤ l (A.1) ∀F opt , Fik Figure A.1: Figure shows how to modify optimal flow F opt to obtain new flow F opt Proof: We use contradiction to prove that, such a flow does not exist. Let opt us suppose in optimal solution, there exists a flow such that Fik = δ1 > 0 opt s.t k > l. Since region P [1, l] is deficit or SS , by fact 5 ∃ p, q s.t Fpq = opt δ2 > 0, p > l and 1 ≤ q ≤ l. Now we show that, F is not optimal flow W.l.og let q > i and k > p. Similar contradiction can be shown for other 44 A.2. Lemmas & Facts three cases. Let δ = min(δ1 , δ2 ). Intuition here is to disturb the optimal flow such that total amount of work done decreases. Let the new Flow matrix obtained be F opt . Figure A.1 describes how optimal flow is modified to obtain new flow. The values in the F opt are as follows: opt Fik = δ1 − δ opt Fpq = δ2 − δ opt opt Fpk = Fpk +δ Fiqopt = Fiqopt + δ At other position, flow of F opt are identical to flows of F opt . Lets calculate the change in the work WF opt − WF opt = δ2 (k − i) + δ1 (p − q) − [(δ2 − δ)(k − i) + (δ1 − δ)(p − q) + δ(q − i) + δ(k − p)] W = δ(k − i + p − q − q + i − k + p) W = 2δ(p − q) W >0 Thus change in the work is positive. It means, WF opt > WF opt which is a contradiction because F opt is an optimal solution. Hence lemma is proved Lemma 3 If region ρs [1, l] is excess or self-sufficient then in an optimal flow F opt , there cannot be any in flow to this region. That is if ρs [1, l] is excess or SS then ∀F opt , F opt [k, i] = 0 ∀k > l & 1 ≤ i ≤ l (A.2) Proof: This lemma can be proved following similar steps as in proof of Lemma 2. So proof is not written in detail. Corollary 7 Given, ρs , ρt , there can either be an inflow or outflow from position 1, in optimal solution. More precisely: 1. ρs [1] > ρt [1], then F opt [j, 1] = 0 ∀j = 1 2. ρs [1] < ρt [1], then F opt [1, j] = 0 ∀j = 1 45 A.3. Proof A.3 Proof In this section we give proof of the theorem using lemmas and facts proved above. Recall that G represents the flows obtained from our greedy algorithm. Theorem 8 Let k be the maximum value such that, G[1, k] ≥ 0 or G[k, 1] ≤ 0. If Fopt is an optimal flow matrix of EMD(ρs , ρt )such that F opt F opt [1, k][1, k] = A = F opt [k + 1, n][1, k] = C F opt [1, k − 1][k + 1, n] = B opt F [k, n][k + 1, n] = D then 1. F opt [1, k − 1][k + 1, n] = B = 0 2. F opt [k + 1, n][1, k − 1] = C = 0 3. Workdone due to flow W (F opt [1, k][1, k]) = work done due to flow W (G[1, k][1, k]) 4. F opt [k, n][k, n] is the optimal flow of another smaller sub-problem EMD(ρs [k, n], ρt [k, n]) k−1 (F opt [j, k]+F opt [k, j]), ρs [k+1, n] = ρs [k+1, n] where ρs [k] = ρs [k]− j=1 opt [j, k] + F opt [k, j]), ρ [k + 1, n] = ρ [k + 1, n] and ρt [k] = ρt [k] − k−1 t t j=1 (F Proof: Theorem states that, if k is the farthest position from 1 such that there is either inflow or outflow between positions 1 and k in greedy solution, then 1,2,3,4 are true. (1) & (2) : ρs [1, k] is a deficit or SS region. Thus using lemma 2 F opt [1, k − 1][k + 1, n] = 0. Similarly, then using lemma 3 F opt [k + 1, n][1, k] = 0 We use induction technique to prove statement (3) is true ∀k = m, 1 ≤ m ≥ n. • k = 1: It implies ρs [1] = ρt [1], thus there is no out-flow or in-flow at position 1. Thus W (F opt [1, 1]) = W (G[1, 1]), proving the statement is true. 46 A.3. Proof • k = m − 1: Let workdone due to W (F opt [1, m − 1][1, m − 1]) is equal to workdone due to W (G[1, m − 1][1, m − 1]) • k = m: W.l.og let G[1, m] > 0. That is first position is excess mass position. Now idea is to move excess mass at position 1 to position 2 and use induction hypothesis to prove the statement. To be more precise, form a new sub problem, such that all flows from ρs [1] appear as flows ρs [2]. That is move all the extra mass at from ρs [1] to ρs [2]. Let resulting distribution be ρs . Length of ρs is m−1. Using induction hypothesis one can say that statement (3) is true for corresponding optimal (F opt ) and greedy (G ) flow matrices of EMD(ρs , ρt [2, n]). One can extend and modify flow matrice F opt , G of EMD(ρs , ρt [2, n]) to obtain flow matrices F opt , G of EMD(ρs , ρt ) by just adjusting flows at position one. It is easy to argue that such increase in workdone for both optimal and greedy flows is same amount and it is equal to workdone in moving at extramass at position 1 to 2. This proves that statement (3) is for F opt , G. (4) states that this problem exhibits optimal substructure property, which is essential for a greedy algorithm. That is Solution to EMD(ρs , ρt ) is equal to W (F opt [k, n][k, n]). Otherwise one can derive a contradiction. Proof of (4) completely shows that solution obtained by greedy algorithm is optimal. 47
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Flexible and efficient exploration of rated datasets
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Flexible and efficient exploration of rated datasets Kolloju, Naresh Kumar 2013
pdf
Page Metadata
Item Metadata
Title | Flexible and efficient exploration of rated datasets |
Creator |
Kolloju, Naresh Kumar |
Publisher | University of British Columbia |
Date Issued | 2013 |
Description | As users increasingly rely on collaborative rating sites to achieve mundane
tasks such as purchasing a product or renting a movie, they are facing the
data deluge of ratings and reviews. Traditionally, the exploration of rated
data sets has been enabled by rating averages that allow user-centric, itemcentric
and top-k exploration. More speci cally, canned queries on user
demographics aggregate opinion for an item or a collection of items such as
18-29 year old males in CA rated the movie The Social Network at 8:2 on
average. Combining ratings, demographics, and item attributes is a powerful
exploration mechanism that allows operations such as comparing the
opinion of the same users for two items, comparing two groups of users on
their opinion for a given class of items, and nding a group whose rating
distribution is nearly unanimous for an item. To enable those operations,
it is necessary to (i) adopt the right measure to compare ratings, and to
(ii) develop e cient algorithms to nd relevant |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2013-08-31 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0052206 |
URI | http://hdl.handle.net/2429/44028 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2013-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24- ubc_2013_spring_kolloju_naresh.pdf [ 810.79kB ]
- Metadata
- JSON: 24-1.0052206.json
- JSON-LD: 24-1.0052206-ld.json
- RDF/XML (Pretty): 24-1.0052206-rdf.xml
- RDF/JSON: 24-1.0052206-rdf.json
- Turtle: 24-1.0052206-turtle.txt
- N-Triples: 24-1.0052206-rdf-ntriples.txt
- Original Record: 24-1.0052206-source.json
- Full Text
- 24-1.0052206-fulltext.txt
- Citation
- 24-1.0052206.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0052206/manifest