Composite Recommendation:Semantics and EfficiencybyMin XieB.Eng., Renmin University of China, 2005M.Eng., Renmin University of China, 2008A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)March 2015c© Min Xie 2015AbstractClassical recommender systems provide users with a list of recommendationswhere each recommendation consists of a single item, e.g., a book or DVD.However, many applications can benefit from a system which is capable ofrecommending packages of items. Sample applications include travel plan-ning, e-commerce, and course recommendation. In these contexts, there isa need for a system that can recommend the most relevant packages for theuser to choose from.In this thesis we highlight our research achievements for the compositerecommendation problem. We first consider the problem of composite rec-ommendation under hard constraint, e.g., budget. It is clear that this isa very common paradigm for the composite recommendation problem. InChapter 3, we first discuss how given a fixed package schema, we can ef-ficiently find the top-k most relevant packages with hard constraints. Theproposed algorithm is shown to be instance optimal, which means that noalgorithm in a reasonable class can perform more than a constant times bet-ter, for some fixed constant. And we also propose relaxed solutions basedon probabilistic reasoning. In Chapter 4, we lift the constraint on the pack-age schema, and discuss how efficient algorithms can be derived to solvethe more general problem with a flexible package schema. For this prob-lem, again we propose both instance optimal algorithm and heuristics-basedsolution which have been verified to be effective and efficient through ourextensive empirical study. Then in Chapter 5, motivated by the fact thathard constraints sometimes might lead to unfavorable results, and followingthe recent paradigm on “softening” the constraints, we study the problemof how to handle top-k query processing with soft constraints. Finally, inChapter 6, we discuss a general performance tuning solution based on cachedviews which can be leveraged to further optimize the various algorithms pro-posed in this thesis.iiPrefaceThis dissertation is the result of collaboration with several researchers. Themost influential among them is my supervisor Laks V.S. Lakshmanan fromthe University of British Columbia, Canada. Most of the studies includedin this dissertation are published in peer-reviewed venues.Algorithms for composite recommendation with hard constraints andfixed package schema are presented in Chapter 3 and are based on our pub-lication in VLDB 2011 [122]. In Chapter 4, we present algorithms which con-sider composite recommendation with hard constraints and flexible packageschema. This chapter is based on our papers in RecSys 2010 [120] and ICDE2011 [121]. In Chapter 5, we consider how soft constraints can be employedto handle limitations of hard constraints. And this chapter is based on ourpaper in VLDB 2013 [125] and VLDB 14 [126]. Finally, we consider howvarious algorithms for composite recommendation presented in this thesiscan be optimized using cached views in Chapter 6, this work being basedon our paper in EDBT 2013 [124].All of the published works [120–122, 124–126] are in collaboration withPeter Wood from University of London, Birkbeck, U.K., and my supervisorLaks V.S. Lakshmanan from the University of British Columbia, Canada.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.1 Semantics . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . 41.2.1 Composite Recommendation under Hard Constraints 41.2.2 Composite Recommendation under Soft Constraints . 51.2.3 Efficiency Optimization . . . . . . . . . . . . . . . . . 51.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1 Composite Recommendation . . . . . . . . . . . . . . . . . . 72.1.1 Application in Trip Planning . . . . . . . . . . . . . . 72.1.2 Application in Course Planning . . . . . . . . . . . . 82.1.3 Application in E-commerce . . . . . . . . . . . . . . . 92.1.4 Other Applications . . . . . . . . . . . . . . . . . . . 92.2 Preference Handling and Elicitation . . . . . . . . . . . . . . 102.3 Top-k Query Processing . . . . . . . . . . . . . . . . . . . . . 122.3.1 Domination-based Pruning and Variable Elimination 132.4 Constraint Optimization . . . . . . . . . . . . . . . . . . . . 15ivTable of Contents3 Composite Recommendation with Hard Constraints: FixedPackage Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . 183.2.1 Language for Aggregation Constraints . . . . . . . . . 183.2.2 Problem Studied . . . . . . . . . . . . . . . . . . . . . 193.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3.1 Rank Join and Post-Filtering . . . . . . . . . . . . . . 203.3.2 Other Related Work . . . . . . . . . . . . . . . . . . . 223.4 Deterministic Algorithm . . . . . . . . . . . . . . . . . . . . 223.4.1 Properties of Aggregation Constraints . . . . . . . . . 233.4.2 Subsumption-based Pruning . . . . . . . . . . . . . . 243.4.3 Efficient Algorithm for Top-k RJAC . . . . . . . . . . 273.5 Probabilistic Algorithm . . . . . . . . . . . . . . . . . . . . . 313.5.1 Estimating Constraint Selectivity . . . . . . . . . . . 323.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.6.1 Efficiency Study . . . . . . . . . . . . . . . . . . . . . 343.6.2 Probabilistic Algorithm . . . . . . . . . . . . . . . . . 354 Composite Recommendation with Hard Constraints: Flex-ible Package Schema . . . . . . . . . . . . . . . . . . . . . . . . 384.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2 Architecture and Problem . . . . . . . . . . . . . . . . . . . . 404.2.1 System Architecture . . . . . . . . . . . . . . . . . . . 404.2.2 Problem Statement . . . . . . . . . . . . . . . . . . . 404.3 Composite Recommendations . . . . . . . . . . . . . . . . . . 424.3.1 Instance Optimal Algorithms . . . . . . . . . . . . . . 424.3.2 Greedy Algorithms . . . . . . . . . . . . . . . . . . . 484.3.3 Histogram-Based Optimization . . . . . . . . . . . . . 514.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.4.1 Experimental Setup and Datasets . . . . . . . . . . . 524.4.2 Quality of Recommended Packages . . . . . . . . . . 544.4.3 Efficiency Study . . . . . . . . . . . . . . . . . . . . . 554.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.6 Application in Travel Planning . . . . . . . . . . . . . . . . . 574.6.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 595 Composite Recommendation with Soft Constraints . . . . 615.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.2 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . 65vTable of Contents5.2.1 Package Profiles . . . . . . . . . . . . . . . . . . . . . 655.2.2 Package Utility and Preference Elicitation . . . . . . 665.2.3 Presenting Packages . . . . . . . . . . . . . . . . . . . 685.3 A Sampling-based Framework . . . . . . . . . . . . . . . . . 725.3.1 Rejection Sampling . . . . . . . . . . . . . . . . . . . 735.3.2 Feedback-aware Sampling . . . . . . . . . . . . . . . . 735.3.3 Optimizing Constraint Checking Process . . . . . . . 765.3.4 Sample Maintenance . . . . . . . . . . . . . . . . . . 775.4 Search for Best Packages . . . . . . . . . . . . . . . . . . . . 785.4.1 Upper Bound Estimation for Best Package . . . . . . 805.4.2 Package Expansion . . . . . . . . . . . . . . . . . . . 815.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . 825.5.1 Comparing Sampling Methods . . . . . . . . . . . . . 845.5.2 Constraint Checking . . . . . . . . . . . . . . . . . . . 855.5.3 Overall Time Performance . . . . . . . . . . . . . . . 855.5.4 Sample Quality . . . . . . . . . . . . . . . . . . . . . 865.5.5 Sample Maintenance . . . . . . . . . . . . . . . . . . 875.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886 Further Optimization using Materialized Views . . . . . . 906.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 906.2 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . 936.2.1 System Overview . . . . . . . . . . . . . . . . . . . . 956.3 LPTA-based kQAV Processing . . . . . . . . . . . . . . . . . 966.3.1 Algorithm LPTA . . . . . . . . . . . . . . . . . . . . 966.3.2 Algorithm LPTA+ . . . . . . . . . . . . . . . . . . . . 996.3.3 Handling the General kQAV Problem . . . . . . . . . 1036.4 IV-Index based Top-k QAV . . . . . . . . . . . . . . . . . . . 1046.4.1 Inverted View Index . . . . . . . . . . . . . . . . . . . 1046.4.2 IV-Search Algorithm . . . . . . . . . . . . . . . . . . 1066.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 1106.5 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . 1116.5.1 LPTA-based Algorithms . . . . . . . . . . . . . . . . 1126.5.2 IV-Index-based Algorithms . . . . . . . . . . . . . . . 1146.5.3 Effectiveness of Pruning . . . . . . . . . . . . . . . . . 1156.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 1156.7 Application in Composite Recommendation . . . . . . . . . . 117viTable of Contents7 Summary and Future Research . . . . . . . . . . . . . . . . . 1197.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1197.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . 121Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122viiList of Tables3.1 Properties of primitive aggregation constraints. . . . . . . . . 244.1 Quality Comparison for Different Composite Recommenda-tion Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 525.1 Top-5 package id’s for different sampling methods and differ-ent ranking methods on UNI. . . . . . . . . . . . . . . . . . . 87viiiList of Figures3.1 Post-filtering rank join with aggregation constraints. . . . . . 233.2 Tuple pruning using aggregation constraints. . . . . . . . . . 253.3 Adaptive subsumption-based pruning. . . . . . . . . . . . . . 303.4 Selectivity of aggregation constraints. . . . . . . . . . . . . . 333.5 Uniform dataset: (a), (b) SUM(A, true) ≥ λ, selectivity 10−5; (c), (d) MIN(A, true) ≤ λ, selectivity 10−5. . . . . . . . . . 353.6 Uniform dataset, SUM(A, true) ≤ λ, selectivity 10−5. . . . . 363.7 Uniform dataset: (a), (b) SUM(A, true) ≥ λ, SUM(B, true) ≤λ, overall selectivity 10−5 ; (c), (d) SUM(A, true) ≥ λ,SUM(B, true) ≥ λ, overall selectivity 10−5. . . . . . . . . . . 373.8 Quality of the probabilistic algorithm. . . . . . . . . . . . . . 374.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . 414.2 NDCG Score for Top-k Packages . . . . . . . . . . . . . . . . 534.3 (a)–(d) Running Time for Different Datasets; (e)–(h) AccessCost for Different Datasets . . . . . . . . . . . . . . . . . . . 544.4 (a)–(d) Running Time Comparison for Different Datasets;(e)–(h) Access Cost Comparison for Different Datasets . . . . 575.1 Examples of packages of items. . . . . . . . . . . . . . . . . . 625.2 Examples of different ranking semantics. . . . . . . . . . . . . 695.3 Approximate center of a convex polytope. . . . . . . . . . . . 755.4 Example of various sampling algorithms. . . . . . . . . . . . . 845.5 Efficiency of the pruning strategy. . . . . . . . . . . . . . . . . 855.6 Overall time performance under various sampling algorithms. 865.7 Experiments on sample maintenance. . . . . . . . . . . . . . . 886.1 (a) A relation R with three attributes A, B, C; (b), twocached views V1, V2 which contain top-3 tuples according tothe the two score functions f1(t) = 0.1t[A] + 0.9t[B], f2(t) =0.1[A] + 0.5[B] + 0.4[C] respectively. . . . . . . . . . . . . . . 916.2 System overview. . . . . . . . . . . . . . . . . . . . . . . . . . 96ixList of Figures6.3 Example of LPTA. . . . . . . . . . . . . . . . . . . . . . . . . 976.4 Query processing cost of LPTA as the dimensionality increases.1006.5 Example of LPTA+. . . . . . . . . . . . . . . . . . . . . . . . 1016.6 Example of (a) a kd-tree, and (b) the corresponding partitionof 2-dimensional space. . . . . . . . . . . . . . . . . . . . . . . 1056.7 LPTA vs. LPTA+: (a–e) results on 5 datasets with each viewcontaining 1000 tuples; (f–j) results on 5 datasets with eachview containing 100 tuples. . . . . . . . . . . . . . . . . . . . 1126.8 When varying the number of views on the RND dataset, theperformance comparison between: (a) LPTA and LPTA+; (b)IVS-Eager and IVS-Lazy. . . . . . . . . . . . . . . . . . . . . 1136.9 When varying the value k of a query on the RND dataset, theperformance comparison between: (a) LPTA and LPTA+; (b)IVS-Eager and IVS-Lazy. . . . . . . . . . . . . . . . . . . . . 1136.10 IVS-Eager vs. IVS-Lazy: (a–e) results on 5 datasets with eachview containing 1000 tuples; (f–j) results on 5 datasets witheach view containing 100 tuples. . . . . . . . . . . . . . . . . 1146.11 Pruning effectiveness test of IV-Search algorithms based onthe five datasets. . . . . . . . . . . . . . . . . . . . . . . . . . 115xAcknowledgementsI would like to express sincere gratitude to my supervisor, Dr. Laks V.S.Lakshmanan, for all the inspiring discussions, helpful suggestions, and con-stant encouragement throughout the course of my Ph.D. program. Withouthis support, I would not be able to make the achievement today, and it ismy honor to have the chance to work with him.I highly appreciate the help from Dr. Peter Wood from University ofLondon, Birkbeck on most of my research work here in UBC, without thefruitful discussions with him and his excellent advices, those research wouldnot be possible.I’m also extremely thankful for Dr. Francesco Bonchi from Yahoo! Re-search Barcelona. His provoking thoughts and insights have always inspiredme to keep exploring interesting and challenging research problems.I would also like to thank my Ph.D. supervisory committee members,Dr. Raymond Ng, Dr. Alan Mackworth, and Dr. Giuseppe Carenini. Whosechallenging questions and comments have helped a lot on improving thisthesis.Also I want to thank other faculty members in the DMM lab, RachelPottinger and Ed Knorr, from whom I have learned a lot through conversa-tions, meetings, and classes.I would like to thank my colleagues in the lab, Michael Lawrence, Hon-grae Lee, Shaofeng Bu, Jian Xu, Solmaz Kolahi, Amit Goyal, Moham-mad Khabbaz, Pooya Esfandiar, Ali Moosavi, Dibesh Shakya, Zhaohong(Charles) Chen, Xueyao (Sophia) Liang, Tianyu Li, Naresh Kumar Kolloju,Wei Lu, Pei Li, Rui Chen, Smriti Bhagat, Lan Wei, Shanshan Chen, ArniThrastarson, Vignesh V.S., Zainab Zolaktaf, Yidan Liu, Glenn Bevilacqua,Keqian Li, Weicong Liao, you guys have made my five years in UBC sojoyful.Finally, I want to thank my family especially my beloved wife Mengya,who sacrificed so much to support me pursuing this career. Without her bymy side, I wouldn’t have come this far.xiChapter 1IntroductionRecommender systems (RecSys) have become very popular of late and havebecome an essential driver of many applications. However, classical RecSysprovide recommendations consisting of single items, e.g., books or DVDs.Several applications can benefit from a system capable of doing compositerecommendation, or recommending packages of items, in the form of sets.For example, in trip planning, a user is interested in suggestions for placesto visit, or points of interest (POI). There may be a cost associated witheach visiting place (time, price, etc.). Optionally, there may be a notion ofcompatibility among items in a set, modeled in the form of constraints: e.g.,“no more than 3 museums in a package”, “not more than two parks”, “thetotal distance covered in visiting all POIs in a package should be ≤ 10 km”.The user may have a limited budget and may be interested in suggestionsof compatible sets of POIs such that each set has a cost that is under abudget and has a value (as judged from ratings) that is as high as possible.In these applications, there is a natural need for the top-k recommendationpackages for the user to choose from. Some so-called “third generation”travel planning web sites, such as NileGuide1 and YourTour2, are startingto provide certain of these features, although in a limited form.Another application arises in social networks, like twitter, where one ofthe important challenges is helping users with recommendations for tweetersto follow, based on their topics of interest3. Tweeters are ranked based onhow influential they are [119] and currently any new user is presented witha list of influential tweeters on each topic from which they manually choosetweeters they would like to follow3. To automate tweeter recommendation,a tweeter’s influence score can be treated as their value and the frequencywith which they tweet as their cost. Compatibility may correspond to theconstraint that a given set of topics should be covered. Given a user’s topicsof interest as well as a budget representing the number of tweets the user candeal with in a day, it would be useful to select compatible sets of tweeters to1http://www.nileguide.com (visited on 03/16/2015)2http://www.yourtour.com (visited on 03/16/2015)3https://blog.twitter.com/2010/power-suggestions (visited on 03/16/2015)11.1. Challengesfollow such that their total influence score is maximized and the total cost iswithin budget. Once again, it would be beneficial to give the user choice bypresenting them with the top-k sets of recommended tweeters to follow. Wenote that some newly founded startups like Followformation4 are beginningto provide services on recommending to users the top-k influential tweetersin a specific domain.As a third application, consider that a university student who wishes tochoose courses that are both highly rated by past students and satisfy certaindegree requirements. Assume each course is assigned a level, a category, anda number of credits. In order to obtain an MSc degree, students must take 8modules, subject to the following further constraints: (i) at least 75 creditsmust come from courses in the “database” category, (ii) the minimum levelof any course taken is 6, and (iii) the maximum number of credits takenat level 6 is 30. The requirements above can be expressed as a conjunctionof aggregation constraints. And based on popularity and other informationavailable, different sets of courses which satisfy all aggregation constraintscan be ranked to find the top most interseting ones to the student.From these examples, we can easily infer that a standard recommenda-tion engine which generates lists of items can be quite overwhelming for theuser since the user needs to manually figure out the package of items, andthe potential number of underlying items is huge. Thus we need a novelsystem which is capable of capturing users’ preferences and recommendinghigh quality packages of items.1.1 Challenges1.1.1 SemanticsOne of the core issues in the composite recommendation problem is how wecan determine the utility of a specific package for a user. To answer thisquestion, we note that there are usually two most important criteria fordetermining the value of a package for a user: 1. the quality of the package,e.g., the sum of the ratings of items within the package; 2. the constraintsspecified by the user, e.g., no more than $500 for the cost.For user-specified constraints, there are again two popular paradigmsof handling them. First, we can treat these users’ preferences as hard con-straints, e.g., if a user specifies a cost budget of $500, only packages of whichthe cost is within the budget will be considered. This paradigm is very useful4http://followformation.com (visited on 03/16/2015)21.1. Challengesfor the cases where users know their preferences exactly.On the other hand, we can also treat these users’ preferences as softconstraints. For example, suppose we have a budget constraint on cost.Then we can assign a score to each package based on its cost budget: thehigher the cost budget, the lower the score. Then this means we do notnecessarily rule out those packages which do not satisfy the constraint. Andthis will be particularly useful for cases where users are not 100% sure oftheir preferences. For example, though a user set a cost budget of $500 ona trip to New York, he might be well interested in a package which costslightly higher than $500, but includes many high quality places of interestand good restaurants.Given a specific paradigm of handling user-specified constraints, thereare still multiple ways of determining the utility of a specific package to auser. E.g., considering hard constraints based approach, given individualitem’s utility to the user, how to determine the utility of a set of items tothe user? Or for the soft constraints based approach, given the score basedon individual item’s utility, and also a score based on the constraints, howto estimate the overall utility of a package for the user? As shown by manyprevious works, we can leverage an additive utility function f which can beused to rank packages [27, 34]. However, the challenge of this approach lies inthe fact that we need to determine the parameters associated with functionf . For example, in the database community [63], usually it is assumed thatf is given by the user. This assumption might be too strong considering thefact that users many often not know their preference exactly. Thus a morepromising way for determining the utility function is through interactionwith the users using preference elicitation, as demonstrated in [27, 34].1.1.2 EfficiencyGiven different semantics of the composite recommendation problem as dis-cussed in the previous section, another core issue is the efficiency of thealgorithm for finding the most relevant packages.Consider the user-specified constraints, usually hard constraints will leadto NP-hard optimization problems such as Knapsack [120], and Orienteering[123], which render the underlying problem computationally difficult to scaleto a large dataset. And on the other hand, for soft constraints, even if we cansort and find the top-k packages by an efficient algorithm, determining theutility function which will be used for ranking might itself be a challengingproblem. E.g., as demonstrated in [27, 34], under a reasonable uncertaintysetting, the candidate utility function can range over an unlimited set of31.2. Key Contributionspossibilities.1.2 Key ContributionsAs we shall see from later chapters of this thesis, different composite recom-mendations problems might have significantly different types of properties,thus instead of proposing a universal one-size-fits-all solution, we believe themore optimal way is to exploit underlying different properties of differentproblem settings. So in this thesis, depending on structure of the underly-ing composite recommendation problem, we propose a portfolio of solutions(Chapter 3,4,and 5) which can be selected from and tailored to satisfy dif-ferent needs of the underlying application, and we also provide toolkits suchas cached views (Chapter 6) which can be leveraged to optimize variousproposed composite recommendation algorithms.1.2.1 Composite Recommendation under Hard ConstraintsIn [123], we consider the problem of performing composite recommenda-tion under hard constraints and having a fixed package schema (E.g., eachpackage has exactly one hotel, and one restaurant). We consider a simpleadditive utility function, and connect this problem to existing solutions onrank join [86] by extending these algorithms with aggregation constraints.By analyzing their properties, we developed deterministic and probabilisticalgorithms for their efficient processing. In addition to showing that thedeterministic algorithm retains the minimum number of accessed tuples inmemory at each iteration, we empirically showed both our deterministic andprobabilistic algorithms significantly outperform the obvious alternative ofrank join followed by post-filtering in many cases and that the probabilisticalgorithm produces results of high quality.In [120] and [121], we consider the more general case of composite rec-ommendation by lifting the constraint on package schema. We proposedthe problem of generating top-k package recommendations that are compat-ible and are under a cost budget, where a cost is incurred by visiting eachrecommended item and the budget and compatibility constraints are userspecified. We identify the problem of finding the top package as being in-tractable since it is a variant of the Knapsack problem, with the restrictionthat items need to be accessed in value-sorted order. So we developed two2-approximation algorithms that are designed to minimize the number ofitems accessed based on simple statistics (e.g., minimum value) about itemcosts. The first of these, InsOpt-CR-Topk, is instance optimal in a strong41.2. Key Contributionssense: every 2-approximation algorithm for the problem must access at leastas many items as this algorithm. The second of these, Greedy-CR-Topk, isnot guaranteed to be instance optimal, but is much faster. We experimen-tally evaluated the performance of the algorithms and showed that in termsof the quality of the top-k packages returned both algorithms are close toeach other and deliver high quality packages; in terms of the number ofitems accessed Greedy-CR-Topk is very close to InsOpt-CR-Topk, but interms of running time, Greedy-CR-Topk is much faster. We also showedthat using histogram-based information about item costs, rather than sim-ply knowledge of the minimum item cost, further reduces the number ofitems accessed by the algorithms and improves their running time.1.2.2 Composite Recommendation under Soft ConstraintsWe also study how composite recommendation is possible using soft con-straints [125, 126]. Following [27, 34], we assume the system does not havethe complete information about user’s utility function, and we leverage theexisting preference elicitation frameworks for eliciting preferences from users.However the challenge here is how can we perform the elicitation efficiently,especially considering the fact that we are reasoning about utilities of combi-nations of items. We propose several sampling-based methods which, givenuser feedback, can capture the updated knowledge of the underlying util-ity function. Finally, we also study various package ranking semantics forfinding top-k packages, using the learned utility function.1.2.3 Efficiency OptimizationA key component in the composite recommendation problem is the searchingof the top-k packages under a given semantics. In classical database queryoptimization, use of materialized views is a popular technique for speedingup query processing. Recently, it was extended to top-k queries [45]. In[124] we consider a general optimization procedure based on cached viewswhich can be leveraged to further reduce the computational cost of pro-cessing top-k queries. We show that the performance of the state-of-the-arttop-k query answering using view algorithm LPTA [45] suffers because of it-erative calls to a linear programming sub-procedure, which can be especiallyproblematic when the number of views is large or if the dimensionality ofthe dataset is high. By observing an interesting characteristic of the LPTAframework, we proposed LPTA+, an improved algorithm for using cachedtop-k views for efficient query processing, which has greatly outperformed51.3. Thesis OutlineLPTA. Furthermore, LPTA is not directly applicable in our setting of top-k query processing with cached views, where views are not complete tuplerankings and base views are not available. Thereto, we adapted both algo-rithms so that they can overcome such limiting assumptions. Finally, weproposed an index structure, called IV-Index, which stores the contents ofall cached views in a central data structure in memory, and we can leverageIV-Index to answer a new top-k query much more efficiently compared withLPTA and LPTA+. Using comprehensive experiments, we showed LPTA+substantially improves the performance of LPTA while the algorithms basedon IV-Index outperform both these algorithms by a significant margin. Wediscuss in this thesis how the proposed optimization framework can be inte-grated into various composite recommendation algorithms proposed in thisthesis.1.3 Thesis OutlineThe rest of this dissertation is organized as follows. In Chapter 2, we providea brief background and review related work. In Chapter 3, we consider thefirst problem in composite recommendation which deals with fixed packageschema and hard constraints. In Chapter 4, we lift the constraint on theschema of the package, and consider the problem of composite recommen-dation with flexible package schema and hard constraint. In Chapter 5, weconsider how soft constraints can be considered in the composite recom-mendation framework. We also discuss how to elicit user’s preference usingimplicit feedback. Finally, in Chapter 6, we discuss a general tuning frame-work based on cached views which can be leveraged to improve performanceof various proposed top-k package searching algorithms. In Chapter 7, wesummarize this dissertation and list directions for future research in com-posite recommendation.6Chapter 2Related Work2.1 Composite Recommendation2.1.1 Application in Trip PlanningOur composite recommendation problem is most related to recent studies ontravel package recommendation. In [19], the authors are interested in find-ing the top-k tuples of travel entities. Examples of entities include cities,hotels and airlines, while packages are tuples of entities. Instead of query-ing recommender systems, they query documents using keywords in orderto determine entity scores. Similar to our work [123], a package in theirframework is of fixed schema, e.g., one city, one hotel, and one airline, withfixed associations among the entities essentially indicating all possible validpackages.Obviously in many real world scenarios, we would like to have flexiblepackages schema, thus frameworks which allow flexible package schema con-figuration were proposed by several researchers. A representative work inthis category is [47], in which the authors propose a novel framework to auto-matically generate travel itineraries from online user-generated data like pic-ture uploads and formulate the problem of recommending travel itinerariesof high quality where the travel time is under a given time budget. However,the value of each POI in [47] is determined by the number of times it is men-tioned by users, whereas in our work [120, 121], item value is a personalizedscore which comes from an underlying recommender system, and we con-sider the very practical setting where accessing these items is constrainedto be in value-sorted order. Similar settings of [47] are also explored in [92].In [39], the authors extend the previous works by considering how multi-day trips can be planned. As discussed in [47] and [121], when travel timebetween POIs is taken into consideration, the underlying problem is closelyrelated to the classical Orienteering problem [38], which seeks a maximumvalue walk on a graph subject to a budget constraint on the cost. However,unlike [120, 121], the orienteering problem does not take POI access costinto consideration which can be less desirable since the number of POIs in72.1. Composite Recommendationthe database might be huge.In the database community, researchers have considered the travel pack-age recommendation problem from the query processing perspective. E.g.,in [111], the authors considered the optimal sequenced route (OSR) querywhich returns a sequence of POIs which satisfy the following two properties:1. the sequence of POIs follow exactly a given “POI type template” whichspecifies order and type of each POI; 2. the total travel distance of the re-turned POIs is minimized. In [40], the authors extend OSR by consideringpartial sequence-based POI type template. In [88], the authors propose tripplanning queries (TPQ), with which the user specifies a subset (not a se-quence) of location types and asks for the optimal route from a given startinglocation to a specified destination which passes through at least one POI ofeach type specified. In [69, 83], the authors consider a similar query typeas OSR with which POI type templates are specified using keyword-basedquery. Clearly, the query oriented travel package search problem requiresusers to know exactly or at least roughly what they want in a travel pack-age, which may not be practical in real world travel applications. In thisthesis, we are more interested in generating travel packages by leveragingusers’ previous travel behavior and minimizing the amount of informationthat needs to be provided by the user, which is similar as the state-or-the-artrecommender systems [16].2.1.2 Application in Course PlanningIn [74, 99, 102], the authors study various composite recommendation prob-lem related to course planning. The resulting product of these works, Cours-eRank [102], is a project motivated by a Stanford course planning applicationfor students, where constraints are of the form “take ki from Si,” where ki isa non-negative integer and Si is a set of courses. Similar to our work [120],each course in this system is associated with a score which is calculated us-ing an underlying recommendation engine. Given a number of constraints ofthe form above (and others), the system finds a minimal set of courses thatsatisfies the requirements and has the highest score. In [99, 101], the authorsextend CourseRank with prerequisite constraints, and proposes various al-gorithms that return high-quality course recommendations which satisfy allthe prerequisites. Similar to [120], such recommendations need not be offixed size. However, [99, 101, 102] do not consider the cost of items (cf.courses) which can be important for many composite recommendation ap-plications.82.1. Composite Recommendation2.1.3 Application in E-commerceFinally, E-commerce represents another promising application in which com-posite recommendation engines can be very useful. In [106], the authorsstudy the problem of recommending “satellite items” related to a given“central item” subject to a cost budget. Every package found by the pro-posed algorithm is composed of one specific central item and several satelliteitems. Clearly, the target of this work is a dedicated engine tailored for aspecific type of recommendation, which is different from our works such as[120, 121, 123] which targets the more general composite recommendationproblem without a specific shape in mind for the recommendation package.However, we note that we could easily extend our algorithm to make suchrecommendations by post-filtering as discussed in [120]. In [37], the authorspropose a framework which can help users search items of fixed pair-wise re-lationships. The problem studied in [37] is similar to [19], thus the proposedframework is not intended to search packages with flexible schema.In [56], the authors design a shopping tool which can help users findexisting deals about bundles of items from various E-commerce sources.Compared with our work [120, 121, 123, 124], the shopbot proposed in [56]cannot create new packages which are tailored to users’ interest, but insteadfocus on finding pricing and other information of existing packages which areprovided from different sources. The general framework of pricing strategiesof item bundles in marketing science has been discussed in [23].The composite recommendation is also related to Combinatorial Auctionin E-commerce [48], since the underlying objects in combinatorial auction arealso packages of items. However, combinatorial auction focuses on determin-ing item or package allocation under a multi-participant scenario, whereasin the composite recommendation problem studied in this thesis, we focuson making personalized recommendation where there exists no competitionfor packages among users, which is in alignment with the state-or-the-artrecommender systems [16].2.1.4 Other ApplicationsIn addition to travel planning, course planning, and E-commerce, there alsoexist other applications where composite recommendation can be extremelybeneficial. E.g., in [32], the authors proposed a framework CARD which pro-vides the top-k recommendations of composite products or services. Fine-grained control over specifying user requirements, as well as how atomiccosts are combined, is provided by an SQL-like language extended with92.2. Preference Handling and Elicitationfeatures for decision support. Each composite recommendation is of fixedschema, making the problem simpler; and CARD returns only exact, notapproximate, solutions.As another example, the problem of team formation studied in [78] canalso be considered as a composite recommendation problem. Here each per-son has a set of skills and pairs of people have a collaboration cost associatedwith them (lower cost indicates better collaboration). Given a task requir-ing a set of skills, the problem is to find a set of people whose skills coverthose required and who have a low aggregated collaboration cost. The no-tion of compatibility in [120] can model their collaboration cost. Similar toCourseRank, in [78], the people (items) themselves are not rated. A fur-ther difference with [120] is that in [120] we wish to maximize the aggregateitem (people) ratings subject to item and compatibility costs, rather thanminimize compatibility cost.We note that although we do not include in our system complex con-straints such as those in [78, 99, 101, 102, 116], for applications where com-plex constraints exist, we can leverage existing work to post-process eachcomposite recommendation generated by our algorithms to ensure that theconstraints are satisfied.2.2 Preference Handling and ElicitationThere has been much investigation into the handling of preferences overitems, e.g., general preference frameworks [72] [41], skyline queries [25, 95,96], and top-k queries [63]. While different approaches for reasoning aboutpreferences might be suitable for different applications, as discussed in [70], amore popular and practical approach is to leverage a general utility functionwhich succinctly characterizes users’ trade-offs among different propertiesof the items under consideration. Among various forms of utility functionwhich have been studied, the most popular one is the simple linear utilityfunction or additive utility function [63, 70].While item preference reasoning has been popular for a long time, onlyrecently have researchers started considering preference handling for pack-ages, or sets of items. This calls for exploring a much larger candidatespace, and usually has an aggregation-based feature space (e.g., total costbudget or average rating), which further complicates the underlying prob-lem. Existing set preference works by researchers in artificial intelligenceusually focus on the formal aspects of this problem, e.g., expressiveness ofthe preference language. These works include [31] which considers generaliz-102.2. Preference Handling and Elicitationing item preferences to set preferences through order lifting and incrementalimprovements, and [29, 30] which considers extensions of CP-nets proposedin [28] to sets of items. But the proposed models in these works are oftennot practical for applications with large amount of items. As commented in[29], the reason is because a language for specifying preferences between setsof items of arbitrary size, to be understood ceteris paribus, there is a inher-ent high complexity. E.g., comparing two sets of items under the preferencelanguage proposed in [29] is PSPACE-complete, and algorithms proposedin [30] has a worst-case exponential time complexity. In [51] and [104], theauthors study preferences over sets with an emphasize on diversity of theunderlying sets.In [129] and [87], the authors study skyline packages of fixed cardinality,which finding packages of fixed cardinality and are better than or equalto other packages on all attributes under consideration. However, a severedrawback of this approach is that the number of skyline packages is usuallyprohibitive. For example, in [129], the number of skyline packages can be inthe hundreds or even thousands for a reasonably-sized dataset.In this dissertation, following the popular paradigm of reasoning pref-erence about individual items [63, 70], we take a quantitative and utilityfunction-based approach for reasoning about preference under the compos-ite recommendation framework. Specifically, we consider the utility of apackage to be a linear weighted combination of various properties of thepackage (e.g., quality and cost), and the package property can be furtherdescribed by aggregations over attribute values of items within the package.However, though linear utility function can be leveraged to model trade-offs among different package properties, still weights of the utility functionneed to be determined. A simple approach for reasoning about preferenceis to assume weights of the utility function are given by the user, e.g., as wehave discussed in [120, 121, 123]. However, this may not be practical sinceusers usually cannot know the weights for sure. For example, users wouldnot be able to tell the system that they are 0.8 interested in the overall cost,and 0.2 interested in the overall quality of a package. Thus in Chapter 5,we propose to extend our works in Chapter 3 and Chapter 4, and studyhow weights of the utility function can be learned through implicit feedbackusing preference elicitation techniques [27, 34].For eliciting preferences, most existing works have been focusing on itemsinstead of packages. In [130], the authors propose an interactive way to elicitpreferred items. The paper does not consider preferences for packages, andmore importantly, they assume that the weight parameters of the underly-ing utility function follow a uniform distribution, and do not discuss how112.3. Top-k Query Processinguser feedback can be leveraged to update the utility function. In [95] and[67], the authors have a similar setting of inferring preferences given somepartial comparisons of items. However, these two papers focus on inferringa most desirable order directly using these given partial comparisons; in ourwork, we took a different Bayesian-based approach by modeling the param-eters of the utility function using a distribution. The most desirable orderof packages depends directly on the uncertain utility function following dif-ferent ranking semantics. Feedback received only affects the posterior of theparameter distribution.In [107], the authors consider an interactive way of ranking travel pack-ages. However, the user feedback model in [107] is defined in such a waythat for each iteration, the user is asked to rank a set of items instead of aset of packages. Thus every decision the user makes is “local” in the sensethat the user is not able to personalize her/his preference over packages asa whole; it is possible that items favored in one iteration might become lessdesirable after seeing some additional items. Also unlike our framework inwhich the elicitation of preferences is implicit, [107] requires several itera-tions of explicit preference elicitation before the system would show the userany recommended package.Compared with existing work on interactive preference elicitation foritems [95, 96], our search space of candidate packages is much larger, and weconsider features which are based on aggregations of item attribute values,thus the problem becomes more challenging.2.3 Top-k Query ProcessingGiven a linear utility function, for both hard constraints based compositerecommendation (as in Chapter 3 and Chapter 4), and soft constraints basedcomposite recommendation (as in Chapter 5), the core of the compositerecommendation algorithms is in finding the most promising packages giventhe linear utility function [120, 121, 123, 125]. As we will demonstrate in thisdissertation, this problem can be cast as a variation of the classical problemof top-k query processing [53].For general top-k query processing, the most popular approach is thefamily of algorithms embodied by Threshold Algorithm (TA) / No RandomAccess Algorithm (NRA) as proposed by Fagin et al. in [53]. While TA andNRA differ in whether random access to the database is allowed, this familyof algorithms usually share a similar query processing framework which isoutlined in Algorithm 1. In this algorithm framework, given a database R122.3. Top-k Query Processingof items, a query vector w for the linear utility function, and the number ofitems required k, we first sort items into a set of lists w.r.t. each attributeunder consideration. Then these lists are accessed following a certain order(usually round robin). For each item t accessed, we can put t into the resultqueue if it is better than the kth item in the queue. Because items aresorted in each list w.r.t. w or the desirable order, we could determine anupperbound value τ on the possible value which can be achieved by anyunseen items. If the current kth item in R has value larger than or equal toτ , we know we have already found the top-k items w.r.t. w.Algorithm 1: TA(R, w, k)1 L ← Lists of items in R which are sorted w.r.t. every attribute;2 O ← Priority queue for the top-k results;3 τ ← ∞;4 while |O| < k ∨ value(O.kthResult) ≤ τ do5 Li ← Next attribute following round robin;6 t ← Li.next();7 if |O| < k ∨ value(O.kthResult) < value(t) then8 O.insert(t);9 τ ← Upperbound value given current access positions in each list;10 return ORecently, various improvements to the original algorithms such as theBest Position Algorithm [18] have been proposed, while variations of top-kqueries such as Rank Join [62] and Continuous Top-k Queries [128] have beenstudied. Finally, Li et al. study how top-k algorithms might be implementedin a relational database [86]. An excellent survey on top-k query processingcan be found in [63].2.3.1 Domination-based Pruning and Variable EliminationAs discussed in the main TA algorithm framework, the key in top-k queryprocessing is to maintain in the memory a set of promising top-k resultcandidates, and iteratively check whether there exist a set of k candidateswhich can “beat” any of the remaining items w.r.t. the query, thus becomingthe optimal top-k results. As we shall see in later chapters of this thesis, theperformances of many variations of the top-k algorithms depends on howmany candidate items we need to maintain in the memory, thus a common132.3. Top-k Query Processingway to facilitate top-k query processing is to prune away items which arenot able to make into the top-k results as early as possible.A popular approach for this purpose is the domination-based pruning, ofwhich the idea can be described as follows. Consider a linear utility functionwhose parameters are described by w. For an item t, we say it dominatesanother item t′, t t′, if t is better than or equal to t′ on every criterionw.r.t. w, and is better than t′ on at least one criterion. Then obviously w.r.t.w, there is no hope for t′ to make into the top-1 result given the existenceof t, thus t′ can be eliminated from consideration. This domination-basedpruning can be easily extended to the top-k case.We discuss in [123] the properties of aggregation constraints in the com-posite recommendation framework and develop an efficient algorithm forprocessing rank joins with aggregation constraints, based on two strategiesfor domination-based pruning. We note that similar approaches were alsoexplored in [129] and [87] in the context of skyline sets.The domination-based pruning is also related to variable elimination(VE) in solving constraint satisfaction problem (CSP) [76]. The key ideahere is to transform the original CSP into a new reduced CSP which isequivalent to the original CSP. To enable such transformation, we considervariables one by one. For each variable X under consideration, we first takeall constraints which involve X and generate intermediate partial solutionswhich satisfy each constraint under investigation. A join operation is per-formed to merge all intermediate partial solutions, and then variable X canbe removed from consideration by generating new constraints which do notinvolve X. E.g., as discussed in [76], consider a CSP that contains the vari-ables A, B, and C. Suppose the only constraints under consideration areA < B and B < C, then we could remove B by joining partial solutions ofthe CSP which satisfy each of the above two constraints, and consider a newconstraint A < C for the join result. Clearly, for each join result satisfyingA < C, there must exist an assignment to variable B which extends it to asolution of the original problem.Clearly, both domination-based pruning and variable elimination corre-spond to approaches which aim at removing variables/items which do notcontribute to the optimal solution of the underlying problem. However, un-like VE which aims at removing variables by “marginalizing” their effects onthe underlying problem, domination-based pruning aims at removing vari-ables/items which are not promising under a given query. Also most existingworks on VE does not consider aggregation effects over variables, which isa key in the composite recommendation problem, as constraints are usuallyspecified as aggregations over item attribute values.142.4. Constraint Optimization2.4 Constraint OptimizationIn constraint optimization the task is to find a solution that optimizes somecost function and satisfies the specified constraints [49, 76]. The constraintoptimization problem finds applications in various domains such as planning,scheduling, and auction.As discussed in [49], the general constraint optimization problems areusually solved by branch-and-bound search algorithms or dynamic program-ming. For the search-based algorithm, the efficiency of the algorithm de-pends on its ability to cut the branches which do not lead to the optimal so-lution during the search process. This dead branch detection is done usuallywith a heuristic function which computes a lower bound of the current sub-problem at the branch under consideration. E.g., we can use either weightedCSP local consistency [79] or mini-bucket elimination [50]. Dynamic pro-gramming was proposed as an alternative to the branch-and-bound search[49], and the algorithm was introduced in the context of sequential decisionmaking.When the underlying constraints and cost functions in a constraint op-timization problem are all linear, the constraint optimization problem canalso be solved through general integer programing or linear programmingsolvers such as CPLEX 5.In the composite recommendation problem studied in this thesis, eachunderlying constraint optimization problem usually takes a specific restrictedform. Thus instead of leveraging a general purpose solver, we can exploitthe property of the underlying problem and consider solvers of simpler prob-lems such as Knapsack [71] for the composite recommendation problem withhard constraint and flexible schema (Chapter 4), and RankJoin [62] for thecomposite recommendation problem with hard constraint and fixed schema(Chapter 3).5http://www-01.ibm.com/software/commerce/optimization/cplex-optimizer/(visited on 03/16/2015)15Chapter 3Composite Recommendationwith Hard Constraints:Fixed Package Schema3.1 IntroductionIn this chapter, we first discuss how composite recommendation problemwith hard constraint and fixed package schema can be efficiently solved.In the last several years, there has been tremendous interest in rank joinqueries and their efficient processing [54, 62, 110]. In a rank join query, youare given a number of relations, each containing one or more value attributes,a monotone score aggregation function that combines the individual values,and a number k. The objective is to find the top-k join results, i.e., the joinresults with the k highest overall scores. Clearly, composite recommendationwith fixed package schema can be directly casted to a rank join problem.E.g., additive utility function can be considered as a specific instance of themontone score aggregation function, and the only thing missing in rank joinis how to handle the user specified constraints, which will be addressed inthis chapter.Rank join can be seen as a generalization of classic top-k queries whereone searches for the top-k objects w.r.t. a number of criteria or features [53].For classic top-k queries, assuming that objects are stored in score-sortedinverted lists for each feature, the top-k objects w.r.t. a monotone scoreaggregation function can be computed efficiently using algorithms such asTA, NRA and their variants [53]. These algorithms satisfy a property calledinstance optimality, which intuitively says that no algorithm in a reasonableclass can perform more than a constant times better, for some fixed constant.Ilyas et al. [62] were the first to develop an instance-optimal algorithmfor rank join queries involving the join of two relations. Their algorithmemploys the so-called corner-bounding scheme. Polyzotis et al. [110] showedthat whenever more than two relations are joined or relations are allowed to163.1. Introductioncontain multiple value-attributes, the corner bounding scheme is no longerinstance optimal. They proposed a tight bounding scheme based on main-taining a “cover set” for each relation, and using this bounding schemeresults in instance optimal algorithms [54, 110].For the composite recommendation problem, as we have discussed before,constraints can add considerable value [85, 97, 103, 105] in two ways. First,they allow the relevant application semantics to be abstracted and allowusers to impose their application-specific preferences on the query (or miningtask) at hand. Second, constraints can often be leveraged in optimizing thequery or mining task at hand. In this chapter, we argue that aggregationconstraints can enrich the framework of rank join queries by including suchapplication semantics.We should highlight the fact that, in our constraints, aggregation isapplied to values appearing in each tuple resulting from a join, rather thanin the traditional sense where aggregation is over sets of tuples. In thissense, aggregation constraints exhibit some similarity to selections appliedto a join.A natural question is how to process rank joins with aggregation con-straints efficiently. A naive approach is to perform the rank join, then applypost-filtering, dropping all results that violate the constraints, and finallyreport the top-k among the remaining results. We show that rank joins withaggregate constraints can be processed much faster than this post-filteringapproach. First, we develop techniques for pushing constraint processingwithin the rank join framework, allowing irrelevant and “unpromising” tu-ples to be pruned as early as possible. As a result, we show that tuples thatwill not contribute to the top-k answers can be detected and avoided. Sec-ond, based on the observation that such an optimized algorithm still needsto access many tuples, we propose a probabilistic algorithm which accessesfar fewer tuples while guaranteeing the quality of the results returned.Specifically, we make the following contributions in this work:• we introduce the problem of efficient processing of rank join querieswith aggregation constraints (Sec. 6.2), showing the limitations of thepost-filtering approach (Sec. 3.3);• we analyze the properties of aggregation constraints and develop anefficient algorithm for processing rank joins with aggregation con-straints, based on two strategies for pruning tuples (Sec. 3.4);• we also develop a probabilistic algorithm that terminates processingin a more aggressive manner than the deterministic approach while173.2. Problem Definitionguaranteeing high quality answers (Sec. 5.2);• we report on a detailed set of experiments which show that the execu-tion times of our algorithms can be orders of magnitude better thanthose of the post-filtering approach (Sec. 3.6).3.2 Problem DefinitionConsider a set R of n relations {R1, R2, . . . , Rn}, with Ri having the schemaschema(Ri), 1 ≤ i ≤ n. For each tuple t ∈ Ri, the set of attributes overwhich ti is defined is schema(t) = schema(Ri). We assume each relationhas a single value attribute V , and (for simplicity) a single join attribute J .6Given a tuple t ∈ Ri and an attribute A ∈ schema(t), t.A denotes t’s valueon A. We typically consider join conditions jc corresponding to equi-joins,i.e., J = J .Let R′ = {Rj1 , Rj2 , . . . , Rjm} ⊆ R. Given a join condition jc, we defines = {t1, . . . , tm} to be a joinable set (JS) if ti ∈ Rji , i = 1, . . . ,m, and ./jcmi=1{ti} 6= ∅. If m = n, we call s a full joinable set (FJS), while if m < n wecall s a partial joinable set (PJS). We denote by JS the set of all possible(partial) joinable sets. Furthermore, for a JS s which comes from R′, wedefine Rel(s) = R′.3.2.1 Language for Aggregation ConstraintsAggregation Constraints can be defined over joinable sets. Let AGG ∈{MIN,MAX,SUM,COUNT,AV G} be an aggregation function, and letthe binary operator θ be ≤, ≥ or =.7 Let p ::= A θ λ be an attribute valuepredicate, where A is an attribute of some relation, θ is as above, and λ is aconstant. We say tuple t satisfies p, t |= p, if A ∈ schema(t) and t.A θ λ istrue. An attribute value predicate p can be the constant true in which caseevery tuple satisfies it. A set of tuples s satisfies p, s |= p, if ∀t ∈ s, t |= p.We now consider aggregation constraints which are applied to tuplesresulting from a join. A primitive aggregation constraint (PAC) is of theform pc ::= AGG(A, p) θ λ, where AGG is an aggregation function, A is anattribute (called the aggregated attribute), p is an attribute value predicate(called the selection predicate) as defined above, and θ and λ are defined as6Our results and algorithms easily extend to more general cases of multiple value-attributes and/or multiple join attributes, following previous work such as [54].7Operators < and > can be treated similarly to ≤ and ≥.183.2. Problem Definitionabove. Given a joinable set s, we defineEvalpc(s) = AGG([t.A | t ∈ s ∧ t |= p])where we use [. . .] to denote a multiset. Then we say s satisfies the primitiveaggregation constraint pc, s |= pc, if Evalpc(s) θ λ holds.The language for (full) aggregation constraints can now be defined asfollows:Predicates: p ::= true |A θ λ | p ∧ pAggregation Constraints: ac ::= pc | pc ∧ acpc ::= AGG(A, p) θ λThe meaning of a full aggregation constraint ac is defined in the obviousway, as are the notions of joinable sets satisfying ac and the satisfying subsetRac of a relation R resulting from a join.Let R be a relation resulting from a (multi-way) join R1 onjc · · · onjc Rm.Each tuple t ∈ R can also be viewed as a joinable set st of tuples fromthe relations Ri. Given an aggregation constraint ac, we define Rac as{t | t ∈ R ∧ st |= ac}.Note that by adding a special attribute C to each relation and settingthe value of each tuple on C to be 1, COUNT can be simulated by SUM .Similarly, when the number of relations under consideration is fixed, AV Gcan also simulated by SUM . So to simplify the presentation, we will notdiscuss COUNT and AV G further.3.2.2 Problem StudiedWe assume the domain of each attribute is normalized to [0, 1]. Let Rdenote the set of reals and S : Rn → R be the score function, defined overthe value attributes of the joined relations. Following common practice, weassume S is monotone, which means S(x1, ..., xn) ≤ S(y1, ..., yn) whenever∀i, xi ≤ yi. To simplify the presentation, we will mostly focus on S beingSUM, so given a joinable set s, the overall value of s, denoted as v(s), canbe calculated as v(s) =∑t∈s t.V . Furthermore, in this chapter we assumethat the join condition jc is equi-join, which means that given two tuplest1 and t2 from two relations, {t1} ./jc {t2} 6= ∅ iff t1.J = t2.J . For brevitywe will omit the join condition jc from the join operator when there is noambiguity.Let ac be a user-specified aggregation constraint (which may be a con-junction of PACs) and jc be the join condition. We study the problem ofRank Join with Aggregation Constraints (RJAC):193.3. Related WorkDefinition 1 Rank Join with Aggregation Constraints: Given a setof relations R = {R1, . . . , Rn} and a join condition jc, let RS denote ./ni=1Ri. Now given a score function S and an aggregation constraint ac,find the top-k join results RSack ⊆ RSac, that is, ∀s ∈ RSack and ∀s′ ∈RSac −RSack , we have v(s) ≥ v(s′).We denote an instance of the RJAC problem by a 5-tuple I = (R, S, jc, ac, k).Because we are usually only interested in exactly k join results, we will dis-card potential join results which have the same value as the kth join resultin RSack ; however, the proposed technique can be easily modified to returnthese as well if needed. Our goal is to devise algorithms for finding thetop-k answers to RJAC as efficiently as possible.3.3 Related Work3.3.1 Rank Join and Post-FilteringThe standard rank join algorithm with no aggregation constraints works asfollows [62, 110]. Given a set of relations R = {R1, . . . , Rn}, assume thetuples of each relation are sorted in the non-increasing order of their value.The algorithm iteratively picks some relation Ri ∈ R and retrieves the nexttuple t from Ri. Each seen tuple t ∈ Ri is stored in a corresponding bufferHRi, and t is joined with tuples seen from HRj , j 6= i. The join result isplaced in an output buffer O which is organized as a priority queue. Toallow the algorithm to stop early, the value of t is used to update a stoppingthreshold τ , which is an upperbound on the value that can be achieved usingany unseen tuple. It can be shown that if there are at least k join resultsin the output buffer O which have value no less than τ , the algorithm canstop, and the first k join results in O are guaranteed to be the top-k results.To characterize the efficiency of a rank join algorithm, previous work hasused the notion of instance optimalilty, proposed by Fagin et al. [53]. Thebasic idea is that, given a cost function cost (which is a monotone function ofthe total number of tuples retrieved), with respect to a class A of algorithmsand a class D of data instances, a top-k algorithm A is instance optimal if,for some constants c0 and c1, for all algorithms B ∈ A and data instancesD ∈ D, we have cost(A,D) ≤ c0 × cost(B,D) + c1.Instance optimality of a rank join algorithm is closely related to thebounding scheme of the algorithm, which derives the stopping thresholdat each iteration. It has been shown in [110] that an algorithm using thecorner-bounding scheme [62] is instance optimal if and only if the underlying203.3. Related Workjoin is a binary join and each relation contains one value attribute. To en-sure instance optimality in the case of multiple value attributes per relationand multi-way rank join, Schnaitter et al. [110] proposed the feasible region(FR) bounding scheme. This FR bound was later improved by Finger andPolyzotis [54] using the fast feasible region (FR*) bounding scheme.Suppose each relation has m value attributes, then the basic idea ofFR/FR* bounding scheme is to maintain a cover set CRi for each relationRi. CRi stores a set of points that represents the m-dimensional boundaryof the values of all unseen tuples in Ri. Given an n-way rank join overR = {R1, . . . , Rn}, to derive the stopping threshold τ , we first enumerateall possible subsets of R. Then for each subset R′, we derive the maximumpossible join result value by joining the HRs of relations in R′ with the CRsof relations in R−R′. The threshold τ is the maximum of all such values.We note that although FR/FR* bounding scheme is tight, its complexitygrows exponentially with the number of relations involved [110]. Indeed,following Finger and Polyzotis [54], we mainly consider rank joins with asmall number of relations.In addition to the bounding scheme, the accessing strategy (which deter-mines which relation to explore next) may also affect the performance of therank join algorithm. For example, a simple accessing strategy such as round-robin often results in accessing more tuples than necessary. More efficientaccessing strategies include the corner-bound-adaptive strategy [62] for bi-nary, single value-attribute rank join and the potential adaptive strategy [54]for multi-way, multiple value-attribute rank join.As shown in the introduction, there are many situations where it isvery natural to have aggregation constraints along with rank join. Whileprevious work on rank join algorithms has devoted much effort to optimizingthe bounding scheme and accessing strategy, little work has been done onopportunities for improving runtime efficiency by using constraints that maybe present in a query.One way to handle aggregation constraints in the standard rank join al-gorithm is by post-filtering each join result using the aggregation constraints.It can be shown that an algorithm based on post-filtering remains instanceoptimal. However, as we will demonstrate in the next section, this na¨ıvealgorithm misses many optimization opportunities by not taking full advan-tage of the properties of the aggregation constraints, and, as we will showin Sec. 3.6, can have poor empirical performance as a result. This observa-tion coincides with recent findings that instance optimal algorithms are notalways computationally the most efficient [54].213.4. Deterministic Algorithm3.3.2 Other Related WorkAs described in the introduction, rank join can be seen as a generalization ofclassic top-k querying where one searches for the top-k objects w.r.t. a num-ber of criteria or features [53]. Ilyas et al. [64] discussed how to incorporatebinary rank join operator into relational query engines. The query optimiza-tion framework used in [64] follows System R’s dynamic programming-basedapproach, and in order to estimate the cost of the rank join operator, a novelprobabilistic model is proposed. In [86], Li et al. extended [64] by provid-ing a systematic algebraic support for relational ranking queries. Tsaparaset al. proposed in [117] a novel indexing structure for answering rank joinqueries. In this work, various tuple pruning techniques are studied to reducethe size of the index structure. In [94], Martinenghi et al. proposed a novelproximity rank join operator in which the join condition can be based on anontrivial proximity score between different tuples. A more detailed surveyof top-k query processing and rank join can be found in [63], We note thatno previous work on rank join has considered aggregation constraints.Our work is also closely related to recent efforts on package recommen-dation [19, 47, 100, 106, 120]. Though some of these works [47, 120] discussfinding high-quality packages under certain aggregation constraints such asbudgetary constraints, none of them provide a systematic study of aggrega-tion constraints. In [19], the authors propose a rank join-based algorithm forfinding travel packages of a fixed schema, however, in this work, the authorsdo not consider aggregation constraints which can be very useful in practice.3.4 Deterministic AlgorithmWe begin by illustrating rank joins with aggregation constraints.Example 1 [Rank Join with Aggregation Constraints] Consider two rela-tions, Museum and Restaurant, each with three attributes, Location, Costand Rating, where Rating is the value attribute and Location is the joinattribute (see Fig. 3.1). Assume we are looking for the top-2 results sub-ject to the aggregation constraint SUM(Cost, true) ≤ 20. Under the cornerbounding scheme and round-robin accessing strategy, the algorithm will stopafter accessing 5 tuples in Museum and 4 tuples in Restaurant. Note thateven though the joinable set {t3, t7} has a high value, it is not a top-2 resultbecause it does not satisfy the constraint.Our motivation in this section is to develop efficient pruning techniquesfor computing rank joins with aggregation constraints fast. Thereto, we223.4. Deterministic AlgorithmMuseumLocation Cost Ratingaa 13.51555RestaurantLocation Cost Ratingcbb5020104.54.54.5t1:t2:t6:t7:t8:ba 10154.54.5 a 5 3t9:b 5 3.5t5: a 10 3t10:t3:t4: { t3, t8 }Top-2 results{ t1, t9 }ConstraintSUM(Cost,true) ≤ 20Figure 3.1: Post-filtering rank join with aggregation constraints.first present a number of properties of aggregation constraints and showhow these properties can be leveraged to prune seen tuples from the in-memory buffers. We then propose an efficient rank join algorithm supportingaggregation constraints that minimizes the number of tuples that are keptin the in-memory buffers, which in turn helps cut down on useless joins.3.4.1 Properties of Aggregation ConstraintsLet pc ::= AGG(A, p) θ λ be a primitive aggregation constraint (PAC). Inorder to use pc to prune seen tuples, we first study properties of the variousforms of pc, i.e., for AGG ∈ {MIN,MAX,SUM} and θ ∈ {≤,≥,=}.First consider the cases when AGG is MIN and θ is ≥, or AGG isMAX and θ is ≤. These cases are the simplest because pc need only beevaluated on each seen tuple individually rather than on a full joinable set.When accessing a new tuple t, if A ∈ schema(t) and t satisfies p, we cansimply check whether t.A θ λ holds. If not, we can prune t from futureconsideration as pc will not be satisfied by any join result including t. Afterthis filtering process, all join results obtained by the algorithm must satisfythe constraint pc. We name this property the direct-pruning property.When AGG ∈ {MAX,SUM} and θ is ≥, or AGG is MIN and θ is ≤,the corresponding aggregation constraint pc is monotone.Definition 2 (Monotone Aggregation Constraint) A PAC pc is mono-tone if ∀t ∈ R, ∀s ∈ JS, where R /∈ Rel(s): if {t} |= pc and {t} ./ s 6= ∅,then {t} ./ s |= pc.For the case when AGG is SUM and θ is ≤, the PAC is anti-monotone.This means that if a tuple t does not satisfy pc, no join result of t with anypartial joinable set will satisfy PAC either.88The cases where AGG is MAX and θ is ≤ and AGG is MIN and θ is ≥ are alsoanti-monotone, but they can be handled using direct pruning, discussed above.233.4. Deterministic AlgorithmDefinition 3 (Anti-Monotone Aggregation Constraint) A PAC pc isanti-monotone if ∀t ∈ R, ∀s ∈ JS, where R 6∈ Rel(s): if {t} 6|= pc, theneither {t} ./ s = ∅ or {t} ./ s 6|= pc.As a special case, when AGG ∈ {MIN,MAX} and θ is =, we canefficiently check whether all the joinable sets considered satisfy AGG(A, p) ≥λ and AGG(A, p) ≤ λ, using a combination of direct pruning and anti-monotonicity pruning.Finally, for the case when AGG is SUM and θ is =, it is easy to see thatpc is neither monotone nor anti-monotone. However as discussed in [97], pccan be treated as a special constraint in which the evaluation value of atuple t on pc, Evalpc({t}), determines whether or not the anti-monotonicproperty holds. For example, let pc ::= SUM(A, p) = λ and t be a tuple.If t |= p and {t} 6|= pc, then either Evalpc({t}) > λ or Evalpc({t}) < λ.In the first case, the anti-monotonic property still holds. We call this con-ditional anti-monotonic property c-anti-monotone. Table 3.1 summarizesthese properties.AGG\θ ≤ ≥ =MIN monotone direct-pruning monotone after pruningMAX direct-pruning monotone monotone after pruningSUM anti-monotone monotone c-anti-monotoneTable 3.1: Properties of primitive aggregation constraints.Properties like direct-pruning, anti-monotonicity and c-anti-monotonicitycan be used to filter out tuples that do not need to be maintained in buffers.However, this pruning considers each tuple individually. In the next subsec-tion, we develop techniques for determining when tuples are “dominated”by other tuples. This helps in pruning even more tuples.3.4.2 Subsumption-based PruningConsider Example 1 again. After accessing four tuples from Museum andthree tuples from Restaurant (see Figure 3.2), the algorithm cannot stopas it has found only one join result. Furthermore we cannot prune anyseen Museum tuple since each satisfies the constraint. However, it turns outthat we can safely prune t4 (from Museum) because, for any unseen tuple t′from Restaurant, if t′ could join with t4 to become a top-2 result, t′ couldalso join with t1 and t2 without violating the constraint and giving a largerscore.243.4. Deterministic AlgorithmMuseumLocation Cost Ratingaa 13.51555RestaurantLocation Cost Ratingcbb5020104.54.54.5t1:t2:t6:t7:t8:ba10154.54.5 a 5 3t9:b 5 3.5t5: a 10 3t10:t3:t4: {t4}Tuple PrunedConstraintSUM(Cost,true) ≤ 20Figure 3.2: Tuple pruning using aggregation constraints.The above example shows that, in addition to the pruning that is directlyinduced by the properties of the aggregation constraints, we can also prunea tuple by comparing it to other seen tuples from the same relation. As wediscuss in the next section, this pruning can help to reduce the number of in-memory join operations. The key intuition behind pruning a tuple t ∈ R inthis way is the following. Call a join result feasible if it satisfies all applicableaggregation constraints. To prune a seen tuple t ∈ R, we should establishthat whenever t joins with tuples (joinable set) s from other relations toproduce a feasible join result ρ, then there is another seen tuple t′ ∈ R thatjoins with s and produces a feasible result whose overall value is more thanthat of ρ. Whenever this condition holds for a seen tuple t ∈ R, we say t′beats t. If there are k distinct seen tuples t′1, ..., t′k ∈ R such that each of thembeats t, then we call t beaten. Clearly, a seen tuple that is beaten is uselessand can be safely pruned. In the rest of this section, we establish necessaryand sufficient conditions for detecting (and pruning) beaten tuples amongthose seen. Thereto, we need the following notion of tuple domination.Definition 4 (pc-Dominance Relationship) Given two tuples t1, t2 ∈ R, t1 pc-dominates t2, denoted t1 pc t2, if for all s ∈ JS, s.t. R 6∈ Rel(s),{t2} ./ s 6= ∅ and {t2} ./ s |= pc, we have {t1} ./ s 6= ∅ and {t1} ./ s |= pc.Intuitively, a tuple t1 pc-dominates another tuple t2 from the same rela-tion (for some given PAC pc) if for any possible partial joinable set s whichcan join with t2 and satisfy pc, s can also join with t1 without violating pc.Note that the pc-dominance relationship defines a quasi-order over tu-ples from the same relation since it is reflexive and transitive but not anti-symmetric: there may exist two tuples t1 and t2, such that t1 pc t2,t2 pc t1, but t1 6= t2.For the various PACs studied in this chapter, we can characterize pre-cisely when the pc-dominance relationship holds between tuples. The con-ditions depend on the type of the PAC.First of all, consider a monotone PAC pc. Because pc is monotone, givena tuple t, if t |= pc, then the join result of t with any other joinable set253.4. Deterministic Algorithmwill also satisfy pc, as long as t is joinable with s. So we have the followinglemma in the case where pc ::= SUM(A, p) ≥ λ.Lemma 1 Let pc ::= SUM (A, p) ≥ λ be a primitive aggregation constraintand t1, t2 be tuples in R. Then t1 pc t2 iff t1.J = t2.J and either t1 |= pcor t1.A ≥ t2.A.We can prove a similar lemma for the other monotonic aggregation con-straints, where AGG is MIN and θ ∈ {≤,=}, or AGG is MAX andθ ∈ {≥,=}.Lemma 2 Let pc be a primitive aggregation constraint in which AGG isMIN and θ ∈ {≤,=}, or AGG is MAX and θ ∈ {≥,=}. Given two tuplest1 and t2, t1 pc t2 iff t1.J = t2.J and either t1 |= pc or t2 6|= pc.For the anti-monotone constraint pc ::= SUM(A, p) ≤ λ, we can directlyprune any tuple t such that t 6|= pc; however, for tuples that do satisfy pc,we have the following lemma.Lemma 3 Let pc ::= SUM (A, p) ≤ λ be a primitive aggregation constraintand t1, t2 be two tuples such that t1 |= pc and t2 |= pc. Then t1 pc t2 ifft1.J = t2.J and t1.A ≤ t2.A.Similarly, for the c-anti-monotone constraint SUM(A, p) = λ, we havethe following lemma.Lemma 4 Let pc ::= SUM (A, p) = λ be a primitive aggregation con-straint and t1, t2 be two tuples such that t1 |= SUM(A, p) ≤ λ andt2 |= SUM(A, p) ≤ λ. Then t1 pc t2 iff t1.J = t2.J and t1.A = t2.A.Given the pc-dominance relationship for each individual aggregation con-straint, we can now define an overall subsumption relationship between twotuples.Definition 5 (Tuple Subsumption)Let t1, t2 be seen tuples in R andac ::= pc1 ∧ · · · ∧ pcm be an aggregation constraint. We say that t1 sub-sumes t2, denoted t1 t2, if t1.J = t2.J , t1.V ≥ t2.V and, for all pc ∈{pc1, . . . , pcm}, t1 pc t29.9Let rk be the kth join result in RSack . To handle the case where all join results whichhave the same score as rk need to be returned, we can change the condition t1.V ≥ t2.Vin Definition 5 to t1.V > t2.V and report all such results.263.4. Deterministic AlgorithmRecall, the main goal of this section is to recognize and prune beatentuples. The next theorem says how this can be done.Theorem 1 Given an RJAC problem instance I = {R, S, jc, ac, k}, let Tbe the set of seen tuples from relation Ri. Tuple t ∈ T is beaten iff t issubsumed by at least k other tuples in T .3.4.3 Efficient Algorithm for Top-k RJACGiven an instance of RJAC, I = (R, S, jc, ac, k), our algorithm kRJAC (seeAlgorithm 2) follows the standard rank join template [62, 110] as describedin Section 3.3.1. However, it utilizes the pruning techniques developed inSec. 3.4.1 and 3.4.2 to leverage the power of aggregation constraints.Algorithm 2: kRJAC(R, S, jc, ac, k)1 τ ← ∞;2 O ← Join result buffer;3 while |O| < k ∨ v(O.kthResult) < τ do4 i ← ChooseInput();5 ti ← Ri.next();6 if Promising(ti, ac) /* (c-)Anti-monotone pruning */7 if ¬(Prune(ti,HRi,ac,k)) /* Subsumption pruning */8 ConstrainedJoin(ti, HR, ac, O);9 τ ← UpdateBound(ti, HR, ac);Below, we first explore the pruning opportunities in the kRJAC algo-rithm using aggregation constraints (lines 6–8), and then discuss how thepresence of aggregation constraints can affect the accessing strategy (line 4)and the stopping criterion (line 9).Optimizing In-Memory Join ProcessingFirst of all, to leverage the (c-)anti-monotonicity property of the aggregationconstraints, in line 6 of Algorithm 2, whenever a new tuple ti is retrievedfrom relation Ri, we invoke the procedure Promising (see Algorithm 3) whichprunes tuples that do not satisfy the corresponding aggregation constraint.Let HR = {HR1, . . . ,HRn} be the in-memory buffers for all seen tuplesfrom each relation. Similar to previous work [62], in line 8 of Algorithm 2,when a new tuple ti is seen from Ri, we perform an in-memory hash join of273.4. Deterministic Algorithmti with seen tuples from all HRj , j 6= i. The idea of this hash join processis that we break each HRi into hash buckets based on the join attributevalue. Note that for an RJAC problem instance in which no join conditionis present or jc = true, all seen tuples from the same relation will be putinto the same hash bucket.Algorithm 4 shows the pseudo-code for the aggregate-constrained hashjoin process. We first locate all relevant hash buckets from each relation(lines 1–2), then join these buckets together and finally check, for each joinresult found, whether it satisfies the aggregation constraints or not (lines 3–5).Algorithm 3: Promising(ti, ac)1 foreach pc in ac do2 if pc ::= MIN(A, p) ≥ (=)λ return Evalpc({ti}) ≥ λ ;3 else if pc ::= MAX(A, p) ≤ (=)λ return Evalpc({ti}) ≤ λ ;4 else if (pc ::= SUM(A, p) ≤ λ) ∨ (pc ::= SUM(A, p) = λ)5 return Evalpc({ti}) > λ;6 return trueAlgorithm 4: ConstrainedJoin(ti, HR, ac, O)1 for j = 1, . . . , i− 1, i+ 1, . . . , n do2 Bj = LocateHashBuckets(ti.J , HR);3 foreach s ∈ B1 ./ · · · ./ Bi−1 ./ {ti} ./ Bi+1 ./ · · ·Bn do4 if s |= ac and v(s) > v(O.kthResult)5 Replace O.kthResult with s.One important observation about this hash join process is that the worstcase complexity for each iteration is O(|HR1| × · · · × |HRi−1| × |HRi+1| ×· · · × |HRn|), which can result in a huge performance penalty if we leaveall seen tuples in the corresponding buffers. As a result, it is crucial tominimize the number of tuples retained in the HR’s. Next we will show howour subsumption-based pruning, as discussed in Section 3.4.2, can be usedto remove tuples safely from HR.Consider a hash bucket B in HRi and a newly seen tuple ti. We assumeevery tuple in a bucket B has the same join attribute value, so according toTheorem 1, if we find that there are at least k tuples in B which subsume283.4. Deterministic Algorithmti, we no longer need to place ti in HRi. This is because we already haveat least k tuples in B that are at least as good as ti. We call this pruningsubsumption-based pruning (SP). Furthermore, ti does not need to be joinedwith HRj , j 6= i, as shown in line 7 of Algorithm 2. We will show inSec.3.6 that this subsumption-based pruning can significantly improve theperformance of the kRJAC algorithm.Algorithms 5 and 6 give the pseudo-code for the subsumption-basedpruning process. We maintain for each tuple t a count t.scount of thenumber of seen tuples that subsume t. Note that, although in the pseudo-code we invoke the Subsume procedure twice for each tuple t in the currenthash bucket B, the two invocations can in fact be merged into one in theimplementation.Algorithm 5: Prune(ti, HRi, ac, k)1 B ← LocateBucket(ti, HRi);2 foreach t ∈ B do3 if Subsume(t, ti, ac) ti.scount← ti.scount+ 1;4 if Subsume(ti, t, ac)5 t.scount← t.scount+ 1;6 if t.scount ≥ k Remove t from B;7 return ti.scount ≥ k;So given the basic subsumption-based pruning algorithm as presented inAlgorithm 5, a natural question to ask is whether can we prune more tuplesfrom the buffer? The answer is “yes”. Assume we are looking for the top-kjoin results. As we consume more tuples from the underlying relations, thevalue of the stopping threshold τ may continue to decrease, which meanssome join results in the output buffer O may have a value larger than τ .These join results are guaranteed to be among the top-k and can be output.Now suppose that the top k′ results, for some k′ < k, have been foundso far. Then it is clear that we need only look for the next top k − k′results among the remaining tuples. So when applying our subsumption-based pruning, we could revise k to k − k′, i.e., in a hash bucket B of abuffer HRi, if a new tuple ti is subsumed by k − k′ other tuples in HRi,we can safely prune ti from that buffer. We call this optimization adaptivesubsumption-based pruning (ASP).Consider the example of Figure 3.3. After retrieving four tuples in theMuseum relation and three tuples in the Restaurant relation, we find one293.4. Deterministic AlgorithmAlgorithm 6: Subsume(t1, t2, ac)1 if t1.V < t2.V return false;2Dominate ← true;3 foreach pc in ac do4 switch pc do5 case MIN(A, p) ≤ (=)λ and MAX(A, p) ≥ (=)λ6 Dominate = Dominate ∧ (t1 |= pc or t2 6|= pc);7 case SUM(A, p) ≥ λ8 Dominate = Dominate ∧ (t1 |= pc or t1.A ≥ t2.A);9 case SUM(A, p) ≤ λ10 Dominate = Dominate ∧ (t1.A ≥ t2.A);11 case SUM(A, p) = λ /* {t1}, {t2} |= SUM(A, p) ≤ λ */12 Dominate = Dominate ∧ (t1.A = t2.A);13 return Dominate;joinable set {t3, t8} which is guaranteed to be the top-1 result, and we havepruned t4. Using adaptive subsumption-based pruning, we can now alsoprune t2 as it is subsumed by t1.MuseumLocation Cost Ratingaa13.51555RestaurantLocation Cost Ratingcbb5020104.54.54.5t1:t2:t6:t7:t8:ba10154.54.5 a 5 3t9:b 5 3.5t5: a 10 3t10:t3:t4: {t2,t4}Tuple PrunedConstraintSUM(Cost,true) ≤ 20{t3 ,t8}Top-1 resultFigure 3.3: Adaptive subsumption-based pruning.If adaptive subsumption-based pruning is utilized, from the correctnessand completeness proof of Theorem 1, we can derive the following corollary.Corollary 1 At the end of each iteration of the kRJAC algorithm, thenumber of accessed tuples retained in memory for each relation is minimal.In the worst case, the overhead of the rank join algorithm using subsumption-based pruning compared to one which does not perform any pruning (bothalgorithms will stop at the same depth d) will be O(d2 · cdom), where cdomis the time for one subsumption test. This worst case situation will happenwhen no tuples seen from a relation subsume any other tuples. However, as303.5. Probabilistic Algorithmwe show in Section 3.6, this seldom happens, and often d is very small afterour pruning process.Bounding Scheme and Accessing StrategyWhen rank join involves more than two relations, the corner-bounding strat-egy should be replaced by a bounding strategy based on cover sets [54, 110].As described in Section 3.3, for the optimal bounding scheme, to derive thestopping threshold τ , we need to consider each subset R′ of R, and jointhe HRs of relations in R′ with the CRs of relations in R −R′. Becausethe cover set CRi of each relation Ri considers only the value of an unseenitem, data points in CRi can be joined with any other tuple from a tuplebuffer HRj , where i 6= j. So the presence of aggregation constraints doesnot affect the operations in the bounding scheme that are related to thecover set, which means when joining CRs of R−R′, and when joining thejoin results of CRs of R−R′ and join results of HRs of R′, we don’t needto consider aggregation constraints. However, when joining HRs of R′, inorder for the derived bound to be tight, we need to make sure that eachpartial join result satisfies the aggregation constraints.Similarly, for the accessing strategy that decides which relation to accessa tuple from next, because the potential value of each relation is determinedby the bounding scheme as discussed in [54, 62, 110], the existing accessingstrategy can be directly used by taking the modified bounding scheme asdescribed above into account.3.5 Probabilistic AlgorithmOur kRJAC algorithm in Section 3.4 returns the exact top-k results. How-ever, similar to the standard NRA [53] algorithm, this deterministic ap-proach may be conservative in terms of its stopping criterion, which meansthat it still needs to access many tuples even though many of them willbe eventually pruned. Theobald et al. [113] first investigated this prob-lem and proposed a probabilistic NRA algorithm; however, their algorithmand analysis cannot be directly used to handle rank join (with aggregationconstraints). In the rest of this section, we will describe a probabilistic al-gorithm, based on the framework of [113], which accesses far fewer tupleswhile guaranteeing the quality of the results returned.Let I = (R, S, jc, ac, k) be a RJAC problem instance, where R = {R1, . . . , Rn}.The main problem we need to solve is, at any stage of the algorithm, to es-timate the probability that an unseen tuple can achieve a value better than313.5. Probabilistic Algorithmthe value of the kth tuple in the top-k buffer. This probability will clearlydepend on the selectivity of the join condition jc and on the aggregationconstraint ac. We assume the join selectivity of jc over R can be estimatedusing some existing techniques such as adaptive sampling [90]. We denotethe resulting join selectivity by δjc(R), which is defined as the estimatednumber of join results divided by the size of the cartesian product of allrelations in R. Given a set s = {t1, . . . , tn} of n tuples, where ti ∈ Ri, bymaking the uniform distribution assumption, we set the probability Pjc ofs satisfying jc as δjc(R). Similarly, considering each primitive aggregationconstraint pc in ac, we can also estimate the probability Ppc of s satisfy-ing pc as the selectivity of pc over R, denoted as δpc(R). We discuss inSec. 3.5.1 how δpc(R) can be estimated under common data distributionassumptions. The probability Pac of s satisfying ac can then be estimatedas Pac =∏pc∈ac Ppc.Given a set of tuples s = {t1, . . . , tn}, ti ∈ Ri, assuming the join condi-tion and the aggregation constraints are independent, we can estimate theprobability of s satisfying the join condition jc and the aggregation con-straints ac as Pjc∧ac = Pjc × Pac.After some fixed number of iterations of the kRJAC algorithm, let thevalue of the kth best join result in the output buffer O be mink. We canestimate the probability P>mink(Ri) that an unseen tuple ti from Ri canachieve a better result than mink. Suppose the current maximum value foran unseen item in Ri is vi. To estimate P>mink(Ri), similarly to [113], weassume a histogram HVj for the value attribute V of each relation Rj ∈ R isavailable. Then using the histograms we can estimate the number Ni of tuplesets {t1, . . . , ti−1, ti+1, . . . , tn}, tj ∈ Rj s.t. vi +∑j∈{1...n}−{i} v(tj) > mink.We omit the obvious detail. Then the probability that ti can join with anyof these Ni tuple sets to become one of the top-k results can be estimatedas P>mink(Ri) = 1− (1− Pjc∧ac)Ni .Given a user specified threshold , we can stop our kRJAC algorithmwhen ∀i ∈ {1, . . . , n}, P>mink(Ri) ≤ .3.5.1 Estimating Constraint SelectivityGiven a PAC pc ::= AGG(A, p) θ λ, and n relations R1, . . . , Rn, to simplifythe analysis, we assume p = true and that attribute values of differentrelations are independent.Consider an example of the binary RJAC problem: given a set s ={t1, t2}, with t1 ∈ R1, t2 ∈ R2. For the aggregation constraint pc ::=SUM(A, true) ≤ λ, it is clear from Figure 3.4(a) that s can satisfy pc only323.6. Experimentswhen t1.A and t2.A fall into the gray region. We call this gray region thevalid region for pc, denoted V Rpc. Similarly Figure 3.4(b) illustrates thevalid region for the constraint pc ::= MIN(A, true) ≤ λ.t1.At2.A0 11λλt1.At2.A0 11λλ(a) SUM(A,true)≤ λ (b) MIN(A,true)≤ λ Figure 3.4: Selectivity of aggregation constraints.Based on the valid region V Rpc for pc, we can estimate the selectivity ofpc by calculating the probability of a tuple set s falling inside V Rpc.Given a set s = {t1, . . . , tn} of n tuples, ti ∈ Ri, and given a PACpc ::= AGG(A, true) θ λ, if we assume t1.A, . . . , tn.A are n independent ran-dom variables following a uniform distribution, we can calculate the closedformula for the probability P (V Rpc) of t1.A, . . . , tn.A falling inside V Rpc asfollows:• If pc ::= SUM(A, true) ≤ λ: P (V Rpc) = λnn! .• If pc ::= MIN(A, true) ≤ λ: P (V Rpc) = 1− (1− λ)n.These facts are easily verified. Because of symmetry, for pc ::= SUM(A, true) ≥λ and pc ::= MAX(A, true) ≥ λ, the corresponding probabilities are verysimilar to pc ::= SUM(A, true) ≤ λ and pc ::= MIN(A, true) ≤ λ respec-tively: we only need to replace λ by 1 − λ in the corresponding formulas.And for a PAC pc where θ is =, note that the probability is 0 under contin-uous distributions, so in practice, we will set these probabilities to a smallconstant which is estimated by sampling the database.Similarly, if we assume that each ti.A follows other distributions such asexponential distribution, similar formulas can be derived.3.6 ExperimentsIn this section, we study the performance of our proposed algorithms basedon two synthetic datasets. The goals of our experiments are to study: (i)333.6. Experimentsthe performance of various pruning techniques, (ii) the performance of theprobabilistic method, and (iii) the result quality of probabilistic method.All experiments were done on a Intel Core 2 Duo machine with 4GB RAMand 250GB SCSI hard disk. All code is in C++ and compiled using GCC4.2.We call the synthetic datasets we generated the uniform dataset andthe exponential dataset. For both datasets, the join selectivity betweentwo relations is fixed at 0.01 by randomly selecting the join attribute valuefrom a set of 100 predefined values. The value and other attributes are setas follows. For the uniform dataset, the value of each attribute follows auniform distribution within the range [0,1]; for the exponential dataset, thevalue of each attribute follows an exponential distribution with mean 0.5.Note that in order to ensure values from the exponential distribution fallinside the range [0,1], we first uniformly pick 1000000 values from [0,1], andthen resample these values following the exponential distribution. Values ofeach attribute are independently selected.We implemented four algorithms: (a) the post-filtering based rank joinalgorithm (Post Filtering); (b) the deterministic algorithm with subsump-tion based pruning (SubS-Pruning); (c) the deterministic algorithm withadaptive subsumption based pruning (Adaptive SubS-Pruning); (d) the prob-abilistic algorithm with subsumption based pruning.3.6.1 Efficiency StudyWe first compare the algorithms in a binary RJAC setting. As can be seenfrom Figure 3.5, subsumption based pruning works very well for monotonicconstraints. One interesting observation from Figure 3.5(d) is that, adaptivesubsumption based pruning does not prune significantly more tuples thannon-adaptive subsumption based pruning. By inspecting the dataset, wefound out this is because there are k tuples which subsume every othertuple, so the adaptive pruning strategy has no effect in this case.Figure 3.6 shows another example of one aggregation constraint, SUM(A, true) ≤λ, under the selectivity of 10−5. As discussed in previous sections, sucha constraint can result in both anti-monotonicity based pruning and sub-sumption based pruning. However, as can be seen from Figure 3.6, theanti-monotonicity based pruning can be very powerful which, in turn, ren-ders the subsumption based pruning less effective.We also tested our algorithms in settings where we have binary RJACand multiple aggregation constraints (see Figure 3.7). For the case of SUM(A, true) ≥λ and SUM(B, true) ≤ λ and overall selectivity is 10−5 ((a) and (b)), be-343.6. Experiments10 20 30 40 500100020003000kMilliseconds(a) Execution Time Post FilteringSubS−PruningAdaptive SubS−PruningProbabilitic10 20 30 40 50012345 x 104k# of tuples(b) # of Tuples Pruned SubS−PruningAdaptive SubS−PruningProbabilitic10 20 30 40 5000.511.522.5 x 104kMilliseconds(c) Execution Time Post FilteringSubS−PruningAdaptive SubS−PruningProbabilitic10 20 30 40 500510x 104k# of tuples(d) # of Tuples Pruned SubS−PruningAdaptive SubS−PruningProbabiliticFigure 3.5: Uniform dataset: (a), (b) SUM(A, true) ≥ λ, selectivity 10−5 ;(c), (d) MIN(A, true) ≤ λ, selectivity 10−5.cause of the presence of an anti-monotone constraint, many tuples can bepruned so the subsumption based algorithm outperforms the post-filteringalgorithm. However, as can be seen from Figure 3.7(c) and (d), when theselectivity of aggregation constraints is very high and no anti-monotonic ordirect-pruning aggregation constraint is present, the overhead of subsump-tion testing causes the execution time of the subsumption based algorithmsalmost to match that of the post-filtering based algorithm. As future work,we would like to study cost-based optimization techniques which can be usedto help decide which strategy should be used.3.6.2 Probabilistic AlgorithmSimilar to previous work on a probabilistic NRA algorithm, Figures 3.5,3.6 and 3.7 show that our probabilistic algorithm will stop earlier thanthe deterministic and post-filtering based algorithms. In most experiments,the probabilistic algorithm accesses far fewer tuples from the underlyingdatabase than the other algorithms. We note that this property can be veryimportant for scenarios where tuples are retrieved using web services [53],353.6. Experiments10 20 30 40 50050010001500200025003000kMilliseconds(a) Execution Time Post FilteringSubS−PruningAdaptive SubS−PruningProbabilitic10 20 30 40 500246810k# of tuples(b) # of Tuples Pruned SubS−PruningAdaptive SubS−PruningProbabiliticFigure 3.6: Uniform dataset, SUM(A, true) ≤ λ, selectivity 10−5.for example, as a monetary cost might be associated with each access andthe latency of retrieving the next tuple might be very high.In terms of the quality of results returned, as Figure 3.8 shows for binaryRJAC with several different aggregation constraints, the value of the joinresults returned by the probabilistic algorithm at each position k is very closeto the exact solution. The percentage of value difference at each position kis calculated asv(sk)−v(s′k)v(sk), where sk is the exact kth result and s′k is the kthresult returned by the probabilistic algorithm.363.6. Experiments10 20 30 40 5002000400060008000kMilliseconds(a)Execution Time Post FilteringSubS−PruningAdaptive SubS−PruningProbabilitic10 20 30 40 5002004006008001000k# of tuples(b)# of Tuples Pruned SubS−PruningAdaptive SubS−PruningProbabilitic10 20 30 40 50010002000300040005000kMilliseconds(a)Execution Time Post FilteringSubS−PruningAdaptive SubS−PruningProbabilitic10 20 30 40 5000.511.522.53 x 104k# of tuples(b)# of Tuples Pruned SubS−PruningAdaptive SubS−PruningProbabiliticFigure 3.7: Uniform dataset: (a), (b) SUM(A, true) ≥ λ, SUM(B, true) ≤λ, overall selectivity 10−5 ; (c), (d) SUM(A, true) ≥ λ, SUM(B, true) ≥ λ,overall selectivity 10−5.0 2 4 6 8 10−0.01−0.00500.0050.01kPercentage of value differenceResult quality of probabilistic algorithm SUM(A,true) ≥ λSUM(A,true) ≤ λMIN(A,true) ≤ λMAX(A,true)≥ λFigure 3.8: Quality of the probabilistic algorithm.37Chapter 4Composite Recommendationwith Hard Constraints:Flexible Package Schema4.1 IntroductionThe limitation of the solution proposed in the previous chapter is that itcan only handle fixed package schema. In this chapter we consider themore general problem of composite recommendations with flexible packageschema, where each recommendation comprises a set of items which can havedifferent size/schema. And though we are dealing with budget constraintsin the main part of this work, we note that other constraints can be alsohandled efficiently using the prunings proposed in [122], and post-filteringtechniques as discussed in Section 4.5.Consider each item is associated with both a value (rating or score)and a cost, and the user specifies a maximum total cost (budget) for anyrecommended set of items. Our composite recommender system consistsof one or more recommender systems focusing on different domains. Thesecomponent RecSys serve (i.e., recommend) top items in non-increasing orderof their value (explicit or predicted ratings). In addition, our compositesystem has access to information sources (which could be databases or webservices) which provide the cost associated with each item.In our setting, the problem of deciding whether there is a recommen-dation (package) whose value exceeds a given threshold is NP-complete asit models the Knapsack problem [71]. Because of this, and the fact thatwe expect the component recommender systems to provide ratings for largenumbers of items and access to these ratings can be relatively expensive10,we devise approximation algorithms for generating the top-k packages asrecommendations.Other researchers have considered complex or composite recommenda-10Especially when the ratings need to be predicted.384.1. Introductiontions. CARD [32] and FlexRecs [74] are comprehensive frameworks in whichusers can specify their recommendation preferences using relational querylanguages extended with additional features or operators. In contrast, weare concerned with developing efficient algorithms for combining recommen-dations from RecSys that provide only ratings for items. Closer to our workis [19] which is concerned with finding packages of entities, such as holi-day packages, where the entities are associated in some way. However, theirpackages are of fixed size, whereas we allow packages of variable size/schema.CourseRank [99, 101, 102] is a system for providing course recommendationsto students, based on the ratings given to courses by past students and sub-ject to the constraints of degree requirements. While we do not captureall CourseRank constraints, in our framework we have item costs and userbudgets—essential features of the application areas we consider for deploy-ment of our system—which are not captured by CourseRank. Similarly, itemcosts and user budgets are not considered for the problem of team formationin [78].In this chapter, we restrict attention to the problem of recommendingpackages when there is just one component RecSys and no compatibilityconstraint is imposed. The problem remains intractable and still warrantsapproximation algorithms. We discuss in Section 4.5 how to extend ouralgorithms when multiple component RecSys and compatibility constraintsare present.Following are the contributions of this chapter:• We propose a novel architecture for doing composite recommendation(Section 4.2).• We propose a 2-approximation algorithm that is instance optimal [53]with an optimality ratio of one. This means that any other 2-approximationalgorithm, that can only access items in non-increasing order of theirvalue, must access at least as many items as our algorithm (Sec-tion 4.3.4.3.1).• We further develop a greedy algorithm for returning top-k compositerecommendations, which is much more efficient, guaranteed to returna 2-approximation, but is no longer guaranteed to be instance optimal(Section 4.3.4.3.2).• We study how histograms can be leveraged to improve the performanceof the algorithms (Section 4.3.4.3.3).394.2. Architecture and Problem• We subject our algorithms to thorough empirical analysis using tworeal datasets. Our findings confirm that our algorithms always pro-duce recommendations that are 2-approximations, with many of thembeing close to optimal. And the greedy algorithm is always signif-icantly faster than the other two algorithms, while the greedy andinstance optimal algorithms usually access substantially fewer itemsthan the optimal algorithm. Finally, we show how histograms canfurther improve the empirical performance of the proposed algorithms(Section 4.4).4.2 Architecture and Problem4.2.1 System ArchitectureIn a traditional RecSys, users rate items based on their personal experience,and these ratings are used by the system to predict ratings for items notrated by an active user. The predicted ratings can be used to give the usera ranked recommendation (item) list.As shown in Figure 4.1, our composite recommendation system is com-posed of one or more component RecSys and has access to external sourcesthat provide the cost of a given item. An external source can be a localdatabase or a web service. For example, Amazon.com can be consultedfor book prices. In terms of computation, we abstract each RecSys as asystem which serves items in non-increasing order of their value (rating orscore) upon request. In addition, the system includes a compatibility checkermodule, which checks whether a package satisfies compatibility constraints,if any. We assume the compatibility checker consults necessary informationsources in order to verify compatibility.The user interacts with the system by specifying a cost budget, an integerk, and optionally compatibility constraints on packages. The system findsthe top-k packages of items with the highest total value such that eachpackage has a total cost under budget and is compatible.4.2.2 Problem StatementGiven a set N of items and U of users, an active user u ∈ U , and item t ∈ N ,we denote by vu(t) the value of item t for user u. We denote the value asv(t) when the active user is understood. A RecSys predicts v(t) when it isnot available, by using the active user’s past behavior and possibly that ofother similar users. For t ∈ N , we denote by c(t) the cost of item t. Given404.2. Architecture and ProblemFigure 4.1: System Architecturea set of items R ⊂ N , we define c(R) = Σt∈R c(t) and v(R) = Σt∈R v(t).Given a cost budget B, a set of items P ⊂ N is called feasible if c(P ) ≤ B.Definition 6 (Top-k Composite Recommendations) Given an instanceI of a composite recommendation system consisting of one component Rec-Sys and an external information source, a cost budget B and an integer k,find the top-k packages P1, ..., Pk such that each Pi is feasible and among allfeasible packages P1, ..., Pk have the k highest total values, i.e., v(P ) ≤ v(Pi)for all feasible packages P 6∈ {P1, ..., Pk}.When k = 1, the top-k composite recommendation problem (CompRec)can be viewed as a variation of the classical 0/1 knapsack problem [71].Thus, even for the special case of top-1 composite recommendation, thedecision problem “Is there a feasible package R which has value larger thana threshold β?” is NP-complete; and the complexity of the function problemof finding the maximum feasible package R is FPNP-complete [98]. However,unlike the classical knapsack setting, our top-k composite recommendationproblem has the restriction that items can be accessed only in non-increasingorder of their value. Without loss of generality, we assume all items havecost smaller than the cost budget B.Note that ratings of items from the component RecSys are retrievedusing sorted access, while the cost of a given item is obtained via randomaccess. Let cs and cr be the costs associated with these accesses. Then thetotal access cost of processing n items is n × (cs + cr). Notice that cs andcr can be large compared to the cost of in-memory operations: for bothaccesses information needs to be transmitted over the Internet, and for thesorted access, v(t) may need to be computed. So, well-known algorithmsfor knapsack which need to access all items [71] may not be realistic. Thus,414.3. Composite Recommendationsan efficient algorithm for top-k CompRec should minimize the total accesscost, i.e., it should minimize the number of items accessed and yet ensurethe top-k packages are obtained.It can be shown that if we have no background knowledge about thecost distribution of items, in the worst case, we must access all items to findtop-k packages. In order to facilitate the pruning of item accesses, we thusassume that some background information about item costs is precomputedand maintained at the composite RecSys. The background cost information,which we denote generically by BG, can be something as simple as a min-imum item cost cmin (Section 4.3.4.3.1, 4.3.4.3.2) or a histogram collectedfrom the external cost source (Section 4.3.4.3.3). This information can bematerialized in our system and be refreshed regularly by re-querying thecost source.Our composite recommendation problem can be considered as a specialcase of a resource-limited knapsack problem where, in addition to qualityguarantee, the number of items to be accessed should also be minimized.So standard algorithms for knapsack, e.g., exact algorithms [71] and ap-proximation algorithms [61, 118] may not be efficient as they always need toaccess the entire dataset. The only known variation of knapsack which dealswith resource limitation is the Online Knapsack Problem [93]. However, forthis problem, no access constraints are considered, only competitiveness interms of quality is studied. And furthermore, no information about itemscan be inferred, which makes the problem significantly harder and difficultto approximate.4.3 Composite RecommendationsIn this section, we develop several approximation algorithms for top-1 Com-pRec, after which we extend them to handle top-k CompRec.4.3.1 Instance Optimal AlgorithmsAs identified in Section 5.2, top-1 CompRec is a variation of the 0/1 knapsackproblem where the underlying items can be accessed only in non-increasingorder of their value (rating). Because of the huge potential size of the setsof items and the high cost of retrieving item information from the source,it is crucial for an algorithm to find high-quality solutions while minimizingthe number of items accessed. Furthermore, as the 0/1 knapsack problem isNP-complete [71], we need to develop efficient approximation algorithms.424.3. Composite RecommendationsTop-1 Composite RecommendationGiven an instance I of top-1 CompRec, let BG denote the known backgroundcost information and S = {t1, ..., tn} be the set of items which have beenaccessed or seen so far.Let v¯ be the value of the first accessed item. Because items are accessedin non-increasing order of their value, n · v¯ is a trivial upperbound on thevalue that can be achieved by any knapsack solution for S.For each i ∈ {1, ..., n} and v ∈ {1, ..., n · v¯}, let SSi,v denote a subset of{t1, ..., ti} whose total value is exactly v and whose total cost is minimized.Let C(i, v) be the cost of SSi,v (where C(i, v) =∞ if the corresponding SSi,vdoes not exist). Then it is well known from previous work [71, 118] that apseudo-polynomial algorithm can be utilized to find the optimal knapsacksolution for S by first calculating all C(i, v) using the following recursivefunction, and then choosing the maximum value achievable by any subsetSSn,v of which the total cost is bounded by budget B, i.e., max{v | C(n, v) ≤B}.C(i+ 1, v) = (4.1){min{C(i, v), c(ti+1) + C(i, v − v(ti+1))} if v(ti+1) ≤ vC(i, v) otherwiseLet the background cost information BG be given by cmin, the minimumcost of all items, let vmin = mint∈S v(t) be the minimum value of all accesseditems, and let OPT be the true optimal solution to the underlying top-1CompRec instance I. We can find an upperbound V ∗ on the value v(OPT )of the optimal solution using Algorithm 7: MaxValBound.Algorithm 7: MaxValBound(S, C, B, BG)1 V ∗ = b Bcmin c × vmin2 for v ∈ {1, ..., n · v¯} do3 if C(n, v) < B4 V ∗=max{V ∗, v + bB−C(n,v)cmin c ∗ vmin}5 return V ∗Lemma 5 Given S, C, B, BG, we have: (1) the value V ∗ returned byMaxValBound is an upperbound on v(OPT ); (2) Value V ∗ is tight, in that434.3. Composite Recommendationsthere exists a possible unseen item configuration for which V ∗ is achievableby using a subset of accessed items and feasible unseen items11.Proof Consider all valid possible top-1 CompRec instances, and let OPT =Sopt denote the optimal solution with maximum value. We will prove bothclaims in the lemma at once. Consider an unseen item t∗ with cost cminand value vmin. Clearly, t∗ is feasible. We can assume an unlimited supplyof unseen items whose cost and value match those of t∗. We claim Sopt hasno unseen item with cost or value not matching that of t∗. If it does, thatitem can be replaced with t∗ while incurring no more cost and having noless total value. Let S′ = Sopt ∩ S, i.e., the subset of seen items in Sopt.Let v = v(S′). Clearly, 1 ≤ v ≤ n × v¯ and c(S′) ≥ C(n,v). Sopt − Scan contain at most bB−c(S′)cmin c unseen items with cost and value matchingthose of t∗. Algorithm 1 does examine such a solution among the variouscandidates it considers. Hence the bound V ∗ computed by Algorithm 1 hasthe property that V ∗ ≥ v(S′)+bB−c(S′)cmin c×vmin, proving (1). To see (2), sincev(OPT ) = v(Sopt) corresponds to the highest possible value of the optimalpackage over all possible configurations, we have V ∗ ≤ OPT , so V ∗ = OPT ,and V ∗ is achievable by an instance which is composed of S′ and bB−c(S′)cmin cunseen items with cost and value matching those of t∗.Given the upper bound V ∗ on the optimal solution, we next proposea 2-approximation algorithm for top-1 CompRec which is guaranteed tobe instance optimal (see below). The algorithm, InsOpt-CR, is shown asAlgorithm 8. One item is retrieved from the source at each iteration ofthe algorithm (lines 3–4). After accessing this new item, we can use thepseudo-polynomial algorithm to find an optimal solution Ro over the ac-cessed itemset S (line 5). We calculate the upper bound value V ∗ of theoptimal solution using MaxValBound. If v(Ro) ≥ 12 × V ∗, the algorithmterminates; if not, it continues to access the next item (lines 7–8). Thefollowing example shows how InsOpt-CR works.Example 2 Let I = {t1, t2, . . . , tn}, n ≥ 102, be a top-1 CompRec instance,where v(t1) = v(t2) = 101, c(t1) = c(t2) = 100, for i = 3, . . . , 101, v(ti) =c(ti) = 1, and for i = 102, . . . , n, v(ti) = 1 and c(ti) = 0.5. Let B =199. Clearly, BG = cmin = 0.5. After accessing the first 101 items, S ={t1, . . . , t101}, Ro = {t1} ∪ {t3, . . . , t101}, v(Ro) = 200. Because cmin = 0.5and vmin = 1, we can calculate V ∗ = 398 and InsOpt-CR will stop sincev(Ro) ≥ 12 × V ∗.11An unseen item t is feasible iff v(t) ≤ vmin and c(t) ≥ cmin444.3. Composite RecommendationsAlgorithm 8: InsOpt-CR(N , B, BG)1 S ← An empty buffer2 while TRUE do3 t ← N .getNext()4 S.Insert(t)5 (Ro,C) ← OptimalKP(S, B)6 V ∗ = MaxValBound(S,C,B,BG)7 if v(Ro) ≥ 12 × V ∗8 return RoGiven a top-1 CompRec instance I with optimal solution OPT , becauseV ∗ ≥ v(OPT ), if v(Ro) ≥ 12 × V ∗, then v(Ro) ≥ 12 × OPT , so InsOpt-CRreturns a correct 2-approximation of OPT.To analyze the optimality of our proposed algorithm, we utilize the no-tion of instance optimality proposed in [53].Definition 7 Instance Optimality: Let A be a class of algorithms, andlet I be a class of problem instances. Given a non-negative cost measurecost(A, I) of running algorithm A over I, an algorithm A ∈ A is in-stance optimal over A and I if for every A′ ∈ A and every I ∈ I wehave cost(A, I) ≤ c · cost(A′, I) + c′, for constants c and c′. Constant c iscalled the optimality ratio.To prove the instance optimality of InsOpt-CR, we first show the follow-ing.Lemma 6 Given any top-1 CompRec instance I and any 2-approximationalgorithm A with background cost information BG and the same access con-straints as InsOpt-CR, A must read at least as many items as InsOpt-CR.Proof Assume to the contrary that there exists an instance I and a 2-approximation algorithm A that stops earlier than InsOpt-CR on I. LetS = {t1, . . . , tn} be the set of items accessed by A at the time it stops.Assume that, after accessing the items in S, Algorithm 8 has (Ro, C) =OptimalKP(S,B) and upper bound V ∗ as calculated in line 6.Because A stops earlier than InsOpt-CR, the condition that v(Ro) ≥12 × V ∗ must not hold as otherwise InsOpt-CR will also stop, contradictingthe assumption that A stops earlier than InsOpt-CR.454.3. Composite RecommendationsThen it is easy to show that, at the time A stops, v(Ro) must be less than12 × V ∗. For this case, given only S and BG, it is possible to form a top-1CompRec instance I ′ which shares the same prefix S as I, but the optimalsolution for I ′ is V ∗. (For example, if Sopt corresponds to the optimal possi-ble solution calculated by MaxValBound, V ∗ = v(Sopt), and S′ = Sopt∩S, wecan set the next bB−c(S′)cmin c unseen items to have value vmin and cost cmin.)Then even the optimal solution Ro for S is not a 2-approximation for in-stance I ′, which contradicts the fact that A is a 2-approximation algorithm.Theorem 2 Let I be the class of all top-1 CompRec instances, and A be theclass of all possible 2-approximation algorithms that are constrained to accessitems in non-increasing order of their value. Given the same backgroundcost information BG, InsOpt-CR is instance optimal over A and I with anoptimality ratio of one.Proof From Lemma 6, for any top-1 CompRec instance I and any 2-approximation algorithm A with background cost information BG and thesame access constraints as InsOpt-CR, A must read at least as many itemsas InsOpt-CR. According to the definition of instance optimality and theproposed cost model, InsOpt-CR is instance optimal over A and I with anoptimality ratio of one.Top-k Composite RecommendationsIn addition to the best composite recommendation, it is often useful toprovide the user with the top-k composite recommendations, where k isa small constant. In this section, we extend the algorithm proposed inSection 4.3.4.3.1.4.3.1 to one that returns the top-k composite recommen-dations. Similar to the top-1 case, due to the hardness of the underlyingproblem, we seek an efficient approximation algorithm which can give ushigh quality recommendations.Given an instance I of top-k CompRec, assumeRI is the set of all feasiblecomposite recommendations, i.e., RI = {R | R ⊆ N ∧c(R) ≤ B}). FollowingFagin et al. [53] and Kimelfeld et al. [73], we define an α-approximationof the top-k composite recommendations to be any set Rk of min(k, |RI |)composite recommendations, such that, for all R ∈ Rk and R′ ∈ RI\Rk,v(R) ≥ 1α × v(R′).To produce top-k composite recommendations, we will apply Lawler’sprocedure [80] to InsOpt-CR. Lawler’s procedure is a general technique forenumerating the optimal top-k answers to an optimization problem, which464.3. Composite Recommendationsrelies on an efficient algorithm to find the optimal (top-1) solution to theproblem.Algorithm 9: InsOpt-CR-Topk(N , B, BG, k)1 S ← An empty buffer2 Rk ← An empty result buffer3 while TRUE do4 t ← N .getNext()5 S.Insert(t)6 (Ro,C) ← OptimalKP(S, B)7 V ∗ = MaxValBound(S,C,B,BG)8 if v(Ro) ≥ 12 × V ∗9 EnumerateTopk(S, Ro, V ∗, B, Rk, k)10 if |Rk| == k11 return |Rk|InsOpt-CR-Topk in Algorithm 9 is the InsOpt-CR algorithm modifiedusing Lawler’s procedure. As can been seen from the procedure, all weneed to change is that instead of returning the 2-approximation solutionfound in Algorithm 8 (line 8), we enumerate at this point all possible 2-approximation solutions using Lawler’s procedure (line 9). If the numberof 2-approximation solutions is at least k, then we can report the top-kpackages found; otherwise, we continue accessing the next item (line 10–11).Algorithm 10 is an overview of the enumeration algorithm EnumerateTopk.Whenever a 2-approximation is found, we call this routine to enumeratepossible 2-approximation results which can be obtained from the accesseditems. In line 3 of Algorithm 10, EnumerateTopk will directly call theenumeration procedure as proposed in [80].Algorithm 10: EnumerateTopk(S,Ro,V ∗,B,Rk,k)1 Rk.Add(Ro)2 while |Rk| ≤ k do3 Rn = LawlerOptimalNext(S, B)4 if v(Rn) < 12 × V ∗5 return Rk6 Rk.Add(Rn)474.3. Composite RecommendationsIn InsOpt-CR-Topk, the enumeration of all possible 2-approximationsolutions is straightforward. Since we know the upper bound V ∗, we cansimply utilize Lawler’s procedure to enumerate candidate packages whichare under cost budget and have aggregated value of at least half of V ∗.Lemma 7 Given any instance I of top-k CompRec and any 2-approximationalgorithm A with the same background cost information BG and access con-straints as InsOpt-CR-Topk, A must read as many items as InsOpt-CR-Topk.Proof If algorithm A stops earlier than InsOpt-CR-Topk, then at the timeit stops, EnumerateTopk must return a result set whose size is less thank, because otherwise InsOpt-CR-Topk would also stop. Then, similar toLemma 6, it is not possible to find a result set in S which is guaranteed to bea 2-approximation to the optimal top-k composite recommendation, whichmeans by properly selecting unseen items, the result set returned by A canbe made not to be a 2-approximation, contradicting the assumption.Theorem 3 Let I be the class of all top-k CompRec instances, and A bethe class of all possible 2-approximation algorithms that are constrained toaccess items in the non-increasing order of their value. Given the samebackground cost information BG, InsOpt-CR-Topk is instance optimal overA and I with an optimality ratio of one.Proof From Lemma 7, for any instance I of top-k CompRec and any 2-approximation algorithm A with the same background cost information BGand access constraints as InsOpt-CR-Topk, A must read as many itemsas InsOpt-CR-Topk. So according to the definition of instance optimality,InsOpt-CR-Topk is instance optimal over A and I with an optimality ratioof one.4.3.2 Greedy AlgorithmsAlthough the instance optimal algorithms presented above guarantee to re-turn top-k packages that are 2-approximations of the optimal packages, theyrely on an exact algorithm for finding an optimal package using the itemsseen so far. Because this may lead to high computational cost, we proposemore efficient algorithms next. Instead of using an exact algorithm to getthe best package for the currently accessed set of items S, we use a simplegreedy heuristic to form a high quality package RG from S and then testwhether RG is globally a high quality package.484.3. Composite RecommendationsCompared with InsOpt-CR, our greedy solution Greedy-CR for top-1 CompRec needs to replace OptimalKP in InsOpt-CR with GreedyKP,which uses greedy heuristics [71] to find a high quality itemset in polyno-mial time 12, and to change Ro to the greedy solution RG. Furthermore,instead of using the tight upperbound calculated by MaxValBound, we needto use a heuristic upperbound which is calculated by Algorithm 11: Max-HeuristicValBound.Algorithm 11: MaxHeuristicValBound(S, B, BG)1 τ ← vmincmin2 Sort S = {t1, . . . , tn} by value/cost ratio3 m = max{m | v(tm)c(tm) ≥ τ ∧ c(Rm) ≤ B}4 Rm = {t1, . . . , tm}5 if m == n6 V ∗ = v(Rm) + τ ∗ (B − c(Rm))7 else8 V ∗ = v(Rm)+max{τ, vm+1cm+1 } ∗ (B − c(Rm))9 return V ∗It follows from known results about knapsack that, similar to InsOpt-CR,Greedy-CR will always generate a correct 2-approximation to the optimalsolution.However, unlike InsOpt-CR, Greedy-CR is not instance optimal amongall 2-approximation algorithms with the same constraints, as the followingexample shows.Example 3 Let I = {t1, t2, . . . , tn}, n ≥ 102, be a top-1 CompRec instance(as considered in Example 2), where v(t1) = v(t2) = 101, c(t1) = c(t2) =100, for i = 3, . . . , 101, v(ti) = c(ti) = 1, and for i = 102, . . . , n, v(ti) =1 and c(ti) = 0.5. Let B = 199, BG = cmin = 0.5 and approximationratio α = 2. From Example 2, we know that after accessing the first 101items, S = {t1, . . . , t101}, v(Ro) = 200, V ∗ = 398 and InsOpt-CR will stop.However, at this moment RG = {t1}, and v(RG) = 101 < 12 × V ∗. SoGreedy-CR will continue accessing new items and it can be easily verifiedthat Greedy-CR needs to access another 98 items before it stops.12Note that any approximation algorithm for knapsack [71] can be plugged in herewithout affecting the correctness and instance optimality of the resulting algorithm.494.3. Composite RecommendationsWe note that, in practice, cases such as the above may occur rarely. Infact, in our experimental results (Section 4.4) we observed that, on a rangeof datasets, Greedy-CR exhibited a very low running time while achievingsimilar access costs and overall result quality when compared to InsOpt-CR.Similar to Section 4.3.4.3.1.4.3.1, we can easily extend Greedy-CR toGreedy-CR-Topk by using Lawler’s procedure [80] to enumerate all possiblehigh quality packages after one such package is identified. However, unlikeInsOpt-CR-Topk which guarantees instance optimality, here we simply useLawler’s procedure to enumerate all candidate packages using the greedyalgorithm instead of the exact algorithm. Similar to [73], we show in thefollowing theorem that for top-k CompRec, if an α-approximation algorithmis utilized in Lawler’s procedure instead of the exact algorithm which findsthe optimal solution, we get an α-approximation to the top-k compositerecommendations.Theorem 4 Given an instance I of top-k CompRec, any α-approximationalgorithm A for top-1 CompRec can be utilized with Lawler’s procedure togenerate a set Rk of composite recommendations which is an α-approximationto the optimal set of top-k composite recommendations.Proof We prove the above theorem using induction on k. Given α-approximationalgorithm A and instance I, it is clear that the top-1 composite recommen-dation set R1 generated by A is an α-approximation to the optimal top-1composite recommendation. Assume that the top-k composite recommenda-tion set Rk generated by Lawler’s procedure and A is an α-approximationto the optimal top-k composite recommendation, which means ∀R ∈ Rk,∀R′ ∈ RI\Rk, α ·V alue(R) ≥ V alue(R′). Then according to the property ofLawler’s procedure, the k+1st composite recommendation Rk+1 generated isan α-approximation to the optimal composite recommendation from RI\Rk,so ∀R′ ∈ RI\Rk, α · V alue(Rk+1) ≥ V alue(R′). It is easy to verify that∀R ∈ Rk ∪ {Rk+1}, ∀R′ ∈ RI\(Rk ∪ {Rk+1}), α · V alue(R) ≥ V alue(R′),so the top-(k+ 1) composite recommendation set Rk+1 generated by A is anα-approximation to the optimal top-(k + 1) composite recommendation.So the quality of the packages generated by the resulting enumerationprocess can be guaranteed. In this enumeration process, given a candidatepackage, we use the greedy algorithm to get the next candidate packagefor each sub-search space in Lawler’s procedure, and if all of them are notguaranteed to be 2-approximations, the enumeration will stop.However, similar to Greedy-CR, it is obvious that Greedy-CR-Topk isnot instance optimal. We note that, in practice, the difference between the504.3. Composite Recommendationsresults generated by InsOpt-CR-Topk and Greedy-CR-Topk (in terms of theaggregate values of packages generated) may be very small.4.3.3 Histogram-Based OptimizationFor the algorithms proposed in previous sections, we have a very pessimisticassumption regarding background cost information, namely BG = cmin, theglobal minimum cost. However in practice, we often have access to muchbetter statistics about the costs of items, e.g., histograms. In this section,we will demonstrate how histograms can be incorporated to improve theproposed algorithms.As can be observed from Algorithm 8, background cost information isused to determine a tight upperbound V ∗ on the value v(OPT ) of the op-timal solution. A pessimistic background cost information assumption willresult in a loose estimation of V ∗ and thus require more item accesses.Let H be a histogram of the costs of items. H will divide the range ofcosts into |H| non-overlapping buckets b1, . . . , b|H|. For each bucket bi, Hstores the number of items whose cost falls inside the corresponding costrange. In classical histograms used in relational databases, it is often as-sumed that items which fall in the same bucket are uniformly distributedwithin the cost range. However, it can easily be shown that this assumptionmay result in an under-estimation of the upperbound value V ∗, thus causingthe proposed approximation algorithms to be incorrect. In order to solvethis problem and guarantee that the estimated V ∗ is a correct upperboundof v(OPT ), we can simply assume that the cost of each item within a bucketbi is the minimum item cost within bi, denoted bi.cmin.In Algorithm 12, we list the modified procedure for determining theupperbound value V ∗ given the histogram H on the cost of unseen items.The algorithm will utilize a sub-procedure MaxItems(H, c) which, giventhe histogram H and a cost threshold c, returns the maximum number ofitems whose total cost is less than to equal to c.Algorithm 12: MaxValBound-H(S, C, B, H)1 V ∗ = MaxItems(H, B) ∗ vmin2 for v ∈ {1, . . . , n · v¯} do3 if C(n, v) < B4 V ∗=max{V ∗, v +MaxItems(H, B − C(n, v)) ∗ vmin}5 return V ∗514.4. ExperimentsLemma 8 Algorithm MaxValBound-H returns a correct upperbound of v(OPT ).Proof Assume to the contrary that the V ∗ returned by MaxValBound-H isnot a correct upperbound. Hence there must exist a configuration of unseenitems such that there exists a package R for which v(R) > V ∗. Let the setof seen items in R be Rseen, and the set of unseen items in R be Runseen.Then during the execution of MaxValBound-H, according to the algorithmwhich calculates C, we must have considered a C(n, v), such that C(n, v) ≤c(Rseen) and v ≥ v(Rseen). And it is clear that for each unseen item t,its value is less than or equal to vmin and its cost is larger than or equalto the cost as indicated by the corresponding bucket in H. So for C(n, v),the set of unseen items R′unseen considered by MaxValBound-H must havethe property that c(R′unseen) ≤ c(Runseen) and v(R′unseen) ≥ v(Runseen). Sov(R) ≤ V ∗ = v + v(R′unseen), which leads to a contradiction.As we will show in Section 4.4, histogram-based optimization can signif-icantly improve the efficiency of the proposed algorithms.4.4 ExperimentsIn this section, we study the performance of our proposed algorithms basedon both real and synthetic datasets.4.4.1 Experimental Setup and Datasets1st Package 2nd Package 3rd Package 4th Package 5th PackageSum Avg Sum Avg Sum Avg Sum Avg Sum AvgOptimal 427 46.7 426 46.6 425 46.7 424 46.7 423 46.6MovieLens InsOpt-CR-Topk 386 47.5 385 47.4 385 47.3 384 47.2 383 47.2Greedy-CR-Topk 384 47 381 47 380 46.8 379 46.7 379 46.7Optimal 300 50 300 50 300 50 300 50 300 50TripAdvisor InsOpt-CR-Topk 185 50 175 50 165 50 160 50 155 50Greedy-CR-Topk 220 50 210 50 210 50 205 50 205 50Optimal 1092 36.4 1091 36.4 1090 36.3 1090 36.3 1089 36.5Uncorrelated Data InsOpt-CR-Topk 929 43.6 926 43.6 925 43.6 925 43.6 924 43.5Greedy-CR-Topk 945 42.9 939 42.8 938 42.8 936 42.7 931 42.8Optimal 122 5.3 122 5.2 122 5.2 122 5.1 122 5.2Correlated Data InsOpt-CR-Topk 110 6.7 110 6.7 110 6.7 110 6.6 110 6.5Greedy-CR-Topk 110 6.6 110 6.6 109 7.6 109 6.5 109 7.15Table 4.1: Quality Comparison for Different Composite RecommendationAlgorithmsThe goals of our experiments were: (i) evaluate the relative quality ofInst-Opt-CR and Greedy-CR compared to the optimal algorithm, in termsof both the total and average values of the top-k packages returned, and524.4. Experiments0 5 100246KNDCG Score(a) Movie Lens0 5 100246K(b) TripAdvisor0 5 100246K(c) Uncorrelated Data0 5 100246K(d) Correlated DataWorst NDCG ScoreGreedy−CR−TopkInsOpt−CR−TopkWorst NDCG ScoreGreedy−CR−TopkInsOpt−CR−TopkWorst NDCG ScoreGreedy−CR−TopkInsOpt−CR−TopkWorst NDCG ScoreGreedy−CR−TopkInsOpt−CR−TopkFigure 4.2: NDCG Score for Top-k Packages(ii) evaluate the relative efficiency of the algorithms with respect to the num-ber of items accessed and the actual run time. All experiments were done ona Xeon 2.5GHz Quad Core Windows Server 2003 machine with 16GB RAMand a 128GB SCSI hard disk. All code is in Java using JDK/JRE 1.6.We use four datasets in our experiments. The first dataset is from Movie-Lens13. We use the 10 million rating MovieLens dataset which contains 1million ratings for 3900 movies by 6040 users. In our experiments, we usedthe running time of movies, obtained from IMDB14, as cost and we assumeusers are interested in packages of movies where the total running time isunder a given budget.TripAdvisor15 is a well-known website where users can share and exploretravel information. For our experiments, we crawled user rating informationfrom places of interest (POIs) in the 10 most popular cities in the US,excluding POIs which had one or no reviews. The dataset contains 23658ratings for 1393 POIs by 14562 users, so it is very sparse.16 We associatewith each POI in the dataset a cost which is based on log(number of reviews)and scaled to the range of 1 to 50. The intuition behind choosing such a costfunction is that the more popular a POI is (in terms of number of reviews),the more likely it is to be crowded or for the tickets to be expensive. Inpractice, we may also use some existing work like [47] to mine from onlineuser-generated itineraries other cost measures, e.g., average time users spentat each POI, average cost of visiting each POI, etc.For the MovieLens and TripAdvisor datasets, we use a simple memory-based collaborative filtering algorithm [16]17 to generate predicted ratings13http://www.movielens.org (visited on 03/16/2015)14http://www.imdb.com (visited on 03/16/2015)15http://www.tripadvisor.com (visited on 03/16/2015)16Pruning more aggressively rendered it too small.17Our algorithms do not depend on a specific recommendation algorithm; in practice,534.4. Experimentsfor each user. The ratings are scaled and rounded to integers ranging from1 to 50.For the MovieLens dataset, we randomly selected 20 users from the 23594user pool, and the budget for each user was fixed at 500 minutes18. For theTripAdvisor dataset, because of the sparsity of the underlying user ratingmatrix, we selected the 10 most active users as our sample for testing thealgorithms, and set the user cost budget to 50.We also tested our algorithms on synthetic correlated and uncorrelateddatasets. For both datasets, item ratings are randomly chosen from 1 to 50.For the uncorrelated dataset, item costs are also randomly chosen from 1to 50, but for the correlated dataset, the cost of item t is randomly chosenfrom min{1, v(t)−5} to v(t)+5. In both datasets, the total number of itemsis 1000, and the cost budget is set to 50. For all datasets, we assume thebackground cost information BG is simply the global minimum item cost.4.4.2 Quality of Recommended Packages0 5 101001010102104106108KRunning Time (ms)(a) MovieLens0 5 101001010102104106108K(b) TripAdvisor0 5 101001010102104106108K(c) Uncorrelated Data0 5 101001010102104106108K(d) Correlated Data0 5 100500100015002000KAccess Cost (# of Items) (e) MovieLens0 5 100500100015002000K(f) TripAdvisor0 5 100500100015002000K(g) Uncorrelated Data0 5 100500100015002000K(h) Correlated DataOptimalInsOpt−CR−TopkGreedy−CR−TopkOptimalInsOpt−CR−TopkGreedy−CR−TopkOptimalInsOpt−CR−TopkGreedy−CR−TopkOptimalInsOpt−CR−TopkGreedy−CR−TopkOptimalGreedy−CR−TopkInsOpt−CR−TopkOptimalGreedy−CR−TopkInsOpt−CR−TopkOptimalGreedy−CR−TopkInsOpt−CR−TopkOptimalGreedy−CR−TopkInsOpt−CR−TopkFigure 4.3: (a)–(d) Running Time for Different Datasets; (e)–(h) AccessCost for Different DatasetsFor each dataset, Table 4.1 shows the quality of the top-5 compositerecommendations returned by the optimal and approximation algorithms.our framework assumes ratings come from existing recommender systems.18For all datasets, we tested our algorithms under various cost budgets with very similarresults, so other budgets are omitted.544.4. ExperimentsWe use as measures of quality the aggregated value of each package (SUMcolumn) and the average item value of each package (AVG column).It can be verified from Table 4.1 that our approximation algorithms doindeed return top-k composite packages whose value is guaranteed to be a2-approximation of the optimal. Furthermore, from the average itemvalue column, it is clear that our proposed approximation algorithms oftenrecommend packages with high average value, whereas the optimal algorithmoften tries to fill the package with small cost and small value items. So bysacrificing some of these lower quality items, the proposed approximationalgorithms may manage to find high quality packages much more efficiently.To better study the overall quality of returned packages, we also adopt amodified Normalized Discounted Cumulative Gain (NDCG) [66] to measurethe quality of the top-k composite packages returned by the approximationalgorithms against the optimal algorithm. Let Ro = {P o1 , . . . , P ok } be thetop-k packages returned by the optimal algorithm, and Ra = {P a1 , . . . , P ak }be the top-k packages returned by the approximation algorithm. The mod-ified NDCG score is a weighted sum of aggregated package value differenceat each position of the returned top-k list, and is defined as:NDCG(Ro, Ra) =k∑i=1log(1 + v(Poi )−v(Pai )v(P oi ))log(1 + i)The ideal value for the modified NDCG score is 0, where the top-k packagesreturned have exactly the same value as the optimal top-k packages. Theworst possible value for the modified NDCG score is∑ki=1log 2log(1+i) , whereeach package returned has an aggregated value of 0. In Figure 4.2, we showfor the 4 datasets the NDCG score of the top-k packages (k ranging over 1to 10) returned by the instance optimal algorithm and the greedy algorithm.It is clear that, while having a substantial run time advantage, the greedyalgorithm can achieve a very similar overall top-k package quality comparedto the instance optimal algorithm. We also note that both approximationalgorithms have a very small NDCG score.4.4.3 Efficiency StudyThe running times of our algorithms on the 4 datasets are shown in Fig-ure 4.3 (a)–(d), while access costs are shown in Figure 4.3 (e)–(h). ForMovieLens, TripAdvisor and the uncorrelated dataset, it can be seen that onaverage the greedy algorithm Greedy-CR-Topk has excellent performance interms of both running time and access cost. The instance optimal algorithm554.5. DiscussionInsOpt-CR-Topk also has low access cost, but its running time grows veryquickly with k since it needs to solve exactly many instances of knapsack,restricted to the accessed items.As can be seen in Figure 4.3 (h), the only dataset where both the greedyand instance optimal algorithms have a high access cost is the correlateddataset (but notice that the greedy algorithm still has good running time).The reason for this is that, for the correlated dataset, the global minimumcost corresponds only to items which also have the least value. Thus theinformation it provides on the unseen items is very coarse. In practice, onesolution to this might be to use more precise background cost information,such as provided by histograms, for example, as described in Section 4.3.3and evaluated below.In Figure 4.4, we compare the performance of the instance-optimal algo-rithm where the background cost information is simply the minimum costwith that where it is histogram-based. For the histogram-based approach,each histogram has 50 buckets and each bucket is of equal width. As canbe seen from Figure 4.4 (e)–(h), the histogram-based approach consistentlyaccesses fewer or an equal number of tuples compared to the minimum cost-based approach. This further results in savings in the overall running time ofthe algorithm, as can be seen from Figure 4.4 (a)–(d). The only exceptionis on the TripAdvisor dataset, where the histogram-based approach maysometimes be slightly slower than the minimum cost-based approach. Wenote that this is because, for this dataset, both approaches access only a fewtuples. Thus, as indicated by Algorithm 12, our histogram-based approachmay incur a small overhead in counting tuples which may not be necessaryfor this case. Similar savings to those reported in Figure 4.4 can be observedfor the histogram-based greedy algorithm.4.5 DiscussionAs mentioned in Section 4.2, our framework includes the notion of a packagesatisfying compatibility constraints. For example, in trip planning, a usermay require the returned package to contain no more than 3 museums.To capture these constraints in our algorithms, we can define a Booleancompatibility function C over the packages under consideration. Given apackage P , C(P ) is true if and only if all constraints on items in P are satis-fied. We can add a call to C in InsOpt-CR-Topk and Greedy-CR-Topk aftereach candidate package has been found. If the package fails the compatibilitycheck, we just discard it and search for the next candidate package. In terms564.6. Application in Travel Planning0 5 10102103104105 (a) MovieLensKRunning Time (ms) 0 5 10050100150 (b) TripAdvisorK 0 5 100123 x 104(c) Uncorrelated DataK 0 5 1001234 x 104(d) Correlated DataK 0 5 100200400600 (e) MovieLensKAccess Cost (# of Items) 0 5 100246810 (f) TripAdvisorK 0 5 100200400600 (g) Uncorrelated DataK 0 5 108008509009501000 (h) Correlated DataK InsOpt−CR−TopkInsOpt−CR−Topk−Hist InsOpt−CR−TopkInsOpt−CR−Topk−Hist InsOpt−CR−TopkInsOpt−CR−Topk−Hist InsOpt−CR−TopkInsOpt−CR−Topk−HistInsOpt−CR−TopkInsOpt−CR−Topk−Hist InsOpt−CR−TopkInsOpt−CR−Topk−Hist InsOpt−CR−TopkInsOpt−CR−Topk−Hist InsOpt−CR−TopkInsOpt−CR−Topk−HistFigure 4.4: (a)–(d) Running Time Comparison for Different Datasets; (e)–(h) Access Cost Comparison for Different Datasetsof access cost, it can easily be verified that the modified InsOpt-CR-Topkalgorithm is still instance optimal.It is worth noting that the Boolean compatibility function defined hereallows for greater generality than the constraints studied in previous worksuch as [19, 102]. However, for application scenarios where only one specifictype of constraint is considered, e.g., having one item from each of 3 pre-defined categories, more efficient algorithms like Rank Join [54, 123] can beleveraged.Furthermore, although in the previous algorithms we assume there isonly one component recommender system, it is straightforward to combinerecommendation lists from several component recommender systems by cre-ating on-the-fly a “virtual recommendation list”, e.g., select at each iterationthe item which has the maximum value/rating across all recommender sys-tems.4.6 Application in Travel PlanningIn travel planning, users may also have a budget on time that can be spendon a trip, and if this is the case, the underlying problem become a Orien-teering problem [38].Given a set N of POIs, a set U of users, an active user u ∈ U , and aPOI t ∈ N , we denote by vu(t) the value of POI t for user u. We denote thevalue as v(t) when the active user is understood. A RS predicts v(t) when it574.6. Application in Travel Planningis not available, by using the active user’s past behaviour and possibly thatof other similar users. For t ∈ N , we denote by tc(t) the time cost and bymc(t) the monetary cost, of POI t. Given a set of POIs R ⊆ N , we definev(R) = Σt∈R v(t), tc(R) = Σt∈R tc(t) and mc(R) = Σt∈Rmc(t).In addition to the cost associated with each POI, we may also need toconsider the cost spent on traveling to corresponding POIs. For each POIpair (t1, t2), t1, t2 ∈ N , we denote by d(t1, t2) the shortest distance betweent1 and t2. And given a set of POIs R ⊆ N , we define w(R) as the minimumdistance walk which covers all POIs in R, and let tcw(R) and mcw(R) bethe corresponding time and monetary cost for taking w(R) by assuming anaverage speed/cost per unit of distance.Definition 8 Top-k Composite Sequence Recommendations: Givenan instance I of a composite recommendation system consisting of one com-ponent RecSys and an external information source, a cost budget Bt ontime, a cost budget Bm on money and an integer k, find the top-k packagesP1, ..., Pk such that each Pi has mc(Pi)+mcw(Pi) ≤ Bm, tc(Pi)+ tcw(Pi) ≤Bt, and among all feasible packages, P1, ..., Pk have the k highest total values,i.e., v(P ) ≤ v(Pi), 1 ≤ i ≤ k for all feasible packages P 6∈ {P1, ..., Pk}.Note that ratings of POIs from the component RecSys are retrieved usingsorted access, while the cost of a given POI is obtained via random access.Let cs and cr be the costs associated with these accesses. Then the totalaccess cost of processing n POIs is n × (cs + cr). Notice that the numberof POIs in the system is often huge, and cs and cr can be large comparedto the cost of in-memory operations, as often for both accesses, informationmay need to be transmitted through the Internet.In our system, we also assume some background cost information abouteach POI is available. The background cost information can be a histogramH collected from the cost database or just the minimum POI monetary(time) cost mcmin (tcmin). This information can be materialized in oursystem and be refreshed regularly by rescanning the cost database. Wedenote the background cost information as BG.When k = 1, the top-k composite sequence recommendation problem(CompRec-Seq) can be viewed as a variation of the orienteering problem [38]with the restriction that POIs can be accessed only in non-increasing or-der of their value. Similar to the composite recommendation problem forsets as discussed in the previous sections, we assume we have some simplebackground cost information such as the minimum shortest distance dminbetween POIs.584.6. Application in Travel Planning4.6.1 AlgorithmComposite sequence recommendation is closely related to the orienteeringproblem [38], which seeks a maximum value walk on a graph subject to abudget constraint on the cost. To simplify the presentation, we will ignorediscussing cost on each POI (i.e., the so-called node cost) as this can behandled by a reduction to the original orienteering problem [47].Similar to the composite recommendation problem for sets of items, tominimize the number of POIs retrieved while having the result quality guar-anteed, we need to adapt the algorithm template as described in previoussections and iteratively calculate the optimal solution for the accessed POI-set S along with a tight upper bound on the possible true optimal solutionto the underlying composite sequence recommendation problem instance.Given an (exponential-time) exact orienteering solver, we can calculatethe optimal solution for the subgraph G of the original POI graph, inducedby the accessed POI-set S. However, it is more challenging to get a tightupper bound on the value of the true optimal solution for the compositesequence recommendation problem.Let dmin be the background cost information about the minimum dis-tance between POIs, vmin = mint∈S v(t) and t∗ be an “imaginary” unseenPOI which has value vmin. A tight upper bound on the possible true opti-mal solution can be computed by the procedure MaxOriValBound shown inAlgorithm 13.It can be proven that, using procedure MaxOriValBound, we can givea correct instance-optimal α-approximation algorithm CR-Seq-Top1 for thecomposite sequence recommendation problem. Similar to composite set rec-ommendation, to get better practical performance, we can utilize approxi-mation algorithms for the orienteering problem such as [38] instead of exactalgorithms. However, the resulting algorithm will not be instance optimal.Algorithm 13: MaxOriValBound(G,Bt,Bm,BG)1 foreach Edge (v1, v2) ∈ G do2 n∗ = b d(v1,v2)dmin c3 Add to G a path P between v1 and v2 which is composed of n∗ “imaginary”POIs, each a copy of t∗, and d(P ) = d(v1, v2)4 S = OptimalOrienteer(G,Bt,Bm)5 Augment S with “imaginary” POIs t∗ if possible6 return v(S)In addition to the best composite recommendation, it is often useful594.6. Application in Travel Planningto provide the user with the top-k composite recommendations, where kis a small constant. Similar to the top-1 case, due to the hardness of theunderlying problem, we seek an efficient approximation algorithm which cangive us high quality recommendations.Given an instance I of the top-k composite recommendation problem,assume RI is the set of all feasible composite recommendations, i.e., RI ={R |R ⊆ N ∧mc(R) ≤ Bm, tc(R) ≤ Bt} for composite set recommendationand RI = {R |R ⊆ N ∧mc(R) +mcw(R) ≤ Bm, tc(R) + tcw(R) ≤ Bt} forcomposite sequence recommendation. Following Fagin et al. [53], we definean α-approximation of the top-k composite recommendations to be any setRk of min(k, |RI |) composite recommendations, such that, for all R ∈ Rkand R′ ∈ RI\Rk, v(R) ≥ 1α × v(R′).Lawler’s procedure [80] is a general technique for enumerating optimaltop-k answers to an optimization problem, which relies on an efficient al-gorithm to find the optimal solution to the problem. We have proven inSection 4.3.2 that for the top-k composite recommendation problem, if thealgorithm for the optimal solution in Lawler’s procedure is replaced by anα-approximation algorithm, instead of optimal top-k answers, we get an α-approximation to the optimal set of top-k composite recommendations. Soin our system we leverage Lawler’s procedure and CR-Seq-Top1 to producetop-k approximate composite recommendations.60Chapter 5Composite Recommendationwith Soft Constraints5.1 IntroductionStandard item recommender systems (IRS) recommend to users personalizedlists of single items based on previous transactions of the users in the system.This model has become extremely popular because of its wide applicationand success on websites such as Amazon, Last.fm, and Netflix. For instance,even in 2006, 35% of purchases on Amazon originated from recommendeditems [77]. The promise of IRS has also been recognized by the databasecommunity, resulting in a series of recent research projects such as RecStore[82], FlexRecs [74], aimed at pushing IRS inside the database engine.However, as pointed out in an influential survey [16], IRS suffers fromlimited flexibility. For instance, users may want recommendations of morethan one item, e.g., a shopping cart of cell phones, accessories, and dataplans on Amazon, lists of songs on Last.fm, or lists of movies on Netflix. Forrealizing this kind of package recommendation system (PRS) using currentitem recommendation interfaces, a user has to either manually assemble thepackages by exploring recommended items (e.g., Amazon), or browse andsearch packages created by other people (e.g., Last.fm, Netflix). With anexponential number of possible combinations of items, as well as a hugenumber of user-created packages, both approaches for finding packages ofinterest quickly become tedious. For a business, the drawback of not beingable to find packages for users means lost opportunities to sell packages,or lowering user satisfaction by drowning users in oceans of user-createdcontent. Thus it is desirable to have a PRS which can learn users’ interestsand recommend to each user packages that are of potential interest.Given a set of n items, if there is no bound on the size of a package, thereexist 2n− 1 possible packages of items (see Figure 5.1). Consider the utilityU(p) of a package p to a user u. Obviously U(p) depends on the items in the615.1. Introductiont1t2t3item(a)f10.60.40.2(b)p1p2p3package items{t1}{t2}{t3}p4p5p6p7{t1,t2}package items{t2,t3}{t1,t3}{t1,t2,t3}f20.20.40.4Figure 5.1: Examples of packages of items.package p. As discussed in [16, 87, 129], the feature values of each package pare usually aggregations of item feature values. E.g., let f1 in Figure 5.1(a)be the cost of an item. For package p, sum1(p) defines the total cost of theitems in p, which is the cost of the package. Let f2 in Figure 5.1(a) be therating of an item. Then avg2(p) defines the average rating of the items in p,i.e., the average quality of the package. In Figure 5.1(b), sum1(p4) = 1 andavg2(p4) = 0.3. For instance, when purchasing a package of books or CDsfrom Amazon, users may want the average rating of items in the package tobe as high as possible, and the total cost of items to be as small as possible.So U(p) = g(sum1(p), avg2(p)), where function g is increasing in avg2(p), anddecreasing in sum1(p). Similar examples can also be found when reasoningabout the utility of packages on Last.fm and Netflix, where the cost of anitem can be the price of a song/movie, and the rating can take the form ofany combinations of the average ratings, the number of listens, the numberof likes, and the number of purchases of a song/movie.Given this general framework of characterizing features and utilities ofpackages, one intuitive way of recommending packages is to present to theuser all skyline packages [87, 129], i.e., packages which cannot be dominatedby another package on every feature. In the above example, a package p1 isa skyline package if there does not exist another package whose total costis lower and average rating is higher. However, as shown empirically in[87, 129], the number of skyline packages can be in the hundreds or eventhousands for a reasonably-sized dataset. So presenting all of these packagesto a user is impractical.Another way to recommend packages is to define hard constraints onsome features, and optimize the remaining features in the form of a utilityfunction (e.g., see [120, 121]). For the above example, we could require thetotal cost of a package to be at most $500, and then find packages whichsatisfy this constraint and maximize the average ratings of items in thepackage. However, this approach also has the following practical limitations.Firstly, users often only have a rough idea of what they want in a desirable625.1. Introductionpackage. E.g., when summing costs of all items in a package, they mayonly say “smaller is better”. Thus hard constraints on a feature may resulteither in sub-optimal packages when the budget is set too low, or a hugenumber of candidate packages when the budget is set too high. Secondly, theimportance of each feature specified by the user is usually unknown. E.g.,for some users, the monetary budget may not be that important and theycan afford to pursue a high quality package while sacrificing a “reasonable”amount of money; whereas other users may be very sensitive to the totalcost of the package, and may have limited flexibility in terms of monetarybudget. We note that a PRS may be informed about the relative importanceof the different criteria by the user; however, it is not realistic to expect auser to know, e.g., that they are 0.8 interested in the overall cost, and 0.2interested in the overall quality of a package!Instead, following recent work on multi-dimensional ranking of items[27, 34, 63], we take a quantitative approach to ranking packages based onmultiple criteria. Specifically, we consider that for each user u, there is anintuitive “implicit” linear utility function U , which captures u’s preferenceor trade-off among different features for choosing a desirable package. Byleveraging this utility function, we can easily rank all packages, and presentthe best ones to the user. E.g., in the above example, for a user u who hasequal preference on the cost and average item rating of a package, we canuse U(p) = −0.5sum1(p) + 0.5avg2(p), where a negative weight is used toindicate that smaller cost is better.This approach captures all the criteria of the package ranking problemin a single score function, thus it avoids the first issue regarding hard con-straints. However, as mentioned in the second issue, it is unrealistic toassume that the user will define the utility function exactly for the system.Thus, similar to recent work such as [27, 34], we will assume the weights ofthe utility function are hidden, and follow a preference elicitation frameworkwhich explores and exploits a user’s preferences based on feedback received,and learns the hidden weights of the utility function over time.Preference elicitation (PE) has been studied extensively in the ArtificialIntelligence community [27, 34]. The general idea of PE is to capture users’preferences using a utility function, and then learn the parameters of thisutility function through feedback w.r.t. certain elicitation queries. Most ofthis work relies on a specific type of query, called a gambling query. Thoughusing gambling queries is well founded in terms of decision theory, to dateit has only been applied to applications with extremely small domains. Alsothe form of the gambling query requires that the user be explicitly asked thisquery by the system through protocols such as a user survey. This limitation635.1. Introductionmakes it undesirable for deployment in a recommender system where userfeedback either needs to be very simple such as “like”s or ratings, or to betaken implicitly, e.g., from item click-throughs on web sites.In this chapter, we propose to use simple package comparison as theelicitation query. Users are presented with a list of recommended packageswhenever they login to the system. These packages are composed of (i) toppackages, w.r.t. the current knowledge of the user’s utility function, selectedaccording to a chosen ranking semantics as discussed in Section 5.2.3, and(ii) a set of random packages, which are used to explore uncertainty in users’preferences. Users’ interaction with the recommended packages are loggedas implicit signals to the system, showing that they are more interestedin the clicked package than the other packages (noise in user feedback isdiscussed in Section 5.6). A user can choose to adopt any of the presentedpackages, and once they do so, their feedback is extracted implicitly, withoutcausing any disruption in their normal interaction with the RS. This meansthe proposed framework can be cleanly integrated into existing services tocapture and update users’ preferences, without any abrupt interruption forthe user using the service. This is in contrast with standard PE methodswhich require interactive elicitation and a dramatic change to the servicework flow.Specifically, our proposed PRS assumes each user has an associated util-ity function U which is parameterized by a weight vector w, and the un-certainty of U is captured by a distribution Pw over the space of possibleweight vectors w. We assume that the prior of Pw is a mixture of Gaus-sians following [34], as mixture of Gaussians can approximate any arbitraryprobability density function. Given Pw, our PRS can directly leverage itto present the user with a small number of recommended packages, andrecord the user’s interaction with the packages. However, a major challengeis that the posterior of Pw given user feedback has no closed form solution,as we shall see. To circumvent this, we propose a sampling-based frameworkwhich obviates the need for a posterior. Instead, package preferences result-ing from user feedback can be translated into constraints on the samplesfrom Pw. However, this raises the question of how we can obtain samplessatisfying the constraints as efficiently as possible. A related question iswhether previously obtained samples can be maintained against new userfeedback. We discuss our solution to these challenges in Section 5.3.Finally, we note that following the Bayesian uncertainty-based frame-work, the posterior distribution of Pw at any time captures the current op-timal representation of a user’s preferences over packages w.r.t. all observedfeedback.645.2. Problem SettingContributions of this chapter is listed as follows:• We propose a PRS which captures user preferences using a linear util-ity function, but avoids the limitations of skyline packages and hardconstraint-based systems;• To solve the problem of determining parameters for the utility func-tion, we leverage a non-intrusive Bayesian-based PE framework byassuming a prior distribution Pw over parameter w;• The posterior of Pw given user feedback has no closed form solutionin general, so we propose different constrained sampling strategies tosolve this problem. We show that an approach based on simple re-jection sampling may waste many samples, resulting in poor overallperformance, whereas more sophisticated sampling strategies such asimportance sampling and MCMC-based sampling make better use ofthe feedback, and are more efficient.• Given the utility function U and the uncertainty which is capturedby Pw, we discuss how top-k packages can be generated w.r.t. ourcurrent knowledge of the user’s preferences, following different rankingsemantics inspired by recent work from different communities.• Finally, we address the problem of how to maintain the set of samplesgenerated when new feedback is received.The rest of the chapter is organized as follows. We define our packagerecommendation problem in Section 5.2, with the sampling-based solutiondiscussed in Section 5.3. Our algorithm for finding the best packages basedon a set of sample weight vectors is presented in Section 5.4. Experimentalresults demonstrating the efficiency and effectiveness of various samplingmethods and ranking algorithms are reported in Section 6.5.5.2 Problem Setting5.2.1 Package ProfilesWe assume that we are given a set T of n items, each item being describedby a set of m features {f1, . . . , fm}. Each item t ∈ T can be represented asan m-dimensional vector ~t = (t1, . . . , tm), where ti denotes the value of itemt on feature fi. For simplicity, when no ambiguity arises, we use t to denoteboth an item and its corresponding feature vector ~t. Also, without loss655.2. Problem Settingof generality, we assume all feature values are non-negative real numbers.In practice, different items can be associated with different feature sets, sosome feature values for an item t might be null.As mentioned in the introduction, users’ preferences over packages areusually based on aggregations over feature values of items in a package. E.g.,the sum of the costs of items defines the overall cost of a package, while theaverage of the ratings of items defines the overall quality of a package. Thuswe define an aggregate feature profile (or simply profile) of a package asfollows.Definition 9 An aggregate feature profile (or profile) is defined as V =(A1, . . . ,Am), where each Ai corresponds to feature fi, 1 ≤ i ≤ m, and isone of the aggregation functions min, max, sum, avg or null, where nullmeans that the corresponding feature fi should be ignored.Note that we simplify the presentation by assuming one aggregation perfeature, but our proposed algorithms can be easily extended to handle morethan one aggregation per feature. Given a package p and a profile V , wedefine the feature value vector ~p of p w.r.t. V as ~p = (A1(p), . . . ,Am(p)),where each Ai(p) is the aggregate value of items in p w.r.t. feature fi. Fol-lowing the usual semantics for evaluating aggregate functions, for min, maxand sum we have Ai(p) = Ai({ti | t ∈ p ∧ ti 6= null}), and for avg we haveavgi(p) = sumi({ti | t ∈ p ∧ ti 6= null})/|p|. Similar to the feature valuevector of an item, when there is no ambiguity, we simply use p to denoteboth the package p and its corresponding feature value vector ~p. Further-more, we denote each feature value Ai(p) of package p by pi when profile Vis clear from the context.Note that given a fixed item set T , it is trivial to calculate the maximumaggregate value for a feature that can be achieved by any package. E.g.,for avg1(p), the maximum average value on f1 that can be achieved by anypackage is simply the maximum f1 value of all the items. So we assumein the following that each individual aggregate feature value is normalizedinto [0, 1] using the maximum possible aggregate value of the correspondingfeature.5.2.2 Package Utility and Preference ElicitationIntuitively, the utility of a package p for a user depends on its feature vectorand we wish to learn this utility. The space of all mappings between possibleaggregate feature values and utility values is uncountable, making this taskchallenging. Fortunately, most preferences exhibit an internal structure that665.2. Problem Settingcan be used to express the utility concisely, e.g., an additive utility function iscommonly assumed in practice [70]. In this work, for a package p and a givenprofile V , we assume the utility of p can be specified using an additive utilityfunction U , which uses a weighted linear combination of the corresponding(aggregate) feature values in p.U(p) = w1p1 + · · ·+ wmpm (5.1)For simplicity, we use w to denote the weight vector (w1, . . . , wm). With-out loss of generality, we assume each parameter wi falls in the range [−1, 1],where a positive (negative) wi means a larger (resp., smaller) value is pre-ferred for the corresponding feature. Note that we can transform all negativewi’s into their absolute value by setting p′i = 1 − pi on the correspondingfeature for all packages.A framework based on a utility function essentially defines a total orderover all packages, where similar to previous works such as [112], we assumeties in utility score are resolved using a deterministic tie-breaker such as theID of a package. This differentiates the approach from that of [87, 129] whichaim to return all skyline packages, the number of which can be prohibitivelylarge, as previously noted.Despite its intuitive appeal, there are two major challenges in adoptingthe utility-based framework for PRS in practice. First, users are usuallynot able to specify (or even know beforehand) the exact weights wi of theutility function U . Thus, we must model the uncertainty in U , and elicituser’s preferences by means of interactions. Second, unlike [87] and [129]which consider packages of fixed size, we allow package size to be flexiblein our framework. We believe this is natural. E.g., given a system-definedmaximum package size φ of say 20, we consider all possible package sizesranging from 1 to 20. Efficient determination of packages of flexible size thatmaximize a user’s utility under partial knowledge about the utility functionfrom elicited preferences is far more challenging than finding packages of agiven fixed size.One popular way of characterizing uncertainty in U is through Bayesianuncertainty [27, 34]19, in which for each user, we assume the exact value ofthe weight vector w is not known, but w can be described by a probabilitydistribution Pw. We assume w follows a mixture of Gaussians: indeed, ithas been shown that a mixture of Gaussians can approximate any arbitraryprobability density function [22].19The other possibility is strict uncertainty, which requires a set of possible utilityfunctions (down to the weights) to be known, which is more restrictive.675.2. Problem SettingWhile Pw can be initialized with a system-defined default distribution,in the long run Pw can be learned by leveraging the feedback provided bythe user. In this work, we assume user feedback is in the form of selectingone preferred package from a set of packages presented to them. This formof feedback is popularly known as example critiquing in conjoint analysis[115] and preference elicitation [89].For a given user u, let the feedback from u preferring package p1 topackage p2 be denoted by p1 p2. This feedback can be leveraged toupdate the posterior of Pw through Bayes’ rule in Equation 5.2, whereP(p1 p2 | w) defines the likelihood of p1 p2 given w. Note that since eachspecific w defines a total order over all packages, the value of P(p1 p2 | w)is either one or zero. We tentatively assume that every user feedback isconsistent, in that the provided preferences correspond to a partial order,and discuss in Section 5.6 how this assumption can be relaxed.Pw(w | p1 p2) =P(p1 p2 | w)Pw(w)∫w P(p1 p2 | w)Pw(w)dw(5.2)However, Gaussian mixtures are not closed under this kind of update[27], meaning we cannot obtain a closed-form solution for the posterior aspresented in the above equation. One popular way to deal with such asituation is to force the posterior to again be a mixture of Gaussians, andthus the posterior can be learned by refitting a Gaussian mixture throughalgorithms such as expectation maximization (EM) [22]. However, the costof refitting through EM is extremely high, so we take a different approachof representing the posterior by maintaining both the prior distribution andthe set of feedback preferences received. The details of our proposal aredescribed in Section 5.3.1.5.2.3 Presenting PackagesWhile the preference elicitation frameworks discussed in the previous sectioncan be exploited to update the knowledge of Pw for any specific user, thereis still a remaining question of how to select and present packages to a userin order to get feedback.In general, given the uncertainty in the utility function, packages pre-sented to the user serve as a way to explore and exploit users’ preferencessimultaneously. I.e., on the one hand, we want to exploit our current knowl-edge about a user’s preferences and try to present to them the best packagespossible according to the current Pw. On the other hand, we want to ex-plore the uncertainty in the user’s preferences, and present packages to them685.2. Problem Settingutility p10.35w(a)prob.0.30.40.3values(0.5,0.1)(0.1,0.5)(0.1,0.1)(c)(b)profile(sum1,avg2)p2 p3 p4 p5 p60.3 0.2 0.575 0.4 0.4750.31 0.54 0.52 0.475 0.56 0.4550.11 0.14 0.12 0.175 0.16 0.155top-2 packagesp4,p6(d)w1w2w3p5,p2p4,p5w1w2w3w1w2w3Figure 5.2: Examples of different ranking semantics.which might not be considered by our current knowledge about the user’spreferences. These packages serve the purpose of correcting bias introducedfrom the initial distribution of Pw and combating mistakes and noise fromuser feedback. In this work, we follow a simple but promising way of pre-senting packages to the user by mixing current best packages along withrandom packages, so that the current best packages can be used to exploitour current knowledge about the user, and the random packages can be usedto explore the user’s preferences.While it is straightforward to pick a random package, it is challengingto pick the best packages, as there is no universally accepted semantics onhow packages should be ranked given the uncertainty in the utility function.Instead of committing to a specific package ranking semantics, we considervarious alternative semantics, and discuss how the different semantics canbe neatly integrated into our PRS framework.The first ranking semantics we consider is based on expectation (EXP),which has been adopted as the most popular semantics for ranking items inpreference elicitation papers in the artificial intelligence community [27, 34].In the following, by a package space P , we mean the set of all possiblepackages formed using items from T and having size no larger than φ (themaximum package size).Definition 10 (EXP) Given a package space P and probability distribu-tion Pw over weight vectors w, find the set of top-k packages Pk w.r.t. ex-pected utility value, i.e. ∀p ∈ Pk, ∀p′ ∈ P\Pk, Ew(w · p) ≥ Ew(w · p′).Example 4 Consider the example in Figure 5.1. Assuming the maximumpackage size is 2, the package space P is given by {p1, . . . , p6}. If theprofile under consideration is (sum1, avg2), then the maximum value fora size-2 package on feature 1 is 1, and the maximum value of a size-2package on feature 2 is 0.4. We can normalize packages’ feature valuesusing these two maximum values. E.g., for package p1 in Figure 5.1(b),695.2. Problem Settingsum1(p1) = 0.6, avg2(p1) = 0.2, so the normalized feature value vector forp1 is (0.6/1, 0.2/0.4) = (0.6, 0.5). To simplify the presentation, we assumein Figure 5.2(a) that there are only three weight vectors, w1, w2 and w3,under consideration, the probability for which is given in the third column.Given the weight vector information, we can easily calculate the utility ofeach package under each weight vector, as shown in Figure 5.2(c). E.g.,the utility of package p1 under w1 is 0.6 × 0.5 + 0.5 × 0.1 = 0.35. Theexpected utility value for each package can be calculated accordingly, usingthe probability of each weight vector. E.g., the expected utility for p1 is0.35 × 0.3 + 0.31 × 0.4 + 0.11 × 0.3 = 0.262. For this example, it is notdifficult to verify that p4 has the largest expected utility, followed by p5. The second ranking semantics we consider is based on the probabilityof a package being in the top-σ position under different parameter settings(TKP). This is inspired by recent work on learning to rank in the machinelearning community [33]. Let P(p | w) = {p′ | p′ ∈ P,w · p′ > w · p}denote the set of packages in P which have utility larger than package p,given a fixed w. Let W denote the set of weight vectors w under which apackage p is dominated by σ or fewer other packages, i.e., |P(p | w)| ≤ σ.Since the utility function is convex, we can readily show that ∀ w1, w2,w1 6= w2, if w1 · p′ > w1 · p, and w2 · p′ > w2 · p, then for any α ∈ [0, 1],(αw1 + (1 − α)w2) · p′ > (αw1 + (1 − α)w2) · p. Thus we can prove thatW¬ := {w | σ < |P(p | w)|} forms a continuous and convex region, andW is also continuous. So we define the probability of p ∈ P being rankedamong the top-σ packages as P(p | Pw, σ) =∫w∈WPw(w)dw.Definition 11 (TKP) Given a package space P and a probability distribu-tion Pw over weight vectors w, find the top-k packages Pk w.r.t. the prob-ability of being ranked in the top-σ positions, i.e., ∀p ∈ Pk, ∀p′ ∈ P\Pk,P(p | Pw, σ) ≥ P(p′ | Pw, σ).Example 5 In Figure 5.2(d), we show the top-2 package list for each weightvector. We can calculate that the probability of p5 being in a top-2 packagelist is 0.4 + 0.3 = 0.7. Package p5 has the largest probability of all candidatepackages, followed by p4 for which the probability is 0.6. The third and fourth ranking semantics we consider are the most prob-able ordering (MPO) and the optimal rank aggregation (ORA), which havebeen discussed in recent work on sensitivity analysis of querying top-k itemsunder uncertainty [112]. We note that unlike EXP and TKP which repre-sent the desirability of each individual package independently, adapted to705.2. Problem Settingour setting, MPO and ORA represent the desirability of the top-k packagelist Pk as a whole.For MPO, given a fixed w, let I(Pk | w) be an indicator function whichdenotes whether Pk is the top-k package under w, i.e., I(Pk | w) = 1 if@p′ ∈ P\Pk, w · p′ > w · p, for all p ∈ Pk; and I(Pk | w) = 0 otherwise. LetWPk denote the set of weight vectors w under which I(Pk | w) = 1. Similarto TKP, we can show WPk forms a continuous region, so the probability of Pkbeing the top-k package can be defined as Po(Pk | Pw) =∫w∈WPkPw(w)dw.Definition 12 (MPO) Given a package space P and a probability distri-bution Pw over weight vectors w, find the top-k packages Pk w.r.t. the mostprobable ordering, i.e., ∀P ′k ⊆ P , |P ′k| ≤ k, P ′k 6= Pk, Po(Pk | Pw) ≥ Po(P ′k |Pw).Example 6 In Figure 5.2(d), we can directly see the probability of eachtop-2 package list by referring to the probability of the corresponding weightvector. Clearly, the best top-2 package list under MPO is p5, p2. Lastly, we consider ORA. The original definition of ORA for item rankingis based on possible rankings of all items in the database [112], which isobviously undesirable in our case since the package space is exponential.Thus we adapt ORA in line with recent work on comparing top-k lists [52].Let D(Pk, P ′k) be the Kendall tau distance with penalty parameter θ be-tween two top-k lists Pk and P ′k (similar development can be done usingSpearman’s footrule) [52], which counts the number of pairwise disagree-ments in the relative order of packages in the two top-k package lists. Fortwo distinct packages p1, p2 ∈ Pk ∪ P ′k, p1 6= p2, consider the following fourcases: (1) p1, p2 ∈ Pk and p1, p2 ∈ P ′k. If the utility value order between p1and p2 is different in the two lists, we set Dp1,p2(Pk, P′k) = 1, otherwise weset Dp1,p2(Pk, P′k) = 0; (2) Both packages appear in one list, and only onepackage appears in the other list. W.l.o.g., assume p1, p2 ∈ Pk, p1 ∈ P ′k. Weset Dp1,p2(Pk, P′k) = 1 if p2 is ranked higher than p1 in Pk, otherwise we setDp1,p2(Pk, P′k) = 0; (3) If one package appears in Pk, while the other packageappears in P ′k, then we set Dp1,p2(Pk, P′k) = 1; (4) If both packages appearin one list and neither appears in the other list, we set Dp1,p2(Pk, P′k) = θ.D(Pk, P ′k) can be defined as follows.D(Pk, P′k) =∑p1 6=p2,p1,p2∈Pk∪P ′kDp1,p2(Pk, P′k) (5.3)Given D(Pk, P ′k), ORA tries to find the “centroid” top-k package listwhich minimizes its distance to all possible top-k package lists givenD(Pk, P ′k)715.3. A Sampling-based Frameworkand the probability Po(P ′k | Pw) of P ′k being ranked as the actual top-k pack-age, which is the same as in MPO.Definition 13 (ORA) Given a package space P and a probability distribu-tion Pw over weight vectors w, find the top-k packages Pk w.r.t. the Kendalltau distance with penalty parameter θ, i.e., Pk = arg minPk∑P ′kD(Pk, P ′k) ·Po(P ′k | Pw).Example 7 In the example in Figure 5.2, we enumerate all possible 2-package lists, finding the distance between each of these and each top-2 pack-age list. E.g., for (p4, p6) and (p4, p5), the distance is 1, since the two listsdo not agree on the order between p5 and p6. Then we calculate the overallexpected distance between each 2-package list and each top-2 package list ac-cording to the definition of ORA. In our example, the best top-2 package listunder ORA is (p4, p5) Its distance to the three top-2 lists (p4, p6), (p5, p2),and (p4, p5) is 1, 2, and 0 respectively, and its expected distance to thesethree top-2 lists is 1× 0.3 + 2× 0.4 + 0× 0.3 = 1.1. In summary, while different ranking semantics might lead to the sametop-k packages (e.g., in our example, p4, p5 are the top-2 packages for bothEXP and ORA), or they might lead to very different top-2 packages (e.g., inour example, the top-2 packages for MPO are p5, p2, while the top-2 packagesfor TKP are p5, p4). We note that these different ranking semantics havebeen successfully adopted in different communities, and as we shall see inSection 5.3, our proposed PE framework can be easily adapted to leverageany of these different ranking semantics.5.3 A Sampling-based FrameworkTo accommodate the preference elicitation framework and various rankingsemantics for selecting packages for recommendation, we propose to usea sampling-based framework for PRS. Unlike the geometric approach pro-posed in previous papers such as [112], a sampling-based solution can beeasily adapted to handle cases with higher dimensionality, as we will showempirically in Section 6.5. We first discuss simple rejection sampling inSection 5.3.1. We then consider more sophisticated sampling techniques inSection 5.3.2. In Section 6.4.2, we discuss how to optimize the constraintviolation checking process for sample generation. Finally in Section 5.3.4,we discuss how previously generated samples can be reused given newlyreceived user feedback, i.e., we discuss sample maintenance.725.3. A Sampling-based Framework5.3.1 Rejection SamplingGiven the distribution Pw over w, an intuitive solution for finding the bestpackages under Pw is first to sample the weight vectors w according to Pw,and then for every w sampled, try to find the best package under w. Thebest package results obtained from each sampled w can be aggregated forestimating the final list of best packages. The required aggregation logicdepends on the ranking semantics and the details will be discussed in Sec-tion 5.4. This approach is intuitive: w’s are sampled from Pw, and packageswhich are ranked higher under these w’s that have a higher probability arelikely to be given a greater weight.As discussed in Section 5.2.2, given current recommendations to theuser and the feedback received, we need to constantly refit the distributionPw so that it reflects the updated user preferences. However refitting theGaussian mixture Pw, say using the EM algorithm [27], after every receivedfeedback using Equation (5.2) can be extremely time consuming. So a na¨ıveway of performing refit-and-sample may be inefficient. Thus we consider analternative approach of maintaining both the prior distribution Pw and allfeedback without refitting the Gaussian mixture.The key idea is that every feedback p1 p2 rules out weight vectors wunder which p1 p2 is not true. For those w’s which do satisfy p1 p2, thefeedback alone does not change their relative order with respect to Pw, i.e.,for w1, w2 which both satisfy p1 p2, if Pw(w1) > Pw(w2), without anyfurther information, we have Pw(w1 | p1 p2) > Pw(w2 | p1 p2).So this means that we can use rejection sampling to sample directly fromthe posterior. I.e., we can sample a random w from the current Pw, and ifw violates any user feedback, we reject this sample. Otherwise, the sampleis accepted. Clearly the rejection sampling method will only keep sampleswhich conform to the feedback received from the user, and as shown above,the relative order of probabilities of two weight vectors being sampled stillconforms to their original relative order following the distribution Pw.5.3.2 Feedback-aware SamplingThe simple rejection sampling scheme proposed above may work well whenthe amount of feedback is small. However, as the amount of feedback grows,the cost of this approach increases, as samples become more likely to berejected. Thus a better sampling scheme should be “aware” of the feedbackreceived, and try to avoid generating invalid samples, i.e., those that violateany provided feedback constraint.735.3. A Sampling-based FrameworkRecall that user feedback produces a set of pairwise preferences of theform p1 p2, where p1 and p2 are packages. Given a set Sρ of thesepreferences, it can be shown that the set of valid weight vectors w, i.e.,those that satisfy all feedback preferences, has the following useful property.Lemma 9 The set of valid weight vectors which satisfy a set of preferencesSρ forms a convex set.Proof By definition, for any w1, w2 which satisfy Sρ, ∀ρ := p1 p2 ∈ Sρ,w1 · p1 ≥ w1 · p2 and w2 · p1 ≥ w2 · p2. Then ∀α ∈ [0, 1], αw1 · p1 ≥ αw1 · p2and (1− α)w2 · p1 ≥ (1− α)w2 · p2. Combining these two inequalities showsthat any convex combination of w1 and w2 also forms a valid w.So valid weight vectors form a continuous and convex region. By exploit-ing this property, we can leverage different sampling methods which can biassamples more towards those which are inside the valid region.Importance SamplingThe general idea of importance sampling is that, instead of sampling from acomplex probability distribution (original distribution), which in our case isPw(w | Sρ), we sample from a different proposal distribution Qw, which ismore likely to satisfy the constraints given by the feedback set Sρ. However,this process will introduce a set of samples which do not follow the originaldistribution, so we need to correct this bias by associating each sample wfrom Qw with an importance weight (or simply weight, when there is noambiguity) q(w). Next we will discuss how this framework can be employedin solving our problem.Since valid weight vectors w.r.t. Sρ form a continuous and convex region,samples which lie close to the “center” of this region are more likely tosatisfy Sρ. However, finding the center of an arbitrary convex polytope isextremely complex and time consuming [26], which negates the motivationfor using importance sampling, namely efficiency. Instead, we use a simplegeometric decomposition-based approach, which partitions the space into amulti-dimensional grid, and approximates the center of the convex polytopeusing the centers of the grid cells which overlap with it.In Figure 5.3, we show a simple two dimensional example of the aboveapproach. Initially, the entire valid region is divided into a 3 × 3 grid asdepicted in Figure 5.3(a). Given feedback ρ := p1 p2, we know anyinvalid w satisfies the property that w · p1 < w · p2, or w · (p2 − p1) > 0, i.e.,w is invalid if w is above the line p2 − p1 = 0. As shown in Figure 5.3(b),745.3. A Sampling-based Frameworkw1w2(0,0)(1,1)(a)w1w2(0,0)(1,1)(b)ρFigure 5.3: Approximate center of a convex polytope.all w’s which are in the top-right grid cell are above the line correspondingto ρ, so we can eliminate this cell from consideration, and the center of theregion of valid w’s can be approximated by the center of the remaining eightcells. The latter can be calculated by simply taking the average of the eightcell centers.We note that whether there exists a w in a grid cell which satisfiesa constraint ρ can be checked in time linear in the dimensionality of thefeature space. Also, finding those cells which violate new feedback can befacilitated by organizing the grid cells into a hierarchical structure such asa Quad-tree [55].Given the center w∗ of the valid region, an intuitive choice for the pro-posal distribution would be a Gaussian Qw ∼ N (w∗,Σ) with mean w∗,covariance Σ. To correct the bias introduced in samples from Qw, in impor-tance sampling, we could associate each valid sample w with an importanceweight q(w) = Pw(w)/Qw(w). Intuitively, this importance weight compen-sates for the difference between Pw and Qw.The importance weight of each sample q(w) can be easily adapted todifferent ranking semantics. For EXP, we multiply q(w) by the utility valuecalculated for each package under consideration for this w. For TKP, MPOand ORA, instead of adding one to the corresponding counter of each pack-age w.r.t. the given w, we add q(w).MCMC-based SamplingAnother popular approach for sampling from complex distributions is theMarkov Chain Monte Carlo or MCMC method. This approach generatessamples from the distribution by simulating a Markov chain. We constructthe Markov chain in a way such that it gives more importance to the regionswhich are valid, i.e., the stationary distribution of the Markov chain is thesame as the posterior Pw(w | Sρ).Since valid weight vectors form a single continuous and convex region, wecould simply find a first valid weight vector w, and then perform a random755.3. A Sampling-based Frameworkwalk from w to a new weight vector w′. Note that in order to explore thevalid region w.r.t. the current set of preferences Sρ, it is clearly desirablethat each step of the random walk explores only a close region around thecurrent w, as otherwise, w′ generated from w may be more likely to beoutside the valid region. Thus we define a length threshold lmax, and setthe transition probability Q(w′ | w) from w to w′ as follows:Q(w′ | w) ={1/lmax if ‖w′ − w‖ ≤ lmax0 otherwise.(5.4)One of the most popular MCMC-based sampling algorithms is Metropolis-Hastings (MH). Using MH we can generate samples following the Markovchain defined by Q(w′ | w). Given a current weight parameter w, we ran-domly pick a weight parameter w′ such that the distance ‖w′ − w‖ is lessthan or equal to lmax. If w′ satisfies the preferences in the feedback setSρ, w′ is accepted as the next sample with a probability α, as defined inEquation 5.5. If w′ is rejected (i.e., with probability (1−α)), we use a copyof the w as the next sample.α = min{1, Pw(w′)Q(w | w′)Pw(w)Q(w′ | w)} (5.5)Note that Q(w′ | w) is obviously symmetric in our case, so α can besimply calculated as α = min{1, Pw(w′)Pw(w) }.Following recommendations in the MH sampling literature [22], we pickone sample from every δ samples generated, where δ is called the step length,rather than including all generated samples in the final sample pool. Thisavoids generating highly correlated samples.5.3.3 Optimizing Constraint Checking ProcessSuppose σ packages p1, . . . , pσ are presented to the user. Note that accordingto the preference elicitation protocol, in every interaction only one package ispicked by the user as the most preferred package. Without loss of generality,let p1 be that package. This results in σ − 1 pairwise package preferences:ρ1 := p1 p2, . . . , ρσ−1 := p1 pσ−1. Thus, as more feedback is receivedfrom the user, the number of package preferences we need to deal withincreases quickly.This raises the following two issues: (1) cycles in preferences, and (2) costof checking whether a sample w should be rejected. We note that cycles in765.3. A Sampling-based Frameworkpreferences can be resolved by presenting packages in a cycle to the user, andasking the user to choose the best out of them (which reverses the directionof one edge in the cycle and breaks the cycle). Next we will discuss how aset Sρ of pairwise package preferences can be organized in order to facilitateefficient checking of whether a sampled weight vector w is valid.One intuitive solution is to reduce redundant package preferences byexploiting the transitivity of the preference relation. It is easy to see that thepreference relation over packages is transitive for additive utility functions,i.e., for any packages p1, p2, p3 and any weight vector w, w · p1 > w · p2 andw · p2 > w · p3 imply w · p1 > w · p3. Thus, there is no need to verifysatisfaction of p1 p3 for a sample w whenever w satisfies p1 p2 andp2 p3. It follows that the number of preferences we need to check is atmost linear in the number of distinct packages (implicitly) appearing in thefeedback.To eliminate redundant preferences received from the user, we can main-tain preferences in a directed acyclic graph (DAG) Gρ: an edge (pi, pj) rep-resents the preference pi pj . Then any transitive reduction algorithm [17]can be applied to eliminate redundant preferences.5.3.4 Sample MaintenanceIt is clear from the previous section that depending on the number of feed-back preferences received from a given user, the sampling process may ac-tually be quite time consuming. Thus, it is desirable to avoid generatingsamples from scratch whenever new feedback is received. In other words, itis desirable to maintain previously generated samples against new incomingfeedback.Given a probability distribution Pw of w, a set Sρ of preferences, and asample pool S, instead of regenerating all samples, we can simply replacesamples which violate the new feedback, and retain samples which do notviolate any new feedback. This approach works since the probability of eachvalid w always follows Pw, regardless of the newly received feedback.A simple idea for replacing invalid samples in the pool is to scan throughall samples in the pool one by one, and check whether each satisfies the newfeedback. This simple approach will be effective if many samples might vio-late the new feedback received. However, the performance will be very poorif hardly any samples from S actually violate the newly received feedback.Note that, as discussed in the previous section, for feedback ρ := p1 p2,w violates ρ if w · (p2 − p1) > 0. Thus finding those w ∈ S which violateρ is the same as finding all weight vectors which have a projected value on775.4. Search for Best Packagesp2 − p1 larger than 0. This problem can be solved by leveraging the classicthreshold algorithm (TA) framework [63], by iteratively enumerating thelargest w from S w.r.t. the query vector p2−p1 until the maximum possiblescore of any unseen w is less than or equal to 0. Obviously this TA-basedalgorithm is very efficient when not many samples violate the new preference.However, for cases where most samples violate the new feedback, the costof the TA-based algorithm may be much higher than the na¨ıve algorithm ofchecking every sample in the pool for possible preference violation.Algorithm 14: RejectedSampleCheck(S, ρ = p1 p2)1 Q ← An empty set for rejected sample w’s;2 Lw ← Lists of samples sorted based on feature values;3 while true do4 lw ← Access lists in Lw in round-robin fashion;5 w ← getNext(lw);6 τ ← boundary value vector from Lw;7 if w · (p2 − p1) > 0 then Add w to Q;8 if τ · (p2 − p1) ≤ 0 then Break;9 if Cprocessed + Cremain ≥ (1 + γ)|S| then10 Scan and check each remaining w in lw, Break;11 return Q;Motivated by this, we propose a hybrid approach shown in Algorithm 14.We organize the samples into m lists Lw = lw1 , . . . , lwm, where each list lwi isa total ordering of items based on the values of the corresponding featurefi. Given new feedback ρ, we start with the TA-based algorithm, and ifthe current number Cprocessed of items processed plus the number Cremain ofremaining items in the current list is larger than or equal to (1 + γ) of thetotal number of items, we stop the TA process, and instead scan throughthe remainder of the current list, checking the validity of each sample withinthis list. Here, γ is a parameter which can be tuned based on the actualperformance, with smaller γ leading to performance closer to the simplescanning algorithm, and larger γ leading to performance closer to the pureTA-based algorithm.5.4 Search for Best PackagesIf we have a top-k package solver which can produce the top-k packages fora given weight vector w, then given a set S of sample weight vectors, we can785.4. Search for Best Packageseasily find the overall top-k packages under different ranking semantics asfollows:For EXP, we need to estimate the expected utility of packages and returnthe top-k packages w.r.t. the estimates. Given all the top-k package resultsobtained from the samples w ∈ S, we maintain the sum of the utility valuesfor each package appearing in the results. Then the sample utility meanof each package is simply the utility sum divided by the number of timesthe package appears in a result. Note that we only need to consider thosepackages which appear in the top-k package list w.r.t. at least one samplew.For TKP, we just need to maintain a counter for each package whichappears in the result set, and the k packages which appear most frequentlyin this set will be the result under TKP.For MPO, instead of maintaining statistics for each package that appearsin the result set, we maintain a counter for each top-k package list. The finaltop-k package list under MPO is the one with the largest counter value.For ORA, the final top-k package list should be the one which minimizesthe sum of its (Kendall tau) distance to any top-k package list in the resultset. So we can simply store all the top-k package lists and then find the finaltop-k packages under ORA using any available algorithm [112].Thus, given a sampling-based framework, a key step in finding the top-kpackages given a set S of sample weight vectors is to find the top-k packagesfor any specific w, which we address next.Given a set T of items and a fixed w for the utility function, the problemof getting the k best items w.r.t. w can be done using any standard top-k query processing technique [63]. However, because we are dealing withsubsets of items, the problem becomes challenging since a na¨ıve solutionwhich first enumerates all possible packages, and then uses a top-k queryprocessing algorithm for finding the best k packages would be prohibitivelyexpensive. Below, we discuss how classical top-k query processing algorithmscan be adapted to finding the top-k packages, given a fixed w.Following recent research on top-k item query processing [63], one intu-ition is that for a top-k package p, the likelihood of having a high utilityitem in p is often higher than the likelihood of having a low utility item in p.Thus by accessing items in their descending utility order, we could poten-tially locate the top-k packages by accessing only a small number of items.To facilitate efficient processing over different weight vectors, we order itemsbased on their utility w.r.t. each individual feature. We denote the resultingset of sorted lists by L.Given a fixed w and utility function U , Algorithm 15 gives the pseudo-795.4. Search for Best Packagescode of the overall algorithm framework TopPkg. As shown in the algorithm,TopPkg first sorts the underlying items into different lists, where each listis an ordering of the items in T w.r.t. the desirable order on one specificfeature according to the utility function (line 2). E.g., consider U(p) =0.5avg1(p)+0.5(1−sum2(p)), where avg1(p) and sum2(p) are the normalizedaggregation values of the package w.r.t. the corresponding feature. Thealgorithm sorts items into two lists, l1 and l2, where in l1, items are sorted innon-increasing order of feature 1, and in l2, items are sorted non-decreasingorder of feature 2 20. As discussed before, the intuition is that by accessingitems with better utility values w.r.t. each individual feature in the utilityfunction, we can potentially quickly find the top-k packages of the items.After constructing the set of lists L, TopPkg accesses items from lists inL in a round-robin fashion (line 4–5). We assume items reside in memory, sotheir feature values can be retrieved quickly. After accessing each new itemt, we can obtain the new boundary value vector τ in which each feature valueequals the corresponding feature value of the last accessed item in each list(line 6). So essentially, the feature vector τ corresponds to the maximumpossible utility value for an unaccessed item.Next, we can expand the existing packages in the queue Q by incorpo-rating the new item t (line 7), a process described in Section 5.4.2. Duringpackage expansion, a current lower bound utility threshold ηlo can be ob-tained by looking at the kth best package so far in the queue Q, and anupper bound utility threshold ηup of any possible package can be obtainedby referring to the maximum utility value an unaccessed item can have, acalculation described in Section 5.4.1. Obviously, if ηup ≤ ηlo, we can safelyreturn the current top-k packages, as no future packages can have higherutility than the current top-k packages (line 8–9).5.4.1 Upper Bound Estimation for Best PackageGiven the accessed item information, one important problem in TopPkg isthe estimation of the upper bound value a package can have. In this section,we will discuss algorithms for estimating this upper bound value.Given a fixed weight vector w, the utility value U(p) of a specific packagep can be calculated as p ·w, so it depends only on items within the packagep. Given the fact that items in each list of L are ordered in non-increasingutility of the corresponding feature, the maximal marginal utility value of20Since each sorted list can be accessed both forwards and backwards, there is no needto maintain two separate lists when different desirable orders are required on the samefeature.805.4. Search for Best PackagesAlgorithm 15: TopPkg(U , T , w, k)1 Q ← A priority queue of packages having one item ∅;2 L ← Lists of items in T sorted according to util. func. U ;3 while true do4 l ← Access lists in L in round-robin fashion;5 t ← getNext(l);6 τ ← boundary value vector from L;7 (ηlo, ηup) ← expandPackages(U , Q, t, τ));8 if ηup ≤ ηlo then break ;9 return top-k packages in Q;an unseen item is obviously bounded by the imaginary item with featurevector τ .Given a utility function U , we say that U is set-monotone if for anypackages p, p′ of items, we have U(p∪p′) ≥ U(p). E.g., U(p) = 0.5sum1(s)+0.5(1−min2(s)) is set-monotone. Clearly, if U is set-monotone, the maximumutility of a package p can be achieved by packing as many items with featurevector τ (were they to exist) as possible into p. On the other hand, if U isnot set-monotone, e.g., when some aggregate feature values in U are basedon avg, we can show that the upper bound value of p in this case is givenby packing as many items with feature vector τ into p as possible, as longas the marginal utility gain of this addition is positive.Lemma 10 Given a package p, a utility function U with fixed w, and asequence of items t1, . . . , tm such that every feature value of ti is no worsethan that of ti+1, then U(p ∪ {t1, . . . , ti}) − U(p ∪ {t1, . . . , ti−1}) ≥ U(p ∪{t1, . . . , ti+1})− U(p ∪ {t1, . . . , ti}), 1 < i < m.Proof The result follows from that fact that every feature value of ti is noworse than that of ti+1 w.r.t. U .An algorithm for estimating the upper bound value, upper-exp, is shownas Algorithm 16, where φ is the maximum allowed package size.5.4.2 Package ExpansionConsider the problem of expanding the set of packages in queue Q on ac-cessing a new item t. A na¨ıve way of expanding packages would be to try toadd t to every possible package in Q as long as the resulting package satis-fies the package size budget, inserting the new packages into Q. Obviously815.5. Experimental EvaluationAlgorithm 16: upper-exp(p, U , τ , φ)1 p′ ← p;2 if U is set-monotone then3 for i ∈ [1, φ− |p|] do p′ ← p′ ∪ {τ} ;4 return U(p′)5 else6 for i ∈ [1, φ− |p|] do7 if U(p′ ∪ {τ})− U(p′) > 0 then p′ ← p′ ∪ {τ} ;8 else return U(p′);9 return U(p′)utilizing this strategy is equivalent to enumerating all possible combinationsof the accessed items. Thus, while it is guaranteed to be correct, it is highlyinefficient.Given a package p, one intuitive optimization is that if incorporating anyunaccessed item cannot improve the value of p, we do not need to considerp in the expansion phase. E.g., let U(p) = 0.5avg1(p) + 0.5min2(p), withp = (0.5, 0.5) and τ = (0.4, 0.4). Clearly, any unaccessed item in L will havea utility worse than or equal to that of τ , so there is no need to consider pfor expansion in the future.To incorporate this optimization, we split the priority queueQ in TopPkginto two sub-queues Q+ and Q−. Queue Q+ stores packages which can befurther expanded (while improving utility), while Q− stores packages whichcannot be further expanded (while improving utility) and so can be prunedfrom the expansion phase. In Algorithm 17 for the expansion phase, we onlyneed to iterate through packages in Q+ (lines 2–12), and for each packagep ∈ Q+, we test whether p can be improved by incorporating the new itemt (line 3). If true, we generate a new package p′, and insert it into theappropriate sub-queue based on whether it can be further improved by anunaccessed item or not (lines 5–8). If false, we can check whether the currentp can be further improved by referring to the updated τ , and p will be movedto Q− if it cannot be improved (lines 9–11).5.5 Experimental EvaluationIn this section, we study the performance of various algorithms proposedin this work based on one real dataset of NBA statistics and four syntheticdatasets. The goals of our experiments are to study: (i) the performance825.5. Experimental EvaluationAlgorithm 17: expandPackages(U , Q, t, τ)1 (ηlo, ηup) ← lower/upper bound value;2 foreach p ∈ Q+ do3 if U(p ∪ {t}) > U(p) then4 p′ ← p ∪ {t};5 if U(p′ ∪ {τ}) > U(p′)) then6 Q+ ← Q+ ∪ {p′};7 ηup ← max(ηup, upper-exp(p′, U , τ , φ));8 else Q− ← Q− ∪ {p′} ;9 if U(p ∪ {τ}) > U(p) then10 ηup ← max(ηup, upper-exp(p, U , τ , φ));11 else Q+ ← Q+ − {p}, Q− ← Q− ∪ {p} ;12 ηlo ← U(Q−[k]) or 0 if fewer than k packages in Q−;13 return (ηlo, ηup)of various sampling techniques w.r.t. our package recommendation problem;(ii) the effectiveness of the proposed pruning process; (iii) the performanceof various maintenance algorithms as the system receives new feedback. Weimplemented all the algorithms in Python, and all experiments were runon a Linux machine with a 4 Core Intel Xeon CPU, OpenSuSE 12.1, andPython 2.7.2.The NBA dataset is collected from the Basketball Statistics website [6],which contains the career statistics of NBA players until 2009. The com-posite recommendation would be to recommend a set of NBA players whichhave good aggregated statistics over different measures, e.g., high average3 points per game, high rebounds per game. The dataset has 3705 NBAplayers and we randomly selected 10 (out of 17) features (each feature cor-responds to statistics of NBA players such as points per game) to be usedin our experiments. The synthetic datasets are generated by adapting thebenchmark generator proposed in [25]. The uniform (UNI) dataset and thepowerlaw (PWR) dataset are generated by considering each feature indepen-dently. For UNI, feature values are sampled from a uniform distribution,and for PWR, feature values are sampled from a power law distribution withα = 2.5 and normalized into the range [0, 1]. In the correlated (COR) syn-thetic dataset, values from different features are correlated with each other,while in the anti-correlated (ANT) synthetic dataset, values from differentfeatures are anti-correlated with each other. Each synthetic dataset has 10features and has 100,000 tuples.835.5. Experimental Evaluation5.5.1 Comparing Sampling MethodsIn Figure 5.4, we show an example of how different sampling methods gen-erate 100 valid 2-dimensional sample w parameters given 5000 packages and2 randomly generated preferences. As discussed previously, each prefer-ence ρ := p1 p2 defines a linear hyperplane. A sample w satisfies ρ iffw · (p1 − p2) ≥ 0, or w is above the corresponding hyperplane. In Fig-ure 5.4 (a), given the set of valid sample w’s (black dots) and the set ofinvalid sample w’s (red crosses), we can infer the two linear hyperplaneswhich correspond to the two given preferences and bound valid sample w’sto the bottom. It is clear from the figure that unless these two preferencesare way “above” the center of Pw, many sample w’s from Pw will be in-valid. Thus using rejection sampling, many samples will be wasted and weneed to spend considerable time checking whether each sample w satisfiesall preference constraints received.-1-0.5 0 0.5 1-1 -0.5 0 0.5 1Dimension 2Dimension 1(a) Rejection SamplingAccepted samplesRejected samples-1-0.5 0 0.5 1-1 -0.5 0 0.5 1Dimension 2Dimension 1(b) Importance SamplingAccepted samplesRejected samples-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8Dimension 2Dimension 1(c) MCMC-based SamplingAccepted samplesRejected samplesFigure 5.4: Example of various sampling algorithms.On the other hand, the two feedback-aware sampling strategies will gen-erate far fewer invalid samples. E.g., in Figure 5.4 (b), the importancesampling technique samples from a proposal distribution which is more tothe center of the valid region, so samples generated are more likely to satisfyall constraints. Notice, each sample w is also associated with a weight, whichis captured by the size of the dot/circle in the figure. The higher the prob-ability w has under the original distribution, and the lower the probabilityw has under the proposal distribution, the larger the weight of w.As can be seen from Figure 5.4 (c), MCMC-based sampling first needs tofind one random valid sample w. Note that during this process we leveragethe simple rejection sampling, thus these rejected samples (denoted as iso-lated red crosses in the figure) will not be part of the random walk processin MCMC. Then from this valid sample w, we initiate a random walk fromthe neighborhood of w, which follows the original distribution of Pw usinga Metropolis-Hastings sampler as discussed Section 5.3.2.845.5. Experimental Evaluation5.5.2 Constraint CheckingAs discussed in Section 5.3.1, no matter which sampling method we use,an important task is to efficiently check whether a sample satisfies all thefeedback constraints received from a user. In Figure 5.5, we show howpruning strategies discussed in Section 6.4.2 benefit the overall checkingperformance by varying the number of features, the number of samples, andthe number of Gaussians in the mixture distribution while keeping the othervariables fixed at a default value. We set the default value for the number ofrandomly generated preferences to 10000, the number of packages to 5000,the number of Gaussians to 1, the number of features to 5, and the numberof samples to 1000. As can be seen from this figure, when we vary oneparameter while fixing other parameters at their default values, the pruningstrategy can robustly generate at least a 10% improvement. Results underother different default values are similar. 0 5 10 15 20 25 30 35 403 4 5 6 7Overall checking time(s)Number of features(a) Varying number of featuresBefore pruningAfter pruning 0 20 40 60 80 100 120 140 160 1801000 2000 3000 4000 5000Overall checking time(s)Number of samples(b) Varying number of samplesBefore pruningAfter pruning 0 5 10 15 20 25 30 35 401 2 3 4 5Overall checking time(s)Number of Gaussians(c) Varying number of GaussiansBefore pruningAfter pruningFigure 5.5: Efficiency of the pruning strategy.5.5.3 Overall Time PerformanceIn this section, we report the overall time performance for package recom-mendation over different datasets. All time results reported are based on anaverage of 5 runs.In Figure 5.6, we compare overall time performance for generating top-k package recommendations under Rejection Sampling (RS), ImportanceSampling (IS), and MCMC-based Sampling (MS). In these experiments, werandomly select one ranking semantic and vary one of the following twoparameters while fixing the remaining parameters at their default values:(1) Number of valid samples required; (2) Number of features. We alsotested varying the number of feedback preferences received, and the numberof Gaussians in the mixture distribution; the results are very similar tovarying the number of valid samples, and thus are omitted.855.5. Experimental Evaluation10-210-11001011021031041000 2000 3000 4000 5000Overall process time(s)Number of samples(a) UNISample Gen.(RS)Sample Gen.(IS)Sample Gen.(MS)Top-k Pkg(RS)Top-k Pkg(IS)Top-k Pkg(MS)10-210-11001011021031041000 2000 3000 4000 5000Overall process time(s)Number of samples(b) PWRSample Gen.(RS)Sample Gen.(IS)Sample Gen.(MS)Top-k Pkg(RS)Top-k Pkg(IS)Top-k Pkg(MS)10-210-11001011021031041000 2000 3000 4000 5000Overall processing time(s)Number of samples(c) CORSample Gen.(RS)Sample Gen.(IS)Sample Gen.(MS)Top-k Pkg(RS)Top-k Pkg(IS)Top-k Pkg(MS)10-210-11001011021031041000 2000 3000 4000 5000Overall processing time(s)Number of samples(d) ANTSample Gen.(RS)Sample Gen.(IS)Sample Gen.(MS)Top-k Pkg(RS)Top-k Pkg(IS)Top-k Pkg(MS)10-210-11001011021031041000 2000 3000 4000 5000Overall processing time(s)Number of samples(e) NBASample Gen.(RS)Sample Gen.(IS)Sample Gen.(MS)Top-k Pkg(RS)Top-k Pkg(IS)Top-k Pkg(MS)10-210-11001011021031042 4 6 8 10Overall process time(s)Number of features(f) UNISample Gen.(RS)Sample Gen.(MS) Top-k Pkg(RS)Top-k Pkg.(MS)10-210-11001011021031042 4 6 8 10Overall process time(s)Number of features(g) PWRSample Gen.(RS)Sample Gen.(MS) Top-k Pkg(RS)Top-k Pkg.(MS)10-210-11001011021031042 4 6 8 10Overall processing time(s)Number of features(h) CORSample Gen.(RS)Sample Gen.(MS) Top-k Pkg(RS)Top-k Pkg.(MS)10-210-11001011021031042 4 6 8 10Overall processing time(s)Number of features(i) ANTSample Gen.(RS)Sample Gen.(MS) Top-k Pkg(RS)Top-k Pkg.(MS)10-210-11001011021031042 4 6 8 10Overall processing time(s)Number of features(j) NBASample Gen.(RS)Sample Gen.(MS) Top-k Pkg(RS)Top-k Pkg.(MS)Figure 5.6: Overall time performance under various sampling algorithms.From Figure 5.6 (a)–(e), with a log-scale for processing time, we observethat the sampling cost for generating valid sample w’s mostly outweighs orat least is comparable to the cost for generating top-k packages, as usuallythe top-k packages can be found by just checking the first few high utilityitems. Also the rejection sampling cost is usually much higher than that ofthe other two feedback-aware sampling approaches.As can be seen from Figure 5.6 (f)–(j), importance sampling is excludedfrom high dimensional experiments because finding the center of a high-dimensional polytope is computationally intractable [46]. Even using thesimple grid-based approximation algorithm as discussed in Section 5.3.2,the cost is exponential w.r.t. the dimensionality. Specifically, when the di-mensionality is over 5, the time to find the center will quickly exceed the timefor rejection sampling. Indeed, when dimensionality is 6, the algorithm can-not finish within 30 minutes whereas simple rejection sampling only needsa couple of seconds. As can be seen from Figure 5.6 (f)–(j), MCMC-basedsampling scales well w.r.t. dimensionality.5.5.4 Sample QualityTo measure the quality of different sampling methods, we compare the top-5package list generated w.r.t. different ranking semantics and different sam-pling methods. In our experiments, we set the number of samples to 5000(we verified that increasing the number of samples beyond 5000 does notchange the top package rankings for different datasets), the number of feed-back preferences received to 1000, the number of features to 4, and thenumber of Gaussians in the mixture distribution to 2. Results under dif-865.5. Experimental Evaluationferent value settings are similar thus excluded. Each package is associatedwith a unique and random package id.In Table 5.1, we show the top-5 packages under different sampling meth-ods and different ranking semantics on UNI; experimental results on otherdatasets and other settings are similar and are thus omitted. As can beseen, given enough samples, the top package results from different samplingmethods typically tend to become very similar. The reason is that althoughdifferent ranking semantics may potentially result in different top packagelists, they can be correlated with each other. E.g., as in our example, ifone list of top packages dominates the results given a set of samples, TKP,MPO and ORA may tend to give very similar results. This is because pick-ing the same list of top packages guarantees that packages in this list mayalso appear most frequently among all top packages. ORA may also pickthis same list as it tends to minimize the distance between it and all othertop package lists. EXP may not be affected by this as it is determined bythe expected utility of the package, so a package appearing frequently maynot necessarily have high expected utility value.Rej. Sampling Imp. Sampling MCMC SamplingEXP 1,2,3,6,7 1,2,3,6,7 1,2,3,6,7TKP 6,7,8,9,10 6,7,8,9,10 6,7,8,9,10MPO 6,7,8,9,10 6,7,8,9,10 6,7,8,9,10ORA 6,7,8,9,10 6,7,8,9,10 6,7,8,9,10Table 5.1: Top-5 package id’s for different sampling methods and differentranking methods on UNI.5.5.5 Sample MaintenanceAs discussed in Section 5.3.4, upon receiving new feedback, a na¨ıve methodof scanning through previous samples to determine which samples need tobe replaced might be costly if the number of rejected samples is low, whereasa top-k algorithm might help by quickly scanning through the pre-processedsample lists, and determining whether all samples satisfy the constraints.However, this algorithm suffers from a substantial overhead when the num-ber of rejected samples is large. Thus we propose a hybrid method whichstarts following the top-k based approach, then falls back to the defaultna¨ıve method if the top-k process cannot stop early.To assess the actual performance of these three algorithms, we considerin the following experiment a setting where the number of previously gener-875.6. Discussionated samples is set to 10000 (results using other values are similar and thusomitted), every other parameter is fixed at a default value similar to previ-ous experiments. We randomly generate sets of 1000 feedback preferences,and then according to the number of samples rejected w.r.t. the feedback,we group the maintenance costs into 7 buckets, where each bucket is associ-ated with a label indicating the maximum number of samples rejected (seeFigure 5.7 (a)). Results are placed in the bucket with the smallest qualifyinglabel. Maintenance cost results are averaged for all cases within the samebucket.According to Figure 5.7 (a), the top-k based algorithm is a clear winnerwhen the number of rejected samples is small. As the number of rejectedsamples grows, the performance of top-k based algorithms will deteriorate,especially the non-hybrid method. But the hybrid method introduces only asmall overhead over the na¨ıve algorithm because of the fall-back mechanism,and this overhead can be tuned through the parameter γ. In Figure 5.7 (b),we show how the ratio of each of top-k cost and hybrid cost over the na¨ıveapproach varies with γ. It shows that when γ is very small, the averageperformance of the hybrid method is very similar to the na¨ıve algorithmas the algorithm is forced to check fewer samples. By slightly increasing γ(e.g., to 0.025 as in Figure 5.7 (b)), the hybrid method can show over 15%improvement compared to the na¨ıve method. When γ increases further, theperformance deteriorates as it becomes similar to the non-hybrid method.We note that this property means we could adaptively decrease γ in practiceuntil a reasonable performance gain can be observed. 0 20 40 60 80 100 1200 1 5 20 50 200 1000Processing time(s)Maximum number of violating samples(a) 10000 SamplesNaive checkingTop-K checkingHybrid checking 0 0.5 1 1.5 2 2.50 0.025 0.05 0.075 0.1Cost ratio w.r.t. naive checkingγ(b) Varying γTop-k cost / Naive costHybrid cost / Naive cost1.0 0.844 0.874 0.897 0.914Figure 5.7: Experiments on sample maintenance.5.6 DiscussionA user’s online interaction can be noisy. E.g., a user may accidentally clickon a package or may change their mind after clicking. A popular method for885.6. Discussionmodeling this kind of noisy user feedback is to assume that every feedbackreceived has a probability ψ of being “correct” [27]. We can incorporatethis noise model into our framework by assuming that every feedback isindependent. Then instead of rejecting a sample whenever it violates somefeedback preference, we condition its rejection using the probability that atleast one violated feedback preference is correct, i.e., 1− (1− ψ)x, where xis the number of feedbacks w violates. This can be easily incorporated intoimportance sampling and MCMC-based sampling.As discussed in [129], users may sometimes specify predicates on theschema of a desired package, e.g., when buying a set of books, at leasttwo should be novels. We can handle such predicates in the top packagegeneration process discussed in Section 5.4. The idea is that when generatinga new candidate package, we evaluate the predicates and retain this packageonly if it satisfy the specified predicates.As future work, we plan to investigate how TopPkg presented in Sec-tion 5.4 can be further optimized using domination-based pruning [87, 129].The intuition is that by pruning away candidate packages which are notpromising, we can further reduce Q, which will be iteratively searched inthe expansion phase. However, we note that usually these pruning strategiesalso come with a non-trivial computational cost, so a systematic cost-basedstudy of different pruning strategies under our proposed PRS model wouldbe interesting.89Chapter 6Further Optimization usingMaterialized Views6.1 IntroductionAs discussed in the previous chapters, no matter under hard constraint orsoft constraint, the key algorithm in our composite recommendation sys-tem can be regarded as a special case of the general top-k query processingproblem. While there exists various methods which can optimize the per-formance of top-k query processing, here in this chapter we discuss how aparticular simple and easy to be implemented technique, query answeringusing cached views, can be used to facilitate our top-k query processing prob-lems. while most of this chapter focuses on discussing how this techniquecan be leveraged to facilitate classical top-k query processing w.r.t. items.At the end of this chapter, we show how this technique can be applied indifferent top-k query processing algorithms in composite recommendation.For the applications under consideration w.r.t. items, typically a simplelinear score function is used to aggregate the attribute values of a tuple intoa score, due to its intuitiveness [36, 44, 45, 60, 117, 128]. Figure 6.1 (a)shows an example relation R which contains 6 tuples over attributes A, Band C. Consider a query Q1 which asks for the top-3 tuples with the highestvalues under the score function f1(t) = 0.1t[A]+0.9t[B]. The result is shownas (a cached view) V1 in Figure 6.1 (b).While various efficient algorithms have been proposed for processing top-k queries [53, 63], one significant limitation is that they cannot take advan-tage of the cached results of previous queries. E.g., consider the previousexample query Q1 whose result V1 is shown in Figure 6.1 (b). Suppose a(possibly different) user subsequently asks the top-1 query Q′ with the scorefunction f ′(t) = 0.2t[A] + 0.8t[B]. Then, as we will see in later sections, wecan use the previous cached result V1 to determine, without accessing theoriginal database R, that t5 is the top-1 answer for Q′.Leveraging cached query results to scale up query answering has recently906.1. IntroductionV1tid rank scoret510.74t320.66t130.57V2tid rank scoret510.74t620.59t230.53Rtid A Bt10.3 0.6t20.4 0.5t30.3 0.7C0.40.60.3f1=0.1A+0.9B k1=3 f2=0.1A+0.5B+0.4C k2=3 (a) (b)t40.5 0.3t50.2 0.8t60.6 0.50.50.80.7Figure 6.1: (a) A relation R with three attributes A, B, C; (b), two cachedviews V1, V2 which contain top-3 tuples according to the the two score func-tions f1(t) = 0.1t[A] + 0.9t[B], f2(t) = 0.1[A] + 0.5[B] + 0.4[C] respectively.become increasingly popular for most large scale websites. For example,the popular Memcached [3] caching system has reportedly been adopted bymany large scale websites such as Wikipedia [10], Flickr [1], Twitter [8] andYoutube [12]. The application of cached query results or materialized viewsfor speeding up query answering in relational databases, the so-called queryanswering using views (QAV) problem, has been extensively studied (see [58]for an excellent survey). This problem has been shown to have applicationsin data integration [75], query optimization [84], and data warehouse de-sign [114].For top-k query processing, recently there have been some initial effortsat using materialized query results for speeding up query answering. Inthe PREFER system, Hristidis et al. [60] consider the problem of how toselect one best materialized view for answering a query. Their setting isquite restrictive, as it cannot exploit multiple materialized views, and italso makes a strong assumption that all attributes of the underlying basetable are always utilized for all top-k queries. Overcoming these limitations,Das et al. [45] propose a novel algorithm, called LPTA, which is able toutilize multiple materialized views for answering a top-k query. Ryeng etal. [109] extend these techniques to answer top-k queries in a distributedsetting.Though LPTA overcomes many of the limitations of PREFER, unfor-tunately it still suffers from several significant limitations. Firstly, the coretechniques proposed in [45] rely on the assumption that either (1) each top-kview is a complete ranking of all tuples in the database, or (2) that the baseviews, which are complete rankings of all tuples in the database accordingto the values of each attribute, are available. These assumptions may oftenbe unrealistic in practice.Consider the example of finding top-k movies. There are several popular916.1. Introductionwebsites which provide top-k lists of movies based on different criteria. Forexample, Metacritics [4] provides a ranked list of (up to 5639) movies basedon Metascore [5], which is aggregated from critics and publications like theNew York Times (NYT) and the San Francisco Chronicle; IMDB [2] providesa top-250 list of movies based on votes from their users; and RottenTomatoes(RT) [7] provides a top-100 list of movies based on the Tomatometer score,which is calculated based on critics. Here, the top-k lists on Metacritics,IMDB, and RT can be regarded as materialized views. Because of the hugenumber of movies available, it is impractical to obtain the complete rankingof all movies from each of the sources, and for the same reason, we cannotassume base views corresponding to the complete ranking of all movies oneach of the individual scores, e.g, NYT score, are available. Consider thequery that asks for top-k movies according to an aggregation of NYT score,IMDB score, and Tomatometer score. Since the only information we haveaccess to is the top-k movies from the Metacritics, IMDB, and RT, thetechnique proposed in [45] cannot be used to answer this query. Similarexamples can also be found in other domains: finding the top-k universitiesbased on university ranking lists from U.S. News [9] and The Times [11]; orfinding the top-k cars based on automobile ranking lists from U.S. News [14]and Auto123 [13].The second issue with the LPTA algorithm proposed in [45] is that ituses linear programming (LP) as a sub-procedure to calculate the upperbound on the maximum value achievable by a candidate result tuple, andthe LPTA algorithm needs to call this sub-procedure iteratively. It hasbeen demonstrated in [45] that for low dimensionality scenarios (e.g., 2 or3), the cost of this LP overhead is reasonable. However, we will show in ourexperiments that for scenarios with higher dimensionality, which we note isvery common, this iterative invocation of the LP sub-procedure may incura high computational overhead.Finally, for both PREFER [60] and LPTA [45], a potentially costly viewselection operation is necessary. For example, the view selection algorithm in[45] requires the simulation of the top-k query process over the histograms ofeach attribute, and the processing cost is linear with respect to the number ofviews. This cost can be prohibitive given a large pool of cached query (view)results. Furthermore, (histograms over) base views are often not availablein practice, restricting the applicability of this view selection procedure.In this chapter, we propose two novel algorithms for the problem of top-k query answering using cached views. Our first algorithm LPTA+ is anextension of LPTA as proposed in [45]. In LPTA+, we make a novel ob-servation on the characteristics of LPTA, and by taking advantage of the926.2. Problem Settingfact that our views are cached in memory, we can usually avoid the itera-tive calling of the LP sub-procedure, thus greatly improving the efficiencyover the LPTA algorithm. LPTA+ can be useful for scenarios with a smallnumber of views and low dimensionality. For the case where the number ofcached views is large and the dimensionality is high, we further propose anindex structure called the inverted view index (IV-Index ), which stores thecontents of all cached views in a central data structure in memory, and canbe leveraged to answer a new top-k query efficiently without any need forview selection.Specifically, we make the following contributions:• We consider the general problem of top-k query answering using views,where base views are not available, and the cached views include onlythe top-k tuples which need not cover the whole view (Section 6.2).• For scenarios where we are not allowed to maintain additional datastructures, we extend LPTA and propose a new algorithm, LPTA+,which can significantly improve performance over LPTA (Section 6.3).• We further propose a novel index-based algorithm, IV-Search, whichleverages standard space-partitioning index structures, and can bemuch faster than LPTA/LPTA+ in most situations. We consider twodifferent strategies for the IV-Search algorithm, and discuss additionaloptimization techniques (Section 6.4).• We present a detailed set of experiments showing that the performanceof our proposed algorithms can be orders of magnitude better than thestate-of-the-art algorithms (Section 6.5).Related work is discussed in Section 6.6.6.2 Problem SettingGiven a schema R with m numeric attributes A1, . . . , Am, we denote arelation instance of R by R. In practice, R may have other non-numericattributes as well, but we are concerned only with the numeric attributes.Every tuple t ∈ R is an m-dimensional vector t = (t[1], . . . , t[m]), where t[i]denotes the value of t on attribute Ai, i = 1, . . . ,m. Similar to previous workon top-k query processing, we assume that attribute values are normalizedin the range of [0, 1], and that each tuple t also has a unique tuple id.936.2. Problem SettingSimilar to [45], we define a top-k query Q over R as a pair (f, k), where kis the number of tuples required, and f : [0, 1]m → [0, 1] is a linear functionwhich maps the m attribute values of a tuple t to a preference score, i.e.,f(t) = w1t[1] + · · · + wmt[m], where wi ∈ [0, 1], and∑iwi = 1. Note thatsince every wi is non-negative, the function f is clearly monotone, i.e., fortwo tuples t1 and t2, if t1[i] ≥ t2[i], i = 1, . . . ,m, then f(t1) ≥ f(t2).Given a relation R and a query Q = (f, k), without loss of generality,assume that k ≤ |R| and that larger f values are preferred. Then thesemantics of top-k query processing is to find k tuples in R which have thelargest values according to the query score function f . We can formallydefine the answer to a top-k query as follows.Definition 14 Top-k Query Answer: Let Q = (f, k) be a top-k queryover relational schema R. Given a relation R over R, the answer of Q onR, Q(R), is a list of tuples from R such that |Q(R)| = k, and ∀t ∈ Q(R)and ∀t′ ∈ R\Q(R), f(t) ≥ f(t′). Finally, tuples of higher rank in Q(R) havea higher score according to the score function f .A top-k cached view, or a top-k view for brevity, is defined similarly toa top-k query, except the results of a top-k view are cached in memory.For each tuple t in a cached view, we assume all attribute values t[i], i =1, . . . ,m, will also be cached in memory, and thus can be efficiently accessedat query time. We allow random access by id on the cached tuples. Givena view Vi = (fi, ki), without any ambiguity, we reuse Vi also to denote thelist of (ki) tuples materialized along with their ranks and scores w.r.t. fi.We use Vi[j] to denote the tuple t ∈ Vi, which has the jth highest scorew.r.t. fi, with ties broken using tuple id, i.e., a tuple with a smaller tupleid will be ranked higher. Similarly for a given relation R, we denote the jthtuple in R following the order defined by a score function f as Rf [j].Let V = {V1, . . . , Vp} be a set of views, where Vi = (fi, ki) is a top-ki view, and let Q = (f, k) be a top-k query. Inspired by the notion ofcertain answers when answering a non-ranking query using views [15], wesay a relation R is score consistent with the set V of views, if for any viewVi = (fi, ki) ∈ V, the jth tuple Rfi [j] in R w.r.t. fi has the same score asthe jth tuple Vi[j] in Vi, i.e., fi(Rfi [j]) = fi(Vi[j]), for j = 1, . . . , ki. Notethat we do not require Rfi [j] to have the same tuple id as Vi[j], since thescore of a tuple is determined solely by its attribute values and not by itstuple id (Definition 14).Given a set of views V = {V1, . . . , Vp}, a score consistent relation R is thecounterpart of a possible world under the closed world assumption (see [15]).946.2. Problem SettingAccordingly, we define a tuple t ∈ Vi, i = 1, . . . , p, to be a certain answer toQ if, for any relation R which is score consistent with V, f(t) ≥ f(Rf [k]),i.e., the score of tuple t is no worse than the score of the kth tuple in Runder the query score function f . Motivated by the previously mentionedapplications where we need to efficiently answer a query using merely top-kviews, we consider the following problem.Definition 15 Top-k QAV (kQAV): Given a set V = {V1, . . . , Vp} oftop-ki views, i = 1, . . . , p and a top-k query Q = (f, k), find all certainanswers of Q, denoted Q(V), up to a maximum of k answers.Notice that we have no access to the complete ranking of tuples in theviews nor access to the base views. Similar to query answering using viewsin a non-ranking setting [15], given only the view set V, we need to find thecertain answers. The number of certain answers may be less than, equal to,or more than k. Since Q = (f, k), we restrict the output to a maximum ofk certain answers, where any ties at rank k are broken arbitrarily.As an example, consider the set of views V = {V1, V2} as shown inFigure 6.1 (b) and assume the base relation R is no longer available. Assumewe are given the query Q = (f, 1), where f = 0.1A+0.8B+0.1C. Using thetechniques proposed in Section 6.3, we can determine that {t5} is the set ofcertain answers to Q. Intuitively, this is because after accessing the secondtuple in V1 and V2, we can derive that for all unseen tuples, the maximumpossible value w.r.t. Q is 0.6425 which is smaller than the current best tuplet5 for which f(t5) = 0.74. And if Q = (f, 4), we can only find 3 certainanswers to Q, which are t5, t3 and t1. This is because after accessing allthree tuples in V1 and V2, the maximum possible value w.r.t. Q is 0.56 forall unseen tuples, and only t5, t3, t1 have projected values larger than orequal to 0.56.6.2.1 System OverviewMotivated by the applications discussed in the introduction, we consider thefollowing top-k query evaluation framework as illustrated in Figure 6.2. Fora top-k query Q = (f, k) submitted to the query processing system, thequery executor will consult the cached top-k views to find the maximum setof certain answers to Q. In this work, we will focus on the kQAV problemwhere the goal is to efficiently find certain answers using only the given top-kviews.In the following sections, we will first adapt and improve the LPTA al-gorithm as originally proposed in [45] for addressing the kQAV problem as956.3. LPTA-based kQAV ProcessingSystemCached Top-k ViewV1V2V3 ...UserQQ( )Query ExecutorVFigure 6.2: System overview.defined above, where neither the complete ranking of tuples nor the baseviews are available. We will then discuss how a standard space partition-based index can be used to further optimize the performance of the algo-rithm.6.3 LPTA-based kQAV ProcessingIn this section, we first discuss LPTA, the state-of-the-art algorithm pro-posed in [45] for answering a top-k query using a set of views.21 We shallsee in Section 6.3.1 that LPTA has several limitations. We first review LPTAand discuss how it can be adapted to produce certain answers when cachedviews are not complete rankings of tuples and no base views are available.In Section 6.3.2, we propose a new algorithm LPTA+ which overcomes thelimitations of LPTA.6.3.1 Algorithm LPTAIn [45], Das et al. first studied the problem of answering a top-k queryusing multiple views. Similar to the TA algorithm [53], the authors of [45]assume that the underlying database can be randomly accessed to retrievetuple attribute information using tuple ids, and that each view stores a list oftuple ids along with the scores. They focus on the scenario where either eachview is a complete ranking of all tuples in R, or the base views, which arecomplete rankings of all tuples in R according to the values of each attribute,are available. Thus a top-k query can always be answered exactly andcompletely. We next briefly review the LPTA algorithm presented in [45].Consider the score function f of each query/view also as a vector ~ffrom the origin O, representing the direction of increasing value. Giventhe assumption on the score function, the vector defined by any possible21Recall that they assume that views are complete rankings of tuples or that base viewsare available.966.3. LPTA-based kQAV Processingscore function considered will reside in the first quadrant. For now, we willassume that for every cached view Vi = (fi, ki), it is the case that ki = |R|and the tuples in Vi are sorted based on fi, or their projected values on ~fi. InFigure 6.3 (a), we show an example of the relation R from Example 6.1 whenprojected on the first two dimensions A and B. Given a query Q = (f, k),we can rank the tuples by projecting them onto ~f , as shown in the figure.Q: (f,k)V1: (f1,k1)V2: (f2,k2)ABT(1,1)O(0,0)(b)Q: (f,k)ABT(1,1)O(0,0)(a)t1t2t3t4t5t6Figure 6.3: Example of LPTA.Recall that we have a set V of p views, and assume that a set U ⊆ Vof r views has been selected in order to answer the query (we discuss theview selection problem below). In order to answer a top-k query Q = (f, k),the LPTA algorithm accesses tuples sequentially from the r views. For eachtuple t accessed, the algorithm performs a random access to the database inorder to retrieve the attribute value information of t. The current candidatetop-k results can be easily maintained from the accessed tuples. However,it is more challenging to find the maximum value τ that can be achieved byany unseen tuple, which is critical for the stopping criterion of the LPTAalgorithm. Let the last tuple accessed in each view Vi = (fi, ki) ∈ U bedenoted by t¯i, i = 1, . . . , r. As observed in [45], τ can be calculated bysolving the following linear programming (LP) problem:maxtτ = f(t)subject to: fi(t) ≤ fi(t¯i), i = 1, . . . , r0 ≤ t[i] ≤ 1, i = 1, . . . ,m (6.1)The “LP solver” is clearly more complex and time consuming than othercomponents in the LPTA algorithm, so instead of invoking this solver everytime a new tuple is accessed from a view Vi, LPTA accesses tuples from ther views in a lock-step fashion, i.e., the LP solver will be called once for everyr tuples accessed.976.3. LPTA-based kQAV ProcessingThe pseudocode for LPTA is given in Algorithm 18. We initialize apriority queueX based on the score function f ofQ (line 1) and the thresholdvalue τ (line 2). The algorithm then iteratively accesses tuples from the rviews in a lock-step fashion (lines 4–5). For each set of r tuples accessed,the algorithm finds the value of τ by solving the LP problem (Formula 6.1)(line 8). If the kth tuple X[k] in the priority queue has value no less thanτ , the algorithm can stop.Algorithm 18: LPTA(U = {V1, . . . , Vr}, Q = (f, k))1 X ← an empty priority queue;2 τ ← ∞;3 repeat4 {t¯1, . . . , t¯r} ← getNextTuple(U);5 retrieveTupleInfo({t¯1, . . . , t¯r});6 X.insert({t¯1, . . . , t¯r});7 X.keepTop(k);8 Find τ by solving the LP problem in Formula (6.1);9 until noNewTuple(U) or (|X| = k and f(X[k]) ≥ τ);As we will demonstrate in Section 6.3.2, while the cost of iterativelycalling the LP solver is reasonable when the dimensionality for the giveninput relation R is low, the cost increases significantly as the dimensionalitygrows. We will discuss in Section 6.3.2 how this increased cost can be avoidedby leveraging innate characteristics of the kQAV problem.Another problem that remains to be addressed in using LPTA is howto choose the r views from a potentially large pool of cached views, so thatquery processing cost can be minimized. As shown in [45], we need nomore than m views for processing a query on an m-dimensional relation R(so r ≤ m), and this view selection process is critical for the performanceof the LPTA algorithm. In [45], the authors first observe that for the 2-dimensional case, we can prune views by considering the angle between viewscore function vectors and the query score function vector. Given a queryscore function f and two view score functions f1, f2, if ~f1 and ~f2 are to thesame side of ~f , then we only need to select the view which has the smallerangle to ~f for answering the query, while the other view can be pruned. Forexample, consider the two cached views V1 and V2 along with query Q inFigure 6.3 (b). Because ~f1 has the smaller angle to ~f , Q can be answeredusing V1, while V2 can be pruned.However, this pruning technique may not be very useful for high dimen-986.3. LPTA-based kQAV Processingsional scenarios. As has been shown in recent work [109], the pruning ofviews in the general case may involve solving an LP problem whose numberof constraints is proportional to the total number of views. This is clearlynot practical when the number of views is large, but this is precisely the sit-uation that arises when we want to answer a query using previously cachedresults. Thus, in [45], the authors adopt a greedy strategy for selectingviews.The view selection algorithm ViewSelect in [45] can be described as fol-lows. Let U be the current set of views selected. ViewSelect will select thenext view to be added to U by using function EstimateCost to simulate theactual top-k query Q on the histograms [65] of the views in U and those ofthe remaining views. If there is no view which can improve the cost of thecurrent set of views, the algorithm stops and returns the current set of viewsselected.Since each call of the EstimateCost sub-procedure again involves solv-ing LP problems against the histograms of the corresponding cached views,the computational cost for view selection turns out to be very high. InSection 6.3.2 we will first improve LPTA by removing many of the calls tothe LP solver. Then in Section 6.3.3, we will show how we could use anLPTA-based algorithm for handling the general kQAV problem with top-kviews.6.3.2 Algorithm LPTA+The original LPTA algorithm relies heavily on repeatedly invoking the LPsolver for both view selection and query processing, since the number oftimes the LP solver will be invoked is proportional to the number of callsto the LPTA algorithm (on both views and histograms) multiplied by thenumber of tuples accessed from the views/histograms. This is especiallyproblematic when the dimensionality is high, since the cost of LP solverincreases significantly as dimensionality grows.To test this intuition, we conducted a preliminary experiment to measurethe relative contribution of the LP solver and other operations to the overallcost. For a randomly generated dataset, where each attribute value of atuple is chosen randomly from a uniform distribution, Figure 6.4 showshow query processing cost increases as dimensionality increases. The resultswere obtained by selecting from a pool of 100 randomly generated views,and by averaging the time of processing 100 randomly generated top-10queries, with all views cached in memory. As can be seen from the figure,the processing cost of the LP solver dominates the cost of other operations996.3. LPTA-based kQAV Processingin the LPTA algorithm. As the dimensionality grows, the cost of the LPsolver increases quickly while the cost of other operations remains essentiallyconstant. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.352 3 4 5 6 7 8 9 10time(seconds)dimensionalityLP Solver CostOther Operation CostFigure 6.4: Query processing cost of LPTA as the dimensionality increases.An important question is whether all these invocations of the LP solverare actually necessary. We will soon see that, by taking advantage of the factthat the views are cached in memory and so can be accessed sequentiallywith very small overhead, it will be sufficient to solve the LP problem justa few times for most executions of the LPTA algorithm.To see this, we need to first to discuss how an LP solver works. Weassume in this chapter that the LP solver is based on the SIMPLEX algo-rithm [43], which is the most widely used LP algorithm. The general SIM-PLEX algorithm usually works in two phases. The goal of the first phaseis to find one feasible solution22 to the original problem, while the goal ofthe second phase is to find the optimal solution to the original problem.Because the formulation of our problem as represented by Formula (6.1) isin standard maximization form [43] (i.e., there are no constraints of the formw1t[1] + · · ·+wmt[m] ≥ θ except the non-negative variable constraints), thefirst phase of finding a feasible solution is essentially trivial. Thus we needto concentrate on the second phase of the SIMPLEX algorithm.We call each non-zero variable in a feasible solution a basic solution vari-able or BSV. In order to obtain the optimal solution in the second phase, weuse the pivoting technique, which essentially replaces one BSV by a variablewhich is not currently a BSV, in the hope that the target value τ can beincreased.Now recall from the LPTA algorithm in Section 6.3.1 that for every r tu-ples read, we need to solve a new LP problem. An interesting characteristicof this process is that, for every LP problem formulated, the only change is22A feasible solution to an LP problem is a solution which satisfies all the constraints.1006.3. LPTA-based kQAV Processingin the Right Hand Side (RHS) of Formula 6.1, specifically fi(ti); other partsof the constraints remain the same.This characteristic motivates us to consider the following improvementto the LPTA algorithm. As before, we start by solving the LP problemonce for the first set of r tuples accessed, deriving the BSVs for the optimalsolution in the process. Then, when new tuples are accessed, we can reusethe previously derived BSVs, and check whether they lead to the optimalsolution. If they do, then we have obtained the optimal solution for thenew LP problem without exploring different possible BSVs using pivoting,which can be very costly [43]. The check above can be done more efficientlythan pivoting. We note that this technique is different from previous workon Incremental Linear Programming [20], where the focus is on the moregeneral problem of adding/removing/updating constraints.The intuition behind the above optimization can also be illustrated usinggeometric properties. Consider the 2-dimensional example in Figure 6.5. Lett1 and t2 be the last two tuples accessed from V1 and V2 respectively. Theoptimal solution for the LP problem in Formula (6.1) can be obtained atvertex c of the convex polytope Oacb in Figure 6.5 (a). Since the values ofc on dimensions A and B are both positive, we know that A and B are theBSVs of the optimal solution. After we have accessed two new tuples t3, t4from V1, V2, we need to shift the two edges ac and bc of the convex polytopedown and left to a′c′ and b′c′, as shown in Figure 6.5 (b). Given the fact thatthe score functions of the cached views are all monotone, it is very likelythat, for the new convex polytope Oa′c′b′, the optimal solution will be at thevertex c′, which again has positive A and B values, and thus correspondsto the same BSVs. This shows that the optimal solution corresponding tothe new tuples can be obtained by choosing the same set of BSVs in theLP problem, i.e., we do not need to repeat the pivoting steps to find theoptimal BSVs.Q: (f,k)V1: (f1,k1)V2: (f2,k2)ABT(1,1)O(0,0)(a)t1t2Q: (f,k)V1: (f1,k1)V2: (f2,k2)ABT(1,1)O(0,0)(b)t1t2t3t4cc'aba'b'abcFigure 6.5: Example of LPTA+.1016.3. LPTA-based kQAV ProcessingThe pseudocode of the new LPTA+ algorithm is shown in Algorithm 19.Compared with LPTA, the difference lies in how the τ value is calculated(lines 8–15). For the first set of tuples accessed, we run the LP solver andderive the corresponding optimal BSVs B and τ (lines 8–9). After that, ineach iteration we check whether re-pivoting is needed by using the functionisValidOptimal to verify whether the existing BSVs lead to a new optimalsolution (lines 11–12); if they do, we derive τ directly, otherwise we solve theLP problem again and derive the new B and τ . Function isValidOptimalbasically pushes variables in B directly into the BSVs of the SIMPLEXalgorithm, and checks whether it forms a valid solution considering the newRHS vector. The overhead of this operation is small and can clearly avoidmany unnecessary pivoting steps in the SIMPLEX algorithm.Algorithm 19: LPTA+(U = {V1, . . . , Vr}, Q = (f, k))1 X ← an empty priority queue;2 τ ← ∞, B ← nil;3 repeat4 {t¯1, ..., t¯r} ← getNextTuple(U);5 retrieveTupleInfo({t¯1, . . . , t¯r});6 X.insert({t¯1, . . . , t¯r});7 X.keepTop(k);8 if B is nil then9 Compute the optimal BSVs B and τ using an LP solver;10 else11 derive new RHS vector b using {t¯1, . . . , t¯r};12 if isValidOptimal(U , B, b) then13 derive the new τ directly;14 else15 Compute the optimal BSVs B and τ using an LP solver;16 until noNewTuple(U) or (|X| = k and f(X[k]) ≥ τ);Since LPTA+ improves only the efficiency of calculating τ , we know thatboth LPTA and LPTA+ will examine the same number of tuples from U . Aswe will demonstrate in the experiments, the reuse of BSVs in LPTA+ usuallyhas a very small cost, and thus by avoiding many unnecessary pivoting steps,LPTA+ can be much more efficient than LPTA in practice.1026.3. LPTA-based kQAV Processing6.3.3 Handling the General kQAV ProblemAlthough LPTA+ can improve the efficiency of LPTA, we still need to extendit to handle the general kQAV problem, where we have only top-k viewsrather than complete rankings of tuples, and no base views.Our first observation is that, given a fixed set of views U = {V1, . . . , Vr},we can find all the certain tuples from U by using the LPTA+ algorithmwith the following simple modifications: (1) if the algorithm stops beforeall tuples in U are exhausted, we have already found a set of top-k certainanswers for the query, since every possible unseen tuple will have a value nobetter than the current top-k results; (2) if we have exhausted all tuples inU , let τ be the threshold value derived from the last tuple of each view; if weremove from the candidate top-k queue all tuples which have value smallerthan τ , then the remaining tuples in the queue are guaranteed to be certainanswers. Similar to the first case, the pruning of the tuples in the candidatetop-k queue here is sound because τ indicates the maximum value that canbe achieved by an unseen tuple, say t. Every tuple t′ which is pruned hasa value less than τ , so there exists a possible relation instance R which isscore consistent with U , and at the same time contains an unlimited supplyof tuples that have the same attribute values as t. Thus t′ cannot become atop-k result for this R since it will be dominated by t.Now one question is whether, given a set of cached views, we can finda minimal subset of views which can give us the maximum set of certainanswers to the query Q = (f, k) (up to a total of k). Unfortunately asdiscussed in [45], an obvious algorithm to determine the best subset of viewshas a high complexity since we need to enumerate all possible combinationsof r views. Instead, following the heuristics proposed in [45], we proposethe modification to the LPTA/LPTA+ algorithm described below. Thismodification guarantees that we will find the maximum set of certain answersto Q and that its complexity is linear in the number of views, but it doesnot guarantee that the number of views used is minimal.Consider the second case above (we do not need any changes to thefirst case since that already finds a set of top-k certain answers). Insteadof pruning tuples which have a value less than τ , we keep these candidatetuples and iteratively consider each of the remaining views. For each viewV ′ ∈ V − U , we investigate all tuples in V ′ one by one, replacing existingcandidate tuples with them whenever they have higher value with respect toQ; meanwhile, we try to refine the threshold value τ by considering the lasttuple accessed in V ′. During this process, if we have k candidate answerswhich have value larger than or equal to τ , we know we have found the top-k1036.4. IV-Index based Top-k QAVcertain answers; otherwise, if all views have been exhausted, we can get themaximum set of certain answers by pruning from the candidate queue thosewhich have value less than τ .It is straightforward to see that the above heuristic, when used in con-junction with LPTA instead, gives us a procedure for finding all certainanswers to Q (up to a maximum of k). Thus, LPTA can be used to findcertain answers even when base views or complete tuple rankings are notavailable.6.4 IV-Index based Top-k QAVThough the LPTA+ algorithm proposed in Section 6.3 improves the effi-ciency of the original LPTA algorithm by avoiding unnecessary pivotingoperations, the algorithm still needs to invoke the LP solver multiple times,during both view selection and query processing. When the underlying rela-tion has high dimensionality, the cost of LP solver calls can be considerable.This motivates the quest for an even more efficient algorithm for finding thecertain answers.To this end, we propose a simple index structure, called the InvertedView Index (IV-Index). Using this index greatly reduces the number ofinvocations of the LP solver, allowing all certain answers in Q(V) to bereturned quickly.6.4.1 Inverted View IndexGiven the set V of cached views, we first collect all tuples in these views intoan Inverted View Index (IV-Index) I = (T ,HV ,Ht). The components of theindex are as follows: Ht is a lookup table which returns the attribute valueinformation for a tuple given its id; HV is a lookup table which returns thedefinition of a view, and T is a high-dimensional data structure. In thiswork, we utilize a kd-tree as the underlying high-dimensional data structureas it has been shown to have the most balanced performance compared withother high-dimensional indexing structures [24]. However, we note that thetechniques we propose can be easily adapted to utilize quad-trees or otherindexing structures.Each node g in the kd-tree T represents an m-dimensional region, withthe root node groot of T representing the entire region from (0, . . . , 0) to(1, . . . , 1). The kd-tree is built as follows. Starting from the root node, werecursively partition the region associated with the current node g into twoparts based on a selected dimension and a splitting hyperplane. These two1046.4. IV-Index based Top-k QAVsub-regions are represented by two nodes which will become the childrenof g in T . Once this recursive process has completed, the disjoint regionsrepresented by the leaf nodes of T form a partitioning of the whole m-dimensional space. An example of a kd-tree along with the partitioning isshown in Figure 6.6.A20A111g1V1V1[k1]g3g2g4g9g10g12g11g14g13g7g8g15g16g6g5g1g2g3g4g13g14g15g16……(a) kd-tree (b) space partitionABFigure 6.6: Example of (a) a kd-tree, and (b) the corresponding partitionof 2-dimensional space.For a node g, without ambiguity, we also use g to denote the regionassociated with the node. To facilitate query processing, we associate eachleaf node g of T with a set Tg of tuple ids (tids), corresponding to tuples inthe cached views that belong to g. Given a node (region) g, let the valuerange of g on each of the m dimensions be [g1l , g1u], . . ., [gml , gmu ], and lett`g = (g1l , . . . , gml ) and tag = (g1u, . . . , gmu ). Then for any monotone functionf , it is clear that the maximum (minimum) value that can be achieved byany tuple in g is f(tag ), (resp., f(t`g )).Since the set of top-k views cached in the memory may not cover thecomplete set of tuples in the database, it is clear that we may only have“partial” knowledge about regions associated with some leaf nodes in T .Let R be any relation that is score consistent with V. Given a region g,let Rg denote the set of tuples in R whose values fall inside g. Then wesay that a region g is complete, or κ(g) = true, if Tg = Rg for every score-consistent relation R; otherwise we say that g is partial, or κ(g) = false.This is a semantic property and it is expensive to check it directly. Asufficient condition for checking the completeness of a region g is given inthe following lemma.Lemma 11 A region g is complete if there exists a top-k cached view Vi =(fi, ki) in V for which fi(Vi[ki]) < fi(t`g ).Proof If fi(Vi[ki]) < fi(t`g ), then clearly for any score-consistent relation1056.4. IV-Index based Top-k QAVR, ∀t ∈ Rg, fi(Vi[ki]) < fi(t). So according to the definition of top-k cachedview, all tuples in Rg must belong to V ; hence Rg = Tg.A 2-dimensional example of Lemma 11 is shown in Figure 6.6 (b). Thisexample shows a vector ~f1 corresponding to a view V1 = (f1, k1) along withthe k1’th tuple V1[k1] from the view. If we draw a line AB through V1[k1]which is perpendicular to ~f1, we can observe that t`g1 and t`g3 are above AB;thus f1(V1[k1]) < f1(t`g1), f1(V1[k1]) < f1(t`g3), and g1, g3 are complete. Onthe other hand, f1(V1[k1]) > f1(t`g2), so if the only top-k cached view wehave is V1, we are not able to determine whether g2 is complete or not. Thisis because we do not have enough information about the part of g2 whichis below AB. If R contains no tuple which falls inside this region, g2 iscomplete; however, if R does contain tuples which fall inside this region, g2is partial.We note that it is not possible to derive a necessary and sufficient con-dition for checking the completeness of a region given only the top-k cachedviews. This is because we will have to consult the original database R tocheck whether the regions which cannot be decided using Lemma 11, e.g., g2in above example, are complete or not. Obviously this process can be expen-sive, and more importantly, it is against the purpose of our kQAV frameworkwhich is to answer queries using only top-k cached views. So we will simplylabel regions whose completeness cannot be decided by Lemma 11 as partial.Alternative weaker sufficient conditions for completeness checking are left asfuture work.Consider a partial leaf node g in T for a top-k cached view V1 = (f1, k1).If f1(t`g ) ≤ f1(V1[k1]) ≤ f1(tag ) (i.e., the hyperplane which crosses V1[k1]and is perpendicular to ~f1 intersects with g), we will store a pair p =(V1, V1[k1].id) in a cross view set Pg associated with g. In p, the first entryis a pointer to the definition of V1, while the second entry is the tuple id ofV1[k1]. If no such views exist, i.e., the view is complete, Pg = ∅. Considerthe example of Figure 6.6 (b), and suppose that V1 is the only top-k cachedview. Then (V1, V1[k].id) is in Pg2 as well as in Pg4 ,Pg5 ,Pg6 ,Pg7 and Pg9 .6.4.2 IV-Search AlgorithmGiven an IV-Index I, a top-k query Q = (f, k) can be answered by traversingthe corresponding kd-tree of I using a strategy such as best-first search [108].The pseudocode of our first algorithm, called IVS-Eager, is given inAlgorithm 20. The algorithm traverses the kd-tree T by visiting first thosenodes which have larger maximum value with respect to Q (lines 3–17), as1066.4. IV-Index based Top-k QAVindicated by f(tag ), since these nodes may have good potential to containtuples which have high value with respect to Q. If the current node g is aleaf node, then we extract all tuples within g and check whether they canbecome new candidate top-k results (lines 9–11). In addition, if a leaf nodeg is partial, we need to collect information from Pg, which defines the regionof the unseen tuples which cannot be covered by the top-k cached views, andsolve a linear programming problem to find the maximum value that canbe achieved by any unseen tuples in g (lines 12–13). Finally, if the currentnode g has its maximum value f(tag ) less than or equal to f(Xr[k]), whichis the value of the kth candidate tuple in Xr, the algorithm can stop, sinceaccording to the best-first search strategy, any unseen nodes cannot containa tuple which is better than Xr[k] (line 14).Algorithm 20: IVS-Eager(I=(T ,HV ,Ht),Q = (f, k))1 Xn ← an empty priority queue for kd-tree nodes;2 Xr ← an empty priority queue for candidate results;3 Xn.enqueue(groot, f(tagroot));4 τ ←∞;5 while ¬Xn.isEmpty() do6 g ← Xn.dequeue();7 τ ← min(τ , f(tag ));8 if isLeaf(g) then9 foreach t ∈ g do10 Xr.enqueue(t, f(t));11 Xr.keepTop(k);12 if ¬κ(g) then13 τ ← min(τ , LPSolve(Pg, Q));14 if |Xr| = k ∧ f(Xr[k]) ≥ f(tag ) then break ;15 else16 foreach gc ∈ children(g) do17 Xn.enqueue(gc, f(tagc));18 return {t | t ∈ Xr ∧ f(t) ≥ τ};The correctness of IVS-Eager follows from the best-first search strategy,since every unseen tuple will have value smaller than Xr[k] with respect toQ. In addition, the updating of the threshold value τ ensures that everytuple returned is a certain answer.1076.4. IV-Index based Top-k QAVOne inefficiency in IVS-Eager is that, for every partial leaf node encoun-tered, we need to invoke an LP solver to update the threshold value τ . Thiscan be expensive for the following two reasons: first, as shown in the exampleof Figure 6.6 (b), each top-k cached view might be stored in the cross viewset of many nodes, so there might be duplicated computation if we solve theLP problem for every node individually; second, when the dimensionality ishigh, the number of such partial nodes will be large. In the following, wepropose another algorithm, called IVS-Lazy, which needs to solve only one(potentially larger) LP problem.Algorithm 21 lists the pseudocode of IVS-Lazy. The difference with IVS-Eager is that whenever a partial leaf node g is encountered in IVS-Lazy, westore the cross view set of g in a cache Cn (line 13) rather than immediatelysolve the LP problem and update the threshold τ as is done in IVS-Eager.After we have exhausted all nodes in the kd-tree, we collect all the viewinformation in Cn and solve a single LP problem (lines 18–19).Further Optimization via View PruningAs can be observed from Algorithms IVS-Eager and IVS-Lazy, a criticaloperation in both algorithms is to collect constraints from the cross viewset(s), and solve the LP problem given the query and constraints. Since thecomplexity of an LP problem may increase considerably with respect to thenumber of constraints, pruning constraints which are not useful can be veryimportant for the overall query performance.Let Q = (f, k) be the query to be processed, and assume that we haveaccessed more than k tuples, i.e., |Xr| ≥ k, using each of the two IV-Indexbased search algorithms. Now consider the point at which we solve the LPproblem, i.e., line 13 in IVS-Eager and line 19 in IVS-Lazy. Let tmin = Xr[k]be the current kth tuple in Xr, and let V = (f ′, k′) be a view from thecorresponding cross view set. According to the definition, for any tuplet /∈ V , we have f ′(t) ≤ f ′(V [k′]), so the maximum value that can be achievedby any such tuple can be calculated using the following LP problem:maxtφ = f(t)subject to: f ′(t) ≤ f ′(V [k′])0 ≤ t[i] ≤ 1, i = 1, . . . ,m (6.2)Let f(t) = w1t[1] + · · ·+ wmt[m], and f ′(t) = w′1t[1] + · · ·+ w′mt[m]. Acareful inspection of the above LP formulation will reveal that it is exactlythe Fractional Knapsack Problem (or Continuous Knapsack Problem) [71].1086.4. IV-Index based Top-k QAVAlgorithm 21: IVS-Lazy(I = (T ,HV ,Ht), Q = (f, k))1 Xn ← an empty priority queue for kd-tree nodes;2 Xr ← an empty priority queue for candidate results;3 Cn ← an empty cache for partial leaf nodes;4 Xn.enqueue(groot, f(tagroot));5 τ ←∞;6 while ¬Xn.isEmpty() do7 g ← Xn.dequeue();8 τ ← min(τ , f(tag ));9 if isLeaf(g) then10 foreach t ∈ g do11 Xr.enqueue(t, f(t));12 Xr.keepTop(k);13 if ¬κ(g) then Cn.add(Pg) ;14 if |Xr| = k ∧ f(Xr[k]) ≥ f(tag ) then break ;15 else16 foreach gc ∈ children(g) do17 Xn.enqueue(gc, f(tagc));18 P ← consolidateCrossViewSets(Cn);19 τ ← min(τ , LPSolve(P, Q));20 return {t | t ∈ Xr ∧ f(t) ≥ τ};1096.4. IV-Index based Top-k QAVIn this problem, we are given a set of items o1, . . . , om, where each item oi,1 ≤ i ≤ m, has weight w′i and value wi, and we are asked to pack theminto a knapsack with maximum weight f ′(V [k′]) such that the total value ismaximized, while allowing fractions of an item to be put into the knapsack.It is well known that a greedy algorithm which accesses items ordered byutility (value divided by weight) finds the optimal solution for the fractionalknapsack problem, in linear time. Thus we utilize the following AlgorithmFKP to find the maximum value φ which can be achieved by an unseentuple with respect to a view V and a query Q. If this value φ is less thanf(tmin) (the value of the current kth tuple in Xr), we can safely prune Vfrom consideration in both IVS-Eager and IVS-Lazy when checking crossview sets.Algorithm 22: FKP(Q = (f, k), V = (f ′, k′))1 l ← {(i, ui ← wiw′i ) | 1 ≤ i ≤ m};2 Sort tuples in l based on utility;3 φ ← 0, B ← f ′(V [k′]);4 for (i, ui) ∈ l do5 if w′i ≥ B then6 φ ← φ+ uiB;7 break;8 else φ ← φ+ wi, B ← B − w′i ;9 return φ;6.4.3 DiscussionSince we usually prefer the cached views to reflect the most recent andpopular queries, and the memory consumption of the index structure needsto be bounded, a mechanism for cache replacement is necessary. There ismuch previous work on good strategies for cache/buffer replacement [42, 68],so in this work we will assume that a cache replacement strategy has beenspecified. Instead, we will only discuss how the basic operations of insertingand deleting a view might be implemented using the IV-Index.To handle view insertion and deletion, we could associate with each tuplet cached in the memory a count c(t), indicating how many views contain t.In addition, we could associate with each node g a count c(g), specifying howmany views cover g, or make g a complete node, according to Lemma 11.1106.5. Empirical ResultsFirst consider inserting a new top-k cached view V = (f, k). For each tuplet ∈ V , we set c(t) = c(t) + 1 and insert t into the kd-tree if necessary. Tochange the completeness status of nodes affected by V in the kd-tree, wecould use a best-first strategy to find each node g for which f(tag ) > f(V [k]),and set c(g) = c(g) + 1. Similarly, when deleting a top-k cached viewV = (f, k), we could use a best-first strategy to find each node g for whichf(tag ) > f(V [k]), and set c(g) = c(g) − 1. In addition, we find each cachedtuple t ∈ V for which f(t) > f(V [k]), set c(t) = c(t)− 1 and remove it fromcache when c(t) = 0.6.5 Empirical ResultsIn this section, we study the performance of various algorithms for the kQAVproblem based on one real dataset of NBA statistics and four syntheticdatasets. The goals of our experiments are to study: (i) the performance ofthe LPTA-based algorithms, and by how much LPTA+ improves the state-of-the-art LPTA algorithm; (ii) the relative performance of the lazy andeager versions of the IV-Index-based algorithm, and to what extent theyoutperform LPTA+; (iii) the effectiveness of the pruning process proposedin Section 6.4.2. We implemented all the algorithms in Python, and thelinear programming solver is based on a variation of LinPro 23, and allexperiments were run on a Linux machine with a 4 Core Intel Xeon CPU,OpenSuSE 12.1, and Python 2.7.2.The NBA dataset is collected from the Basketball Statistics website [6],which contains the career statistics information of NBA players until 2009;each attribute of the dataset corresponds to one major statistics for NBApaperly, e.g., points per game. The NBA dataset has 3705 tuples and weselected 10 attributes to be used in our experiments. The synthetic datasetsare generated by adapting the benchmark generator proposed in [25]. Theuniform (UNI) dataset and the powerlaw (PWR) dataset are generated byconsidering each attribute independently. For UNI, attribute values aresampled from a uniform distribution, and for PWR, attribute values aresampled from a power law distribution with α = 2.5 and normalized into therange [0, 1]. In the correlated (COR) synthetic dataset, values from differentattributes are correlated with each other, while in the anti-correlated (ANT)synthetic dataset, values from different attributes are anti-correlated witheach other. Each synthetic dataset is over 10 attributes and has 100000tuples.23http://www.cdrom.com/pub/MacSciTech/programming/ (visited on 03/18/2013)1116.5. Empirical Results 0 1 2 3 4 5 62 3 4 5 6 7 8 9 10Query Proc. time(s)Num. of dimensions(a) RND (1000)LPTALPTA+ 0 1 2 3 4 5 62 3 4 5 6 7 8 9 10Query Proc. time(s)Num. of dimensions(b) PWR (1000)LPTALPTA+ 0 0.5 1 1.5 2 2.5 3 3.5 42 3 4 5 6 7 8 9 10Query Proc. time(s)Num. of dimensions(c) COR (1000)LPTALPTA+ 0 1 2 3 4 5 62 3 4 5 6 7 8 9 10Query Proc. time(s)Num. of dimensions(d) ANT (1000)LPTALPTA+ 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.82 3 4 5 6 7 8 9 10Query Proc. time(s)Num. of dimensions(e) NBA (1000)LPTALPTA+ 0 1 2 3 4 5 62 3 4 5 6 7 8 9 10Query proc. time(s)Num. of dimensions(f) RND (100)LPTALPTA+ 0 1 2 3 4 5 6 72 3 4 5 6 7 8 9 10Query proc. time(s)Num. of dimensions(g) PWR (100)LPTALPTA+ 0 1 2 3 4 5 6 7 82 3 4 5 6 7 8 9 10Query proc. time(s)Num. of dimensions(h) COR (100)LPTALPTA+ 0 1 2 3 4 5 62 3 4 5 6 7 8 9 10Query proc. time(s)Num. of dimensions(i) ANT (100)LPTALPTA+ 0 0.5 1 1.5 2 2.5 3 3.5 4 4.52 3 4 5 6 7 8 9 10Query proc. time(s)Num. of dimensions(j) NBA (100)LPTALPTA+Figure 6.7: LPTA vs. LPTA+: (a–e) results on 5 datasets with each viewcontaining 1000 tuples; (f–j) results on 5 datasets with each view containing100 tuples.Weights for the score functions in all views are generated randomly,and all views are cached in memory. Similar to previous work on LPTA-based algorithms [45], the size of the histograms used for estimation is setto be roughly 1% of the size of the corresponding dataset. For the IV-Index-based approach, we set the number of tuples in the leaf nodes of thekd-tree to be less than or equal to 50. Alternative configurations for thekd-tree were also tested with similar results and so are omitted here for lackof space. Finally, the query score functions are also generated randomly,and all results reported here are based on an average of the results fromprocessing 100 queries.6.5.1 LPTA-based AlgorithmsIn Figure 6.7, we compare the performance of LPTA and LPTA+ for querieswhich ask for the top-100 tuples using a set of 100 views. Figure 6.7 (a–e)considers the setting in which each view contains 1000 tuples. We can seethat, for all five datasets, LPTA+ is much faster than LPTA in most cases.Similar results are obtained for the setting in which each view contains 100tuples (Figure 6.7 (f–j)). However, we note that for this setting, queryprocessing time is longer because now the views contain fewer tuples, so weneed to check more additional views in order to guarantee that we will findall the certain answers in Q(V) w.r.t. the query Q.In Figure 6.8 (a), we compare the performance of LPTA and LPTA+when varying the number of views in the cache pool. Here we fix the numberof dimensions at 5, and consider queries where k is randomly selected from 101126.5. Empirical Resultsto 100. As can be seen from this figure, the performance of both algorithmsdegenerates as the number of views increases. However, LPTA+ is still twiceas fast as LPTA in most settings. This result was obtained using the RNDdataset. Very similar results were obtained for the other datasets and fordifferent dimensionality settings, and are thus omitted. 0 10 20 30 40 50 60 70 80 90100 200 300 400 500Query Proc. time(s)Number of views(a) LPTA vs. LPTA+LPTALPTA+ 0 0.5 1 1.5 2 2.5 3 3.5100 200 300 400 500Query Proc. time(s)Number of views(b) IVS-Eager vs. IVS-LazyIVS-EagerIVS-LazyFigure 6.8: When varying the number of views on the RND dataset, theperformance comparison between: (a) LPTA and LPTA+; (b) IVS-Eagerand IVS-Lazy.In Figure 6.9 (a), we compare the performance of LPTA and LPTA+when varying the value k in each query from 10 to 100, given 100 viewsand by fixing the the dimensionality at 5. Similar to the previous results,the performance of both algorithms degenerates as k increases, but LPTA+is still faster than LPTA for all settings. The results obtained for datasetsother than RND are very similar. We discuss Figures 6.8(b) and 6.9(b)below. 0 0.5 1 1.5 2 2.5 310 20 30 40 50 60 70 80 90 100Query Proc. time(s)k value in the query(a) LPTA vs. LPTA+LPTALPTA+ 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.1810 20 30 40 50 60 70 80 90 100Query Proc. time(s)k value in the query(b) IVS-Eager vs. IVS-LazyIVS-EagerIVS-LazyFigure 6.9: When varying the value k of a query on the RND dataset, theperformance comparison between: (a) LPTA and LPTA+; (b) IVS-Eagerand IVS-Lazy.Although LPTA+ can greatly outperform LPTA, it can be observed thatthe query processing cost for LPTA+ is still high.1136.5. Empirical Results 0 0.05 0.1 0.15 0.2 0.25 0.32 3 4 5 6 7 8 9 10Query Proc. time(s)Num. of dimensions(a) RND (1000)IVS-EagerIVS-Lazy 0 0.05 0.1 0.15 0.2 0.25 0.32 3 4 5 6 7 8 9 10Query Proc. time(s)Num. of dimensions(b) PWR (1000)IVS-EagerIVS-Lazy 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.42 3 4 5 6 7 8 9 10Query Proc. time(s)Num. of dimensions(c) COR (1000)IVS-EagerIVS-Lazy 0 0.05 0.1 0.15 0.2 0.25 0.32 3 4 5 6 7 8 9 10Query Proc. time(s)Num. of dimensions(d) ANT (1000)IVS-EagerIVS-Lazy 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.162 3 4 5 6 7 8 9 10Query Proc. time(s)Num. of dimensions(e) NBA (1000)IVS-EagerIVS-Lazy 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.262 3 4 5 6 7 8 9 10Query proc. time(s)Num. of dimensions(f) RND (100)IVS-EagerIVS-Lazy 0 0.05 0.1 0.15 0.2 0.25 0.32 3 4 5 6 7 8 9 10Query proc. time(s)Num. of dimensions(g) PWR (100)IVS-EagerIVS-Lazy 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.452 3 4 5 6 7 8 9 10Query proc. time(s)Num. of dimensions(h) COR (100)IVS-EagerIVS-Lazy 0 0.05 0.1 0.15 0.2 0.25 0.32 3 4 5 6 7 8 9 10Query proc. time(s)Num. of dimensions(i) ANT (100)IVS-EagerIVS-Lazy 0 0.05 0.1 0.15 0.2 0.25 0.3 0.352 3 4 5 6 7 8 9 10Query proc. time(s)Num. of dimensions(j) NBA (100)IVS-EagerIVS-LazyFigure 6.10: IVS-Eager vs. IVS-Lazy: (a–e) results on 5 datasets with eachview containing 1000 tuples; (f–j) results on 5 datasets with each view con-taining 100 tuples.6.5.2 IV-Index-based AlgorithmsFigure 6.10 shows the experimental results of the IVS-Eager and IVS-Lazyalgorithms under the same settings as in Figure 6.7. Compared with theresults of LPTA-based algorithms in Figure 6.7, we can readily see that theIV-Index-based approaches are orders of magnitude faster than the LPTA-based approaches under all circumstances. From Figure 6.10 (a–j), we canalso observe that, in most cases, IVS-Lazy is much faster than IVS-Eager,since it saves many calls to the LP solver. The only exception is for low-dimensional cases where both algorithms have a very small query processingcost. The advantage of IVS-Lazy especially applies for the high-dimensionalcases where more nodes in the kd-tree are partial. Similar to the results ofthe LPTA-based algorithms, both IVS-Eager and IVS-Lazy are much fasteron views which contain more tuples, simply because they need to check fewerpartial nodes in the kd-tree.When we vary the number of views and when we vary the number k ineach query, as can be observed from Figure 6.8 (b) and Figure 6.9 (b), theperformance of IV-Index-based algorithms are orders of magnitude fasterthan the LPTA-based algorithms. The running time of both IVS-Eager andIVS-Lazy increases as the number of views increases, or as k increases, aswith all algorithms. However, IVS-Lazy has consistently better performancethan IVS-Eager.1146.6. Related Work 0 0.05 0.1 0.15 0.22 3 4 5 6 7 8 9 10Query Proc. time(s)Num. of dimensions(a) RND (100)IVS-EagerIVS-Eager and PruningIVS-LazyIVS-Lazy and Pruning 0 0.05 0.1 0.15 0.22 3 4 5 6 7 8 9 10Query Proc. time(s)Num. of dimensions(b) PWR (100)IVS-EagerIVS-Eager and PruningIVS-LazyIVS-Lazy and Pruning 0 0.05 0.1 0.15 0.2 0.252 3 4 5 6 7 8 9 10Query Proc. time(s)Num. of dimensions(c) COR (100)IVS-EagerIVS-Eager and PruningIVS-LazyIVS-Lazy and Pruning 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.182 3 4 5 6 7 8 9 10Query Proc. time(s)Num. of dimensions(d) ANT (100)IVS-EagerIVS-Eager and PruningIVS-LazyIVS-Lazy and Pruning 0 0.05 0.1 0.15 0.2 0.252 3 4 5 6 7 8 9 10Query Proc. time(s)Num. of dimensions(e) NBA (100)IVS-EagerIVS-Eager and PruningIVS-LazyIVS-Lazy and PruningFigure 6.11: Pruning effectiveness test of IV-Search algorithms based on thefive datasets.6.5.3 Effectiveness of PruningFinally, in Figure 6.11, we show the effectiveness of the pruning techniquesproposed in Section 6.4.2. In this experiment, we fix the number of tuples ineach view to be 100, and for each query Q = (f, k), k is a random numberwithin [10, 100]; for other settings of these parameters, the results obtainedare very similar. As can be seen from the figure, the pruning technique canimprove the performance of both IV-Search algorithms. Notice that for var-ious dimensionality settings, the overall performance of IVS-Lazy/Pruningis consistently the best on all five datasets.6.6 Related WorkFor general top-k query processing, the most popular approach is the Thresh-old Algorithm (TA) / No Random Access Algorithm (NRA) as proposed byFagin et al. in [53]. While TA and NRA differ in whether random accessto the database is allowed, this family of algorithms usually share a simi-lar query processing framework which accesses tuples from the database ina certain order, while maintaining an upperbound on the maximum valuethat can be achieved by the tuples that have not yet been accessed. If thecurrent top-k result has a value no less than the best value achievable byany unseen tuple, the algorithm can stop. Recently, various improvementsto the original algorithms such as the Best Position Algorithm [18] havebeen proposed, while variations of top-k queries such as Rank Join [62] andContinuous Top-k Queries [128] have been studied. Finally, Li et al. studyhow top-k algorithms might be implemented in a relational database [86].An excellent survey on top-k query processing can be found in [63].Hristidis et al. [60] first considered the problem of using views to speedup top-k query processing. They focused on finding one best view whichcan be used for answering a query. As mentioned in [45], their setting is1156.6. Related Workquite restrictive as it cannot exploit multiple views, and it also assumesthat all attributes of the underlying base table are always utilized for alltop-k queries. Das et al. [45] propose a novel algorithm, called LPTA, whichovercomes the limitations of [60] by utilizing multiple views for answering atop-k query.It can be verified that the kQAV problem defined here is a generalizationof the kQAV problem as considered in [45]. This is because the core tech-niques proposed in [45] rely on the assumption that either each top-k viewVi = (fi, ki) ∈ V is a complete ranking of all tuples in R, i.e., ki = |R|; or thebase views, which are complete rankings of all tuples in R according to thevalues of each attribute, are available. We make no such assumptions in oursetting. That said, we can easily adapt our algorithms to work in settingswhere we do have base views available or all views are complete rankings ofall tuples.In [21], the authors consider the problem of whether a top-k query canbe answered exactly using a set of top-k views, which resembles the classicalquery containment problem in databases [35]. However, this work does notaddress the general kQAV problem, i.e., return a maximum set of certainanswers, in case V cannot answer a query Q exactly. Ryeng et al. [109]extend the techniques proposed in [45] and [21] to answer top-k queries ina distributed setting. They assume that access to the original database isavailable through the network interface, thus exact top-k answers can alwaysbeen found by forming a “remainder” query which can be utilized to fetchtuples not available in the views. We note that the focus of our work ison efficient algorithms for finding answers to the kQAV problem, where theoriginal database is not accessible. Should it be accessible, we can adapt thetechniques proposed in [109] to find the additional answers in the case whereour algorithms cannot find enough certain tuples from the cached views.In addition to leveraging views, an alternative way of optimizing top-k query processing is through a Layered Index [36, 59, 81, 127]. Theseapproaches try to organize tuples in the database into an layered indexstructure. We can quickly obtain the answers to a top-k query by accessingjust the first few layers of the index. First, we note that our proposed IV-Index is significantly different from the layered index, since it is based ona standard space partitioning index such as kd-tree. Furthermore, theselayered indexes all assume access to the original database is available, so aredifficult to adapt to scenarios where we have no access to the database.1166.7. Application in Composite Recommendation6.7 Application in Composite RecommendationFor composite recommendation with soft constraint as discussed in Chap-ter 5, we query top-k packages using algorithm TopPkg (Algorithm 15) forevery weight sample w. Since there is no hard constraint, TopPkg onlyconsiders the value of every package w.r.t. w using its aggregate feature val-ues. Thus given a static item dataset, we could treat each cached packageas an atomic item, and directly apply the kQAV algorithms developed inthis chapter to leverage cached top package results. Considering the factthat in the preference elicitation framework, we need to iteratively samplenew weight vectors in order to accommodate new implicit feedbacks, thekQAV-based optimization can be extremely useful.For composite recommendation with hard constraint as discussed inChapter 3 and Chapter 4, because of the additional constraint, even thoughsome previously cached top-k package results of query Q are still the top-kpackages under the utility function of a new query Q′, these top-k packagesmay not satisfy the constraint associated with Q′. In this section, we proposeextensions to our proposed kQAV algorithms which can handle associatedconstraints with each query.Consider composite recommendation with hard constraint and flexibleschema discussed in Chapter 4, where we focus on the budget constraint(time budget and monetary budget). For IVS-based algorithms in Sec-tion 6.4, let every item here be a package which was cached due to a previousquery, then we could easily associate each package p in the grid with its ac-tual cost, thus when visiting p during query processing, we could simplycheck whether p satisfies the new query constraint. If it does, then we cancontinue as is with items. If it does not, as illustrated in Algorithm 23 line13–14, we mark this current cell as partial, and add p to the set Sg whichcontains all packges in g which do not satisfy the constraint in Q. After allitems in the current cell g have been processed, in line 17 of Algorithm 23, ifthe current cell is determined to be partial, we ignore those packages whichdo not satisfy the constraint of Q, and solve a LP problem to find an up-perbound value on the possible utility value which can be achieved by anypackages satisfying constraint of Q in this cell.For composite recommendation with hard constraint and fixed schemaas discussed in Chapter 3, we could create a separate IV-Index for everypossible schema which will be used in the corresponding application, thensince every query and package in an IV-Index has the same join condition,we could simply handle all remaining aggregation constraints in a similarway as in IVS-Eager-C.1176.7. Application in Composite RecommendationAlgorithm 23: IVS-Eager-C(I=(T ,HV ,Ht),Q = (f, k,B))1 Xn ← an empty priority queue for kd-tree nodes;2 Xr ← an empty priority queue for candidate results;3 Xn.enqueue(groot, f(tagroot));4 τ ←∞;5 while ¬Xn.isEmpty() do6 g ← Xn.dequeue();7 τ ← min(τ , f(tag ));8 if isLeaf(g) then9 Sg ← ∅;10 foreach t ∈ g do11 Xr.enqueue(t, f(t));12 if t does not satisfy constraint B then13 κ(g) ← false;14 Sg.add(t);15 Xr.keepTop(k);16 if ¬κ(g) then17 τ ← min(τ , LPSolve(Pg\Sg, Q));18 if |Xr| = k ∧ f(Xr[k]) ≥ f(tag ) then break ;19 else20 foreach gc ∈ children(g) do21 Xn.enqueue(gc, f(tagc));22 return {t | t ∈ Xr ∧ f(t) ≥ τ};118Chapter 7Summary and FutureResearch7.1 SummaryMotivated by several applications such as trip planning, E-commerce, andcourse planning, we study the problem of composite recommendation inthis thesis, in which instead of focusing on finding items which might sat-isfy users’ needs, we consider how packages of items can be automaticallyrecommended to the user. Based on different requirements from differentapplications, we explore in this thesis several possible variations of the com-posite recommendation problem.In Chapter 3, we consider applications where schemas of the underly-ing packages are pre-defined, and formulate the composite recommendationproblem as an extension to rank join with aggregation constraints. By ana-lyzing their properties, we develop deterministic and probabilistic algorithmsfor their efficient processing. In addition to showing that the deterministicalgorithm retains the minimum number of accessed tuples in memory ateach iteration, we empirically showed both our deterministic and probabilis-tic algorithms significantly outperform the obvious alternative of rank joinfollowed by post-filtering in many cases and that the probabilistic algorithmproduces results of high quality while being at least twice as fast as thedeterministic algorithm.In Chapter 4, we consider applications where schemas of the underlyingpackages can be flexible, and focus on an important class of aggregationconstraints based on budgets. We establish that the problem of finding thetop package is intractable since it is a variant of the Knapsack problem,with the restriction that items need to be accessed in value-sorted order.We developed two approximation algorithms InsOpt-CR-Topk and Greedy-CR-Topk that are designed to minimize the number of items accessed. ForInsOpt-CR-Topk, we show that every 2-approximation algorithm for theproblem must access at least as many items as this algorithm. And for1197.1. SummaryGreedy-CR-Topk, though it is not guaranteed to be instance optimal, but ismuch faster. We experimentally evaluated the performance of the algorithmsand showed that in terms of the quality of the top-k packages returned bothalgorithms are close to each other and deliver high quality packages; interms of the number of items accessed Greedy-CR-Topk is very close toInsOpt-CR-Topk, but in terms of running time, Greedy-CR-Topk is muchfaster. We also showed that using histogram-based information about itemcosts can further reduces the number of items accessed by the algorithmsand improves their running time. The proposed algorithm can be extendedto handle cases where the budget may depend on the order of items beingconsumed by considering the underlying problem as an orienteering problem.In Chapter 5, we study how composite recommendation is possible usingsoft constraints. Following [27, 34], we assume the system does not have thecomplete information about user’s utility function. We cab leverage the ex-isting preference elicitation frameworks for eliciting preferences from users,however, the challenge here is how we can perform the elicitation efficiently,especially considering the fact that we are reasoning about utilities of combi-nations of items. We propose several sampling-based methods which, givenuser feedback, can capture the updated knowledge of the underlying util-ity function. Finally, we also study various package ranking semantics forfinding top-k packages, using the learned utility function.Finally, we discuss in Chapter 6 a general optimization procedure basedon cached views which can benefit various proposed composite recommenda-tion algorithms. We show that the state-of-the-art algorithm for answeringtop-k queries using cached views, LPTA [45], suffers because of iterative callsto a linear programming sub-procedure. This can be especially problematicwhen the number of views is large or if the dimensionality of the datasetis high. By observing an interesting characteristic of the LPTA framework,we proposed LPTA+ which has greatly improved efficiency over LPTA. Weadapted both algorithms so they work in our kQAV setting, where views arenot complete tuple rankings and base views are not available. Furthermore,we proposed an index structure, called IV-Index, which stores the contentsof all cached views in a central data structure in memory, and can be lever-aged to answer a new top-k query much more efficiently compared withLPTA and LPTA+. Using comprehensive experiments, we showed LPTA+substantially improves the performance of LPTA while the algorithms basedon IV-Index outperform both these algorithms by a significant margin.1207.2. Future Research7.2 Future ResearchThis dissertation has made substantial progress in the study of compositerecommendation, but several challenges still remain.First of all, the focus of this dissertation is on locating interesting pack-ages given user specified preferences/constraints, either explicit or implicit.However, most of the proposed algorithms do not leverage existing transac-tions in the system which can be used to further understand users’ preferenceover packages of items. E.g., for trip planning, a user might have alreadyplanned some trips before, thus given his/her existing trips, we might beable to infer what types of POIs he/she might be interested in. Some ini-tial works have attempted to solve this problem through supervised learning[57, 91]. However, the proposed models were only verified on an extremelysmall and non-public dataset.Second, another problem of finding top-k recommended packages is thatmany items found in the top-k packages might be duplicates, rendering thesetop-k packages being very similar to each other. Thus another desirableproperty of the generated top-k package set is diversity [106]. However,there is still a lack of principled ways of trading-off between diversity andquality of the packages. The preference elicitation framework studied inChapter 5 might be a potential fit for solving this problem. For example,we could elicit user’s preference on diversity by presenting both diversifiedpackages and non-diversified packages.Third, the usual recommendation interface is good for presenting a list ofitems, but given a set of k packages, there is a need for a good and intuitiveinterface for presenting these packages. In [106], the authors propose toconsider the visualization problem of packages, however, the focus in ondiversification. We note that in order to maximize the utility of the presentedtop-k packages, a good interface should not only present information relatedto these packages, but also consider highlighting properties which can beused to differentiate different packages along with explanation.Finally, evaluation of the presented top-k packages is also extremely chal-lenging. Existing evaluation measures such as precision and recall might notbe suitable for composite recommendation, given the nature of the objectsunder consideration. For example, although user might find package p themost interesting whereas p′ is presented as the top-1 package, there mightbe significant overlap between p and p′. Thus proper measures which takethe nature of the composite recommendation problem into considerationshould be designed and which can then be leveraged to understand users’preferences over packages of items.121Bibliography[1] Flickr. http://www.flickr.com, visited on 03/16/2015.[2] IMDB. http://imdb.com, visited on 03/16/2015.[3] Memcached. http://memcached.org, visited on 03/16/2015.[4] Metacritics. http://www.metacritic.com, visited on 03/16/2015.[5] Metascore. http://www.metacritic.com/about-metascores, vis-ited on 03/16/2015.[6] Nba basketball statistics. http://www.databasebasketball.com,visited on 03/16/2015.[7] Rottentomatoes. http://www.rottentomatoes.com, visited on03/16/2015.[8] Twitter. http://twitter.com, visited on 03/16/2015.[9] U.s. news best collage rankings. http://www.usnews.com/rankings,visited on 03/16/2015.[10] Wikipedia. http://www.wikipedia.org, visited on 03/16/2015.[11] World university rankings. http://www.timeshighereducation.co.uk/world-university-rankings/, visited on 03/16/2015.[12] Youtube. http://www.youtube.com, visited on 03/16/2015.[13] Auto123 consumer car ratings. http://www.auto123.com/en/car-reviews/consumer-ratings/, visited on 03/18/2013.[14] U.s. news best cars. http://usnews.rankingsandreviews.com/cars-trucks/rankings/cars/, visited on 03/18/2013.[15] Serge Abiteboul and Oliver M. Duschka. Complexity of answeringqueries using materialized views. In PODS, pages 254–263, 1998.122Bibliography[16] Gediminas Adomavicius and Alexander Tuzhilin. Toward the nextgeneration of recommender systems: A survey of the state-of-the-artand possible extensions. IEEE TKDE, 17(6):734–749, 2005.[17] Alfred V. Aho et al. The transitive reduction of a directed graph.J.Comp., 1(2):131–137, 1972.[18] Reza Akbarinia, Esther Pacitti, and Patrick Valduriez. Best positionalgorithms for top-k queries. In VLDB, pages 495–506, 2007.[19] Albert Angel, Surajit Chaudhuri, Gautam Das, and Nick Koudas.Ranking objects based on relationships and fixed associations. InEDBT, pages 910–921, 2009.[20] Greg J. Badros, Alan Borning, and Peter J. Stuckey. The Cassowarylinear arithmetic constraint solving algorithm. ACM Trans. Comput.-Hum. Interact., 8(4):267–306, 2001.[21] Eftychia Baikousi and Panos Vassiliadis. View usability and safetyfor the answering of top-k queries via materialized views. In DOLAP,pages 97–104, 2009.[22] Christopher M. Bishop. Neural Networks for Pattern Recognition.Oxford University Press, 1995.[23] G. R. Bitran and J.-C. Ferrer. On pricing and composition of bundles.Production and Operations Management, 16:93–108, 2007.[24] Christian Bo¨hm, Stefan Berchtold, and Daniel A. Keim. Searching inhigh-dimensional spaces: Index structures for improving the perfor-mance of multimedia databases. ACM Comput. Surv., 33(3):322–373,2001.[25] Stephan Bo¨rzso¨nyi, Donald Kossmann, and Konrad Stocker. The sky-line operator. In ICDE, pages 421–430, 2001.[26] N.D. Botkin and V.L. Turova-Botkina. An algorithm for finding thechebyshev center of a convex polyhedron. Applied Mathematics andOptimization, 29(2):211–222, 1994.[27] Craig Boutilier. A pomdp formulation of preference elicitation prob-lems. In AAAI/IAAI, pages 239–246, 2002.123Bibliography[28] Craig Boutilier, Ronen I. Brafman, Carmel Domshlak, Holger H. Hoos,and David Poole. Cp-nets: A tool for representing and reasoning withconditional ceteris paribus preference statements. J. Artif. Intell. Res.(JAIR), 21:135–191, 2004.[29] Sylvain Bouveret, Ulle Endriss, and Je´roˆme Lang. Conditional impor-tance networks: A graphical language for representing ordinal, mono-tonic preferences over sets of goods. In IJCAI, pages 67–72, 2009.[30] Ronen I. Brafman, Carmel Domshlak, Solomon Eyal Shimony, andY. Silver. Preferences over sets. In AAAI, 2006.[31] Gerhard Brewka, Miroslaw Truszczynski, and Stefan Woltran. Repre-senting preferences among sets. In AAAI, 2010.[32] Alexander Brodsky, Sylvia Morgan Henshaw, and Jon Whittle.CARD: a decision-guidance framework and application for recom-mending composite alternatives. In RecSys, pages 171–178, 2008.[33] Zhe Cao et al. Learning to rank: from pairwise approach to listwiseapproach. In ICML, pages 129–136, 2007.[34] Urszula Chajewska, Daphne Koller, and Ronald Parr. Making rationaldecisions using adaptive utility elicitation. In AAAI/IAAI, pages 363–369, 2000.[35] Ashok K. Chandra and Philip M. Merlin. Optimal implementation ofconjunctive queries in relational data bases. In STOC, pages 77–90,1977.[36] Yuan-Chi Chang, Lawrence D. Bergman, Vittorio Castelli, Chung-Sheng Li, Ming-Ling Lo, and John R. Smith. The onion technique:Indexing for linear optimization queries. In SIGMOD, pages 391–402,2000.[37] Yuan-Chi Chang, Chung-Sheng Li, and John R. Smith. Searching dy-namically bundled goods with pairwise relations. In ACM Conferenceon Electronic Commerce, pages 135–143, 2003.[38] Chandra Chekuri and Martin Pa´l. A recursive greedy algorithm forwalks in directed graphs. In FOCS, pages 245–253, 2005.[39] Gang Chen, Sai Wu, Jingbo Zhou, and Anthony K.H. Tung. Auto-matic itinerary planning for traveling services. TKDE, 26(3):514–527,2014.124Bibliography[40] Haiquan Chen, Wei-Shinn Ku, Min-Te Sun, and Roger Zimmermann.The multi-rule partial sequenced route query. In GIS, page 10, 2008.[41] Jan Chomicki. Preference formulas in relational queries. ACM Trans.Database Syst., 28(4):427–466, 2003.[42] Hong-Tai Chou and David J. DeWitt. An evaluation of buffer man-agement strategies for relational database systems. In VLDB, pages127–141, 1985.[43] George Dantzig. Linear Programming and Extensions. Princeton Uni-versity, 1998.[44] Gautam Das, Dimitrios Gunopulos, Nick Koudas, and Nikos Sarkas.Ad-hoc top-k query answering for data streams. In VLDB, pages 183–194, 2007.[45] Gautam Das, Dimitrios Gunopulos, Nick Koudas, and DimitrisTsirogiannis. Answering top-k queries using views. In VLDB, pages451–462, 2006.[46] Mark de Berg, M. van Krefeld, M. Overmars, and O. Schwarzkopf.Computational Geometry: Algorithms and Applications. Springer,2000.[47] Munmun De Choudhury, Moran Feldman, Sihem Amer-Yahia, NadavGolbandi, Ronny Lempel, and Cong Yu. Automatic construction oftravel itineraries using social breadcrumbs. In ACM Hypertext, pages35–44, 2010.[48] Sven de Vries and Rakesh V. Vohra. Combinatorial auctions: A survey.INFORMS Journal on Computing, 15(3):284–309, 2003.[49] Rina Dechter. Constraint Processing. Morgan Kaufmann, 2003.[50] Rina Dechter and Irina Rish. Mini-buckets: A general scheme forbounded inference. J. ACM, 50(2):107–153, 2003.[51] Marie desJardins and Kiri Wagstaff. Dd-pref: A language for express-ing preferences over sets. In AAAI, pages 620–626, 2005.[52] Ronald Fagin et al. Comparing top k lists. In SODA, pages 28–36,2003.125Bibliography[53] Ronald Fagin, Amnon Lotem, and Moni Naor. Optimal aggregationalgorithms for middleware. J. Comput. Syst. Sci., 66(4):614–656, 2003.[54] Jonathan Finger and Neoklis Polyzotis. Robust and efficient algo-rithms for rank join evaluation. In SIGMOD, pages 415–428, 2009.[55] Raphael A. Finkel and Jon Louis Bentley. Quad trees: A data struc-ture for retrieval on composite keys. Acta Inf., 4:1–9, 1974.[56] Robert S. Garfinkel, Ram D. Gopal, Arvind K. Tripathi, and Fang Yin.Design of a shopbot and recommender system for bundle purchases.Decision Support Systems, 42(3):1974–1986, 2006.[57] Yong Ge, Qi Liu, Hui Xiong, Alexander Tuzhilin, and Jian Chen. Cost-aware travel tour recommendation. In KDD, pages 983–991, 2011.[58] Alon Y. Halevy. Answering queries using views: A survey. VLDB J.,10(4):270–294, 2001.[59] Jun-Seok Heo, Junghoo Cho, and Kyu-Young Whang. The hybrid-layer index: A synergic approach to answering top-k queries in arbi-trary subspaces. In ICDE, pages 445–448, 2010.[60] Vagelis Hristidis, Nick Koudas, and Yannis Papakonstantinou. PRE-FER: A system for the efficient execution of multi-parametric rankedqueries. In SIGMOD, pages 259–270, 2001.[61] Oscar H. Ibarra and Chul E. Kim. Fast approximation algorithms forthe knapsack and sum of subset problems. J. ACM, 22(4):463–468,1975.[62] Ihab F. Ilyas, Walid G. Aref, and Ahmed K. Elmagarmid. Supportingtop-k join queries in relational databases. In VLDB, pages 754–765,2003.[63] Ihab F. Ilyas, George Beskales, and Mohamed A. Soliman. A surveyof top-k query processing techniques in relational database systems.ACM Comput. Surv., 40(4), 2008.[64] Ihab F. Ilyas, Rahul Shah, Walid G. Aref, Jeffrey Scott Vitter, andAhmed K. Elmagarmid. Rank-aware query optimization. In SIGMOD,pages 203–214, 2004.[65] Yannis E. Ioannidis. The history of histograms. In VLDB, pages19–30, 2003.126Bibliography[66] Kalervo Ja¨rvelin and Jaana Keka¨la¨inen. Cumulated gain-based eval-uation of IR techniques. ACM Trans. Inf. Syst., 20(4):422–446, 2002.[67] Bin Jiang et al. Mining preferences from superior and inferior exam-ples. In KDD, pages 390–398, 2008.[68] Theodore Johnson and Dennis Shasha. 2Q: A low overhead high per-formance buffer management replacement algorithm. In VLDB, pages439–450, 1994.[69] Yaron Kanza, Roy Levin, Eliyahu Safra, and Yehoshua Sagiv. Aninteractive approach to route search. In GIS, pages 408–411, 2009.[70] Ralph L. Keeney and Howard Raiffa. Decisions with Multiple Ob-jectives: Decisions with Preferences and Value Tradeoffs. CambridgeUniversity Press, 2003.[71] Hans Kellerer, Ulrich Pferschy, and David Pisinger. Knapsack Prob-lems. Springer, 2004.[72] Werner Kießling. Foundations of preferences in database systems. InVLDB, pages 311–322, 2002.[73] Benny Kimelfeld and Yehoshua Sagiv. Finding and approximatingtop-k answers in keyword proximity search. In PODS, pages 173–182,2006.[74] Georgia Koutrika, Benjamin Bercovitz, and Hector Garcia-Molina.FlexRecs: expressing and combining flexible recommendations. InSIGMOD, pages 745–758, 2009.[75] Chung T. Kwok and Daniel S. Weld. Planning to gather information.In AAAI/IAAI, Vol. 1, pages 32–39, 1996.[76] Poole David L. and Mackworth Alan K. Artificial Intelligence: Foun-dations of Computational Agents. Cambridge University Press, NewYork, NY, USA, 2010.[77] P. Lamere and S. Green. Project aura: recommendation for the restof us. In Sun JavaOne Conference, 2008.[78] Theodoros Lappas, Kun Liu, and Evimaria Terzi. Finding a team ofexperts in social networks. In KDD, pages 467–476, 2009.127Bibliography[79] Javier Larrosa and Thomas Schiex. In the quest of the best form oflocal consistency for weighted csp. In IJCAI, pages 239–244, 2003.[80] Eugene L. Lawler. A procedure for computing the k best solutionsto discrete optimization problems and its application to the shortestpath problem. Man. Sci., 18(7):401–405, 1972.[81] Jongwuk Lee, Hyunsouk Cho, and Seung won Hwang. Efficient dual-resolution layer indexing for top-k queries. In ICDE, pages 1084–1095,2012.[82] Justin J. Levandoski et al. Recstore: an extensible and adaptive frame-work for online recommender queries inside the database engine. InEDBT, pages 86–96, 2012.[83] Roy Levin, Yaron Kanza, Eliyahu Safra, and Yehoshua Sagiv. In-teractive route search in the presence of order constraints. PVLDB,3(1):117–128, 2010.[84] Alon Y. Levy, Alberto O. Mendelzon, Yehoshua Sagiv, and DiveshSrivastava. Answering queries using views. In PODS, pages 95–104,1995.[85] Alon Y. Levy, Inderpal Singh Mumick, and Yehoshua Sagiv. Queryoptimization by predicate move-around. In VLDB, pages 96–107, 1994.[86] Chengkai Li, Kevin Chen-Chuan Chang, Ihab F. Ilyas, and SuminSong. Ranksql: Query algebra and optimization for relational top-kqueries. In SIGMOD, pages 131–142, 2005.[87] Chengkai Li, Nan Zhang, Naeemul Hassan, Sundaresan Rajasekaran,and Gautam Das. On skyline groups. In CIKM, 2012.[88] Feifei Li, Dihan Cheng, Marios Hadjieleftheriou, George Kollios, andShang-Hua Teng. On trip planning queries in spatial databases. InSSTD, pages 273–290, 2005.[89] Greg Linden et al. Interactive assessment of user preference models:The automated travel assistant. In UM, pages 67–78, 1997.[90] Richard J. Lipton, Jeffrey F. Naughton, and Donovan A. Schneider.Practical selectivity estimation through adaptive sampling. In SIG-MOD, pages 1–11, 1990.128Bibliography[91] Qi Liu, Yong Ge, Zhongmou Li, Enhong Chen, and Hui Xiong. Per-sonalized travel package recommendation. In ICDM, 2011.[92] Eric Hsueh-Chan Lu, Chih-Yuan Lin, and Vincent S. Tseng. Trip-mine: An efficient trip planning approach with travel time constraints.In Mobile Data Management (1), pages 152–161, 2011.[93] Alberto Marchetti-Spaccamela and Carlo Vercellis. Stochastic on-lineknapsack problems. Math. Program., 68:73–104, 1995.[94] Davide Martinenghi and Marco Tagliasacchi. Proximity rank join.PVLDB, 3(1):352–363, 2010.[95] Denis Mindolin and Jan Chomicki. Discovering relative importance ofskyline attributes. PVLDB, 2(1):610–621, 2009.[96] Danupon Nanongkai, Ashwin Lall, Atish Das Sarma, and KazuhisaMakino. Interactive regret minimization. In SIGMOD, pages 109–120, 2012.[97] Raymond T. Ng, Laks V. S. Lakshmanan, Jiawei Han, and Alex Pang.Exploratory mining and pruning optimizations of constrained associ-ation rules. In SIGMOD, pages 13–24, 1998.[98] Christos H. Papadimitriou. Computational Complexity. Addison Wes-ley, 1994.[99] Aditya Parameswaran and Hector Garcia-Molina. Recommendationswith prerequisites. In ACM RecSys, pages 353–356, 2009.[100] Aditya Parameswaran, Petros Venetis, and Hector Garcia-Molina.Recommendation systems with complex constraints: A CourseRankperspective. Technical report, 2009.[101] Aditya G. Parameswaran, Hector Garcia-Molina, and Jeffrey D. Ull-man. Evaluating, combining and generalizing recommendations withprerequisites. In CIKM, pages 919–928, 2010.[102] Aditya G. Parameswaran, Petros Venetis, and Hector Garcia-Molina.Recommendation systems with complex constraints: A course recom-mendation perspective. ACM Trans. Inf. Syst., 29(4):20, 2011.[103] Jian Pei and Jiawei Han. Can we push more constraints into frequentpattern mining? In KDD, pages 350–354, 2000.129Bibliography[104] Robert Price and Paul R. Messinger. Optimal recommendation sets:Covering uncertainty over user preferences. In AAAI, pages 541–548,2005.[105] Kenneth A. Ross, Divesh Srivastava, Peter J. Stuckey, and S. Sudar-shan. Foundations of aggregation constraints. Theor. Comput. Sci.,193(1-2):149–179, 1998.[106] Senjuti Basu Roy, Sihem Amer-Yahia, Ashish Chawla, Gautam Das,and Cong Yu. Constructing and exploring composite items. In SIG-MOD, pages 843–854, 2010.[107] Senjuti Basu Roy, Gautam Das, Sihem Amer-Yahia, and Cong Yu.Interactive itinerary planning. In ICDE, pages 15–26, 2011.[108] Stuart J. Russell and Peter Norvig. Artificial Intelligence: A ModernApproach. Pearson Education, third edition, 2010.[109] Norvald H. Ryeng, Akrivi Vlachou, Christos Doulkeridis, and KjetilNørv˚ag. Efficient distributed top-k query processing with caching. InDASFAA, pages 280–295, 2011.[110] Karl Schnaitter and Neoklis Polyzotis. Evaluating rank joins withoptimal cost. In PODS, pages 43–52, 2008.[111] Mehdi Sharifzadeh, Mohammad R. Kolahdouzan, and Cyrus Shahabi.The optimal sequenced route query. VLDB J., 17(4):765–787, 2008.[112] Mohamed A. Soliman et al. Ranking with uncertain scoring functions:Semantics and sensitivity measures. In SIGMOD, pages 805–816, 2011.[113] Martin Theobald, Gerhard Weikum, and Ralf Schenkel. Top-k queryevaluation with probabilistic guarantees. In VLDB, pages 648–659,2004.[114] Dimitri Theodoratos and Timos K. Sellis. Data warehouse configura-tion. In VLDB, pages 126–135, 1997.[115] Olivier Toubia et al. Polyhedral methods for adaptive choice-basedconjoint analysis. Journal of Marketing Research, 41(1):pp. 116–131,2004.[116] Quoc Trung Tran, Chee-Yong Chan, and Guoping Wang. Evaluationof set-based queries with aggregation constraints. In CIKM, pages1495–1504, 2011.130Bibliography[117] Panayiotis Tsaparas, Themistoklis Palpanas, Yannis Kotidis, NickKoudas, and Divesh Srivastava. Ranked join indices. In ICDE, pages277–288, 2003.[118] Vijay V. Vazirani. Approximation Algorithms. Springer, 2001.[119] Jianshu Weng, Ee-Peng Lim, Jing Jiang, and Qi He. TwitterRank:finding topic-sensitive influential twitterers. In WSDM, pages 261–270, 2010.[120] Min Xie, Laks V. S. Lakshmanan, and Peter T. Wood. Breaking outof the box of recommendations: From items to packages. In RecSys,pages 151–158. ACM, 2010.[121] Min Xie, Laks V. S. Lakshmanan, and Peter T. Wood. Comprec-trip:A composite recommendation system for travel planning. In ICDE,pages 1352–1355, 2011.[122] Min Xie, Laks V. S. Lakshmanan, and Peter T. Wood. Efficient rankjoin with aggregation constraints. PVLDB, 4(11):1201–1212, 2011.[123] Min Xie, Laks V. S. Lakshmanan, and Peter T. Wood. Efficient rankjoin with aggregation constraints. PVLDB, 4(11):1201–1212, 2011.[124] Min Xie, Laks V. S. Lakshmanan, and Peter T. Wood. Efficient top-kquery answering using cached views. In EDBT, pages 489–500, 2013.[125] Min Xie, Laks V. S. Lakshmanan, and Peter T. Wood. Ips: An in-teractive package configuration system for trip planning. PVLDB,6(12):1362–1365, 2013.[126] Min Xie, Laks V. S. Lakshmanan, and Peter T. Wood. Generatingtop-k packages via preference elicitation. PVLDB, 7(14):1941–1952,2014.[127] Dong Xin, Chen Chen, and Jiawei Han. Towards robust indexing forranked queries. In VLDB, pages 235–246, 2006.[128] Albert Yu, Pankaj K. Agarwal, and Jun Yang. Processing a largenumber of continuous preference top-k queries. In SIGMOD, pages397–408, 2012.[129] Xi Zhang and Jan Chomicki. Preference queries over sets. In ICDE,pages 1019–1030, 2011.131Bibliography[130] Feng Zhao, Gautam Das, Kian-Lee Tan, and Anthony K. H. Tung.Call to order: A hierarchical browsing approach to eliciting users pref-erence. In SIGMOD, pages 27–38, 2010.132
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Composite recommendation : semantics and efficiency
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Composite recommendation : semantics and efficiency Xie, Min 2015
pdf
Page Metadata
Item Metadata
Title | Composite recommendation : semantics and efficiency |
Creator |
Xie, Min |
Publisher | University of British Columbia |
Date Issued | 2015 |
Description | Classical recommender systems provide users with a list of recommendations where each recommendation consists of a single item, e.g., a book or DVD. However, many applications can benefit from a system which is capable of recommending packages of items. Sample applications include travel planning, e-commerce, and course recommendation. In these contexts, there is a need for a system that can recommend the most relevant packages for the user to choose from. In this thesis we highlight our research achievements for the composite recommendation problem. We first consider the problem of composite recommendation under hard constraint, e.g., budget. It is clear that this is a very common paradigm for the composite recommendation problem. In Chapter 3, we first discuss how given a fixed package schema, we can efficiently find the top-k most relevant packages with hard constraints. The proposed algorithm is shown to be instance optimal, which means that no algorithm in a reasonable class can perform more than a constant times better, for some fixed constant. And we also propose relaxed solutions based on probabilistic reasoning. In Chapter 4, we lift the constraint on the package schema, and discuss how efficient algorithms can be derived to solve the more general problem with a flexible package schema. For this problem, again we propose both instance optimal algorithm and heuristics-based solution which have been verified to be effective and efficient through our extensive empirical study. Then in Chapter 5, motivated by the fact that hard constraints sometimes might lead to unfavorable results, and following the recent paradigm on “softening” the constraints, we study the problem of how to handle top-k query processing with soft constraints. Finally, in Chapter 6, we discuss a general performance tuning solution based on cached views which can be leveraged to further optimize the various algorithms proposed in this thesis. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2015-03-17 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivs 2.5 Canada |
DOI | 10.14288/1.0135685 |
URI | http://hdl.handle.net/2429/52406 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2015-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/2.5/ca/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2015_may_xie_min.pdf [ 1.25MB ]
- Metadata
- JSON: 24-1.0135685.json
- JSON-LD: 24-1.0135685-ld.json
- RDF/XML (Pretty): 24-1.0135685-rdf.xml
- RDF/JSON: 24-1.0135685-rdf.json
- Turtle: 24-1.0135685-turtle.txt
- N-Triples: 24-1.0135685-rdf-ntriples.txt
- Original Record: 24-1.0135685-source.json
- Full Text
- 24-1.0135685-fulltext.txt
- Citation
- 24-1.0135685.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0135685/manifest