Revisiting Recommendations: From Customers toManufacturersbyShailendra AgarwalA THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Computer Science)The University Of British Columbia(Vancouver)September 2013c? Shailendra Agarwal, 2013AbstractRecommender systems exploit user feedback over items they have experienced formaking recommendations of other items that are most likely to appeal to them.However, users and items are but two of the three types of entities participating inthis ecosystem of recommender systems. The third type of entities are the manu-facturers of the products, and users are really their customers. Traditional recom-mender systems research ignores the role of this third entity type and exclusivelyfocuses on the other two. What might item producers bring to recommender sys-tems research? Their objectives are related to their business and are captured byquestions such as ?what kind of (new) products should I manufacture that willmaximize their popularity?? These questions are not asked in a vacuum: manufac-turers have constraints, e.g., a budget. The idea is that the user feedback data (e.g.,ratings) capture users? preferences. The question is whether we can learn enoughintelligence from it, so as to recommend new products to manufacturers that willhelp meet their business objectives.We propose the novel problem of new product recommendation for manufac-turers. We collect real data by crawling popular e-commerce websites, and modelcost and popularity as a function of product attributes and their values. We incor-porate cost constraints into our problem formulation: the cost of the new productsshould fall within the desired range while maximizing the popularity. We showthat the above problem is NP-hard and develop a pseudo-polynomial time algo-rithm for the recommendations generation. Finally, we conduct a comprehensiveexperimental analysis where we compare our algorithm with several natural heuris-tics on three real data sets and perform scalability experiments on a synthetic dataset.iiPrefaceThis thesis is the outcome of my collaborative research with my colleague Dr.Amit Goyal and my supervisor Dr. Laks V.S. Lakshmanan. Below listed are thecontributions of each of the collaborators, including me.Dr. Laks V.S. Lakshmanan was involved in the inital motivation of the prob-lem solved here. He was present in many of the brainstorming sessions and gavevarious suggestions to formulate the problem, to solve the problem and to conductexperiments to test the proposed algorithm. He was also involved in refining thecontent and presentation of the work.Dr. Amit Goyal was present in all the discussions involving formally definingthe problem, proposing solutions to the problem and designing experiments. Heproposed various improvements to the algorithms. He looked at some of the pre-vious literature in the field. He was also involved in improving the content andpresentation of the work.Along with Dr. Laks V.S. Lakshmanan, I was involved in the initial brainstorm-ing that led to the formulation of this problem. I looked at most of the previous lit-erature in the field. I, along with both collaborators, defined the problem formally.I was responsible for crawling the real data from multiple e-commerce websites.I implemented the regression-based cost and popularity modeling. I proposed theinitial algorithm to solve the problem and conducted expeirments to test the algo-rithm. I was responsible for keeping track of the work done during the project andsummarizing it.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Issue of User Sparsity . . . . . . . . . . . . . . . . . . . . . . . . 83.3 Our Problem Formulation . . . . . . . . . . . . . . . . . . . . . . 104 Modeling Cost and Popularity . . . . . . . . . . . . . . . . . . . . . 124.1 Data Set Pre-processing . . . . . . . . . . . . . . . . . . . . . . . 124.2 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 134.2.1 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 14iv4.2.2 Cost and Popularity Oracles . . . . . . . . . . . . . . . . 154.3 Strong Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.1 Complexity of RECMAN . . . . . . . . . . . . . . . . . . . . . . 205.2 An Optimal Algorithm . . . . . . . . . . . . . . . . . . . . . . . 215.2.1 Sketch of Original Algorithm . . . . . . . . . . . . . . . 215.2.2 Our Algorithm . . . . . . . . . . . . . . . . . . . . . . . 226 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296.1 Evaluating our Algorithm . . . . . . . . . . . . . . . . . . . . . . 296.1.1 Comparison of Algorithms w.r.t. Popularity . . . . . . . . 296.1.2 Variation in Popularity of Top-k Products . . . . . . . . . 316.1.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . 326.2 Recommended Product Designs . . . . . . . . . . . . . . . . . . 337 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 41Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43vList of TablesTable 3.1 Raw Data Set Statistics . . . . . . . . . . . . . . . . . . . . . 9Table 3.2 Frequency Distribution of User Ratings . . . . . . . . . . . . . 10Table 4.1 Pruned Data Set Statistics . . . . . . . . . . . . . . . . . . . . 13Table 4.2 RMSE in Predicting the Cost of Product . . . . . . . . . . . . . 14Table 4.3 RMSE in Predicting the Popularity of Product . . . . . . . . . . 15Table 4.4 Top 10 Attribute-level Pairs in Cost and Popularity Modeling, indecreasing order (a) Television (b) Camera. * denote NumericalAttributes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Table 6.1 Top 10 Attributes (in decreasing order of Popularity) of the BestProduct (a) Television (b) Camera. Right columns show all pos-sible Levels for the particular Attribute and Level present in theBest Product is underlined. *denote Numerical Attributes. . . . 40viList of FiguresFigure 1.1 Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Figure 3.1 Frequency Distribution of Ratings from Product Perspective.Left Y-axis: Television and Laptop; Right Y-axis: Camera. . . 11Figure 4.1 Frequency Distribution of Percentage Prediction Error (a) Cost(b) Popularity . . . . . . . . . . . . . . . . . . . . . . . . . . 16Figure 5.1 Television with two Attributes - Screen Size and Screen Type.k = 3, Clb = 50 and Cub = 80 (a) Cost and Popularity of variousLevels of both Attributes (b) Dynamic Programming Table (c)Top-3 Products . . . . . . . . . . . . . . . . . . . . . . . . . 24Figure 6.1 Comparison of Algorithms w.r.t. Popularity in Television Cat-egory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34Figure 6.2 Comparison of Algorithms w.r.t. Popularity in Camera Category 35Figure 6.3 Comparison of Algorithms w.r.t. Popularity in Laptop Category 36Figure 6.4 Popularity of Top-100 Products for four different Costs in Tele-vision Category . . . . . . . . . . . . . . . . . . . . . . . . . 37Figure 6.5 Running Time and Memory Usage with respect to Cost UpperBound and k on SYNTHETIC Data Set (a) Time (b) Memory . 38Figure 6.6 Running Time and Memory Usage with respect to Number ofFeasible Products on SYNTHETIC Data Set. X-axis is Loga-rithmic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39viiGlossaryLOOCV Leave-one-out Cross-validationOPDP Optimal Product Design ProblemMCKP Multiple-choice Knapsack ProblemRMSE Root Mean Square ErrorRS Recommender SystemsRECMAN RECommendation for MANufacturersviiiAcknowledgmentsI would like to sincerely thank my supervisor Dr. Laks V.S. Lakshmanan for hiscontinuous support and encouragement throughout my study period. He was al-ways excited to talk about new ideas and was a great help in completing this thesisproject. I would also like to thank Dr. Amit Goyal for the fun and insightful dis-cussions we had during the course of this project. My thanks go to Dr. RachelPottinger for patiently reading this thesis and providing suggestions for improve-ment. I would like to thank Dr. Ruben H. Zamar, Dr. Matias Salibian-Barrera andFred for the various stimulating discussions during the project. Lastly, thanks to allmy friends, labmates and family for always being there.ixChapter 1IntroductionRecommender Systems (RS) exploit user feedback of items to make recommenda-tions of other items that are most likely to appeal to them. In the ecosystem of a RS,three types of entities participate ? customers (users), products (items), and man-ufacturers (item producers). Research in RS has mostly focused on the customerperspective of the recommendation problem: how can we recommend high qualityproducts to customers in a personalized way? This is usually done by predictinga user?s likely rating on a product she has not experienced before, by leveragingusers? feedback on products they have experienced [17]. There is an important,?flip? side to this problem: we still recommend products but the target populationno longer are the customers who buy the product, rather the target population noware the manufacturers who sell the product. The perspective of this third entitytype, item manufacturers, has heretofore largely been ignored.What might item producers bring to RS research? Their objectives are relatedto their business and are captured by questions such as ?what kind of (new) prod-ucts should I manufacture that will maximize the popularity?? Here, a new productrefers to a non-existent product. E.g., a smartphone company may wish to knowhow to ?tune? the design of new products it would like to launch in the marketso that they achieve a high (online) popularity. A similar remark applies to themanufacturers of other durable goods like video gaming consoles. These busi-ness questions are not asked in a vacuum: manufacturers have constraints, e.g., abudget. The idea is that the user feedback data (e.g., ratings) capture users? pref-1Figure 1.1: Example.erences. The question we ask is whether we can learn enough intelligence from it,e.g., customers? preference for products? attribute values, so as to recommend newproducts to manufacturers that will help meet their business objectives.New product recommendations can be valuable to manufacturers: careful de-sign of a new product before it is launched is crucial for the product?s success.Marketing and launching a new product is expensive and critical to brand manage-ment [11]. Once launched, it is difficult to roll back. Kotler et al. [11] identify badproduct design as one of the most frequent reasons for the failure of a new product.Motivated by this, we bring a ?dual? perspective, i.e., the business perspective, tothe long-studied customer-centric RS problem.A product (or product category) possesses a number of attributes where eachattribute can have one of several possible values (also called levels). For instance,consider the (simplified) example shown in Figure 1.1. In this example, the tele-vision category has two attributes ? Screen Size and Screen Type. ScreenSize has three values ? Small, Medium and Large, while Screen Type has two val-ues ? LCD and Plasma. A product is designed by picking exactly one value foreach attribute, e.g., (Screen Size = Medium, Screen Type = LCD) is one out of sixfeasible products. In real life, there may be multiple attributes, each having a largenumber of possible values. The objective is to design one or more new productsmaximizing their predicted popularity.In this work, we focus on durable products. We collect data on three productcategories ? camera, television and laptop ? by crawling two popular e-commerceplatforms ? www.shopping.com and www.bestbuy.com. Our first task is to buildpredictive models for cost and popularity of products, where cost is the price perunit of the product mentioned on the e-commerce website and popularity is the2number of customer ratings received by the product. A major challenge in buildingpredictive models is that these data sets suffer from acute user sparsity, that is, mostof the users rate very few products. Indeed the user sparsity issue is intrinstic tothese types of data sets: one user is unlikely to buy multiple (expensive) productsof the same kind (e.g., many cameras) and hence, may not review multiple similarproducts. Thus, individual user preferences for product attributes cannot be learnedreliably. Consequently, instead of learning per-attribute user preferences, and thenestimating a new product?s popularity, we predict the popularity by directly esti-mating the impact of various attribute values on popularity. Similarly, we build amodel for predicting the cost of a new product as a function of the attribute valuesthat make it up.Since the decision of which product to launch may be driven by other externalconsiderations such as availability of expertise, engineering feasibility, etc., thereis a need for flexibility in the product designs recommended. To accommodate this,we propose the problem of recommending top-k new items to the manufacturers.Thus, similar to the k most appealing product recommendations to customers inRS, we recommend top-k new products (k is a manufacturer specified parameter) tomanufacturers, out of which it can choose one or more products based on technicalfeasibility, product diversity and other engineering concerns.Cost is a key factor in designing new products [18, 21], as it directly influencesthe quality of (attribute values in) the product. As Hauser et al. [5] note, ?Priceplays a special role in consumer preferences. It is not a feature of a product perse, but rather that which the consumer pays in return for features.? Moreover,marketing decisions, such as which market segment to target, depend on the costof the product. Thus, in view of the importance of cost of a product, we allow themanufacturer to specify a cost range such that the cost of the new products, shouldsatisfy the specified range, while maximizing the popularity.In marketing research literature, the problem of optimal product design hasbeen extensively studied. Conjoint analysis is the popular approach that is fol-lowed. The data is collected via conducting user surveys, and then, users? pref-erences are modeled based on the responses. A detailed comparison with thatproblem appears in Chapter 2.In summary, we formulate new product RECommendation for MANufacturers3(RECMAN) problem as follows: Given the attributes and the possible values that aproduct (category) can have, lower bound Clb and upper bound Cub on cost, and aparameter k, recommend k new products with top-k highest (predicted) popularity,such that the (predicted) cost of each of these new products lies between the costbounds. We make the following contributions.? We propose a novel problem of new product recommendation for manufac-turers, capturing an interesting business perspective of recommender sys-tems. Our formulation incorporates cost constraints and asks for top-k newproduct designs that achieve the highest popularity with cost falling in aspecified range (Chapter 3).? We explore various regression models to predict cost and popularity of prod-uct and find that linear regression performs the best among them (Chapter 4).? We show that RECMAN is NP-hard. We develop a pseudo-polynomial timedynamic programming algorithm whose running time is polynomial in thevalue of the cost upper bound. We show that our algorithm is optimal (Chap-ter 5).? We conduct a comprehensive set of experiments on three real data sets andcompare our algorithm with natural heuristics. The results show that our al-gorithm outperforms the alternatives by considerable margins. Furthermore,our algorithm scales very well w.r.t. both running time and memory usageChapter 6.Related work is presented in Chapter 2, while we conclude the thesis in Chap-ter 7.4Chapter 2Related WorksA popular approach in RS is collaborative filtering, which seeks to predict the ex-pected rating a given user may provide to an item which she hasn?t rated before.The top items w.r.t. predicted ratings are then recommended to the user. Collabora-tive filtering approaches can be classified into memory-based (e.g., User-based andItem-based) and model-based (e.g., Matrix Factorization). We recommend Chap-ters 4 and 5 of [17] as excellent surveys. As we have described above, while RSresearch mostly focuses on recommending relevant items to customers, we pro-pose a novel problem ? recommending designs of new items to the manufacturers,building and launching which, their popularity could be maximized.Our work is partly inspired by the Optimal Product Design Problem (OPDP),studied in the fields of Marketing and Operations Research. The objective here isto design a new product that maximizes the number of users who buy it. In con-joint analysis, which is the state of the art for this problem [3, 20], the utility ofa user u for a product is assumed to be additive, that is, sum of her utilities forthe values of different attributes of the product. If u?s utility for a new productis higher than the utility of the status quo product that u owns, then u is assumedto buy the new product. OPDP is known to be NP-hard [9] and several heuristicshave been proposed [20]. While the overall objective of OPDP is same as RECMAN,on a higher level, there are several major differences between the two problems interms of input, constraints and the consequent solution approach. (i) In OPDP, thedata used is collected via conducting user surveys (see [3]), which are expensive5and time-consuming, while the input to RECMAN is the easily available RS datai.e., ratings data collected from e-commerce websites. (ii) In OPDP, individual userpreferences are modeled. Whereas, as we show in Chapter 3, the durable prod-ucts data sets have inherent acute user sparsity (each user buys and, hence, ratesvery few items) and hence it is not possible to model individual user preferencesreliably. Accordingly, we look at each product as a whole and estimate the costand popularity of a new product as a function of product attributes of a given cat-egory by aggregating the available social signals in the form of user ratings. (iii)Unlike in OPDP, our problem admits cost constraints and aims to output top-k newproducts instead of just one.Das et al. [2] study the problem of web item design. They use collaborative tag-ging data from users to identify new products that are expected to attract maximumnumber of distinct desirable tags. Even though our problem, at a very high level,bears some similarities to theirs, there are several key differences. Their objectivefunction is different: they take a set of desirable tags as input and try to maximizethe number of distinct desirable tags. On the other hand, our objective is to maxi-mize product popularity. The different objective functions make way for differentsolution approaches for the two problems. In particular, their approach cannot beapplied for recommending products that maximize number of ratings since theirframework ignores repeated tagging on the same item and only considers distincttags. Besides, unlike them, we incorporate cost constraints as input, thus providingadditionally flexibility to manufacturers.Considerable work has been done on extracting customer opinions, by apply-ing text mining techniques [1, 6, 16]. We recommend Chapter 5 of [13] for anexcellent survey. Archak et al. [1], in addition to mining customer opinions, applyregression analysis to model sales (of products) as a function of opinions of productattributes. For instance, one of their results is: a better camera has more positiveimpact on sales (in category Camera and Photo) than a better battery. Our problemis fundamentally different than that of opinion mining: our goal is to build newproducts, to maximize the popularity. We do it by identifying the actual attribute-values of products (instead of opinions). e.g., in our analysis, we find that a camerawith a memory stick is more popular than a camera with built-in memory. Notethat ?memory stick? and ?built-in? are two values of the attribute ?memory type?.6Miah et al. [14] study the problem of finding top-k attributes of a product thata seller should highlight in a marketplace such that it attracts maximum visibility.The problems are significantly different as their goal is not to design a new product,but to maximize the visibility of an existing product in a marketplace. Secondly,they do not consider the identification of attribute values. Rather, the attributevalues are already known.Real world product design is a more complex process involving product engi-neering, advertising and other managerial decisions. Nevertheless, RECMAN tendsto provide a starting point to the decision makers regarding the set of attributesvalues that are most important for maximizing popularity, given a cost constraint.7Chapter 3Problem Definition3.1 Data SetsBefore we define the problem statement formally, we explore the characteristicsof the data sets we collected. We perform our study on 3 types of durable prod-ucts ? television, camera and laptop. To collect the data, we crawled two popularplatforms ? www.shopping.com for cameras and www.bestbuy.com for televisionsand laptops. From each data set, we built 3 tables as follows. The first table cor-responds to the ratings log where each tuple ?u, p,r? represents that user u ratesproduct p with rating r. The second table corresponds to products? attributes datawhere each tuple ?p,a, l? indicates that in product p, attribute a has value (alsocalled level) l. Finally, the third table records the cost (sale price) of each productas ?p,c?. Table 3.1 summarizes statistics of the raw crawled data. For instance,in category television, the raw data contains 334 products, 9682 users and 9884ratings. Each product (television) has 27 attributes, including both numerical andcategorical.3.2 Issue of User SparsityIn the original RS problem, a user is recommended k items by predicting her likelyratings for items she has yet not rated. To do this, user preferences are capturedusing her previous interactions with the system. For example, Matrix Factoriza-8Statistic Television Camera LaptopNumber of products 334 969 304Number of users 9682 5317 2317Number of ratings 9884 6013 2353Number of attributes 27 90 16Numerical 6 27 7Categorical 21 63 9Table 3.1: Raw Data Set Statisticstion [10] represents each user by a latent factor vector which is learned based onher prior ratings. So, a natural approach to solve RECMAN would be to build asimilar model for each user to predict whether the particular user would rate theproduct, following which we can return the top-k products that are rated by maxi-mum number of users. The argument below shows the problem encountered withsuch an approach and paves the way for the possible approach.In Table 3.2, we present the frequency distribution of ratings from a user per-spective. The table shows, that for every product category, the number of userswho rated more than 1 product is a tiny fraction of the number of users who ratedexactly 1 product and this number decays extremely fast as number of ratings in-creases. As we argued before, this behavior is intrinsic to durable products: it ishighly unlikely that one single user will buy (and rate) more than one or two prod-ucts from a ?single product category?. The average numbers of ratings per userare 1.02, 1.13 and 1.01 respectively for television, camera and laptop categories.We refer to this as the user sparsity issue. Similar user sparsity issue has also beenwitnessed in other data sets like www.amazon.com and www.resellerratings.com,across a variety of product categories ? Books, Music, DVD, electronics, com-puters etc. [7, 22]. Indeed, individual user preferences learnt from such data setswould severely suffer from overspecialization and overfitting and, hence, would behighly unreliable. Therefore, we formulate RECMAN in an alternative way that canmitigate the user sparsity issue.Instead of looking at individual user preferences, we look at each product asa whole. Figure 3.1 shows the frequency distribution of ratings from product per-spective. The plots show that the sparsity in this case is much less compared to9Number Number of usersof ratings Television Camera Laptop1 9499 4997 22892 168 218 243 13 50 24 1 19 05 0 12 2?6 1 21 0Table 3.2: Frequency Distribution of User Ratingsuser sparsity. For instance, in television, 151 products received 5 or more ratings.Similarly, in categories camera and laptop, the number of products which received5 or more ratings are 288 and 88, respectively. The average numbers of ratingsreceived per product are 30.4, 6.2 and 7.9 respectively. We further prune the prod-ucts that have received very few ratings (< 5). This gives us the average numbersof ratings per product as 66.7, 19 and 26.8, corresponding to the three data sets.These numbers suggest that if we look at the product as a whole, and try to modelcost and popularity as functions of attribute values of products, we would be ableto avoid the issue of overfitting.3.3 Our Problem FormulationThe objective of RECMAN is to recommend top-k products such that the popularity,expressed as number of ratings they receive are maximized. Let ? denote a product(or product category). With m, we denote the number of attributes it can have.An attribute can assume several values (or levels). Then, a product is built byselecting exactly one value for each attribute. We handle a realistic setting wheremanufacturers may have cost constraints, by permitting parameters for lower bound(denoted as Clb) and upper bound (denoted as Cub). The new products shouldachieve the maximum popularity while leaving the cost of each of the productswithin the specified cost bounds.In case of RS, to recommend relevant products to users, the number of prod-ucts that need to be explored are only the products that exist in the system. In otherwords, search space is linear in case of RS. By contrast, in RECMAN, we must100 25 50 75 100 1250100200300400Number of ProductsNumber of ratings 02004006008001000Number of ProductsTelevisionLaptop CameraFigure 3.1: Frequency Distribution of Ratings from Product Perspective.Left Y-axis: Television and Laptop; Right Y-axis: Camera.recommend new product designs that don?t exist in the system. A product is builtby selecting exactly one value for each attribute, thus making the search space ex-ponential. We refer to this space of product designs as ?. Its important to bear thisin mind to understand the notion of scale for new product designs. Any algorithm,to be useful, must scale w.r.t. the size of this huge search space. Notice that thenumber of products in the data set has no direct bearing on the scale issue.To define the problem, for now, we assume that we are given an oracle COST(?)that predicts the cost of the product ? , and another oracle POPULARITY(?) thatpredicts the popularity of ? . Further, we assume that cost (also popularity) of aproduct can be determined by its attributes and hence can be modeled as a functionof the attributes. In Chapter 4, we explore various regression models to estimatecost and popularity, and report our findings. We next formally state RECMAN.Problem 1. Given a product design space ? of m attributes and their possiblelevels (values), lower bound Clb and upper bound Cub on cost, and a parameterk, design k new products P ? ?, each by selecting exactly one level for eachattribute, such that ?? ?P : [Clb ? COST(?) ? Cub & @? ? ? (? \P) : (Clb ?COST(? ?)?Cub & POPULARITY(? ?) > POPULARITY(?))].11Chapter 4Modeling Cost and PopularityWe first report the steps taken to clean the data. Then, we describe the modelsexplored for implementing cost (and popularity) oracles and compare their perfor-mance. Further, we define COST(?) and POPULARITY(?), based on linear regres-sion, the model which performs the best among those we explored. Finally, we givea quantitative description of some of the attribute levels that are strong predictorsof cost (and popularity).4.1 Data Set Pre-processingThe raw data set (Table 3.1) contains both numerical and categorical attributes. Wereplace missing values of numerical attributes by the mean of known numericalvalues of that attribute, and ignore missing values of categorical attributes [19].We employ K-means Clustering with number of clusters as 3, to discretize the nu-merical attributes into 3 clusters1. Next, we binarize all attributes to derive featuresfor regression analysis. For instance, for the attribute Screen Type in the exampleshown in Figure 1.1, there are two levels ? LCD and Plasma; we create two binaryfeatures for it ? ScreenType LCD and ScreenType Plasma.To avoid overfitting, we take the following measures, on all three data sets.First, we prune all the products that received few (< 5) ratings and very large num-1We also tried K-means clustering with 5 clusters on numerical attributes. We obtained similarresults in both cases, although numerical values varied slightly. For brevity, we present results onlyfor 3 clusters.12Statistic Television Camera LaptopNumber of products 136 250 75Number of ratings 7702 3612 1525Number of attributes 13 34 15Total # of levels 34 85 27Table 4.1: Pruned Data Set Statisticsber of ratings (top 2.5% of ratings received). Second, we remove the productswhich are too costly (top 2.5%) or too cheap (bottom 2.5%). We remove the (bina-rized) features that occur rarely (< 10 times). Finally, we do feature selection usingminimal-redundancy-maximal-relevance criterion [15]. Table 4.1 summarizes thestatistics of the pruned data.4.2 Regression AnalysisWe explore the following regression models (see [4]) for modeling cost and popu-larity.Linear Regression. It assumes a linear relationship between the features andthe expected output values of COST(?) and POPULARITY(?). We constrain thecoefficients of regression to be non-negative since both cost and popularity assumeonly non-negative values.2Poisson Regression. This is a generalized linear model that assumes linear rela-tionship between input and logarithm of the output, and returns only non-negativeoutput, as required by our problem formulation.Regression Trees. The idea here is to recursively partition the feature spacebased on the value of features. At each step, the remaining set of training datais divided into two classes based on the value of the feature under consideration.This continues until a certain depth, calculated using various heuristics, followingwhich a model such as regression is fit, within each leaf of the tree [4].We use MATLAB implementations of linear regression (lsqnonneg), poissonregression (glmfit) and regression trees (classregtree). We perform Leave-2For the sake of completeness, we also performed regression analysis with unconstrained coeffi-cients. As expected, the results were worse than for the non-negative case.13Model used Television Camera LaptopLinear Regression 343.5 320.6 293.5Poisson Regression 360.4 526.6 235.3Regression Tree 458.8 423.1 255.3Range of cost ($$) 140 - 4300 50 - 2549 350 - 1900Table 4.2: RMSE in Predicting the Cost of Productone-out Cross-validation (LOOCV) (see chapter 4 of [19]) to compare various mod-els. In LOOCV, the test set contains only one product and the training set containsthe remaining products. Thus, the total number of models built equals total num-ber of products. A key benefit of using LOOCV is that it uses the maximum data totrain the model, which is especially helpful in cases when the data does not containmany samples, as in our case.4.2.1 ComparisonTable 4.2 and Table 4.3 show the Root Mean Square Error (RMSE) associated withcost and popularity prediction using the three models. To provide a context, wealso include the range of cost and popularity under various categories in the ta-ble. For example, in television category, the cost varies from $140 to $4300 (afterpre-processing) and popularity varies from 5 to 172 (note that we removed prod-ucts which received less than 5 ratings, as part of pre-processing). Consider theprediction of cost (Table 4.2): linear regression outperforms others in televisionand camera categories, with RMSE values 343.5 and 320.6, respectively. Eventhough linear regression is dominated by other models in laptop category, the gapis marginal. Besides, it has significantly less error on the other two data sets. Incase of prediction of popularity (Table 4.3), linear regression with RMSE values41.9, 8.6 and 16.7 is clearly better than the other two methods on all three datasets.Hence, we take linear regression as the winner.We further look into the error characteristics of linear regression and see if theyfollow intuitions. The results are presented in Figure 4.1, in terms of percentage14Model used Television Camera LaptopLinear Regression 41.9 8.6 16.7Poisson Regression 54.3 11.0 80.9Regression Tree 51.1 10.3 19.1Range of popularity 5 - 172 5 - 44 5 - 66Table 4.3: RMSE in Predicting the Popularity of Productprediction error, which is computed as|predicted value? actual value|actual value?100Figure 4.1a shows the distribution of percentage error in cost prediction. It showsthat the frequency of percentage error decreases as the percentage error increases.Thus, we can expect reasonable prediction accuracy, with high error occurring veryfew times. Similar results for popularity are shown in Figure 4.1b. One immediateobservation, differentiating Figure 4.1a and Figure 4.1b is that the prediction ofpopularity is not as accurate as that of cost. One possible reason is that, unlike costwhich depends mostly on attributes of a product, popularity may depend on variousother factors like quality of implementation of various attributes, marketing, overallconsumer experience, emotional attachment of consumers to the brand, etc. Hence,the task of predicting popularity is more challenging than predicting cost.4.2.2 Cost and Popularity OraclesHaving shown that linear regression is best able to model the cost and popularity,here, we define the predicted cost asCOST(?) = ?(a:l)??COST(a, l) (4.1)where (a : l) ? ? means that the attribute a has level l in the product ? . Simi-larly, we define the predicted popularity asPOPULARITY(?) = ?(a:l)??POPULARITY(a, l) (4.2)150 50 100 150 200020406080Percentage prediction errorNumber of products CameraTelevisionLaptop(a) Cost0 50 100 150 200020406080Percentage prediction errorNumber of products CameraTelevisionLaptop(b) PopularityFigure 4.1: Frequency Distribution of Percentage Prediction Error (a) Cost(b) Popularity16Here, COST(a, l) (POPULARITY(a, l)) is the coefficient of the binarized fea-ture that corresponds to level l found in the linear regression model of COST (.)(POPULARITY (.)), and, effectively, denotes the impact of level l on the overall cost(popularity) of product. For simplicity, we refer to COST(a, l) (POPULARITY(a, l))as cost (popularity) of l. Note that we have constrained both COST(a, l) andPOPULARITY(a, l) to be non-negative, so that the COST(?) and POPULARITY(?)are also non-negative.4.3 Strong PredictorsTable 4.4 shows the top 10 attribute-level pairs that are strong predictors of the costand popularity (top 10 highest regression coefficients) of television and cameracategory, respectively. Recall that we binarize the attribute-value pairs, hence, oneattribute may appear multiple times in both tables. Most of the results match ourintuitions, for example, in television category, high dollar savings on a productimplies that the cost of product would be high and is a good indicator of cost.Similarly, large size of screen significantly increases the cost of television. Ethernetport is present only in high cost televisions and brand LG mostly has high pricedtelevisions compared to Dynex and Insignia.Looking at popularity modeling, in the television category, we find that brandis the most important predictor of popularity. Also note that, many attribute-levelpairs that have high cost don?t have high popularity. This is because the majorityof population chooses products that have neither very high nor very low cost. Suchproducts have medium dollar savings and 2 HDMI inputs, instead of high dollarsavings and 4 HDMI inputs which have higher cost. We further investigate thetelevisions of brand Dynex, which has the highest popularity. We find that theseare only available at two stores: www.bestbuy.com and www.futureshop.ca. Sinceour television data set was crawled from www.bestbuy.com, it is naturally biasedtowards the popularity of this brand.In camera category, cameras belonging to Nikon D and Canon EOS familylines are expensive. Further, cameras with HDMI interface are costlier as opposedto cameras with USB interface and cameras having interchangeable lens cost high.For popularity modeling, popularity of cameras with memory stick is higher than17cameras with compact flash card. This again supports our observation, from televi-sion category, that majority of population chooses products that have neither veryhigh cost (compact flash card) nor very low cost (built-in memory), but goes forsomething in between (memory stick).18Cost PopularityAttribute Level Attribute LevelDollar savings* High Brand DynexScreen size* Large Brand InsigniaDollar savings* Medium Refresh rate 120 HzRefresh rate 240 Hz Dollar savings* MediumBrand LG Refresh rate 240 HzRefresh rate 600 Hz Dollar savings* High#HDMI inputs 4 #HDMI inputs 2Ethernet port Present Brand SamsungScreen size* Medium Vert. resolution 768 pixelUSB port Present Screen size* Medium(a) TelevisionCost PopularityAttribute Level Attribute LevelFamily line Nikon D Memory type MemoryFamily line Canon EOS stickInterface type HDMI Compression mode BasicInterchangeable lens Yes Resolution* LowWidth* Medium Height* Medium32 MB Memory card Present Battery type LithiumMemory type Compact flash Max. movie length* Mediumcard type II Viewfinder OpticalCompression type TIFF Built-in memory size* LargeSoftware Absent 32 MB Memory Card PresentMemory type Memory stick LCD screen size* Small(b) CameraTable 4.4: Top 10 Attribute-level Pairs in Cost and Popularity Modeling, indecreasing order (a) Television (b) Camera. * denote Numerical At-tributes.19Chapter 5Algorithm5.1 Complexity of RECMANNext, we analyze the complexity of RECMAN. We show that it is NP-hard and isclosely related to the Multiple-choice Knapsack Problem (MCKP) [8]. MCKP is ageneralization of the classical knapsack problem. In classical knapsack, given a setof n items, where each item j ? [1,n] is associated with a profit p j and a weight w j,the goal is to pick a set of items such that their total weight is under a given boundW , and the total profit is maximum. In MCKP, the items are partitioned into m lists(or classes), and exactly one item must be selected from each list such that the totalcost is under W and the profit is maximum.Theorem 1. RECMAN as defined in Problem 1 such that COST(?) and POPULARITY(?)are defined as in Equation 4.1 and Equation 4.2, respectively, is NP-hard.Proof. We reduce MCKP to RECMAN. Given an instance I of MCKP, create aninstance G of RECMAN as follows. Set k = 1 and Clb = 0 in instance G . Foreach list Li in instance I , create an attribute ai in instance G . Next, for eachitem x in a list Li in I , create a level lx for the corresponding attribute ai in G .Set COST(ai, lx) = wix where wix is the weight of item x in list Li. Similarly, setPOPULARITY(ai, lx) = pix, where pix is the profit of item x in list Li. Finally, setCub =W .Considering only top-1 product design (k = 1), a new product ? in instance G20is a tuple of attribute-level pairs, which in turn corresponds to a set of items S ininstance I , exactly one from each list. Clearly, the total weight of S is same asCOST(?) and likewise, the profit of S is same as POPULARITY(?). Hence ? is asolution to RECMAN iff the corresponding set of items S is a solution to MCKP,showing RECMAN is NP-hard.5.2 An Optimal AlgorithmOne may naturally believe that since the problem is NP-hard, we may need to settlefor an approximation algorithm or resort to heuristics. Interestingly, this is not thecase with RECMAN. Even though RECMAN is NP-hard, we show that the algorithmthat we develop in this section is both optimal and scalable, for real world data setsin our context. Its running time is linear in the upper bound of the cost rangeof the product, i.e., Cub. In experiments, we show that our algorithm scales wellon large data sets even when the cost upper bound Cub is as high as 100,000 $.The algorithm is an extension of a dynamic programming algorithm proposed forMCKP [8]. For ease of exposition, we first describe the original algorithm adaptedto our context.5.2.1 Sketch of Original AlgorithmIt maintains a table T of size (m+ 1)? (Cub + 1) where each entry in the tablecontains the popularity of the optimal partial product. If costs are real numbers,they are rounded off to the nearest integer to facilitate table construction [8]. Therow index of the table corresponds to an attribute of the product and the columnindex corresponds to the cost. All indices start with 0. We initialize row 0 withappropriate base cases and then span the entire table T , one row at a time, startingfrom index [1,0]. At any step, corresponding to index [i, j], build the optimal partialproduct such that its cost is exactly j and it contains the attributes {1, ? ? ? , i}. Theidea is that the optimal partial products for row i can be built recursively from theoptimal partial products for row i?1. When the table is built completely, pick theproduct at index T [m, j] such that j ? Cub and T [m, j] is maximum. Finally, thelevels of attributes of the optimal product are obtained from a backtracking processwhere, starting from the last row at index [m, j], the optimal level for each attribute21is used to arrive to the previous row. This is done until levels of all attributes areextracted.5.2.2 Our AlgorithmOur problem differs from MCKP in two respects. (i) First, instead of only one op-timal product, we are interested in top-k optimal products. (ii) Second, in additionto upper bound on cost, we also have a lower bound on it. Hence, RECMAN isa generalization of MCKP, and we adapt the above framework according to ourneeds. However, extension of the original algorithm to handle the top-k require-ment (instead of top-1) is non-trivial as now we need to ensure that (i) no redundantproducts are formed (ii) no product among the top-k optimal products is missed. Inparticular, instead of maintaining one optimal partial product at index [i, j] in tableT , we maintain a list of at most k partial products.To the best of our knowledge, this is the first work that proposes top-k solutionsto MCKP, while the existing works only consider top-1 solution [8]. The well-known Lawler?s procedure [12] could be used to obtain a top-k algorithm. It is ageneral technique to enumerate top-k solutions to any optimization problem, givenan algorithm for the optimal solution. It?s worth noting that if we use Lawler?sprocedure combined with top-1 algorithm for MCKP, the running time complexitywould be much higher than that of our top-k algorithm. We will revisit it afterdescribing our algorithm. Thus, this is a significant algorithmic contribution initself.The overall process is presented in Algorithm 1. We represent a (partial) prod-uct using a vector p = ?attr, cost, depth, popularity, level, flag?, wherep.attr is the row index, p.cost is the column index and p.depth is the depth orrank of (partial) product p among top-k partial products at the corresponding in-dex [p.attr,p.cost]. p.popularity stores the popularity of the partial productp, while p.level keeps the optimal level of the attribute p.attr. The variableflag is employed to save temporary information. We first initialize index [0,0] oftable T by inserting an empty product there with popularity 0 (line 1). Then, usingAlgorithm 2, we span the table, one row at a time, starting from index [1,0] andbuild the list of top-k partial products (lines 2-4). Finally, we backtrack to get the22top-k optimal product profiles, using Algorithm 3 (line 5). In this section, the termproduct profile refers to the set of attributes and the corresponding values of theproduct.Algorithm 1 Overall process1: Initialize [0,0] of table T by inserting an empty product with popularity 0.2: for i? 1 : m do3: for j? 0 : Cub do4: Build top-k partial products at [i,j] (Algorithm 2).5: Backtrack to get back the top-k product profiles (Algorithm 3).We describe Algorithm 2 with the help of an example, shown in Figure 5.1,where we revisit the toy example from the introduction. Cost and popularity in-formation for both attributes Screen Size and Screen type for category televisionare presented in Figure 5.1a. Figure 5.1b shows the dynamic programming tableT . The cost bounds Clb and Cub are set to 50 and 80, respectively, while k is set to3. Table T has m+1 = 3 rows and potentially Cub +1 = 81 columns. The table issparse and we show only the columns that have at least one filled entry. Each cellcan have up to k = 3 partial products.Algorithm 2 Build top-k products for index [i, j]1: for depth? 1 : k do2: Initialize a new product p with p.attr? i, p.cost? j and p.depth?depth.3: S? /0.4: for each level corresponding to attribute p.attr do5: Let p? be the product such that p?.attr= p.attr?1, p?.cost= p.cost?COST(p.attr, level), p?.depth= 1+ |{p??|p??.attr= p.attr,p??.cost=p.cost,p??.level = level,p??.depth < p.depth}|.6: p?.flag? level; S? S?{p?}.7: If S = /0, break.8: p?best ? argmaxp??S(p?.popularity+POPULARITY(p.attr,p?.flag)).9: p.level? p?best .flag.10: p.popularity? p?best .popularity+POPULARITY(p.attr,p.level).For each index [i, j] and each value of depth, Algorithm 2 builds a new prod-uct p. It starts by initializing p (line 2). In S (initialized in line 3), we store23(a) Cost and Popularity of various Levels of both Attributes(b) Dynamic Programming Table(c) Top-3 ProductsFigure 5.1: Television with two Attributes - Screen Size and Screen Type.k = 3, Clb = 50 and Cub = 80 (a) Cost and Popularity of various Levelsof both Attributes (b) Dynamic Programming Table (c) Top-3 Products24the set of partial products of row i? 1 that help us in building p. We look ateach level that the product p may possibly assume for the current attribute p.attr(line 4). To calculate the optimal popularity, we look at every relevant productp? from the previous row (line 5), store it in S (line 6), and later, pick the onewhich provides the maximum popularity (line 8). For a product p? to be relevant,clearly it should belong to row p.attr? 1. Moreover, its cost should be exactlyp.cost? COST(p.attr, level) so that adding level level for attribute p.attr top? would make it equivalent to p. Recall that COST(p.attr, level) is the cost ofthe level level for attribute p.attr which can be determined with a cost oracle,e.g., using linear regression. To ensure that we do not form redundant products,we take the partial product p? that has yet not been combined with level level tobuild a top-k partial product for the current index [i, j]. At the same time, we wantp? to be optimal among all partial products satisfying these conditions. Hence,p?.depth= 1 plus the number of partial products that have previously been formedat [p.attr,p.cost] using level level (line 5). If there exists such a product, we addit to S (line 6). In flag, we store the corresponding level (line 6). Next, the chosenlevel is saved in p.level (line 9) which is used in the subsequent backtrackingprocess, discussed below. Finally, the popularity of p is calculated (line 10).Once the table T is built, we can identify the top-k products w.r.t. maximumpopularity by following Algorithm 3. We look at each product p such that p.attr=m and p.cost ? [Clb,Cub] and take the top-k among them (line 1). Note that, tillnow, we only have the top-k popularity (in Figure 5.1b, the top-3 popularity are120, 100 and 90), and to get the complete product profile, we use the backtrackingprocess. We store the product profiles in a set Q, such that each product profileq ? Q contains m entries of the form ?a, l?, denoting the level l for attribute a forproduct q (shown in Figure 5.1c). Q is initially empty (line 2). We analyze oneproduct p among top-k products at a time (line 4) and initialize an empty productprofile for it (line 5). The product profile is built in lines 6-9, where the goal is todetermine the levels of the attributes that were being used to form the product incontext. To this end, we backtrack one row at a time and re-discover the partialproducts. Let p? denote such a partial product. Clearly, p?.attr = p.attr? 1 aswe move up, one row at a time. Recall that we store the optimal level in p.level,and hence, p?.cost = p.cost?COST(p.level). Finally, there may be multiple25partial products p? that satisfy these conditions. Similar to Algorithm 2, we wantp?.depth such that no redundant products are formed and, at the same time, p? isoptimal at its depth (line 8).Algorithm 3 Backtrack1: P ? k-argmaxp|p.attr=m,p.cost?[Clb,Cub]p.popularity.2: Q? /0.3: while P 6= /0 do4: p?P.pop().5: Initialize an empty product profile q.6: while p 6= /0 do7: q.push(?p.attr,p.level?).8: Let p? be the product such that p?.attr= p.attr?1, p?.cost= p.cost?COST(p.level), p?.depth = 1 + |{p??|p??.attr = p.attr,p??.cost =p.cost,p??.level = p.level,p??.depth < p.depth}|.9: p? p?.10: Q? Q?{q}.Note that, we can identify a position of a partial product in table T by a smaller-sized tuple ?attr,cost,depth?, which contains only the first three elements ofthe entire product tuple. In the example, we can see that while building the partialproduct at position ?2,60,2? (filled rectangle in Figure 5.1b), S consists of partialproducts at positions ?1,20,1? and ?1,40,2? (ovals in Figure 5.1b). Note that, Sdoes not contain the partial product at ?1,40,1? since it has already been usedto construct the product at ?2,60,1?. The value of flag for partial products at?1,20,1? and ?1,40,2? would correspond to levels LCD and Plasma, respectively.The level chosen by the algorithm corresponding to position ?2,60,2? is LCD,because p?best would be at position ?1,20,1? (filled oval in Figure 5.1b).We next show that our algorithm is optimal.Theorem 2 (Optimality). The algorithm described above correctly finds the opti-mal solution to RECMAN.Proof. For brevity, we only show the proof that the popularity found by the algo-rithm is the correct optimal popularity. The correctness of Algorithm 3 in findingthe optimal product profile is straightforward. The proof is by induction.26Base Case: The base case corresponds to row index i being 0. Index [0,0]contains the empty product (with no attributes) with cost and popularity 0. Anyother index [0, j] for j ? 1, . . . ,Cub doesn?t contain any product, as the cost of anempty product cannot be more than 0. This is trivially correct.Induction Step: Assume the optimality for row i? 1, i.e., the algorithm cor-rectly discovers all the top-k optimal products corresponding to row i?1. Considerany index [i, j], for any j. Let Sai be the set of partial products at index [i, j] foundby our algorithm, and S?i be the top-k optimal products at index [i, j]. Assumethere exists a product x ? S?i : x /? Sai , that is, x is among the top-k optimal prod-ucts at index [i, j], but our algorithm didn?t find it. To be precise, x.attr = i,x.cost = j. Let x? be the partial product we obtain after removing attribute ifrom x. Then, x?.attr = i? 1, x?.cost = x.cost?COST(x.attr,x.level) andx?.popularity = x.popularity? POPULARITY(x.attr,x.level). Let S?i?1 bethe top-k optimal products at index [i?1,x?.cost]. and Sai?1 be the products iden-tified by our algorithm.Three possible cases arise: (i) x? does not exist, in which case x does not exist(which is a contradiction). (ii) x? exists but x? /? S?i?1. In this case, we can obtaink better products (in terms of popularity) than x at index [i, j], by adding levelx.level for attribute i (to products in S?i?1). Hence, x /? S?i , which is a contradiction.(iii) x? exists and x? ? S?i?1. In this case, because of the induction hypothesis whichassumes the optimality for row i? 1, all products in S?i?1, including x?, must havebeen identified by our algorithm. As x? is already discovered by our algorithm,then it must have been already considered by the algorithm to build top-k productsat index [i, j]. Since x is not found by the algorithm, for every product p ? Sai , wemust have p.popularity? x.popularity and |Sai |= k. Hence, x cannot be in S?i .Again, this is a contradiction. This completes the proof.Complexity of the Algorithm. It can be shown that space complexity of ouralgorithm is O(Cub ? k ?m), while time complexity is O(Cub ? k ??i?[1,m] Li), whereLi is the number of levels for attribute i. In comparison, if we use Lawler?s proce-dure [12] along with the optimal (top-1) algorithm for MCKP, then space complex-ity would be O(Cub ?m) and time complexity would be O(Cub ? k ? (?i?[1,m] Li)2).Our algorithm runs faster than Lawler?s procedure, but uses more memory. Hence,27we provide an alternative method to find top-k solutions of MCKP, that trades spacefor time.28Chapter 6ExperimentsThe goals of our experiments is two-fold. (1) Evaluating our algorithm by com-paring it to other intuitive and natural heuristics. Studying the effect of parameterk on the popularity achieved and, finally, showing the scalability of our algorithm,by running it on a larger (synthetic) dataset (Section 6.1), and (2) Reporting theconfiguration of the best product generated by our algorithm (Section 6.2).6.1 Evaluating our Algorithm6.1.1 Comparison of Algorithms w.r.t. PopularityIn Theorem 2, we proved that our algorithm outputs the top-k optimal productrecommendations. Here, we compare the algorithm with other natural heuristics,empirically. Before we study the effect of k, we compare the following algorithmsand heuristics, for the case when k = 1. Recall that our algorithm is exact andhence, we expect the popularity achieved from it to be an upper bound. The aimof this comparison is to look at the difference in popularity achieved by the exactalgorithm and various heuristics. We use attr(p) to denote the level assumed bythe product p for the attribute attr. We let E denote the set of existing products ina given data set.OPTIMAL: This is our algorithm as described in Section 5.2. Note that costs(linear regression coefficients) of individual levels in general are real-valued. We29round them off to turn them into integers with no significant loss of accuracy inmodeling the cost of product.RANDOM: We generate 100,000 random products by randomly selecting alevel corresponding to each attribute. Based on its cost, each product is binned intoone of the cost ranges and average popularity is calculated for each bin. Note thatthe popularity returned might not correspond to an actual product, rather it is thepopularity that one can expect when a product is chosen at random. We take thisas a baseline.EXISTING: Given the cost bounds, this method selects an existing product inthe data set that has cost within the bounds and has maximum popularity. That is,a ?new? product ? is chosen such that? = argmaxp?E:COST(p)?[Clb,Cub]POPULARITY(p)The idea is to see how the optimal product found by our algorithm compares interms of popularity with the best existing product whose cost falls in the range[Clb,Cub].GREEDY: This is a relatively sophisticated heuristic. For each attribute attr,a level is selected such thatattr(?) = argminlevelCOST(attr,level)For all levels l (except attr(?)) corresponding to attribute attr, the efficiency isdefined as Efficiency(attr, l) =POPULARITY(attr, l)?POPULARITY(attr,attr(?))COST(attr, l)?COST(attr,attr(?))At any step, attr(?) is replaced by another level l corresponding to attr thathas the current maximum efficiency. Further, efficiency of the remaining levelsof attr is updated. This is done until COST(?) does not exceed Cub. Popularityfor all products which have cost within the cost range are noted and the maximumpopularity becomes the greedy solution. It may happen that during this process,30COST(?) jumps directly from below Clb to above Cub. In this case, instead ofmaximum efficiency, the level corresponding to lower efficiencies are considereduntil a product is formed within the cost range or no unseen level is left. In thelatter case, greedy heuristic returns the null product.In all experiments, we set the size of the cost range to be Cub?Clb = 100, andCub is the cost shown on the plots. Work done would be almost the same even ifthe size of cost range were varied for a given value of the cost upper bound.In Figure 6.1, Figure 6.2 and Figure 6.3, we compare various methods de-scribed above w.r.t. popularity achieved, where the popularity are predicted usinglinear regression, for television, camera and laptop, respectively. The results showthat our OPTIMAL algorithm beats other algorithms, including GREEDY, by a com-fortable margin, on all 3 data sets. For instance, in television category (Figure 6.1),for the cost bound of 1900-2000, OPTIMAL achieves the popularity of 174, an im-provement by 30% over GREEDY, which achieves 134. For the same cost bound,EXISTING and RANDOM achieve the popularity of 67 and 96, respectively. Hence,for the same cost bound, there is significant room for improvement in the prod-ucts currently existing in the market. Note that, at certain cost bounds, popularityachieved by GREEDY is the same as popularity achieved by OPTIMAL. Similarly,in camera category (Figure 6.2), for the cost bound of 1900-2000, while the pop-ularity achieved using OPTIMAL algorithm is 50, it is 37 from using the GREEDYalgorithm. Interestingly, for laptop (Figure 6.3), we observe that GREEDY performsvery well, almost as good as OPTIMAL, for most cost bounds. Overall, OPTIMALoutperforms other methods and GREEDY is the next choice.6.1.2 Variation in Popularity of Top-k ProductsFigure 6.4 shows the variation in the popularity of top-k products, for different val-ues of cost (i.e., Cub), in category television. As expected, the popularity decreasesas the value of k increases. Interestingly, the decrease in popularity is slow. Forinstance, when the cost is 1500, the popularity of the top-1 product is 185 whilethe popularity of 100th product is 162, i.e. the average decrease in popularity isonly 0.23 for unit increase in k. This observation suggests that there is not muchof a difference in popularity among top products (in the real data sets we tested),31strengthening our hypothesis that it is important to recommend multiple productsto the marketing decision maker as she can look at variety of products, all withhigh popularity, and choose one from among them.6.1.3 ScalabilityThe three key factors affecting the running time and space complexity of our algo-rithm are: Cost upper bound Cub, number of products returned k and the numberof levels in each attribute. Note, in Figure 6.2, the maximum possible cost of anycamera is 2700, which is not representative of all real-world products. Thus, tocover most actual products in real life, we create a SYNTHETIC dataset by scalingthe costs in category camera (camera is the biggest of all three real data sets). Thescaling is done so that the sum of the cost coefficients of the maximum cost levelsof each attribute becomes 100,000, i.e., the maximum cost of possible products is100,000, which, we believe, covers the cost of most actual products in real life.We also add dummy attributes-level pairs to double the number of attribute-levelpairs in the original camera data set. This is done to show the scalability of thealgorithm w.r.t. number of feasible products. We show results in Figure 6.5 andFigure 6.6. Figure 6.5 shows the results for five different costs (Cub) between 0 and100,000, at equal intervals, and three different values of k (note the two y-axes inthe figure). We vary k over 1, 10 and 100. It is clear that the algorithm is scalableboth in running time (Figure 6.5a) and memory (Figure 6.5b). For example, forCub = 50,000, the time taken for k=1, 10 and 100 is just 5, 6 and 13 seconds, re-spectively. Similarly, the memory used is 0.5, 0.8 and 3.0 GB, respectively. Thus,the increase in time and memory is not huge even when the number of productsrecommended is increased by orders of magnitude.In Figure 6.6, we plot the time and memory performance against the numberof feasible products. The plot has two Y-axes, on the left we have running timeand on the right, we have memory usage. We fixed Cub = 100,000 and k = 100.To vary the number of feasible products, we remove ten attributes at a time fromthe SYNTHETIC data set. As we can see, the algorithm scales well with respectto number of feasible products and finishes within 25 seconds using only 5 GBmemory even when the number of feasible products is 1025.To summarize, OPTIMAL algorithm, as expected, provides the upper bound on the32popularity that can be achieved by any algorithm, given the cost bound. More inter-estingly, there is considerable difference in the popularity achieved by OPTIMALand all other heuristics. Not only it is optimal, it is scalable, w.r.t. both running timeand memory usage. Finally, our experiments show that the top-k products for k upto 100 have close value for popularity. Thus, recommending multiple top products,instead of just one, is important as it provides flexibility to the manufacturer toselect the product they?d like to launch based on external considerations.6.2 Recommended Product DesignsIn Table 6.1, we report the top-10 attributes (w.r.t. popularity) and the levels forthe best product returned by OPTIMAL in categories television and camera, respec-tively. The cost bound has been fixed to 900-1000. These results would conveyvarious interesting insights to the manufacturer, e.g., refresh rate of most populartelevision is 120 Hz, as opposed to other higher refresh rates. Similarly, televisionswith LCD flat-panels are more popular than LED flat-panels. In the camera cat-egory, cameras with memory stick and CMOS sensor type are more popular thancameras with built-in memory and CCD sensor type. This is especially enabled bythe fact that our algorithm is highly scalable, allowing the manufacturer to vary thecost bounds at will and seeing what kind of recommendations come out.330 500 1000 1500 2000 2500 3000 3500 4000020406080100120140160180200CostPopularity OptimalGreedyExistingRandomFigure 6.1: Comparison of Algorithms w.r.t. Popularity in Television Cate-gory34500 1000 1500 2000 25000510152025303540455055CostPopularity OptimalGreedyExistingRandomFigure 6.2: Comparison of Algorithms w.r.t. Popularity in Camera Category350 500 1000 1500 20000510152025303540CostPopularity OptimalGreedyExistingRandomFigure 6.3: Comparison of Algorithms w.r.t. Popularity in Laptop Category360 20 40 60 80 100130140150160170180190kPopularity Cost = 500Cost = 1500Cost = 2500Cost = 3500Figure 6.4: Popularity of Top-100 Products for four different Costs in Tele-vision Category370 25 50 75 1000510152025Time (in seconds)Cost (in thousands) K = 100K = 10K = 1(a) Time0 25 50 75 100012345Memory (in GB)Cost (in thousands) 012345K = 100K = 10K = 1(b) MemoryFigure 6.5: Running Time and Memory Usage with respect to Cost UpperBound and k on SYNTHETIC Data Set (a) Time (b) Memory38105 1010 1015 1020 10250510152025Time (in seconds)Number of feasible products 012345Memory (in GB)MemoryTimeFigure 6.6: Running Time and Memory Usage with respect to Number ofFeasible Products on SYNTHETIC Data Set. X-axis is Logarithmic.39Attribute LevelsBrand Dynex, Sony, Insignia, LG, SamsungRefresh rate 120 Hz, 60 Hz, 240 Hz, 600 HzDollar savings* Medium, Low, High#HDMI inputs 2, 3, 4Television type LCD flat-panel, LED flat-panelWeight* Small, LargeHeight* Small, LargeMount bracket pattern 200mm X 200mm, 400mm X 400mmMax. hori. resolution 1366 pixel, 1920 pixelWidth* Large, Small(a) TelevisionAttribute LevelsMemory type Memory stick, Built-in, Compact flash cardtype I, Compact flash card type II, IBMmicrodrive, SD card, SDHC cardCompression modes Basic, Fine, Normal, UncompressedResolution* Low, Medium, HighBattery type Lithium, 2XAA, Li-lionMax. movie length* Short, LongLCD screen size* Small, Medium, LargeMax. image resolution* Medium, Low, HighHeight* Small, Medium, Large16 MB Memory Card Present, AbsentImage sensor type CMOS, CCD(b) CameraTable 6.1: Top 10 Attributes (in decreasing order of Popularity) of the BestProduct (a) Television (b) Camera. Right columns show all possible Lev-els for the particular Attribute and Level present in the Best Product isunderlined. *denote Numerical Attributes.40Chapter 7Conclusions and Future WorkPrevious RS research almost exclusively focuses on generating recommendationsto customers. In this thesis, we direct the attention of the community to an impor-tant third entity type of this ecosystem ? the product manufacturers, and propose anovel problem from their business perspective: new product recommendations formanufacturers. Our problem is to recommend top-k new products that maximizethe popularity while satisfying the cost constraints given by the manufacturer. Weexplored various regression models to predict cost and popularity of product. Weshow that the problem is NP-hard and develop a scalable pseudopolynomial timealgorithm, by exploiting the connection to MCKP. We perform a comprehensiveempirical analysis on 3 real data sets, and show that our exact algorithm outper-forms various natural heuristics by a significant margin in terms of the popularityachieved, while remaining highly scalable.This work opens up a rich area for future research. Below, we briefly discusssome interesting directions.Alternative Objectives. We studied RECMAN with the objective of maximiz-ing popularity, expressed in terms of number of ratings in the data sets we couldgather. If data sets with page views or store sales information could be made avail-able, they will be valuable in further validating our research. Indeed, alternativeobjectives can be formulated and optimized in place of popularity: e.g., maximiz-ing the number of satisfied users (say as measured by the ratings/reviews provided)or maximizing the company?s profit [20], given manufacturer?s profit margins. Our41framework is general enough to accommodate the optimization of different objec-tive functions, with minimal changes to the approach.Constraints. Not every attribute that is a strong predictor of the popularity orcost of a product is necessarily actionable for a manufacturer. E.g., we found thatbrand is one of the most important attributes in predicting popularity. But a man-ufacturer cannot just assume a top-selling brand, rather this really implies buildinga good reputation for your brand is important. Similarly, real-world product designoften has technical feasibility constraints: e.g., a very light-weight laptop with aDVD-ROM and very large storage (hard disk) may not be feasible. It is importantto extend our product recommendation algorithm for manufacturers to incorporatesuch constraints. The aim of this thesis is not to replace the manufacturers, or theirengineering design experts, rather the idea is to recommend promising candidatesfor new product designs to the manufacturers, whose expertise is still required forpostprocessing the set of designs recommended. As an example of this postpro-cessing, keeping abreast of the technological advances in the field is essential forultimately deciding on the new product(s) that a manufacturer may want to launch.Cost and Popularity Oracles. We presented some approaches for buildingcost and popularity oracles in Chapter 4. There is considerable room for improve-ment in their prediction accuracy. Part of the challenge in improving the accuracyis the availability of relevant data. It would be desirable to include additional infor-mation such as quality of implementation of various attributes, marketing, overallconsumer experience, etc. as input to the cost/popularity prediction oracles. Fur-ther, it might be a good idea to normalize the popularity of a product based on thelength of time it has been present in the market, to compensate for the fact thatproducts that have been in the market longer have an unfair advantage over recentproducts, in terms of their popularity achieved.42Bibliography[1] N. Archak, A. Ghose, and P. G. Ipeirotis. Show me the money!: deriving thepricing power of product features by mining consumer reviews. InProceedings of the 13th ACM SIGKDD international conference onKnowledge discovery and data mining, KDD ?07, pages 56?65, New York,NY, USA, 2007. ACM. ISBN 978-1-59593-609-7.doi:10.1145/1281192.1281202. URLhttp://doi.acm.org/10.1145/1281192.1281202. ? pages 6[2] M. Das, G. Das, and V. Hristidis. Leveraging collaborative tagging for webitem design. In Proceedings of the 17th ACM SIGKDD internationalconference on Knowledge discovery and data mining, KDD ?11, pages538?546, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0813-7.doi:10.1145/2020408.2020493. URLhttp://doi.acm.org/10.1145/2020408.2020493. ? pages 6[3] P. Green, A. Krieger, and Y. Wind. Thirty years of conjoint analysis:Reflections and prospects. In Y. Wind and P. Green, editors, MarketingResearch and Modeling: Progress and Prospects, volume 14 ofInternational Series in Quantitative Marketing, pages 117?139. SpringerUS, 2004. ISBN 978-0-387-24308-5. doi:10.1007/978-0-387-28692-1 6.URL http://dx.doi.org/10.1007/978-0-387-28692-1 6. ? pages 5[4] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of StatisticalLearning. Springer Series in Statistics. 2001. ? pages 13[5] J. Hauser and V. Rao. Conjoint analysis, related modeling, and applications.In Y. Wind and P. Green, editors, Marketing Research and Modeling:Progress and Prospects, volume 14 of International Series in QuantitativeMarketing, pages 141?168. Springer US, 2004. ISBN 978-0-387-24308-5.doi:10.1007/978-0-387-28692-1 7. URLhttp://dx.doi.org/10.1007/978-0-387-28692-1 7. ? pages 343[6] M. Hu and B. Liu. Mining and summarizing customer reviews. InProceedings of the tenth ACM SIGKDD international conference onKnowledge discovery and data mining, KDD ?04, pages 168?177, NewYork, NY, USA, 2004. ACM. ISBN 1-58113-888-1.doi:10.1145/1014052.1014073. URLhttp://doi.acm.org/10.1145/1014052.1014073. ? pages 6[7] N. Jindal and B. Liu. Opinion spam and analysis. In Proceedings of the2008 International Conference on Web Search and Data Mining, WSDM?08, pages 219?230, New York, NY, USA, 2008. ACM. ISBN978-1-59593-927-2. doi:10.1145/1341531.1341560. URLhttp://doi.acm.org/10.1145/1341531.1341560. ? pages 9[8] H. Kellerer, U. Pferschy, and D. Pisinger. Knapsack problems. Springer,2004. ISBN 978-3-540-40286-2. ? pages 20, 21, 22[9] R. Kohli and R. Krishnamurti. Optimal product design using conjointanalysis: Computational complexity and algorithms. Eur. Journal ofOperational Research, 40(2):186 ? 195, 1989. ISSN 0377-2217.doi:10.1016/0377-2217(89)90329-9. URLhttp://www.sciencedirect.com/science/article/pii/0377221789903299. ?pages 5[10] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques forrecommender systems. Computer, (8), 2009. ISSN 0018-9162.doi:10.1109/MC.2009.263. URL http://dx.doi.org/10.1109/MC.2009.263. ?pages 9[11] P. Kotler, G. Armstrong, V. Wong, and J. Saunders. Principles of Marketing.Financial Times Prentice Hall, 2008. ISBN 9780273711568. ? pages 2[12] E. L. Lawler. A procedure for computing the k best solutions to discreteoptimization problems and its application to the shortest path problem.Manage. Sci., 18(7):pp. 401?405, 1972. ISSN 00251909. URLhttp://www.jstor.org/stable/2629357. ? pages 22, 27[13] B. Liu. Sentiment Analysis and Opinion Mining. Synthesis Lectures onHuman Language Technologies. Morgan & Claypool Publishers, 2012. ?pages 6[14] M. Miah, G. Das, V. Hristidis, and H. Mannila. Determining attributes tomaximize visibility of objects. IEEE Trans. on Knowl. and Data Eng., 2144(7):959?973, July 2009. ISSN 1041-4347. doi:10.1109/TKDE.2009.72.URL http://dx.doi.org/10.1109/TKDE.2009.72. ? pages 7[15] H. Peng, F. Long, and C. Ding. Feature selection based on mutualinformation criteria of max-dependency, max-relevance, andmin-redundancy. Pattern Analysis and Machine Intelligence, IEEETransactions on, 27(8):1226?1238, 2005. ISSN 0162-8828.doi:10.1109/TPAMI.2005.159. ? pages 13[16] A.-M. Popescu and O. Etzioni. Extracting product features and opinionsfrom reviews. In Proceedings of the conference on Human LanguageTechnology and Empirical Methods in Natural Language Processing, HLT?05, pages 339?346, Stroudsburg, PA, USA, 2005. Association forComputational Linguistics. doi:10.3115/1220575.1220618. URLhttp://dx.doi.org/10.3115/1220575.1220618. ? pages 6[17] F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor, editors. RecommenderSystems Handbook. Springer, 2011. ISBN 978-0-387-85819-7. ? pages 1, 5[18] C. Scho?n. On the optimal product line selection problem with pricediscrimination. Manage. Sci., 56(5):896?902, May 2010. ISSN 0025-1909.doi:10.1287/mnsc.1100.1160. URLhttp://dx.doi.org/10.1287/mnsc.1100.1160. ? pages 3[19] P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining, (FirstEdition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA,USA, 2005. ISBN 0321321367. ? pages 12, 14[20] S. Tsafarakis and N. Matsatsinis. Designing optimal products: Algorithmsand systems. In J. Casillas and F. Martnez-Lpez, editors, MarketingIntelligent Systems Using Soft Computing, volume 258 of Studies inFuzziness and Soft Computing, pages 295?336. Springer Berlin Heidelberg,2010. ISBN 978-3-642-15605-2. doi:10.1007/978-3-642-15606-9 19. URLhttp://dx.doi.org/10.1007/978-3-642-15606-9 19. ? pages 5, 41[21] X. J. Wang, J. D. Camm, and D. J. Curry. A branch-and-price approach tothe share-of-choice product line design problem. Manage. Sci., 55(10):1718?1728, Oct. 2009. ISSN 0025-1909. doi:10.1287/mnsc.1090.1058.URL http://dx.doi.org/10.1287/mnsc.1090.1058. ? pages 3[22] S. Xie, G. Wang, S. Lin, and P. S. Yu. Review spam detection via temporalpattern discovery. In Proceedings of the 18th ACM SIGKDD internationalconference on Knowledge discovery and data mining, KDD ?12, pages45823?831, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1462-6.doi:10.1145/2339530.2339662. URLhttp://doi.acm.org/10.1145/2339530.2339662. ? pages 946