Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Hierarchical structure and ordinal features in class-based linear models Wang, Wan Shing Martin 2021

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2021_may_wang_wanshingmartin.pdf [ 596.44kB ]
Metadata
JSON: 24-1.0395817.json
JSON-LD: 24-1.0395817-ld.json
RDF/XML (Pretty): 24-1.0395817-rdf.xml
RDF/JSON: 24-1.0395817-rdf.json
Turtle: 24-1.0395817-turtle.txt
N-Triples: 24-1.0395817-rdf-ntriples.txt
Original Record: 24-1.0395817-source.json
Full Text
24-1.0395817-fulltext.txt
Citation
24-1.0395817.ris

Full Text

Hierarchical Structure and Ordinal Features in Class-based LinearModelsbyWan Shing Martin WangB.S. Cornell University, 2018A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Computer Science)The University of British Columbia(Vancouver)February 2021© Wan Shing Martin Wang, 2021The following individuals certify that they have read, and recommend to the Faculty of Graduate andPostdoctoral Studies for acceptance, the thesis entitled:Hierarchical Structure and Ordinal Features in Class-based Linear Modelssubmitted by Wan Shing Martin Wang in partial fulfillment of the requirements for the degree ofMaster of Science in Computer Science.Examining Committee:David Poole, Computer ScienceSupervisorGiuseppe Carenini, Computer ScienceSupervisory Committee MemberiiAbstractIn many real world datasets, we seek to make predictions about entities, where the entities are in classesthat are interrelated. A commonly studied problem, known as the reference class problem, is how tocombine information from relevant classes to make predictions about entities.The intersection of all classes that an entity is a member of constitutes the most specific class for thatentity. When seeking to make predictions about such intersection classes for which we have not observedmuch (or any) data, we would like to combine information from more general classes to create a prior.If there is no data for the intersection, we would have to rely entirely on the prior. However, if dataexists but is scarce, we seek to balance the prior with the available data.We first investigate a model where we assign weights to classes, and additively combine weights to makepredictions. The use of regularisation forces generalisation; the signal gets pushed up to more generalclasses. To make a prediction for an unobserved intersection of classes, we would use the weightsfrom the individual classes that comprise the intersection. We introduce several variants that averagethe predictions, as well as a probabilistic mix of these variants. We then propose a bounded ancestormethod, which balances the creation of an informed prior with observed data for classes varying amountsof observations.When dealing with ordinal properties, such as shoe size, we can dynamically create new classes andsubclasses in ways that are conducive to creating more informative priors. We do this by splitting theordinal properties. Throughout, we test on the MovieLens and UCSD Fashion datasets.We found that a combination of the three bounded ancestor method variants resulted in the best perfor-mance, and the best combination varied between datasets. We found that a simple model that assignsweights to classes and additively makes predictions slightly outperformed the bounded ancestor methodfor supervised classification. For the bounded ancestor method, we found that splitting ordinal propertiesin different ways had minimal impact on the error metrics we used.iiiLay SummaryWhen making predictions involving entities and relationships between them, we often want to incor-porate prior knowledge to help us, whether from existing domain knowledge, or from entities that aresimilar to the entities of interest. We investigate how to combine such data to help us make predictionsin cases where when the data we are interested in is scarce or unavailable. We propose and evaluateseveral simple methods for combining data, and we test our hypotheses on a Movie prediction datasetand two Fashion datasets. We also investigate data that has an ordering, and look at ways to structurethis data to help us make better predictions.ivPrefaceThe parameter sharing model in Chapter 2 was developed in joint work with Ali Mohammad Mehr. Theparameter sharing model was designed by Dr. David Poole. In his thesis, Ali Mohammad Mehr appliesthe parameter sharing model to water pollution prediction, and develops and proves several theoremsabout the parameter sharing model.This thesis focuses on classification on the MovieLens and Fashion datasets. Implementation and testingfor the experiments in this thesis were completed by the author.The bounded ancestor model and its variants were proposed by Dr. David Poole, and implemented andtested by the author. For the work on constructing hierarchies based on splitting ordinal properties, Dr.David Poole suggested recursively splitting on information gain. The hierarchies and experiments forthe ordinal experiments in Chapter 5 were designed and implemented by the author.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Chapter Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Parameter Sharing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.1 Properties, Attributes, & Classes . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Predictions and Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.2 Error function: Sum of squares . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Discrete outcomes (Classification) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.1 Binary outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.2 Three or more outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.3 Error function: Negative log likelihood . . . . . . . . . . . . . . . . . . . . . 92.4 Regularisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Relational Parameter Sharing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1 Tuple Classes, Functions & Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . 123.1.1 Tuple Classes and functions on tuples . . . . . . . . . . . . . . . . . . . . . . 12vi3.1.2 Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2.1 MovieLens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2.2 Fashion datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Prior and Posterior Prediction for Classes . . . . . . . . . . . . . . . . . . . . . . . . . . 194.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2 Bounded ancestor method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.3 Bounded ancestor method for trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.3.1 Mixing signals from above and below . . . . . . . . . . . . . . . . . . . . . . 234.3.2 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.4 Bounded ancestor method for graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.4.1 Multiple parents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.4.2 Variant 1: Sibling data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.4.3 Variant 2: Class mixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.4.4 Variant 3: Parameter adding . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.4.5 Numerical example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.5 Combination tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.5.2 Error Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.5.4 Interpreting the weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 Ordinals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.1 Classes from ordinals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.1.1 Simple intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.1.2 Recursive binary splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.1.3 Subclass Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.2.1 Hierarchy construction and experiment setup . . . . . . . . . . . . . . . . . . 415.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.2.3 Negative log likelihood loss on classes . . . . . . . . . . . . . . . . . . . . . . 435.2.4 Testing values of k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.1.1 Hypothesis 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.1.2 Hypothesis 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.1.3 Hypothesis 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47vii6.1.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49viiiList of TablesTable 3.1 Example MovieLens 100K [6] data. . . . . . . . . . . . . . . . . . . . . . . . . . 15Table 3.2 Example User from MovieLens 100K [6] user information. . . . . . . . . . . . . . 15Table 3.3 Example movie from MovieLens 100K [6] movie information. . . . . . . . . . . . 16Table 3.4 Example datapoint from “Renttherunway” [14] . . . . . . . . . . . . . . . . . . . 17Table 3.5 Example datapoint from “ModCloth” [14] . . . . . . . . . . . . . . . . . . . . . . 18Table 4.1 Signal from below for the classes in our numerical example. Class A is (Administra-tor, Musical), Class AP1 is Administrator, and Class AP2 is Musical. . . . . . . . . 28Table 4.2 A comparison of the variants on an example, (Administrator, Musical) . . . . . . . 29Table 5.1 Distribution of points: Simple intervals . . . . . . . . . . . . . . . . . . . . . . . . 37Table 5.2 The L2 regularised parameter sharing model and the bounded ancestor method onthe test hierarchies, for “ModCloth”. For both metrics, lower is better. . . . . . . . 42Table 5.3 The L2 regularised parameter sharing model and the bounded ancestor method onthe test hierarchies, for “Renttherunway”. For both metrics, lower is better. . . . . 42ixList of FiguresFigure 3.1 Example Hierarchy involving the pair class (Male∩Administrator,Animation), usedin MovieLens[6]. We can see that the more general classes are parent classes. . . 14Figure 4.1 We are interested in class A, which has a single parent AP, which has parent AQ. . . 22Figure 4.2 We are interested in class A, which has two parents AP1 and AP2 . . . . . . . . . . 24Figure 4.3 Example subset lattice from MovieLens 100K [6] . . . . . . . . . . . . . . . . . 30Figure 5.1 The part of the class hierarchy for the Hip size property, if we use simple intervals. 39Figure 5.2 The part of the class hierarchy for the Hip size property, if we use binary splitting.The structure is the same for both midpoint splitting and information gain splitting. 40Figure 5.3 Part of the class hierarchy for the experiments for binary splitting. The ordinalproperties are split into classes that form binary trees whose root is a child of theglobal class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41xAcknowledgmentsI would like to give a huge thank you to my supervisor, Dr. David Poole, for the guidance, help andencouragement throughout the research process and when I was writing this thesis. I would also liketo thank Ali Mohammad Mehr for his helpfulness when we worked together. I would like to thank myfamily for their support.Thanks to NSERC for providing funding to me through Professor David Poole’s discovery grant, andthanks to Compute Canada for providing me with computing resources.xiChapter 1IntroductionWe are interested in making predictions about relations involving tuples of entities, where these entitiescan be categorised into a hierarchical class structure, or classes that are interrelated. Classes of entitiesinduce classes of tuples of entities. We consider the problem of determining how to use observed datain prediction making for relations. This is related to the reference class problem [2], which is concernedwith selecting one or more relevant classes to use for a prediction. In the reference class problem, theclasses relevant to a prediction are known as reference classes. We are interested in the problem of notonly identifying and selecting reference classes, but also how to combine information from them to makea prediction. In particular, we look at how to make predictions for classes with few or no observations.One of the datasets we test on is the MovieLens 100K [6] movie prediction dataset. MovieLens 100Kdata takes the form of (User, Movie, rating, timestamp) quadruples, where specific users give ratingsto specific movies, and information about users and movies. We chose MovieLens because we canconstruct an interesting hierarchical class structure from the user and movie information. MovieLens isalso interesting because of the cold start problem. The cold start problem involves predicting the ratingscores for users and movies that have no rating data. Cold start is an interesting motivation because it isrelated to our problem of trying to predict for entities without rating data.Much of the existing work on MovieLens [6] focuses on collaborative filtering: making predictionsabout unseen user-movie ratings based on previous ratings made by the user, and the ratings made byother users. An example of an existing collaborative filtering work is Marlin [13]. A classic paper thatdoes collaborative filtering for Netflix movie predictions is Koren and Bell [9]. Collaborative filteringmethods don’t work well on the cold start problem [19]. This thesis does not deal with collaborativefiltering. Instead, we make use of MovieLens as a test case in our work on priors for classes that arearranged hierarchically. As we discuss in the future work section, our work could be potentially be usedin conjunction with collaborative filtering.When dealing with the cold start problem for MovieLens, we can use properties of the user and movieentities involved. These properties are modelled in terms of classes. Deciding which properties to use1and how to combine information from those properties relates to the reference class problem, which in-volves identifying which classes to use to make a prediction. Those classes that are used in the predictionare the reference classes.For example, suppose that when making a prediction about the rating a user gives a movie in MovieLens100K [6], we know that the user is an administrator and the movie is an action film. Some referenceclasses relevant to the prediction would be the set of all (user, movie) pairs, the set of such pairs wherethe user is an administrator, the set of pairs where the movie is an action movie, and the set of pairs withan administrator and an action movie.An issue with real world datasets such as MovieLens 100K [6] is that data for certain classes can benonexistent or scarce. More specific classes have fewer observations. For instance, in the MovieLens100K dataset, the class of (user, movie) pairs where the user’s occupation is administrator has 7479members while the class of (user, movie) pairs where the user is male and an administrator and themovie is an animation film has 128 members. (MovieLens 100K has 100,000 datapoints in total).When making predictions about classes with highly constrained descriptions, for example male admin-istrators and animated drama documentaries, the usefulness of the data from those classes is limited ifvery few male administrators have rated animated drama documentaries. To predict for the constrainedclass, we seek to combine information from more general classes, which are the reference classes formaking predictions about the class with few ratings. To do so, we seek a principled prior from the moregeneral classes to help us.The scarcity problem involving classes with highly constrained descriptions can be subdivided into thesetwo cases.• When there is no data for a specific class, for example, no male administrators have rated anima-tion drama documentaries, we want to directly utilise information from more general classes forprediction making. Examples of general classes that could be of use in obtaining a prior are theclass of (user, movie) pairs where the user is male, or the class of (user, movie) pairs where themovie is a documentary. Since more general classes are more likely to have rating data, the goalwould be to combine information from these more general classes to create an informed prior forthe more specific class.• When rating data for a class is scarce but not nonexistent, if we only use the limited amount ofdata for the class that we have, we would overfit on that data. To avoid overfitting, we could usea hybrid approach that combines an informed prior with the data that we do observe for the class.We need to balance the prior with the observed data, by using a mixture of both.We first investigate a model where weights are assigned to classes, and prediction making is done byadditively combining weights. We call this the parameter sharing model. L2 regularisation can be usedwhen learning weights. In the parameter sharing model, we make predictions for an entity by adding theweights of more general classes. L2 Regularisation forces classes with no observations to have weights2of zero. For example, if no data for administrators and action movies are observed, the prediction foradministrators and action movies will include the sum of the weights for only administrators and theweights for only action movies. The weight for the class whose members are both administrators andaction movies will have a value of zero.1.1 HypothesesThis thesis investigates how to combine information from general classes when data for a more tightlyconstrained class is scarce or nonexistent.The simplest and obvious approach of using L2 regularisation with the parameter sharing model forlearning weights doesn’t work well for learning about classes with very few or no datapoints.We introduce the bounded ancestor method for combining information from priors with informationfrom observed data. We also introduce several variants for learning about a class from more generalclasses. We are interested in the effectiveness of these variants for combining data from more generalclasses to get priors for a more constrained class.1. We hypothesize that the bounded ancestor method will be more effective than L2 regularisationfor learning about classes with few or no datapoints.2. We hypothesize that a weighted combination of these variants is more effective in creating priorsthan any individual variant.3. We hypothesize that for ordinal data, dynamically creating new classes by splitting existing classescan help us make better predictions.We test our hypotheses on three real world datasets: The MovieLens [6] movie prediction dataset, the“ModCloth” [14] fashion dataset, and the “Renttherunway” [14] fashion dataset. All three datasetsinvolve making predictions about relations involving pairs of entities. We introduce these datasets inChapter 3.1.2 Chapter RoadmapIn Chapter 2, we introduce the basic parameter sharing model. This model was developed in jointwork with Ali Mohammad Mehr [15], who proves theoretical properties about the model in his thesis,and applies the model to water pollution prediction. We also introduce important terms and pieces ofnotation regarding entities, classes, and hierarchies.In Chapter 3, we show how the parameter sharing model can work with entities and relations involvingtuples of entities, and how when working with tuples of entities, we can work under the same frameworkas the basic parameter sharing model. This is important because our three test datasets involve relationson pairs of entities.3In Chapter 4, we investigate the first and second hypotheses. We propose the bounded ancestor method,where we combine information from more general classes to create informed priors for more specificunobserved classes. We then combine the prior with observed data. Using the example above, we wouldcombine limited observed data for male administrators and animated drama documentaries with a priorobtained individually from male ratings, administrator ratings, animated films, and so on. A tradeoffexists between emphasizing the prior or observed data: too much emphasis on the observed data causesoverfitting.We take into account the importance of observed data by bounding the effect of the prior. If no data isobserved, we use the prior exclusively. As more data is observed, the data starts to take over. Buildingon this method, we then introduce several variants that outline different ways to combine informationfrom individual classes to create a prior. We then investigate and compare their performance on severaltest datasets.In Chapter 5, we investigate the third hypothesis, by looking into different ways of creating new classesby splitting existing classes based on ordinal properties, and we compare these splitting methods alongwith the parameter sharing model and the bounded ancestor model.4Chapter 2Parameter Sharing ModelIn this chapter, we introduce the parameter sharing model. The original version, which we call thebasic version of the parameter sharing model was developed in a project with Ali Mohammad Mehr[15]. The basic model performs regression and classification on individual entities. Regression takesin an entity as input and produces a numerical output. Classification produces a discrete outcome asoutput, and the parameter sharing model outputs the probabilities of these outcomes.An example regression application is making predictions for water pollution, which Mohammad Mehrinvestigates in his thesis [15]. In Chapter 3, we show how the parameter sharing model can be used forrelations on tuples of entities to predict discrete outcomes.We begin this chapter by introducing key definitions about properties, attributes and classes. We thendiscuss class hierarchies, and we transition to a general discussion about learning and regularisationunder the basic parameter sharing model. Since MovieLens and fashion involve tuples of entities, weleave discussion of those datasets to the subsequent chapters when we formally introduce tuple classes.2.1 Basic definitions2.1.1 Properties, Attributes, & ClassesWe define a property as a function that takes in an entity and outputs a value. Example properties are“height” and “shoe size”. All the properties we work with are functions. We convert non-functionalproperties to Boolean functions. For example, “Movie Genre” can have multiple outputs such as “Ac-tion” and “Drama”. We convert these to “Genre: Action” and “Genre: Drama”, both of which areBoolean functions with True and False as possible values. Note that relations can’t be represented likethis: they are discussed in Chapter 2.We define an attribute as a Boolean function that compares a property to a value for an entity. Anexample attribute is “height(x) > 180cm”. This attribute compares the “height” property on entity x5with the value 180 centimeters, and outputs True or False. An attribute involving equality is “shoesize(x) = 10”, where the “shoe size” property on entity x is compared with the shoe size of 10.A class is defined as a set of entities. Classes are described by Boolean functions of attributes. Examplesare:• The class “height > 180cm” consists of all entities x for which the attribute “height(x) > 180cm”returns True.• The class “(height > 180cm)∩ (shoe size = 10)” consists of all entities x for which the conjunctionof attributes “(height(x) > 180cm) and (shoe size(x) = 10)” returns True.In the basic parameter sharing model, we start with a fixed set of classes, which we call the representedclasses. An interesting problem is deciding which classes to use as the represented classes, which isexplored in Chapter 5. Central to the basic parameter sharing model is the class hierarchy. The classesare interrelated, and the relationships between these classes forms a class hierarchy.C1 is a subclass of C2 if C1 ⊆C2. C1 is a proper subclass of C2 if C1 ⊂C2. The inverse relationship ofa subclass is a superclass. Similarly, C1 is a proper superclass of C2 if C2 ⊂C1The global class is the class containing all entities. We denote the global class using >.The relationships between classes can be depicted as a graphical structure, which we refer to as a classhierarchy. This is important for visualising the relationships between the classes.In the context of a hierarchy, we can refer to subclasses and superclasses as descendant and ancestorclasses respectively. Additionally, we introduce parent and child classes. For two classes A and B:• Definition: Parent class: We say that A is a parent of B, if B ⊂ A and there does not exist anyrepresented classes C such that C ⊂ A and B⊂C.• Definition: Child class: If class A is a parent class of class B, then B is a child class of A.Since the parent and child class definitions build on the definitions of proper superclasses and subclasses,the class hierarchy is a directed acyclic graph (DAG).Cities exampleConsider an example about cities. Suppose we have three represented classes of cities: Western Cana-dian City, Canadian City, and City. Here, Western Canadian City ⊂ Canadian City ⊂ City. Thus wesay that City is a parent of Canadian City which is a parent of Western Canadian City.For classes A, B and C, A and B are sibling classes if C is a parent class of both A and B. This issymmetric: A is a sibling class of B implies B is a sibling class of A.62.2 Predictions and ParametersThe basic parameter sharing model can be used for both regression and classification. Regression pro-duces a real number as the prediction, while Classification tries to predict a discrete outcome, which isa member of a fixed set of outcomes. Mohammad Mehr [15] applies the parameter sharing model toregression for water pollution.A model consists of a set of classes and a parameter for each class. We write σ j as the real numberedparameter for class C j. To make a prediction for the entity, we add up the parameters of the classes thatthe entity is a member of.For example, all predictions involving entities in the class “height > 180cm” use the parameter for“height > 180cm”. An important property of the basic model is that parameter sharing, along with theuse of regularisation, results in generalisation in the sense that signal is being pushed up to more generalclasses in the hierarchy. We discuss regularisation in Section 2.4.2.2.1 RegressionA dataset is a set of (x,y) pairs, where x is an entity and y is a value. We refer to the training set as DTRand the test set as DT . Let Cx be the set of classes that the entity x is a member of. We define yˆ(x) as theprediction for entity x by:yˆ(x) = ∑j∈Cxσ j (2.1)Since every entity is a member of the global class, the global class parameter is used in the summationof every prediction. Intuitively one can think of the global class as the most general reference class.2.2.2 Error function: Sum of squaresTo evaluate the parameters for the represented classes, we use the sum of squares error. The sum ofsquares error for dataset D is:SSE(D) = ∑(xi,yi)∈D(yi− ∑j∈Cxiσ j)2(2.2)The sum of squares error function is a widely used error function in machine learning. The functionpenalises the square of the difference of a prediction from the actual label: hence the name “sum ofsquares”. We want to minimise equation 2.2 (specifically, to find the parameter values σ j that results inthis minimum) for the test set.72.3 Discrete outcomes (Classification)Sometimes we want to categorise the entity into one of several discrete outcomes. For example, whenpredicting whether users like movies, the outcome could be one of “dislikes” and “likes”.2.3.1 Binary outcomesFor the binary case, we have two outcomes, for example “likes” and “dislikes”. For simplicity, we referto them as “0” and “1”. We want to predict the probability distribution of those outcomes. Since thereare only two outcomes, it is sufficient to maintain just the probability of just one of those outcomes, suchas p1, the probability of “likes”. Since p0 + p1 = 1, we can obtain p0 (the probability of “dislikes”) bysubtracting p1 from 1.Prediction with two discrete outcomes is also known as binary classification. For this, we take thestandard approach (Jurafsky and Martin [8]) where we utilise a sigmoid function: For any real numberk, Sigmoid(k) = 11+e−k , which “squeezes” the numerical value into a probability in the range [0,1]. Asbefore, we have a entity x, and Cx is the set of classes that the entity x is a member of.Define a predictor yˆ(x), which is the probability that the model assigns the output “1”, by:yˆ(x) = Sigmoid(∑j∈Cxσ j)(2.3)2.3.2 Three or more outcomesFor more than two outcomes (specifically, M outcomes), a prediction consists of a vector of length M:[p0, p1, ...pM−1]. As before, ∑M−1i=0 pi = 1. A classification task with multiple outcomes is known asmulti-class classification.For two outcomes, it was possible to obtain p0 through 1− p1, since there were only two outcomesinvolved. For M variables, we could obtain any one outcome’s probability from the other M-1 outcomeprobabilities. We could also learn an output vector of length M, and normalise to get the probabilities.A simple single variable linear regression with n weights produces a single numerical output using alength n weight vector. A multiple linear regression problem [4] outputs multiple numerical values,using a length n weight vector for each output value.Our approach is similar to multiple linear regression [4]. For each class j, instead of a single parameterσ j, we maintain a parameter vector which we call S j, where S j = [σ j0,σ j1, ...σ j(M−1)]. σ ji is the nu-merical parameter associated with the ith outcome for class j. For example, if there are three predictionoutcomes, the parameter for class A would be SA = [σA0,σA1,σA2].We define the Softmax function as follows [5]. Let X be a vector with M elements: X = [x0...xM−1].The softmax function takes in this vector, and produces another vector that is normalised.8Softmax(X) =(ex0∑ j ex j,ex1∑ j ex j, ...exM−1∑ j ex j)(2.4)The Softmax is applied to the input vector, so that the vector becomes normalised: the components sumup to 1. It is useful for converting regression outputs into valid probability distributions.Utilising the Softmax, the prediction function for M classes is as follows. We effectively have severalsimple linear regressions, one for each outcome class.p0p1...pM= Softmax∑ j:x∈C j σ0 j∑ j:x∈C j σ1 j...∑ j:x∈C j σM j (2.5)2.3.3 Error function: Negative log likelihoodFor discrete prediction outcomes, we use the negative log likelihood loss. We show the equation forthe binary case here: the three-outcome version is a simple extension of the two-outcome version. Theequation we use is from Jurafsky & Martin [8].In equation 2.6, the outer summation is over all (xi,yi) datapoints in the dataset D, where yi is theclassification output for the ith datapoint. Since this is for two-outcome discrete prediction (binaryclassification), yi is either 0 or 1.For any dataset D:Loss(D) =− 1|D| ∑(xi,yi)∈D[yi log yˆ(x)+(1− yi) log(1− yˆ(x))] (2.6)Note that we divide by the number of datapoints in the test set to obtain a normalised value. We want tominimise Equation 2.6 on the test dataset DT .2.4 RegularisationRegularisation is a technique commonly employed in machine learning to combat overfitting. Overfit-ting occurs when a model over-learns on the training data, resulting in a poor ability to generalise to thetest set. A model that is “memorizing” training data is an extreme example of an overfitting model. Forexample, a linear model that considers each datapoint to be its own singleton class could then have aweight for each such class.Implementation-wise, recall that we sought to obtain parameters that minimised the error function onthe test set. We did so by minimising the same error function but on the training set, to obtain parameters9that could be generalised to the test set. Regularisation adds an extra term to the training error function.A popular regulariser for simple linear models is the L2 regulariser, also known as ridge regressionwhen combined with a linear model. We use L2 regularisation in the parameter sharing model. Formore details on L2 regularisation, see James et al: [7].In his thesis, Mohammad Mehr [15] states and proves several theorems regarding the effects of L2regularisation on the simple linear model. Here, we show how to add a L2 regulariser to the sum ofsquares and negative log likelihood functions.Adding a L2 regularisation term to equation 2.2, we get: (Let DTR represent the training set).Regularised SSE(DTR) = ∑(xi,yi)∈DTR(yi− ∑j∈Cxiσ j)2+λ ∑j∈C\{>}σ2j (2.7)This function is trained (minimised) on the training set, to enable generalisation to the test set. Thetest set error function is still equation 2.2 (page 7). Here λ is a regularisation parameter, which controlshow large the regularisation effect is. The regularisation summation is over all classes except the globalclass. We denote this by the set of classes C\{>}. Since we are seeking to minimise equation 2.7, theextra term is a penalty on the size of the class parameters.To minimise the regularised sum of squares, we take the derivative with respect to the parameters σ j.Notice that the error function is quadratic. Since the derivative of a quadratic is linear, the optimisationproblem that arises from minimising equation 2.2 (page 7) results in finding the zeros of a linear system,enabling one to use fast solvers like QR decompositions or other numerical linear algebra tools. For anoverview of linear system solvers, see Wright and Nocedal [17].For discrete outputs, we add the same term to the loss function. We show the binary output case here.All variables are the same as in equation 2.6.Regularised Loss(DTR) =[− 1|DTR| ∑(xi,yi)∈DTR(yi log yˆ(x)+(1− yi) log(1− yˆ(x)))]+λ ∑j∈C\{>}σ2j(2.8)Just like in regression, the regularisation acts as a penalty term that prevents overfitting, and reduces thesize of the parameters.In the context of the basic parameter sharing model, we see that by virtue of the class hierarchy and pa-rameter sharing, signal gets pushed up the hierarchy to more general classes if we use L2 regularisation.To see how generalisation in the parameter sharing model works, consider the following example.Suppose there is a parent class A, with with three disjoint child classes: B1, B2, B3. Since the parentclass is a superclass of each child class, an entity who is a member of B1, B2, B3 will also be a member10of A. Since prediction involves adding the parameters associated with each class, a prediction sum thatincludes σB1,σB2 or σB3 will also include σA.If we decrease a child class parameter by an arbitrary constant δ , and increase the parent class parameterby δ , the prediction for an entity that is a member of both classes remains unchanged. For example, ifwe decrease σB1 by δ and increase σA by δ , the sum of the two will still be σB1 +σA. Intuitively, L2regularisation forces information to be pushed towards the superclasses.Mohammad Mehr’s thesis [15] introduces several important theorems regarding the effects of L2 regu-larisation. In particular, Theorem 2.1 [15] of Mohammad Mehr’s thesis shows that in an L2 regularisedparameter sharing model (fully trained), for any parent class with children, the parent’s parameter isequal to the sum of the child parameters. This is true for all parent classes with children apart from theglobal class >. We don’t regularise the global parameter >.In this chapter, we introduced the basic parameter sharing model, that performs regression and classifi-cation on simple entities by adding parameters that are assigned to classes. In the next chapter, we showhow the model works with pairs of entities because pairs are also entities. We also introduce our testdatasets.11Chapter 3Relational Parameter Sharing ModelThe previous chapter introduced the basic parameter sharing model for entities. The parameter sharingmodel can also be applied to tuples of entities, since a tuple of entities is also an entity. The first part ofthis chapter shows how the parameter sharing model works with tuples of entities. We then introducethe test datasets, MovieLens [6] and Fashion [14].3.1 Tuple Classes, Functions & Hierarchies3.1.1 Tuple Classes and functions on tuplesDefinition: Tuple class: A tuple class is a tuple of classes. A tuple of entities (E1,E2...) is a member ofa tuple class (C1,C2...) if entity Ei is a member of class Ci for all i.We start with a fixed set of tuple classes.A function on tuples is defined as a function that takes in a tuple of entities, and outputs a value. Thisextends the definition of relations to have a more general range, where a relation is a function fromtuples to Booleans, or equivalently, a set of tuples [3]. Both of our test datasets, MovieLens [6] andFashion[14], involve functions from pairs to discrete values.• For MovieLens [6], for a user entity U, and a movie entity M, we are interested in the functionlikes(U,M), where likes(U,M) = True means that the rating user U gives movie M is greater than3, and likes(U,M) = False means that user U gives movie M a rating of 3 or less.• For Fashion [14], for a user entity U, and a clothing item entity I, we are interested in the functionFit(U,I), where Fit(U,I) gives one of three possible outcomes depending on how well clothingitem I fits user U: “Small”, “Fit”, or “Large”.An example tuple class is (Male, Action): consisting of pairs of user entities and movie entities, wherethe user is Male and the movie is an Action film. We can write this in set notation as:12(Male,Action) = {(U,M) :U ∈Male∩M ∈ Action} (3.1)For instance, let Bob, Brian, and John be two users that are Male, and “Superman” and “Avengers” betwo examples of action movies. Then, (Bob, Avengers), (Brian, Superman), (Bob, Superman), (Brian,Avengers), (John, Avengers), (John, Superman) are all members of the pair class (Male, Action).Note that the set of tuples of classes (sets) is more restrictive than the set of tuples. For example, a pairof classes cannot represent the set of ratings where the user is male or the movie is an action film.3.1.2 HierarchiesThe subclass and superclass properties defined for classes of entities also hold for classes of entitytuples. Similarly, the definitions for parent, child and sibling classes are the same. The tuple (A1,A2...)is a subclass of (B1,B2...) iff Ai is a subclass of Bi for all i.All entities are members of the global class: >. For tuple classes, the global class is the tuple (>,>...>).For example, all the (User, Movie) pairs (2-tuples) are members of the global class (>,>), since bydefinition, all Users are in > and all Movies are in >. In subsequent sections, we abbreviate tuplesinvolving >. For example, (Administrator, >) is abbreviated to Administrator.We show the lattice that can be constructed using the hierarchy formed from the relationships betweenthe tuple classes in an example.Consider an example of male Administrators rating animation movies. In the notation of tuple classes,(User, Movie) tuples with these properties would be in the following pair:(Male∩Administrator,Animation) (3.2)A (user, item) pair in this class would have the user be a member of Male ∩ Administrator, and the itembe a member of Animation.The pair class described in Equation 2 is 3.2 is a child class of:• (Male ∩ Administrator, >)• (Male , Animation)• (Administrator, Animation)(Male ∩ Administrator, >) is a child class of (Male, >) and (Administrator, >). (Male, Animation) is achild class of (Male, >) and (>, Animation). (Administrator, Animation) is a child class of (Adminis-trator, >) and (>, Animation).The classes (Male, >), (Administrator, >), and (>, Animation) are all child classes of (>, >).13We illustrate the hierarchical relationships between these classes in Figure 3.1.Figure 3.1: Example Hierarchy involving the pair class (Male∩Administrator,Animation), usedin MovieLens[6]. We can see that the more general classes are parent classes.3.2 DatasetsWe use the MovieLens 100K dataset as the binary outputs test set, and the Fashion datasets as themulti-output test set.3.2.1 MovieLensMovieLens [6] consists of several movie prediction datasets where users assign rating scores to movies.The datasets range in magnitude from 100,000 datapoints to 25 million datapoints. We use the Movie-Lens 100K and 1M datasets, which consists of 100,000 and 1 million datapoints respectively. We choseMovieLens 100K and 1M because these datasets contain information about the properties of users,whereas the 25 million sized dataset does not.In MovieLens 100K and 1M, the data is provided in raw form as (User, Movie, Rating, Timestamp)quadruples. Properties of Users and Movies are also provided: Age, Gender, Occupation and Zip code14for users, and movie title, release date, whether the film was released for video, IMDB URL, and Genreinformation for Movies (IMDB is a website containing information about movies). Table 3.1 givesexample quadruples, and Tables 3.2 and 3.3 gives examples of user and movie properties.The rating score is a number in the set {1,2,3,4,5}, with 1 being the lowest score and 5 being the highestscore. MovieLens is commonly used for testing recommendation systems, a class of model that tries tooptimally recommend items to users. As mentioned in the introduction, this thesis does not deal withcollaborative filtering: instead, it introduces methods that can be used in conjunction with, and as a priorfor collaborative filtering.Since there are 5 rating scores, we convert them into two outcomes: we denote scores of > 3 as “likes”and scores of ≤ 3 as “dislikes”. We do this because we want to use MovieLens as a test set for twooutcome (binary classification) prediction.The timestamp denotes the time at which the rating was made. Since in real life, customers buy productsand watch movies in real time, models that want to take this into account could utilise the timestamp. Inour experiments, we don’t utilise the timestamp.In Table 3.1, we show some example data, randomly selected from MovieLens 100K [6].User ID Item ID Rating Timestamp275 473 3 880313679896 73 3 887159368629 276 5 880116887394 658 3 880889159856 310 3 891489217892 238 4 886608296279 166 4 879572893Table 3.1: Example MovieLens 100K [6] data.In Table 3.2, we show a randomly selected user from the user information for MovieLens 100K.Property ValueUser ID 6Age 42Gender MOccupation ExecutiveZip code 98101Table 3.2: Example User from MovieLens 100K [6] user information.15In Table 3.3, we show a randomly selected movie from the movie information for MovieLens 100K.The rows ”Unknown” through to ”Western” indicate genre: 1 indicates that the movie is of the genre,and 0 indicates that it is not. The URL property is not shown here, since it provides an IMDB link andis not used in this thesis.Property ValueMovie ID 899Movie Title Winter Guest, The (1997)Release date 01-Jan-1997Video release NaNURL -Unknown 0Action 0Adventure 0Animation 0Children’s 0Comedy 0Crime 0Documentary 0Drama 1Fantasy 0Film-Noir 0Horror 0Musical 0Mystery 0Romance 0Sci-Fi 0Thriller 0War 0Western 0Table 3.3: Example movie from MovieLens 100K [6] movie information.3.2.2 Fashion datasetsThe “fashion datasets” consist of two datasets: the “ModCloth”, and “Renttherunway” datasets. Bothwere compiled by a team from UCSD (University of California, San Diego), and involve predicting thefit of articles of clothing, based on information about the user (for example, height, feet size), and theitem of clothing. Three outcomes are possible: {Small, Fit, Large}. The fashion datasets are discussedin more detail in the creators’ original paper [14]. “ModCloth” contains around 82,000 datapoints, while16Renttherunway contains around 192,000 datapoints.In raw form, each fashion dataset is given as a JSON file with a dictionary for each datapoint. Eachdictionary contains property value pairs. The first entry in the dictionary is “fit”, the output label. Inboth fashion datasets, properties are sometimes not reported for certain datapoints. For example, in the“ModCloth” dataset, of the 82790 datapoints, only 27914 of them have Shoe size specified.In both “ModCloth” and “Renttherunway”, “fit” is the most common target outcome. In “ModCloth”,68.6 % of datapoints have “fit” as the label, and 73.7 % of datapoints in “Renttherunway” have “fit” asthe label.In Table 3.4 we show an example datapoint from the “Renttherunway” dataset. The review text propertyis not shown due to its length: it consists of a review paragraph that isn’t used in this thesis.Property ValueFit FitUser ID 420272Bust size 34dItem ID 2260466Weight 137lbsRating 10Rented for VacationReview text -Body type hourglassReview summary “ So many compliments!”Category romperHeight 5’8Size 14Age 28Review date April 20 2016Table 3.4: Example datapoint from “Renttherunway” [14]The ”user id” indicates the user, and ”item id” indicates the item entity. As part of our pre-processing,we identify which properties belong to the item and which properties belong to the user. There are a fewproperties that don’t belong to either, and we put them under “other properties”. The properties are asfollow:• User Properties: User ID, Age, Body type, Bust size, Height, Weight.• Item Properties: Item ID, Category, Size.• Other Properties: Rating, Rented for, Review date, Review summary, Review text.17Similarly, we show in Table 3.5 an example datapoint from “ModCloth”. The username property isomitted since we don’t use it in this thesis.Property ValueItem ID 123373Waist 27Size 11Cup size cHips 41Bra size 36Category NewLength Just rightHeight 5 ft 4 inUser name -Fit smallUser ID 162012Table 3.5: Example datapoint from “ModCloth” [14]Just as in “Renttherunway”, we pre process by extracting the user, item, and “other” properties. Theproperties are as follows:• User Properties: User ID, user name, Bust, Bra Size, Cup size, Height, Shoe size, Shoe width,waist.• Item Properties: Category, Length, Quality.• Other Properties: Review summary, review text.Notice that the example datapoint in Table 3.5 has several properties missing, for example “shoe size”and “shoe width”.In this chapter we described how since tuples of entities are also entities, the basic parameter sharingmodel applies. Learning and regularisation are the same as in the basic parameter sharing model. In thenext chapter, we introduce the bounded ancestor method.18Chapter 4Prior and Posterior Prediction for ClassesThis chapter investigates the first and second hypotheses from the introduction.• For the first hypothesis, we introduce and explore the bounded ancestor method for combiningpriors with observed data. We then introduce three variants for combining multiple parents, andwe compare these methods on the fashion datasets. The results of the comparison are shown inChapter 5.• For the second hypothesis, we conduct a grid-search experiment on examples of multiple par-ents from all three datasets, to determine whether a weighted combination of the three variantsproduces a better prior than an individual variant.4.1 BackgroundWhen we observe few or no datapoints for a class, we want to obtain a prior for the class. If knowninformation is from related or more general classes, we want to use information from these generalclasses to create an informed prior. We then want to combine the prior with information from observeddata to get a posterior for a class. Having a prior is important to avoid overfitting when data is scarce.In Bayesian statistics the prior probability for a random variable represents the prior knowledge about it.The prior probability is combined with information from the observed data through multiplication withthe likelihood function. If we haven’t observed any data about the random variables we are interestedin, we use the prior probability for inference. As the amount of observed data increases, the data startsto dominate the prior. For an introduction to Bayesian statistics in the context of machine learning, seeMurphy [16].When dealing with classes of entities or classes of entity pairs, sometimes all we know about an entityis that it is in a particular class. It would be very helpful to have a prior for that class. If there are noobservations for that class, the prior for is all we have. As the amount of data for the class increases, theeffect of the data should take precedence.19Our two test datasets, MovieLens 100K [6] and Fashion [14] both involve discrete target outcomes. Weare interested in a probability distribution over these outcomes. A sensible choice of distribution to useas the prior for the probability of the target outcomes is the Dirichlet distribution [1].p(µ|α) ∝W∏i=1µαi−1i (4.1)Here, µi are random variables that have real values between 0 and 1, and ∑Wi=1 µi = 1. The µi randomvariables are mutually exclusive, representing a set of probabilities of W outcomes of a single randomvariable Y . For each outcome i of the random variable Y , µi is a random variable representing theprobability of that outcome.The αi are the parameters of the Dirichlet distribution. There are two ways of parameterising a Dirichletdistribution.• A positive real number αi assigned to each outcome i. ForW target outcomes we write [α1...αW ].• A probability pi associated with each outcome i, together with a positive number N. For W targetoutcomes, we write [p1...pW ] and N.To map between the two parameterisations, we can do: [10]• To get to the second parameterisation from the first, [p1...pW ] = [ α1∑Wi=1αi, ... αW∑Wi=1αi] and the positivenumber is N = ∑Wi=1αi.• To get to the first parameterisation from the second, αi = piN for all i.In this thesis, we use the second parameterisation for Dirichlet distributions. We refer to the real numbers[α1...αW ] and N as pseudocounts when dealing with Dirichlet distributions.4.2 Bounded ancestor methodWe use the bounded ancestor method for binary and multi-target classification. The bounded ancestormethod outputs a probability distribution over the target outcomes. For instance, in Chapter 3, we saidthat each MovieLens 100K [6] entity pair is also an entity, and we defined a function that mapped eachsuch entity to a target outcome: “dislikes” or “likes”. The bounded ancestor method gives a probabilitydistribution over [“dislikes”, “likes”].The bounded ancestor method combines an informed prior with observed data about entities in particularclasses to get posterior distributions over the target outcomes for those classes. Both the prior andposterior distributions are in the form of Dirichlet distributions.4.2.1 NotationFor a class A, the signal from below is information from observed data about entities in class A.20• For a target outcome v, p↑A(v) is the probability that an entity in class A has target outcome v, forthe signal from below. That is, p↑A is a function that takes in a target outcome and outputs theprobability of that outcome.• N↑A denotes the total number of entities in class A.The signal from above is information from the prior for class A.• For a target outcome v, p↓A(v) is the probability that an entity in class A has target outcome v, forthe signal from above. That is, p↓A is a function that takes in a target outcome and outputs theprobability of that outcome.• N↓A denotes the total number of entities assumed in the signal from above for class A.To get the posterior distribution for class A, we mix the signals from above (p↓A, N↓A) and below (p↑A, N↑A).The mixing is a standard mixing of two Dirichlet distributions.• Let pA be the posterior probability distribution for class A. For target outcome v,pA(v) =p↓A(v)N↓A+ p↑A(v)N↑AN↓A+N↑A(4.2)• Let NA be the combination of the number of entities in class A and the number of entities assumedfrom the signal from above. For class A, NA = N↓A+N↑A4.3 Bounded ancestor method for treesWe first illustrate how the bounded ancestor method works for trees. In a tree hierarchy, each class(except the global class >) has exactly one parent.Suppose we are interested in the distribution over target outcomes for a class A. Since we are in a tree,class A has a single parent which we call AP. Class A can also have number of sibling nodes, who sharethe same parent AP. This structure is illustrated in Figure 4.1. We first need to obtain the signals fromabove and below for class A.21Figure 4.1: We are interested in class A, which has a single parent AP, which has parent AQ.The signal from below is a pair of two items p↑A and N↑A, both of which can be obtained directly fromthe members of class A. We obtain them by counting all the members of class A. The total number ofmembers of A is N↑A. p↑A maps from the set of target outcomes to the proportion of members of A withthose outcomes.The signal from above is the prior. This is also a pair of two items: p↓A and N↓A. We get this prior bylooking at the posterior for the class of entities that are in the A’s parent but not in A. These are entitiesin the set AP−A, consisting of the set of all the entities not in A but in A’s parent. For a tree structuredclass hierarchy, we write:p↓A = pAP−A (4.3)N↓A = k (4.4)k is a numerical constant called the bounding constant. In Equations 4.3 and 4.4, pAP−A and NAP−Arepresents the posterior for the class AP−A. class AP−A consists of the entities that are in A’s parentbut not in class A. As the amount of observed data for a class increases, we want to use the observeddata and not rely too much on the prior. We try to achieve this by limiting the effect of the prior bybounding N↓A, by k. This bounding is where the name “bounded ancestor model” comes from.The prior for the global class > always consists of a uniform probability distribution along with k22pseudocounts. For W target outcomes, the prior would be p↓> = [1W , ...1W ] and N↓> = k. For all classes A,we use a fixed pseudocount of k for the prior of A.4.3.1 Mixing signals from above and belowOnce we have the signal from above and the signal from below for a class, we “mix” these two signalsto get a posterior distribution for that class.At this point, we’ve computed the prior p↓A, N↓A, and the signal from below p↑A and N↑A. The posteriordistribution consists of a direct mix of the signals from above and below. We use Equation 4.2 (page21), which is a standard mixing of Dirichlet distributions. For a target outcome v:pA(v) =p↓A(v)N↓A+ p↑A(v)N↑AN↓A+N↑AThe posterior count is obtained by adding the pseudocounts from above, and the number of observedcounts.NA = N↓A+N↑A (4.5)Before proceeding, there are two things worth mentioning about the method.• Since the global parameter has no siblings or parent, the “downward” count for the global class isa user-defined prior. We choose a uniform prior consisting of a uniform probability distributionalong with a pseudocount of k.• We conjecture that the bounding constant k should be chosen to be relatively small, on the orderof 0-100. This is because in our test datasets such as MovieLens 100K [6], there are examplesof classes with very few observations (less than 100). Since we want the data to be emphasizedover the prior, k should be limited. If k is too large, the prior instead pulls the prediction towardsinformation from other classes. We test this conjecture in Chapter 5.4.3.2 ParametersBy mixing the signals from above and below, we obtained the posterior for a class A, which we denotedusing pA and NA.From the posterior for a class A, we can get the parameter for the class. For a class A, let σA be theparameter for the class. σA is a function that takes in a target outcome, and outputs a real number.To obtain the parameter σA for a class A, we apply an inverse softmax function to A’s posterior proba-bility pA, and then subtract the parameters of all the superclasses of A.234.4 Bounded ancestor method for graphs4.4.1 Multiple parentsIn many real-world class hierarchies such as the one constructed from MovieLens 100K [6], classescan have multiple parents. For a class in a tree, we showed how the signal from above is obtainedby considering entities that are in the parent class but not in the class of interest. If the class we areinterested in has multiple parents, we want ways to combine information from these parents to get thesignal from above.As a motivating example, consider the hierarchy shown in Figure 4.2, which we will return to throughoutthis chapter. Now, instead of class A having a single parent AP, class A now has two parents, AP1 andAP2.Previously when A had a single parent AP, we used the posterior of the set AP−A. For two parents, wewant to use information from the sets AP1−A and AP2−A in some way.• AP1−A is the set of entities that are in A’s first parent but not in A.• AP2−A is the set of entities that are in A’s second parent but not in A.Note that the sets AP1 − A and AP2 − A don’t have any common entities. This is because A is theintersection of AP1 and AP2. For an entity to be in both AP1−A and AP2−A, it would have to be in theintersection of AP1 and AP2, and not in A, which is impossible.Figure 4.2: We are interested in class A, which has two parents AP1 and AP2We now present three variants that combine information from multiple parents to get the signal fromabove. We show how they work with M parents. For two parents, we can set M = 2.244.4.2 Variant 1: Sibling dataIn this variant, we use data from entities that are in parents of A but not in A. We call these siblingsbecause they typically come from siblings of A. If any entities are in A but not in A’s children, we canconstruct a dummy class of those entities.If A has M parents, AP1...APM, then this variant combines information from the posteriors of AP1−A,AP2−A, up to APM −A. Just like in the tree case, when working out the posteriors of APi−A for allparents i, the signal from below excludes the entities in A, while the prior for APi−A is the prior for APi,since excluding entities in APi also excludes entities in A.p↓A would be, for a target outcome v:p↓A(v) =∑Mi=1 (pAPi−A(v)NAPi−A)∑Mi=1NAPi−A(4.6)Like in the tree structured case, because we use a uniform prior with k pseudocounts for the global class,the total number of entities in A’s prior is just the bounding constant k.N↓A = k (4.7)If the number of observations for entities in one parent class greatly exceeds the number of observationsfor entities in the other parents, this variant prioritises the impact of information from the parent withmore data, because the total pseudocount involves adding the number of counts from each parent class.4.4.3 Variant 2: Class mixingIn this variant, we average the signal from each parent. The posteriors of AP1−A...APM−A are averaged.For a class A with M parents AP1...APM, for p↓A, for a target outcome v, we have:p↓A(v) =1MM∑i=1pAPi−A(v) (4.8)The number of entities in A’s prior is k:N↓A = k (4.9)This variant could be potentially beneficial when we do not observe many observations for entities froma particular parent, but we think that (through prior domain knowledge or otherwise) the parent shouldstill be strongly weighted.254.4.4 Variant 3: Parameter addingIn parameter adding, we add the parameters of all the superclasses of the class of interest to get thesignal from above.Suppose class A has M parents AP1 to APM. Let the set of superclasses of class A be represented by EA.Then, to obtain p↓A for a target outcome v:p↓A(v) = Softmax( ∑j∈EAσ j(v)) (4.10)When obtaining the parameter for class i, we apply the inverse of the softmax function to i’s posteriorprobability pi, and then we subtract the parameters of all superclasses of i. Then, to get a prior proba-bility distribution for class A by adding the parameters of A’s superclasses, we add the parameters andapply a Softmax function to the sum.To obtain the parameters of A’s superclasses to work out A’s prior, we would have to work out theposteriors of A’s superclasses. To avoid double counting, we subtract the observations of entities in Afrom the signals from below of all of A’s superclasses. The signals from above of A’s superclasses arethe priors for those classes. Since we always exclude data from a class when computing the class’ prior,the priors for A’s superclasses already exclude information from A because entities in A are also in A’ssuperclasses, so the priors do not require recomputing.For all of A’s superclasses, we recompute their posteriors by mixing the signals from below that excludeA’s data, with the priors for those classes. After this mixing, for each of A’s superclasses we recom-pute their parameter by recursively subtracting the parameters of their superclasses. The summation inEquation 4.10 then adds these newly computed parameters and applies the softmax function.Shown below is brief pseudocode for applying variant 3. When iterating through all the classes in ahierarchy, we have to do so in a topological ordering, where ancestors come before descendants. That26is, to compute the prior for class A, we have to have already computed the priors for A’s superclasses.Algorithm 1: Using variant 3 to compute the prior for all classesC = set of all classes in the hierarchy;// priors stores the priors p, N for each class in C ;priors = Empty hashset;for c in C (topological ordering) doSc = set of superclasses of c;// temp parameters stores the parameters of c’s superclasses to use in obtaining c’s prior;temp parameters = Empty hashset ;for s in Sc (topological ordering) doSs = set of superclasses of s ;//Subtract entities in c from s;p↑s−c =p↑sN↑s−p↑cN↑cN↑s−N↑c;N↑s−c = N↑s −N↑c ;// Access the prior for s ;p↑s = priors[s], N↑s = k ;Mix prior and posterior to get Posterior(s-c) ;// Recompute parameter for s by subtracting s’s superclass parameters, stored intemp parameters, from posterior for s ;σs−c = InvSoftmax(Posterior(s-c)) - ∑i∈Ss temp parameters[i]Store parameter for s in temp parameterspriors[c] = Softmax(∑i∈Sc(temp parameters[i])) ;We use a uniform probability distribution and a pseudocount of k for the prior of the global class, so thenumber of entities for A’s prior is k.N↓A = k (4.11)Intuitively, we can consider variants 1 and 2 to be interpolating between parents, while variant 3 involvesadding the information from the parents, since variant 3 adds the parameters of all superclasses of theclass of interest A.4.4.5 Numerical exampleWe show how the three variants work through a numerical example from MovieLens 100K [6]. Our ex-ample follows the hierarchical structure from Figure 4.2 (page 24), with two target outcomes, “dislikes”and “likes”. We are interested in predicting the target outcome for entities in the class (Administrator,Musical), which has two parents: Administrator and Musical. Let A denote the class (Administrator,Musical), and let AP1 denote Administrator and AP2 denote Musical.27This example will show, given observation data for classes A, AP1−A, AP2−A, and >, how to obtainthe signal from above for class A.From MovieLens 100K [6], suppose we observe the following data about classes A, AP1, AP2 and >.These are observed counts, provided as a Dirichlet parameterisation with the probability and total count.Class i N↑i [p↑i (0), p↑i (1)]> 100000 [0.446,0.554]>−A 99646 [0.450,0.550]AP1 7479 [0.412,0.588]AP2 4954 [0.460,0.540]AP1−A 7125 [0.411,0.589]AP2−A 4600 [0.462,0.538]A (Ground truth) 354 [0.432,0.568]Table 4.1: Signal from below for the classes in our numerical example. Class A is (Administrator,Musical), Class AP1 is Administrator, and Class AP2 is Musical.From the data in Table 4.1, we apply the three variants as follows. We use k = 5.For variant 1, we need the posteriors of AP1−A and AP2−A.The posterior for AP1−A combines the signal from below for AP1−A with the prior for AP1. The priorfor AP1 is the posterior for>−AP1. Similarly, the posterior for AP2−A combines the signal from belowfor AP1−A with the prior for AP2. The prior for AP2 is the posterior for >−AP2. To get the posteriorsfor >−AP1 and >−AP2, we mix the observations for >−AP1 and >−AP2 with the prior for >, whichis a uniform probability distribution with a pseudocount of k.Once we have the posteriors for AP1−A and AP2−A, for variant 1 we mix the posteriors according toEquation 4.6 (page 25), and for variant 2 we average the posteriors according to equation 4.8 (page 25)and we get:Variant 1: p↓A = [0.431,0.569]Variant 2: p↓A = [0.436,0.564]For Variant 3, we subtract the global parameter σ> from the posteriors of AP1−A and AP2−A to get σP1and σP2, and we add the parameters σ>, σP1 and σP2 together to use as A’s prior. σ> is just the posteriorof >−A: mixing the signal from below of >−A with the prior for >. Note that when computingparameters for any class i, we always apply an inverse softmax to i’s posterior, before subtracting theparameters of i’s superclasses. Then, after adding the parameters of>, AP1 and AP2, we apply a softmaxfunction to the sum.For Variant 3, we get: p↓A = [0.426,0.574]28Table 4.2 shows the prior produced by each example, alongside the ground truth, which is the actualdistribution of the target outcomes of observed entities in the class (Administrator, Musical). We showP(likes), which is p↓A(1), for simplicity.Variant Probability (likes) for entities in (Administrator, Musical)Ground truth 0.5681: Sibling data 0.5692: Class mixing 0.5643: Parameter adding 0.574Table 4.2: A comparison of the variants on an example, (Administrator, Musical)While this example worked well in combining the two parents to form a prior, there are lots of exampleswhere the individual variants do very poorly. To address this, we try to look at a large number of pairsof two parents to see if a weighted combination of the three variants produces a better prior.4.5 Combination testsIn the sections above, we introduced three variants of the bounded ancestor method for obtaining pri-ors for entities in classes with multiple parents. In the introduction, we hypothesized that a weightedcombination of the three variants might be better for creating priors than any individual variant.In this section we conduct an experiment that investigates whether a weighted combination creates moreeffective priors than an individual variant. We define three weights: w1, w2, w3, where w1+w2+w3 = 1.Let the prior produced by Variants 1, 2, and 3 be denoted by Out put1, Out put2, and Out put3 respectively.The weighted prior is composed of a combination of the three variants:Weighted Prior = w1(Out put1)+w2(Out put2)+w3(Out put3)Our goal to find the weights w1, w2, w3 that will produce an accurate of a prior as possible for entitiesin classes with multiple parents.We treat the experiment as a supervised learning problem, where the parameters that we are trying tolearn are the weights w1,w2,w3. The training data consists of triples (A,AP1,AP2), where AP1 and AP2are parents of A. The classes are selected from properties of the MovieLens 100K [6] and Fashion [14]datasets. A separate training set is constructed for each of the three datasets. The inputs to the learningproblem are the observations of entities from the classes A, AP1 and AP2, and the output is the priorprobability distribution for class A.We measure the accuracy of the prior for class A using the KL Divergence and the Euclidean distance.We call the actual distribution of the observations of entities in class A the ground truth distribution. For29each class, we compare the prior obtained through weighting the three variants to the ground truth usingthe KL Divergence and Euclidean distance.4.5.1 Experimental setupIn our learning problem, we seek weights w1,w2,w3 that sum to 1. Our experiment uses a grid search overpossible weights, to determine the weighting that minimises the KL divergence and Euclidean distance.We chose a grid search for simplicity. We only have three weights, allowing for a simple search overall weight combinations, and a grid search allows us to control the increment that we use to vary theweights.We create classes out of individual user, movie, item properties. For example, consider the subset latticeshown in Figure 4.3 from MovieLens 100K [6].Figure 4.3: Example subset lattice from MovieLens 100K [6]An example training triple of the form (A,AP1, AP2) from Figure 4.3 is ((Male ∩ Administrator, >),(Male,>), (Administrator,>)). We abbreviate this as ((Male, Administrator), Male, Administrator). Forproperties with real values, such as shoe size, in this experiment, classes are created by splitting thevalue range into discrete equidistant intervals.30Of the (A,AP1,AP2) triples, some are not useful for testing the effectiveness of the priors we obtain, sowe remove them from the training set, based on the following criteria:• In some cases, we don’t have access to any observations of class A. In the bounded ancestormethod, a prior for class A would have been used. However, for the purposes of training for thisexperiment, we need to know the ground truth observations for A to compare with the prior wecreate.• If all entities in class AP1 are in A, the class A−AP1 would be empty. Similarly, if all entitiesin AP2 are in A, class A−AP2 would be empty. While we could still get a prior from the otherparent, the purpose of this combination experiment is to determine what weights work best whenworking with multiple parents, so we remove these.The number of triples that make up the training set for the weighted combination experiments that wedo for each dataset are as follows:• MovieLens: In MovieLens, we have 50 classes that we use as the parents. Since this experimentinvolves pairs of parents with a single joint child, by pairing classes up, we get 1225 possibletriples. We remove some triples according to the criteria above, and end up with 826 triples.• Fashion, “ModCloth”: In “ModCloth”, we start with 118 classes, and pairing them up gives us5995 possible triples. After removing some triples according to the criteria above, we end up with4599 triples.• Fashion, “Renttherunway”: In “Renttherunway”, we start with 147 classes, and pairing themup gives us 10731 possible triples. After removing some triples according to the criteria above,we end up with 4954 triples.In our experiments for all three of MovieLens, “ModCloth” and “Renttherunway”, we perform 10-foldcross validation. To do so, we divide the training sets into 10 sections, and repeat the experiment 10times. In each repeat, one section, which consists of 10% of the triples is the test set, with the remaining9 sections consisting of 90 % of the triples as the training set.In training, we conduct grid searches across different weights w1,w2,w3 for the three variants, in 0.01increments. (For example, [0.31, 0.09, 0.6] is a potential combination of weights) and all combinationsare exhausted: 5050 different weights are considered in total. For each combination, we measure theeffectiveness of the weighted prior using our error metrics, shown in the next section.4.5.2 Error MetricsWe use two metrics to determine the effectiveness of the prior for a class. We use the KL divergenceand the Euclidean distance. This section describes each in more detail.KL divergence31The KL divergence was originally introduced by S. Kullback and R. A. Leibler [12]. An in depthintroduction can be found in S. Kullback’s textbook [11]. The KL divergence is a distance measurebetween two probability distributions that share the same domain. Mathematically, the divergence fromP2 to P1 is given by:DKL(P1(x),P2(X)) = ∑x∈Dom(X)P1(X) log(P1(X)P2(X))(4.12)In our application, the two probability distributions we want to compare are:• The prior obtained through weighting the three variants.• The ground truth probability distribution.The domain of both of these probability distributions consists of two outcomes (simplified to 0,1) forthe MovieLens pairs of classes, and three outcomes for fashion (simplified to 0,1,2).Intuitively, one can think of the KL divergence in terms of bits. If we consider P1 to be the groundtruth, then the KL divergence measures the extra bits needed to encode P1 using P2: in other words, howmany more bits does one need if one chooses to use the approximation P2 in place of P1? Since we areapproximating the ground truth with the weighted prior, we are measuring how many extra bits we needto use for the prior to approximate the ground truth. This means that in terms of direction, we computethe KL divergence from the weighted prior to the ground truth.We sum the KL divergence from each (A,AP1,AP2) triple, and divide by the number of training triples(averaging). For the logarithm, we use base 2. When working with probabilities, the KL divergence ismore sensible choice compared to the Euclidean distance. The Euclidean distance is included just forcomparison.Euclidean distanceThe Euclidean distance is a classic measure of the distance between two vectors: here we include thisas a baseline check, in addition to the KL divergence.Let P1 and P2 be two probability distributions, where P1 = [p10, p11, p12], P2 = [p20, p21, p22]. Then, theEuclidean distance is as follows:D(P1(x),P2(x)) =√(p10− p20)2+(p11− p21)2+(p12− p22)2 (4.13)For both metrics, we compute the average across the pairs in the relevant training dataset.324.5.3 ResultsHere we show the optimal weight combinations from all 10 folds in the 10-fold cross validation exper-iments that minimise the KL divergence and Euclidean distance. For each of w1, w2, w3, we report therange, mean and median for all three datasets.MovieLens 100KOptimising for KL divergence:• Ranges: w1 = [0.38,0.49], w2 = [0,0], w3 = [0.51,0.62].• Median: w1,w2,w3 = 0.475,0,0.525• Mean: w1,w2,w3 = 0.465,0,0.535Optimising for Euclidean distance:• Ranges: w1 = [0.27,0.37], w2 = [0,0], w3 = [0.63,0.73].• Median: w1,w2,w3 = 0.32,0,0.68• Mean: w1,w2,w3 = 0.321,0,0.679“ModCloth””Optimising for KL divergence:• Ranges: w1 = [0.33,0.46], w2 = [0,0], w3 = [0.54,0.67].• Median: w1,w2,w3 = 0.37,0,0.63• Mean: w1,w2,w3 = 0.371,0,0.629Optimising for Euclidean distance:• Ranges: w1 = [0.15,0.33], w2 = [0,0], w3 = [0.67,0.85].• Median: w1,w2,w3 = 0.2,0,0.8• Mean: w1,w2,w3 = 0.208,0,0.792“Renttherunway”Optimising for KL divergence:• Ranges: w1 = [0,0.08], w2 = [0,0.07], w3 = [0.92,0.97].• Median: w1,w2,w3 = 0.035,0.01,0.95• Mean: w1,w2,w3 = 0.032,0.02,0.94833Optimising for Euclidean distance:• Ranges: w1 = [0,0.05], w2 = [0,0], w3 = [0.95,1].• Median: w1,w2,w3 = 0.01,0,0.99• Mean: w1,w2,w3 = 0.011,0,0.989Notice that the weighting for “Renttherunway” differs significantly from the weightings for MovieLensand “ModCloth”.4.5.4 Interpreting the weightsBy considering how the variants work, we can draw some insights from them.Variant 1 directly mixes the counts from both parents. For two parents, if one parent has significantlymore observations than the other parent, the distribution of the counts of the computed prior will heavilyskew towards the distribution of counts of the dominating parent class.Variant 2 avoids this by forcing an even split in the weighting of the parents, effectively neglecting thesignificance of a large number of observations for a particular parent class. Thus, one can considerthe weighting optimisation as determining the extent to which the skewing of the datapoints in theparents affects the distribution for a shared child. By skewing, we refer to the case where one parenthas significantly more observed data than another parent. Variant 3 relies on the addition of parentand superclass parameters to compute the prior. These parameters are dependent on the distribution ofcounts of the classes that they represent.We hypothesized that a weighted combination would do better than any individual variant. This was thecase, but surprisingly, there are weightings of 0 for all three datasets. For MovieLens and “ModCloth”for both metrics the optimal weighting of the second variant was 0.The results for “Renttherunway” are also surprising. For both metrics, almost all the weighting is onvariant 3, with very little in the first two variants. This was the case across most of the folds in 10-fold validation. The variation between folds is very small: for example, for Euclidean distance, in“Renttherunway”, the second weight is always zero while the first weight is close to zero.We now try to consider reasons for why “Renttherunway” differs significantly from MovieLens and“ModCloth”. First, we looked at the sets of (A,AP1,AP2) triples used for the grid search experiment forMovieLens, “ModCloth” and “Renttherunway”. In particular, at the average difference in the numberof observations of each parent AP1 and AP2.We looked at the differences in the number of observations between parents for the three datasets, sincethe first and second variants take into account the relative numbers of observations of the parents indifferent ways. We looked at the (A,AP1,AP2) triples of classes that were constructed as the trainingsets for the combination test and looked at the average difference in number of datapoints between34parents AP1 and AP2 for the three datasets. For MovieLens the average difference in the number ofentities between the two parents selected is 15208 datapoints, for “ModCloth” it was 14521, and for“Renttherunway” it was 32407. The average difference in number of datapoints between parents for“Renttherunway” is over two times that for MovieLens and “ModCloth”.The weights for MovieLens and “ModCloth” both weighted the first and third variants positively with al-most no weight on the second variant, while “Renttherunway” has similarly low weights for the first twovariants. It could be that the difference in the number of points between the parents in “Renttherunway”could have affected the results.We looked at the distribution of target outcomes “ModCloth” and “Renttherunway”. In both datasets(over the entire datasets), “fit” is the most common target outcome. In “ModCloth”, 68.6 % percent are“fit”, and in “Renttherunway”, 73.7 % are “fit”. However, since these two values are similar, it does notseem to explain the zero weight for “Renttherunway”.Finally, the training set of (A,AP1,AP2) triples of classes for “Renttherunway” might not represent thewhole dataset, for two reasons. Firstly, as described in Chapter 3, “Renttherunway” contains missingdata: certain attributes are missing. Secondly, we removed a number of training triples from the gridsearch where for the triple of classes (A, AP1, AP2), all the entities in AP1 are in A or all entities in AP2are in A. We removed these because they wouldn’t be helpful for learning the weights, but informationmight have been lost because of it. For the triples removed, most of them were due to the child class Ahaving no data. For MovieLens, 399 of the 1225 (A,AP1,AP2) triples either had no data in the joint childA, or had no data for one of the sibling sets AP1−A or AP2−A. Likewise, for “ModCloth”, 1396 out of4995 had no such data, and for “Renttherunway” 5777 out of 10731 had no such data.Since the parents AP1, AP2 in the triples were all selected from classes constructed from properties ofthe three datasets, we could reason that had we had observations for the removed triples, we wouldhave a more complete picture of the effects of multiple parents on their joint child in those datasets. Inparticular, in “Renttherunway”, close to half the triples were removed, which potentially means that theset of triples we use is not fully representative of the dataset.35Chapter 5OrdinalsThis chapter investigates the third hypothesis. We try to construct new classes from ordinal properties,and we test to see the impact of different class hierarchies on prediction making for our test datasets.The chapter consists of two parts:• Ordinals: We explore different ways of splitting ordinal properties to construct new classes. Wefirst look at simple ways to split the range of ordinal properties into intervals. Then, we considerrecursive binary splitting, by splitting on the midpoint of a range and by splitting on informationgain.• Experimental results: We evaluate the parameter sharing model and the bounded ancestor modelon the Fashion datasets. We evaluate on the negative log likelihood loss and sum of squares error.5.1 Classes from ordinalsSince classes are sets of entities, there are many possible classes. Selecting which classes to assignparameters to (known as the modelled classes) is an important problem. This chapter investigates howto select classes for ordinal properties. The Fashion datasets contain properties that are ordinal, suchas shoe size, where it makes sense to consider not just the numerical values, but the “ordering” of avalue relative to other values. In this section we investigate different ways in which we can split theseproperties and construct new classes.Consider the property Hip size from the “ModCloth” fashion dataset. Of the 82790 datapoints in “Mod-Cloth”, 56064 of them have Hip size specified. Hip size ranges from 30 to 60. In ModCloth, the inputto the relation “Fit(U,I)” is a pair of entities: A (User, Item) pair.We investigate the following ways to split this property.1. Intervals: This does not take ordering into account. We can define classes based on discrete non-overlapping intervals: “Hip size 30-35”, “Hip size 35-40”, and so on. Each (User, Item) pair is a36member of only one of these classes, since the intervals are by definition mutually exclusive.2. Ordinal: Recursive binary splitting: We can split the ordinal property at a single point, creatingtwo classes. We can recursively split each new class into two more classes, creating a binarytree class hierarchy. For example, for Hip size, if we choose the splitting index to be 35, we get:“Hip size ≥ 35”, and “Hip size < 35”. Each of these new classes can then be further dividedrecursively. We choose the splitting point using one of two ways:• The simplest way is to choose the midpoint between the maximum and minimum value ofrange of the ordinal property. For example, if Hip size ranges from 10 to 50, we select 30 asthe splitting point.• We can also choose the splitting point resulting in two intervals that maximises the infor-mation gain. Information gain is a concept used often in decision trees in machine learning.We properly introduce this in Section 5.1.2.5.1.1 Simple intervalsThis is the simplest approach that does not take ordering into account. Given the minimum and maxi-mum numerical values of Hip Size, we divide the range into 10 equidistant brackets. We would create10 new classes, each described by a Hip Size range.The following distribution of points can be seen in Table 5.1.Class Number of pointsHip Size: 30-32 2142Hip Size: 33-35 9777Hip Size: 36-38 12952Hip Size: 39-41 11861Hip Size: 42-44 8047Hip Size: 45-47 4282Hip Size: 48-50 3279Hip Size: 51-53 1745Hip Size: 54-56 1086Hip Size: 57-59 893Table 5.1: Distribution of points: Simple intervalsIn this method, the datapoints are placed into classes defined on equal sized value intervals. However,splitting in this way has several issues. Such a split does not take into account the distribution ofpoints. Splitting to create equal intervals also does not take ordering into account, which is importantfor ordinals.37An alternative way to create simple intervals is to use the cumulative distribution of points. We callthis simple thresholding. Like before, we take the maximum and minimum numerical values of Hipsize, and now we divide the range into 10 equidistant threshold values. Now, a (User, Item) pair is amember of one of these classes if the User’s hip size is greater than the threshold (and thus greater thanall preceding thresholds). We don’t test on simple thresholding. Instead, we focus on testing intervalsplitting and recursive binary splitting.5.1.2 Recursive binary splittingInstead of splitting an ordinal property into multiple intervals in a single split, and creating classes fromthe intervals, we could select a single splitting point that divides the range of the property into twoparts. Let the range of the property (e.g. Hip size) have minimum and maximum values smin and smaxrespectively. Then, if we split at point p, the resulting classes represent intervals: [smin, p] and [p,smax].We can then choose another midpoint recursively from the intervals of each of the newly created classesto further split.Splitting based on the midpoint is simple: The midpoint is:p= smin+smax− smin2Splitting on the midpoint might not be the most useful for learning and generalisation, because we aresplitting at the midpoint regardless of where the observations are distributed. A more principled wayof splitting would be to use the information gain. To introduce information gain, we first discuss theconcept of entropy, which is commonly used in information theory and machine learning.We use the definition and notation from J. R. Quinlan’s classic decision tree paper [18].Let A be a set of points where the points are assigned outcomes of “0” or “1”. Let n0 and n1 be thenumber of “0”s and “1”s respectively. Then, the entropy is defined as: (Page 89, [18]. Quinlan uses pand n. n0 and n1 are used here to allow for extending to more than 2 outcomes).Entropy(A) =− n0n0+n1log2n0n0+n1− n1n0+n1log2n1n0+n1(5.1)The entropy intuitively represents the amount of useful information we get from a particular collectionof points. Next, we define information gain. The information gain is defined with respect to a set and asplit. Let A be the original set, and C1,C2 be the disjoint result of partitioning A into two sets. We callthis split s. Let N1 and N2 be the number of points in C1 and C2. Then, the information gain from thispartition is: (Page 90, [18])Information Gain(A,s) = Entropy(A)− N1N1+N2Entropy(C1)− N2N1+N2 Entropy(C2) (5.2)38We weight the entropy by the number of points in each set. This is for a particular split s. We considerall the potential splits, and we choose the split that maximises the information gain.5.1.3 Subclass HierarchiesThe classes created through the splitting variants we discussed above results in distinct class hierarchies.If we split a property into intervals and create classes out of these intervals, without considering ordering,all of these newly created classes will be disjoint, since any datapoint (entity tuple) can only be a memberof a single one of these classes.In the Fashion datasets [14], there is missing data: not all properties are reported. For instance, theproperty Hip size is not reported for all datapoints. To take this into account, we create a class whosemembers don’t have Hip size reported.The hierarchy resulting from splitting intervals for the Hip size property can be found in Figure 5.1.Figure 5.1: The part of the class hierarchy for the Hip size property, if we use simple intervals.If we choose to do recursive binary splitting, regardless of where we pick the splitting point, the hierar-chy of the relationships between the created classes is a binary tree. We show this in Figure 5.2. Again,we have a separate class for the datapoints that don’t report the property Hip size.39Figure 5.2: The part of the class hierarchy for the Hip size property, if we use binary splitting. Thestructure is the same for both midpoint splitting and information gain splitting.Notice that the depth of the subclass hierarchy corresponds to the number of splits.5.2 ExperimentsHere, we investigate the third hypothesis: whether different ways of splitting ordinals improve testperformance.We want to see if splitting ordinals by recursively splitting on the midpoint or by recursively maximisinginformation gain will improve prediction accuracy and negative log likelihood loss. In this experiment,we compare the performance of the regularised parameter sharing model and the bounded ancestormethod on three hierarchies that differ on how the classes for ordinal properties are constructed.• Simple intervals: like in Figure 5.1. This is the baseline that we compare to.• Binary splits on midpoints.• Binary splits to maximise information gain.We test on the Fashion datasets “ModCloth” and “Renttherunway”, since both datasets have a numberof ordinal properties in addition to categorical properties. We construct hierarchies for both datasets inthe same way.405.2.1 Hierarchy construction and experiment setupThe categorical properties have one class for each possible value. For example, for Shoe width, twopossible values are “Average” and “Wide”. (The classes “Shoe width: Average” and “Shoe width:Wide” correspond to the attributes “Shoe width = Average” and “Shoe width = Wide”). We create newclasses for the ordinals based on binary splitting. Figure 5.3, shows a subset of this hierarchy. In it, weshow Hip size as an example ordinal, and the two values of “Shoe Width” are shown.Figure 5.3: Part of the class hierarchy for the experiments for binary splitting. The ordinal proper-ties are split into classes that form binary trees whose root is a child of the global class.The hierarchy looks the same for both midpoint binary splitting and information gain splitting.We use a train test split where 80% of the dataset is used for training and 20% for testing. The trainingtesting split is done randomly, and the resulting training and test sets are set aside for the experiments.Python is used, with the Pytorch library used for the regularised parameter sharing model, and thenegative log likelihood loss. (The Cross entropy loss function in Pytorch is used). For the L2 regularisedparameter sharing model, the training set is further divided into a 90/10 training/validation set split, inorder to determine when to stop training.We use the negative log likelihood loss and the sum of squares error for the comparisons in the exper-iments in this section. The negative log likelihood loss is as defined in Chapter 2. The sum of squareserror is slightly modified from Chapter 2. We are comparing an output probability vector with a target41label. In this case, we compute the average sum of squares error between the probability of an outcome,and the number 0 or 1 depending on whether the outcome was the actual label.5.2.2 Results“ModCloth”Table 5.2 shows the results for the sum of squares error and negative log likelihood for “ModCloth”for the L2 regularised parameter sharing model and bounded ancestor method. For the L2 regularisedparameter sharing model, the validation set was used to decide when to stop training. The results shownhere are the test set error values.Model and Hierarchy Sum of Squares Negative log likelihoodL2 parameter sharing: Baseline 0.14551 0.76307L2 parameter sharing: Midpoint splitting 0.14552 0.76323L2 parameter sharing: Information gain splitting 0.14485 0.75972Bounded ancestor method: Midpoint splitting 0.15108 0.78849Bounded ancestor method: Information gain splitting 0.15116 0.78852Table 5.2: The L2 regularised parameter sharing model and the bounded ancestor method on thetest hierarchies, for “ModCloth”. For both metrics, lower is better.“Renttherunway”Table 5.3 shows the results for “Renttherunway” for the L2 regularised parameter sharing model and thebounded ancestor method. The results shown here are the test set error values.Model and Hierarchy Sum of Squares Negative log likelihoodL2 parameter sharing: Baseline 0.13273 0.70977L2 parameter sharing: Midpoint splitting 0.13280 0.70962L2 parameter sharing: Information gain splitting 0.13205 0.70396Bounded ancestor method: Midpoint splitting 0.13368 0.71675Bounded ancestor method: Information gain splitting 0.13361 0.71584Table 5.3: The L2 regularised parameter sharing model and the bounded ancestor method on thetest hierarchies, for “Renttherunway”. For both metrics, lower is better.For both “ModCloth” and “Renttherunway”, the bounded ancestor method seems to do slightly worsethan L2 regularised parameter sharing for both metrics. Within hierarchies, the difference in perfor-mance across the different hierarchies is very small, suggesting that splitting ordinals using informationgain or at the midpoint of intervals does not have a large effect on performance.42For “ModCloth”, midpoint splitting slightly outperforms information gain splitting for the bounded an-cestor model, and for “Renttherunway” the opposite is true: information gain splitting gives a lowerloss on both metrics. However, the difference is extremely small, suggesting that the construction ofhierarchies for ordinal properties does not have a large impact, at least with respect to supervised classi-fication.Despite the choice of split for the ordinal properties having a lack of impact on the negative log likeli-hood and sum of squared errors, using different ways to split ordinals is still informative to us becauseit gives us insight regarding how to group entities into classes based on ordinal properties.In the next section we look at the contributions of individual classes to the log likelihood loss. Inparticular, we compare this to the number of entities in those classes.5.2.3 Negative log likelihood loss on classesTo further analyse the results of running the bounded ancestor method, we looked at the value of thenegative log likelihood loss function on the represented classes of our hierarchies. In particular, we lookat the classes for which the loss values are largest and smallest, and the number of entities containedwithin those classes.We looked at the represented classes in the hierarchies for the information gain splitting experiment fromabove, for both “ModCloth” and “Renttherunway” so we can compare between them. We looked at thenumber of entities seen in the training set for each of the represented classes, alongside the contributionof those represented classes to the test negative log likelihood loss. k was selected to be 5.For “ModCloth”, the 5 classes with the largest negative log likelihood losses are as follows. We showthe number of entities in those classes alongside the log likelihood loss. We show the average negativelog likelihood loss for entities in the sets.• Quality not found, Loss = 1.3933, 59 entities.• Length not found, Loss = 1.3261, 27 entities.• Waist 39-40, Loss = 1.3179, 37 entities.• Bust 58-59, Loss = 1.3087, 2 entities.• Waist: 41-50, Loss = 1.3064, 140 entities.The 5 classes with the smallest negative log likelihood losses are:• Waist: 20-22, Loss = 0.2718, 15 entities.• Waist: 20-21, Loss = 0.2718, 14 entities.• Bust: 20-21, Loss = 0.4146, 8 entities.43• Height: 86-91, Loss = 0.4848, 3 entities.• Height: 55-58, Loss = 0.4999, 46 entities.For “Renttherunway”, the 5 classes with the largest negative log likelihood losses are:• Category: overalls, Loss = 3.6717, 5 entities.• Category: skirts, Loss = 1.7221, 5 entities.• Category: caftan, Loss = 1.7025, 2 entities.• Category: sweatershirt, Loss = 1.5520, 3 entities.• Height: 76-78, Loss = 1.4881, 17 entities.The 5 classes with the smallest negative log likelihood losses are:• Category: legging, Loss = 0.1082, 72 entities.• Category: cami, Loss = 0.1135, 11 entities.• Category: duster, Loss = 0.1382, 9 entities.• Category: tee, Loss = 0.2844, 17 entities.• Category: henley, Loss = 0.3002, 6 entities.For both “ModCloth” and “Renttherunway”, the classes with the largest and smallest negative log like-lihood losses tend to be smaller classes with fewer datapoints. The difference in loss values is also moreextreme for “Renttherunway”, ranging from 0.30 to 3.67. One explanation is that classes with fewerentities tend to have more variance in the loss value, which we see reflected in the variation in loss.Another possible interpretation of the variation in loss value for small classes is that the small k valuemight have a proportionally much larger effect on the small classes, because the prior count would becomparable in size to the number of entities in those classes. In the next section, we test the effect ofvarying the value of k on log likelihood loss and the sum of squares error.5.2.4 Testing values of kIn Chapter 4 we conjectured that smaller values of k might be more effective as a pseudocount for theprior of a class. Here, we test the effects of the value of k on the hierarchies that we use for our experi-ments. Here, we look at the hierarchy constructed by splitting on information gain, for “ModCloth”. Wetry out the following values of k: [5,10,25,50,75,100,200,300,400,500,600,700,800,900,1000]. Here,NLL stands for negative log likelihood loss, and SSE stands for the sum of squares error.• k = 5: NLL = 0.7885, SSE = 0.15116• k = 10: NLL = 0.7884, SSE = 0.1511344• k = 25: NLL = 0.7881, SSE = 0.15107• k = 50: NLL = 0.7876, SSE = 0.15096• k = 75: NLL = 0.7872, SSE = 0.15087• k = 100: NLL = 0.7868, SSE = 0.15078• k = 200: NLL = 0.7856, SSE = 0.15048• k = 300: NLL = 0.7847, SSE = 0.15024• k = 400: NLL = 0.7840, SSE = 0.15007• k = 500: NLL = 0.7836, SSE = 0.14994• k = 600: NLL = 0.7833, SSE = 0.14986• k = 700: NLL = 0.7832, SSE = 0.14982• k = 800: NLL = 0.7833, SSE = 0.14980• k = 900: NLL = 0.7834, SSE = 0.14982• k = 1000: NLL = 0.7837, SSE = 0.14987The value of k seems to have a small effect on overall performance accuracy. It seems that a larger valueof k, 700 for negative log likelihood loss, and 800 for sum of squares error, is the most effective out ofthe values we tested.For the prior to have a significant impact on the posterior of a class, k would have to be of comparablesize to the size of the class. However, as k increases, the prior would out scale and dominate the smallerclasses. For example, a prior count of k = 500 would result in a very large signal from above for a classwith 50 entities. One possible explanation for why a prior of around 700 pseudocounts performs well isthat smaller class sizes with few entities (less than 100) might have higher variance, and would benefitfrom a large prior.In the context of the reference class problem, it is worth emphasizing that part of the purpose of the priorin the bounded ancestor method is to provide insight into a entities in a class using information frommore general classes. The overall performance of the method on supervised classification is only part ofthe picture.45Chapter 6Conclusion and Future Work6.1 ResultsWe tried to address several outstanding issues with the parameter sharing model with respect to ob-taining priors and combining information from multiple parents. We also explored the effectiveness ofconstructing new classes from ordinal properties.We investigated three hypotheses, shown in the Introduction. We investigated them through testing theMovieLens, “ModCloth” and “Renttherunway” datasets.6.1.1 Hypothesis 1We hypothesized that the bounded ancestor model would be more effective than regularisation when itcomes to making predictions on classes with few or no datapoints.It was found that in supervised classification tasks on entire datasets, for example on “ModCloth” and“Renttherunway” in Chapter 5, the bounded ancestor method did not perform as well as L2 regularisedparameter sharing.With respect to classes with few datapoints, we looked at how closely the priors produced by the variantsmatched the ground truths. While in some cases the priors are reasonable when compared to the groundtruth, there are many instances of priors being created that do not seem to match the ground truth.However, for predictions involving classes that are not seen in the training set, having a prior obtainedthrough the bounded ancestor method is significantly better than having weights of zero for those classesfrom regularisation.The bounded ancestor model and the three multiple parent variants also have the benefit of being verysimple to work with, being based on Dirichlet distributions. The variants also allow us to exploredifferent ways to combine multiple parents.466.1.2 Hypothesis 2We hypothesized that a weighted combination of the three variants would be more effective than anyindividual variant in producing priors. We measured effectiveness of a prior by comparing the priordistribution with the “ground truth” actual distribution of counts for a class. We did so by consideringpairs of classes and how their counts can be combined to get a prior for their joint child class.We found that a weighted combination of the three variants did work best, but it depended on the dataset.MovieLens and “ModCloth” seemed to favor a combination of variants 1 and 3, while “Renttherunway”seemed to favor combining variants 2 and 3. However, we found that due to outliers and the closenessof the KL divergence of the various combinations, it was difficult to have a combination that works forall pairs of parents. However, since we can quantify the closeness of priors through the KL divergence,we reported on the combinations that minimises this based on 10-fold cross validation.We looked at the distribution of observations for the parents in the grid search experiment conducted,and found that for MovieLens and “ModCloth”, for two parents, the number of observations in eachparent were closer together than in “Renttherunway”.6.1.3 Hypothesis 3We hypothesized that splitting ordinal properties to create new classes would help us make better predic-tions. One can also consider this the problem of how to choose the modelled classes (classes representedwith parameters) to help with prediction making.We looked at the binary splitting of ordinal properties for “ModCloth” and “Renttherunway”. We con-sidered splitting on the midpoint of the range of properties, and splitting to maximise information gain,to construct a new class hierarchy. We looked at how this new hierarchy affects the sum of squares andnegative log likelihood loss of the regularised parameter sharing model on the entire “ModCloth” and“Renttherunway” datasets.We found that the results for both error metrics did not vary significantly across hierarchies. Whilethe overall prediction accuracy did not meaningfully improve, the newly constructed classes based oninformation gain splitting are still very informative. One key benefit of having a class hierarchy isexplainability, as described by Mohammad Mehr [15]. By constructing new classes based on splits thatmaximise information gain, we are able to identify which intervals of property values are most importantto a prediction.6.1.4 Future workWe tried to obtain priors for classes with few or no datapoints by combining information through ob-servations from the parents and siblings of those classes. Future work could potentially look into in-corporating domain knowledge alongside this. For instance, in movie predictions, known habits aboutparticular age demographics might be incorporated alongside priors obtained from data.47Future work could also investigate using the methods for obtaining priors and combining parents along-side collaborative filtering methods for MovieLens and other movie prediction datasets. Collaborativefiltering methods often utilise latent embeddings that are initialised to random values or zero. The vari-ants for combining classes could potentially be adapted to combine embeddings to obtain priors forother embeddings.Future work could also utilise our work alongside approaches for the reference class problem. Otheropen problems include how to better combine information from priors with observations from data: inthis thesis we computed a weighted mix, but other methods could be investigated. Furthermore, futurework could also consider more sophisticated ways of combining multiple parents.48Bibliography[1] C. M. Bishop. Pattern recognition and machine learning. springer, 2006. → page 20[2] M. Chiang and D. Poole. Reference classes and relational learning. International journal ofapproximate reasoning, 53(3):326–346, 2012. → page 1[3] E. F. Codd. A relational model of data for large shared data banks. In Software pioneers, pages263–294. Springer, 2002. → page 12[4] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1. Springerseries in statistics New York, 2001. → page 8[5] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio. Deep learning, volume 1. MIT pressCambridge, 2016. → page 8[6] F. M. Harper and J. A. Konstan. The movielens datasets: History and context. Acm transactionson interactive intelligent systems (tiis), 5(4):1–19, 2015. → pagesix, x, 1, 2, 3, 12, 14, 15, 16, 20, 23, 24, 27, 28, 29, 30[7] G. James, D. Witten, T. Hastie, and R. Tibshirani. An introduction to statistical learning, volume112. Springer, 2013. → page 10[8] V. Keselj. Speech and language processing daniel jurafsky and james h. martin (stanforduniversity and university of colorado at boulder) pearson prentice hall, 2009, xxxi+ 988 pp;hardbound, isbn 978-0-13-187321-6, 2009. → pages 8, 9[9] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems.Computer, 42(8):30–37, 2009. → page 1[10] S. Kotz, N. Balakrishnan, and N. L. Johnson. Continuous multivariate distributions, Volume 1:Models and applications. John Wiley & Sons, 2004. → page 20[11] S. Kullback. Information theory and statistics. Courier Corporation, 1997. → page 32[12] S. Kullback and R. A. Leibler. On information and sufficiency. The annals of mathematicalstatistics, 22(1):79–86, 1951. → page 32[13] B. M. Marlin. Modeling user rating profiles for collaborative filtering. Advances in neuralinformation processing systems, 16:627–634, 2003. → page 1[14] R. Misra, M. Wan, and J. McAuley. Decomposing fit semantics for product size recommendationin metric spaces. In Proceedings of the 12th ACM Conference on Recommender Systems, pages422–426, 2018. → pages ix, 3, 12, 16, 17, 18, 20, 29, 3949[15] A. Mohammad Mehr. Prediction And Anomaly detection In Water Quality with ExplainableHierarchical Learning Through Parameter Sharing. UBC Thesis, 2020. → pages3, 5, 7, 10, 11, 47[16] K. P. Murphy. Machine learning: a probabilistic perspective. MIT press, 2012. → page 19[17] J. Nocedal and S. Wright. Numerical optimization. Springer Science & Business Media, 2006.→ page 10[18] J. R. Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986. → page 38[19] M. Vartak, A. Thiagarajan, C. Miranda, J. Bratman, and H. Larochelle. A meta-learningperspective on cold-start recommendations for items. In Advances in neural informationprocessing systems, pages 6904–6914, 2017. → page 150

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0395817/manifest

Comment

Related Items