Learning Latent Theories of Relations and Individuals by Michael Chi-Hao Chiang B. Eng., University of Melbourne, 2000 M. Eng. Sci., University of Melbourne, 2002 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in THE FACULTY OF GRADUATE STUDIES (Computer Science) The University Of British Columbia (Vancouver) April 2011 c Michael Chi-Hao Chiang, 2011 Abstract Inductive learning of statistical models from relational data is a key problem in artificial intelligence. Two main approaches exist for learning with relational data, and this thesis shows how they can be combined in a uniform framework. The first approach aims to learn dependencies amongst features (relations and properties), e.g. how users’ purchases of products depend on users’ preferences of the products and associated properties of users and products. Such models abstract over individuals, and are compact and easy to interpret. The second approach learns latent properties of individuals that explain the observed features, without modelling interdependencies amongst features. Latentproperty models have demonstrated good predictive accuracy in practise, and are especially useful when few properties and relations are observed. Interesting latent groupings of individuals can be discovered. Our approach aims to learn a unified representation for dependency structures for both observed features and latent properties. We develop a simple approximate expectation maximisation (EM) algorithm for learning the unified representation, and experiments demonstrate cases when our algorithm can generate models that predicts better than dependency-based models of observed features as well as a state-of-the-art latent-property model. We extend our approximate EM algorithm to handle uncertainty about the number of values for latent properties. We search over the number of values and return error bounds, as an alternative to existing proposals based on sampling in the posterior distribution over the number of values. We also solve a specific case where dependencies involve functional relations, which induces a verbose model with many parameters. In comparison, the stanii dard solution of aggregating over all values of the function yields a simple model that predicts poorly. We show how to learn an optimal intermediate-size representation efficiently by clustering the values of the function. The proposed method generates models that capture interesting clusters of function values, dominates the simple model in prediction, and can surpass the verbose model using much fewer parameters. iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii 1 2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Learning in Relational Domains . . . . . . . . . . . . . . . . . . 3 1.2 Relational Models and Probability Prediction . . . . . . . . . . . 6 1.3 Thesis Contributions and Outline . . . . . . . . . . . . . . . . . . 10 A Unified Framework for Relational Modelling . . . . . . . . . . . . 16 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.2 Relational Data . . . . . . . . . . . . . . . . . . . . . . . 19 2.1.3 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . 20 Relational Models . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.1 Languages . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.2 Dimensions of in Relational Models . . . . . . . . . . . . 27 2.3 A Unifying Representation . . . . . . . . . . . . . . . . . . . . . 30 2.4 Learning Problems . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.2 iv 3 2.4.1 Single Observed Relation . . . . . . . . . . . . . . . . . 34 2.4.2 Multiple Observed Relations . . . . . . . . . . . . . . . . 37 An Argument for Latent Relational Models . . . . . . . . . . . . . . 46 3.1 Modelling with Observed Properties and Relations . . . . . . . . 46 3.1.1 Reference Classes . . . . . . . . . . . . . . . . . . . . . 47 3.1.2 Relational Probabilistic Models . . . . . . . . . . . . . . 49 3.2 Modelling Latent Properties of Individuals . . . . . . . . . . . . . 52 3.3 Understanding Relational Inference . . . . . . . . . . . . . . . . 55 3.3.1 Generative Model . . . . . . . . . . . . . . . . . . . . . . 56 3.3.2 Sampling from G . . . . . . . . . . . . . . . . . . . . . . 57 Inference with Reference Classes . . . . . . . . . . . . . 58 Inference with Latent Relational Models . . . . . . . . . . 63 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.4.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.4.2 Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Learning Latent Relational Models . . . . . . . . . . . . . . . . . . 73 4.1 Estimating latent relational models (LRMs) . . . . . . . . . . . . 73 4.1.1 Expectation Maximisation . . . . . . . . . . . . . . . . . 74 4.1.2 EM for LRMs . . . . . . . . . . . . . . . . . . . . . . . . 75 An Approximate EM Method for LRMs . . . . . . . . . . . . . . 77 4.2.1 Expectation Step . . . . . . . . . . . . . . . . . . . . . . 77 4.2.2 Maximisation Step . . . . . . . . . . . . . . . . . . . . . 80 4.2.3 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 81 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 84 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.4.1 Likelihood Optimisation . . . . . . . . . . . . . . . . . . 84 4.4.2 Systems of Relations . . . . . . . . . . . . . . . . . . . . 87 3.3.3 3.3.4 3.4 3.5 4 4.2 4.3 4.4 v 4.5 5 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Learning with Uncertainty about Size of Latent Properties . . . . . 98 5.1 Nonparametric Relational Models . . . . . . . . . . . . . . . . . 99 5.2 Infinite-size LRMs . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.3 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.3.2 Non-optimal Parameters and Asymptotic Errors . . . . . . 105 5.3.3 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . 106 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.5.1 Error Bounds . . . . . . . . . . . . . . . . . . . . . . . . 112 5.5.2 Bayesian Prediction . . . . . . . . . . . . . . . . . . . . 114 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Learning with Functional Relations . . . . . . . . . . . . . . . . . . 121 6.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.2 Optimal Partitioning . . . . . . . . . . . . . . . . . . . . . . . . 125 6.3 Optimal Partitioning by Dynamic Programming . . . . . . . . . . 128 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.5 7 Likelihood Bounds . . . . . . . . . . . . . . . . . . . . . 102 5.4 5.6 6 5.3.1 6.4.1 Journey-to-work data . . . . . . . . . . . . . . . . . . . . 131 6.4.2 E. coli Gene Data . . . . . . . . . . . . . . . . . . . . . . 133 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7.2 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 vi List of Tables Table 3.1 Average log-loss over 2000 sampled datasets from the WebKB domain for REF, infinite relational model (IRM) and LRM on both the training and test sets. . . . . . . . . . . . . . . . . . . Table 3.2 71 Average log-loss over 500 sampled datasets from the EachMovie domain for REF, IRM and LRM on both the training and test sets. 71 Table 4.1 Mutual information scores for each co-relation (ngo1,ngo2,sever) taken with the target relation intergov. . . . . . . . . . . . . . Table 4.2 90 Five-fold cross-validated log-loss (with standard errors) from prediction on training and test of the target relation intergov. Models tested are: the IRM, and LRMs Lc , Lc+ , and Lcd . For each loss reported, the corresponding value for |Vα | (e.g. the number of clusters) is also shown. All losses for the LRMs are the best results achieved over different number of clusters, obtained from data underlying Figs. 4.7, 4.8 and 4.9. Lower values indicate better performance. . . . . . . . . . . . . . . . Table 6.1 96 Tables showing 50 surveyed U.S. cities with sampled percentages of people who drive to work in each city. Each table shows a partitioning generated by Alg. 3, for k = 2 (left), k = 5 (centre), and k = 10 (right). Log-loss on training data for each partition model is shown at the bottom. . . . . . . . . . . . . . 132 vii Table 6.2 Log-losses and standard error for the simple (Eq. 6.22), verbose (Eq. 6.4.2), and partition models (Eq. 6.23) in predicting gene linkage based on respective gene functions. The numbers adjacent to the partition model indicate the number of partitions in each dimension (i.e. the best partition model found 19 partitions in each dimension). . . . . . . . . . . . . . . . . . . . . . . . 136 viii List of Figures Figure 1.1 (left) a yeast protein network; (center) a collaboration network amongst researchers in various disciplines; (right) a disease contagion network. Network nodes denote individuals, e.g. proteins or researchers, and edges denote relational ties, e.g. research x collaborates with researcher y. (Images sourced from http://www-personal.umich.edu/∼mejn/networks) . . . . Figure 1.2 4 Schemas for relational models: (top row) settings where the domain contains only a single type of individuals; (bottom row) domains with multiple types of individuals; (leftmost column) settings where only one relation is observed; (other columns) settings where multiple relations are observed. T, T denote different types, and h denotes a latent property whilst r(·) denotes an observed relation. The main contribution of this thesis correspond to schemas 5 and 6 (shaded), which subsumes all other schemas shown. . . . . . . . . . . . . . . . . . . . . . . Figure 2.1 A simple Bayesian network for the ’sprinkler’ domain. (From http://bnt.googlecode.com/svn/trunk/docs/usage.html) . . . . . Figure 2.2 11 23 A Bayesian network for a simple relational domain modelling relations likes(X, Y ) and f unny(X), where X and Y denote persons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 2.3 25 A Bayesian network (in plate notation) representing a ground LRM in the smoking-friends domain. Shared parameters for smokes and f riends are given by θsmokes and θf riends respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 32 Figure 2.4 A plate representation (left) of a Bayesian network modelling one observed relations r(X, Y ), and two latent unary relation α1 (X) and α2 (Y ), where r(X, Y ) depends on both α1 (X) and α2 (Y ). An unrolled (ground) network is shown on the right, with parameters omitted. Shaded nodes are observations. Figure 2.5 35 An illustration of the partitioning of relational data induced by clustering domain individuals (indices of the along the dimensions of the relation). Data for relation r(X, Y ) is partitioned by clusterings represented by unary relations α1 (X) and α2 (Y ). 36 Figure 2.6 A plate representation (left) of a Bayesian network modelling two relations r(X, Y ) and s(X, Y ), where r(X, Y ) depends probabilistically on s(X, Y ). An unrolled network, with parameters omitted, is shown on the right. Shaded nodes are observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 2.7 38 A plate representation of a Bayesian network modelling two relations r(X) and s(X, Y ), where r(X) depends probabilistically on s(X, Y ). . . . . . . . . . . . . . . . . . . . . . . . Figure 2.8 40 A plate representation (left) of a Bayesian network modelling two observed relations r(X, Y ) and s(X, Y ), and two latent unary relation α1 (X) and α2 (Y ). Both r(X, Y ) and s(X, Y ) depend on both α1 (X) and α2 (Y ). An unrolled network, with parameters omitted, is shown on the right. Shaded nodes are observations. . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 2.9 41 Multiple relations, each represented as a two-dimensional slice, are concatenated to form a hypercube. Clustering of individuals and relations can be done by clustering elements of the hypercube (Kemp et al., 2006). . . . . . . . . . . . . . . . . . 43 Figure 2.10 A plate representation (left) of a Bayesian network modelling two observed relations r(X, Y ) and s(X, Y ), and two latent unary relation α1 (X) and α2 (Y ). Both r(X, Y ) and s(X, Y ) depend on both α1 (X) and α2 (Y ). Additionally, r(X, Y ) also depends on s(X, Y ). An unrolled network with parameter omitted is shown on the right. Shaded nodes are observations. x 44 Figure 3.1 A simplified user-movie rating domain, where likes(U, M ) is a Boolean relation denoting that user U likes movie M . drama(M ) is an observed Boolean property of movies. The illustration splits observed cases for likes(U, M ) by values of drama(M ). Figure 3.2 51 A simplified user-movie rating domain, where likes(U, M ) is a Boolean relation denoting that user U likes movie M . α(U ) is a Boolean unary relation representing some (hidden) property of user U , and β(M ) is similarly defined for movies. The illustration splits observed cases for likes(U, M ) as if α and β are observed. Probabilistic rules for each partition are also show, with probability parameters indicating the proportion where likes(U, M ) = T. . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.3 54 A generative model for a hypothetical relational domain shown as a parametrised Bayesian network. The structure is shown on the left, and parameters of the model are shown on the right. Logical variables X, Y have different types. . . . . . . . . . . Figure 3.4 57 Four example configurations of θr in G (numbers inside the box). For each example, marginal probabilities are displayed outside of the box, computed assuming γa = PG (a(X)) = 0.5 and γb = PG (b(Y )) = 0.5. The corresponding reference classes are shown adjacently. . . . . . . . . . . . . . . . . . . Figure 3.5 64 Log-loss for the IRM, LRM, and reference class predictions REF and POOL. Losses measure on training data (left) and test data (right) are shown for 2000 sets of simulated data. Each point in the figure corresponds to an average loss for bins of ≈ 200 datasets (with standard error shown). The bins are sorted in increasing order of percentage of observed data. . . . Figure 4.1 70 (a) a LRM in plate notation, and (b) an illustration of the corresponding ground LRM (Bayesian network) exhibiting a bipartite structure. Observed variables are shaded. . . . . . . . . xi 76 Figure 4.2 A Bayesian network localised to α(cs101), showing nodes α(cs101)∪M (α(cs101)), where M(α(cs101)) is the Markov blanket of α(cs101). Dark-shaded nodes are observed, whilst light-shaded nodes are fixed to some marginal posterior distribution. The new marginal posterior estimate of α(cs101) is by probabilistic inference in this network. . . . . . . . . . . . . . Figure 4.3 78 Results on simulated datasets from 1000 synthetic LRM. In ˜ all figures shown, L(·) denotes exact log-likelihood, and L(·) is the pseudo-log-likelihood (Eq. 4.6). Θ∗ denote the true parameters used to generate the data, whilst Θ∗ are those learned from the data. The exact likelihoods for the true and learned parameters for each of the 1000 models are plotted in ascending order of L(y1 ; Θ∗1 ) . . . L(y1000 ; Θ∗1000 ). . . . . . . . . . . Figure 4.4 Exact log-likelihood of true generating parameters as a function of link density (left) and average overlap count (right). . . Figure 4.5 88 Results on 1000 simulated models, as a function of link density (top) and average overlap count (bottom). . . . . . . . . . . . Figure 4.6 86 89 Three LRMs in plates notations: a single-relation latent-property model (top-left), a multiple-relation latent-property model (topright), and multiple-relation latent-property model with a dependency link between the co-relation and the target relation (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.7 91 5-fold cross-validated log-loss in predicting the intergov relation using LRMs shown in Fig. 4.6. Relation sever is used as the co-relation in this experiment. Results on training data (top) and test data (bottom) are shown. The horizontal axis correspond to different numbers of values of the latent relation α, and error bars indicate standard error. Lower log-loss mean greater accuracy. . . . . . . . . . . . . . . . . . . . . . . . . xii 93 Figure 4.8 5-fold cross-validated log-loss in predicting the intergov relation using LRMs shown in Fig. 4.6. Relation ngo2 is used as the co-relation in this experiment. Results on training data (top) and test data (bottom) are shown. The horizontal axis correspond to different numbers of values of the latent relation α. Error bars indicate standard error. . . . . . . . . . . . . . . Figure 4.9 94 5-fold cross-validated log-loss in predicting the intergov relation using LRMs shown in Fig. 4.6. Relation ngo1 is used as the co-relation in this experiment. Results on training data (top) and test data (bottom) are shown. The horizontal axis correspond to different numbers of values of the latent relation α. Error bars indicate standard error. . . . . . . . . . . . . . . 95 Figure 5.1 single size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Figure 5.2 Two sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Figure 5.3 Search over configuration of α, from 1 to 10. Left column corresponds to using a geometric distribution as the prior over the size α, with parameter setting 0.1, 0.5 and 0.9. The right column shows uses the Poisson distribution with parameter 1, 5, and 9. Bounds for the likelihood and posterior of the size of α are shown by the shaded regions. . . . . . . . . . . . . . . 113 Figure 5.4 Five-fold cross-validated log-loss for infinite-size latent relational model (ILRM) predictions on the intergov relation, no co-relations are present. ILRMs with different prior distributions over the size of latent property α are evaluated: the geometric distribution (left) and the Poisson distribution (right) with different parameter values are considered. Horizontal lines in the plots are the score and standard error for the single best LRM found in Ch. 4 for the same problem. . . . . . . . . 115 xiii Figure 5.5 Five-fold cross-validated log-loss for ILRM predictions on the intergov relation, with co-relation sever. ILRMs Lc+ (left col- umn, no dependency link between target and co-relation) and Lcd (right column, with a dependency link from co-relation to target relation) are evaluated. Different prior distributions over the size of latent property α are evaluated: the geometric distribution (top row) and the Poisson distribution (bottom row) with different parameter values are considered. Horizontal lines in the plots are the score and standard error for the single best LRM found in Ch. 4 for the same problem. . . . . 116 Figure 5.6 Five-fold cross-validated log-loss for ILRM predictions on the intergov relation, with co-relation ngo2. ILRMs Lc+ (left col- umn, no dependency link between target and co-relation) and Lcd (right column, with a dependency link from co-relation to target relation) are evaluated. Different prior distributions over the size of latent property α are evaluated: the geometric distribution (top row) and the Poisson distribution (bottom row) with different parameter values are considered. Horizontal lines in the plots are the score and standard error for the single best LRM found in Ch. 4 for the same problem. . . . . 118 Figure 5.7 Five-fold cross-validated log-loss for ILRM predictions on the intergov relation, with co-relation ngo1. ILRMs Lc+ (left col- umn, no dependency link between target and co-relation) and Lcd (right column, with a dependency link from co-relation to target relation) are evaluated. Different prior distributions over the size of latent property α are evaluated: the geometric distribution (top row) and the Poisson distribution (bottom row) with different parameter values are considered. Horizontal lines in the plots are the score and standard error for the single best LRM found in Ch. 4 for the same problem. . . . . 119 xiv Figure 6.1 An illustration of Lem. 3. The first figure (left) shows when L(pˆ2 , p1 ) ≤ L(pˆ1 , p1 ), i.e. when the premise of Eq. 6.10 in Lem. 3 does not hold. When the premise holds, we see that the distance between pˆ2 and p1 is less than that between pˆ1 and p1 . 127 Figure 6.2 An illustration partitioning on an initial set y1 , . . . , y5 (left), and partitioning on a sorted set (right). The dashed line in the middle partition is a probability estimate for the partition, averaged over probabilities assigned to the partition. The illustration shows that swapping y3 and y5 to yield the sorted set reduces the error incurred by the estimate. . . . . . . . . . . . 128 Figure 6.3 An illustration of optimal partitioning by dynamic programming.130 Figure 6.4 An illustration of the independent application of Alg. 3 on the functions of both genes G and G in the relation link(G, G ). An example rule resulting from the partitioning is shown. . . . 135 Figure 6.5 Relative log-loss of predictions over all functional classes for the gene classification dataset. . . . . . . . . . . . . . . . . . 136 xv Glossary AI artificial intelligence FOL first-order logic FOPL first-order probabilistic logic SRL statistical relational learning ILP inductive logic programming CPT conditional probability table MCMC Monte Carlo Markov chain ML maximum likelihood BP belief propagation EM expectation maximisation IRM infinite relational model BCTF Bayesian clustered tensor factorisation LRM latent relational model ILRM infinite-size latent relational model IHRM infinite hidden relational model CRP Chinese restaurant process xvi DP Dirichlet process PRM probabilistic relational model ICL independent choice logic PHA probabilistic Horn abduction BLP Bayesian logic programs RBN relational Bayesian network MLN Markov logic network xvii Acknowledgements During my time as a doctoral student I have gained more knowledge and understanding than I could have imagined; there are many individuals to thank. First and foremost I wish to thank my supervisor David Poole, whose depth of knowledge and willingness to push the boundaries of conventional ideas has been challenging and greatly inspiring. Ideas presented in this thesis have most often been the result of David’s insight and intuition. Secondly, I wish to also thank Professor Joel Friedman, whose graduate course on Markov chains proved to be a key component in my development. He has also been an utmost reliable member of my supervisory and have provided some valuable guidance. On a similar note, I thank Dr. Kevin Murphy and Dr. Nando de Freitas for not only being helpful in providing feedback for my work, but also for their interesting research and teachings that I often refer to. Fellow students have also helped me along tremendously. The have shown great generosity with their time and knowledge. In particular, I have gained a lot from interacting with Jacek Kisynski, Mark Crowley, Peter Carbonetto, Frank Hutter, Emtiyaz Khan, and Mark Schmidt. I could not have survived without the support of great friends. I thank Ed, Ross, Kerrie, Shannon, Sandra, Dieter, Christian, Nolwenn, Daylen, Bjoern & Britta, Frank, Roman, Matt, Alana, Nora, Minako, Corinne, Clarice, Ali, Andrew, Aline, Julie, Grace, Phillip, and too many others to mention. These individuals have been my source of perspective and encouragement throughout this time. Finally, deepest gratitude to my family, who gave me the impetus and desire to pursue higher goals. My parents Robert and Sharon have made too many sacrifices on my account, and their continued show of pride and support is of great comfort. xviii Thanks also to my younger sister Michelle, who has transformed from a thorn in my side ;-) to being a great friend. Often, it is her that I look up to. Last but not least, to my late grandparents, who played a central role in my upbringing. Their influence on me will be no doubt be lasting. xix Chapter 1 Introduction Learning is an ability to improve one’s performance through experience (Russell and Norvig, 2009), and is a central problem studied in artificial intelligence (AI). In order for an intelligent agent to operate successfully in the world, a model of its surrounding world is generally sought. The agent can evaluate a learned model in terms of the accuracy of inferences made using the model. The general problem addressed in this thesis is learning in domains that are relational. A relational domain is a collection of objects or individuals where relational ties exist between individuals. A relational tie indicates the value of a relation amongst some collection of individuals. For instance, if Andy and John are friends, then a friends tie exists between Andy and John, where the value of the tie is “true”. Or, if Ken obtained the grade A- in CS101 in 2009, then a grade tie exists for Ken, CS101, and 2009, where the value of the tie is A-. Relational data consist of examples of such ties, along with properties of individuals, e.g. Joe’s height, if observed. Statistical machine learning increasingly emphasise the importance of learning in relational domains, driven by important relational problems that include collaborative filtering (Hofmann and Puzicha, 1999; Koren, 2008; Marlin and Zemel, 2004; Ungar and Foster, 1998), analysis of academic citation networks (McCallum et al., 2000), document networks (Chang and Blei, 2010), webpage linkage (Newman, 2003), social networks (Airoldi et al., 2008; Handcock et al., 2007; Hofman and Wiggins, 2008; Newman, 2003), or biological networks (Airoldi et al., 2008). 1 Given relational data representing different relations amongst individuals, a traditional approach to for relational learning – with roots in the inductive logic programming (ILP) and statistical relational learning (SRL) communities (De Raedt, 2008; Getoor and Taskar, 2007; Muggleton and De Raedt, 1994; Nienhuys-Cheng and de Wolf, 1997) – aims to discover how each relation are depend (probabilistically) on other relations. The result is a dependency structure over relations that best fits the data, where each dependency may be logical or probabilistic. For instance, observing co-worker ties and collaboration ties amongst individuals, one may learn that for every pair (x, y), collaborates(x, y) implies co-worker(x, y) with some probability p. Models learned in this setting quantify over individuals, i.e. they abstracting over individuals, and afford a desirable property of small representation size and ease of interpretation. Rather than searching for the best dependency structure over observed relations, an alternative approach is to postulate and infer latent properties or relations (a property is a unary relation) that explain the observed relations. This approach has been useful in analysis of networks and collaborative filtering (Airoldi et al., 2008; Handcock et al., 2007; Hofman and Wiggins, 2008; Hofmann and Puzicha, 1999; Kemp et al., 2006; Kok and Domingos, 2007; Koren, 2008; Marlin and Zemel, 2004; Neville and Jensen, 2005; Newman, 2003; Taskar et al., 2001; Ungar and Foster, 1998; Xu et al., 2006). Suppose customers’ ratings of products are given, a latent-property model describes how the rating relation depend on the latent properties of users and products. The learning goal is to infer the values of latent properties of each individual that best fits the data (observed ratings). This approach has lead to improved predictive accuracy over models which do not use latent properties (Jensen et al., 2004; Neville and Jensen, 2005; Taskar et al., 2001; Xu et al., 2006). The aim of this thesis is a unifying framework for learning dependency-based models that include both observed and latent relations. Existing work have either focused on dependency structures over observed relations only, or latent properties of individuals as explanations of all observed relations. The combination of the two settings can directly exploit the advantages of both representations. The remainder of this introduction covers some prominent relational domains and existing proposals for learning in these domains (Sec. 1.1). In addition, in 2 Sec. 5.4 philosophical difficulties associated with models that represent only observed relations are highlighted, which serves to motivate our proposal to combine observed and latent relations in a single representation. 1.1 Learning in Relational Domains There are several fields, from statistics to social sciences, which address relational data. In these fields, relations are often viewed as networks whose nodes are individuals and edges represent relational ties amongst individuals. The simplest networks are defined over a single type of individuals, for example friendship networks over people, protein or gene networks, network of citations amongst research papers, or hyperlink network of webpages. Figure 1.1 shows some of these examples. A common strategy to analysing such networks is to model latent properties (unary relation) of individuals in the network (Airoldi et al., 2008; Handcock et al., 2007; Hofman and Wiggins, 2008; Hofmann and Puzicha, 1999; Koren, 2008; Marlin and Zemel, 2004; Newman, 2003; Ungar and Foster, 1998). That is, each individual has a latent property, and relational ties are probabilistically dependent on the latent properties of individuals implicated by the relational tie. The general idea is to postulate latent properties to explain relational ties. Learning latent-property models involves computing the latent property values (from the observed relational ties) for each individual in the network, and how the values of relational ties depend on the latent properties. Inferring values of latent properties is commonly known as clustering, where each latent property value correspond to a cluster, and learning the value of a latent property for an individual amounts to assigning the individual to the corresponding cluster. For instance, we can represent friendship as being probabilistically dependent on the latent properties of participating individuals, with the conditional statement ∀X, Y P (friends(X, Y ) = true | α(X) = i ∧ α(Y ) = j) = p where α is the name of a latent property, and i, j are values of α. The meaning of the rule is that for every pair of individuals (x, y), (x, y) are friends with probability p if α(x) = i and α(y) = j holds. 3 Figure 1.1: (left) a yeast protein network; (center) a collaboration network amongst researchers in various disciplines; (right) a disease contagion network. Network nodes denote individuals, e.g. proteins or researchers, and edges denote relational ties, e.g. research x collaborates with researcher y. (Images sourced from http://wwwpersonal.umich.edu/∼mejn/networks) A network of relational ties can be large and complex, where much information is available to be exploited. Learning latent properties of individuals to explain the observed ties is one way to exploit the available information. Even in the simple case of having only one network, the problem of inferring latent property values is already computationally intractable (Ungar and Foster, 1998). In general, different relations may be observed amongst individuals, e.g. friends, likes, or respects. Learning latent-property models with multiple observed relations has been proposed by (Kemp et al., 2006; Sutskever et al., 2010; Xu et al., 2006). 4 To illustrate such a model, suppose we observe friendship and respect ties amongst individuals, we may have a latent-property model as follows: ∀X, Y P (friends(X, Y ) = true | α(X) = i ∧ α(Y ) = j) = p ∀X, Y P respect(X, Y ) = true | α(X) = i ∧ α(Y ) = j =p The main learning goal here is to infer values of α for all domain individuals that explain all observed friend and respect ties. The common assumption amongst current latent-property models of relational data is that the latent-properties are sufficient to explain all observed relations, and dependencies amongst relations are thus not needed Xu et al. (2006). Whilst these models achieve good predictive accuracy and reveals interesting latent groupings of individuals, Kemp et al. (2006) notes that a desired extension entails interpretable models about relations, and points to more traditional frameworks of learning models based on first-order logic (FOL). The traditional relational learning approach naturally handles multiple relations, where the aim is to learn a set of dependencies amongst the observed relations. The underlying motivation is that each relation can be explained by other relations in the observed set. The ILP community (De Raedt, 2008; Muggleton and De Raedt, 1994) has contributed a vastly to learning dependency-based models represented in FOL. Algorithms such as Progol (Muggleton, 1995a) and Golem (Muggleton, 1992) and have been successful in discovering interesting first-order logic (FOL) programs in various domains. Examples include modelling mutagenicity of chemical compounds based on knowledge about its molecular structure (Srinivasan et al., 1996), modelling London’s tube (train) network (Bratko and Muggleton, 1995), or learning to recognise illegal chess moves (Muggleton et al., 1989), amongst others. The SRL community has a similar goal, but with particular emphasis on probabilistic theories and accounting for domain uncertainty. Many languages have been proposed for representing first-order knowledge with uncertainty1 . Such lan1 SRL is not a full extension of ILP towards probabilistic representations, as ILP deals with a greater range of dependency structures than is possible with SRL algorithms. 5 guages are generally known as first-order probabilistic logic (FOPL) (Friedman et al., 1999; Getoor et al., 2002; Kersting and De Raedt, 2002; Kok and Domingos, 2005; Muggleton, 2003; Richardson and Domingos, 2006)), and learning entails finding the best dependency structure over the known relations, as well as parameters that represent the uncertainty of each dependency. Consider the UW-CSE domain2 where data about properties of students and faculty members are given, along with various relations. A possible FOPL rule for this domain is ∀X, Y P (advised by(X, Y ) | taught by(X, Y ), stu(X), prof(Y )) = p (1.1) where all properties and relations in the above rule are Boolean-valued and are observed3 . The meaning of the statement is that for any pair (x, y), x is advised by y with probability p if x is a student and y is a professor, and that x was taught by y. Most current FOPLs models comprise of a set of such rules, and are coherent joint probability models of the domain (Friedman et al., 1999; Kersting and De Raedt, 2000; Muggleton, 1995b; Poole, 1993b, 1997; Richardson and Domingos, 2006). The following section describes particular motivations for introducing latent relations, by examining the technical and philosophical issues that are confronted by models comprised of only observed relations when predictions are concerned. 1.2 Relational Models and Probability Prediction The conditional probability associated with rules such as Eq. 1.1 represent a statistic obtained about student-professor pairs. Using such a rule for prediction means that every student-professor pair that satisfies the condition is assigned the same probability. This mode of reasoning leads to some philosophical as well as technical difficulties outlined in this section. We start with an example. Example 1 To treat tuberculosis (TB), Dr. Smith prescribes a long and costly course of antibiotic medication. Skin tests are performed to determine whether a patient has TB. According to his records, 53 out of 100 (53%) previous positive skin tests have 2 http://alchemy.cs.washington.edu/data/uw-cse Relations where some cases are unobserved are still referred to as observed relations. Relations where all cases are unobserved are latent relations. 3 6 correctly indicated the presence of TB (true-positive). On this basis Dr. Smith’s initial hypothesis is I : ∀X, P (tb(X) | pos skin test(X)) = 0.53 which means that the probability that any patient returning a positive skin test has 0.53 probability of carrying TB. Based on this hypothesis, in prescribing medication for all positive skin test patients he expects to err in 47% (false-positive) of positive skin test cases. Dr. Smith also knows that prior vaccinations for TB triggers positive skin tests, and has observed that 40 of the positive skin tests were attributable to prior vaccinations. Accounting for this fact, he calculates that for vaccinated cases, a positive skin test means a false-positive, whilst for non-vaccinated cases, there are now 53 of 60 (88%) skin tests that correctly indicate the presence of TB. He thus forms a second (more specific) hypothesis IIa : ∀X, P (tb(X) | pos skin test(X), vacc(X)) = 0 IIb : ∀X, P (tb(X) | pos skin test(X), ¬vacc(X)) = 0.88 Thus, by checking for prior vaccinations, Dr. Smith expects to at least reduce his error from 47% to 12% on treated patients. The example exposes two important issues. First, using hypothesis I and expecting 53% success (88% for hypothesis II) for all impending cases, the doctor has effectively extrapolated that all positive skin tests have a 0.53 probability (0.88 for hypothesis II) of correctly revealing the presence of TB. Using sample statistics as the probability for some proposition about an individual is called direct inference. In philosophy, the validity of direct inference has been the subject of a long-standing philosophical debate (see McGrew (2001) for a survey), closely related to the problem of induction in philosophy (see Vickers (2009)). The basic issue, first articulated by Hume (1748), is that inductions, e.g. hypotheses I and II, lacks logical justification. Namely, any deductive justification of the hypothesis implies the existence of an entailing theory that is itself subject to the same question of justification. Arguments for a deductive justification thus confront the problem of infinite regress. Arguing that inductions can be justified inductively, on the other hand, renders the argument circular. Thus, concluding that 7 the true probability of a new case being a true-positive based on observed rates on past samples is an induction, and is therefore not logically justifiable. The second issue questions whether a more specific hypothesis warrants discarding more general hypotheses. For example, observing that some new patient is not vaccinated may not be sufficient grounds for discarding the hypothesis I. Hypothesis II is more specific to the query, and is expected to be a more precise hypothesis than I. However, narrower samples carry degraded statistical confidence. Hypothesis I, on the other hand, holds greater statistical confidence but is less precise. Selecting the most specific sample is a principle supported by philosophers such as Reichenbach (1949) and Kyburg (1983, 1974). The key argument is that the true probability for a proposition is a matter of finding the “right” statistical sample – the right reference class – and that the right reference class is one that is the most specific. This argument answers both issues raised here; that the right reference class provides the correct sample statistic corresponding to the true probability for the proposition in question, and so relinquishes the need for hypotheses that are more general (less specific). In response to the problem of diminished statistical confidence associated with narrow reference classes, one can choose the most specific reference class which yields acceptable statistical confidence (Kyburg, 1983; Reichenbach, 1949). This approach is ad hoc, however. In general, the problem of identifying the right reference class from a set of competing reference classes is known as the reference class problem (Reichenbach, 1949). There are, however, practical concerns about whether it is possible to find the right reference class at all, given that complete data is never available. In other words, we may never observe an adequate set of attributes required to define a reference class specific enough that encapsulates the correct probability. Nonetheless, the specificity criterion highlights the general idea that learning as much as possible about individuals can help yield the correct probability. Where reference class methods are based on sample statistics, and thus reliant solely on observed data, we argue that attempting to model latent properties of individuals can further improve accuracy. Methods for learning latent properties mentioned in Sec. 1.1 are examples of this approach. We show in Ch. 3 that reference classes alone may lead to systematic prediction errors, and the inclusion 8 of latent properties can rectify this bias. The following example demonstrates the importance that the properties of individuals play, by showing that unintuitive predictions can result from using the reference class method. Example 2 Principal Jones wishes to assess the likelihood of Timothy passing grade 9 maths. Mr. Jones proposes the following theory: I: ∀S, C II: ∀S, C III: ∀S, C P (pass(S, C)) = 0.77 P (pass(S, C) | C = grade 9 math) = 0.84 P (pass(S, C) | S = timothy) = 0.85 The specificity criterion yields II and III as valid candidates, thus Mr. Jones would need to decide on which is the most appropriate statistic. To decide between II and III, Mr. Jones chooses II as it is the statistically stronger reference class (since the number of grade 9 math students outnumber courses Timothy took), and ascribes 0.84 as the probability that Timothy will pass grade 9 maths. In this example, choosing II is unintuitive as it effectively assumes that Timothy is a typical student, and prediction only takes into account the effect of the course. Similarly, if we choose III, then the effect of the course would be marginalised. A more general question is how all of the known reference classes can be combined to form a weighted prediction. A straightforward approach would be to interpolate between I ∼ III. However, suppose Timothy is an above average student and grade 9 mathematics is a particularly easy course, then it is expected that the probability of Timothy passing grade 9 mathematics would be higher than any interpolated probability. It is unclear whether there is a principled to extrapolative combination. In this work we propose to use latent properties as a way to alleviate this problem. The preceding discussion highlights the theme that in addition to exploring dependency structures amongst observed relations, hidden knowledge about individuals (i.e. via latent properties) can help to further improve predictions about relations. That is, whilst reference class methods rely on conditioning to obtain 9 predictions and demonstrably leads to systematic errors, we support the idea that latent knowledge about individuals plays a central role in restoring accuracy. In this thesis we examine representations that incorporate latent knowledge about individuals in a model where complex dependency structures over relations are allowed, and provide algorithmic procedures towards learning such representations. 1.3 Thesis Contributions and Outline The contributions of the thesis, in relation to existing work, can be explained using Fig. 1.2. 10 1 relation >1 relation latent properties 3 5 7 2 4 6 8 >1 type 1 type 1 directed dependencies amongst observed relations Figure 1.2: Schemas for relational models: (top row) settings where the domain contains only a single type of individuals; (bottom row) domains with multiple types of individuals; (leftmost column) settings where only one relation is observed; (other columns) settings where multiple relations are observed. T, T denote different types, and h denotes a latent property whilst r(·) denotes an observed relation. The main contribution of this thesis correspond to schemas 5 and 6 (shaded), which subsumes all other schemas shown. The left-most column of Fig. 1.2 corresponds to problem settings where only one relation is observed, represented by r(·) in the figure. Analysis of networks (Airoldi et al., 2008; Breiger et al., 1975; Chang and Blei, 2010; Handcock et al., 2007; Hofman and Wiggins, 2008; Kok and Domingos, 2007; Xu et al., 2009) as well as collaborative filtering (Hofmann and Puzicha, 1999; Koren, 2008; Marlin and Zemel, 2004; Salakhutdinov and Mnih, 2008; Ungar and Foster, 1998) typically operate in such a setting. A common approach is to model latent properties of individuals to explain the observed relation – an approach often called clustering – where the latent property is represented by h(·) in Fig. 1.2. Aside from net11 work analysis and collaborative filtering, similar models have been studied in SRL (Neville and Jensen, 2005; Taskar et al., 2001). A distinction between models for network analysis and models for collaborative filtering is that network models pertain to a single type of individuals (illustrated by schema 1), e.g. people, whilst collaborative filtering models pertain to two types (illustrated by schema 2), e.g. users and movies. For each different type a latent property can be introduced. Schemas 3 ∼ 8 apply for the setting where there are multiple observed rela- tions. Traditional approaches for this setting derives from ILP and SRL (De Raedt, 2008; Getoor and Taskar, 2007; Muggleton and De Raedt, 1994; Nienhuys-Cheng and de Wolf, 1997), where the aim is to obtain the best dependency structure over the observed relations that explains the observed relations. This approach pertains to schemas 7 and 8, and algorithms proposed under this approach can generally handle both single and multiple-type domains. An alternative approach is to model latent properties that simultaneously explains all observed relations, without directed dependencies amongst observed relations. Such an approach corresponds to schemas 3 and 4, and the main proposal implementing these schemas is the infinite relational model (IRM) (Kemp et al., 2006) (a similar model, the infinite hidden relational model (IHRM), was independently proposed by Xu et al. (2006)). Overall, schemas 1 ∼ 4 denote hierarchical models, whilst schemas 5 ∼ 8 are not. Schemas 5 and 6 represent a union of all other schemas in Fig. 1.2, and are to date not implemented in literature. It is argued in Kemp et al. (2006); Sutskever et al. (2010) that modelling dependencies amongst observed relations in addition to latent properties is a promising extension. In this thesis we contribute towards this extension by presenting a unifying representation and learning framework that captures schemas 5 and 6. The proposed framework underpins specific contribution of this thesis, which I outline below along with other contributions. 1. A unified framework for relational modelling The main contribution of this thesis is a unifying representation and learning framework that can address schemas 5 and 6 shown in Fig. 1.2, and therefore all schemas shown. 12 Relational models can be naturally expressed using languages based on FOL, e.g. the independent choice logic (ICL) (Poole, 1997), including latentproperty models such as well as Kemp et al. (2006) and Xu et al. (2006). In this thesis shows we derive a simple relational probabilistic language – called the latent relational model (LRM) (described in Ch. 2) – and used it to represent all of these models. The LRM captures relations and latent properties, and allows complex dependency structures over the relations and properties. Given a unifying representational language, algorithms are developed in Ch. 4 to 6 that can learn models in the language. Since the language is rich enough to express models conforming to all schemas shown, and that we can learn any model in the language, our proposed framework enables learning of models that conform to all schemas shown in Fig. 1.2. The combination of latent properties with a dependency-based representation over observed relations enables us to solve modelling problems tackled in the clustering literature (e.g. Airoldi et al. (2008); Hofmann and Puzicha (1999); Kemp et al. (2006); Xu et al. (2006)) as well as the relational learning literature (see De Raedt (2008); Getoor and Taskar (2007)). Experiments show that our learning algorithm can find models which outperform state-of-the-art latent-property models such as the IRM (Kemp et al., 2006) in probability prediction, as well as dependency-based models without latent properties. A manuscript relating the general modelling framework proposed is yet to be submitted for publication. 2. A formal argument for latent relational models In the Sec. 5.4, we have argued for the inclusion of latent properties for better probability predictions, largely from a philosophical standpoint. Also, contributions from the literature have argued for latent properties for explaining relational data, based on intuitive and empirical arguments (Airoldi et al., 2008; Hofman and Wiggins, 2008; Hofmann and Puzicha, 1999; Kemp et al., 2006; Taskar et al., 2001; Ungar and Foster, 1998; Xu et al., 2006). In Ch. 3 we make a contribution by providing a formal argument for why modelling 13 latent properties can help improve predictive accuracy. Empirical evidence obtained from experiments with synthetic and real-world data supports the theoretical result. The result is essentially an argument for LRMs, and a journal paper on this result has been accepted subject to minor revisions in the International Journal of Approximation Reasoning. 3. An approximate expectation maximisation (EM) algorithm for learning latent relational models In Ch. 4 we present a learning algorithm for LRMs. Since learning involves estimation latent properties as well as probability parameters of the model, the standard EM (Dempster et al., 1977) framework is appropriate. However, LRMs typically represent large, complex probabilistic models with many densely-correlated latent random variables. Learning such models is computationally intractable. Our algorithm is an approximation of EM that recover computational tractability. Our algorithm also directly ties the computation of latent properties and dependency parameters, which is required for learning models that simultaneously represent the two. The proposed algorithm was presented at the International Workshop for Statistical Relational Learning, July 2009. 4. Learning latent relational models with unknown size of latent properties In Ch. 5, a search-based method is proposed that extends the algorithm of Ch. 4. The extension accounts for uncertainty over the number of values for latent properties. The proposed procedure searches over the unbounded number of values of latent properties, and bounds the error on the likelihood corresponding at each stage of the search. In turn, bounds on the posterior of the number of values are derived, allowing for averaged predictions that are also accompanied with error bounds. We show how the averaging leads to accurate predictions. Existing relational models that account for uncertainty of the number of latent property values are often known as nonparametric relational models (Kemp et al., 2006; Xu et al., 2006), and learning such representations generally rely on Monte Carlo sampling. Our search-based 14 approach is an alternative to existing approaches. A manuscript for the results of this chapter has yet to be submitted. 5. A dynamic programming algorithm for learning with functional relations In Ch. 6, we address the situation where relations can probabilistically depend on functions4 . Standard solutions provide either simple models which can predict poorly, or complex models whose number of parameters scale with the number of function values. We propose an approach that optimally partitions the range of the function to trade off these two extremes to obtain a fixed-size representation. The proposed approach yields intermediatesize representations with predictive accuracy that is competitive if not better than the complex model. The results of this chapter was published in the work-in-progress track of the International Conference for Inductive Logic Programming, June 2007. In summary, the main contribution of this thesis is to provide a framework that unifies representation and learning of latent properties of individuals and complex dependency structures over relations. The two paradigms have, to date, been largely parallel developments. Our framework includes the language (Ch. 2) and learning algorithms (Ch. 4 to 6) sufficient to subsume the two relational learning paradigms. Other contributions of this thesis include (i) a first theoretical argument that supports the inclusion of latent properties in relational models (Ch. 3) for better predictive accuracy; (ii) an empirical demonstration that combining latent properties are not sufficient to achieve the best accuracy, where dependencies amongst observed relations can yield further improvements (Ch. 4); (iii) an extension of our learning algorithm to account for uncertainty about the number of values latent properties can represent (Ch. 5); and finally (iv) a dynamic programming algorithm for learning addressing the special case where functional relations appear in the language, which yields predictive accuracy which surpasses standard solutions for this problem. 4 Mapping between sets of individuals. 15 Chapter 2 A Unified Framework for Relational Modelling This chapter covers the background and notation for this thesis, and introduces the main representational formalism that combines cluster-based and dependencybased representations. 2.1 2.1.1 Background Notation In this work we use both predicate logic and probability notations, where the two sets of notations have overlaps. This section develops notation used in this thesis that aims to resolve some confusion due to the overlaps. Predicate Language To begin with, constants are expressed in lower-case, e.g. joe or venus, and are used to represent individuals. A type is associated with each individual, e.g. joe is a person. We use D(τ ) to represent a domain of type τ , which is the set of individuals of type τ . Types are assumed disjoint – that for any pair τi = τj , D(τi ) ∩ D(τj ) = ∅. A logical variable is written in upper-case (e.g. X or P erson) and denotes some individual. A logical variable is also typed, e.g. P erson denotes 16 some member of D(τ ). A relation is given by r : Ωr → Vr where r is the name of the relation, Ωr = D(τ1 )×. . . , ×D(τa ) is the domain of the relation, and Tr = (τ1 , . . . , τa ) is the type of the relation. Vr = {v1 , . . . , vk } is the range of the relation – an enumerated set of values not appearing in any domain. Number a and k are positive integers denoting the arity and size of r; relation r is thus referred to as a k-valued a-ary relation. When a = 1, r is a unary relation. In this paper, a unary relation is also referred to as a property. When Vr = {F, T}, where F, T are Boolean values, r is a Boolean relation. (Note that this description of a relation is more general than defined in standard predicate logic, as we are interested in representing multi-valued relations in addition to Boolean relations.) A functional relation is given by f : Ωf → Rf where f is the name of the function, Ωf = D1 (τ1 ) × . . . × Dnf (τnf ) is the domain of the function, Tf = {τ1 , . . . , τnf is the type of the function, and Rf = D1 (τ1 ) × . . . × Dmf (τmf ) is the range of the function (cf. range of standard relation given above). Integer mf is the arity of f . A function is a mapping between types. An atom is an expression of the form r(σ1 , . . . , σa ) where each σi is either a constant or logical variable. The types of σ1 , . . . , σa must match the type of r. If all of σ1 , . . . , σa are constants, r(σ1 , . . . , σa ) is a ground atom. A literal specifies the value of an atom, e.g. r(X1 , . . . , Xa ) = v where v ∈ Vr . A literal that contains no logical variables, a ground literal, is a propo- sition. For a Boolean relation r, the literal r(X1 , . . . , Xa ) = T is written simply as r(X1 , . . . , Xa ), and r(X1 , . . . , Xa ) = F is written as ¬r(X1 , . . . , Xa ). A literal is also a formula. Formulae with multiple literals are formed using connectives, e.g. ∧ and/or ∨. Connecting literals using only ∧ forms a conjunctive formula or conjunc- tion, e.g. ¬pass(Student) ∧ difficulty(Course) = high. A disjunctive formula or disjunction is formed using only ∨, e.g. ¬pass(Student) = high ∨ ¬difficulty(Course) = high. 17 A substitution is a set θ = {X1 \x1 , . . . , Xk \xk } where Xi are distinct logical variables and xi are constants. When applied to a formula f , each occurrence of Xi in f is replaced with xi . We denote the application of substitution of θ to f as f θ. For example, suppose f is a(X) = u ∧ ¬b(X, Y ) and θ = {X\x, Y \y}, to f θ is then a(x) = u ∧ ¬b(x, y). If there are no logical variables f θ, θ is called a grounding substitution. We also allow substitutions for (sets of) atoms, e.g. for b(X, Y )θ is b(x, y), and for the set g given by {a(X), b(X, Y )}, gθ is {a(x), b(x, y)}. Given some formula f containing logical variables X1 , . . . , Xn , where each Xi has type τi , let the domain of f be Ωf = D(τi ) × . . . , ×D(τn ). The substitution space of f , Γf , is the set of all possible grounding substitutions for f , given by Γf = {{X1 \x1 , . . . , Xn \xn } : (x1 , . . . , xn ) ∈ Ωf } For example, if formula f is a(X) ∧ b(X, Y ), where Ωf = D(τX ) × D(τY ), then Γf is {{X\x, Y \y} : (x, y) ∈ Ωf }. Probability We are generally interested in models that express uncertainty about the knowledge it represents. Uncertainty is represented by probability, and some basic notation is given as follows. A random variable is denoted by an upper-case letter, e.g. X, which shares notation with logical variables. The distinction will be made explicit when necessary. The probability distribution for a random variable X is written as p(X). The probability of a random event X = x is normally given by P (X = x), but will be abbreviated to P (x) in this work. Sets of random variables are expressed in bold, e.g. X, and similarly for instantiations, e.g. x. Joint probabilities of sets events are written as P (X = x), P (x) for short. All random variables are assumed discrete in this work unless stated otherwise. A univariate distribution is expressed in lower-case, e.g. p(X), and p(X) for a multivariate distribution. Models used in this work are relational probabilistic models whose random variables correspond to ground atoms of the language. Random variables and 18 ground atoms are used interchangeably. For example, the notation P (a(x) = v) may be used to denote the probability of random event a(x) = v, or P (a(X) = v) to denote the probability of the random event a(X) = v where X denote some individual. 2.1.2 Relational Data A dataset for relation r is a non-empty set Dr = {d1 , . . . , dm }. Each di is a tuple of the form x1 , . . . , xa , v where (x1 , . . . , xa ) ∈ Ωr and v ∈ Vr . If x1 , . . . , xa , v ∈ Dr and x1 , . . . , xa , v ∈ Dr , then v = v . A database is a set of datasets, where no more than one dataset for each relation. The following defines what a database entails. (i) (ii) (iii) (iv) D |= T D |= (r(x1 , . . . , xa ) = v) iff x1 , . . . , xa , v ∈ Dr D |= α ∧ β iff D |= α ∧ D |= β D |= α ∨ β (2.1) iff D |= α ∨ D |= β where α, β are formulae. We say that r is an observed relation if Dr = ∅, and is a latent relation otherwise. Counts can be obtained from a database D via logical formulae. The count of cases satisfying formula f with respect to D is given by #D (f ) = θ∈Γf I (D |= f θ) (2.2) where I(s) is a characteristic function; returns 1 if s holds, and 0 otherwise. We also define soft counts for a given formula f = l1 ∧. . .∧ln , where each li has the form ri (X1 , . . . , Xni ) = vi . Let Q be some marginal probability distribution defined for all li θ, θ ∈ Γf , where ri is not observed and D |= li θ (namely q defines a distribution over missing cases), a soft characteristic function for the case 19 li θ = (ri (x) = v) is given by 1 ; ri is observed ∧ D |= (ri (x) = v) ˆI(ri (x) = v, Q) = 0 ; ri is observed ∧ D |= (ri (x) = v ) ∧ v = v Q(r (x) = v) ; otherwise (2.3) i Given a logical formula f = f1 ∧ . . . ∧ fn , a substitution θ ∈ Γf and some probability function Q defined for all f θ where θ ∈ Γf , then a soft characteristic function for f is given by ˜I(f θ, Q) = ˆI(f1 θ, Q) · ˆI(f2 θ, Q) · · · ˆI(fn θ, Q) Then, a soft count is thus ˜ D (f, Q) = # ˜I (f θ, Q) θ∈Γf If all conjuncts of f are observed, then soft count (Eq. 2.1.2) is equivalent to the normal count (Eq. 2.2). 2.1.3 Loss Functions As alluded to in the introduction (Ch. 1), we are interested in models that will yield the correct probability of random events. In statistical terminology, this generally refers to probability estimators whose error is zero in the limit of infinite data. A key measure of accuracy in this work is empirical loss, also called empirical risk, which is commonly used for evaluation of statistical estimators. In general, the true probability of random quantities cannot be observed, but empirical loss can be used to quantify the discrepancy between the estimated probability and the true probability, where true probability is defined as the expected value of data labels in the limit of infinite data. We assume that binary-valued random quantities, as definitions of loss for multi-valued variables remain an open problem. Rather than specifying the exact loss functions we will use, we describe the general form of empirical loss for binary probability estimation, as required for 20 our results in Ch. 6. Our definitions are sourced from Shen (2005). Given some observed label y ∈ {0, 1} (a Bernoulli observation), we are in- terested in estimating the probability γ = P (y = 1). Also, let yˆ ∈ [0, 1] be our estimate for γ. Let the discrepancy for estimator yˆ for cases y = 1 and y = 0 be non-negative (e.g. the 1-norm |y − yˆ|), and respectively define partial losses to be l1,ˆy and l0,ˆy which are increasing functions of the discrepancy. The point-wise empirical loss incurred by yˆ is L(y|ˆ y ) ≡ yl1,ˆy + (1 − y)l0,ˆy (2.4) For a set of labels y = {y1 , . . . , yn }, the point-wise expected loss or empirical loss of yˆ is then L(ˆ y , γ) = Ey [L(y|ˆ y )] = γl1,ˆy + (1 − γ)l0,ˆy (2.5) Choosing l1,ˆy and l0,ˆy to be convex functions renders Eq. 2.5 convex. It further obeys Fischer-consistency, namely that yˆ = γ yields the unique minimum of Eq. 2.5. A common choice is l1,ˆy = − log(ˆ y ) and l0,ˆy = − log(1 − yˆ), which leads to the commonly-used log-loss function. Log-loss corresponds to negative loglikelihood, which we use for evaluating predictive accuracy of our models. Our definitions here are general as required by theoretical results in Ch. 6. The loss functions we consider are computed assuming that each test case in y are independent samples. Relational learning has emphasised that the independence assumptions are not appropriate for relational data due to intrinsic correlations amongst data points (Getoor and Taskar, 2007). However, we note that with a relational predictor which accounts for correlations amongst data, we may proceed to make predictions for each test case independent of other test data, thus the use of loss functions of the form of Eq. 2.5 is valid. 2.2 Relational Models The following covers some well-known languages for expressing relational models, as well as some choices one faces in relational modelling. 21 2.2.1 Languages First-order Logic First-order logic (FOL) has been the standard basis for representing relational models. FOL syntax contain symbols that naturally represent primitive elements of relational domains; constants denote individuals, variables can denote sets of individuals, functional and relational predicates can represent mappings and relational ties between individuals. In addition, complex theories can be formed using formulae (in the form of Eq. ??) constructed from the available symbols. For example, rules in kinship domains may be captured, e.g. mother(X, Y ) ← f emale(X) ∧ parent(X, Y ). grandmother(X, Z) ← mother(X, Y ) ∧ parent(Y, Z). Moreover, FOL can even represent executable programs, e.g. a member function that tests whether object X is a member of List by recursing through the list: member(X, List) ← equal(X, headOf (List)). member(X, List) ← remove head(List, newList) ∧ member(X, newList). The flexibility of FOL for expressing a wide range of knowledge is important for modelling relational domains, where complex knowledge structure often arise, e.g. in the biology domain (Srinivasan et al., 1996). A major shortcoming of FOL, however, is that it does not contain the syntax and semantics necessary to capture quantitative uncertainty that may be as a result of domain noise. Bayesian Networks An important proposal, due to Pearl (1988), is the language of belief networks or Bayesian networks. Bayesian networks integrate logical abduction with probability theory, and specifies a compact representation of the joint distribution over a set of propositions (random variables). For a set of random variables X = {X1 , . . . , Xn }, a Bayesian network is a tuple X, G, Θ where G is a directed acyclic graph over X and Θ are parameters of the network. The network specifies 22 a joint distribution p(X) = i p (Xi | par(Xi )) (2.6) where par(X) represent the set of parents (direct ancestors) of X in graph G, and the parameters of each product term (a conditional distribution) is included in Θ. It can be seen that Bayesian networks are compact representations of joint probability distributions, by representing as a product of simpler distributions. Figure 2.1 provides a simple illustration P(cloudy=T) P(cloudy=F) 0.5 0.5 cloudy rain sprinkler cloudy P(rain=T) P(rain=F) cloudy P(sprinkler=T) P(sprinkler=F) T F 0.5 0.9 0.5 0.1 T F 0.8 0.2 0.2 0.8 wetgrass rain sprinkler P(wetgrass=T) T F T F T T F F 1.0 0.1 0.1 0.01 P(wetgrass=F) 0.0 0.9 0.9 0.99 Figure 2.1: A simple Bayesian network for the ’sprinkler’ domain. (From http://bnt.googlecode.com/svn/trunk/docs/usage.html) Each conditional probability table (CPT) defines a probabilistic dependency for a node in the network, containing the probabilities of the node conditioned on the state of its parent variables. Each table entry can in turn be expressed by a propositional rule with a probability annotation. For example, the ’wetgrass’ table 23 has the equivalent rule-based representation 0.0 : 0.9 : 0.9 : wetgrass ← ¬sprinkler ∧ ¬rain wetgrass ← sprinkler ∧ ¬rain wetgrass ← ¬sprinkler ∧ rain 0.99 : wetgrass ← sprinkler (2.7) ∧ rain and similarly for ¬wetgrass. Despite its successes for modelling many real-world problems, the proposi- tional nature of Bayesian networks has drawbacks from the standpoint of relational modelling. Primarily, representing each property and relational tie as a proposition leads to a representation size that is polynomial in the number of individuals in the domain, and the network quickly becomes unmanageable for subsequent learning and reasoning tasks. Consider a relational domain with the relations likes(X, Y ) and f unny(X) where X and Y denote persons, a corresponding Bayesian network may look like Although Bayesian networks are not practical for relational domains in this manner, it is crucial as the underlying formalism for relational probabilistic languages developed later. The following section outlines the main contributions. Probabilistic Relational Languages Many languages have been proposed which integrate FOL and probability, allowing one to state first-order rules and dependencies as well as probabilities (Jaeger, 1997; Kersting and De Raedt, 2000; Milch et al., 2005; Muggleton, 1995b; Poole, 1993b, 1997, 2000; Richardson and Domingos, 2006; Sato and Kameya, 1997). These languages are collectively referred to as FOPLs. A notable early contribution was due to Halpern (1990), which allows statistical statements such as ∀X, Y 0.76 : friends(X, Y ) ← smokes(X) ∧ smokes(Y ). (2.8) where 0.76 denotes a conditional probability. Subsequent proposals exploit the conditional independence assumptions of Bayesian 24 Figure 2.2: A Bayesian network for a simple relational domain modelling relations likes(X, Y ) and f unny(X), where X and Y denote persons. networks, leading to FOPLs that are compact representations of Bayesian networks which succinctly represents joint distributions over all propositions entailed by the model. The first to demonstrate this combination was probabilistic Horn abduction (PHA) (Poole, 1993b) (which evolved to ICL (Poole, 1997, 2000)), which has a logic program component containing Horn clauses – where each clause specifies a set of logical choices – and a probabilistic component that specifies the probabilistic consequences of these choices. The probabilistic component can be understood as a collection of dependencies that cover possible combinations of choices. For instance, a PHA theory can be used to express the propositional model corresponding to Eq. 2.7, containing dependencies for wetgrass conditioned on choices for sprinkler and rain1 . PRISM (Sato and Kameya, 1997) and is a closely-related language with similar semantics. Other languages that also directly model de1 Note that whilst Eq. 2.7 is defined for propositions, choices can be made in the same way for first-order atoms. 25 pendencies amongst (first-order) atoms are Bayesian logic programs (BLP) (Kersting and De Raedt, 2000), relational Bayesian networks (RBNs) (Jaeger, 1997) and probabilistic relational models (PRMs) (Friedman et al., 1999). In contrast to PHA/ICL and PRISM! (PRISM!), these languages do not represent an explicit logic program component. Whilst the aforementioned languages have probabilistic semantics of Bayesian networks, a well-known exception to this class is the Markov logic network (MLN), which compactly represents Markov networks (undirected probabilistic models) (Richardson and Domingos, 2006). MLNs can be seen as a network of relations in FOL, where dependency links are have no directionality. Associated with each link is a weight parameter that indicate the strength of the dependency. For instance, a MLN for the smoking example is given by ∀X∀Y 1.12 : friends(X, Y ) ⇔ smokes(X) ∧ smokes(Y ). (2.9) where 1.12 is real number not restricted to being probabilistic as is the case for parameters in Bayesian networks. A general point of difference between directed models and undirected models is local interpretability. For example, from Eq. 2.8, the proposition friends(x, y) for any pair (x, y) only depends on propositions smokes(x) and smokes(y). In the undirected case (Eq. 2.9), the proposition friends(x, y) necessarily implicates every other proposition in the ground network. This is due to the fact that for directed models, normalization for conditional probabilities is available locally, whilst normalisation is global for undirected networks (see Koller and Friedman (2009)). Although local interpretability is a convenient property, there are examples (e.g. computer vision), where the direction of statistical influence amongst propositions is unclear and are reasonable to model as undirected networks. Another interesting exception is BLOG (Milch et al., 2005), which combine FOL and probability that covers uncertainty about the number of individuals in the domain. In this work we focus on the case where the number of domain individuals are fixed and known. Interested readers are referred to (Milch et al., 2005) for details. For a good recent survey of FOPLs see Milch and Russell (2007). 26 2.2.2 Dimensions of in Relational Models Having described languages that represent relational models, some relevant modelling choices faced in this works are discussed. Types Already built into our notation is the notion of types. We have assumed a simple type system, where domain individuals are partitioned into disjoint sets, where each set has a designated type, e.g. animal or book. A type can be regarded as a property of individuals, and can in principle be expressed as a unary relation in the predicate language, e.g. the fact that x is a bird can be expressed in the language by is bird(x). However, in domains where there is clear typing in the domain, a built-in type system help avoid overhead from type-checking when dealing with domain individuals and thus expedite learning and reasoning. A notable example of typed languages is PRMs (Friedman et al., 1999), whose primitive components include the notion of class, which is essentially a data table recording properties of individuals of a single type. The simple type system may be inappropriate for many natural domains, e.g. when types may be organised in a hierarchy. For example, bird is a sub-type of animal. This is particularly true in systems which organise entities into classes and sub-classes. Probabilistic reasoning which account for hierarchical type structures have been explored in (desJardins et al., 2007; Sharma and Poole, 2003). In addition to hierarchical domains, it can happen that domain individuals can belong to different types. For example, a graphic novel is in general regarded as both a comic and a novel. Being a member of different type domains, it violates the disjointness assumptions of our simple type structure. To circumvent this problem, a natural solution in an evolving knowledge system is to update the ontology by introducing non-empty type intersections as new types, thereby recovering disjointness. For the current example, graphic novel can be introduced as a new domain type. In this work we typically consider domains where a simple disjoint type system would suffice. However, our models and learning setting do not preclude the introduction of more complex types in the language. 27 Uncertainty There are several types of uncertainty that one encounters in real-world domains. The following discusses key types of uncertainty, and how they may be represented in languages listed above. In the smoking example (Eq. 2.8), the rule itself, without the probability annotation, is bound to fail on subsets of the population. This is due to the fact that there is noise in the domain, and/or that our knowledge about how friendships are formed is incomplete. Thus, to obtain a more accurate representation of the domain, we also aim to represent the noise. FOPLs described in Sec. 2.2.1 provide a coherent way to do so. There are more exotic forms of uncertainty, such as identity uncertainty (Pasula et al., 2002) and existence uncertainty (Milch et al., 2005) that arise frequently. Identity uncertainty arises when different individuals have the same identifier, e.g. two people in the phone book having the same name. In this work, however, we make the common assumption that domain individuals are uniquely identified – i.e. the unique names assumption (Russell and Norvig, 2009) – in order to avoid identity uncertainty. Existence uncertainty (Getoor et al., 2002; Milch et al., 2005) occurs when the number of domain individuals is not known in advance, which leads to questions about whether observations are attributes of known or previously unknown individuals. In this work, we make the standard assumption that the domain is known and fixed. These kinds of uncertainty provide interesting challenges that are beyond the scope of this thesis, but are more appropriate for future investigations. Dependencies and Rules Covered in our discussion about Bayesian networks in Sec. 2.2.1 is the notion of rules (a specification of logical choices) versus dependencies which encompass all possible choices. The main purpose of distinguishing dependencies and rules is that the choice impact how learning may be done. For example, rules such as Eq. 2.8 – that friendships are due to smoking – can be learned using data corresponding pairs of smokers. ILP algorithms such as FOIL (Quinlan, 1990) directly induce 28 rules that cover as many positive data cases and as few negative cases as possible. ILP algorithms can be used to learn dependencies if they all rules entailed by the dependency are supported by examples. If there is no data corresponding to nonsmokers, for example, it may be argued that there is insufficient reason to learn rules for non-smokers. By extension, it brings into question the role of dependencies, where some of the rules it encompasses have no empirical support. The lack of data for non-smokers, however, does not imply that there are no smokers in the world, since our observations are generally incomplete. On these grounds, we consider models that represent dependencies, like BLP (Kersting and De Raedt, 2000) and PRMs (Friedman et al., 1999), rather ones based on rules. Directed and Undirected Models FOPLs can represent directed or undirected models (see Sec. 2.2.1), and the difference between directed and undirected models were illustrated. In this work, however, we choose to model directed models, as there are often interesting causal influences that are best captured with directed models. With the help of intervention data (Pearl, 2009), one can test directed hypotheses and discover directions of influence. Undirected models, however, abstract over the direction of influence and are best for modelling correlations. That said, whether directed or undirected models are favourable depends on the problem at hand, e.g. it may not obvious that smoking is causal to friendships or vice versa (Singla and Domingos, 2008). Directed models yield simple probabilistic explanations for propositions. Of particular interest to this work is the use of latent unary relations to explain observed relations in a directed way. Latent Properties and Clusters Many relational models are known as clustering models, where hidden cluster structure of individuals is represented instead of a dependency structure over the observed relations. As mentioned in Ch. 1, proposals from network analysis and collaborative filtering as well as recent works in SRL, e.g. (Airoldi et al., 2008; Hofmann and Puzicha, 1999; Kemp et al., 2006; Ungar and Foster, 1998; Xu et al., 29 2006), have investigated the clustering approach. A clustering model for a single relation, such as rating(U ser, M ovie) consists of clusters of users and clusters of movies, and how the value of rating(x,y) probabilistically depend on which clusters x and y are assigned. The number of clusters and their membership are probabilistically inferred from observed the rating ties. A clustering model for multiple relations applies the same principle. In this work we represent clusters of individuals via latent properties. Namely, each domain individual x has some latent property α(x), where α is a latent unary relation (see Sec. 2.1). The value of α(x) equivalent to a cluster assignment, e.g. α(x) = 1 means that x is a member of cluster 1. By representing clusters in predicate language, we can thus define a clustering model in predicate language, and directly augment the model with probabilistic dependencies amongst relations to achieve our unifying representation for relational data (see Ch. 1, Sec. 1.3.). Such a model is formally given in Sec. 2.3. 2.3 A Unifying Representation Here we define latent relational models (LRMs); our language for representing relational models. The language of LRMs is a simplification of the relational probabilistic language independent choice logic (ICL) proposed by Poole (1997)2 , where the logic program component of ICL is omitted. As such, LRMs is strictly a modelling language and not a programming language like ICL. Definition 1. Given an input database D, a vocabulary Σ = Ro ∪ Rl , F, V where Ro and Rl are sets of observed and latent relations with respect to D respectively, F is a set of functions, and V is a set of logical variables, a LRM is a triple L = A, G, Θ , where – A is a set of atoms formed using Σ, e.g. r(X, Y ) where r ∈ Ro and X, Y ∈ V. We assume that variables appearing in each atom are local to those atoms. – G is a directed-acyclic graph over A, specified as a set of arcs of the form 2 Languages such as BLP (Kersting and De Raedt, 2000) and PRISM (Sato and Kameya, 1997) bear strong similarities to ICL. 30 c d, where c, d ∈ A. – Θ contain conditional probability parameters for each atom in A. Each θ ∈ Θ is a set of parameters. The parents of an atom a ∈ A according to G is defined by P arG (a) = {π : π ∈ A, (π a) ∈ G} Each parameter θ ∈ Θ characterises a conditional probability distribution. For ex- ample, the parameter θr(X,Y ) ∈ Θ describes the conditional distribution p(r(X, Y ) | P arG (r)). A LRM is illustrated below in Fig. 2.3 for a friendship domain. The LRM contains a relation f riends(X, Y ) that indicates whether X and Y are friends, and smoking(X) for whether individual X smokes, and there is a parameter for each relation. The LRM is illustrated in the plate notation (Buntine, 1994), in which each rectangle – or plate – represent a collection of plate instances. Each plate contains an index variable, and there is a plate instance for each value of the index variable. Indexed symbols in each plate represent a set of random variables, one for each plate instance. For example, in Fig. 2.3, there is a random variable smoking(x) for each individual x, and similarly for smoking(y). Index symbols appearing in intersections of plates yield one random variable per element of crossproduct space of the intersecting plates. Such symbols can be regarded as relations, e.g. there is a f riends(x, y) variable for every (x, y) pair in the domain. A LRMs can be seen as a generalisation of plate models. In the grounding, a LRM defines a joint distribution over all ground atoms of each relation in R. First we define the parents for a ground atom. Definition 2. Let vars({a, P arG (a)}) = {X1 , . . . , Xn } be logical variables appearing in a and its parents, and σ = {X1 \x1 , . . . , Xn \xn } a corresponding substitution, then we define the parents of a ground atom ga of a as par(ga ) = σP arG (a). To define the joint distribution represented by a LRM, where A are relations, 31 Figure 2.3: A Bayesian network (in plate notation) representing a ground LRM in the smoking-friends domain. Shared parameters for smokes and f riends are given by θsmokes and θf riends respectively. the set of all ground atoms obtained by grounding relations in A is given by SA = σA A∈Aσ∈ΓA where ΓA is the substitution space of relation A. The joint distribution over all ground atoms of A is p (SA ) = a∈SA p (a | par(a)) (2.10) A key property of FOPLs is that there is parameter sharing amongst ground atoms of the same relation/function, and this property is inherited by the LRM. Suppose the atom r(X, Y ) appears in a LRM, and that p(r(X, Y ) | P arG (r(X, Y ))) is parametrised by θr , then all ground instances r(x, y) of r(X, Y ) will have condi32 tional probability distributions p(r(x, y) | par(r(x, y))) parametrised by θr . The LRM addresses our representational needs, as it directly allows one to model observed and latent relations, and complex structure over the relations. 2.4 Learning Problems This section describes different relational learning settings as a function of the input to the learning process. Given different settings, existing proposals define various learning problems which we classify using three main characteristics: (i) the number of observed relations given, (ii) the presence of hidden relations, and (iii) a graph structure over relations. These dimensions allow us to summarise key types of relational learning problems addressed in literature. A generic description of learning inputs is described as follows. Definition 3. Inputs * A database D over datasets Dr1 , . . . , Drn , where each relation ri has a domain composed of elements of D = {D(τ1 ), . . . , D(τp )}. * A vocabulary containing: A set of logical variables X1 , . . . , Xl , where each variable has type τ ∈ {τ1 , . . . , τp }. A set of relational symbols Robsv = {r1 , . . . , rn } corresponding to observed relations according to D. Each r ∈ Robsv has domain Ωr = D(τ1 ) × . . . × D(τnr ), where D(τi ) ∈ D. A set of relational symbols Rlatn = {α1 , . . . , αm } representing latent relations. Each α ∈ Rlatn has domain Ωα = D(τ1 ) × . . . × D(τnα ), where D(τi ) ∈ D. A set of function symbols F = {f1 , . . . , fk }. Each f ∈ F has domain Ωα = D(τ1 ) × . . . × D(τnf ), where D(τi ) ∈ D. The generic setting defined here is intended to be general enough to cover those tackled in literature, with the following caveats. Firstly, for reasons covered in Sec. 2.2.2, we aim to learning models containing only directed dependencies. Secondly, the range of latent relations are assumed to 33 be known and fixed, thereby fixing the number of parameters of the model. It is generally the case, however, that the best number of parameters defined by the model is not known a priori. Allowing the number of parameters to be uncertain in an unbounded range leads to a non-parametric model e.g. (Kemp et al., 2006; Xu et al., 2006). For the purpose of explaining learning problems, we focus on parametric models. 2.4.1 Single Observed Relation Proposed models for analysing networks and user preferences in collaborative filtering often consider domains where only a single relation is observed. This setting elicits trivial dependency structures amongst relations, whereas learning clusters of individuals to explain the observed relation has been a more common approach (Airoldi et al., 2008; Hofmann and Puzicha, 1999; Kok and Domingos, 2007; Neville and Jensen, 2005; Taskar et al., 2001; Ungar and Foster, 1998; Xu et al., 2009). Clustering models for this setting adhere to schemas 1 and 2 in Fig. 1.2 in Ch. 1. As explained in Sec. 2.2.2, clustering can be achieved by introducing latent properties (latent unary relations) to represent assignments of individuals to clusters. With input data for the sole observed relation r(X, Y ) denoted by Dr – assuming for now that X, Y are of the same type τ – and let α : D(τ ) → Vα be a latent unary relation where D(τ ) and Vα are known and fixed, we can represent a clustering model as the LRM (Sec. 2.3) L = A, G, Θ , where A = {r(X, Y ), α(X), α(Y )} Θ = {θr , θα } G = {α(X) r(X, Y ), α(Y ) r(X, Y )} For the case where X is of type τ1 and Y is of type τ2 (e.g. following schema 2 in Fig. 1.2), we introduce a separate latent property for each type, i.e. α1 : D(τ1 ) → Vα1 and α2 : D(τ2 ) → Vα2 . A LRM for this case is L = A, G, Θ 34 where A = {r(X, Y ), α1 (X), α2 (Y )} Θ = {θr , θα1 , θα2 } G = {α1 (X) r(X, Y ), α2 (Y ) r(X, Y )} Figure 2.4 provides an illustration of this LRM. (Note that for the setting where Figure 2.4: A plate representation (left) of a Bayesian network modelling one observed relations r(X, Y ), and two latent unary relation α1 (X) and α2 (Y ), where r(X, Y ) depends on both α1 (X) and α2 (Y ). An unrolled (ground) network is shown on the right, with parameters omitted. Shaded nodes are observations. X, Y are of the same type, a similar illustration can be obtained by appropriately replacing the domains and latent properties.) An instructive way to view the effects of clustering is that it induces partitions in the space of relational ties. Figure 2.5 below illustrates how observed ties for r(X, Y ) can be sorted into “blocks” by clustering domain individuals. The presence of latent properties introduces sets of latent random variables in the ground model, at least one per individual (see Fig. 2.4), and learning necessitates probabilistic inference to calculate the marginal distribution over the latent variables. 35 The inference problem has a complexity that is exponential in the number of latent random variables, thus exact inference task becomes intractable for all but the smallest domains. Approximate inference, e.g. variational Bayes or Monte Carlo Markov chain (MCMC) inference (see Koller and Friedman (2009)), is then necessary. In Ch. 4 we provide one such method for inference that caters to our general desideratum of learning rich theories. This scenario is true for existing clustering models; the classification/clustering framework of (Taskar et al., 2001) based on PRM language, collaborative filtering models (Hofmann and Puzicha, 1999; Ungar and Foster, 1998), as well as network analysis models (Airoldi et al., 2008). 1 2 3 1 2 3 data clustered data Figure 2.5: An illustration of the partitioning of relational data induced by clustering domain individuals (indices of the along the dimensions of the relation). Data for relation r(X, Y ) is partitioned by clusterings represented by unary relations α1 (X) and α2 (Y ). An alternative approach for modelling properties of individuals is based on matrix factorization (Salakhutdinov and Mnih, 2008). The general idea is the observed relation r(X, Y ), with domain D(τ1 ) × D(τ2 ) is represented as a N × M matrix, and expressed as a product of two feature matrices AT and B. AT is a N × D matrix where each row is a coefficient vector for some individual in D(τ1 ) and B 36 is a D × M matrix where each column is coefficient vector for some individual in D(τ2 ). The learning goal is to find the best D-rank approximation of r(X, Y ). The matrix factorization formulation have led to state-of-the-art predictive performance, as demonstrated in the Netflix collaborative filtering competition3 . Other proposals based on matrix factorization have also demonstrated good predictive accuracy (Marlin and Zemel, 2004; Rennie and Srebro, 2005). So far we have assumed that the range of values of a latent property is known and fixed a priori, i.e. a fixed number of clusters of individuals. Proposals such as the IRM (Kemp et al., 2006), IHRM (Xu et al., 2006), and Bayesian clustered tensor factorisation (BCTF) (Sutskever et al., 2010) (based on the IRM) have been developed such that a distribution is placed over the number of clusters. Specifically, these models embed the Chinese restaurant process (CRP) (Aldous, 1983) – a nonparametric prior distribution – as the prior distribution over an unbounded number of clusters. Using MCMC inference, the number of clusters and their membership are generated stochastically. With the exception of IRM, IHRM, and BCTF, models from the literature discussed in this section do not naturally handle multiple observed relations. We discuss this scenario next. 2.4.2 Multiple Observed Relations When there are multiple observed relations, three possibilities arise. First, one can learn a dependency-based model without latent properties (see schemas 7 and 8 in Fig. 1.2 in Ch. 1). Second, one can draw on all relations to form a latent-property model (schemas 3 and 4), extending the single-relation problem described in Sec. 2.4.1). Third, combine the two methods to generate a model that represents both clusters and dependencies (schemas 5 and 6). The following sections discuss works related to each schema that involve multiple relations. Modelling Dependencies Amongst Observed Relations With more than one observed relation, the traditional approach studied in ILP and SRL can be describes as searching for the best dependency structure over the ob3 www.netflixprize.com 37 served relations (see schemas 7 and 8 in Fig. 1.2 of Ch. 1). Latent properties or relations are generally not considered except in the context of logical predicate invention where new predicates are induced by inverting logical entailment or logical implication (Kramer, 1995; Muggleton, 1994). Statistical counterparts of predicate invention simply refer to clustering (Craven and Slattery, 2001; Kok and Domingos, 2007), which is described in the next section. The structure of dependencies can be learned by standard ILP/SRL methods (see De Raedt (2008); Friedman et al. (1999); Getoor et al. (2002); Muggleton (2003); Muggleton and De Raedt (1994)). Figure 2.6 shows an example model containing two relations, s(X, Y ) and r(X, Y ), and a directed dependency from s to r. Note that the plate model in Fig. 2.6 illustrates a LRM L = A, Θ, G where Figure 2.6: A plate representation (left) of a Bayesian network modelling two relations r(X, Y ) and s(X, Y ), where r(X, Y ) depends probabilistically on s(X, Y ). An unrolled network, with parameters omitted, is shown on the right. Shaded nodes are observations. 38 A = {r(X, Y ), s(X, Y )} Θ = {θr , θs } G = {s(X, Y ) r(X, Y )} The parameters of the model can be learned through counting. For instance, θr [u, v] is an entry in θr that models the conditional probability P (r(X, Y ) = u | s(X, Y ) = v), where u ∈ Vr , v ∈ Vs . Assuming we have some marginal probability distribution Q defined all unobserved ground atoms of r(X, Y ) and s(X, Y ), a maximum likelihood (ML) value of θr [u, v] is given by θr [u, v] = ˜ D (s(X, Y ) = v ∧ r(X, Y ) = u, Q) # ˜ D (s(X, Y ) = v ∧ r(X, Y ) = u , Q) # (2.11) u ∈Vr ˜ D (f, Q) denotes the soft count of instances when formula f holds relative to # a database D (see Sec. 2.1.2). The marginal distribution Q is typically learned through EM (Dempster et al., 1977), with modifications for relational models (Getoor and Taskar, 2007; Kameya and Sato, 2000). A special problem arises for directed relational models when a direct ancestor (or parent) relation involves logical variables not found in the child. The presence of the additional variable(s) in the parents induces a complex dependency in ground model, leading to an unmanageable representation of conditional probability distributions. Figure 2.7 illustrates this scenario where relation r(X) depends on relation s(X, Y ). Figure 2.7 shows that each ground instance r(x) of r(X) depends on the set S = {s(x, y) : y ∈ D(τ2 )}. The conditional distribution (a table) p(r(X) | s(X, Y )) thus has a representation size exponential in |S|. Compact representation of the conditional probability table can be recovered via aggregation. Illustrating with an example, consider the rule θ: likes(U, M ) ← won(M, Award) 39 Figure 2.7: A plate representation of a Bayesian network modelling two relations r(X) and s(X, Y ), where r(X) depends probabilistically on s(X, Y ). which amounts to θ(1) : θ(2) : θ(3) : θ(n) : likes(U, M ) ← won(M, award1 ) likes(U, M ) ← won(M, award2 ) likes(U, M ) ← won(M, award3 ) .. . likes(U, M ) ← won(M, awardn ) To compute parameter θ, we combine θ(1) , . . . , θ(n) using a combination function (discussed in (Jaeger, 1997; Kersting and De Raedt, 2000)), i.e. θ = comb({θ(i) : i = 1, . . . , n}) where comb may be the mean function, max, min, noisy-or, or noisy-max and so on. The meaning of the combined parameter depends on the choice of combination function. Suppose the function mean(·) is used, then θ reflects the average of probabilities for users liking movies according to different awards. Decomposable aggregation functions have also been studied for probabilistic and relational models (Natarajan et al., 2005; Zhang and Poole, 1996). 40 Modelling Latent Properties To yield clusters from multiple observed relations, without defining dependency links between the relations, we place latent unary relations as parents of each relation. Such models correspond to schemas 3 and 4 in Fig. 1.2 (Ch. 1). For a simple case with two observed relations s(X, Y ) and r(X, Y ) where X, Y are of types τ1 and τ2 respectively, Fig. 2.8 illustrates such a clustering model. Figure 2.8: A plate representation (left) of a Bayesian network modelling two observed relations r(X, Y ) and s(X, Y ), and two latent unary relation α1 (X) and α2 (Y ). Both r(X, Y ) and s(X, Y ) depend on both α1 (X) and α2 (Y ). An unrolled network, with parameters omitted, is shown on the right. Shaded nodes are observations. Despite being referred to as a cluster-based representation, dependency links exist from latent relations to the observed relations. The main idea is that whilst no direct links are defined between observed relations s(X, Y ) and r(X, Y ), statistical signals between them are propagated via latent properties α1 (X) and α2 (Y ). It is argued in (Xu et al., 2006) that the latent properties are sufficient to explain the observed relations as well as interdependence between them, without needing explicit dependency structure. The IRM (Kemp et al., 2006) and IHRM (Xu 41 et al., 2006) are well-known cluster-based representations for multiple observed relations. For the model shown in Fig. 2.8, a corresponding LRM can be written as L = A, G, Θ where A = {r(X, Y ), s(X, Y ), α1 (X), α2 (Y )} G = {α1 (X) α1 (X) r(X, Y ), α2 (Y ) s(X, Y ), α2 (Y ) r(X, Y ), s(X, Y )} Θ = {θr , θs , θα1 , θα2 } One interesting property demonstrated by the IRM (Kemp et al., 2006) is that observed relations can also be clustered. The general idea is that observed relations (represented as a matrix) with the same domains viewed as slices of a hypercube. Entries in the hypercube represent data, and the IRM generates a linear set of clusters of these entries via the CRP. Projecting the clusters onto each dimension of the hypercube yields clusters of individuals as well as clusters of relations. Figure 2.9 illustrates this idea 42 relations relations X X X X clustered initial data Figure 2.9: Multiple relations, each represented as a two-dimensional slice, are concatenated to form a hypercube. Clustering of individuals and relations can be done by clustering elements of the hypercube (Kemp et al., 2006). This property of the IRM is exploited in combination with a tensor factorisation approach to yield the Bayesian clustered tensor factorisation (BCTF) (Sutskever et al., 2010). The clustering of relations is intended to achieve an interpretable model about relations, whilst accurate predictions are gained through the tensor factorisation representation – an extension of the matrix factorisation approach. Whilst representations like the IRM facilitate discovery of interesting clusters of relations – allowing one to learn similarity between relations – explicit dependencies amongst relations can yield information about how relations interoperate to explain the data, e.g. causal links between relations. Representation of dependencies between relations is listed as a desired extension for IRMs (Kemp et al., 2006). 43 Modelling Dependencies Amongst Observed Relations and Latent Properties This thesis combines dependency structures over relations with clustering, implementing schemas 5 and 6 depicted in Fig. 1.2. Learning clusters of relations not considered in this work. A schematic of our representation is shown in Fig. 2.10, which differs from Fig. 2.8 by including a dependency link between s(X, Y ) and r(X, Y ). Figure 2.10: A plate representation (left) of a Bayesian network modelling two observed relations r(X, Y ) and s(X, Y ), and two latent unary relation α1 (X) and α2 (Y ). Both r(X, Y ) and s(X, Y ) depend on both α1 (X) and α2 (Y ). Additionally, r(X, Y ) also depends on s(X, Y ). An unrolled network with parameter omitted is shown on the right. Shaded nodes are observations. Accurate predictions using such models can be achieved through optimising latent properties of individuals, as demonstrated in Ch. 4. The corresponding 44 LRM is L = A, G, Θ where A = {r(X, Y ), s(X, Y ), α1 (X), α2 (Y )} G = {α1 (X) α1 (X) s(X, Y ) r(X, Y ), α2 (Y ) s(X, Y ), α2 (Y ) r(X, Y ), s(X, Y ), r(X, Y )} Θ = {θr , θs , θα1 , θα2 } This section has discussed models corresponding to schemas shown in Fig. 1.2, relative to models proposed in literature. We showed that LRMs is expressive enough to implement all of the schemas covered, and thus extends existing proposals. The next challenge is to develop learning algorithms that can fit LRMs to data. Our proposals for learning are described in Ch. 4, 5 and 6. 45 Chapter 3 An Argument for Latent Relational Models This chapter compares relational models composed of observed relations with those that also contain latent properties of individuals. Specifically, the aim is to assess whether the models considered can produce the correct probability underlying given queries. In our analysis we show that both classes of relational models can be interpreted in terms of reference classes, and explain the accuracy based on reference classes. Under explicit assumptions about how relations are formed, we show analytically that models with only observed relations incur residual errors in the limit of infinite data, whilst those with latent properties of individuals can represent the correct probability. Experiments in synthetic domains show that when data is generated according to our assumption, latent-property models dominate. In real-world domains, we show that this is also the case. 3.1 Modelling with Observed Properties and Relations In this section, we show that relational learning models built from observed features directly construct reference classes. We first describe reference classes. 46 3.1.1 Reference Classes A reference class is a set of individual or tuples of individuals. It is common to adopt the constraint that a reference class is defined via logical formulae (Bacchus et al., 1996; Kyburg, 1983, 1974; Levi, 1981), e.g. the set of individuals who are tall and athletic. We also define reference classes in this way, given as follows. Starting with domains D = {D(τ1 ), . . . , D(τn )}, let A = {a1 , . . . , am } be atoms defined over domains in D; that each atom represents a relation whose domain is composed of elements in D. Let A = V1 × . . . × Vm denotes the space of value assignments for the m atoms, where each Vi is the range of the relation for atom ai . Then, the basic formula space is a set of formulae defined over all possible value assignments of the m relations, given by F = { m i=1 ai = ui : (u1 , . . . , um ) ∈ A}. For any f ∈ F, the set of tuples that satisfy f is given by S(f ) = {subs(θ) ∈ Ω : θ ∈ Γf , D |= f θ} (3.1) which represents all tuples that satisfy f which are entailed by database D. subs(θ) are the constants specified in substitution θ, and Γf is the substitution space of f . Now we define reference classes for the case where only a subset of atoms in A have given values. Let indices I ⊆ {1, . . . , k} where k ≤ m, and for each aj where j ∈ I, a value uj is given. We can equivalently write this as a formula f = j∈I aj = uj , where aj ∈ A. Thus, a restricted formula space is defined by F (f ) = {( m i=1 ai = ui ) ∈ F : ∀j ∈ I, uj = uj }. If I = ∅, i.e. no set values for any atoms, then F = F. A reference class is henceforth given by C f = S (f ) (3.2) f ∈F (f ) Given a reference class C(f ), the proportion of the reference class such that the formula h = v holds is given by P(h = v, f ) = |C(f ∧ h = v)| #D (f ∧ h = v) = |C(f ∧ h = u)| #D (f ∧ h = u) u (3.3) u where h is a relation whose domain is composed of elements of D, and v is a value 47 of the relation. The second equality follows by noting that |C(f ∧ h = v)| ≡ #D (f ∧ h = v) by comparing Eq. 3.1 and Eq. 2.2. We illustrate reference classes with examples. Suppose our domain of dis- course is D = D(person), and we have two Boolean-valued atoms tall(X) and athletic(X), then the basic formula space is given by F ={ tall(X) ∧ athletic(X) tall(X) ∧ ¬athletic(X) ¬tall(X) ∧ athletic(X) ¬tall(X) ∧ ¬athletic(X) } Now suppose we have formula f = tall(X), then the restricted formula space is F ={ tall(X) ∧ athletic(X) tall(X) ∧ ¬athletic(X) } Similarly we can define other restricted formula spaces, e.g. using f = athletic(X) or f = tall(X) ∧ athletic(X). The reference class C(tall(X)) is then C(tall(X)) = S(tall(X) ∧ athletic(X)) ∪ S(tall(X) ∧ ¬athletic(X)) To consider a second example, let our domains of discourse be D(student) and D(course), and atoms technical(S) and require math(C) to represent whether student S is technically-minded, and whether course C involves mathematics. The 48 basic formula space is F ={ technical(S) ∧ require math(C) technical(S) ∧ ¬require math(C) ¬technical(S) ∧ require math(C) ¬technical(S) ∧ ¬require math(C) } Some reference classes may then be defined as follows: (i) C (T) – the set of all student-course tuples. (ii) C (technical(S)) – the set of all student-course tuples with only technicallyminded students. (iii) C (require math(C)) – the set of all student-course tuples with only courses that involve mathematics. (iv) C (technical(S) ∧ require math(C)) – the set of all student-course tuples with technically-minded students and courses that involve mathematics. 3.1.2 Relational Probabilistic Models There have been significant contributions to the representation of probabilistic knowledge in relational domains. The contributions are in the form of languages that combine first-order logic structures and probabilities, so that one can express first-order knowledge as well as uncertainty in relational domains (Halpern, 1990; Kersting and De Raedt, 2000; Milch et al., 2005; Muggleton, 1995b; Poole, 1993b, 1997; Richardson and Domingos, 2006; Sato and Kameya, 1997). Whilst all of these are highly expressive languages, accompanying learning methods have been more restricted, and generally cannot cover the entire space of models entailed by the language. In particular, we focus on the fact that most learning methods use only observed relations, derived from the input data. In well-known languages such as the independent choice logic (Poole, 1997), PRISM (Sato and Kameya, 1997), Bayesian logic programs (Kersting and De Raedt, 49 2000), or Markov Logic Networks (Richardson and Domingos, 2006), models represent probabilistic dependencies or probabilistic rules over relations. Since probabilistic dependencies can be expressed by a set of probabilistic rules, our discussion will be based on such rules. A probabilistic rule is a rule in logic that is annotated by a probability parameter, i.e. θ: h ← b1 ∧ . . . ∧ bn (3.4) where h, b1 , . . . , bn are literal corresponding to observed relations. h is known as the head of the rule, whilst b1 ∧ . . . ∧ bn is the body. Parameter θ charac- terises a some potential φ(h, b1 , . . . , bn ) that denotes the uncertainty associated with the rule. For languages that combine first-order logic and probabilistic semantics of Bayesian networks, e.g. (Kersting and De Raedt, 2000; Muggleton, 1995b; Poole, 1993b, 1997; Sato and Kameya, 1997), φ(h, b1 , . . . , bn ) represents the conditional probability P (h | b1 , . . . , bn ). Since the languages use discrete symbols, P (h | b1 , . . . , bn ) = θ. We focus on relational models that model such conditional probabilities, and explain how they relate to reference classes. Depending on the placement of logic variables in a rule, different types of rules can be derived. Whilst there are many sophisticated forms of rules, we focus on the two most common forms: the constrained rule and the range-restricted rule (Nienhuys-Cheng and de Wolf, 1997). A constrained rule is one where all logical variables appearing in the body must appear in the head. For example, the rule likes(U, M ) ← drama(M ) constrained (U denotes a user, and M denotes a movie). The domain of discourse for a constrained rule is defined by the domain of the head literal, e.g. Ω = D(user) × D(movie). The body imposes an additional constraint on Ω such that the rule only holds for those (u, m) ∈ Ω for which drama(m) is true. The set of all such tuples is denoted here by Ω ⊆ Ω. Keeping only tuples in Ω entailed by the database D leads to ΩD , which is called the coverset in the ILP literature. We can immediate see that C(drama(M )) ≡ ΩD , i.e. the coverset of a constrained probabilistic rule is a reference class. Further, the maximum likelihood value for θ 50 is the proportion of ΩD such that likes(U, M ) is true, namely θ= #D (likes(U, M ) ∧ drama(M )) #D (likes(U, M ) ∧ drama(M )) + #D (¬likes(U, M ) ∧ drama(M )) ≡ P (likes(U, M ), drama(M )) Henceforth, for any query proposition likes(u , m ), if drama(m ) is true, the probability of likes(u , m ) being true is ascribed the value θ. This demonstration shows that constrained probabilistic rules directly implement reference classes and that inference also yields the same answer. An illustration of the example used is given in Fig. 3.1 below. j.smith titanic m.jones s.wang heat top_gun x.ahn hamlet k.stevens star_wars x.ahn top_gun x.ahn hamlet j.smith titanic x.ahn top_gun m.jones s.wang heat top_gun k.stevens star_wars Figure 3.1: A simplified user-movie rating domain, where likes(U, M ) is a Boolean relation denoting that user U likes movie M . drama(M ) is an observed Boolean property of movies. The illustration splits observed cases for likes(U, M ) by values of drama(M ). 51 Range-restricted probabilistic rules leads to a more complicated connection with reference classes. A range-restricted rule is one where all variables appearing in the head also appears in the body. In the deterministic setting, logical variables appearing in the body but not in the head are existentially quantified, and the value of the head of the rule is determined by the OR aggregator. For example, the rule likes(U, M ) ← won(M, Award) states that any given user likes movies that has won some award. In the probabilistic setting, however, the goal is to quantify the degree to which the head is true as a probabilistic function of awards the movie has won. This means that the accompanying parameter is an aggregated value. Specifically, the range-restricted rule likes(U, M ) ← won(M, Award) θ: amounts to θ(1) : θ(2) : θ(3) : θ(n) : likes(U, M ) ← won(M, award1 ) likes(U, M ) ← won(M, award2 ) likes(U, M ) ← won(M, award3 ) .. . likes(U, M ) ← won(M, awardn ) Then, parameter θ can be computed from θ(1) , . . . , θ(n) using a combination function (discussed in (Jaeger, 1997; Kersting and De Raedt, 2000)), i.e. θ = comb({θ(i) : i = 1, . . . , n}) The meaning of the combined parameter depends on the choice of combination function. Suppose the function mean(·) is used, then θ reflects the average of probabilities for users liking movies according to different awards. 3.2 Modelling Latent Properties of Individuals Numerous proposals have appeared in different fields that represent and learn unobserved properties to explain observed relations (Airoldi et al., 2008; Handcock et al., 2007; Hofman and Wiggins, 2008; Hofmann and Puzicha, 1999; Koren, 52 2008). The typical problem setting involves one observed relation, and one unary latent relation to represent unobserved properties for individuals of each domain type (see Ch. 2 Sec. 2.4.1). In the movie-rating problem, for example, the lone observed relation is likes(U, M ), and latent properties α(U ) and β(M ) model unobserved properties of users and movies respectively. In general, we can represent express probabilistic rules that involve multiple observed relations and latent properties (denoted by latent unary relations), i.e. θ : h ← b1 ∧ . . . ∧ bn ∧ l1 ∧ . . . ∧ lm . observed (3.5) latent where li are literals corresponding to latent relations, which we will assume to be unary relations for now. For the movie-rating domain, we have n = 1 (rating relation) and m = 2 types (users and movies). Recall that rules such as Eq. 3.5 are encapsulated by probabilistic dependencies of LRMs. Whilst not immediately clear, LRMs can be interpreted using reference classes. Consider the following illustration of a simple movie-rating domain and the partitioning of observed data for the relation likes(U, M ) induced by latent properties α(U ) and β(M ). In Fig. 3.2, the partitioning of likes(U, M ) is induced by treating α(U ), β(M ) as if they are observed features. As α(U ) and β(M ) are not actually observed, their respective columns in the data table must be inferred. As such, it is clear that every full specification of latent feature values induces a different partitioning of data at the leaves. Each leaf partition is a reference class. Given a partitioning, inference subsequently follows direct inference, e.g. the probability that likes(j.smith, titanic) given that α(j.smith) = T and β(titanic) = F is 1. In drawing connection to reference classes, we can see that latent-feature models effectively allow one to optimise over possible reference classes by estimating the best latent feature values. It is worthwhile noting that latent features can be inferred as probabilities, in which case inference yields a marginal probability. For instance, suppose that pˆ(α(j.smith)) = (0.3, 0.7) is the inferred distribution for α(j.smith) and pˆ(β(titanic)) = 53 j.smith titanic m.jones s.w ang heat top_gun x.ahn hamlet k.stevens star_w ars x.ahn top_gun x.ahn hamlet k.stevens star_w ars x.ahn top_gun x.ahn hamlet k.stevens star_w ars x.ahn top_gun j.smith j.smith titanic m.jones s.w ang heat top_gun titanic m.jones s.w ang heat top_gun Figure 3.2: A simplified user-movie rating domain, where likes(U, M ) is a Boolean relation denoting that user U likes movie M . α(U ) is a Boolean unary relation representing some (hidden) property of user U , and β(M ) is similarly defined for movies. The illustration splits observed cases for likes(U, M ) as if α and β are observed. Probabilistic rules for each partition are also show, with probability parameters indicating the proportion where likes(U, M ) = T. (0.15, 0.85) for β(titanic), then Pˆ (likes(j.smith, titanic)) = Pˆ (likes(j.smith, titanic) | α(j.smith), β(titanic)) × 0.7 × 0.85+ = Pˆ (likes(j.smith, titanic) | α(j.smith), ¬β(titanic)) × 0.7 × 0.15+ = Pˆ (likes(j.smith, titanic) | ¬α(j.smith), β(titanic)) × 0.3 × 0.85+ = Pˆ (likes(j.smith, titanic) | ¬α(j.smith), ¬β(titanic)) × 0.3 × 0.15 where Pˆ (likes(j.smith, titanic) | α(j.smith), β(titanic)) corresponds to the pa54 rameter value θ for the rule θ: likes(U, M ) ← α(U ) ∧ β(M ) and similarly for other values of α and β. In this setting, inference is simply a weighted sum of reference class estimates. One contribution of latent features is that it is equivalent to introducing disjunctions of inequalities in the language for defining reference classes. Specifically, it allows statements such as 1.0 : likes(U, M ) ← ((U = k.stevens) ∨ (U = x.ahn)) ((M = star wars) ∨ (M = top gun)) which corresponds to a leaf in Fig. 3.2. Constituents of each disjunction are determined by values α and β for each individual, as shown in the partition tree in Fig. 3.2. An advantage arises as values of latent features α and β can be optimised, thus allowing optimisation of disjunctions to best explain the data. So far, we have assumed that inferred values of latent properties are deterministic (i.e. a hard-clustering model), where it is also possible that they take probability values (a soft-clustering model). Namely, for some x and y, instead of inferring the most likely value of α(x) and β(y), one can infer a distribution over possible values of α(x) and β(y). Computing soft-clustering models can be seen as computing a distribution over possible hard-clustering models. To consider the simplest case where we cluster a single individual x, ascribing the probability of 0.7 to α(x) = T is equivalent to ascribing a probability of 0.7 to the hard-clustering model where α(x) = T, and 0.3 to the model where α(x) = F. In the next section, we compare inferences made by observed-feature and latentfeature relational models, in the context of recovering the true underlying probabilities of propositions. 3.3 Understanding Relational Inference The connections between relational models and reference classes means that relational models are confronted with the reference class problem (Reichenbach, 55 1949). Although much focus has been given to the justifiability of reference classes and direct inference, as well as the rationality behind the identification of reference classes, we seek a more direct evaluation of whether a given relational model can reproduce the right probability for given propositions. We take a pragmatic approach by assuming that the true probabilities of propositions are known. That is, we explicitly assume a particular generative process of relational data which harbours the true probabilities. 3.3.1 Generative Model Our generative process reflects the common intuition that relations amongst individuals are attributed to (hidden) properties of participating individuals. We focus on the simplest possible example of such a generative process, and generate data that is used for learning. We assess observed-feature and latent-feature relational models learned from this data for recovering the correct probabilities, i.e. those defined in the generative process. Definition 4. We assume a Boolean relation r : D(τ1 ) × D(τ2 ) → {F, T}, where domains D(τ1 ), D(τ2 ) are known, and two unary Boolean relations: a : D(τ1 ) → {F, T} and b : D(τ2 ) → {F, T} (the range of a and b will be called Va and Vb respectively in the rest of this chapter). Each ground atom a(x) of a(X) is a Bernoulli variable with parameter γa corresponding to the probability of a(x) = T. Similarly, each ground atom b(x) of b(X) has Bernoulli parameter γb . Each ground atom r(x, y) of r(X, Y ) is also a Bernoulli variable, with conditional probability parameter γr|a=u,b=v for the probability of r(x, y) = T given a(x) = u and b(y) = v. We call the model G in the remainder of this chapter. G represents a Bayesian network which defines the joint distribution J= pG (a(x)) x∈D(τ1 ) pG (b(y)) pG (r(x, y) | a(x), b(y)) (3.6) y∈D(τ2 ) The network and its parameters are illustrated in Fig. 3.3. We call this model G in the remainder. 56 PG (a(X)) = γa PG (b(Y )) = γb PG (r(X, Y )|a(X) ∧ b(Y )) = γr|a∧b PG (r(X, Y )|¬a(X) ∧ b(Y )) = γr|¬a∧b PG (r(X, Y )|a(X) ∧ ¬b(Y )) = γr|a∧¬b PG (r(X, Y )|¬a(X) ∧ ¬b(Y )) = γr|¬a∧¬b Figure 3.3: A generative model for a hypothetical relational domain shown as a parametrised Bayesian network. The structure is shown on the left, and parameters of the model are shown on the right. Logical variables X, Y have different types. 3.3.2 Sampling from G We assume that r(X, Y ) is an observed relation whilst a(X), b(Y ) are latent relations. Thus, our dataset will consist of only Dr , which is generated using our generative model (Def. 4) as follows. First assume we have two domains D(τ1 ) and D(τ2 ). Then for every x ∈ D(τ1 ), randomly generate the value for a(x) by sam- pling the Bernoulli distribution with parameter γa . Similarly, for every y ∈ D(τ2 ), randomly generate the value for b(y) by sampling the Bernoulli distribution with parameter γb . Then, for every pair (x, y) ∈ D(τ1 ) × D(τ2 ), where a(x) = u and 57 b(y) = v are previously sampled values, randomly generate the value for r(x, y) according to a Bernoulli distribution with parameter γr|a=u,b=v . We can simulate missing-at-random data by randomly removing generated cases of r(X, Y ) post hoc. 3.3.3 Inference with Reference Classes Under the simple problem setting where data is generated according to G (Def. 4), this section examines inference with observed-feature models. Since we have only one observed relation r(X, Y ), relational probabilistic models described in Sec. 3.1.2 for this problem are described by probabilistic rules with empty bodies, i.e. θ: r(X, Y ) ← . The equivalent reference class is C(T) which consists of all tuples (x, y) ∈ Ωr that is entailed by the database. The reference class statistic P(r, T) is simply the proportion where r is true for tuples in C(T). For given queries, specific reference classes can be obtained by specialising on the identity of individuals in the queries. For instance, for the query r(x , y ), reference classes C(X = x ) and C(X = y ) may be used. For our analysis of observed-feature models, we consider the reference classes C(T), C(X = x ), and C(X = y ). The main result we show is that corresponding reference class statistics P(r, T), P(r, X = x ), and P(r, Y = y ) are approximations of marginal probabilities of G. We first list the marginal distributions of G below. Marginalise out both a(X) and b(Y ) from G leads to the first marginal PG (r(X, Y )) = PG (r(X, Y )|a(X) = u, b(Y ) = v)PG (a(X) = u)PG (b(Y ) = v) u = v γr|a∧b × γa γb + γr|¬a∧b × (1 − γa )γb γr|a∧¬b × γa (1 − γb ) + γr|¬a∧¬b × (1 − γa )(1 − γb ) 58 + (3.7) where u, v ∈ {F, T} are Boolean values. Marginalising over b(Y ) only yields PG (r(X, Y )|a(X)) = γr|a∧b γb + γr|a∧¬b (1 − γb ) PG (r(X, Y )|¬a(X)) = γr|¬a∧b γb + γr|¬a∧¬b (1 − γb ) (3.8) (3.9) Finally, marginalising a(X) gives PG (r(X, Y )|b(Y )) = γr|a∧b γa + γr|¬a∧b (1 − γa ) PG (r(X, Y )|¬b(Y )) = γr|a∧¬b γa + γr|¬a∧¬b (1 − γa ) (3.10) (3.11) To show that P(r, T), P(r, X = x ), and P(r, Y = y ) approximate Eq. 3.8 to Eq. 3.11, observe that each datum in the dataset Dr is sampled from G with four different conditional distributions: pG (r(X, Y ) | ¬a(X), ¬b(Y )), pG (r(X, Y ) | ¬a(X), b(Y )), pG (r(X, Y ) | a(X), ¬b(Y )) and pG (r(X, Y ) | a(X), b(Y )). Grouping the data according to their generative distributions we may rewrite Dr as a union of the partitions, namely Dr = D¬a,¬b ∪ D¬a,b ∪ Da,¬b ∪ Da,b . − For partition Da,b , there are n+ a,b cases where r(X, Y ) = T, and na,b cases − for r(X, Y ) = F. The total number of cases is na,b = |Da,b | = n+ a,b + na,b . Similar counts are defined for other partitions of Dr . Given these counts, we can then express reference class estimate P(r = w, T), which implicates all of Dr , as follows P(r, T) = = + + + n+ ¬a,¬b + na,¬b + n¬a,b + na,b N n+ ¬a,¬b n¬a,¬b n¬a,¬b N + n+ a,¬b na,¬b na,¬b N + n+ ¬a,b n¬a,b n¬a,b N + n+ a,b na,b (3.12) na,b N By matching terms in Eq. 3.12 with those in Eq. 3.7, one can see that reference class proportion P(r, T) is simply an empirical approximation of Eq. 3.7. For the reference class C(X = x ) consists of r(X, Y ) such that X = x . Each case r(x , y) has value T with probability γa=u,b=v , where a(x ) = u and b(y) = v happens to be true as a result of the generative process. Let n(x ) = |C(X = x )| be the number of cases in C(X = x ), n+ (x ) the number of cases with value T, and n− (x ) the number of cases with value F. We further split C(X = x ) into 59 two set, corresponding to b(Y ) = T and b(Y ) = F. We then have counts nb (x ), − n+ b (x ), nb (x ), and n¬b (x ), n¬b (x ), and n¬b (x ). The reference class statistic P(r, X = x ) can be expressed as P r, X = x = = = = = + n+ b (x ) + n¬b (x ) n(x ) + nb (x ) nb (x ) n+ (x ) n¬b (x ) + ¬b nb (x ) n(x ) n¬b (x ) n(x ) + nb (x ) nb (x ) n+ ¬b (x ) n¬b (x ) nb (x ) N n (x ) N + ¬b n(x ) n(x ) N N + + n¬b (x ) n¬b (x ) n¬b nb (x ) n(x ) nb nb (x ) N N n (x ) N N + ¬b n(x ) n(x ) N N + (x ) (x ) n+ n n n b ¬b b + ¬b nb (x ) N n¬b (x ) N (3.13) where nb is the count of cases where b(Y ) is true. Following the same derivation we obtain P(r, Y = y ) P r, Y = y = n+ n+ (y ) n¬a a (y ) na + ¬a na (y ) N n¬a (y ) N (3.14) Using Eq. 3.13 and 3.14, the following theorem gives conditions when reference class statistics can model the correct probability. Theorem 1. Let r(X, Y ) be a Boolean relation where X, Y are of types τ1 and τ2 respectively. Suppose r(X, Y ) is the sole observed relation with respect to a database D, with dataset Dr generated using the generative model G (see Def. 4). Assume that γa ∈ (0, 1) and γb ∈ (0, 1) (i.e., they are non-extreme probability values). For any (x , y ) ∈ D(τ1 ) × D(τ2 ), the true probability of r(x , y ) = T is µ = γr|a=u∧b=v , where the generative process gives a(x ) = u and b(y ) = v. 60 Consider the limit as N = |Dr | approaches ∞. It follows that lim P(r, T) = µ (i) N →∞ (ii) (iii) iff γr|a∧b = γr|a∧¬b = γr|¬a∧b = γr|¬a∧¬b lim P(r, X = x ) = µ iff γr|a∧b = γr|a∧¬b ∧ γr|¬a∧b = γr|¬a∧¬b lim P(r, Y = y ) = µ γr|a∧b = γr|¬a∧b ∧ γr|a∧¬b = γr|¬a∧¬b N →∞ N →∞ iff (3.15) where γ·|· are parameters of the generative model G. Proof. To begin with, µ = γr|a=u∧b=v is the true probability of the query r(x , y ) = T, where a(x ) = u and b(y ) = v is given by the generative process G. The ob- served cases of r(X, Y ) is given by Dr , and let N = |Dr |. To evaluate whether reference class statistics P(r, T), P(r, X = x ) and P(r, Y = y ) can yield µ, we consider the limit as N approaches ∞. The reference class statistic P(r, T) has the limiting value lim P(r, T) = γr|a∧b γa γb + γr|a∧¬b γa γ¬b + γr|¬a∧b γ¬a γb + γr|¬a∧¬b γ¬a γ¬b N →∞ and to determine when limN →∞ P(r, T) = µ, directly solve γr|a∧b γa γb + γr|a∧¬b γa γ¬b + γr|¬a∧b γ¬a γb + γr|¬a∧¬b γ¬a γ¬b = µ (3.16) Since µ has one of γr|a∧b , γr|a∧¬b , γr|¬a∧b , or γr|¬a∧¬b , it must be the case that either γa and γb are extreme, or that γr|a∧b = γr|a∧¬b = γr|¬a∧b = γr|¬a∧¬b = µ. Since we assume γa , γb ∈ (0, 1), then it must be the case that γr|a∧b = γr|a∧¬b = γr|¬a∧b = γr|¬a∧¬b = µ. Applying this condition to Eq. 3.16 yields µ(γa γb + γa γ¬b + γ¬a γb + γ¬a γ¬b ) = µ =1 Thus proving statement (i) of Eq. 3.15. (Note the parenthesised sum is 1 because γa + γ¬a = 1 and γb + γ¬b = 1.) 61 The value of P(r, X = x ) (Eq. 3.13) in the limit as N → ∞ is γ r|a∧b γb + γr|a∧¬b γ¬b lim P(r, X = x ) = γ N →∞ γ +γ γ r|¬a∧¬b ¬b r|¬a∧b b , if a (x ) = T (3.17) , if a (x ) = F To establish conditions when P(r, X = x ) can model the correct probability in the limit, we solve limN →∞ P(r, X = x ) = µ. Suppose µ = γr|a∧b , then lim P(r, X = x ) = γr|a∧b N →∞ γr|a∧b γb + γr|a∧¬b γ¬b = γr|a∧b (γb − 1)γr|a∧b = (γb − 1)γr|a∧¬b ∴ γr|a∧b = γr|a∧¬b This shows that since γb is a non-extreme probability, it follows that limN →∞ P(r, X = x ) = µ if and only if γr|a∧b = γr|a∧¬b . Similarly, for the case when µ = γr|a∧¬b , solve lim P(r, X = x ) = γr|a∧¬b N →∞ γr|¬a∧b γb + γr|¬a∧¬b γ¬b = γr|¬a∧b (γb − 1)γr|¬a∧b = (γb − 1)γr|¬a∧¬b γr|¬a∧b = γr|¬a∧¬b which implies that limN →∞ P(r, X = x ) = µ if and only if γr|¬a∧b = γr|¬a∧¬b . Since it is not observed whether µ = γr|a∧b or µ = γr|a∧¬b , limN →∞ P(r, X = x ) = µ if and only if γr|a∧b = γr|a∧¬b ∧ γr|¬a∧b = γr|¬a∧¬b , thus proving condi- tion (ii) in Eq. 3.15. For P(r, Y = y ), the limiting value is γ r|a∧b γa + γr|¬a∧b γ¬a lim P(r, Y = y ) = γ N →∞ γ +γ γ r|¬a∧¬b ¬a r|a∧¬b a , if b (y ) = T (3.18) , if b (y ) = F and since P(r, Y = y ) is a symmetric case of P(r, X = x ), the same arguments used to derive condition (ii) of Eq. 3.15 to show that limN →∞ P(r, Y = y ) = µ if and only if γr|a∧b = γr|¬a∧b ∧ γr|a∧¬b = γr|¬a∧¬b , and thus proving condition (iii) 62 of Eq. 3.15. Theorem 1 shows that assuming G is the true process that underlying our data, and under the stated conditions about G, reference class statistics P(r, T), P(r, X = x ) and P(r, Y = y ) represent the true probability of r(x , y ) = T in the limit. Figure 3.4 provides an instructive view of Thm. 1, showing several configurations of parameters of G (γa = γb = 0.5 is assumed for illustration purposes). Firstly, Fig. 3.4(c) satisfies the condition for case (ii) in Thm. 1, in which case the true probability (those inside the grid) matches marginal probabilities (shown outside the grid), and that reference class statistic P(r, X = x ) corresponds to these marginals in the limit. Thus, P(r, X = x ) will yield the correct probability for r(x , y ). Figure 3.4(c) is also illustrative of (iii) in Thm. 1, due to the symmetry between (ii) and (iii). Figures 3.4(a), 3.4(b), and 3.4(d) correspond to the case where the stated conditions in Thm. 1 do not hold. Hence, the marginal probabilities do correspond to the true probability of r(x , y ), thus P(r, T), P(r, X = x ) and P(r, Y = y ) cannot represent the true probability of r(x , y ). It can be argued that G is an over-simplification of generative processes that actually underlie relational domains; that realistic generative processes are much more complex. However, our analysis shows that even in the case of such a simple process, reference classes can fail to entail the true probability in inference. The analysis is sufficient, in that the existence of a scenario where reference classes are suboptimal. In addition, the design of G is based on common intuitions about how data is generated in widely-studied relations domains, and not contrived to directly expose reference classes. 3.3.4 Inference with Latent Relational Models Section 3.2 introduced relational models designed for clustering individuals, e.g. (Airoldi et al., 2008; Hofmann and Puzicha, 1999; Kemp et al., 2006; Xu et al., 2006), where clusters can be defined via latent properties. In this section we discuss inference for such models and the potential for obtaining the correct answer for probabilistic queries. 63 0.1 0.3 0.2 0.8 0.2 0.5 0.7 0.9 0.8 0.2 0.8 0.5 0.4 0.6 0.5 0.5 0.5 0.5 (a) (b) 0.2 0.2 0.2 0.2 0.2 0.2 0.8 0.8 0.8 0.4 0.8 0.6 0.5 0.5 0.5 0.3 0.5 0.4 (c) (d) Figure 3.4: Four example configurations of θr in G (numbers inside the box). For each example, marginal probabilities are displayed outside of the box, computed assuming γa = PG (a(X)) = 0.5 and γb = PG (b(Y )) = 0.5. The corresponding reference classes are shown adjacently. We use our language of LRMs to represent a clustering model. The LRM we consider for the problem at hand (see beginning of Sec. 3.3) is given by L = A, G, Θ where A = {r(X, Y ), α1 (X), α2 (Y )} G = {α1 r, α2 r} Θ = {θr , θα1 , θα2 } Relation r(X, Y ) is observed, whilst α1 (X) and α2 (Y ) are latent unary relations. Whilst L is has the same dependency structure as G, L may not produce the correct probability to answer queries. To see this we inspect the prediction of L 64 given data Dr , i.e. PL r(x , y ) = T | Dr = u∈Vα1 v∈Vα2 P (r(x , y ) | α1 (x ) = u, α2 (y ) = v, Dr ) (3.19) P (α1 (x ) = u | Dr ) P (α2 (y ) = v | Dr ) Assume for now that the latent properties of L have the same range of values as those in G, i.e. Vα1 = Va and Vα2 = Vb . The key is that if L represents P (r(x , y ) | α1 (x ) = u, α2 (y ) = v, Dr ), P (α1 (x ) = u | Dr ), P (α2 (y ) = v | Dr ) that matches PG (r(x , y ) | a(x ) = u, b(y ) = v), PG (a(x ) = u) and PG (b(y ) = v) respectively, then PL (r(x , y ) = T | Dr ) = PG (r(x , y )); that L represents the correct probability. The ability of L to achieve the correct probability in inference pertains directly to the learning of L from data. The basic requirement is to correctly estimate the values of α1 (x) and α2 (y), for all x and y, and to partition Dr in the same way as G. Given the correct partitioning and that r(X, Y ) is observed, counting pro- portions in each partition will yield P (r(x , y ) | α1 (x ) = u, α2 (y ) = v, Dr ) = PG (r(x , y ) | a(x ) = u, b(y ) = v) (in the limit of infinite data). Computing P (α1 (x ) = u | Dr ) and P (α2 (y ) = v | Dr ) exactly will ren- der PL (r(x , y ) = T | Dr ) a Bayes optimal predictor, and P (α1 (x ) = u | Dr ) approaches PG (a(x ) = u) in the limit of infinite data (similarly for P (α1 (x ) = u | Dr )). For all but the simplest domains, computing the posterior quantities P (α1 (x ) = u | Dr ) and P (α2 (y ) = v | Dr ) exactly is confronted by a computational bottleneck. For example, computing P (α1 (x ) = u | Dr ) requires marginalising over all other latent random variables in the model, i.e. the set of α1 (x) where x = x and all α2 (y) (including α2 (y )). The number of sum terms is exponential in the number of variables in the marginalisation. The marginalisation problem cannot be easily simplified due to the dense correlations that exist amongst latent variables of the model. One can alleviate the computational complexity by seeking an approximation of P (α1 (x ) = u | Dr ) that is efficient to compute as done in many variables Bayes approaches, or by sampling via MCMC methods (see (Friedman and Koller, 2009) for a general coverage). It can be concluded that with finite amount of data the best result available 65 using the LRM is a Bayes optimal predictor, where optimality relies on an ability to infer the exact posterior latent property values for each individual. Learning can yield LRMs that represent the correct probabilities with increasing amount of data, and thus one concludes that, in theory, the correct probability is represented by LRMs. In practise, however, the intractability of the inferring the exact latent property values necessitates approximate solutions. Predictions using L thus cannot claim Bayes optimality. Depending on the accuracy of the approximate inference algorithm for determining latent property values, various degrees of error may result in the limit. Whether LRMs are better predictors than reference classes – i.e. that they produce better approximations of the correct probabilities – is an empirical question which we address in Sec. 3.4. The answer depends on the inference algorithm for latent properties as well as the nature of the underlying domain. At this point we note since a and b are not observed, we do not know the size of Va and Vb . Thus, our LRM contains latent properties α and β such that Vα and Vβ do not have a priori defined size. The arguments presented in this section have assumed that Vα and Vβ match Va and Vb respectively. 3.4 Experiments In our experiments we compare reference classes with two forms of latent-property models: the LRMs and the IRM (Kemp et al., 2006). We evaluate predictive accuracy using simple domains containing only one observed relation, and report log-loss (or negative log-likelihood) (see Ch. 2, Sec. 2.1.3). Lower values of loss indicates higher accuracy. Our first experiment uses data simulated by sampling G (Def. 4). Although Sec. 3.3 shows that LRMs can achieve the correct probabilities with less restrictive conditions than reference classes, our experiments aim to quantify the advantage of LRMs over reference classes. We validate by generating many examples of G with randomly generated latent properties and parameters. In the second experiment, we use real-world relational data from the WebKB project1 which describes hyperlinks between web pages from five academic world 1 http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/index.html 66 wide web domains. Thirdly, we use a movie-rating dataset from the EachMovie project2 , which provides ratings by a approximately 60,000 users for 1,600 movies. 3.4.1 Models Reference Class Predictors In the kind of single-relation domains considered here, the three reference class predictors P(r, T), P(r, X = x ) and P(r, Y = y ) are again used, where r(X, Y ) represents the sole relation in the domain of interest, e.g. hyperlink in the WebKB domain. To answer the query r(x , y ) we construct two predictions using the given reference classes. The first of which, called REF, simply chooses the best prediction from the three reference classes. REF is therefore an inadmissible predictor as we use test data to determine the best predictive model. However, it represents a gold standard for single reference class predictions that we use to compare with other predictors; any method that can outperform REF can outperform all other reference class predictors in the set. A second reference class predictor, called POOL, combines all three reference classes by linear interpolation (e.g. weighted combination with weights summing to 1). However, we again construct a gold standard for this type of reference class predictors by adopting the following scheme. Suppose for each test case r(x, y) that the true probability P (r(x, y)) = η is known (this is true for synthetic data). Let δmin be the minimum of P(r, T), P(r, X = x ) and P(r, Y = y ), and δmax the maximum. Then, if η ∈ [δmin , δmax ], we assume a perfect interpolator POOL outputs η. If η ∈ [δmin , δmax ], then REF is used. In the case that the true probabil- ity is unknown, P (r(x, y)) ∈ {0, 1} (e.g. in real-world experiments), POOL again defaults to REF. Latent-property Models We use two latent-property models: the IRM (Kemp et al., 2006) and a LRM. The IRM models latent properties to explain relational data (see Ch. 2, Sec. 2 http://www.grouplens.org/node/76 67 2.4.1), and is a non-parametric model that stochastically generates the number of values for latent properties, as well as the value of latent properties for each individual. Given a query r(x , y ) = w, prediction is based on the latent property values of x and y respectively. Suppose that α1 (x ) = u and α2 (y ) = v in the IRM (or, x belongs to cluster u and y belongs to cluster v in the IRM nomenclature), the probability ascribed to r(x , y ) = w is the empirical proportion |{r (x, y) = w : α1 (x) = u, α2 (y) = v, D |= r(x, y)}| w |{r (x, y) = w : α1 (x) = u, α2 (y) = v, D |= r(x, y)}| where D is the database function mapping Dr to a value. The LRM used here is described in Sec. 3.3.4, where the latent properties are assumed Boolean. Learning is done via an approximate EM algorithm described in Ch. 4. 3.4.2 Protocol Synthetic Relations We simulate 2000 datasets, each generated by sampling a generative model in the form of G (see Sec. 3.3). Parameters of the generative model are generated ran- domly. There are two types, with domains D(τ1 ) and D(τ2 ), where both |D(τ1 )| and |D(τ2 )| are restricted to be between 50 to 150. The two latent properties a(X) and b(Y ) can have between 2 and 10 values. Each generated dataset contain ob- served cases for the observed relation r(X, Y ) only. A supervised learning framework is assumed, where 90% of the observed cases are used for training, whilst 10% are reserved for testing. Training of the IRM uses MCMC with default parameters specified in the accompanying software3 , whilst the LRM training is done over 30 restarts. Each restart is run for a maximum of 100 iterations. The LRM which attains the best training set likelihood over restarts is returned. 3 http://www.psy.cmu.edu/ ckemp/code/irm.html 68 WebKB The WebKB contains web-page hyperlinks in the domain of five universities. The data consists of instances of the relation link(U RL1 , U RL2 ) as well as features based on the words appearing in each web page. In these experiments, word features are omitted and we use only the hyperlink data so as to focus on the value of using latent properties to explain the relation. Models REF, IRM and LRM are used for this experiment, where POOL was omitted as it defaults to REF when the ground truth probabilities are unknown (as is the case for real-world data). The IRM and LRM were trained using the same settings as described in the previous experiment. 2000 randomly sampled groups of 200 web pages are generated from the original data, each forming a standalone dataset. In each sample, 90% of data are used for training, and 10% for prediction. Movie Ratings We repeat our previous WebKB experiment with the EachMovie dataset4 . The rating values are from 1 to 5, but we threshold these labels to be Boolean-valued by the global mean of all ratings. We take 500 independent sub-samples of the rating data and run independent experiments on each subsample. 90% of ratings in each sample are used for training, whilst 10% are used for testing. 3.4.3 Results Synthetic Relations The results from our 2000 experiments are binned according to the percentage of observed tuples to the maximum number of possible tuples for r(X, Y ). That is, the results are binned according to the amount of data available for learning. Figure 3.5 shows a plot of log-losses of REF, POOL and IRM and LRM, where each point in the graph represents the average log-loss in one bin of experiments using one predictor. Standard error is also shown. Each bin contains 250 to 300 data sets. The outcome indicate a significant advantage towards the LRM and IRM (i.e. lower log-loss). Desirably, the log-loss of both and the LRM and IRM improves 4 http://www.grouplens.org/node/76 69 1 0.95 Negative Log−likelihood Negative Log−likelihood 0.95 REF 0.9 POOL IRM 0.85 iEM LRM 0.8 REF 0.9 POOL IRM LRM iEM 0.85 0.75 0 0.2 0.4 0.6 Percentage labelled 0.8 1 0 0.2 0.4 0.6 Percentage labelled 0.8 1 Figure 3.5: Log-loss for the IRM, LRM, and reference class predictions REF and POOL. Losses measure on training data (left) and test data (right) are shown for 2000 sets of simulated data. Each point in the figure corresponds to an average loss for bins of ≈ 200 datasets (with standard error shown). The bins are sorted in increasing order of percentage of observed data. with the number of data points, indicating an ability to exploit information as they becomes available. The reference class approaches REF and POOL, on the other hand, appears to be insensitive to the amount of the observed data. The ability of LRMs to minimise loss when more information becomes available suggests that sufficient statistics encoded in the LRMs are better approximations of those in the underlying model than reference classes. The advantage of the LRM over the IRM is likely due to the soft-clustering nature of LRMs compared to the hard-clustering nature of IRM. WebKB We take the average log-loss over the 2000 experiments with REF, and IRM, shown in Table 3.1 The first observation is that each method performs well overall, scoring low log-losses. (Note random guessing of probabilities will yield log-loss of 1, using log of base 2, whilst the best possible score is 0). This indicates that the sampled datasets present easy prediction problems, due to the sparse linkage patterns in these sets. The fact that IRM and LRM again achieves significantly (in the statistical sense) better log-loss than REF emphasises the value of modelling latent 70 ALG. Ltrain Ltest REF 0.0216 ± 0.000402 0.0210 ± 0.000549 IRM 0.0179 ± 0.000268 0.0190 ± 0.000475 LRM 0.0181 ± 0.000260 0.0185 ± 0.000431 Table 3.1: Average log-loss over 2000 sampled datasets from the WebKB domain for REF, IRM and LRM on both the training and test sets. properties, particular in these simple data samples where relative limited information is available. In turn, this suggests that the IRM and LRM are effective in exploiting information that is available. The separation between LRM and IRM are not significant in this case. Movie Ratings Tables 3.2 lists the training and test performance of each method measured in logloss. ALG. Ltrain Ltest REF 0.7373 ± 0.00844 0.7340 ± 0.00852 IRM 0.0173 ± 0.00819 0.4359 ± 0.01261 LRM 0.4728 ± 0.00818 0.5372 ± 0.00888 Table 3.2: Average log-loss over 500 sampled datasets from the EachMovie domain for REF, IRM and LRM on both the training and test sets. The EachMovie dataset contains a more complex relational structure than the WebKB dataset, and the linkage density (e.g. ratings per user) is greater than that of WebKB (links per web page). As such, it represents a more difficult relational prediction problem, which is reflected in the overall increase of losses. The IRM’s ability to exploit latent clusters of individuals likely contributes to its superior score, as on average it returned between 3 and 8 clusters of users (and movies), compared to all other methods tested. The LRM learned using LRM is restricted to modelling two clusters for users and movies and yields higher loss, 71 but still maintains its advantage relative to REF. It is possible that the generative assumptions used in our analyses (which mirrors those behind LRMs) may hold in the world, thus contributing to LRMs’ higher accuracy relative to reference classes. However, it is unlikely that the generative assumptions embodied in LRMs fully reflect the complexities of the true generative model in the world, thus highlighting the potential of modelling with latent properties. 3.5 Remarks This chapter analysed two main classes of relational models in the context of prediction. We showed that relational models representing only observed relations can only entail the correct probability of queries (in the limit of infinite data) under rather restrictive conditions about the underlying generative process. On the other hand, relational models that also model latent properties of individuals, i.e. latent relational models (LRMs), can represent the correct probability in the limit. Whilst this is true in theory, obtaining such latent-property models requires exact inference for the latent properties, which is infeasible in general domains. The question of whether LRMs yield better predictions in practice is tested empirically. In a synthetic domain where the generative process of data is that used in our theoretical analysis, we showed that LRMs dominate models with only observed relations. In real-world domains – a website domain and a movie-rating domain – we also observed that LRMs predict significantly better, supporting the notion that latent properties of individuals in fact play a meaningful role in the formation of relations. These observations are corroborated by using the IRM Kemp et al. (2006), which is a well-known latent-property model for relational data. An additional result relates to the acquisition of reference classes. Whilst we have shown that relational models with only observed relations are implmenetations of reference classes, we also show that relational models with latent-properties of individuals are alternative representations of reference classes that correspond to using disjunctions of equalities in the representation. Further, demonstrating that latent-property models are significantly more accurate in prediction presents an attractive alternative to standard methods for obtaining reference classes. 72 Chapter 4 Learning Latent Relational Models The goal of this chapter is to develop a method for learning latent relational models (LRMs). The goal of learning is to estimate parameters governing probabilistic dependencies in LRMs, and at the same time learn latent properties about each individual. With the ability to estimate LRMs that model dependencies over observed relations as well as latent properties, we can evaluate the question of whether latent properties of individuals are sufficient explanations for the observed relations – supported by Xu et al. (2006) – or that additionally modelling dependencies amongst the observed relations can yield a better model. The general framework of EM (Dempster et al., 1977) can, in theory, address our learning goal. In practice, however, EM incurs prohibitive computational costs for all but the simplest relational domains, as LRMs for these domains represent large probabilistic models with many densely correlated latent random variables. We make a basic approximation that yields a computationally tractable approximate EM framework. 4.1 Estimating LRMs Grounding a LRM with individuals of the domain yields a Bayesian network. Every ground atom of each latent property present in the LRM corresponds to a latent 73 random variable in the network. Learning parameters of a LRM that best fits the data requires estimating the marginal posterior distribution over all latent random variables. EM (Dempster et al., 1977) may be applied to this problem, and its limitations. Algorithms in this chapter will be described at the ground level, i.e. Bayesian networks, where the set of observed random variables are denoted by Y corresponding observed values by y. Latent random variables are denoted by Z, with instantiations given by z. 4.1.1 Expectation Maximisation EM (Dempster et al., 1977) is a framework for learning maximum likelihood parameters of probabilistic models with latent random variables. Starting with some initial guess Θ(0) , EM iteratively constructs the sequence of parameters Θ(0) , Θ(1) , . . ., until a local maximum of the expected log-likelihood is reached. The expected log-likelihood is the expectation of the log-likelihood log P (y | Θ), with respect to a marginal function Q over latent variables. Namely, L (Θ; y) = Ez [log P (y | Θ)] = log z P (y | z, Θ) P (z | Θ) Maximising L(Θ; y) is an under-determined problem as both Θ and the latent marginal distribution P (z | Θ) are unknown. (We denote P (z | Θ) by Q(z | Θ) in the rest of this chapter.) To find Θ and Q that jointly maximises L(Θ; y), EM begins with an initial guess Θ(0) , and for each iteration t, a new Q(t) (Z | Θ(t) ) is computed (the E step), and using Q(t) a new Θ(t+1) is computed (the M step) which maximises the log-likelihood L(Θ(t+1) ; y). Neal and Hinton (1998) showed that L(Θ; y) has a lower-bound called the variational free energy, F(Q, Θ), with which they explain the behaviour of EM. 74 This lower-bound can be derived using Jensen’s inequality. z P (y, z | Θ) z Q (z | Θ) L(Θ; y) = log = log ≥ z P (y, z | Θ) Q (z | Θ) (4.1) Q (z | Θ) log P (y, z | Θ) − Q (z | Θ) log Q (z | Θ) = F (Q, Θ) Under the variational free energy interpretation, at each iteration t, the E step chooses a Q that makes the F(Q, Θ) a tight likelihood lower-bound. In fact, it is shown that setting Q(t) to the marginal posterior distribution of Z, i.e. Q(t) (Z | Θ(t) ) = p(Z | y, Θ(t) ), leads to the equality L(Θ(t) ; y) ≡ F(Q, Θ(t) ) (Neal and Hinton, 1998). Next, the M step computes a new Θ(t+1) given Q = Q(t) . F(Q, Θ) is convex in Θ given a fixed Q, thus there is a unique maximiser Θ(t+1) for any fixed Q. Repeating the E and M steps is guaranteed to successively increase the expected log-likelihood until a local maximum. In the context of learning LRMs, EM can be applied such that maximum likelihood parameters for dependencies for the LRM are computed in the M step, whilst posterior estimates of latent random variables for latent properties of individuals are computed in the E step. We discuss the suitability of EM for learning LRMs next. 4.1.2 EM for LRMs EM suffers from practical limitations when there are many latent variables in the model. Specifically, storing the posterior distribution Q(Z | Θ) requires space that is exponential in |Z|, and inferring Q in the E step therefore requires computing the same order of probabilities. The same also applies to evaluating L(Θ(t) ; y). One may exploit conditional independence in graphical models and represent Q as a product of less complex distributions. Such a representation can help reduce storage and computational requirements. However, when latent random variables are densely correlated, inferring their marginal distributions remain costly. 75 For relational domains, modelling latent properties of individuals can lead to many latent random variables in the ground network, and they are densely correlated. Consider a simple LRM shown in Fig. 4.1(a) (in plate notation), whose ground network has a bipartite structure (Fig. 4.1(b)). Each observed node in the ground network correlates its unobserved parents (i.e. the explaining away phenomenon in Bayesian networks). The fact that different observed nodes can have common (latent) parents induces a complex network of correlations across all latent parents in the network. For example, latent variables α1 (i1 ), α1 (i3 ), and α2 (j3 ) (a) (b) Figure 4.1: (a) a LRM in plate notation, and (b) an illustration of the corresponding ground LRM (Bayesian network) exhibiting a bipartite structure. Observed variables are shaded. are intercorrelated by the observed nodes r(i1 , j3 ) and r(i2 , j3 ). Assuming that all latent variables have k values, computing Q in the E step amounts to computing in the order of O(k |Z| ) probabilities; one per state of Z. The exponential complexity in storage and computation means that EM algorithm quickly becomes infeasible, which is often the case in relational domains. An approximation is presented in the next section that alleviates this cost. 76 4.2 An Approximate EM Method for LRMs Our first step to approximating the EM procedure is to represent the marginal distribution Q(Z | Θ) in factor form i Q(Zi | Θ). The number of marginal proba- bilities to compute is now in the order of O(k · |Z|), compared to O(k |Z| ) (where k is the number of values of each variable in Z). Computing each Q(Zi | Θ) in the E step still requires expensive marginalisation, however. The following describes approximations to obtain a tractable E step. 4.2.1 Expectation Step i Q(Zi We computing the distribution | Θ) in the E step by cycling through each Zi ∈ Z and compute Q(Zi | Θ) independently. Our main assumption for computing Q(Zi | Θ) tractably is given in the following. Localised Inference Given parameters Θ and data y, the problem of computing Q(Zi | Θ), where i ∈ {1, . . . , |Z|}, is cast as computing the posterior distribution P (zi | y, Θ) = z−i = z−i P (zi , z−i | y, Θ) P (zi | z−i , y, Θ) P (z−i | y, Θ) where z−i = z − zi . Since we assume a factorised representation, then P (zi | y, Θ) = z−i P (zi | z−i , y, Θ) j P (zj | y, Θ) (4.2) Note that |z−i | can still be large, and Eq. 4.2 remains a large nested summation. Now we apply our main approximation which restricts the number of nested summations. We assume that P (zi | y, Θ) is computed using only the portion of the ground network representing the Markov blanket of latent variable Zi . The portion of the network outside of the Markov blanket are artificially pruned, reducing the number of latent variables in the marginalisation, and hence reduces the number of nested summations. 77 The Markov blanket of any variable X in a graphical model is defined as the set of neighbours of X. In Bayesian networks (directed graphical models), the Markov blanket of X is defined as the union of the graph parents of X, the graph children of X and the parents of each child of X (excluding X). A ground LRM is a Bayesian network, and for any node in the network its Markov blanket can be defined as follows: Definition 5. Let ground atom ar be some ground instance of relation r, A = par(ar ) the parents of ar , S = ch(ar ) the children of ar , and C = c∈S par(c)\ar the parents of each child. The Markov blanket of ar is then M(ar ) = A ∪ S ∪ C. The par(·) relation is defined in Ch. 2, Def. 2, and ch(x) = {y : x ∈ par(y)}. Figure 4.2 illustrates the use of a Markov blanket for computing the marginal of α(cs101) in a simple student-course domain. The rest of the network is effective pruned (greyed out). Figure 4.2: A Bayesian network localised to α(cs101), showing nodes α(cs101) ∪ M (α(cs101)), where M(α(cs101)) is the Markov blanket of α(cs101). Dark-shaded nodes are observed, whilst light-shaded nodes are fixed to some marginal posterior distribution. The new marginal posterior estimate of α(cs101) is by probabilistic inference in this network. Aside from computational tractability, the Markov blanket assumption has some intuitive properties for relational modelling. In Fig. 4.2, the Markov blanket α(cs101) contains ground atoms most relevant to the target individual cs101, i.e. grades involving cs101 as well as latent properties of students who took the cs101. Also, the Markov blanket directly encapsulates dependencies amongst relations, e.g. the directed arc between likes and pass in fig. 4.2. Thus, we can compute the 78 latent property of cs101 in a way that accounts for explicit dependencies amongst observed relations, whilst models proposed by (Kemp et al., 2006; Xu et al., 2006) do not allow for such dependencies. By iteratively computing marginal posterior distributions for each latent variable, statistical correlations are propagated throughout the network. For example, in Fig. 4.2, latent properties of courses outside of the Markov blanket of α(cs101) indirectly correlate with the value of α(cs101). The correlation is made through other latent variables in the Markov blanket, such as those representing latent properties of students who took cs101. E-step Update To compute the marginal posterior value of some latent variable Zi , let Y[i] and Z[i] denote the observed variables and latent variables that appear in the Markov blanket M(Zi ). For instance, for Zi = α(cs101), we have from Fig. 4.2 Y[i] = (likes(joe, cs101), pass(joe, cs101), pass(sarah, cs101)) Z[i] = (β(joe), β(sarah), likes(sarah, cs101)) Given parameters Θ(t) , we update of marginal posterior for Q(zi | Θ(t) ) by incor- porating the Markov blanket assumption into Eq. 4.2. The final update is Q zi | Θ(t) = z[i] P zi | y[i] , z[i] , Θ(t) zj ∈z[i] Q zj | Θ(t−1) (4.3) where t is an iteration number. Exact methods for probabilistic inference such asvariable elimination (Zhang and Poole, 1996) can be used to compute Eq. 4.3. An interesting note about Eq. 4.3 is that it directly applies to ground instances of observed relations whose value is not observed. Thus, it conveniently provides a way to handle missing data (e.g. likes(sarah, cs101) in Fig. 4.2) on top of clustering when applied to latent properties. 79 4.2.2 Maximisation Step In the M step, LRM parameters Θ are re-calculated using the marginal posterior distributions computed in the E step. (t+1) At iteration t, each θr ∈ Θ is updated by with θr follows1 . θr(t+1) [u, v] = v given by Eq. 4.4 as ˜ D P arG (r) = u ∧ r = v, Q(t) # ˜ D P arG (r) = u ∧ r = v , Q(t) # (4.4) ˜ D (·) is the expected count of the condition (a logical formula) in the parenwhere # theses, and is given by ˜ D P arG (r) = u ∧ r = v, Q(t) = # ˆI rφ = v, Q(t) ˆI π1 φ = u1 , Q(t) . . . ˆI πk φ = uk , Q(t) (4.5) φ∈Γg Here, P arG (r) = {π1 , . . . , πk } and u = (u1 , . . . , uk ). Each πi is a relation, and ui is a value of the relation. Let g denote the formula P arG (r) = u ∧ r = v, Γg is the space of substitutions for the formula, and ˆI(·) is the soft characteristic ˜ D (·), Γg and ˆI can be found in Sec. function (Eq. 2.3). (General definitions of # 2.1.2 of Ch. 2.) Using the example LRM relating to Fig. 4.2, we see that the parameter value (t+1) θpass [(T,T,F), T] is given by (t+1) θpass [(T,T,F), T] = ˜ (t) β(S) = T ∧ α(C) = T ∧ likes(S, C) = F ∧ pass(S, C) = T, Q(t) # D ˜ (t) β(S) = T ∧ α(C) = T ∧ likes(S, C) = F ∧ pass(S, C) = v, Q(t) # v∈{T,F} D 1 Equation 4.4 gives a maximum likelihood estimate, but maximum a posteriori is possible if pseudo-counts are available. 80 where T,F are Boolean values for true and false, and the expected counts are ˜ D β(S) = T ∧ α(C) = T ∧ likes(S, C) = F ∧ pass(S, C) = v, Q(t) # ˆI pass(s, c) = v, Q(t) = (s,c) ˆI β(s) = T, Q(t) ˆI α(c) = T, Q(t) ˆI likes(s, c) = F, Q(t) 4.2.3 Likelihood For the same reason that the E step of EM is computationally intractable, the exact L(Θ; y) is also intractable. We thus require a likelihood approximation. A simple approximation is the pseudo-likelihood approximation (Besag, 1975). The basic idea underlying pseudo-likelihood is that the marginal likelihood for each observed node in a graph can be computed from its graph neighbours, and that the overall log-likelihood of data is a summation of individual marginal loglikelihoods. We can easily construct a pseudo-likelihood by reusing the Markov blankets we have already defined in the E step of our approximate EM method. Namely, we compute the likelihood of groups of random variables, where each group corresponds to observed variables in each Markov blankets defined. In computing the marginal posterior distribution of a latent random variable using its Markov blanket, the likelihood of observed nodes in the blanket is also computed in the process. Our pseudo-log-likelihood is obtained by storing and summing log-likelihoods from all Markov blankets. To illustrate, a marginal posterior probability for any Zi ∈ Z, given by Eq. 4.3, can also be written as Q(zi | y, Θ(t) ) = z[i] P (y[i] | z[i] , zi , Θ(t) )Q(z[i] | Θ(t) )Q(zi | Θ(t) ) zi z[i] P (y[i] | z[i] , zi , Θ(t) )Q(z[i] | Θ(t) )Q(zi | Θ(t) ) where the denominator is expected likelihood of observed nodes in the Markov 81 blanket M(Zi ), which we denote as P˜ (y[i] | Θ(t) ). Finally, our pseudo-log- likelihood is simply the sum of log-likelihoods from all Markov blankets: |Z| ˜ (t) | y) = L(Θ i=1 log P˜ y[i] | Θ(t) (4.6) An immediate problem with Eq. 4.6, however, is that it may be a poor estimate of the true log-likelihood. A key reason for this is that random variables appearing in the intersection of overlapping Markov blankets are over-counted. To alleviate this, one can adopt approximations that explicitly account for overlaps, e.g. the region-based free energy approximations of Yedidia et al. (2004) which generalises many well-known approximations such as the mean-field, Bethe and Kikuchi approximations. We empirically test the merits of using Eq. 4.6 in Sec. 4.4, and discuss alternative representations in Sec. 4.5. Finally, having defined the main updates (Equations 4.3 and 4.4) of our approximation of EM method for learning LRMs, along with an expression for approximating the likelihood, our approximate EM algorithm is listed in Alg. 1 below. 4.3 4.3.1 Properties Convergence We can show that Alg. 1 produces parameters that monotonically improve the pseudo-log-likelihood. The analysis is similar to that for standard EM (see Sec. 4.1.1). The main step is to obtain the variational lower-bound for the pseudo-log- 82 Algorithm 1: An approximate EM algorithm for learning LRMs 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: Input: Database D Initialise: Θ(0) ← Θ0 and ∀r ∈ A, ∀ar ∈ r; Q(ar | Θ(0) ) ← Q0 (ar ), where D |= ar for t = 1, 2, . . . do (E step) for all r ∈ A do for all ar ∈ r where D |= ar do Compute marginal posterior distribution Q(t) (ar | Θ(t) ) . . . (Eq. 4.3) end for end for (M step) Update Θ(t+1) given Q(t) . . . (Eq. 4.4) ˜ (t+1) ; y) − L(Θ ˜ (t) ; y) |= 0 then if | L(Θ (t+1) return Θ and Q(t) end if end for likelihood (Eq. 4.6), shown as follows via Jensen’s inequality. |Z| ˜ L(Θ; y) = log i=1 z[i] ,zi P y[i] , z[i] , zi | Θ |Z| = log i=1 z[i] ,zi Q z[i] , zi | Θ P y[i] , z[i] , zi | Θ Q z[i] , zi | Θ |Z| ≥ i=1 z[i] ,zi Q z[i] , zi | Θ log P y[i] , z[i] , zi | Θ (4.7) + Q z[i] , zi | Θ log Q z[i] , zi | Θ |Z| = i=1 Fi (Q, Θ) The above expression shows that there is a separate variational free energy for each Markov blanket. That is, the marginal likelihood from each Markov blanket 83 has its own variational lower-bound. Updating each marginal posterior estimate Q(Zi | Θ) in the E step then tightens each bound Fi (Q, Θ) individually, and the ˜ bound corresponding has its own maximiser. Maximising L(Θ; y) in the M step requires computing maximisers Θi for each Fi (Q, Θ). However, since Θ defines shared LRM parameters, maximisation via Eq. 4.4 in fact chooses new parameters that represents the average of the individual maximising parameters. The averaged parameter choice will still improve each Fi and leads to monotonic improvement of ˜ ˜ L(Θ; y). Alg. 1 will thus monotonically improve L(Θ; y) until a local maximum. 4.3.2 Complexity Suppose there are n latent variables in the model, and all are k-valued. The E step of Alg. 1 then computes n k-ary marginal posterior distributions. The E step complexity is thus O(n), compared to O(k n ) in standard EM, in terms of the number of probabilities computed. Note, however, each univariate marginal posterior computed in our approximate E step may have exponential complexity in marginalisation, depending on the size of Markov blankets and the inference algorithm used. Assuming exact inference algorithm such as variable elimination (Zhang and Poole, 1996), let m be the maximum number of latent variables in any Markov blanket used, and d the maximum dimensionality of conditional probability tables in the LRM. Each marginal posterior inference then has a worst-case complexity of O(m · k d ). The E step therefore has a worst-case complexity of O(n · m · k d ) in terms of summation operations. 4.4 4.4.1 Experiments Likelihood Optimisation The goal of this experiment is to assess the effectiveness of Alg. 1 to identify LRM parameters that yields good likelihood scores. We synthesise small LRMs for generating data, for two reasons: (i) so that learned parameters can be compared to the (known) true generating parameters, and (ii) the pseudo-likelihood score (Eq. 4.6) can be compared to the exact expected log-likelihood (computable due to small model size). 84 Data We create 1000 small LRMs, where each LRM L = A, G, Θ has relations A = {r(X, Y ), α1 (X), α2 (Y )}, where r(X, Y ) is a Boolean relation with do- main Ωr = D(τ1 ) × D(τ2 ) whilst α1 (X) and α2 (Y ) are unary relations with domains D(τ1 ) and D(τ2 ) respectively. The graph structure over relations is given by G = {α1 r, α2 r}. For each LRM, the number of values of α1 and α2 (i.e. Vα1 and Vα2 ) are randomly chosen between 2 and 10. Domain sizes |D(τ1 )| and |D(τ1 )| are randomly chosen between 5 and 20. Conditional probability pa- rameters Θ = {θα1 , θα2 , θr } are also generated randomly, and are the ground truth parameters underlying the data. Data is sampled from each synthetic LRM as follows. We first sample ground atoms for the unary relations. For each x ∈ D(τ1 ), where the range of α1 is Vα1 = {a1 , . . . , aN1 }, we flip a N1 -sided coin with bias θα1 . Similar procedure is carried out for all y ∈ D(τ2 ) using α2 . Then, for every pair (x, y) ∈ D(τ1 )×D(τ2 ), where α1 (x) = u and α2 (y) = v are previously sampled, we first flip a fair coin to determine whether r(x, y) will be observed (i.e. we simulate missingness). If observed, then another coin flip generates the observed value for r(x, y) according to the bias θr [u, v], where θr [u, v] is the conditional probability parameter over values of r given α1 = u ∧ α2 = v. Samples for r(X, Y ) represent the data used for learning, which we denote as the dataset Dr . Samples for α1 and α2 are kept hidden. For each of the 1000 sets of data, Alg. 1 run over 100 restarts to estimate Θ for the respective LRM. Results We compare the exact and approximate (pseudo) log-likelihoods for both true and learned parameters. Exact log-likelihoods are denotes by L(·) whilst approximate ˜ log-likelihoods are denoted by L(·). True parameters for model i are denoted by Θ∗ i ˆ i . Recorded log-likelihoods for each model are shown and learned parameters are Θ as a function of increasing value of L(yi ; Θ∗ ), where yi is the data generated from model i, for i = 1, . . . , 1000. The first observation from Fig. 4.3 is that the exact likelihood of learned parameters correspond well with that of the true parameters in spite of the fact that Alg. 85 20 0 LOG−LIKELIHOOD −20 −40 −60 ˆ L(y, Θ) −80 −100 ˜ ˆ L(y, Θ) −120 L(y, Θ∗ ) −140 −160 0 100 200 300 400 500 600 700 800 900 1000 MODEL NO. Figure 4.3: Results on simulated datasets from 1000 synthetic LRM. In all ˜ is the pseudofigures shown, L(·) denotes exact log-likelihood, and L(·) ∗ log-likelihood (Eq. 4.6). Θ denote the true parameters used to generate the data, whilst Θ∗ are those learned from the data. The exact likelihoods for the true and learned parameters for each of the 1000 models are plotted in ascending order of L(y1 ; Θ∗1 ) . . . L(y1000 ; Θ∗1000 ). ˜ 1 is formulated to optimise the pseudo-log-likelihood L(·). For the small domain considered here, this result can be expected as they represent simpler optimisation problems. In larger, more complex domains, a degradation in performance can be expected. ˜ ˆ converges with the exact We can also see that the pseudo-likelihood L(y; Θ) likelihood of the true parameters. One possible explanation is that some of our generative models have simpler structures than others. Simple patterns may induce lower over-counting errors in the pseudo-log-likelihood (Eq. 4.6). In such cases, pseudo-log-likelihoods can achieve good approximation of the exact loglikelihood. We attempt to quantify these effects by characterising the the complexity of relational patterns using two features: link density and the number of 86 overlaps. Link density of a relation r(X, Y ) is the proportion of ground instances of r(X, Y ) that are observed, defined as |Dr |/|Ωr | where Dr is the dataset for r(X, Y ) and whilst Ωr is the domain of r(X, Y ). The number of overlaps measure, for some ground atom for a latent property, is defined as the number of Markov blankets in which the variable appears. For example, suppose α1 (x), α1 (x) may be correlated to α2 (y1 ), . . . , α2 (yn ) in the ground network by observed nodes r(x, y1 ), . . . , r(x, yn ). α1 (x). α1 (x) will therefore appear in the Markov blankets of α2 (y1 ), . . . , α2 (yn ), resulting in an overlap count of n. The overlap count represents the number of times that α1 (x) will be overcounted. We define the overlaps measure as the average number of overlaps over all ground atoms for latent properties in the model. In Fig. 4.4 we show that the overlap feature has a correlation coefficient of almost -1 with the exact log-likelihood; that models with lower overlap counts leads to higher log-likelihoods, and vice versa. Link density on the other hand appears a weaker indicator. Given the strong correlation between the exact log-likelihood and overlap numbers, it is expected that over-counting errors in the pseudo-log-likelihood will be diminished against diminishing number of overlaps. The results of Fig. 4.3 are redrawn below as a function of decreasing values of link density and overlap numbers separately. These results show that the overlap count feature confirm and they confirm our expectations. 4.4.2 Systems of Relations In this experiment we compare relational models that model (i) latent properties only, (ii) dependencies only, and (iii) both latent properties and dependencies. The main question is whether latent properties alone are sufficient to explain relational data (as argued for by (Xu et al., 2006)), or that the inclusion of dependencies amongst relations can further improve accuracy. We use a simple experimental set-up consisting of two observed relations; the target relation and co-relation. We evaluate prediction on the target relation, whilst assessing the value conditioning on the co-relation in addition to latent properties. 87 LOG−LIKELIHOOD 0 0 −20 −20 −40 −40 −60 −60 −80 −80 −100 −100 −120 0 0.2 0.4 0.6 −120 0 0.8 LINK DENSITY 2 4 6 8 10 OVERLAPS Figure 4.4: Exact log-likelihood of true generating parameters as a function of link density (left) and average overlap count (right). We examine co-relation that are weakly, moderately, and strongly informative of the target relation (measured by mutual information). This will allow us to quantify the value of modelling dependencies in addition to latent properties. We describe our data, models and experimental protocol below. Data We use a reduced version of the dimensionality of nations dataset (Rummel, 1999) used by Kemp et al. (2006), where the domain consists of 14 countries countries, and 56 different Boolean two-placed relations are observed amongst them. Each relation has the form r(X, Y ), where logical variables X, Y denote nations. We choose one target relation from the 56 available relations, and three corelations that are weakly, moderately, and strongly informative of the target relation. The target relation is chosen for it highest entropy value2 . In this case, the 2 A maximum entropy value of 1 indicates an even distribution of positive labels (trues) and negative labels (falses). An entropy value of 0 indicates that all observed labels have the same value. 88 20 0 LOG−LIKELIHOOD −20 −40 −60 −80 ˆ L(y, Θ) −100 −120 ˜ ˆ L(y, Θ) −140 L(y, Θ∗ ) −160 0 100 200 300 400 500 600 700 800 900 1000 900 1000 MODEL NO. 20 0 LOG−LIKELIHOOD −20 −40 −60 ˆ L(y, Θ) −80 −100 ˜ ˆ L(y, Θ) −120 L(y, Θ∗ ) −140 −160 0 100 200 300 400 500 600 700 800 MODEL NO. Figure 4.5: Results on 1000 simulated models, as a function of link density (top) and average overlap count (bottom). 89 best target relation is “intergovernmental organizations” (intergov), which represents the establishment of organizations between governments. Co-relations are chosen using the mutual information measure with the target relation I(rt ; rc ) = Pˆ (rt = u ∧ rc = v) Pˆ (rt = u ∧ rc = v) log Pˆ (rt = u)P (rc = v) u,v∈{F,T} where rt and rc denote the target relation and co-relation respectively, and Pˆ (rt = u ∧ rc = v) = ˜ D (rt = u ∧ rc = v) # ˜ D (rt = u ∧ rc = v ) # u ,v ∈{F,T} Three co-relations are chosen: “sever diplomatic relations” (sever), “non-government organisations” (ngo1), and a second non-government organisations relation (ngo2). Their mutual information measures with intergov are These values show that sever r1 sever ngo1 ngo2 r2 intergov intergov intergov I(r1 ; r2 ) 5 × 10−5 0.63 0.17 Table 4.1: Mutual information scores for each co-relation (ngo1,ngo2,sever) taken with the target relation intergov. is the least informative of intergov, whilst ngo1 is the most informative. Models We learn different LRMs using the given data, as depicted in Fig. 4.6, where target(X, Y ) denotes the target relation, and co-reln(X, Y ) denotes the co-relation. The first LRM (Fig. 4.6 top-left) uses only the target relation and a latent property α. We denote this LRM by Lc . The second LRM (Fig. 4.6 top-right) uses both the target and co-relation with α, and is denoted by Lc+ . Finally, a dependency between the target and the co-relation is modelled, which we call Lcd (Fig. 4.6 bottom). (Here, subscript c indicates the presence of a latent property 90 Figure 4.6: Three LRMs in plates notations: a single-relation latent-property model (top-left), a multiple-relation latent-property model (top-right), and multiple-relation latent-property model with a dependency link between the co-relation and the target relation (bottom). for clustering, + indicate more than one observed relation, and d indicates the dependency between the observed relations). The latent property α is assumed to have a fixed number of values, i.e. |Vα | is fixed. A special case occurs when |Vα | = 1, where α can effectively be removed 91 from the model, resulting in LRMs that contain only the two observed relations. In our experiments, we obtain LRMs for different number of values of |Vα |. We also use the infinite relational model (IRM)3 due to Kemp et al. (2006), which is a nonparametric Bayesian hierarchical model for automatically generating an a priori unknown number of object clusters from multiple observed relations. It is one of the few existing proposals that allow clustering over multiple observed relations. Another well-known example is the IHRM (Xu et al., 2006), which defines a similar model to the IRM. We use the IRM as a representative model of this class. (Note that of the three LRMs described, Lc has the closest description the IRM.) Protocol Since we have three co-relations to consider, the experimental procedure described below applies for each co-relation. Alg. 1 over 50 random restarts is used to train our LRMs, whilst 50 IRMs are trained using the MCMC with default parameters of software accompanying (Kemp et al., 2006). Five-fold cross-validation is run; where 20% observed cases for the target relation intergov are held out for prediction, whilst each model is trained on the remaining data. All data for the co-relation are used during training. Whilst the IRM automatically determines the number of clusters, for LRMs we search over the number (values of |Vα |) from 1 to 10. The cross-validation procedure described previously is repeated for each number. Accuracy is measured in log-loss, and we record the average log-loss over folds of the cross-validation for both the training and test sets. Results The first experiment uses sever as the co-relation, and our experimental procedure described in the previous section produced the results shown in Fig. 4.7 for the LRMs tested. The low mutual information value between sever and intergov (5×10−5 ) means that conditioning on sever is unlikely to benefit predictions about intergov. This is 3 Software from http://www.psy.cmu.edu/ ckemp/code/irm.html 92 1.2 Lc Lc+ Lcd 1.1 log−loss 1 0.9 0.8 0.7 0.6 0.5 0 2 4 6 8 10 12 No. of values (|V α |) Lc Lc+ Lcd 1.1 log−loss 1 0.9 0.8 0.7 0.6 0.5 0 2 4 6 8 10 12 No. of values (|V α |) Figure 4.7: 5-fold cross-validated log-loss in predicting the intergov relation using LRMs shown in Fig. 4.6. Relation sever is used as the co-relation in this experiment. Results on training data (top) and test data (bottom) are shown. The horizontal axis correspond to different numbers of values of the latent relation α, and error bars indicate standard error. Lower log-loss mean greater accuracy. seen at |Vα | = 1, where latent property α are essentially absent, all models score poorly (log-loss ≈ 1)4 . The results indicate that this is the case in both training and test data. When latent property α is utilised (where |Vα | > 1), significant reductions in log-loss is achieved which indicates that latent properties is providing benefits to explaining intergov data better than sever. The differences between Lc , Lc+ , and Lcd for |Vα | > 1 appear insignificant, however, suggesting that informative latent 4 The log-loss value being close to 1 is due to the high self-entropy of the target relation intergov. A log-loss of 1 corresponds to random guessing. 93 properties were fitted in all three models. A noticeable trend is that Lc degrades as |Vα | increases, due likely to the corresponding increase in difficulty of optimising latent properties. Now using ngo2 as the co-relation, where the mutual information value between ngo2 and intergov is higher at 0.17, we repeat the experiment and obtained results shown in Fig. 4.8, which shows similar trends to Fig. 4.7, except that ngo2 directly yields improvements in accuracy, shown at |Vα | = 1, where Lcd achieves a lower log-loss. 1.2 Lc Lc+ Lcd 1.1 log−loss 1 0.9 0.8 0.7 0.6 0.5 0 2 4 6 8 10 12 No. of values (|V α |) Lc Lc+ Lcd 1.1 log−loss 1 0.9 0.8 0.7 0.6 0.5 0 2 4 6 8 10 12 No. of values (|V α |) Figure 4.8: 5-fold cross-validated log-loss in predicting the intergov relation using LRMs shown in Fig. 4.6. Relation ngo2 is used as the co-relation in this experiment. Results on training data (top) and test data (bottom) are shown. The horizontal axis correspond to different numbers of values of the latent relation α. Error bars indicate standard error. Finally, using co-relation ngo1 (with mutual information of 0.63), we observe 94 significant changes in behaviour. Consider the results shown in Fig. 4.9. Lc Lc+ Lcd 1.1 1 log−loss 0.9 0.8 0.7 0.6 0.5 0.4 0 2 4 6 8 10 12 No. of values (|V α |) Lc Lc+ Lcd 1.1 log−loss 1 0.9 0.8 0.7 0.6 0.5 0 2 4 6 8 10 12 No. of values (|V α |) Figure 4.9: 5-fold cross-validated log-loss in predicting the intergov relation using LRMs shown in Fig. 4.6. Relation ngo1 is used as the co-relation in this experiment. Results on training data (top) and test data (bottom) are shown. The horizontal axis correspond to different numbers of values of the latent relation α. Error bars indicate standard error. Figure 4.9 shows that explicitly modelling the dependency between the target relation and the co-relation yields noticeable improvements, due to the informativeness of ngo1 (seen at |Vα | = 1). Furthermore, increasing |Vα | continues to improve log-loss on both training and test data, evidence of the increasing value of latent properties. Models with no explicit dependency between intergov and ngo1 (i.e. Lc and Lc+ ) achieves their best log-loss of ≈ 0.6 on training data, ≈ 0.65 ∼ 0.7 on test data (this is also true when using other co-relations). This suggest that latent properties alone may not be sufficient to model the information provided by 95 the co-relation. Finally, we compare log-loss of the best IRM on datasets corresponding to different co-relations, with the log-loss of the best LRMs in these experiments. The results are shown in Table 4.2. TRAIN co-relations IRM Lc Lc+ Lcd none 0.796 ± 0.056(4) 0.553 ± 0.027(3) – – sever 0.819 ± 0.010(3) – 0.546 ± 0.029(10) 0.550 ± 0.029(3) ngo2 0.962 ± 0.024(4) – 0.544 ± 0.029(7) 0.549 ± 0.029(10) ngo1 0.836 ± 0.016(5) – 0.556 ± 0.030(5) 0.477 ± 0.025(10) TEST co-relations IRM Lc Lc+ Lcd none 0.858 ± 0.116(4) 0.636 ± 0.059(3) – – sever 0.854 ± 0.017(3) – 0.630 ± 0.062(10) 0.635 ± 0.062(3) ngo2 1.001 ± 0.041(3) – 0.648 ± 0.0615(3) 0.664 ± 0.061(10) ngo1 0.834 ± 0.031(5) – 0.646 ± 0.065(6) 0.568 ± 0.055(7) Table 4.2: Five-fold cross-validated log-loss (with standard errors) from prediction on training and test of the target relation intergov. Models tested are: the IRM, and LRMs Lc , Lc+ , and Lcd . For each loss reported, the corresponding value for |Vα | (e.g. the number of clusters) is also shown. All losses for the LRMs are the best results achieved over different number of clusters, obtained from data underlying Figs. 4.7, 4.8 and 4.9. Lower values indicate better performance. 4.5 Remarks In this chapter we have presented a learning method for learning relational models that represent dependencies amongst observed and latent properties. The algorithm is an approximation of EM, and avoids the main computational issues that EM incurs. Convergence and complexity of the proposed algorithm is also discussed. Simulations show that the algorithm is effective in learning models that can achieve good likelihood scores. The algorithm was then used to answer one of the main questions of this thesis, regarding whether modelling dependencies as well as latent properties can yield 96 more accurate models than modelling either alone. From experiments on real-world data, we showed that (i) latent properties can help improve model accuracy, particularly when observed relations are not mutually informative; (ii) when there exist relations that are mutually informative, it is advantageous to explicitly model dependencies amongst them; (iii) the added presence of latent properties can help further improve accuracy, and does not lead to over-fitting. We have confirmed that modelling with latent properties yielded better predictive accuracy (on training as well as test data) than models based on observed properties alone. More importantly, we demonstrated the combination of latent properties and dependencies further improves model accuracy. 97 Chapter 5 Learning with Uncertainty about Size of Latent Properties In Ch. 4 we assumed that latent properties have a priori known number of values. In this chapter, we relax this assumption and allow the number of values for latent properties to be a priori unknown and unbounded. To express our uncertainty over the number of values (which we refer to as the size) of latent properties, we consider a distribution over the unbounded set of possible sizes. For instance, a latent property whose size may be any positive integer, we may use a Poisson distribution as the prior distribution over sizes. There are proposals which address the problem of unbounded sizes of latent properties in relational models, exemplified by the IHRM (Xu et al., 2006) and IRM (Kemp et al., 2006). These models draw from results in Bayesian nonparametrics (Ghosh and Ramamoorthi, 2003) and use stochastic processes as prior distributions over sizes. Stochastic processes such as the Dirichlet process (DP) (Ferguson, 1973) or CRP (Aldous, 1983) stochastically generate Dirichlet distributions whose number of classes are a priori undetermined. Running the process infinitely many times produces all Dirichlet distributions. The bias of DPs and CRPs is towards smaller number of classes. Joint probability models involving stochastic process priors typically proceeds by Monte Carlo sampling. In this chapter we propose an extension to LRMs, called infinite-size latent relational model (ILRM), which explicitly represents the distribution over size of 98 latent properties. We develop a learning procedure for ILRMs that searches search over the sizes of latent properties, and at each search step learn parameters for a standard LRM. A main feature of the search approach is that error bounds on the likelihood of data for each step can be calculated (derived in Sec. 5.3.1). In other words, where the exact likelihood of data is obtained by averaging over the (infinite) space of models, our search-based approach computes error bounds on the likelihood associated with having only searched a finite number of models. The error diminishes as we evaluate more of the model space. The proposed procedure is an alternative to stochastic-process based representations estimated via Monte Carlo sampling. The likelihood bounds also directly leads to bounds on the posterior of latent property sizes, and in turn yields an approximate Bayesian averaging scheme for prediction. Bayesian averaging involves averaging over posterior-weighted predictions of all models, leading to robust predictions that are sometimes more accurate than single-model predictors (Hoeting et al., 1999). The ability to perform averaging and have bounds indicating the quality of the averaged prediction extends existing work. In addition, our formulation permits a flexible choice of prior distributions over latent property sizes, where previous proposals are fixed to the CRP (Kemp et al., 2006; Xu et al., 2006). Experiments in Sec. 5.5 demonstrates the error bounds obtained, and applies Bayesian averaging for prediction. On a real-world dataset used in Ch. 4, we show that our approximate Bayesian predictions of ILRMs perform competitively with the best LRMs found in Ch. 4. A key observation is that a prior distribution not restricted to favouring smaller sized latent properties led to the best averaged predictions. 5.1 Nonparametric Relational Models Existing relational models that represent latent properties and uncertainty about the size of latent properties generally rely on stochastic processes such as the CRP. Well-known examples include the infinite relational model (IRM) (Kemp et al., 2006) and infinite hidden relational model (IHRM) (Xu et al., 2006). The underlying semantics of these models is that latent properties of individuals generate rela- 99 tions amongst individuals. Latent properties often mean clusters, and a stochastic process is used to stochastically generate clusters of domain individuals and prior probabilities of the clusters. Since relations are not use to generate other relations in the IRM/IHRM, these models are hierarchical Bayesian models. For a simple domain consisting of only one relation r(A, B) where A and B are of the same type, an IRM corresponds to the probabilistic model. ∀X, ∀A∀B, α(X), S | γ ∼ CRP (γ) r (A, B) | α(A), α(B) ∼ p (r(A, B) | α(A), α(B)) (5.1) where α(X) is a latent property, and S represents the size (number of values) of α, which can be unbounded. This means that for any pair of individuals (a, b), the value of r(a, b) is dependent on the value of α(a) and α(b). The size and prior distribution (a Dirichlet distribution) over the values of α are drawn from the CRP1 with concentration parameter γ. The CRP generates Dirichlet prior distributions biased towards smaller number of classes. The alternative model we consider in this thesis is one that generates the value of S, followed by the value of α. Then, for all pairs (a, b) the value of relation r(a, b) is generated from α(a) and α(b). This generative model is described as follows S|γ ∀X, ∀A∀B, ∼ φ(γ) α(X) | S ∼ Q (α(X) | S) r (A, B) | α(A), α(B) (5.2) ∼ p (r(A, B) | α(A), α(B)) Here, φ represents our prior distribution over S, where prior distributions such as the geometric or Poisson distribution may now be considered instead of the CRP. For some S = s, α has s values, and distribution Q is defined over the values of α. Note that a CRP generates Dirichlet classes sequentially, and weigh earliergenerated classes more heavily, our approach now allows one to prefer each class equally, e.g. a symmetric Dirichlet distribution over the values of α can be used. 1 In their original paper, Kemp et al. (2006) specify the exact form of the conditional distribution as well as a prior distribution for r(X, Y ). We describe it in general form and omit the prior an focus on the role of the CRP. 100 For example, suppose a Dirichlet distribution p(α) with classes 1 . . . k is generated by a CRP, for any pair i ≤ j it follows that P (α) = i ≥ P (α) = j. When S is fixed, the model can be represented as a LRM where all relations have knwon sizes, and can be learned by our approximate EM algorithm presented in Ch. 4. With respect to the generative model in Eq. 5.2, the IRM couples the process of generating the size of α with that of generating the distribution Q over values of α for each individual. By decoupling the two steps – generating S and generating α given S – Eq. 5.2 allows different choices for φ and Q. In the next section we define an extension of LRMs that represents uncertainty over configurations. 5.2 Infinite-size LRMs The general learning approach presented in this chapter involves searching over the size of latent properties of a LRM, and for each step of the search learn the parameters of the corresponding fixed-size LRM. For a LRM whose latent properties have sizes represented in vector C, an assignment of values to C is a configuration. Definitions relating to configurations are as follows. Definition 6. Configuration ordering Given two configurations a = {a1 , . . . , am } and b = {b1 , . . . , bm }, a b if ∀i ∈ {1 . . . m}, ai ≤ bi . Definition 7. Configuration space Given a configuration u, a configuration space induced by u is S u = {c : c u}. We augment the LRM with a random vector representing the size of latent properties, where instantiations of the vector yields configurations. The augmented representation is called an ILRM, defined as follows: Definition 8. Infinite-size latent relational models An infinite-size latent relational model (ILRM) is a tuple V, G, Θ , where V = (R, RI , S). R contains relations whose sizes are known a priori, RI whose sizes are not known a priori, and random variables S represent sizes of relations in RI . G is a DAG over V. The parents of any S ∈ S can only consist of other variables in 101 S. The parents of any member in RI must include exactly one size variable from S. If R ∈ RI is a parent of some relation T , then the size parent of R is also a parent of T . Θ is a set of conditional probability parameters2 for each member of V. We make Θ explicit by adding it to the right hand side of conditional probabilities. For each V ∈ V there is a CPD, p(V | P arG (V ), Θ), of V given its parents P arG (X), parametrised in Θ. An ILRM represents the joint distribution over all groundings of V. Let A = R ∪ RI , and σA WA = A∈Aσ∈ΓA be the set of all ground atoms of relations in A, where ΓA is the substitution space of relation A. Then, let WV = WA ∪ S represent all random variables in a ground ILRM, the joint distribution represented by the ground ILRM is p (WV | Θ) = 5.3 S∈S p (S | P arG (S), Θ) A∈WA p (A | P arG (A), Θ) (5.3) Learning We describe how search is done in the (unbounded) space of configurations for learning ILRMs, and how to derive bounds about the likelihood at each search step, discussed first as follows. 5.3.1 Likelihood Bounds For purposes of exposition, we use the following notation for the rest of this chapter. Let Z be the set of all latent random variables in a ground LRM (i.e. ground atoms corresponding to latent properties and unobserved cases of observed relations), and Y the set of all observed random variables (observed ground instances of relations). Observations are given by y. The likelihood of data, P (y | Θ) is obtained by marginalising over the size variables. The marginal is an infinite sum where each summand corresponds to 2 Note that the size of Θ is a priori unbounded. Thus this forms a nonparametric model. 102 a size configuration. Our approach is to enumerate only a finite number of summands in the marginal, and quantify the approximation error with bounds on the likelihood. Our first step towards a likelihood bound is to show that the likelihood given a Bayesian network with configuration s , written as P (y | θs∗ ), is lower-bounded by that of one with a lesser configuration s the case that optimal parameters θc∗ is given by θs∗ s . We prove this lower-bound for ∈ Θ are available given any configuration c. θc∗ = arg maxP (y | θ, c) θ∈Θ (5.4) Proposition 1. Given LRMs whose latent properties have sizes given by configurations s and s respectively. If s s , and y represent the observations, then P (y | θs∗ , s) ≤ P y | θs∗ , s ≤ 1 (5.5) where θs∗ and θs∗ are given by Equation 5.4. Proof. Firstly, for any configuration s and parameter θs , P (y | θs , s) ≤ P (y | θs∗ , s) (5.6) where θs∗ is the likelihood-maximising parameter (Equation 5.4). For any two configurations s and s such that s s, the space of LRMs with configuration s is a subset of that with configuration s . Thus, it is always possible to choose parameters for LRM L with configuration s that yields a likelihood at least that of LRM L with configuration s. Namely P (y | θs∗ , s) ≤ P y | θs , s Applying Equation 5.6 to P (y | θs , s ) then yields the lower-bound of Equation 5.5 P (y | θs∗ , s) ≤ P y | θs∗ , s The upper-bound of Equation 5.5 is true by definition of probability. 103 A clustering analogy is illustrative here. Proposition 1 states that the best ncluster model is always in the space of (n + 1)-cluster models. Thus, the best (n + 1)-cluster model must score at least as well as the best n-cluster model. Using Proposition 1, we next derive upper and lower bounds for the likelihood P (y | Θ), which is given exactly by the infinite sum P (y | Θ∗ ) = s∈dom(S) P (y | θs∗ , s) P (s | θs∗ ) (5.7) with θs∗ ∈ Θ∗ given by Equation 5.4. The bounds of interest are given by Lemma 1 below. Lemma 1. Given a fixed configuration ¯s where ∀c ∈ ¯s, c < ∞. P (y) satisfies f + P (y | θ¯s∗ , ¯s) g ≤ P (y | Θ∗ ) ≤ f + g where f= s∈C¯s (5.8) P (y | θs∗ , s) P (s | θs∗ ) g= s ∈dom(S)\C¯s P s | θs∗ Proof. We first rewrite Equation 5.7 as P (y | Θ∗ ) = s∈C¯s P (y | θs∗ , s) P (s | θs∗ ) + s ∈dom(S)\C¯s P y | θs∗ , s P s | θs∗ (5.9) W Here, the first summation involves configurations in C¯s which is the configuration space induced by ¯s. The second summation W involves the (unbounded) set of configurations not in C¯s . Since ¯s ∈ C¯s , it follows that ∀s ∈ dom(S)\C¯s , ¯s Thus, using Proposition 1, it follows that for all s ∈ dom(S)\C¯s P (y | θ¯s∗ , ¯s) ≤ P y | θs∗ , s ≤ 1 104 s. Substituting these bounds in the infinite sum W yields P (y | θ¯s∗ , ¯s) s ∈dom(S)\C¯s P s | θs∗ ≤ W ≤ s ∈C¯s P s | θs∗ (5.10) Applying Equation 5.10 directly to Equation 5.9 finally yields the bounds expressed by Equation 5.8. Equation 5.8 has the desirable property that the upper and lower bounds converge on P (y | Θ∗ ) as ¯s expands. Namely, as C¯s grows with increasing ¯s, f in Equation 5.8 approaches P (y) and g approaches 03 . Bounds on P (y | θs∗ ) can then be used to bound the posterior of size variables: Lemma 2. Let ¯s be some fixed configuration where for all k ∈ ¯s, k < ∞. Then for any configuration s, P (y | θs∗ , s) P (s | θs∗ ) ≤ P (s | θs∗ , y) ≤ f +g P (y | θs∗ , s) P (s | θs∗ ) f + P (y | θ¯s∗ , ¯s) g (5.11) where terms f and g are stated in Lemma 1. Proof. Using Bayes’ rule, P (s | θs∗ , y) = P (y | θs∗ , s)P (s | θs∗ ) P (y | Θ∗ ) (5.12) and applying Lemma 1 on denominator P (y | Θ) directly yields Equation 5.11. 5.3.2 Non-optimal Parameters and Asymptotic Errors So far, Lemmas 1 and 2 assumes that optimal parameters θs∗ are available for any size configuration s. Instead of using the optimal parameters, here we discuss the case when we use non-optimal parameters. 3 ¯ expands. g approaches 0 because of the diminish tail mass of the prior distribution as S 105 Often it is the case that θs∗ is expensive to calculate. The main bottleneck lies in computing the posterior distribution p(Z | Y, S) where Z represents many densely correlated latent variables. Approximate inference procedures such as loopy belief propagation (Kschischang et al., 2001; Yedidia et al., 2003) can be applied in these situations to efficiently acquire an approximation of θs∗ . Approximating the latent posterior distribution leads to discrepancies in the likelihood bounds in Lemma 1. Specifically, any choice of θs satisfies P (y | θs , s) ≤ P (y | θs∗ , s). By rewriting P (y | θs , s) as P (y | θs∗ , s) − ∆s , where ∆s ≥ 0 is some unknown error, we obtain an approximation f˜ of the finite sum f in Lemma 1: f˜ = s∈C¯s = s∈C¯s =f− P (y | θs , s) P (s | θs∗ ) (P (y | θs∗ , s) − ∆s ) P (s | θs∗ ) s∈C¯s (5.13) ∆s P (s | θs∗ ) As C¯s expands, f converges to P (y | Θ∗ ) whilst f˜ converges to P (y | Θ) − ∗ ˜ ¯ s ∆s P (s | θ ). Thus, using f in the upper and lower bounds of P (y | Θ) s s∈C converge to P (y | Θ∗ ) − s∈C¯s s∈C¯s ∆s P (s | θs∗ ) in the limit. The error term ∆s P (s | θs∗ ) can be seen as an average of likelihood errors for over config- uration enumerated. Devoting more time for parameter learning, e.g. with EM (expectation maximisation (Dempster et al., 1977)) or approximate EM, at each step of the configuration search will reduce ∆s P (s | θs∗ ) and hence the error of our bounds in the limit. To quantify the error in the limit is as challenging a problem as quantifying errors of approximate inference algorithms, and thus remains an open question in this work. 5.3.3 Proposed Algorithm Given data, we learn ILRM parameters by search, and assume that DAG is given as input. Algorithm 2 lists our search-based procedure: The algorithm can be terminated at any time and return error bounds, e.g. if some run-time limit is exceeded or some prescribed error bound width is reached. Continuing the search 106 Algorithm 2: ILRM-Search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: Input: V = (R, RI , S) – relations G – a DAG over V y – observations Initialise: Q = {(1, 1, . . . , 1)}; // Configuration queue Θ = ∅; // Parameters H = ∅; // Likelihoods repeat Select s ∈ Q; Learn parameters θs of ILRM with configuration s; Evaluate likelihood P (y | θs , s), add to H; Compute error bounds b = l, u using H (Lemma 1); Add θs to Θ; Add successors(s) to Q; until termination return Θ, b will reduce further reduce the error. For the simple case where all there is only one size variable S in the IBN, Algorithm 2 can be described as a line search depicted by Figure 5.1, where search proceeds from S = 1 to S = k. Each shaded node abstracts steps 11 to 15 of Algorithm 2. bound Figure 5.1: single size 107 bound Figure 5.2: Two sizes Figure 5.1 shows a search-tree over configurations containing only one unknown size variable S. The error from not expanding the infinite sub-tree for S > k is quantified by error bounds (Section 5.3.1). Figure 5.2 depicts a search over a bivariate configuration (S1 , S2 ), where S2 is assumed dependent on S1 . Shaded nodes indicate specific configurations enumerated, i.e. those for whom a Bayesian network is learned. The set of Bayesian networks learned are part of an IBN. The general idea is that for each size configuration enumerated, a Bayesian network is learned. In Figure 5.1 k networks are learned, and likelihood contributions from networks for the non-enumerated configurations are bound. For the case when an IBN has more than one size variable, search proceeds in a higher dimensional space. Figure 5.2 depicts the case where there are two size variables, and search proceeds in a two-dimensional space of values of S = (S1 , S2 ). Suppose that IBN graph structure is such that pa(S2 ) = {S1 }, we search by expanding S1 before S2 (note that if pa(S1 ) = pa(S2 ) = ∅, any order of enumerating is allowed). We generally want to fully evaluate bounds given by Lemma 1, we start the search at the identity configuration s = (1, 1, . . . , 1) and expand the configuration in each dimension such that, at any step, all lesser configurations have already 108 been enumerated. This will result in a tighter lower-bound of the likelihood than not enumerating all lesser configurations. To do so, we maintain a configuration queue. At each step, a configuration is removed from the queue, and all immediate successors of that configuration not already in the queue are inserted. An immediate successor is obtained by incrementing only one component of the configuration by 1, e.g. immediate successors for (S1 = i1 , S2 = i2 ) are (S1 = i1 + 1, S2 = i2 ) and (S1 = i1 , S2 = i2 + 1). 5.4 Prediction Here we show how ILRMs support Bayesian predictions. Given data y and some unobserved query proposition q (e.g. likes(joe, sue)), the desired prediction P (q | y) is computed by Bayesian model averaging (BMA) (Hoeting et al., 1999) Pbma (q | y) = P (q | M, y) P (M | y) M (5.14) Whilst the true BMA solution also requires marginalising over parameters, we describe our averaged prediction in terms of maximum likelihood (ML) parameters4 . In particular we describe the averaged prediction for the case where optimal ML parameters given any configuration are available: P (q | y) = P (q | θs∗ , s, y) P (s | θs∗ , y) s∈dom(S) = s∈dom(S) P (q | θs∗ , s) P (s | θs∗ , y) (5.15) where we assume P (q | θs∗ , s, y) = P (q | θs∗ , s) to mean that predictions are drawn from the model only. (Note that a full Bayesian treatment will also require marginalising out ILRM parameters. We assume point-estimate parameters – thus our approach is “semi” Bayesian – but note that we can approximate the full Bayesian solution by assuming Dirichlet priors over parameters the Bayesian treatment is within reach.) Algorithm 2 learns ILRMs that contain LRMs with all configurations that are 4 A Bayesian treatment of parameters are not precluded in the following, however. 109 lesser than the upper-bound ¯s; that the configuration space C¯s is enumerated. Equation 5.15 can then be written in terms of a finite and infinite sum: P (q | y) = s∈C¯s P (q | θs∗ , s) P (s | θs∗ , y) W + s ∈dom(S)\C¯s P q | θs∗ , s P s | θs∗ , y W Using Lemma 2, we bound each P (s | θs∗ , y) in the finite sum W to obtain s∈C¯s s∈C¯s P (y | θs∗ , s) P (s | θs∗ ) ≤W ≤ f +g P (q | θs∗ , s) P (q | P (y | θs∗ , s) P (s | θs∗ ) f + P (y | θ¯s∗ , ¯s) g θs∗ , s) (5.16) where f and g are given in Lemma 1. Similarly, we can bound the infinite sum W by using Lemma 2. However, each summand in W involves P (q | θs∗ , s ) whose value is unknown since s is not in the enumerated set C¯s . We replace it with a prior P (q), giving P (q) f +g s ∈dom(S)\C¯s P y | θs∗ , s P s | θs∗ ≤ W ≤ P (q) f + P (y | θ¯s∗ , ¯s) g s ∈S ¯s P y | θs∗ , s P s | θs∗ The likelihood P y | θs∗ , s is also unknown, but can be bound using Proposition 1 using ¯s, resulting in P (y | θ¯s∗ , ¯s) P (q) f +g P (q) f + P (y | θ¯s∗ , ¯s) g s ∈dom(S)\C¯s s ∈dom(S)\C¯s 110 P s | θs∗ ≤ W ≤ (5.17) P s | θs∗ Then, P (q | y) is bounded as follows: LW + LW ≤ P (q | y) ≤ UW + UW (5.18) where LW , LW are lower-bounds in Eqs. 5.16 and 5.17 respectively, and UW , UW are corresponding upper-bounds. 5.5 Experiments In our experiments, we again use the dimensionality of nations dataset (Rummel, 1999) analysed in (Kemp et al., 2006), which was also used in Ch. 4. The domain consists of 14 countries, and 56 different two-place Boolean relations are observed amongst them. Each relation has the form r(X, Y ), where X, Y are logical variables of type nation. One latent property α is included. One of the 56 relations provided, one is chosen as the target relation for prediction, and two chosen as co-relations on which the target relation can condition on in our LRMs. The target relation is intergov, whilst the chosen co-relations are ngo and sever. The target relation was chosen for its high degree of entropy, whilst the co-relations are chosen based on their mutual information measure with the target relation; the degree to which the co-relation is informative of the target relation. The co-relation sever has the lowest mutual information score at 5×10−5 , whilst ngo has a score of 0.63 (see Table 4.1). High-scoring co-relations are more informative. In this experiment we first demonstrate characteristics of the bounds derived in Sec. 5.3.1 as we search over the size of α. We also show the effects of different choices of priors. In the second experiment we learn ILRMs and evaluate its predictions using Eq. 5.18. Like the experiments in Ch. 4 Sec. 4.4, we assess models with different co-relations. We learn our ILRMs by searching over different sizes of α. For each co-relation, we compare the prediction of the ILRM with its best corresponding LRM found in Ch. 4. 111 5.5.1 Error Bounds For this experiment we consider an ILRM that models two observed relations ngo1 and intergov, with a latent property α (of countries). The graph structure over relations is given by G = {α intergov, α ngo1, ngo1 intergov}. ILRM parameters Θ are learned using Alg. 1. We search over configurations of the ILRM, from S = 1 to S = 10. We record the prior distribution, the likelihood bounds, and posterior bounds computed at each step of the search. We use two different prior distributions over configurations: (i) the geometric distribution geom(λ) for λ = 0.1, 0.5, 0.9 and (ii) a Poisson distribution poiss(λ) for λ = 1, 5, 9. The results are shown in Fig. 5.3. The figure illustrates the behaviour of bounds derived in Sec. 5.3.1, where our uncertainty about the likelihood P (y) leads to uncertainty about the posterior over the size of latent property α. An observation is that using the geometric prior over the size of latent property α, smaller sizes of α favoured in the posterior, and with higher values of the geometric parameter, the bias becomes stronger. Using the Poisson prior (right-hand column of Fig. 5.3), on the other hand, can utilise prior knowledge that different sizes of α may be favourable. The mode of the Poisson can then be centred according to the prior knowledge. 112 GEOMETRIC(0.1) POISSON(1) 0.4 PRIOR PRIOR 0.1 0.05 0 0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 1 LIKELIHOOD LIKELIHOOD 1 0.8 0.6 2 4 6 8 0.8 0.6 10 0.4 POSTERIOR 0.1 POSTERIOR 0.2 0.05 0 0.2 0 2 4 6 8 10 GEOMETRIC(0.5) POISSON(5) PRIOR 0.2 PRIOR 0.5 0 0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 1 LIKELIHOOD 0.8 LIKELIHOOD 0.1 0.6 0.8 0.6 0.4 2 4 6 8 10 POSTERIOR 0.2 POSTERIOR 0.5 0 0.1 0 2 4 6 8 10 GEOMETRIC(0.9) POISSON(9) 0.2 PRIOR PRIOR 1 0.5 0 0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 8 10 1 LIKELIHOOD 0.6 LIKELIHOOD 0.1 0.55 0.8 0.6 0.5 2 4 6 8 10 0.2 POSTERIOR POSTERIOR 1 0.5 0 0.1 0 2 4 6 NO. OF VALUES 8 10 6 NO. OF VALUES Figure 5.3: Search over configuration of α, from 1 to 10. Left column corresponds to using a geometric distribution as the prior over the size α, with parameter setting 0.1, 0.5 and 0.9. The right column shows uses the Poisson distribution with parameter 1, 5, and 9. Bounds for the likelihood and posterior of the size of α are shown by the shaded regions. 113 5.5.2 Bayesian Prediction Next we examine the accuracy of ILRMs (Eq. 5.18) in prediction. For different co-relations, we compare the log-loss of the averaged prediction with the corresponding best single predictor found in Ch. 4 (Table 4.2). No co-relations The first test involves only the target relation is intergov, with no co-relations. The ILRM for this test – denoted by Lc – contain only intergov, and latent property α. The structure is given by G = {α intergov}. Θ is learned using Alg. 1 introduced in Ch. 4. The sizes of α considered are S = 1, 2, . . . , 10. Lc corre- sponds to ten fixed-size LRMs which are averaged during prediction. The geometric and Poisson distributions are used as prior distributions over the space of configurations. We search over parameters of the geometric distribution, that is we use p(S) ∼ geom(λ) for λ, from 0.1, 0.2, . . . , 1. For the Poisson distribution, p(S) ∼ poiss(λ), we search from λ = 1, . . . , 10. For both choices of priors and each parameter value we perform 5-fold cross- validation, where 20% of observed cases for the target relation are used for prediction, and the rest for training. For each fold, 50 ILRMs are learned, each of which learned with size of α up to (and including) S = 10. The ILRMs with the best test-set log-loss in each fold are used for computing the cross-validated logloss. Figure 5.4 shows the log-loss for different choices of prior distributions for S = 1 . . . 10. In order to interpret these results, recall from Ch. 4 that models corresponded S > 3 generally yielded lower losses (see Figs. 4.7 to 4.9). As such, in averaging, it is reasonable to expect that weighting those models more heavily will also yield the best results. From Fig. 5.4, we see that using geometric prior generally leads to higher losses (worse predictive performance) than the Poisson prior. For geom(λ), increasing values of λ correspondingly favours lower values of S. Thus, as expected, high losses are incurred with increasing values of λ. It is noteworthy that the CRP prior also defines priors that monotonically weights simpler models (lower values of S) more heavily. Using poiss(λ) on the other hand, the mode can be centred flexibly. Figure 114 Lc LOG−LOSS 1.1 Lc 1.1 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0 0.2 0.4 0.6 0.8 1 0 GEOM. PARAM. 2 4 6 8 10 POISSON PARAM. Figure 5.4: Five-fold cross-validated log-loss for ILRM predictions on the intergov relation, no co-relations are present. ILRMs with different prior distributions over the size of latent property α are evaluated: the geometric distribution (left) and the Poisson distribution (right) with different parameter values are considered. Horizontal lines in the plots are the score and standard error for the single best LRM found in Ch. 4 for the same problem. 5.4 shows that for λ ∼ 5, where models of configurations around S = 5 are weighted more heavily, whilst those farther away have less weight. With λ ∼ 5 we observe the lowest log-losses, almost matching that of the best single model found in Ch. 4 (Fig. 4.7). This is expected as the best single models found in Ch. 4 correspond to α with sizes between 3 and 10, thus weight these models more heavily in the averaged prediction better exploits their accuracy. With co-relation The same experiment is repeated with the addition of a co-relation r, and we test two ILRMs. The first, denoted by Lc+ , contain a target relation, one corelation, and one latent property α The structure of Lc+ is given by G = {α intergov, α r}, where r denote a co-relation. The second ILRM, denoted by Lcd , differs from Lc+ by its graph structure, namely that now r intergov ap- pears in G. Parameter learning given configurations is again done using Alg. 1 115 from Ch. 4. Results for different choices of co-relation r (i.e. sever,ngo2, and ngo) are shown in Figs. 5.5, 5.6, and 5.7 respectively. 1.1 Lc+ 1.1 1 log−loss log−loss 1 0.9 0.8 0.9 0.8 0.7 0.7 0.6 0.6 0.5 0 0.5 0.5 0 1 geom. params. 1.1 Lc+ 1.1 1 Lcd 1 log−loss log−loss 0.5 geom. params. 1 0.9 0.8 0.9 0.8 0.7 0.7 0.6 0.6 0.5 0 Lcd 5 0.5 0 10 poiss. params. 5 10 poiss. params. Figure 5.5: Five-fold cross-validated log-loss for ILRM predictions on the intergov relation, with co-relation sever. ILRMs Lc+ (left column, no dependency link between target and co-relation) and Lcd (right column, with a dependency link from co-relation to target relation) are evaluated. Different prior distributions over the size of latent property α are evaluated: the geometric distribution (top row) and the Poisson distribution (bottom row) with different parameter values are considered. Horizontal lines in the plots are the score and standard error for the single best LRM found in Ch. 4 for the same problem. Using co-relation sever, we see that averaged predictions of Lc and Lcd have similar performance. This is expected, since mutual information between sever 116 and intergov is low (∼ 5 × 10−5 ), and modelling the dependency between sever and intergov yields little advantage. The important observation is that the choice of the Poisson prior again leads to log-loss that is closer to the best single model. Using co-relations ngo2 and ngo1 generates the following results in Figs. 5.6 and 5.7. In this case, we observe similar characteristics to the case where sever is used as the co-relation. The higher mutual information between ngo2 and intergov brings about a slight advantage for the Lcd model, whilst ngo1 provides the greatest ad- vantage. The use of poiss(λ) for λ ∼ 5 again yields the best log-loss scores, which correspond to models where α has approximately S ∼ 5 values, and represents the region where the best single models are found. The general finding of this experiment is that the choice of prior distribution over configurations has an important effect on the accuracy of averaged predictions. In particular, for this dataset, the best single models were found to be those where latent property α has least 3 values, and allowing greater weighting for these models leads to better averaged predictions. The use of a Poisson distribution appears more appropriate for this task, as it is not restricted to favouring small number of values. The results of averaged predictions are shown to be close to the best single models found in Ch. 4 using the Poisson prior. 5.6 Remarks In this chapter we have demonstrated an alternative to the standard sampling-based method for handling uncertainty over the size of latent properties in relational models. Our proposed method searches over possible sizes of latent properties and bounds the error of the likelihood computed from models found during search. The bounds lead to posterior bounds which then leads to an approximate prediction. The proposed method makes fewer assumptions than existing proposals (e.g. Kemp et al. (2006); Xu et al. (2006)) about the prior distribution over sizes of latent properties, and allows us to examine different forms of this prior. Our experiments show the behaviour of the bounds, and accuracy of averaged predictions (using only models found in the search) that achieve accuracy close to the single best model found in Ch. 4 for the same dataset, providing that we use an 117 1.1 Lc+ 1.1 1 log−loss log−loss 1 0.9 0.8 0.9 0.8 0.7 0.7 0.6 0.6 0.5 0 0.5 0.5 0 1 geom. params. 1.1 Lc+ 1.1 1 Lcd 1 log−loss log−loss 0.5 geom. params. 1 0.9 0.8 0.9 0.8 0.7 0.7 0.6 0.6 0.5 0 Lcd 5 0.5 0 10 poiss. params. 5 10 poiss. params. Figure 5.6: Five-fold cross-validated log-loss for ILRM predictions on the intergov relation, with co-relation ngo2. ILRMs Lc+ (left column, no dependency link between target and co-relation) and Lcd (right column, with a dependency link from co-relation to target relation) are evaluated. Different prior distributions over the size of latent property α are evaluated: the geometric distribution (top row) and the Poisson distribution (bottom row) with different parameter values are considered. Horizontal lines in the plots are the score and standard error for the single best LRM found in Ch. 4 for the same problem. appropriate prior distribution over the size of latent properties. On a bibliographic note, the search-based framework for bounding the likelihood is based on the search-based method for evaluating probabilities proposed by (Poole, 1993a). Poole’s approach performs probabilistic inference by only eval118 1.1 Lc+ 1.1 1 log−loss log−loss 1 0.9 0.8 0.9 0.8 0.7 0.7 0.6 0.6 0.5 0 0.5 0.5 0 1 geom. params. 1.1 Lc+ 1.1 1 Lcd 1 log−loss log−loss 0.5 geom. params. 1 0.9 0.8 0.9 0.8 0.7 0.7 0.6 0.6 0.5 0 Lcd 5 0.5 0 10 poiss. params. 5 10 poiss. params. Figure 5.7: Five-fold cross-validated log-loss for ILRM predictions on the intergov relation, with co-relation ngo1. ILRMs Lc+ (left column, no dependency link between target and co-relation) and Lcd (right column, with a dependency link from co-relation to target relation) are evaluated. Different prior distributions over the size of latent property α are evaluated: the geometric distribution (top row) and the Poisson distribution (bottom row) with different parameter values are considered. Horizontal lines in the plots are the score and standard error for the single best LRM found in Ch. 4 for the same problem. uating part of the (finite) state space and bounding the remaining mass. We use this approach to evaluate the likelihood bounds the where state space is infinite. Also, the general scheme of successively increasing the number of parameters of statistical model, where the parameter space is infinite, can be related to the method 119 of sieves for non-parametric ML estimation Geman and Hwang (1982). More in depth connections of our approach to the method of sieves would be interesting for future work. 120 Chapter 6 Learning with Functional Relations In previous chapters we have only considered rules and dependencies that contain relations. In this chapter the focus is on rules where a functional relation appears in the body, e.g. ∀X1 , . . . , ∀Xn θ1 : r(X1 , . . . , Xn ) = v ← f (X1 , . . . , Xn ) = y1 . ∀X1 , . . . , ∀Xn θ2 : r(X1 , . . . , Xn ) = v ← f (X1 , . . . , Xn ) = y2 . .. . ∀X1 , . . . , ∀Xn θm : r(X1 , . . . , Xn ) = v ← f (X1 , . . . , Xn ) = ym . (6.1) where r is a relation, and f is a function with range Vf = {y1 , . . . , ym }. Each θi is a probability estimate learned from data, described by Eq. 2.11. To illustrate, consider a simple example drawn from the 1990 U.S. Census (U.S. Census Bureau, 1990), which records preferred modes of transportation taken to work across major U.S. cities. Using the Boolean predicate drive alone to represent driving alone to work, and function lives in to represent the city of residence, 121 the example can be represented by the following set of rules, one rule per city. ∀P erson, θ1 : drive alone(P erson) ← lives in(P erson) = new york city ∀P erson, θ2 : drive alone(P erson) ← lives in(P erson) = los angeles .. . ∀P erson, θm : drive alone(P erson) ← lives in(P erson) = willmington (6.2) where each θi is a proportion of people who drive alone to work in a corresponding city. Thus, if x lives in New York city, the probability that x drives to work is θ1 , if x lives in Los Angeles then θ2 , and so on. The number of rules required to model drive-alone probabilities is proportional to the the number of known cities. Ideally, we would like the representation size of our model to be independent of the number of individuals (e.g. the number of cities in the example). In addition, we wish to avoid over-fitting, which can occur when statistics accompanying a rule is drawn from an unreliable samples, e.g. the drive-alone statistics of small towns. A standard solution to obtain a compact representation is aggregation, where the probabilities over all cities are combined, resulting in a general statement ∀P erson, θ¯ : drive alone(P erson). (6.3) where θ¯ represents some aggregate probability, i.e. the drive-alone probability averaged over all cities. From the standpoint of compactness, Eq. 6.3 is the obvious choice as it is a single rule. However, in prediction, Eq. 6.2 is more precise. Note that we are choosing between two reference classes here, one general and one specific. To highlight the implications on prediction, we make the example more precise. In the 1990 U.S. census (U.S. Census Bureau, 1990), approximately 18 million people were surveyed across 50 major U.S. cities about their drive-to-work habits. It was determined that approximately 72% of the sample drive to work. The data also shows that about 3 million samples were taken from New York city, which has an average drive-to-work rate of 24%. Thus, applying Eq. 6.3 and predicting that future subjects drive to work with probability 0.72 (without considering the city of 122 residence), is likely to incur high estimation errors for subjects coming from New York city. Since New York city has a large population (approximately 8.4 million), a high accumulated error can be expected by using Eq. 6.3. Our goal in this chapter is an algorithm for finding a fixed-size representation that leverages the tension between compactness of representation and predictive accuracy. The task is to find optimal clusters of values of the given function, where optimality is achieved by minimising empirical loss. For the travel to work example, we seek a fixed number of rules that summarise Eq. 6.2, e.g. ∀P erson, θ¯1 : drive alone(P erson) ← member(lives in(P erson), L1 ). ∀P erson, θ¯2 : drive alone(P erson) ← member(lives in(P erson), L2 ). .. . ∀P erson, θ¯k : drive alone(P erson) ← member(lives in(P erson), Lk ). where k ≤ m, Li is a set of cities, and member(Y, Li ) is a invented membership relation that is true if city Y belong to group Li , and false otherwise. 6.1 Problem Statement We assume a database D that entails a relation r(X1 , . . . , Xn ) and a function f (X1 , . . . , Xn ). From D the following table of statistics can be formed f (X1 , . . . , Xn ) y1 y2 .. . count c+ 1 c+ 2 .. . total c1 c2 .. . ym c+ m cm where c+ i and ci correspond to counts of observed cases where relation r(X1 , . . . , Xn ) has value v where X1 , . . . , Xn satisfies f (X1 , . . . , Xn ) = yi , given by c+ i = #D (r(X1 , . . . , Xn ) = v ∧ f (X1 , . . . , Xn ) = yi ) ci = v ∈Vr #D (r(X1 , . . . , Xn ) = v ∧ f (X1 , . . . , Xn ) = yi ) 123 (6.4) We also take an integer k ≤ m as input. From the table, the following rule set can be formed: ∀X1 , . . . , ∀Xn , θ1 : r(X1 , . . . , Xn ) ← f(X1 , . . . , Xn ) = y1 ∀X1 , . . . , ∀Xn , θ2 : r(X1 , . . . , Xn ) ← f(X1 , . . . , Xn ) = y2 .. . ∀X1 , . . . , ∀Xn , θm : r(X1 , . . . , Xn ) ← f(X1 , . . . , Xn ) = ym (6.5) where each θi = c+ i /ci . The learning problem of interest here is to generate the best set of k rules that summarise the verbose rule set in Eq. 6.5, and our strategy is to partition Vf . Given a set of partitions S = {S1 , . . . , Sk } such that Sj = ∅ if i = j, the summary rule set we seek has the form ∀X1 , . . . , ∀Xn , k i=1 Si = Vf and Si ∩ θ¯1 :r(X1 , . . . , Xn ) ← member (f(X1 , . . . , Xn ), S1 ) ∀X1 , . . . , ∀Xn , θ¯2 :r(X1 , . . . , Xn ) ← member (f(X1 , . . . , Xn ), S2 ) .. . ∀X1 , . . . , ∀Xn , θ¯k :r(X1 , . . . , Xn ) ← member (f(X1 , . . . , Xn ), Sk ) (6.6) where member(A, B) is true if A ∈ B, and false otherwise. We consider the best clustering S ∗ as the one that yields the lowest empirical loss. First let T = {T1 , . . . , Tk } be the corresponding clustering of probabilities induced by S, where each Ti contains the set of probabilities from the verbose rule set (Eq. 6.5) for members of Si . For instance, if Si = {y1 , y2 }, the Ti = {θ1 , θ2 }. Each Ti is said to contain the partition parameters for partition Si . The optimisation problem can be formally stated as follows. k E L(θ¯i , Ti ) ∗ S = arg min S (6.7) i=1 where E[L(θ¯i , Ti )] is short for the expected empirical loss of applying θ¯i over all θ ∈ Ti . 124 Given a partitioning S, the probability value θ¯i can also be optimised. Namely, E L(θ¯i , Ti ) is the empirical loss incurred by θ¯i in for partition parameters Ti , the θ¯i that minimises E L(θ¯i , Ti ) is given by θ¯i∗ = arg minE L(θ¯i , Ti ) = arg minE L(θ¯i , θ) θ¯i θ¯i (6.8) θ∈Ti We assume log-loss in this work, i.e. l1,α = − log α and l0,α = − log (1 − α), and since L(·) is a convex function, taking the derivative and equating to zero yields σi∗ as the average of Ti : θ θ¯i∗ = θ∈Ti |Ti | (6.9) The remaining problem is to solve Eq. 6.7 to find the best clustering clustering S ∗ , which we discuss next. 6.2 Optimal Partitioning The space of possible partitionings consist of all possible assignments of elements of Vf to k partitions. Assigning m elements to k < m bins yields k m possible assignments. We show in the following, that to assign m probabilities in Θ to k < m partitions, we only need to consider partitionings such that partitions are ordered. Specifically, we show that the optimal partitioning is S ∗ = {S1∗ , . . . , Sk∗ } with corresponding partition parameters T ∗ = {T1∗ , . . . , Tk∗ } such that for all θ ∈ Ti∗ and ∗ , θ ≥ θ . The following lemma provide the theoretical justification for θ ∈ Ti+1 this result. Notation for empirical loss is provided in Ch. 2 Sec. 2.1.3. Lemma 3. Let p1 and p2 be probabilities, such that p1 > p2 . Also, let t1 and pˆ2 be probabilistic estimates of p1 and p2 respectively, such that pˆ1 > pˆ2 . Applying each estimator to p1 and p2 , L (ˆ p2 , p1 ) ≤ L (ˆ p1 , p1 ) → L (ˆ p2 , p2 ) < L (ˆ p1 , p2 ) 125 (6.10) Proof. The premise L(ˆ p2 , p1 ) ≤ L(ˆ p1 , p1 ) in Eq. 6.10 has the expanded form p1 l1,ˆp2 + (1 − p1 ) l0,ˆp2 ≤ p1 l1,ˆp1 + (1 − p1 ) l0,ˆp1 (6.11) Rewriting p1 as p1 = p2 + δ where δ > 0, Eq. 6.11 becomes (p2 + δ) l1,ˆp2 + (1 − p2 − δ) l0,ˆp2 ≤ (p2 + δ) l1,ˆp1 + (1 − p2 − δ) l0,ˆp1 Some rearrangement yields p2 l1,ˆp2 + (1 − p2 ) l0,ˆp2 + δ (l1,ˆp2 − l0,ˆp2 ) ≤ p2 l1,ˆp1 + (1 − p2 ) l0,ˆp1 + δ (l1,ˆp1 − l0,ˆp1 ) L(ˆ p2 ,p2 ) L(ˆ p1 ,p2 ) It follows that L (ˆ p2 , p2 ) + δ (l1,ˆp2 − l0,ˆp2 ) ≤ L (ˆ p1 , p2 ) + δ (l1,ˆp1 − l0,ˆp1 ) (6.12) Since l1,· and l0,· are increasing functions and that pˆ1 > pˆ2 , it follows that l1,ˆp2 > l1,ˆp1 and l0,ˆp2 < l0,ˆp1 , and therefore l1,ˆp2 − l0,ˆp2 > l1,ˆp1 − l0,ˆp1 (6.13) Given inequality 6.13, for Eq. 6.12 to be true, it then must be the case that L (ˆ p2 , p2 ) < L (ˆ p1 , p2 ) (6.14) which proves Eq. 6.10. Note that an equivalent statement of Eq. 6.10 is that L (ˆ p1 , p2 ) ≤ L (ˆ p2 , p2 ) → L (ˆ p1 , p1 ) < L (ˆ p2 , p1 ) (6.15) Figure 6.1 illustrates Lem. 3, for the case where the premise L (ˆ p2 , p1 ) ≤ L (ˆ p1 , p1 ) holds and when it does not. When the premise holds, the optimal as- signment of estimators is that both pˆ1 and pˆ2 be assigned for p1 . An equivalent statement is that when L (ˆ p1 , p2 ) ≤ L (ˆ p2 , p2 ) holds, assigning both pˆ1 and pˆ2 to 126 p2 is optimal. Figure 6.1: An illustration of Lem. 3. The first figure (left) shows when L(pˆ2 , p1 ) ≤ L(pˆ1 , p1 ), i.e. when the premise of Eq. 6.10 in Lem. 3 does not hold. When the premise holds, we see that the distance between pˆ2 and p1 is less than that between pˆ1 and p1 . Importantly, Lem. 3 is logically equivalent to the statement that either L(ˆ p2 , p1 ) ≤ L(ˆ p1 , p1 ) is true, or that L(ˆ p1 , p2 ) ≤ L(ˆ p2 , p2 ) is true, but not both. In other words, assigning pˆ1 to estimate p2 and pˆ2 to p1 cannot simultaneously achieve the lowest loss for both p1 and p2 . Thus, to estimate the value of two probabilities p1 > p2 , the optimal choice is to assign estimates in a sorted order. For our partitioning problem, suppose we sort the set of probabilities Θ = {θ1 , . . . , θm } in descending order, then split into k < m partitions and generate a θ¯i for each partition i according to Eq. 6.8. It will also be the case that θ¯1 , . . . , θ¯k will be in descending order. According to Lem. 3, swapping any pair θi , θj from different partitions will necessarily increase the loss incurred. As such, the optimal partitioning lies in the space of partitionings on a sorted set Θ. The optimal partitioning S ∗ thus has accompanying partition parameters T ∗ such that ∀θi ∈ Ti∗ ∀θj ∈ Tj∗ , θi > θj where i < j (6.16) Figure 6.2 illustrates the partition process on a sorted set of probabilities. 127 input sort & partition Figure 6.2: An illustration partitioning on an initial set y1 , . . . , y5 (left), and partitioning on a sorted set (right). The dashed line in the middle partition is a probability estimate for the partition, averaged over probabilities assigned to the partition. The illustration shows that swapping y3 and y5 to yield the sorted set reduces the error incurred by the estimate. Having shown that the optimal partitioning lies in the space of partitionings with ordered sets of probabilities, the remaining task is to find optimal cut-points in this set. The following section describes a dynamic programming algorithm for this task. 6.3 Optimal Partitioning by Dynamic Programming The previous section showed that the optimal partitioning lies in the space of partitionings where probabilities are sorted. As such, the objective function (Eq. 6.7) can be solved recursively, i.e. k min S k−1 E L θ¯i∗ , Ti i=1 = minE L θ¯k∗ , Tk Sk E L θ¯i∗ , Si + min S−Sk (6.17) i=1 With the objective function in recursive form, dynamic programming can be applied. 128 Let Θ = {θ1 , . . . , θm } be the accompanying probabilities to the rule set in Eq. 6.5. Let π be an ordering that sorts Θ, resulting in Θπ . To find the best k − 1 cut-points c∗ = {c∗1 , . . . , c∗k−1 } (for k partitions of Θπ ), the dynamic programming procedure is a recursion such that for each candidate cut-point ci in the search for c∗i , a recursive call is done to find c∗i−1 . We implement the procedure in by constructing a table with m rows and k columns. To define table entries, let Θiπ be first i elements of Θπ , and S i and T i be corresponding partitioning and partition parameters on Θiπ . Then, di,j is the empirical loss of the best (j + 1)-partition model on Θiπ j+1 di,j = E L θ¯n∗ , Tni min i {S1i ,...,Sj+1 } n=1 (6.18) where each Tn≤j consists of elements of Θiπ . di,j also has a recursive definition di,j = di−1,j−1 + min dc,j ; i > 0 and j > 0 0; c≤i (6.19) otherwise The main procedure of our partitioning algorithm is to populate this table column by column, starting from d1,1 . Namely, we start from a 1-partition model and build incrementally to the k-partition model. The entry dm,k represents the empirical loss incurred. After completing the table, the cut-points that yields the optimal partitioning S ∗ is read from the table, i.e. c∗j = arg min dc,j ; c j = 1, . . . , k − 1 (6.20) The algorithm for completing the table for partitioning is given as follows. Note that the term di−1,j−1 is directly found in the table, and dc,j can be computed using Eq. 6.18. The process of partitioning by completing the table is illustrated in Fig. 6.3 below. 129 Algorithm 3: Completing the table of partitioning losses for dynamic programming for j = 1 . . . k do for i = 1 . . . m do H=∅ for c = 1 . . . i do Add (di−1,j−1 + dc,j ) to H end for di,j ← min H end for end for cut-point probability candidate best partitioning of i items ( cut-points) cut-points number of cuts Figure 6.3: An illustration of optimal partitioning by dynamic programming. 130 6.4 Experiments 6.4.1 Journey-to-work data In this experiments we use statistics collected about people’s travel-to-work habits, collected in the 1990 United States Census (U.S. Census Bureau, 1990). The aim is to demonstrate partitionings generated by Alg. 3. The input is a table percentages corresponding to different ways people travel to work in 50 major U.S. cities. A sample is shown below. city New York, NY Los Angeles, CA .. . pop. 3,183,088 1,629,096 .. . drove alone 24.0 % 65.2 % .. . carpool 8.5 % 15.4 % .. . public transit 53.4 % 10.5 % .. . other 14.0 % 8.9 % .. . Omaha, NE Toledo, OH Buffalo, NY 166,449 137,772 127,790 78.0 % 81.5 % 61.6 % 12.2 % 10.5 % 13.7 % 3.2 % 3.0 % 13.4 % 6.8 % 5.2 % 11.3 % The table can be represented as a set of rules of the form of Eq. 6.5. For example, the ’drove alone’ class can be described by the following rules ∀X 0.24 : drove alone(X) ← lives in(X) = ”New York, NY” ∀X 0.65 : drove alone(X) ← lives in(X) = ”Los Angeles, CA” .. . ∀X 0.78 : drove alone(X) ← lives in(X) = ”Omaha, NE” ∀X 0.81 : drove alone(X) ← lives in(X) = ”Toledo, OH” ∀X 0.62 : drove alone(X) ← lives in(X) = ”Buffalo, NY” (6.21) One can use this rule set as a predictive model for where people drive to work, based on their city of residence. The aim of Alg. 3 is to yield simpler models by partitioning cities. Table 6.1 shows the results for k = 2, 5 and 10. The partitions correspond to simpler rule-sets than Eq. 6.21. For instance, the 131 Toledo, OH Oklahoma City, OK Tulsa, OK Virginia Beach, VA Indianapolis, IN 2/ Nashville-Davidson, TN 2/ Albuquerque, NM Omaha, NE Fresno, CA Charlotte, NC San Jose, CA Fort Worth, TX Columbus, OH Jacksonville, FL 2/ Memphis, TN Kansas City, MO El Paso, TX Phoenix, AZ Austin, TX San Antonio, TX Dallas, TX Houston, TX Sacramento, CA San Diego, CA Long Beach, CA Tucson, AZ Denver, CO Detroit, MI Cincinnati, OH Milwaukee, WI St. Louis, Los Angeles, CA Portland, OR Cleveland, OH Buffalo, NY Atlanta, GA Miami, FL Minneapolis, MN Seattle, WA New Orleans, LA Oakland, CA Honolulu CDP, HI Baltimore, MD Pittsburgh, PA Chicago, IL Philadelphia, PA Boston, MA San Francisco, CA Washington, DC New York, NY log-loss (train) = 0.8811 0.813 0.808 0.807 0.783 0.781 0.780 0.780 0.778 0.778 0.772 0.768 0.767 0.765 0.755 0.754 0.747 0.740 0.737 0.737 0.734 0.724 0.717 0.717 0.708 0.699 0.698 0.685 0.679 0.673 0.672 0.665 0.653 0.650 0.647 0.616 0.612 0.609 0.603 0.587 0.586 0.569 0.550 0.510 0.489 0.463 0.447 0.401 0.385 0.350 0.240 Toledo, OH Oklahoma City, OK Tulsa, OK Virginia Beach, VA Indianapolis, IN 2/ Nashville-Davidson, TN 2/ Albuquerque, NM Omaha, NE Fresno, CA Charlotte, NC San Jose, CA Fort Worth, TX Columbus, OH Jacksonville, FL 2/ Memphis, TN Kansas City, MO El Paso, TX Phoenix, AZ Austin, TX San Antonio, TX Dallas, TX Houston, TX Sacramento, CA San Diego, CA Long Beach, CA Tucson, AZ Denver, CO Detroit, MI Cincinnati, OH Milwaukee, WI St. Louis, Los Angeles, CA Portland, OR Cleveland, OH Buffalo, NY Atlanta, GA Miami, FL Minneapolis, MN Seattle, WA New Orleans, LA Oakland, CA Honolulu CDP, HI Baltimore, MD Pittsburgh, PA Chicago, IL Philadelphia, PA Boston, MA San Francisco, CA Washington, DC New York, NY log-loss (train) = 0.8716 0.813 0.808 0.807 0.783 0.781 0.780 0.780 0.778 0.778 0.772 0.768 0.767 0.765 0.755 0.754 0.747 0.740 0.737 0.737 0.734 0.724 0.717 0.717 0.708 0.699 0.698 0.685 0.679 0.673 0.672 0.665 0.653 0.650 0.647 0.616 0.612 0.609 0.603 0.587 0.586 0.569 0.550 0.510 0.489 0.463 0.447 0.401 0.385 0.350 0.240 Toledo, OH Oklahoma City, OK Tulsa, OK Virginia Beach, VA Indianapolis, IN 2/ Nashville-Davidson, TN 2/ Albuquerque, NM Omaha, NE Fresno, CA Charlotte, NC San Jose, CA Fort Worth, TX Columbus, OH Jacksonville, FL 2/ Memphis, TN Kansas City, MO El Paso, TX Phoenix, AZ Austin, TX San Antonio, TX Dallas, TX Houston, TX Sacramento, CA San Diego, CA Long Beach, CA Tucson, AZ Denver, CO Detroit, MI Cincinnati, OH Milwaukee, WI St. Louis, Los Angeles, CA Portland, OR Cleveland, OH Buffalo, NY Atlanta, GA Miami, FL Minneapolis, MN Seattle, WA New Orleans, LA Oakland, CA Honolulu CDP, HI Baltimore, MD Pittsburgh, PA Chicago, IL Philadelphia, PA Boston, MA San Francisco, CA Washington, DC New York, NY log-loss (train) = 0.8702 0.813 0.808 0.807 0.783 0.781 0.780 0.780 0.778 0.778 0.772 0.768 0.767 0.765 0.755 0.754 0.747 0.740 0.737 0.737 0.734 0.724 0.717 0.717 0.708 0.699 0.698 0.685 0.679 0.673 0.672 0.665 0.653 0.650 0.647 0.616 0.612 0.609 0.603 0.587 0.586 0.569 0.550 0.510 0.489 0.463 0.447 0.401 0.385 0.350 0.240 Table 6.1: Tables showing 50 surveyed U.S. cities with sampled percentages of people who drive to work in each city. Each table shows a partitioning generated by Alg. 3, for k = 2 (left), k = 5 (centre), and k = 10 (right). Log-loss on training data for each partition model is shown at the bottom. k = 2 partitioning can be represented by ∀X ∀X θ¯1 : drove alone(X) ← member(lives in(X), L1 ) θ¯2 : drove alone(X) ← member(lives in(X), L2 ) 132 where L1 = {”Toledo, OH”, “Oklahoma City, OK”, . . .} L2 = {”Buffalo, NY”, “Atlanta, GA”, . . .} . The benefit of Alg. 3 for achieving a more compact model than that obtained by directly splitting on all values of the function is demonstrated here. In each partition the probability of drive-to-work appear similar. 6.4.2 E. coli Gene Data In this experiment we evaluate the predictive accuracy of partition models generated by Alg. 3. We use the 2001 KDD E. coli dataset1 , which contains data for 4289 genes of the E. coli bacterium genome. For each gene, a function classification is provided (there are 144 possible gene functions2 ), along with numerous biological properties. In addition, the interaction of each gene with other genes in the E. coli genome are given. We consider is task of predicting interactions or link between genes, i.e. link(Gene, Gene ), from their respective function classifications, i.e. f (Gene) and f (Gene ). We do not compare with other models tested in KDD Cup 2001, due to the contrasting nature of our model. We test three models, the first of which is a simple model ∀G∀G θ : link(G, G ) (6.22) where θ represents the global average probability of linkage between any pair of genes. The second model considers all possible function classifications in determining 1 KDD Cup 2001 (http://www.cs.wisc.edu/ dpage/kddcup2001). The functions have hierarchical structure, but for the purposes of this work we flatten the hierarchy to assume independent functions. 2 133 linkage: ∀G∀G θ1,1 : link(G, G ) ← f (G) = f1 ∧ f (G ) = f1 . ∀G∀G θ1,2 : link(G, G ) ← f (G) = f1 ∧ f (G ) = f2 . .. . ∀G∀G θm×m : link(G, G ) ← f (G) = fm ∧ f (G ) = fm . where m is the number of gene functions. Model parameters θi,j , i, j ∈ {1, . . . , m} represent linkage proportion amongst pairs genes that have function classifications fi and fj respectively. For this experiment, there are m = 144 gene functions. The third model is a partition model generated by using Alg. 3: ∀G∀G ∀G∀G ∀G∀G θ¯1,1 : link(G, G ) ← ∧member(f (G), L1 ) ∧ member(f (G ), L1 ). θ¯1,2 : link(G, G ) ← ∧member(f (G), L1 ) ∧ member(f (G ), L2 ). .. . θ¯k×k : link(G, G ) ← ∧member(f (G), Lk1 ) ∧ member(f (G ), Lk2 ). (6.23) where k1 and k2 are the number of partitions of gene functions. We use Alg. 3 to separately learn k1 partitions of function classes for G, and k2 partitions for G . The clusters are denoted by L1 , . . . , Lk1 and L1 , . . . , Lk2 respectively. Each parameter θ¯i,j represent the proportion of linkage amongst gene pairs (G, G ) such that member(f (G), Li ) and member(f (G), Lj ) hold. We generate the partitions independently for values of f (G) and f (G ), thus making the assumption that clusterings for f (G) and f (G ) are independent. To generate a partitioning of values of f (G), we aggregate over values of f (G ), and to generate a clustering for f (G ) we aggregate over f (G). Figure 6.4 illustrates this process (Note that by partitioning the function values for G and G we are partitioning in two dimensions. Algorithm 3 is optimal for the one-dimension case. Using Alg. 3 in this way – by partitioning each dimension independently – cannot guarantee an optimal partitioning.) Learning our partition model is done with Alg. 3, where k1 and k2 range from 1 to 20. We also directly compute the simple model (Eq. 6.22) and the verbose 134 sort aggregate functions aggregate sort functions Figure 6.4: An illustration of the independent application of Alg. 3 on the functions of both genes G and G in the relation link(G, G ). An example rule resulting from the partitioning is shown. model (Eq. 6.4.2) by counting links. We perform 10-fold cross-validation: 90% of cases for link(G, G ) are used to compute our models, and 10% used for prediction. We measure the log-loss in prediction for the simple (Eq. 6.22), verbose (Eq. 6.4.2), and partition models (Eq. 6.23) averaged all folds. The results are shown in Table 6.2 below, which firstly shows that the prediction problem is generally easy, in the sense that the number of links between genes is sparse. However, the figures show that partitioning still allows one to achieve better predictions, evident by the verbose model achieving statistically significant improvements. It is also shown that with k1 = k2 = 19, the partition model achieves competitive scores to the verbose model. 135 Ltest 0.0437 ± 0.000374 0.0338 ± 0.000314 0.0337 ± 0.000316 model simple verbose partition (19,19) Table 6.2: Log-losses and standard error for the simple (Eq. 6.22), verbose (Eq. 6.4.2), and partition models (Eq. 6.23) in predicting gene linkage based on respective gene functions. The numbers adjacent to the partition model indicate the number of partitions in each dimension (i.e. the best partition model found 19 partitions in each dimension). We also display the relative log-loss between models3 . The relative log-loss between our partition models (Eq. 6.23) and the verbose model (Eq. 6.4.2) is shown in Fig. 6.5. The relative log-loss between the simple model (Eq. 6.22) and the verbose model is equivalent to the relative log-loss between the partition model where k1 = k2 = 1 and the verbose model. 0.0048672 2 4 0.0038629 6 8 0.0028585 k1 10 0.0018541 12 14 0.00084974 16 −0.00015463 18 20 2 4 6 8 10 12 14 16 18 20 k2 Figure 6.5: Relative log-loss of predictions over all functional classes for the gene classification dataset. 3 Relative log-loss is simply the difference in log-loss between two models. 136 Figure 6.5 shows that log-loss approaches the performance of the verbose model as more partitions are introduced. In the region where k1 , k2 ∼ 20 shows nega- tive relative log-loss, where the partition model achieves lower log-loss than the verbose model. 6.5 Summary The presence of functional relations induce verbose rule sets whose size is governed by the number of values of the function. The standard approach of aggregation (Jaeger, 1997; Kersting and De Raedt, 2000; Natarajan et al., 2005) fully compacts this rule set into a single rule, but is less accurate than the verbose model. We proposed an algorithm that optimally creates a partitioning values of the function to yield a rule set whose size can be fixed a priori. The learned models are shown to achieve good predictive accuracy compared to the aggregated model, and can outperform the verbose model with much fewer parameters. The theorem and algorithm provided in this chapter find an optimal partitioning of the range of a function. A limitation to the proposed method is that optimality is no longer guaranteed when functions in the model map to a multidimensional object domain. For example, the rule drove alone(P erson) ← lives in(P erson) = new york city∧ occupation(P erson) = carpenter maps people to the joint domain of cities and jobs. There may be meaningful structure between cities and jobs that prevent one from achieving a monotonically decreasing map of probabilities required to ensure optimal clustering. Finding good partitionings in such scenarios is considered in on-going work. 137 Chapter 7 Conclusions 7.1 Contributions A Unified Framework for Relational Learning This thesis presents a general framework that allows a unified approach to representing and learning dependency-based models that include observed relations as well as latent properties of individuals. Past works have either modelled latent properties of individuals to explain the observed relations, or dependency structures amongst observed relations, but not both (see Ch. 1 and 2 for a discussion of these works). In order to achieve our goal, we require a language that is sufficiently expressive; one that can express rich dependency structure over latent and observed relations. We describe the LRM in Ch. 3, which is a lifted relational probabilistic model that represents a directed acyclic dependency network over an arbitrary set of relations1 , which can directly include latent properties. Algorithms provided in Ch. 4 to 6 handle learning of the LRM’s dependency parameters and values of latent properties. By unifying the representation and learning of dependencies and latent properties, we can consider models that go beyond state-of-the-art models such as the 1 The formalism essentially can be regarded as a simplified version of FOPLs such as ICL (Poole, 1997, 2000). 138 IRM (Kemp et al., 2006). The IRM, and similar models such as the IHRM (Xu et al., 2006), generally assume a hierarchical structure, where relations are assumed to be explained by latent properties alone. Dependencies amongst observed relations are not considered, let alone modelling of rich dependency structures. An Argument for Latent Properties An key motivation for this work grew out of the reference class problem (Reichenbach, 1949), which poses the question of which sample statistic one should bring to bear in answering queries about an individual. In Ch. 3 we show that, in the simplest setting, dependency-based learning with only observed relations yield reference classes and therefore suffers from philosophical and technical difficulties that accompany reference classes. We further examine the predictive accuracy of reference class-based methods, and present a theoretical argument that they lead to residual errors in prediction. Specifically, we assumed that values of relations are directly attributed to latent properties of individuals in the true generative process of data, and showed that reference class estimates only capture marginals of the process and results in residual bias in the statistical sense. By introducing latent properties of individuals, unbiased predictions can be represented. Empirical analysis using real-world data shows that models containing latent properties generally dominate reference classes in accuracy, measured in log-loss – an empirical measure of bias of an estimator. Learning LRMs In Ch. 4 we provide our main algorithm for learning LRMs, which comprises of learning parameters governing probabilistic dependencies as well as values of latent properties. A standard solution to the learning problem is EM, which is particularly suitable for this task because it tightly integrate the problems of learning dependency parameters with that of learning latent properties. However, posterior inference in the E step of EM is computationally prohibitive due to the large number densely correlated latent random variables in LRMs. We therefore require an approximate 139 solution, and proposed a simple approximation of EM that avoids the computational bottleneck. Empirical analyses show that the proposed approximate EM algorithm is effective in learning LRMs with good predictive performance. The results reaffirm that modelling latent properties lead to better predictive performance than those which model only observed features. Furthermore, the ability of the algorithm to achieve accurate LRMs is tested against the IRM (Kemp et al., 2006) with positive outcomes. A key result is that modelling dependencies as well as latent properties can lead to significantly gains in accuracy. For instance, when two observed relations are mutually informative – quantified by measuring their mutual information – representing and learning a dependency between them can lead to significant gains in both on training and test data. Additionally learning latent properties can achieve further improvements, and does not degrade accuracy. When the relations are not mutually informative, there is little difference between learning or omitting the dependency. Dependency-based theories are not only interpretable, the addition of latent properties yield gains in predictive accuracy. Our results further indicate that modelling latent properties alone as explanations of observed relations can benefit from modelling dependencies amongst observed relations. The findings in general suggest that learning dependency-based theories with latent relations is favourable from the standpoint of predictive accuracy, in addition to ease of interpretation. Learning with Uncertainty about Size of Latent Properties A hallmark of the IRM (Kemp et al., 2006) and IHRM (Xu et al., 2006) is that they are non-parametric representations, where the number of values of latent properties have no a priori upper-bound. Our approximate EM algorithm assumes a priori fixed number of values for latent properties. In Ch. 5, we relax this assumption and place a distribution over the number of values latent properties can represent. We proposed a search-based method, where the global loop searches over the number of values of latent properties in a LRM, and the inner-loop calls the approximate EM procedure of Ch. 4. 140 Existing proposals use non-parametric distributions such as the Dirichlet process (DP) or Chinese restaurant process (CRP) as stochastic generators of latent properties. Learning relies on posterior sampling using MCMC to simulate the generative process (Kemp et al., 2006; Sutskever et al., 2010; ?). Our approach is an alternative to the sampling-based approach, with some distinguishing features. Firstly, we allow arbitrary priors can be placed over possible numbers of values for latent properties, where DPs and CRPs represent a specific class of prior. Secondly, our search process is accompanied by error bounds on the likelihood. Where the exact likelihood can be computed with an infinite number of models (one per configuration of the number of values for latent properties) our bounds reflect the error incurred from marginalising over only models visited during the search process. As error bounds accompany each step of the search process, the proposed algorithm also has an any time property. Thirdly, bounding the likelihood leads to bounds on the posterior over number of values of latent properties, and subsequently bounds for Bayesian averaging based prediction using models visited in the search process. Empirical results show that allowing a flexible choice of priors, i.e. the Poisson distribution in our experiments, can lead to better weighting of models corresponding to the best number of latent property values. Given an appropriate choice of priors, averaged predictions can achieve predictive performance close to the single best model for the given domain. Optimal Partitioning for Functions An interesting case arises when functional relations are used. As explained in Ch. 6, the presence of a functional relation in the body of any dependency creates a verbose model whose number of parameters is proportional to the number of values of the function. The standard solution is to create a single summary dependency by aggregation (Jaeger, 1997; Kersting and De Raedt, 2000; Natarajan et al., 2005). However, the summary model generally leads to poor predictions, despite its compact size. The desired goal is then to find a model of some fixed intermediate size which also predicts accurately. In Ch. 6 we achieve this goal by solving the problem of finding good partitions 141 of the verbose parameter set. We seek a fixed number of partitions where the membership of each partition is optimised for minimising empirical loss on the training data. We show that the optimisation problem can be solved optimally and efficiently via dynamic programming. The representation size of the partitioned model no longer depend on domain size, and exhibits predictions that dominate the fully aggregated model, and can surpass that of the original verbose model. 7.2 Extensions There are several directions for further work that extends this thesis. Some of these extensions are listed below. Structure Learning Chapters 4 and 5 of this thesis focused on learning parameters of LRMs, where the dependency structures of LRMs are assumed to be part of the input. One can extend the learning framework to also incorporate search over the space of structures, in much the same way as done for BLPs (Kersting and De Raedt, 2002) or PRMs (Friedman et al., 1999; Getoor et al., 2002). The simplest way is to embed our approximate EM algorithm from Ch. 4 as a subroutine of a global search procedure over the space of dependency structures. More sophisticated forms of structure learning (for Bayesian networks) may be considered, e.g. extending the method of posterior sampling of Bayesian network structures proposed by Friedman and Koller (2003) to learn structure of LRMs. Since we are also interested in handling uncertainty about the number values for latent properties, combining our search method from Ch. 5 with a structure search scheme is an interesting future direction for research. A Convergent Message Passing Algorithm The E step of the approximate EM algorithm proposed in Ch. 4 computes the marginal posterior distribution of each latent random variable in a ground LRM. Repeating the E step alone (without the M step) is observed to be a convergent process resulting in steady-state marginal posterior distributions. A suitable direc142 tion for on-going research is to formally characterise the convergence behaviour of the iteration. Establishing convergence would provide grounds for using the E step iteration as a viable approximate inference scheme in large graphical models in its own right. It can serve as an alternative to belief propagation (BP) (Pearl, 1988) – which is not guaranteed to converge – and contribute to the library of convergent approximate inference methods. Likelihood Approximations Another facet of our approximate EM algorithm that warrants further investigation is its approximation of the likelihood function. Currently, a pseudo-likelihood is used to evaluate the quality of parameters learned. The pseudo-likelihood is a sum of likelihoods computed from fragments of the ground LRM, and latent variables appearing in overlaps of network fragments contribute to an under-estimation of the likelihood. To enable better estimates of model parameters, it is desirable to improve one’s approximation of the log-likelihood. One way forward is to employ the region-based energy approximations devised by (Yedidia et al., 2004). The result is a free energy lower-bound of the likelihood that is a sum of contributions from regions of the ground network (more precisely the factor graph of the ground network), with over-counting explicitly accounted for. A closer approximation to the log-likelihood is therefore achieved. The region-based approximation is appropriate for our work since we have implicitly defined regions in the E step of our approximate EM algorithm, and it remains to perform the appropriate subtractions. However, eliminating overcounting terms leads to a non-convex energy function. The loss of convexity can adversely affect the monotonic convergence property of EM scheme, as the M step is no longer guaranteed to optimise the likelihood bound computed in the E step. For this reason, we did not include region-based energy functions in our evaluations. A Tool for Analysing Relational Domains This thesis has the paradigm of learning rich dependency-based relational probabilistic models that also models latent properties of individuals. The proposed 143 learning framework is, however, not limited to latent properties. In fact, arbitrary latent relations can be introduced in the model, without modification of the learning methods proposed. An example application of latent relations is the non-random missing data problem. Rather than assume that data values are missing at random, one can explicitly model the cause of missingness. For instance, for every case of the relation likes(U ser, M ovie), one may model the cause of each case being observed or missing with a latent random variable. Doing so for all user-movie pairs amounts to introduce the latent relation will rate(X, Y ), for example. The ability to additionally postulate and learn about latent relations goes beyond discovering clusters of individuals with latent properties. Applying such models to analysing relational domains is an interesting direction for future work. 144 Bibliography E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels. Journal of Machine Learning Research, 9:1981–2014, 2008. ISSN 1533-7928. D. Aldous. Exchangeability and related topics. In l’cole d’t de probabilits de Saint-Flour, XIII. Springer, 1983. F. Bacchus, A. J. Grove, J. Y. Halpern, and D. Koller. From statistical knowledge bases to degrees of belief. Artificial Intelligence, 87(1-2):75–143, 1996. J. E. Besag. Statistical analysis of non–lattice data. Statistician, 24:179–195,, 1975. I. Bratko and S. Muggleton. Applications of inductive logic programming. Communications of the ACM, 38(11):65–70, 1995. R. Breiger, S. Boorman, and P. Arabie. An algorithm for clustering relational data with applications to social network analysis and comparison with multidimensional scaling. Journal of Mathematical Psychology, 12:328–383, 1975. W. L. Buntine. Operations for learning with graphical models. Journal of Artificial Intelligence Research, 2:159–225, 1994. J. Chang and D. M. Blei. Hierarchical relational models for document networks. Annals of Applied Statistics, 4(1):124–150, 2010. M. Craven and S. Slattery. Relational learning with statistical predicate invention: Better models for hypertext. Machine Learning, 43(1/2):97–119, 2001. L. De Raedt. Logical and Relational Learning. Springer, 2008. 145 A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 34:1–38, 1977. M. desJardins, P. Rathod, and L. Getoor. Learning structured Bayesian networks: Combining abstraction hierarchies and tree-structured conditional probability tables. Computational Intelligence, 24(1):1–22, 2007. T. Ferguson. Bayesian analysis of some nonparametric problems. Annals of Statistics, 1(2):209–230, 1973. N. Friedman and D. Koller. Being Bayesian about network structure: A Bayesian approach to structure discovery in Bayesian networks. Machine Learning, 50: 95 – 125, 2003. N. Friedman and D. Koller. Probabilistic Graphical Models. MIT Press, 2009. N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In Proceedings of the 16th International Joint Conference on Artificial Intelligence, pages 1300–1309, 1999. S. Geman and C.-R. Hwang. Nonparametric maximum likelihood estimation by the method of sieves. Annals of Statistics, 10 (2):401–414, 1982. L. Getoor and B. Taskar, editors. Introduction to Statistical Relational Learning. 2007. L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of link structure. Journal of Machine Learning Research, 2002. J. Ghosh and R. Ramamoorthi. Bayesian Nonparametrics. Springer-Verlag, 2003. J. Y. Halpern. An analysis of first-order logics of probability. Artificial Intelligence, 46:311–350, 1990. M. S. Handcock, E. Raftery, Adrian, and J. M. Tantrum. Model-based clustering for social networks. Journal of the Royal Statistical Society, 170(2):301–354, March 2007. J. A. Hoeting, D. Madigan, A. E. Raftery, and C. T. Volinsky. Bayesian model averaging: a tutorial (with discussion). Statistical Science, 14:4:382417, 1999. J. M. Hofman and C. H. Wiggins. Bayesian approach to network modularity. Physical Review Letters, 100(25), 2008. 146 T. Hofmann and J. Puzicha. Latent class models for collaborative filtering. In Proceeding of the 16th International Joint Conference on Artificial Intelligence, pages 688–693, 1999. D. Hume. An Enquiry Concerning Human Understanding. P.F. Collier & Son., 1748. M. Jaeger. Relational Bayesian networks. In Proceedings of the 13th Conference on Uncertainty in Artificial Intelligence, pages 266–273. Morgan Kaufmann, 1997. D. Jensen, J. Neville, and B. Gallagher. Why collective inference improves relational classification. In Proceedings of the 10th ACM SIGKDD conference Knowledge Discovery and Data Mining, pages 593–598, 2004. Y. Kameya and T. Sato. Efficient EM learning with tabulation for parameterized logic programs. In Computational Logic, 2000. C. Kemp, J. B. Tenenbaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. In Proceedings of the 21st National Conference on Artificial Intelligence, 2006. K. Kersting and L. De Raedt. Bayesian logic programs. In J. Cussens and A. Frisch, editors, Proceedings of the 10th International Conference on Inductive Llogic Programming (Work-in-progress track), pages 138–155, 2000. K. Kersting and L. De Raedt. Basic principles of learning Bayesian logic programs. Technical report, University of Freiburg, 2002. S. Kok and P. Domingos. Learning the structure of Markov logic networks. In Proceedings of the 22th International Conference on Machine Learning, pages 441–448, 2005. S. Kok and P. Domingos. Statistical predicate invention. In Proceedings of the 24th International Conference on Machine Learning, 2007. D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009. Y. Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 426–434, New York, NY, USA, 2008. ACM. 147 S. Kramer. Predicate invention: A comprehensive view. Technical Report OFAI-TR-95-32, 1995. F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47(2): 498–519, 2001. H. Kyburg. The reference class. Philosophy of Science, 50:374–397, 1983. H. E. Kyburg. Logical Foundations of Statistical Inference. D. Reidel Publishing Co., 1974. I. Levi. Direct inference and confirmational conditionalization. Philosophy of Science, 48(4):532–552, 1981. B. Marlin and R. S. Zemel. The multiple multiplicative factor model for collaborative filtering. In Proceedings of the 21st international conference on Machine learning, ICML ’04, pages 73–, New York, NY, USA, 2004. ACM. A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet portals with machine learning. Information Retrieval Journal, 3:127–163, 2000. www.research.whizbang.com/data. T. McGrew. Direct inference and the problem of induction. The Monist, 84, 2001. B. Milch and S. Russell. First-order probabilistic languages: Into the unknown. In Proceedings of the 17th International Conference on Inductive Logic Programming, 2007. B. Milch, B. Marthi, S. Russell, D. Sontag, D. L. Ong, and A. Kolobov. BLOG: Probabilistic models with unknown objects. In 19th International Joint Conference on Artificial Intelligence, pages 1352–1359, 2005. S. Muggleton. Inverting implication. In Proceedings of the Second Inductive Logic Programming Workshop, pages 19–39, Tokyo, 1992. ICOT (Technical report TM-1182). S. Muggleton. Predicate invention and utilisation. Journal of Experimental and Theoretical Artificial Intelligence, 6(1):127–130, 1994. S. Muggleton. Inverse entailment and Progol. New Generation Computing, Special issue on Inductive Logic Programming, 13(3-4):245–286, 1995a. 148 S. Muggleton. Stochastic logic programs. In L. De Raedt, editor, Proceedings of the 5th International Workshop on Inductive Logic Programming, page 29. Department of Computer Science, Katholieke Universiteit Leuven, 1995b. S. Muggleton. Learning structure and parameters of stochastic logic programs. In S. Matwin and C. Sammut, editors, Proceedings of the 12th International Conference on Inductive Logic Programming, volume 2583 of LNAI, pages 198–206. SV, 2003. S. Muggleton and L. De Raedt. Inductive logic programming: Theory and methods. Journal of Logic Programming, 19/20:629–679, 1994. S. Muggleton, M. Bain, J. Hayes-Michie, and D. Michie. An experimental comparison of human and machine learning formalisms. In Proceedings of the Sixth International Workshop on Machine Learning, 1989. S. Natarajan, P. Tadepalli, E. Altendorf, T. G. Dietterich, A. Fern, and A. Restificar. Learning first-order probabilistic models with combining rules. In Proceedings of the 22nd International Conference on Machine learning, pages 609–616, 2005. R. Neal and G. Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan, editor, Learning in Graphical Models. Kluwer, 1998. J. Neville and D. Jensen. Leveraging relational autocorrelation with latent group models. In IEEE International Conference on Data Mining, pages 322–329, 2005. M. E. J. Newman. The structure and function of complex networks. SIAM Review, 45:167–256, 2003. S. Nienhuys-Cheng and R. de Wolf. Foundations of Inductive Logic Programming. Lecture notes in Artificial Intelligence. Springer-Verlag, Germany, 1997. H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpister. Identity uncertainty and citation matching. In Proceedings of the 15th Conference on Advances in Neural Information Processing Systems, 2002. J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Reasoning. Morgan Kaufmann, San Mateo, 1988. J. Pearl. Causality. Cambridge University Press, 2009. 149 D. Poole. Average-case analysis of a search algorithm for estimating prior and posterior probabilities in bayesian networks with extreme probabilities. In Proceedings of the 13th international joint conference on Artifical intelligence Volume 1, pages 606–612, San Francisco, CA, USA, 1993a. Morgan Kaufmann Publishers Inc. D. Poole. Probabilistic Horn abduction and Bayesian networks. Artificial Intelligence, 64(1):81–129, 1993b. D. Poole. The independent choice logic for modelling multiple agents under uncertainty. Artificial Intelligence, 95(1-2):7–56, 1997. D. Poole. Abducing through negation as failure: stable models within the independent choice logic. Journal of Logic Programming, 44(1-3):5–35, 2000. J. R. Quinlan. Learning logical definitions from relations. Machine Learning, 5: 239–266, 1990. H. Reichenbach. The Theory of Probability. University of California Press, 1949. J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd International Conference on Machine Learning, 2005. M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62 (1-2):107–136, 2006. R. J. Rummel. Dimensionality of nations project: attributes of nations and behavior of nation dyads, 1950–1965. ICPSR data file., 1999. S. J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 3rd edition, 2009. R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In Advances in Neural Information Processing Systems, volume 20, 2008. T. Sato and Y. Kameya. PRISM: a symbolic-statistical modeling language. In Proceedings of the 15th International Joint Conference on Artificial Intelligence, pages 1330–1335, 1997. R. Sharma and D. Poole. Efficient inference in large discrete domains. In Proceeding of 19th Conference on Uncertainity in Artificial Intelligence, 2003. Y. Shen. Loss functions for binary classification and class probability estimation. PhD thesis, University of Pennsylvania, 2005. 150 P. Singla and P. Domingos. Lifted first-order belief propagation. In Proceedings of Association for the Advancement of Artificial Intelligence (AAAI), 2008. A. Srinivasan, S. Muggleton, R. King, and M. Sternberg. Theories for mutagenicity: a study of first-order and feature based induction. Artificial Intelligence, 85(1,2):277–299, 1996. I. Sutskever, R. Salakhutdinov, and J. Tenenbaum. Modelling relational data using Bayesian clustered tensor factorization. In Advances in Neural Information Processing Systems, 2010. B. Taskar, E. Segal, and D. Koller. Probabilistic classification and clustering in relational data. In Proceedings of the 17th International Joint Conference on Artificial Intelligence, pages 870–878, 2001. L. H. Ungar and D. P. Foster. Clustering methods for collaborative filtering. In Americal Association for Artificial Intelligence Workshop on Recommendation Systems. AAAI Press, 1998. U.S. Census Bureau. Travel to work characteristics for the 50 largest cities by population in the united states: 1990 census, 1990. J. Vickers. The problem of induction. In E. N. Zalta, editor, The Stanford Encyclopedia of Philosophy. 2009. URL http://plato.stanford.edu/archives/spr2009/entries/induction-problem/. Z. Xu, V. Tresp, K. Yu, and H. Kriegel. Infinite hidden relational models. In Proceedings of the 22nd Conference on Uncertainty in Aritificial Intelligence, 2006. Z. Xu, V. Tresp, A. Rettinger, and K. Kersting. Social network mining with nonparametric relational models. In H. Zhang, M. Smith, L. Giles, and J. Yen, editors, Advances in Social Network Mining and Analysis, LNCS. Springer, 2009. J. Yedidia, W. Freeman, and Y. Weiss. Understanding belief propagation and its generalizations. Technical report, Mitsubishi Electric Research Laboratories, 2003. J. S. Yedidia, W. T. Freeman, and Y. Weiss. Constructing free energy approximations and generalized belief propagation algorithms. IEEE Transactions on Information Theory, 51:2282–2312, 2004. N. L. Zhang and D. Poole. Exploiting causal independence in Bayesian network inference. Journal of Artificial Intelligence Research, 5:301–328, 1996. 151
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Learning latent theories of relations and individuals
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Learning latent theories of relations and individuals Chiang, Michael Chi-Hao 2011
pdf
Page Metadata
Item Metadata
Title | Learning latent theories of relations and individuals |
Creator |
Chiang, Michael Chi-Hao |
Publisher | University of British Columbia |
Date Issued | 2011 |
Description | Inductive learning of statistical models from relational data is a key problem in artificial intelligence. Two main approaches exist for learning with relational data, and this thesis shows how they can be combined in a uniform framework. The first approach aims to learn dependencies amongst features (relations and properties), e.g. how users' purchases of products depend on users' preferences of the products and associated properties of users and products. Such models abstract over individuals, and are compact and easy to interpret. The second approach learns latent properties of individuals that explain the observed features, without modelling interdependencies amongst features. Latent-property models have demonstrated good predictive accuracy in practise, and are especially useful when few properties and relations are observed. Interesting latent groupings of individuals can be discovered. Our approach aims to learn a unified representation for dependency structures for both observed features and latent properties. We develop a simple approximate EM algorithm for learning the unified representation, and experiments demonstrate cases when our algorithm can generate models that predicts better than dependency-based models of observed features as well as a state-of-the-art latent-property model. We extend our approximate EM algorithm to handle uncertainty about the number of values for latent properties. We search over the number of values and return error bounds, as an alternative to existing proposals based on sampling in the posterior distribution over the number of values. We also solve a specific case where dependencies involve functional relations, which induces a verbose model with many parameters. In comparison, the standard solution of aggregating over all values of the function yields a simple model that predicts poorly. We show how to learn an optimal intermediate-size representation efficiently by clustering the values of the function. The proposed method generates models that capture interesting clusters of function values, dominates the simple model in prediction, and can surpass the verbose model using much fewer parameters. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2011-04-18 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0051251 |
URI | http://hdl.handle.net/2429/33750 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2011-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2011_spring_chiang_michael.pdf [ 2.61MB ]
- Metadata
- JSON: 24-1.0051251.json
- JSON-LD: 24-1.0051251-ld.json
- RDF/XML (Pretty): 24-1.0051251-rdf.xml
- RDF/JSON: 24-1.0051251-rdf.json
- Turtle: 24-1.0051251-turtle.txt
- N-Triples: 24-1.0051251-rdf-ntriples.txt
- Original Record: 24-1.0051251-source.json
- Full Text
- 24-1.0051251-fulltext.txt
- Citation
- 24-1.0051251.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0051251/manifest