UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Relational logistic regression Kazemi, Seyed Mehran 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2014_september_kazemi_ seyed mehran.pdf [ 1.82MB ]
Metadata
JSON: 24-1.0166004.json
JSON-LD: 24-1.0166004-ld.json
RDF/XML (Pretty): 24-1.0166004-rdf.xml
RDF/JSON: 24-1.0166004-rdf.json
Turtle: 24-1.0166004-turtle.txt
N-Triples: 24-1.0166004-rdf-ntriples.txt
Original Record: 24-1.0166004-source.json
Full Text
24-1.0166004-fulltext.txt
Citation
24-1.0166004.ris

Full Text

Relational Logistic RegressionbySeyed Mehran KazemiB.Sc., Amirkabir University of Technology, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)August 2014c© Seyed Mehran Kazemi 2014AbstractAggregation is a technique for representing conditional probability distributionsas an analytic function of parents. Logistic regression is a commonly used repre-sentation for aggregators in Bayesian belief networks when a child has multipleparents. In this thesis, we consider extending logistic regression to directed re-lational models, where there are objects and relations among them, and we wantto model varying populations and interactions among parents. We first examinethe representational problems caused by population variation. We show how theseproblems arise even in simple cases with a single parametrized parent, and proposea linear relational logistic regression which we show can represent arbitrary linear(in population size) decision thresholds, whereas the traditional logistic regressioncannot. Then we examine representing interactions among the parents of a childnode, and representing non-linear dependency on population size. We propose amulti-parent relational logistic regression which can represent interactions amongparents and arbitrary polynomial decision thresholds. We compare our relationallogistic regression to Markov logic networks and represent their analogies and dif-ferences. Finally, we show how other well-known aggregators can be representedusing relational logistic regression.iiPrefaceThis thesis is largely based on two published works: one in Knowledge Represen-tation and Reasoning (KR-2014) conference [22] and one in AAAI-2014 StatisticalRelational AI (StarAI) workshop [23]. The former introduces the research prob-lem described in this thesis and points out the solution and the latter summarizesthe results and compares the model proposed in [22] with other similar methods inthe literature. In both publications, I was the first and the correspondence authorand I collaborated with David Buchman, Kristian Kersting, Sriraam Natarajan, andDavid Poole. In preparation of both papers, we had a great deal of active discussionamong all the co-authors and numerous refinements. Below is a description of mycontributions to the work.The preliminary problems with applying logistic regression to relational mod-els and the need to have a model which addresses those problems were realizedby David Poole, my academic supervisor. He introduced this research topic to meand we started working on it and discussing it in our weekly meetings. Throughour discussions, we found out that there are other issues (besides the preliminaryproblems) that we should take into account when using logistic regression for rela-tional models. We developed a relational version of logistic regression (RLR) andinvestigated how it could address the issues with standard logistic regression.Having the RLR model, I proposed the idea of defining canonical forms forthat. I realized that both positive conjunctive and positive disjunctive formulaecan be used as canonical forms for RLR, and made the proofs for both. We alsosuggested a third canonical form in terms of XOR (without proof), the idea forwhich was from David Buchman.Considering the open problem mentioned in [48] about representing k-degreepolynomial decision thresholds, we decided to figure out if our RLR can representthis class of decision thresholds or not. I proved that every polynomial decisionthreshold can be represented by our RLR, and every decision threshold that can berepresented by our RLR is a polynomial decision thresholds.We also considered representing other well-known aggregators using our RLR.I worked out how OR, AND, Noisy-OR, Noisy-AND, Mean > t, More-than-t Trues,More-than-t% Trues, Max > t and Mode = t can be represented by RLR. DavidBuchman then suggested how the aggregators Max and Mode can be modeled usingiiiPrefaceMax > t and Mode = t respectively.After preparing the aforementioned contents for our papers, I designed thestructure and wrote the initial draft for [22] and David Poole did this for [23]. Allco-authors revised, proofread, and gave comments on the drafts, especially DavidPoole who modified several portions of [22]. Throughout the period working onboth papers, David Poole and I had many active discussions in our weekly meet-ings, and David Buchman, Kristian Kersting and Sriraam Natarajan’s knowledgeof the domain was very helpful in progression of the ideas.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 92.2.1 The Factorization Perspective . . . . . . . . . . . . . . . 112.2.2 Multi-valued Child Variables . . . . . . . . . . . . . . . 122.3 Relational Models . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.2 Relational Probabilistic Models . . . . . . . . . . . . . . 142.3.3 Markov Logic Networks . . . . . . . . . . . . . . . . . . 183 Relational Logistic Regression . . . . . . . . . . . . . . . . . . . . . 203.1 Aggregation with Logistic Regression in Relational Models . . . 203.2 Single-parent, Linear Relational Logistic Regression . . . . . . . 233.3 Multi-parent, Linear Relational Logistic Regression . . . . . . . 243.4 Interactions Among Parents . . . . . . . . . . . . . . . . . . . . 25vTable of Contents3.5 Non-linear Decision Thresholds . . . . . . . . . . . . . . . . . . 283.6 Using Weighted Formulae for Relational Logistic Regression . . 293.7 General Relational Logistic Regression . . . . . . . . . . . . . . 303.8 RLR vs MLNs . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.9 Canonical Forms of RLR . . . . . . . . . . . . . . . . . . . . . . 333.10 Non-linear Decision Thresholds . . . . . . . . . . . . . . . . . . 363.11 Beyond Polynomial Decision Thresholds . . . . . . . . . . . . . 393.12 RLR with Multi-valued Child Variables . . . . . . . . . . . . . . 404 Approximating Other Aggregators Using RLR . . . . . . . . . . . . 414.1 OR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2 AND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.3 Noisy-OR and Noisy-AND . . . . . . . . . . . . . . . . . . . . . 434.4 Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.5 More-than-t Trues . . . . . . . . . . . . . . . . . . . . . . . . . 454.6 More-than-t% Trues . . . . . . . . . . . . . . . . . . . . . . . . 464.7 Max . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.8 Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.9 Aggregators Not Represented by RLR . . . . . . . . . . . . . . . 505 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53viList of Tables1.1 Conditional probability table for random variable Fever based onten diseases (represented by D1,D2, . . . ,D10) affecting the proba-bility of someone having fever. . . . . . . . . . . . . . . . . . . . 23.1 Predictions of a logistic regression model in a relational domainwith a population size of 20 as a function of numerical represen-tations for True and False, where the parameters are learned for apopulation of 10. . . . . . . . . . . . . . . . . . . . . . . . . . . 224.1 The probability of random variable AlarmSounds as a function ofthe number of people num who set off the alarm for the weightedformulae in Example 28 with an accuracy of six digits after thedecimal point. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2 Conditional probability table for PRV MaxRate(m) representingthe maximum rate of different movies. For simplicity, we just rep-resent the desired value of the child node (the one having a proba-bility of 1) instead of probabilities of each value it can take. . . . . 48viiList of Figures1.1 A Bayesian network for random variable Fever based on ten dis-eases (represented by D1,D2, . . . ,D10) affecting the probability ofsomeone having fever. . . . . . . . . . . . . . . . . . . . . . . . 22.1 A Bayesian network for random variable Cancer representing fewof its causes and few of the complications caused by it (taken from[27]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 The sigmoid function. . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Two-digit binary to gray (B2G) conversion task and its correspond-ing Bayesian network for an intelligent tutoring system. . . . . . . 142.4 On the left is a relational belief network for predicting the rates ofusers for different movies given the users’ ages and movies’ genres,and on the right is a grounding of the model. . . . . . . . . . . . . 152.5 Intelligent tutoring system for teaching multi-digit binary to gray(B2G) conversion modeled in a relational setting using plates no-tation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.6 A relational model having a PRV (Subs()) with an unbounded num-ber of parents in the grounding, whose conditional probability shouldbe represented using an aggregation operator (taken from [25]). . . 182.7 An undirected relational model in plate notation on the left, and itsgrounding for the population {A1,A2, . . . ,An} on the right. . . . . 193.1 Logistic regression (with i.i.d. priors for the R(x)). The left sideis the relational model in plate notation and the right side is thegrounding for the population {A1,A2, . . . ,An}. . . . . . . . . . . 213.2 A relational model for representing people’s happiness based onthe number of friends they have and whether their friends are kindor not. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26viiiList of Figures4.1 A relational model representing the evacuation scenario of a build-ing when a member of the building sets off an smoke alarm. Theconditional probability of the PRV AlarmSounds in this model shouldbe represented using the aggregation operator OR and the condi-tional probability of the PRV Evacuated should be represented us-ing the aggregation operator AND. . . . . . . . . . . . . . . . . . 424.2 On the left is a relational model for a child node with a single parenthaving an extra logical variable. On the right is a relational modelrepresenting the changes to be made to the model on the left fordefining a noisy-OR or noisy-AND conditional probability for thechild node using RLR. . . . . . . . . . . . . . . . . . . . . . . . 434.3 On the left is a relational model for a child node with a single parenthaving an extra logical variable. On the right is a relational modelrepresenting the changes to be made to the model on the left fordefining a conditional probability representing the aggregator maxfor the child node using RLR. . . . . . . . . . . . . . . . . . . . . 474.4 On the left is a relational model for a child node with a singleparent having an extra logical variable. On the right is a relationalmodel representing the changes to be made to the model on the leftfor defining a conditional probability representing the aggregatormode = t for the child node using RLR. . . . . . . . . . . . . . . 49ixAcknowledgementsMany people have helped me make this thesis possible and I owe to all of them adebt I can never repay.I am foremost indebted to my research supervisor Prof. David Poole. He taughtme how to do research, how to think about research problems, how to evaluateideas, how to express ideas, and many other aspects of the academia. Without hisuninterrupted and endless guidance and support, this thesis was not possible.I thank all my colleagues and friends in the department, my thesis second ex-aminer Dr. Nicholas Harvey, all members of StarAI reading group at UBC, and allpeople who helped me during these years in my academic and non-academic life.xDedicationI would like to dedicate my thesis to my beloved family, especially...• to my Mom for her inexhaustible supports and motivations• to my Dad for instilling the importance of hard work and higher education• to my brothers and sisters for always being there for meI dedicate the following Persian poem to my beloved parents...Garche dar alam pedar darad maghami arjamnLikan afzoon az pedar ghadro maghame MADAR astMADARe daanaa konad farzande daanaa tarbiatHar ke bar har jaa resad az ehtemame MADAR astxiChapter 1IntroductionProbabilistic graphical models, including Bayesian (belief) networks and Markovnetworks (also known as Markov random fields) [41] are probabilistic modelsfor representing the dependency among random variables. These networks usea graphical representation to model the interdependence among random variables.Bayesian networks are directed models representing the joint probability distribu-tion (JPD) of a set of random variables in terms of conditional probability distri-butions (CPD): one CPD for each random variable given its parents in the directedmodel. This allows for a compact and natural representation. On the other hand,Markov networks are undirected models representing the joint probability of a setof random variables in terms of a set of potential functions, where each potentialfunction is a non-negative real-valued function of a subset of random variables. Inthis work, we focus on directed models with conditional probability distributions.CPDs in Bayesian networks are often represented as tables. The advantage ofa tabular representation is that it is as general as possible in terms of representingconditional probabilities; however, it also has several disadvantages. One of thedisadvantages of a tabular representation is that the number of parameters requiredto describe a CPD for a random variable grows exponentially with the number ofparents it has in the Bayesian network.Example 1. Consider a medical domain where whether someone has fever or notdepends on ten different diseases. Representing this domain as a Bayesian networkwith tabular CPDs, the CPD for random variable Fever will have 210 = 1024 pa-rameters (assuming all random variables are binary). The Bayesian network andthe tabular CPD for this example are represented in Fig. 1.1 and Table 1.1 (Di rep-resents the i-th disease and fever≡ “Fever = True′′). Not only it is computationallyexpensive to perform operations on this CPD, but also it is quite tiresome to acquirethe probabilities from expert knowledge; experts will lose patience if we ask them1024 questions. Learning such a table from data is also problematic because, giventhe standard ways for learning a tabular CPD from data, we cannot generalize fromsimilar conditions. (example taken from [26])Example 1 indicates how using a tabular representation of a CPD for a randomvariable may cause troubles. This example suggests considering other represen-1Chapter 1. Introduction                                                                                                                                                             D1 D2 D10 Fever … Figure 1.1: A Bayesian network for random variable Fever based on ten diseases(represented by D1,D2, . . . ,D10) affecting the probability of someone having fever.tations of CPDs in such particular situations. Tree-CPD and rule-CPD are twoalternatives to the tabular representation of CPDs, offering a more compact repre-sentation by using the contextual independence of random variables [15, 50, 60].Aggregation is another compact representation for CPDs, defining a CPD in termsof a function. Logistic regression [5, 32] is a form of aggregation which is used forrepresenting the conditional probability of random variables having many parents(e.g., see [35, 53]). We will explain how it works in later chapters.One of the shortcomings of probabilistic graphical models is that they are notoften adequate to represent large and complex domains where there are entities ina variety of configurations. Furthermore, while they enable us to efficiently handleuncertainty, their representational power of a wide variety of knowledge is usuallylimited. Since first-order logic [54] gives a powerful and compact representationfor knowledge, and knowledge representation and uncertainty management are twoTable 1.1: Conditional probability table for random variable Fever based on ten dis-eases (represented by D1,D2, . . . ,D10) affecting the probability of someone havingfever.D1 D2 . . . D10 Pr(fever|D1, . . . ,D10)True True . . . True α1True True . . . False α2. . . . . . . . . . . . . . .False False . . . False α10242Chapter 1. Introductionkey elements in most applications, combining probability with first-order logic hasreceived a great deal of attention. Early works on this field include combining prob-ability with Horn clauses [33, 44, 58], frame-based systems [11, 39], and databasequery languages [56]. These works have led to the introduction of what is known asrelational probabilistic models which combine first-order logic with probabilisticmodels [1, 14].Relational probabilistic models are models where there are probabilities aboutrelations among individuals that can be specified independently of the actual in-dividuals, and where the individuals are exchangeable; before we know anythingabout the individuals, they are treated identically. These models extend Bayesiannetworks and Markov networks by adding the concepts of objects, object proper-ties, and relations.Similar to Bayesian networks, in directed relational probabilistic models thejoint probability is defined in terms of conditionals. One of the features of relationalprobabilistic models is that the conditional probability of a relation may dependon the number of individuals1 (in relational models, we refer to the number ofindividuals as population size) [45, 48]. In such cases, we cannot represent theconditional probability of the relation in terms of a table.Example 2. Suppose a group of people are invited to a party. Whether the partyis fun or not depends on the number of invited people that attend the party (popu-lation size in this example refers to the number of invited people). Therefore, thenumber of parents that the random variable FunParty has is equal to the populationsize of people that have been invited to that party. This number, however, is notalways fixed, because the number of people that are invited to different parties isnot the same. Since we cannot bound the number of invited people to differentparties, FunParty might have an unbounded number of parents, so a tabular rep-resentation of the conditional probability for this random variable is no longer apossible option.Varying population sizes are quite common. They can appear in a number ofways including:• The actual population may be arbitrary. For example, in considering theprobability of someone committing a crime (which depends on how manyother people could have committed the crime) [45] we could consider thepopulation to be the population of the neighbourhood, the population of thecity, the population of the country, or the population of the whole world. It1Sometimes the dependence of a relation on the the number of individuals in a relational modelis desirable; in other cases, model weights may need to change. See [20, 21] for more information.31.1. Literature Reviewwould be good to have a model that does not depend on this arbitrary deci-sion. We would like to be able to compare models which involve differentchoices.• The population can change. For example, the number of people in a neigh-bourhood or in a school class may change. We would like a model to makereasonable predictions as the population changes. We would also like to beable to apply a model learned at one or a number of population sizes to dif-ferent population sizes. For example, models from drug studies are acquiredfrom very limited populations but are applied much more generally.• The relevant populations can be different for each individual. For instancein Example 2, whether a party is fun for a person or not may depend on thenumber of people at that party who are friends with him or her; however, thisnumber is different for different people at the party. We would like a modelthat makes reasonable predictions for diverse numbers of friends.• The train and test populations may differ. For example, the efficiency of anew hospitality management may be first examined in a small hospital withfew patients, and later used in large hospitals with many patients.Variation in population sizes, and the way the predictions of a model changewith it, is an important factor which should be taken into account when dealing withprobabilistic relational models. Not only it can affect the correctness of a model[48], but also it has a great influence on the performance of different methods usedfor inference in relational models [24]. It also necessitates the use of representa-tions other than tables in certain cases as described in Example 2. Aggregation isoften used in relational models to handle the problem of representing conditionalprobabilities for relations with an unbounded number of parents.In this work, we consider extending standard logistic regression as an aggrega-tor for relational models and investigate how varying populations can cause prob-lems for logistic regression. We propose a relational logistic regression modelwhich addresses these problems and works appropriately for relational models.1.1 Literature ReviewSince their introduction, probabilistic relational models have drawn many researchers’attention. They have been used in different fields such as making recommendations[13, 17], clustering [57], security risk analysis [55], and sorting rocks [9].A great deal of attention in probabilistic relational models has been drawn to-wards extending standard machine learning models of propositional data to work41.1. Literature Reviewfor relational models. Neville et al. [37] and Blockeel and De Raedt [2] devel-oped relational probability tree and relational regression tree models by extendingstandard decision trees and regression trees respectively. Considering the data isheterogeneous and interdependent, Neville et al. [37] and Blockeel and De Raedt[2] use aggregated values of random variables in the data such as mean, mode,max, count, etc. along with the values of random variables in each tuple of the datato split the data in tree nodes. Jaeger [19] extends Bayesian networks to relationaldomains by defining a language for specifying Bayesian networks whose nodes areextensions of first-order predicates.Other extensions of propositional machine learning models to relational do-mains include relational dependency networks [36], relational Bayesian classifiers[37], relational Markov networks (or Markov logic networks) [10, 52], etc.Some models have been proposed in the literature which can lead to learningof logistic regression models for relational data. For instance, Popescul et al. [51]use inductive logic programming (ILP) [28] to generate first-order rules for a tar-get relation, create features by propositionalizing the rules, and then use logisticregression to learn a classifier based on these features. There are also methods fordiscriminative learning of Markov logic networks which can be considered as alogistic regression model with relational features. For instance, Huynh and Moony[18] use an ILP technique to generate discriminative clauses for a target relationand then use logistic regression with L1-regularizer to learn the weights with au-tomatic feature selection (automatic structure learning). These methods are all de-signed for learning purposes and are not used as an aggregator in relational modelssimilar to the way logistic regression is used for aggregation in propositional mod-els. In this work, we propose a relational version of logistic regression which canbe used both for learning and aggregation purposes and we discuss what can andcannot be done by our proposed model.Aggregation in relational models is necessary when a variable has possiblyan unbounded number of parents in the grounding. The use of aggregators fordefining conditional probabilities in such situations has been investigated for manyyears and many aggregation methods have been proposed and used in the litera-ture. Horsch and Poole [16] proposed using probabilistic existential and univer-sal quantifiers to define a conditional probability (e.g. Pr(b|∃X,a(X)) = 0.7 andPr(b|¬∃X,a(X)) = 0.05). An existential quantifier model is equivalent to a logicalOR operation whose value is True if a property holds for at least one individual,and a universal quantifier is equivalent to a logical AND operation whose value isTrue if a property holds for all individuals. Noisy-OR [40] and noisy-AND [8] aretwo of the common aggregators which are extensions of standard OR and ANDoperations. They define a set of noise parameters for each parent and the prob-ability of the child being True increases according to those parameters when the51.2. Perspectiveproperty holds for more parents. Generalized linear models [29] are a class of pop-ular aggregators that satisfy the independence of causal influence. They consistof a function whose input is the values of the parents of a random variable andwhose output is a real number, as well as a threshold on the output of the functiondetermining if the child node is True or False according to the output. Logisticregression is a generalized linear model with soft threshold.One of the important issues which should be taken into account when design-ing a relational model is the variations in the population sizes. These variationscan have desirable or undesirable effects on predictions of model. Poole et al. [48]consider the population size effects on three simple models (naive Bayes, logisticregression with one parent, and a simple Markov network) and indicate that in a re-lational setting, even these simple models make strong assumptions about how thesize of a population affects the predictions. This study has been extended later onby considering population size extrapolation [47]. In this work, we also considerthe effects of population size and the problems they introduce for using logisticregression in relational domains and address those problems in our proposed rela-tional logistic regression.1.2 PerspectiveIn this work, we demonstrate the problems that arise when using standard logis-tic regression as an aggregator in relational models, where a variable has possi-bly an unbounded number of parents in the grounding. Considering these prob-lems, we initially propose a linear, single parent relational logistic regressionwhich solves the problems with standard logistic regression when there is onlyone (parametrized) parent and the decision threshold is linear. Then we demon-strate what happens when we have more parents that can interact with each otherand when we want to model non-linear decision thresholds. We develop a generalrelational logistic regression which works with an arbitrary number of parents andmodels every arbitrary polynomial decision threshold. We define canonical formsand prove what can and cannot be modeled by our relational logistic regression. Wealso compare our relational logistic regression model with Markov logic networks,which are a class of undirected relational models, and point out the similarities anddifferences. We conclude the thesis by introducing many other popular aggrega-tors and representing how we can approximate them using our relational logisticregression.In this work, we focus on binary child variables and categorical parent vari-ables, but explain a possible approach for extending the model to multi-valuedchild nodes. Extension of the model to continuous parent variables (as done in61.2. Perspective[31] for non-relational models) is left as a future work.The rest of the thesis is organized as follows. Chapter 2 provides sufficient in-formation for readers to read the rest of the thesis and defines terminologies used inthe thesis. Chapter 3 demonstrates problems with standard logistic regression whenused as aggregator in relational models and defines relational logistic regression toovercome these problems. This chapter also compares relational logistic regressionwith Markov logic networks, and proves theorems about canonical forms and whatcan and cannot be represented by relational logistic regression. Chapter 4 repre-sents how other well-known aggregation models can be approximated in terms ofrelational logistic regression. Finally, chapter 5 summarizes the thesis and pointsout some future directions.7Chapter 2BackgroundIn this chapter, we provide sufficient information for readers to read the rest of thethesis. We also define terminologies used throughout the thesis.2.1 Bayesian NetworksSuppose we have a set of random variables {X1, . . . ,Xn}. A Bayesian network orbelief network [41] is a directed acyclic graph (DAG) where the random variablesare the nodes, and the arcs represent interdependence among the random variables.Each variable is independent of its non-descendants given the values for its par-ents. Thus, if Xi is not an ancestor of Xj, then Pr(Xi | parents(Xi),Xj) = Pr(Xi |parents(Xi)), where parents(Xi) returns the parents of the random variable Xi inthe DAG. The joint probability of the random variables in a Bayesian network canbe factorized as:Pr(X1,X2, ...,Xn) =n∏i=1Pr(Xi | parents(Xi)) (2.1)Example 3. Fig. 2.1 represents a Bayesian network for cancer (network takenfrom [27]) with five random variables. In this network, Cancer (C) has two parents,Pollution (P) and Smoke (S), and two children, Xray (X) and Dyspnea (D). We caninfer from the network that if we observe whether someone has cancer or not, theprobability of them having dyspnea is independent of whether they smoke or not.In order to model the joint probability of the random variables in the network, fiveconditional probabilities are constructed; one for each random variable given itsparents. The joint probability is then as follows:Pr(C,P,S,X,D) = Pr(P)∗Pr(S)∗Pr(C | S,P)∗Pr(X | C)∗Pr(D | C)One way to represent a conditional probability distribution Pr(Xi | parents(Xi))in a Bayesian network is in terms of a table. Such a tabular representation for arandom variable increases exponentially in size with the number of parents. Forinstance, a Boolean child having 10 Boolean parents requires 210 = 1024 numbersto specify the conditional probability (as in Fig. 1.1).82.2. Logistic Regression P o l l u tion  S mok i n g  Can cer  Xray  Dyspnea  Figure 2.1: A Bayesian network for random variable Cancer representing few ofits causes and few of the complications caused by it (taken from [27]).A compact alternative to a table is an aggregation operator, or aggregator, thatspecifies a function of how the distribution of a variable depends on the valuesof its parents. Examples for common aggregators include OR, AND, as well as“noisy-OR” and “noisy-AND”. These can be specified much more compactly thana table.Example 4. Suppose in Fig. 1.1 we know that all diseases have the same effecton fever and whether someone has fever or not only depends on the number ofdiseases they have. We can model this conditional dependency of fever on itsparents as follows:Pr(fever | D1,D2, . . . ,D10) = sign(w0 +w110∑i=1Di)where fever ≡ “Fever = True′′ and sign(x) is equal to 1 if x ≥ 0 and 0 otherwise.This model assumes the probability of fever given the diseases is either 0 or 1.Having the above model, if we know that people have fever as soon as they haveone of the ten diseases, we can represent this by setting w0 = −1 and w1 = 2(assuming that True is represented by 1 and False is represented by 0). This is acompact form requiring only 2 instead of 1024 parameters.2.2 Logistic RegressionLogistic regression [5, 32] is an aggregator in Bayesian networks. We describehow it works and how it can be used as an agregator.92.2. Logistic RegressionFigure 2.2: The sigmoid function.Suppose a Boolean random variable Q is a child of the numerical random vari-ables {X1,X2, . . . ,Xn}. Logistic regression is an aggregation operator defined as:Pr(q | X1, . . . ,Xn) = sigmoid(w0 +∑iwiXi) (2.2)where q≡ “Q = True”, sigmoid(x) = 1/(1+e−x) (Fig. 2.2 shows a sigmoid func-tion) and w0,w1, . . . ,wn are real-valued weights. It follows from the definitionthat Pr(q | X1, . . . ,Xn) > 0.5 iff w0 +∑i wiXi > 0. Logistic regression definitionin Eq. 2.2 assumes numerical parameters, so Boolean inputs need to be mappedto numerical ones. There are many ways to do this; for now we assume True isrepresented by 1 and False is represented by 0.The space of assignments to the w’s so that w0 +∑i wiXi = 0 is called the deci-sion threshold, as it is the boundary of where Pr(q | X1, . . . ,Xn) changes betweenbeing closer to 0 and being closer to 1. Logistic regression provides a soft thresh-old, in that it changes from close to 0 to close to 1 in a continuous manner. Howfast it changes can be adjusted by multiplying all weights by a positive constant.Example 5. Suppose in Fig. 1.1 we know that if people have none of the tendiseases, the probability of them having fever is low; however, if they have at leastone of the ten diseases, with a high probability they have fever, and this probabilityincreases with the number of diseases they have. Below is an example of how wecan model this conditional probability for fever using logistic regression.Pr(fever | D1, . . . ,D10) = sigmoid(−3+5D1 +5D2 + · · ·+5D10)Given the above conditional probability, the probability of having fever for a per-son with none of the ten diseases is sigmoid(−3)' 0.0474. Once they have one of102.2. Logistic Regressionthe ten diseases, this probability becomes sigmoid(−3+5)' 0.8808. This proba-bility increases as the number of diseases increases. For instance, the probability ofhaving fever for a person having two of the ten diseases is sigmoid(−3+2∗5) '0.9991. The decision threshold for this example is D1 +D2 + · · ·+D10 = 35 , mean-ing that the probability of fever is more than a half if the sum of the indicators ofdiseases being true is more than 35 , less than a half if the sum of the indicators isless than 35 , and is exactly a half of the sum equals35 .2.2.1 The Factorization PerspectiveA simple and general formulation of logistic regression can be defined using amultiplicative factorization of the conditional probability. Eq. 2.2 then becomes aspecial case, which is equivalent to the general case when Q is binary and proba-bilities are positive (non-zero).We define a general logistic regression for Q with parents X1, . . . ,Xn (all vari-ables here may be discrete or continuous) to be when Pr(Q | X1, . . . ,Xn) can befactored into a product of non-negative pairwise factors and a non-negative factorfor Q:Pr(Q | X1, . . . ,Xn) ∝ f0(Q)n∏i=1fi(Q,Xi)where ∝ (proportional-to) means it is normalized separately for each assignmentto the parents. This differs from the normalization for joint distributions (as usedin undirected models), where there is a single normalizing constant. Here theconstraint that causes the normalization is ∀X1, . . . ,Xn : ∑Q Pr(Q | X1, . . . ,Xn) =1, whereas for joint distributions, the normalization is to satisfy the constraint∑Q,X1,...,Xn Pr(Q,X1, . . . ,Xn) = 1.If Q is binary, then:Pr(q | X1, . . . ,Xn) =f0(q)∏ni=1 fi(q,Xi)f0(q)∏ni=1 fi(q,Xi)+ f0(¬q)∏ni=1 fi(¬q,Xi)If all factors are positive, we can divide and then use the identity y = elny:Pr(q | X1, . . . ,Xn) =11+ f0(¬q)f0( q) ∏ni=1fi(¬q,Xi)fi( q,Xi)=11+ exp(ln f0(¬q)f0( q) +∑ni=1 lnfi(¬q,Xi)fi( q,Xi))= sigmoid(lnf0( q)f0(¬q)+n∑i=1lnfi( q,Xi)fi(¬q,Xi)).112.3. Relational ModelsWhen the ln fi( q,Xi)fi(¬q,Xi) are linear functions w.r.t. Xi, it is possible to find values forall w’s such that this can be represented by Eq. (2.2). This is always possible whenthe parents are binary.2.2.2 Multi-valued Child VariablesSuppose a multi-valued categorical random variable Q, where Q can take k≥ 2 dif-ferent values denoted by {V1, . . . ,Vk}, is a child of the numerical random variables{X1,X2, . . . ,Xn}. Logistic regression learns (k−1)(n+1) weights denoted by:w10 w11 . . . w1nw20 w21 . . . w2n. . . . . . . . . . . .w(k−1)0 w(k−1)1 . . . w(k−1)nand defines the conditional probability of Q given its parents as:if (l < k)→ Pr(Q = Vl|X1, . . . ,Xn) =exp(wl0 +∑ni=1 Xiwli)1+∑k−1l′=1 exp(wl′0 +∑ni=1 Xiwl′i)if (l = k)→ Pr(Q = Vl|X1, . . . ,Xn) =11+∑k−1l′=1 exp(wl′0 +∑ni=1 Xiwl′i)(2.3)Note that the definition of logistic regression in Eq. 2.3 reduces to the Eq. 2.2when K = 2.2.3 Relational ModelsRelational models deal with objects and relations among them. Non-relational (orpropositional) models have a set of features and make predictions about a targetfeature based on those features. Relational models have a set of objects (also calledindividuals or entities), properties of the objects, and relations among these objects,and make predictions about target relations. The following subsections describewhat relational models are and how they are used, and motivate why we should userelational models instead of non-relational models in certain domains.2.3.1 MotivationBayesian networks, e.g., as shown in Fig. 1.1 and 2.1, are defined in terms of fea-tures (represented by nodes) and the probabilistic dependencies among them (rep-resented by arcs). In some domains, we have a number of individuals, propertiesof individuals, and relationships among them and we want to make probabilistic122.3. Relational Modelspredictions about a random variable having a certain value, an individual having aproperty, or a group of individuals having a relationship.Example 6. In social networks, we have a set of individuals (users of the socialnetwork), their properties (e.g., gender, age, etc.), and the relationship among them(e.g., being friend or following each other), and we might want to predict if a useris fake or real given their properties and relations (e.g. in [7, 59]).These domains are best modeled in terms of individuals and relationships ratherthan in terms of features. The following example (inspired by two-digit additionexample in [49]) motivates the use of individuals and relations in the domain ofintelligent tutoring systems.Example 7. Consider an automated tutoring system for diagnosing the arithmeticerrors students make in converting a number in binary format to gray code2. Thesystem should be able to predict if the students know XOR and the conversionprocess well enough or have problems in specific tasks.Fig. 2.3 represents a simple case of two-bit binary to gray code conversion(B2G) and a corresponding Bayesian network for diagnosing whether the studentknows the conversion process and how to XOR two bits or not. As we can seein this model, there is one node in the Bayesian network for each digit (Xi) ofthe binary number (plus one extra node which we observe to be zero), one foreach digit (Yi) of resulting gray code format, one representing whether the studentknows XOR (KnowsXOR), and one representing whether the student knows theconversion process (KnowsB2G). If instead of having a two-bit conversion prob-lem, our intelligent tutoring is teaching 10-bit conversion, we have to use ten nodesfor the number in binary format and ten for the resulting gray code. Furthermore,if our system is teaching addition to several students, we need different copies ofthe KnowsXOR, KnowsB2G and Yis for each student. We also need another copyof the Xis and Yis for each conversion problem that a student solves. Having allthese copies, our network will be very huge and it will be very time-consuming toperform operations on it.The problem of facing with a huge Bayesian network having many copies of itsnodes arises in this domain because we have many individuals (students, conver-sion problems and digits), individual properties (whether the students know how toXOR and the B2G process or not) and relations among individuals (the digits of thegray code calculated by students for conversion problems). Relational probabilisticmodels are an appropriate alternative to be used in such situations.2Gray code is a binary numerical system where two successive values differ in only one bit. Inorder to convert a number in binary format to gray code, the d-th digit of the gray code is calculatedby XOR-ing the d-th and (d+1)-th digits in the binary format, assuming there is an extra 0 to the leftof the binary number.132.3. Relational Models X 2 =0  X 1 X 0  Y 1 Y 0 X 2 X 1 X 0 Y 1 Y 0 K n o w s B2G  K n o w s X OR  Figure 2.3: Two-digit binary to gray (B2G) conversion task and its correspondingBayesian network for an intelligent tutoring system.2.3.2 Relational Probabilistic ModelsRelational probabilistic models [14] or template based models [26] extend Bayesianor Markov networks by adding the concepts of individuals (objects, entities, things),relations among individuals (including properties, which are relations of a singleindividual), and by allowing for probabilistic dependencies among these relations.In these models, individuals about which we have the same information are ex-changeable, meaning that, given no evidence to distinguish them, they should betreated identically. We provide some basic definitions and terminologies in thesemodels which are used in the rest of the thesis.Some DefinitionsA population is a set of individuals. A population corresponds to a domain inlogic. The population size is the cardinality of the population which can be anynon-negative integer. For instance, a population can be the set of movies in amovie rating system where Titanic and Her are two individuals and the size of thepopulation is equal to the number of movies in the database.A logical variable is written in lower case. Each logical variable is typed witha population; we use |x| for the size of the population associated with a logicalvariable x. For instance, u and m may be two logical variables typed with thepopulation of users and movies in a movie rating system respectively. Constants,denoting individuals, start with an upper case letter. We refer to a set of logical142.3. Relational Models A ge(u)  Genre(m)  Rate(u,m)  A ge(Sa m )  A ge(Joe )  . . . Genre(Tita n ic )  Genre(Her )  … Rate ( Sa m, T itan ic )  Rate ( Sa m, Her )  Rate ( Joe , T itan ic )  Rate ( Joe , Her )  . . . . . . … … Figure 2.4: On the left is a relational belief network for predicting the rates of usersfor different movies given the users’ ages and movies’ genres, and on the right is agrounding of the model.variables by a lower case letter in bold (e.g., x).A parametrized random variable (PRV) is of the form F(t1, . . . , tk) where Fis a k-ary functor (a function symbol or a predicate) and each ti is a logical variableor a constant. Each functor has a range, which is {True,False} for predicate sym-bols. A PRV represents a set of random variables, one for each assignment of indi-viduals to its logical variables. The range of the functor becomes the range of eachrandom variable. For instance, Rate(u,m), Rate(Sam,m) and Rate(Sam,Titanic)are three different PRVs.A ground random variable is a PRV where all tis areconstants (e.g. Rate(Sam,Titanic)).A relational belief network is an acyclic directed graph where the nodes arePRVs and arcs represent the conditional independence among them. A groundingof a relational belief network with respect to a population for each logical variableis a belief network created by replacing each PRV with the set of random variablesit represents, while preserving the structure. Fig. 2.4 represents a relational beliefnetwork on the left, where the rate given by a user u to a movie m depends on theage of the user and genre of the movie, and its grounding on the right.An atom is an assignment of a value to a PRV. For instance, R(x) is a PRV andR(x) = True is an atom. For a Boolean PRV R(x), we represent R(x) = True byR(x) and R(x) = False by ¬R(x). We refer to an atom ¬R(x) as a negated atom.A formula is made up of atoms with logical connectives. A Boolean formula isa formula in which conjunction, disjunction and negation are the only logical oper-ators used. A conjunctive formula is a formula which is the conjunction of literals,where a literal is an atom or the negation of an atom. A disjunctive formula is a152.3. Relational Modelsformula which is the disjunction of literals. A positive formula is a formula with nonegations. For instance R(x)∧ S(y) is a positive conjunctive Boolean formula. Inthis work, we only consider Boolean formulae and for simplicity we use the termformula to refer to Boolean formula.A substitution is a finite set θ = {x1/t1,x2/t2, . . . ,xk/tk} where xis are distinctlogical variables and each ti is a constant or a logical variable different from xi.Let F be a formula. Fθ is called an instance of F and obtained by replacingsimultaneously all occurrences of every xi, 1≤ i≤ k, in F with the correspondingti. θ is called a unifier for a set {F1,F2, . . . ,Fm} iff F1θ = F2θ = · · · = Fmθ . θis called the most general unifier for a set {F1,F2, . . . ,Fm} iff any other unifierψ of this set can be expressed as ψ = θθ ′, where θ ′ is a substitution. The set{F1,F2, . . . ,Fm} is unifiable iff there exists a unifier for it. (Definitions taken from[6].)A Boolean interaction among a series of PRVs is an interaction that can berepresented by a Boolean formula of the PRVs.When using a single population, we write the population as A1 . . .An, where nis the population size, and use R1 . . .Rn as short for R(A1) . . .R(An). We also usenval for the number of individuals x for which R(x) = val. When R(x) is binary, weuse the shortened nT = nTrue and nF = nFalse.Intelligent Tutoring Example with Relational ModelsExample 7 represented how modeling a domain with multiple individuals and re-lations among them can be troublesome for a standard Bayesian network. In orderto model this problem in a relational setting, we define logical variables s, d andp typed with the population of students, digits and problems respectively. Hav-ing these logical variables, we define the digits of X by two PRVs X(d,p) andX(d+1,p), where the ground random variable X(D,P) represents the D-th digit ofthe number in binary format from the P-th problem. We also define KnowsXOR(s)and KnowsB2G(s) where each ground random variable KnowsXOR(S) representswhether student S knows how to XOR and KnowsB2G(S) represents whether stu-dent S knows the binary to gray conversion process or not. Finally, we definethe result variables as Y(d,p,s), where the ground random variable Y(D,P,S) rep-resents the D-th digit of the result calculated by student S for the P-th problem.The relational belief network can be then represented by plate notation [4] as inFig. 2.5. Having this relational model, there are several efficient algorithms forweight learning [11] and inference [30, 46] in these models which can be used topredict the value of each PRV given any of the other PRVs.162.3. Relational Models d , p  s  K n o ws X O R ( s )  K n o ws B2G (s )  X ( d ,p )  Y ( d ,p ,s )  X ( d +1 ,p )  Figure 2.5: Intelligent tutoring system for teaching multi-digit binary to gray (B2G)conversion modeled in a relational setting using plates notation.A Relational Model Requiring AggregationFig. 2.6 (taken from [25]) represents a relational model and a grounding of themodel. According to the model, the probability of a team having a substitute de-pends on the probability of substitution for any of the players, where the probabilityof substitution for each player depends on whether the players are injured in thegame or not, and the probability of them having an injury depends on whetherthey are in shape or not. In this model, the conditional probability of the PRVsShape(player), Inj(player) and Sub(player) can be represented in terms of a ta-ble, because their parents do not have an extra logical variable, thus the numberof parents they have in the grounding is fixed. However, the PRV Subs() has aparent Sub(player) which has an extra logical variable. Since the number of play-ers of different teams and in different sports can be different, Subs() may have anunbounded number of parents in the grounding. Therefore, the conditional prob-ability for this PRV cannot be represented in terms of a table and we have to useaggregation instead. Logical OR (or equivalently an existential quantifier) is an ap-propriate aggregation operator for the PRV Subs() because Subs() is True if thereis at least one individual P for which Sub(P) is True.172.3. Relational Models Shape(player) Inj(player) Sub(player) player Subs() Shape(Jack) Inj(Jack) Sub(Jack) Shape(Joe) Inj(Joe) Sub(Joe) Subs() …  … … Shape(Ben) Inj(Ben) Sub(Ben)) Figure 2.6: A relational model having a PRV (Subs()) with an unbounded numberof parents in the grounding, whose conditional probability should be representedusing an aggregation operator (taken from [25]).2.3.3 Markov Logic NetworksMarkov logic networks (MLNs) [10, 52] are undirected models for representing thejoint probability distribution of a set of PRVs. MLNs extend the standard Markovnetworks to relational domains. They represent the joint probability of PRVs interms of a first order knowledge-base with soft constraints. Before we explain howMLNs work, we need to give two definitions.A world is an assignment of a value to each ground random variable. Thenumber of worlds is exponential in the number of ground random variables.A first-order knowledge-base [12] is a set of sentences or formulae in first-orderlogic.A first-order knowledge-base can be viewed as a set of hard constraints on thepossible worlds. A hard constraint means that if a world violates even one of theformulae in first-order knowledge-base, the probability of that world is zero. MLNssoften these hard constraints by making a world less probable (not impossible)when it violates a formula. Softening the constraint is by associating a weightto each of the formulae whose value is proportional to the amount of decrease inprobability of a world violating this formula.More formally, an MLN defines the probability distribution over worlds usinga set of weighted formulae of the form 〈F,w〉, where F is a formula and w is theweight of the formula. The set of all weighted formulae of a model can be viewedas a first-order knowledge-base with soften rules. The probability of a world isproportional to the exponential of the sum of the weights of the instances of the182.3. Relational Models                                          x  R( x )  Q  R( A 1 )  R( A 2 )  R( A n )  Q  … Figure 2.7: An undirected relational model in plate notation on the left, and itsgrounding for the population {A1,A2, . . . ,An} on the right.formulae that are True in the world. The probability of any formula is obtained bysumming over the worlds in which the formula is True.Example 8. Consider the undirected relational model in Fig. 2.7. An MLN forthis model may define the probability distribution using the following weightedformulae:〈Q,α0〉〈Q∧¬R(x),α1〉〈Q∧R(x),α2〉〈R(x),α3〉The probability of any world can be calculated based on the number of in-stances of formulae that are True in the world.MLNs can also be adapted to define conditional distributions. Below is anexample of representing a conditional distribution using MLNs.Example 9. Consider the undirected relational model in Fig. 2.7 and suppose thejoint probability of the PRVs is defined using the weighted formulae in Example 8.Also suppose that the truth value of R(x) for every individual x is observed. TheMLN of Fig.2.7 defines the conditional probability of Q given the observed valuesof the R(x) as follows:Pr(q | obs) = sigmoid(α0 +nFα1 +nTα2) (2.4)where obs has R(x) is true for nT individuals, and false for nF individuals out of apopulation of n individuals (so n = nT +nF), and sigmoid(x) = 1/(1+ e−x).19Chapter 3Relational Logistic RegressionLogistic regression is an aggregator for standard Bayesian networks; however, itcannot be used as an aggregator for relational models without making appropriatechanges. In this chapter, we indicate the problems of using logistic regression inrelational domains and develop a relational version of logistic regression which wecall relational logistic regression.3.1 Aggregation with Logistic Regression in RelationalModelsWe saw earlier in this thesis that using an aggregator to define a conditional proba-bility in standard Bayesian networks can save a noticeable amount of memory andreduce computations. While aggregation is optional in non-relational models, itis necessary in directed relational models whenever the parent of a PRV containsextra logical variables. For example, suppose Boolean PRV Q is a child of theBoolean PRV R(x), which contains an extra logical variable, x, as in Fig. 3.1. Inthe grounding, Q is connected to n instances of R(x), where n is the population sizeof x. For the model to be defined before n is known, it needs to be applicable forall values of n.Common ways to aggregate the parents in relational domains, e.g. [11, 16, 25,34, 38, 42], include logical operators such as OR, AND, noisy-OR, noisy-AND, aswell as ways to combine probabilities.Logistic regression, as described in previous chapter, may also be used forrelational models. For instance for the model in Fig. 3.1, logistic regression definesthe conditional probability of child node as:Pr(q | R1, . . . ,Rn) = sigmoid(w0 +∑iwiRi). (3.1)Since the individuals in a relational model are exchangeable, wi must be iden-tical for all parents Ri (this is known as parameter-sharing or weight-tying), soEq. 2.2 becomes:Pr(q | R1, . . . ,Rn) = sigmoid(w0 +w1∑iRi). (3.2)203.1. Aggregation with Logistic Regression in Relational Models    x  R(x)  Q  R(A 1 ) R(A 2 ) R(A n) Q  … Figure 3.1: Logistic regression (with i.i.d. priors for the R(x)). The left side isthe relational model in plate notation and the right side is the grounding for thepopulation {A1,A2, . . . ,An}.Consider what happens with a relational model when n is not fixed.Example 10. Suppose we want to represent “Q is True if and only if R is True for5 or more individuals”, i.e., q ≡(|{i : Ri = True}| ≥ 5)or q ≡ (nT ≥ 5), using alogistic regression model (Pr(q) ≥ 0.5) ≡ (w0 +w1∑i Ri ≥ 0), which we fit for apopulation of 10. Consider what this model represents when the population size is20.If R = False is represented by 0 and R = True by 1, this model will have Q =True when R is true for 5 or more individuals out of the 20. It is easy to see this, as∑i Ri only depends on the number of individuals for which R is True.However, if R = False is represented by −1 and R = True by 1, this model willhave Q = True when R is True for 10 or more individuals out of the 20. The sum∑i Ri depends on how many more individuals have R True than have R False.If R = True is represented by 0 and R = False by any other value, this modelwill have Q = True when R is True for 15 or more individuals out of the 20. Thesum ∑i Ri depends on how many individuals have R False.While the choice of representation for True and False was arbitrary in the non-relational case, in the relational case different parametrizations can result in differ-ent decision thresholds as a function of the population. Table 3.1 gives some nu-merical representations for False and True, with corresponding parameter settings(w0 and w1), such that all regressions represent the same conditional distributionfor n = 10. However, for n = 20, the predictions are different.The decision thresholds in all situations in Table 3.1 are linear functions ofpopulation size. It is straightforward to prove the following proposition:213.1. Aggregation with Logistic Regression in Relational ModelsTable 3.1: Predictions of a logistic regression model in a relational domain with apopulation size of 20 as a function of numerical representations for True and False,where the parameters are learned for a population of 10.False True w0 w1 Prediction for n = 200 1 -4.5 1 Pr(Q = True) > 0.5 iff nT ≥ 5-1 1 0.5 0.5 Pr(Q = True) > 0.5 iff nT ≥ 10-1 2 −7613 Pr(Q = True) > 0.5 iff nT ≥ 8-1 0 5.5 1 Pr(Q = True) > 0.5 iff nT ≥ 15-1 100 −8892021101 Pr(Q = True) > 0.5 iff nT ≥ 51 2 -14.5 1 Pr(Q = True) > 0.5 iff nT ≥ 0-100 1 10912021101 Pr(Q = True) > 0.5 iff nT ≥ 15Proposition 1. Let R = False be represented by the number α and R = True byβ 6= α . Then, for fixed w0 and w1 (e.g., learned for one specific population size),the decision threshold for a population of size n isw0w1(α−β )+αα−β n.Proof. Let nT represent the number of individuals for which the parent is True andn represent the population size. Considering the values for w0,w1,α and β , thedecision threshold can be determined by solving the following equation for nT (weassume w1 6= 0, otherwise the child does not depend on the parent).w0 +w1 (βnT +α(n−nT)) = 0⇒w0w1+nT(β −α)+αn = 0⇒ nT =w0w1(α−β )+αα−β nWhat is important about this proposition is that the way the decision thresholdchanges with the population size n, i.e., the coefficient αα−β , does not depend ondata (which affects the weights w0 and w1), but only on the arbitrary choice of thenumerical representation of R.Thus, Eq. (3.2) with a specific numeric representation of True and False is onlyable to model one of the dependencies of how predictions depend on populationsize, and so cannot properly fit data that does not adhere to that dependence.223.2. Single-parent, Linear Relational Logistic RegressionWe need an additional degree of freedom to get a relational model that canmodel any linear dependency on n, regardless of the numerical representation.3.2 Single-parent, Linear Relational Logistic RegressionWe define a single-parent, linear relational logistic regression by adding a degreeof freedom to the logistic regression formalism in Eq. 3.2.Definition 1. Let Q be a Boolean PRV with a single parent R(x), where x is the setof logical variables in R that are not in Q (so we need to aggregate over x). A single-parent, linear relational logistic regression (SPL-RLR) for Q with parents R(x)is of the form:Pr(q | R(A1), . . . ,R(An)) = sigmoid(w0 +w1∑iRi +w2∑i(1−Ri))(3.3)where Ri is short for R(Ai) (Ai is the i-th assignment of individuals to x), and istreated as 1 when it is True and 0 when it is False. Note that ∑i Ri is the number of(tuple of) individuals for which R is True (= nT ) and ∑i(1−Ri) is the number of(tuple of) individuals for which R is False (= nF).An alternative but equivalent parametrization is:Pr(q | R(A1), . . . ,R(An)) = sigmoid(w0 +w2∑i1+w3∑iRi) (3.4)where 1 is a function that has value 1 for every individual, so ∑i 1 = n. The mappingbetween these parametrizations is w3 = w1−w2; w0 and w2 are the same.Proposition 2. Let R = False be represented by α and R = True by β 6= α . Then,for fixed w0, w2 and w3 in Eq. (3.4), the decision threshold for a population of sizen isw0w3(α−β )+α +w2/w3α−β n.Proof. Let nT represent the number of individuals for which the parent is True andn represent the population size. Considering the values for w0,w2,w3,α and β , thedecision threshold can be determined by solving the following equation for nT (weassume w3 6= 0, otherwise the child does not depend on the values of the individualsin the parent).w0 +w2n+w3 (βnT +α(n−nT)) = 0⇒w0w3+w2w3n+nT(β −α)+αn = 0⇒ nT =w0w3(α−β )+α +w2/w3α−β n233.3. Multi-parent, Linear Relational Logistic RegressionProposition 2 implies that the way the decision threshold in an SPL-RLR growswith the population size n, i.e. the coefficient α+w2/w3α−β , depends on the weights.Moreover, for fixed α and β , with α 6= β , any linear function of population canbe modeled by varying the weights. This was not true for the traditional logisticregression.For the rest of this thesis, when we embed logical formulae in arithmetic ex-pressions, we take True formulae to represent 1, and False formulae to represent 0.Thus ∑L F is the number of assignments to the variables L for which formula F isTrue.3.3 Multi-parent, Linear Relational Logistic RegressionThe SPL-RLR proposed in Definition 1 can be extended to multiple (parametrized)parents by having a different pair of weights ((w1,w2) or (w2,w3)) for each parentPRV. This is similar to the non-relational logistic regression, where each parenthas a (single) different weight. We define a multi-parent, linear relational logisticregression (MPL-RLR) as follows:Definition 2. Let Q be a Boolean PRV with parents R1(x1),R2(x2), . . . ,Rk(xk),where x1,x2, . . . ,xk are the set of logical variables in R1,R2, . . . ,Rk respectivelythat are not in Q (so we need to aggregate over them). A multi-parent, linear rela-tional logistic regression (MPL-RLR) for Q with parents R1(x1),R2(x2), . . . ,Rk(xk)is of the form:Pr(q | R1(x1), . . . ,Rk(xk)) =sigmoid(w0+w11∑x1R1(x1)+w21∑x1(1−R1(x1))+w12∑x2R2(x2)+w22∑x2(1−R2(x2))+ . . .+w1k∑xkRk(xk)+w2k∑xk(1−Rk(xk)))(3.5)This formulation of MPL-RLR needs 2k+1 parameters, where k is the numberof parents. Similar to SPL-RLR, we can use an alternative representation for MPL-243.4. Interactions Among ParentsRLR as follows:Pr(q | R1(x1), . . . ,Rk(xk)) =sigmoid(w0+w21∑x11+w31∑x1R1(x1)+w22∑x21+w32∑x2R2(x2)+ . . .+w2k∑xk1+w3k∑xkRk(xk))(3.6)where 1 is a function that has value 1 for every individual. The mapping be-tween these parametrizations is ∀i,w3i = w1i−w2i; w0 and w2i are the same. Simi-lar to the previous parametrization in Eq. 3.5, this parametrization also needs 2k+1parameters. In cases where parents have the same logical variable, however, thisparametrization can be more compact.Suppose x1,x2, . . . ,xr are the sets of logical variables of the parents of Q wherer ≤ k. Eq.3.7 can be re-written with only k + r +1 parameters as follows:Pr(q | R1(x1), . . . ,Rk(xk)) =sigmoid(w0+w21∑x11+ · · ·+w2r∑xr1+w31∑xR1R1(x1)+ · · ·+w3k ∑xRkRk(xk))(3.7)where xRj represents the set of logical variable of the parent Rj.3.4 Interactions Among ParentsDefinition 2 represents how SPL-RLR can be extended to multiple parents with nointeractions. However, there are cases where we want to model the interactionsamong the parents.Example 11. Suppose we want to model whether someone being happy dependson the number of their friends that are kind. We assume the PRV Happy(x) has asparents Friend(y,x) and Kind(y) as demonstrated in Fig. 3.2. Note that the numberof friends for each person can be different.Consider the following hypotheses:1. A person is happy as long as they have 5 or more friends who are kind.happy(x)≡ |{y : Friend(y,x)∧Kind(y)}| ≥ 5253.4. Interactions Among Parents Friend(y ,x )  K i n d (y )  Ha p p y (x )  Figure 3.2: A relational model for representing people’s happiness based on thenumber of friends they have and whether their friends are kind or not.2. A person is happy if half or more of their friends are kind.happy(x)≡ |{y : Friend(y,x)∧Kind(y)}| ≥ |{y : Friend(y,x)∧¬Kind(y)}|3. A person is happy as long as fewer than 5 of their friends are not kind.happy(x)≡ |{y : Friend(y,x)∧¬Kind(y)}|< 5These three hypotheses coincide for people with 10 friends, but make differentpredictions for people with 20 friends.All three hypotheses are based on the interaction between the two parents, eachrequiring the number of individuals for which some formulae of the parents (in-stead of just a single parent) holds. For instance modeling the first hypothesisneeds the number of people that are simultaneously friends with x and are kind.MPL-RLR does not include formulae of the parents and considers each parent sep-arately. We cannot count the number of individuals that are simultaneously friendswith x and are kind by having two numbers one indicating the number of peoplethat are friends with x and the other indicating the number of people that are kind.Therefore, MPL-RLR cannot model these interactions without including weightedformuale of the parents in its parametrization. In order to model such aggregators,we need to extend MPL-RLR by adding such weighted formulae.The followingextended MPL-RLR models these cases:Pr(happy(x) |Π)= sigmoid(w0 +w1∑yFriend(y,x)∧ Kind(y)+w2∑yFriend(y,x)∧¬Kind(y))(3.8)263.4. Interactions Among Parentswhere Π is a complete assignment of friend and kind to the individuals, and theright hand side is summing over the propositions in Π for each individual. Tomodel each of the above three cases, we can set w0, w1, and w2 in Eq. (3.8) asfollows:1. Let w0 =−4.5, w1 = 1, w2 = 02. Let w0 = 0.5, w1 = 1, w2 =−13. Let w0 = 5.5, w1 = 0, w2 =−1Going from Eq. (3.3) to Eq. (3.4) allowed us to only model the positive casesin SPL-RLR. We can also model this example using only positive formulae byreplacing:w2∑yFriend(y,x)∧¬Kind(y)with:w2∑yFriend(y,x)−w2∑yFriend(y,x)∧Kind(y)in Eq. 3.8 resulting in:Pr(happy(x) |Π)= sigmoid(w0 +(w1−w2)∑yFriend(y,x)∧Kind(y)+w2∑yFriend(y,x))(3.9)Example 11 represented that Boolean formulae of the parents may be requiredto be considered when parents interact with each other. It also represented a par-ticular case where we can model the domain using only positive formulae. We canuse a similar construction for more general cases:Example 12. Suppose a PRV Q is a child of PRVs R(x) and S(x) and its conditionalprobability depends on a conjunctive formula of the parents such as R(x)∧ S(x),R(x)∧¬S(x), ¬R(x)∧S(x) or ¬R(x)∧¬S(x). As in Example 11, we need to countthe number of instances of a formula that are True in an assignment to the parents.It turns out that in this case R(x)∧S(x) is the only non-atomic formula required tomodel the interactions between the two parents, because other conjunctive interac-tions can be represented using this count as follows:∑xR(x)∧¬S(x) =∑xR(x)−∑xR(x)∧S(x)∑x¬R(x)∧ S(x) =∑xS(x)−∑xR(x)∧S(x)∑x¬R(x)∧¬S(x) =|x|−∑xR(x)−∑xS(x)+∑xR(x)∧S(x)273.5. Non-linear Decision Thresholdswith |x|= ∑x True.Example 12 shows that the positive conjunction of the two interacting parentsis the only formula required to compute arbitrary conjunctive formulae consistingof one atom of each parent. In more complicated cases, however, subtle changes tothe representation may be required.Example 13. Suppose a PRV Q is a child of PRVs R(x,y) and S(x,z). Suppose wewant to represent “Q is True if and only if R(x,y)∧¬S(x,z) is True for more thant triples 〈x,y,z〉”. If we follow what we did in Example 12 to represent R(x,y)∧¬S(x,z) in terms of positive formulae, we will get that ∑x,y,z R(x,y)∧¬S(x,z) =∑x,y R(x,y)−∑x,y,z R(x,y)∧S(x,z). However, we need the number of triples 〈x,y,z〉,instead of the number of pairs 〈x,y〉, for which R(x,y) is True. We thus need to use∑x,y,z R(x,y) as the number of assignments to x, y and z for which R(x,y) is True asfollows:∑x,y,zR(x,y)∧¬S(x,z) = ∑x,y,zR(x,y)−∑x,y,zR(x,y)∧S(x,z)So as part of the representation, we need to include the set of logical variablesand not just a weighted formula.3.5 Non-linear Decision ThresholdsExamples 12 and 13 suggest how to model interactions among the parents. Nowconsider the case where the decision threshold for the child PRV is a non-linearfunction of its parents’ population sizes. For instance, if the individuals are thenodes in a dense graph, some properties of arcs grow with the square of the pop-ulation of nodes. We describe how MLNs represent a class of non-linear decisionthresholds for undirected models and use an analogous idea in our relational logis-tic regression.Markov logic networks (MLNs) as described earlier are undirected models forrepresenting the joint probability of a set of PRVs, which can be also adapted to de-fine a conditional distribution. One of the characteristics of MLNs is that they canalso model a class of non-linear dependencies on the population sizes. The follow-ing example shows a case where a non-linear conditional distribution is modeledby MLNs.Example 14. Consider the MLN for PRVs Q and R(x), consisting of a single for-mula Q∧R(x)∧R(y) with weight w, where y represents the same population as x.The probability of q given observations of R(Ai) for all Ai has a quadratic decisionthreshold:Pr(q | R(A1), . . . ,R(An)) = sigmoid(w nT2).283.6. Using Weighted Formulae for Relational Logistic RegressionMore formally, MLNs use Boolean formulae with more than one instance ofa PRV (each having a different logical variable typed with the same population)to model a non-linear decision thresholds on the population size of the PRV. Thisallows for modeling a class of non-linear dependencies using only Boolean for-mulae. This idea can be also used by relational logistic regression. Consider thefollowing example:Example 15. Suppose a PRV Q is a child of the PRV R(x), and we want to rep-resent “Q is True if and only if nT 2 > nF”. This dependency can be representedby introducing a new logical variable x′ with the same population as x and treatingR(x′) as if it were a separate parent of Q. Then we can use the interaction betweenR(x) and R(x′) to represent the model in this example as:∑x,x′R(x)∧R(x′)−∑xTrue+∑xR(x).3.6 Using Weighted Formulae for Relational LogisticRegressionThe previous section represented how the idea of representing non-linear depen-dencies in MLNs can be also used by relational logistic regression. MLNs definethe joint probability of PRVs in terms of a set of weighted formulae, which makesthe model more intuitive. We can also define our relational logistic regression interms of weighted formulae, but for representing a conditional probability distri-bution instead of the joint distribution. Each of the sigmas in our definitions ofSPL-RLR and MPL-RLR can be represented by a weighted formula, and then wetake the sigmoid of the sum of these sigmas.Example 16. The three sigmas used in Example 15 for representing a non-lineardecision threshold with relational logistic regression can be represented by the fol-lowing weighted formulae:• ∑x,x′ R(x)∧R(x′)⇒ 〈R(x)∧R(x′),1〉• ∑x True⇒ 〈True,−1〉• ∑x R(x)⇒ 〈R(x),1〉Using these weighted formulae to represent the conditional probability, ourrelational logistic regression will be the directed analog of MLNs. We will discussthis in more detail after we define our general relational logistic regression.293.7. General Relational Logistic Regression3.7 General Relational Logistic RegressionPrevious examples show the potential for using relational version of logistic regres-sion as an aggregator for relational models. We need a language for representingaggregation in relational models in which we can address the problems mentioned.We propose a generalized form of relational logistic regression as a directed analogof MLNs, which works for multi-parent cases and can model a same class of non-linear decision thresholds. We first give a formal definition of weighted formulaeused by relational logistic regression and then define relational logistic regressionbased on these weighted formulae.Definition 3. A weighted formula (WF) for a Boolean PRV Q(x), where x is atuple of logical variables, is a triple 〈L,Q(x′)∧F′,w〉 where L is a set of logicalvariables such that L∩ x′ = {}, Q(x′) is an instance of Q(x), F′ is a formula ofparent PRVs of Q such that each logical variable in F′ either appears in x′ or is inL , and w is a weight.A child PRV can have a set of WFs. We represent the set of WFs for Q(x) byWFsQ.Example 17. Suppose Q(x,y) is a child of PRVs R(x,z) and S(y). The followingare all valid WFs:• 〈{},Q(x,y),1〉• 〈{y′},Q(x,y)∧S(y)∨S(y′),5〉• 〈{z},Q(x,x)∧R(x,z),−2〉•〈{x,y,z},Q(Xi,Yj)∧S(y)∧R(Xi,z), 35〉where Xi and Yj are two individuals from the population assigned to x and y respec-tively. The following WFs, however, are not valid WFs because the first one haslogical variable z in its formula which does not appear in the set of logical vari-ables, the second one does not have an instance of the child PRV Q, and the thirdone has two instances of Q.• 〈{},Q(x,y)∧R(x,z),1〉• 〈{x,y},S(y),5〉• 〈{x,y},Q(x,y)∧Q(Xi,Yj)∧S(y),5〉303.7. General Relational Logistic RegressionDefinition 4. A weighted formula 〈L,Q′∧F′,w〉 for PRV Q(x), where Q′ repre-sents an instance of Q and F′ represents a formula of the parents of Q, is compat-ible with a ground random variable Q(X), where X is a tuple of individuals,if Q′and Q(X) are unifiable. We represent the set of WFs for Q that are compatible withQ(X) by comp(WFsQ,X).Example 18. Consider a PRV Q(x,y) and a ground random variable Q(Xi,Yj).WFs with the following formulae are compatible with this ground random variable(F′ shows any Boolean formula of the parents):• Q(x,y)∧F′• Q(x,Yj)∧F′• Q(Xi,Yj)∧F′and the following are not compatible:• Q(Xi′ ,y)∧F′• Q(Xi,Yj′)Definition 5. Let Q(x) be a Boolean PRV with parents Ri(xi), where xi is the tupleof logical variables in Ri. A (general) relational logistic regression (RLR) for Qwith parents Ri(xi) is defined using a set of WFs WFsQ for Q as:Pr(q(X) |Π) = sigmoid(∑〈L,Q′∧F′,w〉∈comp(WFsQ,X)w ∑LF′Πθ(Q′,Q(X)))where Π represents the assigned values to parents of Q(x), X represents a tupleof individuals, 〈L,Q′∧F′,w〉 represents a WF for Q where Q′ is an instance ofQ and F′ is a formula of the parents of Q, comp(WFsQ,X) represents all weightedformulae for Q compatible with X, θ(Q′,Q(X)) represents the most general unifierof Q′ and Q(X), and F′Πθ(Q′,Q(X)) is formula F′ with substitution θ(Q′,Q(X))applied to it, and evaluated in Π. (The first summation is over the set of WFs;the second summation is over the tuples of L. Note that ∑{} sums over a singleinstance.)The SPL-RLR (Definition 1) is a subset of Definition 5, because the terms ofEq. (3.4) can be modeled as follows:• w0 can be represented by 〈{},Q,w0〉• w2∑i 1 can be represented by 〈{x},Q,w2〉313.8. RLR vs MLNs• w3∑i Ri can be represented by 〈{x},Q∧R(x),w3〉RLR then sums these WFs, resulting in:Pr(q |Π) = sigmoid(w0∑{}True+w2∑{x}True+w3∑{x}R(x))= sigmoid(w0 +w2n+w3∑iRi).It is straight forward to see that MPL-RLR (Definition 2) is also a subset ofDefinition 5.Example 19. Consider the problem introduced in Example 13. Using general RLR(Definition 5), we can model the conditional probability of Q using the followingWFs:〈{},Q,w0〉〈{x,y,z},Q∧R(x,y)∧¬S(y,z),w1〉Or alternatively:〈{},Q,w0〉〈{x,y,z},Q∧R(x,y),w1〉〈{x,y,z},Q∧R(x,y)∧S(y,z),−w1〉3.8 RLR vs MLNsThe definition of WFs in Definition 3 is similar to the WFs used by MLNs. Themajor difference is that we allow exactly one instance of the child node in theformula and it should be conjoined with the formula of the parents, whereas MLNsallow arbitrary numbers of each PRV in the formulae. The other minor differenceis that we represent the set of logical variables L to be summed over explicitly,whereas MLNs have it implicitly. Instead of summing over the logical variables ina set L, MLNs sum over the logical variables appearing in the formula. We allowextra logical variables that are not in the formula to appear L and the formula sumsover these extra logical variables to calculate the value of WFs in Definition 5. Inorder to sum over an extra logical variable z in MLNs, one could conjoin a True(z)to the formula, where True is a property which holds for all individuals. In thissection, we use explicit set of logical variables in weighted formulae of MLNs.We demonstrate that RLR is directed analog of MLNs using the following ex-ample:323.9. Canonical Forms of RLRExample 20. Consider the model in Example 8 where we had the following weightedformulae:〈{},Q,α0〉〈{x},Q∧¬R(x),α1〉〈{x},Q∧R(x),α2〉〈{x},R(x),α3〉Treating this as an MLN (as in Fig.2.7), if the truth value of R(x) for everyindividual x is observed:Pr(q | obs) = sigmoid(α0 +nFα1 +nTα2) (3.10)where obs has R(x) is true for nT individuals, and false for nF individuals out of apopulation of n individuals (so n = nT +nF).Note that in the MLN, α3 is not required for representing the conditional prob-ability (because it cancels out), but can be used to affect Pr(R(Ai)), where Ai is anindividual of x.In RLR, the sigmoid, as in Equation (3.10), is used as the definition of RLR.RLR only defines the conditional probability of Q being True given each combi-nation of assignments to the R(x) (using Equation (3.10)); when not all R(x) areobserved, separate models of the probability of R(x) are needed.MLNs and RLR agree for the supervised learning case when all variables ex-cept a query leaf variable are observed (such as in Example 20). However, they arequite different in representing distributions.Note that in MLNs, there is a single normalizing constant, guaranteeing theprobabilities of the worlds sum to 1. In RLR, normalization is done separately foreach possible assignment to the parents.In summary: RLR uses the weighted formulae to define the conditional proba-bilities, and MLNs use the weighted formulae to define the joint probability distri-bution.3.9 Canonical Forms of RLRWhile in Definition 3 we allow for any Boolean formula of parents, we can provethat a positive conjunctive form is sufficient to model all the Boolean interactionsamong parents. A Boolean interaction is defined in the background section. Sinceall formulae in the WFs for a child PRV Q are conjoined with an instance of Q, weonly consider the formulae of parents of Q in our proofs.Proposition 3. Let Q be a Boolean PRV with parents Ri(xi), where xi is a set oflogical variables in Ri which are not in Q. Using only positive conjunctive formulae333.9. Canonical Forms of RLRof the parents in the WFs for Q, all Boolean interactions among the parents can bemodeled by RLR.Proof. Every Boolean formula F can be represented as a disjunction of mutuallyexclusive conjunctive formulae as F1 ∨ F2 ∨ ·· · ∨ Fm for some m, where all Fisare conjunctive formulae and are mutually exclusive (i.e., ∀i, j 6= i : Fi∧Fj is False)[43]. Therefore, a WF 〈L,F,w〉 can be replaced by 〈L,F1∨F2∨·· ·∨Fm,w〉. Sincethe conjunctive formulae are mutually exclusive, this new WF can be replaced mym WFs 〈L,F1,w〉 ,〈L,F2,w〉 , . . . ,〈L,Fm,w〉. So we prove that any WF 〈L,Fj,w〉where Fj is a conjunctive formula can be represented using only positive conjunc-tive formuale.We prove this by induction on the number of negations denoted bynneg.For nneg = 0, the formula Fj is in a positive conjunction form and the proposi-tion holds. Assume the proposition holds for nneg. For nneg +1, let ¬Ri(xi) be oneof the negated atoms of the formula Fj. We write Fj as F′j ∧¬Ri(xi). Note that F′jhas nneg negations. The WF〈L,F′j ∧¬Ri(xi),w〉can be replaced by〈L∪xi,F′j,w〉and〈L,F′j ∧Ri(xi),−w〉. Each of the formulae in these WFs has only nneg nega-tions for which the proposition holds according to our assumption.Example 21. Suppose we want to represent a WF 〈x,F,w〉, where F = A(x)∧(B∨C(x)) conjoined with the child PRV, in terms of WFs with positive conjunctive for-mulae. First we write F in sum of products form as (A(x)∧B∧¬C(x))∨ (A(x)∧C(x)). Note that no pair of the product form formulae can be simultaneously True.The second product form formula is in positive conjunctive form and can be repre-sented by a single WF:〈{x},A(x)∧C(x),w1〉and the first one can be represented using the following WFs:〈{x},A(x)∧B,w2〉〈{x},A(x)∧B∧C(x),−w2〉Proposition 3 suggests using only positive conjunctive formuale pf parents inWFs. We refer to an RLR conditional probability using WFs with only positiveconjunctive formulae of parents as a positive conjunctive RLR and an RLR con-ditional probability using WFs with only positive disjunctive formulae of parentsas a positive disjunctive RLR. Proposition 4 proves that positive disjunctive RLRhas the same representational power as positive conjunctive RLR. Therefore, all343.9. Canonical Forms of RLRpropositions proved for positive conjunctive RLR in the rest of the thesis also holdfor positive disjunctive RLR.Proposition 4. A conditional distribution Pr(Q | Ri(xi)) can be expressed by apositive disjunctive RLR if and only if it can be expressed by a positive conjunctiveRLR.Proof. First, suppose Pr(Q | Ri(xi)) can be expressed by a positive disjunctiveRLR. The corollary of Proposition 3 gives that Pr(Q | Ri(xi)) can be expressedby positive conjunctive RLR. We can also prove this without using Proposition 3as follows.We can write a disjunctive formula F as ¬F′ where F′ is a conjunctive formula.So we can change all the disjunctive formulae Fj in the WFs for Pr(Q | Ri(xi)) to¬F′j where F′j is a conjunctive formula. A WF〈L,¬F′j,w〉can be modeled by twoWFs 〈L,Q,w〉 and〈L,F′j,−w〉having conjunctive formulae, because the formercounts all assignments to L, and the latter counts all assignments to L for whichF′j is True, thus the subtraction of these WFs gives the number of assignments to Lfor which F′j is False. The latter WF may consist of negated atoms but we knowfrom Proposition 3 that we can model it by a set of positive conjunctive WFs.Consequently, Pr(Q | Ri(xi)) can be also expressed by a positive conjunctive RLR.Now, suppose the conditional distribution can be expressed by a positive con-junctive RLR definition of Pr(Q | Ri(xi)). While Proposition 3 is written for pos-itive conjunctive RLR, it is straight forward to see that it also holds for conjunc-tive formulae of negated atoms, by having the induction on the number of pos-itive atoms and removing them one by one. This means that we can express Qby WFs 〈L,Fk,w〉 where Fk is a conjunction of negated atoms. We can representeach of these formulae Fk as ¬F′k where F′k is a positive disjunctive formula. Wealso mentioned that a WF〈L,¬F′k,w〉can be expressed by two WFs 〈L,Q,w〉 and〈L,F′k,−w〉. Both the former and the latter formulae are in positive disjunctiveform. Consequently, Pr(Q | Ri(xi)) can be also expressed by a positive disjunctiveRLR.Buchman et al. [3] looked at canonical representations for probability distri-butions with binary variables in the non-relational case. Our positive conjunctivecanonical form corresponds to their “canonical parametrization” with a “referencestate” True (i.e., in which all variables are assigned True), and our positive dis-junctive canonical form has a connection to using a “reference state” False. Their“spectral representation” would correspond to a third positive canonical form forRLR, in terms of XORs (i.e., parity functions).353.10. Non-linear Decision Thresholds3.10 Non-linear Decision ThresholdsWe can also model a class of non-linear decision thresholds using RLR. The fol-lowing example is a case where the child PRV depends on the square of the popu-lation size of its parent.Example 22. Suppose Q is a Boolean PRV with a parent R(x). By having a WF〈{x,x′},Q∧R(x)∧R(x′),w〉 for Q where x′ is typed with the same population asx, the conditional probability of Q depends on the square of the number of assign-ments to x for which R(x) is True. This is similar to the WF used for an MLN inExample 14.Example 22 represents a case where the conditional probability of a child PRVdepends on the square of its parent’s population size. We can prove that by usingonly positive conjunctive formulae in the WFs of a child PRV, we can model anypolynomial decision threshold. First we prove this for the single-parent case andthen for the general case of multi-parents. We assume in the following propositionsthat Q is a Boolean PRV and R1(x1),R2(x2), . . . are its parents where xi is the set oflogical variables in Ri which are not in Q. We use x′i to refer to a new set of logicalvariable typed with the same population as those in xi.Proposition 5. A positive conjunctive RLR definition of Pr(Q |R(x)) (single-parentcase) can represent any decision threshold that is a polynomial function of the sizesof logical variables in x and the number of (tuples of) individuals for which R(x)is True or False.Proof. Based on Proposition 3 we know that a WF having any Boolean formula canbe written as a set of WFs each having a positive conjunctive formula. Therefore,in this proof we do not commit to using only positive conjunctive formulae. Thefinal set of WFs can be represented by a set of positive conjunctive WFs usingProposition 3.Each term of the polynomial in the single-parent case is of the form w(∏i |xi|di)nTαnFβ 3,where nT and nF denote the number of individuals for which R(x) is True or Falserespectively, xi ∈ x represents the i-th logical variable in x, α , β and dis are non-negative integers, and w is the weight of the term. First we prove by induction thatfor any j, there is a WF that can build the term nTαnFβ for any α and β whereα +β = j, α ≥ 0 and β ≥ 0.For j = 0, nTαnFβ = 1. We can trivially build this by WF 〈{},True,w〉. As-suming it is correct for j, we prove it for j + 1. For j + 1, either α > 0 or α = 03It is important to consider the population sizes of logical variables in the polynomial terms sincesome properties may depend on these population sizes. For instance, Poole et al. [47] give a realworld example of predicting the age of people using the number of movies they have rated.363.10. Non-linear Decision Thresholdsand β > 0. If α > 0, using our assumption for j, we can have a WF 〈L,F,w〉which builds the term nTα−1nFβ . So the WF 〈L∪{x′},F∧R(x′),w〉 builds theterm nTαnFβ because the first WF was True nTα−1nFβ times and now we count itnT more times because R(x′) is True nT times.If α = 0 and β > 0, we can have a WF 〈L,F,w〉which builds the term nTαnFβ−1.By the same reasoning as in previous case, we can see that the WF 〈L∪{x′},F∧¬R(x′),w〉produces the term nTαnFβ .In order to include the population size of logical variables xi, where xi ∈ x, andgenerate the term (∏i |xi|di)nTαnFβ , we only add di extra logical variables x′i to theset of logical variables of the WF that generates nTαnFβ . Then we set the weightof this WF to w to generate the desired term.Until now we proved that we can generate every term of the polynomial. SinceRLR sums over all these terms, we can generate every decision threshold whichis a polynomial function of the sizes of logical variables in x and the number of(tuples of) individuals for which R(x) is True or False..Conclusion. One of the conclusions of this proposition is that a term w(∏i |xi|di)nTαnFβcan be generated by having a WF with its formula consisting of nT instances ofR(x′) and nF instances of ¬R(x′), adding di of each logical variable xi to the set oflogical variables, and setting the weight of WF to w. We will use this conclusionfor proving the proposition in multi-parent case.Example 23. Suppose we want to model the case where Q is True if the squareof number of True individuals in R(x) is at least 5 more than twice the number ofFalse individuals in it (i.e., nT 2 ≥ 2nF +5). In this case, we can model the sigmoidof the polynomial nT 2−2nF−4.5. The reason for using 4.5 instead of 5 is to makethe polynomial positive when nT 2 = 2nF +5. The following WFs are used by RLRto model this polynomial. The first WF generates the term nT 2, the second onegenerate −2nF and the third one generates −4.5.〈{x,x′},Q∧R(x)∧R(x′),1〉〈{x},Q∧¬R(x),−2〉〈{},Q,−4.5〉Note that the second WF above can be written in positive form by using the fol-lowing two WFs:〈{x},Q,−2〉〈{x},Q∧R(x),2〉Using the conclusion following Proposition 5, we now extend Proposition 5 tothe multi-parent case:373.10. Non-linear Decision ThresholdsProposition 6. A positive conjunctive RLR definition of Pr(Q | R1(x1), . . . ,Rk(xk))(multi-parent case) can represent any decision threshold that is a polynomial func-tion of the sizes of logical variables in the parents and the number of (tuples of)individuals for which a Boolean function of parents hold.Proof. Let G1,G2, . . . ,Gt represent Boolean interactions of parents for our model.Also let n(i) denote the number of individuals for which Gi is True. Each term ofthe polynomial in the multi-parent case is then of the form:w * (any polynomial of population sizes) * nα1(1)nα2(2) . . .nαt(t).We demonstrate how we can generate any term nα1(1)nα2(2) . . .nαt(t). The inclusion ofpopulation size of logical variables and the weight w for each term is the same asin Proposition 5.The conclusion of Proposition 5 can be generalized to work for any Booleanformula Gi instead of a single parent R. We only need to include a conjunction ofαi instances of Gi with different logical variables typed with the same populationin each instance. Let Fi represent this conjunction of αi instances of Gi. We usethis generalization in our proof.For each nαi(i), we can use the generalization of the conclusion of Proposition 5to obtain a WF 〈Li,Fi,1〉 which generates this term. Similar to the reasoning forsingle-parent case, we can see that the WF 〈{∪ti=1Li,F1∧F2∧·· ·∧Ft,w〉 gener-ates the term nα1(1)nα2(2) . . .nαt(t). We can then use Proposition 3 to write this WF usingonly positive conjunctive WFs.Until now we proved that we can generate every term of the polynomial. SinceRLR sums over all these terms, we can generate any decision threshold that is apolynomial function of the sizes of logical variables in the parents and the numberof (tuples of) individuals for which a Boolean function of parents hold.Example 24. Suppose we want to model the case where Q is True if the squareof number of individuals for which R1(x1) = True multiplied by the number ofindividuals for which R2(x2) = True is less than five times the number of Falseindividuals in R1(x1).In this case, we define G1 = ¬R1(x1) and G2 = R1(x1)∧R1(x′1)∧R2(x2). We can model the sigmoid of the polynomial 5n(1)− n(2)− 0.5.The reason why we use −0.5 in the polynomial is that we want the polynomial tobe negative when 5n(1) = n(2). The following WFs are used by RLR to model thispolynomial where the first formula generates the term 5n(1), the second generates−n(2) and the third generates −0.5.〈{x1},Q∧¬R1(x1),5〉〈{x1,x′1,x2},Q∧R1(x1)∧R1(x′1)∧R2(x2),−1〉〈{},Q,−0.5〉383.11. Beyond Polynomial Decision ThresholdsNote that the first WF above can be written in positive form in the same way as inExample 23.Proposition 6 proved that RLR can model any polynomial decision threshold.Proposition 7 proves the converse of Proposition 6:Proposition 7. Any decision threshold that can be represented by a positive con-junctive RLR definition of Pr(Q | R1(x1), . . . ,Rk(xk)) is a polynomial function ofthe number of (tuples of) individuals for which a Boolean function of parents hold.Proof. We prove that every WF for Q can only generate a term of the polynomial.Since RLR sums over these terms, it will always represent a polynomial decisionthreshold.Similar to Proposition 6, let G1,G2, . . . ,Gt represent the Boolean functions ofparents and let n(i) denote the number of individuals for which Gi is True. A pos-itive conjunctive formula in a WF can consist of α1 instances of G1, α2 instancesof G2, . . . , αt instances of Gt. Based on Proposition 6, we know that this formulais True nα1(1)nα2(2) . . .nαt(t) times. The WF can contain more logical variables in its setof logical variables than the ones in its formula. This, however, will only cause theabove term to be multiplied by the population size of the logical variable generatinga term of the polynomial described in Proposition 6. Therefore, each of the WFscan only generate a term of the polynomial which means that positive conjunctiveRLR can only represent the sigmoid of this polynomial.3.11 Beyond Polynomial Decision ThresholdsProposition 7 showed that any conditional probability that can be expressed us-ing a positive conjunctive RLR definition of Pr(Q | R1(x1), . . . ,Rk(xk)) is the sig-moid of a polynomial of the number of True and False individuals in each parentRi(xi). However, given that the decision thresholds are only defined for integralcounts, some of the apparently non-polynomial decision thresholds are equivalentto a polynomial and so can be modeled using RLR.Example 25. Suppose we want to model Q ≡(⌈√nT⌉< nF). This is a non-polynomial decision threshold, but since nT and nF are integers, it is equivalentto the polynomial decision threshold nT − (nF−1)2 ≤ 0 which can be formulatedusing RLR by the following WFs:〈{x},Q∧R(x),−1〉〈{x},Q∧¬R(x)∧¬R(x),1〉〈{x},Q∧¬R(x),−2〉〈{},Q,1.5〉393.12. RLR with Multi-valued Child VariablesExample 26. Suppose we want to model Q≡ (2nT > 3nF). This is, however, equiv-alent to the polynomial form Q ≡ (nT log2− nF log3 > 0) and can be formulatedin positive conjunctive RLR using the WFs:〈{x},Q,− log3〉〈{x},Q∧R(x), log3+ log2〉There are, however, non-polynomial decision thresholds that cannot be con-verted into a polynomial one and RLR is not able to formulate them.Example 27. Suppose we want to model Q≡ (2nT > nF). This cannot be convertedto a polynomial form and RLR cannot formulate it.Finding a parametrization that allows to model any non-polynomial decisionthreshold remains an open problem.3.12 RLR with Multi-valued Child VariablesDefinitions 3 and 5 consider Boolean child variables. We can extend these defi-nitions to multi-valued child variables similar to the way logistic regression is ex-tended. Suppose a multi-valued categorical PRV Q(x), where Q(x) can take k ≥ 2different values {V1,V2, . . . ,Vk}, is a child of PRVs {R1(x1),R2(x2), . . . ,Rm(xm)}.We define k−1 sets of WFs {wf1,wf2, . . . ,wfk−1} each containing a (possibly) dif-ferent number of WFs, where each formula of WFs in wfi is conjoined with theatom Q = Vi.Having the above k−1 sets of WFs, the probability of Q taking different valuesin its domain given a grounding assignment gax can be defined as follows:if (l < k)→ Pr(Q(x) = Vl|Π) =exp(∑〈L,Q′∧F′,w〉∈comp(wfl,X) w ∑L FΠθ(Q′,Q(X)))1+∑k−1l′=1 exp(∑〈L,F,w〉∈comp(wfl′ ,X) w ∑L FΠθ(Q′,Q(x)))if (l = k)→ Pr(Q(gax) = Vl|Π) =11+∑k−1l′=1 exp(∑〈L,F,w〉∈comp(wfl′ ,X) w ∑L FΠθ(Q′,Q(x)))(3.11)Note that Equation 3.11 reduces to the Equation 5 in Definition 5 when k = 2.The extension of RLR to continuous child PRVs and continuous parents is left as afuture work.40Chapter 4Approximating OtherAggregators Using RLRWe can model other well-known aggregators using positive conjunctive RLR. Inmost cases, however, this is only an approximation because many aggregators aredeterministic taking only values 0 and 1, but the sigmoid function reaches 0 or 1only in the limit. In order for a sigmoid to produce a 0 or 1 output, we need aninfinitely large number, but we cannot choose infinitely large numbers. We can,however, get arbitrarily close to 0 or 1 by choosing arbitrarily large weights. Inthe rest of this chapter, we use M to refer to a number which can be set sufficientlylarge to receive the desired level of approximation. nval is the number of individualsx for which R(x) = val, when R is not Boolean.4.1 OROR is one of the popular aggregators which is equivalent to the logical existentialquantifier (∃). Using OR, the child node is True if there exists at least one as-signment of individuals to the logical variables in the parents, for which a desiredformula holds. In order to model OR in RLR for a PRV Q with parent PRV R(x),we use the WFs:〈{},Q,−M〉〈{x},Q∧R(x),2M〉for which Pr(q | R(x)) = sigmoid(−M + 2MnT). We can see that if none of theindividuals are True (i.e. nT = 0), the value inside the sigmoid is −M which is anegative number and the probability is close to 0. If even one individual is True(i.e. nT ≥ 1), the value inside the sigmoid becomes positive and the probabilitybecomes closer to 1. In both cases, the value inside the sigmoid is a linear functionof M. Increasing M pushes the probability closer to 0 or to 1 and the approximationbecomes more accurate.Example 28. Suppose a group of people live in an apartment and they can set offa fire alarm if they smell smoke. We have a random variable AlarmSounds which414.2. AND S etsOf f(x)  A l armSound  Leave(x)  E vacuated  Figure 4.1: A relational model representing the evacuation scenario of a buildingwhen a member of the building sets off an smoke alarm. The conditional probabil-ity of the PRV AlarmSounds in this model should be represented using the aggre-gation operator OR and the conditional probability of the PRV Evacuated shouldbe represented using the aggregation operator AND.has a parent SetsOff (x) (as in Fig. 4.1) and whose conditional probability can beapproximated by the following WFs:〈{},AlarmSounds,−10〉〈{x},AlarmSounds∧SetsOff (x),20〉where we chose M = 10. We can see in Table 4.1 how close the approximation isfor this value of M.4.2 ANDAND is equivalent to the logical universal quantifier (∀) and can be modeled sim-ilarly to OR. In order to model AND in RLR for a PRV Q with parent PRV R(x),we use the WFs:〈{},Q,M〉〈{x},Q∧¬R(x),2M〉or equivalently with the following WFs having only positive conjunctive formulae:〈{},Q,M〉〈{x},Q,−2M〉〈{x},Q∧R(x),2M〉Table 4.1: The probability of random variable AlarmSounds as a function of thenumber of people num who set off the alarm for the weighted formulae in Exam-ple 28 with an accuracy of six digits after the decimal point.num 0 1 2 3 4 > 4Probability 0.000045 0.999955 1 1 1 1424.3. Noisy-OR and Noisy-AND R( x )  N ( x )  S ( x )  Q  R( x )  Q  Figure 4.2: On the left is a relational model for a child node with a single parenthaving an extra logical variable. On the right is a relational model representing thechanges to be made to the model on the left for defining a noisy-OR or noisy-ANDconditional probability for the child node using RLR.for which Pr(q | R(x)) = sigmoid(M−2MnF). When nF = 0, the value inside thesigmoid is M > 0, so the probability is closer to 1. When nF ≥ 1, the value insidethe sigmoid becomes negative and the probability becomes closer to 0. Like OR,accuracy increases with M.Example 29. In Example 28, after hearing the alarm sound, people start to leavethe building and the building is evacuated if all people have left (see Fig. 4.1). Theconditional probability of the PRV Evacuated can be then approximated using thefollowing WFs:〈{},Evacuated,M〉〈{x},Evacuated,−2M〉〈{x},Evacuated∧Leave(x),2M〉Having AND and OR aggregators (i.e. universal and existential quantifiers) wecan model any other Boolean formuale of the individuals.4.3 Noisy-OR and Noisy-ANDThe previous two sections represented how we can use RLR to approximate thedeterministic OR and AND aggregators using a probabilistic model. Now we con-sider modeling the noisy-OR and noisy-AND using RLR.Figure 4.2 represents how noisy-OR and noisy-AND can be modeled for thenetwork in Figure 3.1. In this figure, R(x) represents the values of the individuals434.4. Meanbeing combined and N(x) represents the noise probability. For noisy-OR, S(x) ≡R(x)∧N(x), and Q is the OR aggregator of S(x). For noisy-AND, S(x) ≡ R(x)∨N(x), and Q is the AND aggregator of S(x). Note that the noise probability canbe different for each of the individuals and we cannot model noisy-OR and noisy-AND without adding extra PRVs to the model and by just using the models fordeterministic OR and AND, where each individual has the same effect.4.4 MeanWe represent how we can model mean > t. We can model “Q is True if mean(R(x))>t” using the following WFs (val and t are numeric constants. The second WF isrepeated for each val ∈ range(R), so we have a total of 1 + r WFs where r is thesize of range(R)):〈{},Q,−M〉〈{x},Q∧R(x) = val,M2(val− t)〉for whichPr(q | R(x)) = sigmoid(−M +M2 ∑val∈range(R)nval(val− t))= sigmoid(−M +M2( ∑val∈range(R)nvalval− t ∑val∈range(R)nval))= sigmoid(−M +M2(sum−nt))where n = |x| and sum represents the sum of the values of the individuals. Whenmean = sumn > t, the value inside the sigmoid is positive and the probability is closeto 1. Otherwise, the value inside the sigmoid is negative and the probability is closeto 0. For this case, M should be greater than the minimum number that | 1sum−nt | cantake to generate a number greater than 1 when multiplied by (sum−nt). Otherwise,it may occur that sum− nt > 0 but M2(sum− nt) ≤ M which makes the sigmoidproduce a number close to 0. Note that this minimum bound for M depends on theaccuracy of computations, not on the population size. Another option is to use thefollowing WFs (the second WF is again repeated for each val ∈ range(R)):〈{},Q,−ε〉〈{x},Q∧R(x) = val,M(val− t)〉where ε is a tiny number ensuring the value inside the sigmoid is negative whenmean = t. It (ε) can be set according to the accuracy of the system, or can be setto zero if a probability of 0.5 is acceptable when mean = t. Note that the numberof required WFs in both casesgrows with the number of values that the parent cantake.444.5. More-than-t TruesExample 30. Suppose we have a set of movies and the ratings people gave to thesemovies in a star-rating system. We define a movie to be popular if the average ofits ratings is more than 3.5. In this case, we have a parametrized random vari-able Popular(m) which is a child of a parametrized random variable Rate(p,m).The following weighted formulae can approximate the conditional dependence ofPopular(m) on its parent (the second WF is repeated 5 times for different values ofi ∈ {1,2,3,4,5}):〈{},Popular(m),−ε〉〈{p},Popular(m)∧Rate(p,m) = i, i−3.5〉RLR sums over the above weighted formulae and takes the sigmoid resulting in:Pr(Popular(m) = True|Π) = sigmoid(−ε + sum−3.5n)where sum denotes the sum of the ratings and n represents the number of ratingsfor this movie. The value inside the sigmoid is positive if mean = sumn > 3.5 and isnegative otherwise (ε is used to ensure the the value inside the sigmoid is negativewhen mean = 3.5).4.5 More-than-t TruesMore-than-t Trues can be considered as an extended version of OR. When t = 0,these two aggregators are equivalent. More-than-t Trues, corresponding to“Q isTrue if R is True for more than t individuals”, can be modeled using the WFs:〈{},Q,−2Mt−M〉〈{x},Q∧R(x),2M〉giving Pr(q | R(x)) = sigmoid(−2Mt−M + 2MnT) and the value inside the sig-moid is positive if nT > t. The number of WFs required is fixed.Example 31. In Example 30, suppose instead of rating movies, users can only likethe movies they enjoyed watching, and a popular movie is defined as one havingat least 50 likes. The following WFs can be used to approximate this dependency:〈{},popular(m),−99〉〈{p},popular(m)∧ liked(p,m),2〉454.6. More-than-t% Trues4.6 More-than-t% TruesMore-than-t% Trues, corresponding to “Q is True if R is True for more than tpercent of the individuals”, is a special case of the aggregator “mean > t100 ” whenwe treat False values as 0 and True values as 1. This directly provides the WFs:〈{},Q,−M〉〈{x},Q∧¬R(x),M2(0− t100)〉〈{x},Q∧R(x),M2(1− t100)〉while requiring M > | 1nT−nt/100 |, where n is the populations size of x. Note thatwe can use Proposition 3 to replace the second WF with two WFs having positiveconjunctive formulae. Unlike the aggregator “mean > . . . ”, here the number ofWFs is fixed, because the parent can take only two different values.Example 32. Consider Example 31 and suppose people like or dislike a movieafter they watch it. Also suppose a movie is defined to be popular if more than 70%of people liked it. This can be approximated by More-than-t% Trues aggregatorusing similar WFs as used in Example 30 with t = 70100 .4.7 MaxFirstly, we represent how we can model the case where Q is true if the max of R(x)is greater than t. Then we propose a way of having a random variable whose valueis the maximum value of the individuals in R.The following WFs are used for modeling max > t, corresponding to “Q isTrue if max(R(x)) > t”, in RLR (the second WF is repeated for each val > t ∈range(R(x))):〈{},Q,−M〉〈{x},Q∧R(x) = val,2M〉thus Pr(q | R(x)) = sigmoid(−M + 2M∑val>t∈range(R) nval). The value inside thesigmoid is positive if there is an individual having a value greater than t (i.e. ∃val >t ∈ range(R) : nval > 0). Note that the number of WFs required grows with thenumber of values greater than t that the parents can take.Now suppose we want Q to be the maximum value of individuals in R(x). Forbinary parents, the “max” aggregator is identical to “OR”. Otherwise, range(Q) =range(R(x)). Let r represent the size of range(R(x)) and t1, t2, . . . , tr represent thevalues in range(R(x)). This “max” aggregator can be modeled using a 2-levelstructure as in Figure 4.3. First, for every ti ∈ range(R(x)), we create a separate“max ≥ t” aggregator using RLR (as described earlier), with R(x) as its parents.464.7. Max R( x )  m ax  ш t 1 m ax  ш t 2 m ax  ш t r Q  … R( x )  Q  Figure 4.3: On the left is a relational model for a child node with a single parenthaving an extra logical variable. On the right is a relational model representing thechanges to be made to the model on the left for defining a conditional probabilityrepresenting the aggregator max for the child node using RLR.This can be viewed in the middle level of Figure 4.3 where there are r intermediaterandom variables each representing if the max is greater than or equal to a ti ∈range(R(x)). Then, we define the child Q, with all the “max ≥ t” aggregators asits parents. Q can compute max(R(x)) given its parents. Note that while r may bearbitrarily large, it has a fixed size and does not change with population size, henceit is possible to use non-relational constructs (e.g., a table) for its implementation.Example 33. As in Example 30, suppose we have a set of users denoted by thelogical variable p, a set movies denoted by the logical variable m, and the rates ofthe users for the movies. Suppose we want to have a PRV MaxRate(m) representingthe maximum rate of each movie.First, we define 5 PRVs MGE1(m),MGE2(m), . . . ,MGE5(m), where MGEi(m)represents whether the maximum rate of the movie is greater than or equal to i ornot. The conditional probability of each of these PRVs can be represented by RLR.As an example, we define the WFs for MGE3(m) as follows (the WFs for the otherPRVs are similar):〈{},MGE3(m),−10〉〈{p},MGE3(m)∧Rate(p,m) = 3,20〉〈{p},MGE3(m)∧Rate(p,m) = 4,20〉〈{p},MGE3(m)∧Rate(p,m) = 5,20〉Then the conditional probability of MaxRate(m) given its parents can be defined474.8. ModeTable 4.2: Conditional probability table for PRV MaxRate(m) representing themaximum rate of different movies. For simplicity, we just represent the desiredvalue of the child node (the one having a probability of 1) instead of probabilitiesof each value it can take.MGE1(m) MGE2(m) MGE3(m) MGE4(m) MGE5(m) ValueTrue True True True True 5True True True True False 4. . . . . . . . . . . . . . . . . .True False False False False 1. . . . . . . . . . . . . . . . . .False False False False False 0as in Table 4.2. We assume the maximum rate of a movie for which we don’t haveany ratings is zero.4.8 ModeFirst, we represent how we can model the case where Q is true if the mode of R(x)is equal to t. Then we propose a way of having a random variable whose value isthe mode value of the individuals in R.To model mode = t, corresponding to“Q is True if mode(R(x)) = t”, we firstadd another PRV S(y) to the network as in Fig. 34 where the population of y is therange of R(x). Then for each individual Y of y, we use the following WFs for whichPr(s(Y) | R(x)) = sigmoid(M− 2M(nY − nt)) and the value inside the sigmoid ispositive if value t has occurred more than or equal to value Y in the population ofR (i.e. nt ≥ nY ). Note that the number of WFs required grows with the number ofvalues that the parent can take.〈{},S(Y),M〉〈{x},S(Y)∧R(x) = C,−2M〉〈{x},S(Y)∧R(x) = t,2M〉Then Q must be True if for all individuals in y, S(y) is True. This is becausea False value for an individual of S means that this individual has occurred morethan t and t is not the mode. Therefore, we can use WFs similar to the ones we484.8. Mode R( x )  S( y )  Q  R( x )  Q  Figure 4.4: On the left is a relational model for a child node with a single parenthaving an extra logical variable. On the right is a relational model representing thechanges to be made to the model on the left for defining a conditional probabilityrepresenting the aggregator mode = t for the child node using RLR.used for AND:〈{},Q,M〉〈{y},Q,−2M〉〈{y},Q∧S(y),2M〉Now suppose we want the value of Q to be the mode of the values of the indi-viduals in R(x). For binary parents, the “mode” aggregator is also called “major-ity”, and can be modeled with the “more-than-t% Trues” aggregator, with t = 50.Otherwise, range(Q) = range(R(x)), and we can use the same approach as for“max”, by having a middle layer with r separate “mode = t” aggregators, where rrepresents the size of range(R(x)), and with Q as their child.Example 34. Suppose in Example 30 a movie is popular if the mode of the rates is5. In order to model this in RLR, we create a network where Rate(p,m) is the parentof S(r,m), where r represents the rates and belongs to {1,2,3,4,5}, and S(r,m) isthe parent of Popular(m). The following WFs should be used to approximate theconditional probability of S(R,m) (where R is an instance of r):〈{},S(R,m),10〉〈{p},S(R,m)∧Rate(p,m) = R,−20〉〈{p},S(R,m)∧Rate(p,m) = 5,20〉and the following WFs should be used to approximate the conditional probability494.9. Aggregators Not Represented by RLRof Popular(m):〈{},Popular(m),10〉〈{r},Popular(m),−20〉〈{r},Popular(m)∧S(r,m),20〉4.9 Aggregators Not Represented by RLRIn previous sections of this chapter, we discussed how a series of well-known ag-gregators can be represented using RLR. There are, however, other well-knownaggregators such as median > t that we could not model them using our RLR.There are also other aggregators whose outputs are a continuous value. Mean andmedian are two such aggregators. Since we only considered child nodes havingBoolean or multi-valued ranges, our RLR cannot model such aggregators. Further-more, we did not consider parents with continuous values in which case the Maxaggregator has also a continuous output. Future work includes extending RLR tocontinuous child and parent nodes, and discussing whether such aggregators canbe represented using the extended RLR or not.50Chapter 5ConclusionToday’s data and models are complex, composed of objects and relations, andnoisy. Hence it is not surprising that relational probabilistic knowledge representa-tion currently receives a lot of attention. However, relational probabilistic model-ing is not an easy task and raises several novel issues when it comes to knowledgerepresentation:• What assumptions are we making? Why should we choose one representa-tion over another?• We may learn a model for some population size(s), and want to apply it toother population sizes. We want to make assumptions explicit and know theconsequences of these assumptions.In this work, we provided answers to these questions for the case of the logistic re-gression model. The introduction of the relational logistic regression (RLR) familyfrom first principle is already a major contribution. Based on it, we have investi-gated the dependence on population size for different variants and have demon-strated that already for simple and well-understood (at the non-relational level)models, there are complex interactions of the parameters with population size.The major contributions of this work can be summarized as follows:• Introducing a relation version of logistic regression (RLR).• Brief comparison of the proposed RLR with MLNs.• Defining canonical forms of representation for RLR.• Proving the class of decision thresholds that can and cannot be modeledusing RLR.• Approximating other well-known aggregators using RLR.Future work includes:51Chapter 5. Conclusion• extending the current version of RLR to continuous child and parent PRVsand discussing the decision thresholds and aggregators that can and cannotbe represented using the extended RLR,• developing a lifted inference algorithm, or extending existing algorithms, forrelational Bayesian networks to allow for RLR conditional probabilities,• learning the structure and the parameters of an RLR conditional probabilityfor a PRV from data,• and understanding the relationship to other models such as undirected mod-els like MLNs. Exploring this direction is important since determining whichmodels to use is more than fitting the models to data; we need to understandwhat we are representing.52Bibliography[1] Hendrik Blockeel. Statistical relational learning. Handbook on Neural Infor-mation Processing, pages 241–281, 2013.[2] Hendrik Blockeel and Luc De Raedt. Top-down induction of first-order logi-cal decision trees. Artificial intelligence, 101(1):285–297, 1998.[3] David Buchman, Mark Schmidt, Shakir Mohamed, David Poole, and NandoDe Freitas. On sparse, spectral and other parameterizations of binary proba-bilistic models. In AISTATS 2012-15th International Conference on ArtificialIntelligence and Statistics, 2012.[4] Wray L. Buntine. Operations for learning with graphical models. arXivpreprint cs/9412102, 1994.[5] Saskia Le Cessie and J.C. van Houwelingen. Ridge estimators in logisticregression. Applied Statistics, 41(1):191–201, 1992.[6] Chin-Liang Chang and Richard Char-Tung Lee. Symbolic logic and mechan-ical theorem proving. Academic Press, New York, 1973.[7] George Danezis and Prateek Mittal. Sybilinfer: Detecting sybil nodes usingsocial networks. In NDSS, 2009.[8] Francisco J. Diez. Parameter adjustment in Bayes networks. the generalizednoisy or-gate. In Proc. ninth UAI, pages 99–105, 1993.[9] Matthew Dirks, Andrew Csinger, Andrew Bamber, and David Poole. Rea-soning and inference for a relational open-world influence diagram applied toa real-time geological domain. In Proc. AAAI-2014 Statistical Relational AIWorkshop, 2014.[10] Pedro Domingos, Stanley Kok, Daniel Lowd, Hoifung Poon, MatthewRichardson, and Parag Singla. Markov logic. In L. De Raedt, P. Frasconi,K. Kersting, and S. Muggleton, editors, Probabilistic Inductive Logic Pro-gramming, pages 92–117. Springer, New York, 2008.53Bibliography[11] Nir Friedman, Lisa Getoor, Daphne Koller, and Avi Pfeffer. Learning prob-abilistic relational models. In Proc. of the Sixteenth International Joint Con-ference on Artificial Intelligence, pages 1300–1307. Sweden: Morgan Kauf-mann, 1999.[12] Michael R. Genesereth and Nils J. Nilsson. Logical foundations of artificialintelligence. Vol. 9. Los Altos, CA: Morgan Kaufmann, 1987.[13] Lisa Getoor and Mehran Sahami. Using probabilistic relational models forcollaborative filtering. In Workshop on Web Usage Analysis and User Profil-ing (WEBKDD’99), 1999.[14] Lisa Getoor and Ben Taskar. Introduction to Statistical Relational Learning.MIT Press, Cambridge, MA, 2007.[15] Carlos Guestrin, Shobha Venkataraman, and Daphne Koller. Context-specificmultiagent coordination and planning with factored mdps. In AAAI/IAAI,2002.[16] Michael Horsch and David Poole. A dynamic approach to probability infer-ence using bayesian networks. In Proc. sixth Conference on Uncertainty inAI, pages 155–161, 1990.[17] Zan Huang, D. Zeng, and Hsinchun Chen. A unified recommendation frame-work based on probabilistic relational models. In Fourteenth Annual Work-shop on Information Technologies and Systems (WITS), 2004.[18] Tuyen N. Huynh and Raymond J. Mooney. Discriminative structure and pa-rameter learning for markov logic networks. In Proc. of the internationalconference on machine learning, 2008.[19] Manfred Jaeger. Relational Bayesian networks. In Proc. of the Thirteenthconference on Uncertainty in artificial intelligence. Morgan Kaufmann Pub-lishers Inc., 1997.[20] Dominik Jian, Andreas Barthels, and Michael Beetz. Adaptive Markov logicnetworks: Learning statistical relational models with dynamic parameters. In9th European Conference on Artificial Intelligence (ECAI), pages 937–942,2009.[21] Dominik Jian, Kirchlechner Bernhard, and Michael Beetz. Extending Markovlogic to model probability distributions in relational domains. In KI, pages129–143, 2007.54Bibliography[22] Seyed Mehran Kazemi, David Buchman, Kristian Kersting, Sriraam Natara-jan, and David Poole. Relational logistic regression. In Proc. 14th Interna-tional Conference on Principles of Knowledge Representation and Reasoning(KR), 2014.[23] Seyed Mehran Kazemi, David Buchman, Kristian Kersting, Sriraam Natara-jan, and David Poole. Relational logistic regression: the directed analog ofMarkov logic networks. In Proc. AAAI-2014 Statistical Relational AI Work-shop, 2014.[24] Seyed Mehran Kazemi and David Poole. Elimination ordeting in first-orderprobabilistic inference. In Proc. of Association for the Advancements of Arti-ficial Intelligence (AAAI), 2014.[25] Jacek Kisynski and David Poole. Lifted aggregation in directed first-orderprobabilistic models. In Twenty-first International Joint Conference on Arti-ficial Intelligence, pages 1922–1929, 2009.[26] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Princi-ples and Techniques. MIT Press, Cambridge, MA, 2009.[27] Kevin B. Krob and Ann E. Nicholson. Bayesian artificial intelligence. cRcPress, 2003.[28] Nada Lavrac and Saso Dzeroski. Inductive logic programming. WLP, pages146–160, 1994.[29] Peter McCillagh. Generalized linear models. European Journal of Opera-tional Research, 16(3):285–292, 1984.[30] Brian Milch, Luke S. Zettlemoyer, Kristian Kersting, Michael Haimes, andLeslie Pack Kaelbling. Lifted probabilistic inference with counting formu-lae. In Proceedings of the Twenty Third Conference on Advances in ArtificialIntelligence (AAAI), pages 1062–1068, 2008.[31] Tim Mitchell. Generative and discriminative classifiers: naive Bayes and lo-gistic regression. http://www.cs.cmu.edu/ tom/mlbook/NBayesLogReg.pdf,2010.[32] Tom Mitchell. Machine Learning. McGraw Hill, 1997.[33] Stephen Muggleton. Stochastic logic programs. Advances in inductive logicprogramming, 32:254–264, 1999.55Bibliography[34] Sriraam Natarajan, Tushar Khot, Daniel Lowd, Prasad Tadepalli, and KristianKersting. Exploiting causal independence in Markov logic networks: Com-bining undirected and directed models. In European Conference on MachineLearning (ECML), 2010.[35] Radford M. Neal. Connectionist learning of belief networks. Artificial intel-ligence, 56(1):71–113, 1992.[36] Jennifer Neville and David Jensen. Relational dependency networks. TheJournal of Machine Learning Research, 8:653–692, 2007.[37] Jennifer Neville, David Jensen, Lisa Friedland, and Michael Hay. Learningrelational probability trees. In Proc. of the ninth ACM SIGKDD internationalconference on Knowledge discovery and data mining. ACM, 2003.[38] Jennifer Neville, Ozgur Simsek, David Jensen, John Komoroske, KellyPalmer, and Henry Goldberg. Using relational knowledge discovery to pre-vent securities fraud. In Proceedings of the 11th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining. MIT Press, 2005.[39] Hanna Pasula and Stuart Russell. Approximate inference for first-order prob-abilistic languages. In Proc. of the Seventeenth International Joint Confer-ence on Artificial Intelligence, pages 741–748. Seattle: Morgan Kaufmann,2001.[40] Judea Pearl. Fusion, propagation and structuring in belief networks. ArtificialIntelligence, 29(3):241–288, 1986.[41] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks ofPlausible Inference. Morgan Kaumann, San Mateo, CA, 1988.[42] Claudia Perlish and Foster Provost. Distribution-based aggregation for re-lational learning with identifier attributes. Machine Learning, 62:65–105,2006.[43] David Poole. The independent choice logic and beyond. Probabilistic induc-tive logic programming.[44] David Poole. Probabilistic Horn abduction and Bayesian networks. ArtificialIntelligence, 64:81–129, 1993.[45] David Poole. First-order probabilistic inference. In Proceedings of the 18thInternational Joint Conference on Artificial Intelligence (IJCAI-03), pages985–991, Acapulco, 2003.56Bibliography[46] David Poole, Fahiem Bacchus, and Jacek Kisynski. Towards completelylifted search-based probabilistic inference. arXiv:1107.4035 [cs.AI], 2011.[47] David Poole, David Buchman, Seyed Mehran Kazemi, Kristian Kersting, andSriraam Natarajan. Population size extrapolation in relational probabilisticmodelling. In Proc. of the Eighth International Conference on Scalable Un-certainty Management, 2014.[48] David Poole, David Buchman, Sriraam Natarajan, and Kristian Kersting.Aggregation and population growth: The relational logistic regression andMarkov logic cases. In Proc. UAI-2012 Workshop on Statistical RelationalAI, 2012.[49] David Poole and Alan K. Mackworth. Artificial Intelligence: foundations ofcomputational agents. Cambridge University Press, 2010.[50] David Poole and Nevin L. Zhang. Exploiting contextual independence inprobabilistic inference. Journal of Artificial Intelligence Research, 18:263–313, 2003.[51] Alexandrin Popescul, Lyle H. Ungar, Steve Lawrence, and David M. Pen-nock. Towards structural logistic regression: Combining relational and sta-tistical learning. In KDD Workshop on Multi-Relational Data Mining, 2002.[52] Matthew Richardson and Pedro Domingos. Markov logic networks. MachineLearning, 62:107–136, 2006.[53] Lawrence K. Saul, Tommi Jaakkola, and Michael I. Jordan. Mean field theoryfor sigmoid belief networks. arXiv preprint cs/9603102, 1996.[54] Raymond M. Smullyan. First-order logic. Vol. 6. Berlin: Springer-Verlag,1968.[55] Teodor Sommestad, Mathias Ekstedt, and Pontus Johnson. A probabilistic re-lational model for security risk analysis. Computers and Security, 29(6):659–679, 2010.[56] Ben Taskar, Pieter Abbeel, and Daphne Koller. Discriminative probabilisticmodels for relational data. In Proc. of the Eighteenth conference on Uncer-tainty in artificial intelligence, 2002.[57] Benjamin Taskar, Eran Segal, and Daphne Koller. Probabilistic classifica-tion and clustering in relational data. In International Joint Conference onArtificial Intelligence, volume 17, 2001.57Bibliography[58] Michael P. Wellman, John S. Breese, and Robert P. Goldman. Fromknowledge bases to decision models. The Knowledge Engineering Review,7(01):35–53, 1992.[59] Haifeng Yu, Michael Kaminsky, Phillip B. Gibbson, and Abraham Flaxman.Sybilguard: defending against sybil attacks via social networks. ACM SIG-COMM Computer Communication Review, 36(4):267–278, 2006.[60] Nevin L. Zhang and David Poole. On the role of context-specific indepen-dence in probabilistic reasoning. pages 1288–1293, 1999.58

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0166004/manifest

Comment

Related Items