Representing and Learning Relationsand Properties Under UncertaintybySeyed Mehran KazemiB.Sc., Amirkabir University of Technology, 2012M.Sc., The University of British Columbia, 2014A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)December 2018c© Seyed Mehran Kazemi 2018The following individuals certify that they have read, and recommend to theFaculty of Graduate and Postdoctoral Studies for acceptance, the dissertationentitled:Representing and Learning Relations and Properties Under Uncertaintysubmitted by Seyed Mehran Kazemi in partial fulfillment of the requirementsfor the degree of Doctor of Philosophy in Computer Science.Examining Committee:David Poole, The University of British ColumbiaSupervisorMark Schmidt, The University of British ColumbiaSupervisory Committee MemberGuy Van den Broeck, University of California Los AngelesSupervisory Committee MemberFrank Wood, University of British ColumbiaUniversity ExaminerMatias Salibian-Barrera, University of British ColumbiaUniversity ExaminerDaniel Lowd, University of OregonExternal ExamineriiAbstractThe world around us is composed of entities, each having various propertiesand participating in relationships with other entities. Consequently, data isoften inherently relational. This dissertation studies probabilistic relationalrepresentations, reasoning and learning with a focus on three common pre-diction problems for relational data: link prediction, property prediction,and joint prediction. For link prediction, we develop a tensor factorizationmodel called SimplE which is simple, interpretable, fully-expressive, andintegratable with certain types of domain expert knowledge. On two standardbenchmarks for knowledge graph completion, we show how SimplE outper-forms the state-of-the-art models. For property prediction, first we studythe limitations of the existing StaRAI models when being used for propertyprediction. Based on this study, we develop relational neural networks whichcombine ideas from lifted relational models with deep learning and performwell empirically. We base the joint prediction on lifted relational models forwhich parameter learning typically requires inference over a highly-connectedgraphical model. The inference step is usually the bottleneck for learning.We study a class of inference algorithms known as lifted inference whichmakes inference tractable by exploiting both conditional independence andsymmetries. We study two ways of speeding up lifted inference algorithms: 1-through proposing heuristics for elimination ordering and 2- through compil-ing the lifted operations to low-level languages. We also expand the largestknown class of models for which we know how to do efficient lifted inference.Thus, structure learning algorithms for lifted relational models that restrictthe search space to models for which efficient inference algorithms exist canperform their search over a larger space.iiiLay SummaryOftentimes, one needs to predict some property of a person, company, movieor some other object in the world, or predict some relationship between twoobjects. For instance, a bookseller may want to predict whether Alice willlike Predictably Irrational and an investor may want to predict the futurerevenue of a company. This research develops and facilitates computationaltechniques and representations that use available data about objects tomake predictions about unknown object properties and relationships. Inparticular, it contribute to three prediction problems: link prediction (e.g.,predicting if Chris is a good fit for a job at Google), property prediction (e.g.,predicting the revenue of a company), and joint prediction (e.g., predictingif all members of a family will like the movie Zootopia).ivPrefaceAll of the work presented henceforth was conducted in the Statistical Rela-tional Artificial Intelligence (StaRAI) group in the Laboratory for Computa-tional Intelligence (LCI) at the Department of Computer Science, Universityof British Columbia (UBC), Point Grey campus. This research was conductedunder the supervision of Professor David Poole (DP). Other members ofthe Ph.D. supervisory committee include Professor Mark Schmidt (MS) andProfessor Guy Van den Broeck (GVDB).In what follows, the contributions of all authors for each publicationresulting from this research has been explained in detail.1. Kazemi, S. M. and Poole, D. (2018). SimplE Embedding for LinkPrediction in Knowledge Graphs, In Advances in Neural InformationProcessing Systems (NeurIPS).This work started with a presentation I had in the StaRAI reading groupin our department. I explained different tensor factorization models for linkprediction in knowledge graphs. One of these models was canonical polyadicdecomposition. This model learns two independent embedding vectors perentity and shows weak performance due to this independence. After thereading group, DP and I discussed potential ways to relax this independence.A few days later, I hypothesized that it can be done by using the inverseof relations. I implemented SimplE, ran experiments and showed that itperforms well empirically, proved that it is fully expressive, showed thatdomain expert knowledge can be incorporated into it, wrote an initial draftof the paper, and edited it based on DP’s comments. DP proposed a secondproof for full expressiveness of SimplE which resulted in a different boundon the size of the embeddings than my proof. We included both proofs inthe paper.2. Kazemi, S. M. and Poole, D. (2018). RelNN: A Deep Neural Model forRelational Learning, In 32nd AAAI Conference on Artificial Intelligence.The initial idea of stacking multiple layers of relational logistic regressionlayers and building RelNNs came from me. We then worked on this idea andvPrefacemade it more mature in our weekly meetings with DP. Then I implementedit, designed and ran experiments to see how it performs and how it comparesto other relational learners for property prediction, and wrote an initial draftof the paper. DP gave me valuable feedback on the draft and I edited thedraft accordingly.3. Kazemi, S. M. and Poole, D. (2018). Bridging Weighted Rules andGraph Random Walks for Statistical Relational Models, In Frontiers inRobotics and AI, Vol 5, No. 8.As I was reading and learning about different categories of models forrelational learning, I realized that while graph random walks and liftedrelational models are being considered separate categories and studied byseparate researchers, there is a close relationship between them. I followed mycuriosity and dove deeper into these two categories to see the similarities anddifferences between them. I realized and mathematically proved that with aslight preprocessing of relation values, path-ranking algorithm (a well-knownmodel in graph random walk category) is a strict subset of relational logisticregression (a recent development in lifted relational models). I wrote downmy proof and prepared a draft of the paper. I sent it to DP for feedback,revised it accordingly, and submitted it for publication.4. Kazemi, S. M. , Fatemi, B., Kim, A., Peng, Z., Tora, M. R., Zeng X.,Dirks, M. and Poole, D. (2017). Comparing Aggregators for RelationalProbabilistic Models, In UAI Workshop on Statistical Relational AI.In 2017, DP offered a topics course on Statistical Relational AI. As acourse project, he asked all the students taking the course to work on a singleproject namely comparing relational models for property prediction. Eventhough I had not taken the course, I acted as a teaching assistant helpingthe students conduct the experiments. I also tested some existing modelsand designed some new variants myself. The initial draft of the paper was,for the most part, written by DP and me. Then everyone edited the paperand/or sent us comments for editing the paper. I was selected as the leadauthor of the paper due to my efforts leading other students, contributing tothe experiments, and helping write the paper.5. Kazemi, S. M. , Kimmig, A., Van den Broeck, G. and Poole, D.(2016). New Liftable classes for first-order probabilistic inference, InAdvances in Neural Information Processing Systems (NeurIPS).viPreface6. Kazemi, S. M. , Kimmig, A., Van den Broeck, G. and Poole, D. (2017).Domain Recursion for Lifted Inference with Existential Quantifiers, InUAI Workshop on Statistical Relational AI.In January 2016, I found out that domain recursion, a lifted inferencerule proposed by GVDB in 2012, is more powerful than expected and cansolve a theory left as an open problem in [14]. In February 2016, I discussedmy finding with GVDB and found out that he and his colleague, AngelikaKimmig (AK), have had the same discovery. They did not publish theirdiscovery as they thought it was not enough for a paper. GVDB suggestedthat we (GVDB, AK, DP, and me) join forces and work on this problem asa team.Getting some clues from GVDB and DP on how this work can be advanced,I found several other theories and characterized two large classes of theoriesfor which domain recursion offered polynomial time inference. I wrotean initial draft for NeurIPS-16. It was then proofread and edited by myco-authors.After our NeurIPS submission got accepted, I continued working onthis project and discovered that domain recursion can also be applied tosome theories with existential quantifiers with a lower cost than previousapproaches. I discussed my findings with DP, GVDB, and AK and wepublished these findings in StaRAI-17.7. Kazemi, S. M. and Poole, D. (2016). Knowledge Compilation forLifted Probabilistic Inference: Compiling to a Low-level Language, InInternational Conference on Principles of Knowledge Representationand Reasoning (KR).8. Kazemi, S. M. and Poole, D. (2016). Why is Compiling LiftedInference into a Low-Level Language so Effective?, In IJCAI Workshopon Statistical Relational AI.The initial idea of this work (compiling to low-level languages ratherthan compiling to data structures) came from DP. He believed compiling tolow-level languages may speed up lifted inference, but did not have a practicalor theoretical justification for it. In our weekly meetings, he told me aboutthis idea and I designed and implemented a compiler (L2C compiler) whichreceived a lifted relational model as input and generated a C++ programas output which, once ran, produced the desired results. I faced severalproblems and issues when designing and implementing the compiler. Weaddress all these issues in our weekly meetings. I ran experiments andviiPrefacecompared L2C with other lifted inference engines, especially the WFOMCengine which compiled to data structures instead of C++ programs. Then Iwrote the initial draft of the KR paper and edited it based on DP’s feedback.After the KR paper was published, I found a theory on why compiling tolow-level languages may be more effective than compiling to data structures.I designed experiments and validated my theory empirically. Then I wrotethe draft for the StaRAI paper and I edited it based on DP’s feedback andcomments.9. Kazemi, S. M. and Poole, D. (2014). Elimination Ordering in LiftedFirst-Order Probabilistic Inference, In 28th AAAI Conference on Arti-ficial Intelligence.When I took DP’s advanced artificial intelligence course, for my courseproject, I decided to design an elimination ordering heuristic for liftedprobabilistic inference. I designed a heuristic and compared it to min-fill,one of the heuristics widely used for traditional inference algorithms. Myresults indicated that my heuristic can beat min-fill on a few benchmarksfor lifted inference. After the course ended, DP encouraged me to designmore heuristics, compare my heuristics and the existing heuristics over morebenchmarks, and design new experiments to compare other aspects of allthese heuristics. Under DP’s supervision, I designed two more heuristics,compared them over more benchmarks and on two different tasks. ThenI wrote the initial draft of the paper. DP gave me valuable feedback forediting the paper.Papers corresponding to each chapter: Chapter 4 is entirely basedon publication number 1. Chapter 5 is based on publications number 2, 3,and 4. Finally, chapter 6 is based on publications number 6, 7, 8, and 9.viiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvList of Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xixDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 What is relational data? . . . . . . . . . . . . . . . . . . . . 21.1.1 Relational data as a relational database . . . . . . . . 41.1.2 Relational data as a set of triples . . . . . . . . . . . 51.1.3 Relational data as a labeled multidigraph . . . . . . . 51.2 Why study relational data? . . . . . . . . . . . . . . . . . . . 61.3 Why relational representation/learning? . . . . . . . . . . . 71.3.1 An example showing the limitations of the indepen-dence assumption . . . . . . . . . . . . . . . . . . . . 91.3.2 An example showing the limitations of not learningabout specific entities . . . . . . . . . . . . . . . . . . 101.3.3 The aggregation problem . . . . . . . . . . . . . . . . 101.4 Relational prediction problems . . . . . . . . . . . . . . . . . 111.4.1 Link prediction . . . . . . . . . . . . . . . . . . . . . 111.4.2 Property prediction . . . . . . . . . . . . . . . . . . . 12ixTable of Contents1.4.3 Joint prediction . . . . . . . . . . . . . . . . . . . . . 121.4.4 Existence prediction . . . . . . . . . . . . . . . . . . . 131.4.5 Record linkage . . . . . . . . . . . . . . . . . . . . . . 131.4.6 Community detection . . . . . . . . . . . . . . . . . . 141.4.7 Problems studied in this thesis . . . . . . . . . . . . . 151.5 Statistical relational artificial intelligence . . . . . . . . . . . 151.6 An overview of the contributions of this dissertation . . . . . 161.6.1 Contributions to property prediction . . . . . . . . . 161.6.2 Contributions to link prediction . . . . . . . . . . . . 171.6.3 Contributions to joint prediction . . . . . . . . . . . . 182 Background and Notation . . . . . . . . . . . . . . . . . . . . 212.1 Logistic regression and neural networks . . . . . . . . . . . . 212.2 Convolutional neural networks (ConvNets) . . . . . . . . . . 232.3 First-order logic . . . . . . . . . . . . . . . . . . . . . . . . . 232.4 Weighted model count . . . . . . . . . . . . . . . . . . . . . . 252.5 Logic programs (LP) . . . . . . . . . . . . . . . . . . . . . . 282.6 Representing relational data . . . . . . . . . . . . . . . . . . 292.7 Properties of relations . . . . . . . . . . . . . . . . . . . . . . 303 An Overview of Relational Models . . . . . . . . . . . . . . . 323.1 Tensor factorization models . . . . . . . . . . . . . . . . . . . 323.1.1 Translational approaches . . . . . . . . . . . . . . . . 333.1.2 Deep learning approaches . . . . . . . . . . . . . . . . 343.1.3 Multiplicative approaches . . . . . . . . . . . . . . . . 343.2 Lifted relational models . . . . . . . . . . . . . . . . . . . . . 353.2.1 Markov logic networks . . . . . . . . . . . . . . . . . 353.2.2 Relational logistic regression . . . . . . . . . . . . . . 373.2.3 Parfactor graphs . . . . . . . . . . . . . . . . . . . . . 383.2.4 Problog . . . . . . . . . . . . . . . . . . . . . . . . . . 393.2.5 Relational probability trees . . . . . . . . . . . . . . . 423.2.6 RDN-Boost . . . . . . . . . . . . . . . . . . . . . . . . 434 Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.1 Properties of tensor factorization models . . . . . . . . . . . 444.1.1 Full expressiveness . . . . . . . . . . . . . . . . . . . . 454.1.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . 464.1.3 Interpretability and redundancy . . . . . . . . . . . . 474.2 SimplE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.2.1 An alternative perspective on SimplE . . . . . . . . . 50xTable of Contents4.2.2 Family of Bilinear Models . . . . . . . . . . . . . . . 504.2.3 Training SimplE . . . . . . . . . . . . . . . . . . . . . 514.2.4 SimplE is fully expressive . . . . . . . . . . . . . . . . 524.2.5 SimplE & Background Knowledge . . . . . . . . . . . 534.3 Experiments and results . . . . . . . . . . . . . . . . . . . . . 544.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 544.3.2 Evaluation metrics . . . . . . . . . . . . . . . . . . . . 564.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . 564.3.4 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . 574.3.5 Entity prediction results . . . . . . . . . . . . . . . . 574.3.6 Incorporating background knowledge . . . . . . . . . 584.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605 Property Prediction . . . . . . . . . . . . . . . . . . . . . . . . 625.1 Why not use existing relational models? . . . . . . . . . . . 625.2 Relational neural networks . . . . . . . . . . . . . . . . . . . 645.2.1 Motivations for hidden layers . . . . . . . . . . . . . . 685.2.2 Learning latent properties directly . . . . . . . . . . . 705.3 From ConvNet primitives to RelNNs . . . . . . . . . . . . . . 715.4 Empirical results . . . . . . . . . . . . . . . . . . . . . . . . . 735.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 745.4.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . 745.4.3 Implementation and learning methodology . . . . . . 755.4.4 Comparative results . . . . . . . . . . . . . . . . . . . 765.4.5 Effect of numeric and rule-based latent properties . . 785.4.6 Cold start and extrapolation . . . . . . . . . . . . . . 785.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816 Joint Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 836.1 Lifted Inference . . . . . . . . . . . . . . . . . . . . . . . . . 846.1.1 Inference for probabilistic graphical models . . . . . . 846.1.2 Inference for lifted relational models . . . . . . . . . . 856.1.3 Why lifted inference? . . . . . . . . . . . . . . . . . . 866.1.4 Lifted inference through weighted model counting . . 876.1.5 Domain-liftability . . . . . . . . . . . . . . . . . . . . 916.1.6 The domain recursion rule . . . . . . . . . . . . . . . 926.2 Elimination ordering for lifted inference . . . . . . . . . . . . 936.2.1 Inefficiency of existing heuristics . . . . . . . . . . . . 936.2.2 Proposed heuristics . . . . . . . . . . . . . . . . . . . 956.2.3 Evaluation of heuristics . . . . . . . . . . . . . . . . . 97xiTable of Contents6.3 L2C: Compiling lifted inference to low-level languages . . . . 1016.3.1 From lifted inference rules to C++ programs . . . . . 1016.3.2 Compilation example . . . . . . . . . . . . . . . . . . 1036.3.3 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.3.4 MinNestedLoops heuristic . . . . . . . . . . . . . . . 1056.3.5 Empirical results for L2C . . . . . . . . . . . . . . . . 1066.4 Liftability Theory . . . . . . . . . . . . . . . . . . . . . . . . 1106.4.1 Three theories that are only domain-liftable usingbounded domain recursion . . . . . . . . . . . . . . . 1116.4.2 New domain-liftable classes: S2FO2 and S2RU . . . 1116.4.3 BDR+RU Class . . . . . . . . . . . . . . . . . . . . . 1136.4.4 Experiments and results . . . . . . . . . . . . . . . . 1136.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 1146.5 Domain Recursion and Existential Quantifiers . . . . . . . . 1167 Conclusion and Future Work . . . . . . . . . . . . . . . . . . 1197.1 Future work for link prediction . . . . . . . . . . . . . . . . . 1207.2 Future work for property prediction . . . . . . . . . . . . . . 1227.3 Future work for joint prediction . . . . . . . . . . . . . . . . 1237.4 Future work for relational learning . . . . . . . . . . . . . . . 124Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126AppendixA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148A.1 Proofs for chapter 4 . . . . . . . . . . . . . . . . . . . . . . . 148A.1.1 Proof of proposition 1 . . . . . . . . . . . . . . . . . . 148A.1.2 Proof of proposition 2 . . . . . . . . . . . . . . . . . . 150A.1.3 Proof of proposition 3 . . . . . . . . . . . . . . . . . . 150A.1.4 Proof of proposition 4 . . . . . . . . . . . . . . . . . . 151A.1.5 Proof of proposition 5 . . . . . . . . . . . . . . . . . . 151A.1.6 Proof of proposition 6 . . . . . . . . . . . . . . . . . . 152A.2 Proofs for chapter 6 . . . . . . . . . . . . . . . . . . . . . . . 152A.2.1 Proof of proposition 7 . . . . . . . . . . . . . . . . . . 152A.2.2 Proof of proposition 8 . . . . . . . . . . . . . . . . . . 152A.2.3 Proof of proposition 9 . . . . . . . . . . . . . . . . . . 157A.2.4 Proof of proposition 10 . . . . . . . . . . . . . . . . . 160A.2.5 Proof of theorem 1 . . . . . . . . . . . . . . . . . . . 160xiiTable of ContentsA.2.6 Proof of Proposition 11 . . . . . . . . . . . . . . . . . 174A.2.7 Proof of proposition 12 . . . . . . . . . . . . . . . . . 174A.2.8 Proof of proposition 13. . . . . . . . . . . . . . . . . . 176xiiiList of Tables1.1 An example dataset taken from [50] that is not amenable totraditional classification. . . . . . . . . . . . . . . . . . . . . . 102.1 Examples of truth assignments to the ground atoms in Exam-ple 19 that are and that are not models. T stands for Trueand F stands for False. . . . . . . . . . . . . . . . . . . . . . . 262.2 Examples of truth assignments to the ground atoms in Exam-ple 20 that are and that are not models with their weights. Tstands for True and F stands for False. . . . . . . . . . . . . . 283.1 An example of a parfactor over two atoms: Friend and Adult. 394.1 Redundancy in ComplEx. . . . . . . . . . . . . . . . . . . . . 484.2 Statistics on WN18 and FB15k datasets. WN18 is a subset ofWordnet [152] and FB15k is a subset of Freebase [22]. . . . . 544.3 Results on WN18 and FB15k. Reported metrics include rawand filtered mean reciprocal rank (MRR), filtered Hit@1,Hit@3, and Hit@10. Best results are in bold. . . . . . . . . . 554.4 Rules Used in Section 4.3.6. . . . . . . . . . . . . . . . . . . . 595.1 Performance of different learning algorithms based on accuracy,log loss, and MSE ± standard deviation on three differenttasks. NA means the method is not directly applicable tothe prediction task/dataset. The best performing method isshown in bold. For models where standard deviation was zero,we did not report it in the table. . . . . . . . . . . . . . . . . 776.1 Runtimes of lifted inference using different heuristics for elim-ination ordering . . . . . . . . . . . . . . . . . . . . . . . . . . 98xivList of Figures1.1 Flattening a relational data into a single flat table. . . . . . . 31.2 An example of a relational data in the form of (a) a multidi-graph (b) a set of triples. . . . . . . . . . . . . . . . . . . . . 61.3 An example of a relational data and a Bayesian networkcreated based only on the medical records (example taken inpart from [218]). . . . . . . . . . . . . . . . . . . . . . . . . . 91.4 An overview of the tractable classes of lifted relational models. 202.1 A generic layer and layer-wise view of logistic regression . . . 223.1 An RLR model. . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2 Examples of parfactor graphs. . . . . . . . . . . . . . . . . . . 403.3 An example of a relational probability tree for predictingwhether a person is young or not. . . . . . . . . . . . . . . . . 424.1 The constraints over MR matrices for bilinear models (a)DistMult, (b) ComplEx, (c) CP, and (d) SimplE. The linesrepresent where the parameters are; other elements of the ma-trices are constrained to be zero. In ComplEx, the parametersrepresented by the dashed line is tied to the parameters repre-sented by the solid line and the parameters represented by thedotted line is tied to the negative of the dotted-and-dashed line. 524.2 Training loss for SimplE-avg on WN18 as a function of thenumber of epochs. . . . . . . . . . . . . . . . . . . . . . . . . 585.1 An RLR model for predicting the gender of users in a movierating system with a layer-wise architecture. . . . . . . . . . . 645.2 A RelNN model for predicting the gender of users in a movierating system with a layer-wise architecture. . . . . . . . . . . 655.3 RelNN structure for predicting the gender in PAKDD15 dataset. 695.4 An example of a layer of a RelNN demonstrated with ConvNetoperations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72xvList of Figures5.5 RelNN structure used for Yelp dataset. Reviewed(u, b), FastFood(b)and SeaFood(b) are observed atoms. N1(u) and N2(u) arenumeric latent properties. H1(b), H2(b), and H3(b) are rule-based hidden properties. S1(b), S2(b), S3(b), and S4(b) arepre-activation values for the rule-based hidden properties (andfor the predictions on Mexican(b)). . . . . . . . . . . . . . . . 765.6 Predicting (a) gender and (b) age on MovieLens using RelNNswith a different number of numeric latent properties andhidden layers. Each experiment has been conducted 10 timesand the mean and standard deviation of the 10 runs have beenreported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.7 Results on predicting the gender when we only use the firstk ratings of the test users and use all ratings in train set forlearning. The x-axis represents the number of ratings used foreach test user (i.e. k) and the y-axis shows the log loss. Eachexperiment has been conducted 10 times and the average logloss is reported. The bars show the standard deviation. . . . 805.8 Results on predicting the gender when we only use the first kratings of the test users and use 0 ≤ r ≤ 20 ratings of eachuser (r is generated randomly for each user) in the train setfor learning. The x-axis represents the number of ratings usedfor each test user (i.e. k) and the y-axis shows the log loss.Each experiment has been conducted 10 times and the averagelog loss is reported. The bars show the standard deviation. . 816.1 A simple example taken from [218] indicating the effects ofexploiting the symmetry for probabilistic inference. . . . . . . 876.2 Runtime of lifted search-based inference as a function of pop-ulation size for different heuristics. From Table 6.1, these are(a) 6th parfactor graph, (b) 10th parfactor graph, (c) 12thparfactor graph, (d) 13th parfactor graph. In all cases, thex-axis is |∆y|. The y-axis is runtime in milliseconds. In (a)min-fill and relational min-fill are overlapping. In (c) MinTa-bleSize and population order heuristics are overlapping. In (d)MinTableSize and min-fill are overlapping. . . . . . . . . . . . 996.3 (a) The C++ program for the theory in Example 43. TheWMC is stored in v1. (b) The C++ program in part (a) afterpruning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104xviList of Figures6.4 Runtimes for L2C using MinTableSize (MTS) and MinNest-edLoops (MNL) heuristics, WFOMC, and PTP on severalbenchmarks (in (a), L2C diagrams for the two heuristics overlap).1056.5 Time spent for each step of L2C. (b) The effect of prun-ing/optimizing the C++ program. . . . . . . . . . . . . . . . 1076.6 The effect of pruning/optimizing the C++ programs. . . . . . 1086.7 The amount of time spent for reasoning with the secondarystructure in L2C and WFOMC on three benchmarks and fordifferent population sizes. . . . . . . . . . . . . . . . . . . . . 1086.8 The amount of time spent for reasoning with the programsgenerated by L2C when the program is interpreted, compiled,or compiled and optimized as well as the amount of timespent for reasoning with WFOMC ’s data structures on threebenchmarks and for different population sizes (WFOMC failedto produce an answer in less than 1000s for the third networkwhen the population size was 500 or higher). . . . . . . . . . 1096.9 A summary of the domain-liftability results. FO2 and RUhave been previously proved to be domain-liftable. For FO3, ithas been proved that at least one of its theories is not domain-liftable under standard complexity assumptions. We identifiedthe new domain-liftable classes S 2FO2 and S 2RU and provedthat FO2 ⊂ S 2FO2, RU ⊂ S 2RU , FO2 ⊂ RU , and S 2FO2 ⊂S 2RU . We also proved that symmetric transitivity and theS4 clause are domain-liftable. The BDR+RU class containsall theories that are currently known to be domain-liftable. . 1146.10 Runtimes for calculating the WMC of (a) the theory in Ex-ample 44, (b) the S4 clause, and (c) symmetric transitivity,using the WFOMC-v3.0 software (which only uses Rules) andcomparing it to the case where we use the domain recursionrule, referred to as Domain Recursion in the diagrams. . . . . 115A.1 h(e)s and v(r)s in Proposition 2. . . . . . . . . . . . . . . . . 150xviiList of Software• SimplE– Description: An embedding model for link prediction in knowl-edge graphs.– Used in: Chapter 4 for link prediction.– Paper: Kazemi, S. M. and Poole, D. (2018). SimplE Embeddingfor Link Prediction in Knowledge Graphs, In NeurIPS.– URL: https://github.com/Mehran-k/SimplE/– Languages & Tools: Python and TensorFlow• RelNN– Description: A deep neural model for relational learning.– Used in: Chapter 5 for property prediction.– Paper: Kazemi, S. M. and Poole, D. (2018). RelNN: A DeepNeural Model for Relational Learning, In AAAI.– URL: https://github.com/Mehran-k/RelNN/– Languages & Tools: Java• L2C– Description: A fast lifted inference engine which compiles liftedinference into C++ programs.– Used in: Chapter 6 for speeding up lifted inference.– Paper: Kazemi, S. M. and Poole, D. (2016). Knowledge Compi-lation for Lifted Probabilistic Inference: Compiling to a Low-levelLanguage, In KR.– URL: https://github.com/Mehran-k/L2C/– Languages & Tools: Ruby and C++xviiiAcknowledgementsMy Ph.D. was an amazing experience and my supervisor, David Poole,deserves most of the credit. I would like to express my special appreciationand thanks to David for not just being a supervisor, but being a mentor,and a friend. I learned numerous valuable work and life lessons from him forwhich I owe him a debt I can never repay.Besides my supervisor, I would like to thank the rest of my thesis commit-tee: Prof. Guy Van den Broeck and Prof. Mark Schmidt for their insightfulcomments and encouragements. I received invaluable feedback from myexaminers: Prof. Frank Wood, Prof. Matias Salibian-Barrera, and Prof.Daniel Lowd for which I am greatly thankful.During my Ph.D., I had the pleasure of working with two amazing andprofessional teams in Telus and TalentSnap and learning several industryskills. I would like to thank all members of the two teams especially MikeTyndall and William Wong from Telus and Cody Sklar and Ali Taghikhanifrom TalentSnap.Words cannot express how grateful I am to my parents, siblings, andfriends for always supporting me in difficult situations.Finally, the deepest support I have received comes from my wife, Bahare.Ph.D. life has many ups and downs and without the support and encourage-ment of an extraordinary wife like her, getting to the finish line would havebeen impossible for me. Bahare, my sweetheart, you are simply awesome.xixDedicationTo my wife, Bahareand my family,for all their love and supportxxChapter 1IntroductionOur daily lives are replete with decisions we make based on predictions aboutthe current or future state of the world. Every morning before leaving thehouse, we check the weather forecast and decide on our clothes. When we shopor watch movies online, or use a social network, the web application predictswhich items we may buy, which movies we may like, or which users we maybefriend respectively, and recommends them to us. People working in financemake predictions about the stock values of different companies and decide onbuying/selling those stocks. Health-care and medical diagnosis applicationspredict diseases, nutrition deficiencies, adverse drug effects, congestive heartfailure, and many other phenomena. Artificial intelligence (AI) [176, 187]studies computational techniques and approaches for understanding how theworld works and for making fast and accurate predictions about the currentor future states of the world to facilitate the decision-making process in suchapplications.In order to make a prediction about any world, an AI system typicallybuilds a model of the world. Machine learning [20] is a subfield of AI whosegoal is to construct models of the world that improve their performancebased on past experience. In many cases, past experience is available interms of data. Machine learning algorithms have been traditionally designedfor flat data. A flat data is a single table with the following properties:• each row of the table corresponds to an example,• each column of the table corresponds to a feature,• a cell at row i and column j of the table corresponds to the value ofj-th feature for i-th example,• examples are assumed to be identically and independently distributed(iid).In many applications (e.g., social networks, online shops, movie ratingsystems, protein interactions, image processing, natural language processing,analyzing point clouds, etc.), data is not flat. For instance the data for11.1. What is relational data?an online shop may look like Fig 1.1(a). In such cases, one option is toflatten the data through feature engineering (manually finding interestingfeatures from the data). For example, to predict which purchased items willbe returned for the online shop presented in Fig 1.1(a), Fig 1.1(b) shows aflattening of this data where Feat1, Feat2, . . . are features constructed byexperts that potentially help predict the returns. Needing to flatten the datamay limit the applicability of machine learning to applications with non-flatdata because:• flattening may require human resources for feature engineering, oftenrequiring high costs and human time,• some information may be lost during flattening thus reducing thepotential for machine learning algorithms to perform satisfactorily,• the independence assumption imposed on the rows of the flatteneddata is violated in many application,• flattening may introduce redundancies (because the flat data may notbe in a normal form, see [196] for definitions of normal forms) whichmay lead to inconsistencies and make machine learning models moreprone to double counting,• some examples may have more features than the others due to theheterogeneity of the data, thus flattening may introduce many nullvalues.To overcome these issues, machine learning models are being developed thatwork directly with the original data without flattening it. During the pastdecades, such models have been proposed that work directly with spatialdata [136], temporal/sequence data [31, 91], event-based data [149, 158, 237],set data [97, 233], graph data [62, 84, 190], relational data [49, 58, 188], etc.In this work, we study representations, reasoning algorithms, and learningalgorithms for relational data.1.1 What is relational data?Relational data consists of information about several entities1, potentiallyof different types, including information on the existence of entities, theproperties of the entities, and the relationships that exist among entities.1Entities are also called objects, individuals, things, etc.21.1.Whatisrelationaldata? BasketID ItemID Returned? 1023 4002 Yes 1023 288 No 1023 23 No 1024 5 No 1024 810 Yes 1025 27 Yes … … … CustomerID BasketID 121 1023 121 1065 122 504 123 555 123 570 124 1025 … … ItemID Code Size Color 4002 T125 M Red 4003 S711 S Blue 4004 W333 S Purple 4005 P202 S White 4006 P202 M Red 4007 P202 L Red … … … … Feat1 Feat2 … Returned? 21 1 … Yes 8 0 … No -4 0 … No 2 0 … No 3 0 … Yes -1 1 … Yes … … … … (a) (b) Flattening Figure 1.1: Flattening a relational data into a single flat table.31.1. What is relational data?Example 1. The data in Fig 1.1(a) contains information about severalentities. There are three types of entities: Item, Customer, and Basket.There exists information about the properties of each entity, e.g., item 4002is Red. There also exists information about the relationships among theentities, e.g., basket 1023 belongs to customer 121.Relational data comes in several formats. Here we introduce threewell-known formats: a relational database, a set of triples, and a labeledmultidigraph.1.1.1 Relational data as a relational databaseRelational data may come in the format of a relational database consistingof multiple tables of rows and columns and logical connections between thesetables called relationships [37]. Generally, each table represents a type ofentity (e.g., Item or Customer) such that:• each row of the table corresponds to an entity,• each column of the table corresponds to a property,• a cell at row i and column j of the table corresponds to the j-thproperty for the i-th entity,• properties of the entities may be interdependent depending on therelationships the entities participate in.A relationship between two (or more) entities can be also represented as atable of rows and columns such that:• each row of the table represents a fixed-length tuple of entities partici-pating in the relationship and the properties of the relationship,• each column of the table corresponds to either a specific type of entityparticipating in the relationship or a property of the relationship.Fig 1.1(a) represents a relational data in the form of a relational database.The right table corresponds to items and the left and middle tables arerelationships between baskets and items and between customers and basketsrespectively. A flat data can be considered as a specific type of relationaldata with only one table.41.1. What is relational data?1.1.2 Relational data as a set of triplesRelational data may come as a set of triples of the form (head , relation, tail)2where head is an entity, tail is either an entity or a scalar value, and relationis a relationship between head and tail. Considering the relational databasein Fig 1.1(a), the first row of the rightmost table translates into three triples:(Item4002 ,Code,T125 )(Item4002 ,Size,M )(Item4002 ,Color ,Red)The first row of the middle table in Fig 1.1(a) translates to a single triple(assuming the name of the relationship for the middle table is Owns):(Customer121 ,Owns,Basket1023 )Fig 1.2(b) represents a relational data as a set of triples. To representa relational data as a set of triples, it may be necessary to reify the rows.Reify means “make into an entity”. For instance, consider the relationshiprepresented in the leftmost table of Fig 1.1(a). Rows of the table for thisrelationship cannot be translated to triples in the same way as we did themiddle table. However, one can reify a new entity for each row and representeach row as several triples between the reified entity and the columns of thetable. For instance for the first row of the leftmost table, let us reify a newentity called Order1. Then we can represent the first row as three triples:(Order1 ,BasketId ,Basket1023 )(Order1 , ItemID , Item4002 )(Order1 ,Returned?,Yes)1.1.3 Relational data as a labeled multidigraphA graph consists of nodes and edges between the nodes. A directed graph(digraph) is a graph whose edges have a direction associated with them. Amultidigraph is a digraph which is permitted to have multiple directed edgesbetween each pair of nodes. A labeled multidigraph is a multidigraph whereeach edge has a label [12].Relational data may come in the form of a labeled multidigraph wherenodes correspond to entities and properties. Each directed edge betweentwo nodes either represents a relation between two entities specified by the2Also represented as (subject , relation, object)51.2. Why study relational data? ( , , ) WifeOf( , , ) Likes ( , , ) Age … ( , , ) Likes (a) (b) Likes Likes FatherOf Age Age WifeOf FollowsOnTwitter Figure 1.2: An example of a relational data in the form of (a) a multidigraph(b) a set of triples.label of the edge, or represents the property of an entity where the type ofproperty is specified by the label of the edge.Fig 1.2(a) represents a relational data in the form of a multidigraph. Sim-ilar to set of triples, to represent a relational data as a labeled multidigraph,it may be necessary to reify new entities.1.2 Why study relational data?Since the world we live in is composed of entities, each having variousproperties and participating in multiple relationships with other entities,data is often relational:• Many companies store their data in relational databases with multi-ple entity-property tables and multiple relation tables to capture theproperties of the entities and the interdependence among them,• The rapid growth of social networks is generating great amounts oflabeled multidigraph data,• Numerous probabilistic knowledge bases (such as NELL [29], Reverb[64], Probase [230], Yago [92], DBpedia [8], Google Knowledge Graph[60] and DeepDive [195, 236]) are being constructed in academia andindustry and are rapidly growing each having millions to billions of61.3. Why relational representation/learning?triples containing information about properties and relationships ofthe entities in the world.With the abundance and ubiquity of relational data, studying the proper-ties of such data and developing representations, reasoning algorithms, andlearning algorithms for it is necessary and can be helpful in many applications.1.3 Why relational representation/learning?Consider an online shopping website. Customers log into their accounts,add items of interest to their baskets, and place orders. After receiving theordered items, a customer may decide to return some of them. Suppose thewebsite owner decides to build a system to predict whether each ordereditem will be returned. The original data may look like Fig 1.1(a).The steps typically taken to design a machine learning model for such aproblem are:1. Feature engineering: A set of features that may be useful in pre-dicting whether the purchased item will be returned or not are selected(e.g., the percentage of items returned by the current customer).2. Flattening the data: The original data is flattened into a singletable as in Fig 1.1(b) according to the features. Each row of this flattable corresponds to a purchase: a row may include the features of thepurchased item, the features of the customer who has purchased theitem, and whether the customer has returned the item or not.3. Training a model: A classifier is trained on the flat table to learnthe conditional probability of return given the features and is used tomake predictions about new purchases.The feature engineering step needs to be done by experts who know thedata well. These experts think about the potential reasons why an itemmay be returned and construct a set of features from the data that capturesthese reasons3. These features may then be tested and revised. Hiring suchexperts may involve huge costs. Furthermore, a large amount of time isusually required for constructing an informative set of features which maynot be acceptable, e.g., for fast prototyping.3Note that tools such as the widely-used deep neural networks for automatic featuregeneration create new features based on the initial set of features constructed by experts.They do not extract the initial set of features from relational data.71.3. Why relational representation/learning?Useful information may be lost during flattening. For instance, theremay be reasons for returning items that are inherent in the data but thatthe experts are not aware of. Ideally, one would like to automate the featureconstruction step to save time and money and avoid losing useful information.The following analogy from machine vision sheds light on the importance ofautomatic feature generation.Analogy 1. Following [143], for many years and in many machine visionapplications, the focus of research was on designing new features (in theshape of image filters) and testing their effectiveness. The progress in suchmachine vision applications had almost reached a plateau. Once enough dataand computing power was available, convolutional neural networks [128, 136]revolutionized machine vision by, among other innovations, automating theprocess of feature/filter construction.Once the data is flattened in step two, an iid assumption is imposed overthe rows of the matrix. In many cases, the independence assumption onthe flat data is not a correct assumption but it is used because it simplifiesthe learning process in step three. Suppose, for instance, that a basketincludes two items: 1- a shirt of size small, and 2- the same shirt but withthe size medium. Flattening creates two rows for these items and the modelpredicts each one independently of its prediction for the other one. However,the predictions for these two rows depend on each other as it is possiblethat the customer likes the shirt, but is not sure about the size. So if thecustomer decides to keep one shirt, they will most probably return the otherone. Ideally, one would like to relax the independence assumption and makea joint prediction for all items in a basket, rather than making separatepredictions for each item in the basket.Furthermore, once the data is flattened, user and item IDs are removedfrom the resulting flattened table as these are unique identifiers and a modelmay overfit to them during training. In other words, unique identifiers arenot considered features and are not included in flat data. Learning specific(latent) properties about each user, item, or any other type of entity in ourdataset may provide valuable information, but by removing the IDs, we losethe chance of learning such properties. For instance, suppose Mary has twosons: one 14 and 18. Then Mary may buy a small and a medium shirt of thesame type and color and keep both. While we may not have observed thatMary has two sons, we may learn from the data the latent property abouther that she buys small and medium shirts and keeps both.In the next two subsections, we provide two more examples one showingthe limitations of the independence assumption, and the other showing the81.3. Why relational representation/learning? Name Cough Asthma Smokes Alice 1 1 0 Bob 0 0 0 Charlie 0 1 0 Dave 1 0 1 Eve 1 0 1 Frank 1 ? ? … … … … Medical Records Name1 Name2 Dave Frank Eve Dave … … Friendship Name1 Name2 Bob Frank … … Brothers Asthma Smokes Cough Figure 1.3: An example of a relational data and a Bayesian network createdbased only on the medical records (example taken in part from [218]).limitations of not learning about specific entities.These examples show that flattening relational data and applying tra-ditional machine learning algorithms and models over the flat data maynot be a satisfactory option. New representations, reasoning and learningalgorithms should be developed that work directly with relational data.1.3.1 An example showing the limitations of theindependence assumptionThe following example from [218] is another example showing the limitationsof the independence assumption.Example 2. Consider the data in Fig 1.3. Suppose we consider the rowsof the medical record table to be independently distributed and build theBayesian network on the right of Fig 1.3. Then for Frank, the network maypredict P (asthma) = 0.3 and P (smokes) = 0.2. However, because Frankand Dave are friends and friends usually have similar smoking habits, weexpect the smoking probability for Frank to increase. Note also that Eve andDave are friends and usually friends of friends are also friends. Therefore weexpect that Frank is probably friends with Eve. Eve being a smoker mayfurther increase the probability of Frank smoking. On the other hand, notethat asthma can be hereditary and Bob (Franks’s brother) does not haveasthma. This may decrease the probability of Frank having asthma. Takingthis extra information into account, one may expect P (asthma) = 0.2 andP (smokes) = 0.5 for Frank, i.e. the reason behind Frank ’s cough is morelikely due to smoking rather than having asthma.91.3. Why relational representation/learning?Table 1.1: An example dataset taken from [50] that is not amenable totraditional classification.Student Course GradeS1 C1 AS2 C1 DS1 C2 BS2 C3 BS3 C2 BS4 C3 BS3 C4 ?S4 C4 ?1.3.2 An example showing the limitations of not learningabout specific entitiesThe following example taken from [50] shows the limitations of not learningabout specific entities.Example 3. Consider the dataset in Fig 1.1 and suppose the task is topredict the grade of the students S3 and S4 for course C4. Standard textbooksupervised learning algorithms cannot handle IDs and will remove them,thus making predictions given no information. If we have a relational modelwhich learns about particular users and courses and the relationships amongthem, however, the model may realize that:• S1 is more intelligent than S2 as S1 has earned a higher grade in C1,• C2 is more difficult than C3 as S1 (the more intelligent student) hasearned a B on C2 and S2 (the less intelligent student) has earned a Bfor C3,• S3 is more intelligent than S4 as S3 has earned a B in C2, S4 hasearned a B in C3 and C2 is more difficult than C3,and conclude that S3 will get a higher grade on C4 than S4.1.3.3 The aggregation problemSuppose in the online shopping example, we hypothesize that whether acustomer returns an item or not depends on the items they have previously101.4. Relational prediction problemspurchased/returned and the date of the purchase/return. For instance ifSam has returned a shirt of size small two weeks ago, has returned the largesize of the same shirt a week ago, and has now ordered a medium size of thesame shirt, chances are this time he keeps the shirt. With this hypothesis,the return variable for an item depends on a varying and unbounded numberof variables as each customer may have previously purchased/returned adifferent number of items. Furthermore, as time passes the number of suchvariables may increase.If we decide to flatten the data into a single table, we may have two alterna-tives: 1- considering a maximum number k of previously purchased/returneditems by customers and including k ∗ t columns in the table, t columns for theproperties of each purchased/return item (e.g., the date of purchase/return),2- engineering a set of features based on all previously purchased/returneditems and including them in the table. The former alternative may in-troduce many null values in the table as many customers may have pur-chased/returned much fewer items than the maximum k. The latter requiresfeature engineering and may suffer from the issues mentioned earlier such asinformation loss, labor, etc.1.4 Relational prediction problemsGiven a relational data, several prediction problems may be of interest. Thefollowing subsections briefly describe some of these problems.1.4.1 Link predictionLink prediction refers to the problem of predicting new links (or new rela-tionships) between entities given the existing ones.Example 4. Recommendation systems [21, 127] are well-known instancesof the link prediction problem. A common application for recommendationsystems is recommending movies to people. Consider a movie rating systemcontaining information about properties of people and their relationshipwith each other, properties of movies and their relationship with each other,and most importantly a set of movies each person likes/dislikes. A movierecommendation system predicts new like/dislike relationships between peopleand movies based on the existing information and makes recommendationsaccordingly.Besides recommending movies to people, recommendation systems haveapplications in many other domains such as recommending friends in social111.4. Relational prediction problemsnetworks, recommending jobs to job seekers or recommending talents torecruiters, and deciding on which ad to show to a particular user.Example 5. Knowledge graphs [29, 60, 64, 92, 195, 230, 236] are knowledgebases that contain information about real-world entities of various types(people, animals, companies, countries, songs, albums, etc.) and the re-lationship among these entities. Fig 1.2 may be a subset of a knowledgegraph. Automatic question answering is one of the main applications ofthese knowledge graphs. The information in knowledge graphs is typicallygathered through information extraction techniques that read the web andextract facts. Since the amount of information on the web is limited, currentknowledge graphs are far from complete. The aim of link prediction forknowledge graphs is to infer new relationships between the entities given theexisting ones.Example 6. Visual relation detection [144, 231, 235] studies the problem ofdetecting entities from images and then inferring links/relationships betweenthese entities to provide a better understanding of the image. This problemcan be viewed as a variant of the link prediction problem where there isuncertainty about the type of entities and the links are to be predicted basedon the information in the image instead of previously known links.1.4.2 Property predictionProperty prediction refers to predicting a specific property for entities of aspecific type based on the available information.Example 7. Electronic health records contain heterogeneous (relational)information about patients, diseases, symptoms, tests, lab results etc. Medicaldiagnosis applications aim at predicting if a patient has a specific type ofdisease (e.g., cancer) based on their medical and family history.Example 8. Online shopping websites may be interested in predicting somespecific properties of their customers (e.g., age, ethnicity, etc.) based on theitems they view and their sequence, links they click on, and their purchase pat-terns. Predicting such properties allows these websites to better personalizetheir item descriptions and make more intelligent recommendations.1.4.3 Joint predictionJoint prediction (aka collective prediction) refers to predicting the collec-tive probability of an assignment of values to a set of entity properties orrelationships given some observation.121.4. Relational prediction problemsExample 9. One variant of automated question answering is to answerquestions regarding real-world entities. IBM Watson [89] is a well-knowntool for this problem that beat world-champions of Jeopardy game show.Questions asked in automated question answering may require joint predic-tions of facts. For instance, the question “has anyone from Canada authoreda paper with both Einstein and Bergmann?” requires a joint probabilityover three random variables.Example 10. In Fig 1.2, we may be interested in finding the probability ofa specific person being born in Germany and living in Canada.In some property prediction problems, making collective predictions of theentity properties may result in better predictions than making independentpredictions for each entity. Consider the following example.Example 11. An example of the property prediction problem for relationaldata may be predicting political affiliations of the users of a social networkgiven their other properties (e.g., age, region, occupation, posts, etc.) andthe relationships they have with other users or entities (e.g., friendships,likes, comments, etc.). In this setting, knowing two users are married to eachother, one may expect them to have similar political affiliations. Thus, acollective prediction of their political affiliations may be more desired thanpredicting each of them independently of the other one.1.4.4 Existence predictionExistence prediction aims at predicting whether an entity exists or not.Example 12. Consider the citation matching problem [134, 172] wheregiven the text for several citations, the problem is to determine what papersexist, what authors exist, and which citation text refers to which paper.Predicting what papers and authors exist is an example of an existenceprediction problem.Example 13. Another similar example of an existence prediction problem isin information extraction where the aim is to extract facts about real-worldentities from raw text. A challenge in information extraction is to determinewhat objects exist from the raw text.1.4.5 Record linkageRecord linkage [34, 67] refers to the problem of recognizing records in twoseparate files which represent identical persons, or objects. It has been131.4. Relational prediction problemspreviously studied independently by researchers in several areas under variousnames including object identification [206], entity resolution [16, 18], identityuncertainty [172], approximate matching [80], duplicate detection [154],merge/purge [87], or hardening soft information [38]. Applications of recordlinkage include citation matching [78], person identification in census [229],and identifying different offers of the same product from multiple retailersfor comparison shopping [19].Example 14. Consider a social network in which some users have multipleaccounts. Determining which two (or more) accounts belong to one personis an example of an entity resolution problem.Example 15. Consider a website offering information about restaurants(or other businesses) which allows its users to find their desired restaurantthrough search. This is a challenging problem because users do not consis-tently use the official name, but use abbreviations, synonyms, different orderof terms, different spelling of terms, the short form of terms, and the namecan contain typos or spacing issues [66]. Finding the desired restaurant giventhe search query of the user is an instance of entity resolution problem.1.4.6 Community detectionA community in a relational data refers to a subset of the entities that aredensely connected to each other, but loosely connected to the rest of theentities. Community detection [69, 118, 129] is a clustering problem aimingat detecting such communities.Example 16. Consider the problem of identifying fake accounts in a socialnetwork. Fake accounts usually tend to befriend, follow, and interact witheach other. However, fake accounts only have limited connections to thereal accounts. Community detection over the accounts of a social networkcan help identify clusters of fake accounts as a whole. Community detectioncan be also used for identifying malicious websites as these websites oftenlink to each other but only a few legitimate websites link to these maliciouswebsites.Example 17. Consider an advertising company who wants to target theusers of a social network. By detecting the communities/clusters of en-tities that behave similarly, the advertising company can provide better-personalized ads for each group.141.5. Statistical relational artificial intelligence1.4.7 Problems studied in this thesisIn this dissertation, we study the link prediction, property prediction, andjoint prediction problems. In some applications, entity resolution can beconsidered as an instance of the link prediction problem where the aim isto predict the equivalence relationship between the entities. In such cases,the models we study and develop for link prediction can be also applied toentity resolution.1.5 Statistical relational artificial intelligenceRepresentation, reasoning, and learning, as well as the prediction problemsmentioned earlier for relational data, are generally studied under the umbrellaof statistical relational artificial intelligence (StaRAI) [50]. The problemof learning statistical models from relational data (a problem studied inStaRAI) is sometimes referred to as statistical relational learning [76] orrelational machine learning [167].Two of the main paradigms of StaRAI approaches include lifted relationalmodels and tensor factorization. Lifted relational models [10, 49, 59, 111, 155,160] typically lift the existing machine learning techniques that have beenproposed for non-relational data to be applicable to relational data. A greatnumber of these models (e.g., [10, 49, 59, 111]) do the lifting by combiningthe existing representations with first-order instead of propositional logic.They typically learn first-order logic rules that define the structure and theregularities in the data, and a weight for each rule representing the confidenceof the model in that rule. Tensor factorization approaches [24, 161, 165, 208]learn embeddings for entities and relationships and define a function fromthe embeddings to whether a relation exists between a head and a tail entityor not.Each of the two paradigms mentioned above (and the other existingparadigms such as graph random walk approaches [48, 130, 131, 224]) has itsown advantages and disadvantages. Several recent surveys and comparativestudies describe and compare the approaches in one or more of these categorieson several tasks [114, 120, 167]. It has been observed in several works thathybrid models (models based on a combination of the approaches in thesecategories) may result in higher predictive performance compared to modelsbased on approaches in a single category [55, 138, 157, 171].This dissertation studies models from both paradigms. In particular, itstudies the lifted relational models paradigm for property prediction andjoint prediction, and tensor factorization for link prediction.151.6. An overview of the contributions of this dissertation1.6 An overview of the contributions of thisdissertationWe have contributed to the link prediction, property prediction, and the jointprediction problems for relational data. For link prediction and propertyprediction, we developed two relational representations (one for each) thatachieved the state-of-the-art results on several benchmarks. For joint predic-tion, we sped up inference for several lifted relational models where inferenceis usually the bottleneck for learning. We also expanded the class of modelsfor which efficient inference is possible. Thus, structure learning algorithmsfor lifted relational models that restrict the search space to models for whichefficient inference algorithms exist can perform their search over a largerspace.This subsection is intended for readers looking for a high-level overviewof the contributions in this dissertation. Readers looking for a detaileddescription of the contributions can safely skip this subsection.1.6.1 Contributions to property predictionPrior to the start my Ph.D, in a joint work with David Buchman, Dr.Kersting, Dr. Natarajan, and Dr. Poole (hereafter, referred to as team1 )which led to my Master’s thesis, we lifted the standard logistic regressionmodel [3, 20], which is one of the main building blocks for neural networks, towork with relational data. We named this model relational logistic regression(RLR) and provided a theoretical characterization of what RLR is capableof doing [111]. RLR can be viewed as the relational counterpart of logisticregression, and as the directed counterpart of Markov logic networks. In twoother projects on RLR with Bahare Fatemi and Dr. Poole (lead by BahareFatemi), we developed structure and parameter learning algorithms for RLR[65] showing how they outperform well-known lifted relational models, andwe developed a probabilistic model based on RLR for finding a new name ina database of records indexed by name, for a telecommunications company[66].Continuing our project on RLR with team1, in a work lead by Dr.Poole, in [178] we investigated some drawbacks of using RLR and othersimilar models such as Markov logic networks (MLNs) [59]. For instance,we observed that in RLR (and MLNs), as the population size of the entitiesincreases, the probability of many variables goes to zero or one, makingthe model overconfident. In Kazemi et al. [114], we further studied thepros and cons of existing models (including relational logistic regression)161.6. An overview of the contributions of this dissertationfor property prediction. The conclusion of the study was that none of theexisting relational models are working entirely satisfactorily.The drawbacks with the existing relational models for property predictionmotivated Dr. Poole and me to devise a new representation. We developedan initial framework for deep relational learning that we call relational neuralnetworks (RelNNs) [109] which can be viewed as a generalization of RLR withhidden layers and latent entity properties, or as a generalization of standardneural networks that takes as input the descriptions of some entities and theirrelationships, applies tied parameters to entities about which we have thesame information, and makes predictions about properties of the entities. In[109], we discuss how RelNNs address some drawbacks we identified for RLRmodels and compare them to several well-known relational models. Initialexperiments on eight tasks over three real-world datasets show that RelNNsare promising models for relational learning. Our software is available onlineon GitHub4.1.6.2 Contributions to link predictionTensor factorization models are currently among the best performing modelsfor link prediction over the existing test cases. These models typicallyconsider an embedding (one or more vectors or matrices of numbers) foreach entity and each relationship and define a function which takes theembeddings for a head and a tail entity plus the embedding for a relationas input and outputs the probability of the head entity having that relationwith the tail entity.Developed in 1927, canonical polyadic (CP) decomposition [90] is amongthe first tensor factorization approaches. This approach learns one embeddingvector for each relation and two embedding vectors for each entity, one to beused when the entity is the head and one to be used when the entity is the tail.CP performs poorly for link prediction [209] as the two embedding vectorsfor entities are learned independently: observing some entity participates ina relation as the head does not affect the vector corresponding to the casewhere the entity is the tail and vice versa.In [110], we addressed the issue about the independence of the two em-bedding vectors for entities by introducing auxiliary relations correspondingto the inverse of the existing relations and augmenting the data to includethe inverse relations. Due to the simplicity of our model, we called it SimplE(Simple Embedding). SimplE is simple but effective. It is a fully expressive4https://github.com/Mehran-k/RelNN171.6. An overview of the contributions of this dissertationtensor factorization model which performs very well empirically and producesstate-of-the-art results for link prediction over two knowledge graphs onebased on WordNet [152] and one based on Freebase [22]. Our software isavailable online on GitHub5.1.6.3 Contributions to joint predictionLifted relational models [10, 49, 59] are currently among the best performingmodels for joint prediction. While these models provide compact encodingsof probabilistic dependencies in relational domains, they typically resultin highly intractable, densely connected graphical models, typically withmillions of random variables. Learning the parameters of these modelsfrom data typically requires inference over large graphical models, and theinference step is usually the bottleneck for learning.One key technique for making inference in these models tractable isto exploit the symmetry among the entities: treating entities about whichwe have the same information identically. Reasoning while exploiting bothsymmetries and independencies of random variables is known as lifted infer-ence [117, 179]. Over the past 15 years, a large number of lifted inferencealgorithms have been proposed [32, 51, 79, 102, 151, 177, 212, 214], oftenproviding exponential speedups for specific lifted relational models.Our first contribution to lifted inference is on speeding it up by compilinglifted inference to low-level programs. Consider a specific lifted relationalmodel and suppose we have two options for reasoning: 1- using a genericlifted inference engine (let us call it generic engine for short), 2- hiring alifted inference expert to write a code specifically for reasoning in our model(let us call it specific code). We expect that the specific code runs faster thanthe generic engine, but hiring an expert to write the specific code requirestime and money. Furthermore, the expert has to write a new code everytime we have a new model.In Kazemi and Poole [105], we automated the process of writing specificcodes for lifted relational models. We developed a compiler which given amodel, generates a C++ code specific to that model. We tested our compileron several benchmarks and observed that our compiler gains an averageof 175x speed-up compared to the state-of-the-art generic lifted inferenceengines. Our software, named L2C, is available online on GitHub6.We further investigated in Kazemi and Poole [107] the reason behind theefficiency of our compiler by comparing it to WFOMC [212] which given a5https://github.com/Mehran-k/SimplE6https://github.com/Mehran-k/L2C181.6. An overview of the contributions of this dissertationlifted relational model, generates data structures (instead of C++ codes)specific to that model. We observed that generating data structures canbe viewed as generating codes and then running them using an interpreter.When we directly generate C++ codes, however, we can run it using acompiler and compilers are typically faster than interpreters.Our second contribution to lifted inference is on speeding it up by devel-oping heuristics to select an order of applying applicable rules to an inputmodel. In [104], we designed several heuristics for selecting the order of therules and evaluated these heuristics on several benchmarks. We observedthat our MinTableSize heuristic, which is a generalization of min-size heuris-tic, outperformed the other heuristics as well as the well-known min-fill[54] heuristic. In [105], we further enhanced the MinTableSize heuristic byperforming a stochastic local search on the initial order aiming at minimizingthe maximum number of nested loops in the final C++ code.Depending on the structure of a model, during inference, we may reach apoint where none of the lifted inference rules proposed so far is applicableto the model. If we reach such a state, learning and reasoning becomeexponential in the number of entities which, except in really small domains,is impractical. To avoid exponential-time reasoning and learning, one mayrestrict the structure of the models to only the classes of models that areknown to be tractable. In this case, extending the classes of tractable modelsmeans richer models can be learned. Currently, two main tractable classeshave been identified: FO2 [214] (encoding dependencies between pairs ofentities) and recursively unary [105] (capturing hierarchical simplificationrules).One of the rules introduced for lifted inference in [215] is the domainrecursion rule. This rule applies the principle of induction on the domain ofthe entities. The rule makes one entity A in the domain explicit. Afterward,the other inference rules simplify the model up to the point where A iscompletely removed from the theory. Although domain recursion was centralto the proof that FO2 is domain-liftable, later work showed that simpler rulessuffice to capture FO2 [203], and the domain recursion rule was believed tobe redundant.Our third contribution to lifted inference is on reviving the domainrecursion rule in [215] by showing that it is more powerful than expected,and can lift models that are otherwise not domain-liftable. This includedan open problem in [14], asking for an inference rule for a logical sentencecalled S4. It also included the symmetric transitivity model and an encodingof the birthday paradox [13] in first-order logic. There previously did notexist any efficient reasoning algorithm with general-purpose rules for these191.6. An overview of the contributions of this dissertation RU FO3 S2FO2 ∀𝑥, 𝑦, 𝑧:𝐹 𝑥, 𝑦 ∧ 𝐹 𝑦, 𝑧 ⇒ 𝐹 𝑥, 𝑧 ∀𝑥, 𝑦:𝐹 𝑥,𝑦 ⇒ 𝐹 𝑦, 𝑥 Symmetric Transitivity ∀𝑥1, 𝑥2, 𝑦1,𝑦2: 𝑆 𝑥1,𝑦1 ∨ ¬𝑆 𝑥1, 𝑦2 ∨ 𝑆 𝑥2, 𝑦2 ∨ ¬𝑆 𝑥2, 𝑦1 S4 Clause FO4 FO2 S2RU BDR+RU Figure 1.4: An overview of the tractable classes of lifted relational models.theories, and we obtained exponential speedups. We also proved that domainrecursion supports its own large classes of domain-liftable models S 2FO2 andS 2RU that subsume the existing FO2 and recursive unary classes respectively.In a follow-up work [115], we demonstrated how domain recursion, whenapplicable, can be a better alternative than Skolemization [214] for theorieswith existential quantifiers. We also introduced a new domain-liftable classcalled BDR+RU which we conjecture is the largest possible domain-liftableclass. To this date, this conjecture has not proved or disproved. Fig 1.4shows an overview of our domain-liftability results.20Chapter 2Background and NotationIn this chapter, we define our notation and provide necessary backgroundfor readers to follow the rest of the dissertation.2.1 Logistic regression and neural networksHere we define multilayer neural networks at a level of abstraction thatencompasses both traditional neural networks and relational neural networks.A layer is a generic box (see Fig 2.1) that given Iˆ as input, produces anoutput Oˆ based on Iˆ, and then given the derivatives DOˆ of the error withrespect to variables in Oˆ, updates its state and outputs the derivatives DIˆof the error with respect to variables in Iˆ using the chain rule.A linear layer (LL) takes Iˆ = {ˆi0 = 1, iˆ1, iˆ2, . . . , iˆn} as input and outputsOˆ = {oˆ1, . . . , oˆt}. Internally, an LL consists of t linear units where the j-thunit consists of a set wj = {wj0, wj1, . . . , wjn} of weights and produces thej-th output oˆj as:oˆj =n∑k=0wjk iˆk (2.1)Given the derivatives of the error with respect to its outputs DOˆ ={DOˆ(oˆ1), . . . , DOˆ(oˆt)}, LL calculates the derivatives of the error with respectto its inputs DIˆ = {DIˆ (ˆi1), . . . , DIˆ (ˆin)} as:DIˆ (ˆik) =t∑j=0wjkDOˆ(oˆj) (2.2)LL calculates the derivative of the error with respect to each weight wjkas:Dwjk = DOˆ(oj) ∗ iˆk (2.3)An activation layer (AL) applies an activation function A (e.g., sigmoid,tanh, or ReLU) to each input Iˆ = {ˆi1, . . . , iˆt} and outputs Oˆ = {oˆ1, . . . , oˆt}212.1. Logistic regression and neural networks Layer 𝐼መ 𝑂 𝐷𝑂 𝐷𝐼መ Error Layer Sigmoid Layer Linear Layer 𝑥 𝑗⬚ 𝑦𝑗 Figure 2.1: A generic layer and layer-wise view of logistic regressionwhere oˆj = A(ˆij). Given DOˆ = {DOˆ(oˆ1), . . . , DOˆ(oˆt)}, AL calculates DIˆ ={DIˆ (ˆi1), . . . , DIˆ (ˆit)} as:DIˆ (ˆij) = DOˆ(oˆj) ∗∂A(ˆij)∂iˆj(2.4)An error layer (EL) consists of an error function E such as the sumof squared errors, log loss, etc. Given Iˆ = {ˆi1, . . . , iˆt} representing t pre-dictions, EL calculates Oˆ = {oˆ1, . . . , oˆt} as oˆj = E(ˆij , yj), where yj is theactual j-th target value. These outputs may be used to measure the over-all training error of the model. DOˆ for an EL is empty. EL calculatesDIˆ = {DIˆ (ˆi1), . . . , DIˆ (ˆit)} as:DIˆ (ˆij) =∂E(ˆij , yj)∂iˆj(2.5)A logistic regression [3] model consists of an LL with only one linear unitand a AL with sigmoid activation. The input to the LL is a value for all ofthe input features. During training, an EL is also added to the structureforming the structure on the right of Fig 2.1.A neural network can be constructed by cascading several LLs andALs. Generalizing logistic regression, the LLs of a neural network maycontain multiple linear units. A layer-wise architecture offers high modelingflexibilities: different types of layers can be placed into a network in a modularway. The parameters of these networks can be learned from training data:feeding the examples forward through the network to produce the predictions,back propagating the derivatives with respect to the error, and updating theparameters using stochastic gradient descent. This training method is knowas the backpropagation algorithm.222.2. Convolutional neural networks (ConvNets)2.2 Convolutional neural networks (ConvNets)Convolutional neural networks (ConvNets) [128, 136] have revolutionizedmany applications in computer vision, speech recognition, natural languageprocessing, etc. Their input is usually one or more (multidimensional)matrices (e.g., image channels). One of the layers used in ConvNets isthe convolution layer, which consists of small matrix-shaped filters thatare convolved with each input matrix generating/outputting new matrices.These filters are known as convolution filters. The parameters of convolutionfilters are tied for all regions of the input matrices. These models are learnedby stochastic gradient methods. Each filter may learn to extract a specifictype of feature. While filters at lower layers mostly extract low-level features,the filters at higher layers learn more high-level features that help makebetter predictions. For images, for instance, filters at lower layers may detectedges, while filters at higher layers may detect more high-level concepts.ConvNets also contain pooling layers. The input of a pooling layer is oneor more (multidimensional) matrices, and the output is the same number ofmatrices as the input where each output matrix is generated based on one ofthe input matrices. In order to generate an output matrix based on an theinput matrix, an n ∗n filter is shifted over the input matrix and the mean (incase of mean-pooling) of the cells or the maximum (in case of max-pooling)cell among the cells under the filter is sent to the output, thus generating anew (possibly smaller in size) matrix.2.3 First-order logicThis section describes finite-domain, function-free first-order logic (FOL).Below is a list of definitions/assumptions that we use for finite-domain,function-free FOL.• Population is a finite set of entities (or individuals).• Logical variables (logvars) are typed with a population. We repre-sent logvars with lower-case letters. The population associated with alogvar x is ∆x. The cardinality of ∆x is |∆x|.• Constants are names denoting entities.• Unique names assumption states that different constants (names)refer to different entities.232.3. First-order logic• E represents the set of all constants/entities.• Lower-case letter in bold represents a tuple of logvars.• Upper-case letter in bold represents a tuple of constants.• x ∈ ∆x is used as a shorthand for instantiating x with one of theelements of ∆x.• Atoms are of the form S(t1, . . . , tk) where S is a predicate symbol andeach ti is a logvar or a constant. When k = 0, S is a propositionalrandom variable and we can omit the parenthesis.• Unary atoms contains exactly one logvar.• Binary atoms contains exactly two logvars.• Substitutions are written as θ = {〈x1, . . . , xk〉/〈t1, . . . , tk〉} whereeach xi is a different logvar and each ti is a logvar or a constant in ∆xi .• A grounding of an atom S(x1, . . . , xk) is a substitution θ = {〈x1, . . . , xk〉/〈X1, . . . , Xk〉} mapping each of its logvars xi to an entity in ∆xi .• G(A, E): Given a set A of atoms and a set E of constants, we denote byG(A, E) the set of all possible groundings for the atoms in A accordingto the constants in E .• A world for a set A of atoms and E of constants is an assignment oftruth values to all groundings in G(A, E).• A literal is an atom or its negation.• A formula ϕ is a literal, a disjunction ϕ1∨ϕ2 of formulas, a conjunctionϕ1∧ϕ2 of formulas, or a quantified formula ∀x ∈ ∆x : ϕ(x) or ∃x ∈ ∆x :ϕ(x) where x appears in ϕ(x). ϕ1 =⇒ ϕ2 is equivalent to ¬ϕ1 ∨ ϕ2and ϕ1 ⇔ ϕ2 is equivalent to (ϕ1 =⇒ ϕ2) ∧ (ϕ2 =⇒ ϕ1).• A free logvar in a formula is a logvar that is not quantified.• An instance of a formula ϕ is obtained by replacing every free logvarx in ϕ by one of the entities in ∆x.• Applying a substitution θ = {〈x1, . . . , xk〉/〈t1, . . . , tk〉} on a formulaϕ (written as ϕθ) replaces each xi in ϕ with ti.242.4. Weighted model count• A sentence is a formula with all logvars quantified.• A clausal formula (aka disjunctive formula) is a formula that hasno conjunctions.• A conjunctive formula is a formula that has no disjunctions.• A theory is a set of sentences.• A clausal theory only contains clausal sentences.Example 18. Consider three types of entities: people, cities, and coun-tries. Let ∆p = {Alex, Sam,Mehran} be the population of people, ∆c ={V ancouver, Tehran, V enice} be the population of cities, and ∆n = {Canada,Germany}be the population of countries. LocatedIn(c,n), BornInCity(p, c), BornInCountry(p,n), and LikesHockey(p) are four atoms. The first three atoms are binaryatoms as they have two logvars each, while the last one is a unary atom as ithas only one logvar. Below are two examples of sentences:∀p ∈ ∆p, c ∈ ∆c, n ∈ ∆n : BornInCity(p, c)∧LocatedIn(c, n)=⇒ BornInCountry(p, n)∀p ∈ ∆p : BornInCountry(p, Canada) =⇒ LikesHockey(p)The first sentence states that if a person p is born in a city c and the cityc is located in country n, then p is born in country n. The second sentencestates that every person who has been born in Canada likes hockey. Sinceϕ1 =⇒ ϕ2 is equivalent to ¬ϕ1 ∨ ϕ2, both the above sentences are clausal.The above two sentences can be considered as a theory in which case thetheory is clausal.2.4 Weighted model countGiven a theory T over entities E having atoms A and a world Ω of truthvalue assignments to G(A, E), Ω is a model of T , Ω |= T , if given its truthvalue assignments, all sentences in T evaluate to True. Model countingrefers to counting the number of truth value assignments to G(A, E) that aremodels.Example 19. Consider two types of entities: people and movies. Let ∆p ={Alice,Bob, Chris} be the population of people and ∆m = {Zootopia,Her}252.4. Weighted model countTable 2.1: Examples of truth assignments to the ground atoms in Example 19that are and that are not models. T stands for True and F stands for False.Teenager(Alice)Teenager(Bob)Teenager(Chris)Animation(Zootopia)Animation(Her)Likes(Alice,Zootopia)Likes(Ben,Zootopia)Likes(Chris,Zootopia)Likes(Alice,Her)Likes(Ben,Her)Likes(Chris,Her)Model?Ω1 T T F T F T T T F T T YesΩ2 T F T T F F T T F T F NoΩ3 F F F T F T T T T T T YesΩ4 T T T T F T T T T T T Yesbe the population of movies. Let Animation(m), Likes(p, m) and Teenager(p)be three atoms. Consider a theory with only one sentence:∀p ∈ ∆p,m ∈ ∆m : Teenager(p) ∧ Animation(m) =⇒ Likes(p,m)stating that every teenager likes every animation movie. Likes(Alice, Zootopia)is an example of a ground atom. A possible world for the three atoms and forthe entities assigns a truth value to all ground atoms including Likes(Alice,Zootopia), Likes(Alice, Her), Likes(Ben, Zootopia), . . . , Animation(Her), . . . ,Teenager(Chris). If the value assignments in the possible world are in away that all teenagers like all animation movies, then the assignment in thepossible world is a model for the theory.Table 2.1 represents four possible worlds and whether they are models ofthis theory or not. Ω1 in Table 2.1 is a model because both Alice and Bob (i.e.the teenagers) like all animation movies (being only Zootopia in this case).Ω2 is not a model as Alice does not like Zootopia. Ω3 is a model as there areno teenagers, therefore the sentence “all teenagers like all animations” holds.Ω4 is also a model since all teenagers like all animation movies. The modelcount of the theory is the number of Ωis that are models of the theory.Next, we describe the weighted model count of a theory. Let:• T be a theory,• F(T ) be the set of predicate symbols in theory T ,262.4. Weighted model count• Φ : F(T ) 7→ R and Φ : F(T ) 7→ R be two functions that map eachpredicate S to weights,• A represent the atoms in T ,• E represent the set of entities,• Ω represent a truth value assignment to G(A, E),• κTrue be the set of groundings assigned True in Ω,• and κFalse the ones assigned False.The weight of Ω is given by:w(Ω) =∏S(C1,...,Ck)∈κTrueΦ(S) ∗∏S(C1,...,Ck)∈κFalseΦ(S)The weighted model count (WMC) of the theory given Φ and Φ is:WMC(T |Φ,Φ) =∑Ω|=Tw(Ω). (2.6)where Ω |= T represents all truth value assignments Ω to G(A, E) that aremodels of T .Example 20. Consider a theory T having only one clausal sentence:∀x ∈ ∆x : ¬Smokes(x) ∨ Cancer(x)and assume ∆x = {A,B}. For this theory:F(T ) = {Smokes,Cancer}Let Φ and Φ be two functions from F(T ) to weights such that:Φ(Smokes) = 0.2, Φ(Cancer) = 0.8,Φ(Smokes) = 0.5, Φ(Cancer) = 1.2The value assignment:Smokes(A) = True, Smokes(B) = False,Cancer(A) = True, Cancer(B) = Trueis a model. The weight of this model is 0.2 ∗ 0.5 ∗ 0.8 ∗ 0.8. Besides theabove model, this theory has eight other models (i.e. a total of nine models).The WMC can be calculated by summing the weights of all nine models.Table 2.2 shows all these nine models and their weights.272.5. Logic programs (LP)Table 2.2: Examples of truth assignments to the ground atoms in Example 20that are and that are not models with their weights. T stands for True andF stands for False.Smokes(A)Smokes(B)Cancer(A)Cancer(B)Model?WeightΩ1 T T T T Yes 0.2 ∗ 0.2 ∗ 0.8 ∗ 0.8Ω2 T T T F No -Ω3 T T F T No -Ω4 T T F F No -Ω5 T F T T Yes 0.2 ∗ 0.5 ∗ 0.8 ∗ 0.8Ω6 T F T F Yes 0.2 ∗ 0.5 ∗ 0.8 ∗ 1.2Ω7 T F F T No -Ω8 T F F F No -Ω9 F T T T Yes 0.5 ∗ 0.2 ∗ 0.8 ∗ 0.8Ω10 F T T F No -Ω11 F T F T Yes 0.5 ∗ 0.2 ∗ 1.2 ∗ 0.8Ω12 F T F F No -Ω13 F F T T Yes 0.5 ∗ 0.5 ∗ 0.8 ∗ 0.8Ω14 F F T F Yes 0.5 ∗ 0.5 ∗ 0.8 ∗ 1.2Ω15 F F F T Yes 0.5 ∗ 0.5 ∗ 1.2 ∗ 0.8Ω16 F F F F Yes 0.5 ∗ 0.5 ∗ 1.2 ∗ 1.22.5 Logic programs (LP)A rule is of the form:H(x) : - ϕwhere H(x) is an atom representing the head and ϕ is a conjunctive formularepresenting the body of the rule. : - is read “if”. A fact is a rule wherethe body ϕ is True. A logic program (LP) is a set of rules. Every groundatom in an LP that is not implied to be True is assumed to be False. Thisassumption is known as the closed-world assumption (CWA). Everylogvar appearing in ϕ but not in H(x) is assumed to be existentially quantifiedin the scope of the body. For instance having a rule S : - R(x) is interpretedas S : - (∃x : R(x)), meaning S is True if R is True for at least one entity.Under certain conditions, an LP can be converted into a FOL theory usingClark’s completion [35] (cf. [176] for more detail and examples).282.6. Representing relational dataExample 21. Consider the following LP:Father(Alex, Shannon)Father(Ben,Alex)Mother(Shannon,Emma)Parent(x, y) : - Father(x, y)Parent(x, y) : - Mother(x, y)Grandparent(x, y) : - Parent(x, z) ∧ Parent(z, y)The first three rules are facts as their bodies are True. From theabove LP, it can be inferred that Parent(Alex, Shannon), Parent(Ben,Alex),Parent(Shannon, Emma). Furthermore, since Parent(Alex, Shannon) andParent(Shannon, Emma), it can be inferred that Grandparent(Alex, Emma).Also since Parent(Ben,Alex) and Parent(Alex, Shannon), it can be inferredthat Grandparent(Ben, Shannon). According to this LP, Grandparent(Ben,Emma) is False as it cannot be inferred from the program.2.6 Representing relational dataLet E represent the set of entities and R represent the set of relations.We represent relational data as a set of triples. A triple is represented as(H ,R,T ), where H ∈ E is the head, R ∈ R is the relation, and T ∈ E is thetail of the triple. A triple (H ,R,T ) corresponds to a ground atom R(H,T )in FOL.For a particular domain, let ζ represent the set of all triples that are trueand ζ ′ represent the ones that are false. Below are examples triples in ζ:(Paris, IsCapitalOf,France)(JustinTrudeau, 23rdPrimeMinister,Canada)(MichaelJackson,HasAlbum,Thriller)In FOL notation, the above triples are written as:IsCapitalOf(Paris, France)23rdPrimeMinister(JustinTrudeau,Canada)HasAlbum(MichaelJackson, Thriller)Below are some examples triples in ζ ′:(Paris, IsCapitalOf,Germany)(BarackObama, 23rdPrimeMinisterOf,Canada)(Metallica,HasAlbum,Thriller)292.7. Properties of relationsIn FOL notation, the above triples are written as:IsCapitalOf(Paris,Germany)23rdPrimeMinister(BarackObama,Canada)HasAlbum(Metallica, Thriller)2.7 Properties of relationsDefinition 1. A relation R is reflexive on a set E ′ ⊂ E of entities if:(E ,R,E ) ∈ ζfor any entity E ∈ E ′. In FOL notation, this property can be stated as∀e ∈ E ′ : R(e, e).Definition 2. A relation R is symmetric on a set E ′ ⊂ E of entities if:(E1 ,R,E2 ) ∈ ζ ⇔ (E2 ,R,E1 ) ∈ ζfor any two entities E1, E2 ∈ E ′. A relation R is anti-symmetric if:(E1 ,R,E2 ) ∈ ζ ⇔ (E2 ,R,E1 ) ∈ ζ ′In FOL notation, symmetry of a relation can be stated as ∀e1, e2 ∈ E ′ :R(e1, e2) ⇔ R(e2, e1) and anti-symmetry can be stated as ∀e1, e2 ∈ E ′ :R(e1, e2)⇔ ¬R(e2, e1).Definition 3. A relation R is transitive on a set E ′ ⊂ E of entities if:(E1 ,R,E2 ) ∈ ζ ∧ (E2 ,R,E3 ) ∈ ζ ⇒ (E1 ,R,E3 ) ∈ ζfor any three entities E1, E2, E3 ∈ E ′. In FOL notation, this property can bestated as ∀e1, e2, e3 ∈ E ′ : R(e1, e2) ∧ R(e2, e3)⇒ R(e1, e3).Definition 4. For any relation R, we denote its inverse as R−1 and definethe inverse as(H ,R,T ) ∈ ζ ⇔ (T ,R−1,H ) ∈ ζfor every H and T in E . In FOL notation, the inverse of a relation R isdefined as ∀x ∈ ∆x, y ∈ ∆y : R(x, y)⇔ R−1(y, x).Example 22. Consider three relations: Knows, MarriedTo, and TallerThan.Knows is reflexive as every person knows themselves. MarriedTo is symmetricbecause if X is married to Y, Y is married to X. TallerThan is anti-symmetricbecause if X is taller than Y, then Y is not taller than X. TallerThan is alsotransitive because if X is taller than Y and Y is taller than Z, then X is tallerthan Z.302.7. Properties of relationsExample 23. Consider a relation IsCapitalOf and some examples of thetriples in ζ having this relation:(Paris, IsCapitalOf,France)(Tehran, IsCapitalOf, Iran)By definition, for the inverse relation IsCapitalOf−1, the following triples arein ζ:(France, IsCapitalOf−1 ,Paris)(Iran, IsCapitalOf−1 ,Tehran)31Chapter 3An Overview of RelationalModelsIn this chapter, we describe two paradigms for relational models: tensorfactorization models and lifted relational models. For each paradigm, wedescribe selected models.3.1 Tensor factorization modelsTensor factorization models have been successfully applied to link predictionin relational data [167]. In this section, we give an overview of tensorfactorization models and introduce some models in this paradigm.Definition 5. An embedding is a function from an entity or a relation toone or more vectors or matrices of numbers. We represent the embeddingassociated with an entity E by Emb(E) and the embedding associated witha relation R with Emb(R).Definition 6. A tensor factorization model defines two things: 1- theembedding functions for entities and relations, 2- a function f taking H, Rand T as input, and generating a similarity (or sometimes a dissimilarity)score for whether (H ,R,T ) holds (i.e. is in ζ) or not based on the embeddingsfor the input entities and the input relation. The values of the embeddingsare learned using the triples in the data.Let v, w and x be vectors of length k. In the rest of the dissertation,we define 〈v, w, x〉 to be the sum of the element-wise product of the threevectors. That is:〈v, w, x〉 .=k∑j=1vj ∗ wj ∗ xjwhere vj , wj , and xj represent the jth element of v, w and x respectively. Wealso define [v;w] to be a vector or size 2∗k corresponding to the concatenationof the two vectors v and w.323.1. Tensor factorization modelsOver the past few years, many tensor factorization models have beendeveloped [24, 56, 82, 161, 163, 168, 199, 208, 232]. We describe three generalcategories of tensor factorization approaches and some selected models ineach category.3.1.1 Translational approachesTranslational approaches define additive functions over embeddings. TransE[24] is one of the first such approaches. In TransE, the embedding for bothentities and relations is a vector of size k. Let v(E) and v(R) representthe embedding vectors of an entity E and a relation R respectively. Thedissimilarity function used in TransE for a triple (H ,R,T ) is:dissim(H,R, T ) = ||v(H) + v(R)− v(T )||i (3.1)where ||v||i represents norm i of vector v (i is typically 1 or 2). The smallerthe value of the dissim function, the more probable the triple (H ,R,T )being correct. An alternative way of looking at TransE is that it aims atlearning its embedding vectors in a way that v(H) + v(R) ≈ v(T ) for anytriple (H ,R,T ) in the data.Realizing that TransE is quite limited in what it can model, severalworks have extended the idea in TransE. Most of these works considerextra vectors/matrices in the embeddings of the relations, project the entityembeddings into relation-specific spaces using the extra vectors/matrices,and then apply TransE. As an example, STransE [161] considers E(R) tobe a vector v(R) plus two matrices Mh(R) and Mt(R). They define theirdissimilarity function as:dissim(H,R, T ) = ||Mh(R)v(H) + v(R)−Mt(R)v(T )||i (3.2)They use the matrices to project entity vectors into a new space (projectinghead and tail using different matrices), and then apply TransE. Attemptshave been also made to relax the dissimilarity function. Instead of enforcingv(H) + v(R) ≈ v(T ) as in TransE, [68]’s FTransE enforces:v(H) + v(R) ≈ αv(T ) (3.3)for some real number α. According to FTransE, only the direction ofv(h) + v(r) matters, not its size. Note that FTransE and STransE (or otherrelated works) can be combined to create the function:Mh(r)v(h) + v(r) ≈ αMt(r)v(t) (3.4)In the rest of the dissertation, we call this model FSTransE.333.1. Tensor factorization models3.1.2 Deep learning approachesDeep learning approaches generally use neural networks to learn how theembeddings interact. E-MLP [199] considers the embeddings for entities tobe vectors v of size k, and for relations to be a matrix M of size 2k ∗m plusa vector v of size m. To make a prediction about a triple (H ,R,T ), E-MLPconcatenates the vectors of the entities and create the vector [v(H); v(T )],and feeds it into a two-layer neural network whose weights for the first layerare the matrix M(R) and for the second layer are v(R). E-MLP requires(2k + 1) ∗m parameters for each relation.ER-MLP [60], the approach used in Google Knowledge Vault, considersthe embeddings for both entities and relations to be single vectors. To makea prediction about a triple (H ,R,T ), they concatenate the vectors of theentities and the relation creating the vector [v(H); v(R); v(T )] and then feedit through a two-layer neural network.In Santoro et al. [188], once the entity vectors are provided by theconvolutional neural network and the relation vector is provided by thelong-short time memory network, for each triple the vectors are concatenatedsimilar to ER-MLP and are fed into a four-layer neural network.3.1.3 Multiplicative approachesMultiplicative approaches define product-based functions over embeddings.DistMult [232], one of the simplest multiplicative approaches, considers theembeddings for each entity and each relation to be a single vector of size k(v(e) and v(r)) and defines its similarity function as:sim(H,R, T ) = 〈v(H), v(R), v(T )〉 (3.5)Since DistMult does not distinguish between head and tail entities, it can onlymodel symmetric relations. That is, by using DistMult, one has implicitlymade the assumption that (H ,R,T ) ∈ ζ ⇔ (T ,R,H ) ∈ ζ.One of the first tensor factorization approaches is the canonical polyadic(CP) decomposition of Hitchcock [90]. CP learns one embedding vector foreach relation R (denoted as v(R)) and two embedding vectors for each entityE, one to be used when the entity is the head (denoted as h(E)) and one tobe used when the entity is the tail (denoted as t(E)) of the relation. Thesimilarity function of CP is defined as:sim(E1,R, E2) = 〈h(E1), v(R), t(E2)〉 (3.6)343.2. Lifted relational modelsComplEx [208] considers the embedding of each entity and each relationto be a vector of size k of complex (instead of real) numbers. For eachentity E (and similarly for each relation R), Emb(E) can be representedas re(E) + im(E)i, where re and im are two vectors of size k and i is thesquare root of −1. According to ComplEx:sim(H,R, T ) =Real(k∑j=1(rej(H) + imj(H)i) ∗ (rej(R) + imj(R)i) ∗ (rej(T )− imj(T )i))(3.7)where Real(α+βi) = α. By changing the sign of im vector for the tail entity(corresponding to conjugate of a complex number), ComplEx distinguishesbetween head and tail entities and allows for modeling asymmetric relations.One can easily verify that the function used by ComplEx can be expandedand written as:sim(H,R, T ) = 〈re(H), re(R), re(T )〉+ 〈re(H), im(R), im(T )〉+ 〈im(H), re(R), im(T )〉− 〈im(H), im(R), re(T )〉(3.8)HolE [168] is another well-known multiplicative model, but we do notdescribe HolE as Hayashi and Shimbo [86] show that HolE is isomorphicto ComplEx. RESCAL [164, 165] can be also considered as another multi-plicative approach. RESCAL considers entity embeddings to be vectors ofsize k and relation embeddings to be vectors of size k ∗ k. The similarityfunction used by RESCAL is the dot-product of the relation vector to theouter product of the entity vectors.3.2 Lifted relational modelsIn this section, we briefly describe several lifted relational models that areused throughout this dissertation.3.2.1 Markov logic networksMarkov logic networks (MLNs) [58, 183] are one of the most well-knownand widely-used lifted relational models. An MLN consists of a set of hard353.2. Lifted relational modelsconstraints and weighted formulae (WFs). A hard constraint is a formula ϕ,and a WF is a pair 〈w,ϕ〉 where w is a weight and ϕ is a formula. These hardconstraints and weighted formulae define a joint distribution over groundingsof the atoms in the MLN.LetM be an MLN consisting of hard constraints Ψ and WFs ψ over atomsA. Let E represent the entities. A world Ω forM is a truth value assignmentto G(A, E). The probability of any world violating a hard constraint is zero.The probability of a world Ω satisfying all hard constraints is proportionalto the exponent of the sum of the weights of the instances of the formulae inψ that are True in the world:Prob(Ω) =1Z∏〈w,ϕ〉∈ψexp(η(ϕ,Ω) ∗ w) (3.9)where η(ϕ,Ω) represents the number of instances of ϕ that hold accordingto Ω, andZ =∑Ω′∏〈w,ϕ〉∈ψexp(η(ϕ,Ω′) ∗ w)where∑Ω′ sums over all possible worlds satisfying the hard constraints, isthe partition (normalization) function. Since the populations are assumed tobe finite,∑Ω′ sums over a finite number of possible worlds.Example 24. Consider the problem of predicting returns in an online shopwebsite. An MLN for this problem may contain the following WF:〈w,Basket(c, b) ∧ InBasket(i1, b) ∧ InBasket(i2, b) ∧ SameType(i1, i2)∧DiffSize(i1, i2) ∧ ¬Returns(c, i1) =⇒ Returns(c, i2)〉where w > 0. The formula increases the probability of all worlds exceptthe world where two items of the same type and different size are both notreturned by a customer.Example 25. An MLN for Example 2 may use the information in thefriendship and brothers tables by using the following WFs:〈w1,Asthma(x) =⇒ Cough(x)〉〈w2,Smokes(x) =⇒ Cough(x)〉〈w3, Smokes(x) ∧ Friends(x, y) =⇒ Smokes(y)〉〈w4,Asthma(x) ∧ Family(x, y) =⇒ Asthma(y)〉〈w5,Friends(x, y) ∧ Friends(y, z) =⇒ Friends(x, z)〉363.2. Lifted relational models Friend(y,x) Kind(y) Happy(x) 𝑊𝐹1: < 𝑇𝑟𝑢𝑒,−4.5 > 𝑊𝐹2:< 𝐹𝑟𝑖𝑒𝑛𝑑(𝑦, 𝑥) ∧ 𝐾𝑖𝑛𝑑(𝑦), 1 > Figure 3.1: An RLR model.3.2.2 Relational logistic regressionLet Q(x) be an atom whose probability depends on a set A (not containingQ) of atoms, ψ be a set of WFs containing only atoms from A, E be a setof entities, Iˆ be an assignment of truth values to all groundings in G(A, E),X be an assignment of entities to x, and θ = {x/X}. Relational logisticregression (RLR) defines the probability of Q(X) given Iˆ as follows:sum =∑〈w,ϕ〉∈ψw ∗ η(ϕθ, Iˆ) (3.10)Prob(Q(X) = True | Iˆ) = sigmoid(sum) (3.11)where η(ϕθ, Iˆ) is the number of instances of ϕθ that are True with respectto Iˆ. A WF whose formula is True can be considered as a bias. Note thatη(True, Iˆ) = 1. Following [111], we assume the formulae of all WFs for RLRmodels are conjunctive as restricting the formulae of RLR to conjunctiveformulae does not limit its representation power. As shown by [178], whenall atoms in A are observed, RLR is identical to MLNs. In the rest of thedissertation, we use RLR/MLN when the two models are identical.Example 26. Consider the lifted relational model in Fig 3.1. For this modelA = {Friend(x, y),Kind(y)} represents the parents of the Happy predicate.Let Iˆ be an assignment of truth values to G(A). RLR/MLN sums overψ = {WF1,WF2} resulting in:Prob(Happy (X) = True | Iˆ) =sigmoid(−4.5 + 1 ∗ η(Friend(y,X) ∧ Kind(y), Iˆ))where η(Friend(y,X) ∧ Kind(y, Iˆ) represents the number of entities in ∆y forwhich Friend(y,X) ∧ Kind(y) according to Iˆ, corresponding to the numberof friends of X that are kind. WF1 is a bias and evaluates to −4.5. Whenthis count is greater than or equal to 5, the probability of X being happy373.2. Lifted relational modelsis closer to one than zero; otherwise, the probability is closer to zero thanone. Therefore, the two WFs model "someone is happy if they have at least5 friends that are kind".Continuous atoms: RLR was initially designed for Boolean or multi-valued parents. If True is associated with 1 and False is associated with 0,we can substitute ∧ in WFs with ∗. Then we can allow continuous atoms inWFs. For example if for some X ∈ ∆x we have R(X) = True, S(X) = Falseand T(X) = 0.2, then 〈w,R(X) ∗ S(X)〉 evaluates to 0, 〈w,R(X) ∗ ¬S(X)〉evaluates to w, and 〈w,R(X) ∗ ¬S(X) ∗ T(X)〉 evaluates to 0.2 ∗ w.Example 27. For Example 3, an RLR model for defining the conditionalprobability of whether a student will get an A in a course or not (Grade(s, c,A)) may have the following WFs:〈w0,True〉〈w1, Intelligent(s)〉〈w2,Easy(c)〉〈w3, Intelligent(s) ∗ Easy(c)〉where Intelligent and Easy are continuous latent properties for students andcourses respectively, whose values will be learned from data during training.If the intelligence of a student S is learned to be 0.7, and the easiness of acourse C is learned to be 0.1, then:Prob(Grade(S,C,A)) = sigmoid(w0 + w1 ∗ 0.7 + w2 ∗ 0.1 + w3 ∗ 0.07)3.2.3 Parfactor graphsA parametric factor or parfactor [179] is of the form 〈A, ρ〉 where A is aset of atoms and ρ is a factor (which is a function from assignments of valuesto atoms in A into non-negative real numbers). As an example,〈{Friend(x, y),Adult(x),Adult(y)}, ρ〉is a parfactor having three atoms, two logvars typed with the same population,and a function ρ which maps every set of assigned values to the three atomsinto a non-negative real number. As an example, ρ may be as in Table 3.1 Aprobability model can be defined by a set of parfactors.383.2. Lifted relational modelsTable 3.1: An example of a parfactor over two atoms: Friend and Adult.Adult(x) Friend(x, y) Adult(y) ValueTrue True True 1.1True True False 0.1True False True 0.3True False False 0.3False True True 0.1False True False 0.9False False True 0.3False False False 0.3A parfactor graph consists of a set of parfactors as the nodes of thegraph where there is an arc between two parfactors par1 and par2 if thereis an atom in par1 which unifies with an atom in par2. Fig 3.2 representsseveral examples of parfactor graphs.By copying each parfactor in a parfactor graph for each grounding of itsatoms, one gets a Markov network. The semantic of a parfactor graph is thesemantic of its resulting Markov network when copying grounding the atoms.A parfactor graph can be converted into an MLN. Consider a parfactorwith m parfactors par1, . . . , parm and let ρ1, . . . , ρm be the functions of theseparfactors. To convert this parfactor into an MLN, a WF is consideredin the MLN for each assignment of values to the atoms in each pari andthe corresponding value from ρi. The formula of this WF comes from theassignment of values to the atoms (connected with conjunctions) and theweight of the WF is the log of the corresponding ρi value. As an example,for the first row of the table in Table 3.1, the following WF is added to theMLN:〈log(1.1),Adult(x) ∧ Friend(x, y) ∧ Adult(y)〉3.2.4 ProblogProblog [49] is a probabilistic logic programming language where a probabilityis assigned to each fact. The ground instances of the facts are assumed tobe probabilistically independent. These probabilities define a probabilitydistribution over (non-probabilistic) logic programs. The probability of aground atom being True in a Problog program is the sum of the probabilitiesof the logic programs that entail the ground atom.393.2.Liftedrelationalmodels {A,B} {A,C(x)} {B,C(x)} {A,B(x)} {A,C(x)} {B(x),C(x)} {B(x),D(x)} {C(x),E(x)} {D(x),E(x)} {A(x),B} {A(x),C} {A(x),D} {B,C} {B,D} {A,B(x,y)} {B(x,y), C(x,z)} {C(x,z),D} (a) (b) (c) (d) {S(x),R(x,y)} {R(x,y),Q(x)} (e) Figure 3.2: Examples of parfactor graphs.403.2. Lifted relational modelsExample 28. Consider the following Problog program:0.8 :: Father(Alex, Shannon)0.4 :: Father(Ben,Alex)1.0 :: Mother(Shannon,Emma)Parent(x, y) : - Father(x, y)Parent(x, y) : - Mother(x, y)Grandparent(x, y) : - Parent(x, z) ∧ Parent(z, y)This program is similar to the LP in Example 21 except that each facthas a probability associated with it. The above Problog program defines adistribution over four LPs. The following LP has a probability of 0.8 ∗ 0.4:Father(Alex, Shannon)Father(Ben,Alex)Mother(Shannon,Emma)Parent(x, y) : - Father(x, y)Parent(x, y) : - Mother(x, y)Grandparent(x, y) : - Parent(x, z) ∧ Parent(z, y)the following LP has a probability of (1− 0.8) ∗ 0.4:Father(Ben,Alex)Mother(Shannon,Emma)Parent(x, y) : - Father(x, y)Parent(x, y) : - Mother(x, y)Grandparent(x, y) : - Parent(x, z) ∧ Parent(z, y)the following LP has a probability of 0.8 ∗ (1− 0.4):Father(Alex, Shannon)Mother(Shannon,Emma)Parent(x, y) : - Father(x, y)Parent(x, y) : - Mother(x, y)Grandparent(x, y) : - Parent(x, z) ∧ Parent(z, y)413.2. Lifted relational models Student(x) ∃y: MarriedTo(x,y) P(Young(x)) = 0.2 P(Young(x)) = 0.9 Count(Rated(x, m) ∧ Action(m)) > 20 P(Young(x)) = 0.02 Mean(Rating(x, m) ∧ Action(m)) > 8 P(Young(x)) = 0.85 P(Young(x)) = 0.45 Figure 3.3: An example of a relational probability tree for predicting whethera person is young or not.and the following LP has a probability of (1− 0.8) ∗ (1− 0.4):Mother(Shannon,Emma)Parent(x, y) : - Father(x, y)Parent(x, y) : - Mother(x, y)Grandparent(x, y) : - Parent(x, z) ∧ Parent(z, y)GrandParent(Alex, Emma) can be inferred from the first and third LPs,but not from the second and fourth LPs. Therefore, the probability ofGrandParent(Alex, Emma) according to the above Problog program is 0.8 ∗0.4 + 0.8 ∗ (1− 0.4) = 0.8.3.2.5 Relational probability treesRelational probability trees (RPTs) [160] are decision trees where the decisionnodes may include thresholds on explicit aggregation functions such as count,average and proportion. Note that an existential quantifier can be modeledusing a threshold over a count. In the online shopping website, for instance,a node may split the customers based on whether they have returned morethan 10 items so far or not.Example 29. The tree in Fig 3.3 is an example of an RPT model forpredicting whether a person is young or not. The root of the model checkswhether the target person is a student or not. If the target person is a student,423.2. Lifted relational modelsthen the model checks whether there exists someone to whom the targetperson is married to. If yes, then the model outputs 0.2 as the probability ofthe desired person being young and 0.9 otherwise. If the target person is nota student, then the model first checks whether they have rated more than 20action movies or not. If the desired person has rated less than or equal to 20action movies, the model outputs 0.02. But if they have rated more than20 action movies, then the model checks whether the mean of the ratings ismore than 8. If the mean is more than 8, the model outputs 0.85 and if it isnot, the model outputs 0.45.3.2.6 RDN-BoostAn RDN-Boost model [155] consists of several relational probability trees. Forlearning, each new tree is learned to boost the performance of the previoustrees. When making predictions, all learned trees are used and an average ofthe outputs of the trees is reported as the final prediction.43Chapter 4Link PredictionEntities typically participate in relationships with other entities. Considera movie rating system where the entities are movies, users, actors, anddirectors. In such a system, there may be relationships such as Liked orWatched between users and movies, relationships such as Acted or Starredbetween actors and movies, and relationships such as Colleague or Friendbetween actors and directors. The aim of link prediction is to predict newlinks (or relationships) between these entities given the existing ones.Several link prediction approaches have been developed for relational data,among which tensor factorization approaches are among the top performing[167]. Trouillon et al. [209] observed that canonical polyadic (CP) decompo-sition [90] (described in Chapter 3) performs poorly for link prediction onthe current test sets as the two vectors for entities are learned independently:observing some entity participates in a relation as the head does not affectthe vector corresponding to the case where the entity is the tail and vice versa.In this chapter, we develop a tensor factorization approach based on CPdecomposition that addresses the independence among the two embeddingvectors of the entities. Due to the simplicity of our model, we call it SimplE.We show that SimplE : 1- learns interpretable embeddings, 2- is fullyexpressive, 3- is capable of encoding expert knowledge into its embeddingsthrough parameter sharing (aka weight tying), and 4- performs very wellempirically despite its simplicity. We also discuss several disadvantagesof other existing approaches. For instance, we prove that the existingtranslational approaches are not fully expressive and we identify severerestrictions on what they can represent. We also show that the function usedin ComplEx [208, 209], one of state-of-the-art approaches, involves redundantcomputations. The results presented in this chapter have been published inKazemi and Poole [110].4.1 Properties of tensor factorization modelsWe discuss full expressiveness, time complexity and parameter growth, inter-pretability, and computation redundancy of several approaches.444.1. Properties of tensor factorization models4.1.1 Full expressivenessFull expressiveness is an important property of any model and has beenrecently the subject of study for several relational models (see e.g., [27, 209,227]).Definition 7. A tensor factorization model is fully expressive if given anyground truth (full assignment of truth values to all triples), there exists anassignment of values to the embeddings of the entities and relations thataccurately separates the correct triples from incorrect ones.We mentioned earlier that DistMult is not fully expressive as it forcesrelations to be symmetric. Trouillon et al. [209] prove that ComplEx is fullyexpressive. In particular, they prove that any ground truth can be modeledin ComplEx with embeddings of length at most |E| ∗ |R|. According to theuniversal approximation theorem [41, 93], under certain conditions, neuralnetworks are universal approximators of continuous functions over compactsets. Therefore, we would expect there to be a representation based on neuralnetworks that can approximate any ground truth, but the number of hiddenunits might have to grow with the number of triples. Wang et al. [227] provethat TransE is not fully expressive. The following proposition proves thatnot only TransE but also several other translational approaches are not fullyexpressive. The proposition also identifies some restrictions of these models.The proof of the proposition can be found in Appendix A.Proposition 1. FSTransE is not fully expressive and has the followingrestrictions:1. If a relation R is reflexive on a subset ∆ of entities, R must also besymmetric on ∆2. If a relation R is reflexive on a subset ∆ of entities, R must also betransitive on ∆3. If an entity E1 has relation R with every entity in a subset of entities∆ and another entity E2 has relation R with one of the entities in ∆,then E2 must have the relation R with every entity in ∆.From Proposition 1, the following corollary follows:Corollary 1. Other variants of translational approaches such as TransE,FTransE, STransE, TransH [222] and TransR [139] are also not fully expres-sive and have the restrictions mentioned in Proposition 1.454.1. Properties of tensor factorization modelsProposition 1 identifies severe restrictions on what relations translationalapproaches can represent. They cannot represent any relation not havingthe properties identified. The following example describes example relationsthat cannot be modeled using FSTransE.Example 30. Consider the relation LessThanOrEqualTo representing if aquantity is less than or equal to another quantity. This relation is reflexiveas any quantity is less than or equal to itself. However, this relation is notsymmetric as, for instance, 5 is less than or equal to 10 whereas 10 is notless than or equal to 5. Therefore, according to Proposition 1, FSTransEcannot model this relation.Consider the relation Knows representing if a person knows another personor not. This relation is reflexive as any person knows themselves. However,Knows is not transitive as, for instance, I know Rihanna and Rihanna knowsmany people, but I do not know every person Rihanna knows. Therefore,according to Proposition 1, FSTransE cannot model this relation.Consider the relation SharesBorder representing if two countries share aborder with each other or not. Iran shares a border with Turkey, Armenia,Azerbaijan, and several other countries. Greece shares a border with Turkey,but it does not share a border with Armenia, Azerbaijan, or any other countrywhich shares a border with Iran. Therefore, according to Proposition 1,FSTransE cannot model this relation.4.1.2 ComplexityAs described in Bordes et al. [23], to scale to the size of the relationaldata we currently have and keep up with their growth (e.g., for knowledgegraphs), a relational model must have a linear time and memory complexity.Furthermore, one of the important challenges in designing tensor factorizationmodels is the trade-off between expressiveness and model complexity. Modelswith many parameters usually overfit and give poor performance. Havinglarger datasets often does not help avoid overfitting for such models as whenrelational datasets grow in size, they contain more entities but often lessinformation about each entity. For instance, in a movie rating dataset, asthe dataset grows, there may be more people and more movies in the datasetbut the average number of rated movies per person may not increase much(or it may even decrease).The time complexity of making a prediction using TransE is O(k) wherek is the size of each embedding vector. Adding the projections as in STransEincreases the time complexity to O(k2). Besides time complexity, the number464.1. Properties of tensor factorization modelsof parameters to be learned from data grows quadratically with k. A quadratictime complexity and parameter growth may arise two issues: 1- scalabilityproblems, 2- overfitting. Same issues exist for models such as RESCAL andNeural tensor networks (NTN) [199] (a model that generalizes RESCAL andE-MLP [167]) for which the time complexity for making a prediction (andthe number of parameters) grow quadratically (or even faster) with k. Thetime complexity (and parameter growth) of DistMult and ComplEx is linearin k. Therefore, they scale better and are less prone to over-fitting.4.1.3 Interpretability and redundancyIn DistMult, each parameter of the embedding vector of the entities can beconsidered as a feature of the entity and the corresponding parameter of arelation embedding vector can be considered as a measure of how importantthat feature is to the relation. Such interpretability allows for incorporatingexpert knowledge into the embeddings to some extent. For instance, if afeature of some type of entities is given to us in the data, we can fix one ofthe parameters of the embedding vector of those entities to the given feature.For ComplEx, a portion of the computations performed to make predic-tions is redundant. Consider a ComplEx model with embedding vectors ofsize 1 (for ease of exposition). Suppose the embedding vectors for H, R andT are [α1 +β1i], [α2 +β2i], and [α3 +β3i] respectively. Then, the probabilityof (H ,R,T ) being correct according to ComplEx is proportional to the sumof the following four terms:1)α1α2α3 2)α1β2β3 3)β1α2β3 4)− β1β2α3It can be seen in Table 4.1 that for any assignment of (non-zero) valuesto αis and βis, at least one of the above four terms is negative. This meansfor a correct triple, ComplEx uses some of the four terms to overestimate itsscore and then uses the other terms to cancel the overestimation.The following example shows how this affects ComplEx’s interpretability:Example 31. Consider a ComplEx model with embeddings of size 1. Con-sider entities E1, E2 and E3 with embedding vectors [1 + 4i], [1 + 6i], and[3 + 2i] respectively, and a relation R with embedding vector [1 + i].According to ComplEx, the score for (E1 ,R,E3 ) is positive suggesting E1probably has relation R with E3 and for (E2 ,R,E3 ) is negative suggestingE2 probably does not have relation R with E3. Since the only differencebetween E1 and E2 is that the imaginary part changes from 4 to 6, it isdifficult to associate a meaning to these numbers.474.1. Properties of tensor factorization modelsTable 4.1: Redundancy in ComplEx.α1 α2 α3 β1 β2 β3 α1α2α3 α1β2β3 β1α2β3 −β1β2α3+ + + + + + + + + −+ + + + + − + − − −+ + + + − + + − + ++ + + + − − + + − ++ + + − + + + + − ++ + + − + − + − + ++ + + − − + + − − −+ + + − − − + + + −+ + − + + + − + + ++ + − + + − − − − ++ + − + − + − − + −+ + − + − − − + − −+ + − − + + − + − −+ + − − + − − − + −+ + − − − + − − − ++ + − − − − − + + ++ − + + + + − + − −+ − + + + − − − + −+ − + + − + − − − ++ − + + − − − + + ++ − + − + + − + + ++ − + − + − − − − ++ − + − − + − − + −+ − + − − − − + − −+ − − + + + + + − ++ − − + + − + − + ++ − − + − + + − − −+ − − + − − + + + −+ − − − + + + + + −+ − − − + − + − − −+ − − − − + + − + ++ − − − − − + + − +− + + + + + − − + −− + + + + − − + − −− + + + − + − + + +− + + + − − − − − +− + + − + + − − − +− + + − + − − + + +− + + − − + − + − −− + + − − − − − + −− + − + + + + − + +− + − + + − + + − +− + − + − + + + + −− + − + − − + − − −− + − − + + + − − −− + − − + − + + + −− + − − − + + + − +− + − − − − + − + +− − + + + + + − − −− − + + + − + + + −− − + + − + + + − +− − + + − − + − + +− − + − + + + − + +− − + − + − + + − +− − + − − + + + + −− − + − − − + − − −− − − + + + − − − +− − − + + − − + + +− − − + − + − + − −− − − + − − − − + −− − − − + + − − + −− − − − + − − + − −− − − − − + − + + +− − − − − − − − − +484.2. SimplESuch uninterpretability in ComplEx led us towards designing a simplermodel. In the next section, we develop simple interpretable models thatremove redundant computations and, with the same number of parameters,have half or quarter the number of summations and multiplications ofComplEx. We show in our experiments that our models generally outperformComplEx on standard benchmarks.4.2 SimplEIn canonical polyadic (CP) decomposition [90], the embedding for each entityE has two vectors h(E) and t(E) and for each relation R has a single vectorv(R). The similarity function for a triple (E1 ,R,E2 ) is 〈h(E1), v(R), t(E2)〉.One major problem with CP is that the two embedding vectors learnedfor each entity are independent of each other: observing a correct triple(E1 ,R,E2 ) only updates h(E1) and t(E2), not t(E1) and h(E2).Example 32. Consider a world with two relations: Likes(p,m) representingif a person likes a movie, and Acted(m, a) representing who acted in whichmovie. Observations on Likes only update the tail vector of movies andobservations on Acted only update the head vector. Thus, what is learnedabout movies through observations on Acted does not affect the predictionsabout Likes and vice versa.To account for the independence of the two vectors for each entity inCP, we introduce SimplE. SimplE considers the embedding for entities to besimilar to CP (i.e., two vectors for each entity), but considers the embeddingfor each relation R to be two (instead of one) vectors, one corresponding toR and one corresponding to R−1.We consider two different functions as the similarity function for a triple(E1 ,R,E2 ). The first function ignores the vectors corresponding R−1s and isdefined as:sim(E1,R, E2) = 〈h(E1), v(R), t(E2)〉 (4.1)Since this function ignores R−1s, we call it SimplE-ignr Alternatively,one can also consider the function to be:sim(E1,R, E2) = 〈h(E2), v(R−1), t(E1)〉 (4.2)The second similarity function takes the average of the above two scoresand is defined as:494.2. SimplEsim(E1,R, E2) =〈h(E1), v(R), t(E2)〉+ 〈h(E2), v(R−1), t(E1)〉2(4.3)Since this function takes the average of the two scores, we call it SimplE-avg. For both cases, the probability of two entities having a relationship canbe computed by taking the Sigmoid of the similarity function.Note that in SimplE-ignr, to find the score of a triple only one multipli-cation is performed between three vectors. This number is 2 for SimplE-avgand 4 for ComplEx. Thus, with the same number of parameters, SimplE-ignrand SimplE-avg reduce the computations by a factor of 4 and 2 respectivelycompared to ComplEx.4.2.1 An alternative perspective on SimplEIn applications where tensor factorization models are to be combined withother representations (e.g., [188, 235]), models that learn a single vectorfor each entity and each relation may provide better compatibility [207].Here, we provide a single-vector perspective on SimplE-avg which shows howSimplE-avg can be viewed as a slightly modified version of DistMult.Let H and T be two entities and R be a relation. Let v(H), v(T ) andv(R) represent vectors of size 2k (i.e. vectors of even size) corresponding tothe the embeddings of H, T, and R respectively. Let ←→v (T ) be equivalent tov(T ) except that the first and second halves of the vector are swapped. InDistMult, the similarity function is defined as:sim(H,R, T ) = 〈v(H), v(R), v(T )〉 (4.4)whereas the similarity function for SimplE-avg is defined as:sim(H,R, T ) = 〈v(H), v(R),←→v (T )〉 (4.5)That is, SimplE-avg can be considered as a variant of DistMult in whichthe first and second halves of the embedding vector for the tail entity areswapped.4.2.2 Family of Bilinear ModelsBilinear models correspond to the family of models where the embeddingfor each entity E is a vector v(E) of size k, for each relation R is a matrixMR of size d × d (with certain restrictions), and the similarity function504.2. SimplEfor a triple (Ei ,R,Ej ) is defined as v(Ei)TMRv(Ej). These models haveshown remarkable performance for link prediction in knowledge graphs [167].DistMult, ComplEx, and RESCAL are known to belong to the family ofbilinear models. We show that SimplE (and CP) also belong to this family.DistMult can be considered a bilinear model which restricts the MRmatrices to be diagonal as in Fig. 4.1(a). For ComplEx, if we consider theembedding for each entity E to be a single vector [re(E); im(E)], then it canbe considered a bilinear model with its MR matrices constrained according toFig. 4.1(b). RESCAL can be considered a bilinear model which imposes noconstraints on the MR matrices. Considering the embedding for each entityE to be a single vector [h(E); t(E)], CP can be viewed as a bilinear modelwith its MR matrices constrained as in Fig 4.1(c). For a triple (E1 ,R,E2 ),multiplying [h(E1); t(E1)] to MR results in a vector v(E1R) whose first half iszero and whose second half corresponds to an element-wise product of h(E1)to the parameters in MR. Multiplying v(E1R) to [h(E2); t(E2)] correspondsto ignoring h(E2) (since the first half of v(E1R) is zeros) and taking thedot-product of the second half of v(E1R) with t(E2). SimplE-avg can beviewed as a bilinear model similar to CP except that the MR matrices areconstrained as in Fig 4.1(d). The extra parameters added to the matrixcompared to CP correspond to the parameters in the inverse of the relations.Note that several other restrictions on the MR matrices are equivalent toSimplE-avg, e.g., restricting MR matrices to be zero everywhere except onthe counterdiagonal. Viewing SimplE-avg as a single-vector-per-entity modelmakes it easily integrable (or compatible) with other embedding models(in knowledge graph completion, computer vision, and natural languageprocessing) such as [188, 191, 235].4.2.3 Training SimplETo train a SimplE model in both SimplE-ignr and SimplE-avg, we considerRs and R−1s to be two different relations each having its own embeddingvector, augment the training data by adding a triple (E2 ,R−1,E1 ) to thedata for each triple (E1 ,R,E2 ) in the data, and the learn the embeddingvectors similarly as CP.For learning, we use stochastic gradient descent with mini-batches. Ineach learning iteration, we iteratively take in a batch of positive triples fromthe dataset, then for each positive triple in the batch we generate n negativetriples by corrupting the positive triple. We use Bordes et al. [24]’s procedureto corrupt positive triples. The procedure is as follows. For a positive triple(H ,R,T ), we randomly decide to corrupt the head or tail. If the head is514.2. SimplE (a) (b) (c) (d) Figure 4.1: The constraints over MR matrices for bilinear models (a) Dist-Mult, (b) ComplEx, (c) CP, and (d) SimplE. The lines represent where theparameters are; other elements of the matrices are constrained to be zero.In ComplEx, the parameters represented by the dashed line is tied to theparameters represented by the solid line and the parameters represented bythe dotted line is tied to the negative of the dotted-and-dashed line.selected, we replace H in the triple with an entity H ′ randomly selected fromE − {H} and generate the corrupted triple (H ′,R,T ). If the tail is selected,we replace T in the triple with an entity T ′ randomly selected from E − {T}and generate the corrupted triple (H ,R,T ′). We generate a labeled batchLB by labelling positive triples as +1 and negatives as −1.Once we have a labeled batch, following Trouillon et al. [208] we optimizethe L2 regularized negative log-likelihood of the batch which has been shownto be less prone to overfitting compared to other widely used loss functions(e.g., margin-based loss). The negative log-likelihood is defined as follows:minθ∑((H ,R,T ),l)∈LBsoftplus(−l ∗ φ(H ,R,T )) + λ||θ||22 (4.6)where θ represents the parameters of the model (the parameters in theembeddings), l represents the label of a triple, λ is the regularization hyper-parameter, and softplus(x) = log(1 + exp(x)).4.2.4 SimplE is fully expressiveThe following two propositions prove the full expressiveness of SimplE eachproviding a different bound on the size of the embedding vectors. Bothpropositions assume a threshold on the Sigmoid function. The proof of thepropositions can be found in Appendix A.Proposition 2. For any ground truth over entities E and relations R, thereexists a SimplE model with embedding vectors of size |E| ∗ |R| that representsthat ground truth.524.2. SimplEProposition 3. For any ground truth over entities E and relations R con-taining |ζ| true facts, there exists a SimplE model with embedding vectors ofsize |ζ|+ 1 that represents that ground truth.Corollary 2. For any ground truth over entities E and relations R containing|ζ| true facts, there exists a SimplE model with embedding vectors of sizemin(|E| ∗ |R|, |ζ|+ 1) that represents that ground truth.The bound in Corollary 2 is tighter than Trouillon et al. [209]’s boundfor ComplEx (|E| ∗ |R|) as for some datasets, |ζ|+ 1 may be much smallerthan |E| ∗ |R|. Note that in practice, there are many regularities in thedata and a much lower vector size suffices. Learning a SimplE model withvectors of size min(|E| ∗ |R|, |ζ| + 1) corresponds to memorizing the data.Generalization/learning occurs when the size is smaller. These theoreticalbounds provide an upper bound on the size of the embeddings that shouldbe considered for, e.g., hyperparameter optimization.4.2.5 SimplE & Background KnowledgeSimilar to DistMult, the embeddings of SimplE are interpretable: the entityand relation embeddings can be interpreted the same way as DistMult.Also similar to DistMult, one can incorporate certain types of backgroundknowledge (e.g., an observed feature of some entities) into the embeddings.Minervini et al. [153] show how background knowledge in terms of equivalenceand inversion between two relations can be incorporated into several tensorfactorization models. In this section, we show how background knowledge interms of logical rules can be incorporated into the embeddings of SimplE bytying the parameters. The following propositions represent three such logicalrules7: symmetry, anti-symmetry, and inversion. We ignored the equivalencebetween two relations as it is trivial. The proof of the propositions can befound in Appendix A.Proposition 4. Let R be a relation such that:(Ei ,R,Ej ) ∈ ζ ⇔ (Ej ,R,Ei) ∈ ζfor any two entities Ei and Ej (i.e. R is symmetric). This property of R canbe encoded into SimplE by tying the parameters v(R−1) to v(R).Proposition 5. Let R be a relation such that:7Note that such background knowledge can be exerted on some relations selectively andnot on the others. This is different than, e.g., DistMult which enforces symmetry on allrelations.534.3. Experiments and resultsTable 4.2: Statistics on WN18 and FB15k datasets. WN18 is a subset ofWordnet [152] and FB15k is a subset of Freebase [22].Dataset |E| |R| #train #valid #testWN18 40,943 18 141,442 5,000 5,000FB15k 14,951 1,345 483,142 50,000 59,071(Ei ,R,Ej ) ∈ ζ ⇔ (Ej ,R,Ei) ∈ ζ ′for any two entities Ei and Ej (i.e. R is anti-symmetric). This property ofR can be encoded into SimplE by tying the parameters v(R−1) to the negativeof v(R).Proposition 6. Let R1 and R2 be two relations such that:(Ei ,R1,Ej ) ∈ ζ ⇔ (Ej ,R2,Ei) ∈ ζfor any two entities Ei and Ej (i.e. R2 is the inverse of R1). This propertyof R1 and R2 can be encoded into SimplE by tying the parameters v(R−11 ) tov(R2) and v(R−12 ) to v(R1).4.3 Experiments and resultsWe compare SimplE empirically with several other tensor factorization models.We show that despite the simplicity of SimplE, it performs very well. We alsodesign an experiment which validates our proofs in Section 4.2.5 empirically.4.3.1 DatasetsWe conducted experiments on two standard benchmarks:• WN18: A subset of Wordnet [152]• FB15k: A subset of Freebase [22]We used the same train/valid/test sets as in Bordes et al. [24]. Thestatistics on these datasets are available in Table 4.2.544.3.ExperimentsandresultsTable 4.3: Results on WN18 and FB15k. Reported metrics include raw and filtered mean reciprocal rank (MRR),filtered Hit@1, Hit@3, and Hit@10. Best results are in bold.WN18 FB15kMRR Hit@ MRR Hit@Model Filter Raw 1 3 10 Filter Raw 1 3 10CP 0.075 0.058 0.049 0.080 0.125 0.326 0.152 0.219 0.376 0.532TransE 0.454 0.335 0.089 0.823 0.934 0.380 0.221 0.231 0.472 0.641TransR 0.605 0.427 0.335 0.876 0.940 0.346 0.198 0.218 0.404 0.582DistMult 0.822 0.532 0.728 0.914 0.936 0.654 0.242 0.546 0.733 0.824NTN 0.53 − − − 0.661 0.25 − − − 0.414STransE 0.657 0.469 − − 0.934 0.543 0.252 − − 0.797ER-MLP 0.712 0.528 0.626 0.775 0.863 0.288 0.155 0.173 0.317 0.501ComplEx 0.941 0.587 0.936 0.945 0.947 0.692 0.242 0.599 0.759 0.840SimplE-ignr 0.939 0.576 0.938 0.940 0.941 0.700 0.237 0.625 0.754 0.821SimplE-avg 0.942 0.588 0.939 0.944 0.947 0.727 0.239 0.660 0.773 0.838554.3. Experiments and results4.3.2 Evaluation metricsTo measure and compare the performances of different models, for each testtriple (H ,R,T ) once we compute the score of (H ′,R,T ) triples for all H ′ ∈ Eand calculate the ranking rankH of the triple having H, and once we computethe score of (H ,R,T ′) triples for all T ′ ∈ E and calculate the ranking rankTof the triple having T . Then we compute the mean reciprocal rank (MRR) (astandard metric widely used for ranking in information retrieval [181] whichin contrast to mean rank is less sensitive to outliers) of these rankings as themean of the inverse of the rankings:MRR =12 ∗ |tt|∑(H ,R,T )∈tt1rankH+1rankT(4.7)where tt represents the test triples.Bordes et al. [24] identified an issue with the above procedure for cal-culating the MRR (hereafter referred to as raw MRR). For a test triple(H ,R,T ), since there can be several entities H ′ ∈ E for which (H ′,R,T )holds, measuring the quality of a model based on its ranking for (H ,R,T )may be flawed. That is because two models may rank the test triple (H ,R,T )to be second, when the first model ranks a correct triple (e.g., from train orvalidation set) (H ′,R,T ) to be first and the second model ranks an incorrecttriple (H ′′,R,T ) to be first. Both these models will get the same score forthis test triple when the first model should get a higher score.To address this issue, Bordes et al. [24] proposed a modification to rawMRR. For each test triple (H ,R,T ), instead of finding the rank of this tripleamong triples (H ′,R,T ) for all H ′ ∈ E (or (H ,R,T ′) for all T ′ ∈ E), theyproposed to calculate the rank among triples (H ′,R,T ) only for H ′ ∈ E suchthat (H ′,R,T ) 6∈ train ∪ valid ∪ test. Following Bordes et al. [24], we callthis measure filtered MRR.We also report hit@k measures. The hit@k for a model is computed asthe percentage of test triples whose ranking (computed as described earlier)is less than or equal to k.4.3.3 ImplementationWe implemented SimplE in TensorFlow [1]8. We tuned our hyperparametersover the validation set. We used the same search grid on embedding sizeand λ as Trouillon et al. [208] to make our results directly comparable to8Code: https://github.com/Mehran-k/SimplE564.3. Experiments and resultstheir results. We fixed the maximum number of iterations to 1000 andthe batch size to 100. We set the learning rate for WN18 to 0.1 and forFB15k to 0.05 and used adagrad to update the learning rate after each batch.Following Trouillon et al. [208], we generated one negative example perpositive example for WN18 and 10 negative examples per positive example inFB15k. We computed the filtered MRR of our model over the validation setevery 50 iterations for WN18 and every 100 iterations for FB15k and selectedthe iteration that resulted in the best validation filtered MRR. The bestembedding size and λ values on WN18 for SimplE-ignr were 200 and 0.001respectively, and for SimplE-avg were 200 and 0.03. The best embeddingsize and λ values on FB15k for SimplE-ignr were 200 and 0.03 respectively,and for SimplE-avg were 200 and 0.1.4.3.4 BaselinesWe compare SimplE with several existing tensor factorization approaches.Our baselines include canonical polyadic (CP) decomposition, TransE, TransR,DistMult, Neural tensor networks (NTN), STransE, ER-MLP, and ComplEx.Given that we use the same data splits and objective function as ComplEx,we report the results of CP, TransE, DistMult, and ComplEx from Trouillonet al. [208]. We report the results of TransR and NTN from [162], andER-MLP from [168] for further comparison.4.3.5 Entity prediction resultsTable 4.3 shows the results of our experiments. It can be viewed that bothSimplE-ignr and SimplE-avg do a good job compared to the existing baselineson both datasets. On WN18, SimplE-ignr and SimplE-avg perform as good asComplEx, a state-of-the-art tensor factorization model. On FB15k, SimplE-avg outperforms the existing baselines and gives state-of-the-art resultsamong tensor factorization approaches. SimplE-avg (and SimplE-ignr) workespecially well on this dataset in terms of filtered MRR and hit@1. Fig 4.2shows the training loss of SimplE-avg for WN18 as a function of the numberof epochs.The table shows that models with many parameters (e.g., NTN andSTransE) do not perform well on these datasets as they probably overfit.The table also shows that multiplicative approaches tend to have betterperformances compared to translational and deep learning approaches. EvenDistMult, the simplest multiplicative approach, outperforms many transla-tional and deep learning approaches, despite not being fully expressive. We574.3. Experiments and resultsFigure 4.2: Training loss for SimplE-avg on WN18 as a function of thenumber of epochs.believe the simplicity of embeddings and the scoring function is a key prop-erty for tensor factorization approaches, and SimplE shows high performancedue to its simplicity and full expressiveness.4.3.6 Incorporating background knowledgeWhen background knowledge is available, we might expect that a knowledgegraph might not include redundant information because it is implied bybackground knowledge and so the methods that do not include the backgroundknowledge can never learn it.In section 4.2.5, we showed how background knowledge that can beformulated in terms of three types of rules can be incorporated into SimplEembeddings. To test this empirically, we conducted an experiment on WN18in which we incorporated several such rules into the embeddings as outlinedin Propositions 4, 5, and 6. The rules can be found in Table 4.4. Accordingto Table 4.4, most of the rules are of the form:∀Ei, Ej ∈ E : (Ei ,R1,Ej) ∈ ζ ⇔ (Ej ,R2,Ei) ∈ ζ (4.8)For (possibly identical) relations such as R1 and R2 participating in arule as above, if both of the following triples are in the training set, one of584.3.ExperimentsandresultsTable 4.4: Rules Used in Section 4.3.6.Rule Number Rule1 (E1 ,Hyponym,E2 ) ∈ ζ ⇔ (E2 ,Hypernym,E1 ) ∈ ζ2 (E1 ,MemberMeronym,E2 ) ∈ ζ ⇔ (E2 ,MemberHolonym,E1 ) ∈ ζ3 (E1 , InstanceHyponym,E2 ) ∈ ζ ⇔ (E2 , InstanceHypernym,E1 ) ∈ ζ4 (E1 ,HasPart,E2 ) ∈ ζ ⇔ (E2 ,PartOf,E1 ) ∈ ζ5 (E1 ,MemberOfDomainTopic,E2 ) ∈ ζ ⇔ (E2 ,SynsetDomainTopicOf,E1 ) ∈ ζ6 (E1 ,MemberOfDomainUsage,E2 ) ∈ ζ ⇔ (E2 ,SynsetDomainUsageOf,E1 ) ∈ ζ7 (E1 ,MemberOfDomainRegion,E2 ) ∈ ζ ⇔ (E2 ,SynsetDomainRegionOf,E1 ) ∈ ζ8 (E1 ,SimilarTo,E2 ) ∈ ζ ⇔ (E2 ,SimilarTo,E1 ) ∈ ζ594.4. Conclusionthem is redundant because one can be inferred from the other:(E1 ,R1,E2)(E2 ,R2,E1)In this subsection, we design an experiment to validate empirically thatbackground knowledge can be incorporated into SimplE models to improvetheir performance. For this goal, we removed redundant triples, i.e. triplesthat can be inferred from background rules in Table 4.4, from the training setof WN18. To remove redundant triples, for any two triples like above whereone can be inferred from the other, we randomly removed one of the triples.Removing redundant triples reduced the number of triples in the training setfrom (approximately) 141K to (approximately) 90K, almost 36% reductionin size. Note that this experiment provides an upper bound on how muchbackground knowledge can improve the performance of a SimplE model.We trained SimplE-ignr and SimplE-avg (with tied parameters accordingto the rules) on this new training dataset with the best hyperparametersfound in the previous experiment. We refer to these two models as SimplE-ignr-bk and SimplE-avg-bk. We also trained another SimplE-ignr and SimplE-avg models on this dataset, but without incorporating the rules into theembeddings. For a sanity check, we also trained a ComplEx model over thisnew dataset. We found that the filtered MRR for SimplE-ignr, SimplE-avg,and ComplEx were respectively 0.221, 0.384, and 0.275. For SimplE-ignr-bkand SimplE-avg-bk, the filtered MRRs were 0.772 and 0.776 respectively,substantially higher than the case without background knowledge. In termsof hit@k measures, SimplE-ignr gave 0.219, 0.220, and 0.224 for hit@1, hit@3and hit@10 respectively. These numbers were 0.334, 0.404, and 0.482 forSimplE-avg, and 0.254, 0.280 and 0.313 for ComplEx. For SimplE-ignr-bk, these numbers were 0.715, 0.809 and 0.877 and for SimplE-avg-bk theywere0.715, 0.818 and 0.883, also substantially higher than the models withoutbackground knowledge. The obtained results validate that backgroundknowledge can be incorporated into SimplE embeddings efficiently to improveits performance.4.4 ConclusionWe proposed a simple interpretable fully expressive tensor factorizationmodel for knowledge graph completion. We showed that our model, calledSimplE, performs very well empirically and has several interesting prop-erties. For instance, we showed that background knowledge in terms of604.4. Conclusionthree types of logical rules can be incorporated into SimplE by tying theembeddings. Incorporating other types of rules (e.g., implication) SimplEthrough parameter-tying into is an interesting future direction.SimplE can be improved or may help improve relational learning in severalways. As an example, Wang et al. [227] showed that each specific relationcan be predicted better by some tensor factorization model and proposed arelation-level ensemble of tensor factorization approaches. Incorporating Sim-plE into their ensemble can further improve their results. Kadlec et al. [103]aimed at improving the performance of DistMult by substantially increasingthe search space for its hyperparameters. Improving the performance oftensor factorization models by increasing the search space of hyperparametersis orthogonal to improving them by proposing better models. The results forSimplE can also potentially improve by tuning the hyperparameters over avery large space.61Chapter 5Property PredictionThis chapter studies the problem of predicting the value of a property of anentity given the observed properties of the entity and the relations the entityparticipates in. When the entity participates in relations with multiple otherentities, the value can be an aggregate of information from the related entities.As an example of a property prediction problem, consider an online shopwhere customers view items, add them to their baskets, purchase them, orreturn them. We may be interested in predicting the gender of our customersfor better personalization.In this chapter, we first discuss the limitations of other well-known re-lational models for property prediction. Then we develop relational neuralnetworks (RelNNs) which address some issues with existing models. RelNNsare developed by connecting multiple relational logistic regression (RLR)layers as a graph and enabling them to learn latent entity properties bothdirectly and through general rules. Similar to [169, 175], we identify therelationship between RelNNs and convolutional neural networks (ConvNets).Identifying such a relationship allows RelNNs to be understood and imple-mented using well-known ConvNet primitives. The time complexity for eachtraining iteration of RelNNs is proportional to the amount of data times thesize of the RelNN, making RelNNs highly scalable.We evaluate RelNNs on three real-world datasets and compare them towell-known relational learning algorithms. We show how RelNNs address arelational learning issue raised in [178] who showed that as the populationsize increases, the probabilities of many variables go to 0 or 1, making themodel overconfident. Obtained results indicate that RelNNs are promisingmodels for relational learning.5.1 Why not use existing relational models?In this section, we identify the limitations of existing relational models forproperty prediction. Parts of the results presented in this section have beenpublished in Kazemi et al. [114]. Consider the problem of gender prediction inthe online shop example again. One needs to aggregate over a varying number625.1. Why not use existing relational models?of items to make such predictions as each customer may have purchased adifferent number of items.Existing explicit aggregators in the literature (see e.g., [71, 94, 112,122, 159, 160]) include OR, AND, noisy-OR, noisy-AND, logistic regression,average, count, median, mode, max/min and proportion.We expect OR and AND (corresponding respectively to whether a userhas purchased at least one item and whether a customer has purchased allitems) to perform badly. Noisy-OR and noisy-AND are both monotonicfunctions: with noisy-OR, more observations can only increase the probabilityof a certain class, and with noisy-AND, more observations can only decreasethe probability of a certain class. Therefore, these two aggregators arealso expected to perform badly on aggregation problems such as the onlineshopping example as each observation (each item purchased by a customer)may only increase (or only decrease) the probability of a class.Logistic regression solves the monotonicity problem of noisy-OR andnoisy-AND functions as each observation may increase or decrease the prob-ability of a certain class. For instance, if Item1 has been mostly purchasedby males and Item2 has been mostly purchased by females, then observinga new customer has purchased Item1 may increase the probability that thisuser is male and observing they have purchased Item2 may decrease theprobability that this user is male. However, as we pointed out in [178], asthe number of views increases, a logistic regression model tends to becomeoverconfident about its predictions (predicting with probabilities near zeroor one) which results in poor performance in terms of log loss.Functions such as average, count, median, mode, max/min and proportionmay provide small hints. However, all of these functions lose great amountsof information as they do not consider which specific items females or malespurchase more.Existing relational models are expected to perform poorly as they (explic-itly or implicitly) rely mostly on one or more of the above aggregators. ForMLNs and RLR, the aggregation is either a function of count or a logisticregression model. In Problog, the aggregation is either a function of countor a noisy-OR model. In RDN-Boost and relational probability trees, theaggregation is a combination of several thresholding functions.For tensor factorization approaches, one may turn each property into atriple of the form (entity , property , value) and use the existing approachesfor property prediction. Such an approach may not work well for propertyprediction as described by the following example. In the online shop example,after training, one of the latent features for each user may become identicalto gender (i.e. just memorizing and not generalizing) and the other latent635.2. Relational neural networksFigure 5.1: An RLR model for predicting the gender of users in a movierating system with a layer-wise architecture. < 𝑂𝑙𝑑ሺ𝑢ሻ, 𝑤1 > < 𝐷𝑜𝑐𝑡𝑜𝑟ሺ𝑢ሻ, 𝑤2 > < 𝐿𝑖𝑘𝑒𝑠ሺ𝑢,𝑚ሻ ∗ 𝐴𝑐𝑡𝑖𝑜𝑛ሺ𝑚ሻ,𝑤3 > < 𝐿𝑖𝑘𝑒𝑠ሺ𝑢,𝑚ሻ ∗ 𝐷𝑟𝑎𝑚𝑎ሺ𝑚ሻ, 𝑤4 > 𝑆ሺ𝑢ሻ < 𝑇𝑟𝑢𝑒, 𝑤0 > 𝑀𝑎𝑙𝑒ሺ𝑢ሻ features be used for other predictions. Since the user embeddings onlymemorize the gender, the model offers no generalization for unseen usergenders and this model performs poorly.Graph random walks [130, 131] are another class of approaches for rela-tional learning that can be used for property prediction. These approachestypically define a strategy for starting from a node in the multidigraph,(probabilistically) walking on the multidigraph and reaching the desirednode. In Kazemi and Poole [108], we identify several shortcomings of theseapproaches for property prediction and prove for Lao and Cohen [130] andLao et al. [131] that, after performing a preprocessing on the input data,they are a subset of RLR and therefore may suffer from the same issues thatRLR does.With numerous advances in learning deep networks [137] for many differ-ent applications, lifting these deep networks to directly work with relationaldata has been the focus of several recent studies [7, 39, 70, 72, 140, 193,200, 210, 225]. However, all of these works are limited in one or more of thefollowing ways: 1- the model is limited to only one input relation, 2- thestructure of the model is highly dependent on the input data, 3- the modelallows for only one hidden layer, 4- the framework is not flexible enoughfor adding new types of layers in a modular way, 5- the model is essentiallypropositional and does not have the ability to learn about specific entities, 6-the model cannot learn hidden entity properties through general rules, or 7-the model does not scale to large domains.5.2 Relational neural networksIn this section, we develop a model for deep relational learning with a layer-wise architecture which addresses the issues identified in previous section645.2. Relational neural networksFigure 5.2: A RelNN model for predicting the gender of users in a movierating system with a layer-wise architecture. < 𝐿𝑖𝑘𝑒𝑠ሺ𝑢,𝑚ሻ ∗ 𝐴𝑐𝑡𝑖𝑜𝑛ሺ𝑚ሻ, 𝑤1 > < 𝑇𝑟𝑢𝑒,𝑤0 > < 𝑇𝑟𝑢𝑒,𝑤2 > < 𝐿𝑖𝑘𝑒𝑠ሺ𝑢,𝑚ሻ ∗ 𝐷𝑟𝑎𝑚𝑎ሺ𝑚ሻ, 𝑤3 > 𝑆1ሺ𝑢ሻ 𝑆2ሺ𝑢ሻ 𝑆3ሺ𝑢ሻ < 𝑇𝑟𝑢𝑒,𝑤4 > < 𝐻1ሺ𝑢ሻ,𝑤5 > < 𝐻2ሺ𝑢ሻ,𝑤6 > < 𝑂𝑙𝑑ሺ𝑢ሻ,𝑤7 > < 𝑇𝑒𝑎𝑐ℎ𝑒𝑟ሺ𝑢ሻ,𝑤8 > 𝐻1ሺ𝑢ሻ 𝐻2ሺ𝑢ሻ 𝑀𝑎𝑙𝑒ሺ𝑢ሻ for other relational models. The results presented in this section and thesubsequent sections of this chapter are published in Kazemi and Poole [109].As described in Chapter 2, logistic regression can be defined using a linearlayer (LL) and an activation layer (AL). During training, an error layer (EL)is appended to the architecture. Neural networks can be defined by cascadingmultiple linear and activation layers. In this section, we first re-define RLRmodels in terms of layers. We design relational counterparts of LL, AL, andEL and show that an RLR model can be defined using a relational LL anda relational AL, with a relational EL appended to the architecture duringtraining. Then, similar to how logistic regression is generalized to neuralnetworks, we build relational neural networks (RelNNs) by connecting severalrelational LLs and relational ALs as a graph. The backpropagation algorithmcan be utilized in training RelNNs by adding a relational EL at the endof the graph, feeding the data forward, calculating the prediction errors,back propagating the derivatives with respect to errors, and updating theparameters using stochastic gradient descent.Let:• Ain and Aout be two mutually exclusive sets of atoms,• E represent an error function,• E represent a set of entities,• G(Ain, E) represent the set of all groundings of Ain for entities E ,• G(Aout, E) represent the set of all groundings of Aout for entities E ,• Iˆ be a function from a grounding P(X) in G(Ain, E) to real values,• Oˆ be a function from Q(X) ∈ G(Aout, E) to values,655.2. Relational neural networks• DOˆ be a function from Q(X) ∈ G(Aout, E) to ∂E∂Q(X) , i.e. a functionthat for each grounding Q(X) ∈ G(Aout, E) returns the derivative ofthe error function with respect to Q(X),• and DIˆ be a function from P(X) ∈ G(Ain, E) to ∂E∂P(X) , i.e. a functionthat given a grounding P(X) ∈ G(Ain, E) returns the derivative of theerror with respect to P(X).Definition 8. A layer is a generic construct that given Iˆ as input, outputsOˆ based on Iˆ and its internal structure, and given DOˆ, updates its internalweights and outputs DIˆ using the chain rule.Definition 9. A relational linear unit for an atom Q(x) is identical tothe linear part of RLR in Eq. (3.10).Definition 10. A relational linear layer (RLL) consists of t > 0 rela-tional linear units with the j-th unit having atom Qj(xj) and a set:ψj = {〈wj1, ϕj1〉 , 〈wj2, ϕj2〉 , . . . ,〈wjmj , ϕjmj〉}of mj WFs. Ain for an RLL contains all atoms in ψjs and Aout contains allQj(xj)s. Iˆ comes from input data or from the layers directly connected tothe RLL. For any Q(X) ∈ G(Aout, E), Oˆ(Q(X)) is calculated using the linearpart of Eq. (3.10) with respect to Iˆ and ψj . An RLL can be seen as generalrules with tied parameters applied to every entity.Example 33. Consider an RLL with Ain = {Friend(x, y),Kind(y)}, Aout ={Happy(x)}, and with the WFs in Example 26. Suppose there are 10 entities:∆x = {X1, X2, . . . , X10}. Let Iˆ be a function from G(Ain, ∆x) to valuesaccording to which X1, X2, . . . , X10 have 3, 0, . . . , and 7 friends that arekind respectively. Oˆ for this RLL is a function from groundings in G(Aout, ∆x)to values as:Oˆ(Happy(X1)) 7→ − 1.5Oˆ(Happy(X2)) 7→ − 4.5. . .Oˆ(Happy(X10)) 7→ 2.5Consider the j-th relational linear unit. For each assignment of entitiesXju to xj, let θju = {xj/Xju}. The derivative with respect to each weightwjk can be calculated as:∂E∂wjk=∑Xjuη(ϕjkθju, Iˆ) ∗DOˆ(Qj(Xju)) (5.1)665.2. Relational neural networksTo show how DIˆ is calculated, we only show the derivative formulae forthe case where no predicate appears twice in a formula (i.e. no self-joins).This is, however, not a requirement for our model.Let ϕ\P represent formula ϕ with any atom with predicate P (or itsnegation) removed from it. For each assignment of entities Xiv to the logvarsxi of P, let θiv = {xi/Xiv}. Then:DIˆ(P(Xiv)) =∑j∑Xjumj∑k=1,P∈ϕjkwjk ∗ η(((ϕjk\P)θju)θiv, Iˆ) ∗DOˆ(Qj(Xju))(5.2)The relational activation layer (RAL) and relational error layer(REL) are similar to their non-relational counterparts. RAL applies anactivation function A (e.g., sigmoid, tanh, or ReLU) to its inputs andoutputs the activations. REL compares the inputs to target labels, finds theprediction errors based on an error function E, and outputs the predictionerrors made by the model for each target entity.Relational neural networks: A relational neural network (RelNN) is astructure containing a sequence of RLLs and RALs. In our experiments,we consider the activation function to be the sigmoid function. Duringtraining, an REL is also added at the end of the sequence. Fig 5.1 andFig 5.2 respectively represent an example of an RLR and a simple RelNNmodel for predicting the gender of users based on their age, occupation, andthe movies they like. We use boxes with solid lines for RLLs and boxes withdashed lines for RALs. For the first RLL:Ain = {Likes(u,m),Action(m),Drama(m)}Aout = {S1(u),S2(u)}All the atoms in Ain for the first RLL are observed atoms (they are given tothe model as input). For the second RLL:Ain = {Old(u),Teacher(u),H1(u),H2(u)}Aout = {S3(u)}Iˆ for the first two atoms of Ain comes from observations (i.e. the inputto the model), and for the last two comes from the layers directly connectedto the RLL. Note that the WFs whose formula is True are biases.675.2. Relational neural networksUsing target predicates as inputs: Fig 5.3 shows a more complicatedRelNN structure for gender predicting in PAKDD-15 competition. Notethat in Fig 5.3, the target atom (Male(u)) is used twice: once at the lastlayer as the target atom and once in the first RLL as an atom to be usedfor learning a hidden feature about items. Since Male(u) is not observed forall entities, the unit containing Male(u) requires special treatment. In ourexperiments, we ignored the unobserved values of Male(u) in the first RLLand only counted over the observed values. That is, we treated the followingtwo rules in Fig 5.3:Viewed(u′, i) ∗Male(u′)Viewed(u′, i) ∗ Female(u′)as:Viewed(u′, i) ∗ ObservedGender(u′) ∗Male(u′)Viewed(u′, i) ∗ ObservedGender(u′) ∗ Female(u′)An alternative would be to randomly initialize the unobserved values inthe first learning iteration, count over the observed and randomly initializedvalues and move forward to compute the target values for Male(u), thenupdate the random values for the unobserved variables with the predictedvalues for the target Male(u) in the second iteration, and continue this processfor the next iterations. Such a procedure is called iterative classification andcan be used for collective classification [141, 192].5.2.1 Motivations for hidden layersMotivation 1. Consider the model in Fig 5.1 and let nU represent thenumber of action movies that user U likes. As Poole et al. (2014) pointout, if the probability of maleness depends on nU , as nU increases theprobability of maleness either goes to 0 or 1. This causes the model tobecome overconfident of its predictions when there exists many ratings anddoes a poor job, especially in terms of log loss. This is, however, not the casefor RelNNs. In the RelNN model in Fig 5.2(b), the probability of malenessdepends, among other things, on:w5 ∗ sigmoid(w0 + w1nU )w1 controls the steepness of the slope of the sigmoid, w0 moves the slopearound, and w5 controls the importance of the slope in the final prediction.With this model, as nU increases, the probability of maleness may convergeto any number in the [0, 1] interval. Furthermore, this model enables thelearning of the critical zones.685.2.Relationalneuralnetworks 𝑆11(𝑢) < 𝑇𝑟𝑢𝑒,𝑤23 > < 𝐻4(𝑢),𝑤24 > < 𝐻5(𝑢),𝑤25 > < 𝐻6(𝑢),𝑤26 > < 𝐻7(𝑢),𝑤27 > < 𝐻8(𝑢),𝑤28 > < 𝐻9(𝑢),𝑤29 > < 𝐻10(𝑢),𝑤30 > < 𝐴(𝑖, 𝑎) ∗ 𝑁1(𝑎),𝑤1 > 𝑆1(𝑖) < 𝑇𝑟𝑢𝑒,𝑤0 > < 𝐵(𝑖,𝑏) ∗ 𝑁2(𝑏),𝑤2 > < 𝐶(𝑖, 𝑐) ∗ 𝑁3(𝑐),𝑤3 > 𝐻1(𝑖) < 𝑉𝑖𝑒𝑤𝑒𝑑(𝑢′, 𝑖) ∗𝑀𝑎𝑙𝑒(𝑢′),𝑤5 > 𝑆2(𝑖) < 𝑇𝑟𝑢𝑒,𝑤4 > 𝐻2(𝑖) < 𝑉𝑖𝑒𝑤𝑒𝑑(𝑢′, 𝑖) ∗ 𝐹𝑒𝑚𝑎𝑙𝑒(𝑢′),𝑤7 > 𝑆3(𝑖) < 𝑇𝑟𝑢𝑒,𝑤6 > 𝐻3(𝑖) < 𝑉𝑖𝑒𝑤𝑒𝑑(𝑢, 𝑖) ∗ 𝐻1(𝑖),𝑤15 > 𝑆7(𝑢) < 𝑇𝑟𝑢𝑒,𝑤14 > < 𝑉𝑖𝑒𝑤𝑒𝑑(𝑢, 𝑖),𝑤17 > 𝑆8(𝑢) < 𝑇𝑟𝑢𝑒,𝑤16 > 𝐻8(𝑢) < 𝑉𝑖𝑒𝑤𝑒𝑑(𝑢, 𝑖) ∗ 𝐻2(𝑖),𝑤19 > 𝑆9(𝑢) < 𝑇𝑟𝑢𝑒,𝑤18 > < 𝑉𝑖𝑒𝑤𝑒𝑑(𝑢, 𝑖) ∗ 𝐻3(𝑖),𝑤20 > 𝐻9(𝑢) < 𝑆𝑒𝑠𝑠𝑖𝑜𝑛𝑇𝑖𝑚𝑒(𝑢),𝑤22 > 𝑆10(𝑢) < 𝑇𝑟𝑢𝑒,𝑤21 > 𝐻10(𝑢) 𝑀𝑎𝑙𝑒(𝑢) < 𝑉𝑖𝑒𝑤𝑒𝑑(𝑢, 𝑖) ∗ 𝐴(𝑖, 𝑎) ∗ 𝑁4(𝑎),𝑤9 > 𝑆4(𝑢) < 𝑇𝑟𝑢𝑒,𝑤8 > < 𝑉𝑖𝑒𝑤𝑒𝑑(𝑢, 𝑖) ∗ 𝐵(𝑖, 𝑏) ∗ 𝑁5(𝑏),𝑤11 > 𝑆5(𝑢) < 𝑇𝑟𝑢𝑒,𝑤10 > < 𝑉𝑖𝑒𝑤𝑒𝑑(𝑢, 𝑖) ∗ 𝐶(𝑖, 𝑐) ∗ 𝑁6(𝑐),𝑤13 > 𝑆6(𝑢) < 𝑇𝑟𝑢𝑒,𝑤12 > Figure 5.3: RelNN structure for predicting the gender in PAKDD15 dataset.695.2. Relational neural networksMotivation 2. Suppose the underlying truth for the models in Fig 5.1 andFig 5.2 is that males correspond to users who like at least p action moviessuch that each movie has less than q likes. A typical RLR/MLN modelcannot learn such a model because, first, it needs to learn something aboutmovies (i.e. having less than q likes), combine it with being action and countthem. However, this can be done using a RelNN (the hidden layer shouldcontain a relational linear unit with atom H(m) and WFs 〈w1, Likes(u,m)〉and 〈w2,True〉). Thus, hidden layers increase the modeling power by enablingthe model to learn generic rules and categorize the entities accordingly, thentreat entities in different categories differently.Motivation 3. Kazemi et al. (2014a) show how different types of existingexplicit aggregators can be represented using RLR. However, some of thosecases (e.g., noisy-or and mode = t) require two RLLs and two RALs, i.e.RelNNs.5.2.2 Learning latent properties directlyEntities may contain latent properties that cannot be specified using generalrules but can be learned directly from data during training. Such propertieshave proved effective in many tasks (e.g., recommendation systems [127]).These properties can be also viewed as soft clustering the entities into differentcategories. We call the latent properties specifically learned for each entitynumeric latent properties and the general latent properties learned throughWFs in RLLs rule-based latent properties to distinguish between the two. Inthe RelNN in Fig 5.3, for instance, H1(i), H4(u) and H7(u) (and other His)are the outputs of RALs and so are rule-based latent properties. S1(i), S4(u)and S7(u) (and other Sis) are the pre-activation values of the rule-basedlatent properties (i.e. Hi(.) = A(Si(.)), where A is an activation function).N2(b) and N4(a) (and other Nis) are numeric latent properties.Consider the models in Fig 5.1 and Fig 5.2 and let Latent(m) be a numericlatent property of the movies whose value is to be learned during training. Onemay initialize Latent(m) with random values for each movie and add a WF〈w, Likes(u,m) ∗ Latent(m)〉 to the first RLL. As mentioned before, duringthe backpropagation phase, an RLL provides the derivatives with respectto each of the inputs. This means for each grounding Latent(M), we havethe derivative of the error with respect to Latent(M) (i.e. DIˆ(Latent(M))).Therefore, these numeric latent values can also be updated during learningusing gradient descent.705.3. From ConvNet primitives to RelNNs5.3 From ConvNet primitives to RelNNsNiepert et al. [169] and Pham et al. [175] explain why their relational learningmodels for graphs can be viewed as instances of ConvNet. We explain whyRelNNs can also be viewed as an instance of ConvNets. Compared to priorwork, we go into more details and provide more intuition. Such a connectionoffers the benefit of understanding and implementing RelNNs using ConvNetprimitives.The cells in input matrices of ConvNets (e.g., image pixels) have spatialcorrelation and spatial redundancy: cells closer to each other are moredependent than cells farther away. For instance if M represents an inputchannel of an image, the dependence between M [i, j] and M [i+ 1, j+ 1] maybe much more than the dependence between M [i, j] and M [i, j + 20]. Tocapture this type of dependency, convolution filters are usually small squarematrices. Convolution filters contain tied parameters so different regions ofthe input matrices are treated identically.For relational data, the dependencies in the input matrices (the relation-ships) are different: the cells in the same row or column (i.e. relationshipsof the same entity) have higher dependence than the cells in different rowsand columns (i.e. relationships of different entities). For instance for a ma-trix L representing which users like which movies, the dependence betweenL[i, j] and L[i+ 1, j + 1] (different people and movies) may be far less thanthe dependence between L[i, j] and L[i, j + 20] (same person and differentmovies). A priori, all rows and columns of the input matrices are symmetric.Therefore, to adapt ConvNets for relational data, we need vector-shapedfilters that are invariant to row and column swapping and better capture therelational dependence and the symmetry assumption.One way to modify ConvNets to be applicable to relational data is asfollows. We consider relationships as inputs to the network. We applyvector-shaped convolution filters on the rows and columns of the relationshipmatrices. For instance for gender prediction from movie ratings, a convolutionfilter may be a vector with size equal to the number of movies. The valuesof these filters can be learned during training (like ConvNet filters), or canbe fixed in which case they correspond to observed properties (e.g., a vectorrepresenting which movies are action movies). Convolving each filter witheach matrix produces a vector. These vectors go through an activation layerand are used as a filter in the next layers. Besides convolving a filter with amatrix, one can join two matrices and produce a new matrix for the next layer.Joining two matrices R(x,m) and T (m, a) produces a new matrix S(x, a)where for some X ∈ x and A ∈ a, S(X,A) = ∑M∈mR(X,M) ∗ T (M,A).715.3.FromConvNetprimitivestoRelNNs 0.1 -0.1 0.9 -0.1 1.1 1 1 0 0 0 1 0 0 1 1 1 1 1 0 1 0 0 0 0 1 Users Movies Likes Matrix Filters 1 0 0 1 Action movies (fixed values) 0.2 -0.1 -0.3 1.1 Values to be learned 1 0 1 1 0 Old users (fixed values) Applying Filters 1 0 2 1 1 3 2 2 1 Activation: 𝑆𝑖𝑔(𝑤0 + 𝑤1𝑣) 0.29 0.25 0.48 0.25 0.52 0.5 0.27 0.73 0.5 0.5 0.88 0.73 0.73 0.5 Assuming 𝑤0 = −1, 𝑤1 = 1 𝑣 represents a vector for all three vectors Next Layer Use the results of previous layer as filters for the next layer. Figure 5.4: An example of a layer of a RelNN demonstrated with ConvNet operations.725.4. Empirical resultsThe filters in the next layers can be applied to either input matrices orthe ones generated in previous layers. By constructing an architecture andtraining the network, the learned filters identify low and high-level featuresfrom the input matrices.While these operations are modifications of ConvNet operations, they allcorrespond to the operations in our RLR/MLN perspective. Vector-shapedfilters correspond to including numeric latent properties in WFs. Fixed filterscorrespond to including observed atoms with a single logvar in WFs. Joiningmatrices corresponds to using two binary atoms in a WF.Example 34. Suppose we want to predict the gender of users in a movierating system based on the movies they like, whether the movie is actionor not, and whether the user is old or not. Fig 5.4 shows one layer of aRelNN for this problem in terms of ConvNet operations. The binary relationLikes(u,m) is considered as the input. Three filters have been applied tothis matrix: one filter with fixed values corresponding to Action(m), onefilter with fixed values corresponding to Old(u), and one filter to be learnedcorresponding to a numeric latent property N(m). Note that Action(m)and N(m) are row-shaped filters and Old(u) is column-shaped. Convolvingthese filters with Likes(u,m) and sending the resulting vectors through anactivation layer is equivalent to having an RLL with three relational linearunits: the first unit contains:〈True, w0〉〈Likes(u,m) ∗ Action(m), w1〉and the other ones have Old(u) and N(m) instead of Action(m) respectively.The obtained vectors after activation can be used as filters for the next layers,or for making the final predictions.The operations of RelNNs are also closely related to Ravanbakhsh et al.[182]’s work on equivariance through parameter-sharing where the weightsof a neural network are tied in a way to ensure that permuting two rows (orcolumns) of an input matrix only permutes the outputs corresponding to thepermuted rows, and has no effects on the other outputs.5.4 Empirical resultsIn this section, we design and conduct experiments to answer the followingquestions:735.4. Empirical results• Q1: How does RelNN’s performance compare to existing state-of-the-art relational learning algorithms?• Q2: How numeric and rule-based latent properties affect the perfor-mance of the RelNNs?• Q3: How well RelNNs extrapolate to unseen cases and address thepopulation size issue pointed out in [178] and discussed in Motivation 1?5.4.1 DatasetsWe use three real-world datasets in our experiments:• Movielens 1M: Our first dataset is the Movielens 1M dataset [85]ignoring the actual ratings and only considering if a movie has beenrated or not, and considering only action and drama genres. Consideringonly this portion of the Movielens dataset makes for better experimentsas more baselines become applicable to it.• PAKDD15: Our second dataset is from PAKDD15 gender predictioncompetition9 but we only considered the A, B, and C prefixes of theitems and ignored the D prefix because each D prefix is on average seenby ≈ 1.5 people. We also ignored the information within the sequenceof the items viewed by each user.• Yelp: Our third dataset contains all Chinese and Mexican restaurantsin Yelp dataset challenge10 (ignoring the ones that have both Chineseand Mexican foods), and the task is to predict if a restaurant is Chineseor Mexican given the people who reviewed them, and whether theyhave fast food and/or seafood or not.5.4.2 BaselinesTraditional machine learning approaches are not directly applicable to ourproblems. Also due to the large size of the datasets, some relational modelsdo not scale to our problems. For instance, we tried Problog [49], but thesoftware crashed after running for a few days even for simple modelsOur baselines include:• Mean: corresponding to always predicting the mean of the trainingdata.9https://knowledgepit.fedcsis.org/contest/view.php?id=10710https://www.yelp.com/dataset_challenge745.4. Empirical results• Matrix factorization: corresponding to the best performing matrixfactorization model according to Kazemi et al. (2017a)’s experiments:the relation matrix is factorized into entity latent properties using thefactorization technique in [127], then, using Weka [83], a linear/logisticregression model is learned over the latent and observed propertiesof the target entities. We also tried several other variants (includingRESCAL’s algorithm [165]), but Kazemi et al. (2017a)’s model wasthe best performing.• RDN-Boost [155]: can be viewed as an extension of random forestswith ways to aggregate over multiple observations, thus enabling ran-dom forests to be applicable to relational data.• K-nearest neighbors collaborative filtering: finds k entities withobserved labels that are similar to the target entity in terms of therelationships they participate in, creates a feature as the weighted meanof the labels of the similar entities, and feeds this label along withother features of the target entity to a classification/regression model.For MovieLens and Yelp datasets, collaborative filtering produces onefeature and for PAKDD dataset, it produces three features, one foreach item prefix. We used the cosine similarity function. The valueof k was set using cross validation. Similar to matrix factorization,linear and logistic regression of Weka were used for classification andregression.5.4.3 Implementation and learning methodologyWe implemented RelNNs in Java11. For all RelNN and RLR/MLN modelsin our experiments, we used fixed structures (i.e. fixed set of WFs and con-nections among them) and learned the parameters using stochastic gradientdescent with multiple restarts, selecting the parameters offering the lowesttraining error. The structure of the RelNN model for Movielens dataset isthat of Fig 5.2 extended to include all ages and occupations with 2 numericlatent properties, 3 RLLs and 3 RALs. The structure of the RelNN modelused for PAKDD15 and Yelp datasets is as in Fig 5.3 and Fig 5.5 respectively.We leave the problem of learning these structures automatically from dataas future work. The structure of the RLR/MLN models are similar to theRelNN models but without hidden layers and numeric latent properties.Backpropagation for the RLR/MLN case corresponds to the discriminative11Code: https://github.com/Mehran-k/RelNN755.4. Empirical results < 𝑅𝑒𝑣𝑖𝑒𝑤𝑒𝑑(𝑢,𝑏),𝑤1 > < 𝑇𝑟𝑢𝑒,𝑤0 > < 𝑇𝑟𝑢𝑒,𝑤2 > < 𝑅𝑒𝑣𝑖𝑒𝑤𝑒𝑑(𝑢,𝑏) ∗ 𝑁1(𝑢),𝑤3 > 𝑆1(𝑏) 𝑆2(𝑏) 𝑆4(𝑏) < 𝑇𝑟𝑢𝑒,𝑤6 > < 𝐹𝑎𝑠𝑡𝐹𝑜𝑜𝑑(𝑏),𝑤7 > < 𝑆𝑒𝑎𝐹𝑜𝑜𝑑(𝑏),𝑤8 > < 𝐻1(𝑏),𝑤9 > < 𝐻2(𝑏),𝑤10 > 𝐻1(𝑏) 𝐻2(𝑏) 𝑀𝑒𝑥𝑖𝑐𝑎𝑛(𝑏) 𝑆3(𝑏) < 𝑇𝑟𝑢𝑒,𝑤4 > < 𝑅𝑒𝑣𝑖𝑒𝑤𝑒𝑑(𝑢, 𝑏) ∗𝑁2(𝑢),𝑤5 > 𝐻3(𝑏) < 𝐻3(𝑏),𝑤11 > Figure 5.5: RelNN structure used for Yelp dataset. Reviewed(u, b),FastFood(b) and SeaFood(b) are observed atoms. N1(u) and N2(u) arenumeric latent properties. H1(b), H2(b), and H3(b) are rule-based hiddenproperties. S1(b), S2(b), S3(b), and S4(b) are pre-activation values for therule-based hidden properties (and for the predictions on Mexican(b)).parameter learning of [96]. To avoid numerical inconsistencies when applyingour models to PAKDD15 dataset, we assumed for each item, there is a maleand a female who have viewed it. For all experiments, we split the data into80/20 percent train/test.We imposed a Laplacian prior (corresponding to an L1-regularization)on all our parameters (weights and numeric latent properties) for automaticfeature selection. For classification, we further regularized our model pre-dictions towards the mean of the training set using a hyperparameter λ as:Prob = λ ∗mean + (1 − λ) ∗ (ModelSignal), selected on a portion of thetraining set considered as the validation set. This regularization alleviatesthe overconfidence of the model and avoids numerical inconsistencies aris-ing when taking the logs of the predictions. Note that this regularizationcorresponds to adding an extra layer to the network.We reported the accuracy indicating the percentage of correctly classifiedinstances, the mean squared errors (MSE), and the log loss. We conductedeach experiment 10 times and reported the mean and standard deviation.5.4.4 Comparative resultsOur experiments include gender prediction for the Movielens and PAKDD15datasets, age prediction for Movielens dataset, and predicting the type offood for the Yelp restaurants dataset. Even though the age is divided into 7categories in the MovieLens dataset, we assume it is a continuous variableto see how RelNNs perform in predicting continuous variables. To make765.4.EmpiricalresultsTable 5.1: Performance of different learning algorithms based on accuracy, log loss, and MSE ± standard deviationon three different tasks. NA means the method is not directly applicable to the prediction task/dataset. The bestperforming method is shown in bold. For models where standard deviation was zero, we did not report it in thetable.Learning AlgorithmTask Measure Mean MatrixFactorizationCollaborativeFilteringRDNBoostRLR/MLN RelNNMovieLens Accuracy 0.7088 0.7550 ± 0.0073 0.7510 0.7234 0.7073 ± 0.0067 0.7902 ± 0.0051Gender Log Loss 0.8706 0.7318 ± 0.0073 0.7360 0.8143 0.8441 ± 0.0255 0.6548 ± 0.0026MSE 0.2065 0.1649 ± 0.0026 0.1675 0.1904 0.1987 ± 0.0067 0.1459 ± 0.0009PAKDD Accuracy 0.7778 NA 0.8806 0.7778 0.8145 ± 0.0388 0.8853 ± 0.0002Gender Log Loss 0.7641 NA 0.5135 0.8039 0.7224 ± 0.0548 0.5093 ± 0.0037MSE 0.1728 NA 0.1026 0.1842 0.1522 ± 0.0185 0.1009 ± 0.0008Yelp Accuracy 0.6168 0.6154 ± 0.0041 0.6712 0.6156 0.6168 ± 0.000 0.6927 ± 0.0077Business Log Loss 0.9604 0.9394 ± 0.0033 0.8757 0.9458 0.9435 ± 0.0034 0.8531 ± 0.0090Prediction MSE 0.2364 0.2300 ± 0.0012 0.2084 0.2316 0.2309 ± 0.0010 0.2023 ± 0.0024MovieLens Age MSE 156.0507 104.5967 ± 0.9978 90.8742 NA 156.0507 ± 0.000 62.7000 ± 0.7812775.4. Empirical resultsthe categories more realistic, for people in age category/interval [i, j], weassume the age is (i+ j)/2, i.e. the mean of the interval. For the first andlast categories, we used 16 and 60 as the ages.Table 5.1 compares RelNNs with other well-known relational learningalgorithms as well as a baseline. It can be viewed from the table howRelNNs outperform well-known relational learning models in terms of allthree performance metrics. Note that all baselines use the same observedfeatures. Manual feature engineering from relations is very difficult, sothe challenge for the models is to extract useful features from relations. Asexplained in [114], the problem with the baselines is with their lower modelingpower and inappropriate implicit assumptions, and not with engineeringfeatures. The results in Table 5.1 answer Q1.5.4.5 Effect of numeric and rule-based latent propertiesFor Q2, we changed the number of hidden layers and numeric latent proper-ties in RelNN to see how they affect the performance. The obtained resultsfor predicting the gender and age in the Movielens dataset can be viewedin Fig 5.6. The results show that both hidden layers and numeric latentproperties (especially the first one of each) have a great impact on the per-formance of the model. When hidden layers are added, as we conjectured inmotivation 1, the log loss and MSE improve substantially as the overconfi-dence of the model decreases. Note that adding layers only adds a constantnumber of parameters, but adding k numeric latent properties adds k ∗ |∆m|parameters. In Fig 5.6, a RelNN with 2 hidden layers and 1 numeric latentproperty has many fewer parameters than an MLN/RLR with no hiddenlayers and 2 numeric latent properties, but outperforms it.5.4.6 Cold start and extrapolationFor Q3, we conduct two experiments:1. We train a RelNN on a large population and test it on a small popula-tion. This experiment can be seen as testing how severely each modelsuffers from the cold start problem.2. We train a RelNN on a small population and test it on a large popula-tion. This experiment can be seen as testing how well these modelsextrapolate to larger populations.For the first experiment, we trained two models for predicting the genderof the users in the MovieLens dataset both containing one numeric latent785.4. Empirical results 0 1 2 #Numeric Latent Props. Props.PrdPropsProp 0 1 2 #Hidden (Relational Linear) Layers 0.7073 ± 0.0067 0.8441 ± 0.0255 0.1987 ± 0.0067 0.7345 ± 0.0027 0.7792 ± 0.0121 0.1797 ± 0.0037 0.7534 ± 0.0054 0.7727 ± 0.0033 0.1771 ± 0.0011 0.7701 ± 0.0116 0.7417 ± 0.0248 0.1670 ± 0.0045 0.7890 ± 0.0045 0.6636 ± 0.0114 0.1478 ± 0.0034 0.7873 ± 0.0044 0.6597 ± 0.0063 0.1473 ± 0.0016 0.7863 ± 0.0062 0.6619 ± 0.0099 0.1475 ± 0.0027 0.7902 ± 0.0051 0.6548 ± 0.0026 0.1459 ± 0.0009 0.7776 ± 0.0029 0.7285 ± 0.0089 0.1638 ± 0.0032 Accuracy ± stdev Log Loss ± stdev MSE ± stdev MSE ± stdev 156.0507 ± 0.0000 70.8620 ± 0.2901 63.9525 ± 0.2828 0 1 2 0 1 2 #Hidden (Relational Linear) Layers 156.0507 ± 0.0000 111.8861 ± 0.1912 106.6505 ± 0.0883 156.0507 ± 0.0000 65.1189 ± 0.3629 62.7000 ± 0.7812 #Numeric Latent Props. Props.PrdPropsProp (b) (a) Figure 5.6: Predicting (a) gender and (b) age on MovieLens using RelNNswith a different number of numeric latent properties and hidden layers.Each experiment has been conducted 10 times and the mean and standarddeviation of the 10 runs have been reported.property, but one containing no hidden layers and the other containing onehidden layer. We only gave the first k ratings of the test users (obtainingthe first k ratings is possible because the dataset is timestamped) to themodel and recorded the log loss of the two models. We did this for differentvalues of k and plotted the results. Obtained results can be viewed in Fig 5.7.Note that since in MovieLens dataset all users have rated at least 20 movies,the model has not seen a case where a user has rated less than 20 movies.That is, the model has been trained on large population sizes (≥ 20). Inthis experiment, the cases in Fig 5.7 where k < 20 correspond to smallpopulations (and cold start).When k = 0 (i.e. ignoring all ratings of the test users), both models had795.4. Empirical results -1-0.95-0.9-0.85-0.8-0.75-0.7-0.651 10 100 1000MSE #of ratings per test user RLRRelNN 0.650.70.750.80.850.90.951 10 100 1000Log Loss #ratings per test user RLR/MLNRelNNFigure 5.7: Results on predicting the gender when we only use the first kratings of the test users and use all ratings in train set for learning. Thex-axis represents the number of ratings used for each test user (i.e. k) andthe y-axis shows the log loss. Each experiment has been conducted 10 timesand the average log loss is reported. The bars show the standard deviation.a similar performance (logloss ≈ −0.9176)12. As soon as we add one ratingfor the test users, the performance of RelNN substantially improves, but theperformance of the RLR/MLN model is not affected much. The gap betweenthe two models is more when the test users have fewer ratings (i.e. unseencases), but it gets less and becomes steady as the number of ratings increasesand becomes closer to the number of ratings in the train set.For the second experiment, for each user in the train set, we kept onlythe first r ratings where 0 ≤ r ≤ 20 was generated randomly for each user.For the new dataset, we repeated the previous experiment and obtained thediagram in Fig 5.8. It can be viewed in this diagram that the RLR/MLNmodel works best when we keep the first 15 ratings of the test users (i.e.k = 15), but for higher values of k, its performance starts to deteriorate.The performance of the RelNN model, on the other hand, improves evenuntil k = 30. After this point as k increases, the performance of both modelsdeteriorates as they are being tested on population sizes much larger than theones in training set. However, the gap between the two models becomes much12This is not shown in the diagram in Fig 5.7 as the x-axis of the diagram is in log scale.805.5. Conclusion 0.740.760.780.80.820.840.860.881 10 100 1000Log Loss #ratings per test user RLR/MLNRelNNFigure 5.8: Results on predicting the gender when we only use the firstk ratings of the test users and use 0 ≤ r ≤ 20 ratings of each user (r isgenerated randomly for each user) in the train set for learning. The x-axisrepresents the number of ratings used for each test user (i.e. k) and they-axis shows the log loss. Each experiment has been conducted 10 times andthe average log loss is reported. The bars show the standard deviation.more for larger values of k, as the RLR/MLN model becomes overconfident.These results validate the hypothesis in [178] that as the population sizegrows, RLR/MLN becomes overconfident and predicts with probabilitiesclose to 0 and 1. The results also show how RelNNs address this issue andvalidate our Motivation 1.5.5 ConclusionIn this chapter, we developed a deep neural model for learning from relationaldata. we showed that this model outperforms several existing models onthree relational learning benchmarks. RelNNs can be advanced in severalways. The current implementation of RelNNs is a fresh non-parallelizedcode. It could be sped up by parallelization (e.g., by using TensorFlow[1] as the backbone), by compilation relational operations to lower-levellanguages (similar to [105]), or by using advanced database query operations.Learning the structure of the RelNN and the WFs automatically from data815.5. Conclusion(e.g., by extending the structure learning algorithm in [65]), developing andadding relational versions of regularization techniques such as (relational)dropout [114, 201] and batch normalization [98], extending the numerichidden properties in RelNNs to entity embeddings, and extending RelNnsfor link prediction are other directions for future research.82Chapter 6Joint PredictionJoint prediction (aka collective prediction) refers to making a collectiveprediction or finding the probability distribution over several properties orrelationships of entities given some observation. Consider a social networkand assume we are interested in answering questions such as: What isthe probability distribution over Alan and his wife, Mary, liking the movieZootopia?. To answer such a query, one needs to compute a joint probability.Lifted relational models (such as Markov logic networks [59] and Problog[49]) have proved successful for representing and answering joint queries overrelational data [50]. A major problem with these models, however, is thatlearning for these models requires inference. Since these models typicallyrepresent densely-connected graphical models with many random variables,traditional inference algorithms tend to be slow on these models. Therefore,if traditional inference algorithms are used for these models, inference willbe the bottleneck for learning, and learning these models is only practicalfor very small problems.To speed up inference, lifted inference algorithms have been developedthat exploit both independence and symmetries in these models. In thischapter, we make four contributions to lifted inference:• We design heuristics for selecting the order of applying lifted inferencerules to theories. Our heuristics improve over min-fill, a heuristic widelyused in traditional inference algorithms.• We demonstrate empirically that compiling lifted inference to low-levellanguages is more efficient than compiling it to data structures. Wedevelop a compiler which takes as input a theory and compiles it intoa C++ program. We also design experiments to provide intuition onwhy compiling to low-level languages is more effective than compilingto data structures.• We expand the class of theories for which lifted inference runs in timepolynomial in the number of entities. This result is established throughreviving a forgotten rule for lifted inference called domain recursion.836.1. Lifted InferenceExpanding this class enables searching over a larger class of modelswhen learning the structure of lifted relational models.• We show that for some theories with existential quantifiers, one mayuse domain recursion instead of Skolemization and get a lower time-complexity.We begin with introducing lifted inference and then dive deep into eachof the above four contributions.6.1 Lifted InferenceIn this section, we provide a high-level description of lifted inference. Westart with an overview of traditional and lifted inference algorithms. Thenwe motivate why one may prefer lifted inference over traditional inferencealgorithms. Finally, we describe how lifted inference works and provide anoverview of the theoretical aspects of lifted inference.6.1.1 Inference for probabilistic graphical modelsProbabilistic graphical models, including Bayesian networks and Markovnetworks [46, 173] are representations for dependencies among random vari-ables. Even though exact inference in these models is known to be NP-hard[40], variable elimination [53, 234] and recursive conditioning [44, 52] are twosimple algorithms for exact inference which are efficient for graphs with a lowtreewidth. The former is a dynamic programming approach and the latteris a search-based method. Both of these algorithms select an order for thevariables and eliminate (branch on) the variables in that order to find theresult of the given query. The elimination ordering for variable eliminationis the reverse of the branching ordering for recursive conditioning with thesame additions and multiplications [44]; here we refer to the eliminationorder even if we are doing search.Both variable elimination and recursive conditioning do inference on theoriginal data structure of the model. Knowledge compilation approaches [43]aim at compiling the original data structure of the model to a secondary datastructure on which inference queries can be answered more efficiently, andthen use the secondary data structure for answering queries. Examples ofsecondary data structure include variants of clique trees (aka junction trees)[100, 133, 147, 194], variants of d-DNNFs [45, 106, 180] and (probabilistic)SDDs [47, 121]. Huang and Darwiche [95] show how exhaustive search846.1. Lifted Inferencealgorithms (such as recursive conditioning) can be turned into knowledgecompilers by recording the trace of the search (evaluating the search algorithmsymbolically instead of numerically).Knowledge compilation approaches have two phases: 1- an offline phasefor compiling the original data structure of the model into a secondary datastructure, and 2- an online phase for receiving observations and queriesand finding the probability of the query variables given observations on thesecondary data structure. The offline phase needs to be done only once.Knowledge compilation approaches aim at pushing the bulk of operations tothe offline phase, thus speeding up query answering in the online phase.The performance of all these methods is highly dependent on the elim-ination order (or its variants, e.g., v-trees [47]). Since finding an orderingwhich results in the minimum total state space or in a width equal to thetree-width of the graph is NP-hard [6, 125], in practice heuristics are usedto choose the order. Among these heuristics, minimum-fill, minimum-sizeand minimum-weight are known to produce good orderings [54, 123, 189].Other methods in the literature to choose an elimination ordering includemaximum cardinality search [205], LEX M [186] and MCS-M [17]. There arealso heuristics with local search methods such as simulated annealing [123],genetic algorithm [132] and tabu search [36].6.1.2 Inference for lifted relational modelsThe problem of lifted probabilistic inference was first explicitly proposed byPoole [179], who formulated the problem in terms of parametrized randomvariables, introduced the use of branching to complement unification, theparametric factor (parfactor) representation of intermediate results, and analgorithm for summing out parametrized random variables and multiplyingparfactors in a lifted manner. This work was advanced by the introduction ofcounting formulae and the development of counting elimination algorithms[51, 151, 204]. The main problem is that these proposals are based onvariable elimination. Variable elimination requires a representation of theintermediate results, and the current representations for such results are notclosed under all of the operations used for inference.An alternative to variable elimination is to use search-based methodsbased on conditioning such as recursive conditioning [44], AND-OR search[52] and other related works (e.g., Bacchus et al. [9]). The advantage ofthese methods is that conditioning simplifies the representations rather thancomplicating them, and these methods exploit context-specific independence[25] and determinism. The use of lifted search-based inference was proposed856.1. Lifted Inferencein [102], [79] and [177]. These methods take as input a probabilistic relationalmodel, a query, and some observations, and output the probability of thequery given the observations. To answer a new query, although we can reusethe cache, many of the operations must be repeated.Van den Broeck [216] follows a knowledge compilation approach to liftedinference by recording the trace of a search-based lifted inference algorithmand extracting a data structure called FO-NNF, on which many inferencequeries can be efficiently answered. Lifted inference by knowledge compilationoffers huge speedups because the compilation is done only once and thenmany queries can be answered efficiently by reusing the compiled model.In this dissertation, we study exact lifted inference algorithms as they canbe considered as the gold standard and many approximate lifted inferencealgorithms rely on exact algorithms. Exact lifted inference has applicationsin (tractable) lifted learning [220], where the main task is to efficientlycompute partition functions, and provides the foundation for a rich literatureon approximate lifted inference and learning [2, 28, 99, 101, 116, 126, 170,197, 213, 221? ]. Besides speeding up inference for lifted relational models,the ideas developed for (exact) lifted inference have applications in otherdomains, e.g., in question answering over open-world probabilistic databases[30]. A survey of lifted inference algorithms can be found in several surveys[117, 120, 145, 211].6.1.3 Why lifted inference?The following example taken from [217] shows how exploiting symmetriesmay help speed up inference and consequently why one may prefer liftedinference algorithms over traditional inference algorithms.Example 35. Consider a deck of cards where all 52 cards are faced downon the ground as in Fig 6.1(a), and assume we need to answer queries suchas the probability of the first card being the queen of hearts (QofH), or theprobability of the second card being a hearts given that the first card is theQofH. The answer to these two queries are obviously 152 and1251 .Suppose we decide to build a model for answering such queries automati-cally, in which we consider one variable for each card whose domain is theset of all suit-number pairs, and impose a probability distribution on thesevariables. Traditionally, (marginal and conditional) independence amongvariables has been the source of tractability of probability distributions. TheMarkov network corresponding to this model is a fully connected graph asin Fig 6.1(b). That is, no pair of variables in this model are marginally866.1. Lifted Inference … … (a) (b) Figure 6.1: A simple example taken from [218] indicating the effects ofexploiting the symmetry for probabilistic inference.or conditionally independent of each other. Using traditional reasoningalgorithms, reasoning in this system requires O(5252) computations.The reason why we humans can answer such queries more easily than thetraditional reasoning algorithms lies in the notion of symmetry. Two randomvariables are symmetric if swapping the two random variables does not changethe probability distribution. In Fig 6.1(a), the probability of the first variable(card) being the QofH is the same as the probability of the second, third, orany other variable (card) being the QofH. Swapping/renaming two variablesin Fig 6.1(b) will result in the exact same probability distribution. That is,the variables is pair-wise symmetric. When a group of random variables arepair-wise symmetric, one can exploit this symmetry to speed up reasoningby treating the whole group identically rather than treating each member ofthe group separately.Symmetry appears in many relational models; entities about which wehave the same information can be assumed symmetric. Traditional reasoningalgorithms must be extended with the capability to exploit symmetry toanswer such queries in polynomial time.6.1.4 Lifted inference through weighted model countingWe describe lifted inference using the weighted model count (WMC) formu-lation as it abstracts away the lifted relational model being used. In thisdissertation, for ease of exposition, we assume that all input theories forWMC are clausal. Furthermore, unless stated otherwise, we assume the inputtheories do not contain existential quantifiers as the rules of lifted inferencehave been designed for theories with no existential quantifiers. The lattercan be achieved using the Skolemization procedure of Van den Broeck et al.[214], which efficiently transforms a theory T with existential quantifiers intoa theory T ′ without existential quantifiers that has the same WMC. That is,876.1. Lifted Inferenceour theories are sets of finite-domain, function-free first-order clauses whoselogvars are all universally quantified (and typed with a population).Furthermore, when a clause mentions two logvars x1 and x2 typed withthe same population ∆x, or a logvar x typed with population ∆x and aconstant C ∈ ∆x, we assume they refer to different entities13.Converting inference for lifted relational models into WMCFor many lifted relational models, (lifted) inference can be converted intoa WMC problem. As an example, consider a Markov logic network (MLN)[183] with the following weighted formulae:〈w1, ϕ1〉〈w2, ϕ2〉. . .〈wk, ϕk〉For every WF 〈wi, ϕi〉 of this MLN, let theory T have a sentence ∀xi ∈∆xi : Auxi(xi)⇔ ϕi where xi consists of all logvars appearing in ϕi. That is,T has the following sentences:∀x1 ∈ ∆x1 : Aux1(x1)⇔ ϕ1∀x2 ∈ ∆x2 : Aux2(x2)⇔ ϕ2. . .∀xk ∈ ∆xk : Auxk(xk)⇔ ϕkAssuming Φ(Auxi) = exp(wi), Φ(Auxi) = 1, and Φ and Φ are 1 for theother predicates, the partition function of the MLN is equal to WMC(T ).Such conversions can be done for Problog, parfactor graphs, relationalBayesian networks, and many other lifted relational models.Example 36. Consider an MLN with the following weighted formulae:〈1.2,Teenager(x) ∧ Friend(x, y) =⇒ Teenager(y)〉〈2.2,Friend(x, y) =⇒ Friend(y, x)〉13Equivalently, we can disjoin x1 = x2 or x = C to the clause. Note that any theorywithout this restriction can be transformed into a theory conforming to this restriction.886.1. Lifted InferenceLet theory T contain the following two sentences:∀x, y ∈ ∆x : Aux1(x, y)⇔ (Teenager(x) ∧ Friend(x, y) =⇒ Teenager(y))∀x, y ∈ ∆x : Aux2(x, y)⇔ (Friend(x, y) =⇒ Friend(y, x))Let Φ(Aux1) = exp(1.2), Φ(Aux2) = exp(2.2), and Φ of the other predi-cates be 1. Let Φ of all predicates be 1. Then the weighted model count ofT given Φ and Φ is equivalent to the partition function of the MLN.Calculating the WMC of a first-order theoryWe now describe a set of rules Rules that can be applied to a theory tofind its WMC efficiently; for more details, readers are directed to [212], [177]or [79]. We use the following theory T with two clauses and four atoms(S(x,m), R(x,m), T(x) and Q(x)) as our running example, and assumeΦ(Q) = 2,Φ(Q) = 1.2, and Φ and Φ are 1 for the other atoms:∀x ∈ ∆x,m ∈ ∆m : Q(x) ∨ R(x,m) ∨ S(x,m)∀x ∈ ∆x,m ∈ ∆m : S(x,m) ∨ T(x)Lifted Decomposition: Assume we ground x in T . Then the clausesmentioning an arbitrary Xi ∈ ∆x are:∀m ∈ ∆m : Q(Xi) ∨ R(Xi,m) ∨ S(Xi,m)∀m ∈ ∆m : S(Xi,m) ∨ T(Xi)These clauses are totally disconnected from clauses mentioning Xj ∈ ∆x(j 6= i), and are the same up to renaming Xi to Xj . Given the symmetry ofthe entities, we can calculate the WMC of only the clauses mentioning Xiand raise the result to the power of the number of connected components(|∆x|). Assuming T1 is the theory that results from substituting x with Xi,WMC(T ) = WMC(T1)|∆x|.Case-Analysis: The WMC of T1 can be computed by a case analysis overdifferent assignments of values to a ground atom, e.g., Q(Xi). Let T2 andT3 represent T1 ∧Q(Xi) and T1 ∧ ¬Q(Xi) respectively. Then, WMC(T1) =WMC(T2) + WMC(T3). We follow the process for T3 (the process for T2 issimilar) having clauses:¬Q(Xi)∀m ∈ ∆m : Q(Xi) ∨ R(Xi,m) ∨ S(Xi,m)∀m ∈ ∆m : S(Xi,m) ∨ T(Xi)896.1. Lifted InferenceUnit Propagation: When a clause in the theory has only one literal, wecan propagate the effect of this clause through the theory and remove it14.In T3, ¬Q(Xi) is a unit clause. Having this unit clause, we can simplify thesecond clause and get the theory T4 having clauses:∀m ∈ ∆m : R(Xi,m) ∨ S(Xi,m)∀m ∈ ∆m : S(Xi,m) ∨ T(Xi)Then WMC(T3) = 1.2 ∗WMC(T4), where 1.2 comes from Φ(Q).Lifted Case-Analysis: Case-analysis can be done for atoms having onelogvar in a lifted way. Consider the S(Xi,m) in T4. Due to the symmetry ofthe entities, we do not have to consider all possible truth value assignmentsto all groundings of S(Xi,m), but only the ones where the number of entitiesM ∈ ∆m for which S(Xi,M) is True (or equivalently False) is different. Thismeans considering |∆m|+ 1 cases suffices, corresponding to S(Xi,M) beingTrue for exactly j = 0, . . . , |∆m| entities. Note that we must multiply by(|∆m|j)to account for the number of ways one can select j out of |∆m| entities.Let T4j represent T4 with two more clauses as follows:∀m ∈ ∆mT : S(Xi,m)∀m ∈ ∆mF : ¬S(Xi,m)∀m ∈ ∆m : R(Xi,m) ∨ S(Xi,m)∀m ∈ ∆m : S(Xi,m) ∨ T(Xi)where ∆mT represents the j entities in ∆m for which S(Xi,M) is True, and∆mF represents the other |∆m| − j entities. Then the WMC of T4 can becalculated as: WMC(T4) =∑|∆m|j=0(|∆m|j)WMC(T4j).Shattering: In T4j , the entities in ∆m are no longer symmetric: we knowdifferent things about those in ∆mT and those in ∆mF . In particular,for entities in ∆mT , we know that S(Xi,m) is True and for entities in∆mF , we know that S(Xi,m) is False. We need to shatter every clausehaving entities coming from ∆m to make the theory symmetric. To do so,the clause ∀m ∈ ∆m : R(Xi,m) ∨ S(Xi,m) in T4j must be shattered to∀m ∈ ∆mT : R(Xi,m) ∨ S(Xi,m) and ∀m ∈ ∆mF : R(Xi,m) ∨ S(Xi,m)14Note that unit propagation may remove clauses and random variables from the theory.To account for them, for each removed atom Q(x), smoothing multiplies the WMC by(Φ(Q) + Φ(Q))∏x∈x |∆x|.906.1. Lifted Inference(and similarly for the other formulae). The shattered theory T5j after unitpropagation will have clauses∀m ∈ ∆mF : R(Xi,m)∀m ∈ ∆mF : T(Xi)Decomposition, Caching, and Grounding: In T5j , the two clauseshave different atoms, i.e., they are disconnected. In such cases, we applydecomposition, i.e., find the WMC of each connected component separatelyand return the product. The WMC of the theory can be found by continuingto apply the above rules. In all the above steps, after finding the WMC ofeach (sub-)theory, we store the results in a cache so we can reuse them ifthe same WMC is required again. By following these steps, one can findthe WMC of many theories in polynomial time. However, if we reach apoint where none of the above rules are applicable, we ground one of thepopulations which makes the process exponential in the number of entities.6.1.5 Domain-liftabilityDomain-liftability studies the classes of theories for which lifted inference isefficient. The following notion provides a formal definition of efficiency.Definition 11. A theory is domain-liftable [215] if its WMC can becalculated in time polynomial in domain sizes. A class C of theories isdomain-liftable if ∀T ∈ C, T is domain-liftable.Two main classes of domain-liftable theories include FO2 [214, 215] andrecursively unary [177].Definition 12. A theory is in FO2 if all its clauses have two or fewer logvars.Definition 13. A theory T is recursively unary (RU) if for every theory T ′resulting from applying rules in Rules except for lifted case analysis to T ,until no more rules apply, either T ′ has no sentences, or there exists someatom S(. . . , x, . . . ) in T ′ containing only one logvar x, such that for any∆xT ⊂ ∆x and ∆xF ⊂ ∆x where ∆xT ∪∆xF = ∆x and ∆xT ∩∆xF = ∅, thetheory ∀xT ∈ ∆xT : S(. . . , xT , . . . ) ∧ ∀xF ∈ ∆xF : ¬S(. . . , xF , . . . ) ∧ T ′is RU.Note that the time needed to check whether a theory is in FO2 or RU isindependent of the domain sizes in the theory. For FO2, the membershipcheck can be done in time linear in the size of the theory, whereas for RU,916.1. Lifted Inferenceonly a worst-case exponential (in the number of atoms) procedure is known.Thus, FO2 currently offers a faster membership check than RU, but as weshow later, RU subsumes FO2. Thus, when selecting between FO2 and RUfor, e.g., lifted learning purposes, there is a trade-off between fast membershipchecking and modeling power.6.1.6 The domain recursion ruleVan den Broeck [215] considered another rule called domain recursion inthe set of rules for calculating the WMC of a theory. His formulationof domain recursion modifies a domain ∆x by making one entity explicit:∆x = ∆x′ ∪ {A} with A 6∈ ∆x′ . Next, clauses are rewritten in terms of∆x′ and A while removing ∆x from the theory entirely. Then, by applyingstandard rules in Rules on this modified theory, the problem is reduced to aWMC problem on a theory identical to the original one, except that ∆x isreplaced with the smaller domain ∆x′ . For theories where it was possible tofollow such a procedure, Van den Broeck [215] showed that the WMC of thetheory can be computed in polynomial time in domain sizes using dynamicprogramming.Example 37. Suppose we have a theory whose only clause is the hardconstraint:∀x, y ∈ ∆p : ¬Friend(x, y) ∨ Friend(y, x)stating if x is friends with y, y is also friends with x. One way to calculatethe WMC of this theory is by grounding only one entity in ∆p and thenusing Rules. Let A be an entity in ∆p and let ∆p′ = ∆p − {A}. We can(using domain recursion) rewrite the theory as:∀x ∈ ∆p′ : ¬Friend(x,A) ∨ Friend(A, x) (6.1)∀y ∈ ∆p′ : ¬Friend(A, y) ∨ Friend(y,A) (6.2)∀x, y ∈ ∆p′ : ¬Friend(x, y) ∨ Friend(y, x) (6.3)Lifted case-analysis on Friend(p′, A) and Friend(A, p′), shattering and unitpropagation give:∀x, y ∈ ∆p′ : ¬Friend(x, y) ∨ Friend(y, x)This theory is equivalent to our initial theory, with the only differencebeing that the population of people has decreased by one. By keeping acache of the values of each sub-theory, one can verify that this process findsthe WMC of the above theory in time quadratic in the domain size.926.2. Elimination ordering for lifted inference6.2 Elimination ordering for lifted inferenceThis section presents our results on designing heuristics for eliminationordering in lifted inference. The results presented in this section have beenpublished in Kazemi and Poole [104].Lifted inference algorithms are sensitive to the elimination (branching)order. Some of them have constraints on elimination orderings for correctness,but still leave many choices of possible orderings.We propose and evaluate some heuristics for elimination ordering in liftedinference. First, we bring some examples which show that the eliminationordering heuristics used for probabilistic graphical models are not suitablefor lifted inference. Then we examine the problems with these heuristics andpropose other heuristics to solve these problems. Since, to the best of ourknowledge, there have been no attempts to optimize the elimination orderingfor relational models, we compare our results with the best heuristics fortraditional inference algorithms.The examples and the elimination ordering heuristics will be describedfor parfactor graphs [179]. It is straightforward to modify these heuristicsfor other lifted relational models. We refer to the order in which atoms areeliminated which is the reverse order in which we do lifted-case analysis onsearch-based lifted inference.6.2.1 Inefficiency of existing heuristicsMost heuristics for elimination ordering in probabilistic graphical models tryto minimize the width (number of random variables in a factor or numberof atoms in a parfactor) of the largest generated factor. While this worksfor these models, it is not a good measure of complexity for lifted inferencein relational models. Among existing heuristics for probabilistic graphicalmodels, the minimum-fill greedy heuristic is the one which is widely used inpractice [54]. This heuristic successively eliminates a variable which producesthe fewest number of fill-edges and often breaks the ties arbitrarily. We usethe simple term “min-fill” to refer to minimum-fill greedy heuristic. Thefollowing examples provide intuition on why min-fill is not a good heuristicfor lifted inference. We use the intuition from the examples to motivate newheuristics which address these problems. In order to break ties randomly, forthe coming examples, we break the ties alphabetically.Example 38. Consider a parfactor graph with parfactors {A,B}, {B,C(x)}and {A,C(x)} (Fig 3.2 (b)) and suppose |∆x| = n. If we use min-fill to choosethe elimination ordering, all atoms produce zero fill-edges. By breaking the936.2. Elimination ordering for lifted inferencetie alphabetically, A is the first atom in the order. However, eliminatingA produces the parfactor {B,C(x)} which needs 2 ∗ (n+ 1) branches to beevaluated (2 branches for B = True and B = False and in each of thesebranches, (n+ 1) branches for C where i-th branch represents the case wherei of the entities are True and the rest are False). But if C was the first atomin our order, eliminating C would result in the parfactor {A,B} which needsonly 2 ∗ 2 branches to be evaluated.This example shows that the idea of breaking ties arbitrarily in min-fillis not a good choice for relational models and a better alternative is to breakties according to the population size.Example 39. Consider a parfactor graph with parfactors {A,B(x)}, {A,C(x)},{B(x),C(x)}, {B(x),D(x)}, {C(x),E(x)} and {D(x),E(x)} (Fig 3.2 (c)) andsuppose |∆x| = n. If we use min-fill, A is the first atom eliminated because it’sthe only atom that introduces no fill-edges (in search-based lifted inference,A is the last atom we do lifted case-analysis on). Then, lifted case-analysison B, C, D and E generates lots of branches because all these atoms have thesame logvar x.Consider what happens when A is at the end of the elimination order. Inthis case, we first branch on A. Then, all atoms have the same logvar andlifted decomposition can be applied, thus we effectively have a non-relationalproblem.This example represents the case with atoms having large populationsizes. By placing these atoms at the beginning of the elimination order(equivalently at the end of lifted case-analysis order), we may be able toapply lifted decomposition which offers a significant computational benefit.Example 40. Consider a parfactor graph with parfactors {A(x),B}, {A(x),C},{A(x),D}, {B,C} and {B,D} (Fig 3.2 (d)) and suppose |∆x| = n. Using anymethod which tries to minimize the width of the largest generated parfactor,A is not the first atom in elimination ordering. In min-fill, for example, C orD is the first atom in the order depending on how we break the tie. However,eliminating C or D results in the parfactor {A(x),B} which needs (n+ 1) ∗ 2branches to be evaluated. Now suppose A is the first atom in the order. Inthis case, eliminating A results in a parfactor {B,C,D} which only needs2 ∗ 2 ∗ 2 branches.This example shows that the width of the largest generated parfactor isnot the only important factor which should be taken into account but thepopulation sizes of the logvars in its atoms are also important. A parfactorwith two atoms having logvars typed with large populations needs manymore branches to be evaluated than a parfactor with three atoms having946.2. Elimination ordering for lifted inferenceno logvars and avoiding the generation of such a parfactor should be moreencouraged.Example 41. Consider a parfactor graph with three parfactors {A,B(x, y)},{B(x, y),C(x, z)} and {C(x, z),D} (Fig 3.2 (e)) and suppose |∆x| = |∆y| =|∆z| = n. Using min-fill, the elimination order is 〈A,B,C,D〉 meaning thatafter branching on D, it branches on C. Since C has more than one logvar, weneed to ground one of them. However, the elimination ordering 〈B,C,A,D〉only branches on atoms with fewer than two logvars and does not requiregrounding any population.This example demonstrates that it’s a good idea to place the atomswith more than one variable at the beginning of the elimination ordering(equivalently at the end of branching ordering) to avoid grounding as muchas possible.6.2.2 Proposed heuristicsThe examples in the previous section motivate the following intuitions: weshould eliminate atoms having variables with large population sizes soonerthan others; we should pay attention to the size of the new parfactor generatedrather than just its width; atoms with more than one logvar should be atthe end of the branching ordering. Based on these observations, we proposenew heuristics.We define the population size of an atom A to be C(∏mi=1 (|∆xi |) +|range(A)| − 1, |range(A)| − 1), where C stands for combinations, m is thenumber of logvars of the atom, xi is its i-th logvar, and |range(A)| is thenumber of values in the range of the atom. Note that this formula correspondsto the number of branches when branching on the atom. The reason whythe number of branches for the atom A is equal to the given combination isthat each branch represents a particular assignment of values in range(A)to the entities, thus the number of branches is equal to the number of wayswe can assign values to |range(A)| numbers K1,K2, . . . ,K|range(A)| so thatthey sum to∏mi=1 (|∆xi |) and this can be solved by the given combination.We also define the population size of a parfactor to be∏ni=1 (1 + |pop(atomi)|)where n is the number of atoms in the parfactor and atomi is its i-th atom.These two definitions are used in our proposed heuristics.MinTableSize heuristicThe MinTableSize heuristic successively eliminates the atom having no morethan one variable that results in a parfactor with minimum population size.956.2. Elimination ordering for lifted inferenceAlgorithm 1 MinTableSize algorithmnextAtom ←− ∅maxPopSize ←− 0maxNumLogvars ←− 2for i = 1 to size(atoms) doif numLogvars(atoms[i]) > maxNumLogvars or (numLogvars(atoms[i])= maxNumLogvars and |pop(atoms[i])| > maxPopSize) thenmaxNumLogvars ←− numLogvars(atoms[i])maxPopSize ←− |pop(atoms[i])|nextAtom ←− atoms[i]if nextAtom 6= ∅ thenreturn nextAtomminTableSize ←− +∞for i = 1 to size(atoms) dof ← parfactor resulting from eliminating atoms[i] in parfactorsif |pop(f)| < minTableSize or (|pop(f)| = minTableSize and|pop(atoms[i])| > |pop(nextAtom)|) thenminTableSize ←− |pop(f)|nextAtom ←− atoms[i]return nextAtomThis resembles the intuition behind the operation selection of [151], in whicha cost is assigned to each possible operation by computing the total size ofthe factors generated by the operation and the one having the minimum costis performed next. Using MinTableSize heuristic, ties may occur. We breakties according to the population size of the atoms, where atoms with a largerpopulation size are eliminated sooner.Atoms with more than one logvar are placed at the beginning of theorder sorted by the number of logvars they contain, and among those withthe same number, one with the largest population size is eliminated first.This heuristic is consistent with our observations in the examples. Thealgorithm for selecting the next atom to be eliminated using MinTableSizeis shown in Algorithm 1. The first loop tries to find atoms with more thantwo logvars and selects one with the maximum number of logvars, then themaximum population size. If there is no such atom, the second loop triesto find the atom which produces the smallest parfactor and breaks the tiesaccording to the population size.966.2. Elimination ordering for lifted inferencePopulation order heuristicAnother relational heuristic can be motivated by considering our first obser-vation which is, eliminating atoms having variables with large populationsizes sooner than others. We call this population order heuristic in whichwe successively branch on the atom with the smallest population size. Thisheuristic can be implemented in two ways. The first way is to sort the atomsaccording to their population sizes at the beginning and use that as thebranching order. The second way, which allows for a dynamic eliminationordering, is to determine the atom which has the smallest population sizeamong remaining atoms and branch on it.Relational minimum-fill heuristicOne possibility for proposing a relational elimination ordering heuristic is toextend min-fill so that it also considers the population sizes. The intuitionbehind min-fill is that minimizing the number of fill-edges reduces the chanceof generating a factor with a large width in the future. Therefore, min-fill weighs all fill-edges equally and successively eliminates an atom whichminimizes the sum of these weights.In relational models, however, fill-edges can have different sizes. Largerfill-edges can degrade the efficiency more severely than small fill-edges andthey should have a higher weight. We consider the weight of each fill-edgeto be its table size which is the multiplication of the population sizes ofthe atoms which are connected by the fill-edge. Like min-fill, relationalminimum-fill heuristic also successively eliminates an atom which minimizesthe sum of these weights. Since each weight is representing the size of aparticular table which is separate from other tables, considering the sum ofthe weights to minimize is reasonable.6.2.3 Evaluation of heuristicsIn this section, we evaluate and compare our proposed relational heuristics,which are MinTableSize, population order and relational min-fill, and also thenon-relational min-fill heuristic, which is one of the best-known heuristics fornon-relational models, based on a series of empirical and theoretical criteria.Empirical efficiencyTo evaluate how efficient each heuristic works, we generated parfactor graphsas follows. First, we generated a random number of atoms having variables976.2.EliminationorderingforliftedinferenceTable 6.1: Runtimes of lifted inference using different heuristics for elimination orderingparfactor graphs and population sizes of logvars min-fill rel. min-fill population order MinTableSize1: {A,B},{B,C(x)},{C(x),A},|∆x|=30 2.854 2.854 0.046 0.0462: {A,B(x)},{A,C(x)},{B(x),C(x)},{B(x),D(x)},{C(x),E(x)},{D(x),E(x)}, |∆x|=5 79.245 79.245 0.134 0.1343: {A(x),B},{A(x),C},{A(x),D},{B,C},{B,D}, |∆x|=25 2.768 2.768 0.192 0.1924: {A,B(x,y)},{B(x,y),C(x,z)},{C(x,z),D}, |∆x|=3, |∆y|=5, |∆z |=5 > 300 > 300 0.178 0.1785: {A,B(x)},{B(x),C(x)}, |∆x|=10 1.530 1.530 0.016 0.0166: {A(x)},{A(x),B(x),C(y)},{A(x),D(y)},{D(y),E,F,G},{D(y),H(z)}, |∆x|=5, |∆y|=10, |∆z |=3 9.617 9.617 > 300 6.6187: {A},{A,B},{B,C},{C,D},{C,G},{D,E,G},{A,E,F},{E,H} 0.460 0.460 0.734 0.5678: {A(x)},{A(x),B(x),C(y)},{A(x),D(y)},{D(y),E,F,G},{D(y),H(z)}, |∆x|=5, |∆y|=50, |∆z |=3 172.163 172.163 > 300 104.7519: {A,B(x)},{B(x),C(x)},{D(x),E(x)},{E(x),F(x)},{G(x)}, |∆x|=10 1.555 1.555 0.062 0.06210: {A(x),B(y)},{A(x),C(z)},{B(y),D(x)},{C(z),D(x)}, |∆x|=5, |∆y|=10, |∆z |=15 69.410 29.400 22.157 21.91911: {A(x),B(y)},{A(x),E},{D(x,y),E},{C,D(x,y)},{C,B(y)},|∆x|=7, |∆y|=18 > 300 0.746 0.746 0.74612: {A,B(x),E(x)},{C(x),D(y),B(x)},{E(x),C(x)},{C(x),B(x)},{A,D(y)}, |∆x|=2, |∆y|=5 17.785 2.420 2.445 2.40913: {A(x),B(y)},{A(x),C(y)},{A(x),D(y)},{E(y),B(y)},{E(y),C(y)},{E(y),D(y)}, |∆x|=20, |∆y|=2 14.678 6.245 10.026 6.24514: {A(x),B(y),C(y)},{A(x),D(y)},{E(x),B(y),C(y)},{E(y),D(y)}, |∆x|=15, |∆y|=3 > 300 192.925 34.411 34.71015: {A(x),B(y),C(x)},{B(y),C(x),F(z)},{A(x),F(z),D(w)},{B(y),E(w)},{C(x),D(w)},|∆x|=2,|∆y|=5,|∆z |=10,|∆w|=30 256.192 256.192 56.880 56.27416: {A(x),B(x,y)},{B(x,y),C(y)},{A(x),D,E},{C(y),F}, |∆x|=6, |∆y|=16 > 300 > 300 1.273 0.92617: {A(x),B(y)},{B(y),E,F},{A(x),C,D}, |∆x|=5, |∆y|=10 0.882 0.882 1.886 0.88218: {A(x),B(x,y)},{B(x,y),C(x,y),D(y)},{A(x),D(y)}, |∆x|=4, |∆y|=7 183.353 183.353 0.220 0.540986.2. Elimination ordering for lifted inference0 5 10 15 20 25050100150 0 10 20 30 40 50020406080 0 5 10 15 20 25050100150200250 1 2 3 4 5 6 7050010001500 min-fillrelational min-fillpopulation orderMinTableSizemin-fillrelational min-fillpopulation orderMinTableSizemin-fillrelational min-fillpopulation orderMinTableSizemin-fillrelational min-fillpopulation orderMinTableSize(a) (b)(c) (d)Figure 6.2: Runtime of lifted search-based inference as a function of popula-tion size for different heuristics. From Table 6.1, these are (a) 6th parfactorgraph, (b) 10th parfactor graph, (c) 12th parfactor graph, (d) 13th parfactorgraph. In all cases, the x-axis is |∆y|. The y-axis is runtime in milliseconds.In (a) min-fill and relational min-fill are overlapping. In (c) MinTableSizeand population order heuristics are overlapping. In (d) MinTableSize andmin-fill are overlapping.with random populations, and then generated a random number of parfactors.Then, we randomly assigned some of the atoms to each parfactor. Since allelements of the parfactor graphs that distinguish one graph from another aregenerated randomly, the whole space of parfactor graphs can be covered inthis way and our generated graphs are reasonable representatives to base theexperiments on them. The reason for using synthesized graphs for evaluationis that the benchmark graphs used in other works are not big enough toserve as a good representative for evaluating elimination ordering heuristics.We performed the search-based inference for these parfactor graphsusing the elimination ordering heuristics mentioned earlier and recorded theruntimes of the inference in milliseconds. For each parfactor graph and eachheuristic, we ran the program 10 times and computed the average, which isreasonable as the algorithm is deterministic. We set the maximum allowabletime for inference to be five minutes (300 seconds). If the program could notfind the result in this time, we stopped it and reported > 300.996.2. Elimination ordering for lifted inferenceResults are presented in Table 6.1 (the minimum time is in bold). Run-times of lifted inference using each of the heuristics are reported in thescale of seconds. The first four rows in the table are the parfactor graphs inExamples 2, 3, 4 and 5 respectively and others are parfactor graphs generatedas described earlier.From the table, we can see that MinTableSize is outperforming otherheuristics in nearly all cases. There are only a few cases where one of theother heuristics has performed better than MinTableSize.Furthermore, we can see that the population order heuristic results in abetter runtime than min-fill and relational min-fill in most cases. However,it has a very bad performance on some parfactor graphs. The reason is thatpopulation order heuristic does not take into account how big the newlygenerated parfactors are.We can also see that min-fill and relational min-fill produce the sameresults in most cases. That’s mostly because in some parfactor graphs,we can successively eliminate an atom which produces no fill-edges, or thegenerated fill-edges have the same population sizes. In these cases, bothmin-fill and relational min-fill choose the same atom to be eliminated andhave the same performance. However, in cases where they produced differentresults, relational min-fill has a better performance than min-fill.Adaptation to population changeTo see how each heuristic performs as the population sizes of the logvarschange, we used four parfactor graphs of Table 6.1 (6th, 10th, 12th and 13thparfactor graphs) and varied the population size of one of the logvars to seehow the runtime of lifted search-based inference with different heuristics isaffected. These four parfactor graphs were chosen because different orders aresuggested by different heuristics in most cases, and they have a logvar whereinference performance is highly dependent on its population size. Resultsare shown in Fig 6.2 where the x-axis shows the population size of thelogvar being changed and the y-axis shows time in seconds. We can see thatMinTableSize is doing a better job of finding a new efficient elimination orderas we change the population sizes of logvars.Avoiding grounding populationsThe following proposition proves that the MinTableSize heuristic avoidsunnecessary grounding of a population. The proof of the proposition can befound in Appendix A.1006.3. L2C: Compiling lifted inference to low-level languagesProposition 7. If any order results in not grounding any population, thenMinTableSize produces an order that results in not grounding any population.Even though the intuition behind population order heuristic helps avoidingto ground populations, the above proposition does not hold for this heuristic:Example 42. Consider a parfactor graph with only one parfactor {A(x),B(y, z)}and suppose |∆x| = 3000, |∆y| = 50 and |∆z| = 50. Population order heuris-tic selects B as the first atom to branch on because it has a smaller populationsize than A. Since B has more than one logvar, the search-based inferencealgorithm grounds one of the populations. This is a case when avoiding toground a population is possible but min-fill and relational min-fill end upgrounding a population.6.3 L2C: Compiling lifted inference to low-levellanguagesThis section presents our work on designing a compiler to compile liftedinference operations into low-level programs. The results presented in thissection have been published in Kazemi and Poole [105].Motivated by the advantages offered by knowledge compilation, Van denBroeck (2013) follows a knowledge compilation approach to lifted inference byevaluating a search-based lifted inference algorithm symbolically (instead ofnumerically) and extracting a data structure on which many inference queriescan be efficiently answered. In this subsection, a knowledge compilationapproach is employed, but instead of extracting a data structure, a low-levelprogram is extracted. C++ is chosen as the target low-level program becauseof its efficiency, availability, efficient libraries, and available compiling andoptimizing packages. We explain how each rule in Rules can be compiled intoC++. The input theory is assumed to be clausal and contain no existentialquantifiers. The positive and negative weight functions are respectively calledposw and negw in the C++ programs.6.3.1 From lifted inference rules to C++ programsIn this section, we provide details on how each lifted inference rule can becompiled to a C++ program.Compiling case-analysis to C++: Let T contain an atom S with nologvars, T1 represent T when S is replaced with True, and T2 represent T1016.3. L2C: Compiling lifted inference to low-level languageswhen S is replaced by False. Then case-analysis on S generates the followingC++ code:{Code for storing WMC(T1) in v1}{Code for storing WMC(T2) in v2}v3 = posw[S] * v1 + negw[S] * v2;Compiling lifted case-analysis to C++: Let T be a theory and S(x)be a unary atom in T . Let Ti be the theory resulting from branching onS(x) being True for i out of |∆x| entities, and False for the rest. Then liftedcase-analysis on S(x) generates the following C++ code:v1 = 0;for(i=0; i <= popSize[x]; i++){{Code for storing WMC(Ti) in v2}v1 += choose(popSize[x], i) * pow(posw[S], i)* pow(negw[S], popSize[x] - i) * v2;}Compiling decomposition to C++: Let T be a theory which consistsof two connected components T1 and T2 (can be easily extended to more thantwo connected components). Then decomposition on theory T generates thefollowing C++ code:{Code for storing WMC(T1) in v1}{Code for storing WMC(T2) in v2}v3 = v1 * v2;Compiling lifted decomposition to C++: Let T be a theory, x be alogvar in T on which lifted decomposition can be applied, and T ′ be thetheory resulting from replacing x in T by some entity X ∈ ∆x. The lifteddecomposition generates the following C++ code:{Code for storing WMC(T ′) in v2}v1 = pow(v2,popSize[x]);Compiling unit propagation to C++: Let T be a theory containing aunit clause ∀x : S(x) (the cases where S is negated or has more or less logvarsare similar). Let T ′ be the theory resulting from propagating S(x). Thenunit propagation on S(x) for theory T generates the following C++ code:{Code for storing WMC(T ′) in v2}v1 = pow(posw[S], popSize[x]) * v2;1026.3. L2C: Compiling lifted inference to low-level languagesCompiling shattering and caching to C++: Shattering does not pro-duce any C++ code, but prepares the theory for applying other rules on it.Before applying any rules, we check if the cache contains the theory T . If itdoes, then we generate the following C++ code:v1 = cache[T];In the compilation stage, we keep a record of the cache entries that areused in future and remove from the C++ program all other cache inserts.6.3.2 Compilation exampleExample 43. Consider compiling a theory T with the following sentencesto C++:∀x ∈ ∆x,m ∈ ∆m : Y(x,m)⇔ R(x,m) ∧ S(x,m)∀x ∈ ∆x,m ∈ ∆m : Z(x,m)⇔ S(x,m) ∧ V(x)where ∆x = {X1, X2, X3, X4, X5} and ∆m = {M1,M2}. Let Φ(Y ) = 1.2,Φ(Z) = 0.2, and the rest of the weights be 1. Lifted decomposition can beapplied to x. Let T ′ represent T when lifted decomposition has been appliedon x. L2C generates the following C++ code:{Code for storing WMC(T ′) in v2 }v1 = pow(v2, 5);where 5 represents |∆x|. T ′ has the following sentences:∀m ∈ ∆m : Y(X1,m)⇔ R(X1,m) ∧ S(X1,m)∀m ∈ ∆m : Z(X1,m)⇔ S(X1,m) ∧ V(X1)Suppose we choose to do a case analysis on S(X1,m). Assuming T′i representsthe theory when S(X1,m) is True for i out of |∆m| entities and False for therest, L2C generates the following code:v2 = 0;for(int i = 0; i <= popSize[m]; i++){{Code for storing WMC(T ′i ) in v3 }v2 += choose(popSize[m], i) * v3;v1 = pow(v2, 5);Note that we ignored the weights of assigning S to True and False in thecode as both weights are 1. Continuing this process and applying the otherrules may generate a code such as the one in Fig 6.3(a).1036.3. L2C: Compiling lifted inference to low-level languages 1. v2=0; 2. for (int i = 0; i <= 2; i++){ 3. v5 = 0; 4. for (int j = 0; j <= i; j++){ 5. v8 = 1; 6. v7 = exp(1.2 * j) * v8; 7. v5 += choose(i, j) * v7; 8. } 9. v10 = 1; 10. v9 = exp(0.2 * i) * v10; 11. v12 = 1; 12. v11 = 1 * v12; 13. v6 = v9 + v11; 14. v4 = v5 * v6; 15. v3 = pow(2, 2-i) * v4; 16. v2 += choose(2, i) * v3; 17. } 18. v1 = pow(v2, 4); 1. v2=0; 2. for (int i = 0; i <= 2; i++){ 3. v5 = 0; 4. for (int j = 0; j <= i; j++) 5. v5 += choose(i, j) * exp(1.2 * j); 6. v2 += choose(2, i) * pow(2, 2-i) * v5 * (exp(0.2 * i) + 1); 7. } 8. v1 = pow(v2, 4); (a) (b) Figure 6.3: (a) The C++ program for the theory in Example 43. The WMCis stored in v1. (b) The C++ program in part (a) after pruning.6.3.3 PruningSince the program obtained from L2C is generated automatically (not by adeveloper), a post-pruning step might seem required to reduce the size ofthe program. For instance, one can remove lines 11 and 12 of the programin Fig 6.3(a) and replace line 13 with "v6 = v9 + 1;". The same can be donefor lines 5 and 9. One may also notice that some variables are set to somevalues and are then being used only once. For example in the program ofFig 6.3(a), v7 and v9 are two such variables. The program can be prunedby removing these lines and replacing them with their values whenever theyare being used. One can obtain the program in Fig 6.3(b) by pruning theprogram in Fig 6.3(a). Pruning can potentially save time and memory atruntime, but the pruning itself may be time-consuming.Since our target program is C++, we can take advantage of the availablepackages developed for pruning/optimizing C++ programs. In particular,we use the −O3 flag when compiling our programs which optimizes thecode at compile time. Using −O3 slightly increases the compile time, butsubstantially reduces the runtime when the program and the population sizesare large. We show the effect of pruning our programs in our experiments.1046.3. L2C: Compiling lifted inference to low-level languages 0.111010 100 1000 10000Time in seconds #People L2C (MTS) L2C (MNL) WFOMC PTP 0.111010010 100 1000 10000Time in seconds #Students L2C (MTS) L2C (MNL) WFOMC PTP 0.1110100100010 100 1000 10000Time in seconds #Professors L2C (MTS) L2C (MNL) WFOMC PTP 0.1110100100010 100 1000 10000Time in seconds #Students and #Professors L2C (MTS) L2C (MNL) WFOMC PTP 0.1110100100010 100 1000 10000Time in seconds Population size of y L2C (MTS) L2C (MNL) WFOMC PTP 0.1110100100010 100 1000 10000Time in seconds Population size of x and y LRC2CPP(MTS) LRC2CPP(MNL) WFOMC PTP(a) (b) (c) (d) (e) (f) Figure 6.4: Runtimes for L2C using MinTableSize (MTS) and MinNested-Loops (MNL) heuristics, WFOMC, and PTP on several benchmarks (in (a),L2C diagrams for the two heuristics overlap).6.3.4 MinNestedLoops heuristicCompiling lifted inference into C++ programs introduces a new suitablecriteria, namely the maximum number of nested loops (MNNL), that anelimination ordering heuristic can aim at minimizing. The reason whyMNNL is a suitable criterion to be minimized is that it can be considereda good indicator of the time complexity of the inference. To design anelimination ordering heuristic that aims at minimizing the MNNL, we startwith the order taken from the MinTableSize (MTS) heuristic, count theMNNL in the C++ program generated when using this order, and thenperform k steps of stochastic local search on the order to minimize the1056.3. L2C: Compiling lifted inference to low-level languagesMNNL. We call this heuristic MinNestedLoops (MNL). Note that the searchis performed only once in the compilation phase. The value of k can be setaccording to how large the network is, how large the population sizes are,how much time we want to spend on the compilation, etc.6.3.5 Empirical results for L2CWe implemented our L2C compiler in Ruby15 and did our experimentsusing Ruby 2.1.5 on a 2.8GH core with 4GB RAM under MacOSX. Weused the g++ compiler with −O3 flag to compile and optimize the C++programs. We compared our results with weighted first-order model counting(WFOMC)16 and probabilistic theorem proving (PTP)17, the state-of-the-artlifted inference engines. We allowed at most 1000 seconds for each algorithm.We ran each query multiple times and reported the average. When usingMNL as our heuristic, we did 25 iterations of local search.Fig 6.4 represents the overall runtimes of L2C with MTS and with MNLas well as the runtimes of WFOMC and PTP for: a) competing workshopsnetwork [151] with 50 workshops and varying #people, b) link predictionnetwork taken from the second experiment in [79] using conjunctive formulaeand querying FutureProf(S) with 50 professors and varying #students, c)similar to (b) but with 50 students and varying #professors, d) similar to(b) and (c) but with varying #students and #professors at the same time,e) an manually constructed MLN similar to the first experiment in [79] withthe following WFs:{ 〈w1,A(x) ∨ B(y)〉 , 〈w2,A(x) ∨ C(x)〉 , 〈w3,B(y) ∨ D(y)〉 ,〈w4,C(x) ∨ D(y)〉 , 〈w5,E ∨ D(y)〉}querying E with |∆x| = 100 and varying |∆y|, and f) similar to (e) butchanging both |∆x| and |∆y| at the same time. The obtained results showthat L2C (using either of the two heuristics) outperforms WFOMC and PTPby an order of magnitude on different networks.Answering queries in our settings consist of: 1) generating the program,2) compiling it, and 3) running it. Fig 6.5 represents the amount of timespent for each step of the algorithm for the network in Fig 6.4(f) (usingMNL). Fig 6.6 compares the runtime when −O3 is not used (the compiletime, in this case, decreased to 0.226 seconds) vs when it is.15Code: https://github.com/Mehran-k/L2C16Available at: https://dtai.cs.kuleuven.be/software/wfomc17Available in Alchemy-2 [124]1066.3. L2C: Compiling lifted inference to low-level languages 0.0010.010.111010010 100 1000 10000 100000Time in seconds Population size of x and y Generating Compiling RunningFigure 6.5: Time spent for each step of L2C. (b) The effect of prun-ing/optimizing the C++ program.Both L2C and WFOMC softwares compile the initial theory into asecondary structure and then reason with the secondary structure. Fig 6.4compares end-to-end results: compiling to the secondary structure andreasoning with it. A natural question to ask is where exactly the speedupin L2C comes from faster compiling or faster reasoning with the secondarystructure. In order to answer this question, we measured the time spent byL2C (using MinNestedLoops) and WFOMC on each of the reasoning stepsfor three networks: 1- the MLN in Fig 6.4(f), 2- an MLN with only one WF:〈w,A(x) ∨ B(x) ∨ C(x,m) ∨ D(m) ∨ E(m) ∨ F〉 ,and 3- another MLN with only with WF:〈w,A(x) ∨ B(x) ∨ C(x) ∨ D(x,m) ∨ E(m) ∨ F(m) ∨ G(m) ∨ H〉For the three networks, it takes L2C 0.173, 0.029, and 0.138 seconds and ittakes WFOMC 0.768, 0.373, and 0.512 seconds respectively to generate theirsecondary structures. Fig 6.7 represents the time spent by L2C and WFOMCfor reasoning with their secondary structures18 for the three networks whenthe population of the logvars varies at the same time (WFOMC could notsolve the third network in less than 1000s for population sizes ≥ 500).Obtained results represent that the compilation part takes almost thesame amount of time in both L2C and WFOMC. For small population sizes,18Reasoning with L2C programs are considered as the time spent on compiling the C++codes plus the runtime.1076.3. L2C: Compiling lifted inference to low-level languages 0.1110100100010 100 1000 10000 100000Time in seconds Population size of x and y With -O3 Without -O3Figure 6.6: The effect of pruning/optimizing the C++ programs. 0.010.1110100100010 100 1000 10000Time in seconds Population size L2C WFOMC 0.010.1110100100010 100 1000 10000Time in seconds Population size L2C WFOMC 0.010.1110100100010 100 1000 10000Time in seconds Population size L2C WFOMC(a) (b) (c) Figure 6.7: The amount of time spent for reasoning with the secondarystructure in L2C and WFOMC on three benchmarks and for differentpopulation sizes.reasoning with WFOMC ’s data structures is more efficient because L2C ’sprograms need a program compilation step. However, as the populationsize grows, the program compilation time becomes negligible and reasoningwith the programs generated by L2C becomes much faster than reasoningwith the data structures generated by WFOMC. As an example, it can beseen from the diagrams that reasoning with L2C ’s program offers about163x speedup compared to WFOMC ’s data structure for the first networkwhen the population sizes are 5000. It is also interesting to note that theslope of the diagrams are the same, meaning reasoning with both secondarystructures has the same time complexity.The previous experiment indicates that the speedup in L2C is mostlydue to the reasoning step. A natural question is why reasoning with L2C ’sprograms is more efficient than reasoning with WFOMC ’s data structures.We hypothesize that the speedup is due to the fact that L2C ’s programs1086.3. L2C: Compiling lifted inference to low-level languages 0.010.1110100100010 100 1000 10000Time in seconds Population size Compile with -O3 Compile without -O3Interpret WFOMC 0.0010.010.1110100100010 100 1000 10000Time in seconds Population size Compile with -O3 Compile without -O3Interpret WFOMC 0.010.1110100100010 100 1000Time in seconds Population size Compile with -O3 Compile without -O3Interpret WFOMC(a) (b) (c) Figure 6.8: The amount of time spent for reasoning with the programsgenerated by L2C when the program is interpreted, compiled, or compiledand optimized as well as the amount of time spent for reasoning withWFOMC ’s data structures on three benchmarks and for different populationsizes (WFOMC failed to produce an answer in less than 1000s for the thirdnetwork when the population size was 500 or higher).can be 1- compiled and 2- optimized, while reasoning with WFOMC ’sdata structures requires an interpreter: a virtual machine that executes thedata structure node-by-node. Validating this hypothesis by comparing theruntimes of L2C and WFOMC softwares is not sensible as there might beimplementation or several other differences (e.g., case analysis order) betweenthe two softwares.In order to test this hypothesis in an implementation-independent way,we used L2C to generate programs for the three networks in our previousexperiment. For the reasoning step, we ran the programs in three differentways: 1- compiling and optimizing the programs using −O3 flag, 2- compilingwithout optimizing the programs, and 3- running the programs using Ch7.5.3 which is an interpreter for C++ programs19. Obtained results canbe viewed in Fig 6.8. We also included the runtime of WFOMC in thesediagrams. It can be viewed that interpreting the C++ programs producessimilar runtimes as working with WFOMC ’s data structures. The diagramfor WFOMC is slightly below the diagram for interpreting the C++ program.One reason can be the non-optimality of the interpreter used for interpretingthe C++ programs. It is interesting to note that by interpreting the C++programs for small population sizes, and compiling and optimizing them forlarger population sizes, in our benchmarks L2C ’s programs are always moreefficient than the WFOMC ’s data structures.19Note that C++ interpreters are mostly used for teaching purposes and may not behighly optimized.1096.4. Liftability TheoryIn order to compare the speedup caused by compilation (instead ofinterpreting) with the speedup caused by optimization, we measured thepercentage of speedup caused by each of them for our three benchmarks (weonly considered the cases where both of them contributed to the speedup).We found that on average, 99.7% of the speedup is caused by compilation, andonly 0.3% of it is caused by optimization. For the largest population whereinterpreting produced an answer in less than 1000s, we found that compilationoffers an average of 175x speedup compared to interpretation. Furthermore,for the largest population where compilation produced an answer, we foundthat optimization offers an average of 2.3x speedup compared to runningthe code without optimizing it.6.4 Liftability TheoryThis section presents our results on domain-liftability of several theories andclasses of theories. The results presented in this section have been publishedin Kazemi et al. [113] and [115].When domain recursion rule was introduced by Van den Broeck [215],it had a central role in the domain-liftability proof of FO2. Later on,Taghipour et al. [203] showed that simpler rules suffice to prove the domain-liftability of FO2 making domain recursion rule to be believed redundant.In this section, we identify several previously-unknown theories and classesof theories that are domain-liftable when using the domain recursion rule.We let Rules represent the lifted inference rules previously identified exceptdomain recursion.Definition 14. Let T be a theory and T ′ be a re-writing of T where aconstant N ∈ ∆x is separated from its population. Domain recursion isbounded on T ′ for ∆x if after applying the standard rules in Rules to themodified theory until N is entirely removed from the theory T ′, the problemis reduced to a WMC problem on a theory identical to T , except that ∆x isreplaced by the smaller domain ∆x′ .When domain recursion is bounded for a theory, we can compute its WMCusing dynamic programming. We refer to Rules extended with boundeddomain recursion (BDR) rule as RulesD.1106.4. Liftability Theory6.4.1 Three theories that are only domain-liftable usingbounded domain recursionS4 Clause: Beame et al. [14] identified a clause (S4) with four binary atomshaving the same predicate and proved that, even though the rules Rules inSection 6.1.4 cannot calculate the WMC of that clause, there is a polynomial-time algorithm for finding its WMC. They concluded that this set of rulesRules for finding the WMC of theories does not suffice, asking for new rulesto compute their theory. We prove that adding domain recursion to the setachieves this goal.Proposition 8. The theory consisting of the S4 clause ∀x1, x2 ∈ ∆x, y1, y2 ∈∆y : S(x1, y1) ∨ ¬S(x2, y1) ∨ S(x2, y2) ∨ ¬S(x1, y2) is domain-liftable usingRulesD.Symmetric Transitivity: Domain-liftable calculation of WMC for thetransitivity formula is a long-standing open problem. Symmetric transitivityis easier as its model count corresponds to the Bell number, but solving itusing general-purpose rules has been an open problem. Consider clauses∀x, y, z ∈ ∆p : ¬F(x, y)∨¬F(y, z)∨F(x, z) and ∀x, y ∈ ∆p : ¬F(x, y)∨F(y, x)defining a symmetric transitivity relation. For example, ∆p may indicate thepopulation of people and F may indicate friendship.Proposition 9. The symmetric-transitivity theory is domain-liftable usingRulesD.Birthday Paradox: The birthday paradox problem [13] is to computethe probability that in a set of n randomly chosen people, two of themhave the same birthday. A first-order encoding of this problem requirescomputing the WMC for a theory with clauses ∀p ∈ ∆p,∃d ∈ ∆d : Born(p, d),∀p ∈ ∆p, d1, d2 ∈ ∆d : ¬Born(p, d1) ∨ ¬Born(p, d2), and ∀p1, p2 ∈ ∆p, d ∈∆d : ¬Born(p1, d) ∨ ¬Born(p2, d), where ∆p and ∆d represent the populationof people and days. The first two clauses impose the condition that everyperson is born in exactly one day, and the third clause states the “no twopeople are born on the same day” query.Proposition 10. The birthday-paradox theory is domain-liftable using RulesD.6.4.2 New domain-liftable classes: S2FO2 and S2RUIn this section, we identify new domain-liftable classes, enabled by thebounded domain recursion rule.1116.4. Liftability TheoryDefinition 15. Let α(S) be a clausal theory that uses a single binarypredicate S, such that each clause has exactly two different literals of S. Letα = α(S1) ∧ α(S2) ∧ · · · ∧ α(Sn) where the Si are different binary predicates.Let β be a theory where all clauses contain at most one Si literal, and theclauses that contain a Si literal contain no other literals with more than onelogvar. Then, S 2FO2 and S 2RU are the classes of theories of the form α∧ βwhere β ∈ FO2 and β ∈ RU respectively.Theorem 1. S 2FO2 and S 2RU are domain-liftable using RulesD.It can be easily verified that membership checking for S 2FO2 and S 2RUis not harder than for FO2 and RU , respectively.Example 44. Suppose we have a set ∆j of jobs and a set ∆v of volunteers.Every volunteer must be assigned to at most one job, and every job requiresno more than one person. If the job involves working with gas, the assignedvolunteer must be a non-smoker. And we know that smokers are mostprobably friends with each other. Then we have the following first-ordertheory:∀v1, v2 ∈ ∆v, j ∈ ∆j : ¬Assigned(v1, j) ∨ ¬Assigned(v2, j)∀v ∈ ∆v, j1, j2 ∈ ∆j : ¬Assigned(v, j1) ∨ ¬Assigned(v, j2)∀v ∈ ∆v, j ∈ ∆j : InvolvesGas(j) ∧ Assigned(v, j)⇒ ¬Smokes(v)∀v1, v2 ∈ ∆v : Aux(v1, v2)⇔ (Smokes(v1) ∧ Friends(v1, v2)⇒ Smokes(v2))Predicate Aux is added to capture the probability assigned to the last rule (asin MLNs). This theory is not in FO2, not in RU , and is not domain-liftableusing Rules. However, the first two clauses are of the form described inLemma 1, the third and fourth are in FO2 (and also in RU), and the thirdclause, which contains Assigned(v, j), has no other atoms with more thanone logvar. Therefore, this theory is in S 2FO2 (and also in S 2RU ) anddomain-liftable based on Theorem 1.Example 45. Consider the birthday paradox introduced in subsection 6.4.1.After Skolemization [214] for removing the existential quantifier, the the-ory contains ∀p ∈ ∆p, ∀d ∈ ∆d : S(p) ∨ ¬Born(p, d), ∀p ∈ ∆p, d1, d2 ∈∆d : ¬Born(p, d1) ∨ ¬Born(p, d2), and ∀p1, p2 ∈ ∆p, d ∈ ∆d : ¬Born(p1, d) ∨¬Born(p2, d), where S is the Skolem predicate. This theory is not in FO2,not in RU , and is not domain-liftable using Rules. However, the last twoclauses belong to clauses in Lemma 1, the first one is in FO2 (and also inRU) and has no atoms with more than one LV other than Born. Therefore,1126.4. Liftability Theorythis theory is in S 2FO2 (and also in S 2RU ) and domain-liftable based onTheorem 1.Proposition 11. FO2 ⊂ RU , FO2 ⊂ S 2FO2, FO2 ⊂ S 2RU , RU ⊂ S 2RU ,S 2FO2 ⊂ S 2RU .6.4.3 BDR+RU ClassLet BDR+RU be the class of theories defined identically as RU , exceptthat Rules is replaced with RulesD in the definition. It is straightforwardto see that S 2FO2 ⊂ BDR+RU and S 2RU ⊂ BDR+RU. Note that similarto RU , membership checking for BDR+RU is also independent of thepopulation sizes. Our new domain-liftability results, the previously knownresults, and the newly introduced domain-liftable classes and theories canbe summarized as in Fig 6.9. Currently, BDR+RU contains all theoriesfor which we know an algorithm for polynomial-time computation of theirWMC exists. This leads us to the following conjecture.Conjecture 1. BDR+RU is the largest possible class of domain-liftabletheories.This conjecture can be also stated as follows.Conjecture 2. RulesD is a complete set of rules for domain-lifted inference.6.4.4 Experiments and resultsIn order to see the effect of using domain recursion in practice, we find theWMC of three theories with and without using the domain recursion rule:(a) the theory in Example 44, (b) the S4 clause, and (c) the symmetric-transitivity theory. We implemented the domain recursion rule in C++ andcompiled the codes using the g++ compiler. We compare our results withthe WFOMC-v3.0 software20. Since this software requires domain-liftableinput theories, for the first theory we grounded the jobs, for the second wegrounded ∆x, and for the third, we grounded ∆p. For each of these threetheories, assuming |∆x| = n for all logvars x in the theory, we varied n andplotted the runtime as a function of n. All experiments were done on a2.8GH core with 4GB RAM under MacOSX. The runtimes are reported inseconds. We allowed a maximum of 1000 seconds for each run.20Available at: https://dtai.cs.kuleuven.be/software/wfomc1136.4. Liftability Theory RU FO3 S2FO2 ∀𝑥, 𝑦, 𝑧:𝐹 𝑥, 𝑦 ∧ 𝐹 𝑦, 𝑧 ⇒ 𝐹 𝑥, 𝑧 ∀𝑥, 𝑦:𝐹 𝑥,𝑦 ⇒ 𝐹 𝑦, 𝑥 Symmetric Transitivity ∀𝑥1, 𝑥2, 𝑦1,𝑦2: 𝑆 𝑥1,𝑦1 ∨ ¬𝑆 𝑥1, 𝑦2 ∨ 𝑆 𝑥2, 𝑦2 ∨ ¬𝑆 𝑥2, 𝑦1 S4 Clause FO4 FO2 S2RU BDR+RU Figure 6.9: A summary of the domain-liftability results. FO2 and RUhave been previously proved to be domain-liftable. For FO3, it has beenproved that at least one of its theories is not domain-liftable under standardcomplexity assumptions. We identified the new domain-liftable classes S 2FO2and S 2RU and proved that FO2 ⊂ S 2FO2, RU ⊂ S 2RU , FO2 ⊂ RU , andS 2FO2 ⊂ S 2RU . We also proved that symmetric transitivity and the S4clause are domain-liftable. The BDR+RU class contains all theories thatare currently known to be domain-liftable.Obtained results can be viewed in Fig 6.10. These results are consistentwith our theory and indicate the clear advantage of using the domain recursionrule in practice. In Fig 6.10(a), the slope of the diagram for domain recursionis approximately 4 which indicates the degree of the polynomial for the timecomplexity. A similar analysis can be done for the results on the S4 clauseand the symmetric-transitivity clauses represented in Fig 6.10(b), (c). Theslope of the diagram in these two diagrams is around 5 and 2 respectively,indicating that the time complexity for finding their WMC are n5 and n2respectively, where n is the size of the population.6.4.5 DiscussionWe can categorize theories with respect to the domain recursion rule as:(1) theories proved to be domain-liftable using domain recursion (e.g., S4,symmetric transitivity, and theories in S 2FO2), (2) theories that are domain-liftable using domain recursion, but we have not identified them yet assuch, and (3) theories that are not domain-liftable even when using bounded1146.4. Liftability Theory 0.0010.11010001 10 100Time in seconds Population size WFOMC-v3.0 Domain Recursion 0.0010.11010001 10 100Time in seconds Population size WFOMC-v3.0 Domain Recursion 0.0010.11010003 30 300 3000Time in seconds Population size WFOMC-v3.0 Domain Recursion(a) (b) (c) Figure 6.10: Runtimes for calculating the WMC of (a) the theory in Exam-ple 44, (b) the S4 clause, and (c) symmetric transitivity, using the WFOMC-v3.0 software (which only uses Rules) and comparing it to the case wherewe use the domain recursion rule, referred to as Domain Recursion in thediagrams.domain recursion. We leave discovering and characterizing the theoriesin category 2 and 3 as future work. But here we show that even thoughthe theories in category 3 are not domain-liftable using bounded domainrecursion, domain recursion may still result in exponential speedups for thesetheories.Consider the (non-symmetric) transitivity rule:∀x, y, z ∈ ∆p : ¬Friend(x, y) ∨ ¬Friend(y, z) ∨ Friend(x, z)Since none of the rules in Rules apply to the above theory, the existinglifted inference engines ground ∆p and calculate the weighted model count(WMC) of the ground theory. By grounding ∆p, these engines lose greatamounts of symmetry. Suppose ∆p = {A,B,C} and assume we selectFriend(A,B) and Friend(A,C) as the first two random variables for case-analysis. Due to the symmetry of the entities, the case where Friend(A,B)and Friend(A,C) are assigned to True and False respectively has the sameWMC as the case where they are assigned to False and True. However, thecurrent engines fail to exploit this symmetry as they consider groundedentities non-symmetric.By applying domain recursion to the above theory instead of fully ground-ing it, one can exploit the symmetries of the theory. Suppose ∆p′ = ∆p−{P}.1156.5. Domain Recursion and Existential QuantifiersThen we can rewrite the theory as follows:∀y, z ∈ ∆p′ : ¬Friend(P, y) ∨ ¬Friend(y, z) ∨ Friend(P, z)∀x, z ∈ ∆p′ : ¬Friend(x, P ) ∨ ¬Friend(P, z) ∨ Friend(x, z)∀x, y ∈ ∆p′ : ¬Friend(x, y) ∨ ¬Friend(y, P ) ∨ Friend(x, P )∀x, y, z ∈ ∆p′ : ¬Friend(x, y) ∨ ¬Friend(y, z) ∨ Friend(x, z)Now if we apply lifted case analysis on Friend(P, y) (or equivalently onFriend(P, z)), we do not get back the same theory with reduced populationand calculating the WMC is still exponential. However, we only generate onebranch for the case where Friend(P, y) is True only once. This branch coversboth the symmetric cases mentioned above. Exploiting these symmetriesreduces the time-complexity exponentially. Therefore, when no other rule isapplicable to a theory, domain recursion (even when not bounded) shouldbe the preferred way of grounding (compared to grounding a population) forlifted inference.6.5 Domain Recursion and ExistentialQuantifiersIn this section, we introduce two theories having existential quantifiersfor which the WMC can be calculated using domain recursion withoutSkolemizing. The results presented in this section have been published inKazemi et al. [115].Deck of cards problem: Consider the deck of cards problem in Exam-ple 35. Let ∆c be a population whose entities are the cards in a deck ofcards, and ∆p represent the possible positions for these cards. Suppose|∆c| = |∆p| = n(= 52). The following theory corresponds to Van den Broeck[217]’s formulation of the deck of cards problem. It states that for each card,there exists a position. For each position there exists a card. And no twocards can be in the same position.∀c ∈ ∆c, ∃p ∈ ∆p : S(c, p)∀p ∈ ∆p,∃c ∈ ∆c : S(c, p)∀p ∈ ∆p, c1, c2 ∈ ∆c : ¬S(c1, p) ∨ ¬S(c2, p)Proposition 12. The WMC of the theory for Van den Broeck [217]’s deckof cards problem can be calculated in O(n2) using the rules in RulesD withoutSkolemization.1166.5. Domain Recursion and Existential QuantifiersFor the theory in Proposition 12, we could alternatively apply Skolemiza-tion on the theory and then solve the Skolemized theory. The Skolemizedtheory is as follows:∀c ∈ ∆c, p ∈ ∆p : ¬S(c, p) ∨ A(c)∀c ∈ ∆c, p ∈ ∆p : ¬S(c, p) ∨ B(p)∀p ∈ ∆p, c1, c2 ∈ ∆c : ¬S(c1, p) ∨ ¬S(c2, p)where A(c) and B(p) are the literals added to the theory as a result ofSkolemization. One can verify that the time complexity of this alternativeis O(n2) as well21. However, by applying domain recursion directly tothe non-Skolemized theory, we avoid dealing with negative weights in theimplementation. Note that dealing with negative weights can be difficult asthe computations for WMC are typically done in log space.A reformulation of the deck of cards problem: The following theoryis a reformulation of the deck of cards problem. It contains all sentencesfrom Van den Broeck [217], plus a sentence stating a position cannot belongto more than one card. Semantically, this theory is equivalent to Van denBroeck [217]’s theory.∀c ∈ ∆c, ∃p ∈ ∆p : S(c, p)∀p ∈ ∆p,∃c ∈ ∆c : S(c, p)∀p ∈ ∆p, c1, c2 ∈ ∆c : ¬S(c1, p) ∨ ¬S(c2, p)∀p1, p2 ∈ ∆p, c ∈ ∆c : ¬S(c, p1) ∨ ¬S(c, p2)Proposition 13. The WMC of the reformulated theory for the deck ofcards problem can be calculated in O(n) using the rules in RulesD withoutSkolemization.For the theory in Proposition 13, we could alternatively apply Skolemiza-tion on the theory and then solve the Skolemized theory. The Skolemized21In order to find the WMC of the theory, lifted case analysis can be applied to A(c)and B(p), then unit propagation can be applied to unit clauses resulting from the first twoclauses, then lifted decomposition can be applied to p of the third clause, and the WMCof the resulting theory is obvious. The time complexity is O(n2) due to the two liftedcase-analyses.1176.5. Domain Recursion and Existential Quantifierstheory is as follows:∀c ∈ ∆c, p ∈ ∆p : ¬S(c, p) ∨ A(c)∀c ∈ ∆c, p ∈ ∆p : ¬S(c, p) ∨ B(p)∀p ∈ ∆p, c1, c2 ∈ ∆c : ¬S(c1, p) ∨ ¬S(c2, p)∀p1, p2 ∈ ∆p, c ∈ ∆c : ¬S(c, p1) ∨ ¬S(c, p2)where A(c) and B(p) are the literals added to the theory as a result ofSkolemization. One can verify that the time complexity of this alternativeis O(n2) instead of linear22. This shows that besides introducing negativeweights that are difficult to deal with in the implementation, Skolemizationmay also increase the time complexity of WMC for some theories. Therefore,domain recursion (when applicable) may be a better option compared toSkolemization for theories with existential quantifiers.Currently, the above two theories are the only theories for which we knowbounded domain recursion can replace Skolemization (for other theories, westill need to apply Skolemization). Characterizing a class of theories withexistential quantifiers for which domain recursion is bounded is an openproblem.22In order to find the WMC of the theory, lifted case analysis can be applied to A(c)and B(p), then unit propagation can be applied to unit clauses resulting from the first twoclauses, then domain recursion can solve the two remaining clauses. The time complexityis O(n2) because there are two lifted case-analyses and then an O(n) domain recursionstep, but O(n) of the operations are hit by the cache.118Chapter 7Conclusion and Future WorkStatistical relational learning aims at developing representations, reasoningalgorithms, and learning algorithms for domains described in terms of entitiesand relationships. Three fundamental problems in statistical relationallearning include link prediction, property prediction, and joint prediction.Link prediction refers to predicting new link/relationships between entitiesgiven the existing ones. Property prediction refers to predicting specificproperties of entities based on their other properties and the relationshipsthey participate in. Joint prediction refers to predicting the probability ofmultiple variables (entity properties or links) happening at the same time.This dissertation studied these three problems and contributed to each one.For link prediction, this dissertation develops a model based on tensorfactorization called SimplE. While tensor factorization approaches have beensuccessfully applied to link prediction for several knowledge graphs, theystill have to be improved and extended in many ways to be applicable toother real-world link prediction problems. For instance, they have to beextended to become applicable to relations beyond triples, to be able toincorporate more background knowledge, to be able to handle time, etc.SimplE has several nice properties that make it an ideal option to be usedas a foundation for the next generation of tensor factorization models to bebuilt over it. These properties include simplicity, interpretability, efficiency,full expressiveness, the ability to incorporate some forms of backgroundknowledge, and promising empirical performance.For property prediction, this dissertation first shows the limitations ofcurrent relational learning models and then develops a relational neuralnetwork called RelNN. Combining the expressiveness of first-order logic withthe generalization power of deep neural networks is a promising direction forbuilding the next generation of relational models. RelNNs are quite flexiblemaking them an ideal option to be used as a framework for building deeprelational models.For joint prediction, this dissertation studies ways to speed up liftedinference and to expand the classes of models to which lifted inference isapplicable. Speeding lifted inference up makes lifted relational models scale1197.1. Future work for link predictionto larger domains. Expanding the classes of models to which lifted inference isapplicable enlarges the search space for structure learning in lifted relationalmodels. The ideas introduced for lifted inference in this dissertation open upmanifold avenues for further improving these algorithms, thus improving thespeed (of learning and inference) and accuracy of lifted relational models.This work opens up several directions for future research. The followingsections provide an overview of these directions.7.1 Future work for link prediction1. Determining if/how background knowledge formulated in terms of rulesother than the ones discussed in this dissertation can be incorporatedinto tensor factorization models through parameter-tying (e.g., deter-mining if/how a knowledge such as (E1 ,R1,E2 ) =⇒ (E1 ,R2,E2 ) canbe incorporated into SimplE). Instead of parameter-tying, existingapproaches (e.g., [55, 57, 81, 184, 185, 223, 228]) for incorporatingbackground knowledge into embeddings rely mostly on changing theloss function or post-processing steps that do not provide guarantees.2. Finding tighter bounds on the lengths of the embeddings for models suchas SimplE and ComplEx for full-expressiveness and establishing fully-expressiveness results for other existing tensor factorization models.3. Verifying how existing models perform when using Dettmers et al. [56]’s1-N scoring for generating negative triples.4. Studying what characteristics of a dataset make each or every tensorfactorization approach require smaller or larger embedding sizes (e.g.,see [166]).5. Creating ensembles of tensor factorization models (similar to [226])when SimplE is included as one of the base models.6. Designing new representations based on tensor factorization for linkprediction.7. Extending RelNNs to be applicable to link prediction problems andtesting their performance.8. Designing representations based on lifted relational models for linkprediction.1207.1. Future work for link prediction9. Designing new tensor factorization representations that are capable ofbeing integrated with many types of background knowledge.10. Combining tensor factorization techniques developed for link predictionover relational data to other settings (e.g., to visual relation detectionas done by [11, 235]).11. Providing a better understanding of why a tensor factorization modelmay work better than another model, or characterizing when one mayexpect a tensor factorization model perform better than another.12. Triples (corresponding to the relations of arity 2) are universal, i.e. anyrelation can be represented in terms of triples. Representing relationsof higher arities with relations of arity 2 requires reifying new entities.For instance, a relation Teaches(Emma, Data Structures, University ofBritish Columbia) can be represented with three relations Teacher(Rei,Emma), Course(Rei, Data Structures) and School(Rei, University ofBritish Columbia) where Rei is a reified entity. Existing tensor factor-ization approaches may not work out of the box in presence of entityreifying as the entities reified during testing (or query answering) donot yet have embeddings. Extending tensor factorization approachesto be applicable to relations of arities higher than 2 is an importantfuture direction.13. Extending tensor factorization approaches to be applicable to nestedtriples such as (Ross, Believes, (Rachel, Likes, Zootopia)).14. Extending SimplE and other tensor factorization approaches to beapplicable to a sequence (instead of a set) of triples, or to be applicablewhen triples are only true on some time intervals (e.g., see [73, 135,146]).15. The performance of link prediction approaches is usually measuredusing mean reciprocal rank when asking questions of type (?,R,T )and (H ,R, ?). While current approaches assume the answer is al-ways one of the entities in the data, in many real-world problemsthe answer can be “no entity”. As an example, if the question is(?,SonOf ,WhitneyHouston), the answer is “no one” as Whitney Hous-ton did not have any sons. Extending link prediction approaches topredict “no entity” is an interesting future direction of research. To-wards this goal, a new measure is required as mean reciprocal rankdoes not consider out of population cases.1217.2. Future work for property prediction16. Current approaches for learning word embeddings mostly rely on a rawtext [15, 150, 174]. Using SimplE (or other factorization approaches) tolearn word embeddings from raw text as well as structured relationshipsamong words (e.g., coming from WordNet) is an interesting futuredirection.From the above list, items 12, 13, 14, and 15 can have a large impact on theapplicability of tensor factorization approaches to a broader class of problemsand they are important directions for future research.7.2 Future work for property prediction1. Providing a better understanding of why RelNNs perform well, whatmay be their shortcomings, and how they can be improved.2. Implementing a parallel version of RelNNs using convolutional neuralnetwork primitives.3. Depending on the structure of a RelNN, it may have a high timecomplexity which depends on the cross-product of the entities of two (ormore) types. Determining what RelNNs structures result in tractablerepresentations is another future direction.4. Applying Chou et al. [33]’s quantization technique to RelNNs to tieparameters in numerical hidden properties.5. Designing a structure learning algorithm for RelNNs.6. Testing RelNNs on more relational learning domains.7. Designing relational versions of regularization techniques in deep learn-ing community such as dropout and batch normalization and addingthem to RelNNs.8. Extending RLR and RelNNs to be applicable when logical variablescorrespond to continuous, uncountable phenomena (such as time andspace).9. Using RelNNs for collective classification tasks using iterative classifi-cation [141] or other methods.1227.3. Future work for joint prediction10. Understanding the relationship between RLR (and RelNNs) with prob-abilistic soft logic [10, 119] and studying if/how structure learningalgorithms for probabilistic soft logic (e.g., Embar et al. [63]) can beemployed for RLR (and/or RelNNs).11. Designing tensor factorization models that are applicable to propertyprediction.From the above list, items 5 and 9 are the most promising directions forimproving the applicability and accuracy of RelNNs.7.3 Future work for joint prediction1. Designing heuristics that find (lifted) case-analysis orderings directlyfor search-based approaches.2. Bounding the cost of the lifted inference for input theories without(symbolically) running the algorithm (this has been done in part in[198], but with the revival of domain recursion, a new work on thistopic is required.).3. Testing the idea of compiling to low-level programs instead of datastructures for other knowledge compilation problems and secondarydata structures (e.g., [26, 45, 47, 74, 106, 121, 156, 180]).4. Determining if/how domain recursion can be compiled to a low-levelprogram and wired into L2C.5. Proving or disproving the domain-liftability of the non-symmetrictransitivity theory.6. Proving or disproving Conjecture 1 (or Conjecture 2) on BDR+RUbeing a complete class of domain-liftable theories.7. In case Conjecture 1 is disproved, an important research question is tocharacterize a larger class and ideally find a dichotomy between classesof models that are domain-liftable and those that are not. Such adichotomy was found for specific types of probabilistic databases [202]and for specific types of queries in [42], for which the authors receivedthe VLDB 10-year best paper award23 in 2014.23http://vldb.org/archives/10year.html1237.4. Future work for relational learning8. A rich study of membership-checking time complexities for domain-liftable classes.9. Identifying whether domain recursion is bounded for a theory or notwithout applying/simulating it on the theory.10. Characterizing the classes of theories with existential quantifiers forwhich domain recursion is bounded.11. Extending lifted inference algorithms (and lifted relational models)with the ability to deal with uncountable phenomena (e.g., time andspace).12. Enabling tensor factorization techniques to be used for joint prediction.13. Testing the performance of relational dependency networks for jointprediction when the conditional probabilities are learned using RelNNs,and facilitating inference in relational dependency networks by ex-tending Lowd [142]’s approach for converting dependency networks toMarkov networks, to relational dependency networks.14. Incorporating the recently identified sources of symmetries such ascontextual symmetries [4] and variable-value symmetries [5] that haveproved effective for lifted sampling approaches into exact search-basedlifted inference algorithms.15. Designing new representations for joint prediction over relational data.From the above list, items 6 and 7 can have a large theoretical impact andare among the most important directions for future research. Items 12 and 13are also promising directions for developing the next generation of relationalmodels for joint prediction.7.4 Future work for relational learning1. With several relational learning paradigms existing in the literature, adeeper study of the similarities and differences between these paradigmsis necessary. In [108], we took a step towards this goal by studying therelationship between graph random walks and lifted relational models.Dumancic et al. [61] also take a step towards this goal by comparinglifted relational models and tensor factorization approaches. However,a much deeper study of these models is still required to facilitate thedesign of the new generations of relational models.1247.4. Future work for relational learning2. Currently, there does not exist a single model which performs wellon all three prediction tasks studied in this dissertation. Creating asingle model which performs well for all three prediction tasks is agreat future direction to follow. Such a model can be developed bycombining models from different relational learning paradigms or canbe completely novel.3. In this dissertation, we studied three prediction problems and themodels that work best for each problem. Studying other predictionproblems for relational data and the models that work best for eachwould be a great direction for future research.4. One challenge of working on statistical relational learning is the scarcityof interesting publicly available data. We believe creating or gatheringsuch datasets and making them publicly available helps relationallearning progress faster.5. To make private relational datasets available to the community withoutviolating data privacy, an interesting future direction would be to studythe possibility of utilizing (Fully) Homomorphic Encryption (FHE)[75, 219]. FHE encrypts data in a way that mathematical operationscan be performed on encrypted data, producing an encrypted resultwhich, when decrypted, matches the result of the operations as if theoperations had been performed on non-encrypted data. Utilizing FHErequires devising relational models based on the operations that areallowed in FHE. Such models have been recently designed for otherdomains; see, e.g., [77, 88].6. Building models that combine the power of deep neural networks forpattern recognition and relational learning for high-level reasoning (asin, e.g., [109, 148]) to take computational intelligence to the next level.All the above items 6 items are among important directions for rapid ad-vancement of statistical relational learning.125Bibliography[1] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, ZhifengChen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, MatthieuDevin, et al. Tensorflow: Large-scale machine learning on heterogeneousdistributed systems. arXiv preprint arXiv:1603.04467, 2016.[2] Babak Ahmadi, Kristian Kersting, and Sriraam Natarajan. Lifted onlinetraining of relational models with stochastic gradient methods. In EuropeanConference on Machine Learning and Principles and Practice of KnowledgeDiscovery in Databases, pages 585–600, 2012.[3] Paul Allison. Logistic regression using SAS: theory and application. SASPublishing, 1999.[4] Ankit Anand, Aditya Grover, Parag Singla, et al. Contextual symmetriesin probabilistic graphical models. In International Joint Conference onArtificial Intelligence, 2016.[5] Ankit Anand, Ritesh Noothigattu, Parag Singla, et al. Non-count symme-tries in boolean & multi-valued prob. graphical models. In InternationalConference on Artificial Intelligence and Statistics (AISTATS), 2017.[6] Stefan Arnborg, Derek G. Corneil, and Andrzej Proskurowski. Complexityof finding embeddings in a k-tree. SIAM Journal on Algebraic DiscreteMethods, 8(2):277–284, 1987.[7] Vojtech Aschenbrenner. Deep relational learning with predicate invention.Master’s thesis, Czech Technical University in Prague, 2013.[8] So¨ren Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, RichardCyganiak, and Zachary Ives. Dbpedia: A nucleus for a web of open data.In The semantic web, pages 722–735. Springer, 2007.[9] Fahiem Bacchus, Shannon Dalmao, and Toniann Pitassi. Solving #SATand Bayesian inference with backtracking search. Journal of ArtificialIntelligence Research, pages 391–442, 2009.126Bibliography[10] Stephen H Bach, Matthias Broecheler, Bert Huang, and Lise Getoor.Hinge-loss markov random fields and probabilistic soft logic. Journal ofMachine Learning Research, 2015.[11] Stephan Baier, Yunpu Ma, and Volker Tresp. Improving visual re-lationship detection using semantic modeling of scene descriptions. InInternational Semantic Web Conference, pages 53–68. Springer, 2017.[12] VK Balakrishnan. Graph theory, volume 1. McGraw-Hill New York,1997.[13] W. W. Rouse Ball. Other questions on probability. MathematicalRecreations and Essays, page 45, 1960.[14] Paul Beame, Guy Van den Broeck, Eric Gribkoff, and Dan Suciu.Symmetric weighted first-order model counting. In PODS, pages 313–328,2015.[15] Yoshua Bengio, Re´jean Ducharme, Pascal Vincent, and Christian Jauvin.A neural probabilistic language model. Journal of machine learningresearch, 3(Feb):1137–1155, 2003.[16] Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su,Steven Euijong Whang, and Jennifer Widom. Swoosh: a generic approachto entity resolution. The VLDB JournalThe International Journal onVery Large Data Bases, 18(1):255–276, 2009.[17] Anne Berry, Jean R. Blair, Pinar Heggernes, and Barry W. Peyton.Maximum cardinality search for computing minimal triangulations ofgraphs. Algorithmica, 39(4):287–298, 2004.[18] Indrajit Bhattacharya and Lise Getoor. Collective entity resolution inrelational data. ACM Transactions on Knowledge Discovery from Data(TKDD), 1(1):5, 2007.[19] Mikhail Bilenko, S Basil, and Mehran Sahami. Adaptive product normal-ization: Using online learning for record linkage in comparison shopping.In Data Mining, Fifth IEEE International Conference on, pages 8–pp.IEEE, 2005.[20] Yvonne M. Bishop, Stephen E. Fienberg, and Paul W. Holland. Discretemultivariate analysis: theory and practice. Springer Science & BusinessMedia, 2007.127Bibliography[21] Jesu´s Bobadilla, Fernando Ortega, Antonio Hernando, and AbrahamGutie´rrez. Recommender systems survey. Knowledge-based systems, 46:109–132, 2013.[22] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and JamieTaylor. Freebase: a collaboratively created graph database for structuringhuman knowledge. In ACM SIGMOD, pages 1247–1250. AcM, 2008.[23] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason We-ston, and Oksana Yakhnenko. Irreflexive and hierarchical relations astranslations. arXiv preprint arXiv:1304.7158, 2013.[24] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston,and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In NeurIPS, pages 2787–2795, 2013.[25] Craig Boutilier, Nir Friedman, Moises Goldszmidt, and Daphne Koller.Context-specific independence in Bayesian networks. In Proceedings ofUncertainty in Artificial Intelligence, pages 115–123, 1996.[26] Tanya Braun and Ralf Mo¨ller. Lifted junction tree algorithm. InJoint German/Austrian Conference on Artificial Intelligence (Ku¨nstlicheIntelligenz), pages 30–42. Springer, 2016.[27] David Buchman and David Poole. Why rules are complex: Real-valuedprobabilistic logic programs are not fully expressive. In Conference onUncertainty in Artificial Intelligence, 2017.[28] Hung Hai Bui, Tuyen N Huynh, Artificial Intelligence Center, andSebastian Riedel. Automorphism groups of graphical models and liftedvariational inference. In Conference on Uncertainty in Artificial Intelli-gence, page 132, 2013.[29] Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, EstevamR Hruschka Jr, and Tom M Mitchell. Toward an architecture for never-ending language learning. In AAAI conference on artificial intelligence,volume 5, page 3, 2010.[30] Ismail Ilkan Ceylan, Adnan Darwiche, and Guy Van den Broeck. Open-world probabilistic databases. In Description Logics, 2016.[31] Kyunghyun Cho, Bart Van Merrie¨nboer, Caglar Gulcehre, DzmitryBahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning128Bibliographyphrase representations using rnn encoder-decoder for statistical machinetranslation. arXiv preprint arXiv:1406.1078, 2014.[32] Jaesik Choi, Rodrigo de Salvo Braz, and Hung H. Bui. Efficient methodsfor lifted inference with aggregate factors. In AAAI conference on artificialintelligence, 2011.[33] Li Chou, Somdeb Sarkhel, Nicholas Ruozzi, and Vibhav Gogate. Onparameter tying by quantization. In AAAI conference on artificial intelli-gence, pages 3241–3247, 2016.[34] Peter Christen. Data matching: concepts and techniques for recordlinkage, entity resolution, and duplicate detection. Springer Science &Business Media, 2012.[35] Keith L Clark. Negation as failure. In Logic and data bases, pages293–322. Springer, 1978.[36] Francois Clautiaux, Aziz Moukrim, Stephane Ngre, and Jacques Car-lier. Heuristic and metaheuristic methods for computing graph treewidth.RAIRO-Operations Research, 38(1):13–26, 2004.[37] Edgar F Codd. A relational model of data for large shared data banks.Communications of the ACM, 13(6):377–387, 1970.[38] William W Cohen, Henry Kautz, and David McAllester. Hardeningsoft information sources. In Proceedings of the sixth ACM SIGKDDinternational conference on Knowledge discovery and data mining, pages255–259. ACM, 2000.[39] William W Cohen. Tensorlog: A differentiable deductive database.arXiv preprint arXiv:1605.06523, 2016.[40] Gregory F. Cooper. The computational complexity of probabilisticinference using Bayesian networks. Artificial Intelligence, 42(2-3):393–405,1990.[41] George Cybenko. Approximations by superpositions of a sigmoidalfunction. Mathematics of Control, Signals and Systems, 2:183–192, 1989.[42] Nilesh Dalvi and Dan Suciu. Efficient query evaluation on probabilisticdatabases. The VLDB Journal, 16(4):523–544, 2007.[43] Adnan Darwiche and Pierre Marquis. A knowledge compilation map.Journal of Artificial Intelligence Research, 17(1):229–264, 2002.129Bibliography[44] Adnan Darwiche. Recursive conditioning. Artificial Intelligence, 126(1-2):5–41, 2001.[45] Adnan Darwiche. A differential approach to inference in Bayesiannetworks. Journal of the ACM, 50(3):280–305, 2003.[46] Adnan Darwiche. Modeling and reasoning with Bayesian networks.Cambridge University Press, 2009.[47] Adnan Darwiche. Sdd: A new canonical representation of propositionalknowledge bases. In IJCAI Proceedings-International Joint Conference onArtificial Intelligence, volume 22, page 819, 2011.[48] Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, Luke Vilnis, IshanDurugkar, Akshay Krishnamurthy, Alex Smola, and Andrew McCallum.Go for a walk and arrive at the answer: Reasoning over paths in knowledgebases using reinforcement learning. NeurIPS Workshop on AKBC, 2017.[49] Luc De Raedt, Angelika Kimmig, and Hannu Toivonen. Problog: Aprobabilistic prolog and its application in link discovery. In InternationalJoint Conference on Artificial Intelligence, volume 7, 2007.[50] Luc De Raedt, Kristian Kersting, Sriraam Natarajan, and David Poole.Statistical relational artificial intelligence: Logic, probability, and compu-tation. Synthesis Lectures on Artificial Intelligence and Machine Learning,10(2):1–189, 2016.[51] Rodrigo de Salvo Braz, Eyal Amir, and Dan Roth. Lifted first-orderprobabilistic inference. In International Joint Conference on ArtificialIntelligence, pages 1319–1325, 2005.[52] Rina Dechter and Robert Mateescu. And/or search spaces for graphicalmodels. Artificial intelligence, 171(2):73–106, 2007.[53] Rina Dechter. Bucket elimination: a unifying framework for probabilisticinference. In Conference on Uncertainty in Artificial Intelligence (UAI),pages 211–219. Morgan Kaufmann, 1996.[54] Rina Dechter. Constraint Processing. Morgan Kaufmann, San Fransisco,CA, 2003.[55] Thomas Demeester, Tim Rockta¨schel, and Sebastian Riedel. Lifted ruleinjection for relation embeddings. In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Processing (EMNLP), 2016.130Bibliography[56] Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and SebastianRiedel. Convolutional 2d knowledge graph embeddings. In AAAI Confer-ence on Artificial Intelligence, 2018.[57] Boyang Ding, Quan Wang, Bin Wang, and Li Guo. Improving knowledgegraph embedding using simple constraints. In Proceedings of the 56thAnnual Meeting of the Association for Computational Linguistics, 2018.[58] Pedro Domingos and Daniel Lowd. Markov logic: An interface layerfor artificial intelligence. Synthesis Lectures on Artificial Intelligence andMachine Learning, 3(1):1–155, 2009.[59] Pedro Domingos, Stanley Kok, Daniel Lowd, Hoifung Poon, MatthewRichardson, and Parag Singla. Markov logic. In L. De Raedt, P. Frasconi,K. Kersting, and S. Muggleton, editors, Probabilistic Inductive LogicProgramming, pages 92–117. Springer, New York, 2008.[60] Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao,Kevin Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. Knowl-edge vault: A web-scale approach to probabilistic knowledge fusion. InACM SIGKDD, pages 601–610. ACM, 2014.[61] Sebastijan Dumancic, Alberto Garcia-Duran, and Mathias Niepert. Onembeddings as an alternative paradigm for relational learning. ICMLWorkshop on Statistical Relational AI, 2018.[62] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bom-barell, Timothy Hirzel, Ala´n Aspuru-Guzik, and Ryan P Adams. Convolu-tional networks on graphs for learning molecular fingerprints. In Advancesin neural information processing systems, pages 2224–2232, 2015.[63] Varun Embar, Dhanya Sridhar, Golnoosh Farnadi, and Lise Getoor. Scal-able structure learning for probabilistic soft logic. In Statistical RelationalArtificial Intelligence (StaRAI) Workshop, 2018.[64] Anthony Fader, Stephen Soderland, and Oren Etzioni. Identifyingrelations for open information extraction. In Empirical Methods in NaturalLanguage Processing (EMNLP), pages 1535–1545, 2011.[65] Bahare Fatemi, Seyed Mehran Kazemi, and David Poole. A learningalgorithm for relational logistic regression: Preliminary results. arXivpreprint arXiv:1606.08531, 2016.131Bibliography[66] Bahare Fatemi, Seyed Mehran Kazemi, and David Poole. Record linkageto match customer names: A probabilistic approach. ICML Workshop onStatistical Relational AI, 2018.[67] Ivan P Fellegi and Alan B Sunter. A theory for record linkage. Journalof the American Statistical Association, 64(328):1183–1210, 1969.[68] Jun Feng, Minlie Huang, Mingdong Wang, Mantong Zhou, Yu Hao,and Xiaoyan Zhu. Knowledge graph embedding by flexible translation.In Principles of Knowledge Representation and Reasoning (KR), pages557–560, 2016.[69] Santo Fortunato. Community detection in graphs. Physics reports,486(3-5):75–174, 2010.[70] Manoel VM Franc¸a, Gerson Zaverucha, and Artur S dAvila Garcez. Fastrelational learning using bottom clause propositionalization with artificialneural networks. Machine learning, 94(1):81–104, 2014.[71] Nir Friedman, Lisa Getoor, Daphne Koller, and Avi Pfeffer. Learningprobabilistic relational models. In Proc. of the Sixteenth International JointConference on Artificial Intelligence, pages 1300–1307. Sweden: MorganKaufmann, 1999.[72] Artur S d’Avila Garcez, Krysia Broda, and Dov M Gabbay. Neural-symbolic learning systems: foundations and applications. Springer Science& Business Media, 2012.[73] Alberto Garc´ıa-Dura´n, Sebastijan Dumancˇic´, and Mathias Niepert.Learning sequence encoders for temporal knowledge graph completion. InEmpirical Methods in Natural Language Processing (EMNLP), 2018.[74] Marcel Gehrke, Tanya Braun, and Ralf Mo¨ller. Lifted dynamic junctiontree algorithm. In International Conference on Conceptual Structures,pages 55–69. Springer, 2018.[75] Craig Gentry and Dan Boneh. A fully homomorphic encryption scheme,volume 20. Stanford University Stanford, 2009.[76] Lise Getoor and Ben Taskar. Introduction to statistical relational learning.MIT press, 2007.[77] Ran Gilad-Bachrach, Nathan Dowlin, Kim Laine, Kristin Lauter,Michael Naehrig, and John Wernsing. Cryptonets: Applying neural132Bibliographynetworks to encrypted data with high throughput and accuracy. In Inter-national Conference on Machine Learning, pages 201–210, 2016.[78] C Lee Giles, Kurt D Bollacker, and Steve Lawrence. Citeseer: Anautomatic citation indexing system. In Proceedings of the third ACMconference on Digital libraries, pages 89–98. ACM, 1998.[79] Vibhav Gogate and Pedro Domingos. Probabilistic theorem proving. InUAI, pages 256–265, 2011.[80] Luis Gravano, Panagiotis G Ipeirotis, Hosagrahar Visvesvaraya Jagadish,Nick Koudas, Shanmugauelayut Muthukrishnan, Divesh Srivastava, et al.Approximate string joins in a database (almost) for free. In Very LargeDatabases (VLDB), volume 1, pages 491–500, 2001.[81] Shu Guo, Quan Wang, Lihong Wang, Bin Wang, and Li Guo. Jointlyembedding knowledge graphs and logical rules. In Proceedings of the 2016Conference on Empirical Methods in Natural Language Processing, pages192–202, 2016.[82] V´ıctor Gutie´rrez-Basulto and Steven Schockaert. From knowledge graphembedding to ontology embedding: Region based representations of rela-tional structures. In Principles of Knowledge Representation and Reasoning(KR), 2018.[83] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, PeterReutemann, and Ian H Witten. The weka data mining software: an update.ACM SIGKDD explorations newsletter, 11(1):10–18, 2009.[84] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representationlearning on large graphs. In Advances in Neural Information ProcessingSystems, pages 1024–1034, 2017.[85] Maxwell Harper and Joseph Konstan. The movielens datasets: Historyand context. ACM Transactions on Interactive Intelligent Systems (TiiS),5(4):19, 2015.[86] Katsuhiko Hayashi and Masashi Shimbo. On the equivalence of holo-graphic and complex embeddings for link prediction. In Statistical Rela-tional AI (StaRAI) Workshop, 2017.[87] Mauricio A Herna´ndez and Salvatore J Stolfo. The merge/purge problemfor large databases. In ACM Sigmod Record, volume 24, pages 127–138.ACM, 1995.133Bibliography[88] Ehsan Hesamifard, Hassan Takabi, and Mehdi Ghasemi. Cryptodl: Deepneural networks over encrypted data. arXiv preprint arXiv:1711.05189,2017.[89] Rob High. The era of cognitive systems: An inside look at ibm watsonand how it works. IBM Corporation, Redbooks, 2012.[90] Frank L Hitchcock. The expression of a tensor or a polyadic as a sumof products. Studies in Applied Mathematics, 6(1-4):164–189, 1927.[91] Sepp Hochreiter and Ju¨rgen Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997.[92] Johannes Hoffart, Fabian M Suchanek, Klaus Berberich, and GerhardWeikum. Yago2: A spatially and temporally enhanced knowledge basefrom wikipedia. Artificial Intelligence, 194:28–61, 2013.[93] Kurt Hornik. Approximation capabilities of multilayer feedforwardnetworks. Neural networks, 4(2):251–257, 1991.[94] Michael Horsch and David Poole. A dynamic approach to probability in-ference using Bayesian networks. In Proc. sixth Conference on Uncertaintyin AI, pages 155–161, 1990.[95] Jinbo Huang and Adnan Darwiche. The language of search. Journal ofArtificial Intelligence Research (JAIR), 29:191–219, 2007.[96] Tuyen N. Huynh and Raymond J. Mooney. Discriminative structure andparameter learning for Markov logic networks. In Proc. of the internationalconference on machine learning, 2008.[97] Maximilian Ilse, Jakub M Tomczak, and Max Welling. Attention-baseddeep multiple instance learning. arXiv preprint arXiv:1802.04712, 2018.[98] Sergey Ioffe and Christian Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. arXiv preprintarXiv:1502.03167, 2015.[99] Ariel Jaimovich, Ofer Meshi, and Nir Friedman. Template based in-ference in symmetric relational markov random fields. In Conference onUncertainty in Artificial Intelligence, 2007.[100] Finn V. Jensen, Steffen L. Lauritzen, and Kristian G. Olesen. Bayesianupdating in causal probabilistic networks by local computations. Compu-tational statistics quarterly 4, pages 269–282, 1990.134Bibliography[101] Yacine Jernite, Alexander M Rush, and David Sontag. A fast variationalapproach for learning Markov random field language models. In ICML,2015.[102] Abhay Jha, Vibhav Gogate, Alexandra Meliou, and Dan Suciu. Liftedinference seen from the other side: The tractable features. In Advances inNeural Information Processing Systems, pages 973–981, 2010.[103] Rudolf Kadlec, Ondrej Bajgar, and Jan Kleindienst. Knowledge basecompletion: Baselines strike back. arXiv preprint arXiv:1705.10744, 2017.[104] Seyed Mehran Kazemi and David Poole. Elimination ordering in first-order probabilistic inference. In AAAI Conference on Artificial Intelligence,2014.[105] Seyed Mehran Kazemi and David Poole. Knowledge compilationfor lifted probabilistic inference: Compiling to a low-level language. InPrinciples of Knowledge Representation and Reasoning (KR), 2016.[106] Seyed Mehran Kazemi and David Poole. Lazy arithmetic circuits. InBeyond-NP Workshop at AAAI, 2016.[107] Seyed Mehran Kazemi and David Poole. Why is compiling liftedinference into a low-level language so effective? arXiv preprintarXiv:1606.04512, 2016.[108] Seyed Mehran Kazemi and David Poole. Bridging weighted rules andgraph random walks for statistical relational models. Frontiers in Roboticsand AI, 5:8, 2018.[109] Seyed Mehran Kazemi and David Poole. RelNN: A deep neural modelfor relational learning. In AAAI Conference on Artificial Intelligence,2018.[110] Seyed Mehran Kazemi and David Poole. Simple embedding for linkprediction in knowledge graphs. In Advances in Information ProcessingSystems (NeurIPS), 2018.[111] Seyed Mehran Kazemi, David Buchman, Kristian Kersting, SriraamNatarajan, and David Poole. Relational logistic regression. In KR, 2014.[112] Seyed Mehran Kazemi, David Buchman, Kristian Kersting, SriraamNatarajan, and David Poole. Relational logistic regression: the directedanalog of Markov logic networks. In Proc. AAAI-2014 Statistical RelationalAI Workshop, 2014.135Bibliography[113] Seyed Mehran Kazemi, Angelika Kimmig, Guy Van den Broeck, andDavid Poole. New liftable classes for first-order probabilistic inference.In Advances in Neural Information Processing Systems, pages 3117–3125,2016.[114] Seyed Mehran Kazemi, Bahare Fatemi, Alexandra Kim, Zilun Peng,Moumita Roy Tora, Xing Zeng, Matthew Dirks, and David Poole. Com-paring aggregators for relational probabilistic models. arXiv preprintarXiv:1707.07785, 2017.[115] Seyed Mehran Kazemi, Angelika Kimmig, Guy Van den Broeck, andDavid Poole. Domain recursion for lifted inference with existential quanti-fiers. arXiv preprint arXiv:1707.07763, 2017.[116] Kristian Kersting, Babak Ahmadi, and Sriraam Natarajan. Countingbelief propagation. In Conference on Uncertainty in Artificial Intelligence,pages 277–284, 2009.[117] Kristian Kersting. Lifted probabilistic inference. In ECAI, pages 33–38,2012.[118] Bisma S Khan and Muaz A Niazi. Network community detection: areview and visual survey. arXiv preprint arXiv:1708.00977, 2017.[119] Angelika Kimmig, Stephen Bach, Matthias Broecheler, Bert Huang, andLise Getoor. A short introduction to probabilistic soft logic. In Proceedingsof the NeurIPS Workshop on Probabilistic Programming: Foundations andApplications, pages 1–4, 2012.[120] Angelika Kimmig, Lilyana Mihalkova, and Lise Getoor. Lifted graphicalmodels: a survey. Machine Learning, 99(1):1–45, 2015.[121] Doga Kisa, Guy Van den Broeck, Arthur Choi, and Adnan Darwiche.Probabilistic sentential decision diagrams. In Proceedings of the 14thInternational Conference on Principles of Knowledge Representation andReasoning (KR), pages 1–10, 2014.[122] Jacek Kisynski and David Poole. Lifted aggregation in directed first-order probabilistic models. In Twenty-first International Joint Conferenceon Artificial Intelligence, pages 1922–1929, 2009.[123] Uffe Kjaerulff. Triangulation of graphs–algorithms giving small totalstate space. Technical report, Deptartment of Mathematics and ComputerScience, Aalborg University, Denmark, 1985.136Bibliography[124] Stanley Kok, Parag Singla, Matthew Richardson, and Pedro Domingos.The alchemy system for statistical relational AI. Technical report, Depart-ment of Computer Science and Engineering, University of Washington,Seattle, WA., 2005.[125] Daphne Koller and Nir Friedman. Probabilistic Graphical Models:Principles and Techniques. MIT Press, Cambridge, MA, 2009.[126] Timothy Kopp, Parag Singla, and Henry Kautz. Lifted symmetry de-tection and breaking for map inference. In Advances in Neural InformationProcessing Systems, pages 1315–1323, 2015.[127] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorizationtechniques for recommender systems. Computer, 42(8), 2009.[128] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenetclassification with deep convolutional neural networks. In NeurIPS, pages1097–1105, 2012.[129] Andrea Lancichinetti and Santo Fortunato. Community detectionalgorithms: a comparative analysis. Physical review E, 80(5):056117, 2009.[130] Ni Lao and William W Cohen. Fast query execution for retrievalmodels based on path-constrained random walks. In ACM SIGKDD, pages881–888. ACM, 2010.[131] Ni Lao, Tom Mitchell, and William W Cohen. Random walk inferenceand learning in a large scale knowledge base. In Empirical Methods fornatural Language Processing (EMNLP), pages 529–539, 2011.[132] Pedro Larranaga, Cindy M. Kuijpers, Mikel Poza, and Roberto H.Murga. Decomposing Bayesian networks: triangulation of the moral graphwith genetic algorithms. Statistics and Computing, 7(1):19–34, 1997.[133] Steffen L. Lauritzen and David J. Spiegelhalter. Local computationswith probabilities on graphical structures and their application to expertsystems. Journal of the Royal Statistical Society. Series B (Methodological),pages 157–224, 1988.[134] Steve Lawrence, C Lee Giles, and Kurt D Bollacker. Autonomouscitation matching. In Proceedings of the third annual conference on Au-tonomous Agents, pages 392–393. ACM, 1999.137Bibliography[135] Julien Leblay and Melisachew Wudage Chekol. Deriving validity time inknowledge graph. In International World Wide Web (WWW) ConferencesSteering Committee, 2018.[136] Yann LeCun, Le´on Bottou, Yoshua Bengio, and Patrick Haffner.Gradient-based learning applied to document recognition. Proceedings ofthe IEEE, 86(11):2278–2324, 1998.[137] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.Nature, 521(7553):436–444, 2015.[138] Yankai Lin, Zhiyuan Liu, Huanbo Luan, Maosong Sun, Siwei Rao, andSong Liu. Modeling relation paths for representation learning of knowledgebases. Empirical Methods in Natural Language Processing, 2015.[139] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu.Learning entity and relation embeddings for knowledge graph completion.In AAAI, pages 2181–2187, 2015.[140] Huma Lodhi. Deep relational machines. In International Conferenceon Neural Information Processing, pages 212–219. Springer, 2013.[141] Ben London and Lise Getoor. Collective classification of network data.Data Classification: Algorithms and Applications, 399, 2014.[142] Daniel Lowd. Closed-form learning of markov networks from depen-dency networks. In Proceedings of the 28th Conference on Uncertainty inArtificial Intelligence (UAI), 2012.[143] David G Lowe. Object recognition from local scale-invariant features. InComputer vision, 1999. The proceedings of the seventh IEEE internationalconference on, volume 2, pages 1150–1157. Ieee, 1999.[144] Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Visualrelationship detection with language priors. In European Conference onComputer Vision, pages 852–869. Springer, 2016.[145] Stefan Lu¨dtke, Max Schro¨der, Frank Kru¨ger, Sebastian Bader, andThomas Kirste. State-space abstractions for probabilistic inference: Asystematic review. arXiv preprint arXiv:1804.06748, 2018.[146] Yunpu Ma, Volker Tresp, and Erik Daxberger. Embedding models forepisodic memory. arXiv preprint arXiv:1807.00228, 2018.138Bibliography[147] Anders L. Madsen and Finn V. Jensen. Lazy propagation: a junctiontree inference algorithm based on lazy evaluation. Artificial Intelligence,113(1):203–245, 1999.[148] Robin Manhaeve, Sebastijan Dumancˇic´, Angelika Kimmig, ThomasDemeester, and Luc De Raedt. Deepproblog: Neural probabilistic logicprogramming. In Advances in Neural Information Processing Systems(NeurIPS), 2018.[149] Hongyuan Mei and Jason M Eisner. The neural hawkes process: Aneurally self-modulating multivariate point process. In Advances in NeuralInformation Processing Systems, pages 6754–6764, 2017.[150] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficientestimation of word representations in vector space. In InternationalConference on Learning Representation (ICLR), Workshop Papers, 2013.[151] Brian Milch, Luke S. Zettlemoyer, Kristian Kersting, Michael Haimes,and Leslie Pack Kaelbling. Lifted probabilistic inference with countingformulae. In AAAI Conference on Artificial Intelligence, pages 1062–1068,2008.[152] George A Miller. Wordnet: a lexical database for english. Communi-cations of the ACM, 38(11):39–41, 1995.[153] Pasquale Minervini, Luca Costabello, Emir Mun˜oz, Vı´t Nova´cˇek, andPierre-Yves Vandenbussche. Regularizing knowledge graph embeddingsvia equivalence and inversion axioms. In Joint European Conference onMachine Learning and Knowledge Discovery in Databases, pages 668–683.Springer, 2017.[154] Alvaro Monge and Charles Elkan. An efficient domain-independentalgorithm for detecting approximately duplicate database records. 1997.[155] Sriraam Natarajan, Tushar Khot, Kristian Kersting, Bernd Gutmann,and Jude Shavlik. Gradient-based boosting for statistical relational learn-ing: The relational dependency network case. Machine Learning, 86(1):25–56, 2012.[156] Aniruddh Nath and Pedro M Domingos. Learning relational sum-product networks. In AAAI Conference on Artificial Intelligence, pages2878–2886, 2015.139Bibliography[157] Arvind Neelakantan, Benjamin Roth, and Andrew McCallum. Compo-sitional vector space models for knowledge base inference. In AAAI springsymposium series, 2015.[158] Daniel Neil, Michael Pfeiffer, and Shih-Chii Liu. Phased lstm: Ac-celerating recurrent network training for long or event-based sequences.In Advances in Neural Information Processing Systems, pages 3882–3890,2016.[159] Jennifer Neville and David Jensen. Relational dependency networks.Journal of Machine Learning Research (JMLR), 8:653–692, 2007.[160] Jennifer Neville, David Jensen, Lisa Friedland, and Michael Hay. Learn-ing relational probability trees. In Proceedings of the ninth ACM SIGKDDinternational conference on Knowledge discovery and data mining, pages625–630. ACM, 2003.[161] Dat Quoc Nguyen, Kairit Sirts, Lizhen Qu, and Mark Johnson. Stranse:a novel embedding model of entities and relationships in knowledge bases.In NAACL-HLT, 2016.[162] Dat Quoc Nguyen. An overview of embedding models of enti-ties and relationships for knowledge base completion. arXiv preprintarXiv:1703.08098, 2017.[163] Maximillian Nickel and Douwe Kiela. Poincare´ embeddings for learninghierarchical representations. In Advances in neural information processingsystems, pages 6338–6347, 2017.[164] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-waymodel for collective learning on multi-relational data. In InternationalConference on Machine Learning (ICML), 2011.[165] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. Factorizingyago: scalable machine learning for linked data. In World Wide Web,pages 271–280. ACM, 2012.[166] Maximilian Nickel, Xueyan Jiang, and Volker Tresp. Reducing therank in relational factorization models by including observable patterns.In Advances in Neural Information Processing Systems (NeurIPS), pages1179–1187, 2014.140Bibliography[167] Maximilian Nickel, Kevin Murphy, Volker Tresp, and EvgeniyGabrilovich. A review of relational machine learning for knowledge graphs.Proceedings of the IEEE, 104(1):11–33, 2016.[168] Maximilian Nickel, Lorenzo Rosasco, Tomaso A Poggio, et al. Holo-graphic embeddings of knowledge graphs. In AAAI Conference on ArtificialIntelligence, pages 1955–1961, 2016.[169] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learningconvolutional neural networks for graphs. In International Conference onMachine Learning (ICML), 2016.[170] Mathias Niepert. Markov chains on orbits of permutation groups. InConference on Uncertainty in Artificial Intelligence (UAI), 2012.[171] Mathias Niepert. Discriminative gaifman models. In Advances in NeuralInformation Processing Systems (NeurIPS), pages 3405–3413, 2016.[172] Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart J Russell, andIlya Shpitser. Identity uncertainty and citation matching. In Advances inneural information processing systems, pages 1425–1432, 2003.[173] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networksof Plausible Inference. Morgan Kaumann, San Mateo, CA, 1988.[174] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove:Global vectors for word representation. In Proceedings of the 2014 con-ference on empirical methods in natural language processing (EMNLP),pages 1532–1543, 2014.[175] Trang Pham, Truyen Tran, Dinh Q Phung, and Svetha Venkatesh.Column networks for collective classification. In AAAI Conference onArtificial Intelligence, pages 2485–2491, 2017.[176] David Poole and Alan K. Mackworth. Artificial Intelligence: founda-tions of computational agents. Cambridge University Press, 2010.[177] David Poole, Fahiem Bacchus, and Jacek Kisynski. Towards completelylifted search-based probabilistic inference. arXiv:1107.4035 [cs.AI], 2011.[178] David Poole, David Buchman, Seyed Mehran Kazemi, Kristian Kerst-ing, and Sriraam Natarajan. Population size extrapolation in relationalprobabilistic modelling. In Proc. of the Eighth International Conferenceon Scalable Uncertainty Management, 2014.141Bibliography[179] David Poole. First-order probabilistic inference. In International JointConference on Artificial Intelligence, pages 985–991, 2003.[180] Hoifung Poon and Pedro Domingos. Sum-product networks: A newdeep architecture. In Computer Vision Workshops (ICCV Workshops),2011 IEEE International Conference on, pages 689–690. IEEE, 2011.[181] Dragomir R Radev, Hong Qi, Harris Wu, and Weiguo Fan. Evaluatingweb-based question answering systems. Ann Arbor, 1001:48109, 2002.[182] Siamak Ravanbakhsh, Jeff Schneider, and Barnabas Poczos. Equivari-ance through parameter-sharing. In Inernational Conference on MachineLearning (ICML), 2017.[183] Matthew Richardson and Pedro Domingos. Markov logic networks.Machine Learning, 62:107–136, 2006.[184] Tim Rockta¨schel, Matko Bosˇnjak, Sameer Singh, and Sebastian Riedel.Low-dimensional embeddings of logic. In Proceedings of the ACL 2014Workshop on Semantic Parsing, pages 45–49, 2014.[185] Tim Rockta¨schel, Sameer Singh, and Sebastian Riedel. Injectinglogical background knowledge into embeddings for relation extraction. InProceedings of the 2015 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies,pages 1119–1129, 2015.[186] Donald J. Rose, Robert E. Tarjan, and George S. Lueker. Algorithmicaspects of vertex elimination on graphs. SIAM Journal on computing,5(2):266–283, 1976.[187] Stuart J Russell and Peter Norvig. Artificial intelligence: a modernapproach. Malaysia; Pearson Education Limited,, 2016.[188] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski,Razvan Pascanu, Peter Battaglia, and Tim Lillicrap. A simple neuralnetwork module for relational reasoning. In Advances in Neural InformationProcessing Systems (NeurIPS), 2017.[189] Nobuo Sato and W. F. Tinney. Techniques for exploiting the sparsityof the network admittance matrix. Power Apparatus and Systems, IEEETransactions on, 82(69):944–950, 1963.142Bibliography[190] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuch-ner, and Gabriele Monfardini. The graph neural network model. IEEETransactions on Neural Networks, 20(1):61–80, 2009.[191] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van denBerg, Ivan Titov, and Max Welling. Modeling relational data with graphconvolutional networks. In European Semantic Web Conference, pages593–607. Springer, 2018.[192] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, BrianGalligher, and Tina Eliassi-Rad. Collective classification in network data.AI magazine, 29(3):93, 2008.[193] Luciano Serafini and Artur d’Avila Garcez. Logic tensor networks: Deeplearning and logical reasoning from data and knowledge. In Proceedingsof the 11th International Workshop on Neural-Symbolic Learning andReasoning, 2016.[194] Glenn R. Shafer and Prakash P. Shenoy. Probability propagation.Annals of Mathematics and Artificial Intelligence, 2(1-4):327–351, 1990.[195] Jaeho Shin, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, andChristopher Re´. Incremental knowledge base construction using deepdive.Proceedings of the VLDB Endowment, 8(11):1310–1321, 2015.[196] Abraham Silberschatz, Henry F Korth, Shashank Sudarshan, et al.Database system concepts, volume 4. McGraw-Hill New York, 1997.[197] Parag Singla and Pedro M Domingos. Lifted first-order belief propa-gation. In AAAI Conference on Artificial Intelligence, volume 8, pages1094–1099, 2008.[198] David B Smith and Vibhav G Gogate. Bounding the cost of search-based lifted inference. In Advances in Neural Information ProcessingSystems, pages 946–954, 2015.[199] Richard Socher, Danqi Chen, Christopher D Manning, and AndrewNg. Reasoning with neural tensor networks for knowledge base completion.In Advances in Neural Information Processing Systems (NeurIPS), 2013.[200] Gustav Sˇourek, Vojtech Aschenbrenner, Filip Zˇelezny, and OndrˇejKuzˇelka. Lifted relational neural networks. In Proceedings of the 2015thInternational Conference on Cognitive Computation: Integrating Neural143Bibliographyand Symbolic Approaches-Volume 1583, pages 52–60. CEUR-WS. org,2015.[201] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever,and Ruslan Salakhutdinov. Dropout: a simple way to prevent neuralnetworks from overfitting. Journal of Machine Learning research (JMLR),15(1):1929–1958, 2014.[202] Dan Suciu, Dan Olteanu, Christopher Re´, and Christoph Koch. Prob-abilistic databases. Synthesis Lectures on Data Management, 3(2):1–180,2011.[203] Nima Taghipour, Daan Fierens, Guy Van den Broeck, Jesse Davis, andHendrik Blockeel. Completeness results for lifted variable elimination. InAISTATS, pages 572–580, 2013.[204] Nima Taghipour, Jesse Davis, and Hendrik Blockeel. Generalizedcounting for lifted variable elimination. Inductive Logic Programming.Springer Berlin Heidelberg, pages 107–122, 2014.[205] Robert E. Tarjan and Yannakakis Mihalis. Simple linear-time algo-rithms to test chordality of graphs, test acyclicity of hypergraphs, andselectively reduce acyclic hypergraphs. SIAM Journal on computing,13(3):566–579, 1984.[206] Sheila Tejada, Craig A Knoblock, and Steven Minton. Learning domain-independent string transformation weights for high accuracy object identifi-cation. In Proceedings of the eighth ACM SIGKDD international conferenceon Knowledge discovery and data mining, pages 350–359. ACM, 2002.[207] The´o Trouillon and Maximilian Nickel. Complex and holographicembeddings of knowledge graphs: a comparison. In Statistical RelationalArtificial Intelligence (StaRAI) Workshop, 2017.[208] The´o Trouillon, Johannes Welbl, Sebastian Riedel, E´ric Gaussier, andGuillaume Bouchard. Complex embeddings for simple link prediction. InInternational Conference on Machine Learning (ICML), pages 2071–2080,2016.[209] The´o Trouillon, Christopher R Dance, Johannes Welbl, SebastianRiedel, E´ric Gaussier, and Guillaume Bouchard. Knowledge graph com-pletion via complex tensor factorization. Journal of Machine LearningResearch (JMLR), 2017.144Bibliography[210] Werner Uwents and Hendrik Blockeel. Classifying relational datawith neural networks. In International Conference on Inductive LogicProgramming, pages 384–396. Springer, 2005.[211] Guy Van den Broeck and Dan Suciu. Query Processing on ProbabilisticData: A Survey. Foundations and Trends in Databases. Now Publishers,August 2017.[212] Guy Van den Broeck, Nima Taghipour, Wannes Meert, Jesse Davis,and Luc De Raedt. Lifted probabilistic inference by first-order knowledgecompilation. In International Joint Conference on Artificial Intelligence(IJCAI), pages 2178–2185, 2011.[213] Guy Van den Broeck, Arthur Choi, and Adnan Darwiche. Lifted relax,compensate and then recover: From approximate to exact lifted proba-bilistic inference. In Conference on Uncertainty in Artificial Intelligence(UAI), 2012.[214] Guy Van den Broeck, Wannes Meert, and Adnan Darwiche. Skolemiza-tion for weighted first-order model counting. In Principles of KnowledgeRepresentation and Reasoning (KR), 2014.[215] Guy Van den Broeck. On the completeness of first-order knowledgecompilation for lifted probabilistic inference. In Advances in Neural Infor-mation Processing Systems, pages 1386–1394, 2011.[216] Guy Van den Broeck. Lifted Inference and Learning in StatisticalRelational Models. PhD thesis, KU Leuven, 2013.[217] Guy Van den Broeck. Towards high-level probabilistic reasoning withlifted inference. In AAAI Spring Symposium on KRR, 2015.[218] Guy Van den Broeck. First-order probabilistic reasoning: Successesand challenges. 2016.[219] Marten Van Dijk, Craig Gentry, Shai Halevi, and Vinod Vaikun-tanathan. Fully homomorphic encryption over the integers. In AnnualInternational Conference on the Theory and Applications of CryptographicTechniques, pages 24–43. Springer, 2010.[220] Jan Van Haaren, Guy Van den Broeck, Wannes Meert, and Jesse Davis.Lifted generative learning of Markov logic networks. Machine Learning,pages 1–29, 2015.145Bibliography[221] Deepak Venugopal and Vibhav Gogate. Evidence-based clustering forscalable inference in Markov logic. In European Conference on MachineLearning and Principles and Practice of Knowledge Discovery in Databases,pages 258–273, 2014.[222] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledgegraph embedding by translating on hyperplanes. In AAAI Conference onArtificial Intelligence, pages 1112–1119, 2014.[223] Quan Wang, Bin Wang, Li Guo, et al. Knowledge base completionusing embeddings and rules. In International Joint Conference on ArtificialIntelligence, pages 1859–1866, 2015.[224] Quan Wang, Jing Liu, Yuanfei Luo, Bin Wang, and Chin-Yew Lin.Knowledge base completion via coupled path ranking. In ACL (1), 2016.[225] Hao Wang, Xingjian Shi, and Dit-Yan Yeung. Relational deep learning:A deep latent variable model for link prediction. In AAAI Conference onArtificial Intelligence, 2017.[226] Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. Knowledge graphembedding: A survey of approaches and applications. IEEE Transactionson Knowledge and Data Engineering, 29(12):2724–2743, 2017.[227] Yanjie Wang, Rainer Gemulla, and Hui Li. On multi-relational linkprediction with bilinear models. AAAI Conference on Artificial Intelligence,2018.[228] Zhuoyu Wei, Jun Zhao, Kang Liu, Zhenyu Qi, Zhengya Sun, andGuanhua Tian. Large-scale knowledge base completion: Inferring viagrounding network sampling over selected instances. In Proceedings of the24th ACM International on Conference on Information and KnowledgeManagement, pages 1331–1340. ACM, 2015.[229] William E Winkler. Overview of record linkage and current researchdirections. In Bureau of the Census. Citeseer, 2006.[230] Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q Zhu. Probase:a probabilistic taxonomy for text understanding. In ACM SIGMOD, pages481–492. ACM, 2012.[231] Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. Scenegraph generation by iterative message passing. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, volume 2, 2017.146[232] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng.Embedding entities and relations for learning and inference in knowledgebases. International Conference on Learning Representation (ICLR), 2015.[233] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos,Ruslan R Salakhutdinov, and Alexander J Smola. Deep sets. In Advancesin Neural Information Processing Systems, pages 3391–3401, 2017.[234] Nevin Lianwen Zhang and David Poole. A simple approach to Bayesiannetwork computations. In Proceedings of the 10th Canadian Conferenceon AI, pages 171–178, 1994.[235] Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua.Visual translation embedding network for visual relation detection. InProceedings of the IEEE Conference on Computer Vision and PatternRecognition, volume 2, page 4, 2017.[236] Ce Zhang. Deepdive: a data management system for automatic knowl-edge base construction. Madison, Wisconsin: University of Wisconsin-Madison, 2015.[237] Yu Zhu, Hao Li, Yikang Liao, Beidou Wang, Ziyu Guan, Haifeng Liu,and Deng Cai. What to do next: Modeling user behaviors by time-lstm.In Proceedings of the Twenty-Sixth International Joint Conference onArtificial Intelligence, IJCAI-17, pages 3602–3608, 2017.147Appendix AThis appendix contains proofs for theorems, propositions, and lemmas in thedissertation, as well as the supplementary material for the dissertation.A.1 Proofs for chapter 4A.1.1 Proof of proposition 1Proposition: FSTransE is not fully-expressive and has the following restric-tions:1. If a relation R is reflexive on a subset ∆ of entities, R must also besymmetric on ∆2. If a relation R is reflexive on a subset ∆ of entities, R must also betransitive on ∆3. If an entity E1 has relation R with every entity in a subset of entities∆ and another entity E2 has relation R with one of the entities in ∆,then E2 must have the relation R with every entity in ∆.Proof. For any entity E, let h(E) = Mh(R)v(E) and t(E) = Mtv(E). In theideal situation, for a triple (E1 ,R,E2 ) to hold, we must have h(E1) + v(R) =αt(E2) for some α.To prove (1), let E1 and E2 be two entities in ∆. A relation R beingreflexive on ∆ suggests:h(E1) + v(R) = α1t(E1)andh(E2) + v(R) = α2t(E2)Now we must prove (E1 ,R,E2 )⇔ (E2 ,R,E1 ). Suppose (E1 ,R,E2 ) holds.Then we know:h(E1) + v(R) = α3t(E2)We can conclude:h(E2) + v(R) = α2t(E2) =α2α3α3t(E2) =α2α3(h(E1) + v(R))148A.1. Proofs for chapter 4=α2α3α1t(E1) = α4t(E1), (α4 =α2α1α3)The above equality proves that (E2 ,R,E1 ) holds. Going from (E2 ,R,E1 )to (E1 ,R,E2 ) is similar.To prove (2), let E1, E2, and E3 be three entities in ∆. A relation Rbeing reflexive suggests:h(E1) + v(R) = α1t(E1)h(E2) + v(R) = α2t(E2)h(E3) + v(R) = α3t(E3)Now we must prove (E1 ,R,E2 ) ∧ (E2 ,R,E3 ) ⇒ (E1 ,R,E3 ). Suppose(E1 ,R,E2 ) and (E2 ,R,E3 ) hold. Then we know:h(E1) + v(R) = α4t(E2)h(E2) + v(R) = α5t(E3)We can conclude:h(E1) + v(R) = α4t(E2) =α4α2α2t(E2) =α4α2(h(E2) + v(R))=α4α2α5t(E3) = α6t(E3), (α6 =α4α5α2)The above equality proves (E1 ,R,E3 ) holds.To prove (3), let E3 and E4 be two entities in ∆, and let E2 have relationR with E3. We prove E2 must have relation R with E4 as well. We know:h(E1) + v(R) = α1t(E3)h(E1) + v(R) = α2t(E4)h(E2) + v(R) = α3t(E3)We can conclude:h(E2) + v(R) = α3t(E3) =α3α1α1 =α3α1(h(E1) + v(R))=α3α1α2t(E4) = α4t(E4), (α4 =α3α2α1)The above equality proves (E2 ,R,E4 ) holds.149A.1. Proofs for chapter 4Figure A.1: h(e)s and v(r)s in Proposition 2.h(e0) 1 0 0 . . . 0 1 0 0 . . . 0 . . . 1 0 0 . . . 0h(e1) 0 1 0 . . . 0 0 1 0 . . . 0 . . . 0 1 0 . . . 0h(e2) 0 0 1 . . . 0 0 0 1 . . . 0 . . . 0 0 1 . . . 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .h(e|E|−1) 0 0 0 . . . 1 0 0 0 . . . 1 . . . 0 0 0 . . . 1v(r0) 1 1 1 . . . 1 0 0 0 . . . 0 . . . 0 0 0 . . . 0v(r1) 0 0 0 . . . 0 1 1 1 . . . 1 . . . 0 0 0 . . . 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .v(r|R|−1) 0 0 0 . . . 0 0 0 0 . . . 0 . . . 1 1 1 . . . 1A.1.2 Proof of proposition 2Proposition: For any ground truth over entities E and relations R, thereexists a SimplE model with embedding vectors of size |E|∗ |R| that representsthat ground truth.Proof. With embedding vectors of size |E| ∗ |R|, for each entity Ei we definehn(Ei) = 1 if n mod |E| = i and 0 otherwise, and for each relation Rj wedefine vn(Rj) = 1 if n div |E| = j and 0 otherwise (see Fig A.1).Then for each Ei and Rj, the product of h(Ei) and v(Rj) is 0 everywhereexcept for (j ∗ |E|+ i)-th element. So for each entity Ek, we set tj∗|E|+i(Ek)to be 1 if (Ei ,Rj,Ek ) holds and −1 otherwise.A.1.3 Proof of proposition 3Proposition: For any ground truth over entities E and relations R contain-ing γ true facts, there exists a SimplE model with embedding vectors of sizeγ + 1 that represents that ground truth.Proof. We use induction to prove this proposition. Let γ be zero (base ofthe induction). We can have embedding vectors of size 1 for each entity andrelation, setting the value for entities to 1 and for relations to −1. Then〈h(Ei), v(Rj), t(Ek)〉 is negative for every entities Ei and Ek and relation Rj.So there exists embedding vectors of size γ + 1 that represents this groundtruth.Let’s assume for any ground truth where γ = n − 1 (1 ≤ n ≤ |R||E|2),there exists an assignment of values to embedding vectors of size n that150A.1. Proofs for chapter 4represents that ground truth (assumption of the induction). We must provefor any ground truth where γ = n, there exists an assignment of values toembedding vectors of size n+ 1 that represents this ground truth.Let (Ei ,Rj,Ek ) be one of the n true facts. Consider a modified groundtruth which is identical to the ground truth with n true facts, except that(Ei ,Rj,Ek ) is assigned false. The modified ground truth has n − 1 truefacts and based on the assumption of the induction, we can represent itusing some embedding vectors of size n. Let q = 〈h(Ei), v(Rj), t(Ek)〉 whereh(Ei), v(Rj) and t(Ek) are the embedding vectors that represent the modifiedground truth. We add an element to the end of all embedding vectors andset it to 0. This increases the vector sizes to n + 1 but does not changeany scores. Then we set h(Ei) to 1, v(Rj) to 1, and t(Ek) to q + 1. Thisensures that 〈h(Ei), v(Rj), t(Ek)〉 > 0 for the new vectors, and no other scoreis affected.A.1.4 Proof of proposition 4Proposition: Let R be a relation such that:(E1 ,R,E2 ) ∈ ζ ⇔ (E2 ,R,E1 ) ∈ ζfor any two entities E1 and E2 (i.e. R is symmetric). This property of R canbe encoded into SimplE by tying the parameters v(R−1) to v(R).Proof. If (E1 ,R,E2 ) ∈ ζ, then a SimplE model makes 〈h(E1), v(R), t(E2)〉and 〈h(E2), v(R−1), t(E1)〉 positive. By tying the parameters v(R−1) to v(R),we can conclude that 〈h(E2), v(R), t(E1)〉 and 〈h(E1), v(R−1), t(E2)〉 alsobecome positive. Therefore, the SimplE model predicts (E1 ,R,E2 ) ∈ ζ.A.1.5 Proof of proposition 5Proposition: Let R be a relation such that:(E1 ,R,E2 ) ∈ ζ ⇔ (E2 ,R,E1 ) ∈ ζ ′for any two entities E1 and E2 (i.e. R is anti-symmetric). This property of Rcan be encoded into SimplE by tying the parameters v(R−1) to the negativeof v(R).Proof. If (E1 ,R,E2 ) ∈ ζ, then a SimplE model makes 〈h(E1), v(R), t(E2)〉and 〈h(E2), v(R−1), t(E1)〉 positive. By tying the parameters v(R−1) to thenegative of−v(R), we can conclude that 〈h(E2), v(R), t(E1)〉 and 〈h(E1), v(R−1), t(E2)〉become negative. Therefore, the SimplE model predicts (E1 ,R,E2 ) ∈ ζ.151A.2. Proofs for chapter 6A.1.6 Proof of proposition 6Proposition: Let R1 and R2 be two relations such that:(E1 ,R1,E2 ) ∈ ζ ⇔ (E2 ,R2,E1 ) ∈ ζfor any two entities E1 and E2 (i.e. R2 is the inverse of R1). This propertyof R1 and R2 can be encoded into SimplE by tying the parameters v(R−11 ) tov(R2) and v(R−12 ) to v(R1).Proof. If (E1 ,R1,E2 ) ∈ ζ, then a SimplE model makes 〈h(E1), v(R1), t(E2)〉and 〈h(E2), v(R−11 ), t(E1)〉 positive. By tying the parameters v(R−12 ) tov(R1) and v(R2) to v(R−11 ), we can conclude that 〈h(E1), v(R−12 ), t(E1)〉 and〈h(E2), v(R2), t(E1)〉 also become positive. Therefore, the SimplE modelpredicts (E2 ,R2,E1 ) ∈ ζ.A.2 Proofs for chapter 6A.2.1 Proof of proposition 7Proposition: If any order results in not grounding any population, thenMinTableSize produces an order that results in not grounding any population.Proof. MinTableSize branches firstly on atoms with fewer than two logvarsand the rest of the branching order is sorted according to the number oflogvars of atoms, ascending. Since there is an order which results in notgrounding any population, after branching on all atoms with fewer thantwo logvars, there should be at least one atom with fewer than two logvarsat each step. We know that a logvar is removed only when all remainingatoms contain that logvar. This means that atoms which have fewer logvarsthan others at the beginning, will always have fewer than them during theprocess. Therefore, at each step, the atom which has fewer than two logvarsis the one which had the minimum number of logvars at the beginning(between remaining atoms). This atom is exactly the one that MinTableSizechooses.A.2.2 Proof of proposition 8Proposition: The theory consisting of the S4 clause ∀x1, x2 ∈ ∆x, y1, y2 ∈∆y : S(x1, y1) ∨ ¬S(x2, y1) ∨ S(x2, y2) ∨ ¬S(x1, y2) is domain-liftable usingRulesD.152A.2. Proofs for chapter 6Proof. Let ∆x′ = ∆x − {N}. Applying domain recursion on ∆x (choosing∆y is analogous) gives the following shattered theory on the reduced domain∆x′ .∀x1, x2 ∈ ∆x′ , y1, y2 ∈ ∆y : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.1)∀x2 ∈ ∆x′ , y1, y2 ∈ ∆y : S(N, y1) ∨ ¬S(N, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.2)∀x1 ∈ ∆x′ , y1, y2 ∈ ∆y : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(N, y1) ∨ S(N, y2)(A.3)∀y1, y2 ∈ ∆y : S(N, y1) ∨ ¬S(N, y2) ∨ ¬S(N, y1) ∨ S(N, y2)(A.4)We now reach to the standard rules Rules to simplify the output ofdomain recursion. The last clause is a tautology and can be dropped. Thetheory contains a unary PRV, namely S(N, y), which is a candidate for liftedcase-analysis. Let ∆yT ⊆ ∆y be the individuals of ∆y for which S(N, y) issatisfied, and let ∆yF = ∆y \∆yT be its complement in ∆y. This gives∀x1, x2 ∈ ∆x′ , y1, y2 ∈ ∆y : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.5)∀x2 ∈ ∆x′ , y1, y2 ∈ ∆y : S(N, y1) ∨ ¬S(N, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.6)∀x1 ∈ ∆x′ , y1, y2 ∈ ∆y : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(N, y1) ∨ S(N, y2)(A.7)∀y ∈ ∆yT : S(N, y) (A.8)∀y ∈ ∆yF : ¬S(N, y) (A.9)Unit propagation creates two independent theories: one containing theS(N, y) atoms, which is trivially liftable, and one containing the other atoms,namely∀x1, x2 ∈ ∆x′ , y1, y2 ∈ ∆y : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.10)∀x2 ∈ ∆x′ , y1 ∈ ∆yT , y2 ∈ ∆yF : ¬S(x2, y1) ∨ S(x2, y2) (A.11)∀x1 ∈ ∆x′ , y1 ∈ ∆yT , y2 ∈ ∆yF : S(x1, y1) ∨ ¬S(x1, y2) (A.12)153A.2. Proofs for chapter 6The last two clauses are equivalent, hence, we have∀x1, x2 ∈ ∆x′ , y1, y2 ∈ ∆y : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.13)∀x ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yT : ¬S(x, y1) ∨ S(x, y2) (A.14)After shattering, we get four copies of the first clause:∀x1, x2 ∈ ∆x′ , y1 ∈ ∆yT , y2 ∈ ∆yT : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.15)∀x1, x2 ∈ ∆x′ , y1 ∈ ∆yT , y2 ∈ ∆yF : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.16)∀x1, x2 ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yT : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.17)∀x1, x2 ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yF : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.18)∀x ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yT : ¬S(x, y1) ∨ S(x, y2) (A.19)The second and third clauses are subsumed by the last clause, and can beremoved:∀x1, x2 ∈ ∆x′ , y1 ∈ ∆yT , y2 ∈ ∆yT : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.20)∀x1, x2 ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yF : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.21)∀x ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yT : ¬S(x, y1) ∨ S(x, y2) (A.22)Let us now consider the last clause, and ignore the first two for the timebeing. The last clause is actually in FO2, and the Skolemization-rewriting ofreused FO2 variables [214] can be applied, for example to y2 in its secondPRV. The last clause is thus replaced by∀x ∈ ∆x′ , y ∈ ∆yF : ¬S(x, y) ∨ ¬A(x) (A.23)∀x ∈ ∆x′ , y ∈ ∆yT : S(x, y) ∨ A(x) (A.24)∀x ∈ ∆x′ : A(x) ∨ B(x) (A.25)∀x ∈ ∆x′ , y ∈ ∆yT : S(x, y) ∨ B(x) (A.26)Next, we perform lifted case-analysis on A(x′). Let ∆α ⊆ ∆x′ be theindividuals in ∆x′ for which A(x′) is satisfied, and let ∆α¯ = ∆x′ \∆α be its154A.2. Proofs for chapter 6complement in ∆x′ :∀x ∈ ∆x′ , y ∈ ∆yF : ¬S(x, y) ∨ ¬A(x) (A.27)∀x ∈ ∆x′ , y ∈ ∆yT : S(x, y) ∨ A(x) (A.28)∀x ∈ ∆x′ : A(x) ∨ B(x) (A.29)∀x ∈ ∆x′ , y ∈ ∆yT : S(x, y) ∨ B(x) (A.30)∀x ∈ ∆α : A(x) (A.31)∀x ∈ ∆α¯ : ¬A(x) (A.32)Unit propagation gives two independent theories: a theory containing thepredicate A, which is trivially liftable, and the theory∀x ∈ ∆α, y ∈ ∆yF : ¬S(x, y) (A.33)∀x ∈ ∆α¯, y ∈ ∆yT : S(x, y) (A.34)∀x ∈ ∆α¯ : B(x) (A.35)∀x ∈ ∆α, y ∈ ∆yT : S(x, y) ∨ B(x) (A.36)Next, we perform atom counting on B(x′). Let ∆β ⊆ ∆α be the individualsof ∆α for which B(x′) is satisfied, and let ∆β¯ = ∆α \∆β be its complementin ∆α. In other words, the original domain ∆x is now split up into fourparts: ∆x¯ = {N}, ∆α¯, ∆β, and ∆β¯. This gives the theory∀x ∈ ∆α, y ∈ ∆yF : ¬S(x, y) (A.37)∀x ∈ ∆α¯, y ∈ ∆yT : S(x, y) (A.38)∀x ∈ ∆α¯ : B(x) (A.39)∀x ∈ ∆α, y ∈ ∆yT : S(x, y) ∨ B(x) (A.40)∀x ∈ ∆β : B(x) (A.41)∀x ∈ ∆β¯ : ¬B(x) (A.42)Unit propagation gives two independent theories: a theory containing thepredicate B, which is trivially liftable, and the theory∀x ∈ ∆α, y ∈ ∆yF : ¬S(x, y) (A.43)∀x ∈ ∆α¯, y ∈ ∆yT : S(x, y) (A.44)∀x ∈ ∆β¯, y ∈ ∆yT : S(x, y) (A.45)We now reintroduce the first removed clause. Clause A.20 has nine copies155A.2. Proofs for chapter 6after shattering :∀x1 ∈ ∆α¯, x2 ∈ ∆α¯, y1, y2 ∈ ∆yT : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.46)∀x1 ∈ ∆α¯, x2 ∈ ∆β, y1, y2 ∈ ∆yT : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.47)∀x1 ∈ ∆α¯, x2 ∈ ∆β¯, y1, y2 ∈ ∆yT : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.48)∀x1 ∈ ∆β, x2 ∈ ∆α¯, y1, y2 ∈ ∆yT : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.49)∀x1 ∈ ∆β, x2 ∈ ∆β, y1, y2 ∈ ∆yT : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.50)∀x1 ∈ ∆β, x2 ∈ ∆β¯, y1, y2 ∈ ∆yT : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.51)∀x1 ∈ ∆β¯, x2 ∈ ∆α¯, y1, y2 ∈ ∆yT : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.52)∀x1 ∈ ∆β¯, x2 ∈ ∆β, y1, y2 ∈ ∆yT : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.53)∀x1 ∈ ∆β¯, x2 ∈ ∆β¯, y1, y2 ∈ ∆yT : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.54)Unit propagation of clauses A.44 and A.45 satisfies any clause that has apositive literal whose x domain is ∆α¯ or ∆β¯ . This removes all clauses exceptfor∀x1 ∈ ∆β, x2 ∈ ∆β, y1, y2 ∈ ∆yT : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.55)We now reintroduce the second removed clause. Clause A.21 has four copiesafter shattering :∀x1 ∈ ∆α, x2 ∈ ∆α, y1, y2 ∈ ∆yF : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.56)∀x1 ∈ ∆α, x2 ∈ ∆α¯, y1, y2 ∈ ∆yF : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.57)∀x1 ∈ ∆α¯, x2 ∈ ∆α, y1, y2 ∈ ∆yF : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.58)∀x1 ∈ ∆α¯, x2 ∈ ∆α¯, y1, y2 ∈ ∆yF : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.59)156A.2. Proofs for chapter 6Unit propagation of clauses A.43 satisfies any clause that has a negativeliteral whose x domain is α. This removes all clauses except for∀x1 ∈ ∆α¯, x2 ∈ ∆α¯, y1, y2 ∈ ∆yF : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.60)Putting it all together, we have the theory∀x1, x2 ∈ ∆β, y1, y2 ∈ ∆yT : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x2, y2)(A.61)∀x1, x2 ∈ ∆α¯, y1, y2 ∈ ∆yF : S(x1, y1) ∨ ¬S(x1, y2) ∨ ¬S(x2, y1) ∨ S(x1, y2)(A.62)∀x ∈ ∆α, y ∈ ∆yF : ¬S(x, y) (A.63)∀x ∈ ∆α¯, y ∈ ∆yT : S(x, y) (A.64)∀x ∈ ∆β¯, y ∈ ∆yT : S(x, y) (A.65)These five clauses are all independent. The last three are trivially liftable.The first two are simply copies of S4 with modified domains ∆β , ∆yT , ∆α¯ and∆yF instead of ∆x and ∆y. However, we have that |∆β| < |∆x|, |∆yT | ≤ |∆y|,|∆α¯| < |∆x|, and |∆yF | ≤ |∆y|. The recursion is thus guaranteed to terminatewith ∆β = ∆α¯ = ∅. By keeping a cache of WMCs for all sizes of ∆x and ∆y,we can compute the WMC of S4 in PTIME.A.2.3 Proof of proposition 9Proposition: The symmetric-transitivity theory is domain-liftable usingRulesD.Proof. Symmetric-transitivity has the following two sentences:∀x, y, z ∈ ∆p : ¬F(x, y) ∨ ¬F(y, z) ∨ F(x, z) (A.66)∀x, y ∈ ∆p : ¬F(x, y) ∨ F(y, x) (A.67)Assuming ∆q = ∆p − {N}:∀y, z ∈ ∆q : ¬F(N, y) ∨ ¬F(y, z) ∨ F(N, z) (A.68)∀x, z ∈ ∆q : ¬F(x,N) ∨ ¬F(N, z) ∨ F(x, z) (A.69)∀x, y ∈ ∆q : ¬F(x, y) ∨ ¬F(y,N) ∨ F(x,N) (A.70)∀x, y, z ∈ ∆q : ¬F(x, y) ∨ ¬F(y, z) ∨ F(x, z) (A.71)∀y ∈ ∆q : ¬F(N, y) ∨ F(y,N) (A.72)∀x ∈ ∆q : ¬F(x,N) ∨ F(N, x) (A.73)∀x, y ∈ ∆q : ¬F(x, y) ∨ F(y, x) (A.74)157A.2. Proofs for chapter 6Lifted case-analysis on F(N, q) assuming ∆qT contains individuals in ∆q forwhich F(N, q) is true and ∆qF is the other individuals in ∆q:∀y, z ∈ ∆q : ¬F(N, y) ∨ ¬F(y, z) ∨ F(N, z) (A.75)∀x, z ∈ ∆q : ¬F(x,N) ∨ ¬F(N, z) ∨ F(x, z) (A.76)∀x, y ∈ ∆q : ¬F(x, y) ∨ ¬F(y,N) ∨ F(x,N) (A.77)∀x, y, z ∈ ∆q : ¬F(x, y) ∨ ¬F(y, z) ∨ F(x, z) (A.78)∀y ∈ ∆q : ¬F(N, y) ∨ F(y,N) (A.79)∀x ∈ ∆q : ¬F(x,N) ∨ F(N, x) (A.80)∀x, y ∈ ∆q : ¬F(x, y) ∨ F(y, x) (A.81)∀x ∈ ∆qT : F(N, x) (A.82)∀x ∈ ∆qF : ¬F(N, x) (A.83)Unit propagation:∀y ∈ ∆qT , z ∈ ∆qF : ¬F(y, z) (A.84)∀x ∈ ∆q, z ∈ ∆qT : ¬F(x,N) ∨ F(x, z) (A.85)∀x, y ∈ ∆q : ¬F(x, y) ∨ ¬F(y,N) ∨ F(x,N) (A.86)∀x, y, z ∈ ∆q : ¬F(x, y) ∨ ¬F(y, z) ∨ F(x, z) (A.87)∀y ∈ ∆qT : F(y,N) (A.88)∀x ∈ ∆qF : ¬F(x,N) (A.89)∀x, y ∈ ∆q : ¬F(x, y) ∨ F(y, x) (A.90)158A.2. Proofs for chapter 6Shattering:∀y ∈ ∆qT , z ∈ ∆qF : ¬F(y, z) (A.91)∀x ∈ ∆qT , z ∈ ∆qT : ¬F(x,N) ∨ F(x, z) (A.92)∀x ∈ ∆qF , z ∈ ∆qT : ¬F(x,N) ∨ F(x, z) (A.93)∀x ∈ ∆qT , y ∈ ∆qT : ¬F(x, y) ∨ ¬F(y,N) ∨ F(x,N) (A.94)∀x ∈ ∆qT , y ∈ ∆qF : ¬F(x, y) ∨ ¬F(y,N) ∨ F(x,N) (A.95)∀x ∈ ∆qF , y ∈ ∆qT : ¬F(x, y) ∨ ¬F(y,N) ∨ F(x,N) (A.96)∀x ∈ ∆qF , y ∈ ∆qF : ¬F(x, y) ∨ ¬F(y,N) ∨ F(x,N) (A.97)∀x ∈ ∆qT , y ∈ ∆qT , z ∈ ∆qT : ¬F(x, y) ∨ ¬F(y, z) ∨ F(x, z) (A.98)∀x ∈ ∆qT , y ∈ ∆qT , z ∈ ∆qF : ¬F(x, y) ∨ ¬F(y, z) ∨ F(x, z) (A.99)∀x ∈ ∆qT , y ∈ ∆qF , z ∈ ∆qT : ¬F(x, y) ∨ ¬F(y, z) ∨ F(x, z) (A.100)∀x ∈ ∆qT , y ∈ ∆qF , z ∈ ∆qF : ¬F(x, y) ∨ ¬F(y, z) ∨ F(x, z) (A.101)∀x ∈ ∆qF , y ∈ ∆qT , z ∈ ∆qT : ¬F(x, y) ∨ ¬F(y, z) ∨ F(x, z) (A.102)∀x ∈ ∆qF , y ∈ ∆qT , z ∈ ∆qF : ¬F(x, y) ∨ ¬F(y, z) ∨ F(x, z) (A.103)∀x ∈ ∆qF , y ∈ ∆qF , z ∈ ∆qT : ¬F(x, y) ∨ ¬F(y, z) ∨ F(x, z) (A.104)∀x ∈ ∆qF , y ∈ ∆qF , z ∈ ∆qF : ¬F(x, y) ∨ ¬F(y, z) ∨ F(x, z) (A.105)∀y ∈ ∆qT : F(y,N) (A.106)∀x ∈ ∆qF : ¬F(x,N) (A.107)∀x ∈ ∆qT , y ∈ ∆qT : ¬F(x, y) ∨ F(y, x) (A.108)∀x ∈ ∆qT , y ∈ ∆qF : ¬F(x, y) ∨ F(y, x) (A.109)∀x ∈ ∆qF , y ∈ ∆qT : ¬F(x, y) ∨ F(y, x) (A.110)∀x ∈ ∆qF , y ∈ ∆qF : ¬F(x, y) ∨ F(y, x) (A.111)Unit propagation:∀y ∈ ∆qT , z ∈ ∆qF : ¬F(y, z) (A.112)∀x ∈ ∆qT , z ∈ ∆qT : F(x, z) (A.113)∀x ∈ ∆qF , y ∈ ∆qT : ¬F(x, y) (A.114)∀x ∈ ∆qF , y ∈ ∆qF , z ∈ ∆qF : ¬F(x, y) ∨ ¬F(y, z) ∨ F(x, z) (A.115)∀y ∈ ∆qT : F(y,N) (A.116)∀x ∈ ∆qF : ¬F(x,N) (A.117)∀x ∈ ∆qF , y ∈ ∆qF : ¬F(x, y) ∨ F(y, x) (A.118)The first, second, third, fifth, and sixth clauses are independent of theother clauses and can be reasoned about separately. They can be trivially159A.2. Proofs for chapter 6lifted. The remaining clauses are:∀x ∈ ∆qF , y ∈ ∆qF , z ∈ ∆qF : ¬F(x, y) ∨ ¬F(y, z) ∨ F(x, z) (A.119)∀x ∈ ∆qF , y ∈ ∆qF : ¬F(x, y) ∨ F(y, x) (A.120)The above clauses are an instance of our initial clauses but with smallerdomain sizes. We can reason about them by following a similar process, andif we keep sub-results in a cache, the process is polynomial.A.2.4 Proof of proposition 10Proposition: The birthday-paradox theory is domain-liftable using RulesD.Proof. After Skolemization [214] for removing the existential quantifier, thebirthday-paradox theory contains ∀p ∈ ∆p,∀d ∈ ∆d : S(p) ∨ ¬Born(p, d),∀p ∈ ∆p, d1, d2 ∈ ∆d : ¬Born(p, d1)∨¬Born(p, d2), and ∀p1, p2 ∈ ∆p, d ∈ ∆d :¬Born(p1, d) ∨ ¬Born(p2, d), where S is the Skolem predicate. This theory isnot in FO2 and not in RU and is not domain-liftable using Rules. However,this theory is both S 2FO2 and S 2RU as the last two clauses are instances ofαBorn, the first one is in FO2 and also in RU and has no PRVs with morethan one LV other than Born. Therefore, this theory is domain-liftable basedon Theorem 1.A.2.5 Proof of theorem 1Theorem: S 2FO2 and S 2RU are domain-liftable using RulesD.Proof. The case where α = ∅ is trivial. Let α = α(S1) ∧ α(S2) ∧ · · · ∧ α(Sn).Once we remove all atoms having none or one logvar by (lifted) case-analysis,the remaining clauses can be divided into n + 1 components: the i-thcomponent in the first n components only contains Si literals, and the (n+1)-th component contains no Si literals. These components are disconnectedfrom each other, so we can consider each of them separately. The (n+ 1)-thcomponent comes from clauses in β and is domain-liftable by definition. Thefollowing two Lemmas prove that the clauses in the other components arealso domain-liftable. The proofs of both lemmas rely on domain recursion.Lemma 1. A clausal theory α(S) with only one predicate S where all clauseshave exactly two different literals of S is domain-liftable.Lemma 2. Suppose {∆p1 ,∆p2 , . . . ,∆pn} are mutually exclusive subsets of∆x and {∆q1 ,∆q2 , . . . ,∆qm} are mutually exclusive subsets of ∆y. We can160A.2. Proofs for chapter 6add any unit clause of the form ∀pi ∈ ∆pi , qj ∈ ∆qj : S(pi, qj) or ∀pi ∈∆pi , qj ∈ ∆qj : ¬S(pi, qj) to the theory α(S) in Lemma 1 and the theory isstill domain-liftable.Therefore, theories in S 2FO2 and S 2RU are domain-liftable.In what follows, we prove Lemma 1.Proof. A theory in this form has a subset of the following clauses:∀x ∈ ∆x, y1, y2 ∈ ∆y : S(x, y1) ∨ S(x, y2) (A.121)∀x ∈ ∆x, y1, y2 ∈ ∆y : S(x, y1) ∨ ¬S(x, y2) (A.122)∀x ∈ ∆x, y1, y2 ∈ ∆y : ¬S(x, y1) ∨ ¬S(x, y2) (A.123)∀x1, x2 ∈ ∆x, y ∈ ∆y : S(x1, y) ∨ S(x2, y) (A.124)∀x1, x2 ∈ ∆x, y ∈ ∆y : S(x1, y) ∨ ¬S(x2, y) (A.125)∀x1, x2 ∈ ∆x, y ∈ ∆y : ¬S(x1, y) ∨ ¬S(x2, y) (A.126)∀x1, x2 ∈ ∆x, y1, y2 ∈ ∆y : S(x1, y1) ∨ S(x2, y2) (A.127)∀x1, x2 ∈ ∆x, y1, y2 ∈ ∆y : S(x1, y1) ∨ ¬S(x2, y2) (A.128)∀x1, x2 ∈ ∆x, y1, y2 ∈ ∆y : ¬S(x1, y1) ∨ ¬S(x2, y2) (A.129)Let N be an individual in ∆x. Applying domain recursion on ∆x′ =∆x′ − {N} for all clauses gives:for (1):∀y1, y2 ∈ ∆y : S(N, y1) ∨ S(N, y2) (A.130)∀x ∈ ∆x′ , y1, y2 ∈ ∆y : S(x, y1) ∨ S(x, y2) (A.131)for (2):∀y1, y2 ∈ ∆y : S(N, y1) ∨ ¬S(N, y2) (A.132)∀x ∈ ∆x′ , y1, y2 ∈ ∆y : S(x, y1) ∨ ¬S(x, y2) (A.133)for (3):∀y1, y2 ∈ ∆y : ¬S(N, y1) ∨ ¬S(N, y2) (A.134)∀x ∈ ∆x′ , y1, y2 ∈ ∆y : ¬S(x, y1) ∨ ¬S(x, y2) (A.135)for (4):∀x ∈ ∆x′ , y ∈ ∆y : S(N, y) ∨ S(x, y) (A.136)∀x1, x2 ∈ ∆x′ , y ∈ ∆y : S(x1, y) ∨ S(x2, y) (A.137)161A.2. Proofs for chapter 6for (5):∀x ∈ ∆x′ , y ∈ ∆y : S(N, y) ∨ ¬S(x, y) (A.138)∀x ∈ ∆x′ , y ∈ ∆y : ¬S(N, y) ∨ S(x, y) (A.139)∀x1, x2 ∈ ∆x′ , y ∈ ∆y : S(x1, y) ∨ ¬S(x2, y) (A.140)for (6):∀x ∈ ∆x′ , y ∈ ∆y : ¬S(N, y) ∨ ¬S(x, y) (A.141)∀x1, x2 ∈ ∆x′ , y ∈ ∆y : ¬S(x1, y) ∨ ¬S(x2, y) (A.142)for (7):∀x ∈ ∆x′ , y1, y2 ∈ ∆y : S(N, y1) ∨ S(x, y2) (A.143)∀x1, x2 ∈ ∆x′ , y1, y2 ∈ ∆y : S(x1, y1) ∨ S(x2, y2) (A.144)for (8):∀x ∈ ∆x′ , y1, y2 ∈ ∆y : S(N, y1) ∨ ¬S(x, y2) (A.145)∀x ∈ ∆x′ , y1, y2 ∈ ∆y : ¬S(N, y1) ∨ S(x, y2) (A.146)∀x1, x2 ∈ ∆x′ , y1, y2 ∈ ∆y : S(x1, y1) ∨ ¬S(x2, y2) (A.147)for (9):∀x ∈ ∆x′ , y1, y2 ∈ ∆y : ¬S(N, y1) ∨ ¬S(x, y2) (A.148)∀x1, x2 ∈ ∆x′ , y1, y2 ∈ ∆y : ¬S(x1, y1) ∨ ¬S(x2, y2) (A.149)Then we can perform lifted case-analysis on S(N, y). For the case whereS(N, y) is true for exactly k of the individuals in ∆y, we update all clauses as-suming ∆yT and ∆yF represent the individuals for which S(N, y) is True andFalse respectively, and assuming ∀y ∈ ∆yT : S(N, y) and ∀y ∈ ∆yF : ¬S(N, y):for (1):∀x ∈ ∆x′ , y1, y2 ∈ ∆y : S(x, y1) ∨ S(x, y2) (A.150)for (2):∀x ∈ ∆x′ , y1, y2 ∈ ∆y : S(x, y1) ∨ ¬S(x, y2) (A.151)for (3):∀x ∈ ∆x′ , y1, y2 ∈ ∆y : ¬S(x, y1) ∨ ¬S(x, y2) (A.152)162A.2. Proofs for chapter 6for (4):∀x ∈ ∆x′ , y ∈ ∆yF : S(x, y) (A.153)∀x1, x2 ∈ ∆x′ , y ∈ ∆y : S(x1, y) ∨ S(x2, y) (A.154)for (5):∀x ∈ ∆x′ , y ∈ ∆yF : ¬S(x, y) (A.155)∀x ∈ ∆x′ , y ∈ ∆yT : S(x, y) (A.156)∀x1, x2 ∈ ∆x′ , y ∈ ∆y : S(x1, y) ∨ ¬S(x2, y) (A.157)for (6):∀x ∈ ∆x′ , y ∈ ∆yT : ¬S(x, y) (A.158)∀x1, x2 ∈ ∆x′ , y ∈ ∆y : ¬S(x1, y) ∨ ¬S(x2, y) (A.159)for (7):∀x ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆y : S(x, y2) (A.160)∀x1, x2 ∈ ∆x′ , y1, y2 ∈ ∆y : S(x1, y1) ∨ S(x2, y2) (A.161)for (8):∀x ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆y : ¬S(x, y2) (A.162)∀x ∈ ∆x′ , y1 ∈ ∆yT , y2 ∈ ∆y : S(x, y2) (A.163)∀x1, x2 ∈ ∆x′ , y1, y2 ∈ ∆y : S(x1, y1) ∨ ¬S(x2, y2) (A.164)(A.165)for (9):∀x ∈ ∆x′ , y1 ∈ ∆yT , y2 ∈ ∆y : ¬S(x, y2) (A.166)∀x1, x2 ∈ ∆x′ , y1, y2 ∈ ∆y : ¬S(x1, y1) ∨ ¬S(x2, y2) (A.167)After subsumptions and shattering:for (1):∀x ∈ ∆x′ , y1, y2 ∈ ∆yT : S(x, y1) ∨ S(x, y2) (A.168)∀x ∈ ∆x′ , y1 ∈ ∆yT , y2 ∈ ∆yF : S(x, y1) ∨ S(x, y2) (A.169)∀x ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yT : S(x, y1) ∨ S(x, y2) (A.170)∀x ∈ ∆x′ , y1, y2 ∈ ∆yF : S(x, y1) ∨ S(x, y2) (A.171)(A.172)163A.2. Proofs for chapter 6for (2):∀x ∈ ∆x′ , y1, y2 ∈ ∆yT : S(x, y1) ∨ ¬S(x, y2) (A.173)∀x ∈ ∆x′ , y1 ∈ ∆yT , y2 ∈ ∆yF : S(x, y1) ∨ ¬S(x, y2) (A.174)∀x ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yT : S(x, y1) ∨ ¬S(x, y2) (A.175)∀x ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yF : S(x, y1) ∨ ¬S(x, y2) (A.176)for (3):∀x ∈ ∆x′ , y1, y2 ∈ ∆yT : ¬S(x, y1) ∨ ¬S(x, y2) (A.177)∀x ∈ ∆x′ , y1 ∈ ∆yT , y2 ∈ ∆yF : ¬S(x, y1) ∨ ¬S(x, y2) (A.178)∀x ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yT : ¬S(x, y1) ∨ ¬S(x, y2) (A.179)∀x ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yF : ¬S(x, y1) ∨ ¬S(x, y2) (A.180)for (4): (For the second clause, the case where y ∈ ∆yF becomes subsumedby the first clause)∀x ∈ ∆x′ , y ∈ ∆yF : S(x, y) (A.181)∀x1, x2 ∈ ∆x′ , y ∈ ∆yT : S(x1, y) ∨ S(x2, y) (A.182)for (5): (The third clause was subsumed by the first two)∀x ∈ ∆x′ , y ∈ ∆yF : ¬S(x, y) (A.183)∀x ∈ ∆x′ , y ∈ ∆yT : S(x, y) (A.184)for (6): (Similar to (4))∀x ∈ ∆x′ , y ∈ ∆yT : ¬S(x, y) (A.185)∀x1, x2 ∈ ∆x′ , y ∈ ∆yF : ¬S(x1, y) ∨ ¬S(x2, y) (A.186)for (7):∀x ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yT : S(x, y2) (A.187)∀x ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yF : S(x, y2) (A.188)∀x1, x2 ∈ ∆x′ , y1 ∈ ∆yT , y2 ∈ ∆yT : S(x1, y1) ∨ S(x2, y2) (A.189)∀x1, x2 ∈ ∆x′ , y1 ∈ ∆yT , y2 ∈ ∆yF : S(x1, y1) ∨ S(x2, y2) (A.190)for (8):∀x ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yT : ¬S(x, y2) (A.191)∀x ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yF : ¬S(x, y2) (A.192)∀x ∈ ∆x′ , y1 ∈ ∆yT , y2 ∈ ∆yT : S(x, y2) (A.193)∀x ∈ ∆x′ , y1 ∈ ∆yT , y2 ∈ ∆yF : S(x, y2) (A.194)164A.2. Proofs for chapter 6for (9):∀x ∈ ∆x′ , y1 ∈ ∆yT , y2 ∈ ∆yT : ¬S(x, y2) (A.195)∀x ∈ ∆x′ , y1 ∈ ∆yT , y2 ∈ ∆yF : ¬S(x, y2) (A.196)∀x1, x2 ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yT : ¬S(x1, y1) ∨ ¬S(x2, y2) (A.197)∀x1, x2 ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yF : ¬S(x1, y1) ∨ ¬S(x2, y2) (A.198)Looking at the first clause for (7) (and some other clauses), we see thatthere exists a ∀y1 ∈ ∆yT but y1 does not appear in the formula. If ∆yT = ∅,we can ignore this clause. Otherwise, we can ignore ∀y1 ∈ ∆yT . So weconsider three cases. When k = 0 (i.e. ∆yF = ∆y, ∆yT = ∅):for (1):∀x ∈ ∆x′ , y1, y2 ∈ ∆y : S(x, y1) ∨ S(x, y2) (A.199)for (2):∀x ∈ ∆x′ , y1, y2 ∈ ∆y : S(x, y1) ∨ ¬S(x, y2) (A.200)for (3):∀x ∈ ∆x′ , y1, y2 ∈ ∆y : ¬S(x, y1) ∨ ¬S(x, y2) (A.201)for (4):∀x ∈ ∆x′ , y ∈ ∆y : S(x, y) (A.202)for (5):∀x ∈ ∆x′ , y ∈ ∆y : ¬S(x, y) (A.203)for (6):∀x1, x2 ∈ ∆x′ , y ∈ ∆y : ¬S(x1, y) ∨ ¬S(x2, y) (A.204)for (7):∀x ∈ ∆x′ , y2 ∈ ∆y : S(x, y2) (A.205)for (8):∀x ∈ ∆x′ , y2 ∈ ∆y : ¬S(x, y2) (A.206)165A.2. Proofs for chapter 6for (9):∀x1, x2 ∈ ∆x′ , y1 ∈ ∆y, y2 ∈ ∆y : ¬S(x1, y1) ∨ ¬S(x2, y2) (A.207)If clause #4 is one of the clauses in the theory, then unit propagationeither gives False, or satisfies all the clauses. The same is true for clauses#5, #7, and #8. In a theory not having any of these four clauses, we willbe left with a set of clauses that are again a subset of the initial 9 clausesthat we started with, but with a smaller domain size. By applying thesame procedure, we can count the number of models. When k = |∆y| (i.e.∆yF = ∅, ∆yT = ∆y), everything is just similar to the k = 0 case.When 0 < k < |∆y| (i.e. neither ∆yT nor ∆yF are empty):for (1):∀x ∈ ∆x′ , y1, y2 ∈ ∆yT : S(x, y1) ∨ S(x, y2) (A.208)∀x ∈ ∆x′ , y1 ∈ ∆yT , y2 ∈ ∆yF : S(x, y1) ∨ S(x, y2) (A.209)∀x ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yT : S(x, y1) ∨ S(x, y2) (A.210)∀x ∈ ∆x′ , y1, y2 ∈ ∆yF : S(x, y1) ∨ S(x, y2) (A.211)for (2):∀x ∈ ∆x′ , y1, y2 ∈ ∆yT : S(x, y1) ∨ ¬S(x, y2) (A.212)∀x ∈ ∆x′ , y1 ∈ ∆yT , y2 ∈ ∆yF : S(x, y1) ∨ ¬S(x, y2) (A.213)∀x ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yT : S(x, y1) ∨ ¬S(x, y2) (A.214)∀x ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yF : S(x, y1) ∨ ¬S(x, y2) (A.215)for (3):∀x ∈ ∆x′ , y1, y2 ∈ ∆yT : ¬S(x, y1) ∨ ¬S(x, y2) (A.216)∀x ∈ ∆x′ , y1∆yT , y2 ∈ ∆yF : ¬S(x, y1) ∨ ¬S(x, y2) (A.217)∀x ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yT : ¬S(x, y1) ∨ ¬S(x, y2) (A.218)∀x ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yF : ¬S(x, y1) ∨ ¬S(x, y2) (A.219)for (4):∀x ∈ ∆x′ , y ∈ ∆yF : S(x, y) (A.220)∀x1, x2 ∈ ∆x′ , y ∈ ∆yT : S(x1, y) ∨ S(x2, y) (A.221)166A.2. Proofs for chapter 6for (5):∀x ∈ ∆x′ , y ∈ ∆yF : ¬S(x, y) (A.222)∀x ∈ ∆x′ , y ∈ ∆yT : S(x, y) (A.223)for (6):∀x ∈ ∆x′ , y ∈ ∆yT : ¬S(x, y) (A.224)∀x1, x2 ∈ ∆x′ , y ∈ ∆yF : ¬S(x1, y) ∨ ¬S(x2, y) (A.225)for (7):∀x ∈ ∆x′ , y ∈ ∆yT : S(x, y) (A.226)∀x ∈ ∆x′ , y ∈ ∆yF : S(x, y) (A.227)for (8):False (A.228)for (9):∀x ∈ ∆x′ , y ∈ ∆yT : ¬S(x, y) (A.229)∀x ∈ ∆x′ , y ∈ ∆yF : ¬S(x, y) (A.230)If either one of clauses #5, #7, #8 or #9 are in the theory, then unitpropagation either gives False or satisfies all clauses. Assume none of thesefour clauses are in the theory. If both clauses #4 and #6 are in a theory,again unit propagation gives either False or satisfies all clauses. If noneof them are in the theory, then the other clauses are a subset of the initial9 clauses that we started with. So let’s consider the case where we haveclause #4 and a subset of the first three clauses (the case with #6 insteadof #4 is similar). In this case, if clauses #2 or #3 are in the theory, unitpropagation either gives False or satisfies all the clauses. If none of themare in the theory and only #1 is in the theory, we have the following clausesafter unit propagation:∀x ∈ ∆x′ , y ∈ ∆yF : S(x, y) (A.231)∀x1, x2 ∈ ∆x′ , y ∈ ∆yT : S(x1, y) ∨ S(x2, y) (A.232)∀x ∈ ∆x′ , y1, y2 ∈ ∆yT : S(x, y1) ∨ S(x, y2) (A.233)The first clause is independent of the other two clauses. The second andthird clauses are just similar to clauses #4 and #1 in the initial list of clausesand we can handle them using the same procedure.If we use a cache to store computations for all subproblems, WMC isdomain-liftable, i.e. polynomial in the population sizes.167A.2. Proofs for chapter 6In what follows, we prove Lemma 2.Proof. Let ψ be the set of pairs (i, j) such that the singleton clause ∀pi ∈∆pi , qj ∈ ∆qj : S(pi, qj) is in the theory, and ψ be the set of pairs (i, j) suchthat the singleton clause ∀pi ∈ ∆pi , qj ∈ ∆qj : ¬S(pi, qj) is in the theory.Then the singleton clauses can be written as follows:∀(i, j) ∈ ψ : ∀pi ∈ ∆pi , qj ∈ ∆qj : S(pi, qj) (A.234)∀(i, j) ∈ ψ : ∀pi ∈ ∆pi , qj ∈ ∆qj : ¬S(pi, qj) (A.235)And the α(S) clauses are as in Lemma 1. Without loss of generality, let’sassume we select an individual N ∈ ∆p1 for domain recursion, and re-writeall clauses to separate N from ∆p1 . Assuming ∆p′1 ,= ∆p1 − {N} and∆x′ = ∆x − {N}, the theory is:For singletons:∀(1, j) ∈ ψ : ∀qj ∈ ∆qj : S(N, qj) (A.236)∀(1, j) ∈ ψ : ∀qj ∈ ∆qj : ¬S(N, qj) (A.237)∀(1, j) ∈ ψ : ∀p′1 ∈ ∆p′1 , , qj ∈ ∆qj : S(p′1, qj) (A.238)∀(1, j) ∈ ψ : ∀p′1 ∈ ∆p′1 , , qj ∈ ∆qj : ¬S(p′1, qj) (A.239)∀(i 6= 1, j) ∈ ψ : ∀pi ∈ ∆pi , qj ∈ ∆qj : S(pi, qj) (A.240)∀(i 6= 1, j) ∈ ψ : ∀pi ∈ ∆pi , qj ∈ ∆qj : ¬S(pi, qj) (A.241)for (1):∀y1, y2 ∈ ∆y : S(N, y1) ∨ S(N, y2) (A.242)∀x ∈ ∆x′ , y1, y2 ∈ ∆y : S(x, y1) ∨ S(x, y2) (A.243)for (2):∀y1, y2 ∈ ∆y : S(N, y1) ∨ ¬S(N, y2) (A.244)∀x ∈ ∆x′ , y1, y2 ∈ ∆y : S(x, y1) ∨ ¬S(x, y2) (A.245)for (3):∀y1, y2 ∈ ∆y : ¬S(N, y1) ∨ ¬S(N, y2) (A.246)∀x ∈ ∆x′ , y1, y2 ∈ ∆y : ¬S(x, y1) ∨ ¬S(x, y2) (A.247)168A.2. Proofs for chapter 6for (4):∀x ∈ ∆x′ , y ∈ ∆y : S(N, y) ∨ S(x, y) (A.248)∀x1, x2 ∈ ∆x′ , y ∈ ∆y : S(x1, y) ∨ S(x2, y) (A.249)for (5):∀x ∈ ∆x′ , y ∈ ∆y : S(N, y) ∨ ¬S(x, y) (A.250)∀x ∈ ∆x′ , y ∈ ∆y : ¬S(N, y) ∨ S(x, y) (A.251)∀x1, x2 ∈ ∆x′ , y ∈ ∆y : S(x1, y) ∨ ¬S(x2, y) (A.252)for (6):∀x ∈ ∆x′ , y ∈ ∆y : ¬S(N, y) ∨ ¬S(x, y) (A.253)∀x1, x2 ∈ ∆x′ , y ∈ ∆y : ¬S(x1, y) ∨ ¬S(x2, y) (A.254)for (7):∀x ∈ ∆x′ , y1, y2 ∈ ∆y : S(N, y1) ∨ S(x, y2) (A.255)∀x1, x2 ∈ ∆x′ , y1, y2 ∈ ∆y : S(x1, y1) ∨ S(x2, y2) (A.256)for (8):∀x ∈ ∆x′ , y1, y2 ∈ ∆y : S(N, y1) ∨ ¬S(x, y2) (A.257)∀x ∈ ∆x′ , y1, y2 ∈ ∆y : ¬S(N, y1) ∨ S(x, y2) (A.258)∀x1, x2 ∈ ∆x′ , y1, y2 ∈ ∆y : S(x1, y1) ∨ ¬S(x2, y2) (A.259)for (9):∀x ∈ ∆x′ , y1, y2 ∈ ∆y : ¬S(N, y1) ∨ ¬S(x, y2) (A.260)∀x1, x2 ∈ ∆x′ , y1, y2 ∈ ∆y : ¬S(x1, y1) ∨ ¬S(x2, y2) (A.261)We apply lifted case-analysis on each S(N,∆qj ). For each j, let ∆qTjrepresent the individuals in ∆qj for which S(N, qj) is true and ∆qFj be theother individuals. For each j, lifted case-analysis adds two clauses to thetheory as follows:∀qj ∈ ∆qTj : S(N, qj) (A.262)∀qj ∈ ∆qFj : ¬S(N, qj) (A.263)We shatter all other singleton clauses based on these newly added singletons.If the singletons are inconsistent, there is no model. Otherwise, let yT169A.2. Proofs for chapter 6represent ∪jqTj and yF represent ∪jqFj . We add the following two singletonclauses to the theory:∀y ∈ ∆yT : S(N, y) (A.264)∀y ∈ ∆yF : ¬S(N, y) (A.265)We shatter all clauses in α(S) form based on these two singletons (notconsidering the shattering caused by the other singletons) and apply unitpropagation. Then the theory is as follows (the details can be checked inLemma 1. Here we only consider the case where yT 6= ∅ and yF 6= ∅; the casewhere one of them is empty can be considered similarly as in Lemma 1):For singletons:∀j : ∀qj ∈ ∆qTj : S(N, qj) (A.266)∀j : ∀qj ∈ ∆qFj : ¬S(N, qj) (A.267)∀(1, j) ∈ ψ : ∀p′1 ∈ ∆p′1 , qj ∈ ∆qTj : S(p′1, qj) (A.268)∀(1, j) ∈ ψ : ∀p′1 ∈ ∆p′1 , qj ∈ ∆qFj : S(p′1, qj) (A.269)∀(1, j) ∈ ψ : ∀p′1 ∈ ∆p′1 , qj ∈ ∆qTj : ¬S(p′1, qj) (A.270)∀(1, j) ∈ ψ : ∀p′1 ∈ ∆p′1 , qj ∈ ∆qFj : ¬S(p′1, qj) (A.271)∀(i 6= 1, j) ∈ ψ : ∀pi ∈ ∆pi , qj ∈ ∆qTj : S(pi, qj) (A.272)∀(i 6= 1, j) ∈ ψ : ∀pi ∈ ∆pi , qj ∈ ∆qFj : S(pi, qj) (A.273)∀(i 6= 1, j) ∈ ψ : ∀pi ∈ ∆pi , qj ∈ ∆qTj : ¬S(pi, qj) (A.274)∀(i 6= 1, j) ∈ ψ : ∀pi ∈ ∆pi , qj ∈ ∆qFj : ¬S(pi, qj) (A.275)The two singletons on ∆yF and ∆yT :∀y ∈ ∆yT : S(N, y) (A.276)∀y ∈ ∆yF : ¬S(N, y) (A.277)for (1):∀x ∈ ∆x′ , y1, y2 ∈ ∆yT : S(x, y1) ∨ S(x, y2) (A.278)∀x ∈ ∆x′ , y1 ∈ ∆yT , y2 ∈ ∆yF : S(x, y1) ∨ S(x, y2) (A.279)∀x ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yT : S(x, y1) ∨ S(x, y2) (A.280)∀x ∈ ∆x′ , y1, y2 ∈ ∆yF : S(x, y1) ∨ S(x, y2) (A.281)170A.2. Proofs for chapter 6for (2):∀x ∈ ∆x′ , y1, y2 ∈ ∆yT : S(x, y1) ∨ ¬S(x, y2) (A.282)∀x ∈ ∆x′ , y1 ∈ ∆yT , y2 ∈ ∆yF : S(x, y1) ∨ ¬S(x, y2) (A.283)∀x ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yT : S(x, y1) ∨ ¬S(x, y2) (A.284)∀x ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yF : S(x, y1) ∨ ¬S(x, y2) (A.285)for (3):∀x ∈ ∆x′ , y1, y2 ∈ ∆yT : ¬S(x, y1) ∨ ¬S(x, y2) (A.286)∀x ∈ ∆x′ , y1∆yT , y2 ∈ ∆yF : ¬S(x, y1) ∨ ¬S(x, y2) (A.287)∀x ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yT : ¬S(x, y1) ∨ ¬S(x, y2) (A.288)∀x ∈ ∆x′ , y1 ∈ ∆yF , y2 ∈ ∆yF : ¬S(x, y1) ∨ ¬S(x, y2) (A.289)for (4):∀x ∈ ∆x′ , y ∈ ∆yF : S(x, y) (A.290)∀x1, x2 ∈ ∆x′ , y ∈ ∆yT : S(x1, y) ∨ S(x2, y) (A.291)for (5):∀x ∈ ∆x′ , y ∈ ∆yF : ¬S(x, y) (A.292)∀x ∈ ∆x′ , y ∈ ∆yT : S(x, y) (A.293)for (6):∀x ∈ ∆x′ , y ∈ ∆yT : ¬S(x, y) (A.294)∀x1, x2 ∈ ∆x′ , y ∈ ∆yF : ¬S(x1, y) ∨ ¬S(x2, y) (A.295)for (7):∀x ∈ ∆x′ , y ∈ ∆yT : S(x, y) (A.296)∀x ∈ ∆x′ , y ∈ ∆yF : S(x, y) (A.297)for (8):False (A.298)for (9):∀x ∈ ∆x′ , y ∈ ∆yT : ¬S(x, y) (A.299)∀x ∈ ∆x′ , y ∈ ∆yF : ¬S(x, y) (A.300)171A.2. Proofs for chapter 6Clauses number A.266, A.267, A.276 and A.277 are disconnected fromthe rest of the theory and can be reasoned about separately. It is trivial tolift these clauses. Now let’s consider the other clauses.If either one of the clauses #5, #7, #8. or #9 are in the theory, thenunit propagation either gives false or satisfies all clauses. The same is truewhen both #4 and #6 are in the theory. If neither #4 nor #6 are in thetheory, then we can conjoin the individuals in ∆yT and ∆yF as well as thosein ∆qjT and ∆qjF and write the theory as follows:For singletons:∀(1, j) ∈ ψ : ∀p′1 ∈ ∆p′1 , qj ∈ ∆qj : S(p′1, qj) (A.301)∀(1, j) ∈ ψ : ∀p′1 ∈ ∆p′1 , qj ∈ ∆qj : ¬S(p′1, qj) (A.302)∀(i 6= 1, j) ∈ ψ : ∀pi ∈ ∆pi , qj ∈ ∆qj : S(pi, qj) (A.303)∀(i 6= 1, j) ∈ ψ : ∀pi ∈ ∆pi , qj ∈ ∆qj : ¬S(pi, qj) (A.304)for (1):∀x ∈ ∆x′ , y1, y2 ∈ ∆y : S(x, y1) ∨ S(x, y2) (A.305)for (2):∀x ∈ ∆x′ , y1, y2 ∈ ∆y : S(x, y1) ∨ ¬S(x, y2) (A.306)for (3):∀x ∈ ∆x′ , y1, y2 ∈ ∆y : ¬S(x, y1) ∨ ¬S(x, y2) (A.307)This theory is an instance of our initial theory, but with p1 having asmaller domain size. So we can continue this process recursively on theremaining clauses.Now let’s consider the case where #4 is in the theory but #6 is not (thecase where #6 is in the theory and #4 is not is similar). In this case, if #2or #3 are in the theory, then unit propagation either gives false or satisfiesall the clauses. If #4 and #1 are in the theory, then the theory is as follows:172A.2. Proofs for chapter 6For singletons:∀(1, j) ∈ ψ : ∀p′1 ∈ ∆p′1 , qj ∈ ∆qTj : S(p′1, qj) (A.308)∀(1, j) ∈ ψ : ∀p′1 ∈ ∆p′1 , qj ∈ ∆qFj : S(p′1, qj) (A.309)∀(1, j) ∈ ψ : ∀p′1 ∈ ∆p′1 , qj ∈ ∆qTj : ¬S(p′1, qj) (A.310)∀(1, j) ∈ ψ : ∀p′1 ∈ ∆p′1 , qj ∈ ∆qFj : ¬S(p′1, qj) (A.311)∀(i 6= 1, j) ∈ ψ : ∀pi ∈ ∆pi , qj ∈ ∆qTj : S(pi, qj) (A.312)∀(i 6= 1, j) ∈ ψ : ∀pi ∈ ∆pi , qj ∈ ∆qFj : S(pi, qj) (A.313)∀(i 6= 1, j) ∈ ψ : ∀pi ∈ ∆pi , qj ∈ ∆qTj : ¬S(pi, qj) (A.314)∀(i 6= 1, j) ∈ ψ : ∀pi ∈ ∆pi , qj ∈ ∆qFj : ¬S(pi, qj) (A.315)(A.316)for (1):∀x ∈ ∆x′ , y1, y2 ∈ ∆yT : S(x, y1) ∨ S(x, y2) (A.317)for (4):∀x ∈ ∆x′ , y ∈ ∆yF : S(x, y) (A.318)∀x1, x2 ∈ ∆x′ , y ∈ ∆yT : S(x1, y) ∨ S(x2, y) (A.319)Clause number A.318 and the singleton clauses having qj ∈ ∆qFj aredisconnected from the rest of the theory and can reasoned about separately.They can be trivially lifted. Once we remove these clauses, the theory willbe as follows:∀(1, j) ∈ ψ : ∀p′1 ∈ ∆p′1 , qj ∈ ∆qTj : S(p′1, qj) (A.320)∀(1, j) ∈ ψ : ∀p′1 ∈ ∆p′1 , qj ∈ ∆qTj : ¬S(p′1, qj) (A.321)∀(i 6= 1, j) ∈ ψ : ∀pi ∈ ∆pi , qj ∈ ∆qTj : S(pi, qj) (A.322)∀(i 6= 1, j) ∈ ψ : ∀pi ∈ ∆pi , qj ∈ ∆qTj : ¬S(pi, qj) (A.323)∀x ∈ ∆x′ , y1, y2 ∈ ∆yT : S(x, y1) ∨ S(x, y2) (A.324)∀x1, x2 ∈ ∆x′ , y ∈ ∆yT : S(x1, y) ∨ S(x2, y) (A.325)which is an instance of our initial theory, but where ∆qj s have smaller domainsizes. So we can continue this process recursively on the remaining clauses.We showed that in all cases, after domain recursion we have an instanceof our initial theory again, but with smaller domain sizes. By keeping theWMC of sub-problems in a cache, the whole process is polynomial in thepopulation sizes, i.e. the theory is domain-liftable.173A.2. Proofs for chapter 6A.2.6 Proof of Proposition 11Proposition: FO2 ⊂ RU , FO2 ⊂ S 2FO2, FO2 ⊂ S 2RU , RU ⊂ S 2RU ,S 2FO2 ⊂ S 2RU .Proof. Let T be a theory in FO2 (T ∈ FO2) and T ′ be any of the theoriesresulting from exhaustively applying rules in Rules expect lifted case-analysison T . If T initially contains a unary PRV with predicate S, either it is stillunary in T ′ or lifted decomposition has replaced the LV with a constant. Inthe first case, we can follow a generic branch of lifted case-analysis on S, andin the second case, either T ′ is empty or all binary atoms in T have becomeunary in T ′ due to applying the lifted decomposition and we can follow ageneric branch of lifted case-analysis for any of these atoms. The genericbranch in both cases is in FO2 and the same procedure can be followeduntil all theories become empty. If T initially contains only binary atoms,lifted decomposition applies as the grounding of T is disconnected for eachpair of individuals, and after lifted decomposition all atoms have no logvars.Applying case analysis on all atoms gives empty theories. Therefore, T isalso in RU (T ∈ RU).The following theory with only one sentence:∀x, y, z ∈ ∆p : F(x, y) ∨ G(y, z) ∨ H(x, y, z)is an example of a theory in RU that is not in FO2, showing RU 6⊂ FO2.FO2 and RU are special cases of S 2FO2 and S 2RU respectively, whereα = ∅, showing FO2 ⊂ S 2FO2 and RU ⊂ S 2RU . However, Example 44is both in S 2FO2 and S 2RU but is not in FO2 and not in RU, showingS 2FO2 6⊂ FO2 and S 2RU 6⊂ RU . Since FO2 ⊂ RU and the class of addedα(S) clauses are the same, S 2FO2 ⊂ S 2RU .A.2.7 Proof of proposition 12Proposition: The WMC of the theory for Van den Broeck [217]’s deck ofcards problem can be calculated in O(n2) using the rules in RulesD withoutSkolemization.Proof. Let N be an entity in ∆p and ∆p′ = ∆p − {N}. We can re-write thetheory in terms of N and ∆p′ as follows:174A.2. Proofs for chapter 6∀c ∈ ∆c : S(c,N) ∨ (∃p′ ∈ ∆p′ : S(c, p′)) (A.326)∃c ∈ ∆c : S(c,N) (A.327)∀p′ ∈ ∆p′ , ∃c ∈ ∆c : S(c, p′) (A.328)∀c1, c2 ∈ ∆c : ¬S(c1, N) ∨ ¬S(c2, N) (A.329)∀p′ ∈ ∆p′ , c1, c2 ∈ ∆c : ¬S(c1, p′) ∨ ¬S(c2, p′) (A.330)(A.331)Since ∃c ∈ ∆c : S(c,N) (the second clause above), let M be an entity in∆c such that S(M,N) = True and let ∆c′ = ∆c − {M}. By re-writing thetheory in terms of ∆c′ and M , the second clause can be removed and we getthe following theory24:S(M,N) ∨ (∃p′ ∈ ∆p′ : S(M,p′)) (A.332)∀c′ ∈ ∆c′ : S(c,N) ∨ (∃p′ ∈ ∆p′ : S(c, p′)) (A.333)∀p′ ∈ ∆p′ : S(M,p′) ∨ (∃c′ ∈ ∆c′ : S(c, p′)) (A.334)∀c′1 ∈ ∆c′ : ¬S(c′1, N) ∨ ¬S(M,N) (A.335)∀c′1, c′2 ∈ ∆c′ : ¬S(c′1, N) ∨ ¬S(c′2, N) (A.336)∀p′ ∈ ∆p′ , c′1 ∈ ∆c′ : ¬S(c′1, p′) ∨ ¬S(M,p′) (A.337)∀p′ ∈ ∆p′ , c′1, c′2 ∈ ∆c′ : ¬S(c′1, p′) ∨ ¬S(c′2, p′) (A.338)The first clause can be removed as we know S(M,N) is True. Thefourth clause can be simplified to ∀c′1 ∈ ∆c′ : ¬S(c′1, N). Since this clausenow contains only one literal, we can apply unit propagation on it. Unitpropagation on S(c′1, N) gives the following theory:∀c′ ∈ ∆c′ , ∃p′ ∈ ∆p′ : S(c, p′) (A.339)∀p′ ∈ ∆p′ : S(M,p′) ∨ (∃c′ ∈ ∆c′ : S(c, p′)) (A.340)∀p′ ∈ ∆p′ , c′1 ∈ ∆c′ : ¬S(c′1, p′) ∨ ¬S(M,p′) (A.341)∀p′ ∈ ∆p′ , c′1, c′2 ∈ ∆c′ : ¬S(c′1, p′) ∨ ¬S(c′2, p′) (A.342)Lifted case analysis on S(M,p′) assuming ∆p′T and ∆p′F represent mutualexclusive and covering subsets of ∆p′ such that ∀p′T ∈ ∆p′T : S(M,p′T ) and∀p′F ∈ ∆p′F : ¬S(M,p′F ) gives the following theory:24Note that there are |∆c| ways for selecting M , so the WMC of the resulting theorymust be multiplied by |∆c|. Also note that if lifted case analysis is subsequently performedon S(c,N), the WMC of the i-th branch must be divided by i + 1.175A.2. Proofs for chapter 6∀c′ ∈ ∆c′ : (∃p′T ∈ ∆p′T : S(c, p′T ) ∨ ∃p′F ∈ ∆p′F : S(c, p′F ))(A.343)∀p′F ∈ ∆p′F ∃c′ ∈ ∆c′ : S(c, p′F ) (A.344)∀p′T ∈ ∆p′T , c′1 ∈ ∆c′ : ¬S(c′1, p′T ) (A.345)∀p′T ∈ ∆p′T , c′1, c′2 ∈ ∆c′ : ¬S(c′1, p′T ) ∨ ¬S(c′2, p′T ) (A.346)∀p′F ∈ ∆p′F , c′1, c′2 ∈ ∆c′ : ¬S(c′1, p′F ) ∨ ¬S(c′2, p′F ) (A.347)Unit propagation on the third clause gives the following theory:∀c′ ∈ ∆c′∃p′F ∈ ∆p′F : S(c, p′F ) (A.348)∀p′F ∈ ∆p′F ∃c′ ∈ ∆c′ : S(c, p′F ) (A.349)∀p′F ∈ ∆p′F , c′1, c′2 ∈ ∆c′ : ¬S(c′1, p′F ) ∨ ¬S(c′2, p′F ) (A.350)This theory is identical to the original theory that we started with, exceptthat it is over the smaller populations ∆c′ and ∆p′F . By continuing thisprocess and keeping the intermediate results in a cache, the WMC of thetheory can be found in O(n2): the number of times we do the above procedureis at most n = |∆c| and each time we do the procedure it takes O(n = |∆p|)time for the lifted case analysis.A.2.8 Proof of proposition 13.Proposition: The WMC of the reformulated theory for the deck of cardsproblem can be calculated in O(n) using the rules in RulesD without Skolem-ization.Proof. Let N be an entity in ∆p and ∆p′ = ∆p − {N}. We can re-write thetheory in terms of N and ∆p′ as follows:∀c ∈ ∆c : S(c,N) ∨ (∃p′ ∈ ∆p′ : S(c, p′)) (A.351)∃c ∈ ∆c : S(c,N) (A.352)∀p′ ∈ ∆p′ , ∃c ∈ ∆c : S(c, p′) (A.353)∀c1, c2 ∈ ∆c : ¬S(c1, N) ∨ ¬S(c2, N) (A.354)∀p′ ∈ ∆p′ , c1, c2 ∈ ∆c : ¬S(c1, p′) ∨ ¬S(c2, p′) (A.355)∀p′1 ∈ ∆p′ , c ∈ ∆c : ¬S(c, p′1) ∨ ¬S(c,N) (A.356)∀p′1, p′2 ∈ ∆p′ , c ∈ ∆c : ¬S(c, p′1) ∨ ¬S(c, p′2) (A.357)176A.2. Proofs for chapter 6Since ∃c ∈ ∆c : S(c,N) (the second clause above), let M be an entity in∆c such that S(M,N) and let ∆c′ = ∆c − {M}. By re-writing the theoryin terms of ∆c′ and M , the second clause can be removed and we get thefollowing theory25:S(M,N) ∨ (∃p′ ∈ ∆p′ : S(M,p′)) (A.358)∀c′ ∈ ∆c′ : S(c,N) ∨ (∃p′ ∈ ∆p′ : S(c, p′)) (A.359)∀p′ ∈ ∆p′ : S(M,p′) ∨ (∃c′ ∈ ∆c′ : S(c, p′)) (A.360)∀c′1 ∈ ∆c′ : ¬S(c′1, N) ∨ ¬S(M,N) (A.361)∀c′1, c′2 ∈ ∆c′ : ¬S(c′1, N) ∨ ¬S(c′2, N) (A.362)∀p′ ∈ ∆p′ , c′1 ∈ ∆c′ : ¬S(c′1, p′) ∨ ¬S(M,p′) (A.363)∀p′ ∈ ∆p′ , c′1, c′2 ∈ ∆c′ : ¬S(c′1, p′) ∨ ¬S(c′2, p′) (A.364)∀p′1 ∈ ∆p′ : ¬S(M,p′1) ∨ ¬S(M,N) (A.365)∀p′1 ∈ ∆p′ , c′ ∈ ∆c′ : ¬S(c′, p′1) ∨ ¬S(c′, N) (A.366)∀p′1, p′2 ∈ ∆p′ : ¬S(M,p′1) ∨ ¬S(M,p′2) (A.367)∀p′1, p′2 ∈ ∆p′ , c′ ∈ ∆c′ : ¬S(c′, p′1) ∨ ¬S(c′, p′2) (A.368)The first clause can be removed as we know S(M,N) = True. Thefourth and the eighth clauses can be simplified to ∀c′1 ∈ ∆c′ : ¬S(c′1, N) and∀p′1 ∈ ∆p′ : ¬S(M,p′1) respectively. Since these clauses now contains onlyone literal, we can apply unit propagation on them. Unit propagation onS(c′1, N) and ¬S(M,p′1) gives the following theory:∀c′ ∈ ∆c′∃p′ ∈ ∆p′ : S(c, p′) (A.369)∀p′ ∈ ∆p′∃c′ ∈ ∆c′ : S(c, p′) (A.370)∀p′ ∈ ∆p′ , c′1, c′2 ∈ ∆c′ : ¬S(c′1, p′) ∨ ¬S(c′2, p′) (A.371)∀p′1, p′2 ∈ ∆p′ , c′ ∈ ∆c′ : ¬S(c′, p′1) ∨ ¬S(c′, p′2) (A.372)This theory is identical to the original theory that we started with, exceptthat it is over the smaller populations ∆c′ and ∆p′ . By continuing thisprocess and keeping the intermediate results in a cache, the WMC of thetheory can be found in O(n): the number of times we do the above procedureis at most n = |∆c| and each time we do the procedure it takes a constantamount of time.25cf. footnote 24177
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Representing and learning relations and properties...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Representing and learning relations and properties under uncertainty Kazemi, Seyed Mehran 2018
pdf
Page Metadata
Item Metadata
Title | Representing and learning relations and properties under uncertainty |
Creator |
Kazemi, Seyed Mehran |
Publisher | University of British Columbia |
Date Issued | 2018 |
Description | The world around us is composed of entities, each having various properties and participating in relationships with other entities. Consequently, data is often inherently relational. This dissertation studies probabilistic relational representations, reasoning and learning with a focus on three common prediction problems for relational data: link prediction, property prediction, and joint prediction. For link prediction, we develop a tensor factorization model called SimplE which is simple, interpretable, fully-expressive, and integratable with certain types of domain expert knowledge. On two standard benchmarks for knowledge graph completion, we show how SimplE outperforms the state-of-the-art models. For property prediction, first we study the limitations of the existing StaRAI models when being used for property prediction. Based on this study, we develop relational neural networks which combine ideas from lifted relational models with deep learning and perform well empirically. We base the joint prediction on lifted relational models for which parameter learning typically requires inference over a highly-connected graphical model. The inference step is usually the bottleneck for learning. We study a class of inference algorithms known as lifted inference which makes inference tractable by exploiting both conditional independence and symmetries. We study two ways of speeding up lifted inference algorithms: 1- through proposing heuristics for elimination ordering and 2- through compiling the lifted operations to low-level languages. We also expand the largest known class of models for which we know how to do efficient lifted inference. Thus, structure learning algorithms for lifted relational models that restrict the search space to models for which efficient inference algorithms exist can perform their search over a larger space. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2018-12-21 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0375812 |
URI | http://hdl.handle.net/2429/68134 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2019-02 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2019_february_kazemi_seyed_mehran.pdf [ 9.37MB ]
- Metadata
- JSON: 24-1.0375812.json
- JSON-LD: 24-1.0375812-ld.json
- RDF/XML (Pretty): 24-1.0375812-rdf.xml
- RDF/JSON: 24-1.0375812-rdf.json
- Turtle: 24-1.0375812-turtle.txt
- N-Triples: 24-1.0375812-rdf-ntriples.txt
- Original Record: 24-1.0375812-source.json
- Full Text
- 24-1.0375812-fulltext.txt
- Citation
- 24-1.0375812.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0375812/manifest