Aggregation and Constraint Processing in Lifted Probabilistic Inference by Jacek Jerzy Kisy´nski M.Sc., Maria Curie-Skłodowska University in Lublin, 2001 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in THE FACULTY OF GRADUATE STUDIES (Computer Science) The University of British Columbia (Vancouver) March 2010 c Jacek Jerzy Kisy´nski, 2010 Abstract Representations that mix graphical models and first-order logic—called either firstorder or relational probabilistic models—were proposed nearly twenty years ago and many more have since emerged. In these models, random variables are parameterized by logical variables. One way to perform inference in first-order models is to propositionalize the model, that is, to explicitly consider every element from the domains of logical variables. This approach might be intractable even for simple first-order models. The idea behind lifted inference is to carry out as much inference as possible without propositionalizing. An exact lifted inference procedure for first-order probabilistic models was developed by Poole [2003] and later extended to a broader range of problems by de Salvo Braz et al. [2007]. The C-FOVE algorithm by Milch et al. [2008] expanded the scope of lifted inference and is currently the state of the art in exact lifted inference. In this thesis we address two problems related to lifted inference: aggregation in directed first-order probabilistic models and constraint processing during lifted inference. Recent work on exact lifted inference focused on undirected models. Directed first-order probabilistic models require an aggregation operator when a parent random variable is parameterized by logical variables that are not present in a child random variable. We introduce a new data structure, aggregation parfactors, to describe aggregation in directed first-order models. We show how to extend the C-FOVE algorithm to perform lifted inference in the presence of aggregation parfactors. There are cases where the polynomial time complexity (in the domain size ii of logical variables) of the C-FOVE algorithm can be reduced to logarithmic time complexity using aggregation parfactors. First-order models typically contain constraints on logical variables. Constraints are important for capturing knowledge regarding particular individuals. However, the impact of constraint processing on computational efficiency of lifted inference has been largely overlooked. In this thesis we develop an efficient algorithm for counting the number of solutions to the constraint satisfaction problems encountered during lifted inference. We also compare, both theoretically and empirically, different ways of handling constraints during lifted inference. iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Probabilistic reasoning in complex domains . . . . . . . . . . . . 1 1.2 Thesis overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Summary of thesis contributions . . . . . . . . . . . . . . . . . . 5 1.4 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . 5 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Belief networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Inference in belief networks . . . . . . . . . . . . . . . . . . . . 7 2.3.1 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.2 Variable elimination for belief networks . . . . . . . . . . 9 2 2.3.2.1 2.4 Complexity of variable elimination . . . . . . . 11 First-order probabilistic models . . . . . . . . . . . . . . . . . . . 12 2.4.1 Parameterized random variables . . . . . . . . . . . . . . 13 2.4.1.1 Counting formulas . . . . . . . . . . . . . . . . 15 Independent Choice Logic . . . . . . . . . . . . . . . . . 16 2.4.2 iv 2.5 Lifted probabilistic inference . . . . . . . . . . . . . . . . . . . . 21 2.5.1 Parametric factors . . . . . . . . . . . . . . . . . . . . . 22 2.5.1.1 Normal-form constraints . . . . . . . . . . . . 24 C-FOVE . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5.2.1 Lifted elimination . . . . . . . . . . . . . . . . 26 2.5.2.2 Parfactor multiplication . . . . . . . . . . . . . 31 2.5.2.3 Splitting, expanding and propositionalizing . . . 33 2.5.2.4 Counting . . . . . . . . . . . . . . . . . . . . . 36 2.5.2.5 Unification . . . . . . . . . . . . . . . . . . . . 38 2.5.2.6 The C-FOVE algorithm . . . . . . . . . . . . . 46 2.5.2.7 Example computation . . . . . . . . . . . . . . 48 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.5.2 2.6 3 Aggregation in Lifted Inference . . . . . . . . . . . . . . . . . . . . 56 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.2 Need for aggregation . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3 Modeling aggregation . . . . . . . . . . . . . . . . . . . . . . . 58 3.3.1 Causal independence . . . . . . . . . . . . . . . . . . . . 58 3.3.2 Causal independence-based aggregation . . . . . . . . . . 60 Aggregation parfactors . . . . . . . . . . . . . . . . . . . . . . . 63 3.4.1 Conversion to parfactors . . . . . . . . . . . . . . . . . . 65 3.4.1.1 Conversion using counting formulas . . . . . . 66 3.4.1.2 Conversion for MAX and MIN operators . . . . 69 Operations on aggregation parfactors . . . . . . . . . . . 76 3.4.2.1 Splitting . . . . . . . . . . . . . . . . . . . . . 76 3.4.2.2 Multiplication . . . . . . . . . . . . . . . . . . 85 3.4.2.3 Summing out . . . . . . . . . . . . . . . . . . 87 Generalized aggregation parfactors . . . . . . . . . . . . 92 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.5.1 Memory usage . . . . . . . . . . . . . . . . . . . . . . . 98 3.5.2 Social network experiment . . . . . . . . . . . . . . . . . 101 3.4 3.4.2 3.4.3 3.5 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 v 4 Solver for #CSP with Inequality Constraints . . . . . . . . . . . . . 105 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.3 4.2.1 Constraint satisfaction problems . . . . . . . . . . . . . . 108 4.2.2 Variable elimination for #CSP . . . . . . . . . . . . . . . 109 4.2.3 Set partitions . . . . . . . . . . . . . . . . . . . . . . . . 111 Counting solutions to CSP instances with inequality constraints . . 113 4.3.1 Analysis of the problem . . . . . . . . . . . . . . . . . . 113 4.3.2 The #VE= algorithm . . . . . . . . . . . . . . . . . . . . 117 5 S-constants . . . . . . . . . . . . . . . . . . . . 118 4.3.2.2 #VE= factors . . . . . . . . . . . . . . . . . . 118 4.3.2.3 Multiplication . . . . . . . . . . . . . . . . . . 121 4.3.2.4 Summing out . . . . . . . . . . . . . . . . . . 126 4.3.2.5 The algorithm . . . . . . . . . . . . . . . . . . 129 4.3.3 Example computation . . . . . . . . . . . . . . . . . . . 131 4.3.4 Complexity of the algorithm . . . . . . . . . . . . . . . . 134 4.3.5 4.4 4.3.2.1 4.3.4.1 Preprocessing . . . . . . . . . . . . . . . . . . 134 4.3.4.2 Inference . . . . . . . . . . . . . . . . . . . . . 137 Empirical evaluation . . . . . . . . . . . . . . . . . . . . 138 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Constraint Processing in Lifted Inference . . . . . . . . . . . . . . . 144 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 5.2 Overview of constraint processing in lifted inference . . . . . . . 145 5.2.1 Splitting and expanding . . . . . . . . . . . . . . . . . . 147 5.2.2 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . 149 5.2.3 Summing out . . . . . . . . . . . . . . . . . . . . . . . . 150 5.3 Splitting as needed vs. shattering . . . . . . . . . . . . . . . . . . 153 5.4 Normal form parfactors vs. #CSP solver . . . . . . . . . . . . . . 159 5.5 5.4.1 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . 165 5.4.2 Summing out . . . . . . . . . . . . . . . . . . . . . . . . 166 5.4.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . 167 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 vi 6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 A 1-dimensional Representation of VE Factors . . . . . . . . . . . . . 179 B Hierarchical Representation of #VE= Factors . . . . . . . . . . . . 182 C From Parfactors to #VE= Factors . . . . . . . . . . . . . . . . . . . 185 D From #VE= Factors to Parfactors . . . . . . . . . . . . . . . . . . . 188 E Splitting as Needed . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 vii List of Figures 1.1 Lifted inference vs propositional inference . . . . . . . . . . . . . 3 2.1 A simple belief network . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 VE algorithm for inference in belief networks . . . . . . . . . . . 10 2.3 A graph and its induced graph . . . . . . . . . . . . . . . . . . . 11 2.4 ICL theory from Example 2.6 . . . . . . . . . . . . . . . . . . . . 18 2.5 ICL theory for multiple lots from Example 2.6 . . . . . . . . . . . 19 2.6 More elegant ICL theory for multiple lots from Example 2.6 . . . 19 2.7 Graphical representation of the ICL theory for multiple lots . . . . 20 2.8 AC-3 algorithm for replacing logical variables with constants . . 39 2.9 MGU algorithm for parameterized random variables . . . . . . . 41 2.10 Algorithm for checking if an MGU is consistent with constraints . 42 2.11 Algorithm for splitting a parfactor on an MGU . . . . . . . . . . 43 2.12 Algorithm for splitting a parfactor on a set of constraints . . . . . 44 3.1 A first-order model from Example 3.1 . . . . . . . . . . . . . . . 57 3.2 A first-order model from Example 3.2 . . . . . . . . . . . . . . . 58 3.3 A first-order model with OR-based aggregation . . . . . . . . . . 60 3.4 A first-order model with MAX-based aggregation . . . . . . . . . 61 3.5 A first-order model from Example 3.5 . . . . . . . . . . . . . . . 62 3.6 Decomposed aggregation . . . . . . . . . . . . . . . . . . . . . . 87 3.7 A first-order model from Example 3.14 . . . . . . . . . . . . . . 92 3.8 Results of the experiment . . . . . . . . . . . . . . . . . . . . . . 99 3.9 Performance on model (a) (with OR-based aggregation) . . . . . . 99 3.10 Performance on model (b) (with MAX-based aggregation) . . . . 100 3.11 Performance on model (c) (with SUM|3 -based aggregation) . . . . 100 viii 3.12 Performance on model (d) (generalized aggregation parfactors) . . 101 3.13 ICL theory for the smoking-friendship model . . . . . . . . . . . 102 3.14 Illustration of how ind(X) works . . . . . . . . . . . . . . . . . . . 103 3.15 Performance on the smoking-friendship model . . . . . . . . . . . 104 4.1 #VE algorithm for #CSP from Dechter [2003] . . . . . . . . . . . 110 4.2 Comparison of ϖn and exponential functions . . . . . . . . . . . . 112 4.3 Constraint graph with tree structure . . . . . . . . . . . . . . . . 113 4.4 Constraint graph with a cycle . . . . . . . . . . . . . . . . . . . . 114 4.5 Constraint graph discussed in Example 4.5 . . . . . . . . . . . . . 115 4.6 Constraint graph discussed in Example 4.5 . . . . . . . . . . . . . 115 4.7 Relationship between #VE= and #VE . . . . . . . . . . . . . . . 117 4.8 A CSP used in Examples 4.11–4.15 . . . . . . . . . . . . . . . . 121 4.9 #VE= algorithm for #CSP with inequality constraints . . . . . . . 130 4.10 Initialization procedure for the #VE= algorithm . . . . . . . . . . 131 4.11 A CSP used in Section 4.3.3 . . . . . . . . . . . . . . . . . . . . 131 4.12 Domains of three variables represented with disjoint subsets . . . 135 4.13 Summary of experiments . . . . . . . . . . . . . . . . . . . . . . 140 4.14 Summary of experiments . . . . . . . . . . . . . . . . . . . . . . 140 4.15 Results of experiments on CSP instances with P(u) = 0.1 . . . . 141 4.16 Results of experiments on CSP instances with P(u) = 0.01 . . . . 142 5.1 A first-order model from Example 5.2 . . . . . . . . . . . . . . . 147 5.2 Speedup of splitting as needed over shattering for Example 5.8 . . 157 5.3 Splitting as needed vs. shattering speedup for Example 5.9 . . . . 159 5.4 Algorithm for converting a parfactor to normal form . . . . . . . . 162 5.5 A simple constraint graph from Example 5.10 . . . . . . . . . . . 163 5.6 Summing out with and without a #CSP solver . . . . . . . . . . . 168 A.1 Structure of a VE factor . . . . . . . . . . . . . . . . . . . . . . . 180 A.2 Procedure indexToTuple( f ,t) . . . . . . . . . . . . . . . . . . . . 181 B.1 Domains of three variables represented with disjoint subsets . . . 183 B.2 Structure of a #VE= factor . . . . . . . . . . . . . . . . . . . . . 183 ix Acknowledgments I am grateful to Dr. David Poole, my research supervisor, for his guidance. I could always rely on his expertise in artificial intelligence, computer science and science in general. I thank Dr. Nando de Freitas, Dr. William Evans, and Dr. Kevin Murphy, members of my supervisory committee, for their advice and feedback throughout the course of my work. I would like to thank Dr. Dan Roth (External Examiner), Dr. Martin Puterman (University Examiner), Dr. Kevin Leyton-Brown (University Examiner), and Dr. Vijay Bhargava (Examination Chair) for the time they spent reading and reviewing my thesis. The Laboratory for Computational Intelligence and the Department of Computer Science at The University of British Columbia provided me with an inspiring work environment. I thank Peter Carbonetto and Mike Chiang for their work on joint projects. In addition to my thesis research I had the opportunity to participate in the AIspace project. My collaboration with Dr. David Poole, Dr. Alan Mackworth, Dr. Giuseppe Carenini, Dr. Cristina Conati, Byron Knoll, and Kyle Porter was a valuable experience. The UBC Department of Computer Science is also a great place to make friends. My lab-mates Asher, Andrea, Dustin, Eric, Firas, Kasia, Mark, Mike, Rita, Robert, and Scott, and, in particular, my officemates Matt, Hendrik, and Peter were very tolerant of my passion for chitchat. Vancouver is a paradise for a skiing and swimming enthusiast. Vast slopes of Whistler and Blackcomb mountains offer a seven-months-long skiing season. Together with my riding buddies Ken, Clint, Reid, Jonathan, and Lowell I took full x advantage of it. The Empire Pool at The UBC Aquatic Centre is open seven months per year for outdoor swimming. I took full advantage of it as well. My friendship with Go´ska, Kris, and Sam started there. The duration of my Ph.D. studies can easily be explained by the fact that on occasion research had to compete with skiing and swimming. Ever since I arrived in Vancouver I was lucky to have great landlords and roommates. Andy and Dave care more about well-being of their tenants than about collecting the rent. Tom, Mike, Wini, Manos, Adam and Maher were cool roommates and we had a good time living together. I would not be able to enter a Ph.D. program without the education I received in Poland. My high school math teacher, Waldemar “Byko” Łobodzi´nski, not only taught me mathematics, but also respect for knowledge and contempt for ignorance. Dr. Tadeusz Kuczumow’s courses in calculus and measure theory were of world-class quality. Dr. Jerzy Mycka, my M.Sc. thesis supervisor, introduced me to computability theory as well as to the art of logic and functional programming. I am also indebted to my family. My aunt Anne and uncle Andrzej encouraged me to pursue studies abroad and provided support once I arrived in Canada. My wife Sylwia has made a lot sacrifices to help me with my studies. People to whom I owe the most are my parents, Magdalena and Jan Kisy´nscy. I would like to thank them for everything they have done for me. xi Chapter 1 Introduction I could never bear to be buried with people to whom I had not been introduced. — Norman Parkinson 1.1 Probabilistic reasoning in complex domains Artificial intelligence studies the design and synthesis of agents that act intelligently and are rational [Poole and Mackworth, 2010; Russell and Norvig, 2009]. Any agent acting in the real world faces uncertainty. Uncertainty arises because of limited information available to an agent: an agent’s observations might be incomplete, an agent’s sensor might be noisy, and the effects of an agent’s own actions might be unknown to an agent. The design of representation and reasoning systems for agents is a core area of artificial intelligence. These systems are used to model and make predictions about the world and to support decision making. Unavoidably, they must handle uncertainty. Probability theory provides a foundation for representation and reasoning systems that can reason under uncertainty. Given a model of the world, a usual probabilistic inference task is to make predictions about the value of a random variable given evidence bout the value of other random variables. While reasoning about the real world, an agent has to deal with domains that involve a large number of individuals. It might be interested only in a few of them, 1 but it cannot ignore the influence of other objects and individuals. Poole [2003] gives the following example of a domain involving a large number of individuals. We are given a description of a person who committed a crime in a town. We also happen to know that a guy named Marian roughly matches the description. What is the probability that Marian is guilty? The probability depends on how well Marian matches the description. It also depends on the rest of the population of the town. If the town is a small village, Marian is likely to be guilty. If the town is a large city, there are potentially many people living in the city who match the description and Marian is likely to be innocent. To represent this problem and compute the probability of Marian being guilty we need to reason about Marian, but we also have to represent and reason about a potentially large number of individuals about whom we do not have any specific information. Probabilistic graphical models, such as belief networks or Markov networks [Koller and Friedman, 2009], are a popular tool for representing dependencies between random variables. However, such standard representations are propositional (zeroth-order), and therefore are not well suited for describing relations between individuals or quantifying over sets of individuals. In order to reason about multiple individuals, we typically make each property of each individual into a separate random variable. Even the relatively simple example domain described above could not be easily represented using probabilistic graphical models. First-order logic has the capacity for representing relations and quantification of logical variables, but it does not treat uncertainty. It has only a very primitive mechanism for handling uncertainty, namely disjunction and existential quantification. Representations that mix graphical models and first-order logic—called either first-order or relational probabilistic models—were proposed nearly twenty years ago [Breese, 1992; Horsch and Poole, 1990] and many more have since emerged [De Raedt et al., 2008; Getoor and Taskar, 2007]. In these models, random variables are parameterized by logical variables that are typed with populations of individuals. This allows a model to be represented before modeled individuals are known, or even before their numbers are known. It also allows an agent to compactly represent the same information about multiple individuals, and to exploit this compactness facilitate efficient inference. 2 first-order posterior first-order model lifted inference propositionalization propositionalization propositional inference propositional model propositional posterior Figure 1.1: Lifted inference vs propositional inference for first-order probabilistic models A popular exact inference technique in first-order probabilistic models is based on dynamical propositionalization of the portion of the model that is relevant to the query, followed by probabilistic inference performed at the propositional level. Unfortunately, even for simple relational probabilistic models, inference at the propositional level—that is, inference that explicitly considers every individual—is very often intractable. The idea of lifted probabilistic inference is to carry out as much inference as possible without propositionalizing. The correctness of this approach is judged by having the same result as if we had first propositionalized the model and then carried out standard probabilistic inference (see Figure 1.1). An exact lifted probabilistic inference procedure for first-order probabilistic directed models was proposed in Poole [2003]. It was later extended to a broader range of problems by de Salvo Braz et al. [2005, 2006, 2007]. Further work by Milch et al. [2008] expanded the scope of lifted probabilistic inference and resulted in the C-FOVE algorithm, which is currently the state of the art in exact lifted probabilistic inference. In this thesis we address two problems related to exact lifted probabilistic inference. The first one is the efficient aggregation during lifted probabilistic inference in directed relational probabilistic models. The second one is constraint process- 3 ing during lifted probabilistic inference in both directed and undirected relational probabilistic models. 1.2 Thesis overview While early work on lifted probabilistic inference by Poole [2003] considered directed models, later work by de Salvo Braz et al. [2007] and Milch et al. [2008] focused on undirected models. Although their results can also be used for directed models, one aspect that arises exclusively in directed models is the need for aggregation that occurs when a parent random variable is parameterized by logical variables that are not present in a child random variable. Currently available lifted inference algorithms do not represent aggregation in relational probabilistic models using data-structures that are independent of sizes of populations. In this thesis we introduce a new data structure, aggregation parfactor and describe how to use it to represent aggregation in first-order probabilistic models. We also show how to perform lifted probabilistic inference in presence of aggregation parfactors, by integrating it into the C-FOVE algorithm. Results of our theoretical and empirical evaluations show that inference with aggregation parfactors can lead to gains in efficiency. First-order models typically contain constraints on logical variables. Constraints are important for capturing knowledge regarding particular individuals. Constraint processing during lifted probabilistic inference includes counting the number of solutions to constraint satisfaction problems induced during the inference (this counting problem is written as #CSP). In this thesis we present an algorithm for solving #CSPs encountered during lifted probabilistic inference and through empirical evaluation show that it significantly improves the efficiency of the inference. Previous works on lifted probabilistic inference adopted various approaches to constraint processing. The impact of these different strategies on computational efficiency of lifted inference has been largely overlooked. In this thesis we analyze constraint processing during lifted probabilistic inference, both theoretically and empirically. Our results stress the importance of informed constraint processing in lifted inference and motivate our work on a specialized #CSP solver. 4 Although this thesis focuses on exact lifted probabilistic inference, our results in the area of constraint processing for lifted probabilistic inference apply to research on approximate lifted probabilistic inference, for example to work by Singla and Domingos [2008]. 1.3 Summary of thesis contributions The contributions of this thesis are as follows: • efficient aggregation algorithms for lifted probabilistic inference • specialized algorithm for #CSPs that greatly improves the efficiency of lifted probabilistic inference • analysis of constraint processing in lifted probabilistic inference. 1.4 Thesis organization The rest of this thesis is organized as follows. In Chapter 2 we provide the necessary background for the topics covered in this thesis; in particular, we describe existing exact lifted probabilistic inference techniques. The contributions of this thesis begin in Chapter 3, where we present algorithms for lifted aggregation in directed first-order probabilistic models. In Chapter 4 we develop an efficient algorithm for solving #CSPs encountered during lifted probabilistic inference. Next, in Chapter 5, we analyze various approaches to constraints processing in lifted probabilistic inference. Finally, in Chapter 6, we summarize the contributions of the thesis and discuss possible future work. Some of the parts of this work have been published as technical papers in AI conferences [Kisy´nski and Poole, 2009a,b]. 5 Chapter 2 Background No matter how hard a man may labor, some woman is always in the background of his mind. — Gertrude Franklin Atherton 2.1 Introduction In this chapter we introduce notation and concepts used throughout the thesis. First, we describe belief networks (Section 2.2) and the variable elimination algorithm (Section 2.3), which is used to perform inference in belief networks. Next, we give an overview of first-order probabilistic models (Section 2.4). We describe in more detail an example of first-order probabilistic modeling, Independent Choice Logic (Section 2.4.2). Finally, we discuss currently available exact lifted probabilistic inference methods (Section 2.5). 2.2 Belief networks One of the areas of artificial intelligence is the design of representation and reasoning formalisms. Belief networks (also known as Bayesian networks) were proposed by Pearl [1988] and have since became a popular representation for independence among random variables. They are a member of class of models known as probabilistic graphical models. 6 sprinkler rain wet grass Figure 2.1: A simple belief network. In the definition below and in the rest of this chapter we use lower case letters for random variables and upper case letters for logical variables. It is not a standard notation, but it helps to distinguish logical variables from random variables. Definition 2.1. A belief network over a set of random variables {x1 , x2 , . . . , xn } consist of a acyclic directed graph over random variables x1 , x2 , . . . , xn , with one node of each random variable, and a set of conditional probability distributions {P(xi | parent(xi )) |i = 1, 2, . . . , n}, where parent(xi ) denotes the nodes in the asso- ciated graph that have directed edges going into xi . A belief network represents the joint probability over random variables x1 , x2 , . . . , xn : n P(x1 , x2 , . . . , xi ) = ∏ P(xi | parent(xi )) . i=1 Example 2.1. Figure 2.1 presents a simple belief network over three random variables rain, sprinkler, and wet_grass. Assume they all have domain { f alse,true}. Associated with the belief network are three probability distributions: P(rain), P(sprinkler) and P(wet_grass|rain, sprinkler). The distributions could, for instance, encode that rain increases probability of grass being wet and so does running sprinkler (see Example 2.2). 2.3 Inference in belief networks Given a belief network, a common inference problem is to compute the posterior distribution over a set of random variables given some evidence. We will describe how to solve this problem with the variable elimination (VE) algorithm [Zhang and Poole, 1994]. The aim of VE is to obtain the global solution in an efficient manner 7 through local computations. The variable elimination algorithm uses factors to represent the input problem instance, the results of intermediate local computations and the final solution. 2.3.1 Factors Let dom(x) denote the domain of random variable x. A factor on random variables x1 , x2 , . . . , xn is a representation of a function from dom(x1 ) × dom(x2 ) × · · · × dom(xn ) into the real numbers. Let v denote an assignment of values to some set of random variables; v is a function that takes a random variable and returns its value. Let F be a factor on a set of random variables S = {x1 , x2 , . . . , xn }. Let v be an assignment of values to random variables in S. We extend v to factors and denote by v(F ) the value of the factor F given v, that is v(F ) = F (v(x1 ), v(x2 ), . . . , v(xn )). If v does assign values only to some of the random variables in S, then v(F ) denotes a factor on other random variables. Example 2.2. Consider a belief network from Example 2.1 and Figure 2.1. The three probability distributions could be represented by the following three factors: P(rain = f alse) 0.8 P(rain = true) , 0.2 P(sprinkler = true) , 0.4 P(sprinkler = f alse) 0.6 rain f alse f alse true true sprinkler f alse true f alse true P(wet_grass = f alse) 1.0 0.2 0.1 0.01 P(wet_grass = true) 0.0 . 0.8 0.9 0.99 Operations on factors include computing a product of factors and summing out random variables from a factor. 8 Suppose F1 is a factor on random variables x1 , . . . , xi , y1 , . . . , y j , and F2 is a factor on random variables y1 , . . . , y j , z1 , . . . , zl , where sets {x1 , . . . , xi }, {y1 , . . . , y j } and {z1 , . . . , zl } are pairwise disjoint. Given an assignment of values to random variables v, the product of F1 and F2 is a factor F1 F2 on the union of the random variables, namely x1 , . . . , xi , y1 , . . . , y j , z1 , . . . , zl , defined by: (F1 F2 )(x1 , . . . , xi , y1 , . . . , y j , z1 , . . . , zl ) = F1 (x1 , . . . , xi , y1 , . . . , y j ) · F2 (y1 , . . . , y j , z1 , . . . , zl ). (2.1) Suppose F is a factor on random variables x1 , . . . , xi , . . . , x j . The summing out of random variable xi from F , denoted as ∑xi F is the factor on random variables x1 , . . . , xi−1 , xi+1 , . . . , x j such that ∑F xi (x1 , . . . , xi−1 , xi+1 , . . . , x j ) = ∑ x∈dom(xi ) F (x1 , . . . , xi−1 , xi = x, xi+1 , . . . , x j ). Given a total ordering of the random variables and total ordering of their domains, factors can be implemented as 1-dimensional arrays (see Appendix A). 2.3.2 Variable elimination for belief networks Variable elimination (VE) is a general technique that can solve many problems, such as, CSP [Dechter, 1999] or belief network inference [Zhang and Poole, 1994]. The first VE algorithm was a non-serial dynamic programming [Bertelè and Brioschi, 1972]. In this section we present a variant of VE for inference in belief networks. Consider a belief network over a set of random variables V = {x1 , x2 , . . . , xn }. We represent conditional probability distributions from the network as factors. For each random variable xi ∈ V , we represent the conditional probability distribution P(xi | parent(xi )) as a factor Fxi on a set of random variables {xi } ∪ parent(xi ). Assume that we are given: • a set of observations of values on a set of random variables O ⊂ V , which we can think of as an assignment vO of values to random variables O; • a query random variable q ∈ V \ O. 9 [00] [01] [02] [03] [04] [05] [06] [07] [08] [09] [10] [11] [12] [13] [14] [15] [16] procedure VE_BN(V, F, vO , q, H) input: set of random variables V , set of factors F , assignment of values to random variables O ⊂ V , vO , query random variable q ∈ V \ O, elimination ordering heuristic H ; output: factor F representing the posterior distribution on q; set E := V \ (O ∪ {q}); while there is a factor in F involving a random variable from E do select random variable y ∈ E according to H ; set F := eliminate(y, F); set E := E \ {y}; end set F := Fi ∈F Fi ; compute normalizing constant c := ∑q F ; return F /c; end [17] [18] [19] [20] procedure eliminate(y, F) input: random variable to be eliminated y, set of factors F ; output: set of factors F with y summed out; [21] [22] [23] partition F into set of factors on y, Fy and set F−y := F \ Fy ; return F−y ∪ {∑y Fi ∈Fy Fi }; end Figure 2.2: VE algorithm for inference in belief networks. Let V \ (O ∪ {q}) = {xs1 , xs2 , . . . , xsm }. We want to compute the posterior distribution on q: ∑ ∑ . . . ∑ vO (Fx ) xs1 xs2 xsm vO (Fx2 ) 1 Computing the product vO (Fx1 ) vO (Fx2 ) vO (Fxn ). ··· ··· vO (Fxn ) is usually not trac- table. Instead, we take advantage of the (possible) sparseness of the associated graph. First, we order sums according to an elimination ordering ρ: ∑ . . . ∑ ∑ vO (Fx ) xρ(m) xρ(2) xρ(1) 1 vO (Fx2 ) ··· vO (Fxn ). Next, we use the distribution law, distribute factors that are not functions of xρ(1) outside of the sum ∑xρ(1) , multiply remaining factors and sum out xρ(1) from their 10 a b c d a b c d e e Figure 2.3: A graph (left) and its induced graph generated by ordering ρ = a, b, c, d, e (right). Edges added during construction of the induced graph are represented by dotted lines. We have w(ρ) = w∗ (ρ) = 3. product. We repeat the process for sums ∑xρ(2) to ∑xρ(m) . The above procedure is the essence of the VE algorithm. Table 2.2 presents the pseudo-code involved. The next section discusses the impact of an elimination ordering on computational cost of the VE algorithm. 2.3.2.1 Complexity of variable elimination First, we need to introduce some concepts from the graph theory. Given an ordered or unordered graph G and an ordering of its nodes ρ, an ordered graph is a pair (G, ρ). The width of a node is the number of nodes sharing an edge with the node that precede it in the ordering ρ. The width of the ordering, w(ρ), is the maximum width over all nodes, and the width of the graph is the minimum width over all orderings of the graph. For the ordered graph (G, ρ), the induced graph (G∗ , ρ) is constructed in the following way: nodes are traversed in the opposite order to ρ. When a node is visited, we connect all of the nodes sharing an edge with the node that precede it in the ordering ρ. The induced width of the ordered graph (G, ρ), w∗ (ρ), is the width of the induced ordered graph (G∗ , ρ) (Figure 2.3). Assume that domains of all random variables have the same size | dom(x1 ) | = · · · = | dom(xn ) | = d and that random variables were processed according to ordering ρ. The time and space complexity of the variable elimination algorithm are determined by the size of the biggest factor created during inference, which depends on the induced width w∗ (ρ) of the belief network graph ∗ (ρ) O(nd w 11 ). The problem of finding an optimal elimination ordering, that is an ordering ρ that minimizes w∗ (ρ), is NP-hard [Arnborg et al., 1987], so usually an elimination ordering heuristic is used [Kjærulff, 1990]. 2.4 First-order probabilistic models Belief networks and other probabilistic graphical models are widely used for representing dependencies between random variables. However, they are propositional representations. That is, in order to represent information about multiple individuals, each property of each individual is represented as a separate node. Probabilistic graphical models are not well suited for describing relations between individuals or quantifying over sets of individuals. First-order logic has the capacity for representing relations and quantification of logical variables, but it does not handle uncertainty well. Representations that mix graphical models and first-order logic (first-order probabilistic models) were proposed nearly twenty years ago [Breese, 1992; Horsch and Poole, 1990]. A common building block of these models is a parameterized random variable, a random variable parameterized by logical variables. Logical variables are typed with populations of individuals. Among the appeals of the first-order probabilistic models is that it is possible to fully specify a model, that is, its structure and the accompanying probability distributions, before knowing the individuals in the modeled domain. This means that, even though we might not know the populations or even their sizes, we still should be able to specify the model. In order to make this possible, the length of a specification of a first-order probabilistic model must be independent of the sizes of the populations in the model. This thesis is not tied to any particular first-order probabilistic language. We reason at the level of data structures and assume that various first-order languages (or their subsets) will compile to these data structures. As we said above, firstorder probabilistic languages share a concept of a parameterized random variable. We introduce it formally in Section 2.4.1. We also describe Independent Choice Logic (Section 2.4.2), an example of a first-order probabilistic language. 12 2.4.1 Parameterized random variables A population is a set of individuals. A population corresponds to a domain in logic. A logical variable will be written starting with an uppercase letter. A logical variable is typed with a population. Given a logical variable X, we denote its population by D(X). Given a set of constraints C on X, we denote the set of individuals from D(X) that satisfy the constraints in C by D(X) : C . A term is a logical variable or a constant denoting an individual from a popu- lation. Definition 2.2. A parameterized random variable is of the form f (t1 , . . . ,tk ), where f is a symbol, called a functor, and ti are terms. Each functor has a set of values called the range of the functor. We denote the range of the functor f by range( f ). Functors with range { f alse,true} are predicate symbols, other are called function symbols. We denote the set of logical variables that appear in the parameterized random variable f (t1 , . . . ,tk ) by param( f (t1 , . . . ,tk )). A substitution to a set of distinct logical variables {X1 , . . . , Xl } is of the form {X1 /ti1 , . . . , Xl /til }, where each ti j is a term. A ground substitution is a substitution, where each ti j is a constant. The application of a substitution θ = {X1 /ti1 . . . . , Xl /til } to a parameterized random variable f (t1 , . . . ,tk ), written f (t1 , . . . ,tk )[θ ], is a parameterized random variable that is the original parameterized random variable f (t1 , . . . ,tk ) with every occurrence of X j in f (t1 , . . . ,tk ) replaced by the corresponding ti j . The parameterized random variable f (t1 , . . . ,tk )[θ ] is called an instance of f (t1 , . . . ,tk ). An instance f (t1 , . . . ,tk )[θ ] that does not contain logical variables is called a ground instance of f (t1 , . . . ,tk ). Let f (t1 , . . . ,tk ) be a parameterized random variable and θ be a ground substitution to all logical variables in param( f (t1 , . . . ,tk )). An application of θ to f (t1 , . . . ,tk ), f (t1 , . . . ,tk )[θ ] results in a ground instance of f (t1 , . . . ,tk ). A ground instance of a parameterized random variable is a random variable. A parameterized random variable f (t1 , . . . ,tk ) represents a set of random variables, one random variable for each ground substitution to all logical variables 13 in param( f (t1 , . . . ,tk )). We denote this set by ground( f (t1 , . . . ,tk )). The domain of each random variables represented by f (t1 , . . . ,tk ) is the range of f . Given a parameterized random variable f (t1 , . . . ,tk ), a set of constraints C on param( f (t1 , . . . ,tk )), and its ground instance f (t1 , . . . ,tk )[θ ], we say that ground instance f (t1 , . . . ,tk )[θ ] satisfies the constraints in C if substitution θ satisfies the constraints in C . We denote the set of ground instances of f (t1 , . . . ,tk ) that satisfy the constraints in C by ground( f (t1 , . . . ,tk )) : C . We extend our notation of an assignment of values to random variables to pa- rameterized random variables. When we say that v assigns a value to a parameterized random variable f (t1 , . . . ,tk ), we mean that it assigns this value to each random variable in the set ground( f (t1 , . . . ,tk )). Example 2.3. Let Lot be a logical variable typed with a population of all lots in a town {lot1 , . . . , lotn }. We have | D(Lot) | = n. Let wet_grass(Lot) be a parameter- ized random variable, where wet_grass is a functor with range { f alse,true}. Thus, range(wet_grass) = { f alse,true} and param(wet_grass(Lot)) = {Lot}. The parameterized random variable wet_grass(Lot) represents a set of n random vari- ables, each with domain { f alse,true}, one random variable for each substitution {Lot/lot1 }, . . . , {Lot/lotn }. Let v be an assignment of values to random variables such that v(wet_grass(Lot)) = true. It means that each of the random variables in the set ground(wet_grass(Lot)), is assigned the value true by v. Substitution θ is a unifier of two parameterized random variables f (ti1 , . . . ,tik ) and f (t j1 , . . . ,t jk ) if f (ti1 , . . . ,tik )[θ ] = f (t j1 , . . . ,t jk )[θ ]. We then say that the two parameterized random variables unify. Substitution θ is a most general unifier (MGU) of two parameterized random variables if • θ is a unifier of the two parameterized random variables, and • if there exists another unifier θ of the two parameterized random variables, then f (. . . )[θ ] must be an instance of f (. . . )[θ ] for all parameterized random variables f (. . . ). If two parameterized random variables have a unifier, they have at least one MGU. Sets of random variables represented by two parameterized random variables that unify have a non-empty intersection. 14 Example 2.4. Parameterized random variables wet_grass(Lot) and wet_grass(lot1 ) have a unifier {Lot/lot1 }. Let Parcel be a logical variable such that D(Parcel) = D(Lot). Parameterized random variables wet_grass(Lot) and wet_grass(Parcel) have a unifier {Lot/Parcel}. Let adjacent be a binary functor, adjacent(Lot, lot1 ) and adjacent(lot3 , Parcel) have a unifier {Lot/lot3 , Parcel/lot1 }. Parameterized random variables wet_grass(lot1 ) and wet_grass(lot2 ) do not unify. Parameterized random variables adjacent(Lot, Lot) and adjacent(lot1 , lot3 ) also do not unify. Finally, parameterized random variables f (ti1 , . . . ,tik ) and h(t j1 , . . . ,t jl ) do not unify as they have different functors. 2.4.1.1 Counting formulas So far, we can specify a value assignment for named individuals, or for all individuals from some population. Sometimes it is useful to be able to say that a certain number of individuals have some assignment, without specifying which individuals. We can achieve this with counting formulas [Milch et al., 2008]. Counting formulas were inspired by work on cardinality potentials [Gupta et al., 2007] and counting elimination [de Salvo Braz et al., 2007]. Definition 2.3. A counting formula is of the form #A:CA [ f (. . . , A, . . . )], where A is a logical variable that is bound by the # sign, C is a set of inequality constraints involving A and f (. . . , A, . . . ) is a parameterized random variable. The value of #A:CA [ f (. . . , A, . . . )], given an assignment of values to random variables v, is the histogram function hv : range( f ) → N defined by v(#A:CA [ f (. . . , A, . . . )]) = hv (x) = |{a ∈ (D(A) : C ) : v( f (. . . , a, . . . )) = x}|, where x is in the range of f . Thus, the set of values of the above counting formula is the set of histograms having a bucket for each element x in the range of f with entries adding up to | D(A):C |+| range( f ) |−1 | range( f ) |−1 | range( f ) |−1 | ). Therefore, a | D(A) : C |. The number of such histograms is small values of | range( f ) | is O(| D(A) : C , which for tabular rep- resentation of a function on a counting formula #A:CA [ f (. . . , A, . . . )] requires an 15 amount of space that is linear in | D(A) : C | for functor f with a binary range and increases dramatically with range size. Example 2.5. Consider a counting formula #Lot [wet_grass(Lot)]. It has a range {(# f alse = n, #true = 0), (# f alse = n − 1, #true = 1), . . . , (# f alse = 0, #true = n)}. As- sume that assignment of values to random variables v assigns value f alse to random variables wet_grass(lot4 ), wet_grass(lot5 ) and wet_grass(lot9 ) and value true to all other ground instances of wet_grass(Lot). We have v(#Lot [wet_grass(Lot)]) = (# f alse = 3, #true = n − 3). Counting formulas are a form of parameterized random variables. Consider a counting formula #A:CA [ f (. . . , A, . . . )]. If the set param( f (. . . , A, . . . )) \{A} is not empty, the counting formula #A:CA [ f (. . . , A, . . . )] represents a set of random variables, one random variable for each ground substitution to all logical variables in param( f (. . . , A, . . . )) \{A}, such that the substitution satisfies the constraints in CA . The domain of each of these random variables is the range of the counting formula, that is, the domains is the set of histograms having a bucket for each element x in the range of f with entries adding up to | D(A) : CA |. Very often we need to analyze a set of random variables underlying a counting formula #A:CA [ f (. . . , A, . . . )], that is random variables ground( f (. . . , A, . . . )). For simplicity, by ground(#A:CA [ f (. . . , A, . . . )]) we denote the set ground( f (. . . , A, . . . )), rather than random variables described in the previous paragraph. Unless otherwise stated, by parameterized random variables we understand both forms: the standard and counting formulas. First-order probabilistic models describe probabilistic dependencies between parameterized random variables. A grounding of a first-order probabilistic model is a propositional probabilistic model obtained by replacing each parameterized random variable with the random variables it represents and replicating appropriate probability distributions. In the next section we describe an example first-order probabilistic modeling language. 2.4.2 Independent Choice Logic The Independent Choice Logic (ICL) [Poole, 1993, 1997, 2000] is a simple and powerful first-order probabilistic language. It subsumes both belief networks and 16 logic programs Lloyd [1987]. We have chosen ICL among many other formalisms for its easy syntax and semantics and because it defines directed first-order probabilistic models which are the focus of a big part of this thesis. We start with defining concepts related to logic programs and proceed to define the ICL formally. A term is either a constant, a logical variable, or is of the form f (t1 ,t2 , . . . ,tk ) where f is a function symbol and t1 ,t2 , . . . ,tk are terms. An atom is of the form p(t1 ,t2 , . . . ,tk ), where p is a predicate symbol and t1 ,t2 , . . . ,tk are terms. A clause is either an atom or is of the form h ← a1 ∧ a2 ∧ · · · ∧ ak , where h is an atom and ai is an atom or the negation (¬) of an atom, for i = 1, 2, . . . , k. We call h the head of the clause and a1 ∧ a2 ∧ · · · ∧ ak the body of the clause. A logic program is a set of clauses. Informally, acyclic logic programs [Apt and Bezem, 1991] are a restricted class of logic programs for which all recursions for variable-free queries eventually halt. The ICL can be understood as an acyclic logic program with stochastic inputs. The stochastic part is defined by a choice space. An atomic choice is an atom that does not unify with the head of any clause. An alternative is a set of atomic choices that do not unify with each other. A choice space is a set of alternatives such that the atomic choices in different alternatives do not unify. Definition 2.4. An ICL theory consist of: F: facts, an acyclic logic program; C: a choice space; P0 : a probability distribution over the alternatives in C such that ∀A ∈ C ∑ P0 (α) = 1. α∈A The meaning of an ICL theory is defined in terms of possible worlds. A total choice for choice space C is a selection of exactly one atomic choice from each grounding of each alternative in C. Each total choice corresponds to a possible world. The logic program F together with the atoms chosen by the total choice define what is true in a possible world. 17 [00] [01] [02] [03] [04] [05] [06] [07] [08] [09] [10] [11] [12] [13] [14] [15] C = {{rain, ¬rain}, {sprinkler, ¬sprinkler}, {wet_r_s, ¬wet_r_s}, {wet_r_ns, ¬wet_r_ns}, {wet_nr_s, ¬wet_nr_s}, {wet_nr_ns, ¬wet_nr_ns}} F = {wet_grass ← rain ∧ sprinkler ∧ wet_r_s, wet_grass ← rain ∧ ¬sprinkler ∧ wet_r_ns, wet_grass ← ¬rain ∧ sprinkler ∧ wet_nr_s, wet_grass ← ¬rain ∧¬sprinkler ∧ wet_nr_ns} P0 (rain) = 0.2 P0 (¬rain) = 0.8 P0 (sprinkler) = 0.4 P0 (¬sprinkler) = 0.6 P0 (wet_r_s) = 0.99 P0 (¬wet_r_s) = 0.01 P0 (wet_r_ns) = 0.9 P0 (¬wet_r_ns) = 0.1 P0 (wet_nr_s) = 0.8 P0 (¬wet_nr_s) = 0.2 P0 (wet_nr_ns) = 0.0 P0 (¬wet_nr_ns) = 1.0 Figure 2.4: ICL theory from Example 2.6. When domains of logical variables are finite, the probability of a possible world is the product of the probabilities of the atomic choices in the corresponding total choice1 . The probability of a proposition is the sum of probabilities of possible worlds in which the proposition is true. More details on ICL semantics are provided in [Poole, 2000], see also an overview in [Poole, 2008]. In the example below we show how a belief network from Examples 2.1 and 2.2 can be represented as an ICL theory. Example 2.6. Consider an ICL theory in Figure 2.4. It represents the same probability distribution as a belief network described in Examples 2.1 and 2.2. The probability distribution has 6 parameters. The ICL theory has a choice space consisting of 6 alternatives with 2 atomic choices each (lines [00] – [05]). In each possible world it rains or not (line [00], a sprinkler is on or not (line [01]), grass is wet or dry when it is raining and the sprinkler is on (line [02]) and so on. The probability distribution over the alternatives is specified in lines [07] – [18]. For ex1 In the general case, which is beyond the scope of this thesis, we need measure over sets of possible worlds. 18 [00] [01] [02] [03] [04] [05] C = {{rain, ¬rain}, [06] [07] [08] [09] F = {wet_grass(Lot) ← rain ∧ sprinkler(Lot) ∧ wet_r_s(Lot), wet_grass(Lot) ← rain ∧ ¬sprinkler(Lot) ∧ wet_r_ns(Lot), wet_grass(Lot) ← ¬rain ∧ sprinkler(Lot) ∧ wet_nr_s(Lot), wet_grass(Lot) ← ¬rain ∧¬sprinkler(Lot) ∧ wet_nr_ns(Lot)} [10] [11] [12] [13] [14] [15] {sprinkler(Lot), ¬sprinkler(Lot)}, {wet_r_s(Lot), ¬wet_r_s(Lot)}, {wet_r_ns(Lot), ¬wet_r_ns(Lot)}, {wet_nr_s(Lot), ¬wet_nr_s(Lot)}, {wet_nr_ns(Lot), ¬wet_nr_ns(Lot)}} P0 (rain) = 0.2 P0 (¬rain) = 0.8 P0 (sprinkler(Lot)) = 0.4 P0 (¬sprinkler(Lot)) = 0.6 P0 (wet_r_s(Lot)) = 0.99 P0 (¬wet_r_s(Lot)) = 0.01 P0 (wet_r_ns(Lot)) = 0.9 P0 (¬wet_r_ns(Lot)) = 0.1 P0 (wet_nr_s(Lot)) = 0.8 P0 (¬wet_nr_s(Lot)) = 0.2 P0 (wet_nr_ns(Lot)) = 0.0 P0 (¬wet_nr_ns(Lot)) = 1.0 Figure 2.5: ICL theory for multiple lots from Example 2.6. [00] [01] [02] [03] [04] [05] C = {{rain(false), rain(true)}, [06] F = {wet_grass(Lot,X) ← rain(Y) ∧ sprinkler(Lot,Z) ∧ wet_rain_sprinkler(Lot,X,Y,Z)} [07] [08] [09] [10] [11] [12] [13] [14] [15] [16] [17] [18] {sprinkler(Lot,false), sprinkler(Lot,true)}, {wet_rain_sprinkler(Lot,false,false,false), wet_rain_sprinkler(Lot,true,false,false)}, {wet_rain_sprinkler(Lot,false,false,true), wet_rain_sprinkler(Lot,true,false,true)}, {wet_rain_sprinkler(Lot,false,true,false), wet_rain_sprinkler(Lot,true,true,false)}, {wet_rain_sprinkler(Lot,false,true,true), wet_rain_sprinkler(Lot,true,true,true)}} P0 (rain(false)) = 0.8 P0 (rain(true)) = 0.2 P0 (sprinkler(Lot,false)) = 0.6 P0 (sprinkler(Lot,true)) = 0.4 P0 (wet_rain_sprinkler(Lot,false,false,false)) = 1.0 P0 (wet_rain_sprinkler(Lot,true,false,false)) = 0.0 P0 (wet_rain_sprinkler(Lot,false,false,true)) = 0.2 P0 (wet_rain_sprinkler(Lot,true,false,true)) = 0.8 P0 (wet_rain_sprinkler(Lot,false,true,false)) = 0.1 P0 (wet_rain_sprinkler(Lot,true,true,false)) = 0.9 P0 (wet_rain_sprinkler(Lot,false,true,true)) = 0.01 P0 (wet_rain_sprinkler(Lot,true,true,true)) = 0.99 Figure 2.6: More elegant ICL theory for multiple lots from Example 2.6. 19 rain rain sprinkler(lot1) sprinkler(lotn) wet grass(lot1 ) wet grass(lotn ) sprinkler(Lot) wet grass(Lot) {lot1 , . . . , lotn } FIRST-ORDER PROPOSITIONAL Figure 2.7: Graphical representation of the ICL theory for multiple lots from Example 2.6. ample, P0 (wet_r_s) = 0.99 = P(wet_grass = true|rain = true, sprinkler = true). Finally, clauses in lines [06]–[09] specify the conditional probability for wet_grass given rain and sprinkler. Atoms rain, wet_grass, and sprinkler from the ICL theory correspond to nodes of the belief network. The above theory can be easily generalized to multiple lots as we show in Figure 2.5. Figure 2.7 illustrates the resulting directed first-order probabilistic model using plates. The notion of plates [Buntine, 1994] is similar to the idea of parameterized random variables; we use plates notation in our figures throughout this thesis. Consider the left part of Figure 2.7. A subgraph involving two parameterized random variables sprinkler(Lot) and wet_grass(Lot) is enclosed within a box. The box is referred to as a plate. It implies that the subgraph is duplicated as many times as there are individuals in the population associated with the plate. The individuals are enumerated in the bottom-right corner of the plate. Arcs coming into the plate and leaving the plate are duplicated as well. Atoms from the ICL theory presented in Figure 2.5 are parameterized random variables. Atoms rain, wet_grass(Lot), and sprinkler(Lot) from the ICL theory correspond to nodes of the first-order probabilistic model shown in Figure 2.7. Facts can be specified in a more elegant way if we parameterize atoms over truth values of the corresponding parameterized random variables from the firstorder model. Figure 2.6 shows a version of the ICL theory from Figure 2.5 that does this. Variable X parameterizes truth values of a parameterized random variable wet_grass(Lot), variable Y parameterizes truth values of a parameterized ran20 dom variable rain(), and variable Z parameterizes truth values of a parameterized random variable sprinkler(Lot). For example, an atom sprinkler(Lot,true) is true when the parameterized random variable sprinkler(Lot) is true and an atom sprinkler(Lot, f alse) is true when when the parameterized random variable sprinkler(Lot) is false. 2.5 Lifted probabilistic inference Although many first-order probabilistic languages have emerged, for many years the most common exact inference technique has been based on dynamical propositionalization of the portion of the first order model that is relevant to the query, followed by probabilistic inference performed at the propositional level [Breese, 1992]. This approach is known as knowledge-based model construction (KBMC). Unfortunately, even a very simple first-order model can result in a very large propositional model and the inference at propositional level can be formidably expensive. The big computational cost of the KBMC approach motivated work on exploiting redundant computation [Koller and Pfeffer, 1997; Pfeffer and Koller, 2000] during inference in first-order probabilistic models and work on compiling firstorder models to secondary structures (for example arithmetic circuits [Chavira et al., 2006]) to facilitate efficient inference. The idea of lifted probabilistic inference is to carry out as much inference as possible without propositionalizing a first-order probabilistic model or its part. An exact lifted probabilistic inference procedure (in some literature called inversion elimination) for first-order probabilistic models was presented in Poole [2003]. Under certain conditions, the inversion elimination procedure can sum out multiple random variables from a model without considering each variable separately. de Salvo Braz et al. [2005, 2006, 2007] worked on increasing the scope of lifted probabilistic inference and introduced a counting elimination procedure. Further work by Milch et al. [2008] added counting formulas (see Section 2.4.1.1) and further expanded the scope of lifted probabilistic inference. Their work resulted in the C-FOVE algorithm, which is currently the state of the art in exact lifted probabilistic inference. 21 While Poole considered directed models, the later work by de Salvo Braz et al. and Milch et al. focused on undirected models, although their results can be used for directed models. Directed models have the advantage of allowing pruning of the part of a model that is irrelevant to the query. Also, conditional probability distributions in directed models can be interpreted and learned locally, which is important for models that are specified by people or need to be understood by people. In Section 2.5.1 we describe data structures used during lifted inference and in Section 2.5.2 we provide an overview of the C-FOVE algorithm. 2.5.1 Parametric factors To perform inference in first-order probabilistic models we need a data structure that would fulfill a role analogical to the role of factors (Section 2.3.1) during inference in belief networks. The above is a motivation behind parfactors [Poole, 2003]. They are used to represent conditional probability distributions in directed first-order models and potentials in undirected first-order models as well as intermediate computation results during lifted inference in first-order models. Definition 2.5. A parametric factor or parfactor is a triple C , V, F where: • C is a set of inequality constraints on logical variables (between a logical variable and a constant or between two logical variables); • V is a set of parameterized random variables, such that for any two parameterized random variables f (. . . ), f (. . . ) from V we have ground( f (. . . )) : C ∩ ground( f (C )) : C = 0; / (2.2) • F is a factor from the Cartesian product of ranges of parameterized random variables in V to the reals. A parfactor C , V, F represents a set of factors, one for each ground substitution G to all free logical variables in V that satisfies the constraints in C . Each such factor FG is a factor on the set of random variables obtained by applying a substitution G. Given an assignment v to random variables represented by V, v(FG ) = v(F). | C , V, F | denotes the number of factors represented by a parfactor C , V, F . 22 Condition (2.2) ensures that each ground substitution G to all free logical vari- ables in V that satisfies the constraints in C results in a set of random variables of the same size. In this thesis we additionally assume that two logical variables that participate in the same inequality constraint from C are typed with the same population. Extending lifted inference to handle hierarchies of types is an interesting idea for future work, but it is orthogonal to the focus of this thesis. Example 2.7. The ICL theory for multiple lots from Example 2.6 (shown in Figure 2.7) can be represented by the following parfactors: 0, / {rain()}, P(rain() = f alse) 0.8 0, / {sprinkler(Lot)}, P(rain() = true) , 0.2 P(sprinkler(Lot) = f alse) 0.6 0, / {rain(), sprinkler(Lot), wet_grass(Lot)}, rain() f alse f alse true true sprinkler(Lot) f alse true f alse true P(wet_grass(Lot) = f alse) 1.0 0.2 0.1 0.01 [01] P(sprinkler(Lot) = true) , [02] 0.4 [03] P(wet_grass(Lot) = true) 0.0 0.8 0.9 0.99 . Parfactor [01] represents a set containing one factor. Parfactors [02] and [03] represent sets of factors of size n. Constraints within parfactors allow us to store knowledge about particular individuals. It might be prior knowledge as well as coming from observations. Example 2.8. Assume that we know that sprinkler on lot loti is very likely to be on. Then the second parfactor from Example 2.7 ([02]) could be replaced by the following two parfactors: 0, / {sprinkler(loti )}, P(sprinkler(loti ) = f alse) 0.1 {Lot = loti }, {sprinkler(Lot)}, P(sprinkler(loti ) = true) , 0.9 P(sprinkler(Lot) = f alse) P(sprinkler(Lot) = true) . 0.6 0.4 Next example illustrates the trade-offs between the sizes of parfactors on standard parameterized random variables, parfactors on counting formulas and parfactors on not parameterized random variables and their expressive power. 23 Example 2.9. Let wet_grass(Lot) be a parameterized random variable from Example 2.3 and #Lot [wet_grass(Lot)] be a counting formula from Example 2.5. Recall that Lot is a logical variable typed with a population {lot1 , lot2 , . . . , lotn } and wet_grass is a functor with range { f alse,true} . Consider the following three parfactors: 0, / {wet_grass(Lot)}, F1 ; [1] 0, / {wet_grass(lot1 ), wet_grass(lot2 ), . . . , wet_grass(lotn )}, F3 . [3] 0, / {#Lot [wet_grass(Lot)]}, F2 ; [2] Factor F1 represents a function from { f alse,true} to the reals and has size 2, thus its size is independent of | D(Lot) | = n. Parfactor [1] can represent a set of n identical real-valued discrete functions on each random variable from ground(wet_grass(Lot)). The next factor, F2 , represents a function from the set {(# f alse = n, #true = 0), . . . , (# f alse = 0, #true = n)} to the set of real numbers and has size n + 1. Parfactor [2] can represent these real-valued discrete functions on random variables from the set ground(wet_grass(Lot)) that only depend the number of random variables from ground(wet_grass(Lot)) that take a particular value not on which of them take it. Finally, factor F3 represents a function from ×ni=1 { f alse,true} to the reals and has size 2n . Parfactor [3] can represent arbitrary real-valued discrete functions on random variables ground(wet_grass(Lot)). 2.5.1.1 Normal-form constraints Let X be a logical variable in V from a parfactor C , V, F . In general, the size of the set D(X) : C depends on other logical variables in V. Example 2.10. Consider a parfactor {X = x1 , X = Y }, {g(X), h(X,Y )}, F , where D(X) = D(Y ) = {x1 , . . . , xn }. The size of the set D(X) : {X = x1 , X = Y } depends on the logical variable Y . It is equal n − 1 when Y = x1 and n − 2 when Y = x1 . The above property is undesirable for parfactors involving counting formulas because the set of possible values of a counting formula #A:CA [ f (. . . , A, . . . )] can 24 take depends on the size of the set D(A) : C . Milch et al. [2008] introduced a special class of sets of inequality constraints. Let C be a set of inequality constraints on logical variables and X be a logical variable. We denote by EXC the excluded set for X, that is, the set of terms t such that (X = t) ∈ C . Set C is in normal form if for each inequality (X = Y ) ∈ C , where X and Y are logical variables, EXC \{Y } = EYC \{X}. A parfactor C , V, FF is in normal form if set C is in normal form. Consider a parfactor C , V, FF , where C is in normal form. For all logical variables X in V, | D(X) : C | = | D(X) | − | EXC |. Example 2.11. Consider the parfactor from Example 2.10. Let C denote a set of constraints from this parfactor. The set C contains only one inequality between logical variables, namely X = Y . We have EXC = {x1 ,Y } and EYC = {X}. As EXC \{Y } = EYC \{X}, the parfactor is not in normal form. Consider a parfactor {X = Y, X = x1 , Y = x1 }, {g(X), h(X,Y )}, F , where D(X) = D(Y ) = {x1 , . . . , xn }. Let C denote a set of constraints from this parfactor. As EXC \{Y } = EYC \{X}, the parfactor is in normal form and | D(X) : C | = n − 2 and | D(Y ) : C | = n − 2. Following Milch et al. [2008] we require that for a parfactor C , V, FF involving counting formulas, the union of C and the constraints in all the counting formulas in V is in normal form. Other parfactors do not need to be in normal form. The trade-off between requiring normal-form and allowing unrestricted sets of inequality constraints is discussed in Section 5.4. 2.5.2 C-FOVE In this section we provide an overview of the C-FOVE algorithm [Milch et al., 2008]. The algorithm builds on previous work by Poole [2003] and de Salvo Braz et al. [2007]. C-FOVE is not tied to any first-order probabilistic language. It can be used to perform inference in any model for which joint probability distribution can be represented as a product of factors. Let Φ be a set of parfactors. Let J (Φ) denote a factor equal to the product of all factors represented by elements of Φ. Let U be the set of all random variables represented by parameterized random variables present in parfactors in Φ. Let Q 25 be a subset of U. The marginal of J (Φ) on Q, denoted JQ (Φ), is defined as JQ (Φ) = ∑U\Q J (Φ). Given Φ and Q, the C-FOVE algorithm computes the marginal JQ (Φ) by summing out random variables from U \ Q, where possible in a lifted manner. Evidence can be handled by adding to Φ additional parfactors on observed random variables. The C-FOVE algorithm assumes that all parfactors are in normal form. As lifted summing out is only possible under certain conditions, the C-FOVE algorithm uses elimination enabling operations, such as applying substitutions to parfactors and multiplication. We start our description of C-FOVE with lifted elimination, then describe elimination enabling operations and finally show how they can be combined to perform lifted probabilistic inference. Milch et al. [2008] define parfactors as quadruples C , L, V, F . C , V, F have the same meaning as in Section 2.5.1 and L is a set of logical variables. Thus, the only difference with our notation is that they explicitly list logical variables present in a parfactor. This lets L be a superset of the set of logical variables that parameterize elements of V. 2.5.2.1 Lifted elimination The aim of lifted elimination is to sum out a parameterized random variable from a parfactor with much less computation then it would be required if we first converted the parfactor to a set of factors and summed out the ground instances of the parameterized random variable from these factors. We first describe a lifted summation of a parameterized random variable that is not a counting formula and next present a lifted elimination of a counting formula. Let F be a factor and r be a real number. By F r we denote a factor such that given an assignment v of values to random variables, v(F r ) = v(F )r . In F r , the values of factor F are brought to power r. Let C , V, F be a parfactor and r be a real number. By C , V, F parfactor C , V, F r . r we denote a The following two propositions follow directly from Propositions 1, 2 and 4 from Milch et al. [2008]. The first proposition considers summing out a set of 26 random variables represented by a parameterized random variable (not a counting formula) from a product of factors represented by a set of parfactors. Proposition 2.1. Let Φ = {g1 , g2 , . . . , gm } be a set of parfactors. Let gi = Ci , Vi , Fi be a normal-form parfactor from Φ and f (. . . ) be a parameterized random variable (not a counting formula) from Vi . Suppose that: (S1) For all parfactors g j = C j , V j , F j ∈ Φ, i = j and all h(. . . ) ∈ V j we have ground( f (. . . )) : Ci ∩ ground(h(. . . )) : C j = 0. / That is, no other parfactor in Φ includes parameterized random variables that represent random variables represented by f (. . . ). (S2) param( f (. . . )) ⊇ f (... )∈(V\{ f (... )}) param( f (. . . )). That is, the set of logi- cal variables in f (. . . ) is a superset of the union of logical variables in other parameterized random variables from V. Let gi = Ci , Vi \ { f (. . . )}, ∑ f (... ) F and r = |gi |/|gi |, where . Then ∑ ground( f (... )):Ci J (Φ) = J (Φ \ {gi } ∪ {(gi )r }) . (2.3) Condition (S1) makes sure that we are completely eliminating random variables ground( f (. . . )) : Ci from the set of parfactors Φ. Condition (S2) guarantees that a result of lifted summation is equivalent to performing a series of summations at a propositional level. If after summing out f (. . . ) some logical variables disappear from the parfactor gi , then gi represents less factors than gi . To compensate for this, the values of the factor in gi are brought to the power r = |gi |/|gi |. 27 Example 2.12. Consider the following set of parfactors: f (A, B) f alse Φ = { {A = B}, { f (A, B), h(B)}, f alse true true h(B) f alse true f alse true value α1 α2 α3 α4 , [1] e(C) green green 0, / {e(C), h(x1 )}, orange orange red red h(x1 ) f alse true f alse true f alse true value β1 β2 β3 β4 β5 β6 }, [2] where range( f ) = range(h) = { f alse,true}, range(e) = {green, orange, red}, D(A) = D(B) = {x1 , x2 , . . . , xn }, D(C) = {y1 , y2 , . . . , ym }, and α1 , . . . , α4 , β1 , . . . , β6 are real numbers. Assume we want to compute ∑ground( f (A,B)):{A=B} J (Φ). Condi- tions (S1) and (S2) of Proposition 2.1 are satisfied. Let us define a new parfactor [1 ]: 0, / {h(B)}, ∑ f (A,B) f (A, B) f alse f alse true true h(B) f alse true f alse true value α1 α2 α3 α4 h(B) = 0, / {h(B)}, f alse true value α1 + α3 α2 + α4 . [1 ] The logical variable A is present in parfactor [1] and it is absent from parfactor [1 ]. Parfactor [1] represents n(n − 1) factors. Parfactor [1 ] represents n factors, that is, it represents n − 1 times fewer factors than [1]. We have h(B) [1 ](n−1) = {A = B}, { f (A, B), h(B)}, f alse true value (α1 + α3 )(n−1) (α2 + α4 )(n−1) . From Proposition 2.1: ∑ ground( f (A,B)):{A=B} (n−1) J (Φ) = J ({[1 ] , [2]}) . An analogous proposition to Proposition 2.1 holds for counting formulas. It considers summing out a set of random variables represented by a counting formula from a product of factors represented by a set of parfactors. 28 Proposition 2.2. Let Φ = {g1 , g2 , . . . , gm } be a set of parfactors. Let gi = Ci , Vi , Fi be a normal-form parfactor from Φ and #A:CA [ f (. . . , A, . . . )] be a counting formula from Vi . Suppose that: (SC1) For all g j = C j , V j , F j ∈ Φ, i = j and all h(. . . ) ∈ V j we have ground(#A:CA [ f (. . . , A, . . . )]) : Ci ∩ ground(h(. . . )) : C j = 0; / (SC2) param(#A:CA [ f (. . . , A, . . . )]) ⊇ f (... )∈(V \ {#A:CA [ f (...,A,... )]}) param( f (. . . )). That is, the set of logical variables of #A:CA [ f (. . . , A, . . . )] is a superset of the union of logical variables of other parameterized random variables from set V. Let gi = Ci , Vi , Fi , where Vi = Vi \ {#A:CA [ f (. . . , A, . . . )]} and Fi is a factor from the Cartesian product of ranges of parameterized random variables in Vi to the re- als. Let v be a value assignment to all random variables but ground( f (. . . , A, . . . )) : Ci ∪ CA . Factor Fi is defined as follows: v(Fi ) = ∑ | D(A) : C A |! h()∈range(#A:CA [ f ()]) ∏x∈range( f (...,A,... )) h(x)! v(Fi )(h()), (2.4) Let r = |gi |/|gi |. Then ∑ ground( f (... )):Ci J (Φ) = J (Φ \ {gi } ∪ {(gi )r }) . (2.5) The only difference between Proposition 2.2 and Proposition 2.1 is the expression | D(A) : C A |! ∏x∈range( f (...,A,... )) h(x)! . The expression is equal to the number of assignments v of values to random variables in a set ground(#A:CA [ f (. . . , A, . . . )]) such that v (#A:CA [ f (. . . , A, . . . )]) = h(). 29 Example 2.13. Consider the following set of parfactors: #A:{A=B} [ f (A)] (# f alse = n − 1, #true = 0) (# f alse = n − 1, #true = 0) (# f alse = n − 2, #true = 1) Φ = { 0, / {#A:{A=B} [ f (A)], h(B)}, (# f alse = n − 2, #true = 1) .. . (# f alse = 0, #true = n − 1) (# f alse = o, #true = n − 1) 0, / {e(C), h(x1 )}, e(C) green green orange orange red red h(B) value f alse true f alse true .. . α1 α2 α3 α4 .. . f alse true α2n−1 α2n h(x1 ) f alse true f alse true f alse true value β1 β2 β3 β4 β5 β6 , [1] }, [2] where range( f ) = range(h) = { f alse,true}, range(e) = {green, orange, red}, D(A) = D(B) = {x1 , x2 , . . . , xn }, D(C) = {y1 , y2 , . . . , ym }, and α1 , . . . , α2n , β1 , . . . , β6 are real numbers. Assume we want to compute ∑ground(#A:{A=B} [ f (A)]) J (Φ). Conditions (SC1) and (SC2) of Proposition 2.2 are satisfied. Let us define a new parfactor [1 ]: 0, / {h(B)}, ∑# A:{A=B} [ f (A)] #A:{A=B} [ f (A)] h(B) value (# f alse = n − 1, #true = 0) (# f alse = n − 1, #true = 0) (# f alse = n − 2, #true = 1) (# f alse = n − 2, #true = 1) .. . f alse true f alse true .. . α1 α2 α3 α4 .. . (# f alse = 0, #true = n − 1) (# f alse = 0, #true = n − 1) f alse true h(B) 0, / {h(B)}, f alse true In the above calculation the expression histogram (# f alse = n − i, #true = i − 1) n−1 n−i , = . [1 ] α2n−1 α2n value ∑ni=1 ∑ni=1 n−1 n−i n−1 n−i α2i−1 α2i | D(A):C A |! from Proposition 2.2 for ∏x∈range( f (...,A,... )) h(x)! (n−1)! is equal to (n−i)!(i−1)! , which is equal to . 30 Parfactor [1] represents n factors and so does parfactor [1 ]. Therefore we do not need to bring elements of [1 ] to any power. From Proposition 2.2: ∑ ground(#A:{A=B} [ f (A)]) 2.5.2.2 J (Φ) = J ({[1 ],[2]}) . Parfactor multiplication Parfactor multiplication allows us to combine parfactors which involve parameterized random variables that represent the same set of random variables. This allows us to satisfy condition (S1) of Proposition 2.1 and condition (SC1) of Proposition 2.2. Parfactor multiplication can be performed in a lifted manner. This means that, although parfactors participating in a multiplication as well as their product represent multiple factors, only one factor multiplication is performed. The following proposition is a consequence of Propositions 5 and 4 from Milch et al. [2008] and earlier work by de Salvo Braz et al. [2007]. Proposition 2.3. Let Φ = {g1 , g2 , . . . , gm } be a set of parfactors. Let gi = Ci , Vi , Fi and g j = C j , V j , F j be parfactors from Φ. Suppose that: (M1) ∀ f (. . . ) ∈ Vi ∀ f (. . . ) ∈ V j (ground( f (. . . )) : Ci = ground( f (. . . )) : C j ) ∨ (ground( f (. . . )) : Ci ∩ ground( f (. . . )) : C j = 0). / That is, sets of random variables represented by parameterized random variables from each parfactor are identical or disjoint. (M2) All parameterized random variables f (. . . ) ∈ Vi and f (. . . ) ∈ V j such that ground( f (. . . )) : Ci = ground( f (. . . )) : C j , f (. . . ) and f (. . . ) are identically parameterized by logical variables and the set of other logical variables present in parfactor gi is disjoint with the set of logical variables present in parfactor g j . Let g = Ci ∪ C j , Vi ∪ V j , Fi F j , ri = |gi |/|g| and r j = |g j |/|g|. Then J (Φ) = J (Φ \ {gi , g j } ∪ { C i ∪ C j , Vi ∪ V j , Fi ri F j r j }) . (2.6) Condition (M1) makes sure that the product of parfactors represents a set of factors of the same dimensionality. Condition (M2) guarantees correctness of set 31 unions in the definition of parfactor g. Similarly to lifted elimination, Equation 2.6 accounts for logical variables present in the product. The product parfactor g may represent more factors than a parfactor participating in the product, for example gi . To compensate for this, the values of the factor in gi are brought to the power ri = |gi |/|g| before computing the final product. Example 2.14. Consider the following set of parfactors: f (A, B) f alse Φ = { {A = B}, { f (A, B), h(B)}, f alse true true h(B) f alse true f alse true value α1 α2 α3 α4 , [1] e(C) green green 0, / {e(C), h(B)}, orange orange red red h(B) f alse true f alse true f alse true value β1 β2 β3 β4 β5 β6 }, [2] where range( f ) = range(h) = { f alse,true}, range(e) = {green, orange, red}, D(A) = D(B) = {x1 , x2 , . . . , xn }, D(C) = {y1 , y2 , . . . , ym }, and α1 , . . . , α4 , β1 , . . . , β6 are real numbers. Assume we want to multiply parfactors [1] and [2]. Conditions (M1) and (M2) of Proposition 2.3 are satisfied. Parfactor [1] represents n(n − 1) factors, parfactor [2] represents nm factors and their product will represent n(n − 1)m factors. We need to take it into account while computing a factor component of the product. Let us define a new parfactor [3]: 32 e(C) f (A, B) h(B) value nm nm green true β2n(n−1)m n(n−1) nm {A = B}, { f (A, B), e(C), h(B)}, f alse true α2n(n−1)m true true value green f alse β1n(n−1)m n(n−1) n(n−1)m f alse f alse α1 h(B) orange f alse β3n(n−1)m = nm n(n−1)m n(n−1) n(n−1)m orange true β4 f alse α3 n(n−1) n(n−1)m true α4 nm red f alse β5n(n−1)m red true β6n(n−1)m nm f (A, B) e(C) h(B) value f alse green f alse α1m β1n−1 f alse green true α2m β2n−1 f alse orange f alse α1m β3n−1 f alse orange true α2m β4n−1 f alse red f alse α1m β5n−1 {A = B}, { f (A, B), e(C), h(B)}, f alse red true α2m β6n−1 true green f alse α3 β1 true green true α4m β2n−1 true orange f alse α3m β3n−1 true orange true α4m β4n−1 true red f alse α3m β5n−1 true red true α4m β6n−1 1 1 1 1 1 1 1 1 1 1 1 1 1 m 1 n−1 1 1 1 1 1 1 1 1 1 1 . [3] From Proposition 2.3 J (Φ) = J ({[3]}). 2.5.2.3 Splitting, expanding and propositionalizing Condition (M1) from Proposition 2.3 for parfactor multiplication requires sets of random variables represented by parameterized random variables from parfactor participating in a product to be identical or disjoint. It can be satisfied through splitting parfactors on substitutions and expanding counting formulas. These two operations modify parameterized random variables which affects sets of random variables the parameterized random variables represent. The first proposition characterizes the splitting operation. It is Proposition 6 from Milch et al. [2008]. 33 Proposition 2.4. Let Φ = {g1 , g2 , . . . , gm } be a set of parfactors. Let gi = Ci , Vi , Fi be a parfactor from Φ. Let X be a logical variable present in gi . Let {X/t} be a substitution such that t ∈ / EXCi and either term t is a constant t ∈ D(X), or term t is a logical variable present in gi such that D(t) = D(X). Let gi [X/t] be a parfactor gi with all occurrences of X replaced by term t and gi = Ci ∪{X = t}, Vi , Fi . Then J (Φ) = J (Φ \ {gi } ∪ {gi [X/t], gi }) . (2.7) We call the operation described above a split of parfactor gi on substitution {X/t} and we call gi a residual parfactor. Example 2.15. Consider the following parfactor: f (A, B) f alse {A = B}, { f (A, B), h(B)}, f alse true true h(B) f alse true f alse true value α1 α2 α3 α4 , [1] where range( f ) = range(h) = { f alse,true}, D(A) = D(B) = {x1 , x2 , . . . , xn }, and α1 , . . . , α4 are real numbers. Parfactor [1] represents n(n − 1) factors, its factor component has size 4. Let us split parfactor [1] on substitution {B/x1 }. We obtain two new parfactors: f (A, x1 ) f alse {A = x1 }, { f (A, x1 ), h(x1 )}, f alse true true and f (A, B) f alse {A = B, B = x1 }, { f (A, B), h(B)}, f alse true true h(x1 ) f alse true f alse true h(B) f alse true f alse true value α1 α2 α3 α4 value α1 α2 α3 α4 [1 ] . [1 ] Parfactor [1 ] represents n−1 factors, parfactor [1 ] represents (n−1)+(n−1)(n− 2) = (n − 1)2 factors, their factor component has size 4. From Proposition 2.4, J ({[1]}) = J ({[1 ],[1 ]}). 34 Given a parfactor g = C , V, F and a logical variable X present in one or more parameterized random variables from V, we may need to propositionalize g on X, that is replace it by a set of parfactors of the form g[X/c], one for each constant c from D(X) : C . Propositionalization can be thought of as splitting g on substitution {X/c} for every c ∈ D(X) : C . The next proposition characterizes the expanding operation. It is Proposition 7 from Milch et al. [2008]. Proposition 2.5. Let Φ{g1 , g2 , . . . , gm } be a set of parfactors. Let gi = Ci , Vi , Fi be a normal-form parfactor from Φ and #A:CA [ f (. . . , A, . . . )] be a counting formula from Vi . Let t be a term such that t ∈ / EACA and t ∈ EYC for each logical variable Y ∈ EACA . Let g = Ci , Vi , Fi , where Vi = Vi \ {#A:CA [ f (. . . , A, . . . )]} ∪ { f (. . . ,t, . . . ), #A:CA ∪{A=t} [ f (. . . , A, . . . )]} and factor Fi is a factor from the Cartesian product of ranges of parameterized random variables in Vi to the reals. Let v be a value assignment to to all random variables but ground( f (. . . , A, . . . )) : Ci ∪ CA . Factor Fi is defined as follows: v(Fi )(x, h()) = v(Fi )(h ()), (2.8) where x ∈ range( f ), histogram h() ∈ range(#A:CA ∪{A=t} [ f (. . . , A, . . . )]), histogram h () ∈ range(#A:CA [ f (. . . , A, . . . )]), and h () is obtained by taking h() and adding 1 to the count for the value x. Then J (Φ) = J (Φ \ {gi } ∪ {gi }) . Example 2.16. Consider the following parfactor: #A:{A=B} [ f (A)] (# f alse = n − 1, #true = 0) (# f alse = n − 1, #true = 0) (# f alse = n − 2, #true = 1) 0, / {#A:{A=B} [ f (A)], h(B)}, (# f alse = n − 2, #true = 1) .. . (# f alse = 0, #true = n − 1) (# f alse = o, #true = n − 1) (2.9) h(B) value f alse true f alse true .. . α1 α2 α3 α4 .. . f alse true . [1] α2n−1 α2n where D(A) = D(B) = {x1 , x2 , . . . , xn } and range( f ) = range(h) = { f alse,true}, and α1 , . . . , α2n are real numbers. Parfactor [1] represents n factors, its factor com35 ponent has size 2n. Let us expand counting formula #A:{A=B} [ f (A)] on constant x1 . We obtain a new parfactor: #A:{A=B,A=x1 } [ f (A)] (# f alse = n − 2, #true = 0) (# f alse = n − 2, #true = 0) (# f alse = n − 2, #true = 0) (# = n − 2, #true = 0) 0, / {#A:{A=B,A=x1 } [ f (A)], f (x1 ), h(B)}, f alse (# f alse = n − 3, #true = 1) (# f alse = n − 3, #true = 1) .. . f (x1 ) h(B) value f alse f alse true true f alse f alse .. . f alse true f alse true f alse true .. . α1 α2 α3 α4 α3 α4 .. . . [1 ] (# f alse = 0, #true = n − 2) true f alse α2n−1 (# f alse = o, #true = n − 2) true true α2n Parfactor [1 ] represents n factors, its factor component has size 4(n − 1). From Proposition 2.5, J ({[1]}) = J ({[1 ]}). We can fully expand a counting formula #A:CA [ f (. . . , A, . . . )] by expanding it on all constants in D(A) : CA . 2.5.2.4 Counting Condition (S2) of Proposition 2.1 and condition (SC2) of Proposition 2.2 require that the set of logical variables of a parameterized random variable being eliminated from a parfactor is a superset of the union of logical variables of other parameterized random variables from this parfactor. When this condition does not hold, we can sometimes satisfy it by performing counting. Counting eliminates a free logical variable from a parfactor. The next proposition characterizes the counting operation. It is Proposition 3 from Milch et al. [2008]. Proposition 2.6. Let Φ = {g1 , g2 , . . . , gm } be a set of parfactors. Let gi = Ci , Vi , Fi be a normal-form parfactor from Φ and A be a logical variable such that there is exactly one parameterized random variable f (. . . , A, . . . ) ∈ Vi for which A ∈ param( f (. . . , A, . . . )). Let C A be a set of those constraints from C i that involve A, let Vi = Vi \ { f (. . . , A, . . . )} ∪ {#A:CA [ f (. . . , A, . . . )]} and let Fi be a factor from the Cartesian product of ranges of parameterized random variables in Vi to the reals. Let v be a value assignment to all random variables but ground( f (. . . , X, . . . )) : Ci . 36 Factor Fi is defined as follows: v(Fi )(h()) = v(Fi ))(x)h(x) , ∏ (2.10) x∈range( f (...,A,... )) where histogram h() ∈ range(#A:CA [ f (. . . , A, . . . )]). Then J (Φ) = J (Φ \ {gi } ∪ { C i \ C A , Vi , Fi }) . (2.11) As suggested in Milch et al. [2008], a good way to think about Equations 2.10 and 2.11 is to consider J ({gi }). We can represent gi as a product of parfactors ob- tained from parfactor gi by propositionalizing the logical variable A. In this prod- uct, given an assignment of values to random variables, it does not matter which ground instance of f (. . . , A, . . . ) is assigned a particular value from range( f ), but only how many of them are assigned the value. The above property allows us to use a counting formula #A:CA [ f (. . . , A, . . . )] to collapse the product to a single parfactor. While the new parfactor has bigger factor component than the original parfactor, the logical variable A no longer occurs free in the new parfactor. Example 2.17. Consider the following parfactor: f (A) f alse {A = B}, { f (A), h(B)}, f alse true true h(B) f alse true f alse true value α1 α2 α3 α4 , [1] where range( f ) = range(h) = { f alse,true}, D(A) = D(B) = {x1 , x2 , . . . , xn }, and α1 , . . . , α4 are real numbers. Parfactor [1] represents n(n − 1) factors, its factor component has size 4. Note, that we cannot eliminate ground( f (A)) : {A = B} from J ({[1]}) in a lifted manner since condition (S1) of Proposition 2.1 is not satisfied. The same applies to lifted elimination of ground(h(A)) : {A = B}. We can however apply Proposition 2.6 and eliminate either of two logical variables present in the parfactor through counting. Let us eliminate the logical variable A. We create a new parfactor where f (A) is replaced with a counting formula: 37 #A:{A=B} [ f (A)] (# f alse = n − 1, #true = 0) (# f alse = n − 1, #true = 0) (# f alse = n − 2, #true = 1) 0,{# / A:{A=B} [ f (A)], h(B)}, (# f alse = n − 2, #true = 1) .. . (# f alse = 0, #true = n − 1) (# f alse = o, #true = n − 1) h(B) f alse true f alse true .. . f alse true value (α1 )n−1 (α3 )0 (α2 )n−1 (α4 )0 (α1 )n−2 (α3 )1 . [1 ] (α2 )n−2 (α4 )1 .. . (α1 )0 (α3 )n−1 (α2 )0 (α4 )n−1 Parfactor [1 ] represents n factors, its factor component has size 2n. From Proposition 2.6, J ({[1]}) = J ({[1 ]}). The counting elimination algorithm of de Salvo Braz et al. [2007] is equivalent to introducing counting formulas as described above and eliminating them immediately. The C-FOVE algorithm, by introducing counting formulas explicitly and allowing them to be part of parfactors increases flexibility of lifted inference (see Section 2.5.2.7). 2.5.2.5 Unification Several conditions present in propositions that theoretically characterize operations on parfactors refer to sets of random variables represented by parameterized random variables. Conditions are concerned about sets of random variables being identical or disjoint. Checking these conditions through analysis of these sets would defeat the goal of lifted inference, that is carrying inference without propositionalizing a first-order probabilistic model or its part. We need syntactic criteria that can be checked quickly. Poole [2003] showed how these conditions can be ensured efficiently using unification [Sterling and Shapiro, 1994]. Milch et al. [2008] extended his work to counting formulas. In the presence of constraints in parfactors, unification needs to be accompanied by constraint analysis and processing. In this section we give an overview of unification and constraint processing necessary to verify and ensure conditions discussed above. We start with two parameterized random variables that are not counting formulas. We illustrate our description with multiple examples. Consider two parfactors C1 , V1 , F1 and C2 , V2 , F2 . Assume that we want to find out the relation between sets of random variables represented by parameterized 38 [00] [01] [02] [03] [04] [05] [06] [07] [08] [09] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] procedure replace( C , V, F ) input: parfactor C , V, F ; output: parfactor C , V, F [θ ], where substitution θ replaces every logical variable constrained to a single constant with this constant, error if C , V, F represents 0 factors; add logical variables from C to the queue; while queue not empty do remove logical variable X from the queue; case | D(X) : {X = t ∈ C : t is a constant}| = 0 return error; end | D(X) : {X = t ∈ C : t is a constant}| = 1 set {x} := D(X) : {X = t ∈ C : t is a constant}; remove unary constraints involving X from C ; add logical variables from the set {Y : X = Y ∈ C } to the queue; replace every occurrence of X in C and in V with x; end otherwise // do nothing end end return C , V, F ; end Figure 2.8: AC-3 algorithm for replacing logical variables with constants. random variables f (t11 , . . . ,tk1 ) ∈ V1 and f (t12 , . . . ,tk2 ) ∈ V2 in respective parfactors, that is, the relation between sets of random variables ground( f (t11 , . . . ,tk1 )) : C1 and ground( f (t12 , . . . ,tk2 )) : C2 . If the sets are not disjoint and are not identical, we want to find out splits that are required to make these sets identical. Note that standard parameterized random variables with different functors represent disjoint sets of random variables. Sometimes the constraints in a parfactor are tight enough to uniquely identify an individual from the population of a logical variable. In this case it is useful to be able to replace the logical variable with a constant denoting the only individual that is possible (we will justify the need for this step later on). Note that binary constraints we are dealing with are inequality constraints like X = Y . The X = Y constraint is a weak constraint, it only constraints the set of possible values for Y , 39 when the set of possible values of X contains a single element. Figure 2.8 shows the AC-3 algorithm [Mackworth, 1977] adapted to perform the task described above. The algorithm does not enumerate populations of involved logical variables, but it needs to know their sizes. Example 2.18. Consider a parfactor C , V, F , where constrains set C = {A = x1 , A = x2 , A = x3 , A = B, A = C, B = x1 , B = x3 , B = C} and D(A) = D(B) = D(C) = {x1 , x2 , x3 , x4 }. The algorithm presented in Figure 2.8 detects that pop- ulation of the logical variable A is constrained to x4 and replaces A with x4 inside the parfactor. We obtain the following set of constraints: {x4 = B, x4 = C, B = x1 , B = x3 , B = C}. Now B is constrained to x2 and can be replaced by this constant and the constraint set reduces to {x4 = C, x2 = C}. In total, the algorithm applied two substitutions, {A/x4 } and {B/x2 }, to the parfactor C , V, F . If we change the population of A, B and C to {x1 , x2 , x3 , x4 , x5 }, then none of the logical variables is constrained to a single individual and the algorithm from Figure 2.8 will not change the parfactor. The scope of logical variables is restricted to the enclosing parfactor. Two logical variables with the same name present in two different parfactors might be typed with different populations and be completely unrelated. For unification to work properly we need to remove such name clashes between considered parfactors. We rename logical variables in both parfactors so that the set of logical variables present in the first parfactor is disjoint with the set of logical variables present in the second parfactor. Example 2.19. Consider set of parfactors Φ0 : Φ0 = { {X = x2 }, { f (X,Y ), q(Y )}, F1 , {X = Z, Z = x1 }, {p(X), f (x1 , Z)}, F2 }, [1] [2] where the logical variable X in parfactor [1] is unrelated to the logical variable X from parfactor [2] and we need to separate them. Assume that in parfactor [1] D(X) = {x1 , x2 , . . . , xn } and D(Y ) = {y1 , y2 , . . . , ym }, and in parfactor [2] D(X) = D(Z) = {y1 , y2 , . . . , ym }. If n and m are greater than 2, the algorithm from Fig- ure 2.8 will not modify the two parfactors and we can proceed with renaming. We 40 [00] [01] [02] [03] procedure MGU( f (t11 , . . . ,tk1 ), f (t12 , . . . ,tk2 )) input: parameterized random variables f (t11 , . . . ,tk1 ), f (t12 , . . . ,tk2 ); output: MGU θ of f (t11 , . . . ,tk1 ) and f (t12 , . . . ,tk2 ), error if f (t11 , . . . ,tk1 ) and f (t12 , . . . ,tk2 ) do not unify; [04] [05] [06] [07] [08] [09] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] set θ := {}; push equations t11 = t12 , . . . ,tk1 = tk2 to the stack; while stack not empty do pop equation ti = t j from the stack; case ti and t j are identical terms // do nothing end ti is a parameter replace every occurrence of ti in the stack and in θ with t j ; add {ti /t j } to θ ; end t j is a parameter replace every occurrence of t j in the stack and in θ with ti ; add {t j /ti } to θ ; end otherwise return error; end end return θ ; end Figure 2.9: MGU algorithm for parameterized random variables. obtain a new set of parfactors Φ1 : Φ1 = { {X1 = x2 }, { f (X1 , X2 ), q(X2 )}, F1 , {X3 = X4 , X4 = x1 }, {p(X3 ), f (x1 , X4 )}, F2 }, [3] [4] where D(X1 ) = {x1 , x2 , . . . , xn } and D(X2 ) = D(X3 ) = D(X4 ) = {y1 , y2 , . . . , ym }. We have J (Φ0 ) = J (Φ1 ). Assume that we have already performed the above two preprocessing operations on parfactors C1 , V1 , F1 and C2 , V2 , F2 . We can now compute an MGU of parameterized random variables f (t11 , . . . ,tk1 ) ∈ V1 and f (t12 , . . . ,tk2 ) ∈ V2 . An al- gorithm for computing MGU is presented in Figure 2.9. The algorithm is adapted 41 [00] [01] [02] [03] [04] [05] [06] [07] [08] [09] [10] [11] [12] [13] procedure checkMGU(θ , C ) input: MGU θ , set of inequality constraints C ; output: true if θ is consistent with C , f alse otherwise; set θ := {}; for each constraint X = t from C do if {{X/t}} ⊂ θ return f alse; if t is a parameter if {{X/t1 }, {t/t1 }} ⊂ θ or {{t/X}} ⊂ θ return f alse; end return true; end Figure 2.10: Algorithm for checking if an MGU is consistent with a set of inequality constraints. from Sterling and Shapiro [1994]. The time and space complexity of the algorithm are O(k). If parameterized random variables do not unify, they represent disjoint sets of random variables. If the MGU is empty, neither parameterized random variable is parameterized by logical variables and each represents the same set containing a single random variable. The above is true because we have renamed logical variables and logical variables present in one parameterized random variable are not present in the other one. Finally, if parameterized random variables unify and the MGU is not empty, we need to check if the MGU is consistent with the constraints in both parfactors. A check if an MGU is consistent with a constraints can be performed using an algorithm shown in Figure 2.10. Example 2.20. Let us continue Example 2.19. Parameterized random variables f (X1 , X2 ) and f (x1 , X4 ) unify with MGU θ = {X1 /x1 , X2 /X4 }. θ is consistent with constraints {X1 = x2 } ∪ {X3 = X4 , X4 = x1 }. If non-empty MGU is not consistent with the constraints, parameterized random variables f (t11 , . . . ,tk1 ) and f (t12 , . . . ,tk2 ) represent disjoint sets of random variables. If non-empty MGU is consistent with the constraints in both parfactors, sets of random variables represented by the analyzed parameterized random variables 42 [00] [01] [02] [03] [04] procedure splitOnMGU(θ , C , V, F ) input: MGU θ , parfactor C , V, F ; output: parfactor g obtained by splitting C , V, F on θ , set of residual parfactors Ψ; [05] [06] [07] [08] [09] [10] [11] [12] [13] [14] [15] [16] [17] set g := C , V, F set Ψ := {}; for each substitution {X/t} from θ do if X is a logical variable in g if t is a constant or t is a logical variable in g split g on {X/t}; set g to the result; add the residual parfactor to Ψ; else replace all occurrences of X in g with t ; end return g and Ψ; end Figure 2.11: Algorithm for splitting a parfactor on an MGU. are not disjoint and possibly are not identical. To make them identical, we split both parfactors on the MGU as described by Poole [2003]. The splitting algorithm is shown in Figure 2.11. Given parfactor g and MGU θ the algorithm returns a parfactor obtained from applying substitutions in θ to g as well as residual parfactors resulting from these operations. After applying the MGU to C1 , V1 , F1 and C2 , V2 , F2 , we obtain sets of residual parfactors and parfactors C1 , V1 , F1 and C2 , V2 , F2 . Thanks to the splitting on the MGU, parameterized random variables f (t11 , . . . ,tk1 ) from V1 and f (t12 , . . . ,tk2 ) from V2 are replaced in V1 and V2 by the same parameterized ran- dom variable f (t1 , . . . ,tk ). However, we want sets ground( f (t1 , . . . ,tk )) : C1 and ground( f (t1 , . . . ,tk )) : C2 to be identical and we need to process constraints in C1 and C2 . Note that, by splitting parfactors on the MGU, not only the sets of random variables represented by targeted parameterized random variables became identical, but also the symbolic representation of the parameterized random variables is identical in both parfactors. The latter brings us closer to satisfying condition (M2) of Proposition 2.6 and allowing for multiplication of these two parfactors. 43 [00] [01] [02] [03] [04] procedure splitOnConstraints(CS , C , V, F ) input: set of constraints CS , parfactor C , V, F ; output: parfactor g obtained by splitting θ to C , V, F , set of by-product parfactors Ψ; [05] [06] [07] [08] [09] [10] [11] [12] [13] [14] [15] set g := C , V, F set Ψ := {}; for each constraint X = t from CS do if (X = t) ∈ / C and X is a logical variable in g and (t is a constant or t is a logical variable in g) split g on {X/t}; add the result to Ψ; set g to the residual parfactor; end return g and Ψ; end Figure 2.12: Algorithm for splitting a parfactor on a set of constraints. Example 2.21. Let us continue Examples 2.19 and 2.20. We split parfactors from set Φ1 : Φ1 = { {X1 = x2 }, { f (X1 , X2 ), q(X2 )}, F1 , {X3 = X4 , X4 = x1 }, {p(X3 ), f (x1 , X4 )}, F2 }, [3] [4] on MGU θ = {X1 /x1 , X2 /X4 }. We obtain the following set of parfactors: Φ2 = { {X3 = X4 , X4 = x1 }, {p(X3 ), f (x1 , X4 )}, F2 , 0, / { f (x1 , X4 ), q(X4 )}, F1 , {X1 = x1 , X1 = x2 }, { f (X1 , X2 ), q(X2 )}, F1 } [4] [5] [6] Parfactor [4] is not affected by splitting on θ , because it does not contain logical variables X1 and X2 . From parfactor [3] we obtain parfactor [5] and a residual parfactor [6]. We have J (Φ1 ) = J (Φ2 ). Note that ground( f (x1 , X4 )) : 0/ = ground( f (x1 , X4 )) : {X1 = x1 , X1 = x2 } and the set of random variables represented by parameterized random variable f (x1 , X4 ) in parfactor [4] is different from the set of random variables it represents in parfactor [5]. 44 The final step processes constraints. Let S be a set of logical variables that are shared between parfactor C1 , V1 , F1 and parfactor C2 , V2 , F2 . Sets of random variables ground( f (t1 , . . . ,tk )) : C1 and ground( f (t1 , . . . ,tk )) : C2 might be different, because inequality constraints that have impact on the set of random variables represented by f (t1 , . . . ,tk ) might be different in parfactors C1 , V1 , F1 and C2 , V2 , F2 . These are constraints between a logical variable and a constant such that the logical variable is in S and constraints between two logical variables such that both logical variables are in S. We can ignore constraints that do not involve logical variables from S. Earlier, we replaced logical variables constrained to single individuals with appropriate constants (see Figure 2.8), and now we can ignore binary constraints that involve a logical variable from S and a logical variable not from S. An algorithm presented in Figure 2.12 inputs a set of constraints and a parfactor and modifies the parfactor so that it contains all relevant constraints from the given set of constraints. The algorithm also returns a set of by-product parfactors. Given C1 , V1 , F1 and C2 , V2 , F2 , we apply the algorithm to set of constraints C2 and parfactor C1 , V1 , F1 and obtain a parfactor C1 ”, V1 ”, F1 . Then we apply the algorithm to set of constraints C1 and parfactor C2 , V2 , F2 and obtain a parfactor C2 ”, V2 ”, F2 . We have ground( f (t1 , . . . ,tk )) : C1 ” = ground( f (t1 , . . . ,tk )) : C2 ”, which was the goal of the process of unification. Example 2.22. Let us continue Example 2.21. We have set of parfactors Φ2 : Φ2 = { {X3 = X4 , X4 = x1 }, {p(X3 ), f (x1 , X4 )}, F2 , 0, / { f (x1 , X4 ), q(X4 )}, F1 , {X1 = x1 , X1 = x2 }, { f (X1 , X2 ), q(X2 )}, F1 }. [4] [5] [6] We will use the algorithm from Figure 2.12 to modify parfactors [4] and [5] so that parameterized random variable f (x1 , X4 ) represent the same set of random variables in the modified parfactors. Since parfactor [5] does not contain any constraints, we only need to split parfactor [5] on constraints from parfactor [4]. The algorithm will ignore constraint X3 = X4 as the logical variable X3 is not present in [5] and split on substitution {X4 /x1 } induced by constraint X4 = x1 . We obtain a 45 new set of parfactors: Φ3 = { {X3 = X4 , X4 = x1 }, {p(X3 ), f (x1 , X4 )}, F2 , [4] {X1 = x1 , X1 = x2 }, { f (X1 , X2 ), q(X2 )}, F1 [6] {X4 = x1 }, { f (x1 , X4 ), q(X4 )}, F1 }. [8] 0, / { f (x1 , x1 ), q(X4 )}, F1 , [7] Parameterized random variable f (x1 , X4 ) represent the same set of random variables in parfactors [4] and [8]. We also have J (Φ2 ) = J (Φ3 ). Above, we talked only about standard parameterized random variables. Assume we want to find out relation between sets of random variables represented by a counting formula #A:CA [ f (. . . , A, . . . )] from a parfactor C , V, F and another parameterized random variable (possibly also a counting formula, but not necessarily). The process is quite similar to what we described for standard parameterized random variables with two differences pointed out by Milch et al. [2008]. The first difference is that we should apply unification to f (. . . , A, . . . ) and combine constraints CA with C instead of considering #A:CA [ f (. . . , A, . . . )] itself. It is because the set of random variables underlying f (. . . , A, . . . ) with respect to constraints CA ∪ C is the same as for #A:CA [ f (. . . , A, . . . )] with constraints C , even though they have dif- ferent functors (see Section 2.4.1.1). The second difference is that where procedure called for splitting parameterized random variables, we expand counting formulas. In this section we described how to compare and change sets of random variables represented by parameterized random variables. The described procedure operates at a syntactic level and it does not enumerate populations of involved logical variables, it only needs to know their sizes. 2.5.2.6 The C-FOVE algorithm Milch et al. [2008] describe the C-FOVE algorithm in terms of the following macro-operations: • S HATTER (Φ) - given a set of parfactors Φ, the shattering macro-operation performs all the splits and expansions (defined in Section 2.5.2.3) necessary to ensure that for any two parameterized random variables present in 46 parfactors from Φ, the sets of random variables represented by these two parameterized random variables are either identical or disjoint. The necessary splits in expansions are determined using unification as described in Section 2.5.2.5; • G LOBAL -S UM -O UT (Φ, f (. . . ), C ) - given a set of parfactors Φ, a parameterized random variable f (. . . ) and a set of constraints C , the macro-operation multiplies all parfactors from Φ containing parameterized random variables that represent ground( f (. . . )) : C (see Section 2.5.2.2) and eliminates ran- dom variables ground( f (. . . )) : C from the product (see Section 2.5.2.1); the macro-operation is only applicable if the product satisfies condition (S2) of Proposition 2.1 (or condition (SC2) of Proposition 2.2 if we are eliminating a counting formula), multiplication operations can always be performed because of initial shattering, this is the only macro-operation that eliminates random variables from J (Φ); • C OUNTING -C ONVERT (Φ, C , V, F , X ) - given a set of parfactors Φ, a parfactor C , V, F from Φ and a free logical variable X, such that X occurs in exactly one parameterized random variable from V, the counting macrooperation eliminates X from the parfactor (see Section 2.5.2.4); • P ROPOSITIONALIZE (Φ, C , V, F , X ) - given a set of parfactors Φ, a parfactor C , V, F from Φ and a free logical variable X, the macro-operation propositionalizes the logical variable X (see Section 2.5.2.3), afterwards it performs shattering to ensure that the sets of random variables represented by parameterized random variables in Φ are still either identical or disjoint after the propositionalization; • F ULL -E XPAND (Φ, C , V, F , #A:CA [ f (. . . , A, . . . )]) - given a set of parfactors Φ, a parfactor C , V, F from Φ and a counting formula #A:CA [ f (. . . , A, . . . )], the macro-operation expands the given counting formula on every constant in D(A) : CA (see Section 2.5.2.3), afterwards it performs shattering to en- sure that the sets of random variables represented by parameterized random variables in Φ are still either identical or disjoint after the expansion; 47 Given a set of parfactors Φ and a set of queried random variables Q, which could be given in a form of a parameterized random variable and a set of constraints, the C-FOVE algorithm computes the marginal JQ (Φ). It starts with a shattering macro-operation. Parfactors in Φ are also shattered against the random variables in Q. Next the C-FOVE algorithm eliminates non-queried random variables from Φ by iteratively performing one of the other four macro-operations. The macro-operations are chosen by a greedy-search with the cost of each operation defined as the total size of parfactors an operation creates, which in practice is equal to the size of factor components of these parfactors. While global elimination and counting operations are not always possible to perform because their preconditions might not be satisfied, propositionalization and full expansion can always be executed and, in an extreme case, fully propositionalize the set of parfactors. This implies C-FOVE’s completeness. It is worth pointing out, that similarly to the VE algorithm for inference in belief networks, the C-FOVE algorithm in directed models allows for pruning of random variables irrelevant to the query. See [Taghipour et al., 2009] for an example how pruning can be performed in lifted manner. 2.5.2.7 Example computation In this section we present a simple lifted computation which illustrates the CFOVE algorithm. Consider the parfactors from Example 2.7 (page 23), which represent the ICL theory from Example 2.6 (page 18) and Figure 2.5 (page 19). Assume D(Lot) = {lot1 , lot2 , . . . , lotn } and that it is observed that grass is wet on lot1 , which can be represented by the following parfactor: 0, / {wet_grass(lot1 )}, P(wet_grass(lot1 ) = f alse) 0.0 48 P(wet_grass(lot1 ) = true) . 1.0 Let Φ be a set of the three parfactors from Example 2.7 and the above parfactor (in what follows, we don’t show details of factor components of parfactors): Φ = { 0, / {rain()}, F1 , [01] 0, / {sprinkler(Lot)}, F2 , 0, / {rain(), sprinkler(Lot), wet_grass(Lot)}, F3 , 0, / {wet_grass(lot1 )}, F4 }. [02] [03] [04] Assume we want to compute Jground(wet_grass(Lot)):{Lot=lot1 } (Φ) using the C-FOVE algorithm. Note that this is the joint on wet_grass(Lot) for all lots except lot1 . Below we describe a run of the C-FOVE algorithm. After each step we show an updated set of parfactors Φ indexed with the step number. After initial shattering, before each step we list available macro-operations and their cost according to the C-FOVE heuristic. First, C-FOVE invokes S HATTER (Φ) macro-operation. The macro-operation splits parfactor [03] on substitution {Lot/lot1 } which creates parfactor [05] and residual parfactor [06]. Next, it splits parfactor [02] on {Lot/lot1 } which creates parfactor [07] and residual parfactor [08]. After shattering, set Φ1 is as follows: Φ1 = { 0, / {rain()}, F1 , [01] 0, / {wet_grass(lot1 )}, F4 , [04] 0, / {rain(), sprinkler(lot1 ), wet_grass(lot1 )}, F3 , [05] {Lot = lot1 }, {rain(), sprinkler(Lot), wet_grass(Lot)}, F3 , [06] {Lot = lot1 }, {sprinkler(Lot)}, F2 }. [08] 0, / {sprinkler(lot1 )}, F2 , [07] Note that parfactors created by the same splitting operation have identical factor components. A smart implementation would represent these factors using the same, single object. We discuss this issue in more detail in Section 5.3. The following (parameterized random variable, set of constraints) pairs are present in set Φ1 : (rain(), 0), / (sprinkler(lot1 ), 0), / (sprinkler(Lot), {Lot = lot1 }), 49 (wet_grass(lot1 ), 0), / and (wet_grass(Lot), {Lot = lot1 }). C-FOVE will eliminate all of them, except for the last one. After the initial shattering, the C-FOVE algorithm has choice between performing the following macro-operations: • G LOBAL -S UM -O UT (Φ1 , wet_grass(lot1 ), 0), / which during the multiplica- tion step would create a factor of size 8 and after summing out would create a factor of size 4; it would eliminate one random variable; • G LOBAL -S UM -O UT (Φ1 , sprinkler(lot1 ), 0), / which during the multiplication step would create a factor of size 8 and after summing out would create a factor of size 4; it would eliminate one random variable; • G LOBAL -S UM -O UT (Φ1 , sprinkler(Lot), {Lot = lot1 }), which during the multiplication step would create a factor of size 8 and after summing out would create a factor of size 4; it would eliminate n − 1 random variables; • P ROPOSITIONALIZE (Φ1 , [06], Lot), which would create n − 1 identical factors of size 2 and, because of subsequent shattering, n − 1 identical factors of size 2; it would not eliminate any random variables; • P ROPOSITIONALIZE (Φ1 , [08], Lot), which would create n − 1 identical factors of size 8 and n − 1 identical factors of size 2; it would not eliminate any random variables; • C OUNTING -C ONVERT (Φ1 , [08], Lot), which would create a factor of size n; it would not eliminate any random variables. The first three macro-operations have identical cost. Tie-breaking is not discussed by Milch et al. [2008]. Let us assume that C-FOVE uses the number of random variables that would be eliminated for tie-breaking (the more, the better) and if the number does not resolve the tie than the algorithm chooses one of equally good operations at random. Under this assumption, C-FOVE chooses the third macro-operation. The macro-operation first applies Proposition 2.3 and multiplies parfactors that involve ground(sprinkler(Lot)) : {Lot = lot1 }, that is, parfactors [06] and [08]. Both parfactors represent the same number of factors, namely n − 1, 50 and no correction is necessary. The product is as follows: {Lot = lot1 }, {rain(), sprinkler(Lot), wet_grass(Lot)}, F2 F3 . The macro-operation applies Proposition 2.1 and sums out ground(sprinkler(Lot)): {Lot = lot1 } from the above product. No logical variable disappears from the parfactor and C-FOVE does not need to compensate for it. C-FOVE obtains the following parfactor: {Lot = lot1 }, {rain(), wet_grass(Lot)}, F5 , where F5 = ∑sprinkler(Lot) F2 [09] F3 . The new set of parfactors is as follows: Φ2 = { 0, / {rain()}, F1 , [01] 0, / {wet_grass(lot1 )}, F4 , [04] 0, / {rain(), sprinkler(lot1 ), wet_grass(lot1 )}, F3 , [05] {Lot = lot1 }, {rain(), wet_grass(Lot)}, F5 }. [09] 0, / {sprinkler(lot1 )}, F2 , [07] Next, C-FOVE has choice between performing the following macro-operations: • G LOBAL -S UM -O UT (Φ2 , wet_grass(lot1 ), 0), / which during the multiplica- tion step would create a factor of size 8 and after summing out would create a factor of size 4; it would eliminate one random variable; • G LOBAL -S UM -O UT (Φ2 , sprinkler(lot1 ), 0), / which during the multiplication step would create a factor of size 8 and after summing out would create a factor of size 4; it would eliminate one random variable; • P ROPOSITIONALIZE (Φ2 , [09], Lot), which would create n − 1 identical factors of size 4; it would not eliminate any random variables; • C OUNTING -C ONVERT (Φ2 , [09], Lot), which would create a factor of size 2n; it would not eliminate any random variables. 51 The first two macro-operations have identical cost. Let us assume that C-FOVE randomly chooses the second one. G LOBAL -S UM -O UT (Φ2 , wet_grass(lot1 ), 0) / first multiplies parfactors [05] and [07] and obtains the following product: 0, / {rain(), sprinkler(lot1 ), wet_grass(lot1 )}, F2 F3 . The macro-operation sums out ground(sprinkler(lot1 )) from the above product and obtains the following parfactor: 0, / {rain(), sprinkler(lot1 )}, Note that = ∑sprinkler(lot1 ) F2 ∑ sprinkler(lot1 ) F2 F3 . [10] F3 = F5 . Factor F5 has already been computed by the previous macro-operation. This overhead is due to shattering, which takes a brute-force approach to constraint processing. We discuss this issue in Section 5.3. The new set of parfactors is as follows: Φ3 = { 0, / {rain()}, F1 , [01] 0, / {wet_grass(lot1 )}, F4 , {Lot = lot1 }, {rain(), wet_grass(Lot)}, F5 , 0, / {rain(), wet_grass(lot1 )}, F5 }. [04] [09] [10] Next, C-FOVE has choice between performing the following macro-operations: • G LOBAL -S UM -O UT (Φ3 , wet_grass(lot1 ), 0), / which during the multiplica- tion step would create a factor of size 4 and after summing out would create a factor of size 2; it would eliminate one random variable; • P ROPOSITIONALIZE (Φ3 , [09], Lot), which would create n − 1 identical factors of size 4; it would not eliminate any random variables; • C OUNTING -C ONVERT (Φ3 , [09], Lot), which would create a factor of size 2n; it would not eliminate any random variables. 52 C-FOVE chooses the first macro-operation. The macro-operation multiplies parfactors [04] and [10]. Their product is as follows: 0, / {rain(), wet_grass(lot1 )}, F4 F5 . The macro-operation sums out ground(wet_grass(lot1 )) from the above product and obtains the following parfactor: 0, / {rain()}, F6 , where F6 = ∑wet_grass(lot1 ) F4 [11] F5 . The new set of parfactors is as follows: Φ4 = { 0, / {rain()}, F1 , [01] {Lot = lot1 }, {rain(), wet_grass(Lot)}, F5 , [09] [11] 0, / {rain()}, F6 }. Next, C-FOVE can perform only one macro-operation: • C OUNTING -C ONVERT (Φ4 , [09], Lot), which creates a factor of size 2n; it does not eliminate any random variables. This macro-operation applies Proposition 2.6 and performs counting on logical variable Lot. It obtains: 0, / {rain(), #Lot:{Lot=lot1 } [wet_grass(Lot)]}, F7 , [12] where F7 is defined as in Equation 2.10. The new set of parfactors is as follows: Φ5 = { 0, / {rain()}, F1 , [01] [11] 0, / {rain()}, F6 , 0, / {rain(), #Lot:{Lot=lot1 } [wet_grass(Lot)]}, F7 }. [12] A call to the S HATTER (Φ5 ) macro-operation does not change the set Φ5 . 53 Next, C-FOVE has choice between performing the following macro-operations: • G LOBAL -S UM -O UT (Φ5 , rain(), 0), / which during multiplication step would create a factor of size 2n and after summing out would create a factor of size n; it would eliminate one random variable; • F ULL -E XPAND (Φ5 , [12], #Lot:{Lot=lot1 } [wet_grass(Lot)]), which would create a factor of size 2n ; it would not eliminate any random variables. C-FOVE chooses the first macro-operation. The macro-operation multiplies parfactors [01], [11] and [12]. Their product is as follows: 0, / {rain(), #Lot:{Lot=lot1 } [wet_grass(Lot)]}, F1 F6 F7 . Next, C-FOVE sums out ground(rain()) from the above product and obtains the following parfactor: 0, / {#Lot:{Lot=lot1 } [wet_grass(Lot)]}, F8 , where F8 = ∑rain() F1 F6 [13] F7 . The new set of parfactors is as follows: Φ6 = { 0, / {#Lot:{Lot=lot1 } [wet_grass(Lot)]}, F8 }. [13] We have Jground(wet_grass(Lot)):{Lot=lot1 } (Φ) = J (Φ6 ) . During computation, constants lot2 , lot3 , . . . , lotn were not explicitly enumerated, we only needed to know that D(Lot) = n. The biggest factor created during inference had size 8 (factor F3 ) for n ≤ 4 and 2n (factor F7 ) for n > 4. 2.6 Summary In this chapter we presented a brief overview of belief networks and the variable elimination algorithm. Belief networks and probabilistic inference in belief networks is discussed in great detail in [Darwiche, 2009] and [Koller and Friedman, 54 2009] as well as in general AI textbooks [Poole and Mackworth, 2010; Russell and Norvig, 2009]. Next, we discussed first-order probabilistic models. As focus of this thesis is on inference, rather than modeling, we gave only one example of first-order probabilistic formalism: ICL. Collections by Getoor and Taskar [2007] and De Raedt et al. [2008] contain chapters devoted to many other first-order probabilistic languages. Milch [2006] provides a very good summary of first-order probabilistic languages. Finally, we gave an overview of current state of the art in exact lifted probabilistic inference. While approximate lifted inference is not the focus of this thesis, it is worth mentioning that there is an ongoing effort in designing approximate lifted inference algorithms [de Salvo Braz et al., 2009; Kersting et al., 2009; Sen et al., 2009; Singla and Domingos, 2008]. In chapters to follow, we will present our contributions to this area. Our work aims to satisfy desiderata already mentioned in his chapter: • the length of a specification of a first-order probabilistic model must be independent of the sizes of the populations in the model; • the cost of inference should be logarithmic when possible and at most at most linear in the sizes of the populations in the model. 55 Chapter 3 Aggregation in Lifted Inference Here’s something to think about: How come you never see a headline like ’Psychic Wins Lottery’? — Jay Leno 3.1 Introduction One aspect that arises in directed first-order probabilistic models is the need for aggregation that occurs when a parent parameterized random variable has logical variables that are not present in a child parameterized random variable. Previously available lifted inference algorithms do not allow a description of aggregation in first-order models that is independent of the sizes of the populations. In this thesis we introduce a new data structure, the aggregation parfactor, describe how to use it to represent aggregation in first-order models, and show how to perform efficient lifted inference in its presence. We analyze the need for aggregation (Section 3.2) and describe how to model aggregation (Section 3.3) in directed first-order probabilistic models. Next, in Section 3.4, we introduce a new data structure, aggregation parfactors, and describe how to perform lifted inference in its presence. For clarity, we start with a simple form of aggregation parfactors and later present a generalized version (Section 3.4.3). In Section 3.5, through experiments, we show that aggregation parfactors can lead to gains in efficiency. 56 played(P erson) played(jan) played(sylwia) played(magda) P erson jackpot won() FIRST-ORDER jackpot won() PROPOSITIONAL Figure 3.1: A first-order model from Example 3.1 and its equivalent belief network. Aggregation is denoted by curved arcs. 3.2 Need for aggregation In a directed first-order probabilistic model, when a child parameterized random variable has a parent parameterized random variable with extra logical variables, in the grounding the child parameterized random variable has an unbounded number of parents. We need some aggregation operator to describe how the child parameterized random variable depends on the parent parameterized random variable. We illustrate this point with the following two simple examples. Example 3.1. Consider the directed first-order probabilistic model and its grounding presented in Figure 3.1. The model is meant to represent that a person who fills in a single 6/49 lottery ticket has a chance of guessing correctly all six numbers and that when it happens, the jackpot is won. A person who does not play the lottery has no chances of winning. The model has two nodes: a parameterized random variable played(Person) with range { f alse,true}, and a random vari- able jackpot_won() with range { f alse,true} that is true if some person guesses correctly all six numbers. We have D(Person) = { jan, sylwia, . . . , magda} and | D(Person) | = n. A parameterized random variable played(Person) represents the n random variables in the corresponding propositional model. Therefore, in the propositional model, the number of parent nodes influencing the node jackpot_won() is n. Their common effect aggregates in the child node. 57 played(P erson) played(jan) played(sylwia) played(magda) P erson best match() FIRST-ORDER best match() PROPOSITIONAL Figure 3.2: A first-order model from Example 3.2 and its equivalent belief network. Aggregation is denoted by curved arcs. The next example illustrates aggregation over non-binary random variables. Example 3.2. Consider the directed first-order probabilistic model and its grounding presented in Figure 3.2. It is a modified model from Example 3.1. A person who plays a 6/49 lottery has a chance of matching correctly zero, one, . . . , five, or all six numbers. The model has two nodes: a parameterized random variables played(Person) with range { f alse,true}, and a random variable best_match() with range {0, 1, 2, 3, 4, 5, 6} that is equal to the highest number of matched lottery numbers among lottery participants. 3.3 Modeling aggregation In this thesis we base aggregation on causal independence. We introduce causal independence in Section 3.3.1 and show how to use it to describe aggregation in Section 3.3.2. 3.3.1 Causal independence We use the definition of causal independence from Zhang and Poole [1996]. Definition 3.1. Parent random variables p1 , p2 , . . . , pn are said to be causally independent with respect to child random variable c if there exist random variables p1 , p2 , . . . , pn such that: 58 1. ∀i∈[1,n] range of pi is a subset of the range of c, and 2. ∀i∈[1,n] pi probabilistically depends on pi and pi is independent of p1 , . . . , pi−1 , pi+1 , . . . , pn , p1 , . . . , pi−1 , pi+1 , . . . , pn given pi , and 3. there exists a commutative and associative binary operator ∗ over the range of c such that c = p1 ∗ p2 ∗ · · · ∗ pn . We call ∗ the base combination operator of c. The above definition covers common causal independence models such as noisy-OR [Pearl, 1986], noisy-MAX [Díez, 1993] and noisy-adders as special cases. The next example illustrates how Definition 3.1 can be used to describe a noisy-OR model. Example 3.3. The noisy-OR model consists of a Boolean child node c and a set of n Boolean parents nodes p = {p1 , p2 , . . . , pn }. Associated with each parent node pi is its causal strength si , which gives the probability that the c is true when pi is true independently of the values of other parents. Let true(p) be a set of parent nodes which are true, then the conditional probability distribution specified by the noisy-or is given by: P(c = true|p) = 1 − (1 − si ) ∏ (3.1) i∈true(p) and P(c = f alse|p) = ∏ (1 − si ). (3.2) i∈true(p) Given Definition 3.1 the noisy-OR model can be described using n Boolean random variables pi , n conditional probability distributions: pi f alse true P( pi = f alse) P( pi = true) 1 − si si 1 0 , i ∈ [1, n], and the logical OR operator as the base combination operator. 59 played(P erson) played(jan) matched 6(P erson) played(sylwia) matched 6(jan) matched 6(sylwia) played(magda) matched 6(magda) P erson jackpot won() FIRST-ORDER jackpot won() PROPOSITIONAL Figure 3.3: A first-order model with OR-based aggregation from Example 3.4 and its equivalent belief network. 3.3.2 Causal independence-based aggregation In this thesis, to describe aggregation we assume that the range of the parent parameterized random variable is a subset of the range of the child parameterized random variable, and use a commutative and associative deterministic binary operator over the range of the child parameterized random variable as an aggregation operator ⊗. Given probabilistic input to the parent parameterized random variable, we can construct any causal independence model covered by the definition of causal independence from Section 3.3.1. In other words, this allows any causal independence model to act as underlying mechanism for aggregation in directed first-order models. While some first-order probabilistic formalisms allow for richer forms of aggregation, see for example work of Jaeger [2002] on combination functions, the above mechanism allows us to satisfy desiderata stated in Section 2.6 and, as we will see in Section 3.4, integrate aggregation into calculus of parfactors. Example 3.4. Let us come back to Example 3.1. We add to the model a parameterized random variable matched_6(Person) with range { f alse, true}. It is a noisy version of the parameterized random variable played(Person). If played(Person) has value f alse, then matched_6(Person) also has value f alse. If played(Person) 60 played(P erson) played(jan) matched(P erson) played(sylwia) matched(jan) matched(sylwia) played(magda) matched(magda) P erson best match() FIRST-ORDER best match() PROPOSITIONAL Figure 3.4: A first-order model with MAX-based aggregation from Example 3.4 and its equivalent belief network. has value true, then matched_6(Person) has value true with probability equal to a chance of guessing correctly all six numbers. The common effect of random variables represented by matched_6(Person) aggregates in jackpot_won(). We use logical OR as an aggregation operator to describe the (deterministic) conditional probability distribution P( jackpot_won()|matched_6(Person)). The modi- fied model is presented in Figure 3.3. Similarly, we can adapt the model from Example 3.2 by adding a parameterized random variable matched(Person) with range {0, 1, 2, 3, 4, 5, 6} and using the MAX operator as an aggregation operator to describe the probability distribution P(best_match()|matched(Person)). The modified model is presented in Figure 3.4. Aggregation can also be used with more abstract operators. Example 3.5. Consider the directed first-order probabilistic model and its grounding presented in Figure 3.5. The model has three nodes: parameterized random variables played(Person) and matched_6(Person), both with range { f alse, true}, and a random variable jackpot_winners() with range {0, 1, 2, many} that represents the number of lottery participants that correctly matched all six numbers. 61 played(P erson) played(jan) matched 6(P erson) played(sylwia) matched 6(jan) played(magda) matched 6(sylwia) matched 6(magda) P erson jackpot winners() FIRST-ORDER jackpot winners() PROPOSITIONAL Figure 3.5: A first-order model from Example 3.5 and its equivalent belief network. Let us define SUM|3 : {0, 1, 2, many} × {0, 1, 2, many} → {0, 1, 2, many} as fol- lows: x + y, if x, y ∈ {0, 1, 2} and x + y ≤ 2; SUM|3 (x, y) = many, otherwise. The operator SUM|3 can be interpreted as sum capped at 3. We use SUM|3 as an aggregation operator to describe P( jackpot_winners()|matched_6(Person)). In this thesis we require that the directed first-order probabilistic models satisfy the following conditions: (1) for each parameterized random variable, its parent has at most one extra logical variable (2) if a parameterized random variable c(. . . ) has a parent p(. . . , A, . . . ) with an extra logical variable A, then: (a) p(. . . , A, . . . ) is the only parent of c(. . . ) (b) the range of p is a subset of the range of c (c) c(. . . ) is a deterministic function of the parent: c(. . . ) = p(. . . , a1 , . . . ) ⊗ . . . ⊗ p(. . . , an , . . . ) = a∈D(A) p(. . . , a, . . . ), where ⊗ is a commutative and associative deterministic binary operator over the range of c. 62 At first the above conditions seem to be very restrictive, but they in fact are not. There is no need to define the aggregation over more than one logical variable due to the associativity and commutativity of the ⊗ operator. We can obtain more complicated distributions by introducing auxiliary parameterized random variables and combining multiple aggregations. Example 3.6. Consider a parent parameterized random variable p(A, B,C) and a child parameterized random variable c(C). We can describe a ⊗-based aggregation over A and B, c(C) = (a,b)∈D(A) × D(B) p(A, B,C) using an auxiliary pa- rameterized random variable c (B,C) such that c has the same range as c. Let c (B,C) = a∈D(A) p(A, B,C), then c(C) = b∈D(B) c (B,C). Similarly, with the use of auxiliary nodes, we can construct a distribution that combines an aggregation with influence from other parent nodes or even combines multiple aggregations generated with different operators. In the rest of the chapter, we assume that the discussed models satisfy conditions (1) and (2), for ease of presentation and with no loss of generality. 3.4 Aggregation parfactors In this section we introduce a new data structure, aggregation parfactors, describe how to use it to represent aggregation in first-order models, and show how to perform efficient lifted inference in its presence. We need to introduce a new data structure, because parfactors, including parfactors on counting formulas, are not adequate representations, as their size would depend on the population size of the extra logical variable. Example 3.7. Consider the first-order model presented in Figure 3.3, which is also discussed in Examples 3.1 and 3.4. We cannot represent the conditional probability distribution P( jackpot_won()|matched_6(Person)) with a parfactor 0, / {matched_6(Person), jackpot_won()}, F as even simple noisy-OR cannot be represented as a product of factors represented by this parfactor. A parfactor 0,{matched_6( / jan), . . . , matched_6(magda), jackpot_won()}, F is not an adequate input representation of this distribution because its size would depend on | D(Person) |. The same applies to 0, / {#Person:0/ [matched_6(Person)], 63 jackpot_won()}, F as the size of the range of #Person:0/ [matched_6(Person)] de- pends on | D(Person) |. An analogous problem arises for the first-order models presented in Figures 3.4 and 3.5. Definition 3.2. An aggregation parfactor is a hextuple C , p(. . . , A, . . . ), c(. . . ), F p , ⊗, CA , where • p(. . . , A, . . . ) and c(. . . ) are parameterized random variables • the range of p is a subset of the range of c • A is the only logical variable in p(. . . , A, . . . ) that is not in c(. . . ) • C is a set of inequality constraints not involving A • F p is a factor from the range of p to real numbers • ⊗ is a commutative and associative deterministic binary operator over the range of c • CA is a set of inequality constraints involving A, such that (D(A) : CA ) = 0. / An aggregation parfactor C , p(. . . , A, . . . ), c(. . . ), F p , ⊗, CA represents a set of factors, one factor F pc for each ground substitution G to all logical variables in param(c(. . . )) that satisfies the constraints in C . Each factor F pc is a mapping from the Cartesian product ×a∈D(A):CA range(p) × range(c) to the reals, which, given an assignment of values to random variables v, is defined as follows: F pc ( v(p(. . . , a1 , . . . )), . . . , v(p(. . . , an , . . . )), v(c(. . . ))) = rp v(p(. . . , a, . . . )) = v(c(. . . )); ∏ F p rc (v(p(. . . , a, . . . ))), if a∈{a1 ,...,an } a∈{a1 ,...,an } 0, otherwise, where D(A) : CA = {a1 , . . . , an }, r p = | ground(p(. . . , a, . . . )): C |, a ∈ D(A): CA and rc = | ground(c(. . . )) : C |. The space required to represent an aggregation parfactor does not depend on the size of the set D(A) : CA . It is O(n2 log n) where n is the size of range(c), as the operator ⊗ can be represented as a factor from range(c) × range(c) to range(c). Exponent rp rc compensates for the possibility that parameterized random vari- able c(. . . ) might be parameterized by logical variables not present in p(. . . , A, . . . ). 64 It is also important to notice that D(A) : CA might vary for different ground substi- tutions G if the set C ∪ CA is not in normal form (see Section 2.5.1.1). We comment on this issue in Section 5.2.3. When an aggregation parfactor C , p(. . . , A, . . . ), c(. . . ), F p , ⊗, CA is used to describe aggregation in a first-order model, the factor F p will be a constant function with the value 1. We use 1 to denote such factors. An aggregation parfactor created during inference may have a non-trivial F p component (see Section 3.4.2). Example 3.8. Consider the first-order model from Figure 3.3. The conditional probability distribution P( jackpot_won()|matched_6(Person)) can be represented with an aggregation parfactor 0, / matched_6(Person), jackpot_won(), 1, OR, 0/ . The size of the representation does not depend on the population size of the logical variable Person. Parameterized random variable jackpot_won() represents one random variable therefore the aggregation parfactor represents one factor: matched_6( jan) f alse f alse f alse f alse .. . true true ... ... ... ... ... .. .. .. ... ... ... matched_6(magda) f alse f alse true true .. . true true jackpot_won() f alse true f alse true .. . f alse true value 1 0 0 1 .. . . 0 1 Similarly, for the model from Figure 3.4, P(best_match()|matched(Person)) can be represented with 0, / matched(Person), best_match(), 1, MAX, 0/ , for the model from from Figure 3.5, P( jackpot_winners()|matched_6(Person)) can be represented with 0, / matched_6(Person), jackpot_winners(), 1, SUM|3 , 0/ . In the rest of this thesis, Φ denotes a set of parfactors and aggregation parfactors. The notation introduced in Section 2.5 remains valid under the extended meaning of the symbol Φ. 3.4.1 Conversion to parfactors In this section we show how aggregation parfactors can be converted to parfactors that in turn can be used during inference with C-FOVE. 65 3.4.1.1 Conversion using counting formulas Consider an aggregation parfactor C , p(. . . , A, . . . ), c(. . . ), F p , ⊗, CA . Since ⊗ is an associative and commutative operator, given an assignment of values to random variables v, it does not matter which of the parameterized random variables p(. . . , a, . . . ), a ∈ D(A) : CA are assigned each value from range(p), but only how many of them are assigned each value. This property was a motivation for the counting elimination algorithm [de Salvo Braz et al., 2007] (see Section 2.5.2.4) and counting formulas [Milch et al., 2008] (see Section 2.4.1.1), and allows us to convert aggregation parfactors to a product of two parfactors, where one of the parfactors is a parfactor on a counting formula: Proposition 3.1. Let gA = C , p(. . . , A, . . . ), c(. . . ), F p , ⊗, CA be an aggregation parfactor from Φ such that set C ∪ CA is in normal form. Let F# be a factor from the Cartesian product range(#A:CA [p(. . . , A, . . . )]) × range(c) to {0, 1}. Given an assignment of values v to all random variables but ground(p(. . . , A, . . . )) : C ∪ CA , the function is defined as follows: F# (h(), v(c(. . . ))) = 1, if h(x) x∈range(p) i=1 x = v(c(. . . )); 0, otherwise, where h() is a histogram from range(#A:CA [p(. . . , A, . . . )]). Then J (Φ) = J (Φ \ {gA } ∪ { C ∪C A , {p(. . . , A, . . . )}, F p , C , {#A:CA [p(. . . , A, . . . )], c(. . . )}, F# }). Proof. It suffices to show that J ({gA }) = J ({ C , p(. . . , A, . . . ), c(. . . ), F p , ⊗, C A }) (3.3) equals J ({ C ∪C A , {p(. . . , A, . . . )}, F p , C , {#A:CA [p(. . . , A, . . . )], c(. . . )}, F# }) . (3.4) 66 We are going to transform expression (3.4) into expression (3.3) by propositionalizing logical variable A. Let D(A) : C A = {a1 , a2 , . . . , an }. J ({ C ∪C A , {p(. . . , A, . . . )}, F p , C , {#A:CA [p(. . . , A, . . . )], c(. . . )}, F# }) = (by n applications of Proposition 2.4) = J ({ C , {p(. . . , a1 , . . . )}, F p , C , {p(. . . , a2 , . . . )}, F p , . . . , C , {p(. . . , an , . . . )}, F p , C , {#A:CA [p(. . . , A, . . . )], c(. . . )}, F# }}) = (by n − 1 applications of Proposition 2.3) = J ({ C , {p(. . . , a1 , . . . ), p(. . . , a2 , . . . ), . . . , p(. . . , an , . . . )}, F1 , C , {#A:CA [p(. . . , A, . . . )], c(. . . )}, F# }}), (3.5) where factor F1 is a mapping from the Cartesian product ×a∈D(A):CA range(p) to the reals, which, given an assignment of values to random variables v, is defined as follows: v(F1 ) = ∏ a∈D(A):CA F p (v(p(. . . , a, . . . ))). Next, we apply n times Proposition 2.5 to expression (3.5) and obtain: J ({ C , {p(. . . , a1 , . . . ), p(. . . , a2 , . . . ), . . . , p(. . . , an , . . . )}, F1 , C , {p(. . . , a1 , . . . ), p(. . . , a2 , . . . ), . . . , p(. . . , an , . . . ), c(. . . )}, F2 }). (3.6) Factor F2 is a mapping from the Cartesian product ×a∈D(A):CA range(p) × range(c) to the reals, which, given an assignment of values to random variables v, is defined as follows: v(F2 ) = 1, if a∈D(A):CA v(p(. . . , a, . . . )) = v(c(. . . )); 0, otherwise. Finally, we apply Proposition 2.3 to parfactors from expression (3.6) and obtain: rp J ({ C , {p(. . . , a1 , . . . ), p(. . . , a2 , . . . ), . . . , p(. . . , an , . . . ), c(. . . )}, F1 rc 67 F2 }), where r p =| ground(p(. . . , a, . . . )): C |, a ∈ D(A): CA , and rc =| ground(c(. . . )): C |. rp Parfactor C , {p(. . . , a1 , . . . ), p(. . . , a2 , . . . ), . . . , p(. . . , an , . . . ), c(. . . )}, F1 rc F2 represents a set of factors, one for each ground substitution G to all logical vari- ables in param(p(. . . , A, . . . )) ∪ param(c(. . . )) \{A} = param(c(. . . )) that satisfies the constraints in C . Each factor FG is a mapping from the Cartesian product ×a∈D(A):CA range(p) × range(c) to the reals, which, given an assignment of values to random variables v, is defined as follows: rp ∏ F p rc (v(p(. . . , a, . . . ))), if v(p(. . . , a, . . . )) = v(c(. . . )); a∈D(A):CA v(FG )= a∈D(A):CA 0, otherwise. From Definition 3.2 such set of factors can be represented as an aggregation parfactor C , p(. . . , A, . . . ), c(. . . ), F p , ⊗, CA . Therefore J ({ C , {p(. . . , a1 , . . . ), p(. . . , a2 , . . . ), . . . , p(. . . , an , . . . ), c(. . . )}, F3 }) = J ({ C , p(. . . , A, . . . ), c(. . . ), F p , ⊗, C A }), which completes the proof. If the set C ∪ CA is not in normal form (Section 2.5.1.1) we will need to use the splitting operation described in Section 3.4.2.1 to convert the aggregation parfactor to a set of aggregation parfactors with constraint sets in normal form (see algorithm presented in Figure 5.4 on page 162). The Proposition 3.1 shows how an aggregation parfactor can be replaced by a parfactors involving a counting formula. The following example illustrates how it is done in practice. Example 3.9. Consider the aggregation parfactor introduced in Example 3.8: 0, / matched_6(Person), jackpot_won(), 1, OR, 0, / . 68 Let n = | D(Person) |. The conversion described in Proposition 3.1 creates two parfactors: 0, / {matched_6(Person)}, 1 and 0, / {#Person:0/ [matched_6(Person)], jackpot_won()}, F# , F# is a factor from the Cartesian product range(#Person:0/ [matched_6(Person)]) × range( jackpot_won) to {0, 1}. The range of #Person:0/ [matched_6(Person)] is a set of histograms h() with a bucket for value f alse, a bucket for value true and entries adding up to n: {(# f alse = n, #true = 0), (# f alse = n − 1, #true = 1), . . . , (# f alse = 0, #true = n)}. The range of jackpot_won is { f alse,true}. The factor F# is defined as follows: F# (h(), v( jackpot_won())) = Note that the expression 1, if OR h(x) OR x = v( jackpot_won()); x∈{ f alse,true} i=1 0, otherwise. OR h(x) OR x evaluates to f alse if h( f alse) = n and x∈{ f alse,true} i=1 to true if h( f alse) < n. Below we show a tabular representation of factor F# : #Person:0/ [matched_6(Person)] (# f alse = n , #true = 0) (# f alse = n , #true = 0) (# f alse = n − 1, #true = 1) (# f alse = n − 1, #true = 1) .. . (# f alse = 0 (# f alse = 0 , #true = n) , #true = n) jackpot_won() f alse true f alse true .. . f alse true value 1 0 0 1 .. . . 0 1 The size of the factor F# is n + 1. 3.4.1.2 Conversion for MAX and MIN operators If in an aggregation parfactor ⊗ is the MAX operator (which includes the OR op- erator as a special case), we can use a factorization presented by Díez and Galán 69 [2003] to convert the aggregation parfactor to parfactors without counting formulas. The factorization is an example of the tensor rank-one decomposition of a conditional probability distribution [Savicky and Vomlel, 2007]. The factorization of Díez and Galán can be used to convert an aggregation parfactor to a pair of standard parfactors as follows: Proposition 3.2. Let gA = C , p(. . . , A, . . . ), c(. . . ), F p , MAX, CA be an aggregation parfactor from Φ, where MAX operator is induced by a total ordering ≺ of range(c) and set C ∪ CA is in normal form. Let s() be a successor function induced by ≺. Let c (. . . ) be an auxiliary parameterized random variable with the same parameterization and the same range as c. Let Fc be a factor from the Cartesian product range(p) × range(c) to real numbers that, given an assignment of values to random variables v, is defined as follows: rc F r p (v(p(. . . , A, . . . ))), p Fc (v(p(. . . , A, . . . )), v(c (. . . ))) = if v(p(. . . , A, . . . )) 0, otherwise, v(c (. . . )); (3.7) where r p =| ground(p(. . . , a, . . . )): C |, a ∈ D(A): CA , and rc =| ground(c(. . . )): C |. Let F∆ be a factor from the Cartesian product range(p) × range(c) to {−1, 0, 1} that, given v, is defined as follows: 1, if v(c(. . . )) = v(c (. . . )); F∆ (v(c(. . . )), v(c (. . . ))) = −1, if v(c(. . . )) = s(v(c (. . . ))); 0, otherwise. Then J (Φ) = ∑ ground(c (... )) J (Φ \ {gA } ∪ { C ∪C A , {p(. . . , A, . . . ), c (. . . )}, Fc , C , {c(. . . ), c (. . . )}, F∆ }). 70 Proof. It is sufficient to prove that J ({gA }) = J ({ C , p(. . . , A, . . . ), c(. . . ), F p , MAX, C A }) (3.8) is equal to ∑ ground(c (... )) J ({ C ∪C A , {p(. . . , A, . . . ), c (. . . )}, Fc , C , {c(. . . ), c (. . . )}, F∆ }). (3.9) We start with expression (3.8) and convert the aggregation parfactor to parfactors. By Proposition 3.1 J ({ C , p(. . . , A, . . . ), c(. . . ), F p , MAX, C A }) = J ({ C ∪C A , {p(. . . , A, . . . )}, F p , C , {#A:CA [p(. . . , A, . . . )], c(. . . )}, F# }) . F# is a factor from the Cartesian product range(#A:CA [p(. . . , A, . . . )]) × range(c) to a set {0, 1} that, given an assignment of values v to all random variables but ground(p(. . . , A, . . . )) : C ∪ CA , is defined as follows: F# (h(), v(c(. . . ))) = 1, if max x∈range(p),h(x)>0 x = v(c(. . . )); 0, otherwise, where h() is a histogram from range(#A:CA [p(. . . , A, . . . )]). Next, we transform expression (3.9) and represent the first parfactor as a product of two parfactors. From (3.7) and Proposition 2.3 J ({ C ∪C A , {p(. . . , A, . . . ), c (. . . )}, Fc ) = rc r p rc J ({ C ∪C A , {p(. . . , A, . . . )}, F p r p , C ∪C A , {p(. . . , A, . . . ), c (. . . )}, F1 ) = J ({ C ∪C A , {p(. . . , A, . . . )}, F p , C ∪C A , {p(. . . , A, . . . ), c (. . . )}, F1 ), where F1 is a factor from the Cartesian product range(p) × range(c) to real num- bers that, given an assignment of values to random variables v, is defined as fol- 71 lows: 1, if v(p(. . . , A, . . . )) F1 (v(p(. . . , A, . . . )), v(c (. . . ))) = 0, otherwise. v(c (. . . )); After applying the above transformations to expressions (3.8) and (3.9) we are reduced to proving that J ({ C ∪C A , {p(. . . , A, . . . )}, F p , C , {#A:CA [p(. . . , A, . . . )], c(. . . )}, F# }) = ∑ ground(c (... )) J ({ C ∪C A , {p(. . . , A, . . . )}, F p , C ∪C A , {p(. . . , A, . . . ), c (. . . )}, F1 , C , {c(. . . ), c (. . . )}, F∆ }). The above is equivalent to proving that J ({ C , {#A:CA [p(. . . , A, . . . )], c(. . . )}, F# }) = ∑ ground(c (... )) (3.10) J ({ C ∪C A , {p(. . . , A, . . . ), c (. . . )}, F1 , C , {c(. . . ), c (. . . )}, F∆ }) . We are going to transform the right hand side of the Equation 3.10 into the left hand side. We start with counting over logical variable A. By Proposition 2.6 ∑ J ({ C ∪C A , {p(. . . , A, . . . ), c (. . . )}, F1 , C , {c(. . . ), c (. . . )}, F∆ }) = ∑ J ({ C , {#A:CA [p(. . . , A, . . . )], c (. . . )}, F2 , C , {c(. . . ), c (. . . )}, F∆ }) . ground(c (... )) ground(c (... )) F2 is a factor from the Cartesian product range(#A:CA [p(. . . , A, . . . )]) × range(c ) to a set {0, 1} that, given an assignment of values v to all random variables but ground(p(. . . , A, . . . )) : C ∪ CA , is defined as follows: F2 (h(), v(c (. . . ))) = 1, if max x∈range(p),h(x)>0 0, otherwise. 72 x v(c (. . . )); Next, we perform multiplication. By Proposition 2.3 ∑ J ({ C , {#A:CA [p(. . . , A, . . . )], c (. . . )}, F2 , C , {c(. . . ), c (. . . )}, F∆ }) = ∑ J ({ C , {#A:CA [p(. . . , A, . . . )], c(. . . ), c (. . . )}, F3 }), ground(c (... )) ground(c (... )) F3 = F2 F∆ is a factor from the Cartesian product range(#A:CA [p(. . . , A, . . . )]) × range(c) × range(c ) to {−1, 0, 1} that, given an assignment of values v to all random variables but ground(p(. . . , A, . . . )) : C ∪ CA , is defined as follows: F3 (h(), v(c(. . . )), v(c (. . . ))) = 1, if max x v(c (. . . )) ∧ v(c(. . . )) = v(c (. . . )); x∈range(p),h(x)>0 = −1, if max x v(c (. . . )) ∧ v(c(. . . )) = s(v(c (. . . ))); x∈range(p),h(x)>0 0, otherwise; = (in the first case v(c(. . . )) = v(c (. . . )) and we replace v(c (. . . )) by v(c(. . . )) in the inequality) 1, if max x x∈range(p),h(x)>0 = −1, if max x x∈range(p),h(x)>0 0, otherwise; v(c(. . . )) ∧ v(c(. . . )) = v(c (. . . )); v(c (. . . )) ∧ v(c(. . . )) = s(v(c (. . . ))); = (in the second case v(c(. . . )) = s(v(c (. . . ))) and we replace v(c (. . . )) by v(c(. . . )) and by ≺ in the inequality) 1, if max x v(c(. . . )) ∧ v(c(. . . )) = v(c (. . . )); x∈range(p),h(x)>0 = −1, if max x ≺ v(c(. . . )) ∧ v(c(. . . )) = s(v(c (. . . ))); x∈range(p),h(x)>0 0, otherwise; 73 = (we split the first case into two cases) 1, if max x = v(c(. . . )) ∧ v(c(. . . )) = v(c (. . . )); x∈range(p),h(x)>0 1, if max x ≺ v(c(. . . )) ∧ v(c(. . . )) = v(c (. . . )); x∈range(p),h(x)>0 = −1, if max x ≺ v(c(. . . )) ∧ v(c(. . . )) = s(v(c (. . . ))); x∈range(p),h(x)>0 0, otherwise, (3.11) where h() is a histogram from range(#A:CA [p(. . . , A, . . . )]). Finally, we perform summation. By Proposition 2.1 ∑ ground(c (... )) J ({ ∑ J ({ C , {#A:CA [p(. . . , A, . . . )], c(. . . ), c (. . . )}, F3 }) = ground(c (... )) C , {#A:CA [p(. . . , A, . . . )], c(. . . ), c (. . . )}, F3 }) = J ({ C , {#A:CA [p(. . . , A, . . . )], c(. . . )}, ∑c (... ) F3 }) . Factor ∑c (... ) F3 is a factor from the Cartesian product range(#A:CA [p(. . . , A, . . . )])× range(c) to the reals. Let us analyze factor F3 as it is defined via four cases in ex- pression (3.11). We fix value of #A:CA [p(. . . , A, . . . )] to h() and value of c(. . . ) to y. Assume that h() and y satisfy equality in the first case from (3.11). There is only one value in range(c ) equal to y and F3 has value 1, for all other values of range(c ), F3 has value 0. Next, assume that h() and y satisfy inequality in the second and third case from (3.11). Note that, since there exists value x ∈ range(p) ⊂ range(c ) such that x ≺ y, then y is a successor of exactly one element from range(c ). When c (. . . ) is equal to this element, F3 has value −1. When c (. . . ) is equal to y, F3 has value 1. For all other values from range(c ), F3 has value 0. Finally, for all other h() and y, regardless of the value of c (. . . ), F3 returns value 0. 74 Based on the above analysis, given an assignment of values v to all random variables but ground(p(. . . , A, . . . )) : C ∪ CA , we have ∑c (... ) F3 (h(), v(c(. . . ))) = 1, if max x∈range(p),h(x)>0 x = v(c(. . . )); 0, otherwise, = F# (h(), v(c(. . . ))), where h() is a histogram from range(#A:CA [p(. . . , A, . . . )]). The above gives us J ({ C , {#A:CA [p(. . . , A, . . . )], c(. . . )}, ∑c (... ) F3 }) = J ({ C , {#A:CA [p(. . . , A, . . . )], c(. . . )}, F# }) and finishes a proof of Equation 3.10 and a proof of the proposition. Example 3.10 illustrates the decomposition introduced in the Proposition 3.2. Example 3.10. Let us consider the aggregation parfactor introduced in Example 3.8: 0, / matched_6(Person), jackpot_won(), 1, OR, 0/ . The conversion described in Proposition 3.2 introduces an auxiliary parameterized random variable jackpot_won (), where jackpot_won has range { f alse,true}. The aggregation parfactor is replaced with two parfactors: 0, / {matched_6(Person), jackpot_won ()}, F jackpot_won and 0, / { jackpot_won(), jackpot_won ()}, F∆ . F jackpot_won is a factor from the Cartesian product { f alse,true} × { f alse,true} to real numbers: matched_6(Person) f alse f alse true true jackpot_won () f alse true f alse true 75 value ( 1 f alse) 1( f alse) 0 (true) 1 . F∆ is a factor from the Cartesian product { f alse,true} × { f alse,true} to a set {−1, 0, 1}: jackpot_won() f alse f alse true true jackpot_won () f alse true f alse true value 1 0 −1 1 . Note that the size of factors F jackpot_won and F∆ is independent of | D(Person) |. An analogous proposition holds for the MIN operator. In both cases, as illustrated by Examples 3.9 and 3.10 and experiments in Section 3.5, the above conversion is advantageous to the conversion described in Section 3.4.1.1, which uses counting formulas. 3.4.2 Operations on aggregation parfactors In the previous section we showed how aggregation parfactors can be used during a modeling phase and then, during inference with the C-FOVE algorithm, once populations are known, aggregation parfactors can be converted to parfactors. Such a solution allows us to take advantage of the modeling properties of aggregation parfactors and C-FOVE inference capabilities. It is also possible to exploit aggregation parfactors during inference. In this section we describe operations on aggregation parfactors that can be added to the C-FOVE algorithm. These operations can delay or even avoid conversion of aggregation parfactors to parfactors involving counting formulas. This in turn, as we will see in Section 3.5, can result in more efficient inference. 3.4.2.1 Splitting The C-FOVE algorithm applies substitutions to parfactors to handle observations and queries and to enable the multiplication of parfactors. As this operation results in the creation of a residual parfactor, it is called splitting. Below we present how aggregation parfactors can be split on substitutions. We start with splitting on a substitution that does not involve the aggregation logical variable: Proposition 3.3. Let gA = C , p(. . . , A, . . . ), c(. . . ), F p , ⊗, CA be an aggregation parfactor from Φ. Let {X/t} be a substitution such that (X = t) ∈ / C and X ∈ 76 param(c(. . . )). Let term t be a constant from D(X), or a logical variable such that t ∈ param(c(. . . )). Let gA [X/t] be a parfactor gA with all occurrences of X replaced by term t. Then J (Φ) = J (Φ \ {gA } ∪ {gA [X/t], C ∪{X = t}, p(. . . , A, . . . ), c(. . . ), F p , ⊗, CA }) . Proof. It suffices to show that a set of factors represented by the aggregation parfactor gA is equal to the union of sets of factors represented by the aggregation parfactors gA [X/t] and C ∪{X = t}, p(. . . , A, . . . ), c(. . . ), F p , ⊗, CA . From Definition 3.2 we know that gA represents a set of factors, one for each ground substitution G to all logical variables in param(c(. . . )) that satisfies the constraints in C . Assume that the term t is a constant. Each ground substitution G either substi- tutes X with t or substitutes X with some other constant from D(X). The former substitutions result in a set of factors equal to the set of factors represented by gA [X/t] while the latter substitutions result in a set of factors equal to the set of factors represented by C ∪{X = t}, p(. . . , A, . . . ), c(. . . ), F p , ⊗, CA . Assume that the term t is a logical variable. Each ground substitution G either substitutes X and t with the same constant or substitutes X and t with different constants. The former substitutions result in a set of factors equal to the set of factors represented by gA [X/t] while the latter substitutions result in a set of factors equal to the set of factors represented by C ∪{X = t}, p(. . . , A, . . . ), c(. . . ), F p , ⊗, CA . Proposition 3.3 allows us to split an aggregation parfactor on a substitution that does not involve the aggregation logical variable. Below we show how to split on a substitution that involves the aggregation logical variable A and a constant. After such operation the individuals from D(A) : C are represented in two data structures: an aggregation parfactor and a standard parfactor. We have to make sure that after splitting c(. . . ) is still equal to a ⊗-based aggregation over the whole D(A) : C . The following proposition describes how it can be done using an auxiliary parameterized random variable: 77 Proposition 3.4. Let gA = C , p(. . . , A, . . . ), c(. . . ), F p , ⊗, CA be an aggregation parfactor from Φ. Let {A/t} be a substitution such that (A = t) ∈ / CA and term t is a constant from D(A), or a logical variable from param(p(. . . , A, . . . )) \{A}. Let c (. . . ) be an auxiliary parameterized random variable with the same param- eterization and range as c(. . . ). Let CA [A/t] be a set of constraints CA with all occurrences of A replaced by term t. Let Fc be a factor from the Cartesian product range(p) × range(c ) × range(c) to real numbers. Given an assignment of values to random variables v, Fc is defined as follows: rc F r p (p(. . . ,t, . . . )), if v(c(. . . )) = p Fc (v(p(. . . , A, . . . )), v(c (. . . )), v(c(. . . )))= v(p(. . . ,t, . . . )) ⊗ v(c (. . . )); 0, otherwise, where r p = | ground(p(. . . , a, . . . )) : C |, a ∈ D(A) : CA , and rc = | ground(c(. . . )) : C |. Then J (Φ) = ∑ ground(c (... )) J (Φ \ {gA } ∪ { C , p(. . . , A, . . . ), c (. . . ), F p , ⊗, CA ∪{A = t} , C ∪C A [A/t], {p(. . . ,t, . . . ), c (. . . ), c(. . . )}, Fc }). Proof. It suffices to show that J ({gA }) = J ( C , p(. . . , A, . . . ), c(. . . ), F p , ⊗, C A ) (3.12) is equal to ∑ ground(c (... )) J ({ C , p(. . . , A, . . . ), c (. . . ), F p , ⊗, CA ∪{A = t} , C ∪ CA [A/t], {p(. . . ,t, . . . ), c (. . . ), c(. . . )}, Fc }). 78 (3.13) We start with expression (3.12) and convert the aggregation parfactor to parfactors. By Proposition 3.1 J ({ C , p(. . . , A, . . . ), c(. . . ), F p , ⊗, C A }) = J ({ C ∪C A , {p(. . . , A, . . . )}, F p , C , {#A:CA [p(. . . , A, . . . )], c(. . . )}, F#1 }) . (3.14) F#1 is a factor from the Cartesian product range(#A:CA [p(. . . , A, . . . )]) × range(c) to a set {0, 1} that, given an assignment of values v to all random variables but ground(p(. . . , A, . . . )) : C ∪ CA , is defined as follows: F#1 (h(), v(c(. . . ))) = 1, if h(x) x∈range(p) i=1 x = v(c(. . . )); (3.15) 0, otherwise, where h() is a histogram from range(#A:CA [p(. . . , A, . . . )]). Next, we transform expression (3.13). We convert the aggregation parfactor to parfactors. By Proposition 3.1 ∑ ground(c (... )) J ({ C , p(. . . , A, . . . ), c (. . . ), F p , ⊗, C A ∪ {A = t} , C ∪C A [A/t], {p(. . . ,t, . . . ), c (. . . ), c(. . . )}, Fc }) = ∑ ground(c (... )) J ({ C ∪C A ∪ {A = t}, {p(. . . , A, . . . )}, F p , C , {#A:CA ∪{A=t} [p(. . . , A, . . . )], c (. . . )}, F#2 , C ∪C A [A/t], {p(. . . ,t, . . . ), c (. . . ), c(. . . )}, Fc }), (3.16) where F#2 is a factor from the Cartesian product range(#A:CA ∪{A=t} [p(. . . , A, . . . )])× range(c ) to a set {0, 1} that, given an assignment of values v to all random variables but ground(p(. . . , A, . . . )) : C ∪ CA ∪{A = t}, is defined as follows: F#2 (h(), v(c (. . . ))) = 1, if h(x) x∈range(p) i=1 0, otherwise, 79 x = v(c (. . . )); where h() is a histogram from range(#A:CA ∪{A=t} [p(. . . , A, . . . )]). Further, the third parfactor from (3.16) can be represented as a product of two parfactors. By Proposition 2.3 ∑ ground(c (... )) J ({ C ∪C A ∪ {A = t}, {p(. . . , A, . . . )}, F p , C , {#A:CA ∪{A=t} [p(. . . , A, . . . )], c (. . . )}, F#2 , C ∪C A [A/t], {p(. . . ,t, . . . ), c (. . . ), c(. . . )}, Fc }) = ∑ ground(c (... )) J ({ C ∪C A ∪ {A = t}, {p(. . . , A, . . . )}, F p , C , {#A:CA ∪{A=t} [p(. . . , A, . . . )], c (. . . )}, F#2 , rc r p rc C ∪C A [A/t], {p(. . . ,t, . . . )}, F p r p , C ∪C A [A/t], {p(. . . ,t, . . . ), c (. . . ), c(. . . )}, Fc2 }) = ∑ ground(c (... )) J ({ C ∪C A ∪ {A = t}, {p(. . . , A, . . . )}, F p , C , {#A:CA ∪{A=t} [p(. . . , A, . . . )], c (. . . )}, F#2 , C ∪C A [A/t], {p(. . . ,t, . . . )}, F p , C ∪C A [A/t], {p(. . . ,t, . . . ), c (. . . ), c(. . . )}, Fc2 }), (3.17) Fc2 is a factor from the Cartesian product range(p) × range(c ) × range(c) to real numbers that, given an assignment of values to random variables v, is defined as follows: 1, if v(c(. . . )) = Fc2 (v(p(. . . , A, . . . )), v(c (. . . )), v(c(. . . ))) = v(p(. . . ,t, . . . )) ⊗ v(c (. . . )); 0, otherwise. 80 The first and the third parfactor from (3.17) can be combined into one parfactor (as if we were reversing a splitting operation). By Proposition 2.4 ∑ ground(c (... )) J ({ C ∪C A ∪ {A = t}, {p(. . . , A, . . . )}, F p , C , {#A:CA ∪{A=t} [p(. . . , A, . . . )], c (. . . )}, F#2 , C ∪C A [A/t], {p(. . . ,t, . . . )}, F p , C ∪C A [A/t], {p(. . . ,t, . . . ), c (. . . ), c(. . . )}, Fc2 }) = ∑ ground(c (... )) J ({ C ∪C A , {p(. . . , A, . . . )}, F p , C , {#A:CA ∪{A=t} [p(. . . , A, . . . )], c (. . . )}, F#2 , C ∪C A [A/t], {p(. . . ,t, . . . ), c (. . . ), c(. . . )}, Fc2 }). (3.18) After replacing expression (3.12) with (3.14) and expression (3.13) with (3.18) we are reduced to proving that J ({ C ∪C A , {p(. . . , A, . . . )}, F p , C , {#A:CA [p(. . . , A, . . . )], c(. . . )}, F#1 }) = ∑ ground(c (... )) J ({ C ∪C A , {p(. . . , A, . . . )}, F p , C , {#A:CA ∪{A=t} [p(. . . , A, . . . )], c (. . . )}, F#2 , C ∪C A [A/t], {p(. . . ,t, . . . ), c (. . . ), c(. . . )}, Fc2 }). The above is equivalent to proving that J ({ C , {#A:CA [p(. . . , A, . . . )], c(. . . )}, F#1 }) = ∑ ground(c (... )) J ({ C , {#A:CA ∪{A=t} [p(. . . , A, . . . )], c (. . . )}, F#2 , C ∪C A [A/t], {p(. . . ,t, . . . ), c (. . . ), c(. . . )}, Fc2 }). 81 (3.19) We are going to transform the right hand side of the Equation 3.19 into the left hand side. We start with multiplying the two parfactors. By Proposition 2.3 ∑ ground(c (... )) J ({ C , {#A:CA ∪{A=t} [p(. . . , A, . . . )], c (. . . )}, F#2 , C ∪C A [A/t], {p(. . . ,t, . . . ), c (. . . ), c(. . . )}, Fc2 }) = ∑ ground(c (... )) J ({ C ∪C A [A/t], {#A:CA ∪{A=t} [p(. . . , A, . . . )], p(. . . ,t, . . . ), c (. . . ), c(. . . )}, F#3 }), (3.20) where F#3 is a factor from the Cartesian product range(#A:CA ∪{A=t} [p(. . . , A, . . . )])× range(p) × range(c ) × range(c) to a set {0, 1} that, given an assignment of values v to all random variables but ground(p(. . . , A, . . . )) : C ∪ CA ∪{A = t}, is defined as follows: F#3 (h(), v(p(. . . ,t, . . . )), v(c (. . . )), v(c(. . . ))) = h(x) 1, if x = v(c (. . . )) ∧ v(c(. . . )) = v(p(. . . ,t, . . . )) ⊗ v(c (. . . )); x∈range(p) i=1 0, otherwise; 1, if h(x) x∈range(p) i=1 x = v(c (. . . )) ∧ h(x) x∈range(p) i=1 = x ⊗ v(p(. . . ,t, . . . )) = v(c(. . . )), 0, otherwise, (3.21) where h() is a histogram from range(#A:CA ∪{A=t} [p(. . . , A, . . . )]). We continue transformation by performing summation in (3.20). By Proposi- tion 2.1 ∑ ground(c (... )) J ({ C ∪C A [A/t], {#A:CA ∪{A=t} [p(. . . , A, . . . )], p(. . . ,t, . . . ), c (. . . ), c(. . . )}, F#3 }) = J ({ C ∪C A [A/t], {#A:CA ∪{A=t} [p(. . . , A, . . . )], p(. . . ,t, . . . ), c(. . . )}, ∑c (... ) F#3 }). (3.22) 82 ∑c (... ) F#3 is a factor from the Cartesian product range(#A:CA ∪{A=t} [p(. . . , A, . . . )])× range(p) × range(c) to a set {0, 1}. Let us consider factor F#3 as it is defined in (3.21). If we fix value of #A:CA [p(. . . , A, . . . )] to h(), there is only one value y in range(c ) such that h(x) i=1 x = y. Therefore, given an assignment of values v to all ran- dom variables but ground(p(. . . , A, . . . )) : C ∪ CA ∪{A = t}, ∑c (... ) F#3 is defined as follows: ∑c (... ) F#3 (h(), v(p(. . . ,t, . . . )), v(c(. . . ))) = 1, if h(x) x∈range(p) i=1 x ⊗ v(p(. . . ,t, . . . )) = v(c(. . . )), (3.23) 0, otherwise, where h() is a histogram from range(#A:CA ∪{A=t} [p(. . . , A, . . . )]). We will denote ∑c (... ) F#3 by F#4 . Let us define factor F#5 from the Cartesian product range(#A:CA [p(. . . , A, . . . )])× range(c) to a set {0, 1} as follows: F#5 (h (), y) = F#4 (h(), x, y), (3.24) where x∈range(p), y∈range(c), histogram h()∈range(#A:CA ∪{A=t} [p(. . . , A, . . . )]), histogram h () ∈ range(#A:CA [p(. . . , A, . . . )]), and h () is obtained by taking h() and adding 1 to the count for the value x. From (3.15), (3.23) and (3.24) we have F#5 = F#1 . Note that Equation 3.24 reassembles Equation 2.8 from Proposition 2.5. Indeed, we can further transform (3.22) by applying Proposition 2.5 as if we were reversing an expansion of a counting formula: J ({ C ∪C A [A/t], {#A:CA ∪{A=t} [p(. . . , A, . . . )], p(. . . ,t, . . . ), c(. . . )}, F#4 }) = J ({ C , {#A:CA [p(. . . , A, . . . )], c(. . . )}, F#5 }) = J ({ C , {#A:CA [p(. . . , A, . . . )], c(. . . )}, F#1 }) . The above and finishes a proof of Equation 3.19 and a proof of the proposition. Example 3.11 illustrates Proposition 3.4. 83 Example 3.11. Consider Example 3.2. As discussed in Example 3.8 the conditional probability distribution P(best_match()|matched(Person)) can be repre- sented with 0, / matched(Person), best_match(), 1, MAX, 0/ . Assume that we have observed that sylwia matched 5 numbers. Before we can project this observation onto our model, we need to split, among others, the above aggregation parfactor on substitution {Person/sylwia}. We follow the procedure described in Theorem 3.4 and introduce an auxiliary parameterized random variable best_match (). It is equal to the maximum number of matched lottery numbers among all individuals from the population of the logical variable Person except for sylwia. An aggregation parfactor 0, / matched(Person), best_match (), 1, MAX, {Person = sylwia} captures this dependency. Recall that parameterized random variable best_match() is equal to the maximum number of matched lottery num- bers among all individuals from the population of the logical variable Person, including sylwia. Hence, it is a maximum of matched(sylwia) and best_match (). This relation is represented by a parfactor 0, / {matched(sylwia), best_match (), best_match()}, Fbest_match , where Fbest_match() is a factor from the Cartesian product {0, 1, . . . , 6} × {0, 1, . . . , 6} × {0, 1, . . . , 6} to real numbers: matched(sylwia) 0 0 .. . best_match () 0 0 .. . best_match() 0 1 .. . value 1 0 .. . 0 .. . .. . 6 6 .. . 1 .. . .. . 6 6 .. . 6 .. . .. . 0 1 .. . 0 .. . .. . 0 0 .. . 0 0 0 .. . 6 0 1 1 .. . 6 6 0 1 .. . 6 0 0 1 .. . . 1 Splitting presented in Proposition 3.4 corresponds to the expansion of a counting formula in C-FOVE. The case where a substitution is of the form {X/A} can be handled in a similar fashion as described in Proposition 3.4. 84 3.4.2.2 Multiplication The C-FOVE algorithm multiplies parfactors to enable elimination of parameterized random variables. An aggregation parfactor can be multiplied by a parfactor on p(. . . , A, . . . ): Proposition 3.5. Let gA = C , p(. . . , A, . . . ), c(. . . ), F p , ⊗, CA be an aggregation parfactor from Φ and g1 = C ∪ CA , {p(. . . , A, . . . )}, F1 be a parfactor from Φ. Let g2 = C , p(. . . , A, . . . ), c(. . . ), F p Then F1 , ⊗, CA . J (Φ) = J (Φ \ {gA , g1 } ∪ {g2 }) . We call g2 the product of gA and g1 . Proof. It suffices to show that J ({gA , g1 }) = J ({ C , p(. . . , A, . . . ), c(. . . ), F p , ⊗, C A , C ∪ CA , {p(. . . , A, . . . )}, F1 }) (3.25) is equal to J ({ C , p(. . . , A, . . . ), c(. . . ), F p F1 , ⊗, C A }) . (3.26) We start with expression (3.25). First we convert the aggregation parfactor gA to parfactors and then multiply one of the resulting parfactors by g1 . J ({ C , p(. . . , A, . . . ), c(. . . ), F p , ⊗, CA , C 1 , {p(. . . , A, . . . )}, F1 }) = (by Proposition 3.1) J ({ C ∪ CA , {p(. . . , A, . . . )}, F p , C , {#A:CA [p(. . . , A, . . . )], c(. . . )}, F#1 , C ∪ CA , {p(. . . , A, . . . )}, F1 }) = (by Proposition 2.3) J ({ C ∪ CA , {p(. . . , A, . . . )}, F p F1 , C , {#A:CA [p(. . . , A, . . . )], c(. . . )}, F#1 }). F#1 is a factor from the Cartesian product range(#A:CA [p(. . . , A, . . . )]) × range(c) to a set {0, 1} that, given an assignment of values v to all random variables but 85 ground(p(. . . , A, . . . )) : C ∪ CA , is defined as follows: F#1 (h(), v(c(. . . ))) = 1, if h(x) x∈range(p) i=1 x = v(c(. . . )); (3.27) 0, otherwise, where h() is a histogram from range(#A:CA [p(. . . , A, . . . )]). By Proposition 3.1 J ({ C ∪ CA , {p(. . . , A, . . . )}, F p F1 , C , {#A:CA [p(. . . , A, . . . )], c(. . . )}, F#1 }) is equal to expression (3.26) which finishes a proof of the proposition. Below we provide an example of multiplication between an aggregation parfactor and a parfactor. Example 3.12. Consider the model from Figure 3.3 and Example 3.4. Probability distributions P(played(Person)) and P(matched_6(Person)|played(Person)) can be represented with a parfactor 0, / {played(Person)}, F played and a parfactor 0, / {played(Person), matched_6(Person)}, Fmatched_6 , respectively. F played is a factor from set { f alse,true} to the reals: played(Person) f alse true value 0.95 0.05 . Fmatched_6 is a factor from the Cartesian product { f alse,true} × { f alse,true} to the reals: played(Person) f alse f alse true true matched_6(Person) f alse true f alse true value 1.00000000 0.00000000 0.99999993 0.00000007 . As shown in Example 3.8, P( jackpot_won()|matched_6(Person)) can be represented with 0, / matched_6(Person), jackpot_won(), 1, OR, 0/ . Let Φ be a set of the three above parfactors. Assume that we want to compute Jground( jackpot_won()) (Φ). 86 p(...a1...) p(...a2...) p(...a3...) p(...a4...) ⊗ p(...an−1...) ⊗ c1,1(...) ⊗ p(...an...) ⊗ c1,n/2(...) c1,2(...) ⊗ ⊗ ⊗ c(log2 n)−1,2 (...) c(log2 n)−1,1 (...) ⊗ c(...) = clog2 n,1(...) Figure 3.6: Decomposed aggregation. We multiply the first two parfactors, sum out random variables from the set ground(played(Person)) and obtain 0, / {matched_6(Player)}, Fsum , where Fsum is a factor from set { f alse,true} to the reals: matched_6(Person) f alse true value 0.9999999965 0.0000000035 . Now we need to multiply 0, / matched_6(Person), jackpot_won(), 1, OR, 0/ by 0, / {matched_6(Player)}, Fsum . We apply results of Proposition 3.5 and obtain an aggregation parfactor 0, / matched_6(Person), jackpot_won(), Fsum , OR, 0/ . This simple example involved a trivial factor multiplication 1 Fsum = Fsum ; in general, we might need to multiply two non-trivial factors. 3.4.2.3 Summing out The C-FOVE algorithm sums out random variables to compute the marginal. Below we show how in some cases we can sum out p(. . . , A, . . . ) directly from an aggregation parfactor 0, / p(. . . , A, . . . ), c(. . . ), F p , ⊗, CA . Consider an aggregation parfactor 0, / p(. . . , A, . . . ), c(. . . ), F p , ⊗, 0/ , assume that param(c(. . . )) = param(p(. . . , A, . . . )) \{A}. Operator ⊗ is associative and we can decompose the aggregation into binary tree of applications of the operator ⊗. For simplicity of the discussion, let us first assume that n = | D(A) : CA | is 87 a power of two. Figure 3.6 illustrates this case. Let ci, j be a functor with range equal to the range of the functor c and ci, j (. . . ) be a parameterized random variable such that param(ci, j (. . . )) = param(p(. . . , A, . . . )) \{A} for i = 0, . . . , log2 n; j = 1, . . . , n/2i . Let c0, j (. . . ) = p(. . . , a j , . . . ), for j = 1, . . . , n; ci, j (. . . ) = ci−1,2 j−1 (. . . ) ⊗ ci−1,2 j (. . . ), for i = 1, . . . , log2 n; j = 1, . . . , n/2i ; c(. . . ) = clog2 n,1 (. . . ). As desired, we obtain: c(. . . ) = clog2 n,1 (. . . ) = clog2 n−1,1 (. . . ) ⊗ clog2 n−1,2 (. . . ) = · · · = n n = p(c0, j (. . . )) = j=1 p(. . . , a j , . . . ). j=1 When p(. . . , A, . . . ) represents a set of random variables that can be treated as independent, the results at each level of the tree shown in Figure 3.6 are identical, therefore we need to compute them only once. When n = | D(A) : CA | is an arbitrary natural number, we can use a square- and-multiply method [Pi˙ngala, 200 B.C.], whose time complexity is logarithmic in | D(A) : CA |, to eliminate p(. . . , A, . . . ) from an aggregation parfactor. This method is formalized in Proposition 3.6. Proposition 3.6. Let gA = C , p(. . . , A, . . . ), c(. . . ), F p , ⊗, CA be an aggregation parfactor from Φ. Assume that that set of constraints C ∪ CA is in normal form and that param(c(. . . )) = param(p(. . . , A, . . . )) \{A}. Assume that no other par- factor or aggregation parfactor in Φ involves parameterized random variables that represent random variables from ground(p(. . . , A, . . . )). Let m = log2 | D(A) : CA | and bm . . . b0 be the binary representation of | D(A) : CA |. Let (F0 , . . . , Fm ) be a sequence of factors from range of c to the reals, defined 88 recursively as follows: F (x), if x ∈ range(p); p F0 (x) = 0, otherwise, Fk−1 (y) Fk−1 (z), if bm−k = 0; ∑ y,z∈range(c) y⊗z=x Fk (x) = F p (w) Fk−1 (y) Fk−1 (z), if bm−k = 1. ∑ w,y,z∈range(c) w⊗y⊗z=x Then ∑ ground(p(...,A,... )) J (Φ) = J (Φ \ {ga } ∪ { C , {c(. . . )}, Fm }) . Note that F0 , . . . , Fm are functions stored as factors. Therefore expression Fk−1 (y) Fk−1 (z) requires only one recursive step that computes factor Fk−1 . Once Fk−1 is computed it is applied twice, to y and to z. In the worst case (when the binary representation of | D(A) : CA | is 11 . . . 1) elimination of p(. . . , A, . . . ) from an aggregation parfactor C , p(. . . , A, . . . ), c(. . . ), F p , ⊗, CA requires 2( log2 | D(A) : CA | )(range(c))3 applications of the ⊗ opera- tor, the same number of multiplications, and ( log2 | D(A) : CA | )((range(c))3 − 1) additions. The example below illustrates Proposition 3.6. Example 3.13. We continue Example 3.12 and apply Proposition 3.6 to eliminate matched_6(Person) from 0, / matched_6(Person), jackpot_won(), Fsum , OR, 0/ , where Fsum is a factor from set { f alse,true} to the reals: matched_6(Person) f alse true value 0.9999999965 0.0000000035 . Assume that n = | D(Person) | = 5. Thus m = 2 and b2 = 1, b1 = 0, b0 = 1. Let us compute (F0 , F1 , F2 ). We have F0 = Fsum . Since b1 = 0, F1 is defined as follows: F1 (x) = ∑ y,z∈{ f alse,true} y OR z=x 89 F0 (y) F0 (z), where x ∈ { f alse,true}. This gives F1 ( f alse) = F0 ( f alse) F0 ( f alse) ≈ 0.999999993 F1 (true) = F0 ( f alse) F0 (true) + F0 (true) F0 ( f alse) + F0 (true) F0 (true) ≈ 0.000000007. As b0 = 0, F2 is defined as follows: F2 (x) = ∑ w,y,z∈{ f alse,true} w OR y OR z=x F1 (w) F1 (y) F1 (z), where x ∈ { f alse,true}. This gives F2 ( f alse) = F0 ( f alse) F1 ( f alse) F1 ( f alse) ≈ 0.9999999825 F2 (true) = F0 ( f alse) F1 ( f alse) F1 (true) + F0 ( f alse) F1 (true) F1 ( f alse) + F0 ( f alse) F1 (true) F1 (true) + F0 (true) F1 ( f alse) F1 ( f alse) + F0 (true) F1 ( f alse) F1 (true) + F0 (true) F1 (true) F1 ( f alse) + F0 (true) F1 (true) F1 (true) ≈ 0.0000000175. and Jground( jackpot_won()) (Φ) = J ( 0, / { jackpot_won()}, F2 ). Operations presented above required 20 applications of operator OR, 20 multi- plications and 8 additions. For n = | D(Person) | = 19771128, we would need 264 applications of operator OR, 264 multiplications and 118 additions to compute the result: matched_6(Person) f alse true value ≈ 0.933141 ≈ 0.066859 . As expected, P( jackpot_won() = true) increases as | D(Person) | grows. Proposition 3.6 does not allow parameterized random variable c(. . . ) in an aggregation parfactor to have extra logical variables that are not present in parameterized random variable p(. . . , A, . . . ). The C-FOVE algorithm handles extra logical variables by introducing counting formulas on these logical variables. Then it can proceed with standard summation. We cannot apply the same approach to aggrega90 tion parfactors as newly created counting formulas could have ranges incompatible with the range of the aggregation operator. We need a special summation procedure, described below in Proposition 3.7. Proposition 3.7. Let gA = C , p(. . . , A, . . . ), c(. . . , E, . . . ), F p , ⊗, CA be an aggregation parfactor from Φ. Assume that set of constraints C ∪ CA is in normal form and that param(c(. . . )) \{E} = param(p(. . . , A, . . . )) \{A}. Assume that no other parfactor or aggregation parfactor in Φ involves parameterized random variables that represent random variables from ground(p(. . . , A, . . . )). Let m = log2 | D(A) : CA | and bm . . . b0 be the binary representation of | D(A) : CA |. Let (F0 , . . . , Fm ) be a sequence of factors from range of c to real numbers, defined recursively as follows: F (x), if x ∈ range(p); p F0 (x) = 0, otherwise, Fk−1 (y) Fk−1 (z), if bm−k = 0; ∑ y,z∈range(c) y⊗z=x Fk (x) = F p (w) Fk−1 (y) Fk−1 (z), otherwise. ∑ w,y,z∈range(c) w⊗y⊗z=x Let CE be a set of constraints from C that involve E. Let F# be a factor from the range of counting formula #E:CE [c(. . . , E, . . . )] to real numbers defined as follows: F (x), if ∃x ∈ range(c) h(x) = | D(E) : C |; m E F# (h()) = 0, otherwise, where h() is a histogram from range(#E:CE [c(. . . , E, . . . )]). Then ∑ ground(p(...,A,... )) J (Φ) = J (Φ \ {ga } ∪ { C \ CE , {#E:CE [c(. . . , E, . . . )]}, F# }) . If set C ∪ CA is not in normal form, then | D(A) : CA | might vary for different ground substitutions to all logical variables in p(. . . , A, . . . ) and we will not be 91 big jackpot() big jackpot() played(P erson) played(jan) played(sylwia) played(magda) matched 6(P erson) matched 6(jan) matched 6(sylwia) matched 6(magda) P erson jackpot won() FIRST-ORDER jackpot won() PROPOSITIONAL Figure 3.7: A first-order model from Example 3.14 and its equivalent belief network. Aggregation is denoted by curved arcs. able to apply Propositions 3.6 and 3.7. We can bring constraints in the aggregation parfactor to a normal form by splitting it on appropriate substitutions (see algorithm presented in Figure 5.4 on page 162). Once the constraints are in normal form, | D(A) : CA | does not change for different ground substitutions. Another approach is to compute | D(A) : CA | conditioned on logical variables in p(. . . , A, . . . ) with a constraint solver and use this information when summing out p(. . . , A, . . . ) (see Section 5.2.3 and Section 5.4). 3.4.3 Generalized aggregation parfactors Propositions 3.6 and 3.7 require that random variables represented by p(. . . , A, . . . ) are independent. They are dependent if they either have a common ancestor in the grounding or a common observed descendant. If during inference we eliminate the common ancestor or condition on the observed descendant before we eliminate p(. . . , A, . . . ) through aggregation, we may introduce a counting formula on p(. . . , A, . . . ). While we can always delay conditioning, with current form of aggregation parfactors sometimes we cannot delay eliminating the common ancestor. 92 Example 3.14. Consider the directed first-order probabilistic model and its grounding presented in Figure 3.7. It is a modification of the model from Example 3.1, represented with causal independence-based aggregation, as in Example 3.4. The new model has additional parameterized random variable big_ jackpot() with range { f alse,true} which is a parent of parameterized random variable played(Person). Assume that people are more likely to play the lottery when big_ jackpot() is true. Ground instances of parameterized random variable played(Person) are no longer independent. Distributions P(big_ jackpot()), P(played(Person)|big_ jackpot()), P(matched_6(Person)|played(Person)), P( jackpot_won()|matched_6(Person)) can be represented with parfactors: Φ0 = { 0, / {big_ jackpot()}, Fbig_ jackpot , [1] [2] 0, / {big_ jackpot(), played(Person)}, F played , 0, / {played(Person), matched_6(Person)}, Fmatched_6 , 0, / matched_6(Person), jackpot_won(), 1, OR, 0/ }, [3] [4] respectively. Fbig_ jackpot is a factor from set { f alse,true} to the reals: big_ jackpot() f alse true value 0.8 0.2 . F played is a factor from the Cartesian product { f alse,true} × { f alse,true} to the reals: big_ jackpot() f alse f alse true true played(Person) f alse true f alse true value 0.95 0.05 0.85 0.15 . Fmatched_6 is a factor from the Cartesian product { f alse,true} × { f alse,true} to the reals: played(Person) f alse f alse true true matched_6(Person) f alse true f alse true value 1.00000000 0.00000000 0.99999993 0.00000007 . Let n = | D(Person) | = 5 (as in Example 3.13). Assume that we want to compute Jground( jackpot_won()) (Φ0 ), which requires eliminating three parameterized random 93 variables: big_ jackpot(), played(Person) and matched_6(Person). Eliminating big_ jackpot() would introduce counting formula #Person:0/ [played(Person)], which would prevent as from performing aggregation in logarithmic time. We cannot eliminate matched_6(Person), because it is present in two parfactors, [3] and [4], which cannot be multiplied as they do not satisfy conditions of Proposition 3.5. Our only choice is eliminating played(Person), which involves multiplying parfactors [2] and [3] and summing out played(Person) from the product, all in lifted manner. We obtain an updated set of parfactors: Φ1 = { 0, / {big_ jackpot()}, Fbig_ jackpot , [1] [5] 0, / {big_ jackpot(), matched_6(Person)}, Fmatched_6 , [4] 0, / matched_6(Person), jackpot_won(), 1, OR, 0/ }, where Fmatched_6 is a factor from the Cartesian product { f alse,true}×{ f alse,true} to the reals: big_ jackpot() f alse f alse true true matched_6(Person) f alse true f alse true value 0.9999999965 0.0000000035 0.999999989 0.000000011 . We have ∑ground(played(Person)) J (Φ0 ) = J (Φ1 ). At this stage, we are left with two parameterized random variables to eliminate: big_ jackpot() and matched_6(Person). As before, eliminating big_ jackpot() first would introduce a counting formula and we cannot eliminate matched_6(Person), because it is present in two parfactors, [4] and [5], which cannot be multiplied. Models like the one described in Example 3.14 do not allow us to apply the results of Propositions 3.6 and 3.7 and perform efficient lifted aggregation. We address this problem by introducing a generalized version of the aggregation parfactor data structure. The generalized version not only can represent aggregation-based dependency between parameterized random variables p(. . . , A, . . . ) and c(. . . ), but also describes how this aggregation depends on a set of context parameterized random variables V. The latter dependency is captured by factor F p∪V , a generalized version of factor F p from the aggregation parfactor data structure. 94 Definition 3.3. A generalized aggregation parfactor is a septuple C , p(. . . , A, . . . ), c(. . . ), V, F p∪V , ⊗, CA , where • p(. . . , A, . . . ) and c(. . . ) are parameterized random variables • the range of p is a subset of the range of c • A is the only logical variable in p(. . . , A, . . . ) that is not in c(. . . ) • V is a set of parameterized random variables, such that: for any two parameterized random variables fi (. . . ), f j (. . . ) from V we have (ground( fi (. . . )) : / and for any fi (. . . ) from V we have C ∪ CA ) ∩ (ground( f j (C )) : C ∪ CA ) = 0; (ground( fi (. . . )) : C ∪ CA ) ∩ (ground(p(. . . , A, . . . )) : C ∪ CA ) = 0/ as well as (ground( fi (. . . )) : C ∪ CA ) ∩ ground(c(. . . )) : C = 0/ • C is a set of inequality constraints not involving A • F p∪V is a factor from the Cartesian product of ranges of parameterized random variables in {p(. . . , A, . . . )} ∪ V to real numbers. • ⊗ is a commutative and associative deterministic binary operator over the range of c • CA is a set of inequality constraints involving A, such that (D(A) : CA ) = 0. / A generalized aggregation parfactor C , p(. . . , A, . . . ), c(. . . ), V, F p , ⊗, CA represents a set of factors, one factor F pVc for each ground substitution G to all log- ical variables in a set fi (... )∈V param( fi (. . . )) ∪ param(c(. . . )) that satisfies the constraints in C ∪ CA . Each factor F pVc is a mapping from the Cartesian product ×a∈D(A):CA range(p) × fi (... )∈V range( fi ) × range(c) to the reals, which, given an assignment of values to all random variables v, is defined as follows: F pVc (v(p(. . . , a1 , . . . )),. . . ,v(p(. . . , an , . . . )),v( f1 (. . . )),. . . ,v( fm (. . . )),v(c(. . . ))) = rp ∏ F p∪V rc (v(p(. . . , a, . . . )), v( f1 (. . . )), . . . , v( fm (. . . ))), a∈{a1 ,...,an } if v(p(. . . , a, . . . )) = v(c(. . . )); a∈{a1 ,...,an } 0, otherwise, 95 where D(A): CA ={a1 , . . . , an }, V ={ f1 (. . . ), . . . , fm (. . . )}, rc =| ground(c(. . . )): C | and r p = | ground(p(. . . , a, . . . )) : C |, a ∈ D(A) : CA . Propositions 3.1–3.7 from Sections 3.4.1 and 3.4.2 can be adapted to generalized parameterized parfactors. The only major changes are that a generalized parfactor can be multiplied by any parfactor on p(. . . , A, . . . ) and context parameterized random variables and that summing out of p(. . . , A, . . . ) from a generalized aggregation parfactor involves repeating computation described in Section 3.4.2.3 for each value assignment to context parameterized random variables. We illustrate this with the following example: Example 3.15. Let us come back to Example 3.14. Assume that we performed all the steps described there, the only difference being that parfactor [4] is a generalized aggregation parfactor: Φ1 = { 0, / {big_ jackpot()}, Fbig_ jackpot , [1] [5] 0, / {big_ jackpot(), matched_6(Person)}, Fmatched_6 , [4] 0, / matched_6(Person), jackpot_won(), 0, / 1, OR, 0/ }. We can now multiply parfactors [4] and [5]: Φ2 = { 0, / {big_ jackpot()}, Fbig_ jackpot , [1] 0, / matched_6(Person), jackpot_won(), {big_ jackpot()}, Fmatched_6 , OR, 0/ }. [6] We have J (Φ1 ) = J (Φ2 ). The above step involved a trivial multiplication 1 Fmatched_6 = Fmatched_6 ; in general, we might need to multiply two non-trivial factors. Recall that n = | D(Person) | = 5 and Fmatched_6 is a factor from the Cartesian product { f alse,true} × { f alse,true} to the reals: big_ jackpot() f alse f alse true true matched_6(Person) f alse true f alse true 96 value 0.9999999965 0.0000000035 0.999999989 0.000000011 . Now we eliminate matched_6(Person) from a generalized aggregation parfactor [6]. The computation involves two steps, one for the case where context parameterized random variable big_ jackpot() is f alse and on for the case where it is true. The first step manipulates numbers from the first two rows of factor Fmatched_6 and is identical to the computation presented in Example 3.13. The second step manipulates numbers from the last two rows of factor Fmatched_6 and otherwise is identical to the first step. We obtain a parfactor [7]: Φ3 = { 0, / {big_ jackpot()}, Fbig_ jackpot , [1] 0, / {big_ jackpot(), jackpot_won()}, F jackpot_won }, [7] F jackpot_won is a factor from the Cartesian product { f alse,true} × { f alse,true} to the reals: big_ jackpot() f alse f alse true true jackpot_won() f alse true f alse true value 0.9999999825 0.0000000175 0.9999999450 0.0000000550 . We have ∑ground(matched_6(Person)) J (Φ2 ) = J (Φ3 ). Finally, we eliminate big_ jackpot(), which involves multiplying parfactors [1] and [7] and summing out big_ jackpot() from the product. We obtain an updated set of parfactors: Φ4 = { 0, / { jackpot_won()}, F jackpot_won }, [8] where F jackpot_won is a factor from set { f alse,true} to the reals: jackpot_won() f alse true value 0.999999975 0.000000025 . We have J (Φ4 ) = Jground( jackpot_won()) (Φ0 ). 3.5 Experiments In this section we compare how the performance of different ways of representing aggregation in directed first-order probabilistic models scales as the populations sizes of logical variables grow. 97 We compared the performance of variable elimination (VE), variable elimination with the noisy-MAX factorization [Díez and Galán, 2003] (VE-FCT), C-FOVE, C-FOVE with the lifted noisy-MAX factorization described in Section 3.4.1.2 (C-FOVE-FCT), and C-FOVE with aggregation parfactors (ACFOVE). We used Java implementations of the above algorithms on an Intel Core 2 Duo 2.66GHz processor with 1GB of memory made available to the JVM. 3.5.1 Memory usage In the first experiment we investigated how memory usage changes as the population size of the aggregation logical variable increases. For our tests we used small first-order models introduced through this chapter. While they are definitely toy models by themselves, aggregation components that are present in these models could be parts of much bigger, more realistic models. If some of the tested algorithms cannot perform efficient inference in our small models, they will not be able to perform inference in bigger ones. We tested all five algorithms on the following test instances: (a) the model introduced in Example 3.1 (depicted in Figure 3.1), compute the marginal of the parameterized random variable jackpot_won(); (b) the model introduced in Example 3.2 (Figure 3.2), compute the marginal of the parameterized random variable best_match(); (c) the model introduced in Example 3.5 (Figure 3.5), compute the marginal of the parameterized random variable jackpot_winners()1 ; (d) the model introduced in Example 3.14 (depicted in Figure 3.7), compute the marginal of the parameterized random variable jackpot_won(). For all instances we varied the population size n of the logical variable Person from 1 to 20, 000, 000. We recorded the maximum value of n for which aggregation was possible given 1GB of memory. The results are presented in Figure 3.8. The space complexity for VE is exponential in n, and the algorithm could not handle models with large population sizes. The space complexity for VE-FCT is linear in n and it performed much better than standard VE. For models (a), (c) and 1 The VE-FCT and C-FOVE-FCT algorithms could not be tested on this model as aggregation is based on the SUM|3 operator. 98 Model VE VE-FCT C-FOVE C-FOVE-FCT AC-FOVE (a) (b) (c) (d) 25 23 24 24 ≈ 1.0 E4 ≈ 1.0 E4 N/A ≈ 1.0 E4 > 2.0 E7 ≈ 7.0 E3 > 2.0 E7 > 2.0 E7 > 2.0 E7 > 2.0 E7 N/A > 2.0 E7 > 2.0 E7 > 2.0 E7 > 2.0 E7 > 2.0 E7 Figure 3.8: The maximum size of the population size of the logical variable Person for which aggregation was possible. VE VE−FCT C−FOVE AC−FOVE C−FOVE−FCT 4 time [ms] 10 2 10 0 10 0 10 2 10 4 10 n = |D(P erson)| 6 10 Figure 3.9: Performance on model (a) (with OR-based aggregation). (d), C-FOVE is also linear in n, but C-FOVE does lifted inference and it achieved better results than VE-FCT, which performs inference at the propositional level. In the model (b), the space complexity for C-FOVE is O(n6 ), and C-FOVE could not handle as large populations as VE-FCT. C-FOVE-FCT and AC-FOVE, for which the space complexity is independent of n, performed best. Note that for C-FOVE-FCT and AC-FOVE the time complexity is logarithmic in n, for other algorithms the time complexity is the same as the space complexity. 99 VE VE−FCT C−FOVE AC−FOVE C−FOVE−FCT 4 time [ms] 10 2 10 0 10 0 10 2 10 4 10 n = |D(P erson)| 6 10 Figure 3.10: Performance on model (b) (with MAX-based aggregation). VE C−FOVE AC−FOVE 4 time [ms] 10 2 10 0 10 0 10 2 10 4 10 n = |D(P erson)| 6 10 Figure 3.11: Performance on model (c) (with SUM|3 -based aggregation). 100 6 10 VE VE−FCT C−FOVE AC−FOVE C−FOVE−FCT 4 time [ms] 10 2 10 0 10 0 10 2 10 4 10 n = |D(P erson)| 6 10 Figure 3.12: Performance on model (d) (generalized aggregation parfactors). For completeness, on Figures 3.9-3.12 we also present the average time over 10 runs of each algorithm for tested instances. While due to the small sizes of models parts of plots below 1ms are very noisy, we can see that the small memory footprint C-FOVE-FCT and AC-FOVE does not come at a cost of high computation time. 3.5.2 Social network experiment For this experiment we used an ICL theory [Poole, 2008] from Carbonetto et al. [2009] that explains how people alter their smoking habits within their social network. The theory (without a probability distribution) is shown in Figure 3.13. The population of logical variables X and Y represents the set of people. The theory accounts for various interdependencies between smoking and friendship within this group of people. For example, non-smokers might convince their friend to stop smoking, or two smokers might be more likely to become friends. Possible cyclic dependencies between people are resolved by “switch” parameterized random variable ind(X) with range { f alse,true}. 101 [00] [01] [02] [03] [04] [05] [06] [06] [07] [08] [09] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] C = {{ind(X), ¬ind(X)}, {ns-fr(X,Y), ¬ns-fr(X,Y)}, {diff-sm-fr(X,Y), ¬diff-sm-fr(X,Y)}, {sm-fr(X,Y), ¬sm-fr(X,Y)}, {fr-at-rnd-sm(X,Y), ¬fr-at-rnd-sm(X,Y)}, {fr-at-rnd-nsm(X,Y), ¬fr-at-rnd-nsm(X,Y)}, {fr-at-rnd(X,Y), ¬fr-at-rnd(X,Y)}, {ind-sm(X), ¬ind-sm(X)}, {no-adv-sm(X), ¬no-adv-sm(X)}, {nsm-adv-sm(X), ¬nsm-adv-sm(X)}, {sm-adv-sm(X), ¬sm-adv-sm(X)}, {ctr-adv-sm(X), ¬ctr-adv-sm(X)}, {noise-1(X,Y), ¬noise-1(X,Y)}, {noise-2(X,Y), ¬noise-2(X,Y)}} F = {friends(X,Y) ← X Y ∧ friends(Y,X), friends(X,Y) ← X≺Y ∧ ind(X) ∧ ind(Y) ∧ ¬smokes(X) ∧ ¬smokes(Y) ∧ ns-fr(X,Y), friends(X,Y) ← X≺Y ∧ ind(X) ∧ ind(Y) ∧ ¬smokes(X) ∧ smokes(Y) ∧ diff-sm-fr(X,Y), friends(X,Y) ← X≺Y ∧ ind(X) ∧ ind(Y) ∧ smokes(X) ∧ ¬smokes(Y) ∧ diff-sm-fr(X,Y), friends(X,Y) ← X≺Y ∧ ind(X) ∧ ind(Y) ∧ smokes(X) ∧ smokes(Y) ∧ sm-fr(X,Y), friends(X,Y) ← X≺Y ∧ ind(X) ∧ ¬ind(Y) ∧ ¬smokes(X) ∧ fr-at-rnd-nsm(X,Y), friends(X,Y) ← X≺Y ∧ ind(X) ∧ ¬ind(Y) ∧ smokes(X) ∧ fr-at-rnd-sm(X,Y), friends(X,Y) ← X≺Y ∧ ¬ind(X) ∧ ind(Y) ∧ ¬smokes(Y) ∧ fr-at-rnd-nsm(Y,X), friends(X,Y) ← X≺Y ∧ ¬ind(X) ∧ ind(Y) ∧ smokes(Y) ∧ fr-at-rnd-sm(Y,X), friends(X,Y) ← X≺Y ∧ ¬ind(X) ∧ ¬ind(Y) ∧ fr-at-rnd(X,Y), smokes(X) ← ind(X) ∧ ind-sm(X), smokes(X) ← ¬ind(X) ∧ ¬sm-adv-sm-fr(X) ∧ ¬nsm-adv-nsm-fr(X) ∧ no-adv-sm(X), smokes(X) ← ¬ind(X) ∧ ¬sm-adv-sm-fr(X) ∧ nsm-adv-nsm-fr(X) ∧ nsm-adv-sm(X), smokes(X) ← ¬ind(X) ∧ sm-adv-sm-fr(X) ∧ ¬nsm-adv-nsm-fr(X) ∧ sm-adv-sm(X), smokes(X) ← ¬ind(X) ∧ sm-adv-sm-fr(X) ∧ nsm-adv-nsm-fr(X) ∧ ctr-adv-sm(X), sm-adv-sm-fr(X) ← ∃Y friends(X,Y) ∧ ind(Y) ∧ smokes(Y) ∧ noise-1(X,Y), nsm-adv-nsm-fr(X) ← ∃Y friends(X,Y) ∧ ind(Y) ∧ ¬smokes(Y) ∧ noise-2(X,Y)} Figure 3.13: ICL theory (without a probability distribution) for the smokingfriendship model. If ind(X) is true, X makes an independent decision to smoke or not to smoke. If ind(X) is f alse, X’s decision may be influenced by X’s friends. The role of parame- terized random variable ind(X) is illustrated in Figure 3.14. Aggregation is present in the theory in lines [28] and [29] which model how a person aggregates advice from smoking and non-smoking friends. For the population size n, the equivalent propositional graphical model has 3n2 + n nodes and 12n2 − 9n arcs. Parameters 102 (a) true smokes(alice) ind(bob) (b) false smokes(alice) smokes(bob) friends(alice,bob) true friends(alice,bob) smokes(bob) ind(alice) smokes(bob) friends(alice,bob) false friends(alice,bob) (c) smokes(alice) smokes(alice) smokes(bob) (d) Figure 3.14: Illustration of how ind(X) works. of the probability distribution forming the theory were learned from data of smoking and drug habits among teenagers attending a school in Scotland [Pearson and Michell, 2000] using methods described by Carbonetto et al. [2009]. In our experiment we varied the populations size n from 2 to 140 and for each value, we computed a marginal probability of a single individual being a smoker. Figure 3.15 shows the average time over 10 runs of tested algorithms for each population size. The VE, VE-FCT and C-FOVE algorithms failed to solve instances with a population size greater than 8, 10, and 11, respectively, because they run out of available memory (1GB). AC-FOVE was able to handle efficiently much larger instances and it ran out of memory for a population size of 159. The ACFOVE algorithm performed equally to the C-FOVE-FCT algorithm except for small populations. It is important to remember that the C-FOVE-FCT algorithm, unlike AC-FOVE, can only be applied to MAX and MIN-based aggregation. 103 4 time [ms] 10 2 10 VE VE−FCT C−FOVE AC−FOVE C−FOVE−FCT 0 10 1 2 10 n 10 Figure 3.15: Performance on the smoking-friendship model. 3.6 Conclusions In this chapter we demonstrated the use of aggregation parfactors to represent aggregation in directed first-order probabilistic models, and how aggregation parfactors can be incorporated into the C-FOVE algorithm. Theoretical analysis and empirical tests showed that in some cases, lifted inference with aggregation parfactors leads to significant gains in efficiency. 104 Chapter 4 Solver for #CSP with Inequality Constraints When angry count to ten before you speak. If very angry, count to one hundred. — Thomas Jefferson When angry, count to four; when very angry, swear. — Mark Twain 4.1 Introduction Lifted probabilistic inference requires counting the number of solutions to binary CSPs (i.e., solving #CSP) with inequality constraints (either between a pair of variables or between a variable and a constant). Variables in these CSP instances typically have large domain sizes. Instances of these problems are described in a lifted manner, that is, the only constants named explicitly are those, for which there exist unary constraints. In order to be efficient, a lifted probabilistic inference engine also requires a lifted answer from a #CSP solver. A lifted answer groups together constants which contribute to the same count. Existing algorithms for counting the number of solutions to constraint satisfaction problems do not accept lifted descriptions as input or produce lifted descriptions as output. Moreover, the complexity of these algorithms is dominated by the domain size, which also makes them unsuitable for lifted probabilistic inference 105 where we want to solve the whole problem in time logarithmic in the domain size where possible. In this chapter we design and analyze a counting algorithm that takes as an input a lifted description of a CSP and returns the answer described in a lifted manner and performs well in presence of large domain sizes. We provide background information in Section 4.2. In particular, we introduce relevant concepts in Section 4.2.1 and describe a #CSP algorithm from [Dechter, 2003, Section 13.3.3], which is a starting point for our algorithm, in Section 4.2.2. Our counting algorithm is described in Section 4.3.2 and analyzed theoretically in Section 4.3.4. Empirical tests on random CSP instances are presented in Section 4.3.5. We analyze the impact of the presented algorithm on probabilistic inference in Chapter 5 in Section 5.4. Appendices A to D provide insights into our implementation of the algorithm. The scope of notation introduced in this chapter is limited to this chapter, Example 5.12 from Chapter 5 and Appendices B to D. 4.2 Background Constraint Satisfaction Problems (CSPs), first introduced by Montanari [1974], are used for representing problems in many areas. Besides the most widely studied decision and search variants, one can pose the question “How many solutions exist?” for a particular CSP. This counting variant of CSP is known as #CSP. It belongs to the #P class of problems introduced by Valiant [1979], which is the class of all counting problems associated with polynomially balanced, polynomial-time decidable relations [Papadimitriou, 1994]. Binary #CSP was proven to be complete for the #P class [Roth, 1996]. Bulatov and Dalmau [2003] described some tractable subclasses of #CSPs, but these are very restrictive. Among counting problems, #S AT receives the most attention [Bayardo Jr. and Pehoushek, 2000; Birnbaum and Lozinskii, 1999; Dahllöf et al., 2002, 2005; Dubois, 1991; Zhang, 1996]. However, some approximate and exact algorithms for #CSP have recently been proposed. Meisels et al. [2000] rephrased #CSP in terms of probability updating in Bayesian networks, and applied methods for approximating probabilities in Bayesian networks to approximate the number of solutions. 106 Angelsmark et al. [2002] represented binary CSP in terms of 2-S AT instances, and used the algorithm described in [Dahllöf et al., 2002] to obtain the number of solutions. In an improved version Angelsmark and Jonsson [2003] translated binary CSP to weighted 2-S AT, and the counting problem was solved with the algorithm described in [Dahllöf et al., 2005]. The translation was done using the partitioning method, which works by partitioning the domains of variables in CSP into a number of disjoint subsets. The time complexity of the improved algorithm for n variables and domain size equal to d approaches O((0.6224d)n ) as d grows. The space complexity is polynomial. The number of solutions to subproblems of a given CSP can be used as a heuristic for solving the whole problem. Dechter and Pearl [1987] counted solutions with a variant of the variable elimination algorithm. Horsch and Havens [2000] used solution probabilities, which can guide search algorithms for solving CSPs. Kask et al. [2004a,b] approximated the number of solutions with the Iterative Join-Graph Propagation method [Dechter et al., 2002] and used it as a heuristic to solve CSPs. Refalo [2004] presented a generic search heuristic based on the impact of a variable. Pesant [2005] proposed a structural approach to #CSP. He derived polynomialtime evaluations of the number of solutions of individual constraints of several types, that can be used to approximate the total number of solutions or to guide search heuristics. Dechter [2003, Section 13.3.3] presented a method for solving #CSP with a variable elimination algorithm. The time and space complexity of the algorithm are ∗ (ρ) equal to O(rd w ), where r is the number of constraints and w∗ (ρ) is the induced width of the associated constraint graph as a function of the elimination ordering ρ. In this chapter we show how this variable elimination algorithm can be modified to count the number of solutions to binary CSPs with inequality constraints and large domains. The class of the problems needed for lifted probabilistic inference, that is the class of CSP with only inequality constraints is known as list-coloring problem [Biggs, 1993]. If values from the domains of the variables are interpreted as colors, the problem can be understood as the problem of coloring graph nodes such that two nodes with an edge between them have different colors. Björklund et al. 107 [2009] provide the algorithm with 2n nO(1) time and space complexity, where n is the number of variables. The more restricted class of the problem, where variables have the same domain of size k is known as k-coloring problem [Biggs, 1993]. It can be interpreted as a problem of coloring a graph nodes using k colors such that two nodes with an edge between them have different colors. It only applies to lifted probabilistic inference, if the encountered CSP does not involve unary constraints. The partitioning method mentioned earlier has been successfully used to solve this problem [Angelsmark and Thapper, 2006]. The corresponding counting problem is known as the problem of computing chromatic polynomial [Biggs, 1993]. The fastest known algorithm due to Björklund et al. [2009] has 2n nO(1) time and space complexity. Given polynomial space, they can find the smallest k for which the original decision problem has positive answer (the chromatic number [Biggs, 1993]) in O(2.2461n ) time. 4.2.1 Constraint satisfaction problems The content of this section is based on [Dechter, 2003]. A CSP instance is a triple P = (X, D, C), where X = {X1 , . . . , Xn } is a finite set of n variables, D is a function that maps each variable Xi to the set D(Xi ) of possible values it can take (domain of Xi ) and C = {C1 , . . . ,Cr } is a finite set of constraints. Each constraint C j is a relation over a set S(C j ) = {Y1 , . . . ,Ym } of variables from X, C j ⊆ D(Y1 ) × · · · × D(Ym ). The set S(C j ) is called the scope of C j and m is called the arity of the constraint C j . In a binary CSP instance, all the constraints are unary (m = 1) or binary (m = 2). A tuple x1 , . . . , xi ∈ D(Xl1 ) × · · · × D(Xli ) satisfies a constraint C j if S(C j ) ⊆ {Xl1 , . . . , Xli } and the projection of x1 , . . . , xi on S(C j ) is an element of C j . A tuple x1 , . . . , xn ∈ D(X1 ) × · · · × D(Xn ) is a solution of P = (X, D, C) if it satisfies all constraints in C. A tuple x1 , . . . , xn ∈ D(X1 ) × · · · × D(Xn ) is a consistent extension in P of the tuple xi1 , . . . , xim ∈ D(Xi1 ) × · · · × D(Xim ) to the variables {X1 , . . . , Xn } \ {Xi1 , . . . , Xim }, if it is a solution of P and ∀ j ∈ {1, . . . , m} ∃l ∈ {1, . . . , n} such that (xi j = xl ) ∧ (Xi j = Xl ). 108 In this chapter, we restrict our focus to discrete, finite CSP instances, where all variables in P have discrete and finite domains. The CSP instance P = (X, D, C) can be represented by a constraint graph: each variable is represented by a node, and two nodes are connected if they are in the scope of the same constraint from C (in the case of a binary CSP, arcs correspond directly to the constraints). 4.2.2 Variable elimination for #CSP In this section, we describe the variable elimination algorithm for #CSP (#VE) presented in [Dechter, 2003, Section 13.3.3]. Assume we are given a CSP instance P = (X, D, C). Each constraint Ci with scope S(Ci ) = {Xi1 , . . . , Xim } is represented by a single factor FCi on variables Xi1 , . . . , Xim , which has the value 1 for satisfying tuples and 0 otherwise. The number of solutions to P, |P| is equal to: |P| = ∑ . . . ∑ FC1 (S(C1 )) X1 Xn Computing the product FC1 (S(C1 )) ··· ··· FCr (S(Cr )). FCr (S(Cr )) is not tractable, but the #VE algorithm takes advantage of the (possible) sparseness of the associated constraint graph, and using the distribution law, distributes factors that are not functions of Xi outside of the sum ∑Xi , for i = n, . . . , 1. Figure 4.1 presents the pseudocode involved. Note that this is exactly the same algorithms as the VE algorithm for inference in belief networks (see Section 2.3.2 and Figure 2.2 on page 10). The only difference is the way the initial factors are created. At each step of the computation, the product of all factors that exist at this step represents the number of consistent extensions associated with the previously eliminated variables. Initially, no variables are eliminated and, as described above, we have 0–1 valued factors corresponding to the original constraints. Their product simply enumerates all possible assignments of values to variables and assigns 1 to tuples that are solutions to the input CSP instance and 0 to other tuples. At the end, if we decide to eliminate all variables (i.e., if E = X), this algorithm returns a factor on the empty set of variables, which is simply a number equal to the 109 [00] [01] [02] [03] [04] procedure #CSP_VE(P, E, H) input: CSP instance P = (X, D, C), set of variables to eliminate E ⊆ X; elimination ordering heuristic H; output: factor representing the solution for each value of X \ E; [05] [06] [07] [08] [09] [10] [11] [12] set F := initialize(P); while there is a factor in F involving variable from E do select variable Y ∈ E according to H; set F := eliminate(Y, F); set E := E \ {Y }; end return Fi ∈F Fi ; end [13] [14] [15] procedure initialize(P) input: CSP instance P = (X, D, C); output: representation of P as set of factors F; [16] [17] [18] [19] [20] [21] [22] [23] set F := 0/ ; for i := 1 to r do create factor Fi on S(Ci ) with value 1 for tuples satisfying Ci and with value 0 otherwise; set F := F ∪ {Fi }; end return F; end [24] [25] [26] [27] procedure eliminate(Y, F) input: variable to be eliminated Y , set of factors F; output: set of factors F with Y summed out; [28] [29] [30] [31] partition F = {F1 , . . . , Fn } into {F1 , . . . , Fm } that do not contain Y and {Fm+1 , . . . , Fn } that do contain Y ; return {F1 , . . . , Fm , ∑Y Fm+1 · · · Fn }; end Figure 4.1: #VE algorithm for #CSP from Dechter [2003]. number of solutions to the input CSP instance. If we decide to eliminate only some of the variables (E ⊂ X), this algorithm returns a factor representing the number of solutions for each combination of values of variables X \ E. Assume that we have eliminated all variables (E = X), and that domains of the variables have the same size d = |D(X1 )| = · · · = |D(Xn )|. The time and space 110 complexity of the algorithm are determined by the size of the biggest factor, which depends on the induced width w∗ (ρ) of the associated constraint graph ∗ (ρ) O(rd w 4.2.3 ). (4.1) Set partitions A partition of a set S is a collection B1 , . . . , Bk of nonempty, pairwise disjoint subsets of S such that S = k i=1 Bi . The sets Bi are called blocks of the partition. Example 4.1. Consider set S = {1, 2, 3}. The collection of blocks B1 = {1, 3}, B2 = {2} is an example of a partition of S. Using a standard notation for set partitions it can be written down as {{1, 3}, {2}}. There are four other partitions of set S: {{1, 2, 3}}, {{1}, {2, 3}}, {{1, 2}, {3}}, and {{1}, {2}, {3}}. Set partitions are intimately connected to equality. For any consistent set of equality assertions on variables, there exists one or more partitions in which the variables that are equal are in the same block, and the variables that are not equal are in different blocks. If we consider a semantic mapping from variables to individuals in the world, the inverse of this mapping, where two variables that map to the same individual are in the same block, forms a partition of the variables. Given a partition π, we denote by C(π) a set of equality assertions corresponding to π. Example 4.2. Consider set of variables {B,C, D}. Equality assertions B = C, B = D, C = D correspond to a partition {{B}, {C, D}}. We have C({{B}, {C, D}}) = {B = C, B = D,C = D}. The number of partitions of the set of size n is equal to the n-th Bell number ϖn , named after Eric T. Bell, who wrote several papers on the subject [Bell, 1934a,b, 1938]. Bell numbers satisfy the following recurrence: ϖ0 = 1, ϖn+1 = n ∑ ϖk k=0 111 n . k (4.2) 10 10 100n 10n 5n 3n 2n ϖn 5 10 0 10 0 5 10 15 n 20 25 30 Figure 4.2: Comparison of ϖn and exponential functions. According to Knuth [2005], Equation 4.2 was discovered by Toshiaki Honda in the early 1800s, and William A. Whitworth [Whitworth, 1878] first pointed out the connection between Bell numbers and set partitions. The first few Bell numbers are: n = 0 1 2 3 4 5 6 7 8 9 10 ... ϖn = 1 1 2 5 15 52 203 877 4140 21147 115975 . . . Bell numbers grow faster than any exponential function (see Lovász [2003]), but for small n’s they stay much smaller than exponential functions with a moderate base (see Figure 4.2). We end this section with a proof of a property of Bell numbers which we will use later on. Proposition 4.1. Let i1 , i2 , . . . , ik ≥ 0. Then ϖi1 ϖi2 . . . ϖik ≤ ϖi1 +i2 +···+ik . Proof. The right side of the inequality represents the number of all partitions of the set of size i1 + i2 + · · · + ik . Assume we have a set of size i1 + i2 + · · · + ik , whose elements are painted with k colors and for 1 ≤ j ≤ k, there are i j elements painted with the color j. The left side of the inequality represents the number of all partitions of this set, such that elements with different colors are always in different blocks. It is easy to see, that such a number is smaller or equal to the number of all partitions of the set of size i1 + i2 + · · · + ik . 112 D = A = B = C = E Figure 4.3: Constraint graph with tree structure. 4.3 Counting solutions to CSP instances with inequality constraints For the rest of this chapter, we restrict our attention to CSP instances with only inequality constraints (either between a pair of variables or between a variable and constants) as only such constraints arise in lifted probabilistic inference. Moreover, for simplicity of the presentation but without loss of generality, we only consider instances for which the constraint graph consists of a single connected component. The number of solutions to CSP with a disconnected constraint graph is simply the product of the numbers of solutions for the connected components of the constraint graph. Note that, unlike the rest of this thesis, in this chapter we do not assume that variables from the same connected component of the constraint graph have the same domain. Given two variables A and B and an inequality constraint between them, we do not need to consider individual values from their domains in order to calculate the number of solutions to such CSP. If we know the sizes of the variables’ domains and the size of the intersection of the domains, we can calculate the number of solutions: |A| |B|−|A ∩ B|. We show below how this simple idea can be generalized to an arbitrary CSP with inequality constraints and develop a modified variable elimination algorithm that does not need to enumerate values from the domains. 4.3.1 Analysis of the problem Let us start with a very simple constraint graph. Example 4.3. Consider the constraint graph presented in Figure 4.3, where all variables have the same domain, the domain size is d, and there are no unary con113 A A = A = B C = = = B C = B=C = D = = = D = D (a) (b) Figure 4.4: Constraint graph with a cycle (a). The two cases: B = C and B = C (b). straints. The graph has a tree structure, which allows us to immediately solve the problem: we can assign the value to A in d ways, and are left with d − 1 possible values for B, d − 1 possible values for C and d − 1 possible values for D and E. Hence, there are d · (d − 1) · (d − 1) · (d − 1) · (d − 1) solutions to this CSP instance. In the next example, we analyze a constraint graph with a cycle. Example 4.4. Consider the constraint graph presented in Figure 4.4 (a). As in the previous example, all variables have the same domain, the domain size is d, and there are no unary constraints. The graph has a cycle, which makes the calculation more complicated. We can assign the value to A in d ways, and are left with d − 1 possible values for B and d − 1 possible values for C. For D we need to consider two cases: B = C and B = C, as is shown in Figure 4.4 (b). In the B = C case, D can take d − 1 values, while in the B = C case, D can have d − 2 values. Hence, the number of solutions to this CSP instance is d · (d − 1) · ((d − 1) + (d − 2) · (d − 2)). The two cases correspond to two partitions of a set {B,C}: a partition {{B,C}} and a partition {{B}, {C}}. Let us compare our simple reasoning to the computation that would be per- formed by the #VE algorithm. Considered CSP can be represented with four factors: f (A, B), f (A,C), f (B, D) and f (C, D), each of size d 2 . To eliminate A, the #VE algorithm multiplies factors f (A, B) and f (A,C). Next, it sums out A from the factor representing the product and obtains a factor on variables B and C of size d 2 . The elimination process continues, but of interest to us is the fact, that 114 = B D = = E = C C = = = = D C = E E C E B B = = = = D = = = E E (a) = = C D = = = = = D B = A A = C = = B D A = = = = = = B A A = A (b) Figure 4.5: Constraint graph discussed in Example 4.5 (a). Cases corresponding to the respective partitions of the set {B,C, D} (b). A = = D C B C = = = = E = = = = D D = = = C B = B = A A = = A B C = = = = E = E (a) = = = D = = = E (b) Figure 4.6: Constraint graph discussed in Example 4.5 (a). Cases corresponding to the respective partitions of the set {B,C, D} that are consistent with inequality B = C (b). #VE computed d 2 assignments of values to variables B and C, while above we just needed to consider the two partitions: {{B,C}} and {{B}, {C}}. We further examine the above observation in the next example. Example 4.5. Consider the graph from Figure 4.5 (a). Again, assume that all variables have the same domain, the domain size is d, and there are no unary constraints. If we count possible assignments to variables in a way described in Example 4.4 following order ρ = A, B,C, D, E , we need to consider ϖ3 = 5 partitions of set {B,C, D}: {{B,C, D}}, {{B}, {C, D}}, {{B,C}, {D}}, {{B, D}, {C}}, {{B}, {C}, {D}}. Each partition corresponds to a different case as to whether the 115 variables are equal or not; these cases are shown on Figure 4.5 (b). The #VE algorithm while following the elimination ordering ρ would create a factor on variables ∗ (ρ) B, C, D of size d w = d3. If we add a constraint B = C to the constraint graph (see Figure 4.6 (a)), we need to consider only partitions that are consistent with this constraint, that is, partitions in which B and C are in different blocks. There are three such partitions (see Figure 4.6 (b)). The #VE algorithm still creates a factor on B, C, D of size d 3 . Notice that what we have done in Example 4.5 is to consider at most ϖ3 partitions of variables, rather than d 3 assignments of values to these variables. Since we do not care about empty partitions, we will never have to consider more partitions than there are assignments of values. As we mentioned in Section 4.2.3, for small n’s (which in our case is equal to the induced width of a constraint graph w∗ (ρ), see Section 4.3.4.2) ϖn stays much smaller than exponential functions with a moderate base (in our case equal to the domain size d). In the problems induced during first-order probabilistic inference we consider, we do not expect w∗ (ρ) to be very large, but we are likely work with large domains; therefore, considering ∗ (ρ) ϖw∗ (ρ) cases instead of d w can be a big gain. In practice, variables can have different domains or different values from their domains might be excluded by unary constraints. In such a situation, we can apply the above reasoning to any set of values that are indistinguishable as far as counting is concerned. For example, the intersection of all domains is a set of values for which we only need the size; there is no point in reasoning about each individual value separately. Similarly, the values from the domain of a variable that do not belong to the domain of any other variable can be grouped and treated together. All we need is to know how many there are. Example 4.6. Consider again the constraint graph from Figure 4.4. Assume that all variables but B have the same domain D(A) = D(C) = D(D) of size d, and that B has a domain of size d + d , where |D(B) ∩ D(A)| = d. In the case where B has a value from D(B) ∩ D(A) = D(A), as before, the number of solutions to this CSP instance is equal to d · (d − 1) · ((d − 1) + (d − 2) · (d − 2)). In the case where B has a value from D(B) \ D(A), where |D(B) \ D(A)| = d , the number of solutions 116 #VE= factors #V E= grounding E #V infe renc e e renc infe solution #VE factors Figure 4.7: Relationship between #VE= and #VE. is equal to d · d · (d − 1) · (d − 1). The overall number of solutions is equal to the sum of these two quantities. Example 4.7. If we were to extend Example 4.6 so that D has the same domain as B, we would then only need to consider how many more solutions would be in the final answer. In this case, we can consider the other values from the domain of D and how these add to the total count of models. We would like to do this locally when summing out variables, rather than globally, as implied in this example. 4.3.2 The #VE= algorithm In this section we describe the #VE= algorithm for counting the number of solutions to CSPs with inequality constraints. The core of the #VE= algorithm is the same as the #VE algorithm (see Section 4.2.2), but #VE= is a lifted algorithm, that is, it reasons at a higher level of abstraction than does #VE. In particular, a #VE= factor (Section 4.3.2.2) is more complicated than the corresponding #VE factor, but one tuple in a #VE= factor represents many tuples in a #VE factor. Assume we are given the CSP instance P = (X, D, C), X = {X1 , . . . , Xn }, C = {C1 , . . . , Cl ,Cl+1 , . . . ,Cr } where C1 , . . . ,Cl are unary constraints and Cl+1 , . . . ,Cr are binary constraints. We use D(X j ) to denote the domain D(X j ) without those values that are excluded by the unary constraints C1 , . . . ,Cl . 117 4.3.2.1 S-constants Following the analysis presented in Section 4.3.1 we partition domains of variables into disjoint sets of values from these domains. We use s-constants to represent such sets. Each s-constant denotes a (non-empty) set of domain values. The sets of domain values associated with different s-constants are assumed to be disjoint. With each s-constant, we have the size of the set it denotes. Instead of reasoning with individual domain values, we can reason with these s-constants. In the rest of this chapter, we will at times treat an s-constant as a set (as is done in normal mathematics); in this case, we mean the set the s-constant denotes. Note that the algorithm never deals with the sets that the s-constants denote (we don’t assume that it has access to these sets). Example 4.8. In Example 4.7 variables A and C have the same domain of size d, let D(A) = D(C) = {x1 , . . . , xd }. Variables B and D have the same domain of size d + d , such that D(B) ∩ D(A) = D(A), let D(B) = D(D) = {x1 , . . . , xd , xd+1 , . . . , xd+d }. We need two s-constants to represent values from domains of these four variables: c1 that represents values in domains of all four variables, {x1 , . . . , xd }, and c2 that denotes the d extra values in domains of B and D, {xd+1 , . . . , xd+d }. For the purpose of our algorithm we can represent domains of the variables as follows: D(A) = D(C) = c1 and D(B) = D(D) = c1 ∪ c2 . The only additional information we will require is that c1 represents d values and c2 represents d values. 4.3.2.2 #VE= factors A #VE= factor F on variables Y1 , . . . ,Ym is a function from a set of #VE= tuples into the natural numbers. A #VE= tuple c1 , . . . , ck , [πc1 . . . πck ] consists of: • an s-constant ci for each variable; • a set of partitions that contains for each set of variables represented by the same s-constant ci , a partition πci of these variables. In text we use square brackets to form a set of partitions to avoid confusion with curly brackets of partitions within the set. For the same reason we do not separate partitions with commas. We skip the brackets when displaying #VE= factors. 118 Example 4.9. Let us continue Examples 4.7 and 4.8. Constraints A = B and B = D can be represented by the following #VE= factors: A c1 c1 c1 B c1 c1 c2 Partition(s) {{A, B}} {{A}, {B}} {{A}} {{B}} # 0 1 1 B c1 c1 c1 c2 c2 c2 D c1 c1 c2 c1 c2 c2 Partition(s) {{B, D}} {{B}, {D}} {{B}} {{D}} {{B}} {{D}} {{B, D}} {{B}, {D}} # 0 1 1 1 0 1 The #VE= factor representing the constraint A = B, FA=B , contains three #VE= tuples. The first one, c1 , c1 , [{{A, B}}] , represents the set of tuples from the Cartesian product of subsets of domains of A and B represented by c1 , namely c1 × c1 = {x1 , . . . , xd } × {x1 , . . . , xd }, that satisfy the constraint A = B, which is implicit in the partition {{A, B}}. Such tuples do not satisfy the constraint A = B represented by the factor and are assigned value 0. The second #VE= tuple, c1 , c1 , [{{A}, {B}}] , represents the set of tuples from c1 × c1 = {x1 , . . . , xd } × {x1 , . . . , xd } that satisfy the constraint represented by the partition {{A}, {B}}, that is A = B. These tuples are assigned value 1. Finally, the third #VE= tuple, c1 , c2 , [{{A}} {{B}}] , repre- sents the set of tuples from the Cartesian product of subsets of domains of A and B represented by s-constants c1 and c2 , c1 × c2 = {x1 , . . . , xd } × {xd+1 , . . . , xd+d }. These subsets are disjoint and each variable is in a partition by itself, meaning that there is no constraint encoded. All tuples from the product satisfy the constraint A = B and are assigned value 1. Suppose s-constant ci denotes the set Si . The number associated with the #VE= tuple t = c j1 , . . . , c jm , Π in a #VE= factor F will be the same as the number in the corresponding #VE factor on Y1 , . . . ,Ym associated with each tuple from S j1 × · · · × S jm that satisfies the equality and inequality constraints implicit in the partitions Π (each of these tuples is associated with the same number). We call the corresponding #VE factor a grounding of the factor F, and denote it as G(F). We denote the tuples from G(F) corresponding to t by G(t) and call them ground tuples. 119 Example 4.10. Consider the #VE= factor representing constraint A = B from Example 4.9. Below we show its grounding: A x1 .. . B x1 .. . # 0 .. . x1 .. . xd+d .. . 1 .. . x1 x1 .. . xd .. . xd xd .. . xd xd xd+1 .. . x1 .. . xd xd+1 .. . xd+d 1 1 .. . 1 .. . 0 1 .. . 1 As with implementation of matrices, we can have either dense representations, for example, using 1-dimensional arrays where we can quickly index any value but need to store zeros, or sparse representations that allow us to avoid storing zeros or repeated structure but are slower when there are few zeros. The examples will show zeros when we think it makes it clearer, but an implementation should do whichever is more efficient. Appendix B shows how to implement the dense representation of factors using a hierarchy of 1-dimensional arrays, so that instead of storing both #VE= tuples and values in memory, we only need to store values. To describe the #VE= algorithm, we need to modify multiplication and summing out operators so they can handle a richer representation of #VE= factors. In the example below we show a CSP represented with s-constants and #VE= factors. We will use this CSP to illustrate operations on #VE= factors in Sections 4.3.2.3 and 4.3.2.4. 120 D(A) = {x1 , x2 , x3 , x4 , x5 , x6 } D(B) = {x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 } D(C) = {x1 , x2 , x3 , x4 , x5 } A = = B C C = {A = x1 , A = x6 , A = B, A = C B = x2 , B = x5 , B = x6 , B = x7 , C = x1 } Figure 4.8: A CSP used in Examples 4.11–4.15. Example 4.11. Consider the CSP presented in Figure 4.8 with D(A) = {x2 , x3 , x4 , x5 }, D(B) = {x1 , x3 , x4 , x8 } and D(C) = {x2 , x3 , x4 , x5 }. Let S0 = {x3 , x4 }, S1 = {x1 , x8 }, S2 = {x2 , x5 }. Then D(A) = S0 ∪ S2 , D(B) = S0 ∪ S1 , and D(B) = S0 ∪ S2 . Let s-constant ci denote Si . When counting the number of solutions, we are not concerned with values; all we need are the counts. We can thus represent this example using two factors, FA=B and FA=C , and three numbers: A c0 c0 c0 c2 c2 B c0 c0 c1 c0 c1 Partition(s) {{A, B}} {{A}, {B}} {{A}} {{B}} {{A}} {{B}} {{A}} {{B}} # 0 1 1 1 1 A c0 c0 c0 c2 c2 c2 C c0 c0 c2 c0 c2 c2 Partition(s) {{A, C}} {{A}, {C}} {{A}} {{C}} {{A}} {{C}} {{A, C}} {{A}, {C}} # 0 1 1 1 0 1 size(c0 ) = 2 size(c1 ) = 2 size(c2 ) = 2 #VE= tuples in the above #VE= factors that are typeset in bold could be pruned as they are assigned value 0. 4.3.2.3 Multiplication Example 4.12. Consider the factors presented in Example 4.11. The third #VE= tuple from the FA=B factor, c0 , c1 , [{{A}} {{B}}] , represents all tuples from c0 × c1 . The second #VE= tuple from the FA=C factor, c0 , c0 , [{{A}, {C}}] , represents all tuples from c0 × c0 such that A = C. The product of these two #VE= tuples represents all tuples from c0 × c1 × c0 such that A = C and can be represented by a #VE= tuple c0 , c1 , c0 , [{{A}{C}} {{B}}] . The second #VE= tuple from the FA=B factor, c0 , c0 , [{{A}, {B}}] , represents all tuples from c0 × c0 such that A = B. The product of this #VE= tuple with the 121 second #VE= tuple from the FA=C factor represents all tuples from c0 × c0 × c0 such that A = B and A = C. This is not present in the form of a #VE= tuple in a product factor, as constraints A = B and A = C do not uniquely identify a partition of the set {A, B,C}. There are two cases: B = C and B = C. The first case corresponds to the partition {{A}, {B,C}}, and the second to the partition {{A}, {B}, {C}}. The resulting #VE= factor has one #VE= tuple for each of these two cases: c0 , c0 , c0 , [{{A}, {B,C}}] and c0 , c0 , c0 , [{{A}, {B}, {C}}] , respectively. We can multiply #VE= factors as we do standard factors (see Equation 2.1), treating the s-constants as domain values, except for the case where the same sconstant is used for multiple variables in the product. In this case, we need to create new #VE= tuples for each partition of the variables that is consistent with the partitions of the #VE= tuples being multiplied (consistency is defined treating a partition as a set of equality and inequality statements). As a special case, if the partitions are inconsistent, no #VE= tuples are produced. We can also prune any #VE= tuple that has more blocks in the partition for a s-constant than there are values in the set represented by the s-constant. We denote a multiplication operator described above by = and define it formally below. Suppose F1 is a #VE= factor on variables X1 , . . . , Xi ,Y1 , . . . ,Y j , and F2 is a #VE= factor on variables Y1 , . . . ,Y j , Z1 , . . . , Zl , where sets {X1 , . . . , Xi }, {Y1 , . . . ,Y j } and {Z1 , . . . , Zl } are pairwise disjoint. The product of F1 and F2 is a #VE= factor F1 =F 2 on the union of the variables, namely X1 , . . . , Xi ,Y1 , . . . ,Y j , Z1 , . . . , Zl , defined by: (F1 = F2 )( cX , cY , cZ , Π ) = F1 ( cX , cY , Π1 ) F2 ( cY , cZ , Π2 ) , (4.3) where • cX , cY , and cZ represent s-constants corresponding to variables X1 , . . . , Xi , Y1 , . . . ,Y j , and Z1 , . . . , Zl , respectively; • Π is a set of partitions, one partition per each subset of variables X1 , . . . , Xi ,Y1 , . . . ,Y j , Z1 , . . . , Zl assigned the same s-constant, Π1 is a set of partitions, one partition per each subset of variables X1 , . . . , Xi ,Y1 , . . . ,Y j assigned the same 122 s-constant, and Π2 is a set of partitions, one partition per each subset of variables Y1 , . . . ,Y j , Z1 , . . . , Zl assigned the same s-constant; • Π is consistent with Π1 and Π2 , that is π∈Π C(π) ⊇ π1 ∈Π1 C(π1 ) ∪ C(π2 ). (4.4) π2 ∈Π2 If we do not prune any #VE= tuples, for each #VE= tuple from F1 =F 2, there exists exactly one #VE= tuple from F1 and exactly one #VE= tuple from F2 that satisfies condition (4.4). If we allow pruning, these #VE= tuples might not exist and in such case the corresponding #VE= tuple in the product F1 created. =F 2 is not Example 4.13. Let us multiply the two #VE= factors from Example 4.11. Assume that we have pruned 0-valued #VE= tuples from the input factors. We have numbered the input #VE= tuples and shown where the resulting #VE= tuples came from: [1] [2] [3] [4] A c0 c0 c2 c2 B c0 c1 c0 c1 Partition(s) {{A}, {B}} {{A}} {{B}} {{A}} {{B}} {{A}} {{B}} # 1 1 1 1 [5] [6] [7] [8] A c0 c0 c2 c2 C c0 c2 c0 c2 Partition(s) {{A}, {C}} {{A}} {{C}} {{A}} {{C}} {{A}, {C}} # 1 1 1 1 [1 · 5] [1 · 5] [1 · 6] = [2 · 5] −−−→ [2 · 6] [3 · 7] [3 · 7] [3 · 8] [4 · 7] [4 · 8] A c0 c0 c0 c0 c0 c2 c2 c2 c2 c2 B c0 c0 c0 c1 c1 c0 c0 c0 c1 c1 C c0 c0 c2 c0 c2 c0 c0 c2 c0 c2 Partition(s) {{A}, {B,C}} {{A}, {B}, {C}} {{A}, {B}} {{C}} {{A}, {C}} {{B}} {{A}} {{B}} {{C}} {{A}} {{B,C}} {{A}} {{B}, {C}} {{A}, {C}} {{B}} {{A}} {{B}} {{C}} {{A}, {C}} {{B}} # 1 1 1 1 1 1 1 1 1 1 Note that a #VE= tuple c0 , c0 , c2 , [{{A, B}} {{C}}] is not present in the product factor. It is because a #VE= tuple c0 , c0 , [{{A, B}}] is not present in the FA=B as it has been pruned. Let us discuss the #VE= tuple products from Example 4.12 in context of Equation 4.3. Consider a #VE= tuple c0 , c1 , c0 , Π , Π = [{{A}{C}} {{B}}] from the prod- uct factor. The value assigned to this #VE= tuple is the product of the value assigned to a #VE= tuple c0 , c1 , Π1 by FA=B and the value assigned to a #VE= 123 tuple c0 , c0 , Π2 by FA=C , where Π1 = [{{A}} {{B}}] and Π2 = [{{A}, {C}}]. Partitions Π1 and Π2 are the only partitions that are consistent with Π for #VE= tuples on A = c0 , B = c1 from FA=B and #VE= tuples on A = c0 ,C = c0 from FA=C , respectively. Consider a #VE= tuple c0 , c0 , c0 , Π , Π = [{{A}, {B,C}}] from the product factor. The value assigned to this #VE= tuple is the product of the value assigned to a #VE= tuple c0 , c0 , Π1 by FA=B and the value assigned to a #VE= tuple c0 , c0 , Π2 by FA=C , where Π1 = [{{A}, {B}}] and Π2 = [{{A}, {C}}]. c0 , c0 , Π2 Partitions Π1 and Π2 are the only partitions that are consistent with Π for #VE= tuples on A = c0 , B = c0 from FA=B and #VE= tuples on A = c0 ,C = c0 from FA=C , respectively. Finally, consider a #VE= tuple c0 , c0 , c0 , Π , Π = [{{A}, {B}, {C}}] from the product factor. The value assigned to this #VE= tuple is the product of the values assigned to the same #VE= tuples as above, namely c0 , c0 , Π1 and c0 , c0 , Π2 . Partitions Π1 and Π2 are the only partitions that are consistent with Π for #VE= tuples on A = c0 , B = c0 from FA=B and #VE= tuples on A = c0 ,C = c0 from FA=C , respectively. Note that we can prune the second #VE= tuple from the product factor as size(c0 ) = 2. There are two values in the set denoted by c0 , and if there are more than two blocks in the partition, it is impossible to assign variables to values. If we do not prune the second #VE= tuple, it will eventually get multiplied by zero (in this case when A is summed out, see Example 4.14 and Equation 4.5). Theorem 4.2. Let f1 and f2 be two #VE= factors. Then G( f1 = f2 ) = G( f1 ) G( f2 ). Proof. Suppose F1 is a #VE= factor on variables X1 , . . . , Xi ,Y1 , . . . ,Y j , and F2 is a #VE= factor on variables Y1 , . . . ,Y j , Z1 , . . . , Zl , where sets {X1 , . . . , Xi }, {Y1 , . . . ,Y j } and {Z1 , . . . , Zl } are pairwise disjoint. #VE= tuples in F1 and F2 represent exactly ground tuples in G(F1 ) and G(F2 ), respectively. While computing the product F1 =F 2, the #VE= algorithm attempts to multiply all of the #VE= tuples from F1 by all of the #VE= tuples from F2 . We start by showing that G(F1 =F 2) 124 ⊇ G(F1 ) G(F2 ). Suppose that t ∈ G(F1 ) nition of G(F2 ) where the value of t is v. Then, by the defi- , there exist tuples t1 ∈ G(F1 ) with value v1 and t2 ∈ G(F2 ) with value v2 such that t = t1 t2 and v = v1 · v2 . By the definition of G(), there exist #VE= tuples t1 ∈ F1 with value v1 and t2 ∈ F2 with value v2 such that t1 ∈ G(t1 ) and t2 ∈ G(t2 ). Tuple t1 is equal to t2 at positions of variables Y1 , . . . ,Y j (since they were multiplied together by the #VE algorithm) and so t1 is equal to t2 at positions of variables Y1 , . . . ,Y j (since s-constants in #VE= tuples denote disjoint sets of domain values). Also, partitions in t1 are consistent with partitions in t2 ; otherwise, partitions in one of the #VE= tuples t1 , t2 would have variables Yn and Ym , 1 ≤ n, m ≤ j, n = m in one block while partitions from the other #VE= tuple have them in different blocks. This would imply that in one of the tuples t1 , t2 elements at positions of variables Yn and Ym are the same, while in the other there are different elements at these positions, which we know is not the case. Thus t1 and t2 are multiplied together by the #VE= algorithm, and their product has the value v1 v2 = v in F1 =F 2. It is easy to see that t belongs to the grounding of the product of t1 and t2 , hence t ∈ G(F1 Now we will show that G(F1 Suppose t ∈ G(F1 =F G(), a #VE= tuple t ∈ F1 =F 2) ⊆ G(F1 ) G(F2 ). = F ). 2 2 ), where the value of t is v. Then by the definition of = F exists such that t ∈ G(t). By the definition of = 2 there exists a #VE= tuple t1 from F1 with value v1 and a #VE= tuple t2 from F2 with value v2 , such that t ∈ t1 =t 2 (multiplication of two #VE= tuples may result in more than one #VE= tuple) and v = v1 v2 . Let t1 be the element of G(t1 ) that is identical to t at positions of variables X1 , . . . , Xi , Y1 , . . . ,Y j . This element must exist because t1 is identical to t at the positions of variables X1 , . . . , Xi , Y1 , . . . ,Y j , and the partitions of t1 are the subset of the partitions of t. In an analogous way, we can show that there is a tuple t2 ∈ G(t2 ) that is identical to t at the positions of variables Y1 , . . . ,Y j , Z1 , . . . , Zl . #VE= tuples t1 and t2 are equal at positions of variables Y1 , . . . ,Y j , and partitions from t1 are consistent with partitions from t2 ; otherwise, they could not be multiplied. This means that the #VE algorithm would multiply each tuple from G(t1 ) by each tuple from G(t2 ). Thus, tuples t1 and t2 would be multiplied by the #VE algorithm. It is easy to see that t1 t2 = t , hence t ∈ G(F1 ) 125 G(F2 ). 4.3.2.4 Summing out To motivate the summation operation in the #VE= algorithm, we review the analogous operator in the #VE algorithm: when summing out a variable X from a factor, each tuple contributes its count to the resulting tuple in the new factor. That is, each tuple X = x,Y = y with value v (where Y is the other variables), adds v to the resulting tuple Y = y . The complication with #VE= is that each #VE= tuple represents many different ground tuples. To sum out a variable X from a #VE= factor we go through each #VE= tuple in the factor, and determine which part of the resulting #VE= factor it contributes to and how much it contributes. Suppose the #VE= tuple under consideration is X = cX ,Y = cY , Π and its value is v. We want to determine the effective count of this #VE= tuple with respect to X. The effective count of a #VE= tuple X = cX ,Y = yY , Π with respect to variable X is the number of ways X can be assigned to values represented by s-constant cX given assignments of values to variables Y . Consider a partition πX ∈ Π that contains X. If X is in the same block as another variable, the effective count is 1. If X is in a block by itself, the effective count is size(cX ) minus the number of other blocks in πX unless the number of other blocks is greater than size(cX ), in which case the effective count is 0. Example 4.14. Consider a #VE= tuple c0 , c0 , c0 , [{{A}, {B,C}}] from the prod- uct #VE= factor from Example 4.13. Recall that size(c0 ) = 2. In the only partition in this #VE= tuple, variable A is in a block by itself and there is one more block. Therefore the effective count of this #VE= tuple with respect to A is 2 − 1 = 1. Variable B is in the same block as C. The effective count of this #VE= tuple with respect to B is 1. Consider another #VE= tuple, c0 , c0 , c0 , [{{A}, {B}, {C}}] , from the same #VE= factor. In this #VE= tuple, variable A is in a block by itself and there are two more blocks. The effective count of this #VE= tuple with respect to A is 2 − 2 = 0. 126 The #VE= tuple X = cX ,Y = cY , Π , then, contributes v times the effective count of X to the #VE= tuple Y = cY , Π from the resulting #VE= factor, where Π is the same as Π but with X removed from the appropriate partition. We assume that Π is simplified so that empty blocks are removed. We denote a summation operator described above by ∑= and define it formally below. Suppose F is a #VE= factor on variables X1 , . . . , Xi−1 , Xi , Xi+1 , . . . , Xn . The summing out of variable Xi from F, denoted as ∑= Xi F, is the #VE = factor on variables X1 , . . . , Xi−1 , Xi , Xi+1 , . . . , Xn such that: = ∑X F i c1 , cn , Π = ∑ ci ∈D(Xi ) (F( c1 , ci , cn , Π ) · ec( c1 , ci , cn , Π , Xi )) , (4.5) where • c1 and cn represent s-constants corresponding to variables X1 , . . . , Xi−1 and Xi+1 , . . . , Xn , respectively; • Π is a set of partitions obtained by removing variable Xi from the appropriate partition in Π; • ec( c1 , ci , cn , Π , Xi ) is the effective count of #VE= tuple c1 , ci , cn , Π with respect to variable Xi . Example 4.15. Let us sum out A from the product factor from Example 4.13. We have numbered the input #VE= tuples and shown where the resulting #VE= tuples came from: [1] [2] [3] [4] [5] [6] [7] [8] [9] A c0 c0 c0 c0 c2 c2 c2 c2 c2 B c0 c0 c1 c1 c0 c0 c0 c1 c1 C c0 c2 c0 c2 c0 c0 c2 c0 c2 Partition(s) {{A}, {B,C}} {{A}, {B}} {{C}} {{A}, {C}} {{B}} {{A}} {{B}} {{C}} {{A}} {{B,C}} {{A}} {{B}, {C}} {{A}, {C}} {{B}} {{A}} {{B}} {{C}} {{A}, {C}} {{B}} # 1 1 1 = 1 ∑A −−→ 1 1 1 1 1 [1+5] [6] [2+7] [3+8] [4+9] B c0 c0 c0 c1 c1 C c0 c0 c2 c0 c2 Partition(s) {{B,C}} {{B}, {C}} {{B}} {{C}} {{B}} {{C}} {{B}} {{C}} # 1+2 = 3 2 1+1 = 2 1+2 = 3 2+1 = 3 The #VE= tuple labeled [1] provides a (size(c0 ) − 1) · 1 = 1 contribution to the #VE= tuple labeled [1 + 5]. The #VE= tuple labeled [5] provides size(c2 ) · 1 = 2 127 contribution to the same #VE= tuple. Thus the #VE= tuple labeled [1 + 5] has a value of 3 in the resulting #VE= factor shown above. Notice that if the size of the sets associated with the s-constants was bigger, there would have been one extra #VE= tuple resulting from Example 4.13, namely c0 , c0 , c0 , [{{A}, {B}, {C}}] , but otherwise we would just be multiplying and adding bigger numbers. Theorem 4.3. Let F be a #VE= factor and X be a variable. Then = G(∑X F) = ∑ G(F). X Proof. Suppose F is a #VE= factor on variables Y1 , . . . , X, . . . ,Yn (if F is not a factor on X, then the theorem is trivially true). We start by showing, that G(∑= X F) ⊇ ∑X G(F). Suppose that t = y1 , . . . , yn ∈ ∑X G(F) and value of t is v. Then, by the defi- nition of ∑ there exist tuples t1 = y1 , . . . , x1 , . . . , yn , . . . ,tm = y1 , . . . , xm , . . . , yn ∈ G(F), with values v1 , . . . , vm such that v = v1 + · · · + vm . Values x1 , . . . , xm belong to sets of domains values denoted by s-constants c1 , . . . , c j , where j ≤ m. By the definition of G() for each ti , 1 ≤ i ≤ m there exists a #VE= tuple tk ∈ F, 1 ≤ k ≤ j with s-constant ck at the position of variable X such that ti ∈ G(tk ). Tuples t1 , . . . ,t j are identical at all positions except for the position of variable X (because corresponding ground tuples are identical at those positions). Also, all of their partitions not involving variable X are identical; otherwise, partitions in one of the #VE= tuples t1 , . . . ,t j would have variables Yp and Yr , 1 ≤ p, r ≤ n, p = r in one block while partitions from the other #VE= tuple would have them in different blocks. This scenario would imply that in one of the tuples t1 , . . . ,tm elements at positions of variables Yp and Yr are the same, while in the other there are different elements at these positions, which we know is not the case. The above conclusion means that the #VE= algorithm would add together #VE= tuples t1 , . . . ,t j during summing out of X. Such summation would yield a #VE= tuple t with the same s-constants at positions of variables Y1 , . . . ,Yn as tuples t1 , . . . ,t j , and with overall contribution to the value of t from the s-constants c1 , . . . , c j equal to v. It is easy to see that t ∈ G(t), hence t ∈ G(∑= X F). 128 Now we will show that G(∑= X F) ⊆ ∑X G(F). Consider a tuple t ∈ G(∑= X F) where the value of t is v. By the definition of G(), there exists a #VE= tuple t = cy1 , . . . , cyn , Π ∈ ∑= X F with value v such that = t ∈ t. Then, by the definition of ∑ , there exist tuples t1 = cy1 , . . . , cx1 , . . . , cyn , Π1 , . . . ,t j = cy1 , . . . , cx j , . . . , cyn , Π j ∈ F (for partitions Πi , 1 ≤ i ≤ j, if we remove the partition containing X, we obtain partitions Π) with values v1 , . . . , v j , with t obtained by adding together t1 , . . . ,t j . Let us denote the effective count of ti with j respect to X by ec(ti , X), 1 ≤ i ≤ j. Then we have v = ∑i=1 vi · ec(ti , X). Tuple t corresponds to the sum of ec(t1 , X) tuples from G(t1 ) with value v1 , ec(t2 , X) tuples from G(t2 ) with value v2 , . . . , ec(t j , X) tuples from G(t j ) with value v j . All of these ground tuples were identical to t at positions of variables Y1 , . . . ,Yn ; no other tuples in G(F) had this property. The #VE algorithm would add all of those tuples and obtain t as a result, hence t ∈ ∑X G(F). 4.3.2.5 The algorithm The #VE= algorithm is presented in Figure 4.9. It reassembles the #VE algorithm (see Section 4.2.2 and Figure 4.1 on page 110) as well as the VE algorithm for inference in belief networks (see Section 2.3.2 and Figure 2.2 on page 10). As in the #VE algorithm for #CSP, we maintain the following invariant: at each step of the computation, the product of all factors that exist at this step associates with its tuples the number of consistent extensions to the previously eliminated variables. If, given a CSP instance P = (X, D, C), we use the procedure #CSP_VE= (P, E, H) presented in Figure 4.9 to eliminate all variables (E = X), we obtain a factor on the empty set of variables, which is simply a number equal to the number of solutions to the input CSP instance. As in the case of the #VE algorithm, if we decide to eliminate only some of the variables (E ⊂ X), we end up with a factor giving us the number of solutions for each combination of values of variables X \ E. A procedure for constructing a #VE= factor-based representation of a CSP in- stance is shown in Figure 4.10. The procedure does not create 0-valued tuples. It also assumes that s-constants and sizes of the corresponding disjoint sets of domain values are given as the input data. In Section 4.3.4.1, we comment on the complex129 [00] [01] [02] [03] [04] [05] [06] [07] procedure #CSP_VE= (P, E, H) input: CSP instance P = (X, D, C) with only inequality constraints domains in D consist of s disjoint sets S1 , . . . , Ss sets are represented through s s-constants c1 , . . . , cs s-constants have counts size(c1 ), . . . , size(cs ), set of variables to eliminate E ⊆ X, elimination ordering heuristic H; output: factor representing the solution; [08] [09] [10] [11] [12] [13] [14] [15] set F := initialize_VE= (P); while there is a factor in F involving a variable from E do select variable Xi ∈ E according to H; set F := eliminate_VE= (Xi , F, size(c1 ), . . . , size(cs )); set E := E \ Xi ; end = return F ∈F F ; end [16] [17] [18] [19] [20] procedure eliminate_VE= (Xi , F, size(c1 ), . . . , size(cs )) input: variable to be eliminated Xi , set of factors F, counts size(c1 ), . . . , size(cs ); output: set of factors F with Xi summed out; [21] [22] [23] [24] partition F = {F1 , . . . , Fm } into {F1 , . . . , Fl } that do not contain Xi and {Fl+1 , . . . , Fm } that do contain Xi ; = return {F1 , . . . , Fl , ∑Xi Fl+1 = . . . = Fm }; end Figure 4.9: #VE= algorithm for #CSP with inequality constraints. ity of constructing these data from the most basic, array-based representation of the input CSP instance. In Appendix C we show an example which illustrates how it is computed in our implementation of the lifted probabilistic inference algorithm. In the next section, we apply our #VE= algorithm to a problem with large domains. 130 [00] [01] [02] procedure initialize_VE= (P) input: CSP instance P = (X, D, C); output: representation of P as a set of factors F; [03] [04] [05] [06] [07] [08] [09] [10] [11] [12] [13] set F := 0/ ; for every binary inequality constraint Ci in C, where S(Ci ) = {Xu , Xv } do create factor Fi on Xu , Xv : if Si ⊂ D(Xu ), S j ⊂ D(Xv ) and Si = S j create tuple ci , c j , [{{Xu }} {{Xv }}] with value 1; if Si ⊂ D(Xu ), Si ⊂ D(Xv ) and size(ci ) > 1 create tuple ci , ci , [{{Xu }, {Xv }}] with value 1; set F := F ∪ {Fi }; end return F; end Figure 4.10: Initialization procedure for the #VE= algorithm. A = = B C = D = D(A) = {x1 , x2 , . . . , x10000 } D(B) = D(C) = D(D) = {x1 , x2 , . . . , x5000 } C = {A = x1 , A = x3 , A = x4 , A = x5 , A = B, A = C, B = x1 , B = x2 , B = x3 , B = x5 , B = D, C = x1 ,C = x3 ,C = x5 ,C = D, D = x1 , D = x2 , D = x3 , D = x5 }. Figure 4.11: A CSP used in Section 4.3.3. 4.3.3 Example computation Consider the CSP instance presented in Figure 4.11. We want to compute the number of models for different values of D using our algorithm, with elimination ordering ρ = A, B,C . After processing all unary constraints, we obtain: D(A) = {x2 , x6 , . . . , x10000 }, D(C) = {x2 , x4 , x6 , . . . , x5000 }, and D(B) = D(D) = {x4 , x6 , . . . , x5000 }. 131 Let S0 = {x2 }, S1 = {x4 }, S2 = {x6 , . . . , x5000 }, and S3 = {x5001 , . . . , x10000 }. Then D(A) = S0 ∪ S2 ∪ S3 , D(B) = S1 ∪ S2 , D(C) = S0 ∪ S1 ∪ S2 and D(D) = S1 ∪ S2 . Let s-constant ci denote Si , i = 0, 1, 2, 3. The CSP instance can be described using four #VE= factors and four numbers shown below. In what follows, #VE= tuples that are typeset in bold are pruned, either because they are assigned value 0, or because the effective count of one of the s-constants in the #VE= tuple is 0. (i) (ii) A c0 c0 c2 c2 c2 c3 c3 B c1 c2 c1 c2 c2 c1 c2 Partition(s) {{A}} {{B}} {{A}} {{B}} {{A}} {{B}} {{A, B}} {{A}, {B}} {{A}} {{B}} {{A}} {{B}} # 1 1 1 0 1 1 1 B c1 c1 c1 c2 c2 c2 D c1 c1 c2 c1 c2 c2 Partition(s) {{B, D}} {{B}, {D}} {{B}} {{D}} {{B}} {{D}} {{B, D}} {{B}, {D}} # 0 1 1 1 0 1 (iii) A c0 c0 c0 c0 c2 c2 c2 c2 c3 c3 c3 C c0 c0 c1 c2 c0 c1 c2 c2 c0 c1 c2 Partition(s) {{A, C}} {{A}, {C}} {{A}} {{C}} {{A}} {{C}} {{A}} {{C}} {{A}} {{C}} {{A, C}} {{A}, {C}} {{A}} {{C}} {{A}} {{C}} {{A}} {{C}} # 0 1 1 1 1 1 0 1 1 1 1 C c0 c0 c1 c1 c1 c2 c2 c2 D c1 c2 c1 c1 c2 c1 c2 c2 Partition(s) {{C}} {{D}} {{C}} {{D}} {{C, D}} {{C}, {D}} {{C}} {{D}} {{C}} {{D}} {{C, D}} {{C}, {D}} # 1 1 0 1 1 1 0 1 (iv) size(c0 ) = 1 size(c1 ) = 1 size(c2 ) = 4995 size(c3 ) = 5000 132 To eliminate A, we multiply factors (i) and (ii), and sum out A from their product. As a result we obtain factor (v): A c0 c0 c0 c0 c0 c0 c2 c2 c2 c2 c2 c2 c2 c2 c3 c3 c3 c3 c3 c3 c3 c3 B c1 c1 c1 c2 c2 c2 c1 c1 c1 c1 c2 c2 c2 c2 c1 c1 c1 c1 c2 c2 c2 c2 C c1 c1 c2 c1 c2 c2 c0 c1 c1 c2 c0 c1 c2 c2 c0 c1 c1 c2 c0 c1 c2 c2 Partition(s) {{A}} {{B,C}} {{A}} {{B}, {C}} {{A}} {{B}} {{C}} {{A}} {{B}} {{C}} {{A}} {{B,C}} {{A}} {{B}, {C}} {{A}} {{B}} {{C}} {{A}} {{B,C}} {{A}} {{B}, {C}} {{A}, {C}} {{B}} {{A}, {B}} {{C}} {{A}, {B}} {{C}} {{A}, {B,C}} {{A}, {B}, {C}} {{A}} {{B}} {{C}} {{A}} {{B,C}} {{A}} {{B}, {C}} {{A}} {{B}} {{C}} {{A}} {{B}} {{C}} {{A}} {{B}} {{C}} {{A}} {{B,C}} {{A}} {{B}, {C}} # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (v) = ∑A −−−→ B c1 c1 c1 c2 c2 c2 c2 C c0 c1 c2 c0 c1 c2 c2 Partition(s) {{B}} {{C}} {{B,C}} {{B}} {{C}} {{B}} {{C}} {{B}} {{C}} {{B,C}} {{B}, {C}} # 9995 9996 9995 9994 9995 9995 9994 To eliminate B, we multiply factors (iii) and (v), and sum out B from their product. As a result we obtain factor (vi): B c1 c1 c1 c1 c2 c2 c2 c2 c2 c2 c2 c2 c2 c2 C c0 c1 c2 c2 c0 c0 c1 c1 c1 c2 c2 c2 c2 c2 D c2 c2 c2 c2 c1 c2 c1 c1 c2 c1 c1 c2 c2 c2 Partition(s) {{B}} {{C}} {{D}} {{BC}} {{D}} {{B}} {{C, D}} {{B}} {{C}, {D}} {{B}} {{C}} {{D}} {{B}, {D}} {{C}} {{B}, {{C, D}} {{B}, {{C}, {D}} {{B}, {D}} {{C}} {{B,C}} {{D}} {{B}, {C}} {{D}} {{B}, {C, D}} {{B,C}, {D}} {{B}, {C}, {D}} # 9995 9996 (vi) 9995 C 9995 c0 9994 c0 9994 ∑= B 9995 −−−→ c1 c1 9995 c2 9995 c2 9995 c2 9994 9994 9995 9994 133 D c1 c2 c1 c2 c1 c2 c2 Partition(s) {{C}} {{D}} {{C}} {{D}} {{C, D}} {{C}} {{D}} {{C}} {{D}} {{C, D}} {{C}, {D}} # 49920030 49920031 49925025 49925026 49920031 49920031 49920032 Finally, to eliminate C, we multiply factors (iv) and (vi), and sum out C from their product. As a result we obtain factor (vii): C c0 c0 c1 c2 c2 D c1 c2 c2 c1 c2 Partition(s) {{C}} {{D}} {{C}} {{D}} {{C}} {{D}} {{C}} {{D}} {{C}, {D}} # 49920030 49920031 49925026 49920031 49920032 = ∑C −−−→ (vii) D c1 c2 Partition(s) {{D}} {{D}} # 249400474875 249400484865 To obtain the final result, we would normally multiply all the remaining factors, but for this example we are left only with factor (vii), so it provides our solution: for D = x4 , the given CSP instance has 249 400 474 875 solutions, and for each value xi from D = {x6 , x7 , . . . , x5000 }, it has 249 400 484 865 solutions. Eliminating D from factor (vii) would give us the overall number of solutions: |P| = 249 400 474 875 + 249 400 484 865 · 4995 = 1 246 004 822 375 550 . 4.3.4 Complexity of the algorithm In Section 4.3.4.1, we discuss the complexity of constructing s-constants and sizes of the corresponding disjoint sets of domain values. In Section 4.3.4.2, we analyze the complexity of inference in #VE= . 4.3.4.1 Preprocessing The complexity of the preprocessing phase depends highly on the representation of the input problem instance. One can expect that if the domains of the variables in a CSP instance are very large, then they will be represented using functions and relations. Such representation might allow for a very efficient construction of disjoint subsets without enumerating the values of any domain. (see Appendix C). To obtain a fair comparison of the preprocessing costs for #VE and #VE= we analyze below the most basic, array-based representation, which does not favor the #VE= algorithm. Assume that a CSP instance has n variables with maximum domain size dmax , r binary constraints and some number of unary constraints. Let U be a union of the domains of all n variables without values excluded by unary constraints. We will 134 A B S2 S5 S3 S1 S6 A = S1 ∪ S2 ∪ S3 ∪ S5 S4 B = S1 ∪ S2 ∪ S4 ∪ S6 S7 C = S1 ∪ S3 ∪ S4 ∪ S7 C Figure 4.12: Domains of three variables. We need at most seven disjoint subsets, S1 , S2 , . . . , S7 , to specify domains of the variables. consider a representation, where binary constraints are stored as a list of r tuples and values of the domains and unary constraints are stored in the arrays. We need to process unary constraints and we need to divide set U into disjoint subsets so that domains of variables can be represented using s-constants corresponding to these subsets. In the worst case we need to split U into 2n − 1 such subsets, one subset of values per each non-empty subset of the set of variables (see Figure 4.12). In order to do the above task efficiently, we sort the n arrays storing domains and the n arrays storing unary constraints (time complexity O(n dmax log(dmax ))). The choice of the ordering is irrelevant as long as it is a strict total ordering. With each array, we associate a pointer, initially set to the first element of the sorted array. We will refer to the elements pointed at as to current elements. We also create a data structure of size 2n − 1 for storing counts associated with each potential subset of U1 . Next, we proceed in a manner reassembling the merging algorithm. We choose the smallest current element x from n domain arrays. We check which of other current elements of domain arrays are equal to x, which gives as a set of variables whose domains contain x. We check which of current elements of unary constraints arrays are equal to x, which gives as a set of variables for which x is excluded from their domains. This information allows us to update the count for the respective 1A sparse representation should be considered as one can expect many of the counts to be 0. 135 disjoint set. After the count is updated, we move forward the pointers associated with the elements and constraints equal to x. We repeat this step for each of at most n dmax values in U. The time complexity of this operation is O(n dmax (n + log(2n − 1))) = O(n2 dmax ). The space complexity is O(n dmax + 2n ). Given the disjoint subsets and their counts, we can create #VE= factors to represent all r binary constraints. There are at most 2n−1 s-constants corresponding to the disjoint subsets with non-zero counts. These s-constants are used to represent domains of variables. At the same time the domain of each variable is represented by at most dmax s-constants, as we do not use s-constants to represent empty subsets. Therefore the domain of each variable is represented by at most min({2n−1 , dmax }) s-constants. Each #VE= factor has at most min({2n−1 , dmax })2 + min({2n−1 , dmax }) #VE= tuples. This is because there are min({2n−1 , dmax })2 possible pairs of s-constants and for min({2n−1 , dmax }) pairs of identical s-constants, there are two possible par- titions of associated two variables: one in which variables are in the same block and one in which they are in different blocks2 . Thus, the time and space complexity of creating the #VE= factors is O(r (min({2n−1 , dmax })2 + min({2n−1 , dmax })22n−2 ))). The time complexity of the whole preprocessing phase for the array-based rep- resentation is O(n dmax (log(dmax ) + n) + r (min({2n−1 , dmax })2 + min({2n−1 , dmax }))), and the space complexity is O(r + n dmax + 2n + r (min({2n−1 , dmax })2 + min({2n−1 , dmax }))) . For the same CSP instance, a naive #VE implementation creates the represen2 ). However, most VE implementations first sort domains of tation in time O(rdmax the variables (as this allows for a 1-dimensional representation of factors as well as 2 If we do not represent zero-valued #VE= tuples in the #VE= factors, the maximum size of the #VE= factor becomes min({2n−1 , dmax })2 as for pairs of identical s-constants we then only represent these cases where variables are in different blocks of the partition. 136 speeding up the processing of constraints). In such cases, the time and space com2 2 ) and O(r + nd plexities are respectively O(ndmax log(dmax ) + rdmax max + rdmax ). Therefore the cost of the preprocessing phase for #VE= does not have to be greater than the cost for #VE. In many cases, #VE= constructs a network much faster and is less likely to run out of memory (see Section 4.3.5). 4.3.4.2 Inference Assume that #VE= factors given as the input data to the #VE= algorithm involve s s-constants. As in the case of the standard #VE algorithm for #CSP, the time and space complexities of the inference phase of the #VE= algorithm are determined by the size of the biggest #VE= factor, which depends on the induced tree width w∗ (ρ) of the associated constraint graph. For the elimination ordering ρ, the biggest #VE= factor represents a function on w∗ (ρ)+1 variables. In the worst case, the domain of each variable in the biggest factor is represented by all s s-constants, and the factor may need to represent all ∗ (ρ)+1 sw possible combinations of those s-constants. For each combination, there are possibly several #VE= tuples in the factor. For a combination containing s-constants c1 , c2 , . . . , ck , if each s-constant occurs in the combination nc1 , nc2 , . . . , nck times, respectively (where nc1 + nc2 + · · · + nck = w∗ (ρ) + 1), then there may be up to ϖnc1 ϖnc2 . . . ϖnck #VE= tuples in the factor. Proposition 4.1 tells us that this number varies between 0 and ϖw∗ (ρ)+1 . Thus, the size of the biggest #VE= factor can be at most: ∗ (ρ)+1 sw ϖw∗ (ρ)+1 . The total number of #VE= factors processed during the execution of the algorithm is bounded by 2r, where r is the number of constraints. Therefore the space complexity of our algorithm is equal to ∗ (ρ)+1 w O rs ϖw∗ (ρ)+1 . (4.6) The time complexity of the algorithm is equal to the space complexity times some polynomial cost of manipulating partitions P(w∗ (ρ) + 1). For both space and time 137 complexities, the only direct domain-size dependency is the actual length of the numbers we store, multiply, sum and output. In comparison to the space complexity of the #VE algorithm (see Equation 4.1), ∗ (ρ)+1 the component d w (where d is the size of the largest domain) is possibly w∗ (ρ)+1 greatly reduced to s , but we have an additional, possibly large, component ϖw∗ (ρ)+1 . The plot from Figure 4.2 suggests that for moderate values of w∗ (ρ) and large domains #VE= should achieve much lower space complexity than #VE. A comparison of time complexities of both algorithm leads to the same conclusion. There, #VE= has an additional, polynomial overhead related to the cost of manipulating partitions. 4.3.5 Empirical evaluation We implemented in Java both the #VE algorithm3 by Dechter [2003] presented in Section 4.2.2 and the #VE= algorithm4 introduced in Section 4.3.2, and compared their performances. We wrote the code using J2SDK 1.4.2.06 and ran it under JRE 1.5.0.01. We performed the experiments on an Intel Pentium IV 3.20GHz machine with 1GB of RAM was made available to the Java Virtual Machine. For accurate timing, we used the method described in [Gørtz, 2000]. The purpose of our tests was to answer the following questions. How does more complex preprocessing affects #VE= performance compared to #VE? How do rich data structures used by #VE= affect it performance compared to #VE? In the previous section we computed the asymptotic complexities of the preprocessing and inference phases of both algorithms. Even though the #VE= algorithm performs better asymptotically as the domain size increases, we do not know answers to these questions because we do not know the constants associated with these complexities. The constants for the #VE= algorithm could be so large as to make it impractical. To estimate the constants associated with each inference method, we performed tests on random CSP instances. We sampled instances with 5 to 50 variables. Each of possible binary inequality constraints was chosen independently (as it would be 3 http://people.cs.ubc.ca/~kisynski/code/ve_hcsp/ 4 http://people.cs.ubc.ca/~kisynski/code/ve_in_hcsp/ 138 done using standard generation models A and C, see Achlioptas et al. [1997]) with probability varying from 0.1 to 0.5. We only considered instances that theoretically could be solved by the #VE algorithm given 1GB of memory. As a result, the induced width varied from 2 to 32 and the domain size varied from 2 to 216 (all variables in one instance having the same domain). We generated two sets of CSPs, each containing 103 instances. In the first set each of possible unary constrains was chosen independently with the probability 0.1 (which resulted in big differences between the domains of variables after unary constraints were processed). In the second set the probability was equal to 0.01 (so domains were more similar). We considered four elimination-ordering heuristics – max-cardinality, mindegree, min-factor and min-fill (see Kjærulff [1990] for an overview of these heuristics). The last three achieved similar performance, superior to the max-cardinality scheme. We report results achieved with the min-degree heuristic, as it is the simplest and the fastest to compute. To allow for the variation in run time due to garbage collection and other covariates, all computations were performed five times. We report mean run time and, where necessary, standard error. In all of our experiments values of the domains of the variables, unary constraints and binary constraints in input CSP instances were stored in the arrays in an arbitrary order. We analyzed the amount of time #VE and #VE= spent on constructing their internal representations given array-based descriptions of CSPs and the amount of time they spent doing inference. In Figure 4.13 we present counts of the outcomes of the experiment. The #VE= algorithm never failed during the preprocessing phase. During the inference phase, it failed only 3 times for instances for which the #VE algorithm succeeded. #VE= was faster than #VE for most instances during both preprocessing and inference. In Figure 4.14 we further summarize the results. For each set of instances we show the total time the algorithms spent on solving all 103 instances. If an algorithm failed to finish a computation for an instance, we included the time it spent on the computation until failure. For the instances with 0.1 probability of a value being excluded by a unary constraint, #VE was not only slower, but also failed to compute the result for 29 instances, while #VE= failed for only 11 instances. In 139 P(u) 0.1 0.01 Cases Counts #VE fails #VE= fails #VE faster than #VE= Preprocessing Inference Yes Yes No Yes No Yes No No Yes Yes No Yes No Yes No No N/A N/A N/A Yes No N/A N/A N/A Yes No 0 8 0 25 70 0 8 0 21 74 9 12 2 21 51 8 13 1 15 58 Figure 4.13: Summary of experiments. P(u) - probability of a value being excluded by a unary constraint. P(u) 0.10 0.01 Algor. #VE #VE= #VE #VE= Preprocessing Inference time [s] se [s] #f 21.85 0.60 21.02 0.45 0.16 0.01 0.11 0.01 8 0 8 0 time [s] se [s] #f 99 358.84 3 475.86 21 49 175.77 781.37 11 77 749.75 2 321.95 21 31 260.52 2 133.07 9 Total time [s] se [s] #f 99 380.69 3 475.79 29 49 176.36 781.37 11 77 770.78 2 321.93 29 31 260.97 2 133.08 9 Figure 4.14: Summary of experiments. P(u) - probability of a value being excluded by a unary constraint; #f - number of times a particular algorithm failed due to lack of memory. the second set, the probability of a value being excluded by a unary constraint was equal to 0.01; for those instances, #VE= also achieved much better performance than #VE did. The experiments showed that more complicated preprocessing in the case of #VE= does not significantly influence performance, as the total run time is dominated by the inference phase. The experiments also showed that #VE= is capable of solving more instances than #VE can. 140 ∇ #VE success CSP network creation × #VE failure ° #VE=/= success ∗ #VE=/= failure 5 10 4 time [ms] 10 3 10 2 10 1 10 0 10 35 30 5 25 4 20 3 15 10 2 5 1 10 10 10 0 induced width ∇ #VE success 10 10 domain size inference × #VE failure ° #VE=/= success ∗ #VE=/= failure 8 10 6 time [ms] 10 4 10 2 10 0 10 35 30 5 25 4 20 3 15 10 2 5 1 10 10 10 10 0 induced width domain size Figure 4.15: Results of experiments on CSP instances with the probability of a value being excluded by a unary constraint equal to 0.1. 141 10 ∇ #VE success CSP network creation × #VE failure ° #VE=/= success ∗ #VE=/= failure 5 10 4 time [ms] 10 3 10 2 10 1 10 0 10 35 30 5 25 4 20 3 15 10 2 5 1 10 10 10 0 induced width ∇ #VE success 10 10 domain size inference × #VE failure ° #VE=/= success ∗ #VE=/= failure 8 10 6 time [ms] 10 4 10 2 10 0 10 35 30 5 25 4 20 3 15 10 2 5 1 10 10 10 10 0 induced width domain size Figure 4.16: Results of experiments on CSP instances with the probability of a value being excluded by a unary constraint equal to 0.01. 142 10 Figure 4.15 visualizes the detailed results for the first set of instances, and Figure 4.16, for the second one. In both plots, each vertical line corresponds to one test instance. It is important to remember that for all instances for which #VE= succeeded, we could keep increasing the domain size and the #VE= algorithm would still be able to compute the answer, while the #VE algorithm would sooner or later run out of memory. 4.4 Conclusions We approached #CSP for a practical reason: we needed an efficient, lifted algorithm to solve #CSP induced during lifted probabilistic inference practical. Such CSP instances involve only inequality constraints (either between a pair of variables or between a variable and a constant) and domains of variables might be very large. The first property turned out to be sufficient to allow for the development of an efficient, lifted method based on the variable elimination framework. The time and space complexities of our algorithm do not depend directly on the domain size. Moreover, because our algorithm is a lifted one, it can inter-operate with lifted probabilistic inference engine in a natural way (see Appendix C and Appendix D). There has recently been work on coloring problems by Angelsmark and Thapper [2006]; Björklund et al. [2009]. This work does not solve our problems as the inputs and outputs of these algorithms are not lifted. However, it is possible that our algorithm could be applied to their problems, which remains an intriguing possibility. 143 Chapter 5 Constraint Processing in Lifted Inference So I said to the Gym instructor “Can you teach me to do the splits?” He said “How flexible are you?”. I said “I can’t make Tuesdays”. — Tim Vine 5.1 Introduction In this chapter we analyze lifted inference from the perspective of constraint processing and, through this viewpoint, we analyze and compare existing approaches to constraint processing and expose their advantages and limitations. Our theoretical results show that the wrong choice of constraint processing method can lead to exponential increase in computational complexity. Our empirical tests confirm the importance of constraint processing in lifted inference. In Section 5.2 we give an overview of constraint processing during lifted probabilistic inference. In Section 5.3 we compare the splitting as needed approach of Poole [2003] with the shattering approach of de Salvo Braz et al. [2007]. In Section 5.4 we compare two approaches to solving #CSPs encountered during lifted inference: the use of a specialized #CSP solver [de Salvo Braz et al., 2007; Poole, 2003] and the requirement that the parfactors used during inference be in normal 144 form [Milch et al., 2008], which simplifies #CSP solving, but requires additional splitting. 5.2 Overview of constraint processing in lifted inference Specification of a first-order probabilistic model consists of probabilistic statements that may involve inequalities between a logical variable (that represents an individual in a population) and a constant (an individual from the population) as well as between two logical variables. Due to prior knowledge or observations, we may know that some individual(s) should be treated differently from the rest of the population. In that case, statements about the rest of the population require the constraint that the logical variable is not one of the exceptional individuals. For example, if we know that everyone is Joe’s friend, for the parameterized random variable f riends(X,Y ), we may treat the cases Y = joe and Y = joe as separate cases. Example 2.8 shows constraints induced by prior knowledge. There are also different cases where two logical variables are equal or not. Example 5.1. We want to specify a conditional probability describing the likelihood that two people are friends, provided they like each other. It is natural to have two parameterized random variables: likes(X,Y ) and f riends(X,Y ), where X and Y are logical variables with the same population representing a group of people. Obviously, the case of whether someone likes themself (X = Y ) should be treated differently from the case of whether someone likes someone else (X = Y ). Inequality constraints can be introduced by observing. For example, if we observed that Joe and Peter are friends, we need to treat f riends( joe, peter) separately from f riends(X,Y ) for X = joe and Y = peter. Section 2.5.2.7 contains another example of constrains introduced by observing. Inequality constraints can also be introduced during inference. For example, if we have a probabilistic statement that contains the parameterized random variables p(X) and p(Y ), we need to treat the X = Y and X = Y cases separately: in the first case, there is one random variable in each grounding; in the second case, there are two random variables in each grounding. 145 As pointed out in Poole [2003], when performing probabilistic reasoning within such models we need to take into account the population sizes of logical variables. For each probabilistic statement (an original one or one created during inference) we need to know how many random variables are represented by parameterized random variables within that statement. Poole [2003] gave example, where the probability that someone is guilty of a crime depends on how many other people could have committed the crime (i.e., the population size). The probability that a system will have a component with a fault depends on how many components there are. Thus, for exact inference we need to count the number of solutions to the CSP associated with the individuals and the inequality constraints, that is we need to solve the #CSP associated with the individuals and the inequality constraints. In Section 2.5.2 we gave an overview of the C-FOVE algorithm [Milch et al., 2008]. In this section we look at it again, but this time we focus on constraint processing outlined above that is performed during lifted inference with C-FOVE. Let us recall the notation from Section 2.5.2. Φ denotes a set of parfactors and Q denotes queried random variables, that is, a subset of the set of all random variables represented by parameterized random variables present in parfactors in Φ. Given Φ and Q, the C-FOVE algorithm computes the marginal JQ (Φ) by summing out random variables from Q, where possible in a lifted manner. Evidence is handled by adding to Φ additional parfactors on observed random variables. As we discussed in the previous section, parfactors in Φ can contain inequality constraints and parfactors representing observations are likely to explicitly name particular individuals. This makes parameterized random variables sharing the same functor represent different sets of random variables. Example 5.2. Consider the first-order probabilistic model and its grounding presented in Figure 5.1. Let A and B be logical variables typed with a population D(A) = D(B) = {x1 , . . . , xn }. Let g and h be functors with range { f alse,true}. Assume we do not have any specific knowledge about ground instances of g(A), but we have some specific knowledge about h(A, B) for the case where A = x1 and for the case where A = x1 and A = B. We would represent the model using the 146 g(x1) g(A) g(xn) h(x1, x1) h(A, B) h(x1, xn) B A FIRST-ORDER h(xn , x1) h(xn , xn) PROPOSITIONAL Figure 5.1: A first-order model from Example 5.2. following parfactors: Φ0 = { 0, / {g(A)}, F0 , [0] 0, / {g(x1 ), h(x1 , B)}, F1 , [1] {A = x1 , A = B}, {g(A), h(A, B)}, F3 }, [3] {A = x1 }, {g(A), h(A, A)}, F2 , [2] where F0 is a factor from range(g) to the reals and F1 , F2 , and F3 are factors from range(g) × range(h) to the reals, be a set of parfactors, such that J (Φ0 ) represents a joint probability distribution over random variables present in the model. The sets of random variables represented by g(A) and g(x1 ) in parfactor [0] and parfactor [1] are not disjoint and are not identical. The same applies to g(A) in parfactor [0] and parfactors [2] and [3]. We will use the set of parfactors Φ0 in a series of examples to follow. We will not compute a particular marginal, just highlight constraint processing involved in operations that can be performed within Φ0 . 5.2.1 Splitting and expanding Before a ground instance of a parameterized random variable can be summed out, a number of conditions must be satisfied. One is that a ground instance of a parameterized random variable can be summed out from a parfactor in Φ only if there are 147 no other parfactors in Φ involving this ground instance (condition (S1) of Proposition 2.1 and condition (SC1) of Proposition 2.2 ). To satisfy this condition, the inference procedure may need to multiply parfactors prior to summing out. Parfactor multiplication has a condition of its own: two parfactors C1 , V1 , F1 and C2 , V2 , F2 can be multiplied only if for each parameterized random variable from V1 and for each parameterized random variable from V2 , the sets of random variables represented by these two parameterized random variables in respective parfactors are identical or disjoint (condition (M1) in Proposition 2.3). This condition is trivially satisfied for parameterized random variables with different functors. In Section 2.5.2.3 we introduced parfactor splitting and expanding a counting formula, the two operations that are used during lifted inference to enable parfactor multiplication. Poole [2003] proposed a scheme in which splitting and expanding are performed as needed during inference when two parfactors are about to be multiplied and the precondition for multiplication is not satisfied. We call this approach splitting as needed. An alternative, called shattering, was proposed by de Salvo Braz et al. [2007]. The shattering operation performs all the splits and expansions that are required to ensure that for any two parameterized random variables present in parfactors, the sets of random variables represented by them are either identical or disjoint. We compare splitting as needed and shattering in Section 5.3. Shattering is used in the C-FOVE algorithm of Milch et al. [2008]. Shattering is performed as the first operation of the inference and might be also necessary in the middle of inference if propositionalization or full expansion of a counting formula is performed. The C-FOVE algorithm also requires all parfactors to be in normal form, and the shattering operation might perform additional splits and expansions to bring all parfactors to normal form. An example computation presented in Section 2.5.2.7 includes a shattering operation. In Appendix E we show how to compute the same query in more efficient manner using splitting as needed. The following example illustrates how to use splitting to make the sets of random variables represented by two parameterized random variables in different parfactors be identical or disjoint and how to use splitting to convert a parfactor to normal form. 148 Example 5.3. Let us continue from Example 5.2. The sets of random variables represented by g(A) and g(x1 ) in parfactor [0] and parfactor [1] are not disjoint and are not identical: g(A) and g(x1 ) unify with MGU {{A/x1 }} (see Section 2.5.2.5 for an overview of the role of unification in lifted probabilistic inference). We split parfactor [0] on a substitution {A/x1 }, creating parfactors [4] and [5]: Φ1 = { 0, / {g(x1 ), h(x1 , B)}, F1 , [1] {A = x1 }, {g(A), h(A, A)}, F2 , [2] 0, / {g(x1 )}, F0 , [4] {A = x1 , A = B}, {g(A), h(A, B)}, F3 , [3] {A = x1 }, {g(A)}, F0 }. [5] We have J (Φ1 ) = J (Φ0 ). Sets of random variables represented by parameterized random variables present in Φ1 are disjoint or identical, but parfactor [3] is not in normal form as EAC \{B} = {x1 } = 0/ = EBC \{A}, where C = {A = x1 , A = B}. To bring parfactor [3] to normal-from, we split on a substitution {B/x1 }, creating parfactors [6] and [7]: Φ2 = { 0, / {g(x1 ), h(x1 , B)}, F1 , [1] 0, / {g(x1 )}, F0 , [4] {A = x1 }, {g(A), h(A, x1 )}, F3 [6] {A = x1 }, {g(A), h(A, A)}, F2 , [2] {A = x1 }, {g(A)}, F0 , [5] {A = x1 , A = B, B = x1 }, {g(A), h(A, B)}, F3 }. [7] We have J (Φ2 ) = J (Φ1 ) and all parfactors in Φ2 are in normal form. 5.2.2 Multiplication Once the preconditions for parfactor multiplication are satisfied, multiplication can be performed in a lifted manner as described in Section 2.5.2.2. The lifted inference procedure needs to know how many factors each parfactor involved in the multiplication represents and how many factors their product will represent. These 149 numbers can be different because the product parfactor might involve more logical variables than a parfactor participating in the multiplication. Given a parfactor C , V, F , the number of factors it represents is equal to the number of solutions to the constraint satisfaction problem (CSP, see Section 4.2.1) formed by logical variables from V and constraints from C , which means that we need to solve the associated #CSP. We compare existing approaches to this problem in Section 5.4. Example 5.4. Assume that we want to multiply parfactors [4] and [1] from Φ2 (Example 5.3). Parfactor [4], 0, / {g(x1 )}, F0 , represents 1 factor while parfac- tor [1], 0, / {g(x1 ), h(x1 , B)}, F1 , represents n factors. Their product, a parfactor 0, / {g(x1 ), h(x1 , B)}, F8 , where F8 is a factor from range(g) × range(h) to the re- als, represents n factors. We need to bring values of the factor F0 to the power (see Proposition 2.3) when computing F8 : F8 = F0 1 n 1 n F3 . Let Φ3 = { {A = x1 }, {g(A), h(A, A)}, F2 , [2] {A = x1 }, {g(A)}, F0 , [5] {A = x1 , A = B, B = x1 }, {g(A), h(A, B)}, F3 , [7] {A = x1 }, {g(A), h(A, x1 )}, F3 0, / {g(x1 ), h(x1 , B)}, F8 }. [6] [8] We have J (Φ3 ) = J (Φ2 ) and all parfactors in Φ3 are in normal form. 5.2.3 Summing out During lifted summing out, a parameterized random variable is summed out from a parfactor C , V, FF , which means that a random variable is eliminated from each factor represented by the parfactor in one inference step. Lifted inference will perform summing out only once on the factor F. If some logical variables only appear in the parameterized variable that is being eliminated, the resulting parfactor will represent fewer factors than the original one. As in the case of parfactor multiplication, the inference procedure needs to compensate for this difference. Values of the factor component in the resulting parfactor are brought to the power r, where the number r tells us how many times fewer factors the result of summing 150 out represents compared to the original parfactor. This is described formally in Equations 2.3 and 2.5 from Propositions 2.1 and 2.2 respectively. The number r is equal to the size of the set (D(X1 ) × · · · × D(Xk )) : C , where X1 , . . . , Xk are logical variables that will disappear from the parfactor. If the parfactor is in normal form, the exponent r does not depend on values of logical variables remaining in the parfactor after summing out. In such case, computation reduces to solving #CSP and is not different from computation performed during parfactor multiplication. Example 5.5. Assume that we want to sum out ground(h(A, B)) : {A = x1 , A = B, B = x1 } from parfactor [7] from Φ3 (Example 5.4) i.e., we want to sum out all ground instances of h(A, B) that satisfy the constraints in [7]. The logical variable B will not appear in the resulting parfactor. We have | D(B) : {A = x1 , A = B, B = x1 }| = n − 2. Let F9 be a factor from range(g) to the reals, F9 = (∑h(A,B) F3 )n−2 . As a result of summation we obtain a parfactor {A = x1 }, {g(A)}, F9 . Let Φ4 = { {A = x1 }, {g(A), h(A, A)}, F2 , [2] {A = x1 }, {g(A)}, F0 , [5] 0, / {g(x1 ), h(x1 , B)}, F8 , [8] {A = x1 }, {g(A), h(A, x1 )}, F3 [6] {A = x1 }, {g(A)}, F9 }. [9] We have J (Φ4 ) = ∑ground(h(A,B)):{A=x1 ,A=B,B=x1 } J (Φ3 ). If the parfactor is not in normal form, the exponent r may depend on values of remaining logical variables and it is necessary to compute all the sizes of the set X conditioned on values of logical variables remaining in the parfactor. In such case, the summing out operation creates multiple parfactors. Example 5.6. Consider an alternative joint probability distribution over random variables present in the model from Figure 5.1 to the distribution presented in the Example 5.2. Assume we do not have any specific knowledge about ground instances of g(A), but we have some specific knowledge about h(A, B) for the case where B = x1 and for the case where A = B. We would represent the model using 151 the following parfactors: Φi = { 0, / {g(A)}, Fi , [i] 0, / {g(A), h(A, x1 )}, Fii , [ii] {A = B, B = x1 }, {g(A), h(A, B)}, Fiv }, [iv] {A = x1 }, {g(A), h(A, A)}, Fiii , [iii] where Fi is a factor from range(h) to the reals and F1 , F2 , and F3 are factors from range(g) × range(h) to the reals. The sets of random variables represented by g(A) in parfactors [i] and [ii] and parfactors [iii] and [iv] are not disjoint and are not identical. The sets of random variables represented by h(A, x1 ), h(A, A) and h(A, B) in parfactors from Φi are {A=B,B=x1 } disjoint. Parfactor [iv] is not in normal form, as EA {A=B,B=x1 } EB \{A}. \{B} = 0/ = {x1 } = Assume that we want to sum out ground(h(A, B)) : {A = B, B = x1 } from Φi i.e., we want to sum out all ground instances of h(A, B) that satisfy the constraints in [iv]. Conditions (S1) and (S2) of Proposition 2.1 are satisfied. The requirement that parfactor [iv] is in normal-form is not satisfied. Nevertheless, lifted summation can be performed. The logical variable B will disappear from [iv]. We have n − 1, if A = x ; 1 | D(B) : {A = B, B = x1 }| = n − 2, if A = x . 1 (5.1) Let Fv be a factor from range(g) to the reals, Fv = ∑h(A,B) Fiv . We obtain two parfactors as the result of summation: a parfactor 0, / {g(x1 )}, Fv n−1 and a parfactor {A = x1 }, {g(A)}, Fv n−2 . Let Φii = { 0, / {g(A)}, Fi , [i] 0, / {g(A), h(A, x1 )}, Fii , {A = x1 }, {g(A), h(A, A)}, Fiii , 0, / {g(x1 )}, Fv n−1 , {A = x1 }, {g(A)}, Fv n−2 }, We have J (Φii ) = ∑ground(h(A,B)):{A=B,B=x1 } J (Φi ). 152 [ii] [iii] [v] [vi] The above example shows that normal-form requirement is not necessary, but if it is not satisfied, the solution of #CSP encountered during summing out is conditioned values of logical variables remaining in the parfactor. In Section 5.4 we compare two approaches to solving #CSPs during lifted probabilistic inference: one based on a #CSP solver [de Salvo Braz et al., 2007; Poole, 2003], which can compute the expression (5.1), and the other normal-form based [Milch et al., 2008]. Similar observations could be made for a summing out a parameterized random variable from an aggregation parfactor (see Section 3.4.2.3). The above overview of lifted inference, together with simple examples, shows that constraint processing is an integral, important part of lifted probabilistic inference. It also points out that there are two choices regarding constraint processing during lifted inference: a choice between splitting as needed and shattering and a choice between using a #CSP solver and enforcing normal form. These choices are independent, and shattering can be used with a #CSP solver [de Salvo Braz et al., 2007] or with normal form parfactors [Milch et al., 2008]. Splitting as needed can be used together with a #CSP solver [Poole, 2003], one could also imagine using splitting as needed but requiring normal-form for parfactors. The rest of this chapter is devoted to the comparison of these different strategies. 5.3 Splitting as needed vs. shattering In this section we compare the splitting as needed approach and the shattering approach. For clarity, we do not analyze here splits related to converting constraints sets to normal form. Shattering simplifies design and implementation of lifted inference procedures. Given a set of parfactors Φ, shattering performs all splits and expansions on substitutions that are obtained from unification and analysis of constraints (see Section 2.5.2.5). After shattering, all sets of random variables represented by parameterized random variables present in Φ are pairwise identical or disjoint. Lifted inference will not perform any additional splits or expansions, unless propositionalization or full expansion is performed. Splitting as needed will not perform splits or expansions on other substitutions than those used during shattering (unless propositionalization or full expansion is 153 performed). In fact, it might perform much fewer splits and expansions as we demonstrate in the example below. Example 5.7. Consider the following set of parfactors: Φ = { 0, / {gQ (), g1 (X1 , X2 , . . . , Xk ), g2 (X2 , X3 , . . . , Xk ), . . . , gk (Xk ))}, F0 , [0] 0, / {g1 (a, X2 , . . . , Xk )}, F1 , [1] 0, / {g2 (a, X3 , . . . , Xk )}, F2 , [2] ..., 0, / {gk−1 (a, Xk )}, Fk−1 , [k − 1] 0, / {gk (a)}, Fk , [k] and let Q = ground(gQ ()). Assume we want to compute the marginal JQ (Φ). Lifted inference with shattering will create exponentially more parfactors (in the number of logical variables in a parfactor) than lifted inference with splitting as needed. For i = 1, . . . , k, a set of random variables represented by a parameterized random variable gi (Xi , . . . , Xk ) in a parfactor [0] is a proper superset of a set of random variables represented by a parameterized random variable gi (a, Xi+1 , . . . , Xk ) in a parfactor [i]. Therefore lifted inference with shattering needs to perform several splits. Since the order of splits during shattering does not matter here, assume that the first operation is a split of the parfactor [0] on a substitution {X1 /a} which creates a parfactor 0, / {gQ (), g1 (a, X2 , . . . , Xk ), g2 (X2 , X3 , . . . , Xk )}, F0 [k + 1] and a residual parfactor {X1 = a}, {gQ (), g1 (X1 , X2 , . . . , Xk ), g2 (X2 , X3 , . . . , Xk ), . . . , gk (Xk ))}, F0 . [k + 2] In both newly created parfactors, for i = 2, . . . , k, a set of random variables represented by a parameterized random variable gi (Xi , . . . , Xk ) is a proper superset of a set of random variables represented by a parameterized random variable gi (a, Xi+1 , . . . , Xk ) in a parfactor [i] and shattering proceeds with further splits of 154 both parfactors. Assume that in the next step parfactors [k + 1] and [k + 2] are split on a substitution {X2 /a}. The splits result in four new parfactors. The result of the split of the parfactor [k + 1] on {X2 /a} contains a parameterized random variable g1 (a, a, . . . , Xk ) and a parfactor [1] needs to be split on a substitution {X2 /a}. The shattering process continues following a scheme described above. It terminates after 2k+1 − k − 2 splits and results in 2k+1 − 1 parfactors (each original parfactor [i], i = 0, . . . , k, is shattered into 2k−i parfactors). Assume that after initial shattering, lifted inference proceeds with the optimal elimination ordering g1 , . . . , gk (this elimination ordering does not introduce counting formulas; whereas other orderings do). To compute the marginal JgQ () (Φ), 2k lifted multiplications and 2k+1 − 2 lifted summations are performed. Consider lifted inference with splitting as needed. Assume lifted inference follows an elimination ordering g1 , . . . , gk . A set of random variables represented by a parameterized random variable g1 (X1 , . . . , Xk ) in a parfactor [0] is a proper superset of a set of random variables represented by a parameterized random variable g1 (a, X2 , . . . , Xk ) in a parfactor [1] and the parfactor [0] is split on a substitution {X1 /a}. The split results in parfactors identical to the parfactors [k + 1] and [k + 2] from the description of shattering above. The parfactor [k + 1] is multiplied by the parfactor [1] and all ground instances of g1 (a, X1 , . . . , Xk ) are summed out from their product while all ground instances of g1 (X1 , X2 , . . . , Xk ) (subject to a constraint X1 = a) are summed out from the parfactor [k + 2]. The summations create two parfactors: 0, / {gQ (), g2 (X2 , X3 , . . . , Xk ), . . . , gk (Xk ))}, FFk+3 , 0, / {gQ (), g2 (X2 , X3 , . . . , Xk ), . . . , gk (Xk ))}, FFk+4 . [k + 3] [k + 4] Ground instances of g2 are eliminated next. Parfactors [k + 3] and [k + 4] are split on a substitution {X2 /a}, the results of the splits and a parfactor [2] are multiplied together and the residual parfactors are multiplied together. Then, all ground instances of g2 (a, X3 , . . . , Xk ) are summed out from the first product while all ground instances of g2 (X2 , . . . , Xk ) (subject to a constraint X2 = a) are summed out from the second product. The elimination of g3 , . . . , gk looks the same as for g2 . In total, 155 2k − 1 splits, 3k − 2 lifted multiplications and 2k lifted summations are performed. At any moment, the maximum number of parfactors is k + 3. The above example shows that shattering approach is sometimes much worse than splitting as needed. Even though shattering creates a lot of additional parfactors, parfactors resulting from the same split share the same factor component. If factors are implemented as an immutable data structure, then they can be shared between different parfactors inside the implementation. This might help reduce overhead of shattering. We demonstrate this in the following example. Example 5.8. Consider the following set of parfactors: Φ = { 0, / {gQ (), g1 (a)}, F0 , [0] {X = a}, {gQ (), g1 (X)}, F1 , [1] 0, / {g2 (X), g3 (X)}, F3 , [3] 0, / {g1 (X), g2 (X)}, F2 , [2] ..., 0, / {gk−1 (X), gk (X)}, Fk , 0, / {gk (X)}, Fk+1 } [k] [k + 1]. Assume that all functors have the range size 10, Q = ground(gQ ()) and that we want to compute the marginal JQ (Φ). Because of the presence of parameterized random variable g1 (a) in parfactor [0], lifted inference with shattering splits parfactor [2] on substitution {X/a}, which creates a parfactor 0, / {g1 (a), g2 (a)}, F2 and a residual parfactor. This in turn causes a split of parfactor [3] on a substitution {X/a}. Splits propagate until the last split of parfactor [k + 1] on a substitution {X/a}, to a total of k splits, which create k additional parfactors. Afterward probabilistic inference proceeds with 2k + 1 multiplications and k + 1 summations regardless of the elimination ordering. Lifted inference with splitting as needed with elimination ordering g1 , g2 , . . . , gk would perform exactly the same splits, multiplications, and summations as lifted 156 1.5 ✂ 1 ☎✝✆ ✄ ✄ ✁✂ 0.5 0 10 20 30 40 50 60 70 80 90 100 Figure 5.2: Speedup of splitting as needed over shattering for Example 5.8. inference with shattering. Lifted inference with splitting as needed and ordering gk , gk−1 , . . . , g1 performs just 1 split, k + 2 multiplication,s and k + 1 summations. As we said in the introduction to this example, even though shattering creates a lot of additional parfactors, parfactors resulting from the same split share the same factor component, so overhead of shattering might be smaller than expected. We used a Java implementation of C-FOVE with factors implemented as an immutable data structure shared between different parfactors to verify this point. Tests were performed on an Intel Core 2 Duo 2.66GHz processor with 1GB of memory made available to the JVM. Figure 5.2 shows the results of the experiment where we varied k from 1 to 100. We report an average speedup of splitting as needed with elimination ordering gk , gk−1 , . . . , g1 over shattering over 10 runs. Even though lifted inference with shattering created nearly twice as many parfactors, it used virtually the same amount of memory as lifted inference with splitting. It was slower because it performed more arithmetic operations, but it was not slower by a factor of two. While minimizing the number of splits might result in faster inference, the size of factor components created during inference has a decisive influence on inference’s speed. Consider the following example. 157 Example 5.9. Φ = { 0, / {gQ (), g1 (X)}, F0 , 0, / {g1 (X), g2 (X)}, F1 , [0] [1] ..., 0, / {gk−2 (X), gk−1 (X)}, Fk−2 , [k − 2] 0, / {gk (a)}, Fk , [k] 0, / {gk−1 (X), gk (X)}, Fk−1 , {X = a}, {gk (X)}, Fk+1 } [k − 1] [k + 1] Assume that functors have the range size 10, Q = ground(gQ ()) and that we want to compute the marginal JQ (Φ). Because of the presence of gk (a) in a parfactor [k], lifted inference ordering with shattering splits a parfactor [k − 1] on a substitution {X/a}, which creates a parfactor 0, / {gk−1 (a), gk (a)}, Fk−1 (and a residual parfactor). This in turn causes a split of a parfactor [k − 2] on a substitution {X/a}. Splits propagate until the last split of a parfactor [0] on a substitution {X/a}, to a total of k splits, which create k additional parfactors. Afterward probabilistic inference proceeds with 2k multiplications and 2k summations regardless of the elimination ordering. The maximum size among factors created during inference is 100. Lifted inference with splitting as needed and elimination ordering gk , gk−1 , . . . , g1 would achieve identical performance. Lifted inference with splitting as needed and elimination ordering g1 , g2 , . . . , gk first performs k − 1 multiplications and summations and is left with three parfac- tors: 0, / {gQ (), gk (X)}, F and parfactors [k] and [k + 1]. Then it splits the first parfactor on a substitution {X/a}. The result of a split is multiplied by the par- factor [k] and the residual is multiplied by the parfactor [k − 1]. Two additional summations eliminate ground instances of gk from obtained products. The final result is obtained by multiplying together two resulting parfactors. Inference performed 1 split, k + 2 multiplications and k + 1 summations. The maximum size among factors created during inference is 1000. 158 1.5 ✂ 1 ☎ ✆ ✄ ✄ ✁✂ 0.5 0 50 55 60 65 70 75 80 85 90 95 100 Figure 5.3: Splitting as needed vs. shattering speedup for Example 5.9. We compared the shattering approach to splitting as needed with elimination ordering g1 , g2 , . . . , gk under the same conditions as in Example 5.8. Figure 5.3 shows the results of the experiment where we varied k from 50 to 100. Even though lifted inference with the splitting as needed approach with elimination ordering g1 , g2 , . . . , gk performs approximately only half of operations on parfactors compared to the shattering approach, it is slower because it manipulates much larger factors. Thus, it performed many more arithmetic operations. An elimination ordering heuristic should be able to detect that the above elimination ordering is undesired. From the above examples we can see that with a good elimination ordering heuristic, the splitting as needed approach can be faster than the shattering approach. It is worth pointing out though, that splitting as needed complicates the design of an elimination ordering heuristic. With the shattering approach it is easy to compute the total size of parfactors that would be created by a potential inference step (see Section 2.5.2.6). With the splitting as needed approach, for each step under consideration we first need to determine the potentially necessary splits. 5.4 Normal form parfactors vs. #CSP solver In Section 5.2 we pointed out, that during lifted probabilistic inference, we may encounter arbitrarily complicated #CSPs with inequality constraints because of the model’s complexity, as well as the fact that the inference itself introduces many new inequality constraints. First-order probabilistic models usually involve logical variables with large populations, which means that we need to be able to solve #CSPs with large domains. Moreover, one run of a first-order probabilistic infer159 ence algorithm might involve solving many #CSPs, so an efficient algorithm for solving #CSP is required. We introduced such algorithm in Chapter 5. Milch et al. [2008] avoid the need to use a constraint solver by requiring all parfactors to be in normal form (Section 2.5.1.1). Below we present two propositions which give insights into structure of constraints in normal-form parfactors. Proposition 5.1. Let C , V, F be a parfactor in normal form. Then each connected component of the constraint graph corresponding to CSP formed by logical variables from V and constraints from C is fully connected. Proof. The proposition is trivially true for components with one or two logical variables. Let us consider a connected component with more than two logical variables. Suppose, contrary to our claim, that there are two logical variables X and Y with no edge between them. Since the component is connected, there exists a path X, Z1 , Z2 ..., Zm ,Y . As C is in normal form, EZCi \{Zi+1 } = EZCi+1 \{Zi }, i = 1, . . . , m−1 and EZCm \{Y } = EYC \{Zm }. We have X ∈ EZC1 , and consequently X ∈ EYC . This contradicts our assumption that there is no edge between X and Y . Proposition 5.2. Let C , V, F be a parfactor in normal form. Let P be the CSP formed by logical variables from V and constraints from C . Then for logical vari- ables in the same connected component of the constraint graph corresponding to P sets of constants excluded by unary constraints are identical. Proof. The proposition is trivially true for components with one logical variable. Let us consider a connected component with more than one logical variable. Suppose, contrary to our claim, that there are two logical variables X and Y and a constant c such that c ∈ EXC and c ∈ / EYC . Since the parfactor is in normal-form from Proposition 5.1 we know that the component is fully connected and (X = Y ) ∈ C . Therefore, because the parfactor is in normal-form, we have EXC \{Y } = EYC \{X}. This contradicts our assumption that c ∈ EXC and c ∈ / EYC . The following corollary is a direct consequence of Propositions 5.1 and 5.2 and the definition of a normal-form parfactor from Section 2.5.1.1. Corollary 5.3. A parfactor is in normal form C , V, F if and only if each connected component of the constraint graph corresponding to CSP formed by logical 160 variables from V and constraints from C is fully connected and each logical variable within the same connected component is subject to the same unary constraints. The above characteristic of normal-form parfactors is the basis for a straightforward computation of the number of factors represented by a normal-form parfactor. We describe it in the following proposition. Proposition 5.4. Let C , V, F be a normal-form parfactor. Assume that a constraint graph corresponding to CSP formed by logical variables present in V and constraints from C consist of one connected component. Let m be the number of logical variables present in V, n be the size of population of these logical variables, and l be the number of different constants present in C , that is l = |{c : ∃X ∈ V (X = c) ∈ C }|. Then | C , V, F | = (n − l)! . (n − l − m)! Proof. From the assumption that the underlying constraint graph consists of a single connected component we know that the graph has m nodes, one node for each logical variable present in V. Since the parfactor is in normal-form, by Proposi- tion 5.2 for each logical variable the set of constants excluded by unary constraints in C is identical and has size l. From Proposition 5.1 we know that the constraint graph is fully connected. Let us traverse the constraint graph in an arbitrary order and construct all possible ground substitutions to all logical variables in V that sat- isfy the constraints in C . The first logical variable can have n − l values, the second one can have n − l − 1 values, and the process continues until we reach the last, m-th, logical variable, which can have n − l − m + 1 values. Therefore the number of ground substitutions is as follows: (n − l)(n − l − 1) . . . (n − l − m + 1) = (n − l)! . (n − l − m)! The above proposition can be easily generalized to the case where there are multiple connected components. While Proposition 5.4 shows that for normal-form parfactors solving #CSP is straightforward, the normal form has also negative consequences. The requirement 161 [00] [01] [02] [03] [04] [05] [06] [07] [08] [09] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] procedure convert( C , V, F ) input: parfactor C , V, F ; output: set of normal-form parfactors Ψ created from C , V, F through splitting; add C , V, F to the queue; set Ψ := {}; while queue not empty do remove parfactor g = Ci , Vi , Fi from the queue; for each logical variable X from Ci do for each logical variable Y from the same component of Ci as X do if (X = Y ) ∈ / Ci split g on {X/Y }; add the result to the queue; set g to the residual; for each unary constraint Y = t from Ci do if (X = t) ∈ / Ci split g on {X/t}; add the result to the queue; set g to the residual; end end end add g to Ψ; end return Ψ; end Figure 5.4: Algorithm for converting a parfactor to a set of normal-form parfactors. that all parfactors be in normal from is enforced by splitting parfactors that are not in normal form on appropriate substitutions. Milch et al. [2008], who introduced the concept of normal-form parfactors, do not provide any details how to do this conversion. Nevertheless, Corollary 5.3 provides us with a clear recipe for a conversion algorithm: perform all the splits required to make each connected component of the associated constraint graph be fully connected and perform all the splits required to make logical variables in the same connected component be subject to the same unary constraints; repeat the process for all by-product parfactors created by the above splits. The algorithm is presented in Figure 5.4. It is clear, that a conversion of a single parfactor to normal form can create several parfactors. We illustrate this point with two examples provided below. 162 A A = = = B C A=D D = B=C = A=D B=C = = A = = C = = B = = B = C = = = D (a) D (b) Figure 5.5: A simple constraint graph from Example 5.10 (a) and constraint graphs obtained through a conversion to normal form (b). Example 5.10. Consider a parfactor C , V, F , where C = {A = B, A = C, B = D,C = D} and there are no other logical variables present in V. The corresponding constraint graph is presented in Figure 5.5 (a). The parfactor is not in normal form, for example EAC \{B} = EBC \{A}. To convert the parfactor to a set of parfactors in normal form we need to perform splits on substitutions {A/D} and {B/C} in arbi- trary order. After we split on the first substitution, we split both resulting parfactors on the second substitution. Thus we perform three splits and obtain four parfactors. Constraint graphs corresponding to these parfactors are presented Figure 5.5 (b). Note that it is the same constraint graph as the one discussed in Example 4.4, where we needed to consider only two cases to get the number of solutions. If the underlying graph is sparse, conversion might be very expensive as we show in the example below. Example 5.11. Consider a parfactor {X0 = a, X0 = X1 , . . . , X0 = Xk }, {g0 (X0 ), g1 (X1 ), . . . , gn (Xk )}, F , where D(X0 ) = D(X1 ) = · · · = D(Xk ). Let C denote a set of constraints from this parfactor. We have EXC0 = {a, X1 , . . . , Xk } and EXCi = {X0 }, i = 1, . . . , k. The parfactor is not in normal form because EXC0 \{Xi } = EXCi \{X0 }, i = 1, . . . , k. As a result the size of the set D(X0 ) : C depends on other logical variables in the parfactor. For instance, it differs for X1 = a and X1 = a or for X1 = X2 and X1 = X2 . A conversion of the considered parfactor to set of parfactors in normal form involves 2k − 1 splits on substitutions of the form {Xi /a}, 1 ≤ i ≤ k and ∑ki=2 163 k i (ϖi − 1) splits on substitutions of the form {Xi /X j }, 1 ≤ i, j ≤ k, where ϖi − 1 is i-th Bell number (see Section 4.2.3). The conversion creates ∑ki=0 k i ϖi parfactors in normal form. In Example 5.13 we analyze how this conversion affects parfactor multiplication compared to the use of a #CSP solver. From the above example we can see that the cost of converting a parfactor to normal form can be worse than exponential. Moreover, converting parfactors to normal form may be very inefficient when analyzed in context of parfactor multiplication (see Section 5.4.1) or summing out a parameterized random variable from a parfactor (see Section 5.4.2). Our empirical tests (see Section 5.4.3) confirm this observation. While parfactors that involve counting formulas must be in normal form (see Section 2.5.1.1), this is not necessary for parfactors without counting formulas. An alternative to converting all parfactors to normal form is to use a #CSP solver. We presented an algorithm for solving #CSPs encountered during lifted probabilistic inference in Chapter 4. Details of the interaction between the solver and a probabilistic inference engine depend on the implementation of the solver, in particular on data structures used. In Appendix C we explain how to represent #CSPs encountered during lifted probabilistic inference using these data structures. In Appendix D we explain how to translate answers from the #CSP solver to data structures used by a probabilistic inference engine. In the example below we present a simple interaction with the solver. Example 5.12. In Example 5.6 we need to know the number | D(B) : {A = B, B = x1 }|, where D(A) = D(B) and | D(A) | = n. Let s-constant a1 denote set {x1 } and s- constant a2 denote set D(A) \{x1 } (s-constants were introduced in Section 4.3.2.1). The following #VE= factor has value 1 for substitutions to logical variables A,B that are solutions to the above CSP and 0 otherwise: A a1 a2 a2 B a2 a2 a2 Partition(s) {{A}} {{B}} {{A, B}} {{A}, {B}} 164 1 . 0 1 The #CSP solver eliminates B from the above factor and returns: A a1 a2 Partition(s) {{A}} {{A}} n−1 . n−2 The two cases used in Example 5.6 are obtained through analysis of the resulting factor and knowledge that a1 denotes set {x1 } and a2 denotes set D(A) \{x1 }. We can infer that | D(B) : {A = B, B = x1 }| equals n − 1 if A = x1 and n − 2 if A = x1 . 5.4.1 Multiplication In the example below we demonstrate how the normal form requirement might lead to a lot of, otherwise unnecessary, parfactor multiplications. Example 5.13. Assume we would like to multiply the parfactor from Example 5.11 by a parfactor p f = 0, / {g1 (X1 )}, F1 . First, let us consider how it is done with a #CSP solver. A #CSP solver computes the number of factors the parfactor from Example 5.11 represents, (| D(X0 ) | − 1)k+1 . Next the solver computes the number of factors represented by the parfactor p f , which is trivially | D(X1 ) |. A correction is applied to values of the factor F1 to compensate for the difference between these two numbers. Finally the two parfactors are multiplied. The whole operation involved two calls to a #CSP solver, one correction and one parfactor multiplication. Now, let us see how it can be done without the use of #CSP solver. The first parfactor is converted to a set Φ of ∑ki=0 k i ϖi parfactors in normal form, as presented in Example 5.11. Some of the parfactors in Φ contain a parameterized random variable g1 (a), the rest contains a parameterized random variable g1 (X) and a constraint X1 = a, so the parfactor p f needs to be split on a substitution {X1 /a}. The split results in a parfactor 0, / {g1 (a)}, F1 and a residual {X1 = a}, {g1 (X1 )}, F1 . Next, each parfactor from Φ is multiplied by either the result of the split or the residual. Thus ∑ki=0 k i ϖi parfactor multiplications need to be performed and most of these multiplication require a correction prior to the actual parfactor multiplication. There is an opportunity for some optimization, as factor components of parfactors multiplications for different corrections could be cached and reused instead of being recomputed. Still, even with such a caching mechanism, multiple parfactor 165 multiplications would be performed compared to just one multiplication when a #CSP solver is used. 5.4.2 Summing out Examples 5.6 and 5.12 demonstrate how summing out a parameterized variable from a parfactor that is not in normal form can be done with a help of a #CSP solver. In the example below we show how this operation would look if we convert the parfactor to a set of parfactors in normal form (which does not require a #CSP solver). Example 5.14. Assume that we want to sum out ground(h(A, B)) : {A = B, B = x1 } from the parfactor {A = B, B = x1 }, {g(A), h(A, B)}, Fiv from the Example 5.6. First, we convert the parfactor to a set of parfactors in normal form by splitting on a substitutions {A/x1 }. We obtain two parfactors in normal form: {B = x1 }, {g(x1 ), h(x1 , B)}, Fiv , which represents n − 1 factors, and {A = x1 , A = B, B = x1 }, {g(A), h(A, B)}, Fiv , which represents (n − 1)(n − 2) factors. Next we sum out ground(h(x1 , B)) : {B = x1 } from the first parfactor and ground(h(A, B)) : {A = x1 , A = B, B = x1 } from the second parfactor. In both cases a correction will be necessary, as B will no longer be among logical variables present in the resulting parfactors and these parfactors will represent fewer factors than the original parfactors. In general, as illustrated by Examples 5.6, 5.12 and 5.14, a lifted inference procedure that enforces conversion to normal form and a lifted inference that uses a #CSP solver create the same number of parfactors. The difference is, that the first approach computes a factor component for the resulting parfactors once and then applies a different correction for each resulting parfactor based on the answer from the #CSP solver. The second approach computes the factor component multiple times, once for each resulting parfactor, but does not use a #CSP solver. As these factor components (before applying a correction) are identical, redundant 166 computations could be eliminated by caching. We successfully adopted a caching mechanism in our empirical test (Section 5.4.3), but expect it to be less effective for larger problems. As in the case of splitting as needed, it might be difficult to design an efficient elimination ordering heuristic that would work with a #CSP solver. This is because we do not known in advance how many parfactors will be obtained as a result of summing out. We need to run a #CSP solver to obtain this information. 5.4.3 Experiment In this experiment we summed out a parameterized random variable from a parfactor. We compared summing out with a help of a #CSP solver presented in Chapter 4 (#CSP-SUM) to summing out achieved by converting a parfactor to a set of parfactors in normal form and summing out a parameterized random variable from each obtained parfactor without a #CSP solver. (We cached factor components as suggested in Section 5.4.2). We randomly generated sets of parfactors. There were up to 5 parameterized random variables in each parfactor with range sizes varying from 2 to 10. There were up to 10 logical variables present in each parfactor. Logical variables were typed with the same population. We varied the size of this population from 5 to 1000 to verify how well #CSP solver scaled for larger populations. The probability of presence of a binary constraint between a logical variable and another logical variable varied from 0.05 to 0.25 and each logical variable on average participated in 3 unary constraints. The above settings resulted in simple parfactors being generated. For each population size we generated 100 parfactors and reported a cumulative time. We used Java implementations of tested lifted inference methods. Tests were performed on an Intel Core 2 Duo 2.66GHz processor with 1GB of memory made available to the JVM. The results are presented in Figure 5.6. For small population sizes, summing out with the help of a #CSP solver (#CSP-SUM) performs worse than summing out by converting to normal form (CONV-NFM-SUM). #CSP-SUM prevails for larger populations. For the second approach, we also report time excluding (NFM-SUM) conversion to normal form. The difference between CONV- 167 6 ✣✢ ✜✛ ✘✚ ✗✙✖ ✘ 10 CONV−NFM−SUM NFM−SUM #CSP−SUM 4 10 2 10 1 10 2 10 ✂✁✄ ✆☎✞✝✠✟☛✡✌☞✠✁✄✍✏✎✑☞✓✒✕✔ 3 10 Figure 5.6: Summing out with and without a #CSP solver. NFS-SUM and NFM-SUM shows the significant cost of conversion to normal form. 5.5 Conclusions In this chapter we analyzed the impact of constraint processing on the efficiency of lifted inference and explained why we cannot ignore its role in lifted inference. We showed that a choice of constraint processing strategy has a big impact on efficiency of lifted inference. In particular, we discovered that shattering [de Salvo Braz et al., 2007] is never better—and sometimes worse—than splitting as needed [Poole, 2003], and that the conversion of parfactors to normal form [Milch et al., 2008] is an expensive alternative to using a specialized #CSP solver. Although in this chapter we focused on exact lifted inference, our results are potentially applicable to the approximate lifted inference algorithms which use parfactors, as even approximate methods might require the exact number of factors represented by a parfactor. 168 Chapter 6 Conclusions Prediction is very difficult, especially about the future. — Niels Bohr 6.1 Summary This thesis concerns exact lifted probabilistic inference in first-order probabilistic models. In particular, we focused on two problems: lifted aggregation in directed first-order probabilistic models (Chapter 3) and constraint processing during lifted probabilistic inference (Chapters 4 and 5). Aggregation occurs naturally in non-trivial directed first-order models. Lifted inference algorithms focused on undirected first-order models and lacked data structures suitable for representing aggregation. We introduced a new data structure, aggregation parfactors, and defined operations on aggregation parfactors that allow them to be part of lifted inference algorithms. We demonstrated usefulness and effectiveness of our solution using a model from the social networks domain. First-order probabilistic models involve inequality constraints on logical variables. Additional inequality constraints are introduced during inference. Constraints allow models to capture properties of particular individuals in a modeled domain. Lifted probabilistic inference has to process these constraints. One task is to count solutions to constraint satisfaction problems induced during inference. These CSPs might involve logical variables with large domains, but constraints are restricted to unary and binary inequality constraints. In Chapter 4 we presented 169 a lifted algorithm for counting solutions to such CSPs. The computational complexity of our algorithm does not directly depend on the domain sizes of logical variables. Because our algorithm is lifted, it inter-operates with a lifted probabilistic inference engine in a natural and efficient manner. To date, constraint processing in lifted probabilistic inference has not received much attention in the literature and various researchers made seemingly arbitrary decisions about constraint processing strategies. In Chapter 5 we compared different approaches to constraint processing during lifted probabilistic inference. Our theoretical and empirical results stress the great importance of informed constraint processing. 6.2 Future work There are many opportunities for future research in the area of lifted probabilistic inference. Below we outline open problems directly related to this thesis. While it is difficult to design an elimination ordering heuristic that works well with the splitting as needed approach and a #CSP solver, it is worth pursuing. In this thesis we used parfactors to describe first-order probabilistic models. Parfactors are data structures and hence not best suited for specifying models by humans. Moreover, sets of parfactors cannot directly represent all models that can be defined using first-order probabilistic modeling languages. For example, only simple ICL programs can be directly translated to sets of parfactors. Nevertheless, an inference procedure for ICL, could use the C-FOVE algorithm as a subroutine. Development of such procedure is an interesting open problem. The framework presented in this thesis allows us only to reason about particular individuals from the population of a logical variable and about the rest of this population. A natural extension is to allow reasoning about sets of individuals belonging to some population. The calculus of parfactors would not be greatly affected by such generalization. The most important changes would be required in the unification process. Some changes would also be necessary in constraint processing, but not in the #CSP solver presented in Chapter 4. The representational power of parfactors could be increased by allowing parfactors to contain other types of constraints, not just inequality constraints. For ex170 ample, ’less-than’ constraint could be useful when used together with logical variables that naturally exhibit linear order, like space dimensions and time. Increased expressiveness comes at the higher cost of constraint processing. Balancing this tradeoff is a challenging task. Finally, variable elimination-based lifted inference algorithms require lifted data structures for storing the intermediate results of a computation. If an appropriate lifted data structure is not available, such algorithms use propositionalization and store the results in factors. We believe that a search-based lifted inference algorithm could avoid unnecessary propositionalization. 171 Bibliography D. Achlioptas, L. M. Kirousis, E. Kranakis, D. Krizanc, M. S. O. Molloy, and Y. C. Stamatiou [1997]. Random constraint satisfaction: A more accurate picture. In Proceedings of the 3rd International Conference on Principles and Practice of Constraint Programming (CP 1997), 107–120. → pp. 139 O. Angelsmark and P. Jonsson [2003]. Improved algorithms for counting solutions in constraint satisfaction problems. In Proceedings of the 9th International Conference on Principles and Practice of Constraint Programming (CP 2003), 81–95. → pp. 107 O. Angelsmark, P. Jonsson, S. Linusson, and J. Thapper [2002]. Determining the number of solutions to binary CSP instances. In Proceedings of the 8th International Conference on Principles and Practice of Constraint Programming (CP 2002), 327–347. → pp. 107 O. Angelsmark and J. Thapper [2006]. Partitioning based algorithms for some colouring problem. In Recent Advances in Constraints, Joint ERCIM/CoLogNET International Workshop on Constraint Solving and Constraint Logic Programming, CSCLP 2005, Uppsala, Sweden, June 20-22, 2005, Revised Selected and Invited Papers, volume 3978 of Lecture Notes in Computer Science, 44–58. Springer. → pp. 108, 143 K. R. Apt and M. Bezem [1991]. Acyclic programs. New Generation Computing, 9(3–4):335–363. → pp. 17 S. Arnborg, D. G. Corneil, and A. Proskurowski [1987]. Complexity of finding embeddings in a k-tree. SIAM Journal on Algebraic and Discrete Methods, 8(2):277–284. → pp. 12 R. J. Bayardo Jr. and J. D. Pehoushek [2000]. Counting models using connected components. In Proceedings of the 17th National Conference on Artificial Intelligence (AAAI 2000), 157–162. → pp. 106 172 E. T. Bell [1934a]. Exponential numbers. The American Mathematical Monthly, 41(7):411–419. → pp. 111 E. T. Bell [1934b]. Exponential polynomials. The Annals of Mathematics, 2nd Ser., 35(2):258–277. → pp. 111 E. T. Bell [1938]. The iterated exponential integers. The Annals of Mathematics, 2nd Ser., 39(3):539–557. → pp. 111 U. Bertelè and F. Brioschi [1972]. Nonserial Dynamic Programming, volume 91 of Mathematics in Science and Engineering. Academic Press. → pp. 9 N. Biggs [1993]. Algebraic Graph Theory. Cambridge University Press, 2nd edition. → pp. 107, 108 E. Birnbaum and E. L. Lozinskii [1999]. The good old Davis-Putnam procedure helps counting models. Journal of Artificial Intelligence Research, 10:457–477. → pp. 106 A. Björklund, T. Husfeldt, and M. Koivisto [2009]. Set partitioning via inclusionexclusion. SIAM Journal on Computing, 39(2):546–563. → pp. 107, 108, 143 J. S. Breese [1992]. Construction of belief and decision networks. Computational Intelligence, 8(4):624–647. → pp. 2, 12, 21 A. A. Bulatov and V. Dalmau [2003]. Towards a dichotomy theorem for the counting constraint satisfaction problem. In Proceedings of 44rd IEEE Symposium on Foundations of Computer Science (FOCS’03), 562–571. → pp. 106 W. L. Buntine [1994]. Operations for learning with graphical models. Journal of Artificial Intelligence Research, 2:159–225. → pp. 20 P. Carbonetto, J. Kisy´nski, M. Chiang, and D. Poole [2009]. Learning a contingently acyclic, probabilistic relational model of a social network. Technical Report TR-2009-08, The University of British Columbia, Department of Computer Science. → pp. 101, 103 M. Chavira, A. Darwiche, and M. Jaeger [2006]. Compiling relational Bayesian networks for exact inference. International Journal of Approximate Reasoning, 42(1–2):4–20. → pp. 21 V. Dahllöf, P. Jonsson, and M. Wahlström [2002]. Counting satisfying assignments in 2-SAT and 3-SAT. In Proceedings of the 8th Annual International Computing and Combinatorics Conference (COCOON 2002), 535–543. → pp. 106, 107 173 V. Dahllöf, P. Jonsson, and M. Wahlström [2005]. Counting models for 2SAT and 3SAT formulae. Theoretical Computer Science, 332(1-3):265 – 291. → pp. 106, 107 A. Darwiche [2009]. Modeling and Reasoning with Bayesian Networks. Cambridge University Press. → pp. 54 L. De Raedt, P. Frasconi, K. Kersting, and S. H. Muggleton, eds. [2008]. Probabilistic Inductive Logic Programming, volume 4911 of Lecture Notes in Computer Science. Springer. → pp. 2, 55 R. de Salvo Braz, E. Amir, and D. Roth [2005]. Lifted first-order probabilistic inference. In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI 2005), 1319–1125. → pp. 3, 21 R. de Salvo Braz, E. Amir, and D. Roth [2006]. MPE and partial inversion in lifted probabilistic variable elimination. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI 2006), 1124–1130. → pp. 3, 21 R. de Salvo Braz, E. Amir, and D. Roth [2007]. Lifted first-order probabilistic inference. In L. Getoor and B. Taskar, eds., Introduction to Statistical Relational Learning, chapter 15, 433–450. MIT Press. → pp. ii, 3, 4, 15, 21, 22, 25, 31, 38, 66, 144, 148, 153, 168 R. de Salvo Braz, S. Natarajan, H. Bui, J. Shavlik, and S. Russell [2009]. Anytime lifted belief propagation. In Proceedings of the International Workshop on Statistical Relational Learning (SRL 2009). → pp. 55 R. Dechter [1999]. Bucket elimination: A unifying framework for reasoning. Artificial Intelligence, 113(1):41–85. → pp. 9 R. Dechter [2003]. Constraint Processing. Morgan Kaufmann Publishers. → pp. ix, 106, 107, 108, 109, 110, 138 R. Dechter, K. Kask, and R. Mateescu [2002]. Iterative join-graph propagation. In Proceedings of the 18th Annual Conference on Uncertainty in Artificial Intelligence (UAI 2002), 128–136. → pp. 107 R. Dechter and J. Pearl [1987]. Network-based heuristics for constraint satisfaction problems. Artificial Intelligence, 34(1):1–38. → pp. 107 F. J. Díez [1993]. Parameter adjustment in Bayes networks. The generalized noisy OR-gate. In Proceedings of the 9th Annual Conference on Uncertainty in Artificial Intelligence (UAI 1993), 99–105. → pp. 59 174 F. J. Díez and S. F. Galán [2003]. Efficient computation for the noisy MAX. International Journal of Intelligent Systems, 18(2):165–177. → pp. 69, 70, 98 O. Dubois [1991]. Counting the number of solutions for instances of satisfiability. Theoretical Computer Science, 81(1):49–64. → pp. 106 L. Getoor and B. Taskar, eds. [2007]. Introduction to Statistical Relational Learning. Adaptive Computation and Machine Learning. MIT Press. → pp. 2, 55 J. Gørtz [2000]. Java tip 92: Use the JVM profiler interface for accurate timing. http://www.javaworld.com/javaworld/javatips/jw-javatip92.html. → pp. 138 R. Gupta, A. A. Diwan, and S. Sarawagi [2007]. Efficient inference with cardinality-based clique potentials. In Proceedings of the 24th Annual International Conference on Machine Learning (ICML 2007), 329–336. → pp. 15 M. Horsch and D. Poole [1990]. A dynamic approach to probabilistic inference using Bayesian networks. In Proceedings of the 6th Annual Conference on Uncertainty in AI (UAI 1990), 155–161. → pp. 2, 12 M. C. Horsch and W. S. Havens [2000]. Probabilistic arc consistency: A connection between constraint reasoning and probabilistic reasoning. In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence (UAI 2000), 282– 290. → pp. 107 M. Jaeger [2002]. Relational Bayesian networks: a survey. Electronic Transactions in Artificial Intelligence, 6. → pp. 60 K. Kask, R. Dechter, and V. Gogate [2004a]. Counting-based look-ahead schemes for constraint satisfaction. In Proceedings of the 10th International Conference on Principles and Practice of Constraint Programming (CP 2004), 317–331. → pp. 107 K. Kask, R. Dechter, and V. Gogate [2004b]. New look-ahead schemes for constraint satisfaction. In The 8th International Symposium on Artificial Intelligence and Mathematics (AI&M 2004). → pp. 107 K. Kersting, B. Ahmadi, and S. Natarajan [2009]. Counting belief propagation. In Proceedings of the 25th Annual Conference on Uncertainty in Artificial Intelligence (UAI 2009), 277–284. → pp. 55 J. Kisy´nski and D. Poole [2009a]. Constraint processing in lifted probabilistic inference. In Proceedings of the 25th Annual Conference on Uncertainty in Artificial Intelligence (UAI 2009), 293–302. → pp. 5 175 J. Kisy´nski and D. Poole [2009b]. Lifted aggregation in directed first-order probabilistic models. In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI 2009), 1922–1929. → pp. 5 U. Kjærulff [1990]. Triangulation of graphs - algorithms giving small total state space. Technical Report R90-09, Aalborg University, Department of Mathematics and Computer Science, Aalborg, Denmark. → pp. 12, 139 D. E. Knuth [2005]. The Art of Computer Programming, Volume 4: Combinatorial Algorithms, Fascicle 3: Generating All Combinations and Partitions. AddisonWesley. → pp. 112, 184 D. Koller and N. Friedman [2009]. Probabilistic Graphical Models: Principles and Techniques. Adaptive Computation and Machine Learning. MIT Press. → pp. 2, 54 D. Koller and A. Pfeffer [1997]. Object-oriented Bayesian networks. In Proceedings of the 13th Annual Conference on Uncertainty in AI (UAI 1997), 302–313. → pp. 21 J. W. Lloyd [1987]. Foundations of Logic Programming. Springer, 2nd edition. → pp. 17 L. Lovász [2003]. Combinatorial Problems and Exercises. North-Holland, 2nd edition. → pp. 112 A. K. Mackworth [1977]. Consistency in networks of relations. Artificial Intelligence, 8(1):99–118. → pp. 40 A. Meisels, S. E. Shimony, and G. Solotorevsky [2000]. Bayes networks for estimating the number of solutions to a CSP. Annals of Mathematics and Artificial Intelligence, 28(1-4):169–186. → pp. 106 B. Milch [2006]. Probabilistic Models with Unknown Objects. Ph.D. thesis, University of California, Berkeley, Computer Science Division. → pp. 55 B. Milch, L. S. Zettlemoyer, K. Kersting, M. Haimes, and L. P. Kaelbling [2008]. Lifted probabilistic inference with counting formulas. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence (AAAI 2008), 1062–1068. → pp. ii, 3, 4, 15, 21, 22, 25, 26, 31, 33, 35, 36, 37, 38, 46, 50, 66, 145, 146, 148, 153, 160, 162, 168 U. Montanari [1974]. Networks of constraints: Fundamental properties and applications to picture processing. Information Sciences, 7(2):95–132. → pp. 106 176 C. H. Papadimitriou [1994]. Computational complexity. Addison Wesley. → pp. 106 J. Pearl [1986]. Fusion, propagation and structuring in belief networks. Artificial Intelligence, 29(3):241–288. → pp. 59 J. Pearl [1988]. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann. → pp. 6 M. Pearson and L. Michell [2000]. Smoke Rings: social network analysis of friendship groups, smoking and drug-taking. Drugs: education, prevention and policy, 7:21–37. → pp. 103 G. Pesant [2005]. Counting solutions of CSPs: A structural approach. In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI 2005), 260–265. → pp. 107 A. Pfeffer and D. Koller [2000]. Semantics and inference for recursive probability models. In Proceedings of the 17th National Conference on Artificial Intelligence (AAAI 2000), 538–544. → pp. 21 Pi˙ngala [200 B.C.]. Chandah-sûtra. → pp. 88 D. Poole [1993]. Probabilistic Horn abduction and Bayesian networks. Artificial Intelligence, 64(1):81–129. → pp. 16 D. Poole [1997]. The Independent Choice Logic for modeling multiple agents under uncertainty. Artificial Intelligence, 94(1–2):7–56. → pp. 16 D. Poole [2000]. Abducting through negation as failure: Stable models with the Independent Choice Logic. Journal of Logic Programming, 44:5–35. → pp. 16, 18 D. Poole [2003]. First-order probabilistic inference. In Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI 2003), 985–991. → pp. ii, 2, 3, 4, 21, 22, 25, 38, 43, 144, 146, 148, 153, 168 D. Poole [2008]. The Independent Choice Logic and beyond. In L. De Raedt, P. Frasconi, K. Kersting, and S. H. Muggleton, eds., Probabilistic Inductive Logic Programming, volume 4911 of Lecture Notes in Computer Science, 222– 243. Springer. → pp. 18, 101 D. Poole and A. Mackworth [2010]. Artificial Intelligence: foundations of computational agents. Cambridge University Press. → pp. 1, 55 177 P. Refalo [2004]. Impact-based search strategies for constraint programming. In Proceedings of the 10th International Conference on Principles and Practice of Constraint Programming (CP 2004), 557–571. → pp. 107 D. Roth [1996]. On the hardness of approximate reasoning. Artificial Intelligence, 82(1-2):273–302. → pp. 106 S. J. Russell and P. Norvig [2009]. Artificial Intelligence - A modern approach. Prentice Hall, 3rd edition. → pp. 1, 55 P. Savicky and J. Vomlel [2007]. Exploiting tensor rank-one decomposition in probabilistic inference. Kybernetika, 43(5):747–764. → pp. 70 P. Sen, A. Deshpande, and L. Getoor [2009]. Bisimulation-based approximate lifted inference. In Proceedings of the 25th Annual Conference on Uncertainty in Artificial Intelligence (UAI 2009), 496–505. → pp. 55 P. Singla and P. Domingos [2008]. Lifted first-order belief propagation. In Proceedings of the 23rd National Conference on Artificial Intelligence (AAAI 2008), 1094–1099. → pp. 5, 55 L. Sterling and E. Shapiro [1994]. The Art of Prolog. The MIT Press, 2nd edition. → pp. 38, 42 N. Taghipour, W. Meert, J. Struyf, and H. Blockeel [2009]. First-order BayesBall for CP-Logic. In Proceedings of the International Workshop on Statistical Relational Learning (SRL 2009). → pp. 48 L. G. Valiant [1979]. The complexity of enumeration and reliability problems. SIAM Journal on Computing, 8(3):410–421. → pp. 106 W. A. Whitworth [1878]. Choice and Chance: An Elementary Treatise on Permutations, Combinations, and Probability, with 300 Exercises (3rd, revised and enlarged edition). Cambridge: Deighton Bell. → pp. 112 N. L. Zhang and D. Poole [1994]. A simple approach to Bayesian network computations. In Proceedings of the 10th Biennial Canadian Artificial Intelligence Conference (AI 1994), 171–178. → pp. 7, 9 N. L. Zhang and D. Poole [1996]. Exploiting causal independence in Bayesian network inference. Journal of Artificial Intelligence Research, 5:301–328. → pp. 58 W. Zhang [1996]. Number of models and satisfiability of sets of clauses. Theoretical Computer Science, 155(1):277–288. → pp. 106 178 Appendix A 1-dimensional Representation of VE Factors Appendix. Hmm. A-Anything else? — Lt. Father Francis John Patrick Mulcahy (M*A*S*H) In this appendix, we present a 1-dimensional representation of factors. It is based on David Poole’s Java implementation of VE for belief networks. Besides saving memory when we want a dense representation, this representation allows for efficient implementation of the summing-out operator. Let us consider factor f on variables X1 , X2 , . . . , Xn , where each variable Xi has a domain {x0i , x1i , . . . , xdi i −1 } ordered according to some total ordering <i . Assume we are given a total ordering <X of the variables. The ordering <X is arbitrary. This situation induces a lexicographic ordering of the tuples of elements from the domains of the variables so that factors can be implemented as 1-dimensional arrays. Assume that we have X1 <X X2 <X · · · <X Xn and x0i <i x1i <i · · · <i xdi i −1 , for i = 1, 2, . . . , n. Ordering <T of the tuples in f is defined as follows: (xi11 , xi22 , . . . , xinn ) <T (x1j1 , x2j2 , . . . , xnjn ) ⇐⇒ ∃l∈{1,2,...,n} (xill <l xljl ) ∧ ∀k∈{1,2,...,l−1} (xikk =k xkjk ) . 179 X1 x01 x01 .. . x01 x01 x01 .. . X2 x02 x02 .. . x01 .. .. .. x01 x01 .. . x01 .. .. .. .. . xd11 −1 xd11 −1 .. . xd11 −1 x02 x02 x02 .. . x02 .. .. .. x02 x02 .. . x02 .. .. .. .. . xd22 −1 xd22 −1 .. . xd22 −1 ... ... ... .. .. .. ... ... ... ... .. .. .. ... ... .. .. .. .. .. .. .. .. .. ... ... .. .. .. ... ... .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... .. .. .. ... ... Xn−1 x0n−1 x0n−1 .. . x0n−1 x1n−1 x1n−1 .. . x1n−1 .. .. .. xdn−1 n−1 −1 xdn−1 n−1 −1 .. . xdn−1 n−1 −1 .. .. .. .. . xdn−1 n−1 −1 xdn−1 n−1 −1 .. . xdn−1 n−1 −1 Xn x0n x1n .. . xdnn −1 x0n x1n .. . xdnn −1 .. .. .. x0n x1n .. . xdnn −1 .. .. .. .. . x0n x1n .. . xdnn −1 f () v0 v1 .. . vdn −1 vdn vdn +1 .. . v2dn −1 .. .. .. vdn (dn−1 −1) vdn (dn−1 −1)+1 .. . vdn dn−1 −1 .. .. .. .. . vdn dn−1 ...d2 (d1 −1)+dn dn−1 ...(d2 −1)+···+dn (dn−1 −1) vdn dn−1 ...d2 (d1 −1)+dn dn−1 ...(d2 −1)+···+dn (dn−1 −1)+1 .. . vdn dn−1 ...d2 d1 −1 Figure A.1: Structure of a VE factor. Figure A.1 shows the structure of factor f . The factor consists of d1 d2 . . . dn tuples ordered according to <T with values v0 , v1 , . . . , vd1 d2 ...dn . Consider tuple (xi11 , xi22 , . . . , xinn ) from the factor f . Given indexes i1 , i2 , . . . , in we can compute the tuple’s index (and thus, the index of its value) in f : (xi11 , xi22 , . . . , xinn ) → v((...(i1 d2 +i2 )d3 +... )dn−1 +in−1 )dn +in . Given the index of the tuple (the index of its value) in f , we can recover the tuple’s elements with the procedure indexToTuple( f ,t) presented in Table A.2. Therefore, we do not need to store tuples in factors but only their values, which saves memory. 180 [00] [01] [02] [03] procedure indexToTuple( f ,t) input: factor f on variables X1 , X2 , . . . , Xn , index t of the tuple; output: tuple with index t in factor f ; [04] [05] [06] [07] [08] [09] [10] for j := n downto 2 do set i j := t mod d j ; set t := t div d j ; end set i1 := t ; return (xi11 , xi22 , . . . , xinn ); end Figure A.2: Procedure indexToTuple( f ,t). 181 Appendix B Hierarchical Representation of #VE= Factors I’m afraid I have a bad appendix. — Maj. Margaret ‘Hot Lips’ O’Houlihan (M*A*S*H) In this appendix, we show how to implement the dense representation of #VE= factors as a hierarchy of 1-dimensional arrays, so that instead of storing both #VE= tuples and values in memory, we only need to store values. For simplicity, we will describe our representation using a factor on three variables as an example. Consider Figure B.1, which illustrates domains of the three variables A, B and C. We need at most seven disjoint subsets, S1 , S2 , . . . , S7 , to specify domains of the variables. In our example, we assume that all seven subsets are not empty; thus, the domain of each variable consists of four subsets. In the #VE= factor we represent subsets using seven s-constants, c1 , c2 , . . . , c7 , as described in Section 4.3.2.2. We represent a #VE= factor on A, B and C as a hierarchy of factors. The hierarchy consists of two levels. At the top level, there is a factor representing all combinations of s-constants describing the domain of each variable, and which maps them via pointers to factors from the bottom level of the hierarchy (see Figure B.2. If we introduce a total ordering among s-constants, then we can represent 182 A B S2 S5 S1 S3 S6 A = S1 ∪ S2 ∪ S3 ∪ S5 S4 B = S1 ∪ S2 ∪ S4 ∪ S6 S7 C = S1 ∪ S3 ∪ S4 ∪ S7 C Figure B.1: Domains of three variables represented with disjoint subsets. A c1 c1 .. . c1 .. ... . c5 B c1 c1 .. . c6 .. .. .. c6 C c1 c3 .. . c7 .. .. .. c7 {A, B,C} {{A, B,C}} {{A, B}, {C}} {{A,C}, {B}} {{A}, {B,C}} {{A}, {B}, {C}} → − − .. . − .. .. .. − {A, B} {{A, B}} {{A}, {B}} # v0 v1 v2 v3 v4 {C} {{C}} {{C}} # v5 v6 {A} {{A}} {B} {{B}} {C} {{C}} # v27 {A} {{A}} {B} {{B}} {C} {{C}} # v88 Figure B.2: Structure of a #VE= factor. the factor from the top level as a 1-dimensional array, as described in Appendix A. The disadvantage of such dense representation is that we have to store all of the pointers in the array, even if they are null. At the bottom level, there are factors mapping partitions of the variables A, B and C to the values of the #VE= factor. A combination of s-constants pointing to a factor from the bottom level determines the structure of the factor. For each subset of the variables with the same s-constant, we have some number of partitions. 183 For example, for the tuple of s-constants c1, c1, c2 , we have two partitions of variables A and B (namely {{A, B}} and {{A}, {B}} and one partition of variable C ({{C}}). In the factor from the bottom level, we represent all of their combinations and map them to the appropriate values of the #VE= factor (see Figure B.2). If we use a standard lexicographic ordering of partitions (see [Knuth, 2005]), we can represent the factors from the bottom level as 1-dimensional arrays (as described in Appendix A) and a list of partitions for each subset of variables. The disadvantage of such dense representation is that we have to store values for all combinations of partitions in the array, even if they are 0 (which can happen sometimes after summing out). The combinatorial properties of set partitions allow us to efficiently recover the partition structure from the position of the partition in the ordering, and to compute the position of the partition in the ordering given its structure. Therefore, we do not need to store lists of partitions associated with factors from the bottom level of the hierarchy, but only lists of their indexes in the ordering. Java methods that implement necessary operations on set partitions and Bell numbers are part of the Bell package1 . As it is mainly a programming exercise, we do not describe those methods. It is important to mention that the hierarchical, dense representation is not the only possibility. An empirical evaluation is necessary to decide whether it is better than a flat, sparse representation or a hierarchical, sparse representation. Similarly, one could use partitions themselves instead of indexes of set partitions. 1 http://people.cs.ubc.ca/~kisynski/code/bell/ 184 Appendix C From Parfactors to #VE= Factors Well, I figured I would go after the appendix while I’m in the area. — Maj. Frank Marion ‘Ferret Face’ Burns (M*A*S*H) In this appendix, we show how to represent a CSP induced by a parfactor in terms of #VE= factors. We use the dense representation of #VE= factors described in Appendix B. For simplicity we will analyze a CSP consisting of a single connected component with three variables. Consider the following parfactor: C, { f (A, B,C), g(A,C), g(x2 , x9 )}, F , [0] where C = {A = x2 , B = x3 , B = x7 ,C = x2 , A = B, B = C}, D(A) = D(B) = D(C) and | D(A) | = 10. Constraints in the parfactor form a CSP instance P = ({A, B,C}, D, C), where D(A) = D(B) = D(C). We know that D(A) contains ten elements. We know only those elements of D(A), which are explicitly present in the parfactor: x2 , x3 , x7 , x9 . Let us represent the instance P using #VE= factors. First, we exclude elements participating in unary constraints from the domains of the variables A, B, and C and obtain pruned domains D(A), D(B), and D(C). Then, we partition pruned domains into disjoint sets of elements, and represent these sets as s-constants (see 185 Section 4.3.2.1). There are three variables, so we will need at most 23 − 1 = 7 s-constants (see Figure B.1). In our implementation we create a sparse data structure that maps potential sconstants to the sets of elements they represent. We pick an arbitrary total ordering < of D(A), sort unary constraints for each variable according to <, and process the elements known to us in order of <. This allows us to generate sets of indistinguishable individuals represented by s-constants through a single sweep of unary constraints and elements that are known to us. Each set of elements represented by an s-constant can be defined by listing all of its elements or by listing all elements from the domain that do not belong to it. For each set we choose a more compact representation. For simplicity, let as assume that x2 < x3 < x7 < x9 . Element x2 is excluded from the domain of A and C, therefore it is represented by the s-constant S6 . Element x3 is excluded from the domain of B, therefore it is represented by the sconstant S3 . Element x7 is also represented by S3 . Element x9 belongs to the domains of all variables and is represented by the s-constant S1 . The remaining six elements belong to the domains of all variables and are also represented by S1 . S1 represents seven elements, S3 represents two elements, and S6 represents one element. We have D(A) = S1 ∪ S3 , D(B) = S1 ∪ S6 , and D(C) = S1 ∪ S3 . Although the #VE= algorithm only needs to know the sizes of the sets denoted by each s-constant, we need to know the elements of these sets so that we can interpret answers returned by the #VE= algorithm. The s-constant S1 represents the set of elements {x ∈ D(A)|A = x2 ∧ A = x3 ∧ A = x7 }, S3 represents the set {x3 , x7 } and S6 represents the set {x2 }. The length of the descriptions of each of these sets is independent of the size of D(A). Once we process unary constraints and obtain s-constants, we can use the procedure from Figure 4.10 to construct the following representation of the CSP instance: A S1 S1 S1 S3 S3 B S1 S1 S6 S1 S6 Partition(s) {{A, B}} {{A}, {B}} {{A}} {{B}} {{A}} {{B}} {{A}} {{B}} # 0 1 1 1 1 B S1 S1 S1 S6 S6 186 C S1 S1 S3 S1 S3 Partition(s) {{B,C}} {{B}, {C}} {{B}} {{C}} {{B}} {{C}} {{B}} {{C}} # 0 1 1 1 1 The first #VE= factor represents the binary constraint A = B and the second #VE= factor represents the binary constraint B = C. If the domains of logical variables present in the parfactor were larger, we would perform exactly the same computation, only the size of the set represented by the s-constant S1 would be larger. 187 Appendix D From #VE= Factors to Parfactors That’s right up my alley, I wrote the book on the appendix. I even wrote the appendix, but they took that out. — Capt. Benjamin Franklin ‘Hawkeye’ Pierce (M*A*S*H) In this appendix, we show how an answer from the #VE= algorithm is processed by a lifted inference procedure. Consider the following parfactor from Appendix C: C , { f (A, B,C), g(A,C), g(x2 , x9 )}, F , [0] where C = {A = x2 , B = x3 , B = x7 ,C = x2 , A = B, B = C}, D(A) = D(B) = D(C) and | D(A) | = 10. Assume that a lifted inference procedure is about to sum out the parameterized random variable f (A, B,C) from the parfactor [0]. The logical variable B appears only in f (A, B,C) and will be eliminated from the parfactor. Therefore the resulting parfactor will represent fewer factors than the parfactor [0] and the lifted probabilistic inference procedure needs to compensate for this difference. To do so, it needs to compute the size of the set D(B) : C , which tells how many times fewer factors the resulting parfactor will represent. The parfactor [0] is not in normal form and it is necessary to compute all the sizes of the D(B) : C set conditioned on values of logical variables remaining in the resulting parfactor, namely A and C. In our im188 plementation, the lifted inference procedure uses the #VE= algorithm to perform this task. First, we represent the CSP instance formed by constraints in the parfactor [0]. This is described in Appendix C. Then, the #VE= algorithm multiplies the two #VE= factors that represent the CSP instance and sums out B from the product. The #VE= algorithm returns the following answer: A S1 S1 S1 S3 S3 S3 C S1 S1 S3 S1 S3 S3 Partition(s) {{A,C}} {{A}, {C}} {{A}} {{C}} {{A}} {{C}} {{A,C}} {{A}, {B}} # 7 6 7 7 8 8 We also know that the s-constant S1 represents the set of elements {x ∈ D(A)|A = x2 ∧ A = x3 ∧ A = x7 } and the s-constant S3 represents the set {x3 , x7 } (see Appendix C). The answer from the #VE= algorithm needs to be translated to sets of substitutions and constraints accompanying each #VE= tuple and its value. We start with computing the generic result of summation, parfactor [00], by summing out f (A, B,C) from the parfactor [0] without compensating for disappearance of B: {A = x2 ,C = x2 }, {g(A,C), g(x2 , x9 )}, F f , [00] where F f = ∑ F f (...) Next, for each #VE= tuple we generate one or more parfactors by modifying the parfactor [00]: • S1 , S1 , [{{A,C}}] – A and C are equal, their domain is {x ∈ D(A)|A = x2 ∧ A = x3 ∧ A = x7 }. We add the description of the domain to the parfactor [00], apply substitution {A/C} to the parfactor [00], and bring the factor component to the power 7: {C = x2 ,C = x3 ,C = x7 }, {g(C,C), g(x2 , x9 )}, F f 7 ; 189 [01] • S1 , S1 , [{{A}, {C}}] – A and C are different, their domains are respectively {x ∈ D(A)|A = x2 ∧ A = x3 ∧ A = x7 } and {x ∈ D(C)|C = x2 ∧C = x3 ∧C = x7 }. We add the descriptions of the domains to the parfactor [00], add constraint A = C, and bring the factor component to the power 6: {A = x2 , A = x3 , A = x7 ,C = x2 ,C = x3 ,C = x7 , A =C}, {g(A,C), g(x2 , x9 )}, F f 6 ; [02] • S1 , S3 , [{{A}} {{C}}] – A and C have disjoint domains, their domains are respectively {x ∈ D(A)|A = x2 ∧ A = x3 ∧ A = x7 } and {x3 , x7 }. We add the description of the domain of A to the parfactor [00], apply substitution {C/x3 }, and bring the factor component to the power 7; we repeat the process for substitution {C/x7 }: {A = x2 , A = x3 , A = x7 }, {g(A, x3 ), g(x2 , x9 )}, F f 7 , {A = x2 , A = x3 , A = x7 }, {g(A, x7 ), g(x2 , x9 )}, F f 7 ; [03] [04] • S3 , S1 , [{{A}} {{C}}] – A and C have disjoint domains, their domains are respectively {x3 , x7 } and {x ∈ D(C)|C = x2 ∧ C = x3 ∧ C = x7 }. We add the description of the domain of C to the parfactor [00], apply substitution {A/x3 }, and bring the factor component to the power 7; we repeat the process for substitution {A/x7 }: {C = x2 ,C = x3 ,C = x7 }, {g(x3 ,C), g(x2 , x9 )}, F f 7 , 7 {C = x2 ,C = x3 ,C = x7 }, {g(x7 ,C), g(x2 , x9 )}, F f ; [05] [06] • S3 , S3 , [{{A,C}}] – A and C are equal, their domain is {x3 , x7 }. We apply substitutions {A/x3 } and {C/x3 } to the parfactor [00] and bring the factor component to the power 8; we repeat the process for substitutions {A/x7 } and {C/x7 }: 0, / {g(x3 , x3 ), g(x2 , x9 )}, F f 8 , 0, / {g(x7 , x7 ), g(x2 , x9 )}, F f 8 ; 190 [07] [08] • S3 , S3 , [{{A}, {B}}] – A and C are different, their domain is {x3 , x7 }. We apply substitutions {A/x3 } and {C/x7 } to the parfactor [00] and bring the factor component to the power 8; we repeat the process for substitutions {A/x7 } and {C/x3 }: 0, / {g(x3 , x7 ), g(x2 , x9 )}, F f 8 , 0, / {g(x7 , x3 ), g(x2 , x9 )}, F f 8 ; [09] [10] The summation created many parfactors, but significantly fewer than would be created by a #CSP solver that does not produced lifted description as output. The above example strongly suggests that an extension to the calculus of parfactors that would allow reasoning about sets of individuals belonging to some population is worth pursuing. 191 Appendix E Splitting as Needed It isn’t necessary. It isn’t a hot appendix. It’s chronic. — Maj. Margaret ‘Hot Lips’ O’Houlihan (M*A*S*H) In this appendix we apply propositions introduced in Section 2.5.2 and present a simple lifted computation which illustrates the splitting as needed approach (Section 5.2.1). We use the same model and compute the same query as in Section 2.5.2.7, where we demonstrated how the C-FOVE algorithm performs inference with the use of the shattering operation (Section 5.2.1). . Consider the parfactors from Example 2.7 (page 23), which represent the ICL theory from Example 2.6 (page 18) and Figure 2.5 (page 19). Assume D(Lot) = {lot1 , lot2 , . . . , lotn } and that it is observed that grass is wet on lot1 , which can be represented by the following parfactor: 0, / {wet_grass(lot1 )}, P(wet_grass(lot1 ) = f alse) 0.0 192 P(wet_grass(lot1 ) = true) . 1.0 Let Φ be a set of the three parfactors from Example 2.7 and the above parfactor (in what follows, we don’t show details of factor components of parfactors): Φ = { 0, / {rain()}, F1 , [01] [02] 0, / {sprinkler(Lot)}, F2 , [03] 0, / {rain(), sprinkler(Lot), wet_grass(Lot)}, F3 , 0, / {wet_grass(lot1 )}, F4 }. [04] Assume we want to compute Jground(wet_grass(Lot)):{Lot=lot1 } (Φ). Note that this is the joint on wet_grass(Lot) for all lots except lot1 . Let us first eliminate the parameterized random variable sprinkler(Lot). We apply Proposition 2.3 and multiply the two parfactors that involve sprinkler(Lot), namely [02] and [03]. Both parfactors represent the same number of factors, namely n, and no correction is necessary. We obtain the product 0, / {rain(), sprinkler(Lot), wet_grass(Lot)}, F5 , where F5 = F2 [05] F3 . The new set of parfactors is as follows: Φ1 = { 0, / {rain()}, F1 , [01] 0, / {wet_grass(lot1 )}, F4 , [04] 0, / {rain(), sprinkler(Lot), wet_grass(Lot)}, F5 }. [05] Next, we apply Proposition 2.1 and sum out sprinkler(Lot) from parfactor [05]. No logical variable disappears from the parfactor and we do not need to compensate for it. We obtain the following parfactor: 0, / {rain(), wet_grass(Lot)}, F6 , 193 [06] where F6 = ∑sprinkler(Lot) F5 . The new set of parfactors is as follows: Φ2 = { 0, / {rain()}, F1 , [01] 0, / {wet_grass(lot1 )}, F4 [04] [06] 0, / {rain(), wet_grass(Lot)}, F6 }. Next, we again apply Proposition 2.3 and multiply the parfactors involving parameterized random variable rain(), [01] and [06], and obtain the following parfactor: [07] 0, / {rain(), wet_grass(Lot)}, F7 , 1 where F7 = F1 n F6 . Parfactor [01] represents one factor, while parfactor [06] represents n factors, and we needed to compensate for this when computing parfactor [07]. The new set of parfactors is as follows: Φ3 = { 0, / {wet_grass(lot1 )}, F4 [04] [07] 0, / {rain(), wet_grass(Lot)}, F7 }. We cannot eliminate rain() from parfactor [07], because param(rain()) param(wet_grass(Lot)). Instead, we apply Proposition 2.6 to parfactor [07] and perform counting on the logical variable Lot. We obtain 0, / {rain(), #Lot [wet_grass(Lot)]}, F8 , [08] where F8 is defined as in Equation 2.10. The new set of parfactors is as follows: Φ4 = { 0, / {wet_grass(lot1 )}, F4 [04] 0, / {rain(), #Lot [wet_grass(Lot)]}, F8 }. [08] Now, we can sum out rain() from parfactor [08] (Proposition 2.1). We obtain 0, / {#Lot [wet_grass(Lot)]}, F9 , 194 [09] where F9 = ∑rain() F8 . The new set of parfactors is as follows: Φ5 = { 0, / {wet_grass(lot1 )}, F4 [04] 0, / {#Lot [wet_grass(Lot)]}, F9 }. [09] Before we can multiply the two remaining parfactors, we need to apply Proposition 2.5 to parfactor [09] and expand the counting formula #Lot [wet_grass(Lot)]. We obtain the following parfactor: 0, / {wet_grass(lot1 ), #Lot:{Lot=lot1 } [wet_grass(Lot)]}, F10 , [10] where F10 is defined as in Equation 2.8. The new set of parfactors is as follows: Φ6 = { 0, / {wet_grass(lot1 )}, F4 [04] 0, / {wet_grass(lot1 ), #Lot:{Lot=lot1 } [wet_grass(Lot)]}, F10 }. [10] Next, we multiply parfactor [04] by parfactor [10] (Proposition 2.3) and obtain: 0, / {wet_grass(lot1 ), #Lot:{Lot=lot1 } [wet_grass(Lot)]}, F11 , where F11 = F10 [11] F4 . The new set of parfactors is as follows: Φ7 = { 0, / {wet_grass(lot1 ), #Lot:{Lot=lot1 } [wet_grass(Lot)]}, F11 }. [11] We sum out wet_grass(lot1 ) from parfactor [11] (Proposition 2.1) and obtain 0, / {#Lot:{Lot=lot1 } [wet_grass(Lot)]}, F12 , [12] where F12 = ∑wet_grasslot1 F11 . The new set of parfactors is as follows: Φ8 = { 0, / {#Lot:{Lot=lot1 } [wet_grass(Lot)]}, F12 }. We have Jground(wet_grass(Lot)):{Lot=lot1 } (Φ) = J (Φ8 ) . 195 [12] During computation, constants lot2 , lot3 , . . . , lotn were not explicitly enumerated, we only needed to know that D(Lot) = n. The biggest factor created during inference had size 8 (factor F3 ) for n ≤ 4 and 2(n + 1) (factor F8 ) for n > 4. We performed 1 expansion, 3 multiplications, 1 counting operation, and 3 summations. During inference presented in Section 2.5.2.7, the C-FOVE algorithm performed 2 splits, 5 multiplications, 1 counting operation, and 4 summations. 196
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Aggregation and constraint processing in lifted probabilistic...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Aggregation and constraint processing in lifted probabilistic inference Kisynski, Jacek Jerzy 2010
pdf
Page Metadata
Item Metadata
Title | Aggregation and constraint processing in lifted probabilistic inference |
Creator |
Kisynski, Jacek Jerzy |
Publisher | University of British Columbia |
Date Issued | 2010 |
Description | Representations that mix graphical models and first-order logic - called either first-order or relational probabilistic models — were proposed nearly twenty years ago and many more have since emerged. In these models, random variables are parameterized by logical variables. One way to perform inference in first-order models is to propositionalize the model, that is, to explicitly consider every element from the domains of logical variables. This approach might be intractable even for simple first-order models. The idea behind lifted inference is to carry out as much inference as possible without propositionalizing. An exact lifted inference procedure for first-order probabilistic models was developed by Poole [2003] and later extended to a broader range of problems by de Salvo Braz et al. [2007]. The C-FOVE algorithm by Milch et al. [2008] expanded the scope of lifted inference and is currently the state of the art in exact lifted inference. In this thesis we address two problems related to lifted inference: aggregation in directed first-order probabilistic models and constraint processing during lifted inference. Recent work on exact lifted inference focused on undirected models. Directed first-order probabilistic models require an aggregation operator when a parent random variable is parameterized by logical variables that are not present in a child random variable. We introduce a new data structure, aggregation parfactors, to describe aggregation in directed first-order models. We show how to extend the C-FOVE algorithm to perform lifted inference in the presence of aggregation parfactors. There are cases where the polynomial time complexity (in the domain size of logical variables) of the C-FOVE algorithm can be reduced to logarithmic time complexity using aggregation parfactors. First-order models typically contain constraints on logical variables. Constraints are important for capturing knowledge regarding particular individuals. However, the impact of constraint processing on computational efficiency of lifted inference has been largely overlooked. In this thesis we develop an efficient algorithm for counting the number of solutions to the constraint satisfaction problems encountered during lifted inference. We also compare, both theoretically and empirically, different ways of handling constraints during lifted inference. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-03-31 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0051829 |
URI | http://hdl.handle.net/2429/23170 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2010-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2010_spring_kisynski_jacek.pdf [ 1.47MB ]
- Metadata
- JSON: 24-1.0051829.json
- JSON-LD: 24-1.0051829-ld.json
- RDF/XML (Pretty): 24-1.0051829-rdf.xml
- RDF/JSON: 24-1.0051829-rdf.json
- Turtle: 24-1.0051829-turtle.txt
- N-Triples: 24-1.0051829-rdf-ntriples.txt
- Original Record: 24-1.0051829-source.json
- Full Text
- 24-1.0051829-fulltext.txt
- Citation
- 24-1.0051829.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0051829/manifest