Poisson Process Infinite Relational Model: a Bayesiannonparametric model for transactional databyCreagh BriercliffeB.Sc. (Hons.), University of Manitoba, 2014A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Statistics)The University of British Columbia(Vancouver)August 2016c© Creagh Briercliffe, 2016AbstractTransactional data consists of instantaneously occurring observations made on or-dered pairs of entities. It can be represented as a network—or more specifically,a directed multigraph—with edges possessing unique timestamps. This thesis ex-plores a Bayesian nonparametric model for discovering latent class-structure intransactional data. Moreover, by pooling information within clusters of entities,it can be used to infer the underlying dynamics of the time-series data. Blun-dell, Beck, and Heller (2012) originally proposed this model, calling it the Pois-son Process Infinite Relational Model; however, this thesis derives and elaborateson the necessary procedures to implement a fully Bayesian approximate inferencescheme. Additionally, experimental results are used to validate the computationalcorrectness of the inference algorithm. Further experiments on synthetic data eval-uate the model’s clustering performance and assess predictive ability. Real datafrom historical records of militarized disputes between nations test the model’scapacity to learn varying degrees of structured relationships.iiPrefaceThis dissertation is solely authored, unpublished work by the author, Creagh Brier-cliffe. The research topic and experiments were jointly designed with Profs. Alexan-dre Bouchard-Coˆte´ and Paul Gustafson. The Poisson Process Infinite RelationalModel was first presented by Blundell, Beck, and Heller [6].iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Chinese Restaurant Process . . . . . . . . . . . . . . . . . . . . . 42.2 Infinite Relational Model . . . . . . . . . . . . . . . . . . . . . . 63 Poisson Process Infinite Relational Model . . . . . . . . . . . . . . . 113.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Generative Process . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3.1 Sample Space of Partitions . . . . . . . . . . . . . . . . . 16iv3.3.2 Posterior Over Partitions . . . . . . . . . . . . . . . . . . 163.3.3 Posterior Over Rates . . . . . . . . . . . . . . . . . . . . 193.3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . 203.3.5 Sampling Hyperparameters . . . . . . . . . . . . . . . . . 213.4 Note on the Superposition Property . . . . . . . . . . . . . . . . . 224 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.1.1 Validation of Sampler . . . . . . . . . . . . . . . . . . . 254.1.2 Convergence Diagnostics . . . . . . . . . . . . . . . . . . 284.1.3 Clustering Evaluation . . . . . . . . . . . . . . . . . . . . 324.1.4 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2 Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49vList of TablesTable 3.1 The nth Bell number corresponding to various values of n. Bncounts the number of distinct partitions of a set of size n. Thus,Bn is the size of the sample space for pi when there are n entities. 17Table 4.1 Countries belonging to some of the clusters inferred from theMAP partition of the MID dataset. . . . . . . . . . . . . . . . . 45Table 4.2 The clusters and countries corresponding to the 6 largest aver-age Poisson process rates recovered from 1,000 posterior sam-ples. Samples were based on MAP estimates of the model pa-rameters. Clusters for the rates of transactions are shown in theorder sender→ recipient. . . . . . . . . . . . . . . . . . . . . 46viList of FiguresFigure 4.1 Multidigraph representation of randomly generated data, from5 entities, at various stopping times, T . Vertex colours repre-sent the three different clusters. Edges are coloured to signifythat each transaction occurs at a unique time point. . . . . . . 26Figure 4.2 95% confidence intervals for the posterior probabilities of eachpossible partition. Estimates are based on 10,000 MCMC sam-ples with Monte Carlo standard errors calculated by a batchmeans method. Each datapoint shows the true posterior prob-ability, and points in red are those not captured by the confi-dence interval. . . . . . . . . . . . . . . . . . . . . . . . . . 28Figure 4.3 `1 norm, measured at various iterations, between: (i) the vec-tor of true posterior probabilities of each possible partition, and(ii) the steady-state distribution of the empirical transition ma-trix of MCMC samples. . . . . . . . . . . . . . . . . . . . . . 30Figure 4.4 Trace plots of MCMC samples of partitions, given differentinitializations. Data is comprised of 336 randomly generatedtransactions from 30 entities. The red lines depict the true par-tition. Numbering along the vertical axis is arbitrary. . . . . . 31Figure 4.5 Trace plots of MCMC samples for each of the hyperparameters,given different initializations. Data is comprised of 336 ran-domly generated transactions from 30 entities. The red linesdepict the true parameter values. . . . . . . . . . . . . . . . . 33viiFigure 4.6 Mean number of clusters in 10,000 MCMC samples as the stop-ping time is increased. The red line depicts the number of clus-ters in the ground-truth partition used to generate the syntheticdata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34Figure 4.7 Mean number of clusters in the MCMC samples as the stoppingtime is increased. Red points are an average of all 10,000 sam-ples, while blue points drop the first 10% as burn-in. The redline depicts the number of clusters in the ground-truth partitionused to generate the synthetic data. . . . . . . . . . . . . . . . 35Figure 4.8 Adjusted Rand index between a partition estimate and the truepartition as the stopping time is increased. Each datapoint isan average across 10 randomly generated datasets, and errorbars show one standard deviation. Each estimator is computedfrom 10,000 MCMC samples. . . . . . . . . . . . . . . . . . . 37Figure 4.9 Estimated log posterior predictive densities on a test set as theamount of training data (stopping time) is increased. Each es-timate is an average based on 10,000 MCMC samples. The redline depicts the likelihood of the test set when the true partitionand true rates are given. . . . . . . . . . . . . . . . . . . . . . 40Figure 4.10 Multidigraph representation of the MID dataset consisting of138 countries (vertices) and 508 disputes (edges). Vertices areboth coloured and numbered according to the 18 clusters in-ferred from the MAXPEAR partition. . . . . . . . . . . . . . . 43Figure 4.11 Multidigraph representation of the MID dataset consisting of138 countries (vertices) and 508 disputes (edges). Vertices areboth coloured and numbered according to the 58 clusters in-ferred from the MAP partition. . . . . . . . . . . . . . . . . . 44viiiGlossaryCRP Chinese Restaurant ProcessIRM Infinite Relational ModelMAP maximum a posteriori, the mode of the posterior distributionMAXPEAR An estimator of partitions that maximizes the posterior expectedadjusted Rand index with the true partitionMCMC Markov chain Monte CarloMEDV An estimator of partitions based on an ad-hoc approach by Medvedovicet al. [21]MID Militarized Interstate DisputeMINBINDER An estimator of partitions that minimizes Binder’s loss functionPPIRM Poisson Process Infinite Relational ModelSBM Stochastic BlockmodelixAcknowledgmentsI have immensely enjoyed these last two years in Vancouver—and not just for thefact that I escaped the Winnipeg winters. I have been incredibly fortunate to besurrounded by dedicated mentors, talented colleagues, encouraging friends, andloving family. Through their constant support, I have been able to grow profes-sionally, and feel at home in a new city. I am truly grateful to have been affordedthe opportunity to pursue my interests, and humbled by the support I received.First, I want to express my gratitude towards my academic co-supervisors,Profs. Alexandre Bouchard-Coˆte´ and Paul Gustafson.1 Alex and Paul are theBayesian-dream-team of unrelenting support, unbounded wisdom, and the truedriving force behind my research. They possess the uncanny ability to take anyof my seemingly unsurmountable research-related struggles, and boil them downto triviality. Alex introduced me to the field of Bayesian nonparametrics, and wasa fountain of insightful comments, no matter the topic. Paul’s ability to simplifycomplicated problems, and lead engaging meetings cannot be overstated. Thankyou both for your tireless dedication to my continued academic growth.I have many brilliant mentors and gifted colleagues to thank for helping meto achieve my academic goals. Thanks to Prof. Saman Muthukumarana for yourenthusiastic encouragement. Thanks to Andrea Sollberger for your incredible senseof humour, and for tolerating my endless questions. Thanks to my officematesDerek, Jonathan, Sohrab and Yijun for your camaraderie. Thanks to my friendsand colleagues, Gal, Jeff and Lena, whose solidarity carried me through our longbout with 551-consulting. I also want to acknowledge the financial support of theNatural Sciences and Engineering Research Council of Canada.1Listed names are in alphabetical order, and not necessarily in order of gratitude.xA special thanks to my friends and former officemates, Sean Jewell and NeilSpencer. Sean and Neil’s incredible cleverness and passion for learning drove meto continue my academic pursuits. I am grateful to Sean for convincing me to cometo UBC, getting me engaged in the reading group, and helping me to adapt to theKitsilano life. I am grateful to Neil for teaching me exploit A&W coupons, forhis ever-insightful views, and for secretly acting as assistant coach to my fantasyhockey team. Together, their presence made coming to the office a delightful andrewarding experience.When I needed a break from my studies, I was extremely fortunate to be able tocall on an amazing group of friends. In particular, thanks to Andrew, Ryan, Spencerand Steve for making me feel like a pro at pitch-and-putt, and for communallychastising Steve’s awful choice of Chinese food. A special thank you to my closefriends Micah Goldberg and James Struthers, who value my friendship so muchthat they followed me to Vancouver. Micah’s unique ability to unabashedly inserthimself into situations where he is unwelcome, makes him a truly entertaining per-son to be around. James’ unmatched charisma and propensity for self-destructionhave made the last two years of domestic life some of the best. Thank you, Jamesand Micah, for helping to make Vancouver feel like home.Finally, a heartfelt thank you to my parents, Susan and Randy Briercliffe, fortheir endless support and unconditional encouragement. I cannot express enoughgratitude for the amount of love and patience you have given me on this journey.xiChapter 1IntroductionOften, the primary objectives of probabilistic unsupervised learning are to unveilimplicit or hidden structure in data, and to harness this structure to learn the datagenerating process. To address these objectives, a common statistical techniqueis to use a mixture model, which boasts a unifying approach for structure discov-ery and probabilistic modelling. Most mixture models and related unsupervisedlearning algorithms are designed for feature-based data, expressed as vectors inEuclidean space [17]. This thesis focusses on data of a different form, where ob-servations are made on ordered pairs of objects or entities, and occur in continuous-time—we call this transactional data. Moreover, this thesis elaborates on and ex-plores a Bayesian nonparametric mixture model for transactional data, called thePoisson Process Infinite Relational Model (PPIRM).Transactional data consists of a set of entities, X = {x1, . . . ,xn}, on which aset of observations are made for ordered triples (xi,x j, t), where t ∈ R>0 recordsthe event time. Each observation is called a transaction to evoke the notion thatit represents a directed event, or exchange between two entities: one sender andone recipient, at a specific time. Similar data types are typically represented as agraph, where the entities mark the vertices, and the events are depicted as edges.Transactions are akin to dyadic events, where each transaction is an instantaneousdirected edge between a pair of vertices. Thus, a full transactional dataset can berepresented as a directed multigraph, with edges occurring instantaneously in time.Our goal is to use the PPIRM to learn the implicit class-structure of the entities, and1infer the dynamics of the time-series data.The PPIRM was originally proposed by Blundell et al. [6] as a model for dis-covering the underlying social structure from a set of interactions between people.As it was not the main focus of [6], there is little detail regarding an inference pro-cedure and subsequent assessment of its performance. To address this deficiencyin the literature, we derive a fully Bayesian approximate inference scheme for thePPIRM, and evaluate its performance on synthetic and real data. Furthermore, Blun-dell et al. [6] posit that the superposition property of Poisson processes can be usedto maintain efficient posterior inference, but we show that this approach can bemisguided.Section 1.1 discusses other work related to the PPIRM, while the remainder ofthis thesis is organized as follows. Chapter 2 provides the necessary backgroundmaterial on a Bayesian nonparametric distribution over partitions of a set. Chap-ter 2 concludes with the details of the Infinite Relational Model (IRM), of whichthe PPIRM is an extension. The PPIRM and a full derivation of an inference pro-cedure are presented in Chapter 3. Experimental results on synthetic and real dataare discussed in Chapter 4. Finally, conclusions and future directions are reportedin Chapter 5.1.1 Related WorkBlundell et al. [6] proposed the PPIRM by relating the work of DuBois and Smyth[9] to the IRM [20, 32]. While, the model proposed in [9] also uses Poisson pro-cesses to model event times, it requires the number of latent classes to be fixed inadvance. Conversely, the IRM and PPIRM exploit a flexible nonparametric method,allowing the number of classes to be data-dependent and inferred with the classmemberships. Furthermore, the work in [9] models latent classes of events, ratherthan latent classes of entities.The IRM is a Bayesian nonparametric model for dyadic data. It can be viewedas a generalization of the parametric Stochastic Blockmodel (SBM) [18, 30], whichmodels the regular block structure of an adjacency matrix. The dyadic data ap-plicable to the IRM, which Kemp et al. [20] call relations, lack the time-varyingstructure of transactional data. Thus, transactional data are strictly more rich than2the relations that can be modelled with the IRM.The model that was the main focus of [6] uses mutually-exciting Hawkes pro-cesses, where the rates of events depend upon those of other processes in an excita-tory fashion. Similarly, Simma and Jordan [28] use a cascade of Poisson processes,which also form a marked Hawkes Process. These models allow for time-varyingrates, which are aimed at capturing the reciprocating nature of many social inter-actions. However, they lack the conjugacy that affords the PPIRM the ability tomaintain efficient posterior inference.Recently, Xin et al. [31] proposed a continuous-time extension of the SBMthat uses inhomogeneous Poisson processes to model transactions among basket-ball players. While their model allows for time-varying rates, it also requires thenumber of latent classes to be fixed in advance.Finally, we want to emphasize that the domain of applications for the PPIRMand related work is far-reaching. Examples include social interactions [2, 6, 9, 18,30], biological networks [2, 16], sports [31], and ontologies of semantic knowl-edge [20]. Subsequently, we aim to keep our model descriptions as general anddomain-independent as possible. Then, in Section 4.2, we display the PPIRM’sability to model a real dataset of historical military conflicts between nations.3Chapter 2BackgroundThe PPIRM was presented by Blundell et al. [6] as an extension to the IRM. The IRM[20, 32] is a Bayesian model for relational data, which aims to learn a clusteringof the data. In doing so, the IRM relies on a specialized prior distribution over allpossible clusterings, called the Chinese Restaurant Process (CRP) [3].This chapter is divided into two parts. Section 2.1 reviews the details of theCRP, and its features as a prior distribution. Section 2.2 discusses the IRM, startingfrom its most simple form with dyadic data, and then moving to the more generalversion. Together, these sections provide the necessary notation and background todiscuss the PPIRM in Chapter 3.2.1 Chinese Restaurant ProcessLet S be a finite set of size n = |S|, and let k ∈ N such that k ≤ n. A partition ofS into k blocks is an unordered collection of non-empty disjoint sets {A1, . . . ,Ak}whose union is S [26]. In cluster analysis, blocks are referred to as clusters. Here,the task is to construct a partition of a dataset, such that the data within the samecluster are more similar to each other than to those in other clusters. Taking aBayesian approach to clustering, one might wish to place a prior distribution overpartitions of a set. Moreover, when the number of clusters, k, is unknown, a sensi-ble prior should assign some probability mass to all possible partitions; that is, forall possible values of k. The CRP does just that: it induces a probability distribution4over partitions.The CRP gets its name from the metaphorical seating plan that describes thestochastic process [3]. In this metaphor, customers arrive sequentially at a Chineserestaurant with an infinite number of tables. Each table can seat an infinite numberof customers. The first customer enters the restaurant and sits at the first table. Thesecond customer enters and either joins the first customer, or sits at a new table.In general, each subsequent customer either joins an already occupied table withprobability proportional to the number of customers already sitting there, or sits ata new table with probability proportional to a fixed parameter α . At any point inthis process, the configuration of customers at tables defines a random partition.To be more precise, let pii denote the table assignment of the ith customer. TheCRP with parameter α draws each pii sequentially, such thatP(pii = k|pi1, . . . ,pii−1) =nki−1+α if nk > 0,αi−1+α if k is a new table,where nk denotes the number of customers seated at table k. Together, the tableassignments, pii, can be used to describe the full partition, pi . Thus, a partition canbe represented alternatively as a vector, with components representing the clustermemberships of each datum. The parameter, α ∈ R>0, is called the concentrationparameter. Intuitively, larger values of α tend to produce partitions with moreoccupied tables and fewer customers per table [11].Though this process is defined sequentially, the distribution induced by the CRPis, in fact, exchangeable [3]. The probability of a particular partition, pi , is invariantto the order in which the customers were seated. Hence, the order in which thedata are assigned to clusters can be permuted without changing the probability ofthe partition. Given this property, the probability of any single partition can beexpressed succinctly asp(pi) =αK∏Kk=1(nk−1)!∏ni=1(i−1+α), (2.1)where K denotes the number of tables. Note that Equation 2.1 depends on thenumber and sizes of the tables, but does not depend on the order in which the5customers were seated.If we represent the customers by the elements of the set [n] := {1,2, . . . ,n}, thenafter n customers have been seated, the tables define a partition of [n]. Furthermore,if we add slightly to the above seating metaphor and specify that each table iscircular, then the CRP can also define a distribution over permutations of [n]. Inthis modified metaphor, when the j+1st customer enters the restaurant they choosewith equal probability to sit at any of the following j+ 1 places: to the left ofcustomer i for some i ∈ [ j], or alone at a new table. After the n customers havebeen seated, the tables represent the cycles of the permutation, whose product isa uniformly distributed random permutation of [n] [26]. Since this feature of theCRP is rarely used in cluster analysis or Bayesian mixture modelling, we refer theinterested reader to the works of Aldous [3] and Pitman [26] for further details.In terms of a prior over partitions of a dataset, the CRP has several potentiallydesirable features. Namely, a sensible prior should encourage only as many clus-ters as warranted by the data [20]. Since new customers can always be assigned tonew tables, the CRP permits the number of tables to grow with the data. Specifi-cally, the expected number of tables grows logarithmically with the number of cus-tomers [11]. Furthermore, the CRP exhibits a rich-gets-richer phenomenon, wherea customer is more likely to join a table occupied by a large number of customers.That is, the larger nk is, the higher the probability that table k will grow [29]. Aswe shall see in Section 2.2, these features also make the CRP a useful prior for thepartitions of relational data.2.2 Infinite Relational ModelThe IRM [20, 32] is a Bayesian model of binary-valued relationships amongst a setof entities. For example, the entities might consist of a set people, and a relationlikes describes whether one person likes another. In this way, the relation can bedefined as a function, likes: people × people→ {0,1}, where likes(u,v) indicateswhether or not person u likes person v. The goal here is to partition entities intoclusters, where a good partition allows relationships between entities to be pre-dicted by their cluster assignments. The IRM assumes that an entity’s tendency toparticipate in relations is determined entirely by its cluster assignment.6The IRM can be seen as a nonparametric generalization of the parametric SBMintroduced by Holland et al. [18], Wasserman and Anderson [30]. By utilizing theCRP as a prior over partitions, the IRM does not require the number of clusters tobe fixed in advance, unlike the SBM. Rather, the IRM infers this number simulta-neously with the cluster memberships for each entity. Given that the CRP permitseach new entity to be assigned to a new cluster, the IRM effectively has access to acountably infinite collection of clusters. Hence, Kemp et al. [20] coined the nameInfinite Relational Model for their approach to modelling relational data.Like the SBM, the IRM can be presented in terms of modelling the graph struc-ture of relational data. To describe the generative process, we adopt this approachand list the features of the data in terms of their analogs on a unweighted, directedgraph (or digraph). We define this graph, G = (V,E), by its vertices and directededges. Let V denote the set of vertices of the graph, where each vertex representsa single entity. For vertices u,v ∈ V , let euv ∈ {0,1} denote the presence (1) orabsence (0) of a directed edge from vertex u to v in the graph; edges correspond tothe relationships between entities. Let E denote the edge set consisting of all or-dered triples (u,v,euv) for all vertices u,v ∈V . Note that for asymmetric relations,it is not necessary that euv = evu, thereby creating a directed graph.The generative process of the IRM is as follows:pi|α ∼ CRP(α) (2.2)λkl|γ ∼ Beta(γ,γ) ∀k, l ∈ range(pi) (2.3)euv|λ ,pi ∼ Bernoulli(λpiupiv) ∀u,v ∈V (2.4)where pi is a partition of the vertices V , drawn from the CRP with concentrationparameter α , as described in Section 2.1. Adopting the notation from [6], we userange(pi) to denote the set of all unique clusters in pi , which we index with k andl. Note that while the CRP is a distribution over a countably infinite number ofclusters, |range(pi)| is restricted to a natural number between 1 and |V |, inclusive,for fixed vertex sets. That is, when |range(pi)| = 1, all vertices belong to a singlecluster; when |range(pi)| = |V |, each vertex belongs to its own cluster. We writethat vertex u ∈ V belongs to the cluster given by piu. The probability of a directededge from vertex u to v is given by the parameter λpiupiv . Finally, we use λ to denote7the matrix of Bernoulli parameters between all |range(pi)| × |range(pi)| pairs ofclusters. Kemp et al. [20] choose to place symmetric conjugate priors, with singlehyperparameter γ , on each entry of the λ matrix.Since the generative process uses a conjugate prior on the entries of λ , it isstraightforward to derive the marginal likelihood p(E|pi), integrating out λ :p(E|pi) =∫p(λ |pi)p(E|λ ,pi)dλ=∫ [∏k,l∈range(pi)Beta(λkl;γ,γ)][∏u,v∈VBernoulli(euv;λpiupiv)]dλ=∫∏k,l∈range(pi)λ γ−1kl (1−λkl)γ−1B(γ,γ) ∏u:piu=k ∏v:piv=lλ euvkl (1−λkl)1−euvdλ= ∏k,l∈range(pi)1B(γ,γ)∫λ γ−1kl (1−λkl)γ−1λmklkl (1−λkl)m¯kl dλkl= ∏k,l∈range(pi)B(mkl + γ, m¯kl + γ)B(γ,γ), (2.5)where mkl denotes the number of directed edges from all entities in cluster k to anyentity in cluster l. That is, mkl is the number of pairs (u,v) where piu = k, piv = land eu,v = 1. m¯kl is the number of pairs (u,v) where piu = k, piv = l and eu,v = 0.B(a,b) denotes the Beta function, B(a,b) = [Γ(a)+Γ(b)]/Γ(a+ b). Lastly, notethat for simplicity of notation, conditioning on the parameter γ is implicit.The marginal likelihood in Equation 2.5 is, in fact, quite simple to compute.If we let nk denote the number of entities belonging to cluster k, then m¯kl can becalculated from m¯kl = nknl−mkl . Thus, we need only to record the counts nk for allclusters, and mkl for all cluster pairs. Using this marginal likelihood, inference canbe easily carried out using Markov chain Monte Carlo (MCMC) methods to samplefrom the posterior on cluster assignments p(pi|E)∝ p(E|pi)p(pi). Kemp et al. [20],employ a fully Bayesian approach, using an exponential prior p(α) = e−α andimproper prior p(γ)∝ γ−5/2 when sampling the hyperparameters. Finally, we notethat although λ is integrated out, it is simple to recover these parameters by drawingvalues independently from the posterior λkl|E,pi ∼ Beta(mkl + γ, m¯kl + γ) for allk, l ∈ range(pi). Moreover, the maximum a posteriori (MAP) value of λkl given pi8and E is actually:mkl + γ−1mkl + m¯kl +2γ−2 =mkl + γ−1nknl +2γ−2 when mkl, m¯kl > 1− γ,contrary to what is stated in [20].The IRM can be extended to model multiple types of entities, and polyadic re-lations of any arbitrary arity [20]. For example, suppose there are several relationsdefined over the domain people × people: likes(·, ·), respects(·, ·) and hates(·, ·).These relations can then be grouped together as a single type referred to as predi-cates. Next, we can define a ternary relation applies(u,v, p), which is 1 if predicatep applies to the pair (u,v), of people u and v. Thus, we have a ternary relation overtwo entity types: people and predicates. The aim is to now simultaneously clusterthe people and the predicates. When the observed data is viewed as a third-ordertensor (people× people× predicate), the IRM finds a partition of people and pred-icates such that a good clustering ensures each three-dimensional sub-block of thedata includes mostly ones or mostly zeroes.To specify the generative process of this more general version of the IRM, con-sider an r-dimensional relation involving q different entity types, with r ≥ q. Now,Vq denotes the entity set of the qth entity type, and let dp ≤ q denote the index ofthe entity type occupying dimension p≤ r. Then,pi(i)|α ∼ CRP(α) ∀i = 1, . . . ,qλk1...kr |γ ∼ Beta(γ,γ) ∀k1 ∈ range(pi(d1)), . . . ,∀kr ∈ range(pi(dr))eu1...ur |λ ,pi(1), . . . ,pi(q) ∼ Bernoulli(λpi(d1)u1 ...pi(dr)ur)∀u1 ∈Vd1 , . . . ,∀ur ∈Vdrwhere λ is now an rth-order tensor (or r-dimensional array) of Bernoulli parame-ters between all r-tuples of clusters. Kemp et al. [20] further generalize the IRMby allowing multiple relations and introducing a parameter array λ ( j) for each re-lation and corresponding edge set E( j). In this model, they assume that relationsare conditionally independent given the cluster assignments, and that the clusterassignments for each entity type are independent.Due to the flexibility of the IRM, it has many potential applications. It has been9used to discover structure in object-feature data [20], where the features can beviewed as unary predicates and the IRM can be used as a means for biclustering. Inthis case, the IRM was applied to an animal-feature matrix, and the clustering re-sults were compared to solutions given by human subjects. Kemp et al. [20] furtherdemonstrated its use by learning systems of related concepts as a means to acquirean ontology of semantic knowledge—an important task in AI development, wherethe ontology can be applied to further problem solving. They showed that the IRMcan be used to discover a simple biomedical ontology from a semantic networkcomprised of concepts and binary predicates. The concepts included entities like‘disease or syndrome,’ ‘diagnostic procedure,’ and ‘mammal.’ The predicates in-cluded verbs like complicates, affects and causes. Furthermore, by computing theMAP values of λkl , Kemp et al. [20] were able to identify the pairs of clusters (k, l)that are most strongly linked, and the predicates that linked them.10Chapter 3Poisson Process InfiniteRelational ModelOne limitation of the IRM is the inability to model multiple instances of a singlerelation between the same pair of entities. For example, consider an entity setplayers and a relation passes, both referring to a soccer match. With the IRM, weare limited to binary-valued relations, where passes(u,v) indicates whether or notplayer u passes the ball to player v; however, in reality, any given player could passthe ball to another several times within a given match. Thus, using the notationfrom Section 2.2, we want to capture this information by letting euv ∈ {0}∪N. Interms of the graph structure of the data, we want to model a weighted digraph.Blundell et al. [6] suggest a straightforward modification to account for this,replacing the Beta-Bernoulli model in Expressions 2.3 and 2.4 with a Gamma-Poisson model. This model permits Poisson-distributed counts on the directededges, while retaining the convenient conjugacy in the generative process. How-ever, Blundell et al. [6] note that a Gamma-Poisson observation model does notallow for the prediction of events into the future, outside of the observed time win-dow. Continuing with the previous example, we may be interested in predictingpasses between players over the course of a season, beyond just a single match.Therefore, Blundell et al. [6] suggest using instead a Poisson process.This chapter presents the PPIRM, which extends the IRM by modelling a richerstructure of data, while retaining the ability to perform efficient, approximate pos-11terior inference. Section 3.1 formalizes the problem statement, while Section 3.2describes the generative process of the PPIRM. Blundell et al. [6] give little detailregarding their inference scheme for the PPIRM, but mention that it relies on the su-perposition property of Poisson processes. Instead, we derive our own approximateinference procedure in Section 3.3, and Section 3.4 discusses why their proposal touse the superposition property is misguided.3.1 Problem FormulationConsider a dataset consisting of a set of entities and a set of recorded transactionsbetween them, each with a specific timestamp. To distinguish from the binary-valued “relations” of the IRM, we use the term transaction to define events betweenordered pairs of entities that occur at a specific time point. Multiple transactionscan occur between the same ordered pair of entities, but each event must occur ata unique time point. Thus, we use the term transaction in the sense of an exchangebetween two entities, one sender and one recipient, at a particular time. In this way,the events can be seen as asymmetric, or directed, interactions. Examples include:• a team of soccer players and the times at which a pass was made betweenany two of them during the course of a match;• employees of a company and the times at which an email was sent from oneemployee to another;• the neurons in the human brain and the times at which a signal is transmittedfrom the axon of one neuron to the dendrite of another;• a group of Twitter users and the times at which one user retweets the contentof another;• a collection of airports and the times at which flights depart from one citytravelling towards another.Each transaction can be represented as a triple consisting of a sender entity, arecipient entity, and a time of occurrence. Let N denote the full dataset; that is, theset of all observed transaction triples. There are two ways in which this data, N, are12strictly more rich than the relations, E, from the IRM: each transaction includes atime of occurrence, and multiple transactions can be observed between any orderedpair of entities. Note that both of these features exist in each of the above examples.For example, there can be multiple flights departing from Vancouver heading toWinnipeg, but each has a unique departure time. Thus, the IRM cannot sufficientlycapture the structure in this data.Given such a dataset, the problem is one of learning or inference, where wewish to learn the underlying dynamics of the network structure formed by the data.More explicitly, we wish to infer the parameters of a model that accurately de-scribes the structure of the data, N. Using this inferred model, we can discovergroups of similarly behaving entities, and predict unobserved transactions. To thisend, we construct a Bayesian stochastic model with the objective of partitioningentities into clusters, where a good set of partitions allows transactions betweenentities to be predicted by their cluster assignments. Like the IRM, the PPIRM as-sumes that an entity’s tendency to participate in transactions is determined entirelyby its cluster assignment and the cluster assignment of the partner entity. Thus, ourproblem is inherently two-fold:(i) learn a latent clustering of the entities—an unsupervised learning task;(ii) model the time-series structure of the transaction data.3.2 Generative ProcessTo describe the PPIRM, we revert to discussing the graph structure of the transac-tional data, N, which is both useful for visualizing the data, and common practicein related literature [2, 18, 30]. We define this graph, G = (V,N), by its verticesand transaction set. G represents a directed multigraph (also known as a multidi-graph or quiver) where each edge has its own identity given by its timestamp. Incomparison, recall that the relational data modelled by the IRM can be representedby an unweighted digraph.Let V denote the set of vertices, where each vertex represents a single entity,and let n= |V | denote the order ofG. Note that throughout we try to use n to repre-sent counts related to vertices, and m for counts related to the edges or transactions.13Like before, pi is a partition of the vertices in V , and it can be represented as both aset of sets and as a vector. Note that we chose to implement the latter for its relativesimplicity to code; however, implementing a set of sets, using for example Java’sLinkedHashSet class [1], could improve computational efficiency. In the vec-tor representation, pi is a vector of length n containing the cluster assignments forall vertices. That is, pi = (pi1,pi2, . . . ,pin). Thus, vertex u ∈V belongs to the clustergiven by the uth component of pi , piu. Again, we adopt the notation from [6] anduse range(pi) to denote the set of all distinct clusters in pi .Directed edges from a vertex u to a vertex v are modelled as events occurringrandomly from a Poisson process with rate λpiupiv . Specifically, we use a Poissonprocess on [0,∞), such that the number of events in any interval [t, t ′) of the real-half line, denoted N[t, t ′), is Poisson distributed with rate λ (t ′− t). The generativeprocess of the PPIRM is:pi|α ∼ CRP(α) (3.1)λkl|δ ,β ∼ Gamma(δ ,β ) ∀k, l ∈ range(pi) (3.2)Nuv(·)|λ ,pi ∼ PoissonProcess(λpiupiv) ∀u,v ∈V (3.3)where Nuv(·) is the random counting measure of the Poisson process, and δ and βare respectively the shape and rate (i.e. inverse scale) parameters of the Gammaprior on the rate of the Poisson process, λkl . We note that this model permitstransactions to occur between an entity and itself, thereby creating self-loops in themultidigraph representation.As a critique of the PPIRM, Blundell et al. [6] suggest two potential deficiencieswith the model:(i) the rate of transactions between an ordered vertex pair, (u,v), is independentof the rate between the reverse pair, (v,u); this relates to a phenomenonknown as reciprocity [18, 30].(ii) Conditioned on the time interval containing all observed transactions, thetimes of these events are uniformly distributed. With regards to stochasticprocesses, this feature is called time homogeneity.To address these concerns, they propose replacing the Poisson processes with mutually-14exciting Hawkes processes. However, in doing so, they sacrifice conjugacy in themodel, as there is no conjugate prior for the Hawkes process likelihood. Conse-quently, they cannot integrate out the stochastic process rate parameters, and areforced to sample them using MCMC methods. Thus, the PPIRM sacrifices somemodel flexibility in order to maintain efficient posterior inference.3.3 InferenceThe primary objective is to infer a partition given the observed transaction data.Specifically, the aim is to infer both the number of clusters and the cluster mem-berships for all entities. In the context of social network analysis, this is oftenreferred to as learning the latent community structure of the network [16]. Here,we exploit the conjugate prior for the Poisson likelihood and integrate out the Pois-son process rate parameters.1 Thus, the posterior mass function over partitions isgiven byp(pi|N) = p(pi)p(N|pi)∑pi p(pi)p(N|pi)(3.4)∝ p(pi)p(N|pi)= p(pi)∫p(N|λ ,pi)p(λ |pi)dλ , (3.5)where λ denotes the matrix of Poisson process rates between clusters. If we letc = |range(pi)| denote the number of distinct clusters in the partition pi , then λ canbe represented as a matrix of size c× c.Subsection 3.3.1 discusses the need for approximate inference methods dueto the combinatorially large sample space of distinct partitions. Subsection 3.3.2presents the full derivation of the of the unnormalized posterior mass function overpartitions, while Subsection 3.3.3 derives the posterior density over the Poissonprocess rates, λ . Subsection 3.3.4 mentions the implementation details of ourMCMC sampler for partitions. Finally, Subsection 3.3.5 details the rest of our fullyBayesian model and the procedures for sampling the hyperparameters, which werenot discussed in [6].1Note that for simplicity of exposition, we view the parameters α,δ and β as fixed numbersin R>0. Later, in Subsection 3.3.5, we treat them as random variables.153.3.1 Sample Space of PartitionsThe size of the sample space of distinct partitions is a combinatorial problem, andthus suffers from a combinatorial explosion as the number of entities, n, grows. Tobe precise, the number of distinct partitions of a set of size n is given by the nthBell number,Bn =1e∞∑k=1knk!,by Dobin´ski’s formula [25]. This formula allows Bn to be interpreted also as thenth moment of a Poisson distribution with rate parameter 1. For example, B1 = 1because the 1-element set {u} can be partitioned only one way, as {{u}}. Similarly,B3 = 5 because the 3-element set {u,v,w} can be partitioned in 5 distinct ways:1. {{u,v,w}}2. {{u},{v,w}}3. {{v},{u,w}}4. {{w},{u,v}}5. {{u},{v},{w}}Table 3.1 lists some of the first several Bell numbers.Due to this combinatorial explosion, for most interesting problems it is compu-tationally infeasible to calculate the exact posterior probability of pi , which requirescomputing a sum over all Bn distinct partitions, as seen in Equation 3.4. Therefore,we rely on MCMC to perform approximate inference with Equation 3.5 as the targetposterior.3.3.2 Posterior Over PartitionsBelow, we derive each component of the posterior distribution given by Equa-tion 3.5. By definition of the CRP, the first term on the right-hand side becomesp(pi) =αc∏ck=1(nk−1)!∏nu=1(u−1+α), (3.6)16Table 3.1: The nth Bell number corresponding to various values of n. Bncounts the number of distinct partitions of a set of size n. Thus, Bn isthe size of the sample space for pi when there are n entities.n Bn1 12 23 54 155 526 2037 8778 4,1409 21,14710 115,975......100 ≈ 4.7585×10115where nk denotes the number of vertices belonging to cluster k, such that ∑ck=1 nk =n. Next, the conditional probability of the Poisson process rates, p(λ |pi), comesfrom taking the product of Gamma densities over all clusters k, l ∈ range(pi). Thus,p(λ |pi) =c∏k=1c∏l=1Gamma(λkl;δ ,β )=c∏k=1c∏l=1β δΓ(δ )λ δ−1kl e−βλkl . (3.7)The conditional likelihood of the transaction data given all other informationcan be decomposed as follows. Since transactions between each ordered vertexpair, (u,v), are modelled by an independent Poisson process, we can decomposeN into Poisson distributed counts, Xuv, with events distributed uniformly on theinterval [0,T ). Thus, we have that each Xuv is Poisson distributed with rate param-eter λpiupivT . Simply put, Xuv is a count summary of the data, with xuv denoting theobserved number of transactions from vertex u to vertex v by time T . Now, the17likelihood can be written asp(N|pi,λ ) =n∏u=1n∏v=1Poisson(xuv;λpiupivT ) ·Uniform(0,T )xuv=n∏u=1n∏v=1(λpiupivT )xuve−λpiupiv Txuv!(1T)xuv=c∏k=1c∏l=1∏u:piu=k∏v:piv=l(λklT )xuve−λklTxuv!(1T)xuv=c∏k=1c∏l=1(λklT )mkl e−λklT nknl∏u:piu=k∏v:piv=l xuv!(1T)mkl, (3.8)where mkl denotes the total number of transactions from all vertices in cluster ktowards any vertex in cluster l by time T . That is, mkl = ∑u:piu=k∑v:piv=l xuv.Combining Equations 3.7 and 3.8, and evaluating the integral, gives the fol-lowing likelihood where the rates have been marginalized out:p(N|pi) =∫p(N|λ ,pi)p(λ |pi)dλ=∫ c∏k=1c∏l=1(λklT )mkl e−λklT nknl∏u:piu=k∏v:piv=l xuv!(1T)mkl β δΓ(δ )λ δ−1kl e−βλkl dλ=∫ c∏k=1c∏l=1[T mklβ δΓ(δ )∏u:piu=k∏v:piv=l xuv!](1T)mklλmkl+δ−1kl e−(T nknl+β )λkl dλ=c∏k=1c∏l=1[T mklβ δΓ(δ )∏u:piu=k∏v:piv=l xuv!](1T)mkl ∫λmkl+δ−1kl e−(T nknl+β )λkl dλkl=c∏k=1c∏l=1[T mklβ δΓ(δ )∏u:piu=k∏v:piv=l xuv!](1T)mkl Γ(mkl +δ )(T nknl +β )mkl+δ=c∏k=1c∏l=1Γ(mkl +δ )Γ(δ )∏u:piu=k∏v:piv=l xuv!(TT nknl +β)mkl ( βT nknl +β)δ ( 1T)mkl=c∏k=1c∏l=1(mkl +δ −1)!(δ −1)!mkl!(T nknlT nknl +β)mkl (1− T nknlT nknl +β)δ· mkl!∏u:piu=k∏v:piv=l xuv!(1T nknl)mkl18=c∏k=1c∏l=1NegBin(mkl;δ ,T nknlT nknl +β)mkl!∏u:piu=k∏v:piv=l xuv!(1T nknl)mkl,(3.9)where NegBin(·;r, p) is shorthand for the negative binomial probability mass func-tion with parameters r and p. Finally, the kernel of our target posterior isp(pi|N) ∝ αc∏nu=1(u−1+α)c∏k=1(nk−1)!c∏l=1NegBin(mkl;δ ,T nknlT nknl +β)· mkl!∏u:piu=k∏v:piv=l xuv!(1T nknl)mkl∝ αcc∏k=1(nk−1)!c∏l=1NegBin(mkl;δ ,T nknlT nknl +β)mkl!(1nknl)mkl.(3.10)3.3.3 Posterior Over RatesWith p(pi|N) as the target posterior, we can sample the partition directly, withoutthe need to sample the rates. Nevertheless, it is simple to recover these rates toestablish the relationships between clusters. Furthermore, as we discuss in Subsec-tion 4.1.4, certain experiments require samples of λ to use in likelihood calcula-tions. Thus, we derive the posterior for λ given the data and partition asp(λ |N,pi) ∝ p(N|λ ,pi)p(λ |pi)=c∏k=1c∏l=1(λklT )mkl e−λklT nknl∏u:piu=k∏v:piv=l xuv!(1T)mkl β δΓ(δ )λ δ−1kl e−βλkl∝c∏k=1c∏l=1λkl mkl+δ−1e−(T nknl+β )λkl ,which we recognize as an unnormalized Gamma distribution. Thus, the posteriordistribution for λ isp(λ |N,pi) =c∏k=1c∏l=1Gamma(λkl;mkl +δ ,T nknl +β ). (3.11)19Given the transaction data and MCMC samples of pi|N, we can reinstantiatesamples of λ . That is, we can draw a rate λkl , independently from the Gammadistribution in Equation 3.11, for every pair of clusters k, l ∈ range(pi).3.3.4 ImplementationAs mentioned in Subsection 3.3.1, due to the large sample space for pi—even formodest numbers of vertices—we rely on MCMC to perform approximate infer-ence. Using Expression 3.10 as the unnormalized target posterior, we implementa Metropolis-Hastings algorithm with a simple, symmetric proposal distribution.The procedure for sampling from the proposal distribution executes as shown inAlgorithm 1.Algorithm 1 Sample a partition, pi∗, from the proposal distribution1: procedure SAMPLEPARTITION2: pi∗← pi , where pi is the partition at the current iteration3: Pick a vertex u ∈V uniformly at random4: if ∃v ∈V such that piv = piu and v 6= u then5: Pick any existing cluster, or a new one, uniformly at random6: else7: Pick any existing cluster uniformly at random8: Assign pi∗u to the chosen cluster9: return pi∗This pseudo-code presents a simple proposal distribution, whereby at most onevertex moves to a new cluster at each sweep of the Metropolis-Hastings algorithm.The if-else statement simply checks whether or not the chosen vertex is alonein its current cluster. If this vertex is alone in its cluster, then moving it to a new,unoccupied cluster would be equivalent to picking its current cluster. Since thisproposal is symmetric, the acceptance ratio for the Metropolis-Hastings algorithmsimplifies to calculating the ratio of p(pi∗|N) to p(pi|N). We note that more com-plicated proposals, like split-merge algorithms [8], have been shown to improvecomputational efficiency towards exploring multiple modes.Unless otherwise noted, we initialize our sampler with a partition that placeseach vertex in its own cluster. This initialization is sensible when the prior favoursa low number of clusters, thus allowing the chain to generally explore more of20the sample space. Note that for a CRP with concentration parameter α , the ex-pected number of occupied clusters grows as O(α lnn), when there are n vertices(customers) [11]. Other sensible initializations include placing each vertex in thesame cluster, and drawing a random partition from the prior. However, both maylimit the mixing of the chain when the observed data is sparse. Subsection 4.1.2examines the effect of different initializations in further detail.3.3.5 Sampling HyperparametersBlundell et al. [6] did not discuss the choice of priors or sampling procedures forthe hyperparameters in the PPIRM, so we choose to proceed as follows. We place anexponential prior on the concentration parameter, α , such that p(α) = e−α . A rateparameter of one enforces a relatively small mean on α; consequently, the CRPwith parameter α tends to produce partitions with a smaller number of clusters.This helps to capture the intuition that the prior should favour partitions with smallnumbers of clusters [20].Recall that the Poisson process rate parameters are drawn independently froma Gamma distribution with shape δ and rate β . We, in turn, place Gamma priorson δ and β , and set their respective shape and rate parameters all to 0.01. Thisimposes a relatively small mean and large variance.We again implement a Metropolis-Hastings algorithm to sample new valuesof the hyperparameters. Sampling is done separately for each hyperparamter ina deterministic order. We use a Normal distribution, truncated at zero, for theproposal distribution. Specifically, for a hyperparameter, θ , we propose a newvalue, θ ∗, from the proposal distributionq(θ ∗|θ) = φ(θ∗−θ)Φ(θ), θ ∗ > 0,where φ(·) denotes the standard Normal density andΦ(·) denotes the standard Nor-mal cumulative distribution function. Then, the acceptance ratio for the Metropolis-21Hastings algorithm becomesp(θ ∗)q(θ |θ ∗)p(θ)q(θ ∗|θ) =p(θ ∗)φ(θ −θ ∗)Φ(θ)p(θ)φ(θ ∗−θ)Φ(θ ∗)=p(θ ∗)Φ(θ)p(θ)Φ(θ ∗).3.4 Note on the Superposition PropertyBlundell et al. [6] give little detail regarding their inference procedure for thePPIRM. Yet they note that conjugacy in the model can be maintained due to thesuperposition property of Poisson processes. It may be tempting to employ thisidea when deriving the likelihood, p(N|λ ,pi), by combining the transactions fromall vertices in one cluster towards any vertex in a second cluster. In this way, therates of each Poisson process would be additively combined via the superpositionproperty. Furthermore, the posterior over partitions, as shown in Expression 3.10,is independent of the transaction counts xuv between each vertex pair; rather, it de-pends on the total counts mkl between each cluster pair. While this may seem likea compelling reason to combine the rates via the superposition property, we showthat doing so would be misguided. In particular, using the superposition prop-erty to derive the likelihood will forfeit relevant information about the individual,vertex-specific interaction data.Recall that mkl denotes the number of transactions from all vertices in clusterk towards any vertex in cluster l by time T . Suppose that cluster k contains onlythe vertices u and u′, while cluster l contains only vertex v, and k 6= l. Recallthat xuv and xu′v denote the individual transaction counts between ordered pairsof vertices, and that mkl = xuv + xu′v. Now, consider the event {mkl = 2}, whichoccurs when either {xuv = 2,xu′v = 0}, {xuv = 1,xu′v = 1}, or {xuv = 0,xu′v = 2}.The probability of each event, disregarding the temporal component, is given by22the Poisson distribution and written asP(Xuv = 2,Xu′v = 0) = P(Xuv = 0,Xu′v = 2)=(λklT )2e−λklT2!· (λklT )0e−λklT0!=(λklT )2e−2λklT2; (3.12)P(Xuv = 1,Xu′v = 1) =(λklT )1e−λklT1!· (λklT )1e−λklT1!= (λklT )2e−2λklT . (3.13)Compare this to the probability of the event {mkl = 2}, which we can derive usingthe superposition property, to giveP(Mkl = 2) =(λklnknlT )2e−λklnknlT2!= 2(λklT )2e−2λklT . (3.14)Comparing Equations 3.12 and 3.13, it is obvious that different vertex-specificcounts occur with different probabilities. Furthermore, neither scenario occurs withthe same probability as the combined Poisson count in Equation 3.14. Thus, con-ditioning solely on the event {mkl = 2} discards information contained in the fulltransaction dataset with vertex-specific counts. This simple example shows thatusing only the sum of transaction counts across clusters, mkl , will in some casesforfeit information contained in the full data. Therefore, we refrain from using thesuperposition of Poisson processes in deriving the likelihood and posterior, despitethe suggestion of Blundell et al. [6].23Chapter 4ExperimentsThis chapter assesses the performance of the PPIRM and showcases its potentialwith real data. Section 4.1 uses synthetic data to first establish the computationalcorrectness of the inference algorithm presented in Chapter 3. Next, the PPIRM’sperformance is evaluated on several experiments designed to assess clustering per-formance and prediction. Section 4.2 uses real data from militarized conflicts be-tween nations to model the correlates of war. Inferred partitions reveal structuredrelationships between groups of countries, which are compared against historicalrecords.4.1 Synthetic DataTo establish the validity of our inference algorithm, and to assess the performanceof the PPIRM, synthetic data can be generated to run several experiments. Syntheticdata is simulated from the generative process described in Section 3.2. For a givenseed, number of entities, and parameter values for α,δ and β , data can be generatedup to a stopping time T . By increasing T , further transactions can be generatedfrom the specified model.Figure 4.1 displays one instance of synthetic data, at various stopping times,represented as the multidigraph described in Section 3.2. Each subfigure is a snap-shot of the generated transactions produced by the continuous-time process, up tothe given stopping time. For this particular instance, the parameters were set to24α = 1, δ = 0.3 and β = 0.01. This setting enforces a relatively large variance onthe Poisson process rate parameters, λ . Consequently, this allows for better con-trast between the rates of transactions. By T = 0.05, in Figure 4.1b, we alreadyobserve a relatively large number of transactions between the orange and greenclusters, and a relatively low number of transactions from the green to blue cluster.This pattern is further solidified as the stopping time is increased in Figures 4.1cand 4.1d. Note that we omit the times associated with each edge for the sake ofclarity; instead, the edges are given different shades of colours to signify that eachtransaction occurs at a unique time point.Subsection 4.1.1 discusses an approach that uses synthetic data to establish thecomputational correctness of our MCMC sampler for partitions. Subsection 4.1.2examines two diagnostic techniques for assessing convergence of the Markov Chainsused for approximate inference, and compares the effects of using various initial-izations. Finally, Subsections 4.1.3 and 4.1.4 present several experiments for as-sessing the clustering and predictive performance of the PPIRM using syntheticdata.4.1.1 Validation of SamplerAs we have designed our own inference scheme for the PPIRM, we must implementthese procedures from scratch. In particular, we rely on the MCMC algorithms dis-cussed in Section 3.3.4, to perform approximate inference on this Bayesian model.Consequently, a critical precursory step to further experimentation is to verify thecomputational correctness of the implemented sampler. We are particularly inter-ested in verifying the correctness of the sampler for partitions.Two popular approaches for validating Bayesian software are those of Cooket al. [7] and Geweke [12]. For non-scalar parameters, like partitions, the approachby Cook et al. [7] is not directly applicable. This is because their validation meth-ods are based on the posterior quantiles of scalar parameters.Geweke [12] presents an alternative strategy for testing Bayesian software,which relies on a two-stage process. First, data is drawn from the so-called marginal-conditional simulator, which is presumed to be error-free. Independent samplesfrom the full joint distribution of p(pi,λ ,N) are created by sampling each compo-25(a) T = 0.01; 3 transactions (b) T = 0.05; 38 transactions(c) T = 0.1; 69 transactions (d) T = 0.5; 376 transactionsFigure 4.1: Multidigraph representation of randomly generated data, from 5entities, at various stopping times, T . Vertex colours represent the threedifferent clusters. Edges are coloured to signify that each transactionoccurs at a unique time point.26nent, marginally, from the generative process, as described in Section 3.2. Second,the successive-conditional simulator creates alternative samples from p(pi,λ ,N).This is achieved by generating samples from p(N|λ ,pi) for each MCMC sample of(pi,λ ). Finally, a function f , such that f (pi,λ ,N) ∈ R, is applied to each elementof these two sequences from p(pi,λ ,N), and a Student’s t-test or Kolmogorov-Smirnov test is used to evaluate the results.In order to use either the Cook et al. [7] or Geweke [12] approach, an appro-priate function must be chosen that maps samples containing pi to a real number.As an example, one could use the entropy of the cluster sizes as a function to sum-marize the partitions. However, such a function is not injective, and thus discardssome information in the samples. Additionally, the Geweke [12] approach requiressampling λ at each step, in order to sample N, which we aimed to avoid in thefirst place by marginalizing out λ . For these reasons, we instead employ a simplerapproach to validate our sampler for partitions.Since the sample space of partitions is discrete, we can enumerate all possiblepartitions when the number of entities is reasonably small. Thus, we can calcu-late the posterior probabilities, up to a normalizing constant, directly from Expres-sion 3.10 for all possible partitions. Then, we normalize these values to find thetrue posterior probabilities. Using the true probabilities as targets, we can calculateconfidence intervals based on the MCMC samples by considering the proportionof iterations that each partition was sampled. Standard errors are calculated usinga batch means method, where the batch size is taken to be the square root of thenumber of iterations.Figure 4.2 displays these 95% confidence intervals on three different datasetsof varying numbers of entities. In order to isolate any issues with the samplerfor partitions, all other parameters are given and only the partitions were sampled.Note that the numbering of the partitions along the horizontal axis is constructedin an arbitrary order. In Figures 4.2a and 4.2b, we see that all of the true posteriorprobabilities were captured by the confidence intervals. In Figure 4.2c, we observethat all but one of the 52 true posterior probabilities were captured by the confi-dence intervals. We conclude that our sampler for partitions appears to be workingcorrectly.27lllll0.1500.1750.2000.2250.2500.2751 2 3 4 5partitionposterior probability95% Confidence Intervals for True Posterior Probabilities(a) 3 entities; 5 partitionslllllllllll lll l0.050.100.150.204 8 12partitionposterior probability95% Confidence Intervals for True Posterior Probabilities(b) 4 entities; 15 partitionsllllllllllllllllllllllllllllllllllllllllllllllllllll0.000.050.100.150.200 10 20 30 40 50partitionposterior probability95% Confidence Intervals for True Posterior Probabilities(c) 5 entities; 52 partitionsFigure 4.2: 95% confidence intervals for the posterior probabilities of eachpossible partition. Estimates are based on 10,000 MCMC samples withMonte Carlo standard errors calculated by a batch means method. Eachdatapoint shows the true posterior probability, and points in red are thosenot captured by the confidence interval.4.1.2 Convergence DiagnosticsAfter establishing that the MCMC sampler is targeting the correct posterior dis-tribution, a natural next step is to consider the rate at which the Markov Chainconverges to the targeted posterior. Furthermore, it is common practice to use di-agnostic tools to assess MCMC sampler convergence on full-scale problems. Thus,we examine two approaches for assessing the convergence of our samplers: anoriginal approach that scores the stationary distribution of the Markov Chain over28various iterations, and a second approach that looks at the trace plots of samplesgenerated from a larger synthetic dataset.Utilizing the fact that we can calculate the true posterior probabilities, for allpossible partitions, when the number of entities is reasonably small, we developthe following original experiment. We measure the distance between the vectorof true posterior probabilities and the estimated posterior probabilities from theMCMC samples of partitions. At any given iteration, n, of the MCMC sampler, wecan calculate an empirical transition matrix, M(n). The entry M(n)kl is the proportionof iterations, up to n, that partition l was sampled, given that partition k was theprevious state. From this transition matrix, a number of methods can be used tosolve for the steady-state distribution. We use the eigendecomposition to solve forthe eigenvector corresponding to an eigenvalue of one. We then measure the `1norm (or Manhattan distance) between the steady-state vector and the vector oftrue posterior probabilities. Formally, the `1 norm between two vectors x and y,both of length m, is calculated as ∑mi=1 |xi− yi|. We choose the `1 norm, as webelieve it to be a sensible choice; however, we recognize that other metrics, likethe `2 norm, could be equally as valid. Finally, we examine the sequence of normvalues, over various numbers of MCMC iterations. The rate at which this sequenceapproaches zero is indicative of the rate at which the Markov Chain targets thecorrect posterior distribution.Figure 4.3 displays three instances of the aforementioned experiment on var-ious numbers of entities. In Figures 4.3a and 4.3b, we observe that for 3 and 4entities, the sequences tend to flatten out by about 2,500 iterations. Pay specialattention to the scale in Figure 4.3c. While for 5 entities the sequence continues todecrease, the norm values are just as close to zero. Thus, we conclude that in eachcase the estimated posterior probabilities quickly approach the true values; how-ever, for larger numbers of entities, where there are considerably more partitions,longer chains will likely more accurately estimate all probabilities.As a further diagnostic check, trace plots permit a visual assessment of theMCMC sampler convergence. A trace plot displays the sampled parameter valuesover all iterations. One method for assessing whether the sampler has convergedto the targeted posterior distribution is to compare the trace plots from differentchains, started from different initializations. Ideally, the chains should reach the29llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.000.020.040.060 2500 5000 7500 10000iterationsl1 normNorm Between Steady−State and True Posterior Probability Vector(a) 3 entitieslllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.000.010.020.030.040.050 2500 5000 7500 10000iterationsl1 normNorm Between Steady−State and True Posterior Probability Vector(b) 4 entitiesllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.0000.0020.0042500 5000 7500 10000iterationsl1 normNorm Between Steady−State and True Posterior Probability Vector(c) 5 entitiesFigure 4.3: `1 norm, measured at various iterations, between: (i) the vector oftrue posterior probabilities of each possible partition, and (ii) the steady-state distribution of the empirical transition matrix of MCMC samples.same (and correct) posterior distribution, despite different initializations. Further-more, with sufficient data and identifiability of the model, we might expect thisdistribution to have a mode close to the true parameter value.We generate synthetic data from 30 entities and run the MCMC samplers for allparameters in a deterministic order. Figure 4.4 shows the trace plots for the sampledpartitions, having started from three different initializations: (a) all entities placedin different clusters, so that there are 30 clusters; (b) all entities placed in the samecluster, so that there is only one cluster; (c) a random draw from the CRP prior.We number the partitions arbitrarily, based simply on the order in which they were300 2000 4000 6000 8000 10000020406080Trace Plot of MCMC Samples for Partitionsiterationspartition(a) Initialization: all different clusters0 2000 4000 6000 8000 100000204060Trace Plot of MCMC Samples for Partitionsiterationspartition(b) Initialization: all same cluster0 2000 4000 6000 8000 10000020406080100120140Trace Plot of MCMC Samples for Partitionsiterationspartition(c) Initialization: random draw from priorFigure 4.4: Trace plots of MCMC samples of partitions, given different ini-tializations. Data is comprised of 336 randomly generated transactionsfrom 30 entities. The red lines depict the true partition. Numberingalong the vertical axis is arbitrary.sampled. In each case, the chains appear to converge to a mode around the truepartition, depicted by the red line, by about 2,000 iterations. We note that the chainstarted from the initialization that placed all entities in the same cluster, shown inFigure 4.4b, sampled the fewest number of distinct partitions. Conversely, thechain started from a random partition from the prior (shown in Figure 4.4c) visitedmore states before reaching the mode; however, these visits were largely transient.Using the same synthetic dataset from 30 entities, we compare the MCMC sam-ples for the hyperparameters of the PPIRM. Figure 4.5 shows the trace plots for31the sampled hyperparameters, α,δ and β , given various initial values. In eachcase, the two chains appear to converge towards the same mode around the trueparameter value, after relatively few iterations. Note that for this dataset, there are7 true clusters and hence a total of 49 different Poisson process rate parameters.This means that there are less than 7 observed transactions per the 49 differentrate parameters that comprise λ . Thus, despite a large number of entities and withrelatively few observed transactions, the samplers for the hyperparameters quicklyconverge to a distribution around the true parameter values.4.1.3 Clustering EvaluationAs mentioned in Section 3.1, the first main problem that we wish to address is tolearn a latent clustering of the entities. With synthetic data we can evaluate thePPIRM’s performance compared to a ground-truth partition, on what is otherwisean unsupervised learning task. Specifically, we can evaluate the PPIRM’s abilityto infer the number of clusters, and compare the average number of clusters inthe MCMC samples as the amount of data is increased. Figure 4.6 displays twosuch examples with the same parameter settings and the same number of clustersin the ground-truth partition, but different numbers of entities. In both cases, allmodel parameters were sampled, except λ , which was marginalized out. Only thegenerated data (up to the specific stopping time) was used during inference.In Figure 4.6a, we observe that the PPIRM accurately recovers the true numberof clusters when the number of entities is reasonably small, and after only a moder-ate amount of data is observed. In this simulation, 128 transactions were observedby the stopping time T = 0.05. We note that, here, 128 transactions is less than 4transactions per the 36 different rate parameters that comprise λ , given that therewere 6 true clusters. Furthermore, there are a total of 102 possible ordered pairs ofentities, given that there are 10 entities and self-loops are permitted.As discussed in Subsection 3.3.1, increasing the number of entities leads to acombinatorial explosion in the number of possible partitions. In Figure 4.6b, wetriple the number of entities, but leave fixed the number of true clusters and thenumber of MCMC samples. We observe that even with increasing data, the PPIRMtends to slightly over-estimate the number of clusters. This phenomenon has been320 2000 4000 6000 8000 10000051015Trace Plot of MCMC Samples for αiterationsα(a) Initial α = 20 2000 4000 6000 8000 100000246810Trace Plot of MCMC Samples for αiterationsα(b) Initial α = 0.50 2000 4000 6000 8000 100000.51.01.5Trace Plot of MCMC Samples for δiterationsδ(c) Initial δ = 10 2000 4000 6000 8000 100000.00.51.01.52.02.5Trace Plot of MCMC Samples for δiterationsδ(d) Initial δ = 0.10 2000 4000 6000 8000 100000.00.20.40.60.81.0Trace Plot of MCMC Samples for βiterationsβ(e) Initial β = 10 2000 4000 6000 8000 100000.000.020.040.060.080.10Trace Plot of MCMC Samples for βiterationsβ(f) Initial β = 0.1Figure 4.5: Trace plots of MCMC samples for each of the hyperparameters,given different initializations. Data is comprised of 336 randomly gen-erated transactions from 30 entities. The red lines depict the true param-eter values.33llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll5.25.45.65.86.00.00 0.25 0.50 0.75 1.00stopping timeaverage number of clustersAverage Number of Clusters in MCMC Samples vs. Stopping Time(a) 10 entities; 6 true clustersllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll5.05.56.06.57.00.00 0.25 0.50 0.75 1.00stopping timeaverage number of clustersAverage Number of Clusters in MCMC Samples vs. Stopping Time(b) 30 entities; 6 true clustersFigure 4.6: Mean number of clusters in 10,000 MCMC samples as the stop-ping time is increased. The red line depicts the number of clusters in theground-truth partition used to generate the synthetic data.34llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll5670.00 0.25 0.50 0.75 1.00stopping timeaverage number of clustersburn−inllnoyesAverage Number of Clusters in MCMC Samples vs. Stopping TimeFigure 4.7: Mean number of clusters in the MCMC samples as the stoppingtime is increased. Red points are an average of all 10,000 samples,while blue points drop the first 10% as burn-in. The red line depictsthe number of clusters in the ground-truth partition used to generate thesynthetic data.studied for Dirichlet process mixture models by Miller and Harrison [22].Given that our initialization for the partition sampler places each entity in itsown cluster, there are as many clusters as entities when the sampler starts. Fur-thermore, the sampler moves only, at most, one entity per iteration; thus, it takesseveral iterations before the chain can reach a partition with only 6 clusters. Forthese reasons, we rerun the previous experiment with 30 entities and drop the first1,000 MCMC samples as burn-in. Figure 4.7 shows that when we exclude the first10% of samples, the mean number of clusters is closer to the true value of 6.To further evaluate the PPIRM’s clustering performance, we assess the qualityof the learned clusters based on various point estimates from the MCMC samples.We measure quality of the learned partition using the adjusted Rand index [19].The adjusted Rand index measures the similarity between two partitions and ad-35justs for the expected number of chance agreements under uniform random sam-pling. Compared to a ground-truth, a randomly generated partition has an expectedscore of zero, while the correct partition has a score of one.We consider four different point estimates from the MCMC samples:(i) the MAP as observed from the MCMC samples. Note that as an estimator,the MAP is equivalent to the sequence of Bayes estimators that minimize theposterior expectation of the 0-1 loss function [10].(ii) The partition based on an ad-hoc approach by Medvedovic et al. [21] thatuses hierarchical clustering. We refer to this estimator as MEDV for short.(iii) The partition that minimizes Binder’s loss function [4]. This is also equiv-alent to maximizing the posterior expected Rand index with the true parti-tion [10]. We refer to this estimator as MINBINDER.(iv) The partition that maximizes the posterior expected adjusted Rand indexwith the true partition. We refer to this estimator as MAXPEAR.Since Fritsch and Ickstadt [10] explain and examine each estimator in detail, weomit these details here and simply compare their results with the PPIRM. Figure 4.8illustrates two examples comparing these four estimators as the amount of datais increased. In both subfigures, ten datasets were randomly generated using thesame parameter settings, but with different number of entities. From 10,000 MCMCsamples, each estimator was computed and its adjusted Rand index with the truepartition was averaged across the ten replicates.In Figure 4.8a, we see that the MAP and MEDV estimators effectively recoverthe true partition when the number of entities is reasonably small, and after only amoderate amount of data is observed. Recall that an adjusted Rand index scoreof one indicates perfect recovery of the true partition. On the other hand, theMINBINDER and MAXPEAR estimators seem to be highly variable and, on average,show no improvement with increasing amounts of observed data. In Figure 4.8b,we see that when the number of entities is tripled the MAP and MEDV estimatorsstill accurately recover the true partition; however, their accuracy is subject to somenoise variations in the data. Furthermore, the MINBINDER and MAXPEAR estima-tors actually show improved accuracy with a greater number of entities; although,36lll l l l l l l l l l l l l l l l l lll ll l l l l llll l l l l lll lll ll l l l l l l l l l l l l l l l lll l l l l l ll l l l l l l l lll lMAP MaxPEARMedv. MinBinder0.000.250.500.751.000.000.250.500.751.000.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00stopping timeadjusted Rand indexestimatorllllMAPMaxPEARMedv.MinBinderMean Adjusted Rand Index vs. Stopping Time(a) 10 entitiesl l l l ll l l l l l l l l l l l l l ll lllll l lllllll l l l l l ll l l l ll l l l l l l l l l l l l l l ll ll llll l l l l l l l l l l l lMAP MaxPEARMedv. MinBinder0.000.250.500.751.000.000.250.500.751.000.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00stopping timeadjusted Rand indexestimatorllllMAPMaxPEARMedv.MinBinderMean Adjusted Rand Index vs. Stopping Time(b) 30 entitiesFigure 4.8: Adjusted Rand index between a partition estimate and the truepartition as the stopping time is increased. Each datapoint is an av-erage across 10 randomly generated datasets, and error bars show onestandard deviation. Each estimator is computed from 10,000 MCMCsamples.37the MAXPEAR estimator is still quite variable. In sum, we see that the PPIRM accu-rately recovers the true partition, provided that an appropriate estimator is used.4.1.4 PredictionThe second objective mentioned in Section 3.1 is to model the time-series structureof the transaction data. This requires inferring both a partition of the entities, aswell as the rate parameters between each ordered pair of clusters. Together, theinferred parameters can be used to model the stochastic processes that govern thetransactions, and make predictions about unobserved transactions.With synthetic data we evaluate the PPIRM’s ability to model a set of held-out data, given a set of training data used for inference. From the training data,the PPIRM produces MCMC samples of the partition using the scheme described inSection 3.3.4. For each sampled partition, we sample the rates, λ , by the proceduredescribed in Section 3.3.3. In order to quantify how well the PPIRM explains held-out data, we estimate the log posterior predictive density of test data, N∗, giventraining data, N. The log posterior density islog p(N∗|N) = log∫∑pip(N∗|N,λ ,pi)p(λ ,pi|N)dλ= log∫∑pip(N∗|λ ,pi)p(λ ,pi|N)dλ= logEλ ,pi|N [p(N∗|λ ,pi)] , (4.1)where the likelihood on the test data, p(N∗|λ ,pi), can be evaluated using Equa-tion 3.8. We can approximate the expectation in Equation 4.1 using MCMC sam-ples from the joint density λ ,pi|N. That is,logEλ ,pi|N [p(N∗|λ ,pi)]≈ log[1nn∑i=1p(N∗|λ (i),pi(i))], (4.2)where (λ (i),pi(i)) are approximate samples from λ ,pi|N.In our experiments, we compute the righthand-side of Expression 4.2 using thefollowing procedure:1. Generate synthetic data, with known parameter values, until stopping time38T .2. Transactions occurring within the last x% of time become the test set; theremaining transactions are the training set. Note that this is not necessarilyequivalent to using the last x% of all observed transactions as the test set.3. Generate n approximate samples of pi|N by the MCMC methods discussed inSubsection 3.3.4.4. Generate a sample of λ |N,pi(i) for each MCMC sample, pi(i), from i= 1,2, . . . ,n.Use the posterior distribution derived in Subsection 3.3.3.5. For all n sample pairs, {(pi(i),λ (i))}ni=1, calculate the likelihood on the testdata. That is, calculate p(N∗|λ (i),pi(i)), for i = 1,2, . . . ,n.6. Compute log[1n ∑ni=1 p(N∗|λ (i),pi(i))].Figure 4.9 displays two examples that used this procedure with the same parametersettings, but with different numbers of entities. In both cases, all model parameterswere sampled, including λ .In Figure 4.9a, the test set is comprised of 270 transactions between 10 en-tities; by stopping time T = 114, there are 310,332 transactions in the trainingset. We observe that after this point, the PPIRM has effectively learned the modelparameters with reasonably sufficient accuracy for the model to fit the test data.Specifically, the training set is large enough for the inference algorithm to accu-rately recover the true parameters of the model, to a degree of accuracy dictated bythis test set. Notice that, at this point, the estimated log predictive densities show apattern of convergence to the likelihood of the test set given the true partition andtrue rates (as depicted by the red line). To be clear, the red line shows the value oflog p(N∗|λt ,pit), where λt and pit denote the true model parameters used to generatethe synthetic data. Figure 4.9b shows a similar pattern when the number of entitiesis tripled.4.2 Real DataThrough learning a latent clustering of entities, the PPIRM can be used to gleaninsight from real data that may otherwise be too massive or too convoluted for the39lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll5095105115120 100 200 300stopping timeestimated log predictive densityLog Predictive Density vs. Stopping Time(a) 10 entities; 270 transactions in the test setlllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll445944604461446244630 100 200 300stopping timeestimated log predictive densityLog Predictive Density vs. Stopping Time(b) 30 entities; 2,342 transactions in the test setFigure 4.9: Estimated log posterior predictive densities on a test set as theamount of training data (stopping time) is increased. Each estimate isan average based on 10,000 MCMC samples. The red line depicts thelikelihood of the test set when the true partition and true rates are given.40human eye to decipher. By inferring both the cluster memberships and the ratesof transactions between clusters, the PPIRM can reveal varying degrees of struc-tured relationships between groups of interacting pairs. We display this feature ofthe PPIRM on a dataset comprised of a group of countries and incidents where oneside threatens, displays, or uses force against another. In keeping with the experi-ments of Blundell et al. [6], we use version 3.02 of the Militarized Interstate Dis-pute (MID) dataset, compiled by the Correlates of War Project [15]. Specifically,we extract transactions from the dyadic MID dataset [14], which is comprised ofdisputing pairs of countries.MIDs are defined as cases of conflict between states in which the threat or useof military force, short of war, is explicitly directed towards the government orterritory of another state [15]. A conflict can be classified as a MID if it resultsin fewer than 1,000 deaths, and some military force is used [27]. Version 3.02 ofthe MID dataset spans the years 1993–2001. Each dyad in the dataset is ordered toidentify which state took the first militarized action against the other. We use thestart date of each dispute as the observed time of transaction. For MIDs involvingmultiple countries, a dyad is recorded for each ordered pair of countries involvedin the conflict. Ghosn et al. [15] provides a codebook describing the full dataset,as well as a detailed, historical description of each conflict.The dyadic MID dataset consists of 138 countries—these are the entities ofthe PPIRM—and 508 recorded disputes between the years 1993 and 2001. We fitthe PPIRM to this data by using the MCMC methods described in Chapter 3. Weran our inference algorithm for 10,000 iterations, starting from the initial partitionthat places each country in its own cluster, and discarded the first 500 burn-insamples. From the MCMC samples of partitions, we compute the point-estimatesdiscussed in Section 4.1.3. We find that different estimates produce partitions ofvarious coarseness; however, overall, the inferred clusterings are in keeping withthat discussed by Ghosn et al. [15].For illustrative purposes, we show two examples of the learned clusterings,and discuss how these match the historical narrative. Figure 4.10 displays theMAXPEAR partition of the 138 countries, represented as a multidigraph. Eachcountry is a vertex of the graph, both coloured and numbered according to itsrespective cluster assignment. The 508 disputes are represented as the directed41edges of the graph, coloured in grey. We found that the MAXPEAR partition is thecoarsest of the four partition estimates, in the sense that it has the fewest numberof clusters, and hence the largest cluster sizes. The MAXPEAR partition has 18clusters, whereas the MAP has 58, and is the finest partition. For comparison, Fig-ure 4.11 shows the MAP partition on the same data, with a slightly different spatialarrangement for clarity.Examining the clustering produced by the MAXPEAR, one striking feature isthat cluster 8 (grey) is surrounded by and heavily connected to the countries incluster 1 (purple). As it turns out, cluster 8 is comprised of only Yugoslavia, whilecluster 1 contains 27 of the 28 NATO countries, excluding Latvia. Indeed, thedispute narrative [15] confirms that Yugoslavia entered into many conflicts withNATO countries, spurred by its involvement in the Bosnian civil war and generalhostility over Kosovo dating back to before 1993. This hostility erupted into aseries of incidents including a blockade imposed on Yugoslavia by Western coun-tries, and NATO airstrikes resulting from the strife in Kosovo. The fact that Latviais not part of cluster 1 is not at all surprising, as it did not officially join NATO until2004, well after the coverage of this dataset. Furthermore, during the time periodcovered, Latvia had several of its own border disputes with Russia.For the finer partition produced by the MAP, the PPIRM is able to identify sev-eral dispute threads, for which we highlight the relevant cluster memberships inTable 4.1. In particular, we note that cluster 19 identifies the countries who joinedforces after the Afghan incursions into Tajikistan, circa 1994. The dispute narra-tive [15] lists repeated conflicts along the Tajik-Afghan border, accounting for themany edges between clusters 19 and 51 (Afghanistan). We note that the countriesin cluster 19 took more of a secondary role in these conflicts, often led by Russia.Consequently, the MAP partitions Russia into its own cluster.Cluster 43 identifies the dispute thread between two geographically contiguousgroups of African countries. These disputes led to several incidents, most stem-ming from the Democratic Republic of Congo conflict, and eventually ending in apeace agreement between countries from both groups in 1999. Similarly, cluster54 pinpoints the East Asian countries that experienced several disputes involvingthe seizure of patrol boats and fishing vessels along the Pacific coast.Using the PPIRM to extract further insights, we recovered the Poisson process42Militarized Interstate Disputes11121344335334146711111111111111111 1111811131119131131011111113113111713 3121171331414144 4144146144314414414417411156331164111111633333177717 77331833377131131Figure 4.10: Multidigraph representation of the MID dataset consisting of138 countries (vertices) and 508 disputes (edges). Vertices are bothcoloured and numbered according to the 18 clusters inferred from theMAXPEAR partition.43Militarized Interstate Disputes2220 50450232752 21152234924 2538 34102020313535352050262635263135311035263126 28362646583110263031103535323310143131353531323153325337853393940376215742434315 154312434842155743134332431316454622474858585041 2712350550505150191919322965429545455184456565654549321732166Figure 4.11: Multidigraph representation of the MID dataset consisting of138 countries (vertices) and 508 disputes (edges). Vertices are bothcoloured and numbered according to the 58 clusters inferred from theMAP partition.44Table 4.1: Countries belonging to some of the clusters inferred from the MAPpartition of the MID dataset.MAP Cluster Label Countries Belonging to Cluster3 Kuwait14 Azerbaijan19 Tajikistan, Kyrgyzstan, Uzbekistan22 U.S.A., Turkey28 Yugoslavia30 Russia33 Armenia35Netherlands, Belgium, Luxembourg, Germany, Hungary,Italy, Lithuania, Ukraine, Norway, Denmark43Uganda, Kenya, Rwanda, Ethiopia, Zimbabwe,Namibia, Botswana47 Iraq51 Afghanistan54 Taiwan, South Korea, Japan, Vietnam, Philippinesrates by sampling directly from the posterior. That is, with the MAP estimates ofthe partition and hyperparameters, δ and β , we sampled estimates of the rates, λ ,using Equation 3.11. We produced 1,000 samples for the 58× 58 matrix λ , andtook an entry-wise average over all samples. We note that, alternatively, one couldcompute the MAP of each entry of λ directly, as opposed to sampling. Table 4.2identifies the clusters corresponding to the six largest average rates, as a means fordiscovering which groups of countries were most heavily involved in MIDs duringthe timeframe.Again, these results closely adhere to the dispute narratives presented by Ghosnet al. [15], and correspond to the groups of countries most predominantly involvedin MIDs. Furthermore, the PPIRM discovers patterns of similarly behaving coun-tries. In particular, U.S.A. and Turkey are both heavily involved in conflicts withIraq and Yugoslavia, separately. Consequently, the MAP partition clusters U.S.A.and Turkey together, and the recovered rates reveal their historical propensity forconflict with Iraq and Yugoslavia. Finally, we note that the rates affirm the afore-45Table 4.2: The clusters and countries corresponding to the 6 largest averagePoisson process rates recovered from 1,000 posterior samples. Sampleswere based on MAP estimates of the model parameters. Clusters for therates of transactions are shown in the order sender→ recipient.Rank Clusters Countries Belonging to Clusters1 22→ 47 U.S.A., Turkey→ Iraq2 3→ 47 Kuwait→ Iraq3 33→ 14 Armenia→ Azerbaijan4 30→ 51 Russia→ Afghanistan5 22→ 28 U.S.A., Turkey→ Yugoslavia6 19→ 51 Tajikistan, Kyrgyzstan, Uzbekistan→ Afghanistanmentioned narrative of a series of disputes involving Russia, Tajikistan and itsneighbouring countries, towards Afghanistan.46Chapter 5ConclusionThis thesis takes the PPIRM, originally proposed by Blundell et al. [6], and de-rives the necessary posterior distributions to implement a fully Bayesian inferencescheme. The implementation of the MCMC sampler and choices of hyperpriors arediscussed in full detail. Furthermore, a simple example is presented to show thatthe superposition of Poisson processes should not be used in deriving the posterior,despite the suggestion of Blundell et al. [6]. Experimental results are presented tovalidate the correctness of the implementation, which is a non-standard task whensampling partitions of a set. Synthetic data are used to assess the clustering andpredictive performance of the PPIRM. Real data from a history of militarized con-flicts between nations are used to showcase the PPIRM’s ability to reveal varyingdegrees of structured relationships. Finally, Section 5.1 discusses some prospectivedirections for future work, building upon the PPIRM.5.1 Future WorkThere are several future directions that could be explored to improve the perfor-mance and extend the applicability of the PPIRM. As far as extending applicability,one could imagine any of the following additions to the transactional data. Therecan be(i) entity-specific attributes;(ii) different types of transactions;47(iii) additional side-channel information that can be used to describe patterns,like periodicity, in the event times.In a Bayesian context, both (i) and (ii) may require an extra model layer that drawsattributes or event types from a given distribution. As an example of (iii), informa-tion like the time of day could potentially reveal new patterns in the transactionaldata. This could be modelled via an inhomogeneous Poisson process, whose rateis dependent on both time and an intensity function over the side-channel informa-tion.More direct alterations to the PPIRM include modifying the Gamma distribu-tion in Expression 3.2 to be zero-inflated. Subsequently, this would permit some ofthe Poisson process rates to be zero, thereby preventing certain clusters from inter-acting with each other. As mentioned in Section 3.3.4, implementing a split-mergeproposal distribution could help to improve the computational efficiency of the in-ference algorithm towards exploring multiple modes. An example of a split-mergealgorithm can be found in [8].Different choices for the prior over partitions could be used with the PPIRM.In theory, one may wish to cluster together entities that interact closely in time.A distance dependent Chinese Restaurant Process [5] allows the random clusterassignments of the entities to be based on the distances between them. These dis-tances can be based on time, space, or other characteristics—like entity-specificattributes. To incorporate time dependency, distances between vectors of transac-tion times can be calculated between each ordered pair of entities. However, indoing so, the data is effectively used in shaping the prior.Alternatively, the Indian Buffet Process [13] could be used to model vectors oflatent features, as opposed to simply latent classes. This would relate the work ofMiller et al. [23] and Palla et al. [24] to the PPIRM, thereby allowing each entityto potentially belong to any number of clusters. These latent features could becombined through a weight matrix, and transformed to values in R>0 that can beused as Poisson process rates. Thus, the rate of transactions between any orderedpair of entities would be determined by the the combined effect of all pairwisefeature interactions.48Bibliography[1] Class LinkedHashSet.https://docs.oracle.com/javase/7/docs/api/java/util/LinkedHashSet.html.Accessed: 2016-07-21. → pages 14[2] E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixedmembership stochastic blockmodels. Journal of Machine LearningResearch, 9(Sep):1981–2014, 2008. → pages 3, 13[3] D. J. Aldous. Exchangeability and related topics. In E´cole d’E´te´ deProbabilite´s de Saint-Flour XIII1983, pages 1–198. Springer, 1985. →pages 4, 5, 6[4] D. A. Binder. Bayesian cluster analysis. Biometrika, 65(1):31–38, 1978. →pages 36[5] D. M. Blei and P. I. Frazier. Distance dependent chinese restaurantprocesses. Journal of Machine Learning Research, 12(Aug):2461–2488,2011. → pages 48[6] C. Blundell, J. Beck, and K. A. Heller. Modelling reciprocating relationshipswith hawkes processes. In Advances in Neural Information ProcessingSystems, pages 2600–2608, 2012. → pages ii, iii, 2, 3, 4, 7, 11, 12, 14, 15,21, 22, 23, 41, 47[7] S. R. Cook, A. Gelman, and D. B. Rubin. Validation of software forbayesian models using posterior quantiles. Journal of Computational andGraphical Statistics, 2012. → pages 25, 27[8] D. B. Dahl. Sequentially-allocated merge-split sampler for conjugate andnonconjugate dirichlet process mixture models. Journal of Computationaland Graphical Statistics, 11, 2005. → pages 20, 4849[9] C. DuBois and P. Smyth. Modeling relational events via latent classes. InProceedings of the 16th ACM SIGKDD international conference onKnowledge discovery and data mining, pages 803–812. ACM, 2010. →pages 2, 3[10] A. Fritsch and K. Ickstadt. Improved criteria for clustering based on theposterior similarity matrix. Bayesian analysis, 4(2):367–391, 2009. →pages 36[11] S. J. Gershman and D. M. Blei. A tutorial on bayesian nonparametricmodels. Journal of Mathematical Psychology, 56(1):1–12, 2012. → pages5, 6, 21[12] J. Geweke. Getting it right: Joint distribution tests of posterior simulators.Journal of the American Statistical Association, 99(467):799–804, 2004. →pages 25, 27[13] Z. Ghahramani and T. L. Griffiths. Infinite latent feature models and theindian buffet process. In Advances in neural information processing systems,pages 475–482, 2005. → pages 48[14] F. Ghosn and S. Bennett. Codebook for the dyadic militarized interstateincident data, version 3.10. Online: http:// correlatesofwar.org, 2003. →pages 41[15] F. Ghosn, G. Palmer, and S. A. Bremer. The mid3 data set, 1993–2001:Procedures, coding rules, and description. Conflict Management and PeaceScience, 21(2):133–154, 2004. → pages 41, 42, 45[16] M. Girvan and M. E. Newman. Community structure in social andbiological networks. Proceedings of the national academy of sciences, 99(12):7821–7826, 2002. → pages 3, 15[17] T. Hofmann, J. Puzicha, and M. I. Jordan. Learning from dyadic data.Advances in neural information processing systems, pages 466–472, 1999.→ pages 1[18] P. W. Holland, K. B. Laskey, and S. Leinhardt. Stochastic blockmodels:First steps. Social networks, 5(2):109–137, 1983. → pages 2, 3, 7, 13, 14[19] L. Hubert and P. Arabie. Comparing partitions. Journal of classification, 2(1):193–218, 1985. → pages 3550[20] C. Kemp, J. B. Tenenbaum, T. L. Griffiths, T. Yamada, and N. Ueda.Learning systems of concepts with an infinite relational model. In AAAI,volume 3, page 5, 2006. → pages 2, 3, 4, 6, 7, 8, 9, 10, 21[21] M. Medvedovic, K. Y. Yeung, and R. E. Bumgarner. Bayesian mixturemodel based clustering of replicated microarray data. Bioinformatics, 20(8):1222–1232, 2004. → pages ix, 36[22] J. W. Miller and M. T. Harrison. A simple example of dirichlet processmixture inconsistency for the number of components. In Advances in neuralinformation processing systems, pages 199–206, 2013. → pages 35[23] K. Miller, M. I. Jordan, and T. L. Griffiths. Nonparametric latent featuremodels for link prediction. In Advances in neural information processingsystems, pages 1276–1284, 2009. → pages 48[24] K. Palla, D. A. Knowles, and Z. Ghahramani. Relational learning andnetwork modelling using infinite latent attribute models. IEEE transactionson pattern analysis and machine intelligence, 37(2):462–474, 2015. →pages 48[25] J. Pitman. Some probabilistic aspects of set partitions. The AmericanMathematical Monthly, 104(3):201–209, 1997. → pages 16[26] J. Pitman. Combinatorial stochastic processes. Technical Report 621,Lecture notes for St. Flour Summer School, 2002. → pages 4, 6[27] M. R. Sarkees. The cow typology of war: Defining and categorizing wars(version 4 of the data). Note with version 4 of the Correlates of War Data,2010. → pages 41[28] A. Simma and M. I. Jordan. Modeling events with cascades of poissonprocesses. Uncertainity in Artificial Intelligence (UAI), 2010. → pages 3[29] Y. W. Teh. Dirichlet process. In Encyclopedia of machine learning, pages280–287. Springer, 2011. → pages 6[30] S. Wasserman and C. Anderson. Stochastic a posteriori blockmodels:Construction and assessment. Social Networks, 9(1):1–36, 1987. → pages 2,3, 7, 13, 14[31] L. Xin, M. Zhu, and H. Chipman. A continuous-time stochastic block modelfor basketball networks. arXiv preprint arXiv:1507.01816, 2015. → pages 351[32] Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. Learning infinite hidden relationalmodels. Uncertainity in Artificial Intelligence (UAI2006), 2006. → pages 2,4, 652