UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Bayesian analysis of continuous time Markov chains with applications to phylogenetics Zhao, Tingting 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2019_february_zhao_tingting.pdf [ 3.26MB ]
JSON: 24-1.0375661.json
JSON-LD: 24-1.0375661-ld.json
RDF/XML (Pretty): 24-1.0375661-rdf.xml
RDF/JSON: 24-1.0375661-rdf.json
Turtle: 24-1.0375661-turtle.txt
N-Triples: 24-1.0375661-rdf-ntriples.txt
Original Record: 24-1.0375661-source.json
Full Text

Full Text

Bayesian analysis of continuous timeMarkov chains with applications tophylogeneticsbyTingting ZhaoM.Sc., The University of British Columbia, 2013B.Sc., Zhejiang University, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Statistics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)December 2018© Tingting Zhao 2018The following individuals certify that they have read, and recommend tothe Faculty of Graduate and Postdoctoral Studies for acceptance, the dis-sertation entitled:Bayesian analysis of continuous time Markov chains with applicationsto phylogenetics.submitted by Tingting Zhao in partial fulfillment of the requirements forthe degree of Doctor of Philosophyin Statistics.Examining Committee:Alexandre Bouchard-Coˆte´, Department of Statistics.SupervisorPaul Gustafson, Department of Statistics.Supervisory Committee MemberSarah P. Otto, Department of Zoology.Supervisory Committee MemberUniversity ExaminerUniversity ExamineriiAbstractBayesian analysis of continuous time, discrete state space time series is animportant and challenging problem, where incomplete observation and largeparameter sets call for user-defined priors based on known properties ofthe process. Generalized Linear Models (GLM) have a largely unexploredpotential to construct such prior distributions. We show that an impor-tant challenge with Bayesian generalized linear modelling of ContinuousTime Markov Chains (CTMCs) is that classical Markov Chain Monte Carlo(MCMC) techniques are too ineffective to be practical in that setup. Wepropose two computational methods to address this issue. The first algo-rithm uses an auxiliary variable construction combined with an AdaptiveHamiltonian Monte Carlo (AHMC) algorithm. The second algorithm com-bines Hamiltonian Monte Carlo (HMC) and Local Bouncy Particle Sampler(LBPS) to take advantage of the sparsity in certain high-dimensional ratematrices. We propose a characterization for a class of sparse factor graphs,where LBPS can be efficient. We also provide a framework to assess thecomputational complexity for the algorithm. An important aspect of prac-tical implementation of Bouncy Particle Sampler (BPS) is the simulationof event times. Default implementations use conservative thinning bounds.Such bounds can slow down the algorithm and limit the computational per-formance. In the second algorithm, we develop exact analytical solutions toiiiAbstractthe random event times in the context of CTMCs. Both sampling algorithmsand our model make it efficient both in terms of computation and analyst’stime to construct stochastic processes informed by prior knowledge, such asknown properties of the states of the process. We demonstrate the flexibilityand scalability of our framework using both synthetic and real phylogeneticprotein data.ivLay SummaryPhylogenetics is the study of the evolutionary relationships among differentspecies or organisms. It helps understand both patterns and evolutionaryprocesses of the natural world. Our first contribution is that we developa novel evolutionary modelling framework for phylogenetics such that theknown biological properties of the process can be incorporated into the priorinformation. It helps to reconstruct the evolutionary history especially un-der challenging situations with incomplete observation and large parametersets, which is often the case in phylogenetics. Our second contribution isthat we develop two novel sampling methods to perform inference under ourmodelling framework. Both algorithms achieve markedly better computa-tional efficiency compared with classical algorithms. We demonstrate theflexibility and scalability of our framework using both synthetic and realphylogenetic protein data.vPrefaceThis thesis is original work by the author Tingting Zhao, under the super-vision of Dr. Alexandre Bouchard-Coˆte´. A version of Chapter 2, Section3.1-3.2 of Chapter 3, Chapter 4, and Section 6.1 of Chapter 6 in the the-sis is published in Zhao et al. (2016). It is joint work with Ziyu Wang,Alexander Cumberworth, Joerg Gsponer, Nando de Freitas, and AlexandreBouchard-Coˆte´. I designed the model, implemented the algorithm, con-ducted the computational studies and theoretical derivations. Dr. Alexan-dre Bouchard-Coˆte´ and I contributed equally to the writing of the paper. Aversion of Section 3.3 of Chapter 3, Chapter 5 and Section 6.2 of Chapter6 in the thesis has been submitted to Journal of Machine Learning Re-search and the paper is currently under review. I designed the computa-tional algorithm, implemented it and conducted the computational studiesand theoretical derivations. I wrote the manuscript of the paper and Dr.Alexandre Bouchard-Coˆte´ is the sole co-author for this paper. Dr. Alexan-dre Bouchard-Coˆte´ took on the supervisory roles and offered great help andvaluable suggestions with revising both papers. I am the first author forboth papers.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1 CTMCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.1 Forward sampling of a full path from CTMCs . . . . 102.1.2 Density of fully observed paths . . . . . . . . . . . . . 112.1.3 Density of partially observed sample paths . . . . . . 122.2 Bayesian Phylogenetics Overview . . . . . . . . . . . . . . . 142.2.1 Data Structures . . . . . . . . . . . . . . . . . . . . . 142.2.2 Substitution models in the CTMC framework . . . . 16viiTable of Contents2.2.3 Bayesian phylogenetic inference . . . . . . . . . . . . 182.3 Existing MCMC algorithms . . . . . . . . . . . . . . . . . . . 212.4 Limitations of previous approaches . . . . . . . . . . . . . . . 223 Bayesian rate matrix GLMs . . . . . . . . . . . . . . . . . . . 243.1 Bayesian rate matrix GLMs . . . . . . . . . . . . . . . . . . 253.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 263.1.2 Prior distributions . . . . . . . . . . . . . . . . . . . . 293.2 Bayesian reversible rate matrix GLMs . . . . . . . . . . . . . 303.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 323.2.2 Phylogenetic GLM-CTMC model . . . . . . . . . . . 333.3 Bayesian GLM chain GTR model . . . . . . . . . . . . . . . 333.3.1 GTR rate matrix representation using Bayesian GLMfeature functions . . . . . . . . . . . . . . . . . . . . . 333.3.2 Proposed Bayesian GLM chain GTR model . . . . . . 363.3.3 Examples of order functions in Bayesian GLM chainGTR model . . . . . . . . . . . . . . . . . . . . . . . 384 Inference via HMC . . . . . . . . . . . . . . . . . . . . . . . . . 424.1 Overview of posterior simulation . . . . . . . . . . . . . . . . 444.2 Hamiltonian Monte Carlo . . . . . . . . . . . . . . . . . . . . 464.2.1 Probability density functions and canonical distribu-tions . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.2.2 The HMC algorithm . . . . . . . . . . . . . . . . . . 474.3 Adaptive Hamiltonian Monte Carlo . . . . . . . . . . . . . . 514.3.1 Bayesian optimization for HMC . . . . . . . . . . . . 534.3.2 Computational cost of AHMC . . . . . . . . . . . . . 55viiiTable of Contents4.4 HMC for GLM-CTMC models . . . . . . . . . . . . . . . . . 564.4.1 Running time of previous CTMC gradient computa-tion methods . . . . . . . . . . . . . . . . . . . . . . . 564.4.2 HMC with auxiliary variable construction for GLM-CTMC models . . . . . . . . . . . . . . . . . . . . . . 584.4.3 Running time for the gradient computation via auxil-iary variables . . . . . . . . . . . . . . . . . . . . . . . 614.4.4 HMC for GLM-CTMC reversible models . . . . . . . 614.4.5 Correctness of HMC with auxiliary variable samplingscheme for GLM-CTMC models . . . . . . . . . . . . 634.5 Extension to phylogenetic trees . . . . . . . . . . . . . . . . . 665 Inference via LBPS-HMC . . . . . . . . . . . . . . . . . . . . 685.1 Background on BPS and LBPS . . . . . . . . . . . . . . . . . 705.1.1 Basics of BPS . . . . . . . . . . . . . . . . . . . . . . 705.1.2 Basics of LBPS . . . . . . . . . . . . . . . . . . . . . 735.2 Characterization of factor graphs where LBPS can be efficient 795.2.1 Strong sparsity of factor graphs . . . . . . . . . . . . 795.2.2 Running time analysis of LBPS for factor graphs . . . 825.3 Achieving strong sparsity through LBPS-HMC for GLM-CTMCmodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855.3.1 LBPS-HMC sampling scheme for GLM-CTMC model 865.3.2 Gradient computation for univariate weights in theHMC step under the combined sampling scheme . . . 875.3.3 Factor graph representation under LBPS-HMC . . . . 895.4 Analytical bounce time solution to related factors . . . . . . 925.4.1 Local sojourn time factors . . . . . . . . . . . . . . . 94ixTable of Contents5.4.2 Local transition count factors . . . . . . . . . . . . . 955.5 Running time analysis of LBPS-HMC . . . . . . . . . . . . . 966 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996.1 Numerical studies for AHMC . . . . . . . . . . . . . . . . . . 996.1.1 Synthetic data experiments for AHMC . . . . . . . . 996.1.2 Evaluation of computational efficiency . . . . . . . . . 1076.1.3 Kinase domain data for AHMC . . . . . . . . . . . . 1106.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 1136.2 Numerical studies for LBPS-HMC . . . . . . . . . . . . . . . 1156.2.1 Correctness check of LBPS and HMC using exact in-variance test . . . . . . . . . . . . . . . . . . . . . . . 1176.2.2 Computational efficiency comparison between LBPS-HMC and HMC . . . . . . . . . . . . . . . . . . . . . 1176.2.3 Computational efficiency comparison using differenttrajectory lengths of LBPS . . . . . . . . . . . . . . . 1216.2.4 Real Data Analysis . . . . . . . . . . . . . . . . . . . 1246.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 1267 Conclusions and future work . . . . . . . . . . . . . . . . . . . 1297.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1297.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 131Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133AppendicesAppendix A List of acronyms . . . . . . . . . . . . . . . . . . . . 144xTable of ContentsAppendix B Likelihood for trees under CTMCs . . . . . . . . 146Appendix C Connection Between Bayesian GLMs and GTR 151C.1 Configuration of GTR using Bayesian GLMs . . . . . . . . . 151C.2 Proof of equivalence under GTR configuration . . . . . . . . 153C.3 Connection between Bayesian GLM GTR model and BayesianGLM chain GTR model . . . . . . . . . . . . . . . . . . . . . 157Appendix D Euclidean distance table and ordering algorithm 159D.1 Euclidean distance between pairs of Amino Acids . . . . . . 159D.2 Nearest neighbour pariwise amino acid ordering algorithm . 159Appendix E Cached-Uniformization method . . . . . . . . . . 161Appendix F Supplementary figures for AHMC . . . . . . . . 167Appendix G Supplementary tables for AHMC . . . . . . . . . 176Appendix H Supplementary results for LBPS-HMC . . . . . 179H.1 Exact invariance test results . . . . . . . . . . . . . . . . . . 179H.2 Posterior samples comparison between LBPS and HMC . . . 181H.3 Traceplot comparison for real data analysis . . . . . . . . . . 182H.4 Correctness check via ARD between two samplers . . . . . . 183xiList of Tables3.1 Size and polarity/charge properties of amino acids . . . . . . 276.1 Summary of the ESS per 106 seconds for the stationary distri-bution and exchangeable coefficients with auxiliary variableand without, denoted as “Aux” and “No Aux” respectively,with estimates inferred under a PolaritySizeGtr modelwith fixed  = 5×10−7 and L = 30, while fixing the topologyand branch lengths to the true one with a simulated aminoacid dataset of 10 leaves and 5000 sites generated under aPolaritySizeGtr model. . . . . . . . . . . . . . . . . . . . 1116.2 The minimum ESS per second across different exchangeableparameters for different dimensions of rate matrices. . . . . . 1216.3 The 5% quantile of ESS per second across different exchange-able parameters for different dimensions of rate matrices. . . 1216.4 The median of the ESS per second across different exchange-able parameters for different dimensions of rate matrices. . . 1226.5 The standard deviation of the ESS per second on a log10scale across different exchangeable parameters for differentdimensions of rate matrices. . . . . . . . . . . . . . . . . . . . 122xiiList of Tables6.6 Actual walltime of LBPS-HMC for 40,000 iterations with tra-jectory lengths at 0.025, 0.1, 0.15, 0.2, 0.25. . . . . . . . . . . 1236.7 The maximum ESS per second of LBPS-HMC for 40,000 it-erations with trajectory lengths at 0.025, 0.1, 0.15, 0.2, 0.25. . 1236.8 Summary of ESS across exchangeable parameters betweenHMC and LBPS with trajectory length 0.2. The last rowshows the ratio of ESS for LBPS-HMC over HMC. . . . . . . 126G.1 Geweke diagnostic test statistic values under the Polari-tySizeGtr model for datasets where each sequence has 100sites. Under the 5% significance level, an absolute value largerthan 2 supports a lack of convergence for that parameter. . . 176G.2 Geweke diagnostic test statistic values under the Polarity-SizeGtr model for datasets where each sequence has 1500sites. Under the 5% significance level, an absolute value largerthan 2 supports a lack of convergence for that parameter. . . 177G.3 ESS per 104 seconds. NMH x stands for a Normal proposalMetropolis Hastings algorithms with proposal bandwidth x. . 177G.4 Summary of the ESS per 106 seconds for the stationary distri-bution and exchangeable coefficients with auxiliary variableand without, denoted as “Aux” and “No Aux” respectively,with estimates inferred under a PolaritySizeGtr modelwith fixed  = 5×10−7 and L = 30, while fixing the topologyand branch lengths to the true one with a simulated aminoacid dataset of 10 leaves and 5000 sites generated under aPolaritySizeGtr model. . . . . . . . . . . . . . . . . . . . 178xiiiList of TablesH.1 EIT for wu using LBPS-HMC, which uses HMC for the sta-tionary distribution weights and LBPS for wb for exchange-able parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 179H.2 EIT for exchangeable parameters using LBPS-HMC, whichuses HMC for the stationary distribution and LBPS for theweights of the exchangeable parameters. . . . . . . . . . . . . 180H.3 EIT for stationary distribution elements using HMC for boththe stationary distribution and the weights of the exchange-able parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 180H.4 EIT for exchangeable parameters using HMC for both thestationary distribution and the weights of the exchangeableparameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180H.5 Summary of ARD across exchangeable parameters betweenHMC and LBPS. . . . . . . . . . . . . . . . . . . . . . . . . . 182H.6 Summary of ARD across exchangeable parameters. . . . . . . 183H.7 The quantiles of ESS of exchangeable parameters that haveARD bigger than the 95% quantile of ESS across all exchange-able parameters obtained using LBPS-HMC and HMC algo-rithm respectively. . . . . . . . . . . . . . . . . . . . . . . . . 185H.8 The ESS of exchangeable parameters that have ARD big-ger than the 95% quantile of all values of ARD across allexchangeable parameters using LBPS-HMC and HMC algo-rithm respectively. . . . . . . . . . . . . . . . . . . . . . . . . 186xivList of Figures2.1 An example of a pair of protein sequences with gaps . . . . . 163.1 A simple explanation of the general idea of the NNPAAOalgorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.1 Plot of contours of the bivariate standard Gaussian targetdistribution for which we would like to sample from. Thisfigure comes from Bouchard-Coˆte´ et al. (2018). . . . . . . . . 745.2 For factor γ2, its neighbour variables N2 = {w1, w2}. The ex-tended neighbour variables for factor γ2 areN2 = {w1, w2, w3}.The extended neighbour factors for factor γ2 are S2 = {γ1, γ2, γ3, γ4}. 765.3 For factor γ2, its neighbour variables N2 = {w1, w2}. The ex-tended neighbour variables for factor γ2 areN2 = {w1, w2, w3}.The extended neighbour factors for factor γ2 are S2 = {γ1, γ2, γ3, γ4}. 82xvList of Figures5.4 Factor graph under Bayesian GLM representation of GTRmodel for DNA evolution when using only LBPS to updatew, where Hx,x′ and Cx,x′ represent the sojourn time fac-tor and the transition count factor for states (x, x′), wherex 6= x′ ∈ X = {A,C,G, T}. In the graph, we leave outthe other half of the transition count factors and sojourntime factors for the symmetric pair of states (x′, x) such asHCA, CCA, HGA, CGA, . . . for representation simplicity. . . . . 905.5 Factor graph when using Bayesian GLM representation of ourproposed chain GTR model for DNA evolution when usingHMC to update univariate weights wu and LBPS to updatebivariate weights wb, where w = (wu,wb). . . . . . . . . . . . 936.1 Actual stationary distributions are compared with estimatesfrom a 415 sites and 641 sequences synthetic dataset gener-ated using the PolaritySizeGtr model. We generate un-balanced stationary distributions to increase the difficulty ofreconstruction. . . . . . . . . . . . . . . . . . . . . . . . . . . 1026.2 Histogram of the relative biases (described in Section 6.1.1)of the 380 off-diagonal transition probability matrix (t = 1).The two dotted lines label −0.2 and 0.2. . . . . . . . . . . . . 1046.3 Mean RMSE of estimated rate matrices are shown for thefour models described in Section 3.2.1. The confidence bandsare obtained from the LOESS smoothing method using aquadratic polynomial regression. . . . . . . . . . . . . . . . . 1046.4 Estimates of the stationary distributions from MrBayes andour model, both set to their respective GTR settings. . . . . . 106xviList of Figures6.5 Histogram of the relative differences of the estimates of the190 exchangeable coefficients between MrBayes and our model,both set to their respective GTR settings. The differences aregenerally within 0.05; the two outliers greater than 0.1 are be-tween A to R and Q to P. . . . . . . . . . . . . . . . . . . . . 1066.6 Effective sample size per 104 seconds on a log scale usingAHMC, an adaptive NMH with a joint proposal guided byan estimated covariance matrix and NMH with bandwidthsof 0.01, 0.02, 0.05, 0.10, where NMH x stands for a Normalproposal Metropolis Hastings algorithms with proposal band-width x. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1096.7 ESS per second (displayed on a natural log scale) across allfeatures obtained by running HMC with tuning parameterscoming from a grid centered at the values  = 5 × 10−7 andL = 30 coming from AHMC. . . . . . . . . . . . . . . . . . . 1106.8 Approximation of the posterior distribution of the weightscorresponding to coarse features under the PolaritySizemodel. The naming convention on the abscissa is describedin Section 3.2.1, with “Non” standing for “NonPolar”. . . . . 1126.9 Bubble plot of exchangeable coefficients. A bigger bubbleindicates a large exchangeable coefficient value. . . . . . . . . 1136.10 Density plot of posterior density comparison across exchange-able parameters of an 8-by-8 rate matrix between LBPS-HMCand HMC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1186.11 ESS per second on a log10 scale across all exchangeable pa-rameters with the dimension of the rate matrices ranging from5-by-5, 10-by-10, . . . to 30-by-30 with a step size of 5. . . . . 120xviiList of Figures6.12 Effective sample size per second of the summary statisticsacross exchangeable parameters with respect to different tra-jectory lengths of LBPS-HMC. . . . . . . . . . . . . . . . . . 1246.13 Summary statistics of ESS per second of across exchangeableparameters with different trajectory lengths of LBPS at 0.1,0.15 and 0.2, compared with HMC. . . . . . . . . . . . . . . . 127B.1 A simple tree topology with three observed sequences denotedas shaded nodes X1, X2 and X4 and ti as the branch lengthfor each edge, where i = 1, 2, 3, 4. . . . . . . . . . . . . . . . 148B.2 It shows a treatment of a phylogeny as a directed graphicalmodel with three observed extant sequences, two sequencesfor extinct species, where each sequence has M sites. Theplate represents the independence assumption among M dif-ferent sites on a sequence. Compared with the tree topologyin Figure B.1, the branch length for each edge is left implicitin the graphical model representation since it can be incorpo-rated as a parameter in the conditional distribution P (xv|xPv).149F.1 Actual stationary distributions are compared with estimatesfrom a 415 sites and 641 sequences synthetic dataset gener-ated using the PolaritySizeGtr model. We generate un-balanced stationary distributions to increase the difficulty ofreconstruction. . . . . . . . . . . . . . . . . . . . . . . . . . . 167F.2 Histogram of the relative biases (described in Section 7.3) ofthe 400 transition probability matrix (excluding the redun-dant diagonal elements) under t = 2. The two dotted lineslabel -0.2 and 0.2. . . . . . . . . . . . . . . . . . . . . . . . . . 168xviiiList of FiguresF.3 Histogram of the relative biases (described in Section 7.3) ofthe 400 transition probability matrix (excluding the redun-dant diagonal elements) under t = 3. The two dotted lineslabel -0.2 and 0.2. . . . . . . . . . . . . . . . . . . . . . . . . . 169F.4 Histogram of the relative biases (described in Section 7.3) ofthe 400 transition probability matrix (excluding the redun-dant diagonal elements) under t = 5. The two dotted lineslabel -0.2 and 0.2. . . . . . . . . . . . . . . . . . . . . . . . . 170F.5 Estimates of the stationary distributions from MrBayes andour model, both set to their respective GTR settings. . . . . . 170F.6 Histogram of the relative differences of the estimates of the190 exchangeable coefficients between MrBayes and our model,both set to their respective GTR settings. The differences aregenerally within 0.05; the two outliers greater than 0.1 are be-tween A to R and Q to P. . . . . . . . . . . . . . . . . . . . . 171F.7 The relationship between cache sizes and branch lengths us-ing pairs of DNA sequences with 5000 sites under Gtr withbranch lengths from 0.2 to 5 with step size 0.2. “Average” rep-resents the averaged cache size for each dataset across 100,000MCMC iterations. “Min” and “Max” represent the minimumand maximum of the cache size. . . . . . . . . . . . . . . . . . 172xixList of FiguresF.8 Trace plots for NMH with bandwidth 0.01 and our AHMCsampler, both ran within the same wall-clock time limit of2267 minutes. After discarding the burn-in period of the first50,000 iterations, there are 450,000 iterations for the NMHmethod and 229,520 iterations for our AHMC sampler. Theordinate axis represents the sampled values for the feature“AD.” Other features exhibit the same qualitative behaviour. 173F.9 Estimates of the weights for polarity/charge and size features.The legend specifies the underlying model following the nam-ing convention defined in Section 7 of the main text. . . . . . 174F.10 Stationary distributions on the protein kinase domain fam-ily data. “Estimated” refers to the stationary distributionbased on the posterior mean of the univariate weights inour Bayesian model. “Empirical” refers to the empirical fre-quency counts, which are sometimes used in lieu of estimatesderived from CTMC models (e.g., Cao et al. (1994)). Notethat even though the two estimates are broadly similar, theyalso quantitatively differ in important aspects. Estimatingthe stationary distribution empirically from the data frequencycounts is mainly for the purpose of saving computational timeor reducing the number of parameters in complex models suchas General Time Reversible Model (GTR) for codon evolu-tion requiring a 61-by-61 rate matrix. Empirical frequencycounts do not take the uncertainties in the estimates of thestationary distribution into account while the Bayesian sam-pling method is able to assess the uncertainty in the estimatesof the stationary distribution. . . . . . . . . . . . . . . . . . 175xxList of FiguresH.1 Boxplot of posterior sample comparison for exchangeable pa-rameters of an 8-by-8 rate matrix between LBPS and HMC. . 182H.2 Traceplot for an exchangeable parameter selected uniformlyat random using LBPS-HMC compared with HMC with ARD0.43% in terms of the posterior mean. . . . . . . . . . . . . . 183H.3 ARD across exchangeable parameters between LBPS com-pared with HMC with blue dotted lines representing the 10%and 90% quantile of ARD. . . . . . . . . . . . . . . . . . . . . 184H.4 Violin plot of ARD across exchangeable parameters betweena shorter run and a longer run, which has 2 days longer wall-time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185xxiAcknowledgementsFirst and foremost, I would like to express my deepest gratitude to mysupervisor Dr. Alexandre Bouchard-Coˆte´ for his guidance, insights and en-couragement for the past six years. I have worked under the supervision ofDr. Alexandre Bouchard-Coˆte´ when I began my summer research during mymaster studies in 2012. I feel so lucky to work with him and get influencedby his enthusiasm, attitude and insights for research and work. He has greatpatience and passion in training his students and spent a great amount ofhis time and efforts in helping his students to fulfill their potentials. Alexis always trying to think what is the best for his students. His academicinsights, great guidance and generous personalities helped shed light on myacademic career. I would also like to thank him for his dedication in revis-ing my thesis and papers and I feel very sorry that I often leave him a veryshort time schedule before the deadline. I am also grateful to my supervisorycommittee members Dr. Paul Gustafson and Dr. Sarah P. Otto for theirvaluable advice and insightful comments and their support to help improvemy thesis. Dr. Sarah P. Otto has offered a lot of valuable suggestions on therevision and improvement of my thesis in such a short time schedule to helpme meet the deadline. Moreover, I would also like to express my gratitudeto Dr. Alexandre Bouchard-Coˆte´, Dr. Lang Wu and Dr. Paul Gustafson fortheir warm support as my references. They never say no whenever I needxxiiAcknowledgementshelp during my graduate studies no matter how busy they are. Dr. Lang Wuis also the second reader of my master’s thesis. I would also like to thank Dr.Natalia Nolde for sharing her insights and experience about making careerdecisions between academia and industry. I would also like to thank boththe current and the past members from the statistical machine learning labled by Dr. Alexandre Bouchard-Coˆte´ for the fun interactions, memories andenlightening learning experience during weekly group meetings.Last, I would like to thank my parents, who educated me with theirunconditional support and love, my cute younger brother, and my husbandDr. Guangyu Zhu, who always believes in me, supports me and encouragesme especially when I get weak and moody. Without them, this work is notpossible.xxiiiChapter 1IntroductionContinuous Time Markov Chains (CTMCs) are continuous time stochasticprocesses with discrete state space, which constitute an extension of dis-crete time Markov chains. They have widespread applications in modellinga variety of data types including time series and survival data (Kay, 1977;Kalbfleisch and Lawless, 1985, inter alia). As the core of modern phyloge-netic analysis, they are widely used to model the evolution of nucleotides oramino acids due to their modelling simplicity and flexibility. CTMCs alsoplay a fundamental role in areas such as queueing theory, reliability analysis,risk analysis, computer performance evaluation, web search, and biomedicalapplications.In certain scenarios, CTMCs can involve a large number of parame-ters. An important practical challenge lies in reliably estimating the largenumber of parameters from partially observed time series. For example,in phylogenetics, we only observe the molecular sequences representing thepresent species at the leaves of the tree, the common ancestor at the rootand the internal nodes denoting extinct ancestors are not observable. Var-ious Markov models for biological sequence evolution have been proposedsuch as Jukes and Cantor (1969), Kimura (1980), Hasegawa et al. (1985),and Tavare´ (1986) with different levels of model complexity, which refer todifferent numbers of parameters involved to determine the underlying rate1Chapter 1. Introductionmatrix of CTMCs. A rate matrix is used to describe the instantaneous ratea CTMC moves between different states. A natural question arises: Whichmodel should we use? A model with more parameters has more flexibility incharacterizing the process. However, the additional number of parameterscompared with a simpler model may increase the estimation error when theamount of data is limited. For example, modelling amino acid evolution ismore challenging than modelling nucleotides for the simple reason that therate matrix is of size 20-by-20 rather than 4-by-4. If we assume mutationsbetween each pair of distinct states have different rates, each state in thestate space of the CTMC can occur at different frequencies, and the amountof change from state x to state y is equal to the amount of change from y to xin equilibrium, we need O(|X |2) parameters to characterize the rate matrixwhen modelling both nucleotide evolution and amino acid evolution, where|X | denotes the size of the state space respectively. In nucleotide evolution,|X | is four since the state space only consists of four different states rep-resenting the four nucleotides. Thus, reliable estimates for the parametersin the rate matrix can be obtained. In contrast, under the same aforemen-tioned assumptions, |X | is 20 for amino acid evolution, which leads to a largenumber of parameters compared with nucleotide evolution. Hence, there isoften not enough data in one protein family to naively rely on such a modelwith O(|X |2) parameters although it provides flexibility in characterizingthe rates of changes between pairs of states.This motivated the construction of lower-dimensional parameterizationsof rate matrices; one example of this is parameter tying, which refers to whentwo or more parameters are tied, the same parameter value will be sharedamong tied parameters. It is a widespread practice in CTMC modelling(Kimura, 1980; Hasegawa et al., 1985, inter alia), where certain parameters2Chapter 1. Introductioncan be tied when taking biological properties into consideration. For ex-ample, in DNA evolutionary modelling, biochemical considerations suggesttying transversion parameters together (a transversion is a mutation chang-ing purines (adenine or guanine) into pyrimidines (cytosine or thymine), orvice-versa; these mutations are less frequent than the mutations within eachclass).This still leaves open many possible ways of tying parameters. Reducingthe number of parameters (going from a fine-grained parameterization to acoarser one) generally achieves better performance on small datasets, but cancreate a ceiling on the performance on larger datasets. Hierarchical modelsprovide an approach to this issue. However, using a single hierarchy is notalways natural. For example, in the protein evolution problem motivatingour work, one could tie parameters based on mass, charge, and a wide rangeof other largely orthogonal aspects of protein properties.From a Bayesian perspective, we can view the problem of constructingtied rate matrix parameterizations as a special case of the problem of con-structing structured priors on rate matrices. We take this route, applyingGeneralized Linear Models (GLM) ideas to Bayesian rate matrix modelling.Parameter tying emerges as a special case of this Generalized Linear Modelfor Continuous Time Markov Chains (GLM-CTMC) structured prior con-struction. In the motivating example of protein evolution, we explore howphysicochemical properties of amino acids can be used to design structuredpriors over rate matrices in the Bayesian GLM rate matrix framework. Werelied on a large literature studying the physicochemical similarities betweenamino acids to design a set of potentially useful features.The structured priors we describe also allow the combination of severalgranularities concurrently, and importantly, non-hierarchical combinations.3Chapter 1. IntroductionIn fact, we show an example of a structured prior where naive counting ofthe number of parameters would suggest poor generalization (the number ofparameters being larger than the number of upper-triangular entries in therate matrix of a certain reversible CTMC rate matrix), but where perfor-mance is in fact superior for all dataset sizes considered to that of a modelwith an unstructured prior on a full rank parameterization.While these GLM-CTMC based structured priors are useful, their con-struction is not surprising from a statistical point of view. In fact, theircousins in the maximum likelihood framework are widely known and used(Kalbfleisch and Lawless, 1985). The surprise is rather that the Bayesiancounterpart is not more broadly used in applications—with the importantexception of a simple special case of our framework corresponding to BayesianCox models in survival analysis (Clayton, 1991; Ibrahim et al., 2005).We show that an important challenge with Bayesian GLM modelling ofCTMCs—and perhaps the reason for their lack of more widespread use—isthat posterior simulation using classical Markov Chain Monte Carlo (MCMC)techniques can be highly ineffective. We then describe how recent advancesin two disparate fields can be combined to address this issue. The first ingre-dient in our proposed solution is the use of an Adaptive Hamiltonian MonteCarlo (AHMC) algorithm, which uses the gradient of the log target den-sity to efficiently sample from the posterior density. The second ingredientcomes from a set of techniques called substitution mapping, which consistof estimating, for each position of the biological sequences, the number ofmutations that have occurred on each branch of a phylogenetic tree. Thesetechniques have undergone intensive developments in the recent years, andallow us to efficiently maintain a set of auxiliary variables that make gra-dient calculations fast and simple. We use numerical experiments including4Chapter 1. Introductionboth synthetic and real datasets to demonstrate the computational efficiencyof our sampling scheme compared with classical MCMC algorithms and anadaptive MCMC algorithm with a joint normal proposal guided by an esti-mated covariance matrix.One limitation of Hamiltonian Monte Carlo (HMC) (or AHMC) is thatit does not exploit the sparsity structure often found in the parameteri-zation of high-dimensional rate matrices. To further exploit sparsity, wemake use of recently developed Monte Carlo schemes based on non-reversiblePiecewise-deterministic Markov Processes (PDMPs). In particular, we buildour samplers based on the non-reversible rejection-free dynamics proposedin Peters et al. (2012) in the physics literature and later developed for sta-tistical applications in Bouchard-Coˆte´ et al. (2018). It has been shown byNeal (2004); Sun et al. (2010); Chen and Hwang (2013); Bierkens (2016)that non-reversible MCMC algorithms can outperform reversible MCMCin terms of mixing rate and asymptotic performance. Related samplingschemes have been developed including continuous-time Monte Carlo algo-rithms and continuous-time Sequential Monte Carlo algorithms (Fearnheadet al., 2016; Pakman et al., 2017), but we focus on proof-of-concept real dataapplications of the Bouncy Particle Sampler (BPS) algorithm in this work.In the BPS algorithm, the posterior samples of a variable of interest arecontinuously indexed by the position of a particle that moves along piecewiselinear trajectories, where the position of the particle represents the value ofthe parameter or variable being sampled. When encountering a high energybarrier (low posterior density value), the particle is never rejected but in-stead the direction of its path is changed after a Newtonian collision. A keyalgorithmic requirement is to efficiently determine the bouncing time of theparticle. Most existing work (Fearnhead et al., 2016; Vanetti et al., 2017;5Chapter 1. IntroductionPakman et al., 2017) use conservative bounds from thinning algorithm ofan inhomogeneous Poisson Process (PP) to sample the collision time. Con-servative bounds can lead to computational inefficiency of the algorithm.Bouchard-Coˆte´ et al. (2018) have obtained analytical solutions to the colli-sion time under certain simple scenarios such as Gaussian distribution. Inthe thesis, an exact analytical solution is provided for the important specialcase of CTMC posterior inference.BPS can be further sped up using a “local,” sparse version of the algo-rithm referred to as Local Bouncy Particle Sampler (LBPS). We provide thecharacterization of a family of factor graphs, where LBPS can be efficient.Whereas Bouchard-Coˆte´ et al. (2018) only provided real-data benchmarksfor global BPS, proof-of-concept real data benchmarks for the local BPSalgorithm is provided in this thesis.Besides our proposed AHMC sampler for Bayesian GLM rate matrixmodels, we propose another novel sampler combining HMC and LBPS toefficiently sample from CTMCs while maintaining the techniques of aux-iliary variables used in Zhao et al. (2016). The computational efficiencyof the combined sampling scheme of HMC and LBPS is compared withclassical state-of-the-art HMC algorithms on proof-of-concept applicationsto protein evolution via real data sequences. Analytical solutions to thebouncing time for each factor of the factorized posterior density of CTMCsare obtained to boost the computational efficiency. In comparison, previ-ous work use thinning algorithms to either obtain analytical conservativebounds (Bouchard-Coˆte´ et al., 2018) or construct approximate bounds viapredictive models using local regression.To summarize, the thesis is organized as follows. In Chapter 2, we reviewthe basics of CTMCs and provide an overview about Bayesian phylogenetics6Chapter 1. Introductionincluding popular substitution models and existing MCMC algorithms. Wehave also pointed out the limitations of previous approaches. In Chapter3, we proposed our Bayesian GLM rate matrix modelling framework andprovide examples on phylogenetic GLM-CTMC. In Chapter 4, we proposedefficient inference methods via AHMC for GLM-CTMC models. To furtherexploit the sparsity structure for a family of factor graphs, where LBPScan be useful, we proposed a sampling scheme via a combination of HMCand LBPS. In Chapter 6, the computational efficiency of both algorithmsis evaluated via both synthetic and real datasets. Chapter 7 summarizesthe contributions of the thesis and discusses the advantages and potentialextensions of our work.7Chapter 2BackgroundContinuous Time Markov Chains (CTMCs) are at the core of modern phylo-genetics. We provide an overview of Bayesian phylogenetics and show thatour methodology readily extends to reversible processes and phylogenetictrees. In this chapter, we do not make novel contributions but instead pro-vide the background information of CTMCs and phylogenetics. This chapteris organized as follows.We first review the basics of CTMCs. For pedagogical reasons, we discussour notation and methodology for CTMCs in the context of a general HiddenMarkov Model (HMM) in Section 2.1. We illustrate how to compute thedensity of either complete or partial observations of CTMCs in Section 2.1.2and Section 2.1.3 respectively. We describe the basic aspects of phylogeneticsin molecular evolution and review the data structures used in phylogeneticsin Section 2.2.1. In molecular phylogenetics, a sequence substitution modelis a stochastic model used to describe the replacement process by whicha sequence of characters (nucleotides, amino acids or codons) changes intoanother set of homologous (i.e. aligned) character states over time. Wedescribe the popular substitution models in phylogenetics in Section 2.2.2.We outline the existing Markov Chain Monte Carlo (MCMC) algorithmsin Section 2.3, review previous priors of the rate matrix in Section 2.2.3,and discuss limitations of previous approaches in Section 2.4. We focus on82.1. CTMCsprotein modelling in this thesis since it has a larger state space comparedwith DNA evolution leading to higher levels of model complexity. Thus,it can demonstrate the advantage of our framework better, which will beshown in Chapter 6.2.1 CTMCsWe provide a list of acronyms in our thesis in Appendix A. Recall that an(homogeneous) CTMC is a probability distribution over piecewise constantfunctionsX := (X(t) : t ∈ T ) where the abscissa T = [0, |T |) is a segment ofthe real line representing time, and the ordinate is a discrete set X of states,which we assume to be finite. Without loss of generality, X = {1, 2, . . . , |X |}.The distribution over the length of each piecewise constant function is ex-ponential, with a rate that depends on the current state, and a transitionmatrix that determines the conditional probability of jumping to a new statex′ ∈ X given a position x just before the jump. Throughout this thesis, weconsider an homogeneous CTMC.It is convenient to organize the parameters for these exponential andcategorical distributions into a rate matrix Q indexed by X . The rate of theexponential departure distribution from state x is given by −qx,x, and theprobability of transitioning to x′ given that a jump is initiated from x is givenby −qx,x′/qx,x, x 6= x′, with the constraints that qx,x = −∑x′∈X :x 6=x′ qx,x′ ,qx,x′ ≥ 0 for x 6= x′. We denote the induced probability model by PQ. Wealso interpret qx,x′ as the instantaneous rate of substitution from one statex to another state x′. The transition probability matrix P (t) = {px,x′(t)}represents the probability that state x is replaced by state x′ after a lengthof time t, where x, x′ ∈ X . The relationship between the rate matrix and92.1. CTMCsAlgorithm 1 Doob-Gillespie’s algorithmRequire: The rate matrix Q and the initial distribution pini(·) over X .Ensure: A simulated full path from a CTMC over a given time interval.1: Sample the initial state x0 ∼ pini. Set i = 0.2: for i = 0, 1, . . . , do3: Draw δi ∼ exp(|qxixi |).4: if ti + δi > T then5: Return (x0, x1, . . . , t0, . . . , ti).6: end if7: Let ti+1 = ti + δi.8: Sample a new state xi+1 = x at time ti+1 for x 6= xi with probabilities−qxix/qxixi .9: end forthe corresponding transition probability matrix is that Q = dP (t)/dt withthe initial condition P (0) = I, where I is an identity matrix of dimen-sion |X | × |X |. In turn, we calculate the transition probability matrixP (t) using matrix exponentiation of the rate matrix scaled by t, that isP (t) = exp(Qt) =∑∞j=0 (tQ)j /j!. If a chain reaches equilibrium and it is areversible process, we use pi = (pi1, pi2, . . . , pi|X |) to represent the stationarydistribution, where pii represents the proportion of time that the chain staysat state i at equilibrium.2.1.1 Forward sampling of a full path from CTMCsAccording to the previous description, given a rate matrix Q, we can conductforward sampling of a full path under CTMCs given a certain time intervaland a rate matrix Q using the Doob-Gillespie algorithm (Doob, 1945).Denote the jump time as Ti and the waiting time as ∆i from Ti to Ti+1.We generate a fully observed path on a given time interval [t0, T ] via theDoob-Gillespie algorithm presented in Algorithm 1.102.1. CTMCs2.1.2 Density of fully observed pathsWe review the density of fully observed paths given rate matrix parameters.This density of a sample path x only depends on the following sufficientstatistics:Initial states: for all x ∈ X , let nx(x) ∈ {0, 1, 2, . . . } denote the numberof paths initialized at state x. Let n(x) := (n1(x), n2(x), . . . , n|X |(x)).Sojourn times: let hx(x) ∈ [0,∞) denote the total time spent in state x.In other words, it is the sum of all sojourn times across all jumps andpaths. Let h(x) := (h1(x), h2(x), . . . , h|X |(x)).Transition counts: for all distinct (x, x′) ∈ X distinct := {(x, x′) ∈ X 2 :x 6= x′}, let cx,x′(x) ∈ {0, 1, 2, . . . } denote the number of times that ajump ends at state x′ right after jumping from x. Let c(x) denote thevector of all transition counts organized according to some arbitraryfixed order of X distinct.Given a sample path x, let z(x) := (n(x),h(x), c(x)) denote the cor-responding sufficient statistics. In terms of these statistics (n,h, c) :=(n(x),h(x), c(x)), the density of a sample path x from a CTMC with ratematrix Q is given by:fx|Q,∆(x|Q,∆) :=(∏x∈Xpini(x)nx) ∏(x,x′)∈Xdistinctqcx,x′x,x′ (2.1.1)×(∏x∈Xexp(hxqx,x))1 [x ∈ S(∆)] ,where pini(·) is the probability mass function of the initial distribution, andS(∆) denotes the set of CTMC paths of total length ∆. When ∆ = |T |,112.1. CTMCswe write fx|Q(x|Q) := fx|Q,∆(x|Q, |T |) for short. We say that a process isreversible if, for all t, t′ ∈ T , PQ(Xt = x,Xt′ = x′) = PQ(Xt = x′, Xt′ = x).In a reversible case, pini(·) is the stationary distribution.2.1.3 Density of partially observed sample pathsThe sample path is only partially observed in many scenarios of interest.Hence, it is typically not possible to work directly with the simple densitygiven in Equation (2.1.1).Let V := {1, 2, . . . , |V |} denote a set indexing observation times, andE, the set of consecutive observation index pairs, E := {(v, v + 1) : v ∈{1, 2, . . . , |V | − 1}}. For all v ∈ V , we denote the corresponding observationtime by t(obs)v . We denote the sorted times by t(obs) := (t(obs)1 , t(obs)2 , . . . , t(obs)|V | ),t(obs)v < t(obs)v′ for all (v, v′) ∈ E. From the known observed times t(obs), weconstruct a mapping d(·) that takes a sample path x as input, and outputsa vector containing the state of the CTMC at each of the observation times,d(x) :=(d1(x), . . . , d|V |(x)), where dv(x) := x(t(obs)v).To compute the marginal transition probability ptrans between a state xand a state x′ separated by an interval of time ∆ ≥ 0, we use the matrixexponentiation of the rate matrix scaled by ∆:ptrans(x′|x,Q,∆) := PQ(X(∆) = x′|X(0) = x) (2.1.2)= (exp(∆Q))x,x′ . (2.1.3)In general, there is no closed form expression in terms of the parameters qx,x′for the matrix exponentiation in Equation (2.1.3), but pointwise evaluationcan be done in time O(|X |3) using numerical methods (Moler and Van Loan,122.1. CTMCs1978).1To simplify the notation, we assume that t(obs)1 = 0.2 After marginal-ization, the probability of a discretely observed sample path D := d(X)becomes:pd|Q(d|Q) := pini(d1)∏e=(v,v+1)∈Eptrans (dv+1|dv, Q,∆e) , (2.1.4)where ∆(v,v+1) := t(obs)v+1 − t(obs)v is the spacing between consecutive observa-tions.The random variable D is not necessarily observed directly. Instead, wemay only have access to a collection of random variables Y = (Y 1, . . . , Y |V |),conditionally independent given D, and with an emission probability ordensity denoted by femi(yv|dv):fd,y|Q(d,y|Q) := pd|Q(d|Q)∏v∈Vfemi(yv|dv). (2.1.5)The density of the data given a rate matrix would naively involve a sumover an exponential set of latent states:fy|Q(y|Q) =∑d1∈X· · ·∑d|V |∈Xfd,y|Q(d,y|Q), (2.1.6)but we compute the sum in polynomial time using standard HMM inferencemethods (see for example Kschischang et al. (2006) for a general treatment of1For reversible processes, it is typical to use a method based on spectral decompositions,for general processes, the Pade´ approximant combined with squaring and scaling is areasonable default choice (Moler and Van Loan, 1978).2We can assume t(obs)1 = 0 without loss of generality. If t(obs)1 > 0, modify the initialdistribution to pini′(d1) :=∑x∈X pini(x)ptrans(d1|x,Q, t(obs)1 ), and subtract t(obs)1 to allobserved times.132.2. Bayesian Phylogenetics Overviewthe sum-product algorithm on factor graphs). This is possible since the den-sity fd,y|Q(d,y|Q) can be decomposed into a chain-shaped graphical modelwith a structure given by vertices V and edges E. Excluding the cost ofmatrix exponentiation,3 this means that computing Equation (2.1.6) takestime O(|V | · |X |2). The posterior distribution over the discrete latent states,pd|y,Q, is computed in the same way.2.2 Bayesian Phylogenetics OverviewSince we focus on protein evolution to illustrate our methodology, we pro-vide a brief overview of Bayesian phylogenetic inference in this section. Weintroduce some basic description of phylogenetics, which includes the struc-ture of the available data, conventional substitution models in the CTMCframework and Bayesian approaches in this field. An example of the like-lihood computation for a simple phylogenetic tree is described in detail inAppendix A.2.2.1 Data StructuresAssume that a certain biological sequence has related homologs in a set oforganisms, and that these sequences have been aligned (i.e., that for eachcharacter in the sequence, we know which characters in the other sequencesshare a common ancestral character). This sequence can be subsampledso that the evolution of one group of aligned nucleotides or amino acids isapproximately independent (or assumed to be independent) to the evolution3 While the cost of one matrix exponential is O(|X |3), performing it |V | times (foreach of the time intervals ∆e) can be done faster than O(|V | · |X |3) for reversible CTMCs(Schadt et al., 1998). Under the reversibility condition, reliable numerical methods canbe used to obtain a spectral decomposition of Q, which can be reused across all the timeintervals, obtaining a running time of O(|X |3 + |V | · |X |2) (Moler and Van Loan, 1978).142.2. Bayesian Phylogenetics Overviewof any other group included in the dataset. Each such group is called asite. Molecular phylogenetic data consist of homologous DNA sequences orprotein sequences of related species which have been aligned. It can alsoinvolve more complex structures such as groups of adjacent or interactingsites evolving in a non-independent fashion.We describe the available data structure below. Assuming we have n1observed species, the observed data such as amino acid sequences or DNAsequences can be organized into a matrix with n1 rows with each row rep-resenting one species and n2 columns representing the number of sites foreach sequence. We would like to mention that the sequences are obtainedfrom multiple sequencing alignment tools and algorithms, which are not thefocus of this study. For simplicity, we assume there is no sequencing erroror alignment error in the data. Let X = {xij} be the aligned amino acidsequences, where i ∈ {1, 2, . . . , n1}, j ∈ {1, 2, ..., n2}. Each column of thematrix xj = {x1,j , ..., xn1,j} specifies the states for all n1 sequences at thejth site. Each row of the data matrix xi = {xi,1, ..., xi,n1} represents thestates of all sites on the ith sequence. We assume the sites are independentwithin each sequence.X =x1,1 x1,2 · · · x1,n2x2,1 x2,2 · · · x2,n2....... . ....xn1,1 xn1,2 · · · xn1,n2Let us denote the set of all 20 amino acids as the state space X , whereX = {A,G, S, P,C, T,N,D,R,E,Q,H, I, L,K,M,F,W, Y, V }, xij ∈ X orsometimes xij is denoted as a gap “-” in the data. A gap can be used for siteswith missing information sometimes resulting from experimental difficulties.152.2. Bayesian Phylogenetics OverviewFigure 2.1: An example of a pair of protein sequences with gapsIt can also be used to represent an inserted or deleted substring of a geneticstring, particularly in the consideration of DNA or protein alignments. Itis represented by a contiguous interval of gap symbols, which indicate theinsertion of an interval of nucleotides or amino acids from a nucleic acid orprotein in one sequence or deletion of the counterparts in the other sequence.Usually, no distinction is made between insertions and deletions, which aresometimes collectively referred to as indels. To better illustrate the gaps,we show in Figure 2.1 part of an alignment of two protein strings, which canalso be found at http://rosalind.info/glossary/gap/, where the secondsequence has a gap of 20 amino acids. This gap corresponds to a deletionif we view the alignment as a transformation of the top sequence into thebottom one or an insertion if we consider a transformation of the bottominto the top one.Although indels and missing data provide different information, in mythesis, both situations are treated in the same way. In terms of likelihoodcomputation, a gap is treated by summing the likelihood over all possiblestates in the state space. Note that there is a large literature on sophisticatedindel models. See for example, Suchard and Redelings (2006) for a Bayesianindel model.2.2.2 Substitution models in the CTMC frameworkIn general, there are two kinds of models to characterize molecular evolu-tion. Outside a CTMC framework, empirical models (Dayhoff et al., 1978;162.2. Bayesian Phylogenetics OverviewJones et al., 1992) are constructed by averaging over many proteins and thesubstitution rates between amino acids are fixed no matter what protein isanalyzed. These models do not take factors affecting amino acid evolutioninto account. Examples of such factors include the physicochemical proper-ties of amino acids, mutational bias in the DNA, acceptance or rejection ofthe amino acid under selective constraints on the protein and so on. Theempirical models pull data from several protein families to learn the substi-tution pattern based on empirical frequencies of replacement between aminoacids. In this case, only closely related sequence pairs are used to reducethe probability of multiple replacement.In this work, we consider models within a CTMC framework that ac-count for features of the evolution. In this section, we introduce both theGeneral Non-Reversible (GNR) model and the General Time Reversible(GTR) model in the CTMC framework. Both of them are stochastic modelswhich describe the substitution process, where one amino acid is replaced byanother at the same site along the path between two species. The differencebetween them is that the General Time Reversible (GTR) model satisfiesone additional constraint on the structure of the rate matrix to make itreversible.GNR modelWe organize the off-diagonal entries of Q into a vector of non-negative num-bers denoted by ξ := (qx,x′ : (x, x′) ∈ X distinct), where for all distinct(x, x′) ∈ X distinct := {(x, x′) ∈ X 2 : x 6= x′}. We call this parameterizationthe GNR, following Baele et al. (2010). We denote the rate matrix corre-sponding to ξ by Q(ξ); the diagonal elements are simply formed by negating172.2. Bayesian Phylogenetics Overviewthe sum of the off-diagonal entries of each row, qx,x = −∑x′∈X :x 6=x′ qx,x′ .GTR modelIt is often the case in phylogenetics that non-reversibility is only weaklyidentifiable (Huelsenbeck et al., 2002), so reversibility is widely assumed inmainstream models. Assuming reversibility and the rate matrix has notchanged along the tree, the initial distribution of each state can be tiedto the stationary distribution. When there are several time series available,this allows more information coming from the initial values of the time seriesto be pooled.Compared with GNR, GTR has one more time-reversible constraint,which ispiiqij = pijqji, for any i, j ∈ X .We can also reparameterize the rate matrix element as qij = pijθij , where θijrepresents the exchangeable coefficient or exchangeable parameter , denotingthe instantaneous rate of transitions between state i and j by factoring outthe effect of the stationary distribution. The GTR model is defined asqi,j = pijθij , for any i, j ∈ X , where θij = θji.In both GNR and GTR model, in order to address the time confounding,Q is usually normalized such that the expected number of transitions in oneunit of time is one:∑|X |i=1 qi,ipii = Bayesian phylogenetic inferenceA phylogenetic tree is based on a discrete directed tree (V ,E) called thetopology, and a collection of real numbers ∆ := (∆e : e ∈ E) associatedto each edge e ∈ E, called the branch lengths. The branch lengths often182.2. Bayesian Phylogenetics Overviewrepresent the number of changes that have occurred on that branch. Branchlength may also correspond to the evolutionary time estimates, where theyare proportional to evolutionary distance. The set of nodes V denote boththe extant and ancestral species. The extant species are represented as theleaves Vleaves ⊂ V of the tree, which are the observed data. The root of thephylogenetic tree represents the most recent common ancestor of all extantspecies. If we assume the substitution model is GTR in the phylogeneticcontext, that means the location of the root r is unidentifiable and we canview the topology as undirected but root the tree arbitrarily when requiredby computation. All the internal nodes represented by V \(r∪Vleaves) denotethe speciation events for species trees. In this thesis, we only consider speciestrees. Note that we use the same notation as the chain graphical model inSection 2.1, this highlights the fact that the tree graphical model and thechain graphical model play similar roles. A sample path X := (X(t) : t ∈ T )on a phylogenetic tree can be simulated recursively as follows:1. Generate the value X(r) at the root r of the tree (V ,E) according toan initial distribution pini.2. Given the root of any subtree v ∈ V and the state X(v) at that node,independently generate the value at each child v′ of v as follows:(a) simulate a CTMC on one edge, (X(t) : t ∈ (0,∆e]) for a timegiven by ∆e, where e = (v → v′),(b) set X(v′) to the final point of this process, and repeat step 2recursively by setting v′ as the root of the subtree.There is one such processXk for each site k ∈ {1, . . . ,K}, and we assumethat all of these processes share the same tree T . We view the topology and192.2. Bayesian Phylogenetics Overviewbranch lengths as random variables, τ := (∆, V , E).After introducing the parameters for a phylogenetic tree, we introducethe parameters ρ = (pi, θ) ∈ Θ for the rate matrix Q. The parameteriza-tion of the rate matrix qij = pijθij is expressed in terms of the stationarydistribution and the exchangeable coefficient. As a result, the parametersinvolved in a phylogenetic model is (τ , ρ). We denote the substitution modelas M and the observed data as Y . The likelihood of the model p(Y |τ , ρ,M)given all the parameters can be specified and calculated using the pruningalgorithm from Felsenstein (1981). An example of likelihood computationunder a simple tree topology using the pruning algorithm has been shown inAppendix B. If we take a frequentist’s approach, we can estimate the param-eters by finding the values of parameters that can maximize the likelihood.While in a Bayesian framework, we need to define prior distributions for theparameters in the model denoted as g(τ , ρ), and then obtain the posteriordistribution of all the parameters based on a combination of the observeddata and the prior information. The prior distribution is usually assumedto have the form g(τ , ρ) = g(τ )g(ρ).There are various choices for the prior distribution of the parameters.For example, we can choose a uniform distribution over the topologiesand independent exponential over each of the branch lengths. We denote(x, x′) ∈ X distinct := {(x, x′) ∈ X 2 : x 6= x′}. A variety of prior densities fξon ξ can be used for the GNR model. For example, in Huelsenbeck et al.(2002), the authors used independent gamma priors with unit mean andvariance β for each qx,x′ , x 6= x′:202.3. Existing MCMC algorithmsfξ(ξ) = fgam(ξ) :=∏(x,x′)∈Xdistinct1[qx,x′ > 0]q1/β−1x,x′ exp(−qx,x′/β). (2.2.1)A popular prior for ρ is to use a Dirichlet distribution on the stationarydistribution pi and independent gamma priors on θ for the GTR model.2.3 Existing MCMC algorithmsIn this section, we review existing MCMC approaches to posterior approx-imation of CTMC parameters. We view the paths X as nuisance variableswhich are marginalized using matrix exponentiation. A state of the MCMCalgorithm therefore consists of only the variable ξ. We denote the state atMCMC iteration i by ξ(i).Given a state ξ(i) from the previous iteration, existing MCMC algo-rithms typically propose the next sample ξ(i+1) using a Metropolis withinGibbs strategy. For example, in MrBayes, a prominent software packagefor Bayesian phylogenetic inference (Huelsenbeck and Ronquist, 2001; Ron-quist et al., 2012), sliding window and multiplier proposals (where the newvalue of a univariate variable is perturbed additively or multiplicatively atrandom), as well as Dirichlet proposals (as an independence sampler) areavailable. We denote the density of the proposal by fprop. The proposedparameters ξ′ are then accepted with probability:min{1,fξ(ξ′)fξ(ξ(i))fy|Q(y|Q(ξ′))fy|Q(y|Q(ξ(i)))fprop(ξ(i)|ξ′)fprop(ξ′|ξ(i))},in which case ξ(i+1) = ξ′, otherwise, ξ(i+1) = ξ(i). The computational212.4. Limitations of previous approachesbottleneck is in computing the factor fy|Q.2.4 Limitations of previous approachesThe GNR model can lack flexibility in practice: an important issue is thescalability with respect to the size of the state space, |X |. The number ofparameters in both the GNR and the GTR model is O(|X |2), so even fora state space of moderate size, there is often not enough data available toobtain a useful posterior distribution on the parameters. One can reduce thenumber of parameters by enforcing certain constraints (i.e. tying parame-ters); for example, setting q1,2 = q1,3 removes one effective degree of freedom.This can be applied to several groups of pairs of states believed to behavesimilarly. For example, in order to model DNA evolution, Kimura (1980)used this parameter tying technique to group all the nucleotide transver-sions (a transversion is a mutation changing purines (adenine or guanine)into pyrimidines (cytosine or thymine), or vice-versa; these mutations areless frequent than the mutations within each class). However, parametertying can be restrictive, such as when several overlapping ways of group-ing states are conceivable. For example, each amino acid can be describedwith several properties including composition, polarity, and volume; Sneathlists as many as 134 potentially useful side chain properties (Sneath, 1966).Using all of these properties by taking intersections of equivalence classeswould get us back to an overparameterized model close to the GNR modelthat we were trying to simplify in the first place.To deal with the challenges and limitations, we shall see in the nextchapter that we propose our modelling framework, which allows parametertying as a special case of constructing structured priors on rate matrices of222.4. Limitations of previous approachesCTMCs.23Chapter 3Bayesian rate matrix GLMsThe General Time Reversible (GTR) model is one of the most popular sub-stitution models (reviewed in Chapter 2). The GTR model for amino acidsubstitution involves 210 parameters and individual proteins are sometimestoo short (Tiessen et al., 2012) to provide sufficient information to obtainaccurate estimates of a large set of parameters. To reduce the number of pa-rameters, parameter tying has been used (Bishop and Friday, 1985), wherecertain parameters can be tied when taking biological properties into consid-eration. However, this still leaves open many possible ways of tying parame-ters. We are faced with the trade-off between two competing demands of ac-curacy and model complexity given different sizes of the datasets. Reducingthe number of parameters can achieve better performance on small datasets,but can create a ceiling on the performance on larger datasets. In the mo-tivating example of protein evolution, if the individual proteins are fairlyshort, we hope to construct a low dimensional model of protein evolutionthat can still capture the patterns of evolution. Physicochemical propertiesof amino acids can provide insights into the patterns of protein evolution.This motivates us to propose a framework that can incorporate the biolog-ical information into the substitution models. We propose our modellingframework, which we call Generalized Linear Model for Continuous TimeMarkov Chains (GLM-CTMC) in this chapter. Previous parameter tying243.1. Bayesian rate matrix GLMstechnique emerges as a special case of GLM-CTMC framework. From aBayesian perspective, we can view the problem of constructing tied ratematrix parameterizations as a special case of the problem of constructingstructured priors on rate matrices. We take this route and propose to applyGeneralized Linear Models (GLM) ideas to Bayesian rate matrix modelling.This chapter is organized as follows. We first describe our model for thegeneral unnormalized rate matrix in Section 3.1 and then extend it to thenormalized reversible case using Bayesian GLM rate matrix parameteriza-tion in Section 3.2. We illustrate how our framework works by explaininghow physicochemical properties of amino acids can be used to design struc-tured priors over rate matrices in the Bayesian GLM rate matrix frameworkin Section 3.1.1 and Section 3.2.1. Finally, we propose a Bayesian GLMchain GTR model for unnormalized reversible rate matrices with sparsitystructure in Section 3.3.2, which can be exploited by our sampling algorithmdescribed in Chapter 5. For simplicity, we focus on the simple example ofchains but other sparse graph structures would also be suitable.3.1 Bayesian rate matrix GLMsWe present our GLM-CTMC model, for the off-diagonal entries of an un-normalized rate matrix Q(nr)(w):q(nr)x,x′(w) := exp{〈w,ϕ(x, x′)〉}q¯(x, x′), (3.1.1)constructed from the following quantities:Natural parameters (weights): w ∈ Rp, for some fixed integer p.Sufficient statistic: ϕ : X distinct → Rp, an arbitrary, user-specified func-253.1. Bayesian rate matrix GLMstion taking a pair of distinct states as input, ϕ(x, x′) := (ϕ1(x, x′), . . . ,ϕp(x, x′)).Base measure: q¯as an unnormalized mass function assumed to be known.This could come from a simpler model; for simplicity, we set q¯≡ 1.As before, diagonal entries are defined so that each row sums to zero,q(nr)x,x (w) := −∑x′∈X :x 6=x′ q(nr)x,x′(w). The power of this model comes from theflexibility brought by ϕ, which allows the user to encode their prior knowl-edge. The model is related to Kalbfleisch and Lawless (1985), however thisprevious work used GLMs for a different purpose, namely the incorporationof covariate dependences into the rate matrix. Our model is novel since itallows parameter tying, which allows us to reduce the model complexity byincorporating biological information into the prior distribution. The cor-responding computational challenges for frequentist inference and Bayesianinference are also different.3.1.1 ExamplesGoing back to the amino acid example from the introduction, the p compo-nents of the mapping ϕ can be interpreted as different side chain properties,designed from biochemical prior knowledge. Following the terminology frommachine learning, we call each of these components a feature. We now de-scribe some examples of features in the context of amino acid evolutionarymodelling. First, we give a description of a single feature. Then, we pro-gressively describe more general mechanisms for feature generation.We construct a mapping for the purpose of defining classes of aminoacids sharing certain physicochemical properties. For example, we showin Table 3.1 the definition of a mapping called “size,” where size : X →263.1. Bayesian rate matrix GLMsSize Polarity/chargeMicro A, G, S, P, C, T, N, DBig R, E, Q, H, I, L, K, M, F, W, Y, VNon-polar A, C, G, I, L, M, F, P, W, VBasic Polar R, H, KPolar N, Q, S, T, YAcidic Polar D, ETable 3.1: Size and polarity/charge properties of amino acids{Micro,Big}, based on Barnes et al. (2003). As an illustration, for theamino acid Serine (S), size(S) = Micro. Also, if e := (x, x′) ∈ X distinct isa pair of states, we define size((x, x′)) to be the list obtained by applyingthe map to the individual elements in e. For example, for the amino acidsPhenylalanine (F) and Serine (S), size((S,F)) = (Big,Micro).Using these definitions, we define our first feature as follows:ϕ1(x, x′) := 1(Big,Big)(size((x, x′))),Here the indicator function, denoted 1A : X → {0, 1} is defined as:1A(x) :=1 if x ∈ A,0 if x /∈ A.The feature ϕ1(x, x′) is equal to one if and only if the size of the aminoacid is preserved by the substitution between x and x′. To be specific, if andonly if size((x, x′)) = (Big,Big), then ϕ1(x, x′) = 1(Big,Big)(size((x, x′))) = 1.We call this first feature “Big→Big” for short. Note that this ties severalentries of rate matrix, since multiple pairs e1, e2, · · · ∈ X distinct are suchthat ϕ1(ei) 6= 0. Continuing in this fashion, we can build three additionalfeatures similarly, “Big→Micro”, “Micro→Big”, and “Micro→Micro”.Next, we can use another mapping to get additional features. For exam-273.1. Bayesian rate matrix GLMsple, we consider the map polarity : X → {Non-polar,Basic Polar,Polar,AcidicPolar} in Table 3.1, which encodes a combination of polarity and chargeproperties. The names of these features follow the same convention, yieldingfeatures such as “Acidic→Basic”. Following a convention from the machinelearning literature, we use the terminology feature template for a recipe thatyields several features. For example, the function “size” corresponds to afeature template we denote by Size, and polarity, to a feature template wedenote by Polarity. Feature templates can be combined to form biggermodels. For example, PolaritySize denotes a sufficient statistic vectorthat contains the concatenation of the features in Polarity with those inSize.In this section, the feature template we have developed is used for ratematrices without a reversibility assumption. This feature template takes thedirectionality into account. For example, the feature template differentiatesthe “Big→Micro” feature from the “Micro→ Big” feature. Thus, in terms ofthe input for the “size” function, we use (x, x′) to indicate that the order ofstates x and x′ in the pair of states (x, x′) matters. In comparison, when wedevelop the feature template for a model under the reversibility assumptionin Section 3.2.1, we do not differentiate the “Big→Micro” feature from the“Micro→ Big” feature since we assume that the “Big→Micro” feature andthe “Micro→ Big” feature have the same effect on the substitution rates.Namely, we do not take directionality into consideration. Thus, the order ofstates x and x′ in the pair of states (x, x′) does not matter. We use {x, x′}to represent the pairs of states x and x′.283.1. Bayesian rate matrix GLMs3.1.2 Prior distributionsNext, we place a prior on w, with density denoted by fw(w). A defaultchoice is a normal prior with mean zero and precision (inverse variance)κ, denoted by fnor(w). As an alternative, taking fw to be the densityof a product of independent log-gamma distributions, denoted flga(w) hasthe interesting property that the corresponding GLM-CTMC models thenbecome a superset of the Bayesian GNR model reviewed in Section 2.2.2, inthe following sense: if we set p = |X |(|X |−1), order the elements of X distinctas e1, e2, . . . , ep, and set (ϕ(x, x′))m = 1[(x, x′) = em], then the GLM ratematrix model and the specific GNR model introduced in Section 2.2.2 inducethe same marginal distribution over paths in Section 2.1.3.Following the terminology of Berg-Kirkpatrick et al. (2010), we call eachfeature used in the reduction described above a fine feature. With finefeatures, |ϕ−1m (R\{0})| = 1, which means there is exactly one pair of states(x, x′) associated with feature m; otherwise, when |ϕ−1m (R\{0})| > 1, we callfeature m a coarse feature, which means that there is more than one pair ofstates (x, x′) that share the same feature.Asymptotically, only fine features are required to fully specify the ratematrix. However, for a small sample size, a Bayesian GLM-CTMC modelcontaining only coarse features may capture the patterns of the rate matrixbetter than a more complex model including all fine features since the re-duced model can achieve better estimation performance on the rate matrixwith small datasets. For a finite sample size, a Bayesian GLM-CTMC modelcontaining both coarse and fine features can dominate a model containingonly fine features. We show an example in Figure 6.3 in Section 6.1.1, us-ing the reversible equivalent of the Bayesian GNR model. In that example,293.2. Bayesian reversible rate matrix GLMswe investigate the dependency of the estimates on the number and type offeatures used in the model, and on the size of the dataset. We find thatwhen we have a small dataset, it is advantageous to only include a numberof coarse features in the model since better estimation accuracy is achievedunder this situation. As we obtain larger datasets, using fine features orboth coarse and fine features will outperform a simpler model using onlycoarse features in terms of estimation accuracy.3.2 Bayesian reversible rate matrix GLMsWe have discussed in Section 2.2.2 that reversible models are widely used inphylogenetics and have several advantages compared with a non-reversibleone. In the remainder of this section, we show how to accommodate re-versibility constraints within our GLM-CTMC framework.Recall first that a rate matrix is reversible if and only if the off-diagonalentries of the rate matrix can be parameterized as qx,x′ = θ{x,x′}pix′ , with theinitial distribution of all states coinciding with the stationary distribution,and θ{x,x′} denoting some non-negative parameters indexed by unorderedpairs of distinct states (Tavare´, 1986). We denote the set of unordereddistinct pairs of states by X unordered,dist. := {{x, x′} ⊂ X : x 6= x′}.To build priors over reversible rate matrices, we consider two collectionsof exponential families, one for each local decision involved in the genera-tive process (i.e. sampling: initial points, exponentially distributed waiting303.2. Bayesian reversible rate matrix GLMstimes, and jumps), defined as follows:θ{x,x′}(w) := exp{〈w,φ({x, x′})〉}θ¯({x, x′}) (3.2.1)pix(w) := exp{〈w,ψ(x)〉 −A(w)}pi¯(x), (3.2.2)A(w) := log∑x∈Xexp{〈w,ψ(x)〉}, (3.2.3)q(rev)x,x′ (w) := θ{x,x′}(w)pix′(w), (3.2.4)which in turn depends on the following quantities:Natural parameters (weights): w ∈ Rp, for some fixed integer p, whichare shared by the θ-exponential family (Equation (3.2.1)), and thepi-exponential family (Equation (3.2.2)).Sufficient statistics, coming in two flavours: bivariate features φ : X unordered,dist.→ Rp for the θ-exponential family, taking unordered pairs of states asinput; and univariate features ψ : X → Rp for the pi-exponential fam-ily, taking a single state as input. Both types map these contexts intoRp.Log-normalization: A : Rp → (0,∞) for the pi-exponential family. Thisis just a sum over |X | elements. Note that the θ-exponential family isa family of measures rather than a probability distribution, thereforethere is no need for log-normalization in this case.Base measures: θ¯and pi¯are known probability mass functions. As inthe non-reversible case, these could come from a simpler model. Forsimplicity, we assume they are identically equal to one to simplify thenotation.313.2. Bayesian reversible rate matrix GLMs3.2.1 ExamplesTo make the model reversible, we made a few modifications to the featureswe defined, similar to those described in Section 3.1.1. We provide exam-ples of how we construct the reversible feature template such as Polarity,PolaritySize, PolaritySizeGtr for our proposed models based on thephysiochemical properties of amino acids, and the Gtr feature template forthe widely used GTR model.First, if {x, x′} ∈ X unordered,dist. is a set of two distinct states, we definesize({x, x′}) to be the set obtained by applying the map to the individualelements in {x, x′}. For example, for the amino acids Phenylalanine (F) andSerine (S), size({S,F}) = {Big,Micro}. Using this definition, an example ofa reversible bivariate feature is given by:φ1({x, x′}) := 1{Big,Big}(size({x, x′})),Note that we do not take the directionality into account. We call this firstfeature “BigBig” for short. Continuing in this fashion, we can build twoadditional features similarly, “BigMicro,” and “MicroMicro.” This definesthe reversible feature template Size. We construct the reversible featuretemplate Polarity similarly.Another feature template that we consider is the Gtr template, whichtakes a pair of unordered distinct states as input, obtained with the samerecipe as above applied to the identity map. This is a special case of the finefeatures described in Section 3.1. The feature template PolaritySizeGtrconsists in the concatenation of Gtr with PolaritySize. The featureswe have defined so far are all bivariate. For the univariate features (thoseconcerning the stationary distribution rather than the dynamics), we always323.3. Bayesian GLM chain GTR modeluse fine features in this section. The reason is that the stationary distribu-tion over the 20 amino acids is relatively easy to estimate from the typicaldataset size.3.2.2 Phylogenetic GLM-CTMC modelBecause we want a reversible process, we base our evolutionary parame-ter model on the construction introduced in Section 3.2. Another sourceof unidentifiability is that multiplying all the branch lengths ∆e by a con-stant while dividing all parameters θ{x,x′}s by the same constant does notchange the transition probability matrix. It is therefore common to add anormalization constraint to the rate matrices in the phylogenetic setup, forexamples enforcing that the expected number of changes per site is one ona branch of unit length. We follow this convention, making the followingamendment to the model of Section 3.2: first, we define a normalizationcoefficient β,β(w) = −(∑x∈Xpix(w)q(rev)x,x (w))−1, (3.2.5)from which we construct a normalized reversible rate matrix, q(norm)x,x′ (w) :=β(w)q(rev)x,x′ (w).3.3 Bayesian GLM chain GTR model3.3.1 GTR rate matrix representation using Bayesian GLMfeature functionsWe have defined the Bayesian GLM reversible rate matrix parameterizationin Equation 3.2.1-3.2.4. In Chapter 5, because we propose a combined sam-333.3. Bayesian GLM chain GTR modelpling scheme using different kernels to update the univariate weights wu andbivariate weights wb respectively, we represent the Bayesian GLM rate ma-trix parameterization in Equation 3.3.1-3.3.4 using wu and wb separately.This notation is needed in Chapter 5.θ{x,x′}(wb) := exp{〈wb,φ({x, x′})〉}, (3.3.1)pix(wu) := exp{〈wu,ψ(x)〉 −A(wu)}, (3.3.2)A(wu) := log∑x∈Xexp{〈wu,ψ(x)〉}, (3.3.3)q(rev)x,x′ (w) := θ{x,x′}(wb)pix′ (wu) , (3.3.4)The parameters of interest are w, where w =wuwb . In Equation 3.3.1,we introduce bivariate feature functions φ : X unordered,dist. → Rp2 . In Equa-tion 3.3.2, we introduce univariate feature functions ψ : X → Rp1 . We havep1 + p2 = p and w ∈ Rp. The weights corresponding to the univariate fea-tures ψ are denoted as univariate weights wu and the weights related to thebivariate features φ are denoted as bivariate weights wb.We discuss the connection between Bayesian GLM framework and theGTR model in detail in Appendix C.1. We have proved that the BayesianGLM framework can be used to represent any rate matrices under the GTRparameterization in Appendix C.1. For simplicity, take DNA evolution as anexample. In DNA evolution, the state space X = {A,C,G, T}. To help thereader understand the connection between the Bayesian GLM frameworkand the GTR model, we first show how to represent the GTR model forthe specific case of DNA evolution under the Bayesian GLM framework.343.3. Bayesian GLM chain GTR modelWe use wu = (wA, wC , wG, wT ) to obtain the stationary distribution. Forexample, piA(w) = exp(wA)/(exp(wA) + exp(wC) + exp(wG) + exp(wT )).The bivariate weights wb = (wAC , wAG, wAT , wCG, wCT , wGT ) are used toobtain the exchangeable parameters, for example, θAC = exp(wAC).Now we show how to represent the GTR model for an arbitrary statespace. We briefly discuss this result since the notation will be needed tointroduce our novel sparse model in Section 3.3.2. Recall that in the GTRmodel, Q is parameterized by the stationary distribution pi = (pi1, pi2, . . . , pi|X |)and exchangeable parameters θx,x′ , where qx,x′ = θ{x,x′}pix′ . Under thereversibility assumption, we have θ{x,x′} = θ{x′,x}. In total, there arep2 = |X |(|X | − 1)/2 exchangeable parameters θ{x,x′} under the reversibilityassumption. We define a map η: X unordered,dist. → {1, 2, . . . , |X |(|X |− 1)/2}such that the ith element of bivariate features φgtr : X unordered,dist. → Rp2 isdefined as:φgtri ({x, x′}) := 1(η({x, x′}) = i), (3.3.5)where the ith feature φgtri ({x, x′}) is equal to one if and only if the pairof unordered states {x, x′} is mapped to ith exchangeable parameter viaη({x, x′}). To ensure the reversibility of the rate matrix, we require thatη({x, x′}) = η({x′, x}). Under the definition of φgtr, any rate matrices canbe represented using the Bayesian GLM rate matrix parameterization withfeature functions φgtr.353.3. Bayesian GLM chain GTR model3.3.2 Proposed Bayesian GLM chain GTR modelIn Section 3.3.1, we introduced a function η({x, x′}) that embeds statetransitions into the integer. We call this η an “order function” into theη-provided order. Here we use such order functions to parameterize the ex-changeable rate of a pair of states such that it depends on the biologicalproperties of the current pair of states and the nearest neighbour pair ofstates of the current pair. An example of this is shown in Equation 3.3.7.Thus, we would like to find an order of each pair of distinctive states that isbiologically meaningful so that neighbour pairs can share statistical strengthand also the structure of the model can bring computational ease. One mo-tivation for the parameterization of the Bayesian GLM chain GTR model isbased on its identifiability. In the context of phylogenetics, a model is non-identifiable if different sets of parameters including tree topologies, branchlengths and evolutionary parameters can produce the same likelihood. Zhao(2013) has discussed the identifiability of certain DNA substitution mod-els with rate variation among different sites under a discrete distribution.Allman et al. (2008) have provided the first proof of the identifiability ofthe GTR model with continuous rate variation among different sites undera Gamma distribution. A normalized version of the Bayesian GLM chainGTR model can be proved to be identifiable following a similar proof givenby Allman et al. (2008).Another motivation of our proposed model in this section stems from itscomputational simplicity since it ensures that the graphical model inducedis strongly sparse (as defined in Section 5.2.1), and as a consequence, we canbuild an efficient inference algorithm based on that sparsity structure (seeSection 5.3 for details). Take protein evolution as an example. Most frequent363.3. Bayesian GLM chain GTR modelamino acids exchanges take place between residues with similar physico-chemical properties. Under the Bayesian GLM chain GTR model proposedbelow, we assume that mutations between two pairs of amino acids thatshare the same amino acid are expected to have similar exchangeable ratesif the physicochemical properties are similar for the other amino acids in thetwo pairs. We would like to find an ordering η({x, x′}) such that amino acidpairs that are nearest neighbours share more similar physicochemical prop-erties. The Nearest Neighbour Pairwise Amino Acid Ordering (NNPAAO)Algorithm (described in Appendix D.2) proposed in Section 3.3.3 provides anordering of amino acid pairs such that neighbour pairs of amino acids sharemore similarity than non-neighbour pairs. We denote it as ηdist({x, x′}) sincethe order is determined based on the Euclidean distance between amino acidsdefined in Equation 3.3.8. Given this ordering, neighbour pairs share morebiological similarities than non-neighbour pairs.Based on the ordering structure (and hence neighbourhood structure)provided by η({x, x′}), we define the following sparsity inducing model,which we call Bayesian GLM chain GTR:φchaini ({x, x′}) := 1(η({x, x′}) ∈ {i, i+ 1}), (3.3.6)where i = 1, 2, . . . , |X |(|X | − 1)/2. If we set η({x, x′}) = ηdist({x, x′}), theintuition of the Bayesian GLM chain GTR model is that the value of eachexchangeable parameter θ{x,x′} depends on {x, x′} and its neighbour pair de-noted as {x′′, x′′′}. Its neighbour pair satisfies that η({x′′, x′′′})−η({x, x′}) =1. Neighbour pairs can share statistical strength. Equivalently, we can de-fine the Bayesian chain GTR model through the exchangeable parameters373.3. Bayesian GLM chain GTR modelθ{x,x′}(wb) by plugging in the definition of φchaini ((x, x′)) in Equation 3.3.6:θ{x,x′}(wb) = exp{〈wb,φchain({x, x′})〉}=exp(wbη({x,x′}) +wbη({x,x′})−1), if η({x, x′}) = 2, 3, . . . , (|X |(|X | − 1)/2),exp(wbη({x,x′})), if η({x, x′}) = 1.(3.3.7)For simplicity, we focus on the simple example of chains, but it does notneed to be a chain. Other sparse graphs would also be suitable. We providea general characterization of the running time analysis for arbitrary factorgraphs with strong sparsity structure in Section 5.2.2. The chain GTR modelis general since it is able to represent all rate matrices. We prove this inAppendix C. Examples of order functions in Bayesian GLM chainGTR modelIn Section 3.3.2, we have provided intuition behind the chain GTR model,and we have described that ηdist({x, x′}) is an oder function such that aminoacid pairs that are nearest neighbours share the most similar physicochemi-cal properties. In this section, we describe in detail how ηdist({x, x′}) worksand the general idea of the NNPAAO algorithm.According to He et al. (2011), we determine the ordering of pairs of aminoacids regarding their closeness based on their pairwise Euclidean distancedefined by a subset of their physiochemical properties. The distance satisfiesnonnegative, symmetric, and triangle properties and is given (Grantham,383.3. Bayesian GLM chain GTR model1974) as :Dij =(α(ci − cj)2 + β(pi − pj)2 + γ(vi − vj)2) 12, (3.3.8)where c = composition, p = polarity, v = molecular volume, and α, β, γare defined in Grantham (1974). Larger pairwise distance indicates lesssimilarity between pairs of amino acids, where mutations between them maybe deleterious. On the contrary, smaller distance indicates more similaritiesbetween them. The pairwise Euclidian distance between amino acids is givenTable 3 in He et al. (2011). We provide this table in Appendix D.1 of thethesis.Given the distance in Appendix D.1, the ordering of pairs of amino acidsregarding their closeness is determined according to our proposed NNPAAOAlgorithm 9 in Appendix D.2. The output ordering of the amino acid pairs{x, x′} from Algorithm 9 is ηdist({x, x′}).Some notation is introduced to help understand the algorithm. De-note AminoAcidDist as the pairwise Euclidean distance between aminoacids shown in Appendix D.1, X as the state space of 20 amino acidsand X unordered,dist. as the set of unordered pairs of distinct amino acids,where |χ| = 20 and ∣∣X unordered,dist.∣∣ = 190. The algorithm outputs a dictio-nary AminoAcidsPairRank with keys from X unordered,dist.. The value asso-ciated with each key represents the ordering of this amino acid pair, whichis ηdist({x, x′}). We provide the general idea of how NNPAAO algorithmworks in Figure 3.1. The ordering ηdist({x, x′}) is obtained each time anamino acid pair is chosen as the current pair in the NNPAAO algorithm.To summarize, in this chapter, we have proposed a novel Bayesian GLMframework to model rate matrices of CTMCs. The new framework allows393.3. Bayesian GLM chain GTR modelFigure 3.1: A simple explanation of the general idea of the NNPAAO algorithm.for parameter tying such that biological information such as physicochemi-cal properties of amino acid can be exploited to construct rate matrices forprotein evolution. At the end of this chapter, we propose a model, whichensures that the strong sparsity in definition 5.2.2 is satisfied. This is use-ful for our proposed sampling algorithm in Chapter 5. One computationalchallenge for the Bayesian GLM framework is that there can be strong corre-lation among certain parameters under the posterior distribution such thatthe classical MCMC algorithms can be highly ineffective. In the next chap-ter, we propose our novel sampling algorithm via a combination of AdaptiveHamiltonian Monte Carlo (AHMC) and the uniformization technique, which403.3. Bayesian GLM chain GTR modelallows us to sample efficiently under the Bayesian GLM framework.41Chapter 4Inference via HMCIn Chapter 3, we have described our Bayesian Generalized Linear Models(GLM) framework to construct rate matrices for Continuous Time MarkovChains (CTMCs), which allows us to construct rate matrices based on bi-ological information about the process such as the physicochemical prop-erties of amino acids. However, one important challenge with BayesianGLM framework of CTMCs is that posterior sampling via classical MarkovChain Monte Carlo (MCMC) techniques can be highly ineffective since therich parameterization of the GLM-CTMC model can create strong corre-lations among different components of the parameters under the posteriordistribution. One of our main contributions in this chapter is that we pro-vide a computational remedy by proposing a novel computationally efficientsampling scheme (described in Section 4.4.2), which combines the AdaptiveHamiltonian Monte Carlo (AHMC) algorithm (Wang et al., 2013a) and anauxiliary variable technique called uniformization (Nielsen, 2002; Minin andSuchard, 2008; Rodrigue et al., 2008; Hobolth and Stone, 2009). Our pro-posed sampling algorithm allows us to sample efficiently under our BayesianGLM framework of CTMCs. We consider HMC algorithm instead of otherMCMC algorithms since it uses the gradient of the log of target density toefficiently sample from the posterior density. Neal (1993); Liu (2001); Neal(2010) have introduced the HMC sampling algorithm to the mainstream of42Chapter 4. Inference via HMCstatistical computing and recent advances have demonstrated its extensivesuccess in many scientific applications (Wang et al., 2013b; Hoffman andGelman, 2014a; Kramer et al., 2014; Betancourt and Girolami, 2015). How-ever, the computational performance of HMC is very sensitive to the valuesof the tuning parameters. Thus, we adopt Wang et al. (2013a)’s AHMCmethod, which uses Bayesian optimization to provide guidance on selectinggood values of the tuning parameters to ensure that the algorithm is com-putationally efficient in the phylogenetic context. We just apply Wang et al.(2013a)’s method and we do not make improvements over their method.Our main contribution in this work consists in the first application of HMCto phylogenetics to the best of our knowledge, and augmenting the perfor-mance of HMC by combining it with uniformization methods and adaptiveparameter tuning in the context of CTMC estimation.This chapter is organized as follows. In Section 4.1, we provide anoverview of the target posterior density that we wish to sample from. InSection 4.2-4.3, we review Wang et al. (2013b)’s work on HMC and AHMC.Our contribution is in Section 4.4-4.5. To summarize, we make the followingnovel contributions:• As we shall see in Section 4.4.2, we have developed an efficient samplerby combining AHMC with an auxiliary variable construction throughthe uniformization technique for our proposed GLM-CTMC model de-scribed in Chapter 3.• Without using the auxiliary variable construction, as shown in Sec-tion 4.4.1, one HMC iteration takes O(L|X |5). Our proposed sam-pling algorithm in Section 4.4.2 achieves a significant computationalgain such that with the auxiliary variables, one iteration of HMC in434.1. Overview of posterior simulationour algorithm takes only O(L|X |2 + |X |3), where L represents thesame tuning parameter in HMC in both expressions of O(L|X |5) andO(L|X |2 + |X |3). We prove this result in Section 4.4.3.• Our proposed sampling algorithm is not only applicable to generalnon-reversible GLM-CTMC models, it is also applicable to reversiblemodels, where the reversibility assumption for the Markov processes isassumed. We show this result in Section 4.4.4. Because HMC requiresthe function and gradient evaluation at each iteration, we provide theanalytical form of the function and gradient evaluation in Section 4.4.2given the augmented auxiliary variables. The analytical form of func-tion and gradient evaluation given the augmented auxiliary variablesalso contributes to boosting the computational efficiency of our sam-pler by making the gradient computation simpler and faster.• As we shall see in Section 4.4.5, we prove the correctness of our sam-pling algorithm, which indicates that our sampling algorithm con-verges to the desired posterior distribution.• Finally, in Section 4.5, we discuss how to extend our sampling algo-rithm to phylogenetic trees.4.1 Overview of posterior simulationThis section addresses the problem of approximation of the density of theposterior distribution given by:fw|y(w|y) ∝ fw(w)fy|Q(y|Q(nr)(w)). (4.1.1)444.1. Overview of posterior simulationRecall that fw(w) represents the prior distribution for the weights in Sec-tion 3.1, which we use to model the rate matrix according to a BayesianGLM rate matrix representation. In Equation 4.1.1, fy|Q(y|Q(nr)(w)) rep-resents the likelihood of observations y given an unnormalized rate matrixQ(nr)(w) as the parameter. The observations y can have various structuresdepending on the specific applications. For example, in the phylogeneticcontext, y can represent a few observed biological sequences as described inSection 2.2.1. While in a chain-shaped graphical model, y can represent thepartially observed paths shown in Section 2.1.3 or partially observed timeseries, which evolve according to a CTMC.In principle, approximating the posterior fw|y(w|y) density could be ap-proached using the existing Metropolis-Hastings-based methods described inSection 2.3. However, the rich parameterization of the GLM-CTMC modelcan create strong correlations among components of w under the posteriordistribution, which is problematic since naive Metropolis-Hastings methodstend to struggle in the face of high-dimensional correlated random vectors(Gilks and Roberts, 1996). We show an example of this poor behaviour inSection 6.1.2.These correlations are especially severe in overcomplete representations,i.e. when there are (x0, x′0) ∈ X distinct, w0 ∈ Rp, w0 6= 0, such that〈w0,ϕ(x0, x′0)〉 = 0. When there is a large number of features under consid-eration (for example, the combination of coarse and fine features described inthe previous section) or when there is a rich covariate structure, it becomesincreasingly impractical to automatically simplify overcomplete feature sets.454.2. Hamiltonian Monte Carlo4.2 Hamiltonian Monte CarloHMC algorithm is a MCMC sampling method for obtaining a number ofrandom samples from a probability distribution for which direct samplingis difficult. These random samples come from a Markov chain that has thedesired distribution as its stationary distribution. In Bayesian statistics, thedesired distribution is often the target posterior distribution. In contrast toclassical MCMC methods such as the Metropolis-Hastings algorithm, HMCtakes into consideration the gradient of the log of the target posterior density,and thus reduces the correlation between successive samples. The sequenceof random samples obtained from HMC can be used for approximating thetarget distribution or computing integrals.HMC was first proposed in 1980s by Duane et al. (1987) to tackle compu-tation problems in lattice quantum chromodynamics. HMC is now widely-known as a powerful and efficient sampling algorithm, having been demon-strated to outperform many existing MCMC algorithms, especially in prob-lems with high-dimensional, continuous, and correlated distributions (Chenet al., 2001; Neal, 2010). HMC has become a popular computational toolwith the development of a probabilistic programming language Stan for sta-tistical inference (Stan Development Team, 2014). However, Stan is not ap-plicable to my work and situation since we require the computation of matrixexponential and its gradient. Stan implements gradient-based MCMC algo-rithms including the HMC algorithm and the No U-Turn Sampler (NUTS)(Hoffman and Gelman, 2014b), which is a variant of HMC. To compute thegradient of the model, Stan takes advantage of reverse-mode automatic dif-ferentiation (Carpenter et al., 2015). However, certain implementations ofautomatic differentiation (Hogan, 2014) do not handle some transcendental464.2. Hamiltonian Monte Carlofunctions well, in particular the matrix exponential function. Moreover, thedevelopers of Stan do not plan to implement it in the near future.4.2.1 Probability density functions and canonicaldistributionsBefore we introduce the HMC algorithm, we first review briefly the rela-tionship between the target posterior density function and the probabilitydensity function of the canonical distribution of a certain physical system.This is also discussed in Neal (2010).Given some energy function U(w) of the physical system, the probabilitydensity function of a canonical distribution over w is:P (w) =1Aexp (−U(w)) , (4.2.1)where A is the normalizing constant such that the function P (w) integratesto one. We can represent the distribution of interest with density g(w) usingthe canonical distribution density function by setting the energy functionU(w) = − log g(w), where w represents the variables of interest.4.2.2 The HMC algorithmThe Hamiltonian is an energy function, which defines a joint distributionfor “position” w and “momentum” m given as:P (w,m) =1Aexp (−H(w,m)) . (4.2.2)In HMC, it is assumed that H(w,m) = U(w) + K(m). Under this474.2. Hamiltonian Monte Carloassumption, the joint density isP (w,m) =1Aexp (−U(w)) exp (−K(m)) (4.2.3)The posterior distribution of the model parameters is usually the targetdistribution in Bayesian inference. Thus, when using HMC, the posteriordistribution is usually expressed using a canonical distribution by settingthe potential energy function asU(w) = − log(f(w)f(y|w)),where f(w) represents the prior density and f(y|w) represents the likelihoodfunction given w.HMC requires as input the specification of: (i) a potential energy functionU , which is the negative logarithm of the target density we wish to samplefrom (up to an additive constant), in our CTMC case,U(w) := − log fw|y(w|Q(nr)(w)) (4.2.4)= − log fw(w)− log fy|Q(y|Q(nr)(w)) + constant, (4.2.5)and (ii), a kinetic energy function, most typically K(m) = 12mTM−1m,with momentum vector m and a positive definite mass matrix M . Forstandard HMC, the mass matrix is set to the identity matrix I. We reviewit in Algorithm 2 and defer the technical details of HMC to existing work(Neal, 2010).Note that the HMC algorithm uses the gradient of the joint log density toguide the exploration of the parameter space. In addition, HMC requires theselection of two free parameters: a step-size  > 0 and a number of leapfrog484.2. Hamiltonian Monte CarloAlgorithm 2 HMC Algorithm1: Given: L, , and w(1).2: for i = 1, 2, · · · do3: Sample m(i) ∼ N (0, I) and L(i) ∼ Unif({1, . . . , L})4: Let w(i,0) := w(i) and m(i,0) := m(i) − 2∇wU(w(i,0))5: for j = 1, 2, · · · , L(i) do6: w(i,j) := w(i,j−1) + m(i,j−1)7: if j < L(i): m(i,j) := m(i,j−1) − ∇wU(w(i,j))8: end for9: m(i,L(i)) := m(i,L(i)−1) − 2∇wU(w(i,L(i)))10: Draw u ∼ Unif(0, 1)11: if u < min[1, exp{U(w(i)) +K(m(i))− U(w(i,L(i)))−K(m(i,L(i)))}] then12: Let (w(i+1),m(i+1)) := (w(i,L(i)),m(i,L(i)))13: else14: Let (w(i+1),m(i+1)) := (w(i,0),m(i,0))15: end if16: end forsteps L ∈ {1, 2, . . . }. The accepted guidance is to choose the step-size so asto ensure that the sampler’s rejection rate is moderate. It is also preferable tohave a large L, since this reduces the random walk behaviour of the sampler(Neal, 2010), but too large an L results in unnecessary computation. InHMC,  and L are tuning parameters. When using HMC to sample fromthe posterior density, once the values of  and L are chosen, they will befixed at the selected values during the sampling process.HMC has not been widely adopted in practice, due principally to thesensitivity and difficulty of choosing its tuning parameters  and L. In fact,tuning HMC has been reported by many experts to be more difficult thantuning other MCMC methods (Ishwaran, 1999b; Neal, 2010). Fortunately, afew recent methods address the automated tuning of HMC. Two notable ap-proaches are: NUTS, by Hoffman and Gelman (2014b), which aims to findthe best parameter settings by tracking the sample path and preventingHMC from retracing its steps in this path; and Riemann Manifold Hamil-494.2. Hamiltonian Monte Carlotonian Monte Carlo (RMHMC), by Girolami and Calderhead (2011), whichexploits the Riemannian geometry of the problem. In this chapter, we followthe approach of adapting Markov chains using Bayesian optimization (Ma-hendran et al., 2012; Hamze et al., 2013; Wang et al., 2013a) to improve theconvergence rate of HMC. In Wang et al. (2013a), the authors demonstratethat this approach outperforms NUTS, and that RMHMC is more accurate,but significantly more expensive. RMHMC also makes additional assump-tions, namely that higher moments can be efficiently computed. We followand just apply Wang et al. (2013a)’s method using Bayesian optimizationto tune  and L.As in Neal (2010), we consider a slight variation of the HMC algorithm:instead of performing L leapfrog steps at each iteration i, we perform arandom number of leapfrog steps, chosen from the discrete uniform distri-bution over {1, · · · , L}, i.e. L(i) ∼ Unif({1, . . . , L}) steps. This approachamounts to using a mixture of L different HMC transition kernels, thuspreserving detailed balance (Andrieu et al., 2003). We sample L fromL(i) ∼ Unif({1, . . . , L}) since Neal has pointed out that the potential ofnon-ergodicity can be solved by randomly choosing  or L (or both) fromsome fairly small interval (Mackenze, 1989; Neal, 2010) while ergodicity canfail if the L leapfrog steps in a trajectory produces an exact periodicity forsome functions of the parameters.We denote the distribution induced by sampling one HMC iteration(Line 3 to 15 in Algorithm 2) by:W (i+1)|W (i) ∼ HMC( · |W (i);U,L, ). (4.2.6)Since we resample an independent momentum m at each step, the momen-504.3. Adaptive Hamiltonian Monte Carlotum can be omitted from the above notation.4.3 Adaptive Hamiltonian Monte CarloIn order to adapt the MCMC parameters L and  for HMC, we need to (i)define an objective function and (ii) choose a suitable optimization method.As pointed out by Pasarica and Gelman (2010), a natural objectivefunction for adaptation is the asymptotic efficiency of the MCMC sampler,(1 + 2∑∞k=1 ρk)−1, where ρk is the auto-correlation of the sampler with lagk. Despite its appeal, this measure is problematic because the higher orderauto-correlations are hard to estimate. To circumvent this problem, theyintroduced an objective measure called the Expected Squared Jumping Dis-tance (ESJD), ESJD(γ) := Eγ‖W (i+1) −W (i)‖2, where γ = (, L) denotesthe set of parameters for HMC. In practice, the above intractable expec-tation with respect to the Markov chain is approximated by an empiricalestimator, as outlined in Pasarica and Gelman (2010).The ESJD measure is efficient in situations where the higher order auto-correlations decrease monotonically with respect to ρ1. However, it is notsuitable for tuning HMC samplers since by increasing the number of leapfrogsteps one can almost always generate better samples. What we need is ameasure that also takes computing time into consideration. With this goalin mind, Wang et al. (2013a) introduced the following objective function:f(γ) := ESJD(γ)/√L. This function simply normalizes the ESJD by thenumber of leapfrog steps L, thus taking both statistical efficiency and com-putation into consideration.Now that we are armed with an objective function, we need to addressthe issue of optimization. Bayesian optimization is an efficient gradient-free514.3. Adaptive Hamiltonian Monte CarloAlgorithm 3 AHMC1: Given: Γ, m, k, α, and γ1.2: for i = 1, 2, . . . , do3: Run HMC for m iterations with γi = (i, Li) (Algorithm 2).4: Obtain the objective function value ri using the drawn samples.5: Augment the data Di = {Di−1, (γi, ri)}.6: if ri > supj∈{1,··· ,i−1} rj then7: s = αri8: end if9: Draw u ∼ Unif([0, 1])10: let pi = (max{i− k + 1, 1})−0.5, with k ∈ N+.11: if u < pi then12: γi+1 := arg maxγ∈Γ u(γ, s|Di).13: else14: γi+1 := γi15: end if16: end foroptimization tool well suited for this expensive black-box function problem.Normalized ESJD involves an intractable expectation that can be approx-imated by sample averages, where the samples are produced by runningHMC for a few iterations. Each set of HMC samples for a specific set oftuning parameters γ ∈ Γ results in a noisy evaluation of the normalizedESJD: r(γ) = f(γ) + ε, where we assume that the measurement noise isGaussian ε ∼ N (0, σ2η).We approach this problem using standard Bayesian optimization meth-ods based on Gaussian processes. To ensure that the adaptation does notalter HMC’s invariant distribution, we stop the adaptation process after afixed number of steps Nadapt = 3000. After this horizon, we fix the max-imizers ? and L? estimated by the Bayesian optimization algorithm. SeeAlgorithm 3 and Section 4.3.1 for details. In particular, we discuss theadditional computational cost when tuning HMC in Section Adaptive Hamiltonian Monte Carlo4.3.1 Bayesian optimization for HMCTo provide additional background information, we provide here a detaileddescription of the Bayesian optimization method we used to adaptively tunethe tuning parameters of the HMC algorithm.Following the standard Bayesian optimization methodology, we set Γ tobe a box constraint such thatΓ = {(, L) :  ∈ [bl , bu], L ∈ [bLl , bLu ]}for some interval boundaries bl ≤ bu and bLl ≤ bLu . The parameter L isdiscrete. The parameter  is continuous, but since it is one-dimensional, wecan discretize it using a very fine grid.Since the true objective function is unknown, we specify a zero-meanGaussian prior over it:f(·) ∼ GP (0, k(·, ·))where k(·, ·) is the covariance function. Given a vector of noisy evaluationsof the objective function r¯i = {rk}ik=1 evaluated at points {γk}ik=1, we formthe dataset Di =({γk}ik=1, {rk}ik=1). Using Bayes rule, we arrive at theposterior predictive distribution over the unknown objective function (seeRasmussen and Williams (2006) for more details):f |Di, γ ∼ N (µi(γ), σ2i (γ))µi(γ) = k¯T(K + σ2ηI)−1r¯iσ2i (γ) = k(γ, γ)− k¯T (K + σ2ηI)−1k¯534.3. Adaptive Hamiltonian Monte CarlowhereK =k(γ1, γ1) . . . k(γ1, γi).... . ....k(γi, γ1) . . . k(γi, γi) ,k¯ = [k(γ, γ1) . . . k(γ, γi)]T , and r¯i = [r1 . . . ri]T .We adopt an automatic relevance determination covariance function withk(γi, γj) = exp(−12γTi Σ−1γj), where Σ is a positive definite matrix. We setΣ = diag([α(bu − bl )]2 ;[α(bLu − bLl )]2), where α = 0.2.The Gaussian process simply provides a surrogate model for the trueobjective. The surrogate can be used to efficiently search for the maximumof the objective function. In particular, it enables us to construct an ac-quisition function u(·) that tells us which parameters γ to try next. Theacquisition function uses the Gaussian process posterior mean to predictregions of potentially higher objective values (exploitation). It also usesthe posterior variance to detect regions of high prediction uncertainty (ex-ploration). Moreover, it effectively trades-off exploration and exploitation.Different acquisition functions have been proposed in the literature, includ-ing Mockus (1982); Srinivas et al. (2010); Hoffman et al. (2011). We adopta variant of the Upper Confidence Bound (UCB) (Srinivas et al., 2010),modified to suit our application:u(γ, s|Di) = µi(γ, s) + piβ12i+1σi(γ).As in standard UCB, we set βi+1 = 2 log((i+1)d2 +2pi23δ), where d is the dimen-sion of Γ and δ is set to 0.1. The parameter pi ensures that the diminishingadaptation condition of Roberts and Rosenthal (2007) is satisfied. Specifi-544.3. Adaptive Hamiltonian Monte Carlocally, we set pi = (max{i− k + 1, 1})−0.5 for some k ∈ N+. As pi goes to 0,the probability of Bayesian optimization adapting γ vanishes. For efficiency,we also impose a hard limit Nadapt to the number of adaptation steps. Adetailed convergence analysis is presented in Wang et al. (2013a).The acquisition function also includes a scalar scale-invariance parameters, such that µi(γ, s) = k¯T(K + σ2ηI)−1r¯is. This parameter is estimatedautomatically so as to rescale the rewards to the same range each time weencounter a new maximal reward.4.3.2 Computational cost of AHMCGaussian processes require the inversion of the covariance matrix and, hence,have complexity O(i3), where i is the number of iterations. As Wang et al.(2013a) have pointed out, when tuning the number of leapfrog jumps Land the step size , the goal is less about finding the absolute global bestcombination of the two parameters given finite computational resources.Recall that the parameter pi acts as an annealing parameter. Given noconstraint on time, a slow annealing schedule could be used to explore theparameter space for  and L fully. However, under time constraints, a fasterannealing schedule is adopted. The Bayesian optimization method proposesnew  and L with vanishing probability. Thus, the number of unique pointsin our Gaussian process grows sub-linearly with the number of iterations.This slow growth makes it possible to adopt kernel specification techniques,as proposed by Engel (2005), to drastically reduce the computational costwithout suffering any loss in accuracy. Following Wang et al. (2013a), inall our experiments, we set α = 4, k = 100, and m = Bk , where B is thenumber of burn-in samples. We use the same set of parameters throughout554.4. HMC for GLM-CTMC modelsour experiments.4.4 HMC for GLM-CTMC models4.4.1 Running time of previous CTMC gradientcomputation methodsThe remaining ingredient required by HMC is the gradient of the logarithmof the joint density:∇wU(w) = −∇w log fw(w)−∇w log fy|Q(y|Q(nr)(w)).The first term is generally trivial, for example, when fw = fnor, the firstterm simplifies to κw. The second term, which involves the gradient of ma-trix exponentials, is more challenging but can still be computed numerically.In previous work, this has been done using a spectral decomposition (Jen-nrich and Bright, 1976), as automatic differentiation does not apply readilyto this problem.However, computing L leapfrog steps using the method described in Jen-nrich and Bright (1976) is computationally prohibitive for the models weconsider in this paper. More precisely, the method of Jennrich and Bright(1976) has a term of the form |X |3pL in its asymptotic running time. Sincemuch of the previous work has been motivated by frequentist estimators, thenumber of parameters p has generally been assumed not to be dominant. Incontrast, we expect a typical use case of our model to include Θ(|X |2) pa-rameters, since our experimental results show that including all fine featuresis beneficial in practice (Figure 6.3). This means that for large state spaces,one MCMC iteration has a running time of Θ(L|X |5), if we were to use564.4. HMC for GLM-CTMC modelsexisting methods for computing the gradient. This is true even after takinginto account the savings coming from reusing spectral decompositions, andcareful ordering of vector-matrix product operations. We first provide thedetails of how a running time of Θ(L|X |5) is achieved based on Kenney andGu (2012)’s work.We review here the analysis of the running time of previous methods forcomputing the gradient of individual entries of a rate matrix from a partiallyobserved CTMC. A good review of the state-of-the-art method can be foundin Kenney and Gu (2012).4 Kenney and Gu (2012) first describe in full thealgorithm calculating all first and second derivatives for each site for a giventree for phylogenetic likelihood based on pruning algorithm. Then theysummarize the running time on page 25 as O(K(p + |E|)|X |2) (translatingtheir notation to ours).5It is worth noting that although the running time is given by O(K(p+|E|)|X |2), which is a quadratic term of |X |, close inspection of the analysison page 40, step 2, reveals that a term of O(p|X |3) is absorbed in thisoverall running time. Thus, when the number of parameters is of orderO(|X |2), previous methods will scale as O(|X |5) instead of O(|X |4) sinceO(K(p + |E|)|X |2) is absorbed in the running time. Moreover, the entirealgorithm needs to be started from scratch after each of the L leapfrog step,giving an overall running time of O(L|X |5).The term O(p|X |3) arises from the multiplication of the matrix of deriva-tive of the rate matrix with respect to each parameter with a fixed dense ma-trix coming from the eigen-decomposition (see the definition of Nβ in The-4Although Kenney and Gu (2012) focuses on Hessian computation, it also includes adetailed running time analysis of gradient computation, found in Appendix 2.5We present the algorithm in the context of a phylogenetic tree with K sites. Werecover the case of a time series as a special case by setting K = 1.574.4. HMC for GLM-CTMC modelsorem 1 of Kenney and Gu (2012)). It takes into account the caching of theeigen-decomposition. Absorbing the O(p|X |3) term into O((p+ |E|)|X |2K)is reasonable given that the number of sites K is often larger than the num-ber of CTMC states |X | in practice.4.4.2 HMC with auxiliary variable construction forGLM-CTMC modelsIn this section, we introduce a methodology allowing us to use HMC whileavoiding the computational difficulties described above. The key advantageof our method is that it can take advantage of the sparsity in the featurevectors ϕ(x, x′). In contrast, previous work has focussed on sparsity of therate matrix.If we let T0 denote the running time of the sum-product algorithm (re-viewed in Section 2.1), and s :=∑(x,x′)∈Xdistinct∑pm=1 1[ϕm(x, x′) 6= 0], thetotal number of non-zero entries across all possible features, then performingL leapfrog steps takes time O(Ls+T0) using our method. See Section 4.4.3for justification of our running-time analysis. In all the examples we considerin Section 6, s ∈ O(|X |2).Assume that w(i−1) denotes the state of the MCMC algorithm at theprevious iteration. Our method computes the next state w(i) as follows.Precomputation: the most expensive parts of the calculations do not needto be recomputed at each leapfrog step within the current MCMCiteration.1. Construct the HMM described in Section 2.1.3, and using stan-dard graphical model inference algorithms (Kschischang et al.,2006), compute the forward recursions on the HMM from the584.4. HMC for GLM-CTMC modelsprevious step, followed by a backward sampling step to imputethe values of the latent time series at the discrete set of observa-tion times.D(i)∣∣∣Y , Q(nr) (W (i)) ∼ pd|y,Q ( · ∣∣∣Y , Q(nr) (W (i))) . (4.4.1)2. For each pair of consecutive imputed latent states, d(i)v , d(i)v+1,simulate a path X(i)e of length ∆e given the two end pointse = (v, v + 1):X(i)e∣∣∣ (Xt(obs)v= d(i)v , Xt(obs)v+1= d(i)v+1)∝∼ fx|Q,∆(·∣∣∣Q(nr) (W (i)) ,∆e) .(4.4.2)Because of the conditioning on the end-point d(i)v+1, the densityon the right of Equation (4.4.2) is only specified up to a nor-malization constant. We address this issue using techniques fromthe substitution mapping literature (Nielsen, 2002; Minin andSuchard, 2008; Rodrigue et al., 2008; Hobolth and Stone, 2009;Tataru and Hobolth, 2011; Rao and Teh, 2013; Irvahn and Minin,2014), in which a range of methods for end-point simulation havebeen developed. The method we use is based on uniformizationand caching of intermediate quantities. We provide a detaileddescription of the cached uniformization method in Appendix E.3. Compute the sum of the sufficient statistics produced in the pre-vious step, Z(i) :=∑e z(X(i)e). To be specific, the sum of thesufficient statistics refers to the aggregated values of the totalnumber of paths initiated at each state, the total time spent ateach state and the transition counts for each pair of distinctive594.4. HMC for GLM-CTMC modelsstates in the state space across all edges of a tree or all intervalsof the observed time series.HMC sampling: perform an HMC step,W (i+1)|W (i),Z(i) ∼ HMC( · |W (i);UZ(i) , L?, ?). (4.4.3)with parameters ? and L? tuned using AHMC, and a function eval-uation and gradient (assuming a normal prior on the weights) givenby:Uz(i) (w) =12κ‖w‖22 −∑x∈Xh(i)x q(nr)x,x (w) (4.4.4)−∑(x,x′)∈Xdistinctc(i)x,x′ log(q(nr)x,x′ (w))+ cnst∇wUz(i) (w) = κw +∑(x,x′)∈Xdistinctϕ(x, x′)(h(i)x q(nr)x,x′(w)− c(i)x,x′).(4.4.5)Crucially, note that the inner loop, namely computation of Equation (4.4.5),does not involve re-computation of the forward recursions and matrix ex-ponentials (see Section 4.4.3). This sampling method converges to the de-sired stationary distribution, namely the distribution with density fw|y(w|y)given in Equation (4.1.1). The proof, which is based on an auxiliary variableconstruction, can be found in Section HMC for GLM-CTMC models4.4.3 Running time for the gradient computation viaauxiliary variablesThe running time of the precomputation step is analyzed in the previoussection. In general, the dominant term will be |E|K|X |2, namely the costof a sum-product execution. Importantly, this factor is not multiplied bythe number of leapfrog steps, as we are not required to refresh the auxiliaryvariable at each leapfrog step.We now compute the running time of the operations that need to be com-puted at each leapfrog step j, which boils down to computing ∇Uz(i)(w(i,j)).Naively, this would take O(|X |2p). However, we can exploit the fact thatϕ(x, x′) is often sparse. If s :=∑(x,x′)∈Xdistinct∑pm=1 1[ϕm(x, x′) 6= 0] de-notes the total number of non-zero entries across all possible features, then∇Uz(i)(w(i,j)) can be computed in time O(s), therefore, computation of anHMC iteration has a running time of O(sL). In all examples presented inthe thesis, s ∈ O(|X |2).Note that this type of sparsity is distinct from the sparsity in the matrixQ exploited in previous work (Rao and Teh, 2013). In our examples, Q isnot sparse.4.4.4 HMC for GLM-CTMC reversible modelsWe follow essentially the same strategy described in Section 4.4.2 to approx-imate the posterior distribution in the reversible case, with the followingtwo modifications. First, we augment the sufficient statistic defined in Sec-tion 2.1.2 to include Z(rev) := (N ,C,H), where N := (N1, . . . , N |X |) repre-sents the number of paths initiated at state x, x ∈ X , H := (H1, . . . ,H|X |)represents the total time spent in each state and for all distinct (x, x′) ∈614.4. HMC for GLM-CTMC modelsX distinct := {(x, x′) ∈ X 2 : x 6= x′}, Cx,x′ ∈ {0, 1, 2, . . . } denotes the numberof times that a jump ends at state x′ right after jumping from x. Second,we need to modify the expression involved in the calculation of the gradient.We derive the gradient expression for the reversible, normalized model.The derivation is lengthy, but note that the auxiliary variable Z makes itpossible to generate gradient calculation code directly from the augmentedjoint distribution, for example using Stan. This is not practical withoutconditioning on the auxiliary variable Z. We show the expression of thegradient here for completeness. The partial derivatives have been verifiednumerically using numerical differentiation. The log-likelihood with the nor-malized rate matrix is:log f(norm)w|z,y (w|z,y) :=∑x∈Xnx log pix(w) +∑x∈X∑x′∈X :x 6=x′cx,x′ log q(norm)x,x′ (w)−∑x∈Xhx∑x′∈X :x6=x′q(norm)x,x′ (w) + constant. (4.4.6)Its associated energy is:Uz (w) =12κ‖w‖22 −∑x∈Xhxqx,x(w) (4.4.7)−∑(x,x′)∈Xdistinctcx,x′ log(qx,x′ (w))−∑x∈Xnx log(pix(w)),=12κ‖w‖22 +∑(x,x′)∈Xdistincthxqx,x′ (w) ,−∑(x,x′)∈Xdistinctcx,x′ log(qx,x′ (w))−∑x∈Xnx log(pix(w)).624.4. HMC for GLM-CTMC modelsComputing the gradient is now reduced to a routine derivation yielding:∇[log f(norm)w|z,y (w|z,y)]=∑x∈Xnx (ψ(x)−∇A(w)) (4.4.8)+∑x∈X∑x′∈X :x 6=x′(cx,x′ − hxq(norm)x,x′ (w))×(β−1∇β(w) + φ({x, x′}) +ψ(x′)−∇A(w)),where:∇A(w) =∑x∈Xψ(x)pix(w), (4.4.9)∇β(w) = β∑x∈Xpix(w)2ψ(x)− ( ∑x′∈X :x 6=x′(ψ(x) +ψ(x′) + φ({x, x′}))q(norm)x,x′ (w)) .4.4.5 Correctness of HMC with auxiliary variable samplingscheme for GLM-CTMC modelsWe show that the sampling algorithm described in Section 4.4.2 convergesto the desired stationary distribution, namely the distribution with densityfw|y(w|y) (Equation (4.1.1)). The key step consists in defining the appro-priate auxiliary variable Z. In general, auxiliary variables can be used inan MCMC framework in two ways: either by directly augmenting the statespace of the Markov chain, or as a transient variable used to define a kernel,but discarded right after its instantiation. As we show here, our situationfalls in the second case, a fact that becomes crucial in Section 4.5.66More precisely, if Z were part of the MCMC state space, we would have to developnew topology resampling operators that condition on both W and Z. This is not thecase here: we only need to condition on W , which allows us to reuse existing phylogeneticMCMC operators.634.4. HMC for GLM-CTMC modelsConsider the joint distributionfw,x,y,z(w,x,y, z) := fw(w)fx|Q(x|Q(w))fy|d(y|d(x))1[z = z(x)],fy|d(y|d) :=∏v∈Vfemi(yv|dv). (4.4.10)It is easy to check that we recover the target model of Section 3.1 aftermarginalization of x and z. The conditional densities fz|w,y and fw|z,y aredefined in the obvious way from the above joint density. Moreover, sincez is sufficient to compute fx|Q(x|Q(w)), then the following is well definedwhen z ∈ z(S(|T |)): fx|Q(z|Q(w)) := fx|Q(xz|Q(w)), where xz is such thatz(xz) = z.Next, we define a Markov chain W (1),W (2), . . . on the state space Rp.We construct a Markov transition kernel T using a sequence of sub-steps:first, augmentation of the state space to accommodate auxiliary variables,second, transition in the augmented space, and third, projection back tothe target space Rp. We show that each of these sub-steps keeps fw|y(w|y)invariant.We denote the sub-steps as follows: wT 17−→ (w, z) T 27−→ (w′, z) T 37−→ w′.Each kernel in this decomposition is defined as follows: T 1 samples theauxiliary variable Z given the data Y and current parameters W . It hasdensity T 1(w′, z′|w) := 1[w = w′]fz|w,y(z|w,y). Sampling from this densityis outlined in Section 4.4, with additional details provided in Appendix E.T 2 performs one (or several) Metropolis-within-Gibbs step(s) on w, keepingz fixed. More precisely, this Metropolis-within-Gibbs step uses the HMCalgorithm of Section 4.1. We show below that it is invariant with respectto the conditional distribution fw|z,y. T 3 deterministically projects the pairback to the original space, retaining only the weights w.644.4. HMC for GLM-CTMC modelsIt remains to show that the formulae for Uz(i) and ∇wUz(i) in Section 4.4correspond to the negative logarithm of the conditional density fw|z,y: 7log fw|z,y(w|z,y) = log∫x:z(x)=zfw,x,y,z(w,x,y, z(x)) dx+ cnst (4.4.11)= log∫x:z(x)=zfw(w)fx|Q(x|Q(w))fy|d(y|d(x)) dx+ cnst(4.4.12)= log fw(w)fx|Q(z|Q(w)) + log∫x:z(x)=zfy|d(y|d(x)) dx︸ ︷︷ ︸constant with respect to w+cnst.(4.4.13)Now, plugging in Equation (2.1.1) into the above expression, we obtainEquation (4.4.4). For the gradient:∇wUz (w) = ∇w[12κ‖w‖22 −∑x∈Xhxq(nr)x,x (w)−∑(x,x′)∈Xdistinctcx,x′ log(q(nr)x,x′ (w))](4.4.14)= κw −∑x∈Xhx∇w ∑x′∈X :x 6=x′q(nr)x,x′(w)− ∑(x,x′)∈Xdistinctcx,x′ϕ(x, x′)(4.4.15)= κw +∑(x,x′)∈Xdistinctϕ(x, x′)(hxq(nr)x,x′(w)− cx,x′).(4.4.16)7In Equation (4.4.11), the densities and integrals are defined with respect to a referencemeasure on CTMC sample paths containing finite numbers of jumps.654.5. Extension to phylogenetic treesThis coincides with Equation (4.4.5).4.5 Extension to phylogenetic treesIn this section, we extend our methodology to phylogenetic trees within ourGLM-CTMC framework and outline the posterior simulations for phyloge-netic models.We now discuss how to generalize the techniques of Section 3.2.2 to thephylogenetic setup. Since the tree τ is unknown, we need to augment ourMCMC sampler, so that a state in the MCMC chain now has the form(W (i), τ (i)).In the generalized sampler, where a state in the MCMC chain is in theform of (W (i), τ (i)), the MCMC moves alternate between (1), resamplingthe weights W while keeping the tree τ fixed; and (2), resampling the tree τwhile keeping the weights fixed. Weight resampling, Step (1), is performedusing the method described in Section 4.1, with two minor differences. First,the gradient of the normalized reversible rate matrix is modified as describedin Section 4.4.4. Second, the graphical model (V ,E) is now a tree insteadof a chain. The posterior inference methods for chains can be extended totrees in a straightforward fashion (see Felsenstein (1981) for a descriptiontailored to phylogenetic inference).Tree resampling, Step (2), can be performed by first formingQ = Q(norm)(w)from the weightsw we condition on, and then using standard tree resamplingmethods. These tree resampling methods are based on Metropolis-Hastingslocal operators on topology and branch lengths. See a comparison of thesemethods in Lakner et al. (2008). Our package implements stochastic neigh-bour interchange and multiplicative branch length perturbations. We use664.5. Extension to phylogenetic treesa spectral decomposition method to calculate the transition probabilitiesptrans.In this chapter, we have presented our novel sampling scheme via a com-bination of AHMC and an auxiliary variable technique called uniformizationfor the Bayesian GLM framework described in the previous chapter. AHMC(or HMC) uses discretization schemes to approximate the continuous timetrajectory of the Hamiltonian dynamics. In the next chapter, we propose anovel type of sampling algorithm constructed from Piecewise-deterministicMarkov Processes (PDMPs), where an exact simulation of the continuoustime trajectory is possible, usually through a thinning strategy (Bouchard-Coˆte´ et al., 2018). We shall see that our proposed sampling algorithm inthe next chapter can further exploit the sparsity structure in the parame-terization of the rate matrices.67Chapter 5Inference via LBPS-HMCIn this chapter, we have proposed a sampling scheme denoted as LBPS-HMC, which combines the Hamiltonian Monte Carlo (HMC) algorithm andthe Local Bouncy Particle Sampler (LBPS). In Chapter 4, we have intro-duced our inference method for Continuous Time Markov Chains (CTMCs)via Adaptive Hamiltonian Monte Carlo (AHMC). HMC uses the gradientof the log target density to efficiently sample from the posterior densityand we also demonstrate that it outperforms the classical Markov ChainMonte Carlo (MCMC) algorithm using both synthetic and real datasets.However, one limitation of HMC (or AHMC) is that it does not exploit thesparsity structure often found in the parameterization of high-dimensionalrate matrices. To further exploit sparsity, we make use of recently devel-oped Monte Carlo schemes based on non-reversible Piecewise-deterministicMarkov Processes (PDMPs). In particular, we build our samplers based onthe non-reversible rejection-free dynamics developed for statistical applica-tions in Bouchard-Coˆte´ et al. (2018). It has been shown by Neal (2004); Sunet al. (2010); Chen and Hwang (2013); Bierkens (2016) that non-reversibleMCMC algorithm can outperform reversible MCMC in terms of mixing rateand asymptotic performance. Moreover, the Bouncy Particle Sampler (BPS)can be further sped up using a local, sparse version of the algorithm referredto as LBPS.68Chapter 5. Inference via LBPS-HMCAs shown in Section 5.1.1 and 5.1.2, we review the basics of BPS andLBPS based on the work of Bouchard-Coˆte´ et al. (2018). We make ourcontributions in Section 5.2-5.5. To summarize, we make the following con-tributions:• A characterization of a class of sparse factor graphs when LBPS isefficient, which is discussed in Section 5.2.• A novel sampler combining HMC and LBPS to efficiently sample fromCTMCs while maintaining the techniques of auxiliary variables usedin Zhao et al. (2016) under the Bayesian GLM parameterization ofrate matrices for CTMCs, which is shown in Section 5.3.1.• An explanation of our motivation for proposing a combined samplingscheme via HMC and LBPS instead of using only LBPS. The rea-son is that the factor graph for the target posterior density undera Bayesian GLM chain GTR model with auxiliary variables achievesstrong sparsity (described in definition 5.2.2) using LBPS-HMC, whichis discussed in Section 5.3.3. The strong sparsity will not be achievedif only LBPS is used, details are provided also in Section 5.3.3.• Analytical solutions to the bouncing time (defined in Section 5.1.1) foreach factor of the factorized posterior density of CTMCs to boost thecomputational efficiency described in Section 5.4. In comparison, pre-vious work use thinning algorithms to either obtain analytical conser-vative bounds (Bouchard-Coˆte´ et al., 2018) or construct approximatebounds via predictive models using local regression (Pakman et al.,2017).• A running time analysis for one iteration of LBPS for both general695.1. Background on BPS and LBPSfactor graphs and factor graphs with sparse structure in Section 5.2.2,and a running time analysis for the special case of our proposed algo-rithm LBPS-HMC in one iteration under a Bayesian GLM model withsparsity structure in the factor graph discussed in Section Background on BPS and LBPSBefore introducing our proposed sampling scheme we first introduce thebasic notations and background on BPS and LBPS.5.1.1 Basics of BPSThe BPS belongs to the type of emerging continuous-time non-reversibleMCMC sampling algorithms constructed from PDMPs (Davis, 1984). It wasfirst proposed by Peters et al. (2012), formalized, developed, and generalizedby Bouchard-Coˆte´ et al. (2018); Vanetti et al. (2017). We use three keycomponents to describe PDMPs:The deterministic dynamics: a system of differential equations that char-acterize the process’ deterministic behaviour between jumps.The event rate: a function that determines the intensity of jumps at eachstate.The transition distribution: a probability measure that determines thenext state of the process according to a certain transition kernel.The BPS is a special case of PDMPs with certain choices of the three afore-mentioned components. We denote the state of PDMPs by Y = (W ,V ),which encodes the position and velocity of the particle. Let ζ(w,v) =705.1. Background on BPS and LBPSζ(v)ζ(w), where ζ(v) represents the standard multivariate Gaussian distri-bution and ζ(w) represents the target posterior distribution of interest. TheBPS keeps ζ(w,v) invariant. If it is of interest to use a PDMP to samplefrom the target distribution, this PDMP at least needs to admit the targetdistribution as the invariant distribution. Vanetti et al. (2017) have provedthat the BPS keeps ζ(w,v) invariant. They have provided sufficient con-ditions to ensure this requirement is satisfied and they have verified thatthe sufficient conditions are satisfied for the BPS. Moreover, if the PDMPis ergodic, it allows us to estimate consistently expectations with respect tothe invariant distribution (Vanetti et al., 2017). Bouchard-Coˆte´ et al. (2018)have provided a proof of the ergodicity of the BPS.We denote U(w) as the associated energy function, which is the nega-tive log of the unnormalized target posterior density function and is assumedcontinuously differentiable. We have discussed the relationship between thetarget probability density function and the canonical distribution in Sec-tion 4.2.1. In the BPS, the specific choices for the three components are:The deterministic dynamics:dwdt= v,dvdt= 0.The event rate: the intensity λ(w,v) = max{0, 〈∇U(w),v〉} of an inho-mogeneous Poisson Process (PP) according to which the jumps takeplace.The transition distribution: a Dirac delta function centered at (w, Tw(v)),where715.1. Background on BPS and LBPSTw(v) = v − 2〈∇U(w),v〉||∇U(w)||2 ∇U(w).To ensure the ergodicity of the Markov chains, refreshment events alsohappen at random times while the velocity is refreshed according to certaindistributions such as a standard normal distribution.Throughout this chapter, we use the following terminologies to describethe algorithm or analyze the algorithm. We summarize our definition of theterminologies to make it easier for the readers.• A “bounce”, a “bouncing event”, a “collision”, and a “collision event”have the same meaning in this chapter. They all refer to that thevelocity of the particle is updated in the same way as a Newtonianelastic collision on the hyperplane tangential to the gradient of theenergy.• A “bouncing time” or a “collision time” refers to the time at which abounce (collision) occurs.• A “refreshment” refers to that the velocity of the particle is resampledfrom a certain distribution. In my thesis, a standard normal distribu-tion is used, which is also the one often used in the literature.• A “refreshment time” refers to the time at which a refreshment hap-pens.• An “event” refers to a time either a bounce (collision) or a refreshmentoccurs.With these terminologies, we present the BPS algorithm in Algorithm 4.725.1. Background on BPS and LBPSAlgorithm 4 BPS algorithm1: Initialization:Initialize the particle position and velocity(w(0),v(0))arbitrarily on Rd ×Rd.Set T to certain fixed trajectory length.2: for i = 1, 2, . . ., do3: Sample the first arrival times τref and τbounce of PP withintensity λref and λbounce respectively, where λbounce =max(〈v(i−1),∇U(w(i−1) + v(i−1)t)〉, 0) and the value of λref is pre-fixed.4: Set τi ← min(τbounce, τref).5: Update the position of the particle via w(i) ← w(i−1) + v(i−1)τi.6: If τi = τref, sample the next velocity v(i) ∼ N (0d, Id).7: If τi = τbounce, obtain the next velocity v(i) by applying the transition func-tion Tw(i)(v(i−1)).8: If ti =∑ij=1 τj > T , exit For Loop (line 2).9: end forIn order to help understand how the BPS works, we borrow the figurefrom Bouchard-Coˆte´ et al. (2018) in Figure 5.1, which shows a trajectorygenerated by BPS on a standard bivariate Gaussian target distribution,where U(x) = − log γ(x) represents the energy (negative log of the targetdensity) when the parameters take value x.Bouchard-Coˆte´ et al. (2018) have introduced several strategies for simu-lating the collision time including time-scale transformation, adaptive thin-ning, and a combination of superposition and thinning. In our problem, weare able to derive the analytical expression of the collision time. Thus, weleave out the details of those strategies for simulating the collision time.5.1.2 Basics of LBPSIf the target posterior density ζ(w) can be represented as a product ofpositive factors in Equation 5.1.1,ζ(w) ∝∏f∈Fγf (Nf ), (5.1.1)735.1. Background on BPS and LBPSFigure 5.1: Plot of contours of the bivariate standard Gaussian target distributionfor which we would like to sample from. This figure comes from Bouchard-Coˆte´et al. (2018).where F is the set of index for all factors in the target density and Nf rep-resents the subset of variables connected to factor f , then a “local” versionof BPS referred to as LBPS can be computationally efficient by taking ad-vantage of the structural properties of the target density, especially if thetarget density has strong sparsity property described in definition 5.2.2.Now we provide a brief description of LBPS algorithm. If the targetunnormalized posterior density can be factored according to Equation 5.1.1,its associated energy function isU(w) =∑f∈FUf (Nf ), (5.1.2)where Uf (Nf ) = − log(γf (Nf )). We define the local intensity λf (w,v) andlocal transition function T fw(v) for factor γf as:745.1. Background on BPS and LBPSλf (w,v) = max{0, 〈∇Uf (w),v〉} (5.1.3)T fw(v) = v − 2〈∇Uf (w),v〉||∇Uf (w)||2 ∇Uf (w). (5.1.4)It is worth noting that for variables that are not the neighbour variablesfor factor γf , Tfw(v)k = vk. In the LBPS, the next collision time is thefirst arrival time of a PP with intensity λ(w,v) =∑f∈F λf (w,v). Wecan sample the first arrival time via the superposition algorithm for a PP.We sample τf for each factor γf from a PP with intensity λf (w,v) =max{0, 〈∇Uf (w),v〉}. The first arrival time for PP with intensity λ(w,v) isτ = minf∈Fτf . Once a bounce event takes place, LBPS only needs to manipu-late a subset of the variables and factors. To be specific, if we denote f∗ asthe factor index such that τf∗ = minf∈Fτf , we use γf∗ to denote the collisionfactor, which refers to the factor that obtains the first arrival time for PPwith intensity λf (w,v). We use Figure 5.2 to illustrate the process.Assume the target densityζ(w) ∝ γ1(w1)γ2(w1, w2)γ3(w2)γ4(w2, w3)γ5(w3, w4).We assume that γ2 is the collision factor. Following the priority queueimplementation proposed by Bouchard-Coˆte´ et al. (2018), LBPS will updatethe position of neighbour variables for the collision factor γf∗ , which in thiscase the neighbour variables include w1 and w2. Then LBPS will applythe corresponding local transition function T f∗w (v) to update the velocityv. Next, the algorithm will sample the candidate bounce time of the nextevent for all the extended neighbour factors γf ′ for factor γf∗ such that755.1. Background on BPS and LBPSFigure 5.2: For factor γ2, its neighbour variables N2 = {w1, w2}. The extendedneighbour variables for factor γ2 are N2 = {w1, w2, w3}. The extended neighbourfactors for factor γ2 are S2 = {γ1, γ2, γ3, γ4}.Nf ′∩Nf∗ 6= ∅. In this example, the extended neighbour factors of γ2 includesγ1, γ2, γ3 and γ4 since they are connected to either w1 or w2, which areneighbour variables of γ2. If a refreshment event happens, computationallycheaper refreshment scheme such as “local refreshment” (Bouchard-Coˆte´et al., 2018) is often used in LBPS by exploiting the structure for the factorgraph. In the local refreshment scheme, one factor γf is chosen uniformlyat random first and only the components of v with indices corresponding tovariables Nf are resampled from a standard normal distribution.Here, we only provide the high level idea of why LBPS can be efficientwhen strong sparsity is satisfied for the target factor graph. In Section 5.2.1,we define strong sparsity rigorously and then we provide the running timeanalysis for LBPS for arbitrary factor graphs and factor graphs with strong765.1. Background on BPS and LBPSsparsity in Section 5.2.2. If strong sparsity is satisfied for a family of factorgraphs, it indicates that the number of neighbour variables for the collisionfactor grows much slower than the dimension of the parameter space. Thenumber of neighbour variables and the number of operations needed to up-date the position of the neighbour variables are both negligible comparedto the dimension of the parameter space. When computing the candidatecollision time for the extended neighbour factors, the number of factors in-volved is negligible compared to the total number of factors in the factorgraph, which is desirable in LBPS. We need to point out that although LBPSmanipulates a subset of variables and factors, each local bounce will leadto changes in all variables not just the neighbour variables of the currentcollision factor.In order to help understand the priority queue implementation of LBPSBouchard-Coˆte´ et al. (2018), we first introduce some notations and algo-rithms. Priority queue is an abstract data type in computer science. It issimilar to a regular queue or stack data structure, but each element is as-sociated with a “priority”. An element with higher priority is served beforean element with lower priority. A priority queue implementation of LBPSis efficient since we can take advantage of the computational gains in termsof finding the minimum or inserting/updating an element of the queue viapriority queue implementation.A list of triplets composed of the position, velocity and the collision orrefreshment event time for each variable is used to store the trajectory infor-mation. To ensure the efficient implementation of LBPS, for each variablewe only record the trajectory information when either a bounce or a refresh-ment event takes place. For each variable k ∈ {1, 2, . . . , p}, where p is thedimension of the parameter space, we only record a list of triplets Lk, where775.1. Background on BPS and LBPSLk =(w(i)k , v(i)k , t(i)k), t(i)k represents the ith event (bounce or refreshment)time for variable k, w(i)k represents the position of the variable at time t(i)kand v(i)k represents the velocity of variable k after the ith event happens.We have t(i)k = 0 at the initial position w(0)k with initial velocity denotedas v(0)k . The advantage of encoding the trajectory information using Lk foreach variable is that we can compute the position of a variable at any timet > 0 using Algorithm 5 (Bouchard-Coˆte´ et al., 2018).Algorithm 5 Compute wk(t) given Lk for ∀t > 0, where k ∈ {1, 2, . . . , p}.1: Find the index j = i(t, k) in Lk corresponding to the largest time t(j)ksuch that t(j)k 6 t.2: Update wk(t)← wjk +(t− tjk)vjk.We also need to introduce a new queue(w,v, T ) operation, which isneeded in the LBPS when we update the priority queue O at initializationor after each refreshment event. The details are summarized in Algorithm 6.We present the priority queue implementation of LBPS (Bouchard-Coˆte´et al., 2018) in Algorithm 7.Algorithm 6 New Queue (w,v, T )1: for f ∈ F , do2: (vf )k ← v(|Lk|−1)k for ∀wk ∈ Nf , where Nf is the set of variables forγf .3: Compute wf ← wf (T ) using Algorithm 5.4: Simulate the first arrival time τf of a PP with intensity λf (wf +tvf ,vf ) on [0,∞).5: Set in the priority queue O the time associated to f to T + τf .6: end for7: Return O.785.2. Characterization of factor graphs where LBPS can be efficient5.2 Characterization of factor graphs whereLBPS can be efficientIn this section, we introduce the characterization of factor graphs whereLBPS can be efficient. We propose the notion of strong sparsity for a familyof factor graphs. We also analyze the running time for LBPS for botharbitrary factor graphs and factor graphs with strong sparsity.5.2.1 Strong sparsity of factor graphsA factor graph is a bipartite graph used to represent the factorization of afunction. We propose our notion of sparsity for factor graphs. The sparsitydefinition is useful in analyzing the computational performance of LBPS.The standard notion of sparsity for a complete graph G = (V,E) refers to|E| = o (|V |2), where |E| represents the number of edges and |V | representsthe number of vertices. However, the definition of the sparsity used inour thesis is different and more stringent than the conventional definitionof sparsity. The motivation for proposing the sparsity definition for factorgraphs is we expect LBPS to be efficient when strong sparsity (shown indefinition 5.2.2) is satisfied.795.2. Characterization of factor graphs where LBPS can be efficientAlgorithm 7 LBPS algorithm via priority queue implementation1: Initialize(w(0),v(0))arbitrarily on Rd × Rd.2: Initialize the global clock T ← 0.3: for k ∈ {1, 2, . . . , p}, do4: Initialize the list Lk ←(w(0)k, v(0)k, T).5: end for6: Set O ← new queue (w(0), v(0), T ), where new queue operation is outlined in Algorithm 6.7: Sample tref ∼ Exp(λref).8: while more events (bounce or refreshment happens) i = 1, 2, . . . do9: (t, γf )← smallest candidate bounce time and associated factor in the priority queue O.10: Remove (t, γf ) from O.11: Update the global clock, T ← t.12: if T < tref, then13: (vf )k ← v(|Lk|−1)kfor ∀k ∈ Nf , where Nf is the set of variables connected to factor γf .14: Compute wf ← wf (T ) using Algorithm 5.15: for wk ∈ Nf do16: w(|Lk|)k← w(|Lk|−1)k+(T − t(|Lk|−1)k)v(|Lk|−1)k, where t(|Lk|−1)kand v(|Lk|−1)kare retrievedfrom Lk.17: v(|Lk|)k← {Tfw(v)}k.18: Lk ←{Lk,(w(|Lk|)k, v(|Lk|)k, T)}(add the new sample to the list).19: end for20: for f ′ ∈ F : Nf′ ∩Nf 6= ∅ (note, this includes the update of γf ) do21: for all wk ∈ Nf′22: Compute wf′ ← wf′ (T ) using Algorithm 5.23: Simulate the first arrival time τf′ of a PP with intensity λf′ (wf′ + tvf′ ,vf′ ) on [0,∞).24: Set in the priority queue O the candidate bounce time associated to f ′ to T + τf′ .25: end for26: else27: Sample v′ ∼ N (0d, Id).28: O ← new queue (w(tref),v′, tref), where w(tref) is computed using Algorithm 5.29: Set tref ← τref ∼ Exp(λref)30: end if31: end while32: Return the samples encoded as the lists Lk, where k ∈ {1, 2, . . . , p}.Definition 5.2.1. Given a factorization of a functionf(w) =m∏f=1γf (Nf ),the factor graph G = (w,Γ, E) consist of variables w = {w1, w2, . . . , wp},set of factors Γ = {γ1, γ2, . . . , γm} and edges E. An undirected edge exists be-tween factor γf and variable wk iff wk ∈ Nf . The neighbour variables for805.2. Characterization of factor graphs where LBPS can be efficientfactor γf are denoted as Nf , Nf ⊂ {w1, w2, . . . , wp}. The neighbour fac-tors for variable wk are Sk := {γf : wk ∈ Nf}. The extended neighbourvariables for factor γf are Nf := {wk : wk ∈ Nf ′ such that Nf ′ ∩Nf 6= ∅}.The extended neighbour factors for factor γf are Sf := {γf ′ : Nf ′∩Nf 6=∅}.In Figure 5.3, we use the following factor graph to illustrate defini-tion 5.2.1:f(w) = γ1(w1)γ2(w1, w2)γ3(w2)γ4(w2, w3)γ5(w3, w4).Given definition 5.2.1, we define the strong sparsity for factor graphsbelow. We need a family of factor graphs, where the number of factors andthe dimension of the variables can go to infinity. For simplicity, we omitfrom notation N(m)f and S(m)f but use Nf and Sf instead.Definition 5.2.2. A factor graph G = (w,Γ, E) consists of w = {w1, w2, . . . , wp},Γ = {γ1, γ2, . . . , γm} and edges E, the strong sparsity requires that maxf|Sf | =o(m), for 1 6 k 6 p, and maxf|Nf | = o(p), for 1 6 f 6 m.The strong sparsity reflects that for any factor in the factor graph, thenumber of its extended neighbour factors and extended neighbour variablesgrows much slower than the dimension of the parameter space and is ulti-mately negligible compared with the dimension of the parameter space andthe total number of factors in the factor graph. We will show later thatthe factor graph under our proposed sampling scheme satisfies the strongsparsity property.815.2. Characterization of factor graphs where LBPS can be efficientFigure 5.3: For factor γ2, its neighbour variables N2 = {w1, w2}. The extendedneighbour variables for factor γ2 are N2 = {w1, w2, w3}. The extended neighbourfactors for factor γ2 are S2 = {γ1, γ2, γ3, γ4}.5.2.2 Running time analysis of LBPS for factor graphsRecently, researchers (Deligiannidis et al., 2018; Andrieu et al., 2018; Bierkenset al., 2018) have investigated the scaling limits of samplers constructed fromPDMPs including BPS and Zig-zag processes. The Zig-zag process is a con-tinuous time non-reversible stochastic process with continuous, piecewise lin-ear trajectories on Rp. It moves with constant velocity, where v ∈ {1,−1}p,until one component of v switches sign. The time when one component ofv switches sign and the choice of which component to reverse direction is825.2. Characterization of factor graphs where LBPS can be efficientdetermined by a collection of state-dependent switching rates (λi)pi=1, whichdepend on gradient of the unnormalized negative posterior density and thecurrent velocity.Currently, there is no running time analysis for LBPS. Our running timeanalysis provides the computational cost for one iteration of LBPS from theperspective of both a general factor graph G = (w,Γ, E) and also factorgraphs satisfying the strong sparsity property in definition 5.2.2.In G = (w,Γ, E), we have variables w = {w1, w2, . . . , wp} and factorsΓ = {γ1, γ2, . . . , γm}. We introduce some notations which are needed in theanalysis. Denote∣∣N∗∣∣ = maxf∣∣Nf ∣∣ , |N∗| = maxf|Nf | and∣∣S∗∣∣ = maxf∣∣Sf ∣∣.Since Nf ⊂ Nf , we have |Nf | 6∣∣Nf ∣∣ and |N∗| 6 ∣∣N∗∣∣. Denote the cost forcomputing the collision time for a factor γf as cf and c∗ = maxfcf . Let cUfdenote the running time for computing ∇Uf (w) and cU∗ = maxfcUf . Thecomputation time for c∗ and cU∗ is case specific. For example, in Section 5.5,we show that c∗ 6 O(1) and cU∗ 6 O(1) in our problem context. Now,we analyze the running time of LBPS via a priority queue implementation(Bouchard-Coˆte´ et al., 2018) for a factor graph G = (w,Γ, E).1. Compute the collision time for all factors and build a priority queue:O(mc∗ +m logm).2. (a) If a collision takes place,i. Update the position of extended neighbour variables: O(∣∣N∗∣∣).ii. Update the velocity of neighbour variables according to Equa-tion 5.1.4: O(|N∗|+ cU∗) 6 O(|N∗|+ cU∗).iii. Add new samples to the trajectory list Lk: O(|N∗|). Weuse Lk to denote a list of triplets(w(i)k , v(i)k , t(i)k), with w(0)k835.2. Characterization of factor graphs where LBPS can be efficientand v(0)k representing the initial position, and velocity andt(0)k = 0. For i > 0, v(i)k represents the velocity after the ithcollision or refreshment event and t(i)k represents the time forthe ith event.iv. Compute the collision time for extended neighbour factors:O (∣∣S∗∣∣ c∗).v. Insert the extended neighbour factors and the correspond-ing collision time into the priority queue: O(logm). Theelements of the priority queue are stored with an increasingorder of the collision time.(b) If a refreshment event takes place and a local refreshment schemeis used:i. Pick a factor uniformly at random and only the componentsof v with indices corresponding to the neighbour variablesNf of the selected factor are refreshed by resampling from astandard normal distribution: O(|N∗|).ii. Update the position of the neighbour variables for the se-lected factor, compute the collision time for the extendedneighbour factors and update the corresponding collision timein the priority queue: O(|N∗|+∣∣S∗∣∣ c∗+∣∣S∗∣∣ logm) 6 O(|N∗|+∣∣S∗∣∣ c∗ + ∣∣S∗∣∣ logm).iii. Sample the next refreshment time O(1).Thus, assuming there are L1 collision events and L2 refreshment events whenthe particle travels along a fixed trajectory length, the total running time845.3. Achieving strong sparsity through LBPS-HMC for GLM-CTMC modelfor LBPS is:O(mc∗+L1(∣∣N∗∣∣+ cU∗ + ∣∣S∗∣∣ c∗ + logm)+L2(|N∗|+∣∣S∗∣∣ c∗+∣∣S∗∣∣ logm)).For a factor graph with no assumptions on sparsity, the total running timefor LBPS isO(L1 (p+ cU∗ +mc∗ + logm) + L2(p+mc∗ +m logm)).For a factor graph with strong sparsity, the total running time for LBPS isO(mc∗ + L1 (pα1 + cU∗ +mα2c∗ + logm) + L2(pα1 +mα2c∗ +mα2 logm))(5.2.1)where 0 6 α1 < 1, 0 6 α2 < 1.However, we need to mention that our running time analysis does notinclude the computational cost to obtain approximately one independentsample. We will leave how L1 and L2 are scaled with respect to the dimen-sion of the parameter space to obtain a constant number of effective samplesto future work.5.3 Achieving strong sparsity throughLBPS-HMC for GLM-CTMC modelIn this section, we first provide an overview of our proposed sampling schemeLBPS-HMC summarized in Algorithm 8 in Section 5.3.1. Our samplingscheme combines both HMC and LBPS denoted as LBPS-HMC insteadof using only LBPS. As seen in Algorithm 8, we use HMC to sample theunivariate weights and LBPS to sample the bivariate weights respectively. In855.3. Achieving strong sparsity through LBPS-HMC for GLM-CTMC modelSection 5.3.2, we provide the function and gradient information with respectto the univariate weights since HMC requires the function and gradientevaluation information.The motivation for using LBPS-HMC is that the strong sparsity is de-sirable for the target factor graph, where LBPS can be efficient as shownin Section 5.2. As we will show in Section 5.3.3, the strong sparsity indefinition 5.2.2 is satisfied for the target factor graph representation of thepotential energy function augmented with the sufficient statistics z(x) :=(n(x),h(x), c(x)) under the LBPS-HMC sampling scheme. Within thesame section, we show that the strong sparsity of the corresponding fac-tor graph representation is not satisfied if we only use LBPS to sample boththe univariate and bivariate weights. We will see that in Section 5.3.3, due tothe parameterization of the stationary distribution of wu in Equation 3.3.2,for any factor in the factor graph, its extended neighbour variables and ex-tended neighbour factors include all the variables and factors in the factorgraph. Thus, LBPS will not be efficient. The results of this section do notapply to the model with a normalized rate matrix. In this case, we do nothave sparsity and there are no analytical solutions to the collision times ofLBPS.5.3.1 LBPS-HMC sampling scheme for GLM-CTMC modelIn this section, we provide the combined sampling scheme in Algorithm 8below.865.3. Achieving strong sparsity through LBPS-HMC for GLM-CTMC modelAlgorithm 8 Proposed sampling scheme LBPS-HMC for CTMCs1: Initialization:Initialize weightsw0 = (wu,wb) from N(0, 1), wherewu represents the weightscorresponding to univariate features to obtain the stationary distribution andwb represents the weights associated with bivariate features used to computethe exchangeable parameters.2: for i = 1, 2, . . . , N , do3: Compute the rate matrix Q given w(i−1) under Bayesian GLM rate matrices.4: Use an end-point sampler (Tataru and Hobolth, 2011; Rao and Teh, 2013)to simulate a path given time interval ∆e with two consecutive observa-tions observed at time points e = (t, t + 1) of the time series according toEquation 2.1.1 using the cached uniformization technique.5: Compute the aggregate sum of the sufficient statistics of all time series ob-tained in Step 4.6: Update univariate weights wu for the stationary distribution via HMC:(wu)(i+1) | (wu,wb)(i) , Z(i) ∼ HMC(· | (wu,wb)(i) , Z(i), L, ),where L and  are tuning parameters representing the number of leapfrogjumps and step size in HMC. Function evaluation and gradient calculationrequired by HMC are described in Equation 5.3.2 and Equation 5.3.3 inSection Update bivariate weights wb used to compute the exchangeable parameters:(wb)(i+1) | (wu,wb)(i) , Z(i) ∼ LBPS(· | (wu,wb)(i) , Z(i), T ),where T is the tuning parameter in LBPS representing the fixed length ofthe trajectory.8: end for5.3.2 Gradient computation for univariate weights in theHMC step under the combined sampling schemeIn our proposed sampling scheme combining HMC and LBPS described inAlgorithm 8, we use HMC to update only the univariate weights wu in step6 of Algorithm 8 and LBPS to update the bivariate weights wb in step 7 ofAlgorithm 8, respectively. Because HMC requires the gradient informationfor wu, we provide the gradient computation for wu in the HMC step of875.3. Achieving strong sparsity through LBPS-HMC for GLM-CTMC modelLBPS-HMC under a reversible, unnormalized rate matrix.We review that the augmented joint density given the sufficient statisticsz(y) := (n(y),h(y), c(y)) for a sample CTMC path y under a reversible,unnormalized rate matrix is:log fw|z,y(w|z,y) := −12κ‖w‖22 −∑x∈Xhx∑x′∈X :x 6=x′qx,x′ (w) ,+∑x∈X∑x′∈X :x6=x′cx,x′ log(qx,x′ (w))+∑x∈Xnx log(pix(w)) (5.3.1)Under the combined sampling scheme, wu and wb are updated respectively.In the ith iteration of the combined sampling scheme LBPS-HMC, whileusing HMC kernel, only wu is the parameter we need to update. The ex-changeable parameters are fixed at the values obtained in the (i − 1)thiteration denoted as θ{x,x′}((wb)i−1). To simplify the notation, we denoteθ{x,x′}((wb)i−1)as θ{x,x′}. Thus, Equation 5.3.1 can be rewritten as:log fw|z,y(w|z,y) = −12κ‖wu‖22 −∑x∈Xhx∑x′∈X :x6=x′pix′(wu)θ{x,x′} (5.3.2)+∑x∈X∑x′∈X :x 6=x′cx,x′ log(pix′(wu)θ{x,x′})+∑x∈Xnx log(pix(wu))885.3. Achieving strong sparsity through LBPS-HMC for GLM-CTMC modelThus, the gradient of the augmented joint density with respect to wu is:∇ log fw|z,y(w|z,y) = −κwu −∑x∈X∑x′∈X :x 6=x′hxqx,x′ (wu)(ψ(x′)−∑x∈Xpix(wu)ψ(x))+∑x∈X∑x′∈X :x 6=x′cx,x′(ψ(x′)−∑x∈Xpix(wu)ψ(x))+∑x∈Xnx(ψ(x)−∑x∈Xpix(wu)ψ(x)), (5.3.3)where:∇A(w) =∑x∈Xψ(x)pix(w).5.3.3 Factor graph representation under LBPS-HMCIn this section, we show the motivation for proposing a combined samplingscheme using HMC to update the univariate weights wu and using LBPSto update the bivariate weights wb instead of using only LBPS to updatew =(wu,wb). We refer to the second sampling scheme using pure LBPSas a naive sampling scheme. We illustrate this through the factor graphrepresentation for the potential energy function with augmented sufficientstatistics defined in Section 2.1.2 under the combined sampling scheme andnaive sampling scheme separately. Here, without loss of generality, we omitthe Normal factor coming from the prior distribution of w.For simplicity, under both sampling schemes, we illustrate the problemusing DNA evolution since there are only four states in the state space, where|X | = {A,C,G, T}. Under the naive sampling scheme, we are consideringa Bayesian GLM rate matrix with GTR features instead of the chain GTRfeatures since the parameterization for bivariate weights wb is simpler. Each895.3. Achieving strong sparsity through LBPS-HMC for GLM-CTMC modelexchangeable parameter θx,x′ = exp(wx,x′) only depends on one element ofthe bivariate weights. While in the chain GTR model, each exchangeable pa-rameter depends on two bivariate weight elements except θ{x,x′} depends onone bivariate weight wb1, where η({x, x′}) = 1 according to the definition inEquation 3.3.6. The root cause that makes the naive sampling scheme unde-sirable is that the parameterization of the stationary distribution makes wucommon neighbour variables for all factors. This is true for either modelswith GTR features or chain GTR features. Thus, under the naive sam-pling scheme, we present the factor graph under a Bayesian GLM rate ma-trix with GTR features for DNA evolution in Figure 5.4. In this figure,  Figure 5.4: Factor graph under Bayesian GLM representation of GTR modelfor DNA evolution when using only LBPS to update w, where Hx,x′ and Cx,x′represent the sojourn time factor and the transition count factor for states (x, x′),where x 6= x′ ∈ X = {A,C,G, T}. In the graph, we leave out the other half of thetransition count factors and sojourn time factors for the symmetric pair of states(x′, x) such as HCA, CCA, HGA, CGA, . . . for representation simplicity.wu = (wA, wC , wG, wT ) is used to obtain the stationary distribution. Forexample, piA(w) = exp(wA)/ (exp(wA) + exp(wC) + exp(wG) + exp(wT )).905.3. Achieving strong sparsity through LBPS-HMC for GLM-CTMC modelThe bivariate weights wb = (wAC , wAG, wAT , wCG, wCT , wGT ) are used toobtain the exchangeable parameters, for example, θAC = exp(wAC). Asshown in Equation 4.4.7, all factors such as Cx,x′ , Hx,x′ depend on rate ma-trix element qx,x′(w) = θx,x′(wb)pix′(wu), thus all factors have neighbourvariables wu. This indicates that for any factor, its extended neighbourfactors includes all other factors since they all share common neighbourvariables wu. In both models, the number of factors is O(|X |2). Once acollision takes place, the candidate collision time for an order of O(|X |2)extended neighbour factors need to be computed under a naive samplingscheme. This can be computationally expensive under a large state space Xand is not desirable.To resolve the computational issue, we propose a sampling scheme usingHMC to sample wu and LBPS to sample wb respectively. Under our pro-posed sampling scheme and a Bayesian GLM chain GTR model, the factorgraph is shown in Figure 5.5. In this situation, we are providing the factorgraph for a model with chain GTR features instead of GTR features sincelater in the real data analysis of the protein evolution, we are considering amodel with chain GTR features. We illustrate here why a model with chainGTR features has only a constant number of extended neighbour factorsfor any bivariate factor in DNA evolution with |X | = 4. This conclusiongeneralizes naturally to protein evolution with |X | = 20. Thus, the strongsparsity of the factor graph is satisfied under Bayesian GLM chain GTRmodel using LBPS-HMC.Regardless of the size of the state space X , we prove that for any{x, x′} ∈ X unordered,dist., any sojourn time factor Hx,x′ or transition countfactor Cx,x′ has at most twelve extended neighbour factors and three ex-tended neighbour variables under Bayesian GLM chain GTR model us-915.4. Analytical bounce time solution to related factorsing the proposed sampling scheme. For the pairs of states {x, x′}, whereη({x∗, x∗∗}) 6= |X |(|X | − 1)/2, the twelve extended neighbour factors of fac-tor Hx,x′ or Cx,x′ include the factors only connected with wbi and wbi+1,where i = η({x, x′}) − 2, η({x, x′}) − 1, and η({x, x′}). Similarly, for thepair of states {x∗, x∗∗} such that η({x∗, x∗∗}) = |X |(|X | − 1)/2, Hx∗,x∗∗ orCx∗,x∗∗ has eight extended neighbour factors, where each of the eight factorsis only connected with wbi+1, where i = η({x, x′})− 2 and η({x, x′})− 1. Allfactors have three extended neighbour variables wbη({x,x′})−1,wbη({x,x′}) andwbη({x,x′})+1, except that factor Hx′′,x′′′ or Cx′′,x′′′ has two extended neigh-bour variables wb1 and wb2, when η({x′′, x′′′}) = 1.Thus, when updating the position of extended neighbour variables orcomputing the candidate bounce time for extended neighbour factors of thecollision factor, there is only a small constant number of extended neighbourvariables and extended neighbour factors involved. We can conclude thatthe factor graph in Figure 5.5 satisfies the desirable strong sparsity in defi-nition 5.2.2. In the next section, we will show that we can derive analyticalsolution to the candidate bounce time for each category of factors.5.4 Analytical bounce time solution to relatedfactorsIn this section, after taking the log, we decompose the potential energy U(w)in Equation 4.4.7 with augmented sufficient statistics into a sum of factorsand organize them into the following categories:Normal factor: 12κ‖w‖22,Sojourn time factor: Hx,x′ := hxqx,x′ (w) , for all (x, x′) ∈ X distinct,925.4. Analytical bounce time solution to related factors  Figure 5.5: Factor graph when using Bayesian GLM representation of our pro-posed chain GTR model for DNA evolution when using HMC to update univariateweights wu and LBPS to update bivariate weights wb, where w = (wu,wb).Transition count factor: Cx,x′ := −c(i)x,x′ log(qx,x′ (w)), for all (x, x′) ∈X distinct,Initial count factor: pix := −nx log(pix(w)), for all x ∈ X , with pix rep-resenting the initial count factor but pix(w) denotes the stationarydistribution for state x given w.We derive the analytical solution to the collision time for sojourn timefactors and transition count factors respectively. The analytical collisiontime solution for a Normal distribution can be found in Bouchard-Coˆte´ et al.(2018), so we focus on the other factors.935.4. Analytical bounce time solution to related factors5.4.1 Local sojourn time factorsAs discussed in Bouchard-Coˆte´ et al. (2018), if the posterior density is log-concave, the BPS can be viewed as “swapping” the order of the steps takenby the Metropolis-Hastings (MH) algorithm. In the MH algorithm, we firstsample a proposal, then sample E from a uniform distribution Unif(0, 1),and finally perform an accept or reject decision. In the BPS, E is firstsampled and then the maximum trajectory length ∆ allowed by the sameMH ratio is travelled such thatζ(w)ζ(w + v∆)= E. (5.4.1)In this section, we show that all factors in our model satisfy the convexityrequirements and hence the collision time can be efficiently computed usingEquation 5.4.1 above. We also provide analytic formulae for Equation 5.4.1for each type of local factors occurring in our model.Under a reversible model, the corresponding potential energy of the par-ticle is:U(w0) = h(i)x q(rev)x,x′ (w0)= h(i)x pix′ exp(〈w0,φ({x, x′})〉).The potential energy after a time interval ∆:U(w0 + v∆) = h(i)x q(rev)x,x′ (w0 + v∆)= h(i)x pix′ exp(〈w0 + v∆,φ({x, x′})〉).945.4. Analytical bounce time solution to related factorsThe potential energy is convex with respect to ∆ sinced2U(w0 + v∆)d∆2= U(w0 + v∆)〈v,φ({x, x′})〉2 > 0.The equation only holds when 〈v,φ({x, x′})〉 = 0.We observe that the particle is travelling to a higher energy area if andonly if 〈v,φ({x, x′})〉 > 0. Therefore, we set− log(E) = U(w0 + v∆)− U(w0)= h(i)x q(rev)x,x′ (w0)(exp(〈v∆,φ({x, x′})〉)− 1).We denote c = − log(E) > 0 and obtain:∆ =1〈v,φ({x,x′})〉 log(ch(i)x q(rev)x,x′ (w0)+ 1)if 〈v,φ({x, x′})〉 > 0,∞, otherwise.5.4.2 Local transition count factorsSimilarly, the transition count factor for pairs of states (x, x′) is−c(i)x,x′ log(q(rev)x,x′ (w0)),and the gradient is −c(i)x,x′φ({x, x′}). Thus, the corresponding potential en-ergy of the particle is:U(w0) = −c(i)x,x′ log(q(rev)x,x′ (w0))= −c(i)x,x′(log pix′ + 〈w0,φ({x, x′})〉)955.5. Running time analysis of LBPS-HMCThe potential energy after a time interval ∆:U(w0 + v∆) = −c(i)x,x′(log(pix′) + 〈w0 + v∆,φ({x, x′})〉).The potential energy after a time interval ∆ is convex with respect to ∆sinced2U(w0 + v∆)d∆2= 0.We sample E ∼ Unif(0, 1), set c = − log(E), U(w0 + v∆)− U(w0) = c,and obtain:∆ =− cc(i)x,x′ 〈v,φ({x,x′})〉if 〈v,φ({x, x′})〉 6 0,∞, otherwise.5.5 Running time analysis of LBPS-HMCIn Section 5.2.2, we have provided the running time analysis of LBPS undera general factor graph G. In this section, we provide the running timeanalysis of LBPS-HMC alternative under Bayesian GLM chain GTR modelfor CTMCs. Recall that the computational cost for a factor graph G withsparse property is given in Equation 5.2.1:O(mc∗ +L1 (pα1 + cU∗ +mα2c∗ + logm) +L2(pα1 +mα2c∗ +mα2 logm)),assuming there are L1 collision events and L2 refreshment events when theparticle travels along a fixed trajectory length.In the special case of using LBPS-HMC under Bayesian GLM chain965.5. Running time analysis of LBPS-HMCGTR model for CTMCs, we have c∗ 6 O(1), cU∗ 6 O(1),∣∣N∗∣∣ 6 O(1) and∣∣S∗∣∣ 6 O(1). The computational cost we derive not only holds for BayesianGLM chain GTR model but also holds for any Bayesian GLM model thatsatisfies c∗ 6 O(1), cU∗ 6 O(1),∣∣N∗∣∣ 6 O(1) and ∣∣S∗∣∣ 6 O(1). Under theBayesian GLM chain GTR models, we have m 6 O (|X |2) and p 6 O (|X |2).1. Sample the auxiliary sufficient statistics using end-point sampler: O (|X |3).2. Update the univariate weights wu using HMC: O(L0|X |), where L0represents the number of leapfrog steps in one HMC iteration. Whenupdating wu, the number of non-zero entries for all possible featuresis |X |.3. Update the bivariate weightswb using LBPS:O(L1 log |X |+L2 log |X |).This is obtained by plugging α1 = α2 = 0 into Equation 5.2.1, whereα1 = α2 = 0 is obtained since we have∣∣N∗∣∣ 6 O(1) and ∣∣S∗∣∣ 6 O(1).Thus, the total cost of one iteration of our LBPS-HMC alternative takesO(L0|X |+ L1 log |X |+ L2 log |X |+ |X |3). (5.5.1)For comparison, one HMC iteration for Bayesian GLM chain GTR modeltakes O (L|X |2 + |X |3) since the total number of non-zero entries for allpossible features of Bayesian GLM chain GTR models is |X |(|X |+1)/2 usingonly HMC, where L is the number of leapfrog steps in one HMC iteration.It is worth noting that L0 represents the number of leapfrog jumps whenusing HMC to only sample the univariate weights wu and L represents thenumber of leapfrog jumps when using HMC to sample both the univariateweights wu and the bivariate weights wb. Empirically, the value of L isusually chosen larger than L0. In our experiments with HMC, we find that975.5. Running time analysis of LBPS-HMCL0|X |, L1 log |X | and L2 log |X | are all lower order terms compared withL|X |2. This also explains why our LBPS-HMC is more computationallyefficient than HMC.Our running time analysis so far does not provide information on howL1 and L2 should be scaled in order to obtain a constant number of ef-fective samples. To provide the readers an idea of how L in HMC scales,consider the case of normally distributed random vectors of dimension pwith an identity covariance matrix. This is not the same as our problemor any real practical problem, but it may shed light on how the numberof leapfrog steps L scales with dimensionality for some problems. To reacha nearly independent point, the number of leapfrog updates under the in-dependent and identically distributed normal case grows as O(p14)(Neal,2010), where p is the dimension of the parameters. This implies that thecomputational cost to obtain one effective sample using HMC is O(p54).In our case, one HMC iteration takes O (L|X |2 + |X |3), so we expect thatL|X |2 is the dominant term since in the case of the GTR model, p is Θ(|X |2)and under the independent and identically distributed normal case L growsas O(|X | 52). For LBPS, under the independent and identically distributednormal distribution with dimension p or weakly dependent case, we expectL1 to grow as O(p). This indicates that the computational cost to obtainone effective sample for LBPS under this situation is O(p log p).In this chapter, we have presented our novel sampling algorithm LBPS-HMC and also provided a framework to analyze its running time. In the nextchapter, we will demonstrate the flexibility and scalability of our proposedsampling algorithms in Chapter 4 and Chapter 5 using both synthetic andreal phylogenetic protein data respectively.98Chapter 6ExperimentsIn this chapter, we use numerical studies to evaluate the computationalperformance of our proposed AHMC algorithm and the proposed algorithmvia a combination of LBPS and HMC denoted as LBPS-HMC through bothsynthetic datasets and real data analysis.6.1 Numerical studies for AHMCIn this section, as an illustration of how a typical modeller would inter-act with our framework, we develop three novel rate matrix models foramino acid data by incorporating various biochemical properties of aminoacids. We investigate the behaviour of these models on synthetic and realdata. We also perform experiments to assess the computational efficiencyof the method, as well as the effectiveness of the automated HMC adap-tation method. All the datasets used in the experiments are available athttps://github.com/zhaottcrystal/CTMC-via-HMC.6.1.1 Synthetic data experiments for AHMCIn this section, we generate synthetic amino acid sequences evolving on aphylogenetic tree to show that our method can estimate the held-out ratematrix used to generate the data accurately. We also use this controlled996.1. Numerical studies for AHMCdataset to demonstrate the computational efficiency of our HMC algorithmrelative to the classic MCMC techniques. We implemented AHMC for ourframework in Java, with open source code available at https://github.com/zhaottcrystal/conifer. We first tested our implementation usingseveral strategies, including numerical checks of the gradient against thepointwise value of the objective function, as well as the prior/prior-posteriorsimulator test of Geweke (2004).Setup for synthetic data experimentsWe use a weight configuration of the PolaritySizeGtr model describedin Section 3.2.1 to get a reversible rate matrix, which is in turn used togenerate synthetic amino acid sequences evolving along a phylogenetic tree.Using the package from Paradis et al. (2004), we simulate an unrootedbifurcating tree with 10 tips. We use a uniform distribution on [0, 1] onthe branch lengths. The tree topology is generated according to Aldous’Markov branching model (Aldous, 1996). Throughout the synthetic dataexperiments of this section, the tree topology and the branch lengths arefixed at the true tree we generate. Only the weights related to the ratematrix are sampled using our proposed AHMC algorithm. The weights forthe Gtr features are generated from a uniform distribution ranging from−1 to 1. The magnitude and sign of these weights are selected by lookingat the posterior weight distributions on real data. We do this since themagnitudes on the coarse features affect the maximum possible benefit thatcan be gained from using coarse features. The values we use are consistentwith the posterior weight values obtained from real data in Figure 6.8. Forexample, we show in Figure 6.8 that both the lower and upper quantiles1006.1. Numerical studies for AHMCfor the weights of features corresponding to changes between polar aminoacids like “AcidicPolar” are supported within the positive real line so thatwe choose positive weights for them in our experiments. Similarly, the lowerand upper quantile are negative for the weights corresponding to the feature“BigMicro” related to the change between different size categories. Thismotivates us to select negative values for them. As for the univariate featureweights, we use a uniform distribution over −0.2 and 0.2.We evaluate the accuracy of the rate matrix reconstructions using threeerror measures. The first one looks at the stationary distribution of theestimated rate matrix and compares it to the stationary distribution of theheld-out rate matrix (that is, the one used to generate the dataset). Tocompare the two stationary distributions, we compute the Kullback-Leibler(KL) divergence. The second measure is based on relative biases in entriesof the marginal transition probability matrices, pt(x, x′) = exp(Qt)x,x′ . Therelative bias is defined as rˆ(x, x′) := (pˆt(x, x′) − pt(x, x′))/pt(x, x′), wherept(x, x′) and pˆt(x, x′) are the held-out and estimated transition probabilitiesbetween states x and x′ at the end points of a branch of length t. We showresults for t = 1. Finally, we also measure error using a more direct, but lessinterpretable metric, the Root-Mean-Square Error (RMSE) on individualrates. We take into account all entries in Q except the diagonal entries,which are redundant because of the constraint that each diagonal entryQ(x, x) is equal to the negative sum of off-diagonal entries of row x.Accuracy of rate matrix reconstruction on synthetic dataIn this section, we assess the accuracy of our inferred parameters on a syn-thetic dataset generated under PolaritySizeGtr model. We select the1016.1. Numerical studies for AHMCsame number of sites (415) and sequences (641) as in our real data experi-ments (Section 6.1.3). We summarize the posterior distribution into a singlereconstructed rate matrix via the posterior mean of the rate matrix samples.First, we observed that the stationary distribution reconstruction is veryaccurate (as measured by the KL divergence). Using 100,000 MCMC itera-tions with 10,000 as the burnin period, we obtain a KL divergence of 0.0003,which indicates the two distributions are very similar since a value of zerofor the KL divergence indicates the two distributions are identical. See Fig-ure F.1 for a visualization interpreting this result, showing the high qualityof the reconstruction in terms of the induced stationary distribution. Thisconfirms our intuition that the stationary distribution is not the challengingpart of the problem. C D E F G H I K L M N P Q R S T V W YStatesStationary Distribution EstimatesCategoryEstimatesTrueFigure 6.1: Actual stationary distributions are compared with estimates froma 415 sites and 641 sequences synthetic dataset generated using the Polarity-SizeGtr model. We generate unbalanced stationary distributions to increase thedifficulty of reconstruction.To get a more complete picture of accuracy, we look at the reconstruction1026.1. Numerical studies for AHMCerror in terms of the relative biases of individual transition probabilities.The 20 × 20 − 20 = 380 off-diagonal transition probabilities each have adifferent relative bias, so we summarize them using a histogram, shownin Figure 6.2. Values higher than zero correspond to an over-estimationof the transition probability, and values lower than zero correspond to anunder-estimation. The 5% 15%, 25%, 50%, 75% 85% and 95% quantiles ofthe relative bias without the diagonal elements in the transition probabilitymatrix under t = 1 are -0.211, -0.112, -0.062, 0.000, 0.082, 0.176 and 0.423,indicating that 90% of the estimated values range from 21.1% smaller to42.3% larger than the true values. We find the true transition probabilityfor the elements with relative biases greater than one are 0.0029 and 0.0021and the estimates are 0.0061 and 0.0053 corresponding to amino acids withlow stationary distribution in the synthetic dataset. This validates thatlarge relative biases come from the low transition probabilities. We providethe histograms of the relative bias under t = 2, 3 and 5 in Appendix F.When t = 5, we find almost all elements of the relative bias are within 20%difference from the true one.Next, we investigate the dependency of the estimates on the numberand type of features used in the model, and on the size of the dataset. Wecreate 60 synthetic datasets, each based on 10 synthetic sequences relatedby a shared phylogenetic tree under a PolaritySizeGtr model. The treetopology is the same as the remaining synthetic data experiments in thischapter. We generate four independent datasets with 1500 sites. We let thenumber of sites in each sequence increase from 100 to 1400 with a step size100 by subsampling from the four independent datasets with 1500 sites. Weapply four different models to each of these 60 datasets: Gtr, Polarity,PolaritySize, and PolaritySizeGtr (see Section 3.2.1). The number of1036.1. Numerical studies for AHMC0.−0.5 0.0 0.5 1.0 1.5RelaBiasProportionFigure 6.2: Histogram of the relativebiases (described in Section 6.1.1) ofthe 380 off-diagonal transition proba-bility matrix (t = 1). The two dottedlines label −0.2 and 0.2.llllllllllllllllllllllllllllllllll llllllllllllll llllllllllllllllllllllll llllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llll llllllllllllllllllllllllllllllllll lllllllllllllll lll0.010.020.03400 800 1200Number of Sites in each sequenceMean RMSE of the estimated rate matrix methodllllGTRPOLARITYPOLARITYSIZEPOLARITYSIZEGTRFigure 6.3: Mean RMSE of esti-mated rate matrices are shown for thefour models described in Section 3.2.1.The confidence bands are obtainedfrom the LOESS smoothing methodusing a quadratic polynomial regres-sion.MCMC iterations with the HMC move is 100,000 and we choose a burn-inperiod of 10,000. Each iteration contains several leapfrog steps. We per-form a Geweke diagnostic test (Geweke, 1992) to assess MCMC mixing; theresults of this test are in Appendix G. The averaged run times (in min-utes) of the four independent datasets using the 10 synthetic sequences with1500 sites under Gtr, Polarity, PolaritySize, PolaritySizeGtr are2021.9, 1944.5, 1424.5, and 2071.2, respectively. These experiments are runon a cluster of Intel Xeon E5430 quad-core processors, running at 2.66 GHz.To summarize the performance of each model on each dataset by a singlenumber, we use the RMSE metric, described in Section 6.1.1. We plot themean accuracies in Figure 6.3 as a function of the number of sites. Wealso show 95% confidence bands computed using the smoothing methodLocal Polynomial Regression Fitting (LOESS) (Chambers and Hastie, 1991)based on four repeated measures for each site. The smoothed value obtained1046.1. Numerical studies for AHMCfrom the LOESS fit is determined by its neighbouring data points using aquadratic polynomial in the regression.Under a naive asymptotic analysis, one might expect the Gtr model tobe unchallengeable. The results show that for a wide range of finite, real-istic dataset sizes, it is advantageous to add coarse features to the model.Of course, this is true only when the values of the coarse feature weightsare non-zero. Fortunately, our experiments in Section 6.1.3 show that ourmodel has high confidence that the magnitudes of the weights of several ofour physicochemical coarse features are indeed large in a dataset of proteinsin the protein kinase domain family. Reassuringly, we also see from Fig-ure 6.3 that the performance of all models improve as more data becomesavailable. As expected, for the Polarity and PolaritySize models, thisperformance increase tapers off since these two models are misspecified inthis setup. However, PolaritySize is very effective for small datasets.We also compare our implementation to MrBayes, a widely-used Bayesianphylogenetics package (Huelsenbeck and Ronquist, 2001; Ronquist et al.,2012). We select a specific set of features, described in Appendix C, thatmatch the degrees of freedom in MrBayes’ GTR configuration. We haveshown in Appendix C.2 that even in this specific configuration, our modelyields a different rate matrix prior compared to MrBayes’. Our experimentstherefore focus on measuring the effect that the difference in priors has onthe posterior distributions. To do so, we simulate a tree with 10 leaves and5000 sites using the same strategy as described in Section 6.1.1. We fix thetree topology and branch lengths to the correct values for both MrBayesand our method. The results are in Figure F.5 and F.6 in Appendix F, andvalidate that the posterior does not appear to be sensitive to the form ofthe distribution fw on the individual weights.1056.1. Numerical studies for AHMC0. C D E F G H I K L M N P Q R S T V W YStatesStationary Distribution EstimatesCategoryMrBayesconiferFigure 6.4: Estimates of the stationary distributions from MrBayes and ourmodel, both set to their respective GTR settings.010203040500.0 0.1 0.2 0.3RelaBiasCount0. 6.5: Histogram of the relative differences of the estimates of the 190 ex-changeable coefficients between MrBayes and our model, both set to their respectiveGTR settings. The differences are generally within 0.05; the two outliers greaterthan 0.1 are between A to R and Q to P.1066.1. Numerical studies for AHMC6.1.2 Evaluation of computational efficiencyNext, we seek to assess the computational gains brought by our samplingalgorithm. To do so, we simulate a tree with 10 leaves and 2000 sites, usingthe same strategy as described in Section 6.1.1. We focus on a single model,PolaritySizeGtr, and compare several samplers. As a baseline, we firstperform experiments using a Metropolis-Hastings algorithm with a Normallydistributed proposal distribution. Since we observed the performance ofthis baseline to be highly sensitive to the bandwidth of the proposal, werepeat the experiment with a wide range of bandwidths. We also run anadaptive MCMC algorithm from Roberts and Rosenthal (2009), namely aNormal Proposal Metropolis-Hastings (NMH) algorithm based on a jointNormal proposal distribution with an empirical covariance estimated fromthe MCMC run. We use the proposal distribution in Roberts and Rosenthal(2009) given at iteration n byQn(x, ·) =N(x, (0.1)2Id/d) for n 6 2d,(1− β)N(x, (2.38)2Σn/d) + βN(x, (0.1)2Id/d) for n > 2d,where x is the estimate in the previous iteration, d is the dimension of theparameters, Id is an identity matrix of dimension d and Σn is the currentempirical estimate of the covariance matrix based on the run so far and βis a small positive constant (we take β=0.05, as suggested in Roberts andRosenthal (2009)). It is known from Roberts et al. (1997) and Roberts et al.(2001) that the proposal N(x, (2.38)2Σ/d) is optimal in a particular large-dimensional context. We compared these baseline samplers to our tuning-free sampling method, which combines auxiliary variables and AHMC (seeSection 4.1).1076.1. Numerical studies for AHMCWe quantitatively evaluate the computational efficiency by monitoringthe Effective Sample Size (ESS) per second. This measure provides an esti-mate of the number of independent samples that the chain generates; sucha measure is needed as MCMC produces correlated samples, and a highernumber of steps per second does not necessarily mean a higher efficiency, asthe samples could be more correlated. We use the package from Plummeret al. (2006) to compute the ESS. We calculate the ESS per second for eachfeature weight and report the extrema and quantile ESS per second acrossthe different feature weights.We repeat each experiment four times and average the results. We showthe box plot of the ESS per unit time for each feature weight in Figure 6.6.We scale all results by 104 seconds for ease of presentation. See Appendix Gfor results on a wider range of different bandwidths (variance of the Normalproposal distribution). Our method achieves a higher efficiency compared toclassical NMH kernels with all the bandwidths tested, as well as comparedto the adaptive NMH algorithm. Even when comparing our sampler to the“oracle” NMH baseline, the minimum ESS per second is superior by an orderof magnitude, and the quantiles by a factor of at least two. Our sampleroutperforms adaptive NMH by a factor of four in terms of the minimumESS per second. Moreover, it is worth noting that our plot is on a log scale.AHMC is significantly better than classical MCMC given the summariesdiscussed above. We also expect that as the dimension of the parameterspace increases, we will observe an even larger computational gain comparedwith the classical MCMC algorithms.Next, we investigate the effectiveness of our Bayesian optimization adap-tation scheme from Section 4.3. We simulate a tree with 10 leaves and aminoacid sequences with 500 sites. We adapt the parameters L and  for HMC1086.1. Numerical studies for AHMCllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0246AHMC Adaptive NMH0.01 NMH0.02 NMH0.05 NMH0.10Effective Sample Size per 104 Seconds in the log scale Figure 6.6: Effective sample size per 104 seconds on a log scale using AHMC, anadaptive NMH with a joint proposal guided by an estimated covariance matrix andNMH with bandwidths of 0.01, 0.02, 0.05, 0.10, where NMH x stands for a Normalproposal Metropolis Hastings algorithms with proposal bandwidth x.using the techniques of AHMC described in Section 4.3. We obtain L = 31and  = 0.02473 after adaptation. To assess the quality of these values, werepeat the same experiment several times without adaptation, on a grid ofparameters centered at L = 31 and  = 0.02473; see Figure 6.7. Each cell inthe plot reflects the average over ESS per second across all feature weights,displayed in the log scale. The results show that the performance of the com-bination of tuning parameters identified by our adaptation scheme is closeto optimal. Moreover, we found that the cost of the adaptive Bayesian op-timization is negligible in all cases we encountered. While the cost dependson the total number of optimization steps, we found that a small numberof steps was sufficient in practice (Figure 6.7), with a cost on the order of2–3 minutes. By comparison running 100,000 MCMC iterations took 2,2621096.1. Numerical studies for AHMCNumber of Steps in HMCStepsize in HMCε/8ε/4ε/2ε2ε4ε8εL/8 L/4 L/2 L 2L 4L 8L−11−10−9−8−7−6−5−4−3Figure 6.7: ESS per second (displayed on a natural log scale) across all featuresobtained by running HMC with tuning parameters coming from a grid centered atthe values  = 5× 10−7 and L = 30 coming from AHMC.minutes.We also include a comparison of the computational efficiency of the HMCsampling scheme with and without the auxiliary variables described in Sec-tion 4.4 and Section 4.4.5. In both setups, we set  = 5× 10−7 and L = 30.The results, shown in Table G.4, show that the HMC sampling schemewith auxiliary variable outperforms an HMC algorithm without auxiliaryvariables by more than one order of magnitude. As before, performance ismeasured by ESS per second.6.1.3 Kinase domain data for AHMCWe apply our methods to the protein kinase domain family, which is foundin many different proteins. To create an alignment, we use the jackhmmerprogram, part of the HMMER 3.0 software suite (Eddy, 1998), to search1106.1. Numerical studies for AHMCMethod Min 1st Quantile Median Mean 3rd Quantile MaxAux 20 165 295 551 484 8698No Aux 3.27 8.62 12.1 56.5 18.2 681Table 6.1: Summary of the ESS per 106 seconds for the stationary distributionand exchangeable coefficients with auxiliary variable and without, denoted as “Aux”and “No Aux” respectively, with estimates inferred under a PolaritySizeGtrmodel with fixed  = 5 × 10−7 and L = 30, while fixing the topology and branchlengths to the true one with a simulated amino acid dataset of 10 leaves and 5000sites generated under a PolaritySizeGtr model.the non-redundant protein sequence database for homologs. The startingsequence is selected to be the kinase domain present in the mouse Titinprotein, taken from the Protein Data Bank (PDB) file 1TKI, which alsoincludes some of the flanking sequence. The final alignment contains 641sequences and 415 sites. To initialize the phylogenetic tree used in ourMCMC algorithm, we ran MrBayes (Ronquist et al., 2012), which has amore complete and efficient set of tree resampling moves, until the averagestandard deviation of split frequencies was under 0.1.We first estimate the rate matrix using a model containing the Polari-tySize set of features. In this situation, only thirty-three parameters needto be estimated, including thirteen weights corresponding to bivariate fea-tures and twenty related to the univariate features. We run the MCMCchain for 100, 000 iterations, including a burn-in period of 10, 000 iterations.We use the Heidelberg and Welch Diagnostic (Heidelberger and Welch, 1981,1983) to test if the sampled values come from a stationary distribution. Wefail to reject the null hypothesis, which indicates that the number of MCMCiterations is sufficient for the thirty-three parameters. In Figure 6.8, we showthe approximation of the posterior distribution of the weights for each bi-variate feature obtained from the post burn-in iterations. Recall that when1116.1. Numerical studies for AHMCllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll−2−1012AcidicAcidicAcidicBasicAcidicPolarBasicBasicBasicPolarPolarPolarNonNonAcidicNonBasicNonNonPolarBigBigBigMicroMicroMicrovalueFigure 6.8: Approximation of the posterior distribution of the weights correspond-ing to coarse features under the PolaritySize model. The naming convention onthe abscissa is described in Section 3.2.1, with “Non” standing for “NonPolar”.a weight is positive, its feature has an accelerating effect on the substitutionrate, and when the weight is negative, the feature has a decelerating effect.We expect that substitutions among amino acids with similar propertieswould be favoured. From Figure 6.8, we see that the 95% Highest PosteriorDensity (HPD) interval is supported within the positive real line for twoweights corresponding to the “AcidicPolar” and “BasicBasic” features, andthe 95% HPD interval are supported within the negative real line for weightscorresponding to the “AcidicNonPolar” and “BasicNonPolar” features. Weperformed the same experiment with the PolaritySizeGtr model, andobtained similar results. See Figure F.9 in Appendix F.We also estimated the exchangeable coefficients θ{x,x′} between each pairof amino acids. The exchangeable coefficients represent the rate of substi-tution between two amino acids after factoring out the effect of their sta-1126.1. Numerical studies for AHMCtionary distribution. Our results shown in Figure 6.9 are qualitatively con-sistent with the substitution patterns found in previous work. For example,comparing with Barnes et al. (2003), the exchangeable coefficients betweentryptophan (W) and other amino acids are relatively small compared withother states, which is expected based on tryptophan’s unique properties.Moreover, arginine (R) and lysine (K) have a high exchangeable coefficient,which is expected because of their similar charge properties.ARNDCQEGHLIKMFPSTWYVA R N D C Q E G H L I K M F P S T W Y VAmino AcidsAmino Acids Exchange0.050.301.003.005.0011.00Figure 6.9: Bubble plot of exchangeable coefficients. A bigger bubble indicates alarge exchangeable coefficient value.6.1.4 DiscussionBoth the simulation studies and real data analysis in this section showpromise for estimating large rate matrices for which prior knowledge isknown. One well-suited application would be the modelling of the co-1136.1. Numerical studies for AHMCevolution of groups of interacting amino acid residues. Features could beused to model both the evolution of each individual amino acid (driven bymechanistic codon constraints and physicochemical considerations), as wellas structural constraints acting on several amino acids simultaneously.An important challenge with such models, which we have not addressedhere, is the computation of matrix exponentials over large matrices. For-tunately, recent advances in the field could be combined with our method(Higham, 2009; Hajiaghayi et al., 2014). In particular, our work has thepotential to be combined with methods that can exploit sparsity in the ratematrix (Didier et al., 2009; Zhang et al., 2010; Rao and Teh, 2013; Irvahnand Minin, 2014) since the base measures in our GLM-CTMC model can beused to create sparsity.A second challenge in estimating large models is data size. One strategyto address this is the use of several datasets simultaneously. Again, this canbe handled easily in our framework, including the modelling of hierarchi-cal and non-hierarchical structures across datasets. One could use distinctcovariates where the underlying rate matrix model is allowed to vary whilesharing statistical strengths across different families and/or different sub-trees of the tree of life.We expect the performance improvements brought by our AHMC meth-ods over (adaptive) NMH methods to increase in these higher dimensionalmodels. In Neal (2010), an asymptotic analysis supports that the com-putational requirements for HMC scale as O(d5/4), compared to O(d2) foran optimally tuned NMH method. Moreover, using the Riemann manifoldHMC approach of Girolami and Calderhead (2011), it may be possible tocombine the advantages of our method with some of the gains over naiveNMH brought by the covariance adaptation baseline of Section 6.1.2. More1146.2. Numerical studies for LBPS-HMCgenerally, our method can benefit from various classical methods and recentadvances from the HMC literature, for example methods to decrease rejec-tion rates via non-constant trajectory lengths (Hoffman and Gelman, 2014b;Sohl-Dickstein et al., 2014), partial refreshment methods (Horowitz, 1991),splitting methods to exploit tractable Hamiltonian subcomponents (Sextonand Weingarten, 1992) and several other techniques reviewed in Neal (2010).6.2 Numerical studies for LBPS-HMCIn this section, all numerical experiments are run on Compute Canada re-sources “cedar” with Intel “Broadwell” CPUs at 2.1Ghz and an E5-2683 v4model. A detailed description of “cedar” can be found at https://docs.computecanada.ca/wiki/Cedar. The algorithm is implemented in Python3.6.4. Across all experiments, the ESS are evaluated via the R packagemcmcse (Flegal et al., 2012). We set the first 30% of the posterior samplesas the burnin period to be discarded.All our synthetic datasets are simulated under a Bayesian GLM chainGTR model. We assigned a standard normal distribution as the prior overboth the univariate weights and the bivariate weights, which are both gen-erated from Unif(0, 1) across all simulation studies. The refreshment rate ofLBPS-HMC is set to one and the number of leapfrog jumps and the stepsizeof each jump are set as L = 40 and  = 0.001 in HMC. The combinationof the two tuning parameters was simply chosen through grid search givenmultiple combination of the parameter values.All synthetic experiments generate 500 sequences given a total lengthof the observation time interval as three and the states of all sequences areobserved every 0.5 time units under a Bayesian GLM chain GTR model1156.2. Numerical studies for LBPS-HMCwith different dimensions of rate matrices. Later in this chapter, “LBPS-HMC” refers to the combined sampling scheme of using HMC to sample theunivariate weights and LBPS to sample the bivariate weights. When HMCsampling scheme is mentioned later, it refers to the sampling scheme usingHMC to sample all the weight parameters w =(wu,wb).In both the experiments using synthetic datasets of this section and thereal pair of data sequences in Section 6.2.4, we conduct computational ef-ficiency comparison via the ESS per second and also check the correctnessof the two samplers. For the synthetic data experiments, we perform ExactInvariance Test (EIT) in the spirit of Geweke (2004), compare the density ofthe posterior samples between the two samplers and compute the AbsoluteRelative Difference (ARD) defined in Appendix H.4. Since the two algo-rithms share the same limiting distribution, as we increase the number ofiterations of the Markov chains, after an initial burnin period, the distri-bution of the posterior samples obtained from the two algorithms will bemore and more similar and both will become closer to the target posteriordistribution. For higher dimensional rate matrices, since the total numberof parameters is O(|X |2), it is hard to display the density plots for all pa-rameters, we use ARD as the metric to describe the similarities between theposterior samples from the two algorithms. Smaller values of ARD indicatemore similarities. In Section 6.2.4 with the real data sequences, we onlyuse ARD to check the similarity between the distributions of the posteriorsamples from the two algorithms and show the results in Appendix H.4.1166.2. Numerical studies for LBPS-HMC6.2.1 Correctness check of LBPS and HMC using exactinvariance testTo check the correctness of our software implementation and derivation ofthe analytic collision time, we use Kolmogorov-Smirnov (KS) tests to for-mally check that marginal-conditional and successive conditional distribu-tions coincide, in the spirit of Geweke (2004). We performed tests on bothLBPS-HMC and HMC kernels on various dimensions of the rate matrices(see Appendix H.1).To further validate the correctness of LBPS-HMC and HMC, we comparethe distribution of the posterior samples collected from LBPS-HMC andHMC respectively. The shared synthetic dataset is generated under an 8-by-8 Bayesian GLM chain GTR model. The prior distribution for w isa standard normal distribution. We obtain 40,000 and 10,000 posteriorsamples from LBPS-HMC and HMC separately with the first 30% of thesamples discarded. As we have a longer chain for both algorithms, we expecttheir density plots should be closer. We use the density plot of differentparameters in Figure 6.10 to illustrate this. The boxplot is provided inFigure H.1 in Appendix H.2. We also provide summary statistics of theposterior samples among different parameters.6.2.2 Computational efficiency comparison betweenLBPS-HMC and HMCIn order to explore the scalability of our algorithm as the dimension ofthe parameters increases, we compared the ESS per second among all theparameters for different dimensions of the rate matrices. The larger valuesthe ESS per second are, the more efficient the algorithm is. It is worth1176.2. Numerical studies for LBPS-HMCFigure 6.10: Density plot of posterior density comparison across exchangeableparameters of an 8-by-8 rate matrix between LBPS-HMC and HMC.noting that the ESS per second (in the y-axis) is shown on a log10 scale.The dimension of the rate matrices ranges from 5-by-5, 10-by-10, . . . to30-by-30. We obtain a total number of 60,000 posterior samples from LBPS-HMC and 10,000 samples from HMC. To speed up the actual running timeof the experiments, for rate matrices of dimension lower than 15-by-15, thetrajectory length is fixed as 0.1, which is the tuning parameter T of LBPS.For rate matrices of higher dimensions, the trajectory length is shortenedto 0.05. According to the Bayesian GLM chain GTR parameterization,1186.2. Numerical studies for LBPS-HMCthe number of exchangeable parameters is |X |(|X | − 1)/2. The result isdisplayed in Figure 6.11. In Figure 6.11, the y-axis represents the ESS persecond across all exchangeable rate parameters in a rate matrix on a log10scale.In order to compare the efficiency of two MCMC samplers, one obviouschoice is the mean (or median) of the ESS per second across all parameters.But we emphasize that the minimum ESS per second is more important.It is often the case that one or a few parameters may have dramaticallyworse mixing than the others and valid posterior samples will not be avail-able unless all parameters have mixed adequately, which indicates that theminimum ESS per second is above a certain selected threshold (de Valpineet al., 2017).From Figure 6.11, we can see that when the dimension of the parameterspace is low, for a 5-by-5 rate matrix, HMC has better performance thanLBPS-HMC in terms of all exchangeable parameters in a rate matrix sincethe values of the ESS per second across all exchangeable parameters usingHMC are larger than that obtained by LBPS-HMC. As the dimension in-creases, it shows that the minimum ESS per second across all exchangeablerate parameters for LBPS-HMC is larger than HMC for rate matrices ofdimension ranging from 10-by-10 to 30-by-30 with a step size of five. Weshow the the minimum ESS per second for LBPS-HMC and HMC in Ta-ble 6.2. The “Difference” row represents the difference of the minimum ESSper second between the two samplers on a log10 scale and the “Ratio” ofthe table presents the ratio of the minimum ESS per second of HMC overLBPS-HMC. Similarly, we also report the 5% quantile of the ESS per secondacross all parameters for both samplers together with the difference betweenthem (on a log10 scale) and the ratio of HMC over LBPS-HMC in Table 6.3.1196.2. Numerical studies for LBPS-HMC−3.5−3.0−2.5−2.05 10 15 20 25 30The size of the state space for the rate matricesESS per second in the log10 scaleMethodHMCLBPSFigure 6.11: ESS per second on a log10 scale across all exchangeable parameterswith the dimension of the rate matrices ranging from 5-by-5, 10-by-10, . . . to 30-by-30 with a step size of 5.In order to give the readers an idea of the average computational perfor-mance of the two samplers, we present the median ESS per second acrossall parameters in Table 6.4.We also find that HMC has larger variation in the ESS per second acrossdifferent exchangeable rate parameters, which is shown in Figure 6.11 andTable 6.5. The indicates that HMC is not efficient in exploring certaindirections of the parameter space. The variation in the ESS per secondacross different exchangeable parameters for LBPS-HMC is much smallerthan HMC. As the dimension of the rate matrix increases to 20-by-20, wefind that the median ESS per second between the two algorithms is similarwhile the minimum and 5% quantile of the ESS per second of LBPS-HMC1206.2. Numerical studies for LBPS-HMCDimension 5 10 15 20 25 30LBPS-HMC (Min) -2.707 -2.835 -2.967 -3.102 -3.204 -3.296HMC (Min) -2.321 -2.840 -3.014 -3.171 -3.454 -3.497Difference (log10 scale) 0.386 -0.005 -0.047 -0.069 -0.250 -0.201Ratio of HMC over LBPS-HMC 2.438 0.989 0.897 0.853 0.564 0.630Table 6.2: The minimum ESS per second across different exchangeable parametersfor different dimensions of rate matrices.Dimension 5 10 15 20 25 30LBPS-HMC (5%) -2.703 -2.833 -2.950 -3.091 -3.181 -3.274HMC (5%) -2.308 -2.799 -2.951 -3.102 -3.357 -3.390Difference (log10 scale) 0.395 0.034 -0.001 -0.011 -0.176 -0.116Ratio of HMC over LBPS-HMC 2.483 1.081 0.998 0.975 0.667 0.766Table 6.3: The 5% quantile of ESS per second across different exchangeable pa-rameters for different dimensions of rate matrices.is better than HMC. Based on the information about the minimum, 5%quantile and the median of the ESS per second across all the parametersdisplayed in Table 6.2, Table 6.3 and Table 6.4, it suggests that LBPS-HMChas better computational performance than HMC if the dimensionality issufficiently large. The improvement in the minimum and 5% quantile ESSper second is modest but real, which suggests that LBPS-HMC should beadopted for high-dimensional CTMCs inference problems.6.2.3 Computational efficiency comparison using differenttrajectory lengths of LBPSThe computation efficiency of HMC has been found to depend highly on thechoices of the tuning parameters, which are the number of leapfrog jumps Land the size of each jump . Various strategies (Wang et al., 2013b; Hoffman1216.2. Numerical studies for LBPS-HMCDimension 5 10 15 20 25 30LBPS-HMC (Median) -2.634 -2.787 -2.922 -3.062 -3.141 -3.231HMC (Median) -2.174 -2.641 -2.770 -2.957 -3.159 -3.189Difference (log10 scale) 0.460 0.146 0.152 0.105 -0.018 0.042Ratio of HMC over LBPS-HMC 2.879 1.402 1.416 1.274 0.959 1.101Table 6.4: The median of the ESS per second across different exchangeable pa-rameters for different dimensions of rate matrices.Dimension 5 10 15 20 25 30LBPS-HMC (SD) 0.053 0.044 0.026 0.027 0.031 0.033HMC (SD) 0.100 0.097 0.116 0.099 0.115 0.118Table 6.5: The standard deviation of the ESS per second on a log10 scale acrossdifferent exchangeable parameters for different dimensions of rate matrices.and Gelman, 2014a) have been developed to effectively tune the parameters.In the LBPS step of LBPS-HMC, it also involves a tuning parameter, whichis the fixed trajectory length. Thus, it is of interest to examine whetherthe computational efficiency of LBPS is sensitive to different choices of thetrajectory length.We simulate a synthetic dataset from a 20-by-20 rate matrix with 190 ex-changeable parameters under a chain Bayesian GLM GTR model describedin Section 3.3.2. A 20-by-20 rate matrix is chosen since it has the same di-mension as protein evolution in the real data analysis. The synthetic datasetis the same when we perform sampling using LBPS-HMC with different tra-jectory lengths.We use ESS per second as the metric to evaluate the computation ef-ficiency of LBPS-HMC with different trajectory lengths. Our summarystatistics include the minimum, first quantile, mean, median, third quantileand maximum of the ESS per second across 190 exchangeable parameters.1226.2. Numerical studies for LBPS-HMC40,000 posterior samples of the parameters of interest are obtained usingour algorithm. The actual walltime (in seconds) of our algorithm with fixedtrajectory lengths at 0.025, 0.1, 0.15, 0.2 and 0.25 is shown in Table 6.6.Since the first 30% of the samples are discarded, the actual running timeused to calculate the ESS per second is scaled by 70% of the total walltimein Table 6.6.Trajectory length 0.025 0.1 0.15 0.2 0.25Walltime (in seconds) 240677 282233 338242 381565 403932Table 6.6: Actual walltime of LBPS-HMC for 40,000 iterations with trajectorylengths at 0.025, 0.1, 0.15, 0.2, 0.25.The results are shown in Figure 6.12. We have found that except themaximum ESS per second, the computational efficiency of LBPS-HMC issimilar with different values of the trajectory length. Especially the mini-mum ESS per second is very robust to different choices of trajectory lengths.The longer the trajectory lengths, the better performance of the maxi-mum ESS per second, which is shown in Table 6.7. Similar conclusionsare achieved in our real data analysis.Trajectory length 0.025 0.1 0.15 0.2 0.25ESS per second 0.001151 0.001748 0.001602 0.001788 0.002153Table 6.7: The maximum ESS per second of LBPS-HMC for 40,000 iterationswith trajectory lengths at 0.025, 0.1, 0.15, 0.2, 0.25.1236.2. Numerical studies for LBPS-HMC0.00000.00050.00100.00150.0020Min. 1st Qu Median Mean 3rd Qu MaxDifferent summary statistics across all exchangeable parametersESS per secondtrajectoryLength0.0250. 6.12: Effective sample size per second of the summary statistics acrossexchangeable parameters with respect to different trajectory lengths of LBPS-HMC.6.2.4 Real Data AnalysisWe use the real dataset from Zhao et al. (2016) with 641 amino acid se-quences. Each sequence has 415 sites from the protein kinase domain family.In phylogenetics, the evolutionary process is often inferred using multiple ho-mologous biological sequences under a evolutionary tree with the same ratematrix across the tree. For simplicity, we estimate the rate matrix from apair of sequences. We pick randomly a pair of amino acid sequences out ofall 641 sequences to study the rate matrix from its posterior distribution.Numerical results: computational efficiency comparisonWe provide the correctness check via ARD between the two samplers inAppendix I. We compare the computational efficiency of LBPS-HMC andHMC via comparing the summary statistics of ESS per second across 1901246.2. Numerical studies for LBPS-HMCexchangeable parameters of a 20-by-20 reversible rate matrix using the pro-tein kinase domain family dataset in Zhao et al. (2016). Figure 6.13 demon-strates the computational advantages of using LBPS-HMC over HMC. Thetrajectory lengths of LBPS-HMC are chosen as 0.1, 0.15 and 0.2. We findthat when the trajectory length is set at 0.2, LBPS-HMC outperforms HMCacross all summary statistics of ESS per second. The minimum ESS per sec-ond of HMC is markedly worse than LBPS, this implies the inefficiency ofHMC to explore certain direction of the parameter space. Under a fixedtrajectory length of LBPS-HMC, there is not a big difference between thefirst quantile and third quantile of the ESS per second across all parame-ters. This indicates that LBPS has similar computational efficiency acrossdifferent directions of the parameter space. In Figure 6.13, it shows thatdifferent values of the trajectory lengths have more effects on the computa-tional efficiency than in Figure 6.12. Ideally, different problems should havedifferent scales of trajectory lengths in LBPS since the number of collisionsover the same fixed trajectory length can be quite different. The numberof collisions over a fixed trajectory length depends on the scale of the ratematrix and also the size of the dataset. We use the same scale of trajectorylengths in both simulation studies and the real data analysis, which is notideal. In the simulation studies, the size of the synthetic dataset is muchlarger since it has 500 sequences given a total length of the observation timeinterval as three and the states of all sequences are observed every 0.5 timeunits while the real dataset only has one pair of sequences with 268 sitesafter removing the gaps.In Table 6.8, we provide the ESS running HMC for 811380 seconds (9.4days) compared with the ESS of LBPS-HMC, which only took 81.25% of thewalltime of HMC. In this table, the last row shows the ratio of the summary1256.2. Numerical studies for LBPS-HMCstatistics of the ESS for LBPS-HMC over HMC. Considering the ratio ofLBPS-HMC over HMC in the same walltime, the median of the ESS persecond of LBPS-HMC is around 3 times faster than that of HMC and theminimum ESS per second for LBPS-HMC is around 10 times faster thanHMC, which further demonstrates the better computational performanceof LBPS-HMC. A traceplot scaled by the running time of an exchangeableparameter selected uniformly at random showing better mixing using LBPS-HMC compared with HMC is provided in Figure H.2 in Appendix H.3.Min. 1st Quantile Median Mean 3rd Quantile Max.LBPS-HMC (0.2) 773 1656 1864 1845 2108 2532HMC 95 626 740 736 840 1233Ratio 8.13 2.64 2.52 2.51 2.51 2.05Table 6.8: Summary of ESS across exchangeable parameters between HMC andLBPS with trajectory length 0.2. The last row shows the ratio of ESS for LBPS-HMC over HMC.6.2.5 DiscussionIn this chapter, we have developed a computationally efficient samplingscheme combining LBPS and HMC to explore high dimensional CTMCsunder a novel Bayesian GLM chain GTR model. In this model, we assumethat the mutation rates of the amino acid evolutionary processes dependson its neighbour pairs with similar physiochemical properties.We provide a framework for assessing the running time of our algorithm.We define the notion of sparsity through the extended neighbourhood ofeach factor in the factor graph, where LBPS-HMC can be efficient whiletaking advantage of the sparse structure.In terms of the computational performance, we find that HMC may have1266.2. Numerical studies for LBPS-HMC0.0000.0020.004Min. 1st Qu Median Mean 3rd Qu MaxDifferentVariablesESS per secondMethodHMCLBPS−0.1LBPS−0.15LBPS−0.2Figure 6.13: Summary statistics of ESS per second of across exchangeable pa-rameters with different trajectory lengths of LBPS at 0.1, 0.15 and 0.2, comparedwith HMC.difficulty in exploring certain directions of the parameter space since thereis a discrepancy between the ESS per second among different parametersand the minimum ESS per second for HMC is worse than LBPS-HMC. Bycontrast, the computational performance of LBPS-HMC is similar acrossdifferent directions of the parameter space. In real applications, the com-putational performance of LBPS-HMC is better than HMC in terms of allchosen summary statistics of the ESS per second and especially the mini-mum ESS per second.Moreover, via protein evolution, we provide the first proof-of-conceptreal data application for LBPS-HMC to explore high dimensional CTMCscompared with HMC. The experiment results suggest that LBPS-HMCcan be a computationally efficient algorithm to explore high dimensionalCTMCs. This shows promises for evolutionary processes such as codon evo-lution (Anisimova and Kosiol, 2008) with a large state space. Our sampling1276.2. Numerical studies for LBPS-HMCalgorithm also has the potential to be applied to co-evolution of groupsof interacting amino acids residues, which is also modelled with large ratematrices.For phylogenetic applications, the current limitation of our algorithmis that it only works for unnormalized, reversible rate matrices since onlyunder this situation, can the analytical solution to the collision time be ob-tained. It is convention in phylogenetics to use normalized rate matrices,where the normalization makes sure that the expected number of mutationsis one given one unit of time. To resolve the issue, for example, for pairwisesequences, we can first set the branch length to one unit length and obtainthe posterior samples for the unnormalized reversible rate matrix. For eachposterior sample of the unnormalized reversible rate matrix, a normalizedrate matrix can be obtained by the transformation described in Section 3.2.2.We can obtain the corresponding normalizing coefficient for each posteriorsample. Thus, the posterior mean of the normalizing coefficient is the esti-mate of the branch length. The uncertainties of the branch length can beevaluated from the normalizing coefficient for the rate matrix correspondingto each posterior sample.It is also possible to combine our method with Bayesian variable selec-tion techniques (Ishwaran et al., 2005) and a binary version of BPS (Pak-man, 2017) to detect the features that have a large effect on the underlyingCTMCs. To further facilitate the computational efficiency of our samplingalgorithm, it is worth exploring an adaptive version of the algorithm tochoose the tuning parameters in LBPS wisely.128Chapter 7Conclusions and future work7.1 ConclusionsIn this thesis, I have focused on improving probabilistic models for Con-tinuous Time Markov Chains (CTMCs) and developing Bayesian modelsand associated Monte Carlo methods for inference. CTMCs are at the coreof modelling continuous time, discrete state space time series, which is achallenging problem due to incomplete observations and potentially largeparameter sets. The challenges call for user-defined priors based on knownproperties of the process. Traditional methods are limited by two difficulties.The first comes from the question of constructing reasonable structured pri-ors to enable lower dimensional parameterization of rate matrices while stillmaintaining good performance on large datasets. To solve this problem, wehave developed a Bayesian Generalized Linear Models (GLM) framework tomodel the rate matrices of CTMCs. We use protein evolution as an exampleto demonstrate our modelling framework. We have also shown how to in-corporate physiochemical properties of amino acid in the prior informationunder our framework. The other difficulty is that the posterior simulationusing classical MCMC algorithms can be highly ineffective under BayesianGLM modelling of CTMCs. I have proposed two computational remediesto solve the second computation issue.1297.1. ConclusionsIn the first method, we address the computation issue by combining twoingredients. The first ingredient is the use of an AHMC algorithm, whichuses the gradient of the log target density to efficiently sample from theposterior density. The second ingredient comes from a set of techniquescalled substitution mapping, which have undergone intensive developmentsin the recent years. These techniques allow us to efficiently maintain aset of auxiliary variables that make gradient calculations fast and simple.The superior computational efficiency of AHMC is evaluated by monitoringthe ESS per second and is compared with a Metropolis-Hastings algorithmwith a Normally distributed proposal distribution with different choices ofbandwidths and an adaptive MCMC algorithm from Roberts and Rosen-thal (2009), namely a NMH algorithm based on a joint normal proposaldistribution with an empirical covariance estimated from the MCMC run asbaselines.In the second method, I have developed a novel sampler combining HMCand LBPS together with substitution mapping to efficiently sample fromCTMCs under the Bayesian GLM parameterization of rate matrices, achiev-ing better scalability with the dimension of the parameter space comparedwith the first solution of the HMC sampling scheme in Zhao et al. (2016).In our work, we first propose the notion of strong sparsity for factor graphs.We then find that if the target factor graph of the Bayesian posterior distri-bution satisfies strong sparsity, our proposed sampling scheme LBPS-HMCcan be tailored to take advantage of the sparse structure and the “local”bouncing time can be computed efficiently with analytical solutions by us-ing only the variables associated to the factor. We provide the runningtime analysis for both arbitrary factor graphs and factor graphs with strongsparsity. Analytical solutions to the bouncing time for each factor of the1307.2. Future workposterior density have been derived in our problem context, which is notpossible in most cases. We demonstrate better computational efficiency ofLBPS in its scalability with dimensions compared with HMC.For both methods, we demonstrate the computational efficiency of ourmethods via both synthetic dataset studies and real data analysis in termsof protein evolution using data from the protein kinase domain family.7.2 Future workIn this section, I discuss some future research directions. In one currentproject, I apply both AHMC and LBPS sampling schemes to CTMCs witha large state space to model codon evolution requiring a 64 × 64 rate ma-trix. Under the Bayesian GLM rate matrices parameterization framework,we try to detect the effects of different levels of features such as DNA, phys-iochemical properties of amino acids and synonymous/non-synonymous mu-tations on codon evolution. Moreover, I would like to combine the currentframework with spike and slab priors (Mitchell and Beauchamp, 1988) forperforming Bayesian feature selection in the context of CTMCs. In one re-lated project, I extend AHMC and LBPS sampling schemes to model MarkovModulated Poisson Process (MMPP) with two major applications: studyingmulti-state disease progression and analyzing user behaviours via trajectorydata consisting of times and locations of user “check-ins” collected from asmartphone-based social media tool. A check-in is a record of individualpublicly registered visits to interesting places with a timestamp, a location(latitude and longitude), and additional information such as the name ofa place, a category, or a user-rating. To further relax the constraint of afixed number of hidden states in MMPPs, I plan to combine the ideas from1317.2. Future worknonparametric Bayesian approaches from Saeedi and Bouchard-Coˆte´ (2011)and a Sequential Monte Carlo (SMC) sampling scheme. Our frameworkcould also be applied to Continuous Time Bayesian Network (CTBN) mod-els (Nodelman et al., 2002a), where a graph specifies a sparsity structure inthe rate matrix. CTBN models can harness the representational power ofBayesian networks to characterize structured CTMCs. CTBN models havewide applications ranging from stochastic kinetic models like the Lotka-Volterra equations (Rao and Teh, 2013), which describe interacting popu-lations of animal species, drug effect modelling (Nodelman et al., 2002b) togene regulatory networks (Wilkinson, 2009). However, posterior inferencefor CTBN models is very challenging (Sturlaugson and Sheppard, 2014).Rao and Teh (2013) has constructed an auxiliary variable Gibbs sampling al-gorithm for such models based on the uniformization method. This methodcould be combined with our GLM representation of the structured rate ma-trices to perform Bayesian inference on the parameters of CTBN models. Itis promising that our GLM representation framework may make it easier tointerpret the effects of certain parameters on the process of interest.132BibliographyAldous, D. (1996). “Probability distributions on cladograms.” In Random discrete struc-tures, 1–18. Springer.Allman, E. S., Ane´, C., and Rhodes, J. A. (2008). “Identifiability of a Markovian model ofmolecular evolution with gamma-distributed rates.” Advances in Applied Probability ,40(1): 229–249.Andrieu, C., de Freitas, N., Doucet, A., and Jordan, M. I. (2003). “An Introduction toMCMC for Machine Learning.” Machine Learning , 50(1): 5–43.Andrieu, C., Durmus, A., Nu¨sken, N., and Roussel, J. (2018). “Hypercoercivity of Piece-wise Deterministic Markov Process-Monte Carlo.” arXiv preprint arXiv:1808.08592 .Anisimova, M. and Kosiol, C. (2008). “Investigating protein-coding sequence evolutionwith probabilistic codon substitution models.” Molecular biology and evolution, 26(2):255–271.Atchison, J. and Shen, S. M. (1980). “Logistic-normal distributions: Some properties anduses.” Biometrika, 67(2): 261–272.Baele, G., Van de Peer, Y., and Vansteelandt, S. (2010). “Using non-reversible context-dependent evolutionary models to study substitution patterns in primate non-codingsequences.” Journal of Molecular Evolution, 71(1): 34–50.Barnes, M. R., Gray, I. C., et al. (2003). Bioinformatics for geneticists, volume 2. WileyHoboken, NJ.Berg-Kirkpatrick, T., Bouchard-Coˆte´, A., DeNero, J., and Klein, D. (2010). “PainlessUnsupervised Learning with Features.” In Proceedings of the North American Chapterof the Association for Computational Linguistics (NAACL10), volume 8, 582–590.133BibliographyBetancourt, M. and Girolami, M. (2015). “Hamiltonian Monte Carlo for hierarchicalmodels.” Current trends in Bayesian methodology with applications, 79: 30.Bierkens, J. (2016). “Non-reversible metropolis-hastings.” Statistics and Computing ,26(6): 1213–1228.Bierkens, J., Kamatani, K., and Roberts, G. O. (2018). “High-dimensional scaling limitsof piecewise deterministic sampling algorithms.” arXiv preprint arXiv:1807.11358 .Bishop, M. and Friday, A. E. (1985). “Evolutionary trees from nucleic acid and proteinsequences.” Proc. R. Soc. Lond. B , 226(1244): 271–302.Bouchard-Coˆte´, A., Vollmer, S. J., and Doucet, A. (2018). “The bouncy particle sampler:A nonreversible rejection-free Markov chain Monte Carlo method.” Journal of theAmerican Statistical Association, 1–13.Cao, Y., Adachi, J., Janke, A., Pa¨a¨bo, S., and Hasegawa, M. (1994). “Phylogenetic rela-tionships among eutherian orders estimated from inferred sequences of mitochondrialproteins: instability of a tree based on a single gene.” Journal of Molecular Evolution,39(5): 519–527.Carpenter, B., Hoffman, M. D., Brubaker, M., Lee, D., Li, P., and Betancourt, M.(2015). “The stan math library: Reverse-mode automatic differentiation in C++.”arXiv preprint arXiv:1509.07164 .Carpenter, J., Clifford, P., and Fearnhead, P. (1999). “An Improved Particle Filter forNon-linear Problems.” In IEEE Proceedings: Radar, Sonar and Navigation, volume146, 2–7.Chambers, J. M. and Hastie, T. J. (1991). Statistical models in S . CRC Press, Inc.Chen, L., Qin, Z., and Liu, J. S. (2001). “Exploring Hybrid Monte Carlo in BayesianComputation.” Sigma, 2: 2–5.Chen, T.-L. and Hwang, C.-R. (2013). “Accelerating reversible Markov chains.” Statistics& Probability Letters, 83(9): 1956–1962.Clayton, D. G. (1991). “A Monte Carlo method for Bayesian inference in frailty models.”Biometrics, 47: 467–485.134BibliographyDavis, M. H. (1984). “Piecewise-deterministic Markov processes: A general class of non-diffusion stochastic models.” Journal of the Royal Statistical Society. Series B (Method-ological), 353–388.Dayhoff, M., Schwartz, R., and Orcutt, B. (1978). Atlas of Protein Sequences and Struc-ture, volume 5, chapter A model of evolutionary change in proteins, 345–352. SilverSprings: Natl. Biomed. Res. Found.de Valpine, P., Turek, D., Paciorek, C. J., Anderson-Bergman, C., Lang, D. T., and Bodik,R. (2017). “Programming with models: writing statistical algorithms for general modelstructures with NIMBLE.” Journal of Computational and Graphical Statistics, 26(2):403–413.Deligiannidis, G., Paulin, D., and Doucet, A. (2018). “Randomized Hamiltonian MonteCarlo as Scaling Limit of the Bouncy Particle Sampler and Dimension-Free ConvergenceRates.” arXiv preprint arXiv:1808.04299 .Didier, F., Henzinger, T. A., Mateescu, M., and Wolf, V. (2009). “Fast adaptive uni-formization of the chemical master equation.” In High Performance ComputationalSystems Biology, 2009. HIBI’09. International Workshop on, 118–127. IEEE.Doob, J. L. (1945). “Markoff chains—Denumerable case.” Transactions of the AmericanMathematical Society , 58(3): 455–473.Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. (1987). “Hybrid MonteCarlo.” Physics Letters B , 195(2): 216–222.Eddy, S. R. (1998). “Profile hidden Markov models.” Bioinformatics, 14(9): 755–763.Engel, Y. (2005). “Algorithms and representations for reinforcement learning.” Doktorar-beit, The Hebrew University of Jerusalem.Fearnhead, P., Bierkens, J., Pollock, M., and Roberts, G. O. (2016). “Piecewisedeterministic Markov processes for continuous-time Monte Carlo.” arXiv preprintarXiv:1611.07873 .Felsenstein, J. (1981). “Evolutionary trees from DNA sequences: a maximum likelihoodapproach.” Journal of Molecular Evolution, 17(6): 368–376.135BibliographyFlegal, J. M., Hughes, J., Vats, D., et al. (2012). “mcmcse: Monte Carlo standard errorsfor MCMC.” Riverside, CA and Minneapolis, MN. R package version, 1–0.Geweke, J. (1992). “Evaluating the accuracy of sampling-based approaches to the calcu-lation of posterior moments.” In Bayesian Statistics, volume 4, 169–193. Citeseer.— (2004). “Getting it right: Joint distribution tests of posterior simulators.” Journal ofthe American Statistical Association, 99(467): 799–804.Gilks, W. R. and Roberts, G. O. (1996). Markov chain Monte Carlo in practice, chapterStrategies for Improving MCMC, 89–114. Chapman and Hall/CRC.Girolami, M. and Calderhead, B. (2011). “Riemann manifold Langevin and HamiltonianMonte Carlo methods.” Journal of the Royal Statistical Society: Series B , 73(2): 123–214.Grantham, R. (1974). “Amino acid difference formula to help explain protein evolution.”Science, 185(4154): 862–864.Hajiaghayi, M., Kirkpatrick, B., Wang, L., and Bouchard-Coˆte´, A. (2014). “EfficientContinuous-Time Markov Chain Estimation.” In International Conference on MachineLearning (ICML), volume (In Press).Hamze, F., Wang, Z., and de Freitas, N. (2013). “Self-Avoiding Random Dynamics onInteger Complex Systems.” ACM Transactions on Modeling and Computer Simulation,23(1): 9:1–9:25.Hasegawa, M., Kishino, H., and Yano, T. (1985). “Dating of the human-ape splitting by amolecular clock of mitochondrial DNA.” Journal of Molecular Eevolution, 22: 160–174.He, M. X., Jime´nez-Montan˜o, M. A., and Ricci, P. E. (2011). “AMINO ACIDS, EU-CLIDEAN DISTANCE AND SYMMETRIC MATRIX.” In Proceedings of the Inter-national Conference on Bioinformatics & Computational Biology (BIOCOMP), 1. TheSteering Committee of The World Congress in Computer Science, Computer Engineer-ing and Applied Computing (WorldComp).136BibliographyHeidelberger, P. and Welch, P. D. (1981). “A spectral method for confidence intervalgeneration and run length control in simulations.” Communications of the ACM , 24(4):233–245.— (1983). “Simulation run length control in the presence of an initial transient.” Opera-tions Research, 31(6): 1109–1144.Higham, N. J. (2009). “The scaling and squaring method for the matrix exponentialrevisited.” SIAM review , 51(4): 747–764.Hobolth, A. and Stone, E. A. (2009). “Simulation from endpoint-conditioned, continuous-time Markov chains on a finite state space, with applications to molecular evolution.”The Annals of Applied Statistics, 3(3): 1204.Hoffman, M., Brochu, E., and de Freitas, N. (2011). “Portfolio Allocation for BayesianOptimization.” In Uncertainty in Artificial Intelligence, 327–336.Hoffman, M. D. and Gelman, A. (2014a). “The No-U-turn sampler: adaptively settingpath lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research,15(1): 1593–1623.— (2014b). “The No-U-Turn Sampler: Adaptively Setting Path Lengths in HamiltonianMonte Carlo.” Journal of Machine Learning Research, 15: 1351–1381.Hogan, R. J. (2014). “Fast reverse-mode automatic differentiation using expression tem-plates in C++.” ACM Transactions on Mathematical Software (TOMS), 40(4): 26.Horowitz, A. M. (1991). “A generalized guided Monte Carlo algorithm.” Physics LettersB , 268: 247–252.Huelsenbeck, J. P., Bollback, J. P., and Levine, A. M. (2002). “Inferring the root of aphylogenetic tree.” Systematic Biology , 51(1): 32–43.Huelsenbeck, J. P. and Ronquist, F. (2001). “MrBayes: Bayesian inference of phylogenetictrees.” Bioinformatics, 17(8): 754–755.Ibrahim, J. G., Chen, M.-H., and Sinha, D. (2005). Bayesian survival analysis. WileyOnline Library.137BibliographyIrvahn, J. and Minin, V. N. (2014). “Phylogenetic Stochastic Mapping Without MatrixExponentiation.” Journal of Computational Biology , 21(9): 676–690.Ishwaran, H. (1999a). “Applications of hybrid Monte Carlo to Bayesian generalized linearmodels: Quasicomplete separation and neural networks.” Journal of Computationaland Graphical Statistics, 8(4): 779–799.— (1999b). “Applications of Hybrid Monte Carlo to Bayesian Generalized Linear Mod-els: Quasicomplete Separation and Neural Networks.” Journal of Computational andGraphical Statistics, 8(4): 779–799.Ishwaran, H., Rao, J. S., et al. (2005). “Spike and slab variable selection: frequentist andBayesian strategies.” The Annals of Statistics, 33(2): 730–773.Jennrich, R. I. and Bright, P. B. (1976). “Fitting systems of linear differential equationsusing computer generated exact derivatives.” Technometrics, 18(4): 385–392.Jones, D. T., Taylor, W. R., and Thornton, J. M. (1992). “The rapid generation of muta-tion data matrices from protein sequences.” Computer Applications in the Biosciences:CABIOS , 8(3): 275–282.Jukes, T. and Cantor, C. (1969). Evolution of Protein Molecules. New York: AcademicPress.Kalbfleisch, J. and Lawless, J. F. (1985). “The analysis of panel data under a Markovassumption.” Journal of the American Statistical Association, 80(392): 863–871.Kay, B. R. (1977). “Proportional hazard regression models and the analysis of censoredsurvival data.” Applied Statistics, 26(3): 227–237.Kenney, T. and Gu, H. (2012). “Hessian Calculation for Phylogenetic Likelihood basedon the Pruning Algorithm and its Applications.” Statistical Applications in Geneticsand Molecular Biology , 11: 1544–6115.Kimura, M. (1980). “A simple method for estimating evolutionary rate of base substi-tutions through comparative studies of nucleotide sequences.” Journal of MolecularEvolution, 16: 111–120.138BibliographyKramer, A., Calderhead, B., and Radde, N. (2014). “Hamiltonian Monte Carlo methodsfor efficient parameter estimation in steady state dynamical systems.” BMC bioinfor-matics, 15(1): 253.Kschischang, F. R., Frey, B. J., and Loeliger, H. A. (2006). “Factor Graphs and theSum-product Algorithm.” IEEE Trans. Inf. Theor., 47(2): 498–519.Lakner, C., Van Der Mark, P., Huelsenbeck, J. P., Larget, B., and Ronquist, F. (2008).“Efficiency of Markov chain Monte Carlo tree proposals in Bayesian phylogenetics.”Systematic Biology , 57(1): 86–103.Liu, J. S. (2001). Monte Carlo strategies in scientific computing . Springer Science &Business Media.Mackenze, P. B. (1989). “An improved hybrid Monte Carlo method.” Physics Letters B ,226(3-4): 369–371.Mahendran, N., Wang, Z., Hamze, F., and Freitas, N. D. (2012). “Adaptive MCMCwith Bayesian optimization.” In International Conference on Artificial Intelligence andStatistics, 751–760.Minin, V. N. and Suchard, M. A. (2008). “Counting labeled transitions in continuous-timeMarkov models of evolution.” Journal of Mathematical Biology , 56(3): 391–412.Mitchell, T. J. and Beauchamp, J. J. (1988). “Bayesian variable selection in linear regres-sion.” Journal of the American Statistical Association, 83(404): 1023–1032.Mockus, J. (1982). “The Bayesian Approach to Global Optimization.” In System Modelingand Optimization, volume 38, 473–481. Springer.Moler, C. and Van Loan, C. (1978). “Nineteen dubious ways to compute the exponentialof a matrix.” SIAM review , 20(4): 801–836.Neal, R. M. (1993). “Probabilistic inference using Markov chain Monte Carlo methods.”— (2004). “Improving asymptotic variance of MCMC estimators: Non-reversible chainsare better.” arXiv preprint math/0407281 .139Bibliography— (2010). “MCMC using Hamiltonian dynamics.” Handbook of Markov Chain MonteCarlo, 54: 113–162.Nielsen, R. (2002). “Mapping mutations on phylogenies.” Systematic biology , 51: 729–739.Nodelman, U., Shelton, C., and Koller, D. (2002a). “Continuous time Bayesian networks.”In Uncertainty in AI , volume 18.Nodelman, U., Shelton, C. R., and Koller, D. (2002b). “Continuous time Bayesian net-works.” In Proceedings of the Eighteenth conference on Uncertainty in artificial intelli-gence, 378–387. Morgan Kaufmann Publishers Inc.Pakman, A. (2017). “Binary Bouncy Particle Sampler.” arXiv preprint arXiv:1711.00922 .Pakman, A., Gilboa, D., Carlson, D., and Paninski, L. (2017). “Stochastic Bouncy ParticleSampler.” In International Conference on Machine Learning , 2741–2750.Paradis, E., Claude, J., and Strimmer, K. (2004). “APE: analyses of phylogenetics andevolution in R language.” Bioinformatics, 20(2): 289–290.Pasarica, C. and Gelman, A. (2010). “Adaptively scaling the Metropolis algorithm usingexpected squared jumped distance.” Statistica Sinica, 20(1): 343.Peters, E. A. et al. (2012). “Rejection-free Monte Carlo sampling for general potentials.”Physical Review E , 85(2): 026703.Plummer, M., Best, N., Cowles, K., and Vines, K. (2006). “CODA: Convergence Diagnosisand Output Analysis for MCMC.” R News, 6(1): 7–11.URL http://CRAN.R-project.org/doc/Rnews/Rao, V. and Teh, Y. W. (2013). “Fast MCMC sampling for Markov jump processes andextensions.” The Journal of Machine Learning Research, 14(1): 3295–3320.Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning .Cambridge, Massachusetts: MIT Press.Roberts, G. O., Gelman, A., Gilks, W. R., et al. (1997). “Weak convergence and optimalscaling of random walk Metropolis algorithms.” The Annals of Applied Probability ,7(1): 110–120.140BibliographyRoberts, G. O. and Rosenthal, J. S. (2007). “Coupling and ergodicity of adaptive Markovchain Monte Carlo algorithms.” Journal of applied probability , 44(2): 458–475.— (2009). “Examples of adaptive MCMC.” Journal of Computational and GraphicalStatistics, 18(2): 349–367.Roberts, G. O., Rosenthal, J. S., et al. (2001). “Optimal scaling for various Metropolis-Hastings algorithms.” Statistical Science, 16(4): 351–367.Rodrigue, N., Philippe, H., and Lartillot, N. (2008). “Uniformization for sampling re-alizations of Markov processes: applications to Bayesian implementations of codonsubstitution models.” Bioinformatics, 24(1): 56–62.Ronquist, F., Teslenko, M., van der Mark, P., Ayres, D. L., Darling, A., Ho¨hna, S.,Larget, B., Liu, L., Suchard, M. A., and Huelsenbeck, J. P. (2012). “MrBayes 3.2:efficient Bayesian phylogenetic inference and model choice across a large model space.”Systematic biology , 61(3): 539–542.Saeedi, A. and Bouchard-Coˆte´, A. (2011). “Priors over recurrent continuous time pro-cesses.” In Advances in Neural Information Processing Systems, 2052–2060.Schadt, E. E., Sinsheimer, J. S., and Lange, K. (1998). “Computational advances inmaximum likelihood methods for molecular phylogeny.” Genome Research, 8(3): 222–233.Sexton, J. C. and Weingarten, D. H. (1992). “Hamiltonian evolution for the hybrid MonteCarlo method.” Nuclear Physics B , 380: 665–677.Sneath, P. (1966). “Relations between chemical structure and biological activity in pep-tides.” Journal of Theoretical Biology , 12(2): 157–195.Sohl-Dickstein, J., Mudigonda, M., and DeWeese., M. (2014). “Hamiltonian Monte Carlowithout detailed balance.” In International Conference on Machine Learning (ICML-14), volume 31, 719–726.Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. (2010). “Gaussian Process Opti-mization in the Bandit Setting: No Regret and Experimental Design.” In InternationalConference on Machine Learning .141BibliographyStan Development Team (2014). “Stan: A C++ Library for Probability and Sampling,Version 2.5.0.”URL http://mc-stan.org/Sturlaugson, L. and Sheppard, J. W. (2014). “Inference Complexity in Continuous TimeBayesian Networks.” In UAI , 772–779. Citeseer.Suchard, M. A. and Redelings, B. D. (2006). “BAli-Phy: simultaneous Bayesian inferenceof alignment and phylogeny.” Bioinformatics, 22(16): 2047–2048.Sun, Y., Schmidhuber, J., and Gomez, F. J. (2010). “Improving the asymptotic perfor-mance of Markov chain Monte-Carlo by inserting vortices.” In Advances in NeuralInformation Processing Systems, 2235–2243.Tarmast, G. (2001). “Multivariate Log–Normal Distribution.” ISI Proceedings: Seoul 53rdSession.Tataru, P. and Hobolth, A. (2011). “Comparison of methods for calculating conditionalexpectations of sufficient statistics for continuous time Markov chains.” BMC Bioin-formatics, 12: 465–475.Tavare´, S. (1986). “Some probabilistic and statistical problems in the analysis of DNAsequences.” Lectures on Mathematics in the Life Sciences, 17: 57–86.Tavare´, S. (1986). “Some probabilistic and statistical problems in the analysis of DNAsequences.” Lectures on mathematics in the life sciences, 17: 57–86.Tiessen, A., Pe´rez-Rodr´ıguez, P., and Delaye-Arredondo, L. J. (2012). “Mathematicalmodeling and comparison of protein size distribution in different plant, animal, fungaland microbial species reveals a negative correlation between protein size and proteinnumber, thus providing insight into the evolution of proteomes.” BMC research notes,5(1): 85.Vanetti, P., Bouchard-Coˆte´, A., Deligiannidis, G., and Doucet, A. (2017). “PiecewiseDeterministic Markov Chain Monte Carlo.” arXiv preprint arXiv:1707.05296 .142Wang, Z., Mohamed, S., and de Freitas, N. (2013a). “Adaptive Hamiltonian and RiemannManifold Monte Carlo Samplers.” In International Conference on Machine Learning(ICML), 1462–1470. JMLR W&amp;CP 28 (3): 14621470, 2013.Wang, Z., Mohamed, S., and Freitas, N. (2013b). “Adaptive hamiltonian and riemannmanifold monte carlo.” In International Conference on Machine Learning , 1462–1470.Wilkinson, D. J. (2009). “Stochastic modelling for quantitative description of heteroge-neous biological systems.” Nature Reviews Genetics, 10(2): 122.Zhang, J., Watson, L. T., and Cao, Y. (2010). “A modified uniformization method for thesolution of the chemical master equation.” Computers and Mathematics with Applica-tions, 59(1): 573–584.Zhao, T. (2013). “Inference of rates across sites via an expectation maximization algo-rithm.”Zhao, T., Wang, Z., Cumberworth, A., Gsponer, J., de Freitas, N., Bouchard-Coˆte´, A.,et al. (2016). “Bayesian analysis of continuous time Markov chains with application tophylogenetic modelling.” Bayesian Analysis, 11(4): 1203–1237.143Appendix AList of acronyms• ARD: Absolute Relative Difference.• BPS: Bouncy Particle Sampler.• CTBN: Continuous Time Bayesian Network.• CTMC: Continuous Time Markov Chains.• EIT: Exact Invariance Test.• ESJD: Expected Squared Jumping Distance.• ESS: Effective Sample Size.• GLM: Generalized Linear Models.• GNR: General Non-Reversible.• GTR: General Time Reversible.• HPD: Highest Posterior Density.• KL: Kullback-Leibler.• KS: Kolmogorov-Smirnov.• LBPS: Local Bouncy Particle Sampler.144Appendix A. List of acronyms• LOESS: Local Polynomial Regression Fitting.• MMPP: Markov Modulated Poisson Process.• NMH: Normal Proposal Metropolis-Hastings.• NNPAAO: Nearest Neighbour Pairwise Amino Acid Ordering.• NUTS: No U-Turn Sampler.• PDB: Protein Data Bank.• PDMP: Piecewise-deterministic Markov Processes.• PP: Poisson Process.• RMHMC: Riemann Manifold Hamiltonian Monte Carlo.• RMSE: Root-Mean-Square Error.• SMC: Sequential Monte Carlo.145Appendix BLikelihood for trees underCTMCsIn this section, we present the details of the pruning algorithm on a treeas a special case of the sum-product algorithm in order to calculate thelikelihood of observed homologous sequences given a phylogenetic tree anda rate matrix from a perspective of probabilistic graphical models. Beforeintroducing the details of how to calculate the likelihood given a phylogenetictree with observed sequences at the leaves, we view the tree as an instanceof directed graphical models so that efficient exact inference algorithms fora directed tree can be accommodated to the phylogenetic context.We introduce basics of graphical models first since we will represent aphylogenetic tree as an instance of a graphical model. Graphical models canbe classified into directed and undirected ones. We focus on the directedone since it is used to represent a phylogenetic tree. In graphical models,random variables are denoted as nodes V and dependencies between differentrandom variables are denoted as edges E in a graph G. Let X represent theset of random variables of interest. In the directed case, the edges representthe conditional distributions and we define the joint probability distributionof an instance x as product of the conditional probabilities of nodes given146Appendix B. Likelihood for trees under CTMCstheir parents,P (x) =∏v∈VP (xv|xPv), (B.0.1)where xv is an instance of the collection of random variables associatedwith node v ∈ V and xPv is the set of parents for node v and Pv denotesthe subset of indices of the parents of node v.To calculate the likelihood of observed homologous biological sequencesunder a given tree, the problem is to compute the marginal probabilitiesgiven the joint distribution in the form of Equation (B.0.1), where each edgerepresents the conditional probability of a state given its ancestral state andit is parameterized in terms of a rate matrix and the corresponding branchlength. If we split all the nodes on a phylogenetic tree into two groups Uand W , where ∀v ∈W represents either an internal node or the root of thetree, which are the putative ancestral species and ∀v ∈ U represents a leafnode denoting extant species, which are the observed data, the problem ofcalculating the likelihood of the observed data is to marginalize over all thestates of the root and the internal nodes of the joint probability distributionon the tree, which is∑xWP (xU , xW ) under this parameterization.The challenge is to conduct tractable computations when the numberof random variables is large. Since we can treat the phylogenetic tree asa directed tree, efficient exact inference can be achieved by the eliminationalgorithm, also known as the pruning algorithm introduced by Felsenstein(1981) in the phylogenetic context, which is an instance of the sum-productalgorithm for factor graphs (Kschischang et al., 2006).We will demonstrate how the pruning algorithm works using a simple treewith three observed sequences, which is shown in Figure B.1. The shaded147Appendix B. Likelihood for trees under CTMCsFigure B.1: A simple tree topology with three observed sequences denoted asshaded nodes X1, X2 and X4 and ti as the branch length for each edge, wherei = 1, 2, 3, 4.nodes represent the observed sequences. To show the connections between aphylogenetic tree and its corresponding directed graphical model, we presentthe graphical model representation in Figure B.2. In the graphical modelrepresentation, we make the same two assumptions as in the previous sec-tion, namely we assume independence among different sites in the sequencesand no evolutionary rate variation among sites. For simplicity, we are usingDNA sequences instead of amino acid sequences here, where the state spaceis X = {A,C,G, T}.We first illustrate how to calculate the likelihood under a tree in Fig-ure B.1 for only one site. Assuming for the ith site, the states of DNA atthe three tips X4, X1 and X2 are A, G and C respectively. From the tipsto the root of the tree, the algorithm calculates the conditional likelihood ofa subtree recursively given the state of this node in the current generationby combining the conditional likelihood of its immediate child in the leftlineages and also the one in the right lineages of the same node.Using the small tree, we illustrate how this algorithm works. The depth148Appendix B. Likelihood for trees under CTMCsFigure B.2: It shows a treatment of a phylogeny as a directed graphical modelwith three observed extant sequences, two sequences for extinct species, where eachsequence has M sites. The plate represents the independence assumption amongM different sites on a sequence. Compared with the tree topology in Figure B.1,the branch length for each edge is left implicit in the graphical model represen-tation since it can be incorporated as a parameter in the conditional distributionP (xv|xPv ).of the tree is two. Let Lixk(A) denote the probability of observing all the tipsgiven the state of the node xk has state A at the ith site. As assumed, at thetips, the state ofX2 of the ith site is C. Then (Lix2(A), Lix2(C), Lix2(G), Lix2(T ))= (0, 1, 0, 0), (Lix2(A), Lix2(C), Lix2(G), Lix2(T )) = (0, 0, 1, 0), (Lix4(A), Lix4(C),Lix4(G), Lix4(T )) = (1, 0, 0, 0). For an internal node X3, we illustrate how tocalculate Lix3(A) as an example.149Appendix B. Likelihood for trees under CTMCsLix3(A) =(∑xProb(X1 = x|A, t1)Lix1(x))×(∑yProb(X2 = y|A, t2)Lix2(y))=(∑xexp(t1Q)A→xLix1(x))×(∑yexp(t2Q)A→yLix1(y))As a result, Lix3(C), Lix3(G), Lix3(T )) can be calculated in the same way.Lix5(r) =(∑sProb(X3 = s|r, t1)Lix3(s))×(∑qProb(X4 = q|r, t4)Lix4(q))=(∑sexp(t3Q)r→sLix3(s))×(∑qexp(t4Q)r→qLix4(q))Under the reversibility assumption, we assume the stationary distribu-tion for the four states is pi = (piA, piC , piG, piT ), then the likelihood of thistree for the ith site isLi =∑r∈XpirLix5(r).Denote the log-likelihood for the ith site as li = log(Li). By assumingindependence among different sites, the log-likelihood for M sites is l =∑Mi=1 li.150Appendix CConnection BetweenBayesian GLMs and GTRIn this section, we build the connection between our Bayesian GLMs frame-work and the GTR model, which is one of the most popular models in phylo-genetics. The GTR model can be represented using our Bayesian GLM ratematrix parameterization framework under a particular configuration. Ourmain conclusion is that a Normal prior fnor(w) on weights w under BayesianGLM paramterization is equivalent to a log-Normal prior on exchangeablecoefficients θ{x,x′} and a logistic-Normal prior on the stationary distributionpix′ under GTR parameterization.C.1 Configuration of GTR using Bayesian GLMsFor any x, x′ ∈ X , x 6= x′, Q is defined asq(rev)x,x′ = θ{x,x′}pix′ ,where θ{x,x′} = θ{x′,x} under the reversible assumption.The dimension of w denoted as p is equal to the total number of univari-ate features p1 and bivariate features p2, which is p = p1+p2. Recall that thesize of the state space is |X |. The univariate features for the pi-exponential151C.1. Configuration of GTR using Bayesian GLMsfamily are used to define the stationary distribution with the constraint∑x∈X pix = 1. In this section, we fix the value for one of the univariateweight elements to be zero. We have this constraint only in Appendix C.1and Appendix C.2 since this makes the proof easier in Appendix C.2.Thus, in this section, we havep1 = |X | − 1, p2 = |X unordered,dist.| = |X |(|X | − 1)2, p = p1 + p2.However, in a general context described in Section 3.2, we do not need theconstraint that the value for one element of wu is set to one so that p1 = |X |in general.Under the Bayesian reversible matrix GLMs configuration,θ{x,x′}(w) := exp{〈w,φ({x, x′})〉}. (C.1.1)We pick an arbitrary state x∗ ∈ X as a reference such thatpix∗(w) =11 +∑x′ 6=x∗:x′∈X exp{〈w,ψ(x′)〉} . (C.1.2)Hence, for any other state x′ ∈ X ,pix′(w) = exp{〈w,ψ(x′)〉}pix∗ . (C.1.3)To simplify notation, we havepix∗(wu) =11 +∑p1i=1 exp(wui ). (C.1.4)152C.2. Proof of equivalence under GTR configurationFor any x′ 6= x ∈ X , definekx′ = arg16j6p1 [1{ψ(x′) = ej}= 1],pix′(wu) = pix∗(wu) exp(wkx′ ), (C.1.5)where (e1, e2, . . . , ep) forms a basis of Rp. Given any state x′ ∈ X , kx′ is theindex such that the kx′th element of w represents the weight for univariatefeature x′.We use the p1th to the (p1 + p2)th elements of w denoted as wb tocompute the exchangeable coefficients withkx,x′ = argp1+16j6p1+p2 [1{φ({x, x′}) = ej}= 1],θ{x,x′}(w) = exp(wkx,x′ ). (C.1.6)Similarly, kx,x′ is the index such that the kx,x′th element of w correspondsto the bivariate feature {x, x′} used to define θ{x,x′}(w), where x 6= x′.C.2 Proof of equivalence under GTRconfigurationOur goal is to show a Normal prior fnor(w) on w within Bayesian GLMframework using GTR features is equivalent to a log-Normal on the ex-changeable coefficients θ{x,x′} and a logistic-Normal prior on the stationarydistribution pix′ for a rate matrix Q under the GTR parameterization. To153C.2. Proof of equivalence under GTR configurationachieve this, we only need to show after a variable transformation from wunder a Normal prior to θ{x,x′} and pix′ parameterization under GTR, theprior for θ{x,x′} and pix′ is log-Normal and logistic-Normal, respectively. Wefirst present the definition of multivariate log-Normal and logistic-Normaldistribution. Then we present our result in Theorem C.2.1.Definition C.2.1 (Section 2, Tarmast (2001)). Let X = (X1, X2, . . . , Xp)be a p-component random vector having a multivariate Normal distributionwith mean v and covariance matrix A = (aij). We use the transformationYi = exp(Xi) and define Y = (Y1, Y2, . . . , Yp), when y = (y1, y2, . . . , yp) is arealization of Y . The density of Y is multivariate log-Normal distributiondenoted as mlnN(v, A) and has the following form:fY (y) = (2pi)− p2 |A|− 12y−1 · exp (−(log y − v)TA−1(log y − v)/2) ,where 0 < yi <∞, log y = (log y1, log y2, . . . , log yp) and yi = exp(xi).Definition C.2.2 (Definition, Atchison and Shen (1980)). Let Rd denoted-dimensional real space, Pd be the positive orthant of Rd and Gd be thed-dimensional positive simplex defined byGd = {u ∈ Pd : u1 + . . .+ ud < 1}The logistic transformation from Rd to Gd or its inverse log ratio transfor-mation:u = er/1 + d∑j=1erj and r = log(u/ud+1),where ud+1 = 1 −∑dj=1 uj, can be used to define a logistic-Normal distri-154C.2. Proof of equivalence under GTR configurationbution over Gd and we can say that u is Ld(µ,Σ), assuming r follows themultivariate Normal distribution Nd(µ,Σ). The density function of Ld(µ,Σ)is|2piΣ|− 12d+1∏j=1uj−1 exp(−12[log(u/ud+1)− µ]TΣ−1[log(u/ud+1)− µ]).Theorem C.2.1. Following the configuration of Appendix C.1, if we assignfnor(w) = N(0,Σ) on w = (w1, . . . ,wp), via the transformation definedin Equation (C.1.4), (C.1.5) and (C.1.6), the prior of θ{x,x′} and pix′ isa multivariate log-Normal distribution and a logistic-Normal distribution,where Σi,i = σ2i , Σi,j = 0 for j 6= i,i, j = 1, 2, . . . , p, and p = p1 + p2. Weassume elements in w are ordered such that the first p1 elements correspondto univariate features and the rest correspond to bivariate features.Proof. Denote the likelihood function given the observations y and theweights w as fy|Q(y|Q(rev)(w)), the posterior distribution of w given yas fw|y(w|y). The value of the likelihood function does not change withdifferent parameterizations, hence,fy|Q(y|Q(rev)(w)) = fy|Q(y|Q(rev)(θ{x,x′}, pix′))via the transformation defined in Equation (C.1.4), (C.1.5) and (C.1.6),where we use Q(rev)(θ{x,x′}, pix′) to represent that Q is parameterized byθ{x,x′} and pix′ instead of w.Based on the independence ofwi andwj , for ∀i 6= j, denote fy|Q(y|Q(rev)(wj))as the marginal likelihood, which has integrated over all other elements ofw except the jth element.155C.2. Proof of equivalence under GTR configurationConsider the bivariate features, that is for ∀j, p1 + 1 6 j 6 p1 + p2,fwj|y(wj |y) =1√2piexp(− w2j2σ2j)fy|Q(y|Q(rev)(wj))∫ ∞−∞1√2piexp(− w2j2σ2j)fy|Q(y|Q(rev)(wj))dwj(C.2.1)According to Equation (C.1.6), without loss of generality, assuming kx,x′ =j, we haveθ{x,x′}(w) = exp(wkx,x′ ) = exp(wj). (C.2.2)Note that θ{x,x′} = exp(wj) is a variable transformation of wj . Hence,fθ{x,x′}|y(θ{x,x′}|y) =1√2piσjθ{x,x′}exp(− log(θ{x,x′})22σ2j)fy|Q(y|Q(rev)(θ{x,x′}))∫ ∞01√2piσjθ{x,x′}exp(− log(θ{x,x′})22σ2j)fy|Q(y|Q(rev)(θ{x,x′}))dθ{x,x′}.(C.2.3)Equation (C.2.1) and (C.2.3) show that if the prior on wj is N(0, σ2j ), theprior of θ{x,x′} = exp(wj) follows lnN(0, σ2j ) according to Definition C.2.1of the log-normal distribution in the one dimensional case.Regarding the bivariate features, we conclude that θ{x,x′}, for any {x, x′} ∈X unordered,dist., is independent from each other and together they follow amultivariate log-Normal distribution mlnN(0, A) based on Definition C.2.1,where A is a diagonal matrix with Ai,i =(σ2i), and the σ2i s are the (p1 +1)thto the (p1 + p2)th diagonal elements from Σ of fnor(w).If wu follows a Normal distribution, where wu = (w1,w2, . . . ,wp1) cor-responds to the univariate features, Σ∗ is a diagonal matrix with Σ∗ii = (σ∗i )2,where (σ∗i )2s are the first p1 diagonal elements from Σ of fnor(w), following156C.3. Connection between Bayesian GLM GTR model and Bayesian GLM chain GTR modelEquation (C.1.4), (C.1.5), and Definition C.2.2, the prior of pix′ is Ld(0,Σ∗)since the transformation from wi to pix′ is a logistic transformation and theoriginal weights w are normally distributed.C.3 Connection between Bayesian GLM GTRmodel and Bayesian GLM chain GTR modelWe have defined φgtr in Section 3.3.1 such that any GTR rate matrices canbe represented via the Bayesian GLM rate matrix parameterization. Our ob-jective is to prove that there exists w∗ such that for any rate matrices underthe Bayesian GLM parameterization, for ∀w, q(rev)x,x′ (w) can be representedunder the Bayesian GLM chain GTR parameterization.Zhao et al. (2016) have shown that for any rate matrix Q, we can findweights w such that Q(x,x′) = qx,x′(w) = exp{〈wb,φgtr({x, x′})〉}pix′(wu)under the Bayesian GLM GTR model. In the Bayesian GLM chain GTRmodel, we show that there exists w∗ such that for ∀w =wuwb,qgtrx,x′(w) = exp{〈wb,φgtr({x, x′})〉}pix′(wu)= qchainx,x′ (w∗)= exp{〈wb∗,φchain({x, x′})〉}pix′(wu∗)Thus, we can set wu∗ = αwu for any α ∈ R so that pix′(wu) = pix′(wu∗), for∀x′ ∈ X . By plugging the definition of φgtr and φchain in Equation 3.3.5157C.3. Connection between Bayesian GLM GTR model and Bayesian GLM chain GTR modeland Equation 3.3.6 respectively, we require that:(wb∗)η({x,x′}) +(wb∗)η({x,x′})−1 =(wb)η({x,x′}) , for η({x, x′}) = 2, 3, . . . ,X|(|X | − 1)/2),(wb∗)1=(wb)1, for η({x, x′}) = 1.Set p = X|(|X | − 1)/2), it is equivalent to solve(wb∗)1=(wb)1(wb∗)1+(wb∗)2=(wb)2(wb∗)2+(wb∗)3=(wb)3...(wb∗)p−1+(wb∗)p=(wb)p.We obtain the solution wb∗ = Bwb, whereBij =(−1)i+j−2, if j 6 i and i, j = 1, 2, . . . , (|X |(|X | − 1)/2),0, otherwise.Thus, the solution exists, where w∗ =wu∗wb∗ =αwuBwb .158Appendix DEuclidean distance table andordering algorithmD.1 Euclidean distance between pairs of AminoAcidsY H Q R T N K D E G F L A S P I M V C WY 0 83 99 77 92 143 85 160 122 147 22 36 112 144 110 33 36 55 194 37H 83 0 24 29 47 68 32 81 40 98 100 99 86 89 77 94 87 84 174 115Q 99 24 0 43 42 46 53 61 29 87 116 113 91 68 76 109 101 96 154 130R 77 29 43 0 71 86 26 96 54 125 97 102 112 110 103 97 91 96 180 102T 92 47 42 71 0 65 78 85 65 59 103 92 58 58 38 89 81 69 149 128N 143 68 46 86 65 0 94 23 42 80 158 153 111 46 91 149 142 133 139 174K 85 32 53 26 78 94 0 101 56 127 102 107 106 121 103 102 95 97 202 110D 160 81 61 96 85 23 101 0 45 94 177 172 126 65 108 168 160 152 154 181E 122 40 29 54 65 42 56 45 0 98 140 138 107 80 93 134 126 121 170 152G 147 98 87 125 59 80 127 94 98 0 153 138 60 56 42 135 127 109 159 184F 22 100 116 97 103 158 102 177 140 153 0 22 113 155 114 21 28 50 205 40L 36 99 113 102 92 153 107 172 138 138 22 0 96 145 98 5 15 32 198 61A 112 86 91 112 58 111 106 126 107 60 113 96 0 99 27 94 84 64 195 148S 144 89 68 110 58 46 121 65 80 56 155 145 99 0 74 142 135 124 112 177P 110 77 76 103 38 91 103 108 93 42 114 98 27 74 0 95 87 68 169 147I 33 94 109 97 89 149 102 168 134 135 21 5 94 142 95 0 10 29 198 61M 36 87 101 91 81 142 95 160 126 127 28 15 84 135 87 10 0 21 196 67V 55 84 96 96 69 133 97 152 121 109 50 32 64 124 68 29 21 0 192 88C 194 174 154 180 149 139 202 154 170 159 205 198 195 112 169 198 196 192 0 215W 37 115 130 102 128 174 110 181 152 184 40 61 148 177 147 61 67 88 215 0D.2 Nearest neighbour pariwise amino acidordering algorithm159D.2. Nearest neighbour pariwise amino acid ordering algorithmAlgorithm 9 Nearest neighbour pairwise amino acid ordering1: Initialization:Xdynamic, distinct = Xdistinct, counter=0. Set α= β=i0=i1=NULL.AminoAcidPairRank=DynamicSupportForAminoAcid =dict(), which is an empty dictionary.2: for i ∈ X do3: DynamicSupportForAminoAcid[i] = X \ {i}.4: end for5: while Xdynamic, distinct is not empty do6: if i0 is not NULL and i1 is not NULL then7: Find support of i0: α=DynamicSupportForAminoAcid[i0].8: Find support of i1: β=DynamicSupportForAminoAcid[i1].9: end if10: if neither of α or β is empty then11: rowDist = AminoAcidDist[i0, α]12: colDist = AminoAcidDist[β, i1]13: rowMinInd = α[argmin(rowDist)]14: colMinInd = β[argmin(colDist)]15: if AminoAcidDist[i0, rowMinInd] 6 AminoAcidDist[colMinInd, i1] then16: i1 = rowMinInd17: else if AminoAcidDist[i0, rowMinInd] > AminoAcidDist[colMinInd, i1] then18: i0 = colMinInd19: end if20: else if α is NULL and β is not NULL then21: colDist=AminoAcidDist[β, i1]22: colMinInd=β[argmin(colDist)]23: i0= colMinInd24: else if α is not NULL and β is NULL then25: rowDist = AminoAcidDist[i0, α]26: rowMinInd=α[argmin(rowDist)]27: i1 = rowMinInd28: else29: Find the amino acid pair SmallestPair ∈ Xdynamic,distinct with nonzero minimum inAminoAcidDist.30: i0= index of the first amino acid in SmallestPair in X .31: i1 = index of the second amino acid in SmallestPair in X .32: end if33: AminoAcidPairRank[i0, i1]=AminoAcidPairRank[i1, i0] =counter, counter ++.34: AminoAcidDist[i0, i1]=AminoAcidDist[i1, i0]=∞.35: if i1 ∈ DynamicSupportForAminoAcid[i0] then36: Remove i1 from DynamicSupportForAminoAcid[i0]37: end if38: if i0 ∈ DynamicSupportForAminoAcid[i1] then39: Remove i0 from DynamicSupportForAminoAcid[i1]40: end if41: s0 = X i0 +X i1 (obtain amino acid pair s0), s1 = X i1 +X i0 (obtain amino acid pair s1).42: if s0 ∈ Xdynamic, ordered,distinct then43: Remove s0 from Xdynamic, distinct.44: end if45: if s1 ∈ Xdynamic, distinct then46: Remove s1 from Xdynamic, distinct47: end if48: end whileEnsure: AminoAcidPairRank.160Appendix ECached-UniformizationmethodWe use uniformization for exact sampling from a CTMC conditional on theinitial and ending states. The key idea of uniformization is to transform acontinuous time Markov process to a discrete one subordinated to a PoissonProcess (PP). Suppose that X = {Xt, t > 0} is a continuous time Markovchain with state space X and with constant exponential parameter λ(x) =q¯ ∈ (0,∞) for x ∈ X and jump transition probability matrix B. A CTMCis said to be subordinated to a PP with rate parameter q¯ ifa The transition times (τ1, τ2, . . .) are the arrival times of the PP withrate q¯.b The inter-transition times (τ1, τ2 − τ1, . . .) are the inter-arrival timesof the PP with rate q¯. The inter-transition times are independent andeach follows an exponential distribution with rate q¯.c M = {Mt : t ∈ [0,∞)} is the Poisson counting process, where Mt isthe number of transitions in (0, t] for t ∈ [0,∞).d The PP and the jump chain Y = (Y0, Y1, . . .) are independent, andXt = YMt for t ∈ [0,∞).161Appendix E. Cached-Uniformization methodThe jump transition probability matrix B satisfies that Bx,x = 0 for x ∈ X .In the uniformization algorithm we introduce below, we show in Equa-tion E.0.1 and Equation E.0.2 how B and q¯ are obtained from the originalrate matrix Q of the CTMC.The general procedure of uniformization is described in Hobolth andStone (2009). Uniformization is based on a transition probability con-structed from the rate matrix by adding self-transitions (virtual jumps):B = I +1q¯Q (E.0.1)q¯ = maxx|qx,x|. (E.0.2)Given two states at the two end-points of a single branch the posteriordistribution over the number of jumps (which can be either between distinctstates, or self-transitions) J is given by:P(J = j|X0 = x1, Xt = x2) = e−q¯t (q¯t)jj!(Bj)x1,x2exp (tQ)x1,x2. (E.0.3)Given the number of transitions J = j and the end-points, the list ofstates visited is obtained by simulating a discrete Markov chain of lengthsj + 2, where the first state conditioned to be x1, the last state, x2, and thetransitions probabilities are given by B.Finally, the locations of the jumps (between distinct states, or self-transitions) are obtained by simulating J independent uniform numbers andsorting them. The self-transitions can be discarded at a post-processing step.The computational bottleneck of this process is the evaluation of(Bj)x1,x2for various powers of j. If one is concerned with a single branch, it may be162Appendix E. Cached-Uniformization methodattractive to do this computation in the following way:(Bj)x1,x2= (· · · ((ex1B)B . . . B)x2), (E.0.4)where ex is the standard basis vector with a one at entry x, and equal tozero otherwise. This yields a running time of J |X |2.However, since the matrix B is shared among all sites and all branches, itbecomes attractive to store the powers of B in a phylogenetic context. Thisrequires a more expensive compute time of |X |3 per power, but this paysoff in a phylogenetic context, since most branches have moderate lengths,which implies that the number of transitions per branch is also moderate.We use a cache taking the form of a list, cache, initialized to containa single element: cache[1] ← B. We retrieve from the cache using theprocedure get and cache(j), defined as follows:1. If the length of the list cache is smaller than j, store in cache[j] theproduct of the matrices B and get and cache(j − 1).2. Return cache[j].We denote by Me the expectation of the number of times Step 1 re-quires the computation of a matrix multiplication, over the full process ofresampling the auxiliary variable Z across all branches of a phylogenetictree.Using this cache, we sample Z as follows (writing in brackets the totalrunning time of each step over the entire process of resampling once theauxiliary variable Z:88As in the previous section, we present the algorithm in the context of a phylogenetictree with K sites. We recover the case of a time series as a special case by setting K = 1.163Appendix E. Cached-Uniformization method1. Form the matrix in Equation (E.0.1). {|X |2}2. Compute the sum-product algorithm, and use its output to samplereconstructed states for every end-point of every branch and site (seethe Background section of the main paper for details). {Reversiblecase: |X |3 + 2K|E| · |X |2 +K|E| · |X |, non-reversible case: |E| · |X |3 +K|E| · |X |2 + K|E| · |X | (both in the reversible and non-reversiblecases, the first two terms account for the computation of the sum-product algorithm, and the last term is the cost of sampling internalreconstructions)}93. Initialize transition counts c and total sojourn times h to zero. {|X |2}4. In the reversible case, perform the eigen-decomposition of Q. In thenon-reversible case, for each branch of length ∆, cache the matrixexponential, exp(∆Q). {Reversible case: |X |3, non-reversible case:|E| · |X |3}5. For each branch and each site with end points x1 and x2 (obtainedfrom Step 2 above):(a) Simulate J = j using the following method:i. Simulate a uniform random number u ∈ [0, 1]. {K|E|}ii. Set s← 0 {K|E|}iii. Loop, for j = 0, 1, . . .A. s ← s + e−q¯∆ (q¯∆)jj!(get and cache(j))x1,x2(exp(∆Q)x1,x2{Reversible case:O(KMe|X | · |E|+Me|X |3), non-reversible case: O(KMe ·9In the reversible case, the running time assumes that matrix exponentiation is per-formed via diagonalization (at the cost of |X |3) (Moler and Van Loan, 1978). In thenon-reversible case, the matrix exponentials can be computed using the Pade method ata cost of |E| · |X |3 (Moler and Van Loan, 1978).164Appendix E. Cached-Uniformization method|E|+Me|X |3) (using the cache of the exponentiated tran-sition matrix, and the cached matrix exponential/diago-nalization (step 4))}B. If s > u, break the inner loop. {K|E|Me}(b) Simulate J uniform numbers in the interval [0,∆], and sort themto obtain a list u1 < u2 < · · · < uJ . {This can be done in timeK|E|Me by simulating spacings via exponential random variablesand normalizing them (Carpenter et al., 1999)}(c) To sample the path, initialize a current state x to the first endpoint, x1, and loop over i = 1, 2, . . . , J :i. Form a vector of posterior transition probabilities p = (p1, . . . , p|X |),with entry x′ given by:px′ = Bx,x′ (get and cache(J − i))x′,x2 . (E.0.5){KMe|E| · |X |}ii. Normalize the entries in p to form a probability distribution,and sample from it to obtain a new state x˜. {KMe|E| · |X |}iii. Update the sufficient statistics c and h using the transition(x, x˜) and corresponding time ui+1 − ui. {KMe|E|}iv. Update the current state, x← x˜. {KMe|E|}In the reversible case, the running time is O(K|E| · |X |2 + Me|X |3 +KMe|E| · |X |), and in the non-reversible case, O(K|E| · |X |2 +Me|X |3 + |E| ·|X |3 +KMe|E| · |X |).We also investigate the relationship between the branch lengths and thecache size using cached uniformization to sample Z. We generate pairs of165Appendix E. Cached-Uniformization methodDNA sequences of 5000 sites under Gtr with branch lengths from 0.2 to 5with step size 0.2. See Figure F.7. The figure indicates that the cache sizeincreases linearly as the branch length increases.166Appendix FSupplementary figures forAHMC0. C D E F G H I K L M N P Q R S T V W YStatesStationary Distribution EstimatesCategoryEstimatesTrueFigure F.1: Actual stationary distributions are compared with estimates froma 415 sites and 641 sequences synthetic dataset generated using the Polarity-SizeGtr model. We generate unbalanced stationary distributions to increase thedifficulty of reconstruction.167Appendix F. Supplementary figures for AHMC0.−0.5 0.0 0.5 1.0RelaBiasProportionFigure F.2: Histogram of the relative biases (described in Section 7.3) of the 400transition probability matrix (excluding the redundant diagonal elements) undert = 2. The two dotted lines label -0.2 and 0.2.168Appendix F. Supplementary figures for AHMC0.−0.25 0.00 0.25 0.50RelaBiasProportionFigure F.3: Histogram of the relative biases (described in Section 7.3) of the 400transition probability matrix (excluding the redundant diagonal elements) undert = 3. The two dotted lines label -0.2 and 0.2.169Appendix F. Supplementary figures for AHMC0.−0.2 0.0 0.2 0.4RelaBiasProportionFigure F.4: Histogram of the relative biases (described in Section 7.3) of the 400transition probability matrix (excluding the redundant diagonal elements) undert = 5. The two dotted lines label -0.2 and C D E F G H I K L M N P Q R S T V W YStatesStationary Distribution EstimatesCategoryMrBayesconiferFigure F.5: Estimates of the stationary distributions from MrBayes and ourmodel, both set to their respective GTR settings.170Appendix F. Supplementary figures for AHMC010203040500.0 0.1 0.2 0.3RelaBiasCount0. F.6: Histogram of the relative differences of the estimates of the 190exchangeable coefficients between MrBayes and our model, both set to their re-spective GTR settings. The differences are generally within 0.05; the two outliersgreater than 0.1 are between A to R and Q to P.171Appendix F. Supplementary figures for AHMCll llllll llllllllllll lllllllll l ll ll ll lll ll ll lll l ll llllllll ll lll lllllll l lllll10203040500 1 2 3 4 5Branch LengthCache Size LabelslllAverageMaxMinFigure F.7: The relationship between cache sizes and branch lengths using pairsof DNA sequences with 5000 sites under Gtr with branch lengths from 0.2 to 5with step size 0.2. “Average” represents the averaged cache size for each datasetacross 100,000 MCMC iterations. “Min” and “Max” represent the minimum andmaximum of the cache size.172Appendix F. Supplementary figures for AHMCFigure F.8: Trace plots for NMH with bandwidth 0.01 and our AHMC sampler,both ran within the same wall-clock time limit of 2267 minutes. After discardingthe burn-in period of the first 50,000 iterations, there are 450,000 iterations for theNMH method and 229,520 iterations for our AHMC sampler. The ordinate axisrepresents the sampled values for the feature “AD.” Other features exhibit the samequalitative behaviour.173Appendix F. Supplementary figures for AHMCllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllAcidicAcidic AcidicBasic AcidicNon AcidicPolarBasicBasic BasicNon BasicPolar BigBigBigMicro MicroMicro NonNon NonPolarPolarPolar−202−202−202−202WeightsModelPOLARITYSIZEPOLARITYSIZEGTRFigure F.9: Estimates of the weights for polarity/charge and size features. Thelegend specifies the underlying model following the naming convention defined inSection 7 of the main text.174Appendix F. Supplementary figures for AHMC0.0000.0250.0500.0750.1000.125A C D E F G H I K L M N P Q R S T V W YStatesValues CategoryEmpiricalEstimatedFigure F.10: Stationary distributions on the protein kinase domain family data.“Estimated” refers to the stationary distribution based on the posterior mean ofthe univariate weights in our Bayesian model. “Empirical” refers to the empiri-cal frequency counts, which are sometimes used in lieu of estimates derived fromCTMC models (e.g., Cao et al. (1994)). Note that even though the two estimatesare broadly similar, they also quantitatively differ in important aspects. Estimatingthe stationary distribution empirically from the data frequency counts is mainly forthe purpose of saving computational time or reducing the number of parameters incomplex models such as General Time Reversible Model (GTR) for codon evolu-tion requiring a 61-by-61 rate matrix. Empirical frequency counts do not take theuncertainties in the estimates of the stationary distribution into account while theBayesian sampling method is able to assess the uncertainty in the estimates of thestationary distribution.175Appendix GSupplementary tables forAHMCAcidicAcidic AcidicBasic AcidicPolar BasicBasic BasicPolar PolarPolar0.226 -0.711 0.440 -0.246 0.496 1.332NonNon AcidicNon BasicNon NonPolar BigBig BigMicro-2.366 0.065 -0.010 -1.388 -1.215 -0.677MicroMicro statio(A) statio(C) statio(D) statio(E) statio(F)-0.902 -0.685 -0.577 -0.863 -0.048 0.012statio(G) statio(H) statio(I) statio(K) statio(L) statio(M)-0.455 -1.305 -0.269 -0.199 -0.136 -0.932statio(N) statio(P) statio(Q) statio(R) statio(S) statio(T)-1.431 0.194 -0.629 -0.901 -0.526 -0.645statio(V) statio(W) statio(Y)0.360 0.049 0.137Table G.1: Geweke diagnostic test statistic values under the PolaritySizeGtrmodel for datasets where each sequence has 100 sites. Under the 5% significancelevel, an absolute value larger than 2 supports a lack of convergence for that pa-rameter.176Appendix G. Supplementary tables for AHMCAcidicAcidic AcidicBasic AcidicPolar BasicBasic BasicPolar PolarPolar-0.365 -0.400 1.414 1.565 -0.438 1.049NonNon AcidicNon BasicNon NonPolar BigBig BigMicro1.613 0.041 0.783 0.868 -0.559 -0.751MicroMicro statio(A) statio(C) statio(D) statio(E) statio(F)-1.270 -0.644 -0.321 -0.090 -0.219 -0.204statio(G) statio(H) statio(I) statio(K) statio(L) statio(M)-0.270 -0.300 -0.450 -0.161 -0.207 -0.525statio(N) statio(P) statio(Q) statio(R) statio(S) statio(T)-0.473 -0.413 -0.008 -0.406 -0.315 -0.274statio(V) statio(W) statio(Y)-0.587 -0.067 -0.377Table G.2: Geweke diagnostic test statistic values under the PolaritySizeGtrmodel for datasets where each sequence has 1500 sites. Under the 5% significancelevel, an absolute value larger than 2 supports a lack of convergence for that pa-rameter.Method Min. 1st Qu. Median Mean 3rd Qu. Max.AHMC 24.56 41.42 50.32 86.01 67.90 432.20Adaptive NMH 6.32 30.91 40.07 41.84 52.45 93.34NMH 0.002 0.17 0.53 0.97 1.54 1.53 17.77NMH 0.005 0.51 1.70 2.47 4.89 3.99 74.39NMH 0.01 1.44 6.97 10.88 20.25 18.62 244.20NMH 0.0125 0.71 3.53 5.65 9.86 9.25 117.70NMH 0.015 0.80 4.01 6.49 11.24 10.78 128.10NMH 0.018 1.23 6.77 11.05 19.03 18.29 226.30NMH 0.02 1.60 7.54 11.82 20.98 20.80 245.50NMH 0.03 1.32 3.97 6.16 11.69 10.28 162.90NMH 0.04 0.40 1.44 2.18 3.61 3.73 38.49NMH 0.05 0.20 0.54 0.78 0.97 1.07 6.02NMH 0.06 0.36 0.81 1.08 1.25 1.44 6.56NMH 0.07 0.28 0.65 0.83 0.89 1.06 2.47NMH 0.08 0.31 0.73 0.95 1.04 1.25 3.39NMH 0.09 0.25 0.53 0.67 0.77 0.93 2.78NMH 0.10 0.55 0.79 0.90 0.97 1.08 3.03NMH 0.15 0.20 0.26 0.30 0.36 0.41 0.81NMH 0.20 0.27 0.31 0.35 0.38 0.43 0.63Table G.3: ESS per 104 seconds. NMH x stands for a Normal proposal MetropolisHastings algorithms with proposal bandwidth x.177Appendix G. Supplementary tables for AHMCMethod Min 1st Quantile Median Mean 3rd Quantile MaxAux 20 165 295 551 484 8698No Aux 3.27 8.62 12.1 56.5 18.2 681Table G.4: Summary of the ESS per 106 seconds for the stationary distributionand exchangeable coefficients with auxiliary variable and without, denoted as “Aux”and “No Aux” respectively, with estimates inferred under a PolaritySizeGtrmodel with fixed  = 5 × 10−7 and L = 30, while fixing the topology and branchlengths to the true one with a simulated amino acid dataset of 10 leaves and 5000sites generated under a PolaritySizeGtr model.178Appendix HSupplementary results forLBPS-HMCH.1 Exact invariance test resultsIn Table H.1 and Table H.2, it is shown that our proposed new algorithmpassed the Exact Invariance Test (EIT) since all the pvalues are biggerthan a threshold of 0.05/n, where n represents the number of parametersin the test while taking multiple comparisons into account. For example,n = 10 for a 5-by-5 chain GTR rate matrix while testing the exchangeableparameters. Similarly, we display the test results for using HMC in Table H.3and Table H.4.Parameter index 1 2 3 4 5Pvalue 0.640 0.200 0.329 0.114 0.167Test statistics 0.060 0.087 0.077 0.097 0.090Test KS KS KS KS KSTable H.1: EIT for wu using LBPS-HMC, which uses HMC for the stationarydistribution weights and LBPS for wb for exchangeable parameters.179H.1. Exact invariance test resultsParameter index 1 2 3 4 5 6 7 8 9 10Pvalue 0.200 0.504 0.441 0.061 0.571 0.838 0.382 0.709 0.238 0.382Test statistics 0.087 0.067 0.070 0.107 0.063 0.050 0.073 0.057 0.083 0.073Test KS KS KS KS KS KS KS KS KS KSTable H.2: EIT for exchangeable parameters using LBPS-HMC, which uses HMCfor the stationary distribution and LBPS for the weights of the exchangeable pa-rameters.Parameter index 1 2 3 4 5Pvalue 0.967 0.200 0.640 0.838 0.640Test statistics 0.040 0.087 0.060 0.050 0.060Test KS KS KS KS KSTable H.3: EIT for stationary distribution elements using HMC for both thestationary distribution and the weights of the exchangeable parameters.Parameter index 1 2 3 4 5 6 7 8 9 10Pvalue 0.139 0.076 0.504 0.281 0.838 0.640 0.936 0.504 0.139 0.238Test statistics 0.093 0.103 0.067 0.080 0.050 0.060 0.043 0.067 0.093 0.083Test KS KS KS KS KS KS KS KS KS KSTable H.4: EIT for exchangeable parameters using HMC for both the stationarydistribution and the weights of the exchangeable parameters.180H.2. Posterior samples comparison between LBPS and HMCH.2 Posterior samples comparison betweenLBPS and HMCAs stated in Chapter 6, because the two algorithms HMC and LBPS-HMCshare the same limiting distribution, as we increase the number of iterationsof the Markov chains, after an initial burnin period, the distribution of theposterior samples obtained from the two algorithms will be more and moresimilar and both will become closer to the target posterior distribution. Forhigher dimensional rate matrices, since the total number of parameters isO(|X |2), it is hard to display the density plots for all parameters to comparethe similarity between the two samplers in order to evaluate their correct-ness. To address this, we use Absolute Relative Difference (ARD) as themetric to describe the similarities between the posterior samples from thetwo algorithms. Smaller values of ARD indicate more similarities. ARD isdefined asARD(x, y) =|x− y|max(x, y), for ∀x, y > 0. (H.2.1)In this section, we provide the summary statistics including the mini-mum, first quantile, median, mean, third quantile and maximum of the ARDacross all exchangeable parameters between samples obtained from one sin-gle run of LBPS-HMC and HMC in Table H.5 under the 8-by-8 BayesianGLM chain GTR described in Section 3.3.2. We also provide the boxplotcomparison of the posterior samples collected from LBPS-HMC and HMCunder the same set-up in Figure H.1.181H.3. Traceplot comparison for real data analysisMin. 1st Quantile Median Mean 3rd Quantile Max.1.4% 1.16% 2.70% 3.05% 3.63% 12.06%Table H.5: Summary of ARD across exchangeable parameters between HMC andLBPS.lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllll036912E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 E17 E18 E19 E20 E21 E22 E23 E24 E25 E26 E27 E28variablevaluesmethodFixed Initialization, T−0.1hmcFigure H.1: Boxplot of posterior sample comparison for exchangeable parametersof an 8-by-8 rate matrix between LBPS and HMC.H.3 Traceplot comparison for real data analysisWe provide the traceplot on an exchangeable parameter selected uniformlyat random using LBPS-HMC compared with HMC. There is a small differ-ence between posterior means with ARD 0.43%.182H.4. Correctness check via ARD between two samplersFigure H.2: Traceplot for an exchangeable parameter selected uniformly at ran-dom using LBPS-HMC compared with HMC with ARD 0.43% in terms of theposterior mean.H.4 Correctness check via ARD between twosamplersTheoretically, we have the same prior distribution and likelihood for the pa-rameters (weights) of interest, LBPS-HMC and HMC share the same poste-rior distribution. Thus, if we have longer chains for both algorithms, theirdensity of the posterior samples should be closer and closer. We evaluatethis via ARD.Figure H.4 and Table H.6 demonstrate the ARD across all exchangeableparameters between HMC and LBPS-HMC with trajectory length 0.2.Min. 1st Quantile Median Mean 3rd Quantile Max.0.3% 7.5% 14.8% 15.1% 22.0% 44.4%Table H.6: Summary of ARD across exchangeable parameters.183H.4. Correctness check via ARD between two samplers0%2%4%6%8%0.0 0.1 0.2 0.3 0.4Absolute relative difference across all exchangeable parametersProportion out of all 190 exchangeable parameters0481216CountFigure H.3: ARD across exchangeable parameters between LBPS compared withHMC with blue dotted lines representing the 10% and 90% quantile of ARD.Since the maximum value of ARD is 44.4%, so we further check whetherthe value will decrease as expected when we have a longer chain. We havefound that the difference between the two algorithms in terms of ARD be-come smaller and smaller when we run both algorithms for a longer period.This is also the case when conducting our simulation studies. We validatethis by providing a violin plot in Figure H.4 comparing the ARD from thecurrent run with the results from a longer run. We denoted the ARD fromthe current run as “shorter” and the ARD from a longer run as “longer”,where HMC had a walltime of 12 days and LBPS-HMC had a walltime of 10days. Each method had a walltime around two days longer than the currentrun. The maximum of ARD dropped from 44% to 37% when we have alonger walltime.In order to figure out the situations when ARD values are larger for184H.4. Correctness check via ARD between two samplerscertain parameters, we demonstrate the quantile of ESS in Table H.7 andthe actual ESS in Table H.8 of the exchangeable parameters that have ARDbigger than its 95% quantile between LBPS-HMC and HMC. We have foundthat a relatively bigger difference in the posterior mean between the twomethods happens in situations when either they both have low ESS for thisparameter compared to other parameters or when LBPS-HMC has high ESSsuch as 97.9% quantile and 94.2% quantile across exchangeable parametersbut HMC does not obtain an equivalently high ESS for the same parameterwithin the same amount or even slightly longer walltime.l0. shorterwalltime groupAbsolute relative difference valuesFigure H.4: Violin plot of ARD across exchangeable parameters between a shorterrun and a longer run, which has 2 days longer walltime.LBPS 2.63% 2.105% 8.42% 5.79% 77.4% 94.2% 75.8% 97.9 % 0.526% 3.16%HMC 16.84% 0.526% 15.79% 2.63% 53.7% 84.2% 73.7% 98.4% 1.053% 5.26%Table H.7: The quantiles of ESS of exchangeable parameters that have ARDbigger than the 95% quantile of ESS across all exchangeable parameters obtainedusing LBPS-HMC and HMC algorithm respectively.185H.4. Correctness check via ARD between two samplersLBPS 1071 1051 1248 1147 2145 2330 2115 2451 773 1118HMC 544 152 604 488 927 1032 1090 591 190 718Table H.8: The ESS of exchangeable parameters that have ARD bigger thanthe 95% quantile of all values of ARD across all exchangeable parameters usingLBPS-HMC and HMC algorithm respectively.186


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items