LiquidBayes: a bayesian network for monitoring cancerprogression using liquid biopsiesbyKevin YangBSc., The University of British Columbia, 2020A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Bioinformatics)The University of British Columbia(Vancouver)December 2022© Kevin Yang, 2022The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:LiquidBayes: a bayesian network for monitoring cancer progressionusing liquid biopsiessubmitted by Kevin Yang in partial fulfillment of the requirements for the degreeof Master of Science in Bioinformatics.Examining Committee:Dr. Andrew Roth, Assistant Professor, Departments of Computer Science andPathology and Laboratory Medicine, UBCSupervisorDr. Yongjin Park, Assistant Professor, Departments of Statistics and Pathology andLaboratory Medicine, UBCCo-supervisorDr. Alexandre Bouchard-Coˆte´, Professor, Department of Statistics, UBCSupervisory Committee MemberDr. Ryan Morin, Associate Professor, Department of Molecular Biology and Bio-chemistry, Simon Fraser UniversitySupervisory Committee MemberiiAbstractCancer exhibits temporal heterogeneity, described by the existence and continualevolution of multiple cell subpopulations in a tumour. Temporal heterogeneity trig-gers resistance, disrupting targeted therapies and worsening patient prognosis. Thecontinual monitoring of cancer patients can aid in identifying resistance and leadto informed treatment decisions. Liquid biopsies are non-invasive blood samplescontaining double stranded, free-floating DNA called cell-free DNA (cfDNA). Asubset of cfDNA, known as circulating tumour DNA (ctDNA), originates from thetumour itself and can uncover tumour-specific mutations to characterize cancer.Although liquid biopsies provide a means to evaluate response over time, ctDNAabundance can be extremely low in post-treatment patients, inhibiting its utility formonitoring therapy efficacy in these contexts.Statistical methods have been developed for estimating tumour fraction usingctDNA samples. Moreover, some groups have integrated somatic mutations de-rived from bulk sequencing of a tissue biopsy to address low ctDNA abundance.However, no method we know of incorporates single-cell sequencing of a tissuebiopsy in ctDNA analysis. In this thesis, we present LiquidBayes, a BayesianNetwork that integrates clone-level copy number profiles from single-cell WholeGenome Sequencing (WGS) of the primary tissue with WGS of ctDNA samples.LiquidBayes leverages Markov Chain Monte Carlo (MCMC) techniques and is im-plemented using a Probabilistic Programming Language (PPL). LiquidBayes sig-nificantly outperforms state-of-the-art methods in tumour fraction estimation andallows for inference of clonal prevalences. LiquidBayes can analyze serial ctDNAsamples to dissect temporal heterogeneity, intercept resistance and ultimately im-prove patient prognosis.iiiLay SummaryCancer is comprised of subpopulations of cells, known as clones, each of whichmay respond differently to treatments. As such, it is important to monitor tumourprogression after administering therapy to inform subsequent treatment decisions.Liquid biopsies are non-invasive blood draws that contain circulating-tumour DNA(ctDNA), which are DNA fragments derived from the tumour itself. By studyingctDNA, we can uncover salient properties of cancer; in particular, tumour burden(pervasiveness of cancer) and clonal prevalences (diversity of cancer).Various methods have been developed to estimate tumour fraction, a surrogatemeasure for tumour burden, using ctDNA samples. However, these methods donot analyze cancer at single-cell resolution. Our objective is to incorporate a tissuebiopsy, analyzed at single-cell resolution, with liquid biopsies to uncover charac-teristics of the cancer. We present LiquidBayes, a Bayesian statistical model forinferring tumour fraction and clonal prevalences, leading to informed treatmentdecisions and improved prognosis.ivPrefaceThis thesis was completed under the supervision of Dr. Andrew Roth and Dr.Yongjin Park at the BC Cancer Research Centre. I was responsible for imple-menting LiquidBayes and all related experiments. Felix Fu joined as a sum-mer research student and contributed to the development of the benchmarkingpipeline. Shaocheng Wu provided the processed Lymphoma Dataset described inSection 2.3.1.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Tumour Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . 21.2 Liquid Biopsies . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Sequencing Technologies . . . . . . . . . . . . . . . . . . . . . . 31.3.1 Single-Cell Sequencing . . . . . . . . . . . . . . . . . . . 31.3.2 Direct Library Preparation+ . . . . . . . . . . . . . . . . 41.4 Cancer Genomics . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4.1 Copy Number Variation . . . . . . . . . . . . . . . . . . 41.4.2 Single Nucleotide Variant . . . . . . . . . . . . . . . . . 51.5 Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . 6vi1.5.1 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . 61.5.2 Forward Sampling . . . . . . . . . . . . . . . . . . . . . 61.5.3 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . 61.5.4 Bayesian Coverage . . . . . . . . . . . . . . . . . . . . . 72 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1 LiquidBayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 102.1.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . 132.1.4 BinomialLogits . . . . . . . . . . . . . . . . . . . . . . . 132.2 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.1 Simulating Copy Number Profiles . . . . . . . . . . . . . 142.2.2 Simulating Read Counts . . . . . . . . . . . . . . . . . . 142.2.3 Simulating Allelic Counts . . . . . . . . . . . . . . . . . 162.3 Semi-realistic Datasets . . . . . . . . . . . . . . . . . . . . . . . 172.3.1 Lymphoma Dataset . . . . . . . . . . . . . . . . . . . . . 172.3.2 Semi-realistic Data Simulation . . . . . . . . . . . . . . . 182.3.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 202.4 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.4.1 LiquidBayes . . . . . . . . . . . . . . . . . . . . . . . . 202.4.2 ichorCNA . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4.3 MRDetectSNV . . . . . . . . . . . . . . . . . . . . . . . 212.4.4 Variant Calling . . . . . . . . . . . . . . . . . . . . . . . 223 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.1 Synthetic Experiments . . . . . . . . . . . . . . . . . . . . . . . 233.1.1 Tumour Fraction . . . . . . . . . . . . . . . . . . . . . . 233.1.2 Read Depth . . . . . . . . . . . . . . . . . . . . . . . . . 253.1.3 Number of Clones . . . . . . . . . . . . . . . . . . . . . 313.1.4 Missing Clones . . . . . . . . . . . . . . . . . . . . . . . 323.2 Semi-realistic Experiments . . . . . . . . . . . . . . . . . . . . . 383.2.1 Tumour Fraction . . . . . . . . . . . . . . . . . . . . . . 38vii3.2.2 Read Depth . . . . . . . . . . . . . . . . . . . . . . . . . 413.2.3 Number of Clones . . . . . . . . . . . . . . . . . . . . . 433.2.4 Missing Clone . . . . . . . . . . . . . . . . . . . . . . . 454 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 50Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52A Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 63A.1 Figures & Tables . . . . . . . . . . . . . . . . . . . . . . . . . . 63viiiList of TablesTable 3.1 Numerical results from synthetic experiments for tumour fraction 25Table 3.2 Numerical results from synthetic experiments for clonal preva-lences (base) . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Table 3.3 Numerical results from synthetic experiments on read depth (base) 29Table 3.4 Numerical results from synthetic experiments on the number ofclones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34Table 3.5 Proportions of removed clones in synthetic experiments . . . . 36Table 3.6 Numerical results from semi-realistic experiments on tumourfraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39Table 3.7 Numerical results from semi-realistic experiments on read depth 42Table 3.8 Proportions of removed clones in semi-realistic experiments . . 47Table A.1 Unnormalized ρ values . . . . . . . . . . . . . . . . . . . . . 63Table A.2 Numerical results from synthetic experiments for clonal preva-lences (extended) . . . . . . . . . . . . . . . . . . . . . . . . 64Table A.3 Numerical results from synthetic experiments on read depth(extended) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65Table A.4 Numerical results from semi-realistic experiments for clonalprevalences (base) . . . . . . . . . . . . . . . . . . . . . . . . 74Table A.5 Numerical results from semi-realistic experiments for clonalprevalences (extended) . . . . . . . . . . . . . . . . . . . . . . 75Table A.6 Numerical results from semi-realistic experiments on the num-ber of clones . . . . . . . . . . . . . . . . . . . . . . . . . . . 79ixList of FiguresFigure 1.1 Overview of the DLP+ library preparation methodology. . . . 5Figure 2.1 LiquidBayes’ base model. . . . . . . . . . . . . . . . . . . . 9Figure 2.2 LiquidBayes’ extended model. . . . . . . . . . . . . . . . . . 11Figure 2.3 Raw vs. corrected read counts. . . . . . . . . . . . . . . . . . 12Figure 2.4 Synthetic copy number (CN) profiles for three clones (A,B,C)and normal. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Figure 2.5 Synthetic read counts. . . . . . . . . . . . . . . . . . . . . . 17Figure 2.6 Heat map of inferred CN profile for Lymphoma patient data . . 19Figure 3.1 Synthetic experiments on tumour fraction and clonal prevalences 24Figure 3.2 Posterior plots from synthetic experiments on tumour fraction 27Figure 3.3 Posterior statistics from synthetic experiments on tumour fraction 28Figure 3.4 Synthetic experiments on read depth . . . . . . . . . . . . . . 30Figure 3.5 Posterior plots from synthetic experiments on read depth . . . 31Figure 3.6 Synthetic experiments on number of clones . . . . . . . . . . 32Figure 3.7 Posterior plots from synthetic experiments on number of clones 33Figure 3.8 Posterior statistics from synthetic experiments on number ofclones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35Figure 3.9 Synthetic experiments with a missing clone . . . . . . . . . . 36Figure 3.10 Posterior plots from synthetic experiments with a missing clone 37Figure 3.11 Semi-realistic experiments on tumour fraction . . . . . . . . . 38Figure 3.12 Posterior plots from semi-realistic experiments on tumour frac-tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40xFigure 3.13 Posterior statistics from semi-realistic experiments on tumourfraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Figure 3.14 Semi-realistic experiments on tumour fraction for clones . . . 43Figure 3.15 Semi-realistic experiments on read depth . . . . . . . . . . . 44Figure 3.16 Semi-realistic experiments on number of clones . . . . . . . . 45Figure 3.17 Semi-realistic experiments on a missing clone . . . . . . . . . 46Figure A.1 Semi-realistic experiments on tumour fraction for clones (ex-tended) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66Figure A.2 Posterior plots from synthetic experiments on read depth (ex-tended) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67Figure A.3 Posterior plots from semi-realistic experiments on read depth . 68Figure A.4 Posterior plots from synthetic experiments on number of clones 69Figure A.5 Posterior plots from synthetic experiments with smallest cloneremoved (base) . . . . . . . . . . . . . . . . . . . . . . . . . 70Figure A.6 Posterior plots from synthetic experiments with largest cloneremoved (base) . . . . . . . . . . . . . . . . . . . . . . . . . 71Figure A.7 Posterior plots from synthetic experiments with smallest cloneremoved (extended) . . . . . . . . . . . . . . . . . . . . . . . 72Figure A.8 Posterior plots from synthetic experiments with largest cloneremoved (extended) . . . . . . . . . . . . . . . . . . . . . . . 73Figure A.9 Posterior plots from semi-realistic experiments on read depth(base) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76Figure A.10 Posterior plots from semi-realistic experiments on read depth(extended) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77Figure A.11 Posterior statistics from semi-realistic experiments on read depth 78Figure A.12 Posterior plots from semi-realistic experiments on number ofclones (base) . . . . . . . . . . . . . . . . . . . . . . . . . . 80Figure A.13 Posterior plots from semi-realistic experiments on number ofclones (extended) . . . . . . . . . . . . . . . . . . . . . . . . 81Figure A.14 Posterior statistics from semi-realistic experiments on numberof clones . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82xiFigure A.15 Posterior statistics from semi-realistic experiments with amissing clone . . . . . . . . . . . . . . . . . . . . . . . . . . 83xiiList of AbbreviationsBN Bayesian NetworkCFDNA cell-free DNACTDNA circulating tumour DNACPD Conditional Probability DistributionCN copy numberCNV Copy Number VariationDNA Deoxyribonucleic AcidDLBCL Diffuse Large B-Cell LymphomaDLP+ Direct Library Preparation PlusDAG Directed Acyclic GraphxiiiFL Follicular LymphomaHMM Hidden Markov ModelHDI Highest Density IntervalHDR Highest Density RegionMCMC Markov Chain Monte CarloMRD Minimal Residual DiseaseNGS Next-Generation SequencingPCR Polymerase Chain ReactionPPL Probabilistic Programming LanguageSNP Single Nucleotide PolymorphismSNV Single Nucleotide VariantSOTA state-of-the-artTFRI Terry Fox Research InstitutexivVAF Variant Allele FrequencyVCF Variant Call FormatWGS Whole Genome SequencingxvAcknowledgmentsI would like to thank my supervisor Dr. Andrew Roth for his guidance throughoutthe past two years. He was available to answer questions and give feedback todrive this project forward. Many thanks to my co-supervisor, Dr. Yongjin Park forsupporting me in academic and family matters. I am grateful to Dr. Andrew Rothand Dr. Yongjin Park for accommodating me as I welcomed my daughter Felicityinto this world.I also want to acknowledge those part of my committee. Thank you Dr. RyanMorin and Dr. Alexandre Bouchard-Coˆte´ for providing valuable input during com-mittee meetings. This thesis has improved as a result. I also want to extend myappreciation to Dr. Ryan Brinkman for agreeing to chair my thesis defense. Thisthesis would not be complete without his contribution.A big thanks to the Canadian Institute of Health Research for partially fundingmy research through the Canada Graduate Scholarships Masters award.I am grateful for the opportunity to work in the Roth lab for these past twoyears, where I have developed both as an individual and as a researcher. Meaning-ful friendships have made this journey more enjoyable.Above all, thank you to my beautiful wife Cheyenne whom I cherish with allmy heart. This thesis simply would not exist without your hard work and persever-ance, assuming familial responsibilities on top of your own research, studies andpersonal life. Finally, I would be remiss not to mention my wonderful daughterFelicity, who keeps everybody up all night changing diapers, feeding and cleaningbottles. You bring much joy to our family.xviChapter 1IntroductionCancer is a complex disease driven by genetic mutations. Originating from a sin-gle mutated cell, subsequent rounds of proliferation and additional mutations ulti-mately give rise to the tumour[48]. This evolutionary process produces geneticallydistinct populations of cells known as ‘clones’, leading to inconsistent therapeu-tic response over the tumour[47][52]. Clonal evolution exhibits temporal hetero-geneity, marked by the growing or waning of clones over time due to selectivepressures. Importantly, clones in cancers develop resistance to treatment by ac-quiring mutations that alter cell-intrinsic mechanisms which govern its response totherapies[67][52]. Consequently, continual monitoring at all stages of treatmentcan improve prognosis by intercepting clonal resistance. Currently, clonal analy-sis is commonly done using invasive tissue biopsies. However, tissue biopsies areunfit for monitoring treatment response, as it is impractical to extract multiple tis-sue biopsies across time. In contrast, liquid biopsies are minimally-invasive bloodsamples containing double stranded, free-floating Deoxyribonucleic Acid (DNA),known as cell-free DNA (CFDNA). In cancer patients, circulating tumour DNA(CTDNA) is a subset of CFDNA originating from the tumour itself. Studies haveshown a high concordance between CTDNA and tumour biopsies[7][1][14][25],suggesting CTDNA to act as a proxy for serial tissue biopsies. Accordingly, theproportion of reads contributed by cancer cells in CFDNA (tumour fraction) canbe used as a surrogate measure of the overall quantity of malignant cells (tumourburden) to track cancer progression.11.1 Tumour HeterogeneityCancer is uncontrolled cell growth following evolutionary principles, where ge-netic variation alters molecular signatures in cells[48]. Therefore, the tumour iscomposed of genotypically distinct subpopulations of cells called clones – thiscondition is known as tumour heterogeneity. Tumour heterogeneity has clinicalrelevance, in that targeted therapies do not have a uniform effect on the tumour.Incidentally, cancer can develop resistance as tumour cells accrue mutations thatalter genetic pathways and therapy targets[52]. Hence, tumour heterogeneity in-troduces a great deal of complexity when developing effective therapies[9]. Se-quencing technologies have the potential to uncover tumour heterogeneity, mon-itor clonal fluctuation and identify the emergence of clinical resistance[3] (Sec-tion 1.3). Furthermore, genetic mutations such as Copy Number Variations(CNVS) (Section 1.4.1) and Single Nucleotide Variants (SNVS) (Section 1.4.2)can be used to characterize tumour heterogeneity. Statistical models have suc-cessfully applied Next-Generation Sequencing (NGS) data for analyzing tumourheterogeneity[42][23][53][58][49].1.2 Liquid BiopsiesTissue biopsy is the removal of tissue from a patient to provide a representativespecimen for interpretation and analysis[62]. Derived from the tumour itself, tissuebiopsy is the gold standard source of data in cancer research[28]. However, a tissuebiopsy cannot portray cancer holistically. Tissue biopsies suffer from three majordrawbacks: first, sampling bias is present when extracting tissue; second, spatialheterogeneity limits its representative scope; and lastly, temporal heterogeneity isignored as serial sampling is infeasible [59]. These limitations indicate that tissuebiopsies are insufficient to comprehensively portray cancer.Mandel and Metais first reported the existence of fragmented DNA in the non-cellular component of blood[50]. A liquid biopsy is a non-invasive sample ofbiological fluid containing fragmented DNA called CFDNA. CFDNA are doublestranded, highly fragmented molecules that are approximately 150bp in length[66].In cancer patients, circulating tumour DNA (CTDNA) is a subset of CFDNA thatoriginates from the primary tumour. Importantly, studies have illustrated a high2concordance between CTDNA and tumour biopsies[7]. Therefore, CTDNA revealsrelevant characteristics of the tumour for therapy trials. Applications of CTDNAare numerous: diagnosis and molecular profiling, tracking of therapeutic response,monitoring resistance, studying tumour heterogeneity, detecting Minimal ResidualDisease (MRD) and early cancer detection[7]. However, CTDNA can be present inmeager proportions, impeding its utility in clinical contexts.Presently, there are several statistical methods for analyzing CTDNA.ichorCNA[1] and LiquidCNA[37] use copy number (CN) to track and quantifyclonal evolution and prevalence, whereas Kang et al.[29] and Li et al.[39] proposemethods that leverage CTDNA methylation patterns. Instead of focusing exclusivelyon CTDNA, Zviran et al.[73] introduce an integrated bulk analysis of CTDNA andsolid tissue to address low CTDNA abundance. However, bulk approaches are in-adequate to resolve minor clonal populations due to sequencing error rates[21] andfail to address clonal CNVS at low tumour cellularity[15][70]. Considering theselimitations, we employ Direct Library Preparation Plus (DLP+) (Section 1.3.2), de-livering single-cell resolution for precise deconvolution of both major and minorclonal populations and resolving clonal CNVS in low tumour burden settings. Inthis thesis, CTDNA is used to estimate the tumour fraction (the proportion of readsfrom cancer cells in CFDNA), which acts as a proxy for the tumour burden (the totalnumber of malignant cells in a patient).1.3 Sequencing Technologies1.3.1 Single-Cell SequencingNGS allows for the simultaneous sequencing of millions of different DNAmolecules. NGS has substantially increased accessibility and speed of sequencing,allowing researchers to uncover genetic alterations that underlie the pathogenesisof cancer at an unprecedented scale[45]. Furthermore, DNA sequencing has pro-gressed in precision and throughput, allowing for sequencing of entire genomesof individual cells[57]; this methodology is referred to as single-cell sequencing.Single-cell sequencing can reveal genomic variability among individual cells, de-livering novel insights into tumour heterogeneity. In particular, it enables iden-3tification of minor clonal populations and reconstruction of clonal phylogenies.However, single-cell sequencing methods typically rely on genome amplification,which leads to uneven coverage and allelic dropout[71][46]. We tackle this issueby employing a unique library preparation method, DLP+, to mitigate these biases.1.3.2 Direct Library Preparation+DLP+ is a scalable single-cell library preparation method withoutpreamplification[38]. DLP+ distinguishes itself by performing shallow se-quencing of thousands of cells rather than deep sequencing of few cells. In DLP+,object recognition is used to assess cell state, quality and doublets. A tagmentationstep appends unique oligonucleotide barcodes to exposed DNA in each wellfor mapping reads back to their respective cells. Then, rounds of PolymeraseChain Reaction (PCR) are performed on individual wells. DLP+ identifies clonalpopulations by clustering cells based on their CN profiles. Briefly, UMAP[41]is applied to normalized raw copy number data from HMMcopy[36]. Next, thereduced data is clustered and cells in each cluster are merged to produce clone-level pseudo-bulk genomes. In this project, however, clones are inferred using aphylogenetic method[54] and tree cutting[18][55][56]. Ultimately, LiquidBayes isagnostic to the method for inferring clones, as it only requires clone-specific CNprofiles (Section 2.1). Figure 1.1 gives a high level overview of the DLP+ pipeline.1.4 Cancer Genomics1.4.1 Copy Number VariationCNVS are gains or deletions of genomic segments and account for a substantialproportion of human genetic variations[68]. Customarily, CNVS have been definedto be “a segment of DNA that is 1kb or larger and is present at a variable copy num-ber in comparison with a reference genome”[16]. CNVS are prevalent in cancerand can elucidate causative biological mechanisms and impart prognostic insightsfor patients[61]. Additionally, CN profiles are an effective way of summarizinggenome-wide CNVS. CN profiles are a set of integers representing the CN at eachbin, where bins are non-overlapping segments of the genome with fixed length.4Figure 1.1: Overview of the DLP+ library preparation methodology. A: First,1000s of single cells are isolated into wells on a chip, imaged, then lysed. B:DNA undergoes tagmentation, heat incubation and rounds of spinning. C:Genetic material from wells are pooled together in preparation for sequencing.(Figure taken from [38])Statistical methods that leverage CN information for interrogating tumour hetero-geneity have displayed fruitful results[23][58][42][17][49].1.4.2 Single Nucleotide VariantSNVS describe the event where a single nucleotide is substituted at a specific ge-nomic position[72]. This substitution can alter amino acid synthesis, disrupt pro-tein function, and ultimately effect disease. In cancer, variants may confer selectiveadvantages to a subpopulation of cells, promoting tumour growth[64]. These vari-ants are called ‘driver mutations’. An active area of research examines driver mu-tations for the purpose of prognosis evaluation and treatment response[63]. Vari-ant calling, the process whereby genomic positions that harbor mutations are dis-cerned, is a common component in bioinformatics pipelines. Numerous variantcallers have been developed, each with their own strengths and weaknesses[69].51.5 Probabilistic Models1.5.1 Bayesian NetworksBayesian Networks (BNS) (or graphical models) are Directed Acyclic Graphs(DAGS) representing probability distributions [32]. Nodes are associated withstatistical distributions (e.g. Gaussian) and edges express interactions betweennodes. They visualize the structure of a probabilistic model by defining condi-tional independence properties[5]. Each node has a Conditional Probability Distri-bution (CPD), which is a function of its parents. In this way, the nodes in a BN arelinked through probabilistic associations. Typically, BNS are used in conjunctionwith Bayesian inference, facilitating model design and inference method selection.1.5.2 Forward SamplingForward sampling is a method for sampling a BN. Nodes are sampled in an ordersuch that upon sampling a node, values for all its parents exist[31]. In this setting,we sample each node using its CPD. The set of samples for all nodes is called aparticle. Particles can then be used to estimate expectations and probabilities.1.5.3 Markov Chain Monte CarloA common issue in Bayesian inference deals with the intractable normalizationconstant in the posterior distribution. Given a prior p(θ) and likelihood p(x | θ)over observations x, the posterior distribution is defined as,p(θ | x) = p(θ)p(x | θ)∫p(x | θ)p(θ)dθ . (1.1)Monte Carlo simulation samples from a non-standard target distribution p(φ) de-fined on a high-dimensional space. Empirical point estimates can be used to ap-proximate the target distribution[2]. Specifically, if we have N samples,pN(φ) =1NN∑i=1φ (i) (1.2)6specifies an estimate of the mean of the posterior distribution. Markov ChainMonte Carlo (MCMC) is a way to sample from complicated probability distribu-tions in high dimensions. MCMC techniques such as Metropolis Hastings[24][43]and Gibbs Sampling[20][19] are particularly useful in bypassing the intractableintegral in Bayesian inference.1.5.4 Bayesian CoverageThe Bayesian coverage is the proportion of runs whose Highest Density Interval(HDI) contains the true parameter value. Hyndman [27] defines the HDI (or HighestDensity Region (HDR)) in the following manner. Let f (x) be the density functionof a random variable X . Then the 100(1−α)% HDR is the subset R( fα) of thesample space of X such thatR( fα) = {x : f (x)≥ fα}, (1.3)where fα is the largest constant such that Pr(X ∈ R( fα))≥ 1−α .7Chapter 2Methods2.1 LiquidBayes2.1.1 ModelsLiquidBayes was implemented as a BN and offered two models - a base model andan extended model.2.1.1.1 Base ModelThe base model leveraged clone-specific CN profiles to enhance tumour fractionand clonal prevalence estimates. A general schematic and its graphical model isshown in Figure 2.1. ci j represented bin-specific clonal CN values and ρ was avector characterizing clonal prevalences. We modeled yi, log2 binned read countscorrected for GC content and mappability, using a Student’s t-distribution withmean parameter µi and scale parameter τ . µi was the log-transform of the linearcombination between bin-specific clonal CN values and clonal prevalences, dividedby the sample ploidy estimate. τ was modeled as an Inverse-gamma(3,1).2.1.1.2 Extended ModelThe extended model incorporated SNVS alongside clone-specific CN profiles. Ageneral schematic and its graphical model is shown in Figure 2.2. The portion re-8(a) General schematic for LiquidBayes’ base model. Inputs include clone-specific CN profiles fromDLP+ and binned read counts from Whole Genome Sequencing (WGS) of CTDNA. The modeloutputs clonal prevalences and normal fraction estimates.(b) Graphical model for LiquidBayes’ base model. k is the number of clones and n is the number ofbins. Shaded nodes depict observed variables and unshaded nodes depict unobserved randomvariables.Figure 2.1: LiquidBayes’ base model.9lated to clone-specific CN profiles was the same as the base model (Section 2.1.1.1).Let l be an SNV site from the input Variant Call Format (VCF) file, f ′l j be the num-ber of mutant reads at site l for clone j and fl j be the total number of reads at site lfor clone j. We estimated the number of mutant copies at site l for clone j (ml j) bymultiplying the Variant Allele Frequency (VAF) at the site (f ′l jfl j) with the correspond-ing CN value of the bin containing the site. For a specific site l, we modeled thenumber of mutant reads in the CTDNA sample (bl) using a BinomialLogits(dl,ξl)(Section 2.1.4), where dl was the total number of reads at site l and ξl = ∑ j ρ jml j.SNVS were treated as biallelic.2.1.2 Preprocessing2.1.2.1 Read CountsLiquidBayes included a preprocessing pipeline to correct and normalize rawread counts. Binned read counts were extracted from the CTDNA bam usingreadCounter from hmmcopy utils[35]. GC and mappability bias correctionwere applied using correctReadcount from HMMcopy[36]; GC and mappa-bility wig files were generated using gcCounter and mapCounter from hmm-copy utils[35]. The copy column of the resultant dataframe contained normalizedand corrected binned read counts. The hg19 reference genome was used in ourexperiments. Figure 2.3 displays the effect of correction and normalization at dif-ferent tumour fractions.2.1.2.2 Copy-Number ProfilesWe performed no additional preprocessing steps for CN profiles.2.1.2.3 Single Nucleotide VariantsSNV preprocessing only applied to the extended model. First, we obtained refer-ence and alternate allele counts from all clone and CTDNA bams at SNV positionsextracted from the input VCF file. Next, we filtered out sites that were not containedin any CN profile bin or had zero total counts. Then, we computed clone-specificVAFS at each site and multiplied them by the corresponding bin-specific clonal CN10(a) General schematic for LiquidBayes’ extended model. Inputs include clone-specific CN profilesand SNVS from DLP+ and binned read counts from WGS of CTDNA. The model outputs clonalprevalences and normal fraction estimates.(b) Graphical model for LiquidBayes’ extended model. k is the number of clones, n is the numberof bins and L is the number of SNV sites in the prior tissue biopsy. BinLogits refers to theBinomialLogits distribution in Section 2.1.4. Shaded nodes depict observed variables and unshadednodes depict unobserved random variables.Figure 2.2: LiquidBayes’ extended model.11Figure 2.3: Raw and corrected read counts at different tumour fraction lev-els. Left: Raw read counts from readCounter. Right: Read counts aftercorrectReadcount was applied.12value to produce clone-specific mutant copy estimates. Finally, we constructed anL× (2+K) ndarray, where L was the number of SNV sites and K was the num-ber of clones. The first two columns were the reference and alternate allele countsfrom the CTDNA sample and the remaining K columns contained the clonal mutantcopy estimates.2.1.3 ImplementationLiquidBayes was implemented using the NumPyro (v.0.10.1)Probabilistic Programming Language (PPL)[51]. We applied thenumpyro.infer.MCMC kernel and used the numpyro.infer.NUTSsampler[26]. For numpyro.infer.MCMC, we set num samples=10000and num warmup=500. For numpyro.infer.NUTS, we settarget accept prob=.95. All other parameters for both functionswere set to their default values. Two NumPyro model functions were constructedcorresponding to each LiquidBayes model (base and extended). LiquidBayes waswrapped as a command line interface program using the click Python package.Source code can be found at (https://github.com/Roth-Lab/LiquidBayes).2.1.4 BinomialLogitsThe BinomialLogits distribution is a non-standard probability distribution withsupport {0,1,2, . . . ,n} defined in the NumPyro[51] distributions library. IfX ∼ BinomialLogits(n,ξ ) where n ∈ {0,1,2, . . .}, ξ ∈ R we have the following:p(X = x) =exξΓ(x+1)Γ(n− x+1)φφ =enξ⊥0(e−|ξ |+1)nΓ(n+1)a ⊥b := max(a,b),(2.1)and moments,13E[X ] = ng(ξ )Var[X ] = ng(ξ )(1−g(ξ ))g(x) =11+ e−x.(2.2)2.2 Synthetic Datasets2.2.1 Simulating Copy Number ProfilesSynthetic CN profiles were simulated using an Hidden Markov Model (HMM). Weimposed 8 distinct states which corresponded to CN states 0-8. Each state had aself-loop and positive transition probabilities for two steps in either direction (e.g.state 3 can transition to one of 5 states: 1,2,3,4,5). We initialized the first CNstate to 2. Given the current state, we sampled from a Multinomial(p), wherep = [.005, .01, .97, .01, .005] using numpy.multinomial. Next, we indexedthe vector [−2,−1,0,1,2] using the sampled value and added the indexed value tothe current state to get the next state. The emission distribution was the identity.The upper and lower bounds for the CN state were 8 and 0, respectively. Eachsynthetic CN profile contained 5000 bins and we treated CN profiles as independent.Algorithm 1 presents pseudocode for this process. Figure 2.4 depicts simulated CNprofiles for 3 clones.For experiments involving missing clones, we either removed the clone withthe largest or smallest proportion. We accomplished this by deleting the appro-priate CN from the original set of CN profiles and writing the updated set of CNprofiles to a new file.2.2.2 Simulating Read CountsSynthetic read counts were generated using a modified forward sampling (Sec-tion 1.5.2) procedure. We computed cˆi, which was a measure of the expectedproportion of total reads in bin i. The expected number of reads for a given cov-14Algorithm 1: Simulate Copy-Number Profilesp← [.005, .01, .97, .01, .005] // transition probabilitiesevents← [−2,−1,0,1,2] // eventscn pro f iles← []for i in range(k) docn pro f ile← [2]for j in range(1,n) do // n =number of binsidx←Multi(1, p)state← events[idx]+ cn pro f ile[ j−1]// force values to be in [0,8]if state > 8 thenstate← 8endelse if event < 0 thenstate← 0endcn pro f ile.append(state)endcn pro f iles.append(cn pro f ile)endFigure 2.4: Synthetic CN profiles for three clones (A,B,C) and normal.15erage was determined by multiplying the size of the human genome by the desiredcoverage and dividing by the average read length. We set the size of the humangenome to be 3× 109 and the average read length to be 135. Then, we sampledfrom Multinomial(reads, cˆ) using numpy.multinomial to get our final dataset.Algorithm 2 describes the read count generation procedure. Figure 2.5 shows scat-terplots of forward simulated read counts at various tumour fractions.Algorithm 2: Simulate Read CountsInput: ρ ← [ρ1,ρ2, . . . ,ρk], c← cn pro f iles[n× k+1]c¯ = (∑kj=1ρ jc1 j,∑kj=1ρ jc2 j, . . . ,∑kj=1ρ jcn j)cˆ = 1∑nl=1 c¯l (c¯1, c¯2, . . . , c¯n)reads = cov∗3×109135y∼Multi(reads, cˆ)2.2.3 Simulating Allelic CountsSynthetic allelic counts were produced by first sampling from a Beta(2,2) tosimulate VAFS for each site/clone pair (f ′l jfl j, where l ∈ {1,2, . . . ,250} and j ∈{1,2, . . . ,k}, k being the number of clones). Then, we computed ml j = f′l jfl jcq j,cq j being generated by the process described in Section 2.2.1. We sampleddl ∼ Poisson(cov) and bl ∼ BinLogits(dl,ξl), where ξl = ∑ j ρ jml j. Algorithm 3details the simulation methodology and Equation 2.3 offers a summary of equa-tions and distributions used for this task.f ′l jfl j∼ Beta(2,2)ml j =f ′l jfl jcq jξl =∑jρ jml jdl ∼ Poisson(cov)bl ∼ BinLogits(dl,ξl)(2.3)16Figure 2.5: Synthetic read counts.2.3 Semi-realistic Datasets2.3.1 Lymphoma DatasetSingle-cell Lymphoma patient data from the Terry Fox Research Institute (TFRI)was obtained using DLP+[38][56]. Figure 2.6 shows a heat map of the CN pro-files for this dataset. Two time points corresponding to Follicular Lymphoma(FL) and Diffuse Large B-Cell Lymphoma (DLBCL) were identified (samplesTFRIPAIR4 FL and TFRIPAIR4 DLBCL, respectively). In our experiments, wedid not distinguish between time points (See Section 4.2 for potential improve-ments). Given the output from DLP+, CN profiles of single-cells were estimatedusing HMMCopy[36] and clone populations were inferred using Sitka[54] and tree17Algorithm 3: Simulate Allelic CountsInput: ρ ← [ρ1,ρ2, . . . ,ρk], c← cn pro f iles[n× k+1]s← 250m← [s× k]ξ ← [s×1]d ← [s×1]b← [s×1]for l ← 0 to s−1 dofor j ← 0 to k−1 dof ∼ Beta(2,2)q← bin index containing site lm[l][ j]← f × c[q][ j]endξ [l]← ∑kj=1ρ jml jd[l]∼ Poisson(cov)b[l]∼ BinLogits(d[l],ξ [l])endcutting[18][55][56]. Single-cell bam files were merged according to their clonemembership using samtools merge[11]. To obtain the CN profile of a clone,we took the mean CN value for each bin over all cells assigned to that clone. Sixdistinct clone populations (A-F) were inferred and tissue from a healthy patientwas sequenced as a matched normal. Data is available upon request.2.3.2 Semi-realistic Data SimulationWe downsampled clone-level and matched normal bam files usingDownsampleSam from GATK4[12] to imitate CTDNA. For downsamplingclone-level bam files, we altered DownsampleSam’s -P parameter depending onthe desired tumour fraction and clonal prevalences. Specifically, we computed -Pusing the following equation:Pj =r ∗ρ jn j, (2.4)where Pj was the value of -P in DownsampleSam for clone j, r was the targetnumber of reads in the final dataset, ρ j was the clonal prevalence for clone j andn j was the total number of reads in the clone-level bam for clone j. The value of ρ181 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X YCloneEFDCBASampleCloneA (777)B (59)C (141)D (42)E (81)F (62)SampleTFRIPAIR4_DLBCLTFRIPAIR4_FLCopy Number01234567891011+Figure 2.6: Heat map of inferred CN profiles for Lymphoma patient data.Rows correspond to single cells. Tree on leftmost side illustrates the phy-logeny of single cells.depended on the number of clones in the dataset and is documented in Table A.1.For downsampling the matched normal bam, we calculated:Pnorm =r ∗ (1− t f )nnorm, (2.5)where Pnorm was the value of -P in DownsampleSam for the matched normal andnnorm was the total number of reads in the matched normal bam file. Finally, wemerged downsampled clone and normal bams to manufacture an in-silico mixtureof reads from distinct clonal genomes. We changed the number of clones by in-cluding or excluding clone-level bam files during downsampling and merging. Wealtered t f to obtain semi-realistic datasets with different tumour fractions. Semi-realistic datasets of varying read depths were created by adjusting r in Equation 2.4.192.3.3 LimitationsHere we note limitations to our semi-realistic data simulation methodology. Nu-cleosome protection and accessibility[13] and fragmentation patterns[8] in CTDNAhave proven to be powerful biomarkers. Nucleosome footprints have effectively in-ferred cell types contributing CFDNA in cancer [60] and a machine learning modelhas successfully applied CTDNA fragmentation patterns to predict tissue of originin cancer [8]. Currently, our semi-realistic data simulation (Section 2.3) ignores nu-cleosome occupancy and fragmentation patterns, potentially introducing unwantedbiases to our semi-realistic datasets.2.4 BenchmarkingWe benchmarked LiquidBayes’ performance against two state-of-the-art (SOTA)methods: ichorCNA[1] and MRDetectSNV[73]. We evaluated LiquidBayes’ per-formance at a variety of tumour fractions, read depths and numbers of clones. Foreach arrangement, we simulated ten datasets and applied each inference method toall datasets. L1 losses were calculated by differencing tumour fraction estimatesfrom the ground truth and then taking the absolute value. L1 losses were dividedby tumour fraction and log(x+1) transformed to compare relative performanceacross tumour fractions. Finally, we generated plots to visualize and compare per-formance accuracy across methods. The entire benchmarking pipeline was imple-mented using Snakemake[44] and is available upon request.2.4.1 LiquidBayesWe used LiquidBayes v.0.12 for benchmarking. The input consisted of a simu-lated CTDNA bam, GC and mappability wig files generated using gcCounter andmapCounter, respectively, from hmmcopy utils[35], a matrix outlining clone-specific CN profiles, the model type (base or extended), 10000 inference sam-ples, 500 warmup samples and a unique integer to control numpyro’s randomseed. For the extended model, we also provided clonal bams and a VCF file (Sec-tion 2.4.4). LiquidBayes returned samples from the posterior distribution for thespecified model type. To obtain tumour fraction point estimates, we computedthe mean over normal fraction estimates and took the complement. Similarly, we20computed the mean over clonal prevalence estimates to get clonal prevalence pointestimates.2.4.2 ichorCNAWe used ichorCNA v.0.3.2 [1] and followed the supplied Snakefile forproper execution. ichorCNA only required a plasma bam file anddid not make use of a matched tissue biopsy. First, we quanti-fied read counts in the plasma bam using hmmcopy utils readCounter[35] with parameters binSize: 500000, qual: 20 and chrs:1,2,. . .,22. Next, we executed runIchorCNA.R with parameterschrs:1,2,. . .,22, ichorCNA chrs: c(1:22), ichorCNA gcWig:gc hg19 500kb.wig and ichorCNA mapWig: map hg19 500kb.wig;wig files were located in inst/extdata/ provided by ichorCNA. Tumour frac-tion estimates were acquired directly from the output file {id}.params.txt.2.4.3 MRDetectSNVNo modifications were made to the MRDetect software. MRDetect required twoinputs: a plasma bam and a VCF file (Section 2.4.4). As stated in [73], tumourfraction was computed using the equation,T F = 1− (1− [M−µ ∗R]/N)1/cov (2.6)where T F denoted the tumour fraction, M denoted the number of SNVS from theplasma bam, N denoted the number of SNVS in the patient-specific tumour biopsy,R denoted the total number of reads covering the patient-specific tumour biopsy,cov denoted the local coverage in sites with a tumour-specific SNV and µ de-noted the noise rate #errors#reads . R, M and N were reads checked, sites detected andsites checked, respectively, from MRDetect’s output file. µ was obtained by run-ning MRDetect on a matched normal sample and extracting detection rate. cov wasdetermined using pysam.depth (https://github.com/pysam-developers/pysam)with the -a flag on the plasma bam and computing the mean depth.212.4.4 Variant CallingFor LiquidBayes and MRDetectSNV, the pseudo-bulk bam file constituted theprior tissue biopsy. Relevant clone-level bams for the experiment of interest weremerged using samtools merge, producing a pseudo-bulk bam file. We per-formed somatic variant calling using Strelka2 (v.2.9.10)[30]. Next, we configuredthe workflow by executing configureStrelkaSomaticWorkflow.py,using a matched normal and the GRCh37-lite reference genome(https://www.bcgsc.ca/downloads/genomes/9606/hg19/1000genomes/bwa i-nd/genome/README.GRCh37-lite). In accordance with best practices[30], smallindel candidates were discovered using Manta[6] with the same sample, matchednormal and reference genome. Finally, we ran runWorkflow.py built duringthe configuration step. Strelka2 reported all variant predictions in VCF 4.1[10].22Chapter 3Results3.1 Synthetic ExperimentsWe evaluated LiquidBayes’ base and extended models by analyzing inference out-puts on synthetic datasets with varied tumour fractions, read depths and numbers ofclones (Section 2.2). We used the L1 and Relative L1 losses to evaluate accuracy.3.1.1 Tumour FractionTumour burden denotes the total number of malignant cells in a cancer patient.Tumour fraction is the total number of reads contributed by cancer cells in CFDNA,and can be used as a proxy measure for tumour burden. MRD is a major causeof relapse in post-treatment cancer patients. Therefore, tracking tumour burden inpatients can aid in MRD discovery and improve prognosis. Although liquid biopsiesfacilitate serial sampling, which is vital for relapse surveillance, the tumour fractionin a blood draw can fall to proportions as low as 1e-5. Accordingly, statisticalmethods have been developed to estimate the tumour fraction in liquid biopsies,but struggled when tumour fractions fell below 1e-5.We evaluated LiquidBayes at successively smaller tumour fractions (tf={.5, .3,1e-1, 1e-3, 1e-5, 1e-7}). In general, we observed that the error increased as thetumour fraction decreased (Figure 3.1a, Table 3.1). Both variants of LiquidBayesperformed comparably and gave accurate predictions at tumour fractions greaterthan 1e-3. LiquidBayes distinguished itself from ichorCNA and MRDetectSNV by23(a) Boxplots of the Relative L1 of tumour fraction estimates for six tumour fraction levels for thebase and extended models. Synthetic datasets had 5 clones and a 1x read depth.(b) Boxplots of the Relative L1 of tumour and clone fraction estimates for six tumour fraction levelsusing the base model. Simulated datasets had a read depth of 1x and four clones.Figure 3.1: Boxplots summarizing results for tumour fraction synthetic ex-periments. Ten datasets were created for each tumour fraction setting.inferring the prevalence of individual clonal populations in addition to the overalltumour fraction. In our experiments, LiquidBayes returned accurate clonal preva-lence estimates for tumour fractions greater than 1e-5 (Figure 3.1b, Table 3.2). SeeFigure A.1 and Table A.2 for results on the extended model.We plotted MCMC samples for a single replicate at each tumour fraction (Fig-ure 3.2). For both models, the HDIS for tumour fractions greater than 1e-3 con-24Mean SD Relative L1 L1Model Tumour fractionBase.5 0.500640 0.000566 0.001490 0.000746.3 0.300262 0.000582 0.001725 0.0005181e-1 0.100056 0.000600 0.003901 0.0003921e-3 0.000326 0.000158 0.510813 0.0006741e-5 0.000308 0.000144 3.228974 0.0002981e-7 0.000289 0.000178 7.470750 0.000289Extended.5 0.500625 0.000566 0.001461 0.000731.3 0.300415 0.000393 0.001671 0.0005021e-1 0.100037 0.000612 0.004114 0.0004131e-3 0.000291 0.000138 0.533003 0.0007091e-5 0.000288 0.000180 3.129769 0.0002781e-7 0.000342 0.000125 8.092337 0.000342Table 3.1: Mean/standard deviation of tumour fraction estimates and averageL1/Relative L1 values across ten replicates for synthetic experiments on thebase and extended models at six tumour fraction levels. Relative L1=log(L1 /tumour fraction + 1).tained the true value, whereas the HDIS at lower tumour fractions did not. Theposterior plots for tumour fractions .5, .3 and 1e-1 were unimodal, whereas theposterior plots for tumour fractions 1e-3, 1e-5 and 1e-7 were multimodal. Wespeculated that multimodality arose in low tumour fractions from highly correlatedCN profiles among clones, allowing for multiple clone proportion estimates withequal likelihoods.We visualized the distribution of HDI widths across all replicates at each tumourfraction level (Figure 3.3a). We observed that the width decreased along with tu-mour fraction. Moreover, there was no perceptible change in HDI widths at tumourfractions under 1e-3. Figure 3.3b illustrates the Bayesian coverage (Section 1.5.4)at each tumour fraction. LiquidBayes had high coverages at large tumour fractions(.5, .3, 1e-1) and zero coverages at small tumour fractions (1e-3, 1e-5, 1e-7).3.1.2 Read DepthRead depth is defined as the number of times individual bases have been sequenced.Oftentimes, it is helpful to examine the average read depth, as it provides a rough25Estimates Truth L1Tumour fraction Clone.5A 0.380606 0.380529 0.000569B 0.003269 0.003414 0.000233C 0.080311 0.080324 0.000199D 0.035879 0.035733 0.000267normal 0.499936 0.500000 0.000761.3A 0.228200 0.228317 0.000220B 0.002047 0.002048 0.000289C 0.048231 0.048195 0.000174D 0.021522 0.021440 0.000241normal 0.699999 0.700000 0.0001981e-1A 0.076061 0.076106 0.000309B 0.000786 0.000683 0.000317C 0.016114 0.016065 0.000155D 0.007131 0.007147 0.000186normal 0.899908 0.900000 0.0002881e-3A 0.000048 0.000761 0.000713B 0.000059 0.000007 0.000052C 0.000052 0.000161 0.000108D 0.000052 0.000071 0.000028normal 0.999788 0.999000 0.0007881e-5A 3.4e-5 7.61e-06 0.000028B 4.3e-5 6.83e-08 0.000043C 3.8e-5 1.61e-06 0.000036D 3.9e-5 7.15e-07 0.000039normal 0.999846 0.999990 0.0001441e-7A 3.8e-5 7.61e-08 0.000038B 4.7e-5 6.83e-10 0.000047C 4.0e-5 1.61e-08 0.000040D 4.1e-5 7.15e-09 0.000041normal 0.999834 0.9999999 0.000166Table 3.2: Mean/standard deviation of clonal fraction estimates and averageL1 values across ten replicates for synthetic experiments on the base model atsix tumour fraction levels.26(a) Base model.(b) Extended model.Figure 3.2: Posterior plots of tumour fraction estimates from synthetic ex-periments using both models at six tumour fraction levels. Dark black linesindicate the HDI.measure of the signal-to-noise ratio. Herein, we use read depth to reference averageread depth. Naturally, higher read depths are preferable, but also more costly. For astatistical model, it is advantageous to ascertain read depth’s influence on accuracyand to identify a range of read depths in which reliable results can be obtained.In doing so, one can optimize between cost and accuracy depending on resourceaccessibility.27(a) Boxplots and superimposed strip plots tovisualize the distribution of 94% HDI widthsacross ten replicates for the base and extendedmodels at six tumour fractions.(b) Bar plot depicting the Bayesian coverageacross ten replicates at six tumour fractions.Figure 3.3: Posterior statistics for synthetic experiments on tumour fractions.(a) HDI widths. (b) Bayesian coverage.We analyzed LiquidBayes’ performance on synthetic datasets of diverse readdepths. Overall, estimates were better and less variable at higher read depths (Fig-ure 3.4, Table 3.3). We also tested LiquidBayes at sequentially smaller tumourfractions to determine when read depth no longer affected accuracy. At tumourfractions 1e-1 and 1e-2, read depths larger than 1e-3x produced acceptable esti-mates. For tumour fraction = 5e-3, we observed subpar solutions at read depthssmaller than .5x, with estimates being over 40% off of the ground truth. Even-tually, at tumour fraction = 1e-3, we observed that read depth had no effect onaccuracy. Numeric results for the extended model can be found in Table A.3.We plotted MCMC samples from experiments on each read depth for tumourfractions 5e-3 and 1e-3 (Figure 3.5). We chose these tumour fractions becausethey marked a shift as to read depth’s impact on accuracy. At tumour fraction =5e-3, posterior plots for all read depths were unimodal (Figure 3.5a). Moreover,the HDI for all read depths except 1e-2x and 1x contained the true tumour fraction.At tumour fraction = 1e-3, we saw multimodality at all read depths except 1e-1xand 10x (Figure 3.5b). Furthermore, only the HDIS for read depths 1e-1x and 10xcontained the true tumour fraction. MCMC plots from the extended model can befound in Figure A.2.28Mean SD Relative L1 L1Tumour fraction Read depth1e-11e-3 0.087906 0.010132 0.112380 0.0122741e-2 0.095469 0.005211 0.057149 0.0059281e-1 0.098613 0.003510 0.030304 0.003093.5 0.099996 0.001004 0.008259 0.0008301 0.099989 0.000948 0.006689 0.00067310 0.100380 0.000327 0.003789 0.000380100 0.100338 0.000194 0.003425 0.0003431e-21e-3 0.020376 0.004697 0.688400 0.0103761e-2 0.012362 0.003053 0.253836 0.0030621e-1 0.010260 0.001544 0.120614 0.001305.5 0.010466 0.001260 0.089021 0.0009621 0.009140 0.002844 0.122506 0.00149310 0.009918 0.000226 0.020163 0.000204100 0.009991 0.000049 0.003498 0.0000355e-31e-3 0.019065 0.006245 1.288322 0.0140651e-2 0.009443 0.001918 0.615581 0.0044431e-1 0.006896 0.002033 0.339780 0.002202.5 0.005113 0.000958 0.123607 0.0006881 0.005047 0.000875 0.126902 0.00069510 0.004980 0.000145 0.020579 0.000105100 0.005004 0.000068 0.009416 0.0000471e-31e-3 0.000251 0.000113 0.557197 0.0007491e-2 0.000185 0.000153 0.593020 0.0008151e-1 0.003063 0.001184 1.047924 0.002063.5 0.000134 0.000145 0.621109 0.0008661 0.000091 0.000132 0.644174 0.00090910 0.001166 0.000299 0.236296 0.000279100 0.000194 0.000123 0.589036 0.000806Table 3.3: Mean/standard deviation of tumour fraction estimates and av-erage L1/Relative L1 values across ten replicates for synthetic experimentson the base model at seven read depths and four tumour fractions. RelativeL1=log(L1 / tumour fraction + 1).29Figure 3.4: Boxplots of Relative L1 in synthetic experiments for seven readdepths at four tumour fraction levels. Read depth was inversely related toerror. The variance of Relative L1 losses was more volatile at read depths lessthan .5x than at read depths greater than .5x. Simulated datasets included 3clones and 10 replicates.The width of the HDI decreased as the read depth increased (Figure A.3a). Readdepth and coverage did not exhibit any clear patterns (Figure A.3b). However, weobserved a downward trend in coverage across all read depths as tumour fractiondecreased.30(a) Tumour fraction = 5e-3.(b) Tumour fraction = 1e-3.Figure 3.5: Posterior plots of tumour fraction estimates from synthetic exper-iments using the base model at six read depths for tumour fractions 5e-3 and1e-3. Bold black lines indicate the HDI.3.1.3 Number of ClonesTumour heterogeneity is coincident with multiple clonal populations. Moreover,the amount of clones in the tumour can fluctuate over time. Therefore, it is valuableto produce accurate estimates notwithstanding the number of clones.We assessed how the number of clones affected LiquidBayes’ performanceusing synthetic datasets containing between two to six clones. For the most part,error declined as the number of clones increased (Figure 3.6, Table 3.4). This trend31Figure 3.6: Boxplots of Relative L1 values from synthetic experiments fortwo to six clones at two tumour fractions. Simulated datasets had a read depthof 1x and contained ten replicates.was more pronounced at tumour fraction = 1e-1, but slightly ambiguous for tumourfraction = .5. Furthermore, the variability of estimates decreased as the number ofclones increased. We did not observe any significant difference in performancebetween the base and extended models.Posterior plots of MCMC samples from the base model (Figure 3.7) and ex-tended model (Figure A.4) were generated for these experiments. Posterior meanswere extremely close to the true tumour fraction for all clone quantities and allposterior plots were unimodal.There was no relationship between the number of clones and the HDI width(Figure 3.8a). Though, HDI widths were smaller and more concentrated at tumourfraction = 1e-1. The coverage was generally higher for larger numbers of clones atboth tumour fraction settings (Figure 3.8b).3.1.4 Missing ClonesNovel clones can emerge over time according to clonal evolution. Consequently,these novel clones may not be covered by the set of CN profiles derived from a priortissue biopsy. We recognized this to be a potential shortcoming of LiquidBayes, sowe tested its robustness to missing clones.32(a) Tumour fraction = .5.(b) Tumour fraction = 1e-1.Figure 3.7: Posterior plots of tumour fraction estimates from synthetic exper-iments using the base model for three to six clones at two tumour fractions.Bold black lines indicate the HDI.33Mean SD Relative L1 L1Model Tumour fraction Num clonesBase.52 0.50194 0.00435 0.00730 0.003673 0.50595 0.00240 0.01181 0.005954 0.50006 0.00095 0.00152 0.000765 0.50064 0.00057 0.00149 0.000756 0.49963 0.00082 0.00141 0.000711e-12 0.10126 0.00271 0.02331 0.002373 0.09999 0.00095 0.00669 0.000674 0.10009 0.00035 0.00288 0.000295 0.10006 0.00060 0.00390 0.000396 0.10027 0.00042 0.00403 0.00040Extended.52 0.50245 0.00460 0.00736 0.003713 0.50554 0.00227 0.01102 0.005544 0.49981 0.00109 0.00177 0.000885 0.50063 0.00057 0.00146 0.000736 0.49962 0.00082 0.00143 0.000721e-12 0.10135 0.00241 0.02151 0.002193 0.10007 0.00097 0.00645 0.000654 0.10009 0.00040 0.00347 0.000355 0.10004 0.00061 0.00411 0.000416 0.10027 0.00042 0.00404 0.00040Table 3.4: Mean/standard deviation of tumour fraction estimates and averageL1/Relative L1 values across ten replicates for synthetic experiments at fivenumbers of clones and two tumour fractions. Relative L1=log(L1 / tumourfraction + 1).We mimicked instances where the prior tissue biopsy did not fully characterizeall clone populations by removing the CN profile of the smallest or largest cloneprior to inference (Section 2.2.1). Table 3.5 documented the proportion values ofremoved clones. The error increased moderately when we removed the smallestclone, but increased considerably when we removed the largest clone (Figure 3.9).This was likely because CN profiles were independent. Thus, a great deal of in-formation was lost by removing a CN profile, because there was no correlationbetween CN profiles (Section 2.2.1). When there were four or more clones, remov-ing the smallest clone marginally affected performance. We saw a big differenceremoving the largest clone when only having a small number of clones. See Fig-34(a) Boxplots and superimposed strip plots to visualize the distribution of 94% HDI widths across tenreplicates for two to six clones at two tumour fractions.(b) Bar plots depicting the Bayesian coverage across ten replicates for two to six clones at twotumour fractions.Figure 3.8: Posterior distribution statistics for synthetic experiments on dif-ferent numbers of clones. (a) HDI widths. (b) Bayesian coverage.ure A.5 and Figure A.6 for MCMC plots from the base model. See Figure A.7 andFigure A.8 for MCMC plots from the extended model.There was no significant change in the distribution of HDI widths from omit-ting the smallest clone (Figure 3.10a). In contrast, HDI widths increased consid-erably when the largest clone was removed, especially at tumour fraction = .5.Figure 3.10b showed that removing a clone led to a decrease in coverage. Whenthe smallest clone was removed, the coverage was still greater than zero at somequantities of clones. However, when the largest clone was removed, the coveragewas zero throughout.35Figure 3.9: Boxplots with superimposed strip plots of Relatve L1 values fortwo to six clones displaying the effect of removing the smallest or largestclone. Table 3.5 documents the proportions of the removed clones. Syntheticdatasets had a read depth of 1x.Smallest LargestNumber of Clones Tumour fraction2 .5 0.049050 0.4509501e-1 0.009810 0.0901903 .5 0.058240 0.3698401e-1 0.011650 0.0739704 .5 0.003410 0.3805301e-1 0.000680 0.0761105 .5 0.004210 0.4216301e-1 0.000840 0.0843306 .5 0.005260 0.3399601e-1 0.001050 0.067990Table 3.5: Proportions of the smallest and largest clones that were removedin synthetic experiments at different numbers of clones and tumour fractions.36(a) Boxplots and superimposed strip plots to visualize the distribution of 94% HDI widths across tenreplicates for two to six clones (includes the missing clone) at tumour fractions .5 and 1e-1.(b) Bar plot depicting the Bayesian coverage across ten replicates for two to six clones (includes themissing clone) at two tumour fractions.Figure 3.10: Posterior distribution statistics for synthetic experiments on theeffects of a missing clone. (a) HDI widths (b) Bayesian coverage37Figure 3.11: Summary of results from LiquidBayes, ichorCNA and MRDe-tectSNV for tumour fraction experiments on semi-realistic datasets. Datasetshad three clones and a read depth of 1x.3.2 Semi-realistic ExperimentsWe executed experiments on idealized mixtures of clonal populations (semi-realistic datasets) from DLP+ on single-cells from a Lymphoma patient (Sec-tion 2.3). These experiments compared LiquidBayes to two SOTA methods fortumour fraction estimation using liquid biopsies: ichorCNA and MRDetectSNV.Similar to Section 3.1, we used the L1 loss and Relative L1 loss to measure thequality of estimates.3.2.1 Tumour FractionWe applied LiquidBayes, ichorCNA and MRDetectSNV on semi-realistic datasetswith varied tumour fractions. LiquidBayes’ base and extended models signifi-cantly outperformed both ichorCNA and MRDetectSNV at all tumour fractions(Figure 3.11, Table 3.6). In addition, the extended model gave better estimatesover the base model for all tumour fractions.To visualize uncertainty, we plotted MCMC samples from LiquidBayes’ baseand extended models for a single replicate at each tumour fraction (Figure 3.12).Posterior means from the extended model were much closer to the ground truth38Mean SD Relative L1 L1Model Tumour fractionBase.5 0.663208 0.007561 0.282422 0.163208.3 0.383843 0.015048 0.245732 0.0838431e-1 0.131260 0.006562 0.270887 0.0312601e-3 0.013071 0.002996 2.542357 0.0120711e-5 0.012402 0.004721 7.052994 0.0123921e-7 0.012726 0.003582 11.715263 0.012726Extended.5 0.532218 0.006430 0.062379 0.032218.3 0.273644 0.082625 0.082660 0.0321991e-1 0.070117 0.022109 0.250598 0.0298831e-3 0.001322 0.000528 0.303214 0.0004121e-5 0.001311 0.000471 4.819289 0.0013011e-7 0.001312 0.000649 9.343209 0.001312ichorCNA.5 0.893600 0.001669 0.580649 0.393600.3 0.936984 0.001078 1.138883 0.6369841e-1 0.949707 0.007314 2.250956 0.8497071e-3 0.952829 0.006165 6.859417 0.9518291e-5 0.955883 0.006628 11.467784 0.9558731e-7 0.958093 0.016018 16.075162 0.958093MRDetectSNV.5 0.914455 0.000583 0.603720 0.414455.3 0.934819 0.001449 1.136569 0.6348191e-1 0.954658 0.000985 2.256182 0.8546581e-3 0.963231 0.001488 6.870293 0.9622311e-5 0.963465 0.001018 11.475706 0.9634551e-7 0.963797 0.000461 16.081221 0.963797Table 3.6: Mean/standard deviation of tumour fraction estimates and averageL1/Relative L1 values across ten replicates for LiquidBayes’ base and ex-tended models, ichorCNA and MRDetectSNV in semi-realistic experimentsat six tumour fraction levels. Relative L1=log(L1 / tumour fraction + 1).than posterior means from the base model. Moreover, the 94% HDI of the extendedmodel for tumour fractions .3 and 1e-3 contained the true value.There was a steady decline in HDI widths as the tumour fraction decreased(Figure 3.13a). There were also a few outliers for both variants at tumour fractions.3 and 1e-1. For the base model, the coverage was non-zero only at tumour frac-tions 1e-1 and 1e-3 (Figure 3.13b). For the extended model, we saw high coverage39(a) Base model.(b) Extended model.Figure 3.12: Posterior plots of tumour fraction estimates from semi-realisticexperiments using both models at six tumour fraction levels. Dark black linesindicate the HDI.values at tumour fractions .3 and 1e-3 and zero elsewhere. We were unsure whythe coverage spiked at these tumour fractions.We also explored LiquidBayes’ estimates for individual clone proportions.Overall, we observed high Relative L1 values for lower tumour fractions (Fig-ure 3.14). The extended model gave better estimates for clone A than the basemodel at all tumour fractions. For clones B and C, the extended model performedbetter than the base model at low tumour fractions (1e-3, 1e-5, 1e-7); otherwise,40(a) Boxplots and superimposed strip plots tovisualize the distribution of 94% HDI widthsacross ten replicates for the base and extendedmodels at six tumour fractions.(b) Bar plot depicting the Bayesian coverageacross ten replicates at six tumour fractions.Figure 3.13: Posterior distribution statistics for semi-realistic experiments ontumour fraction. (a) HDI widths (b) Bayesian coverageboth variants were comparable in accuracy. See Table A.4 and Table A.5 for nu-meric results from the base and extended models, respectively.3.2.2 Read DepthHere we evaluated read depth’s impact on model accuracy. At read depths higherthan 1e-3x, LiquidBayes gave better results than ichorCNA and MRDetectSNV(Figure 3.15). Moreover, LiquidBayes’ estimates improved with higher readdepths. In contrast, ichorCNA and MRDetectSNV did not profit from higher readdepths. In general, the extended model performed better than the base model, espe-cially at a read depth of 1x. Incidentially, LiquidBayes’ base and extended modelshad L1 values below 5e-2 at all tumour fractions when read depth was 1x, whereasichorCNA and MRDetect’s L1 values were significantly higher (Table 3.7). Sur-prisingly, MRDetectSNV had low L1 values when the read depth was 1e-3x. Uponinspection, we discovered that MRDetectSNV returned a tumour fraction of zero,thus invalidating these results. This highlighted a weakness of the Relative L1loss metric, in that it wrongly favored methods that predicted 0 for low tumourfractions.41Mean SD Relative L1 L1Model Tumour fraction Read depthBase1e-1 1 0.1313 0.0066 0.2709 0.03131e-3 0.4068 0.0564 1.3952 0.30681e-2 1 0.0215 0.0049 0.7386 0.01151e-3 0.3977 0.0252 3.6813 0.38775e-3 1 0.0153 0.0045 1.0654 0.01031e-3 0.4336 0.0801 4.4485 0.42861e-3 1 0.0131 0.0030 2.5424 0.01211e-3 0.4140 0.0487 6.0199 0.4130Extended1e-1 1 0.1319 0.0083 0.2747 0.03191e-3 0.4406 0.1162 1.4581 0.34061e-2 1 0.0223 0.0052 0.7727 0.01231e-3 0.4161 0.2260 3.5138 0.40615e-3 1 0.0139 0.0053 1.0419 0.00971e-3 0.4019 0.0343 4.3836 0.39691e-3 1 0.0119 0.0036 2.4359 0.01091e-3 0.4032 0.0303 5.9969 0.4022ichorCNA1e-1 1 0.9497 0.0073 2.2510 0.84971e-3 0.2974 0.2974 0.7608 0.20711e-2 1 0.9612 0.0154 4.5654 0.95121e-3 0.5510 0.2753 3.8491 0.54105e-3 1 0.9566 0.0038 5.2539 0.95161e-3 0.5098 0.2775 4.4168 0.50481e-3 1 0.9528 0.0062 6.8594 0.95181e-3 0.4465 0.3026 5.8140 0.4454MRDetectSNV1e-1 1 0.9547 0.0010 2.2562 0.85471e-3 0.0000 0.0000 0.6931 0.10001e-2 1 0.9626 0.0020 4.5670 0.95261e-3 0.0000 0.0000 0.6931 0.01005e-3 1 0.9633 0.0005 5.2609 0.95831e-3 0.0000 0.0000 0.6931 0.00501e-3 1 0.9632 0.0015 6.8703 0.96221e-3 0.0000 0.0000 0.6931 0.0010Table 3.7: Mean/standard deviation of tumour fraction estimates and averageL1/Relative L1 values across ten replicates for LiquidBayes’ base and ex-tended models, ichorCNA and MRDetectSNV in semi-realistic experimentsat two read depths and four tumour fraction levels. Relative L1=log(L1 / tu-mour fraction + 1).42Figure 3.14: Boxplots of Relative L1 values of tumour and clone fractionestimates on semi-realistic datasets for six tumour fraction levels using bothmodels. Datasets had three clones A, B and C at proportions .8, .15 and .05,respectively, and a read depth of 1x.We plotted MCMC samples from a single run for the base model (Figure A.9)and extended model (Figure A.10) at read depths 1x and 1e-1x. We saw that theposterior means for read depth 1x were closer to the ground truth than for readdepth 1e-1x. For both variants, very few 94% HDIS captured the true value. SeeFigure A.11a and Figure A.11b for summaries of the posterior distributions.3.2.3 Number of ClonesTo determine how the amount of clones affected accuracy, we executed each modelon datasets with different quantities of clones. All models were robust to the num-ber of clones, in that differing clone quantities did not diminish accuracy (Fig-ure 3.16). Both variants of LiquidBayes outperformed ichorCNA and MRDetect-SNV and the extended model gave better estimates relative to the base model forall clone quantities except 2. LiquidBayes’ estimates marginally improved as thenumber of clones increased, whereas the number of clones did not seem to affect43Figure 3.15: Boxplots and superimposed strip plots of Relative L1 values forthree read depths at four tumour fraction levels in semi-realistic experiments.Datasets included three clones and ten replicates.ichorCNA (except at tumour fraction = .5 with five clones) or MRDetectSNV’soutputs. Corresponding L1 and Relative L1 values are documented in Table A.6.Figure A.12 and Figure A.13 displays MCMC plots from the base and extendedmodels, respectively. For the base model, only the 94% HDI at tumour fraction =1e-1 and 3 clones contained the true parameter value. For the extended model, no94% HDI contained the true parameter value at tumour fraction = .5, whereas every94% HDI contained the ground truth at tumour fraction = 1e-1. All distributionsfor both variants were unimodal and symmetric.HDI widths were relatively stable, besides some variability in the extendedmodel, mostly at 2 or 3 clones (Figure A.14a). We observed zero coverages44Figure 3.16: Boxplots and superimposed strip plots of Relative L1 values fortwo to five clones at two tumour fraction levels in semi-realistic experiments.Datasets had a read depth of 1x and ten replicates.throughout and a sharp spike for the extended model at tumour fraction = 1e-1for 4 and 5 clones (Figure A.14b). The reason for this was unclear.3.2.4 Missing CloneWe sought to test LiquidBayes’ behavior when the tissue biopsy was missing aclone. Our design was identical to Section 3.1.4 for removing clones. For theseexperiments, the inputs to LiquidBayes and MRDetectSNV were different, in thatLiquidBayes was missing prior information about one clone, whereas MRDetect-SNV received a bulk tumour sample containing all clones (see Section 4.2). Bothvariants of LiquidBayes outperformed ichorCNA and MRDetectSNV when thesmallest clone was removed (Figure 3.17). With the largest clone removed, ichor-CNA and MRDetectSNV gave competitive results with LiquidBayes at tumourfraction = .5, but underperformed at tumour fraction = 1e-1. Overall, the extendedmodel gave inferior estimates to the base model at small clone quantities (2 or 3clones). Furthermore, when the largest clone was removed, the extended modelperformed worse than the base model for all clone quantities, which was unlikethe other experiments described in this section (Section 3.2). We believe this wasbecause the extended model lost data from two sources - CN profiles and allelic45Figure 3.17: Boxplots and superimposed strip plots of Relative L1 values fortwo to five clones comparing the effect of removing the smallest or largestclone in semi-realistic experiments. Table 3.8 documents the proportions ofthe removed clones. Datasets had a read depth of 1x.counts, when a clone was removed (as opposed to only one for the base model).Figure A.15 plots posterior statistics for these experiments.46Smallest LargestNumber of Clones Tumour fraction2 .5 0.1 0.41e-1 0.02 0.083 .5 0.025 0.41e-1 0.005 0.084 .5 0.025 0.351e-1 0.005 0.075 .5 0.025 0.31e-1 0.005 0.06Table 3.8: Proportions of the smallest and largest clones that were removedin semi-realistic experiments at different numbers of clones and tumour frac-tions.47Chapter 4Conclusion4.1 DiscussionMonitoring treatment efficacy can improve patient prognosis by detecting resis-tance and identifying MRD. Traditional tissue biopsies are invasive, thus pro-hibiting serial sampling. In contrast, liquid biopsies are minimally-invasive blooddraws, making supervision of post-treatment patient recovery feasible. Specifi-cally, given a liquid biopsy, we can estimate the tumour fraction which acts as asurrogate measure for the tumour burden. In so doing, we can track the develop-ment of cancer via liquid biopsies. We present LiquidBayes, a BN which integratesclone-specific CN profiles from single-cell sequencing of the primary tumour insubsequent CTDNA analysis. We also propose a model extension where SNVS areincluded alongside CN profiles.Post-treatment cancer patients can exhibit very low tumour burdens, and of-ten experience relapse later on. We showed that LiquidBayes was able to infer thetumour fraction at levels as low as 1e-7 more accurately than the current SOTA (Sec-tion 3.2.1). Furthermore, LiquidBayes deconvolved individual clone proportions,providing insight into the clone-specific mechanisms of resistance in patients. Inparticular, by assessing fluctuations in clonal prevalences, clinicians can identifyresistant clones, thus informing treatment decisions and enhancing patient survival.Read depth acts as a proxy for the signal-to-noise ratio, but higher read depthscorrespond to higher costs. In our experiments, LiquidBayes required a read depth48of at least 1e-1x to outperform ichorCNA and MRDetectSNV (Section 3.2.2). Ad-ditionally, LiquidBayes’ performance improved with higher read depths, whereasichorCNA and MRDetectSNV’s performance did not.Tumour heterogeneity brings about multiple clonal populations in the tumour.As a result, the number of clonal populations across patients can differ consid-erably. LiquidBayes’ estimates were robust to the number of clones, enabling itto generalize to cancers with different degrees of heterogeneity. Cancer is also adynamic process, giving rise to novel clonal populations over time. Hence, theclonal population structure in the initial tissue biopsy may not reflect the tumourat a future time point. Our experiments showed that LiquidBayes was robust whenthe missing clone was small, but unstable when the missing clone was large (Sec-tion 3.2.4). However, in LiquidBayes’ defense, phylogenetic relationships amongclones were ignored when considering which clone to remove. In particular, cloneA was usually removed, despite it being the progenitor clone. This potentiallydiminished LiquidBayes’ accuracy, as the CN profile of clone A was likely de-scriptive of its descendants. We expect to see a performance boost if clones froman an earlier time point are conserved instead. Furthermore, this approach betterreflects the properties of a true initial tissue biopsy.The dataset used in our semi-realistic experiments (Section 2.3.1) carried multi-ple clonal populations with diverse CN profiles, possibly giving LiquidBayes an ad-vantage over ichorCNA and MRDetectSNV. Moreover, we pooled both time pointsin our semi-realistic experiments, giving LiquidBayes access to the CN profiles forall clones (rather than only a subset), which is unlikely from an initial tissue biopsy.However, we believe that LiquidBayes will perform well with only the CN profileof clones at an earlier time point, because mutations in progenitor clones propagateto their descendants.Our synthetic experiments did not reveal differences in performance betweenthe base and extended models (Section 3.1). We hypothesize that this was dueto incorrect modeling assumptions for simulating CN profiles and allelic counts.Specifically, we treated CN profiles as independent when it is more accurate to re-late them through a phylogeny. To this extent, resolving clonal prevalences (andtumour fraction) is relatively straightforward, making knowledge of SNVS superflu-ous. For simulating allelic counts, we used a Beta(2,2) for simulating VAFS, when49in reality, they should concentrate on fractions with the form ab , where a,b ∈ N.This unrealistic modeling assumption further handicapped the extended model.On the other hand, in our semi-realistic experiments, the extended model gener-ally produced better estimates over the base model (Section 3.2). Exceptions tothis were at low read depths (1e-3, 1e-1) and small numbers of clones (2 or 3). Thediscrepancy between our synthetic and semi-realistic experiments can be attributedto the datasets that were used. As previously mentioned, the synthetic datasets hadinaccurate modeling assumptions for CN profiles and allelic counts, whereas thesemi-realistic datasets came from real patient data, necessarily modeling both cor-rectly. Tumour fraction inference is more difficult when clones are related througha phylogeny, due to correlated mutational landscapes among clones. Therefore,consolidating CNVS and SNVS to form a more diverse feature set for the proba-bilistic model is beneficial.Moreover, it is not unjustified to assume that clonal populations follow a per-fect (no site mutates more than once) and persistent (mutations do not disappearor revert) phylogeny. Accordingly, possessing the clonal mutational landscape ofancestral clones from an initial tissue biopsy for tumour fraction estimation in sub-sequent liquid biopsies is advantageous. Looking ahead to actual matched DLP+and CTDNA samples taken at different time points, we anticipate good results fromLiquidBayes.4.2 Future DirectionsSingle Nucleotide Polymorphisms (SNPS) are genome positions with two distinctalleles that appear in a significant portion of the human population [34]. SNPScomprise a major part of genetic variation among individuals [33] and explain dif-ferences in disease susceptibility. In future work, we intend to use SNPS to obtainallele-specific read counts, increasing LiquidBayes’ prediction power and sensitiv-ity. This requires extracting SNPS from DLP+ results and blending in an additionalcomponent to the model for handling SNPS.We have not yet tested LiquidBayes on time-dependent datasets. Thiscan immediately be applied to the Lymphoma dataset used in our semi-realistic experiments (Section 2.3.1). The two samples, TFRIPAIR4 FL and50TFRIPAIR4 DLBCL, corresponded to two stages of disease at different time points- FL and DLBCL. Conveniently, both samples came from the same patient andclones were perfectly separated by sample (clone A in TFRIPAIR4 FL and clonesB,C,D,E,F in TFRIPAIR4 DLBCL). For example, we can feed the CN profile (andVCF file for the extended model) associated with clone A as input into Liquid-Bayes, and generate downsampled in silico mixtures from the remaining clones tosimulate CTDNA datasets.We intend to perform experiments on more cancer types. In particular, we wantto explore cancers with varying levels of aneuploidy and numbers of SNVS. Modelaccuracy was constrained by low read depths. Work is being done to expand bench-marking experiments by incorporating datasets with higher read depths. Moreover,we aim to execute missing clone experiments where LiquidBayes and MRDetect-SNV both have a missing clone in their inputs for comparative analysis. Due totime constraints, we were not able to run any experiments on real CTDNA data.This is an important task in future investigations.51Bibliography[1] V. A. Adalsteinsson, G. Ha, S. S. Freeman, A. D. Choudhury, D. G. Stover,H. A. Parsons, G. Gydush, S. C. Reed, D. Rotem, J. Rhoades, D. Loginov,D. Livitz, D. Rosebrock, I. Leshchiner, J. Kim, C. Stewart, M. Rosenberg,J. M. Francis, C.-Z. Zhang, O. Cohen, C. Oh, H. Ding, P. Polak, M. Lloyd,S. Mahmud, K. Helvie, M. S. Merrill, R. A. Santiago, E. P. O’Connor, S. H.Jeong, R. Leeson, R. M. Barry, J. F. Kramkowski, Z. Zhang, L. Polacek,J. G. Lohr, M. Schleicher, E. Lipscomb, A. Saltzman, N. M. Oliver,L. Marini, A. G. Waks, L. C. Harshman, S. M. Tolaney, E. M. V. Allen, E. P.Winer, N. U. Lin, M. Nakabayashi, M.-E. Taplin, C. M. Johannessen, L. A.Garraway, T. R. Golub, J. S. Boehm, N. Wagle, G. Getz, J. C. Love, andM. Meyerson. Scalable whole-exome sequencing of cell-free DNA revealshigh concordance with metastatic tumors. Nature Communications, 8(1),nov 2017. doi:10.1038/s41467-017-00965-y. → pages 1, 3, 20, 21[2] C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan. An introduction tomcmc for machine learning. Machine Learning, 50(1/2):5–43, 2003.doi:10.1023/a:1020281327116. → page 6[3] P. L. Bedard, A. R. Hansen, M. J. Ratain, and L. L. Siu. Tumourheterogeneity in the clinic. Nature, 501(7467):355–364, sep 2013.doi:10.1038/nature12627. → page 2[4] E. Bingham, J. P. Chen, M. Jankowiak, F. Obermeyer, N. Pradhan,T. Karaletsos, R. Singh, P. Szerlip, P. Horsfall, and N. D. Goodman. Pyro:Deep universal probabilistic programming, 2018.[5] C. M. Bishop. Pattern Recognition and Machine Learning (InformationScience and Statistics). Springer, 2008. ISBN 9780387310732. → page 6[6] X. Chen, O. Schulz-Trieglaff, R. Shaw, B. Barnes, F. Schlesinger,M. Ka¨llberg, A. J. Cox, S. Kruglyak, and C. T. Saunders. Manta: rapiddetection of structural variants and indels for germline and cancer52sequencing applications. Bioinformatics, 32(8):1220–1222, dec 2015.doi:10.1093/bioinformatics/btv710. → page 22[7] R. B. Corcoran and B. A. Chabner. Application of cell-free DNA analysis tocancer treatment. New England Journal of Medicine, 379(18):1754–1765,nov 2018. doi:10.1056/nejmra1706174. → pages 1, 3[8] S. Cristiano, A. Leal, J. Phallen, J. Fiksel, V. Adleff, D. C. Bruhm, S. Ø.Jensen, J. E. Medina, C. Hruban, J. R. White, D. N. Palsgrove, N. Niknafs,V. Anagnostou, P. Forde, J. Naidoo, K. Marrone, J. Brahmer, B. D.Woodward, H. Husain, K. L. van Rooijen, M.-B. W. Ørntoft, A. H. Madsen,C. J. H. van de Velde, M. Verheij, A. Cats, C. J. A. Punt, G. R. Vink, N. C. T.van Grieken, M. Koopman, R. J. A. Fijneman, J. S. Johansen, H. J. Nielsen,G. A. Meijer, C. L. Andersen, R. B. Scharpf, and V. E. Velculescu.Genome-wide cell-free DNA fragmentation in patients with cancer. Nature,570(7761):385–389, may 2019. doi:10.1038/s41586-019-1272-6. → page20[9] I. Dagogo-Jack and A. T. Shaw. Tumour heterogeneity and resistance tocancer therapies. Nature Reviews Clinical Oncology, 15(2):81–94, nov2017. doi:10.1038/nrclinonc.2017.166. → page 2[10] P. Danecek, A. Auton, G. Abecasis, C. A. Albers, E. Banks, M. A. DePristo,R. E. Handsaker, G. Lunter, G. T. Marth, S. T. Sherry, G. McVean, and R. D.and. The variant call format and VCFtools. Bioinformatics, 27(15):2156–2158, jun 2011. doi:10.1093/bioinformatics/btr330. → page 22[11] P. Danecek, J. K. Bonfield, J. Liddle, J. Marshall, V. Ohan, M. O. Pollard,A. Whitwham, T. Keane, S. A. McCarthy, R. M. Davies, and H. Li. Twelveyears of SAMtools and BCFtools. GigaScience, 10(2), jan 2021.doi:10.1093/gigascience/giab008. → page 18[12] V. der Auwera GA and O. BD. Genomics in the cloud: Using docker, gatk,and wdl in terra (1st edition). O’Reilly Media, 2020. → page 18[13] A.-L. Doebley, M. Ko, H. Liao, A. E. Cruikshank, C. Kikawa, K. Santos,J. Hiatt, R. D. Patton, N. D. Sarkar, A. C. Hoge, K. Chen, Z. T. Weber,M. Adil, J. Reichel, P. Polak, V. A. Adalsteinsson, P. S. Nelson, H. A.Parsons, D. G. Stover, D. MacPherson, and G. Ha. Griffin: Framework forclinical cancer subtyping from nucleosome profiling of cell-free DNA. sep2021. doi:10.1101/2021.08.31.21262867. → page 2053[14] J. Donaldson and B. H. Park. Circulating tumor DNA: Measurement andclinical utility. Annual Review of Medicine, 69(1):223–234, jan 2018.doi:10.1146/annurev-med-041316-085721. → page 1[15] P. Eirew, A. Steif, J. Khattra, G. Ha, D. Yap, H. Farahani, K. Gelmon,S. Chia, C. Mar, A. Wan, E. Laks, J. Biele, K. Shumansky, J. Rosner,A. McPherson, C. Nielsen, A. J. L. Roth, C. Lefebvre, A. Bashashati,C. de Souza, C. Siu, R. Aniba, J. Brimhall, A. Oloumi, T. Osako, A. Bruna,J. L. Sandoval, T. Algara, W. Greenwood, K. Leung, H. Cheng, H. Xue,Y. Wang, D. Lin, A. J. Mungall, R. Moore, Y. Zhao, J. Lorette, L. Nguyen,D. Huntsman, C. J. Eaves, C. Hansen, M. A. Marra, C. Caldas, S. P. Shah,and S. Aparicio. Dynamics of genomic clones in breast cancer patientxenografts at single-cell resolution. Nature, 518(7539):422–426, nov 2014.doi:10.1038/nature13952. → page 3[16] L. Feuk, A. R. Carson, and S. W. Scherer. Structural variation in the humangenome. Nature Reviews Genetics, 7(2):85–97, feb 2006.doi:10.1038/nrg1767. → page 4[17] A. Fischer, I. Va´zquez-Garcı´a, C. J. Illingworth, and V. Mustonen.High-definition reconstruction of clonal composition in cancer. Cell Reports,7(5):1740–1752, jun 2014. doi:10.1016/j.celrep.2014.04.055. → page 5[18] T. Funnell, C. H. O’Flanagan, M. J. Williams, A. McPherson, S. McKinney,F. Kabeer, H. Lee, S. Salehi, I. Va´zquez-Garcı´a, H. Shi, E. Leventhal,T. Masud, P. Eirew, D. Yap, A. W. Zhang, J. L. P. Lim, B. Wang, J. Brimhall,J. Biele, J. Ting, V. Au, M. V. Vliet, Y. F. Liu, S. Beatty, D. Lai, J. Pham,D. Grewal, D. Abrams, E. Havasov, S. Leung, V. Bojilova, R. A. Moore,N. Rusk, F. Uhlitz, N. Ceglia, A. C. Weiner, E. Zaikova, J. M. Douglas,D. Zamarin, B. Weigelt, S. H. Kim, A. D. C. Paula, J. S. Reis-Filho, S. D.Martin, Y. Li, H. Xu, T. R. de Algara, S. R. Lee, V. C. Llanos, D. G.Huntsman, J. N. McAlpine, G. J. Hannon, G. Battistoni, D. Bressan, I. G.Cannell, H. Casbolt, C. Jauset, T. Kovacˇevic´, C. M. Mulvey, F. Nugent, M. P.Ribes, I. Pearson, F. Qosaj, K. Sawicka, S. A. Wild, E. Williams, E. Laks,A. Smith, D. Lai, A. Roth, S. Balasubramanian, M. Lee, B. Bodenmiller,M. Burger, L. Kuett, S. Tietscher, J. Windhager, E. S. Boyden, S. Alon,Y. Cui, A. Emenari, D. R. Goodwin, E. D. Karagiannis, A. Sinha, A. T.Wassie, C. Caldas, A. Bruna, M. Callari, W. Greenwood, G. Lerda,Y. Eyal-Lubling, O. M. Rueda, A. Shea, O. Harris, R. Becker, F. Grimaldo,S. Harris, S. L. Vogl, J. A. Joyce, S. S. Watson, S. Tavare, K. N. Dinh,E. Fisher, R. Kunes, N. A. Walton, M. A. Sa’d, N. Chornay, A. Dariush,54E. A. Gonza´lez-Solares, C. Gonza´lez-Ferna´ndez, A. K. Yoldas¸, N. Miller,X. Zhuang, J. Fan, H. Lee, L. A. Sepu´lveda, C. Xia, P. Zheng, S. P. Shah, andS. A. and. Single-cell genomic variation induced by mutational processes incancer. Nature, oct 2022. doi:10.1038/s41586-022-05249-0. → pages 4, 18[19] A. E. Gelfand and A. F. M. Smith. Sampling-based approaches to calculatingmarginal densities. Journal of the American Statistical Association, 85(410):398–409, jun 1990. doi:10.1080/01621459.1990.10476213. → page 7[20] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and thebayesian restoration of images. IEEE Transactions on Pattern Analysis andMachine Intelligence, PAMI-6(6):721–741, nov 1984.doi:10.1109/tpami.1984.4767596. → page 7[21] M. Gerstung, C. Beisel, M. Rechsteiner, P. Wild, P. Schraml, H. Moch, andN. Beerenwinkel. Reliable detection of subclonal single-nucleotide variantsin tumour cell populations. Nature Communications, 3(1), jan 2012.doi:10.1038/ncomms1814. → page 3[22] N. D. Goodman. The principles and practice of probabilistic programming.ACM SIGPLAN Notices, 48(1):399–402, jan 2013.doi:10.1145/2480359.2429117.[23] G. Ha, A. Roth, J. Khattra, J. Ho, D. Yap, L. M. Prentice, N. Melnyk,A. McPherson, A. Bashashati, E. Laks, J. Biele, J. Ding, A. Le, J. Rosner,K. Shumansky, M. A. Marra, C. B. Gilks, D. G. Huntsman, J. N. McAlpine,S. Aparicio, and S. P. Shah. TITAN: inference of copy number architecturesin clonal cell populations from tumor whole-genome sequence data.Genome Research, 24(11):1881–1893, jul 2014.doi:10.1101/gr.180281.114. → pages 2, 5[24] W. K. Hastings. Monte carlo sampling methods using markov chains andtheir applications. Biometrika, 57(1):97–109, apr 1970.doi:10.1093/biomet/57.1.97. → page 7[25] M. J. Higgins, D. Jelovac, E. Barnathan, B. Blair, S. Slater, P. Powers,J. Zorzi, S. C. Jeter, G. R. Oliver, J. Fetting, L. Emens, C. Riley, V. Stearns,F. Diehl, P. Angenendt, P. Huang, L. Cope, P. Argani, K. M. Murphy, K. E.Bachman, J. Greshock, A. C. Wolff, and B. H. Park. Detection of tumorpik3ca status in metastatic breast cancer using peripheral blood. ClinicalCancer Research, 18(12):3462–3469, jun 2012.doi:10.1158/1078-0432.ccr-11-2696. → page 155[26] M. D. Hoffman and A. Gelman. The no-u-turn sampler: Adaptively settingpath lengths in hamiltonian monte carlo. 2011.doi:10.48550/ARXIV.1111.4246. → page 13[27] R. J. Hyndman. Computing and graphing highest density regions. TheAmerican Statistician, 50(2):120, may 1996. doi:10.2307/2684423. → page7[28] M. Ilie´ and P. Hofman. Pros: Can tissue biopsy be replaced by liquidbiopsy? Translational Lung Cancer Research, 5(4):420–423, aug 2016.doi:10.21037/tlcr.2016.08.06. → page 2[29] S. Kang, Q. Li, Q. Chen, Y. Zhou, S. Park, G. Lee, B. Grimes, K. Krysan,M. Yu, W. Wang, F. Alber, F. Sun, S. M. Dubinett, W. Li, and X. J. Zhou.CancerLocator: non-invasive cancer diagnosis and tissue-of-originprediction using methylation profiles of cell-free DNA. Genome Biology, 18(1), mar 2017. doi:10.1186/s13059-017-1191-5. → page 3[30] S. Kim, K. Scheffler, A. L. Halpern, M. A. Bekritsky, E. Noh, M. Ka¨llberg,X. Chen, Y. Kim, D. Beyter, P. Krusche, and C. T. Saunders. Strelka2: fastand accurate calling of germline and somatic variants. Nature Methods, 15(8):591–594, jul 2018. doi:10.1038/s41592-018-0051-x. → page 22[31] D. Koller. Probabilistic graphical models. MIT Press, 2010. ISBN9780262013192. → page 6[32] C. Krapu and M. Borsuk. Probabilistic programming: A review forenvironmental modellers. Environmental Modelling & Software, 114:40–48,apr 2019. doi:10.1016/j.envsoft.2019.01.014. → page 6[33] L. Kruglyak and D. A. Nickerson. Variation is the spice of life. NatureGenetics, 27(3):234–236, mar 2001. doi:10.1038/85776. → page 50[34] T. LaFramboise. Single nucleotide polymorphism arrays: a decade ofbiological, computational and technological advances. Nucleic AcidsResearch, 37(13):4181–4193, jul 2009. doi:10.1093/nar/gkp552. → page 50[35] D. Lai. Hmm copy utils. GitHub repository, 2011. → pages 10, 20, 21[36] D. Lai, G. Ha, and S. Shah. Hmmcopy: Copy number prediction withcorrection for gc and mappability bias for hts data. 2021. R package version1.36.0. → pages 4, 10, 1756[37] E. Lakatos, H. Hockings, M. Mossner, W. Huang, M. Lockley, and T. A.Graham. LiquidCNA: Tracking subclonal evolution from longitudinal liquidbiopsies using somatic copy number alterations. iScience, 24(8):102889, aug2021. doi:10.1016/j.isci.2021.102889. → page 3[38] E. Laks, A. McPherson, H. Zahn, D. Lai, A. Steif, J. Brimhall, J. Biele,B. Wang, T. Masud, J. Ting, D. Grewal, C. Nielsen, S. Leung, V. Bojilova,M. Smith, O. Golovko, S. Poon, P. Eirew, F. Kabeer, T. R. de Algara, S. R.Lee, M. J. Taghiyar, C. Huebner, J. Ngo, T. Chan, S. Vatrt-Watts, P. Walters,N. Abrar, S. Chan, M. Wiens, L. Martin, R. W. Scott, T. M. Underhill,E. Chavez, C. Steidl, D. D. Costa, Y. Ma, R. J. Coope, R. Corbett,S. Pleasance, R. Moore, A. J. Mungall, C. Mar, F. Cafferty, K. Gelmon,S. Chia, M. A. Marra, C. Hansen, S. P. Shah, S. Aparicio, G. J. Hannon,G. Battistoni, D. Bressan, I. Cannell, H. Casbolt, C. Jauset, T. Kovacˇevic´,C. Mulvey, F. Nugent, M. P. Ribes, I. Pearsall, F. Qosaj, K. Sawicka,S. Wild, E. Williams, S. Aparicio, E. Laks, Y. Li, C. O’Flanagan, A. Smith,T. Ruiz, S. Balasubramanian, M. Lee, B. Bodenmiller, M. Burger, L. Kuett,S. Tietscher, J. Windager, E. Boyden, S. Alon, Y. Cui, A. Emenari,D. Goodwin, E. Karagiannis, A. Sinha, A. T. Wassie, C. Caldas, A. Bruna,M. Callari, W. Greenwood, G. Lerda, Y. Lubling, A. Marti, O. Rueda,A. Shea, O. Harris, R. Becker, F. Grimaldi, S. Harris, S. Vogl, J. A. Joyce,J. Hausser, S. Watson, S. Shah, A. McPherson, I. Va´zquez-Garcı´a, S. Tavare´,K. Dinh, E. Fisher, R. Kunes, N. A. Walton, M. A. Sa’d, N. Chornay,A. Dariush, E. G. Solares, C. Gonzalez-Fernandez, A. K. Yoldas, N. Millar,X. Zhuang, J. Fan, H. Lee, L. S. Duran, C. Xia, and P. Zheng. Clonaldecomposition and DNA replication states defined by scaled single-cellgenome sequencing. Cell, 179(5):1207–1221.e22, nov 2019.doi:10.1016/j.cell.2019.10.026. → pages 4, 5, 17[39] J. Li, L. Wei, X. Zhang, W. Zhang, H. Wang, B. Zhong, Z. Xie, H. Lv, andX. Wang. DISMIR: Deep learning-based noninvasive cancer detection byintegrating DNA sequence and methylation information of individualcell-free DNA reads. Briefings in Bioinformatics, 22(6), jul 2021.doi:10.1093/bib/bbab250. → page 3[40] L. McInnes, J. Healy, and S. Astels. hdbscan: Hierarchical density basedclustering. The Journal of Open Source Software, 2(11):205, mar 2017.doi:10.21105/joss.00205.[41] L. McInnes, J. Healy, and J. Melville. Umap: Uniform manifoldapproximation and projection for dimension reduction, 2018. → page 457[42] A. W. McPherson, A. Roth, G. Ha, C. Chauve, A. Steif, C. P. E. de Souza,P. Eirew, A. Bouchard-Coˆte´, S. Aparicio, S. C. Sahinalp, and S. P. Shah.ReMixT: clone-specific genomic structure estimation in cancer. GenomeBiology, 18(1), jul 2017. doi:10.1186/s13059-017-1267-2. → pages 2, 5[43] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, andE. Teller. Equation of state calculations by fast computing machines. TheJournal of Chemical Physics, 21(6):1087–1092, jun 1953.doi:10.1063/1.1699114. → page 7[44] F. Mo¨lder, K. P. Jablonski, B. Letcher, M. B. Hall, C. H. Tomkins-Tinch,V. Sochat, J. Forster, S. Lee, S. O. Twardziok, A. Kanitz, A. Wilm,M. Holtgrewe, S. Rahmann, S. Nahnsen, and J. Ko¨ster. Sustainable dataanalysis with snakemake. F1000Research, 10:33, apr 2021.doi:10.12688/f1000research.29032.2. → page 20[45] L. Mu¨llauer. Next generation sequencing: clinical applications in solidtumours. memo - Magazine of European Medical Oncology, 10(4):244–247,nov 2017. doi:10.1007/s12254-017-0361-1. → page 3[46] T. Nawy. Single-cell sequencing. Nature Methods, 11(1):18–18, dec 2013.doi:10.1038/nmeth.2771. → page 4[47] S. Nik-Zainal, L. B. Alexandrov, D. C. Wedge, P. V. Loo, C. D. Greenman,K. Raine, D. Jones, J. Hinton, J. Marshall, L. A. Stebbings, A. Menzies,S. Martin, K. Leung, L. Chen, C. Leroy, M. Ramakrishna, R. Rance, K. W.Lau, L. J. Mudie, I. Varela, D. J. McBride, G. R. Bignell, S. L. Cooke,A. Shlien, J. Gamble, I. Whitmore, M. Maddison, P. S. Tarpey, H. R. Davies,E. Papaemmanuil, P. J. Stephens, S. McLaren, A. P. Butler, J. W. Teague,G. Jo¨nsson, J. E. Garber, D. Silver, P. Miron, A. Fatima, S. Boyault,A. Langerød, A. Tutt, J. W. Martens, S. A. Aparicio, A˚. Borg, A. V.Salomon, G. Thomas, A.-L. Børresen-Dale, A. L. Richardson, M. S.Neuberger, P. A. Futreal, P. J. Campbell, and M. R. Stratton. Mutationalprocesses molding the genomes of 21 breast cancers. Cell, 149(5):979–993,may 2012. doi:10.1016/j.cell.2012.04.024. → page 1[48] P. C. Nowell. The clonal evolution of tumor cell populations. Science, 194(4260):23–28, oct 1976. doi:10.1126/science.959840. → pages 1, 2[49] L. Oesper, G. Satas, and B. J. Raphael. Quantifying tumor heterogeneity inwhole-genome and whole-exome sequencing data. Bioinformatics, 30(24):3532–3540, oct 2014. doi:10.1093/bioinformatics/btu651. → pages 2, 558[50] M. P and M. P. Les acides nucle´iques du plasma sanguin chez l’homme. C RSeances Soc Biol Fil, (142):241–243, 1948. → page 2[51] D. Phan, N. Pradhan, and M. Jankowiak. Composable effects for flexibleand accelerated probabilistic programming in numpyro. 2019.doi:10.48550/ARXIV.1912.11554. → page 13[52] P. Ramos and M. Bentires-Alj. Mechanism-based cancer therapy: resistanceto therapy, therapy for resistance. Oncogene, 34(28):3617–3626, sep 2014.doi:10.1038/onc.2014.314. → pages 1, 2[53] A. Roth, J. Khattra, D. Yap, A. Wan, E. Laks, J. Biele, G. Ha, S. Aparicio,A. Bouchard-Coˆte´, and S. P. Shah. PyClone: statistical inference of clonalpopulation structure in cancer. Nature Methods, 11(4):396–398, mar 2014.doi:10.1038/nmeth.2883. → page 2[54] S. Salehi, F. Dorri, K. Chern, F. Kabeer, N. Rusk, T. Funnell, M. J. Williams,D. Lai, M. Andronescu, K. R. Campbell, A. McPherson, S. Aparicio,A. Roth, S. P. Shah, and A. Bouchard-Coˆte´. Cancer phylogenetic treeinference at scale from 1000s of single cell genomes. may 2020.doi:10.1101/2020.05.06.058180. → pages 4, 17[55] S. Salehi, F. Kabeer, N. Ceglia, M. Andronescu, M. J. Williams, K. R.Campbell, T. Masud, B. Wang, J. Biele, J. Brimhall, D. Gee, H. Lee, J. Ting,A. W. Zhang, H. Tran, C. O’Flanagan, F. Dorri, N. Rusk, T. R. de Algara,S. R. Lee, B. Y. C. Cheng, P. Eirew, T. Kono, J. Pham, D. Grewal, D. Lai,R. Moore, A. J. Mungall, M. A. Marra, G. J. Hannon, G. Battistoni,D. Bressan, I. G. Cannell, H. Casbolt, A. Fatemi, C. Jauset, T. Kovacˇevic´,C. M. Mulvey, F. Nugent, M. P. Ribes, I. Pearsall, F. Qosaj, K. Sawicka,S. A. Wild, E. Williams, E. Laks, Y. Li, C. H. O’Flanagan, A. Smith,T. Ruiz, D. Lai, A. Roth, S. Balasubramanian, M. Lee, B. Bodenmiller,M. Burger, L. Kuett, S. Tietscher, J. Windhager, E. S. Boyden, S. Alon,Y. Cui, A. Emenari, D. Goodwin, E. D. Karagiannis, A. Sinha, A. T. Wassie,C. Caldas, A. Bruna, M. Callari, W. Greenwood, G. Lerda, Y. Eyal-Lubling,O. M. Rueda, A. Shea, O. Harris, R. Becker, F. Grimaldi, S. Harris, S. L.Vogl, J. Weselak, J. A. Joyce, S. S. Watson, I. Va´zquez-Garc´ıa, S. Tavare´,K. N. Dinh, E. Fisher, R. Kunes, N. A. Walton, M. A. Sa’d, N. Chornay,A. Dariush, E. A. Gonza´lez-Solares, C. Gonza´lez-Ferna´ndez, A. K. Yoldas,N. Millar, T. Whitmarsh, X. Zhuang, J. Fan, H. Lee, L. A. Sepu´lveda,C. Xia, P. Zheng, A. McPherson, A. Bouchard-Coˆte´, S. Aparicio, and S. P. S.and. Clonal fitness inferred from time-series modelling of single-cell cancer59genomes. Nature, 595(7868):585–590, jun 2021.doi:10.1038/s41586-021-03648-3. → pages 4, 18[56] C. Sarkozy, S. Wu, K. Takata, T. Aoki, S. B. Neriah, K. Milne, T. Goodyear,C. Strong, T. Rastogi, D. Lai, L. H. Sehn, P. Farinha, B. H. Nelson, A. Weng,D. W. Scott, J. W. Craig, C. Steidl, and A. Roth. Integrated single cellanalysis reveals co-evolution of malignant b cells and the tumormicroenvironment in transformed follicular lymphoma. nov 2022.doi:10.1101/2022.11.17.516951. → pages 4, 17, 18[57] E. Shapiro, T. Biezuner, and S. Linnarsson. Single-cell sequencing-basedtechnologies will revolutionize whole-organism science. Nature ReviewsGenetics, 14(9):618–630, jul 2013. doi:10.1038/nrg3542. → page 3[58] R. Shen and V. E. Seshan. FACETS: allele-specific copy number and clonalheterogeneity analysis tool for high-throughput DNA sequencing. NucleicAcids Research, 44(16):e131–e131, jun 2016. doi:10.1093/nar/gkw520. →pages 2, 5[59] G. Siravegna, S. Marsoni, S. Siena, and A. Bardelli. Integrating liquidbiopsies into the management of cancer. Nature Reviews Clinical Oncology,14(9):531–548, mar 2017. doi:10.1038/nrclinonc.2017.14. → page 2[60] M. W. Snyder, M. Kircher, A. J. Hill, R. M. Daza, and J. Shendure. Cell-freeDNA comprises an in vivo nucleosome footprint that informs itstissues-of-origin. Cell, 164(1-2):57–68, jan 2016.doi:10.1016/j.cell.2015.11.050. → page 20[61] C. D. Steele, A. Abbasi, S. M. A. Islam, A. L. Bowes, A. Khandekar,K. Haase, S. Hames-Fathi, D. Ajayi, A. Verfaillie, P. Dhami, A. McLatchie,M. Lechner, N. Light, A. Shlien, D. Malkin, A. Feber, P. Proszek,T. Lesluyes, F. Mertens, A. M. Flanagan, M. Tarabichi, P. V. Loo, L. B.Alexandrov, and N. Pillay. Signatures of copy number alterations in humancancer. Nature, 606(7916):984–991, jun 2022.doi:10.1038/s41586-022-04738-6. → page 4[62] D. Sylvie-Louise Avon and H. Klieb. Oral soft-tissue biopsy: an overview. JCan Dent Assoc, 78:c75, 2012. → page 2[63] D. Tamborero, C. Rubio-Perez, J. Deu-Pons, M. P. Schroeder, A. Vivancos,A. Rovira, I. Tusquets, J. Albanell, J. Rodon, J. Tabernero, C. de Torres,R. Dienstmann, A. Gonzalez-Perez, and N. Lopez-Bigas. Cancer genomeinterpreter annotates the biological and clinical relevance of tumor60alterations. Genome Medicine, 10(1), mar 2018.doi:10.1186/s13073-018-0531-8. → page 5[64] C. Tomasetti, L. Marchionni, M. A. Nowak, G. Parmigiani, andB. Vogelstein. Only three driver gene mutations are required for thedevelopment of lung and colorectal cancers. Proceedings of the NationalAcademy of Sciences, 112(1):118–123, dec 2014.doi:10.1073/pnas.1421839112. → page 5[65] J.-W. van de Meent, B. Paige, H. Yang, and F. Wood. An introduction toprobabilistic programming, 2018.[66] S. Volik, M. Alcaide, R. D. Morin, and C. Collins. Cell-free DNA (cfDNA):Clinical significance and utility in cancer shaped by emerging technologies.Molecular Cancer Research, 14(10):898–908, oct 2016.doi:10.1158/1541-7786.mcr-16-0044. → page 2[67] J. C. M. Wan, C. Massie, J. Garcia-Corbacho, F. Mouliere, J. D. Brenton,C. Caldas, S. Pacey, R. Baird, and N. Rosenfeld. Liquid biopsies come ofage: towards implementation of circulating tumour DNA. Nature ReviewsCancer, 17(4):223–238, feb 2017. doi:10.1038/nrc.2017.7. → page 1[68] R. Xi, A. G. Hadjipanayis, L. J. Luquette, T.-M. Kim, E. Lee, J. Zhang,M. D. Johnson, D. M. Muzny, D. A. Wheeler, R. A. Gibbs, R. Kucherlapati,and P. J. Park. Copy number variation detection in whole-genomesequencing data using the bayesian information criterion. Proceedings of theNational Academy of Sciences, 108(46), nov 2011.doi:10.1073/pnas.1110574108. → page 4[69] C. Xu. A review of somatic single nucleotide variant calling algorithms fornext-generation sequencing data. Computational and StructuralBiotechnology Journal, 16:15–24, 2018. doi:10.1016/j.csbj.2018.01.003. →page 5[70] H. Zahn, A. Steif, E. Laks, P. Eirew, M. VanInsberghe, S. P. Shah,S. Aparicio, and C. L. Hansen. Scalable whole-genome single-cell librarypreparation without preamplification. Nature Methods, 14(2):167–173, jan2017. doi:10.1038/nmeth.4140. → page 3[71] C. Zong, S. Lu, A. R. Chapman, and X. S. Xie. Genome-wide detection ofsingle-nucleotide and copy-number variations of a single human cell.Science, 338(6114):1622–1626, dec 2012. doi:10.1126/science.1229164.→ page 461[72] H. Zou, L.-X. Wu, L. Tan, F.-F. Shang, and H.-H. Zhou. Significance ofsingle-nucleotide variants in long intergenic non-protein coding RNAs.Frontiers in Cell and Developmental Biology, 8, may 2020.doi:10.3389/fcell.2020.00347. → page 5[73] A. Zviran, R. C. Schulman, M. Shah, S. T. K. Hill, S. Deochand, C. C.Khamnei, D. Maloney, K. Patel, W. Liao, A. J. Widman, P. Wong, M. K.Callahan, G. Ha, S. Reed, D. Rotem, D. Frederick, T. Sharova, B. Miao,T. Kim, G. Gydush, J. Rhoades, K. Y. Huang, N. D. Omans, P. O. Bolan,A. H. Lipsky, C. Ang, M. Malbari, C. F. Spinelli, S. Kazancioglu, A. M.Runnels, S. Fennessey, C. Stolte, F. Gaiti, G. G. Inghirami, V. Adalsteinsson,B. Houck-Loomis, J. Ishii, J. D. Wolchok, G. Boland, N. Robine, N. K.Altorki, and D. A. Landau. Genome-wide cell-free DNA mutationalintegration enables ultra-sensitive cancer monitoring. Nature Medicine, 26(7):1114–1124, jun 2020. doi:10.1038/s41591-020-0915-3. → pages3, 20, 2162Appendix ASupporting MaterialsA.1 Figures & TablesNumber of Clones ρ2 (0.1,0.9)3 (0.1,0.7,0.2)4 (0.05,0.6,0.2,0.15)5 (0.05,0.4,0.05,0.25,0.25)6 (0.05,0.4,0.05,0.15,0.25,0.15)Table A.1: Unnormalized ρ values at different numbers of clones. ρ wasdetermined by the number of clones.63Estimates Truth L1Tumour fraction Clone.5A 0.380398 0.380529 0.000698B 0.003268 0.003414 0.000274C 0.080290 0.080324 0.000231D 0.035855 0.035733 0.000257normal 0.500189 0.500000 0.000884.3A 0.228291 0.228317 0.000266B 0.002076 0.002048 0.000343C 0.048053 0.048195 0.000301D 0.021540 0.021440 0.000266normal 0.700040 0.700000 0.0004221e-1A 0.076050 0.076106 0.000316B 0.000794 0.000683 0.000329C 0.016110 0.016065 0.000198D 0.007134 0.007147 0.000200normal 0.899912 0.900000 0.0003481e-3A 0.000032 0.000761 0.000729B 0.000043 0.000007 0.000038C 0.000037 0.000161 0.000124D 0.000039 0.000071 0.000035normal 0.999849 0.999000 0.0008491e-5A 0.000038 7.61e-06 0.000033B 0.000049 6.83e-08 0.000049C 0.000040 1.61e-06 0.000038D 0.000041 7.15e-07 0.000040normal 0.999832 0.999990 0.0001581e-7A 0.000039 7.61e-08 0.000039B 0.000050 6.83e-10 0.000050C 0.000043 1.61e-08 0.000043D 0.000042 7.15e-09 0.000042normal 0.999825 0.9999999 0.000174Table A.2: Mean/standard deviation of clonal fraction estimates and averageL1/Relative L1 values across ten replicates for synthetic experiments on theextended model at six tumour fraction levels. Relative L1=log(L1 / tumourfraction + 1).64Mean SD Relative L1 L1Tumour fraction Read depth1e-11e-3 0.086328 0.011741 0.146775 0.0160451e-2 0.095850 0.007966 0.069998 0.0073411e-1 0.098558 0.003351 0.028274 0.002886.5 0.100187 0.001204 0.010205 0.0010271 0.100068 0.000972 0.006447 0.00064910 0.100411 0.000315 0.004101 0.000411100 0.100345 0.000304 0.003756 0.0003771e-21e-3 0.022454 0.006898 0.766878 0.0124541e-2 0.011396 0.003174 0.208565 0.0024991e-1 0.011042 0.002278 0.168987 0.001923.5 0.010406 0.001387 0.100823 0.0010921 0.009938 0.000868 0.069420 0.00072610 0.009925 0.000250 0.022587 0.000229100 0.009997 0.000053 0.003782 0.0000385e-31e-3 0.017627 0.005056 1.224582 0.0126271e-2 0.008844 0.003066 0.528497 0.0038441e-1 0.006386 0.001910 0.302335 0.001884.5 0.005194 0.001042 0.143213 0.0008001 0.005128 0.000709 0.101063 0.00054710 0.004953 0.000135 0.019796 0.000101100 0.005004 0.000061 0.008541 0.0000431e-31e-3 0.000204 0.000101 0.584141 0.0007961e-2 0.000254 0.000149 0.554169 0.0007461e-1 0.002738 0.001337 0.887751 0.001738.5 0.000190 0.000110 0.591815 0.0008101 0.000118 0.000098 0.631397 0.00088210 0.001191 0.000269 0.220185 0.000259100 0.000098 0.000080 0.641871 0.000902Table A.3: Mean/standard deviation of tumour fraction estimates and averageL1/Relative L1 values across ten replicates for synthetic experiments on theextended model at seven read depths and four tumour fraction levels. RelativeL1=log(L1 / tumour fraction + 1).65Figure A.1: Boxplots of the Relative L1 of tumour and clone fraction es-timates for six tumour fraction levels using the extended model. Simulateddatasets had a read depth of 1x and four clones.66(a) Tumour fraction = 5e-3.(b) Tumour fraction = 1e-3.Figure A.2: Posterior plots of tumour fraction estimates from synthetic ex-periments using the extended model at six read depths for tumour fractions5e-3 and 1e-3. Bold black lines indicate the HDI.67(a) Boxplots and superimposed strip plots to visualize the distribution of 94% HDI widths across tenreplicates for the base and extended models at seven read depths and four tumour fractions.(b) Bar plot depicting the Bayesian coverage across ten replicates at three read depths and fourtumour fractions.Figure A.3: Posterior distribution statistics for synthetic experiments on readdepth. (a) HDI widths. (b) Bayesian coverage.68(a) Tumour fraction = .5.(b) Tumour fraction = 1e-1.Figure A.4: Posterior plots of tumour fraction estimates from synthetic ex-periments using the extended model for three to six clones at tumour fractions.5 and 1e-1. Bold black lines indicate the HDI.69(a) Tumour fraction = .5.(b) Tumour fraction = 1e-1.Figure A.5: Posterior plots of tumour fraction estimates from synthetic ex-periments using the base model for three to six clones at tumour fractions .5and 1e-1, with the smallest clone removed. Bold black lines indicate the HDI.70(a) Tumour fraction = .5.(b) Tumour fraction = 1e-1.Figure A.6: Posterior plots of tumour fraction estimates from synthetic ex-periments using the base model for three to six clones at tumour fractions .5and 1e-1, with the largest clone removed. Bold black lines indicate the HDI.71(a) Tumour fraction = .5.(b) Tumour fraction = 1e-1.Figure A.7: Posterior plots of tumour fraction estimates from synthetic ex-periments using the extended model for two to six clones at tumour fractions.5 and 1e-1, with the smallest clone removed. Bold black lines indicate theHDI72(a) Tumour fraction = .5.(b) Tumour fraction = 1e-1.Figure A.8: Posterior plots of tumour fraction estimates from synthetic ex-periments using the extended model for two to six clones at tumour fractions1e-1, with the largest clone removed. Bold black lines indicate the HDI73Estimates Truth L1Tumour fraction Clone.5A 0.582716 0.400000 0.182716B 0.021080 0.075000 0.053920C 0.059412 0.025000 0.034412normal 0.336792 0.500000 0.163208.3A 0.337564 0.240000 0.097564B 0.014707 0.045000 0.030293C 0.031571 0.015000 0.018288normal 0.616157 0.700000 0.0838431e-1A 0.114300 0.080000 0.034300B 0.008298 0.015000 0.008047C 0.008663 0.005000 0.005884normal 0.868740 0.900000 0.0312601e-3A 0.010537 0.000800 0.009737B 0.001076 0.000150 0.000928C 0.001458 0.000050 0.001408normal 0.986929 0.999000 0.0120711e-5A 0.010578 8.00e-6 0.010570B 0.000756 1.50e-6 0.000754C 0.001068 5.00e-7 0.001067normal 0.987598 0.999990 0.0123921e-7A 0.010458 8.00e-8 0.010458B 0.001127 1.50e-8 0.001127C 0.001141 5.00e-9 0.001141normal 0.987274 0.9999999 0.012726Table A.4: Mean/standard deviation of clonal fraction estimates and averageL1 values across ten replicates for semi-realistic experiments on LiquidBayes’base model at six tumour fraction levels.74Estimates Truth L1Tumour fraction Clone.5A 0.461781 0.400000 0.061781B 0.011919 0.075000 0.063081C 0.058518 0.025000 0.033518normal 0.467782 0.500000 0.032218.3A 0.234162 0.240000 0.042017B 0.009876 0.045000 0.035124C 0.029606 0.015000 0.017159normal 0.726356 0.700000 0.0321991e-1A 0.057481 0.080000 0.022519B 0.005300 0.015000 0.010001C 0.007336 0.005000 0.004175normal 0.929883 0.900000 0.0298831e-3A 0.000615 0.000800 0.000196B 0.000293 0.000150 0.000151C 0.000414 0.000050 0.000364normal 0.998678 0.999000 0.0004121e-5A 6.37e-4 8.00e-6 0.000629B 3.18e-4 1.50e-6 0.000316C 3.56e-4 5.00e-7 0.000356normal 0.998689 0.999990 0.0013011e-7A 5.46e-4 8.00e-8 0.000546B 4.06e-4 1.50e-8 0.000406C 3.60e-4 5.00e-9 0.000360normal 0.998688 0.9999999 0.001312Table A.5: Mean/standard deviation of clonal fraction estimates and averageL1/Relative L1 values across ten replicates for semi-realistic experiments onthe extended model at six tumour fraction levels.75(a) Read depth = 1x.(b) Read depth = 1e-1x.Figure A.9: Posterior plots of tumour fraction estimates from semi-realisticexperiments using the base model at two read depths and four tumour frac-tions. Dark black lines indicate the HDI.76(a) Read depth = 1x.(b) Read depth = 1e-1x.Figure A.10: Posterior plots of tumour fraction estimates from semi-realisticexperiments using the extended model at two read depths and four tumourfractions. Dark black lines indicate the HDI.77(a) Boxplots and superimposed strip plots to visualize the distribution of 94% HDI widths across tenreplicates for the base and extended models at four tumour fractions and three read depths.(b) Bar plots depicting the Bayesian coverage across ten replicates at four tumour fractions andthree read depths.Figure A.11: Posterior distribution statistics for semi-realistic experimentson read depth. (a) HDI widths (b) Bayesian coverage78Mean SD Relative L1 L1Model Tumour fraction Num cloneBase.52 0.6567 0.0192 0.2723 0.15673 0.6632 0.0076 0.2824 0.16324 0.6547 0.0142 0.2693 0.15475 0.6468 0.0089 0.2574 0.14681e-12 0.1300 0.0058 0.2618 0.03003 0.1313 0.0066 0.2709 0.03134 0.1282 0.0062 0.2473 0.02825 0.1253 0.0094 0.2229 0.0253Extended.52 0.0678 0.0026 0.6229 0.43223 0.5322 0.0064 0.0624 0.03224 0.5887 0.0074 0.1632 0.08875 0.5399 0.1310 0.1862 0.10631e-12 0.0120 0.0025 0.6311 0.08803 0.0701 0.0221 0.2506 0.02994 0.0989 0.0047 0.0328 0.00345 0.0998 0.0083 0.0614 0.0064ichorCNA.52 0.8925 0.0013 0.5795 0.39253 0.8936 0.0017 0.5806 0.39364 0.8951 0.0054 0.5823 0.39515 0.8170 0.0325 0.4903 0.31701e-12 0.9501 0.0075 2.2514 0.85013 0.9497 0.0073 2.2510 0.84974 0.9596 0.0033 2.2613 0.85965 0.9604 0.0021 2.2621 0.8604MRDetectSNV.52 0.9103 0.0005 0.5992 0.41033 0.9145 0.0006 0.6037 0.41454 0.9174 0.0005 0.6069 0.41745 0.9219 0.0005 0.6118 0.42191e-12 0.9522 0.0005 2.2536 0.85223 0.9547 0.0010 2.2562 0.85474 0.9555 0.0005 2.2571 0.85555 0.9571 0.0004 2.2587 0.8571Table A.6: Mean/standard deviation of tumour fraction estimates and aver-age L1/Relative L1 values across ten replicates for LiquidBayes’ base and ex-tended models, ichorCNA and MRDetectSNV in semi-realistic experimentsfor two to five clones and two tumour fractions. Relative L1=log(L1 / tumourfraction + 1).79(a) Tumour fraction = .5.(b) Tumour fraction = 1e-1.Figure A.12: Posterior plots of tumour fraction estimates from semi-realisticexperiments using the base model for two to five clones at two tumour frac-tions. Dark black lines indicate the HDI.80(a) Tumour fraction = .5.(b) Tumour fraction = 1e-1.Figure A.13: Posterior plots of tumour fraction estimates from semi-realisticexperiments using the extended model for two to five clones at two tumourfractions. Dark black lines indicate the HDI.81(a) Boxplots and superimposed strip plots to visualize the distribution of 94% HDI widths across tenreplicates for the base and extended models for two to five clones at two tumour fractions.(b) Bar plot depicting the Bayesian coverage across ten replicates for the base and extended modelsfor two to five clones at two tumour fractions.Figure A.14: Posterior distribution statistics for semi-realistic experimentson number of clones. (a) HDI widths (b) Bayesian coverage82(a) Boxplots and superimposed strip plots to visualize the distribution of 94% HDI widths acrossten replicates for two to five clones (includes the missing clone) at tumour fractions .5 and 1e-1 forsemi-realistic experiments.(b) Bar plot depicting the Bayesian coverage across ten replicates for two to five clones (includesthe missing clone) at two tumour fractions for semi-realistic experiments.Figure A.15: Posterior distribution statistics for semi-realistic experimentson the effects of a missing clone. (a) HDI widths (b) Bayesian coverage83