IMPROVEMENT TO THE STATISTICAL SENSITIVITY OF TOP QUARK PAIR PRODUCTION IN CONJUNCTION WITH ADDITIONAL HEAVY FLAVOUR JETS THROUGH MULTIVARIATE ANALYSIS by MACKENZIE PETER FULFORD VAN ROSSEM B.Sc., The University of Toronto, 2014 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Physics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) August 2016 © Mackenzie Peter Fulford van Rossem, 2016 AbstractWith the mass of the discovered Higgs-like boson being 125 GeV, this leads to a primary Higgs decay modeto two bottom (b) jets. A precise measurement of top-pair (tt¯) production in conjunction with two additionalb-jets is essential to reduce the background uncertainty on the tt¯ + Higgs production cross-section, a directprobe of the Higgs to Yukawa coupling. This thesis attempts to improve on the statistical sensitivity of tt¯production in conjunction with two additional heavy-flavour jets, using expected sensitivities from 20.3 fb−1of pp collision data at√s = 8 TeV, collected by the ATLAS detector at the Large Hadron Collider in 2012.This thesis compares multiple multivariate analysis techniques, boosted decision trees and artificial neuralnetworks, in both binary and multi-class classification cases. An overall improvement in precision was seen,from 19.7% uncertainty on the baseline tt¯+bb¯measurement based on a fit to the best single variable, to 16.1%uncertainty with the very best multi-class neural network algorithm. This represents a relative improvementof nearly 20% and could thus reduce luminosity needed for a precision measurement of this process.iiPrefaceThis dissertation is ultimately based on the experimental apparatus and data of the ATLAS experiment atCERN, the subject of a large international collaboration.The statistical analysis performed in chapters 9 and 10, as well as the work in appendices C and D are originalto this thesis, performed entirely by myself. The rest is a necessary explanation of work performed by others,both within the ATLAS collaboration and elsewhere. These explanations are relevant and necessary in orderto understand the original work performed in this thesis.The work presented in this thesis is unpublished, however it relies on many results derived for the publica-tion [26], for which I was also a major contributor.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1 The Electroweak Force . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 The Higgs Mechanism and Electroweak Symmetry Breaking . . . . . . . . . . . . . . . . 52.3 The Strong Force . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Hadron Collisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5 Motivation for Precision tt¯ + bb¯Measurement . . . . . . . . . . . . . . . . . . . . . . . 93 The ATLAS Detector and the Large Hadron Collider . . . . . . . . . . . . . . . . . . . . 123.1 Inner Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Calorimetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.3 Muon Spectrometer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.4 Trigger System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 Object Reconstruction and B-Tagging Algorithms . . . . . . . . . . . . . . . . . . . . . . 224.1 Track Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.2 Electrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.3 Muons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.4 Jets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.5 B-Tagging Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.5.1 IP3D Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.5.2 SV1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.5.3 B-Tagging Likelihood Ratio. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.5.4 JetFitter Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.5.5 B-Tagging Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.6 B-Tagging Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 Monte Carlo Simulation of Event Templates . . . . . . . . . . . . . . . . . . . . . . . . . 325.1 Data Samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.2 Signal Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32iv6 Event Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.1 Pre-Selection Cuts at Reconstruction Level . . . . . . . . . . . . . . . . . . . . . . . . . 346.2 Cuts and Fiducial Event Definition at Truth Level . . . . . . . . . . . . . . . . . . . . . . 346.3 Template Definitions at Truth Level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377.1 Techniques for Statistical Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . 377.2 Training Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387.3 Testing Phase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387.4 Variables for Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398 Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 Baseline Sensitivity using MV1c Discriminator . . . . . . . . . . . . . . . . . . . . . . . 449.1 Baseline Fit Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449.1.1 Re-Binning to Reduce Uncertainties Due to Limited MC Statistics . . . . . . . . . . . 459.2 Beyond Baseline: Higher Dimensional Distributions . . . . . . . . . . . . . . . . . . . . 4710 Machine Learning Classification for Improved Statistical Sensitivity . . . . . . . . . . . . 5310.1 Choice of Kinematic Variables and Training Template Combinations . . . . . . . . . . . . 5310.1.1 Identification of Possible Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 5310.1.2 Binary Combinations of Templates Used for Training . . . . . . . . . . . . . . . . . . 5510.2 Variable Selection Metrics and Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 5810.3 Boosted Decision Trees Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6210.3.1 Removal of Least-Performant Variables for BDTs . . . . . . . . . . . . . . . . . . . . 6210.3.2 Results from Boosted Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 6510.4 Neural Network Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6610.5 Comparison of Best BDT and NN Binary Classification Results. . . . . . . . . . . . . . . 7110.6 Multi-Class Neural Network Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 7110.6.1 Re-Binning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7210.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7311 Conclusions and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81vAppendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84A Boosted Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84A.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84A.2 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85B Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89C Variables Rejected Due to Poor Separation. . . . . . . . . . . . . . . . . . . . . . . . . . 94D Variable Selection Metric Based on Fit Sensitivities . . . . . . . . . . . . . . . . . . . . . 95D.1 BDT sensitivities with ttbb trained against ttlightnotc . . . . . . . . . . . . . . . . . . . 97D.2 BDT sensitivities with ttbb trained against ttsingleb . . . . . . . . . . . . . . . . . . . . 98D.3 Neural Network Sensitivities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99viList of Tables1 General performance of the ATLAS detector. . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Parameters of the Inner Detector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Main parameters of the muon spectrometer.. . . . . . . . . . . . . . . . . . . . . . . . . . . 204 Summary table of baseline results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515 The variable rankings as displayed by TMVA for neural networks. . . . . . . . . . . . . . . . 596 The variable rankings as displayed by TMVA for gradient-boosted decision trees. . . . . . . . . 777 Demonstrative BDT point card table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788 Maximum and average sensitivity point card values. . . . . . . . . . . . . . . . . . . . . . . 969 BDT sensitivities point card after the first variable test iteration, ttbb vs ttlightnotc. . . . . . . 9710 BDT sensitivities point card after the second variable test iteration, ttbb vs ttlightnotc. . . . . . 9711 BDT sensitivities point card after the first variable test iteration, ttbb vs ttsingleb. . . . . . . . 9812 BDT sensitivities point card after the second variable test iteration, ttbb vs ttsingleb. . . . . . . 9813 NN sensitivities point card after the first variable test iteration. . . . . . . . . . . . . . . . . . 9914 NN sensitivities point card after the second variable test iteration.. . . . . . . . . . . . . . . . 100viiList of Figures1 CKM matrix showing electroweak quark flavour mixing. . . . . . . . . . . . . . . . . . . . . 72 Running coupling of QCD coupling constant.. . . . . . . . . . . . . . . . . . . . . . . . . . 83 tt¯ + bb¯ production Feynman diagrams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Cut-away view of the ATLAS detector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Cut-away view of the ATLAS inner detector. . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Drawing showing ID sensors and structural elements traversed by two charged tracks. . . . . . . 157 Cut-away view of the ATLAS calorimeter system. . . . . . . . . . . . . . . . . . . . . . . . 178 Cut-away view of the ATLAS muon system. . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Distribution of the signed significance of the transverse and longitudinal impact parameters. . . 2410 Distributions of the properties of the vertex found by the SV1 tagging algorithm. . . . . . . . . 2511 Background (fake) b-tagging resonances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2612 Schematic displaying the JetFitter topology. . . . . . . . . . . . . . . . . . . . . . . . . . . 2713 Distributions of some of the properties of the vertex found by the JetFitter tagging algorithm. . . 2914 Demonstrations of similarities between MV1c values of merged templates. . . . . . . . . . . . 3615 MV1c weight value assigned to the 3rd and 4th highest MV1c-ranked jets for each event. . . . . 4416 Combined 15-bin 3rd and 4th MV1c weights distribution. . . . . . . . . . . . . . . . . . . . 4517 Merged 11-bin 3rd and 4th MV1c weights distribution. . . . . . . . . . . . . . . . . . . . . . 4618 MV1c weight value assigned to the 1st and 2nd highest MV1c-ranked jets for each event. . . . . 4719 Combined 35-bin 2nd , 3rd and 4th MV1c weights distribution. . . . . . . . . . . . . . . . . . 4920 Merged 26-bin 2nd , 3rd and 4th MV1c weights distribution. . . . . . . . . . . . . . . . . . . 5021 Combined 65-bin highest 4 MV1c weights distribution. . . . . . . . . . . . . . . . . . . . . . 5122 Merged 38-bin highest 4 MV1c weights distribution. . . . . . . . . . . . . . . . . . . . . . . 5223 Normalized template distributions for the variables investigated with this analysis. . . . . . . . 5624 Normalized template distributions for the variables investigated with this analysis. . . . . . . . 5725 Correlations between all 22 initial variables within the ttbb sample.. . . . . . . . . . . . . . . 5926 Correlations between all 22 initial variables within the ttlightnotc sample. . . . . . . . . . . . 6027 The best possible BDT distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6628 Distribution of the only top performing BDT distribution to use any non-MV1c variable. . . . . 6729 The best ttbb sensitivity achievable with any binary classification machine learning algorithm. . 6930 avgdrbb and normavgdrbb variable peculiarities. . . . . . . . . . . . . . . . . . . . . . . . 7031 The best result from any machine learning algorithm. . . . . . . . . . . . . . . . . . . . . . . 7432 Second best overall/multi-class ML result. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7533 Third best overall/multi-class ML result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7634 Gradient boosting algorithm for a generic loss function.. . . . . . . . . . . . . . . . . . . . . 8535 Binary class gradient boosting algorithm for a negative binomial log likelihood loss function. . . 8636 Multi-class gradient boosting algorithm for a negative binomial log likelihood loss function. . . 8837 A graphical representation of the function used to map input variables to a response in a NN. . . 90viii1. IntroductionPhysics approaches the difficulties of understanding reality from a reductionist perspective. This has lead totwo fundamental cornerstones of our theoretical understanding: special relativity and quantum mechanics,which combined leads to quantum field theory. Studies of quantum field theories are performed with cosmicray detectors, neutrino experiments and, most prominently, particle colliders. The largest and most powerfulof such colliders ever created is the Large Hadron Collider (LHC) in CERN, Switzerland, which includesseveral different detectors, including the general purpose ATLAS experiment [1], upon which this thesisfocuses.The Standard Model (SM) of particle physics is a quantum field theory that has arisen as the best moderndescription of nature at its most fundamental level. Though it still fails to explain certain phenomena, suchas dark matter or gravity, it accounts for the vast majority of physical interactions underlying reality [2]. As aquantum theory, it is inherently probabilistic in nature, requiring vast statistics to demonstrate the underlyingfacts. Managing these statistics is no small feat, and pushes modern technology and computational equipmentto its limits [3].Discoveries in particle physics, such as the recent discovery of the Higgs boson, depend on accurate andprecise measurements of other processes, for example those which may appear in the detector similarly tothe signal of interest. This analysis attempts to improve upon the statistical sensitivity of the measurementof top quark pair (tt¯) production with two additional heavy flavour jets (e.g. those originating from bottomor charm hadrons). These types of events are very similar to the primary detector signature of top quarkpair production in conjunction with a Higgs, where the Higgs, with a mass of ∼125 GeV, primarily decaysto two bottom quarks. In this measurement, events are selected with six or more jets, along with a singlelepton, which implies the assumption that just one of the W -bosons originating from the top quarks decaysleptonically.The measurement uses a maximum likelihood method, known as a profile likelihood fit [4], to comparethe predicted ratio of four different likelihood templates determined from simulation to what fits best to thedata. These likelihood templates are generated using multivariate outputs, such as the output of the MV1cneural network bottom-tagging algorithm [5] (in the case of the baseline measurement), or a newmultivariatealgorithm including additional kinematic variables (for the improved measurement). The purpose is to seewhat improvements to the statistical power of the measurement, if any, are possible by including additionalkinematic variables in a multivariate analysis.This thesis attempts to build this standard model (SM) measurement, as performed with the ATLAS ex-periment, from the ground up. It gives extra attention to the description of bottom-tagging techniques andmultivariate algorithms, as these are the main focus of much of the analysis. It starts with a rough overview ofthe necessary background theory to the standard model and the desired measurement, in chapter 2. The goalof this section is to give an intuitive understanding of the chaos created in a typical proton-proton interactionof the LHC, and the resultant final state particles interacting in the detector, as motivation and as a lead intothe difficulties associated with attempting a SM measurement.This is followed by an overview of the primary components of the ATLAS detector, in chapter 3. Thischapter includes a brief description of the physical interactions that particles have with the various detector1components, and how these interactions are converted to electric signals, which may then be recorded forlater offline analysis.Chapter 4 discusses how various physical objects (e.g. jets) are reconstructed from individual detector hits,while going into some detail on the techniques used to identify jets originating from bottom quarks. Chapter 5briefly discusses the Monte Carlo (MC) simulations used to simulate the processes of interest and detectorresponse given various initial state particles. Chapter 6 discusses the various cuts made at reconstruction(detector) and truth (MC) levels, e.g. such as selecting only events which occurred with a fully operationaldetector, triggered by a well-defined lepton trigger.Chapter 7 discusses the basics of a multi-variate analysis, including an introduction to Boosted DecisionTrees (BDTs) and Neural Networks (NNs). This chapter is supplemented by a detailed explanation of thesealgorithms in appendices A and B. Chapter 8 describes the primary statistical technique used in this analysis,and details how a cross-section measurement may be determined from a maximum likelihood fit of templatehistograms.Chapter 9 gives a baseline measurement of what can be achieved without additional multivariate techniques.This reference measurement simply uses the output from the highest performing bottom-tagging algorithm.Chapter 10 then attempts to improve on this measurement, through the introduction of additional kinematicvariables and the use of the machine learning techniques introduced in chapter 7 and appendices A and B.Finally, this thesis concludes in chapter 11 with possible improvements, and what additional tasks couldbe done to continue this analysis to a complete measurement, such as including sources of systematicuncertainty.22. TheoryQuantum field theories are generally defined by their underlying gauge symmetry groups. The StandardModel (SM) has underlying gauge symmetry group SU (3)C xSU (2)L xU (1)Y [6]. It includes just 19 freeparameters,1 and is remarkably adept at explaining nearly all physical phenomena produced in particlephysics experiments. The model consists of 61 fundamental particles, which interact in ways as permittedfrom conservation of energy, momentum and of their various quantum numbers, such as electric charge andspin. The two major classes of particles are fermions, typically seen as matter with half-integer spin, andbosons, seen as force carriers with integer spin.The typical perspective is that interactions between fermions are mediated by the gauge bosons. This leadsto explanations of three fundamental forces: the strong nuclear force, and the electroweak force; the latterof which splits into the electromagnetic and weak nuclear forces through the Higgs mechanism [7, 8]. TheHiggs boson, which mediates this mechanism, is a recent discovery and thus still under a large amount ofexperimental investigation to ensure its interactions all coincide with SM predictions.The gauge symmetry group of the SM defines the theory, by defining the Lagrangian and thereby the motionand interactions of all particles involved. Gauge symmetries enforces the form of the SM interactions throughthe existence of the gauge bosons, and so without them all fermions would simply follow the free particleLagrangian [6]:L f ree f ermion = iψ¯ /∂ψHere ψ could be any fermion in the SM. There is no mass term, implying fermions are technically massless,though, as we will see, an effective mass term arises when Yukawa couplings to the Higgs boson areconsidered [9].This free theory is rather dull, as nothing ever changes except the location of the fermions. Requiring gaugesymmetries forces derivatives to become a covariant derivatives [9]. In the process this leads to interactionsbetween various fermions and the gauge bosons, as we will see shortly. This section will briefly tour thethree fundamental forces explained by the SM. This tour will be followed by a brief heuristic description ofthe chaos created in collisions generated by particle accelerators, followed by the motivation and explanationof the measurement attempted in the rest of this thesis.2.1. The Electroweak ForceThe electroweak force has SU (2)L xU (1)Y gauge symmetry [7, 8]. This requires three gauge bosons as thegenerators for the SU (2) group, and an additional gauge boson as the generator for the U (1) group. Thekinetic energy term in the SM for these generators is given by:LEW = −14WiµνWµνi − 14BµνBµν ,1 This excludes neutrino flavour oscillations, which explain their non-zero mass. This exclusion, by treating neutrinos as massless,is typical in representations of the SM [6]. Further, since neutrino mass has no impact on any aspects of this analysis, it will betreated as thus here.3whereW iµν = ∂νWiµ − ∂µW iν + g i jkW jµW kν andBµν = ∂νBµ − ∂µBν .and where g is the SU (2) gauge coupling constant. This gives rise to couplings with fermions from theLagrange term:LEW f ermions = iΨ¯L /DΨL + iψ¯R /DψRwith the covariant derivative /D acting differently on the left (L) and right (R) chiral projections2 of thefermion ψ [9]:DµΨL = (∂µ + igWµ + ig′YLBµ )ΨL , DµψR = (∂µ + ig′YRBµ )ψRand where g′ is the U (1)Y coupling constant. The YL and YR denote the U (1) hypercharge eigenvalues,which are simply ±1 or ± 12 , depending on convention and on which type of fermion the transformation isacting [6].As can be seen, the SU (2) gauge group couples only to the left chiral projections of fermions. A typicalrepresentation is that left handed fermions are arranged into doublets, such that the left-handed up-type quarkis paired with a left-handed down-type quark, and a left-handed charged lepton is paired with a left-handedneutrino [8]. These doublets may then transform within the SU (2)L xU (1)Y gauge group via transformationssuch as:ΨL → Ψ′L = eiYLθ (x)eiTi βi (x)/2ΨLwhere T i = τi/2, and the τi are the Pauli matrices, denoting the generators of the fundamental representationof SU (2) [7]. Fermions with right-handed chirality are arranged into singlets, which can be seen to transformsimilarly under this gauge group, but with the T i = 0 [8].All the fermions and bosons introduced thus far are massless, and would travel at the speed of light.Unfortunately, simply introducing a Dirac mass term would couple left- and right-chiral fields, violatingthe gauge invariance of these fields. Since these fields have different transformation properties under anSU (2) gauge transformation, such a term would violate the gauge invariance of these fields, and are thusnot permitted. Instead, fermions and the electroweak gauge bosons gain their masses through a form ofspontaneous symmetry breaking [8].2 The left chiral projection of a fermion ψ is ψL = (1 − γ5)ψ/2, whereas the right projection is ψR = (1 + γ5)ψ/2.42.2. The Higgs Mechanism and Electroweak Symmetry BreakingIntroducing a complex scalar doublet with themost general Lagrangian possible, as permitted by SU (2)xU (1)invariance and conservation of unit probability, we can write [8]:LHiggs = (Dµφ)†(Dµφ) − V (φ) , (1)withDµφ = (∂µ + igT iW iµ +i2g′Bµ )φ , (2)andV (φ) = −µ2φ†φ + λ(φ†φ)2being the most general permitted potential. The λ term describes quartic self-interactions among the scalarfields, and as a result λ must be greater than zero in order for the vacuum to be stable.If µ2 < 0, the minimal energy vacuum expectation value (VEV) of this scalar field would remain at zero,and all the above SU (2)xU (1) symmetries would remain unbroken [7]. However, for µ2 > 0, the scalar fieldremains symmetric around zero but has a non-zero VEV. In this case, the potential takes on a sombrero3potential in complex SU (2) phase space. Spontaneously going from φ = 0 to the minimum of the potentialand re-writing the Lagrangian around this new minimum is what breaks the SU (2) symmetry. The directionof the minimum in complex SU (2)L space is not determined, as the potential depends only on φ†φ.Taking v to be the VEV of this complex scalar doublet, we can choose to work in the unitary gauge and writethe expectation value of the Higgs doublet as:〈φ〉 = 1√2(0v).Substituting this back into equations 1 and 2, we generate the gauge boson masses:(Dµφ)†(Dµφ) =v28[g2((W1µ )2 + (W2µ )2)+(gW3µ − g′Bµ)2]. (3)The charged vector bosons,W+ andW− are defined as [8]:W±µ ≡1√2(W1µ ∓ iW2µ ) ,where, from the g2 term in equation 3, we see theW mass, mW , present itself as:12(gv2)2W†µW µ =12m2WW2 .3 aka “Mexican Hat.”5The neutral vector boson, Z , is defined as [8]:Zµ ≡ 1√g2 + g′2(gW3µ − g′Bµ )which, from the remaining term in equation 3, gives it a mass:mZ =v2√g2 + g′2 .The electromagnetic gauge boson remains massless, and takes the form [8]:Aµ ≡ 1√g2 + g′2(g′W3µ + gBµ ) .This U (1)Q gauge symmetry (with Q indicating electric charge) thus remains unbroken, as desired to yieldthe correct electromagnetic interactions and the massless photon.The masses for fermions are generated similarly. For the leptons, for example, we have [8]:eL =(νee−)LµL =(νµµ−)LτL =(νττ−)Lfor the left-handed doublets. These have Yukawa couplings with the Higgs doublet and right-handed leptonsinglet, generating an effective mass. This is seen by:LYukawa lepton = Γl l¯L φ lR + h.c.where Γl is the Yukawa coupling strength between the lepton l and the Higgs φ. Working in the unitarygauge as before, it is easily seen how this generates an effective mass term for the charged lepton l−, as thisgauge picks out the identical leptonic component from lL to the leptonic singlet, lR [6].The Higgs particle itself, h, is viewed as energetic fluctuations about the VEV, v. Still working in the unitarygauge, this can be written as [8]:φ =1√2(0v + h)From this, following similar steps as above, it is a straightforward matter to work out the couplings betweenthe Higgs particle, h, and the various fermions and gauge bosons.62.3. The Strong ForceThe strong nuclear force obeys SU (3)C symmetry (C for colour), with eight gauge bosons, representingeach of the eight generators of the symmetry group. These bosons, termed gluons, couple only to quarks,which have a quantum number referred to as colour, with three colours for quarks, and three correspondinganti-colours for anti-quarks [6]. Like the SU (2) goldstone bosons, the gluons have a kinetic energy termsuch that:Lgluons = −14GaµνGµνawithGaµν = ∂νGaµ − ∂µGaν + gSabcGbµGcνwhere the roman indices now run from 1 to 8, rather than just 1 to 3 as in the SU (2) case. This additionalgauge field causes the covariant derivative acting on quarks to gain an extra term:4Dquarksµ qL = (∂µ + igTiW iµ +i2g′Bµ − igSTaGaµ )qLand similarly for the right-handed singlets, which omit the SU (2) term [8].Mass terms for quarks are again generated through SU (2) symmetry breaking, by a Yukawa coupling tothe Higgs doublet, as in the lepton case [8]. However, quark flavour eigenstates are not equal to the SU (2)electroweak eigenstates [10]. This causes mixing between families, with couplings as given by the CKMmatrix, shown in figure 1.d ′s′b′ =Vud Vus VubVcd Vcs VcbVtd Vt s Vtbdsb VCKM =0.97428 ± 0.00015 0.2253 ± 0.0007 0.00347+0.00016−0.000120.2252 ± 0.0007 0.97345+0.00015−0.00016 0.0410+0.0011−0.00070.00862+0.00026−0.00020 0.0403+0.0011−0.0007 0.999152+0.000030−0.000045Figure 1: The CKM matrix gives the electroweak mixing between different quark families. d ′, s′ and b′ are thedown-type quark electroweak doublet partners, and d, s and b are the flavour eigenstates. The values of the CKMmatrix (given on the right) are related to the square root of the probabilities of transitioning between quark families viathe exchange of a W boson [10].The bare couplings of a theory which appear in the Lagrangian are not directly observable in nature, andinstead what is seen are effective couplings. Further, unlike in electroweak theory where the strength of theinteraction decreases with separation between interacting particles, in QCD, we see that the strength of theeffective coupling can be expressed as:αS =1b · ln(µ2/λ2QCD)where here, b is a constant dependent on the number of quarks and λQCD is a constant defining the scaleof QCD interactions [6]. This shows that the strength of the interaction decreases with energy, generatingfeatures known as colour confinement and asymptotic freedom. A plot showing the effective coupling atvarious energies is given in figure 2.4 Well, technically eight terms, as there is a separate term for each generator.7pre-LHC ATLASFigure 2: A plot showing the running coupling of the QCD coupling “constant.” As can be seen, the coupling blowsup at low energies, rendering perturbation techniques useless. Conversely, the coupling decreases at high energies,allowing perturbative calculations of hard processes at LHC-like energies. The figure on the left shows pre-LHCdeterminations of αS for various energies, with the right figure showing the latest ATLAS results, for various absoluterapidities |y | [11–13].Colour confinement is the term coined for the empirically established fact that quarks cannot be observedas free particles. As a result of the force increasing between coloured particles as distance between themincreases, at some point it becomes more energetically favourable for the separated quarks to bond with a newquark pair generated from the vacuum [11]. Conversely, asymptotic freedom implies that when quarks are athigh energy, despite in a bound state, they behave essentially as free particles [2]. These effects, which arereally two sides of the same coin, make low energy QCD measurements extremely difficult, as perturbationtheory no longer applies when the coupling constant is of order one.2.4. Hadron CollisionsDue to the complex nature of the proton, which is not a fundamental particle, high energy proton collisions atthe LHC create a hugely chaotic event, with many particles going in many different directions. The ATLASdetector, described in chapter 3, is designed to detect and convert momentum and energy signatures fromthe various possible particles into electric signals, to be read out by a computing farm. This allows particlephysics measurements to be performed as Poisson counting experiments, assuming nearly-identical detectorconditions for each collision. This section will briefly introduce the fundamental phenomenologies of theseparticles, and the motivation behind the measurement attempted in this thesis.The particles which have the most straightforward detector response are leptons and photons. With theexception of the tau, these particles tend not to decay on their own in the detector, and instead cause aresponse due to interactions with the detector material and magnetic fields. Electrons and photons tend tocreate electromagnetic showers in the calorimeter, via Bremsstrahlung radiation and electron pair production,8as they repeatedly elastically scatter off of detector material. Muons, which are unstable, have such longlifetimes and low interaction cross-sections that they tend to escape the entirety of the detector beforedecaying. Further, neutrinos only interact weakly, and so remain unobserved by the detector even in the bestof conditions. The tau decays fast enough that the decay products are observed instead.On the other hand, quarks are never observed directly, and instead we must detect the hadrons they form.When a quark or gluon (aka a parton) are produced, they often emit somewhat collinear gluons, whichin turn can produce qq¯ pairs [14]. Then, due to colour confinement, these gluons and quarks hadronize.The hadronization process is inherently in the non-perturbative regime of QCD, further complicating thereconstruction process [15]. This process creates a phenomenology known as jets. This creates enormousdifficulty in isolating the original chromodynamic object, and underlies the majority of difficulty in obtainingclean measurements in QCD dominated collisions.A notable exception to this rule is the top quark, which decays so quickly that it is unable to form hadrons.However, it is instead the decay products which are observed in the detector. As seen from the above CKMmatrix (figure 1), the top almost always decays to a bottom (b) quark and aWboson. Thus none of the hadronicjet complications are avoided, due to the presence of the bottom quarks. In fact, they are compounded by theadded production of W bosons, which either decay leptonically, leaving a charged lepton along with missingenergy and momentum as a result of the neutrino, or which decay to quark pairs, generating yet furtherhadronic jets.2.5. Motivation for Precision t t¯ + bb¯ MeasurementA Higgs-like boson has been discovered at a mass of 125 GeV, with measured decays to the weak vectorbosons, two photons (most commonly through a quark loop), and to two tau leptons [16]. However, atthis mass, prediction of its primary decay path is to two bottom quarks, which was only recently directlymeasured [17]. Further, as the top quark is the most massive particle in the standard model, Higgs productionwith associated top quark pair production has been calculated to have a large cross section. As a result ofthis, the cleanest signature that allows us to directly access the Higgs-to-quark Yukawa coupling is whenHiggs is produced in association with a top quark pair. The largest branching ratio of the Higgs decay is totwo bottom quarks, as shown in figure 3, making the tt¯ + H with H → bb¯ a very important measurementto pursue. This measurement is very important for a variety of reasons based in theory, such as determiningthe stability of the electroweak vacuum [18], or determining if the measured Higgs is part of a larger model,such as supersymmetry [19].As can be seen in [17], the single largest obstacle to improving the sensitivity of this search arises fromthe knowledge of the tt¯ + bb¯ normalization, with additional uncertainty as a result of tt¯ + cc¯ normalization.Theoretical calculations of these events pose an enormous computational difficulty, for both humans andcomputers, due to the difficulties encountered when attempting to use perturbation theory on QCD [19]. Asthese tt¯ events with additional heavy flavour jets have extremely similar final state detector signatures to tt¯+Hproduction, without good constraints on these backgrounds it is difficult to obtain an accurate measurementof the associated Higgs production.The rate of production of particles is given in terms of a production cross-section, usually symbolized by σ.In many recent measurements, since a large fraction of events fall outside the detector acceptance, typically9Figure 3: Representative tree-level Feynman diagrams for the production of the Higgs boson in association with atop-quark pair (tt¯H) and the subsequent decay of the Higgs to a bb¯ pair, (a) and (b), and for the main background oftt¯ + bb¯ [17].these production cross-sections are given in terms of a f iducial cross-section. This instead gives the cross-section for what ought to be seen by the detector, i.e. where the decay products end up in a momentum-spaceaccessible by the detector in terms of physical angle as well as energy and momentum. This can thenbe extrapolated to an inclusive cross section through a simple fiducial efficiency extrapolation factor. Theequation for this fiducial cross-section is straightforward:σ f id =NsigL · f id (4)where Nsig is the number of signal events of the event type you are trying to measure, L is the luminosity (notthe Lagrangian, which uses the same symbol, above), and f id is the fiducial efficiency. In this case, Nsigcan be a large variety of “signal” events, designated as signal simply because they are sought after events.In the case of this thesis, an attempt is made to improve on the measurement of tt¯ production in conjunctionwith two additional jets, of various heavy flavours. When looking at events that pass our basic requirementof six jets and one lepton, we get seven distinct event types (where, for convenience, the bar notation will besuppressed):1. tt + bb2. tt + bc3. tt + bj4. tt + cc5. tt + c j6. tt + j j7. All other tt events.In the above listing, j refers to any jet of lighter flavour than charm. Using Monte Carlo predictions(discussed in chapter 5) for the detector response of each of these event types, event likelihood templateswill be generated for each event type. These will then be used in a profile likelihood fit, as discussed inchapter 8. However, with a large amount of similarity between many of these templates, it is impractical to10attempt to fit all seven event templates, as-is. To combat this issue, similar templates were combined to makefour templates for use in the final fit:1. tt + bb2. tt + bj (now includes tt + bc)3. tt + c j (now includes tt + cc)4. tt + j j (now includes all other tt events)This thesis attempts to improve the statistical sensitivity of measuring each of these four event types. Machinelearning techniques (introduced in chapter 7) are used to generate distinct likelihood templates for each ofthese four event types. These techniques are compared to the baseline measurement (chapter 9), achievablesimply by using the output from the continuous MV1c b-tagger (described in chapter 4).113. The ATLAS Detector and the Large Hadron ColliderThe Large Hadron Collider (LHC) is the largest and most powerful particle accelerator ever created bymankind. Built at CERN under the border of Switzerland and France, it is designed to accelerate protons orlead ions to ultra relativistic energies. It has four separate collision points, which each have their own detectorbuilt around it; the general purpose detectors CMS and ATLAS, along with the more specialized detectors,LHCb and ALICE. It is 27 km in circumference, and has a maximum rated center-of-mass (CM) energy of√s = 14 TeV [3]. This analysis deals specifically with proton-proton collisions at CM energy√s = 8 TeV, asthose are the conditions of the data collected in 2012 by the ATLAS detector.This section briefly describes the ATLAS detector, and heuristically describes some of the physics behindthe various detection methods. Most of the descriptions in this section are taken from the ATLAS technicaldesign reports, found in [1] and [3]. Descriptions of the sub-detectors also rely on smaller technical reports,cited at the beginning of each section.Figure 4: Cut-away view of the ATLAS detector. The dimensions of the detector are 25 m in height and 44 m in length.The overall weight of the detector is approximately 7000 tonnes [1].The ATLAS detector, pictured in figure 4, is nominally forward-backward symmetric with respect to theinteraction region. The magnet configuration comprises a 2 T superconducting solenoid surrounding theinner-detector cavity, and large superconducting toroidal systems (separately for the barrel and two end-caps)arranged with an eight-fold azimuthal symmetry around the calorimeters.Pattern recognition, momentum and vertex measurements, and electron identification are achieved with a12Table 1: General performance of the ATLAS detector. Note that, for high-pT muons, the muon-spectrometer perform-ance is independent of the inner-detector system. The units for E and pT are in GeV [1].combination of discrete, high-resolution semiconductor pixel and strip detectors in the inner part of thetracking volume, and straw-tube tracking detectors with the capability to generate and detect transitionradiation in its outer part.High granularity liquid-argon (LAr) electromagnetic sampling calorimeters, with excellent performance interms of energy and position resolution, cover the pseudorapidity range |η | < 3.2. The hadronic calorimetryin the range |η | < 1.7 is provided by a scintillator-tile calorimeter, which is separated into a large barreland two smaller extended barrel cylinders, one on either side of the central barrel. In the end-caps (|η | >1.5), LAr technology is also used for the hadronic calorimeters, matching the outer |η | limits of end-capelectromagnetic calorimeters. The LAr forward calorimeters provide both electromagnetic and hadronicenergy measurements, and extend the pseudorapidity coverage to |η | = 4.9.The calorimeter is surrounded by the muon spectrometer. The air-core toroid system, with a long barrel andtwo inserted end-cap magnets, generates strong bending power in a large volume within a light and openstructure. Multiple-scattering effects are thereby minimised, and excellent muon momentum resolution isachieved with three layers of high precision tracking chambers. The muon instrumentation includes, as a keycomponent, trigger chambers with timing resolution of the order of 1.5-4 ns.The general performance of the ATLAS detector is given in table 1. This table lists the minimum performanceresolution of each sub-detector, along with the η coverage. A fiducial region in the detector of |η | < 2.5 isused for this analysis, in line with the fiducial region for a typical high-precision ATLAS measurement.3.1. Inner DetectorThe ATLAS Inner Detector makes precise measurements on the location of charged particles. Through avariety of reconstruction algorithms (described fully in [20]), particle tracks are determined from the varioushit-points a charged particle makes when traversing the detectors. The momentum of the particle can bedetermined by measuring the curvature of the track in the solenoidal magnetic field, and velocity informationon the particle can be learned from the amount of transition radiation produced in the Transition RadiationTracker (TRT). The descriptions in this section closely follow what is detailed in [21], [22] and [23].13Figure 5: Cut-away view of the ATLAS inner detector. The right shows the sensors and structural elements traversedby a charged track of 10 GeV pT in the barrel inner detector (η = 0.3). The track traverses successively the berylliumbeam-pipe, the three cylindrical silicon-pixel layers with individual sensor elements of 50x400µm2, the four cylindricaldouble layers (one axial and one with a stereo angle of 40 mrad) of barrel silicon-microstrip sensors (SCT) of pitch80µm, and approximately 36 axial straws of 4 mm diameter contained in the barrel transition-radiation tracker moduleswithin their support structure [1].The Inner Detector is designed for precision tracking of charged particles with 40 MHz bunch crossingidentification. It combines tracking straw tubes in the outer transition-radiation tracker (TRT), the microstripdetectors of the semiconductor tracker (SCT) at intermediate radii with the Pixel Detector, the crucial partfor vertex reconstruction, as the innermost component.The Pixel Detector is subdivided into three barrel layers and three disks on either side for the forward directionend-cap. This layout results in a three-hit system for particles with |η | < 2.5. The main components areapproximately 1700 identical modules, corresponding to a total of 8 · 107 pixels. The modules consist of apackage composed of sensors and readout-chips mounted on a hybrid. They have to be radiation hard to anATLAS life time dose of 50 MRad or 1015 neutron-equivalent.The SCT consists of 4088 modules of silicon-strip detectors arranged in four concentric barrels (2112modules) and two endcaps of nine disks each (988 modules per endcap), as shown in figure 5. Each barrelor disk provides two strip measurements at a stereo angle which are combined to build space-points. TheSCT typically provides eight strip measurements (four space-points) for particles originating in the beam-interaction region. The barrel modules are of a uniform design, with strips approximately parallel to themagnetic field and beam axis. Each module consists of four rectangular silicon-strip sensors with stripswith a constant pitch of 80 µm; two sensors on each side are daisy-chained together to give 768 strips ofapproximately 12 cm in length. A second pair of identical sensors is glued back-to-back with the first pair ata stereo angle of 40 mrad. The modules are mounted on cylindrical supports such that the module planes areat an angle to the tangent to the cylinder of 11° for the inner two barrels and 11.25° for the outer two barrels,and overlap by a few millimetres to provide a hermetic tiling in azimuth.14Figure 6: Drawing showing the sensors and structural elements traversed by two charged tracks of 10 GeV pT in theend-cap inner detector (η = 1.4 and 2.2). The end-cap track at η = 1.4 traverses successively the beryllium beam-pipe,the three cylindrical silicon-pixel layers with individual sensor elements of 50x400 µm2, four of the disks with doublelayers (one radial and one with a stereo angle of 40 mrad) of end-cap silicon-microstrip sensors (SCT) of pitch v 80µm,and approximately 40 straws of 4 mm diameter contained in the end-cap transition radiation tracker wheels. In contrast,the end-cap track at η = 2.2 traverses successively the beryllium beam-pipe, only the first of the cylindrical silicon-pixellayers, two end-cap pixel disks and the last four disks of the end-cap SCT. The coverage of the end-cap TRT does notextend beyond |η | = 2 [21].Each SCT endcap disk consists of up to three rings of modules with trapezoidal sensors. The strip directionis radial with constant azimuth and a mean pitch of 80 µm. As in the barrel, sensors are glued back-to-backat a stereo angle of 40 mrad to provide space-points. Modules in the outer and middle rings consist of twodaisy-chained sensors on each side, whereas those in the inner rings have one sensor per side.The pixel detector and SCT both operate by doping narrow strips of silicon to turn them into diodes, whichare then reversed biased. As ionizing radiation traverses the detector, it produces free electrons and holes,which migrate to the electrodes producing an external pulse [24]. This pulse is then stored on-board thetracker module until a decision is heard from the L1 trigger, before being passed on or discarded.The TRT consists of 370,000 cylindrical drift tubes (straws). Each straw (4 mm diameter, made of Kaptonwith a conductive coating) acts as a cathode and is kept at high voltage of negative polarity. In the centreof the straw there is a 30 µm diameter gold-plated tungsten sense wire. The layers of straws are interleavedwith the radiators (polypropylene foils or fibres). The straws are filled with a gas mixture based on xenon(70%) - for good X-ray absorption - with the addition of CO2 (27%) and O2 (3%) to increase the electrondrift velocity and for photon-quenching.The TRT has three major modular components: the barrel and two endcaps. The TRT barrel consists ofthree layers of 32 modules each. The straws are 144 cm long and are parallel to the beam occupying the15Table 2: Parameters of the Inner Detector. The resolutions quoted are typical values (the actual resolution in eachdetector depends on the impact angle) [3].region between 56 < r < 107 cm and |z | < 72cm, corresponding to a pseudorapidity coverage of |η | < 0.7.The wires are electrically split in the centre and are read out at both ends, thus reducing the occupancy butdoubling the number of electronic channels.In the end-caps, the straws are radially arranged in 18 units per side called wheels, for a total of 224 layers ofstraws on each side (83 < |z | < 340 cm). The straws extend between radii of 63 cm and 103 cm, except inthe last four wheels (64 layers), where the straws extend between radii of 47 cm and 100 cm, covering thus apseudorapidity range of 0.7 < |η | < 2.5.Transition radiation is produced when a relativistic charged particle crosses the interface of two media ofdiffering dielectric constants (in this case, provided by the polypropylene foils or fibres between straws). Theprobability of emitting transition radiation photons increases with the γ-factor, and so gives a direct measureof the particle’s velocity (and therefore energy). As in any other source of radiation from relativistic particles,this radiation is extremely forward within an angle of order 1/γ [25]. As ionizing particles traverse the TRT,the transition radiation generated as they cross straw boundaries is picked up by the central wire electrodes.This generates an external pulse that is again saved on-board until a decision is received from the trigger.3.2. CalorimetryAll calorimetry in ATLAS is performed through the use of sampling calorimeters, consisting of inter-layeredabsorbed and active regions. Passive absorber materials with a high number of nucleons (e.g. lead) are used tocreate sprays of lower energy particles, whose energies are then measured by a scintillator; most commonly16Liquid Argon (LAr). In the electromagnetic (EM) calorimeter, these sprays are resultant primarily fromelectron pair-production when a photon recoils off a nucleon, or from Bremsstrahlung radiation producedfrom the nucleon recoils of charged particles. In the hadronic calorimeter, these sprays are much morecomplex as they often include inelastic scattering with the nucleus. This typically results in large showersof protons, neutrons, pions, and electromagnetic particles (alpha particles, photons), or even larger hadronsresulting from, e.g., fission of the original nucleus. The types and lengths of the absorber and active materialsare chosen to provide good energy resolution for high-energy jets (see table 1), while reducing punch-throughinto the muon spectrometer to well below the irreducible level of prompt or decay muons.This section largely follows that detailed in [1]. Aviewof theATLAScalorimeters is presented in figure 7. Thecalorimetry consists of an electromagnetic (EM) calorimeter covering the pseudorapidity region |η | < 3.2, ahadronic barrel calorimeter covering |η | < 1.7, hadronic end-cap calorimeters covering 1.5 < |η | < 3.2, andforward calorimeters covering 3.1 < |η | < 4.9.Figure 7: Cut-away view of the ATLAS calorimeter system [1].The EM calorimeter is divided into a barrel part (|η | < 1.475) and two end-cap components (1.375 < |η | <3.2), each housed in their own cryostat. The central solenoid and the LAr calorimeter share a commonvacuum vessel, thereby eliminating two vacuum walls. The barrel calorimeter consists of two identicalhalf-barrels, separated by a small gap (4 mm) at z = 0. Each end-cap calorimeter is mechanically divided intotwo coaxial wheels: an outer wheel covering the region 1.375 < |η | < 2.5, and an inner wheel covering theregion 2.5 < |η | < 3.2. The EM calorimeter is a lead-LAr detector with accordion-shaped Kapton electrodes17and lead absorber plates over its full coverage. The accordion geometry provides complete φ symmetrywithout azimuthal cracks.In the region of |η | < 1.8, a presampler detector is used to correct for the energy lost by electrons and photonsupstream of the calorimeter. The presampler consists of an active LAr layer of thickness 1.1 cm (0.5 cm) inthe barrel (end-cap) region.The hadronic tile calorimeter is placed directly outside the EM calorimeter envelope. Its barrel covers theregion |η | < 1.0, and its two extended barrels the range 0.8 < |η | < 1.7. It is a sampling calorimeter usingsteel as the absorber and scintillating tiles as the active material. The barrel and extended barrels are dividedazimuthally into 64 modules. Radially, the tile calorimeter extends from an inner radius of 2.28 m to an outerradius of 4.25 m.The Hadronic End-cap Calorimeter (HEC) consists of two independent wheels per end-cap, located directlybehind the end-cap electromagnetic calorimeter and sharing the same LAr cryostats. To reduce the drop inmaterial density at the transition between the end-cap and the forward calorimeter (around |η | = 3.1), theHEC extends out to |η | = 3.2, thereby overlapping with the forward calorimeter. Similarly, the HEC η rangealso slightly overlaps that of the tile calorimeter (|η | < 1.7) by extending to |η | = 1.5. Each wheel is builtfrom 32 identical wedge-shaped modules, assembled with fixtures at the periphery and at the central bore.Each wheel is divided into two segments in depth, for a total of four layers per end-cap. The wheels closestto the interaction point are built from 25 mm parallel copper plates, while those further away use 50 mmcopper plates (for all wheels the first plate is half-thickness). The outer radius of the copper plates is 2.03 m,while the inner radius is 0.475 m (except in the overlap region with the forward calorimeter where this radiusbecomes 0.372 m). The copper plates are interleaved with 8.5 mm LAr gaps, providing the active mediumfor this sampling calorimeter.The Forward Calorimeter (FCal) is integrated into the end-cap cryostats, as this provides clear benefits interms of uniformity of the calorimetric coverage as well as reduced radiation background levels in the muonspectrometer. The FCal consists of three modules in each end-cap: the first, made of copper, is optimised forelectromagnetic measurements, while the other two, made of tungsten, measure predominantly the energy ofhadronic interactions. Each module consists of a metal matrix, with regularly spaced longitudinal channelsfilled with the electrode structure consisting of concentric rods and tubes parallel to the beam axis. The LArin the gap between the rod and the tube is the sensitive medium. This geometry allows for excellent controlof the gaps, which are as small as 0.25 mm in the first section.3.3. Muon SpectrometerThe conceptual layout of the muon spectrometer is shown in figure 8 and the main parameters of themuon chambers are listed in table 3. It is based on the magnetic deflection of muon tracks in the largesuperconducting air-core toroid magnets, instrumented with separate trigger and high-precision trackingchambers. Over the range |η | < 1.4, magnetic bending is provided by the large barrel toroid. For 1.6 < |η | <2.7, muon tracks are bent by two smaller end-cap magnets inserted into both ends of the barrel toroid. Over1.4 < |η | < 1.6, usually referred to as the transition region, magnetic deflection is provided by a combinationof barrel and end-cap fields. This magnet configuration provides a field which is mostly orthogonal to the18muon trajectories, while minimising the degradation of resolution due to multiple scattering. This sectionclosely follows the technical report [1].In the barrel region, tracks are measured in chambers arranged in three cylindrical layers around the beamaxis; in the transition and end-cap regions, the chambers are installed in planes perpendicular to the beam,also in three layers.Figure 8: Cut-away view of the ATLAS muon system [1].Over most of the η-range, a precision measurement of the track coordinates in the principal bending directionof themagnetic field is provided byMonitoredDrift Tubes (MDTs). Themechanical isolation in the drift tubesof each sense wire from its neighbours guarantees a robust and reliable operation. At large pseudorapidities,Cathode Strip Chambers (CSCs, which are multi-wire proportional chambers with cathodes segmented intostrips) with higher granularity are used in the innermost plane over 2 < |η | < 2.7, to withstand the demandingrate and background conditions.The trigger system covers the pseudorapidity range |η | < 2.4. Resistive Plate Chambers (RPCs) are usedin the barrel and Thin Gap Chambers (TGCs) in the end-cap regions. The trigger chambers for the muon19Table 3: Main parameters of the muon spectrometer. Numbers in brackets for the MDTs and the RPCs refer to the finalconfiguration of the detector in 2009 [1].spectrometer serve a threefold purpose: provide bunch-crossing identification, provide well-defined pTthresholds, and measure the muon coordinate in the direction orthogonal to that determined by the precision-tracking chambers.3.4. Trigger SystemThe Trigger and Data Acquisition (collectively TDAQ) systems, the timing- and trigger-control logic, andthe Detector Control System (DCS) are partitioned into sub-systems, typically associated with sub-detectors,which have the same logical components and building blocks. The trigger system has three distinct levels:L1, L2, and the event filter. Each trigger level refines the decisions made at the previous level and, wherenecessary, applies additional selection criteria. The data acquisition system receives and buffers the eventdata from the detector-specific readout electronics, at the L1 trigger accept rate, to over 1600 point-to-pointreadout links. The first level uses a limited amount of the total detector information to make a decision inless than 2.5 µs, reducing the rate to about 75 kHz. The two higher levels access more detector informationfor a final rate of up to 200 Hz with an event size of approximately 1.3 Mbyte [1].The L1 trigger searches for high transverse-momentum muons, electrons, photons, jets, and τ- leptonsdecaying into hadrons, as well as large missing and total transverse energy. Its selection is based oninformation from a subset of detectors. High transverse-momentum muons are identified using trigger20chambers in the barrel and end-cap regions of the spectrometer. Calorimeter selections are based on reduced-granularity information from all the calorimeters. Results from the L1 muon and calorimeter triggers areprocessed by the central trigger processor, which implements a trigger menu made up of combinations oftrigger selections. Pre-scaling of trigger menu items is also available, allowing optimal use of the bandwidthas luminosity and background conditions change. Events passing the L1 trigger selection are transferred tothe next stages of the detector-specific electronics and subsequently to the data acquisition via point-to-pointlinks.In each event, the L1 trigger also defines one or more Regions-of-Interest (RoIs), i.e. the geographicalcoordinates in η and φ, of those regions within the detector where its selection process has identifiedinteresting features. The RoI data include information on the type of feature identified and the criteria passed,e.g. a threshold. This information is subsequently used by the high-level trigger.The L2 selection is seeded by the RoI information provided by the L1 trigger over a dedicated data path. L2selections use, at full granularity and precision, all the available detector data within the RoIs (approximately2% of the total event data). The L2 menus are designed to reduce the trigger rate to approximately 3.5 kHz,with an event processing time of about 40 ms, averaged over all events. The final stage of the event selection iscarried out by the event filter, which reduces the event rate to roughly 200 Hz. Its selections are implementedusing offline analysis procedures within an average event processing time of the order of four seconds.214. Object Reconstruction and B-Tagging AlgorithmsVarious detectors and combinations thereof are used to identify specific particles or sets of particles, alongwithqualities or quantities related to particles, such as as reconstructing chromodynamic jets, missing transverseenergy or tagging particles as a specific type, such as bottom (b) -tagged jets. A variety of techniques areused to accomplish this particle identification, which are explained in this section. Information about objectreconstruction is primarily taken from [26], whereas the various papers used for b-tagging algorithms aregiven at the start of each subsection.4.1. Track ReconstructionThe Inner Detector charged track reconstruction software employs a suite of track-fitting tools, includ-ing global-χ2 fitting and Kalman-filter techniques, along with more specialised filters as described in [1]and [20].4.2. ElectronsElectron candidates are reconstructed from energy clusters in the electromagnetic calorimeter that arematchedto reconstructed tracks in the inner detector. The electrons are required to have transverse energy ET > 25GeV and be in the central region of the detector: |ηcluster | < 2.47. Candidates in the electromagneticcalorimeter barrel/endcap transition region 1.37 < |ηcluster | < 1.52 are excluded. The longitudinal impactparameter of the track with respect to the primary vertex, |z0 |, is required to be less than 2 mm.Electrons must satisfy tight quality requirements based on the shape of the energy deposit and the matchto the track to distinguish them from hadrons. Additionally, isolation requirements are imposed based onnearby tracks or calorimeter energy deposits. These requirements depend on the electron kinematics and arederived to give a constant electron efficiency. The cell-based isolation uses the sum of all calorimeter cellenergies within a cone of ∆R = 0.2 around the electron direction while the track-based isolation sums alltrack momenta within a cone of ∆R = 0.3;5 in both cases the track momentum itself is excluded from thecalculation [26].4.3. MuonsMuon candidates are reconstructed by matching tracks formed in the muon spectrometer and inner detector.The final candidates are refit using the complete track information from both detector systems, and arerequired to have pT > 25GeV , |η | < 2.5, and |z0 | < 2 mm. Muons must be isolated from nearby tracks,using a cone-based algorithm with cone size ∆Riso = 10GeV/pµT . All tracks with momenta above 1 GeV,excluding the muon’s track, are considered in the sum. The ratio of the summed track transverse momentato the muon pT is required to be smaller than 5%, corresponding to a 97% selection efficiency for promptmuons from Z → µµ decays. If a muon and an electron share a track, the event is rejected [26].5 ∆R is the measure of angular separation, defined as ∆R =√∆η2 + ∆φ2. For a full list of ATLAS coordinates, please see [1].224.4. JetsJets are reconstructedwith the anti-kT algorithmwith a radius parameter R = 0.4, using calibrated topologicalclusters built from energy deposits in the calorimeters. Prior to jet finding, a local cluster calibration schemeis applied to correct the topological cluster energies for the non-compensating response of the calorimeter,dead material, and out-of-cluster leakage. The corrections are obtained from simulations of charged andneutral particles. After energy calibration, jets are required to have pT > 25 GeV and |η | < 2.5. To avoidselecting jets from secondary pp interactions, a selection on the absolute value of the jet vertex fraction (JVF)variable above 0.5 is applied to jets with pT < 50 GeV and |η | < 2.4. This requirement ensures that at least50% of the sum of the pT of tracks with pT > 1 GeV associated with a jet comes from tracks compatible withthe primary vertex. During jet reconstruction, no distinction is made between identified electrons and otherenergy deposits. Therefore, if any of the jets lie within ∆R = 0.2 of a selected electron, the single closestjet is discarded in order to avoid double-counting electrons as jets. After this, electrons or muons within∆R = 0.4 of a remaining jet are removed [5].4.5. B-Tagging AlgorithmsSeveral algorithms to identify b-jets have been developed. They range from relatively simple likelihood-based algorithms based on impact parameter (IP3D) and secondary vertex related variables (SV1), to themore refined JetFitter algorithm, which exploits the topology of weak b- and c-hadron decays, using aKalman filter to search for a common line connecting the primary vertex to beauty and charm hadron decayvertices [5]. The outputs of these algorithms are combined in an artificial neural network with output weightprobability densities evaluated separately for b−, c−, and light-flavour jets. The MV1 algorithm employs anartifical neural network based on the IP3D, SV1 and JetFitter algorithms. It is trained with b jets as signaland light-flavour jets as background, and computes a tag weight for each jet [5]. A similar MV1c algorithmis employed by the analysis presented in this thesis. It is trained using the same b-tagging algorithms asinput, however it trains b-jets against both c- and light- flavour jets, resulting in an improved c-jet rejectionfor similar signal efficiencies. This section will first provide a brief description of these input algorithms andthe nuances involved with combining them into a neural network, and touch on difficulties calibrating theperformance in simulation to match that observed in data.The following sections are inspired by reference [27].4.5.1. IP3D AlgorithmThe IP3D algorithm is a likelihood ratio based algorithm that uses the 3-dimensional impact parametersignificance6 for each track in the jet as input. The impact parameters of tracks are computed with respect tothe primary vertex. On the basis that the decay point of the b-hadron must lie along its flight path, the impactparameter is signed to further discriminate the tracks from b-hadron decay from tracks originating from theprimary vertex. The sign of the parameter will be positive if the inferred direction of flight of the decayedparticle is similar to the jet direction, and negative if it would be the opposite. The experimental resolution6 The significance of a variable X is is X/σX , and is used to give higher weights to better measured tracks.23Figure 9: Distribution of the signed significance of the transverse (left) and longitudinal (right) impact parameters,d0/σd0 and z0/σz0 respectively. The impact parameter is taken with respect to the primary vertex for tracks of b-taggingquality associated to jets, for experimental data (solid black points) and for simulated data (filled histograms for thevarious flavours). The ratio data/simulation is shown at the bottom of the plots [28]. These plots use 2011 data, withuntuned simulation, as a demonstration of the separation power available from these variables. Unfortunately, updatedplots were not available. The IP3D algorithm used by this analysis uses simulations corrected to the same 2012 dataupon which the rest of this analysis relies [5]generates a random sign for the tracks originating from the primary vertex, while tracks from the b/c hadrondecay tend to have a positive sign. To take advantage of correlations between these variables, the transverseand longitudinal impact parameters are combined in a two dimensional probability density function, whichis then used in a likelihood ratio, as described below. These distributions are shown in figure 9.4.5.2. SV1 AlgorithmThe SV1 algorithm is also a likelihood ratio based algorithm, that uses the invariant mass of the reconstructedsecondary vertex, the energy fraction of the vertex, and the number of two track vertices included in thereconstruction (figure 10). The vertex search itself starts with a determination of all track pairs which formgood (χ2 < 4.5) two-track vertices inside the jet. In addition, each track of the pair must have a 3-dimensionalimpact parameter higher than 2.0 and the sum of these two significances must be higher than 6.0. In order todecrease the fake rate an additional requirement on the direction of the displaced vertex with respect to thejet direction is applied.Some of the reconstructed two-track vertices stem from K0s and Λ0 decays, γ → e+ + e− conversions andhadronic interactions in the detector material. The corresponding distributions are shown in figure 11.Figures 11 a) and b) show the invariant pi+pi− and ppi− mass spectra for accepted two particle vertices with24vertex mass vertex energy fractionnumber of two-track vertices Schematic of secondary vertex reconstruction.Figure 10: Distributions of the properties of the vertex found by the SV1 tagging algorithm for experimental data (solidblack points) and for simulated data (filled histograms for the various flavours). The ratio data/simulation is shown atthe bottom of each plot [28]. These plots use 2011 data, with untuned simulation, as a demonstration of the separationpower available from these variables. Unfortunately, updated plots were not available. The SV1 algorithm used bythis analysis uses simulations corrected to the same 2012 data upon which the rest of this analysis relies [5]. The SV1tagger takes a two-dimensional pdf of the vertex mass and energy fraction, leaving the number of two-track vertices asan independent [27]. The schematic (bottom-right) loosely demonstrates secondary vertex reconstruction [29].25peaks due to K0s and Λ0 decays. Figure 11 c) shows the distance between the primary and secondary verticesin the transverse plane with peaks due to interactions in the material of the beam pipe and pixel detectorlayers. Charged particle tracks coming from such vertices are marked as bad and do not participate furtherin the following b-tagging procedure for the given jet.Figure 11: Some distributions for reconstructed two track vertices: a) the pi+pi− invariant mass spectrum with a peak ofK0 decays; b) the ppi invariant mass spectrum with a peak of Λ0 decays; c) the distance in the transverse plane betweenthe primary and secondary vertices with the peaks due to interactions in the beam pipe (two walls at R ≈ 30mm) wallsand pixel layers (R = 50.5mm and R = 88.5mm) [27]. Reconstructed secondary vertices which fall within one of thesebackground resonances are removed from the b-tagging procedure.In the next step of the algorithm all tracks inside the jet from accepted two-track vertices (except for markedK0s or Λ0 decays and material interactions) are combined into a secondary track list and the vertex fittingprocedure from the VKalVrt package7 tries to fit a single secondary vertex out of all these tracks. If theresulting vertex has an unacceptably high χ2, the track with the highest contribution to the vertex χ2 isdeleted from the secondary track list and the vertex fit is redone. This procedure iterates until a good χ2 ofthe vertex fit is obtained or all tracks from the secondary track list have been removed.For the SV1 algorithm, the invariant mass and vertex charged energy fraction are combined in a 2D dis-tribution, with the number of two-track vertices considered as an independent distribution (figure 10. Animproved algorithm, "SV2," that capitalizes on the correlations between all three variables (by using a 3Ddistribution) is also in development, but requires quite some statistics to implement, and so is not availablefor this analysis.4.5.3. B-Tagging Likelihood RatioFor both the impact parameter tagging and the secondary vertex tagging, a likelihood ratio method is used:the measured value Xi of a discriminating variable is compared to pre-defined smoothed and normalizeddistributions for both the b- and light jet hypotheses, b(Xi ) and u(Xi ). Two-dimensional probability density7 Described in detail in [27].26functions (PDFs) are used by both IP3D and SV1, with the SV1 algorithm implementing an additionalindependent (separated) 1D variable. The ratio of the probabilities 8 b(Xi )/u(Xi ) defines the track or vertexweight, which can be combined into a jet weight WJet as the sum of the logarithms of the NT /V individualtrack/vertex weightsWi [27]:WJet =NT /V∑i=1lnWi =NT /V∑i=1lnb(Xi )u(Xi )The independent variable in SV1 thus simply adds independent additional terms to the vertex weight, as thePDF j (Xi ) for independent variables m and n can be written j (X ) = jm (Xm ) jn (Xn ), where the vertex indexi has been suppressed for clarity. When no secondary vertex is found, the SV1 tagger returns a weight ofln1−SVb1−SVu , where is the efficiency of the given quark type [27].4.5.4. JetFitter AlgorithmThe fragmentation of a b-quark often results in a decay chain with two vertices, one stemming from theb-hadron decay and at least one from a subsequent c-hadron decay. The JetFitter algorithm capitalizes onthis topology by assuming the b- and any c-hadron decay vertices lie on the same line defined through theb-hadron flight path. All charged particle tracks stemming from either the b- or c-hadron decay thus intersectthis b-hadron flight axis. The lateral displacement of the c-hadron decay vertex with respect to the b-hadronflight path is small enough not to violate significantly this basic assumption within the typical resolutions ofthe tracking detector.The JetFitter algorithm describes the decay chain through the determination of the following variables(figure 12):~d = (~xPV , φ, θ, d1, d2, ..., dN ),with ~xPV representing the position of the primary vertex, (φ, θ) representing the azimuthal and polar directionsof the b-hadron flight axis and di representing the distance of the ith vertex away from the primary vertex,along the b-hadron flight. Implicit in this, is the number of secondary vertices involved in the fit, N .Figure 12: A schematic displaying the JetFitter topology [30].Before starting the fit, the variables are initialized with their prior knowledge:8 These are PDFs of getting the data value Xi , given that the track originates from a b or light quark, which is exactly the likelihoodof the track originating from a b or light quark, given the data point Xi .27• The primary vertex position (with covariance matrix), as provided by the primary vertex findingalgorithm;• The b-hadron flight direction, approximated by the direction of the jet axis, the error being providedby the convolution of the jet direction resolution with the average displacement of the jet axis relativeto the b-hadron flight axis, as determined from Monte Carlo simulations.The fit is then performed, resulting in the minimization of the χ2 containing the weighted residuals of alltracks with respect to their vertices on the b-hadron flight axis. After the primary vertex and the b-hadronflight axis have been initialised, a first fit is performed under the hypothesis that each track represents a singlevertex along the b-hadron flight axis, until χ2 convergence is reached, obtaining a first set of fitted variables(φ, θ, d1, d2, ..., dN ), with N equal to the number of tracks.A clustering procedure is then performed, where all combinations of two vertices (picked up among thevertices lying on the b-hadron flight axis plus the primary vertex) are taken into consideration, filling atable of probabilities. After the table of probabilities is filled, the vertices with the highest compatibility aremerged, a new complete fit is performed and a new table of probabilities is filled. This procedure is iterateduntil no pairs of vertices with a probability above a certain threshold remain. The result of this clusteringprocedure is a decay topology with a well defined association of tracks to vertices along the b-hadron flightaxis, with at least one track for each vertex.The decay topology is described by the following discrete variables:• Number of vertices with at least two tracks;• Total number of tracks at these vertices;• Number of additional single track vertices on the b-hadron flight axis.While the vertex information is condensed into the following variables:• Mass: the invariant mass of all the charged particle tracks attached to the decay chain;• Energy fraction: the fraction of energy of these particles divided by the sum of the energies of allcharged particles matched to the jet;• Flight length significance d/σ(d) of the weighted average position d.Some of these variables can be seen in figure 13.Initially, the use of these variables was developed to define a likelihood function similar to above, by indicating13 different categories based on different topologies of the decay chain. Then, a category-specific PDF wasdetermined for each of the three vertex-related variables (which are treated as independent and thus separable),and a likelihood for each jet type was determined by summing over each of the topologies. Details are givenin [30]. However, the b-tagging algorithm used in this analysis is the MV1c neural network, which takesoutput from a separate JetFitter-related neural network as an input variable, as we will see below.28Figure 13: Distribution of some of the properties of the vertex found by the JetFitter tagging algorithm for experimentaldata (solid black points) and for simulated data (filled histograms for the various flavors). The ratio data/simulation isshown at the bottom of each plot [28]. These plots use 2011 data, with untuned simulation, as a demonstration of theseparation power available from these variables. Unfortunately, updated plots were not available. The SV1 algorithmused by this analysis uses simulations corrected to the same 2012 data upon which the rest of this analysis relies [5].294.5.5. B-Tagging Neural NetworksThe JetFitter variables listed above have been combined with the output from IP3D to form a neural net-work [28, 31]. This neural net has taken several forms, such as training b-jets against light-jets, in JetFitter-CombNN, or training against light- and c-jets in JetFitterCombNNc [32]. The MV1 neural network, whichtrains b against light, uses JetFitterCombNN as one of its input variables. However, the MV1c networkalgorithm used in this analysis, which trains b-jets against light- and c-jets, uses the output from a multi-classneural network9, and so the b-, c- and light-jet probabilities from JetFitterCombNN are all used as inputvariables for the MV1c network.In summary, the MV1c neural network takes the likelihood ratios SV1 and IP3D as independent inputvariables, along with the three output probabilities provided by JetFitterCombNN. Further, note that for moreadvanced taggers, such as those described in [31] and used by ATLAS in Run 2 (from 2015 onwards), oftenthe the likelihood ratios used by SV1 and IP3D (described in 4.5.3) are bypassed entirely, by taking thediscriminating variables used in these tests as direct inputs to the neural network, rather than the results ofthe likelihood tests described above.4.6. B-Tagging CalibrationThe performance of b-tagging algorithms is characterised by the efficiency of tagging a b-jet, b , and theprobabilities of mistakenly tagging as a b-jet a jet originating from a c quark, c , or a light-flavour parton, l .Here, "light-flavour" partons are u, d, s quarks or gluons g. For each b-tagging algorithm, working pointsare defined as a function of the average b-jet efficiency as measured in simulation. The efficiencies b , cand l are measured in data [5].The calibration of the performance for b-jets is described fully in [33]. Heuristically, events are selectedthat primarily originate from di-leptonic tt¯ events.10 A separate Probability Distribution Function (PDF) isdetermined for the b-tagging discriminant or weight, w (e.g. MV1c), for each different jet flavour, which isthen used to form a likelihood of the form e.g.:L(pT ,w) =∑f lavour types fFf Pf (w |pT ),where Ff is the jet flavour fraction in each sample and Pf is the probability of obtaining the b-tagging weightw, given that the jet is of flavour f and has momentum pT . This method is performed separately for threedifferent bins in η and 10 different bins in pT , and so the dependence on the given efficiencies is specifiedthrough the stated dependence on pT in the above equation. All flavour PDFs Pf (w |pT ) are determined fromMC simulation, except for the b-jet weight PDF, which contains the efficiency information to be extractedfrom data, through [33]:b (pT ) =∫ ∞wcutdw′Pb (w′ |pT )9 Explained in detail in section 7, and appendix B.10 Either two electrons or muons, or one electron and one muon, all originating from the primary vertex with tight selectionrequirements (as described in [33]), plus two additional jets. Selecting di-leptonic tt¯ candidate events in this way provides a strongreduction of combinatorial background, allowing a cleaner comparison between candidate sample types.30The likelihood is then maximized using data, yielding efficiencies for the b-tagging weight for each bin in pTand η.Charm-efficiency calibration is performed by reconstructing D∗+ hadrons through a series of kinematicscuts made on the participating tracks. Using this well-defined signal region, the efficiency of tagging the D∗+hadron is measured from the data, and used to determine the c-tagging efficiency c through:D∗+ = Fbb (D∗+) + (1 − Fb )c (D∗+),where Fb is the fraction of D∗+ originating from b quark decays as determined from the fit of the b vs c toD∗+ candidates to the measured data, and b is determined from the above di-leptonic tt¯ methodology [5].These D∗+-only efficiencies are then extrapolated to full c-tagging efficiencies through the use of extrapolationfactors. The extrapolation factor in data can be different from that in simulation, either because the productionfractions of the various charmhadron species are different, or because of different charged-particlemultiplicitydistributions for the decays of a given charm hadron. The extrapolation factor in data is therefore estimatedfrom simulation, after reweighting both the production fractions and branching ratios in the simulation to thebest experimental values [5].The analysis presented here uses a technique of continuous working points, such that these calibrationprocedures are performed in five different efficiency intervals. For the MV1c tagger used in this analysis,this corresponds to bin edges marking 80%, 70%, 60% and 50% efficiencies. As will be seen in chapters 8and 9, the MV1c distributions of the 3rd and 4th highest MV1c-weighted jets in each event will be used in aprofile likelihood fit as a measure of statistical sensitivity, where a fit performed on data would measure theproduction cross sections of various processes. With this intention in mind, continuous calibration had beenperformed such that scale factor corrections are derived to yield the correct probability that a simulated jetwould land in a particular bin.The calibration results are provided as data/MC efficiency scale factors, e.g., κdata/simc ≡ datac / simc andsimilarly for b and light-flavour jets.315. Monte Carlo Simulation of Event TemplatesMonte Carlo (MC) simulation is used in order to generate the template Probability Distribution Functions(PDFs) required to make a cross-section measurement, as described in section 8. While this analysis onlylooks at the expected sensitivities of a measurement, and thus ignores real data entirely, it is also useful toknow the data upon which the simulation is based. This section mostly reiterates the relevant aspects ofthe paper [26] upon which this thesis builds, though some additional context in the descriptions has beenprovided.5.1. Data SamplesThe results are based on proton-proton collision data collected with the ATLAS experiment at the LHC ata centre-of-mass energy of√s = 8 TeV in 2012. Only events collected under stable beam conditions withall relevant detector subsystems operational are used. Events are selected using single-lepton triggers withpT thresholds of 24 or 60 GeV for electrons and 24 or 36 GeV for muons. The triggers with the lower pTthreshold include isolation requirements on the candidate lepton in order to reduce the trigger rate to anacceptable level. The total integrated luminosity available for this analysis is 20.3 fb−1 [26].5.2. Signal ModellingAfter selection, the sample composition is dominated by tt¯ events. Contributions from other processes arisefromW+jets, Z+jets, single top (t-channel,Wt and s-channel), dibosons (WW,WZ, ZZ) and events with oneor more non-prompt or fake leptons from decays of hadrons. In the paper [26], tt¯ V (where V correspondsto a W or Z boson) and tt¯ H events that pass the fiducial selection are considered as part of the signal. Allnon-tt¯ backgrounds are included in the fit as a single component.t t¯: The nominal sample used to model tt¯ events was generated using the PowhegBox (version 1, r2330)NLO generator, with the NLO CT10 pdf assuming a top quark mass of 172.5 GeV. It was interfaced toPythia 6.427 with the CTEQ6L1 parton distribution function (referred to as pdf in lower case, to indicatedifference from PDFs11) and the Perugia2011C settings for the tunable parameters (hereafter referred to astune). The hdamp parameter of PowhegBox, which controls the pT of the first additional emission beyondthe Born configuration, was set to mtop = 172.5 GeV. The main effect of this is to regulate the high-pTemission against which the tt¯ system recoils. The tt¯ sample is normalised to the theoretical calculation of253+13−15 pb performed at next-to-next-to leading order (NNLO) in QCD that includes resummation of next-to-next-to-leading logarithmic (NNLL) soft gluon terms with Top++2.0 [26]. The quoted uncertainty includesthe scale uncertainty and the uncertainties from PDF and αS choices.[26]Other Backgrounds: The small backgrounds are also modelled from simulation, except for the non-promptor fake lepton background, which is obtained from real data. Full details on this are available in [26].All samples were simulated taking into account the effects of multiple pp interactions based on the pile-upconditions in the 2012 data. The pile-up interactions are modelled by overlaying simulated hits from events11 Though, technically speaking, pdfs are a specialized type of PDF.32with exactly one inelastic (signal) collision per bunch crossing with hits from minimum-bias events that areproduced with Pythia 8.160 using the A2M tune and the MSTW2008 LO PDF. Finally the samples wereprocessed through a simulation of the detector geometry and response using Geant4. All simulated sampleswere processed through the same reconstruction software as the data. Simulated events are corrected so thatthe object identification efficiencies, energy scales and energy resolutions match those determined in datacontrol samples [26].336. Event Selection6.1. Pre-Selection Cuts at Reconstruction LevelIn addition to the reconstructed object quality requirements given in section 4.1, all events are required tohave at least one reconstructed vertex with at least five associated tracks. Events are required to have exactlyone lepton (electron or muon) coming from this primary vertex. The lepton must also be matched to thetrigger object which triggered the event [26].For this analysis, c-jet rejection is important, so the MV1c b-tagging algorithm is used. At least six jetsare required, two of which pass the 80% MV1c b-tagging working point (presumed to come from the topdecays). This working point was optimised to give the lowest total expected uncertainty on the measurementspresented in [26].6.2. Cuts and Fiducial Event Definition at Truth LevelThe ttbb cross-section measurement is made in a fiducial region determined by a variety of pre-selectioncuts made on final state particles, in order to maximize detector acceptance. Events which fail these cuts arediscarded. The measured fiducial cross-section can then be extrapolated to the full cross-section using anadditional correction, as explained in chapter 2.In order to construct template histograms from Monte Carlo, a series of cuts are made on the event structureto define the fiducial region under which the measurement will be made. All particle-level objects (jets andleptons) are required to be within the detector acceptance of |η | < 2.5. One electron or muon is required withpT > 25 GeV. Electrons and muons are dressed by adding to the lepton the four-vector momenta of photonswithin a cone of size ∆R = 0.1 around the lepton. A minimum of six jets are required with pT > 20 GeV.The pT threshold for particle-level jets was optimised (in [26]) to reduce the uncertainty on the measurement;it is chosen to be lower than for reconstructed jets (which is 25 GeV) as jets with a true pT just below thereconstruction threshold may satisfy the event selection requirement due to the jet energy resolution. Thiseffect is enhanced by the steeply falling pT spectra for the additional jets. In addition, good separation(∆Ri, j > 0.4) is required between any of the jets and the lepton [26].A jet is defined as a b-jet by its association with one or more b-hadrons with pT > 5 GeV. To perform thematching between b-hadrons and jets, the magnitudes of the four-momenta of b-hadrons are first scaled toa negligible value (in order to not alter normal jet reconstruction), and then the modified b-hadron four-momenta are included in the list of stable particle four-momenta upon which the jet clustering algorithm isrun, a procedure known as ghost-matching [34]. If a jet contains a b-hadron after this re-clustering, it isidentified as a b-jet; similarly, if a jet contains no b-hadron but is ghost-matched to a c-hadron with pT > 5GeV, it is identified as a c-jet. All other jets are considered light-flavour jets [26].346.3. Template Definitions at Truth LevelIn this analysis, tt¯ events are split according to the flavour of the additional jets in the event. With twoadditional jets (≥ 6 total), this yields ≥ 6 different templates (depending how the events lacking in additionalheavy flavour are divided). With this in mind, template histograms are made from the full tt¯ Monte Carlosample using these truth-level definitions (in addition to the requirements stated above), for each of theseseven heavy flavour templates:ttbb ≥ 4 b-jets.ttbc Exactly 3 b-jets, and at least one c-jet.ttbj Exactly 3 b-jets, and no c-jets.ttcc Exactly 2 b-jets, and ≥ 2 c-jets.ttcj Exactly 2 b-jets, and exactly one c-jet.ttjj Exactly 2 b-jets, and no c-jets.ttother Any tt¯ event not satisfying the above selection criteria.12The templates of different combinations are then merged if they have similar shapes and if they are producedthrough similar processes [26]. In this case, ttbc is combined with ttbj, ttcc is combined with ttc j, andlastly tt j j is combined with ttother to get:ttbb ≥ 4 b-jets.ttsingleb Exactly 3 b-jets.ttcjets Exactly 2 b-jets, and at least one c-jet.ttlightjetsnotc Any tt¯ event not satisfying the above selection criteria.Histograms showing the similarities between merged templates may be seen in 14.Note that contributions from W → cq decays where the c-hadron is matched to one of the fiducial jets areincluded in the ttc j and ttc jets templates; this contribution is found to dominate over that from tt¯ withadditional c-jets [26].12 Note that no explicit requirement is made on the presence of top quarks or on the fact that two of the b-jets must originate fromthe decay of the top quarks, as that would utilise generator dependent parton level information. The labels are simply chosen ascorresponding to the largest expected contribution in that category.35 Highest MV1c weightrdB-Tagging Efficiency of Jet with 3No Tag 0.8 eff 0.7 eff 0.6 eff 0.5 eff1−10 Highest MV1c weightthB-Tagging Efficiency of Jet with 4No Tag 0.8 eff 0.7 eff 0.6 eff 0.5 eff3−102−101−101ttsinglebttbjttbc Highest MV1c weightrdB-Tagging Efficiency of Jet with 3No Tag 0.8 eff 0.7 eff 0.6 eff 0.5 eff2−101−101 Highest MV1c weightthB-Tagging Efficiency of Jet with 4No Tag 0.8 eff 0.7 eff 0.6 eff 0.5 eff4−103−102−101−101 ttcjetsttccttcj Highest MV1c weightrdB-Tagging Efficiency of Jet with 3No Tag 0.8 eff 0.7 eff 0.6 eff 0.5 eff2−101−101 Highest MV1c weightthB-Tagging Efficiency of Jet with 4No Tag 0.8 eff 0.7 eff 0.6 eff 0.5 eff3−102−101−101ttlightnotcttjjttotherFigure 14: These plots demonstrate the similarities between the six templates merged to make three instead, using the3rd and 4th highest MV1c values as example. Since these MV1c values are most likely to show differences betweenthe various templates, if they are similar (as seen above), it is strong evidence for the ability to merge them.367. Machine Learning Algorithms7.1. Techniques for Statistical ImprovementsIn any likelihood test, whether it is a simple ratio as exemplified in chapter 4.5.3, or the profile likelihood fitused in this analysis and described in chapter 8, it is necessary to have distinctive features or shapes betweenlikelihood templates in order for the test to be useful. When attempting to capitalize on multiple sources ofinformation (represented by various candidate input variables), it is first necessary to determine whether ornot the variable will be useful, and then further determine the method through which it may be used. Tounderstand whether a variable is useful, we must first understand how it will be used, so this will be discussedfirst.The most straightforward option is to include the N-dimensional PDF of the discriminating input variablesin the likelihood, or separate out the PDFs for each variable in the likelihood and lose any separating powergained from correlations between variables in the process13. Unfortunately, the number of events that wouldneed to be generated in order to adequately populate the space would be prohibitive. The second method,treating input variables as separate, suffers from difficulties in determining separating variables which aresufficiently de-correlated in order to be useful. Further, there is a great loss in discriminating power whencorrelations between variables are not considered, as in the second method (as evidenced by, e.g. theimprovements seen in SV1 over SV0 described in [28]).The performance of Machine Learning (ML) techniques is studied in this thesis in the context of improvingthe performance of the profile likelihood fit described in section 8. ML techniques are helpful when tryingto make distinctive one dimensional likelihood distributions based on many input variables, for use invarious likelihood tests. Two common ML techniques, which are seen to have comparable "out of the box"performance ([35]) are Neural Networks (NNs) and Boosted Decision Trees (BDTs).These algorithms are used to develop distinctive output distributions out of numerous input variables, andare thus a segment of the mathematical connection between measured physical objects (such as tracks andreconstructed jets, described in chapter 4), and a measured physical result through use of standard statisticaltools (such as the profile likelihood ratio test, described in chapter 8). Due to their prevalence in this analysis(e.g. the MV1c algorithm, along with the analysis described in chapter 10), special care has been taken toexplain in detail how each algorithm uniquely transforms and combines variables to provide a final likelihooddistribution, for each of these two techniques. For BDTs, this description may be found in appendix A. Thedescription of NNs is in appendix B.13 An example of this is given in chapters 4.5.2 and 4.5.3, where the Secondary Vertex (SV) algorithm uses three input variables:the vertex mass, the vertex energy fraction and the number of two track vertices. The primitive SV0 algorithm treats each of thesevariables as independent, without correlation, by taking the total jet likelihood as equal to the likelihood resulting independentlyfrom each input variable. As described at the end of 4.5.3, the SV1 algorithm keeps the likelihood resultant from the number oftwo-track vertices as an independent contribution to the overall likelihood, but now combines the vertex mass and vertex energyfraction into a two dimensional likelihood, which then contributes as a single factor to the overall likelihood of each jet, in orderto take advantage of correlations between the mass and energy fraction. While there were insufficient statistics to use it in studieswhich use the√8 TeV (2012) data (such as this one), the SV2 algorithm is designed to combine all three variables into a threedimensional likelihood, which takes advantage of correlations between all of the variables. A similar procedure may be applied tovariables of distinctive phase spaces for use in likelihood tests such as the profile likelihood fit, described in chapter 8. However,this method is not pursued in this analysis.37No matter the algorithm employed, there are two fundamental aspects to a ML algorithm: the training phase,where algorithm parameters are tuned to best suit the analysis, and the testing phase, where the trainedalgorithm is used to simply process data and return an output weight for each event, usually interpreted as aprobability density (as touched on below and described fully in the appendices). This section will introducethe common elements to each technique, and place the use of the ML algorithm in the context of the overallanalysis. How the algorithms are created in the training phase is first explained, followed by how they are usedin the testing phase. Lastly, optimal conditions for input variables are explained at the end of the chapter.7.2. Training PhaseThe creation and optimisation of these algorithms is performed through the training phase. Training occursusing events with known desired values for the outcome (in this case, simulated MC events, where the trueflavour of the event is known) to provide the set of the algorithm parameters with the best performance.In the typical case of binary classification, two types are trained against one another (a "signal" versus a"background"). The training process aims to result in an algorithm that returns values as close to 1 as possiblefor signal-like events, and as close to 0 or −1 (depending on the algorithm) for background events.Common to both ML algorithms, the training process iterates over training events, and updates each of theparameters in an iterative process. An error function is used to evaluate the functionality of the algorithm ateach iteration, and parameter updates are made in the direction that minimizes this error function.Typically, precautions are taken to prevent completely reaching a false minimum for these parameters basedon over-using training events so that event-wise features get used, as this typically results in an over-tunedand biased algorithm, known as “over-trained.” To avoid this, a portion (usually half) of the events availablefor training are kept separate, and used for validating that the algorithm is not biased towards the trainingdata. This is performed by comparing the likelihood templates generated between training and validationdata sets, and looking for discrepancies between the two.When it comes to training an algorithm with significant shape differences between each of the four heavy-flavour templates used in the profile likelihood fit (chapters 6 and 6.3), a decision must be made of whichtemplate or template combinations qualifies as the "signal," and which as the "background." As will be seenin chapter 10, this decision may not be straightforward. Further, as noted in many resources, such as [35]and [36], the selection of which variables to include in the algorithm is central to the algorithm’s performance,by any metric. Section 7.4 gives a few brief notes on what makes a good separating variable for any MLtechnique.7.3. Testing PhaseAfter the ML algorithm is optimised from the training data during the training phase, it is used to generatenew likelihood templates. After the algorithm is trained, the MC simulated events of each heavy flavourtemplate are run through the algorithm to develop a new separate likelihood distribution for each heavyflavour type. For each event, the values of each of the different selected input variables are taken as input tothe algorithm, and in return the algorithm returns a response, F (x), where here x is a vector of the input38variables used in each event. This response is typically re-interpreted as a probability through the logisticfunction:p(Xi ) =11 + eF (Xi ). (5)The response of the ML algorithm for each event is then taken to build the likelihood templates. Ideally, theML algorithm is trained to provide distinctive features between each template used in the profile likelihoodfit in this analysis. However, in the case of this analysis, four separate templates are used in the final fit, andso it is not straightforward for the ML algorithm to determine an optimal output distribution. "Multi-class"ML algorithms exist to address this issue, where a probability density is returned for each of the K trainedclasses, through the softmax function:pk (x) = eFk (x)/ K∑l=1eFl (x) . (6)This may be used to generate likelihood templates, however including three classes creates a 2-D likelihood(as the algorithm responses sum to one), and four classes creates a 3-D likelihood. Thus, use of these typesof algorithms are quickly subject to issues due to low statistics, and thus the four-class algorithms are notinvestigated here. Multiple examples of the 1-D binary and 2-D 3-class cases are given in chapter 10.The method through which each ML algorithm generates a response is unique to the method used (BDTs arevastly different from NNs), though in general terms it may be written that F (x) = f (x |p) where p is a set ofparameters specifying the inner structure of how the algorithm uses the input variables, x. The determinationof these parameters p is the main goal of the training phase, described above.7.4. Variables for Machine LearningCommon to all machine learning algorithms is the necessity of selecting useful discriminating variables foruse in the algorithm. Optimal variables will provide good shape differences for each of the fitted templates.With this in mind, variables which capitalize on differences in the physics between templates are identifiedand vetted in chapter 10.One example are the MV1c b-tagging probabilities, given in figure 15. Another example, bbmindr , is theminimum ∆R between any b-tagged jet, as this is expected to decrease with additional tagged jets (assumingthe tt¯ results in reasonably back-to-back b-jets). These distributions all show distinctive shape differencesbetween each of the likelihood templates.Considerations must also be paid to correlations between variables, as the amount of information availableto the algorithm may not increase with the addition of highly correlated variables. In the case of bbmindr , itindirectly depends on the number of b-tagged jets, and thus on the number of jets exceeding the MV1c 80%working point. So, while it shows excellent separation, it is difficult to judge how useful this variable will beas a supplement to the MV1c values.A detailed analysis showing optimal variable (and training template) selection for this analysis is shown inchapter 10. However, first the underlying statistical technique used to evaluate the efficacy of these likelihood39templates, known as the profile likelihood fit, is examined next in chapter 8, and a baseline measurement isgiven, which avoids the additional complications brought forth by these ML techniques by using a simple2-D likelihood formed from the MV1c value of the 3rd and 4th MV1c-weighted jets, in chapter 9.408. Statistical MethodsAprofile likelihood ratio test is performed in order to determine the production cross section of ttbb, ttsingleb(referenced in this section as simply ttbj), ttc jets (here as ttc j) and other tt¯ and non-tt¯ events, which arereferred to as tt j j. This section will briefly describe this ratio, along with the associated test statistic used fordetermining a two-sided confidence interval. A similar statistic also based on a likelihood ratio was used todetermine various inputs to the MV1c algorithm, as described in section 4.5.3. However, in the case of theb-tagging statistics, the goal was to determine the likelihood of each individual jet being a particular flavour.In the case of this cross section measurement, the goal instead is to find the proportion of events belongingto each type. Note that this analysis looks only to improve the expected statistical uncertainty as much aspossible.Regardless of whether the discriminating variable, x, is the calibrated MV1c network output, or an outputdistribution from a home-grown NN or BDT, we can construct a histogram n = (n1, ...., nN ) with N bins.The expectation value of ni can be written:E[ni] = mbbi + mb ji + mc ji + mj jiwhere mkk ′i represents the mean number of entries of the ttkk′ sample in the ith bin, given by:mkk′i = mkk ′tot∫bin ifkk ′ (x)dx ,where mkk ′tot is the total number of expected ttkk’ events, given by the expected production cross section timesthe integrated luminosity of the data sample. If systematics were considered, one could include nuisanceparameters which characterize the shape of the the PDF fkk ′ for the variable x, given the ttkk ′ hypothesis, andhave these nuisance parameters vary depending on the uncertainty [37]. The PDFs for each of the differentttkk ′ hypotheses are generated from Monte Carlo simulation.The problem of determining the production cross-section for each of the ttkk ′ comes down to determiningthe value for mkk ′tot . For later clarity, we define a parameter of interest (POI), µ, and nuisance parameters,θkk′, such that:E[ni] = µmbbi +∑ll′,bbθll′mll′i . (7)mbbtot is then taken to be the constant value predicted from simulation, with the POI controlling the size of thettbb cross section relative to simulation, with the size of the mll′tot events determined through the nuisanceparameters θkk ′. In general letting θ represent this set of nuisance parameters,14 and writing the likelihoodas a product of Poisson probabilities for each bin [37]:L(µ|x, θ) =N∏i=1(µmbbi +∑ll′,bb θll′mll′i )nini!e−(µmbbi +∑l l′,bb θl l′ml l′i )14 From this perspective, there is a signal sample, marked by the POI, and a background composed of the other three samples. Thetotal number of background events is then taken as one nuisance parameter, and the relative number of events between contributingsamples - as determined by the differences between the mll′ - can be seen as two additional parameters controlling the shape ofthe background. Regardless of the perspective taken, θ refers to the set of parameters marking the size of the three additionalsamples.41where here the dependence on the data from MC simulation, x, on the RHS has been suppressed.To test a hypothesized value of µ, the profile likelihood ratio is considered:λ(µ) =L(µ, ˆˆθ)L( µˆ, θˆ).Here ˆˆθ denotes the values of θ that maximizes L for the specified µ, i.e., it is the conditional maximum-likelihood estimator (MLE) of θ (and thus is a function of µ). The denominator is the maximized (uncondi-tional) likelihood function, i.e., µˆ and θˆ are their MLEs [37].From this definition of λ(µ) we can see that 0 ≤ λ ≤ 1, with λ near 1 implying good agreement betweendata and the hypothesized value of µ. When considering hypotheses where µ can be smaller or larger thanthe nominal value, a double-sided test statistic is used:tµ = −2 ln λ(µ) .This can then be used to create a p-value to quantify the level of disagreement with the nominal hypothesis:pµ =∫ ∞tµ,obsf (tµ |µ)dtµ .where tµ,obs is the value of the statistic tµ observed from the data, and f (tµ |µ) denotes the PDF of tµ underthe assumption of signal strength µ [37].As detailed in [37], if a value of µ = µ′ is measured, then the appropriate distribution f (tµ |µ′) is required inorder to carry out the p-value integral. In this case, the MLE of µ will equal µ′, and so E[µ] = µˆ = µ′. If µˆis Gaussian distributed, f (tµ |µ) follows a non-central chi-square distribution for one degree of freedom, tothe extent that terms of O(1/√N ) can be neglected. This gives a non-centrality parameter λ of:λ =(µ − µ′)2σ2.In this large sample limit, the inverse of the covariance matrix can be written as:V−1i j = −E[∂2 ln L∂θi∂θ j].When i = j = 0, this implies V00 = σ2µ .15 For i = j > 0, this is used to indicate the variance and covariancewith the nuisance parameters (e.g. the other sample parameters, though also various systematics if additionalnuisance parameters were to be included).In order to estimate the statistical sensitivity, expressed through Vi j , a representative "Asimov" data set isused, whose MLEs yield the previously expected values for each parameter. Thus, in this case, µ′ = 1regardless of what template is taken to be of interest. Trivially, this yields a test statistic tµ,obs = 0, and anon-centrality parameter λ = 0, yielding a χ2 distribution f (tµ |µ = 1) with 1 degree of freedom, and thus ap-value of 1, as expected.15 Note that here, σ2 is the variance, rather than the cross-section. It is used to determine the uncertainty in the measurement.42The RooStats framework is interfaced with the Minuit minimization package, which is used to find the(trivial) MLEs, µˆ and θˆ, along with the conditional MLE ˆˆθ. The standard Hessian matrix provided by Minuit("HESSE"), allows for numerical determination of the parameter covariance matrix (as above), and thus canbe used to determine the predicted variances for the Asimov fit [37], as desired by this study. The HistFactorypackage is used to build the RooStats model from truth-definition template histograms [4].HistFactory [4] builds a RooStats workspace based on the model described above, using template histogramsfor each of the four additional heavy-flavour flavour type. In this implementation, HistFactory creates anormalization factor µkk ′ in the same style as equation 7, but for each of the different heavy flavour types.The Minuit minimizer, acting on the complete negative log likelihood of the Asimov data set, settles (bydefinition) on a normalization factor of one for each of the templates. The desired statistical sensitivities arethen the uncertainties of each of the templates in this fit, as provided by RooStats.439. Baseline Sensitivity using MV1c DiscriminatorOne option to extract a cross-section measurement is to take a profile likelihood fit of the two-dimensionaldistribution using the MV1c values for the 3rd vs that of the 4th highest jets, ranked by MV1c weight (seefigure 15). Using the continuous b-tagging working points, the calibrated distributions have five bins each.As the 4th highestMV1cweight is always less than the 3rd , this yields a 15-bin histogram, shown in figure 16.This distribution is then fed into HistFactory, and the Minuit minimizer supplies symmetrised versions of theuncertainties, as explained above. Highest MV1c weightrdB-Tagging Efficiency of Jet with 3No Tag 0.8 eff 0.7 eff 0.6 eff 0.5 eff2−101−101 Highest MV1c weightthB-Tagging Efficiency of Jet with 4No Tag 0.8 eff 0.7 eff 0.6 eff 0.5 eff4−103−102−101−101 = 711.9eventsttbb N = 1704.1eventsttbj N = 15309.6eventsttcj N = 20082.9eventsttjj NFigure 15: Distributions of the (normalized) MV1c weight value assigned to the 3rd (left) and 4th (right) highestMV1c-ranked jets for each event. Event-type identification is made at truth level, and the number of expected events ofeach type is given in the legend. The error bars indicate the statistical uncertainty on the MC estimate.This method considers that the MV1c distributions associated with the 3rd and 4th highest jets in each eventwill provide separation power between each of the four tt¯ templates. The differences in tagging between the3rd and 4th jets indicate the flavour content of the possible additional heavy flavour jets, with ttbb eventsmost likely to contain high weights of both.9.1. Baseline Fit ResultsThe expected statistical sensitivities resulting from the 15-bin templates are found to be:∆µbb = 0.1972,∆µb j = 0.2660,44 highest jet by MV1c efficiencyth/4rd31.0/1.0 0.8/1.0 0.7/1.0 0.6/1.0 0.5/1.0 0.8/0.8 0.7/0.8 0.6/0.8 0.5/0.8 0.7/0.7 0.6/0.7 0.5/0.7 0.6/0.6 0.5/0.6 0.5/0.54−103−102−101−101 = 711.9eventsttbb N = 1704.1eventsttbj N = 15309.6eventsttcj N = 20082.9eventsttjj N Highest MV1c weightsth/4rdHeavy Flavour Likelihood Distributions by 3Figure 16: Combining the 3rd and 4th highest MV1c weights yields a 15-bin distribution. Ideally, the form of eachevent type is distinctive from the rest. The error bars indicate the statistical uncertainty on the MC estimate.∆µc j = 0.0678, and∆µ j j = 0.0709.Included in the fit is a single additional template including the sum of all expected non-tt¯ background events,as a source of uncertainty in the overall normalization of the fit. Since this background does not affecttemplate shape in any way, it has a very low contribution to expected sensitivities.9.1.1. Re-Binning to Reduce Uncertainties Due to Limited MC StatisticsAs explained in [4], in cases where template histograms are sparsely populated, the templates may not be verygood descriptions of the underlying distribution, but are estimates of that distribution within some statisticaluncertainty due to limited MC statistics. A complete treatment of this issue would be to give each bin of eachsample a nuisance parameter for the true rate, which is then fit using both the (Asimov) data measurement andthe Monte Carlo estimate. This approach would lead to several hundred nuisance parameters in the currentanalysis. Instead, HistFactory employs a lighter weight version in which there is only one nuisance parameterper bin associated with the total uncertainty on the Monte Carlo estimate in that bin. In the following, if thereis such an uncertainty associated with a bin, I refer to it as low stat.45In the case of the 15-bin fit applied above,16 the top six bins are low stat, with largeMC statistical uncertainties,in the range of 4.4% to 8.2%. One technique to reduce the impact of this is to merge low- or unpopulated binsadjacent to one another in the two-dimensional distribution, so as to decrease the granularity of the likelihoodtemplate distributions. Intuitively, this can be justified by considering that not much of the differences inshape between the likelihood templates will be seen in these hard-cutting b-tagging requirements. highest jet by MV1c efficiencyth/4rd31.0/1.0 0.8/1.0 0.7/1.0 0.6/1.0 0.5/1.0 0.8/0.8 0.7/0.8 0.6/0.8 0.5/0.8 <0.7/0.7 <0.6/<0.63−102−101−101 = 711.9eventsttbb N = 1704.1eventsttbj N = 15309.6eventsttcj N = 20082.9eventsttjj N Highest MV1c weightsth/4rdHeavy Flavour Likelihood Distributions by 3Figure 17: Combining the 3rd and 4th highest MV1c weights yields a 15-bin distribution, which can be collapsed intothe 11-bin histogram above through combining the top six bins into two triplets. The error bars indicate the statisticaluncertainty on the MC estimate.Numerous re-binning techniques are available to reduce the initial 15-bin histogram, by combining differingpairs or triplets of bins. An optimal combination retains or improves upon the sensitivities of the originalbinning, while reducing or eliminating uncertainties associated to low bin statistics of the templates. Throughtrial and error of different re-binning possibilities, an 11-bin histogram that combines the top six bins intotwo triplets17 was found to provide the best result, with nearly identical sensitivities to the 15-bin case, butno low stat bins. The expected statistical sensitivities resulting from the 2-jet, 11-bin likelihood templatesare:∆µbb = 0.1926,∆µb j = 0.2695,16 Which should be noted as technically having two dimensions in the separation template phase space, albeit with a reduced volumedue to the ranked nature of the MV1c distributions.17 This corresponds to all events with the 4th highest MV1c weight above the 60% threshold in the top bin, and all events withthe 4th highest MV1c weight above the 70% threshold, but below the 60%, in the second top bin, regardless of the 3rd highestweight.46∆µc j = 0.0681, and∆µ j j = 0.0708.With sufficient statistics to avoid individual MC statistical bin uncertainties, while also resulting in aninsignificant change in sensitivity for the ttbb cross-section over the 15-bin fit, these sensitivities shall be thebaseline against which possible improvements from including further MV1c information (below), or otherkinematic variables through a multivariate analysis (chapter 10), shall be judged.9.2. Beyond Baseline: Higher Dimensional DistributionsIncluding additional MV1c information, such as the MV1c value for the 2nd highest jet, or other jets, mayimprove the sensitivity of the fit since it includes additional information about the event. While issues withdimensionality are expected to present themselves more strongly, as they were already present in the simple2-D distribution described above, it will serve illustrative for later MVA techniques to investigate the moststraightforward way of including additional variables. The distributions resulting from the leading and 2ndhighest MV1c weights are given in figure 18.B-Tagging Efficiency of Jet with Highest MV1c weight0.8 eff 0.7 eff 0.6 eff 0.5 eff1−101 Highest MV1c weightndB-Tagging Efficiency of Jet with 20.8 eff 0.7 eff 0.6 eff 0.5 eff1−10 = 711.9eventsttbb N = 1704.1eventsttbj N = 15309.6eventsttcj N = 20082.9eventsttjj NFigure 18: Distributions of the (normalized) MV1c weight value assigned to the 1st (left) and 2nd (right) highestMV1c-ranked jets for all events. Event-type identification is made at truth level, and the number of expected events ofeach type are given in the legend. The error bars indicate the statistical uncertainty on the MC estimate.Further, statistical issues from increasing the dimension of the MV1c distribution are slightly stymied bythe ranked nature of the jets in question. Since the 3rd highest MV1c weight is always higher than the 4thhighest, this brings the total number of occupied bins down from 52 = 25 to 15-bins. Similarly, the possibility47of including the top 3 ranked jets includes just 35-bins from 53 bins, due to the ranked nature of these values,and the fact that the top two bins always exceed the 80% working point due to the analysis level b-taggingrequirement. Including further jets, such as the top 4, 5 or 6 MV1c-ranked jets yields 65, 100 or 135 binlikelihood distributions, respectively.Marginal improvements are seen in the ttbb sensitivity when comparing the inclusion of the MV1c inform-ation of further jets:∆µ2− jet, 11binbb= 0.1926, Nlow stat bins = 0∆µ2− jet, 15binbb= 0.1972, Nlow stat bins = 6 (nominal sensitivity)∆µ3− jet, 35binbb= 0.1917, Nlow stat bins = 13∆µ4− jet, 65binbb= 0.1888, Nlow stat bins = 33∆µ5− jet, 100binbb= 0.1804, Nlow stat bins = 69∆µ6− jet, 135binbb= 0.1790, Nlow stat bins = 104Similar improvements on the sensitivity to other heavy-flavour types are seen.The number of scantly populated bins with significant MC statistical bin uncertainties, Nlow stat bins , is alsofound to increase with dimension, as expected. Further, the uncertainties associated with the overall contentof each bin were seen to increase in size, from a limited 4-8% uncertainty in the 15-bin (2-D) case (as above),to some non-physical empty bins (which must certainly be avoided) in the 100- or 135-bin (5- or 6-D) cases.Clearly the improvements seen by including the additional information from extraneous jets will most likelynot remain significant in the case of a full analysis.A simple algorithm to combine adjacent bins of similar MV1c weights was developed in order to impose aminimum (predicted, from MC) overall bin content on the likelihood distribution. By increasing the bin sizein some regions of the N-dimensional distribution (while being sure not to connect disconnected regionsof phase space in the re-binning process), particularly in the high-MV1c region where lower statistics areexpected, it allows some information from additional jets to be included in the likelihood fit.Through trial-and-error with the 35, 65 and 100 bin distributions18 (including 3-, 4- and 5-jets, respectively),the best-case-scenario was found when imposing a minimum bin content of 120 expected events on eitherthe 35- or 65-bin distributions. While the 5-jet 100-bin distribution showed the best initial results, accessto ttbb sensitivities below 18.5% were quickly lost after applying any re-binning procedure. Applying there-binning procedure to both the 3- and 4-jet distributions resulted in nearly identical results, with no low statbins, as seen below.The expected statistical sensitivities resulting from a 26-bin likelihood fit, originating from the 3-jet, 35-bindistribution is:∆µ3− jet, 26binbb= 0.1863,18 The increase from 5 to 6 jets, or 100 to 135 bins, was seen to be negligible, as it also includes 35 additional low stat or empty bins.48∆µ3− jet, 26binb j= 0.2569,∆µ3− jet, 26binc j = 0.0651, and∆µ3− jet, 26binj j = 0.0412.Similarly, the expected statistical sensitivities resulting from a 38-bin likelihood fit, originating from the4-jet, 65-bin distribution is:∆µ4− jet, 38binbb= 0.1854,∆µ4− jet, 38binb j= 0.2550,∆µ4− jet, 38binc j = 0.0645, and∆µ4− jet, 38binj j = 0.0373.The likelihood templates for these distributions are shown in figures 19 and 20.0.8/1.0/1.00.7/1.0/1.00.6/1.0/1.00.5/1.0/1.00.8/0.8/1.00.7/0.8/1.00.6/0.8/1.00.5/0.8/1.00.7/0.7/1.00.6/0.7/1.00.5/0.7/1.00.6/0.6/1.00.5/0.6/1.00.5/0.5/1.00.8/0.8/0.80.7/0.8/0.80.6/0.8/0.80.5/0.8/0.80.7/0.7/0.80.6/0.7/0.80.5/0.7/0.80.6/0.6/0.80.5/0.6/0.80.5/0.5/0.80.7/0.7/0.70.6/0.7/0.70.5/0.7/0.70.6/0.6/0.70.5/0.6/0.70.5/0.5/0.70.6/0.6/0.60.5/0.6/0.60.5/0.5/0.60.5/0.5/0.500.050.10.150.20.25 = 711.9eventsttbb N = 1704.1eventsttbj N = 15309.6eventsttcj N = 20082.9eventsttjj N Highest MV1c weightth/4rd/3ndHeavy Flavour Likelihood Distributions by 2Figure 19: One dimensional representation of the 2nd , 3rd and 4th MV1c weights in tt¯ events. The axis is labelledsuch that the 2nd , 3rd and 4th MV1c weight is given, respectively. The error bars indicate the statistical uncertainty onthe MC estimate.While a fraction of a percentage point improvement over the 11 or 15 bin (2-jet) nominal cases is hardlysignificant in terms of improvement, this process serves to illustrate the balance that must be chosen between49 highest jet by MV1c efficiencyth/4rd/3nd25 10 15 20 2500.050.10.150.20.25 = 711.9eventsttbb N = 1704.1eventsttbj N = 15309.6eventsttcj N = 20082.9eventsttjj N Highest MV1c weightth & 4rd, 3ndHeavy Flavour Likelihood Distributions by 2Figure 20: One dimensional representation of the 2nd , 3rd and 4th MV1cweights in tt¯ events, with increased granularityfor bins with low statistics. This is a re-binned version of figure 19. It combines all events with the 4th MV1c higherthan the 60%working point, along with a triplet and four pairs of bins. All bin pairings (and the triplet) are merged usingneighbours to avoid combining disjoint regions of the phase space. The error bars indicate the statistical uncertaintyon the MC estimate.including entirely new information (from additional variables) while decreasing the precision on the know-ledge of the template, due to limited template statistics (and so increasing the granularity of the measurementof each variable), versus excluding additional information and increasing the measurement precision of thetemplate of a particular variable or distribution. The optimal decision in the tradeoff ultimately comes downto whether the additional precision gained in a particular variable is more or less beneficial than less precisemeasurements with additional variables.A summary table of all results in this section is provided in table 4.When one starts to consider including additional kinematic variables to help provide additional separationpower between templates, the procedure is more complicated. With no straightforward way to apply are-binning procedure in high dimensions, the additional complication of no clear-cut bin edges from tuningprocedures as with MV1c, nor the benefit of ranking to stymie dimensionality issues, clearly an alternativemethod is necessary. The solution to this is the use of machine learning techniques, as will be seen in thefollowing chapter.500.8/0.8/1.0/1.00.7/0.8/1.0/1.00.6/0.8/1.0/1.00.5/0.8/1.0/1.00.7/0.7/1.0/1.00.6/0.7/1.0/1.00.5/0.7/1.0/1.00.6/0.6/1.0/1.00.5/0.6/1.0/1.00.5/0.5/1.0/1.00.8/0.8/0.8/1.00.7/0.8/0.8/1.00.6/0.8/0.8/1.00.5/0.8/0.8/1.00.7/0.7/0.8/1.00.6/0.7/0.8/1.00.5/0.7/0.8/1.00.6/0.6/0.8/1.00.5/0.6/0.8/1.00.5/0.5/0.8/1.00.7/0.7/0.7/1.00.6/0.7/0.7/1.00.5/0.7/0.7/1.00.6/0.6/0.7/1.00.5/0.6/0.7/1.00.5/0.5/0.7/1.00.6/0.6/0.6/1.00.5/0.6/0.6/1.00.5/0.5/0.6/1.00.5/0.5/0.5/1.00.8/0.8/0.8/0.80.7/0.8/0.8/0.80.6/0.8/0.8/0.80.5/0.8/0.8/0.80.7/0.7/0.8/0.80.6/0.7/0.8/0.80.5/0.7/0.8/0.80.6/0.6/0.8/0.80.5/0.6/0.8/0.80.5/0.5/0.8/0.80.7/0.7/0.7/0.80.6/0.7/0.7/0.80.5/0.7/0.7/0.80.6/0.6/0.7/0.80.5/0.6/0.7/0.80.5/0.5/0.7/0.80.6/0.6/0.6/0.80.5/0.6/0.6/0.80.5/0.5/0.6/0.80.5/0.5/0.5/0.80.7/0.7/0.7/0.70.6/0.7/0.7/0.70.5/0.7/0.7/0.70.6/0.6/0.7/0.70.5/0.6/0.7/0.70.5/0.5/0.7/0.70.6/0.6/0.6/0.70.5/0.6/0.6/0.70.5/0.5/0.6/0.70.5/0.5/0.5/0.70.6/0.6/0.6/0.60.5/0.6/0.6/0.60.5/0.5/0.6/0.60.5/0.5/0.5/0.60.5/0.5/0.5/0.500.050.10.150.20.25 = 711.9eventsttbb N = 1704.1eventsttbj N = 15309.6eventsttcj N = 20082.9eventsttjj N Highest MV1c weightth/4rd/3nd/2stHeavy Flavour Likelihood Distributions by 1Figure 21: One dimensional representation of top four MV1c weights in tt¯ events. The axis is labelled such that the1st , 2nd , 3rd and 4th MV1c weight is given, respectively. The error bars indicate the statistical uncertainty on the MCestimate.# Jets 2 Jets 2 Jets 3 Jets 3 Jets 4 Jets 4 Jets# Bins 11 Bins 15 Bins 26 Bins 35 Bins 38 Bins 65 Bins# low stat bins nil 6 nil 13 nil 33∆µbb 0.1926 0.1972 0.1863 0.1917 0.1854 0.1888∆µb j 0.2695 0.2660 0.2569 0.2492 0.2550 0.2443∆µc j 0.0681 0.0678 0.0651 0.0635 0.0645 0.0629∆µ j j 0.0708 0.0709 0.0412 0.0407 0.0373 0.0370Table 4: Summary table of all relevant results presented in this section. 5- and 6-jet cases were excluded due to thesefits creating a surplus of bins with low statistics, rendering them useless.51 highest jet by MV1c efficiencyth/4rd/3nd/2st15 10 15 20 25 30 3500.050.10.150.20.25 = 711.9eventsttbb N = 1704.1eventsttbj N = 15309.6eventsttcj N = 20082.9eventsttjj NHeavy Flavour Likelihood Distributions by the Top 4 Highest MV1c weightsFigure 22: One dimensional representation of top four MV1c weights in each tt¯ event, with increased granularity forbins with low statistics. This is a re-binned version of figure 21. The new binning merges neighbouring bins to avoidcombining disjoint regions of the phase space. The error bars indicate the statistical uncertainty on the MC estimate.5210. Machine Learning Classification for Improved Statistical SensitivityAs seen in chapter 7 and appendices A and B, machine learning algorithms provide a method for combiningmultiple input variables into lower dimensional distributions, while still maintaining distinguishing featuresbetween likelihood templates. Several steps were taken to try to optimise the algorithm used, by trying manypermutations of the input variables, along with training different choices of likelihood templates against oneanother. Parameters related to the structure of the algorithm itself were left mostly at default values withinTMVA [36], as it has been seen elsewhere [35, 38] that any boosted decision trees or neural networks withsufficient complexity should allow for reasonable tuning to the data, provided precautions and checks aretaken against overtraining. As such, these meta-parameters, such as the number of hidden nodes or thenumber of trees, will not be discussed further.19This chapter will list the kinematic variables identified as potentially helpful in providing separation powerbetween templates, and will go into detail on the motivations behind some of the more relevant variables. Itthen describes the steps taken to work through differing permutations of variables and training templates forboth BDTs and NNs, and will describe the best results in each case.10.1. Choice of Kinematic Variables and Training Template Combinations10.1.1. Identification of Possible VariablesAs typical with any machine learning algorithm [35], the input variables selected for use in training will havea huge impact on the power of the algorithm. Ideal variables demonstrate differences in kinematics of theoutgoing jets (and lepton) between different heavy flavour compositions. To this end, 22 variables have beenidentified for attempted use in the algorithm. In addition to the MV1c of the top four jets (ranked by MV1c),18 variables were found to have some separation power, through examination of the corresponding likelihooddistributions.Many of the variables depend on whether or not a jet is considered b-tagged. For this purpose, jets that passthe MV1c 80% b-tagging working point are considered tagged. It is expected that much of the separationpower between variables dependent on b-tagging will originate from the b-tagging mechanism itself, and thuswill not be expected to provide much benefit over the raw b-tagging algorithms. Working in conjunction withthe continuous MV1c distributions, the impact made from small differences in a variable not dependent onb-tagging may prove to be more beneficial to a trained algorithm than strongly discriminating variables whichdepend on b-tagging. Thus, whether or not the variable is dependent on b-tagging was a main considerationtaken when selecting which variables were worth investigating.The additional b-jets in ttbb events typically result from gluon splitting to a bb¯ pair. Thus, it is expected thatfor high momentum gluons, the two b-quarks will be relatively close in ∆R, whereas the separation betweenother jets should be higher. As a result of this, many variables attempt to capitalize on these particular19 For clarity, this analysis uses NNs with a single hidden layer with N + 8 nodes, where N is the number of input variables,along with a tanh activation function. The BDTs used in this analysis are bagged and gradient-boosted, with a max depth of two(implying trees are simply a “stump” with a single split) and a forest size of 1000 trees. Further details on the implementation ofBDTs and NNs used here are available in the relevant appendices, A and B.53kinematics by looking at angular distributions or angular separation between objects. Further, in order toproduce bb¯-jet pairs, rather than light jets, higher momentum gluons are required. Therefore, the tt¯ systemwill be a bit more boosted. Therefore many variables related to the pT of jets were also seen to be usefuldiscriminants.The following variables were considered:aplab The 2nd eigenvalue of 3-dimensional linear momentum tensor built using b-tagged jets (the "aplanar-ity"). Events with four or more b-tags will have aplanarities which may reflect differences in overallevent kinematics.avgdr The average ∆R of all possible jet pairs.avgdrbb The average ∆R of all b-tagged jet pairs. Assuming the tt¯ system is produced back-to-back, thisvariable reflects the assumption that additional b-tagged jets will bring down the average angularseparation between jet pairs.bbmaxmass The invariant mass of the pair of b-tagged jets closest in ∆R.bbmindr The smallest ∆R between any b-tagged jet pair. This will be smaller when the additional pair ofb-jets are boosted.costhetastar The scattering angle of the hard process.20drbbmaxpt The ∆R between the two b-tagged jets with the largest vector sum pT .jets40 The number of jets with more than 40GeV of pT .jjssumpt The scalar sum of pT of the pair of non-b-tagged jets closest in ∆R.mindr2 The smallest ∆R between any pair of jets.mindrbb The invariant mass of the b-tagged pair of jets closest in ∆R.normavgdrbb The average ∆R between all b-tagged jet pairs, with each entry weighted by the pT scalarsum of the two jets. Overall sum divided by twice the pT scalar sum of all b-tagged jets.nwavgdrbb The average ∆R between all b-tagged jet pairs, with each entry weighted by the pT scalar sumof the two jets. Overall sum divided by the sum of the energies of all b-tagged jets.pt5 The pT of the 5th highest jet in pT .ssumpt The scalar sum of the pT of all jets.sum34pt The scalar sum of the 3rd and 4th highest jets in pT .wavgdr The average ∆R between all possible jet pairs, with each entry weighted by the pT scalar sum of thetwo jets.20 Take vector sum of top two pT -ranked jets, then boost the top pT jet backwards by this sum. Cos(θ∗) is the pT of the jet afterboosting.54wavgdrbb The average ∆R between all b-tagged jet pairs, with each entry weighted by the pT scalar sum ofthe two jets.These distributions are shown in figures 23 and 24. A full list of variables investigated for use, but discardeddue to insufficient separation power, is available in appendix C.Running every possible training permutation of including or not a given variable of 22 variables (222 > 4million) is both impractical and undesirable, as a large majority of these variable sets will be poor performers.To reduce the computational load, the performance of algorithms trained using a subset of these permutationswere analyzed. With these results, it was possible to systematically determine which variables could bediscarded, and which might appear in the final, optimal combination. The metrics used for evaluating theefficacy of each variable are discussed in section 10.2.10.1.2. Binary Combinations of Templates Used for TrainingAnother factor affecting the performance of the training is what to consider as signal and what as background.Since our likelihood profile fit involves fitting four different templates, it is not immediately clear which ofthese templates, or combinations of templates, to use in training. The performance of the algorithm will notbe judged simply on its ability to create distinct shapes between the two trained classes, but instead on itsability to create separation power between each of the four templates involved in the fit. As a result, numerouspossibilities exist for which templates to use in training. A variety of template combinations were identifiedfor investigation:Binary Template Training Combinations1. ttbb vs ttallnotb (=∑all templates excluding ttbb)2. ttbb vs ttlightnotc (=∑all templates that exclude any additional b- or c-tagged jets)3. ttbb vs ttlight jets (=∑all templates with ≤ 1 c-tagged jet and ≤ 2 b-tagged jets)4. ttbb vs ttlight jetsandcc (=∑all templates with ≤ 2 b-tagged jets)5. ttbb vs ttmedium jets (=∑all templates with additional c-jets or exactly 3 b-tagged jets)6. ttbb vs ttc jets (=∑all templates with additional c-jets)7. ttbb vs ttsingleb (= template with exactly 3 b-tagged jets)8. ttlightnotc vs ttbjets (=∑all templates with ≥ 3 b-tagged jets)9. ttbjets vs ttlight jets10. ttbjets vs ttlight jetsandcc11. ttbjets vs ttc jets55Aplanarity of Normalized 3-D Momentum Tensor0 0.05 0.1 0.15 0.2 0.25 0.34−103−102−101−101 = 711.9eventsttbb N = 1704.1eventsttbj N = 15309.6eventsttcj N = 20082.9eventsttjj NaplabR Between All Jets∆Average 3 4 5 6 7 8 9 10 11 123−102−101−10avgdrR Between All B-Tagged Jets∆Average 0 2 4 6 8 10 1200.050.10.150.20.250.3avgdrbbMass of B-Tagged Jet Pair with Largest Mass (GeV)0 200 400 600 800 100000.050.10.150.20.250.30.350.4bbmaxmassR Between Any B-Tagged Pair∆Smallest 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 500.020.040.060.080.10.120.14bbmindr*θCos 0 100 200 300 400 500 60000.050.10.150.20.250.30.35costhetastarT PΣR Between Pair of B-Tagged Jets with Largest ∆0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 500.020.040.060.080.10.120.14drbbmaxptNumber of Jets over 40 GeV0 2 4 6 8 10 1200.050.10.150.20.250.3jets40R∆ of Pair of Untagged Jets Closest in TPΣScalar 0 100 200 300 400 500 600 700 8004−103−102−101−10jjssumptFigure 23: Normalized template distributions for the variables investigated with this analysis. The y-axes are linear orlog scaled depending on which best displayed differences between templates.56R Between Any Pair of Jets∆Smallest 0.4 0.6 0.8 1 1.2 1.4 1.600.050.10.150.20.25mindr2R (GeV)∆Mass of B-Tagged Pair Closest in 0 100 200 300 400 500 600 700 80000.050.10.150.20.250.3mindrbbT PΣ, Normalized to TR Between All B-Jets, Weighted by Jet P∆Avg. 0.5 1 1.5 2 2.500.020.040.060.080.10.120.140.160.18normavgdrbbE (GeV)∑ btwn B-Tagged Jets, weighted by TR / p∆Avg. 0 0.5 1 1.5 2 2.5 3 3.5 400.050.10.150.20.25 = 711.9eventsttbb N = 1704.1eventsttbj N = 15309.6eventsttcj N = 20082.9eventsttjj NnwavgdrbbT Highest Pth of the Jet with 5TP20 40 60 80 100 120 140 16000.050.10.150.20.250.30.350.4pt5 (GeV)TScalar Sum of All Jet P200 400 600 800 1000 1200 1400 1600 1800 200000.050.10.150.20.250.30.35ssumpt (GeV)T Highest Jet Pth and 4rdSum of 350 100 150 200 250 300 350 400 45000.050.10.150.20.250.3sum34pt (GeV)TR Between All Jets, weighted by P∆Avg. 500 1000 1500 2000 2500 3000 350000.050.10.150.20.250.3wavgdr (GeV)TR Between All B-Tagged Jets, weighted by P∆Avg. 0 200 400 600 800 1000 1200 1400 160000.050.10.150.20.250.30.35wavgdrbbFigure 24: Normalized template distributions for the variables investigated with this analysis. The y-axes are linear orlog scaled depending on which best displayed differences between templates.57In these combinations, a strong focus is placed on isolating ttbb from the remaining templates, as thereis strong physical motivation to measure ttbb as precisely as possible, as explained in chapter 2. Beyondthis, with the difficulties associated with optimizing separation power between each of the three remainingtemplates, it was extremely difficult to predict which remaining templates should be involved in trainingthe algorithm. However, to avoid increasing the combinatorics by an additional factor of 20 (including themulti-class options, listed in section 10.6), the additional task of determining which template subsets to use inthe training phase of the algorithm was left until after a reasonable subset of variables had been determined.Therefore, as a start, two approaches were taken.The first approach assumed that variables with low separation power between two strongly differing templateswould also perform poorly when training with similar templates. As a result, ttbb events are trained againsttt j j (= ttlightnotc) events while determining which variables to use. As an alternative, the oppositeassumption was made, which was to consider the template most similar to ttbb, i.e. ttbb was trained againstttbj + ttbc (= ttsingleb) in determining an alternative set of variables. An independent optimal variable setwas determined for each approach, which were then used in training with every of the above template trainingcombinations.Initially, multi-class training templates were also considered, as this was predicted to yield better results inthe final fit. However, attempting to separate individually each of the four templates in a four-dimensionalfit that best represented the final statistical fit was quickly seen to suffer drastically from dimensionalityissues. Further, even in the three class case, it is not immediately clear on how best to reduce binning inthe resultant multi-dimensional histogram to something manageable by the profile likelihood fit. As a resultof these issues, compounded with the added complexity of determining which variables to use in the finalalgorithm, only binary classification training templates were considered until a final variable set had beendecided upon.10.2. Variable Selection Metrics and ProcessWhen eliminating variables, it was necessary to determine which were least likely to appear in the finaloptimal variable set. TMVA provides two key sources of information for identifying successful variables:the correlation matrix of the input variables, and a variable "importance" ranking provided by TMVA.Correlation matrices between variables are provided for each trained template. Thus, a separate matrix isprovided for the ttbb events (figure 25) and for ttlightnotc events (figure 26). Further, TMVA providesa ranking of the impact each variable has on the final discriminant. For BDTs, this ranking is created bycounting how often the variables are used to split decision tree nodes, and by weighting each split occurrenceby the separation gain-squared it has achieved and by the number of events in the node. For NNs, the rankingis given by the sum of the weights-squared of the connections between the variable’s neuron in the input layerand the hidden layer [36].The correlations between variables are independent of the inclusion of other variables, and so the matricesprovided in figures 25 and 26 will be referenced often. TMVA rankings provided for NNs is also independentof which other variables are included, and were seen to be also independent of which templates were used intraining. Thus, only one TMVA ranking is required for NNs, given in table 5.58Rank Variable Importance Rank Variable Importance Rank Variable Importance1 wavgdr 4.0 × 108 9 pt5 4.1 × 104 17 MV1c4 6.9 × 1012 ssumpt 5.5 × 107 10 avgdr 1.6 × 103 18 nwavgdrbb 4.7 × 1013 wavgdrbb 2.2 × 107 11 MV1c1 1.0 × 103 19 normavgdrbb 4.3 × 1014 bbmaxmass 3.1 × 106 12 MV1c2 6.6 × 102 20 mindr2 1.9 × 1015 mindrbb 1.4 × 106 13 avgdrbb 3.9 × 102 21 aplab 2.2 × 10−26 jjssumpt 8.8 × 105 14 MV1c3 1.7 × 102 22 jets40 3.4 × 10−467 sum34pt 7.7 × 105 15 drbbmaxpt 1.5 × 1028 costhetastar 5.2 × 105 16 bbmindr 1.1 × 102Table 5: The variable rankings as displayed by TMVA for neural networks. The variable ranking was given in thisorder regardless of which templates were used for training.-100-80-60-40-20020406080100aplabavgdravgdrbbbbmaxmassbbmindrcosthetastardrbbmaxptjets40jjssumptmindr2mindrbbmv1c_1mv1c_2mv1c_3mv1c_4normavgdrbbnwavgdrbbpt5 ssumptsum34ptwavgdrwavgdrbbaplabavgdravgdrbbbbmaxmassbbmindrcosthetastardrbbmaxptjets40jjssumptmindr2mindrbbmv1c_1mv1c_2mv1c_3mv1c_4normavgdrbbnwavgdrbbpt5ssumptsum34ptwavgdrwavgdrbbCorrelation Matrix (ttbb)100 -1 36 1-23 -5-19 -2 4 -16 8 20 43 29-32 46 6 -2 19 -1100 17 21 9 9 12 -3 7 -4 11 -3 1 1 8 11 4 40 32 22 50 14 36 17100 57-16 3 1 -3 5 -9 11 32 67 63-14 83 9 5 5 10 71 1 21 57100 9 34 24 3 3 1 45 3 17 29 30 16 40 26 39 37 44 85-23 9 -16 9100 -1 85 -3 8 70-11-21-51-36 93-25 -4 -4 -5 -11 -5 9 34 -1100 1 1 49-12 24 1 4 -1 9 33 79 57 76 39-19 12 3 24 85 1100 -3 8 57 -9-13-32-19 82 -9 -3 -1 -3 5 6 -3 1 3 1 100 -2 2 3 -1 2 -1 3 -1 2 -1 1 1 1 5 -2 7 -3 3 -3 49 -3 -2100 -8 3 2 2 -1 -5 32 59 53 54 6 4 -4 5 1 8 -12 8 2 -8100 2 -4 -3 -1 3 -16-21-18-14 -2-16 11 -9 45 70 24 57 3 3 2100 -7-12-34-24 67-16 15 25 26 28 22 8 -3 11 3 -11 -9 -1 2 -4 -7100 39 15 8 -13 11 2 4 8 20 1 32 17-21 1-13 2 2 -3-12 39100 43 23-21 32 4 4 3 3 25 43 1 67 29-51 -32 -1 -1 -34 15 43100 51-50 67 7 5 6 4 49 29 8 63 30-36 4 -19 3 -1-24 8 23 51100-36 60 9 9 8 9 49-32 11-14 16 93 -1 82 -1 -5 3 67-13-21-50-36100-25 -4 -4 -4 1 -7 46 4 83 40-25 9 -9 2 -16 11 32 67 60-25100 11 14 12 11 68 6 40 9 26 -4 33 -3 -1 32-16 15 4 7 9 -4 11100 64 58 63 29 32 5 39 -4 79 -1 1 59-21 25 2 4 5 9 -4 14 64100 85 95 45 22 5 37 -5 57 -3 1 53-18 26 4 3 6 8 -4 12 58 85100 79 41 -2 50 10 44 76 5 1 54-14 28 3 4 9 1 11 63 95 79100 45 19 14 71 85-11 39 6 5 6 -2 22 8 25 49 49 -7 68 29 45 41 45100Linear correlation coefficients in %Figure 25: Correlations between all 22 initial variables within the ttbb sample.Unfortunately, the picture is not so clear with BDTs, where variable ranking will change depending on which59-100-80-60-40-20020406080100aplabavgdravgdrbbbbmaxmassbbmindrcosthetastardrbbmaxptjets40jjssumptmindr2mindrbbmv1c_1mv1c_2mv1c_3mv1c_4normavgdrbbnwavgdrbbpt5ssumptsum34ptwavgdrwavgdrbbaplabavgdravgdrbbbbmaxmassbbmindrcosthetastardrbbmaxptjets40jjssumptmindr2mindrbbmv1c_1mv1c_2mv1c_3mv1c_4normavgdrbbnwavgdrbbpt5ssumptsum34ptwavgdrwavgdrbbCorrelation Matrix (ttlightnotc)100 29 2-13 -3-10 -2 -8 1 4 43 21-14 29 -1 -1 -2 12100 9 12 9 7 9 4 11 11 2 2 9 -5 31 22 15 43 5 29 9100 59 62 -1 67 -7 2 41 1 3 41 24 65 73 1 3 65 2 12 59100 52 36 54 6 -2 90 1 12 7 53 38 15 38 38 40 90-13 9 62 52100 -2 98 -6 4 63 -1-19-10 99 42 -2 -3 1 44 -3 7 -1 36 -2100 -1 49 -9 34 -1 1 1 -2 4 26 82 57 78 40-10 9 67 54 98 -1100 -6 4 62 -14 -7 98 45 -2 -2 1 48100 -2 4 -7 6 -6 49 -6 100 -6 6 -7 -3 29 61 52 55 6 11 2 -2 4 -9 4 -6100 1 -2 -1 4 -2 -9-15-11 -6 -5 -8 11 41 90 63 34 62 6 100 -1-11 -5 63 24 14 36 37 38 76 1 1 1 100 35 1 2 2 4 3 -1 -1 1 -1 35100 4 2 -1 6 1 -1 -1 1 43 2 41 12-19 1-14 -2-11 1 4100 37-17 34 2 3 2 3 24 21 2 24 7-10 1 -7 -1 -5 2 37100 -9 19 1 2 1 2 15-14 9 65 53 99 -2 98 -7 4 63 -1-17 -9100 44 -2 -3 1 46 29 -5 73 38 42 4 45 -3 -2 24 2 6 34 19 44100 1 5 6 2 56 31 15 -2 26 -2 29 -9 14 1 2 1 -2 1100 53 50 52 14 -1 22 38 -3 82 -2 61-15 36 -1 3 2 -3 5 53100 83 95 41 -1 15 1 38 57 52-11 37 2 1 6 50 83100 79 38 -2 43 3 40 1 78 1 55 -6 38 -1 3 2 1 2 52 95 79100 40 12 5 65 90 44 40 48 6 -5 76 2 1 24 15 46 56 14 41 38 40100Linear correlation coefficients in %Figure 26: Correlations between all 22 initial variables within the ttlightnotc sample.60other variables are included, and also depending on which templates are used in training. This commonlyoccurs when variables are highly correlated. In the case of two variables that perform similarly, one maybe consistently selected as the optimal splitting variable when growing trees, and thus may vastly outrankthe other variable, placing many unrelated variables in between the two. However, upon removal of this topchoice, the second variable may be selected much more often, giving it a rank above variables that werepreviously ranked higher. TMVA rankings that include every variable will still be helpful, as these ranks willstill rank variables in similar "groupings" (such as a pairing of two variables as illustrated above, though itmay also be a triplet of variables, or more) in a meaningful order. Thus, these rankings will still be used,with the previous caveat in mind. Further, thinking ahead towards selecting which training templates to use,variable rankings for a number of template combinations were examined in order to account for the (milder)differences in rankings that occur when using differing templates. These rankings may be seen in table 6.If this were a simple (e.g. signal vs background) search, we could simply train the signal template againstthe background template, look at the variable correlation matrix and TMVA rankings, and remove variablesthat are both poorly ranked and highly correlated with other variables. Since these binary classificationalgorithms are, by design, optimal at separating the two trained templates, there is no additional informationto consider. However, in this case, the added complexity of using the trained ML algorithm to generatefour different likelihood templates for use in a profile likelihood fit is an issue that goes unaddressed whenrelying on the above metrics.21 In this case, the classification algorithm must create distinguishing featuresbetween multiple templates using a single training. The ability of the algorithm to split between, e.g. ttbband ttsingleb, when ttsingleb has played no role in the training process, is impossible to judge from theabove metrics alone.To this end, a sensitivity metric was developed for this analysis that takes into account the final, systematic-uncertainty-free fit sensitivities when developing ranks between each variable. However, as this sensitivitymetric is based heavily on the iterative approach taken to remove variables, this approach will be explainedfirst.As mentioned previously, training and studying all combinations of variables would be prohibitive. Instead,an iterative approach was taken which ensures each trained subset fairly represented each of the variables. Ina first iteration, the 3rd and 4th highest MV1c values were assumed to be among the top performers, and sothese were automatically included. A training was performed adding each of the 20 choose n combinations ofremaining variables, for each choice subset, n ∈ [1, 4]∪ [16, 19]. These values for n were selected based oncomputational limits, while still giving a sufficiently good sense for the importance of a given variable. Thisfirst iteration resulted in 12 390 separate trainings, the results of which were sorted and analyzed to determinewhich variables were the worst performers. From likelihood fits to the BDT trained discriminants, 6 variablesappeared reasonable to discard (listed below, in section 10.3.1). A second iteration was performed, enablingthis time with the possibility of excluding the 3rd and/or 4th highest MV1c values. Each of the 16 choosen combinations of remaining variables were trained, for each choice subset, n ∈ [1, 5] ∪ [11, 15] (with thisrange of n again selected based on computational limits). The results from n = 3, 4 and 5 were used tocross-check the results from the previous iteration, as well as look at algorithms that didn’t necessarily includeboth MV1c3 and MV1c4. Using the information from this second iteration to allow the removal of two more21 There exists the possibility of using a multi-class classification algorithm, simultaneously training every heavy-flavour templatetype against every other type. However, for the four templates involved in the fit in this analysis, this method would create a3-dimensional distribution, and thus requires more statistics than available. This is discussed more fully in chapter 9.61variables, also listed in section 10.3.1. Finally, in a third step, each of the remaining 16 384 combinations ofthe 14 variables were trained and evaluated as candidates for further use.Numerous complications were encountered while trying to generate a bias-free metric based on the resultantsensitivities of an algorithm (fully detailed in appendix D). The final result of this was to assign a score foreach variable based on how frequently it was included or excluded in a subset of permutations. This scoreis:K × avar5p5+ 5 × avar10p10+ 2 × avar25p25+avar50p50where K is 10 for the first iteration and 50 for the second iteration, avart as the total number of appearancesthe variable var makes in the top t% of permutations, andpt = min{[(m choose n) × t100], [(m − 1) choose (n − 1)]}is the maximum number of appearances a variable could possibly make in a subset of permutations. Here, mis the number of potential variables to eliminate; i.e. m = 20 in the first iteration (as MV1c3 and MV1c4 arealways included), and m = 16 in the second iteration.The appearance of a variable in a training with good sensitivities for subsets n < m2 intuitively implies thatthat variable should be included in the final training. However, for subsets n > m2 it should be noted thatthe opposite is true; in this case, the variable was excluded from the trained algorithm, and thus shouldbe excluded from the final training. As a result, at each iteration, the scores from each choice subset weresummed to create an “inclusion sum” (composed of a variable’s scores summed for all subsets n < m2 ), an“exclusion sum” (sums from n > m2 ) and a total sum of all subsets in that iteration. The inclusion sum isintended to give an impression of how frequently a variable was included among algorithms with the bestsensitivities, and the exclusion sum is intended to estimate how often a variable was excluded among the bestalgorithms. The total sum is a balance between these two perspectives.Lastly, it should be noted that due to differences in how each algorithm uses the input variables, the variableselection for BDTs and NNs was performed separately.10.3. Boosted Decision Trees OptimizationWhen choosing which variables to discard, TMVA ranks for all the various possible template trainingcombinations must be considered, as in this preliminary stage it remains unclear which particular templatetraining combination will yield the best results. TMVA ranks for select template training combinations canbe seen in table 6.10.3.1. Removal of Least-Performant Variables for BDTsAs described in sections 10.1.2 and 10.2, variable removal was performed twice, training ttbb againstttlightnotc or ttsingleb, and over two iterations each time, removing six variables after the first iteration,and two variables after the second iteration.62- Variable removal from training t t bb against t t l ightnot c -After training the ∼12K variable combinations in the first iteration, with ttbb trained against ttlightnotc, thecorrelation matrices (figures 25 and 26), sensitivities point card (table 7) and TMVA variable rankings wereexamined (table 6). Based on this, six variables were removed, as described below.MV1c1 was seen to be ranked very low by all of TMVA rankings, while also having only moderateperformance in the sensitivities, implying it was usually not selected as a splitting variable for treesin BDTs. Thus its inclusion in the training has little impact overall, which accounts for its reasonablygood exclusion sum sensitivities, and thus seemingly reasonable overall performance.j et s40 is also ranked poorly by both TMVAand the sensitivities performance, with only amoderate showingin the inclusion sum sensitivities.Thus, on the straightforward basis of MV1c1 and jets40, both adding little or no benefit to the overallalgorithm, they were excluded from further iterations.The variables costhetastar , pt5, ssumpt,wavgdr and sum34pt were found to have large (> 60%) correlationswith one another, in both ttbb and ttlightnotc samples.wavgdr is seen to be rarely included or excluded in algorithms, via both the low and exclusion sums(implying little to no impact on overall performance), while also consistently ranked worse than e.g.pt5 and ssumpt in TMVA rankings.sum34pt ranks average to below average in all TMVA and sensitivity rankings, while also consistentlydoing worse than pt5 and ssumpt.Thus, wavgdr and sum34pt were also eliminated in the first iteration.The variables avgdrbb, mindrbb, wavgdrbb, bbmaxmass and drbbmaxpt were found to have strongcorrelations with one another. While the overall performance of this group of variables is often seen to bebetter than others, a large amount of redundancy is expected when including groupings which have highinternal correlations.bbmaxmass and dr bbmaxpt are generally seen as the worst performing variables in this correlationgroup as ranked by both TMVA and sensitivity charts.Therefore both bbmaxmass and drbbmaxpt were selected to be excluded from further iterations.After training the ∼14K combinations in the second iteration, the new sensitivities (table 10) and TMVArankingswere examined. Despite extremely high correlation between them, both bbmindr and normavgdrbbwere left included. Based on the high performance in the sensitivities involving these variables, it is likelyone in this pair will be the largest contributor to the final optimal variable set. As a result, it was decided toleave both in during the final iteration, to ensure the absolute best discriminating variable remained in use.costhet ast ar and wavgdr bb were removed in this iteration, since they remain highly correlated withthe remaining variables in their respective groupings (as explained above), and yet still rank worse.63With just 14 variables, the total number of variable permutations is reduced to 16 382, and thus becomestechnically feasible to attempt every possible remaining variable combination on every template trainingpermutation. For clarity, this list of 14 remaining variables is:1. aplab 2. avgdr 3. avgdrbb 4. bbmindr 5. j j ssumpt6. mindr2 7. mindrbb 8. MV1c2 9. MV1c3 10. MV1c411. normavgdrbb 12. nwavgdrbb 13. pt5 14. ssumpt- Variable removal from training t t bb against t t singleb -Realizing that training ttbb against its most dissimilar template, ttlightnotc, yielded poor results indicatedstrongly that instead the opposite should be attempted. As a result, the variable selection process wasrepeated, but training ttbb against ttsingleb instead. The selection process was very similar. Looking at theresults from the sensitivities in the first round, given in table 11, along with the variable correlation plots andTMVA rankings given in table 6, six variables were again eliminated in the first round. These were mostlydifferent from those removed in the other case.mindr2was seen to have abysmal performance in all sensitivity results. It is rarely included in the inclusionsum ranks, and was often excluded in the exclusion sum ranks, leading to this variable having the worstoverall sensitivity results. On this basis alone, it was removed from further iterations.nwavgdr bb, similarly to the previous case, was seen to produce poor sensitivities, and be correlated withavgdrbb and wavgdrbb, which tended to produce better sensitivities. avgdrbb had a worse overallsensitivity point score, however it was far more often included in the best results, as seen from itsinclusion sum score. Since the majority of negative points (18) attributed to avgdrbb are resultantfrom its exclusion from the best performing algorithm when 19 variables are selected (a highly unlikelyscenario for a top performing algorithm), it was decided to leave avgdrbb in for the next iteration.dr bbmaxpt, also similarly to the previous case, has correlations with bbmaxmass, avgdrbb, mindrbband wavgdrbb, while not producing as good sensitivities as some of these other variables, and so itwas also eliminated.bbmindr was similarly seen to have very high correlations with normavgdrbb while not creating asbeneficial algorithms, and so it, too, was eliminated from further iterations.ssumpt and sum34pt were seen to have high (> 60%) correlations with wavgdr , costhetastar and pt5,while performing worse in sensitivities and being consistently ranked worse by TMVA. As a result,these two variables are also eliminated from further iterations.In the second iteration, with 16 variables remaining, and again training combinations involving up to 5 ordown to 11 variables, two additional variables were selected for elimination. The sensitivity performancesresulting from this iteration are seen in table 12.avgdr had low inclusion and frequent exclusion from the best sensitivities, and so was opted to be excludedfrom the final iteration.j j ssumpt was also excluded, as it is somewhat correlated with costhetastar and wavgdr while doingcomparable or worse in the sensitivities and TMVA rankings.64The optimal list of 14 variables from training ttbb against ttsingleb is therefore:1. aplab 2. avgdrbb 3. bbmaxmass 4. costhetastar 5. jets406. mindrbb 7. MV1c1 8. MV1c2 9. MV1c3 10. MV1c411. normavgdrbb 12. pt5 13. wavgdr 14. wavgdrbbAs can be immediately seen, this list of remaining variables is not the same as in the previous case. However,in many cases, it was a different variable from the same correlation grouping that was removed.10.3.2. Results from Boosted Decision TreesUsing separately both of the above sets of 14 variables, it was possible to train all remaining permutationsof variables with each of the 11 different template combinations. Interestingly, the only algorithms whichprovided comparable results to the baseline22 all occurred when ttbb was trained against ttsingleb, and allof them used both MV1c3 and MV1c4 variables. In fact, of these top performing BDT algorithms, of whichthere were just five, all but one relied entirely on the MV1c-based variables.The best expected statistical sensitivities from any BDT algorithm, resulted from training ttbb againstttsingleb using MV1c1, MV1c3 and MV1c4 as the discriminating variables. The results are:∆µbb = 0.1715∆µb j = 0.2359∆µc j = 0.0536∆µ j j = 0.0322The corresponding distributions for this training can be seen in figure 27.The remaining three BDT algorithms with comparable sensitivites to this best result instead used MV1c2instead of MV1c1 (∆µbb = 0.1722), or used both variables (∆µbb = 0.1733), or neither, relying on justMV1c3 and MV1c4 (∆µbb = 0.1743).Further, the only top-performing BDT algorithm to use any non-MV1c-based variable instead used the jets40variable, along with MV1c3 and MV1c4. The expected statistical sensitivities, from training ttbb againstttsingleb with jets40, MV1c3 and MV1c4 as discriminating variables, were:∆µbb = 0.1812∆µb j = 0.2639∆µc j = 0.0601∆µ j j = 0.057665Arbitrary Units2 4 6 8 10 12 14 16 18 205−104−103−102−101−101 ttbbttsinglebttcjetsttlightnotcFigure 27: The best possible BDT distribution, from training ttbb against ttsingleb using MV1c1, MV1c3 and MV1c4.The corresponding distributions for this training can be seen in figure 28.The best sensitivities from BDTs only improve upon the baseline sensitivity by less than 2%, which is notparticularly significant. Further, as a result of the complications that arose attempting to determine the bestvariable and template training combination, in particular how vastly the performance of the algorithm couldchange if a wrong decision was made, BDTs were not considered any further in this thesis.Regardless, it is interesting to note how all the top performing algorithms relied entirely on the MV1c values,with only one top algorithm including an alternative variable: the number of jets over 40 GeV. In particular,it is notable that this additional variable still only holds integer values, as is typically advantageous for use inBDTs [35].10.4. Neural Network OptimizationThe variable selection process for usewith neural networks is farmore straightforward. Over-trained networksare quickly reached when too many variables are included in relation to the amount of statistics available for22 Specifically, every other BDT algorithm had ttbb sensitivities worse than 21.5%.66Arbitrary Units5 10 15 20 253−102−101−10ttbbttsinglebttcjetsttlightnotcFigure 28: The only top performing BDT distribution to use any non-MV1c variable. Notably, it uses the only otherinteger-valued variable: jets40; the number of jets with pT over 40 GeV.training [36]. As a result of this, NNs were expected (and were seen here) to quickly lose separation powerwhen including too many variables given the limited MC statistics available, since an over-training cutoff wasinstigated in these circumstances, which prevented the network from reaching the potential it would have withunlimited training statistics. As a result of this, the examined sensitivities at the first iteration only looked atwhen three to six variables were included (since, as before, MV1c3 and MV1c4 were always included at thisiteration). This sensitivity point card is available in table 13. In order to avoid the issues seen with BDTswhen choosing variables, this sensitivity table was created for the two cases of ttbb trained against eitherttlightnotc or ttsingleb.Unlike for BDTs, where the variable rankings could change depending on the inclusion or exclusion of othervariables, and also depending on which templates were trained against one another, the relative order ofvariable importance was static regardless of either of these factors for NNs. Neural network TMVA rankingsfor variables is given in table 5.Looking at both TMVA rankings and the sensitivities, a large degree of ambiguity is seen towards whichvariables to include. This is most plainly seen by noting that the best four variables as determined byTMVA, wavgdr , ssumpt, wavgdrbb and bbmaxmass, are the worst four variables to include accordingto the sensitivities when ttbb is trained against ttlightnotc. As a result of these difficulties associated with67confidently identifying useless variables, the same variable sets were selected for removal as in the BDT cases.This permitted investigations into the performance of algorithms which exclude MV1c3 and/or MV1c4. Thesensitivity tables for this second iteration of NN training, using the same variable sets as for the seconditeration of BDT training, is seen in table 14.From this second iteration of results, it was determined that all the top performing algorithms include bothMV1c3 and MV1c4. There are four algorithms with less than 18% uncertainty of the ttbb template resultantfrom training ttbb against ttsingleb, and five algorithms resultant from training ttbb against ttlightnotcwhich do better than 18%. However, the best algorithms from either template training combination thatexclude both MV1c3 and MV1c4 have ttbb sensitivities worse than 24.5%. Notably, MV1c2 is includedin the best algorithm that excludes both MV1c3 and MV1c4, as this variable implicitly carries some of theinformation contained in the lower ranked variables, due to the ordered nature of the MV1c variables (e.g.that MV1c3 ≤ MV1c2).As a result of MV1c3 and MV1c4 being included in every top performing algorithm, it was decided toinclude these two variables in every training permutation. Further, since the NNs quickly overtrain if toomany variables are left included, it was decided to simply train over every template training combination,including up to four additional variables, as in the first variable iteration (thus, three to six variables total).Then, examining the results, the three template training combinations which had the best sensitivities weretrained including up to a fifth additional variable, to be sure that the inclusion of additional variables is notbeneficial. These three template combinations were when ttbb was trained against ttallnotb, ttlightnotc orttlight jetsandcc. The very best ttbb sensitivity from all permutations turned out to include four additionalvariables, as seen below.The expected statistical sensitivities, from training ttbb against ttallnotb using avgdr , bbmindr , MV1c1,MV1c3, MV1c4 and normavgdrbb as discriminating variables was found to be:∆µbb = 0.1691∆µb j = 0.2321∆µc j = 0.0538∆µ j j = 0.0436The corresponding distributions for these sensitivities can be seen in figure 29.This is the best ttbb sensitivity achievable with binary classification algorithms, and the only algorithm tobreak the 17% uncertainty mark.Note on avgdr bb and normavgdr bb variables...The output template distributions of this top algorithm has a peculiar feature of having a central region witha high number of events. This central bump is an artifact of the avgdrbb and normavgdrbb variables, ascan be seen in figure 30. These variables are heavily dependent on the number of b-tagged jets, by takingon a different range of values depending on how many jets are considered tagged. As a result of this, a peak68Verification Histogram against Overtraining(from TMVA Trainer) Neural Network ttbb trained against ttallnotb Variables Used: • avgdr • bbmindr • mv1c1 • mv1c3 • mv1c4 • normavgdrbb Expected Statistical Uncertainties: µbb = 1 ± 0.1691 µbj = 1 ± 0.2321 µcj = 1 ± 0.0538 µjj = 1 ± 0.0436 0.25 0.3 0.35 0.4 0.45 0.5Fraction of Events in Template6−105−104−103−102−101−10Finely Binned NN Template Response (from TMVA Reader)Arbitrary Units5 10 15 20 25Fraction of Events in Template3−102−101−10ttbbttsinglebttcjetsttlightnotcHistograms Used in Likelihood Fit (Re-Binned Version of Above)Figure 29: The best ttbb sensitivity achievable with any binary classification machine learning algorithm.69frequency of events is observed at a different value for these variables, for each number of b-tags. Theseadditional bumps carry into the ML algorithms, creating the observed template distributions.avgdrbb2 4 6 8 10 12 140.347 / dN00.10.20.30.40.5ttbbttlightnotcInput variable: avgdrbbnormavgdrbb0.5 1 1.5 2 2.50.0635 / dN00.20.40.60.811.21.41.61.822.2Input variable: normavgdrbbavgdrbb2 4 6 8 10 12mv1c311.522.533.544.55mv1c3 versus avgdrbb (ttbb)_Idavgdrbb2 4 6 8 10 12mv1c411.522.533.544.55mv1c4 versus avgdrbb (ttbb)_Idavgdrbb2 4 6 8 10 12normavgdrbb0.511.522.5normavgdrbb versus avgdrbb (ttbb)_Idavgdrbb2 4 6 8 10 12mv1c311.522.533.544.55mv1c3 versus avgdrbb (ttlightnotc)_Idavgdrbb2 4 6 8 10 12mv1c411.522.533.544.55mv1c4 versus avgdrbb (ttlightnotc)_Idavgdrbb2 4 6 8 10 12normavgdrbb0.511.522.5normavgdrbb versus avgdrbb (ttlightnotc)_IdFigure 30: The top figures show the distributions for avgdrbb and normavgdrbb as used by the trainer for the ttbb andttlightnotc cases. The middle figure shows the correlations between these variables among ttbb events, which includefour or more b-jets at truth level. The bottom figure shows these same correlations, but for ttlightnotc events. As canbe seen from these distributions, and from the definitions of the variables themselves, a high degree of dependence isseen between the number of MV1c b-tagged jets, and the value of these two variables. In particular, bumps are seen inthe distributions of these variables, as a result of an integer increase in the number of b-tagged jets. (The red points inthe bottom two rows of plots simply give the average value for events in that bin.)7010.5. Comparison of Best BDT and NN Binary Classification ResultsThe best NN and BDT algorithms give very similar ttbb sensitivities (16.91% vs 17.15%, respectively). Theyboth use three of the same variables: MV1c1, MV1c3 and MV1c4. However, the NN algorithm supplementsthese three variables with three additional variables. More interestingly, however, is how relatively robustthe NN algorithms seem to be compared to the BDT algorithms.Unlike with BDTs, in the case of NNs there were 123 alternative algorithms which all provided better than18% uncertainty on the ttbb measurement. Further, at least one algorithm which performs better than 18%ttbb uncertainty exists for each of the tested template training combiations; a testament to the robust natureof NNs. Five different template training combinations are represented, even in just the top ten algorithms:the three mentioned previously, which were trained including up to seven variables total, in addition to ttbbtrained against ttmedium jets and ttbjets trained against ttlight jets.In an attempt to compare “apples to apples,” NN algorithms which use just MV1c1, MV1c3 and MV1c4were examined and found to have all have sensitivities around ∼18%, regardless of which templates wereused in training. While not quite as good sensitivities as with the BDT which trains ttbb against ttsinglebusing these three variables, it is still interesting to see how NNs are rather indifferent to which templates areused in training. This is in stark contrast to BDTs, which were unable to get sensitivities better than 21.5%without specifically training ttbb against ttsingleb.10.6. Multi-Class Neural Network AlgorithmsDue to the relatively robust nature of the Neural Network algorithms in comparison to Boosted DecisionTrees, it was decided to investigate Multi-Class algorithms using only NNs. Given that MV1c3 and MV1c4were always in the top-ranked variables in the binary classification, only permutations which included bothof these variables along with one to four additional variables were considered. The list of multi-classtemplate training permutations which were investigated is given below. As a result of statistical issues withdimensionality, as well as severe complications in determining a re-binning procedure for a 3-D histogram,only cases with three separating templates were considered.Multi-Class Training Combinations1. ttbb vs ttlightnotc vs ttmedium jets2. ttbb vs ttlightnotc vs ttc jets3. ttbb vs ttlightnotc vs ttsingleb4. ttbb vs ttlight jetsandcc vs ttsinglebandcc (=∑templates with exactly 3 b-tagged jets and/or ≥ 2 c-tagged jets)5. ttbb vs ttlight jets vs ttsinglebandcc6. ttbb vs ttlight jets vs ttsingleb7. ttbb vs ttlightnotc vs ttc jets vs ttsingleb718. ttbb vs ttlight jets vs ttsingleb vs ttcc (= template with ≥ 2-ctagged jets)9. ttbjets vs ttlightnotc vs ttc jets10.6.1. Re-Binning AlgorithmFor multi-class algorithms training three templates against one another, two independent algorithm readersare produced by TMVA.23 This generates a two dimensional histogram, which must be converted to a onedimensional histogram for use in the likelihood fit. This process is not quite as straightforward as with binaryclassification, where neighbouring bins are simply merged to form a reasonable template. In particular, anyre-binning procedure must not merge disjoint regions of phase space, nor isolate regions by circling bins withhigh statistics, as was already discussed for the baseline analysis.To avoid these issues, the following algorithm was followed to merge bins. This algorithm assumes the 2Dhistogram has the ttbb (or, in some template training cases, ttbjets) output on the x-axis, and the “moderate”reader, e.g. ttsingleb or ttc j, on the y-axis. Therefore, an event that is seen as likely ttsingleb but unlikelyas ttbb would reside in the top-left quadrant of the histogram. The merging algorithm is as follows:1. Determine the bin with the lowest content. If multiple bins have this content, such as when there existmultiple empty bins, the algorithm selects the right-most bin. If the right-most bin also has multipleempty bins, the top bin in this column is selected.2. Determine the bin with the least content that neighbours this first bin. If multiple neighbours exist withthe same content (such as nil), the preferred neighbour to merge with is ordered: up, down, right, left.This ordering causes the “moderate” reader to be more likely to have a coarser binning than the ttbbreader.3. Merge these two bins, then repeat from step 1. In cases where the bin with the least content has alreadybeen merged with other bins, the new bin is seen as a rectangular shape and merges with similargroupings. In example, if a bin is merged with the bin below it, and this pairing is again seen to havethe least content, with a neighbour to the left which has the least of all neighbouring bins, then thepair merges with both bins to the left to form a single bin. As a result of this, the re-binning keepsrectangular shapes from the original 2D histogram.In this way, the algorithm attempts to eliminate bins with low statistics, while maintaining reasonable shapesin the algorithm phase space. Outlines which indicate the re-binning procedure for a given algorithm aregiven with each result in this section. In some cases they seem less than ideal, though, as will be seen, it isdifficult to judge the success of the rebinning procedure from appearance alone.23 In practice, three outputs are produced in total, but the sum of all three is equal to 1 by construction, resulting in two independentdistributions.7210.6.2. ResultsThe multi-class algorithms mildly improve on the binary class cases, with 25 configurations giving betterthan the top binary result of 16.9% uncertainty on the ttbbmeasurement. The best three results from amongthese algorithms are given below.Training ttbb vs ttsingleb vs ttlight jetsandcc using input variables drbbmaxpt, MV1c3 and MV1c4, givesexpected statistical sensitivities of:∆µbb = 0.1608∆µb j = 0.2304∆µc j = 0.0660∆µ j j = 0.0435The corresponding output discriminant can be seen in figure 31.Training ttbb vs ttsingleb vs ttlightnotc using input variables aplab, avgdr , MV1c3 and MV1c4, givesexpected statistical sensitivities of:∆µbb = 0.1621∆µb j = 0.2306∆µc j = 0.0525∆µ j j = 0.0286The corresponding output discriminant can be seen in figure 32.Training ttbb vs ttsingleb vs ttlight jets using input variables aplab, drbbmaxpt, MV1c1, MV1c3 andMV1c4, gives expected statistical sensitivities of:∆µbb = 0.1621∆µb j = 0.2352∆µc j = 0.0610∆µ j j = 0.0345The corresponding output discriminant can be seen in figure 33.As can be seen, the use of a multi-class algorithm allows for some mild improvements on what is achievablewith binary classification. Further, the added complication of a non-transparent re-binning procedure alsocreates an added level of difficulty in deciphering the results.The automatic re-binning procedure applied for these algorithms seem to be less than optimal. An attemptwas made to improve on this procedure for just these top three algorithms, by manually selecting which binsto merge. Unintuitively, this worsened the results in all three cases.73Verification Against Overtraining Plots for Each TMVA Reader and Class Neural Network ttbb trained vs ttsingleb vs ttlightjetsandcc Variables Used: • drbbmaxpt • mv1c3 • mv1c4 Expected Statistical Uncertainties: µbb = 1 ± 0.1608 µbj = 1 ± 0.2304 µcj = 1 ± 0.0660 µjj = 1 ± 0.0435 11021031041017076.21377.36 395.49523.2133 1916.83306.621 108.004476.482 197.11271.932 82.8509 102.85 1.9690191.2587 60.0332 2.55747 26.6536 0.479486 36.767712.028 30.5925 72.2171 81.9632 4.1681 36.69370.524434 85.8178 9.79814246.753 52.4658 4.39067103.228 80.4551ttbb Response0.2 0.25 0.3 0.35 0.4 0.45ttsingleb Response0.220.230.240.250.260.270.280.290.30.310.322D Output Histogram Showing Re-Binning Procedurewith Total Bin OccupancyArbitrary Units5 10 15 20 25Fraction of Events in Template4−103−102−101−101 ttbbttsinglebttcjetsttlightnotcHistogram of Templates Used in Likelihood FitFigure 31: The best result from any machine learning algorithm. The black boxes in the 2D histogram indicate there-binning procedure taken with this algorithm. While this does not appear optimal, particularly as it leaves one binempty, as will be seen a manual re-binning procedure actually lowers the sensitivity of the measurement. Note that theempty bin is excluded from the profile likelihood fit.74Verification Against Overtraining Plots for Each TMVA Reader and Class Neural Network ttbb trained vs ttsingleb vs ttlightnotc Variables Used: • aplab • avgdr • mv1c3 • mv1c4 Expected Statistical Uncertainties: µbb = 1 ± 0.1621 µbj = 1 ± 0.2306 µcj = 1 ± 0.0525 µjj = 1 ± 0.0286 11021031041032.092126897.92413.78 2269.061393.86 0.8411051098.29 226.557 65.6636 24.1773839 210.264 7.69979 38.4098 44.6079 1.27105306.608 245.656 1.99741 36.6868 49.4669 8.99828441.515 120.414 194.063 47.6087 2.82237 0.905462111.331 126.784 2.75899 54.0417 32.0091 9.41278 2.24488 0.343811106.436 7.34477 0.027121125.374 153.592 11.9933 2.14354ttbb Response0.2 0.25 0.3 0.35 0.4 0.45 0.5ttsingleb Response0.20.220.240.260.280.30.320.340.360.380.42D Output Histogram Showing Re-Binning Procedurewith Total Bin OccupancyArbitrary Units5 10 15 20 25Fraction of Events in Template5−104−103−102−101−101 ttbbttsinglebttcjetsttlightnotcHistogram of Templates Used in Likelihood FitFigure 32: The second best result from any machine learning algorithm. The black boxes in the 2D histogram indicatethe re-binning procedure taken with this algorithm.75Verification Against Overtraining Plots for Each TMVA Reader and Class Neural Network ttbb trained vs ttsingleb vs ttlightjets Variables Used: • aplab • drbbmaxpt • mv1c1 • mv1c3 • mv1c4 Expected Statistical Uncertainties: µbb = 1 ± 0.1621 µbj = 1 ± 0.2352 µcj = 1 ± 0.0610 µjj = 1 ± 0.0345 11021031041029704.4 5.479273626.51 487.375213.832 1371.411.88944 244.13 107.447385.265 172.175261.304 44.6761 3.39045 7.84401 31.082886.6944 10.3673 63.2034 3.86071 42.441116.9081 22.443 159.171 0.427396 3.02674 54.2916146.082 32.1362 85.7672 75.6606 2.5648436.9286 120.224 0.23555 44.266883.5349 7.7886ttbb Response0.2 0.25 0.3 0.35 0.4 0.45 0.5ttsingleb Response0.220.240.260.280.30.320.342D Output Histogram Showing Re-Binning Procedurewith Total Bin OccupancyArbitrary Units5 10 15 20 25Fraction of Events in Template5−104−103−102−101−101 ttbbttsinglebttcjetsttlightnotcHistogram of Templates Used in Likelihood FitFigure 33: The third best result from any machine learning algorithm. The black boxes in the 2D histogram indicatethe re-binning procedure taken with this algorithm.76Rank l ightnot c Importance Rank al lnot b Importance Rank l ight jet sandcc Importance1 MV1c3 6.6 × 10−1 1 MV1c3 6.2 × 10−1 1 MV1c3 6.3 × 10−12 avgdrbb 9.3 × 10−2 2 MV1c4 2.1 × 10−1 2 MV1c4 1.8 × 10−13 avgdr 4.7 × 10−2 3 MV1c2 4.0 × 10−2 3 MV1c2 5.6 × 10−24 pt5 3.5 × 10−2 4 avgdr 3.0 × 10−2 4 avgdr 3.2 × 10−25 MV1c2 3.0 × 10−2 5 avgdrbb 2.7 × 10−2 5 avgdrbb 2.2 × 10−26 ssumpt 2.9 × 10−2 6 pt5 2.4 × 10−2 6 pt5 2.9 × 10−27 wavgdr 1.8 × 10−2 7 ssumpt 1.8 × 10−2 7 ssumpt 1.9 × 10−28 nwavgdrbb 1.5 × 10−2 8 wavgdr 8.6 × 10−3 8 wavgdr 1.3 × 10−29 normavgdrbb 1.5 × 10−2 9 jjssumpt 8.5 × 10−3 9 jjssumpt 7.5 × 10−310 jjssumpt 1.5 × 10−2 10 costhetastar 4.2 × 10−3 10 sum34pt 3.6 × 10−311 bbmindr 9.7 × 10−3 11 sum34pt 4.1 × 10−3 11 nwavgdrbb 3.0 × 10−312 sum34pt 9.4 × 10−3 12 nwavgdrbb 4.1 × 10−3 12 wavgdrbb 2.3 × 10−313 costhetastar 7.9 × 10−3 13 bbmindr 3.5 × 10−3 13 mindrbb 2.0 × 10−314 mindrbb 5.6 × 10−3 14 normavgdrbb 1.9 × 10−3 14 costhetastar 1.9 × 10−315 wavgdrbb 5.5 × 10−3 15 aplab 0.0 15 aplab 0.016 bbmaxmass 2.3 × 10−3 16 bbmaxmass 0.0 16 bbmaxmass 0.017 jets40 1.4 × 10−3 17 drbbmaxpt 0.0 17 bbmindr 0.018 aplab 0.0 18 jets40 0.0 18 drbbmaxpt 0.019 drbbmaxpt 0.0 19 mindr2 0.0 19 jets40 0.020 mindr2 0.0 20 mindrbb 0.0 20 mindr2 0.021 MV1c1 0.0 21 MV1c1 0.0 21 MV1c1 0.022 MV1c4 0.0 22 wavgdrbb 0.0 22 normavgdrbb 0.0Rank medium jet s Importance Rank c jet s Importance Rank singleb Importance1 MV1c3 4.5 × 10−1 1 MV1c3 4.3 × 10−1 1 MV1c4 1.0 × 10−12 MV1c4 2.4 × 10−1 2 MV1c4 2.2 × 10−1 2 MV1c3 8.2 × 10−23 avgdrbb 6.1 × 10−2 3 avgdrbb 9.7 × 10−2 3 avgdr 6.1 × 10−24 MV1c2 5.7 × 10−2 4 MV1c2 7.0 × 10−2 4 mindrbb 6.0 × 10−25 avgdr 4.0 × 10−2 5 avgdr 4.6 × 10−2 5 pt5 5.9 × 10−26 pt5 2.9 × 10−2 6 ssumpt 2.7 × 10−2 6 bbmindr 5.3 × 10−27 ssumpt 2.7 × 10−2 7 pt5 2.5 × 10−2 7 sum34pt 4.9 × 10−28 bbmindr 1.4 × 10−2 8 mindrbb 1.5 × 10−2 8 mindr2 4.8 × 10−29 mindrbb 1.4 × 10−2 9 jjssumpt 1.4 × 10−2 9 normavgdrbb 4.6 × 10−210 jjssumpt 1.2 × 10−2 10 wavgdr 1.4 × 10−2 10 nwavgdrbb 4.5 × 10−211 wavgdr 1.1 × 10−2 11 costhetastar 9.8 × 10−3 11 avgdrbb 4.4 × 10−212 costhetastar 1.0 × 10−2 12 bbmindr 7.2 × 10−3 12 ssumpt 4.2 × 10−213 nwavgdrbb 7.5 × 10−3 13 aplab 5.4 × 10−3 13 drbbmaxpt 4.2 × 10−214 normavgdrbb 6.6 × 10−3 14 nwavgdrbb 4.6 × 10−3 14 jjssumpt 3.8 × 10−215 bbmaxmass 5.3 × 10−3 15 mindr2 3.7 × 10−3 15 MV1c2 3.6 × 10−216 sum34pt 5.2 × 10−3 16 bbmaxmass 3.4 × 10−3 16 costhetastar 3.6 × 10−217 mindr2 2.8 × 10−3 17 sum34pt 3.2 × 10−3 17 aplab 3.4 × 10−218 wavgdrbb 2.4 × 10−3 18 wavgdrbb 3.0 × 10−3 18 wavgdr 3.0 × 10−219 aplab 1.9 × 10−3 19 MV1c1 2.5 × 10−3 19 wavgdrbb 2.7 × 10−220 drbbmaxpt 4.2 × 10−4 20 drbbmaxpt 1.1 × 10−3 20 bbmaxmass 2.5 × 10−221 jets40 0.0 21 jets40 0.0 21 jets40 2.4 × 10−222 MV1c1 0.0 22 normavgdrbb 0.0 22 MV1c1 1.8 × 10−2Table 6: The variable rankings as displayed by TMVA for gradient-boosted decision trees, for ttbb trained against thetrained template combinations, indicated at the top in bold.77Variable Incl. Excl. Total Variable Incl. Excl. Total Variable Incl. Excl. Totalaplab 10.7 -22.8 -12.1 avgdr 2.8 -20.8 -18.0 avgdrbb 24.6 -16.4 8.2bbmaxmass 10.8 -6.6 4.3 bbmindr 22.5 -6.8 15.7 costhetastar 3.5 -8.2 -4.7drbbmaxpt 9.8 -4.8 5.0 jets40 5.8 -33.3 -27.4 jjssumpt 4.2 -6.7 -2.5mindr2 3.5 -9.1 -5.5 mindrbb 13.4 -9.7 3.6 normavgdrbb 62.0 -14.6 47.4pt5 1.5 -8.0 -6.5 ssumpt 3.6 -6.4 -2.8 sum34pt 5.2 -10.7 -5.5wavgdr 2.6 -5.4 -2.9 wavgdrbb 13.4 -17.2 -3.8Table 7: This table shows the points assigned to variables, based on whether they were included or excluded fromalgorithms which had top performing sensitivities, as well as displaying the sum of the two. Higher scores are better,as the exclusion sums were assigned negative values. This table is meant to identify the main scores examined whileselecting which variables to eliminate. A fully detailed point card, which also shows the points assigned for each subset,is available in appendix D, table 9.7811. Conclusions and OutlookPreciselymeasuring tt¯+bb¯ production is stronglymotivated by the desire tomeasure theHiggs-to-topYukawacoupling. A Higgs boson with a mass of 125 GeV (such as that recently discovered by LHC experimentsATLAS and CMS) decays primarily to a bottom quark pair. Additionally, due to the large mass of the topquark, the cross section for Higgs production in association with two top quarks is computed to be very large.Thus, in order to precisely measure tt¯ + Higgs production, with Higgs decaying to bb¯, it is necessary toprecisely know how often tt¯ + bb¯ is produced from all possible processes, not just those which include theHiggs. Improving the statistical precision of the knowledge of this dominant background was the goal of thisthesis.This thesis attempted to improve on the statistical sensitivity to the measurement of tt¯+bb¯ production throughthe use of multivariate analysis techniques. This measurement was based on expected sensitivities from 20.3fb−1 of pp collision data at√s = 8 TeV, collected by the ATLAS detector at the Large Hadron Collider in2012. Using a profile likelihood statistical fit, a comparison was made between the baseline measurement,which uses the fit to the best single variable: the MV1c neural network b-tagger algorithm, and newlyproduced neural network and boosted decision tree algorithms.Four templates were used in the likelihood fit: tt¯ events with at least two additional b-jets, in addition to thetwo b-jets resulting from the tt¯ pair, tt¯ events with just one additional b-jet, tt¯ events with exactly two b-jetsand at least one c-jet, and all other tt¯ events. The baseline measurement used a fit of the MV1c b-taggervalues of the 3rd and 4th highest jets in each event, as ranked by MV1c. This resulted in sensitivities to thettbb measurement with 19.3% uncertainty.Multivariate analysis techniques (boosted decision trees and neural networks) were then examined in orderto include additional discriminating kinematic variables. In addition to four MV1c-based variables, 18additional variables were identified as possible discriminants for use. In order to best determine an optimalvariable set for use in these algorithms, a novel metric was developed for this analysis which looked directly atthe final sensitivities resulting from the trained algorithms. This new metric, along with the variable rankingsprovided by TMVA and the correlation charts between variables, were all used in order to determine whichvariables were reasonable to include in the final, optimal algorithm training.Since four different templates were used in the likelihood fit, it was not immediately clear which templatesto use in training. Including all four templates in training would result in a 3-dimensional histogram, whichsuffers from complications due to low statistics and a related re-binning process. Therefore, in additionto determining an optimal variable set, 20 different template training combinations were identified for usein training. Eleven of these combinations were binary (e.g. training ttbb against just one other template),resulting in a 1-D output histogram, easily accessible by the likelihood fit. The remaining nine templatecombinations createdmulti-class algorithms, which trained three different templates against one another. Thismulti-class training procedure resulted in a two dimensional output histogram, which was then re-binned foruse in the likelihood fit.Both neural networks (NNs) and boosted decision trees (BDTs) were analyzed under binary classificationcircumstances, with very similar best sensitivities: 16.9% and 17.2% ttbb uncertainties, respectively. How-ever, it was found that NNs were relatively stable to small changes, unlike BDTs; for example NNs weremore stable to changes of which variables were included or which templates were used in training. As a79result of this, multi-class algorithms were only investigated for neural networks. The best results from thesemulti-class neural network algorithms provided further improvement, down to 16.1% ttbb uncertainty.This improvement, from 19.3% uncertainty with the baseline ttbb measurement, to 16.1% ttbb uncertaintywith the very best multi-class neural network, represents a relative improvement of almost 20%.Looking forward, in order to realize this analysis as a complete measurement, further work would be requiredto ensure these improvements hold once systematic uncertainties are included. This would require looking athow various sources of systematics end up affecting the output distributions from the multivariate analysis,and determining how this affects the profile likelihood fit.80References[1] The ATLAS Collaboration, The ATLAS Experiment at the CERN Large Hadron Collider,J. Instrum. 3 (2008) S08003. 437 p, Also published by CERN Geneva in 2010,url: https://cds.cern.ch/record/1129811.[2] F. Wilczek, Asymptotic freedom: From paradox to paradigm,Proc. Nat. Acad. Sci. 102 (2005) 8403–8413, [Rev. Mod. Phys.77,857(2005)],arXiv: hep-ph/0502113 [hep-ph].[3] The ATLAS Collaboration, ATLAS detector and physics performance: Technical Design Report, 1,Technical Design Report ATLAS, Geneva: CERN, 1999,url: https://cds.cern.ch/record/391176.[4] K. Cranmer et al.,‘HistFactory: A tool for creating statistical models for use with RooFit and RooStats’,tech. rep. CERN-OPEN-2012-016, New York U., 2012,url: https://cds.cern.ch/record/1456844.[5] The ATLAS Collaboration,Calibration of the performance of b-tagging for c and light-flavour jets in the 2012 ATLAS data,ATLAS-CONF-2014-046, 2014, url: http://cdsweb.cern.ch/record/1741020.[6] M. Bohm, A. Denner and H. Joos, Gauge theories of the strong and electroweak interaction, 2001,isbn: 3519230453, 978-3519230458.[7] S. Dawson, ‘Introduction to the physics of Higgs bosons’,Theoretical Advanced Study Institute in Elementary Particle Physics (TASI 94): CP Violation and thelimits of the Standard Model Boulder, Colorado, May 29-June 24, 1994, 1994,arXiv: hep-ph/9411325 [hep-ph],url: http://alice.cern.ch/format/showfull?sysnb=0191888.[8] S. Dawson, ‘Introduction to electroweak symmetry breaking’, High energy physics and cosmology.Proceedings, Summer School, Trieste, Italy, June 29-July 17, 1998, 1998 1–83,arXiv: hep-ph/9901280 [hep-ph],url: http://alice.cern.ch/format/showfull?sysnb=0301862.[9] M. E. Peskin and D. V. Schroeder, An Introduction to quantum field theory, 1995,isbn: 9780201503975, 0201503972,url: http://www.slac.stanford.edu/spires/find/books/www?cl=QC174.45%3AP4.[10] A. Ceccucci, Z. Ligeti and Y. Sakai, The CKM quark-mixing matrix,Phys.Lett. B667 (2008) 145–152,url: http://pdg.lbl.gov/2015/reviews/rpp2015-rev-ckm-matrix.pdf.[11] S. Bethke, Experimental tests of asymptotic freedom, Prog. Part. Nucl. Phys. 58 (2007) 351–386,arXiv: hep-ex/0606035 [hep-ex].[12] B. Malaescu and P. Starovoitov,Evaluation of the Strong Coupling Constant αS Using the ATLAS Inclusive Jet Cross-Section Data,Eur. Phys. J. C72 (2012) 2041, arXiv: 1203.5416 [hep-ph].81[13] D. d’Enterria and P. Z. Skands, eds.,Proceedings, High-Precision αS Measurements from LHC to FCC-ee, CERN, Geneva: CERN, 2015,arXiv: 1512.05194 [hep-ph],url: https://inspirehep.net/record/1409920/files/arXiv:1512.05194.pdf.[14] S. Sapeta, QCD and Jets at Hadron Colliders, Prog. Part. Nucl. Phys. 89 (2016) 1–55,arXiv: 1511.09336 [hep-ph].[15] S. D. Ellis et al., Jets in hadron-hadron collisions, Prog. Part. Nucl. Phys. 60 (2008) 484–551,arXiv: 0712.2447 [hep-ph].[16] The ATLAS Collaboration, Search for the associated production of the Higgs boson with a top quarkpair in multilepton final states with the ATLAS detector, Phys. Lett. B749 (2015) 519–541,arXiv: 1506.05988 [hep-ex].[17] The ATLAS Collaboration, Search for the Standard Model Higgs boson produced in association withtop quarks and decaying into bb¯ in pp collisions at√s = 8 TeV with the ATLAS detector,Eur. Phys. J. C75.7 (2015) 349, arXiv: 1503.05066 [hep-ex].[18] S. Alekhin, A. Djouadi and S. Moch,The top quark and Higgs boson masses and the stability of the electroweak vacuum,Phys. Lett. B716 (2012) 214–219, arXiv: 1207.0980 [hep-ph].[19] S. Dawson et al.,Associated Higgs production with top quarks at the large hadron collider: NLO QCD corrections,Phys. Rev. D68 (2003) 034022, arXiv: hep-ph/0305087 [hep-ph].[20] T Cornelissen et al., ‘Concepts, Design and Implementation of the ATLAS New Tracking (NEWT)’,tech. rep. ATL-SOFT-PUB-2007-007. ATL-COM-SOFT-2007-002, CERN, 2007,url: http://cds.cern.ch/record/1020106.[21] The ATLAS Collaboration, The ATLAS Pixel Detector, IEEE Trans. Nucl. Sci. 53 (2006) 1732–1736,arXiv: physics/0412138 [physics.ins-det].[22] The ATLAS Collaboration, Operation and performance of the ATLAS semiconductor tracker,JINST 9 (2014) P08009, arXiv: 1404.7473 [hep-ex].[23] The ATLAS Collaboration, ‘The ATLAS transition radiation tracker’,Astroparticle, particle and space physics, detectors and medical physics applications. Proceedings,8th Conference, ICATPP 2003, Como, Italy, October 6-10, 2003, 2003 497–501,arXiv: hep-ex/0311058 [hep-ex],url: http://weblib.cern.ch/abstract?ATL-CONF-2003-012.[24] G. F. Knoll, Radiation Detection and Measurement, 3rd ed. New York: John Wiley and Sons, 2000,isbn: 9780471073383, 0471073385,url: http://www.slac.stanford.edu/spires/find/books/www?cl=QCD915:K55:2000.[25] B. Dolgoshein, Transition radiation detectors, Nucl. Instrum. Meth. A326 (1993) 434–469.[26] The ATLAS Collaboration,Measurements of fiducial cross-sections for tt¯ production with one or twoadditional b-jets in pp collisions at√s =8 TeV using the ATLAS detector,Eur. Phys. J. C76.1 (2016) 11, arXiv: 1508.06868 [hep-ex].82[27] The ATLAS Collaboration, b-Tagging (2009), url: https://cds.cern.ch/record/1159581.[28] The ATLAS Collaboration,Commissioning of the ATLAS high performance b-tagging algorithms in the 7 TeV collision data,ATLAS-CONF-2011-102, 2011, url: http://cdsweb.cern.ch/record/1369219.[29] The ATLAS Collaboration, ‘b-tagging in dense environments’,tech. rep. ATL-PHYS-PUB-2014-014, CERN, 2014,url: https://cds.cern.ch/record/1750682.[30] G Piacquadio and C Weiser, A new inclusive secondary vertex algorithm for b-jet tagging in ATLAS,Journal of Physics: Conference Series 119.3 (2008) 032032,url: http://stacks.iop.org/1742-6596/119/i=3/a=032032.[31] The ATLAS Collaboration,‘Performance and Calibration of the JetFitterCharm Algorithm for c-Jet Identification’,tech. rep. ATL-PHYS-PUB-2015-001, CERN, 2015,url: http://cds.cern.ch/record/1980463.[32] The ATLAS Collaboration, ‘Measurement of the b-tag Efficiency in a Sample of Jets ContainingMuons with 5 fb−1 of Data from the ATLAS Detector’, tech. rep. ATLAS-CONF-2012-043,CERN, 2012, url: http://cds.cern.ch/record/1435197.[33] The ATLAS Collaboration, Calibration of b-tagging using dileptonic top pair events in acombinatorial likelihood approach with the ATLAS experiment (2014),url: http://cdsweb.cern.ch/record/1664335.[34] M. Cacciari, G. P. Salam and G. Soyez, The Catchment Area of Jets, JHEP 04 (2008) 005,arXiv: 0802.1188 [hep-ph].[35] J. Friedman, T. Hastie and R. Tibshirani, The Elements of Statistical Learning: Data Mining,Inference, and Prediction In The Elements of Statistical Learning, 2003, isbn: 978-0-387-84858-7.[36] P. Speckmayer et al., The toolkit for multivariate data analysis, TMVA 4,J. Phys. Conf. Ser. 219 (2010) 032057.[37] G. Cowan et al., Asymptotic formulae for likelihood-based tests of new physics,Eur. Phys. J. C71 (2011) 1554, [Erratum: Eur. Phys. J.C73,2501(2013)],arXiv: 1007.1727 [physics.data-an].[38] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics),Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2006, isbn: 0387310738.[39] Ji Zhu, Hui Zou, Saharon Rosset and Trevor Hastie, Multi-class AdaBoost,Statistics and Its Interface 2 (2009) 349–360,url: https://web.stanford.edu/~hastie/Papers/SII-2-3-A8-Zhu.pdf.[40] J. Friedman, T. Hastie and R. Tibshirani, Additive logistic regression: a statistical view of boosting(With discussion and a rejoinder by the authors), Ann. Statist. 28.2 (Apr. 2000) 337–407,url: http://dx.doi.org/10.1214/aos/1016218223.[41] J. H. Friedman, Greedy function approximation: A gradient boosting machine.,Ann. Statist. 29.5 (Oct. 2001) 1189–1232, url: http://dx.doi.org/10.1214/aos/1013203451.83AppendicesA. Boosted Decision TreesDecision trees are simple enough to understand, and the addition of boosting turns them into a powerfuldiscriminator. However, boosting unfortunately eliminates most of the transparency. The TMVA implementsnumerous boosting techniques in its binary classification class, however for classification between threeor more classes only logistic gradient-boosted trees are used, as AdaBoost is only typically suitable forbinary classification [36, 39, 40]. As a result of this, and the relative stability of gradient boosting overAdaBoost [36], this analysis will focus on logistic gradient boosted decision trees. Bagging is also used tohelp smooth over statistical instability.A.1. Decision TreesClassification TreesIn the case of binary classification trees each data point or event, xi , has a true response yi equal to 1 or-1.24 Each tree is grown by selecting S points in each variable range to test as an ideal split point (S = 20 inthis analysis). By default with the TMVA, the splitting variable j and split point s is selected that minimizesthe sum of the two Gini indices of each new node, though other measures of impurity are available. For Kclasses, pmk is the fraction of class k events in node m. The Gini index of a node m is given by [35]:Gm =∑k,k ′pmk pmk ′ =∑k ∈Kpmk (1 − pmk ) (8)In an "ideal" node, with 100% of events belonging to a certain class ko , pmk = 1 for k = ko and pmk = 0otherwise, which yields a Gini index of 0. This increases as the sample becomes more mixed, where in afully mixed sample, pmk = pmk ′ ∀k, k ′, the maximal Gini index is equal to 1 − 1K . In binary classification, asingle ideal split is all that is required to perfectly separate a trivial signal and background sample. However,for multiple classifications, a minimum of K nodes are required, and thus K − 1 splits.Regression TreesSomewhat counter-intuitively, gradient boosted decision trees are grown using a regression algorithm forboth regression and classification tasks. In a regression task, each data point (or event) xi corresponds to aresponse yi , where yi is now a spectrum rather than binary. The predicted response of a regression tree ismodelled as a constant c in each terminal node J, equal to the average response; c = ave(yi |xi ∈ J) [35].24 Classification algorithms with > 2 classes will feature a k−dimensional vector response, with yik = 1 if xi ∈ k and yik = 0otherwise.84These trees are grown in essentially the same way, however the j and s selected now minimize the sum ofthe squared errors for each node:E =∑xi ∈J(yi − c)2 (9)A.2. BoostingThe loss function fully determines the boosting procedure [36]. The algorithm for a generic loss functionis given in Figure 34. The function of this algorithm is most easily illustrated by taking the loss functionbetween some binary class predictor Fk (x) and some true value yk to be equal to the mean squared error,L[y, F (x)] = 12 (y − F (x))2. In this case, the algorithm seed F0(x) is the average response of all data points;c = ave(yi ). The pseudo response at each iteration m is then simply the residuals, y˜i = yi − Fm−1(xi ),for each event xi . A regression tree is then grown using these residuals as the target response. This treeis viewed as a new classifier h(x; am ), where the am refer to the parameters that determine the mth tree’ssubstructure. The algorithm response is then updated by taking Fm = Fm−1 + ρmh(x; am ), with ρm selectedto minimizeN∑i=1[y˜i −(Fm−1(xi ) + ρmh(xi ; am ))]2which in this case is a simple optimization problem. Substituting in y˜i = yi − Fm−1(xi ) allows us to solvefor ρm directly, by minimizingN∑i=1[yi − Fm−1(xi ) −(Fm−1(xi ) + ρmh(xi ; am ))]2=N∑i=1[yi − ρmh(xi ; am )]2.While illustrative, this choice of loss function does not perform as well as others [41].Figure 34: Gradient boosting algorithm for a generic loss function L [41].Adaptive BoostingThe general AdaBoost algorithm is given in figure 36. It uses a loss function L[y, F (x)] = e−F (x)y .Taking the functional derivative of this with respect to F (x) gives the pseudo response at each iteration m,85Figure 35: Gradient boosting algorithm for a negative binomial log likelihood loss function in the binary classificationcase [41].y˜i = yie−Fm−1 (xi )yi . Growing a regression tree h(x; am ) to this pseudo response, we would need to find thescale factor ρm that minimizesN∑i=1exp[− y˜i(Fm−1 + ρmh(x; am ))].Substituting in y˜i shows we need to find a ρm that minimizes:N∑i=1exp[− yie−Fm−1 (xi )yi(Fm−1 + ρmh(x; am ))]and use it to update Fm according to Fm = Fm−1 + ρmh(x; am ). It should be noted that the tree splits amused to create h(x; am ) will minimize∑nodesJ∑xi ∈R j( y˜i − cj )2 =∑nodesJ∑xi ∈R j(yie−Fm−1 (xi )yi − cj )2While at first this may not look like the typical AdaBoost algorithm, with a bit of work it can be seen asequivalent [39]. The exponentials are in fact the AdaBoosted event weights, w(m)i = e−yiFm−1 (xi ). Thus, wecan rewrite our minimization problem for the split points am:∑nodesJ∑xi ∈R j( y˜i − cj )2 =∑nodesJ∑xi ∈R j(wi yi − cj )2where, in this form, and comparing to Equation 8 and Equation 9, it is clear h(x; am ) will minimize theweighted misclassified25 events, s.t.am = argminam∑i∈misclassi f iedw(m)i25 Recalling that y = 1 or y = −1 for binary classification, and that the average will also be simply 1 or -1 for Discrete AdaBoost,though it takes on a continuum for Real AdaBoost86though the separation criteria is different than in what is used in classification trees. As detailed in [36]and [35], it can then be shown thatρm =12log1 − errmerrmwhereerrm =∑misclassi f iedevent swi∑allevent swi=∑misclassi f iedevent sexp(−yiFm−1(xi ))∑allevent sexp(−yiFm−1(xi ))and the approximation is updated according to Fm = Fm−1 + ρmh(x; am ), as expected. It is easily checkedthat the effective event weights are updated also, as in the next iteration we will havew(m+1)i = e−yiFm (xi ) = e−yi (Fm−1 (xi )+ρmh(x;am )),and so forth. AdaBoost is subject to poor performance in the face of outliers and statistical fluctuation, as aresult of the strong weightings applied to all misclassified events.Logistic Gradient BoostingThe logistic gradient boosting algorithmemployed by theTMVAuses amore robust loss function, L[y, F (x)] =ln(1 + e−2F (x)y ). The pseudo response in this case is y˜i = 2yi/(1 + e−2yiFm−1 (xi )). Classification is againperformed using the sign of F (x), whereF (x) =12log[Pr (y = 1|x)Pr (y = −1|x)]which can be rearranged to give the separated probabilitiesPr (y = 1|x) = 1/(1 + e−2yF (x)) and Pr (y = −1|x) = 1/(1 + e2yF (x)).After growing a regression tree h(x; am ) by choosing the split parameters am to best fit the pseudo responsey˜i , it is necessary to find a scale factor for the new tree. Finding a solution ρm that minimizesN∑i=1ln(1 + exp[− 2yi(Fm−1(xi ) + ρmh(xi ; am ))])is impossible, and so separate updates are made for each terminal node Rjm . This simplifies the minimizationproblem to finding a γ jm that minimizes∑xi ∈R jmln(1 + exp[− 2yi (Fm−1(xi ) + γ jm )] ).This still has no closed form solution, and so a single Newton Raphson step is performed to approximateγ jm[41]. This yields:γ jm =∑xi ∈R jmy˜i/ ∑xi ∈R jm| y˜i |(2 − | y˜i |)87which in turn allows updates to the algorithm response viaFm (x) = Fm−1(x) +∑nodesJ∑xi ∈R jmγ jm .For K > 2 classes the algorithm runs similarly, however instead of yi ∈ {−1, 1} we have yik ∈ {0, 1}, forFigure 36: Gradient boosting algorithm for a negative binomial log likelihood loss function with K ≥ 2 classes. WhenK = 2 It should be noted that this algorithm is equivalent to Algorithm 5 (shown in Figure 34) [41].each class k. Further, there are now K response distributions Fk (x), which determine the output probabilityof class k throughpk (x) = eFk (x)/ K∑l=1eFl (x) . (10)Accordingly, there are now K pseudo responses, given by y˜ik = yik − pk,m−1(xi ). Thus, K regression treesare grown at each iteration m, each fit to the residuals of a different class on the probability scale [41]. Again,no closed form solution exists for the scale factors ρkm of each resultant tree, nor even node-by-node factorsγ jkm . Performing a single Newton Raphson step to approximate the γ jkm yieldsγ jkm =K − 1K∑xi ∈R jkm y˜ik∑xi ∈R jkm | y˜ik |(1 − | y˜ik |).With these, the algorithm response is updated via:Fkm (x) = Fk,m−1(x) +∑nodesJ∑xi ∈R jkmγ jkm .From this, one should note that each regression tree is grown to fit the probability residuals of one particularclass. In example, if ∃ 3 classes k ∈ {A, B,C}, and a nearly perfect first split is made separating A from Band C, the next split will still attempt to separate any outlying A class events from the rest of the B and Cclass events, even if ∃ a split that is much more successful at separating class B from C. As a result of ∃different Fk (x) for each k ∈ K , a forest is grown for each class k0 that trains k0 against all classes k , k0.Each forest Fk (x) is then compared to the rest to give a probability as output, defined by Equation 10.88B. Neural NetworksNeural Networks (NNs) orMulti-Layer Perceptrons (MLPs) aremore difficult to visualize than a decision tree,but are arguably simpler to understand as a whole than a full forest (due to the opaque nature of boosting). Insimple words, neural networks provide a method for providing one or more non-linear responses to multipleinput variables. Figure 37 is a graphical representation of the function composition that occurs within aneural network. Each input variable Xi (where X is a data point, and i labels the variable) goes through anon-linear transformation, termed an activation function, which is represented by a connection to a neuronin the hidden layer. The hidden layer is then combined in a linear fashion to the output nodes, which are thentransformed a final time through the softmax function, exemplified in Equation 10, to provide positive finalestimates that sum to one [35].26 This simple network can be tuned by adding more or less neurons, possiblyin additional hidden layers, with an additional non-linear transformation between each layer. A "weight" foreach connection is fit by minimizing an error function dependent on the training data and predicted results ofthe classifier. However, precautions must be taken to prevent reaching the global minimum, as this usuallyleads to overtraining.27This relatively simple method for fitting around non-linearities in data is very diverse in its applicability, andsimple to implement through the TMVA. In the case of this analysis, one or two hidden layers were utilized,with N + 5 to N + 10 neurons total, where N is the number of input variables. The hyperbolic tangentfunction was used as the activation function at each neuron,σ(x) = tanh(x) =ex − e−xex + e−xand a simple sum synapse function was used, leading from the final hidden layer to the output layer.In succinct equations, this leads to a model where for M neurons in the first layer, each neuron Zm for m ∈ Min the first hidden layer is given by:Zm = σ(α0m + αimXi ), (11)where m labels the neuron, α0m is the weight connecting the neuron to the constant "1" (represented by abias node in graphical network representations generated by the TMVA), and αm is a vector of the weights26 Note that the softmax function slightly over-parameterizes the system, since it provides the constraint that the probabilities mustsum to one. In the case of K = 2 classes this is avoided by using the logistic function. For just two classes a and b, the equivalencecan be seen through the following:p(Xi ) =11 + eF (Xi )pa =eFaeFa + eFb=11 + eFb−Fapb =eFbeFa + eFb=11 + e−(Fb−Fa )and further noting that the network is trained to optimize the output nodes Fk , such that Fk will be as positively large as possibleif the data point belongs to class k, and as negatively large as possible if it does not. If two output layer nodes are used for binaryclassification, this implies one node will be the negative of the other. Thus, one node is typically used to avoid this redundancy,along with the logistic function to provide probabilities that still sum to one.27 In machine learning literature, overtraining is used to indicate an overly-tuned system, that has too much bias in the bias-variancetradeoff existent in any classification or regressive system.89Figure 37: A graphical representation of the function used to map input variables to a response in a neural network. Inthis example, the network has five input variables, along with a bias node representing the “1” used in the non-lineartanh activation function used to map between the input layer and the hidden layer, which has 13 nodes plus an additionalbias node. The mapping between the hidden layer and the output layer is a simple sum. More complicated networksmay include additional hidden layers, with additional non-linear transformations between each layer, or in the case ofmulti-class networks, multiple nodes (aka responses) at the output layer. The training process fits for the weights usedto connect the various nodes within a network.connecting the neuron to each of the input variables, X . The index i is summed over. If there are two layers,the second layer goes through a similar transformation, with the Zm from the first layer replacing the Xi inEquation 11.For K classes there will be K nodes at the output layer, which for this analysis is given by:Fk = β0k + βmk Zm (12)which is then transformed a final time through Equation 10 to give probabilities that sum to one. A sum ofsquares, or a sum of absolutes, is sometimes used in place of Equation 12, but were not used here. Each datapoint X is then run through this series of equations to provide a probability of being a member of class k.What remains is determining the connection weights α and β. For N variables and M neurons in the firstlayer, there are M (N +1) weights to determine for each connection between an input variable and first hiddenneuron layer, and for K classes there are K (M + 1) weights between the final hidden layer and the outputlayer. In the case of multiple hidden layers, the number of weights between layers is determined similarly. Inorder to solve for these weights, an algorithm known as back-propagation is used.90Back-propagation28 operates by updating the set of weights w according to:wρ+1 = wρ − η∇wE (13)where ρ is used to indicate the iteration, and the gradient is taken in the direction that minimizes the errorfunction E. Here, η is the (tunable) learning rate. A learning rate of 0.02 was used for this analysis, with adecay rate of 0.01. A squared error function was used here;E(X |w) =K∑k=112(pk (X ) − yk (X ))2where pk (X ) is the output from the neural net for the data point X and yk (X ) = 1 if X ∈ k and yk (X ) = 0otherwise. In the forward pass of the algorithm, the weights are held fixed and the network prediction pk iscalculated. In the backward pass the weights are updated by first calculating the corresponding derivative ofthe error function, then updating the weights according to the above equation. With class label k and hiddenlayer neuron labeled m (where m = 0 implies the connection to the bias node, s.t. Z0(X ) = 1 ∀X), thisgives:∂E∂ βmk= (pk (X ) − yk (X ))p′k [βmZm (X )]Zm (X ) = δk (X )Zm (X )where p′k[Fk (X )] is the derivative of the softmax function with Fk (X ) = βmZm (X ) as its argument. Itshould be noted that in the case of binary classification, and thus a single output node, that this additionalterm can be safely ignored, as it is equivalent to changing the learning rate and is thus not expected to changethe structure of the network. The updates between input variables and hidden layer neurons are given by:∂E∂αim=K∑k=1(pk (X ) − yk (X ))p′k [βmZm (X )]σ′(αimXi )Xi= σ′(αimXi )K∑k=1δk (X )Zm (X )where i still labels the input variable, repeated indices are summed over, and X with or without a label ismeant to indicate dependence on the training event being used. σ′(Zm ) is used to indicate the derivative ofthe activation function, with the present input to the mth neuron, αimXi , as the argument. X0 is meant toindicate the input bias, s.t. X0 = 1 ∀X . The TMVA employs online learning which updates the weights ateach event, though bulk learning is also sometimes used. Bulk learning would imply an additional sum overtraining events, and weight updates would occur after going through the complete set.This algorithm is altered slightly in the Bayesian extension29 of the MLP, by adding a regulator term to theerror function:E(X |w) = λ |w |2 +K∑k=112(pk (X ) − yk (X ))228 So named since the network is first held static as a training event is run forward through the network, followed by a backward passas the calculations to update network weights are performed in a reverse order, as described below.29 A full discussion of the equivalence between this application of regulator terms to adding Gaussian priors to the weights and noisein the data is given in [38].91A small value of λ leads to a network with large bias (if other overtraining precautions are not taken), whereasa large value for λ is stronger in diminishing the weights, forcing the network to operate closer to its linearregime, thus increasing its variance. This regulator term thus allows for automatic control over the complexityof the network, as it will shrink unimportant weights to zero. To avoid shifting the problem of selecting thecorrect model complexity over to selecting the correct value of λ, λ is automatically determined using theevidence approximation.We can see that the likelihood p(t |X,w, ν) where t = Fk (X,w) + (with being some additive Gaussiannoise in the training data with precision equal to ν), is maximized for some value of the complete set ofweights, w. This initial set of weights is determined by letting the above back-propagation algorithm run asufficient number of iterations as to find the minimum squared error. By introducing the additional set ofhyper-parameters µ, which correspond to the precision of the Gaussian priors applied to the weights w, wecan write the evidence function as30:p(t |µ, ν) =∫p(t |w, ν)p(w |µ)dw= (ν2pi)N/2(µ2pi)M/2∫exp[−E(w)]dwwhere N now refers to the number of training events, M now refers to the total number of weights inthe network, and the error function E(w) is equal to the squared error plus the regulator term, up to aproportionality constant. Maximizing this evidence function thus implies minimizing the error function,which is again performed in an iterative fashion. Holding the weights w constant, the hyper-parameters νand µ are updated31 as given by:µ =M2EW (w)ν =N2ED (w)where ED is the portion of the error function related to the data (e.g. the squared error) and EW correspondsto the error dependent solely on the weights (e.g. the regulator term). The full derivation of these updatesare given in [38], but is loosely described as follows. Completing the square over w in the evidence functionallows us to solve for the lg of the evidence function in terms of µ, ν, the present best-guess for w (aka themean of the posterior distribution), and the Hessian of the error function. This leads to an implicit solutionfor µ and ν that maximizes this evidence function. Using the approximation that N >> M then leads to theupdates for µ and ν given above.Holding µ and ν constant, the weights w are updated through a similar backward pass as before, however theregulated error function is now used, with λ = µ/ν and a sum over all training events is taken for each pass.This yields updates through Equation 13 using the following derivatives:∂E∂ βmk= 2λ | βmk | +N∑j=1(pk (X j ) − yk (X j ))p′k [βmZm (X j )]Zm (X j )30 For the remainder of this discussion, the dependence of various likelihood functions on the data X is suppressed for clarity.31 This approximation only applies when the number of data points N >> M , where M is the total number of weights in the network.Note that this is always true for this analysis.92=N∑j=1δk (X j )Zm (X j ) + 2λ | βmk |and∂E∂αim= 2λ |αim | +K∑k=1N∑j=1δk (X j )σ′(αimXi j )XiThus if one is to fit an arbitrarily complex network to arbitrary accuracy using the initial back-propagationalgorithm, application of the Bayesian extension allows one to ensure the model is not overfit as it shrinksweights which don’t contribute sufficiently to the full training set. In other words, the regulator term servesto suppress weights which only serve to model outliers, based on training data only. The only cost is theadditional computational time required after back-propagation in order to calculate the regulated weights.The iterative process between updating µ and ν, then updating the weights, is repeated as much as necessary.The TMVA employs on the order of 10K passes.93C. Variables Rejected Due to Poor SeparationThe following variables were investigated for use with a ML algorithm, but rejected due to low separationpower between templates:• Sum pT / Sum E for all jets and lepton• Mass of combo of any two jets with largest vector sum pT• Mass of combo of any two jets with smallest ∆R• ∆R between lepton and tagged jet pair with smallest ∆R• Mass of combo between tagged jet and any other jet with smallest ∆R• Mass of combo between tagged jet and any jet with largest vector sum pT• Mass of combo between untagged jet pair with smallest ∆R• Mass of jet triplet with largest vector sum pT• ∆R of pair of untagged jets that are closest in ∆R• pT of combo between untagged jet pair with smallest ∆R• Mass of combo of b-tagged pair with largest vector sum pT• 2nd to 5th Fox-Wolfram Moments for all jets• Other 3-momentum tensor eigenvalues• Eigenvalues for 3-momentum tensor composed from all jets94D. Variable Selection Metric Based on Fit SensitivitiesThe goal of the sensitivity metric is to rate a variable as better performing if combinations involving thatparticular variable provided better sensitivities after the final profile likelihood fit. The most naive way todo this might be to look at just the very-best performing permutations amongst the trained subsets, and rankvariables based on the frequency of their appearance in these top performers. However, this ignores a keybehaviour of machine learning algorithms; variables tend to work best in specific groupings.For example, permutations which include four variables may dominate the rankings, as they have the mostinformation available to the algorithm. However, these subsets of four may be dominated by groupings ofvariables which all aid the algorithm in similar ways. If there are two well-performing, but differing, sets offour variables, which provide nearly identical overall information to the algorithm (spread between the fourvariables), then only variables involved in these two sets may be naively seen as effective variables. However,the final "optimal" set of variables may include only a particular selection among these two highly correlatedvariable subsets, but may also include an uncorrelated pair of variables that do well together, and perhapsa single additional uncorrelated variable. However, since the sensitivities were dominated by the variablesinvolved in the optimal 4-variable combinations, neither of the variables involved in the best pair, nor thebest lone variable, may get a reasonable initial rating. This issue was avoided by treating each of the choicesubsets independently. This way, a sensitivity-based score for each variable is generated for each choice,which can be looked at individually, or normalized before contributing to a final overall score.Initially, the sum of the ranks of each combination involving a variable were taken as that variable’s score (e.g.the variables involved in the top combination are awarded one point, variables in the second top combinationare awarded two points, etc), with a lower score implying a better variable. However, these scores were oftenvery similar, and deemed not very useful as a result. In an effort to create more obvious sensitivity ratings,a weighted point system was employed. If a variable appears more frequently among the best sensitivities,it was be awarded more points. In the implementation, a different score was given for each of the choicesubsets, awarded as follows.If a combination is in the top 5% of sensitivities among permutations, each variable involved in the combin-ation was awarded 10 points. For each appearance in the top 10% of permutations a variable got 5 points,2 points were awarded for each combination in the top 25% and 1 point for each appearance in the top50%32. In this way, variables are rated better if they are consistently near the top, whereas variables that doreasonably well are still not ignored. This is in order to provide some separation among the lesser-performing(but non-negligble) variables. The particular number of points awarded at each tier was easily adjustable,though the above values were used for this analysis during both iterations, except that 50 points instead of10 were awarded for appearances in the top 5% tier during the second iteration. This is to better reflect howthe variable set is intended to become narrower, and thus much higher weight is placed on the very bestcombinations.Unfortunately, simply a score for each choice subset is still not a suitable performance metric since, at eachiteration, there are ≥ 8 choice subsets at which to look. To combine these scores, a normalizing factorbetween scores must be used, since if there are m choose n permutations, each individual variable will appear32 Points in this way are cumulative, such that the variables involved in the very best permutation were awarded 10+5+2+1 = 18points.95in (m − 1) choose (n − 1) of these combinations (for n ≤ m2 ). Without normalization, point contributionsfrom choices that have high combinatorics will dominate the final score. To make up for this, the totalnumber of points awarded at each tier (5%, 10%, 25% or 50%) is normalized by the maximum number ofappearances each variable could make in the tier for that choice. This maximum number is min[(number oftotal permutations in tier), ((m − 1) choose (n − 1))]. This causes the total number of points to be awardedper tier to be approximately33 equal across choice subsets.For example, for choice n among m variables, with:pt = min{[(m choose n) × t100], [(m − 1) choose (n − 1)]}and labelling avart as the total number of appearances the variable var makes in permutations in the top t%,a variable is assigned, for the given choice:K × avar5p5+ 5 × avar10p10+ 2 × avar25p25+avar50p50where K is 10 for the first iteration, and 50 for the second iteration.Finally, it should be noted that if a variable is selected for choice ≥ m2 , that this implies its exclusion ratherthan inclusion, and so a high score awarded for these choices implies it is more frequently excluded thanother variables. To reflect this, when taking a total sum of the scores from each choice, the scores forchoices ≥ m2 are taken as negative. However, there may be some further biases present (discussed below)when looking at these highly-inclusive choices, and so when evaluating the sensitivity performance of avariable three metrics are used: the sum from the upper choice scores (the "top-down" approach, referred toas the exclusive sum sensitivities), the sum from the lower choices ("bottom-up," referred to as the inclusivesum sensitivities), and the total score (referred to as the overall sensitivities, or simply sensitivity rankings).The score limit possibilities for this metric are given in table 8. Fully detailed scores for each variable, ateach iteration and training template permutation, are also presented below.Table 8: The top line in each table shows the maximum values a single variable can take for each sensitivity rating.The second line shows the total number of points awarded among all variables for that particular choice subset or sum(column). The top table shows possibilities in the first iteration, the bottom table shows the second iteration limits.ID - 1st iteration ch 1 ch 2 ch 3 ch 4 Incl. Sum ch 16 ch 17 ch 18 ch 19 Excl. Sum Total SumIndividual Max 18 18 18 18 72 18 18 18 18 -72 -72 to 72Column Sum 40 49.89 65 80 234.89 80 65 49.89 40 -234.89 0ID - 2nd iteration ch 1 ch 2 ch 3 ch 4 ch 5 Incl. Sum ch 11 ch 12 ch 13 ch 14 ch 15 Excl. Sum Total SumIndividual Max 58 58 58 58 58 290 58 58 58 58 58 -290 -290 to 290Column Sum 71 126 181 236 293 907 293 236 181 126 71 -907 033 Choiceswith higher combinatorics still award slightly more points in total, however this helps overcome the bias given to variablesthat perform well in choices with fewer combinatorics, as in these cases a variable is more likely to saturate every tier’s maximum.This is seen most dramatically in the choice = 1 subset, when the maximum number of appearances is just one, and so the topvariable is awarded the absolute maximum 18 points in the first iteration, or 58 points in the second iteration.96D.1. BDT sensitivities with t t bb trained against t t l ightnot cch 1 ch 2 ch 3 ch 4 Incl. Sum Incl. Rank ch 16 ch 17 ch 18 ch 19 Excl. Sum Excl. Rank Total Sum Total Rank0 0.21 0.376 0.873 1.459 pt5 5.779 4.235 5.273 18 -33.287 jets40 -27.449 jets400 0.262 0.893 1.429 2.584 wavgdr 4.53 5.166 5.115 8 -22.811 aplab -18.016 avgdr0 0.789 0.785 1.255 2.829 avgdr 7.366 6.101 4.378 3 -20.845 avgdr -12.131 aplab0 0.578 0.832 2.053 3.463 costhetastar 4.383 3.98 5.858 3 -17.221 wavgdrbb -6.542 pt50 0.893 1.168 1.487 3.548 mindr2 6.544 6.303 2.531 1 -16.378 avgdrbb -6.29 mv1c20 0.473 1.363 1.789 3.625 ssumpt 5.659 4.51 4.852 0 -15.021 mv1c2 -5.54 mindr20 0.368 1.393 2.461 4.222 jjssumpt 4.137 4.554 5.905 0 -14.596 normavgdrbb -5.458 sum34pt0 0.736 1.551 2.918 5.205 sum34pt 3.413 3.082 4.168 0 -10.663 sum34pt -4.745 costhetastar1 0.736 1.813 2.289 5.838 jets40 3.601 3.65 2.479 0 -9.73 mindrbb -3.794 wavgdrbb0 2.215 3.129 3.387 8.731 mv1c2 3.526 2.089 0.473 3 -9.088 mindr2 -2.851 wavgdr0 2.11 2.67 4.234 9.014 mv1c1 3.356 2.642 1.21 1 -8.208 costhetastar -2.79 ssumpt1 2.268 2.702 3.86 9.83 drbbmaxpt 3.445 1.656 1.9 1 -8.001 pt5 -2.499 jjssumpt1 2.531 3.016 4.133 10.68 aplab 2.818 2.732 0.21 1 -6.76 bbmindr 2.762 mv1c11 2.163 3.291 4.36 10.814 bbmaxmass 3.235 2.855 0.631 0 -6.721 jjssumpt 3.632 mindrbb3 2.209 3.417 4.736 13.362 mindrbb 3.203 1.612 0.736 1 -6.551 bbmaxmass 4.263 bbmaxmass1 2.636 4.484 5.307 13.427 wavgdrbb 3.232 2.29 0.893 0 -6.415 ssumpt 5 drbbmaxpt3 3.951 5.054 5.135 17.14 nwavgdrbb 2.741 1.933 1.578 0 -6.252 mv1c1 8.186 avgdrbb8 3.742 5.249 5.485 22.476 bbmindr 3.031 2.265 0.684 0 -5.98 nwavgdrbb 11.16 nwavgdrbb3 5.694 7.547 8.323 24.564 avgdrbb 3.09 1.872 0.473 0 -5.435 wavgdr 15.716 bbmindr18 15.308 14.228 14.442 61.978 normavgdrbb 2.868 1.437 0.525 0 -4.83 drbbmaxpt 47.382 normavgdrbbTable 9: The BDT sensitivities point card after the first variable test iteration, training ttbb against ttlightnotc. Eachcolumn corresponds to a different choice subset, with separate sums and rankings for the lower and upper choices,along with a total rank. Upper choices are subtracted from the lower choices, as an increase in points among the upperchoices implies greater occurrence of being excluded (rather than included, as with the lower choices). MV1c3 andMV1c4 are included automatically in every selection at this iteration.ch 1 ch 2 ch 3 ch 4 ch 5 Incl. Sum Incl. Rank ch 11 ch 12 ch 13 ch 14 ch 15 Excl. Sum Excl. Rank Total Sum Total Rank0 0.266 0.823 1.918 3.284 6.291 pt5 39.17 42.243 37.188 18.232 0 -136.833 mv1c3 -99.735 avgdr0 0.466 2.514 1.56 4.846 9.386 avgdr 31.821 26.546 19.138 28.616 3 -109.121 avgdr -59.1 mv1c40 0.266 0.634 3.66 6.464 11.024 ssumpt 15.542 12.926 3.24 1.349 58 -91.057 mv1c4 -44.432 pt51 0.399 0.65 3.92 7.597 13.566 costhetastar 18.462 16.26 10.968 19.649 1 -66.339 wavgdrbb -33.481 mindr20 0.333 4.489 5.719 7.424 17.965 mindr2 23.655 17.134 10.457 0.2 0 -51.446 mindr2 -32.636 ssumpt0 0.333 4.215 6.289 11.656 22.493 jjssumpt 20.899 12.974 14.984 0.866 1 -50.723 pt5 -26.055 costhetastar1 0.399 4.498 9.443 16.617 31.957 mv1c4 16.952 18.215 12.545 1.633 0 -49.345 nwavgdrbb -24.395 wavgdrbb0 1.349 5.199 15.294 20.102 41.944 wavgdrbb 10.197 11.695 13.139 11.049 3 -49.08 avgdrbb -17.477 jjssumpt1 1.548 9.046 12.528 19.201 43.323 mindrbb 24.801 11.009 6.634 1.216 0 -43.66 ssumpt 3.767 mindrbb0 10.699 9.488 14.504 17.562 52.253 aplab 15.628 13.178 9.106 1.748 3 -42.66 aplab 9.593 aplab0 0.666 12.366 20.031 23.77 56.833 mv1c2 14.381 9.311 5.129 10.149 1 -39.97 jjssumpt 11.998 mv1c33 11.399 11.284 17.366 19.309 62.358 bbmindr 17.551 12.823 8.581 0.666 0 -39.621 costhetastar 26.402 mv1c21 11.382 14.231 25.768 32.98 85.361 avgdrbb 17.318 11.242 10.53 0.466 0 -39.556 mindrbb 31.167 bbmindr3 20.465 25.175 19.843 19.424 87.907 nwavgdrbb 6.467 8.716 9.309 10.366 1 -35.858 normavgdrbb 36.281 avgdrbb3 28.265 35.945 39.641 41.98 148.831 mv1c3 11.577 6.693 2.973 9.948 0 -31.191 bbmindr 38.562 nwavgdrbb58 37.749 40.413 38.483 40.755 215.4 normavgdrbb 8.545 5.006 7.048 9.832 0 -30.431 mv1c2 179.542 normavgdrbbTable 10: The BDT sensitivities point card after the second variable test iteration, training ttbb against ttlightnotc,with 16 potential variables remaining. Each column corresponds to a different choice subset, with separate sums andrankings for the lower and upper choice sums, along with a total performance rank.97D.2. BDT sensitivities with t t bb trained against t t singlebch 1 ch 2 ch 3 ch 4 Incl. Sum Incl. Rank ch 16 ch 17 ch 18 ch 19 Excl. Sum Excl. Rank Total Sum Total Rank0 0.315 0.543 1.129 1.987 avgdr 10.459 9.975 6.173 1 -27.607 mindr2 -25.379 mindr20 0.105 1.013 1.11 2.228 mindr2 2.809 2.352 3.905 18 -27.066 avgdrbb -12.702 avgdrbb1 0.367 0.991 1.876 4.234 sum34pt 3.763 3.838 1.893 8 -17.494 wavgdr -10.937 nwavgdrbb0 0.21 2.422 2.391 5.023 bbmindr 4.582 3.156 6.647 3 -17.385 nwavgdrbb -8.352 sum34pt0 0.999 1.554 2.514 5.067 drbbmaxpt 5.438 5.104 3.59 0 -14.132 jets40 -6.929 bbmindr1 0.525 1.623 2.712 5.86 jjssumpt 4.555 3.817 3.214 1 -12.586 sum34pt -5.919 avgdr1 1.689 1.446 2.313 6.448 nwavgdrbb 4.241 3.122 4.589 0 -11.952 bbmindr -5.911 wavgdr0 0.788 2.051 3.877 6.716 normavgdrbb 4.686 2.837 1.157 3 -11.68 ssumpt -3.694 drbbmaxpt0 1.315 2.846 4.112 8.273 ssumpt 4.773 4.384 2.11 0 -11.267 aplab -3.407 ssumpt0 0.999 3.405 4.769 9.173 wavgdrbb 3.233 2.779 1.525 3 -10.537 bbmaxmass -2.658 jets400 2.157 4.179 4.643 10.979 costhetastar 4.13 2.814 1.368 1 -9.312 wavgdrbb -1.667 normavgdrbb3 3.957 2.27 2.247 11.474 jets40 3.479 3.231 2.051 0 -8.761 drbbmaxpt -1.65 jjssumpt3 1.368 3.208 4.007 11.583 wavgdr 2.74 2.27 2.373 1 -8.383 normavgdrbb -0.139 wavgdrbb0 2.373 4.581 5.268 12.222 bbmaxmass 3.417 2.859 0.63 1 -7.906 avgdr 1.685 bbmaxmass0 2.209 5.516 6.639 14.364 avgdrbb 3.486 1.837 2.479 0 -7.802 mv1c2 5.019 aplab1 5.273 4.796 5.217 16.286 aplab 3.08 2.109 2.321 0 -7.51 jjssumpt 5.602 costhetastar3 1.735 5.075 6.565 16.375 mindrbb 3.843 2.386 0.946 0 -7.175 pt5 12.417 mindrbb1 5.962 5.978 6.717 19.657 pt5 2.991 3.116 0.736 0 -6.843 mv1c1 12.482 pt58 9.133 3.214 5.165 25.512 mv1c2 2.061 1.416 1.9 0 -5.377 costhetastar 17.71 mv1c218 8.39 8.256 6.68 41.326 mv1c1 2.136 1.559 0.263 0 -3.958 mindrbb 34.483 mv1c1Table 11: The BDT sensitivities point card after the first variable test iteration, but with ttbb trained against ttsingleb.Each column corresponds to a different choice subset, with separate sums and rankings for the lower and upper choices,along with a total rank. Upper choices are subtracted from the lower choices, as an increase in points among the upperchoices implies greater occurrence of being excluded (rather than included, as with the lower choices). MV1c3 andMV1c4 are included automatically in every selection at this iteration.ch 1 ch 2 ch 3 ch 4 ch 5 Incl. Sum Incl. Rank ch 11 ch 12 ch 13 ch 14 ch 15 Excl. Sum Excl. Rank Total Sum Total Rank0 0.532 0.965 3.703 6.118 11.318 avgdr 33.157 29.751 24.858 11.183 1 -99.949 jets40 -77.055 avgdr0 0.399 2.524 7.955 10.52 21.398 wavgdr 33.157 29.751 24.858 11.183 1 -99.949 jjssumpt -67.425 jets400 0.815 8.375 6.371 14.487 30.048 pt5 28.669 26.775 22.297 10.632 0 -88.373 avgdr -67.425 jjssumpt1 1.482 5.224 9.362 14.509 31.577 wavgdrbb 27.599 24.92 13.398 19.182 0 -85.099 normavgdrbb -37.47 wavgdr0 0.948 6.43 10.446 13.894 31.718 costhetastar 23.558 16.272 20.468 1.066 1 -62.364 aplab -30.692 pt50 9.348 8.301 7.937 6.938 32.524 jets40 18.137 12.65 10.748 18.832 1 -61.367 wavgdrbb -29.79 wavgdrbb0 9.348 8.301 7.937 6.938 32.524 jjssumpt 21.333 12.816 14.608 10.983 1 -60.74 pt5 -11.627 aplab3 1.066 3.909 14.158 22.113 44.246 mindrbb 19.778 17.301 10.74 11.049 0 -58.868 wavgdr -9.139 costhetastar0 9.282 9.019 12.681 15.767 46.749 bbmaxmass 20.833 20.304 10.581 1.349 0 -53.067 mv1c1 -6.257 mv1c10 0.466 12.45 15.192 18.702 46.81 mv1c1 20.833 20.304 10.581 1.349 0 -53.067 mv1c2 -6.257 mv1c20 0.466 12.45 15.192 18.702 46.81 mv1c2 21.803 15.962 6.459 0.666 3 -47.89 bbmaxmass -1.141 bbmaxmass3 10.281 10.949 11.009 15.498 50.737 aplab 17.187 12.143 7.112 1.415 3 -40.857 costhetastar 15.92 mv1c30 9.348 12.623 15.152 16.784 53.907 mv1c3 15.337 14.065 5.519 0.066 3 -37.987 mv1c3 33.706 mindrbb3 11.316 11.749 20.836 27.473 74.374 avgdrbb 9.379 6.456 11.154 9.965 0 -36.954 avgdrbb 37.42 avgdrbb1 10.298 25.597 40.224 43.54 120.659 mv1c4 5.204 2.048 2.406 0.882 0 -10.54 mindrbb 112.943 mv1c41 57.032 56.443 47.229 42.552 204.256 normavgdrbb 3.638 2.639 0.84 0.599 0 -7.716 mv1c4 119.157 normavgdrbbTable 12: The BDT sensitivities point card after the second variable test iteration, but with ttbb trained against ttsingleb,with 16 potential variables remaining. Each column corresponds to a different choice subset, with separate sums andrankings for the lower and upper choice sums, along with a total sensitivity performance rank.98D.3. Neural Network Sensitivitiesch 1 ch 2 ch 3 ch 4 Incl. Sum Incl. Rank ch 1 ch 2 ch 3 ch 4 Incl. Sum Incl. Rank0 .157 .472 .591 1.220 wavgdr 0 .105 .490 .736 1.331 costhetastar0 .105 .303 .895 1.303 wavgdrbb 0 .157 .320 .902 1.379 wavgdrbb0 .210 .467 .678 1.355 ssumpt 0 .421 .291 .685 1.397 bbmaxmass0 .473 .408 .650 1.531 bbmaxmass 0 .210 .438 .768 1.416 ssumpt0 .263 .520 .824 1.607 sum34pt 0 .105 .490 .967 1.562 pt50 .263 .662 1.069 1.994 mindrbb 0 .263 .514 .793 1.570 wavgdr0 .105 .549 1.390 2.044 costhetastar 0 .210 .537 .883 1.630 sum34pt0 .105 .540 1.422 2.067 pt5 0 .315 .554 .851 1.720 mindrbb0 .789 1.923 3.093 5.805 jjssumpt 0 .789 1.929 3.556 6.274 jjssumpt1 1.315 3.016 5.350 10.681 avgdr 1 1.157 3.949 5.875 11.981 normavgdrbb1 1.789 3.276 5.590 11.655 jets40 1 1.683 4.398 6.340 13.421 bbmindr0 3.478 5.250 6.366 15.094 drbbmaxpt 1 2.209 4.177 6.206 13.592 mindr21 3.110 4.799 6.371 15.280 normavgdrbb 0 3.215 5.285 5.689 14.189 avgdr1 3.478 6.136 6.483 17.097 aplab 1 3.478 5.616 6.344 16.438 drbbmaxpt3 3.215 4.910 6.598 17.723 nwavgdrbb 1 5.115 5.404 6.182 17.701 avgdrbb3 4.800 5.907 6.168 19.875 mindr2 3 2.472 5.954 6.804 18.230 nwavgdrbb1 6.647 6.583 6.925 21.155 avgdrbb 3 6.331 4.926 7.079 21.336 aplab3 7.705 4.639 6.602 21.946 bbmindr 3 6.753 6.258 6.005 22.016 mv1c18 6.331 6.138 6.624 27.093 mv1c2 8 5.858 6.400 6.616 26.874 mv1c218 5.536 8.472 6.271 38.279 mv1c1 18 9.027 7.040 6.670 40.737 jets40Table 13: The NN sensitivity rankings after the first variable iteration. On the left, ttbb has been trained againstttlightnotc, whereas the rankings on the right occur when ttbb is trained against ttsingleb. Each column correspondsto a different choice subset. Choices which include too many variables are excluded, as this was universally seen toprovide very poor performance within a NN algorithm. This is a result of lowMC statistics, as many-variable algorithmsquickly reach an overtrained state, where unequal improvements were seen by the TMVA between the training andtesting samples.99ch 1 ch 2 ch 3 ch 4 ch 5 Incl. Sum Incl. Rank ch 1 ch 2 ch 3 ch 4 ch 5 Incl. Sum Incl. Rank0 .066 .199 .443 1.230 1.938 wavgdrbb 1 .333 .771 .784 .838 3.726 wavgdr0 .066 .295 .509 1.482 2.352 mindrbb 0 .200 .428 .712 4.926 6.266 costhetastar1 0 .552 .502 .711 2.765 ssumpt 1 .400 .323 .840 4.652 7.215 bbmaxmass1 .200 .625 .534 1.718 4.077 pt5 0 .066 .475 1.120 6.249 7.910 mindrbb0 .266 .374 .638 2.846 4.124 costhetastar 0 .133 .475 .857 7.216 8.681 wavgdrbb1 .933 1.352 1.770 3.965 9.020 jjssumpt 1 .532 .504 1.479 14.176 17.691 pt50 1.549 5.570 12.871 25.231 45.221 avgdr 3 1.866 2.047 4.009 26.601 37.523 jjssumpt0 1.615 5.923 17.953 26.228 51.719 aplab 0 1.349 11.160 20.502 23.905 56.916 avgdr0 1.349 10.933 18.759 26.091 57.132 mindr2 0 1.549 11.441 23.188 23.394 59.572 jets401 1.615 14.789 19.474 26.066 62.944 bbmindr 0 1.415 16.845 22.557 24.188 65.005 mv1c10 9.882 14.684 21.746 28.468 74.780 mv1c2 0 10.081 16.692 24.386 22.826 73.985 aplab3 10.632 15.085 22.397 28.184 79.298 normavgdrbb 0 10.716 17.053 21.666 26.426 75.861 mv1c23 11.032 18.929 21.381 29.119 83.461 nwavgdrbb 3 11.049 15.917 29.524 27.771 87.261 normavgdrbb3 19.565 21.278 27.546 29.100 100.489 avgdrbb 1 18.965 20.796 26.036 26.520 93.317 mv1c30 28.199 31.759 31.900 30.025 121.883 mv1c3 3 20.198 21.293 27.194 25.662 97.347 avgdrbb58 39.016 38.630 37.549 32.508 205.703 mv1c4 58 47.132 44.756 31.117 27.617 208.622 mv1c4Table 14: A ranking of variable sensitivities, based on their performance in NN algorithms. The ranks on the left areachieved from training ttbb against ttlightnotc, whereas the ranks on the right are from training ttbb against ttsingleb.In both cases, it can be seen that MV1c3 and MV1c4 do very well, and are included in nearly all top performingalgorithms. As a result of this, it was decided to keep these two variables in every variable set, and run over all templatetraining combinations with permutations of the remaining 20 variables.100
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Improvement to the statistical sensitivity of top quark...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Improvement to the statistical sensitivity of top quark pair production in conjunction with additional… van Rossem, Mackenzie Peter Fulford 2016
pdf
Page Metadata
Item Metadata
Title | Improvement to the statistical sensitivity of top quark pair production in conjunction with additional heavy flavour jets through multivariate analysis |
Creator |
van Rossem, Mackenzie Peter Fulford |
Publisher | University of British Columbia |
Date Issued | 2016 |
Description | With the mass of the discovered Higgs-like boson being 125 GeV, this leads to a primary Higgs decay mode to two bottom (b) jets. A precise measurement of top-pair (tt̄) production in conjunction with two additional b-jets is essential to reduce the background uncertainty on the tt̄ + Higgs production cross-section, a direct probe of the Higgs to Yukawa coupling. This thesis attempts to improve on the statistical sensitivity of tt̄ production in conjunction with two additional heavy-flavour jets, using expected sensitivities from 20.3 fb-¹ of pp collision data at √s = 8TeV, collected by the ATLAS detector at the Large Hadron Collider in 2012. This thesis compares multiple multivariate analysis techniques, boosted decision trees and artificial neural networks, in both binary and multi-class classification cases. An overall improvement in precision was seen, from 19.7% uncertainty on the baseline tt̄ + bb̄ measurement based on a fit to the best single variable, to 16.1% uncertainty with the very best multi-class neural network algorithm. This represents a relative improvement of nearly 20% and could thus reduce luminosity needed for a precision measurement of this process. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2016-09-08 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0314174 |
URI | http://hdl.handle.net/2429/59118 |
Degree |
Master of Science - MSc |
Program |
Physics |
Affiliation |
Science, Faculty of Physics and Astronomy, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2016-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2016_november_vanrossem_mackenzie.pdf [ 6.32MB ]
- Metadata
- JSON: 24-1.0314174.json
- JSON-LD: 24-1.0314174-ld.json
- RDF/XML (Pretty): 24-1.0314174-rdf.xml
- RDF/JSON: 24-1.0314174-rdf.json
- Turtle: 24-1.0314174-turtle.txt
- N-Triples: 24-1.0314174-rdf-ntriples.txt
- Original Record: 24-1.0314174-source.json
- Full Text
- 24-1.0314174-fulltext.txt
- Citation
- 24-1.0314174.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0314174/manifest