Open Collections

UBC Undergraduate Research

Exploring Machine Learning Models to Improve the Classification of Displaced Hadronic Jets in the ATLAS… de Schaetzen, Rodrigue 2020-04

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


52966-de_Schaetzen_Rodrigue_PHYS449_Exploring_machine_learning_2020.pdf [ 2.68MB ]
JSON: 52966-1.0394306.json
JSON-LD: 52966-1.0394306-ld.json
RDF/XML (Pretty): 52966-1.0394306-rdf.xml
RDF/JSON: 52966-1.0394306-rdf.json
Turtle: 52966-1.0394306-turtle.txt
N-Triples: 52966-1.0394306-rdf-ntriples.txt
Original Record: 52966-1.0394306-source.json
Full Text

Full Text

Exploring Machine Learning Models to Improve theClassification of Displaced Hadronic Jets in the ATLASCalorimeterbyRodrigue de SchaetzenA THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFBSc, Combined Honours Computer Science and PhysicsinTHE FACULTY OF SCIENCE(Physics and Astronomy)The University of British Columbia(Vancouver)April 2020c© Rodrigue de Schaetzen, 2020AbstractThe Large Hadron Collider (LHC) has yet to find new physics that could ad-dress the Standard Model’s (SM) large open questions such as the compositionof Dark Matter and the matter-antimatter asymmetry of the universe. There havebeen recent searches for Hidden Sector (HS) particles through the investigation ofpair-production of neutral long-lived particles (LLPs) in proton-proton collisions.The ATLAS collaboration recently published results using a partial dataset from asearch for paired LLP decays that produce displaced hadronic jets in the ATLAScalorimeter. Several classification models have been studied to identify these LLPdecays, including boosted decision trees and LSTMs. In this analysis, 1D convo-lutional layers were added to an existing model architecture, which significantlyimproved the performance. Following hyperparameter optimization, the proposedmodel achieved a ROC AUC score of 0.97; a 10% relative improvement over theprevious model.iiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 The Standard Model . . . . . . . . . . . . . . . . . . . . . . . . 11.2 The ATLAS Experiment . . . . . . . . . . . . . . . . . . . . . . 21.2.1 ATLAS detector sub-systems . . . . . . . . . . . . . . . . 31.2.2 Jets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 The Central Problem . . . . . . . . . . . . . . . . . . . . . . . . 52 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1 Long Lived Particles . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Backgrounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.1 Quantum Chromodynamics (QCD) . . . . . . . . . . . . . 82.2.2 Beam Induced Background (BIB) . . . . . . . . . . . . . 82.3 Supervised Machine Learning . . . . . . . . . . . . . . . . . . . 82.4 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 9iii2.4.1 Recurrent Neural Network (RNN) . . . . . . . . . . . . . 102.4.2 1D Convolutional Neural Network (CNN) . . . . . . . . . 103 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1 Current Network Configuration . . . . . . . . . . . . . . . . . . . 123.2 Exploring the Ordering of Transverse Momentum . . . . . . . . . 143.3 Modifying Model Architecture . . . . . . . . . . . . . . . . . . . 143.4 Model Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4.2 ROC AUC . . . . . . . . . . . . . . . . . . . . . . . . . 163.5 K-Fold Cross Validation . . . . . . . . . . . . . . . . . . . . . . 174 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 184.1 Re-ordering Transverse Momentum . . . . . . . . . . . . . . . . 184.2 Modifying Model Architecture . . . . . . . . . . . . . . . . . . . 204.3 Optimizing Hyperparameters . . . . . . . . . . . . . . . . . . . . 214.3.1 Insights from Grid Search . . . . . . . . . . . . . . . . . 214.3.2 Possible New Metrics . . . . . . . . . . . . . . . . . . . 255 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29A Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 31A.1 Python Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31A.2 Complete Model Diagrams . . . . . . . . . . . . . . . . . . . . . 31ivList of TablesTable 3.1 Presented here are the number of events available per jet class,and whether the event is simulated or real. In total, the fulldataset consists of 2,087,366 events. . . . . . . . . . . . . . . 13Table 3.2 Hyperparameter search space for the grid search. The count col-umn displays the number of different values tested per hyperpa-rameter. In total 5x4x3 = 60 unique model configurations weretrained. Final Conv1D layer represents the number of respectivefilters for each of the 3 CNNs in the network and are referencedby their input data. Note, in attempt to reduce the search space(and to reduce computational load) this was treated as a singlehyperparameter. . . . . . . . . . . . . . . . . . . . . . . . . . 15Table 3.3 Table describing the terms in Equations 3.1 and 3.2. . . . . . . 17Table 4.1 Table to highlight the improvements achieved by this study. Re-sults indicate the architecture Conv1D + LSTM prior to the gridsearch provides a 5-6% increase in ROC AUC in comparisonto the LSTM model. The proposed model with optimized hy-perparameters attains a relative improvement of 10%. Note, thenumber of epochs were doubled to 200 to ensure all modelshave converged in order to record maximum model performance. 22vList of FiguresFigure 1.1 The Standard Model of particle physics. It describes our cur-rent understanding of the fundamental particles and their inter-actions. The model is considered incomplete due to its inabil-ity to answer several major questions about the universe, mostnotably the composition of Dark Matter. Image taken from [1]. 2Figure 1.2 A diagram of a slice of ATLAS while looking down into thecylinder. The innermost layer (ID) is composed of detectorsthat track trajectories of charged particles. The next two lay-ers are the electromagnetic calorimeter (ECal) and hadroniccalorimeter (HCal). Calorimeters are designed to absorb theenergy of particles that pass through them, eventually stoppingthe particle. Electrons and photons are absorbed in the ECal,while hadrons get absorbed in the HCal. The final layer is themuon spectrometer which is designed to measure muons andany other charged particle that has not yet been stopped. Figurefrom [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Figure 2.1 The Feynman diagram of the theoretical particle decays thatcreate the signature of interest. Two protons (p) collide to forma heavy scalar boson (Φ). The boson decays to two neutrallong-lived particles (s) which then both decay to a fermion-antiferminion pair ( f , f¯ ). Overall this model serves as a bench-mark for searching paired LLPs. Figure taken, with permis-sion, from ATLAS analysis team. . . . . . . . . . . . . . . . 7viFigure 2.2 A diagram depicting a trackless-displaced jet. The dotted lineagain indicates no detection by the detector. Important high-level variables measured at ATLAS are also shown here. Pseu-dorapidity (η) is a spatial coordinate related to the angle of theparticle in relation to the beam axis. φ is the angle of a particlein the transverse plane. . . . . . . . . . . . . . . . . . . . . . 7Figure 2.3 Diagram depicting a single 1D Convolutional layer with 1 fil-ter of kernel size 3. At each step of the convolution, matrixmultiplication is applied between the filter matrix and the por-tion of the input matrix covered by the filter. This operation isrepeated as the filter moves down row by row along the inputmatrix. Original diagram taken from [3]. . . . . . . . . . . . 11Figure 3.1 A simplified diagram depicting the deep learning based LLPtagger developed by the UBC ATLAS collaboration. The modelleverages Long Short-Term Memory (LSTM) networks, spe-cialized RNN layers capable of learning long-term dependen-cies. Consequently, the original network will be referred toas the LSTM model. The full architecture is displayed in Ap-pendix A.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Figure 4.1 5-Fold cross validation comparing the impact of pT orderingon the model performance. The distribution of the model met-rics are represented as boxplots. The mean is shown by theorange line, the top and bottom of the box represent the 75thand 25th percentile respectively, and the whiskers represent themaximum and minimum. Models trained with sorted pT dataperformed better in both accuracy and ROC AUC. . . . . . . 19Figure 4.2 5-Fold cross validation comparing model architectures. Thecombined architecture Conv1D + LSTM outperformed the mod-els LSTM and Conv1D. . . . . . . . . . . . . . . . . . . . . 20viiFigure 4.3 Grid search results showing the positive correlation betweenlearning rate and model performance. 12 different models weretrained for each learning rate value. Prior to the grid search,the learning rate was set to 0.00005 which achieved the secondlowest mean performance in this experiment. . . . . . . . . . 22Figure 4.4 Average ROC AUC scores plotted against the tested regular-ization values from the grid search. A curve is displayed foreach of the learning rates to highlight the effect of varying theregularization. 3 different models were trained for each of theregularization + learning rate configurations. . . . . . . . . . 23Figure 4.5 Impact on the Conv1D + LSTM architecture when decreasingthe number of nodes in the LSTM layers from 150 nodes to 60nodes. A 4-Fold CV was performed to evaluate this modifica-tion on the model performance. . . . . . . . . . . . . . . . . 24Figure 4.6 Diagram of the optimized proposed model architecture. . . . . 25Figure A.1 Complete diagram of the LSTM model . . . . . . . . . . . . 31Figure A.2 Complete diagram of the Conv1D + LSTM model. . . . . . . 32Figure A.3 Complete diagram of the Conv1D model. GlobalAverage-Pooling1D layers were used to flatten the features. . . . . . . 32viiiGlossaryATLAS A Toroidal LHC ApparatuSAUC Area Under CurveBIB Beam Induced BackgroundCMS Compact Muon SolenoidCNN Convolutional Neural NetworkECAL Electromagnetic CalorimeterFPR False Positive RateHCAL Hadronic CalorimeterHS Hidden SectorID Inner DetectorIID Independent and Identically DistributedLHC Large Hadron ColliderLLP long-lived particlesLSTM Long Short-Term MemoryQCD Quantum ChromodynamicsRNN Recurrent Neural NetworkROC Receiver Operating CharacteristicSM Standard ModelCV Cross ValidationpT Transverse momentumixAcknowledgmentsThank you to my supervisor Dr. Alison Lister and to Ph.D student Fe´lix Cormierfor their feedback and continued support during this project.I would like to also thank my roommate and my dad for their insightful commentsand suggestions for this paper.xChapter 1IntroductionThis chapter provides a brief introduction to the world of particle physics and themotivation for this study.1.1 The Standard ModelThe Standard Model (SM) of particle physics describes our current understandingof the fundamental particles and their interactions. However, many open questionssuch as the matter-antimatter asymmetry of the universe [4], and the compositionof Dark Matter [5] cannot be answered by the Standard Model, rendering it anincomplete model. In response, physicists have proposed new models which extendthe SM in order to address some of its limitations. One class of such models areHidden Sector (HS) models [6] [7]. These models predict a new set of particles,only weakly coupled to the SM, resulting in experimental signatures containingparticles not charged under the SM (so not visible in the detectors) that decay toSM particles with a measurable lifetime. HS models could provide answers to theopen SM questions presented above, in particular the nature of Dark Matter.The components of the SM shown in Figure 1.1 are the fundamental buildingblocks of all ordinary matter. For example, subatomic particles such as protons andneutrons are each composed of three valence quarks and held together in a bandstate by the strong force, i.e. gluons. This is the definition of a hadron.1Figure 1.1: The Standard Model of particle physics. It describes our currentunderstanding of the fundamental particles and their interactions. Themodel is considered incomplete due to its inability to answer severalmajor questions about the universe, most notably the composition ofDark Matter. Image taken from [1].1.2 The ATLAS ExperimentThis study will use simulation from A Toroidal LHC ApparatuS (ATLAS), a general-purpose particle detector containing many layers of sub-detectors. It is locatedat the world’s largest particle accelerator, the Large Hadron Collider (LHC) situ-ated along the Swiss-French border. Along with Compact Muon Solenoid (CMS),another detector around the collider, ATLAS measurements observed the long-predicted Higgs Boson in 2012 [8]. Inside the LHC, bunches of protons are accel-erated to near light speeds in opposite directions around the ring and collide witheach other at specific points. Specifically, during Run 2 of the LHC (2015-2018)the center of mass energy was 13 TeV for proton-proton collisions. This tremen-dous amount of energy along with a frequency of 108 physics-related events persecond, generated countless particles that either decayed in the detectors, or leftthe apparatus undetected.21.2.1 ATLAS detector sub-systemsDue to the variety of particles that arise from proton collisions, many layers of dif-ferent detectors combined in ATLAS provide clues to identifying a particle. Theseclues make up what is known as a particle signature. Figure 1.2 illustrates the sub-systems of ATLAS. Their descriptions are provided in the list below. It should beemphasized that the proton-proton collisions occur at the center of the detector.1. Inner Detector (ID): The innermost layer of the detector measures the tra-jectories of electrically-charged particles. From the curvature of the recon-structed trajectories their momentum can be inferred.2. Electromagnetic Calorimeter (ECAL): Calorimeters are devices designedto absorb and measure the energy of particles that reach them. The topologyof calorimeters consists of clusters of cells. As such, the specific clustersa particular particle hits is of high relevance to particle identification. TheECAL is the first of two calorimeters in ATLAS and is specialized in mea-suring the energy of photons and electrons.3. Hadronic Calorimeter (HCAL): The HCal measures the energy of hadrons(e.g. protons, neutrons).4. Muon Spectrometer: The main purpose of the final layer is to track muons,an elementary particle similar to the electron as they, along with neutrinos,are not stopped in the calorimeters. The muon spectrometer is also seg-mented. The detector principle is similar to the ID in that it allows the re-construction of the trajectory of charged particles in a magnetic field. Theparts of the muon spectrometer exhibiting energy deposits (hits) consistentwith a charged particle are called muon segments. Note, any charged parti-cle that has not yet decayed in the calorimeters will be detected in this finallayer.1.2.2 JetsA major component of hadron collider experiments is the reconstruction and anal-ysis of jets which are roughly cone-shaped clusters of particles. Typically, jets3Figure 1.2: A diagram of a slice of ATLAS while looking down into the cylin-der. The innermost layer (ID) is composed of detectors that track trajec-tories of charged particles. The next two layers are the electromagneticcalorimeter (ECal) and hadronic calorimeter (HCal). Calorimeters aredesigned to absorb the energy of particles that pass through them, even-tually stopping the particle. Electrons and photons are absorbed in theECal, while hadrons get absorbed in the HCal. The final layer is themuon spectrometer which is designed to measure muons and any othercharged particle that has not yet been stopped. Figure from [2].originate from quarks or gluons (elementary particles) that decay and radiate thenform into hadrons (composite particles). The layers of the ATLAS detector make itpossible to piece together the low-level and high-level characteristics of a particu-lar jet. Tracks, calorimeter cluster deposits, and muon segments constitute the lowlevel constituents of a jet. The high-level jet variables used in this study are givenby the four-momentum of the jets, namely:• Pseudorapidity (η): Describes the angle in relation to the axis of the detec-tor cylinder, and thus beam axis.• Transverse momentum (pt): The component of momentum transverse/per-pendicular to the axis of the detector.• Angle (φ ): The angle in the transverse plane.41.3 The Central ProblemTo date, no physics outside of the Standard Model has been discovered at the LHC.For this reason, researchers have broadened their search to more complex particlesignatures. In particular, a number of studies search for particles that decay to SMparticles only after a measurable distance in the detector. Many extensions to theSM theorize the existence of such long-lived particles (LLP). A paper published bythe ATLAS collaboration ([9]) considers a heavy neutral boson decaying to a pairof neutral LLPs. In their search, the signature of interest is LLPs decaying in theATLAS calorimeters. A model capturing the complexities of this signal was devel-oped so that it could be differentiated from the highly abundant background. Theinitial classification model consisted of a Boosted Decision Tree; a relatively sim-ple machine learning algorithm. An ongoing analysis has revamped this model toleverage the recent developments of highly complex machine learning algorithms.This approach builds off the successes of similar complex models applied in otherphysics analyses [10], [11].In this analysis, the application of a novel machine learning algorithm to thecurrent classification model is explored. Specifically, this paper proposes a mod-ified architecture consisting of adding 1D convolutional layers. The goal is tofurther improve the LLP jet classification model. An improvement to the modelwould increase the discovery potential of a Hidden Sector particles which couldanswer some of the Standard Model open questions.5Chapter 2Theory2.1 Long Lived ParticlesFigure 2.1 displays the Feynman Diagram of the theorized HS model generatingthe signature of interest. Two protons (p) collide to form a heavy neutral boson(Φ). The boson then decays to two long-lived scalar particles (s) which in turneach decay to a fermion-antifermion pair ( f , f¯ ). Both the boson and the long-lived particles are invisible to the detector, indicated by the dotted lines in thediagram. The four fermion final state is the observable signature inside ATLAS.Due to the long lifetime of s considered in this model, each LLP decaying to afermion-antifermion pair is postulated to decay to a displaced jet just before or inthe first layers of the ATLAS calorimeters thus depositing most of their energy inthe HCal. This restriction can be expressed by a high ratio of energy depositedin the HCal relative to the ECal (Equation 2.1). Other expected characteristics ofthe signal jet include a lack of tracks and narrow jet widths. A model exhibitingsome of these discussed signal jet features is shown in Figure 2.2. It should beemphasized that this study is interested in searching for pairs of these types of jets.CalRatio =EHCalEECal(2.1)6Lifetime of s (τs)Interaction point (IP)Figure 2.1: The Feynman diagram of the theoretical particle decays that cre-ate the signature of interest. Two protons (p) collide to form a heavyscalar boson (Φ). The boson decays to two neutral long-lived particles(s) which then both decay to a fermion-antiferminion pair ( f , f¯ ). Over-all this model serves as a benchmark for searching paired LLPs. Figuretaken, with permission, from ATLAS analysis team.Figure 2.2: A diagram depicting a trackless-displaced jet. The dotted lineagain indicates no detection by the detector. Important high-level vari-ables measured at ATLAS are also shown here. Pseudorapidity (η) isa spatial coordinate related to the angle of the particle in relation to thebeam axis. φ is the angle of a particle in the transverse plane.72.2 BackgroundsTwo types of jets which mimic signal are considered in this analysis.2.2.1 Quantum Chromodynamics (QCD)Although the least probable to resemble signal, QCD is the most abundant formof background. QCD multi-jets are simply decays to the SM from proton-protoncollisions. A cluster of neutrons for example, decaying to other hadrons in theHCal could confuse the classification model with signal. Detector measurementerrors could also contribute to a QCD jet reassembling signal.2.2.2 Beam Induced Background (BIB)BIB stems from muons generated from proton interactions with the collider beamgas or the collimators. This occurs prior to the protons reaching the ATLAS de-tector. These muons travelling parallel to the beam pipe could deposit energy incalorimeters creating a trackless jet.2.3 Supervised Machine LearningA multi-class classification model is needed to classify jets (this is a jet by jetnot event level classification) as either signal, QCD, or BIB. The complexity ofthe classification problem at hand requires the unprecedented pattern-identificationability of novel machine learning algorithms. In the context of classification, super-vised machine learning consists of systematically tuning weights of a function thatdescribes the relationship between some input and discrete output by comparingthe predicted outputs to ground truth. The variables that describe a particular inputare referred to as features and the discrete classes the model is trying to predictare called labels. In the context of this analysis, the features are all the low-leveland high-level variables that describe a particular jet e.g. track, constituent, muonsegment, and pT . The jet types (i.e. Signal, QCD, or BIB) are the labels.L(y, yˆ) =−M∑j=0N∑i=0(yi j ∗ log(yˆi j)) (2.2)8A supervised model tunes/learns its weights via a loss function which deter-mines how close the predictions are to the truth values. The loss function for atypical multi-classification problem, and the one used in this study, is given byEquation 2.2. It is known as the categorical cross entropy loss function. yˆ is thepredicted value, y is the truth label, M is the number of classes, and N is the num-ber of samples. Discrete outputs with n possible labels are converted to an array oflength n. The value 1 is set to the index corresponding to the discrete class and 0sare set everywhere else, a technique called one-hot encoding. Given this descrip-tion, it is simple to verify the presented loss function is a sum of separate loss foreach class label per observation. Iterative optimization algorithms make it possiblefor the model loss to be minimized. The optimizer used in this analysis is Nadam[12]. The whole procedure of minimizing the loss function and tuning weights ofthe model is referred to as the training phase. It is crucial for the data used formodel training to be independent from the data used for testing to avoid a biasedmodel. A common metaphor used to illustrate this constraint is a student gettingaccess to the answers of an exam prior to writing it versus the student getting accessto similar previous exams. When a model has excessively tuned its weights to thepoint it has learned the complexities of the noise in the training data, the model issaid to have overfitted. A standard technique to monitor model performance duringtraining is to regularly evaluate the model on a separate dataset called the valida-tion dataset. This provides insight into how well the model is doing and indicatesthe possibility of overfitting.2.4 Deep LearningDeep learning is an area of machine learning inspired by the complex neural net-works of human brains. Networks are composed of interconnected layers of artifi-cial neurons or nodes. In a simple feed-forward neural network, each node has anassociated weight and bias, and its output is fed into an non-linear function calledan activation function. The correct weights and biases of these nodes to matchany given input to its corresponding output is determined during model training.Multiple layers of nodes and large numbers of nodes in each layer make it possiblefor the network to automatically learn highly complex and discriminative features.9This capability relates to the dominance of deep learning for complex classificationtasks, in comparison to other machine learning algorithms.The following two subsections provide technical details of the specific deeplearning algorithms discussed throughout this paper.2.4.1 Recurrent Neural Network (RNN)Recurrent Neural Networks are a class of artificial neural networks with a powerfulability to model sequential data. The assumption made in standard neural networksis that the data is Independent and Identically Distributed (IID). In other words, noinformation/context is lost if samples are randomly selected from a dataset. Inmany problems however, this is not the case. For example, a model that tries topredict the next word in a sentence needs to have information about the parts ofthe sentence that came before it. RNNs solve this problem by not making theIID assumption and instead retain information on the inputs the model has seenso far by having loops in the network nodes. The output of the model is thereforedependent on both the current input and the inputs before it. As a result, thesemodels perform exceptionally well in finding patterns in data containing variablelength sequences.2.4.2 1D Convolutional Neural Network (CNN)Convolutional Neural Networks are a class of artificial neural networks centeredaround the idea of convolutions. Series of filters or feature detectors capture lo-calized information as they move across an input such as pixels of an image. Thisoperation is called a convolution. The filters themselves are nothing more thana matrix with adjustable weights that produce an output when multiplied by por-tions of an input matrix. The goal is for a network to tune these filters such thathigh-level features are extracted.In the case of 1D CNNs, the width of a filter is the width of the input matrix.This implies the filter can only move across rows and not across columns (i.e. 1dimension). The filter is also referred to as the kernel and the height of the filter iscalled the kernel size. The example of a 1D Convolutional layer shown in Figure2.3 has a filter with kernel size 3.10Figure 2.3: Diagram depicting a single 1D Convolutional layer with 1 filterof kernel size 3. At each step of the convolution, matrix multiplicationis applied between the filter matrix and the portion of the input matrixcovered by the filter. This operation is repeated as the filter moves downrow by row along the input matrix. Original diagram taken from [3].11Chapter 3MethodsThe Python programming library Keras [13], with the TensorFlow [14] back-end,was used throughout this study to implement and modify networks.3.1 Current Network ConfigurationThe current architecture of the LLP tagger developed by the UBC ATLAS group isa deep Recurrent Neural Network. It leverages Long Short-Term Memory (LSTM)Networks, specialized RNN layers capable of learning long-term dependencies, anability standard RNN layers are known to lack [15]. These specialized layers solvethe issue of being unable to learn the connections between relevant information andthe current inputs when the gaps (i.e. how much data has been fed into the modelsince a particular set of inputs) between the two are too big. This makes LSTMshighly effective at learning dependencies across arbitrary sized sequential data.Each jet fed into the network is truncated at 20 tracks, 30 constituents, and 30muon segments. Features of these jet components consist of the various possiblemeasurements made by the ATLAS detector such as transverse momentum, pseu-dorapidity, angle in the transverse plane, layer fraction, and timing information.Figure 3.1 provides a graphical representation of the current architecture.In addition to a preconfigured network, preprocessed and transformed data wasalso available at the start of the project. The number of events and the distinctionbetween simulated and real data is shown in Table 3.1. Details specifying the12Figure 3.1: A simplified diagram depicting the deep learning based LLP tag-ger developed by the UBC ATLAS collaboration. The model leveragesLong Short-Term Memory (LSTM) networks, specialized RNN layerscapable of learning long-term dependencies. Consequently, the originalnetwork will be referred to as the LSTM model. The full architecture isdisplayed in Appendix A.1.Event Type Simulated/Real Number of EventsSignal Simulated 660,134QCD Simulated 766,056BIB Real 661,176Table 3.1: Presented here are the number of events available per jet class, andwhether the event is simulated or real. In total, the full dataset consistsof 2,087,366 events.methods used to generate simulated events is described in [9]. In essence, thepipeline presented above effectively served as the starting point of this study.133.2 Exploring the Ordering of Transverse MomentumLSTMs are particularly well-suited for temporal modeling, where inputs feedinginto the network layers are expected to be ordered in some way. By default, tracksand constituents are sorted by descending pT during the pre-processing phase, totake advantage of this fact. Although muon segments also feed into an LSTM, it isnot possible to sort them by pT since it is a missing feature. It is worth noting thatthis ordering is somewhat arbitrary and does not translate to any physical meaning.However, since some variables can be more accurately modeled than others, thereis likely an optimal ordering to the inputs. Hence, the first study focused on deter-mining whether transverse momentum is an appropriate ordering for inputs tracksand constituents. Models were trained with three different pT -ordered datasets:descending, ascending, and random.3.3 Modifying Model ArchitectureThe CMS collaboration published a paper [16] on training a deep neural network toclassify b jets using proton-proton collision data measured with the CMS detector.The model architecture presented in their paper is a feedforward neural networkconsisting of CNN, LSTM, and Dense layers. Specifically, the CNN layers are1D convolution filters with kernel size 1. Although several studies [17], [18] haveshown this unified architecture is highly effective in applications that benefit fromboth temporal and spatial modeling, the former does not apply to the CNN layers inthe b jet tagger. Instead, they perform global feature extraction and dimensionalityreduction, without a spatial aspect since these filters capture a single row at a time.The addition of these 1D convolutional layers output highly discriminating andcompressed features which feed into the LSTMs.Inspired by this, in this second study, the addition of 1D convolutional layerswith kernel size 1 to the current LLP tagger model is explored. Henceforth, theproposed model will be referred to as Conv1D + LSTM. The inputs track, con-stituent, and muon segment now feed into Conv1D layers before passing throughthe LSTMs. The number of Conv1D layers and filters were initialized to matchthe configuration outlined in [16]. An initial comparison was made to verify theaddition of Conv1D layers does indeed improve the performance of the network.14Hyperparameter Values CountLearning rate 0.000025,0.00005,0.0001,0.0002,0.0004 5Regularization 0.001,0.0025,0.005,0.01 4Final Conv1D layer16,12,8 for Constituent and Track8,6,4 for Muon Segment3Table 3.2: Hyperparameter search space for the grid search. The count col-umn displays the number of different values tested per hyperparameter.In total 5x4x3 = 60 unique model configurations were trained. FinalConv1D layer represents the number of respective filters for each of the3 CNNs in the network and are referenced by their input data. Note, inattempt to reduce the search space (and to reduce computational load)this was treated as a single hyperparameter.Following this step, the hyperparameters of the proposed architecture were op-timized through a grid search, an effective yet computationally expensive modeloptimization technique. The search space consisted of 5 values for learning rate, 4values for regularization, and 3 values for the number of filters in the final Conv1Dlayer. The specific values tested are shown in Table 3.2. Note, the learning rateand regularization values used for training the model in the previous study were0.00005 and 0.001 respectively. The following are short descriptions for each hy-perparameter part of the search space:• Learning Rate: Size of the adjustments made to the model weights withrespect to the loss gradient. Also known as step size.• Regularization: An additional term to the loss function which penalizesmodel complexity to avoid overfitting.• Number of filters in final Conv1D layer: Determines the width of the inputmatrices feeding into the LSTMs. This was part of the grid search to validatethe usefulness of dimensionality reduction. For example, the Track input isa 20x13 matrix which reduces to 20x8 if the final Conv1D layer contains 8filters.153.4 Model MetricsThe next two subsections provide descriptions of two metrics this analysis used toevaluate model performance.3.4.1 AccuracyRecall, the final output of the network is a probability for each jet class. A given jetis considered accurately labeled if the ground-truth label matches the jet class withthe highest probability. In order to measure a model’s accuracy over a given dataset,model predictions for every jet are gathered. The accuracy is simply calculated asthe number of correctly labelled jets divided by the total number of jets classified.3.4.2 ROC AUCA useful tool commonly used in binary classifier systems is a Receiver Operat-ing Characteristic (ROC) curve. It provides graphical insight into the classificationperformance across all possible threshold values. The Area Under Curve (AUC) isdefined as the total area under a ROC curve and represents the degree of separa-bility. A value of 1 corresponds to a perfect classifier while 0.5 is equivalent to arandomly guessing network.The following is a description of a specialized ROC curve tailored to the multi-class LLP tagger. This curve is generated by plotting QCD rejection against LLPefficiency. LLP tagging efficiency is defined as the fraction of times the networkcorrectly tagged a jet as signal (Equation 3.1) while QCD rejection is equivalentto 1 over the False Positive Rate (FPR) (Equation 3.2). Table 3.3 defines the termsin these equations. The final jet class is integrated into the plot via the quotedBIB efficiency, a discrete value corresponding to the proportion of true BIB jetsclassified as BIB. This value determines the initial cut/separation to the three jetclass distributions prior to generating the ROC curve. For this analysis, the BIBefficiency was set to 0.968 to be consistent with the BIB efficiency achieved in theprevious analysis [9].LLP Tagging Efficiency =T PT P+FN(3.1)16True Signal True QCD backgroundClassified as Signal True Positive (TP) False Positive (FP)Classified as QCD False Negative (FN) True Negative (TN)Table 3.3: Table describing the terms in Equations 3.1 and 3.2.QCD Rejection =FP+T NFP(3.2)3.5 K-Fold Cross ValidationK-Fold Cross Validation (CV) is a powerful statistical technique used through-out this study to validate and compare models. It consists of randomly splittingthe training data into k partitions and training a model for each possible training-validation pair. This produces k different models, such that each model is trainedon k− 1 partitions of the data, and each partition acts as the validation set for ex-actly one model. This technique results in a less biased, or more accurate estimateof the model skill than other methods.17Chapter 4Results and Discussion4.1 Re-ordering Transverse MomentumThe model trained with random pT ordered data performed poorly in comparisonto the models trained with ordered pT data. Results of a 5-Fold cross validation areshown in Figure 4.1. Descending pT seemed to be slightly better than ascendingpT and as a result the descending pT dataset was used in the subsequent study.From the results, transverse momentum appears to be a suitable ordering forthe track and constituent inputs. It is worth observing that the mean performancewas slightly higher in models trained with descending ordered data compared toascending. Repeated experiments showed conflicting results and it is still unclearwhether this small difference is significant. Regardless, the improved performancewith sorted data compared to unsorted data is enough to justify either direction ofordering.18descending ascending random0.760.780.800.820.84ROC AUCdescending ascending randompT ordering0.9200.9250.9300.9350.940Accuracy5-Fold CV Model ComparisonFigure 4.1: 5-Fold cross validation comparing the impact of pT ordering onthe model performance. The distribution of the model metrics are rep-resented as boxplots. The mean is shown by the orange line, the topand bottom of the box represent the 75th and 25th percentile respec-tively, and the whiskers represent the maximum and minimum. Modelstrained with sorted pT data performed better in both accuracy and ROCAUC.19LSTM Conv1D Conv1D_LSTM0.7000.7250.7500.7750.8000.825ROC AUCLSTM Conv1D Conv1D_LSTMModel architecture0.9050.9100.9150.9200.9250.9300.935Accuracy5-Fold CV Model ComparisonFigure 4.2: 5-Fold cross validation comparing model architectures. The com-bined architecture Conv1D + LSTM outperformed the models LSTMand Conv1D.4.2 Modifying Model ArchitectureA 5-Fold cross validation was performed to evaluate the new proposed architec-ture of adding 1D Convolutional layers. In addition to training the original andnew model architectures, a third model consisting of Conv1D, GlobalAveragePool-ing1D, and Dense layers (referred to as the Conv1D model) was also trained. TheConv1D layer configuration including the number of filters for each layer, was setto closely match the configuration outlined in the DeepJet b tagging algorithm [16]due to the similar input and output shapes of the 1D CNNs. The Conv1D + LSTMarchitecture outperformed both models in accuracy and ROC AUC. It was alsofound that the Conv1D model (no LSTM layers) and the LSTM model achievedsimilar performance. Results of the 5-Fold cross validation are shown in Figure4.2.204.3 Optimizing HyperparametersFollowing the completion of the grid search, correlations were calculated betweenthe hyperparameters from the search space and the model evaluation metrics. Astrong positive correlation was found between learning rate and model perfor-mance. Their relationship is shown in Figure 4.3 which plots the mean ROC AUC.Though far less significant, a negative correlation was found between regular-ization and model performance. Figure 4.4 plots the average ROC against regular-ization for each of the different learning rate values from the search space. A curvewas plotted for each learning rate to better visualize the effect of changing theregularization. A subtle trend in better performance is seen when decreasing regu-larization. Finally, little dependency was found on varying the number of filters inthe final Conv1D layers with the model metrics.The final adjustment made to the Conv1D + LSTM architecture was decreasingthe number of nodes in the LSTM layers resulting in a decrease in trainable param-eters. The motivation for this adjustment comes from the general notion that themore parameters there are to train, the more likely it is for the model to overfit tothe training data. This reduction from 150 to 60 nodes in the LSTM decreased thevariability of the model performance, and increased the mean classification ability.Results from a 4-Fold CV comparing the effect of this adjustment are displayed inFigure 4.5.To summarize, the best Conv1D + LSTM model was trained with a learningrate of 0.0004, regularization of 0.001, and 60 nodes for the LSTM layers. All themodifications made to the original LSTM network are reflected in the model archi-tecture diagram shown in Figure 4.6. Table 4.1 provides a summary of the relativeimprovements achieved by the proposed model. There is a 10% ROC AUC im-provement in comparison to the original model, clearly indicating the architectureConv1D + LSTM is a major improvement to the deep learning LLP tagger.4.3.1 Insights from Grid SearchLearning rate is arguably the most important hyperparameter to tune on a givennetwork and its optimal value is highly dependent on the model architecture, opti-mizer, and loss function. Too small of a value can result in the loss converging to21Figure 4.3: Grid search results showing the positive correlation betweenlearning rate and model performance. 12 different models were trainedfor each learning rate value. Prior to the grid search, the learning ratewas set to 0.00005 which achieved the second lowest mean performancein this experiment.Model ROC AUC AccuracyConv1D + LSTM optimized 0.96 0.97Conv1D + LSTM 0.92 0.96LSTM 0.87 0.94Table 4.1: Table to highlight the improvements achieved by this study. Re-sults indicate the architecture Conv1D + LSTM prior to the grid searchprovides a 5-6% increase in ROC AUC in comparison to the LSTMmodel. The proposed model with optimized hyperparameters attains arelative improvement of 10%. Note, the number of epochs were doubledto 200 to ensure all models have converged in order to record maximummodel performance.22Figure 4.4: Average ROC AUC scores plotted against the tested regulariza-tion values from the grid search. A curve is displayed for each of thelearning rates to highlight the effect of varying the regularization. 3 dif-ferent models were trained for each of the regularization + learning rateconfigurations.local minimum, unable to climb out due to the small step size. As such, dedicatedtime was spent tuning this value for the Conv1D + LSTM model. Results fromthe grid search strongly indicated the optimal learning rate for the proposed archi-tecture was much larger than the starting value based on the training configurationfrom the previous study.Although regularization was found to have a slight dependency on the modelmetrics, additional experiments would be required to validate this trend (via a largersearch space) and to determine whether 0.001 is truly the optimal value. A final im-provement to the hyperparameters would be finding the optimal number of LSTM23Conv1D_LSTM_150 Conv1D_LSTM_600.9450.9500.9550.9600.965ROC AUCConv1D_LSTM_150 Conv1D_LSTM_60Model architecture0.9660.9670.9680.9690.9700.9710.972Accuracy4-Fold CV Model ComparisonFigure 4.5: Impact on the Conv1D + LSTM architecture when decreasing thenumber of nodes in the LSTM layers from 150 nodes to 60 nodes. A4-Fold CV was performed to evaluate this modification on the modelperformance.nodes, since only one other value was tested other than the initial setting. Alongwith the other improvements discussed above, further hyperparameter optimiza-tions could be performed with a randomized search resulting in drastically lowerruntime.Another major result from the parameter search is the fact that adding addi-tional filters to the final Conv1D layer did not seem to improve the model. Alayer containing 8 filters seemed to output the same amount of meaningful infor-mation as a layer with 16 filters. Therefore, the null correlation found betweenthe tested number of filters and model performance validates the dimensionalityreduction ability of the Conv1D layers. To conclude this part of the discussion, the24Figure 4.6: Diagram of the optimized proposed model architecture.improvements to the LSTM model with the added sequential Conv1D layers canbe attributed to the resulting feature extraction and compression.4.3.2 Possible New MetricsAside from ways to further improve the model, extensions to this analysis shouldconsider exploring new metrics that could better quantify model performance, orprovide more insight when making comparisons. One example would be extract-ing/optimizing for the signal efficiency for a small fixed set of QCD rejection val-ues, rather than calculating the efficiency over all values of rejection (as done forthe ROC curve and accuracy number). Setting fixed thresholds would output moreexplicit measures of the network’s ability to differentiate the three jet classes. Theadded constraint would also make it easier to explicitly optimize for signal effi-ciency.Another useful metric would be to calculate the ratio of signal event count tothe square root of the background event count. This metric, often referred to as”significance”, is commonly used in particle physics research. Optimizing this ra-tio implies maximizing the signal event count while minimizing the uncertainty25expressed by the denominator. The counting of events and the uncertainty areassociated with the fact that counting signal events with some probability of back-ground obeys a Poisson distribution. The higher the significance, the higher theconfidence is to reject the null hypothesis that signal is purely a result of statisticalfluctuation of the background. In other words, an increase in significance would di-rectly translate to a higher probability of finding new physics. Though a promisingapproach, this technique is nontrivial since it requires information on the expectednumber of signal and background events (i.e. the relative cross-sections).26Chapter 5ConclusionsMany extensions to the Standard Model suggest the existence of long lived particlesthat only interact with the SM through a weakly-coupled mediator. The extendedlifetime of these particles and the weak interaction with the SM would result indisplaced hadronic jets in the ATLAS detector. Ongoing and published researchby the ATLAS collaboration, have presented machine learning models to classifydisplaced jets in the ATLAS calorimeters. In this analysis, a modified architecturewas proposed with the aim to improve the model performance.Prior to exploring new models, an experiment was performed to verify the deci-sion to sort certain input data by transverse momentum due to the nature of LSTMlayers expecting ordered inputs. The experiment consisted of training models ona descending, ascending and random pT ordered datasets. Models trained on theordered datasets performed significantly better in both accuracy and ROC AUC.A deep recurrent neural network developed as part of the ongoing LLP search,was established as the benchmark for comparing model performance and develop-ing an improved architecture. This analysis explored the addition of 1D Convo-lutional layers to the deep learning based LLP tagger. A 5-Fold cross validationshowed the proposed Conv1D + LSTM model substantially outperformed the orig-inal network.Finally, a grid search was performed to optimize the hyperparameters of theimproved model. A larger learning rate, a slightly smaller regularization value, anda decrease in LSTM nodes were found to further enhance the model performance.27Overall, the optimized Conv1D + LSTM model achieved a 10% increase in ROCAUC in comparison to the original model.Extensions to this analysis should consider exploring other metrics for evalu-ating and comparing models. Significance and signal efficiency at specific QCDrejection values are two proposed metrics which could offer better insight to thediscovery potential.28Bibliography[1] Wikipedia contributors. Standard model — Wikipedia, the freeencyclopedia. Model&oldid=943664780,2020. [Online; accessed 6-March-2020]. → pages vi, 2[2] Lawrence Lee, Christian Ohm, Abner Soffer, and Tien-Tien Yu. Collidersearches for long-lived particles beyond the standard model. Progress inParticle and Nuclear Physics, 106:210–255, May 2019. → pages vi, 4[3] Cezanne Camacho. Cnns for text classification - gif. text/conv maxpooling steps.gif.[Online; accessed 1-April-2020]. → pages vii, 11[4] Yanou Cui and Brian Shuve. Probing baryogenesis with displaced vertices atthe lhc. Journal of High Energy Physics, 2015(2), Feb 2015. → pages 1[5] Keith R. Dienes, Shufang Su, and Brooks Thomas. Distinguishingdynamical dark matter at the lhc. Physical Review D, 86(5), Sep 2012. →pages 1[6] Yuk Fung Chan, Matthew Low, David E. Morrissey, and Andrew P. Spray.Lhc signatures of a minimal supersymmetric hidden valley. Journal of HighEnergy Physics, 2012(5), May 2012. → pages 1[7] Matthew Baumgart, Clifford Cheung, Joshua T Ruderman, Lian-Tao Wang,and Itay Yavin. Non-abelian dark sectors and their collider signatures.Journal of High Energy Physics, 2009(04):014–014, apr 2009. → pages 1[8] The ATLAS Collaboration. Observation of a new particle in the search forthe standard model higgs boson with the atlas detector at the lhc. PhysicsLetters B, 716(1):1–29, Sep 2012. → pages 229[9] The ATLAS Collaboration. Search for long-lived neutral particles in ppcollisions at√s = 13 tev s = 13 tev that decay into displaced hadronic jets inthe atlas calorimeter. The European Physical Journal C, 79(6), Jun 2019. →pages 5, 13, 16[10] Shannon Egan, Wojciech Fedorko, Alison Lister, Jannicke Pearkes, andColin Gay. Long short-term memory (lstm) networks with jet constituentsfor boosted top tagging at the lhc, 2017. → pages 5[11] Wahid Bhimji, Sasha Farrell, Thorsten Kurth, Michela Paganini, Mr Prabhat,and Evan Racah. Deep neural networks for physics analysis on low-levelwhole-detector data at the lhc. Journal of Physics: Conference Series, 1085,11 2017. → pages 5[12] Timothy Dozat. Incorporating nesterov momentum into adam. 2015. →pages 9[13] Franc¸ois Chollet et al. Keras., 2015. →pages 12[14] M. Abadi, et al. TensorFlow: Large-scale machine learning onheterogeneous systems, 2015. Software available from →pages 12[15] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencieswith gradient descent is difficult. IEEE Transactions on Neural Networks,5(2):157–166, 1994. → pages 12[16] The CMS Collaboration. Performance of the DeepJet b tagging algorithmusing 41.9/fb of data from proton-proton collisions at 13TeV with Phase 1CMS detector. Nov 2018. → pages 14, 20[17] Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, SubhashiniVenugopalan, Sergio Guadarrama, Kate Saenko, and Trevor Darrell.Long-term recurrent convolutional networks for visual recognition anddescription, 2014. → pages 14[18] Tara Sainath, Oriol Vinyals, Andrew Senior, and Hasim Sak. Convolutional,long short-term memory, fully connected deep neural networks. pages4580–4584, 04 2015. → pages 1430Appendix ASupporting MaterialsA.1 Python CodeThe GitHub repository rdesc/deep-learning-llp-tagger contains the python scriptsthat were used for pre-processing data, generating plots, and for training, evaluat-ing, and testing models.A.2 Complete Model DiagramsThe following three figures consist of complete diagrams of the three different ar-chitectures discussed in this paper. These diagrams were generated via the plot modelmethod from keras.utils.Figure A.1: Complete diagram of the LSTM model31Figure A.2: Complete diagram of the Conv1D + LSTM model.Figure A.3: Complete diagram of the Conv1D model. GlobalAveragePool-ing1D layers were used to flatten the features.32


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items