Faster Convolutional Neural Network Training viaApproximate MemoizationbyAmruth SandhupatlaB.Tech., National Institute of Technology Warangal, India, 2014A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of Applied ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)October 2018c© Amruth Sandhupatla, 2018The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Faster Convolutional Neural Network Training via Approximate Mem-oizationsubmitted by Amruth Sandhupatla in partial fulfillment of the requirements forthe degree of Master of Applied Science in Electrical and Computer Engineer-ing.Examining Committee:Tor M. Aamodt, Electrical and Computer EngineeringSupervisorSudip Shekhar, Electrical and Computer EngineeringSupervisory Committee MemberAlireza Nojeh, Electrical and Computer EngineeringHeads Nominee and Chair MemberiiAbstractDeep Convolutional Neural Networks have found wide application but their train-ing time can be significant. We find that between successive epochs during training,many neurons compute nearly the same output when presented with the same in-put. This presents an opportunity to skip computation in the forward pass on thelater epoch via memoization. This dissertation explores the potential of such anapproach by investigating the correlation of neuron activations between trainingepochs. We develop an implementation of activation memoization that takes intoaccount the lockstep behavior of threads executing together in single-instruction,multiple-thread Graphic Processing Units (GPU). Finally, we discuss the trade-offbetween speedup and accuracy by showing that STAN achieves a 1.3-4× convo-lution speedup over our baseline GPU implementation at the expense of 2.7-7%loss in accuracy. When STAN is applied to Hyperband [66] which can be robustto drop in accuracy, an overall training speedup of 7%-16% can be achieved witha minimal change in test error (±0.2).iiiLay SummaryConvolutional Neural Networks play an essential role in today’s computer visionapplications such as object detection, face recognition, image classification, anddocument analysis. However, to solve such convoluted tasks, one requires hugedata and computational resources. Fortunately, more data is available due to theworldwide usage of the internet but the lack of appropriate processors to computeseveral computations hinders the progress. Graphics Processing Units (GPU) isone such accelerator which can perform several computations in parallel and isideal for applications such as neural networks. One way to increase the complexityof a model is by adding more layers to the network which in turn demands morecomputation and hence more training time.In this thesis, we aim to solve the issue of excessive training time by developinga method which skips a few selected computations and has shown to achieve asubstantial convolution speedup on GPUs.ivPrefaceThis dissertation is based on a research project conducted by myself under thesupervision and guidance of Professor Tor M. Aamodt. I assisted with definingthe problem space and was responsible for deriving mathematical solutions, iden-tifying challenges within this problem space, and designing and optimizing thealgorithm to evaluate the proposed idea. I also conducted the experiments and col-lected all the data represented in this dissertation, except for designing our baselinekernel in section 5.6. The development of this kernel was done by collaboratingwith James Asefa with my assistance under the supervision of Professor Aamodt.Prof. Tor M. Aamodt provided valuable guidance and directions in identifyingthe research problems, developing solution methodologies, and documenting theresults.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . 21.2 GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.1 Architecture of NVIDIA GPU . . . . . . . . . . . . . . . 41.2.2 Control flow in GPUs . . . . . . . . . . . . . . . . . . . . 51.3 Memoization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.6 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10vi2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1 Input or Data layer . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Convolution layer . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3 Activation function . . . . . . . . . . . . . . . . . . . . . . . . . 142.4 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5 Fully connected and loss layers . . . . . . . . . . . . . . . . . . . 152.6 Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.6.1 Local Response Normalization . . . . . . . . . . . . . . . 162.6.2 Batch Normalization . . . . . . . . . . . . . . . . . . . . 182.7 CNNs on GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Approximate Memoization . . . . . . . . . . . . . . . . . . . . . . . 255.1 Neuron Activation Correlation . . . . . . . . . . . . . . . . . . . 255.2 STAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.3 STAN-corr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.4 Focusing learning to reduce storage . . . . . . . . . . . . . . . . 315.5 STAN-lite and STAN-last . . . . . . . . . . . . . . . . . . . . . . 335.6 GPU Implementation . . . . . . . . . . . . . . . . . . . . . . . . 346 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . 366.2 Null hypothesis and 2-bit STAN-lite . . . . . . . . . . . . . . . . 366.3 STAN-corr using Mega-batch . . . . . . . . . . . . . . . . . . . . 386.4 Comparison across implementations . . . . . . . . . . . . . . . . 406.5 Discussion: STAN-corr on ResNets . . . . . . . . . . . . . . . . 437 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447.1 meProp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447.2 Hyperband . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447.2.1 Alternatives to Hyperband . . . . . . . . . . . . . . . . . 46vii8 Conclusions and Future work . . . . . . . . . . . . . . . . . . . . . . 48Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50A Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 61A.1 Proof of Equation 5.1 . . . . . . . . . . . . . . . . . . . . . . . . 61A.2 Proof of Equation 5.2 . . . . . . . . . . . . . . . . . . . . . . . . 62A.3 Hyperband search space . . . . . . . . . . . . . . . . . . . . . . 63viiiList of TablesTable 5.1 Effect on validation error of AlexLRN using CIFAR-10 for 10epochs. Here, Z = 1 and pt = 0 All the memory values reportedare calculated using pmap for CPUs and nvidia-smi commandfor GPU DRAM usage. . . . . . . . . . . . . . . . . . . . . . 35Table 6.1 Comparison of validation loss while using STAN on networkswith varying depths. Here random skip refers to convolutionaldropout [109] ratio of 0.5, “c1 STAN-last” denotes that STAN-last is applied to only first convolution layer and “r1 STAN-lite”refers to STAN-lite applied to only first convolution layer of allresidual blocks. For ResNets, errors were captured at epoch 60and rest at epoch 120. Default values for ResNet are θ = 0, forrest it is θ = 10, Z = 1, pt = 0, and STAN-last/STAN-lite isapplied to all layers. . . . . . . . . . . . . . . . . . . . . . . . 37Table 6.2 Comparison of wall-clock execution time of forward (fwd) andbackward (bwd) propagation for AlexLRN network for an epochaveraged over 10 instances. Here θ=10, Z=1, pt=0 and me-Prop’s k=30. Only the bars containing label “L1 reg” uses L1regularization and rest of them uses L2 regularization. . . . . . 42Table A.1 Hyper-parameter search space used in hyperband for severalnetworks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64ixList of FiguresFigure 1.1 Execution flow of all 32 threads of a warp. Here threadIdx.xdenotes a unique id associated with each thread in x-dimension. 5Figure 1.2 Comparison of auto-correlation of weights, dot-product andneuron activation outputs at time t vs. t−1 for first conv layerin ResNet-18. . . . . . . . . . . . . . . . . . . . . . . . . . . 8Figure 1.3 Layer-wise analysis of percentage of neurons with differentautocorrelation values for the same randomly chosen image.Here LeNet and Alex are trained for 8 and the rest for 120epochs. Low learning rate α refers to decrease from 0.01 to0.001. Here Alex is Caffe’s [51] cifar10 quick network and ckrepresents kth convolution layer (c). LeNet uses MNIST andrest of them used CIFAR-10 dataset. . . . . . . . . . . . . . . 8Figure 2.1 Structure of a typical Convolutional Neural Net. Figure isdrawn using NN-SVG tool [64] . . . . . . . . . . . . . . . . 12Figure 2.2 Convolution operation of an input and a weight tensor resultingin an output tensor. Here convolution operation is performedonly for a patch of an image whose tensor dimensions are samethat of weight dimensions. Figure is drawn using NN-SVGtool [64]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Figure 2.3 Overview of Local Response Normalization. . . . . . . . . . 17Figure 2.4 Overview of Batch Normalization. . . . . . . . . . . . . . . . 18xFigure 2.5 Mapping the dimensions of input’s data layout format to con-volution process. Here, a patch of input image whose size isH f ×Wf is taken and a dot product and accumulation operationis performed to obtain the output of a single neuron. The figureis drawn using NN-SVG tool [64]. . . . . . . . . . . . . . . . 20Figure 5.1 Activation output difference on successive epochs versus prioractivation correlation for a single neuron-training sample pair.Here, neuron belong to AlexLRN’s convolution layer 2 whichis trained for 40 epochs using the CIFAR-10 dataset. . . . . . 26Figure 5.2 Lag-1 autocorrelation versus epoch for 100 neurons in Layer 2of Alexnet. . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Figure 5.3 Histogram showing frequency with which neuron activationsachieve a given correlation for 100 neurons in Layer 2 of Alexnetduring training. . . . . . . . . . . . . . . . . . . . . . . . . . 28Figure 5.4 Here, let us assume a dataset has 6 mini-batches (from A toF) in total, Mega batch size (ζ ) = 2, skipping interval (Z) = 2.We see that, instead of allocating memory for all the imagesin dataset, STAN allocates memory only for a mega-batch andthereby saving memory. . . . . . . . . . . . . . . . . . . . . 33Figure 6.1 Analysis of STAN-corr on AlexLRN while varying ζ and θ .Here epochs trained = 40, NA is Not Applicable, Z=1 and pt=2. 39Figure 6.2 AlexLRN’s forward pass speedup while varying Z and ζ . Herefp16 STAN-corr is used, epochs trained = 40 and pt=2. . . . . 40Figure 6.3 Breakdown of execution times of several layers of AlexLRNnetwork in Caffe framework. . . . . . . . . . . . . . . . . . . 41Figure 6.4 Histogram of autocorrelation of all gradient updates of AlexLRN’slayer 1 weights during training for 40 epochs. . . . . . . . . . 41xiFigure 7.1 Analysis of Hyperband for CNNs with and without STAN. Un-less specifically stated, default values are pt = 0, Z = 1, θ = 10and meProp top-k = 30. ResNet-18 uses STAN-lite and rest ofthem use STAN-last. For series marked as ?, R = 171 elseR = 81. Here, a single point on graph indicates a bracket andthe total execution time reported is wall clock time which in-cludes forward pass, backward pass and Hyperband’s overhead. 46xiiList of AbbreviationsCNN Convolutional Neural NetworksCPU Central Processing UnitCTA Cooperative Thread ArraysCUDA Compute Unified Device ArchitectureDRAM Dynamic Random Access MemoryFP Floating PointGB GigaByteGPU Graphics Processing UnitINT IntegerKB KiloByteL0 Level ZeroLRN Local Response NormalizationMNIST Modified National Institute of Standards and TechnologyRELU Rectified Linear UnitRESNET Residual NetworksSIMT Single-Instruction, Multiple-ThreadxiiiSM Streaming MultiprocessorSTAN Staunch NeuronsxivAcknowledgmentsFirst and foremost, I am profoundly indebted to my parents for their unconditionallove and blessings. Without their continuous support and encouragement, I couldnot have completed this thesis. Mother, your optimism and love are the primarysources of all my previous, current and future achievements. Father, your conscien-tiousness and pragmatic approach to all the problems has inspired me a lot. Specialthanks to my brother who taught me to be curious, enthusiastic and confident.To my grandmother Katyayani who pampered me a lot and taught me to havefun apart from academics. I am always indebted to my grandfather Shanker Lingam,who taught me to dream big to achieve big.Second, I would like to thank my supervisor Professor Tor M Aamodt for hispatience, knowledge, and generous financial support. I thank him for providing mewith an excellent research atmosphere in his lab. This thesis would not have beenpossible without his support, guidance, and encouragement. I will remember andam deeply thankful for your wisdom and all the lessons you taught me during mythesis.Third, I am very fortunate and grateful to have excellent colleagues who of-fered me genuine and friendly support and assurance to carry out my research.I should particularly thank Francois Demoullin, Md Aamir Raihan, Dave Evans,Tayler Hetherington for their personal and professional support. I enjoyed work-ing with James Asefa and thank him for contributing his wonderful baseline kernelcode for my thesis.Finally, I would like to thank Praneeth Muppidi, Manoj Sharma, Surya KNR,Prithu Banerjee, Brighty Dutta and Rupsa Gupta for ever-lasting memories andwould cherish them forever.xvI also gratefully acknowledge the funding provided by the Natural Sciencesand Engineering Research Council of Canada (NSERC) and Computing Hardwarefor Emerging Intelligent Sensory Applications (COHESA) that made my researchpossible.xviChapter 1IntroductionThe quest for developing intelligent systems began several decades ago and itsfoundation was laid by Alan Turing. He introduced the architecture of moderncomputers and designed a test to measure the intelligence of a machine which ispopularly known as Turing’s test [105]. He proposed an imitation game where aperson studies and analyzes the responses from two agents or humans and must beable to differentiate the computer from that of a human’s responses. If the personcould not come up with the right conclusions, it indicates an intelligent machine.Even though the current generation computers have huge memory capacity andpowerful processors, none of the current machines passes Turing’s test. One no-table event which challenges people to tackle this difficult problem is the LoebnerPrize competition. This competition awards an amount of $100,000 to a personwhose program can pass unrestricted Turing’s test [23]. Nevertheless, some peoplecriticize this competition as a publicity stunt [96]. Designing intelligent systemswhich can pass Turing’s test is still an open and tough problem yet to be solved bymankind.Intelligence demonstrated by machines is popularly called “Artificial Intelli-gence.” One of the popular branches of artificial intelligence includes machinelearning whose aim is to learn a target distribution without being explicitly pro-grammed. Machine learning is further categorized into supervised learning, unsu-pervised learning and reinforcement learning. Supervised learning is a techniquewhere we input a few samples of the input distribution and output the learned dis-1tribution. Ideally, this learned distribution should be close to the true input distri-bution which is also called as “generalization.” In unsupervised learning, the algo-rithm is fed with all the available inputs and we expect it to draw conclusions andkey observations about the properties of the data. Finally, reinforcement learningdoesn’t involve feeding of input data but the machine itself is placed in a special-ized environment. Depending on the actions taken by the machine or the agent, theenvironment provides appropriate rewards. For a given task, the agent will decidewhether the rewards generated by the environment are good or bad and learn theconsequences of its actions accordingly. The same procedure is repeated until theagent has learned to do a given task accurately.AI had a good start in the early 60’s and a lot of research has been put intothis field which led to some of the landmark creations like Ferranti Mark 1 [72] byUniversity of Manchester where Dr. Dietrich Prinz programmed it to play chess.Another notable achievement was made by IBM’s Deep Blue computer [12] whichdefeated Garry Kasparov in 1997. Despite this success of solving problems whichwere considered hard for humans, AI was notorious for performing poorly on taskswhich are perceived as an easy task by humans such as recognizing images andwords [29]. These were few reasons for the cause of AI winter which made severalresearchers abandon it for a while. The boom of AI (especially deep learningand machine learning) has begun in early 2010 and currently, the field has maturedenough to beat humans at the Go game [98] and various other video games [24, 79].Convolutional Neural NetworksDeep Learning belongs to the category of supervised learning which is specializedin finding patterns or properties from the images and languages. Much of it’s in-spiration stem from an urge to replicate the brain. In 1959, Hubel and Wiesel [45]studied the retinas of monkeys and concluded that each portion of the visual cortexis responsible for detecting light, colors, orientations and ocular dominance. Firstcomputer model to mimic the brain cells was introduced as the Perceptron [89].Kunihiko Fukushima proposed the neocognitron in 1980 [27] which is capableof visual pattern recognition and could be regarded as the predecessor of Convo-lutional Neural Networks (CNN) [31]. LeCun et al. [58, 60] published a ground-2breaking paper which introduces multi-layer artificial neural network called LeNet-5 which could classify handwritten digits with the help of the backpropagation al-gorithm [40]. Similarly, backpropagation was also used by Zhang et al. [119] todesign a model named SIANN which can recognize characters from an image irre-spective of shifting and distortion of the images. The main drawback of the abovemodels is the use of smaller datasets consisting of a few thousand images. Oneimpediment for using large datasets must be lack of training data and computingpower at that time, which limits its use for real-world applications.In 2012, Krizhevsky et al. [55] propose a classic Convolutional Neural Net-works (CNN) architecture and show significant improvements upon previous meth-ods on the image classification task. The overall architecture of their method, i.e.,AlexNet is similar to LeNet-5 but with a few enhancements and deeper structure.This work boosted the computer vision field and the task of beating humans in im-age recognition became popular. Notable works in this area include VGGNet [99],ZFNet [118], GoogleNet [103], Residual Networks (RESNET) [38], DenseNet [43],ResneXT [112] and Wide Residual Nets [117].GPUDeep Networks like RESNETs consists of several hundred to thousands of layerswhereas networks like DenseNets have 25.6 million parameters (20×more param-eters than RESNETs) which require huge memory and computational resources.Considering the growth of datasets size (MNIST [60] has 60000 training images tothat of Imagenet [20] having 14 million images), it would take several months totrain these networks on traditional CPUs. This due to the CPU’s inherent natureof sequential execution which makes it impractical for deep learning applications.Special accelerators like Graphics Processing Units (GPUs) would be ideal for suchcomputationally hungry applications.CPUs execute the program sequentially with a lot of emphasis on control logicand cache systems to speed up the execution. In addition to this, they have a lim-ited number of cores or processing units to compute which limits the parallelism.GPUs are complementary to CPUs in such a way that they have several processingunits, fast DRAM memories, and exploits thread level parallelism. Current genera-3tion GPUs like NVIDIA’s Volta has around 21 billion transistors, 5120 processingunits (NVIDIA calls them CUDA cores), can achieve peak single precision per-formance of 15 TFLOPS and uses very fast memory whose bandwidth is around900GB per sec [82]. The CPUs and GPUs work together very well by exploiting thecomputational power of GPUs and control logic of CPUs.Architecture of NVIDIA GPUGPU consists of several computing cores collectively known as streaming multi-processors (SM) and multiple such SMs form a basic building block. Each SMconsists of several floating point, integer and tensor cores which are responsiblefor performing the computations.The programming model of GPUs consists of a host code and device (GPU)code (in CUDA language [81]) which are separated by NVIDIA compiler. Theuser specifies the number of threads to be launched which are arranged as a gridconsisting of several blocks or cooperative thread arrays (CTA). Each CTA consistof threads which can be arranged in 3 dimensions. The basic building block ofthe programming model is called a warp which is a combination of 32 threads.For example, Volta’s SM is partitioned into 4 blocks, each with 16 FP32, 8 FP64,16 INT32, 2 mixed-precision tensor cores, a dedicated L0 instruction cache, 64KBregister file, a dispatch unit, and warp scheduler. Volta can support around 2048threads per SM and 1024 threads per block.When the device code is launched, multiple CTAs are allocated to each SM de-pending on availability and capacity of the GPU. Here, several warps in a CTA sharethe processing units and shared memory. As GPU has several SMs and has multiplethreads to execute, a single instruction in device code is executed by several threadsin parallel leading to instruction and thread level parallelism. This is also popularlyknown as Single Instruction Multiple Threads (SIMT) execution [68].It is to be noted that this dissertation only uses NVIDIA GPUs but all observa-tions and the results of this thesis can be applied to other GPUs as well.4Figure 1.1: Execution flow of all 32 threads of a warp. Here threadIdx.xdenotes a unique id associated with each thread in x-dimension.Control flow in GPUsWhen the device code or a kernel is launched, a single instruction is being executedby all the threads of a warp in parallel. However, if any one of the threads doesn’texecute an instruction, we encounter a problem called warp divergence [68]. Tounderstand it better, we need to look into how a GPU thread executes an instructionin detail. Figure 1.1 shows how GPU deals with branching instructions. For eachconditional instruction, the compiled code has a small flag namely predicate flag(p) which denotes whether that instruction has to be executed or not. From thefigure, we see that as first instruction is not a conditional statement, all 32 threadsof a warp execute it in parallel. All the threads compute the predicate value p inparallel and based on it’s p value, the thread will execute the appropriate instructionand skip the rest. In our example, all the threads whose unique id is less than 17 willtake if branch and all threads whose id is greater than 16 will take else branch. Aspartial number of threads of a warp take if branch and rest take else branch, GPUserializes the execution of if-else branch leading to a slow down. This is calledwarp divergence. In order to achieve the highest performance, it is in the user’sinterest to avoid warp divergence.5MemoizationIf a big problem can be divided into several sub-problems and if each of such sub-problem is repeated several times, instead of computing them again, we can employa table which stores the output of that sub-problem (caching) and read it’s outputwhen the same sub-problem is encountered later. This is called memoization and itcan decrease execution time by caching the output of redundant tasks. One classicexample would be calculating the factorial of a number. For a given number n, itsfactorial is calculated as n!= n∗(n−1)∗(n−2)...2∗1. We can see that ∀ n>1, onewould require calculating the factorial of 1. In the same way, ∀ n>2, one wouldneed the factorial value of 2 and 1 to calculate its factorial. In general, ∀ n>k, weneed all the factorials from 1 to k. Computing these values for all the numberswould slow down the execution. To overcome this problem, memoization employsa table which stores the factorial values of numbers from 1 to k. If we need tocalculate factorial of any number, instead of computing everything from scratch,we retrieve the value of factorials of smaller numbers from the table and continuecalculating output for the rest of the numbers. By skipping the computations in thisway, we can speed up the execution.Outline of the thesisWith recent advances in accuracy, deep neural networks (DNNs) have found prac-tical use in a growing list of applications including self-driving cars, speech recog-nition, text translation, document analysis, and medical imaging. An impedimentto discovering better networks is the large amount of computing resources requiredto train them. To address this concern, several approaches have been introduced:Krizhevsky et al. [55] demonstrated that graphics processing units (GPUs) couldbe used to improve performance versus CPUs. This improvement can range froma speedup of 4× to 50× [26, 63, 78, 101]. Responding to growing demand forGPUs for training machine learning algorithms, NVIDIA recently introduced spe-cialized hardware units called tensor cores along with improved DRAM bandwidthvia High Bandwidth Memory (HBM) and higher bandwidth to remote GPUs viaNVLINK [3].In this dissertation, we explore the potential of employing memoization [76]6to reduce the training time of DNNs. As originally conceived in the Section 1.3,memoization is a software technique in which, if the inputs of a function do notchange, the evaluation of that function can be short-circuited by employing tablelookup. Thus, we need a metric to quantify the repetition of sub-problems (neu-ron outputs in CNN case) such that we can employ memoization table to store theoutput of neurons. Lag-1 autocorrelation is one such metric to measure the repeti-tiveness of neuron outputs during training. We find there are three key challengesto applying memoization in the context of training DNNs. The first is the degreeto which such repetition of function inputs occurs during training of neural net-works. The second is the potentially large storage overheads. The third challengeis exploiting the potential for memoization on high-performance hardware.With respect to the first challenge, we find that the value of a weight parameterw at the end of epoch t, w(t), tends to be similar to the value of the same parameterat the end of the prior epoch, w(t− 1). This correlation results from the gradualmanner in which stochastic gradient decent changes parameters. When presentedwith the same network input, the small changes in parameters tend to result inneuron activations that are correlated. If the neuron outputs are then passed througha nonlinear unit which has slowly changing outputs over some input domain, thenthe output of that neuron will have greatly diminished sensitivity to changes ininput parameters when the input to the nonlinear unit is in that domain. Figure 1.2illustrates these points using the first layer of ResNet-18 [38]. This figure plotsthe lag-1 autocorrelation of the weights, neuron dot product and post non-linearityactivation output using the CIFAR-10 dataset [54]. For a discrete time series x(k)with mean µx, we compute the lag-1 autocorrelation using∑k(x(k)−µx)(x(k−1)−µx)∑k(x(k)−µx)2(1.1)We consider the value of an individual weight or neuron over all epochs andaverage their lag-1 autocorrelations over all weights or neurons in a given layer.The figure shows that at any given epoch, 99% of the weights have lag-1 auto-correlation exceeding 0.5. Similarly, on any given epoch, 61% of the neuron dotproducts and 45% of the neuron output activations after the ReLU non-linearityhave lag-1 autocorrelation exceeding 0.5. Figure 1.3 illustrates that across several7Figure 1.2: Comparison of auto-correlation of weights, dot-product and neu-ron activation outputs at time t vs. t−1 for first conv layer in ResNet-18.Figure 1.3: Layer-wise analysis of percentage of neurons with different au-tocorrelation values for the same randomly chosen image. Here LeNetand Alex are trained for 8 and the rest for 120 epochs. Low learning rateα refers to decrease from 0.01 to 0.001. Here Alex is Caffe’s [51] ci-far10 quick network and ck represents kth convolution layer (c). LeNetuses MNIST and rest of them used CIFAR-10 dataset.8convolution layers and workloads, on average of 62% of neurons have a correlationof at least 0.5 from one epoch to the next. This figure also illustrates that acrossseveral convolution layers, the degree of correlation is impacted by learning rate(low-α means learning rate is reduced by a factor of 10), increased by using batchnormalization [47] (AlexBN) or layer response normalization [55] (AlexLRN) ver-sus the baseline implementation of Alexnet in Caffe (Alex) which does not employnormalization techniques. We believe the correlation is low for LeNet [61] becauseit converges very quickly. We explore the impact of approximating the outputs ofneurons predicated to have less sensitivity to small changes in weights by a mem-oized copy of the neuron output recorded during a recent epoch. We call suchneurons as (STA)unch (N)eurons which are loyal and won’t change their behaviorwhen fed with the same input the next epoch.The second challenge, storage overheads, can be mitigated by storing only ap-proximate correlation information. For example, for the ReLU activation function,if the input is negative and changing slowly across epochs, then the output will bezero. Storing this information requires only one bit per neuron per input image. Apotential extension of our approach would also reduce the overhead of tracking bysubdividing the dataset into large groups of mini-batches and applying memoiza-tion only within a group. We call this group of mini-batches a mega-batch and theapproach to using mega-batches is discussed in Section 5.4.The third challenge, hardware overheads, results from the way CNN frame-works make use of computation resources on GPUs. On modern GPUs, individ-ual scalar threads execute in lock-step on the hardware using a technique knownas single-instruction, multiple thread (SIMT) execution [68]. Libraries such ascuBLAS [1], cuDNN [15] and CUTLASS [2] obtain high throughput on SIMThardware by computing the result of an individual neuron using individual threads.Hence, if we attempt to skip computation of any given individual neuron usingmemoization, there is no guarantee performance will improve. To address this, weexplore the potential of evaluating a neuron output by spreading it’s computationover all threads of a warp.9ContributionsThe contributions of this dissertation are:• It shows that by averaging across several networks, 62% of neurons have lag-1 autocorrelation above 0.5 during training for a randomly selected image.• It introduces a new technique, called STAN, which skips the computation ofa neuron based on that neuron’s prior behavior in previous training epochs.By doing so, STAN achieves a convolution kernel speedup of 1.3-4× overour baseline GPU implementation at the expense of 2.7-7% loss in accuracy.• It shows that STAN combined with meProp [102] achieve an overall trainingspeedup of 2.2×.• Finally, averaging across several networks, it shows that STAN acceleratesHyperband’s wall-clock search time by 13% and when combined with me-Prop, an acceleration of 43% is achieved with ±0.2 change in the validationloss.OrganizationThe rest of this dissertation is organized as follows:• Chapter 2 details the internal workings of a Convolutional Neural Networkand it’s basic building blocks. It then provides a detailed background on howCNNs utilize the power of GPUs to perform their computations.• Chapter 3 discusses the motivation behind the thesis statement.• Chapter 4 discusses related work.• Chapter 5 introduces the required terminology and approximate memoiza-tion techniques.• Chapter 6 evaluates STAN on CPUs and GPUs and reports speedup values.• Chapter 7 discusses about two applications of our work.10• Finally, Chapter 8 discusses directions for future work and concludes thisdissertation.11Chapter 2BackgroundConvolutional Neural Network (CNN) is a supervised learning model which is pop-ularly used for image classification and object detection. It can be seen as a combi-nation of several layers stacked together with few enhancements such as non-linearactivation functions, pooling, and normalization techniques. The general architec-ture of CNN is shown in Figure 2.1. From the figure, we can see that a lot oftransformations are being done across several layers which will help us extractuseful information from the inputs. We shall discuss CNNs and their basic buildingblocks in detail in the below sections.Figure 2.1: Structure of a typical Convolutional Neural Net. Figure is drawnusing NN-SVG tool [64]12Input or Data layerThis is the first layer of a CNN where inputs (can be images, text etc.) are fed.Current datasets include millions of images and can be a time-consuming task ifthey are fed sequentially. To overcome it, a group of images (known as mini-batch)is fed to the network at once. Accelerators such as GPUs can perform severalcomputations in parallel which makes the concept of mini-batches attractive andfeasible.It is often desirable to have as many inputs as possible which make CNN robustto the noise in the dataset. In other words, having more data improves the learn-ing of the network and prevents over-fitting. Popular technique to obtain moredata is data transformation techniques. Some of these include image translationsand horizontal reflection which were found to be helpful in preventing overfitting[55]. Other data augmentation techniques include sampling, mirroring [13], rotat-ing [111], shifting [93] and various other photometric transformations [22, 99].Convolution layerThe fundamental block of a neural net includes a neuron whose inspiration stemsfrom the neurons of a human brain. A neuron output involves few multiplicationsand an accumulation. To be precise, the convolution layer uses a filter which is aset of values arranged in 2-D or 3-D space. Then a patch of input image whosedimensions are similar to that of the filter is taken and an element-wise dot prod-uct operation is performed on filter values (called weights) and the input. Later,an accumulation operation is performed over all such dot products resulting in ascalar value which is the output of convolution operation of a single neuron. Thisoperation is best illustrated in Figure 2.2. This process is repeated several timesuntil we perform the convolution operation on all pixels of the image leading to anoutput tensor. It consists of several neurons each of which is specialized in learningsome representation of the input (called as features).It is to be noted that the same filter is being used for all the input patches andthe distance with which the filter moves from one patch to the next is called stride.In order to preserve the information at the corners of an input image, padding isperformed. Padding is a technique where typically a few zeros are being appended13Figure 2.2: Convolution operation of an input and a weight tensor resultingin an output tensor. Here convolution operation is performed only fora patch of an image whose tensor dimensions are same that of weightdimensions. Figure is drawn using NN-SVG tool [64].across all dimensions of the input tensor. By repeating the same convolution oper-ation across several layers, a neuron can learn complex features. Zeiler et al. [118]have observed that neurons in earlier (or shallow) layers tend to learn simple fea-tures like corners and other edge/color conjunctions and as we go deeper into thelayers, neurons capture complex features like a cat or a dog.Activation functionActivation functions introduce non-linearity in the network which aids the neuronsto learn complex features. While there are many activation functions, the mostpopular one is Rectified Linear Units (RELU). Mathematically, output of RELU canbe defined as max(0,x) where x is input to the RELU. Generally, we called neuronas turned on if the output of RELU is greater than zero and the neuron is turned off ifthe output of RELU is zero. Krizhevsky et al. [55] have observed that deep convolu-tional neural networks with RELUs train several times faster than their equivalentswith tanh units. In addition to it, He et al. [37] propose a generalized version of14the traditional rectified unit which they call as Parametric Rectified Linear Unit(PReLU). They show that it improves model fitting with nearly zero extra compu-tational cost, minimal risk of overfitting and achieved better accuracies. However,this dissertation only uses RELU in all of the experiments.PoolingAfter the input is passed through the convolution layer and an activation function,it is fed to the pooling layers to reduce the number of features. Few typical poolingoperations include average and max pooling. As image dimensions are usuallylarge, we end up with many features which may result in problems like over-fitting.Pooling layer generally takes a patch of features and perform operations such asaverage (Average pooling) or obtain only the maximum value (Max-pooling) ina given input patch. The intuition here is to reduce the number of features ordimensions while retaining all the important features required for learning. Despiteits advantages, recent research has shown that the pooling layers can be replacedby a convolution layer with increased stride without loss of accuracy [100].Fully connected and loss layersConvolution layers use the same filter values to sweep across all input patches.This is called parameter sharing and is very useful to reduce the number of param-eters (weights) in the network. However, fully-connected layers consist of severalneurons where a single neuron is connected to all next layer neurons each with aunique weight. In other words, their structure is similar to a complete bipartitegraph with no parameter sharing. After several convolution and pooling layers,we feed the output to fully-connected layers whose aim is to perform high-levelreasoning [118]. In other words, it is useful to generalize all its input values andcreate some generalized formulae internally (fitting the input data points) which isuseful in prediction.Finally, the last layer of CNN is an output layer which typically uses softmaxoperation for multi-class classification. Softmax calculates the probability withwhich an input belongs to a certain class. For example, consider a K-dimensionalvector z and we apply a softmax operation to produce a new K-dimensional vector15σ(z) of values ranging from 0 to 1. The softmax operation is given byσ(z j) =ez jK∑k=1ezk;∀ j ∈ [1,K]. (2.1)We can take the logarithm of Equation 2.1 to obtain the cross-entropy loss. Ina typical CNN, we try to minimize the loss function value by adjusting the param-eter (weight) values using back-propagation [90–92]. Other popular loss functionsinclude negative log-likelihood and mean squared loss error.EnhancementsIn this section, we discuss a few important techniques which are widely used inCNNs in general and in our thesis. Two of them include Batch Normalization [47]and Local Response Normalization [55].Local Response NormalizationThe neurons in the brain exhibit a property known as lateral inhibition. This isa phenomenon where an excited neuron reduces the activity of its neighboringneurons. This creates a contrast in that particular area and improves the percep-tion. Motivated by above observations, Jarrett et al. [50] introduced local contrastnormalization which normalizes feature values by subtracting with the mean anddividing by the variance of its neighboring features. They argue that by doingsuch operation at a neuron level, a sort of local competition is enforced amongneighboring neurons and thus helps in achieving better accuracies. Later, Alexnetintroduced local response normalization (LRN) and surpassed the state-of-the-artmodel accuracies of that time. The only way local response normalization differfrom local contrast normalization is that the mean subtraction operation is omitted.Mathematically speaking, for a total of N kernels in a layer, LRN is calculated aslrnix,y =aix,y(c1+αmin(N−1,i+n/2)∑k=max(0,i−n/2)(akx,y)2)β(2.2)16ReLU activation functiona(i, x, y)Convolution with filterLocal response Normalization lrn(i, x, y)kernels/filtersNext layer Feature MapsInput Feature MapFigure 2.3: Overview of Local Response Normalization.Here, aix,y denotes the value of a feature map i at coordinates (x,y), lrnix,y de-notes the lrn output value of a feature map i at coordinates (x,y), c1,α,β representsthe constants determined using validation set and n represents number of adjacentkernels over which the lrn operation is done. Pictorial representation of LRN oper-ation can be found in Figure 2.3.Other similar works are as follows: inspired by the properties of simple cellsof the first primate cortical visual processing stage, Pinto et al. [86] observed thatby normalizing the values locally and using Gabor filters, the model outperformedthe contemporary state-of-the-art models. Lyu et al. [71] introduced Divisive Nor-malization which makes the learning robust to noise. This was inspired by thenon-linearities in the behavior of cortical neurons [28, 41].17BNBNBNmini-batch of imagesConvolution with filterkernel/filterReLU activation functionBatch NormalizationFigure 2.4: Overview of Batch Normalization.Batch NormalizationIoffe et al. [47] argue that during training, the learning distribution of neuronschanges continuously. If these outputs are fed as inputs to the next layer, nextlayer neurons have to accommodate those changes in the previous layer distribu-tion. This hampers the learning progress and is one of the cause for difficulties intraining deep neural networks. The authors called this as internal covariate shiftand propose batch normalization to address it. Figure 2.4 depicts the flow of batchnormalization. The technique involves computing the mean and variance of anactivation across all the images in a mini-batch and then normalized as follows:µmb =mb∑i=1ximb(2.3)18σ2mb =mb∑i=1(xi−µmb)2mb(2.4)BN(xi) = γ ∗ xi−µmb√σ2mb+ ε+β (2.5)Here, for a mini-batch size mb and given activation value xi, ∀ i ∈ [1,mb], µmbdenotes the mean across the mini-batch, ε is a very small constant, σ2mb denotes thevariance and BN(xi) is output of batch normalization. γ,β are learnable parame-ters. By applying batch normalization, the authors achieved 4.9% top-5 validationaccuracy and surpassed the humans.CNNs on GPUCNNs are such applications which require billions of multiply and accumulate op-erations [14, 34]. When executed on CPUs, it takes several days to years to trainthese networks as the CPU have limited parallelism. One can observe that opera-tions like convolution involve several dot products all of which are independent ofeach other and can be executed in parallel. As GPUs are better suited for perform-ing several computations in parallel, we can compute these dot products on GPUs.libraries.As GPUs can execute several computations in parallel, we generally feed mul-tiple images at once (named as mini-batch) instead of feeding a single image at atime. This is because all the images in a mini-batch are independent of each otherand don’t depend on other image’s results. Hence, for a set of images whose mini-batch size is N, H as the height of an image, W as the width of an image and C asthe number of channels, 2 most common data layouts used to store a mini-batchare NCHW and NHWC. Assuming a row-major order for storing values in thememory, NCHW represents a matrix whose dimensions are N×CHW where eachrow consists of all the pixel values of an image across all the channels (CHW).Generally, NCHW format is efficient on GPUs because a single row of an inputmatrix consists of all the pixels of a single image which are in contiguous memorylocations and will aid in coalesced memory access.From the Figure 2.5, we see that an element-wise dot product is being per-19Figure 2.5: Mapping the dimensions of input’s data layout format to convo-lution process. Here, a patch of input image whose size is H f ×Wf istaken and a dot product and accumulation operation is performed to ob-tain the output of a single neuron. The figure is drawn using NN-SVGtool [64].formed between filter and input values and the same operation is repeated acrossall the channels of an image and across all the images in a mini-batch in parallel.As computations across images in a mini-batch are independent, we can take theinput tensor of format NCHW and multiply values in a row to that of a column ofa filter matrix stored in CHW×N format. In this way, we multiply all pixel valuesof an image with all the pixel values of the corresponding filter to give a scalarneuron output value. Thus, we can convert the convolution operation into matrixmultiplication operation and execute them efficiently on GPUs.20Chapter 3MotivationModern CNNs have thousands of layers at the expense of huge training time. As thenumber of layers increase, the number of parameters and computations increases.This makes deeper nets rely on GPUs and other accelerators. One extreme examplewould be the work by Shazeer et al. [95] where they have designed a sparsely-gated mixture-of-experts model which has up to 137 billion parameters and tookthem several weeks to train the network on a cluster of GPUs. Other big scale ex-amples include AlphaGo [98] which was trained for several weeks with 50 GPUs tobeat a top-class professional Go player. In addition to it, as we update our trainingset continuously, we need to train our network with these new inputs periodicallywhich is very time-consuming. Hence, accelerating training process is very impor-tant and is the main focus of this dissertation.Mank et al. [73] have studied the chronic recordings of a single-cell activitywhich is crucial for a full understanding of how individual neurons change theirfunctional properties during development and learning or as a consequence of adisease process. To achieve this feat, they have extracted calcium binding proteintroponin C (TnC) from a mice’s skeletal and cardiac muscle, performed mutage-nesis of selected amino acids residues and finally developed a new synthetic dyecalled TN-XXL. This dye has been injected into mice to study the behavior of ori-entation selective neurons and sensory-evoked activity from the same individualneurons in the visual cortex during consecutive imaging sessions over days andweeks. The authors observe that repeated imaging with a particular orientation21did not cause any change in neuron behavior over the course of several days andweeks and have concluded that stimulus selectivity remained highly stable overtime. Similar behavior has been seen in other parts of brain like motor cortex[46, 74, 85], hippocampal region [104] and sensory cortex [17].This biological observation and huge training time overheads have motivated usto pursue in the direction of accelerating training. From our experiments, we haveobserved that some neurons exhibit similar behavior when fed with the same inputduring training. It led us to exploit this property by employing the memoizationtechnique and skip computations. We achieved a substantial convolution speedupduring the CNN training by employing memoization.22Chapter 4Related WorkPruning the network has become one of the most efficient mechanisms to reducethe parameters and compress the network with minimal loss in accuracy [32, 35].Some of them include pruning individual weights [36, 59], group-wise pruning[115], pruning channels [39, 70], feature map, kernel and intra-kernel pruning [7].Using pre-trained models, pruning was done by training a reinforcement learningagent to evaluate the importance of a channel [67] or by removing neurons with lowscores [116]. The above approaches either focus on reducing parameters, transforma trained model to a compressed model, or need retraining for additional epochsafter pruning to maintain accuracy.Others have introduced penalties like group-lasso regularization [107], group-sparsity regularization [57] and penalty on scaling factors [44] to achieve speedup.Accelerating inference for arbitrary sparsity patterns was done in [83] whereasauthors in [42] observed that some neurons are always inactive and pruned them.Our work differs in the following way: first, our focus is on accelerating thetraining process . Second, as the neurons learn continuously during training, wedo not suppress but retain their original connections. Third, we show that STAN isindependent of sparsity patterns, can deal with dense and sparse matrices and alsoorthogonal to sparsity-inducing techniques.Another interesting direction to achieve speedup include quantization wherea reduced precision is used instead of full precision for the network parameters.Notable works in this area include the WAGE technique [110] which uses low23bit-width integers for weights, activations, gradients, and errors. Several machinelearning frameworks have also exploited half-precision computations to achievespeedups. This has been taken to the extreme by using only one bit or binaryweights and activations and have achieved substantial speedups [18, 19]. Binary-Weight-Networks have used binary values for weights whereas XNOR-Networkshave used binary values for both weights and inputs [88]. By doing so, the authorshave claimed a speedup of 58× for convolution operations and 32× memory sav-ings with negligible accuracy loss. Similar works include TTQ [120] and TWN[65] which uses low precision for weights to compress the network.SnaPEA [113] achieves speedup on pre-trained models by reordering all of theneuron’s positive and negative weights. They first compute dot products only forpositive weights and then start computing negative weights only if the resultantaccumulation is positive. At any given time, if the result turns out to be negative,they skip computation for rest of the negative weights and thus achieve consider-able speedup on their custom designed accelerator. Similar work include Cnvlutin[4], where they identify zero valued inputs and skip their computations which re-sulted in a speedup of 1.55× without any loss of accuracy.Recent works have employed dynamic routing techniques where the decisionis taken at the end of each layer to route the computation to the next layers [75].FreezeOut [11] has achieved speedups by introducing a user-defined timer for eachlayer in the network. The learning rate decreases based on user specified sched-ule and once the timer expires, that particular layer is frozen and is used only forinference. Zoneout [56] also implements memoization technique in RNNs whereinstead of turning off some activations, they randomly replace some activationswith activations from previous epochs. This technique acts as a regularizer forRNNs and has shown improvements in model performance. Shomron et al. [97]observed that in pre-trained networks, neighboring neurons tend to have similarvalues (which they call as spatial correlation) and employ value prediction [69]technique to skip the computations.24Chapter 5Approximate MemoizationThis section quantifies the degree to which past correlation is associated with smallchanges in neuron outputs when a network is presented with the same input. Itthen provides an outline of STAN, our algorithm for exploiting this correlation viamemoization.Neuron Activation CorrelationIn this section, we define “activation” to mean the sum of products over the neu-ron’s weights and the post-nonlinearity activation feeding into these weights.The questions we seek to answer are (a) whether past behavior can be used topredict future behavior for neuron activations during the training process and (b)to what extent neurons have sufficient correlation such that prediction may yieldperformance benefits.To help provide insight into the first question, Figure 5.1 is a scatter plot show-ing the change in activation between the current epoch and the prior epoch whenthe network is presented with the same training input (vertical axis) versus the cor-relation in that neuron’s activation up to and including the prior epoch but not thecurrent epoch (horizontal axis). The data are collected for a single neuron in thesecond layer of Alexnet while trained using layer response normalization for 40epochs using the CIFAR-10 data set. The top graph is for a network trained with amomentum of 0.9 while the bottom graph is for a network trained with a momen-25Figure 5.1: Activation output difference on successive epochs versus prioractivation correlation for a single neuron-training sample pair. Here,neuron belong to AlexLRN’s convolution layer 2 which is trained for40 epochs using the CIFAR-10 dataset.tum of 0. The graph shows that when a neuron’s activation is highly correlated,the activation output is indeed closer to the immediately prior value, as expected.For the data plotted in these figures, the median difference in the activation valuesin successive epochs is 1.53 with high momentum and 0.48 with low momentum.For the high momentum case, the average correlation is 73% when the activationdifference is less than the median and 37% when the activation difference is morethan the median. Similarly, for the low momentum case, the average correlationis 81% when the activation difference is less than the median and 51% when theactivation difference is greater than the median. Thus, higher correlation up to thecurrent epoch is associated with a relatively smaller activation difference in the26next epoch.To address the second question, Figure 5.2 plots the lag-1 autocorrelation val-ues for successive activations for 100 neurons from the second layer of Alexnettrained using momentum set to 0.9 and Figure 5.3 presents the same results as ahistogram over correlation values. We find 41% of neuron activations have a cor-relation above 0.7 and 73% have a correlation above 0.5. From the perspective ofoptimizing training time, we see that a non trivial fraction, 41%, of neurons haveactivations with lag-1 correlation above 0.7. Computing a neuron’s activation re-quires a dot product between the weights and the activation outputs from the priorlayer. When combined with the observation above that the average correlation ishigher when the activation difference in the next epoch is small (below the me-dian), one can observe there is potential to improve performance if we can use theold values of the activations instead of computing them.Much research has found that DNN classifications for previously trained net-works can tolerate significant reductions in precision and recent work has exploredreduced precision during training [33, 77]. Motivated by this and the observationsabove, we explore the potential impact of using memoized copies of activationoutputs during training to help avoid the expensive dot product recomputation ofneuron outputs. We propose to do this when their correlation so far suggests thenext output for a given training input will be similar to their current output for thatsame training input.STANIn this section, we describe STAN, our proposed approach to exploiting the rep-etition that occurs in computation between epochs during training of DNNs. Webegin by outlining the basic approach employed by STAN, and then elaborate uponour refinements. So far we have implemented our memoization approach only forthe forward pass, but there may be potential to apply it elsewhere.To exploit the potential for computation savings noted above, for a given train-ing input, STAN records the activation output of all neurons. To determine whetherto use the recorded values, STAN must compute or estimate via some proxy, thecorrelation of past activations.27Figure 5.2: Lag-1 autocorrelation versus epoch for 100 neurons in Layer 2 ofAlexnet.[0.00, 0.05](0.05, 0.09](0.09, 0.14](0.14, 0.18](0.18, 0.23](0.23, 0.28](0.28, 0.32](0.32, 0.37](0.37, 0.41](0.41, 0.46](0.46, 0.51](0.51, 0.55](0.55, 0.60](0.60, 0.64](0.64, 0.69](0.69, 0.74](0.74, 0.78](0.78, 0.83](0.83, 0.87](0.87, 0.92]Lag-1 Activation Correlation Between EpochsFrequencyFigure 5.3: Histogram showing frequency with which neuron activationsachieve a given correlation for 100 neurons in Layer 2 of Alexnet duringtraining.28Algorithm 1 provides a simplified skeleton for the various versions of STAN,called STAN-corr, STAN-last and STAN-lite, that we elaborate upon below. Notshown in this figure is the use of mega batches to reduce memory size, which wedescribe in Section 5.4.During the first pt epochs, STAN trains the network without using any mem-oized values. Thereafter, every Z epochs, we repeat the following process. Onthe first C out of Z epochs, we compute the activation as normal on Line 8 and,for STAN-corr, update our correlation estimate on Line 10. On the last of theseC epochs, we test whether memoization should be applied during the next Z−Cepochs (Lines 13 to 25). For each neuron i in layer l and for image k, these lines seteil(k) equal to θ when it is determined that memoization should not be applied, andset eil(k) to the saved activation value when we determine memoization would bebeneficial–i.e., the neuron is classified as “staunch”. For the next Z−C epochs onLines 27 to 31, we use the memoized value if the neuron was classified as staunch.Here all eil(k) belong to an entry matrix E which serves as a memoization look-uptable.In theory, the speedup achieved while training a network for N epochs withSTAN is given byN(t f + tb)N(t f + tb)− N−ptZ+1 (Zti− te)(5.1)where execution time per epoch for forward pass is t f , backward pass is tb, calcu-lation of E is te and ti is the time saved by STAN. Here speedup is guaranteed onlyif Z ∗ ti > te. Proof can be found in appendix Section A.1.In the following, we elaborate the details of 3 variants of STAN namely STAN-corr, STAN-last, and STAN-lite.STAN-corrSTAN-corr determines whether to skip neuron dot product calculations by comput-ing the lag-1 correlation of the activations for a given neuron and input combinationacross successive epochs. Computing this correlation by applying Equation 1.1 di-rectly would require storing all activations for all epochs. The storage requirementsgrow linearly with the number of epochs which is not feasible for large networks29Algorithm 1 STAN Pseudo-code1: Input: Training set of T images, max epochs, technique used S, threshold (θ ),skipping interval (Z), correlation measurement interval (C), pre-stan interval(pt), L is number of layers and m is number of neurons in layer.2: Initialize epoch=0, u=0, entry matrix E to 0.3: for epoch≤ max epochs do4: while ∀ l ∈[1,L], i ∈[1,m], k ∈[1,T ] do5: if epoch < pt then6: train the CNN normally without STAN.7: else if pt +u∗Z ≤ epoch < pt +u∗Z+C then8: yil(k) = Wil ·X il (k)9: if S = STAN-corr then10: update ACil(k) using Equations 5.2, 5.3 and 5.411: end if12: if epoch = pt +u∗Z+C−1 then13: switch (S)14: case STAN-corr:15: if ACil(k)< θ then16: eil(k) = θ17: else18: eil(k) = yil19: end if20: case STAN-last: eil(k) = min(yil(k), θ )21: case STAN-lite:22: if yil(k)< θ then eil(k) = 0 else eil(k) = 123: end switch24: u=u+125: end if26: else27: if eil(k) == θ then28: yil(k) = Wil ·X il (k)29: else30: yil(k) = eil(k)31: end if32: end if33: Back-propagation: Update all the weights.34: end while35: epoch = epoch+136: end for30combined with large training data sets.We compute the lag-1 autocorrelation in Equation 1.1 by using the mean andvariance of activations across epochs as shown in Equation 5.2:ACn =An−1+A1+A2+A3Sn−1+ n−1n (xn−µn−1)2(5.2)Note that to simplify the discussion, we omit indices for neuron and training in-put. In Equation 5.2, ACn denotes the autocorrelation after n epochs, An denotesnumerator of Equation 1.1, A1 = (n− 2)( xn−µn−1n )2, A2 = (xn−1− µn) ∗ (xn− µn),A3 = xn−µn−1n (xn−1 + x1− 2 ∗ µn−1), µn denotes mean of the first n activations, Sndenotes sum of squared deviations from the mean (Equation 5.4). The detailedderivation of ACn starting from Equation 1.1 is shown in appendix Section A.2.We compute µn and Sn using an online approach proposed by Welford [106]given by:µn =n−1n∗µn−1+ xnn (5.3)Sn =n∑i=1(xi−µn)2 = Sn−1+ n−1n (xn−µn−1)2 (5.4)Instead of storing n past activations for each combination of neuron and input,the above online approach requires storing only two values: µn and Sn for eachcombination of neuron and input. We have found that the precision used to storeµn and Sn can be reduced to reduce the storage overheads due to these components.Specifically, we found little change in the loss when using 16-bit floating-point forstoring µn and Sn which is discussed in Section 6.3.Focusing learning to reduce storageWhile the use of online computation reduces storage overheads, for large datasetsthe storage overheads are still considerable. We next consider how to reduce theseoverheads by adding additional structure to the way training examples are pre-sented to the network. The intuition behind this approach follows from the wayhumans learn in schools, which at least traditionally have been organized around31classes focused on particular subjects. With a given class, a student focuses onlearning a given subject well and for the time they are present in that class, theytend to ignore their other classes. This type of learning is similar to Curriculumlearning [8].We adopt a somewhat similar strategy by dividing up our training set of B mini-batches of examples into M groups each containing ζ = B/M mini-batches. Eachof these groups is called a mega-batch. We apply STAN (Algorithm 1) to the exam-ples in a mega-batch for a small number of “mini-epochs” instead of max epochs.For simplicity, we set the number of mini-epochs to Z. After a mini-epoch, wemove on to the next mega-batch. Between mega-batches, we do not save activa-tions, µ or S values. Once we have iterated through all mega-batches, we startagain at the first mega-batch. A single pass through all the training data usingthis approach uses each training example as many times as Z epochs of traditionaltraining. To account for this, in our experimental evaluation, we hold the total num-ber of times an input is presented during training constant when comparing acrossmethods. The training procedure using mega-batch is best illustrated in figure 5.4.From Figure 5.4, one might argue that the assumption of feeding all uniqueinput images in same order is against stochastic SGD. While it is true, we observedthat all the frameworks which we have worked on (Caffe and Pytorch [84]) im-plement without-replacement sampling techniques instead of true stochastic SGDfor performance and implementation issues [9, 10]. In addition to it, without-replacement technique performs better than other sampling techniques [94]. Thisguarantees that an image will be repeated across epochs and not within an epoch.However if the training process is to be truly randomized and apply STAN, wewould need an identifier to differentiate an image among several images belongingto a single label, which will further complicate and increase memory requirements.The use of mega-batches reduces the overheads of storage by a factor of Mat the possible expense of a small loss in training accuracy (discussed in detail inSection 6.3).32Figure 5.4: Here, let us assume a dataset has 6 mini-batches (from A to F)in total, Mega batch size (ζ ) = 2, skipping interval (Z) = 2. We seethat, instead of allocating memory for all the images in dataset, STANallocates memory only for a mega-batch and thereby saving memory.STAN-lite and STAN-lastThe ReLU operation commonly used in today’s neural networks outputs a zero ifthe input is negative. As neurons tend to have a high correlation in their activationthroughout training, we propose to try to avoid computing the correlation altogetherby simply checking if the activation is a negative value far enough away from zerothat in the next epoch it is unlikely to become positive. We call this approachSTAN-lite as it only requires saving a single bit per neuron and training example:whether to output a zero for the ReLU or recompute the dot product and performthe ReLU operation.We found that to avoid a significant degradation in both training and validation33error, it was necessary to set a large magnitude negative threshold. This, in turn,reduced the fraction of neurons which were classified as “staunch” and reducedthe potential training time benefits. To overcome this, we also explored savingand using the last activation value provided it was less than a positive threshold.This may seem counterintuitive: reusing a positive activation value without ex-plicitly considering its prior lag-1 autocorrelation may predict wildly wrong nextactivation. However, we find in practice this approach works reasonably well at theexpense of having to save past activation values between epochs.In summary, the memory requirements for STAN-last and STAN-lite are givenby sizeo f (datatype)∗Nc ∗Ti for STAN-last case.Nc ∗Ti for STAN-lite case. (5.5)Here Nc is the total number of neurons across all convolution layers, or the totalsize of entry matrix E of the CNN and Ti is total number of images in the dataset.STAN-lite requires less additional memory than STAN-last because the former usesa bitmap whereas the latter typically use float32 as the data type. Table 5.1 com-pares STAN-last, STAN-lite and STAN-corr and we observe that STAN-lite re-quires 100× less memory than STAN-corr (without mega-batches) at the expenseof accuracy.GPU ImplementationIn Algorithm 1, if eil 6= θ , we skip k dot products and an accumulation for the neu-ron nil . Existing GPU libraries for matrix multiplication, such as Cublas1.1 [87],allocate a single thread to compute the output of at most 4 neurons at a time. Inorder to skip the computation of a single thread, we need 4 consecutive zeros in theoutput matrix, which our experiments have found to be unlikely. Additionally, forbranching instructions, all threads of a warp must execute the same control flow.If some threads of a warp follow an “if” branch and the rest execute the corre-sponding “else” branch, current GPU implementations will serialize the two pathsby first executing one then the other. This is called warp-divergence [68] and itcan introduce significant overhead. We wrote our own CUDA kernel to implementSTAN. This kernel avoids warp-divergence when skipping dot product calculations34Table 5.1: Effect on validation error of AlexLRN using CIFAR-10 for 10epochs. Here, Z = 1 and pt = 0 All the memory values reported are cal-culated using pmap for CPUs and nvidia-smi command for GPU DRAMusage.VALIDATION LOSS MEMORY USEDBASELINE 0.89 347MBθ0.9 0.5 0.1NAIVE 0.912 0.913 0.912 THRASHING(>80GB)STAN-CORR 0.9 0.899 0.908 60 GBθ0.5 10STAN-LAST 1.5 1.33 10GBSTAN-LITE 1.4 1.8 620 MBby spreading the computation of a single dot-product over the threads of a singlewarp. Hence, when STAN chooses to skip the computation of a dot product fora single neuron, the entire warp exits the kernel. GPUs are designed in such away that the scheduler issues instructions to the pipeline at warp-level and whena warp is prevented from performing some computation, the scheduler automat-ically schedules another warp. Further optimization potential exists beyond whatwe have implemented to date: warps are grouped into thread blocks which are allo-cated resources together. Typically eight or more thread blocks can share the samehardware pipeline. We leave to future work exploring the advantages of having awarp select a computation for another neuron when skipping its currently assignedneuron.35Chapter 6ResultsExperimental setupAll the experiments were performed using CUDA 8.0, NVIDIA GeForce GTX1080 Ti GPU, Intel(R) Core(TM) i7-4771 3.50GHz CPU, Pytorch [84] for MLPand rest in Caffe. All the networks use L2 regularization and stochastic gradient de-scent except MLP and ResNet-18 which uses ADAM optimizer [52]. MNIST [60]was used by MLP and LeNet and CIFAR-10 by rest of the networks. The BLASlibrary used for CPU is LAPACK [6] and for GPUs it is cuBLAS1.1 and cuBLASlibrary [80] from CUDA 8.0 (we call it CuBLAS 8). For low-level performancereasons, Caffe and Pytorch implement without-replacement sampling techniquesrather than true SGD. Thus, for all experiments, the dataset has a finite number ofimages and the order in which they are fed during training is the same in succes-sive epochs. We perform normalizing, scaling and/or subtracting the data meanfor input images as data pre-processing and set a constant seed to get deterministicresults.Null hypothesis and 2-bit STAN-liteTable 6.1 disproves the null hypothesis by comparing STAN with randomly drop-ping convolution layer neurons. We see that for several networks, STAN has abetter error than dropping neurons randomly. Interestingly, AlexLRN and LeNet’s36Table 6.1: Comparison of validation loss while using STAN on networks withvarying depths. Here random skip refers to convolutional dropout [109]ratio of 0.5, “c1 STAN-last” denotes that STAN-last is applied to onlyfirst convolution layer and “r1 STAN-lite” refers to STAN-lite appliedto only first convolution layer of all residual blocks. For ResNets, er-rors were captured at epoch 60 and rest at epoch 120. Default valuesfor ResNet are θ = 0, for rest it is θ = 10, Z = 1, pt = 0, and STAN-last/STAN-lite is applied to all layers.CNN TYPE OF VALIDATION RELATIVESKIPPING LOSS MEMORYBASELINE 0.0279 1×LENET RANDOM 0.0276STAN-LAST 0.0275 4×BASELINE 0.75 1×ALEXLRN RANDOM 2.17C1 STAN-LAST 0.74STAN-LAST 1.48 6×BASELINE 0.54 1×RESNET-18 RANDOM 3.55STAN-LITE 3.2 1.9×R1 STAN-LITE 1.36 1.45×BASELINE 0.54 1×RESNET-18 RANDOM 3.552-BIT FSM STAN-LITE 2.0 2.6×R1 STAN-LITE 1.14 1.75×error were improved marginally by using STAN-last (Table 6.1).We have observed that as the depth increases, STAN evolves from improv-ing the accuracy to over-fitting the network causing high errors on ResNet. Thisis because as the depth of the network increases, the noise introduced by STANtechniques in all layers is being added and getting amplified. As ResNets havelow auto-correlation (Figure 1.3), it leads to high errors. Therefore, we increasedpt from 0 to 3 and to overcome the over-fitting, we used BN-ReLU-conv wide-dropout structure [117] with high regularization value. To solve the noise ampli-fication problem, STAN-lite is applied only to the first conv layer of all residualblocks. This improved the error values which is evident from Table 6.1.37On the other hand, as autocorrelation is low for ResNets (Figure 1.3), STAN-lite cannot capture such complex behavior of neurons with just 1 bit. To overcomethis problem, we used a 2-bit saturating counter which is similar to 2-bit saturatingcounter branch predictor [62]. As 2-bit fsm has twice the number of bits thanSTAN-lite, it captures complex behavior of neurons in ResNets leading to bettererrors than 1-bit STAN-lite (Table 6.1).STAN-corr using Mega-batchTo reduce the STAN-corr memory requirements, we have used mega-batch and16-bit floating-point (fp16) to store µn and ACn. Figure 6.1 c shows the effect onvalidation accuracy by using fp16 for various ζ values. Compared with the graphplotting relative memory footprint (Figure 6.1 a), we see that fp16 uses less mem-ory than 32-bit floating point (1.6× savings for ζ=90) with 0.4% loss in accuracy.Figure 6.1 b plots convolution kernel and forward pass speedup for STAN-corr rela-tive to a baseline in which one warp executes one neuron but performs no skipping.Comparing this graph with the Figure 6.1 c, we observe that STAN-corr achievesa convolution kernel (green data points) speedup of 1.3-4× or 0-11% forward passspeedup (red data points) with 2.7-7% loss in accuracy.We next turn to the impact on overall forward pass training time. Figure 6.2depicts the total speedup achieved when fp16 STAN-corr is used while varyingZ and ζ . As observed before, STAN-corr can achieve 1.3-4× convolution kernelspeedup but it achieves only 3-5% overall forward pass speedup. This is becauseSTAN only skip computations in our convolution CUDA kernel and it’s speedupis limited by other operations such as im2col and backpropagation which takes asignificant amount of execution time (Figure 6.3). From Figure 6.3 a, we see thatbackpropagation takes 20% and convolution takes 70% of total execution time.However, the CUDA kernel to implement the GEMM operation takes only 21% oftotal execution time whereas the im2col operation consumes around 40% of totalexecution time (Figure 6.3 b).Chapter 8 discusses the potential to extend STAN to faster im2col alternativeslike Winograd [108]. We have also found that weight updates also show highcorrelations (Figure 6.4) during training and applying STAN to accelerate back-38(a) Relative memory requirements. (b) Relative convolution kernel and forward passspeedup.(c) Validation accuracy in %Figure 6.1: Analysis of STAN-corr on AlexLRN while varying ζ and θ . Hereepochs trained = 40, NA is Not Applicable, Z=1 and pt=2.39Figure 6.2: AlexLRN’s forward pass speedup while varying Z and ζ . Herefp16 STAN-corr is used, epochs trained = 40 and pt=2.propagation is our future work. Section 7.2 discusses an application of STAN tohyper-parameter selection that is tolerant of increased validation error.Comparison across implementationsTo prove our claims from Section 5.6, we use cuBLAS1.1 which is the only BLASversion open-sourced by NVIDIA. From Table 6.2, we see that cuBLAS1.1 withSTAN-last is slower than just cuBLAS1.1 which proves that warp-divergence isan impediment to achieve speedup. Unlike GPUs, CPUs do not have warps andthey do not suffer from warp divergence. Hence, it is straightforward to introduce40(a) Training process. (b) Convolution operationFigure 6.3: Breakdown of execution times of several layers of AlexLRN net-work in Caffe framework.Figure 6.4: Histogram of autocorrelation of all gradient updates ofAlexLRN’s layer 1 weights during training for 40 epochs.41Table 6.2: Comparison of wall-clock execution time of forward (fwd) andbackward (bwd) propagation for AlexLRN network for an epoch aver-aged over 10 instances. Here θ=10, Z=1, pt=0 and meProp’s k=30. Onlythe bars containing label “L1 reg” uses L1 regularization and rest of themuses L2 regularization.TIME (SECONDS)FWD PASS BWD PASSCUBLAS 8 8.51 1.19CUBLAS 1.1 35.52 3.064CUBLAS 1.1 WITHSTAN-LAST 37.448 3.064STAN-LAST (L2 REG) 8.38 1.22STAN-LAST (L1 REG) 6.62 1.6MLP 47.548 15.081MLP, MEPROP, STAN-LAST 28.235 0.238LAPACK 695 69.9LAPACK WITH STAN-LAST 43.2 69.9STAN on top of CPU BLAS libraries such as LAPACK and we achieved a for-ward pass speedup of 16.1×. On GPUs, STAN-last is 1.02× faster than cuBLAS8and 4.02× faster than cuBLAS1.1. We observed that our baseline kernel (withoutSTAN-last) is slower than cuBLAS8. We believe this is because cuBLAS8 is notonly optimized at the application level but also optimized at the assembly level.We redirect readers to work in [30] where the authors were able to beat NVIDIAlibraries by doing careful assembly level optimization. Doing such assembly leveloptimization is beyond the scope of this work.Most of the experiments for Table 6.2 involved dense and irregular sparse ma-trices. The speedup was possible because of non-zero θ values which allowed usto capture and classify the neurons with non-zero outputs as STANs. In additionto it, our baseline kernel handles computation of a neuron using one warp whichmakes STAN independent of sparsity patterns. To further maximize the speed up,we have also implemented L1 instead of L2 regularization so that weights tend tobe sparse and yil become low. From Table 6.2, we can see that STAN-last using L1is 1.27× faster than STAN-last with L2 regularization.42Hence, STAN is independent of sparsity patterns, can deal with dense andsparse matrices and also orthogonal to sparsity-inducing techniques.Discussion: STAN-corr on ResNetsWe have observed that when STAN-corr is applied to ResNets, our single GPUdoes not have enough memory to store all the eil values. Hence, we have offloadedthe calculation of Equation 5.2 to CPU which resulted in a slowdown. STAN-corr isfeasible for deeper nets if we employ model parallelism because it distributes neu-rons over several GPUs and thus the entry matrix values, mean and variances of aneuron are also distributed across GPUs. This reduces additional memory require-ments per GPU. Due to lack of enough hardware resources to implement modelparallelism, throughout this dissertation, we have confined ourselves to AlexLRN.This would allow us to calculate mean, variance and autocorrelation on a GPU andachieve speedup.It is to be noted that applying STAN for a deeper net by employing model par-allelism can be feasible by distributing E across all GPUs. As eil is exclusive to aneuron, there is no need to communicate its entry matrix values to other neurons(which may be executed on other GPUs) thus eliminating additional communica-tion overhead among GPUs. The same argument holds true for mean, variance andautocorrelation values of a neuron.43Chapter 7ApplicationsmePropmeProp [102] accelerates backpropagation by computing only top-k elements ofthe gradient vector such that only k rows or columns of weight matrices are up-dated. By doing so, the authors claim to achieve a staggering 69.2× speedup withminimal loss in accuracy. From Equation 5.1, if tb t f , overall speedup may below. Hence, we have examined the possibility of using STAN on top of meProp tomaximize overall speedup. We have ported our code to the Pytorch framework andhave used the same MLP from meProp. We have observed that meProp acceleratedbackward propagation by 63.4× and STAN has a forward pass speedup of 1.69×leading to an overall speedup of 2.2× (entries labeled MLP vs. MLP, meProp,STAN-last in Table 6.2). The meProp code for CNNs is not yet available, henceour analysis was confined only to an MLP network.HyperbandFinding the optimal configuration often involves experimenting with several hyper-parameter sets. One such approach is Hyperband [66] which requires two inputs:the maximum amount of resource R that can be allocated to a single configurationand η , which controls the ratio of configurations to be discarded from the pool.This process of discarding is called successive halving [49] and each such run is44called a bracket. Hyperband begins by exploring the maximum number of config-urations with minimum resources and, after each such iteration, successive halvingis performed to reduce the size of the pool by η times whereas the resource alloca-tion are increased exponentially by η times. The same process is repeated until wecomplete all successive halving in a bracket to output a configuration with the bestvalidation error seen so far.Following the Hyperband’s terminology, we have set initial values as R=81 andη=3. As Hyperband relies on validation error to shortlist hyperparameter sets, wehave examined how STAN affects this selection process by executing Hyperbandon a baseline CNN and compared it with another CNN which uses STAN. Weanswer the question: what if the hyperparameter set (H1) selected at the end of abracket with STAN is different from the hyperparameter set (H2) chosen for thebaseline which does not use STAN? Our solution is to train a new CNN withoutSTAN from scratch using both H1 and H2 and compare their validation loss T 1and T 2 respectively. For simplicity, we have obtained T 1 and T 2 by training a newCNN for a default number of epochs given by the framework for that particularnetwork.Figure 7.1 studies the effect of STAN-last on Hyperband. We see that in somecases, STAN does not influence Hyperband’s decision leading to T 1− T 2 as 0whereas in other cases, we end up with configurations with marginally high T1-T2 value (at most 0.3). For most of the networks, an average speedup of 13%is achieved with a highest of 16% for AlexLRN and lowest of 7% for AlexBN.To maximize the speedup, we have varied Z and used a larger pool (171 config-urations). For example, LeNet’s acceleration rose from 11% (Z=1) to 14% (Z=5case) but with increasing T1-T2 (with ±0.2 change). Hence, as the Z or θ valueincreases, more speedup is achieved but the quality of shortlisted configurationsdegrades. It is to be noted that when combined with meProp, an acceleration of43% is achieved for LeNet.The details on our Hyperband search space can be found in appendix Sec-tion A.3.45Figure 7.1: Analysis of Hyperband for CNNs with and without STAN. Un-less specifically stated, default values are pt = 0, Z = 1, θ = 10 andmeProp top-k = 30. ResNet-18 uses STAN-lite and rest of them useSTAN-last. For series marked as ?, R = 171 else R = 81. Here, a singlepoint on graph indicates a bracket and the total execution time reportedis wall clock time which includes forward pass, backward pass and Hy-perband’s overhead.Alternatives to HyperbandThe authors in [25] have implemented an evolutionary algorithm for hyperpa-rameter selection and achieved better test accuracies. However, to create a newoffspring, it depends on the pool of configurations obtained by the grid searchwhich is computationally expensive. Genetic algorithm was also used for hyper-parameter optimization by discarding the bad configurations based on its fitnessfunction value (test error) [114]. Random search involves training several modelswhereas, other works choose good configurations based on the network’s learningcurve [21, 53]. Another interesting idea is population based training [48] wherethe hyper-parameters of a network can be replaced with best performing hyper-parameter sets of other models in the population. This again depends on the net-work’s learning curve for replacing a given set of hyper-parameters with the best46set. We argue that STAN with reasonable θ , pt and Z will not change the accuracyand learning curve drastically. Hence STAN can be used on top of such techniquesto achieve additional speedups.47Chapter 8Conclusions and Future workAll of our networks use im2col operation which is used to convert higher dimensionimages to 2-D matrices which are used for GEMM operations. Several works haveobserved that it degrades GPU performance and proposed alternate solutions to it[5, 16]. Other choices include Winograd convolution [108] whose procedure foran input vector d, weight vector g and an output vector o is shown below:o = AT ((Gg) (BT d)) (8.1)where the matrices AT ,G & BT are constant matrices for a given filter size and is Hadamard product. We have seen that for a correlation of 0.5, 99% weights and62% of neuron outputs are correlated (Figure 1.2 and 1.3). Hence most of theentries of Gg and BT d may be correlated. In addition to it, if a neuron’s outputhas to be skipped using STAN, the computation of the corresponding row of ATwith the corresponding column of (Gg) (BT d) matrix can be skipped. Hence,there is a possibility of employing memoization to further speed up the Winogradoperation. However, implementing this and validating our claims will be our futurework.To conclude, we found that during training, 62% of neurons have lag-1 auto-correlation above 0.5 and we employed a memoization technique called STAN toexploit it. To prevent warp-divergence, we have designed our baseline CUDA ker-nel where a single warp executes a neuron’s computation. We showed that STAN48could achieve a convolution kernel speedup of 1.3-4× over our baseline GPU im-plementation at the expense of 2.7-7% loss in accuracy and an overall speedup of13% for hyper-parameter search.49Bibliography[1] CUBLAS LIBRARY. Technical report, NVIDIA Corp., December 2017.→ page 9[2] CUTLASS: Fast Linear Algebra in CUDA C++. Technical report, NVIDIACorp., December 2017. → page 9[3] NVIDIA TESLA V100 GPU ARCHITECTURE. Technical report,NVIDIA Corp., August 2017. → page 6[4] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, andA. Moshovos. Cnvlutin: Ineffectual-neuron-free deep neural networkcomputing. In ACM SIGARCH Computer Architecture News, volume 44,pages 1–13. IEEE Press, 2016. → page 24[5] A. Anderson, A. Vasudevan, C. Keane, and D. Gregg. Low-memorygemm-based convolution algorithms for deep neural networks. arXivpreprint arXiv:1709.03395, 2017. → page 48[6] E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, J. Dongarra,J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, et al. LAPACKUsers’ guide. SIAM, 1999. → page 36[7] S. Anwar and W. Sung. Coarse pruning of convolutional neural networkswith random masks. 2016. → page 23[8] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning.In Proceedings of the 26th annual international conference on machinelearning, pages 41–48. ACM, 2009. → page 32[9] L. Bottou. Curiously fast convergence of some stochastic gradient descentalgorithms. 2009. → page 32[10] L. Bottou. Stochastic gradient descent tricks. In Neural networks: Tricks ofthe trade, pages 421–436. Springer, 2012. → page 3250[11] A. Brock, T. Lim, J. M. Ritchie, and N. Weston. Freezeout: Acceleratetraining by progressively freezing layers. arXiv preprint arXiv:1706.04983,2017. → page 24[12] M. Campbell, A. J. Hoane Jr, and F.-h. Hsu. Deep blue. Artificialintelligence, 134(1-2):57–83, 2002. → page 2[13] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of thedevil in the details: Delving deep into convolutional nets. arXiv preprintarXiv:1405.3531, 2014. → page 13[14] Y.-H. Chen, J. Emer, and V. Sze. Eyeriss: A spatial architecture forenergy-efficient dataflow for convolutional neural networks. In ACMSIGARCH Computer Architecture News, volume 44, pages 367–379. IEEEPress, 2016. → page 19[15] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro,and E. Shelhamer. cudnn: Efficient primitives for deep learning. arXivpreprint arXiv:1410.0759, 2014. → page 9[16] M. Cho and D. Brand. Mec: memory-efficient convolution for deep neuralnetwork. arXiv preprint arXiv:1706.06873, 2017. → page 48[17] C. Clopath, T. Bonhoeffer, M. Hu¨bener, and T. Rose. Variance andinvariance of neuronal long-term representations. Phil. Trans. R. Soc. B,372(1715):20160161, 2017. → page 22[18] M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deepneural networks with binary weights during propagations. In Advances inneural information processing systems, pages 3123–3131, 2015. → page24[19] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio.Binarized neural networks: Training deep neural networks with weightsand activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830,2016. → page 24[20] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: Alarge-scale hierarchical image database. In Computer Vision and PatternRecognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee,2009. → page 351[21] T. Domhan, J. T. Springenberg, and F. Hutter. Speeding up automatichyperparameter optimization of deep neural networks by extrapolation oflearning curves. In IJCAI, volume 15, pages 3460–8, 2015. → page 46[22] D. Eigen and R. Fergus. Predicting depth, surface normals and semanticlabels with a common multi-scale convolutional architecture. InProceedings of the IEEE International Conference on Computer Vision,pages 2650–2658, 2015. → page 13[23] R. Epstein. The quest for the thinking computer. AI magazine, 13(2):81,1992. → page 1[24] V. Firoiu, W. F. Whitney, and J. B. Tenenbaum. Beating the world’s best atsuper smash bros. with deep reinforcement learning. arXiv preprintarXiv:1702.06230, 2017. → page 2[25] F. Friedrichs and C. Igel. Evolutionary tuning of multiple svm parameters.Neurocomputing, 64:107–117, 2005. → page 46[26] N. Fujimoto. Faster matrix-vector multiplication on geforce 8800gtx. InParallel and Distributed Processing, 2008. IPDPS 2008. IEEEInternational Symposium on, pages 1–8. IEEE, 2008. → page 6[27] K. Fukushima and S. Miyake. Neocognitron: A self-organizing neuralnetwork model for a mechanism of visual pattern recognition. InCompetition and cooperation in neural nets, pages 267–285. Springer,1982. → page 2[28] W. S. Geisler and D. G. Albrecht. Cortical neurons: isolation of contrastgain control. Vision research, 32(8):1409–1410, 1992. → page 17[29] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press,2016. http://www.deeplearningbook.org. → page 2[30] S. Gray. Intel Nervana sgemmhttps://github.com/NervanaSystems/maxas/wiki/SGEMM, 2015. URLhttps://github.com/NervanaSystems/maxas/wiki/SGEMM. → page 42[31] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang,L. Wang, G. Wang, et al. Recent advances in convolutional neuralnetworks. arXiv preprint arXiv:1512.07108, 2015. → page 2[32] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for efficient dnns.In Advances In Neural Information Processing Systems, pages 1379–1387,2016. → page 2352[33] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deeplearning with limited numerical precision. In International Conference onMachine Learning, pages 1737–1746, 2015. → page 27[34] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deepneural networks with pruning, trained quantization and huffman coding.arXiv preprint arXiv:1510.00149, 2015. → page 19[35] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights andconnections for efficient neural network. In Advances in neuralinformation processing systems, pages 1135–1143, 2015. → page 23[36] B. Hassibi and D. G. Stork. Second order derivatives for network pruning:Optimal brain surgeon. In Advances in neural information processingsystems, pages 164–171, 1993. → page 23[37] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers:Surpassing human-level performance on imagenet classification. InProceedings of the IEEE international conference on computer vision,pages 1026–1034, 2015. → page 14[38] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for imagerecognition. In Proceedings of the IEEE conference on computer visionand pattern recognition, pages 770–778, 2016. → pages 3, 7[39] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deepneural networks. In International Conference on Computer Vision (ICCV),volume 2, page 6, 2017. → page 23[40] R. Hecht-Nielsen. Theory of the backpropagation neural network. InNeural networks for perception, pages 65–93. Elsevier, 1992. → page 3[41] D. J. Heeger. Normalization of cell responses in cat striate cortex. Visualneuroscience, 9(2):181–197, 1992. → page 17[42] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang. Network trimming: Adata-driven neuron pruning approach towards efficient deep architectures.arXiv preprint arXiv:1607.03250, 2016. → page 23[43] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Denselyconnected convolutional networks. → page 3[44] Z. Huang and N. Wang. Data-driven sparse structure selection for deepneural networks. arXiv preprint arXiv:1707.01213, 2017. → page 2353[45] D. H. Hubel and T. N. Wiesel. Receptive fields and functional architectureof monkey striate cortex. The Journal of physiology, 195(1):215–243,1968. → page 2[46] D. Huber, D. Gutnisky, S. Peron, D. Oconnor, J. Wiegert, L. Tian,T. Oertner, L. Looger, and K. Svoboda. Multiple dynamic representationsin the motor cortex during sensorimotor learning. Nature, 484(7395):473,2012. → page 22[47] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep networktraining by reducing internal covariate shift. arXiv preprintarXiv:1502.03167, 2015. → pages 9, 16, 18[48] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue,A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, et al.Population based training of neural networks. arXiv preprintarXiv:1711.09846, 2017. → page 46[49] K. Jamieson and A. Talwalkar. Non-stochastic best arm identification andhyperparameter optimization. In Artificial Intelligence and Statistics, pages240–248, 2016. → page 44[50] K. Jarrett, K. Kavukcuoglu, Y. LeCun, et al. What is the best multi-stagearchitecture for object recognition? In Computer Vision, 2009 IEEE 12thInternational Conference on, pages 2146–2153. IEEE, 2009. → page 16[51] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fastfeature embedding. In Proceedings of the 22nd ACM internationalconference on Multimedia, pages 675–678. ACM, 2014. → pages x, 8[52] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014. → page 36[53] A. Klein, S. Falkner, J. T. Springenberg, and F. Hutter. Learning curveprediction with bayesian neural networks. 2016. → page 46[54] A. Krizhevsky and G. Hinton. Learning multiple layers of features fromtiny images. 2009. → page 7[55] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification withdeep convolutional neural networks. In Advances in neural informationprocessing systems, pages 1097–1105, 2012. → pages 3, 6, 9, 13, 14, 1654[56] D. Krueger, T. Maharaj, J. Krama´r, M. Pezeshki, N. Ballas, N. R. Ke,A. Goyal, Y. Bengio, A. Courville, and C. Pal. Zoneout: Regularizing rnnsby randomly preserving hidden activations. arXiv preprintarXiv:1606.01305, 2016. → page 24[57] V. Lebedev and V. Lempitsky. Fast convnets using group-wise braindamage. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEEConference on, pages 2554–2564. IEEE, 2016. → page 23[58] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E.Hubbard, and L. D. Jackel. Handwritten digit recognition with aback-propagation network. In D. S. Touretzky, editor, Advances in NeuralInformation Processing Systems 2, pages 396–404. Morgan-Kaufmann,1990. URL http://papers.nips.cc/paper/293-handwritten-digit-recognition-with-a-back-propagation-network.pdf.→ page 2[59] Y. LeCun, J. S. Denker, and S. A. Solla. Optimal brain damage. InAdvances in neural information processing systems, pages 598–605, 1990.→ page 23[60] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learningapplied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. → pages 2, 3, 36[61] Y. LeCun et al. Lenet-5, convolutional neural networks. → page 9[62] J. K. Lee and A. J. Smith. Branch prediction strategies and branch targetbuffer design. Computer, (1):6–22, 1984. → page 38[63] V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen,N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, et al.Debunking the 100x gpu vs. cpu myth: an evaluation of throughputcomputing on cpu and gpu. ACM SIGARCH computer architecture news,38(3):451–460, 2010. → page 6[64] A. LeNail. Nn-svg: Publication-ready neural network architectureschematics. URL http://alexlenail.me/NN-SVG/. → pages x, xi, 12, 14, 20[65] F. Li and B. Liu. Ternary weight networks. CoRR, abs/1605.04711, 2016.URL http://arxiv.org/abs/1605.04711. → page 2455[66] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar.Hyperband: Bandit-based configuration evaluation for hyperparameteroptimization. 2016. → pages iii, 44[67] J. Lin, Y. Rao, J. Lu, and J. Zhou. Runtime neural pruning. In Advances inNeural Information Processing Systems, pages 2178–2188, 2017. → page23[68] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. Nvidia tesla: Aunified graphics and computing architecture. IEEE micro, 28(2), 2008. →pages 4, 5, 9, 34[69] M. H. Lipasti, C. B. Wilkerson, and J. P. Shen. Value locality and loadvalue prediction. ACM SIGPLAN Notices, 31(9):138–147, 1996. → page24[70] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang. Learning efficientconvolutional networks through network slimming. In 2017 IEEEInternational Conference on Computer Vision (ICCV), pages 2755–2763.IEEE, 2017. → page 23[71] S. Lyu and E. P. Simoncelli. Nonlinear image representation using divisivenormalization. In Computer Vision and Pattern Recognition, 2008. CVPR2008. IEEE Conference on, pages 1–8. IEEE, 2008. → page 17[72] E. Manchester University. The ferranti computer”. Digital computernewsletter, 1951. → page 2[73] M. Mank, A. F. Santos, S. Direnberger, T. D. Mrsic-Flogel, S. B. Hofer,V. Stein, T. Hendel, D. F. Reiff, C. Levelt, A. Borst, et al. A geneticallyencoded calcium indicator for chronic in vivo two-photon imaging. Naturemethods, 5(9):805, 2008. → page 21[74] Y. Masamizu, Y. R. Tanaka, Y. H. Tanaka, R. Hira, F. Ohkubo,K. Kitamura, Y. Isomura, T. Okada, and M. Matsuzaki. Two distinctlayer-specific dynamics of cortical ensembles during learning of a motortask. Nature neuroscience, 17(7):987, 2014. → page 22[75] M. McGill and P. Perona. Deciding how to decide: Dynamic routing inartificial neural networks. arXiv preprint arXiv:1703.06217, 2017. → page24[76] D. Michie. ”memo” functions and machine learning. Nature, 218(5136):19, 1968. → page 656[77] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia,B. Ginsburg, M. Houston, O. Kuchaev, G. Venkatesh, et al. Mixedprecision training. In Proceedings of the International Conference onLearning Representations, 2018. → page 27[78] V. Mnih. Cudamat: a cuda-based matrix class for python. 2009. → page 6[79] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcementlearning. arXiv preprint arXiv:1312.5602, 2013. → page 2[80] C. Nvidia. Cublas library. NVIDIA Corporation, Santa Clara, California,15(27):31, 2008. → page 36[81] C. Nvidia. Programming guide, 2010. → page 4[82] T. NVIDIA. V100 gpu architecture. the worlds most advanced data centergpu. NVIDIA. Aug, 2017. → page 4[83] J. Park, S. Li, W. Wen, P. T. P. Tang, H. Li, Y. Chen, and P. Dubey. Fastercnns with direct sparse convolutions and guided pruning. 2016. → page 23[84] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation inpytorch. 2017. → pages 32, 36[85] A. J. Peters, S. X. Chen, and T. Komiyama. Emergence of reproduciblespatiotemporal activity during motor learning. Nature, 510(7504):263,2014. → page 22[86] N. Pinto, D. D. Cox, and J. J. DiCarlo. Why is real-world visual objectrecognition hard? PLoS computational biology, 4(1):e27, 2008. → page 17[87] Pudn. Nvidia cublas 1.1, 2008. URLhttp://en.pudn.com/Download/item/id/584006.html. → page 34[88] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenetclassification using binary convolutional neural networks. In EuropeanConference on Computer Vision, pages 525–542. Springer, 2016. → page24[89] F. Rosenblatt. The perceptron: a probabilistic model for informationstorage and organization in the brain. Psychological review, 65(6):386,1958. → page 257[90] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internalrepresentations by error propagation. Technical report, California Univ SanDiego La Jolla Inst for Cognitive Science, 1985. → page 16[91] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learningrepresentations by back-propagating errors. nature, 323(6088):533, 1986.[92] D. E. Rumelhart, R. Durbin, R. Golden, and Y. Chauvin. Backpropagation:The basic theory. Backpropagation: Theory, architectures andapplications, pages 1–34, 1995. → page 16[93] J. Salamon and J. P. Bello. Deep convolutional neural networks and dataaugmentation for environmental sound classification. IEEE SignalProcessing Letters, 24(3):279–283, 2017. → page 13[94] O. Shamir. Without-replacement sampling for stochastic gradient methods.In Advances in Neural Information Processing Systems, pages 46–54,2016. → page 32[95] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, andJ. Dean. Outrageously large neural networks: The sparsely-gatedmixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017. → page21[96] S. M. Shieber. Lessons from a restricted turing test. arXiv preprintcmp-lg/9404002, 1994. → page 1[97] G. Shomron and U. Weiser. Exploiting spatial correlation in convolutionalneural networks for activation value prediction. arXiv preprintarXiv:1807.10598, 2018. → page 24[98] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. VanDen Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot, et al. Mastering the game of go with deep neural networks andtree search. nature, 529(7587):484, 2016. → pages 2, 21[99] K. Simonyan and A. Zisserman. Very deep convolutional networks forlarge-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. →pages 3, 13[100] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Strivingfor simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806,2014. → page 1558[101] D. Strigl, K. Kofler, and S. Podlipnig. Performance and scalability ofgpu-based convolutional neural networks. In Parallel, Distributed andNetwork-Based Processing (PDP), 2010 18th Euromicro InternationalConference on, pages 317–324. IEEE, 2010. → page 6[102] X. Sun, X. Ren, S. Ma, and H. Wang. meprop: Sparsified back propagationfor accelerated deep learning with reduced overfitting. arXiv preprintarXiv:1706.06197, 2017. → pages 10, 44[103] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. InProceedings of the IEEE conference on computer vision and patternrecognition, pages 1–9, 2015. → page 3[104] L. Thompson and P. Best. Long-term stability of the place-field activity ofsingle units recorded from the dorsal hippocampus of freely behaving rats.Brain research, 509(2):299–308, 1990. → page 22[105] A. M. Turing. Computing machinery and intelligence. In Parsing theTuring Test, pages 23–65. Springer, 2009. → page 1[106] B. Welford. Note on a method for calculating corrected sums of squaresand products. Technometrics, 4(3):419–420, 1962. → page 31[107] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsityin deep neural networks. In Advances in Neural Information ProcessingSystems, pages 2074–2082, 2016. → page 23[108] S. Winograd. Arithmetic complexity of computations, volume 33. Siam,1980. → pages 38, 48[109] H. Wu and X. Gu. Towards dropout training for convolutional neuralnetworks. Neural Networks, 71:1–10, 2015. → pages ix, 37[110] S. Wu, G. Li, F. Chen, and L. Shi. Training and inference with integers indeep neural networks. 2018. → page 23[111] S. Xie and Z. Tu. Holistically-nested edge detection. In Proceedings of theIEEE international conference on computer vision, pages 1395–1403,2015. → page 13[112] S. Xie, R. Girshick, P. Dolla´r, Z. Tu, and K. He. Aggregated residualtransformations for deep neural networks. In Computer Vision and PatternRecognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE,2017. → page 359[113] A. Yazdanbakhsh, K. Samadi, and H. Esmaeilzadeh. Snapea: Predictiveearly activation for reducing computation in deep convolutional neuralnetworks. → page 24[114] S. R. Young, D. C. Rose, T. P. Karnowski, S.-H. Lim, and R. M. Patton.Optimizing deep learning hyper-parameters through an evolutionaryalgorithm. In Proceedings of the Workshop on Machine Learning inHigh-Performance Computing Environments, page 4. ACM, 2015. → page46[115] N. Yu, S. Qiu, X. Hu, and J. Li. Accelerating convolutional neuralnetworks by group-wise 2d-filter pruning. In Neural Networks (IJCNN),2017 International Joint Conference on, pages 2502–2509. IEEE, 2017. →page 23[116] R. Yu, A. Li, C.-F. Chen, J.-H. Lai, V. I. Morariu, X. Han, M. Gao, C.-Y.Lin, and L. S. Davis. Nisp: Pruning networks using neuron importancescore propagation. arXiv preprint arXiv:1711.05908, 2017. → page 23[117] S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprintarXiv:1605.07146, 2016. → pages 3, 37[118] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutionalnetworks. In European conference on computer vision, pages 818–833.Springer, 2014. → pages 3, 14, 15[119] W. Zhang, K. Itoh, J. Tanida, and Y. Ichioka. Parallel distributed processingmodel with local space-invariant interconnections and its opticalarchitecture. Applied optics, 29(32):4790–4797, 1990. → page 3[120] C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternary quantization.arXiv preprint arXiv:1612.01064, 2016. → page 2460Appendix ASupporting MaterialsProof of Equation 5.1Total execution time without STAN is given byN(t f + tb)Time spent by the CNN when STAN is applied consists of 3 parts:1. time spent in pre-stan interval ispt(t f + tb)2. time spent during calculating E isN− ptZ+1(t f + tb+ te)3. where as the new execution time of CNN after applying STAN is(N− pt) ZZ+1(tb+(t f − ti))So the speedup is given by61N(t f + tb)pt(t f + tb)+N−ptZ+1 (t f + tb+ te)+(N− pt) ZZ+1(tb+(t f − ti))N(t f + tb)N(t f + tb)+N−ptZ+1 te− (N− pt) ZZ+1 tiN(t f + tb)N(t f + tb)− N−ptZ+1 (Zti− te)Proof of Equation 5.2Using the equations 5.4 and 5.3, we derive the autocorrelation as follows:An =n−1∑i=1(xi−µn)∗ (xi+1−µn)=n−2∑i=1(xi−µn)∗ (xi+1−µn)+(xn−1−µn)∗ (xn−µn)=n−2∑i=1(xi−µn−1− xn−µn−1n )∗ (xi+1−µn−1−xn−µn−1n)+(xn−1−µn)∗ (xn−µn)=n−2∑i=1((xi−µn−1)∗ (xi+1−µn−1))+n−2n2(xn−µn−1)2+(xn−1−µn)∗ (xn−µn)− xn−µn−1n( n−2∑i=1(xi−µn−1)+n−2∑i=1(xi+1−µn−1))= An−1+n−2n2(xn−µn−1)2+(xn−1−µn)∗ (xn−µn)− xn−µn−1n( n−2∑i=1(xi−µn−2+ µn−2n−1 −xn−1n−1)+n−2∑i=1(xi+1−µn−1)+ x1− x1+µn−1−µn−1)= An−1+n−2n2(xn−µn−1)2+(xn−1−µn)∗ (xn−µn)−xn−µn−1n( −1n−1n−2∑i=1(xn−1−µn−2)+(−x1+µn−1))= An−1+n−2n2(xn−µn−1)2+(xn−1−µn)∗ (xn−µn)+62xn−µn−1n(n−2n−1(xn−1−µn−2)+(x1−µn−1))= An−1+n−2n2(xn−µn−1)2+(xn−1−µn)∗ (xn−µn)+xn−µn−1n(n−2n−1[xn−1− (n−1)∗µn−1− xn−1n−2]+(x1−µn−1))= An−1+n−2n2(xn−µn−1)2+(xn−1−µn)∗ (xn−µn)+xn−µn−1n(xn−1+ x1−2∗µn−1)Therefore, equation 5.2 can be rewritten asACn =AnSn=An−1+ n−2n2 (xn−µn−1)2+(xn−1−µn)∗ (xn−µn)+ xn−µn−1n (xn−1+ x1−2∗µn−1)Sn−1+ n−1n (xn−µn−1)2Hyperband search spaceThe networks considered in our experiments are MLP, LeNet4, Alex5, AlexLRN7,AlexBN6 and ResNet-188.The details on our hyperband search space for all thesenetworks can be found in table A.1. Few things to be noted include• To make our entry matrix memory management simple, we fixed the batchsize as 100 and thus, it was not included in our search space.• For all the experiments, constant seed was used to get deterministic results.• ResNet-18 and MLP uses ADAM and rest of the networks use SGD.4The model specification is available at https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet train test.prototxt5The model specification is available at https://github.com/BVLC/caffe/blob/master/examples/cifar10/cifar10 quick train test.prototxt7The model specification is available at https://github.com/BVLC/caffe/blob/master/examples/cifar10/cifar10 full train test.prototxt6The model specification is available at https://github.com/BVLC/caffe/blob/master/examples/cifar10/cifar10 full sigmoid train test bn.prototxt8The model specification is available at https://github.com/letr63jd56/STAN/blob/master/stan-cifar10/models/old resnet dropout/trainval bn relu conv.prototxt63• Epochs trained column in table A.1 denotes the number of epochs used toobtain T1 and T2.Table A.1: Hyper-parameter search space used in hyperband for several net-works.NETWORK HYPER-PARAMETER SCALE MIN MAX EPOCHSTRAINEDMLP HIDDEN NEURONS LINEAR 100 8192 20DROPOUT LINEAR 0 0.999TOP-K LINEAR 0 100LENET LEARNING RATE LOG 5E-5 5 20CONV1 L2 PENALTY LOG 5E-5 5CONV2 L2 PENALTY LOG 5E-5 5MOMENTUM LOG 0.001 0.99ALEX LEARNING RATE LOG 5E-5 5 8L2 PENALTYFOR ALL CONV LAYERS LOG 5E-5 5FC1 L2 PENALTY LOG 5E-5 5FC2 L2 PENALTY LOG 5E-5 5ALEXLRN LEARNING RATE LOG 5E-5 5 120L2 PENALTYFOR ALL CONV LAYERS LOG 5E-5 5FC1 L2 PENALTY LOG 5E-5 5LRN SCALE LOG 5E-6 5POWER LINEAR 0.01 3ALEXBN LEARNING RATE LOG 5E-5 5 120MOMENTUM LOG 5E-5 5L2 PENALTYFOR ALL CONV LAYERS LOG 5E-5 5FC1 L2 PENALTY LOG 5E-5 5RESNET-18 L2 PENALTYFOR ALL CONV LAYERS LOG 5E-5 5 60DROPOUT LINEAR 0 0.99964
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Faster convolutional neural network training via approximate...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Faster convolutional neural network training via approximate memoization Sandhupatla, Amruth 2018
pdf
Page Metadata
Item Metadata
Title | Faster convolutional neural network training via approximate memoization |
Creator |
Sandhupatla, Amruth |
Publisher | University of British Columbia |
Date Issued | 2018 |
Description | Deep Convolutional Neural Networks have found wide application but their training time can be significant. We find that between successive epochs during training, many neurons compute nearly the same output when presented with the same input. This presents an opportunity to skip computation in the forward pass on the later epoch via memoization. This dissertation explores the potential of such an approach by investigating the correlation of neuron activations between training epochs. We develop an implementation of activation memoization that takes into account the lockstep behavior of threads executing together in single-instruction, multiple-thread Graphic Processing Units (GPU). Finally, we discuss the trade-off between speedup and accuracy by showing that STAN achieves a 1.3-4X convolution speedup over our baseline GPU implementation at the expense of 2.7-7% loss in accuracy. When STAN is applied to Hyperband which can be robust to drop in accuracy, an overall training speedup of 7%-16% can be achieved with a minimal change in test error (-0.2 to +0.2). |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2018-10-18 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0372887 |
URI | http://hdl.handle.net/2429/67615 |
Degree |
Master of Applied Science - MASc |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2018-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2018_november_sandhupatla_amruth.pdf [ 1.08MB ]
- Metadata
- JSON: 24-1.0372887.json
- JSON-LD: 24-1.0372887-ld.json
- RDF/XML (Pretty): 24-1.0372887-rdf.xml
- RDF/JSON: 24-1.0372887-rdf.json
- Turtle: 24-1.0372887-turtle.txt
- N-Triples: 24-1.0372887-rdf-ntriples.txt
- Original Record: 24-1.0372887-source.json
- Full Text
- 24-1.0372887-fulltext.txt
- Citation
- 24-1.0372887.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0372887/manifest