ReSprop: Reused Sparsified BackpropagationbyNegar GoliB. Sc, Sharif University of Technology, 2017A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of Applied ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)June 2020© Negar Goli, 2020The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:ReSprop: Reused Sparsified Backpropagationsubmitted by Negar Goli in partial fulfillment of the requirements for the degreeof Master of Applied Science in Electrical and Computer Engineering.Examining Committee:Tor M. Aamodt, Electrical and Computer EngineeringSupervisorLeonid Sigal, Computer ScienceSupervisory Committee MemberiiAbstractThe success of Convolutional Neural Networks (CNNs) in various applications isaccompanied by a significant increase in computation and training time. In thiswork, we focus on accelerating training by observing that about 90% of gradientsare reusable during training. Leveraging this observation, we propose a new algo-rithm, Reuse-Sparse-Backprop (ReSprop), as a method to sparsify gradient vectorsduring CNN training. ReSprop maintains state-of-the-art accuracy on CIFAR-10,CIFAR-100, and ImageNet datasets with less than 1.1% accuracy loss while en-abling a reduction in back-propagation computations by a factor of 10× resultingin a 2.7× overall speedup in training. As the computation reduction introduced byReSprop is accomplished by introducing fine-grained sparsity that reduces compu-tation efficiency on GPUs, we introduce a generic sparse convolution neural net-work accelerator (GSCN), which is designed to accelerate sparse back-propagationconvolutions. When combined with ReSprop, GSCN achieves 8.0× and 7.2×speedup in the backward pass on ResNet34 and VGG16 versus a GTX 1080 TiGPU.iiiLay SummaryConvolutional Neural Networks play an essential role in today’s computer visionapplications. However, to train these networks, one requires massive data and com-putational resources. Fortunately, more data is available due to the worldwide us-age of the internet. But, the lack of efficient algorithms for training to reduce thecomputations hinders the progress. One way to decrease the computation in CNNs(Convolutional Neural Networks) is to produce sparsity in the computations. Thereare several methods, which sparsify the inference. However, due to the complexityof training, few works have studied sparsity in training for reducing the trainingcomputations.In this thesis, we aim to solve the issue of excessive training time by develop-ing a new training algorithm which reuses the gradients to sparsify computations.This method can achieve a substantial speedup on a specific hardware acceleratordesigned for sparse training.ivPrefaceThis dissertation is based on a research project conducted by myself under the su-pervision and guidance of Professor Tor M. Aamodt. The work in this thesis wasalso presented in the paper ReSprop: Reuse Sparsified Backpropagation, acceptedto appear in the oral presenation track at the 2020 IEEE/CVF Conference on Com-puter Vision and Pattern Recognition (CVPR).I assisted with defining the problem space and was responsible for derivingmathematical solutions, identifying challenges within this problem space, and de-signing and optimizing the algorithm to evaluate the proposed idea. Prof. TorM. Aamodt provided valuable guidance and directions in identifying the researchproblems, developing solution methodologies, and documenting the results.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Sparse Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Gradient Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 GSCN Accelerator . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . 8vi2.1.1 Convolutional Layer . . . . . . . . . . . . . . . . . . . . 92.1.2 Activation Function . . . . . . . . . . . . . . . . . . . . . 102.1.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.5 Normalization . . . . . . . . . . . . . . . . . . . . . . . 112.1.6 Fully-connected layer . . . . . . . . . . . . . . . . . . . . 122.2 ResNet, WRN and VGG . . . . . . . . . . . . . . . . . . . . . . 122.2.1 VGG . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.2 ResNet and WRN . . . . . . . . . . . . . . . . . . . . . . 132.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.1 Training Convolutions and Notation . . . . . . . . . . . . 152.3.2 Mini-batch Training . . . . . . . . . . . . . . . . . . . . 152.3.3 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . 162.4 Sparse Convolution . . . . . . . . . . . . . . . . . . . . . . . . . 162.4.1 Sparse Hardware Acceleretors . . . . . . . . . . . . . . . 173 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.1 Dense to Sparse Networks by Weight Pruning . . . . . . . . . . . 183.1.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 193.1.2 Prunning Criterion . . . . . . . . . . . . . . . . . . . . . 193.1.3 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 203.1.4 Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 Sparse Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3 Low precision neural network . . . . . . . . . . . . . . . . . . . 213.4 Reusing Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . 223.5 Hardware Accelerators for Deep Neural Networks . . . . . . . . . 243.5.1 Hardware Accelerator for Sparse Networks . . . . . . . . 264 Gradient Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2 Approach and Key Insight . . . . . . . . . . . . . . . . . . . . . 294.3 Angle Preservation and comparison with meProp . . . . . . . . . 30vii5 ReSprop Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.1 ReSprop: Reuse-Sparse-Backprop . . . . . . . . . . . . . . . . . 335.1.1 Stochastic Output Gradient . . . . . . . . . . . . . . . . . 345.1.2 Warm Up . . . . . . . . . . . . . . . . . . . . . . . . . . 376 GSCN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.1 Accelerator for Sparse Training . . . . . . . . . . . . . . . . . . . 396.2 Computation of Convolutional Layers . . . . . . . . . . . . . . . 416.3 Background on SCNN . . . . . . . . . . . . . . . . . . . . . . . 416.3.1 SCNN Limitations . . . . . . . . . . . . . . . . . . . . . 446.4 GSCN Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . 476.5 GSCN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 497 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 527.3 Accuracy Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 537.3.1 Accuracy on CIFAR10 and CIFAR100: . . . . . . . . . . 537.3.2 Accuracy on ImageNet: . . . . . . . . . . . . . . . . . . 547.4 Sensitivity Study . . . . . . . . . . . . . . . . . . . . . . . . . . 567.4.1 Deep and Wide Networks . . . . . . . . . . . . . . . . . 567.4.2 Impact of Batch Size: . . . . . . . . . . . . . . . . . . . . 577.4.3 Distribute Training Across Multiple Compute Nodes . . . 577.5 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587.5.1 Adaptive Thresholding: . . . . . . . . . . . . . . . . . . . 587.5.2 Pre-ReSprop Overhead . . . . . . . . . . . . . . . . . . . 597.5.3 Theoretical Speedup: . . . . . . . . . . . . . . . . . . . . 597.5.4 Accelerator for Sparse Back-propagation: . . . . . . . . . 618 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648.0.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 658.0.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . 66Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67viiiList of TablesTable 4.1 Validation accuracy of meProp and reuse strategy (HG) withdifferent sparsities and reuse percentages, repectively. TrainingResNet-18 on CIFAR-10 for 30 epochs (batch size = 128, lr =0.1 and optimizer = SGD). . . . . . . . . . . . . . . . . . . . . 31Table 5.1 Validation accuracy of full, average and stochastic ReSprop forResNet-18 on the CIFAR-10 dataset for 200 epochs (batch size= 128, lr = 0.1, optimizer = SGD, avg of 3 runs). . . . . . . . 36Table 6.1 The table shows percentage of unused cartesian product overtotal cartesian product in SCNN architecture for gradient ofweights convolution. . . . . . . . . . . . . . . . . . . . . . . . 47Table 7.1 Validation accuracy of ReSprop and W-ReSprop at differentreuse-sparsity constraints on the CIFAR-100. . . . . . . . . . 54Table 7.2 Validation accuracy of ReSprop and W-ReSprop at differentreuse-sparsity constraints on the CIFAR-10. . . . . . . . . . . 55Table 7.3 Top 1 validation accuracy of ReSprop and W-ReSprop algo-rithms at different reuse-sparsity constraints on the ImageNetdataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55Table 7.4 Validation accuracy of ResNet-18, 34 and 50 on the CIFAR-100dataset at 90% reuse-sparsity. . . . . . . . . . . . . . . . . . . 57Table 7.5 Validation accuracy of ResNet-34 on the CIFAR-100 datasetwith different batch sizes of 32, 64 and 128. . . . . . . . . . . 57ixTable 7.6 Top 1 validation accuracy of ResNet-18 on the ImageNet datasettrained on 2, 4 and 8 nodes. . . . . . . . . . . . . . . . . . . . 58Table 7.7 Validation accuracy and train speedup at 90% sparsity com-pared to dense training (CIFAR-10 dataset). . . . . . . . . . . 61Table 7.8 Theoretical and GSCN speedup at backward pass computationswith 90% resue-sparsity (ImageNet). . . . . . . . . . . . . . . 61Table 7.9 GSCN parameters . . . . . . . . . . . . . . . . . . . . . . . . 62xList of FiguresFigure 1.1 Forward and backward propagation through a convolutionallayer during training. . . . . . . . . . . . . . . . . . . . . . . 2Figure 1.2 Percentage of floating point operations during the backwardand forward pass in training different architectures. . . . . . 3Figure 1.3 A simple example of gradient’s movement in a 3D space. Thegradient’s movement in the first and second iterations are markedin red. The changes in the (y) axis is shown in blue dots. . . . 5Figure 2.1 Different layers in CNN architecture. . . . . . . . . . . . . . 9Figure 2.2 An input with C channels convolving with K filters to producean output with K channels. . . . . . . . . . . . . . . . . . . . 10Figure 2.3 Architecure of VGG-16 network. Conv and FC denote convo-lution and fully-connected layers, respectively. . . . . . . . . 13Figure 2.4 A simple residual block with skip connection for double layers. 13Figure 4.1 HG and meProp angles for different reuse percentages andsparsities, respectively. The angle is calculated by finding theaverage angle of all layers while training ResNet-18 on CIFAR-10 for 100 iterations (batch size=128). . . . . . . . . . . . . . 31Figure 5.1 Training with ReSprop for layer l at iteration i. . . . . . . . . 35Figure 5.2 Back-propagation convolutions in stochastic mode comparedto full mode for layer l at iteration i. . . . . . . . . . . . . . . 37xiFigure 6.1 Forward pass and backward pass convolutions for N input sam-ples with C channels and K filters each with C channels. . . . 40Figure 6.2 Figure shows an example of SCNN and GSCN data and work-load distribution while having four PEs. Input and filter as-signment to each PE for SCNN and GSCN are different. . . . 42Figure 6.3 SCNN PE microarchitecture employing SSCN data flow [72]. 43Figure 6.4 When the size of filter is smaller than input size, SCNN archi-tecture has negligible amount of unused products. . . . . . . 45Figure 6.5 When the size of two operands are close in the convolution theSCNN architecture produces many unused products. . . . . . 46Figure 6.6 GSCN PE microarchitecture employing GSCN data flow. TheGSCN PE microarchitecture is built upon SCNN, the unitsadded or changes in GSCN have been shown with yellow color. 49Figure 7.1 Top 1 validation accuracy of ReSprop, W-ReSprop, mePropand W-meProp algorithms for training ResNet-18 on the Im-ageNet dataset. The baseline is trained with no sparsity orreusing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56Figure 7.2 Computation overhead of ReSprop at forward pass (pre-ReSprop)for different batch sizes (ImageNet dataset). . . . . . . . . . 60Figure 7.3 ReSprop training (forward+backward) speedup versus archi-tecture for three reuse-sparsity percentages (ImageNet). . . . 60Figure 7.4 Figure shows the speedup for GSCN compared to SCNN andGTX 1080 Ti GPU while training with ReSprop. . . . . . . . 62Figure 8.1 ReSprop has two parts. Pre-Resprop computations are negli-gible and back-ReSprop are sparse computations in backwardpass. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64xiiList of AbbreviationsCIFAR : Canadian Institute for Advanced ResearchConv : ConvolutionCNN : Convolutional Neural NetworksCUDA : Compute Unified Device ArchitectureDNN : Deep Neural NetworksDRAM : Dynamic Random Access MemoryGPU : Graphics Processing UnitGSCN : Generic Sparse Convolutional Neural NetworkReLU : Rectified Linear UnitReSprop : Reused Sparse Back-propagationResNet : Residual NetworksVGG : Visual Geometry GroupxiiiAcknowledgmentsFirst and foremost, I am profoundly indebted to my parents in Iran for their un-conditional love and blessings. Special thanks to my sister for her support andencouragement throughout my study.Second, I would like to express my deep gratitude to Professor Tor M Aamodt,my research supervisor, for his patient guidance, enthusiastic encouragement, use-ful critiques of this research work, and generous financial support. I thank himfor providing me with an excellent research atmosphere in his lab. The meetingsand conversations were vital in inspiring me to think outside the box, from multi-ple perspectives to form a comprehensive study. I will remember and am deeplythankful for your wisdom and all the life lessons you taught me during my thesis.Third, I am very fortunate and grateful to have excellent colleagues who offeredme genuine and friendly support and assurance to carry out my research. I shouldparticularly thank Md Aamir Raihan for several discussions and feedback on myresearch project. My special thanks to Francois Demoullin and Deval Shah fortheir constant encouragement and motivation, which helped me to get through oneof the tough times of my life. I would also like to thank Dave Evans for his personaland professional support.Finally, I would like to express my very great appreciation to Mohammad Ja-fari, UBC Ph.D. candidate, for his valuable and constructive suggestions during theplanning and development of this research work. His willingness to give his timeso generously has been very much appreciated.I also gratefully acknowledge the funding provided by the Natural Sciencesand Engineering Research Council of Canada (NSERC) and Computing Hardwarefor Emerging Intelligent Sensory Applications (COHESA) that made my researchxivpossible.xvDedicationTo my father Mohammad-Reza Goli, mother Mahin Yaraghi, and sisterLeily Goli.With gratitude for your inspiration, love, and support.xviChapter 1IntroductionWe are witnessing an explosion in the use of Deep Neural Networks (DNNs), withmajor impact on the world’s economic and social activity. At present, there is abun-dant evidence of DNN’s effectiveness in areas such as classification, vision, andspeech [34], [16], [68], [75], [64]. Of particular interest are Convolutional Neu-ral Networks (CNNs), which achieve state-ofthe-art performance in many of theseareas, such as image/temporal action recognition [46], [85] and scene generation[76]. However, state-of-the-art CNNs require extensive computational resourcesand a significant amount of time to be trained. The convolution operation is theprimary computation source in CNNs.Training a neural network consists of a forward pass and a backward pass. Thebackward pass has a higher computational cost compared to the forward pass. Inthe backward pass, the back-propagation algorithm is applied which is a commonmethod for weights adjustment in conjunction with an optimization method such asgradient descent. The basic idea of the back-propagation algorithm is to propagatethe error (the gradient in gradient descent) back along through the network andadjust the weights to correct the error. In brief, the backward pass shares similarcomputation patterns with the forward pass but involves approximately two-foldas many convolution operations for both error propagation and weight update, asshown in Figure 1.1. In this study, we try to reduce the computations in the back-ward pass. To have a better understanding of training, we need to also keep in mindthat in practice, the most commonly used optimization method for CNN is stochas-1***Input ActivationFilter ValuesFilter ValuesFilter GradientInput ActivationInput GradientOutput ActivationOutput GradientFigure 1.1: Forward and backward propagation through a convolutional layerduring training.tic gradient descent (SGD) [10], which randomly chooses a subset of the trainingdata, called a minibatch, and updates parameters based on the average error of eachiteration.Prior work has adopted two main strategies to accelerate CNN training: (1)reducing the number of iterations per compute node required to converge, usingtechniques such as batch normalization reducing internal covariate shift [37], par-allelize training with data or model parallelism [17, 46], and importance samplingto reduce the variance of gradient estimates [44, 45]; (2) reducing the amount ofcomputation per iteration using techniques such as stochastic depth to remove lay-ers during training [35], randomized hashing to reduce the number of multiplica-tions [87], quantization [8, 93, 98] and sparse training [21, 54, 90, 94]. We explorethe second strategy and propose Reuse Sparse Backprop (ReSprop), a novel wayto sparsify convolution computations during training1.1Source code available at https://github.com/negargoli/ReSprop201234ReSProp overhead in forward pass (% ) 28.49%30.76%28.07%30.36%33.39%33.38%71.51%69.24%71.93%69.64%66.61%66.62%0%10%20%30%40%50%60%70%80%90%100%ResNet18 ResNet34 ResNet50 WRN-50-2 VGG16 VGG19FP operations in training (%) Forward BackwardFigure 1.2: Percentage of floating point operations during the backward andforward pass in training different architectures.1.1 Sparse TrainingAt their core, convolutions are parallel dot products and accumulations. Thus,sparse convolutions decrease computational cost by reducing the number of multi-plication and addition operations. Recent related works [33, 58, 59, 61, 66] studydifferent approaches to sparsifying inference, and many [1, 2, 13, 30, 72] pro-pose accelerators to exploit sparsity in inference; however, there is limited work onsparse training [21, 54, 69, 90, 94].In a forward-backward pass of a CNN during training, a convolutional layerrequires three convolutions: one for forward propagation and two for backwardpropagation shown in Figure 1.1. Our measurements are shown in Figure 1.2 indi-cate that back-propagation consumes around 70% of the time during training.MeProp [90, 94] and DSG [54] sparsify the backward pass convolutions us-ing different sparsification methods. Since the sparsity produces by these methodsis fine-grained, it might results in speedup on specialized hardware but not on anNvidia GPU (before newly introduced Ampere GPU), which is the current pre-ferred platform. Thus, the need for a specialized hardware accelerator for sparseinference and training is vital.MeProp reduces the computational cost of training by sparsifying gradients inback-propagation calculations and DSG by graph selection method and dimension-3reduction search, compresses activations with elementwise unstructured sparsity,and accelerates vector-matrix multiplications (VMM). However, we observe thatmeProp fails to converge while training deeper networks or when using large datasets(Sections 4 and 7.1), and DSG loses more accuracy and achieves less trainingspeedup compared to ReSprop.ReSprop reduces the computation overhead of backward pass by reusing gra-dients to sparsify back-propagation convolutions. ReSprop overcomes prior limi-tations and loses less than 1.1% accuracy on large dataset such as ImageNet [82].1.2 Gradient ReuseIn this section, we describe our main insight, which is reusing gradients, and themotivation behind it. We briefly explain prior works on reusing gradients and thenmove to our idea.The stochastic variance reduction gradient (SVRG) method is proposed byJohnson and Zhang [40], and belongs to the class of stochastic methods using theso-called variance reduction technique [4, 19, 40, 62]. The common idea behindthese methods is to use some full gradient of the past to approximate the future forgeneral non-convex problems to reduce variance. Although the reusing proposedby SV methods is different from ours in many aspects such as the reusing strategyand the purpose of reusing, these methods motivate us to think of reusing gradi-ents between iterations. We assume that at each iteration, not all the gradients willconsiderably change, and it would be possible that just a portion of them varies sig-nificantly, and the rest remains almost the same. We try to visualize the gradient’smovement in a simple 3D example. As it is shown in this example, the gradient’smovement in the first and second iterations are slightly different and the amount ofchange is almost the same.Our observations (in Section 4.2) demonstrate that updating a small portionof the gradient components each iteration and replacing the rest with the previousiteration’s gradient component values is sufficient for maintaining state-of-the-artaccuracy. The ReSprop algorithm (Section 5.1) exploits gradient reusability andsparsifies the gradients in the back-propagation convolutions up to 90% with lessthan 1.1% loss of accuracy on the ImageNet dataset. ReSprop has less than 2%4Figure 1.3: A simple example of gradient’s movement in a 3D space. Thegradient’s movement in the first and second iterations are marked inred. The changes in the (y) axis is shown in blue dots.computation overhead and less than 16% memory footprint overhead while train-ing the ImageNet dataset with batch sizes larger than 128. For 90% sparsity, wecalculate ReSprop theoretical speedups between 9.3× and 9.8× for backward passcalculations and, as a consequence of Amdahl’s Law [5], between 2.5× and 2.9×for the overall training process on different architectures.1.3 GSCN AcceleratorIn recent years accelerators, which are dedicated fixed-function peripherals de-signed to perform a single computationally intensive task over and over, have be-come an active area of research. Moreover, accelerators have been designed andadopted into programmable pipelines. Nvidia Tensor Core and ray tracing unit inTuring GPUs are some examples of it. Due to the high computation of convolu-tions, there are many algorithms for sparse convolutions and accelerators which tryto exploits fine-grained sparsity produced by these algorithms. The current avail-able GPUs are not able to exploit fine-grained sparsity, but the newly introducedNVIDIA Ampere GPU supports fine-grained sparsity of weights in the forwardpass convolution.5The fine-grained sparsity introduces by ReSprop, can be accelerated by hard-ware similar to recently proposed inference accelerators [1, 2, 13, 30, 72]. Al-though there are many hardware accelerators for sparse inference, accelerators forsparse training have not been widely explored. Thus, we calculated the speedupachieved by ReSprop on our proposed custom hardware accelerator for sparsetraining.Among available accelerator, Sparse CNN (SCNN) accelerator is an acceler-ator for inference, which exploits both weight and activation sparsity to improveboth performance and power [72]. We propose a Generic Sparse ConvolutionalNeural network (GSCN) accelerator hardware architecture (Section 6) based onSCNN. GSCN is designed to accelerate sparse back-propagation convolutions.GSCN supports sparse back-propagation convolutions and overcomes the deficien-cies of SCNN in cases of underutilization for small inputs and shortcomings causedby closely-sized input and filter. Our results (Section 7.5) show ReSprop on GSCNachieves 8.0× speedups versus a GPU on the backward pass of ResNet34.1.4 ContributionsThe contributions of this dissertation are:• It shows that we can reuse about 90% of the gradients between consecutiveiterations with minimal loss of accuracy.• It introduces a new training algorithm called ReSprop, which by reusing thegradients, makes the backward pass computations sparse. For 90% sparsity,ReSprop accuracy loss is less than 1.1%. It achieves theoretical speedupsbetween 9.3× and 9.8× for backward pass calculations and between 2.5×and 2.9× for the overall training process on different architectures.• It shows that ReSprop on GSCN achieves between 7.2× and 8.6× speedupsversus a GPU on the backward pass on different architectures.1.5 OrganizationThe rest of this dissertation is organized as follows:6• Chapter 2 details the internal workings of a Convolutional Neural Network,and it is basic building blocks. It provides detailed background about train-ing.• Chapter 3 discusses related studies on pruning, sparse training, and reusegradients.• Chapter 4 describes our insight for reusing gradients and our reuse strategyand finally verifies it with experiments and comparison with meProp [90,94].• Chapter 5 proposes the ReSprop algorithm based on the reusing strategy andexplain how reusing leads us to the sparse back-propagation calculations inReSprop.• Chapter 6 explains the GSCN architecture and dataflow.• Chapter 7 demonstrates and analyzes accuracy results for different datasets,computation reduction, ReSprop overhead, and GSCN accelerator speedup.7Chapter 2BackgroundTo properly explain ReSprop and the experiments done to validate ReSprop, itis helpful to first briefly go over convolutional neural networks (CNNs), back-propagation’s calculations in convolution layers and mini-batch training. Thischapter gives an overview of convolutional neural networks (CNNs) and describesthe commonly used CNNs topologies used in our experiments. It then explainstraining a network and briefly describes mini-batch training.2.1 Convolutional Neural NetworkThe efficacy of Convolutional Neural Networks (CNNs) in image recognition isone of the main reasons why the world has woken up to the efficacy of deeplearning. CNNs are powering major advances in computer vision, which has ap-plications for self-driving cars, robotics, drones, security, medical diagnoses, andtreatments for the visually impaired. CNNs are not limited to image recognition;however, they have been applied directly to text analytics and many other fields.A CNN architecture typically consists of convolution, activation, pooling, dropout,batch normalization, and fully-connected layers (Figure 2.1). In this section weexplain the convolution layer, which is the main layer in CNNs in detail and brieflygo over other layers.8Input Image ConvolutionActivation Function (Non-linearity)PoolingNormalizationFeature mapsDropoutFigure 2.1: Different layers in CNN architecture.2.1.1 Convolutional LayerThe predominant layers in CNNs are Convolutional layers. Each convolutionallayer is composed of a 3-dimensional input with C channels (width (W )×height (H)×channels (C)) and K number of 3-dimensional (width (R)×height (S)×channels (C))filters which in turn form a 3-dimensional (W −R+ 1)× (H− S+ 1)×K output.These parameters are visualized in Figure 2.2. The width (W) and height (H) ofan image are easily understood. The channels (C) are necessary because of howcolors are encoded. Red-Green-Blue (RGB) encoding, for example, produces animage three layers deep. Each layer is called a channel, and through convolution,it produces a stack of output feature maps. The height and weight of the filters aresmaller than those of the input volume. Convolution is a specialized kind of linearoperation. Each filter is convolved with the input volume to compute an activa-tion map made of neurons. In more details, the filter slides across the width andheight of the input, and the dot products between the input and filter are computedat every spatial position. The output volume of the convolutional layer is obtainedby stacking the activation maps of all filters along the depth dimension (channels)(Figure 2.2).After a convolutional layer, the input is passed through a nonlinear transformsuch as tanh or rectified linear unit (ReLU), which will squash input values into arange between -1 and 1. One of the key challenges with images is that they arehigh-dimensional, which means they can consume a lot of time and computingpower to process. Convolutional networks include stage designed to reduce the di-mensionality of images. Filter stride is one way to reduce dimensionality; another9FiltersK* =Output InputRSHWH-S+1W-R+1H-S+1Figure 2.2: An input with C channels convolving with K filters to produce anoutput with K channels.way is through downsampling, max-pooling, or subsampling. The pooling layeris almost part of any CNN network, and it mainly helps extract sharp and smoothfeatures. It is also done to reduce variance and computations.2.1.2 Activation FunctionThe activation function decides whether a neuron should be activated or not bycalculating the weighted sum and further adding a bias to it. A network comprisedof only linear activation functions is very easy to train, but cannot learn complexmapping functions. Linear activation functions are still used in the output layer fornetworks that predict a quantity (e.g. regression problems).Increasingly, neural networks use non-linear activation functions, which canhelp the network learn complex data, compute and learn almost any function rep-resenting a question, and provide accurate predictions. Thus, the purpose of theactivation function is to introduce non-linearity into the output of a neuron. Whilethere are many activation functions, the most popular one is Rectified Linear Units(ReLU). Krizhevsky et al [46] have observed that deep convolutional neural net-works with ReLUs train several times faster than their equivalents with tanh units.Moreover, ReLU reduces likelihood of vanishing gradient. Generally, a neuron isturned on if the output of ReLU is greater than zero, and the neuron is turned off ifthe output of ReLU is zero.102.1.3 PoolingA limitation of the feature map output of convolutional layers is that they recordthe precise position of features in the input. This means that small movements inthe position of the feature in the input image will result in a different feature map.This can happen with re-cropping, rotation, shifting, and other minor changes tothe input image.A common approach to addressing this problem from signal processing iscalled downsampling. Thus, a lower resolution version of an input signal is createdthat still contains the large or important structural elements, without the fine detailthat may not be as useful to the task. Downsampling can also be achieved with con-volutional layers by changing the stride of the convolution across the image [88].A more robust and common approach is to use a pooling layer. A pooling layeris added after a nonlinearity (e.g. ReLU). Pooling layer generally takes a patch offeatures and performs operations like average (Average Pooling) of all activationoutput values or maximum value for each patch of the feature map (max-pooling).2.1.4 DropoutDropout is a regularization method that approximates training a large number ofneural networks with different architectures in parallel. During training, somenumber of layer outputs are randomly ignored or “dropped out”. By dropping aunit out, the method temporarily removes it from the network, along with all itsincoming and outgoing connections. Dropout simulates a sparse activation from agiven layer, which interestingly, in turn, encourages the network to actually learna sparse representation as a side-effect. As such, it is used as an alternative forregularizing the network [89].2.1.5 NormalizationBatch normalization is a technique for training very deep neural networks that stan-dardizes the inputs to a layer for each mini-batch. It has the effect of stabilizing thelearning process and dramatically reducing the number of training epochs requiredto train deep networks [37]. We explain mini-batch training later in Section 2.3.11By standardizing the activations of the prior layer, assumptions the subsequentlayer makes about the spread and distribution of inputs during the weight updatewill not change, at least not dramatically. It has the effect of stabilizing and speed-ing up the training process of deep neural networks.2.1.6 Fully-connected layerFully connected layers are an essential component of CNNs, which have beenproven very successful in recognizing and classifying images. The CNN processbegins with convolution and pooling, and the result of this process feeds into a fullyconnected neural network structure that drives the final classification decision. Thefully connected part of the CNN goes through its own backpropagation process todetermine the most accurate weights.2.2 ResNet, WRN and VGGClassic CNN network architectures were comprised simply of stacked convolu-tional layers. Modern architectures explore new and innovative ways of construct-ing convolutional layers in a way that allows for more efficient learning. Almostall of these architectures are based on a repeatable unit that is used throughout thenetwork. In this study we use three common CNN networks called residual net-works (ResNet) [32], wide residual networks (WRN) [96] and Visual GeometryGroup (VGG) [86].2.2.1 VGGVGG is one of the most common deep convolutional network for object recogni-tion. It was developed and trained by Oxford’s renowned Visual Geometry Group(VGG), which achieved very good performance on the ImageNet dataset [86]. Thisnetwork is characterized by its simplicity, using only 3×3 convolutional layersstacked on top of each other in increasing depth. Figure 2.3 shows the architec-ture for VGG-16.123x3 Conv3x3 ConvPooling3x3 Conv3x3 ConvPooling3x3 Conv3x3 ConvPooling3x3 Conv3x3 Conv3x3 ConvPooling3x3 Conv3x3 Conv3x3 ConvPooling3x3 ConvFC FC FCFigure 2.3: Architecure of VGG-16 network. Conv and FC denote convolu-tion and fully-connected layers, respectively.2.2.2 ResNet and WRNA feedforward network with a single layer is sufficient to represent any function.However, the layer might be massive, and the network is prone to overfitting thedata. Therefore, there is a common trend of using deep network architectures.However, increasing network depth does not work by simply stacking layers to-gether. Deep networks are hard to train because of the notorious vanishing gradientproblem [73] as the gradient is back-propagated to earlier layers, repeated multi-plication may make the gradient infinitely small. As a result, as the network goesdeeper, its performance gets saturated or even starts degrading rapidly. ResNetsolves this problem by introducing an identity shortcut connection that skips oneor more layers (residual block) [32]. Typical ResNet models are implemented withdouble or triple layer skips that contain nonlinearities (ReLU) and batch normal-ization in between. Figure 2.4 shows an example of a residual block in the ResNetarchitecture.Weight Layer Weight Layer ReLU+ReLUXXF(X)F(X) +XFigure 2.4: A simple residual block with skip connection for double layers.However, deep residual networks are able to scale up to thousands of layers13and still improve the performance; each fraction of a percent of improved accuracycosts nearly doubling the number of layers. Thus, training very deep residual net-works have a problem of diminishing feature reuse, which makes these networksvery slow to train. To tackle these problems, Zagoruyko et al [96] proposes anovel architecture where they decrease the depth and increase the width of residualnetworks and called it wide residual networks (WRNs).2.3 TrainingIn supervised training, which is our concern in this study, both the inputs and theoutputs are provided. The network processes the inputs and compares its resultingoutputs against the desired outputs (forward pass). Errors are then propagated backthrough the system, causing the system to adjust the weights which control thenetwork (backward pass) [81]. This process occurs over and over as the weightsare continually tweaked. Neural networks are trained using gradient descent, wherethe estimate of the error used to update the weights is calculated based on thetraining dataset. Therefore training can be separated into four distinct sections, theforward pass, the loss function, the backward pass, and the weight update. A lossfunction can be defined in many different ways, but a common one is MSE (meansquared error). Among different parts in training, the backward pass has the mostcomputations and consumes about 70% of the time during training 1.2. In thisstudy we focus on reducing the computations in the backward pass.The goal of supervised training is to get to a point where the predicted label(output of the network) is the same as the training label. In order to get there, theamount of loss (L) needs to be minimized. In other words, the goal of trainingis to find out which weights most directly contributed to the loss (or error) of thenetwork. This is the mathematical equivalent of a ∂L∂w where w are the weights ata particular layer. We perform a backward pass through the network to determinewhich weights contributed most to the loss and finding ways to adjust them sothat the loss decreases. Once this derivative is computed, the weight gets updatedaccordingly. The learning rate (η) is a parameter that is chosen by the programmer.A high learning rate means that bigger steps are taken in the weight updates, andthus, it may take less time for the model to converge on an optimal set of weights14(2.1).wi+1 = wi−η× ∂L∂wi (2.1)2.3.1 Training Convolutions and NotationIn this section we explain the notation we use in the forward and backward passconvolutions. As we know in CNNs, convolution is the predominant operation inthe forward and backward pass. The output of the lth layer in the CNNs’ forward-propagation is obtained by:yl+1 = wl⊗al (2.2)Where al and wl denote activations and weights at layer l, respectively, and ⊗ isthe convolution operation. In back-propagation the lth layer receives the outputgradient of the l+1th layer. The output gradient is the gradient of the loss (L)with respect to the layer’s output ( ∂L∂yl+1 ). The output gradient is used to computethe gradient of input activation ( ∂L∂al ) and the gradient of weights (∂L∂wl). The back-propagation convolutions for calculating gradient of inputs and weights at the lthlayer can be defined as [49]:∂L∂al=∂L∂yl+1⊗wtl (2.3)∂L∂wl=∂L∂yl+1⊗atl (2.4)2.3.2 Mini-batch TrainingMini-batch gradient descent is a variation of the gradient descent algorithm thatsplits the training dataset into small batches that are used to calculate model errorand update model coefficients. Therefore, mini-batch training applies the neededforward and backward pass calculations and updates the model parameters at eachiteration (mini-batch) [11, 27, 50].152.3.3 MomentumA very popular technique that is used along with SGD is called Momentum [91].Instead of using only the gradient of the current step to guide the search, momen-tum also accumulates the gradient of the past steps to determine the direction togo. The equations of gradient descent are revised as follows.zk+1 = β zk +∇ f(wk)wk+1 = wk−αzk+1(2.5)The first equations has two parts. The first term is the gradient that is retainedfrom previous iterations. This retained gradient is multiplied by a value called“Coefficient of Momentum” (β ) which is the percentage of the gradient retainedevery iteration. The change is innocent, and costs almost nothing. When β = 0,we recover gradient descent. But for β = 0.99, this appears to be the boost thenetwork needs.2.4 Sparse ConvolutionAs we described, the convolutional neural network (CNN) technique is built aroundthe sharing of weights. It is influenced by the structural architecture of the humanvisual system. CNNs are based on ideas that utilize local connectivity between neu-rons and hierarchically organized transformation of the input. Nodes form groupsof d-dimensional arrays known as feature maps. Each node in a given map receivesinputs from a certain window area of the previous layer, which is referred to as itsreceptive field. The convolution operation results in a much sparser NN than theMLP.Although CNN has fewer computations compared to MLP, it still has an exten-sive amount of computations. Thus, there are many different methods that sparsifythe inputs and/or weights to make computations less on CNNs. In sparse CNN, dueto the sparsity of the tensors, the multiplications in which one of the operands orboth of them are zeros can be skipped. Many of the initial works on neural networkfocus on removing unimportant connections by sparsifying the weights (pruning).162.4.1 Sparse Hardware AcceleretorsRecent works have examined how to support the processing of sparse weights andinputs in hardware efficiently. A variety of dedicated hardware accelerators forsparse matrix multiplication have been proposed. Most of the studies designeda parallel architecture comprising multiple processing elements using differentdataflows and compression methods.The DNN dataflow in the recent accelerators can be categorized based on thedata handling characteristics:1) Weight stationary: The weight stationary dataflow is designed to minimizethe energy consumption of reading weights by maximizing the accesses of weightsfrom the register file at the PE (processing element).2) Input stationary: The input stationary dataflow is designed to minimize theenergy consumption of reading inputs to PEs.3) Row stationary: The row stationary dataflow assigns the processing of a 1-Drow convolution into each PE for processing. It keeps the row of filter weightsstationary inside the register file of the PE and then streams the input activationsinto the PE.4) No local reuse (NLR): While small register files are efficient in terms ofenergy (pJ/bit), they are inefficient in terms of area. In order to maximize thestorage capacity, and minimize the off-chip memory bandwidth, no local storageis allocated to the PE and instead all that area is allocated to the global buffer toincrease its capacity.SCNN accelerator, which is our focus in this study, supports the processing ofconvolutional layers in a compressed format. It uses a stationary input dataflow todeliver the compressed weights and activations to a multiplier array. We explainSCNN in detail in Section 6.3.17Chapter 3Related WorkWe categorize the related work into four different groups. First, we will discussthe pruning methods, which result in sparse networks. Second, we cover existingsparse training methods. Third, we talk about the reusing of gradients in priortraining algorithms. Finally, we explain the hardware accelerators designed forsparse CNNs.3.1 Dense to Sparse Networks by Weight PruningCreating sparse networks by eliminating the weights has an extensive history. Le-Cun et al [48]; Karnin [43]; Hassibi and Stork [31] present the early work of net-work pruning using second-order derivatives as the pruning criterion. Han et al [28]propose parameter magnitude as the pruning criterion and introduced the pipelinewith three stages. These three stages are: 1) train a large, over-parameterized model(sometimes there are pretrained models available), 2) prune the trained large modelaccording to a certain criterion, and 3) fine-tune the pruned model to regain the lostperformance. The process of pruning and fine-tuning is often iterated several times,gradually reducing the network’s size. Many papers propose slight variations ofthis algorithm. For example, some papers prune periodically during training [25]or even at initialization [52]. Others modify the network to explicitly include ad-ditional parameters that encourage sparsity and serve as a basis for the pruningcriterion after training [67].18Pruning methods vary primarily in their choices regarding sparsity structure,pruning criterion, scheduling (when to prune), and fine-tuning.3.1.1 StructureSome methods prune individual parameters (unstructured pruning). Doing so pro-duces a sparse neural network, which, although smaller in terms of parametercount, may not gain speedups using current available libraries and hardware. Othermethods consider parameters in groups (structured pruning), removing entire neu-rons, filters, or channels to exploit hardware and software optimized for densecomputation [33, 53].3.1.2 Prunning CriterionIt is common to score parameters based on their absolute values, trained importancecoefficients, or contributions to network activations or gradients. Some pruningmethods compare scores locally, pruning a fraction of the parameters with the low-est scores within each structural subcomponent of the network (e.g., layers) [29].Others consider scores globally, comparing scores to one another irrespective of thepart of the network in which the parameter resides [24, 51]. In the same context,Structured Sparsity Learning (SSL) added group sparsity regularization to penalizeunimportant parameters by removing some weights [95]. Li et al [53] proposed aone-shot channel pruning method using the L1 norm of weights for filter selec-tion, provided that those channels with smaller weights always produce weakeractivations. Recently, channel pruning alternatively used LASSO regression basedchannel selection and feature map reconstruction to prune filters [33]. Anwar etal [7]performes structured pruning in convolutional layers by considering stridedsparsity of feature maps and kernels to avoid the need for custom hardware anduses particle filters to decide the importance of connections and paths. In contrastto previous pruning studies for deep deterministic models, Zahng et al [? ]roposeda pruning approach for deep probabilistic models by using the mask of the weights.193.1.3 SchedulingPruning methods differ in the amount of the network to prune at each step. Somemethods prune all desired weights at once in a single step [57]. Others prune afixed fraction of the network iteratively over several steps [29] or vary the rate ofpruning according to a more complex function [25].3.1.4 Fine-tuningFor methods that involve fine-tuning, it is most common to continue to train thenetwork using the trained weights from before pruning. Alternative proposals in-clude rewinding the network to an earlier state [24] and reinitializing the networkentirely [57].All the methods described in this section focus on gaining higher performanceand/or accuracy at inference. These methods often involve a re-training phase,which, contrary to our motivation, increases training time.3.2 Sparse TrainingMore recent studies try to find the sparse network during training through a prune,redistribute, and regrowth cycle. Mainly this group of studies prunes the networkduring training by adding more computation overhead to the training. The principalgoal of most of them is to gain a higher accuracy using their pruned network for theinference. Bellec et al [9]; Mocanu et al [65]; Mostafa and Wang et al [69] proposedifferent regrowth methods for sparsifying the networks through training. Dettmerset al [21] present faster training by sparse momentum, which uses the exponentiallysmoothed gradients as the criterion for pruning and regrowth weights. A differentapproach to accelerate training is sparsifying activations. Liu et al [54] introducesa dynamic sparse graph (DSG) structure, which activates only a small amount ofneurons at each iteration via a dimension-reduction and accelerates forward andbackward passes.Methods that maintain sparse gradients throughout training are most closely re-lated to our work. Sun et al [90] and Wei et al [94] introduce meProp, an algorithmwhich targets computation reduction in training by sparsifying gradients. Meprop20computes an approximate gradient by keeping top-k values of the backward outputgradient and masking the remaining values to 0. The forward propagation is com-puted as usual. However, during back-propagation, only a small subset of the fullgradient is computed to update the model parameters. Thus, by using meProp algo-rithm, the backward pass convolutions explained in Section 2.3.1 can be redefinedas follows:∂L∂al= Topk(∂L∂yl+1)⊗wtl (3.1)∂L∂wl= Topk(∂L∂yl+1)⊗atl (3.2)Both of the above equations are sparse calculations due to the sparsity of outputgradients. The amount of computation reduction and training speedup dependson the sparsity percentage of output gradients. The authors demonstrate mePropconvergence while training a network with two convolutional layers on the MNISTdataset at 95% gradient sparsity. However, they do not analyze larger datasets anddeeper networks.Golub et al [26] proposed Dropout which restricts randomly selected gradientupdates during each training iteration and limits the set of weights that can beupdated throughout the entire training process. This work decreases the number ofweights that must be stored during the training process, but it does not make thetraining computations sparse.3.3 Low precision neural networkReducing data precision or quantization is another viable way to improve the com-puting efficiency of DNN accelerators. The recent TensorRT results show that thewidely used NN models, including AlexNet, VGG, and ResNet, can be quantizedto 8 bit without inference accuracy loss. However, it is difficult for such a unifiedquantization strategy to retain the network’s accuracy when further lower precisionis adopted. Many complex quantization schemes have been proposed, however,significantly increasing the hardware overhead of the quantization encoder/decoderand the workload scheduler in the accelerator design. There are many studies thattried to minimize the hardware overhead while reducing the computation and mem-21ory footprint. For example, one of the recent methods called Gist [38] reduces thememory footprint by encoding schemes of inputs through training, and it has just4% performance overhead.Our work focuses on accelerating training by sparsifying the training compu-tation, and quantization and low periciosn methods are beyond the scope of ourstudy.3.4 Reusing GradientFull gradient ascent [12] with a constant step size achieves a linear convergencerate in the number T of iterations (i.e., parameter updates) [71]. However, eachiteration requires N gradient computations, which can be too expensive for largevalues of N.Stochastic Gradient (SG) ascent [11] overcomes this problem by sampling asingle sample xi per iteration, but a vanishing step size is required to control thevariance introduced by sampling. Starting from SAG, a series of variations to SGhave been proposed to achieve a better trade-off between convergence speed andcost per iteration, they called stochastic variance reduction (SVR) methods. SAG[80], SVRG [40], SAGA [19], Finito [20], and MISO [63] all are SVR methodsproposed for smooth strongly-convex optimization problems. The common idea ofthese methods is to reuse past gradient computations to reduce the variance of thecurrent estimate.We breifly explain the most common SVR algorithms in more details. BothSAGA and SAG can be derived from a variance reduction viewpoint: here X isthe SGD direction sample f ′j(xk), whereas Y is a past stored gradient f ′j(φ kj). InEqs. 3.3, 3.4 and 3.5 you can see the differences between resuing methodologies.SAG is obtained by using α = 1/n (notation used in Eq. 3.3) whereas SAGA is theunbiased version with α = 1 (Eq. 3.4). For the same φ ’s, the variance of the SAGupdate is 1/n2 times the one of SAGA, but at the expense of having a non-zerobias.22(SAG) xk+1 = xk− γ f ′j (xk)− f ′j(φ kj)n+1nn∑i=1f ′i(φ ki) (3.3)[19].(SAGA) xk+1 = xk− γ[f ′j(xk)− f ′j(φ kj)+1nn∑i=1f ′i(φ ki)](3.4)[19].(SVRG) xk+1 = xk− γ[f ′j(xk)− f ′j(x˜)+1nn∑i=1f ′i (x˜)](3.5)[19].The SVRG update (Eq. 3.5) is obtained by using Y = f ′j(x˜) with α = 1. Thevector x˜ is not updated every step, but rather the loop over k appears inside an outerloop, where x˜ is updated at the start of each outer iteration. Essentially SAGA isat the midpoint between SVRG and SAG; it updates the φ j value each time indexj is picked, whereas SVRG updates all of φ ’s as a batch. The usage of SAG vs.SVRG is problem dependent. For example for linear predictors where gradientscan be stored as a reduced vector of dimension p− 1 for p classes, SAGA is pre-ferred over SVRG both theoretically and in practice. For neural networks, whereno theory is available for either method, the storage of gradients is generally moreexpensive than the additional backpropagations, but this is computer architecturedependent. Also having to tune one parameter instead of two is a practical advan-tage for SAGA.Recent works explore the extension of SVR approaches to general non-convexproblems [3, 79]. However, the faster theoretical convergence rate of the SVRmethods is not a guarantee of better empirical performance in deep neural networks[18].ReSprop reuses gradients in a different way than SVR. ReSprop reuses gra-23dients between successive mini-batches to sparsify back-propagation calculations.The goal of ReSprop is reducing computation, not variance. We show that ourmethod reaches state-of-the-art accuracy with minimal loss while having 10× com-putation reduction in back-propagation for different network architectures withvarying widths and depths.3.5 Hardware Accelerators for Deep Neural NetworksIn the early stage of DNN accelerator design, accelerators were designed for theacceleration of approximate programs in general-purpose processing [23], or forsmall Neural Networks (NNs) [56]. Although the functionality and performanceof on-chip accelerators were very limited, they revealed the basic idea of AI-specialized chips. Because of the limitations of general-purpose processing chips,it is of necessity to design specialized chips for AI/DNN applications.On-chip accelerators:The neural processing unit (NPU) [23] is designed to use hardwarelized on-chip NNs to accelerate a segment of a program instead of running all parts ona CPU. The hardware design of the NPU is quite simple. An NPU consists ofeight processing engines (PEs). Each PE performs the computation of a neuron;that is, multiplication, accumulation, and sigmoid. Thus, what the NPU performsis the computation of a multiple layer perceptron (MLP) NN. The idea of usingthe hardwarelized MLP (NPU) to accelerate some program segments was veryinspiring. If a program segment is frequently executed and approximable, and ifthe inputs and outputs are well defined, then that segment can be accelerated by theNPU. To execute a program on the NPU, programmers need to manually annotate aprogram segment that satisfies the above three conditions. Next, the compiler willcompile the program segment into NPU instructions, and the computation tasksare off-loaded from the CPU to the NPU at runtime. Sobel edge detection and fastFourier transform (FFT) are two examples of such program segments. The idea ofthe NPU was the inspiration for many of the later studies.Stand-alone DNN/convolutional neural network accelerator:For broadly used DNN and CNN applications, stand-alone domain-specific ac-celerators have achieved great success in both cloud and edge scenarios. Compared24with general-purpose CPUs and GPUs, these custom architectures offer better per-formance and higher energy efficiency. Custom architectures usually require a deepunderstanding of the target workloads. The dataflow (or data reuse pattern) is care-fully analyzed and utilized in the design to reduce the off-chip memory access andimprove the system efficiency. The DianNao series [14] and the tensor processingunit (TPU) [41], which are academic and industrial examples, respectively, are thetwo most popular stand-alone accelerators which we discuss more in details.The DianNao series includes multiple accelerators with different features. Di-anNao is the first design of the series. It is composed of the following components:(1) A computational block neural functional unit (NFU), which performs computa-tions; (2) An input buffer for input neurons (NBin); (3) An output buffer for outputneurons (NBout); (4) A synapse buffer for synaptic weights (SB); (5) A controlprocessor (CP). The NFU, which includes multipliers, adder trees, and nonlinearfunctional units, is designed as a pipeline. Rather than a normal cache, a scratch-pad memory is used as on-chip storage because it can be controlled by the compilerand can easily explore the data locality. While efficient computing units are impor-tant for a DNN accelerator, inefficient memory transfers can also affect the systemthroughput and energy efficiency. The DianNao series introduces a special designto minimize memory transfer latency and enhance system efficiency.On top of the stand-alone accelerators, a domain-specific instruction set archi-tecture (ISA), called Cambricon [55], was proposed to support a broad range of NNapplications. Cambricon is a load-store architecture that integrates scalar, vector,matrix, logical, data transfer, and control instructions. The ISA design considersdata parallelism, customized vector/matrix instructions, and the use of scratchpadmemory. The successors of the Cambricon series introduce support to sparse NNs.Highlighted with a systolic array, Google published its first TPU paper (tpu1)in 2017. tpu1 focuses on inference tasks and has been deployed in Google’s datacenter since 2015. The structure of the systolic array can be regarded as a special-ized weight-stationary dataflow.A DNN/CNN generally requires a large memory footprint. For large and com-plicated DNN/CNN models, it is unlikely that the whole model can be mapped ontothe chip. Due to the limited off-chip bandwidth, it is of vital importance to increaseon-chip data reuse and reduce the off-chip data transfer in order to improve com-25puting efficiency. During architecture design, dataflow analysis is performed, andspecial consideration needs to be taken. Eyeriss accelerator [15] explored differentNN dataflows, including input-stationary, output-stationary, weight-stationary, andno-local-reuse dataflows, in the context of a spatial architecture and proposed therow-stationary (RS) dataflow to enhance data reuse.The efficiency of DNN accelerators can also be improved by applying efficientNN structures. saprsifying the network, for example, makes the model small yetsparse, thus reducing the off-chip memory access. The NN quantization allowsthe model to operate in a low-precision mode, thus reducing the required storagecapacity and computational cost.3.5.1 Hardware Accelerator for Sparse NetworksAs we discussed previous work shown that a large proportion of NN connectionscan be pruned to zero with or without minimum accuracy loss. Moreover, someof the activations are zero due to the Relu function. Thus, many correspondingcomputing architectures have also been proposed to exploit this sparsity. In par-ticular, these works employ Run Length Encoding (RLE) to compress the inputsor the kernels. The RLE not only compresses the data but also allows to enhancethe throughput by skipping the multiplications of the encoded data. While this ap-proach seems appealing, combining the sparsity exploitation with other techniques(i.e. tiling, layer merging, etc. explained later in this section) makes efficient dataflow management (and tiling) a non-trivial problem. Therefore, the existing ar-chitectures only exploit partial sparsity. Eyeriss [15], EIE [30], Cnvlutin [1], andLaconic [83] are some of the most well-known works that sought to remove multi-plications by zero-valued activations and/or weights.The authors of Cnvlutin achieved this by computing only non-zero inputs andusing an “offset” buffer, alongside the input buffer, to store the indices of eachinput’s corresponding weights after zero-skipping. A hardware controller fills theoffset buffer on the fly such that it does not consume extra bandwidth. To further in-crease acceleration, Cnvlutin prunes near-zero outputs during inference to increasethe sparsity of the next layer’s input buffer. Eyeriss, EIE, and Laconic’s authorsachieved benefits from pruning using similar strategies to those employed by Cn-26vlutin’s. EIE compresses and enhances the throughput only for the classificationlayers. It targets sparsity in both filters and feature maps only in fully-connectedlayers. Eyeriss exploits sparsity of all the layers, however, it cannot enhance thethroughput (by skipping redundant computations), and it compresses only the in-puts.However, the special data format and extra encoder/decoder adopted in thesedesigns introduce additional hardware costs. Some works discuss how to de-sign NN models in a hardware-friendly way, such as by using block sparsity.Techniques that can handle irregular memory access and an unbalanced work-load in sparse NN have also been proposed. For example, Cambricon-X [97] andCambricon-S [99]address the memory access irregularity in sparse NNs through acooperative software/hardware approach. Cambricon-X first marks non-zero neu-rons one by one, filters out zero-valued neurons, and then sends the neurons tothe computational units for processing and eliminates unnecessary computationand weight storage. CNV [1] proposed a new data structure format for storing theinputs and outputs that enables the seamless elimination of most zero operand mul-tiplications. The CNV storage format enables it to move the decisions on whichcomputations to eliminate off the critical path. ReCom [39] proposes a ReRAM-based sparse NN accelerator based on structural weight/activation compression.Some other studies like Bit-pragmatic [2] skip zero-valued bits. Bit-pragmaticexploits activation sparsity in forward pass. In this work, they propose Pragmatic(PRA), a massively data-parallel architecture that eliminates most of the ineffectualcomputations on-the-fly. The idea behind it is using serial-parallel shift-and-addmultiplication while skipping the zero bits of the serial input.Thus, many previous works have studied accelerating the sparse convolutionduring forward propagation, and they focused on different data flows and compres-sion methods during inference. However, none of them focus on an accelerator forsparse back-propagation convolutions.The Sparse CNN (SCNN) accelerator [72] improves performance and energyefficiency by exploiting the zero-valued weights and activations during inference(forward pass). In order to evaluate ReSprop on a custom hardware accelerator, wepropose an accelerator based on SCNN. Our accelerator (GSCN) can be used forboth sparse forward and backward pass convolutions.27Chapter 4Gradient ReuseIn this section, we explain our idea of reusing gradients and the insights behindit. First, we explain CNN convolutions and summarize the notation we will usethroughout the remainder of this thesis. Then, we elaborate our hypothesis aboutgradients undergoing minor changes between consecutive iterations and proposea reuse strategy. Following this argument, we discuss our experiment designedto evaluate our resue strategy. Finally, we end this chapter with a more detailedcomparison between our work and the most related prior work, meProp [90].4.1 PreliminariesIn Section 2.3.1 we explained the convolutions calculated in training. In summary,the output of the lth layer in the CNNs’ forward-propagation (Eq. 4.1), and theback-propagation convolutions for calculating gradient of inputs (Eq. 4.2) andweights (Eq. 4.3) at the lth layer is defined as follows (we use same notation as inSection 2.3.1):yl+1 = wl⊗al (4.1) ∂L∂al =∂L∂yl+1⊗wtl (4.2)∂L∂wl=∂L∂yl+1⊗atl (4.3)In this study, mini-batch training allows us to leverage the correlation amongoutput gradient components of consecutive iterations and facilitates reusing the28output gradient components. We use the term “gradients” to refer to individualcomponents of the gradient vector throughout this study.4.2 Approach and Key InsightOur approach to accelerate CNN training is to modify back-propagation convo-lutions. The output gradient vector and in turn the vectors dependant on it (Eq.4.2 and 4.3) are updated in the backward pass. In essence, ReSprop precalculatesa portion of the output gradient vector, and this, in turn, enables precomputing aportion of the backpropagated values.We conjectured that there are a large number of similar features between train-ing samples, and this motivated us to explore reusing the output gradients amongmini-batches. We focus on the feasibility of reusing a subset of the output gradientsbetween consecutive iterations and measure the inter-iteration similarity of outputgradients. We propose a reuse strategy to leverage precalculated output gradientsfrom the previous iteration while performing computation only for significantlychanged output gradients in the current iteration (mini-batch). We define our reusestrategy as follows: If a component of an output gradient compared to its previousiteration changes more than an adaptive threshold then we use the current (ith) iter-ation value; otherwise, we reuse the value of the previous iteration. We introduce avector we call the hybrid output gradient (HG). We define HG such that it containsx% of the previous iteration’s gradients and (100− x)% of the current iteration’sgradients. Here, x% is called the reuse percentage. The HG for layer l at iterationi is defined as:(HGl)i = (∂L∂yl+1)i−1+T hl[(∂L∂yl+1)i− ( ∂L∂yl+1 )i−1] (4.4)We use the notation (al)i to denote the value of vector a at layer l and iterationi. Each layer has its own adaptively adjusted threshold (Tl), which satisfies thereuse percentage. The function T hl(V ), where T h stands for “Threshold”, at layer29l applied to output gradient vector V is defined as:∀vi ∈V : ui =vi |vi|> Tl0 |vi| ≤ Tl (4.5)where ui represents the elements of output vector T hl(V ) and Tl is a per layeradaptive threshold. In Section 5.1, we explain how to use (HGl)i to sparsify backpropagation using ReSprop.In Section 4.3, we empirically show that HGl is a good approximation to theoriginal output gradient ( ∂L∂yl+1 ), and that it is feasible to train the network with theHG vector. To study the correlation between HG and the original output gradient,we investigate the angle preservation using cosine similarity.4.3 Angle Preservation and comparison with mePropTo study the correlation between HG and the original output gradient, we inves-tigate the angle preservation property of the HG vector. We calculate the cosinesimilarity between HG and the original output gradient to measure the angle be-tween these vectors. According to hyperdimensional computing theory [42], twoindependent isotropic vectors picked randomly from a high dimensional space d,are approximately orthogonal. If there is no correlation between the HG vector andthe original output gradient, they would make an angle of approximately 90◦. Onthe other hand, Anderson et al[6] show that binarizing a random vector in high di-mensional space d (d→ ∞), preserves the vector direction with minimal changes,and a random vector and its binarized version form an angle of around 37◦. Accord-ing to Anderson et al.’s observations, in a high dimensional space 37◦ is a relativelysmall angle between two vectors, so that both vectors have similar directions.Figure 4.1 demonstrates the angle between the original output gradient vectorand both the HG vector (dark green bar) and meProp gradient (light blue bar). Asshown at 1 , the angle between output gradient vectors of consecutive iterationsis close to 90◦. This indicates that successive output gradients are approximatelyorthogonal. However, we observe that reusing a subset of output gradient in con-secutive iterations, via HG reuse strategy, reduces the angle between the original3002040608010030% 50% 60% 70% 80% 90% 100%AngleSparsity percentage (meProp) / Reuse percentage (HG)θ=<HG,Original grad> θ=<meProp,Original grad>θ=90° θ=37°1Figure 4.1: HG and meProp angles for different reuse percentages and spar-sities, respectively. The angle is calculated by finding the average angleof all layers while training ResNet-18 on CIFAR-10 for 100 iterations(batch size=128).Reuse HG Val Acc Sparsity meProp Val Acc50% 84.21±0.09 50% 84.14±0.0860% 84.11±0.06 60% 64.29±0.0770% 83.87±0.10 70% 50.65±0.1380% 78.40±0.14 80% 41.67±0.2590% 73.14±0.17 90% 23.67±0.23Table 4.1: Validation accuracy of meProp and reuse strategy (HG) with dif-ferent sparsities and reuse percentages, repectively. Training ResNet-18on CIFAR-10 for 30 epochs (batch size = 128, lr = 0.1 and optimizer =SGD).output gradients and the HG vector to less than 37◦. We compare this strategy withmeProp [90] by studying the angle preservation property and the validation accu-racy of these algorithms. The meProp algorithm sets output gradients not rankedin the Top-K by magnitude to zero and calculates Eq. 4.3 and 4.2 with the sparseoutput gradient. Figure 4.1 shows the angle between the original output gradientand meProp. Since cosine similarity is undefined for a zero vector, the angle for100% sparse meProp is not presented. We can see HG preserves the original out-put gradient direction far better than meProp’s sparse output gradient. Table 4.131further verifies the network convergence while reusing gradients. This table showsthe validation accuracy of reusing output gradients with small magnitude change(HG Val Acc) compared to setting small magnitude gradients to zero (meProp ValAcc). MeProp has considerably less validation accuracy versus HG after 30 epochsof training. The gap between HG and meProp validation accuracy is more pro-nounced at higher sparsity percentages. Further, Table 5.1 shows the accuracy ofReSprop (using HG) improves further after 200 epochs of training CIFAR-10.32Chapter 5ReSprop Algorithm5.1 ReSprop: Reuse-Sparse-BackpropThis section describes ReSprop, an efficient back-propagation algorithm, whichwe developed to exploit the reusability of gradients. We reformulate the back-propagation convolutions based on the HG vector, which leads to sparse convolu-tions and a training speedup. The HG vector in Eq. 4.4 at iteration i can be splitinto two separated parts: One, ReHG (“Reused HG”) the output gradient of the pre-vious iteration ( ∂L∂yl+1 )i−1 and is computed and stored before the current iteration;two, SpHG (“Sparse HG”) the result of T h[( ∂L∂yl+1 )i− (∂L∂yl+1)i−1]. SpHG is sparsedue to the threshold function. Using these definitions Eq. 4.4 can be rewritten asfollows:(HGl)i = (ReHGl)i+(SpHGl)i (5.1)By replacing the output gradient in Eq. 4.3 and 4.2 with the HG vector defined inEq. 5.1 the back-propagation convolutions can be rewritten as:(∂L∂wl)i = ((ReHGl)i⊗ (atl)i)︸ ︷︷ ︸1 Pre∇wl+((SpHGl)i⊗ (atl)i)︸ ︷︷ ︸2 Sparse∇wl(5.2)(∂L∂al)i = ((ReHGl)i⊗ (wtl)i)︸ ︷︷ ︸1 Pre∇al+((SpHGl)i⊗ (wtl)i)︸ ︷︷ ︸2 Sparse∇al(5.3)33Algorithm 1 ReSprop forward pass for lth convolutional layer at iteration i.1: for l = 1 to Layers do2: Receive: random sample ( ∂L∂yl+1 )i−13: (yl+1)i = (wl)i⊗ (al)i4: (pre∇wl)i = (random ( ∂L∂yl+1 )i−1)⊗Avg(atl)i5: (pre∇al)i = (wtl)i⊗ (random( ∂L∂yl+1 )i−1)6: end forUsing ReHG+ SpHG in the back-propagation convolutions as shown in Eq.5.2 and 5.3 allows us to break calculations into two parts labeled 1 and 2 . Part 1represents the precomputed portion and can be calculated in parallel with forward-propagation, before the current iteration’s backward-propagation starts and part2 is where computation is saved using sparse convolution due to the sparsity ofSpHG. We name the above algorithm ReSprop. We call the process for calculatingpart 1 pre-ReSprop (Alg. 1) and the process for calculating part 2 back-ReSprop(Alg. 2). Varying reuse percentage leads to different levels of sparsity in back-ReSprop. Thus, we name the sparsity generated by our algorithm reuse-sparsity(RS). As shown in Alg. 2 (lines 5 and 7), in ReSprop the back-propagation con-volutions are sparse, and RS percentage is the main factor that defines the amountof computation reduction. In Section 7.1, we analyze the accuracy of ReSprop andshow that at 90% RS, it loses negligible (less than 1.1%) accuracy for differentdatasets and has higher accuracy compared to DSG and meProp sparse trainingalgorithms.5.1.1 Stochastic Output GradientStoring the output gradients for an entire mini-batch at each iteration as implied byEq. 4.4 to 5.3 creates a substantial memory overheads. We define full mode Re-sprop as a variant of ReSprop in which we store the output gradient for all samplesin a minibatch. A simple approach for reducing the memory overheads and de-creasing the computation in pre-ReSprop is to use the average output gradients ofthe previous mini-batch. We call this variant average mode ReSprop. In averagemode, we add the extra step of computing average of gradients over the mini-batch.34(al)iLayer parametersPre-ReSprop Forward convolution(Yl+1)iSave parameters for backward passThreshold( ) 𝜕L 𝜕wl i( ) 𝜕L 𝜕al i(Pre ∇wl)i(Pre ∇a l) i(sparse ∇wl)i(sparse ∇al)iRand( ) 𝜕L 𝜕yl+1 i-1(wl)i(al)i (wl)i-++Back-ReSprop (sparse conv)Forward passBackward passf ∇f (al+1)iLayer L-1Layer L+1Layer L( ) 𝜕L 𝜕al+1 i( ) 𝜕L 𝜕yl+1 iRand( ) 𝜕L 𝜕yl+1 i-1Figure 5.1: Training with ReSprop for layer l at iteration i.To avoid the extra step of averaging, stochastic sampling of the previous iteration’soutput gradient can be used in the ReSprop algorithm. We call this variant stochas-tic mode ReSprop. Our results indicate that using stochastic sampling of the outputgradient does not decrease the accuracy of ReSprop compared to average or fullmode. Table 5.1 shows the validation accuracy results for training ResNet-18 withCIFAR-10 using full, average, and stochastic mode variants of ReSprop after 200epochs. Storing and using the output gradient vector of a random sample at eachiteration significantly reduces the computation and memory cost of the ResProp.Below we use ReSprop as a shorthand for stochastic mode ReSprop.Algorithms 1 and 2 show the forward and backward pass calculations, respec-tively, for ReSprop. The convolutions needed for computing pre∇a and pre∇win the full mode are shown respectively in Figure 5.2a and 5.2c. We decrease thememory and computation overheads needed for convolutions in pre-ReSprop by35Algorithm 2 ReSprop backward pass for lth convolutional layer at iteration i.1: for l = Layers to 1 do2: Receive:( ∂L∂yl )i, random sample (∂L∂yl+1)i−13: Calculate (SpHGl)i4: Receive: (pre∇wl)i from forward pass5: ( ∂L∂wl )i = (pre∇wl)i+(SpHGi,l⊗ (atl)i)6: Receive: (pre∇al)i from forward pass7: ( ∂L∂al )i = (pre∇al)i+((wtl)i⊗ (SpHGl)i)8: Update (wl)i with ( ∂L∂wl )i9: Send ( ∂L∂al )i to previous layer10: end forRS Full (HG) Avg Stochastic50% 94.54±0.04 94.71±0.06 94.69±0.0460% 94.38±0.08 94.58±0.03 94.66±0.0770% 94.36±0.03 94.52±0.04 94.53±0.0980% 93.18±0.16 93.28±0.12 93.51±0.1290% 91.10±0.11 91.82±0.07 91.43±0.11Baseline: 94.42±0.08Table 5.1: Validation accuracy of full, average and stochastic ReSprop forResNet-18 on the CIFAR-10 dataset for 200 epochs (batch size = 128, lr= 0.1, optimizer = SGD, avg of 3 runs).a factor of mini-batch size when we use stochastic or average mode. The con-volutions for stochastic mode is shown in Figure 5.2b and 5.2d. For computingpre∇a in Figure 5.2b, one random output gradient (K×H×W ) out of N samplesis chosen and convolved with weights, producing one sample pre∇a, which thenis replicated N times for all the N samples. Similarly, for computing pre∇w inFigure 5.2d, a random output gradient (K×H ×W ) out of N samples is chosenand reshaped into the desired shape (K× 1×H ×W ). Since in stochastic mode,we use the output gradient of a random sample, the output gradient is the samefor all the convolutions for computing pre∇w. Thus, due to the distributive prop-erty of convolutions, instead of convolving a random sample with all N inputs, we36weightCN (samples)N (samples)=( 𝜕L 𝜕yl+1 )i-1(Pre ∇al)i(a) Compute pre∇a in full mode=( 𝜕L 𝜕yl+1 )i-1weightC(Pre ∇al)iRandom(b) Compute pre∇a in stochastic modeC =N (samples)( 𝜕L 𝜕yl+1 )i-1 (Pre ∇wl)iInputK(c) Compute pre∇w in full modekC =K( 𝜕L 𝜕yl+1 )i-1(Pre ∇wl)iAverage of N InputsRandomAvgAvg(d) Compute pre∇w in stochastic modeFigure 5.2: Back-propagation convolutions in stochastic mode compared tofull mode for layer l at iteration i.can average the inputs and then convolve the average input with a random gradientsample as shown in Eq.5.4 for layer l and visulized in Figure 5.2d.(pre∇wl)i = (random (∂L∂yl+1)i−1)⊗Avg(atl)i (5.4)Figure 5.1 demonstrates the computation flow of ReSprop for the forward andbackward pass. The Back-ReSprop box in the figure represents backward convo-lutions which are sparse (lines 5 and 7 in Alg. 2). The computation overhead ofReSprop for the forward pass computations (pre-ReSprop) is shown in Figure 7.2;this overhead is less than 2% for batch sizes larger than 128.5.1.2 Warm UpNarang et al [70] and Zhu et al [100] show that gradually increasing the spar-sity percentage as training proceeds results in less drop in the model’s final ac-curacy compared to maintaining a constant rate of sparsity during training. Weapply the same approach and gradually increase the reuse-sparsity until reachinga targeted rate of reuse-sparsity. We call this approach warm up ReSprop (W-37ReSprop). In W-ResProp, we increase the sparsity percentage linearly in the firstm (m number o f epochs) epochs until we get to the targeted reuse-sparsity. W-ReSprop helps the model adapt to gradient reuse, and it noticeably increases thenetwork accuracy at high reuse-sparsities compared to base ReSprop. Results forW-ReSprop are shown and compared to base ReSprop in Section 7.1.38Chapter 6GSCNIn recent years, custom hardware designs have been demonstrated to be an effec-tive hardware platform to accelerate CNN inference and training. However, mostexisting architectures focus on sparse CNN inference. The architecture designedfor sparse CNN inference is inefficient when executing sparse training. In thischapter, we explain our accelerator for sparse training, which is designed based onthe SCNN accelerator.6.1 Accelerator for Sparse TrainingSparse backward pass convolutions in ReSprop can be accelerated by exploitingthe zero-valued elements in SpHG vector and zero-valued activations that arisefrom the common ReLU operator.Among available accelerators, the SCNN accelerator is an accelerator for compressed-sparse convolutional neural networks proposed by [72]. It employs an efficientdata flow that enables maintaining the sparsity in a compressed encoding. This ar-chitecture eliminates unnecessary data transfers and reduces storage requirements.Although SCNN improves the performance for sparse forward convolution, it suf-fers from underutilization when the input sizes are small. Moreover, SCNN cannot be used for backward pass convolutions since while calculating the gradientsof weights (Eq. 4.3), SCNN architecture introduces many unused products.We propose a generic sparse convolution accelerator (GSCN) which improves39N (samples)WeightK* = NOutput InputH RSW(a) Forward pass convolutionWeightCGradient of Input N (samples)Output GradientN (samples)* =(b) Backward pass convolution for calculating gradients of inputs Input Gradient of WeightkCK * =Output Gradient(c) Backward pass convolution for calculating gradients of weightsFigure 6.1: Forward pass and backward pass convolutions for N input sam-ples with C channels and K filters each with C channels.SCNN by:• Implementing different data and workload distribution among processing el-ements (PEs) to solve PE underutilization.• Predicting unused computations produced by the SCNN architecture andskip them.406.2 Computation of Convolutional LayersAs discussed in Section 2.1, the core operation in a CNN convolutional layer is a2-dimensional sliding-window convolution of an R×S element filter over a W ×Helement input plane to produce a (W −R+1)× (H−S+1) element output plane.In CNN networks, filter sizes are usually small (3× 3 or 1× 1), and padding isapplied to the input. Thus the output dimensions in many cases are close to theinput size. There can be multiple (C) input planes, which are referred to as inputchannels, and multiple filters (K) can be applied to the same input to produce Koutput channels. In mini-batch training, a mini-batch of length N input planes isapplied to the same filter volume. Figure 6.1a shows N inputs convolving with Kfilters as described in Section 2.1.During back-propagation convolutions, the output gradient is convolved withweights, and input activations (Eq. 4.3 and 4.2). The output gradient has K chan-nels. Thus, to convolve the output gradient with weights and input activations andhave the desired shape, we need to reshape the inputs, weights, and gradients. Asshown in Figure 6.1b the weights’ dimensions are reshaped from (K×C×R× S)to (C×K×R×S). In this way, for each sample, K output gradients are convolvedwith K filters and produce one channel of input gradients. In Figure 6.1c the out-put gradients and inputs are reshaped to (K×N×W ′×H ′) and (C×N×W ×H),respectively. Thus, N output gradients are convolved with N inputs to produceone channel of weight gradients. These reshapings can be done before saving theweights, gradients, and inputs into the GSCN buffers.6.3 Background on SCNNSince GSCN is based on SCNN, in this section, we discuss the SCNN data flowand architecture in detail. The baseline data flow is based on SCNN’s data flow.The SCNN data flow employs an input stationary computation order in which aninput activation is held stationary at the computation units as it is multiplied by allthe filters needed to make all its contributions to each of the K output channels.SCNN also factors the K output channels into K/Kc output-channel groups of sizeKc, and only store weights and outputs for a single output-channel group at a time411 3PE1Convolution 3* =3PE3 Accumulator buffers* =1123InputWeightDRAM23 44DRAKHWRS2343211 3PE1Convolution 3* =1PE1 Accumulator buffers* =4321 1 3PE1Convolution 3* =2PE2 Accumulator buffers* =43211 3PE1Convolution 3* =4PE4 Accumulator buffers* =4321(a) Data and workload distribution over four PEs for first channel in SCNN.1PE1 Accumulator buffer* =HW1123InputWeightDRAM23 44DRAKHWRS232PE2 Accumulator buffer* =HW3PE3 Accumulator buffer* =HW4PE4 Accumulator buffer* =HW1 23 41 23 41 23 41 23 4(b) Data and workload distribution over four PEs for first channel in GSCN.Figure 6.2: Figure shows an example of SCNN and GSCN data and workloaddistribution while having four PEs. Input and filter assignment to eachPE for SCNN and GSCN are different.inside the weight and accumulator buffers. In this section, we try to explain SCNNdataflow and architecture by referring to Figure 6.3 for SCNN architecture, 6.2afor data and workload distribution and Alg. 3 for SCNN dataflow and operations.SCNN dataflow requires buffers for filters and inputs, and an accumulatorbuffer to store the partial sums of the output activations. These buffers are shown inFigure 6.3 1 . SCNN employs a tiling strategy to spread the work across an arrayof PEs (P is the number of available PEs) so that each PE can operate indepen-dently. The W ×H input plane breaks into smaller W×HP element tiles (Wt×Ht)that are distributed across the PEs. Thus each PE would get Wt×Ht portion of theinput. All the Kc weights are broadcast to the PEs, and each PE operates on its ownsubset of the input and output space (Alg.3 line 2). The PEs need to communicatewith each other since they contain incomplete, partial sums at the edge of tiles (ha-los). As shown in 4 Figure 6.3, there is a specific unit for Relu and compressionoperations and this unit also takes care of finding the haloes and communications42IndicesIndicesF,IFxI Multiplier Array (cartesian product)FxIW,H,KIndicesInput BufferFilter BufferIndicesBuffer BankBuffer BankReLUCompress and HalosOutput BufferCompute W,H,K Coordinates FxI – Arbitrated xbar111234Figure 6.3: SCNN PE microarchitecture employing SSCN data flow [72].between PEs. In SCNN from the filter buffer, a vector of F filter-weights is fetched,and with a vector of I inputs fetched from the input activation buffer are deliveredto an array of F× I multipliers to compute a full Cartesian product (CP) of outputpartial-sums; as shown in Figure 6.3 2 and Alg.3 line 11. This all-to-all operationis useful since each fetched weight is reused (via wire-based multicast) over all Iactivations; each activation is reused over all F weights. The dataflow for each PEis shown in Alg. 3. As it is shown in lines 2 and 7, each PE will get a portionof whole input Wt×Ht and all Kc filters. Figure 6.2a further demonstrates dataand workload distribution between PEs. In this figure, we assume an SCNN ar-chitecture with a total of 4 PEs (p = 4). Thus, each PE will get 1/4 of the inputchannel.As explained SCNN uses the cartesian product to exploit the parallelism ofmany multipliers within a PE, and it fetches a vector of F filter-weights from theweight buffer, and a vector of I inputs from the input activation buffer. These valueswill multiply in an array of F × I multipliers to compute a full cartesian productof output partial-sums. SCNN will exploit the same property to make computa-tion efficient on compressed-sparse weights and input (Figure 6.3). Parallel to thenon-zero cartesian product, the coordinates in dense output is computed using theindices from the sparse-compressed filters and inputs as shown in Figure 6.3 3(Alg. 3 lines 12, 13 and 14). The F× I products are delivered to an array of A ac-cumulator banks, indexed by the output coordinates (Alg. 3 line 15). To reduce thecontention among products hashed to the accumulator bank, A is set to be largerthan F× I. SCNN shows that A = 2×F× I sufficiently reduces accumulator bank43Algorithm 3 SCNN dataflow for each PE1: BUFFER W-buf[C][Kc∗R∗S/F ][F ]2: BUFFER In-buf[C][Wt ∗Ht/I][I]3: BUFFER Acc-buf[Kc][Wt+R−1][Ht+S−1]4: BUFFER Out-buf[K/Kc][Kc∗Wt ∗Ht]5: for k′ = 0 to K/Kc−1 do6: for c = 0 to C−1 do7: for a = 0 to (Wt ∗Ht/I)−1 do8: In[0 : I−1] = In-buf[c][a][0 : I−1];9: for w = 0 to (Kc∗R∗S/F)−1 do10: wt[0 : F−1] = W-buf[c][w][0 : F−1];11: for (parallel for)(i = 0 to I−1)× ( f = 0 to F−1) do12: k = Kcoord(w, f );13: x = Xcoord(a, i,w, f );14: y = Y coord(a, i,w, f );15: acc-buf[k][x][y]+ = in[i]∗wt[ f ]16: end for17: end for18: end for19: end for20: Out-buf[k′][0 : Kc∗Wt ∗Ht−1] =Acc-buf[0 : Kc−1][0 : Wt−1][0 : Ht−1]21: end forcontention. The output then goes through ReLU and compression, and the partialsums at the tile edges are computed and sent to other PEs (Figure 6.3 4 ).6.3.1 SCNN LimitationsIn this section, we describe the SCNN limitations. As we described, SCNN dataflowand architecture are computation efficient for sparse inference. However, SCNNhas two main problems:1) SCNN data flow performs well when the input dimension (W ×H) is bigenough to utilize all the PEs. For most of the classification networks, the size ofthe input and accordingly, the size of the output and gradient of output in deeperlayers become small. Thus, tiling (dividing) the small input size across PEs (Wt×44Ht) in SCNN data flow will cause significant underutilization of PEs and loss ofperformance for most of the state of the art classification networks in deep layers.For example, the 16x16 input can not be divided between 64 PEs, and many PEswould be idle.Cartesian product2221102023 41=xOutput gradient with padding14x14Weight2x223 41 24 51 23 41Cartesian Result: 1x21, 2x22, 2x21, 1x22Cycle N in a PE when I = {21,22} and F={1,2}Cartesian productCartesian Result: 3x21, 4x22, 4x21, 3x22`23 41 23 41 23 41 Cycle N+1 in a PE when I = {21,22} and F={4,5}12343421 2223 24`21 2223 24`Figure 6.4: When the size of filter is smaller than input size, SCNN architec-ture has negligible amount of unused products.2) The SCNN cartesian product works well for the forward pass convolutionssince the filter dimensions (R× S) are smaller than input dimensions (W ×H).However, each output product in the cartesian product mostly yields a useful partialsum. However, in gradients of weights convolution (Figure 6.1c) the input windowis almost the same size as the output gradient. Thus many of the output productsfrom the cartesian product would not be used, so we call them unused products. InFigure 6.5, we demonstrate the unused products produces when the size of input(14×14) and gradients (12×12) are close. In this example, the size of F and I is 2.Thus, at each cycle, the cartesian product produces four products. As shown in thefigure after the first cartesian product (cycle N), there is no need to slide the inputwindow further, and we need to skip the rest of the computations (cycle N+1) andother upcoming cycles till we read the next filter row into the F or read the next I45-1-221 2223 24-3-4Cartesian product22211020=xInput14x14Output gradinet 12x12Cartesian Result: 1x21, 2x22, 2x21, 1x22Cycle N in a PE when I = {21,22} and F={-1,-2}Cartesian productCartesian Result: -3x21, -4x22, -4x21, -3x22`Cycle N+1 in a PE when I = {21,22} and F={-3,-4}22211020 -2 -3-1 -5 -6-4 22211020 - -3- -5 -6-4 22211020 -2 -3-1 -5 -6-422211020-2 -3-1 -5 -6-4 22211020-2 -3-1 -5 -6-4 22211020-2 -3-1 -5 -6-4-2 -3-1 -5 -6-4-3-421 2223 24`Figure 6.5: When the size of two operands are close in the convolution theSCNN architecture produces many unused products.input activations.The number of unused products is minor when size of the filter is smaller thanthe input size (Figure 6.4). The forward pass convolutions and one of the con-volutions in the backward pass, which calculated gradient’s of inputs, are in thiscategory. The cartesian product for two cycles convolving gradients (14×14) withfilters (2×2) in SCNN is shown in Figure 6.4 for these cases. Size of F and I aretwo and four parallel products computed in each cartesian product. As shown inthis example, unlike Figure 6.5, the cartesian products are useful in both cycles andare part of the convolution calculations.Table 6.1 shows the percentage of the cartesian product (4×4) that its productsare not useful (all the 16 products are not used) over the total number of cartesianproducts in for gradient of weight convolution using SCNN architecture. I and Fare four, and SCNN has in total 64 PEs.46ResNet18 ResNet34Unused cartesian products (%) 64.91% 69.6%Table 6.1: The table shows percentage of unused cartesian product over totalcartesian product in SCNN architecture for gradient of weights convolu-tion.6.4 GSCN Data FlowWe propose a Generic Sparse Convolutional Neural network (GSCN) with slightdifferences from SCNN data flow to solve the first problem mentioned in Section6.3.1. This data flow employs the input stationary the same as SCNN, but it over-comes the underutilization problem. In GSCN’s data flow, the input is not divided,but instead, the whole input is broadcasted to all the PEs. In this way, overcome theunderutilization problem. In GSCN, instead, a subset of filters is assigned to eachPE as follows. Thus, the size of input and filter buffer is different form of SCNN.The input buffer is P× larger and the filter buffer is P× smaller.Assume the number of PEs is P, and we have K filters which same as SCNNare grouped to Kc filters. In GSCN, each PE gets Ks = ∗Kcp filters, and the partialsums are stored in the accumulator buffers. In GCSN’s data flow, there is no needto communicate between PEs since each PE contains complete partial sums. Alg.4 shows the GSCN dataflow. As it is shown in lines 1, 2, and 4, the size of buffersis different from SCNN. At line 5, GSCN’s loop is over Ks filters instead of Kcfilters, and at line 7, the loop is going over all the input elements. Note that the restof the algorithm is almost the same as SCNN apart from the accumulator buffersize (lines 15 and 20).Figure 6.2b further demonstrates our proposed data flow. In this example, it isassumed that GSCN architecture has 4 PEs. As it is shown, each PE gets wholeinput (W ×H), while filters are divided between PEs.GSCN, like SCNN, uses available compression methods [30, 92] to compresssparse filters and inputs to reduce both arithmetic operations and data movement.The GSCN encoding includes a data vector consisting of the non-zero values and anindex vector that includes the number of non-zero values followed by the number of47Algorithm 4 GSCN dataflow for each PE1: BUFFER W-buf[C][Ks∗R∗S/F ][F ]2: BUFFER In-buf[C][W ∗H/I][I]3: BUFFER Acc-buf[Ks][W +R−1][Ht+S−1]4: BUFFER Out-buf[K/Ks][Ks∗W ∗H]5: for k′ = 0 to K/Kc−1 do6: for c = 0 to C−1 do7: for a = 0 to (W ∗H/I)−1 do8: In[0 : I−1] = In-buf[c][a][0 : I−1];9: for w = 0 to (Ks∗R∗S/F)−1 do10: wt[0 : F−1] = W-buf[c][w][0 : F−1];11: for (parallel for)(i = 0 to I−1)× ( f = 0 to F−1) do12: k = Kcoord(w, f );13: x = Xcoord(a, i,w, f );14: y = Y coord(a, i,w, f );15: acc-buf[k][x][y]+ = in[i]∗wt[ f ]16: end for17: end for18: end for19: end for20: Out-buf[k′][0 : Ks∗W ∗H−1] = Acc-buf[0 : Ks−1][0 : W −1][0 : H−1]21: end forzeros before each value. The 3-dimensional R×S×K filter is effectively linearized,enabling full compression across the dimension transitions. The activations areencoded in a similar fashion but across the H×W ×C dimensions. As the fittersare divided among the PEs, each tile of compressed filters is actually R×S×Ks.The total size of the accumulator buffer for GSCN (∗Kcp × out put ′s size) doesnot differ from SCNN accumulator size (Kc× out put ′s sizep ). Specifically, the inputbuffer size needs to be P times larger than SCNN, while the weight buffer size is Ptimes smaller than the SCNN weight buffer. This is also visible by comparing thefirst four lines of Alg. 3 and Alg. 4.To avoid having a significant input buffer size, GSCN can divide the inputspace into sub-volumes so that the collection of PEs operates on a sub-volume of48the inputs at a time. In this way, DRAM accesses for one temporal portion can behidden by pipelining them in tandem with the computation of another portion. Thesub-volume input space implementation is part of our future works.6.5 GSCN ArchitectureIn this section we discuss the GSCN’s architecure which is based upon SCNN buttackles SCNN second limitation (unused products) and improves the performance.indicesindicesF,IindicesFxI multiplier array (Cartesian product)FxIW,H,Kindices Input bufferFilter bufferAddress Generator UnitindicesPredictor Buffer bankBuffer bankReLUCompressOutput bufferCompute W,H,K coordinates Feedback FxI – A arbitrated xbar123Figure 6.6: GSCN PE microarchitecture employing GSCN data flow. TheGSCN PE microarchitecture is built upon SCNN, the units added orchanges in GSCN have been shown with yellow color.To prevent unused calculation, GSCN proposes a predictor (Figure 6.6 2 ) thatpredicts unused products after loading inputs and filters’ indices into the coordinatecomputer. This predictor sends feedback to the input and filter buffer’s addressgeneration unit 1 . In SCNN and GSCN as well, the inputs are stored in (W →H)and filters in (R→ S→K) sequence. In Alg. 5 the filter storage sequence is shown.The first outer loop is over width, and the last loop is over filters. Thus, the widthis always increasing while reading weights from filter buffer.GSCN fetches a vector of I inputs from the input buffer, performs the cartesianproduct for all the filters in the filter buffer (F × I each time), and then moves tothe next I inputs. Note that input and filter are operands for forward pass convo-lutions, and in the back-propagation convolutions, we have gradient and input’s ofthe previous layer, and gradients and weight.49Algorithm 5 GSCN storage order for filter elements in the buffer for each channel1: Ks = KcNumber o f PEs2: for r = 1 to R do3: for s = 1 to S do4: for k = 1 to Ks do5: FilterBuffer.pushback(filter(k,s,r))6: end for7: end for8: end forAs explained while loading the filters from the filter buffer, the filter’s widthwill never decrease since the outer loop is over width while storing the filters inthe filter buffer (Alg. 5 line 1). The coordinate computer computes the outputcoordinates for a given F × I. Assume the input indices are w and h, and weightindices are s and r. If the calculated output coordinate becomes negative ((w−r)<0 or (h− s) < 0), it means that a particular output product is not going to be usedin the convolution as shown in figure 6.5.Thus, when the width of all output coordinates (w− r) produced by a cartesianproduct (for a given F× I) are negative, it indicates that all next F filter vectors aregoing to produce unused products with that particular I vector.If none of the current F filters vectors are useful, the predictor skips all theupcoming products until we move to the next I input vector. Feedback signal sentto the address generator unit changes the address pointer so that the next I inputvector and the first F filter from filter buffer be fetched.This method will help us to skip many of the unused cartesian products andincrease performance. The PE architecture is pipelined, and the predictor is placedright after loading the inputs; thus, our prediction will be acted upon with one cycledelay. Figure 6.6 demonstrates the microarchitecture of a GSCN PE, including afilter buffer, input/output buffer, address generation unit, a multiplier array, a coor-dinate computation, a predictor, a scatter crossbar, a bank of accumulator buffers,and a post-processing unit for ReLU and compression. The ReLU and compres-sion unit in GSCN is more straightforward than SCNN since GSCN doesn’t needto communicate with other PEs for the tile edges (Figure 6.6 3 ). We expect that50predictor enables us to skip a considerable portion of unused products. Since about60% of products are not useful in the gradients of weight convolution, ideally, wewould get about 2× speedups.51Chapter 7Evaluation7.1 EvaluationIn this section, we present experimental results of the ReSprop and W-ReSprop al-gorithms adapted to different datasets and architectures. Moreover, we quantify thetheoretical computation reduction of ReSprop and simulate the speedup it achieveson a generic hardware accelerator designed to support sparse back-propagation.7.2 Experimental SetupWe implement the ReSprop and W-ReSprop algorithms in PyTorch [74]. We usethe custom C++ and Cuda extensions of pytorch to write our custom convolutionlayer. Note that this code is showing the functionality of ReSprop algorithm, anddue to the fine-grained sparsity to gain the speedup, a hardware accelerator is re-quired. To evaluate our algorithms, we train three different widely used state-of-the-art architectures; ResNet-18, 34, 50 [32], Wide Residual Networks [96], andVGG-16 [86] on three different datasets: CIFAR-10, CIFAR-100 [47] and Ima-geNet ILSVRC2012 [82]. For training, we use the SGD optimizer with momen-tum of 0.9, weight decay of 0.0001, initial learning rate of 0.1 and 5 to 8 warm upepochs for W-ReSprop. The baseline is trained with no sparsity or reuse. CIFAR-10 and CIFAR-100 datasets are trained for 200 epochs on a single GPU with amini-batch size of 128. The learning rate is annealed by a factor of (1/10)th at the5280th and 120th epochs. We run each experiment with three different seeds and usethe average value for all the results. The ImageNet dataset is trained for 90 epochswith a total mini-batch size of 256 samples on 4 GPUs (RTX 2080 Ti GPU). Thelearning rate is reduced by (1/10)th at the 30th and 60th epoch. The above choiceof hyper-parameters follows [32, 36]. For all evaluations in this section, we use theabove setup, except in Section 7.4, where we study batch size impact and effect ofthe number of compute nodes on accuracy.To model the performance of the GSCN and the baseline (SCNN) architecture,we rely primarily on a custom-built DNNsim cycle-level simulator [22]. We ex-tend this simulator to support GSCN. The simulator is driven by the 90% sparsehybrid gradient, sparse input activation and weights extracted from the PyTorchframework [74] while training with ReSporp. The simulator executes each layerof the network one at a time. We demonstrate the results for GSCN, computingback-propagation convolutions for a mini-batch size of 64 compared to GTX 1080Ti GPU. The GPU back-propagation convolutions is measured by isolating theback-propagation convolution function in C++ back-end in PyTorch framework.7.3 Accuracy AnalysisIn this section, we provide a comprehensive analysis of the ReSprop and W-ReSpropalgorithms and evaluate convergence and robustness on a wide range of models.7.3.1 Accuracy on CIFAR10 and CIFAR100:Tables 7.1 and 7.2 show the accuracy of the ReSprop and W-ReSprop algorithmsat different reuse-sparsity percentages on CIFAR-10 and CIFAR-100 datasets. Wecan see that ReSprop and W-Resprop algorithms achieve better accuracy than thebaseline with reuse-sparsities of 50% and 60%, respectively. CIFAR-10, withfewer classification classes, is more robust to reuse gradients, and it suffers onlya slight accuracy loss at 70% reuse-sparsity using the ReSprop algorithm. Whilethe accuracy drop for reuse-sparsities higher than 70% is considerable in the Re-Sprop algorithm, it can be avoided by the addition of a warm up phase. For bothCIFAR-10 and CIFAR-100, on three different architectures, W-ReSprop algorithm53loses less than 0.95% validation accuracy at 90% and less than 0.7% at 80% reuse-sparsity.CIFAR-100RS Algorithm ResNet34 WRN-28-10 VGG-1650%ReSprop 76.02±0.15 81.45±0.17 72.58±0.23W-ReSprop 76.4±0.11 81.78±0.16 72.79±0.2160%ReSprop 75.81±0.15 80.44±0.16 70.89±0.22W-ReSprop 76.01±0.12 81.34±0.15 72.45±0.2270%ReSprop 73.92±0.18 78.34±0.11 69.76±0.19W-ReSprop 75.60±0.13 81.09±0.15 71.98±0.2380%ReSprop 70.76±0.15 76.87±0.13 66.04±0.29W-ReSprop 75.44±0.17 80.87±0.14 71.88±0.2390% ReSprop 69.12±0.13 75.06±0.10 65.32±0.21W-ReSprop 75.14±0.16 80.38±0.17 71.57±0.24Baseline 75.61±0.16 81.29±0.17 72.50±0.21Table 7.1: Validation accuracy of ReSprop and W-ReSprop at different reuse-sparsity constraints on the CIFAR-100.7.3.2 Accuracy on ImageNet:Table 7.3 shows the accuracy obtained by the ReSprop and W-ReSprop on ResNet34,VGG-16 and Wide-Resnet-50-2. The results indicate that unlike CIFAR datasetsfor which W-ReSprop and ReSprop algorithms outperform the baseline at reuse-sparsities lower than 70%, for the ImageNet dataset at 50% resue-sparsity ReSpropand W-ReSprop have less than 0.5% and 0.15% loss of accuracy, respectively. Weobserve that for the CIFAR-100 dataset, the W-ReSprop algorithm has better accu-racy at high reuse-sparsities compared to the base ReSprop; the same trend holdsfor the Imagenet dataset. W-ReSprop at 90% reuse-sparsity has less than 1.1% ac-curacy loss in all three networks. For a fair comparison with W-ReSprop, we eval-uate W-meProp, a variation of the meProp algorithm employing a warm up phase.Figure 7.1 demonstrates the validation curve of ReSprop, W-ReSprop, meProp andW-meProp algorithms on the ImageNet dataset for the Resnet-18 architecture. Thevalidation curve indicates a significant loss of accuracy for meProp and W-meProp.54CIFAR-10RS Algorithm ResNet34 WRN-28-10 VGG-1650%ReSprop 95.85±0.06 96.58±0.09 93.35±0.18W-ReSprop 95.91±0.05 96.93±0.11 93.28±0.1960%ReSprop 95.25±0.04 95.89±0.11 93.18±0.14W-ReSprop 95.41±0.09 96.79±0.07 93.26±0.1570%ReSprop 95.01±0.07 95.68±0.08 92.63±0.16W-ReSprop 95.23±0.09 96.13±0.15 92.91±0.1780%ReSprop 94.17±0.07 93.23±0.08 91.90±0.18W-ReSprop 94.96±0.13 95.93±0.12 92.64±0.1790% ReSprop 91.61±0.09 90.71±0.15 90.01±0.18W-ReSprop 94.36±0.07 95.67±0.11 92.43±0.18Baseline 95.13±0.09 96.30±0.11 93.25±0.15Table 7.2: Validation accuracy of ReSprop and W-ReSprop at different reuse-sparsity constraints on the CIFAR-10.RS Algorithm ResNet34 WRN-50-2 VGG1650%ReSprop 73.08 78.69 70.09W-ReSprop 73.21 78.81 70.4170%ReSprop 67.12 73.34 68.73W-ReSprop 72.73 78.25 70.0190% ReSprop 63.78 67.72 60.76W-ReSprop 72.44 77.93 69.46Baseline 73.34 78.88 70.50Table 7.3: Top 1 validation accuracy of ReSprop and W-ReSprop algorithmsat different reuse-sparsity constraints on the ImageNet dataset.MeProp has validation accuracy of 32.56% at 50% sparsity while the ReSprop vali-dation accuracy at 50% reuse-sparsity is 69.83% which is 0.03% less than the base-line. The W-Resprop algorithm at 50% reuse-sparsity gains 0.08% higher accuracythan the baseline and loses negligible accuracy of 0.7% at 70% reuse-sparsity.55010203040506070800 10 20 30 40 50 60 70 80Validation accuracy -top1 (%)EpochReSprop - 50% RS meProp - 50% sparsityReSprop - 70% RS meProp - 70% sparsityW-ReSprop - 70% RS W-meProp - 70% sparsityBaselineFigure 7.1: Top 1 validation accuracy of ReSprop, W-ReSprop, meProp andW-meProp algorithms for training ResNet-18 on the ImageNet dataset.The baseline is trained with no sparsity or reusing.7.4 Sensitivity StudyIn this section, we analyze the depth and width of the network on ReSprop, wevary the batch size from 32 to 128 while training ResNet-34 on the CIFAR-100dataset on a single GPU, and train ResNet-18 on the ImageNet dataset with 2,4and 8 GPUs.7.4.1 Deep and Wide NetworksPrevious studies have shown that network depth and width affect network conver-gence [60, 77, 84]. Here, we study the effect of depth and width on the ReSpropalgorithms. Table 7.4 shows the accuracy of ResNet-18, 34 and 50 for W-Respropalgorithm on CIFAR100 at 90% reuse percentage. We observe from the results thatW-ReSprop converges to the state-of-the-art accuracy with minimal loss of accu-racy for deep networks. At 90% reuse-sparsity W-ReSprop algorithm has an accu-56racy loss of 0.17% for ResNet-50. Similarly, the results for WRN-28-10 shown inTable 7.1 and 7.2, shows slight accuracy loss on training wide networks with theW-ReSprop algorithm.ResNet18 ResNet34 ResNet50W-ReSprop 90% 73.26 75.15 76.67Baseline 74.84 75.61 76.84Table 7.4: Validation accuracy of ResNet-18, 34 and 50 on the CIFAR-100dataset at 90% reuse-sparsity.7.4.2 Impact of Batch Size:Here, we explore effects of batch size on the accuracy of ReSprop. Table 7.5demonstrates that ReSprop converges with negligible accuracy loss for differentbatch sizes. ReSprop and W-ReSprop algorithms achieve higher accuracy forlarger batch sizes. This behavior may be a result of including more samples in-creasing the likelihood similar features are present resulting in a higher correlationwith the next iteration’s gradients.Batchsize 32 64 128ReSprop 70% 72.94 73.48 73.92W-ReSprop 90% 74.98 75.09 75.14Baseline 76.12 75.88 75.61Table 7.5: Validation accuracy of ResNet-34 on the CIFAR-100 dataset withdifferent batch sizes of 32, 64 and 128.7.4.3 Distribute Training Across Multiple Compute NodesData parallelism is a popular way to accelerate training [46, 78]. To explore theimpact of distributed training on accuracy, and still ignoring speedup, we evaluateReSprop on multiple GPUs to compute gradient updates and then aggregating theselocally computed updates. Below, we focus on training with multiple GPUs on a57single machine by splitting the input across the specified GPUs. The ReSpropalgorithm (Alg. 1 and 2) is applied during the training on each GPU independently.Table 7.6 shows the accuracy results for training ResNet-18 on ImageNet with avaried number of GPUs on a single machine. Since the ReSprop algorithm isapplied to each GPU, the number of GPUs does not affect the model’s accuracytrained with the ReSprop algorithm.# GPUs in total 2 4 8Batchsize in total 128 256 512W-ReSprop 90% 68.73 68.81 68.61Baseline 69.21 69.45 69.47Table 7.6: Top 1 validation accuracy of ResNet-18 on the ImageNet datasettrained on 2, 4 and 8 nodes.7.5 SpeedupIn this section, we quantify the computation reduction, overheads, and the speedupof the ReSprop algorithm. Since we are using 5 to 8 epochs of whole training (90-200 epochs) for the warm up phase, the speedup for W-ReSprop would be the sameorder as the ReSprop algorithm.7.5.1 Adaptive Thresholding:The threshold operation can be implemented with O(n) complexity. For each layer,if the reuse-sparsity of (SpHGl)i becomes less than the targeted reuse-sparsity, wehalve Tl (Eq (4.5)) to force more elements to zero and use the updated value of Tlfor the next iteration. On the other hand, if the sparsity of (SpHGl)i is more thanthe targeted reuse-sparsity, we increase Tl by doubling it. To accelerate the processof moving toward the desired Tl , we chose the initialization value of 10−7 for allthe layers in all the experiments, based on the output gradient’s distribution on theResNet-18, 34 and 50 on CIFAR datasets. We experimentally find that for a givenlayer and a fixed reuse-sparsity, the threshold is almost constant during training.Thus, the threshold can be updated after a specific number of iterations, which58reduces the computation overhead. The total computation overhead of adaptivethresholding, matrix additions, and subtractions in the ReSprop algorithm is lessthan 2.5% for both Imagenet and CIFAR datasets.7.5.2 Pre-ReSprop OverheadAs shown in Section 5.1, the ReSprop algorithm consists of pre-ReSprop and back-ReSprop. Pre-ReSprop can be calculated in parallel with the original forward passconvolution. Figure 7.2 plots computation overhead (measured in terms of floating-point operations) added by ReSprop to the forward-pass at different batch sizes.This overhead is less than 2% for batch sizes larger than 128. We theoreticallyanalyze the memory footprint by calculating ReSprop parameters that need to bestored and fetched. The results of the pre-ReSprop calculations and a randomsample of the previous iteration’s output gradient are stored and used in the back-ReSprop. We compute ReSprop memory footprint overhead by considering theadaptive threshold, pre-ReSprop, and back-ReSprop overheads. For the CIFARand ImageNet datasets for batch sizes larger than 128, ReSprop has less than 16%memory footprint overhead compared to the total model parameters and the inputactivations’ memory footprint for different architectures (ResNet18, 34, 50 andVGG-16).7.5.3 Theoretical Speedup:We evaluate the theoretical improvement in computational cost for forward andbackward passes by comparing the number of floating-point operations with andwithout ReSprop. First row of Table 7.8 shows the theoretical speedup of ReSpropfor the backward pass. Since ReSprop accelerates only the backward pass, the the-oretical training (forward + backward) speedup can be calculated using Amdahl’sLaw [5]. Figure 7.3 shows the total training speedup considering the overheadsof pre-ReSprop and thresholding. This analysis shows that at 90% reuse-sparsity,ImageNet can be trained 2.5× to 3.0× (on average 2.7×) faster using ReSprop.Among sparse training algorithms, DSG sparsifies back-propagation convolutions(Eq. 4.2 and 4.3). Table 7.7 shows the accuracy and speedup of DSG and W-ReSprop. W-ReSprop with the same sparsity percentage achieves higher accuracy590123456789ResNet18 ResNet34 ResNet50 WRN-50-2 VGG16 VGG19ReSProp overhead in forward pass (% ) 3264128256Figure 7.2: Computation overhead of ReSprop at forward pass (pre-ReSprop)for different batch sizes (ImageNet dataset).and speedup. Reducing dimension for sparsifying gradients and inputs is the mainreason for accuracy loss at high sparsities in DSG.00.511.522.533.5ResNet18 ResNet34 ResNet50 WRN-50-2 VGG16 VGG19Speedup 50%70%90%Figure 7.3: ReSprop training (forward+backward) speedup versus architec-ture for three reuse-sparsity percentages (ImageNet).60ResNet-18 WRN-8-2Algorithm Speedup Acc ↓ Speedup Acc ↓DSG 2.2 3.88% 2.3 2.74%W-ReSprop 2.7 0.51% 2.8 0.43%Table 7.7: Validation accuracy and train speedup at 90% sparsity comparedto dense training (CIFAR-10 dataset).ResNet18 ResNet34 VGG-16Theoretical 9.83 9.68 9.34GSCN+Baseline 1.32 1.81 1.27GSCN+ReSprop 8.6 8.01 7.21Table 7.8: Theoretical and GSCN speedup at backward pass computationswith 90% resue-sparsity (ImageNet).7.5.4 Accelerator for Sparse Back-propagation:We modify the SCNN [72] to support back-propagation convolutions and call theresulting architecture a generic sparse convolution accelerator (GSCN). We feedGSCN with sparse convolutions of ReSprop. To model performance of GSCN, werely primarily on the DNNsim cycle-level simulator [22]. We extend this simulatorto support GSCN.Table 7.9 lists the key parameters of the GCN design we explore in this paper.The design employs four tiles of 8×8 array of PEs, each PE with a 4×4 cartesianmultiplier array, and an accumulator buffer with 32 banks. Since we need a genericarchitecture, we set the GSCN’s buffer size in a way that it is sufficient for allforward and backward pass convolutions. The filter and input buffer each carry a4-bit overhead for each 16-bit value to encode the compressed coordinates.To run more than one input sample parallel (mini-batch training) in GSCNdesign GSCN will tile the PEs in which n× n PE array is called a tile. Tiles willfetch the same filters (reuse the filters), but each tile has a different input sample.Since PE tiles are applying convolutions on different input samples, there is noneed of communicate among them.Figure 7.4 shows the speedup on GSCN compared to SCNN and GTX 1080 Ti61Normalized computation time00.250.50.751ResNet18 Gradient of Input convolutionsResNet34 Gradient of input convolutionsResNet18 Gradient of weight convolutionsResNet34 Gradient of weight convolutionsGTX 1080 Ti SCNN GSCN (without predictor) GSCN (with predictor)Figure 7.4: Figure shows the speedup for GSCN compared to SCNN andGTX 1080 Ti GPU while training with ReSprop.Table 7.9: GSCN parametersGSCN Parameters ValuesTiles 4PE/Tile 64PE Parameters ValueMultiplier width 16 bitsAccumulator width 24 bitsMultiply array (F×I) 4×4Accumulator banks 32Accumulator bank entries 32Filter buffer 2KBInput buffer 10KBOutput buffer 10KBGPU. For GPU timing, the execution time for two back-propagation convolutionkernels are measured on a real GTX 1080 Ti GPU. The y-axis shows relative com-putation time based on GPU execution time. This figure demonstrates that addingpredictor has made gradient of weight calculation about 2× faster, as we expected.Table 7.8 shows the speedup we can gain on GSCN accelerator compared toGTX 1080 ti GPU by running standard training (second row) and ReSprop algo-62rithm (third row) on GSCN.63Chapter 8ConclusionThis work proposes Reuse-Sparsified Backpropagation for faster training by reusingthe gradients during training. ReSprop sparsifies backward convolutions whileadding minimal computation overhead to the forward pass. Storing and reusinga random gradient at each iteration enables us to make the pre-calculation over-head about 2% for batch sizes larger than 128. Moreover, the memory overheadbecomes negligible by storing just one random gradient. Thus, pre-ReSprop cal-culations are negligible and can be done in parallel to forward pass. As shown inFigure 8.1 We proposed W-ReSprop as a variant of ReSprop. W-ReSprop enablesForward-pass Backward-pass ReSprop-forwardForward-pass ReSprop-backward (sparse)TimeOriginal TrainingReSprop TrainingFigure 8.1: ReSprop has two parts. Pre-Resprop computations are negligibleand back-ReSprop are sparse computations in backward pass.us to lose less accuracy compared to ReSprop. W-ReSprop helps network to getadapted to the reuse and after first few epochs we continue the training with thetarget reuse percentage.ReSprop and W-ReSprop can be used for training common network architec-64tures and achieves average 2.7× overall speedup in training with negligible lossin model accuracy. The backward pass takes 70% training time, and the speedupgained by ReSprop is due to the 10× faster backward pass. Thus, the overall (the-oretical) training speedup can be calculated from the above using Amdahl’s Lawas:Speedup =1(1−0.7)︸ ︷︷ ︸forward-pass (not accelerated)+0.710︸︷︷︸backward-passOur experiments show that ReSprop and W-ReSprop are robust to the choice ofbatch sizes, architectures, and are applicable to distribute training across multipleGPUs.To find the speedup of ReSprop on a hardware accelerator, we need a sparseaccelerator that supports training. Most previous studies on CNN sparse acceler-ators, however, have paid less attention to the training aspect, despite being morecomputationally demanding than inference. We modify SCNN accelerator to sup-port sparse back-propagation convolutions to measure the speedup of the ReSpropalgorithm on our customized hardware design.This work introduces a generic sparse convolution neural-network accelera-tor (GSCN), which overcomes the shortcomings of previous accelerators in back-propagation convolutions. Thus, GSCN can be used both for sparse training andinference. GSCN achieves 6× and 5.5× speed up on ResNet18 and ResNet34back-propagation, respectively, compared to sparse convolutional neural network(SCNN) accelerator. It also achieves 8.7× and 7.1× speedup compared to the GTX1080 Ti GPU.8.0.1 DiscussionReSprop method can be considered as a new variant of momentum for each indi-vidual output gradient parameter. While momentum has a hyperparameter β forthe gradient of weights, our method has a fixed β = 1 or β = 0 for each outputgradient element. Moreover, ReSprop is an orthogonal approach to the choice ofoptimizers, and it can be used with different optimizers such as Adam or Adagrad.658.0.2 Future WorkThe adaptation of the proposed methodology with a diverse set of CNN architec-tures, such as fully convolutional networks, could be followed in future work. Thefuture work also includes a deeper analysis of reusing gradient in other non-CNNmodels such as fully connected and recurrent neural networks. This thesis has beenmainly focused on the reusing of output gradient, while analysis of the sparsity ofweights combined with the reusing idea could be explored in the future. Also, itcould be interesting to consider the effect of reusing gradient on reusing weightsand sparsifying the weights.Moreover, the GSCN accelerator is an ongoing project and many aspects of itsuch as NOC between PEs, running large batch sizes, and how to gain a higherperformance in sparse calculations is an ongoing project.66Bibliography[1] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, andA. Moshovos. Cnvlutin: Ineffectual-neuron-free deep neural networkcomputing. in 2016 ieee. In ACM/IEEE International Conference onComputer Architecture (ISCA), volume 10, 2016. → pages 3, 6, 26, 27[2] J. Albericio, A. Delma´s, P. Judd, S. Sharify, G. O’Leary, R. Genov, andA. Moshovos. Bit-pragmatic deep neural network computing. InProceedings of the 50th Annual IEEE/ACM International Symposium onMicroarchitecture, pages 382–394. ACM, 2017. → pages 3, 6, 27[3] Z. Allen-Zhu and E. Hazan. Variance reduction for faster non-convexoptimization. In International Conference on Machine Learning, pages699–707, 2016. → page 23[4] Z. Allen-Zhu and Y. Yuan. Improved svrg for non-strongly-convex orsum-of-non-convex objectives. In International conference on machinelearning, pages 1080–1089, 2016. → page 4[5] G. M. Amdahl. Validity of the single processor approach to achieving largescale computing capabilities. In Proceedings of the April 18-20, 1967,spring joint computer conference, pages 483–485, 1967. → pages 5, 59[6] A. G. Anderson and C. P. Berg. The high-dimensional geometry of binaryneural networks. arXiv preprint arXiv:1705.07199, 2017. → page 30[7] S. Anwar, K. Hwang, and W. Sung. Structured pruning of deepconvolutional neural networks. ACM Journal on Emerging Technologies inComputing Systems (JETC), 13(3):1–18, 2017. → page 19[8] R. Banner, I. Hubara, E. Hoffer, and D. Soudry. Scalable methods for 8-bittraining of neural networks. In Advances in Neural Information ProcessingSystems, pages 5145–5153, 2018. → page 267[9] G. Bellec, D. Kappel, W. Maass, and R. Legenstein. Deep rewiring:Training very sparse deep networks. arXiv preprint arXiv:1711.05136,2017. → page 20[10] L. Bottou. Large-scale machine learning with stochastic gradient descent.In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010. →page 2[11] L. Bottou and Y. L. Cun. Large scale online learning. In Advances in neuralinformation processing systems, pages 217–224, 2004. → pages 15, 22[12] A. Cauchy. Me´thode ge´ne´rale pour la re´solution des systemes d’e´quationssimultane´es. Comp. Rend. Sci. Paris, 25(1847):536–538, 1847. → page 22[13] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu,N. Sun, et al. Dadiannao: A machine-learning supercomputer. InIEEE/ACM International Symposium on Microarchitecture, pages609–622. IEEE Computer Society, 2014. → pages 3, 6[14] Y. Chen, T. Chen, Z. Xu, N. Sun, and O. Temam. Diannao family:energy-efficient hardware accelerators for machine learning.Communications of the ACM, 59(11):105–112, 2016. → page 25[15] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze. Eyeriss: An energy-efficientreconfigurable accelerator for deep convolutional neural networks. IEEEjournal of solid-state circuits, 52(1):127–138, 2016. → page 26[16] D. Ciregan, U. Meier, and J. Schmidhuber. Multi-column deep neuralnetworks for image classification. In 2012 IEEE conference on computervision and pattern recognition, pages 3642–3649. IEEE, 2012. → page 1[17] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato,A. Senior, P. Tucker, K. Yang, et al. Large scale distributed deep networks.In Advances in Neural Information Processing Systems, pages 1223–1231,2012. → page 2[18] A. Defazio and L. Bottou. On the ineffectiveness of variance reducedoptimization for deep learning. arXiv preprint arXiv:1812.04529, 2018. →page 23[19] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incrementalgradient method with support for non-strongly convex compositeobjectives. In Advances in Neural Information Processing Systems, pages1646–1654, 2014. → pages 4, 22, 2368[20] A. Defazio, J. Domke, et al. Finito: A faster, permutable incrementalgradient method for big data problems. In International Conference onMachine Learning, pages 1125–1133, 2014. → page 22[21] T. Dettmers and L. Zettlemoyer. Sparse networks from scratch: Fastertraining without losing performance. arXiv preprint arXiv:1907.04840,2019. → pages 2, 3, 20[22] I. Edo, O. Awad, A. H. Zadeh, D. M. Stuart, A. D. Lascorz, M. Nikolic´, andA. Moshovos. DNNsim: Deep Learning Accelerators Toolkit.https://github.com/isakedo/DNNsim. → pages 53, 61[23] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. Neural accelerationfor general-purpose approximate programs. In 2012 45th AnnualIEEE/ACM International Symposium on Microarchitecture, pages449–460. IEEE, 2012. → page 24[24] J. Frankle, G. K. Dziugaite, D. M. Roy, and M. Carbin. The lottery tickethypothesis at scale. arXiv preprint arXiv:1903.01611, 2019. → pages19, 20[25] T. Gale, E. Elsen, and S. Hooker. The state of sparsity in deep neuralnetworks. arXiv preprint arXiv:1902.09574, 2019. → pages 18, 20[26] M. Golub, G. Lemieux, and M. Lis. Dropback: Continuous pruning duringtraining. arXiv preprint arXiv:1806.06949, 2018. → page 21[27] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press,2016. → page 15[28] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deepneural networks with pruning, trained quantization and huffman coding.arXiv preprint arXiv:1510.00149, 2015. → page 18[29] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights andconnections for efficient neural network. In Advances in neuralinformation processing systems, pages 1135–1143, 2015. → pages 19, 20[30] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally.Eie: efficient inference engine on compressed deep neural network. InACM/IEEE International Symposium on Computer Architecture (ISCA),pages 243–254. IEEE, 2016. → pages 3, 6, 26, 4769[31] B. Hassibi and D. G. Stork. Second order derivatives for network pruning:Optimal brain surgeon. In Advances in Neural Information ProcessingSystems, pages 164–171, 1993. → page 18[32] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for imagerecognition. In IEEE Conference on Computer Vision and PatternRecognition, pages 770–778, 2016. → pages 12, 13, 52, 53[33] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deepneural networks. In IEEE International Conference on Computer Vision,pages 1389–1397, 2017. → pages 3, 19[34] G. Hinton, L. Deng, D. Yu, G. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior,V. Vanhoucke, P. Nguyen, B. Kingsbury, et al. Deep neural networks foracoustic modeling in speech recognition. IEEE Signal processingmagazine, 29, 2012. → page 1[35] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networkswith stochastic depth. In European Conference on Computer Vision, pages646–661. Springer, 2016. → page 2[36] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Denselyconnected convolutional networks. In IEEE Conference on ComputerVision and Pattern Recognition, pages 4700–4708, 2017. → page 53[37] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep networktraining by reducing internal covariate shift. arXiv preprintarXiv:1502.03167, 2015. → pages 2, 11[38] A. Jain, A. Phanishayee, J. Mars, L. Tang, and G. Pekhimenko. Gist:Efficient data encoding for deep neural network training. In 2018ACM/IEEE 45th Annual International Symposium on ComputerArchitecture (ISCA), pages 776–789. IEEE, 2018. → page 22[39] H. Ji, L. Song, L. Jiang, H. H. Li, and Y. Chen. Recom: An efficientresistive accelerator for compressed deep neural networks. In 2018 Design,Automation & Test in Europe Conference & Exhibition (DATE), pages237–240. IEEE, 2018. → page 27[40] R. Johnson and T. Zhang. Accelerating stochastic gradient descent usingpredictive variance reduction. In Advances in Neural InformationProcessing Systems, pages 315–323, 2013. → pages 4, 2270[41] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,S. Bates, S. Bhatia, N. Boden, A. Borchers, et al. In-datacenterperformance analysis of a tensor processing unit. In Proceedings of the44th Annual International Symposium on Computer Architecture, pages1–12, 2017. → page 25[42] P. Kanerva. Hyperdimensional computing: An introduction to computingin distributed representation with high-dimensional random vectors.Cognitive computation, 1(2):139–159, 2009. → page 30[43] E. D. Karnin. A simple procedure for pruning back-propagation trainedneural networks. IEEE Transactions on Neural Networks, 1(2):239–242,1990. → page 18[44] A. Katharopoulos and F. Fleuret. Biased importance sampling for deepneural network training. arXiv preprint arXiv:1706.00043, 2017. → page 2[45] A. Katharopoulos and F. Fleuret. Not all samples are created equal: Deeplearning with importance sampling. arXiv preprint arXiv:1803.00942,2018. → page 2[46] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification withdeep convolutional neural networks. In Advances in Neural InformationProcessing Systems, pages 1097–1105, 2012. → pages 1, 2, 10, 57[47] A. Krizhevsky et al. Learning multiple layers of features from tiny images.Technical report, Citeseer, 2009. → page 52[48] Y. LeCun, J. S. Denker, and S. A. Solla. Optimal brain damage. InAdvances in Neural Information Processing Systems, pages 598–605, 1990.→ page 18[49] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learningapplied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. → page 15[50] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Mu¨ller. Efficient backprop.In Neural networks: Tricks of the trade, pages 9–48. Springer, 2012. →page 15[51] N. Lee, T. Ajanthan, and P. H. Torr. Snip: Single-shot network pruningbased on connection sensitivity. arXiv preprint arXiv:1810.02340, 2018.→ page 1971[52] N. Lee, T. Ajanthan, S. Gould, and P. H. Torr. A signal propagationperspective for pruning neural networks at initialization. arXiv preprintarXiv:1906.06307, 2019. → page 18[53] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filtersfor efficient convnets. arXiv preprint arXiv:1608.08710, 2016. → page 19[54] L. Liu, L. Deng, X. Hu, M. Zhu, G. Li, Y. Ding, and Y. Xie. Dynamicsparse graph for efficient deep learning. arXiv preprint arXiv:1810.00859,2018. → pages 2, 3, 20[55] S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen.Cambricon: An instruction set architecture for neural networks. In 2016ACM/IEEE 43rd Annual International Symposium on ComputerArchitecture (ISCA), pages 393–405. IEEE, 2016. → page 25[56] X. Liu, M. Mao, B. Liu, H. Li, Y. Chen, B. Li, Y. Wang, H. Jiang,M. Barnell, Q. Wu, et al. Reno: A high-efficient reconfigurableneuromorphic computing accelerator design. In Proceedings of the 52ndAnnual Design Automation Conference, pages 1–6, 2015. → page 24[57] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell. Rethinking the value ofnetwork pruning. arXiv preprint arXiv:1810.05270, 2018. → page 20[58] C. Louizos, K. Ullrich, and M. Welling. Bayesian compression for deeplearning. In Advances in Neural Information Processing Systems, pages3288–3298, 2017. → page 3[59] C. Louizos, M. Welling, and D. P. Kingma. Learning sparse neuralnetworks through l 0 regularization. arXiv preprint arXiv:1712.01312,2017. → page 3[60] Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang. The expressive power of neuralnetworks: A view from the width. In Advances in Neural InformationProcessing Systems, pages 6231–6239, 2017. → page 56[61] J.-H. Luo, J. Wu, and W. Lin. Thinet: A filter level pruning method fordeep neural network compression. In IEEE International Conference onComputer Vision, pages 5058–5066, 2017. → page 3[62] M. Mahdavi, L. Zhang, and R. Jin. Mixed optimization for smoothfunctions. In Advances in neural information processing systems, pages674–682, 2013. → page 472[63] J. Mairal. Incremental majorization-minimization optimization withapplication to large-scale machine learning. SIAM Journal onOptimization, 25(2):829–855, 2015. → page 22[64] H. Malmgren, M. Borga, and L. Niklasson. Artificial Neural Networks inMedicine and Biology: Proceedings of the ANNIMAB-1 Conference,Go¨teborg, Sweden, 13–16 May 2000. Springer Science & Business Media,2012. → page 1[65] D. C. Mocanu, E. Mocanu, P. Stone, P. H. Nguyen, M. Gibescu, andA. Liotta. Scalable training of artificial neural networks with adaptivesparse connectivity inspired by network science. Nature communications, 9(1):2383, 2018. → page 20[66] D. Molchanov, A. Ashukha, and D. Vetrov. Variational dropout sparsifiesdeep neural networks. In International Conference on Machine Learning,pages 2498–2507. JMLR. org, 2017. → page 3[67] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz. Pruningconvolutional neural networks for resource efficient inference. arXivpreprint arXiv:1611.06440, 2016. → page 18[68] J. Morajda. Neural networks and their economic applications. In Artificialintelligence and security in computing systems, pages 53–62. Springer,2003. → page 1[69] H. Mostafa and X. Wang. Parameter efficient training of deepconvolutional neural networks by dynamic sparse reparameterization.arXiv preprint arXiv:1902.05967, 2019. → pages 3, 20[70] S. Narang, E. Elsen, G. Diamos, and S. Sengupta. Exploring sparsity inrecurrent neural networks. arXiv preprint arXiv:1704.05119, 2017. →page 37[71] Y. Nesterov. Introductory lectures on convex optimization: A basic course,volume 87. Springer Science & Business Media, 2013. → page 22[72] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan,B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally. Scnn: An acceleratorfor compressed-sparse convolutional neural networks. In ACM/IEEE 44thAnnual International Symposium on Computer Architecture (ISCA), pages27–40. IEEE, 2017. → pages xii, 3, 6, 27, 39, 43, 6173[73] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of trainingrecurrent neural networks. In International conference on machinelearning, pages 1310–1318, 2013. → page 13[74] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation inpytorch. 2017. → pages 52, 53[75] J. L. Patel and R. K. Goyal. Applications of artificial neural networks inmedical science. Current clinical pharmacology, 2(3):217–226, 2007. →page 1[76] A. Radford, L. Metz, and S. Chintala. Unsupervised representationlearning with deep convolutional generative adversarial networks. arXivpreprint arXiv:1511.06434, 2015. → page 1[77] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. S. Dickstein. On theexpressive power of deep neural networks. In International Conference onMachine Learning, pages 2847–2854. JMLR. org, 2017. → page 56[78] R. Raina, A. Madhavan, and A. Y. Ng. Large-scale deep unsupervisedlearning using graphics processors. In International Conference onMachine Learning, pages 873–880. ACM, 2009. → page 57[79] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variancereduction for nonconvex optimization. In International Conference onMachine Learning, pages 314–323, 2016. → page 23[80] N. L. Roux, M. Schmidt, and F. R. Bach. A stochastic gradient methodwith an exponential convergence rate for finite training sets. In Advancesin neural information processing systems, pages 2663–2671, 2012. → page22[81] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learningrepresentations by back-propagating errors. nature, 323(6088):533–536,1986. → page 14[82] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visualrecognition challenge. International journal of computer vision, 115(3):211–252, 2015. → pages 4, 5274[83] S. Sharify, A. D. Lascorz, M. Mahmoud, M. Nikolic, K. Siu, D. M. Stuart,Z. Poulos, and A. Moshovos. Laconic deep learning inference acceleration.In Proceedings of the 46th International Symposium on ComputerArchitecture, pages 304–317, 2019. → page 26[84] O. Sharir and A. Shashua. On the expressive power of overlappingoperations of deep networks. arXiv preprint arXiv:1703.02065, 2017. →page 56[85] K. Simonyan and A. Zisserman. Two-stream convolutional networks foraction recognition in videos. In Advances in Neural InformationProcessing Systems, pages 568–576, 2014. → page 1[86] K. Simonyan and A. Zisserman. Very deep convolutional networks forlarge-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. →pages 12, 52[87] R. Spring and A. Shrivastava. Scalable and sustainable deep learning viarandomized hashing. In ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, pages 445–454. ACM, 2017. →page 2[88] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Strivingfor simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806,2014. → page 11[89] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov. Dropout: a simple way to prevent neural networks fromoverfitting. The journal of machine learning research, 15(1):1929–1958,2014. → page 11[90] X. Sun, X. Ren, S. Ma, and H. Wang. meprop: Sparsified back propagationfor accelerated deep learning with reduced overfitting. In Proceedings ofthe 34th International Conference on Machine Learning, pages 3299–3308.JMLR. org, 2017. → pages 2, 3, 7, 20, 28, 31[91] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance ofinitialization and momentum in deep learning. In International conferenceon machine learning, pages 1139–1147, 2013. → page 16[92] R. W. Vuduc and J. W. Demmel. Automatic performance tuning of sparsematrix kernels, volume 1. University of California, Berkeley Berkeley, CA,2003. → page 4775[93] N. Wang, J. Choi, D. Brand, C.-Y. Chen, and K. Gopalakrishnan. Trainingdeep neural networks with 8-bit floating point numbers. In Advances inNeural Information Processing Systems, pages 7675–7684, 2018. → page 2[94] B. Wei, X. Sun, X. Ren, and J. Xu. Minimal effort back propagation forconvolutional neural networks. arXiv preprint arXiv:1709.05804, 2017. →pages 2, 3, 7, 20[95] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsityin deep neural networks. In Advances in Neural Information ProcessingSystems, pages 2074–2082, 2016. → page 19[96] S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprintarXiv:1605.07146, 2016. → pages 12, 14, 52[97] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, andY. Chen. Cambricon-x: An accelerator for sparse neural networks. In 201649th Annual IEEE/ACM International Symposium on Microarchitecture(MICRO), pages 1–12. IEEE, 2016. → page 27[98] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen. Incremental networkquantization: Towards lossless cnns with low-precision weights. arXivpreprint arXiv:1702.03044, 2017. → page 2[99] X. Zhou, Z. Du, Q. Guo, S. Liu, C. Liu, C. Wang, X. Zhou, L. Li, T. Chen,and Y. Chen. Cambricon-s: Addressing irregularity in sparse neuralnetworks through a cooperative software/hardware approach. In 2018 51stAnnual IEEE/ACM International Symposium on Microarchitecture(MICRO), pages 15–28. IEEE, 2018. → page 27[100] M. Zhu and S. Gupta. To prune, or not to prune: exploring the efficacy ofpruning for model compression. arXiv preprint arXiv:1710.01878, 2017.→ page 3776
You may notice some images loading slow across the Open Collections website. Thank you for your patience as we rebuild the cache to make images load faster.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Resprop : reused sparsified backpropagation
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Resprop : reused sparsified backpropagation Goli, Negar 2020
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Resprop : reused sparsified backpropagation |
Creator |
Goli, Negar |
Publisher | University of British Columbia |
Date Issued | 2020 |
Description | The success of Convolutional Neural Networks (CNNs) in various applications is accompanied by a significant increase in computation and training time. In this work, we focus on accelerating training by observing that about 90% of gradients are reusable during training. Leveraging this observation, we propose a new algorithm, Reuse-Sparse-Backprop (ReSprop), as a method to sparsify gradient vectors during CNN training. ReSprop maintains state-of-the-art accuracy on CIFAR-10, CIFAR-100, and ImageNet datasets with less than 1.1% accuracy loss while enabling a reduction in back-propagation computations by a factor of 10x resulting in a 2.7x overall speedup in training. As the computation reduction introduced by ReSprop is accomplished by introducing fine-grained sparsity that reduces computation efficiency on GPUs, we introduce a generic sparse convolution neural network accelerator (GSCN), which is designed to accelerate sparse back-propagation convolutions. When combined with ReSprop, GSCN achieves 8.0x and 7.2x speedup in the backward pass on ResNet34 and VGG16 versus a GTX 1080 Ti GPU. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2020-06-19 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0391971 |
URI | http://hdl.handle.net/2429/74770 |
Degree |
Master of Applied Science - MASc |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2020-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2020_november_goli_negar.pdf [ 2.81MB ]
- Metadata
- JSON: 24-1.0391971.json
- JSON-LD: 24-1.0391971-ld.json
- RDF/XML (Pretty): 24-1.0391971-rdf.xml
- RDF/JSON: 24-1.0391971-rdf.json
- Turtle: 24-1.0391971-turtle.txt
- N-Triples: 24-1.0391971-rdf-ntriples.txt
- Original Record: 24-1.0391971-source.json
- Full Text
- 24-1.0391971-fulltext.txt
- Citation
- 24-1.0391971.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0391971/manifest