DropBack: Continuous Pruning During Deep NeuralNetwork TrainingbyMaximilian GolubB. Electrical Engineering, University of Washington, 2016a thesis submitted in partial fulfillmentof the requirements for the degree ofMaster of Applied Scienceinthe faculty of graduate and postdoctoral studies(Electrical and Computer Engineering)The University of British Columbia(Vancouver)September 2018©Maximilian Golub, 2018The following individuals certify that they have read, and recommend to the Facultyof Graduate and Postdoctoral Studies for acceptance, the thesis entitled:DropBack: Continuous Pruning During Deep Neural Network Trainingsubmitted byMaximilian Golub in partial fulfillment of the requirements for thedegree of Master of Applied Science in Electrical and Computer Engineering.Examining Committee:Mieszko Lis, Electrical and Computer EngineeringSupervisorGuy Lemieux, Electrical and Computer EngineeringSupervisorJane Wang, Electrical and Computer EngineeringSupervisory Committee MemberMatei Ripeanu, Electrical and Computer EngineeringSupervisory Committee MemberiiAbstractIn recent years, neural networks have regained popularity in a variety of fields suchas image recognition and speech transcription. As deep neural networks grow morepopular for solving everyday tasks, deployment on small embedded devices — suchas phones — is becoming increasingly popular. Moreover, many applications —such as face recognition or health applications — require personalization, whichmeans that networks must be retrained after they have been deployed.Because today’s state-of-the-art networks are too large to fit onmobile devices andexceed mobile device power envelopes, techniques such as pruning and quantizationhave been developed to allow pre-trained networks to be shrunk by about an orderof magnitude. However, they all assume that the network is first fully trained off-lineon datacenter-class GPUs, then pruned in a post-processing step, and only thendeployed to the mobile device.In this thesis, we introduce DropBack, a technique that significantly reduces thestorage and computation required during both inference and training. In contrastto existing pruning schemes, which retain the weights with the largest values andset the rest to zero, DropBack identifies the weights that have changed the most,and recomputes the original initialization values for all other weights. This meansthat only the most important weights must be stored in off-chip memory bothduring inference and training, reducing off-chip memory accesses (responsible for amajority of the power usage) by up to 72×.Crucially, networks pruned using DropBack maintain high accuracy even forchallenging network architectures: indeed, onmodern, compact network architecturessuch as Densenet and WRN-28-10, DropBack outperforms the current state-of-the-art pruning techniques in both accuracy and off-chip memory storage required foriiiweights. On the CIFAR-10 dataset, we observe 5× reduction in weights on an already9×-reduced VGG-16 network, which we call VGG-S, and 4.5× on Densenet andWRN-28-10 — all with zero or negligible accuracy loss — or 19×, 27×, and 36×,respectively, with a minor impact on accuracy. When the recomputed initial weightsare decayed to zero, the weight memory footprint of WRN-28-10 can be reduced upto 72×.ivLay SummaryMachine learning models power many consumer-facing features such as Apple’spersonal assistant Siri and Tesla’s self-driving cars. Because current models are toolarge to store and adapt onmobile devices like smartphones, anymodel improvementsand updates must be done off-line using cloud resources. In many applications,however, transmitting data to the cloud is unacceptable for privacy and securityreasons, and in such use cases the models cannot be retrained after deployment.In this thesis, we aim to solve this issue by making it possible to train modernmachine learning models on these smaller, energy-constrained systems. We developDropBack, a method for reducing the on-device storage costs needed for creating —i.e., training — large machine learning models. Unlike previous research, which hasfocused on reducing the costs of using already created models, DropBack reduces thecosts of both creating and using such models. Compared to prior work, DropBackdecreases the on-device storage requirements by an order of magnitude, all whileretaining best-in-class accuracy results on modern image recognition networks.vPrefaceThis thesis is original, independent work by the author, Maximilian Golub. ProfessorMieszko Lis and Professor Guy Lemieux served as advisors, and helped to edit thiswork.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Energy Consumption of Large Neural Networks . . . . . . . . . . 21.2 DropBack: Pruning While Training . . . . . . . . . . . . . . . . . 61.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . 82 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.1 The Structure of MLPs and CNNs . . . . . . . . . . . . . 102.1.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Compression Techniques . . . . . . . . . . . . . . . . . . . . . . 152.2.1 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.2 Quantization . . . . . . . . . . . . . . . . . . . . . . . . 162.2.3 Other Compression Methods . . . . . . . . . . . . . . . . 17vii2.3 xorshift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 The DropBack Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 203.1 Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 Algorithm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Algorithm 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4 Differences with Existing Approaches . . . . . . . . . . . . . . . 304 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 324.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.1.1 Training Methods . . . . . . . . . . . . . . . . . . . . . . 324.1.2 Networks Used . . . . . . . . . . . . . . . . . . . . . . . 334.1.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2 Handwritten Digit Recognition (MNIST) . . . . . . . . . . . . . . 354.2.1 Pruning by Weight Magnitude: Algorithm 1 . . . . . . . . 354.2.2 Pruning by Accumulated Gradient Magnitude: Algorithm 2 374.2.3 DropBack with Weight Decay: Algorithm 3 . . . . . . . . 414.3 Fashion Item Icon Recognition (Fashion-MNIST) . . . . . . . . . 424.3.1 Pruning by Accumulated Gradient Magnitude: Algorithm 2 434.3.2 DropBack with Weight Decay: Algorithm 3 . . . . . . . . 434.4 Color Image Classification (CIFAR-10) . . . . . . . . . . . . . . 454.4.1 Pruning by Accumulated Gradient Magnitude: Algorithm 2 464.4.2 DropBack with Weight Decay: Algorithm 3 . . . . . . . . 504.4.3 Weight Reduction: Algorithm 2 vs. Algorithm 3 . . . . . . 534.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65viiiList of TablesTable 4.1 LeNet-300-100 (top) and MLP-100 (bottom) on MNIST, usingthe naive prune-by-weight-magnitude Algorithm 1. Naive 50Krefers to a configuration where 50,000 weights are retained duringtraining, Naive 5K to a configuration with 5,000 retained weights,and so on. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Table 4.2 LeNet-300-100 (top) and MLP-100 on the MNIST digit dataset,using Algorithm 2. . . . . . . . . . . . . . . . . . . . . . . . . 39Table 4.3 Number of gradients for each layer retained in the final trainedMLP-100 network. . . . . . . . . . . . . . . . . . . . . . . . . 41Table 4.4 The MNIST digit dataset using LeNet-300-100 (top) and MNIST-100-100, using Algorithm 3. All models where frozen fromepoch 25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Table 4.5 VGG-S on Fashion-MNIST, using Algorithm 2. . . . . . . . . 44Table 4.6 VGG-S on Fashion-MNIST, using Algorithm 3. . . . . . . . . 44Table 4.7 Validation accuracy and weight reduction ratio of several prunednetworks on the CIFAR-10 dataset, trained using Algorithm 2. . 47Table 4.8 Validation accuracy and weight reduction ratio of several prunednetworks on CIFAR-10 using Algorithm 3 . . . . . . . . . . . 51Table 4.9 Comparing Algorithm 2 and Algorithm 3 on CIFAR-10. . . . . 58ixList of FiguresFigure 1.1 Hypothetical power usages while training WRN-28-10 with abatch size of 60, with iterations taking one second (60 imagesper second). . . . . . . . . . . . . . . . . . . . . . . . . . . 5Figure 1.2 A Basic three layer multi-layer perceptron (MLP), unpruned in(a) and pruned in (b). . . . . . . . . . . . . . . . . . . . . . . 5Figure 2.1 A basic two input single output perceptron. . . . . . . . . . . 12Figure 2.2 A 3x3 convolutional filter calculating the value of a single output.To compute the entire output feature map the filter would bestrided along the input feature map, to produce five total outputsthis example. . . . . . . . . . . . . . . . . . . . . . . . . . . 13Figure 3.1 Distribution of accumulated gradients over 100 epochs of stan-dard SGD training on MNIST using a 90,000-weight MLP. . . 23Figure 3.2 Number of weights added/removed to the top-2K gradient setin the first 100 iterations on MNIST. . . . . . . . . . . . . . . 27Figure 3.3 Number of weights added/removed to the top-2K gradient setafter the first 100 iterations on MNIST. Note the y-axis scale is1/10th the scale of Figure 3.2. . . . . . . . . . . . . . . . . . 27Figure 4.1 The VGG-16 network architecture. VGG-16 is an older convolu-tional neural network and is designed with a single path throughthe network, with different blocks of 3x3 filters stacked together. 35xFigure 4.2 The Densenet network architecture. Densenet connects theoutput of each convolutional layer to the inputs of all downstreamlayers in order to propagate feature information more readilythroughout the network. Figure taken from [27]. . . . . . . . . 36Figure 4.3 The types of wide residual network blocks. wide residual net-works are built from these four blocks. Figure taken from [71]. 36Figure 4.4 The accuracy vs. weight reduction trade-off for LeNet-300-100and MLP-100 on Algorithm 2. . . . . . . . . . . . . . . . . . 38Figure 4.5 Rate of convergence using Algorithm 2 for LeNet-300-100 forour technique our and the baseline model. Note that the y-axisstarts as 0.90, and the final accuracies are within 1% of each other. 40Figure 4.6 Weight reduction ratio vs. accuracy for VGG-S on FashionMNIST. 45Figure 4.7 Weight reduction ratio vs. accuracy for VGG-S, using Algorithm2 and Algorithm 3. . . . . . . . . . . . . . . . . . . . . . . . 46Figure 4.8 Weight reduction ratio vs. accuracy for Densenet, using Algo-rithm 2 and Algorithm 3. . . . . . . . . . . . . . . . . . . . . 48Figure 4.9 Weight reduction ratio vs. accuracy for WRN-28-10, usingAlgorithm 2 and Algorithm 3. . . . . . . . . . . . . . . . . . 49Figure 4.10 VGG-S CIFAR-10 epoch vs. validation accuracy for our methodat 5M tracked parameters, variational dropout, and the baselinemodel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50Figure 4.11 Histogram of 3x3 convolutional filters in WRN-28-10 withDropBack 500K Algorithm 3, including filters with all zeros. . 54Figure 4.12 Histogram of 3x3 convolutional filters in WRN-28-10 withDropBack 500K Algorithm 3, excluding filters with all zeros. . 54Figure 4.13 Histogram of 3x3 convolutional filters in WRN-28-10 withDropBack 100K Algorithm 3, including filters with all zeros. . 55Figure 4.14 Histogram of 3x3 convolutional filters in WRN-28-10 withDropBack 100K Algorithm 3, excluding filters with all zeros. . 55Figure 4.15 Histogram of 3x3 convolutional filters in VGG-Swith DropBack500K Algorithm 3, including filters with all zeros. . . . . . . 56Figure 4.16 Histogram of 3x3 convolutional filters in VGG-Swith DropBack500K Algorithm 3, excluding filters with all zeros. . . . . . . 56xiFigure 4.17 Histogram of 3x3 convolutional filters in Densenet with Drop-Back 500K Algorithm 3, including filters with all zeros. . . . 57Figure 4.18 Histogram of 3x3 convolutional filters in Densenet with Drop-Back 500K Algorithm 3, excluding filters with all zeros. . . . 57Figure 4.19 Diffusion(`2)distance vs. training time on MLP-100 (note logtime scale). . . . . . . . . . . . . . . . . . . . . . . . . . . . 59Figure 4.20 Evolution of weights under SGD projected into 3D space usingPCA for DropBack, baseline, magnitude based pruning, andvariational dropout. . . . . . . . . . . . . . . . . . . . . . . . 60xiiChapter 1IntroductionNeural networks are becoming an everyday part of modern life; from voice assistantsin cellphones to Amazon shopping recommendations, neural networks are ubiquitous.Indeed, some have likened the resurgence of neural networks to the next dotcom-levelboom [5, 13, 59]. Typically, these neural networks are deployed in the cloud usingdatacenter-class GPU compute resources; increasingly, however, they are also beingdeployed on mobile devices, e.g., for voice and facial recognition [61, 62].In comparison to top-end GPUs, mobile devices have very limited off-chipDRAM bandwidth and drastically reduced power envelopes. Because modern deepneural networks are large and computationally expensive to use — and even morecomputationally expensive to train— bringing the recent advances in neural networktechnology to mobile devices remains a challenge. Consequently, much work in thelast three years has focused on reducing pre-trained neural networks to fit on mobiledevices so that they can be used for inference [14, 18, 64, 68].Comparatively little attention, however, has been paid to enabling trainingwithinthe low power envelopes of mobile devices. On-device training is desirable to (re-)train networks to improve accuracy for individuals users’ voices and faces [61, 62],as well as in other applications where transmitting training data to the cloud isundesirable for privacy and security reasons. This presents three key challenges:(a) state-of-the-art networks need to be trained within the mobile device off-chipDRAM size and off-chip bandwidth restrictions; (b) the power and energy costs oftraining, which include computation and off-chip memory access, must be reduced1to fit within the mobile device power and energy budgets; and (c) models trained onmobile devices should not lose significant accuracy compared to models trained inthe cloud.In this thesis, we take a step towards enabling on-device training by developingDropBack, an algorithm that lowers the off-chip memory usage for weights, andoptionally reduces the computation requirements, during training. We demonstratethat DropBack can reduce the number of weights that must be stored during trainingby up to 5× with no loss of accuracy, and up to 36× if a small amount of accuracycan be sacrificed. Because retrieving weights in off-chip DRAM and computing onit consumes substantial energy, reducing the memory footprint can in turn reduce therequired power envelope; if DropBack is incorporated in a neural network accelerator— a task outside of the scope of this thesis — it can potentially reduce the energyrequired to train deep neural networks on mobile devices by an order of magnitude.1.1 Energy Consumption of Large Neural NetworksEnergy Usage During InferenceModern deep neural networks can use large amounts of storage and computation: forexample, the WRN-28-10 network [71] requires 144MB of weights to be accessedin off-chip DRAM and 5.48 billion floating-point multiply-accumulate operations toclassify a single image. In the 45nm technology node, this requires a minimum of47mJ of energy for accessing DDR2 DRAM (2.6nJ per 64 bits accessed [25]) and afurther 27mJ for computation (estimating a single floating-point multiply-accumulateat 5pJ = 0.9pJ+4pJ per 32-bit floating-point fused multiply-accumulate [25]). Inaddition, a realistic chip would also have other overheads, such as scheduling,accessing on-chip register files and scratchpads, and so on. To label images at therate of 60 images per second, WRN-28-10 would consume a minimum of 2.8W1 ofpower on memory accesses and 1.6W2 on computation.1144MB ∗60s−1 ∗2600pico joules/64b = 2.8W25.48∗109ops ∗60s−1 ∗5pico joules/op = 1.6W2Reducing Power During InferenceBecause accessing DRAM costs ∼650× more than a floating-point multiply [25],reducing the number of weights present in the trained network can considerablyreduce the energy used for inference.One simple method of reducing the weight count in neural networks is prun-ing [14, 18, 38, 44, 46, 60, 64, 65, 74]. Pruning removes some — ideally most — ofthe weights of a neural network, usually by dropping the weights with the lowestvalues (we refer to this as magnitude-based pruning). An example of this pruningtechnique is shown in Figure 1.2: a simple multi-layer perceptron with 13 weights isfirst trained using the full network (Figure 1.2(a)), and then the weakest connectionsare dropped by setting the lowest-value weights to zero (Figure 1.2(b)). Pruningaway these connections is often effective because neurons essentially compute aweighted sum of their input connections, and connections with very low weightvalues contribute very little to the neuron’s overall output. In this example, approxi-mately half the weights in the network have been removed and the DRAM storageneeded for the weights is accordingly halved. When the network is now used toclassify an input, the bandwidth required to load the weights from memory, and theassociated power cost, are also cut in half.Power During TrainingTraining consumes substantially more energy than inference. A single trainingiteration using stochastic gradient descent on a single sample (e.g., an image) takesapproximately three times the number of weight accesses and operations comparedto inference on the same sample, as each weight must be retrieved once during theforward pass, retrieved again during the backward pass, and updated to add thenewly computed gradient. In addition, training typically takes many thousands ofiterations on batches of tens to hundreds of samples each. For example, training theWRN-28-10 image classification network processing 60 images per second wouldconsume a minimum of 15W of power including activations3 — far beyond thecooling capacity of the iPhone design, which is 5W [33].3WRN-28-10 has 12 million activations, which cost 15.6mJ to access, and must be accessed twiceduring training. Weights must be accessed 3 times, and computational requirements are roughly 3times higher. ((47mj +27mj) ∗3+15.6mj ∗2) ∗60s−1 = 15.2W3The proportion of memory bandwidth taken up by weights depends on thenetwork being trained, the training library used, and the underlying hardware. Forexample, in a multi-layer perceptron (MLP), weights are not reused within a singleimage, and will take up a larger proportion of the bandwidth; on the other hand, in aconvolutional network, each weight can be reused multiple times, so, underlyinghardware permitting, less bandwidth is needed.Figure 1.1 shows the power breakdown for two hypothetical libraries trainingWRN-28-10, a convolutional network (i.e., the worst case for weight pruning).Figure 1.1(b) describes an extreme points where weights are never reused per imagein a batch, and so must be accessed and updated for each activation map pulled frommemory (this is similar to other network architectures such as MLPs). Figure 1.1(a)shows the opposite extremum, where the library and underlying hardware allowperfect reuse, and the absolute minimum number of weight accesses occur in eachmini-batch4.In the worst case — training a convolutional network, with perfect weightreuse, perfect hardware, and optimal scheduling — activations can take up to 13×more power than weight accesses5. However, activations can be readily compressed:solutions such as gradient check-pointing [7] can compress activations by as muchas 7× by trading off fewer memory accesses for extra computation. In addition,activations dominate only when many training examples are grouped together inlarge batches for parallelism, such as in high-end devices like GPUs or TPUs; insmaller devices with less compute resources, training with single images is commonand effective [47], and weights again dominate as in Figure 1.1(b). For WRN, it ispossible to have more weights than activations if the input images are small, in thiscase 32x32x3.On-device training therefore requires dramatic improvements in the numberof weights stored and accessed during the training process. Unfortunately, as wediscuss below, existing pruning techniques cannot be applied to reduce training-timeweight counts.4This was calculated using the same equation as the initial example in this section, footnote 3,modified so weights are only accessed once per mini-batch of 60 images, with one mini-batch beingprocessed per second: (27mj ∗3+15.6mj ∗2) ∗60s−1 +47mJ ∗1s−1 = 6.8W51.87/.14 = 13.364(a) The full weight reuse library. (b) The no-weight reuse library.Figure 1.1: Hypothetical power usages while trainingWRN-28-10 with a batchsize of 60, with iterations taking one second (60 images per second).X XY YX ∗W1 + Y ∗W2 X ∗W3 + Y ∗W4Output LayerInput LayerHidden Layer(a) The unpruned MLP.X XY YX ∗W1 + Y ∗W2 X ∗W3 + Y ∗W4Output LayerInput LayerHidden Layer(b) The pruned MLP.Figure 1.2: A Basic three layer multi-layer perceptron (MLP), unpruned in (a)and pruned in (b).5Limitations of Existing Pruning TechniquesExisting pruning techniques [e.g., 38] require that the entire network with all weightsbe fully trained before the pruning step can begin. Typically, the lowest-valuedweights are then removed from the network (i.e., set to zero), which often results inan accuracy drop; to counteract this, the pruned network is then re-trained to recoversome accuracy [21].Because existing pruning approaches start with a fully-trained unpruned network,they cannot be used directly during training. It is possible, however, to imagine ascheme where this weight-magnitude-based pruning stage is applied during everyiteration, and the full set of weights is never stored. We examine this possibility inSection 3.1, where we show that magnitude-based pruning fails to attain acceptableaccuracy in nearly all cases. This is because in the beginning stage of training allweights are initialized using random values [37], and a weight with low initial valuecan increase dramatically before the training process converges.Consequently, reducing the weight storage needs and the number of weightsaccessed during training — and thereby reducing the energy usage and powerrequirements — calls for a novel approach to weight storage during training. Webriefly outline our approach below.1.2 DropBack: Pruning While TrainingIn this thesis, we develop DropBack, a novel pruning algorithm that (a) can traindeep neural networks without accuracy loss while storing up to 4.5× fewer weightsduring the training process, and (b) produces a pruned network with weight reductioncomparable to state-of-the-art post-training pruning techniques on modern networks.DropBack is based on three key insights: (1) that weights that have accumulatedthe most gradient updates over time account for most of the learning; (2) thataccumulated gradients are predicated on how the initialization values are chosen forthe remaining weights; (3) that a pseudo-random number generator can recomputeinitialization values as required and do not need to be stored. Based on theseobservations, DropBack stores only critical weights during training and recomputesthe initial value for the rest of the weights, called untracked weights. Intuitively,recomputing the non-critical weights allows the training algorithm to leverage the6scaffolding created by their initial values to improve the learning process for thecritical weights. The DropBack algorithm is described as Algorithm 2 in Chapter 3.The weight reduction substantially reduces both memory footprint and memorybandwidth requirements during training. For example, DropBack can reduce theactive weight count in the convolutional neural network WRN over 360×, matchingthe baseline performance of VGG-S, reducing the required memory bandwidth from8GB/s to 22MB/s when labeling images at 60 images per second. The networkweights now take 0.4MB to store, comfortably fitting in the on-chip SRAM memoryof a recent iPhone [33], where accessing 64 bits of on-chip data costs only 5pJ on a45nm process (instead of 2.6nJ off-chip) [25]. Interestingly, the DropBack algorithmpreferentially removes entire 3×3 convolutional filters (up to 99%), even though thealgorithm itself has no concept of a convolutional filter.DropBack differs substantially from prior pruning techniques, which either(a) train an unconstrained network, prune the network, then retrain the network, or(b) add an additional term to the loss function of the network to encourage sparsitythat can be used for post-training pruning. DropBack instead prunes the networkfrom the very first iteration, does not require separate pruning and retraining phases,and does not require additional modifications to the loss function of the network.DropBack outperforms best-in-class pruning methods on network architecturesthat are already dense and have been found particularly challenging to reduce [41–43].On Densenet, DropBack achieves 5.86% validation error with 4.5×weight reduction,and on WRN-28-10 DropBack achieves 3.85%–4.20% error with 4.5×–7.3× weightreduction. For both Densenet and WRN-28-10 these weight reduction ratios arestate of the art compared to post-training pruning techniques. The memory usedduring training and inference for weight storage is limited, and no extra steps arerequired to reduce the network after training.With the addition of an initial weight decay parameter the untracked weightscan be reduced to zero, providing both memory and computational sparsity. OnDensenet with 500K tracked weights in Section 4.4.2, DropBack with decay canachieve 77.77% sparsity — i.e., 77.77% of the weights are zero and can be removedentirely — without any accuracy loss. On WRN-28-10, DropBack with decay canachieve 86% sparsity (500K tracked weights) without decrease in accuracy, and 99%sparsity with only a 1% increase in error (100K tracked weights). The weight decay7variant of DropBack is described as Algorithm 3 in Section 3.3.1.3 Thesis OrganizationThe rest of this thesis is organized as follows. Chapter 2 contains backgroundinformation on neural networks and related work, including other methods ofreducing thememory footprint and energy consumption of neural networks. Chapter 3presents the intuition behind DropBack and develops the DropBack algorithm inthree phases. Chapter 4 reports and discusses the accuracy and weight reductionratio of the final two DropBack variants, and presents relevant tradeoffs. Finally,Chapter 5 discusses potential future research directions and concludes this work.8Chapter 2BackgroundUnderstanding DropBack requires a basic knowledge of neural networks, how theyare trained, and how they can be compressed using existing techniques. This chapterlays out the two types of neural network used in this thesis —multi-layer perceptrons(MLPs) and convolutional neural networks (CNNs). It then discusses how thecomputational and memory requirements of these networks differ during trainingand inference, and describes the different compression methods currently in use.2.1 Neural NetworksNeural networks are constructed of many interconnected neurons, each of whichperforms a simple mathematical function. In the two variants we consider here,information flows across the network in a feed-forward fashion, with the outputsof neurons in earlier “layers” becoming inputs to neurons in later layers. Neuronsorganized in regular structures and trained with large datasets have achieved state-of-the-art performance in problems such as classification, regression, and speechunderstanding. By convention, networks are called “deep” if they contain more thanseveral layers.Before discussing how to reduce the computational and memory impacts oftraining, we first give a basic outline of their operation. Section 2.1.1 describes themulti-layer perceptron, one of the simplest forms of neural networks. Section 2.1.2describes how neural networks are used once trained, and section 2.1.3 gives a basic9overview of how neural networks are trained.2.1.1 The Structure of MLPs and CNNsThe simplest neural network consists of one neuron, and is called a perceptron; sucha network is shown in Figure 2.1. The perceptron takes two inputs, x and y, andoutputs a single resulting value, f (x, y). Internally, the perceptron stores constanttwo weights, w1 and w2, one for each input. The mathematical operation performedby the perceptron is the dot product, so we have f (x, y) = w1x+w2y. Typically theoutput of a perceptron is then passed through an activation function— originally theHeaviside step function [54], and more recently the sigmoid [37] and rectified linearunit (ReLU) [1] functions. ReLU is typically used in vision-related applications [1],and merely cuts off all negative values at 0:f (x) =max(0, x) (2.1)Perceptrons can be combined into multiple layers to create a multi-layer percep-tron (MLP). An example of a simple three-layerMLP is shown earlier in Figure 1.2(a).The input (topmost) layer takes two inputs, x and y, and feeds each to two differentneurons (i.e., perceptrons). Each neuron has its own weights for both x and y, labeledw1, w2, w3, and w4. Each neuron has a single output, o1 and o2 respectively, whichis the dot product of their respective weights followed by an activation function. Inthis example, the activation function is ReLU (Equation 2.1). The outputs of theinput layer are the inputs to the hidden layer of three neurons, with each neuron inthe hidden layer receiving o1 and o2 as inputs. The hidden layer passes outputs tothe final (output) layer, a single neuron.Compared to a single perceptron, MLPs are capable of solving non-linearlyseparable problems, and, indeed, any computable function [12]. However, becauseevery neuron has a separate weight for every feature in the input, they quickly growtoo large to train even with a modest number of input dimensions. For example, in thecommonly used ILSVR2012 dataset [55], images are typically scaled to 256x256x3,for a total input size of 196,608 pixels; given an MLP with 1,000 neurons in theinput layer, just that layer would use approximately 800MB of weights. In order toscale neural networks to problems with many inputs, such as image recognition,10alternative structures are therefore used.The most common alternative to basic MLPs is a convolutional neural network(CNN). CNNs are built out of convolutional layers, pooling layers, and typicallyone or two “dense layers” — another name for a layer of an MLP — as the finallayers. The convolutional layers take advantage of the observation that vision tasks— such as detecting edges — are the same regardless of where in the image theyoccur, i.e., are invariant under translation. The convolutional layers are thereforebuilt from convolutional “filters” which, when convolved with a set of neighboringpixels, produce one output value. The operation of the filter is shown in Figure 2.2.A filter will only consider pixels in its receptive window — e.g., the filter inFigure 2.2 has a 3x3 receptive window — and pixels outside this window nevercontribute to the current output value being computed. To produce an entire outputlayer, the convolutional filter is moved across the input layer (e.g., left to right inFigure 2.2). Typically, each convolutional layer consists of many filters, the outputsof which are stacked to produce many output feature maps as inputs to the next layer.Pooling combines the outputs from a few neurons, usually using the max functionto effectively downsample the previous layer. As with the “dense” MLP layers,convolutional and pooling layers are stacked together to form a network. Figure 4.1shows a fairly small CNN which we use as one of the evaluation models.2.1.2 InferenceFor the inference task, such as classifying an input image as a cat vs. a dog, theneural network is run forward: inputs are provided to the first layer, which in turncomputes the inputs to the second layer, and so on. The outputs of the final layerform the output of the entire network; for example, they might represent confidencethat the image belongs to a certain class (e.g., cat or dog).During a forward pass, the weights of the network are only read and neverupdated, and storage needed for intermediate results is limited to a single layer.1 Theamount of intermediate result memory required to perform inference can therefore beminimal, depending on the exact structure of the network and how the computationsare scheduled. On the other hand, all of the weights for the entire network must be1Or a small, constant number of layers in more advanced networks like residual networks [22]11X YX ∗W1 + Y ∗W2f (X, Y )Figure 2.1: A basic two input single output perceptron.read during each inference pass.The amount of computation and memory required for inference tasks with agiven network can often be further reduced by modifying the network throughpruning, which removes some weights from the network, or quantization, whichreduces the number of bits needed to represent each weight. Pruning is describedin Section 2.2.1 and quantization in Section 2.2.2.2.1.3 TrainingTo train a neural network, the weights of each neuron are first initialized to a smallrandom value, usually drawn from a scaled normal distribution [40]. The aim oftraining is to change these weights to give the correct output for each example in thetraining set, which contains multiple inputs, each paired with its respective “groundtruth” label. If the training set is drawn from the same distribution as the examplesthe network is expected to classify, and care is taken to prevent the model fromovertraining, a trained network can be deployed and perform with accuracy close toits performance on the training set.Weights are typically trained using an optimization algorithm called stochasticgradient descent (SGD). First, after one or more inputs (a “batch”) are classified121 3 4 2 5 3 12 6 7 3 4 2 13 5 6 8 9 12 -11 0 12 2 20 0 0Row 1:1 ∗ 1 + 3 ∗ 0 + 4 ∗ 1 = 5Row 2: 2*2+2*6+7*2=30Row 3: 3*0+5*0+6*0=0Ouput Pixel = 353X3 FilterInput Feature MapFigure 2.2: A 3x3 convolutional filter calculating the value of a single output.To compute the entire output feature map the filter would be strided alongthe input feature map, to produce five total outputs this example.using the forward pass, the loss (prediction error) on these inputs is computed; themean squared error, shown in Equation 2.2, is commonly used as the loss function.Next, the algorithm calculates the gradient of each weight with respect to the loss.Finally, each weight’s gradient is scaled by a factor called the learning rate andsubtracted from the weight as shown in Equation 2.3. In effect, each weight is updatedbased on how much it contributed to the error of the current output compared to theground truth.Applying the chain rule to the problem of computing any gradient in the networkyields an algorithm called back-propagation, or backprop [37]. The key observation13is shown in Equation 2.4: the partial derivative of loss with respect to the weightcan be chained through the unactivated output of the neuron itself and through theactivation function by multiplying the relevant partial derivatives. Because this canbe done in a layer-by-layer fashion starting from the last layer and ending on the first,the weight-update part of the backprop algorithm is also known as the backwardpass.MSE =1nn∑i=1(Yi − Yˆi)2(2.2)∆wi = −α∂EMSE∂wi(2.3)∂EMSE∂wi=∂EMSE∂ai∂ai∂neti∂neti∂wi(2.4)The backward pass is far more memory intensive than the forward pass; theexact factor depends on the implementation details of the deep learning libraryused and the network being trained, but can range from 2× to 100×. A significantcomponent of this cost comes from the need to store the activations of each layer,needed to compute how the loss changes with respect to each weight. For an MLP,this activation storage cost is equivalent to the number of weights in the next layer;for a CNN, on the other hand, the number of activations for each weight is muchhigher since the same small set of weights (i.e., filter) produces an entire output map.Compressing or reducing the number of activations in CNNs and neural networksis therefore an active area of research. Some research has focused on storingactivations in larger, slower memories farther away from the computation suchas [52], while another work focused on recomputing the activations as the backwardpass progresses [7]. Activations have also been compressed successfully throughomitting zero values [46, 53, 74]. While activation storage is expensive, activationsare only ever accessed twice — once to write the activations from the forwardpass, and once to read them for the current neuron in a backward pass. In contrast,weights are repeatedly accessed— especially in CNNs — during the forward pass,and during the update from the backward pass.DropBack complements the work on reducing activation storage at training14time by also reducing storage requirements for weights during training. Chapter 3describes how DropBack modifies the training process in a way that can reduce thenumber of weights that need to be stored by an order of magnitude or more.2.2 Compression TechniquesResearch to date has examined two main methods of reducing the memory andcomputational overhead of deep neural networks: pruning and quantization. Pruninga neural network removes weights, filters, or entire sections of the network deemedunimportant to producing network’s final output. Weights to be removed are selectedvia a heuristic, e.g., removing the smallest-magnitude weights. Quantization, on theother hand, does not modify the structure of the network itself but instead changeshow its weights and activations are represented. Each value is stored using fewer bitsthan normal (which means that the stored value is approximate), reducing both thememory needed to store the weights and/or activations as well as the computationalcost. Other techniques have been proposed such as HashedNets [8] and Huffmancoding [18], but are less commonly used than pruning or quantization as they requiresignificant restructuring of neural network libraries.2.2.1 PruningPruning, or the removal of specific parameters in a neural network, was introducedin 1990 as a technique for improving generalization and reducing the number ofexamples needed to train a network [38]. Magnitude-based pruning is the most basicapproach: it simple removes the lowest-magnitude weights up to a user-specifiedpercentage (e.g., 95% of the weights). Like most pruning methods, magnitude-basedpruning requires that the network be trained first, then pruned, and only then retrainedto recover accuracy. This method was effective on the smaller datasets used by LeCunet al. [38], but loses significant accuracy compared to more modern methods onlarger networks and datasets. While this technique analytically predicts the effectof pruning and prunes weights with the smallest predicted perturbation [38], theretraining step is still required.Later work considered using the second derivatives to select parameters toprune [21], or training the network itself to learn which connections are important15through either `1 or `2 regularization [18]. Computing the second derivatives of allthe parameters is, however, computationally and memory intensive, and trainingwith `1 and `2 regularization results in slower training. Alvarez and Salzmann [3]focused on reducing the rank of the parameter matrices during training for latercompression, again at the cost of increased time required for training, and is not ableto reduce training time memory usage compared to standard training.Overall, pruning networks to enforce sparsity levels of 90%–99% after traininghas been effective for many networks [14, 18, 38, 44, 46, 60, 64, 65, 74]. In contrastto DropBack, all of these post-training pruning techniques require a retraining stepto regain lost accuracy, and both pruning and low-rank constraints still requireun-pruned backpropagation during the training phase. In fact, the memory andenergy cost of the extra backpropagation step in standard pruning methods increasesthe barrier to training in low-power embedded systems.Other work has attempted to prune the network during training, either to improveaccuracy or to increase sparsity. Zhu and Gupta [75] gradually increase the numberof weights masked from contributing to the network, while Molchanov et al. [49]extend variational dropout [32] with per-parameter dropout rates to increase sparsity.Babaeizadeh et al. [4] inject random noise into a network to find and merge the mostcorrelated neurons. Finally, Langford et al. [35] decay weights every k steps (for asomewhat large value of k), inducing sparsity gradually. However, unlike DropBack,all of these techniques store the entire unpruned network (plus potentially a 2×overhead for extra state per weight in the network) and therefore require at least asmuch memory to train as the unpruned network.2.2.2 QuantizationQuantization can be used to reduce the inference-time and training-time memoryand compute requirements by representing the network weights and activations withlower precision than normal. To reduce storage costs, numbers can be quantized aftertraining [9, 14, 17, 18, 64, 68, 72], or during an iterative retraining process [72]. Allof these techniques quantize floating-point number representations to fixed-pointrepresentations; this can reduce the energy required for computation as floating-pointoperations are more energetically expensive than fixed-point operations. Both Zhou16et al. [72] and Gysel [17] go further and either encourage or mandate weight valuesto be powers of two; this allows for multiplications to be replaced by single shiftoperations, saving computation energy. Importantly, while the final product of thesemethods is quantized, they all rely on the full floating-point weights being availableduring training.Quantization can also performed during training instead of as a post-processingstep as shown by Cai et al. [6], Courbariaux et al. [10, 11], Gupta et al. [16], Holtand Baker [24], Hubara et al. [28], Mishra et al. [48], Rastegari et al. [51], Simardand Graf [57], Zhou et al. [73]. Out of all of these methods, only Courbariaux et al.[10], Gupta et al. [16], Hubara et al. [28], Mishra et al. [48] use reduced precisionwhile training to lower the training-time storage costs; other methods store the fullnon-quantized weights during backpropagation just like the post-training methods.Quantization is orthogonal to pruning (and, in particular, to DropBack), and the twotechniques can potentially be combined.Quantization has also been used to reduce the overhead of gradient broadcastsin distributed training environments. Seide et al. [56] reduce gradients to a singlebit upon broadcast, saving 32× the bandwidth compared to standard 32-bit floating-point (FP32) gradient broadcasts. Similar approaches with less quantization thanthat achieved by Seide et al. [56] have been studied by Alistarh et al. [2], Wangniet al. [66], Wu et al. [67]. All of these works have a shared limitation. Each localdevice maintains the full FP32 parameters as well as an additional error term for eachgradient to correct the quantization upon each broadcast, so such approaches reduceinter-node communication costs at the cost of local storage. Because DropBackonly ever tracks a small subset of gradients, a distributed implementation could alsoreduce broadcast costs (by the same amount as the weight compression factor) byonly syncing the tracked weight sets, and doing so without maintaining a full set ofthe parameters on each local training node.2.2.3 Other Compression MethodsVery few techniques have attempted to reduce the inference-time and training-timememory and compute costs. HashedNets [8] use a hash function to group neuronconnections into buckets; within each bucket, all connections use the same weight17value. In effect, HashedNets quantize the weights, but use a lookup table to permita full 32-bit floating-point weights to be used during computation. However, thetechnique was never tested beyond MNIST, a small and extremely easy dataset,where it achieved decent but not state-of-the-art compression.Huffman coding [29] has also been used to compress neural networks [18].As a lossless compression technique, the coding was applied after both pruningand quantization had been performed, and after the network had been trained tocompletion. This allowed the network to be compressed approximately 2×more thanpruning and quantization alone, but required additional computation to decompressthe weights on every access, and complicated the memory access pattern. Huffmandecoding is far more expensive than the weight regeneration mechanism used torepresent non-tracked gradients in DropBack (see Figure 3.2), but is orthogonal toour technique and can be used to compress the tracked weights if the energy cost iswarranted.2.3 xorshiftThe xorshift algorithm [45] is a very lightweight method of generating high qualitypseudo-random numbers. Shown in Algorithm 0, the state x is initialized to anyreal integer, Z$, and then shifted and xor-ed three times. Later, in Section 3.2, thisxorshift algorithm will be modified to regenerate initial values.Algorithm 0: The 32-bit xorshift algorithm. The parameter x is a single32-bit non zero number; for deterministically regenerating weights, x is theflattened index of a weight plus a constant factor that is different for everytraining run.Initialization: x ∼ ZOutput: xx = x ⊕ (x << 13)x = x ⊕ (x >> 17)x = x ⊕ (x << 5)182.4 SummaryDropBack differs from all of these techniques by specifically targeting memoryconstraints during training, and, unlike all of them, prunes the network from thevery first iteration of training. DropBack is described next in Chapter 3.19Chapter 3The DropBack AlgorithmUniquely among pruning techniques, DropBack prunes weights from the very firstiteration of training; the full set of weights is never stored in or retrieved frommemory. To provide this reduction in memory usage, DropBack continuously omitsweights that it determines to be least important to the final output of the network;these are called untracked weights. The final version, Algorithm 3, produces a sparsenetwork where all the untracked weights are zero during inference.This chapter develops DropBack in three steps, each building on insights fromthe previous version:• Algorithm 1, where untracked weights are set to zero.• Algorithm 2, where untracked weights maintain their initial value which canbe regenerated on the fly without storage.• Algorithm 3, where untracked weights are decayed from their initial value tozero over time during training.The rest of this chapter describes each of these in detail.3.1 Algorithm 1We first investigated an adaptation of the magnitude-based pruning approach [38,etc.] commonly used for post-training pruning. In this algorithm, each iteration oftraining tracks only the highest-magnitude weights, while all other weights are set tozero. The details of the computation are shown in Algorithm 1. In the algorithm,20Wtrk and Wutrk represent tracked and untracked weights, T and U are tracked anduntracked accumulated gradients, S is the set of sorted accumulated gradients, k isthe number of gradients to track, λ is the lowest tracked cumulative gradient, and αis the learning rate. The mask indicates a boolean matrix with the same shape asthe weights, and mask indicates its logical inverse. The 1 operator indicates that themask is boolean.Although the listing in Algorithm 1 sorts all the weights before pruning forexposition clarity, it is not necessary to physically store and sort the full weight set.In a practical implementation, the tracked weight set would be stored using a priorityqueue of size n, in which an incoming gradient higher than the stored minimumevicts the smallest element.In contrast with the common post-training magnitude based pruning techniques,Algorithm 1 encapsulates the training-pruning-retraining phases in a single trainingstep, which reduces the number of epochs required to converge. In essence, each stepin this algorithm is equivalent to a post-pruning retraining step. Ideally, this wouldmove the network towards an optimum while reducing the number of parametersfrom the very first training step.We tested Algorithm 1 using both LeNet-300-100 [37] and a smaller multi-layerperceptron with only 100 hidden neurons, which we refer to as MLP-100. Theachieved weight reduction ratio was slightly less than 2×, an inferior result comparedwith other existing pruning techniques (detailed results are discussed in Section 4.2).Below, we investigate why Algorithm 1 performs poorly, and draw insights forcreating a better version.Initial Weights are a Poor Metric for SelectionProper weight initialization is essential in order to train deep neural networks quickly,and even to learn at all [40]. These initial weights are typically drawn at randomfrom a scaled Gaussian distribution, and updated by the stochastic gradient descentoptimization algorithm. During the first iteration of training, weight updates arescaled by a typical learning rate of 0.001 to 0.1 when using stochastic gradientdescent with momentum, or higher learning rates of up to 0.5 without momentum,so they move little compared to their initial magnitude. As a result, pruning away the21lowest-magnitude weights after the very first training step leads to a nearly randomselection of weights to be dropped. In effect, the only weights that are tracked arethose with the highest initial value, and no other weights have the opportunity tolearn.Vanishing Gradients Result in Poor or No LearningAlgorithm 1 suffers from another problem that stems from immediately settinguntracked weights to zero. When most of the weights are zero, the activations inmuch of the network also become zero during the forward pass. In turn, this causesthe gradients of the loss with respect to those weights to also become zero duringthe backward pass, which inhibits learning. In our experiments, for example, settingthe 90% of the MLP-100 network to zero (10× weight reduction) caused nearly 99%of the gradients to become zero as well, preventing optimization.This effect is well-known in the literature as the vanishing gradient problem [15].To counteract vanishing gradients, the post-training pruning schemes such as thoseof Zhu and Gupta [75] and Han et al. [18] increase the sparsity of the networkvery gradually so that relatively few additional weights are zeroed in every trainingiteration; although the final sparsity achieved with those methods can be as highas 90%, that level of sparsity is only achieved at the very end of training. Becausewe aim to reduce the memory footprint from the very start of the training process,however, we cannot employ this gradual sparsity technique in DropBack.3.2 Algorithm 2The second step in developing DropBack is based on the two key observationsfrom evaluating Algorithm 1: (a) that dropping the lowest-magnitude weights earlyon inhibits learning, and (b) that resetting weights to zero results in the vanishinggradient problem. We take advantage of these observations in the three insightsdiscussed below.Insight 1: Track the Highest Accumulated GradientsSection 3.1 shows that the naive approach of tracking only the highest-magnitudeweights is not effective during the first few training iterations, and so cannot be used22Algorithm 1: Pruning the Lowest-Magnitude WeightsInitialization:W (0) withW (0) ∼ N(0,σ)Output:W (t)while not converged doT ={W (0)+∑t−1i=0 α∂ f (W (i−1);x(i−1))∂w s.t. w ∈Wtrk}U ={W (0)+ α∂ f (W (i−1);x(i−1))∂w s.t. w ∈Wutrk}S = sort(T ∪U)λ = Skmask = 1(S > λ)W (t) = mask · (W (t−1)−α∇f (W (t−1); x(t−1)) ) +mask ·0t = t +1end3 2 1 0 1 2 Accumulated Gradient 012345678Kernel DensityFigure 3.1: Distribution of accumulated gradients over 100 epochs of standardSGD training on MNIST using a 90,000-weight MLP.23to reduce the memory footprint throughout the entire training process. Informally,this is because the initial value of each weight, typically drawn from a scaled normaldistribution [40], serves as scaffolding which gradient descent can amplify to trainthe network.Instead, our approach in DropBack is to first observe that each weight is a sum ofits initial value and the gradient accumulated during training, and logically separatethe two components. (At first sight, this may appear counterproductive because nowthe initial value may need to be stored for each weight; we resolve this problemusing Insight 2 below.) Then, to reduce the number of weights tracked, we keeponly a fixed number of weights which have learned the most overall— that is, theweights with the highest accumulated gradients.To validate this hypothesis, we examined the distribution of accumulatedgradients during the first 100 training iterations of a 90,000-weight MLP on theMNIST dataset (see Section 4.1 for experimental method details). In the histogram,shown in Figure 3.1, most gradients are very close to zero. This shows that weightsmove very little from their initial values, and suggests that only a small fraction ofgradients needs to be tracked provided the remaining weights are kept at their initialvalues.Insight 2: Recompute Initialization-time Values for All Untracked WeightsAs discussed in Section 3.1, simply setting untracked weights to zero impairs training;this is because the scaffolding provided by the initialization values is critical tothe accuracy of the trained network in order to prevent vanishing gradients. If theinitialization values can be preserved, we reasoned, we should be able to achievehigher accuracy and better pruning.To validate this hypothesis, we trained the 90,000-weight MLP on MNIST as inSection 3.1, but this time allowed the untracked weights to retain their initialization-time values. We observed that the tracked weights could be reduced up to 60× ifinitialization values were preserved, but only 1.8× if untracked weights were zeroed(see Section 4.2.1 for details). This is in line with the observation that zeroing weightscauses gradients to vanish (see Section 3.1), and suggests that initialization-timevalues for all weights should somehow be preserved.24However, storing the initialization-time values would require accessing memoryto retrieve them during both the forward and backward passes of training — a costlyproposition for large networks where an off-chip memory access consumes upwardsof 2900× more energy than a 32-bit floating-point add.To avoid storing these initial weights, we observe that in practice the values areinitialized using a pseudo-random number source that is initialized using a singleseed value and post-processed to fit a scaled normal distribution [40]. Becauseeach value only depends on the seed value and its index, it can be deterministicallyregenerated precisely when it is needed for computation, without ever being storedin memory.While the value might still be very briefly stored in the on-chip register file (in aCPU or GPU), register files are small SRAMs located very close to the processor’sfunctional units, and accessing them costs a fraction of the energy needed to accessoff-chip DRAM [25], for example. In addition, as the register file only serves tocommunicate values between individual CPU or GPU instructions, the lifetime ofthis storage is limited to the few instructions that it takes to generate the value anduse it to multiply with a single activation value, so only a very small amount ofstorage is needed.To recompute a normally distributed pseudo-random initialization value, weemploy the xorshift algorithm described earlier in Section 3.2, modified to outputa floating-point number between −1 and 1. This computation, shown in Algorithm 4,requires eight 32-bit integer operations and one 32-bit floating-point operation.Recomputing the weight, therefore, consumes about 1.5pJ of energy in a 45nmprocess, which is still 1700× less energy than a single off-chip memory access.Regenerating untracked parameters to their initial values also works out-of-the-box for layers like Batch Normalization or Parametric ReLU, where the initializationstrategy is typically a constant value that is the same for the entire network. In thiscase, DropBack merely regenerates the constant value rather than using Algorithm 4.As a result, these layers are also pruned by DropBack, a unique feature not found inother pruning methods.25Algorithm 4: The 32-bit xorshift algorithm modified for generating randomfloating-point numbers between −1 and 1. The parameter x is a single 32-bitnon zero number; for deterministically regenerating weights, x is the flattenedindex of a weight plus a constant factor that is different for every training run.The output of this algorithm can then be scaled by the number inputs to a layeras suggested by LeCun et al. [40].Initialization: x ∼ ZOutput: xx = x ⊕ (x << 13)x = x ⊕ (x >> 17)x = x ⊕ (x << 5)x = (x&0x007fffff)|0x40000000x = x−3.0Insight 3: After a Few Epochs, Freeze Which Weights Are TrackedDuring training, it may happen that a “new” gradient for an untracked weight exceedsone of the accumulated gradients tracked by DropBack. This is very common duringthe initial phase of training, and is an artifact of the optimization algorithm’s effortsto select the most productive direction. However, this also means that gradientsfor all of the untracked weights are still computed in every iteration. Selecting thetracked set costs additonal energy as well, as either software or hardware sorting isrequired. While the energy required for gradient computation and comparison issmall compared to retrieving all weights from off-chip DRAM, it can noticeablycontribute to the total energy expenditure once only a small fraction of weights aretracked.In order to sidestep most of these accesses, we reasoned that the weight selectionshould stabilize after a few epochs. By then, most of the tracked gradients shouldhave accumulated sufficiently large magnitudes that it is unlikely that they would beexceeded by a “new” gradient for an untracked weight; indeed, as most accumulatedgradients have small magnitudes (see Figure 3.1), we expected most weights to movelittle.To validate this intuition, we trained the 90,000-parameter MLP-100 on MNISTusing standard SGD while keeping track of the 2000 weights that had accumulatedthe highest gradients. Figure 3.2 and Figure 3.3 show that the set of the 2,000 highest-260 25 50 75 100Iteration0500100015002000Weights SwappedFigure 3.2: Number of weights added/removed to the top-2K gradient set inthe first 100 iterations on MNIST.0 5000 10000 15000Iteration0100200Weights SwappedFigure 3.3: Number of weights added/removed to the top-2K gradient set afterthe first 100 iterations on MNIST. Note the y-axis scale is 1/10th the scaleof Figure 3.2.27gradient weights stabilizes in relatively few iterations. Before the first iterations,many weights are added to and removed from the set of 2,000 highest accumulatedgradients (Figure 3.2). After the first ten iterations, however, the top-2,000 gradientset settles down with < 0.04% of all weights being added to or removed from thetop-gradient set in any iteration (Figure 3.3). This “noise” remains throughout thetraining process.This observation allows us to “freeze” which gradients are tracked after a smallnumber of epochs (e.g., 10–20), where an epoch is a full pass through the trainingdataset (approximately 500 iterations per epoch). After freezing, no new gradientsmay replace those in the currently tracked set, so gradients need to be computedonly for the weights in the tracked set instead of for all weights.We investigated several methods of selecting the freeze epoch, including viaa hyper-parameter, freezing after some validation accuracy has been reached, andfreezing once the number of additions/removals became lower than a small percentageof the tracked weight count (see Section 4.3.1). Because freezing at a constantepoch worked better, we chose to freeze at the best epoch for each test in Chapter 4,determined by sweeping the hyper-parameter space for the freeze epoch.The Complete Algorithm 2Algorithm 2 shows the resulting DropBack training process, which incorporates thethree insights above. N(0,σ) is generated from the xorshift pseudo-RNG shown inAlgorithm 4.Weights are initialized from a scaled normal distribution [40], generated by thexorshift pseudo-random number generator. After a user defined epoch, the trackedset is “frozen,” and gradients are only updated for the weights already tracked; thissaves additional computation time and energy. Note that the tracked gradient set Trequires no storage: its elements are recomputed when needed fromW t−1−W (0). Asin standard stochastic gradient descent, training ends once the network is consideredto have converged.28Algorithm 2: Pruning the Lowest-Magnitude Accumulated GradientsInitialization:W (0) withW (0) ∼ N(0,σ)Output:W (t)while not converged doT ={∑t−1i=0 α∂ f (W (i−1);x(i−1))∂w s.t. w ∈Wtrk}if not frozen:U ={α∂ f (W (i−1);x(i−1))∂w s.t. w ∈Wutrk}else:U = {}S = sort(T ∪U)λ = Skmask = 1(S > λ)W (t) = mask · (W (t−1)−α∇f (W (t−1); x(t−1)) ) +mask ·W (0)t = t +1end3.3 Algorithm 3Although the DropBack Algorithm 2 developed above reduces the storage requiredfor weights during training, using the trained network requires a modified inferencealgorithm. This is because the trained network relies on the initial values of thenon-tracked weights being recomputed as they have been during training, which isnot supported out-of-the-box in existing neural network accelerators. Additionally,the regenerated weights still present a computation overhead during inference.We therefore examined the possibility of slowly decaying these initial weightvalues to zero during the training process. We reasoned that, once the trackedweights have been trained, the scaffolding provided by the initial weights is nolonger necessary, and can be gradually removed. As the initial weight values slowlydecay to zero, the tracked weights — the only weights that change during training— should gradually adjust to compensate for the lower values (and, eventually, theabsence) of the non-tracked weights.To test this intuition, we allowed the initial values to decay to zero over timeusing an exponential decay term controlled by the number of iterations, each iteration29scales down the initial values regenerated until they reach 0. Details are shownin Algorithm 3. In all of our experiments, using 32-bit floating-point weights, everyinitial parameter has decayed to zero by iteration 1,000. Overall, DropBack with thedecay modification achieves the best accuracy to weight reduction tradeoff on all ofthe networks used in this thesis except the MLPs (see Section 4.2.3 for details).Algorithm 3: Decaying the Initial ParametersInitialization:W (0) withW (0) ∼ N(0,σ)Output:W (t)while not converged doT ={∑t−1i=0 α∂ f (W (i−1);x(i−1))∂w s.t. w ∈Wtrk}if not frozen:U ={α∂ f (W (i−1);x(i−1))∂w ∗ δt−1 s.t. w ∈Wutrk}else:U = {}S = sort(T ∪U)λ = Skmask = 1(S > λ)W (t) = mask · (W (t−1)−α∇f (W (t−1); x(t−1)) ) +mask ·W (0) ∗ δt−1t = t +1end3.4 Differences with Existing ApproachesMost existing pruning techniques (discussed in Section 2.2.1) either require thenetwork to be retrained after pruning is performed, or need additional terms to beapplied to the loss function of the network to encourage increased sparsity. Pruningtechniques that rely on retraining are well-studied [14, 18, 44, 46, 64, 74, etc.], andcan achieve close to the original accuracy of the unpruned network. Rather thandecrease the memory footprint during training, these algorithms actually increase itbecause they typically track the full network plus additional parameters. In addition,they typically increase training time, as an additional retraining or fine-tuningtraining stage is required. DropBack differs from these because it substantially30reduces the memory footprint and memory access counts during training as well aswhen the pruned network is used for inference.Adding a term to the loss or additional parameters to encourage sparsity has alsobeen examined in several papers [4, 49, 75, etc.]. While these pruning techniquesare as effective as pruning using a retraining stage, they either massively increase thetime to convergence, memory usage, or both. These pruning techniques require extraparameters for stochastic gradient descent to update, making the optimization taskmore difficult and increasing memory usage by several times. In contrast, DropBackis able to reduce the memory footprint of weights during training and convergesat a rate equivalent to a non-pruning training algorithm. These alternative pruningtechniques could be used to gradually reduce the computation and memory requiredwhile training a network, but no current work has done so, and additionally theseexisting techniques cannot provide an upper bound on memory usage from the veryfirst iteration like DropBack.In the next chapter, we evaluate the variants of DropBack on real-world deepneural networks, investigate the reason why DropBack is effective, and analyze thetradeoffs for both algorithms.31Chapter 4Results and DiscussionDropBack was tested on three different datasets and a total of five different neuralnetworks, achieving between 4× to 72× weight reduction without losing morethan 2% in validation accuracy. On the more complicated networks, WRN-28-10and Densenet, DropBack Algorithm 2 outperformed the current state-of-the-artpruning techniques on CIFAR-10 in accuracy and weight reduction. The DropBackvariant that decays weights to zero, Algorithm 3, reduced the number of non-zeroparameters by 98.6% on the WRN-28-10/CIFAR-10 with an acceptable accuracyloss of 1.4 percentage points. When the network architecture allows it, this schemenaturally results in zeroing out entire convolutional filters — with 94% of the 3x3convolutional filters removed entirely — a further reduction in computation.In the remainder of this chapter, we detail the methods, networks, and datasetsused for testing, and discuss the results.4.1 MethodsDropBackwas tested on five networks and three datasets, using standard industrytools. Results are directly comparable to other pruning papers.4.1.1 Training MethodsDropBack was implemented using the Chainer deep neural network toolkit [63];all runs, including DropBack runs, were trained on an NVIDIA 1080Ti GPU. We32compared DropBack against four training regimes:• a baseline implementation without any pruning;• a straightforward magnitude-based pruning implementation where only thehighest weights are kept after each iteration, i.e., Algorithm 1;• variational dropout [32], which can progressively prune weights duringtraining but at least doubles the memory footprint; and• network slimming [42], a modern train-prune-retrain pruning method thatachieves state-of-the-art results on modern network architectures.All DropBack results were optimized using stochastic gradient descent withoutmomentum, as all other optimization strategies cost significant extra memory. Allmethods without DropBack used the ADAM optimizer [31] with the parametersspecified by the original papers.4.1.2 Networks UsedA total of five deep neural networks are used to test DropBack. The first two arebasic MLPs with a different numbers of neuron in their hidden layer, the third is astraightforward convolutional neural network, and the final two are state-of-the-artconvolutional networks with complex, compact architectures. They are:• LeNet-300-100 [39], a common benchmark MLP, with 786 neurons in theinput layer, 300 neurons in the first hidden layer, 100 neurons in the secondhidden layer, and 10 neurons in the output layer, for a total of 266K weights;• MLP-100, a smaller variant of LeNet-300-100, with only 100 hidden neuronsin the first hidden layer and no second hidden layer, for a total of 90K weights;• VGG-S [70], shown in Figure 4.1, a smaller variant of VGG-16 [58] that omitsone fully connected layer and reduces the other two to 512 neurons instead of4,096 and adds modern batch normalization layers between its convolutionblocks;• Densenet [27], shown in Figure 4.2, a very densely interconnected variant ofVGG aimed at improving training accuracy while reducing the weight count;and• WRN-28-10 [71], a modern wide residual network with state-of-the-artperformance on image tasks, which features a complex architecture with four33different different blocks, shown in Figure 4.3, and is discussed in more detailbelow.LeNet-300-100 and the original VGG-16 are widely acknowledged to be veryoverprovisioned [30], with many more weights than necessary (i.e., later variantswith fewer weights perform just as well or better), but are often used to demonstratepruning effectiveness [14, 44, 46]. To avoid pruning purely due to the network beingtoo large, we focus on the MLP-100 variant to LeNet-300-100, as well as a muchsmaller VGG-S variant of VGG-16 that reduces the weight count from 138M to15M.Densenet andWRN-28-10 are both much newer andmore complex convolutionalnetworks with fewer parameters, and were chosen precisely because their complex,more interconnected architectures make them very challenging to prune with existingtechniques [41–43]. Densenet is constructed from stacks of convolutional layers,with the output of each stack being fed to all downstream stacks (rather than just thenext layer as in a standard convolutional network). The authors of Densenet argue thatthis design removes the vanishing gradient problem, strengthens feature propagation,and reduces the number of parameters compared to standard convolutional neuralnetworks. WRN-28-10 is constructed of four types of blocks, shown in Figure 4.3.Block type A has two stacked convolutions but also concatenates its inputs to theoutputs of those convolutions, carrying the input feature maps downward withoutmodification. Block type B is similar, but has two bottleneck layers — layers of1×1 convolutions — surrounding a normal 3×3 convolutional layer. Block type Cis the same as A, but with many more filters in each convolutional layer, whileblock type D adds a dropout layer [60] between pairs of convolutions. Wide residualnetworks are extremely effective at image classification tasks, but as the number offilters in the wide block are increased the parameter count dramatically increases.4.1.3 DatasetsDropBack was evaluated on the MNIST digit recognition dataset [36], the Fashion-MNIST small grayscale image dataset [69], and the CIFAR-10 color image recogni-tion dataset [34]. ForMNIST, two networks were used: LeNet-300-100 andMLP-100.For Fashion-MNIST, we used VGG-S alone. For CIFAR-10, three networks were343x3 Conv 64 Filters 3x3 Conv 64 Filters 3x3 Conv 128Filters 3x3 Conv 128Filters 3x3 Conv 256Filters 3x3 Conv 256Filters 3x3 Conv 256Filters 3x3 Conv 512Filters 3x3 Conv 512Filters 3x3 Conv 512Filters 3x3 Conv 512Filters 3x3 Conv 512Filters 3x3 Conv 512Filters FC 4096NeuronsFC 4096NeuronsFC 4096NeuronsFigure 4.1: The VGG-16 network architecture. VGG-16 is an older convo-lutional neural network and is designed with a single path through thenetwork, with different blocks of 3x3 filters stacked together.used: VGG-S, Densenet [26], and WRN-28-10 [71].4.2 Handwritten Digit Recognition (MNIST)DropBack was first evaluated using Algorithm 1, Algorithm 2, and Algorithm 3on the MNIST handwritten digit dataset using LeNet-300-100 MLP and MLP-100.Training was allowed for up to 100 epochs, and the initial learning rate of 0.4 wasreduced every 25 epochs by a factor of 0.5. Training was stopped after five epochsin which accuracy did not improve over the previous best epoch.4.2.1 Pruning by Weight Magnitude: Algorithm 1To demonstrate that standard post-training pruning techniques cannot be used toprune networks at training time, we first examined Algorithm 1, a naive adaptation of35Figure 4.2: The Densenet network architecture. Densenet connects the outputof each convolutional layer to the inputs of all downstream layers in orderto propagate feature information more readily throughout the network.Figure taken from [27].Figure 4.3: The types of wide residual network blocks. wide residual networksare built from these four blocks. Figure taken from [71].36LeNet-300-100 Val. Error Weight Reduction Best Epoch Freeze EpochBaseline 266K 1.41% 1× 65 N/ANaive 50K 86.56% 5.33× 94 0Naive 20K 91.1% 13.3× 67 0Naive 5K 89.89% 53.32× 45 0Naive 1.5K 90.0% 177.74× 54 0MLP-100 Val. Error Weight Reduction Best Epoch Freeze EpochBaseline 90K 1.70% 1× 47 N/ANaive 50K 11.56% 1.8× 31 0Naive 20K 88.9% 4.5× 87 0Naive 5K 89.1% 18× 65 0Naive 1.5K 89.5% 60× 56 0Table 4.1: LeNet-300-100 (top) and MLP-100 (bottom) on MNIST, using thenaive prune-by-weight-magnitude Algorithm 1. Naive 50K refers to aconfiguration where 50,000 weights are retained during training, Naive 5Kto a configuration with 5,000 retained weights, and so on.post-training pruning (see Section 3.1). In this scheme, only the highest-magnitudeweights are retained during every iteration, and the remaining weights are set tozero.Table 4.1 shows the results for the baseline (unpruned) model and four con-figurations of this naive scheme using Algorithm 1. Neither LeNet-300-100 orMLP-100 can be effectively pruned using this scheme, with a 5× weight reductionratio resulting in an error only slightly above random (since there are 10 digits,a random guess will be correct 10% of the time). MLP-100 can be reduced by1.8× at the cost of a substantially lower accuracy, unfortunately, even higher weightreductions result in no learning, with network output remaining random.Because Algorithm 1 fails even on the simplest dataset used in this thesis, weomit its results from the remainder of this chapter.4.2.2 Pruning by Accumulated Gradient Magnitude: Algorithm 2We next examined Algorithm 2, which retains the highest accumulated gradients andrestores the remaining weights to their initialization-time values (see Section 3.2).370 25 50 75 100 125 150 175Weight Reduction Ratio1.52.02.53.03.5Validation ErrorLeNet-300-100MLP-100Figure 4.4: The accuracy vs. weight reduction trade-off for LeNet-300-100and MLP-100 on Algorithm 2.Table 4.2 shows the results on the MNIST digit dataset for the baseline (unpruned)model and four configurations of DropBack, retaining 50,000 weights (1.8× weightreduction), 20,000weights (4.5×weight reduction), 5,000weights, and 1,500weights(60× weight reduction), respectively. Figure 4.4 shows the same data visually.Accuracy and Weight ReductionUsing MLP-100 with a modest 1.8× reduction in weights, DropBack slightly exceedsthe accuracy of the baseline model. This slight accuracy improvement matches thetrend reported in prior work such as DSD [20], where a sparse layer that omits30% to 50% weights outperforms the baseline dense model with all of the weightsretained. However, DSD first trains the network to convergence on the completeparameter set, and only then prunes some weights and retrains the resulting sparsenetwork, with this process repeated several times (the final network output of DSDis not sparse, and therefore not included in this thesis as comparison). In contrast,38LeNet-300-100 Val. Error Weight Reduction Best Epoch Freeze EpochBaseline 266K 1.41% 1× 65 N/ADropBack 50K 1.51% 5.33× 24 10DropBack 20K 1.78% 13.3× 33 20DropBack 5K 2.58% 53.32× 32 20DropBack 1.5K 3.84% 177.74× 97 40MLP-100 Val. Error Weight Reduction Best Epoch Freeze EpochBaseline 90K 1.70% 1× 47 N/ADropBack 50K 1.58% 1.8× 24 5DropBack 20K 1.70% 4.5× 32 5DropBack 5K 2.50% 18× 27 20DropBack 1.5K 3.78% 60× 26 20Table 4.2: LeNet-300-100 (top) and MLP-100 on the MNIST digit dataset,using Algorithm 2.DropBack achieves this accuracy improvement without ever storing the full denseconfiguration of the network, and without repeatedly retraining the network toconvergence.The larger LeNet-300-100 MLP sees a slight drop in accuracy, at 50K retainedweights, but the initial weight reduction of 5.33× is much higher than MLP-100.It also reaches maximum accuracy nearly 3× fewer epochs than Baseline. Furtherreducing the model to 20K weights (4.5× weight reduction) results in nearly thesame accuracy as the baseline, and convergence in about 50% fewer epochs thanBaseline.With an extreme weight reduction of 1.5K tracked weights, the error rate roughlydoubles. However, the nearly 60× reduction in the weight storage requirement forMLP-100 and 177× for LeNet-300-100 potentially offers an attractive design pointfor low-power embedded accelerators in future mobile and edge devices.Freezing EpochWe also investigated the effect of freezing the tracked gradient selection process (seeSection 3.2) early during training. The freeze epochwas selected by a hyper-parametersweep.390 20 40 60 80 100Epoch0.900.920.940.960.981.00Validation AccuracyDropBackBaselineFigure 4.5: Rate of convergence using Algorithm 2 for LeNet-300-100 for ourtechnique our and the baseline model. Note that the y-axis starts as 0.90,and the final accuracies are within 1% of each other.For bothMNIST networks, freezing sooner to reduce the computational overheadresults in lower achieved accuracy at very high weight reduction ratios, but forsmaller weight reduction ratios freezing early has little effect on the overall accuracy.When the weight reduction ratio is small, the weights being swapped in and outof the tracked set are considerably less important than in high weight reductionratios, therefore freezing early is less likely to choose a poor tracked parameter set.Figure 4.5 shows the rate of convergence for DropBack and the baseline on theLeNet-300-100 network; despite converging to the nearly same accuracy DropBackhas more noise than the Baseline.Allocation of Retained WeightsNext, we examined which network layers retained the most weights. We reasoned thatas the weight reduction factor increases and accuracy drops, the network will reducethe precision of feature detection and put more resources into decision-making.Table 4.3 shows that the number of parameters retained per layer indeed varies40layer DropBack 1.5K DropBack 10K Baselinefc1 (100×784) 734 (48.9%) 7223 (72.2%) 78500 (87.6%)fc2 (100×100) 512 (34.1%) 2128 (21.3%) 10100 (11.3%)fc3 (100×10) 254 (16.9%) 549 (5.5%) 1010 (1.1%)Total 1500 10000 89610Table 4.3: Number of gradients for each layer retained in the final trainedMLP-100 network.LeNet-300-100 Val. Error Alg. 2 Val. Error Alg. 3 Best EpochBaseline 266K 1.41% 1.41% 65DropBack 50K 1.51% 1.51% 43DropBack 20K 1.78% 2.10% 50DropBack 5K 2.58% 3.32% 58DropBack 1.5K 3.84% 10.70% 81MLP-100 Val. Error Alg. 2 Val. Error Alg. 3 Best EpochBaseline 90K 1.70% 1.70% 47DropBack 50K 1.58% 1.69% 44DropBack 20K 1.70% 2.06% 41DropBack 5K 2.50% 2.06% 70DropBack 1.5K 3.78% 17.14% 78Table 4.4: The MNIST digit dataset using LeNet-300-100 (top) and MNIST-100-100, using Algorithm 3. All models where frozen from epoch 25.depending on the number of tracked weights. The smaller DropBack 1.5K networkallocates a much higher proportion of its weights to the later layers compared to theDropBack 10K network and the baseline. As later layers make decisions based onthe features provided by earlier layers, this result suggests that the more reducednetwork indeed assigns proportionally more neurons to the critical decision-makingtasks.4.2.3 DropBack with Weight Decay: Algorithm 3Finally, we investigated whether decaying the untracked weights to zero (as opposedto recomputing their initialization-time values) was effective on these small networks.41We examined Algorithm 3 on the MNIST digit dataset, using the same networks aswith Algorithm 2 above and an untracked weight value decay rate δ = 0.90. Thisadds an extra floating point multiply to every parameter in the network, however thecost of doing so is low since the number of weights in all networks is in the millions,and the number of operations is in the billions. Additionally, this extra cost is onlypresent for the first 1000 iterations before all initial parameters have decayed to zero.The results are shown in Table 4.4 and Figure 4.4. Overall, compared with theresults from Algorithm 2 results in Table 4.2, decaying weights resulted in increasevalidation error rates for a given weight reduction ratio, especially on the highestlevel of weight reduction ratios tested. For example, with 1.5K weights retained,the error increases to 10.70% on LeNet-300-100 and 17.14%; With the exceptionof LeNet-300-100 DropBack 50K, all of the other weight reduction ratios also seean increase in error compared to Algorithm 2. This appears to be because theserelatively small networks rely on the scaffolding provided by the initial parameters,and without this the remaining parameters have trouble retaining generalizationaccuracy.At the higher end of the weight reduction ratios, the error is too high to beusable. DropBack 20K offers reasonable error for weight reduction, LeNet-300-100increases from 1.78% error to 2.10% with a weight reduction ratio of 13.3×, andMLP-100 increases from 1.70% to 2.06% for a weight reduction ratio of 4.5×.4.3 Fashion Item Icon Recognition (Fashion-MNIST)We next evaluated DropBack on the VGG-S convolutional network. Because theMNIST dataset is not challenging for deep convolutional networks like VGG, whichcan achieve 98% accuracy [58], we evaluated it on Fashion-MNIST, a dataset thatis the same size as MNIST (28×28 grayscale images, 10 categories) but is alsomuch more challenging for modern networks in terms of achieved accuracy [69].Stochastic gradient descent without momentum was used with a maximum of 200epochs, with the learning rate constant for all epochs; no data augmentation wasperformed.Both Algorithm 2 and Algorithm 3 were evaluated, for a variety of trackingconstraints and freezing epochs in order to explore the weight reduction vs. accuracy42trade-offs. The results are presented in Table 4.5 for Algorithm 2 and in Table 4.6for Algorithm 3.4.3.1 Pruning by Accumulated Gradient Magnitude: Algorithm 2Accuracy and Weight ReductionDropBack Algorithm 2 achieved slightly higher accuracy than the Baseline whilereducing the model 5×, resulting in a network where only 3M parameters out of theapproximately 15M original parameters differed from their initial values. A higherweight reduction ratio of nearly 30× increased the error to 7.4%, likely an acceptabletradeoff for embedded edge devices with very stringent weight storage budgets.Compared with results shown later on the CIFAR-10 dataset (see Section 4.4and Table 4.7), the network can be reduced far more (up to ∼ 30×) with a lowererror rate that what is observed for CIFAR-10. This is because Fashion-MNIST is asimpler dataset than CIFAR-10: it consists of 28×28×1 grayscale images, versus32×32×3 images in full color. Since the images are grayscale, the network does notneed to distinguish different features in each color channel, and fewer convolutionalfilters overall are required.Freezing EpochLike the earlier experiments on MNIST, we also examined the effect of selecting thefreezing epoch on accuracy. Unlike in the case of MNIST, where higher reductionratios resulted in a later optimal freezing epoch, the optimal freezing epoch forthe MNIST-Fashion experiments did not appear to depend on the target reductionratio. The optimal freezing ratio was chosen in the same way as with the MNISTexperiments, by hyper-parameter sweep. In all but the DropBack 5M case, theoptimal point was 30 epochs. Possibly, the parameter selection on Fashion-MNISTis easier due to the makeup of the dataset compared to the standard MNIST dataset.4.3.2 DropBack with Weight Decay: Algorithm 3Finally, we investigated the weight-decay variant of DropBack on theMNIST-Fashiontask using the VGG-S network. The results are shown in Table 4.6 and Figure 4.6.43Fashion MNIST Validation Error Weight Reduction Best Epoch Freeze EpochVGG-S Baseline 15M 5.43% 1× 178 N/AVGG-S DropBack 5M 5.14% 3.0× 105 25VGG-S DropBack 3M 5.43% 5.0× 151 30VGG-S DropBack 500K 7.39% 30.0× 175 30VGG-S DropBack 200K 13.44% 75.0× 189 30VGG-S DropBack 100K 19.80% 150.0× 172 30Table 4.5: VGG-S on Fashion-MNIST, using Algorithm 2.Fashion MNIST Val. Error Alg. 2 Val. Error Alg. 3 Best EpochVGG-S Baseline 15M 5.43% 5.43% 178VGG-S DropBack 5M 5.14% 5.32% 75VGG-S DropBack 3M 5.43% 5.62% 158VGG-S DropBack 500K 7.39% 6.50% 192VGG-S DropBack 200K 13.44% 8.87% 93VGG-S DropBack 100K 19.80% 16.34% 40Table 4.6: VGG-S on Fashion-MNIST, using Algorithm 3.At the lower weight reduction ratios, Algorithm 3 results in minimal increases inerror compared to Algorithm 2 at the lower weight reduction ratios, but in exchangecreates useful levels of computational sparsity by zeroing a large proportion ofweights. For example, DropBack 3M has 80% of the model parameters decayedto zero by the end of training. During training, this sparsity does not result incomputation savings because the weight decay is achieved over time; however,at inference time, many of the weights are zero and computation savings can besignificant.In the higher range of weight reduction ratios, on the other hand, Algorithm 3actually outperforms Algorithm 2 by 1–3%. At 500K tracked parameters, only 3.3%of the network is non-zero. Using a coordinate-list sparse matrix representation,the sparsified weight matrix of VGG-S trained by DropBack with pruning to 500Kweights would use only 3.77MB compared to 57MB for the unpruned baseline in adense matrix representation, including metadata overhead. Compared to Algorithm 2,storing the weights as a standard sparse matrix without the requirement to regenerate440.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6Number of Retained Weights 1e768101214161820Validation Error (%)VGG-S Alg. 2VGG-S Alg. 3Figure 4.6: Weight reduction ratio vs. accuracy for VGG-S on Fashion MNIST.their initial values is beneficial for further reducing inference time energy, as weightsno longer need to be recomputed using the xorshift RNG. An accelerator such asthe Efficient Inference Engine [19] can, according to the authors, leverage this levelof sparsity to improve energy efficiency by 10× compared to a dense weight matrixrepresentation.4.4 Color Image Classification (CIFAR-10)Finally, we evaluated DropBack on the CIFAR-10 dataset. This dataset presentsa much more challenging task than MNIST or Fashion-MNIST because (a) theimages are larger (32×32 pixels vs. 28×28), and, more importantly, (b) the imagesare in color, which means that the network must learn to reason about and combinefeatures detected in three separate color channels.In addition to the fairly straightforward VGG-S network, we also evaluated onDensenet and WRN-28-10. These are modern networks with complex architecturesand state-of-the-art performance on image classification tasks. We specifically450.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4Number of Retained Weights 1e71020304050Validation Error (%)DropBack Alg. 2Mag. PruningNetwork SlimmingVar. DropoutDropBack Alg. 3Figure 4.7:Weight reduction ratio vs. accuracy for VGG-S, using Algorithm2 and Algorithm 3.selected them because both networks are compact with very dense connections, andare very challenging for prior pruning methods to optimize [42]. VGG-S was trainedfor 300 epochs, while Densenet and WRN-28-10 were trained for 500 epochs; thehighest-accuracy epoch was selected for the final result. For all three experiments, thelearning rate started at 0.4 and decayed 0.5× every 25 epochs; no data augmentationwas performed.4.4.1 Pruning by Accumulated Gradient Magnitude: Algorithm 2Accuracy and Weight ReductionTable 4.7 and figures 4.7, 4.8, and 4.9 show how DropBack compares to variationaldropout, network slimming, and magnitude-based pruning on VGG-S, Densenet,and WRN-28-10. Overall, DropBack can achieve comparable (or even slightlyimproved) accuracy on VGG-S and Densenet with a five-fold weight reduction, and46CIFAR-10 Validation error Weight reduction Best epoch Freeze epochVGG-S Baseline 15M 10.08% 1× 214 N/AVGG-S DropBack 5M 9.75% 3× 127 5VGG-S DropBack 3M 9.90% 5× 128 20VGG-S DropBack 750K 13.49% 20× 269 35VGG-S DropBack 500K 20.85% 30× 201 15VGG-S DropBack 100K 52.15% 150× 30 15VGG-S Var. Dropout 13.50% 3.4× 200 N/AVGG-S Mag Pruning .80 9.42% 5.0× 182 N/AVGG-S Slimming 11.08% 3.8× 196 N/ADensenet Baseline 2.7M 6.48% 1× 382 N/ADensenet DropBack 500K 5.86% 4.5× 409 N/ADensenet DropBack 100K 9.42% 27× 307 N/ADensenet Var. Dropout 90% 1× N/A N/ADensenet Mag Pruning .75 6.41% 4.0× 480 N/ADensenet Slimming 5.65% 2.9× 495 N/AWRN-28-10 Baseline 36M 3.75% 1× 326 N/AWRN-28-10 DropBack 8M 3.85% 4.5× 384 N/AWRN-28-10 DropBack 7M 4.02% 5.2× 417 N/AWRN-28-10 DropBack 5M 4.20% 7.3× 304 N/AWRN-28-10 DropBack 1M 31.18% 36.48× 460 N/AWRN-28-10 DropBack 500K 41.02% 72.96× 494 N/AWRN-28-10 DropBack 100K 83.12% 364.80× 57 N/AWRN-28-10 Var. Dropout 90% 1× N/A N/AWRN-28-10 Mag Pruning .75 26.52% 4× 109 N/AWRN-28-10 Slimming .75 16.640% 4× 173 N/ATable 4.7: Validation accuracy and weight reduction ratio of several prunednetworks on the CIFAR-10 dataset, trained using Algorithm 2.470 500000 1000000 1500000 2000000 2500000Number of Retained Weights020406080Validation Error (%)DropBack Alg. 2Mag. PruningNetwork SlimmingVar. DropoutDropBack Alg. 3Figure 4.8: Weight reduction ratio vs. accuracy for Densenet, using Algorithm2 and Algorithm 3.up to 20×–30× weight reduction if some accuracy is sacrificed. On WRN-28-10,DropBack achieves 7× weight reduction with less than 0.5% accuracy compared toother techniques which all offer at best 4× weight reduction at over 10% increase inerror rate.WRN-28-10 and Densenet are challenging, as they are already quite dense for theaccuracy level they achieve. Variational dropout works well only on VGG-S and failsto converge — that is, fails to achieve better-than-random accuracy — on Densenetand WRN. Magnitude-based pruning tops out at a worse accuracy than DropBackon WRN-28-10 and Densenet, despite offering less weight reduction when pruning80% and 75% of the parameters, respectively. Finally, network slimming achievesslightly better top accuracy on Densenet at the cost of 36% less weight reduction,but results in dramatic accuracy loss when applied to WRN-28-10; this is in linewith recent work which has shown that WRN is hard to reduce more than about2× without losing significant accuracy [41–43]. Only DropBack is able to achieve480.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5Number of Retained Weights 1e7020406080Validation Error (%)DropBack Alg. 2Mag. PruningNetwork SlimmingVar. DropoutDropBack Alg. 3Figure 4.9:Weight reduction ratio vs. accuracy for WRN-28-10, using Algo-rithm 2 and Algorithm 3.weight reduction on the order of 5× with little to no accuracy loss on WRN-28-10and Densenet.Convergence and FreezingWith the more complicated networks WRN-28-10 and Densenet, freezing earlyresulted in significant accuracy drops of ∼10% compared to not freezing. This isbecause these networks are much more difficult to train than VGG-S, as demonstratedby the larger best epoch values in Table 4.7. Intuitively, with higher weight reductionratios and more complex networks, freezing early results in a more substantialaccuracy drop because parameters that are on the edge of being included in thetracked set are far more critical. With larger tracked sets, weights that are beingswapped in and out of the tracked set have smaller gradients, therefore contribute lessto the error. With more complex networks, additional noise, such as the noise addedby parameter selection changing each iteration, can be beneficial to training [50].490 20 40 60 80 100Epoch0.00.20.40.60.81.0Validation AccuracyVariationalOursBaselineFigure 4.10: VGG-S CIFAR-10 epoch vs. validation accuracy for our methodat 5M tracked parameters, variational dropout, and the baseline model.Therefore, WRN-28-10 and Densenet were run without freezing.Figure 4.10 shows that DropBack initially learns slightly more slowly than theuncompressed baseline on VGG-S, but exhibits the same convergence behavior asthe baseline after about 20 epochs. Variational dropout, in contrast, learns morequickly initially but converges to a lower accuracy, essentially because the rapidmovement results in numerical instability and poor performance [23] (see Section 4.5for details).4.4.2 DropBack with Weight Decay: Algorithm 3Accuracy and Weight ReductionThe results of training networks on CIFAR-10 using Algorithm 3 are shown inTable 4.8. In almost every case, Algorithm 3 improves the weight reduction ratio.The level of weight reduction achieved for an acceptable accuracy loss dependson the network used. The newer WRN-28-10 network is reduced the most, and athigh compression levels is still more accurate than any VGG-S result. Densenet50CIFAR-10 Validation error Best epochVGG-S Baseline 15M 10.08% 214VGG-S DropBack 5M 9.82% 234VGG-S DropBack 3M 10.12% 188VGG-S DropBack 500K 13.41% 227VGG-S DropBack 100K 51.08% 255Densenet Baseline 2.7M 6.48% 382Densenet DropBack 500K 5.83% 479Densenet DropBack 100K 45.24% 437WRN-28-10 Baseline 36M 3.75% 326WRN-28-10 DropBack 8M 3.66% 493WRN-28-10 DropBack 7M 3.66% 493WRN-28-10 DropBack 5M 3.87% 478WRN-28-10 DropBack 1M 4.65% 460WRN-28-10 DropBack 500K 5.06% 562WRN-28-10 DropBack 100K 8.74% 362Table 4.8: Validation accuracy and weight reduction ratio of several prunednetworks on CIFAR-10 using Algorithm 3has far fewer parameters than the baseline WRN-28-10 or VGG-S networks, butAlgorithm 3 is still able to achieve weight reduction ratio while maintaining accuracy.Compared to VGG-S, the advantages of the newer, more complex architectures areclear especially when reduced to 500K and 100K tracked parameters. Densenet at500K tracked parameters gains almost 1% in test accuracy, whereas VGG-S loses3.4%, compared to the baseline accuracy result for the respective network.Using DropBack with 100K tracked weights, WRN-28-10 has the highest weightreduction ratio and significantly lower validation error than VGG-S or Densenet;Densenet, on the other hand, suffers a colossal increase in error when only 100Kweights are tracked. This is likely due to its unique highly interconnected architecture:in a standard CNN like VGG-S, the outputs of one layer are fed into only the nextlayer. In Densenet the output of each layer is fed to into every layer below. Thiscauses each weight in Densenet to affect more of the output than either WRN-28-10or VGG-S51The more sparsely connectedWRN-28-10 network retains much higher accuracyeven with high weight reduction ratios. At 100K, the error only doubles comparedto the baseline, and WRN-28-10 still outperforms the best (lowest error) version ofVGG-S.These results suggest that, (a) that network architecture has a significant impacton the possible weight reduction with DropBack, and (b) that sparsely connectedarchitectures tend to perform better.Filter-Level SparsityFinally, we asked whether the weights that were pruned away correlated with theconvolutional filter structure of the network. Because each filter represents a differentfeature, we reasoned that most of the weight pruning would occur in units of entirefilters.To investigate filter-level sparsity, we asked howmany weights remained (i.e., hadnon-zero values) in each 3×3 convolutional filter at the end of the training processusing Algorithm 3. We selected Weight reduction levels of 500K, and 100K trackedparameters as these both still outperformed the VGG-S baseline. The histogramsshow filters with 0 to 9 weights for each network and training configuration inFigures 4.11to 4.18, inclusive.For WRN-28-10, the vast majority of 3×3 filters have been decayed to zerocompletely, allowing them to be removed from the network entirely. This is asignificant computational savings, as each filter requires approximately 20 operationsper pixel. Figure 4.11 shows the number of non-zero weights per 3×3 convolutionfor DropBack 500K, while Figure 4.13 shows the same for the case of 100K trackedparameters.We also investigated the filters that were not entirely removed and asked howmany filter values are non-zero. Figures 4.12 and 4.14 show the same histograms asabove but now without the all-zero filters, again for 500K and 100K, respectively.In both cases, the vast majority of the remaining filters have only a single value.This again results in substantial reduction in the computation effort, as applying asingle-value filter requires just a single operation per pixel.In contrast to WRN-28-10, we observed fewer filters entirely reduced to zero52in VGG-S and Densenet, with either 100K or 500K tracked weights. On VGG-Swith 500K tracked parameters, only 88% of the filters were set completely to zerocompared with the 94% of the filters from WRN-28-10 500K. Figure 4.15 andFigure 4.16 show how dense the VGG-S filters are overall compared to the WRN-28-10 filters in Figure 4.11 and Figure 4.12. Like WRN-28-10, VGG-S had many singlevalue filters (Figure 4.16), but had relatively more high complexity filters whereasWRN-28-10 had essentially no filters with more than four non-zero weights. Thissuggests that VGG-S requires more complex filters than the WRN-28-10 network,perhaps because the more complex units that comprise the WRN-28-10 architecturepre-encode some of the necessary computation and allow the filters themselves tobe less complex. VGG-S also could not be reduced as far as WRN-28-10 withoutcatastrophic accuracy loss.Densenet, on the other hand, relies on very complex filters, and did not achieveas much filter-level sparsity as VGG-S or WRN-28-10. The histograms are shown inFigures 4.17 and 4.18. While 64% of the filters are entirely zero, the second highestcategory of filter were filters with 7 non-zero values.This result is intuitive, as the Densenet network used in this thesis has 5×fewer parameters than the next smallest network (VGG-S), but no complex buildingblocks like WRN-28-10 to account for some of the complexity. Therefore, to achievebetter or similar accuracy as VGG-S network, the filters will have to incorporatemore information. This more dense representation of information also helps explainDensenet’s poor performance at higher weight reduction ratios — while some filterscan be simplified, there are fewer filters overall to modify. Removing a single valuefrom a filter in Densenet has far more impact than on VGG-S or WRN-28-10.4.4.3 Weight Reduction: Algorithm 2 vs. Algorithm 3The question of selecting DropBack Algorithm 2 versus DropBack Algorithm 3depends greatly on the network, weight reduction desired, and energy reductionrequired. Table 4.9 shows a side-by-side comparison of Algorithms 2 and 3, usingresults from Section 4.4.VGG-S performs better at higher weight reduction levels when using Algorithm3, with the added benefit of the computational sparsity from the decayed to 0 initial530 1 2 3 4 5 6 7 8 9Number of non 0 elements0500000100000015000002000000250000030000003500000CountFigure 4.11: Histogram of 3x3 convolutional filters in WRN-28-10 with Drop-Back 500K Algorithm 3, including filters with all zeros.1 2 3 4 5 6 7 8 9Number of non-zero elements, excluding 3760894 all-zero filters020000400006000080000100000120000140000160000CountFigure 4.12: Histogram of 3x3 convolutional filters in WRN-28-10 with Drop-Back 500K Algorithm 3, excluding filters with all zeros.540 1 2 3 4 5 6 7 8 9Number of non 0 elements05000001000000150000020000002500000300000035000004000000CountFigure 4.13: Histogram of 3x3 convolutional filters in WRN-28-10 with Drop-Back 100K Algorithm 3, including filters with all zeros.1 2 3 4 5 6 7 8 9Number of non-zero elements, excluding 3976145 all-zero filters0500010000150002000025000CountFigure 4.14: Histogram of 3x3 convolutional filters in WRN-28-10 with Drop-Back 100K Algorithm 3, excluding filters with all zeros.550 1 2 3 4 5 6 7 8 9Number of non 0 elements02000004000006000008000001000000CountFigure 4.15: Histogram of 3x3 convolutional filters in VGG-S with DropBack500K Algorithm 3, including filters with all zeros.1 2 3 4 5 6 7 8 9Number of non-zero elements, excluding 1072463 all-zero filters020000400006000080000100000CountFigure 4.16: Histogram of 3x3 convolutional filters in VGG-S with DropBack500K Algorithm 3, excluding filters with all zeros.560 1 2 3 4 5 6 7 8 9Number of non 0 elements020000400006000080000100000120000140000160000CountFigure 4.17: Histogram of 3x3 convolutional filters in Densenet with DropBack500K Algorithm 3, including filters with all zeros.1 2 3 4 5 6 7 8 9Number of non-zero elements, excluding 168165 all-zero filters020004000600080001000012000CountFigure 4.18: Histogram of 3x3 convolutional filters in Densenet with DropBack500K Algorithm 3, excluding filters with all zeros.57CIFAR-10 Alg. 2 Alg. 3VGG-S DropBack 5M 9.75 % 9.82%VGG-S DropBack 500K 20.85 % 13.41%VGG-S DropBack 100K 90.00 % 51.08%Densenet DropBack 500K 5.86 % 5.83%Densenet DropBack 100K 9.42 % 45.24%WRN-28-10 DropBack 8M 3.85 % 3.66%WRN-28-10 DropBack 500K 59.89 % 5.06%WRN-28-10 DropBack 100K 89.86 % 8.74%Table 4.9: Comparing Algorithm 2 and Algorithm 3 on CIFAR-10.values. At less extreme amounts of weight reduction, DropBack 5M, Algorithm 2 isvery slightly better, although the computational sparsity obtained by Algorithm 3makes the 0.07% increase in error acceptable.Densenet, on the other hand, exhibits very different behavior when pruned. Atlow weight reduction levels, both algorithms perform the same, with Algorithm 3taking longer to converge while introducing 78% weight sparsity. The weightreduction factor at DropBack 100K for Densenet is less than what WRN-28-10 andVGG-S achieved, but Algorithm 3 suffers a significant increase in validation error.This is because the densely connected architecture of the network causes too manyactivations to become zero, removing substantial expressive power of the alreadycompact network. Leaving the initial parameters intact with Algorithm 2 results,on the other hand, preserves more accuracy and outperforms the baseline VGG-Snetwork.Finally, on WRN-28-10, Algorithm 3 outperforms Algorithm 2 at all levels ofweight reduction, and does not lose accuracy as quickly at high levels of weightreduction. Using DropBack 100K Algorithm 3, WRN-28-10 outperforms the bestVGG-S network even with the largest weight reduction ratio of 364.80×. WhileWRN-28-10 is the largest network used for CIFAR-10, and thus should see thehighest weight reduction ratios, no other network was able to achieve a comparableaccuracy at 100K tracked parameters.58100 101 102 103Iteration0510152025303540L2 DistanceBaseline (97.65%)Dropback 1.5K Alg. 2 (94.66%)Dropback 10K Alg. 2(96.53%)Dropback 1.5K Alg. 3 (79.21%)Dropback 10K Alg. 3 (96.52%)Magnitude Pruning .75 (92.01%)VD Sparse (94.88%)Figure 4.19: Diffusion(`2)distance vs. training time on MLP-100 (note logtime scale).4.5 DiscussionFinally, we sought to understand how DropBack is able to offer better accuracy athigher weight reduction ratios than prior work across a wide range of deep neuralnetworks.To investigate this, we applied the training process analysis fromHoffer et al. [23].Briefly, the authors observe that the average `2 (Euclidean) distance of weights fromtheir initial values is a logarithmic function of training time, i.e., ‖wt −w0‖ ∼ log t.They therefore model SGD as a random walk on a random potential surface, whichexhibits the same logarithmic distance effect (known as ultra-slow diffusion). Theauthors demonstrate that SGD configurations that preserve the ultra-slow diffusioneffect result in models that generalize well.We therefore asked whether DropBack follows the same principle. We reasonedthat, because DropBack tracks the highest gradients, DropBack should preserve thelargest contributors to the `2 diffusion distance of the baseline training scheme. Inaddition, because most of the remaining gradients are close to zero (see Figure 3.1),we expected the `2 diffusion distance to evolve similarly to that of the baseline597.55.02.50.02.55.07.510.012.52024621012345BaselineDropback 1.5K Alg. 2Dropback 10K Alg. 2Dropback 1.5K Alg. 3Dropback 10K Alg. 3Magnitude PruningVD SparseFigure 4.20: Evolution of weights under SGD projected into 3D space usingPCA for DropBack, baseline, magnitude based pruning, and variationaldropout.unpruned training scheme. Finally, because fewer of the largest contributors arepreserved when the weight reduction ratio is higher, we expected the average `2distance to be greater in configurations with a higher weight reduction ratio.To verify this intuition, we measured the diffusion distance for the baseline un-compressed network, DropBack, variational dropout, and magnitude-based pruning,all on the MLP-100 network.1 Figure 4.19 shows that under DropBack Algorithm 2weights diffuse very similarly to the baseline training scheme. The overall `2 distanceis negligibly lower because the untracked weights remain at their initialization values.In contrast, magnitude-based pruning begins with a substantial `2 distance (becausemany initialization values are immediately zeroed) and does not provide enoughscaffolding structure for SGD to train well. Finally, variational dropout drasticallyalters the loss surface of the network, and so diffuses much faster than the baselineand DropBack. This results in numerical instability (the total variation of the solutionis not bounded, and the optimization does not converge) as Hoffer et al. [23] predict;1Network slimming, being a train-prune-retrain technique, is not amenable to this type of analysis.60this explains the failure of variational dropout to converge on the denser networks(see Section 4.4).To visualize how the weight values themselves evolve under DropBack com-pared to the Baseline and the two pruning techniques, we projected the parameterspace to three dimensions using Principle Component Analysis (PCA). PCA is adimensionality reduction technique that projects a coordinate space, in this case theparameters of the MLP-100 network, down to a space with fewer dimensions.Figure 4.20 shows that under DropBack, the principal components of the trainedweight vector stay very close to those of the Baseline-trained weight vector, whereasthose of magnitude-based pruning and variational dropout diverge significantly. Ifwe consider the training path of the Baseline configuration to be ideal, DropBackresults in a similar near-ideal evolution.With DropBack Algorithm 3, the diffusion distance is altered more significantly,and the path SGD takes during optimization changes correspondingly. When trackingonly 1.5K of the original 90K parameters, the path and diffusion distance is altereddrastically compared to the paths taken by Algorithm 2 and the Baseline. Thisexplains the results from Table 4.2 and Table 4.4, where the DropBack variantAlgorithm 2 performs better on the same dataset (MNIST) and network (MLP-100).4.6 SummaryOn the MNIST dataset, DropBack can reduce the number of weights in both LeNet-300-100 and MLP-100 to 20K tracked weights with validation errors of 1.78% and1.70% respectively, compared to 1.41% error and 1.70% error for their respectiveunreduced baselines.On the Fashion-MNIST dataset with VGG-S, DropBack can reduce the numberof weights by 5× with a validation error of 5.43%, matching the baseline accuracyexactly.On the CIFAR-10 dataset, DropBack is effective at reducing the number ofweights used in all networks. On VGG-S, DropBack achieves a 9.90% validationerror with a 5×weight reduction. On Densenet, DropBack achieves 5.86% validationerror with 4.5× weight reduction, and on WRN-28-10 DropBack achieves 4.20%error with 7.3× weight reduction. For both Densenet and WRN-28-10 these weight61reduction ratios are state of the art compared to post-training pruning techniques,without an increase in error.62Chapter 5ConclusionNeural networks are becoming an everyday part of modern life. As their popularitygrows, they are deployed to smaller and smaller devices. In order to fit the memory,energy, and computational bounds of these devices, neural networks must becompressed, which typically results in accuracy loss and longer training times.DropBack can reduce the number of weights in networks up to 5× with no accuracyloss and up to 36× with slight accuracy loss.DropBack reduces weight storage both during and after training by (a) trackingonly the weights with the highest accumulated gradients, and (b) recomputing theremaining weights on the fly. With the addition of a decay term for the untrackedweight set, DropBack can be used to create compute sparsity both during and aftertraining, increasing the potential for energy savings even more.Because DropBack prunes precisely those weights that have learned the least,its weight diffusion profile during training is very close to that of standard (uncon-strained) SGD, in contrast to other techniques. This allows DropBack to achievebetter accuracy and weight reduction than prior methods on dense modern networkslike Densenet and WRN-28-10, which have proven challenging to prune usingexisting techniques. On Densenet, a 5× weight reduction ratio can be achieved withno accuracy loss. On WRN-28-10, a 7× weight reduction ratio can be achieved withno accuracy loss. With a slight increase in error, WRN-28-10 can be reduced by36×.DropBack is the first pruning technique to store only a small subset of weight63gradients during the training process. This is a key advantage, as it will make itpossible for future work to dramatically reduce the memory footprint and memorybandwidth needed to store and access weights during training. Because of this, futuredesigns that combines DropBack with custom hardware will be able to train networkslarger than currently achievable with typical hardware, or to train standard-sizenetworks on small mobile and embedded devices, which is currently not possiblewith standard training techniques and mobile GPUs.64Bibliography[1] A. F. Agarap. Deep learning using rectified linear units (relu). CoRR,abs/1803.08375, 2018. URL http://arxiv.org/abs/1803.08375. → page 10[2] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. QSGD:Communication-efficient SGD via gradient quantization and encoding. URLhttp://arxiv.org/abs/1610.02132. → page 17[3] J. M. Alvarez and M. Salzmann. Compression-aware Training of DeepNetworks. arXiv:1711.02638 [cs], Nov. 2017. arXiv: 1711.02638. → page 16[4] M. Babaeizadeh, P. Smaragdis, and R. H. Campbell. NoiseOut: A simple wayto prune neural networks. 2016. → pages 16, 31[5] R. Brynjolfsson and A. Mcafee. The business of artificial intelligence. URLhttps://hbr.org/2017/07/the-business-of-artificial-intelligence. → page 1[6] Z. Cai, X. He, J. Sun, and N. Vasconcelos. Deep Learning with Low Precisionby Half-wave Gaussian Quantization. arXiv:1702.00953 [cs], Feb. 2017.arXiv: 1702.00953. → page 17[7] T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training Deep Nets withSublinear Memory Cost. arXiv:1604.06174 [cs], Apr. 2016. arXiv:1604.06174. → pages 4, 14[8] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressingneural networks with the hashing trick. CoRR, abs/1504.04788, 2015. URLhttp://arxiv.org/abs/1504.04788. → pages 15, 17[9] Y. Choi, M. El-Khamy, and J. Lee. Towards the Limit of NetworkQuantization. arXiv:1612.01543 [cs], Dec. 2016. arXiv: 1612.01543. → page1665[10] M. Courbariaux, Y. Bengio, and J.-P. David. Training deep neural networkswith low precision multiplications. arXiv:1412.7024 [cs], Dec. 2014. arXiv:1412.7024. → page 17[11] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio. BinarizedNeural Networks: Training Deep Neural Networks with Weights andActivations Constrained to +1 or -1. arXiv:1602.02830 [cs], Feb. 2016. arXiv:1602.02830. → page 17[12] G. Cybenko. Approximation by superpositions of a sigmoidal function. mathcont sig syst (mcss) 2:303-314. 2:303–314, 12 1989. → page 10[13] T. Franck. Machine learning could lead to economic hypergrowth, newresearch suggests. URL https://www.cnbc.com/2017/10/21/machine-learning-could-lead-to-economic-hypergrowth-new-research-suggests.html. → page 1[14] S. Ge, Z. Luo, S. Zhao, X. Jin, and X. Y. Zhang. Compressing deep neuralnetworks for efficient visual inference. In 2017 IEEE InternationalConference on Multimedia and Expo (ICME), pages 667–672, July 2017.doi:10.1109/ICME.2017.8019465. → pages 1, 3, 16, 30, 34[15] G. B. Goh, N. O. Hodas, and A. Vishnu. Deep learning for computationalchemistry. URL http://arxiv.org/abs/1701.04503. → page 22[16] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learningwith limited numerical precision. 2015. → page 17[17] P. Gysel. Ristretto: Hardware-Oriented Approximation of ConvolutionalNeural Networks. arXiv:1605.06402 [cs], May 2016. arXiv: 1605.06402. →pages 16, 17[18] S. Han, H. Mao, and W. J. Dally. Deep Compression: Compressing DeepNeural Networks with Pruning, Trained Quantization and Huffman Coding.arXiv:1510.00149 [cs], Oct. 2015. arXiv: 1510.00149. → pages1, 3, 15, 16, 18, 22, 30[19] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally.EIE: Efficient Inference Engine on Compressed Deep Neural Network. InACM/IEEE International Symposium on Computer Architecture (ISCA), 2016.→ page 4566[20] S. Han, J. Pool, S. Narang, H. Mao, E. Gong, S. Tang, E. Elsen, P. Vajda,M. Paluri, J. Tran, B. Catanzaro, and W. J. Dally. DSD: Dense-Sparse-DenseTraining for Deep Neural Networks. In International Conference on LearningRepresentations (ICLR), 2017. → page 38[21] B. Hassibi, D. G. Stork, and G. J. Wolff. Optimal brain surgeon and generalnetwork pruning. In IEEE International Conference on Neural Networks,1993.,, pages 293–299. IEEE, 1993. → pages 6, 15[22] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for ImageRecognition. In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), June 2016. → page 11[23] E. Hoffer, I. Hubara, and D. Soudry. Train longer, generalize better: closingthe generalization gap in large batch training of neural networks. In NIPS,2017. → pages 50, 59, 60[24] J. L. Holt and T. E. Baker. Back propagation simulations using limitedprecision calculations. In IJCNN-91-Seattle International Joint Conference onNeural Networks, volume ii, pages 121–126 vol.2. → page 17[25] M. Horowitz. 1.1 computing’s energy problem (and what we can do about it).In 2014 IEEE International Solid-State Circuits Conference Digest ofTechnical Papers (ISSCC), pages 10–14. doi:10.1109/ISSCC.2014.6757323.→ pages 2, 3, 7, 25[26] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. DenselyConnected Convolutional Networks. arXiv:1608.06993 [cs], Aug. 2016. URLhttp://arxiv.org/abs/1608.06993. arXiv: 1608.06993. → page 35[27] G. Huang, Z. Liu, and K. Q. Weinberger. Densely connected convolutionalnetworks. CoRR, abs/1608.06993, 2016. URL http://arxiv.org/abs/1608.06993.→ pages xi, 33, 36[28] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Quantizedneural networks: Training neural networks with low precision weights andactivations. 2016. → page 17[29] D. A. Huffman. A method for the construction of minimum-redundancy codes.40(9):1098–1101. ISSN 0096-8390. doi:10.1109/JRPROC.1952.273898. →page 1867[30] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, andK. Keutzer. SqueezeNet: AlexNet-level accuracy with 50$\times$ fewerparameters and $<$0.5mb model size. arXiv:1602.07360, 2016. → page 34[31] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization.arXiv:1412.6980, 2014. → page 33[32] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the localreparameterization trick. 2015. → pages 16, 33[33] A. Kingsley-Hughes. Inside Apple’s new A11 Bionic processor. ZDNet,September 2017. → pages 3, 7[34] A. Krizhevsky. Learning multiple layers of features from tiny images. 2009.→ page 34[35] J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient.2008. → page 16[36] Y. LeCun. The MNIST database of handwritten digits. 1998. → page 34[37] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard,and L. D. Jackel. Backpropagation Applied to Handwritten Zip CodeRecognition. Neural Computation, 1(4):541–551, 1989. → pages 6, 10, 13, 21[38] Y. LeCun, J. S. Denker, and S. A. Solla. Optimal brain damage. In D. S.Touretzky, editor, Advances in Neural Information Processing Systems 2,pages 598–605. Morgan-Kaufmann, 1990. → pages 3, 6, 15, 16, 20[39] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learningapplied to document recognition. 86(11):2278–2324, 1998. ISSN 0018-9219.→ page 33[40] Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller. Effiicient BackProp. InNeural Networks: Tricks of the Trade, This Book is an Outgrowth of a 1996NIPS Workshop, pages 9–50, London, UK, UK, 1998. Springer-Verlag. ISBN3-540-65311-2. → pages 12, 21, 24, 25, 26, 28[41] H. Li, S. De, Z. Xu, C. Studer, H. Samet, and T. Goldstein. Training QuantizedNets: A Deeper Understanding. arXiv:1706.02379 [cs, stat], June 2017. URLhttp://arxiv.org/abs/1706.02379. arXiv: 1706.02379. → pages 7, 34, 48[42] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang. Learning EfficientConvolutional Networks through Network Slimming. arXiv:1708.06519 [cs],68Aug. 2017. URL http://arxiv.org/abs/1708.06519. arXiv: 1708.06519. →pages 33, 46[43] C. Louizos, M. Welling, and D. P. Kingma. Learning Sparse Neural Networksthrough l0 Regularization. arXiv:1712.01312 [cs, stat], Dec. 2017. URLhttp://arxiv.org/abs/1712.01312. arXiv: 1712.01312. → pages 7, 34, 48[44] J.-H. Luo, J. Wu, and W. Lin. ThiNet: A Filter Level Pruning Method forDeep Neural Network Compression. arXiv:1707.06342 [cs], July 2017. arXiv:1707.06342. → pages 3, 16, 30, 34[45] G. Marsaglia. Xorshift rngs. Journal of Statistical Software, 8(14):1–6, 2003.→ page 18[46] M. Masana, J. van de Weijer, L. Herranz, A. D. Bagdanov, and J. M. Alvarez.Domain-adaptive deep network compression. arXiv:1709.01041 [cs], Sept.2017. arXiv: 1709.01041. → pages 3, 14, 16, 30, 34[47] D. Masters and C. Luschi. Revisiting small batch training for deep neuralnetworks. URL http://arxiv.org/abs/1804.07612. → page 4[48] A. Mishra, E. Nurvitadhi, J. J. Cook, and D. Marr. WRPN: Widereduced-precision networks. 2017. → page 17[49] D. Molchanov, A. Ashukha, and D. Vetrov. Variational dropout sparsifies deepneural networks. 2017. → pages 16, 31[50] A. Neelakantan, L. Vilnis, Q. V. Le, I. Sutskever, L. Kaiser, K. Kurach, andJ. Martens. Adding gradient noise improves learning for very deep networks.URL http://arxiv.org/abs/1511.06807. → page 49[51] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNOR-Net: ImageNetClassification Using Binary Convolutional Neural Networks.arXiv:1603.05279 [cs], Mar. 2016. arXiv: 1603.05279. → page 17[52] M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler.Virtualizing deep neural networks for memory-efficient neural network design.CoRR, abs/1602.08124, 2016. URL http://arxiv.org/abs/1602.08124. → page14[53] M. Rhu, M. O’Connor, N. Chatterjee, J. Pool, and S. W. Keckler. CompressingDMA Engine: Leveraging Activation Sparsity for Training Deep NeuralNetworks. arXiv:1705.01626 [cs], May 2017. arXiv: 1705.01626. → page 1469[54] F. Rosenblatt. The Perceptron, a Perceiving and Recognizing AutomatonProject Para. Report: Cornell Aeronautical Laboratory. Cornell AeronauticalLaboratory, 1957. URL https://books.google.ca/books?id=P_XGPgAACAAJ.→ page 10[55] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNetLarge Scale Visual Recognition Challenge. International Journal ofComputer Vision (IJCV), 115(3):211–252, 2015.doi:10.1007/s11263-015-0816-y. → page 10[56] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descentand application to data-parallel distributed training of speech dnns. InInterspeech 2014, September 2014. URLhttps://www.microsoft.com/en-us/research/publication/1-bit-stochastic-gradient-descent-and-application-to-data-parallel-distributed-training-of-speech-dnns/.→ page 17[57] P. Y. Simard and H. P. Graf. Backpropagation without multiplication. InAdvances in Neural Information Processing Systems, pages 232–239, 1994.→ page 17[58] K. Simonyan and A. Zisserman. Very deep convolutional networks forlarge-scale image recognition. CoRR, abs/1409.1556, 2014. URLhttp://arxiv.org/abs/1409.1556. → pages 33, 42[59] A. Snyder. Behind the boom in machine learning. URLhttps://www.axios.com/behind-the-boom-in-machine-learning-1513388387-e5b2d881-f055-46d9-859c-c7d743b18c04.html. → page 1[60] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journalof Machine Learning Research, 15(1):1929–1958, 2014. → pages 3, 16, 34[61] C. V. M. L. Team. Personalized hey siri - apple, . URLhttps://machinelearning.apple.com/2018/04/16/personalized-hey-siri.html. →page 1[62] C. V. M. L. Team. An on-device deep neural network for face detection -apple, . URLhttps://machinelearning.apple.com/2017/11/16/face-detection.html. → page 170[63] S. Tokui, K. Oono, S. Hido, and J. Clayton. Chainer: a next-generation opensource framework for deep learning. In Proceedings of workshop on machinelearning systems (LearningSys) in (NIPS), volume 5, 2015. → page 32[64] K. Ullrich, E. Meeds, and M. Welling. Soft Weight-Sharing for NeuralNetwork Compression. arXiv:1702.04008 [cs, stat], Feb. 2017. arXiv:1702.04008. → pages 1, 3, 16, 30[65] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Regularization ofneural networks using DropConnect. In PMLR, pages 1058–1066. → pages3, 16[66] J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsification forcommunication-efficient distributed optimization. URLhttp://arxiv.org/abs/1710.09854. → page 17[67] J. Wu, W. Huang, J. Huang, and T. Zhang. Error compensated quantized SGDand its applications to large-scale distributed optimization. URLhttp://arxiv.org/abs/1806.08054. → page 17[68] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng. Quantized ConvolutionalNeural Networks for Mobile Devices. arXiv:1512.06473 [cs], Dec. 2015.arXiv: 1512.06473. → pages 1, 16[69] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-MNIST: a novel image dataset forbenchmarking machine learning algorithms. 2017. → pages 34, 42[70] S. Zagoruyko. Torch | 92.45% on CIFAR-10 in torch. URLhttp://torch.ch/blog/2015/07/30/cifar.html. → page 33[71] S. Zagoruyko and N. Komodakis. Wide Residual Networks.arXiv:1605.07146 [cs], May 2016. URL http://arxiv.org/abs/1605.07146.arXiv: 1605.07146. → pages xi, 2, 33, 35, 36[72] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen. Incremental NetworkQuantization: Towards Lossless CNNs with Low-Precision Weights.arXiv:1702.03044 [cs], Feb. 2017. arXiv: 1702.03044. → pages 16, 17[73] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. DoReFa-Net: TrainingLow Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients.arXiv:1606.06160 [cs], June 2016. arXiv: 1606.06160. → page 17[74] J. Zhu, J. Jiang, X. Chen, and C.-Y. Tsui. SparseNN: An Energy-EfficientNeural Network Accelerator Exploiting Input and Output Sparsity.arXiv:1711.01263 [cs], Nov. 2017. arXiv: 1711.01263. → pages 3, 14, 16, 3071[75] M. Zhu and S. Gupta. To prune, or not to prune: exploring the efficacy ofpruning for model compression. 2017. → pages 16, 22, 3172
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- DropBack : continuous pruning during deep neural network...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
DropBack : continuous pruning during deep neural network training Golub, Maximilian 2018
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | DropBack : continuous pruning during deep neural network training |
Creator |
Golub, Maximilian |
Publisher | University of British Columbia |
Date Issued | 2018 |
Description | In recent years, neural networks have regained popularity in a variety of fields such as image recognition and speech transcription. As deep neural networks grow more popular for solving everyday tasks, deployment on small embedded devices — such as phones — is becoming increasingly popular. Moreover, many applications — such as face recognition or health applications — require personalization, which means that networks must be retrained after they have been deployed. Because today’s state-of-the-art networks are too large to fit on mobile devices and exceed mobile device power envelopes, techniques such as pruning and quantization have been developed to allow pre-trained networks to be shrunk by about an order of magnitude. However, they all assume that the network is first fully trained off-line on datacenter-class GPUs, then pruned in a post-processing step, and only then deployed to the mobile device. In this thesis, we introduce DropBack, a technique that significantly reduces the storage and computation required during both inference and training. In contrast to existing pruning schemes, which retain the weights with the largest values and set the rest to zero, DropBack identifies the weights that have changed the most, and recomputes the original initialization values for all other weights. This means that only the most important weights must be stored in off-chip memory both during inference and training, reducing off-chip memory accesses (responsible for a majority of the power usage) by up to 72×. Crucially, networks pruned using DropBack maintain high accuracy even for challenging network architectures: indeed, on modern, compact network architectures such as Densenet and WRN-28-10, DropBack outperforms the current state-of-the- art pruning techniques in both accuracy and off-chip memory storage required for weights. On the CIFAR-10 dataset, we observe 5× reduction in weights on an already 9×-reduced VGG-16 network, which we call VGG-S, and 4.5× on Densenet and WRN-28-10 — all with zero or negligible accuracy loss — or 19×, 27×, and 36×, respectively, with a minor impact on accuracy. When the recomputed initial weights are decayed to zero, the weight memory footprint of WRN-28-10 can be reduced up to 72×. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2018-09-05 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0371928 |
URI | http://hdl.handle.net/2429/67112 |
Degree |
Master of Applied Science - MASc |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2018-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2018_november_golub_maximilian.pdf [ 1.13MB ]
- Metadata
- JSON: 24-1.0371928.json
- JSON-LD: 24-1.0371928-ld.json
- RDF/XML (Pretty): 24-1.0371928-rdf.xml
- RDF/JSON: 24-1.0371928-rdf.json
- Turtle: 24-1.0371928-turtle.txt
- N-Triples: 24-1.0371928-rdf-ntriples.txt
- Original Record: 24-1.0371928-source.json
- Full Text
- 24-1.0371928-fulltext.txt
- Citation
- 24-1.0371928.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0371928/manifest