Predicting Parameters in Deep LearningbyBabak ShakibiB.A.Sc., University of Tehran, 2008M.A.Sc., University of Toronto, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)October 2014c© Babak Shakibi, 2014AbstractThe recent success of large and deep neural network models has motivated thetraining of even larger and deeper networks with millions of parameters. Train-ing these models usually requires parallel training methods where communicatinglarge number of parameters becomes one of the main bottlenecks. We show thatmany deep learning models are over-parameterized and their learned features canbe predicted given only a small fraction of their parameters. We then propose amethod which exploits this fact during the training to reduce the number of param-eters that need to be learned. Our method is orthogonal to the choice of networkarchitecture and can be applied in a wide variety of neural network architecturesand application areas. We evaluate this technique using various experiments inimage and speech recognition and show that we can only learn a fraction of theparameters (up to 10% in some cases) and predict the rest without a significant lossin the predictive accuracy of the model.iiPrefaceThis work was done in collaboration with Misha Denil, Nando de Freitas, LaurentDinh and Marc’Aurelio Ranzato. The key ideas of this work were mainly from thecollaborations between Nando de Freitas, myself and Misha Denil with suggestionsand contributions from Marc’Aurelio Ranzato. I also did the implementation andevaluation of the technique using the reconstruction ICA (RICA) model on CIFAR-10 and STL-10 datasets. Then, since there were other large experiments in thisproject, they were distributed among myself, Misha Denil, and Laurent Dinh. Theexperiments with multi-layer perceptrons, and speech recognition using TIMIT,were mostly done by Misha Denil. Laurent Dinh was mainly responsible for theexperiment using convolutional networks on CIFAR-10. The result of our workwas presented in the NIPS 2013 conference [8], and some parts of the text fromthis paper are used in this thesis.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 A Brief History of Deep Learning . . . . . . . . . . . . . . . . . 11.2 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . 31.3 Structure of this Thesis . . . . . . . . . . . . . . . . . . . . . . . 32 Predicting Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Correlation in Learned Parameters . . . . . . . . . . . . . . . . . 62.2 Exploiting Parameter Correlations . . . . . . . . . . . . . . . . . 72.2.1 Choice of Dictionary . . . . . . . . . . . . . . . . . . . . 92.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.1 Constructing Dictionaries . . . . . . . . . . . . . . . . . 122.3.2 Columnar Architecture . . . . . . . . . . . . . . . . . . . 132.3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.4 Parallel Training . . . . . . . . . . . . . . . . . . . . . . 142.4 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4.1 Interpretation as Pooling . . . . . . . . . . . . . . . . . . 162.4.2 Interpretation as Low-Rank Factorization . . . . . . . . . 17iv3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.1 Reconstruction Independent Component Analysis . . . . . . . . . 213.1.1 Constructing the Dictionary . . . . . . . . . . . . . . . . 213.1.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2 Multi-Layer Perceptrons . . . . . . . . . . . . . . . . . . . . . . 253.2.1 Constructing the Dictionary . . . . . . . . . . . . . . . . 263.2.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . 273.3.1 Constructing the Dictionary . . . . . . . . . . . . . . . . 283.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.4 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 294 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 314.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33vList of FiguresFigure 2.1 A schematic of a simple layer of an MLP . . . . . . . . . . . 5Figure 2.2 Features learned with reconstruction ICA using patches fromnatural images. . . . . . . . . . . . . . . . . . . . . . . . . . 6Figure 2.3 (a) A feature learned using Reconstrction ICA from 20,00016×16 patches of natural images. (b) The feature is sampleduniformly at random at 50% of the points. (c) The feature re-constructed form the sample points using a squared exponen-tial kernel and kernel ridge regression. . . . . . . . . . . . . 7Figure 2.4 Reconstructing sub-sampled features. The first column in eachsubfigure shows four sample features. The second columnshows a few parameters sampled uniformly at random fromthe features in the first column. The third column shows thatthis random set can be used to predict the rest of the parame-ters.(a) an MLP trained on MNIST dataset, (b) reconstructionICA trained on STL-10, (c) a convolutional network trainedCIFAR-10, (d) a convolutional network trained on STL-10. . 8Figure 2.5 Interpretation of the dictionary matrix as a pooling layer. (a) Asimple fully-connected layer (W). (b) The layer (W), replacedby a smaller fully-connected layer (Wα ) and a fixed poolinglayer (U). The dashed lines indicate the weights that are fixed. 17Figure 2.6 Pooling layer representation of a dictionary based on a squaredexponential kernel when inputs are 16×16 image patches. . . 18viFigure 3.1 RICA with different proportion of parameters learned. In theleft-most column 100% of the parameters are learned. In theright-most column only 10% of parameters are learned and therest are predicted. The intermediate columns interpolate be-tween these extremes in increments of 10%. . . . . . . . . . 22Figure 3.2 Performance of features learned using RICA on CIFAR-10dataset with and without parameter prediction. . . . . . . . . 23Figure 3.3 Performance of features learned using RICA on STL-10 datasetwith and without parameter prediction. Error bars show 90%confidence intervals. . . . . . . . . . . . . . . . . . . . . . . 24Figure 3.4 Comparison of RICA, and RICA with 50% learned parametersusing the same number of dynamic parameters (RICA-50 hastwice the number of features) . . . . . . . . . . . . . . . . . 25Figure 3.5 Comparison of various dictionary learning methods used foran MLP trained on MNIST. The legend shows the dictionarytype for layer1-layer2. . . . . . . . . . . . . . . . . . . . . . 27Figure 3.6 Performance of a convolutional neural network on CIFAR-10dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Figure 3.7 Performance on TIMIT core test set using an MLP with twohidden layers. . . . . . . . . . . . . . . . . . . . . . . . . . 29viiChapter 1IntroductionIn this chapter we motivate our work by providing a brief history of neural net-works, and explain why predicting parameters can be useful in training neuralnetworks. We then specify the contributions of this thesis and explain how it isstructured.1.1 A Brief History of Deep LearningNeural networks are a powerful, well-known and relatively old machine learningmethod that has recently gained much attention due to their stellar performance inmany machine learning tasks such as image and speech recognition [2, 6, 15].Neural networks were first popularized in the 1980’s and they were inspiredby our primitive understanding of neural networks of the brain. One of the mainsuccess stories in this field were convolutional networks introduced introduced byFukushima [9]. LeCun et al. [22] developed the convolutional network idea andcreated a model for handwritten digit recognition which at some point was used toread several million checks per day in the United States.Despite early successes of neural networks, they were notorious for being hardto train and eventually went out of favor and were replaced by other methods suchas support vector machines (SVMs), which were simpler to train and in many casesoutperformed neural networks.In a landmark paper in 2006, Hinton et al. [12] developed a new method for1training deep networks. In this paper they proposed an unsupervised layerwisetraining of the network followed by a feedforward backpropagation fine-tuning.They managed to finally end the supremacy of SVMs by achieving better results inthe digit recognition task on the MNIST [20] dataset. Their method reignited theinterest in deep learning models.Recently, with increased computational power and the access to larger datasetswith the advent of big data era, neural networks have been able to achieve stateof the art results on vairous benchmarks in object recognition [2, 15, 17], speechrecognition [6, 25] and many other fields.Although novel tweaks and modifications, such as dropout [13], have con-tributed to the performance of neural networks, much of their recent success canbe attributed to the scaling of the classic network structures to larger and deepernetworks and to the use of GPUs or clusters of CPUs for faster learning.In the supervised learning field Krizhevsky et al. [15] construct a scaled upversion of a convolutional neural network, originally used by [22] for handwrittendigit recognition, and use multiple GPUs to train the model and achieve state-of-the-art results for object recognition in the challenging ImageNet [7] dataset.Cires¸an et al. [2] show that they can achieve state of the art result on the MNISTdataset simply by constructing a very big and deep multi-layer perceptron (MLP)and training it using a GPU.Similarly, in the unsupervised feature learning field Le et al. [19] construct avery large network with about 1 billion parameters and train it using 10 millionimages on 1000 machines and show remarkable results. In another work, Coateset al. [4] analyze the effect of various parameters in the effectiveness of unsuper-vised learning methods and demonstrates that increasing the size of the network(number of features) has the largest impact in the performance of the network.These examples show that by simply increasing the size of networks and usingmore data in our training we can improve the accuracy of deep networks. Thechallenge then becomes how to train these big and deep networks on very largedatasets.One common solution for training large networks is to use parallel comput-ing with multiple GPU’s or multiple CPU-machines. This allows for training oflarger networks on larger databases. However, because of the communication bot-2tlenecks, the speedup gained using parallel computing usually diminishes as weincrease the number of machines. In multi-GPU settings each GPU needs to com-municate the parameter updates to other GPUs or a parameter server which requirescopying from GPU memory to CPU memory or memory of another GPU whichis typically slow. Similarly in multi-CPU settings CPU’s can belong to differentmachines and the parameters will need to be communicated through the networkwhich introduces a significant bottleneck.In this work we propose a method for reducing the number of parameters in aneural network by exploiting the correlations in the learned features. Our methodcan be used in various deep learning architectures to reduce the number of param-eters and potentially decrease the communication bottlenecks in parallel trainingframeworks.We postpone the review of related literature on reducing the number of para-maters and exploiting correlation in deep networks to the final chapter of this thesiswhen we have fully described our method.1.2 Contributions of this ThesisThe first contribution of this thesis is to demonstrate that the parameters of thefeatures learned in many network architectures and applications are correlated, im-plying that these neural networks are over-parameterized.The second contribution is to provide a method to exploit this correlation in or-der to reduce the number of parameters that need to be learned (active parameters)and hence reduce the communication cost in parallel training settings.We also show that by using our method we can train larger networks withoutincreasing the number of active parameters and achieve better accuracy. We testout method in a broad range of network architectures and two application areas.1.3 Structure of this ThesisIn chapter 2 we show the correlation in learned features for various network struc-tures and then develop our method to exploit these correlations using a simple ridgeregression.In chapter 3 we use multiple experiments to show the versatility of our method3and evaluate its performance in various network architectures and application ar-eas. These experiments include supervised and unsupervised learning methods,simple and convolutional network architectures, and applications in image andspeech recognition.Finally, in chapter 4, we conclude our work by going over the related literature,summarizing our contributions and suggesting future directions for continuing thisresearch.4Chapter 2Predicting ParametersDeep networks have a wide range of architectures. In this chapter, in order to ex-plain our method, we focus on multi-layer perceptrons (MLPs), which are neuralnetworks with a very simple architecture. As we will see in the experiments chap-ter, all the discussions here can be easily extended to more complex networks suchas convolutional networks.vh1 = g(vwT1 )w1Figure 2.1: A schematic of a simple layer of an MLPAn MLP consists of several layers and each layer performs a linear transforma-tion of the inputs followed by a non-linear function (usually a sigmoid function).Figure 2.1 shows a schematic of a simple fully-connected layer of an MLP. In thissetting for a single output unit (h) we have h = g(vwT ), where the row-vector v5Figure 2.2: Features learned with reconstruction ICA using patches from nat-ural images.is an nv-dimensional input of this layer, and w is an nv-dimensional weight vector.The weight vector w contains the weights connecting each of the input units tothe output unit h. The w vectors from all the layers in the network constitute theparameters of the network and are learned during the training.In this chapter we show that the learned weights of each w vector are usuallycorrelated. We then propose a method to exploit this correlation to learn only asmall number of parameters for each weight vector w and predict the full w fromthem. Finally, we show how this prediction method works in practice to reducecommunication costs during training.2.1 Correlation in Learned ParametersThe inspiration for this work came from studying learned features in successfuldeep learning networks used for image recognition tasks. Figure 2.2 shows some ofthe learned features using an unsupervised learning method called reconstructionICA (RICA) [18], trained on patches from natural images. The details of howthese features are learned are explained in the experiments chapter, here we areonly interested in visual examination of the learned features. These features aresmooth, and it is easy to see that the parameters of the pixels close to each other are6correlated. These smooth features are not specific to RICA and are very commonin deep networks trained on natural images.One way of demonstrating correlation in the features is to show that we canpredict the parameters using only a subset of the parameters. Figure 2.3 providesa concrete example. In figure 2.3a we see a typical learned featured which in thiscase is learned using RICA. In figure 2.3b we sample 50% of the parameters of thisfeature uniformly at random. Now we need to show how well we can reconstructthe original feature using only the parameters at these sampled pixels.(a) (b) (c)Figure 2.3: (a) A feature learned using Reconstrction ICA from 20,000 16×16 patches of natural images. (b) The feature is sampled uniformlyat random at 50% of the points. (c) The feature reconstructed formthe sample points using a squared exponential kernel and kernel ridgeregression.There are many ways of doing the prediction and we cover them in the nextsection, here we use a simple kernel ridge regression [29], with squared exponen-tial kernel to predict the missing parameters. Figure 2.3c shows the reconstructedfeature. Visually, we can see that we have reconstructed the feature with good ac-curacy. Figure 2.4 shows the result of the same analysis for the features learnedusing various deep learning methods from different image datasets.2.2 Exploiting Parameter CorrelationsIn this section we formalize the parameter prediction problem and show how toexploit the correlation in the features of a deep network to represent them in alower dimensional space.As was explained in the beginning of this chapter the weight vector w is a nv-7(a) (b) (c) (d)Figure 2.4: Reconstructing sub-sampled features. The first column in eachsubfigure shows four sample features. The second column shows a fewparameters sampled uniformly at random from the features in the firstcolumn. The third column shows that this random set can be used topredict the rest of the parameters.(a) an MLP trained on MNIST dataset,(b) reconstruction ICA trained on STL-10, (c) a convolutional networktrained CIFAR-10, (d) a convolutional network trained on STL-10.dimensional vector which contains the weights for a single output unit from each ofthe input units. We can view this weight vector as a function (w : W → R) whichmaps the weight space W to the real numbers R and we estimate values of thisfunction using regression. We model this regression using a linear combination ofbasis functions. Using this method we will havew = UwTα (2.1)where U is a nv×nα dictionary matrix, whose columns are the basis functions andwα is an nα -dimensional vector that contains the weights corresponding to eachbasis function.One could use other methods to approximate the w, however since we will needto reconstruct the weight vector in each pass of the training algorithm, the methodneeds to be fast and efficient. The choice of using a dictionary of basis functions isappropriate, because once we construct the dictionary the prediction of the weightvector w from wα will just be a matter of matrix multiplication.If we have a good base dictionary U we will only need to learn wα since we8can use the dictionary U to reconstruct w from wα . Which means that we wouldeffectively reduce the number of dynamic parameters for this weight vector fromnv to nα . The problem now becomes that of choosing a good base dictionary.2.2.1 Choice of DictionaryThere are many choices for the base dictionary. We can construct simple generaldictionaries such as a dictionary with independent and identically distributed (i.i.d.)Gaussian entries which will do random projections or a dictionary with randomcolumns of zeros and identity which will be equivalent to a layer with randomconnections between wα and w. These general dictionaries however do not exploitthe correlations and structure in the features. A better approach would be to useprior knowledge about the features or learn the dictionary from the data.If we do not have prior knowledge about the structure of the features we canlearn the dictionary from the data. A simple choice would be to train the layeras an unsupervised model (e.g. an autoencoder) and use the learned features as adictionary.When we have prior knowledge about the structure of the features we can usethis knowledge to choose an appropriate base dictionary, for example in the caseof images where we expect smooth features we can use Fourier or wavelet bases.There are many ways for exploiting the structure in the data to learn dictionar-ies. In this work we mainly focus on one way of encoding our knowledge about thecorrelations inside a feature which is to use a function. A kernel function encodesthe covariance of variables, and once an appropriate kernel is chosen for the datawe then use kernel ridge regression to construct a dictionary.Constructing a Dictionary with KernelsIn order to construct a dictionary with kernels we first choose α ∈W to be the setof locations where we represent the feature explicitly. There are many ways thisset can be chosen, we found that choosing α to be a uniformly random subset ofW works well. Figure 2.3b shows an example of uniformly random sampling ofan image patch feature.Next, we define a kernel matrix Kα whose elements are (Kα)i j = k(i, j) where9k is a kernel function that models the covariance between locations i, j ∈ α . Thiskernel encodes the covariance between the values of (wα)i and (wα) j. We intro-duce another kernel matrix kα whose elements are given by (kα)i j = k(i, j) fori ∈ α and j ∈W . We can then use these kernel matrices in a standard kernel ridgepredictor [29] to predict w from wα . We havew = kTα(Kα +λ I)−1wα (2.2)where λ is a ridge regularization coefficient. We can rewrite this equation as:w = UwTαwhere U isU = kTα(Kα +λ I)−1 (2.3)This equation shows how we can construct a base dictionary using kernel ridgeregression. At this point we have reduced the problem of choosing a dictionary forencoding the correlations in the parameters to the choice of an appropriate kernel.We investigate the choice of the kernel for two cases: 1) Smooth features for whichwe expect topological smoothness in the parameters and 2) Unstructured featuresfor which we do not have prior knowledge and we need to learn a kernel from data.Smooth FeaturesSmooth features, similar to the ones in Figure 2.4, usually arise when have a topo-logical structure in the inputs. Topological structure here means that the relativelocation of the input units in the input vector is of significance. For example, whenthe input is a natural image the pixels close to each other will be correlated, there-fore the vectorized image v will have a topological structure.The structure of the input vector v imposes a structure on the elements of thelearned feature w. Once α is chosen we need a kernel to encode the smooth struc-ture in the pixel space. There are many kernels that enforce smoothness, here weuse a simple squared exponential kernel. Let us index the elements of w by i ∈W .In the case of image patches we would have i = (ix, iy) which indicate the location10of the index i in the patch. The squared exponential kernel can be written ask(i, j) = exp(−(ix− jx)2 +(iy− jy)22σ2)(2.4)where σ is a length scale parameter that controls the smoothness imposed by thekernel. In color images and convolutional networks, where we have multiple chan-nels, we assume that the channels are independent and we use a single kernel for allthe channels. Using this kernel and equation 2.3 we can construct the dictionaryand use equation 2.1 to predict w from wα .Unstructured FeaturesWe now turn our attention to the cases where we do not expect a topologicalsmoothness in the feature weight space. These features arise when the input vectorhas no inherent structure. For example this happens in the deep layers of an MLPwhere the hidden units are permutation invariant i.e. the units can be rearrangedwithout changing the model.In these cases we can no longer assume a topological structure and use a smoothkernel. We therefore need to learn a kernel from the data. One way of learning thekernel is to use empirical covariance or empirical squared covariance of the inputsaveraged over the data as our kernel. For example in a hidden layer h we constructthe kernel function k(i, j) for ridge regression ask(i, j) = 1N−1N∑n=1(hi(vn)− h¯i)(h j(vn)− h¯ j) (2.5)where n indexes the training data and h¯i is the value of hi(v) averaged over thedata. The squared covariance is obtained by squaring the inputs before computingthe covariance.2.3 ImplementationIn this section we demonstrate how our method can be implemented in training anMLP. Other network architectures can be trained using a similar technique. Weexamine the details of training other network types in the experiments section.11Figure 2.1 shows a typical MLP with nl hidden layers. We know that for eachoutput unit h we can write h = g(vwT ). We let h denote the nh-dimensional vectorcontaining all the output units and we can write h = g(vW), where W is an nv×nhmatrix whose columns are the weight vectors (w) for each output unit h. We canextend the equation 2.1 to the vector formW = UWα (2.6)where Wα is an nα × nh matrix whose columns are weight vector representationsin a lower dimensional space wα for all output units.One advantage of our method is that it can be used to easily extend currentarchitectures. It does not require changing the architecture and it is easy to imple-ment. Having constructed the dictionaries we can use equation 2.6 to reconstructW from Wα , whenever access to W is required.2.3.1 Constructing DictionariesAs was discussed in section 2.2.1, the way we construct a dictionary for each layerdepends on its input.For the first layer of an MLP, if the input is structured, we can use dictionariesfor the smooth features (section 2.2.1) such as wavelet bases or the dictionariesconstructed from squared exponential kernels. If the input is not an image and doesnot have a topological structure, we use the dictionaries for unstructured features(section 2.2.1) that we learn from data.For the deep layers of an MLP (all the layers except the first layer), hidden unitactivations are permutation invariant and we can no longer use the dictionaries forsmooth features and we need to learn the dictionaries. The problem here is thathidden unit activations in the deep layers depend on the weights in the previouslayers, therefore we can not learn the kernel in this case without first training theprevious layers.In this case, we use the layer-wise pre-training technique described in [12]. Weuse autoencoders to do unsupervised layer-wise pre-training on the previous layersto learn preliminary weights for them. We then use these weights to find the hiddenunit activations required for learning the kernel. Once we have the activations for12this layer we use the hidden unit activations to learn appropriate dictionaries.Once the dictionaries are learned, one can “fine-tune” the dictionaries by con-tinuing to learn them during the actual training of the network. This increases thenumber of dynamic parameters but might increase the performance of the network.In this work the computation of the dictionaries is only done once, before the train-ing starts, and the dictionaries will be fixed throughout the training.2.3.2 Columnar ArchitectureThe prediction process we have described so far assumes that U is the same forall features; however, this can be too restrictive. Continuing with the intuition thatfilters should be smooth local edge detectors we might want to choose α to givehigh resolution in a local area of pixel space while using a sparser representation inthe remainder of the space. Naturally, in this case we would want to choose severaldifferent α’s, each of which concentrates high resolution information in differentregions.It is straightforward to extend feature prediction to this setting. Suppose wehave several different index sets α1, . . . ,αJ corresponding to elements from a dic-tionary U. For each α j we can form the sub-dictionary Uα j and predicted thefeature matrix W j = Uα j Wα j . The full predicted feature matrix is formed by con-catenating each of these matrices blockwise W =[W1, . . . ,WJ]. Each block of thefull predicted feature matrix can be treated completely independently. Blocks Wiand W j share no parameters—even their corresponding dictionaries are different.Each α j can be thought of as defining a column of representation inside thelayer. The input to each column is shared, but the representations computed ineach column are independent. The output of the layer is obtained by concatenatingthe output of each column.Introducing additional columns into the network increases the number of staticparameters but the number of dynamic parameters remains fixed. The increasein static parameters comes from the fact that each column has its own dictionary.The reason that there is not a corresponding increase in the number of dynamicparameters is that for a fixed size hidden layer the hidden units are divided betweenthe columns. The number of dynamic parameters depends only on the number of13hidden units and the size of each dictionary.2.3.3 TrainingDeep networks are usually trained by optimizing an objective function using thegradient descent algorithm. In this method, the gradient of the objective functionwith respect to the parameters (W for all layers) is calculated and is used to com-pute a step vector. This step is then used to update the parameters. The gradient iscomputed for the new parameters and the iteration continues until we converge toan optimum.When using our technique, we simply replace W with UWα and compute thegradients with respect to Wα instead of W. Therefore instead of searching for anoptimum point in the weight space of W we search in the lower-dimensional spaceof Wα . The optimization in this lower-dimensional space can be easier; however,examining this hypothesis requires more experiments and it is outside of the scopeof this work.By replacing W with UWα , we can speed up the calculation of the output vec-tor h = vW, by first multiplying v and U, and then multiplying the result by Wα .In other words, we can write h = g(v(UWα)) = g((vU)Wα). By using this methodwe reduce the computational complexity of calculating h by nα × (nv +nh)/(nv×nh).It should be noted here that this speed up does not simply generalize to the con-volutional layers; since in these layers we have to transform the vectorized featurew into its tensor form and then convolve it with the input. In this case we need toreconstruct the matrix W explicitly which actually introduces more computation inthe problem.The main advantage of our method is that the number of dynamic parameters,the parameters that we update in each iteration of the training, is reduced. Thisreduction in the number of parameters can lead to lower communication costs be-tween nodes in parallel training frameworks.2.3.4 Parallel TrainingIn this section we demonstrate how our method can be integrated into parallel train-ing frameworks. The frameworks that are discussed here apply both to GPU and14CPU networks. There are two general ways of distributing computation across thenodes:• Data ParallelismIn this approach, each node computes the gradients for all the parameters ofthe network; However the nodes use only a subset of the data to compute thegradient. In other words all nodes contain the whole network parameters butthe data is distributed across the nodes.• Model ParallelismIn this approach, the neural network is divided and distributed across thenodes and each node is only responsible for calculating the gradient andupdating its own portion of network parameters. However all the data is fedto each node.Parallel computing frameworks usually use a mixture of both data parallelismand model parallelism for training. In model parallelism each node only calculatesthe gradients and updates its own portion of parameters, hence there is no needto communicate gradients between the nodes. In data parallelism, however, nodesshare model parameters, and each node computes the gradient for all the parame-ters in the model. Therefore, we need to communicate the gradients or parameterupdates between the nodes.Our method for reducing the number of parameters can be used in the dataparallelism approach to reduce the communication cost between the nodes. Let usassume that we have multiple machines that are used in a data parallel framework.Each machine calculates its own gradient and sends it to the parameter server. Byusing our method the model will have less dynamic parameters and this gradientwill become smaller.Our method requires that each machine store additional data (the dictionarymatrix U), however, usually there is not a significant memory constraint on eachmachine and the main bottleneck is in communicating parameters between the ma-chines.152.4 Remarks2.4.1 Interpretation as PoolingAs was described in this chapter, in a layer of a typical neural network, we haveh = g(vW). We also showed that using our method we can write the weight matrixas W = UWα (equation 2.6). Using these two equations we can writeh = g(vUWα). (2.7)Until now we have motivated our approach as a way of reducing the numberof parameters using a base dictionary which means we looked at equation 2.7 withparenthesis around UWα , h = g(v(UWα)). By using matrix product associativitywe can look at this equation as:h = g((vU)Wα) = g(v′Wα) (2.8)where v′= vU is an nα -dimensional vector. Here matrix U is first applied to theinputs and the result v′ is multiplied by the weight matrix Wα . In this setting, wecan think of U as a matrix that linearly transforms an nv-dimensional vector v intoan intermediate nα -dimensional vector v′. In other words U reduces the number ofinputs from nv to nα which is very similar to what a pooling layer does.So we can think of U as a linear pooling layer that pools the input into a smallerhidden layer. Figure 2.5 shows a schematic of this pooling layer.One can argue that what we achieve in our method can be interpreted as alearned pooling layer that appropriately pools the inputs into a smaller layer andhence reduces the number of active parameters in the network.In order to further investigate the pooling interpretation we can look at one ofthe dictionaries that we learned and see what kind of pooling layer it represents.For simplicity we look at a dictionary obtained using squared exponential kerneland kernel ridge regression. For grayscale input patches of 9×9 pixels. Figure 2.6shows the columns of this dictionary which act as a pooling layer on the inputs.16h = g(vW)vW(a)h = g(v′Wα)v v′U Wα(b)Figure 2.5: Interpretation of the dictionary matrix as a pooling layer. (a) Asimple fully-connected layer (W). (b) The layer (W), replaced by asmaller fully-connected layer (Wα ) and a fixed pooling layer (U). Thedashed lines indicate the weights that are fixed.2.4.2 Interpretation as Low-Rank FactorizationSo far we have motivated our method as a way of reducing the number of parame-ters in W by using a base dictionary U. By looking at equation 2.6W = UWα (2.6 revisited)we can see that this equation is factorizing an nv×nh matrix W into two matricesU (nv× nα ) and Wα (nα × nh). Therefore, one can interpret this equation as alow-rank factorization of weight matrix W.Our goal in this work is to exploit the correlations in the features and representthe features in a lower dimensional space that will be less correlated. We achievethis by judiciously constructing a suitable base dictionary for each case. This ap-proach can be interpreted as a low-rank factorization where the first matrix U isfixed.Here one can argue that the same goal might be achieved if we transform thefeatures to a lower dimensional space by simply representing the W as a multipli-17Figure 2.6: Pooling layer representation of a dictionary based on a squaredexponential kernel when inputs are 16×16 image patches.cation of two low-rank matrices U and Wα and concurrently learning the low-rankmatrices.We investigate this in section 3.2 and show that constructing and fixing a suit-able U matrix is required for good performance and that a naı¨ve approach of learn-ing the two matrices simultaneously will not perform well in practice. Further-more, by using a fixed U matrix we reduce the number of dynamic parameters tonα ×nh, whereas if we learn both matrices the number of dynamic parameters willbe nα × (nv +nh).One reason for poor performance of naı¨ve low-rank factorization can be thefact that the solution to the decomposition is not unique. If Q is any invertiblematrix of size nα ×nα we have W = UV = (UQ)(Q−1V) = U˜V˜, therefore U˜ andV˜ will also be acceptable solutions. We remove this redundancy in our approachby fixing the value of U and learning only V. There are other methods for re-moving the redundancy including imposing an orthonormality constraint on U orV. We do not investigate these methods, since they usually require relatively com-plex optimization algorithms and will not easily fit in the simple backpropagation18framework for training deep networks.19Chapter 3ExperimentsIn this section, we show effectiveness and versatility of our parameter predictionmethod by evaluating it in deep networks with various architectures and in differentapplication areas. These include unsupervised learning with RICA and supervisedlearning with MLP and convolutional neural networks with applications in imageand speech recognition.First, we evaluate our method for a single layer, unsupervised learning methodcalled reconstruction ICA on an image recognition task. This experiment serves asa proof of concept and shows that we can learn lower-dimensional weight vectors,predict the full weight vectors effectively, and maintain the performance of thenetwork.Since the training time becomes a major issue in large networks in order toevaluate various dictionaries we use a simple MLP trained on the MNIST dataset,which is a relatively small dataset. This experiment allows us to do comparisonbetween many different dictionary choices.Next, we experiment using more complex network architectures. We showthat we can effectively reduce the number of parameters in a large convolutionalnetwork in an image recognition task on the relatively large CIFAR-10 dataset.Finally, we move to the speech recognition task and demonstrate the effective-ness of our method in reducing the parameters in an MLP on the TIMIT dataset.203.1 Reconstruction Independent Component AnalysisIn this experiment we use RICA which is a well-known unsupervised learningmethod [18]. It was used in the Google landmark paper where they constructed avery large network which consisted of stacked RICA layers [19]. Here, we inves-tigate RICA in its single layer form.RICA was proposed by Le et al. [18] as an extension of Independent Compo-nent Analysis (ICA). ICA requires the weights to be orthonormal which makes theoptimization hard and it is also sensitive to data whitening. Reconstruction ICArelaxes the orthonormality requirement to an extra loss term which goes to zerowhen the weights are orthonormal. The objective function for reconstruction ICAbecomesJ(W) = ηnn∑i=1||v(i)WWT −v(i)||22 +n∑i=1nh∑j=1g(v(i)W j) (3.1)where η is the regularization parameter and n is the number of data points. In ourexperiments in this section we use a soft L1 nonlinearity g(.) = log(cosh(.)).This way of relaxing orthonormality constraint makes RICA very similar toa sparse autoencoder, for more analysis of this method we refer the reader to theoriginal paper [18].3.1.1 Constructing the DictionaryThe experiments in this section are all done using a single layer RICA on datasetsof natural images. Therefore the features are smooth and we can follow the methodin section 2.2.1 to construct our dictionaries.We use uniformly sampled locations as our lower-dimensional space α and usea squared exponential kernel and equation 2.3 to construct the dictionary.3.1.2 TrainingFor training the model we replaced W in equation 3.1 with UWα and optimized theobjective function with respect to Wα using L-BFGS [24] optimization algorithm.213.1.3 ResultsUnsupervised Feature Learning from Natural ImagesIn the first experiment we use a dataset of 11 natural images [14]. We extract20,000 16× 16 patches from these images and train our RICA model on thesepatches.In order to visually evaluate how our method works we learn features withvarying degrees of parameter reduction ratios. We define the parameter reductionratio r to be the ratio of the number of dynamic parameters after using our method(nα × nh) to the number of original dynamic parameters (nv× nh), which will ber = nα/nv.Figure 3.1 shows sample features learned using various parameter reductionvalues. It is evident from this figure that we can learn features that are visuallysimilar to the features learned with full parameters while having a reduction ratioof up to around 10%.Figure 3.1: RICA with different proportion of parameters learned. In the left-most column 100% of the parameters are learned. In the right-most col-umn only 10% of parameters are learned and the rest are predicted. Theintermediate columns interpolate between these extremes in incrementsof 10%.Image ClassificationIn this section, we quantitatively evaluate how reducing the parameters affects theperformance of RICA. RICA is an unsupervised learning method, in order to doimage classification using the features learned by RICA we use a standard sys-220.08 0.28 0.49 0.69 0.9Proportion of parameters learned0.240.310.370.440.5ErrorCIFAR-10nokernelRICAFigure 3.2: Performance of features learned using RICA on CIFAR-10dataset with and without parameter prediction.tem proposed by [4], which is designed to evaluate unsupervised feature learningmethods.In the first step, random patches are extracted from the training set. Thesepatches (without the labels) are then used in the unsupervised learning method(RICA) to learn features. In order to evaluate the quality of these features, weconvolve them with the input images, we divide the feature activations into fourquadrants and do a simple L2 pooling in each quadrant. We then concatenate theresulting four vectors to form our feature vector. These feature vectors, along withthe image labels are then used to train an L2 SVM classifier. See [4] for a moredetailed description of this system.For new images in the test dataset, we follow the same procedure to constructthe feature vectors and use the trained SVM to predict the labels. The accuracy ofthe test set is then used as a quantitative measure for the quality of learned featuresby the unsupervised learning method.We first train our model using the CIFAR-10 dataset. When training the RICAmodel we construct the dictionary as was explained in section 3.1.1 and use ourtechnique to reduce the number of parameters. We vary the proportion of parame-ters learned to evaluate the result of reducing parameters on the quality of learnedfeatures.23Figure 3.2 shows the accuracy of our model as we decrease proportion of pa-rameters learned. At 100% learned parameters we get a competitive accuracy ofaround 77%. As we decrease the proportion of parameters learned we do not see asignificant drop in accuracy until the very low range of the proportion of parame-ters learned. In fact we can learn only 30% of parameters without a significant lossin accuracy.0.08 0.28 0.49 0.69 0.9Proportion of parameters learned0.420.440.460.480.5ErrorSTL-10nokernelRICAFigure 3.3: Performance of features learned using RICA on STL-10 datasetwith and without parameter prediction. Error bars show 90% confidenceintervals.We do a similar experiment with STL-10 dataset. This dataset has a largecollection of unlabelled images and a small labeled set. We perform unsupervisedtraining on random patches of training set. We follow the STL-10 protocol whichrequires training of the classifier on each fold of the training set and finding theaverage accuracy on the full test.Figure 3.3 shows the results for the STL-10 dataset. Since we do the trainingand testing on each of the folds separately we will have multiple accuracy results.We report the average and the 90% confidence intervals in the figure. Similar tothe results from CIFAR-10 dataset we can only learn around 30% of parameterswithout a significant loss in the accuracy.24Increasing Classification PerformanceIn this step, we show that we can use our technique to obtain higher accuracy whileusing the same number of dynamic parameters. We can achieve this by using twicethe number of features but learning only half of their parameters and hence keepingthe number of dynamic parameters constant.Figure 3.4 shows that by using our method we can get improved performancein CIFAR-10 and STL-10 datasets while using the same number of dynamic pa-rameters.1800 15750 29700 43650 57600Number of dynamic parameters0.240.310.370.440.5ErrorCIFAR-10RICARICA-50%(a) CIFAR-105000 23750 42500 61250 80000Number of dynamic parameters0.420.440.460.480.5ErrorSTL-10RICARICA-50%(b) STL-10Figure 3.4: Comparison of RICA, and RICA with 50% learned parametersusing the same number of dynamic parameters (RICA-50 has twice thenumber of features)3.2 Multi-Layer PerceptronsIn this section, we do experiments using a simple MLP on MNIST dataset whichconsists of 60,000 28×28 images of handwritten digits. All the models in this sec-tion have two hidden layers with 784-500-500-10 architecture. The final layer is a10-way softmax which is used to classify the digit. We only predict the parametersin the first two hidden layers the softmax parameters are not predicted. We are alsousing a columnar architecture in both hidden layers with 10 columns. The hiddenunits are divided equally between the columns.253.2.1 Constructing the DictionaryIn this section we are working with a simple MLP and the dataset is relativelysmall which results in relatively short training times. We can therefore use thissetting to evaluate various dictionary constructing methods. We experiment withpermutations of the following dictionaries:• AE dictionary is constructed by training the layer as an autoencoder• SE kernel based dictionary with squared exponential kernel• Emp kernel based dictionary with learned kernel using empirical covariance• Emp2 kernel based dictionary with learned kernel using squared empiricalcovariance• LowRank dictionary is not fixed; both dictionary U and weight matrix Ware learned (simultaneously) during the network training (see section 2.4.2.• RandCon dictionary consists of random columns of zero and identity vec-tors• RandFixU dictionary consists of iid Gaussian bases.For the first layer we can use all of these dictionaries including the squaredexponential based dictionary (SE) since the inputs are images. For the secondlayer however we can use all the dictionaries except for the SE dictionary since thehidden unit activations in the second layer are permutation invariant and there is notopological structure in the weights.Pre-training (as an autoencoder) is used in all the cases to learn the dictionaries.3.2.2 TrainingIn order to keep consistency across various models we pretrain all the models usingautoencoders except for the LowRank model. We did not do pretraining for theLowRank model since we found the autoencoder training to be very unstable inthis case. We construct the models using various permutations of dictionaries forthe first and second hidden-layer. The training is then done using the standardback-propagation algorithm with the weight matrices W replaced by UWα .263.2.3 Results0.2 0.4 0.6 0.8 1.0Proportion of parameters learned0.0150.0200.0250.0300.0350.0400.0450.050ErrorCompare Completion MethodsnokernelLowRankRandCon-RandConRandFixU-RandFixUSE-EmpSE-Emp2SE-AEFigure 3.5: Comparison of various dictionary learning methods used for anMLP trained on MNIST. The legend shows the dictionary type forlayer1-layer2.Figure 3.5 shows the results from the models that we tested. We can see thatthe SE-Emp and SE-Emp2 dictionary learning techniques clearly outperform thenaı¨ve dictionaries RandCon and RandFixU. More importantly, we can see that us-ing these dictionaries we can learn only 10% of the parameters without a significantloss in the model’s predictive accuracy.This figure also shows that the naı¨ve matrix factorization approach describedin section 2.4.2 performs very poorly compared to the cases where we choose andlearn an appropriate dictionary U.3.3 Convolutional Neural NetworksIn this section, we try to show the universality of our method by experimentingon a convolutional neural network. Our network has three convolutional layersfollowed by a fully connected layer and a softmax layer. We train the models onthe CIFAR-10 dataset which contains 32×32×3 RGB images.The inputs to the network are 32× 32× 3 images and the first convolutionallayer has 48 filters of size 8× 8× 3. The second convolutional layer applies 64filters of size 8× 8× 48 to the output of the first layer. The third convolutionallayer applies 64 features of size 5×5×64 to the outputs of the second layer. Theoutput form the third convolutional layer is fed to a fully connected layer with 500hidden units. The convolutional layers have one column and the fully connected27layer has five columns.3.3.1 Constructing the DictionaryThe inputs to the first convolutional layer are images which have a topologicalstructure we can therefore use a squared exponential kernel to construct the dic-tionary for the first layer. convolutional layers, in contrast to simple MLP layers,conserve the topological structure in the data; we can therefore use squared expo-nential kernel based dictionaries for all the convolutional layers. The input to thefully connected layer is the output of a convolutional layer therefore it too has atopological structure. We can use squared exponential kernel based dictionary forthe fully connected layer as well.Training is done using the backpropagation algorithm with weight matrices Wreplaced by UWα .3.3.2 ResultsIf we learn all the parameters in our model we achieve %75 accuracy which is agood result for the CIFAR-10 dataset with a network of this size. We then reducethe proportion of parameters learned. Figure 3.6 shows the results for this model.We can see that we can reduce the proportion of parameters learned to %25 withouta significant loss in accuracy.1.00.50.25 0.75Proportion of parameters learned0.080.160.240.320.4ErrorConvnet CIFAR-10convnetFigure 3.6: Performance of a convolutional neural network on CIFAR-10dataset.283.4 Speech RecognitionAll of our experiments in the previous sections were image recognition tasks, inorder to show the generality of our method in this section we show the results ona speech recognition task. We use TIMIT dataset, which is a widely used datasetin speech recognition community. TIMIT contains broadband recordings of 630speakers of eight major dialects of American english, each reading ten phoneticallyrich sentences.The raw speech data was analyzed using a 25-ms Hamming window with a10-ms fixed frame rate. In all the experiments, we represented the speech using12th-order Mel frequency cepstral coefficients (MFCCs) and energy, along withtheir first and second temporal derivatives. The networks used in this experimentare simple MLPs with two hidden layers with 1024 units. Phone error rate wasmeasured by performing Viterbi decoding the phones in each utterance using a bi-gram language model, and confusions between certain sets of phones were ignoredas described in [23].We use empirical covariance kernel for constructing the dictionary for bothhidden layers.Figure 3.7 shows the Phone Error Rate (PER) for various degrees of reductionin the number of dynamic parameters. We can see that we do not lose performanceeven if we only learned 20% of parameters.0.2 0.4 0.6 0.8 1.0Proportion of Parameters Learned0.00.10.20.30.40.5Phone Error RateTIMITEmp-EmpFigure 3.7: Performance on TIMIT core test set using an MLP with two hid-den layers.29Chapter 4Conclusions4.1 Related WorkSeveral other methods for limiting the number of parameters in a neural networkhave been explored in the literature. An early approach is the technique of “Opti-mal Brain Damage” [21] which uses approximate second derivative information toremove parameters from an already trained network. This technique does not ap-ply in our setting, since we aim to limit the number of parameters before training,rather than after.The most common approach to limiting the number of parameters is to uselocally connected features [4]. The size of the parameterization of locally con-nected networks can be further reduced by using tiled convolutional networks [10]in which groups of feature weights which tile the input space are tied. Convolu-tional neural networks [15] are even more restrictive and force a feature to havetied weights for all receptive fields.Techniques similar to the one in this paper have appeared for shallow mod-els in the computer vision literature. The double sparsity method of Rubinstein etal. [28] involves approximating linear dictionaries with other dictionaries in a sim-ilar manner to how we approximate network features. Rigamonti et al. [27] studyapproximating convolutional filter banks with linear combinations of separable fil-ters. Both of these works focus on shallow single layer models, in contrast to ourfocus on deep networks.30The techniques described in this paper are orthogonal to the parameter reduc-tion achieved by tying weights in a tiled or convolutional pattern. Tying weightseffectively reduces the number of feature maps by constraining features at differentlocations to share parameters. Our approach reduces the number of parameters re-quired to represent each feature and it is straightforward to incorporate into a tiledor convolutional network.Cires¸an et al. [1] control the number of parameters by removing connectionsbetween layers in a convolutional network at random. They achieve state-of-the-art results using these randomly connected layers as part of their network. Ourtechnique subsumes the idea of random connections, as described in Section ??.The idea of regularizing networks through prior knowledge of smoothness isnot new, but it is a delicate process. Lang and Hinton [16] tried imposing explicitsmoothness constraints through regularization but found it to universally reduceperformance. Naı¨vely factoring the weight matrix and learning both factors tendsto reduce performance as well. Although the idea is simple conceptually, execu-tion is difficult. Gu¨lc¸ehre et al. [11] have demonstrated that prior knowledge isextremely important during learning, which highlights the importance of introduc-ing it effectively.Our approach is superficially similar to the factored RBM [26, 30], whose pa-rameters form a 3-tensor. Since the total number of parameters in this model isprohibitively large, the tensor is represented as an outer product of three matrices.Major differences between our technique and the factored RBM include the factthat the factored RBM is a specific model, whereas our technique can be appliedmore broadly—even to factored RBMs. In addition, in a factored RBM all factorsare learned, whereas in our approach the dictionary is fixed judiciously.4.2 Future DirectionsIn this paper we always choose the set α of indices uniformly at random. Thereare a wide variety of other options which could be considered here. Other workshave focused on learning receptive fields directly [3], and would be interesting toincorporate with our technique.In a similar vein, more careful attention to the selection of kernel functions31is appropriate. We have considered some simple examples and shown that theypreform well, but our study is hardly exhaustive. Using different types of kernelsto encode different types of prior knowledge on the weight space, or even learningthe kernel functions directly as part of the optimization procedure as in [31] arepossibilities that deserve exploration.When no natural topology on the weight space is available we infer a topologyfor the dictionary from empirical statistics; however, it may be possible to insteadconstruct the dictionary to induce a desired topology on the weight space directly.This has parallels to other work on inducing topology in representations [10] aswell as work on learning pooling structures in deep networks [5].4.3 ConclusionsIn this work we showed that many neural networks are over-parameterized andlearned parameters are correlated. We then proposed a method that uses this cor-relation to reduce the number of parameters in the network without affecting theperformance.The technique we presented is extremely general, can be applied to a broadrange of models and is orthogonal to the choice of network architecture, activationfunction and optimization method. We tested our method on a variety of networkarchitectures and application areas. In these experiments we showed that using ourmethod we can learn only a fraction of the parameters, as low as 10% in somecases, without a significant loss in the predictive accuracy of the model.32Bibliography[1] D. Cires¸an, U. Meier, and J. Masci. High-performance neural networks forvisual object classification. arXiv:1102.0183, 2011. → pages 31[2] D. C. Cires¸an, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep, big,simple neural nets for handwritten digit recognition. Neural Computation,22(12):3207–3220, 2010. → pages 1, 2[3] A. Coates and A. Y. Ng. Selecting receptive fields in deep networks. InAdvances in Neural Information Processing Systems, pages 2528–2536,2011. → pages 31[4] A. Coates, A. Y. Ng, and H. Lee. An analysis of single-layer networks inunsupervised feature learning. In Artificial Intelligence and Statistics, 2011.→ pages 2, 23, 30[5] A. Coates, A. Karpathy, and A. Ng. Emergence of object-selective featuresin unsupervised feature learning. In Advances in Neural InformationProcessing Systems, pages 2690–2698, 2012. → pages 32[6] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-traineddeep neural networks for large-vocabulary speech recognition. Audio,Speech, and Language Processing, IEEE Transactions on, 20(1):30–42,2012. → pages 1, 2[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: Alarge-scale hierarchical image database. In IEEE Computer Vision andPattern Recognition, pages 248–255, 2009. → pages 2[8] M. Denil, B. Shakibi, L. Dinh, M. A. Ranzato, and N. de Freitas. Predictingparameters in deep learning. In C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K. Weinberger, editors, Advances in Neural InformationProcessing Systems 26, pages 2148–2156. Curran Associates, Inc., 2013. →pages iii33[9] K. Fukushima. Neocognitron: A self-organizing neural network model for amechanism of pattern recognition unaffected by shift in position. Biologicalcybernetics, 36(4):193–202, 1980. → pages 1[10] K. Gregor and Y. LeCun. Emergence of complex-like cells in a temporalproduct network with local receptive fields. arXiv preprint arXiv:1006.0448,2010. → pages 30, 32[11] C. Gu¨lc¸ehre and Y. Bengio. Knowledge matters: Importance of priorinformation for optimization. In International Conference on LearningRepresentations, 2013. → pages 31[12] G. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deepbelief nets. Neural computation, 18(7):1527–1554, 2006. → pages 1, 12[13] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov. Improving neural networks by preventing co-adaptationof feature detectors. CoRR, abs/1207.0580, 2012. → pages 2[14] A. Hyva¨rinen, J. Hurri, and P. O. Hoyer. Natural image statistics,volume 39. Springer, 2009. → pages 22[15] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification withdeep convolutional neural networks. In Advances in Neural InformationProcessing Systems, pages 1106–1114, 2012. → pages 1, 2, 30[16] K. Lang and G. Hinton. Dimensionality reduction and prior knowledge ine-set recognition. In Advances in Neural Information Processing Systems,1990. → pages 31[17] Q. V. Le, J. Ngiam, Z. Chen, D. Chia, P. W. Koh, and A. Y. Ng. Tiledconvolutional neural networks. Advances in Neural Information ProcessingSystems, pages 1279–1287, 2010. → pages 2[18] Q. V. Le, A. Karpenko, J. Ngiam, and A. Y. Ng. ICA with reconstructioncost for efficient overcomplete feature learning. Advances in NeuralInformation Processing Systems, 24:1017–1025, 2011. → pages 6, 21[19] Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean,and A. Ng. Building high-level features using large scale unsupervisedlearning. In International Conference on Machine Learning, 2012. → pages2, 2134[20] Y. LeCun and C. Cortes. The MNIST database of handwritten digits. URL:http://yann. lecun. com/exdb/mnist, 1998. → pages 2[21] Y. LeCun, J. S. Denker, S. Solla, R. E. Howard, and L. D. Jackel. Optimalbrain damage. In Advances in Neural Information Processing Systems,pages 598–605, 1990. → pages 30[22] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learningapplied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. → pages 1, 2[23] K.-F. Lee and H.-W. Hon. Speaker-independent phone recognition usinghidden markov models. Acoustics, Speech and Signal Processing, IEEETransactions on, 37(11):1641–1648, 1989. → pages 29[24] D. C. Liu and J. Nocedal. On the limited memory bfgs method for largescale optimization. Mathematical programming, 45(1-3):503–528, 1989. →pages 21[25] A.-r. Mohamed, G. E. Dahl, and G. Hinton. Acoustic modeling using deepbelief networks. Audio, Speech, and Language Processing, IEEETransactions on, 20(1):14–22, 2012. → pages 2[26] M. Ranzato, A. Krizhevsky, and G. E. Hinton. Factored 3-way restrictedBoltzmann machines for modeling natural images. In Artificial Intelligenceand Statistics, 2010. → pages 31[27] R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua. Learning separable filters. InIEEE Computer Vision and Pattern Recognition, 2013. → pages 30[28] R. Rubinstein, M. Zibulevsky, and M. Elad. Double sparsity: learning sparsedictionaries for sparse signal approximation. IEEE Transactions on SignalProcessing, 58:1553–1564, 2010. → pages 30[29] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis.Cambridge University Press, New York, NY, USA, 2004. → pages 7, 10[30] K. Swersky, M. Ranzato, D. Buchman, B. Marlin, and N. Freitas. Onautoencoders and score matching for energy based models. In InternationalConference on Machine Learning, pages 1201–1208, 2011. → pages 31[31] P. Vincent and Y. Bengio. A neural support vector network architecture withadaptive kernels. In International Joint Conference on Neural Networks,pages 187–192, 2000. → pages 3235
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Predicting parameters in deep learning
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Predicting parameters in deep learning Shakibi, Babak 2014
pdf
Page Metadata
Item Metadata
Title | Predicting parameters in deep learning |
Creator |
Shakibi, Babak |
Publisher | University of British Columbia |
Date Issued | 2014 |
Description | The recent success of large and deep neural network models has motivated the training of even larger and deeper networks with millions of parameters. Training these models usually requires parallel training methods where communicating large number of parameters becomes one of the main bottlenecks. We show that many deep learning models are over-parameterized and their learned features can be predicted given only a small fraction of their parameters. We then propose a method which exploits this fact during the training to reduce the number of parameters that need to be learned. Our method is orthogonal to the choice of network architecture and can be applied in a wide variety of neural network architectures and application areas. We evaluate this technique using various experiments in image and speech recognition and show that we can only learn a fraction of the parameters (up to 10% in some cases) and predict the rest without a significant loss in the predictive accuracy of the model. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2014-11-07 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivs 2.5 Canada |
IsShownAt | 10.14288/1.0165555 |
URI | http://hdl.handle.net/2429/50999 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2014-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/2.5/ca/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2014_november_shakibi_babak.pdf [ 385.9kB ]
- Metadata
- JSON: 24-1.0165555.json
- JSON-LD: 24-1.0165555-ld.json
- RDF/XML (Pretty): 24-1.0165555-rdf.xml
- RDF/JSON: 24-1.0165555-rdf.json
- Turtle: 24-1.0165555-turtle.txt
- N-Triples: 24-1.0165555-rdf-ntriples.txt
- Original Record: 24-1.0165555-source.json
- Full Text
- 24-1.0165555-fulltext.txt
- Citation
- 24-1.0165555.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0165555/manifest