Deep Learning for Sequence Modelling: Applications in Natural Languagesand Distributed Compressive SensingbyHamid PalangiM.Sc., Electrical Engineering, Sharif University of Technology, 2010B.Sc., Electrical Engineering, Shahid Rajaee University, 2007A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)April 2017c© Hamid Palangi, 2017AbstractThe underlying data in many machine learning tasks have a sequential nature. For example, words generatedby a language model depend on the previously generated words, behaviour of a user in a social networkevolves over different snapshots of the social graph over time, different speech frames in a speech recognitionsystem depend on the previously generated frames, etc. The main question is, how can we leverage thesequential nature of data to extract better features for the target machine learning task? In an effort toaddress this question, this thesis presents three important applications of deep sequence modelling methods.The first application is sentence modelling for web search task where the question addressed is: Howto create a vector representation for a natural language sentence, aimed at a specific task such as websearch? We propose Long Short-Term Memory Deep Structured Semantic Model (LSTM-DSSM), a modelfor information retrieval on click-through data with significant performance gains compared to existing stateof the art baselines. The proposed LSTM-DSSM model sequentially takes each word in a sentence, extractsits relevant information, and embeds it into a semantic vector.The second application involves distributed compressive sensing, where the main questions addressedare: (a) How to relax the joint sparsity constraint? (b) How to exploit the structural dependency of a group ofsparse vectors to reconstruct them better from down-sampled measurements (structures besides sparsity)?(c) How to exploit available offline data during sparse reconstruction at the decoder? We present a deeplearning approach to distributed compressive sensing and show that it addresses the above three questionsand is almost as fast as greedy methods during reconstruction.The third application is related to speech recognition. The question addressed here is: How to builda recurrent acoustic model for the task of phoneme recognition? We present a Recurrent Deep StackingNetwork (R-DSN) architecture for this task. Each module in the R-DSN is initialized with an Echo StateNetwork (ESN), and then all connection weights within the module are fine tuned.iiPrefaceThis thesis presents the research conducted by Hamid Palangi, in collaboration with Prof. Rabab Ward andProf. Li Deng. Below is the list of the scientific papers by Hamid Palangi that were published through thecourse of his studies as a Doctor of Philosophy (PhD) candidate at University of British Columbia (UBC).• J1: Hamid Palangi, Rabab Ward, Li Deng, “Distributed Compressive Sensing: A Deep LearningApproach”, IEEE Transactions on Signal Processing, Volume: 64, Issue: 17, pp. 4504-4518, 2016• J2: Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying Song,Rabab Ward, “Deep Sentence Embedding Using the Long Short-Term Memory Networks: Analysisand Application to Information Retrieval”, IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing, Volume: 24, Issue: 4, pp. 694-707, 2016• J3: Hamid Palangi, Rabab Ward, Li Deng, “Convolutional Deep Stacking Networks for DistributedCompressive Sensing”, Elsevier Signal Processing, Volume: 131, pp. 181-189, 2016• C1: Hamid Palangi, Rabab Ward, Li Deng, “Exploiting Correlations Among Channels in DistributedCompressive Sensing with Convolutional Deep Stacking Networks”, IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 2016• C2: Hamid Palangi, Rabab Ward, Li Deng, “Reconstruction of Sparse Vectors in Compressive Sens-ing with Multiple Measurement Vectors using Bidirectional Long Short-Term Memory”, IEEE GlobalConference on Signal and Information Processing (GlobalSIP), Washington D.C., USA, 2016• C3: Hamid Palangi, Li Deng, Rabab Ward, “Recurrent Deep-Stacking Networks for sequence clas-sification”, IEEE China Summit and International Conference on Signal and Information Processing(ChinaSIP), Xi’an, China, 2014.• C4: Hamid Palangi, Li Deng, Rabab Ward, “Learning Input and Recurrent Weight Matrices in EchoState Networks”, Deep Learning Workshop in Conference on Neural Information Processing Systems(NIPS), Lake Tahoe, Nevada, USA, 2013• C5: Hamid Palangi, Rabab Ward, Li Deng, “Using Deep Stacking Network to Improve StructuredCompressed Sensing with Multiple Measurement Vectors”, IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP), Vancouver, Canada, 2013iii• J1, C2: Hamid Palangi (the primary author and the main contributor of the papers) proposed themethods, implemented them, and performed the evaluations. Prof. Ward and Prof. Deng providedextensive technical and editorial feedback throughout writing of these papers.• J3, C1, C5: Hamid Palangi (the primary author and the main contributor of the papers) proposed themethods, implemented them, and performed the evaluations. Prof. Deng provided the codes of hisDeep Stacking Network (DSN) paper [29]. Prof. Ward and Prof. Deng provided extensive technicaland editorial feedback throughout writing of these papers.• J2: Hamid Palangi (the primary author and the main contributor of the paper) derived all the formu-lations required for learning, implemented the methods, and performed the evaluations. Prof. Deng,Prof. Gao and Prof. He provided extensive technical feedback during the project and helped withwriting of the paper. Dr. Shen and Dr. Song provided extensive technical feedbacks during imple-mentation of methods. Dr. Shen and Prof. He helped to run some of the experiments. Dr. Chenhelped with writing of the paper and provided valuable feedback and ideas for visualizations. Prof.Ward provided extensive technical and editorial feedback throughout writing of the paper.• C3, C4: Hamid Palangi (the primary author and the main contributor of the papers) derived all theformulations required for learning, implemented the methods, and performed the evaluations. Prof.Deng provided the codes of his DSN paper [29], extensive technical feedback during design and im-plementation of methods and helped to run the experiments. Prof. Deng and Prof. Ward providedextensive technical and editorial feedback throughout writing of these papers.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Compressive Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Motivations and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.3.1 Recurrent Deep Stacking Networks for Speech Recognition . . . . . . . . . . . . . 101.3.2 A Sentence Modelling Method for Web Search Task . . . . . . . . . . . . . . . . . 111.3.3 A Deep Learning Approach to Distributed Compressive Sensing . . . . . . . . . . . 111.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Recurrent Deep Stacking Networks for Speech Recognition . . . . . . . . . . . . . . . . . . . 142.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2 General Recurrent Networks and Specific Echo State Networks . . . . . . . . . . . . . . . . 152.3 Learning the Input Weight Matrix in ESN . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.1 Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3.2 Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.4 Learning the Recurrent Weight Matrix (Wrec) in the ESN . . . . . . . . . . . . . . . . . . . 222.5 Learning the Recurrent Deep Stacking Network . . . . . . . . . . . . . . . . . . . . . . . . 232.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25v3 A Sentence Modelling Method for Web Search Task . . . . . . . . . . . . . . . . . . . . . . . 273.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3 Sentence Embedding Using RNNs with and without LSTM Cells . . . . . . . . . . . . . . . 303.3.1 The Basic Version of RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3.2 The RNN with LSTM Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.4 Learning Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.5 Analysis of the Sentence Embedding Process and Performance Evaluation . . . . . . . . . . 353.5.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.5.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.6 Expressions for the Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.6.1 RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.6.2 LSTM-RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.7 A More Clear Figure for Input Gate for “hotels in shanghai” Example . . . . . . . . . . . . 503.8 A Closer Look at RNNs with and without LSTM Cells in Web Document Retrieval Task . . 513.9 Derivation of BPTT for RNN and LSTM-RNN . . . . . . . . . . . . . . . . . . . . . . . . 523.9.1 Derivation of BPTT for RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.9.2 Derivation of BPTT for LSTM-RNN . . . . . . . . . . . . . . . . . . . . . . . . . 563.10 LSTM-RNN Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.10.1 LSTM-RNN Semantic Vectors: Another Example . . . . . . . . . . . . . . . . . . 623.10.2 Key Word Extraction: Another Example . . . . . . . . . . . . . . . . . . . . . . . . 643.11 Doc2Vec Similarity Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653.12 Diagram of the Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.13 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684 A Deep Learning Approach to Distributed Compressive Sensing . . . . . . . . . . . . . . . . 694.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.1.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.2 RNN with LSTM Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.3.1 High Level Picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.3.2 Training Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.3.3 Learning Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.4 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.4.1 MNIST Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804.4.2 Natural Images Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.5 Expressions for the Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884.5.1 Output Weights U . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92vi4.5.2 Output Gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924.5.3 Input Gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 934.5.4 Forget Gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.5.5 Input without Gating (yg(t)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.5.6 Error Signal Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 954.6 Bidirectional LSTM for MMV Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 954.7 Convolutional Deep Stacking Networks for Distributed Compressive Sensing . . . . . . . . 994.7.1 Experimental Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . 1044.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.1.1 Recurrent Deep Stacking Networks for Speech Recognition . . . . . . . . . . . . . 1115.1.2 A Sentence Modelling Method for Web Search Task . . . . . . . . . . . . . . . . . 1125.1.3 A Deep Learning Approach to Distributed Compressive Sensing . . . . . . . . . . . 1125.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115viiList of TablesTable 2.1 Frame-level phoneme-state classification error rates for the TIMIT core test set . . . . . . 25Table 2.2 Frame level phoneme classification error rates for the R-DSN on the TIMIT core test setas a function of the number of modules for the fixed 4000 or 10000 neurons in the hiddenlayer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Table 3.1 Key words for query: “hotels in shanghai” . . . . . . . . . . . . . . . . . . . . . . . . . 39Table 3.2 Key words for document: “shanghai hotels accommodation hotel in shanghai discountand reservation” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40Table 3.3 Keywords assigned to each cell of LSTM-RNN for different queries of two topics, “food”represented by green color and “health” represented by red color . . . . . . . . . . . . . 41Table 3.4 Comparisons of NDCG performance measures (the higher the better) of proposed modelsand a series of baseline models, where nhid refers to the number of hidden units, ncellrefers to number of cells, win refers to window size, and n is the number of negativesamples which is set to 4 unless otherwise stated. Unless stated otherwise, the RNN andLSTM-RNN models are chosen to have the same number of model parameters as theDSSM and CLSM models: 14.4M, where 1M = 106. The boldface numbers are the bestresults. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44Table 3.5 RNNs with & without LSTM cells for the same query: “hotels in shanghai” . . . . . . . 51Table 3.6 RNNs with & without LSTM cells for the same Document: “shanghai hotels accommo-dation hotel in shanghai discount and reservation” . . . . . . . . . . . . . . . . . . . . . 51Table 3.7 RNN versus LSTM-RNN for Query: “how to f ix bath tub wont turn o f f ” . . . . . . . . 52Table 3.8 RNN versus LSTM-RNN for Document: “how do you paint a bathtub and what paintshould . . . ” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52Table 3.9 Keyword extraction for Query: “how to f ix bath tub wont turn o f f ” . . . . . . . . . . . 64Table 3.10 Keyword extraction for Document: “how do you paint a bathtub and what paint should . . .” 65viiiList of FiguresFigure 1.1 A Feed-forward DNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Figure 1.2 Importance of depth (Figure from [40]). . . . . . . . . . . . . . . . . . . . . . . . . . . 4Figure 1.3 Effect of adding many layers (Figure from [52]). . . . . . . . . . . . . . . . . . . . . . 5Figure 1.4 Adding skip connections to resolve the problem with many layers (Figure from [52]). . . 5Figure 1.5 One layer RNN unfolded over time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Figure 1.6 General CS framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Figure 2.1 Illustration of a general RNN unfolded over time . . . . . . . . . . . . . . . . . . . . . 16Figure 2.2 Illustration of an R-DSN architecture with three modules shown with different colors . . 23Figure 3.1 The basic architecture of the RNN for sentence embedding, where temporal recurrenceis used to model the contextual information across words in the text string. The hiddenactivation vector corresponding to the last word is the sentence embedding vector (blue). 31Figure 3.2 The basic LSTM architecture used for sentence embedding . . . . . . . . . . . . . . . . 32Figure 3.3 The click-through signal can be used as a (binary) indication of the semantic similaritybetween the sentence on the query side and the sentence on the document side. Thenegative samples are randomly sampled from the training data. . . . . . . . . . . . . . . 34Figure 3.4 Query: “hotels in shanghai”. Since the sentence ends at the third word, all the values tothe right of it are zero (green color). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38Figure 3.5 Document: “shanghai hotels accommodation hotelin shanghai discount and reservation”. Since the sentence ends at the ninth word, allthe values to the right of it are zero (green color). . . . . . . . . . . . . . . . . . . . . . 39Figure 3.6 Activation values, y(t), of 10 most active cells for Query: “hotels in shanghai” andDocument: “shanghai hotels accommodation hotel in shanghai discount and reservation” 40Figure 3.7 LSTM-RNN compared to RNN during training: The vertical axis is logarithmic scale ofthe training cost, L(Λ), in (3.4). Horizontal axis is the number of epochs during training. 45Figure 3.8 Input gate, i(t), for Document: “shanghai hotels accommodation hotel in shanghai dis-count and reservation” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50Figure 3.9 Query: “how to fix bath tub wont turn off ” . . . . . . . . . . . . . . . . . . . . . . . . . 63Figure 3.10 Document: “how do you paint a bathtub and what paint should . . . ” . . . . . . . . . . . 64Figure 3.11 Query: “how to fix bath tub wont turn off ” . . . . . . . . . . . . . . . . . . . . . . . . . 65ixFigure 3.12 Document: “how do you paint a bathtub and what paint should . . . ” . . . . . . . . . . . 66Figure 3.13 Architecture of the proposed method. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67Figure 4.1 Block diagram of the proposed method unfolded over channels. . . . . . . . . . . . . . 72Figure 4.2 Block diagram of the Long Short-Term Memory (LSTM). . . . . . . . . . . . . . . . . 73Figure 4.3 Randomly selected images for test from MNIST dataset. The first channel encodes digitzero, the second channel encodes digit one and so on. . . . . . . . . . . . . . . . . . . . 80Figure 4.4 Comparison of different MMV reconstruction algorithms for MNIST dataset. Bottomfigure is the same as top figure without results of BCS algorithm to make the differenceamong different algorithms more visible. In this experiment M = 72 and N = 144. . . . 82Figure 4.5 Reconstructed images using different MMV reconstruction algorithms for 4 images ofthe MNIST dataset. First row are original images, S, second row are measurement ma-trices, Y, third row are reconstructed images using LS-CS, fourth row are reconstructedimages using SOMP, fifth row using PCSBL-GAMP, sixth row using MT-BCS, seventhrow using T-SBL, eighth row using NWSOMP and the last row are reconstructed imagesusing the proposed LSTM-CS method. . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Figure 4.6 Comparison of different MMV reconstruction algorithms for MNIST dataset. Bottomfigure is the same as top figure without results of BCS algorithm to make the differenceamong different algorithms more visible. In this experiment M = 36 and N = 144. . . . 84Figure 4.7 Reconstruction performance of the methods discussed in this chapter for different noiselevels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85Figure 4.8 Phase transition diagram for different methods on MNIST dataset where 90% perfectrecovery is considered. Assuming a perfect recovery condition of NMSE ≤ 0.6 for thisdataset. “n” is the number of entries in each sparse vector, “m” is the number of randommeasurements and “k” is the number of non-zero entries in each sparse vector. . . . . . 86Figure 4.9 Comparison of different MMV reconstruction algorithms for different number of ran-dom measurements for MNIST dataset. In this experiment n = 144. . . . . . . . . . . . 87Figure 4.10 Randomly selected natural images from three different classes used for test. The firstrow are “buildings”, the second row are “cows” and the third row are “flowers”. . . . . . 88Figure 4.11 Comparison of different MMV reconstruction algorithms for natural image dataset usingDCT transform and just one layer for LSTM model in LSTM-CS. Image classes fromtop to bottom respectively: buildings, cows and flowers. . . . . . . . . . . . . . . . . . 89Figure 4.12 Comparison of different MMV reconstruction algorithms for natural image dataset usingWavelet transform and just one layer for LSTM model in LSTM-CS. Image classes fromtop to bottom respectively: buildings, cows and flowers. . . . . . . . . . . . . . . . . . 90Figure 4.13 CPU time for different MMV reconstruction algorithms. These times are for the experi-ment using DCT transform for 10 test images from the building class. The bottom figureis the same as top figure but without T-SBL and LS-CS to make the difference amongdifferent methods more clear. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91Figure 4.14 Block diagram of the proposed bidirectional LSTM-CS method unfolded over channels. 96xFigure 4.15 Up: Comparative reconstruction performance using DCT transform. Middle: Recon-struction performance using Wavelet transform. Bottom: CPU time. Note that the timeis reported for T-MSBL which is a faster version of T-SBL. . . . . . . . . . . . . . . . . 98Figure 4.16 Proposed Method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100Figure 4.17 The curve of FISTA coefficients mk−1mk+1 in (4.64) with respect to the epoch number. . . . . 103Figure 4.18 High level block diagram of the proposed method. . . . . . . . . . . . . . . . . . . . . 105Figure 4.19 Comparison of different MMV reconstruction algorithms performance for the naturalimage dataset using DCT transform. Image classes from top to bottom are buildings,cows and flowers respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107Figure 4.20 Comparison of different MMV reconstruction algorithms performance for the naturalimage dataset using Wavelet transform. Image classes from top to bottom are buildings,cows and flowers respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108Figure 4.21 The performance of CDSN-CS with different window sizes. This experiment was con-ducted for image class of cows with DCT transform as sparsifying basis. . . . . . . . . . 109Figure 4.22 The performance of CDSN-CS with and without RBM initialization. This experimentwas conducted for image class of cows with Wavelet transform as sparsifying basis. . . . 109Figure 4.23 CPU time of different MMV reconstruction algorithms discussed in this section. Up: allmethods. Bottom: removing T-SBL and BCS to make the difference among remainingmethods more visible. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110xiGlossaryDNN Deep Neural NetworkLSTM-DSSM Long Short-Term Memory Deep Structured Semantic ModelLSTM Long Short-Term MemoryMMV Multiple Measurement VectorsSOMP Simultaneous Orthogonal Matching PursuitR-DSN Recurrent Deep Stacking NetworkDSN Deep Stacking NetworkESN Echo State NetworkDL Deep LearningGPU Graphical Processing UnitCS Compressive SensingMLP Multilayer PerceptronRELU Rectified Linear UnitCNN Convolutional Neural NetworkSVM Support Vector MachineRNN Recurrent Neural NetworkBPTT Backpropagation Through TimeEEG ElectroencephalogramWBAN Wireless Body-Area NetworkCT Computed TomographyxiiMRI Magnetic Resonance ImagingSMV Single Measurement VectorCDSN Convolutional Deep Stacking NetworkMFCC Mel Frequency Cepstral CoefficientsGRU Gated Recurrent UnitNLP Natural Language ProcessingDCNN Dynamic Convolutional Neural NetworkNDCG Normalized Discounted Cumulative GainBCS Bayesian Compressive SensingMT-BCS Multitask Compressive SensingSBL Sparse Bayesian LearningLASSO least absolute shrinkage and selection operatorxiiiAcknowledgmentsI would not have been able to make it to this point without the support of my supervisors Prof. Rabab Wardand Prof. Li Deng.I would like to thank Prof. Rabab Ward for her patience, generous help, and wise suggestions that helpedme a lot during my PhD studies. Other than her scientific guidance throughout my research work, she taughtme a lot about technical writing and technical presentation skills. I would also like to thank Prof. RababWard for her financial support through her Natural Sciences and Engineering Research Council of Canada(NSERC) grants.I would like to thank Prof. Li Deng for giving me a chance to explore new ideas, teaching me howto approach a new challenging problem, and his great help and suggestions throughout my PhD studies.I would also like to thank Prof. Li Deng and Microsoft Research (MSR) to give me 3 amazing researchinternship opportunities. At MSR, I had the honour to work under supervision of Prof. Li Deng with verysmart researchers Prof. Jianfeng Gao, Prof. Xiaodong He, Dr. Yelong Shen, Dr. Jianshu Chen and Dr.Xinying Song. I want to thank all of them for many important lessons that I learned from them.I would like to thank anonymous reviewers of my research papers for their comments which helped meimprove the quality of my works.Each friendship offers something totally unique and irreplaceable. Each friendship ultimately makes uswho we are. I am happy and lucky to have great friends at our lab. Thank you Davood, Hossein, Sima,Fahimeh, Basak, Ali, Maryam, Anahita, and Ilya.All that I am, or hope to be, I owe to my parents and family. Thank you for your unconditional love.Thank you for working hard and for always being there for me.Thank you dad . . . it hurts every single day to think that you are not here anymore. I start this thesis withyour memory . . .xivChapter 1IntroductionEverything we hear is an opinion, not a fact. Everything we see is a perspective, not the truth.— Marcus AureliusDeep Learning (DL) [23, 27, 48, 55, 79, 129, 131] is the foundation for many recent breakthroughs indifferent research areas [20, 37, 51, 58, 76, 91, 109, 125, 128]. Some fundamental ideas in deep learning,e.g., feedforward or convolutional neural networks, back-propagation and recurrent neural networks havebeen known from 1980s. The main reasons for the recent popularity of these ideas are: (a) the fast andefficient computation using Graphical Processing Unit (GPU) which makes it possible to run experiments ina reasonable amount of time, and (b) the availability of very large datasets (90% of world’s data have beengenerated over the last two years[2]).Among the different deep models, sequence modelling methods are of great importance. The mainreason for this is that many real world and machine learning tasks have an inherent sequential nature. Forexample, (a) sentences used in a conversation depend on the previous sentences used in that conversationand the context, i.e., a sequence of sentences, (b) words / characters used in a sentence depend on previouswords / characters, i.e., a sequence of words / characters. This is very important in language modellingtasks. (c) Different blocks in one image and different image frames in a video have strong sequentialdependencies / correlations, i.e., a sequence of images. (d) In an acoustic model of speech recognitionsystem, the probability of which speech frame happens next, depends on the previously observed speechframes, i.e., a sequence of speech frames. (e): In a social network, e.g., Facebook or Linkedin, a socialgraph represents each user by a node and the connections of that user to other users as edges in the graph.This graph quickly evolves over time depending on new interactions among users, e.g., new connections,followers, etc. Different snapshots of this social graph over time have strong correlations / dependencies,i.e., a sequence of social graphs. This is central in the important task of link prediction in social networks,e.g., “people you may know” feature in Facebook or Linkedin. (f) In a machine translation system, e.g.,Google translate or Skype translator, the main task is: given a sentence in a source language (e.g., French),generate its translation in a target language (e.g., English). The state-of-the-art methods for this task arebased on sequence to sequence models that use one sequence model for source language (encoder), andanother sequence model for target language (decoder). The encoder sequence model basically extracts1features from a sequence of words / characters in a source language. The decoder sequence model generateswords / characters in target language given the features from encoder. In other words, the whole system isbased on the sequential nature of words / characters in source and target language.In this thesis, we addressed three important problems using approaches based on deep sequence mod-elling methods:1. Acoustic modelling for speech recognition which resulted in a new model, Recurrent Deep StackingNetwork (R-DSN) for speech recognition.2. Sentence modelling for web search task which resulted in Long Short-Term Memory Deep Struc-tured Semantic Model (LSTM-DSSM). This model can potentially improve the performance of searchengines like Google or Bing and3. Compressive Sensing which resulted in a fundamentally new deep learning approach to distributedcompressive sensing.The rest of this chapter is organized as follows: in the next section an introduction to deep learningand deep sequence modelling methods are presented. In section 1.2, an introduction to Compressive Sens-ing (CS) is described. Section 1.3 explains motivations behind this research work and identifies our contri-butions. The organization of the thesis is summarized in section 1.4.1.1 Deep LearningSince their introduction in 1957 [104], neural networks have been used for finding patterns in input dataand for modelling complicated relationship between input and output data. The first generation of neuralnetworks is the Perceptron [104] that performs a classification task by mapping an input vector of realnumbers to a single binary value (its output). The Perceptron has a simple learning algorithm but it cannot learn complex structures. The second generation of neural networks is the Multilayer Perceptron (MLP)which has multiple layers of neurons (processing units). In MLP, each neuron in a hidden layer has anon-linear activation function, e.g., sigmoid function f (x) = 11+e−x . The learning algorithm used in MLPis Backpropagation [105]. Although MLP performs better than the Perceptron, Backpropagation has twomain problems; it is very slow in networks with multiple hidden layers and it can get stuck in poor localoptima. Due to these problems in the 1990’s many researchers stopped using neural networks with multiplehidden layers [56]. However, in 2006, two seminal papers [60] and [61] proposed more efficient methodsfor training Deep Neural Network (DNN), i.e., neural networks with many hidden layers. Afterwards, DNNswere used successfully in different applications, e.g. speech, audio, image, video and language processing[129].A DNN is a neural network with more than one hidden layer in space (e.g., the multilayer feed-forwardor the convolutional neural networks) or in time (e.g., recurrent neural network). For example, for feed-forward neural networks, each unit j in a hidden layer is typically a non-linear function (e.g., sigmoid, tanh2𝑥1 𝑥2 𝑥𝑖 𝑥𝑛 1 … … ℎ1(1) ℎ2(1) ℎ𝑗(1) ℎ𝑚1(1) 1 … … ℎ1(2) ℎ2(2) ℎ𝑘(2) ℎ𝑚2(2) 1 … … 𝑦1 𝑦2 𝑦𝑙 𝑦𝑐 … … Figure 1.1: A Feed-forward DNN.or Rectified Linear Unit (RELU)) that maps its input to a state h(1)j whereh(1)j = f (∑ixiwi j +b j). (1.1)In (1.1), h(1)j is the output of the j-th unit in the first hidden layer (the superscript (1) in h(1)j indicates thehidden layer number which is 1 in this case), the wi j is the weight for the connection between this unit andthe unit xi (where xi is the unit i in the layer below), and b j is the bias for j-th unit. f (.) is the non-linearactivation function for unit j. For example, the activation function in the case of sigmoid non-linearity isf (x) =11+ e−x. (1.2)A feed-forward DNN model is presented in Fig. 1.1. In this figure, a DNN with n input (visible) unitsrepresented by [x1,x2, . . . ,xn], the m1 hidden units in hidden layer 1 that are feature detectors, the m2 hiddenunits in hidden layer 2 that are also feature detectors and the c output units [y1,y2, . . . ,yc] is represented.As an example, if we want to do multi-class classification using this model, and assuming that we have cclasses, we add a softmax layer on the top of the network (not shown in Fig. 1.1) to get a value between 0and 1 for each class, i.e.,p j =ey j∑leyl, (1.3)where p j in (1.3) can be interpreted as the probability that input data x = [x1,x2, . . . ,xn]T belongs to thej-th class. You can think of the input data x as pixels of an image, or as the Mel Frequency CepstralCoefficients (MFCC) features of a speech frame, or as one hot representation of a word, or a bag of words3Figure 1.2: Importance of depth (Figure from [40]).representation of a sentence, or simply some hand crafted features from the data. We can also use a weightedlayer to define p j in (1.3) and learn those weights as well. In other words, if we define y = [y1,y2, . . . ,yc]Tthenp j =eyT U j∑leyT Ul, (1.4)where U j is j-th row of the weights matrix between layer [y1,y2, . . . ] and layer [p1, p2, . . . ] (not shown in Fig.1.1) which will be learned during training. To learn all parameters in Fig. 1.1 and also U in (1.4), we needto define an appropriate cost function. For example, the cost function in the above classification example isusually the cross entropy between the target labels, t, and the observed probabilities, p,loss =−∑jt jlog(p j). (1.5)In (1.5), target label t j is our given knowledge from training data of the class input data x belongs to. If xactually belongs to class j, then t j is 1, otherwise it is 0. p j, on the other hand, is the DNN’s opinion aboutinput data x. It is the probability that DNN thinks input data x belongs to class j.The main role of hidden layers in Fig. 1.1 is to create high level good features from the input data x. Anatural question here is: how important is depth of the network? In other words, does adding more hiddenlayers continuously improves the performance? We can investigate the importance of depth by inspectingthe different parts of the Krizhevsky’s Convolutional Neural Network (CNN) [76] which has 8 layers and istrained on ImageNet [106] dataset. The architecture of Krizhevsky’s CNN along with the results of applyingthe Support Vector Machine (SVM) classifier on different layers are shown in Fig. 1.2. The main observationfrom Fig. 1.2 is that applying the classifier to features from higher layers results in significant performanceimprovement (e.g., from 44.8% using features from layer 1 to 85.5% using features from layer 7). It isimportant to note that simply adding many more layers does not always improve the performance. A goodexample for this is the results of simply using 20 layers and 56 layers of CNN for CIFAR-10 dataset which4Figure 1.3: Effect of adding many layers (Figure from [52]).Figure 1.4: Adding skip connections to resolve the problem with many layers (Figure from [52]).is presented in Fig. 1.3. Similar phenomena have been observed on the ImageNet dataset which meansthat learning better models is not always equivalent to adding more layers. Note that the above problem isNOT caused by overfitting as it is obvious from the training error curves in Fig. 1.3. One reason is the factthat with deeper networks the error signal during backpropagation is no longer significant enough by thetime it arrives to the lower layers. To resolve this problem, a residual network is proposed in [52] whichsimply adds skip connections in the CNN architecture. One example is shown in Fig. 1.4. Note that the skipconnection is applied before the non-linear activation function.The Recurrent Neural Network (RNN) is a type of deep neural network that is deep in the temporaldimension and has been used extensively in time sequence modelling [13, 28, 36, 49, 84, 85, 103]. Toexplain a simple RNN, assume that we have been observing input data from time t = 1 till time t = T .For example, we can think of the input data as a sentence with T words. Assume the t-th word in thesentence is represented by vector x(t) = [x1(t),x2(t), . . . ,xn(t)]T (T stands for transpose here). x(t) can bea feature vector like word2vec [88] or a one hot representation of t-th word. We continue our explanationof RNN using Fig. 1.5 which represents a one layer recurrent neural network unfolded over time. InFig. 1.5, h(t) = [h1(t),h2(t), . . . ,hm1(t)]T is vector of hidden units’ activations after observing t-th word5𝒙(1) 𝒙(2) … 𝒙(𝑡) … 𝒙(𝑇) 𝒉(1) … … 𝒉(2) 𝒉(𝑡) 𝒉(𝑇) 𝒚(1) … … 𝒚(2) 𝒚(𝑡) 𝒚(𝑇) 𝑾 𝑾 𝑾 𝑾 𝑼 𝑼 𝑼 𝑼 𝑾𝒓𝒆𝒄 𝑾𝒓𝒆𝒄 𝑾𝒓𝒆𝒄 𝑾𝒓𝒆𝒄 𝑾𝒓𝒆𝒄 Figure 1.5: One layer RNN unfolded over time.in the sentence x(t). Since there is just one hidden layer, we have not shown the layer number superscriptfor simplicity. Similarly, y(t) = [y1(t),y2(t), . . . ,yc(t)]T is the vector of the output units’ activations afterobserving t-th word in the sentence. W is the matrix of weights between the input layer and the hidden layer,Wrec is the matrix of weights among the hidden layer activations over time and U is the matrix of weightsbetween the hidden layer and the output layer. During the forward pass, h(t) and y(t) are calculated ash(t) = f (Wx(t)+Wrech(t−1)) (1.6)y(t) = g(Uh(t)), (1.7)where f (.) and g(.) are non-linear activation functions for the hidden layer and the output layer. Back-propagation Through Time (BPTT) is used to train this model. If we want to use this RNN model for theclassification task explained before, we can similarly add a softmax layer on the top and minimize the crossentropy cost function in (1.5). There are of course much more details into the learning process and how tomake it work (which we omit in this introductory section). We just want to emphasize that simple RNNs arenotoriously hard to train [12] due to vanishing and exploding gradients problems. Nevertheless, there aresome specific RNN architectures that have resolved these problems. We will discuss some of them later inthis thesis.A reasonable question to ask here is: Can all problems be mapped to a neural network style y = f (x)?The short answer is No! Example tasks for which the simple y = f (x) fails are: (a) cloze style QA wherethe task is to read and comprehend a text (e.g., book, etc) and then answer questions about it. (b) Given atext, the task is to fill in the blanks and (c) ChatBot [24].Generally a powerful model to address the above challenging tasks needs to remember the externalcontext, given an input, the model needs to know where to look for in the context, what to look for in thecontext, how to reason, using this external context and the model should also handle a changing externalcontext. To address some of the above challenges Memory Networks [112] have been proposed where the6main idea is to separate the controller of the memory from the memory itself. In other words, it combines alarge memory with a learning component that can read and write to the memory.1.2 Compressive SensingCompressive Sensing (CS) is an effective framework for acquiring sparse signals by which both sensing andcompression are performed at the same time [9, 14, 31]. Since there are numerous examples of natural andartificial signals that are sparse in the time domain, spatial domain or a transform domain, CS has foundvarious applications. One example to motivate the essence of compressive sensing is as follows: in a stan-dard image compression like JPEG (or JPEG2000), the first step is to apply a DCT (or Wavelet) transformto the image to create a compact (few non-zeros) representation, and then the very small coefficients arediscarded. Of course there are much more details but this is the basic idea. Consider though that we havespent computational resources to calculate the small coefficients that we later discard. Now the question is:is there any way to just sample (measure) the information that we need? The answer is Yes and this couldbe done by using compressive sensing.As a sample application, compressive sensing can be used for tele-monitoring the scalp Electroen-cephalogram (EEG) signals via Wireless Body-Area Network (WBAN) [16]. In WBAN, a number of sensorscollect and compress the EEG data. The compressed data are then sent to a remote terminal for furtherprocessing. By using this system, patients do not need to go to the hospital frequently. Instead, their EEGdata can be monitored and transmitted to the hospital when needed. In designing a WBAN, there are anumber of constraints such as the sensor’s energy constraint due to its limited battery life [90], accuracyof the compressed and then transmitted physiological signals and a low hardware cost. Conventional datacompression schemes can not overcome all these constraints [81]. Compressive sensing, on the other hand,can form a promising framework to solve the above constraints [6].Besides wearable technology for health telemonitoring discussed above, other example applications ofcompressive sensing are building a single pixel camera [34], Computed Tomography (CT) to reduce theamount of radiation [5], Magnetic Resonance Imaging (MRI) to shorten MRI scanning sessions [80] andmany more applications including geophysical data analysis, computational biology, remote sensing andcommunications [1].A general CS framework with encoder and decoder is represented in Fig. 1.6. CS framework includes anencoder to down-sample the input data and a decoder to reconstruct the input data from its down-sampledversion. In Fig. 1.6, the data is represented by vector x = [x(1),x(2), . . . ,x(N)]T . You can think of x asvectorized version of an image with N pixels. The CS encoder is generally a wide random matrix. This isrepresented by Φ which is an M×N matrix where M < N. The down-sampled version of x is representedby y = [y(1),y(2), . . . ,y(M)]T . The entries of y are called measurements, each of them is a linear randomcombination of all entries of x. The y is referred to as the measurement vector. Since we have just onemeasurement vector in Fig. 1.6, it is also referred to as Single Measurement Vector (SMV) problemy =Φx. (1.8)7Encoder Decoder 𝚽 𝑀 𝑁 𝒙 𝒚 Random Matrix Recovering Sparse Solution 𝚽 𝑀 𝑁 𝑀 < 𝑁 𝚿 𝑁 𝑁 𝒚 Sparsifying Basis 𝒙 = 𝜳𝒔 𝒔 Signal Reconstruction 𝒙 𝚿 𝑁 𝑁 𝒙 = 𝜳𝒔 Figure 1.6: General CS framework.The main task in CS is how to reconstruct the original data x given the measurement vector y and themeasurement matrix Φ. One of the main reasons that this task is challenging is that the linear system ofequations in (1.8) is underdetermined, i.e., it has more unknowns than equations. Therefore, there are infinitenumber of solutions for x. Nevertheless, if x is sparse enough (has few non-zero entries), or if there is atransform domain Ψ where the transformed representation of x is sparse (e.g., using DCT of x), then thetheory of CS guarantees that under some reasonable conditions, the sparsest solution is unique [31]. Thebasis Ψ can be complete, i.e., Ψ ∈ ℜN×N , or over-complete, i.e., Ψ ∈ ℜN×N1 where N < N1 (compressedsensing for over-complete dictionaries is introduced in [15]). For this section, assume a complete basisΨ ∈ℜN×N andx =Ψs, (1.9)where s is K− sparse, i.e., s has at most K non-zero elements. The condition to obtain a unique s given y isusually described based on the spark1 value, i.e., K < spark(ΦΨ)2 [32] or the the mutual coherence2 (µ(ΦΨ))1spark of a given matrix (defined in section 3 of [32]), is the smallest possible number of linearly dependent columns of thatmatrix. It gives a measure of linear dependency in the system modelled by a given matrix.2The coherence µ(A) of a given matrix A is defined as the maximum absolute value of inner products between any two columns8where the condition is K < 12(1+1µ(ΦΨ)) [33]. To find s, the following optimization problem should besolved:sˆ = argmins‖s‖0 s.t. y =ΦΨs, (1.10)where ‖s‖0 is (pseudo) norm 0 of s and represents the number of non-zero entries in s. Since `0 norm makesthe problem non-convex, it is usually relaxed to the `1 norm and the following convex optimization problemis solved instead of (1.10):sˆ = argmins‖s‖1 s.t. y =ΦΨs. (1.11)In distributed compressive sensing , also known as the Multiple Measurement Vectors (MMV) prob-lem, a set of L sparse vectors {si}i=1,2,...,L is to be jointly recovered from a set of L measurement vectors{yi}i=1,2,...,L. Some application areas of MMV include magnetoencephalography, array processing, com-pressed sensing of different Electroencephalogram (EEG) channels on brain, equalization of sparse commu-nication channels and cognitive radio [33].Suppose that the L sparse vectors and the L measurement vectors are arranged as columns of matricesS = [s1,s2, . . . ,sL] and Y = [y1,y2, . . . ,yL] respectively. In the MMV problem, S is to be reconstructed givenY,Y = AS, (1.12)whereA =ΦΨ. (1.13)In (1.12), S is assumed to be jointly sparse, i.e., non-zero entries of each vector occur at the same locationsas those of other vectors, which means that the sparse vectors have the same support. Assume that S isjointly sparse. Then, the necessary and sufficient condition to obtain a unique S given Y is [22]:|supp(S)|< spark(A)−1+ rank(S)2, (1.14)where |supp(S)| is the number of rows in S with non-zero energy. In the SMV problem, no rank informationexists. In the MMV problem, the rank information exists and affects the uniqueness bounds. Generally,solving the MMV problem jointly can lead to better uniqueness guarantees than solving the SMV problemfor each vector independently [35].In the current MMV literature, a jointly sparse matrix is recovered typically by one of the followingmethods: 1) greedy methods [120] like Simultaneous Orthogonal Matching Pursuit (SOMP) which performsnon-optimal subset selection, 2) relaxed mixed norm minimization methods [7, 82, 119, 121, 134], or 3)Bayesian methods like [72, 126, 133] where a posterior density function for the values of S is created,assuming a prior belief, e.g., Y is observed and S should be sparse in basis Ψ. The selection of one of theabove methods depends on the requirements imposed by the specific application.of A (Definition 2 of [33]).91.3 Motivations and ContributionsIn this thesis we study the applications of deep learning methods for sequence modeling, and how to usethem in favor of better performance for the target machine learning task. Our focus is on three importantapplications related to speech recognition, sentence modeling for web search and distributed compressivesensing. For each application, we explain the problem and present our proposed method to address theproblem followed by a list of our contributions.We start the thesis by proposing a novel architecture for sequence learning that is a stack of one layerrecurrent neural networks with linear output units and initialized by echo state networks. We refer to thisarchitecture as Recurrent Deep Stacking Network (R-DSN) and evaluate its performance on a speech recog-nition task. We then study the information retrieval task and propose a sentence embedding method thatuses Long Short-Term Memory (LSTM), an special neural network architecture for sequence modelling,and show that it yields significantly better results than state-of-the-art methods on a real world commercialsearch engine’s click-through data. Finally, we target the problem of distributed Compressive Sensing (CS)and propose a deep learning approach for this problem, i.e., a new sparse reconstruction algorithm (for datacompressed by CS) that is based on deep learning. We show that the proposed method, is almost as fast asgreedy methods, performs better than a number of strong well known methods and effectively addresses theMMV problem when sparse vectors are not jointly sparse.Below we explain the motivation behind each work and briefly list our contributions in each area.1.3.1 Recurrent Deep Stacking Networks for Speech RecognitionEcho State Network (ESN) is a special type of the temporally deep network model, the Recurrent NeuralNetwork (RNN), where the recurrent matrix is carefully designed and both the recurrent and input the ma-trices are fixed. ESN uses a linear activation function for the output units. It uses this linearity to simplifythe learning of the output matrix. The Deep Stacking Network (DSN), on the other hand, is constructedby stacking shallow feed-forward neural networks with linear output units on top of each other using con-catenated features derived from the lower modules of the DSN and the raw input data. DSNs do not haverecurrent connections, making them less effective in modelling and classifying input data with temporaldependencies. Our contributions in the area of sequence modelling methods of neural networks with linearoutput units are as follows:• we devise a special technique that takes advantage of the linearity in the output units of an ESN, tolearn the input and recurrent matrices. This has not been done in earlier ESNs due to their well knowndifficulty in learning these matrices.• Compared to the technique of BPTT in learning general RNNs, our proposed method exploits the lin-earity of the activation function in the output units to formulate the relationships amongst the variousmatrices in an RNN. These relationships result in the gradient of the cost function having an analyti-cal form. This has the advantage of enabling us to compute the gradients faster and more accuratelyinstead of obtaining them by recursion as in BPTT.10• we embed recurrent connections into the DSN, giving rise to what we call an R-DSN. Each module ofthe R-DSN consists of a special form of recurrent neural networks. Generalizing from the earlier DSN,the use of linearity in the output units of the R-DSN enables us to derive a closed form for computingthe gradient of the cost function with respect to all network matrices without backpropagating errors.1.3.2 A Sentence Modelling Method for Web Search TaskLearning a good representation (or features) of input data is an important task in machine learning. In textand language processing, one such problem is the learning of an embedding vector for a sentence; that is, totrain a model that can automatically transform a sentence to a vector that encodes the semantic meaning ofthe sentence. Our contributions in the area of sentence embedding and information retrieval are:• Developing a model that addresses sentence embedding, a hot topic in current natural language pro-cessing research, using an RNN with LSTM cells. The proposed LSTM-RNN model sequentially takeseach word in a sentence, extracts its information, and embeds it into a semantic vector.• Training above LSTM-RNN model in a weakly supervised manner on users’ click-through data loggedby a commercial web search engine.• Performing visualization and analysis to understand how the sentence embedding process works. Themodel is found to automatically attenuate the unimportant words and detects the salient keywords inthe sentence. Furthermore, these detected keywords are found to automatically activate different cellsof the LSTM-RNN, where words belonging to a similar topic activate the same cell.• As a semantic representation of the sentence, the embedding vector can be used in many differentapplications. These automatic keyword detection and topic allocation abilities enabled by the LSTM-RNN allow the network to perform document retrieval, a difficult language processing task. On aweb search task, the LSTM-RNN embedding is shown to significantly outperform several existingstate-of-the-art methods.1.3.3 A Deep Learning Approach to Distributed Compressive SensingVarious studies that address the problem of Multiple Measurement Vectors (MMVs), also known as dis-tributed compressive sensing, have been recently carried. Most of these studies assume the different sparsevectors in such a problem to be jointly sparse. They do not use complicated structures beyond sparsityor block sparsity or tree structures and do not use available offline data during sparse reconstruction. Ourcontributions in the area of compressive sensing are as follows:• We relax the joint sparsity condition. Instead, we assume that these sparse vectors depend on eachother but that this dependency is unknown. We capture this dependency by computing the conditionalprobability of each entry in each vector being non-zero, given the residuals of all previous vectors. Toestimate these probabilities, we propose to use deep sequence modelling methods.11• To reconstruct the sparse vectors at the decoder, we propose a greedy solver that uses the above deepsequence model to estimate the conditional probabilities. By performing extensive experiments ontwo real world datasets, we show that the proposed method significantly outperforms the generalMMV solver (the Simultaneous Orthogonal Matching Pursuit (SOMP)) and a number of the model-based Bayesian methods. The proposed method does not alter or add any complexity to the generalcompressive sensing encoder. The trained model is only used to help the reconstruction algorithm atthe decoder.• We propose a bidirectional version of the above solver and show that it performs better than abovesolver.• We propose a Convolutional architecture for Deep Stacking Networks which results in ConvolutionalDeep Stacking Network (CDSN). As a sample application of this architecture, we use it in combinationwith the above proposed greedy solver and obtain significant performance improvements comparedto SOMP and a number of model-based Bayesian methods.1.4 OrganizationEvery chapter starts with an introduction section where the problem description is stated, followed by areview of the related background. Then our proposed solution for the described problem is explained. Thisis followed by the evaluation and comparison of the proposed solution against the state-of-the-art methods.A summary of each chapter is given in the final section of that chapter. More details about the organizationof each chapter are:• In Chapter 2 the problem of deep sequence modelling for neural networks with linear output units isaddressed. Two models are proposed, the first is a modified ESN whose input and recurrent weightsare learned in a special manner, and the second is the Recurrent Deep Stacking Network (R-DSN).The models are evaluated on the TIMIT dataset for the speech recognition task. The performance iscomparable with state-of-the-art methods.• In Chapter 3 the problem of sentence embedding is addressed. We have proposed and showed howto use a deep sequence modelling method, with Long Short-Term Memory (LSTM) cells, for sentenceembedding. The results were significantly better than those of the state-of-the-art methods on thedifficult task of information retrieval using click-through data. We performed extensive visualizationand analysis and showed that the proposed model can be used for key word extraction and topicmodelling.• In Chapter 4 the problem of distributed compressive sensing is addressed. We proposed a deep learn-ing approach to distributed compressive sensing. We showed how deep sequence modelling methodscan help to (a) relax the joint sparsity condition (b) use structures beyond sparsity during reconstruc-tion (c) use available offline data during reconstruction and (d) obtain the results of sparse reconstruc-tion almost as fast as greedy methods.12• In Chapter 5 we conclude this thesis and provide a summary of our contributions. We also discussfuture work and present potential opportunities for improvements and extensions of the methods pro-posed in this thesis.13Chapter 2Recurrent Deep Stacking Networks forSpeech RecognitionIt is not the knowledge but the act of learning, not possession but the act of getting there, whichgrants the greatest enjoyment.— Carl Friedrich Gauss2.1 IntroductionDeep Neural Network (DNN) have proven to yield excellent performance in many pattern classification andrecognition tasks [30, 55]. To fine tune DNNs, the stochastic gradient descent method is often used. Thismakes it difficult to parallelize the learning across different machines. As one way to overcome this problem,Deep Stacking Networks (DSNs) have been proposed [26, 29]. Motivated by the stacking concept of [127],(where complex functions are learnt by stacking a number of simple functions), a DSN is constructed bystacking feed-forward neural networks with one hidden layer having a non-linear activation function for thehidden units and a linear activation function for the output units[29]. Training each module is performedindependently of other modules, and there is no need to back propagate the error from the output to theinput layers. There are however no temporal connections in each module of a DSN; therefore, the temporaldependencies in the input data are not learnt effectively in DSNs.Recurrent Neural Networks (RNNs) belong to a general type of deep neural networks which are usedto model time sequences and dynamical systems [13, 28, 36, 49, 85, 103]. Echo State Networks (ESNs)belong to the general class of the RNNs [66, 67, 69, 118]. The following properties of an ESN make itdistinct from other types of RNNs. First, both the recurrent and input matrices in an ESN are fixed andnot learned. This is largely due to the difficulty in learning RNN [12, 99]. Second, the number of hiddenneurons in an ESN is typically much larger than those in regular RNNs. The main challenges of trainingRNNs described in [12, 99] is avoided in ESN by not training most of the very large number of difficultnetwork parameters. This also leads to avoiding potential overfitting problems. Third, the output units, alsocalled readout units, in an ESN are linear. This is unlike the typically nonlinear output units in regular RNNs.Given the very large number of hidden neurons of an ESN, the output or readout weight matrix is very large14as well. The use of linear output units allows the output weight matrix to be learned very efficiently and witha simple regularization mechanism based on ridge regression. Fourth, the learning of the ESN parameters(i.e. output matrix) is much simpler than that for regular RNNs. The former uses linear learning withconvex optimization, and the latter, is based typically on Back Propagation Through Time (BPTT) and ishighly nonlinear and non-convex. As a result, learning the ESN parameters can be effectively carried outvia batch training. This greatly facilitates parallel implementation. Learning the general RNN parameterson the other hand, typically requires stochastic gradient descent, and is more difficult for parallelization.The simplicity in ESN learning comes at the cost of not learning some important parameters (includingthe input and the recurrent weight matrices) and of using linear output units. While the special design of therecurrent matrices (see [68]) and the use of a large number of hidden neurons help in reducing the weaknessof using fixed parameters, it is desirable to make these parameters adapt to the data. This is as long as therequired learning remains simpler and more parallelizable than the common BPTT learning method appliedto regular RNNs.This chapter presents this type of learning for the input and recurrent matrices of ESNs. We propose atechnique that makes full use of the linearity in the output units when constructing constraints on all threeinput, output, and recurrent matrices in an ESN. The constraints enable us to compute the gradients as thelearning signal in an analytical form, and this makes the gradient estimate more accurate than when com-puted by recursion as in BPTT. Our preliminary experimental results on phoneme classification are highlypositive. It is demonstrated that learning one or both of the input and recurrent matrices in an ESN givesbetter phoneme classification accuracy than that obtained by a traditional ESN without learning them. Fur-thermore, when longer time steps are used in analytically computing the gradients, the better classificationresults are obtained.To model input data having temporal dependencies more effectively, we also introduce the RecurrentDeep Stacking Network (R-DSN) that combines the strengths of both DSNs and RNNs while overcomingsome of their weaknesses. The proposed R-DSN has recurrent (temporal) connections that are missing in theDSN. In the R-DSN, each module is a single-layer RNN with linear output units. To train each module, wedo not use BPTT. Instead, we initialize the weights using ESN, and then fine tune them using batch-modegradient descent based on a closed-form formulation to compute the gradients. After each epoch of finetuning, the echo state property is also forced to be satisfied.2.2 General Recurrent Networks and Specific Echo State NetworksA general RNN has temporal connections as well as input-to-hidden layer, hidden layer-to-output connec-tions. These connections are mathematically represented by the recurrent weight matrix Wrec, the inputweight matrix W, and the output weight matrix U, respectively. The RNN architecture is illustrated in Fig.2.1. It also includes input-to-output and output-to-hidden (feedback) connections, with the latter denoted byW f b. The sequential sections of Fig. 2.1(a), 2.1(b), 2.1(c), . . . , denote the RNN as it unfolds in time. Notethat all the weight matrices are constrained to be the same (i.e. they are tied) at any discrete point in time.In Figure 1, xi, hi and yi represent the input, hidden and output vectors at discrete time t = i. Again, theconnections between the input (xi) and hidden (hi) layers, the hidden and output (yi) layers, and the output15x1h1Wy1Ux2h2Wy2UWrecx3h3Wy3UWrec WrecxnhnWynUWrec...Wfb Wfb Wfb Wfb(a) (b) (c) ...Figure 2.1: Illustration of a general RNN unfolded over timeand hidden layers are represented by W, U and W f b respectively. The temporal connection between hi andhi+1 is represented by the matrix Wrec. Note that the direct connections from the input to the output layerform a part of the matrix U; i.e., it is equivalent to concatenating the input layer with the hidden layer.There are two standard methods to train RNNs, BPTT and the method based on Extended KalmanFiltering (EKF) [68]. BPTT is a first order method, which extends error backpropagation for feed-forwardnetworks by treating each time step as a new hidden layer ( but ties all the weight matrices across time).Generally, BPTT has slow convergence. Its difficulty in capturing long-term memory due to vanishinggradient and exploding gradient problems has been well known for many years [12]. It is often non-trivialto obtain good results with BPTT; see the tremendous amount of engineering required to make BPTT work[86, 103, 113]. The EKF-based method, on the other hand, has fast convergence properties and belongsto a second-order method. However, the computational requirements of the EKF method are high and itsimplementation is non-trivial [68], especially for large-scale problems.One prominent approach that has been proposed to overcome the difficulty in training RNNs is the ESN[69],[68]. As explained above, an ESN is a special type of RNNs whose recurrent weights (Wrec) and inputto hidden layer weights (W) are fixed and only U ( that represents the hidden layer to output and the inputto output weights ) is trained. The recurrent connections in Wrec are sparse and their values are carefullyfixed in a way that the echo-state property is preserved. ESNs can be trained very fast because the onlyconnections that are trained are the output connections. With good initialization, ESNs have been shown toyield good performance for one dimensional sequences; but for high dimensional data such as speech, thestudies have been relatively limited; please see [118].Since the output units in an ESN have a linear activation function and assuming the hidden units have asigmoid activation function, the formulation of an ESN as a special type of RNNs ( as shown in Fig. 2.1 )can be succinctly described byhi+1 = σ(WT xi+1+Wrechi+W f byi) (2.1)yi+1 = UT hi+1, (2.2)where σ(x) = 11+e−x . As discussed earlier, an ESN has a special designed recurrent matrix, which is endowed16with the echo state property in most existing versions [68, 69, 118]. The echo state property implies thatthe state, or the hidden units’ activities, of the network can be determined uniquely based on the currentand previous inputs and outputs provided the network has been running for a sufficiently long time. Theformal definition of echo states is described in [68]. Assume that the maximum eigenvalue of Wrec is λmax,the activation function of the hidden units is sigmoid and | λmax |> 4, then the network does not have echostates. This is a sufficient condition for the echo states to not exist1 [68]. However, as emphasized in [68], inpractice, when | λmax |< 4 the network has echo states. There is a similar sufficient condition under whichthe exploding gradient problem for recurrent weights would not happen [98].In ESN training, only the output weight matrix, U, is trained. The input and recurrent weight matricesshould be carefully fixed. There are three main steps in training an ESN: constructing a network with theecho state property, computing the network states, and estimating the output weights.To construct a network with the echo state property, the input weight matrix W and the sparse recurrentweight matrix Wrec are randomly generated. Then, the maximum eigenvalue of Wrec is calculated and allentries of Wrec are renormalized as follows:Wrec = λWrecλmax, (2.3)where λ < 4 for sigmoid activation function. λ is also an important parameter which affects the memorylength of the network and should be determined based on the required memory size for the specific task. Asemphasized in [68], the entries of the input weight matrix are also of great importance; large entries causemost hidden units to saturate while small entries lead hidden units to stay in the linear region of the sigmoidfunction. The value of λ should be predetermined and fixed ( based on the desired task ) before the secondstep (that calculates the network states).To find the outputs of the hidden layer units, the hidden states are initialized to zero or another initialstate. Then the network runs freely for itrans time steps where the hidden states of each time step arecalculated using (2.1) with W f b = 0. After itrans time steps, the hidden state vectors are stacked in matrixH, i.e.H = [hitranshitrans+1 . . .hN ], (2.4)where N is the number of time steps.To calculate the output weights U, we stack the desired outputs or targets corresponding to input signalxi as a matrix T; i.e.T = [titranstitrans+1 . . . tN ]. (2.5)Since H is computed using all known quantities (incluidng the fixed input and recurrent matrices) using(2.1) and hence it is known also, U can be obtained by minimizing the following mean-square-error costfunction:E =‖ UT Hc−T ‖2F= tr[(UT Hc−T)(UT Hc−T)T ], (2.6)1Note that in [68] the condition is stated as | λmax |> 1 because the activation function considered is hyperbolic tangent.17where F stands for the Frobenius norm of a matrix, tr(.) is the trace of a matrix andHc = [H X]X = [xitransxitrans+1 . . .xN ].(2.7)Minimizing (2.6), we have the globally optimal estimate of U determined by setting the gradient of theabove cost function to zero and solving it; i.e.∂E∂U= 2Hc(UT Hc−T)T = 0U = (HcHTc )−1HcTT .(2.8)In practical implementation, to prevent inaccurate results when HHT is singular or close to singular, thefollowing solution of “ridge regression” is used for estimating UU = (HcHTc +µI)−1HcTT , (2.9)where I is the identity matrix and µ is a fixed positive number.2.3 Learning the Input Weight Matrix in ESNAssuming the memory of the network extends back to m time steps, we use the following notation to facili-tate the development of the learning method for the input weight matrix W:X1 = [x1 xm+1 x2m+1 . . . ], X2 = [x2 xm+2 x2m+2 . . . ], . . .H1 = [h1 hm+1 h2m+1 . . . ], H2 = [h2 hm+2 h2m+2 . . . ], . . .T1 = [t1 tm+1 t2m+1 . . . ], T2 = [t2 tm+2 t2m+2 . . . ], . . . .(2.10)Therefore, equations (2.1) and (2.2) can be written asHi+1 = σ(WT Xi+1+WrecHi) (2.11)Yi+1 = UT Hi+1. (2.12)To find the gradient of the cost function E with respect to W and learn input weights W we consider twocases in the remainder of this section. In Case 1, we assume that U does not depend on W and in Case 2 wetake into account the dependency between U and W. Note that in both cases we take into account the timedependency among the hidden state vectors in H, i.e., the dependency of hi+1 on hi at every time step i 2.Since Case 2 is a more realistic formulation of the gradient, it is used for learning the input weight matrixin our experimental results presented in Section 5. We derive the gradient for one time step dependency andthen generalize it to n time step dependency, i.e., Hi depends on Hi−1, Hi−2 and so on up to n time steps.2Note that this is one of the main differences with the work presented in [26, 130] where there is no temporal connection in thesingle layer network and hence no time dependency is considered.182.3.1 Case 1The gradient of the cost function with respect to W can be written as∂E∂W=∂∂Wtr[(UT H2−T2)(UT H2−T2)T ]=∂∂Wtr[(UTσ(WrecH1+WT X2)−T2)(UTσ(WrecH1+WT X2)−T2)T ]= [∂∂Wσ(WrecH1+WT X2)][2UT (UT H2−T2)T ].(2.13)Assuming that H1 depends on W, i.e., H1 = σ(WrecH0+WT X1), and denoting the term independent of WbyS = 2UT (UT H2−T2)T , (2.14)then using chain rule of calculus we have∂E∂W= [∂∂W[Wrecσ(WrecH0+WT X1)+WT X2]]HT2 ◦ (1−HT2 )◦S, (2.15)and therefore∂E∂W= X1[HT1 ◦ (1−HT1 )WTrec ◦HT2 ◦ (1−HT2 )◦S]+X2[HT2 ◦ (1−HT2 )◦S], (2.16)where ◦ is element-wise multiplication.2.3.2 Case 2The gradient calculated in Case 1 is not accurate because the dependency between U and W is ignored. Totake this dependency into consideration, the first line of equation (2.13) is rewritten as∂E∂W=∂∂Wtr(UT H2HT2 U︸ ︷︷ ︸a−UT H2TT2︸ ︷︷ ︸b−T2HT2 U+T2TT2 ). (2.17)Substituting U = (H2HT2 )−1H2TT2 we havea = T2HT2 (H2HT2 )−T H2HT2 (H2HT2 )−1︸ ︷︷ ︸IH2TT2= T2HT2 (H2HT2 )−T H2TT2 ,(2.18)andb = T2HT2 (H2HT2 )−T H2TT2 , (2.19)therefore∂E∂W=− ∂∂Wtr(T2HT2 (H2HT2 )−1H2TT2 ). (2.20)19Since tr(AB) = tr(BA) and tr(A) = tr(AT ) we have∂E∂W=− ∂∂Wtr((H2HT2 )−1H2TT2 T2HT2 ). (2.21)Using the chain rule presented in equation (126) of [100], the gradient can be written as∂E∂W=−[ ∂∂WH2︸ ︷︷ ︸B][∂∂HT2tr((H2HT2 )−1H2TT2 T2HT2 )︸ ︷︷ ︸A]. (2.22)Below we first calculate matrix A and then matrix B.For constant values of hidden states H2 we define F, M and S as follows:F = (H2HT2 )−1M = TT2 T2S = H2MHT2 .(2.23)Then, using the chain rule again, we obtainA =∂∂HT2tr(FH2MHT2 )︸ ︷︷ ︸A1+∂∂HT2tr((H2HT2 )−1S)︸ ︷︷ ︸A2, (2.24)considering the fact that tr(AB) = tr(BA), A1 can be written asA1 =∂∂HT2tr(MHT2 FH2). (2.25)From equation (107) of [100] we haveA1 = MT HT2 FT +MHT2 F. (2.26)Since FT = F and MT = M, A1 can be written asA1 = 2MHT2 F. (2.27)Considering equation (114) of [100], A2 will be as followsA2 =−(HT2 (H2HT2 )−1)(S+ST )(H2HT2 )−1, (2.28)substituting S and using M = MT results inA2 =−2HT2 (H2HT2 )−1(H2MHT2 )(H2HT2 )−1. (2.29)20Now substituting M and F in (2.23) results in the final formulation for A as follows:A = 2TT2 T2HT2 (H2HT2 )−1−2HT2 (H2HT2 )−1H2TT2 T2HT2 (H2HT2 )−1. (2.30)To calculate B, we assume that there is just one time step dependency, i.e., H2 depends on H1 and Wand H1 does not depend on H0 but depends on W. The generalization to an arbitrary number of time stepsis straightforward and is presented at the end of this section. From (2.22) and (2.11) we write B asB =∂∂W[σ(Wrecσ(WrecH0+WT X1)+WT X2)], (2.31)which is similar to the term calculated in (2.16) and thereforeB = X1[HT1 ◦ (1−HT1 )WTrec ◦HT2 ◦ (1−HT2 )]+X2[HT2 ◦ (1−HT2 )]. (2.32)By substituting (2.30) and (2.32) in (2.22) we get the gradient formulation for one time step dependency asfollows:∂E∂W=−[X1[HT1 ◦ (1−HT1 )WTrec ◦HT2 ◦ (1−HT2 )◦A]+X2[HT2 ◦ (1−HT2 )◦A]]. (2.33)This gradient formulation can be generalized for an arbitrary number of time steps as follows:∂E∂W=−[n∑i=1XiCi], (2.34)where n is the number of time steps andCi = [HTi ◦ (1−HTi )WTrec]◦Ci+1 , f or i = 1, . . . ,n−1Cn = HTn ◦ (1−HTn )◦A.(2.35)To calculate A using (2.30), Hn and Tn are used.After calculating the gradient of the cost function w.r.t W, the input weights W are updated using thefollowing update equationWi+1 = Wi−α ∂E∂Wi +β (Wi−Wi−1), (2.36)where α is the step size andβ =moldmnewmnew =1+√1+4m2old2,(2.37)and where the initial value for mold and mnew is 1. The third term in (2.36) helps the algorithm to convergefaster and is based on the FISTA algorithm proposed in [11] and used in [130].212.4 Learning the Recurrent Weight Matrix (Wrec) in the ESNTo learn the recurrent weights, the gradient of the cost function w.r.t Wrec should be calculated. We firstderive the formulation for the two time steps dependency, i.e., H2 depends on H1 and H1 depends on H0but no more time dependencies. Then it will be generalized to the arbitrary number of time steps. The samemethod that is used in section 2.3 can be used to get the following formulation for the gradient:∂E∂Wrec=−[ ∂∂WrecH2︸ ︷︷ ︸B][∂∂HT2tr((H2HT2 )−1H2TT2 T2HT2 )︸ ︷︷ ︸A], (2.38)where A will be the same as (2.30). To calculate B we haveB =∂∂Wrec[σ(Wrecσ(WrecH0+WT X1)+WT X2)]= [∂∂Wrec[Wrecσ(WrecH0+WT X1)︸ ︷︷ ︸H1+WT X2]]HT2 ◦ (1−HT2 )= [H1+Wrec H0[HT1 ◦ (1−HT1 )]︸ ︷︷ ︸∂H1∂Wrec]HT2 ◦ (1−HT2 ), (2.39)and therefore the gradient will be:∂E∂Wrec= H1[HT2 ◦ (1−HT2 )◦A]+WrecH0[HT1 ◦ (1−HT1 )◦ (HT2 ◦ (1−HT2 )◦A)]. (2.40)It can be generalized for arbitrary number of time steps as follows:∂E∂Wrec=n∑i=1Wn−irec Hi−1Ci, (2.41)where H0 includes the initial hidden states andCn = HTn ◦ (1−HTn )◦ACi = HTi ◦ (1−HTi )◦Ci+1,(2.42)and A is calculated using (2.30) based on Hn and Tn.Only the non-zero entries of the sparse matrix Wrec are updated using (2.36) and the gradient calculatedin (2.41). To make sure that the network has the echo state property after each epoch, the entries of Wrecare renormalized such that the maximum eigenvalue of Wrec is λ that is predetermined in (2.3). Thisrenormalization also prevents the gradient explosion problem for recurrent weights from happening.A summary of the learning method is as follows:• The echo state network with predetermined maximum eigenvalue of Wrec is constructed based on theexplanations presented in section 2.2.22Module 3 X1H1(1)W(1)U(1)Module 1 Module 2 Wa(2)X2H2(1)W(1)U(1)Wrec(1)Wrec(1) ...XmHm(1)W(1)Ym(1)U(1)Wrec(1)Y1(1)X1H1(2)Wb(2)U(2)Y1(2)Y2(1)X2H2(2)Wb(2)U(2)Y2(2)Wrec(2) Wrec(2) ... Wrec(2)XmHm(2)Wb(2)U(2)Ym(2)W(2) = [Wa(2) Wb(2)]Winit(2) = [Rn W(1)]Wrec,init(2)= Rrec(2)Wa(2) Wa(2)Wa(3)X1H1(3)U(3)Y1(3)H2(3)U(3)Y2(3)Wrec(3) Wrec(3)... Wrec(3)Hm(3)U(3)Ym(3)Y1(1) Y2(1)Ym(1)Wc(3)X2 XmWc(3)Wb(3)Wa(3)Wb(3)Wa(3)Wb(3)Wc(3)W(3) = [Wa(3) Wb(3) Wc(3)]Winit(3) = [Rn Rn Wb(2)]Wrec,init(3)= Rrec(3)Figure 2.2: Illustration of an R-DSN architecture with three modules shown with different colors• Input weights matrix W is updated based on (2.34), (2.35), (2.36) and (2.37).• Non-zero entries of the sparse recurrent weights matrix Wrec are updated based on (2.41), (2.42),(2.36) for Wrec and (2.37).• Updated Wrec is renormalized to have the predetermined maximum eigenvalue λ .• The forward pass is repeated with the updated input and recurrent weights to find the hidden states.The network runs freely for itrans time steps and then the hidden states are recorded as matrix H.• Taking into account the direct connections from the input to output in the network, the output weightmatrix U is calculated using (2.9).To prevent the value of the gradient w.r.t W from exploding, we use a similar approach proposed in [98]where the gradient value is renormalized when it is greater than a threshold.2.5 Learning the Recurrent Deep Stacking NetworkThe RNN described in the preceding section can be stacked into multiple modules. The architecture of aR-DSN with three modules is shown in Fig. 2.2. In the R-DSN, the output of each module is part of theinput of the upper module. Therefore, the dimensionality of the input of the upper modules is more thanthat of the lower modules. In this work, we have not used RBM or temporal RBM [114] to initialize the23weights of the lowest module. Instead, we have used random initialization. The only constraint is that Wrecbe initialized such that echo state property holds; i.e., the maximum eigenvalue of Wrec is kept less than 4.The training method we have implemented for the two modules of architecture in Fig. 2.2 is as follows:• Training the first (i.e the lowest) module:– Input and recurrent weights are initialized using an ESN architecture– Hidden units’ states H(1)i , i = 1, . . . ,m are calculated using (2.11)– The method described in previous sections is used to compute gradients of the cost function w.r.t. W(1) and W(1)rec– W(1) and W(1)rec are updated using (2.36) and (2.37)– Entries of Wrec are renormalized such that the network has the echo state property using (2.3)– U(1) is calculated using (2.8)• Training the second module:– Input weights corresponding to output of the first module W(2)a are initialized randomly (Ra in Fig. 2.2) and inputweights corresponding to the input training data are initialized to the fine tuned input weights from the previousmodule (W(1) in Fig. 2.2)– Recurrent weights W(2)rec are initialized with a random sparse matrix whose sparsity pattern is different from that ofthe previous module(R(2)rec in Fig. 2.2)– Entries of W(2)rec are normalized such that the echo state property holds– All weights are fine tuned using the method described for the first moduleTo resolve the exploding gradient problem for input weights we have used the renormalization methodproposed in [99].All modules higher than two in the R-DSN are trained in a similar way to the second step above. Thenumber of epochs used in training all modules are carefully tuned to provide regularization via the “earlystopping” mechanism.2.6 ExperimentsWe carried out the experiments for frame-level classification of phoneme states on the TIMIT dataset usingan ESN with all parameters learned as discussed so far. The training data includes 1,124,589 frames. Thevalidation set has 122,488 frames from 50 speakers. The results are reported using the core test set consistingof 192 sentences and 57,920 frames. The speech is analysed using the standard Mel Frequency CepstralCoefficients (MFCC). Each feature vector has 39 entries, including first and second derivatives. We haveused 3 states for each of 61 phonemes resulting in a target class vector with 183 entries. phoneme statelabels are extracted using a GMM-HMM system which aligns the frames with their corresponding states.We have used a context window of 3 frames for all experiments resulting in the input vectors with3×39 = 117 entries. The regularization parameter µ used in (2.9) is set to 10−8. The maximum eigenvalueof Wrec is set to 3.9. We have used a step size (α in (2.36)) of 0.07. The task is to classify each frame inthe TIMIT core test set into one of 183 phoneme states. The results are presented in Table 2.1 for differenthidden layer sizes in the ESN, one for each row in the table. The results are also arranged by four different24ways of learning the input and recurrent weigh matrices W and Wrec, where m is the number of time stepsin incorporating matrices’ dependencies in the learning. The column of “ESN” refers to the traditional ESNas in [66, 67, 69, 118] where W and Wrec are not learned.Table 2.1: Frame-level phoneme-state classification error rates for the TIMIT core test setHidden units ESN Learning W Learning W and Wrec Learning W and Wrecwith m = 1 with m = 1 with m = 3100 75.5% 66.7% 64.0 % 63.2%500 70.1% 59.8% 57.5% 56.8%2000 63.8% 54.2% 52.7% 52.1%10000 57.1% 49.5% 48.0% 46.8%30000 53.3% 45.9% 44.5% 43.0%The set of results for the various R-DSNs with the hidden layer sizes of 4000 and 10000 and with thestacking modules up to three are summarized in Table 2.2.The preliminary experimental results shown in Table 2.1 and 2.2 verify that learning input and recurrentweight matrices in the ESN is superior to the ESN with the same structure but without learning the twomatrices. Furthermore, the longer the time steps are incorporated in the learning, the lower error rates are.On the column of “ESN” in Table 2.1, we also observe that the traditional ESN improves its performanceas the number of hidden units increases, consistent with the findings reported in [118]. Also stacking morelayers improves the performance as observed in Table 2.2. Finally, we would like to remark that the resultsobtained so far are very preliminary, and the task is on frame-level classification of 183 phoneme states. Ourfirst step of research is focused on this easiest task since it is a pure and simple machine learning problemand it requires no expertise in speech recognition. The next steps are to move 1) from 183-state classificationto 39-phoneme classification; 2) from frame level to segment level (which requires dynamic programmingover three states of each phoneme); and 3) from classification (with no phoneme insertion and deletionerrors) to recognition (with phoneme insertion and deletion errors).2.7 ConclusionsThe first idea of this chapter is straightforward: the traditional ESN learns only one of three important setsof weight matrices, and we want to learn them all. The key property that characterizes the ESN is the use oflinear output units so that the learning is simple, convex and forms a least-square ridge regression problemwith a global optimum (in learning the output weights). In extending the learning of the output weightsto only learning input and recurrent weights, we make use of the same property of linear output units todevelop and formulate constraints among various sets of ESN weight matrices. Such constraints are thenused to derive analytic forms of the error gradients w.r.t the input and recurrent weights to be learned. Thestandard learning method of BPTT for the general RNN (with typically nonlinear output units) does notadmit analytical forms of gradient computation. BPTT requires recursively propagating the error signalbackward through time, a very different style of computation and learning than what we have developed in25Table 2.2: Frame level phoneme classification error rates for the R-DSN on the TIMIT core test set asa function of the number of modules for the fixed 4000 or 10000 neurons in the hidden layerHidden Modules Learning W and Wrec Learning W and WrecUnits in R-DSN with m = 1 with m = 34000 1 50.5% 50.0%4000 2 49.9% 49.5%4000 3 49.5% 49.1%10000 1 48.0% 46.8%10000 2 47.2% 46.0%10000 3 47.0% 45.7%this work for ESNs.The second idea of this chapter is also straightforward: a novel deep learning architecture, the R-DSN,which extends the earlier RNN and DSN models. The R-DSN constructs multiple modules of the RNNusing stacking, in the same way that the DSN uses stacking to form multiple modules of a simple, non-recurrent feed-forward neural network. Alternatively, the R-DSN can be viewed as a generalization of theDSN, where the generalization lies in embedding recurrent connections in each module that were missingin the earlier DSN. The main technical contribution of the work reported in this part is the development ofclosed-from formulas for the gradient computation based on the special structure of the R-DSN, and thebatch-mode training method for all parameters in the R-DSN capitalizing on these formulas.26Chapter 3A Sentence Modelling Method for WebSearch TaskIn theory, there is no difference between theory and practice. But, in practice, there is!— Jan. L. A. van de Snepscheut3.1 IntroductionLearning a good representation (or features) of input data is an important task in machine learning. In textand language processing, one such problem is learning of an embedding vector for a sentence; that is, to traina model that can automatically transform a sentence to a vector that encodes the semantic meaning of thesentence. While word embedding is learned using a loss function defined on word pairs, sentence embeddingis learned using a loss function defined on sentence pairs. In the sentence embedding usually the relationshipamong words in the sentence, i.e., the context information, is taken into consideration. Therefore, sentenceembedding is more suitable for tasks that require computing semantic similarities between text strings.By mapping texts into a unified semantic representation, the embedding vector can be further used fordifferent language processing applications, such as machine translation [116], sentiment analysis [77], andinformation retrieval [65]. In machine translation, the recurrent neural networks (RNN) with Long Short-Term Memory (LSTM) cells, or the LSTM-RNN, is used to encode an English sentence into a vector, whichcontains the semantic meaning of the input sentence, and then another LSTM-RNN is used to generate aFrench (or another target language) sentence from the vector. The model is trained to best predict the outputsentence. In [77], a paragraph vector is learned in an unsupervised manner as a distributed representationof sentences and documents, which are then used for sentiment analysis. Sentence embedding can also beapplied to information retrieval, where the contextual information are properly represented by the vectors inthe same space for fuzzy text matching [65].In this chapter, we propose to use an RNN to sequentially accept each word in a sentence and recurrentlymap it into a latent space together with the historical information. As the RNN reaches the last word in thesentence, the hidden activations form a natural embedding vector for the contextual information of thesentence. We further incorporate the LSTM cells into the RNN model (i.e. the LSTM-RNN) to address the27difficulty of learning long term memory in RNN. The learning of such a model is performed in a weaklysupervised manner on the click-through data logged by a commercial web search engine. Although manuallylabelled data are insufficient in machine learning, logged data with limited feedback signals are massivelyavailable due to the widely used commercial web search engines. Limited feedback information such asclick-through data provides a weak supervision signal that indicates the semantic similarity between thetext on the query side and the clicked text on the document side. To exploit such a signal, the objectiveof our training is to maximize the similarity between the two vectors mapped by the LSTM-RNN from thequery and the clicked document, respectively. Consequently, the learned embedding vectors of the queryand clicked document are specifically useful for web document retrieval task.An important contribution of this chapter is to analyse the embedding process of the LSTM-RNN byvisualizing the internal activation behaviours in response to different text inputs. We show that the embed-ding process of the learned LSTM-RNN effectively detects the keywords, while attenuating less importantwords, in the sentence automatically by switching on and off the gates within the LSTM-RNN cells. Wefurther show that different cells in the learned model indeed correspond to different topics, and the keywordsassociated with a similar topic activate the same cell unit in the model. As the LSTM-RNN reads to the endof the sentence, the topic activation accumulates and the hidden vector at the last word encodes the richcontextual information of the entire sentence. For this reason, a natural application of sentence embeddingis web search ranking, in which the embedding vector from the query can be used to match the embeddingvectors of the candidate documents according to the maximum cosine similarity rule. Evaluated on a realweb document ranking task, our proposed method significantly outperforms many of the existing state ofthe art methods in NDCG scores. Please note that when we refer to document in this chapter we mean thetitle (headline) of the document.3.2 Related WorkInspired by the word embedding method [87, 89], the authors in [77] proposed an unsupervised learningmethod to learn a paragraph vector as a distributed representation of sentences and documents, which arethen used for sentiment analysis with superior performance. However, the model is not designed to capturethe fine-grained sentence structure. In [75], an unsupervised sentence embedding method is proposed withgreat performance on large corpus of contiguous text corpus, e.g., the BookCorpus [135]. The main ideais to encode the sentence s(t) and then decode previous and next sentences, i.e., s(t − 1) and s(t + 1),using two separate decoders. The encoder and decoders are RNNs with Gated Recurrent Unit (GRU) [18].However, this sentence embedding method is not designed for document retrieval task having a supervisionamong queries and clicked and unclicked documents. In [111], a Semi-Supervised Recursive Autoencoder(RAE) is proposed and used for sentiment prediction. Similar to our proposed method, it does not needany language specific sentiment parsers. A greedy approximation method is proposed to construct a treestructure for the input sentence. It assigns a vector per word. It can become practically problematic for largevocabularies. It also works both on unlabeled data and supervised sentiment data.Similar to the recurrent models in this chapter, the DSSM [65] and CLSM [108] models, developed forinformation retrieval, can also be interpreted as sentence embedding methods. However, DSSM treats the28input sentence as a bag-of-words and does not model word dependencies explicitly. CLSM treats a sentenceas a bag of n-grams, where n is defined by a window, and can capture local word dependencies. Then aMax-pooling layer is used to form a global feature vector. Methods in [19] are also convolutional basednetworks for Natural Language Processing (NLP). These models, by design, cannot capture long distancedependencies, i.e., dependencies among words belonging to non-overlapping n-grams. In [73] a DynamicConvolutional Neural Network (DCNN) is proposed for sentence embedding. Similar to CLSM, DCNN doesnot rely on a parse tree and is easily applicable to any language. However, different from CLSM where aregular max-pooling is used, in DCNN a dynamic k-max-pooling is used. This means that instead of justkeeping the largest entries among word vectors in one vector, k largest entries are kept in k different vectors.DCNN has shown good performance in sentiment prediction and question type classification tasks. In [64],a convolutional neural network architecture is proposed for sentence matching. It has shown great perfor-mance in several matching tasks. In [132], a Bilingually-constrained Recursive Auto-encoders (BRAE) isproposed to create semantic vector representation for phrases. Through experiments it is shown that theproposed method has great performance in two end-to-end SMT tasks.Long short-term memory networks were developed in [62] to address the difficulty of capturing longterm memory in RNN. It has been successfully applied to speech recognition, which achieves state-of-artperformance [50, 107]. In text analysis, LSTM-RNN treats a sentence as a sequence of words with internalstructures, i.e., word dependencies. It encodes a semantic vector of a sentence incrementally which differsfrom DSSM and CLSM. The encoding process is performed left-to-right, word-by-word. At each timestep, a new word is encoded into the semantic vector, and the word dependencies embedded in the vectorare “updated”. When the process reaches the end of the sentence, the semantic vector has embedded all thewords and their dependencies, hence, can be viewed as a feature vector representation of the whole sentence.In the machine translation work [116], an input English sentence is converted into a vector representationusing LSTM-RNN, and then another LSTM-RNN is used to generate an output French sentence. The modelis trained to maximize the probability of predicting the correct output sentence. In [53], there are two maincomposition models, ADD model that is bag of words and BI model that is a summation over bi-gram pairsplus a non-linearity. In our proposed model, instead of simple summation, we have used LSTM model withletter tri-grams which keeps valuable information over long intervals (for long sentences) and throws awayuseless information. In [8], an encoder-decoder approach is proposed to jointly learn to align and translatesentences from English to French using RNNs. The concept of “attention” in the decoder, discussed in thispaper, is closely related to how our proposed model extracts keywords in the document side. For furtherexplanations please see section 3.5.1. In [74] a set of visualizations are presented for RNNs with and withoutLSTM cells and GRUs. Different from our work where the target task is sentence embedding for documentretrieval, the target tasks in [74] were character level sequence modelling for text characters and sourcecodes. Interesting observations about interpretability of some LSTM cells and statistics of gates activationsare presented. In section 3.5.1 we show that some of the results of our visualization are consistent with theobservations reported in [74]. We also present more detailed visualization specific to the document retrievaltask using click-through data. We also present visualizations about how our proposed model can be used forkeyword detection.29Different from the aforementioned studies, the method developed in this chapter trains the model sothat sentences that are paraphrase of each other are close in their semantic embedding vectors — see thedescription in Sec. 3.4 further ahead. Another reason that LSTM-RNN is particularly effective for sentenceembedding, is its robustness to noise. For example, in the web document ranking task, the noise comes fromtwo sources: (i) Not every word in query / document is equally important, and we only want to “remember”salient words using the limited “memory”. (ii) A word or phrase that is important to a document may notbe relevant to a given query, and we only want to “remember” related words that are useful to compute therelevance of the document for a given query. We will illustrate robustness of LSTM-RNN in this chapter. Thestructure of LSTM-RNN will also circumvent the serious limitation of using a fixed window size in CLSM.Our experiments show that this difference leads to significantly better results in web document retrievaltask. Furthermore, it has other advantages. It allows us to capture keywords and key topics effectively. Themodels in this chapter also do not need the extra max-pooling layer, as required by the CLSM, to captureglobal contextual information and they do so more effectively.3.3 Sentence Embedding Using RNNs with and without LSTM CellsIn this section, we introduce the model of recurrent neural networks and its long short-term memory versionfor learning the sentence embedding vectors. We start with the basic RNN and then proceed to LSTM-RNN.3.3.1 The Basic Version of RNNThe RNN is a type of deep neural networks that are “deep” in temporal dimension and it has been usedextensively in time sequence modelling [13, 17, 25, 28, 36, 49, 84, 85, 103]. The main idea of using RNNfor sentence embedding is to find a dense and low dimensional semantic representation by sequentially andrecurrently processing each word in a sentence and mapping it into a low dimensional vector. In this model,the global contextual features of the whole text will be in the semantic representation of the last word in thetext sequence — see Figure 3.1, where x(t) is the t-th word, coded as a 1-hot vector, Wh is a fixed hashingoperator similar to the one used in [65] that converts the word vector to a letter tri-gram vector, W is theinput weight matrix, Wrec is the recurrent weight matrix, y(t) is the hidden activation vector of the RNN,which can be used as a semantic representation of the t-th word, and y(t) associated to the last word x(m) isthe semantic representation vector of the entire sentence. Note that this is very different from the approachin [65] where the bag-of-words representation is used for the whole text and no context information is used.This is also different from [108] where the sliding window of a fixed size (akin to an FIR filter) is used tocapture local features and a max-pooling layer on the top to capture global features. In the RNN there isneither a fixed-sized window nor a max-pooling layer; rather the recurrence is used to capture the contextinformation in the sequence (akin to an IIR filter).The mathematical formulation of the above RNN model for sentence embedding can be expressed asl(t) = Whx(t)y(t) = f(Wl(t)+Wrecy(t−1)+b), (3.1)30… Embedding vector 𝒍(2) 𝒍(1) 𝒍(𝑚) Figure 3.1: The basic architecture of the RNN for sentence embedding, where temporal recurrence isused to model the contextual information across words in the text string. The hidden activationvector corresponding to the last word is the sentence embedding vector (blue).where W and Wrec are the input and recurrent matrices to be learned, Wh is a fixed word hashing operator,b is the bias vector and f (·) is assumed to be tanh(·). Note that the architecture proposed here for sentenceembedding is slightly different from traditional RNN in that there is a word hashing layer that convert thehigh dimensional input into a relatively lower dimensional letter tri-gram representation. There is also noper word supervision during training, instead, the whole sentence has a label. This is explained in moredetail in section 3.4.3.3.2 The RNN with LSTM CellsAlthough RNN performs the transformation from the sentence to a vector in a principled manner, it isgenerally difficult to learn the long term dependency within the sequence due to vanishing gradients problem.One of the effective solutions for this problem in RNNs is using memory cells instead of neurons originallyproposed in [62] as Long Short-Term Memory (LSTM) and completed in [45] and [46] by adding forgetgate and peephole connections to the architecture.We use the architecture of LSTM illustrated in Fig. 3.2 for the proposed sentence embedding method.31LSTM 𝒓(𝑡) 𝑾3 𝒓(𝑡) 𝑾4 𝒓(𝑡) 𝑾2 𝒓(𝑡) 𝑾1 𝑔(. ) 𝜎(. ) 𝜎(. ) Input Gate Output Gate 𝜎(. ) Forget Gate Cell × 𝒚𝑔(𝑡) 𝒊(𝑡) 𝒇(𝑡) 𝒄(𝑡 − 1) × ℎ(. ) × 𝒄(𝑡) 𝒐(𝑡) 𝒗(𝑡) 𝒄(𝑡 − 1) 𝑾𝑝2 𝑾𝑝3 𝑾𝑝1 𝒗(𝑡 − 1) 𝑾𝑟𝑒𝑐4 𝟏 𝒃4 𝒗(𝑡 − 1) 𝟏 𝑾𝑟𝑒𝑐3 𝒃3 𝟏 𝒃1 𝑾𝑟𝑒𝑐1 𝒗(𝑡 − 1) 𝟏 𝒃2 𝒗(𝑡 − 1) 𝑾𝑟𝑒𝑐2 Figure 3.2: The basic LSTM architecture used for sentence embeddingIn this figure, i(t), f(t) ,o(t) ,c(t) are input gate, forget gate, output gate and cell state vector respectively,Wp1, Wp2 and Wp3 are peephole connections, Wi, Wreci and bi, i= 1,2,3,4 are input connections, recurrentconnections and bias values, respectively, g(·) and h(·) are tanh(·) function and σ(·) is the sigmoid function.We use this architecture to find y for each word, then use the y(m) corresponding to the last word in thesentence as the semantic vector for the entire sentence.Considering Fig. 3.2, the forward pass for LSTM-RNN model is as follows:yg(t) = g(W4l(t)+Wrec4y(t−1)+b4)i(t) = σ(W3l(t)+Wrec3y(t−1)+Wp3c(t−1)+b3)f(t) = σ(W2l(t)+Wrec2y(t−1)+Wp2c(t−1)+b2)c(t) = f(t)◦ c(t−1)+ i(t)◦yg(t)o(t) = σ(W1l(t)+Wrec1y(t−1)+Wp1c(t)+b1)y(t) = o(t)◦h(c(t)), (3.2)where ◦ denotes Hadamard (element-wise) product. A diagram of the proposed model with more details ispresented in section 3.12.323.4 Learning MethodTo learn a good semantic representation of the input sentence, our objective is to make the embeddingvectors for sentences of similar meaning as close as possible, and meanwhile, to make sentences of differentmeanings as far apart as possible. This is challenging in practice since it is hard to collect a large amountof manually labelled data that give the semantic similarity signal between different sentences. Nevertheless,the widely used commercial web search engine is able to log massive amount of data with some limited userfeedback signals. For example, given a particular query, the click-through information about the user-clickeddocument among many candidates is usually recorded and can be used as a weak (binary) supervision signalto indicate the semantic similarity between two sentences (on the query side and the document side). In thissection, we explain how to leverage such a weak supervision signal to learn a sentence embedding vectorthat achieves the aforementioned training objective. Please also note that above objective to make sentenceswith similar meaning as close as possible is similar to machine translation tasks where two sentences belongto two different languages with similar meanings and we want to make their semantic representation as closeas possible.We now describe how to train the model to achieve the above objective using the click-through datalogged by a commercial search engine. For a complete description of the click-through data please referto section 2 in [41]. To begin with, we adopt the cosine similarity between the semantic vectors of twosentences as a measure for their similarityR(Q,D) =yQ(TQ)T yD(TD)‖yQ(TQ)‖ · ‖yD(TD)‖ , (3.3)where TQ and TD are the lengths of the sentence Q and sentence D, respectively. In the context of trainingover click-through data, we will use Q and D to denote “query” and “document”, respectively. In Figure3.3, we show the sentence embedding vectors corresponding to the query, yQ(TQ), and all the documents,{yD+(TD+),yD−1 (TD−1 ), . . . ,yD−n (TD−n )}, where the subscript D+ denotes the (clicked) positive sample amongthe documents, and the subscript D−j denotes the j-th (un-clicked) negative sample. All these embeddingvectors are generated by feeding the sentences into the RNN or LSTM-RNN model described in Sec. 3.3and take the y corresponding to the last word — see the blue box in Figure 3.1.We want to maximize the likelihood of the clicked document given query, which can be formulated asthe following optimization problem:L(Λ) = minΛ{− logN∏r=1P(D+r |Qr)}= minΛN∑r=1lr(Λ), (3.4)where Λ denotes the collection of the model parameters; in regular RNN case, it includes Wrec and W inFigure 3.1, and in LSTM-RNN case, it includes W1, W2, W3, W4, Wrec1, Wrec2, Wrec3, Wrec4, Wp1, Wp2,Wp3, b1, b2, b3 and b4 in Figure 3.2. D+r is the clicked document for r-th query, P(D+r |Qr) is the probability33yQ(TQ)yD+(TD+)yD1(TD1)yDn (TDn )Click:'1'…'0'0'Nega/ve'samples'Posi/ve'sample'Figure 3.3: The click-through signal can be used as a (binary) indication of the semantic similaritybetween the sentence on the query side and the sentence on the document side. The negativesamples are randomly sampled from the training data.of clicked document given the r-th query, N is number of query / clicked-document pairs in the corpus andlr(Λ) =− log(eγR(Qr,D+r )eγR(Qr,D+r )+∑ni= j eγR(Qr,D−r, j))= log(1+n∑j=1e−γ·∆r, j), (3.5)where ∆r, j = R(Qr,D+r )−R(Qr,D−r, j), R(·, ·) was defined earlier in (3.3), D−r, j is the j-th negative candidatedocument for r-th query and n denotes the number of negative samples used during training.The expression in (3.5) is a logistic loss over ∆r, j. It upper-bounds the pairwise accuracy, i.e., the 0 - 1loss. Since the similarity measure is the cosine function, ∆r, j ∈ [−2,2]. To have a larger range for ∆r, j, weuse γ for scaling. It helps to penalize the prediction error more. Its value is set empirically by experimentson a held out dataset.To train the RNN and LSTM-RNN, we use Back Propagation Through Time (BPTT). The update equa-tions for parameter Λ at epoch k are as follows:4Λk = Λk−Λk−14Λk = µk−14Λk−1− εk−1∇L(Λk−1+µk−14Λk−1), (3.6)where ∇L(·) is the gradient of the cost function in (3.4), ε is the learning rate and µk is a momentum pa-rameter determined by the scheduling scheme used for training. Above equations are equivalent to Nesterovmethod in [93]. To see why, please refer to appendix A.1 of [115] where Nesterov method is derived as a34momentum method. The gradient of the cost function, ∇L(Λ), is∇L(Λ) =−N∑r=1n∑j=1T∑τ=0αr, j∂∆r, j,τ∂Λ︸ ︷︷ ︸one large update, (3.7)where T is the number of time steps that we unfold the network over time andαr, j =−γe−γ∆r, j1+∑nj=1 e−γ∆r, j. (3.8)∂∆r, j,τ∂Λ in (3.7) and error signals for different parameters of RNN and LSTM-RNN that are necessary fortraining are presented in section 3.6. Full derivation of gradients in both models is presented in section 3.9.To accelerate training by parallelization, we use mini-batch training and one large update instead ofincremental updates during back propagation through time. To resolve the gradient explosion problem weuse gradient re-normalization method described in [85, 98]. To accelerate the convergence, we use Nesterovmethod [93] and found it effective in training both RNN and LSTM-RNN for sentence embedding.We have used a simple yet effective scheduling for µk for both RNN and LSTM-RNN models, in the firstand last 2% of all parameter updates µk = 0.9 and for the other 96% of all parameter updates µk = 0.995.We have used a fixed step size for training RNN and a fixed step size for training LSTM-RNN.A summary of training method for LSTM-RNN is presented in Algorithm 1.3.5 Analysis of the Sentence Embedding Process and PerformanceEvaluationTo understand how the LSTM-RNN performs sentence embedding, we use visualization tools to analyzethe semantic vectors generated by our model. We would like to answer the following questions: (i) How areword dependencies and context information captured? (ii) How does LSTM-RNN attenuate unimportantinformation and detect critical information from the input sentence? Or, how are the keywords embeddedinto the semantic vector? (iii) How are the global topics identified by LSTM-RNN?To answer these questions, we train the RNN with and without LSTM cells on the click-through datasetwhich are logged by a commercial web search engine. The training method has been described in Sec. 3.4.Description of the corpus is as follows. The training set includes 200,000 positive query / document pairswhere only the clicked signal is used as a weak supervision for training LSTM. The relevance judgement set(test set) is constructed as follows. First, the queries are sampled from a year of search engine logs. Adult,spam, and bot queries are all removed. Queries are de-duped so that only unique queries remain. To reflexa natural query distribution, we do not try to control the quality of these queries. For example, in our querysets, there are around 20% misspelled queries, and around 20% navigational queries and 10% transactionalqueries, etc. Second, for each query, we collect Web documents to be judged by issuing the query to severalpopular search engines (e.g., Google, Bing) and fetching top-10 retrieval results from each. Finally, thequery-document pairs are judged by a group of well-trained assessors. In this study all the queries are35Algorithm 1 Training LSTM-RNN for Sentence EmbeddingInputs: Fixed step size “ε”, Scheduling for “µ”, Gradient clip threshold “thG”, Maximum number of Epochs “nE poch”, Total number of query/ clicked-document pairs “N”, Total number of un-clicked (negative) documents for a given query “n”, Maximum sequence length for truncatedBPTT “T ”.Outputs: Two trained models, one in query side “ΛQ”, one in document side “ΛD”.Initialization: Set all parameters in ΛQ and ΛD to small random numbers, i = 0, k = 1.procedure LSTM-RNN(ΛQ,ΛD)while i≤ nE poch dofor “first minibatch”→ “last minibatch” dor← 1while r ≤ N dofor j = 1→ n doCompute αr, j . use (3.8)Compute ∑Tτ=0αr, j∂∆r, j,τ∂Λk,Q. use (3.14) to (4.54) in section 3.6Compute ∑Tτ=0αr, j∂∆r, j,τ∂Λk,D. use (3.14) to (4.54) in section 3.6sum above terms for Q and D over jend forsum above terms for Q and D over rr← r+1end whileCompute ∇L(Λk,Q) . use (3.7)Compute ∇L(Λk,D) . use (3.7)if ‖∇L(Λk,Q)‖> thG then∇L(Λk,Q)← thG · ∇L(Λk,Q)‖∇L(Λk,Q)‖end ifif ‖∇L(Λk,D)‖> thG then∇L(Λk,D)← thG · ∇L(Λk,D)‖∇L(Λk,D)‖end ifCompute4Λk,Q . use (4.13)Compute4Λk,D . use (4.13)Update: Λk,Q←4Λk,Q +Λk−1,QUpdate: Λk,D←4Λk,D +Λk−1,Dk← k+1end fori← i+1end whileend procedurepreprocessed as follows. The text is white-space tokenized and lower-cased, numbers are retained, and nostemming/inflection treatment is performed. Unless stated otherwise, in the experiments we used 4 negativesamples, i.e., n = 4 in Fig. 3.3.We now proceed to perform a comprehensive analysis by visualizing the trained RNN and LSTM-RNNmodels. In particular, we will visualize the on-and-off behaviors of the input gates, output gates, cell states,and the semantic vectors in LSTM-RNN model, which reveals how the model extracts useful informationfrom the input sentence and embeds it properly into the semantic vector according to the topic information.Although giving the full learning formula for all the model parameters in the previous section, we willremove the peephole connections and the forget gate from the LSTM-RNN model in the current task. Thisis because the length of each sequence, i.e., the number of words in a query or a document, is known inadvance, and we set the state of each cell to zero in the beginning of a new sequence. Therefore, forgetgates are not a great help here. Also, as long as the order of words is kept, the precise timing in thesequence is not of great concern. Therefore, peephole connections are not that important as well. Removingpeephole connections and forget gate will also reduce the amount of training time, since a smaller number36of parameters need to be learned.3.5.1 AnalysisIn this section we would like to examine how the information in the input sentence is sequentially extractedand embedded into the semantic vector over time by the LSTM-RNN model.Attenuating Unimportant InformationFirst, we examine the evolution of the semantic vector and how unimportant words are attenuated. Specifi-cally, we feed the following input sentences from the test dataset into the trained LSTM-RNN model:• Query: “hotels in shanghai”• Document: “shanghai hotels accommodation hotel in shanghai discount and reservation”Activations of input gate, output gate, cell state and the embedding vector for each cell for query anddocument are shown in Fig. 3.4 and Fig. 3.5, respectively. The vertical axis is the cell index from 1 to 32,and the horizontal axis is the word index from 1 to 10 numbered from left to right in a sequence of wordsand color codes show activation values. From Figs.3.4–3.5, we make the following observations:• Semantic representation y(t) and cell states c(t) are evolving over time. Valuable context informationis gradually absorbed into c(t) and y(t), so that the information in these two vectors becomes richerover time, and the semantic information of the entire input sentence is embedded into vector y(t),which is obtained by applying output gates to the cell states c(t).• The input gates evolve in such a way that it attenuates the unimportant information and detects theimportant information from the input sentence. For example, in Fig. 3.5(a), most of the input gatevalues corresponding to word 3, word 7 and word 9 have very small values (light green-yellow color)1,which corresponds to the words “accommodation”, “discount” and “reservation”, respectively, inthe document sentence. Interestingly, input gates reduce the effect of these three words in the finalsemantic representation, y(t), such that the semantic similarity between sentences from query anddocument sides are not affected by these words.Keywords ExtractionIn this section, we show how the trained LSTM-RNN extracts the important information, i.e., keywords,from the input sentences. To this end, we backtrack semantic representations, y(t), over time. We focus onthe 10 most active cells in final semantic representation. Whenever there is a large enough change in cellactivation value (y(t)), we assume an important keyword has been detected by the model. We illustrate theresult using the above example (“hotels in shanghai”). The evolution of the 10 most active cells activation,y(t), over time are shown in Fig. 3.6 for the query and the document sentences.2From Fig. 3.6, we also1If this is not clearly visible, please refer to Fig. 3.8. We have adjusted color bar for all figures to have the same range, for thisreason the structure might not be clearly visible. More visualization examples could also be found in section 3.102Likewise, the vertical axis is the cell index and horizontal axis is the word index in the sentence.37 2 4 6 8 1051015202530−0.6−0.4−0.200.20.4(a) i(t) 2 4 6 8 1051015202530−0.6−0.4−0.200.20.4(b) c(t) 2 4 6 8 1051015202530−0.6−0.4−0.200.20.4(c) o(t) 2 4 6 8 1051015202530−0.6−0.4−0.200.20.4(d) y(t)Figure 3.4: Query: “hotels in shanghai”. Since the sentence ends at the third word, all the values tothe right of it are zero (green color).observe that different words activate different cells. In Tables 3.1–3.2, we show the number of cells eachword activates.3 We used Bidirectional LSTM-RNN to get the results of these tables where in the first row,LSTM-RNN reads sentences from left to right and in the second row it reads sentences from right to left.In these tables we labelled a word as a keyword if more than 40% of top 10 active cells in both directionsdeclare it as keyword. The boldface numbers in the table show that the number of cells assigned to that wordis more than 4, i.e., 40% of top 10 active cells. From the tables, we observe that the keywords activate morecells than the unimportant words, meaning that they are selectively embedded into the semantic vector.Topic AllocationNow, we further show that the trained LSTM-RNN model not only detects the keywords, but also allocatesthem properly to different cells according to the topics they belong to. To do this, we go through the test3Note that before presenting the first word of the sequence, activation values are initially zero so that there is always a consid-erable change in the cell states after presenting the first word. For this reason, we have not indicated the number of cells detectingthe first word as a keyword. Moreover, another keyword extraction example can be found in section 3.10.2.38 2 4 6 8 1051015202530−0.6−0.4−0.200.20.4(a) i(t) 2 4 6 8 1051015202530−0.6−0.4−0.200.20.4(b) c(t) 2 4 6 8 1051015202530−0.6−0.4−0.200.20.4(c) o(t) 2 4 6 8 1051015202530−0.6−0.4−0.200.20.4(d) y(t)Figure 3.5: Document: “shanghai hotels accommodation hotelin shanghai discount and reservation”. Since the sentence ends at the ninth word, all the valuesto the right of it are zero (green color).Table 3.1: Key words for query: “hotels in shanghai”Query hotels in shanghaiNumber of assignedcells out of 10Left to Right - 0 7Number of assignedcells out of 10Right to Left 6 0 -dataset using the trained LSTM-RNN model and search for the keywords that are detected by a specific cell.For simplicity, we use the following simple approach: for each given query we look into the keywords thatare extracted by the 5 most active cells of LSTM-RNN and list them in Table 3.3. Interestingly, each cellcollects keywords of a specific topic. For example, cell 26 in Table 3.3 extracts keywords related to the topic“food” and cells 2 and 6 mainly focus on the keywords related to the topic “health”.39 1 2 3246810 00.020.040.060.08(a) y(t) top 10 for query 2 4 6 8246810 00.050.1(b) y(t) top 10 for documentFigure 3.6: Activation values, y(t), of 10 most active cells for Query: “hotels in shanghai” and Docu-ment: “shanghai hotels accommodation hotel in shanghai discount and reservation”Table 3.2: Key words for document: “shanghai hotels accommodation hotel in shanghai discount andreservation”shanghai hotels accommodation hotel in shanghai discount and reservationNumber of assignedcells out of 10Left to Right - 4 3 8 1 8 5 3 4Number of assignedcells out of 10Right to Left 4 6 5 4 5 1 7 5 -3.5.2 Performance EvaluationWeb Document Retrieval TaskIn this section, we apply the proposed sentence embedding method to an important web document retrievaltask for a commercial web search engine. Specifically, the RNN models (with and without LSTM cells)embed the sentences from the query and the document sides into their corresponding semantic vectors, andthen compute the cosine similarity between these vectors to measure the semantic similarity between thequery and candidate documents.Experimental results for this task are shown in Table 3.4 using the standard metric mean NormalizedDiscounted Cumulative Gain (NDCG) [70] (the higher the better) for evaluating the ranking performance ofthe RNN and LSTM-RNN on a standalone human-rated test dataset. We also trained several strong base-lines, such as DSSM [65] and CLSM [108], on the same training dataset and evaluated their performanceon the same task. For fair comparison, our proposed RNN and LSTM-RNN models are trained with thesame number of parameters as the DSSM and CLSM models (14.4M parameters). Besides, we also includein Table 3.4 two well-known information retrieval (IR) models, BM25 and PLSA, for the sake of bench-marking. The BM25 model uses the bag-of-words representation for queries and documents, which is astate-of-the-art document ranking model based on term matching, widely used as a baseline in IR society.40Table 3.3: Keywords assigned to each cell of LSTM-RNN for different queries of two topics, “food”represented by green color and “health” represented by red colorQuery cell 1 cell 2 cell 3 cell 4 cell 5 cell 6 cell 7 cell 8 cell 9 cell 10 cell 11 cell 12 cell 13 cell 14 cell 15 cell 16al yo yo sauce yo sauce sauceatkins diet lasagna dietblender recipescake bakery edinburgh bakerycanning corn beef hash beef, hashtorre de pizzafamous desserts dessertsfried chicken chicken chickensmoked turkey recipesitalian sausage hoagies sausagedo you get allergy allergymuch pain will after total knee replacement pain pain, kneehow to make whiter teeth make, teeth toillini community hospital community, hospital hospital communityimplant infection infection infectionintroductory psychology psychology psychologynarcotics during pregnancy side effects pregnancy pregnancy,effects, during duringfight sinus infections infectionshealth insurance high blood pressure insurance blood high, bloodall antidepressant medications antidepressant, medicationsQuery cell 17 cell 18 cell 19 cell 20 cell 21 cell 22 cell 23 cell 24 cell 25 cell 26 cell 27 cell 28 cell 29 cell 30 cell 31 cell 32al yo yo sauceatkins diet lasagna diet dietblender recipes recipescake bakery edinburgh bakery bakerycanning corn beef hash corn, beeftorre de pizza pizza pizzafamous dessertsfried chicken chickensmoked turkey recipes turkey recipesitalian sausage hoagies hoagies sausage sausagedo you get allergymuch pain will after total knee replacement knee replacementhow to make whiter teeth whiterillini community hospital hospital hospitalimplant infection infectionintroductory psychology psychologynarcotics during pregnancy side effectsfight sinus infections sinus, infections infectionshealth insurance high blood pressure high, pressure insurance,highall antidepressant medications antidepressant medicationsPLSA (Probabilistic Latent Semantic Analysis) is a topic model proposed in [63], which is trained using theMaximum A Posterior estimation [42] on the documents side from the same training dataset. We experi-mented with a varying number of topics from 100 to 500 for PLSA, which gives similar performance, andwe report in Table 3.4 the results of using 500 topics. Results for a language model based method, uni-gramlanguage model (ULM) with Dirichlet smoothing, are also presented in the table.To compare the performance of the proposed method with general sentence embedding methods indocument retrieval task, we also performed experiments using two general sentence embedding methods.1. In the first experiment, we used the method proposed in [77] that generates embedding vectors knownas Paragraph Vectors. It is also known as doc2vec. It maps each word to a vector and then uses thevectors representing all words inside a context window to predict the vector representation of the nextword. The main idea in this method is to use an additional paragraph token from previous sentencesin the document inside the context window. This paragraph token is mapped to vector space usinga different matrix from the one used to map the words. A primary version of this method is knownas word2vec proposed in [88]. The only difference is that word2vec does not include the paragraphtoken.To use doc2vec on our dataset, we first trained doc2vec model on both train set (about 200,000 query-document pairs) and test set (about 900,000 query-document pairs). This gives us an embeddingvector for every query and document in the dataset. We used the following parameters for training:• min-count=1 : minimum number of of words per sentence, sentences with words less than this41will be ignored. We set it to 1 to make sure we do not throw away anything.• window=5 : fixed window size explained in [77]. We used different window sizes, it resulted inabout just 0.4% difference in final NDCG values.• size=100 : feature vector dimension. We used 400 as well but did not get significantly differentNDCG values.• sample=1e-4 : this is the down sampling ratio for the words that are repeated a lot in corpus.• negative=5 : the number of noise words, i.e., words used for negative sampling as explained in[77].• We used 30 epochs of training. We ran an experiment with 100 epochs but did not observe muchdifference in the results.• We used gensim [101] to perform experiments.To make sure that a meaningful model is trained, we used the trained doc2vec model to find themost similar words to two sample words in our dataset, e.g., the words “pizza” and “infection”. Theresulting words and corresponding scores are presented in section 3.11. As it is observed from theresulting words, the trained model is a meaningful model and can recognise semantic similarity.Doc2vec also assigns an embedding vector for each query and document in our test set. We used theseembedding vectors to calculate the cosine similarity score between each query-document pair in thetest set. We used these scores to calculate NDCG values reported in Table 3.4 for the Doc2Vec model.Comparing the results of doc2vec model with our proposed method for document retrieval task showsthat the proposed method in this chapter significantly outperforms doc2vec. One reason for this is thatwe have used a very general sentence embedding method, doc2vec, for document retrieval task. Thisexperiment shows that it is not a good idea to use a general sentence embedding method and using abetter task oriented cost function, like the one proposed in this chapter, is necessary.2. In the second experiment, we used the Skip-Thought vectors proposed in [75]. During training, skip-thought method gets a tuple (s(t − 1),s(t),s(t + 1)) where it encodes the sentence s(t) using oneencoder, and tries to reconstruct the previous and next sentences, i.e., s(t − 1) and s(t + 1), usingtwo separate decoders. The model uses RNNs with Gated Recurrent Unit (GRU) which is shown toperform as good as LSTM. In the paper, authors have emphasized that: ”Our model depends on havinga training corpus of contiguous text”. Therefore, training it on our training set where we barely havemore than one sentence in query or document title is not fair. However, since their model is trainedon 11,038 books from BookCorpus dataset [135] which includes about 74 million sentences, we canuse the trained model as an off-the-shelf sentence embedding method as authors have concluded inthe conclusion of the paper.To do this we downloaded their trained models and word embeddings (its size was more than 2GB)available from “https://github.com/ryankiros/skip-thoughts”. Then we encoded each query and itscorresponding document title in our test set as vector.42We used the combine-skip sentence embedding method, a vector of size 4800×1, where it is concate-nation of a uni-skip, i.e., a unidirectional encoder resulting in a 2400×1 vector, and a bi-skip, i.e., abidirectional encoder resulting in a 1200×1 vector by forward encoder and another 1200×1 vectorby backward encoder. The authors have reported their best results with the combine-skip encoder.Using the 4800× 1 embedding vectors for each query and document we calculated the scores andNDCG for the whole test set which are reported in Table 3.4.The proposed method in this chapter is performing significantly better than the off-the-shelf skip-thought method for document retrieval task. Nevertheless, since we used skip-thought as an off-the-shelf sentence embedding method, its result is good. This result also confirms that learning embeddingvectors using a model and cost function specifically designed for document retrieval task is necessary.As shown in Table 3.4, the LSTM-RNN significantly outperforms all these models, and exceeds thebest baseline model (CLSM) by 1.3% in NDCG@1 score, which is a statistically significant improvement.As we pointed out in Sec. 3.5.1, such an improvement comes from the LSTM-RNN’s ability to embed thecontextual and semantic information of the sentences into a finite dimension vector. In Table 3.4, we havealso presented the results when different number of negative samples, n, is used. Generally, by increasingn we expect the performance to improve. This is because more negative samples results in a more accurateapproximation of the partition function in (3.5). The results of using Bidirectional LSTM-RNN are alsopresented in Table 3.4. In this model, one LSTM-RNN reads queries and documents from left to right, andthe other LSTM-RNN reads queries and documents from right to left. Then the embedding vectors from leftto right and right to left LSTM-RNNs are concatenated to compute the cosine similarity score and NDCGvalues.A comparison between the value of the cost function during training for LSTM-RNN and RNN on theclick-through data is shown in Fig. 3.7. From this figure, we conclude that LSTM-RNN is optimizingthe cost function in (3.4) more effectively. Please note that all parameters of both models are initializedrandomly.3.6 Expressions for the GradientsIn this section we present the final gradient expressions that are necessary to use for training the proposedmodels. Full derivations of these gradients are presented in section 3.9.3.6.1 RNNFor the recurrent parameters, Λ= Wrec (we have ommitted r subscript for simplicity)∂∆ j,τ∂Wrec= [δD+yQ (t− τ)yTQ(t− τ−1)+δD+yD (t− τ)yTD+(t− τ−1)]− [δD−jyQ (t− τ)yTQ(t− τ−1)+δD−jyD (t− τ)yTD−j (t− τ−1)], (3.9)43Table 3.4: Comparisons of NDCG performance measures (the higher the better) of proposed modelsand a series of baseline models, where nhid refers to the number of hidden units, ncell refers tonumber of cells, win refers to window size, and n is the number of negative samples which is set to4 unless otherwise stated. Unless stated otherwise, the RNN and LSTM-RNN models are chosento have the same number of model parameters as the DSSM and CLSM models: 14.4M, where1M = 106. The boldface numbers are the best results.Model NDCG NDCG NDCG@1 @3 @10Skip-Thought 26.9% 29.7% 36.2%off-the-shelfDoc2Vec 29.1% 31.8% 38.4%ULM 30.4% 32.7% 38.5%BM25 30.5% 32.8% 38.8%PLSA (T=500) 30.8% 33.7% 40.2%DSSM (nhid = 288/96) 31.0% 34.4% 41.7%2 LayersCLSM (nhid = 288/96, win=1) 31.8% 35.1% 42.6%2 Layers, 14.4 M parametersCLSM (nhid = 288/96, win=3) 32.1% 35.2% 42.7%2 Layers, 43.2 M parametersCLSM (nhid = 288/96, win=5) 32.0% 35.2% 42.6%2 Layers, 72 M parametersRNN (nhid = 288) 31.7% 35.0% 42.3%1 LayerLSTM-RNN (ncell = 32) 31.9% 35.5% 42.7%1 Layer, 4.8 M parametersLSTM-RNN (ncell = 64) 32.9% 36.3% 43.4%1 Layer, 9.6 M parametersLSTM-RNN (ncell = 96) 32.6% 36.0% 43.4%1 Layer, n = 2LSTM-RNN (ncell = 96) 33.1% 36.5% 43.6%1 Layer, n = 4LSTM-RNN (ncell = 96) 33.1% 36.6% 43.6%1 Layer, n = 6LSTM-RNN (ncell = 96) 33.1% 36.4% 43.7%1 Layer, n = 8Bidirectional LSTM-RNN 33.2% 36.6% 43.6%(ncell = 96), 1 Layerwhere D−j means j-th candidate document that is not clicked andδyQ(t− τ−1) = (1−yQ(t− τ−1))◦(1+yQ(t− τ−1))◦WTrecδyQ(t− τ), (3.10)440 5 10 15 20 25 30 35 40 45 50103104L(Λ),logscaleEpoch LSTM−RNNRNNFigure 3.7: LSTM-RNN compared to RNN during training: The vertical axis is logarithmic scale ofthe training cost, L(Λ), in (3.4). Horizontal axis is the number of epochs during training.and the same as (3.10) for δyD(t− τ−1) with D subscript for document side model. Please also note thatδyQ(TQ) = (1−yQ(TQ))◦ (1+yQ(TQ))◦(b.c.yD(TD)−a.b3.c.yQ(TQ)),δyD(TD) = (1−yD(TD))◦ (1+yD(TD))◦(b.c.yQ(TQ)−a.b.c3.yD(TD)), (3.11)wherea = yQ(t = TQ)T yD(t = TD)b =1‖yQ(t = TQ)‖ , c =1‖yD(t = TD)‖ . (3.12)For the input parameters, Λ= W∂∆ j,τ∂W= [δD+yQ (t− τ)lTQ(t− τ)+δD+yD (t− τ)lTD+(t− τ)]−[δD−jyQ (t− τ)lTQ(t− τ)+δD−jyD (t− τ)lTD−j (t− τ)]. (3.13)A full derivation of BPTT for RNN is presented in section 3.9.453.6.2 LSTM-RNNStarting with the cost function in (3.4), we use the Nesterov method described in (4.13) to update LSTM-RNN model parameters. Here,Λ is one of the weight matrices or bias vectors {W1,W2,W3,W4,Wrec1,Wrec2,Wrec3,Wrec4,Wp1,Wp2,Wp3,b1,b2,b3,b4} in the LSTM-RNN architecture. The general format of the gradient of thecost function, ∇L(Λ), is the same as (3.7). By definition of ∆r, j, we have∂∆r, j∂Λ=∂R(Qr,D+r )∂Λ− ∂R(Qr,Dr, j)∂Λ. (3.14)We omit r and j subscripts for simplicity and present ∂R(Q,D)∂Λ for different parameters of each cell of LSTM-RNN in the following subsections. This will complete the process of calculating ∇L(Λ) in (3.7) and thenwe can use (4.13) to update LSTM-RNN model parameters. In the subsequent subsections vectors vQ andvD are defined asvQ = (b.c.yD(t = TD)−a.b3.c.yQ(t = TQ))vD = (b.c.yQ(t = TQ)−a.b.c3.yD(t = TD)), (3.15)where a, b and c are defined in (3.12). Full derivation of truncated BPTT for LSTM-RNN model is presentedin section 3.9.Output GateFor recurrent connections we have∂R(Q,D)∂Wrec1= δ rec1yQ (t).yQ(t−1)T +δ rec1yD (t).yD(t−1)T , (3.16)whereδ rec1yQ (t) = oQ(t)◦ (1−oQ(t))◦h(cQ(t))◦vQ(t), (3.17)and the same as (4.27) for δ rec1yD (t) with subscript D for document side model. For input connections, W1,and peephole connections, Wp1, we will have∂R(Q,D)∂W1= δ rec1yQ (t).lQ(t)T +δ rec1yD (t).lD(t)T . (3.18)∂R(Q,D)∂Wp1= δ rec1yQ (t).cQ(t)T +δ rec1yD (t).cD(t)T . (3.19)The derivative for output gate bias values will be∂R(Q,D)∂b1= δ rec1yQ (t)+δrec1yD (t). (3.20)46Input GateFor the recurrent connections we have∂R(Q,D)∂Wrec3=diag(δ rec3yQ (t)).∂cQ(t)∂Wrec3+diag(δ rec3yD (t)).∂cD(t)∂Wrec3, (3.21)whereδ rec3yQ (t) = (1−h(cQ(t)))◦ (1+h(cQ(t)))◦oQ(t)◦vQ(t)∂cQ(t)∂Wrec3= diag(fQ(t)).∂cQ(t−1)∂Wrec3+bi,Q(t).yQ(t−1)Tbi,Q(t) = yg,Q(t)◦ iQ(t)◦ (1− iQ(t)). (3.22)In equation (4.31), δ rec3yD (t) and∂cD(t)∂Wrec3 are the same as (4.32) with D subscript. For the input connections wewill have the following:∂R(Q,D)∂W3=diag(δ rec3yQ (t)).∂cQ(t)∂W3+diag(δ rec3yD (t)).∂cD(t)∂W3, (3.23)where∂cQ(t)∂W3= diag(fQ(t)).∂cQ(t−1)∂W3+bi,Q(t).xQ(t)T . (3.24)For the peephole connections we will have∂R(Q,D)∂Wp3=diag(δ rec3yQ (t)).∂cQ(t)∂Wp3+diag(δ rec3yD (t)).∂cD(t)∂Wp3, (3.25)where∂cQ(t)∂Wp3= diag(fQ(t)).∂cQ(t−1)∂Wp3+bi,Q(t).cQ(t−1)T . (3.26)For bias values, b3, we will have∂R(Q,D)∂b3=diag(δ rec3yQ (t)).∂cQ(t)∂b3+diag(δ rec3yD (t)).∂cD(t)∂b3, (3.27)where∂cQ(t)∂b3= diag(fQ(t)).∂cQ(t−1)∂b3+bi,Q(t). (3.28)47Forget GateFor the recurrent connections we will have∂R(Q,D)∂Wrec2=diag(δ rec2yQ (t)).∂cQ(t)∂Wrec2+diag(δ rec2yD (t)).∂cD(t)∂Wrec2, (3.29)whereδ rec2yQ (t) = (1−h(cQ(t)))◦ (1+h(cQ(t)))◦oQ(t)◦vQ(t)∂cQ(t)∂Wrec2= diag(fQ(t)).∂cQ(t−1)∂Wrec2+b f ,Q(t).yQ(t−1)Tb f ,Q(t) = cQ(t−1)◦ fQ(t)◦ (1− fQ(t)). (3.30)For input connections to forget gate we will have∂R(Q,D)∂W2=diag(δ rec2yQ (t)).∂cQ(t)∂W2+diag(δ rec2yD (t)).∂cD(t)∂W2, (3.31)where∂cQ(t)∂W2= diag(fQ(t)).∂cQ(t−1)∂W2+b f ,Q(t).xQ(t)T . (3.32)For peephole connections we have∂R(Q,D)∂Wp2=diag(δ rec2yQ (t)).∂cQ(t)∂Wp2+diag(δ rec2yD (t)).∂cD(t)∂Wp2, (3.33)where∂cQ(t)∂Wp2= diag(fQ(t)).∂cQ(t−1)∂Wp2+b f ,Q(t).cQ(t−1)T . (3.34)For forget gate’s bias values we will have∂R(Q,D)∂b2=diag(δ rec2yQ (t)).∂cQ(t)∂b2+diag(δ rec2yD (t)).∂cD(t)∂b2, (3.35)where∂cQ(t)∂b2= diag(fQ(t)).∂cQ(t−1)∂b3+b f ,Q(t). (3.36)48Input without Gating (yg(t))For recurrent connections we will have∂R(Q,D)∂Wrec4=diag(δ rec4yQ (t)).∂cQ(t)∂Wrec4+diag(δ rec4yD (t)).∂cD(t)∂Wrec4, (3.37)whereδ rec4yQ (t) = (1−h(cQ(t)))◦ (1+h(cQ(t)))◦oQ(t)◦vQ(t)∂cQ(t)∂Wrec4= diag(fQ(t)).∂cQ(t−1)∂Wrec4+bg,Q(t).yQ(t−1)Tbg,Q(t) = iQ(t)◦ (1−yg,Q(t))◦ (1+yg,Q(t)). (3.38)For input connection we have∂R(Q,D)∂W4=diag(δ rec4yQ (t)).∂cQ(t)∂W4+diag(δ rec4yD (t)).∂cD(t)∂W4, (3.39)where∂cQ(t)∂W4= diag(fQ(t)).∂cQ(t−1)∂W4+bg,Q(t).xQ(t)T . (3.40)For bias values we will have∂R(Q,D)∂b4=diag(δ rec4yQ (t)).∂cQ(t)∂b4+diag(δ rec4yD (t)).∂cD(t)∂b4, (3.41)where∂cQ(t)∂b4= diag(fQ(t)).∂cQ(t−1)∂b4+bg,Q(t). (3.42)Error Signal BackpropagationError signals are back propagated through time using following equations:δ rec1Q (t−1) =[oQ(t−1)◦ (1−oQ(t−1))◦h(cQ(t−1))]◦WTrec1.δ rec1Q (t), (3.43)49 2 4 6 8 1051015202530−0.6−0.4−0.200.20.4Figure 3.8: Input gate, i(t), for Document: “shanghai hotels accommodation hotel in shanghai dis-count and reservation”δ reciQ (t−1) = [(1−h(cQ(t−1)))◦ (1+h(cQ(t−1)))◦oQ(t−1)]◦WTreci .δ reciQ (t), f or i ∈ {2,3,4}. (3.44)3.7 A More Clear Figure for Input Gate for “hotels in shanghai” ExampleIn this section we present a more clear figure for part (a) of Fig. 3.5 that shows the structure of the inputgate for document side of “hotels in shanghai” example. As it is clearly visible from Fig. 3.8, the input gatevalues for most of the cells corresponding to word 3, word 7 and word 9 in document side of LSTM-RNNhave very small values (light green-yellow color). These are corresponding to words “accommodation”,“discount” and “reservation” respectively in the document title. Interestingly, input gates are trying toreduce effect of these three words in the final representation (y(t)) because the LSTM-RNN model is trainedto maximize the similarity between query and document if they are a good match.50Table 3.5: RNNs with & without LSTM cells for the same query: “hotels in shanghai”hotels in shanghaiNumber of assignedcells out of 10 (LSTM-RNN) - 0 7Number of assignedneurons out of 10 (RNN) - 2 9Table 3.6: RNNs with & without LSTM cells for the same Document: “shanghai hotels accommoda-tion hotel in shanghai discount and reservation”shanghai hotels accommodation hotel in shanghai discount and reservationNumber of assignedcells out of 10 (LSTM-RNN) - 4 3 8 1 8 5 3 4Number of assignedneurons out of 10 (RNN) - 10 7 9 6 8 3 2 63.8 A Closer Look at RNNs with and without LSTM Cells in WebDocument Retrieval TaskIn this section we further show examples to reveal the advantage of LSTM-RNN sentence embedding com-pared to the RNN sentence embedding.First, we compare the scores assigned by trained RNN and LSTM-RNN to our “hotels in shanghai”example. On average, each query in our test dataset is associated with 15 web documents (URLs). Eachquery / document pair has a relevance label which is human generated. These relevance labels are “Bad”,“Fair”, “Good” and “Excellent”. This example is rated as a “Good” match in the dataset. The score for thispair assigned by RNN is “0.8165” while the score assigned by LSTM-RNN is “0.9161”. Please note thatthe score is between 0 and 1. This means that the score assigned by LSTM-RNN is more correspondentwith the human generated label.Second, we compare the number of assigned neurons and cells to each word by RNN and LSTM-RNNrespectively. To do this, we rely on the 10 most active cells and neurons in the final semantic vectors inboth models. Results are presented in Table 3.5 and Table 3.6 for query and document respectively. Aninteresting observation is that RNN sometimes assigns neurons to unimportant words, e.g., 6 neurons areassigned to the word “in” in Table 3.6.As another example we consider the query, “how to f ix bath tub wont turn o f f ”. This example is ratedas a “Bad” match in the dataset by human. It is good to know that the score for this pair assigned byRNN is “0.7016” while the score assigned by LSTM-RNN is “0.5944”. This shows the score generated byLSTM-RNN is closer to human generated label.Number of assigned neurons and cells to each word by RNN and LSTM-RNN are presented in Table3.7 and Table 3.8 for query and document. This is out of 10 most active neurons and cells in the semanticvector of RNN and LSTM-RNN. Examples of RNN assigning neurons to unimportant words are 3 neuronsto the word “a” and 4 neurons to the word “you” in Table 3.8.51Table 3.7: RNN versus LSTM-RNN for Query: “how to f ix bath tub wont turn o f f ”how to f ix bath tub wont turn o f fNumber of assignedcells out of 10 (LSTM-RNN) - 0 4 7 6 3 5 0Number of assignedneurons out of 10 (RNN) - 1 10 4 6 2 7 1Table 3.8: RNN versus LSTM-RNN for Document: “how do you paint a bathtub and what paintshould . . . ”how do you paint a bathtub and what paint should you . . .Number of assignedcells out of 10(LSTM-RNN) - 1 1 7 0 9 2 3 8 4Number of assignedneurons out of 10(RNN) - 1 4 4 3 7 2 5 4 73.9 Derivation of BPTT for RNN and LSTM-RNNIn this section we present the full derivation of the gradients for RNN and LSTM-RNN.3.9.1 Derivation of BPTT for RNNFrom (4) and (5) we have∂L(Λ)∂Λ=N∑r=1∂ lr(Λ)∂Λ=−N∑r=1n∑j=1αr, j∂∆r, j∂Λ, (3.45)whereαr, j =−γe−γ∆r, j1+∑nj=1 e−γ∆r, j, (3.46)and∆r, j = R(Qr,D+r )−R(Qr,Dr, j). (3.47)We need to find ∂∆r, j∂Λ for input weights and recurrent weights. We omit r subscript for simplicity.Recurrent Weights∂∆ j∂Wrec=∂R(Q,D+)∂Wrec− ∂R(Q,D−j )∂Wrec. (3.48)We divide R(D,Q) into three components:R(Q,D) = yQ(t = TQ)T yD(t = TD)︸ ︷︷ ︸a.1‖yQ(t = TQ)‖︸ ︷︷ ︸b.1‖yD(t = TD)‖︸ ︷︷ ︸c, (3.49)52then∂R(Q,D)∂Wrec=∂a∂Wrec.b.c︸ ︷︷ ︸D+a.∂b∂Wrec.c︸ ︷︷ ︸E+a.b.∂c∂Wrec︸ ︷︷ ︸F. (3.50)We haveD =∂yQ(t = TQ)T yD(t = TD).b.c∂Wrec=∂yQ(t = TQ)T yD(t = TD).b.c∂yQ(t = TQ).∂yQ(t = TQ)∂Wrec+∂yQ(t = TQ)T yD(t = TD).b.c∂yD(t = TD).∂yD(t = TD)∂Wrec= yD(t = TD).b.c.∂yQ(t = TQ)∂Wrec+yQ(t = TQ).(b.c)T︸ ︷︷ ︸b.c.∂yD(t = TD)∂Wrec. (3.51)Since f (.) = tanh(.), using chain rule we have∂yQ(t = TQ)Wrec=[(1−yQ(t = TQ))◦ (1+yQ(t = TQ))]yQ(t−1)T , (3.52)and thereforeD = [b.c.yD(t = TD)◦ (1−yQ(t = TQ))◦(1+yQ(t = TQ))]yQ(t−1)T+[b.c.yQ(t = TQ)◦ (1−yD(t = TD))◦(1+yD(t = TD))]yD(t−1)T . (3.53)To find E we use following basic rule∂∂x‖x−a‖2 = x−a‖x−a‖2 . (3.54)53ThereforeE = a.c.∂∂Wrec(‖yQ(t = TQ)‖)−1 =−a.c.(‖yQ(t = TQ)‖)−2.∂‖yQ(t = TQ)‖∂Wrec=−a.c.(‖yQ(t = TQ)‖)−2. yQ(t = TQ)‖yQ(t = TQ)‖∂yQ(t = TQ)∂Wrec=−[a.c.b3.yQ(t = TQ)◦ (1−yQ(t = TQ))◦(1+yQ(t = TQ))]yQ(t−1), (3.55)where F is calculated similar to (3.55):F =−[a.b.c3.yD(t = TD)◦ (1−yD(t = TD))◦(1+yD(t = TD))]yD(t−1). (3.56)Considering (3.50),(3.53),(3.55) and (3.56) we have∂R(Q,D)∂Wrec= δyQ(t)yQ(t−1)T +δyD(t)yD(t−1)T , (3.57)whereδyQ(t = TQ) = (1−yQ(t = TQ))◦ (1+yQ(t = TQ))◦(b.c.yD(t = TD)−a.b3.c.yQ(t = TQ)),δyD(t = TD) = (1−yD(t = TD))◦ (1+yD(t = TD))◦(b.c.yQ(t = TQ)−a.b.c3.yD(t = TD)). (3.58)Equation (3.58) will just unfold the network one time step, to unfold it over rest of time steps using back-propagation we haveδyQ(t− τ−1) = (1−yQ(t− τ−1))◦(1+yQ(t− τ−1))◦WTrecδyQ(t− τ),δyD(t− τ−1) = (1−yD(t− τ−1))◦(1+yD(t− τ−1))◦WTrecδyD(t− τ), (3.59)54where τ is the number of time steps that we unfold the network over time which is from 0 to TQ and TD forqueries and documents respectively. Now using (3.48) we have∂∆ j,τ∂Wrec= [δD+yQ (t− τ)yTQ(t− τ−1)+δD+yD (t− τ)yTD+(t− τ−1)]− [δD−jyQ (t− τ)yTQ(t− τ−1)+δD−jyD (t− τ)yTD−j (t− τ−1)]. (3.60)To calculate final value of gradient we should fold back the network over time and use (3.45), we will have∂L(Λ)∂Wrec=−N∑r=1n∑j=1T∑τ=0αr, j,TD,Q∂∆r, j,τ∂Wrec︸ ︷︷ ︸one large update. (3.61)Input WeightsUsing a similar procedure we will have the following for input weights:∂R(Q,D)∂W= δyQ(t− τ)lQ(t− τ)T +δyD(t− τ)lD(t− τ)T , (3.62)whereδyQ(t− τ) = (1−yQ(t− τ))◦ (1+yQ(t− τ))◦(b.c.yD(t− τ)−a.b3.c.yQ(t− τ)),δyD(t− τ) = (1−yD(t− τ))◦ (1+yD(t− τ))◦(b.c.yQ(t− τ)−a.b.c3.yD(t− τ)). (3.63)Therefore∂∆ j,τ∂W=[δD+yQ (t− τ)lTQ(t− τ)+δD+yD (t− τ)lTD+(t− τ)]−[δD−jyQ (t− τ)lTQ(t− τ)+δD−jyD (t− τ)lTD−j (t− τ)], (3.64)and therefore∂L(Λ)∂W=−N∑r=1n∑j=1T∑τ=0αr, j∂∆r, j,τ∂W︸ ︷︷ ︸one large update. (3.65)553.9.2 Derivation of BPTT for LSTM-RNNFollowing from (3.50) for every parameter, Λ, in LSTM-RNN architecture we have∂R(Q,D)∂Λ=∂a∂Λ.b.c︸ ︷︷ ︸D+a.∂b∂Λ.c︸ ︷︷ ︸E+a.b.∂c∂Λ︸ ︷︷ ︸F, (3.66)and from (3.51)D = yD(t = TD).b.c.∂yQ(t = TQ)∂Λ+yQ(t = TQ).b.c.∂yD(t = TD)∂Λ. (3.67)From (3.55) and (3.56) we haveE =−a.c.b3.yQ(t = TQ)∂yQ(t = TQ)∂Λ , (3.68)F =−a.b.c3.yD(t = TD)∂yD(t = TD)∂Λ . (3.69)Therefore∂R(Q,D)∂Λ= D+E+F =vQ∂yQ(t = TQ)∂Λ+vD∂yD(t = TD)∂Λ, (3.70)wherevQ = (b.c.yD(t = TD)−a.b3.c.yQ(t = TQ))vD = (b.c.yQ(t = TQ)−a.b.c3.yD(t = TD)). (3.71)Output GateSince α ◦β = diag(α)β = diag(β )α where diag(α) is a diagonal matrix whose main diagonal entries areentries of vector α , we have∂y(t)∂Wrec1=∂∂Wrec1(diag(h(c(t))).o(t))=∂diag(h(c(t)))∂Wrec1︸ ︷︷ ︸zero.o(t)+diag(h(c(t))).∂o(t)∂Wrec1= o(t)◦ (1−o(t))◦h(c(t)).y(t−1)T . (3.72)56Substituting (3.72) in (3.70) we have∂R(Q,D)∂Wrec1= δ rec1yQ (t).yQ(t−1)T +δ rec1yD (t).yD(t−1)T , (3.73)whereδ rec1yQ (t) = oQ(t)◦ (1−oQ(t))◦h(cQ(t))◦vQ(t)δ rec1yD (t) = oD(t)◦ (1−oD(t))◦h(cD(t))◦vD(t), (3.74)with a similar derivation for W1 and Wp1 we get∂R(Q,D)∂W1= δ rec1yQ (t).lQ(t)T +δ rec1yD (t).lD(t)T , (3.75)∂R(Q,D)∂Wp1= δ rec1yQ (t).cQ(t)T +δ rec1yD (t).cD(t)T . (3.76)For output gate bias values we have∂R(Q,D)∂b1= δ rec1yQ (t)+δrec1yD (t). (3.77)Input GateSimilar to output gate we start with∂y(t)∂Wrec3=∂∂Wrec3(diag(o(t)).h(c(t)))=∂diag(o(t))∂Wrec3︸ ︷︷ ︸zero.h(c(t))+diag(o(t)).∂h(c(t))∂Wrec3= diag(o(t)).(1−h(c(t)))◦ (1+h(c(t))) ∂c(t)∂Wrec3. (3.78)To find ∂c(t)∂Wrec3 assuming f(t) = 1 (we derive formulation for f(t) 6= 1 from this simple solution) we havec(0) = 0c(1) = c(0)+ i(1)◦yg(1) = i(1)◦yg(1)c(2) = c(1)+ i(2)◦yg(2). . .c(t) =t∑k=1i(k)◦yg(k) =t∑k=1diag(yg(k)).i(k). (3.79)57Therefore∂c(t)∂Wrec3=t∑k=1[∂diag(yg(k))Wrec3︸ ︷︷ ︸zero.i(k)+diag(yg(k)).∂ i(k)Wrec3]=t∑k=1diag(yg(k)).i(k)◦ (1− i(k)).y(k−1)T (3.80), (3.81)and∂y(t)∂Wrec3=t∑k=1[o(t)◦ (1−h(c(t)))◦ (1+h(c(t)))︸ ︷︷ ︸a(t)◦yg(k)◦ i(k)◦ (1− i(k))︸ ︷︷ ︸b(k)]y(k−1)T . (3.82)But this is expensive to implement, to resolve it we have∂y(t)∂Wrec3=t−1∑k=1[a(t)◦b(k)]y(k−1)T︸ ︷︷ ︸expensive part+[a(t)◦b(t)]y(t−1)T= diag(a(t))t−1∑k=1b(k).y(k−1)T︸ ︷︷ ︸∂c(t−1)∂Wrec3+diag(a(t)).b(t).y(t−1)T . (3.83)Therefore∂y(t)∂Wrec3= [diag(a(t))][∂c(t−1)∂Wrec3+b(t).y(t−1)T ]. (3.84)For f(t) 6= 1 we have∂y(t)∂Wrec3= [diag(a(t))][diag(f(t)).∂c(t−1)∂Wrec3+bi(t).y(t−1)T ], (3.85)wherea(t) = o(t)◦ (1−h(c(t)))◦ (1+h(c(t)))bi(t) = yg(t)◦ i(t)◦ (1− i(t)). (3.86)58Substituting above equation in (3.70) we will have∂R(Q,D)∂Wrec3= diag(δ rec3yQ (t)).∂cQ(t)∂Wrec3+diag(δ rec3yD (t)).∂cD(t)∂Wrec3, (3.87)whereδ rec3yQ (t) = (1−h(cQ(t)))◦ (1+h(cQ(t)))◦oQ(t)◦vQ(t)∂cQ(t)∂Wrec3= diag(fQ(t)).∂cQ(t−1)∂Wrec3+bi,Q(t).yQ(t−1)Tbi,Q(t) = yg,Q(t)◦ iQ(t)◦ (1− iQ(t)). (3.88)In equation (3.87), δ rec3yD (t) and∂cD(t)∂Wrec3 are the same as (3.88) with D subscript. Therefore, update equationsfor Wrec3 are (3.87), (3.88) for Q and D and (6).With a similar procedure for W3 we will have the following:∂R(Q,D)∂W3= diag(δ rec3yQ (t)).∂cQ(t)∂W3+diag(δ rec3yD (t)).∂cD(t)∂W3, (3.89)where∂cQ(t)∂W3= diag(fQ(t)).∂cQ(t−1)∂W3+bi,Q(t).xQ(t)T . (3.90)Therefore, update equations for W3 are (3.89), (3.90) for Q and D and (6).For peephole connections we will have∂R(Q,D)∂Wp3= diag(δ rec3yQ (t)).∂cQ(t)∂Wp3+diag(δ rec3yD (t)).∂cD(t)∂Wp3, (3.91)where∂cQ(t)∂Wp3= diag(fQ(t)).∂cQ(t−1)∂Wp3+bi,Q(t).cQ(t−1)T . (3.92)Hence, update equations for Wp3 are (3.91), (3.92) for Q and D and (6).Following similar derivation for bias values b3 we will have∂R(Q,D)∂b3= diag(δ rec3yQ (t)).∂cQ(t)∂b3+diag(δ rec3yD (t)).∂cD(t)∂b3, (3.93)59where∂cQ(t)∂b3= diag(fQ(t)).∂cQ(t−1)∂b3+bi,Q(t). (3.94)Update equations for b3 are (3.93), (3.94) for Q and D and (6).Forget GateFor forget gate, with a similar derivation to input gate we will have∂y(t)∂Wrec2= [diag(a(t))][diag(f(t)).∂c(t−1)∂Wrec2+b f (t).y(t−1)T ], (3.95)wherea(t) = o(t)◦ (1−h(c(t)))◦ (1+h(c(t)))b f (t) = c(t−1)◦ f(t)◦ (1− f(t)). (3.96)Substituting above equation in (3.70) we will have∂R(Q,D)∂Wrec2= diag(δ rec2yQ (t)).∂cQ(t)∂Wrec2+diag(δ rec2yD (t)).∂cD(t)∂Wrec2, (3.97)whereδ rec2yQ (t) = (1−h(cQ(t)))◦ (1+h(cQ(t)))◦oQ(t)◦vQ(t)∂cQ(t)∂Wrec2= diag(fQ(t)).∂cQ(t−1)∂Wrec2+b f ,Q(t).yQ(t−1)Tb f ,Q(t) = cQ(t−1)◦ fQ(t)◦ (1− fQ(t)). (3.98)Therefore, update equations for Wrec2 are (3.97), (3.98) for Q and D and (6).For input weights to forget gate, W2, we have∂R(Q,D)∂W2= diag(δ rec2yQ (t)).∂cQ(t)∂W2+diag(δ rec2yD (t)).∂cD(t)∂W2, (3.99)where∂cQ(t)∂W2= diag(fQ(t)).∂cQ(t−1)∂W2+b f ,Q(t).xQ(t)T . (3.100)Therefore, update equations for W2 are (3.99), (3.100) for Q and D and (6).60For peephole connections, Wp2, we have∂R(Q,D)∂Wp2= diag(δ rec2yQ (t)).∂cQ(t)∂Wp2+diag(δ rec2yD (t)).∂cD(t)∂Wp2, (3.101)where∂cQ(t)∂Wp2= diag(fQ(t)).∂cQ(t−1)∂Wp2+b f ,Q(t).cQ(t−1)T . (3.102)Therefore, update equations for Wp2 are (3.101), (3.102) for Q and D and (6).Update equations for forget gate bias values, b2, will be following equations and (6):∂R(Q,D)∂b2= diag(δ rec2yQ (t)).∂cQ(t)∂b2+diag(δ rec2yD (t)).∂cD(t)∂b2, (3.103)where∂cQ(t)∂b2= diag(fQ(t)).∂cQ(t−1)∂b3+b f ,Q(t). (3.104)Input without Gating (yg(t))Gradients for yg(t) parameters are as follows:∂y(t)∂Wrec4= [diag(a(t))][diag(f(t)).∂c(t−1)∂Wrec4+bg(t).y(t−1)T ], (3.105)wherea(t) = o(t)◦ (1−h(c(t)))◦ (1+h(c(t)))bg(t) = i(t)◦ (1−yg(t))◦ (1+yg(t)). (3.106)Substituting above equation in (3.70) we will have∂R(Q,D)∂Wrec4= diag(δ rec4yQ (t)).∂cQ(t)∂Wrec4+diag(δ rec4yD (t)).∂cD(t)∂Wrec4, (3.107)61whereδ rec4yQ (t) = (1−h(cQ(t)))◦ (1+h(cQ(t)))◦oQ(t)◦vQ(t)∂cQ(t)∂Wrec4= diag(fQ(t)).∂cQ(t−1)∂Wrec4+bg,Q(t).yQ(t−1)Tbg,Q(t) = iQ(t)◦ (1−yg,Q(t))◦ (1+yg,Q(t)). (3.108)Therefore, update equations for Wrec4 are (3.107), (3.108) for Q and D and (6).For input weight parameters, W4, we have∂R(Q,D)∂W4= diag(δ rec4yQ (t)).∂cQ(t)∂W4+diag(δ rec4yD (t)).∂cD(t)∂W4, (3.109)where∂cQ(t)∂W4= diag(fQ(t)).∂cQ(t−1)∂W4+bg,Q(t).xQ(t)T . (3.110)Therefore, update equations for W4 are (3.109), (3.110) for Q and D and (6).Gradients with respect to bias values, b4, are∂R(Q,D)∂b4= diag(δ rec4yQ (t)).∂cQ(t)∂b4+diag(δ rec4yD (t)).∂cD(t)∂b4, (3.111)where∂cQ(t)∂b4= diag(fQ(t)).∂cQ(t−1)∂b4+bg,Q(t). (3.112)Therefore, update equations for b4 are (3.111), (3.112) for Q and D and (6). There is no peephole connec-tions for yg(t).3.10 LSTM-RNN VisualizationIn this section we present more examples of LSTM-RNN visualization.3.10.1 LSTM-RNN Semantic Vectors: Another ExampleConsider the following example from test dataset:• Query: “how to f ix bath tub wont turn o f f ”• Document: “how do you paint a bathtub and whatpaint should you use yahoo answers︸ ︷︷ ︸treated as one word”62 2 4 6 8 1051015202530−0.3−0.2−0.100.10.20.30.4(a) i(t) 2 4 6 8 1051015202530−0.3−0.2−0.100.10.20.30.4(b) c(t) 2 4 6 8 1051015202530−0.3−0.2−0.100.10.20.30.4(c) o(t) 2 4 6 8 1051015202530−0.3−0.2−0.100.10.20.30.4(d) y(t)Figure 3.9: Query: “how to fix bath tub wont turn off ”Activations of input gate, output gate, cell state and cell output for each cell for query and document arepresented in Fig.3.9 and Fig.3.10 respectively based on a trained LSTM-RNN model.Three interesting observations from Fig.3.9 and Fig.3.10:• Semantic representation y(t) and cell states c(t) are evolving over time.• In part (a) of Fig.3.10, we observe that input gate values for most of the cells corresponding to word 3,word 4, word 7 and word 9 in document side of LSTM-RNN have very small values (light blue color).These are corresponding to words “you”, “paint”, “and” and “paint” respectively in the document ti-tle. Interestingly, input gates are trying to reduce effect of these words in the final representation (y(t))because the LSTM-RNN model is trained to maximize the similarity between query and document ifthey are a good match.• y(t) is used as semantic representation after applying output gate on cell states. Note that valuablecontext information is stored in cell states c(t).63 2 4 6 8 1051015202530 0.10.120.140.160.180.20.22(a) i(t) 2 4 6 8 1051015202530 −0.2−0.100.10.20.3(b) c(t) 2 4 6 8 10510152025300.350.40.450.50.55(c) o(t) 2 4 6 8 1051015202530 −0.1−0.0500.050.1(d) y(t)Figure 3.10: Document: “how do you paint a bathtub and what paint should . . . ”Table 3.9: Keyword extraction for Query: “how to f ix bath tub wont turn o f f ”how to f ix bath tub wont turn o f fNumber of assignedcells out of 10Left to Right - 0 4 7 6 3 5 0Number of assignedcells out of 10Right to Left 4 1 6 7 6 7 7 -3.10.2 Key Word Extraction: Another ExampleEvolution of 10 most active cells over time for the second example are presented in Fig. 3.11 for query andFig. 3.12 for document. Number of assigned cells out of 10 most active cells to each word are presented inTable 3.9 and Table 3.10.64 2 4 6 8246810 00.020.040.060.080.1Figure 3.11: Query: “how to fix bath tub wont turn off ”Table 3.10: Keyword extraction for Document: “how do you paint a bathtub and what paint should . . .”how do you paint a bathtub and what paint should you . . .Number of assignedcells out of 10Left to Right - 1 1 7 0 9 2 3 8 4Number of assignedcells out of 10Right to Left 5 9 5 4 8 4 5 5 9 -3.11 Doc2Vec Similarity TestTo make sure that a meaningful model is trained, we used the trained doc2vec model to find the most similarwords to two sample words in our dataset, the words “pizza” and “infection”. The resulting words andcorresponding scores are as follows:print(model.most-similar(’pizza’)) :[(u’recipes’, 0.9316294193267822),65 2 4 6 8 10246810 00.020.040.060.080.1Figure 3.12: Document: “how do you paint a bathtub and what paint should . . . ”(u’recipe’, 0.9295548796653748),(u’food’, 0.9250608682632446),(u’restaurants’, 0.9223555326461792),(u’bar’, 0.9191627502441406),(u’sabayon’, 0.916868269443512),(u’page’, 0.9160783290863037),(u’restaurant’, 0.9112323522567749),(u’house’, 0.9104640483856201),(u’the’, 0.9103578925132751)]print(model.most-similar(’infection’)):[(u’infections’, 0.9698576927185059),(u’treatment’, 0.9143450856208801),(u’symptoms’, 0.9138627052307129),(u’disease’, 0.9100595712661743),66𝑃(𝐷+|𝑄) Cosine 𝒙(2) 𝑾ℎ 𝒍(2) LSTM 𝒄(1) 𝒚(2) 𝒙(1) 𝑾ℎ 𝒍(1) LSTM 𝒚(1) 𝒙(𝑇𝐷2−) 𝑾ℎ 𝒍(𝑇𝐷2−) LSTM 𝒄(𝑇𝐷2− − 1) 𝒚(𝑇𝐷2−) … … Query LSTM 𝑄 Clicked Document LSTM 𝐷+ Unclicked Document LSTM 𝐷1− … Unclicked Document LSTM 𝐷2− Unclicked Document LSTM 𝐷𝑛− 𝑃(𝐷1−|𝑄) Cosine 𝑃(𝐷2−|𝑄) Cosine 𝑃(𝐷𝑛−|𝑄) Cosine 𝛿𝑄 𝛿𝐷+ 𝛿𝑄 𝛿𝑄 𝛿𝑄 𝛿𝐷1− 𝛿𝐷2− 𝛿𝐷𝑛− 𝛿𝐷2− (𝑇𝐷2−) 𝛿𝐷2− (2) … 𝛿𝐷2− (1) Semantic representation of the document LSTM cells and gates Backpropagated error signal Figure 3.13: Architecture of the proposed method.(u’palpitations’, 0.9083651304244995),(u’pneumonia’, 0.9073051810264587),(u’medical’, 0.9043352603912354),(u’abdomen’, 0.9034136533737183),(u’medlineplus’, 0.9032401442527771),(u’gout’, 0.9027985334396362)]As it is observed from the resulting words, the trained model is a meaningful model and can recognisesemantic similarity.3.12 Diagram of the Proposed ModelTo clarify the difference between the proposed method and the general sentence embedding methods, in thissection we present a diagram illustrating the training procedure of the proposed model. It is presented inFig. 3.13. In this figure n is the number of negative (unclicked) documents. The other parameters in thisfigure are similar to those used in Fig. 2 and Fig. 3 of this chapter.673.13 ConclusionsThis chapter addresses deep sentence embedding. We propose a model based on long short-term memoryto model the long range context information and embed the key information of a sentence in one semanticvector. We show that the semantic vector evolves over time and only takes useful information from any newinput. This has been made possible by input gates that detect useless information and attenuate it. Due togeneral limitation of available human labelled data, we proposed and implemented training the model witha weak supervision signal using user click-through data of a commercial web search engine.By performing a detailed analysis on the model, we showed that: 1) The proposed model is robust tonoise, i.e., it mainly embeds keywords in the final semantic vector representing the whole sentence and 2)In the proposed model, each cell is usually allocated to keywords from a specific topic. These findings havebeen supported using extensive examples. As a concrete sample application of the proposed sentence em-bedding method, we evaluated it on the important language processing task of web document retrieval. Weshowed that, for this task, the proposed method outperforms all existing state of the art methods significantly.This work has been motivated by the earlier successes of deep learning methods in speech [20, 21, 29,57, 129] and in semantic modelling [43, 44, 65, 108], and it adds further evidence for the effectiveness ofthese methods.68Chapter 4A Deep Learning Approach to DistributedCompressive SensingWhen a traveller reaches a fork in the road, the `1-norm tells him to take either one way or theother, but the `2-norm instructs him to head off into the bushes!— John F. Claerbout and Francis Muir4.1 IntroductionCompressive Sensing (CS) [31],[14],[9] is an effective approach for acquiring sparse signals where bothsensing and compression are performed at the same time. Since there are numerous examples of naturaland artificial signals that are sparse in the time, spatial or a transform domain, CS has found numerousapplications. These include medical imaging, geophysical data analysis, computational biology, remotesensing and communications.In the general CS framework, instead of acquiring N samples of a signal x ∈ℜN×1, M random measure-ments are acquired where M < N. This is expressed by the underdetermined system of linear equationsy =Φx, (4.1)where y∈ℜM×1 is the known measured vector andΦ∈ℜM×N is a random measurement matrix. To uniquelyrecover x given y and Φ, x must be sparse in a given basis Ψ. This means thatx =Ψs, (4.2)where s is K−sparse, i.e., s has at most K non-zero elements. The basisΨ can be complete; i.e.,Ψ∈ℜN×N ,or over-complete; i.e., Ψ ∈ ℜN×N1 where N < N1 (compressed sensing for over-complete dictionaries isintroduced in [15]). From (4.1) and (4.2)y = As, (4.3)where A =ΦΨ. Since there is only one measurement vector, the above problem is usually called the Single69Measurement Vector (SMV) problem in compressive sensing.In distributed compressive sensing , also known as the Multiple Measurement Vectors (MMV) prob-lem, a set of L sparse vectors {si}i=1,2,...,L is to be jointly recovered from a set of L measurement vectors{yi}i=1,2,...,L. Some application areas of MMV include magnetoencephalography, array processing, equal-ization of sparse communication channels and cognitive radio [33].Suppose that the L sparse vectors and the L measurement vectors are arranged as columns of matricesS = [s1,s2, . . . ,sL] and Y = [y1,y2, . . . ,yL] respectively. In the MMV problem, S is to be reconstructed givenYY = AS. (4.4)In (4.4), S is assumed to be jointly sparse, i.e., non-zero entries of each vector occur at the same locations asthose of other vectors, which means that the sparse vectors have the same support. Assume that S is jointlysparse. Then, the necessary and sufficient condition to obtain a unique S given Y is [22]|supp(S)|< spark(A)−1+ rank(S)2, (4.5)where |supp(S)| is the number of rows in S with non-zero energy and spark of a given matrix is the smallestpossible number of linearly dependent columns of that matrix. spark gives a measure of linear depen-dency in the system modelled by a given matrix. In the SMV problem, no rank information exists. Inthe MMV problem, the rank information exists and affects the uniqueness bounds. Generally, solving theMMV problem jointly can lead to better uniqueness guarantees than solving the SMV problem for eachvector independently [35].In the current MMV literature, a jointly sparse matrix is recovered typically by one of the followingmethods: 1) greedy methods [120] like Simultaneous Orthogonal Matching Pursuit (SOMP) which per-forms non-optimal subset selection, 2) relaxed mixed norm minimization methods [82, 119, 121, 134], or3) Bayesian methods like [72, 126, 133] where a posterior density function for the values of S is created,assuming a prior belief, e.g., Y is observed and S should be sparse in basis Ψ. The selection of one of theabove methods depends on the requirements imposed by the specific application.4.1.1 Problem StatementThe MMV reconstruction methods stated above do not rely on the use of training data. However, for manyapplications, a large amount of data similar to the data to be compressed by CS is available. Examplesare camera recordings of the same environment, images of the same class (e.g., flowers, buildings, ....),electroencephalogram (EEG) of different parts of the brain, etc. In this chapter, we address the followingquestions in the MMV problem when training data is available:1. Can we learn the structure of the sparse vectors in S by a data driven bottom up approach using thealready available training data? If yes, then how can we exploit this structure in the MMV problem todesign a better reconstruction method?2. Most of the reconstruction algorithms for the MMV problem rely on the joint sparsity of S. However,70in some practical applications, the sparse vectors in S are not exactly jointly sparse. This can be dueto noise or due to sources that create different sparsity patterns. Examples are images of differentscenes captured by different cameras, images of different classes, etc. Although S is not jointlysparse, there may exist a possible dependency among the columns of S, however, due to lack of jointsparsity, the above methods will not give satisfactory performance. The question is, can we designthe aforementioned data driven method in a way that it captures the dependencies among the sparsevectors in S? The type of such dependencies may not be necessarily that of joint sparsity. And thenhow can we use the learned dependency structure in the reconstruction algorithm at the decoder?Please note that we want to address the above questions “without adding any complexity or adaptability”to the encoder. In other words, our aim is not to design an optimal encoder, i.e., optimal sensing matrix Φ orthe sparsifying basis Ψ, for the given training data. The encoder would be as simple and general as possible.This is specially important for applications that use sensors having low power consumption due to a limitedbattery life. However, the decoder in these cases can be much more complex than the encoder. For example,the decoder can be a powerful data processing machine.4.1.2 Proposed MethodTo address the above questions, we propose the use of a two step greedy reconstruction algorithm. In thefirst step, at each iteration of the reconstruction algorithm, and for each column of S represented as si, wefirst find the conditional probability of each entry of si being non-zero, given the residuals of all previoussparse vectors (columns) at that iteration. Then we select the most probable entry and add it to the supportof si. The definition of the residual matrix at the j−th iteration is R j = Y−AS j where S j is the estimateof the sparse matrix S at the j−th iteration. Therefore in the first step, we find the locations of the non-zeroentries. In the second step we find the values of these non-zero entries. This can be done by solving a leastsquares problem that finds si given yi and AΩi . AΩi is a matrix that includes only those atoms (columns) ofA that are members of the support of si.To find the conditional probabilities at each iteration, we propose the use of a Recurrent Neural Network(RNN) with Long Short-Term Memory (LSTM) cells and a softmax layer on top of it. To find the modelparameters, we minimize a cross entropy cost function between the conditional probabilities given by themodel and the known probabilities in the training data. The details on how to generate the training data andthe training data probabilities are explained in subsequent sections. Please note that this training is doneonly once. After that, the resulting model is used in the reconstruction algorithm for any test data that hasnot been observed by the model before. Therefore, the proposed reconstruction algorithm would be almostas fast as the greedy methods. The block diagram of the proposed method is presented in Fig. 4.1 and Fig.4.2. We will explain these figures in detail in subsequent sections.To the best of our knowledge, this is the first model-based method in MMV sparse reconstruction thatis based on a deep learning bottom up approach. Similar to all deep learning methods, it has the importantfeature of learning the structure of S from the raw data automatically. Although it is based on a greedymethod that selects subsets that are not necessarily optimal, we experimentally show that by using a prop-erly trained model and only one layer of LSTM, the proposed method significantly outperforms well known71𝒓2 LSTM 𝒄1 𝒓1 LSTM 𝒓𝐿 LSTM 𝒄𝐿−1 … … LSTM cells and gates 𝒗1 𝒗2 𝒗𝐿 𝑼 𝑼 𝑼 𝒛1 𝒛2 𝒛𝐿 softmax softmax softmax … … … 𝑃(𝒔1|𝒓1) 𝑃(𝒔2|𝒓1, 𝒓2) 𝑃(𝒔𝐿|𝒓1, 𝒓2, … , 𝒓𝐿) … … Ω1 update Ω2 update Ω𝐿 update Updating support of 𝒔𝐿 Least Squares Least Squares Least Squares … 𝒔 1 𝒔 2 𝒔 𝐿 Estimation of L-th sparse vector 𝒓1 = 𝒚1 − 𝑨Ω1𝒔 1 𝒓2 = 𝒚2 − 𝑨Ω2𝒔 2 … 𝒓𝐿 = 𝒚𝐿 − 𝑨Ω𝐿𝒔 𝐿 Residual vector for L-th sparse vector Vector of LSTM cells’ state Figure 4.1: Block diagram of the proposed method unfolded over channels.MMV baselines (e.g., SOMP) as well as the well known Bayesian methods for the MMV problem (e.g.,Multitask Bayesian Compressive Sensing (MT-BCS)[72] and Sparse Bayesian Learning for temporally cor-related sources (T-SBL)[133]). We show this on two real world datasets. At the end of this chapter, we alsopresent bidirectional version of above LSTM based CS reconstruction method along with a method basedon Convolutional Deep Stacking Networks.We emphasize that the computations carried at the encoder mainly include multiplication by a randommatrix. The extra computations are only needed at the decoder. Therefore an important feature of compres-sive sensing (low power encoding) is preserved.4.1.3 Related WorkExploiting data structures besides sparsity for compressive sensing has been extensively studied in the liter-ature [4, 10, 33, 38, 71, 72, 83, 92, 95, 122, 126, 133]. In [10], it has been theoretically shown that usingsignal models that exploit these structures will result in a decrease in the number of measurements. In [33], athorough review on CS methods that exploit the structure present in the sparse signal or in the measurementsis presented. In [71], a Bayesian framework for CS is presented. This framework uses a prior informationabout the sparsity of s to provide a posterior density function for the entries of s (assuming y is observed). It72LSTM 𝒓(𝑡) 𝑾3 𝒓(𝑡) 𝑾4 𝒓(𝑡) 𝑾2 𝒓(𝑡) 𝑾1 𝑔(. ) 𝜎(. ) 𝜎(. ) Input Gate Output Gate 𝜎(. ) Forget Gate Cell × 𝒚𝑔(𝑡) 𝒊(𝑡) 𝒇(𝑡) 𝒄(𝑡 − 1) × ℎ(. ) × 𝒄(𝑡) 𝒐(𝑡) 𝒗(𝑡) 𝒄(𝑡 − 1) 𝑾𝑝2 𝑾𝑝3 𝑾𝑝1 𝒗(𝑡 − 1) 𝑾𝑟𝑒𝑐4 𝟏 𝒃4 𝒗(𝑡 − 1) 𝟏 𝑾𝑟𝑒𝑐3 𝒃3 𝟏 𝒃1 𝑾𝑟𝑒𝑐1 𝒗(𝑡 − 1) 𝟏 𝒃2 𝒗(𝑡 − 1) 𝑾𝑟𝑒𝑐2 Figure 4.2: Block diagram of the Long Short-Term Memory (LSTM).then uses a Relevance Vector Machine (RVM) [117] to estimate the entries of the sparse vector. This methodis called Bayesian Compressive Sensing (BCS). In [72], a Bayesian framework is presented for the MMVproblem. It assumes that the L “tasks” in the MMV problem in (4.4), are not statistically independent. Byimposing a shared prior on the L tasks, an empirical method is presented to estimate the hyperparametersand extensions of RVM are used for the inference step. This method is known as Multitask CompressiveSensing (MT-BCS). In [72], it is experimentally shown that the MT-BCS outperforms the method that appliesOrthogonal Matching Pursuit (OMP) on each task, the Simultaneous Orthogonal Matching Pursuit (SOMP)method which is a straightforward extension of OMP for the MMV problem, and the method that appliesBCS for each task. In [126], the Sparse Bayesian Learning (SBL) [39, 117] is used to solve the MMV prob-lem. It was shown that the global minimum of the proposed method is always the sparsest one. The authorsin [133], address the MMV problem when the entries in each row of S are correlated. An algorithm based onSBL is proposed and it is shown that the proposed algorithm outperforms the mixed norm (`1,2) optimizationas well as the method proposed in [126]. The proposed method is called T-SBL. In [83], a greedy algorithmaided by a neural network is proposed to address the SMV problem in (4.3). The neural network parametersare calculated by solving a regression problem and are used to select the appropriate column of A at eachiteration of OMP. The main modification to OMP is replacing the correlation step with a neural network.73They experimentally show that the proposed method outperforms OMP and `1 optimization. This methodis called Neural Network OMP (NNOMP). In [95], an extension of [83] with a hierarchical Deep StackingNetowork (DSN) [29] is proposed for the MMV problem. “The joint sparsity of S is an important assump-tion in the proposed method”. To train the DSN model, the Restricted Boltzmann Machine (RBM) [59] isused to pre-train DSN and then fine tuning is performed. It has been experimentally shown that this methodoutperforms SOMP and `1,2 in the MMV problem. The proposed methods are called Nonlinear WeightedSOMP (NWSOMP) for the one layer model and DSN-WSOMP for the multilayer model. In [92], a feedfor-ward neural network is used to solve the SMV problem as a regression task. Similar to [95] (if we assumethat we have only one sparse vector in [95]), a pre-training phase followed by a fine tuning is used. Forpre-training, the authors have used Stacked Denoising Auto-encoder (SDA) [124]. Please note that an RBMwith Gaussian visible units and binary hidden units (i.e., the one used in [95]) has the same energy functionas an auto-encoder with sigmoid hidden units and real valued observations [123]. Therefore the extension of[92] to the MMV problem will give similar performance as that of [95]. In [122], a reconstruction methodis proposed for sparse signals whose sparsity patterns change slowly with time. The main idea is to replaceCompressive Sensing (CS) on the observation y with CS on the Least Squares (LS) residuals. LS residualsare calculated using the previous estimation of the support. In [38], a reconstruction method is proposed torecover sparse signals with a sparsity pattern that slowly changes over time. The main idea is to use SparseBayesian Learning (SBL) framework. Similar to SBL, a set of hyperparameters are defined to control thesparsity of signals. The main difference is that the prior for each coefficient also involves the coefficients ofthe adjacent temporal observations. In [4], a CS algorithm is proposed for time-varying sparse signals basedon the least absolute shrinkage and selection operator (LASSO). A dynamic LASSO algorithm is proposedfor the signals with time-varying amplitudes and support.4.2 RNN with LSTM CellsThe RNN is a type of deep neural networks [20, 57] that are “deep” in the temporal dimension. It has beenused extensively in time sequence modelling [13, 28, 36, 49, 84, 85, 94, 96, 103]. If we look at the sparsevectors (columns) in S as a sequence, the main idea of using RNN for the MMV problem is to predict thesparsity patterns over different sparse vectors in S.Although RNN performs sequence modelling in a principled manner, it is generally difficult to learnthe long term dependency within the sequence due to the vanishing gradients problem. One of the effectivesolutions for this problem in RNNs is to employ memory cells instead of neurons that is originally proposedin [62] as Long Short-Term Memory (LSTM). It is further developed in [45] and [46] by adding forget gateand peephole connections to the architecture.We use the architecture of LSTM illustrated in Fig. 4.2 for the proposed sequence modelling methodfor the MMV problem. In this figure, i(t), f(t) ,o(t) ,c(t) are input gate, forget gate, output gate and cellstate vector respectively, Wp1, Wp2 and Wp3 are peephole connections, Wi, Wreci and bi, i = 1,2,3,4 areinput connections, recurrent connections and bias values, respectively, g(·) and h(·) are tanh(·) function andσ(·) is the sigmoid function. We use this architecture to find v for each channel and then use the proposedmethod in Fig. 4.1 to find the entries that have a higher probability of being non-zero. Considering Fig. 4.2,74the forward pass for LSTM model is as follows:yg(t) = g(W4r(t)+Wrec4v(t−1)+b4)i(t) = σ(W3r(t)+Wrec3v(t−1)+Wp3c(t−1)+b3)f(t) = σ(W2r(t)+Wrec2v(t−1)+Wp2c(t−1)+b2)c(t) = f(t)◦ c(t−1)+ i(t)◦yg(t)o(t) = σ(W1r(t)+Wrec1v(t−1)+Wp1c(t)+b1)v(t) = o(t)◦h(c(t)), (4.6)where ◦ denotes the Hadamard (element-wise) product.Summary of notations used in Fig. 4.2 is as follows:• “t”: Stands for the time index in the sequence. For example, if we have 4 residual vectors of fourdifferent channels, we can show them as r(t), t = 1,2,3,4.• “1”: is a scalar• “Wreci, i= 1,2,3,4”: Recurrent weight matrices of dimension ncell×ncell where ncell is the numberof cells in LSTM.• “Wi, i = 1,2,3,4”: Input weight matrices of dimension M×ncell where M is the number of randommeasurements in compressive sensing. These matrices map the residual vectors to feature space.• “bi, i = 1,2,3,4”: Bias vectors of size ncell×1.• “Wpi, i = 1,2,3”: Peephole connections of dimension ncell×ncell.• “v(t), t = 1,2, . . . ,L”: Output of the cells. Vector of size ncell×1. L is the number of channels in theMMV problem.• “i(t),o(t),yg(t), t = 1,2, . . . ,L”: Input gates, output gates and inputs before gating respectively. Vec-tor of size ncell×1.• “g(·) and h(·)”: tanh(·) function.• “σ(·)”: Sigmoid function.4.3 Proposed Method4.3.1 High Level PictureThe summary of the proposed method is presented in Fig. 4.1. We initialize the residual vector, r, for eachchannel by the measurement vector, y, of that channel. These residual vectors serve as the input to the LSTMmodel that captures features of the residual vectors using input weight matrices (W1,W2,W3,W4) as well75as the dependency among the residual vectors using recurrent weight matrices (Wrec1,Wrec2,Wrec3,Wrec4)and the central memory unit shown in Fig. 4.2. A transformation matrix U is then used to transform,v ∈ ℜncell×1, the output of each memory cell after gating, into the sparse vectors space, i.e., z ∈ ℜN×1.“ncell” is the number of cells in the LSTM model. Then a softmax layer is used for each channel to find theprobability of each entry of each sparse vector being non-zero. For example, for channel 1, the j-th outputof the softmax layer isP(s1( j)|r1) = ez( j)∑Nk=1 ez(k). (4.7)Then for each channel, the entry with the maximum probability value is selected and added to the supportset of that channel. After that, given the new support set, the following least squares problem is solved tofind an estimate of the sparse vector for the j-th channelsˆ j = argmins j‖y j−AΩ j s j‖22. (4.8)Using sˆ j, the new residual value for the j-th channel is calculated as follows:r j = y j−AΩ j sˆ j. (4.9)This residual serves as the input to the LSTM model at the next iteration of the algorithm. The stoppingcriteria for the algorithm is when the residual values are small enough or when it has performed N iterationswhere N is the dimension of the sparse vector. Since we have used LSTM cells for the proposed method,we call it LSTM-CS algorithm. The pseudo-code of the proposed method is presented in Algorithm 2.Algorithm 2 Distributed Compressive Sensing using Long Short-Term Memory (LSTM-CS)Inputs: CS measurement matrix A ∈ ℜM×N ; matrix of measurements Y ∈ ℜM×L; minimum `2 norm of residual matrix “resMin” as stoppingcriterion; Trained “lstm” modelOutput: Matrix of sparse vectors Sˆ ∈ℜN×LInitialization: Sˆ = 0; j = 1; i = 1; Ω= /0; R = Y.1: procedure LSTM-CS(A,Y, lstm)2: while i≤ N or ‖R‖2 ≤ resMin do3: i← i+14: for j = 1→ L do5: R(:, j)i← R(:, j)i−1max(|R(:, j)i−1|)6: v j ← lstm(R(:, j)i,v j−1,c j−1) . LSTM7: z j ← Uv j8: c← so f tmax(z j)9: idx← Support(max(c))10: Ωi←Ωi−1 ∪ idx11: SˆΩi (:, j)← (AΩi )†Y(:, j) . Least Squares12: SˆΩCi (:, j)← 013: R(:, j)i← Y(:, j)−AΩi SˆΩi (:, j)14: end for15: end while16: end procedureWe continue by explaining how the training data is prepared from off-line dataset and then we presentthe details of the learning method. Please note that all the computations explained in the subsequent twosections are performed only once and they do not affect the run time of the proposed solver in Fig. 4.1. It is76almost as fast as greedy algorithms in sparse reconstruction.4.3.2 Training Data GenerationThe main idea of the proposed method is to look at the sparse reconstruction problem as a two step task: aclassification as the first step and a subsequent least squares as the second step. In the classification step, theaim is to find the atom of the dictionary, i.e., the column of A, that is most relevant to the given residual ofthe current channel and the residuals of the previous channels. Therefore we need a set of residual vectorsand their corresponding sparse vectors for supervised training. Since the training data and A are given, wecan imitate the steps explained in the previous section to generate the residuals. This means that, givena sparse vector s with k non-zero entries, we calculate y using (4.3). Then we find the entry that has themaximum value in s and set it to zero. Assume that the index of this entry is k0. This gives us a new sparsevector with k−1 non-zero entries. Then we calculate the residual vector fromr = y−ak0s(k0), (4.10)where ak0 is the k0-th column of A and s(k0) is the k0-th entry of s. It is obvious that this residual valueis because of not having the remaining k− 1 non-zero entries of s. From these remaining k− 1 non-zeroentries, the second largest value of s has the main contribution to r in (4.10). Therefore, we use r to predictthe location of the second largest value of s. Assume that the index of the second largest value of s is k1. Wedefine s0 as a one hot vector that has value 1 at k1-th entry and zero at other entries. Therefore, the trainingpair is (r,s0).Now we set the k1-th entry of s to zero. This gives us a new sparse vector with k− 2 non-zero entries.Then we calculate the new residual vector fromr = y− [ak0 ,ak1 ][s(k0),s(k1)]T . (4.11)We use the residual in (4.11) to predict the location of the third largest value in s. Assume that the index ofthe third largest value of s is k2. We define s0 as a one hot vector that has value 1 at k2-th entry and zero atother entries. Therefore, the new training pair is (r,s0).The above procedure is continued upto the point that s does not have any non-zero entry. Then the sameprocedure is used for the next training sample. This gives us training samples for one channel. Then thesame procedure is used for the next channel in S. Since the number of non-zero entries, k, is not known inadvance, we assume a maximum number of non-zero entries per channel for training data generation.4.3.3 Learning MethodTo calculate the parameters of the proposed model, i.e., W1,W2,W3,W4, Wrec1,Wrec2,Wrec3,Wrec4, Wp1,Wp2,Wp3,b1,b2,b3,b4 in Fig. 4.2 and transformation matrix U in Fig.4.1, we minimize a cross entropy cost functionover the training data. Assuming s is the output vector of the softmax layer given by the model in Fig. 4.1(output of the softmax layer is represented as conditional probabilities in Fig. 4.1) and s0 is the one hot77vector explained in the previous section, the following optimization problem is solved:L(Λ) = minΛ{nB∑i=1Bsize∑r=1L∑τ=1N∑j=1Lr,i,τ, j(Λ)}Lr,i,τ, j(Λ) =−s0,r,i,τ( j)log(sr,i,τ( j)), (4.12)where nB is the number of mini-batches in the training data, Bsize is the number of training data pairs,(r,s0), in each mini-batch, L is the number of channels in the MMV problem, i.e., number of columns of S,and N is the length of vector s and s0. Λ denotes the collection of the model parameters that includes W1,W2, W3, W4, Wrec1, Wrec2, Wrec3, Wrec4, Wp1, Wp2, Wp3, b1, b2, b3 and b4 in Fig. 4.2 and U in Fig. 4.1.To solve the optimization problem in (4.12), we use Backpropagation through time (BPTT) with Nes-terov method. The update equations for parameter Λ at epoch k are as follows:4Λk = Λk−Λk−14Λk = µk−14Λk−1− εk−1∇L(Λk−1+µk−14Λk−1), (4.13)where ∇L(·) is the gradient of the cost function in (4.12), ε is the learning rate and µk is a momentum pa-rameter determined by the scheduling scheme used for training. Above equations are equivalent to Nesterovmethod in [93]. To see why, please refer to appendix A.1 of [115] where the Nesterov method is derived asa momentum method. The gradient of the cost function, ∇L(Λ), is∇L(Λ) =nB∑i=1Bsize∑r=1L∑τ=1N∑j=1∂Lr,i,τ, j(Λ)∂Λ︸ ︷︷ ︸one large update. (4.14)As it is obvious from (4.14), since we have unfolded the LSTM over channels in S, we fold it back when wewant to calculate gradients over the whole sequence of channels.∂Lr,i,τ, j(Λ)∂Λ in (4.14) and error signals for different parameters of the proposed model that are necessaryfor training are presented in section 4.5.We have used mini-batch training to accelerate training and one large update instead of incrementalupdates during back propagation through time. To resolve the gradient explosion problem we have usedgradient clipping. To accelerate the convergence, we have used Nesterov method [93] and found it effectivein training the proposed model for the MMV problem.We have used a simple yet effective scheduling for µk in (4.13), in the first and last 10% of all parameterupdates µk = 0.9 and for the other 80% of all parameter updates µk = 0.995. We have used a fixed step sizefor training LSTM. Please note that since we are using mini-batch training, all parameters are updated foreach mini-batch in (4.14).A summary of training method for LSTM-CS is presented in Algorithm 3.Although the training method and derivatives in section 4.5 are presented for all parameters in LSTM, inthe implementation ,we have removed peephole connections and forget gates. Since length of each sequence,78Algorithm 3 Training the proposed model for Distributed Compressive SensingInputs: Fixed step size “ε”, Scheduling for “µ”, Gradient clip threshold “thG”, Maximum number of Epochs “nE poch”, Total number oftraining pairs in each mini-batch “Bsize”, Number of channels for the MMV problem “L”.Outputs: LSTM-CS trained model for distributed compressive sensing “Λ”.Initialization: Set all parameters in Λ to small random numbers, i = 0, k = 1.procedure LSTM-CS(Λ)while i≤ nE poch dofor “first minibatch”→ “last minibatch” dor← 1while r ≤ Bsize doCompute ∑Lτ=1∂Lr,τ∂Λk. use (4.23) to (4.54) in section 4.5r← r+1end whileCompute ∇L(Λk)← “sum above terms over r”if ∇L(Λk)> thG then∇L(Λk)← thG. For each entry of the gradient matrix ∇L(Λk)end ifCompute4Λk . use (4.13)Update: Λk ←4Λk +Λk−1k← k+1end fori← i+1end whileend procedurei.e., the number of columns in S, is known in advance, we set state of each cell to zero in the beginning ofa new sequence. Therefore, forget gates are not a great help here. Also, as long as the order of columns inS is kept, the precise timing in the sequence is not of great concern, therefore, peephole connections are notthat important as well. Removing peephole connections and forget gate will also help to have less trainingtime, i.e., less number of parameters need to be tuned during training.4.4 Experimental Results and DiscussionWe have performed the experiments on two real world datasets, the first is the MNIST dataset of handwrittendigits [78] and the second is three different classes of images from natural image dataset of MicrosoftResearch in Cambridge [102].In this section, we would like to answer the following questions: (i) How is the performance of differentreconstruction algorithms for the MMV problem, including the proposed method, when different channels,i.e., different columns in S, have different sparsity patterns? (ii) Does the proposed method perform wellenough when there is correlation among different sparse vectors? E.g., when sparse vectors are DCT orWavelet transform of different blocks of an image? (iii) How fast is the proposed method compared to otherreconstruction algorithms for the MMV problem? (iv) How robust is the proposed method to noise?For all the results presented in this section, the reconstruction error is defined asNMSE =‖Sˆ−S‖‖S‖ , (4.15)where S is the actual sparse matrix and Sˆ is the recovered sparse matrix from random measurements by thereconstruction algorithm. The machine used to perform the experiments has an Intel(R) Core(TM) i7 CPU79Figure 4.3: Randomly selected images for test from MNIST dataset. The first channel encodes digitzero, the second channel encodes digit one and so on.with clock 2.93 GHz and with 16 GB RAM.4.4.1 MNIST DatasetMNIST is a dataset of handwritten digits where the images of the digits are normalized in size and centredso that we have fixed size images. The task is to simultaneously encode 4 images each of size 24× 24,i.e., we have 4 channels and L = 4 in (4.4). The encoder is a typical compressive sensing encoder, i.e., arandomly generated matrix A. We have normalized each column of A to have unit norm. Since the imagesare already sparse, i.e., have a few number of non-zero pixels, no transform, Ψ in (4.2), is used. To simulatethe measurement noise, we have added a Gaussian noise with standard deviation 0.005 to the measurementmatrix Y in (4.4). This results in measurements with signal to noise ratio (SNR) of approximately 46dB.We have divided each image into four 12× 12 blocks. This means that the length of each sparse vector isN = 144. We have taken 50% random measurements from each sparse vector, i.e., M = 72. After receivingand reconstructing all blocks at the decoder, we compute the reconstruction error defined in (4.15) for thefull image. We have randomly selected 10 images for each digit from the set {0,1,2,3}, i.e., 40 images intotal for the test. This means that the first column of S is an image of digit 0, the second column is an imageof digit 1, the third column is an image of digit 2 and the fourth column is an image of digit 3. Test imagesare represented in Fig. 4.3.We have compared the performance of the proposed reconstruction algorithm (LSTM-CS) with 7 recon-struction methods for the MMV problem. These methods are:• Simultaneous Orthogonal Matching Pursuit (SOMP) which is a well known baseline for the MMV80problem.• Bayesian Compressive Sensing (BCS)[71] applied independently on each channel. For the BCSmethod we set the initial noise variance of i-th channel to the value suggested by the authors, i.e.,std(yi)2/100 where i ∈ {1,2,3,4} and std(.) calculates the standard deviation. We set the thresholdfor stopping the algorithm to 10−8.• Multitask Compressive Sensing (MT-BCS) [72] which takes into account the statistical dependencyof different channels. For MT-BCS we set the parameters of the Gamma prior on noise variance toa = 100/0.1 and b = 1 which are the values suggested by the authors. We set the stopping thresholdto 10−8 as well.• Sparse Bayesian Learning for Temporally correlated sources (T-SBL) [133] which exploits correlationamong different sources in the MMV problem. For T-SBL, we used the default values proposed bythe authors.• Nonlinear Weighted SOMP (NWSOMP) [95] which solves a regression problem to help the SOMPalgorithm with prior knowledge from training data. For NWSOMP, during training, we used one layer,512 neurons and 25 epochs of parameters update.• Compressive Sensing on Least Squares Residual (LSCS) [122] where no explicit joint sparsity as-sumption is made in the design of the method. For LSCS, we used sigma0= cc∗ (1/3)∗ sqrt(Sav/m)suggested by the authors where m is the number of measurements and Sav = 16 as suggested by theauthor. We tried a range of different values of cc and got the best results with cc = 0.1. We also setsigsys = 1, siginit = 3 and lambdap = 4 as suggested by the author.• The method proposed in [38? ] and referred to as PCSBL-GAMP where sparse Bayesian learning isused to design the method and no explicit joint sparsity assumption is made. For PCSBL-GAMP, weused beta= 1, Pattern= 2 because we need the coupling among the sparse vectors, i.e., left and rightcoupling, maximum number of iterations equal to maxiter = 400, and C = 1e0 as suggested by theauthors for the noisy case.For LSTM-CS, during training, we used one layer, 512 cells and 25 epochs of parameter updates. Weused only 200 images for the training set. The training set does not include any of the 40 images used fortest. To monitor and prevent overfitting, we used 3 images per channel as the validation set and we usedearly stopping if necessary. Please note that the images used for validation were not used in the training setor in the test set. Results are presented in Fig. 4.4.In Fig. 4.4, the vertical axis is the NMSE defined in (4.15) and horizontal axis is the number of non-zeroentries in the sparse vector. The number of measurements, M, is fixed to 72. Each point on the curves inFig. 4.4 is the average of NMSE over 40 reconstructed test images at the decoder.For the MNIST dataset, we observe from Fig. 4.4 that LSTM-CS significantly outperforms the recon-struction algorithms for the MMV problem discussed in this chapter. One important reason for this is that8120 40 60 80 100 120 1400.511.522.533.5Number of non-zero entries in the sparse vectorNMSE SOMPLSTM-CS,ncell=512MT-BCST-SBLNWSOMPLSTM-CS,ncell=512,x4 TrDataLSCSPCSBL-GAMPBCS20 40 60 80 100 120 1400.20.40.60.8Number of non-zero entries in the sparse vectorNMSE SOMPLSTM-CS,ncell=512MT-BCST-SBLNWSOMPLSTM-CS,ncell=512,x4 TrDataLSCSPCSBL-GAMPFigure 4.4: Comparison of different MMV reconstruction algorithms for MNIST dataset. Bottom fig-ure is the same as top figure without results of BCS algorithm to make the difference amongdifferent algorithms more visible. In this experiment M = 72 and N = 144.existing MMV solvers rely on the joint sparsity in S, while the proposed method does not rely on this as-sumption. Another reason is that the structure of each sparse vector is effectively captured by LSTM. Thereconstructed images using different MMV reconstruction algorithms for 4 test images are presented in Fig.4.5. An interesting observation from Fig. 4.5 is that the accuracy of reconstruction depends on the complex-ity of the sparsity pattern. For example when the sparsity pattern is simple, e.g., image of digit 1 in Fig. 4.5,all the algorithms perform well. But when the sparsity pattern is more complex, e.g., image of digit 0 in Fig.4.5, then their reconstruction accuracy degrades significantly.82Original50% MeasurementsSOMPMT-BCST-SBLNWSOMPLSTM-CSOriginal50% MeasurementsSOMPMT-BCST-SBLNWSOMPLSTM-CSOriginal50% MeasurementsSOMPMT-BCST-SBLNWSOMPLSTM-CSOriginal50% MeasurementsSOMPMT-BCST-SBLNWSOMPLSTM-CSLS−CSPCSBL−GAMPLS−CSPCSBL−GAMPLS−CSPCSBL−GAMPLS−CSPCSBL−GAMPChannel 2Channel 1 Channel 3 Channel 4Reconstruction Algorithms:Figure 4.5: Reconstructed images using different MMV reconstruction algorithms for 4 images ofthe MNIST dataset. First row are original images, S, second row are measurement matrices,Y, third row are reconstructed images using LS-CS, fourth row are reconstructed images usingSOMP, fifth row using PCSBL-GAMP, sixth row using MT-BCS, seventh row using T-SBL,eighth row using NWSOMP and the last row are reconstructed images using the proposed LSTM-CS method.We have repeated the experiments on the MNIST dataset with 25% random measurements, i.e., M = 36.The results are presented in Fig. 4.6. We trained 4 different LSTM models for this experiment. The first oneis the same model used for previous experiment (m = 72). In the second model, we increased the numberof cells in the LSTM model from 512 to 1024. In the third and fourth models, we used 2 times and 4 timesmore training data respectively. The rest of the experiments’ settings was similar to the settings describedbefore. As observed from these results, by investing more on training a good LSTM model, LSTM-CSmethod performs better.All the results presented so far are for noisy measurements where an additive Gaussian noise with stan-dard deviation 0.005 is used (SNR' 46dB). To evaluate the stability of the proposed LSTM-CS method tonoise, and compare it with other methods discussed in this chapter, an experiment was performed using thefollowing range of noise standard deviations:σ = {0.5,0.2,0.1,0.05,0.01,0.005}, (4.16)where σ is the standard deviation of noise. This approximately corresponds toSNR = {6 dB,14 dB,20 dB,26 dB,40 dB,46 dB}. (4.17)We used the same experimental settings explained above. Results are presented in Fig. 4.7.As observed from the results, in very noisy environment, i.e., SNR = 6 dB, performance of MT-BCS8320 40 60 80 100 120 14011.522.533.544.5Number of non-zero entries in the sparse vectorNMSE SOMPT-SBLLSTM-CSMT-BCSNWSOMPBCSLSTM-CS,ncell=1024LSTM-CS,ncell=1024,x2 TrDataLSTM-CS,ncell=1024,x4 TrDataPCSBL-GAMPLSCS20 40 60 80 100 120 1400.60.70.80.911.11.21.3Number of non-zero entries in the sparse vectorNMSE SOMPT-SBLLSTM-CSMT-BCSNWSOMPLSTM-CS,ncell=1024LSTM-CS,ncell=1024,x2 TrDataLSTM-CS,ncell=1024,x4 TrDataPCSBL-GAMPLSCSFigure 4.6: Comparison of different MMV reconstruction algorithms for MNIST dataset. Bottom fig-ure is the same as top figure without results of BCS algorithm to make the difference amongdifferent algorithms more visible. In this experiment M = 36 and N = 144., LSCS and PCSBL-GAMP degrades significantly while T-SBL , NWSOMP and LSTM-CS (proposed inthis chapter) methods show less severe degradation. In very low noise environment, i.e., SNR = 46 dB,performance of LSTM-CS, trained with just 512 cells and 200 training images, is better than other methods.In medium noise environment, i.e., SNR = 20 dB and SNR = 26 dB, performance of LSTM-CS, T-SBLand PCSBL-GAMP are close (although LSTM-CS is slightly better). Please note that the performance ofLSTM-CS can be further improved by using a better architecture (e.g., more cells, more training data ormore layers) as explained previously.8410 15 20 25 30 35 40 451234567Signal to Noise Ratio (dB)NMSE SOMPNWSOMPLSTM-CSMT-BCST-SBLPCSBL-GAMPLSCSBCS(a) Results for all Methods.10 15 20 25 30 35 40 450.30.50.70.91.11.31.51.7Signal to Noise Ratio (dB)NMSE SOMPNWSOMPLSTM-CSMT-BCST-SBLPCSBL-GAMPLSCS(b) Results without BCS method for a more clear visibility.Figure 4.7: Reconstruction performance of the methods discussed in this chapter for different noiselevels.To present the phase transition diagram of solvers, we used a simple LSTM-CS solver that uses 512 cellsand just 200 training images. The performance was evaluated over the following values of mn where n is thenumber of entries in each sparse vector and m is the number of measurements per channel:mn= {0.10,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50}. (4.18)850.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50.10.20.30.40.50.60.70.80.91m/nk/m SOMPLSTM-CSPCSBL-GAMPBCST-SBLNWSOMPMT-BCSLSCSFigure 4.8: Phase transition diagram for different methods on MNIST dataset where 90% perfect re-covery is considered. Assuming a perfect recovery condition of NMSE ≤ 0.6 for this dataset.“n” is the number of entries in each sparse vector, “m” is the number of random measurementsand “k” is the number of non-zero entries in each sparse vector.For this experiment, we randomly selected 50 images per channel from MNIST dataset. Since we haveL = 4 channels, and each image is of size 24× 24, and each image has 4 blocks of 12× 12 pixels, in totalwe will have 50× 4× 4 = 800 sparse vectors. Considering Fig. 4.4, NMSE of most solvers is about 0.6.Therefore we set the following as the condition for perfect recovery: if more than 90% of test images arereconstructed with an NMSE of 0.6 or less, count that test image as perfectly recovered. We did this foreach mn in (4.18). Results are presented in Fig. 4.8. Results presented in Fig.4.8 shows the reconstructionperformance improvement when LSTM-CS method is used.We also present the performance of LSTM-CS for different number of random measurements. We usedthe set of random measurements in (4.18) with n = 144. We used an LSTM with 512 cells and 400 trainingimages. The settings for all other methods was similar to the one described before. Results are presentedin Fig. 4.9. As observed from Fig. 4.9, using LSTM-CS method improves the reconstruction performancecompared to other methods discussed in this chapter.4.4.2 Natural Images DatasetFor experiments on natural images we used the MSR Cambridge dataset [102]. Ten randomly selectedtest images belonging to three classes of this dataset are used for experiments. The images are shown inFig. 4.10. We have used 64× 64 images. Each image is divided into 8× 8 blocks. After reconstructingall blocks of an image in the decoder, the NMSE for the reconstructed image is calculated. The task is tosimultaneously encode 4 blocks (L = 4) of an image and reconstruct them in the decoder. This means that Sin (4.4) has 4 columns each one having N = 64 entries. We used 50% measurements, i.e., Y in (4.4) have 48620 30 40 50 60 7012345Number of Random Measurements (m)NMSE SOMPNWSOMPLSTM-CSMT-BCST-SBLLSCSPCSBL-GAMPBCS(a) Results for all Methods.20 30 40 50 60 700.511.5Number of Random Measurements (m)NMSE SOMPNWSOMPLSTM-CSMT-BCST-SBLLSCSPCSBL-GAMP(b) Results without BCS method for a more clear visibility.Figure 4.9: Comparison of different MMV reconstruction algorithms for different number of randommeasurements for MNIST dataset. In this experiment n = 144.columns each one having M = 32 entries.We have compared the performance of the proposed algorithm, LSTM-CS, with SOMP, T-SBL, MT-BCS and NWSOMP. We have not included results of applying BCS per channel due its weak performancecompared to other methods (this is shown in the experiments for MNIST dataset). We have used the samesetting as the settings for the MNIST dataset for different methods which is explained in the previous section.The only differences here are: (i) For each class of images, we have used just 55 images for training set and87Figure 4.10: Randomly selected natural images from three different classes used for test. The first roware “buildings”, the second row are “cows” and the third row are “flowers”.5 images for validation set which do not include any of 10 images used for test. (ii) We have used 15 epochsfor training LSTM-CS which is enough for this dataset, compared to 25 epochs for the MNIST dataset.The experiments were performed for two popular transforms, DCT and Wavelet, for all aforementionedreconstruction algorithms. For the wavelet transform we used Haar wavelet transform with 3 levels ofdecomposition. Results for DCT transform are presented in Fig. 4.19. Results for wavelet transform arepresented in Fig. 4.20.To conclude the experiments section, the CPU time for different reconstruction algorithms for the MMVproblem discussed in this chapter are presented in Fig. 4.23. Each point on the curves in Fig. 4.23 is thetime spent to reconstruct each sparse vector averaged over all the 8×8 blocks in 10 test images. We observefrom this figure that the proposed algorithm is almost as fast as greedy algorithms. Please note that thereis a faster version of T-SBL that is known as TMSBL. It will improve the CPU time of T-SBL but it is stillslower than other reconstruction methods.4.5 Expressions for the GradientsIn this section we present the final gradient expressions that are necessary to use for training the proposedmodel for the MMV problem. Due to lack of space, we omit the presentation of full derivations of thesegradients.Starting with the cost function in (4.12), we use the Nesterov method described in (4.13) to updateLSTM-CS model parameters. Here, Λ is one of the weight matrices or bias vectors {W1,W2,W3,W4,Wrec1,Wrec2,Wrec3,Wrec4,Wp1,Wp2,Wp3,b1,b2,b3,b4} in the LSTM-CS architecture. The general format ofthe gradient of the cost function, ∇L(Λ), is the same as (4.14). To calculate ∂Lr,i,τ (Λ)∂Λ from (4.12) we have∂Lr,i,τ(Λ)∂Λ=−N∑j=1s0,r,i,τ( j)∂ log(sr,i,τ( j))∂Λ. (4.19)8810 20 30 40 50 600.10.150.20.250.30.350.4Number of non-zero entries in the sparse vectorNMSE SOMPMT-BCST-SBLLSTM-CSNWSOMPLSCSPCSBL-GAMP10 20 30 40 50 600.150.20.250.30.350.40.45Number of non-zero entries in the sparse vectorNMSE SOMPMT-BCST-SBLLSTM-CSNWSOMPLSCSPCSBL-GAMP10 20 30 40 50 600.20.250.30.350.40.450.5Number of non-zero entries in the sparse vectorNMSE SOMPMT-BCST-SBLLSTM-CSNWSOMPPCSBL-GAMPLSCSFigure 4.11: Comparison of different MMV reconstruction algorithms for natural image dataset usingDCT transform and just one layer for LSTM model in LSTM-CS. Image classes from top tobottom respectively: buildings, cows and flowers.8910 20 30 40 50 600.10.150.20.250.30.35Number of non-zero entries in the sparse vectorNMSE SOMPMT-BCST-SBLLSTM-CSNWSOMPPCSBL-GAMPLSCS10 20 30 40 50 600.150.20.250.30.350.40.45Number of non-zero entries in the sparse vectorNMSE SOMPMT-BCST-SBLLSTM-CSNWSOMPLSCSPCSBL-GAMP10 20 30 40 50 600.20.250.30.350.40.450.5Number of non-zero entries in the sparse vectorNMSE SOMPMT-BCST-SBLLSTM-CSNWSOMPLSCSPCSBL-GAMPFigure 4.12: Comparison of different MMV reconstruction algorithms for natural image dataset usingWavelet transform and just one layer for LSTM model in LSTM-CS. Image classes from top tobottom respectively: buildings, cows and flowers.9010 20 30 40 50 605101520253035Number of non-zero entries in each sparse vectorCPUTime(sec) SOMPNWSOMPMT-BCST-SBLLSTM-CSLS-CSPCSBL-GAMP10 20 30 40 50 600.050.10.150.20.25Number of non-zero entries in each sparse vectorCPUTime(sec) SOMPNWSOMPMT-BCSLSTM-CSPCSBL-GAMPFigure 4.13: CPU time for different MMV reconstruction algorithms. These times are for the exper-iment using DCT transform for 10 test images from the building class. The bottom figure isthe same as top figure but without T-SBL and LS-CS to make the difference among differentmethods more clear.After a straightforward derivation of derivatives we will have∂Lr,i,τ(Λ)∂Λ= (β sr,i,τ − s0,r,i,τ)∂zτ∂Λ , (4.20)91where zτ is the vector z for τ-th channel in Fig. 4.1 and β is a scalar defined asβ =N∑j=1s0,r,i,τ( j). (4.21)Since during training data generation we have generated one hot vectors for s0, β always equals to 1.Since we are looking at different channels as a sequence, for a more clear presentation we show any vectorcorresponding to t-th channel with (t) instead of index τ . For example, zτ is represented by z(t).Since z(t) = Uv(t) we have∂z(t)∂Λ= UT∂v(t)∂Λ. (4.22)Combining (4.20), (4.21) and (4.22) we will have∂Lr,i,t(Λ)∂Λ= UT (sr,i(t)− s0,r,i(t))∂v(t)∂Λ . (4.23)Starting from “t = L”-th channel, we define e(t) ase(t) = UT (sr,i(t)− s0,r,i(t)). (4.24)The expressions for the gradients for different parameters of LSTM-CS model are presented in the subse-quent sections. We omit the subscripts r and i for simplicity of presentation. Please note that the final valueof the gradient is sum of gradient values over the mini-batch samples and number of channels as representedby summations in (4.14).4.5.1 Output Weights U∂Lt∂U= (s(t)− s0(t)).v(t)T . (4.25)4.5.2 Output GateFor recurrent connections we have∂Lt∂Wrec1= δ rec1(t).v(t−1)T , (4.26)whereδ rec1(t) = o(t)◦ (1−o(t))◦h(c(t))◦ e(t). (4.27)For input connections, W1, and peephole connections, Wp1, we will have∂Lt∂W1= δ rec1(t).r(t)T , (4.28)∂Lt∂Wp1= δ rec1(t).c(t)T . (4.29)92The derivative for output gate bias values will be∂Lt∂b1= δ rec1(t). (4.30)4.5.3 Input GateFor the recurrent connections we have∂Lt∂Wrec3= diag(δ rec3(t)).∂c(t)∂Wrec3, (4.31)whereδ rec3(t) = (1−h(c(t)))◦ (1+h(c(t)))◦o(t)◦ e(t)∂c(t)∂Wrec3= diag(f(t)).∂c(t−1)∂Wrec3+bi(t).v(t−1)Tbi(t) = yg(t)◦ i(t)◦ (1− i(t)). (4.32)For the input connections we will have the following:∂Lt∂W3= diag(δ rec3(t)).∂c(t)∂W3, (4.33)where∂c(t)∂W3= diag(f(t)).∂c(t−1)∂W3+bi(t).r(t)T . (4.34)For the peephole connections we will have∂Lt∂Wp3= diag(δ rec3y (t)).∂c(t)∂Wp3, (4.35)where∂c(t)∂Wp3= diag(f(t)).∂c(t−1)∂Wp3+bi(t).c(t−1)T . (4.36)For bias values, b3, we will have∂Lt∂b3= diag(δ rec3(t)).∂c(t)∂b3, (4.37)where∂c(t)∂b3= diag(f(t)).∂c(t−1)∂b3+bi(t). (4.38)934.5.4 Forget GateFor the recurrent connections we will have∂Lt∂Wrec2= diag(δ rec2(t)).∂c(t)∂Wrec2, (4.39)whereδ rec2(t) = (1−h(c(t)))◦ (1+h(c(t)))◦o(t)◦ e(t)∂c(t)∂Wrec2= diag(f(t)).∂c(t−1)∂Wrec2+b f (t).v(t−1)Tb f (t) = c(t−1)◦ f(t)◦ (1− f(t)). (4.40)For input connections to forget gate we will have∂Lt∂W2= diag(δ rec2(t)).∂c(t)∂W2, (4.41)where∂c(t)∂W2= diag(f(t)).∂c(t−1)∂W2+b f (t).r(t)T . (4.42)For peephole connections we have∂Lt∂Wp2= diag(δ rec2(t)).∂c(t)∂Wp2, (4.43)where∂c(t)∂Wp2= diag(f(t)).∂c(t−1)∂Wp2+b f (t).c(t−1)T . (4.44)For forget gate’s bias values we will have∂Lt∂b2= diag(δ rec2(t)).∂c(t)∂b2, (4.45)where∂c(t)∂b2= diag(f(t)).∂c(t−1)∂b3+b f (t). (4.46)4.5.5 Input without Gating (yg(t))For recurrent connections we will have∂Lt∂Wrec4= diag(δ rec4(t)).∂c(t)∂Wrec4, (4.47)94whereδ rec4(t) = (1−h(c(t)))◦ (1+h(c(t)))◦o(t)◦ e(t)∂c(t)∂Wrec4= diag(f(t)).∂c(t−1)∂Wrec4+bg(t).v(t−1)Tbg(t) = i(t)◦ (1−yg(t))◦ (1+yg(t)). (4.48)For input connections we have∂Lt∂W4= diag(δ rec4(t)).∂c(t)∂W4, (4.49)where∂c(t)∂W4= diag(f(t)).∂c(t−1)∂W4+bg(t).r(t)T . (4.50)For bias values we will have∂Lt∂b4= diag(δ rec4(t)).∂c(t)∂b4, (4.51)where∂c(t)∂b4= diag(f(t)).∂c(t−1)∂b4+bg(t). (4.52)4.5.6 Error Signal BackpropagationError signals are back propagated through time using following equations:δ rec1(t−1) = [o(t−1)◦ (1−o(t−1))◦h(c(t−1))]◦ [WTrec1.δ rec1(t)+ e(t−1)], (4.53)δ reci(t−1) = [(1−h(c(t−1)))◦ (1+h(c(t−1)))◦o(t−1)]◦ [WTreci .δ reci(t)+ e(t−1)],f or i ∈ {2,3,4}. (4.54)4.6 Bidirectional LSTM for MMV ProblemIn this section, we present bidirectional version of LSTM-CS. Given the fact that in the MMV problem,usually all measurement vectors are given, we can use both past and future information about the structureof the sparse vectors in S. This means that we can perform support prediction for a given column of S,based on both previous columns and future columns. This calls for a bidirectional learning architecture. Weuse bidirectional LSTM to address this problem. We experimentally show that the proposed method in thissection outperforms [72, 95, 133] and [97].95𝒓2LSTM𝒄1𝒓1LSTM𝒓𝐿LSTM𝒄𝐿−1……LSTM cells and gates= 𝑼𝒗1𝒗1𝒛1 𝒛2 𝒛𝐿softmax softmax softmax………𝑃(𝒔1|𝒓1) 𝑃(𝒔2|𝒓1, 𝒓2) 𝑃(𝒔𝐿|𝒓1, 𝒓2, … , 𝒓𝐿)……Ω1 update Ω2 update Ω𝐿 updateUpdating support of 𝒔𝐿Least Squares Least Squares Least Squares…ො𝒔1 ො𝒔2 ො𝒔𝐿 Estimation of L-thsparse vector𝒓1 = 𝒚1 − 𝑨Ω1 ො𝒔1𝒓2 = 𝒚2 − 𝑨Ω2 ො𝒔2…𝒓𝐿 = 𝒚𝐿 − 𝑨Ω𝐿 ො𝒔𝐿Residual vector for L-thsparse vectorVector of LSTM cells’ state= 𝑼𝒗2𝒗2= 𝑼𝒗𝐿𝒗𝐿Figure 4.14: Block diagram of the proposed bidirectional LSTM-CS method unfolded over channels.Method presented in this section is similar to LSTM-CS with the difference that for support prediction,we use both left to right and right to left predictions. This is shown in Fig. 4.14. If we detect the classes (thenon-zero entries) one by one, we can use the remaining residuals after finding each class (non-zero entry)as an appropriate input to a deep model for feature extraction. The extracted feature vectors are representedby {−→v1 ,−→v2 , . . . ,−→vL} for left-to-right model and {←−v1 ,←−v2 , . . . ,←−vL} for right-to-left model in Fig. 4.14. Featurevectors from both directions can be concatenated for the next step of the algorithm.Similar to previous sections, we initialize the residual vector, r, for each channel by the measurementvector, y, of that channel. These residual vectors, represented as r1,r2, . . . ,rL in Fig. 4.14, serve as theinputs to the bidirectional LSTM model. The bidirectional LSTM model captures features of the residualvectors using input weight matrices (W1,W2,W3,W4) as well as the dependency among the residual vectorsusing recurrent weight matrices (Wrec1,Wrec2,Wrec3,Wrec4) and the central memory units in left-to-right andright-to-left LSTMs. A transformation matrix U is then used to transform, [−→v ,←−v ]T ∈ℜ2ncell×1, the outputsof each memory cell after gating for both left-to-right and right-to-left models, into the sparse vectors space,i.e., z ∈ ℜN×1. “ncell” is the number of cells in the LSTM model. Then a softmax layer is used for eachchannel to find the probability of each entry of each sparse vector being non-zero. For example, for channel961, the j-th output of the softmax layer isP(s1( j)|r1) = ez( j)∑Nk=1 ez(k). (4.55)Rest of the method is similar to LSTM-CS described in previous sections.To evaluate performance of bidirectional LSTM-CS, we performed experiments on three different classesof images from a natural image dataset described in section 3.5. We used the same NMSE defined in (4.15)to evaluate the performance of bidirectional LSTM-CS.Five randomly selected test images from each class (flowers, buildings, cows) were used for test exper-iments. For each class of images, we used just 55 images for training set and 5 images for validation setwhich do not include any of 5 images used for test. We re-sized images to 128×128 images. Each imagewas divided into 8×8 blocks. After reconstructing all blocks of an image in the decoder, the NMSE for thereconstructed image was calculated. The task was to simultaneously encode 4 blocks (L = 4) of an imageand reconstruct them in the decoder. This meant that S in (4.4) had 4 columns each one having N = 64entries. We used 40% measurements, thus Y in (4.4) had 4 columns each one having M = 25 entries. Theencoder was a typical compressive sensing encoder, i.e., a randomly generated matrix A. We normalizedeach column of A to have unit norm. To simulate the measurement noise, we added a Gaussian noise withstandard deviation 0.005 to the measurement matrix Y in (4.4).We compared the performance of the proposed algorithm, BLSTM-CS, with SOMP [120], MT-BCS[72],T-SBL[133], NWSOMP[95] and LSTM-CS[97]. For MT-BCS we set the parameters of the Gamma prioron noise variance to a = 100/0.1 and b = 1 which are the values suggested by the authors. We set thestopping threshold to 10−8 as well. For T-SBL, we used the default values proposed by the authors. Weused T-MSBL which is a faster version of T-SBL. For NWSOMP, during training, we used one layer, 512neurons and 15 epochs of parameters update. The experiments were performed for two popular transforms,DCT and Wavelet, for all of the above reconstruction algorithms. For the wavelet transform, we used Haarwavelet transform with 3 levels of decomposition. For both LSTM-CS and BLSTM-CS, we used a smallmodel with 16 cells. For NWSOMP we used 3 layers and 512 neurons per layers. We present results for oneclass of images, buildings. The results from the other two classes of images are similar to what is presentedhere. To monitor and prevent overfitting, we used 5 images per channel as the validation set and we usedearly stopping if necessary. Please note that the images used for validation were not used in the training setor in the test set. Results for DCT transform and wavelet transform are shown in Fig. 4.15.As observed in Fig.4.15, BLSTM-CS outperforms the other methods discussed in this section for dif-ferent sparsity levels. To evaluate run time of different methods, considering the fact that all methods areimplemented in MATLAB and run on the same machine, the CPU time shown in Fig.4.15 demonstrates thatthe proposed method is faster than the Bayesian methods discussed in this section and is almost as fast asthe greedy method SOMP.9710 20 30 40 50 600.080.10.120.140.160.18Number of non zero entriesNMSE SOMPMTBCSTMSBLLSTM-UnidirectionalLSTM-BidirectionalNWSOMP10 20 30 40 50 600.120.140.160.18Number of non zero entriesNMSE SOMPNWSOMPMTBCSTMSBLLSTM-UnidirectionalLSTM-Bidirectional10 20 30 40 50 600.20.40.60.811.21.4Number of non zero entriesCPUTime(sec) SOMPNWSOMPMTBCSTMSBLLSTM-UnidirectionalLSTM-BidirectionalFigure 4.15: Up: Comparative reconstruction performance using DCT transform. Middle: Recon-struction performance using Wavelet transform. Bottom: CPU time. Note that the time isreported for T-MSBL which is a faster version of T-SBL.984.7 Convolutional Deep Stacking Networks for Distributed CompressiveSensingIn this section, we propose a method that relies on a Convolutional Deep Stacking Network (CDSN) pro-posed in this section to capture the dependency amongst the different channels. To reconstruct the sparsevectors, the approach is similar to LSTM-CS with the difference that we use CDSN to capture structureinformation of sparse vectors. In CDSN, to capture the dependencies amongst different channels, a slidingconvolution window over the columns of the matrix S is used where each convolution window contains wconsecutive columns of S where w is the size of convolution window.The main contributions of this section are proposing a convolutional version of the Deep Stacking Net-works (DSNs)[29], which we refer to as CDSN, and then using CDSN to capture the dependencies amongdifferent channels in the MMV problem. We then use a similar greedy reconstruction algorithm to LSTM-CS at the decoder to reconstruct S.Please note that in the sparse representation literature, the dictionary learning method [3] uses the avail-able training data to learn the sparsifying basis (Ψ in (4.2)) that can represent the signal as compactly aspossible. The main difference between dictionary learning and our work here is that we assume the sparsi-fying basis as given and there is no need to learn it. In other words, the sparse vectors in S are not necessarilyvery sparse. We expect the performance of our method to improve by combining dictionary learning withour proposed method. Nevertheless, in this section, we focus on the performance improvement obtained byusing the proposed approach only.Block diagrams of the proposed method are presented in Fig. 4.16. Similar to LSTM-CS, the dashedlines in Fig. 4.16 show that the process of reconstructing the sparse vectors repeats for a number of iterations.At each iteration, each column of S is estimated by a separate process. The inputs to this process are theresiduals at each iteration and the outputs are the estimated columns of S.More formally, in the proposed method, before the i-th iteration of reconstructing the j-th column ofS, i non-zero entries of that column are predicted so far. We represent the j-th column of S by s j. At thei-th iteration, the first step predicts the location of the (i+1)-th non-zero entry of s j, using the residuals ofcolumns contained in a sliding convolution window. In Fig. 4.16(a), an example with convolution window ofsize 3 is represented. This sliding convolution window helps in capturing the dependencies among channels.The predicted location of the (i+ 1)-th non-zero entry is then added to the support of s j. This support isrepresented by Ω j in Fig. 4.16(a). The second step finds the updated estimate of s j by solving a linear leastsquares problem that finds s j given y j (the j-th column of Y) and AΩ jsˆ j = argmins j‖y j−AΩ j s j‖22, (4.56)where AΩ j is a matrix that includes only those columns of A that correspond to the support of s j. Thedefinition of the residual matrix at the i−th iteration is Ri = Y−ASi where Si is the estimate of the sparsematrix S at the i−th iteration. Columns of R are represented by r j, j = 1,2, . . . ,L in Fig. 4.16(a).Now the remaining important questions are:(i) how can we find the parameters of CDSN represented in Fig. 4.16(b), i.e., W(1)1 ,W(1)2 ,W(2)1 ,W(2)2 ,W(3)1 ,W(3)2 ?99𝒓2 𝒓1 𝒓𝐿 … 𝑖𝑑𝑥1 𝑖𝑑𝑥2 𝑖𝑑𝑥𝐿 … Ω2 update Updating support of 𝒔𝐿 … 𝒔 1 𝒔 2 𝒔 𝐿 Estimation of L-th sparse vector 𝒓1 = 𝒚1 − 𝑨Ω1𝒔 1 𝒓2 = 𝒚2 − 𝑨Ω2𝒔 2 … 𝒓𝐿 = 𝒚𝐿 − 𝑨Ω𝐿𝒔 𝐿 Residual vector for L-th sparse vector Predicted location of next non-zero entry in L-th sparse vector 𝒓3 𝒛𝒆𝒓𝒐𝒔 𝒛𝒆𝒓𝒐𝒔 CDSN LS … … … 𝑖𝑑𝑥3 𝒔 3 𝒓3 = 𝒚3 − 𝑨Ω3𝒔 3 Convolutional Deep Stacking Network max LS LS LS Least Squares Ω1 update Ω3 update Ω𝐿 update max max max CDSN CDSN CDSN (a) Block diagram of the proposed method with convolution window size 3.𝑾1(1) 𝑾2(1) 𝒗(1) 𝒓2 𝒓1 𝒓3 𝑾1𝑎(2) 𝑾1𝑏(2) 𝑾2(2) 𝑾1(2)= [𝑾1𝑎(2), 𝑾1𝑏(2)] 𝑾1,𝑖𝑛𝑖𝑡(2)= [𝑹𝑛 , 𝑾1(1)] 𝒓2 𝒓1 𝒓3 𝒗(2) 𝒓2 𝒓1 𝒓3 𝑾1𝑏(3) 𝑾1𝑐(3) 𝒗(1) 𝑾1𝑎(3) sigmoid 𝑾2(3) 𝒗(3) 𝑾1(3)= [𝑾1𝑎(3), 𝑾1𝑏(3), 𝑾1𝑐(3)] 𝑾1,𝑖𝑛𝑖𝑡(3)= [𝑹𝑛, 𝑹𝑛 , 𝑾1𝑏(2)] Layer 1 Layer 2 Layer 3 𝑾1,𝑖𝑛𝑖𝑡(1)= 𝑅𝐵𝑀 𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 CDSN sigmoid sigmoid 𝒉(2) 𝒉(3) 𝒉(1) (b) Block diagram of the convoltional deep stacking network with 3 layers. This diagram shows CDSN for channel 2 in Fig.4.16(a), and Rn denotes a random matrix.Figure 4.16: Proposed Method.100Please note that in Fig. 4.16(b), W(l)1 is the matrix of weights from input layer to hidden layer for the l-thlayer of CDSN and W(l)2 is the matrix of weights from hidden layer to output layer for the l-th layer ofCDSN. Residual vectors of channel one, two and three are represented by r1,r2 and r3 respectively in Fig.4.16(b).(ii) how should the training data be represented to find the parameters of CDSN given that at eachiteration of the proposed method the location of one of the non-zero entries is determined? This means thatthe CDSN should observe the non-zero entries in the training data one by one. In the other words, we cannot simply use the given training data (e.g., images), and an appropriate representation of it is necessary.Question (ii) has already been addressed for LSTM-CS method described at the beginning of this chap-ter. We address question (i) by describing the CDSN formally and explaining the training method. Theforward pass for l-th layer of CDSN represented in Fig. 4.16(b) ish(l) =11+ e−W(l)1 z(l)v(l) = [W(l)2 ]T h(l). (4.57)In (4.57), v(l) is the output vector and z(l) is the input vector of l-th layer and is defined as follows:z(l) = [v(1),v(2), . . . ,v(l−1),r]. (4.58)In (4.58), r is the vector formed by the concatenation of all residual vectors in each convolution window. Tofind the CDSN unknown parameters W(l)1 and W(l)2 for each layer, l, a mean squared error cost function isminimized{W(l)1 ,W(l)2 }= argmin{W(l)1 ,W(l)2 }12‖V(l)−T‖22, (4.59)where T is a matrix whose columns are the target vectors in the training set and V(l) is a matrix whosecolumns are the corresponding output vectors from the l-th layer. Each layer of CDSN is a one layer neuralnetwork with a non-linear hidden layer and a linear output layer. In an CDSN, similar to a DSN [29], theoptimization problem in (4.59) is solved for each layer separately. The linearity of the output layer for eachlayer of CDSN makes it possible to find a closed form solution for W(l)2 given W(l)1 and TW(l)2 =[H(l)[H(l)]T]−1H(l)TT , (4.60)where H(l) is a matrix whose columns are h(l) in (4.57) corresponding to different training samples in thetraining set. To prevent overfitting and to have a reliable solution for W(l)2 when H(l) is ill conditioned, usu-ally an `2 regularization term is added to (4.59). In other words, to calculate W(l)2 the following optimization101problem is solved:W(l)2 = argminW(l)212‖[W(l)2 ]T H(l)−T‖22+µ‖W(l)2 ‖22, (4.61)which results inW(l)2 =[µI+H(l)[H(l)]T]−1H(l)TT , (4.62)where I is the identity matrix.To find W(l)1 , for each layer of CDSN we use the stochastic gradient descent method. To calculate thegradient of the cost function with respect to W(l)1 given the fact that W(l)2 and H(l) depend on W(l)1 , it can beshown [130] that the gradient of the cost function in (4.59) with respect to W(l)1 is∂‖V(l)−T‖22∂W(l)1= Z(l)[[H(l)]T ◦ [1−H(l)]T◦[[H(l)]†[H(l)TT ][T[H(l)]†]−TT [T[H(l)]†]]], (4.63)where Z(l) is a matrix whose columns are z(l) in (4.58) corresponding to different training samples in thetraining set and ◦ is the Hadamard product operator. Using the gradient information from past iterationscan help to improve the convergence speed in convex optimization problems [11]. Although the problem in(4.59) is not necessarily convex because of the stack of non-linear hidden layers, but we found out experi-mentally that the gradient information from the past iterations can be helpful here as well. As in [130], weuse the FISTA algorithm to accelerate the fine tuning. Therefore, the update equations for W(l)1 at the k-thiteration are as follow:W(l)1,k = Wˆ(l)1,k−ρ∂‖V(l)−T‖22∂Wˆ(l)1,kmk+1 =12(1+√1+4m2k)Wˆ(l)1,k+1 = Wˆ(l)1,k +mk−1mk+1(W(l)1,k−W(l)1,k−1). (4.64)The curve of the FISTA coefficients mk−1mk+1 with respect to the epoch number is represented Fig. 4.17. Aftercomputing W(l)1 from (4.64), we use the closed form formulation in (4.62) to find W(l)2 .Another important consideration for training the CDSN is that the cost function in (4.59) is not necessar-ily convex, therefore the initialization of W(l)1 before fine tuning plays an important role. For initializationof the first layer of CDSN, we train a Restricted Boltzmann Machine (RBM) [59, 110] with Gaussian visibleunits and binary hidden units. This results in the following energy function between visible units, i.e., entries1025 10 15 20 250.650.70.750.80.850.90.95Epochmk−1 / mk+1Figure 4.17: The curve of FISTA coefficients mk−1mk+1 in (4.64) with respect to the epoch number.of z(1), and hidden units, i.e., entries of h(1)E(z(1),h(1)) =12(z(1)−b1)T (z(1)−b1)−bT2 h(1)− [z(1)]T W(1)1,inith(1), (4.65)where b1 and b2 are vectors of bias values for the visible and hidden units respectively. The goal is to findW(1)1,init from the training input data, i.e., the residual vectors, z(1), in the training data generated as explainedearlier. Then we use W(1)1,init to initialize W(1)1 as shown in Fig. 4.16(b). This approach has been shown to bealso helpful in training neural networks and specifically autoencoders [60]. The parameters of the RBM canbe found by maximizing the log probability that the RBM assigns to the input data, which is a function ofthe energy function in (4.65), using the contrastive divergence method [59]. The details on the general RBMtraining method used in this work can be found at [54]. As shown in the block diagram of CDSN in Fig.4.16(a), to initialize the parameters of the upper layers of CDSN, W(l+1)1 , we use the learned parameters ofthe lower layer, W(l)1 , as initialization. This approach has been shown to be helpful in training DSNs [29]and it was helpful in our task as well. This completes the description of the training method for CDSN andthe answer for question (ii).Given the trained CDSN, a summary of the proposed reconstruction algorithm that finds the sparsestsolution S given Y and A in (4.4) is presented in Algorithm 4. We refer to this algorithm as CDSN-CS sincewe have used a convolutional DSN for distributed compressive sensing. A more high level architecture ofthe proposed method is also presented in Fig. 4.18.103Algorithm 4 Distributed Compressive Sensing using Covolutional Deep Stacking Network (CDSN-CS)Inputs: CS measurement matrix A∈ℜM×N ; matrix of measurements Y∈ℜM×L; minimum `2 norm of residual matrix“resMin” as stopping criterion; Trained “cdsn” model; Convolution window size “w”Output: Matrix of sparse vectors Sˆ ∈ℜN×LInitialization: Sˆ = 0; j = 1; i = 1; Ω= /0; R = Y.1: procedure CDSN-CS(A,Y,cdsn)2: while i≤ N and ‖R‖2 ≤ resMin do3: i← i+14: for j = 1→ L do5: R(:, j)i← R(:, j)i−1max(|R(:, j)i−1|)6: v j←cdsn([R(:, j− w2 )i,R(:, j− w2 +1)i, . . . ,R(:, j+ w2 −1)i,R(:, j+ w2 )i])7: idx← Support(max(v j))8: Ωi←Ωi−1∪ idx9: SˆΩi(:, j)← (AΩi)†Y(:, j) . Least Squares10: SˆΩCi (:, j)← 011: R(:, j)i← Y(:, j)−AΩi SˆΩi(:, j)12: end for13: end while14: end procedure4.7.1 Experimental Evaluation and DiscussionIn this section we experimentally demonstrate: (i) How is the performance of the proposed method in thissection compared to other reconstruction algorithms? (ii) How fast is the proposed method? (iii) What arethe effects of the convolution window size in CDSN-CS? (iv) What are the effects of the RBM initialization?To address the above issues, we performed experiments on the same natural image dataset describedearlier in this chapter. Ten randomly selected test images from each of 3 classes of this dataset were usedfor experiments. The images are shown in Fig. 4.10. The size of each of the used images was 64×64. Eachimage was divided into 8× 8 non-overlapping blocks. After reconstructing all the blocks of an image, thereconstruction error for the reconstructed image was calculated. The reconstruction error is defined similarto (4.15). We encoded 8 blocks (L = 8) of each image simultaneously using a random measurement matrixand reconstructed them at the decoder. Therefore, S in (4.4) had 8 columns and each column had N = 64entries. We used 40% measurements, i.e., Y in (4.4) had 8 columns and each column had M = 25 entries.The encoder was a typical compressive sensing encoder, i.e., A was a randomly generated matrix. Eachcolumn of A was normalized to have unity norm. To simulate the measurement noise, Gaussian noise withstandard deviation 0.005 was added to the measurement matrix Y in (4.4). We used two popular transforms,DCT and Wavelet, as the sparsifying basis Ψ in (4.2). For the wavelet transform we used the Haar wavelettransform with 3 levels of decomposition. We used 55 images for the training set, 5 images for the validationset and 10 images for the test set. The PC used to perform the experiments had an Intel(R) Core(TM) i7CPU with clock 2.93 GHz and with 16 GB RAM.The performance of the proposed reconstruction algorithm (CDSN-CS) was compared with 5 recon-struction methods for the MMV problem. These methods are: 1) Simultaneous Orthogonal MatchingPursuit (SOMP) which is a well known baseline for the MMV problem, 2) Bayesian Compressive Sens-104𝒓2 𝒓1 𝒓𝐿 … 𝑖𝑑𝑥1 𝑖𝑑𝑥2 𝑖𝑑𝑥𝐿 … Ω2 update … 𝒔 1 𝒔 2 𝒔 𝐿 𝒓1 = 𝒚1 − 𝑨Ω1𝒔 1 𝒓2 = 𝒚2 − 𝑨Ω2𝒔 2 … 𝒓𝐿 = 𝒚𝐿 − 𝑨Ω𝐿𝒔 𝐿 𝒓3 𝒛𝒆𝒓𝒐𝒔 𝒛𝒆𝒓𝒐𝒔 CDSN LS … … … 𝑖𝑑𝑥3 𝒔 3 𝒓3 = 𝒚3 − 𝑨Ω3𝒔 3 max LS LS LS Ω1 update Ω3 update Ω𝐿 update max max max CDSN CDSN CDSN 𝒓2 𝒓1 𝒓𝐿 … 𝑖𝑑𝑥1 𝑖𝑑𝑥2 𝑖𝑑𝑥𝐿 … Ω2 update … 𝒔 1 𝒔 2 𝒔 𝐿 𝒓1 = 𝒚1 − 𝑨Ω1𝒔 1 𝒓2 = 𝒚2 − 𝑨Ω2𝒔 2 … 𝒓𝐿 = 𝒚𝐿 − 𝑨Ω𝐿𝒔 𝐿 𝒓3 𝒛𝒆𝒓𝒐𝒔 𝒛𝒆𝒓𝒐𝒔 CDSN LS … … … 𝑖𝑑𝑥3 𝒔 3 𝒓3 = 𝒚3 − 𝑨Ω3𝒔 3 max LS LS LS Ω1 update Ω3 update Ω𝐿 update max max max CDSN CDSN CDSN One non-zero entry detected and calculated per channel All residual vectors are updated iteration 1 iteration 2 One non-zero entry detected and calculated per channel All residual vectors are updated … Figure 4.18: High level block diagram of the proposed method.ing (BCS)[71] applied independently on each channel, 3) Multitask Compressive Sensing (MT-BCS) [72]which takes into account the statistical dependency among the different channels, 4) Sparse Bayesian Learn-ing for Temporally correlated sources (T-SBL) [133] which exploits the correlation among different sourcesin the MMV problem and 5) Nonlinear Weighted SOMP (NWSOMP) [95]. For the BCS method we setthe initial noise variance of i-th channel to the value suggested by the authors, i.e., std(yi)2/100 wherei∈ {1,2,3,4,5,6,7,8} and std(.) calculates the standard deviation. The threshold for stopping the algorithmwas set to 10−8. For MT-BCS we set the parameters of the Gamma prior on noise variance to a = 100/0.1and b = 1 which are the values suggested by the authors. We set the stopping threshold to 10−8 as well. ForT-SBL, we used the default values recommended by the authors. For NWSOMP, during training, we usedthree layers, each layer having 512 neurons and 25 epochs of parameters update. For CDSN-CS, duringtraining, we used three layers, 64 neurons per layer with different window sizes and 25 epochs of parameterupdates. For RBM initialization, we ran 200 epochs of RBM parameter update with step size 0.01. To mon-itor overfitting of the RBM, we used free energy as explained in [54]. For fine tuning CDSN-CS after RBMinitialization we used step size 0.002. The regularization parameter µ in (4.62) was set to 0.01. To monitorand prevent overfitting, we used 5 images per channel as the validation set and we used early stopping ifnecessary. Please note that the images used for validation were different from those used in the training set105or in the test set.The results for the different classes of images are presented in Fig. 4.19 for the DCT transform and inFig. 4.20 for the Wavelet transform. In these figures, the vertical axis is the MSE defined in (4.15) and thehorizontal axis is the number of non-zero entries in each sparse vector. The number of measurements, M,is fixed to 25. Each point on the curves is the average of the MSEs over 10 reconstructed test images atthe decoder. For all results, we used a convolution window of size 5 because it gave the best performancecompared to other window sizes. For image class of flowers (the bottom part of Fig. 4.19 and Fig. 4.20), aconvolution window of size 7 gave better performance. As observed in these figures, CDSN-CS outperformsthe five reconstruction methods SOMP, BCS applied to each channel independently, MT-BCS, T-SBL andNWSOMP. We believe that this improvement in the performance is due to exploiting the dependenciesamong the different channels by CDSN network.To study the effect of the convolution window size, a comparison among the different convolution win-dow sizes in CDSN-CS for the image class “cows” with the DCT transform is presented in Fig. 4.21. Asobserved from this figure, increasing the window size improves the results up to a point after which theresults do not improve any more. Since we use distinct patches from each image, we might assign thisbehaviour to the fact that the residuals of image patches that are far from each other might be less correlatedthan the residuals of image patches that are close to each other.To show that RBM initialization is helpful for our task, we conducted two experiments. In the firstexperiment the CDSN is trained using random initialization. In the second experiment it is trained usingRBM initialization. The results are presented in Fig. 4.22. As observed in this figure, RBM initializationimproves the reconstruction performance.To conclude this section we present the CPU time for the different reconstruction algorithms discussedin this section in Fig. 4.23. Since all methods are run in MATLAB and on the same machine, Fig. 4.23gives a rough idea about how fast the different methods discussed in this section are. As observed in thisfigure, the proposed method is faster than the Bayesian methods discussed and is almost as fast as the greedymethods.4.8 ConclusionsThis chapter presents a method to reconstruct sparse vectors for the MMV problem. The proposed methodlearns the structure of sparse vectors and does not rely on the commonly used joint sparsity assumption.Through experiments on two real world datasets, we showed that the proposed method outperforms thegeneral MMV baseline SOMP as well as a number of Bayesian model based methods for the MMV problem.Please note that we have not used multiple layers of LSTM or the advanced deep learning methods fortraining, e.g., regularization using drop out which can improve the performance of LSTM-CS. This chapteris a proof of concept that deep learning methods and specifically sequence modelling methods, e.g., LSTM,can improve the performance of the MMV solvers significantly. This is specially the case when the sparsitypatterns are more complicated than that of obtained by the DCT or Wavelet transforms. We showed thison the MNIST dataset. We showed that the proposed method is almost as fast as greedy methods. Thegood performance of the proposed method depends on the availability of training data (as is the case in all10610 20 30 40 50 60Number of non-zero entries in the sparse vector0.10.120.140.160.180.20.22MSESOMPCDSN-CSBCSMT-BCST-SBLNWSOMP10 20 30 40 50 60Number of non-zero entries in the sparse vector0.150.20.25MSESOMPCDSN-CSBCSMT-BCST-SBLNWSOMP10 20 30 40 50 60Number of non-zero entries in the sparse vector0.20.250.3MSESOMPBCSMT-BCST-SBLCDSN-CSNWSOMPFigure 4.19: Comparison of different MMV reconstruction algorithms performance for the naturalimage dataset using DCT transform. Image classes from top to bottom are buildings, cows andflowers respectively.10710 20 30 40 50 60Number of non-zero entries in the sparse vector0.120.140.160.180.20.22MSESOMPCDSN-CSBCSMT-BCST-SBLNWSOMP10 20 30 40 50 60Number of non-zero entries in the sparse vector0.140.160.180.20.220.240.26MSESOMPCDSN-CSBCSMT-BCST-SBLNWSOMP10 20 30 40 50 60Number of non-zero entries in the sparse vector0.20.250.30.350.4MSESOMPBCSMT-BCST-SBLNWSOMPCDSN-CSFigure 4.20: Comparison of different MMV reconstruction algorithms performance for the naturalimage dataset using Wavelet transform. Image classes from top to bottom are buildings, cowsand flowers respectively.10810 20 30 40 50 60Number of non-zero entries in the sparse vector0.120.130.140.150.16MSECDSN-CS, Win=7CDSN-CS, Win=1CDSN-CS, Win=3CDSN-CS, Win=5Figure 4.21: The performance of CDSN-CS with different window sizes. This experiment was con-ducted for image class of cows with DCT transform as sparsifying basis.10 20 30 40 50 60Number of non-zero entries in the sparse vector0.150.160.170.180.190.2MSECDSN-CS, random initCDSN-CS, with RBM initFigure 4.22: The performance of CDSN-CS with and without RBM initialization. This experimentwas conducted for image class of cows with Wavelet transform as sparsifying basis.10910 20 30 40 50 605101520253035Number of non-zero entries in the sparse vectorCPUTime(sec) MT-BCST-SBLSOMPCDSN-CSNWSOMPBCS10 20 30 40 50 600.020.040.060.080.10.120.14Number of non-zero entries in the sparse vectorCPUTime(sec) MT−BCSSOMPCDSN−CSNWSOMPFigure 4.23: CPU time of different MMV reconstruction algorithms discussed in this section. Up: allmethods. Bottom: removing T-SBL and BCS to make the difference among remaining methodsmore visible.deep learning methods). In many applications, however, training data is available, e.g., different imagesof the same class or signals with similar sparsity patterns. Please note that if collecting training samplesis expensive or enough training samples are not available, using other sparse reconstruction methods isrecommended. We also presented the bidirectional version of LSTM-CS along with a reconstruction methodbased on Convolutional Deep Stacking Networks model proposed in this chapter.110Chapter 5Conclusions and Future WorkAll glory comes from daring to begin.— Eugene F Ware5.1 ConclusionsIn this thesis, we focused on deep learning methods used for sequence modelling. We approached threeimportant problems: speech recognition, sentence modeling for web search and distributed compressivesensing. For each problem, we explained why the problem have a sequential nature, and then, proposednew methods to exploit this sequential behavior. We showed effectiveness of proposed methods throughextensive experiments on real world challenging datasets. A summary of our findings are listed below:5.1.1 Recurrent Deep Stacking Networks for Speech RecognitionThe traditional Echo State Networks (ESNs) learn only the set of one of three important weight matrices. Inthis work, we wanted to address the problem of how to learn them all. The key property that characterizesthe ESN is the use of linear output units so that the learning is simple, can be formulated as a convex least-square ridge regression problem w.r.t output weights. In extending the learning of the output weights tolearning input and recurrent weights, we make use of the same property of the linear output units to developand formulate constraints for the sets of various ESN weight matrices. Such constraints are then used toderive analytic forms of the error gradients w.r.t the input and recurrent weights to be learned. The standardlearning method of BPTT used for the general RNN (with typically nonlinear output units) does not admitanalytical forms of gradient computation. BPTT requires recursively propagating the error signal backwardthrough time, a very different style of computation and learning than what we have developed in this workfor ESNs.A novel deep learning architecture, the R-DSN, which extends the earlier RNN and DSN models wasalso proposed. The R-DSN constructs multiple modules of the RNN using stacking, in the same way thatthe DSN uses stacking to form multiple modules of a simple, non-recurrent feed-forward neural network.Alternatively, the R-DSN can be viewed as a generalization of the DSN, where the generalization lies inembedding recurrent connections in each module that were missing in the earlier DSN. The main technical111contribution of the work reported in this part is the development of closed-from formulas for the gradi-ent computation based on the special structure of the R-DSN, and the batch-mode training method for allparameters in the R-DSN capitalizing on these formulas.The above methods were evaluated on TIMIT dataset for phoneme recognition task in speech recogni-tion. The results were better than a single layer RNN and also almost as good as the state-of-the-art methods,but not better than them (Tables 2.1 and 2.2).5.1.2 A Sentence Modelling Method for Web Search TaskWe proposed LSTM-DSSM, a new sentence modeling method for web search task. The proposed methodwas based on long short-term memory network to learn the long range context information and embed thekey information of a sentence in one semantic vector. We showed that the semantic vector evolves overtime and only takes useful information from any new input. This has been made possible by input gatesthat detect useless information and attenuate it. One important property of the proposed method was that,it created vector representations for queries and documents in such a way that semantic representations ofqueries and clicked documents are as close as possible, while, those of queries and unclicked documents areas far as possible. This was due to LSTM-DSSM’s specific architecture and cost function.Due to the general limitation of available human labelled data, we proposed and implemented and trainedLSTM-DSSM with a weak supervision signal using users’ click-through data of a commercial web searchengine.By performing a detailed analysis on the model, we showed that: 1) The proposed model is robust tonoise, i.e., it mainly embeds keywords in the final semantic vector that represents the whole sentence and 2)in the proposed model, each cell is usually allocated to keywords from a specific topic. These findings havebeen supported using extensive examples.As a concrete sample application of the proposed sentence embedding method, we evaluated it on theimportant language processing task of web document retrieval. The evaluations were performed on click-through data from a real world commercial search engine. For the task of information retrieval, the proposedmethod outperformed all existing state-of-the-art methods significantly (Table 3.4).5.1.3 A Deep Learning Approach to Distributed Compressive SensingWe presented a new method to reconstruct sparse vectors for the MMV problem based on a deep learningapproach. The proposed method learns the structure of the sparse vectors and does not rely on the commonlyused joint sparsity assumption. More accurately, it addresses the following three important problems in theMMV problem at the same time: (a) How to exploit structures besides sparsity during reconstruction? (b)How to do reconstruction in the MMV problem when sparse vectors are not jointly sparse? (c) How toexploit available offline data for a better reconstruction performance?Through experiments on two real world datasets, we showed that the proposed method outperformsthe general MMV baseline SOMP as well as a number of Bayesian model based methods for the MMVproblem. Please note that we have not used multiple layers of LSTM or the advanced deep learning methodsfor training, e.g., regularization using drop out which can improve the performance of LSTM-CS. This112work was a proof of concept that deep learning methods and specifically sequence modelling methods, e.g.,LSTM, can improve the performance of MMV solvers significantly. This is specially the case when thesparsity patterns are more complicated than that obtained by the DCT or a Wavelet transform. We showedthis on the MNIST dataset. We showed that the proposed method is almost as fast as the greedy methods.The good performance of the proposed method depends on the availability of large amount of trainingdata (as is the case in all deep learning methods). In many applications, enough training data is available,e.g., different images of the same class or different signals with similar sparsity patterns. Please note thatif collecting training samples is expensive or enough training samples are not available, then using othersparse reconstruction methods is recommended.The above methods were evaluated on two real world image datasets, MNIST dataset and differentclasses of a natural image dataset. The results were better than the well known greedy method SimultaneousOrthogonal Matching Pursuit (SOMP) and a number of state-of-the-art model based Bayesian methods (Fig.4.4).5.2 Future WorkThere is still a lot that can be done in the direction of developing deep sequence modelling methods for theproblems discussed in this thesis. Below is a list of reasonably “feasible” future work problems:1. Using the proposed sentence embedding method in chapter 3, LSTM-DSSM, for other importantlanguage processing tasks for which we believe sentence embedding plays a key role. Some examplesare: question answering, machine translation, and sentiment analysis.2. Exploiting the prior information about the structure of the different matrices in Fig. 3.2 in chapter 3to develop more effective cost functions and learning methods.3. Exploiting the attention mechanism [8] in the proposed model in chapter 3 to improve the retrievalperformance and finding out which words in the query are aligned to which words of the document.4. Using the method proposed in chapter 3 for text summarization. For example, given an article or largebody of text, the task is to automatically generate a title for it.5. Building a multi-resolution version of the method proposed in chapter 3. This means that using aConvolutional Neural Network (CNN) or LSTM for character level feature extraction, an LSTM onthe top to extract word / sentence level information, and having a cosine similarity cost function onthe top. This multi-resolution architecture will be learned end to end using a method similar to theone described in chapter 3.6. Extending the proposed method in chapter 4 to non-linear distributed compressive sensing. Thismeans finding the sparsest solution s iny = f (As) (5.1)113where f (.) is a non-linear function. This non-linearity might come from the physical properties of theproblem or we might want to add it intentionally. We should also explore what a non-linear encoderin CS might result in.7. Using the proposed LSTM-CS approach in chapter 4 (to reconstruct data compressed using CS), forthe following important compressive sensing applications:(a) Magnetic Resonance Imaging (MRI) to shorten MRI scanning sessions. There are also oppor-tunities to use LSTM-CS for Dynamic MRI where the correlation among image frames can beeffectively exploited by LSTM-CS.(b) Computed Tomography (CT) where the goal is to reduce the amount of radiation.(c) Video compressive sensing where there is correlation amongst the video frames.(d) Health telemonitoring and specifically compressive sensing of EEG signals where there is cor-relation amongst the different EEG channels.(e) Using a modified version of LSTM-CS for Blind Compressive Sensing (BCS) [47] where boththe sparse vector s and the sparsifying basis Ψ are unknown.(f) Using an optimized implementation of LSTM-CS for Single-Pixel cameras [34].8. Extending the evaluation tasks in chapter 2 to more complex continuous phoneme and word recogni-tion tasks.9. One challenge in compressive sensing research is that there is no unified framework to reliably com-pare all existing reconstruction methods. This is specially the case when it comes to real worlddatasets. An important future work direction is to create a unified and easy to use framework, suchthat, new reconstruction methods can be easily compared with existing methods.114Bibliography[1] Compressive sensing resources. http://dsp.rice.edu/cs , accessed on Decemeber 13, 2016. → pages 7[2] Big data, for better or worse: 90% of world’s data generated over last two years. 2013. URLhttps://www.sciencedaily.com/releases/2013/05/130522085217.htm. accessed on December 13,2016. → pages 1[3] M. Aharon, M. Elad, and A. Bruckstein. k -svd: An algorithm for designing overcompletedictionaries for sparse representation. Signal Processing, IEEE Transactions on, 54(11):4311–4322,Nov 2006. → pages 99[4] D. Angelosante, G. Giannakis, and E. Grossi. Compressed sensing of time-varying signals. In 16thInternational Conference on Digital Signal Processing, pages 1–8, 2009. → pages 72, 74[5] A. Averbuch, J. Weill, O. Barkan, and S. Dekel. Adaptive compressed tomography sensing. IEEEConference on Computer Vision and Pattern Recognition, pages 2195–2202, 2013. → pages 7[6] S. Aviyente. Compressed sensing framework for eeg compression. In Statistical Signal Processing,2007. SSP ’07. IEEE/SP 14th Workshop on, pages 181–184, 2007. → pages 7[7] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Structured sparsity through convex optimization.Statistical Science, 27(4):450–468, 2012. → pages 9[8] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align andtranslate. ICLR2015, 2015. http://arxiv.org/abs/1409.0473. → pages 29, 113[9] R. Baraniuk. Compressive sensing [lecture notes]. IEEE Signal Processing Magazine, 24(4):118–121, july 2007. → pages 7, 69[10] R. Baraniuk, V. Cevher, M. Duarte, and C. Hegde. Model-based compressive sensing. InformationTheory, IEEE Transactions on, 56(4):1982–2001, April 2010. → pages 72[11] A. Beck and M. Teboulle. Gradient-based algorithms with applications to signal-recoveryproblems. Cambridge University Press, 2009. → pages 21, 102[12] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent isdifficult. Neural Networks, IEEE Transactions on, 5(2):157–166, 1994. → pages 6, 14, 16[13] Y. Bengio, N. Boulanger-Lewandowski, and R. Pascanu. Advances in optimizing recurrentnetworks. In Proc. ICASSP, Vancouver, Canada, May 2013. → pages 5, 14, 30, 74[14] E. Candes, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccuratemeasurements. Communications on Pure and Applied Mathematics, 59(8):1207–1223, 2006. →pages 7, 69115[15] E. J. Cands, Y. C. Eldar, D. Needell, and P. Randall. Compressed sensing with coherent andredundant dictionaries. Applied and Computational Harmonic Analysis, 31(1):59 – 73, 2011. →pages 8, 69[16] H. Cao, V. Leung, C. Chow, and H. Chan. Enabling technologies for wireless body area networks: Asurvey and outlook. Communications Magazine, IEEE, 47(12):84–93, 2009. → pages 7[17] J. Chen and L. Deng. A primal-dual method for training recurrent neural networks constrained bythe echo-state property. In Proceedings of the International Conf. on Learning Representations(ICLR), 2014. → pages 30[18] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neuralnetworks on sequence modeling. NIPS Deep Learning Workshop, 2014. → pages 28[19] R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neuralnetworks with multitask learning. In International Conference on Machine Learning, ICML, 2008.→ pages 29[20] G. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks forlarge-vocabulary speech recognition. Audio, Speech, and Language Processing, IEEE Transactionson, 20(1):30 –42, jan. 2012. ISSN 1558-7916. → pages 1, 68, 74[21] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Large vocabulary continuous speech recognition withcontext-dependent DBN-HMMs. In Proc. IEEE ICASSP, pages 4688–4691, Prague, Czech, May2011. → pages 68[22] M. Davies and Y. Eldar. Rank awareness in joint sparse recovery. IEEE Transactions on InformationTheory, 58(2):1135 –1146, Feb. 2012. → pages 9, 70[23] L. Deng. Achievements and challenges of deep learning. APSIPA Transactions on Signal andInformation Processing, 2015. → pages 1[24] L. Deng. How deep reinforcement learning can help chatbots. Venturebeat, 2016. URLhttp://venturebeat.com/2016/08/01/how-deep-reinforcement-learning-can-help-chatbots/. accessedon December 13, 2016. → pages 6[25] L. Deng and J. Chen. Sequence classification using high-level features extracted from deep neuralnetworks. In Proc. ICASSP, 2014. → pages 30[26] L. Deng and D. Yu. Deep convex networks for speech pattern classification. Proc. Interspeech,2011. → pages 14, 18[27] L. Deng and D. Yu. Deep learning: Methods and applications. NOW Publishers, pages 1–199, 2014.→ pages 1[28] L. Deng, K. Hassanein, and M. Elmasry. Analysis of the correlation structure for a neural predictivemodel with application to speech recognition. Neural Networks, 7(2):331–339, 1994. → pages 5,14, 30, 74[29] L. Deng, D. Yu, and J. Platt. Scalable stacking and learning for building deep architectures. In Proc.ICASSP, pages 2133 –2136, march 2012. → pages iv, 14, 68, 74, 99, 101, 103116[30] L. Deng, O. Abdel-Hamid, and D. Yu. A deep convolutional neural network using heterogeneouspooling for trading acoustic invariance with phonetic confusion. In Proc. IEEE ICASSP, Vancouver,Canada, May 2013. → pages 14[31] D. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289 –1306,april 2006. → pages 7, 8, 69[32] D. L. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal) dictionariesvia `1 minimization. Proceedings of the National Academy of Sciences of the United States ofAmerica, 100(5):2197–2202, 2003.“http://www-stat.stanford.edu/ donoho/Reports/2002/OptSparse.pdf”, accessed on December 13,2016. → pages 8[33] M. Duarte and Y. Eldar. Structured compressed sensing: From theory to applications. IEEETransactions on Signal Processing, 59(9):4053 –4085, sept. 2011. → pages 9, 70, 72[34] M. F. Duarte, M. A. Davenport, D. Takbar, J. N. Laska, T. Sun, K. F. Kelly, and R. G. Baraniuk.Single-pixel imaging via compressive sampling. IEEE Signal Processing Magazine, 25(2):83–91,2008. → pages 7, 114[35] Y. Eldar and H. Rauhut. Average case analysis of multichannel sparse recovery using convexrelaxation. IEEE Transactions on Information Theory, 56(1):505 –519, Jan. 2010. → pages 9, 70[36] J. L. Elman. Finding structure in time. Cognitive Science, 14(2):179–211, 1990. → pages 5, 14, 30,74[37] H. Fang, S. Gupta, F. N. Iandola, R. K. Srivastava, L. Deng, P. Dolla´r, J. Gao, X. He, M. Mitchell,J. C. Platt, C. L. Zitnick, and G. Zweig. From captions to visual concepts and back. In IEEEConference on Computer Vision and Pattern Recognition, CVPR, Boston, MA, USA, pages1473–1482, 2015. → pages 1[38] J. Fang, Y. Shen, and H. Li. Pattern coupled sparse bayesian learning for recovery of time varyingsparse signals. In 19th International Conference on Digital Signal Processing (DSP), pages705–709, 2014. → pages 72, 74, 81[39] A. C. Faul and M. E. Tipping. Analysis of sparse bayesian learning. In Advances in NeuralInformation Processing Systems (NIPS) 14, pages 383–389. MIT Press, 2001. → pages 73[40] R. Fergus. Introduction to convolutional networks. Presented as a lecture at CIFAR Deep LearningSummer School, Montreal, Canada, 2016. → pages ix, 4[41] J. Gao, W. Yuan, X. Li, K. Deng, and J.-Y. Nie. Smoothing clickthrough data for web search ranking.In Proceedings of the 32Nd International ACM SIGIR Conference on Research and Development inInformation Retrieval, SIGIR ’09, pages 355–362, New York, NY, USA, 2009. ACM. → pages 33[42] J. Gao, K. Toutanova, and W.-t. Yih. Clickthrough-based latent semantic models for web search.SIGIR ’11, pages 675–684. ACM, 2011. → pages 41[43] J. Gao, X. He, W. tau Yih, and L. Deng. Learning continuous phrase representations for translationmodeling. In Proc. ACL, 2014. → pages 68[44] J. Gao, P. Pantel, M. Gamon, X. He, L. Deng, and Y. Shen. Modeling interestingness with deepneural networks. In Proc. EMNLP, 2014. → pages 68117[45] F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction with lstm.Neural Computation, 12:2451–2471, 1999. → pages 31, 74[46] F. A. Gers, N. N. Schraudolph, and J. Schmidhuber. Learning precise timing with lstm recurrentnetworks. J. Mach. Learn. Res., 3:115–143, Mar. 2003. → pages 31, 74[47] S. Gleichman and Y. C. Eldar. Blind compressed sensing. IEEE Transactions on InformationTheory, 57(10):6958–6975, 2011. → pages 114[48] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. → pages 1[49] A. Graves. Sequence transduction with recurrent neural networks. In Representation LearningWorkshp, ICML, 2012. → pages 5, 14, 30, 74[50] A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks.In Proc. ICASSP, Vancouver, Canada, May 2013. → pages 29[51] A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwin´ska, S. G.Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, A. P. Badia, K. M. Hermann, Y. Zwols,G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, K. Kavukcuoglu, and D. Hassabis.Hybrid computing using a neural network with dynamic external memory. Nature, advance onlinepublication, 2016. → pages 1[52] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEEConference on Computer Vision and Pattern Recognition, CVPR, 2016. → pages ix, 5[53] K. M. Hermann and P. Blunsom. Multilingual models for compositional distributed semantics.arXiv preprint arXiv:1404.4641, 2014. accessed on December 13, 2016. → pages 29[54] G. Hinton. A practical guide to training restricted boltzmann machines: Version 1. Technical report,University of Toronto, 2010. → pages 103, 105[55] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks.Science, 313(5786):504 – 507, 2006. → pages 1, 14[56] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen,T. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6):82–97, 2012.→ pages 2[57] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen,T. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97,November 2012. → pages 68, 74[58] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke,P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speechrecognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012. → pages 1[59] G. E. Hinton. Training products of experts by minimizing contrastive divergence. NeuralComputation, 14(8):1771–1800, Aug. 2002. ISSN 0899-7667. → pages 74, 102, 103118[60] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks.Science, 313(5786):504–507, 2006. → pages 2, 103[61] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. NeuralComput., 18(7):1527–1554, July 2006. ISSN 0899-7667. → pages 2[62] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780,Nov. 1997. → pages 29, 31, 74[63] T. Hofmann. Probabilistic latent semantic analysis. In In Proc. of Uncertainty in ArtificialIntelligence, UAI99, pages 289–296, 1999. → pages 41[64] B. Hu, Z. Lu, H. Li, and Q. Chen. Convolutional neural network architectures for matching naturallanguage sentences. In Advances in Neural Information Processing Systems 27, pages 2042–2050.2014. → pages 29[65] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. Learning deep structured semanticmodels for web search using clickthrough data. In Proceedings of the 22Nd ACM InternationalConference on Conference on Information & Knowledge Management, CIKM ’13, pages2333–2338. ACM, 2013. → pages 27, 28, 30, 40, 68[66] H. Jaeger. The “echo state” approach to analysing and training recurrent neural networks. GMDReport 148, GMD - German National Research Institute for Computer Science, 2001. → pages 14,25[67] H. Jaeger. Short term memory in echo state networks. GMD Report 152, GMD - German NationalResearch Institute for Computer Science, 2001. → pages 14, 25[68] H. Jaeger. A tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and theecho state network approach. Technical report, Fraunhofer Institute for Autonomous IntelligentSystems (AIS) since 2003: International University Bremen, 2005. → pages 15, 16, 17[69] H. Jaeger and H. Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy inwireless communication. Science, 304(5667):78–80, 2004. → pages 14, 16, 17, 25[70] K. Ja¨rvelin and J. Keka¨la¨inen. Ir evaluation methods for retrieving highly relevant documents. InProceedings of the 23rd Annual International ACM SIGIR Conference on Research andDevelopment in Information Retrieval, SIGIR, pages 41–48. ACM, 2000. → pages 40[71] S. Ji, Y. Xue, and L. Carin. Bayesian compressive sensing. Signal Processing, IEEE Transactionson, 56(6):2346–2356, 2008. → pages 72, 81, 105[72] S. Ji, D. Dunson, and L. Carin. Multitask compressive sensing. Signal Processing, IEEETransactions on, 57(1):92–106, 2009. → pages 9, 70, 72, 73, 81, 95, 97, 105[73] N. Kalchbrenner, E. Grefenstette, and P. Blunsom. A convolutional neural network for modellingsentences. Proceedings of the 52nd Annual Meeting of the Association for ComputationalLinguistics, June 2014. → pages 29[74] A. Karpathy, J. Johnson, and L. Fei-Fei. Visualizing and understanding recurrent networks. arXivpreprint arXiv:1506.02078, 2015. accessed on December 13, 2016. → pages 29[75] R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba, R. Urtasun, and S. Fidler. Skip-thoughtvectors. Advances in Neural Information Processing Systems (NIPS), 2015. → pages 28, 42119[76] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutionalneural networks. In Advances in Neural Information Processing Systems 25, pages 1097–1105.2012. → pages 1, 4[77] Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. Proceedings ofthe 31st International Conference on Machine Learning, pages 1188–1196, 2014. → pages 27, 28,41, 42[78] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to documentrecognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998.http://yann.lecun.com/exdb/mnist/ , accessed on December 13, 2016. → pages 79[79] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, May 2015. →pages 1[80] M. Lustig, D. L. Donoho, J. M. Santos, and J. M. Pauly. Compressed sensing mri. IEEE SignalProcessing Magazine, 25(2):72–82, 2008. → pages 7[81] H. Mamaghanian, N. Khaled, D. Atienza, and P. Vandergheynst. Compressed sensing for real-timeenergy-efficient ecg compression on wireless body sensor nodes. Biomedical Engineering, IEEETransactions on, 58(9):2456–2466, 2011. → pages 7[82] N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for high-dimensionaldata. The Annals of Statistics, 37:246 – 270, 2009. → pages 9, 70[83] D. Merhej, C. Diab, M. Khalil, and R. Prost. Embedding prior knowledge within compressedsensing by neural networks. IEEE Transactions on Neural Networks, 22(10):1638 –1649, oct. 2011.→ pages 72, 73, 74[84] G. Mesnil, X. He, L. Deng, and Y. Bengio. Investigation of recurrent-neural-network architecturesand learning methods for spoken language understanding. In Proc. INTERSPEECH, Lyon, France,August 2013. → pages 5, 30, 74[85] T. Mikolov, M. Karafia´t, L. Burget, J. Cernocky`, and S. Khudanpur. Recurrent neural network basedlanguage model. In Proc. INTERSPEECH, pages 1045–1048, Makuhari, Japan, September 2010. →pages 5, 14, 30, 35, 74[86] T. Mikolov, S. Kombrink, L. Burget, J. Cernocky, and S. Khudanpur. Extensions of recurrent neuralnetwork language model. In Proc. IEEE ICASSP, pages 5528–5531, Prague, Czech, May 2011. →pages 16[87] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vectorspace. arXiv preprint arXiv:1301.3781, 2013. accessed on December 13, 2016. → pages 28[88] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vectorspace. In International Conference on Learning Representations (ICLR), 2013. URLarXiv:1301.3781. → pages 5, 41[89] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of wordsand phrases and their compositionality. In Proceedings of Advances in Neural InformationProcessing Systems, pages 3111–3119, 2013. → pages 28120[90] A. Milenkovi, C. Otto, and E. Jovanov. Wireless sensor networks for personal health monitoring:Issues and an implementation. Computer Communications (Special issue: Wireless SensorNetworks: Performance, Reliability, Security, and Beyond, 29:2521–2533, 2006. → pages 7[91] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou,H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deepreinforcement learning. Nature, 518:529–533, 2015. → pages 1[92] A. Mousavi, A. B. Patel, and R. G. Baraniuk. A deep learning approach to structured signalrecovery. arXiv, abs/1508.04065, 2015. http://arxiv.org/abs/1508.04065 , accessed on December 13,2016. → pages 72, 74[93] Y. Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2).Soviet Mathematics Doklady, 27:372–376, 1983. → pages 34, 35, 78[94] H. Palangi, L. Deng, and R. Ward. Learning input and recurrent weight matrices in echo statenetworks. In NIPS Workshop on Deep Learning, December 2013.http://research.microsoft.com/apps/pubs/default.aspx?id=204701 , accessed on December 13, 2016.→ pages 74[95] H. Palangi, R. Ward, and L. Deng. Using deep stacking network to improve structured compressedsensing with multiple measurement vectors. In 2013 IEEE International Conference on Acoustics,Speech and Signal Processing, pages 3337–3341, May 2013. → pages 72, 74, 81, 95, 97, 105[96] H. Palangi, L. Deng, and R. Ward. Recurrent deep-stacking networks for sequence classification. InSignal and Information Processing (ChinaSIP), 2014 IEEE China Summit International Conferenceon, pages 510–514, July 2014. → pages 74[97] H. Palangi, R. Ward, and L. Deng. Distributed compressive sensing: A deep learning approach.IEEE Transactions on Signal Processing, 64(17):4504–4518, Sept 2016. → pages 95, 97[98] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. InICML 2013, volume 28 of JMLR Proceedings, pages 1310–1318. JMLR.org, 2013. → pages 17, 23,35[99] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. InProc. ICML, Atlanta, GA, June 2013. → pages 14, 24[100] K. B. Petersen and M. S. Pedersen. The matrix cookbook. Technical report, 2008. → pages 20[101] R. Rˇehu˚rˇek and P. Sojka. Software Framework for Topic Modelling with Large Corpora. InProceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50,Valletta, Malta, May 2010. ELRA. http://is.muni.cz/publication/884893/en , accessed on December13, 2016. → pages 42[102] M. Research. http://research.microsoft.com/en-us/projects/objectclassrecognition. accessed onDecember 13, 2016. → pages 79, 86[103] A. J. Robinson. An application of recurrent nets to phone probability estimation. IEEE Transactionson Neural Networks, 5(2):298–305, August 1994. → pages 5, 14, 16, 30, 74121[104] F. Rosenblatt. The perceptron: A perceiving and recognizing automaton. Technical report, CornellAeronautical Laboratory Report, 1957. → pages 2[105] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagatingerrors. Nature, 323:533–536, 1986. → pages 2[106] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. → pages 4[107] H. Sak, A. Senior, and F. Beaufays. Long short-term memory recurrent neural network architecturesfor large scale acoustic modeling. In Proceedings of the Annual Conference of International SpeechCommunication Association (INTERSPEECH), 2014. → pages 29[108] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil. A latent semantic model with convolutional-poolingstructure for information retrieval. CIKM, November 2014. → pages 28, 30, 40, 68[109] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser,I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner,I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering thegame of go with deep neural networks and tree search. Nature, 529:484–503, 2016. → pages 1[110] P. Smolensky. In D. E. Rumelhart and J. L. McClelland, editors, Parallel distributed processing:explorations in the microstructure of cognition, vol. 1, chapter Information processing in dynamicalsystems: foundations of harmony theory, pages 194–281. MIT Press, Cambridge, MA, USA, 1986.ISBN 0-262-68053-X. → pages 102[111] R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning. Semi-supervised recursiveautoencoders for predicting sentiment distributions. In Proceedings of the Conference on EmpiricalMethods in Natural Language Processing, EMNLP ’11, pages 151–161, 2011. ISBN978-1-937284-11-4. → pages 28[112] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus. End-to-end memory networks. In Advances inNeural Information Processing Systems 28, pages 2440–2448. 2015. → pages 6[113] I. Sutskever. Training Recurrent Neural Networks. PhD thesis, University of Toronto, 2013. →pages 16[114] I. Sutskever, G. E. Hinton, and G. W. Taylor. The recurrent temporal restricted boltzmann machine.In NIPS, pages 1601–1608, 2008. → pages 23[115] I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton. On the importance of initialization andmomentum in deep learning. In ICML (3)’13, pages 1139–1147, 2013. → pages 34, 78[116] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. InProceedings of Advances in Neural Information Processing Systems, pages 3104–3112, 2014. →pages 27, 29[117] M. E. Tipping. Sparse bayesian learning and the relevance vector machine. J. Mach. Learn. Res., 1:211–244, Sept. 2001. → pages 73[118] F. Triefenbach, A. Jalalvand, B. Schrauwen, and J.-P. Martens. Phoneme recognition with largehierarchical reservoirs. Advances in Neural Information Processing Systems, 2009. → pages 14, 16,17, 25122[119] J. Tropp. Algorithms for simultaneous sparse approximation. part II: Convex relaxation. SignalProcessing, 86(3):589 – 602, 2006. → pages 9, 70[120] J. Tropp, A. Gilbert, and M. Strauss. Algorithms for simultaneous sparse approximation. part I:Greedy pursuit. Signal Processing, 86(3):572 – 588, 2006. → pages 9, 70, 97[121] E. van den Berg and M. P. Friedlander. Probing the pareto frontier for basis pursuit solutions. SIAMJournal on Scientific Computing, 31(2):890–912, 2008. → pages 9, 70[122] N. Vaswani. Ls-cs-residual (ls-cs): Compressive sensing on least squares residual. IEEETransactions on Signal Processing, 58(8):4108–4120, 2010. → pages 72, 74, 81[123] P. Vincent. A connection between score matching and denoising autoencoders. Neural Comput., 23(7):1661–1674, July 2011. → pages 74[124] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust featureswith denoising autoencoders. ICML, pages 1096–1103, 2008. → pages 74[125] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator.In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Boston, MA, USA, pages3156–3164, 2015. → pages 1[126] D. P. Wipf and B. D. Rao. An empirical bayesian strategy for solving the simultaneous sparseapproximation problem. Signal Processing, IEEE Transactions on, 55(7):3704–3716, 2007. →pages 9, 70, 72, 73[127] D. H. Wolpert. Stacked generalization. Neural Networks, 5:241–259, 1992. → pages 14[128] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio.Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the32nd International Conference on Machine Learning, ICML, Lille, France, pages 2048–2057, 2015.→ pages 1[129] D. Yu and L. Deng. Deep learning and its applications to signal and information processing[exploratory dsp]. IEEE Signal Processing Magazine, 28(1):145 –154, jan. 2011. → pages 1, 2, 68[130] D. Yu and L. Deng. Efficient and effective algorithms for training single-hidden-layer neuralnetworks. Pattern Recognition Letters, 33(5):554–558, 2012. → pages 18, 21, 102[131] D. Yu and L. Deng. Automatic Speech Recognition - A Deep Learning Approach. Springer, 2015. →pages 1[132] J. Zhang, S. Liu, M. Li, M. Zhou, and C. Zong. Bilingually-constrained phrase embeddings formachine translation. In Proceedings of the 52nd Annual Meeting of the Association forComputational Linguistics (ACL) (Volume 1: Long Papers), pages 111–121, Baltimore, Maryland,2014. → pages 29[133] Z. Zhang and B. D. Rao. Sparse signal recovery with temporally correlated source vectors usingsparse bayesian learning. Selected Topics in Signal Processing, IEEE Journal of, 5(5):912–926,2011. → pages 9, 70, 72, 73, 81, 95, 97, 105[134] P. Zhao and B. Yu. On model selection consistency of lasso. J. Mach. Learn. Res., 7:2541–2563,2006. ISSN 1532-4435. → pages 9, 70123[135] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. Aligningbooks and movies: Towards story-like visual explanations by watching movies and reading books.arXiv preprint arXiv:1506.06724, 2015. accessed on December 13, 2016. → pages 28, 42124
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Deep learning for sequence modelling : applications...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Deep learning for sequence modelling : applications in natural languages and distributed compressive… Palangi, Hamid 2017
pdf
Page Metadata
Item Metadata
Title | Deep learning for sequence modelling : applications in natural languages and distributed compressive sensing |
Creator |
Palangi, Hamid |
Publisher | University of British Columbia |
Date Issued | 2017 |
Description | The underlying data in many machine learning tasks have a sequential nature. For example, words generated by a language model depend on the previously generated words, behavior of a user in a social network evolves over different snapshots of the social graph over time, different speech frames in a speech recognition system depend on the previously generated frames, etc. The main question is, how can we leverage the sequential nature of data to extract better features for the target machine learning task? In an effort to address this question, this thesis presents three important applications of deep sequence modelling methods. The first application is sentence modelling for web search task where the question addressed is: How to create a vector representation for a natural language sentence, aimed at a specific task such as web search? We propose Long Short-Term Memory Deep Structured Semantic Model (LSTM-DSSM), a model for information retrieval on click-through data with significant performance gains compared to existing state of the art baselines. The proposed LSTM-DSSM model sequentially takes each word in a sentence, extracts its relevant information, and embeds it into a semantic vector. The second application involves distributed compressive sensing, where the main questions addressed are: (a) How to relax the joint sparsity constraint? (b) How to exploit the structural dependency of a group of sparse vectors to reconstruct them better from down-sampled measurements (structures besides sparsity)? (c) How to exploit available offline data during sparse reconstruction at the decoder? We present a deep learning approach to distributed compressive sensing and show that it addresses the above three questions and is almost as fast as greedy methods during reconstruction. The third application is related to speech recognition. The question addressed here is: How to build a recurrent acoustic model for the task of phoneme recognition? We present a Recurrent Deep Stacking Network (R-DSN) architecture for this task. Each module in the R-DSN is initialized with an Echo State Network (ESN), and then all connection weights within the module are fine tuned. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2017-04-10 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0343522 |
URI | http://hdl.handle.net/2429/61157 |
Degree |
Doctor of Philosophy - PhD |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2017-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2017_may_palangi_hamid.pdf [ 5.67MB ]
- Metadata
- JSON: 24-1.0343522.json
- JSON-LD: 24-1.0343522-ld.json
- RDF/XML (Pretty): 24-1.0343522-rdf.xml
- RDF/JSON: 24-1.0343522-rdf.json
- Turtle: 24-1.0343522-turtle.txt
- N-Triples: 24-1.0343522-rdf-ntriples.txt
- Original Record: 24-1.0343522-source.json
- Full Text
- 24-1.0343522-fulltext.txt
- Citation
- 24-1.0343522.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0343522/manifest