UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Deep learning with exchangeable tensors Graham, Devon R. 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2018_september_graham_devon.pdf [ 1.64MB ]
JSON: 24-1.0370999.json
JSON-LD: 24-1.0370999-ld.json
RDF/XML (Pretty): 24-1.0370999-rdf.xml
RDF/JSON: 24-1.0370999-rdf.json
Turtle: 24-1.0370999-turtle.txt
N-Triples: 24-1.0370999-rdf-ntriples.txt
Original Record: 24-1.0370999-source.json
Full Text

Full Text

Deep Learning with Exchangeable TensorsbyDevon R. GrahamB. Arts, Bishop’s University, 2010B. Computer Science, The University of British Columbia, 2016A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)August 2018c© Devon R. Graham, 2018The following individuals certify that they have read, and recommend to theFaculty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Deep Learning with Exchangeable Tensorssubmitted by Devon R. Graham in partial fulfillment of the requirements forthe degree of Master of Sciencein Computer ScienceExamining Committee:Siamak Ravanbakhsh, Computer ScienceSupervisorLaks V.S. Lakshmanan, Computer ScienceSecond ReaderiiAbstractIn this work we use deep learning to model interactions across two or more setsof objects, such as user–movie ratings, protein–drug bindings, or ternary user-item-tag interactions. The canonical representation of such interactions is a matrix(or a higher-dimensional tensor) with an exchangeability property: the encoding’smeaning is not changed by permuting rows or columns. We argue that modelsshould hence be Permutation Equivariant (PE): constrained to make the same pre-dictions across such permutations. We present a parameter-sharing scheme andprove that it could not be made any more expressive without violating PE. Thisscheme yields three benefits. First, we demonstrate state-of-the-art performanceon multiple matrix completion benchmarks. Second, our models require a numberof parameters independent of the numbers of objects, and thus scale well to largedatasets. Third, models can be queried about new objects that were not availableat training time, but for which interactions have since been observed. In experi-ments, our models achieved surprisingly good generalization performance on thismatrix/tensor extrapolation task, both within domains (e.g., new users and newmovies drawn from the same distribution used for training) and even across do-mains (e.g., predicting music ratings after training on movie ratings).iiiLay SummaryWe propose a model for predicting interactions among entities, such as users’ rat-ings of movies. Given a table with rows for users and columns for movies, swap-ping two rows (or columns), does not change the information in the table, andwe should make the same predictions. We prove that our model does this for anyshuffling of the rows/columns, and that any model to do so must have this form.We achieve state-of-the-art results on a number of benchmark tasks. We alsoachieve impressive results when extrapolating to new data. For example, movieratings for a new user, or, surprisingly, using a model that was taught to predictmovie ratings to predict song ratings for a different group of users. We are notrestricted to user-item ratings. Our model can be applied to predict protein-drugbindings, or even values in higher-dimensional tables, such as users tagging friendsin photos.ivPrefaceThe work presented in this thesis was performed in collaboration with Jason Hart-ford, and our respective supervisors Siamak Ravanbakhsh and Kevin Leyton-Brown.It has been accepted for publication at the 35th International Conference on Ma-chine Learning, 2018.Writing of the tensorflow code and running of experiments was done in roughlyequal proportion by Jason and me, starting from a code skeleton provided by Sia-mak. I developed all the proofs of the theoretical guarantees of our model startingfrom an initial proposition by Siamak, and I extended the model’s formulation andproofs to higher dimensional tensors. Writing of the main text of the conferencepaper was done roughly equally by all of us.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Matrix Completion for Recommender Systems . . . . . . . . . . 42.1.1 Inductive and Transductive Analysis . . . . . . . . . . . . 52.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.1 Feed-Forward Neural Networks . . . . . . . . . . . . . . 62.2.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . 83 Exchangeable Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1 Exchangeable Matrix Layer . . . . . . . . . . . . . . . . . . . . . 11vi3.1.1 Multiple Input–Output Channels . . . . . . . . . . . . . . 163.1.2 Input Features for Rows and Columns . . . . . . . . . . . 173.1.3 Jointly Exchangeable Matrices . . . . . . . . . . . . . . . 173.1.4 Sparse Inputs . . . . . . . . . . . . . . . . . . . . . . . . 183.2 Extension to General Tensors . . . . . . . . . . . . . . . . . . . . 184 Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 234.1 Self-Supervised Exchangeable Model . . . . . . . . . . . . . . . 234.2 Factorized Exchangeable Autoencoder . . . . . . . . . . . . . . . 244.3 Chanel-wise Dropout . . . . . . . . . . . . . . . . . . . . . . . . 254.4 Subsampling in Large Datasets . . . . . . . . . . . . . . . . . . . 254.4.1 Uniform sampling . . . . . . . . . . . . . . . . . . . . . 254.4.2 Conditional sampling . . . . . . . . . . . . . . . . . . . . 264.4.3 Sampling at test time . . . . . . . . . . . . . . . . . . . . 265 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . 285.1 Details of Architecture and Training . . . . . . . . . . . . . . . . 295.1.1 Self-Supervised Model. . . . . . . . . . . . . . . . . . . . 295.1.2 Factorized Exchangeable Autoencoder Model. . . . . . . 295.2 Transductive Setting (Matrix Interpolation) . . . . . . . . . . . . 305.3 Inductive Setting (Matrix Extrapolation) . . . . . . . . . . . . . . 315.4 Extrapolation to New Datasets . . . . . . . . . . . . . . . . . . . 325.5 Effects of Sub-Sampling Procedures . . . . . . . . . . . . . . . . 346 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38A Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 42A.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42A.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43A.2.1 Proof of Proposition 3.2.1 . . . . . . . . . . . . . . . . . 43A.2.2 Proof of Proposition 3.2.2 . . . . . . . . . . . . . . . . . 43A.2.3 Proof of Theorem 3.2.1 . . . . . . . . . . . . . . . . . . . 44viiA.2.4 Proof of Theorem 3.1.1 . . . . . . . . . . . . . . . . . . . 46viiiList of TablesTable 5.1 Number of users, items and ratings for the data sets used inour experiments. MovieLens data sets are standard [14], as isNetflix, while for Flixster, Douban and Yahoo Music we usedthe 3000× 3000 submatrix presented by [23] for comparisonpurposes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Table 5.2 Comparison of RMSE scores for the MovieLens-100k dataset,based on the canonical 80/20 training/test split. Baseline num-bers are taken from [2]. . . . . . . . . . . . . . . . . . . . . . 30Table 5.3 Comparison of RMSE scores for the MovieLens-1M dataset onrandom 90/10 training/test split. Baseline numbers are takenfrom [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Table 5.4 Evaluation of our model’s ability to generalize across datasets.We trained a factorized model on ML100k and then evaluated iton four new datasets. Results for the smaller datasets are takenfrom [2]. Netflix Baseline shows state of the art prior to theNetflix Challenge. . . . . . . . . . . . . . . . . . . . . . . . . 33ixList of FiguresFigure 3.1 Commutative diagram representing the layer Y = vec−1(σ(Wvec(X))).Here f (·) = vec−1(σ(Wvec(·))) is the layer defined in Section3.1, and pi(·) = G(N) ·G(M) is the permutation of the rows/-columns applied to the input/output matrix (i.e., G(N) ∈{0,1}N×Nand G(M) ∈ {0,1}M×M are permutation matrices). . . . . . . . 12Figure 3.2 Parameter sharing in an exchangeable matrix layer, where theelements of the input and output are scalars. The application ofdifferent tied parameters to input elements is illustrated usingdark blue for w1, green for w2, red for w3, and light blue forw4. The same structure (with the same four parameters) repeatsacross all units of the output layer. . . . . . . . . . . . . . . 14Figure 3.3 Pooling structure implied by the tied weights for matrices (left)and 3D tensors (right). The pink cube highlights one elementof the output. It is calculated as a function of the correspond-ing element from the input (dark blue), pooled aggregationsover the rows and columns of the input (green and yellow),and pooled aggregation over the whole input matrix (red). Inthe tensor case (right), we pool over all sub-tensors (orangesub-matrices, green sub-vectors and red scalar). For clarity,the output connections are not shown in the tensor case. . . . 16xFigure 3.4 Structure of our parameter matrix for the 1D (left), 2D (centre),and 3D (right) input arrays. The 1D case corresponds to thepermutation equivariant layer proposed in Deep-Sets [37] with4 elements. The 2D case corresponds to a 4× 3 input matrix,and the 3D case corresponds to a 4× 3× 2 input tensor. Theparameter-sharing patterns for the weight matrix of the higherdimensional arrays can be produced via the Kronecker productof the weight matrix for the 1D array (i.e., vector). . . . . . . 20Figure 4.1 Factorized exchangeable autoencoder. The encoder maps fromthe input tensor to an embedding layer of row / column factorsvia one or more hidden layers. The decoder attempts to recon-struct the input using the factors via one or more hidden layers. 24Figure 4.2 Uniform sampling (left) selects samples (red) uniformly fromthe non-zero indices of the the matrix X while conditional sam-pling (right) first samples a set of rows (shown in orange) fromthe row marginal distribution (green) and then selects a samplefrom the resulting column conditional distribution. . . . . . . 26Figure 5.1 Evaluation of our model’s ability to generalize. We trained onML-100k and evaluated on a subset of the ML-1M data. Atevaluation time, p% of the ML-1M data was treated as ob-served and the model was required to complete the remaining(1− p)% (with p varied from 5% to 95%). The model out-performs nearest-neighbour approaches for all values of p anddegrades gracefully to the baseline of predicting the mean inthe small data case. . . . . . . . . . . . . . . . . . . . . . . . 32Figure 5.2 Performance difference between sampling methods on the ML-100k. The two sampling methods use minibatches of size 20 000,while the full batch method used all 75 000 training examples.Note that the y-axis does not begin at 0. . . . . . . . . . . . . 34xiAcknowledgementsI would first and foremost like to thank my supervisor, Siamak Ravanbakhsh, forhis support and patience, for many helpful conversations and suggestions, and forkeeping me on track during the course of this project.I also cannot thank Jason Hartford enough, first for his own contributions tothis project, and second for his guidance to me in navigating this new world. I amalso grateful to Kevin Leyton-Brown for many stimulating discussions, and for hisefforts to make this the best project it could possibly be.Last but not least, I would like to thank my family for helping me get to whereI am today. None of this would have been possible without their ceaseless encour-agement.xiiFor SharonxiiiChapter 1IntroductionAs a motivating example, suppose we are given a set of users N= {1, . . . ,N}, a setof movies M = {1, . . . ,M}, and some data describing their interaction in the formof tuples X = 〈n,m,x〉 with n ∈ N, m ∈M and x ∈ R. For example, x may be therating that user n assigns to movie m. Our goal is to learn a function that models thisinteraction. That is, we seek to learn a mapping from N×M to R. The canonicalrepresentation of such a function is a matrix X ∈ RN×M, where we have Xn,m = xfor each 〈n,m,x〉 ∈ X. Learning this function corresponds to matrix completion:using existing patterns inX to predict the values of any missing elements of X. Thebasis of this work is the following intuitive observation: if we were to swap tworows (or two columns) of X, the resulting matrix does not contain fundamentallydifferent information from the original, and we should not expect our predictionsof missing values to be different in the two matrices. Indeed, this swapping simplycorresponds to a re-labeling of some users (or movies), and so should certainly notaffect the values of any predictions made. X is what we will call an exchangeablematrix: any row- and column-wise permutation of X represents the same set ofratings and hence the same matrix completion problem.Exchangeability has a long history in machine learning and statistics. de Finetti’stheorem states that exchangeable sequences are the product of a latent variablemodel. Extensions of this theorem characterize distributions over other exchange-able structures, from matrices to graphs; see Orbanz and Roy [24] for a detailedtreatment. In machine learning, a variety of frameworks formalize exchangeability1in data, from plate notation to statistical relational models [10, 25]. When dealingwith exchangeable arrays (or tensors), a common approach is tensor factorization.In particular, one thread of work leverages tensor decomposition for inference inlatent variable models [1]. However, tensor factorization requires retraining mod-els for each new input matrix, and does not allow for the increased expressivitythat deep models can provide through the use of multiple layers with more fea-ture channels. Not all domains are inherently exchangeable. For example, thereis information contained in the ordering of temporal data. Specifically, we couldhave users’ purchase history over days of the year. In this case the users would beexchangeable but not the days, as purchasing patterns would be expected to changeover time (e.g., to increase in advance of the Christmas season). But exchangeabil-ity is a powerful idea that we believe should be taken advantage of when possible.A natural goal in the setting of exchangeable data is to define a learning al-gorithm that is constrained to make the same predictions across all exchangeableinputs. That is, a learning algorithm which makes the same predictions regardlessof which permuted instance of the input data it is presented with. We call such alearning algorithm Permutation Equivariant (PE) and argue that PE is an impor-tant form of inductive bias in exchangeable domains. It is, however, not trivial toachieve PE. Indeed, all existing parametric factorization and matrix/tensor com-pletion methods associate unique parameters with each row and column, and thusare not PE. In addition to requiring retraining to handle new input matrices, havingunique row and column parameters means such models may make different predic-tions when presented with a row- or column-permuted version of the input. Suchmodels are therefor learning patterns in the ordering of the data that only exist inthe particular sample they are presented with, and are not truly present in the pop-ulation as a whole. Considering PE in this context, we can see that enforcing it isequivalent to augmenting our data with all possible row and column permutationsof the input, and so it is a powerful and fundamental property.So how can we enforce PE? Indeed, a simple and naive approach would beto augment the input data with all N!×M! permutations of its rows and columns.However, this is obviously computationally wasteful and becomes completely in-feasible for all but the smallest N and M. Instead, we show that a simple parameter-sharing scheme suffices to ensure that a deep model can represent only PE func-2tions. The result is analogous to the idea of a convolutional layer as found in Con-vNets: a lower-dimensional effective parameter space enforces a desired equivari-ance property. Indeed, parameter-sharing is a generic and efficient approach forachieving equivariance in deep models [27]. In the case of ConvNets, this equiv-ariance is with respect to translation (e.g., rotation) of the pixels in an image, whilein our setting it is with respect to permutations of the rows and/or columns of theinput matrix.When the matrix models the interaction between the members of the samegroup (such as a user-user friend relation), one could further constrain permuta-tions to be identical across both rows and columns. An example of such a jointlyexchangeable matrix [24], modelling the interaction of the nodes in a graph, is theadjacency matrix employed by graph convolution. Our approach reduces to graphconvolution in the special case of 2D arrays with this additional parameter-sharingconstraint, but also applies to arbitrary matrices and higher dimensional arrays.Finally, we explain connections to some closely related past work. First, a sim-ilar parameter-sharing scheme was introduced in the context of behavioral gametheory by Hartford et al. [15]. In this work, rows and columns represent players’actions and the exchangeable matrix encodes their payoffs under these actions. Ourwork provides a theoretical foundation for these ideas and shows how a similar ar-chitecture can be applied in a much more general setting. Second, our model forexchangeable matrices can be seen as a generalization of the deep sets architec-ture [37], where a set can be seen as a one-dimensional exchangeable array. (i.e.,an exchangeable vector).The rest of this thesis is laid out as follows: Chapter 2 discusses backgroundmaterial and some relevant related work. Chapter 3 introduces our proposed neu-ral network layer for matrix-structured inputs, with extensions to multiple inputchannels, as well as sparse and higher-dimensional inputs. Chapter 4 outlines twonetwork architectures in which our proposed layer can be used. Finally, we discussour experimental results in Chapter 5, and finish with some concluding remarks inChapter 6. All proofs, as well as a list of notation, can be found in the Appendix.3Chapter 2BackgroundThis work is an application of deep learning techniques to the problem of matrixcompletion, which in turn is a natural way of viewing recommender systems. Inthis chapter we discuss relevant works and ideas relating to such matrix comple-tion problems, as well as outline some key concepts from deep learning that areemployed by our model. Section 2.1 surveys some recent approaches to matrixcompletion, and Section 2.2 discusses some basics of deep learning.2.1 Matrix Completion for Recommender SystemsRecommender systems are very widely applied, with many modern applicationssuggesting new items (e.g., movies, friends, restaurants, etc.) to users based onprevious ratings of other items. The core underlying problem is naturally posedas a matrix completion task: each row corresponds to a user and each columncorresponds to an item; the matrix has a value for each rating given to an item by auser, and the goal is to fill in the missing values of this matrix.Broadly speaking, recommender systems fall into one of two categories: content-based models, and collaborative filtering models. Content-based models make useof particular features of users and items (such as a movie’s genre) to predict futureratings. Collaborative filtering approaches make predictions by modelling interac-tions between users and items.The literature in matrix factorization and completion is vast. The classical4approach to solving the matrix completion problem is to find some low rank (orsparse) approximation that minimizes a reconstruction loss for the observed rat-ings [see e.g., 4, 18, 22]. Because these procedures learn embeddings for each userand item to make predictions, they are transductive, meaning they can only makepredictions about users and items observed during training. To our knowledge thisis also true for all recent deep factorization and other collaborative filtering tech-niques [e.g., 7, 9, 21, 28, 31, 35, 38]. An exception is a recent work by Yang et al.[36] that extends factorization-style approaches to the inductive setting (where pre-dictions can be made on unseen users and items). However their method relies onadditional side information to represent users and items. By contrast, our approachis able to make inductive completions on rating matrices that may differ from thatwhich was observed during training without using any side information (thoughour approach can easily incorporate such side information).2.1.1 Inductive and Transductive AnalysisIn the problem of matrix completion, during training we are given a sparse inputmatrix Xtr with observed entries Itr = {(n,m)}. At test time, we are given Xtswith observed entries Its = {(n′,m′)}, and we are interested in predicting (someof) the missing entries of Xts, expressed through I′ts. In the transductive or matrixinterpolation setting, Itr and Its have overlapping rows and/or columns—that is, atleast one of the following is true: {n | (n,m) ∈ Itr} ∩ {n′ | (n′,m′) ∈ Its} 6= /0 or{m | (n,m) ∈ Itr}∩{m′ | (n′,m′) ∈ Its} 6= /0. In fact, often Xtr and Xts are identical.Conversely, in the inductive or matrix extrapolation setting, we are interested inmaking predictions about completely unseen entries: the training and test row/col-umn indices are completely disjoint. We will even consider cases where Xtr and Xtsare completely different datasets—e.g., movie ratings vs music ratings. The samedistinction applies in matrix factorization. Training a model to factorize a particu-lar given matrix is transductive, while factorizing unseen matrices after training isinductive.Matrix completion may also be posed as predicting edge weights in bipartitegraphs [2, 23], where we have a set of nodes N representing users, and anotherset of nodes M representing items. An edge 〈n,m〉 is absent if user n has not yet5interacted with item m, and has one of r possible edges equal to the rating value thatuser n has assigned to item m otherwise [2]. The objective then is to predict ratingsfor the missing edges (i.e., to predict which of the r possible edges is most likely).This approach builds on recent work applying convolutional neural networks tograph-structured data [3, 6, 8, 12, 17, 29]. Parameter sharing in graph convolutionis closely related to parameter sharing in exchangeable matrices, and indeed itcan be seen as a special case of our model where w2 = w3 (see Equation 3.2 ofChapter 3). The mapping between the two methods may be seen by consideringthe adjacency matrix of the graph as the input matrix to our exchangeable matrixlayer (see Section 3.1 of Chapter 3). The key difference in the two approachesis that our exchangeable matrix layer (see Section 3.1) is exactly equivariant toall permutations of rows and columns of the adjacency matrix, while the graphconvolution layer is equivariant to all permutations within the automorphism groupof the graph [27].2.2 Deep LearningDeep learning has enjoyed much success in recent years, particularly in domainsrelated to vision and natural language. For recent surveys of these applications see[19, 30]. The key feature of these models is that they are built up modularly fromsmaller building blocks, and as a result they are able to learn richer representationsof the data through these compositions.2.2.1 Feed-Forward Neural NetworksThe simplest form of deep networks are known as feed-forward networks (or some-times multi-layer perceptrons). These are constructed from individual buildingblocks known as “layers”. Each layer consists of two parts, a linear mapping ofan input vector to a output vector, followed by a non-linear activation function thatis applied element-wise. That is, given an input vector x ∈ RN , a neural networklayer is a function mapping from RN to RM of the form:y = σ(Wx+b) (2.1)6Where W is an M × N matrix containing real parameter values referred to as“weights”, b is a length-M real vector whose entries are referred to as a “biases”,and σ is a non-linear, element-wise activation function. Common choices for σinclude ReLU, σReLU(x) = max(0,x), or sigmoid, σsig(x) = 11+e−x .Deep feed-forward networks are built by composing multiple of these layerstogether, providing the output from one layer as the input to the next. The firstand last layer are known as the “input” and “output” layers, respectively, whileintermediate layers are referred to as “hidden” layers. A K-layer feed-forwardnetwork is thus a function f which maps vectors x from RN to vectors in RM,defined as follows:f (x) = WK σ(WK−1 σ(. . . σ(W2 σ(W1x+b1)+b2) . . .))+bK (2.2)Where each Wk is a weight matrix of appropriate dimension, and each bk is a biasvector of appropriate length. Together, the Wk’s and bk’s are referred to as themodel’s “parameters”. The values of the parameters which produce the best modelcan be found by minimizing some loss function which captures the difference be-tween f (x), the models’s output, and yobs, the observed or expected output. Thisprocess of finding optimal parameter values is referred to as “training”.Feed-forward networks are powerful function approximators. One of theirmost championed features is the so-called universal approximation theorem [5],which states that a simple feed-forward network with just a single hidden layercan approximate any continuous function over compact subsets of RN with arbi-trary accuracy. In fact, feed-forward networks are powerful enough that they canoften simply use their parameters to memorize their training data, and will thengeneralize poorly to new observations after training. For example, a model mayeasily learn through training to identify the objects in some particular image, butcompletely fail to recognize these same objects when presented with a new image.This is referred to as overfitting, and is clearly undesirable. One way to understandoverfitting is that a large enough model will not only learn the patterns in the train-ing data, but will also learn to fit the random noise present in these observations.In this case, the model will not generalize well to new observations, since it willexpect to see the same random noise patterns. The next section discusses some7common methods for combatting overfitting.2.2.2 RegularizationRegularization is a general technique for preventing overfitting. The key idea is toconstrain the model in some way, limiting its ability to simply memorize the inputdata, and force it to learn a more meaningful representation of that data. The hopeis then that the model will perform better when presented with new observationsfrom the data-generating distribution that were unseen at training time. Regular-ization has many benefits. It can help with model interpretability by limiting thevalues parameters can take, as with LASSO regression [33]. Adding L1- or L2- reg-ularization terms to regression models is equivalent to enforcing a Bayesian prioron the parameters, and can help make an ill-formed problem well-formed. How-ever, the prevention of overfitting is the most important benefit of regularizationfor our purposes, and we focus on this feature.DropoutOne of the simplest methods of regularization in deep models is dropout [32].While simple, this method can be extremely effective and has the benefit of beingcomputationally quite cheap. At every training iteration, each entry of the vectory in 2.1 is randomly set to 0 with some probability p before being fed into thesubsequent layer in 2.2. Intuitively this has the effect of producing a more robust,“thinned” model which must learn a richer representation of the data to reducetraining loss. More concretely, dropout may be seen as a form of approximatemodel averaging over all 2S possible thinned models (where S is the sum of thelengths of the input vectors to each layer). Combining multiple models can oftenachieve better performance than any individual model, but training multiple deepneural networks is often prohibitively expensive. In the case of dropout however,this approximate averaging is achieved extremely efficiently in an empirical sense,as the training procedure is equivalent to simultaneously training multiple such“thinned” models, and then taking an approximate average over all of them.8Parameter SharingAnother simple and efficient method of regularization is parameter sharing, wheredistinct elements of the Wk’s in Equation 2.2 are constrained to take the same value(i.e., are “tied” together). The particular choice of which distinct elements are tiedtogether is problem specific, but is always chosen deterministically. A commonand well-known example is the use of convolution layers in ConvNets. Here, theweights are tied together so that the matrix multiplication in 2.1 is equivalent toconvolving each element of the input with the same small parameter matrix. Aswith dropout, parameter sharing has the benefits of being simple to implement andcomputationally very cheap.One approach that is closely related to the method used in this thesis is to letthe group symmetries of the input data inform the parameter sharing scheme of thenetwork [27]. When the goal is to design a model that is equivariant with respect tosome permutation group, then this group can be used to determine the form of theparameter tying in the Wk matrices. For example, if both the input vector x and out-put vector y represent sets of N elements, then the group in question should be thesymmetric group on N elements (since the order of elements in a set is irrelevant),and the weight matrix W must be such that there are two distinct parameters1 usedto produce the elements of the output y: one that interacts with the correspondingelement of the input, and one that interacts with all other elements. This thesis isable to show that in fact such parameter tying schemes are the only way to alwaysachieve the desired permutation equivariance (see Section 3.2).Another highly successful application of parameter sharing that can be seen asrelated to the method in this thesis is convolutional neural networks (CNNs). Thesemodels are responsible for many impressive results in problems related to machinevision [19]. Inspired by the neural structure of the visual cortex of mammals,these models tie parameters over different, overlapping spacial regions of the inputimage. In this way local regions share information with one another, and withenough layers, the model is able to identify ever more complex features, from edgesto shapes, and even faces. These models are equivariant with respect to translationof the input image, while our models are equivariant with respect to permutations1In practice, we would likely have multiple input and output channels, and these distinct param-eters would in fact be matrices.9along the dimensions of the input tensors (e.g. permutations of matrix rows orcolumns).10Chapter 3Exchangeable LayerWe now describe the neural network layer that is the main building block of ourdeep models. We assume we are given an input matrix with exchangeable rowsand columns. That is, the rows and/or columns of our input matrix may be readilyswapped with one another without changing the underlying structure or value ofthe data. For clarity, we start by limiting ourselves to the case of matrices (i.e., 2Darrays), while in Section 3.2 we extend our analysis to arrays of higher dimension(i.e., general tensors).3.1 Exchangeable Matrix LayerLet X ∈ RN×M be our exchangeable input matrix. We will use vec(X) ∈ RNM todenote its vectorized form. That is, vec(X) is the vector created by stacking upthe columns of X. We use vec−1(x) to denote the inverse of this vectorization. Sovec−1(x) reshapes a vector of length NM into an N×M matrix, such that we havevec−1(vec(X)) = X.Now, consider a fully connected layer Y := vec−1(σ(Wvec(X))) where σ isan element-wise nonlinear function such as sigmoid, W ∈ RNM×NM is a weightmatrix, and Y ∈ RN×M is the output matrix. Exchangeablity of X motivates thefollowing desirable property: suppose we permute the rows and columns X usingtwo arbitrary permutation matrices G(N) ∈ {0,1}N×N and G(M) ∈ {0,1}M×M to getX′ := G(N)XG(M). Enforcing Permutation Equivariance (PE) requires that the new11output matrix Y′ := vec−1(σ(W vec(X ′))) experiences the same permutation of itsrows and columns. That is, we require our model to make the same predictionsof missing values in X′ as in X. In other words, we require that Y′ = G(N)YG(M).Figure 3.1 shows a commutative diagram which intuitively represents the commu-tativity of our proposed layer.Figure 3.1: Commutative diagram representing the layerY = vec−1(σ(Wvec(X))). Here f (·) = vec−1(σ(Wvec(·))) isthe layer defined in Section 3.1, and pi(·) = G(N) ·G(M) is the permu-tation of the rows/columns applied to the input/output matrix (i.e.,G(N) ∈ {0,1}N×N and G(M) ∈ {0,1}M×M are permutation matrices).Requiring that our model makes the same predictions of missing values in X′ asin X is a natural restriction to place on it. It is however not quite enough to ensureexactly the exchangeability we desire. To see why, recall that the input to our layeris actually vec(X), the vectorized form of X. Permuting the rows and columns of Xwith G(N) and G(M), respectively, is equivalent to permuting vec(X) with the per-mutation matrix G(NM) =G(M)T⊗G(N) ∈ {0,1}NM×NM, where⊗ denotes the Kro-necker product of matrices. That is, vec(G(N)XG(M)) =(G(M)T⊗G(N))vec(X).Indeed, this is true of any matrices, and simply follows from the definition of theKronecker product. Thus, while we want to enforce Permutation Equivariancewith respect to any permutation that is of the form G(M)⊗G(N) ∈ {0,1}NM×NM,with G(N) ∈ {0,1}N×N and G(M) ∈ {0,1}M×M, we must also ensure a lack of Per-mutation Equivariance with respect to any other permutation in {0,1}NM×NM thatis not of this form. Such a permutation would correspond, for example, to swap-ping the ratings that two different users applied to some particular movie. Makingsuch a swap fundamentally changes our input data (since it corresponds to a differ-ent rating pattern) and we should expect our model to make different predictions in12this case. If we did not include this additional constraint on our layer, then a trivialweight matrix Wn,m = c would satisfy the desired equivariance property. It is theseconsiderations that motivate the following definition.Definition 3.1.1 (exchangeable matrix layer) Given X∈RN×M, a fully connectedlayer σ(Wvec(X)) with W ∈RNM×NM is called an exchangeable matrix layer if,for any permutation matrix G(NM) ∈ {0,1}NM×NM, applying G(NM) to the elementsof vec(X) results in the same permutation being applied to the output iff G(NM)corresponds to a permutation of the rows and columns of X. That isσ(WG(NM) vec(X)) = G(NM)σ(Wvec(X)) (3.1)iff G(NM) = G(M)⊗G(N) for some G(N) ∈ {0,1}N×N and G(M) ∈ {0,1}M×M.This definition heavily constrains the form of the weight matrix W in an ex-changeable matrix layer. Indeed, its number of effective degrees of freedom simplycannot grow with N or M. Theorem 3.1.1 demonstrates that the weight matrix ofany PE layer is forced to have the following, simple form:W(n,m),(n′,m′) :=w1 n = n′, m = m′w2 n = n′, m 6= m′w3 n 6= n′, m = m′w4 n 6= n′, m 6= m′.(3.2)Where each wk ∈ R. Here, we index the rows and columns of W (both oflength NM) using tuples (n,m) whose values correspond to the n-th row and m-thcolumn of our input matrix X. Each output element Yn,m is produced using thefollowing parameters: one connecting it to its counterpart in the input matrix Xn,m(w1); one each connecting it to the inputs of the same row and of the same col-umn (w2 and w3); and one shared by all the other connections (w4). We may alsoinclude a bias parameter w5 ∈ R, so that the exchangeable matrix layer becomesσ(W vec(X)+b), where b is a length-NM vector whose entries are all w5. Includ-ing this parameter does not affect the exchangeable properties of the layer, as it isadded to every observed user-movie interaction. For clarity we will often omit this13parameter in our analysis, but all results hold with its inclusion as well. See Figure3.2 for an illustration of the action of this parameter matrix on an input tensor, andFigure 3.4 for a visual illustration of its composition.Figure 3.2: Parameter sharing in an exchangeable matrix layer, where the el-ements of the input and output are scalars. The application of differenttied parameters to input elements is illustrated using dark blue for w1,green for w2, red for w3, and light blue for w4. The same structure (withthe same four parameters) repeats across all units of the output layer.We can also express an exchangeable matrix layer in a more natural fashion ifwe apply the following reparameterization to the weight matrix in (3.2):w′1 = w1−w2−w3−w4w′2 = w2+w4w′3 = w3+w4w′4 = w4.We can then rewrite the layer Y = vec−1(σ(Wvec(X))) asY = σ(w′1X+w′2(1N1TNX)+w′3(X1M1TM)+w′4(1N1TNX1M1TM)),where 1N is a length-N column vector of all 1’s. Observe that the term 1N1TNX issimply summing each column of X and broadcasting the result over each of the N14rows. Similarly, X1M1TM sums the rows of X and broadcasts the results over the Mcolumns, and 1N1TNX1M1TM sums both rows and columns. In practice, we wouldlikely apply a further reparameterization and take means over rows and columns,rather than simply summing (e.g., w′2→ w′2N ). We employ this approach in Theorem3.1.1 and subsequent sections.Writing the output Y in this form shows us how an exchangeable matrix layercan be implemented efficiently in practice, and in such a way as to take advantageof fast, parallelizable GPU computations. We will use this reparameterized form,and include the bias term, in the statement of Theorem 3.1.1. which formalizesthe requirement on the parameter matrix that any exchangeable matrix layer musthave. All proofs are deferred to the appendix. For technical reasons the activationfunction used in our layer is required to be strictly monotonic, and we provide aformal definition of this property below.Definition 3.1.2 (strictly monotonic function) A function f : X → Y is strictlymonotonic if either(i) for all x1,x2 ∈ X with x1 < x2 we have f (x1)< f (x2) or(ii) for all x1,x2 ∈ X with x1 < x2 we have f (x1)> f (x2).Item (i) is referred to as strictly monotonically increasing, and item (ii) as strictlymonotonically decreasing.Theorem 3.1.1 formalizes the requirement on the parameter matrix that anyexchangeable matrix layer must have. All proofs are deferred to the appendix.Theorem 3.1.1 Given a strictly monotonic function σ , a neural layer σ(Wvec(X)+b)is an exchangeable matrix layer (Definition 3.1.1) iff the elements of the parametermatrix W are tied together such that the resulting fully connected layer simplifiestoY = σ(w′1X+w′2N(1N1TNX)+w′3M(X1M1TM)+w′4NM(1N1TNX1M1TM)+w′51N1TM)(3.3)15where 1N = [1, . . . ,1]︸ ︷︷ ︸length NT and w′1, . . . ,w′5 ∈R.As described above, the parameter sharing scheme implied by 3.2 is equivalentto summing or averaging elements across rows and columns of the input matrix.More generally, PE is preserved by any commutative pooling operation, such asmax or LogSumExp. Moreover, stacking multiple layers with the same equiv-ariance properties preserves equivariance [27]. This allows us to build deep PEmodels.Figure 3.3: Pooling structure implied by the tied weights for matrices (left)and 3D tensors (right). The pink cube highlights one element of theoutput. It is calculated as a function of the corresponding element fromthe input (dark blue), pooled aggregations over the rows and columns ofthe input (green and yellow), and pooled aggregation over the whole in-put matrix (red). In the tensor case (right), we pool over all sub-tensors(orange sub-matrices, green sub-vectors and red scalar). For clarity, theoutput connections are not shown in the tensor case.3.1.1 Multiple Input–Output ChannelsEquation 3.3 describes the layer as though it has single input and output matri-ces. However, this is very limiting and in practice we may want to handle inputdata where the interaction between users and movies is represented by a length-16K vector, rather than a single scalar value. This is particularly true when usingexchangeable matrix layers to build up deep models: we want hidden layers toencode sufficiently rich information, and so we would likely make use of multiplechannels to encode interactions. Thus, as is the case with convolutional layers,our layer may have K input and O output channels. We use superscripts X〈k〉 andY〈o〉 to denote the kth input and oth output matrices over such channels. Cross-channel interactions are fully connected—that is, we have five unique parametersw〈k,o〉1 , . . . ,w〈k,o〉4 ,w〈o〉5 for each combination of input–output channels; note that thebias parameter w5 does not depend on the input. Similar to convolution, the num-ber of channels provides a tuning knob for the expressive power of the model. Inthis setting, the scalar output Y〈o〉n,m is the output for user n and movie m in outputchannel o. This value is given byY〈o〉n,m = σ(K∑k=1(w〈k,o〉1 X〈k〉n,m+w〈k,o〉2N(∑n′X〈k〉n′,m) (3.4)+w〈k,o〉3M(∑m′X〈k〉n,m′)+w〈k,o〉4NM(∑n′,m′X〈k〉n′,m′)+w〈o〉5))3.1.2 Input Features for Rows and ColumnsIn some applications, in addition to the matrix X ∈RN×M×K , where K is the num-ber of input channels/features, we may have additional features for rows R∈RN×K′and/or columns C ∈ RM×K′′ . Such features may correspond, for example, to auser’s age or occupation, or to a movie’s genre or year of release. We can preservePE by broadcasting these row/column features over the whole matrix and treatingthem as additional input channels.3.1.3 Jointly Exchangeable MatricesFor jointly exchangeable matrices, such as user-user interactions, or for the ad-jacency matrix in the case of graph convolution, Equation 3.3 (similarly, 3.4) isconstrained to have N = M and G(N) = G(M). This will in turn constrain the cor-responding parameter-sharing in Equation 3.2 so that w2 = w3, and the resultinglayer will remain equivariant to precisely the desired set of permutations.173.1.4 Sparse InputsReal-world arrays are often extremely sparse. Indeed, matrix and tensor comple-tion is only meaningful with missing entries. Fortunately, the equivariance prop-erties of Theorem 3.1.1 continue to hold when we only consider the nonzero (ob-served) entries. For sparse matrices, we continue to use the same parameter-sharingscheme across rows and columns, with the only difference being that we limit themodel to observed entries. We now make this precise.Let X ∈ RN×M×K be a sparse exchangeable array with K channels, where allthe channels for each row-column pair 〈n,m〉 are either fully observed or com-pletely missing. Let I identify the set of such non-zero indices. Let Rn = {m |(n,m) ∈ I} be the non-zero entries in the nth row of X, and let Cm be the non-zeroentries of its mth column. For this sparse matrix, the terms in the layer of 3.4 areadapted as one would expect:w〈k,o〉2N ∑n′X〈k〉n′,m →w〈k,o〉2|Cm| ∑n′∈CmX〈k〉n′,mw〈k,o〉3M ∑m′X〈k〉n,m′ →w〈k,o〉3|Rn| ∑m′∈RnX〈k〉n,m′w〈k,o〉4NM ∑n′,m′X〈k〉n′,m′ →w〈k,o〉4|I| ∑(n′,m′)∈IX〈k〉n′,m′Writing the layer using the terms above provides an approach for implementingexchangeable matrix layers efficiently, using sparse matrix operations, as we havedone in our Tensorflow implementation.This section has introduced the notion of an exchangeable neural network layerfor matrix-structured data, and showed certain results concerning the structure sucha layer must have. We now extend this definition and analysis to general, higher-dimensional tensors.3.2 Extension to General TensorsIn this section we generalize the exchangeable matrix layer (Definition 3.1.1) tohigher-dimensional arrays (tensors) and formalize its optimal qualities. The def-18inition of our exchangeable matrix layer can be seen as an extension from 1 to 2dimensions of the layer defined in the Deep Sets work [37]. By applying a similarextension of our exchangeable matrix layer from 2 to n dimensions, we can definean exchangeable layer that acts on tensors of arbitrary size and shape in a fairlystraightforward fashion.Suppose now that instead of an input matrix, our data X ∈ RN1×...×ND is a D-dimensional tensor. Again, vec(X) is its vectorized form, where the precise methodof vectorization is irrelevant, provided it is used consistently. We index vec(X) bytuples (n1,n2, ...,nD), corresponding to the original dimensions of X. Let N =∏i Ni and let n = (ni,n−i) be an element of [N1]× ...× [ND] such that ni is thevalue of the i-th entry of n, and n−i the values of the remaining entries (where itis understood that the ordering of the elements of n is unchanged). As before, weseek a layer that is equivariant only to certain, meaningful, permutations of vec(X).This motivates our definition of an exchangeable tensor layer in a manner that iscompletely analogous to Definition 3.1.1 for matrices.For any positive integer N, let S(N) denote the symmetric group of all permu-tations of N objects. Then S(N1)× ...×S(ND) refers to the product group of allpermutations of N1 through ND objects, while S(N) refers to the group of all per-mutations ofN=∏i Ni objects. So S(N1)× ...×S(ND) is a proper subgroup of S(N),having ∏i(Ni!) members, compared to (∏i Ni)! members in S(N). We want a layerthat is equivariant to only those permutations in S(N1)× ...×S(ND), but not to anyother member of S(N).Definition 3.2.1 (exchangeable tensor layer) Given X∈RN1×...×ND , letN=∏i Niand let G(N) ∈ S(N). Then the neural network layer vec−1(σ(Wvec(X))) withW ∈ RN×N is an exchangeable tensor layer if permuting the elements of the in-put along any set of axes results in the same permutation of the output tensor, andmoreover for any other permutation of the elements X (i.e., permutations that arenot along axes), there exists an input X for which this equality does not hold. Thatis:G(N)σ(Wvec(X)) = σ(WG(N) vec(X)), (3.5)iff G(N) ∈ S(N1)×S(N2)× ...×S(ND).19The following theorem asserts that a logical extension of the simple parametertying scheme from Section 3.1 achieves the desired permutation equivariance fortensor-structured data. Note that, while we continue to consider only the single-channel, dense case, the extension to sparse inputs over multiple channels can bemade for higher-dimensional data in the same way as for 2D data in Sections 3.1.1and 3.1.4.Theorem 3.2.1 The layer Y= vec−1(σ(Wvec(X))), where σ is any strictly mono-tonic, element-wise nonlinearity (e.g., sigmoid), is an exchangeable tensor layer iffthe parameter matrix W ∈RN×N is defined as follows.For each S ⊆ {1, . . . ,D}, we define a distinct parameter wS ∈ R, and tie theentries of W as followsWn,n′ := wS s.t. ni = n′i ⇐⇒ i ∈ S. (3.6)That is, the (n,n′)-th element of W is uniquely determined by the set of indices atwhich n and n′ are equal.Figure 3.4: Structure of our parameter matrix for the 1D (left), 2D (centre),and 3D (right) input arrays. The 1D case corresponds to the permutationequivariant layer proposed in Deep-Sets [37] with 4 elements. The 2Dcase corresponds to a 4×3 input matrix, and the 3D case corresponds toa 4×3×2 input tensor. The parameter-sharing patterns for the weightmatrix of the higher dimensional arrays can be produced via the Kro-necker product of the weight matrix for the 1D array (i.e., vector).In the special case where X ∈RN1×N2 is a matrix, the formulation of W abovegives exactly the matrix W described in Section 3.1. Theorem 3.2.1 says that a20layer constructed in this manner is equivariant with respect to only those permu-tations of N objects that correspond to permutations of sub-tensors along the Ddimensions of the input. The proof is in the Appendix. Equivariance with respectto permutations in S(N1)×S(N2)× ...×S(ND) follows as a simple corollary of Theo-rem 2.1 in [27]. Non-equivariance with respect to other permutations follows fromthe following two Propositions.Proposition 3.2.1 Let g(N) ∈ S(N) be an “illegal” permutation of elements of thetensor X – that is g(N) 6∈ S(N1)×S(N2)× ...×S(ND). Then there exists a dimensioni ∈ [D] such that, for some ni,n−i,n′i,n′−i:g(N)((ni,n−i)) = (n′i,n−i), butg(N)((ni,n′−i)) 6= (n′i,n′−i).If we consider the sub-tensor of X obtained by fixing the value of the i-thdimension to ni, we expect a “legal” permutation to move this whole sub-tensor tosome n′i (it could additionally shuffle the elements within this sub-tensor, providedthis shuffle is itself “legal”.) This Proposition asserts that an “illegal” permutationg(N) 6∈ S(N1)×S(N2)× ...×S(ND) is guaranteed to violate this constraint for somedimension/sub-tensor combination.The next proposition asserts that if we can identify such inconsistency in thepermutation of sub-tensors then we can identify an entry in G(N)W that will differfrom WG(N). We can therefore find some input tensor X for which the equivarianceproperty is violated – i.e., some X for which G(N)σ(W vec(X)) 6=σ(WG(N) vec(X)).Proposition 3.2.2 Let g(N) ∈ S(N) with G(N) ∈ {0,1}N×N the corresponding per-mutation matrix. Suppose g(N) 6∈ S(N1)×S(N2)× ...×S(ND), and let W ∈RN×N beas defined above. If there exists an i ∈ [D], and some ni,n−i,n′i,n′−i such thatg(N)((ni,n−i)) = (n′i,n−i), andg(N)((ni,n′−i)) 6= (n′i,n′−i),21then(G(N)W)(n′i,n′−i),(ni,n−i)6= (WG(N))(n′i,n′−i),(ni,n−i)Proposition 3.2.2 makes explicit a particular element at which the productsG(N)W and WG(N) will differ, provided g(N) is not of the desired form.Theorem 3.2.1 shows that this parameter sharing scheme produces a layer thatis equivariant to exactly those permutations we desire, and moreover, it is optimalin the sense that any layer having fewer ties in its parameters (i.e., more parameters)would fail to be equivariant.For the sake of clarity, we will continue to discuss only the matrix (i.e., 2D ten-sor) case in subsequent sections. However, as shown above, all results generalizeto the case of arbitrary dimensions.22Chapter 4Network ArchitecturesWe introduce two architectures in which our layer can be used. The first, presentedin Section 4.1, is what we call a self-supervised model: a simple composition ofexchangeable matrix layers that is trained to produce randomly removed entriesof the observed data matrix. The second model, presented in Section 4.2, is afactorized model that uses an auto-encoding nonlinear factorization scheme. Whilethere are innumerable methods for (nonlinear) matrix factorization and completion,both of these models are the first to generalize to inductive settings while achievingcompetitive performance in transductive settings.Section 4.3, discusses a dropout technique that we found was important forgetting good results with both models, and Section 4.4 discusses two subsamplingtechniques used for making predictions in sparse matrices that are too large to befit into limited GPU memory.4.1 Self-Supervised Exchangeable ModelThe first model is a simple, deep, feed-forward network consisting of a sequenceof stacked exchangeable matrix layers. That is, the function fss : RN×M×K →RN×M×O is simply a composition of exchangeable matrix layers. Given the matrixX with observed entries I, we divide I = Iin∪ Ipr into disjoint input and predictionentries. We then train fss(Xin) to predict the prediction entries Xpr. In this way,the model learns to predict missing entries from the observed interactions of other23users and movies.4.2 Factorized Exchangeable AutoencoderThe Factorized Exchangeable Autoencoder (EAE) model consists of an encoderportion and a decoder portion. The encoder fenc : RN×M×K → RKN×N ×RKM×Mmaps the (sparse) input matrix X ∈ RN×M×K to a row-factor ZN ∈ RKN×N and acolumn-factor ZM ∈ RKM×M. To do so, the encoder uses a composition of ex-changeable matrix layers. The output of the final layer of the encoder Yl ∈RN×M×Klis pooled across rows and columns (and optionally passed through a feed-forwardlayer) to produce latent factors ZN and ZM. The decoder fdec :RKN×N×RKM×M→RN×M×K also uses a composition of exchangeable matrix layers, and reconstructsthe input matrix X from the factors ZN and ZM. The optimization objective is tominimize reconstruction error `(X, fdec( fenc(X))); similar to classical auto-encoders.This procedure is also analogous to classical matrix factorization, with the im-portant distinction that once trained, we can readily factorize new, unseen matriceswithout performing any further optimization. Note that the same architecture triv-ially extends to tensor-factorization, where we instead use exchangeable tensorlayers (see Section 3.2).Figure 4.1: Factorized exchangeable autoencoder. The encoder maps fromthe input tensor to an embedding layer of row / column factors via oneor more hidden layers. The decoder attempts to reconstruct the inputusing the factors via one or more hidden layers.244.3 Chanel-wise DropoutBoth the Factorized EAE and Self-Supervised Exchangeable models are flexibleenough that they are able to simply memorize the training data, making regulariza-tion important for good generalization performance. Dropout [32] can be extendedto apply to exchangeable matrix layers by noticing that each channel in an ex-changeable matrix layer is analogous to a single unit in a standard feed-forwardnetwork. We therefore randomly drop out whole channels during training (as op-posed to dropping out individual elements from these channels). This procedure isequivalent to the SpatialDropout technique used in convolutional networks [34].4.4 Subsampling in Large DatasetsA key practical challenge arising from our approach is that our layer, and thereforeour models, are designed to take the whole data matrix X as input, and will makedifferent predictions if given only a subset of the data matrix. These predictionswill typically be worse than when given the whole matrix, since our model is de-signed to learn from the interactions between observed data points. As datasetsgrow, the model and input data become too large to fit within the limited size ofGPU memory. This is problematic both during training and at evaluation time be-cause our models rely on aggregating shared representations across data points tomake their predictions. To address this, we explore two subsampling procedures,described below.4.4.1 Uniform samplingThe simplest approach is to sample sub-matrices of X by uniformly sampling fromits (typically sparse) elements. This has the advantage that each submatrix is anunbiased sample of the full matrix and that the procedure is computationally cheap,but has the potential to limit the performance of the model because the relationshipsbetween the elements of X are sparsified. For example, two users may rate the sameset of movies, but when we sample, we may choose disjoint subsets of these moviesfor the two users. In this way we lose any information about the similarity of thesetwo users.254.4.2 Conditional samplingRather than sparsifying interactions between all set members, we can pick a subsetof rows and columns and maintain all their interactions; see Figure 4.2. This proce-dure is unbiased as long as each element (n,m) ∈ I has the same probability of be-ing sampled. To achieve this, we first sample a subset of rows N′ ⊆N= {1, . . . ,N}from the marginal P(n) := |Rn||I| , followed by a subsampling of columns using themarginal distribution over the columns, within the selected rows: P(m | N′) =|{(n,m)∈I|n∈N′}||{(n,m′)∈I|n∈N′,m′∈M}| . This gives us a set of columns M′ ⊆M. We consider any ob-servation within N′×M′ as our subsample: Isample := {(n,m) ∈ I | n ∈ N′,m ∈M′}.Unfortunately, this sampling procedure is computationally more expensive thanuniform sampling, as we have to calculate marginal distributions for each set ofsamples.Figure 4.2: Uniform sampling (left) selects samples (red) uniformly from thenon-zero indices of the the matrix X while conditional sampling (right)first samples a set of rows (shown in orange) from the row marginaldistribution (green) and then selects a sample from the resulting columnconditional distribution.4.4.3 Sampling at test timeAt training time all we must ensure is that we sample in an unbiased way; how-ever, at test time we need to ensure that the test indices Its of large test matrix Xts26are all sampled. Fortunately, according to the coupon collectors’ problem, in ex-pectation we only need to repeat random sampling of indices log(|Its|) times more(≈ 10× in practice) than an ideal partitioning of Its, in order to cover all the relevantindices [11].27Chapter 5Experiments and ResultsFor reproducibility we have released a Tensorflow implementation of our model.1Details on the training and architectures appear in Section 5.1. Section 5.2 reportsexperimental results in the standard transductive (matrix interpolation) setting fora number of benchmark datasets. However, more interesting results are reported in5.3, where we test a trained deep model on a completely different dataset. Finally,5.5 compares the model’s performance when using different sampling procedures.The datasets used in our experiments are summarized in Table 5.1.Dataset Users Items Ratings SparsityMovieLens 100K 943 1682 100,000 6.30%MovieLens 1M 6040 3706 1,000,209 4.47%Flixster 3000 3000 26173 0.291%Douban 3000 3000 136891 1.521%Yahoo Music 3000 3000 5335 0.059%Netflix 480,189 17,770 100,480,507 1.178%Table 5.1: Number of users, items and ratings for the data sets used in our ex-periments. MovieLens data sets are standard [14], as is Netflix, while forFlixster, Douban and Yahoo Music we used the 3000× 3000 submatrixpresented by [23] for comparison purposes.1https://github.com/mravanba/deep exchangeable tensors.285.1 Details of Architecture and TrainingBelow we discuss the specific architectures and training procedures that were mostsuccessful for each of our two models.5.1.1 Self-Supervised Model.We trained a simple feed-forward network with 9 exchangeable matrix layers usinga leaky ReLU activation function. Each hidden layer has 256 channels and weapply a channel-wise dropout with probability 0.5 after the first to seventh layers.We found this channel-wise dropout to be crucial to achieving good performance.Before the input layer we mask out a portion of the ratings be setting their valuesto 0 uniformly at random with probability 0.15. We convert the input ratings toone-hot vectors and interpret the model output as a probability distribution overpotential rating levels. We tuned hyper-parameters by training on 75% of the data,evaluating on a 5% validation set. For the MovieLens-100k dataset, we test thismodel using the canonical u1.base/u1.test training/test split, which reserves 20% ofthe ratings for testing. For the MovieLens-1M dataset, we use the same architectureas for ML-100k and trained on 85% of the data, validating on 5%, and reserving10% for testing. The limited size of GPU memory becomes an issue for this largerdataset, so we had to employ conditional sampling for training. At validation timewe used full batch predictions using the CPU in order to avoid memory issues.5.1.2 Factorized Exchangeable Autoencoder Model.For our Factorized EAE model we use three exchangeable matrix layers for theencoder. The first two have 220 channels, and the third layer maps the input to 100features for each entry, with no activation function applied. This is followed bymean pooling along both dimensions of the input. Thus, each user and movie isencoded into a length 100 vector of real-valued latent features. The decoder usesfive similar exchangeable matrix layers, with the final layer having five channels,which we interpret as a distribution over ratings levels. We apply a channel-wisedropout with probability 0.5 after the third and fourth layers, which we again foundto be crucial for good performance. We convert the input ratings to one-hot vectorsand optimize using cross-entropy loss.295.2 Transductive Setting (Matrix Interpolation)Here, we apply our models to some standard datasets, and demonstrate that ex-ploiting the PE structure of the exchangeable matrices allows us to achieve resultscompetitive with state-of-the-art. Moreover, our models are able to achieve thisusing only a constant number of parameters, while the number of parameters in allcompeting methods grows with N and/or M.The first standard dataset we experimented on was MovieLens-100k. In Table5.2 we report that the self-supervised exchangeable model is able to achieve state-of-the-art performance on this dataset, while the factorized model also producescompetitive results.Model ML-100KMC [4] 0.973GMC [16] 0.996GRALS [26] 0.945sRGCNN [23] 0.929Factorized EAE (ours) 0.920GC-MC [2] 0.910Self-Supervised Model (ours) 0.910Table 5.2: Comparison of RMSE scores for the MovieLens-100k dataset,based on the canonical 80/20 training/test split. Baseline numbers aretaken from [2].The next dataset, MovieLens-1M, is large enough that we begin to experiencethe problem of limited GPU memory, described in Section 4.4, while training. Wetherefore had to make use of sampling techniques, and tried both procedures de-scribed in that section. Not surprisingly, we found that the conditional samplingmethod performed better, but at the cost of slower training iterations. Our resultson MovieLens-1M are summarized in Table 5.3. On this larger dataset both mod-els gave comparatively weaker performance than what we observed on the smallerML-100k dataset, as well as in the extrapolation results (see Section 5.3). We sus-pect this is largely due to the constraints on memory. Given a fixed-size memory,there is an obvious trade-off between the size of the model (in terms of both num-ber of layers and units per layer) and the batch size one can train. For our models,we found that both larger batches and deeper and wider models tended to offer bet-30ter performance, but on datasets as large as ML-1M, it is not currently possible tohave both. The results for our two models on this dataset are similar and reportedin the same table.Model ML-1MPMF [22] 0.883Self-Supervised Model (ours) 0.863Factorized EAE (ours) 0.860I-RBM [28] 0.854BiasMF [18] 0.845NNMF [9] 0.843LLORMA-Local [20] 0.833GC-MC [2] 0.832I-AUTOREC [31] 0.831CF-NADE [38] 0.829Table 5.3: Comparison of RMSE scores for the MovieLens-1M dataset onrandom 90/10 training/test split. Baseline numbers are taken from [2].5.3 Inductive Setting (Matrix Extrapolation)Because our model does not rely on any per-user or per-movie parameters, it shouldbe able to generalize to new users and movies that were not present during training.We tested this claim by evaluating the performance of a pre-trained model in mak-ing predictions for new users and movies. Specifically, we trained an exchangeablefactorized autoencoder on the MovieLens-100k dataset and then evaluated it’s per-formance on a subsample of the MovieLens-1M dataset. At test time, the modelwas shown a portion of the new ratings and then made to make predictions on theremaining ratings. We vary the proportion of new ratings shown to the model whenmaking these predictions.Note that the ML-100k data was collected between 1997 and 1998, while allusers in the ML-1M dataset joined MovieLens in 2000 [13]. So the sets of usersfound in the two datasets are clearly disjoint. And while there is overlap in themovies in the two datasets (for example, Toy Story is found in both), the modelremains oblivious to this overlap, essentially treating them as if they were differentmovies, and only relying on their interactions with users to make predictions.Figure 5.1 summarizes these results, where we vary the amount of data shown31to the model for making predictions from 5% of the new ratings up to 95%. Wecompare our results against K-nearest neighbours approaches, as these are the onlyother models we are aware of that are capable of handling new users or movieswithout retraining. Our model significantly outperforms the baselines in this taskand its performance degrades gracefully as we reduce the amount of data used togenerate predictions.Figure 5.1: Evaluation of our model’s ability to generalize. We trained onML-100k and evaluated on a subset of the ML-1M data. At evaluationtime, p% of the ML-1M data was treated as observed and the modelwas required to complete the remaining (1− p)% (with p varied from5% to 95%). The model outperforms nearest-neighbour approaches forall values of p and degrades gracefully to the baseline of predicting themean in the small data case.5.4 Extrapolation to New DatasetsPerhaps most surprisingly, we were able to achieve competitive results when train-ing and testing our models on completely disjoint datasets. For these experimentswe stress-tested our model’s inductive ability by testing how it generalizes to newdatasets without retraining. We used a Factorized Exchangeable Autoencoder thatwas trained to make predictions on the MovieLens-100k dataset and tasked it withmaking predictions on the Flixster, Douban, YahooMusic and Netflix datasets. We32then evaluated the model’s performance against models trained for each of theseindividual datasets. All the datasets involve rating prediction tasks, so they sharesome similar semantics with MovieLens, but they have different user bases and(in the case of YahooMusic) deal with music not movie ratings, so we may expectsome change in the rating distributions and user-item interactions. Furthermore,the Flixster ratings are in 0.5 point increments from 1−5 and YahooMusic allowsratings from 1− 100, while Douban, MovieLens and Netflix ratings are on 1− 5scale. To account for the different rating scales, we simply binned the inputs toour model to a 1−5 range and, where applicable, linearly rescaled the output be-fore comparing it to the true rating2. Despite all of this, Table 5.4 shows that ourmodel achieves results that are very competitive with those of models that werespecifically trained for the task.Model FlixsterDoubanYahooMusicNetflixGRALS [26] 1.313 0.833 38.0 -sRGCNN [23] 1.179 0.801 22.4 -GC-MC [2] 0.941 0.734 20.5 -Factorize EAE (ours) 0.908 0.738 20.0 -Factorize EAE (trained on ML100k) 0.987 0.766 23.3 0.918Netflix Baseline - - - 0.951PMF [22] - - - 0.897LLORMA-Local [20] - - - 0.834I-AUTOREC [31] - - - 0.823CF-NADE [38] - - - 0.803Table 5.4: Evaluation of our model’s ability to generalize across datasets. Wetrained a factorized model on ML100k and then evaluated it on four newdatasets. Results for the smaller datasets are taken from [2]. NetflixBaseline shows state of the art prior to the Netflix Challenge.For the sake of comparison, we also include the performance of a FactorizedEAE that has actually been trained on the respective datasets. As expected, thisimproves performance of our model over one that is trained on ML-100k. Weare able to achieve new state-of-the-art results on the Flixster and YahooMusicdatasets, and very similar performance to Berg et al. [2]’s GC-MC model on the2Because of this binning procedure, our model received input data that is considerably coarser-grained than that which was used for the comparison models.33Douban dataset (current state-of-the-art for that dataset). Interestingly, we see thelargest performance gains over existing approaches on the datasets in which ratingsare sparse (see Table 5.1).5.5 Effects of Sub-Sampling ProceduresThe tradeoff between batch sample size and network size (see Section 4.4) led us toinvestigate the degree to which performance degrades as a result of subsampling inlarge matrices. We evaluated the effect of subsampling the input matrix X on per-formance for the MovieLens-100k dataset. The results are summarized in Figure5.2.Figure 5.2: Performance difference between sampling methods on the ML-100k. The two sampling methods use minibatches of size 20 000, whilethe full batch method used all 75 000 training examples. Note that they-axis does not begin at 0.This figure reveals two key findings. First, even with large batch sizes of 20 000examples, performance for both the uniform and conditional sampling methods isdiminished compared to the full batch case. We suspect that our models’ weakerresults on the larger ML-1M dataset may be partially attributed to the need to sub-sample. Secondly, the conditional sampling method was able to recover some ofthe diminished performance. We believe it is likely that more sophisticated sam-pling schemes that explicitly take into account the dependence between hidden34nodes will lead to better performance, but we leave that to future work.35Chapter 6ConclusionThis work has considered the problem of predicting relationships between the ele-ments of two or more distinct sets of objects, where the data can also be expressedas an exchangeable matrix or tensor. For clarity of exposition, we focused onthe case of user-movie ratings relationships, but our models have potential applica-tions in many other domains, from higher dimensional data, to seemingly unrelatedproblems such as sentiment analysis.We introduced a parameter-tying scheme which enables the application of deepmodels to this type of exchangeable data. We proved that our scheme always pro-duces permutation equivariant models and that no increase in model expressivenessis possible without violating this guarantee.Experimentally, we showed that our models achieve state-of-the-art or com-petitive performance on widely studied matrix completion benchmarks. Notably,and unlike all other approaches that offer strong performance, our models achievedthese results despite having a number of parameters independent of the size of thematrix to be completed. This means that our models can be easily adapted to new,unseen data, without any costly retraining procedure, a benefit that is characteristicof few other methods.Also unlike these other approaches, our models can achieve competitive resultsfor the problem of matrix extrapolation: asking an already-trained model to com-plete a new matrix involving new objects that were unobserved at training time.We showed that when the interactions between these new objects came from the36same distribution that generated the training data, performance was almost unaf-fected. Finally, we observed that our methods were surprisingly strong on varioustransfer learning tasks: extrapolating from MovieLens ratings to Flixter, Douban,YahooMusic, and Netflix ratings. All of these contained different user populationsand different distributions over the objects being rated. Indeed, in the YahooMusicdataset, the underlying objects were not even of the same kind as those rated intraining data.37Bibliography[1] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensordecompositions for learning latent variable models. The Journal of MachineLearning Research, 15(1):2773–2832, 2014. → page 2[2] R. v. d. Berg, T. N. Kipf, and M. Welling. Graph convolutional matrixcompletion. arXiv preprint arXiv:1706.02263, 2017. → pagesix, 5, 6, 30, 31, 33[3] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks andlocally connected networks on graphs. ICLR, 2014. → page 6[4] E. J. Cande`s and B. Recht. Exact matrix completion via convexoptimization. Foundations of Computational mathematics, 2009. → pages5, 30[5] B. C. Csa´ji. Approximation with artificial neural networks. Faculty ofSciences, Etvs Lornd University, Hungary, 24:48, 2001. → page 7[6] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neuralnetworks on graphs with fast localized spectral filtering. In Advances inNeural Information Processing Systems, pages 3844–3852, 2016. → page 6[7] Z. Deng, R. Navarathna, P. Carr, S. Mandt, Y. Yue, I. Matthews, andG. Mori. Factorized variational autoencoders for modeling audiencereactions to movies. → page 5[8] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel,A. Aspuru-Guzik, and R. P. Adams. Convolutional networks on graphs forlearning molecular fingerprints. In Advances in neural informationprocessing systems, 2015. → page 6[9] G. K. Dziugaite and D. M. Roy. Neural network matrix factorization. arXivpreprint arXiv:1511.06443, 2015. → pages 5, 3138[10] L. Getoor and B. Taskar. Introduction to statistical relational learning. MITpress, 2007. → page 2[11] A. Hald, A. de Moivre, and B. McClintock. A. de moivre:’de mensura sortis’or’on the measurement of chance’. International Statistical Review/RevueInternationale de Statistique, pages 229–262, 1984. → page 27[12] W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning onlarge graphs. In Advances in Neural Information Processing Systems, 2017.→ page 6[13] F. M. Harper and J. A. Konstan. The movielens datasets: History andcontext. In ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4,Article 19, 2015. → page 31[14] F. M. Harper and J. A. Konstan. The movielens datasets: History andcontext. ACM Transactions on Interactive Intelligent Systems, 2015. →pages ix, 28[15] J. S. Hartford, J. R. Wright, and K. Leyton-Brown. Deep learning forpredicting human strategic behavior. In Advances in Neural InformationProcessing Systems, pages 2424–2432, 2016. → page 3[16] V. Kalofolias, X. Bresson, M. Bronstein, and P. Vandergheynst. Matrixcompletion on graphs. arXiv preprint arXiv:1408.1717, 2014. → page 30[17] T. N. Kipf and M. Welling. Semi-supervised classification with graphconvolutional networks. arXiv preprint arXiv:1609.02907, 2016. → page 6[18] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques forrecommender systems. 2009. → pages 5, 31[19] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. In Nature, 2015. →pages 6, 9[20] J. Lee, S. Kim, G. Lebanon, and Y. Singer. Local low-rank matrixapproximation. In Sanjoy Dasgupta and David McAllester, editors,Proceedings of the 30th International Conference on Machine Learning(ICML), volume 28 of Proceedings of Machine Learning Research, pages82–90, 2013. → pages 31, 33[21] S. Li, J. Kawale, and Y. Fu. Deep collaborative filtering via marginalizeddenoising auto-encoder. In Proceedings of the 24th ACM International onConference on Information and Knowledge Management, pages 811–820.ACM, 2015. → page 539[22] A. Mnih and R. R. Salakhutdinov. Probabilistic matrix factorization. InAdvances in neural information processing systems, 2008. → pages 5, 31, 33[23] F. Monti, M. Bronstein, and X. Bresson. Geometric matrix completion withrecurrent multi-graph neural networks. In Advances in Neural InformationProcessing Systems, 2017. → pages ix, 5, 28, 30, 33[24] P. Orbanz and D. M. Roy. Bayesian models of graphs, arrays and otherexchangeable random structures. IEEE transactions on pattern analysis andmachine intelligence, 2015. → pages 1, 3[25] L. D. Raedt, K. Kersting, S. Natarajan, and D. Poole. Statistical relationalartificial intelligence: Logic, probability, and computation. SynthesisLectures on Artificial Intelligence and Machine Learning, 10(2):1–189,2016. → page 2[26] N. Rao, H.-F. Yu, P. K. Ravikumar, and I. S. Dhillon. Collaborative filteringwith graph information: Consistency and scalable methods. In C. Cortes, N.D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances inNeural Informa- tion Processing Systems 28, page 21072115, 2015. →pages 30, 33[27] S. Ravanbakhsh, J. Schneider, and B. Poczos. Equivariance throughparameter-sharing. In Proceedings of the 34th International Conference onMachine Learning, volume 70 of JMLR: WCP, August 2017. → pages3, 6, 9, 16, 21, 44[28] R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted boltzmann machinesfor collaborative filtering. In Proceedings of the 24th internationalconference on Machine learning, pages 791–798. ACM, 2007. → pages5, 31[29] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. Thegraph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009. → page 6[30] J. Schmidhuber. Deep learning in neural networks: An overview. CoRR,abs/1404.7828, 2014. URL http://arxiv.org/abs/1404.7828. → page 6[31] S. Sedhain, A. K. Menon, S. Sanner, and L. Xie. Autorec: Autoencodersmeet collaborative filtering. In Proceedings of the 24th InternationalConference on World Wide Web, pages 111–112. ACM, 2015. → pages5, 31, 3340[32] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.Dropout: A simple way to prevent neural networks from overfitting. TheJournal of Machine Learning Research, 2014. → pages 8, 25[33] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal ofthe Royal Statistical Society. Series B (Methodological), pages 267–288,1996. → page 8[34] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler. Efficient objectlocalization using convolutional networks. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pages 648–656,2015. → page 25[35] H. Wang, N. Wang, and D.-Y. Yeung. Collaborative deep learning forrecommender systems. In Proceedings of the 21th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining, pages1235–1244. ACM, 2015. → page 5[36] Z. Yang, W. W. Cohen, and R. Salakhutdinov. Revisiting semi-supervisedlearning with graph embeddings. arXiv preprint arXiv:1603.08861, 2016.→ page 5[37] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, andA. J. Smola. Deep sets. In Advances in Neural Information ProcessingSystems, 2017. → pages xi, 3, 19, 20[38] Y. Zheng, B. Tang, W. Ding, and H. Zhou. A neural autoregressive approachto collaborative filtering. In Proceedings of the 33nd InternationalConference on Machine Learning, pages 764–773, 2016. → pages 5, 31, 3341Appendix ASupporting MaterialsA.1 Notation• X ∈RN1×...×ND , the data tensor• x ∈R∏i Ni , vectorized X, also denoted by vec(X).• [N]: the sequence {n}n=1,...,N = (1,2, ...N)• n = (ni,n−i), an element of N1× ...×ND• N =∏i Ni• S(N), symmetric group of all permutations of N objects• G(N) = {g(N)i }i, a permutation group of N objects• g(N)i or G(N)i , both can refer to the matrix form of the permutation g(N)i ∈G(N).• g(N)i (n) refers to the result of applying g(N)i to n, for any n ∈ [N].• σ , an arbitrary, element-wise, strictly monotonic, nonlinear function such assigmoid.42A.2 ProofsA.2.1 Proof of Proposition 3.2.1Proof 1 Let X ∈RN1×...×ND be the data matrix. We prove the contrapositive by in-duction on D, the dimension of X. Suppose that, for all i∈ [D] and all ni,n′i,n−i,n′−i,we have that g(N)((ni,n−i)) = (n′i,n−i) implies g(N)((ni,n′−i)) = (n′i,n′−i). Thismeans that, for any n = (ni,n−i) = (n1, ...,ni, ...,nD) the action g(N) takes on niis independent of the action g(N) takes on n−i, the remaining dimensions. Thus wecan writeg(N)(n) = g(Ni)(ni)×g(N/Ni)(n−i)Where it is understood that the order of the group product is maintained (this is aslight abuse of notation). If D= 2 (base case) we have g(N)((n1,n2)) = g(N1)(n1)×g(N2)(n2). So g(N) ∈S(N1)×S(N2), and we are done. Otherwise, an inductive argu-ment on g(N/Ni) allows us to write g(N)(n) = g(N1)(n1)×g(N2)(n2)× ...×g(ND)(nD).And so g(N) ∈ S(N1)×S(N2)× ...×S(ND), completing the proof.A.2.2 Proof of Proposition 3.2.2Proof 2 First, observe thatg(N)(n) = n′ ⇐⇒ (G(N))n′,n = 1Now, let i ∈ [D] be such that, for some ni,n′i,n−i,n′−i we haveg(N)((ni,n−i)) = (n′i,n−i), andg(N)((ni,n′−i)) 6= (n′i,n′−i).Then by the observation above we have:(G(N))(n′i,n−i),(ni,n−i)= 1, and(G(N))(n′i,n′−i),(ni,n′−i)6= 1.43And so:(G(N)W)(n′i,n′−i),(ni,n−i)=(G(N))(n′i,n′−i),∗(W)∗,(ni,n−i)= ∑k∈[N1×...×ND](G(N))(n′i,n′−i),k(W)k,(ni,n−i)6= W(ni,n′−i),(ni,n−i)The last line follows from the observation above and the fact that G(N) is a permu-tation matrix and so has only one 1 per row. Similarly,(WG(N))(n′i,n′−i),(ni,n−i)=(W)(n′i,n′−i),∗(G(N))∗,(ni,n−i)= ∑k∈[N1×...×ND](W)(n′i,n′−i),k(G(N))k,(ni,n−i)=(W)(n′i,n′−i),(n′i,n−i)Where again the last line follows from the above observation. Now, considerW(ni,n′−i),(ni,n−i) and W(n′i,n′−i),(n′i,n−i). Observe that (ni,n′−i) differs from (ni,n−i) atexactly the same indices that (n′i,n′−i) differs from (n′i,n−i). Let S ⊆ [D] be the setof indices at which n−i differs from n′−i. We therefore haveW(ni,n′−i),(ni,n−i) = W(n′i,n′−i),(n′i,n−i) = θS,Which completes the proof.A.2.3 Proof of Theorem 3.2.1Proof 3 We will prove both the forward and backward direction:(⇐) Suppose W has the form given by (3.6). We must show the layer is onlyequivariant with respect to permutations in S(N1)× ...×S(ND):• Equivariance: Let g(N) ∈ S(N1) × ...×S(ND), and let G(N) be the corre-sponding permutation matrix. Then a simple extension of Theorem 2.1 in[27] implies G(N)WX = WG(N)X for all X, and thus the layer is equivari-ant. Intuitively, if g(N) ∈ S(N1)× ...×S(ND) we can “decompose” g(N)(n)44into D permutations S(N1)(n1), ...,S(ND)(nD) which act independently on theD dimensions of X.• No equivariance wrt any other permutation: Let g(N) ∈ S(N), with g(N) 6∈S(N1)×·· ·×S(ND), and let G(N) be the corresponding permutation matrix.Using Propositions 3.2.1 and 3.2.2 we have:G(N)W 6= WG(N)So there exists an index at which these two matrices differ, call it (n,n′). Thenif we take vec(X) to be the vector of all 0’s with a single 1 in position n′, wewill have:G(N)Wvec(X) 6= WG(N) vec(X).And since σ is assumed to be strictly monotonic, we have:σ(G(N)Wvec(X)) 6= σ(WG(N) vec(X)).And finally, since G(N) is a permutation matrix and σ is applied element-wise, we have:G(N)σ(Wvec(X)) 6= σ(WG(N) vec(X)).Therefore, the layer σ(Wvec(X)) is not a exchangeable tensor layer, andthe proof is completed.This proves the first direction.(⇒) We prove the contrapositive. Suppose Wn,n′ 6= Wm,m′ for some n,n′,m,m′ ∈N1 × ...×ND with {i : ni = n′i} = {i : mi = m′i}. We want to show that thelayer Y = vec−1(σ(Wvec(X))) is not equivariant to some permutation g(N) ∈45S(N1)× ...×S(ND). We define this permutation as follows:g(N)(ν) =m if ν = nm′ if ν = n′n if ν = mn′ if ν = m′ν otherwiseThat is, g(N) “swaps” n with m and n′ with m′. This is a valid permutation firstsince it acts element-wise, but also since {i : ni = n′i} = {i : mi = m′i} impliesthat n = n′ iff m = m′ (and so g(N) is injective, and thus bijective). So if G(N) is thepermutation matrix of g(N) then we have (G(N))(n′,n) = (G(N))(m′,m) = 1, and:(G(N)W)(m,n′) = ∑k∈[N1×...×ND](G(N))(m,k)W(k,n′)= W(n,n′)6= W(m,m′)= ∑k∈[N1×...×ND]W(m,k)(G(N))(k,n′)= (WG(N))(m,n′)And so G(N)W 6= WG(N) and by an argument identical to the above, with ap-propriate choice of X, this layer does not satisfy the requirements of an exchange-able tensor layer. This completes the proof.A.2.4 Proof of Theorem 3.1.1A simple reparameterization allows us to write the matrix W of (3.6) in the form of(3.3). Thus Theorem 3.1.1 is just the special case of Theorem 3.2.1 where D = 2and the proof follows from that.46


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items