Science, Faculty of
Computer Science, Department of
DSpace
UBCV
Flores Garrido, Marisol
2008-09-02T17:20:12Z
2008
Master of Science - MSc
University of British Columbia
Tensors can be viewed as multilinear arrays or generalizations of the notion of matrices. Tensor decompositions have applications in various fields such as psychometrics, signal processing, numerical linear algebra and data mining. When the data are nonnegative, the nonnegative tensor factorization (NTF) better reflects the underlying structure. With NTF it is possible to extract information from a given dataset and construct lower-dimensional bases that capture the main features of the set and concisely describe the original data.
Nonnegative tensor factorizations are commonly computed as the solution of a nonlinear bound-constrained optimization problem. Some inherent difficulties must be taken into consideration in order to achieve good solutions. Many existing methods for computing NTF optimize over each of the factors separately; the resulting algorithms are often slow to converge and difficult to control. We propose an all-at-once approach to computing NTF. Our method optimizes over all factors simultaneously and combines two regularization strategies to ensure that the factors in the decomposition remain bounded and equilibrated in norm.
We present numerical experiments that illustrate the effectiveness of our approach. We also give an example of digital-inpainting, where our algorithm is employed to construct a basis that can be used to restore digital images.
https://circle.library.ubc.ca/rest/handle/2429/1606?expand=metadata
1130731 bytes
application/pdf
An All-at-Once Approach to Nonnegative Tensor Factorizations by Marisol Flores Garrido B.Sc., Universidad Autónoma de Coahuila, 2005 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE STUDIES (Computer Science) The University Of British Columbia (Vancouver) August 2008 c Marisol Flores Garrido, 2008 Abstract Tensors can be viewed as multilinear arrays or generalizations of the notion of matrices. Tensor decompositions have applications in various fields such as psychometrics, signal processing, numerical linear algebra and data mining. When the data are nonnegative, the nonnegative tensor factorization (NTF) better reflects the underlying structure. With NTF it is possible to extract information from a given dataset and construct lower-dimensional bases that capture the main features of the set and concisely describe the original data. Nonnegative tensor factorizations are commonly computed as the solution of a nonlinear bound-constrained optimization problem. Some inherent difficulties must be taken into consideration in order to achieve good solutions. Many existing methods for computing NTF optimize over each of the factors separately; the resulting algorithms are often slow to converge and difficult to control. We propose an all-at-once approach to computing NTF. Our method optimizes over all factors simultaneously and combines two regularization strategies to ensure that the factors in the decomposition remain bounded and equilibrated in norm. We present numerical experiments that illustrate the effectiveness of our approach. We also give an example of digital-inpainting, where our algorithm is employed to construct a basis that can be used to restore digital images. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1 2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 NTF problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.1 NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.2 Moving to tensors . . . . . . . . . . . . . . . . . . . . . 9 1.4 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.5 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1 Tensor’s definition . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 Matrix operations . . . . . . . . . . . . . . . . . . . . . . 16 2.2.2 Matricization . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.3 Tensor operations . . . . . . . . . . . . . . . . . . . . . . 19 2.2.4 Special tensors . . . . . . . . . . . . . . . . . . . . . . . 20 iii 3 2.3 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Rank and factorizations . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.1 Rank of a tensor . . . . . . . . . . . . . . . . . . . . . . 23 2.4.2 Tucker decomposition . . . . . . . . . . . . . . . . . . . 23 2.4.3 CANDECOMP/PARAFAC decomposition . . . . . . . . 25 2.4.4 Other decompositions . . . . . . . . . . . . . . . . . . . 26 Nonnegative tensor factorizations . . . . . . . . . . . . . . . . . . . 27 3.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.1 Lack of uniqueness . . . . . . . . . . . . . . . . . . . . . 28 3.1.2 Degeneracy and best rank-R approximation . . . . . . . . 30 3.1.3 Lack of bounds . . . . . . . . . . . . . . . . . . . . . . . 31 Algorithms for computing the NTF . . . . . . . . . . . . . . . . . 31 3.2.1 Multiplicative update rules . . . . . . . . . . . . . . . . . 31 3.2.2 Alternating least squares . . . . . . . . . . . . . . . . . . 35 3.2.3 All-at-once approach . . . . . . . . . . . . . . . . . . . . 38 All-at-once approach . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1 Regularization strategy . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.1 Regularization function . . . . . . . . . . . . . . . . . . . 40 4.1.2 Spacer steps . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Gradient of the function. . . . . . . . . . . . . . . . . . . . . . . 43 4.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3.1 BFGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3.2 L-BFGS-B . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2 4 5 Implementation and numerical experiments 5.1 5.2 . . . . . . . . . . . . . 49 Implementation of the algorithm . . . . . . . . . . . . . . . . . . 49 5.1.1 L-BFGS-B routine . . . . . . . . . . . . . . . . . . . . . 49 5.1.2 Other implementation remarks . . . . . . . . . . . . . . . 50 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2.1 Experiment settings . . . . . . . . . . . . . . . . . . . . . 52 5.2.2 Performance of the algorithm on the matrix case . . . . . 56 5.2.3 Effects of regularization strategies . . . . . . . . . . . . . 57 iv 5.2.4 6 Comparison with other approaches . . . . . . . . . . . . . An application to image processing 59 . . . . . . . . . . . . . . . . . . 70 6.1 Image basis through nonnegative factorizations . . . . . . . . . . 70 6.2 Inpainting problem . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.3 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . 75 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 A List of routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 B Summary of multilinear operators’ properties . . . . . . . . . . . . 90 v List of Tables 5.1 Comparison between interfaces . . . . . . . . . . . . . . . . . . . 57 5.2 CBCL Dataset, matrix approach . . . . . . . . . . . . . . . . . . 62 5.3 CBCL Dataset, tensor approach . . . . . . . . . . . . . . . . . . 64 5.4 AT&T Dataset, matrix approach . . . . . . . . . . . . . . . . . . 65 5.5 AT&T Dataset, tensor approach . . . . . . . . . . . . . . . . . . 67 5.6 CBCL Dataset, comparison with other algorithms. . . . . . . . . . 68 5.7 AT&T Dataset, comparison with other algorithms. . . . . . . . . . 69 6.1 Experiment with hand-written numbers. . . . . . . . . . . . . . . 73 vi List of Figures 1.1 NMF and clustering . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 PCA vs. NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Tensor slices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1 Tensor fibers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Matricization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Third-order special tensors . . . . . . . . . . . . . . . . . . . . . 21 2.4 Tucker decomposition . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 PARAFAC decomposition . . . . . . . . . . . . . . . . . . . . . 26 5.1 Effects of regularization . . . . . . . . . . . . . . . . . . . . . . . 58 6.1 Basis for CBCL face database with outliers. . . . . . . . . . . . . 71 6.2 Experiment using hand-written digits. . . . . . . . . . . . . . . . 73 6.3 Inpainting problem, test 1; different λ values. . . . . . . . . . . . 76 6.4 Inpainting problem, test 1. . . . . . . . . . . . . . . . . . . . . . 77 6.5 Inpainting problem, test 2. . . . . . . . . . . . . . . . . . . . . . 78 6.6 Images from the AT&T image database. . . . . . . . . . . . . . . 78 6.7 Inpainting problem, test 3. . . . . . . . . . . . . . . . . . . . . . 79 vii Acknowledgements I would like to thank my supervisor, Michael P. Friedlander, for his unbounded support, patience and encouragement. He introduced me to a branch of applied mathematics that I knew little about and he did it showing real passion for the field (there are sparkles on his eyes every time he talks about optimization!). I cannot thank him enough for all he has done for me. Thanks as well to Chen Greif for reading this thesis and providing many helpful comments. In the last two years I have met some wonderful people and made a few excellent friends. Thanks to all my bullpen mates; despite the endless homework I really enjoyed my first year here. A special thanks goes to the Tuesdays’ group: CB, Hoyt, Lloyd, Hao, Seonah, and Shin; some of what I learned at our gatherings are among the most important legacies of this period. I would also like to thank my excellent SCL labmates for their kindness and support. A special thanks to Ewout for always being kind and for his willingness to share his vast knowledge. Thanks to all the Mexican friends at UBC who, with the help of lots of corn and some tequila, relieved my home sickness and made the sunless weather bearable. I want to particularly thank Ignacio, Itzia, and Ramón, for all the laughs, support, and companionship they provided. Thanks to Margaret, an admirable woman and a great friend. I can hardly believe how lucky I was for entering her home. Thanks to my old, good, unconditional friends: Citlali, Esli, Humberto, Javier, Marisol and Mario. They provide solid friendship even at times when I’m not sure who I am. Lastly, but most importantly, I thank my parents, brother, and sister for their viii endless support, faith and love. It would have been impossible to complete this work without them. I could not have asked for a better family. ix To Nates, a.k.a. Lady Melancholy, for being the most beautiful cronopio I know. x Chapter 1 Introduction The notion of a tensor has been in the mathematical record since the 19th century [36] but has gained new strength in recent years with the rise and rapid development of fields like computational-based data mining and machine learning. Tensor’s multilinear structure permits the storage and organization of large amounts of data. Through the use of multiple indexes, they facilitate an efficient and flexible management of information. It is also recognized that the manipulation and use of alternative expressions of a given tensor—like factorizations—can provide an insight into the relevant information in the stored data, find structures that are not evident by simple analysis, and facilitate the extraction of useful information about the set. The higher-order SVD (HOSVD) [47], for example, represents a tool that can be compared with the principal component analysis (PCA) for two-dimensional arrays. There are an extensive number of applications that involve nonnegative data and require nonnegativity in their analysis in order to avoid a misrepresentation of physical realities. This led to the mergence of the basic ideas on tensor factorizations and nonnegative matrix factorizations (NMF)—introduced in 1994 [63]—to create nonnegative tensor factorizations (NTF)—introduced in 1997 [14, 60]—, a tool that extends the scope of tensor factorizations as analysis instruments by showing both usefulness in the treatment of physically meaningful data, and the structure that makes tensors appealing in data management. 1 1.1 Tensors Tensors can be put in the context of formal differential geometry or physics, but the concept that will be used in this work is the one that is found in most literature related to multilinear algebra and scientific computing, where a tensor simply represents a multilinear array or a generalization of the notion of scalar, vector, and matrix. Two ideas are commonly used to describe this concept. The first presents the notion of a Nth-order tensor as a collection of numbers organized in a Ndimensional array, whose elements require N indices to be referred to. The second is the idea of a tensor as a geometrical entity that “can be expressed as a multi-dimensional array relative to a choice of basis of the particular space on which it is defined. The intuition underlying the tensor concept is inherently geometrical: as an object in and of itself, a tensor is independent of any chosen frame of reference” [1]. Both ideas clearly state that a matrix (represented by a table whose elements are addressed by two indices) and a vector (represented by a list of elements that require only one index to be identified) can be understood as 1st-order and 2ndorder tensors, respectively. The order of a tensor is then understood as its number of dimensions (also known as ways or modes). 1.2 NTF problem Given an N-dimensional tensor V ∈ IRI1 ×I2 ×···×IN , the NTF problem consists in finding a core tensor G ∈ IRJ1 ×J2 ×···×JN , and a set of N matrices A(n) ∈ IRIn ×Jn , n = 1, . . . , N, such that V ≈ G ×1 A(1) ×2 A(2) ×3 · · · ×N A(N) , (1.1) where the N + 1 constitutive factors G , A(1) , . . . , A(N) are required to be componentwise nonnegative, i.e., G , A(1) , . . . , A(N) ≥ 0. The meaning of (1.1) will be clarified in Chapter 2, where a description of notation and tensor operations is provided. For now, it is enough to mention that 2 the matrices A(1) , . . . , A(N) are generally understood as components of V along its different dimensions (tensor dimensions are also referred to as ways or modes), whereas G represents a kernel tensor (usually smaller than V , though having the same order) whose entries capture the relation between the matrices A(i) . While previously it was stated that NTF extends the scope of tensor factorizations, an alternate interpretation is that they just generalize NMF. Thus, if V is a 2nd-order tensor—i.e., a matrix—we get the problem of factorizing V ∈ IRI1 ×I2 into matrices G, A(1) , A(2) such that V = G ×1 A(1) ×2 A(2) = A(1) GA(2)T , (1.2) with G, A(1) , A(2) ≥ 0, which is essentially the NMF problem. 1.3 Motivation The motivation for NTF is mainly based on two facts: the proved ability of NMF to extract relevant information from a given dataset, identifying and capturing the main features of the set, and the variety of applications that rely on having a tensor structure to manage the information. In this section a short explanation of both is provided. 1.3.1 NMF NMF were first described by Paatero [63] and later reintroduced by Lee and Seung [51]. Lee and Seung’s science article begins with the words “Is perception of the whole based on perception of its parts? There is psychological and physiological evidence for parts-based representations in the brain, and certain computational theories of object recognition rely on such representations. . . . Here we demonstrate an algorithm called nonnegative matrix factorization that is able to learn parts of faces and semantic features of text.” This statement captures the essential motivation for NMF, which is to analyze and learn about a whole by dividing it into meaningful coherent parts that keep the 3 nonnegative characteristics of the original data. The NMF’s goal is to simplify the interpretation of such parts. In one sense, every matrix factorization—e.g., QR, SVD—provides a better understanding of the original matrix. What sets NMF and the other well-known factorizations in linear algebra apart is the introduction of the nonnegativity constraints over the factors to be constructed. Given a nonnegative matrix V ∈ IRm×n and a positive integer r < min{m, n}, a nonnegative matrix factorization can be formulated as the problem of finding nonnegative matrices W ∈ IRm×r , G ∈ IRr×r , and H ∈ IRn×r that minimize the functional f (W, G, H) = 21 kV −W GH T k2F , (1.3) where k kF represents the Frobenius norm. This norm is the one most frequently used in the literature and, although there are other possibilities [23, 61]), the use of other norms can be complicated and expensive. Two observations must be noted regarding this problem formulation. First, the square matrix G has been included in this equation to highlight the conformity with (1.1 ) and to look at (1.3) as a particular case of the tensor problem. In general, it is assumed that G is the identity matrix and only W and H are computed. Second, the nonnegativity constraints imposed on the factors are flexible and, depending on the application and nature of the data, only one of the factors might be required to be nonnegative. It is also important to point out that the imposition of the nonnegativity constraint and the fact that r is typically smaller than the rank of V , imply that, in general, it is impossible to find an “exact” factorization. Therefore, the product W H T represents an approximation of the original data. The columns of the matrix W represent the basis used to approximate the columns of V , whereas the ith row of H contains the specific set of weights that generates the approximation to the ith column of V . The dimension r of the basis space is fundamental to the problem’s definition but the selection of an appropriate value of r is often not considered part of the problem. We assume that this parameter is given. 4 One more detail worth mentioning is that W , which is generally referred to as a basis, might not be one in the formal mathematical sense. When Lee and Seung described the problem, in the context of images, they called the columns of W “basis images”; apparently, later publications use the same word and interpretation of W —a set of vector elements that can be combined to reconstruct the original data— but the linear independence of the columns of W is not assumed or required in the problem. Such independence is highly likely to be found when r is smaller than m and n, as it happens most of the time, but, in general, the word “basis” is not used with a strict (mathematical) signification. Applications of NMF Applications of NMF include music transcription [68], bioinformatics [16], DNA gene expressions [30], data mining [11, 65], spectral analysis [64] and medical applications [34]. Two of the most popular applications related to clustering and image processing are described below. 1. Clustering. Clustering is an important technique for statistical data analysis and can be defined [25, Chapter 9] as “the classification of objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait—often proximity according to some defined distance measure”. NMF has proven useful in data clustering [24] and has thus become a tool in problems such as document classification, text processing and other problems that concern data mining and pattern recognition [25]. To gain some understanding of the relation between clustering and NMF, consider the example 1. In it, NMF is compared against the K-means algorithm [54], which is one of the most popular data-clustering algorithms. In the example, the simplest version of K-means, known as Lloyd’s algorithm, is used, which partitions the input points into k initial sets, calculates the mean point (centroid) of each set, and constructs a new partition by associating each point with the closest centroid. The centroids are then recalculated for the new clusters and the algorithm is repeated by alternate 5 application of these two steps until convergence is achieved, i.e., until the points no longer switch clusters or centroids are no longer changed. Example 1. Let V ∈ IR2×7 be the matrix from equation (1.4), in which each column represents an observation made in IR2 . For this set of seven points it is known that there are three clusters and, by using K-means, we get the means (2.86, 3.23), (−4.40, 2.08) and (−7.45, 9.54). If we apply an NMF algorithm (in this example, the algorithm described in Chapter 4 is used) using r = 3 (the known number of clusters that determines the parameter r is irrelevant in this example) and impose nonnegativity on H only, we find the following factorization: V= 2.81 4.17 −5.25 1.83 ≈ 5.45 4.49 −6.24 1.85 {z | W ! −7.93 −3.02 −4.93 2.92 8.62 3.11 1.29 2.29 ! 0.71 0.00 0.26 0.02 0.23 −8.23 0.06 0.79 0.13 0.19 0.55 9.84 } 0.08 0.03 0.91 0.82 0.10 | {z −6.96 10.45 HT 0.00 0.78 0.00 0.52 0.00 0.00 } (1.4) As can be seen in Figure 1.1 (left axis), each column in W is an approximation of the cluster means. On the other hand, H can be interpreted as a cluster indicator where each row is associated with one column in V , and each column is associated with a cluster. Using hard clustering (where the k-clusters are mutually exclusive), it would be only one element equal to one in each row, indicating to which cluster belongs that row; suppressing the orthogonality requirement (i.e., using soft clustering) we can interpret the element (i, j) in H as a measurement of the probability of point i being in the cluster j (though formally this idea is not valid as H is not a stochastic matrix). “Soft” orthogonalization is reflected when we plot the rows of H (Figure 1.1(b)). With the entries of H we can identify the vectors which have a small angle between them; this makes H effective as an indicator of the cluster membership of each vector. 6 Figure 1.1: The left axis shows the data V (dots) and the columns of the factor W (squares), which approximate the means of each cluster. The right plot shows each row of H, which can be viewed as a cluster indicator. Each line’s colour was determined according to each row’s maximum value. Labels identify the represented row. 2. Image processing. A common application of NMF in the image-processing field is related to the compression of an image data set. When r (see (1.3)) is smaller than n we get a sense of compression, because we are constructing a basis W in a lower dimensional space that captures the “core” characteristics of the set represented by V . If V is a matrix in IR(m1 m2 )×n that contains a database of n images of m1 × m2 pixels each (each column of V is a vectorized representation of an image) we can expect W to capture the most relevant features of the set and, by taking r < n, obtain a set of r images that contains less overlapping information and is enough to recover the original set with certain fidelity, although it is smaller and easier to handle than V . NMF vs. PCA Lee and Seung [51], in the article that roused a wave of research on NMF, use an interesting comparison between PCA and NMF in the context of image processing. Recall that PCA is a data analysis method to reduce the dimension of the original data by projecting the raw data onto a few dominant eigenvectors with large variance (energy) [2, Section 6.3]. PCA can be computed via SVD [9, Chapter 5], 7 (a) PCA (b) NMF Figure 1.2: Sets of basis images with 49 elements. as it is done in this example, by decomposing the covariance matrix of the data matrix and using the left singular eigenvectors to construct the principal components. NMF can be compared with PCA in the sense that both have as a goal the extraction of essential information from the original data, but there are clear differences that make each of them suitable for different models and types of applications. One remarkable difference is that, while both approaches can be used to construct a basis that can be linearly combined to regenerate an original set (or to create new elements), PCA requires that the images in the basis be added and subtracted to recover the original set (the coefficients used to combine the basis elements can be negative), whereas in the NMF case only addition is required (the coefficients, contained in H, used to combine the basis elements are always nonnegative). A consequence of this fact is seen in the basis vectors generated by NMF. As an example, consider a database consisting on 2429 face images and compare the 49-rank basis images obtained through PCA (Figure 1.2a) against the one obtained by NMF using r = 49 (Figure 1.2b). The example illustrates the way in which NMF produces basis images that correspond to parts of the set. Given that in this factorization the only way to reconstruct a face is by adding more basis images, those images are forced to represent 8 parts which, for the used database, are face parts: nose, chin, forehead, hairlines. Thus, we can see one strength of NMF: the allowance of only nonnegative coefficients in linear combinations of the basis leads to more physically intuitive basis elements. Additionally, the claims that our brain learns “by parts” [77] suggest that NMF can be viewed as an analogous representation of the physical process in neurons. It cannot be said that NMF is better than PCA, since PCA is designed to produce optimal part-representations in a certain sense and has important properties (orthogonality, robust computation, unique factorization) that make it the best option for certain models, but it is clear that NMF matches our intuitive idea of parts and represents a good alternative for applications in which using negative coefficients does not make sense. 1.3.2 Moving to tensors The multilinear structure of tensors makes them convenient—even necessary—for certain applications. First, there are problems whose model involves measurements that correspond to a large number of different attributes. In those cases, a high-dimensional structure that represents the data and allows the association of each problem’s feature to a simple index, simplifies the data manipulation and the performance of certain operations, e.g. averaging over one of the attributes. The following are some examples of models in which tensors are used to organize data of different aspects of the problem; we use superindexes to denote the modalities (i.e., aspects to be measured in the problem) considered along the tensor modes and subindexes to indicate the data type. 1. Spectroscopy data, Xstrength ∈ IRBatch number×time×spectra [3]. 2. Web mining, Xclick counts ∈ IRusers×queries×web pages [70]. 3. Image analysis, Ximage intensity ∈ IR people×views×illuminations×expressions×pixels [76]. 4. Semantic differential data Xgrade ∈ IR judges×music pieces×scales [57]. Second, for certain applications it is convenient to preserve the natural structure of the data set. As an example, consider the image-processing example given in 9 section 1.3.1. In order to work with an image database, we must vectorize each image and store it as a column in the matrix V ∈ IRm1 m2 ×n . If a tensor structure is used instead, we have a tensor V ∈ IRm1 ×m2 ×n where the images are simply stacked one behind the other (the depth of the tensor represents the number of images in the set), and thus a more natural arrangement of the data is achieved. The use of such an arrangement benefits the analysis of the set, e.g., in the fields of image processing and computer vision structure tensors (representations of partial derivatives) are used to obtain orientation estimation and image structure analysis, and it has been pointed out [15] that its use has advantages like “allowing the integration of information from a local neighborhood”, and, consequently, “making possible to distinguish areas where structures are oriented uniformly, like in regions with edges, from areas where structures have different orientations, like in corner regions”. Third, the multilinear structure of tensors facilitates the interpretation and analysis of the factors and makes it possible to identify patterns and extract information on the different—independent—modalities of the dataset. When we choose to use a matrix approach to perform the exploration of the set, we have to group or unfold several modalities along one single dimension in order to adapt the data to the model, e.g., in the image-processing example the spatial structure of the images (wide and height) is condensed into single columns. This unfolding process can hinder the interpretation of the factors, but, even more important, by mixing the information of the modalities unfolded together, the opportunity to obtain information on each specific modality is dispelled and the potential data mining on the set is limited. Mørup et al. [55] give an example where nonnegative factorization is used to analyze electroencephalography (EEG) data. They are aware of the impact that collapsing several modalities into a single dimension can have in the exploration performed, and they use both NMF and NTF to fully analyze the dataset and extract different information about the problem: NTF to perform an overall analysis, and NMF to perform a local one (while NMF can examine subject specific activities and search for time-frequency patterns, NTF can effectively extract the most similar activities across subjects and/or conditions). Thus, the use of tensors simplifies the organization of the data and provides an 10 alternate insight into the problem. 1.4 Previous work In general terms, there are two main NTF factorizations: Tucker and Candecomp/PARAFAC (CP) (explained with more detail in Chapter 2). The main algorithms for computing these decompositions can be classified as multiplicative update rules, all-at-once approaches, and alternating least squares (ALS). In brief we can say that the later approach is an iterative algorithm that alternates between phases, each of which is a least-squares procedure. In the NTF problem, each ALS’s phase corresponds to holding all the variables fixed except one and solving the resultant linear least-squares subproblem for the non-fixed variable. In this section we give an overview of the research that has been done on the three general classes of algorithms to compute NTF. Later, in Chapter 3, we revisit each of the approaches and describe them in more detail. In 1997 Paatero [60] presented an all-at-once approach to obtain PARAFAC decompositions. Paatero and Tapper [63] had previously published a work on “positive matrix factorizations”, which is acknowledged as an important paper on NMF’s historical development. It seems that Paatero was interested in extending their ideas to a multilinear model. Paatero implemented a Gauss-Newton approach in solving the problem and handled the nonnegativity by using logarithmic barrier functions. He explored some of the NTF problem’s peculiarities, like degeneracy [62] or occurrence of local minima, and tried to regularize the problem by imposing a fixed norm on each factor’s columns. He did not use a specific data set to measure the performance of the proposed algorithm, and part of his experiments seem focused on 2 × 2 × 2 examples that exhibit degeneracy and the way in which the nonregularity of the problem affects the speed of convergence. However he clearly stated that the main limitation of the new algorithm is its behavior with large models, whose number of required operations could be “on the limit of being impossible on a PC computer”. In the same year, Rasmus Bro and Sijmen De Jong [14] introduced a modification of the standard algorithm for nonnegativity linear least-squares regresssion that 11 is specifically designed for use in both CP and Tucker multilinear factorizations. Their algorithm uses an ALS approach and is presented in the context of chemometrics problems. The numerical experiments involve third order tensors up to size 32 × 4 × 371, which are approximated with a rank-3 tensor (i.e., 1221 variables are computed). They measure the quality of their algorithm by comparing its running time against that of other algorithms and, because their focus is on the algorithm’s time consumption, they do not mention any of the typical challenges related to tensors (like lack of uniqueness of the solution). Nor do they justify the chosen rank of the computed approximations; they merely used PARAFAC models to show that their algorithm can be used on multilinear models, but did not focus on the NTF problem. In 2006 Friedlander and Hatz [33] combined this approach with some sparseness constraints (an `1 -norm penalty function) and a strategy to keep the solution bounded by scaling every iteration, so that the `1 norm of the factors is controlled. Later, in 2007, Kim, Park, and Eldén [41] solved the problem using nonnegative least squares with multiple right-hand sides (NLS/M-RHS), instead of NLS with single right-hand sides (NLS/S-RHS), for estimating each factor, and used their algorithm to solve large-scale problems. In the same year Cichoki, Zdunek, and Amari [20] presented a new ALS algorithm that is robust in the presence of noise, and, therefore, has potential applications in multi-way blind source separation (BSS). They solve a matrix problem for each of the frontal slices of the tensor and incorporate the regularization and penalty terms into the local squared Euclidean norms, achieving sparse and local representations of the solutions. At the same time, there was an exploration of the multiplicative update rules. In 1999 Lee and Seung [51] introduced a multiplicative update rule to compute NMF and started a wave of research on the topic. Over the next several years, numerous researchers, such as Welling and Weber [81] in 2001, attempted to extend these easy and quick-to-implement update rules to the tensor case. Later, in 2005, works that use multiplicative updates are adapted for specific applications. Some examples of researchers that used these updates include FitzGerald, Cranitch, and Coyle [29], who focused on the problem of sound source separation; and Hazan, Polan, and Shashua [38, 66] whose research centered on problems in computer vision. Also included are Mørup, Hansen, Parnas, and Arnfred [55] 12 who concentrated on EEG; and Bader, Berry, and Browne [7, 8] who explored email tracking in Enron data. In every case we find small variations of the multiplicative rule that take advantage of the particular structure of the problem in question. Except for Paatero’s work, most of the literature on NTF focuses on update rules and ALS, as, presumably, computational limitations restricted the exploration of the all-at-once approach. In our work, we propose an all-at-once approach to computing NTF. Our method optimizes over all factors simultaneously and combines two regularization strategies to ensure that the factors in the decomposition remain bounded and equilibrated in norm. 1.5 Notation We follow the notation used in [33] and [42]. Upper-case calligraphic letters represent tensors. All matrices are denoted by capital Roman letters, whereas vectors are denoted by lower-case Roman letters, and scalars by lower-case Greek letters. The Im × Im identity matrix is denoted by IIm . A vector of all ones is denoted by e and its size is implied by its context. To denote sections or subarrays of a tensor, we use M ATLAB notation, i.e., the colon sign is used to pick up elements of the tensor along a specific mode. Thus, if we have a three dimensional array X and we want to consider horizontal, lateral, and frontal two-dimensional slices, we could use Xi:: , X: j: , and X::k respectively; see Figure 1.5. Figure 1.5 has been taken from [43]. 13 Figure 1.3: Slices of a 3-order tensor. (a) Horizontal Xi:: . (b) Lateral X: j: . (c) Frontal X::k . This figure has been taken from [43]. 14 Chapter 2 Tensors Although in simple terms a tensor is a generalized linear “quantity” that can be expressed as a multi-dimensional array, the view of a tensor as a multilinear array oversimplifies the concept. Rather than existing as a mere collection of data, a tensor has a set of operations defined on it. In the following sections a brief description of some operations defined on tensors is provided. Figures used in the following sections are taken from [42], [43] and [48]. 2.1 Tensor’s definition An Nth-order tensor is formally defined as the tensor product of N vector spaces. De Lathauwer [49] provides the following definition. Definition 1. Let V1 , V2 , . . . , VN be N Euclidean vector spaces with finite dimensions I1 , I2 , . . . , IN . Consider N vectors Un ∈ Vn , n = 1, . . . , N. We denote by U1 ◦U2 ◦ · · · ◦UN the multilinear mapping on V1 ×V2 × · · · ×VN , defined by (U1 ◦U2 ◦ · · · ◦UN )(X1 , X2 , . . . , XN ) = hU1 , X1 iV1 hU2 , X2 iV2 · · · hUN , XN iVN , in which hUn , Xn iVn denotes the scalar product in Vn , and Xn is an arbitrary vector in Vn (1 ≤ n ≤ N). The space generated by all the elements U1 ◦ U2 ◦ · · · ◦ UN is called the tensor product space of V1 ,V2 , . . . ,VN . An element of the tensor product space is called an Nth order tensor (over V1 ×V2 × · · · ×VN ). 15 Figure 2.1: Fibers of a 3-order tensor: (a) columns x: jk ; (b) rows xi:k ; (c) tubes xi j: . This figure has been taken from [43]. Tensors of order three or higher are called higher-order tensors [43]. For higher-order tensors, a f iber is an array obtained by fixing all the indices except one. They are comparable to the columns or rows of a matrix; in a thirdorder tensor, for example, we have column, row, and tube fibers denoted by x: jk , xi:k , and xi j: , respectively; see Figure 2.1. Figure 2.1 has been taken from [43]. 2.2 Operations 2.2.1 Matrix operations In order to define some of the tensor operators, it is necessary to recall some matrix products commonly used in multilinear algebra. In this section we present the definition and some properties of the Kronecker, Khatri-Rao and Hadamard products; more properties can be found in B. Kronecker product Following the definition given by Van Loan [75], the Kronecker product of two matrices, A ∈ IRm1 ×n1 and B ∈ IRm2 ×n2 , is denoted by A ⊗ B and given by an m1 × n1 16 block matrix whose (i, j) block is the m2 × n2 matrix ai j B: ··· a1n1 B a21 B a22 B · · · A⊗B = .. .. .. . . . am11 B am12 B · · · a2n1 B .. . . a11 B a12 B am1 n1 B Some of the basic properties of the Kronecker product are: • (A ⊗ B)T = AT ⊗ BT , • (A ⊗ B)(C ⊗ D) = AC ⊗ BD, • A ⊗ (B ⊗C) = (A ⊗ B) ⊗C. We do not mention more properties because they are not needed in the following sections, but an extensive list (that includes properties concerning eigenvalues of a Kronecker product) is found in [75], [35], and [50, Chapter 13]. Khatri-Rao product Given two matrices with the same number of columns, A ∈ IRI1 ×r and B ∈ IRI2 ×r , its Khatri-Rao product is a (I1 I2 ) × r matrix denoted by A A B= h a:1 ⊗ b:1 a:2 ⊗ b:2 · · · B and given by a:r ⊗ b:r i . Note that if a and b are vectors, then the Khatri-Rao and Kronecker products are the same, i.e., a ⊗ b = a b. Hadamard product The Hadamard product is the elementwise matrix product. It can only be performed on matrices that have the same size. Thus, given two matrices, A and B, in IRI1 ×I2 , their Hadamard product, denoted by A ∗ B, is also an I1 × I2 matrix whose elements are given by (A ∗ B)i j = ai j bi j . 17 Figure 2.2: Matricizing a tensor along its first way. This figure has been taken from [43]. 2.2.2 Matricization Matricization is the operation of “flattening” a tensor and expressing it as a matrix. This is achieved by reordering the elements of the tensor according to a specified order along its modes. Formally, to matricize a tensor X ∈ IRI1 ×I2 ×···×IN we require a partition of the modes {1, . . . , N} in two ordered sets, R = {r1 , . . . , rL } and C = {c1 , . . . , cM }, which must be mapped to the rows and the columns of the matrix, respectively, according to the rule X(R×C :IN ) jk = xi1 i2 ...iN , with L " `−1 j = 1 + ∑ (ir` − 1) ∏ Ir`0 `=1 # M and k = 1 + `0 =1 ∑ m=1 " (irm − 1) # m−1 ∏ Ir m . m0 =1 The matrix XR×C :IN is (∏m∈R Im ) × (∏m∈C Im ). The geometrical rearrangement of the elements might not be easy to picture in the general case but the case that commonly arises in the applications is the one in which the set R consists of only the index in ; that is, we simply take the fibers along the n mode and put them, one beside the other, on the matrix X(n); see Figure 2.2.2. The resulting matrix has dimensions In × (∏m6=n Im ). Another particular case that must be pointed out is when the set C consists of only the index in . The geometrical meaning of this case is taking all the fibers along the nth-mode and stacking them, one below the other, on the column vector 18 vec(X ) ∈ IRI1 ×···×In ; thus, this special case of matricization vectorizes a tensor. 2.2.3 Tensor operations Addition and scalar multiplication are performed in the natural way. We can add two arrays of the same size by adding up the corresponding elements; we multiply the array by a scalar by scaling each entry by the given scalar. A less natural operation is the multilinear matrix multiplication, described next. The tensor-matrix product The basic tensor-matrix multiplication is called the n-mode product and consists of multiplying a tensor along its nth-mode by a matrix. An Nth-order tensor X , whose size is I1 × I2 × · · · × IN , can be multiplied along its nth mode only by matrices A(n) of size Jn × In , i.e., matrices whose number of columns coincides with the dimension of the tensor along the nth way. The product, denoted by X ×n A(n) , is of size I1 × · · · × In−1 × Jn × In+1 × · · · × IN . Elementwise we have (X ×n A(n) )i1 ...in−1 jin+1 ...iN = In ∑ xi i ...i 1 1 N a jk , k=1 where 1 ≤ j ≤ Jn . The idea of transforming a tensor by a matrix can be extended to the case where each mode of the tensor is under a linear transformation. Kolda [42] introduces the idea of the Tucker operator as a representation for the multi-mode multiplication, defined as follows. Definition 2. Let Y ∈ IRJ1 ×J2 ×···×JN and N̂ = {1, . . . , N}. Suppose we have matrices A(n) ∈ IRIn ×Jn for each n ∈ N̂. Then the Tucker operator is defined as [[Y ; A(1) , A(2) , . . . , A(N) ]] = Y ×1 A(1) ×2 A(2) ×3 · · · ×N A(N) . The result has the size I1 × I2 × · · · × IN . Two of the tensor-matrix product’s properties worth mentioning are: 19 • If n 6= m, we have X ×m A ×n B = X ×n B ×m A. • If both matrices are multiplying the same mode, we have X ×n A ×n B = X ×n (AB). Other operations Vectorization of tensors can be used to define the tensorial inner product and norm. In this way, the inner product of two given tensors X and Y in IRI1 ×···×IN is defined as hX , Y i = vec(X )T vec(Y ) = I1 IN I2 ∑∑ ··· i1 =1 i2 =1 ∑ xi i ···i 1 2 N yi1 i2 ···iN . iN =1 It follows that a norm (analogous to the matrix Frobenius norm) of the tensor X is defined by kX k2 = hX , X i = vec(X )T vec(X ) = I1 IN I2 ∑∑ i1 =1 i2 =1 ··· ∑ xi2 i ···i 1 2 N . iN =1 2.2.4 Special tensors The specially structured tensors described below are important in the descriptions given in sections 2.3 and 2.4. Diagonal tensor A diagonal tensor X ∈ IRI1 ×I2 ×···×IN is a tensor that has nonzero elements xi1 i2 ...iN only when i = i1 = i2 = · · · = iN , for 1 ≤ i ≤ min{In }. With this idea we can define the identity tensor as the diagonal tensor whose nonzero entries xi1 i2 ...iN are all one; see Figure 2.3a. 20 (a) Identity tensor (b) A rank-one tensor Figure 2.3: Third-order special tensors. Figures taken from [43]. Rank-one tensor An Nth-order tensor X ∈ IRI1 ×···×IN is a rank-one tensor if it can be expressed as the tensor product of N vectors v(1) ∈ IRI1 , v(2) ∈ IRI2 , . . . , v(N) ∈ IRIN , i.e., if X = v(1) ◦ v(2) ◦ · · · ◦ v(N) ; see Figure 2.3b. Elementwise, we have (1) (2) (N) (X )i1 i2 ...iN = vi1 vi2 · · · viN . 2.3 Derivatives Denote by X the multi-mode product [[G ; A(1) , . . . , A(N) ]], and consider the matricization of the result. We know that X(n) = A(n) G(n) (A(N) ⊗ A(n−1) ⊗ A(n+1) ⊗ · · · ⊗ A(N) )T , where ⊗ represents the Kronecker product. From matrix calculus we know that for matrices X, Y , and Z, d[WY Z] = Z T ⊗W ; dY 21 therefore, ∂ X(n) ∂ A(n) = i h A(N) ⊗ · · · ⊗ A(n−1) ⊗ A(n+1) ⊗ · · · ⊗ A(1) GT(n) ⊗ I, where I is the mn × mn identity matrix, and ∂ X(n) (N) = A ⊗ · · · ⊗ A(n−1) ⊗ A(n+1) ⊗ · · · ⊗ A(1) ⊗ A(n) . ∂ G(n) Let us define A⊗ , An⊗ , and An as A⊗ = A(N) ⊗ · · · ⊗ A(1) An⊗ = A(N) ⊗ · · · ⊗ A(n−1) ⊗ A(n+1) ⊗ · · · ⊗ A(1) An = A(N) · · · A(n−1) A(n+1) · · · A(1) , where represents the Khatri-Rao product. Consider the special case when G is a diagonal “square” tensor, i.e., r = r1 = r2 = · · · = rN , and use D to denote the matrix such that Gii···i = Dii with i ∈ {1, . . . , mink (rk )}. Then we have An⊗ GT(n) = An DT , and ∂ X(n) (n) |∂ A {z } ∏mi ×rmn GT(n) ⊗ Imn |{z} |{z} An⊗ |{z} = ∏ mi ×rN−1 i ! (rN−1 ×r) (mn ×mn ) i6=n = (2.1) An D ⊗ Imn , |{z} |{z} !|{z} (r×r) (mn ×mn ) ∏ mi ×r i6=n and, though the dimension of the partial derivative with respect to A(n) does not change, the calculations are remarkably simplified. The derivative of X with respect to a diagonal G is easily found if we consider 22 that the Tucker product can be written using outer products: [[G ; A(1) , . . . , A(N) ]] = r1 rN r2 ∑ ∑ ··· j1 =1 j2 =1 ∑ jN =1 (1) (2) (N) g j1 j2 ... jN a: j1 ◦ a: j2 ◦ · · · a: jN . Since G is diagonal, we know that g j1 j2 ... jN 6= 0 implies j1 = j2 = · · · = jN . Therefore, we are only interested in the elements g j j··· j ; and for those we have ∂X (1) (2) (N) = a: j ◦ a: j ◦ · · · a: j . ∂ g j j... j (2.2) 2.4 Rank and factorizations 2.4.1 Rank of a tensor There are two different concepts associated with the rank of an Nth-order tensor: 1. The rank-(R1 , R2 , . . . , RN ). The n-rank of a tensor, denoted by rankn , is defined as the (column) rank of the matrix X(n) . In contrast to the matrix case, where the column and row rank are always the same, the ranks of a tensor along its different modes can be different. Thus, a tensor X such that rank1 (X ) = R1 , rank2 (X ) = R2 ,. . . , rankN (X ) = RN is called a rank-(R1 , R2 , . . . , RN ) tensor. 2. The rank-R. The rank-R of a tensor X is the minimum number of rank-1 terms needed to generate X by their sum. We are not aware of algorithms that determine the rank-R of a given tensor. 2.4.2 Tucker decomposition In a sense, Tucker factorization represents the opposite of the multi-mode multiplication; given a tensor X ∈ IRI1 ×···×IN we want to decompose it into a core tensor 23 Figure 2.4: A Tucker decomposition of a 3rd-order tensor. This figure has been taken from [43]. G ∈ IRJ1 ×···×JN and N matrices A(i) ∈ IRIi ×Ji with 1 ≤ i ≤ N, such that X = [[G ; A(1) , A(2) , . . . , A(N) ]] = G ×1 A(1) ×2 A(2) ×3 · · · ×N A(N) . The geometry of this decomposition is illustrated in Figure 2.4. The tensor G is called core tensor and is generally full. On the other hand, the matrices that transform the core tensor along each mode, A(1) , . . . , A(N) , are commonly assumed to have orthogonal columns. Changing the requirements over G and A(1) , . . . , A(N) leads to different decompositions. A remarkable case is when the tensor G is required to satisfy the properties of all orthogonality (any two slices along the same mode are orthogonal, for all the modes) and ordering (the norms of the n-mode slices form a descendent sequence, and that holds for all the modes), and the matrices A(1) , . . . , A(N) are required to be orthogonal. In this case we get the higher-order SVD (HOSVD) factorization [47]. Note that in the HOSVD the core tensor G is in general a full tensor; the structure constraint of diagonality of the matrix of singular values in the second-order case is replaced by the all-orthogonality and ordering conditions. Because the HOSVD represents a form of higher-order principal-component analysis, it makes sense to interpret each matrix A(n) as the principal component along the nth mode and the core tensor G as containing the level of interaction between the components. Even without the specific requirements that HOSVD imposes over G , the Tucker factorization is often given this interpretation in the literature: a nucleus containing 24 the “core” information and a group of transformations along each mode. The first method to compute Tucker decompositions was introduced by Tucker [74] in 1966. The method’s basic idea is to find the components that best capture the variation in the nth mode, independent of other modes. The method was later analyzed by Lathauwer, De Moor, and Vandewalle [47] and called HO-SVD. Another popular algorithm is TUCKALS3, which was introduced in 1980 by Kroonenberg and De Leeuw [44] for computing decompositions of third-order tensors. The algorithm uses an ALS approach and it was later extended to Nth-order tensors [40]. Lathauwer et al. [48] called this algorithm the higher-order orthogonal iteration (HOOI), which is the name that commonly appears in the literature. A survey of other algorithms for computing Tucker factorizations can be found in [43]. 2.4.3 CANDECOMP/PARAFAC decomposition The CANDECOMP/PARAFAC (CP) decomposition can be viewed as a special case of Tucker when the core tensor is the identity tensor. As a consequence of having the identity as core tensor, we also require the matrices A(n) to have the same number of columns in order to keep the dimensions involved in the multiplication coherent. The idea behind this decomposition is to express a tensor as the sum of a finite number of rank-one tensors. For example, given a third-order tensor X ∈ IRI1 ×I2 ×I3 , we want to write it as r (1) (2) (3) X ≈ ∑ ai ◦ ai ◦ ai , i=1 (1) where r is a positive integer and ai (2) ∈ IRI1 , ai (3) ∈ IRI2 , and ai ∈ IRI3 ; see Figure 2.4.3. A useful assumption made in [43] is the scaling of the columns of the modematrices A(i) and the “absorption” of the scale into a vector g ∈ IRr , so that for a general Nth-order tensor X ∈ IRI1 ×···×IN , a CP decomposition with rank r would be given by r (1) (N) X ≈ ∑ gi ai ◦ · · · ◦ ai i=1 25 = [[g; A(1) , . . . , A(N) ]]. (2.3) Figure 2.5: A PARAFAC decomposition of a 3rd-order tensor. This figure has been taken from [43]. The main problem that arises in computing CP factorizations is to determine the number of rank-1 components in the decomposition. According to the survey by Kolda and Bader [43], most procedures fit multiple CP decompositions with different numbers of components until one is “good”. Once the number of components has been chosen, the CP decomposition can be computed using different algorithms that include ALS and all-at-once approaches. A description of these methods can be found in the surveys by Faber, Bro, and Hopke [26] and Tomasi and Bro [73]. 2.4.4 Other decompositions There are more tensor decompositions that are mostly particular cases of Tucker and CP decompositions, or proposed models that combine aspects of them. We mention, for example, the INDSCAL (individual differences in scaling), a special case of CP for three-way tensors that are symmetric in two modes or the PARAFAC2, a variant of CP applied to a collection of matrices that are equivalent in only one mode. A list and description of these particular decompositions can be found in [43]. Although they are beyond the scope of this work, they provide a context for NTF, which are a particular case of the CP decomposition and will be described in more detail in the next chapters. 26 Chapter 3 Nonnegative tensor factorizations The nonnegative tensor factorization (NTF) problem consists in finding an Nthorder core tensor G ∈ IRJ1 ×···×JN , and N matrices A(1) ∈ IRI1 ×J1 , . . . , A(N) ∈ IRIN ×JN , whose multimode product approximates a given tensor V ∈ IRI1 ×···×IN , i.e., V ≈ G ×1 A(1) ×2 A(2) · · · ×N A(N) . A common approach to compute these factors [8, 33, 38, 60] is to solve the nonlinear least-squares problem minimize F(G , A(1) , . . . , A(N) ) subject to G , A(1) , . . . , A(N) ≥ 0, G ,A(1) ,...,A(N) (3.1) where F(G , A(1) , . . . , A(N) ) = 12 kG ×1 A(1) ×2 A(2) · · · ×N A(N) − V k2F . (3.2) In the CP model that we consider here, we are seeking an Nth-order diagonal tensor G ∈ IRr×···×r (the core factor), and matrices (that we call matrix factors in posterior sections) that have r columns, i.e., A(n) ∈ IRIn ×r , n ∈ {1, . . . , N}. 27 In the case where N = 2, (3.1) reduces to the NMF problem, minimize (1) (2)T 1 2 kA GA subject to (1) A(1) ,A(2) −V k2F (3.3) G, A , A (2) ≥ 0. Note that G is not needed in this formulation, but we keep it as a factor to maintain the analogy with (3.1). We use this matrix form of the problem to illustrate particular aspects of the more general NTF problem. In this chapter we provide a description of the main difficulties associated with the NTF problem and explain the ideas behind the most popular approaches to solving the problem. 3.1 Challenges The NTF problem usually appears in applications that involve a large number of variables. Consequently, we have to deal with large-scale optimization problems and their implied challenges, including large computer-memory requirements and long computing time. Apart from computational challenges, the problem is ill-posed—this presents another challenge. A description of the main challenges related to NTF is provided next. 3.1.1 Lack of uniqueness Kruskal [45, 46] provides a discussion on the uniqueness of CP decompositions of 3rd-order tensors; in this section we explain some definitions that are found (detailed at length) in his work. The multimode tensor factorization is never unique. If it is assumed that U, V , and W are nonsingular matrices, then [[G ; A(1) , A(2) , A(3) ]] = [[G ×1 U ×2 V ×3 W ; A(1)U −1 , A(2)V −1 , A(3)W −1 ]]. (3.4) Thus, any of the n-mode factors can be modified without affecting the product as long as the inverse modification is applied to the core tensor. This is known as 28 scaling indeterminacy. This fact is more easily seen in the matrix case, where we have −T T W GH T = (W D−1 1 )(D1 GD2 )(HD2 ) . Furthermore, in the case of the CP decomposition where the elements of the diagonal tensor G are fixed and all equal, i.e., λ = g1 = · · · = gr , we also have permutation indeterminacy. We know that in the CP model, [[A(1) , A(2) , . . . , A(N) ]] = [[A(1) P, A(2) P, . . . , A(N) P]] for any r × r permutation matrix P. Therefore, we could arrange the rank-1 factors in a different order, which would mean applying the same permutation P to all the columns of each factor A(n) , n = 1, . . . , N, and still get the same approximation. Note that if G is also considered a variable, this indeterminacy is always present, since it is possible to apply the permutation P to the diagonal elements of G and maintain the product unchanged. These two operations, scaling and permutation, are known as elementary changes. Two rank-r decompositions of a given tensor X are equivalent if one can be obtained from the other by elementary changes. Kruskal defines as unique a decomposition (of a tensor X ) that has rank-r and is equivalent to all other rank-r decompositions of X . However, the lack of uniqueness in the NTF problem is insidious. In his work, Kruskal also proves that 3rd-order decompositions are not unique (according to his definition) if their ranks are large enough, although the expression “large enough” does not have a specific quantification. Despite the absence of a result concerning the maximum rank that guarantees uniqueness (up to elementary changes), it is observed that the indeterminacy increases with the size of the tensor. It has been conjectured, for example, that for an array of dimensions r × r × r, any approximation of order greater than 3r/2 − 1 is not unique [45]. Clearly, the presence of all these indeterminacies make the problem ill-posed. 29 3.1.2 Degeneracy and best rank-R approximation A tensor is degenerate if it can be approximated arbitrarily well by a factorization of lower rank [21, 62]. The property is better understood through the use of an example. Here we outline the one used in [62], a model carefully constructed to illustrate degeneracy in a 3rd-order tensor. Consider the rank-3 tensor X ∗ ∈ IRn×n×n given by X ∗ = [[â, b, c]] + [[a, b̂, c]] + [[a, b, ĉ]] where a, b, c and â, b̂, ĉ are vectors in IRn . The tensor given by the product Xε = [[A, B,C]] ε2 1 ε2 1 ε2 1 ε2 1 a + â − a + â , b + b̂ − b + b̂ , = ε 2 ε 2 ε 2 ε 2 2 2 1 ε ε 1 c + ĉ − c + ĉ ε 2 ε 2 2 2 2 1 ε 1 ε 1 ε a + â ◦ b + b̂ ◦ c + ĉ + = ε 2 ε 2 ε 2 2 2 ε ε ε2 1 1 1 − a + â ◦ − b + b̂ ◦ − c + ĉ ε 2 ε 2 ε 2 6 ε = [[â, b, c]] + [[a, b̂, c]] + [[a, b, ĉ]] + [[â, b̂, ĉ]] 4 (3.5) is a rank-2 tensor. However, we clearly have kXε − X ∗ kF = ε6 [[â, b̂, ĉ]] 4 F which decreases as ε → 0. Thus, X ∗ is a rank-3 tensor that is arbitrarily well approximated by the rank-2 tensor Xε ; in other words, we have a rank-3 tensor for which there is not a best approximation of rank 2. Extensive research has been carried out for particular types of tensors (e.g., I × I × 2 tensors [69]). In general, it has been proven that for any given size there is always a set of tensors that, at least for certain values of r, exhibits degeneracy 30 [22]. This result implies that any given NTF problem may not have a solution. 3.1.3 Lack of bounds Note that the scaling indeterminacy described in section 3.1.1 makes it possible to have extremely unbalanced factors, i.e., some factors have arbitrarily large norms, while others have arbitrarily small norm. For example, for any γ 6= 0, G ×1 A(1) ×2 A(2) ×3 A(3) = G ×1 γA(1) ×2 1γ A(2) ×3 A(3) . Furthermore, this unbalanced variation in the elements of the factors can also be caused by degeneracy. When we try to approximate a degenerate tensor X , we find a “best-fit” solution that gets arbitrarily closer to X as some (or all) its factors approach infinity. This is seen in (3.5), where the first column of A, B and C tends to infinity and the second tends to −∞ as ε → ∞. This lack of bounds has the effect of producing ill-conditioned factors whenever the original tensor is degenerate. 3.2 Algorithms for computing the NTF Algorithms for the NTF problem can be classified into three groups: multiplicative update algorithms, alternative least squares (ALS) and all-at-once approaches. 3.2.1 Multiplicative update rules The multiplicative-update algorithm, proposed by Lee and Seung [51], was one of the first methods available for the NMF problem. This approach gained popularity because of its simplicity. It was extended to the NTF problem by Welling and Weber [81]. Algorithm 1 outlines a basic version of the multiplicative-update algorithm for the NTF problem. The symbol ∗ and the fractions denote componet-wise operations (i.e., ∗ is the Hadamard product defined in Chapter 2). We highlight several of this method’s properties. First, note that the algorithm involves only the computation of the matrices A(1) , . . . , A(N) and assumes that G is the identity tensor, i.e., this update rule is 31 Algorithm 1: Multiplicative update rule Input: Nth-order tensor V ; N initial mode-factors A(i) ≥ 0, i = 1, . . . , N while the residual norm is greater than a tolerance do for i = 1, . . . , N do Form Vi = V(i) ; (i) Compute K = A ; Update A(i) using the rule A(i) ← A(i) ∗ Vi K ; A(i) K T K Scale the factors to keep them bounded end Compute the residual end designed only for CP factorizations. Second, because each update is obtained using only multiplications, if the initial guess consists of nonnegative factors, the subsequent iterates will also be nonnegate. The algorithm thus produces nonnegative factors, as required. Third, it is guaranteed that the algorithm converges to a stationary point. We describe the proof given by Berry et al. [10]. Consider the additive version of the update rule A(i) ∗ (Vi − A(i) K T )K. A(i) K T K This is equivalent to the multiplicative rule. To see this note that A(i) = A(i) + A(i) ∗ (Vi − A(i) K T )K (i) T A K K A(i) = A(i) + (i) T ∗ (Vi K − A(i) K T K) A K K (i) ∗V K A A(i) ∗ A(i) K T K i = A(i) + (i) T − A K K A(i) K T K Vi K = A(i) ∗ (i) T . A K K A(i) = A(i) + 32 (3.6) (Recall that fractions and asterisks indicate element-wise operations and thus we can cancel terms.) In the case where the algorithm convergences to some set of matrices {A(1)∗ , . . . , A(N)∗ }, we would have A(i)∗ ∗ (Vi − A(i)∗ K T )K = 0. A(i)∗ K T K (i)∗ If we assume that A jk > 0, then (i)∗T (Vi K − A(i)∗ K T K) jk = ((V(i) − A(i)∗ A (i)∗ )A ) jk = 0. Thus, ∂F =0 ∂ A(i) for all i, i.e., we are at a stationary point of the function F (the derivative of F is not obvious but details about it are given in section 4.2). This reasoning has two major drawbacks. First, a stationary point is not always an optimum; i.e., it could be just a saddle point (which could be verified if the second-order derivative of the function is available). Second, the guaranteed arrival at a stationary point is based on the assumption that all the elements in A(i) are positive (for all i). Therefore, when the algorithm converges to a limit point in the interior of the feasible region, we can guarantee that the point is a stationary point; but the algorithm might also stop at a point that is not stationary and simply lies on the boundary of the feasible region. It is worth noting the difficulty in proving the convergence of multiplicative updates. The difficulty comes mainly from the nonnegativity constraints. Lin [52] discusses the difficulty of proving convergence in the NMF problem and he proposes slight modifications to existing updates in order to prove their convergence. In the tensor case, we find that, in fact, the work by Welling and Weber lacks a convergence proof. Given the challenges that proving convergence for this approach presents, it is not surprising that the speed of convergence has not been accurately described. However, despite the lack of precise results, it is evident in practice that the ma- 33 jor drawback of the multiplicative rules is their slow convergence [10]. Therefore, after Welling and Weber’s work, modifications to the basic multiplicative update algorithm 1 have been proposed which try to adapt to the tensor problem the improvements made in NMF multiplicative rules. We find, for example, that in the NMF problem, modifications to the initial multiplicative rules lead to a class based on gradient descent methods [19]. In the tensor case, Hazan, Polak, and Shashua [38] propose something analogous, where the update for each column of the factor-matrices has the form (i) (i) (i) a j = a j − µ(a j ) ∂F (i) , ∂aj where we can clearly identify a step size represented by the function µ and a descent direction given by the negative gradient (i) The quantities µ(a j ) are step size parameters that can vary depending on the algorithm; the work by Hazan et al. focuses on 3rd-order tensors and, for a tensor in IRI1 ×I2 ×I3 , uses (i) (i) µ(a jk ) = a jk (i) (n) (n) ∑r`=1 a j` ∏3n=1,n6=i ha` , ak i (3.7) to construct a rank-r approximation, where i = {1, 2, 3}, k = 1, . . . , r, and j depends on the the mode i and in each case goes through the values 1,. . . ,Ii . It is worth mentioning that Hazan et al. use second-order derivatives in the analysis of their update rule. Other modifications include the work by Mørup, Hansen, and Arnfred [56], who use additional constraints to obtain sparsity and encourage uniqueness; and the work by Bader, Harshman, and Kolda [8], who introduce a new algorithm—ASALSAN, for computing three-way DEDICOM (“decomposition into directional components”; a family of decompositions for analyzing intrinsically asymmetric relationships)—and impose nonnegativity constraints on the model (they develop a nonnegative version of ASALSAN). None of these modifications is described with detail since they are not relevant to this research. However, it is important to note that there is a trade-off between the improvements achieved by more sophisticated rules and the complexity and 34 number of operations required, as seen by examining equation (3.7). Despite the refinement in the rules, the multiplicative-update approach has shown poor performance and no guarantee of getting an optimal point. Nevertheless, it continues to be the most popular strategy to compute nonnegative decompositions with an NMF version of the algorithm given in [10] being included in the statistics toolbox of the software M ATLAB v.7.6. 3.2.2 Alternating least squares A way of dealing with optimization problems where the objective function and constraint functions have a partially separable structure with respect to the problem’s variables consists of breaking down the problem into smaller ones that can be solved one-at-a-time in sequence. This is the intended goal of the optimization algorithm known as block-coordinate descent or nonlinear Gauss-Seidel [13, Section 2.7]. The alternating least squares (ALS) algorithm is a particular case of this more general approach. The ALS approach fixes all the factors except one, and optimizes the problem with respect to the free factor. In this way, the nonlinear least-squares problem is decomposed into a sequence of smaller, linear problems. Consider the matrix case and recall that we want to solve the problem minimize 12 kV −W GH T k2F , W,G,H≥0 (3.8) where V ∈ IRm×n , W ∈ IRm×r , G ∈ IRr×r , and H ∈ IRn×r . We have explicitely written the factor G (as it is done in (3.3)) but for now, in the remainder of this section, we can assume that G ≡ I (an assumption that typically appears in the NMF literature; e.g. [51, 81]) . If we follow the ALS approach, the problem given by (3.8) can be decomposed into two least-squares problems, minimize 21 kV −W H T k2F , W ≥0 and minimize 21 kV −W H T k2F . H≥0 (3.9) Note that each of these least squares problems is a collection of linear least-squares problems. For example, denote by v j the columns of the matrix V , and by h j the rows of the matrix H. The first problem in (3.9) can be solved by the sequence of 35 problems minimize 21 kv j −W hTj k22 , h j ≥0 j = 1, . . . , n. (3.10) Each of the subproblems (3.10) can be solved with any optimization algorithm for bound-constrained problems, like gradient-projection methods [13, 2.3], or BCLS [32]. Moreover, these n subproblems can be solved for in parallel. In the tensor case, the general nonlinear least-squares problem is partitioned into the subproblems NT FA(n) (1) (n) (N) 1 ]] − V 2 k[[G ; A , . . . , A , . . . , A minimize A(n) k2F subject to A(n) ≥ 0, for n = 1, . . . , N, and NT FG (1) (N) 1 ]] − V 2 k[[G ; A , . . . , A minimize G k2F subject to G ≥ 0. When solving the subproblem associated with each matrix A(n) , we have the option of matricizing the problem and solving the problem minimize A(n) (n) T A⊗ GT(n) A(n)T −V(n) 1 2 (n) subject to A 2 F (3.11) ≥0 by computing one row of A(n) at a time. In this way we can take advantage of the standard linear least-squares algorithms to solve each subproblem. Another possibility is to vectorize the problem to solve for all the rows of A(n) simultaneously. To do this, recall, from matrix calculus (see, e.g., [27]), that for any three matrices A, X, and B of suitable size, we have vec(AXB) = BT ⊗ A vec(X). 36 (3.12) Thus, (n) (n) vec A⊗ GT(n) A(n)T = vec A⊗ GT(n) A(n)T I(In ) (n) = I(In ) ⊗ A⊗ GT(n) vec(A(n)T ), and (3.11) becomes minimize A(n) (n) T I(In ) ⊗ (A⊗ GT(n) ) vec(A(n)T ) − vec(V(n) ) 1 2 (n) subject to A 2 2 ≥ 0. To solve the subproblem associated with the core tensor G , we need to isolate G(n) in any of the subproblems given by (3.11). We can use (3.12) again to get (n)T (n) vec A(n) G(n) A⊗ = A⊗ ⊗ A(n) vec(G(n) ), which, by choosing n = 1, becomes (1)T vec A(1) G(1) A⊗ = A⊗ vec(G(1) ), and leads to the optimization subproblem minimize G 1 2 A⊗ vec(G(1) ) − vec(V(1) ) 2 2 subject to G ≥ 0. The ALS approach was first used in the tensor case by Bro and De Jong [14] in 1997. Some later modifications include the work of Friedlander and Hatz [33], who incorporated a regularization function and a strategy to keep the variables bounded using the `1 -norm. They also use the vectorized version of the problem and solve simultaneously for all the rows of each matrix factor, increasing the computations’ efficiency. Depending on the implementation, ALS algorithms can be effective and the drawbacks faced by these algorithms are those that are inherent in the NTF problem: no guarantee of convergence to a local minimum (it could be a saddle point), great sensitivity to the initial point, and a considerable increase on the work when 37 additional constraints are imposed. 3.2.3 All-at-once approach The third approach consists in solving directly the nonlinear optimization problem with the boundary constraints. It offers the advantage of keeping the global relation between all the variables, which can benefit the quality of the solution. One of the main challenges of this approach, however, is the effective solution of large problems. This approach is the focus of this research and the details regarding its description and implementation are given in the next chapter. 38 Chapter 4 All-at-once approach The all-at-once approach can result in large-scale optimization problems—where computing the gradient and performing operations can be very expensive— but it can also save computational work by quickly reaching a satisfactory solution. However, prior to exploring this approach, it must be considered that one of the main challenges in solving the NTF problem is the development of a regularization strategy that resolves the problem’s inherent ill-posedness. In this chapter we describe both our regularization strategy and our algorithm to implement the all-at-once approach. 4.1 Regularization strategy The indeterminacies due to scaling and permutation are somehow avoided when we compute sequentially the factors of the approximation, which is done in ALS and multiplicative update rules. In both cases, we choose an initial approximation to the solution and keep all the values fixed while looking for a specific factor, which is determined by the status of the other factors. It follows that both scale and permutation are implicitly chosen together with the initial guess. In the other approaches the ill-conditioning of the problem is manifested by the fact that the algorithm converges to different solutions when it starts with different initial values. However, in the all-at-once approach the ill-posedness can directly affect the success of the algorithm. It is therefore important to establish a way to 39 control the factors we are computing. We use two different strategies to regularize the problem: a classic regularization strategy that takes the form of a penalty function and an explicit bound on the elements of the matrix factors. Both strategies are described next. 4.1.1 Regularization function Tikhonov regularization [37, Section 5.1], which penalizes the two-norm of the solution, is the best-known method to regularize optimization problems. In least-squares problems, the method requires the addition of a regularization term to the objective of the optimization problem, in order to favor a solution with small norm. In the problem we are solving, we use the regularization function N f (γ, A(1) , . . . , A(N) ) = = Jn (n) ∑ γn ∑ ka j k1 n=1 j=1 N Jn In ∑ γn ∑ ∑ n=1 (4.1) (n) |ai j |, j=1 i=1 which penalizes big values in the elements of the matrix factors and, as a consequence, encourages solutions with smaller norms. Note that in (4.1) we allow a different weight, γn , to constrain each matrix factor. The function f is expected to encourage sparsity of the computed factors, since it pushes the small values towards zero and leaves the significant big entries relatively unchanged. The use of a function that involves the `1 -norm and enhances sparseness helps to regularize the optimization problem. It is also remarkable that sparsity is a property coherent with the motivation of the NTF problem; it enriches the solution from an interpretive point of view because it leads to basis factors that represent independent parts of the whole set. Consider, for example, the matrix case where, given a m × n matrix V whose columns represent images, we want to find matrices W, G, H such that V ≈ W GH T . As explained in chapter 1, W ∈ IRm×r represents a basis of images of rank r, whereas H ∈ IRn×r contains the necessary coefficients to express each column of V 40 in terms of the basis W . In this case, we might want to have sparsity in H, meaning that we want the basis images to have strong, independent features so that the least number of them are required to construct each image of the original set. Note that we may wish to allow an arbitrarily dense basis W if that accurately reflects the nature of the original images. By differently weighting the penalization on the factors, we are trying to approach the “ideal case”: if r = n, W would be a full matrix (we would have W = V ), while H would be the identity matrix (very sparse). So, the chosen application will determine the penalty parameter used to control each factor. 4.1.2 Spacer steps In Section 3.1, a potential problem in tensor factorizations was identified. Matrix factors can be arbitrarily scaled and, depending on the initial guess, could become large in norm. To keep all the variables bounded and mantain an energy equilibrium among the matrix factors, we scale the columns of each factor every n iterations and “absorb” the scale into the core diagonal tensor G . This periodic rescaling of the factors is carefully crafted so that the objective value does not increase. This property is crucial to ensure that the overall minimization algorithm continues to converge. The property is confirmed by the following theorem. Theorem 1. (Convergence for Spacer Steps [13, Section 1.2]) Consider a sequence {xk } such that f (xk+1 ) ≤ f (xk ), k = 0, 1, . . . Assume that there exists an infinite set K of integers for which xk+1 = xk + α k d k , ∀k ∈ K , where {d k }K is gradient related and α k is chosen by the minimization rule, or the limited minimization rule, or the Armijo rule. Then every limit point of the subsequence {xk }K is a stationary point. 41 It is then possible to insert a rescaling step into an optimization algorithm without interfering with convergence. Such a strategy can be used to control the norm of the factors. A possible strategy to bound the factors is to scale them in such a way that the `1 -norm of each matrix is equal to a constant λ ; in this case, each factor A(n) is replaced by Ā(n) = λ kA(n) k1 A(n) , and the tensor G is replaced by G¯ = ! N (n) ∏ kA G. k1 /λ n=1 Another possibility consists of individually normalizing each column of the matrix factors. This can be done by multiplying each matrix factor A(n) (where 1 ≤ n ≤ N) by a diagonal matrix D(n) such that (D(n) )ii = 1 (n) . kai k By multiplying the core tensor G by the inverse of each diagonal matrix D(n) , we can guarantee that the value of the objective function remains constant. This is a consequence of the scaling indeterminacy described in (3.4). While the first strategy involves a matrix norm and can be described as imposing the condition kA(n) k1 = λ for all n in {1, . . . , N}, the second strategy does not represent a matrix norm. Nevertheless, it does not contradict the theory established by the spacer-steps theorem, it achieves the objective of keeping the solution bounded, and Paatero [60] suggests that it “would be easy to implement without sacrificing the convergence properties of the algorithm”, although he does not theoretically justify the suggestion. 42 4.2 Gradient of the function. According to our regularization strategy, the objective function of our problem is given by Φγ (G , A(1) , . . . , A(N) ) = F(G , A(1) , . . . , A(N) ) + f (γ, A(1) , . . . , A(N) ) = 12 kRk2F + f (γ, A(1) , . . . , A(N) ), where f (γ, A(1) , . . . , A(N) ) is the regularization function defined in (4.1), and R is defined as the residual R = G ×1 A(1) ×2 A(2) ×3 . . . ×N A(N) − V where V is a given Nth-order tensor in IRm1 ×...×mN , G is a diagonal Nth-order tensor in IRr1 ×...×rN , and A(i) is a matrix with dimensions mi × ri , for i = 1, . . . , N. If Y denotes any of the factors G , A(1) , . . . , A(N) , by using the chain rule we have ∂ Φγ ∂ F ∂ fγ ∂ F ∂ R ∂ fγ = + = + . ∂Y ∂Y ∂Y ∂ R ∂Y ∂Y Note that by defining F(G , A(1) , . . . , A(N) ) as 12 kRk2F , we are abusing notation because, although it could be intuitive that we are referring to the sum of the squares of the entries of R, we should formally have either F = 21 kR(n) k2F = 12 tr RT(n) R(n) , and dF = tr RT(n) dR(n) , dR(n) (4.2) where R(n) is a mn × ∏i6=n mi matrix, or F = 21 kvec((R)k22 = 21 vec(R)T vec(R), where vec(R) has dimensions (∏i mi ) × 1. 43 and dF = vec(R)dR dR (4.3) Using (4.2), (2.1), (2.2), and (4.3) we have ∂ Φγ = (An D ⊗ Imn )T vec(R(n) ) + γn ∂ A(n) = ((An D)T ⊗ Imn )vec(R(n) ) + γn = vec(Imn R(n) An D) + γn = vec(R(n) An D) + γn , and ∂ Φγ (1) (2) (N) = vec(R)T vec(a: j ◦ a: j ◦ · · · a: j ). ∂ g j j... j (4.4) Therefore, for the particular case of matrices, in which F = 12 kRk2F = 21 kW GH T −V k2F , (4.5) (see Section 3.2.2) we would have the derivatives ∂F = vec(RHG) + γW ∂W and ∂F = vec(GW T R) + γH . ∂ HT Note that we work with the transpose of H; see (4.5). Derivatives with respect to G are similar to the derivatives given in (4.4). It is necessary to address the effects of scaling (by spacer steps) on the derivative of the function. If we use the `1 -norm of each factor to bound the solution, we get a final solution in which the core tensor is a scalar multiple of the initial tensor G . Thus, the introduced variation is a scaling of the Kruskal product, independent from G , and the derivatives of the function must simply be scaled. Furthermore, since G is not really subject to variations, we can make it equal to the identity tensor and eliminate it as a problem variable. On the other hand, if we scale each of the columns of each of the factors using a different scalar, we do need to use a diagonal tensor to express the induced variation, and the diagonal elements of G must be considered variables (each’s 44 derivative would be the one given in (4.4)). 4.3 Optimization Now that the regularized objective function and its derivative have been established, it is necessary to choose an optimization method to solve the problem. Any optimization method for nonlinear problems subject to simple bounds can be used to solve the problem, but some modifications are necessary to incorporate the scaling strategy. Formally, using the optimization method, we want to generate the spacer steps to be inserted in a sequence that every n iterations switches the current iterate solution by another equivalent through an elementary change (scaling). We use the L-BFGS-B method to generate such spacer steps. This method, a modification of the BFGS method, is broadly used since its introduction in 1994. Since it is a well-known method, this section will just briefly describe the ideas behind it, and refer to the original publications [17, 83] for details. 4.3.1 BFGS The BFGS (Broyden-Fletcher-Goldfarb-Shanno) method, which belongs to the class of quasi-Newton methods, is an optimization method designed to solve unconstrained problems. Recall that all the Newton-type methods find, at the k iteration, a search direction dk by solving a linear system of equations of the type Bk dk = −gk , (4.6) where gk represents the gradient of the objective function, and Bk is an approximation to the Hessian of the objective function. Quasi-Newton methods belong to the Newton-type family of optimization methods. Their goal is to approximate the curvature of the objective function without evaluating the Hessian. To accomplish this, the methods use a positive definite matrix updated by a low-rank matrix at each iteration, i.e., they solve the equation 45 (4.6) with Bk+1 = Bk +Uk , where Uk has low rank and is defined by the specific quasi-Newton method, but always ensures that the matrix Bk+1 remains positive definite. In particular, BFGS uses the rank-two update Bk+1 = Bk − qk qTk yk yTk + , qTk sk yTk sk yTk sk > 0, where sk represents the change in the variables, yk is the change in the gradient, and qk is the product Bk sk . Notice that the algorithm requires that the user provide the gradient, but not the Hessian, of the objective function. A linesearch that satisfies the Wolf conditions guarantees that the updated matrix remains symmetric positive definite [59, Section 6.1]. BFGS is one of the most popular update schemes in the class of quasi-Newton methods and has been modified to deal with large-scale problems, giving rise to the limited-memory variants, described below. 4.3.2 L-BFGS-B L-BFGS and L-BFGS-B are variants of the BFGS method for solving large-scale optimization problems. The initial “L” in both names represents the words “limited memory” and represents a feature shared by the algorithms. The additional “B” at the end of the second name stands for “bounded” and draws attention to the main difference between the codes: L-BFGS is designed to solve unconstrained problems, whereas L-BFGS-B can handle bounds on the variables. As its name suggests, a “limited memory” algorithm is characterized by its ability to restrain the amount of storage or computation used during its execution. While the conventional BFGS method approximates the Hessian, using a smaller reduced matrix that increases in dimension with each iteration, the L-BFGS method does not keep all the s and y from past iterations, and it updates the Hessian using information only from the previous p iterations (with p provided by the user). When the number of iterations is smaller than p, we get the usual BFGS update, and when it is larger, we have a limited memory BFGS (L-BFGS) update. 46 L-BFGS-B is identical to the L-BFGS algorithm except for the additional property of allowing bounds on the variables. The constraint control is achieved by using a projected-gradient approach to determine an active set, and then solving a quadratic problem over the subspace of free variables, handling the active bounds by Lagrange multipliers. Liu, Nocedal et al. [53] proved that the L-BFGS-B algorithm converges Rlinearly. For our algorithm (with spacer steps) we know that the convergence is guaranteed by the theorem 1 and, although the speed of convergence might deteriorate with the incorporation of the scaling strategy, it is reasonable to expect a linear rate of convergence [17]. The general L-BFGS-B is shown in Algorithm 2 (see the original paper [17] for details). 47 Algorithm 2: L-BFGS-B algorithm. Input: Starting point x0 ; Integer n (number of limited memory corrections stored); Lower and upper bounds (l and u); Tolerance tol k = 0 while TRUE do if kP(xk − gk , l, u) − xk k∞ < tol then STOP end Compute the Cauchy point; Compute the search direction dk by either the direct primal, the conjugate gradient or the dual method; Perform a line search along dk , subject to the bounds on the problem, to compute a steplength and set xk+1 = xk + λk dk ; Compute ∇ f (xk+1 ); if yk satisfies sTk yk > epskyk k2 then Add sk and yk to Sk and Yk ; end if More than n updates are stored then Delete the oldest column from Sk and Yk ; end Update SkT Sk , YkT Yk , and other parameters related with the Bk update; k ← k+1 end 48 Chapter 5 Implementation and numerical experiments We combine the regularization strategies described in section 4.1 with the L-BFGSB method and implement an algorithm with an all-at-once approach. Experiments are then performed, which validate our algorithm. In this chapter we provide details about the algorithm implementation and describe the main results of the numerical experiments, which include observations on the performance of the algorithm for the matrix case, the effects of different settings on regularization parameters, and a comparison of the performance of the all-at-once approach against other approaches. 5.1 Implementation of the algorithm 5.1.1 L-BFGS-B routine We implement our algorithm using the L-BFGS-B [82] F ORTRAN code, which we call from M ATLAB via a MEX interface. We take advantage of the reverse communication scheme on which L-BFGS-B is built, and we treat the main LBFGS-B routine Setulb as a black box that manages the optimization process. It accepts the following inputs: • the current iteration xk ; 49 • a set of parameters and working arrays, P, required by the F ORTRAN routines (a complete description can be found in [83]); • a control variable, task, that indicates the state of the optimization process. The routine generates an output that consists of • a point xk+1 such that fk+1 < fk ; • the updated set of parameters P; • the updated value of task. Every n iterations, we replace the point xk by a scaled one, xˆk , modify the gradient gk , and change the value of some of the parameters in P that depend on gk . The function Setulb uses as stopping criteria the norm of the projected gradient and the relative change on the value of the function. It can also have an “abnormal” termination when the routine terminates without being able to satisfy the termination conditions, or an “error” termination when it detects an error in the input parameters. To these criteria, we added the number of iterations and the amount of elapsed time to control the end of the algorithm execution. 5.1.2 Other implementation remarks We have two implementations of the algorithm, one for NTF and one for NMF. The difference between them lies on the gradient computation and although it is possible to use the tensor code to solve a matrix problem and obtain the same results, it is cheaper to use the matrix interface (more details on this aspect are given in section 5.2.2). The tensor implementation relies on the M ATLAB Tensor Toolbox, version 2.2 [4, 5, 6]. There are two important settings in the algorithm that we leave to the user to choose, as they are considered input parameters: the value of r and the initial point. We briefly comment on each of them in the remainder of this section. 50 Choice of r The value of r is essential in the definition of the objective function as it sets the number of variables involved in the problem. In some sense, it determines the properties that we expect to find in the solution. Furthermore, if a problem has been solved using a matrix approach with rank rm , it is difficult to determine the rank rt that leads to an equivalent restatement of the problem with tensors. To illustrate this point, consider again a problem in image-processing context in which there is a set of n3 images, each with a resolution of n1 × n2 pixels. Using a matrix approach, the set is represented by a (n1 n2 × n3 ) matrix, where each column represents an image. If we find an approximation with rank rm , we have a basis W (n1 n2 × r), set of coefficients H (n3 × r), and diagonal matrix G (we consider only the r diagonal elements). Thus we have (n1 n2 + n3 + 1)rm variables. In the tensor approach, an approximation with the same rank (rt = rm ) has three matrix factors: A(1) (n1 × r), A(2) (n2 × r), and A(3) (n3 × r), and a diagonal core tensor G (r diagonal elements). Thus we have only (n1 + n2 + n3 + 1)rm elements. It is clear that the difference in structural arrangement of the elements, makes it inappropriate to use the same value rm in both approaches. However, it is not clear what value of rt makes the problems “equivalent”. We can choose rt such that the number of variables in both problems is the same, i.e., rt = n1 n2 + n3 + 1 n1 + n2 + n3 + 1 rm , (5.1) but that would diminish the benefits of using a tensor approach. In other words, using the same value of rt = rm is not enough to get a good result, but using (5.1) results in wasted storage and greatly increases the computational time. Initial guess In addition to the solution’s lack of uniqueness, the NTF problem might have local minima and, therefore, we expect great sensitivity of the algorithm to the initial guess. Some authors [10] have suggested the use of a Monte Carlo method to explore different starting points and determine the one that brings the best solution. 51 We do not attempt to solve this aspect of the problem. Our algorithm uses the initial guess given by the users. If no initial point is provided, it simply generates a random starting point with positive elements. 5.2 Experiments We perform experiments that can be grouped into three different types. First, we explore the performance of the algorithm in the matrix case, i.e., when the input in the algorithm is a 2nd-order tensor. Then, we explore the effectiveness of the regularization strategies—penalty function and scaling—using different parameter settings and comparing the obtained solutions. Finally, we choose the parameter values that give the best performance, and compare the all-at-once approach against the alternative approaches described in chapter 3 (a multiplicative update rule and an ALS implementation). Although we focus on the performance of the algorithm with no particular application, all the experiments employ image datasets, and, thus, use an imagerelated context to analyze the solutions (to determine their quality). The initial points are generated using M ATLAB ’ S rand function, and the state of the method’s generator is initialized with the purpose of allowing the computations to be repeated. So, the initial point is not chosen in a special way and does not use any a priori information on the problem, but it is the same every time we compute a solution independently of the algorithm and parameter settings. The experiments have been performed using M ATLAB version 7.6.0 on an Intel Pentium 4 CPU 3.20GHz, 2 GB RAM computer. 5.2.1 Experiment settings Used databases We use two different image databases to perform the experiments. Having more than one set of images increases the number of experiments that can be performed, but this also allows us to assess how the quality of the solutions varies as we modify the characteristics of the data (i.e., lighting, level of variation between elements of the set), and number of variables involved (i.e., images’ resolution). 52 1. CBCL Dataset [71]. This database has been used at MIT’s Center for Biological and Computational Learning (CBCL) and was developed by K. Sung. The database consists on 2429 grayscale face images with dimensions 19 × 19 (361 pixels) each. We assemble 80 images in a 361 × 80 matrix and a 19 × 19 × 80 tensor, for testing the matrix and tensor approach respectively. 2. AT&T Dataset [18]. This database was used in the context of a face recognition project carried out in collaboration with the Speech, Vision and Robotics Group of the Cambridge University Engineering Department. It consists of ten different grayscale images of each of 40 distinct subjects. The size of each image is 92 × 112 pixels, and we work with a set of 80 images randomly chosen from the 400 images in the database. The images are scaled to 30 × 38 size in order to reduce the number of variables, and then we assemble a 1140 × 80 matrix, and a 30 × 38 × 80 tensor for the experiments. Criteria of interest To compare the different algorithms, we use the value of the objective function and, as an optimality measure, the value given by kproj g(x)k∞ . max{1, kxk∞ } (5.2) We know that when this value is zero (kproj g(x)k∞ = 0), a stationary point has already been reached. When kxk∞ < 1, (5.2) represents the infinity-norm of the projected gradient; otherwise, when the norm of x is greater than 1, it is the relative infinity-norm of the gradient what is being considered. The algorithm’s cost per iteration varies greatly depending on the number of variables involved in the problem. We iterate for large problems until convergence is achieved or a predetermined time (90 minutes) elapses. Thus, we do not use a maximum number of iterations (we set the corresponding parameter as infinity), and instead control the termination of the algorithm by time (when no optimum solution is found within the period of time). Each table presents the number of 53 iterations used to achieve each solution and a flag indicates whether the optimality criteria were reached (flag = 1), or the maximum allowed time elapsed (flag = 2). The assessment of the solutions’ quality is an open problem in NTF and generally a specific context is required to establish metrics for the success—or failure— of a particular approach to finding a good solution. Two general approaches for evaluating the quality of a computed solution are the congruence [72] and the peak signal-to-noise ratio (PSNR) [58]. The congruence is defined as the cosine of the angle between the vectorized tensors, i.e., given two matrices, A and B in IRm×n , that represent digital images, we have congr(A, B) = cos(vec(A), vec(B)) = vec(A)T vec(B) . kvec(A)kkvec(B)k If the angle between the vectorized images is small, the congruence is approximately equal to unity. On the other hand, the PSNR represents the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. If we define the mean squared error (MSE) of the images A and B as MSE(A, B) = 1 m n ∑ ∑ kAi j − Bi j k2 , mn i=1 j=1 the PSNR is given by the expression PSNR(A, B) = 10 · log10 maxA2 , MSE(A, B) where maxA is the maximum possible pixel value of the image A. The image quality assessment is an open problem [79] and we use the structural similarity index (SSIM) [80] measure, which is especially adapted to digitalimage datasets. Measurement of solutions’ quality by SSIM The SSIM represents an attempt to quantitatively measure the perceived quality of an image and, because it is constructed so that it takes into account known characteristics of the human visual system, it matches the perceived visual quality 54 on an image better than other commonly used measures, like the mean-squarederror or the peak signal-to-noise ratio. The SSIM uses luminance, contrast, and structure comparison to give an overall similarity measure between two image signals, x, y ∈ IRn , i.e., SSIM(x, y) = f (l(x, y), c(x, y), s(x, y)), where l, c, and s are comparison functions. The simplified form of the SSIM (see the original article for details on the general form) is given by the equation SSIM(x, y) = (2µx µy + κ1 )(2σxy + κ2 ) , (µx2 µy2 + κ1 )(σx2 + σy2 + κ2 ) (5.3) where µx , µy represent the mean intensity of the signals, i.e., µx = 1 n ∑ xi ; n i=1 σx , σy , and σxy represent the standard deviations and correlation of the signals, i.e., σx = 1 n ∑ (xi − µx )2 n − 1 i=1 !1/2 , σxy = 1 n ∑ (xi − µx )(yi − µy ); n − 1 i=1 the constants κ1 , κ2 help to avoid unstable results; typical values are κ1 = 0.01 and κ2 = 0.03 (when κ1 = κ2 = 0 the SSIM becomes the “universal quality index”, UQI [78]). The SSIM index satisfies three important properties: symmetry (SSIM(x, y) = SSIM(y, x)), boundedness (SSIM(x, y) ≤ 1), and unique maximum (SSIM(x, y) = 1 if and only if x = y). In the performed experiments, we measure the SSIM between each of the original images (each column in the matrix approach, and each frontal slice in the tensor case) and its corresponding image in the obtained solution (columns or slices obtained by multiplying the computed factors). In the result tables we show the mean of the SSIMs obtained. 55 5.2.2 Performance of the algorithm on the matrix case We find that approximately 67% of the overall time required by our tensor algorithm is used during the function evaluations, whereas about 30% is spent on the L-BFGS-B routine. There are two operations that consume most of the time spent on function evaluations. First, the computation of the product of the factors, which is needed to obtain the residual. The M ATLAB Tensor Toolbox [5] that we use in the implementation has a data structure called the Kruskal tensor that allows the user to store a CP decomposition efficiently. It does not store the core tensor G , but only its diagonal and the rest of the factors (matrices A(i) ). Unfortunately, every time we evaluate the function we need to convert the Kruskal tensor to a full tensor (i.e., we have to perform the multiplication of the factors) in order to subtract the full tensor V . A significant amount of time (around 40%) is used to evaluate the function and its gradient in this operation. Second, to evaluate the gradient we need to compute the product of the ma(i) tricized residual times the matrices A , i = 1, . . . , N (that represent a Khatri-Rao product). The Tensor Toolbox includes the function mttkrp [6] which computes this product efficiently when the tensor involved in the multiplication is sparse or Kruskal. However, as the residual is a dense tensor, we have to form the Khatri-Rao explicitly and compute the product with the matricized residual. Table 5.1 illustrates the performance of tensor version of the algorithm to solve NMF problems (problems in which the input tensor is a second-order tensor). A set of 80 images of size 19 × 19 was used to perform the experiment. Note that convergence was not achieved in any of the problems. We see that solving the problems with the tensor algorithm requires, in every case, about twice the time required by the matrix algorithm and does not provide a better solution (the value of the objective function is smaller when using the matrix approach). 56 r 10 40 75 Interface NMF NTF NMF NTF NMF NTF f 2.217e+06 2.218e+06 2.931e+05 2.966e+05 1.798e+04 1.987e+04 Optim. 0.13 0.20 0.49 1.30 0.39 0.93 Iter 2000 2000 2000 2000 2000 2000 Time (s) 24.76 59.35 108.67 205.74 148.59 285.21 Table 5.1: CBCL Database; solution obtained solving a NMF problem with the NMF and NTF interfaces. 5.2.3 Effects of regularization strategies In this set of experiments the focus is on the effect of different regularization strategies. We use the regularization parameters γ = 10, 10−1 , 10−2 , and 10−3 , and try different scaling strategies, that consist of making equal to one: the `1 -matrix norm (this strategy is denoted by “1m”), the `2 norm of the column with the biggest norm (“2m”), the `1 norm of each colum (“1c”), or the `2 norm of each column (“2c”). The different problems come from using the two datasets and, also, from varying the value of the approximation’s rank, r. As we said before, we do not consider r as a parameter to be determined. It is a part of the definition of the problem (which depends entirely on the application and user’s interest), and we try several values in this experimentation stage. In every case we use τ = 10−4 as the tolerance for the norm of the projected gradient and the decrement between successive values of f . Tables 5.2 and 5.4 shows the results of solving the problem with our algorithm (all-at-once approach and regularization techniques described in chapter 4) and matrix approach. Tables 5.3 and 5.5 show the same set of regularization parameters applied to tensor structures. In every case, the time shown is measured in hours. Each table includes only four values of r for comparison purposes. In the matrix case r = 10, 20, 40, 75. In the tensor case we use the values of r that make the storage requirements equal to those of the matrix case. From Tables 5.2 to 5.5 we can see that a clear preeminence of a scaling strategy is absent. In the AT&T dataset, the scaling denoted by 2m seems to lead to a 57 Figure 5.1: Effects of regularization. The first row shows the original images. The following rows show (from second to fifth row) the solution obtained with: (1) no regularization; (2) scaling only; (3) `1 penalty function only; (4) both strategies. smaller value of the objective function, although there is not convergence in most of the cases. This makes it difficult to assess the effectivity of each method’s ability to find a good solution (independent of their iterations’ costs). From the experiments run on the other dataset (CBCL)—where convergence is achieved in most of the problems— we can see that the 2-norm scaling strategies— 2m and 2c—require less iterations than those with 1-norm scaling; these also have a slightly better SSIM average. In Figure 5.2.3 we can see that regularization does improve the images constructed with the obtained solutions. When regularization is used we can see better definition in the face features. The images correspond to the solution (parts of it; we show only 5 out of 80 images) of the NTF problem using the AT&T dataset, and r = 109; the scaling strategy is 2m, and γ = 10−3 ; in none of the cases (different parameter settings) convergence is achieved. Additionally, we verify that tensor factorizations have smaller storage requirements to achieve a good solution. For example, the solution that we see in Table 58 5.5 corresponding to r = 163 in the tensor approach (24287 variables) is better than the solution that we see in Table 5.4 corresponding to r = 40 (48840 variables) with the matrix approach (although convergence is not reached in the tensor case within the allowed time). More examples can be found when Table 5.2 is compared to Table 5.3, and Table 5.4 is compared to Table 5.5. It is clear that when we take approximately the same number of variables (r equivalent in storage) in the tensor and matrix approach, we obatain a better solution using tensors. 5.2.4 Comparison with other approaches We compare the performance of the all-at-once approach against an implementation of the multiplicative update rule (MUR) described by Hazan, Polak and Shashua [38], and the lsNTF interface implemented by Friedlander and Hatz [31, 33], which uses ALS with regularization. In the all-at-once approach, we show three different parameter settings. AAonce1 corresponds to the algorithm with no regularization (γ = 0, no scaling); AAonce2 and AAonce-3 use 2m-scaling but the values γ = 10 and γ = 10−3 , respectively. In the later two cases, all of the factors are controlled by the regularization function and scaled every 10 iterations (the frequency of the scaling is an arbitrary choice). As in the previous experiments, we do not set a maximum number of iterations. Instead, we allow a maximum computing time of 90 minutes. The value of the tolerance is τ = 10−5 . Optimality, SSIM, and f lag parameters are as described before. Results are shown in Tables 5.6 and 5.7. The multiplicative update rule seems to incur the least computational effort per iteration as it has the largest number of iterations within the allowed time. However, it does not give satisfactory results in terms of optimality and does not achieve convergence in any case (with the given tolerance value). The lsNTF algorithm gives the worst results in terms of the value of the objective function and, in all the cases, seems to stop at a saddle point (the norm of the projected gradient is nearly zero). In the column corresponding to the number of iterations, we observe small numbers, but it is important to recall that those are the main iterations, and that each of them involves solving a linear problem for every 59 factor, which is done by calling the software BCLS [32]. In other words, each of the main iterations implies a number of inner iterations that can be large. In all the cases the best results are obtained with the all-at-once approach, which gives a smaller value of f and requires a relatively small number of iterations. The cost per iteration, however, is high; each function evaluation requires O(r · N · ∏Ni=1 Ii ) operations, since each involves a multi-mode (Kruskal) product (O(r · N · ∏Ni=1 Ii ) operations) and N Khatri-Rao products (each of them requires O(r ∏Ni=1 Ii ) operations). 60 Reg. Param. γ Sca Computed Solution kRkF Optim. mean SSIM Iter Time Exit r = 3, (1326 variables) 0e+00 – 3.262e+03 4.398e-03 7.091e-01 712 7.8e-04 1 0e+00 1m 3.262e+03 3.508e-05 7.091e-01 900 1.2e-03 1 0e+00 2m 3.263e+03 2.463e-04 7.091e-01 300 4.1e-04 1 0e+00 1c 3.262e+03 2.664e-05 7.091e-01 976 1.3e-03 1 0e+00 2c 3.263e+03 1.287e-04 7.091e-01 393 7.0e-04 1 1e+01 – 3.262e+03 1.770e-03 7.090e-01 1388 1.4e-03 1 1e-01 – 3.262e+03 9.747e-04 7.091e-01 1724 1.7e-03 1 1e-02 – 3.262e+03 2.106e-03 7.091e-01 1448 1.4e-03 1 1e-03 – 3.262e+03 7.962e-04 7.091e-01 768 7.9e-04 1 r = 7, (3094 variables) 0e+00 – 2.467e+03 1.148e-03 8.336e-01 1948 3.5e-03 1 0e+00 1m 2.467e+03 2.555e-05 8.336e-01 5854 1.3e-02 1 0e+00 2m 2.467e+03 2.103e-04 8.336e-01 2458 6.0e-03 1 0e+00 1c 2.467e+03 2.310e-05 8.336e-01 5611 1.6e-02 1 0e+00 2c 2.467e+03 3.541e-04 8.336e-01 2594 7.7e-03 1 1e+01 – 2.467e+03 1.259e-03 8.335e-01 2069 5.1e-03 1 1e-01 – 2.467e+03 7.983e-04 8.336e-01 3767 8.4e-03 1 1e-02 – 2.467e+03 1.519e-03 8.336e-01 5646 1.2e-02 1 1e-03 – 2.467e+03 1.104e-03 8.336e-01 2219 4.2e-03 1 r = 13, (5746 variables) 0e+00 – 1.843e+03 5.719e-04 9.139e-01 5789 2.0e-02 1 0e+00 1m 1.846e+03 4.950e-05 9.137e-01 3592 1.5e-02 1 0e+00 2m 1.845e+03 4.276e-04 9.155e-01 3046 1.3e-02 1 0e+00 1c 1.843e+03 1.904e-05 9.139e-01 10789 4.4e-02 1 0e+00 2c 1.845e+03 1.105e-04 9.135e-01 6264 2.5e-02 1 1e+01 – 1.844e+03 7.175e-04 9.139e-01 4496 1.4e-02 1 1e-01 – 1.843e+03 2.745e-04 9.139e-01 9490 3.2e-02 1 Table 5.2: CBCL Dataset, matrix approach (The table continues on the next page.) 61 1e-02 – 1.843e+03 4.817e-04 9.139e-01 9343 3.0e-02 1 1e-03 – 1.843e+03 7.477e-04 9.139e-01 7335 2.3e-02 1 r = 25, (11050 variables) 0e+00 – 1.196e+03 6.775e-05 9.623e-01 12660 7.6e-02 1 0e+00 1m 1.196e+03 1.680e-05 9.621e-01 44574 3.3e-01 1 0e+00 2m 1.198e+03 3.286e-04 9.623e-01 12107 7.8e-02 1 0e+00 1c 1.196e+03 3.433e-05 9.623e-01 28813 1.8e-01 1 0e+00 2c 1.195e+03 3.880e-04 9.621e-01 12808 8.0e-02 1 1e+01 – 1.198e+03 6.515e-04 9.622e-01 22803 1.2e-01 1 1e-01 – 1.196e+03 4.040e-04 9.623e-01 18970 1.4e-01 1 1e-02 – 1.196e+03 1.733e-04 9.623e-01 27555 2.1e-01 1 1e-03 – 1.196e+03 1.932e-04 9.623e-01 21916 1.3e-01 1 Table 5.2: CBCL Dataset, matrix approach. The optimality value represents the (relative) infinity-norm of the gradient. A SSIM close to 1 indicates good quality of the solution. Time is measured in hours. An exit flag equal to 1 indicates that a convergence criterion was satisfied. Reg. Param. γ Sca Computed Solution kRkF Optim. mean SSIM Iter Time Exit r = 12, (1428 variables) 0e+00 – 2.595e+03 1.166e-02 7.767e-01 1591 1.2e-02 1 0e+00 1m 2.596e+03 9.977e-05 7.800e-01 1740 1.7e-02 1 0e+00 2m 2.602e+03 2.708e-04 7.830e-01 1043 1.1e-02 1 0e+00 1c 2.617e+03 4.162e-05 7.804e-01 1566 1.6e-02 1 0e+00 2c 2.605e+03 1.785e-04 7.796e-01 1111 1.1e-02 1 1e+01 – 2.596e+03 1.489e+01 7.766e-01 1107 1.5e+00 2 1e-01 – 2.595e+03 5.400e-03 7.767e-01 5558 1.1e+00 1 1e-02 – 2.595e+03 2.598e-02 7.767e-01 3839 2.2e-02 1 Table 5.3: CBCL Dataset, tensor approach (the table continues on the next page.) 62 1e-03 – 2.595e+03 1.839e-02 7.767e-01 1545 1.9e-02 1 r = 24, (2856 variables) 0e+00 – 2.021e+03 1.218e-02 8.715e-01 4734 5.4e-02 1 0e+00 1m 2.038e+03 2.060e-05 8.723e-01 7859 1.5e-01 1 0e+00 2m 2.036e+03 2.234e-05 8.779e-01 2840 5.3e-02 1 0e+00 1c 2.037e+03 6.278e-05 8.750e-01 3971 6.8e-02 1 0e+00 2c 2.027e+03 2.097e-04 8.749e-01 3238 5.5e-02 1 1e+01 – 2.046e+03 1.056e-02 8.726e-01 4907 4.3e-02 1 1e-01 – 2.046e+03 9.773e-03 8.726e-01 5844 5.1e-02 1 1e-02 – 2.046e+03 6.493e-03 8.726e-01 7068 6.2e-02 1 1e-03 – 2.048e+03 1.254e-02 8.739e-01 2678 2.3e-02 1 r = 49, (5831 variables) 0e+00 – 1.466e+03 2.015e-03 9.352e-01 15552 3.1e-01 1 0e+00 1m 1.590e+03 1.414e-04 9.242e-01 10047 3.3e-01 1 0e+00 2m 1.571e+03 1.618e-03 9.261e-01 10892 3.7e-01 1 0e+00 1c 1.578e+03 7.276e-06 9.253e-01 15978 5.7e-01 1 0e+00 2c 1.571e+03 2.711e-04 9.249e-01 8618 3.3e-01 1 1e+01 – 1.466e+03 4.410e-03 9.325e-01 20390 3.3e-01 1 1e-01 – 1.472e+03 1.141e-02 9.352e-01 22951 1.2e+00 1 1e-02 – 1.458e+03 2.604e-03 9.337e-01 25416 4.0e-01 1 1e-03 – 1.462e+03 5.359e-03 9.321e-01 11459 1.8e-01 1 r = 92, (10948 variables) 0e+00 – 9.480e+02 3.614e-03 9.702e-01 30290 9.7e-01 1 0e+00 1m 1.171e+03 3.376e+03 9.534e-01 26954 1.5e+00 2 0e+00 2m 1.168e+03 4.413e-01 9.556e-01 25103 1.4e+00 1 0e+00 1c 1.159e+03 1.020e+02 9.565e-01 27036 1.5e+00 2 0e+00 2c 1.161e+03 6.786e-01 9.580e-01 13057 7.2e-01 1 1e+01 – 9.493e+02 9.574e-01 9.707e-01 23699 1.5e+00 2 1e-01 – 2.205e+03 3.010e+02 8.503e-01 122 1.5e+00 2 1e-02 – 1.316e+03 5.096e+01 9.451e-01 542 1.5e+00 2 Table 5.3: CBCL Dataset, tensor approach (the table continues on the next page.) 63 1e-03 – 9.796e+02 2.034e+00 9.687e-01 4886 1.5e+00 2 Table 5.3: CBCL Dataset, tensor approach. The optimality value represents the (relative) infinity-norm of the gradient. A SSIM close to 1 indicates good quality of the solution. Time is measured in hours. An exit flag equal to 1 indicates that a convergence criterion was satisfied. Reg. Param. γ Sca Computed Solution kRkF Optim. mean SSIM Iter Time Exit r = 10, (12210 variables) 0e+00 – 6.246e+03 1.613e-03 6.525e-01 1820 1.3e-02 1 0e+00 1m 3.725e+04 7.220e-05 6.525e-01 6706 5.9e-02 1 0e+00 2m 3.725e+04 1.559e-03 6.525e-01 2548 2.2e-02 1 0e+00 1c 3.725e+04 1.356e-04 6.525e-01 5901 5.5e-02 1 0e+00 2c 3.725e+04 9.131e-04 6.525e-01 2641 2.6e-02 1 1e+01 – 6.246e+03 8.400e-03 6.525e-01 3239 2.3e-02 1 1e-01 – 6.246e+03 1.556e-03 6.525e-01 6429 4.7e-02 1 1e-02 – 6.246e+03 2.527e-03 6.525e-01 3019 2.2e-02 1 1e-03 – 6.246e+03 7.313e-04 6.525e-01 1915 1.4e-02 1 r = 20, (24420 variables) 0e+00 – 4.748e+03 2.303e-03 7.779e-01 8324 1.2e-01 1 0e+00 1m 3.725e+04 2.910e-04 7.776e-01 15319 2.6e-01 1 0e+00 2m 3.725e+04 7.322e-04 7.756e-01 4176 7.1e-02 1 0e+00 1c 3.725e+04 6.577e-05 7.758e-01 13480 2.2e-01 1 0e+00 2c 3.724e+04 2.404e-04 7.755e-01 6873 1.0e-01 1 1e+01 – 4.734e+03 5.935e-04 7.747e-01 12307 1.6e-01 1 1e-01 – 4.748e+03 1.628e-03 7.779e-01 17029 2.2e-01 1 1e-02 – 4.746e+03 3.161e-03 7.754e-01 12157 1.6e-01 1 1e-03 – 4.748e+03 8.558e-04 7.779e-01 10312 1.3e-01 1 Table 5.4: AT&T Dataset, matrix approach (the table continues on the next page.) 64 r = 40, (48840 variables) 0e+00 – 2.950e+03 8.370e-04 9.038e-01 30177 6.7e-01 1 0e+00 1m 3.725e+04 7.462e-03 9.033e-01 34993 1.0e+00 2 0e+00 2m 3.724e+04 2.257e-04 9.036e-01 23870 6.8e-01 1 0e+00 1c 3.725e+04 4.222e-02 9.029e-01 34957 1.0e+00 2 0e+00 2c 3.724e+04 4.183e-04 9.036e-01 21314 6.1e-01 1 1e+01 – 2.956e+03 4.308e-02 9.040e-01 39595 1.0e+00 2 1e-01 – 2.964e+03 5.629e-03 9.031e-01 39688 1.0e+00 2 1e-02 – 2.949e+03 5.231e-03 9.045e-01 39824 1.0e+00 2 1e-03 – 2.966e+03 5.707e-04 9.023e-01 32917 8.2e-01 1 r = 75, (91575 variables) 0e+00 – 5.969e+02 6.860e-02 9.956e-01 23273 1.0e+00 2 0e+00 1m 3.725e+04 4.613e-02 9.922e-01 18048 1.0e+00 2 0e+00 2m 3.724e+04 2.757e-01 9.953e-01 17644 1.0e+00 2 0e+00 1c 3.725e+04 5.022e-02 9.921e-01 17989 1.0e+00 2 0e+00 2c 3.724e+04 6.077e-01 9.954e-01 17670 1.0e+00 2 1e+01 – 6.463e+02 3.397e-02 9.949e-01 20744 1.0e+00 2 1e-01 – 5.996e+02 7.546e-02 9.955e-01 20789 1.0e+00 2 1e-02 – 5.974e+02 9.318e-02 9.956e-01 20786 1.0e+00 2 1e-03 – 5.992e+02 1.006e-01 9.955e-01 20791 1.0e+00 2 Table 5.4: AT&T Dataset, matrix approach. The optimality value represents the (relative) infinity-norm of the gradient. A SSIM close to 1 indicates good quality of the solution. Time is measured in hours. An exit flag equal to 1 indicates that a convergence criterion was satisfied. 65 Reg. Param. γ Sca Computed Solution kRkF Optim. mean SSIM Iter Time Exit r = 81, (12069 variables) 0e+00 – 3.940e+03 1.484e-02 8.277e-01 10704 9.7e-01 1 0e+00 1m 1.104e+04 2.936e+02 3.064e-01 33 9.6e-03 1 0e+00 2m 4.198e+03 1.751e+03 8.040e-01 6929 1.5e+00 2 0e+00 1c 8.201e+03 7.158e+01 4.583e-01 4037 8.8e-01 1 0e+00 2c 4.458e+03 3.464e+03 7.877e-01 7052 1.5e+00 2 1e+01 – 3.935e+03 5.618e-03 8.262e-01 10797 1.0e+00 1 1e-01 – 3.924e+03 5.148e-02 8.262e-01 17818 1.5e+00 2 1e-02 – 3.937e+03 1.050e-01 8.260e-01 16773 1.5e+00 2 1e-03 – 3.919e+03 2.098e-02 8.260e-01 15378 1.3e+00 1 r = 163, (24287 variables) 0e+00 – 2.759e+03 5.681e+00 9.129e-01 10114 1.5e+00 2 0e+00 1m 9.445e+03 4.982e-06 3.971e-01 333 1.3e-01 1 0e+00 2m 3.041e+03 1.089e+00 8.953e-01 3980 1.5e+00 2 0e+00 1c 6.835e+03 3.322e-02 5.712e-01 3908 1.5e+00 2 0e+00 2c 3.211e+03 5.639e+00 8.874e-01 3755 1.5e+00 2 1e+01 – 2.772e+03 3.126e+00 9.142e-01 8673 1.5e+00 2 1e-01 – 2.768e+03 4.187e+00 9.132e-01 9952 1.5e+00 2 1e-02 – 2.761e+03 2.788e+00 9.131e-01 10144 1.5e+00 2 1e-03 – 2.761e+03 3.530e+00 9.115e-01 10109 1.5e+00 2 r = 327, (48723 variables) 0e+00 – 1.780e+03 1.047e+01 9.634e-01 5443 1.5e+00 2 0e+00 1m 1.104e+04 8.933e-07 3.064e-01 31 4.2e-02 1 0e+00 2m 2.334e+03 9.623e-01 9.402e-01 2030 1.5e+00 2 0e+00 1c 7.198e+03 1.541e-01 5.529e-01 2089 1.5e+00 2 0e+00 2c 2.479e+03 4.086e+01 9.319e-01 2027 1.5e+00 2 1e+01 – 1.764e+03 5.278e+00 9.634e-01 4951 1.5e+00 2 1e-01 – 1.796e+03 7.833e+00 9.628e-01 5099 1.5e+00 2 Table 5.5: AT&T Dataset, tensor approach (the table continues on the next page.) 66 1e-02 – 1.801e+03 7.253e+00 9.626e-01 5009 1.5e+00 2 1e-03 – 1.798e+03 7.546e+00 9.628e-01 5119 1.5e+00 2 r = 614, (91486 variables) 0e+00 – 1.206e+03 1.767e+01 9.832e-01 3180 1.5e+00 2 0e+00 1m 6.786e+03 3.828e-02 5.792e-01 1148 1.5e+00 2 0e+00 2m 1.916e+03 9.495e-01 9.600e-01 1131 1.5e+00 2 0e+00 1c 6.554e+03 8.340e-02 6.154e-01 1110 1.5e+00 2 0e+00 2c 2.326e+03 1.378e+01 9.411e-01 1100 1.5e+00 2 1e+01 – 1.219e+03 2.550e+01 9.829e-01 2432 1.5e+00 2 1e-01 – 1.238e+03 6.015e+01 9.824e-01 2879 1.5e+00 2 1e-02 – 1.240e+03 1.111e+01 9.823e-01 2883 1.5e+00 2 1e-03 – 1.239e+03 1.167e+01 9.823e-01 2866 1.5e+00 2 Table 5.5: AT&T Dataset, tensor approach. The optimality value represents the (relative) infinity-norm of the gradient. A SSIM close to 1 indicates good quality of the solution. Time is measured in hours. An exit flag equal to 1 indicates that a convergence criterion was satisfied. Algorithm kRkF Optim. mean SSIM Iter Time Exit r = 12, (1428 variables) MUR 3.370e+06 3.242e+02 7.977e-01 95468 1.50 2 lsNTF 2.206e+08 0.000e+00 8.432e-05 192 1.40 1 AAonce-1 2.595e+03 1.166e-02 7.767e-01 1591 0.01 1 AAonce-2 2.602e+03 2.750e-02 7.805e-01 840 0.01 1 AAonce-3 2.602e+03 7.128e-05 7.809e-01 1030 0.01 1 r = 24, (2856 variables) MUR 2.088e+06 2.779e+02 8.754e-01 78533 1.50 2 lsNTF 2.206e+08 0.000e+00 8.431e-05 169 1.44 1 AAonce-1 2.021e+03 1.218e-02 8.715e-01 4734 0.05 1 Table 5.6: CBCL Dataset, comparison with other algorithms (the table continues on the next page.) 67 AAonce-2 2.033e+03 5.234e-03 8.808e-01 3240 0.06 1 AAonce-3 2.036e+03 1.148e-05 8.779e-01 3117 0.06 1 r = 49, (5831 variables) MUR 1.127e+06 1.036e+02 9.305e-01 61422 1.50 2 lsNTF 2.206e+08 0.000e+00 8.432e-05 120 1.44 1 AAonce-1 1.466e+03 2.015e-03 9.352e-01 15552 0.31 1 AAonce-2 1.579e+03 1.555e-02 9.269e-01 4250 0.18 1 AAonce-3 1.571e+03 9.539e-06 9.261e-01 11840 0.57 1 r = 92, (10948 variables) MUR 5.057e+05 5.910e+01 9.671e-01 40328 1.50 2 lsNTF 2.206e+08 0.000e+00 8.433e-05 73 1.39 1 AAonce-1 9.480e+02 3.614e-03 9.702e-01 30290 0.97 1 AAonce-2 1.170e+03 1.937e-02 9.558e-01 5760 0.39 1 AAonce-3 1.164e+03 1.084e-03 9.562e-01 22311 1.50 2 Table 5.6: CBCL Dataset, comparison with other algorithms. The optimality value represents the (relative) infinity-norm of the gradient. A SSIM close to 1 indicates good quality of the solution. Time is measured in hours. An exit flag equal to 1 indicates that a convergence criterion was satisfied. Algorithm kRkF Optim. mean SSIM Iter Time Exit r = 27, (4023 variables) MUR 1.687e+07 1.445e+03 6.719e-01 29243 1.50 2 lsNTF 6.939e+08 0.000e+00 6.428e-05 47 1.30 1 AAonce-1 5.790e+03 2.052e-02 6.765e-01 2968 0.08 1 AAonce-2 5.787e+03 3.324e-01 6.704e-01 1874 1.50 2 AAonce-3 5.797e+03 2.462e-01 6.769e-01 1455 1.50 2 r = 54, (8046 variables) Table 5.7: AT&T Dataset, comparison with other algorithms (the table continues on the next page.) 68 MUR 1.094e+07 7.560e+02 7.662e-01 21896 1.50 2 lsNTF 6.938e+08 0.000e+00 6.428e-05 42 1.47 1 AAonce-1 4.618e+03 6.344e-03 7.727e-01 9347 0.59 1 AAonce-2 5.033e+03 4.713e-01 7.359e-01 2162 1.50 2 AAonce-3 5.550e+03 5.225e+00 6.965e-01 471 1.50 2 r = 109, (16241 variables) MUR 6.391e+06 3.234e+02 8.603e-01 14641 1.50 2 lsNTF 6.938e+08 0.000e+00 6.432e-05 27 1.53 2 AAonce-1 3.422e+03 6.932e+00 8.670e-01 15213 1.50 2 AAonce-2 4.904e+03 2.376e+01 7.592e-01 323 1.51 2 AAonce-3 5.520e+03 1.205e+01 7.122e-01 195 1.51 2 r = 204, (30396 variables) MUR 3.262e+06 2.006e+02 9.269e-01 8229 1.50 2 lsNTF 6.938e+08 0.000e+00 6.438e-05 18 1.52 2 AAonce-1 2.428e+03 5.869e+00 9.316e-01 7892 1.50 2 AAonce-2 7.371e+03 6.711e+01 5.640e-01 49 1.50 2 AAonce-3 6.550e+03 4.807e+01 6.270e-01 76 1.51 2 Table 5.7: AT&T Dataset, comparison with other algorithms. The optimality value represents the (relative) infinity-norm of the gradient. A SSIM close to 1 indicates good quality of the solution. Time is measured in hours. An exit flag equal to 1 indicates that a convergence criterion was satisfied. 69 Chapter 6 An application to image processing One of the main motivations for nonnegative factorizations is its potential use as a tool to extract information from a dataset. In this chapter we illustrate this aspect of nonnegative factorizations through some examples. We then describe the digital-inpainting problem and explain the way in which our algorithm can be used to help the restoration of digital images that have been damaged by loss of information. 6.1 Image basis through nonnegative factorizations In the NMF problem we can interpret each column of W ∈ IRm×r as a basis element, and the matrix H ∈ IRn×r as the set of coefficients that can be used to approximate the original set (V ∈ IRm×n ) with elements in the basis W . When we use r < n, we expect the basis W to contain the most significant information in V , so that the value of kV −W H T k can be minimized. When V represents an image dataset (each image is represented by a column of V ), we expect the columns of W to represent images that condense the most important features of the original set. Figure 6.1a shows 49 images that represent part of a set formed by 200 images taken from the CBCL image database (described in section 5.2.1), plus two outlier 70 (a) Database sample (b) Nonnegative Basis Figure 6.1: Experiment using CBCL face database with outliers. The basis was obtained via NMF. images—Frankenstein and a dog (shown in the bottom-right corner of the set). By applying the matrix version of the all-at-once algorithm, using r = 15, we get the basis images shown in Figure 6.1b. The inclusion of the outliers makes it easier to appreciate the way in which the basis images combine and resemble the most distinctive characteristics of the elements in the set. The images that have been highlighted in Figure 6.1b (first row of the first column, and third row of the third colum), while not reproductions of the outlier images, clearly merge some of the outliers’ characteristics (e.g, face and nose shape) with other common attributes of the set (e.g, in the basis Frankeinstein’s face seems to smile). Note that for the matrix case, our algorithm works with an approximation of the type V ≈ W GH T , where G is a diagonal matrix that has been included to maintain the analogy with the tensor case. We interpret G as a matrix that contains the scale of the elements in W , and, in the image-processing context, consider as basis images the columns of the product W G. To obtain a set of base images in the tensor case, suppose that the tensor V ∈ 71 IRm1 ×m2 ×m3 contains each of the original images as a frontal slice, and that V = [[G ; A(1) , A(2) , A(3) ]]. By applying matricization, which is described in section 2.2.2, (see also Appendix B for a summary of matrix-operators results) we can unfold the tensor with respect to its third coordinate and—taking into account that G is diagonal and matrices A(i) have the same number of columns—obtain the expression V = (A(1) A(2) )DA(3)T , (6.1) where D is a diagonal matrix such that Giii = Dii with i ∈ {1, . . . , r} (as used in section 2.3), and V is the matrix that we use in the matrix approach, i.e., a matrix whose columns represent the vectorized images of the dataset (using column-wise concatenation). From (6.1) it is clear that we can consider the product (A(1) A(2) )D—which is a m1 m2 × r matrix—as the matrix that contains the basis images, and A(3) as the matrix that represents the coefficients used to construct each column of V , i.e., A(3) is the analogous of H. Since (1) (A(1) (2) (1) (2) A(2) )D = [D11 (a1 ⊗ a1 ), · · · , Drr (ar ⊗ ar )], and (1) D j j (a j (1) (1) (2)T a j ) = D j j vec(a j a j ), j = 1, . . . , r, we see that the ith image in the basis is given by a rank-1 matrix scaled by the ith diagonal element of G . In fact, the outer product of the columns of A(1) and A(2) (scaled by the diagonal elements of G ), can be computed as B = G ×1 A(1) ×2 A(2) . In this way, the ith frontal slice of B represents ith basis image. A second example that shows the use of nonnegative factorizations as a way to extract relevant features in a set is obtained by considering a set formed by 200 images, each with size 28 × 28, that represent hand-written digits. We assembled a matrix V ∈ IR784×200 and a tensor, V ∈ IR28×28×200 , with the images and choose r = 10 in the matrix case (9840 variables), and r = 20 in the tensor case (5120 variables). Basis images obtained through NMF are shown in the 72 (a) Database sample (b) NMF basis. (c) NTF basis. Figure 6.2: Experiment using hand-written digits . kRk NMF NTF 1.786e+08 1.566e+08 Optim. 6.44e-02 3.23e-01 Min SSIM 0.091 0.249 Mean SSIM 0.445 0.521 Iter 791 752 Time (s) 24.17 158.16 Storage Basis Total 7840 1140 9840 5120 Table 6.1: Experiment with hand-written numbers. The SSIM of two images is equal to 1 if and only if the images are identical; thus, in this table a SSIM value close to 1 indicates a good quality of the solution. The last two columns indicate the number of variables (elements in the factors) used in each approach. Figure 6.2b while basis images obtained by NTF are shown in 6.2b. Results are shown in Table 6.1. The basis images in NTF tend to be sparser, which represents a sharper decomposition into part components. Despite the fact that fewer variables are used in the tensor case (we have a higher compression rate: the NMF basis has 7840 73 elements, whereas the NTF basis has only 1140), we obtain higher fidelity in the reconstruction of the original images than we do using the matrix approach. 6.2 Inpainting problem By inpainting we understand the restoration of missing or damaged parts on an image; digital inpainting refers to the case in which the images are digital and the restoration is made with computer support. The work by Bertalmio, et al. [12] is acknowledged as a pioneering work in digital inpainting. They use a third-order PDE-based model and show unified applications in the movie industry, video images, and art restoration. Since their work was published, other approaches have been used to solve the digital inpainting problem, e.g., models based on the Bayesian rationale [67], and expectationminimization (EM) algorithms [28]. We explore the use of NTF basis images in the inpainting problem, using a basic mathematical model of the problem. Suppose that b ∈ IRm represents a signal (an image, in our context) that can be expressed in terms of a basis A ∈ IRm×n using the coefficients contained in the vector x ∈ IRn , as represented by the equation Ax = b, (6.2) It is assumed that x is sparse. If D ∈ IR p×m , with p < m, represents a lossy operator and we apply it to (6.2), we get DAx = Db. (6.3) Inpainting is to recover b from Db, which is an inverse problem. Let us denote the product DA by Â, and Db by b̂. Under the assumption that x is a sparse vector, the coefficients of x can be recovered by solving the regularized least-squares problem given by minimizekÂx − b̂k22 + λ kxk1 , λ 74 (6.4) where λ is a regularization parameter that lies in the interval [0, kAT bk∞ ]. Once we obtain the vector x, we can recover the original signal b by simply multiplying the matrix A times x. We believe that, given an specific set, it is possible to construct a basis through NTF that can later be used to restore elements that have lost some information. The basis elements obtained through NTF—that tend to have sparse components—can be interpreted as a dictionary, i.e., a set of atom images that can be combined to obtain other images (not only the ones contained in the original set V ). So, we can use NTF to obtain the matrix A that appears in (6.2). 6.3 Numerical experiments We designed an experiment to empirically test the use of NTF basis in the inpainting problem. We employ a set of 140 images taken from the AT&T image dataset, which were scaled to a 40 × 50 size, and assemble a 40 × 50 × 140 tensor. We compute an NTF factorization, with r = 593 (the equivalent in storage to having r = 70 in the matrix case), using scaling every 5 iterations, γ = 10−3 in the regularization function, and τ = 10−4 as tolerance value. The problem has 136983 variables and we computed the solution over 72 hours. In this time a total of 20593 iterations were performed. We did not obtain convergence, but the final optimality value was 1.76 × 10−3 . We then computed the basis B = G ×1 A(1) ×2 A(2) , and discard A(3) . Note that, although B is a 40 × 50 × 531 tensor, what we store are the factors G , A(1) , and A(2) , which have a total of 53963 elements. As we are using these factors to reconstruct the whole set (the tensor V , which has 280000 elements), we are working with only 19.27 % of the original information. As a first test, we selected an image that had been originally included in V , and removed 80% of its pixel components (randomly chosen). We did not explicitely form the operator D that appears in (6.3). We solved (6.4) for 30 different (logarithmically spaced) values of λ , and chose the best solution; Figure 6.3 shows ten of the “recovered” images obtained as we varied the value of λ . Figure 6.4 shows the original image (Figure 6.4a), the image obtained through the NTF basis ((Figure 6.4b)), the “noisy” image—that has 75 Figure 6.3: In-painting problem, test 1: Recovered images for different values of the regularization parameter λ . only 20% of the original information—(Figure 6.4c), and the best of the recovered images, obtained with λ = 10−3 (Figure 6.4d). As a second test, we chose as b an image that was not in the original set V , and applied the same procedure that was used in the previous test. Results are shown in Figure 6.5. The person whose image is used in the second test belongs to the group of subjects that were photographed to form the AT&T dataset, but the set of 140 images that we use to generate the basis included only one image of that person. In it he wears glasses and has a different head position and facial expression (the image is shown in the second row of the fifth column of Figure 6.6). Thus, it is remarkable that the basis obtained by NTF approximates well the original image. As a third and final test, we decided to remove a specific area of an image that was not in V . The removed data represents 12.65% of the information in the original image. Figure 6.7 shows the results of the experiment, The reconstructed image does not look like a natural face but the basis does complete the part of the face that is missing. There are studies that focus on the correlation between different face regions [39] and we believe that if a basis is specifically constructed to solve the problem of inpainting of face parts (by con76 (a) (b) (c) (d) Figure 6.4: Inpainting problem, test 1: (a) original image; (b) image constructed using a basis obtained by NTF; (c) image after information loss; (d) image recovered with the best λ . structing a set V with images that include representation of the distinct possible face features), better results can be obtained. 77 (a) (b) (c) (d) Figure 6.5: Inpainting problem, test 2: (a) original image; (b) image constructed using a basis obtained by NTF; (c) image after information loss; (d) image recovered with the best λ . Figure 6.6: Subjects photographed to form the AT&T image database. 78 (a) (b) (c) (d) Figure 6.7: Inpainting problem, test 3: (a) original image; (b) image constructed using a basis obtained by NTF; (c) image after information loss; (d) image recovered with the best λ . 79 Bibliography [1] Glossary of the document “A Space and Solar Physics Data Model from the SPASE Consortium ”, Version 1.2.1, available at http://www.spase-group.org/data/doc/spase-1 2 1.pdf (accessed May 2, 2008). → pages 2 [2] E. Alpaydin. Introduction to Machine Learning. The MIT Press, 2004. → pages 7 [3] C. A. Andersson and R. Bro. Improving the speed of multiway algorithms. part I: Tucker3. Chemometrics and Intelligent Laboratory Systems, 42: 93–103, 1998. → pages 9 [4] B. W. Bader and T. G. Kolda. M ATLAB tensor classes for fast algorithm prototyping. ACM Transactions on Mathematical Software, 32(4), 2006. → pages 50 [5] B. W. Bader and T. G. Kolda. M ATLAB tensor toolbox version 2.2. http://csmr.ca.sandia.gov/∼tgkolda/TensorToolbox/, January 2007. → pages 50, 56 [6] B. W. Bader and T. G. Kolda. Efficient MATLAB computations with sparse and factored tensors. Technical Report SAND2006-7592, Sandia National Laboratories, Albuquerque, NM and Livermore, CA, 2006. → pages 50, 56 [7] B. W. Bader, M. W. Berry, and M. Browne. Discussion tracking in Enron email using PARAFAC. Conference Paper, Text Mining Workshop 2007: Workshop at the SIAM Conf. on Data Mining, 2007. → pages 13 [8] B. W. Bader, R. Harshman, and T. G. Kolda. Temporal analysis of semantinc graphs using ASALSAN. In Proceedings of the 7th IEEE International Conference on Data Mining, 2007. → pages 13, 27, 34 80 [9] D. P. Berrar, W. Dubitzky, and M. Granzow, editors. A Practical Approach to Microarray Data Analysis. Springer, 2002. → pages 7 [10] M. Berry, M. Browne, A. Langville, V. Pauca, and R. Plemmons. Algorithms and applications for approximate nonnegative matrix factorizations. Computational Statistics and Data Analysis, 52(1):155–173, September 2007. → pages 32, 34, 35, 51 [11] M. W. Berry and M. Browne. Email surveillance using nonnegative matrix factorization. Computational and Mathematical Organization Theory, 11: 249–264, 2005. → pages 5 [12] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester. Image inpainting. In SIGGRAPH 2000, volume 34, pages 417–424, 2000. → pages 74 [13] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, 2nd. edition, 1999. → pages 35, 36, 41 [14] R. Bro and S. D. Jong. A fast non-negativity-constrained least squares algorithm. Journal of Chemometrics, 11:393–401, 1997. → pages 1, 11, 37 [15] T. Brox, J. Weickert, B. Burgeth, and P. Mrázek. Nonlinear structure tensors. Image and vision computing, 24(1):41–55, 2006. → pages 10 [16] J. Brunet, P. Tamayo, T. R. Golub, and J. P. Mesirov. Meatagenes and molecular pattern discovery using matrix factorization. volume 101, pages 4164–4169, 2004. → pages 5 [17] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16 (5):1190–1208, 1995. → pages 45, 47 [18] A. L. Cambridge. At&t database of faces. http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html, 1994. → pages 53 [19] M. Chu, F. Diele, R. Plemmons, and S. Ragni. Optimality, computation and interpretation of nonnegative matrix factorizations. Available at citeseer.ist.psu.edu/758183.html. → pages 34 [20] A. Cichocki, R. Zdunek, and S. Amari. Independent Component Analysis and Signal Separation, volume 4666/2007 of Lecture notes in Computer Science, chapter Hierarchical ALS algorithms for nonnegative matrix and 3D tensor factorization, pages 169–176. Springer Berlin/Heidelberg, 2007. → pages 12 81 [21] V. De Silva and L. Lim. Tensor rank and the ill-posedness of the best low-rank approximation problem. Technical Report 06-06, Stanford University, 2006. → pages 30 [22] V. De Silva and L. H. Lim. Tensor rank and ill-posedness of the best low-rank approximation problem. SIAM Journal on Matrix Analysis and Applications, 2006. To appear. → pages 31 [23] I. S. Dhillon and S. Sra. Generalized nonnegative matrix approximations with bregman divergences. In Proceeding of the Neural Information Processing Systems (NIPS) Conference, 2005. → pages 4 [24] C. Ding, X. He, H. D. Simon, and R. Jin. On the equivalence of nonnegative matrix factorization and K-means – spectral clustering. In Proceedings of the 5th SIAM International Conference on Data Mining, 2005. → pages 5 [25] L. Eldén. Matrix Methods in Data Mining and Pattern Recognition. SIAM, 2007. → pages 5 [26] N. M. Faber, R. Bro, and P. K. Hopke. Recent developments in CANDECOMP/ PARAFAC algorithms: A critical review. Chemometrics and intelligent laboratory systems, 65(1):119–137, 2003. → pages 26 [27] P. L. Fackler. Notes on matrix calculus. North Carolina State University, September 2005. → pages 36 [28] M. J. Fadili and J. Starck. Em algorithm for sparse representation-based image inpainting. IEEE, 2005. → pages 74 [29] D. FitzGerald, M. Cranitch, and E. Coyle. Nonnegative tensor factorization for sound source separation. Proceedings of the Irish signals and systems conference, 2005. → pages 12 [30] P. Fogel, S. S. Young, D. M. Hawkins, and N. Ledirac. Inferential, robust non-negative matrix factorization analysis of microarray data. Bioinformatics, 23(1):44–49, 2007. → pages 5 [31] M. Friedlander and K. Hatz. M ATLAB software package lsNTF–1.0. http://www.cs.ubc.ca/∼mpf/?n=Papers.Lsntf, 2006. → pages 59 [32] M. P. Friedlander. BCLS: A large-scale solver for bound-constrained least squares. Available at http://www.cs.ubc.ca/∼mpf/bcls/, 2006. → pages 36, 60 82 [33] M. P. Friedlander and K. Hatz. Computing nonnegative tensor factorizations. Technical Report TR-2006-21, University of British Columbia, Dept. of Computer Science, October 2006. → pages 12, 13, 27, 37, 59 [34] Y. Gao and G. Church. Improving molecular cancer class discovery through sparse non-negative matrix factorization. Bioinformatics, pages 3970–3975, 2005. → pages 5 [35] A. Graham. Kronecker Products and Matrix Calculus With Applications. Halsted Press,John Wiley and Sons, 1981. → pages 17 [36] W. R. Hamilton. On quaternions. In The Mathematical Papers of Sir William Rowan Hamilton, volume 3. Cambridge University Press, 2000. Originally published in the Journal “The London, Edinburgh, and Dublin Philosophical Magazine” in 1846. → pages 1 [37] C. Hansen. Rank-deficient and Discrete Ill-Posed Problems: Numerical Aspects of Linear Inversion. SIAM, 1998. → pages 40 [38] T. Hazan, S. Polak, and A. Shashua. Sparse image coding using a 3D non-negative tensor factorization. In Proceedings fo the Tenth IEEE International Conference on Computer Vision (ICCV’ 05), 2005. → pages 12, 27, 34, 59 [39] B. Hwang, V. Blanz, and T. Vetter. Face reconstruction from a small number of feature points. In In International Conference on Pattern Recognition (ICPR, pages 842–845, 2000. → pages 76 [40] A. Kapteyn, T. J. Wansbeek, and H. Neudecker. An approach to n-mode components analysis. Psychometrika, 51:269–275, 1986. → pages 25 [41] H. Kim, H. Park, and L. Eldén. Non-negative tensor factorization based on alternating large-scale non-negativity-constrained least squares. pages 1147–1151, 2007. → pages 12 [42] T. Kolda. Multilinear operators for higher-order decompositions. Technical Report SAND2006-2081, Sandia National Laboratories, April 2006. → pages 13, 15, 19, 90 [43] T. Kolda and B. Bader. Tensor decompositions and applications. Technical Report SAND2007-6702, Sandia National Laboratories, November 2007. → pages 13, 14, 15, 16, 18, 21, 24, 25, 26, 90 83 [44] P. M. Kroonenberg and J. De Leeuw. Principal component analysis of three-mode data by means of alternating least squares algorithms. Psychometrika, 45:69–97, 1980. → pages 25 [45] J. B. Kruskal. Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra Applications, 18:95–138, 1977. → pages 28, 29 [46] J. B. Kruskal. Rank, decomposition and uniqueness for 3-way and n-way arrays. In R. Coppi and S. Bolasco, editors, Multiway Data Analysis, pages 7–18. Elsevier Science Publishers B. V, 1989. → pages 28 [47] L. Lathauwer, B. Moor, and J. Vandewalle. A multilinear sngular value decomposition. SIAM J. Matrix Anal. Appl., 21(4):1253–1278, 2000. → pages 1, 24, 25 [48] L. Lathauwer, B. Moor, and J. Vandewalle. On the best rank-1 and rank-(r1 , r2 , . . . , rn ) approximation of higher-order tensors. SIAM J. Matrix Anal. Appl., 21(4):1324–1342, 2000. → pages 15, 25 [49] L. d. Lathauwer. Signal processing based on multilinear algebra. PhD thesis, Katholieke Universiteit Leuven, September 1997. → pages 15 [50] A. J. Laub. Matrix Analysis for Scientists and Engineers. SIAM, 2004. → pages 17 [51] D. Lee and H. Seung. Learning the parts of objects by nonnegative matrix factorization. Nature, 401:788–791, 1999. → pages 3, 7, 12, 31, 35 [52] C. J. Lin. On the convergence of multiplicative update algorithms for nonnegative matrix factorization. IEEE Transactions on Neural Networks, 18(6):1589–1596, 2007. → pages 33 [53] D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45:503–528, 1989. → pages 47 [54] J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symp. Math. Statistics and Probability, volume 1, pages 281–296, 1967. → pages 5 [55] M. Mørup, L. K. Hansen, J. Parnas, and S. M. Arnfred. Decomposing the time-frequency representation of EEG using nonnegative matrix and multiway factorization. Technical report, Technical University of Denmark, 2006. URL http://www2.imm.dtu.dk/pubdb/p.php?4144. → pages 10, 12 84 [56] M. Mørup, L. K. Hansen, and S. M. Arnfred. Algorithms for sparse nonnegative tucker decompositions. Neural Computation, 20:2112–2131, 2008. → pages 34 [57] T. Murakami and P. M. Kroonenberg. Three-mode models and individual differences in semantic differential data. Multivariate Behavioral Research, 38(2):247–283, 2003. → pages 9 [58] A. N. Netravali and B. G. Haskell. Digital Pictures: Representation, Compression, and Standards. Springer, year = 1995, edition = Second,. → pages 54 [59] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, second edition, 2006. → pages 46 [60] P. Paatero. A weighted non-negative least squares algorithm for three way parafac factor analysis. Chemometrics and Intelligent Laboratory Systems, 38:223–242, 1997. → pages 1, 11, 27, 42 [61] P. Paatero. Least squares formulation of robust non-negative factor analysis. Chemometrics and Intelligent Laboratory Systems, 37:23–35, 1997. → pages 4 [62] P. Paatero. Construction and analysis of degenerate parafac models. Journal of Chemometrics, 14:285–299, 2000. → pages 11, 30 [63] P. Paatero and U. Tapper. Positive matrix factorization: a nonnegative factor model with optimal utilization of error estimates of data values. Environmetrics, 5:111–126, 1994. → pages 1, 3, 11 [64] V. P. Pauca, J. Piper, and R. Plemmons. Nonnegative matrix factorization for spectral data analysis. Linear Algebra Appl., 416:29–47, 2006. → pages 5 [65] F. Shahnaz, M. Berry, P. Pauca, and R. Plemmons. Document clustering using nonnegative matrix factorization. J. Inform. Proc. Managment, 42: 373–386, 2006. → pages 5 [66] A. Shashua and T. Hazan. Nonnegative tensor factorization with applications to statistics and computer vision. In Machine Learning, Proceedings of the Twenty-Second International Conference, 2005. → pages 12 [67] J. Shen. Inpainting and the fundamental problem of image processing. SIAM News, 36(5), 2003. → pages 74 85 [68] P. Smaragdis and J. Brown. Nonnegative matrix factorization for polyphonic music transcription. In Proceedings IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 177–180, 2003. → pages 5 [69] A. Stegeman. Degeneracy in CANDECOMP/PARAFAC explained for p × p × 2 arrays of rank p + 1 or higher. Psychometrika, 71:483–501, 2006. → pages 30 [70] J. T. Sun, H. Zeng, H. Liu, Y. Lu, and Z. Chen. Cube SVD: A novel approach to personalized web search. In A. Press, editor, 14th International Conference on World Wide Web, pages 652–662, 2005. → pages 9 [71] K.-K. Sung. Learning and Example Selection for Object and Pattern Recognition. PhD thesis, MIT, Artificial Intelligence Laboratory and Center for Biological and Computational Learning, Cambridge, MA, 1996. → pages 53 [72] G. Tomasi and R. Bro. PARAFAC and missing values. Chemometrics and Intelligent Laboratory Systems, 75(2):163–180, 2005. → pages 54 [73] G. Tomasi and R. Bro. A comparison of algorithms for fitting the PARAFAC model. Computational Statistics and Data Analysis, 50:1700–1734, 2006. → pages 26 [74] L. R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311, 1966. → pages 25 [75] C. F. Van Loan. The ubiquitous Kronecker product. Journal of Computation and Applied Mathematics, 123:85–100, 2000. → pages 16, 17 [76] M. A. O. Vasilescu and D. Terzopoulos. Multilinear image analysis for facial recognition. In Proceedings of the International Conference on Pattern Recognition, volume 3, pages 511–514, 2002. → pages 9 [77] E. Wachsmuth, M. W. Oram, and D. I. Perrett. Recognition of objects and ther component parts: responses of single units in the temporal cortex of the macaque. Cerebral Cortex, 4(5):509–522, 1994. → pages 9 [78] Z. Wang and A. C. Bovik. A universal image quality index. IEEE signal processing letters, 9:81–84, 2002. → pages 55 [79] Z. Wang, A. C. Bovik, and L. Lu. Why is image quality assessment so difficult? In Proceedings of the IEEE Int. Conference Acoust., Speech, and Signal Processing, 2002. → pages 54 86 [80] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assesment: from error measurement to structural similarity. IEEE Transactions on Image Processing, 13(1), January 2004. → pages 54 [81] M. Welling and M. Weber. Positive tensor factorization. Pattern Recognition Letters, 22:1255–1261, 2001. → pages 12, 31, 35 [82] C. Zhu, R. Byrd, and J. Nocedal. L-BFGS-B code. http://www.ece.northwestern.edu/∼nocedal/lbfgsb.html. → pages 49 [83] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. L-BFGS-B: Fortran subroutines for large-scale bound constrained optimization. Technical Report NAM-11, Northwestern University, EECS Department, 1994. → pages 45, 50 87 Appendix A List of routines The following are the main routines used in the implementation of our algorithm. funObj NTFAAonce.m, Compute the objective-function value and gradi- funObj NMFAAonce.m ent in the tensor and matrix cases, respectively. lbfgsbNMF Modification of the L-BFGS-B method that allows the scaling of the iterative points (i.e., allows the insertion of spacer steps). NMF AAonce.m Main NMF routine; computes NMF using an allat-once approach and the L-BFGS-B optimization method NTF AAonce.m Main NTF routine; computes NTF using an allat-once approach and the L-BFGS-B optimization method. scaleiterate Dvar.m Function that scales the current point according with the specified criterion (norm and scaling type). 88 getFactorsNMF.m, Auxiliary functions to reshape the problem vari- struct2vec.m, vec2struct.m ables (both vectorize them and express them as matrix factors). 89 Appendix B Summary of multilinear operators’ properties The following properties are listed in [42] and [43]. n-mode product Let Y ∈ IRJ1 ×J2 ×···×JN be an Nth-order tensor. • Given matrices A ∈ IRIm ×Jm , B ∈ IRIn ×Jn , Y ×m A ×n B = (Y ×m A) ×n B = (Y ×n B) ×m A, • Given matrices A ∈ IRI×Jn , B ∈ IRK×I , Y ×n A ×n B = Y ×n (AB). • If A ∈ IRI×Jn with full rank, then X = Y ×n A ⇒ Y = X ×n A† . • If A ∈ IRI×Jn is orthonormal, then X = Y ×n A ⇒ Y = X ×n AT . 90 (m 6= n). Matricization Let Y ∈ IRJ1 ×J2 ×···×JN be an Nth-order tensor, and N = 1, . . . , N. • If A ∈ IRI×Jn . Then, X = Y ×n A ⇔ X(n) = AY(n) . • Let A(n) ∈ IRIn ×Jn for all n ∈ N . If R = {r1 , . . . , rL } and C = {c1 , . . . , cM } partition N , then X = Y ×1 A(1) ×2 A(2) ×3 · · · ×N A(N) ⇔ T X(R×C :JN ) = A(rL ) ⊗ · · · ⊗ A(r1 ) Y(R×C :IN ) A(cM ) ⊗ · · · ⊗ A(c1 ) . • Consequently, if A(n) ∈ IRIn ×Jn for all n ∈ N , for any specific n ∈ N we have X = Y ×1 A(1) ×2 A(2) ×3 · · · ×N A(N) ⇔ T X(n) = A(n)Y(n) A(N) ⊗ · · · ⊗ A(n+1) ⊗ A(n−1) ⊗ · · · ⊗ A(1) . Tucker operator Given X ∈ IRI1 ×···×IN , G ∈ IRJ1 ×···×JN , and N matrices A(i) ∈ IRIi ×Ji with 1 ≤ i ≤ N, such that X = [[G ; A(1) , A(2) , . . . , A(N) ]], we can express X in the following ways: • Elementwise, J1 xi1 i2 ···iN = J2 JN ∑ ∑ ··· ∑ j1 =1 j2 =1 jN =1 (1) (2) (N) g j1 j2 ··· jN ai1 j1 ◦ ai2 j2 ◦ · · · ◦ aiN jN . 91 • Using outer products, X = J1 JN J2 (1) ∑ ∑ ··· ∑ j1 =1 j2 =1 jN =1 (2) (N) g j1 j2 ··· jN a: j1 ◦ a: j2 ◦ · · · ◦ a: jN . • Using a matricized form, T X(n) = A(n) G(n) A(N) ⊗ · · · ⊗ A(n+1) ⊗ A(n−1) ⊗ · · · ⊗ A(1) . • Using multi-mode multiplication, X = G ×1 A(1) ×2 A(2) ×3 · · · ×N A(N) . Kruskal operator Given X ∈ IRI1 ×···×IN and N matrices A(i) ∈ IRIi ×R with 1 ≤ i ≤ N, such that X = [[A(1) , A(2) , . . . , A(N) ]], we can express X in the following ways: • Elementwise, R xi1 i2 ···iN = (1) (2) 1 2 (N) . Nr ∑ ai r · ai r · · · · · ai r=1 • Sum of outer products, X = R (1) ∑ a:r (2) (N) ◦ a:r ◦ · · · ◦ a:r . r=1 • Using a matricized form, X(n) = A(n) A(N) ··· A(n+1) 92 A(n−1) ··· A(1) T . • Using a vectorized form vec(X ) = A(N) ··· A(1) T 1, where 1 is a ones vector of length R. Aditionally, if X is a third-order tensor in IRI×J×K such that X = [[A, B,C]], we can use slice notation. For example, each frontal slice can be written as X::k = AD(k) BT , k = 1, . . . , K, where D(k) ∈ IRR×R is a diagonal matrix whose diagonal elemens are given by the kth row of C, i.e., D(k) = diag(ck: ). Assume that we have matrices A(n) ∈ IRIn ×R for n = 1, . . . , N. Other properties of the Kruskal operator are the following. • If we have matrices B(Kn ×In ) for n = 1, . . . , N, then ii hh [[A(1) , · · · , A(N) ]]; B(1) , · · · , B(N) = [[B(1) A(1) , · · · , B(N) A(N) ]]. • If R̂ = {r1 , . . . , rL } and Ĉ = {c1 , . . . , cM } represent a partition of the set N̂ = {1, . . . , N}, then X = [[A(1) , A(2) , . . . , A(N) ]] ⇐⇒ X(R̂×Ĉ:I ) = A(rL ) N̂ ··· Ar1 A(cM ) ··· A(c1 ) T An−1 ··· A(1) T . • Thus, for any n ∈ N̂ we have X = [[A(1) , A(2) , . . . , A(N) ]] ⇐⇒ X(n) = A(n) A(N) 93 ··· A(n+1) .
Thesis/Dissertation
2008-11
10.14288/1.0051271
eng
Computer Science
Vancouver : University of British Columbia Library
University of British Columbia
Attribution-NonCommercial-NoDerivatives 4.0 International
http://creativecommons.org/licenses/by-nc-nd/4.0/
Graduate
Nonnegative tensor
Optimization
An all-at-once approach to nonnegative tensor factorizations
Text
http://hdl.handle.net/2429/1606