@prefix vivo: .
@prefix edm: .
@prefix ns0: .
@prefix dcterms: .
@prefix dc: .
@prefix skos: .
vivo:departmentOrSchool "Science, Faculty of"@en, "Computer Science, Department of"@en ;
edm:dataProvider "DSpace"@en ;
ns0:degreeCampus "UBCV"@en ;
dcterms:creator "Flores Garrido, Marisol"@en ;
dcterms:issued "2008-09-02T17:20:12Z"@en, "2008"@en ;
vivo:relatedDegree "Master of Science - MSc"@en ;
ns0:degreeGrantor "University of British Columbia"@en ;
dcterms:description """Tensors can be viewed as multilinear arrays or generalizations of the notion of matrices. Tensor decompositions have applications in various fields such as psychometrics, signal processing, numerical linear algebra and data mining. When the data are nonnegative, the nonnegative tensor factorization (NTF) better reflects the underlying structure. With NTF it is possible to extract information from a given dataset and construct lower-dimensional bases that capture the main features of the set and concisely describe the original data.
Nonnegative tensor factorizations are commonly computed as the solution of a nonlinear bound-constrained optimization problem. Some inherent difficulties must be taken into consideration in order to achieve good solutions. Many existing methods for computing NTF optimize over each of the factors separately; the resulting algorithms are often slow to converge and difficult to control. We propose an all-at-once approach to computing NTF. Our method optimizes over all factors simultaneously and combines two regularization strategies to ensure that the factors in the decomposition remain bounded and equilibrated in norm.
We present numerical experiments that illustrate the effectiveness of our approach. We also give an example of digital-inpainting, where our algorithm is employed to construct a basis that can be used to restore digital images."""@en ;
edm:aggregatedCHO "https://circle.library.ubc.ca/rest/handle/2429/1606?expand=metadata"@en ;
dcterms:extent "1130731 bytes"@en ;
dc:format "application/pdf"@en ;
skos:note "An All-at-Once Approach to Nonnegative Tensor Factorizations by Marisol Flores Garrido B.Sc., Universidad Autónoma de Coahuila, 2005 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE STUDIES (Computer Science) The University Of British Columbia (Vancouver) August 2008 c©Marisol Flores Garrido, 2008 Abstract Tensors can be viewed as multilinear arrays or generalizations of the notion of matrices. Tensor decompositions have applications in various fields such as psy- chometrics, signal processing, numerical linear algebra and data mining. When the data are nonnegative, the nonnegative tensor factorization (NTF) better reflects the underlying structure. With NTF it is possible to extract information from a given dataset and construct lower-dimensional bases that capture the main features of the set and concisely describe the original data. Nonnegative tensor factorizations are commonly computed as the solution of a nonlinear bound-constrained optimization problem. Some inherent difficulties must be taken into consideration in order to achieve good solutions. Many exist- ing methods for computing NTF optimize over each of the factors separately; the resulting algorithms are often slow to converge and difficult to control. We pro- pose an all-at-once approach to computing NTF. Our method optimizes over all factors simultaneously and combines two regularization strategies to ensure that the factors in the decomposition remain bounded and equilibrated in norm. We present numerical experiments that illustrate the effectiveness of our ap- proach. We also give an example of digital-inpainting, where our algorithm is employed to construct a basis that can be used to restore digital images. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 NTF problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.1 NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.2 Moving to tensors . . . . . . . . . . . . . . . . . . . . . 9 1.4 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.5 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2 Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1 Tensor’s definition . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 Matrix operations . . . . . . . . . . . . . . . . . . . . . . 16 2.2.2 Matricization . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.3 Tensor operations . . . . . . . . . . . . . . . . . . . . . . 19 2.2.4 Special tensors . . . . . . . . . . . . . . . . . . . . . . . 20 iii 2.3 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Rank and factorizations . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.1 Rank of a tensor . . . . . . . . . . . . . . . . . . . . . . 23 2.4.2 Tucker decomposition . . . . . . . . . . . . . . . . . . . 23 2.4.3 CANDECOMP/PARAFAC decomposition . . . . . . . . 25 2.4.4 Other decompositions . . . . . . . . . . . . . . . . . . . 26 3 Nonnegative tensor factorizations . . . . . . . . . . . . . . . . . . . 27 3.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.1 Lack of uniqueness . . . . . . . . . . . . . . . . . . . . . 28 3.1.2 Degeneracy and best rank-R approximation . . . . . . . . 30 3.1.3 Lack of bounds . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Algorithms for computing the NTF . . . . . . . . . . . . . . . . . 31 3.2.1 Multiplicative update rules . . . . . . . . . . . . . . . . . 31 3.2.2 Alternating least squares . . . . . . . . . . . . . . . . . . 35 3.2.3 All-at-once approach . . . . . . . . . . . . . . . . . . . . 38 4 All-at-once approach . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1 Regularization strategy . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.1 Regularization function . . . . . . . . . . . . . . . . . . . 40 4.1.2 Spacer steps . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Gradient of the function. . . . . . . . . . . . . . . . . . . . . . . 43 4.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3.1 BFGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3.2 L-BFGS-B . . . . . . . . . . . . . . . . . . . . . . . . . 46 5 Implementation and numerical experiments . . . . . . . . . . . . . 49 5.1 Implementation of the algorithm . . . . . . . . . . . . . . . . . . 49 5.1.1 L-BFGS-B routine . . . . . . . . . . . . . . . . . . . . . 49 5.1.2 Other implementation remarks . . . . . . . . . . . . . . . 50 5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2.1 Experiment settings . . . . . . . . . . . . . . . . . . . . . 52 5.2.2 Performance of the algorithm on the matrix case . . . . . 56 5.2.3 Effects of regularization strategies . . . . . . . . . . . . . 57 iv 5.2.4 Comparison with other approaches . . . . . . . . . . . . . 59 6 An application to image processing . . . . . . . . . . . . . . . . . . 70 6.1 Image basis through nonnegative factorizations . . . . . . . . . . 70 6.2 Inpainting problem . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.3 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . 75 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 A List of routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 B Summary of multilinear operators’ properties . . . . . . . . . . . . 90 v List of Tables 5.1 Comparison between interfaces . . . . . . . . . . . . . . . . . . . 57 5.2 CBCL Dataset, matrix approach . . . . . . . . . . . . . . . . . . 62 5.3 CBCL Dataset, tensor approach . . . . . . . . . . . . . . . . . . 64 5.4 AT&T Dataset, matrix approach . . . . . . . . . . . . . . . . . . 65 5.5 AT&T Dataset, tensor approach . . . . . . . . . . . . . . . . . . 67 5.6 CBCL Dataset, comparison with other algorithms. . . . . . . . . . 68 5.7 AT&T Dataset, comparison with other algorithms. . . . . . . . . . 69 6.1 Experiment with hand-written numbers. . . . . . . . . . . . . . . 73 vi List of Figures 1.1 NMF and clustering . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 PCA vs. NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Tensor slices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1 Tensor fibers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Matricization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Third-order special tensors . . . . . . . . . . . . . . . . . . . . . 21 2.4 Tucker decomposition . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 PARAFAC decomposition . . . . . . . . . . . . . . . . . . . . . 26 5.1 Effects of regularization . . . . . . . . . . . . . . . . . . . . . . . 58 6.1 Basis for CBCL face database with outliers. . . . . . . . . . . . . 71 6.2 Experiment using hand-written digits. . . . . . . . . . . . . . . . 73 6.3 Inpainting problem, test 1; different λ values. . . . . . . . . . . . 76 6.4 Inpainting problem, test 1. . . . . . . . . . . . . . . . . . . . . . 77 6.5 Inpainting problem, test 2. . . . . . . . . . . . . . . . . . . . . . 78 6.6 Images from the AT&T image database. . . . . . . . . . . . . . . 78 6.7 Inpainting problem, test 3. . . . . . . . . . . . . . . . . . . . . . 79 vii Acknowledgements I would like to thank my supervisor, Michael P. Friedlander, for his unbounded support, patience and encouragement. He introduced me to a branch of applied mathematics that I knew little about and he did it showing real passion for the field (there are sparkles on his eyes every time he talks about optimization!). I cannot thank him enough for all he has done for me. Thanks as well to Chen Greif for reading this thesis and providing many helpful comments. In the last two years I have met some wonderful people and made a few ex- cellent friends. Thanks to all my bullpen mates; despite the endless homework I really enjoyed my first year here. A special thanks goes to the Tuesdays’ group: CB, Hoyt, Lloyd, Hao, Seonah, and Shin; some of what I learned at our gatherings are among the most important legacies of this period. I would also like to thank my excellent SCL labmates for their kindness and support. A special thanks to Ewout for always being kind and for his willingness to share his vast knowledge. Thanks to all the Mexican friends at UBC who, with the help of lots of corn and some tequila, relieved my home sickness and made the sunless weather bearable. I want to particularly thank Ignacio, Itzia, and Ramón, for all the laughs, support, and companionship they provided. Thanks to Margaret, an admirable woman and a great friend. I can hardly believe how lucky I was for entering her home. Thanks to my old, good, unconditional friends: Citlali, Esli, Humberto, Javier, Marisol and Mario. They provide solid friendship even at times when I’m not sure who I am. Lastly, but most importantly, I thank my parents, brother, and sister for their viii endless support, faith and love. It would have been impossible to complete this work without them. I could not have asked for a better family. ix To Nates, a.k.a. Lady Melancholy, for being the most beautiful cronopio I know. x Chapter 1 Introduction The notion of a tensor has been in the mathematical record since the 19th century [36] but has gained new strength in recent years with the rise and rapid development of fields like computational-based data mining and machine learning. Tensor’s multilinear structure permits the storage and organization of large amounts of data. Through the use of multiple indexes, they facilitate an efficient and flexible management of information. It is also recognized that the manipulation and use of alternative expressions of a given tensor—like factorizations—can provide an insight into the relevant infor- mation in the stored data, find structures that are not evident by simple analysis, and facilitate the extraction of useful information about the set. The higher-order SVD (HOSVD) [47], for example, represents a tool that can be compared with the principal component analysis (PCA) for two-dimensional arrays. There are an extensive number of applications that involve nonnegative data and require nonnegativity in their analysis in order to avoid a misrepresentation of physical realities. This led to the mergence of the basic ideas on tensor factoriza- tions and nonnegative matrix factorizations (NMF)—introduced in 1994 [63]—to create nonnegative tensor factorizations (NTF)—introduced in 1997 [14, 60]—, a tool that extends the scope of tensor factorizations as analysis instruments by showing both usefulness in the treatment of physically meaningful data, and the structure that makes tensors appealing in data management. 1 1.1 Tensors Tensors can be put in the context of formal differential geometry or physics, but the concept that will be used in this work is the one that is found in most literature related to multilinear algebra and scientific computing, where a tensor simply rep- resents a multilinear array or a generalization of the notion of scalar, vector, and matrix. Two ideas are commonly used to describe this concept. The first presents the notion of a Nth-order tensor as a collection of numbers organized in a N- dimensional array, whose elements require N indices to be referred to. The second is the idea of a tensor as a geometrical entity that “can be expressed as a multi-dimensional array relative to a choice of basis of the particular space on which it is defined. The intuition underlying the tensor concept is inherently geometrical: as an object in and of itself, a tensor is independent of any chosen frame of reference” [1]. Both ideas clearly state that a matrix (represented by a table whose elements are addressed by two indices) and a vector (represented by a list of elements that require only one index to be identified) can be understood as 1st-order and 2nd- order tensors, respectively. The order of a tensor is then understood as its number of dimensions (also known as ways or modes). 1.2 NTF problem Given an N-dimensional tensor V ∈ IRI1×I2×···×IN , the NTF problem consists in finding a core tensor G ∈ IRJ1×J2×···×JN , and a set of N matrices A(n) ∈ IRIn×Jn , n= 1, . . . ,N, such that V ≈ G ×1 A(1)×2 A(2)×3 · · ·×N A(N), (1.1) where the N+1 constitutive factors G ,A(1), . . . ,A(N) are required to be componen- twise nonnegative, i.e., G ,A(1), . . . ,A(N) ≥ 0. The meaning of (1.1) will be clarified in Chapter 2, where a description of notation and tensor operations is provided. For now, it is enough to mention that 2 the matrices A(1), . . . ,A(N) are generally understood as components of V along its different dimensions (tensor dimensions are also referred to as ways or modes), whereas G represents a kernel tensor (usually smaller than V , though having the same order) whose entries capture the relation between the matrices A(i). While previously it was stated that NTF extends the scope of tensor factoriza- tions, an alternate interpretation is that they just generalize NMF. Thus, if V is a 2nd-order tensor—i.e., a matrix—we get the problem of factorizingV ∈ IRI1×I2 into matrices G,A(1),A(2) such that V = G×1 A(1)×2 A(2) = A(1)GA(2)T , (1.2) with G,A(1),A(2) ≥ 0, which is essentially the NMF problem. 1.3 Motivation The motivation for NTF is mainly based on two facts: the proved ability of NMF to extract relevant information from a given dataset, identifying and capturing the main features of the set, and the variety of applications that rely on having a tensor structure to manage the information. In this section a short explanation of both is provided. 1.3.1 NMF NMF were first described by Paatero [63] and later reintroduced by Lee and Seung [51]. Lee and Seung’s science article begins with the words “Is perception of the whole based on perception of its parts? There is psychological and physiological evidence for parts-based representa- tions in the brain, and certain computational theories of object recog- nition rely on such representations. . . . Here we demonstrate an al- gorithm called nonnegative matrix factorization that is able to learn parts of faces and semantic features of text.” This statement captures the essential motivation for NMF, which is to analyze and learn about a whole by dividing it into meaningful coherent parts that keep the 3 nonnegative characteristics of the original data. The NMF’s goal is to simplify the interpretation of such parts. In one sense, every matrix factorization—e.g., QR, SVD—provides a better understanding of the original matrix. What sets NMF and the other well-known factorizations in linear algebra apart is the introduction of the nonnegativity con- straints over the factors to be constructed. Given a nonnegative matrix V ∈ IRm×n and a positive integer r 0, then (ViK−A(i)∗KTK) jk = ((V(i)−A(i)∗A(i)∗T )A(i)∗ ) jk = 0. Thus, ∂F ∂A(i) = 0 for all i, i.e., we are at a stationary point of the function F (the derivative of F is not obvious but details about it are given in section 4.2). This reasoning has two major drawbacks. First, a stationary point is not always an optimum; i.e., it could be just a saddle point (which could be verified if the second-order derivative of the function is available). Second, the guaranteed arrival at a stationary point is based on the assumption that all the elements in A(i) are positive (for all i). Therefore, when the algorithm converges to a limit point in the interior of the feasible region, we can guarantee that the point is a stationary point; but the algorithm might also stop at a point that is not stationary and simply lies on the boundary of the feasible region. It is worth noting the difficulty in proving the convergence of multiplicative updates. The difficulty comes mainly from the nonnegativity constraints. Lin [52] discusses the difficulty of proving convergence in the NMF problem and he pro- poses slight modifications to existing updates in order to prove their convergence. In the tensor case, we find that, in fact, the work by Welling and Weber lacks a convergence proof. Given the challenges that proving convergence for this approach presents, it is not surprising that the speed of convergence has not been accurately described. However, despite the lack of precise results, it is evident in practice that the ma- 33 jor drawback of the multiplicative rules is their slow convergence [10]. Therefore, after Welling and Weber’s work, modifications to the basic multiplicative update algorithm 1 have been proposed which try to adapt to the tensor problem the im- provements made in NMF multiplicative rules. We find, for example, that in the NMF problem, modifications to the initial multiplicative rules lead to a class based on gradient descent methods [19]. In the tensor case, Hazan, Polak, and Shashua [38] propose something analogous, where the update for each column of the factor-matrices has the form a(i)j = a (i) j −µ(a(i)j ) ∂F ∂a(i)j , where we can clearly identify a step size represented by the function µ and a de- scent direction given by the negative gradient The quantities µ(a(i)j ) are step size parameters that can vary depending on the algorithm; the work by Hazan et al. focuses on 3rd-order tensors and, for a tensor in IRI1×I2×I3 , uses µ(a(i)jk ) = a(i)jk ∑r`=1 ( a(i)j` ∏ 3 n=1,n 6=i〈a(n)` ,a(n)k 〉 ) (3.7) to construct a rank-r approximation, where i= {1,2,3}, k= 1, . . . ,r, and j depends on the the mode i and in each case goes through the values 1,. . . ,Ii. It is worth mentioning that Hazan et al. use second-order derivatives in the analysis of their update rule. Other modifications include the work by Mørup, Hansen, and Arnfred [56], who use additional constraints to obtain sparsity and encourage uniqueness; and the work by Bader, Harshman, and Kolda [8], who in- troduce a new algorithm—ASALSAN, for computing three-way DEDICOM (“de- composition into directional components”; a family of decompositions for analyz- ing intrinsically asymmetric relationships)—and impose nonnegativity constraints on the model (they develop a nonnegative version of ASALSAN). None of these modifications is described with detail since they are not relevant to this research. However, it is important to note that there is a trade-off between the improvements achieved by more sophisticated rules and the complexity and 34 number of operations required, as seen by examining equation (3.7). Despite the refinement in the rules, the multiplicative-update approach has shown poor performance and no guarantee of getting an optimal point. Never- theless, it continues to be the most popular strategy to compute nonnegative de- compositions with an NMF version of the algorithm given in [10] being included in the statistics toolbox of the software MATLAB v.7.6. 3.2.2 Alternating least squares A way of dealing with optimization problems where the objective function and constraint functions have a partially separable structure with respect to the prob- lem’s variables consists of breaking down the problem into smaller ones that can be solved one-at-a-time in sequence. This is the intended goal of the optimization algorithm known as block-coordinate descent or nonlinear Gauss-Seidel [13, Sec- tion 2.7]. The alternating least squares (ALS) algorithm is a particular case of this more general approach. The ALS approach fixes all the factors except one, and optimizes the problem with respect to the free factor. In this way, the nonlinear least-squares problem is decomposed into a sequence of smaller, linear problems. Consider the matrix case and recall that we want to solve the problem minimize W,G,H≥0 1 2‖V −WGHT‖2F , (3.8) whereV ∈ IRm×n,W ∈ IRm×r, G∈ IRr×r, andH ∈ IRn×r. We have explicitely written the factor G (as it is done in (3.3)) but for now, in the remainder of this section, we can assume that G≡ I (an assumption that typically appears in the NMF literature; e.g. [51, 81]) . If we follow the ALS approach, the problem given by (3.8) can be decomposed into two least-squares problems, minimize W≥0 1 2‖V −WHT‖2F , and minimizeH≥0 1 2‖V −WHT‖2F . (3.9) Note that each of these least squares problems is a collection of linear least-squares problems. For example, denote by v j the columns of the matrix V , and by h j the rows of the matrix H. The first problem in (3.9) can be solved by the sequence of 35 problems minimize h j≥0 1 2‖v j−WhTj ‖22, j = 1, . . . ,n. (3.10) Each of the subproblems (3.10) can be solved with any optimization algo- rithm for bound-constrained problems, like gradient-projection methods [13, 2.3], or BCLS [32]. Moreover, these n subproblems can be solved for in parallel. In the tensor case, the general nonlinear least-squares problem is partitioned into the subproblems NTFA(n) minimize A(n) 1 2‖[[G ;A(1), . . . ,A(n), . . . ,A(N)]]−V ‖2F subject to A(n) ≥ 0, for n= 1, . . . ,N, and NTFG minimize G 1 2‖[[G ;A(1), . . . ,A(N)]]−V ‖2F subject to G ≥ 0. When solving the subproblem associated with each matrix A(n), we have the option of matricizing the problem and solving the problem minimize A(n) 1 2 ∥∥∥A(n)⊗ GT(n)A(n)T −V T(n)∥∥∥2F subject to A(n) ≥ 0 (3.11) by computing one row of A(n) at a time. In this way we can take advantage of the standard linear least-squares algorithms to solve each subproblem. Another possibility is to vectorize the problem to solve for all the rows of A(n) simultaneously. To do this, recall, from matrix calculus (see, e.g., [27]), that for any three matrices A, X , and B of suitable size, we have vec(AXB) = ( BT ⊗A)vec(X). (3.12) 36 Thus, vec ( A(n)⊗ G T (n)A (n)T ) = vec (( A(n)⊗ G T (n) ) A(n)T I(In) ) = ( I(In)⊗A(n)⊗ GT(n) ) vec(A(n)T ), and (3.11) becomes minimize A(n) 1 2 ∥∥∥(I(In)⊗ (A(n)⊗ GT(n)))vec(A(n)T )−vec(V T(n))∥∥∥22 subject to A(n) ≥ 0. To solve the subproblem associated with the core tensor G , we need to isolate G(n) in any of the subproblems given by (3.11). We can use (3.12) again to get vec ( A(n)G(n)A (n)T ⊗ ) = ( A(n)⊗ ⊗A(n) ) vec(G(n)), which, by choosing n= 1, becomes vec ( A(1)G(1)A (1)T ⊗ ) = A⊗vec(G(1)), and leads to the optimization subproblem minimize G 1 2 ∥∥A⊗vec(G(1))−vec(V(1))∥∥22 subject to G ≥ 0. The ALS approach was first used in the tensor case by Bro and De Jong [14] in 1997. Some later modifications include the work of Friedlander and Hatz [33], who incorporated a regularization function and a strategy to keep the variables bounded using the `1-norm. They also use the vectorized version of the problem and solve simultaneously for all the rows of each matrix factor, increasing the computations’ efficiency. Depending on the implementation, ALS algorithms can be effective and the drawbacks faced by these algorithms are those that are inherent in the NTF prob- lem: no guarantee of convergence to a local minimum (it could be a saddle point), great sensitivity to the initial point, and a considerable increase on the work when 37 additional constraints are imposed. 3.2.3 All-at-once approach The third approach consists in solving directly the nonlinear optimization problem with the boundary constraints. It offers the advantage of keeping the global relation between all the variables, which can benefit the quality of the solution. One of the main challenges of this approach, however, is the effective solution of large problems. This approach is the focus of this research and the details regarding its descrip- tion and implementation are given in the next chapter. 38 Chapter 4 All-at-once approach The all-at-once approach can result in large-scale optimization problems—where computing the gradient and performing operations can be very expensive— but it can also save computational work by quickly reaching a satisfactory solution. However, prior to exploring this approach, it must be considered that one of the main challenges in solving the NTF problem is the development of a regularization strategy that resolves the problem’s inherent ill-posedness. In this chapter we describe both our regularization strategy and our algorithm to implement the all-at-once approach. 4.1 Regularization strategy The indeterminacies due to scaling and permutation are somehow avoided when we compute sequentially the factors of the approximation, which is done in ALS and multiplicative update rules. In both cases, we choose an initial approximation to the solution and keep all the values fixed while looking for a specific factor, which is determined by the status of the other factors. It follows that both scale and permutation are implicitly chosen together with the initial guess. In the other approaches the ill-conditioning of the problem is manifested by the fact that the algorithm converges to different solutions when it starts with different initial values. However, in the all-at-once approach the ill-posedness can directly affect the success of the algorithm. It is therefore important to establish a way to 39 control the factors we are computing. We use two different strategies to regularize the problem: a classic regulariza- tion strategy that takes the form of a penalty function and an explicit bound on the elements of the matrix factors. Both strategies are described next. 4.1.1 Regularization function Tikhonov regularization [37, Section 5.1], which penalizes the two-norm of the solution, is the best-known method to regularize optimization problems. In least-squares problems, the method requires the addition of a regularization term to the objective of the optimization problem, in order to favor a solution with small norm. In the problem we are solving, we use the regularization function f (γ,A(1), . . . ,A(N)) = N ∑ n=1 γn Jn ∑ j=1 ‖a(n)j ‖1 = N ∑ n=1 γn Jn ∑ j=1 In ∑ i=1 |a(n)i j |, (4.1) which penalizes big values in the elements of the matrix factors and, as a conse- quence, encourages solutions with smaller norms. Note that in (4.1) we allow a different weight, γn, to constrain each matrix factor. The function f is expected to encourage sparsity of the computed factors, since it pushes the small values towards zero and leaves the significant big entries rela- tively unchanged. The use of a function that involves the `1-norm and enhances sparseness helps to regularize the optimization problem. It is also remarkable that sparsity is a property coherent with the motivation of the NTF problem; it enriches the solution from an interpretive point of view because it leads to basis factors that represent independent parts of the whole set. Consider, for example, the matrix case where, given a m× n matrix V whose columns represent images, we want to find matricesW,G,H such thatV ≈WGHT . As explained in chapter 1, W ∈ IRm×r represents a basis of images of rank r, whereas H ∈ IRn×r contains the necessary coefficients to express each column ofV 40 in terms of the basisW . In this case, we might want to have sparsity in H, meaning that we want the basis images to have strong, independent features so that the least number of them are required to construct each image of the original set. Note that we may wish to allow an arbitrarily dense basis W if that accurately reflects the nature of the original images. By differently weighting the penalization on the factors, we are trying to ap- proach the “ideal case”: if r= n,W would be a full matrix (we would haveW =V ), while H would be the identity matrix (very sparse). So, the chosen application will determine the penalty parameter used to control each factor. 4.1.2 Spacer steps In Section 3.1, a potential problem in tensor factorizations was identified. Matrix factors can be arbitrarily scaled and, depending on the initial guess, could become large in norm. To keep all the variables bounded and mantain an energy equilibrium among the matrix factors, we scale the columns of each factor every n iterations and “ab- sorb” the scale into the core diagonal tensor G . This periodic rescaling of the factors is carefully crafted so that the objective value does not increase. This property is crucial to ensure that the overall minimiza- tion algorithm continues to converge. The property is confirmed by the following theorem. Theorem 1. (Convergence for Spacer Steps [13, Section 1.2]) Consider a sequence {xk} such that f (xk+1)≤ f (xk), k = 0,1, . . . Assume that there exists an infinite setK of integers for which xk+1 = xk+αkdk, ∀k ∈K , where {dk}K is gradient related and αk is chosen by the minimization rule, or the limited minimization rule, or the Armijo rule. Then every limit point of the subsequence {xk}K is a stationary point. 41 It is then possible to insert a rescaling step into an optimization algorithm with- out interfering with convergence. Such a strategy can be used to control the norm of the factors. A possible strategy to bound the factors is to scale them in such a way that the `1-norm of each matrix is equal to a constant λ ; in this case, each factor A(n) is replaced by Ā(n) = ( λ ‖A(n)‖1 ) A(n), and the tensor G is replaced by Ḡ = ( N ∏ n=1 ‖A(n)‖1/λ ) G . Another possibility consists of individually normalizing each column of the matrix factors. This can be done by multiplying each matrix factor A(n) (where 1≤ n≤ N) by a diagonal matrix D(n) such that (D(n))ii = 1 ‖a(n)i ‖ . By multiplying the core tensor G by the inverse of each diagonal matrix D(n), we can guarantee that the value of the objective function remains constant. This is a consequence of the scaling indeterminacy described in (3.4). While the first strategy involves a matrix norm and can be described as im- posing the condition ‖A(n)‖1 = λ for all n in {1, . . . ,N}, the second strategy does not represent a matrix norm. Nevertheless, it does not contradict the theory estab- lished by the spacer-steps theorem, it achieves the objective of keeping the solution bounded, and Paatero [60] suggests that it “would be easy to implement without sacrificing the convergence properties of the algorithm”, although he does not the- oretically justify the suggestion. 42 4.2 Gradient of the function. According to our regularization strategy, the objective function of our problem is given by Φγ(G ,A(1), . . . ,A(N)) = F(G ,A(1), . . . ,A(N))+ f (γ,A(1), . . . ,A(N)) = 12‖R‖2F + f (γ,A(1), . . . ,A(N)), where f (γ,A(1), . . . ,A(N)) is the regularization function defined in (4.1), and R is defined as the residual R = G ×1 A(1)×2 A(2)×3 . . .×N A(N)−V where V is a given Nth-order tensor in IRm1×...×mN , G is a diagonal Nth-order tensor in IRr1×...×rN , and A(i) is a matrix with dimensions mi× ri, for i= 1, . . . ,N. IfY denotes any of the factors G ,A(1), . . . ,A(N), by using the chain rule we have ∂Φγ ∂Y = ∂F ∂Y + ∂ fγ ∂Y = ∂F ∂R ∂R ∂Y + ∂ fγ ∂Y . Note that by defining F(G ,A(1), . . . ,A(N)) as 12‖R‖2F , we are abusing nota- tion because, although it could be intuitive that we are referring to the sum of the squares of the entries ofR, we should formally have either F = 12‖R(n)‖2F = 12 tr ( RT(n)R(n) ) , and dF dR(n) = tr ( RT(n)dR(n) ) , (4.2) where R(n) is a mn×∏i6=nmi matrix, or F = 12‖vec((R)‖22 = 12vec(R)Tvec(R), and dF dR = vec(R)dR (4.3) where vec(R) has dimensions (∏imi)×1. 43 Using (4.2), (2.1), (2.2), and (4.3) we have ∂Φγ ∂A(n) = (AnD⊗ Imn)Tvec(R(n))+ γn = ((AnD) T ⊗ Imn)vec(R(n))+ γn = vec(ImnR(n)A n D)+ γn = vec(R(n)A n D)+ γn, and ∂Φγ ∂g j j... j = vec(R)Tvec(a(1): j ◦a(2): j ◦ · · ·a(N): j ). (4.4) Therefore, for the particular case of matrices, in which F = 12‖R‖2F = 12‖WGHT −V‖2F , (4.5) (see Section 3.2.2) we would have the derivatives ∂F ∂W = vec(RHG)+ γW and ∂F ∂HT = vec(GW TR)+ γH . Note that we work with the transpose of H; see (4.5). Derivatives with respect to G are similar to the derivatives given in (4.4). It is necessary to address the effects of scaling (by spacer steps) on the deriva- tive of the function. If we use the `1-norm of each factor to bound the solution, we get a final solu- tion in which the core tensor is a scalar multiple of the initial tensor G . Thus, the introduced variation is a scaling of the Kruskal product, independent from G , and the derivatives of the function must simply be scaled. Furthermore, since G is not really subject to variations, we can make it equal to the identity tensor and eliminate it as a problem variable. On the other hand, if we scale each of the columns of each of the factors us- ing a different scalar, we do need to use a diagonal tensor to express the induced variation, and the diagonal elements of G must be considered variables (each’s 44 derivative would be the one given in (4.4)). 4.3 Optimization Now that the regularized objective function and its derivative have been estab- lished, it is necessary to choose an optimization method to solve the problem. Any optimization method for nonlinear problems subject to simple bounds can be used to solve the problem, but some modifications are necessary to incorporate the scaling strategy. Formally, using the optimization method, we want to generate the spacer steps to be inserted in a sequence that every n iterations switches the current iterate solution by another equivalent through an elementary change (scaling). We use the L-BFGS-B method to generate such spacer steps. This method, a modification of the BFGS method, is broadly used since its introduction in 1994. Since it is a well-known method, this section will just briefly describe the ideas behind it, and refer to the original publications [17, 83] for details. 4.3.1 BFGS The BFGS (Broyden-Fletcher-Goldfarb-Shanno) method, which belongs to the class of quasi-Newton methods, is an optimization method designed to solve un- constrained problems. Recall that all the Newton-type methods find, at the k iteration, a search direc- tion dk by solving a linear system of equations of the type Bkdk =−gk, (4.6) where gk represents the gradient of the objective function, and Bk is an approxima- tion to the Hessian of the objective function. Quasi-Newton methods belong to the Newton-type family of optimization meth- ods. Their goal is to approximate the curvature of the objective function without evaluating the Hessian. To accomplish this, the methods use a positive definite matrix updated by a low-rank matrix at each iteration, i.e., they solve the equation 45 (4.6) with Bk+1 = Bk+Uk, where Uk has low rank and is defined by the specific quasi-Newton method, but always ensures that the matrix Bk+1 remains positive definite. In particular, BFGS uses the rank-two update Bk+1 = Bk− qkq T k qTk sk + ykyTk yTk sk , yTk sk > 0, where sk represents the change in the variables, yk is the change in the gradient, and qk is the product Bksk. Notice that the algorithm requires that the user provide the gradient, but not the Hessian, of the objective function. A linesearch that satisfies the Wolf conditions guarantees that the updated matrix remains symmetric positive definite [59, Section 6.1]. BFGS is one of the most popular update schemes in the class of quasi-Newton methods and has been modified to deal with large-scale problems, giving rise to the limited-memory variants, described below. 4.3.2 L-BFGS-B L-BFGS and L-BFGS-B are variants of the BFGS method for solving large-scale optimization problems. The initial “L” in both names represents the words “limited memory” and represents a feature shared by the algorithms. The additional “B” at the end of the second name stands for “bounded” and draws attention to the main difference between the codes: L-BFGS is designed to solve unconstrained problems, whereas L-BFGS-B can handle bounds on the variables. As its name suggests, a “limited memory” algorithm is characterized by its ability to restrain the amount of storage or computation used during its execution. While the conventional BFGS method approximates the Hessian, using a smaller reduced matrix that increases in dimension with each iteration, the L-BFGSmethod does not keep all the s and y from past iterations, and it updates the Hessian using information only from the previous p iterations (with p provided by the user). When the number of iterations is smaller than p, we get the usual BFGS update, and when it is larger, we have a limited memory BFGS (L-BFGS) update. 46 L-BFGS-B is identical to the L-BFGS algorithm except for the additional prop- erty of allowing bounds on the variables. The constraint control is achieved by using a projected-gradient approach to determine an active set, and then solving a quadratic problem over the subspace of free variables, handling the active bounds by Lagrange multipliers. Liu, Nocedal et al. [53] proved that the L-BFGS-B algorithm converges R- linearly. For our algorithm (with spacer steps) we know that the convergence is guaranteed by the theorem 1 and, although the speed of convergence might dete- riorate with the incorporation of the scaling strategy, it is reasonable to expect a linear rate of convergence [17]. The general L-BFGS-B is shown in Algorithm 2 (see the original paper [17] for details). 47 Algorithm 2: L-BFGS-B algorithm. Input: Starting point x0; Integer n (number of limited memory corrections stored); Lower and upper bounds (l and u); Tolerance tol k = 0 while TRUE do if ‖P(xk−gk, l,u)− xk‖∞ < tol then STOP end Compute the Cauchy point; Compute the search direction dk by either the direct primal, the conjugate gradient or the dual method; Perform a line search along dk, subject to the bounds on the problem, to compute a steplength and set xk+1 = xk+λkdk; Compute ∇ f (xk+1); if yk satisfies sTk yk > eps‖yk‖2 then Add sk and yk to Sk and Yk; end if More than n updates are stored then Delete the oldest column from Sk and Yk; end Update STk Sk, Y T k Yk, and other parameters related with the Bk update; k← k+1 end 48 Chapter 5 Implementation and numerical experiments We combine the regularization strategies described in section 4.1 with the L-BFGS- B method and implement an algorithm with an all-at-once approach. Experiments are then performed, which validate our algorithm. In this chapter we provide details about the algorithm implementation and de- scribe the main results of the numerical experiments, which include observations on the performance of the algorithm for the matrix case, the effects of different settings on regularization parameters, and a comparison of the performance of the all-at-once approach against other approaches. 5.1 Implementation of the algorithm 5.1.1 L-BFGS-B routine We implement our algorithm using the L-BFGS-B [82] FORTRAN code, which we call from MATLAB via a MEX interface. We take advantage of the reverse communication scheme on which L-BFGS-B is built, and we treat the main L- BFGS-B routine Setulb as a black box that manages the optimization process. It accepts the following inputs: • the current iteration xk; 49 • a set of parameters and working arrays,P , required by the FORTRAN rou- tines (a complete description can be found in [83]); • a control variable, task, that indicates the state of the optimization process. The routine generates an output that consists of • a point xk+1 such that fk+1 < fk; • the updated set of parametersP; • the updated value of task. Every n iterations, we replace the point xk by a scaled one, x̂k, modify the gradient gk, and change the value of some of the parameters inP that depend on gk. The function Setulb uses as stopping criteria the norm of the projected gra- dient and the relative change on the value of the function. It can also have an “abnormal” termination when the routine terminates without being able to satisfy the termination conditions, or an “error” termination when it detects an error in the input parameters. To these criteria, we added the number of iterations and the amount of elapsed time to control the end of the algorithm execution. 5.1.2 Other implementation remarks We have two implementations of the algorithm, one for NTF and one for NMF. The difference between them lies on the gradient computation and although it is possible to use the tensor code to solve a matrix problem and obtain the same results, it is cheaper to use the matrix interface (more details on this aspect are given in section 5.2.2). The tensor implementation relies on the MATLAB Tensor Toolbox, version 2.2 [4, 5, 6]. There are two important settings in the algorithm that we leave to the user to choose, as they are considered input parameters: the value of r and the initial point. We briefly comment on each of them in the remainder of this section. 50 Choice of r The value of r is essential in the definition of the objective function as it sets the number of variables involved in the problem. In some sense, it determines the properties that we expect to find in the solution. Furthermore, if a problem has been solved using a matrix approach with rank rm, it is difficult to determine the rank rt that leads to an equivalent restatement of the problem with tensors. To illustrate this point, consider again a problem in image-processing context in which there is a set of n3 images, each with a resolution of n1×n2 pixels. Using a matrix approach, the set is represented by a (n1n2×n3) matrix, where each column represents an image. If we find an approximation with rank rm, we have a basisW (n1n2× r), set of coefficients H (n3× r), and diagonal matrix G (we consider only the r diagonal elements). Thus we have (n1n2+n3+1)rm variables. In the tensor approach, an approximation with the same rank (rt = rm) has three matrix factors: A(1) (n1× r), A(2) (n2× r), and A(3) (n3× r), and a diagonal core tensor G (r diagonal elements). Thus we have only (n1+n2+n3+1)rm elements. It is clear that the difference in structural arrangement of the elements, makes it inappropriate to use the same value rm in both approaches. However, it is not clear what value of rt makes the problems “equivalent”. We can choose rt such that the number of variables in both problems is the same, i.e., rt = ( n1n2+n3+1 n1+n2+n3+1 ) rm, (5.1) but that would diminish the benefits of using a tensor approach. In other words, using the same value of rt = rm is not enough to get a good result, but using (5.1) results in wasted storage and greatly increases the computational time. Initial guess In addition to the solution’s lack of uniqueness, the NTF problem might have local minima and, therefore, we expect great sensitivity of the algorithm to the initial guess. Some authors [10] have suggested the use of a Monte Carlo method to explore different starting points and determine the one that brings the best solution. 51 We do not attempt to solve this aspect of the problem. Our algorithm uses the initial guess given by the users. If no initial point is provided, it simply generates a random starting point with positive elements. 5.2 Experiments We perform experiments that can be grouped into three different types. First, we explore the performance of the algorithm in the matrix case, i.e., when the input in the algorithm is a 2nd-order tensor. Then, we explore the effectiveness of the regularization strategies—penalty function and scaling—using different parameter settings and comparing the obtained solutions. Finally, we choose the parameter values that give the best performance, and compare the all-at-once approach against the alternative approaches described in chapter 3 (a multiplicative update rule and an ALS implementation). Although we focus on the performance of the algorithm with no particular application, all the experiments employ image datasets, and, thus, use an image- related context to analyze the solutions (to determine their quality). The initial points are generated using MATLAB’S rand function, and the state of the method’s generator is initialized with the purpose of allowing the computations to be repeated. So, the initial point is not chosen in a special way and does not use any a priori information on the problem, but it is the same every time we compute a solution independently of the algorithm and parameter settings. The experiments have been performed using MATLAB version 7.6.0 on an Intel Pentium 4 CPU 3.20GHz, 2 GB RAM computer. 5.2.1 Experiment settings Used databases We use two different image databases to perform the experiments. Having more than one set of images increases the number of experiments that can be performed, but this also allows us to assess how the quality of the solutions varies as we modify the characteristics of the data (i.e., lighting, level of variation between elements of the set), and number of variables involved (i.e., images’ resolution). 52 1. CBCL Dataset [71]. This database has been used at MIT’s Center for Bio- logical and Computational Learning (CBCL) and was developed by K. Sung. The database consists on 2429 grayscale face images with dimensions 19× 19 (361 pixels) each. We assemble 80 images in a 361× 80 matrix and a 19×19×80 tensor, for testing the matrix and tensor approach respectively. 2. AT&T Dataset [18]. This database was used in the context of a face recogni- tion project carried out in collaboration with the Speech, Vision and Robotics Group of the Cambridge University Engineering Department. It consists of ten different grayscale images of each of 40 distinct subjects. The size of each image is 92× 112 pixels, and we work with a set of 80 images randomly chosen from the 400 images in the database. The images are scaled to 30× 38 size in order to reduce the number of variables, and then we assemble a 1140× 80 matrix, and a 30× 38× 80 tensor for the experiments. Criteria of interest To compare the different algorithms, we use the value of the objective function and, as an optimality measure, the value given by ‖proj g(x)‖∞ max{1,‖x‖∞} . (5.2) We know that when this value is zero (‖proj g(x)‖∞ = 0), a stationary point has already been reached. When ‖x‖∞ < 1, (5.2) represents the infinity-norm of the projected gradient; otherwise, when the norm of x is greater than 1, it is the relative infinity-norm of the gradient what is being considered. The algorithm’s cost per iteration varies greatly depending on the number of variables involved in the problem. We iterate for large problems until convergence is achieved or a predetermined time (90 minutes) elapses. Thus, we do not use a maximum number of iterations (we set the corresponding parameter as infinity), and instead control the termination of the algorithm by time (when no optimum solution is found within the period of time). Each table presents the number of 53 iterations used to achieve each solution and a flag indicates whether the optimality criteria were reached (flag = 1), or the maximum allowed time elapsed (flag = 2). The assessment of the solutions’ quality is an open problem in NTF and gener- ally a specific context is required to establish metrics for the success—or failure— of a particular approach to finding a good solution. Two general approaches for evaluating the quality of a computed solution are the congruence [72] and the peak signal-to-noise ratio (PSNR) [58]. The congru- ence is defined as the cosine of the angle between the vectorized tensors, i.e., given two matrices, A and B in IRm×n, that represent digital images, we have congr(A,B) = cos(vec(A),vec(B)) = vec(A)Tvec(B) ‖vec(A)‖‖vec(B)‖ . If the angle between the vectorized images is small, the congruence is approxi- mately equal to unity. On the other hand, the PSNR represents the ratio between the maximum possi- ble power of a signal and the power of corrupting noise that affects the fidelity of its representation. If we define the mean squared error (MSE) of the images A and B as MSE(A,B) = 1 mn m ∑ i=1 n ∑ j=1 ‖Ai j−Bi j‖2, the PSNR is given by the expression PSNR(A,B) = 10 · log10 ( max2A MSE(A,B) ) , where maxA is the maximum possible pixel value of the image A. The image quality assessment is an open problem [79] and we use the struc- tural similarity index (SSIM) [80] measure, which is especially adapted to digital- image datasets. Measurement of solutions’ quality by SSIM The SSIM represents an attempt to quantitatively measure the perceived quality of an image and, because it is constructed so that it takes into account known characteristics of the human visual system, it matches the perceived visual quality 54 on an image better than other commonly used measures, like the mean-squared- error or the peak signal-to-noise ratio. The SSIM uses luminance, contrast, and structure comparison to give an overall similarity measure between two image signals, x,y ∈ IRn, i.e., SSIM(x,y) = f (l(x,y),c(x,y),s(x,y)), where l,c, and s are comparison functions. The simplified form of the SSIM (see the original article for details on the general form) is given by the equation SSIM(x,y) = (2µxµy+κ1)(2σxy+κ2) (µ2x µ2y +κ1)(σ2x +σ2y +κ2) , (5.3) where µx,µy represent the mean intensity of the signals, i.e., µx = 1 n n ∑ i=1 xi; σx,σy, and σxy represent the standard deviations and correlation of the signals, i.e., σx = ( 1 n−1 n ∑ i=1 (xi−µx)2 )1/2 , σxy = 1 n−1 n ∑ i=1 (xi−µx)(yi−µy); the constants κ1,κ2 help to avoid unstable results; typical values are κ1 = 0.01 and κ2 = 0.03 (when κ1 = κ2 = 0 the SSIM becomes the “universal quality index”, UQI [78]). The SSIM index satisfies three important properties: symmetry (SSIM(x,y) = SSIM(y,x)), boundedness (SSIM(x,y)≤ 1), and unique maximum (SSIM(x,y) = 1 if and only if x= y). In the performed experiments, we measure the SSIM between each of the origi- nal images (each column in the matrix approach, and each frontal slice in the tensor case) and its corresponding image in the obtained solution (columns or slices ob- tained by multiplying the computed factors). In the result tables we show the mean of the SSIMs obtained. 55 5.2.2 Performance of the algorithm on the matrix case We find that approximately 67% of the overall time required by our tensor algo- rithm is used during the function evaluations, whereas about 30% is spent on the L-BFGS-B routine. There are two operations that consume most of the time spent on function eval- uations. First, the computation of the product of the factors, which is needed to obtain the residual. The MATLAB Tensor Toolbox [5] that we use in the implementation has a data structure called the Kruskal tensor that allows the user to store a CP decomposition efficiently. It does not store the core tensor G , but only its diagonal and the rest of the factors (matrices A(i)). Unfortunately, every time we evaluate the function we need to convert the Kruskal tensor to a full tensor (i.e., we have to perform the multiplication of the factors) in order to subtract the full tensor V . A significant amount of time (around 40%) is used to evaluate the function and its gradient in this operation. Second, to evaluate the gradient we need to compute the product of the ma- tricized residual times the matrices A(i) , i = 1, . . . ,N (that represent a Khatri-Rao product). The Tensor Toolbox includes the function mttkrp [6] which computes this product efficiently when the tensor involved in the multiplication is sparse or Kruskal. However, as the residual is a dense tensor, we have to form the Khatri-Rao explicitly and compute the product with the matricized residual. Table 5.1 illustrates the performance of tensor version of the algorithm to solve NMF problems (problems in which the input tensor is a second-order tensor). A set of 80 images of size 19× 19 was used to perform the experiment. Note that convergence was not achieved in any of the problems. We see that solving the problems with the tensor algorithm requires, in every case, about twice the time required by the matrix algorithm and does not provide a better solution (the value of the objective function is smaller when using the matrix approach). 56 r Interface f Optim. Iter Time (s) 10 NMF 2.217e+06 0.13 2000 24.76 NTF 2.218e+06 0.20 2000 59.35 40 NMF 2.931e+05 0.49 2000 108.67 NTF 2.966e+05 1.30 2000 205.74 75 NMF 1.798e+04 0.39 2000 148.59 NTF 1.987e+04 0.93 2000 285.21 Table 5.1: CBCL Database; solution obtained solving a NMF problem with the NMF and NTF interfaces. 5.2.3 Effects of regularization strategies In this set of experiments the focus is on the effect of different regularization strate- gies. We use the regularization parameters γ = 10, 10−1, 10−2, and 10−3, and try different scaling strategies, that consist of making equal to one: the `1-matrix norm (this strategy is denoted by “1m”), the `2 norm of the column with the biggest norm (“2m”), the `1 norm of each colum (“1c”), or the `2 norm of each column (“2c”). The different problems come from using the two datasets and, also, from vary- ing the value of the approximation’s rank, r. As we said before, we do not consider r as a parameter to be determined. It is a part of the definition of the problem (which depends entirely on the application and user’s interest), and we try several values in this experimentation stage. In every case we use τ = 10−4 as the tolerance for the norm of the projected gradient and the decrement between successive values of f . Tables 5.2 and 5.4 shows the results of solving the problem with our algorithm (all-at-once approach and regularization techniques described in chapter 4) and matrix approach. Tables 5.3 and 5.5 show the same set of regularization parameters applied to tensor structures. In every case, the time shown is measured in hours. Each table includes only four values of r for comparison purposes. In the matrix case r = 10,20,40,75. In the tensor case we use the values of r that make the storage requirements equal to those of the matrix case. From Tables 5.2 to 5.5 we can see that a clear preeminence of a scaling strategy is absent. In the AT&T dataset, the scaling denoted by 2m seems to lead to a 57 Figure 5.1: Effects of regularization. The first row shows the original images. The following rows show (from second to fifth row) the solution obtained with: (1) no regularization; (2) scaling only; (3) `1 penalty function only; (4) both strategies. smaller value of the objective function, although there is not convergence in most of the cases. This makes it difficult to assess the effectivity of each method’s ability to find a good solution (independent of their iterations’ costs). From the experiments run on the other dataset (CBCL)—where convergence is achieved in most of the problems—we can see that the 2-norm scaling strategies— 2m and 2c—require less iterations than those with 1-norm scaling; these also have a slightly better SSIM average. In Figure 5.2.3 we can see that regularization does improve the images con- structed with the obtained solutions. When regularization is used we can see better definition in the face features. The images correspond to the solution (parts of it; we show only 5 out of 80 images) of the NTF problem using the AT&T dataset, and r = 109; the scaling strategy is 2m, and γ = 10−3; in none of the cases (different parameter settings) convergence is achieved. Additionally, we verify that tensor factorizations have smaller storage require- ments to achieve a good solution. For example, the solution that we see in Table 58 5.5 corresponding to r= 163 in the tensor approach (24287 variables) is better than the solution that we see in Table 5.4 corresponding to r= 40 (48840 variables) with the matrix approach (although convergence is not reached in the tensor case within the allowed time). More examples can be found when Table 5.2 is compared to Table 5.3, and Table 5.4 is compared to Table 5.5. It is clear that when we take approximately the same number of variables (r equivalent in storage) in the tensor and matrix approach, we obatain a better solution using tensors. 5.2.4 Comparison with other approaches We compare the performance of the all-at-once approach against an implemen- tation of the multiplicative update rule (MUR) described by Hazan, Polak and Shashua [38], and the lsNTF interface implemented by Friedlander and Hatz [31, 33], which uses ALS with regularization. In the all-at-once approach, we show three different parameter settings. AAonce- 1 corresponds to the algorithm with no regularization (γ = 0, no scaling); AAonce- 2 and AAonce-3 use 2m-scaling but the values γ = 10 and γ = 10−3, respectively. In the later two cases, all of the factors are controlled by the regularization function and scaled every 10 iterations (the frequency of the scaling is an arbitrary choice). As in the previous experiments, we do not set a maximum number of iterations. Instead, we allow a maximum computing time of 90 minutes. The value of the tolerance is τ = 10−5. Optimality, SSIM, and f lag parameters are as described before. Results are shown in Tables 5.6 and 5.7. The multiplicative update rule seems to incur the least computational effort per iteration as it has the largest number of iterations within the allowed time. However, it does not give satisfactory results in terms of optimality and does not achieve convergence in any case (with the given tolerance value). The lsNTF algorithm gives the worst results in terms of the value of the objec- tive function and, in all the cases, seems to stop at a saddle point (the norm of the projected gradient is nearly zero). In the column corresponding to the number of iterations, we observe small numbers, but it is important to recall that those are the main iterations, and that each of them involves solving a linear problem for every 59 factor, which is done by calling the software BCLS [32]. In other words, each of the main iterations implies a number of inner iterations that can be large. In all the cases the best results are obtained with the all-at-once approach, which gives a smaller value of f and requires a relatively small number of iter- ations. The cost per iteration, however, is high; each function evaluation requires O(r ·N ·∏Ni=1 Ii) operations, since each involves a multi-mode (Kruskal) product (O(r ·N ·∏Ni=1 Ii) operations) and N Khatri-Rao products (each of them requires O(r∏Ni=1 Ii) operations). 60 Reg. Param. Computed Solution γ Sca ‖R‖F Optim. mean SSIM Iter Time Exit r = 3, (1326 variables) 0e+00 – 3.262e+03 4.398e-03 7.091e-01 712 7.8e-04 1 0e+00 1m 3.262e+03 3.508e-05 7.091e-01 900 1.2e-03 1 0e+00 2m 3.263e+03 2.463e-04 7.091e-01 300 4.1e-04 1 0e+00 1c 3.262e+03 2.664e-05 7.091e-01 976 1.3e-03 1 0e+00 2c 3.263e+03 1.287e-04 7.091e-01 393 7.0e-04 1 1e+01 – 3.262e+03 1.770e-03 7.090e-01 1388 1.4e-03 1 1e-01 – 3.262e+03 9.747e-04 7.091e-01 1724 1.7e-03 1 1e-02 – 3.262e+03 2.106e-03 7.091e-01 1448 1.4e-03 1 1e-03 – 3.262e+03 7.962e-04 7.091e-01 768 7.9e-04 1 r = 7, (3094 variables) 0e+00 – 2.467e+03 1.148e-03 8.336e-01 1948 3.5e-03 1 0e+00 1m 2.467e+03 2.555e-05 8.336e-01 5854 1.3e-02 1 0e+00 2m 2.467e+03 2.103e-04 8.336e-01 2458 6.0e-03 1 0e+00 1c 2.467e+03 2.310e-05 8.336e-01 5611 1.6e-02 1 0e+00 2c 2.467e+03 3.541e-04 8.336e-01 2594 7.7e-03 1 1e+01 – 2.467e+03 1.259e-03 8.335e-01 2069 5.1e-03 1 1e-01 – 2.467e+03 7.983e-04 8.336e-01 3767 8.4e-03 1 1e-02 – 2.467e+03 1.519e-03 8.336e-01 5646 1.2e-02 1 1e-03 – 2.467e+03 1.104e-03 8.336e-01 2219 4.2e-03 1 r = 13, (5746 variables) 0e+00 – 1.843e+03 5.719e-04 9.139e-01 5789 2.0e-02 1 0e+00 1m 1.846e+03 4.950e-05 9.137e-01 3592 1.5e-02 1 0e+00 2m 1.845e+03 4.276e-04 9.155e-01 3046 1.3e-02 1 0e+00 1c 1.843e+03 1.904e-05 9.139e-01 10789 4.4e-02 1 0e+00 2c 1.845e+03 1.105e-04 9.135e-01 6264 2.5e-02 1 1e+01 – 1.844e+03 7.175e-04 9.139e-01 4496 1.4e-02 1 1e-01 – 1.843e+03 2.745e-04 9.139e-01 9490 3.2e-02 1 Table 5.2: CBCL Dataset, matrix approach (The table continues on the next page.) 61 1e-02 – 1.843e+03 4.817e-04 9.139e-01 9343 3.0e-02 1 1e-03 – 1.843e+03 7.477e-04 9.139e-01 7335 2.3e-02 1 r = 25, (11050 variables) 0e+00 – 1.196e+03 6.775e-05 9.623e-01 12660 7.6e-02 1 0e+00 1m 1.196e+03 1.680e-05 9.621e-01 44574 3.3e-01 1 0e+00 2m 1.198e+03 3.286e-04 9.623e-01 12107 7.8e-02 1 0e+00 1c 1.196e+03 3.433e-05 9.623e-01 28813 1.8e-01 1 0e+00 2c 1.195e+03 3.880e-04 9.621e-01 12808 8.0e-02 1 1e+01 – 1.198e+03 6.515e-04 9.622e-01 22803 1.2e-01 1 1e-01 – 1.196e+03 4.040e-04 9.623e-01 18970 1.4e-01 1 1e-02 – 1.196e+03 1.733e-04 9.623e-01 27555 2.1e-01 1 1e-03 – 1.196e+03 1.932e-04 9.623e-01 21916 1.3e-01 1 Table 5.2: CBCL Dataset, matrix approach. The optimality value represents the (relative) infinity-norm of the gradient. A SSIM close to 1 indicates good quality of the solution. Time is measured in hours. An exit flag equal to 1 indicates that a convergence criterion was satisfied. Reg. Param. Computed Solution γ Sca ‖R‖F Optim. mean SSIM Iter Time Exit r = 12, (1428 variables) 0e+00 – 2.595e+03 1.166e-02 7.767e-01 1591 1.2e-02 1 0e+00 1m 2.596e+03 9.977e-05 7.800e-01 1740 1.7e-02 1 0e+00 2m 2.602e+03 2.708e-04 7.830e-01 1043 1.1e-02 1 0e+00 1c 2.617e+03 4.162e-05 7.804e-01 1566 1.6e-02 1 0e+00 2c 2.605e+03 1.785e-04 7.796e-01 1111 1.1e-02 1 1e+01 – 2.596e+03 1.489e+01 7.766e-01 1107 1.5e+00 2 1e-01 – 2.595e+03 5.400e-03 7.767e-01 5558 1.1e+00 1 1e-02 – 2.595e+03 2.598e-02 7.767e-01 3839 2.2e-02 1 Table 5.3: CBCL Dataset, tensor approach (the table continues on the next page.) 62 1e-03 – 2.595e+03 1.839e-02 7.767e-01 1545 1.9e-02 1 r = 24, (2856 variables) 0e+00 – 2.021e+03 1.218e-02 8.715e-01 4734 5.4e-02 1 0e+00 1m 2.038e+03 2.060e-05 8.723e-01 7859 1.5e-01 1 0e+00 2m 2.036e+03 2.234e-05 8.779e-01 2840 5.3e-02 1 0e+00 1c 2.037e+03 6.278e-05 8.750e-01 3971 6.8e-02 1 0e+00 2c 2.027e+03 2.097e-04 8.749e-01 3238 5.5e-02 1 1e+01 – 2.046e+03 1.056e-02 8.726e-01 4907 4.3e-02 1 1e-01 – 2.046e+03 9.773e-03 8.726e-01 5844 5.1e-02 1 1e-02 – 2.046e+03 6.493e-03 8.726e-01 7068 6.2e-02 1 1e-03 – 2.048e+03 1.254e-02 8.739e-01 2678 2.3e-02 1 r = 49, (5831 variables) 0e+00 – 1.466e+03 2.015e-03 9.352e-01 15552 3.1e-01 1 0e+00 1m 1.590e+03 1.414e-04 9.242e-01 10047 3.3e-01 1 0e+00 2m 1.571e+03 1.618e-03 9.261e-01 10892 3.7e-01 1 0e+00 1c 1.578e+03 7.276e-06 9.253e-01 15978 5.7e-01 1 0e+00 2c 1.571e+03 2.711e-04 9.249e-01 8618 3.3e-01 1 1e+01 – 1.466e+03 4.410e-03 9.325e-01 20390 3.3e-01 1 1e-01 – 1.472e+03 1.141e-02 9.352e-01 22951 1.2e+00 1 1e-02 – 1.458e+03 2.604e-03 9.337e-01 25416 4.0e-01 1 1e-03 – 1.462e+03 5.359e-03 9.321e-01 11459 1.8e-01 1 r = 92, (10948 variables) 0e+00 – 9.480e+02 3.614e-03 9.702e-01 30290 9.7e-01 1 0e+00 1m 1.171e+03 3.376e+03 9.534e-01 26954 1.5e+00 2 0e+00 2m 1.168e+03 4.413e-01 9.556e-01 25103 1.4e+00 1 0e+00 1c 1.159e+03 1.020e+02 9.565e-01 27036 1.5e+00 2 0e+00 2c 1.161e+03 6.786e-01 9.580e-01 13057 7.2e-01 1 1e+01 – 9.493e+02 9.574e-01 9.707e-01 23699 1.5e+00 2 1e-01 – 2.205e+03 3.010e+02 8.503e-01 122 1.5e+00 2 1e-02 – 1.316e+03 5.096e+01 9.451e-01 542 1.5e+00 2 Table 5.3: CBCL Dataset, tensor approach (the table continues on the next page.) 63 1e-03 – 9.796e+02 2.034e+00 9.687e-01 4886 1.5e+00 2 Table 5.3: CBCL Dataset, tensor approach. The optimality value represents the (relative) infinity-norm of the gradient. A SSIM close to 1 indicates good quality of the solution. Time is measured in hours. An exit flag equal to 1 indicates that a convergence criterion was satisfied. Reg. Param. Computed Solution γ Sca ‖R‖F Optim. mean SSIM Iter Time Exit r = 10, (12210 variables) 0e+00 – 6.246e+03 1.613e-03 6.525e-01 1820 1.3e-02 1 0e+00 1m 3.725e+04 7.220e-05 6.525e-01 6706 5.9e-02 1 0e+00 2m 3.725e+04 1.559e-03 6.525e-01 2548 2.2e-02 1 0e+00 1c 3.725e+04 1.356e-04 6.525e-01 5901 5.5e-02 1 0e+00 2c 3.725e+04 9.131e-04 6.525e-01 2641 2.6e-02 1 1e+01 – 6.246e+03 8.400e-03 6.525e-01 3239 2.3e-02 1 1e-01 – 6.246e+03 1.556e-03 6.525e-01 6429 4.7e-02 1 1e-02 – 6.246e+03 2.527e-03 6.525e-01 3019 2.2e-02 1 1e-03 – 6.246e+03 7.313e-04 6.525e-01 1915 1.4e-02 1 r = 20, (24420 variables) 0e+00 – 4.748e+03 2.303e-03 7.779e-01 8324 1.2e-01 1 0e+00 1m 3.725e+04 2.910e-04 7.776e-01 15319 2.6e-01 1 0e+00 2m 3.725e+04 7.322e-04 7.756e-01 4176 7.1e-02 1 0e+00 1c 3.725e+04 6.577e-05 7.758e-01 13480 2.2e-01 1 0e+00 2c 3.724e+04 2.404e-04 7.755e-01 6873 1.0e-01 1 1e+01 – 4.734e+03 5.935e-04 7.747e-01 12307 1.6e-01 1 1e-01 – 4.748e+03 1.628e-03 7.779e-01 17029 2.2e-01 1 1e-02 – 4.746e+03 3.161e-03 7.754e-01 12157 1.6e-01 1 1e-03 – 4.748e+03 8.558e-04 7.779e-01 10312 1.3e-01 1 Table 5.4: AT&T Dataset, matrix approach (the table continues on the next page.) 64 r = 40, (48840 variables) 0e+00 – 2.950e+03 8.370e-04 9.038e-01 30177 6.7e-01 1 0e+00 1m 3.725e+04 7.462e-03 9.033e-01 34993 1.0e+00 2 0e+00 2m 3.724e+04 2.257e-04 9.036e-01 23870 6.8e-01 1 0e+00 1c 3.725e+04 4.222e-02 9.029e-01 34957 1.0e+00 2 0e+00 2c 3.724e+04 4.183e-04 9.036e-01 21314 6.1e-01 1 1e+01 – 2.956e+03 4.308e-02 9.040e-01 39595 1.0e+00 2 1e-01 – 2.964e+03 5.629e-03 9.031e-01 39688 1.0e+00 2 1e-02 – 2.949e+03 5.231e-03 9.045e-01 39824 1.0e+00 2 1e-03 – 2.966e+03 5.707e-04 9.023e-01 32917 8.2e-01 1 r = 75, (91575 variables) 0e+00 – 5.969e+02 6.860e-02 9.956e-01 23273 1.0e+00 2 0e+00 1m 3.725e+04 4.613e-02 9.922e-01 18048 1.0e+00 2 0e+00 2m 3.724e+04 2.757e-01 9.953e-01 17644 1.0e+00 2 0e+00 1c 3.725e+04 5.022e-02 9.921e-01 17989 1.0e+00 2 0e+00 2c 3.724e+04 6.077e-01 9.954e-01 17670 1.0e+00 2 1e+01 – 6.463e+02 3.397e-02 9.949e-01 20744 1.0e+00 2 1e-01 – 5.996e+02 7.546e-02 9.955e-01 20789 1.0e+00 2 1e-02 – 5.974e+02 9.318e-02 9.956e-01 20786 1.0e+00 2 1e-03 – 5.992e+02 1.006e-01 9.955e-01 20791 1.0e+00 2 Table 5.4: AT&T Dataset, matrix approach. The optimality value represents the (relative) infinity-norm of the gradient. A SSIM close to 1 indicates good quality of the solution. Time is measured in hours. An exit flag equal to 1 indicates that a convergence criterion was satisfied. 65 Reg. Param. Computed Solution γ Sca ‖R‖F Optim. mean SSIM Iter Time Exit r = 81, (12069 variables) 0e+00 – 3.940e+03 1.484e-02 8.277e-01 10704 9.7e-01 1 0e+00 1m 1.104e+04 2.936e+02 3.064e-01 33 9.6e-03 1 0e+00 2m 4.198e+03 1.751e+03 8.040e-01 6929 1.5e+00 2 0e+00 1c 8.201e+03 7.158e+01 4.583e-01 4037 8.8e-01 1 0e+00 2c 4.458e+03 3.464e+03 7.877e-01 7052 1.5e+00 2 1e+01 – 3.935e+03 5.618e-03 8.262e-01 10797 1.0e+00 1 1e-01 – 3.924e+03 5.148e-02 8.262e-01 17818 1.5e+00 2 1e-02 – 3.937e+03 1.050e-01 8.260e-01 16773 1.5e+00 2 1e-03 – 3.919e+03 2.098e-02 8.260e-01 15378 1.3e+00 1 r = 163, (24287 variables) 0e+00 – 2.759e+03 5.681e+00 9.129e-01 10114 1.5e+00 2 0e+00 1m 9.445e+03 4.982e-06 3.971e-01 333 1.3e-01 1 0e+00 2m 3.041e+03 1.089e+00 8.953e-01 3980 1.5e+00 2 0e+00 1c 6.835e+03 3.322e-02 5.712e-01 3908 1.5e+00 2 0e+00 2c 3.211e+03 5.639e+00 8.874e-01 3755 1.5e+00 2 1e+01 – 2.772e+03 3.126e+00 9.142e-01 8673 1.5e+00 2 1e-01 – 2.768e+03 4.187e+00 9.132e-01 9952 1.5e+00 2 1e-02 – 2.761e+03 2.788e+00 9.131e-01 10144 1.5e+00 2 1e-03 – 2.761e+03 3.530e+00 9.115e-01 10109 1.5e+00 2 r = 327, (48723 variables) 0e+00 – 1.780e+03 1.047e+01 9.634e-01 5443 1.5e+00 2 0e+00 1m 1.104e+04 8.933e-07 3.064e-01 31 4.2e-02 1 0e+00 2m 2.334e+03 9.623e-01 9.402e-01 2030 1.5e+00 2 0e+00 1c 7.198e+03 1.541e-01 5.529e-01 2089 1.5e+00 2 0e+00 2c 2.479e+03 4.086e+01 9.319e-01 2027 1.5e+00 2 1e+01 – 1.764e+03 5.278e+00 9.634e-01 4951 1.5e+00 2 1e-01 – 1.796e+03 7.833e+00 9.628e-01 5099 1.5e+00 2 Table 5.5: AT&T Dataset, tensor approach (the table continues on the next page.) 66 1e-02 – 1.801e+03 7.253e+00 9.626e-01 5009 1.5e+00 2 1e-03 – 1.798e+03 7.546e+00 9.628e-01 5119 1.5e+00 2 r = 614, (91486 variables) 0e+00 – 1.206e+03 1.767e+01 9.832e-01 3180 1.5e+00 2 0e+00 1m 6.786e+03 3.828e-02 5.792e-01 1148 1.5e+00 2 0e+00 2m 1.916e+03 9.495e-01 9.600e-01 1131 1.5e+00 2 0e+00 1c 6.554e+03 8.340e-02 6.154e-01 1110 1.5e+00 2 0e+00 2c 2.326e+03 1.378e+01 9.411e-01 1100 1.5e+00 2 1e+01 – 1.219e+03 2.550e+01 9.829e-01 2432 1.5e+00 2 1e-01 – 1.238e+03 6.015e+01 9.824e-01 2879 1.5e+00 2 1e-02 – 1.240e+03 1.111e+01 9.823e-01 2883 1.5e+00 2 1e-03 – 1.239e+03 1.167e+01 9.823e-01 2866 1.5e+00 2 Table 5.5: AT&T Dataset, tensor approach. The optimality value represents the (relative) infinity-norm of the gradient. A SSIM close to 1 indicates good quality of the solution. Time is measured in hours. An exit flag equal to 1 indicates that a convergence criterion was satisfied. Algorithm ‖R‖F Optim. mean SSIM Iter Time Exit r = 12, (1428 variables) MUR 3.370e+06 3.242e+02 7.977e-01 95468 1.50 2 lsNTF 2.206e+08 0.000e+00 8.432e-05 192 1.40 1 AAonce-1 2.595e+03 1.166e-02 7.767e-01 1591 0.01 1 AAonce-2 2.602e+03 2.750e-02 7.805e-01 840 0.01 1 AAonce-3 2.602e+03 7.128e-05 7.809e-01 1030 0.01 1 r = 24, (2856 variables) MUR 2.088e+06 2.779e+02 8.754e-01 78533 1.50 2 lsNTF 2.206e+08 0.000e+00 8.431e-05 169 1.44 1 AAonce-1 2.021e+03 1.218e-02 8.715e-01 4734 0.05 1 Table 5.6: CBCL Dataset, comparison with other algorithms (the table con- tinues on the next page.) 67 AAonce-2 2.033e+03 5.234e-03 8.808e-01 3240 0.06 1 AAonce-3 2.036e+03 1.148e-05 8.779e-01 3117 0.06 1 r = 49, (5831 variables) MUR 1.127e+06 1.036e+02 9.305e-01 61422 1.50 2 lsNTF 2.206e+08 0.000e+00 8.432e-05 120 1.44 1 AAonce-1 1.466e+03 2.015e-03 9.352e-01 15552 0.31 1 AAonce-2 1.579e+03 1.555e-02 9.269e-01 4250 0.18 1 AAonce-3 1.571e+03 9.539e-06 9.261e-01 11840 0.57 1 r = 92, (10948 variables) MUR 5.057e+05 5.910e+01 9.671e-01 40328 1.50 2 lsNTF 2.206e+08 0.000e+00 8.433e-05 73 1.39 1 AAonce-1 9.480e+02 3.614e-03 9.702e-01 30290 0.97 1 AAonce-2 1.170e+03 1.937e-02 9.558e-01 5760 0.39 1 AAonce-3 1.164e+03 1.084e-03 9.562e-01 22311 1.50 2 Table 5.6: CBCL Dataset, comparison with other algorithms. The optimal- ity value represents the (relative) infinity-norm of the gradient. A SSIM close to 1 indicates good quality of the solution. Time is measured in hours. An exit flag equal to 1 indicates that a convergence criterion was satisfied. Algorithm ‖R‖F Optim. mean SSIM Iter Time Exit r = 27, (4023 variables) MUR 1.687e+07 1.445e+03 6.719e-01 29243 1.50 2 lsNTF 6.939e+08 0.000e+00 6.428e-05 47 1.30 1 AAonce-1 5.790e+03 2.052e-02 6.765e-01 2968 0.08 1 AAonce-2 5.787e+03 3.324e-01 6.704e-01 1874 1.50 2 AAonce-3 5.797e+03 2.462e-01 6.769e-01 1455 1.50 2 r = 54, (8046 variables) Table 5.7: AT&T Dataset, comparison with other algorithms (the table con- tinues on the next page.) 68 MUR 1.094e+07 7.560e+02 7.662e-01 21896 1.50 2 lsNTF 6.938e+08 0.000e+00 6.428e-05 42 1.47 1 AAonce-1 4.618e+03 6.344e-03 7.727e-01 9347 0.59 1 AAonce-2 5.033e+03 4.713e-01 7.359e-01 2162 1.50 2 AAonce-3 5.550e+03 5.225e+00 6.965e-01 471 1.50 2 r = 109, (16241 variables) MUR 6.391e+06 3.234e+02 8.603e-01 14641 1.50 2 lsNTF 6.938e+08 0.000e+00 6.432e-05 27 1.53 2 AAonce-1 3.422e+03 6.932e+00 8.670e-01 15213 1.50 2 AAonce-2 4.904e+03 2.376e+01 7.592e-01 323 1.51 2 AAonce-3 5.520e+03 1.205e+01 7.122e-01 195 1.51 2 r = 204, (30396 variables) MUR 3.262e+06 2.006e+02 9.269e-01 8229 1.50 2 lsNTF 6.938e+08 0.000e+00 6.438e-05 18 1.52 2 AAonce-1 2.428e+03 5.869e+00 9.316e-01 7892 1.50 2 AAonce-2 7.371e+03 6.711e+01 5.640e-01 49 1.50 2 AAonce-3 6.550e+03 4.807e+01 6.270e-01 76 1.51 2 Table 5.7: AT&T Dataset, comparison with other algorithms. The optimal- ity value represents the (relative) infinity-norm of the gradient. A SSIM close to 1 indicates good quality of the solution. Time is measured in hours. An exit flag equal to 1 indicates that a convergence criterion was satisfied. 69 Chapter 6 An application to image processing One of the main motivations for nonnegative factorizations is its potential use as a tool to extract information from a dataset. In this chapter we illustrate this aspect of nonnegative factorizations through some examples. We then describe the digital-inpainting problem and explain the way in which our algorithm can be used to help the restoration of digital images that have been damaged by loss of information. 6.1 Image basis through nonnegative factorizations In the NMF problem we can interpret each column ofW ∈ IRm×r as a basis element, and the matrix H ∈ IRn×r as the set of coefficients that can be used to approximate the original set (V ∈ IRm×n) with elements in the basisW . When we use r < n, we expect the basis W to contain the most significant information in V , so that the value of ‖V −WHT‖ can be minimized. When V represents an image dataset (each image is represented by a column of V ), we expect the columns of W to represent images that condense the most important features of the original set. Figure 6.1a shows 49 images that represent part of a set formed by 200 images taken from the CBCL image database (described in section 5.2.1), plus two outlier 70 (a) Database sample (b) Nonnegative Basis Figure 6.1: Experiment using CBCL face database with outliers. The basis was obtained via NMF. images—Frankenstein and a dog (shown in the bottom-right corner of the set). By applying the matrix version of the all-at-once algorithm, using r = 15, we get the basis images shown in Figure 6.1b. The inclusion of the outliers makes it easier to appreciate the way in which the basis images combine and resemble the most distinctive characteristics of the elements in the set. The images that have been highlighted in Figure 6.1b (first row of the first column, and third row of the third colum), while not reproductions of the outlier images, clearly merge some of the outliers’ characteristics (e.g, face and nose shape) with other common attributes of the set (e.g, in the basis Franke- instein’s face seems to smile). Note that for the matrix case, our algorithm works with an approximation of the type V ≈WGHT , where G is a diagonal matrix that has been included to maintain the analogy with the tensor case. We interpret G as a matrix that contains the scale of the elements inW , and, in the image-processing context, consider as basis images the columns of the productWG. To obtain a set of base images in the tensor case, suppose that the tensor V ∈ 71 IRm1×m2×m3 contains each of the original images as a frontal slice, and that V = [[G ;A(1),A(2),A(3)]]. By applying matricization, which is described in section 2.2.2, (see also Ap- pendix B for a summary of matrix-operators results) we can unfold the tensor with respect to its third coordinate and—taking into account that G is diagonal and ma- trices A(i) have the same number of columns—obtain the expression V = (A(1)A(2))DA(3)T , (6.1) where D is a diagonal matrix such that Giii = Dii with i ∈ {1, . . . ,r} (as used in section 2.3), and V is the matrix that we use in the matrix approach, i.e., a matrix whose columns represent the vectorized images of the dataset (using column-wise concatenation). From (6.1) it is clear that we can consider the product (A(1)A(2))D—which is a m1m2× r matrix—as the matrix that contains the basis images, and A(3) as the matrix that represents the coefficients used to construct each column ofV , i.e., A(3) is the analogous of H. Since (A(1)A(2))D= [D11(a(1)1 ⊗a(2)1 ), · · · ,Drr(a(1)r ⊗a(2)r )], and D j j(a (1) j a(1)j ) = D j jvec(a(1)j a(2)Tj ), j = 1, . . . ,r, we see that the ith image in the basis is given by a rank-1 matrix scaled by the ith diagonal element of G . In fact, the outer product of the columns of A(1) and A(2) (scaled by the diagonal elements of G ), can be computed asB= G ×1A(1)×2A(2). In this way, the ith frontal slice ofB represents ith basis image. A second example that shows the use of nonnegative factorizations as a way to extract relevant features in a set is obtained by considering a set formed by 200 images, each with size 28×28, that represent hand-written digits. We assembled a matrix V ∈ IR784×200 and a tensor, V ∈ IR28×28×200, with the images and choose r = 10 in the matrix case (9840 variables), and r = 20 in the tensor case (5120 variables). Basis images obtained through NMF are shown in the 72 (a) Database sample (b) NMF basis. (c) NTF basis. Figure 6.2: Experiment using hand-written digits . ‖R‖ Optim. Min Mean Iter Time Storage SSIM SSIM (s) Basis Total NMF 1.786e+08 6.44e-02 0.091 0.445 791 24.17 7840 9840 NTF 1.566e+08 3.23e-01 0.249 0.521 752 158.16 1140 5120 Table 6.1: Experiment with hand-written numbers. The SSIM of two images is equal to 1 if and only if the images are identical; thus, in this table a SSIM value close to 1 indicates a good quality of the solution. The last two columns indicate the number of variables (elements in the factors) used in each approach. Figure 6.2b while basis images obtained by NTF are shown in 6.2b. Results are shown in Table 6.1. The basis images in NTF tend to be sparser, which represents a sharper de- composition into part components. Despite the fact that fewer variables are used in the tensor case (we have a higher compression rate: the NMF basis has 7840 73 elements, whereas the NTF basis has only 1140), we obtain higher fidelity in the reconstruction of the original images than we do using the matrix approach. 6.2 Inpainting problem By inpainting we understand the restoration of missing or damaged parts on an image; digital inpainting refers to the case in which the images are digital and the restoration is made with computer support. The work by Bertalmio, et al. [12] is acknowledged as a pioneering work in digital inpainting. They use a third-order PDE-based model and show unified applications in the movie industry, video images, and art restoration. Since their work was published, other approaches have been used to solve the digital inpaint- ing problem, e.g., models based on the Bayesian rationale [67], and expectation- minimization (EM) algorithms [28]. We explore the use of NTF basis images in the inpainting problem, using a basic mathematical model of the problem. Suppose that b ∈ IRm represents a signal (an image, in our context) that can be expressed in terms of a basis A ∈ IRm×n using the coefficients contained in the vector x ∈ IRn, as represented by the equation Ax= b, (6.2) It is assumed that x is sparse. If D ∈ IRp×m, with p< m, represents a lossy operator and we apply it to (6.2), we get DAx= Db. (6.3) Inpainting is to recover b from Db, which is an inverse problem. Let us denote the product DA by Â, and Db by b̂. Under the assumption that x is a sparse vector, the coefficients of x can be recovered by solving the regularized least-squares problem given by minimize λ ‖Âx− b̂‖22+λ‖x‖1, (6.4) 74 where λ is a regularization parameter that lies in the interval [0,‖ATb‖∞]. Once we obtain the vector x, we can recover the original signal b by simply multiplying the matrix A times x. We believe that, given an specific set, it is possible to construct a basis through NTF that can later be used to restore elements that have lost some information. The basis elements obtained through NTF—that tend to have sparse compo- nents—can be interpreted as a dictionary, i.e., a set of atom images that can be combined to obtain other images (not only the ones contained in the original set V ). So, we can use NTF to obtain the matrix A that appears in (6.2). 6.3 Numerical experiments We designed an experiment to empirically test the use of NTF basis in the inpaint- ing problem. We employ a set of 140 images taken from the AT&T image dataset, which were scaled to a 40×50 size, and assemble a 40×50×140 tensor. We compute an NTF factorization, with r = 593 (the equivalent in storage to having r = 70 in the matrix case), using scaling every 5 iterations, γ = 10−3 in the regularization function, and τ = 10−4 as tolerance value. The problem has 136983 variables and we computed the solution over 72 hours. In this time a total of 20593 iterations were performed. We did not obtain convergence, but the final optimality value was 1.76×10−3. We then computed the basis B = G ×1 A(1)×2 A(2), and discard A(3). Note that, although B is a 40× 50× 531 tensor, what we store are the factors G , A(1), and A(2), which have a total of 53963 elements. As we are using these factors to reconstruct the whole set (the tensor V , which has 280000 elements), we are working with only 19.27 % of the original information. As a first test, we selected an image that had been originally included in V , and removed 80% of its pixel components (randomly chosen). We did not explicitely form the operator D that appears in (6.3). We solved (6.4) for 30 different (logarithmically spaced) values of λ , and chose the best solution; Figure 6.3 shows ten of the “recovered” images obtained as we varied the value of λ . Figure 6.4 shows the original image (Figure 6.4a), the im- age obtained through the NTF basis ((Figure 6.4b)), the “noisy” image—that has 75 Figure 6.3: In-painting problem, test 1: Recovered images for different val- ues of the regularization parameter λ . only 20% of the original information—(Figure 6.4c), and the best of the recovered images, obtained with λ = 10−3 (Figure 6.4d). As a second test, we chose as b an image that was not in the original set V , and applied the same procedure that was used in the previous test. Results are shown in Figure 6.5. The person whose image is used in the second test belongs to the group of subjects that were photographed to form the AT&T dataset, but the set of 140 images that we use to generate the basis included only one image of that person. In it he wears glasses and has a different head position and facial expression (the image is shown in the second row of the fifth column of Figure 6.6). Thus, it is remarkable that the basis obtained by NTF approximates well the original image. As a third and final test, we decided to remove a specific area of an image that was not in V . The removed data represents 12.65% of the information in the original image. Figure 6.7 shows the results of the experiment, The reconstructed image does not look like a natural face but the basis does complete the part of the face that is missing. There are studies that focus on the correlation between different face regions [39] and we believe that if a basis is specifically constructed to solve the problem of inpainting of face parts (by con- 76 (a) (b) (c) (d) Figure 6.4: Inpainting problem, test 1: (a) original image; (b) image con- structed using a basis obtained by NTF; (c) image after information loss; (d) image recovered with the best λ . structing a set V with images that include representation of the distinct possible face features), better results can be obtained. 77 (a) (b) (c) (d) Figure 6.5: Inpainting problem, test 2: (a) original image; (b) image con- structed using a basis obtained by NTF; (c) image after information loss; (d) image recovered with the best λ . Figure 6.6: Subjects photographed to form the AT&T image database. 78 (a) (b) (c) (d) Figure 6.7: Inpainting problem, test 3: (a) original image; (b) image con- structed using a basis obtained by NTF; (c) image after information loss; (d) image recovered with the best λ . 79 Bibliography [1] Glossary of the document “A Space and Solar Physics Data Model from the SPASE Consortium ”, Version 1.2.1, available at http://www.spase-group.org/data/doc/spase-1 2 1.pdf (accessed May 2, 2008). → pages 2 [2] E. Alpaydin. Introduction to Machine Learning. The MIT Press, 2004. → pages 7 [3] C. A. Andersson and R. Bro. Improving the speed of multiway algorithms. part I: Tucker3. Chemometrics and Intelligent Laboratory Systems, 42: 93–103, 1998. → pages 9 [4] B. W. Bader and T. G. Kolda. MATLAB tensor classes for fast algorithm prototyping. ACM Transactions on Mathematical Software, 32(4), 2006. → pages 50 [5] B. W. Bader and T. G. Kolda. MATLAB tensor toolbox version 2.2. http://csmr.ca.sandia.gov/∼tgkolda/TensorToolbox/, January 2007. → pages 50, 56 [6] B. W. Bader and T. G. Kolda. Efficient MATLAB computations with sparse and factored tensors. Technical Report SAND2006-7592, Sandia National Laboratories, Albuquerque, NM and Livermore, CA, 2006. → pages 50, 56 [7] B. W. Bader, M. W. Berry, and M. Browne. Discussion tracking in Enron email using PARAFAC. Conference Paper, Text Mining Workshop 2007: Workshop at the SIAM Conf. on Data Mining, 2007. → pages 13 [8] B. W. Bader, R. Harshman, and T. G. Kolda. Temporal analysis of semantinc graphs using ASALSAN. In Proceedings of the 7th IEEE International Conference on Data Mining, 2007. → pages 13, 27, 34 80 [9] D. P. Berrar, W. Dubitzky, and M. Granzow, editors. A Practical Approach to Microarray Data Analysis. Springer, 2002. → pages 7 [10] M. Berry, M. Browne, A. Langville, V. Pauca, and R. Plemmons. Algorithms and applications for approximate nonnegative matrix factorizations. Computational Statistics and Data Analysis, 52(1):155–173, September 2007. → pages 32, 34, 35, 51 [11] M. W. Berry and M. Browne. Email surveillance using nonnegative matrix factorization. Computational and Mathematical Organization Theory, 11: 249–264, 2005. → pages 5 [12] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester. Image inpainting. In SIGGRAPH 2000, volume 34, pages 417–424, 2000. → pages 74 [13] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, 2nd. edition, 1999. → pages 35, 36, 41 [14] R. Bro and S. D. Jong. A fast non-negativity-constrained least squares algorithm. Journal of Chemometrics, 11:393–401, 1997. → pages 1, 11, 37 [15] T. Brox, J. Weickert, B. Burgeth, and P. Mrázek. Nonlinear structure tensors. Image and vision computing, 24(1):41–55, 2006. → pages 10 [16] J. Brunet, P. Tamayo, T. R. Golub, and J. P. Mesirov. Meatagenes and molecular pattern discovery using matrix factorization. volume 101, pages 4164–4169, 2004. → pages 5 [17] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16 (5):1190–1208, 1995. → pages 45, 47 [18] A. L. Cambridge. At&t database of faces. http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html, 1994. → pages 53 [19] M. Chu, F. Diele, R. Plemmons, and S. Ragni. Optimality, computation and interpretation of nonnegative matrix factorizations. Available at citeseer.ist.psu.edu/758183.html. → pages 34 [20] A. Cichocki, R. Zdunek, and S. Amari. Independent Component Analysis and Signal Separation, volume 4666/2007 of Lecture notes in Computer Science, chapter Hierarchical ALS algorithms for nonnegative matrix and 3D tensor factorization, pages 169–176. Springer Berlin/Heidelberg, 2007. → pages 12 81 [21] V. De Silva and L. Lim. Tensor rank and the ill-posedness of the best low-rank approximation problem. Technical Report 06-06, Stanford University, 2006. → pages 30 [22] V. De Silva and L. H. Lim. Tensor rank and ill-posedness of the best low-rank approximation problem. SIAM Journal on Matrix Analysis and Applications, 2006. To appear. → pages 31 [23] I. S. Dhillon and S. Sra. Generalized nonnegative matrix approximations with bregman divergences. In Proceeding of the Neural Information Processing Systems (NIPS) Conference, 2005. → pages 4 [24] C. Ding, X. He, H. D. Simon, and R. Jin. On the equivalence of nonnegative matrix factorization and K-means – spectral clustering. In Proceedings of the 5th SIAM International Conference on Data Mining, 2005. → pages 5 [25] L. Eldén. Matrix Methods in Data Mining and Pattern Recognition. SIAM, 2007. → pages 5 [26] N. M. Faber, R. Bro, and P. K. Hopke. Recent developments in CANDECOMP/ PARAFAC algorithms: A critical review. Chemometrics and intelligent laboratory systems, 65(1):119–137, 2003. → pages 26 [27] P. L. Fackler. Notes on matrix calculus. North Carolina State University, September 2005. → pages 36 [28] M. J. Fadili and J. Starck. Em algorithm for sparse representation-based image inpainting. IEEE, 2005. → pages 74 [29] D. FitzGerald, M. Cranitch, and E. Coyle. Nonnegative tensor factorization for sound source separation. Proceedings of the Irish signals and systems conference, 2005. → pages 12 [30] P. Fogel, S. S. Young, D. M. Hawkins, and N. Ledirac. Inferential, robust non-negative matrix factorization analysis of microarray data. Bioinformatics, 23(1):44–49, 2007. → pages 5 [31] M. Friedlander and K. Hatz. MATLAB software package lsNTF–1.0. http://www.cs.ubc.ca/∼mpf/?n=Papers.Lsntf, 2006. → pages 59 [32] M. P. Friedlander. BCLS: A large-scale solver for bound-constrained least squares. Available at http://www.cs.ubc.ca/∼mpf/bcls/, 2006. → pages 36, 60 82 [33] M. P. Friedlander and K. Hatz. Computing nonnegative tensor factorizations. Technical Report TR-2006-21, University of British Columbia, Dept. of Computer Science, October 2006. → pages 12, 13, 27, 37, 59 [34] Y. Gao and G. Church. Improving molecular cancer class discovery through sparse non-negative matrix factorization. Bioinformatics, pages 3970–3975, 2005. → pages 5 [35] A. Graham. Kronecker Products and Matrix Calculus With Applications. Halsted Press,John Wiley and Sons, 1981. → pages 17 [36] W. R. Hamilton. On quaternions. In The Mathematical Papers of Sir William Rowan Hamilton, volume 3. Cambridge University Press, 2000. Originally published in the Journal “The London, Edinburgh, and Dublin Philosophical Magazine” in 1846. → pages 1 [37] C. Hansen. Rank-deficient and Discrete Ill-Posed Problems: Numerical Aspects of Linear Inversion. SIAM, 1998. → pages 40 [38] T. Hazan, S. Polak, and A. Shashua. Sparse image coding using a 3D non-negative tensor factorization. In Proceedings fo the Tenth IEEE International Conference on Computer Vision (ICCV’ 05), 2005. → pages 12, 27, 34, 59 [39] B. Hwang, V. Blanz, and T. Vetter. Face reconstruction from a small number of feature points. In In International Conference on Pattern Recognition (ICPR, pages 842–845, 2000. → pages 76 [40] A. Kapteyn, T. J. Wansbeek, and H. Neudecker. An approach to n-mode components analysis. Psychometrika, 51:269–275, 1986. → pages 25 [41] H. Kim, H. Park, and L. Eldén. Non-negative tensor factorization based on alternating large-scale non-negativity-constrained least squares. pages 1147–1151, 2007. → pages 12 [42] T. Kolda. Multilinear operators for higher-order decompositions. Technical Report SAND2006-2081, Sandia National Laboratories, April 2006. → pages 13, 15, 19, 90 [43] T. Kolda and B. Bader. Tensor decompositions and applications. Technical Report SAND2007-6702, Sandia National Laboratories, November 2007. → pages 13, 14, 15, 16, 18, 21, 24, 25, 26, 90 83 [44] P. M. Kroonenberg and J. De Leeuw. Principal component analysis of three-mode data by means of alternating least squares algorithms. Psychometrika, 45:69–97, 1980. → pages 25 [45] J. B. Kruskal. Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra Applications, 18:95–138, 1977. → pages 28, 29 [46] J. B. Kruskal. Rank, decomposition and uniqueness for 3-way and n-way arrays. In R. Coppi and S. Bolasco, editors,Multiway Data Analysis, pages 7–18. Elsevier Science Publishers B. V, 1989. → pages 28 [47] L. Lathauwer, B. Moor, and J. Vandewalle. A multilinear sngular value decomposition. SIAM J. Matrix Anal. Appl., 21(4):1253–1278, 2000. → pages 1, 24, 25 [48] L. Lathauwer, B. Moor, and J. Vandewalle. On the best rank-1 and rank-(r1,r2, . . . ,rn) approximation of higher-order tensors. SIAM J. Matrix Anal. Appl., 21(4):1324–1342, 2000. → pages 15, 25 [49] L. d. Lathauwer. Signal processing based on multilinear algebra. PhD thesis, Katholieke Universiteit Leuven, September 1997. → pages 15 [50] A. J. Laub. Matrix Analysis for Scientists and Engineers. SIAM, 2004. → pages 17 [51] D. Lee and H. Seung. Learning the parts of objects by nonnegative matrix factorization. Nature, 401:788–791, 1999. → pages 3, 7, 12, 31, 35 [52] C. J. Lin. On the convergence of multiplicative update algorithms for nonnegative matrix factorization. IEEE Transactions on Neural Networks, 18(6):1589–1596, 2007. → pages 33 [53] D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45:503–528, 1989. → pages 47 [54] J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symp. Math. Statistics and Probability, volume 1, pages 281–296, 1967. → pages 5 [55] M. Mørup, L. K. Hansen, J. Parnas, and S. M. Arnfred. Decomposing the time-frequency representation of EEG using nonnegative matrix and multiway factorization. Technical report, Technical University of Denmark, 2006. URL http://www2.imm.dtu.dk/pubdb/p.php?4144. → pages 10, 12 84 [56] M. Mørup, L. K. Hansen, and S. M. Arnfred. Algorithms for sparse nonnegative tucker decompositions. Neural Computation, 20:2112–2131, 2008. → pages 34 [57] T. Murakami and P. M. Kroonenberg. Three-mode models and individual differences in semantic differential data. Multivariate Behavioral Research, 38(2):247–283, 2003. → pages 9 [58] A. N. Netravali and B. G. Haskell. Digital Pictures: Representation, Compression, and Standards. Springer, year = 1995, edition = Second,. → pages 54 [59] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, second edition, 2006. → pages 46 [60] P. Paatero. A weighted non-negative least squares algorithm for three way parafac factor analysis. Chemometrics and Intelligent Laboratory Systems, 38:223–242, 1997. → pages 1, 11, 27, 42 [61] P. Paatero. Least squares formulation of robust non-negative factor analysis. Chemometrics and Intelligent Laboratory Systems, 37:23–35, 1997. → pages 4 [62] P. Paatero. Construction and analysis of degenerate parafac models. Journal of Chemometrics, 14:285–299, 2000. → pages 11, 30 [63] P. Paatero and U. Tapper. Positive matrix factorization: a nonnegative factor model with optimal utilization of error estimates of data values. Environmetrics, 5:111–126, 1994. → pages 1, 3, 11 [64] V. P. Pauca, J. Piper, and R. Plemmons. Nonnegative matrix factorization for spectral data analysis. Linear Algebra Appl., 416:29–47, 2006. → pages 5 [65] F. Shahnaz, M. Berry, P. Pauca, and R. Plemmons. Document clustering using nonnegative matrix factorization. J. Inform. Proc. Managment, 42: 373–386, 2006. → pages 5 [66] A. Shashua and T. Hazan. Nonnegative tensor factorization with applications to statistics and computer vision. InMachine Learning, Proceedings of the Twenty-Second International Conference, 2005. → pages 12 [67] J. Shen. Inpainting and the fundamental problem of image processing. SIAM News, 36(5), 2003. → pages 74 85 [68] P. Smaragdis and J. Brown. Nonnegative matrix factorization for polyphonic music transcription. In Proceedings IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 177–180, 2003. → pages 5 [69] A. Stegeman. Degeneracy in CANDECOMP/PARAFAC explained for p× p×2 arrays of rank p+1 or higher. Psychometrika, 71:483–501, 2006. → pages 30 [70] J. T. Sun, H. Zeng, H. Liu, Y. Lu, and Z. Chen. Cube SVD: A novel approach to personalized web search. In A. Press, editor, 14th International Conference on World Wide Web, pages 652–662, 2005. → pages 9 [71] K.-K. Sung. Learning and Example Selection for Object and Pattern Recognition. PhD thesis, MIT, Artificial Intelligence Laboratory and Center for Biological and Computational Learning, Cambridge, MA, 1996. → pages 53 [72] G. Tomasi and R. Bro. PARAFAC and missing values. Chemometrics and Intelligent Laboratory Systems, 75(2):163–180, 2005. → pages 54 [73] G. Tomasi and R. Bro. A comparison of algorithms for fitting the PARAFAC model. Computational Statistics and Data Analysis, 50:1700–1734, 2006. → pages 26 [74] L. R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311, 1966. → pages 25 [75] C. F. Van Loan. The ubiquitous Kronecker product. Journal of Computation and Applied Mathematics, 123:85–100, 2000. → pages 16, 17 [76] M. A. O. Vasilescu and D. Terzopoulos. Multilinear image analysis for facial recognition. In Proceedings of the International Conference on Pattern Recognition, volume 3, pages 511–514, 2002. → pages 9 [77] E. Wachsmuth, M. W. Oram, and D. I. Perrett. Recognition of objects and ther component parts: responses of single units in the temporal cortex of the macaque. Cerebral Cortex, 4(5):509–522, 1994. → pages 9 [78] Z. Wang and A. C. Bovik. A universal image quality index. IEEE signal processing letters, 9:81–84, 2002. → pages 55 [79] Z. Wang, A. C. Bovik, and L. Lu. Why is image quality assessment so difficult? In Proceedings of the IEEE Int. Conference Acoust., Speech, and Signal Processing, 2002. → pages 54 86 [80] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assesment: from error measurement to structural similarity. IEEE Transactions on Image Processing, 13(1), January 2004. → pages 54 [81] M. Welling and M. Weber. Positive tensor factorization. Pattern Recognition Letters, 22:1255–1261, 2001. → pages 12, 31, 35 [82] C. Zhu, R. Byrd, and J. Nocedal. L-BFGS-B code. http://www.ece.northwestern.edu/∼nocedal/lbfgsb.html. → pages 49 [83] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. L-BFGS-B: Fortran subroutines for large-scale bound constrained optimization. Technical Report NAM-11, Northwestern University, EECS Department, 1994. → pages 45, 50 87 Appendix A List of routines The following are the main routines used in the implementation of our algorithm. funObj NTFAAonce.m, funObj NMFAAonce.m Compute the objective-function value and gradi- ent in the tensor and matrix cases, respectively. lbfgsbNMF Modification of the L-BFGS-B method that al- lows the scaling of the iterative points (i.e., allows the insertion of spacer steps). NMF AAonce.m Main NMF routine; computes NMF using an all- at-once approach and the L-BFGS-B optimization method NTF AAonce.m Main NTF routine; computes NTF using an all- at-once approach and the L-BFGS-B optimization method. scaleiterate Dvar.m Function that scales the current point according with the specified criterion (norm and scaling type). 88 getFactorsNMF.m, struct2vec.m, vec2struct.m Auxiliary functions to reshape the problem vari- ables (both vectorize them and express them as matrix factors). 89 Appendix B Summary of multilinear operators’ properties The following properties are listed in [42] and [43]. n-mode product Let Y ∈ IRJ1×J2×···×JN be an Nth-order tensor. • Given matrices A ∈ IRIm×Jm , B ∈ IRIn×Jn , Y ×m A×n B= (Y ×m A)×n B= (Y ×n B)×m A, (m 6= n). • Given matrices A ∈ IRI×Jn , B ∈ IRK×I , Y ×n A×n B= Y ×n (AB). • If A ∈ IRI×Jn with full rank, then X = Y ×n A⇒ Y =X ×n A†. • If A ∈ IRI×Jn is orthonormal, then X = Y ×n A⇒ Y =X ×n AT . 90 Matricization Let Y ∈ IRJ1×J2×···×JN be an Nth-order tensor, andN = 1, . . . ,N. • If A ∈ IRI×Jn . Then, X = Y ×n A⇔ X(n) = AY(n). • Let A(n) ∈ IRIn×Jn for all n ∈N . If R = {r1, . . . ,rL} and C = {c1, . . . ,cM} partitionN , then X = Y ×1 A(1)×2 A(2)×3 · · ·×N A(N)⇔ X(R×C :JN ) = ( A(rL)⊗·· ·⊗A(r1) ) Y(R×C :IN ) ( A(cM)⊗·· ·⊗A(c1) )T . • Consequently, if A(n) ∈ IRIn×Jn for all n ∈ N , for any specific n ∈ N we have X = Y ×1 A(1)×2 A(2)×3 · · ·×N A(N)⇔ X(n) = A (n)Y(n) ( A(N)⊗·· ·⊗A(n+1)⊗A(n−1)⊗·· ·⊗A(1) )T . Tucker operator GivenX ∈ IRI1×···×IN , G ∈ IRJ1×···×JN , and N matrices A(i) ∈ IRIi×Ji with 1≤ i≤ N, such that X = [[G ;A(1),A(2), . . . ,A(N)]], we can express X in the following ways: • Elementwise, xi1i2···iN = J1 ∑ j1=1 J2 ∑ j2=1 · · · JN ∑ jN=1 g j1 j2··· jNa (1) i1 j1 ◦a (2) i2 j2 ◦ · · · ◦a (N) iN jN . 91 • Using outer products, X = J1 ∑ j1=1 J2 ∑ j2=1 · · · JN ∑ jN=1 g j1 j2··· jNa (1) : j1 ◦a (2) : j2 ◦ · · · ◦a (N) : jN . • Using a matricized form, X(n) = A (n)G(n) ( A(N)⊗·· ·⊗A(n+1)⊗A(n−1)⊗·· ·⊗A(1) )T . • Using multi-mode multiplication, X = G ×1 A(1)×2 A(2)×3 · · ·×N A(N). Kruskal operator GivenX ∈ IRI1×···×IN and N matrices A(i) ∈ IRIi×R with 1≤ i≤ N, such that X = [[A(1),A(2), . . . ,A(N)]], we can express X in the following ways: • Elementwise, xi1i2···iN = R ∑ r=1 a(1)i1r ·a (2) i2r · · · · ·a (N) iNr . • Sum of outer products, X = R ∑ r=1 a(1):r ◦a(2):r ◦ · · · ◦a(N):r . • Using a matricized form, X(n) = A (n) ( A(N)·· ·A(n+1)A(n−1)·· ·A(1) )T . 92 • Using a vectorized form vec(X ) = ( A(N)·· ·A(1) )T 1, where 1 is a ones vector of length R. Aditionally, if X is a third-order tensor in IRI×J×K such that X = [[A,B,C]], we can use slice notation. For example, each frontal slice can be written as X::k = AD(k)BT , k = 1, . . . ,K, where D(k) ∈ IRR×R is a diagonal matrix whose diagonal elemens are given by the kth row ofC, i.e., D(k) = diag(ck:). Assume that we have matrices A(n) ∈ IRIn×R for n= 1, . . . ,N. Other properties of the Kruskal operator are the following. • If we have matrices B(Kn×In) for n= 1, . . . ,N, then[[ [[A(1), · · · ,A(N)]];B(1), · · · ,B(N) ]] = [[B(1)A(1), · · · ,B(N)A(N)]]. • If R̂= {r1, . . . ,rL} and Ĉ = {c1, . . . ,cM} represent a partition of the set N̂ = {1, . . . ,N}, then X = [[A(1),A(2), . . . ,A(N)]]⇐⇒ X(R̂×Ĉ:IN̂) = ( A(rL)·· ·Ar1 )( A(cM)·· ·A(c1) )T . • Thus, for any n ∈ N̂ we have X = [[A(1),A(2), . . . ,A(N)]]⇐⇒ X(n) = A (n) ( A(N)·· ·A(n+1)An−1·· ·A(1) )T . 93"@en ;
edm:hasType "Thesis/Dissertation"@en ;
vivo:dateIssued "2008-11"@en ;
edm:isShownAt "10.14288/1.0051271"@en ;
dcterms:language "eng"@en ;
ns0:degreeDiscipline "Computer Science"@en ;
edm:provider "Vancouver : University of British Columbia Library"@en ;
dcterms:publisher "University of British Columbia"@en ;
dcterms:rights "Attribution-NonCommercial-NoDerivatives 4.0 International"@en ;
ns0:rightsURI "http://creativecommons.org/licenses/by-nc-nd/4.0/"@en ;
ns0:scholarLevel "Graduate"@en ;
dcterms:subject "Nonnegative tensor"@en, "Optimization"@en ;
dcterms:title "An all-at-once approach to nonnegative tensor factorizations"@en ;
dcterms:type "Text"@en ;
ns0:identifierURI "http://hdl.handle.net/2429/1606"@en .