Near-optimal sample complexity for noisy or 1-bit tensorcompletionbyNavid GhadermarzyA THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Mathematics)The University of British Columbia(Vancouver)August 2018c© Navid Ghadermarzy, 2018The following individuals certify that they have read, and recommend to the Faculty of Gradu-ate and Postdoctoral Studies for acceptance, a thesis/dissertation entitled:NEAR-OPTIMAL SAMPLE COMPLEXITY FOR NOISY OR 1-BIT TENSOR COMPLETIONsubmitted by NAVID GHADERMARZY in partial fulfilment of the requirements of the degree ofDoctor of Philosophy in MathematicsExamining Committee:Ozgur Yilmaz, Faculty of ScienceSupervisorYaniv Plan, Faculty of ScienceSupervisorFelix Herrmann, Faculty of ScienceSupervisory Committee MemberChen Greif, Faculty of ScienceUniversity ExaminerBrian Wetton, Faculty of ScienceUniversity ExaminerRichard Lockhart, Simon Fraser UniversityExternal ExamineriiAbstractTensor completion is the problem of recovering a low-rank tensor from a partial subset of itsentries. Assuming a rank-r, order-d tensor in RN×N×···N , the best sampling complexity achievedis O(rNd2 ) which can be obtained by a tensor nuclear-norm minimization problem. This boundis significantly larger than O(rdN), the number of free variables in a rank-r tensor. In this thesiswe prove that when r = O(1), we can achieve optimal sample complexity by constraining eitherone of two proxies for tensor rank, the convex M-norm or the non-convex max-qnorm. The max-qnorm is the generalization of matrix max-norm to tensors which is non-convex. The M-norm onthe other hand is a convex atomic norm whose atoms are rank-1 sign tensors. We prove that bothmax-qnorm and M-norm of a bounded rank-r tensor are bounded by quantities that are indepen-dent of N. We also prove that the unit balls of both these norms have small Rademacher complexity.We analyze max-qnorm and M-norm constrained least squares minimization problems for ten-sor completion, proving that when r = O(1), m = O(Nd) measurements are sufficient for efficientestimation of the original tensor. We also use an information theoretic technique to prove that thedependence on N is optimal. Moreover, we design an alternating method for approximating thesolution of max-qnorm tensor completion and do a thorough investigation of its performance onsynthetic and real-world data.We also generalize the 1-bit matrix completion problem to higher-order tensors. We provethat when r = O(1) a bounded rank-r, order-d tensor T in RN ×RN ×·· ·×RN can be estimatedefficiently by only m = O(Nd) binary measurements by regularizing either its max-qnorm or itsM-norm. We prove that the sample complexity of recovering a low-rank tensor from 1-bit mea-surements of a subset of its entries is the same as recovering it from unquantized measurements.Moreover, we show the advantage of using 1-bit tensor completion over matricization both theo-retically and numerically. Specifically, we show how the 1-bit measurement model can be used forcontext-aware recommender systems.iiiLay summaryMany real-world data sets can be arranged as multi-dimensional arrays or tensors. Tensor com-pletion is an important problem which is applicable whenever the data has missing or corruptedentries. Missing or corrupted entries can be the result of a faulty sensor or when taking measure-ments is too expensive. Another important application of tensor completion is for recommendationsystems such as the movie recommendation used in Netflix or Hulu. Due to the rich understandingof matrices and powerful tools available for matrices, a common approach for completing missingdata (tensor completion) is to first rearrange them as a 2-dimensional matrix and then predict themissing entries (matricizing). In this thesis, we prove the advantage of tensor completion overmatricizing when we have access to noisy measurements of a subset of the entries of a tensor andalso when we have access to partial 1-bit measurements (0 or 1 measurements) of the tensor.ivPrefaceThis thesis consists of my original research, conducted at the Department of Mathematics at theUniversity of British Columbia, Vancouver, Canada, under the supervision of Dr. Ozgur Yilmazand Dr. Yaniv Plan. The following chapters contain previously published or submitted work forwhich I was the principal investigator and author.Chapter 3 and some parts of Chapter 2 are published in [45] which is joint work with OzgurYilmaz and Yaniv Plan and has been posted on arXiv and is submitted for publication.Chapter 4 and some parts of Chapters 2 and 5 are published in [46] which is joint work withOzgur Yilmaz and Yaniv Plan and has been posted on arXiv and is submitted for publication.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Matrix completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 1-bit matrix completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Rademacher complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Fibers, slices and matricization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.6 Rank of a tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.7 Outline of the results and organization . . . . . . . . . . . . . . . . . . . . . . . . 81.7.1 Simplified results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.7.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 M-norm and max-qnorm of tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1 Notions of Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15vi2.2 Frobenius and nuclear norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3 Max-qnorm and atomic M-norm . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.1 Matrix max-norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.2 Tensor max-qnorm and atomic M-norm . . . . . . . . . . . . . . . . . . . 182.3.3 Unit max-qnorm ball of tensors . . . . . . . . . . . . . . . . . . . . . . . . 192.3.4 Max-qnorm and M-norm of bounded low-rank tensors . . . . . . . . . . . 202.4 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4.1 Proof of Lemma 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4.2 Proof of Theorem 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4.3 Proof of Theorem 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.4.4 Proof of Lemma 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.4.5 Proof of Lemma 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.4.6 Proof of Theorem 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 Tensor completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.1 Introduction and past results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2 Observation model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.3 M-norm constrained least squares estimation . . . . . . . . . . . . . . . . . . . . . 373.4 Comparison to past results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.5 Information-theoretic lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . 413.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.6.1 Algorithms for max-qnorm constrained least squares estimation . . . . . . 433.6.2 Experiment on max-qnorm of tensors . . . . . . . . . . . . . . . . . . . . 453.6.3 Automatic algorithm for choosing optimal max-qnorm bound R . . . . . . 463.6.4 Numerical results of low-rank tensor completion . . . . . . . . . . . . . . 493.7 Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.7.1 Proof of Theorem 24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.7.2 Proof of Theorem 32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.7.3 Proof of Theorem 33 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594 1-bit tensor completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.1 Introduction and past results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.2 Problem formulation and observation model . . . . . . . . . . . . . . . . . . . . . 654.2.1 Discrepancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.3 Max-qnorm and M-norm constrained 1-bit tensor completion . . . . . . . . . . . . 67vii4.4 Nuclear norm constrained 1-bit tensor completion . . . . . . . . . . . . . . . . . . 704.5 1-bit tensor completion under exact rank constraint . . . . . . . . . . . . . . . . . 724.6 Information-theoretic lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . 744.7 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.7.1 Max-qnorm constrained 1-bit tensor completion . . . . . . . . . . . . . . . 754.7.2 1-bit tensor completion with exact rank constraints . . . . . . . . . . . . . 784.7.3 Synthetic experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.8 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.8.1 Proof of Theorem 39 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.8.2 Proof of Theorem 47 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854.8.3 Proof of Theorem 51 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.8.4 Proof of Theorem 54 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.8.5 Proof of Lemma 65 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 985.1 Applications of tensor completion . . . . . . . . . . . . . . . . . . . . . . . . . . 985.1.1 Color image inpainting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 985.1.2 Video completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.1.3 Hyperspectral imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.1.4 MRI data recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1055.2 Applications of 1-bit tensor completion . . . . . . . . . . . . . . . . . . . . . . . . 1075.2.1 Score prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1085.2.2 In-car music recommender system . . . . . . . . . . . . . . . . . . . . . . 1095.2.3 A Restaurant Context-Aware Recommender System . . . . . . . . . . . . . 1116 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117viiiList of TablesTable 5.1 Results of a comparison between 1-bit and multi-bit matrix and tensor completion al-gorithms on incar-music data [6] for predicting whether the unobserved ratings wereabove or below average. The multi-bit methods use the original ratings from 0 to 5and the context-aware methods include the context information such as time of the day,weather, location of driving and mood. . . . . . . . . . . . . . . . . . . . . . . . . 111Table 5.2 Results of a comparison between 1-bit and multi-bit matrix tensor completion algo-rithms on Tijuana restaurant data [102] for predicting whether the unobserved ratingswere above or below average. TF refers to using tensor factorization without the max-qnorm regularization. For the 1-bit results we use f (x) =Φ( xσ ). . . . . . . . . . . . . 113Table 5.3 Results of a comparison between 1-bit and multi-bit matrix tensor completion algo-rithms on Tijuana restaurant data [102] showing the mean absolute error. . . . . . . 113ixList of FiguresFigure 1.1 The figure in left shows a third order tensor and its slices and the figure in theright shows its mode-1 matricization. . . . . . . . . . . . . . . . . . . . . . . . 6Figure 3.1 Log-log plot of average max-qnorm of 3 and 4-dimensional low-rank tensorsobtained by Algorithm 1, for various rank and sizes, averaged over 15 drawsfor each rank and size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Figure 3.2 All the possible situations in the five-point search algorithm. The leftmost andthe rightmost red dots are the previous lower bound and upper bound on themax-qnorm R, respectively. The green intervals show the new lower bound andupper bound based on the RMSE. . . . . . . . . . . . . . . . . . . . . . . . . 48Figure 3.3 3-dimensions, no noise Average relative recovery error ‖Trecovered−T]‖2F‖T ]‖2Ffor 3-dimensional tensor T ∈ R50×50×50 and different ranks and samples. . . . . . . 50Figure 3.4 3-dimensions, 10-dB noise Average relative recovery error ‖Trecovered−T]‖2F‖T ]‖2Ffor3-dimensional tensor T ∈ R50×50×50 and different ranks and samples. . . . . . 51Figure 3.5 4-dimensions, no noise Average relative recovery error ‖Trecovered−T]‖2F‖T ]‖2Ffor 4-dimensional tensor T ∈ R20×20×20×20 and different ranks and samples. Theplot on the left shows the results for the 400×400 matricized case. . . . . . . . 53Figure 3.6 4-dimensions, 10-dB noise Average relative recovery error ‖Trecovered−T]‖2F‖T ]‖2Ffor4-dimensional tensor T ∈ R20×20×20×20 and different ranks and samples. Theplot on the left shows the results for the 400×400 matricized case. . . . . . . . 53Figure 3.7 3-dimensions, no noise Average relative recovery error ‖Trecovered−T]‖2F‖T ]‖2Ffor e-dimensional tensors with fixed number of measurements in the left and fixedrank in the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54xFigure 4.1 Average relative squared error, ‖Trecovered−T]‖2F‖T ]‖2F, obtained from matricized 1bit-TC, max-qnorm, factor-frob norm, and CGD over a range of different valueof σ . From left to right the original tensor is 2 ,3 and 4-dimensional. Resultsshow that the noise variance should not be too small or too big. . . . . . . . . . 80Figure 4.2 Average relative squared errors, ‖Trecovered−T]‖2F‖T ]‖2F, for recovering a 200× 200matrix with various ranks, using m 1-bit measurements sample for a range ofdifferent value of m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81Figure 4.3 Average relative squared errors, ‖Trecovered−T]‖2F‖T ]‖2F, for recovering a 30×30×30tensor with various ranks, using m 1-bit measurements for a range of differentvalue of m. Top left is the result of matricization, top right is CGD for exact-rank constraints, bottom left is the results using the factor-frob function as anestimate for nuclear-norm constraint and bottom right is the result of max-qnorm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82Figure 4.4 Average relative squared errors, ‖Trecovered−T]‖2F‖T ]‖2F, for recovering a 15× 15×15× 15 tensor with various ranks, using m 1-bit measurements for a range ofdifferent value of m. Left is the result of matricization, center is the result ofusing the factor-frob function as an estimate for nuclear-norm constraint andright is the result of max-qnorm. . . . . . . . . . . . . . . . . . . . . . . . . . 83Figure 5.1 The Baboon image and the RSE of best rank-r approximations of the 128×128×3 tensor and its 128×384 matricization. . . . . . . . . . . . . . . . . . 99(a) Original Baboon image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99(b) RSE of best rank-r approximations . . . . . . . . . . . . . . . . . . . . . . 99Figure 5.2 Reconstructed Baboon images with 50% observed entries. The first row showsthe results with missing pixels and the second row shows the results with miss-ing random entries. The results of matricization are worse than using tensorsand the results of missing pixels are worse than missing random entries of thetensor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100Figure 5.3 Results for the Peppers image with only 20% sampling rate. . . . . . . . . . . 101Figure 5.4 PSNR results for a larger natural image in R321×481×3. The second row is theresults of LRTC [83], TCTF [134] and t-SVD [132] . . . . . . . . . . . . . . . 102Figure 5.5 Completing a black and white video from 20% and 5% of the entries. . . . . . 103Figure 5.6 Frames 50 and 100 of the tomato video [83]. The subsampling mask includesblocks of pixels as well. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103xiFigure 5.7 An example of a hyperspectral image with 90% missing entries. In the secondrow, we show the result with added white noise. . . . . . . . . . . . . . . . . . 104Figure 5.8 The recovery results for four slices of a brain MR image. The columns areimages of different slices and the rows show the original image and differentsampling rates. First row shows the original images. Each image in the figureshows the sampled image (top) and the recovered image (bottom). . . . . . . . 106Figure 5.9 The left figure shows the average relative error of recovering a rank-5, 30×30×30 tensor by max-qnorm constrained 1-bit tensor completion with differ-ent values of σ when the 1-bit measurements were obtained by using σ = 0.15.The right figure shows the percentage of correct sign predictions. . . . . . . . . 109Figure 5.10 Results of applying 1-bit matrix and tensor completion to partially observedverbal scores tensor [87] to determine whether unobserved scores are above orbelow average. The left figure shows the percentage of correct predictions andthe right figure shows the absolute mean error. The scores are in the range [-4040] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110xiiList of Algorithms1 Estimating max-qnorm of a tensor T . . . . . . . . . . . . . . . . . . . . . . . . . 452 Tensor completion, with cross-validation . . . . . . . . . . . . . . . . . . . . . . . 473 1-bit TC with rank-constraints via a Conditional Gradient Descent (CGD) . . . . . 79xiiiAcknowledgmentsFirst and foremost, I would like to thank my supervisor Dr. Ozgur Yilmaz for his guidanceduring my PhD. I have been very fortunate to have your valuable mentorship in my mathematicaland academic life. I will be forever grateful for teaching me compressed sensing. I appreciate allthe time you spent explaining concepts and sharing your insightful suggestions. I am also gratefulfor your careful evaluation of all my writings and I am sorry about all the grammatical and spellingerrors during the years, especially the hundreds of commas you added and removed from my writ-ings. I couldn’t have asked for a better supervisor and thank you for all the support and feedbackthorough the years.This thesis wouldn’t have been possible without Dr. Yaniv Plan. I started working on tensorcompletion when we had a discussion about it in our first meeting and since then I have learnednew concepts or encountered new questions every time we have had a meeting which is more thananyone can ask from a supervisor. Your knowledge and enthusiasm for math in general has alwaysmotivated me. Thank you for you guidance during my PhD.I should also thank Dr. Felix Herrmann and all the members of the SLIM group who I havebugged during my PhD. It was a huge opportunity to learn from a world-known expert in seismicimaging. I also want to thank the mathematics community at UBC for teaching me and helping mewith my academic life.Finally I want to thank all my friends who are always supportive and kind to me. EspeciallyI want to thank Rebeca for her love and support. Thank you for understanding, supporting, andbelieving in me. Last but not least I would want to thank my family who I owe where I am rightnow to them.xivChapter 1IntroductionRepresenting data as multi-dimensional arrays, i.e., tensors, arises naturally in many modernapplications such as collaborative filtering [65], image inpainting [51], interpolating large-scaleseismic data [29, 71], medical images [88], computer vision [58], data mining [1], image compres-sion [83, 109], hyperspectral image analysis [79], and radar signal processing [94].There are many reasons where one may want to work with a subset of the tensor entries; (i)these data sets are usually large, and we wish to store only a small number of the entries (compres-sion). For example, color videos can be arranged in 4-dimensional tensors that can be compressedsignificantly; (ii) in some applications, the acquisition of each entry can be expensive, e.g., eachentry may be obtained by solving a large PDE [123]. One example of such applications is seismicimaging where data are usually 3 to 6 dimensions; (iii) some of the entries might get lost due tophysical constraints while gathering them. For example, parts of Hyperspectral images taken bysatellites can be blocked by clouds. These restrictions result in situations where one has accessonly to a subset of the tensor entries. The problem of tensor completion entails recovering a tensorfrom a subset of its entries.The general framework of the tensor completion problems can be formulated as the following.Given a d-dimensional tensor T ] ∈RN1×...×Nd we have access to measurements taken from a subsetof its entries. In particular, assuming Ω ⊂ [N1]× [N2]×·· ·× [Nd] to be a subset of the indices ofthe tensor and f :R→R to be a real function, the most basic form of a tensor completion problemisfind X ∈ RN1×...×Nd s.t. f (X(ω)) = f (T ](ω)) ∀ ω ∈Ω.It is easy to observe that if the number of observations, m := |Ω|, is smaller than the dimension of1the tensor, Πdi=1Ni, the above problem does not have a unique result. Hence, the tensor completionproblem focuses on two main goals; (i) Given the measurement function f , what assumptions onT ] andΩ can guarantee efficient recovery of T ] from { f (T ](ω)}ω∈Ω; and (ii) designing stable andtractable methods that recover the tensor from measurements taken from a subset of its entries.1.1 Matrix completionRecovering an object with a number of measurements that is smaller than its ambient dimen-sion is generally impossible without some form of prior information about the object. One exampleof using such prior information is in Compressed Sensing (CS) which has revolutionized the fieldof signal acquisition and reconstruction [20, 36]. The main idea of CS is based on using inherentproperties of signals, such as sparsity in a specific domain, to decrease the number of necessarymeasurements for recovery.In recent years, the framework of recovering an object using measurements far fewer than itsambient dimension by exploiting its inherent properties has been extended in a few directions. Oneof these extensions is the problem of matrix completion which was popularized by the famousNetflix prize contest. Given a rank-r matrix A] ∈ RN×N and a random subset of its entries Ω ={ω1,ω2, · · · ,ωm}, ωi ∈ [N]× [N], we observe m noisy samples {Yωt}mt=1 of the matrix A], Yωt =A]ωt +σξt , where ξt is a zero-mean noise. The purpose of matrix completion is to recover A] fromthe measurements Y . This problem has been extensively studied in the literature [22, 38, 66, 118].The straightforward solution is finding the matrix with the lowest rank that is consistent with themeasurements. In the noiseless case, this translates toargminrank(M) s.t. MΩ = A]Ω,where MΩ(ω) = A](ω) for ω ∈ Ω and zero otherwise. Rank minimization is NP-hard; therefore,extensive research has been done to find tractable alternatives. The most common approach is usingnuclear norm. In [23], the authors showed that solving the nuclear-norm minimization problemargmin‖M‖∗ s.t. MΩ = A]Ω,would recover a rank-r, N×N matrix from only O(rNpolylog(N)) samples under mild incoherenceconditions on the matrix. Extensive research has been done in analyzing variants of nuclear-norm2minimization and designing efficient algorithms to solve it as in [17, 21, 23, 126].The nuclear-norm of a matrix is the sum of its singular values and is the best convex ap-proximation for the rank function. An alternative interpretation of the rank and the nuclear-norm of a matrix is based on the minimum number of columns of its factorizations. In par-ticular, the rank of a matrix M is the minimum number of columns of the factors U, V whereM = UV ′; the nuclear norm is the minimum product of the Frobenius norms of the factors, i.e.,‖M‖∗:= min‖U‖F‖V‖F subject to M = UV ′ [118]. An alternative proxy for the rank of a ma-trix is its max-norm defined as ‖M‖max:= min‖U‖2,∞‖V‖2,∞ subject to M = UV ′ [118]. Themax-norm bounds the norm of the rows of the factors U and V , and was used for matrix com-pletion in [40]. There, the authors studied both max-norm and trace-norm matrix completion byanalyzing the Rademacher complexity of the unit balls of these norms. They proved that underuniformly random sampling, either with or without replacement, m = O( rNε log3(1ε )) samples aresufficient for achieving mean squared recovery error ε using max-norm constrained estimation andm = O( rN log(N)ε log3(1ε )) samples are sufficient for achieving mean squared recovery error ε usingnuclear-norm constrained estimation.1.2 1-bit matrix completionIn some applications, the observations are highly quantized. For example, in some recommen-dation systems, the available data are the result of yes/no questions asked from the users, e.g.,whether they think an advertisement is relevant to them or not. Here the problem is recovering alow-rank tensor from 1-bit measurements of a subset of its entries. To be precise in tensor comple-tion we have access to (possibly noisy) samples of the tensor, which we also refer to as multi-bittensor completion and in 1-bit tensor completion we have access to 1-bit information of a subsetof the entries of the tensor.1-bit compressed sensing [16, 59, 99] investigates the possibility of recovering a sparse signalfrom its 1-bit-quantized measurements and it is proved that robust recovery is possible with m =O(s log(n)2)measurements which is surprisingly the same sample complexity one can achieve withunquantized measurements. The generalization of 1-bit CS to matrix recovery has been studiedextensively in the literature. Here, the 1-bit samples are obtained by dithering a subset of theentries of the matrix and taking their sign. Given a rank-r matrix M] and a random subset of its3entries Ω, observations can be modeled asYω =+1 with probability f (M]ω),−1 with probability 1− f (M]ω)for ω ∈Ω. (1.1)Here f : R→ [0,1] is a (previously known) monotone differentiable function which can be inter-preted as the cumulative distribution of the dithering noise function. Given these measurementsthe negative log-likelihood of observing Y from a matrix M isLΩ,Y (M) = ∑ω∈Ω(1[Yω=1] log(1f (Mω))+1[Yω=−1] log(11− f (Mω))).Considering a constraint set D that is preferably low-dimensional such that M] ∈D, we can find anestimate for M] byargminLΩ,Y (M) s.t. M ∈ D.Using exact-rank constrains has been analyzed in [11, 93] and convex constraints such as nuclearnorm [31] and max-norm [18] have been analyzed as well and prove that with high probability,m = O(rN log(N)) measurements are sufficient for efficient recovery of a rank-r matrix in RN×N .1.3 Rademacher complexityAn important technical tool used in this thesis is treating tensors as functions from indices tovalues. Using this, we can analyze the complexity of sets of tensors as sets of functions. To beprecise, we regard a tensor T as a function f : [N1]× ·· · [Nd]→ R defined as f (ω1, · · · ,ωd) :=T (ω1, · · · ,ωd).For a function class F , the empirical Rademacher complexity of F with respect to a sampleset Ω is defined asRˆΩ(F ) =2|Ω|Eε [supf∈F| ∑ω∈Ωεi f (ω)|],where εi is a Rademacher random variable. The Rademacher complexity expresses how well afunction class correlates with random noise on average and is a powerful tool for proving gen-eralization bounds. In chapter 2, we prove that the set of bounded low-rank tensors has lowRademacher complexity.41.4 NotationsIn this section, we introduce some notations that will be used throughout this thesis. We adoptthe notation of Kolda and Bader’s review on tensor decompositions [70]. Below, λ , σ , and α areused to denote scalars, and C and c are used to denote universal constants. Vectors are denoted bylower case letters, e.g., u, and v. Both matrices and tensors are represented by upper case letters,usually using A and M for matrices, and T and X for tensors. Tensors are a generalization of ma-trices to higher order, also called multi-dimensional arrays. For example, a first-order tensor is avector and a second-order tensor is a matrix. X ∈⊗di=1RNi is a d-th order tensor whose i-th sizeis Ni. We also denote⊗di=1RN as RNd. Elements of a tensor are either specified as Xi1,i2,···,id orX(i1, i2, · · · , id), where 1≤ i j ≤ N j for 1≤ j ≤ d. We also use Xω as a shorthand to refer the indexω of a tensor, where ω = (i1, i2, · · · , id) is an n-tuple determining the index X(i1, i2, · · · , id).Inner products are denoted by 〈·, ·〉. The symbol ◦ represents both matrix and vector outerproducts where T = U1 ◦U2 ◦ · · · ◦Ud means T (i1, i2, · · · , id) = ∑k U1(i1,k)U2(i2,k) · · ·Ud(id,k),where k ranges over the columns of the factors. In the special case of vectors, T = u1 ◦u2 ◦ · · · ◦udmeans T (i1, i2, · · · , id) = u1(i1)u2(i2) · · ·ud(id). Finally [N] := {1, · · · ,N} is the shorthand notationwe use for the set of integers from 1 to N.1.5 Fibers, slices and matricizationMatricization is a fundamental step in many algorithms that have been designed for tensor and1-bit tensor completion. This method rearranges the tensor as a matrix by assigning one dimensionof the tensor as the rows and all the other dimensions as the columns. Fibers are the higher-orderanalogue of matrix rows and columns. For a tensor X ∈⊗di=1RNi , mode-i fibers of the tensor areΠ j 6=iN j vectors obtained by fixing all indices of X except for the i-th one. For example mode-1 fibers of a third order tensor X is denoted by X(:, j,k) for fixed j and k. Slices of a tensor,on the other hand, are obtained by fixing all the indices of a tensor except for two. The mode-imatricization of X , denoted by X(i) ∈ RNi×Π j 6=iN j is obtained by arranging all the mode-i fibers ofX along columns of the matrix X(i). More precisely, X(i)(ii, j) = X(i1, i2, · · · , id), wherej = 1+d∑k=1,k 6=i(ik−1)Jk with Jk =Πk−1m=1,m6=iNm.A detailed illustration of these definitions can be found in [69, 70]. Figure 1.1 shows an exampleof mode-1 matricization of an order-3 tensor.5Figure 1.1: The figure in left shows a third order tensor and its slices and the figure in theright shows its mode-1 matricization.A generalization of these unfoldings was proposed by [90] that rearranges X(1) into a more bal-anced matrix: For j ∈ {1, · · · ,d}, X[ j] is obtained by arranging the first j dimensions along the rowsand the rest along the columns. In particular, using Matlab notations X[ j]= reshape(X(1),Πi= ji=1Ni,Πi=di= j+1Ni).1.6 Rank of a tensorA unit tensor is a tensor U ∈⊗dj=1RN j that can be written asU = u(1) ◦u(2) ◦ · · · ◦u(d), (1.2)where u( j) ∈RN j is a unit-norm vector and ◦ denotes the vector outer product. The vectors u( j) arecalled the components of U . A rank-1 tensor is a scalar multiple of a unit tensor. Define Ud to bethe set of unit tensors of order d.The rank of a tensor T , denoted by rank(T ), is defined as the smallest number of rank-1 tensorsthat generate T as their sum, i.e.,T =r∑i=1λiUi =r∑i=1λiu(1)i ◦u(2)i ◦ · · · ◦u(d)i ,where Ui ∈ Ud is a unit tensor. This low-rank decomposition is also known as CANDECOM-P/PARAFAC (CP) decomposition [24, 53]. In this thesis, we concentrate on CP decomposi-tions which generalizes the outer product machinery of matrix SVD. However, unlike matrix6SVD, there is no restriction on orthogonality of the set of vectors {u( j)i }N ji=1. More importantly,for a rank-r tensor T = ∑ri=1λiu(1)i ◦ u(1)i ◦ · · · ◦ u(d)i , mode-i matricization is T( j) = ∑ri=1λiu( j)i ◦(u(1)i ⊗·· ·⊗u( j−1)i ⊗u( j+1)i ⊗·· ·⊗u(d)i ) and the balanced matricization introduced in [90] is T[ j] =∑ri=1λi(u(1)i ⊗u(2)i ⊗·· ·⊗u( j)i )◦(u( j+1)i ⊗·· ·⊗u(d)i ). Here, the symbol⊗, represents the Kroneckerproduct. The rank of all matricizations defined above are less than or equal to the rank of the tensor.Even though tensor-rank is a direct generalization of matrix-rank it does not have a lot ofproperties of matrix rank. For example, tensor-rank can be different over R and C [28]. Moreover,there is no straightforward algorithm to compute the rank of a tensor [70]. Furthermore, the setof rank-r tensors is not necessarily closed, e.g., there exists a sequence of rank-2 tensors thatconverge to a rank-3 tensor. An example of such tensors is presented in [97]. Consider a rank-3tensor defined asX = a1 ◦b1 ◦b2+a1 ◦b2 ◦ c2+a2 ◦b1 ◦ c1,where the matrices A = [a1a2] , B = [b1b2], and C = [c1c2] have linearly independent columns.This tensor can be approximated arbitrarily closely by a rank-two tensor of the following form:Xt = t(a1+1ta2)◦ (b1+ 1t b2)◦ (c1+1tc2)− ta1 ◦b1 ◦ c1.Theoretical and numerical difficulties of approximating a tensor by its CP decomposition hasresulted in defining other decompositions that are used in the literature. The Tucker decomposition[122] is a form of higher-order PCA, which decomposes a tensor into a core tensor that gets mul-tiplied by a matrix in each mode [70]. The core tensor can be thought of as a multi-dimensionaltensor of singular values. For a tensor T , define ri to be the rank of its mode-i matricization, i.e.,assume rank(T(i)) = ri. ThenT = G×1 A1×2 A2 · · ·×d Adis its Tucker decomposition. Here G ∈ Rr1×···rd is the core tensor, Ai ∈ RNi×ri is a factor of thedecomposition and (r1, · · · ,rd) is its multi-linear rank. The notation G×i Ai is the multiplicationalong mode-i which is the result of multiplication of Ai with the matricization of G along its i-thmode. The generalization of rank to multi-rank removes some of the difficulties associated withCP-decomposition. Most notable is that the set of tensors with multi-linear rank (r1, · · · ,rd) isclosed. As a result, every tensor has a best multi-rank (r1, · · · ,rd)−approximation which can beobtained by using HOSVD [33]. Notice that the CP decomposition is a special case of the Tuckerdecomposition when G is diagonal. For a tensor with multi-linear rank (r, · · · ,r), the number of free7variables is exponential in r. This has motivated the use of hierarchical tensor decompositions suchas Hierarchical Tucker decomposition[48, 52], its subclass tensor train decomposition [27, 96] andmany more variants whose number of free variables is linear with r. In this thesis, we concentrateon CP-decomposition because of its simplicity and similarity to a certain nuclear decompositionof low max-qnorm tensors that we will define later.1.7 Outline of the results and organizationIn this thesis, we study the generalization of matrix completion and 1-bit matrix completion tohigher-order tensors, i.e., considering an order-d tensor T with size N1×N2×·· ·×Nd , we wantto reconstruct T by sampling a subset of its entries (or some function of the samples). Similar toits lower-dimensional counterparts, without assuming further structure on the underlying tensor,there is no hope of recovering the missing entries as they are independent of the observed entries.Therefore, here (and in many applications) tensors of interest are the ones that can be expressedapproximately as a lower-dimensional object. In particular, we consider tensors that have low CP-rank [24, 53]. The low-rank assumption makes tensor completion a feasible problem. For example,an order-d, rank-r tensor, which has size N1×N2×·· ·Nd where Ni = O(N), has O(rNd) free vari-ables, which is much smaller than Nd , the ambient dimension of the tensor.Generalizing the techniques and machinery used in matrix completion to the tensor completionproblem is generally hard. Maybe the most important complication is that tensors do not have aunique generalization of SVD and most extensions manage to capture just some of the propertiesof the matrix SVD. For example, generalizing the outer product machinery of SVD leads to the CP-decomposition, which does not have orthogonality condition of the matrices. Therefore, despiteall the powerful tools and algorithms developed for matrix completion, tensor completion problemis still fairly open and not as well understood. For instance, there is a large gap between theo-retical guarantees and what is observed in numerical simulations. This is mainly due to the lackof efficient orthogonal decompositions, low-rank approximations, and limited knowledge of thestructure of low-rank tensors compared to matrices. These limitations have resulted in numerousworks which are based on connecting the general tensor completion problem to matrix completionby rearranging the tensor as a matrix, including the sum of nuclear norms (SNN) model that mini-mizes the sum of the nuclear norm of matricizations of the tensor along all its dimensions, leadingto sufficient recovery with m = O(rNd−1) samples [44, 83]. More balanced matricizations, suchas the one introduced in [90], can result in a better bound of m = O(rNdd2 e) samples.8Once we move from matrices to higher-order tensors, many of the well-known facts of matrixalgebra cease to be true. For example, even a best rank-k approximation may not exist for sometensors, illustrated in [70, Section 3.3], showing that the space of tensors with rank at most 2 is notclosed. Interestingly there is a paper titled “Most tensor problems are NP-hard” [55] which provesthat many common algebraic tasks are NP-hard for tensors with d ≥ 3, including computing therank, spectral norm, and nuclear norm. The computational complexity of directly solving tensorcompletion and the inferior results of matricization make tensor completion challenging.Having all these complications in mind, on the theoretical side, a low-rank tensor has O(rdN)free variables, but an upper-bound of O(rNdd2 e) on the sample complexity. When d > 2, the poly-nomial dependence on N seems to have a lot of room for improvement. Moreover, it is well-knownthat empirical recovery results are much better when the tensor is not rearranged as a matrix, eventhough these results are attempting to solve an NP-hard problem. This has resulted in efforts to-wards narrowing this gap, including heuristic algorithms [9, 83]. In spite of good empirical resultsand reasonable justifications, a theoretical study filling this gap was not presented in these cases.In the second part of the thesis, we do an extensive investigation of the recovery guaranteesone can obtain for the 1-bit tensor completion problem. Unlike the tensor completion problem, the1-bit tensor completion has not been well studied in the literature. Matricizing the tensor alongdifferent modes was suggested in [2]. Our analysis is based on considering three constraint setsfor a low-rank tensor. The set of low-rank tensors (fixed rank) and two convex approximations ofthis set: the set of low nuclear-norm tensor and the set of low M-norm tensors. Our investigationis based on generalizing three different results on 1-bit matrix completion. Along the way, we dis-cuss the differences between these methods and the complications of generalizing the 1-bit matrixcompletion methods to tensors which answers some of the open questions and highlights someother open problems in the field.In this thesis, we analyze both 1-bit and multi-bit tensor completion and provide near-optimalsample complexity guarantees for both of them. Nuclear norm has been extended to tensors in[35, 43, 129]. In an effort to obtain linear dependence on N, we analyze 1-bit and multi-bit tensorcompletion using a max-qnorm (max-quasi-norm) constrained algorithm where the max-qnorm isa direct generalization of the matrix max-norm to the case of tensors. Unfortunately, max-qnormis non-convex. However, analyzing the unit-ball of the dual of the dual of the max-qnorm (whichis a convex norm) led us to define and analyze a convex atomic-norm (which we call M-norm)9constrained least squares problem, where we obtained optimal recovery bounds on the size of thetensor.1.7.1 Simplified resultsWithout going into details, we briefly state the upper bounds we establish (in Section 3.3)on the recovery errors associated with M-norm and max-qnorm constrained tensor completion.Given a rank-r, order-d tensor T ] ∈⊗di=1RN , and a random subset of its entries with indices inS = {ω1,ω2, · · · ,ωm}, ωi ∈ [N]d , we observe m noisy entries {Yωt}mt=1 of {T ](ωt)}mt=1, where eachobservation is perturbed by i.i.d. noise with mean zero and variance σ2. To give a simple version ofthe result we assume that indices in S are drawn independently at random with the same probabilityfor each observation, i.e., we assume uniform sampling to give a simple theorem.Theorem 1. Consider a rank-r, order-d tensor T ] ∈⊗di=1RN with ‖T ]‖∞≤ α . Assume that weare given a collection of noisy observationsYωt = T](ωt)+σξt , t = 1, · · · ,m,where the noise sequence ξt are i.i.d. standard normal random variables and each index ωt ischosen uniformly random over all the indices of the tensor. Then, there exists a constant C < 20such that the solution ofTˆM = argminX1mm∑t=1(Xωt −Yωt )2 subject to ‖X‖∞≤ α, ‖X‖M≤ (r√r)d−1α, (1.3)satisfies‖T ]− TˆM‖2FNd≤C(α+σ)α(r√r)d−1√dNm.with probability greater than 1− e −Nln(N) − e− dN2 . Moreover, the solution ofTˆmax = argminX1mm∑t=1(Xωt −Yωt )2 subject to ‖X‖∞≤ α, ‖X‖max≤ (√rd2−d)α, (1.4)satisfies‖T ]− Tˆmax‖2FNd≤Cd(α+σ)α√rd2−d√dNm,with probability greater than 1− e −Nln(N) − e− dN2 .10Above, ‖X‖M is the M-norm of tensor X which is an atomic norm whose atoms are rank-1 signtensors defined in Section 2.3.2 (2.12) and ‖X‖max is max-qnorm defined in (2.11).Without going into details, the following theorem summarizes the results in Chapter 4 on re-covering a rank-r tensor from partial 1-bit observations.Theorem 2. Consider a rank-r, order-d tensor T ] ∈⊗di=1RN with ‖T ]‖∞≤α and a random subsetof its indices Ω⊂ [N]d where |Ω|= m. Suppose that we only observe 1-bit measurements of T ] onΩ according toYω =+1 with probability f (T]ω),−1 with probability 1− f (T ]ω)for ω ∈Ω. (1.5)Here f : R→ [0,1] is a previously-known well-defined monotone differentiable function. Definethe negative log-likelihood functionLΩ,Y (X) = ∑ω∈Ω(1[Yω=1] log(1f (Xω))+1[Yω=−1] log(11− f (Xω))).Then the minimizer of the M-norm constrained optimization problemTˆM = arg minXLΩ,Y (X) subject to ‖T‖∞≤ α,‖T‖M≤ (r√r)d−1α,satisfies‖T ]− TˆM‖2FNd≤Cβα{Lα(r√r)d−1α√dNm+Uα√log( 4δ )m},with probability greater than 1− δ . Here Lα , βα and Uα are constants that just depend on α .Moreover, the minimizer of the max-qnorm constrained optimization problemTˆmax = arg minXLΩ,Y (X) subject to ‖T‖∞≤ α,‖T‖max≤√rd2−dα,satisfies‖T ]− Tˆmax‖2FNd≤Cdβα{Lα√rd2−2α√dNm+Uα√log( 4δ )m},with probability greater than 1−δ . Furthermore, the minimizer of the nuclear-norm constrained11optimizationTˆnuc = arg minXLΩ,Y (X) subject to ‖X‖∞≤ α,‖X‖∗≤√rd−1Ndα,satisfies‖T ]− Tˆnuc‖2FNd≤Cd,α√rd−1(logd(N)√N√m+log(N)√Ndm),with probability at least 1− CNd−1 and the minimizer of the exact-rank constrained optimizationTˆr = arg minXLΩ,Y (X) subject to rank(X)≤ r,‖X‖∞≤ α,satisfies‖T ]− Tˆr‖2FNd≤ max(4α2Ndm,Cα,γαα2(2r)3d2 − 32√Ndm).with probability at least 1− exp(−cNd).1.7.2 OrganizationNext, we briefly explain the contents of each chapter. To explain the main contributions weconsider an order-d tensor T ] ∈ RN1×···×Nd where Ni = O(N) for 1≤ i≤ d.Chapter 2In this chapter, we review some of the basic definitions related to tensors. Specifically, we finda bound on the nuclear norm of bounded rank-r tensors which is tight in N by connecting the CP-decomposition with an orthogonal decomposition. Next, we define the M-norm and max-qnorm oftensors as a robust proxy for the rank of a tensor. We prove that both M-norm and max-qnorm ofa bounded low-rank tensor are bounded above by quantities that just depend on the tensor’s rankand its infinity norm and are independent of N. We use a generalization of Grothendieck’s theoremto connect the max-qnorm of a tensor to its nuclear decomposition with unit infinity-norm factors.Using this, we bound the Rademacher complexity of the set of bounded tensors with low max-qnorm. This also establishes a theoretical framework for further investigation of low max-qnormtensors.12Chapter 3In this section, we investigate the problem of tensor completion. In particular, we prove that,with high probability, m = O(r3d2 dN) (or m = O(R2N) if M-norm is bounded by R) samples aresufficient to estimate a rank-r bounded tensor using a convex least squares algorithm. Moreover,we derive an information-theoretic lower bound that proves m = O(R2N) measurements are nec-essary for recovering tensors with M-norm less than R. This proves that our bound is optimal bothin its dependence on N and the M-norm bound R. It is worth mentioning though, that the boundwe prove in this chapter is not necessarily optimal in r, the rank of the tensor.We also propose an alternating method to approximate the solution of the max-qnorm tensorcompletion. In an effort to get a better understanding of the behavior of the max-qnorm, we use thisalgorithm to investigate the max-qnorm of low-rank tensors whose factors are random Gaussianand random binary values. We also do a thorough investigation of the performance of the algorithmfor different values of rank, r, the maximum length of the tensor dimensions, N, and the numberof measurements, m. Through synthetic numerical examples, we illustrate the advantage of usingalgorithms designed for max-qnorm constrained tensor completion instead of matricization. Thesealgorithms significantly outperform algorithms based on matricization and alternating least squares(ALS).Chapter 4In this section, we investigate the problem of 1-bit tensor completion. First, we analyze 1-bittensor completion using M-norm constraints on the underlying tensor (this is a convex constraint).We prove that, with high probability, the mean squared error (MSE) of recovering a rank-r tensorT ] from m 1-bit measurements by solving an M-norm constrained optimization is O(√r3d−3√Ndm ).Moreover, we analyze a related non-convex function, max-qnorm, and prove that MSE of optimiz-ing a log-likelihood function constrained by max-qnorm is O(√rd2−d√Ndm ).Next, we use nuclear norm as a convex constraint. The theoretical sample complexity we ob-tain by using nuclear norm is not as tight as using M-norm and we show that with the tools we usehere (or in the matrix-case) there is no hope of getting tighter results. We believe that achieving atighter upper bound is only possible if we use nuclear norm as a regularizer with carefully chosenregularization parameters (and not as a constraint). We also analyze 1-bit tensor completion usingexact-rank constraints. The estimation error relies on bounding the M-norm of rank-r tensors andon upper bound on the dual-M-norm of a random tensor which might be of independent interest.13We also derive an information-theoretic lower bound that proves the MSE of any arbitrary al-gorithm is at leastΩ(r√Nm) for a rank-r tensor andΩ(R√Nm) for a tensor with M-norm less than R.This proves that our upper bound is optimal in N and the M-norm bound R (but not necessarily in r).Finally, we propose a numerical method to approximate the solution of max-qnorm constrained1-bit tensor completion and show its advantage over 1-bit matrix completion using synthetic data.Chapter 5In this chapter, we show some of the real-world applications of multi-bit and 1-bit tensor com-pletion. We first show the application of the tensor completion algorithm in Chapter 3 to imageinpainting, video completion, hyperspectral imaging and completing MRI images.In the second part of the chapter, we show some applications of the 1-bit tensor completionalgorithm. In particular, we concentrate on context-aware recommendation systems.14Chapter 2M-norm and max-qnorm of tensorsIn this chapter, we generalize some of the well-known norms defined for matrices to tensors.Some of the definitions are straightforward generalizations of matrix-norms. First, we explainsome notions of orthogonality in tensors and then we define Frobenius and nuclear norm of ten-sors. The rest of the chapter is devoted to defining and analyzing the max-qnorm and atomicM-norm which is an important pillar of the analysis in the rest of the thesis.2.1 Notions of OrthogonalityIn this section, we define some notions of orthogonality of tensors which we later use to boundnuclear norm of a bounded low-rank tensor. Before that though, notice that for X ,T ∈⊗dj=1RN j ,the inner product of X and T is defined as:〈X ,T 〉 :=N1∑i1=1N2∑i2=1· · ·Nd∑id=1X(i1, i2, · · · , id)T (i1, i2, · · · , id). (2.1)which is a direct generalization of matrix inner product. For the sake of simplicity, in the rest ofthis section, we assume N j = N for 1≤ j ≤ d.Definition 3. Two tensors X ,T ∈ RNd are orthogonal if 〈X ,T 〉= 0.Consider U := u(1) ◦ u(2) ◦ · · · ◦ u(d) and V := v(1) ◦ v(2) ◦ · · · ◦ v(d). In this case, 〈U,V 〉 =Πdj=1〈u( j),v( j)〉. In particular, two rank-1 tensors are orthogonal if and only if any two of theircorresponding components are orthogonal.15Definition 4. U and V are completely orthogonal, U ⊥c V , if u j ⊥ v j for all j ∈ {1, · · · ,d}. More-over, U and V are strongly orthogonal, U ⊥s V , if U ⊥V and for all 1≤ j ≤ d, either u( j) =±v( j)or u( j) ⊥ v( j).Definition 5. [Orthogonal rank decomposition [68]] Define rank⊥(T ) to be the minimal r, suchthat T can be written as the sum of r orthogonal unit tensors as follows:T =r∑i=1λiUi, 〈Ui,U j〉= 0 for i 6= j (2.2)where Ui ∈ RNd are unit tensors.Lemma 6. For any rank-r tensor T = ∑ri=1λiu(1)i ◦u(2)i ◦ · · · ◦u(d)i , we have rank⊥(T )≤ (rd−1)Proof of this lemma is provided in Section 2.4. In the next section we use the orthogonaldecomposition to bound the nuclear and Frobenius norm of a tensor whose rank⊥(T )≤ (rd−1).2.2 Frobenius and nuclear normThe Frobenius norm of a tensor is defined as‖T‖2F := 〈T,T 〉=N1∑i1=1N2∑i2=1· · ·Nd∑id=1T 2i1,i2,···,id . (2.3)Remember that a unit tensor U ∈⊗dj=1RN is a tensor that can be written as an outer productof d unit vectors. Using this definition, we define the spectral norm of a tensor as‖T‖:= maxU∈Ud〈T,U〉, (2.4)where Ud is the set of all unit tensors in RNd. The nuclear norm of a tensor, defined as the dual ofthe spectral norm, was originally formulated in [50, 107] and has been revisited in greater depth inthe past few years, e.g., in [35, 42, 57, 81]. To be precise,‖T‖∗:= max‖X‖≤1〈T,X〉. (2.5)In the matrix case, the nuclear norm is equal to the sum of the singular values of the matrix.However, a similar singular value decomposition is not available for tensors in general. What we16know so far is that calculating the nuclear norm of d-tensors with d > 2 is NP-hard [43] and similarto the matrix case, we can prove that‖T‖∗:= min{∑λi : T =∑λiu(1)i ◦u(2)i ◦ · · · ◦u(d)i , ‖ui‖= 1, ∀ i}. (2.6)Using the `1-norm and `2-norm of the vector of singular values of a matrix, we can prove thatfor any rank-r matrix M, ‖M‖∗≤√r‖M‖F . Using a combinatorial orthogonal decomposition, wecan find a similar bound for a general rank-r tensor.Theorem 7. For any rank-r tensor T , ‖T‖F≤√rd−1‖T‖ and ‖T‖∗≤√rd−1‖T‖F .We prove this theorem by connecting the CP-decomposition with the orthogonal rank decom-position. The details of the proof are in Section 2.4.2.2.3 Max-qnorm and atomic M-normIn this section, we introduce the max-qnorm and M-norm of tensors and characterize the unitball of these norms as tensors that have a specific decomposition with bounded factors. This thenhelps us to prove a bound on the max-qnorm and M-norm of low-rank tensors that is independentof N. We give an overview of properties of max-qnorm and M-norm and prove them in Section2.4.2.3.1 Matrix max-normFirst, we define the max-norm of matrices which was first defined in [82] as γ2 norm. Wealso mention some of the properties of the matrix max-norm which we generalize later on in thissection. Recall that the max-norm of a matrix is defined as‖M‖max= minM=U◦V{‖U‖2,∞‖V‖2,∞}, (2.7)where ‖U‖2,∞= sup‖x‖2=1‖Ux‖∞ is the maximum `2-norm of the rows of U [82, 116].Considering all the possible factorizations of a matrix M =U ◦V , the rank of M is the minimumnumber of columns in the factors and nuclear norm of M is the minimum product of the Frobeniusnorms of the factors. The max-norm, on the other hand, finds the factors with the smallest row-17norm as ‖U‖2,∞ is the maximum `2-norm of the rows of the matrix U . Furthermore, it was provedin [78] that max-norm is comparable with nuclear norm in the following sense:‖M‖max≈ inf{∑j|σ j|: M =∑jσ ju jvTj ,‖u j‖∞= ‖v j‖∞= 1}. (2.8)Here, the factor of equivalence is the Grothendieck’s constant KG ∈ (1.67,1.79). To be precise,inf∑ j|σ j|KG≤ ‖M‖max≤ inf∑j|σ j|, (2.9)where the infimum is taken over all nuclear decompositions M = ∑ jσ ju jvTj ,‖u j‖∞= ‖v j‖∞= 1.Moreover, in connection with element-wise `∞ norm we have:‖M‖∞≤ ‖M‖max≤√rank(M)‖M‖1,∞≤√rank(M)‖M‖∞. (2.10)This is an interesting result that shows that we can bound the max-norm of a low-rank matrix byan upper bound that is independent of N.2.3.2 Tensor max-qnorm and atomic M-normWe generalize the definition of max-norm to tensors as follows. Let T be an order-d tensor.Then‖T‖max:= minT=U (1)◦U (2)◦···◦U (d){d∏j=1‖U ( j)‖2,∞}. (2.11)Notice that this definition agrees with the definition of max-norm for matrices when d = 2. As inthe matrix case, the rank of the tensor is the minimum possible number of columns in the low-rankfactorization of T =U (1) ◦U (2) ◦ · · · ◦U (d) and the max-qnorm is the minimum product of the rownorms of the factors over all such decompositions.Theorem 8. For d ≥ 3, the max-qnorm (2.11) does not satisfy the triangle inequality. However, itsatisfies a quasi-triangle inequality‖X +T‖max≤ 2 d2−1(‖X‖max+‖T‖max),and, therefore, is a quasi-norm.The proof of this theorem is in Section 2.4.3. Later on, in Section 3.3, we prove that certain18max-qnorm constrained minimization problems, break the O(Nd2 ) limitation on the number ofmeasurements required for tensor completion and 1-bit tensor completion mainly because of twomain properties:• Max-qnorm of a bounded low-rank tensor does not depend on the size of the tensor.• Defining T± := {T ∈{±1}N1×N2×···×Nd | rank(T )= 1}, the unit ball of the tensor max-qnormis a subset of Cdconv(T±) which is a convex combination of 2Nd rank-1 sign tensors. HereCd is a constant that only depends on d and conv(S) is the convex envelope of the set S.However, the max-qnorm in non-convex. To obtain a convex alternative that still satisfies theproperties mentioned above, we consider the norm induced by the set T± directly; this is an atomicnorm as discussed in [25]. The atomic M-norm of a tensor T is then defined as the gauge of T±[105] given by‖T‖M:= inf{t > 0 : T ∈ t conv(T±)}. (2.12)As T± is centrally symmetric around the origin and spans⊗dj=1RN j , this atomic norm is a convexnorm and the gauge function can be rewritten as‖T‖M= inf{ ∑X∈T±cX : T = ∑X∈T±cX X ,cX ≥ 0,X ∈ T±}. (2.13)2.3.3 Unit max-qnorm ball of tensorsIn the next lemma, we prove that, similar to the matrix case, the tensor unit max-qnorm ball iscomparable to the set T±. First define BTmax(1) := {T ∈ RN1×···×Nd | ‖T‖max≤ 1} and BM(1) :={T : ‖T‖M≤ 1}.Lemma 9. The unit ball of the max-qnorm, unit ball of atomic M-norm, and conv(T±) satisfy thefollowing:• BM(1) = conv(T±),• BTmax(1) ⊂ c1cd2 conv(T±).Here c1 and c2 are derived from the generalized Grothendieck theorem [15, 121] which is ex-plained thoroughly in Section 2.4.3.19Using Lemma 9, it is easy to analyze the Rademacher complexity of the unit ball of thesetwo norms. In fact, noticing that T± is a finite class with |T±|< 2dN and some basic properties ofRademacher complexity we can prove the following lemma. Below, RˆS(X) denotes the empiri-cal Rademacher complexity of X . To keep this section simple, we refer to Section 2.4.5 for thedefinition of Rademacher complexity and proof of Lemma 10.Lemma 10. The empirical Rademacher complexities of the M-norm and max-qnorm unit balls arebounded by• supS:|S|=mRˆS(BM(1))< 6√dNm ,• supS:|S|=mRˆS(BTmax(1))< 6c1cd2√dNm .2.3.4 Max-qnorm and M-norm of bounded low-rank tensorsNext, we bound the max-qnorm and M-norm of a rank-r tensor whose (entry-wise) infinitynorm is less than α . First, we bound the max-qnorm and a similar proof can be used to obtain abound on the M-norm as well which we explain in the Section 2.4.6. As mentioned before, ford = 2, i.e., the matrix case, an interesting inequality has been proved which does not depend on thesize of the matrix, i.e., ‖M‖max≤√rank(M) α [116]. In what follows, we bound the max-qnormand M-norm of a rank-r tensor T with ‖T‖∞≤ α .Theorem 11. Assume T ∈ RN1×···Nd is a rank-r tensor with ‖T‖∞= α . Then• α ≤ ‖T‖M≤ (r√r)d−1α.• α ≤ ‖T‖max≤√rd2−dα.The proofs of these two bounds are similar and both of them can be found in Section 2.4.6. Noticethe discrepancy of Theorem 11 when d = 2. This is an artifact of the proof which hints at the factthat Theorem 11 might be not optimal in r for general d as well.Remark 12. When d = 2, the atoms of the M-norm are rank-1 sign matrices. Therefore, flatlow-rank matrices have small M-norm. Moreover, M-norm and max-norm are equivalent and theM-norm of a matrix M with rank(M) = r and ‖M‖∞≤ α is bounded by KGα√r, where KG ∈{1.671.79} is the Grothendiecks constant [78].202.4 Proofs2.4.1 Proof of Lemma 6To find an orthogonal rank decomposition we find an orthonormal basis for the sets spannedby corresponding components of U1, · · · ,Ur. In particular, define S j := {u( j)1 ,u( j)2 , · · · ,u( j)r } for1≤ j ≤ d. Next, define B j := {b( j)1 ,b( j)2 , · · · ,b( j)r } to be an orthonormal basis for the set span{S j}.Hence, for 1≤ i≤ r and 1≤ j≤ d, we can write u( j)i =∑rk=1σ ( j)i,k b( j)k . Now by expanding each u( j)iin T =∑ri=1λiu(1)i ◦u(2)i ◦· · ·◦u(d)i , we can get T =∑ri1=1∑ri2=1 · · ·∑rid=1σi1,···,id b(1)i1 ◦b(2)i2 ◦· · ·◦b(d)idwhere σi1,···,id =∑ri=1λiσ(1)i,i1 σ(2)i,i2 · · ·σ(d)i,id . Hence we can write T =∑rdi=1λiTi where Ti’s are stronglyorthogonal.Notice that finding an orthogonal decomposition suffices and we can improve the result slightlyby dropping the strong orthogonality of the sub-tensors. To do this notice that we can combine allthe Ti’s that only differ in the first factor, i.e., defining X j2,···, jd = (∑ri1=1σi1, j2···, jd b(1)i1 )◦b(2)j2 ◦ · · · ◦b(d)jd , we can observe thatT =r∑j2=1r∑j3=1· · ·r∑jd=1X j2,···, jd =rd−1∑i=1λiXi,such that any two different Xi and X j differ at least in one of the factors 2 to d and therefore, areorthogonal to each other.2.4.2 Proof of Theorem 7To prove this theorem, we first bound the Frobenius norm and nuclear norm of a tensor withrank⊥(T ) = r and then use Lemma 6 to finish the proof.Lemma 13. For a tensor T with rank⊥(T ) = r, we have ‖T‖F≤√r‖T‖ and ‖T‖∗≤√r‖T‖F .Proof: Assume that T has the strongly orthogonal decomposition T = ∑ri=1λiTi. Notice thatfor 1 ≤ i ≤ r, Ti is a rank-1 unit tensor with ‖Ti‖= ‖Ti‖F= ‖Ti‖∗= 1. Tensors Ti are mutuallyorthogonal. Therefore, 〈T,Ti〉= λi and we can conclude that‖T‖≥maxri=1|λi|. (2.14)On the other hand, ‖T‖2F= 〈T,T 〉= 〈∑ri=1λiTi,∑ri=1λiTi〉 and as 〈Ti,Tj〉= 0 for i 6= j we con-21clude that‖T‖F=√r∑i=1λ 2i . (2.15)Next, we bound the nuclear norm. Notice that ‖T‖∗= ‖∑ri=1λiTi‖∗≤ ∑ri=1|λi|‖Ti‖∗ whichmeans that‖T‖∗≤r∑i=1|λi|. (2.16)The lemma can be verified using (2.14), (2.15), and (2.16).Combining Lemma 6 and Lemma 13 finishes the proof.2.4.3 Proof of Theorem 8Notice that for any tensors T and X , there exist decompositions T = ©i=di=1U (i) and X =©i=di=1V (i), where ‖U ( j)‖2,∞≤ (‖T‖max)1d and ‖V ( j)‖2,∞≤ (‖X‖max) 1d . Moreover, one way to fac-torize the tensor X +T is by concatenating the factors of X and T as X +T =©i=di=1[U (i),V (i)] andtherefore,‖X +T‖max ≤Πdj=1‖[U ( j),V ( j)]‖2,∞≤ (√‖X‖2dmax+‖T‖2dmax)d ≤ 2 d2−1(‖X‖max+‖T‖max),which proves that max-qnorm is a quasi-norm. Notice that the last inequality follows from theinequality |a+b|p≤ 2p−1(|a|p+|b|p) for p> 1. It is easy to check that the max-qnorm satisfies thetriangle inequality, when d = 2. However, this is not true for d > 2. Next, we prove this for d = 3and higher-order cases can be proven similarly.The main challenge in proving that the max-qnorm does not satisfy the triangle-inequality whend = 3 is that the size of the factors is not fixed. However, it can be observed from the followingsimple counter-example. Let T = T1+T2, where T1 =[10]◦[10]◦[10], and T2 =[11]◦[11]◦[11],and note that T is a rank-2, 2×2×2 tensor. Here, T1 and T2 are rank-1 tensors with ‖T1‖max= 1 and‖T2‖max= 1 (notice that for any T , ‖T‖max≥ ‖T‖∞). Therefore, if max-qnorm satisfies triangle-inequality, then ‖T‖max cannot exceed 2. In what follows we prove that this is not possible. Noticethat even though T is a small tensor with only 8 elements we don’t have a straightforward way tocalculate its max-qnorm. However, we can bound it from above and below. If ‖T‖max≤ 2, thenthere exists a decomposition T = U (1) ◦U (2) ◦U (3) such that ‖T‖max= ∏3j=1‖U ( j)‖2,∞≤ 2, and22with a simple rescaling of the factors,‖U (1)‖2,∞≤√2, ‖U (2)‖2,∞≤√2, ‖U (3)‖2,∞≤ 1. (2.17)First, notice that T is an all-ones tensor except for one entry where T (1,1,1) = 2. Defining thegeneralized inner product as〈x1, · · · ,xd〉 :=k∑i=1d∏j=1x j(i), (2.18)this means that〈U (1)(1, :),U (2)(1, :),U (3)(1, :)〉= 2. (2.19)Using Cauchy-Schwarz〈U (1)(1, :),U (2)(1, :),U (3)(1, :)〉 ≤ ‖U (1)(1, :)‖ ‖U (2)(1, :)‖ ‖U (3)(1, :)‖∞. (2.20)Combining (2.17), (2.19), and (2.20), we get2≤ ‖U (1)(1, :)‖ ‖U (2)(1, :)‖≤ ‖U (1)‖2,∞ ‖U (2)‖2,∞≤ 2,which together with (2.17) proves that‖U (1)(1, :)‖=√2, and ‖U (2)(1, :)‖=√2. (2.21)Moreover, similarly2≤ 2‖U (3)(1, :)‖∞≤ 2 ⇒ ‖U (3)(1, :)‖∞= 1.Notice that ‖U (3)(1, :)‖≤ 1, and ‖U (3)(1, :)‖∞= 1 which proves that U (3)(1, :) is an all zeros vectorwith a single non-zero entry of one. Remember that the number of columns of U (3) is arbitrary.Without loss of generality, we can assumeU (3)(1, :) = (1,0, · · · ,0). (2.22)Combining this with (2.19), and (2.21) we can also prove thatU (1)(1, :) =U (2)(1, :) = (√2,0, · · · ,0). (2.23)23Now from T (1,1,2) = 1 and the above two equations we have to have U (3)(2,1) = 12 and similarlyU (2)(2,1) = 1√2. Finally T (1,2,2) = U (1)(1,1) U (2)(2,1) U (3)(2,1) =√2 1√212 =12 which is acontradiction.2.4.4 Proof of Lemma 9Characterization of the unit ball of the atomic M-norm follows directly from (2.12). By defini-tion, any tensor T with ‖T‖M≤ 1 is a convex combination of the atoms of T±, T =∑X∈T± cX X ,cX >0 with ∑X∈T± cX = 1. This proves that BM(1) = conv(T±).To characterize the unit ball of max-qnorm, we use a generalization of Grothendieck’s the-orem to higher-order tensors [15, 121]. First, we generalize the matrix ‖·‖∞,1 norm (‖M‖∞,1:=sup‖x‖∞=1‖Mx‖1) as:Definition 14. ‖T‖∞,1:= sup‖x1‖∞,···,‖xd‖∞≤1|∑Ni1=1 · · ·∑Nid=1 T (i1, · · · , id)x1(i1) · · ·xd(id)|.Theorem 15 (Generalized Grothendieck theorem). Let T be an order-d tensor such that‖T‖∞,1 = sup‖x1‖∞,···,‖xd‖∞≤1|N∑i1=1· · ·N∑id=1T (i1, · · · , id)x1(i1) · · ·xd(id)|≤ 1,and let u ji j ∈ Rk,1≤ j ≤ d,1≤ i j ≤ N be N×d vectors such that ‖uji j‖≤ 1. Then|N∑i1=1· · ·N∑id=1T (i1, · · · , id)〈u1i1,u2i2, · · · ,udid〉|≤ c1cd2, (2.24)where 〈u1i1 ,u2i2, · · · ,udid〉 is the generalized inner product of u1i1,u2i2, · · · ,udid as defined in (2.18). Here,c1 ≤ KG5 and c2 ≤ 2.83.Now we use Theorem 15 to prove Lemma 9.Proof of Lemma 9: The dual norm of the max-qnorm is‖T‖∗max= max‖U‖max≤1〈T,U〉= max‖u1i1‖,···,‖udid ‖≤1N∑i1=1· · ·N∑id=1T (i1, · · · , id)〈u1i1 ,u2i2, · · · ,udid〉. (2.25)Above, the length of the vectors u1i1, · · · ,udid is not constrained (to clarify, notice that the length ofthese vectors are larger or equal to the rank of U). Using Theorem 15, ‖T‖∗max≤ c1cd2‖T‖∞,1. On24the other hand, in the special case when u1i1 , · · · ,udid ∈ R the right-hand side of (2.25) is equal to‖T‖∞,1. Therefore, ‖T‖∞,1≤ ‖T‖∗max. Taking the dual:‖T‖∗∞,1c1cd2≤ (‖T‖∗max)∗ ≤ ‖T‖∗∞,1 (2.26)Notice that the max-qnorm, defined in (2.11) is a quasi-norm and therefore, (‖T‖∗max)∗ is not equalto ‖T‖max. However, notice that the max-qnorm is absolutely homogeneous and therefore,(‖T‖∗max)∗ = max‖Z‖∗max≤1〈T,Z〉 ≤ ‖T‖max.which implies that‖T‖∗∞,1c1cd2≤ ‖T‖max. (2.27)To calculate the unit ball of ‖.‖∗∞,1, notice that the argument of the supremum in Definition 14 islinear in each variable x j(i j) and as −1≤ x j(i j)≤ 1, the supremum is achieved when x j(i j) =±1which means that ‖T‖∞,1= supU∈T±|〈T,U〉|. Therefore, conv(T±) is the unit ball of ‖.‖∗∞,1 and Lemma9 (ii) follows from (2.27).2.4.5 Proof of Lemma 10A technical tool that we use in the proof of our main results involves data-dependent estimatesof the Rademacher and Gaussian complexities of a function class. We refer to [8, 116] for a de-tailed introduction of these concepts.Two important properties that will be used in the following lemma is: First, if F⊂G, thenRˆS(F)≤RˆS(G) and second is RˆS(F) = RˆS(conv(F)).Lemma 16. supS:|S|=mRˆS(BM(1))< 6√dNmProof: By definition, BM(1) = conv(T±), and T± is a finite class with |T±|< 2dN . Therefore,RˆS(T±)<√72dN+log|S||S| [114] which concludes the proof.Lemma 17. supS:|S|=mRˆS(BTmax(1))< 6c1cd2√dNmProof: By Lemma 9, BTmax(1) ⊂ c1cd2 conv(T±) and we have RˆS(T±) <√72dN+log|S||S| . Taking25the convex hull of this class and using |S|= m ≤ Nd and scaling by c1cd2 we get RˆS(BTmax(1)) ≤6c1cd2√dNm .2.4.6 Proof of Theorem 11In order to prove the tensor max-qnorm bound, we first sketch the proof of [103] for the matrixcase. That is, assuming that M ia s matrix with rank(M) = r and ‖M‖∞≤ α , we show that thereexists a decomposition M =U ◦V where U ∈RN1×r,V ∈RN2×r and ‖U‖2,∞≤√r,‖V‖2,∞≤ α . Toprove this, we first state a version of the John’s theorem [103].Theorem 18 (John’s theorem [64]). For any full-dimensional symmetric convex set K⊆Rr (dim(K)=r)and any ellipsoid E ⊆Rr (defined with respect to `2 norm) that is centered at the origin, there existsan invertible linear map S so that E ⊆ S(K)⊆√rE.Theorem 19. [103, Corollary 2.2] For any rank-r matrix M ∈ RN1×N2 with ‖M‖∞≤ α there existvectors u1, · · · ,uN1,v1, · · · ,vN2 ∈ Rr such that 〈ui,v j〉= Mi, j and ‖ui‖≤√r and ‖v j‖≤ α .The proof is based on considering any rank-r decomposition of M = X ◦Y where X ∈ RN1×rand Y ∈ RN2×r and Mi, j = 〈xi,y j〉. Defining K to be the convex hull of the set {±xi : i ∈ [N1]}.Then using the linear map S in John’s Theorem for the set K with the ellipsoid E = Br := {x ∈Rr : ‖x‖2≤ 1}, the decomposition M = (XS)◦(Y S−1) satisfies the conditions of Theorem 19 [103].The following lemma proves the existence of a nuclear decomposition for bounded rank-rtensors, which can be used directly to bound the M-norm of a bounded rank-r tensor.Lemma 20. Any order-d, rank-r tensor T , with ‖T‖∞≤ α can be decomposed into rd−1 rank-onetensors whose components have unit infinity norm such thatT =rd−1∑j=1σ ju1j ◦u2j ◦ · · · ◦udj , ‖u1j‖∞, · · · ,‖udj‖∞≤ 1, withrd−1∑j=1|σ j|≤ (r√r)d−1α. (2.28)Proof: We prove this lemma by induction. The proof for d = 2 follows directly from applyingJohn’s theorem to a rank-r decomposition of T , i.e., T = XS ◦Y S−1 where T = X ◦Y [103]. Thisis summarized in Theorem 19 above as well.Now assume an order-d tensor which can be written as T =∑rj=1λ jv1j ◦v2j ◦· · ·◦vdj and ‖T‖∞≤α . Matricizing along the first dimension results in T[1] = ∑ri=1(λiu(1)i ) ◦ (u(2)i ⊗·· ·⊗ u(d)i ). Using26Matlab notations we can write T[1] =U ◦V where U(:, i) = λiu(1)i ∈ RN1 , and V (:, i) = u(2)i ⊗·· ·⊗u(d)i ∈ RΠk=dk=2Nk .Using John’s theorem, there exists an S ∈ Rr×r where T[1] = X ◦Y where X =US, Y = V S−1,‖X‖∞≤ ‖X‖2,∞≤√r, and ‖Y‖∞≤ ‖Y‖2,∞≤ α . Furthermore, each column of Y is a linear combi-nation of the columns of V , i.e., there exist ζ1, · · ·ζr such that Y (:, i) = ∑rj=1 ζ j(u(2)j ⊗·· ·⊗ u(d)j ).Therefore, unfolding i-th column of Y into a (d− 1)-dimensional tensor Ei ∈ RN2×···×Nd wouldresult in a rank-r, (d− 1)-dimensional tensor with ‖Ei‖∞≤ ‖Y‖∞≤ α . By induction, Ei can bedecomposed into rd−2 rank-one tensors with bounded factors, i.e.,Ei =rd−2∑j=1σi, jv1i, j ◦ v2i, j ◦ · · · ◦ vdi, j, ‖vi, j‖∞≤ 1, ∑|σi, j|≤ (r√r)d−2α.Going back to the original tensor, as T[1]= X ◦Y , we also have T =∑ri=1 X(:, i)◦(∑rd−2j=1 σi, jv1i, j ◦v2i, j ◦ · · · ◦ vdi, j). Notice that ‖X(:, i)‖∞≤√r. Therefore, we can rewriteT =r∑i=1rd−2∑j=1(σi, j‖X(:, i)‖∞) X(:, i)‖X(:, i)‖∞ ◦ v1i, j ◦ v2i, j ◦ · · · ◦ vdi, j.By rearranging, we get T = ∑rd−1k=1 σku1K ◦u2k ◦ · · · ◦udk , ‖u1k‖∞, · · · ,‖udk‖∞≤ 1 andrd−1∑k=1|σk|=r∑i=1rd−2∑j=1σi, j‖X(:, i)‖∞≤r∑i=1√rrd−2∑j=1σi, j ≤r∑i=1√r((r√r)d−2α)= (r√r)d−1α,which concludes the proof of Lemma 20. This lemma can be used directly to bound the M-normof a bounded rank-r tensor.Next, we bound the max-qnorm of a bounded rank-r tensor. The following lemma proves theexistence of a nuclear decomposition for bounded rank-r tensors, which can be used directly tobound their max-qnorm. As the max norm is one-homogeneous, without loss of generality weassume ‖T‖∞≤ 1.Lemma 21. Any order-d, rank-r tensor T ∈⊗di=1RNi , with ‖T‖∞≤ 1 can be decomposed into27rd−1 rank-one tensors, T = ∑rd−1j=1 u1j ◦u2j ◦ · · · ◦udj , where:rd−1∑j=1(ukj(t))2 ≤ rd−1 for any 1≤ k ≤ d, 1≤ t ≤ Nk. (2.29)Notice that√∑rd−1j=1 (ukj(t))2 is the spectral norm of j-th row of k-th factor of T , i.e.,∑rd−1j=1 (ukj(t))2≤rd−1 means that the two-infinity norm of the factors is bounded by√rd−1.Remark 22. [Proof by Lemma 20] At the end of this subsection, we provide a proof for thelemma as stated above. However, using the decomposition obtained in Lemma 20, we can finda decomposition with ∑rd−1j=1 (ukj(t))2 ≤ rd . To do this notice that by Lemma 20, and defining~σ := {σ1, · · · ,σrd−1} we can writeT =rd−1∑j=1σ jv1j ◦ v2j ◦ · · · ◦ vdj , ‖v1j‖∞, · · · ,‖vdj‖∞≤ 1, with ‖~σ‖1≤ (r√r)d−1.Now defineukj := (σ j)1d vkj for any 1≤ k ≤ d, 1≤ t ≤ Nk.It is easy to check that T = ∑rd−1j=1 u1j ◦u2j ◦ · · · ◦udj andrd−1∑j=1(ukj(t))2 =rd−1∑j=1σ2dj (vkj(t))2 ≤rd−1∑j=1σ2dj = ‖~σ‖2d2d.Using Holder’s inequality, when d ≥ 2 we haverd−1∑j=1(ukj(t))2 = ‖~σ‖2d2d≤ ‖~σ‖2d1 (rd−1)1−2d ≤ r 3d−3d r (d−1)(d−2)d = r (d−1)(d+1)d ≤ rdThis proves an upper bound which is close to the one in the lemma. To get a more optimal upperbound (the one stated in the Lemma 21) we need to go over the induction steps as explained below.Proof of Lemma 21: We prove this lemma by induction. The proof for d = 2 follows directly fromapplying John’s theorem to a rank-r decomposition of T , i.e., T =XS◦Y S−1 where T =X ◦Y . Nowassume an order-d tensor which can be written as T = ∑rj=1 u1j ◦u2j ◦ · · · ◦udj and ‖T‖∞≤ 1. Matri-cizing along the first dimension results in T[1] = ∑ri=1(u(1)i ) ◦ (u(2)i ⊗·· ·⊗ u(d)i ). Using matrix no-tation we can write T[1]=U ◦V where U(:, i) = u(1)i ∈RN1 , and V (:, i) = u(2)i ⊗·· ·⊗u(d)i ∈RΠk=dk=2Nk .28Using John’s theorem, there exists an S ∈ Rr×r where T[1] = X ◦Y where X =US, Y = V S−1,‖X‖2,∞≤√r, and ‖Y‖∞≤ ‖Y‖2,∞≤ 1. More importantly, each column of Y is a linear combinationof the columns of V . More precisely, there exist ζ1, · · ·ζr such that Y (:, i) = ∑rj=1 ζ j(u(2)j ⊗·· ·⊗u(d)j ). Therefore, unfolding i-th column of Y into a (d− 1)-dimensional tensor Ei ∈ RN2×···×Ndwould result in a rank-r, (d−1)-dimensional tensor with ‖Ei‖∞≤ ‖Y‖∞≤ 1. By induction, Ei canbe decomposed into rd−2 rank-one tensors where Ei =∑rd−2j=1 v2i, j◦v3i, j◦· · ·◦vdi, j, where∑rd−2j=1 (vki, j(t))2≤rd−2 for any 2≤ k≤ d and any 1≤ t ≤ Nk. Notice that the factors start from v2i, j to emphasize thatE is generated from the indices in the dimensions 2 to d.Going back to the original tensor, as T[1] = X ◦Y , we can writeT =r∑i=1X(:, i)◦ (rd−2∑j=1v2i, j ◦ v3i, j ◦ · · · ◦ vdi, j).By distributing the outer product we get T = ∑ri=1∑rd−2j=1 X(:, i)◦ v2i, j ◦ v3i, j ◦ · · · ◦ vdi, j. Renamingthe vectors in the factors we getT =rd−1∑k=1u1k ◦u2k ◦ · · · ◦udk .Now we bound the max norm of T using this decomposition by considering each factor sepa-rately using the information we have about X and Eis.Starting from the first factor, notice that ‖X‖2,∞≤√r or more precisely ∑ri=1 X(t, i)2 ≤ r forany 1≤ t ≤ N1. By careful examining of the two decompositions of T stated above, we getu1k = X(:,mod(k−1,r)+1)and thereforerd−1∑k=1(u1k(t))2 = rd−2r∑i=1X(t, i)2 ≤ rd−2r = rd−1, for any 1≤ t ≤ N1, (2.30)which proves the lemma for the vectors in the first dimension of the decomposition.29For the second dimension, define j := mod(k−1,rd−2)+1, and j := k− jrd−2 +1. Thenu2k = vki, j,and therefore,rd−1∑k=1(u2k(t))2 =r∑i=1rd−2∑j=1(v2i, j(t))2 ≤r∑i=1rd−2 ≤ rd−1, for any 1≤ t ≤ N2, (2.31)which finishes the proof of the lemma for the vectors in the second dimension. All the other di-mensions can be bounded in an exactly similar way to the second dimension.The bound on the max-qnorm of a bounded rank-r tensor follows directly from Lemma 21 anddefinition of tensor max-qnorm.Remark 23. In both lemmas 20 and 21, we start by decomposing a tensor T =U1 ◦U2 ◦ · · · ◦Udinto T[1] =U1 ◦V and generating K (in the John’s theorem) by the rows of the factor U1. Noticethat John’s theorem requires the set K to be full-dimensional. This condition is satisfied in thematrix case as the low-rank decomposition of a matrix (with the smallest rank) is full-dimensional.However, this is not necessarily the case for tensors. In other words, the matricization along adimension might have smaller rank than the original tensor. To take care of this issue, considera factor Uadd with the same size of U1 such that U1 +Uadd is full-dimensional. Now the tensorTε = T + εUadd ◦U2 ◦ · · · ◦Udε satisfies the conditions of the John’s theorem and by taking ε tozero we can prove that ‖T‖M= ‖Tε‖M and ‖T‖max= ‖Tε‖max. Notice that M-norm is convex andmax-qnorm satisfies ‖X +T‖max≤ (√‖X‖2dmax+‖T‖2dmax)d .30Chapter 3Tensor completionIn many applications, acquiring all the indices of a tensor is either impossible or too expen-sive, which results in having access to just a subset of the entries of the tensor. In this section,we consider the problem of tensor completion from noisy measurements of a subset of its entries.As explained before, we assume that the indices of the entries that are measured are drawn inde-pendently at random with replacement. Also, the tensor of interest is low-rank and has boundedentries. Instead of constraining the problem to the set of low-rank bounded tensors, we consider amore general case and consider the set of bounded tensors with bounded M-norm which includesthe set of low-rank bounded tensors. We minimize a constrained least squares (LS) problem givenin (3.5) below. Similar results can be obtained for a max-qnorm constrained LS. We only pro-vide the final result of the max-qnorm constrained problem in Theorem 29 as the steps are exactlysimilar to the M-norm constrained one. When d = 2, i.e., the matrix case, max-norm constrainedmatrix completion has been thoroughly studied in [19], so we will not discuss the lemmas andtheorems that can be directly used in the tensor case; see [19] for more details.3.1 Introduction and past resultsThe problem of matrix completion, i.e., recovering a matrix from partial noisy measurements ofa subset of its entries arises in a wide variety of practical applications, including collaborative fil-tering [47], sensor localization [14, 113], system identification [84] and seismic data interpolation[3]. The low-rank structure of a matrix makes it possible to complete it by subsampling a numberof entries which is much smaller than the ambient dimension of the matrix. Matrix completion isuseful for applications where acquiring the full data exactly is either expensive or impossible dueto physical limitations.31Assuming that we have access to m random entries of a matrix M], the set of indices of whichis denoted by Ω, where |Ω|= m, the matrix completion problem entails finding a matrix Mˆ withsmallest rank that agrees with the m obtained samples, i.e., defining MΩ to be the projection of Monto the set of matrices whose all non-zero entries are in Ω,Mˆ := arg min rank(M) s.t. MΩ = M]Ω.This problem has been extensively studied in the literature [17, 22, 23, 66]. In general, rankminimization is NP-hard. Therefore, one of the most common approaches is minimizing convexsurrogates of the rank, such as nuclear norm [17, 21, 23], and max-norm [19, 40]. Both theseconvex regularizers can be used to recover the underlying matrix with as few as O(rN log(N))measurements which is close to the number of free variables of a rank-r matrix in RN×N which isO(Nr).Using max-norm for learning low-rank matrices was pioneered in [118] where max-norm wasused for collaborative prediction. In this chapter, we use max-qnorm for tensor completion whichis a generalization of a recent result on matrix completion using max-norm constrained optimiza-tion [19]. First, we review some of the results which are related to M-norm and max-qnorm tensorcompletion. In particular, we first go over some of the matrix completion results including usingnuclear norm and max-norm and then review some of the results on tensor completion.Inspired by the result of [38], which proved that the nuclear norm is the convex envelope ofthe rank function, most of the research on matrix completion has focused on using nuclear-normminimization. Assuming M] to be a rank-r, N×N matrix, and M]Ω to be the set of m independentsamples of this matrix, in [22, 114], it was proved that solvingMˆ := argmin ‖M‖∗ subject to MΩ = M]Ω, (3.1)recovers the matrix M] exactly if |Ω|>CN1.2r log(N), provided that the row and column space ofthe matrix is “incoherent”. This result was later improved in [66] to |Ω|=O(Nr log(N)). There hasbeen significant research in this area since then, either in sharpening the theoretical bound, e.g.,[12, 23, 104] or designing efficient algorithms to solve (3.1), e.g., [17, 61].More relevant to noisy tensor completion are the results of [19, 21, 67] which consider recov-32ering M] from measurements YΩ, where Y = M]+Z, and |Ω|= m; here Z is a noise matrix. It wasproved in [21] that if ‖ZΩ‖F≤ δ , by solving the nuclear-norm minimization problemargmin ‖M‖∗ subject to ‖(M−Y )Ω‖F≤ δ ,we can recover Mˆ where1N‖M]− Mˆ‖F≤C√Nmδ +2δN,provided that there are sufficiently many measurements for perfect recovery in the noiseless case.Another approach was taken by [67] where the authors assume that ‖M]‖∞≤ α , and Z is azero-mean random matrix whose entries are i.i.d. with subgaussian-norm σ . They then suggestinitializing the left and right-hand singular vectors (L and R) from the observations YΩ and provethat by solvingminL,S,R12‖M]−LSR′‖2F subject to L′L = Ir,R′R = Ir,one can recover a rank-r matrix Mˆ where1N‖M]− Mˆ‖F≤Cα√Nrm+C′σ√Nrα log(N)m.Inspired by promising results of using max-norm for collaborative filtering [118], a max-normconstrained optimization was employed in [40] to solve the noisy matrix completion problem underthe uniform sampling assumption. Nuclear norm minimization has been proven to be rate-optimalfor matrix completion. However, it is not entirely clear if it is the best approach for non-uniformsampling. In many applications, such as collaborative filtering, the uniform sampling assumptionis not a reasonable assumption. For example, in the Netflix problem, some movies get much moreattention and therefore have more chance of being rated compared to others. To tackle the issue ofnon-uniform samples, using a weighted nuclear norm was suggested in [91], imposing probabilitydistributions on samples belonging to each row or column. Due to similar considerations, [19]generalized the max-norm matrix completion to the case of non-uniform sampling and provedthat, with high probability, m = O(Nrε log3(1ε )) samples are sufficient for achieving mean squaredrecovery error ε , where the mean squared error is dependent on the distribution of the observations.To be more precise, in their error bound, indices that have higher probability of being observed arerecovered more accurately compared to the entries that have less probability of being observed. Inparticular, [19] assumed a general sampling distribution as explained in Section 3.2 (when d = 2)33that includes both uniform and non-uniform sampling. Assuming that each entry of the noisematrix is an independent zero-mean Gaussian random variable with variance σ , and ‖M]‖∞≤ α ,they proved that the solution Mˆmax ofmin‖M‖max≤√rα‖(M]−M)Ω‖2F ,assuming each entry gets observed with probability larger than 1µN2 on average, satisfies1N2‖Mˆmax−M]‖2F≤Cµ(α+σ)α√rNn,with probability greater than 1− 2e−dN . This section is a generalization of the above result totensor completion.Finally, we explain some of the past results on tensor completion. To our knowledge, thischapter provides the first result that proves linear dependence of the sufficient number of randomsamples on N when the measurements are taken independently at random. It is worth mentioning,though, that [73] proves that O(Nrd−0.5d log(r)) adaptively chosen samples is sufficient for exactrecovery of tensors. However, the result is heavily dependent on the samples being adaptive.There is a long list of heuristic algorithms that attempt to solve the tensor completion problemby using different decompositions or matricizations which, in spite of showing good empiricalresults, do not do a theoretical comparison of tensor completion and matrix completion [49, 83].The most popular approach is minimizing the sum of nuclear norms of all the matricizations of thetensor along all modes. To be precise one solvesminXd∑i=1βi‖X(i)‖∗ subject to XΩ = T ]Ω, (3.2)where X(i) is the mode-i matricization of the tensor (see [44, 83, 111, 120]). The result obtainedby solving (3.2) is highly sensitive on the choice of the weights βi and an exact recovery result isnot available. At least in the special case of tensor sensing, where the measurements of the tensorare its inner products with random Gaussian tensors, [90] proves that m = O(rNd−1) is necessaryfor (3.2), whereas a more balanced matricization such as X[b d2 c] (as explained in Section 1.5) canachieve successful recovery with m = O(rbd2 cNdd2 e) Gaussian measurements.34Assuming T ] is symmetric and has an orthogonal CP decomposition, it was proved in [60] thatwhen d = 3, an alternating minimization algorithm can achieve exact recovery from O(r5N32 log(N)4)random samples. However, the empirical results of this work show good results for non-symmetrictensors as well if a good initial point can be found.Decompositions such as Tucker [122], Hierarchical Tucker [48], higher-order svd [32] and ten-sor train [95] which have more expressive power (i.e., more free variable) than CP decompositionhas been considered in various settings and has been applied to various applications. In [72], ten-sors with fixed multi-linear rank are estimated by performing Riemannian optimization techniqueson the manifold of tensors of fixed multi-linear rank. An algorithm based on Tucker decompositionwas proposed in [39] which is numerically shown to perform well even when the multi-linear ranksare unknown. In [131], a generalization of the singular value decomposition for tensors, called t-SVD, is used to prove that a third order tensor (d = 3) can be recovered from O(rN2 log(N)2)measurements, provided that the tensor satisfies some incoherence conditions called tensor inco-herence conditions. In [5], a heuristic algorithm was proposed that can find an approximationto the tensor in the hierarchical Tucker format by inspecting O(dr3 + d log(d)rN) entries of thetensor adaptively. An efficient optimization framework for problems whose solutions are well-approximated by Hierarchical Tucker (HT) tensors was developed in [29]. Finally, the tensor traindecomposition was used for tensor completion in [56, 106]In short, any method that is based on naive matricization can not improve the O(Ndd2 e) requiredmeasurements bottleneck. The large gap between the theoretical guarantees and the number offree variables of the set of low-rank tensors (O(rNd)) motivated a lot of researchers to considerformulations that are not based on matricization (including the work in this chapter).A noteworthy paper is an interesting theoretical result that minimizes the nuclear norm of atensor defined as the dual of the spectral norm and avoids any kind of matricization [129]. Theyshow that using the nuclear norm, the sample size requirement for a tensor with low coherence us-ing nuclear norm is m=O(√rNd log(N)) which is still far away from the number of free variables.Comparing our result with the result of [129], an important question that needs to be investigated iswhether max-qnorm is a better measure of the complexity of low-rank tensors compared to nuclearnorm or whether the difference is just an artifact of the proofs. We investigate this question in moredetail in Chapter 4. Another difficulty of using tensor nuclear norm is the lack of sophisticated oreven approximate algorithms that can minimize the nuclear norm of a tensor. We propose a factor-35frob function in Chapter 4 which can be considered as a proxy for nuclear norm but needs moreinvestigation.Using polynomial-time algorithms was considered in [7] with two important results. First,for third-order tensors they show that if m = O(N32 r) then there is a polynomial time algorithmfor completing it. Further, they show a connection between noisy tensor completion and stronglyrefuting random 3-SAT and 3-XOR formulas which hints that polynomial-time algorithms mightnot be able to achieve a sample complexity better than O(N32 ). We believe that max-qnorm andM-norm can not be calculated in polynomial-time but the results in this section might shed somelight on the above claim.Finally, the sum of squares method has been considered for exact tensor completion in [101]and has better sample complexity (O(rNd2 )) than matricization (for odd values of d). Similar to thesum of squares algorithm in [130], this algorithm does not scale well to large problem instances.Therefore spectral algorithms were considered in [89] where they obtain sample complexity ofO(rN32 ) (when d = 3) which although not proven in [89] or in this thesis, seems to be optimalwhen r N.3.2 Observation modelGiven an order-d tensor T ] ∈ RNd and a random subset of indices S = {ω1,ω2, · · · ,ωm}, ωi ∈[N]d , we observe m noisy entries {Yωt}mt=1:Yωt = T](ωt)+σξt , t = 1, · · · ,m, (3.3)for some σ > 0. The variables ξt are zero-mean i.i.d. random variables withE(ξ 2t )= 1. The indicesin S are drawn randomly with replacement from a predefined probability distributionΠ= {piω}, forω ∈ [N]d , such that ∑ω piω = 1. Obviously, maxpiω ≥ 1Nd . Although it is not a necessary conditionfor our proof, it is natural to assume that there exists µ ≥ 1 such thatpiω ≥ 1µNd ∀ω ∈ [N]d,which ensures that each entry is observed with some positive probability. This observation modelincludes both uniform and non-uniform sampling and is a better fit than uniform sampling in many36practical applications.3.3 M-norm constrained least squares estimationGiven a collection of noisy observations {Yωt}mt=1 of a low-rank tensor T ], following the obser-vation model (3.3), we solve a least squares problem to find an estimate of T ]. Consider the set ofbounded M-norm tensors with bounded infinity normKTM(α,R) := {T ∈ RNd: ‖T‖∞≤ α,‖T‖M≤ R}.Notice that assuming that T ] has rank r and ‖T ]‖∞≤ α , Theorem 11 ensures that a choice ofR = (r√r)d−1α is sufficient to include T ] in KTM(α,R). DefiningLˆm(X ,Y ) :=1mm∑t=1(Xωt −Yωt )2, (3.4)we bound the recovery error for the estimate TˆM obtained by solving the optimization problemTˆM = argminXLˆm(X ,Y ) subject to X ∈ KTM(α,R). (3.5)In words, TˆM is a tensor with entries bounded by α and M-norm less than R that is closest to thesampled tensor in Frobenius norm. Moreover, as for any tensor T , ‖T‖M and ‖T‖max are greaterthan or equal to ‖T‖∞, we assume R≥ α .We now state the main result on the performance of M-norm constrained tensor completion asin (3.5) for recovering a bounded low-rank tensor.Theorem 24. Consider an order-d tensor T ] ∈⊗di=1RN with ‖T ]‖∞≤ α and ‖T ]‖M≤ R. Givena collection of noisy observations {Yωt}mt=1 following the observation model (3.3) where the noisesequence ξt are i.i.d. standard normal random variables, there exists a constant C < 20 such thatthe minimizer TˆM of (3.5) satisfies:‖TˆM−T ]‖2Π:=∑ωpiω(TˆM(ω)−T ](ω))2 ≤C (σ(R+α)+Rα)√dNm, (3.6)with probability greater than 1− e −Nln(N) − e− dN2 .37Corollary 25. If we assume each entry of the tensor is sampled with some positive probability,piω ≥ 1µNd ∀ω ∈ [N]d , then1Nd‖TˆM−T ]‖2F≤Cµ(α+σ)R√dNm. (3.7)with probability greater than 1− e −Nln(N) − e− dN2 .Remark 26. In Section 1.7.1, we presented a simplified version of the above theorem when µ = 1and T ] is a rank-r tensor which uses the bound ‖T ]‖M< (r√r)d−1α proved in Theorem 11.Remark 27. The upper bound (3.6) is general and does not impose any restrictions on the samplingdistribution pi . However, the recovery error depends on the distribution. In particular, the entriesthat have a bigger probability of being sampled have a better recovery guarantee compared to theones that are sampled with a smaller probability (notice that in Corollary 25 we assume piω ≥1µNd ∀ω ∈ [N]d).Corollary 28. Under the same assumptions as in Theorem 24 but assuming instead that ξt areindependent sub-exponential random variables with sub-exponential norm K such thatmaxn=1,···,nE[exp(|ξt |K)]≤ e,then1Nd‖TˆM−T ]‖2F≤Cµ(α+σK)R√dNm. (3.8)with probability greater than 1−2e −Nln(N) .Although equation (3.8) proves linear dependence of sample complexity with N, we are notaware of a polynomial-time method for estimating (or even attempting to estimate) the solution of(3.5). However, we later propose an algorithm that is inspired by max-qnorm constrained tensorcompletion and illustrate its efficiency numerically. Therefore, now we analyze the error boundof max-qnorm constrained tensor completion which is very similar to the error bound of (3.5). Tothis end, we define the set of low max-qnorm tensors asKTmax(α,R) := {T ∈ RNd: ‖T‖∞≤ α,‖T‖max≤ R}.Note that, Theorem 11 ensures that a choice of R=√rd2−dα is sufficient to include T ] in KTmax(α,R).The following theorem provides the bound on max-qnorm constrained LS estimation.38Theorem 29. Consider an order-d tensor T ] ∈⊗di=1RN with ‖T ]‖∞≤ α and ‖T ]‖max≤ R. Givena collection of noisy observations {Yωt}mt=1 following the observation model (3.3) where the noisesequence ξt are i.i.d. standard normal random variables, defineTˆmax = argminXLˆm(X ,Y ) subject to X ∈ KTmax(α,R). (3.9)Then there exists a constant Cd such that the minimizer TˆM of (3.5) satisfies:‖Tˆmax−T ]‖2Π=∑ωpiω(Tˆmax(ω)−T ](ω))2 ≤Cd (σ(R+α)+Rα)√dNm, (3.10)with probability greater than 1− e −Nln(N) − e− dN2 .Corollary 30. Moreover, if we assume each entry of the tensor is sampled with some positiveprobability, piω ≥ 1µNd ∀ω ∈ [N]d , then1Nd‖Tˆmax−T ]‖2F≤Cdµ(α+σ)R√dNm. (3.11)with probability greater than 1− e −Nln(N) − e− dN2 .Remark 31. In Section 1.7.1, we presented a simplified version of the above theorem when µ = 1and T ] is a rank-r tensor which uses the bound ‖T ]‖max< α√rd2−d proved in Theorem 11.The proof of this theorem is very similar to the proof of Theorem 24. The only differences are:(i) the max-qnorm is a quasi-norm and therefore the max-qnorm of the error tensor (Tˆmax−T ]) isbounded by 2d−1R; (ii) the unit ball of max-qnorm is larger than the unit ball of M-norm. Thedetails of these differences are provided in Remark 36 in Section 3.7.1.In Theorem 24, the mean squared error decays like O( 1√m). Although this is similar to the resultobtained by [19, Theorem 3.1] (when d = 2), other results such as [19, Theorem 3.2] and [40] provemean squared error bounds that decay like O( 1m) when σ = 0. This difference is due to the noise.In fact in Section 3.5, we prove that it is not possible to achieve a better decay provided that σ > 0.Moreover, following the steps in [19, Theorem 3.2], we can prove the following theorem.Theorem 32. Assume the same conditions as the ones in Theorem 24. Moreover, assume σ ≤ α .39Then with probability 1− 2m over the choice of sample set,|TˆM−T ]‖2Π≤C[σ(√log(m)3R2Nm+√log(m)32α2m)+ log(m)3R2Nm+ log(m)32α2m]. (3.12)Theorem 32 proves decay rate O( 1m) when σ = 0. The proof is similar to the proof in [19,Theorem 6.2] which is based on a general theorem in [119] and is included in Section 3.7.2 forcompleteness.3.4 Comparison to past resultsAs discussed in Section 3.1, there are several works that have considered max-norm for matrixcompletion [37, 41, 78, 110, 118]. However, the closest work to our result is [19], where the authorsstudy max-norm constrained matrix completion, which is a special case of max-qnorm constrainedtensor completion with d = 2. Here, we have generalized the framework of [19] to the problem oftensor completion. Although the main ideas of the proof are similar, the new ingredients includebuilding a machinery for analyzing the max-qnorm and M-norm of low-rank tensors, as explainedin Section 2.3. As expected, our result reduces to the one in the [19] when d = 2. More interest-ingly when d > 2, compared to the matrix error bound, the only values in upper bound (3.6) thatchange is the upper bound on the max-qnorm of the d-th order tensor (which is independent of N)and the order d, which changes the constants slightly.As can be seen from Theorem 11, for a rank-r tensor T with ‖T‖∞≤ α , we have ‖T‖M≤(r√r)d−1α . Therefore, assuming α = O(1), to obtain an error bound of 1Nd ‖TˆM−T ]‖2F≤ ε , it issufficient to have m>C (r√r)d−1dNε2 samples. Similarly, using the max-qnorm, for an approximationerror bounded by ε , it is sufficient to obtain m > Cd rd2−ddNε2 samples. In contrast, the sufficientnumber of measurements with the best possible matricization is m>C rNd d2 eε2 , which is significantlylarger for higher-order tensors.Tensor completion using nuclear norm gives significantly inferior bounds as well. In particular,fixing r, and d, compared to latest results on tensor completion using nuclear norm [130], usingM-norm lowers the theoretical sufficient number of measurements from O(Nd2 ) to O(dN).403.5 Information-theoretic lower boundTo prove a lower bound on the performance of (3.5), we employ a classical information-theoretic technique to establish a minimax lower bound for non-uniform sampling of random tensorcompletion on the max-qnorm ball. A similar strategy in the matrix case has been used in [19, 31].In order to derive a lower bound on the performance of (3.5), we find a set of tensors in the set KTMthat are sufficiently far away from each other. Fano’s inequality implies that with the finite amountof information that we have, there is no method that can differentiate between all the elements of aset with too many elements and therefore any method will fail to recover at least one of them witha large probability. The main idea and the techniques closely follow [19, Section 6.2]; thereforewe only explain the main steps we take to generalize this approach from matrices to tensors.Similar to the upper bound case, we analyze a general restriction on the max-qnorm of thetensors instead of concentrating on low-rank tensors. Plugging the upper bound of the max-qnormof low-rank tensors as a special case provides a lower bound for low-rank tensors as well.Restating the set of bounded low M-norm tensors given byKTM(α,R) := {T ∈ RNd: ‖T‖∞≤ α,‖T‖M≤ R}, (3.13)We will find a lower bound on the recovery error of any method that takes {Yωt}mt=1 as input andoutputs an estimate Tˆ . This includes TˆM that is obtained byTˆM = argminXLˆm(X ,Y ) subject to X ∈ KTM(α,R). (3.14)In particular, we show that when the sampling distribution satisfiesµNd≤minωpiω ≤maxω piω ≤LNd,the M-norm constrained least squares estimator is rate optimal on KTM(α,R). The following theo-rem is proved in Section 3.7.3.Theorem 33. Assume that the noise sequence ξt are i.i.d. standard normal random variables andthe sampling distribution Π satisfies maxω piω ≤ LNd . Fix α , R, N, and m such thatR2 ≥ 48α2K2GN, (3.15)41where KG is the Grothendieck’s constant. Then the minimax recovery error is lower bounded byin fTˆMsupT∈KTM(α,R)1NdE‖TˆM−T‖2F≥min{α216,σR128√2KG√NmL}. (3.16)Remark 34. Comparing the above theorem with (3.8), we observe that as long as σR128√2KG√NmL <α216 , M-norm constrained tensor completion is optimal in both N and R.3.6 ExperimentsIn this section, we present algorithms that we use to solve (3.9) and experiments concern-ing max-qnorm of specific classes of tensors and max-qnorm constrained tensor completion. Asmentioned before most of the typical procedures such as calculating the nuclear norm or even cal-culating the rank of a tensor are NP-hard. The situation seems even more hopeless if we considerthe results of [7] which connects 3-dimensional tensor completion with refuting 3-SATs, which hasa long line of research behind it. In short, if we assume that either max-qnorm or M-norm is com-putable in polynomial time, a conjecture of [30] for refuting 3-SATs will be disproved. All thesebeing said, the current chapter is the first work that considers max-qnorm for tensor completionand the preliminary results we show in this section are promising, outperforming matricization inevery experiment we ran, and even outperforming the TenALS algorithm of [60].In this section, we concentrate on (3.9) instead of (3.5) as we are not aware of any algorithmthat can even attempt to solve (3.5) and simple heuristic algorithms we designed for (3.9) givepromising results even though we do not know of any algorithm that is known to converge due tothe non-convexity of the optimization problem (3.9).There are two questions that need to be answered while solving (3.9). First, how to choosethe max-qnorm bound R, and second, how to solve the least squares problem once R is fixed. Weaddress both these question in the next sections. We also run some experiments to estimate thetensor max-qnorm of some specific classes of tensors to get an idea of the dependency of the max-qnorm of a tensor on its size and rank. Finally, we compare the results of max-qnorm constrainedtensor completion with TenALS and matricizing.423.6.1 Algorithms for max-qnorm constrained least squares estimationIn this section, we introduce a few algorithms that attempt to solve (or approximate the solutionof) (3.9). Defining f (V1, · · · ,Vd,Y ) := Lˆm((V1 ◦ · · · ◦Vd),Y ), we minimizemin f (V1, · · · ,Vd,Y ) subject to maxi(‖Vi‖2,∞)≤ d√R, (3.17)where R is the max-qnorm constraint. In the definition of the max-qnorm, there is no limitation onthe column size of the factors Vi.In the experiments we run in this section, we limit the factor sizes to N× 2N. Although thisis an arbitrary value and we haven’t derived an error bound in the max-qnorm of tensors with thislimitation, we believe (and our experiments also confirm) that this choice is large enough whenr << N. We defer the exact details of the effect of this choice on the error bounds to future work.All the algorithms mentioned in this section are first-order methods that are scalable for higherdimensions and just require access to the first derivative of the loss function.Projected gradientThe first algorithm is the projected gradient algorithm that for each factor, fixes all the otherfactors and takes a step according to the gradient of the loss function. Next, we project back all thefactors on the set C := {X |‖X‖2,∞≤ d√R}. To be precise, for each factor Vi, define the matricizationof T = V1 ◦ · · · ◦Vd along the i-th dimension, Ti, to be Ti = Vi ◦Ri (notice that Ri is obtained fromthe Kronecker product of the factors V1, · · · ,Vi−1,Vi+1, · · · ,Vd) and define fi(X) := Lˆ ((X ◦Ri),Yi),where Yi is the matricization of Y along its i-th dimension. Fixing a step size γ , the algorithmupdates all the factors in parallel via[Vi]← PC([Vi− γ5 ( fi)Ri]). (3.18)where PC simply projects the factor onto the set of matrices with `2-infinity norm less thand√R.This projection looks at each row of the matrix and if the norm of a row is bigger than d√R, it scalesthat row back down to d√R and leaves other rows unchanged.This algorithm is a well-known algorithm with a lot of efficient implementations and modi-fications. Furthermore, using Armijo line search rule to guarantee sufficient decrease of the lossfunction, it is guaranteed to find a stationary point of (3.17).43Projected quasi-NewtonStacking all the factors in a matrix X ,X =V1V2...Vdand defining f (X) := Lˆm((V1 ◦ · · · ◦Vd),Y ), this algorithm uses BFGS quasi-Newton method toform a quadratic approximation to the function at the current estimate and then uses spectral pro-jected gradient (SPG) method to minimize this quadratic function, constrained to X ∈C. We usethe implementation of [108] which uses limited memory BFGS and uses a Barzilai-Borwein scal-ing of the gradient, and use a non-monotone Armijo line search along the feasible direction to findthe next iterate in the SPG step.Stochastic gradientThe loss functionLˆm(X ,Y ) =1mm∑t=1(Xωt −Yωt )2,is decomposable into the sum of m loss functions, each concerning one observed entry. Thismakes it very easy to use stochastic gradient methods that at each iteration take one or more ofthe entries, and find the feasible direction according to this subset of observations. In particular, ateach iteration, we take a subset of the m entries, S⊂Ω, and minimize the loss functionLˆS(X ,Y ) =1|S| ∑ωt∈S(Xωt −Yωt )2.This approach is useful when we are dealing with very high dimension sizes and accessingall the measurements at once is not an option or very costly. There has been plenty of researchon the efficiency of this method and its recovery guarantees [74]. The projection part is done asbefore with the advantage that we just need to project the rows in the factors that correspond to thesubset of entries chosen in this iteration and not necessarily all of them which saves time in largeapplications.443.6.2 Experiment on max-qnorm of tensorsIn this section, we run an experiment to find the dependency of the max-qnorm on its rank andsize. To this end, we consider tensors whose low-rank factors come from Gaussian distribution.We also mention the results for tensors coming from random sign factors. Although in comparisonwith other ways of generating low-rank tensors, these specific classes of tensors do not necessarilyrepresent tensors with highest possible max-qnorm, they can be helpful in giving us an idea of howdoes the max-qnorm scale with size and rank.In order to estimate the max-qnorm of a tensor, we employ a max-qnorm constrained tensorcompletion while accessing all the entries of the tensor and find the smallest constraint that suc-cessfully recovers the tensor. Using the bisection method to estimate the max-qnorm of the tensor,starting from a lower bound and an upper bound for the max-qnorm of the tensor we first check ifthe tensor can be recovered with max-qnorm bound equal to the average of the upper bound andthe lower bound. Next, we increase the lower bound if the max-qnorm constraint is too small forfull recovery and reduce the upper bound if the max-qnorm bound is large enough. Algorithm 1explains this algorithm in more details. For small ranks, we get to the approximate max-qnormvery fast, for example, in less than log(rankdim−1)+ k iterations we can estimate the max-qnormwith an error less than 2−k. Moreover, we assume successful recovery is achieved once the rootmeans squared error (RMSE) is less than a small predefined value. This algorithm becomes fasterafter the first iteration, as we use the factors found in the previous iteration as a good initial pointfor the next iteration.Algorithm 1 Estimating max-qnorm of a tensor T1: Input T , Ω= [N1]× [N2]×·· ·× [Nd], lowerbound, upperbound2: Output ‖T‖max with an estimation error of at most 0.013: for iteration =1 to dlog2(upperbound− lowerbound)e+6 do4: Tˆ = argminX∈RN1×···×RNd ‖XΩ−TΩ‖2F subject to ‖X‖max≤ lowerbound+upperbound25: end for6: RMSE = ‖X−T‖F√∏i=di=1 Ni7: if RMSE ≤ 1e−3 then8: upperbound = lowerbound+upperbound29: else10: lowerbound = lowerbound+upperbound211: end if12: return lowerbound+upperbound245Figure 3.1 shows the results for both 3 and 4-dimensional tensors when low-rank factors aredrawn either from Gaussian distribution 3.1.a, and 3.1.b or from Bernoulli distribution 3.1.c. Inboth cases, we have considered r ∈ {1,2, · · · ,10}, and N ∈ {5,10,15,20} for the 3-dimensionalcase and N ∈ {5,10} for the 4-dimensional case. In all the cases the average max-qnorm is similarfor different value of N when rank and order is fixed. These results confirm that max-qnorm of atensor just depends on its rank and its order and is independent of the size of the tensor.The results are averaged over 15 experiments. Because of the linear effect of infinity norm ofa tensor on its max-qnorm, we rescale all tensors to have‖T‖∞= 1 before estimating their max-qnorm. Comparing Figure 3.1.a and 3.1.b shows that the max-qnorm is around√r for d = 3, andr for d = 4. These values are attained exactly when the factors are drawn from Bernoulli randomvariables. Moreover, the dependence is constant when d = 2. This suggests a multiplicative in-crease of√r when the order is increased by one. However, whether or not the actual bound ingeneral case is O(√rd2−d) is an interesting open question.1 2 3 4 5 6 7 8 9 10rank1234Max normaverage max norm of N*N*N tensorsN= 5N= 10N= 15N= 20(a) 3-dimensional, Gaussian fac-tors1 2 3 4 5 6 7 8 9 10rank13578Max normaverage max norm of N*N*N*N tensorsN= 5N= 10(b) 4-dimensional, Gaussian fac-tors1 2 3 4 5 6 7 8 9 10rank1357910Max normaverage max norm of tensors with sign factors3d results independent of N4d results independent of N(c) 3 and 4-dimensional, sign fac-torsFigure 3.1: Log-log plot of average max-qnorm of 3 and 4-dimensional low-rank tensorsobtained by Algorithm 1, for various rank and sizes, averaged over 15 draws for eachrank and size.3.6.3 Automatic algorithm for choosing optimal max-qnorm bound RAs explained, before, other than designing algorithms for solving constrained max-qnorm min-imization, we need to design a procedure to find good bounds on the max-qnorm. The theoreticalbounds found in this thesis might not be tight and even if they are proven to be tight, such the-oretical upper bounds usually capture the worst-case scenarios which might not be optimal for a46general problem. This issue is very important as the result of the tensor completion is very depen-dent on choosing the right upper bound. Other than this in many practical applications, we don’thave access to the actual rank of the underlying tensor which shows the importance of finding theupper bound automatically and not as an input to the optimization problem.Algorithm 2 Tensor completion, with cross-validation1: Input possibly noisy measurements YΩ = TΩ+σN(0,1), observed entries Ω, lowerbound,upperbound2: Output Tˆ3: Divide the observations into Ωtrain, and Ωvalidate4: for iteration =1 to dlog2(upperbound− lowerbound)e+6 do5: for itercheck=0 to 4 do6: bound(itercheck) =itercheck4 ∗upperbound+ 4−itercheck4 ∗upperbounds7: Tˆ (itercheck) = argminX∈RN1×···×RNd ‖XΩ−YΩ‖2F s.t. ‖X‖max≤ bound(itercheck)8: RMSE(itercheck) =‖(Tˆ (itercheck)Ωvalidate−(Y )Ωvalidate‖F√|Ωvalidate|9: end for10: minindex = argmin RMSE11: lowerbound = bound(minindex−1)12: upperbound = bound(minindex+1)13: Tˆ = Tˆ (minindex)14: end for15: return TˆThe first approach is modifying algorithm 1 to find a good upper bound. There are two com-plications with generalizing this approach. First, in tensor completion, we don’t have access tothe full tensor and we have to estimate the recovery error on the indices we have not observedas otherwise choosing a large upper bound can result in over-training, i.e, fitting the observationsexactly and losing the low-rank (and low max-qnorm) structure of the tensor. The other complica-tion, is that in algorithm 1, we use RMSE < 1e− 3 as an approximation for full recovery. Doingsuch a thing in a noisy problem is not possible and using the RMSE of the noise is an independentproblem that depends on the noise-type and is usually not optimal in a general case. Other thanthis, even if we know the RMSE of the noise, we won’t be bale to decide if a bound is larger thanthe optimal max-qnorm bound or smaller due to the possibility of over-fitting.To address these two issues, we propose algorithm 2 for tensor completion that uses cross-validation for estimating the RMSE and uses a five-point search for the optimal max-qnorm bound.47RMSEmax−qnorm bound RRMSEmax−qnorm bound RRMSEmax−qnorm bound RFigure 3.2: All the possible situations in the five-point search algorithm. The leftmost and therightmost red dots are the previous lower bound and upper bound on the max-qnorm R,respectively. The green intervals show the new lower bound and upper bound based onthe RMSE.As explained above we need to find a way to estimate the true RMSE to avoid over-fitting andalso as a measure of how good the approximation is. In order to do this, before starting the opti-mization process, we randomly divide the observed samples, Ω, into two sets: Ωtrain and Ωvalidateand we solve each max-qnorm constrained sub-problem just using the samples in Ωtrain whichcontains 80% of the total samples and use the reserved samples Ωvalidate (20% of total samples)to estimate the RMSE. We are aware of more sophisticated cross-validation (such as k-fold crossvalidation) algorithms that try to remove the bias in the samples as much as possible. However,considering that each max-qnorm constrained tensor completion problem is expensive, our numer-ical results show that the simple cross-validation used in algorithm 2 is good enough for finding anapproximately optimal upper bound.Now we explain algorithm 2 more thoroughly. To find the optimal upper bound we input a largeenough upper bound and a small enough lower bound that are bigger and smaller than the optimalmax-qnorm bound and iteratively refine these bounds until the two bounds become close to eachother. To determine the next upper and lower bounds, checking the middle point is not enoughbecause unlike algorithm 1, the RMSE is not going to be zero for bounds bigger than the optimalbound. Therefore, other than the lower bound and the upper bound we calculate the RMSE usingthree points in the interval between the lower bound and the upper bound as well and consider the48best bound among those three points to be the center of the new upper bound and the new lowerbound. The derivation of this approach is in Algorithm 2. A hand-wavy reasoning behind this isassuming that there is an optimal bound R] that gives the best RMSE, any bound larger than R]results in over-training and any bound smaller than R] results in under training. This over-trainingor under-training becomes more severe when the bound is further from the optimal max-qnormbound and therefore the problem becomes finding the minimum of a function where we don’tknow the derivative of the function. Deriving a provably exact algorithm to find the optimal upperbound in such a situation is an interesting question that we postpone to future work. Assuming allthe justification above is roughly correct, Figure 3.2 shows that the five-point method explained inAlgorithm 2 finds the optimal upper bound. Considering all the assumptions above to be true, thefigure plots the RMSE against the max-qnorm upper bound R, and shows all the possible situationsthat the five points (red stars) can have in such a curve. The green lines show the new intervalbounded by the new lower and upper bound. Notice that the optimal value always stays in themiddle of these two bounds. Our numerical experiments in the next section show that althoughthis justification is not mathematically rigorous, the final outcome is very promising. However,providing a better explanation or designing a more rigorous method is an interesting future work.3.6.4 Numerical results of low-rank tensor completionIn this section, we present the results of max-qnorm constrained tensor completion, and com-pare it with those of matricization and alternating least squares (tenALS) [60]. As explained atthe beginning of Section 3.6, we pick the low-rank factors to have size N × 2N. Although thechoice of 2N is arbitrary, we believe it is large enough for small ranks and does not result in largeerrors. This also has the additional benefit of not requiring the knowledge of the exact rank of thetensors. As explained above, we just assume the knowledge of an upper bound and a lower boundon the max-qnorm of the tensor and use cross-validation to the find the optimal max-qnorm bound.Obviously, the algorithm becomes faster if these bounds are closer to the actual max-qnorm of thetensor.Figures 3.3 and 3.4 show the results of completing a 50×50×50 tensor with m random sam-ples of it taken uniformly at random (without and with 10-dB noise respectively). The results areaveraged over 15 experiments, with various ranks ranging from 3 to 30 and sampling rate rangingfrom 0.01 to 0.1. The row i and column j of each subplot shows the average squared relativerecovery error ‖T−Tˆ‖2F‖T‖2Ffor a random tensor with rank i, where different columns represent differentnumber of samples observed according to (3.3). We rescale the true tensors to have infinity normequal to 1 before adding the noise and the RMSE of the noise is around 0.1. As expected in all49Matricized TC using FPCA algorithm0.02 0.04 0.06 0.08 0.151015202530rank0.80.850.90.951MNC TC with N*2N factors0.02 0.04 0.06 0.08 0.151015202530rank00.20.40.60.81MNC TC with N*r factors0.02 0.04 0.06 0.08 0.151015202530rank00.20.40.60.81TC using ALS0.02 0.04 0.06 0.08 0.151015202530rank00.20.40.60.81Figure 3.3: 3-dimensions, no noise Average relative recovery error ‖Trecovered−T]‖2F‖T ]‖2Ffor 3-dimensional tensor T ∈ R50×50×50 and different ranks and samples.experiments we get better results with higher number of measurements m, and smaller rank r. Thematricized results are obtained by applying the Fixed Point Continuation with Approximate SVDalgorithm (FPCA), introduced in [85] to flattened 900×30 matrices (This algorithm results in thebest outcomes in our experiments, compared to other noisy matrix completion algorithms). Next,for a more fair comparison we have included the results of tensor completion using alternatingleast squares (ALS) where the code is provided online [60]. The two plots in the second row showthe max-qnorm constrained (MNC) results in two scenarios. One with exact low-rank factors andsecond, as the theory suggests, with factors with larger number of columns. Notice that the exactmax-qnorm formulation does not put any limitation on the number of columns of the factors andwe chose 2N to balance the computational cost and the accuracy of the computed max-qnorms.The results unanimously show the advantage of using N×2N factors instead of N× r factors. Al-though we are dealing with higher-dimensional factors and this makes the algorithm a little slower,50Matricized TC using FPCA algorithm0.02 0.04 0.06 0.08 0.151015202530rank0.90.951MNC TC with N*2N factors0.02 0.04 0.06 0.08 0.151015202530rank0.20.40.60.81MNC TC with N*r factors0.02 0.04 0.06 0.08 0.151015202530rank0.20.40.60.81TC using ALS0.02 0.04 0.06 0.08 0.151015202530rank0.20.40.60.81Figure 3.4: 3-dimensions, 10-dB noise Average relative recovery error ‖Trecovered−T]‖2F‖T ]‖2Ffor3-dimensional tensor T ∈ R50×50×50 and different ranks and samples.using larger factors has the additional benefit of not getting stuck in local minima compared toexact low-rank factors. The ALS algorithm, show some discrepancies in the results, i.e., there arecases that the average error increases with smaller rank or higher sampling rate. We believe this isbecause of high non-convexity in this algorithm. We expected to see similar discrepancies in themax-qnorm results as well but at least for these dimensions and setup, our algorithm seems to beable to run away from local minima which is surprising. We can see the importance of this issuewhen we compare the results of using N× r factors and N× 2N factors. It is worth mentioninghere that the ALS algorithm uses a number of initial vectors to use in the initialization step andlarger values give a better estimate at the expense of longer processing time. We use 50 vectorsso that the time spent by the algorithms is comparable to each other. However, the result can beslightly improved if we use more initial vectors.51The results of matricization is always inferior to those of tensor completion with both ALS andMNC which is expected, especially as here the tensors have an odd order. The difference betweenmatrix completion and MNC with N × 2N factors is significant. For example, when mNd = 0.1,Max-qnorm constrained TC (MNC) recovers all the tensor with rank less 10, whereas matrix com-pletion starts to fails for ranks bigger than 1.In Figures 3.5 and 3.6 we have included the results for completing a 4-dimensional 20×20×20×20 tensor. The results are similar to the 3-dimensional case and therefore we are not includingthe exact low-rank factor results which are inferior to using N× 2N factors. The available onlinecode of ALS algorithm is not suitable for 4-dimensional cases as well. Similar to the 3-dimensionalcase MNC TC always beats matrix completion. Notice that because of the even dimension, the ma-tricization can be done in a more balanced way and therefore, results of matrix completion are alittle better compared to the 3-dimensional case but still we can see the huge advantage of usingtensor completion instead of matricizing.It is worth mentioning that although the algorithms take reasonable time for small dimensionsused here (less than 15 seconds in each case), it is still slower than matricizing. In other wordsusing tensors enjoys better sample complexity and can be used to save very large data but it comesat the cost of computational complexity. Deriving scalable and efficient algorithms is an interestingand necessary future work for practical applications.In Figure 3.7, we do a thorough investigation of the recovery error of our algorithm with respectto r, N, and the number of measurements m when we use max-qnorm constrained 1-bit TC (3.9).In the figures, dark blue shows full recovery. In the left image in Figure 3.7 we fix m = 1500 andshow the average squared error for varying values of N and r. As expected, we observe an increasein the average error for larger values of N and r. In the right image of figure 3.7 we fix r = 5 andshow the results for varying N and m. The low-rank tensors are generated by random sign factors.In Figure 3.1 we showed that for such tensors the max-qnorm is equal to√r and therefore, theerror bound (in the noiseless case) is O(log(m)32 rNm ). By observing the rows of the left image, wecan approximately see the dependence on r is linear. However, for m and N the dependence of oferror is much more complicated than what we expect theoretically. We think this is due to a fewfactors. First, our algorithm is not exactly solving the max-qnorm constrained problem due to thelimited number of columns in the factors and inexact infinity-norm projections. Second and moreimportant, the theoretical bound is a worst-case error bound that is true with high probability over52Matricized TC using FPCA algorithm0.02 0.04 0.06 0.08 0.12468101214161820rank00.10.20.30.40.50.60.70.80.91MNC TC with N*2N factors0.02 0.04 0.06 0.08 0.12468101214161820rank00.10.20.30.40.50.60.70.80.91Figure 3.5: 4-dimensions, no noise Average relative recovery error ‖Trecovered−T]‖2F‖T ]‖2Ffor 4-dimensional tensor T ∈R20×20×20×20 and different ranks and samples. The plot on theleft shows the results for the 400×400 matricized case.Matricized TC using FPCA algorithm0.02 0.04 0.06 0.08 0.12468101214161820rank0.10.20.30.40.50.60.70.80.91MNC TC with N*2N factors0.02 0.04 0.06 0.08 0.12468101214161820rank0.10.20.30.40.50.60.70.80.91Figure 3.6: 4-dimensions, 10-dB noise Average relative recovery error ‖Trecovered−T]‖2F‖T ]‖2Ffor4-dimensional tensor T ∈ R20×20×20×20 and different ranks and samples. The plot onthe left shows the results for the 400×400 matricized case.the sampling scheme. Whereas, the figures show the average error.53RSE, m=1500, d=3 2 4 6 8 10 12rank151821242730NRSE, r=5, d=3 200 400 600 8001000m151821242730N0.20.40.60.81Figure 3.7: 3-dimensions, no noise Average relative recovery error ‖Trecovered−T]‖2F‖T ]‖2Ffor e-dimensional tensors with fixed number of measurements in the left and fixed rank inthe right.3.7 Proof3.7.1 Proof of Theorem 24In this section, we prove Theorem 24. We make use of Lemma 9, Lemma 10 and Theorem 11repeatedly. However, some parts of the other calculations are simple manipulations of the proof in[19, Section 6].For ease of notation define Tˆ := TˆM. Notice that T ] is feasible for (3.5) and therefore,1mm∑t=1(Tˆωt −Yωt )2 ≤1mm∑t=1(T ]ωt −Yωt )2.Plugging in Yωt = T](ωt)+σξt and defining ∆= Tˆ −T ] ∈ KTM(2α,2R) we get1mm∑t=1∆(ωt)2 ≤ 2σmm∑t=1ξt∆(ωt). (3.19)The proof is based on a lower bound on the left-hand side of (3.19) and an upper bound on itsright-hand side.54Upper bound on right-hand side of (3.19)First, we bound Rˆm(α,R) := sup∆∈KTM(α,R)| 1m ∑mt=1 ξt∆(ωt)| where ξt is a sequence of N (0,1)random variables. With probability at least 1− δ over ξ = {ξt}, we can relate this value to aGaussian maxima as follows [98, Theorem 4.7]:sup∆∈KTM(α,R)| 1mm∑t=1ξt∆(ωt)| ≤ Eξ [ sup∆∈KTM(α,R)| 1mm∑t=1ξt∆(ωt)|]+piα√log( 1δ )2m≤ R Eξ [ supT∈T±| 1mm∑t=1ξtT (ωt)|]+piα√log( 1δ )2m,where T ∈ T± is the set of rank-one sign tensors with |T±|< 2Nd . Notice that in the second inequal-ity we use BM(1) = conv(T±) (Lemma 9) and remove the convex hull from the supremum.Since for each T , ∑mt=1 ξtT (ωt) is a Gaussian with mean zero and variance m, the expectedmaxima is bounded by√2m log(|T±|). Gathering all the above information, we end up with thefollowing upper bound with probability larger than 1−δ :supT∈KTM(α,R)| 1mm∑t=1ξtT (ωt)|≤ R√2log(2)Ndm+piα√log( 1δ )2m. (3.20)Choosing δ = e−Nd2 , we get that with probability at least 1− e−Nd2supT∈KTM(α,R)| 1mm∑t=1ξtT (ωt)|≤ 2(R+α)√Ndm. (3.21)Lower bound on left-hand side of (3.19)In this section, we prove that with high probability, 1m ∑mt=1∆(ωt)2 does not deviate much fromits expectation ‖∆‖2Π. For ease of notation, define TS = (T (ω1),T (ω2), · · · ,T (ωm)) to be the set ofchosen samples drawn from Π where‖T‖2Π=1mES∼Π‖TS‖22=∑ωpiωT (ω)2.55We prove that with high probability over the samples,1m‖TS‖22≥ ‖T‖2Π−Cβ√Ndm, (3.22)holds uniformly for all tensors T ∈ KTM(1,β ).Lemma 35. Defining δ (S) := supT∈KTM(1,β )| 1m‖TS‖22−‖T‖2Π|, and assuming N,d > 2 and m ≤ Nd ,there exists a constant C > 20 such thatP(δ (S)>Cβ√Ndm)≤ e −Nln(N) .To prove this lemma, we show that we can bound the t-th moment of δ (S) asES∼Π[δ (S)t ]≤(8β√Nd+ t ln(m)√m)t. (3.23)Before stating the proof of this bound, we show how we can use it to prove Lemma 35 by usingMarkov’s inequality.P(δ (S)>Cβ√Ndm) = P((δ (S))t > (Cβ√Ndm)t)≤ ES∼Π[δ (S)t ](Cβ√Ndm )t. (3.24)Using (3.23) and simplifying we getP(δ (S)>Cβ√Ndm)≤(4√Nd+ tln(m)C√Nd)t. Taking t = Ndln(m) and for C > 12,P(δ (S)>Cβ√Ndm)≤ e −Ndln(m) ≤ e −Nln(N) .Now we prove (3.23) by using some basic techniques of probability in Banach space, includ-ing symmetrization and contraction inequality [31, 77]. Regarding the tensor T ∈⊗di=1RN as afunction from [N]d → R, we define fT (ω1,ω2, · · · ,ωd) := T (ω1,ω2, · · · ,ωd)2. We are interested in56bounding δ (S) := supfT :T∈KTM(1,β )| 1m ∑mi=1 fT (ωi)−E( fT (ωi))|. A standard symmetrization argumentand using contraction principle yieldsES∼Π[δ (S)t ]≤ ES∼Π{2Eε [ supT∈KTM(1,β )| 1mm∑i=1εiT (ωi)2|]}t ≤ ES∼Π{4Eε [ supT∈KTM(1,β )| 1mm∑i=1εiT (ωi)|]}t ,where εi’s are Rademacher random variables. Notice that if TM ≤ β , then T ∈ β conv(T±) andthereforeES∼Π[δ (S)t ]≤ ES∼Π{4βEε [ supT∈T±| 1mm∑i=1εiT (ωi)|]}t = β tES∼Π{Eε [ supT∈T±| 4mm∑i=1εi|]}t .To bound the right-hand side above notice that for any α > 0 [114, Theorem 36]Pε(4mm∑i=1εi ≥ α√m) = P(Binom(m,12)≥ m2+α√m8)≤ e−α216 .Taking a union bound over T±, where |T±|≤ 2Nd we getPε [ supT∈T±(| 4mm∑i=1εiT (ωi)|)≥ α√m ]≤ 2Nd+1e−α216 .Combining the above result and using Jensen’s inequality, when t > 1β tES∼Π{Eε [ supT∈T±| 4mm∑i=1εi|]}t ≤ β tES∼Π{Eε [ supT∈T±| 4mm∑i=1εi|]t} ≤ β t((α√m)t +4t2Nd+1e−α216)Choosing α =√16ln(4×2Nd+1)+4t ln(m) and simplifying proves (3.23).Gathering the results of (3.19), (3.21), and (4.49)Now we combine the upper and lower bounds in the last two sections to prove Theorem 24.On one hand from (3.21), as ∆ ∈ KT (2α,2R), we get1m‖∆S‖22≤ 8σ(R+α)√Ndm,57with probability greater than 1− e−Nd2 . On the other hand, using Lemma 35 and rescaling, we get‖∆‖2Π≤1m‖∆S‖22+CRα√Ndm,with probability greater than 1− e −Nln(N) . The above two inequalities finishes the proof of Theorem24.Remark 36. There are only two differences in the proof of Theorem 24 and Theorem 29. First isan extra constant, c1cd2 , which shows up in Rademacher complexity of unit max-qnorm ball whichchanges the constant C in Theorem 24 to Cc1cd2 in Theorem 29 and second is the max-qnorm ofthe error tensor ∆ = Tˆ −T ] (refer to equation (3.19)) which belongs to KTmax(2α,2d−1R) insteadof KTmax(2α,2R).3.7.2 Proof of Theorem 32The proof is based on [119] which establishes excess risk bounds of risk minimization withsmooth loss functions and a hypothesis class with bounded empirical Rademacher complexity. Tothat end, remember that Y (ωt) = T ](ωt)+σξt . We define the loss functionL (X) := Eωt∼Π,ξt∼N (0,1)(X(ωt)−Y (ωt))2 = ‖X−T ]‖2Π+σ2and its corresponding empirical loss functionLˆm(X ,Y ) :=1mm∑t=1(Xωt −Yωt )2,where {Yωt}mt=1 is i.i.d. noisy samples generated according to Π. Our goal is to bound ‖TˆM−T ]‖2Πwhere TˆM := argminX∈KT (α,R) Lˆm(X ,Y ). First step is proving that with high probability the noiseis bounded. In particular, with probability greater than 1− 1m [19, equation (6.16)],max1≤t≤m|ξt |< c√log(m).58Using this and [119, Theorem 1], we get than with probability greater than 1−δL (TˆM)− minX∈KT (α,R)L (X)≤Cσlog(m) 32 Rm(KT (α,R))+√b log( 1δ )m+ log(m)3 Rm(KT (α,R))2+ b log( 1δ )m , (3.25)where b = 5α2+4ασ√log(m), and Rm(KT (α,R)) is bounded by 6√dNm (Lemma 10). Moreover,note that minX∈KT (α,R)L (X) =L (T]) = σ2 andL (TˆM) = ‖TˆM−T ]‖2Π+σ2. The proof is finishedby plugging these two into the left-hand side of (3.25).3.7.3 Proof of Theorem 33Packing set constructionIn this section, we construct a packing for the set KTM(α,R).Lemma 37. Let r = b( RαKG )2c and let KTM(α,R) be defined as in (3.13) and let γ ≤ 1 be such thatrγ2 is an integer and supposerγ2 ≤ N. Then there exists a set χT ⊂ KTM(α,R) with|χT |≥ exp( rN16γ2)such that• For T ∈ χT , |T (ω)|= αγ for ω ∈ [N]d .• For any T (i),T ( j) ∈ χT , T (i) 6= T ( j)‖T (i)−T ( j)‖2F≥α2γ2Nd2.Proof: This packing is a tensor version of the packing set generated in [31, Lemma 3] with similarproperties and our construction is based on the packing set generated there for low-rank matriceswith bounded entries. In particular, we know that there exists a set χ ⊂ {M ∈ RN×N : ‖M‖∞≤α, rank(M) = r} with |χ|≥ exp( rN16γ2 ) and for any M(i),M( j) ∈ χ , ‖M(i)−M( j)‖2F≥α2γ2N22 wheni 6= j.59Take any M(k) ∈ χ . M(k) is a rank-r matrix with ‖M(k)‖∞= αγ ≤ α and therefore ‖M(k)‖max≤√rα . By using (2.9), there exists a nuclear decomposition of M(k) with bounded singular vectorsM(k) = ∑iσiui ◦ vi, ‖ui‖∞,‖vi‖∞≤ 1, such that ∑i=1|σi|≤ KG√rα . Define T (k) = ∑iσiui ◦ vi ◦~1 · · · ◦~1 where~1 ∈RN is the vector of all ones. Notice that ‖ui‖∞,‖vi‖∞,‖~1‖∞≤ 1 and therefore bydefinition, ‖T (k)‖M≤ KG√rα ≤ R by Lemma 9. The tensor is basically generated by stacking thematrix M(k) along all the other d− 2 dimensions and therefore |T (k)(ω)|= αγ for ω ∈ [N]d , and‖T (k)‖∞≤ α . Hence we build χT from χ by taking the outer product of the matrices in χ by thevector~1 along all the other dimensions. Obviously, |χT |= |χ|≥ exp( rN16γ2 ). It just remains to prove‖T (i)−T ( j)‖2F≥α2γ2Nd2.Assuming that T (i) is generated from M(i) and T ( j) is generated from M( j), since T (i)(i1, i2, · · · , id)=M(i)(i1, i2), ‖T (i)−T ( j)‖2F= ∑Ni1=1∑Ni2=1 · · ·∑Nid=1(T (i)(i1, i2, · · · , id)−T ( j)(i1, i2, · · · , id))2 =Nd−2∑Ni1=1∑Ni2=1(M(i)(i1, i2)−M( j)(i1, i2))2 =Nd−2‖M(i)−M( j)‖2F≥ α2γ2Nd2 which concludes proofof the lemma.Proof of Theorem 33Now we use the construction in Lemma 66 to obtain a δ -packing set, χT of KTM, with δ =αγ√Nd2 . For the lower bound, we assume that the sampling distribution satisfiesmaxωpiω ≤ LNd . (3.26)The proof is based on the proof in [19, Section 6.2] which we will rewrite the main parts and referto [19] for more details. The proof is based on two main arguments. First is a lower bound on the‖·‖F -risk in terms of the error in a multi-way hypothesis testing problem [127]in fTˆsupT∈KTM(α,R)ET‖Tˆ −T‖2F≥δ 24minT˜P(T˜ 6= T ]),where T ] is uniformly distributed over the pacing set χT . The left had side above is the maximumexpected error obtained by the best estimator in terms of Frobenius norm. Second argument is avariant of the Fano’s inequality which conditioned on the observations S = {ω1, · · · ,ωm}, gives the60lower boundP(T˜ 6= T ]|S)≥ 1− ((|χT |2))−1∑k 6= j K(T k||T j)+ log(2)log|χT | , (3.27)where K(T k||T j) is the Kullback-Leibler divergence between distributions (YS|T k) and (YS|T j).Here, (YS|T k) is the probability of observing YS given that the measurements are taken from T k.For our observation model with i.i.d. Gaussian noise, we haveK(T k||T j) = 12σ2m∑t=1(T k(ωt)−T j(ωt))2,and therefore,ES[K(T k||T j)] = m2σ2‖Tk−T j‖2Π.From the first property of the packing set generated in Lemma 66, ‖T k−T j‖2F≤ 4γ2Nd . Thiscombined (3.26) and (3.27) yieldsP(T˜ 6= T ])≥ 1−32Lγ4α2mσ2 +12γ2rN≥ 1− 32Lγ4α2mσ2rN− 12γ2rN≥ 34− 32Lγ4α2mσ2rN,provided rN > 48. Now if γ4 ≤ σ2rN128Lα2m ,in fTˆsupT∈KTM(α,R)1NdE‖Tˆ −T‖2F≥α2γ216.Therefore, if σ2rN128Lα2m ≥ 1 choosing γ = 1 finishes the proof. Otherwise choosing γ2 = σ8√2α√rNLmresults inin fTˆsupT∈KTM(α,R)1NdE‖Tˆ −T‖2F≥σα128√2√rNLm≥ σR128√2KG√NLm.61Chapter 41-bit tensor completionIn this chapter, we analyze the problem of recovering a low-rank tensor from 1-bit measure-ments of a subset of its entries. Namely, we provide recovery guarantees using max-qnorm andM-norm constrained ML estimation in Section 4.3, nuclear-norm constrained ML estimation insection 4.4, and finally exact-rank-constrained ML estimation in Section 4.5. All the proofs arepresented separately in Section 4.8. We explain the max-qnorm constrained analysis first, as someof the concepts in its proof will be used in the other two sections. Before that, we explain theobservation model and main goal of the problem more concretely.4.1 Introduction and past resultsCompressed sensing, as the name suggests, analyzes the possibility of simultaneously measur-ing (sensing) a signal and compressing it at the acquisition level. Sometimes hardware limitationsor the nature of the measurements force these measurements to be highly quantized which adds anextra complication to this process. 1-bit compressed sensing is a technique that was developed toaddress this issue [16], where we examine the extreme case where we just store 1-bit measurementsof a signal. It is interesting to note that when the signal to noise ratio is low, numerical experimentsdemonstrate that such extreme quantization can be optimal [76].Generalizing from vectors to matrices, a framework for 1-bit matrix completion was developedin [31] where the indices of a low-rank matrix is sampled after dithering them by a known noisefunction and then storing the sign of these dithered samples. Interestingly they showed that thesample complexity of 1-bit matrix completion is similar to the case where the measurements arenot quantized.62The extreme quantization framework is especially interesting for recommender systems. Thereare two reasons for this. First, users’ ratings have to be quantized so that they have qualitativeinterpretations. The most extreme case is yes/no questions in surveys. Moreover, user ratings areinfluenced by many factors which can be modeled by considering an additive noise. In [31] it wasshown that 1-bit matrix completion is more successful than multi-bit matrix completion in estimat-ing the movie-ratings.In the 1-bit matrix completion problem, the measurements are in the form of Y (ω)= sign(M](ω)+Z(ω)) for ω ∈Ω, where Z is a noise matrix. Notice that without the noise, there will be no differ-ence in the 1-bit measurements of M] and αM] for any positive α . However, if we assume that theadditive noise comes from a log-concave distribution, the problem can be solved by minimizinga regularized negative log-likelihood function given the measurements. In other words, the noisematrix has a dithering effect which is essential for unique recovery of the matrix. Under these as-sumptions, a nuclear-norm constrained maximum likelihood (ML) optimization was used in [31]to recover M] and it was proved that it is minimax rate-optimal under the uniform sampling model.In particular, in order to recover a rank-r matrix M] with ‖M]‖∞≤α from 1-bit measurements, theyconsidered the weaker assumption that the matrix belongs to the set {A| ‖A‖∗≤α√rN2,‖A‖∞≤α}and proved1N2‖M]−Mrecovered‖2F≤Cα√rNm,provided that m >CN log(N). The authors also show that these estimates are near-optimal.A line of research followed this paper and concentrated on analyzing the recovery of low-rank matrices from 1-bit measurements [11, 18, 93]. Max-norm constrained ML estimation wasconsidered in [18]. The max norm, an alternative convex proxy of the rank of a matrix, bounds thespectral norm of the rows of the low-rank factors of M], say U and V . It is defined as ‖M]‖max:=min‖U‖2,∞‖V‖2,∞ s.t. M] = UV ′ [117]. The max norm has been extensively investigated in themachine learning community after the work of [116] and shown to be empirically superior tonuclear-norm minimization for collaborative filtering problems. A max-norm constrained MLestimation was analyzed in [18] and it was shown to be near optimal. To be precise, it was provedin [18] that under some weak assumptions on the sampling distribution, with probability greaterthan 1− 4N ,1N2‖M]−Mrecovered‖2F≤Cα(α√rNm+Uα√log(N)m).63Notice that m = O(rN) is sufficient for estimating the true matrix efficiently.Finally using exact low-rank constraints for 1-bit matrix completion is considered in [11, 93].Here, they consider a rank-constrained ML estimator and analyze it using a second-order Taylorexpansion. The sampling model of [11] corresponds to a d-regular bipartite graph and their estima-tion error bound, O(N3√r3Nm2 ), is inferior to that of [93] which is O(√rNm ). Although the low-rankconstraints are non-convex, an efficient conditional gradient descent algorithm was proposed andanalyzed in [93] which will be discussed later.Other generalizations of this problem have been considered in the literature. Learning from afinite alphabet was considered in [75]. Notice that finite alphabet is different than what we callmulti-bit where multi-bit can be thought of as infinite-bit as well. Subspace learning from bits wasconsidered in [26] and more general formulations of the problem were considered in [100, 125].In this chapter we generalize 1-bit matrix completion to multi-dimensional tensors and analyzerecovery guarantees of 1-bit measurements obtained from a rank-r bounded tensor. Similar to themulti-bit tensor completion, we show the advantage of using M-norm and max-qnorm constrainedcompletion instead of matricizing both theoretically and numerically when r = O(1).Generalizing 1-bit matrix completion to 1-bit tensor completion was considered in [2] as well.There they solve the 1-bit tensor completion problem by a weighted sum of the 1-bit matrix com-pletion problem resulting from matricization of the tensor along each one of the d dimensions.Although this method can be successful at finding the best matricization, its performance is stillbounded by the best performance one can get using matricization which inferior to direct tensorcompletion when r N.The framework of 1-bit tensor completion has not been analyzed before. However, tensorfactorization for context-aware recommender systems has been investigated in [54, 65, 128, 133]with two distinct differences. First, analyses in these works are different from the 1-bit machineryintroduced in [31] and second is the lack of theoretical analysis. However, these works show thepower of using tensor factorization for context-aware recommender systems.644.2 Problem formulation and observation modelIn this section, we explain the 1-bit tensor completion problem. Given a d-dimensional tensorT ] ∈⊗di=1RN (for simplicity, we assume Ni = N) and a random subset of its indices Ω ⊂ [N]×[N]×·· · [N], suppose that we observe 1-bit measurements of the indices in the subset Ω accordingto the following rule:Yω =+1 with probability f (T]ω),−1 with probability 1− f (T ]ω)for ω ∈Ω. (4.1)Here f :R→ [0,1] is a monotone differentiable function which can be interpreted as the cumulativedistribution of some noise function Zω where the observation then becomes [31]:Yω =+1 with probability Tω +Zω ≥ 0,−1 with probability Tω +Zω < 0 for ω ∈Ω.Suppose that we observe 1-bit measurements of a random subset of the entries of the tensor. Astandard assumption is that observed indices are chosen uniformly in random, i.e., we randomlypick |Ω| indices out of all the Nd indices and then take 1-bit measurements from this set. Here,Ω is chosen at random such that P(ω ∈Ω) = mNd . Alternatively, the measurements may be pickeduniformly at random with replacement so each index is sampled independently of the other mea-surements. We take a generalization of the later model, i.e., we assume a general sampling distri-bution Π= {piω} for ω ∈ [N]× [N]×·· ·× [N] such that ∑ω piω = 1 and then each index is sampledusing the distribution Π. Although in this method it is possible to choose a single index multipletimes, it has been proven in various contexts that sampling with replacement can be as powerfultheoretically as sampling without replacement[18, 19]. Other than theoretical considerations, as-suming non-uniform sampling is a better model for various applications of 1-bit matrix and tensorcompletion, including the Netflix problem. The challenges and benefits of using non-uniform sam-pling are thoroughly discussed in [18, 115].We assume that f is such that the following quantities are well defined:Lα := sup|x|≤α| f ′(x)|f (x)(1− f (x)) ,βα := sup|x|≤αf (x)(1− f (x))( f ′(x))2,(4.2)65where α is the upper bound of the absolute values of entries of T . Here, Lα controls the steepnessof f , βα controls its flatness. Furthermore, we assume f and f ′ are non-zero in [−α,α] and thequantity:Uα := sup|x|≤αlog(1f (x)(1− f (x))),is well-defined. A few well-known examples are [31]:• Logistic regression/Logistic noise defined as f (x) = ex1+ex ,Lα = 1, βα = (1+eα )2eα , and Uα = 2log(eα2 + e−α2 ).• Probit regression/Gaussian noise defined as f (x) =Φ( xσ ),Lα ≤ 4σ (ασ +1), βα ≤ piσ2 exp( α22σ2 ), and Uα ≤ (ασ +1)2.For the exact-rank constrained proof we use quadratic lower and upper bounds on the secondorder Taylor expansion of f . Therefore we also define γα , and µα to beγα := min(inf|x|≤α{ f′2(x)f 2(x)− f′′(x)f (x)}, inf|x|≤α{ f′2(x)(1− f (x))2 +f ′′(x)1− f (x)}),µα := max(sup|x|≤α{ f′2(x)f 2(x)− f′′(x)f (x)}, sup|x|≤α{ f′2(x)(1− f (x))2 +f ′′(x)1− f (x)}).(4.3)γα and µα are fixed constants that do just depend on α . For example for the Logistic regressionmodel, we have γα = eα(1+eα )2 and µα =14 .The problem of 1-bit tensor completion entails recovering T ] from the measurements Yω giventhe function f . To do this, similar to [31], we minimize the negative log-likelihood functionLΩ,Y (X) = ∑ω∈Ω(1[Yω=1] log(1f (Xω))+1[Yω=−1] log(11− f (Xω))),given our observations and subject to certain constraints to promote low-rank solutions.4.2.1 DiscrepancyIn this section, we briefly describe some more notation which are direct generalizations ofcorresponding definitions for matrix distances. First, the Hellinger distance for two scalars 0 ≤p,q≤ 1 is defined by:d2H(p,q) = (√p−√q)2+((√1− p−√1−q)2,66which gives a standard notation on the distance between two probability distributions. We gener-alize this definition to define Hellinger between two tensors T,U ∈ [0,1]N1×·Nd as:d2H(T,U) =1N1 · · ·Nd ∑i1,···,idd2H(S(i1, · · · , id),U(i1, · · · , id)).The Kullback-Leibler divergence between two tensors T,U ∈ [0,1]N1×·Nd is generalized as:K(T ||U) := 1N1 · · ·Nd ∑i1,···,idK(S(i1, · · · , id)||U(i1, · · · , id)),whereK(p||q) := p log( pq )+(1− p) log(1−p1−q ), for 0≤ p,q≤ 1. It is easy to prove that for any twoscalars 0≤ p,q≤ 1, d2H(p,q)≤K(p||q) and hence:d2H(T,U)≤K(T ||U). (4.4)4.3 Max-qnorm and M-norm constrained 1-bit tensorcompletionIn this section, we analyze the problem of 1-bit tensor recovery using max-qnorm constrainedML estimation as well as M-norm constrained ML estimation. These two optimization problemsare closely related, and our analysis and results are similar. Therefore, we explain the corre-sponding results together. To be precise, we recover T ] from the 1-bit measurements YΩ, acquiredfollowing the model (4.1), via minimizing the negative log-likelihood functionLΩ,Y (X) = ∑ω∈Ω(1[Yω=1] log(1f (Xω))+1[Yω=−1] log(11− f (Xω))), (4.5)subject to X having low max-qnorm or M-norm and small infinity-norm. DefiningKTmax(α,R) : = {T ∈ RN1×N2×···×Nd : ‖T‖∞≤ α,‖T‖max≤ R},KTM(α,R) : = {T ∈ RN1×N2×···×Nd : ‖T‖∞≤ α,‖T‖M≤ R},67we analyze the following optimization problems:Tˆmax = arg minXLΩ,Y (X) subject to X ∈ KTmax(α,Rmax). (4.6)TˆM = arg minXLΩ,Y (X) subject to X ∈ KTM(α,RM). (4.7)Remark 38. Flat low-rank tensors are contained in KTmax(α,Rmax) and KTM(α,RM) (Rmax and RM arequantities that just depend on α and the rank of the tensor) [45, Theorem 7]. M-norm, max-qnormand infinity norm are all continuous functions and therefore, both KTmax(α,Rmax) and KTM(α,RM)contain approximately low-rank tensors. We assume the data is generated from one of these setsof approximate low-rank tensors.Theorem 39. Suppose that we have m 1-bit measurements of a subset of the entries of a tensorT ] ∈ RNd following the probability distribution explained above where f is chosen such that Lα ,βα , and Uα are well defined. Moreover, the indices of the tensor are sampled according to aprobability distribution Π. Then assuming ‖T ]‖max≤ Rmax, there exist absolute constants Cmaxand CM such that for a sample size 2 ≤ m ≤ Nd and for any δ > 0 the maximizer Tˆmax of (4.6)satisfies:‖T ]− Tˆmax‖2Π:=∑ωpiω(Tˆmax(ω)−T ](ω))2 ≤Cmaxcd2βαLαRmax√dNm +Uα√log( 4δ )m (4.8)with probability at least 1−δ . Similarly, assuming ‖T ]‖M≤ RM, for any δ > 0, the maximizer TˆMof (4.7) satisfies:‖T ]− TˆM‖2Π=∑ωpiω(TˆM(ω)−T ](ω))2 ≤CMβαLαRM√dNm +Uα√log( 4δ )m (4.9)with probability at least 1−δ .Corollary 40. A rank-r tensor T ]r with ‖T ]r ‖∞≤ α satisfies ‖T ]r ‖max≤√rd2−2α and therefore, themaximizer Tˆmax of (4.6) with Rmax =√rd2−2α satisfies‖T ]r − Tˆmax‖2Π≤Cmaxcd2βα{Lα√rd2−2α√dNm+Uα√log( 4δ )m},with probability greater than 1−δ . Moreover, ‖T ]r ‖M≤ (r√r)d−1α and the maximizer TˆM of (4.7)68with RM = (r√r)d−1α satisfies‖T ]r − TˆM‖2Π≤CMβα{Lα(r√r)d−1√dNm+Uα√log( 4δ )m},with probability greater than 1−δ .Remark 41. There are two differences in the right-hand sides of (4.8) and (4.9). First is the differ-ence in Rmax and RM which is due to the different upper bounds on max-qnorm and M-norm of abounded rank-r tensor. The second difference is in the constants Cmaxcd2 and CM where CM <Cmaxc1where c1 and c2 are small constants derived from the generalized Grothendieck’s theorem [45, The-orem 13],[121] (c1 < 0.9 and c2 <√2). Note for a tensor T , ‖T‖M≤ c1cd2‖T‖max and CM < Cmaxc1which implies that (4.9) is tighter than (4.8). Moreover, for d ≥ 2, the M-norm bound of a low-ranktensor is smaller than its max-qnorm bound [45].Remark 42. Assuming that all entries have a nonzero substantial probability of being observed,i.e., in the sense that there exists a constant η ≥ 1 such that piω ≥ 1ηNd , for all ω ∈ [N1]×·· ·× [Nd],we can simplify (4.8) as1Nd‖T ]− Tˆmax‖2F≤Cηβα{LαRmax√dNm+Uα√log(dN)m}.A similar bound can be obtained for M-norm constrained ML estimation (4.9) as well.Remark 43. The terms r3d−3 and rd2−d in the upper bounds come from the M-norm and max-qnorm of low-rank tensors. We believe this is sub-optimal in r. Indeed, when d = 2, we know thatboth these upper bounds can be improved to√r instead of r32 and r2 .Remark 44. For a rank-r tensor T ] with ‖T ]‖∞≤ α , it is sufficient to choose Rmax = r d2−d2 α in(4.6) and RM = r3d−32 α in (4.7) for T ] to be feasible [45]. This proves that efficient recovery can beachieved when m = O(rd2−dN) using (4.6) and m = O(r3d−3N) using (4.7). These are significantimprovements over matricizing, which would require Ω(rNd2 ) measurements when r N.Remark 45. 1-bit tensor completion can recover the magnitude of the tensor because of the dither-ing effect of the noise we add before sampling. If the SNR is too low, we might end up estimatingthe noise by a low-rank tensor and if the SNR is too high, we risk losing magnitude information.In Section 4.7, we conduct some experiments and discuss the optimal noise level using syntheticdata.Remark 46. 1-bit matrix completion using max-norm is analyzed in [18]. A direct generalizationresulted in the error bound (4.8) which recovers their result when d = 2. However, the direct69generalization of max-norm to tensors is a quasi-norm and the resulting constraint is non-convex.The closely related upper bound using M-norm (4.9) is governed by a convex optimization andgives similar results when d = 2. In this case, the only difference is the Grothendieck’s constant1.67 < c1c22 < 1.79 [62].4.4 Nuclear norm constrained 1-bit tensor completionIn this section, we analyze tensor recovery from 1-bit measurements under nuclear-norm con-straints. Instead of considering the set of tensors with bounded rank, we consider the set of tensorswith bounded nuclear norm. This constraint is more relaxed compared to the non-convex set oflow-rank tensors. Using Theorem 7, the nuclear norm of a rank-r tensor T with ‖T‖∞≤ α isbounded by ‖T‖∗≤ α√rd−1Nd . To be consistent with the results obtained in [31], in this section,we assume uniform sampling, i.e., we assume P(ω ∈Ω) = mNd (the exact same results are also truefor (4.1) with piω = mNd ). Considering all these assumptions, in this section, we use the followingoptimization problem to recover T ].Tˆnuc = arg maxXLΩ,Y (X) subject to ‖X‖∗≤√rd−1Ndα. (4.10)Theorem 47. Assume that ‖T ]‖∗≤√rd−1Ndα and ‖T ]‖∞≤ α and suppose the sample set Ω ischosen at random such that P(ω ∈ Ω) = mNd and 1-bit measurements of these samples are takenvia (4.1). If Tˆnuc is the solution to (4.10), Then with probability at least 1− CNd−1‖T ]− Tˆnuc‖2FNd≤Cd,α√rd−1(logd(N)√N√m+log(N)√Ndm).Remark 48. Theorem 47 is a direct generalization of the 1-bit matrix completion using nuclear-norm constrained ML in [31] and the results are the same when d = 2.Remark 49. Comparing Theorem 39 and Theorem 47 shows that the nuclear-norm constrainedalgorithm gives much weaker results compared to the M-norm constrained one. In this remark, weattempt to clarify this difference. The main step in the proof of Theorem 47 which results in theO(Nd2 ) requirement in the number of measurements, is the second inequality of (4.28) below whichis similar to the Rademacher complexity of low nuclear-norm tensors. In the following remark weshow that with the assumptions in Theorem 47 we can’t hope to get a better result and if one hopesto achieve tighter upper bounds, further assumptions should be considered. For simplicity, we as-sume α = 1.70• Define ~1N to be the vector of all ones with length N. Then for the d-dimensional tensorT = ~1N ◦ ~1N ◦ · · · ◦ ~1N , rank(T ) = 1, ‖T‖∞= 1, and ‖T‖∗=√Nd . This shows that the boundon the nuclear norm of bounded rank-r tensors is tight (at least) in N.• Considering the set G = {X ∈ ⊗di=1RN : ‖X‖∗≤ √rd−1Nd}, the empirical Rademachercomplexity of set G is defined as E[supX∈G|< ∆Ω ∗E,X >|], where ∗ denotes the Hadamardproduct, ∆Ω is the sampling operator which is one on Ω and zero otherwise and E is atensor of Rademacher random variables. This expression shows up in (4.28) and is atleast√rd−1Nd which can be achieved by considering a tensor T ∈ G which is zero ev-erywhere except for one single entry with the value of√rd−1Nd . For such a tensor T ,‖T‖∗=√rd−1Nd and obviously, for any random tensor ∆Ω ∗E there exists such a tensor thatmakes[supX∈G|< ∆Ω ∗E,X >|]≥√rd−1Nd . Therefore, it seems that imposing an explicitconstraint on the infinity norm is necessary if one hopes to get a better error bound.• Even considering the set G= {X ∈⊗di=1RN : ‖X‖∗≤√rd−1Nd,‖X‖∞≤ 1} doesn’t result inbetter results. For any sampling set Ω and tensor ∆Ω, consider the tensor T withT (ω) =min(1,√rd−1Nd|Ω| ) · sign(∆Ω) for ω ∈Ω,0 for ω /∈ΩThen ‖T‖∞≤ 1, ‖T‖∗≤min(|Ω|,√rd−1Nd) and more importantly |< ∆Ω ◦E,X >|=min(|Ω|,√rd−1Nd).Tracking back the steps of the proof we get1Nd‖T − Tˆnuc‖2F≤Cd,α(logd(N)min(1Nd−1,√rd−1N√m)+log(N)min(m,√rd−1Nd)m),which is essentially the same result as before and the theoretical bound is meaningful whenm >√rd−1Nd .It is worth mentioning that the above remark just proves that with the approach taken in ourproof (or any nuclear-norm constrained approach that depends on the Rademacher complexity oflow nuclear-norm tensors) we can not expect to get a better result. We believe that without someextra incoherence assumptions on the tensor, it’s impossible to get better results. The specifics of71such assumptions depend on the considered applications. An example of such assumptions whichis widely used in matrix completion literature is an incoherence condition on the low-rank factors.In particular, assume a rank-r tensor T = U1 ◦U2 ◦ ◦· · · ◦Ud , where U j := [u( j)1 u( j)2 · · · u( j)r ] isthe low-rank factor whose columns are the low-rank vectors corresponding to the j-th dimension.Then assuming PU j to be the orthogonal projection onto the columns of U j, the coherence of U j isdefined asµ(U j) =N jrmax1≤i≤N j‖PU jei‖22.The incoherence assumption max j(µ(U j))≤ µ0, for some positive constant µ0 forces the elementsof the low-rank factors to be of the same order. Once we make this assumption, we can observethat we can bound the entries of the tensor by (using Matlab notations below for the rows):|T (i1, · · · , id)|= | Σ1≤k≤rU1(i1,k)U2(i2,k) · · ·Ud(id,k)|≤ ‖U1(i1, :)‖2‖U2(i2, :)‖2‖U3(i3, :)‖∞· · ·‖Ud(id, :)‖∞≤ (µ0rN )d,which removes the Nd factor in the upper bound of Theorem 47 with the expense of an extra rd .However, one can argue that such an argument is implicitly assuming that the ‖T‖∞∼O( 1Nd ) whichis much stronger than the assumptions in Theorem 47. We believe that applicability of such extraassumptions depends on specific applications and needs to be explored separately. We keep theresults as they are to be consistent with the results of 1-bit matrix completion in the literature.Remark 50. Remark 49 studies the nuclear-norm constrained ML. We do not know if bringing thenuclear norm into the optimization as a regularizer, i.e., minimizingLΩ,Y (X)+λ‖X‖∗ can achievebetter sample complexity for a suitable λ . We postpone the study of this problem to future work.It is worth mentioning though that finding the tensor with the least nuclear norm that satisfiesmeasurements have been studied in [129] for the problem of tensor completion; they prove therequired number of samples for exact recovery is O(rN32 ) when d = 3 which is similar to our resultfor the 1-bit case.4.5 1-bit tensor completion under exact rank constraintNext, we consider recovery with exact rank constraints. Such formulations in the matrix casewere first proposed and analyzed in [11], where the authors considered a rank-constrained MLestimation for recovering a matrix where the sampling is drawn from a p-regular bipartite graph.Here, p is the number of indices observed in each row. Even though this observation model is72restrictive, for an N×N matrix, their estimation error bound, O(N3√r3N|Ω|2 ), is not minimax optimal.Later [93] improved this result to O( rN log(N)|Ω| ) using the fact that under mild assumptions, low-rankmatrices follow a restricted strong convexity condition [91].The exact rank constraint adds some complications to the problem of 1-bit tensor completion.First, unlike the M-norm and nuclear-norm constraints, the rank constraint is non-convex and sec-ond, we need to know the rank of the true matrix (or an accurate estimation of the rank). Forthe sake of simplicity we assume that the samples are drawn randomly and independently suchthat P(ω ∈ Ω) = mNd . Notice that although this is different than sampling with replacement, thetwo models are closely related to each other and with high probability the analysis of one can beapplied to the other. Given such a random sample set Ω and 1-bit measurements of these samplesYΩ we employ a rank-constrained ML estimation to recover the underlying tensor T ]. In partic-ular, we assume T ] is a d-dimensional rank-r tensor with ‖T ]‖∞≤ α and we study the followingoptimization problem:Tˆr = arg minXLΩ,Y (X) subject to rank(X)≤ r,‖X‖∞≤ α. (4.11)The main idea of the proof is the same as the ones on [11, 93], i.e., using a second order Taylorexpansion onLΩ,Y and some concentration inequalities to get a bound on the error.Theorem 51. Assume that T ] is a d-dimensional tensor, where rank (T ])≤ r and ‖T ]‖∞≤ α andsuppose Ω is chosen at random (with replacement) according to the probability distribution Π ={piω} such that piω = 1Nd . If Tˆr is the solution to (4.11), then with probability at least 1−exp(−cNd)1Nd‖T ]− Tˆr‖2F≤max(4α2Ndm,Cα,γαα2(2r)3d2 − 32√Ndm). (4.12)Remark 52. Theorem 51, which is a generalization of the results in [11, 93], considers ML estima-tion using exact rank constraints. The result in [11] is not as good as the one in [93] and therefore,here we compare this theorem with the latter (assuming d = 2 and taking M] to be the originalmatrix). Theorem 3.3 of [93] proves the upper bound on the recovery error‖Mˆ−M]‖2FNd≤C max(1,α2)rN log(N)m.Comparing this result with (4.12), we make the following remarks.73• Our result does not have the log(N); which is the result of using the strong convexity of theset of low max-norm tensors (refer to Section 4.8.3 for more details).• Fixing p = mN their error is O(log(N)p ) whereas our error is O(√1p). This means that if wefix N our error bound is weaker in terms of m. The reason behind these differences is thatthey use the strong convexity of the set of low-rank matrices whereas we consider the set oflow M-norm matrices. Notice that information-theoretic lower bounds have been developedfor both max-norm and nuclear-norm constrained 1-bit matrix completion in [18, 31]; thesebounds are O(√1p) as well. Considering low M-norm tensors is essential for our proof whend > 2.Remark 53. Comparing exact rank constraints with M-norm and nuclear-norm constraints, noticethat regardless of the proof method the upper bound obtained with exact rank constraints can not beworse as the set of low-rank tensors is a subset of the low M-norm and low nuclear-norm tensorsconsidered above. However, we use M-norm and its dual in one of the steps of the proof (see(4.42)) and this is the reason we get similar bounds as using M-norm. One way to solve this issueis proving the restricted strong convexity of the set of low-rank tensors. When d = 2, this has beenstudied in [91]; we postpone generalizing this result to d > 2 to future work.4.6 Information-theoretic lower boundIn this section, we provide a lower bound for the reconstruction error when recovering tensorsfrom 1-bit random measurements. In particular, by using an information-theoretic argument simi-lar to the one in [31], we prove that on a particular set of tensors, any arbitrary algorithm will notbe able to recover a close enough approximation of all the tensors in that set. The same approachwas taken in both [18, 31] and therefore, we will not go into the details of the proofs and refer tothose papers for a more detailed explanation of these type of methods. However, to be more pre-cise, we use a classical information-theoretic technique which shows that with limited amount ofinformation we can distinguish only a limited number of tensors from each other. To that end, wefirst construct a set of tensors χ such that for any two distinct members X i and X j of χ ‖X i−X j‖Fis large. Therefore, we should be able to distinguish the tensors in this set if the recovery error ofan arbitrary algorithm is small enough. However, Fano’s inequality will imply that the probabilityof choosing the right tensor among this set is going to be small and therefore force a lower boundon the performance of any arbitrary recovery algorithm.74The lower bound is achieved using Lemma 18 of [45], which constructs a packing set for theset of low M-norm tensors given byKTM(α,RM) := {T ∈ RN1×N2×···×Nd : ‖T‖∞≤ α,‖T‖M≤ RM}.We include Lemma 18 of [45] and a reiteration of the proof in [31] in Section 4.8.4. In the following1.68 < KG := c1c22 < 1.79 is Grothendieck’s constant.Theorem 54. Fix α , r := b( RMαKG )2c, RM, and N such that α ≥ 1, α2rN ≥C0, r≥ 4 and r≤O(Nα2 ).Let βα be defined as in (4.2). Suppose f′(x) is decreasing for x > 0 and for any tensor T inthe set KTM(α,RM), assume we obtained m measurements (denoted by YΩ) from a random subsetΩ of the entries following the model (4.1). Considering any arbitrary algorithm that takes thesemeasurements and returns an approximation Tˆ (YΩ), then there exists a tensor T ∈KTM(α,RM) suchthatP(1Nd‖T − Tˆ (YΩ)‖2F≥min(C1,C2√β 3α4RMKG√Nm))≥ 34Remark 55. Theorem 54 proves that the sample complexity achieved by (4.7) in Theorem 39 isoptimal in both N and RM. Evidence in the matrix case suggests that a similar lower bound for theset of low-rank tensors (instead of low M-norm) should be Nm instead of√Nm provided that m islarge enough. We postpone an upper bound for exact low-rank tensors to future work.4.7 Numerical resultsIn this section, we compare some numerical results of using the above algorithms for 1-bittensor completion. The optimization problem we want to solve is in the form ofarg minXLΩ,Y (X) subject to X ∈ C, (4.13)where LΩ,Y (X) is the negative log-likelihood function defined in (4.5). Similar to the matrixcompletion problem LΩ,Y is a smooth function from RNd → R. The main difference betweenthe three different methods is in the choice of C. Before, showing the results, we mention a fewpractical considerations for each one of the above algorithms.4.7.1 Max-qnorm constrained 1-bit tensor completionWe are not aware of any algorithm that is able to solve the M-norm constrained problem (4.7).Therefore, instead, we present the results of constraining the max-qnorm of the tensor. To this75end, we employ a similar approach to the one in [45] and use the low-rank factors to estimate themax-qnorm of a tensor. In particular, defining f (V1, · · · ,Vd) := LΩ,Y (V1 ◦ · · · ◦Vd) the problembecomesmin f (V1, · · · ,Vd) subject to maxi(‖Vi‖2,∞)≤ R1dmax ,‖V1 ◦ · · · ◦Vd‖∞≤ α. (4.14)In the definition of max-qnorm (2.11), there is no limit on the size of the low-rank factors. Dueto computational restrictions, we limit the factor size by twice the dimension size, i.e., Vi ∈RN×2N .Although we end up approximating the solution of (4.14), our numerical results shows that, forsynthetic tensors with rank smaller than N, 2N is a large enough factor size and the results are notvery sensitive to the use of larger factors (this is true for our experiments where N < 100). Anotherpractical benefit of (4.14) is that we can use alternating minimization on the small factors insteadof the whole tensor which is crucial in applications where saving the whole tensor is too expensive.The next practical problem that needs to be addressed is in choosing Rmax. Although theorysuggests the upper-bound of rd2−d2 α [45], in practice this bound can be much higher than the ac-tual max-qnorm of a rank-r tensor. Moreover, in many practical applications, we do not have anaccurate upper bound on the rank of the tensor (or the rank of a good approximation of a tensor).Therefore, similar to [45], we use cross-validation to estimate the optimal value of Rmax.In summary, to approximate the solution of (4.14), in our experiments we use 10% of the avail-able data for cross-validating the choice of Rmax and optimize over the factors V1 to Vd alternativelywhile at each iteration we solve the simpler sub-problemminVif (V1, · · · ,Vd) subject to ‖Vi‖2,∞≤ R1dmax, (4.15)while fixing all the other factors Vj, j 6= i. We solve each sub-problem (4.15) by employing aprojected gradient algorithm. The algorithm updates all the factors[Vi]← PC([Vi− γ5 ( fi)Ri]). (4.16)where PC simply projects the factor onto the set of matrices with `2,∞ norm less than R1d . Thisprojection looks at each row of the matrix and if the norm of a row is bigger than R1d , it scales thatrow back down to R1d and leaves other rows unchanged.76Remark 56. The constraint on the boundedness of the tensor (‖V1 ◦ · · · ◦Vd‖∞≤ α) can also beincorporated into (4.15), which introduces new challenges to the algorithm. First, the exact pro-jection onto the set of {Vi|‖V1 ◦ · · · ◦Vd‖∞≤ α} is not as straightforward as the projection onto{Vi|‖Vi‖2,∞≤ R1dmax}. An approximate projection by rescaling the factor viaVi =α‖V1 ◦ · · · ◦Vd‖∞Vi, if ‖V1 ◦ · · · ◦Vd‖∞> αwas introduced in [18]. On the other hand, the exact projection can be formulated as a quadraticlinear program, which is very computationally expensive. The second complication is that addingthis constraint makes the constraint set C the intersection of two convex sets, which again makesexact projection expensive.In our synthetic experiments, we noticed that adding this constraint does not change the resultssignificantly, especially for low-rank tensors. Furthermore, in our applications, we concentrate onthe performance of the algorithm in recovering the sign of the tensor which reduces the importanceof projecting onto the set of bounded tensors and therefore, in this section, we just report the resultsobtained by (approximately) solving (4.15).Remark 57. The optimization problem (4.15) can be solved in parallel over the rows of Vi as boththe objective function and the regularizer (‖.‖2,∞) are decomposable in the rows.Nuclear-norm constrained 1-bit tensor completionThe nuclear norm of tensors has been thoroughly studied in [35, 57, 130]. Calculating the nu-clear norm of a tensor is NP-hard [43] and similarly projections onto the set of low nuclear-normtensors are NP-hard. Furthermore, we are not aware of any algorithms that can approximate thesolution to nuclear-norm constrained (or regularized) tensor completion.In the matrix case, i.e., when d = 2 we know that the nuclear norm of a matrix can be expressedas [117]‖M‖∗= min{B,C}12(‖B‖2F+‖C‖2F) s.t. X = BCT = B◦C. (4.17)A generalization of this represtation to tensors was used as a regularizer in [9], which has veryinteresting relations with the CP decomposition of the tensor. Notice that the following represen-tation of nuclear-norm is equivalent to the one in (4.17).‖M‖∗= min{B,C}(‖B‖F‖C‖F) s.t. X = BCT = B◦C. (4.18)77When d = 2, there is an interesting relation between the M-norm (2.12) and max-norm (2.7) onone hand, and the two representations of the nuclear norm, i.e., (2.5) and (4.18) on the other hand.This motivated us to consider the following norm, which we call factor-frob norm as a surrogatefor tensor nuclear-norm.‖T‖:= minT=U (1)◦U (1)◦···◦U (d){d∏j=1‖U ( j)‖F}. (4.19)The relations of this norm with the nuclear norm and the sample complexity of using this norm fortensor completion is an interesting question that we refer to future work. An advantage of (4.19) isthat similar to max-qnorm, we can use an alternating algorithm accompanied with cross-validationto approximate the solution to the following problemTˆ = arg minXLΩ,Y (X) subject to ‖X‖≤ R, ‖X‖∞≤ α. (4.20)We postpone analyzing the sample complexity of (4.20) to future work.4.7.2 1-bit tensor completion with exact rank constraintsThe optimization problem (4.11) is non-convex and very hard to approximate due to local min-ima. Similar to [93], we use a conditional gradient descent (CGD) algorithm for rank-constrainedlog-likelihood estimation. In particular, at each iteration t, we first find a descent direction fromthe gradient of LΩ,Y , and then use a tensor rank-1 approximation of the gradient ∇LΩ,Y (Xt) andupdate the solution by doing a line search. Notice that at the t-th iteration of the CGD, the outputis a rank-t tensor. If we know the exact rank, we stop the algorithm when t = rank(T ]). Otherwise,we find the best rank by the estimation error obtained from a validation set of samples.For the sake of completeness, the details of the algorithm is shown in Algorithm 3. At eachiteration of the algorithm, we find the gradient tensor, that has zero values in the non-observedindices and find the best rank-1 approximation to this tensor. For the rank-1 approximations (line8 in Algorithm 3) we can use the tensor toolbox version 2.5 available in [4]. However, our exper-iments show that using a tensor completion algorithm with alternating rank-1 factors (e.g., usingAlgorithm 2) give much better results. This is due to the fact that the values of the gradient tensoron the indices that are not observed is not available and therefore, solving a tensor completionalgorithm gives better results than fitting the gradient tensor (with all the zeros of non-observedentries). These approximations are very likely to get stuck in local minima. Therefore, we need to78Algorithm 3 1-bit TC with rank-constraints via a Conditional Gradient Descent (CGD)1: Input Ω, Y ,LΩ,Y , rup, α2: Output Tˆ using the CGD algorithm3: Divide Ω in to Ωtrain (90%) and Ωvalidate (10%)4: X0 = 05: ˆˆT = 06: validationerror =LΩvalidate,Y7: for t =0 to rup do8: V t ← argminrank(V )=1〈V,5LΩtrain,Y (X t)〉9: (lt ,qt)← argminlt≥0,qt≥0LΩtrain,Y (ltX t +qtV t), s.t. ‖ltX t +qtV t‖∞≤ α10: X t+1← ltX t +qtV t11: ifLΩvalidate,Y > validationerror then12: break13: else14: validationerror =LΩvalidate,Y (Xt+1)Tˆ = X t+115: end if16: end for17: return Tˆrun this multiple times with various initial points to make sure that we do not find a local minimumwhich is far away from the global minimum. This makes the algorithm much slower compared tothe other algorithms used in this thesis. Moreover, as we show later, in general, the algorithm per-formance deteriorates when the order of the tensor increases which is due to higher non-convexityof the problem. We discuss the results in more detail in the next section.4.7.3 Synthetic experimentsIn this section, we present extensive numerical results on synthetic experiments to compare theperformance of 1-bit tensor completion with max-qnorm, factor-frob, and exact rank constraints.An important component of the measurements which affects the results is the choice of the func-tion f . In particular, we investigate the optimal choice of σ when we use the Gaussian noise, i.e.,f (x) =Φ( xσ ) in the log-likelihood function (4.5). Figure 4.1 shows the results for d = 2,3,4, whereN = 200,30,15 respectively. In all the figures we average the recovery error over 20 realizationswhen r = 5 and σ varies from 0.001 to 10. Moreover, in all cases, m = |Ω|= Nd2 .Notice that the same experiment can be done with various values of r, and m. However, findingthe optimal value of σ for different cases is very time consuming and unnecessary as for most79-3 -2 -1 0 100.20.40.60.81200x200, r=5MC-NuclearMax-qnormFactor-frobCGD -3 -2 -1 0 100.20.40.60.8130x30x30, r=5-3 -2 -1 0 100.20.40.60.8115x15x15x15, r=5Figure 4.1: Average relative squared error, ‖Trecovered−T]‖2F‖T ]‖2F, obtained from matricized 1bit-TC, max-qnorm, factor-frob norm, and CGD over a range of different value of σ . Fromleft to right the original tensor is 2 ,3 and 4-dimensional. Results show that the noisevariance should not be too small or too big.cases σ = 0.1 seems to be close to the optimal value at least when ‖T ]‖∞= 1.In Figure 4.1, we show the results for matricized nuclear-norm constrained 1-bit matrix com-pletion (MC-Nuclear), Max-qnorm, Factor-frob, and CGD (with exact rank) as explained above.The underlying low-rank tensor is generated from low-rank N× r factors with entries drawn i.i.d.from a uniform distribution on [−1,1]. The tensor is then scaled to have ‖T ]‖∞= 1. The MC-Nuclear results show the case where we matricize the tensor first and then solve a 1-bit matrixcompletion problem using the algorithm in [31] but with added cross-validation for the optimalnuclear-norm bound. In Figure 4.1, the left figure shows the result for T ] ∈ R100×100. In this case,MC-Nuclear and Factor-frob are solving the same problem and as expected we get similar results.Moreover, as expected for the matrix case, the results are close to each other. The CGD algorithmbenefits from the extra information of the rank of the tensor. This low-rank structure gives a smalladvantage to CGD compared to the other algorithms. The figure in the middle shows the results forT ] ∈R30×30×30 and as expected from theorems proved above, we see a significant improvement byusing 1-bit tensor completion instead of matricizing. Moreover, the performances of max-qnormand factor-frob are similar. Finally, the figure in the right shows the results for T ] ∈ R30×30×30.The performance of CGD becomes less and less predictable when we increase the order. One ofthe reasons behind this is the rank-1 projections of the gradient tensor which is an NP-hard taskand becomes less and less accurate when the problem gets harder.In short, we decided to use σ = 0.1 as it seems to give the best performance among all the800 0.5 100.20.40.60.81Matricized 1bit-MCr=3r=5r=100 0.5 100.20.40.60.81max-qnorm 200x200r=3r=5r=100 0.5 100.20.40.60.81factor-frob 200x200r=3r=5r=100 0.5 100.20.40.60.81CGD 200x200r=3r=5r=10Figure 4.2: Average relative squared errors, ‖Trecovered−T]‖2F‖T ]‖2F, for recovering a 200×200 ma-trix with various ranks, using m 1-bit measurements sample for a range of differentvalue of m.choices of σ . Similar to the matrix case [31], using big noise deteriorated the results as we end upmeasuring the noise which in the 1-bit regime looks like a coin toss. Choosing small noise on theother hand limits the dithering effect which is essential for successful recovery.Once we fix the choice of σ we move on to comparing results for different ranks and tensorsizes. In particular, we present results for various rank r ∈ {3,5,10} and various sample sizesmNd ∈ {0.1,0.2, · · · ,1}. Figure 4.2 shows the results of a 2-dimensional case where T ] ∈ R200×200.The results are average relative error obtained from 20 runs of the same sample size m and rank r.As expected the results are very similar to each other. In all the cases we can naturally observe thatresults are better when the underlying tensor has a smaller rank or when the percentage of observed810 0.2 0.4 0.6 0.8 100.20.40.60.811bit-MC 30x900r=3r=5r=100 0.2 0.4 0.6 0.8 100.20.40.60.81max-qnorm 30x30x30r=3r=5r=100 0.2 0.4 0.6 0.8 100.20.40.60.81factor-frob 30x30x30r=3r=5r=100 0.2 0.4 0.6 0.8 100.20.40.60.81CGD 30x30x30r=3r=5r=10Figure 4.3: Average relative squared errors, ‖Trecovered−T]‖2F‖T ]‖2F, for recovering a 30× 30× 30tensor with various ranks, using m 1-bit measurements for a range of different value ofm. Top left is the result of matricization, top right is CGD for exact-rank constraints,bottom left is the results using the factor-frob function as an estimate for nuclear-normconstraint and bottom right is the result of max-qnorm.entries ( mNd ) increases. The CGD algorithm exploits the low-rank structure of the matrix instead ofa convex surrogate of the rank and therefore gives a slightly better result when the sample size islarge.Next, we show the results for 3-dimensional tensors T ] ∈R30×30×30. A balanced matricizationis not possible in this case and therefore, tensor completion has an added advantage when the ten-sor has an odd order. We can see this advantage very clearly in Figure 4.3. For smaller samplingrates, e.g., when mND = 0.3 the average error of 1-bit tensor completion is three to four times better820 0.5 100.20.40.60.811bit-MC 225x225r=3r=5r=100 0.5 100.20.40.60.81max-qnorm 15x15x15x15r=3r=5r=100 0.5 100.20.40.60.81factor-frob 15x15x15x15r=3r=5r=10Figure 4.4: Average relative squared errors, ‖Trecovered−T]‖2F‖T ]‖2F, for recovering a 15×15×15×15 tensor with various ranks, using m 1-bit measurements for a range of different valueof m. Left is the result of matricization, center is the result of using the factor-frobfunction as an estimate for nuclear-norm constraint and right is the result of max-qnorm.than matricizing.The results of max-qnorm and factor-frob constrained completion are very similar. This sug-gests that similar sample complexity can be obtained by analyzing the factor-frob constrainedtensor completion which we postpone to future work. Note that the way we choose the low-rankfactors is an important factor that should be considered in any numerical studies. Interestingly, inboth these cases, there is not much difference between observing 1-bit samples of the full tensorsand observing half of them. This might be due to the fact that the ranks are small for observing thetrue power of 1-bit tensor completion. The CGD results start deteriorating once we move to higherorders. In particular, when the sample percentage is high, the CGD recovery shows some signs ofover-training. In the case of max-qnorm and factor-frob, we used 10% of the samples to preventover-fitting. However, a similar strategy can’t be used for the CGD unless we change the algorithmsignificantly, for example using cross-validation for the line-search part. Such modifications arebeyond the scope of this chapter.Figure 4.4 shows the results for T ] ∈ R15×15×15×15. We removed the CGD results as in allcases they get trapped in local minima and the results are meaningless. It is worth mentioning thatalthough tensor completion results are significantly better than matricizing, the results show someirregular behavior when r is larger.834.8 Proofs4.8.1 Proof of Theorem 39Our proof closely follows the one in [18, Section 7.1]. Therefore, we just briefly explain thesteps. In what follows we prove the max-qnorm constrained ML estimation error. The proof for theM-norm constraint case (4.9) follows the exact same steps where the only difference is a constantdifference in the Rademacher complexity of the unit balls of these two norms. Define the lossfunction g(x;y) : R×{±1}→ R as:g(x;y) = 1[y=1] log(1f (x))+1[y=−1] log(11− f (x)).Regarding the tensor completion problem as a prediction problem where we consider the tensorT ∈ RN1×N2×···×Nd as a function from [N1]× [N2]×·· · [Nd]→ R where the value of the function at(i1, · · · , id) is the corresponding entry, the proof is based on a general excess risk bound developedin [8].For a subset S= {ω1,ω2, · · · ,ωm} of the set [N1]×·· ·× [Nd] and a tensor T , DS(T ;Y ) is the averageempirical loss according to g. Precisely, DS(T ;Y ) = 1m ∑mi=1 g(T (ωi);Y (ωi)). When the sample setS is drawn i.i.d. according to Π (with replacement), we define:DΠ(T ;Y ) := ES∼Π[g(T (ωi);Y (ωi))] =m∑i=1piωig(T (ωi),Y (ωi)).Notice that since Tˆmax is the optimal solution of optimization problem (4.6) and T ] is feasible forthis problem we have:DS(Tˆmax;Y )≤ DS(T ];Y ) = 1mm∑i=1g(T ](ωi);Y (ωi)) (4.21)Therefore, we haveEY [DΠ(Tˆmax;Y )−DΠ(T ];Y )]= EY [DΠ(Tˆmax;Y )−DS(T ];Y )]+EY [DS(T ];Y )−DΠ(T ];Y )]≤ EY [DΠ(Tˆmax;Y )−DS(Tˆmax;Y )]+EY [DS(T ];Y )−DΠ(T ];Y )]≤ supT∈KTmax(α,Rmax){EY [DΠ(T ;Y )]−EY [DS(T ;Y )]}+EY [DS(T ];Y )−DΠ(T ];Y )].(4.22)84Notice that the left-hand side of (4.22) is equivalent to weighted Kullback-Leibler divergence be-tween f (T ]) and f (Tˆmax).Now we focus on bounding the right-hand side of (4.22). Using Hoeffding’s inequality on the ran-dom variable Zω := g(T ](ω);Y (ω))−E[g(T ](ω);Y (ω))] we conclude that with probability 1−δover choosing the sampling subset S:DS(T ];Y )−DΠ(T ];Y )≤Uα√log( 1δ )2m. (4.23)Moreover, a combination of Theorem 8, (4) of Theorem 12 from [8] and the upper bound (10) andnoting that g(,˙y) is an Lα -Lipschitz function yields:supT∈KTmax(α,Rmax){EY [DΠ(T ;Y )]−EY [DS(T ;Y )]} ≤ 12LαRmaxc1cd2√dNm+Uα√8log( 2δ )m. (4.24)Gathering (4.22), 4.23, and (4.24) we get:KΠ( f (T ])|| f (Tˆmax))≤ 12LαRmaxc1cd2√dNm+Uα√8log( 2δ )m+Uα√log( 1δ )2m. (4.25)This, together with (4.4) and [31, Lemma 2] proves (4.8). The upper-bound (4.9) can be proved byfollowing the exact same arguments with the only difference being in the right-hand side of (4.24)where the Rademacher complexity of unit M-norm ball does not have the constant c1cd2 and theupper bound RM is different than Rmax.4.8.2 Proof of Theorem 47To prove this theorem it is more convenient to work with the functionL¯Ω,Y (X) =LΩ,Y (X)−LΩ,Y (0).Lemma 58. Let G⊂⊗di=1RN beG = {X ∈d⊗i=1RN : ‖X‖∗≤ α√rdNd},85for some r < N and α ≥ 0. ThenP(supX∈G∣∣L¯Ω,Y (X)−EL¯Ω,Y (X)∣∣>CdαLγ√rd(logd(N)√N√m+log(N)√Ndm))≤ 1Nd−1,where the probability is both over Ω and Y .Before proving the lemma, note that by assumption, T ] ∈ G and from the definition of Tˆnuc,Tˆnuc ∈ G and L¯Ω,Y (Tˆnuc)≥ L¯Ω,Y (T ]). Therefore,0≤ L¯Ω,Y (Tˆnuc)− L¯Ω,Y (T ])= E[L¯Ω,Y (Tˆnuc)− L¯Ω,Y (T ])]+(L¯Ω,Y (Tˆnuc)−E[L¯Ω,Y (T ])])−(L¯Ω,Y (T ])−E[L¯Ω,Y (Tˆnuc)])≤ E[L¯Ω,Y (Tˆnuc)− L¯Ω,Y (T ])]+2supX∈G∣∣L¯Ω,Y (X)−E[L¯Ω,Y (X)]∣∣=−mD( f (T ])|| f (Tˆnuc))+2supX∈G∣∣L¯Ω,Y (X)−E[L¯Ω,Y (X)]∣∣ .,And therefore with probability bigger than 1− 1Nd−1D( f (T ])|| f (Tˆnuc))≤ 2CdαLγ√rd(logd(N)√N√m+log(N)√Ndm)(4.26)Proof of Lemma 58: This proof closely follows the proof of [31, Lemma 1]. Therefore we omit thesteps that are exactly similar. Using Markov’s inequality for any h > 0P(supX∈G∣∣L¯Ω,Y (X)−EL¯Ω,Y (X)∣∣≥CdLγα√r(logd(N)√mN+ log(N)√Nd))≤E[supX∈G∣∣L¯Ω,Y (X)−EL¯Ω,Y (X)∣∣h](CdLγα√r(logd(N)√mN+ log(N)√Nd))h(4.27)Define E to be a tensor whose entries are i.i.d. Rademacher random variables and ∆Ω to be the86indicator tensor for Ω. Then we have [31]E[supX∈G∣∣L¯Ω,Y (X)−EL¯Ω,Y (X)∣∣h]≤ (4Lγ)hE[supX∈G|< ∆Ω ∗E ∗Y,X >|h]=(4Lγ)hE[supX∈G|< E ∗∆Ω,X >|h]≤ (4Lγ)hE [‖E ◦∆Ω‖‖X‖∗]h≤ (4Lγ)h(√rd√Ndα)hE‖E ◦∆Ω‖h. (4.28)Next, we need to bound the spectral norm of E ◦∆Ω whose entries are i.i.d. zero-mean randomvariables. For this we define B := E ◦∆Ω and use [92, Theorem 3] to bound ‖B‖:Theorem 59. Let B ∈⊗di=1RN be a random order-d tensor, whose entries are independent zero-mean, random variables. For any λ ≤ 164 , assume 1≤ h≤ 2dλNln(5eλ ). Then,E(‖B‖h) 1h ≤ cd√ln(1λ)[log2( 1λ )d−1](d∑j=1EBαhj) 1h+√λN(EBβ h)1h , (4.29)where α2j := maxi1,···,i j−1,i j+1,···,id(∑Ni j=1 B2i1,···,i j−1,i j,i j+1,···,id), β := maxi1···id|Bi1···id | and cd ≤ c8d√2d.Next, for fixed i1, · · · , i j−1, i j+1, · · · , id , we define δ j := B2i1,···,i j−1,i j,i j+1,···,id , where Eδ j = mNd and|δ j|≤ 1. Using a Bernstein’s inequality for t > 6mNd ,P(∣∣∣∣∣ N∑i j=1(δ j− mNd )∣∣∣∣∣> t)≤ 2exp(-t) = 2P(Wj > t), (4.30)where W1, · · · ,WN are i.i.d. exponential variables.Following the same line of argument in [31], using Eq=∫ ∞0 P(q≥ t) for positive random variables87q we get(E[maxi1,···,i j−1,i j+1,···,id(N∑j=1δ j)h2]) 1h≤√mNNd+E maxi1,···,i j−1,i j+1,···,id∣∣∣∣∣ N∑i j=1(δ j− mNd )∣∣∣∣∣h21h≤√mNd−1+( 6mNd−1)h+∫ ∞( 6mNd−1 )hP maxi1,···,i j−1,i j+1,···,id∣∣∣∣∣ N∑j=1(δ j− mNd )∣∣∣∣∣h≥ tdt 12h≤√mNd−1+((6mNd−1)h+2∫ ∞( 6mNd−1 )hP(maxi1,···,i j−1,i j+1,···,idW hj ≥ t)dt) 12h≤√mNd−1+((6mNd−1)h+2E[( maxi1,···,i j−1,i j+1,···,idWj)h]) 12h≤ (1+√6)√mNd−1+212h(√log(Nd−1)+212h√h).Above, in the first line we have used triangle inequality, in the second line we have used Jensen’sinequality, in the third line, we have used (4.30), and in the fifth line, we have used the followinginequality for the exponential random variables Wj:E[maxi1,···,i j−1,i j+1,···,idW hj]≤E[∣∣∣∣ maxi1,···,i j−1,i j+1,···,idWj− log(Nd−1)∣∣∣∣h]+ logh(Nd−1)≤ 2h!+logh(Nd−1)Now choosing h = dlog(Nd−1)e> 1 we can boundEBαhj =(E[maxi1,···,i j−1,i j+1,···,id(N∑j=1δ j)h2]) 1h≤ (1+√6)√mNd−1+(2+√2)(√(d−1)log(N)).Plugging above into (4.29) and choosing λ = CλN we get:(E‖B‖h) 1h ≤ cd√ln(N)([log2(N)d−1]∗d((1+√6)√mNd−1+(2+√2)(√(d−1)log(N)))+Cλ),(4.31)where we used the fact that for p< 1, (a+b)p < ap+bp for a,b> 0. Moreover, Cλ is chosen suchthat CλN <164 .88By simplifying the constant we can write(E‖B‖h) 1h ≤C′dlogd(N)√mNd−1+C′′dlog(N), (4.32)where max(C′d,C′′d)≤ cd8d√2((2+√2)d2+Cλ ). Plugging (4.32) into (4.28) we get:E[supX∈G∣∣L¯Ω,Y (X)−EL¯Ω,Y (X)∣∣h]≤ (4Lγα√rd (C′dlogd(N)√mN+C′′dlog(N)√Nd))h .Plugging this into (4.27), we concludeP(supX∈G∣∣L¯Ω,Y (X)−EL¯Ω,Y (X)∣∣≤CdLγα√rd (logd(N)√mN+ log(N)√Nd)) , (4.33)with probability bigger than 1− 1Nd−1 provided that Cd > 12 max(C′d,C′′d).The proof of Theorem 47 immediately follows (4.26) and the following lemma [31, Lemma 2].Lemma 60. Let ‖T‖∞,‖Tˆ‖∞≤ α . ThenD( f (T )|| f (Tˆ ))≥ inf|ξ |≤α( f˙ (ξ ))28 f (ξ )(1− f (ξ ))‖T − Tˆ‖2FNd. (4.34)4.8.3 Proof of Theorem 51Define FΩ,Y (X) :=−LΩ,Y (X) =−∑ω∈Ω(I[Yω=1]log( f (Xω))+ I[Yω=−1]log(1− f (Xω))). Fur-ther, assume ‖T‖∞≤ α and remember thatγα ≤min(inf|x|≤α{f˙ 2(x)f 2(x)− f¨ (x)f (x)}, inf|x|≤α{f˙ 2(x)1− f (x))2 +f¨ (x)f (x)(1− f (x))})Lα ≥ sup|x|≤α{ | f˙ (x)|f (x)(1− f (x)}.(4.35)Next define Crank(r,α) := {T ∈⊗di=1RN : ‖T‖∞≤ α, rank(T )≤ r}.Our proof extends the proof of [11, Theorem 3.1], and [93, Theorem 3.3] to the case of multi-dimensional arrays and the main methodology is similar. In what follows ω = {i1, · · · , id}, deter-mines an index of the tensor where 1 ≤ i j ≤ N. For any Ω, Y , F , and f as defined above, we89have∂FΩ,Y (X)∂Tw=(− f˙ (Xω)f (Xω)I(Yω=1)+f˙ (Xω)1− f (Xω)I(Yω=−1))I(ω∈Ω), (4.36)∂ 2FΩ,Y (X)∂X2ω=[(f˙ 2(Xω)f 2(Xω)− f¨ (Xω)f (Xω))I(Yω=1)+(f¨ (Xω)1− f (Xω) −f˙ 2(Xω)(1− f (Xω))2)I(Yω=−1)]I(ω∈Ω),(4.37)and ∂2FΩ,Y (X)∂Xω1∂Xω2= 0 for ω1 6= ω2.Let θ ] = vec(T ]) ∈ RNd and F˜Ω,Y (θ ]) = FΩ,Y (T ]). The objective function FΩ,Y (T ) is contin-uous in T . However, unlike the matrix case the set Crank(r,α), defined above, is not necessarilycompact. In particular, there exists a sequence of rank-r tensors that converge to a tensor withrank larger than r for any r > 1. This phenomenon raises the definition of the border rank of atensor which has been extensively studied in the literature. A tensor T is border-rank r if it isrank-r-approximable [13, 34]. Define Sr := {X ∈⊗di=1RN |rank(X) = r}.Definition 61. A tensor T is border-rank-r if T ∈ Sr and T /∈ Sr−1, where Sr is the closure of Sr.Now consider the set Crank(r,α). This set is compact and therefore, there exists a tensor in theset Crank(r,α) that minimizes F . This tensor is either rank-r or border-rank-r. In either case forany small ε there exists a rank-r tensor Tˆ , where F˜Ω,Y (θˆ)≤ F˜Ω,Y (θ ])+ ε , where θˆ = vec(Tˆ ).By the second-order Taylor’s theorem, for any θ ∈ RNd , we have:F˜Ω,Y (θ) = F˜Ω,Y (θ ])+ 〈∇θ F˜Ω,Y (θ ]),θ −θ ]〉+ 12〈θ −θ],(∇2θθ F˜Ω,Y (θ˜))(θ −θ ])〉, (4.38)where θ˜ = θ ]+ γ(θ −θ ]) for some 0≤ γ ≤ 1. Plugging θˆ in (4.38), we getε ≥ 〈∇θ F˜Ω,Y (θ ]), θˆ −θ ]〉+ 12〈θˆ −θ],(∇2θθ F˜Ω,Y (θ˜))(θˆ −θ ])〉. (4.39)Next, we study the two terms in the right-hand side and use them to bound ‖θ ]− θˆ‖F . The firstone is 〈∇θ F˜Ω,Y (θ ]), θˆ − θ ]〉. Define X to be the tensor where vec(X) = θˆ and let ζ := vec(X −T ]) = θˆ −θ ]. Then〈∇θ F˜Ω,Y (θ ]),ζ 〉= 〈Z,X−T ]〉, (4.40)90where Z := ∇FΩ,Y (T ]). Using (4.36) we haveZω =(− f˙ (T]ω)f (T ]ω)I(Yω=1)+f˙ (T ]ω)1− f (T ]ω)I(Yω=−1))I(ω∈Ω).Notice that the entries of Z are independent random variables with |Zω |≤ Lα . In both [11, 93],the authors have used the inequality 〈Z,X − T ]〉 ≤ ‖Z‖ ‖X − T ]‖∗ to bound this inner product.This bound is not optimal in the tensor case for the same reasons we explained in Remark 49.Therefore, here we use the M-norm and its dual to bound this inner product. Notice that here wecan use max-qnorm as well with the expense of the constant c1cd2 and larger bounds on the normof the low-rank tensor. First notice that〈Z,X−T ]〉 ≤ ‖Z‖∗M‖X−T ]‖M,where ‖Z‖∗M is the dual norm of ‖Z‖M, i.e., ‖Z‖∗max= sup‖U‖M≤1|〈Z,U〉|= supU∈T±|〈Z,U〉|. Thereforeassuming Ω= {ω1, · · · ,ωm},‖Z‖∗M≤ [ sup‖U‖M≤1| ∑ω∈ΩZωU(ω)|] = [ supU∈T±|i=m∑i=1Z(ωi)U(ωi)|], (4.41)where Z(ωi) is a sub-exponential random variable with sub-exponential constant Lα , i.e., maxi=1,···,mE[exp( |Z(ωi)|Lα )]≤e. Notice that |T±|≤ 2Nd . Using Bernstein-type inequality for sub-exponential random variables[124], yieldsP{| 1mm∑i=1Z(ωi)U(ωi)| ≥ t}≤ 2exp{− c ·min(mt2L2α,mtLα)}.Using t =√Ndm for m > Nd and a union bound on all U ∈ T± we getsupU∈T±|i=m∑i=1Z(ωi)U(ωi)|≤ cLα√mNd,with probability bigger than 1− exp(−Nd) and for some absolute constant c > 0. Plugging thisinto (4.41) we get‖Z‖∗M≤ cLα√mNd.The following lemma summarizes the result above and might be of independent interest.91Lemma 62. Consider a d-dimensional random tensor Z ∈ ⊗di=1RN with m non-zero entries,whose entries are independent sub-exponential random variables taking values in [−1,1], withsub-exponential constant K. Then‖Z‖∗M≤ c√mNd,with probability bigger than 1− exp(−Nd). Moreover, by the same probability, the dual of themax-qnorm can be bounded in a similar manner but with an order-dependent constant:‖Z‖∗max≤Cd√mNd,And finally going back to bounding the first term of (4.38)〈∇θ F˜Ω,Y (θ ]),ζ 〉= 〈Z,X−T ]〉 ≤ ‖Z‖∗M‖X−T ]‖M≤ cLα√mNd(2r)3d2 − 32 2α, (4.42)where in the last step we used Theorem 11 along with the fact that rank(X −T ]) ≤ 2r and ‖X −T ]‖∞≤ 2α .Remark 63. As explained above, both [11] and [93] use nuclear norm and its dual, spectral norm tobound the inner product 〈Z,X −T 〉. However, in the tensor case, using nuclear norm doesn’t giveus optimal bounds. Ignoring the log powers for now, in (4.31) we bounded the spectral norm ofa random matrix roughly by c1 log(N)+ c2√mNd−1 . In the optimal setting of m∼ N log(N), whend = 2, log(N) and√mNd−1 have the same order, but when we move to d > 2 the spectral normis dominated by the log factor which prevents the factor√mNd−1 from cancelling the√Nd termthat comes up in the nuclear bound and this is the main reason that an exact generalization of themethod in [93] is not applicable here.Next, we bound the second term of (4.38), i.e., 12〈θ −θ ],(∇2θθ F˜Ω,Y (θ˜))(θ −θ ])〉. A straight-forward generalization of [11, Lemma 6.3] results in the following lemma.Lemma 64. Let ζ = vec(X −T ]) = θˆ − θ ]. Then for θ˜ = θ ]+ γ(θˆ − θ ]), where 0 ≤ γ ≤ 1, wehave〈ζ ,[∇2θθ F˜Ω,Y (θ˜)]ζ 〉 ≥ γα‖(X−T ])Ω‖2F .Here, [(X)Ω]ω = Xω if ω ∈Ω and zero otherwise.Now we need to prove that a fixed low-rank tensor is close to its sampled versions. By matri-cizing X −T ], we end up with a N d2 ×N d2 matrix whose rank is less than 2r, and then we can use92the result of [91] which proves that the set of low-rank matrices has the strong convexity propertywith high probability, i.e.,‖(X−T ])Ω‖F√|Ω| ≥C‖X−T ]‖F√Nd . (4.43)However, it also forces the additional condition that |Ω|≥ CN d2 log(N) which is not optimal.Therefore, we need a similar upper bound on ‖(X−T ])Ω‖F that does not force the extra samplingcondition. In order to do this, notice that rank(X −T ]) ≤ 2r and ‖X −T ]‖∞≤ 2α , and therefore,‖X − T ]‖M≤ (2r) 3d2 − 32 2α . Next, we show that with high probability, tensors with low M-normhave a similar property to the one in (4.43). The following lemma is a direct generalization of thelower bound step used in [19, Section 6.1]; for completeness we include the proof in Section 4.8.5.Lemma 65. DefiningC(β ,δ ) = {X ∈d⊗i=1RN : ‖X‖∞≤ 1,‖X‖M≤ β , ‖X‖2FNd≥ δ},for a tensor X ∈C(β ,δ ), with probability bigger than 1−2exp(−c0mδ )1m‖XΩ‖2F≥12Nd‖X‖2F−C1β√Ndmfor all X ∈C(β ,δ ). (4.44)provided m > log2c0δ .Applying (4.52) to X−T ] with appropriate rescaling, we get1m‖(X−T ])Ω‖2F≥12Nd‖X−T ]‖2F−C1r3d2 − 32 (2α)2√Ndm, if‖X−T ]‖2FNd≥ (2α)2δwith probability 1−2exp(−c0mδ ). Picking δ = log(2t )c0m, this means that with probability 1−t either‖X−T ]‖2FNd≤ (2α)2 log(2t )c0m, (4.45)or1m‖(X−T ])Ω‖2F≥12Nd‖X−T ]‖2F−C1(2r)3d2 − 32 (2α)2√Ndm. (4.46)In the second case, gathering the results of (4.38), (4.42), Lemma 64, and (4.46) we getε ≥−cLα√mNd(2r)3d2 − 32 2α+12γα(m2Nd‖X−T ]‖2F−C1(2r)3d2 − 32 (2α)2√mNd).93As ε is arbitrary we can simplify the above result as‖X−T ]‖2FNd≤ cαLα(2r)3d2 − 32γα√Ndm+C1(2r)3d2 − 32 (2α)2√Ndm. (4.47)Combining (4.45) and (4.47) with t = 2exp(−Nd) finishes the proof.4.8.4 Proof of Theorem 54Lemma 66 ([45, Lemma 18]). Let r = b( RαKG )2c and let KTM(α,R) be defined as in Section 4.6and let γ ≤ 1 be such that rγ2 is an integer and suppose rγ2 ≤ N. Then the there exists a setχT(α,γ) ⊂ KTM(α,R) with|χT(α,γ)|≥ exprN16γ2such that• For T ∈ χT(α,γ), |T (ω)|= αγ for ω ∈ {[N]× [N] · · · [N]}.• For any T (i), T ( j) ∈ χT(α,γ), T (i) 6= T ( j)‖T (i)−T ( j)‖2F≥α2γ2Nd2Next choosing γ in a way that rγ2 is an integer and4√2εα≤ γ ≤ 8εαwe transform χT(α,γ) so that the entries of each element of the packing set come from the set{α,α ′ := (1− γ)α)} by definingχ := {X +α(1− γ2)1 : X ∈ χT(α2 ,γ)}Notice that for any X ∈ χ , ‖X‖∞≤ α and ‖X‖M≤ RM and therefore, we can use χ as a packing setfor KTM. Once the packing sets are generated, using the exact same proof as [31, Section A.3], wecan prove an upper bound. Choose ε to beε2 = min(11024,C2α√β 3α4√rNm). (4.48)94And assume there is an algorithm that, using 1-bit measurements YΩ of X , returns Xˆ such that1Nd‖X− Xˆ‖2F≤ ε2,with probability at least 14 . Notice that this means that with probability at least14 , we can distinguishthe elements of the packing set as for any X (i),X ( j) ∈ χ , 1Nd ‖X (i)−X ( j)‖2F≥ 4ε2. Hence, definingX∗ to be the closest element of χ to Xˆ , if we show that P(X 6=X∗)> 34 the proof is done. Otherwise,using Fano’s inequality [31] and following the steps in Section A.3.2 of [31] we have14≤ 1−P(X 6= X∗)≤ 16γ2 64mε2βα ′ +1rN≤ 1024ε2 64mε2βα ′ +1α2rN ,where α ′ := (1− γ)α . Comparing 64mε2 and βα ′ , either14<2048ε2α2rN,which is a contradiction when C0 > 8 orε2 >α√βα ′512√2√rNm,which is a contradiction when C2 ≤ 1512√2 .4.8.5 Proof of Lemma 65In this section, we prove that with high probability, 1m ∑mt=1 T (ωt)2 doesn’t deviate much fromits expectation ‖T‖2Π. For ease of notation, define TS = (T (ω1),T (ω2), · · · ,T (ωm)) to be the set ofchosen samples drawn from Π where‖T‖2Π=1mES∼Π‖TS‖22=∑ωpiωT (ω)2.Following [19], we prove that with high probability over the samples,1m‖TS‖22≥12‖T‖2Π− fβ (m,N,d), (4.49)holds for all all tensors T ∈C(β ,δ ) := {T ∈ KTM(1,β ) : ‖T‖2Π≥ δ} uniformly for some β ≥ 1 and95δ > 0. As suggested in [19], we prove the stronger result that| 1m‖TS‖22−‖T‖2Π|≤12‖T‖2Π− fβ (m,N,d),holds for all T ∈C(β ,δ ) with high probability. Following the peeling argument of [91], for ` =1,2, · · · and θ = 32 , define a sequence of subsetsC`(β ,δ ) := {T ∈C(β ,α) : θ `−1δ ≤ ‖T‖2Π≤ θ `δ},and for any radius D > 0, setB(D) := {T ∈C(β ,δ ) : ‖T‖2Π≤ D}.If there exist some T ∈C(β ,δ ) such that | 1m‖TS‖22−‖T‖2Π|> 12‖T‖2Π+ fβ (m,N,d) then there existsa corresponding ` such that T ∈C`(β ,δ )⊂ B(θ `δ ) and| 1m‖TS‖22−‖T‖2Π|>13θ `δ + fβ (m,N,d). (4.50)And therefore, the main task becomes proving that the probability of the event (4.50) happeningis small. To this end, we prove the following lemma that shows that the empirical mean 1m‖TS‖22doesn’t deviate much from its expectation.Lemma 67. Defining ∆D(S) := supT∈B(D)| 1m‖TS‖22−‖T‖2Π|, there exists a constant C1 such that forany D > 0,P{∆D(S)> D3 +C1β√Ndm} ≤ e−mD10 .Proof: The proof of this lemma follows the proof in [19, Section 6.1.1] which is based on somebasic techniques of probability in Banach spaces, including symmetrization, contraction inequalityand Bousquet’s version of Talagrand concentration inequality [77].Regarding the tensor T as a function in [N]× ·· · × [N]→ R from its indices to R, we areinterested in bounding the following empirical process:supfT :T∈B(D)| 1mm∑t=1fT (ωt)−E[ fT (ωt)]|, fT (·) = {T (·)}2.96Bounding the variance fT for T ∈ B(D) and using a standard symmetrization argumentES∼Π[∆D(S)]≤ 2ES∼Π{Eε [ supT∈B(D)| 1mm∑t=1εtT (ωt)2| |S]},where {εt} is an i.i.d. Rademacher sequence, independent of S. Next, noticing that |T (ωt)|≤ 1 andusing Ledoux-Talagrand contraction inequality [77]Eε [ supT∈B(D)| 1mm∑t=1εtT (ωt)2| |S]≤ 4Eε [ sup‖T‖max≤β| 1mm∑t=1εtT (ωt)| |S]≤ 24β√dNm.Notice that in the last step we have used Lemma 10 which is a uniform bound on worst-case Rademacher complexity of low M-norm tensors. Hence, ES∼Π[∆D(S)] ≤ 48β√dNm and us-ing Bousquest’s version of Talagrand concentration inequality for empirical processes indexed bybounded functions completes the proof of the Lemma [? ].Define fβ (m,N,d) := 48β√Ndm and consider the sequence of eventsε` = {∆θ `δ >13θ `δ + fβ (m,N,d)} for `= 1,2, · · · .Since C(β ,δ ) = ∪`≥1C`(β ,δ ), using the union bound we have:P{∃T ∈C(β ,δ ),subject to | 1m‖TS‖22−‖T‖2Π|>12‖T‖2Π+ fβ (m,N,d)}≤ ∑`≥1P{∃T ∈C`(β ,δ ),subject to | 1m‖TS‖22−‖T‖2Π|>12‖T‖2Π+ fβ (m,N,d)}≤∞∑`=1P(ε`)≤ exp(−c0mδ )1− exp(−c0mδ ) ,(4.51)with c0 =log( 32 )10 . Consequently provided m >log2c0δwe obtain1m‖TS‖22≥12‖T‖2Π−C1β√Ndmfor all T ∈C(β ,δ ) (4.52)with probability greater than 1−2exp(−c0mδ ) which finishes the proof of lower bound.97Chapter 5ApplicationsAs explained before, in many applications data are naturally multi-dimensional and the inter-dimensional dependencies are best explained when we use tensors to represent them. In this chap-ter, we show the results of applying tensor completion and 1-bit tensor completion to variousapplications ranging from imaging to recommender systems. First, we show applications of tensorcompletion in Section 5.1 and then we show some applications of 1-bit tensor completion in Sec-tion 5.2.In the results presented in this chapter, we concentrate on recovery errors and do not reporttheir time complexity. Tensor completion algorithms are generally slower than matrix completionalgorithms, which is the price we pay for the better sample complexity obtained by not matricizing.Generally our algorithms are about 5 to 10 times slower than their matrix completion counterparts.The algorithms presented in Chapters 3 and 4 are prototypes to show the advantage of using max-qnorm instead of matricizing and they are not optimized to be as fast as they can be. therefore, inthis chapter, we just report the recovery errors and do not do a study of the time complexity of thealgorithms.5.1 Applications of tensor completion5.1.1 Color image inpaintingColor images are usually represented by three-dimensional tensors. The first two dimensionsdetermine the pixel location and the third dimension is the rgb (red, green and blue) values. Al-though the small length of the third dimension means that the gain over matricization will be small,98(a) Original Baboon image0 5 10 15 20rank00.020.040.060.080.10.12RSEBest (approximate) rank-r approximaitonsMatricized3-dimensional(b) RSE of best rank-r approximationsFigure 5.1: The Baboon image and the RSE of best rank-r approximations of the 128×128×3 tensor and its 128×384 matricization.our experiments give a nice demonstration of the advantage of using tensors instead of matriciz-ing. To this end, we first consider the Baboon image in Figure 5.1 which is represented by a128×128×3 tensor. In Figure 5.1, we also show the relative squared error (RSE) of the best rank-r approximation of the tensor (approximate) and the best rank-r approximation of its matricizationalong the first dimension (which is exact). For a tensor T ], the RSE of an estimate Tˆ is defined asRSE :=‖T ]− Tˆ‖2F‖T ]‖2F.Remember that any rank-r tensor approximation is also a rank-r matrix approximation of the ma-tricized tensor and therefore, as expected, the RSE of rank-r matrix approximations are better thanthe rank-r tensor approximation. It is also worth noting that a tensorized low-rank matrix does notnecessarily have a small tensor-rank.Figure 5.2 shows the results of applying tensor and matrix completion algorithms to the sub-sampled images in the first column which has 50% sampling rate. We show the Peak signal-to-noise ratio (PSNR) which is a measure of error for image recovery and is defined asPSNR := 10log(N1N2N3‖T ]‖2∞‖Tˆ −T ]‖2F).There are two methods of subsampling which significantly affect the recovery performance. Oneis missing pixels, i.e., missing the all the third dimension information (rgb values) of a pixel of99Observation, missing pixels Missing pixels, tensor, PSNR= 25.0204 Missing pixels, matricized, PSNR= 24.6317Observation, missing indices Missing indices, tensor, PSNR= 26.9471 Missing indices, matricized, PSNR= 25.661Figure 5.2: Reconstructed Baboon images with 50% observed entries. The first row showsthe results with missing pixels and the second row shows the results with missing ran-dom entries. The results of matricization are worse than using tensors and the resultsof missing pixels are worse than missing random entries of the tensor.the image. The first row of Figure 5.2 shows this case. The second method is randomly removingindices of the tensor which is illustrated in the second row of 5.2. In the first row, when we miss apixel, we miss the interdependencies of the values in the third dimension and therefore, the resultsof using tensors and matricization are very similar. On the other hand, in the second row, the algo-rithm learns the relation between the rgb valued and used that information for the missing indicesand therefore, the results are better. Moreover, the improvement using tensors is more significantin this case.Comparing the results of Figures 5.1 and 5.2 shows the main reason behind the success of ten-sor completion compared to matrix completion. Although the matricized version is more low-rank(or is better approximated by low-rank objects), because it has larger dimension we see that thecompletion results are better when we don’t matricize.It is worth mentioning that tensor completion has been used for color image inpainting in100original imageObservation, missing pixels Missing pixels, tensor, PSNR= 17.8495 Missing pixels, matricized, PSNR= 17.209Observation, missing indices Tensor, PSNR= 19.4503 Matricized, PSNR= 17.7895Figure 5.3: Results for the Peppers image with only 20% sampling rate.[10, 80, 83, 134]. Our results are comparable (or better) than the most of the algorithms in theliterature except for the algorithm in [10] which is a sophisticated image and video completionalgorithm. We have not tried implementing those ideas here and postpone it to future work.In Figure 5.3, we show the results for the Peppers1 image (which is a 128 by 128 pixels image)with the lower observation sampling rate p= 0.2. The results are similar to the ones of the Baboonimage.In Figure 5.4 we show the results for a larger image which is obtained from the Berkeley Seg-mentation database 2 [86]. In the second row we show the results of some of the state of the artalgorithms including LRTC [83], TCTF [134] and t-SVD [132].1http://sipi.usc.edu/database/database.php?volume=misc2https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/101Original image Subsampled image, p=0.5 Max-qnorm 29.5571LRTC 28.0516 TCTF 26.475 t-SVD 30.9393Figure 5.4: PSNR results for a larger natural image in R321×481×3. The second row is theresults of LRTC [83], TCTF [134] and t-SVD [132]5.1.2 Video completionVideos can be naturally represented by a tensor where the frames of the video are arranged inthe first two (black and white videos) or three (color videos) dimensions and the last dimensiondetermines the frame number. In addition to the fact that each frame is low rank, the video doesnot change drastically from one frame to the other which makes the video tensor approximatelylow-rank. This is especially true for videos where objects in the video are more static. An exampleof such a video is the bridge video3 where we show one of its frames in Figure 5.5. We consider100 frames of the video and remove 80% and 95% of the samples randomly. The SNR of therecoveries with p = 0.2 and p = 0.05 is respectively 69 and 57.In Figure 5.6 we have considered another video which is the rotting tomato video used in [83].We randomly sample 30% of the pixels where the missing pixels contain blocks of pixels as well(see Figure 5.6). This video is a little harder to reconstruct compared to the bridge video in Figure5.5 because the tomato gets smaller and smaller as it gets rotten and also we are missing blocks ofpixels. Therefore, the results are a little inferior in this case (SNR of the recovery is around 30 dB)but still very good.3Available at http://trace.eas.asu.edu/yuv102Original frame sampled frame, 80% missing Recovered frame SNR=69.3788Original frame sampled frame, 95% missing Recovered frame SNR=57.8699Figure 5.5: Completing a black and white video from 20% and 5% of the entries.original frame 50original frame 100subsampled frame 50subsampled frame 100Recovered frame 50Recovered frame 100Figure 5.6: Frames 50 and 100 of the tomato video [83]. The subsampling mask includesblocks of pixels as well.1035.1.3 Hyperspectral imagingHyperspectral data consist of collections of measurements sensed from contiguous spectralbands [112]. Unlike the human eye that generally sees light in three bands (long wavelengths -perceived as red, medium wavelengths - perceived as green, and short wavelengths - perceivedas blue), hyperspectral imaging records images of an object from a wide range of bands (wave-lengths). Considering the object as a 2-dimensional object the resulting image can be stored into a3-dimensional tensor whose bands (third dimension slices) are highly correlated. There are multi-ple reasons that make a hyperspectral image contaminated. For example, images taken by satelliteshave missing information due to faulty sensors and the environment situation (such as cloudy sky)[63, 79]. In figure 5.7 we consider the hyperspectral image of [83]4, where 90% of its samples areremoved. The missing entries include random pixels and also blocks of pixels that model the effectof a cloud. An example of the observed data is shown in Figure 5.7. In the second row, we addrandom white noise with SNR = 20dB to the image and show the reconstruction of the 30th band.Notice that in this case, we can improve the results by denoising as well (around 1dB in this case).Original band sampled band, 90% missing Recovered band SNR=33.0625Noisy band, SNR=19.9572 sampled noisy band, 90% missing Recovered band, SNR=23.1412Figure 5.7: An example of a hyperspectral image with 90% missing entries. In the secondrow, we show the result with added white noise.4Available at http://www.cs.rochester.edu/u/jliu/publications.html1045.1.4 MRI data recoveryThe next application we consider here is recovering dynamic MR images. Figure 5.8 showsa few slices of the original 3D brain MRI data which is in R191×400×190. We randomly remover95%, 80%, and 50% of the entries and show the sampled and recovered slices in each case. ThePSNR for p = 0.05, p = 0.2 and p = 0.5 is 29,37 and 76 respectively. The results with p = 0.5,i.e., the last row is practically full recovery of the image (SNR ∼ 80). For p = 0.2, it also doesa good job of recovering the image except for small blurriness in the corner of the brain image.The main difficulty arises in the final slices where the brain images get smaller; because we onlysample 20% of the entries our algorithm does not get enough information to recover those slices.105slice 25Original image slice 40 slice 55 slice 7095% missing20% missing50% missingFigure 5.8: The recovery results for four slices of a brain MR image. The columns are imagesof different slices and the rows show the original image and different sampling rates.First row shows the original images. Each image in the figure shows the sampled image(top) and the recovered image (bottom).1065.2 Applications of 1-bit tensor completionIn this section, we test the performance of 1-bit tensor completion on real-world data, in par-ticular, for predicting exam scores and user ratings. We show the improvements gained by solving1-bit tensor completion instead of matricizing. We transform the ratings (or scores) to 1-bit databy replacing them with whether they are above or below some approximate mean-rating. Next, weinvestigate the ability of 1-bit tensor completion to predict if an observed rating is above or belowaverage. In all the applications we know the maximum value of the ratings and use this value torescale the data to have unit infinity norm. This helps us in choosing the appropriate function fbased on our synthetic experiments in Section 4.7, i.e., Gaussian noise with σ = 0.1. In Remark68 (below) we explain the consequences of this choice in more details.To be more precise, we explain the general application setup we use in this section briefly.Assume T ] is the true tensor and we have access to a subset of its entries for training (T ]Ωtrain) andanother subset of the entries for testing (T ]Ωtest). Ωtrain and Ωtest are two disjoint subsets of theentries. Below is a brief recipe for the experiments done in this section.• Scale the observed entries to have unit infinity norm.• Using an approximate mean-value η , take sign(T ]Ωtrain −η) to be the 1-bit measurements.Notice that with synthetic low-rank tensors we add a dithering noise using some function fbefore taking the sign (see (4.1)). However, our experiments showed that assuming the noiseto be intrinsic in the data gives slightly better results than dithering. To be precise instead ofdithering the observation tensor T ]Ωtrain by a noise tensor Z and taking the 1-bit measurementsto be sign(T ]Ωtrain +ZΩtrain −η), we assume the noise is hidden in the original ratings. Weexplain the complications of not knowing the exact noise in Remark 68. The same approachwas taken in [31].• Use a 1-bit tensor completion algorithm to get an initial estimate Tˆinit.• Add the approximate mean-value η and scale the resulting tensor back to get the final Tˆ ,which is an approximation of T ].• Evaluate Tˆ by comparing Tˆ and T ] on the test indices.107Remark 68. In the applications, we empirically observe that we get better predictions if we donot dither the original tensor, i.e., if we assume the noise to be implicit in the data. Moreover, asexplained in Remark 56, in the applications section we also ignore the infinity-norm constraint.Therefore, in theory, changing the value of σ should not change the results of predicting the signof the tensor. For example, if we consider recovering the original tensor with two noise functionsf1(x) =Φ( xσ1 ) and f2(x) =Φ(xσ2 ), we would haveΦ(T (ω)σ1) =Φ(σ2σ1 T (ω)σ2),which shows that the recovered tensors achieved by f1 and f2 should be scalar multiples of eachother and the sign predictions should be the same. In Figure 5.9 we illustrate this by showingthe results of applying noise functions with different values of σ for recovering a rank-5 tensor inR30×30×30 whose 1-bit measurement have been obtained with σ = 0.15. The plot in the left showsthe average RSE of 1-bit max-qnorm 1-bit tensor completion and as expected the best recovery ofthe original tensor is obtained when we use the true noise function that we originally used to getthe observations. However, the right plot shows that the percentage of correct sign predictions isvery robust to the choice of σ and all the results are very close to each other which supports theabove discussion.5.2.1 Score predictionIn this section, we apply our algorithm to the data on pupil attainments in schools in Scot-land which contains the information of 3,435 children who attended 148 primary schools and 19secondary schools in Scotland5. We generate a 5-dimensional tensor in R148×19×2×4×10 whichincludes the information of primary school, secondary school, gender, social class and attainmentscore of the students and estimate the verbal reasoning score of the students based on varying num-ber of 1-bit information (from 230 samples to 2100 samples). The scores are ranged from -40 to 40and we take the 1-bit information to be the sign of the score. We use 10% of the observations forcross-validating the choice of parameters and another 10% of the total scores as a test set. Figure5.10.a shows the percentage of correct sign predictions and Figure 5.10.b shows the mean absoluteerrorMAE :=∑ω∈Ωt |T ](ω)− Tˆ (ω)||Ωt |5Available at http://www.bristol.ac.uk/cmm/learning/mmsoftware/data-rev.html1080.05 0.1 0.15 0.2 0.2500.10.20.30.40.50.6RSE0.05 0.1 0.15 0.2 0.250246810% of correct predictionsFigure 5.9: The left figure shows the average relative error of recovering a rank-5, 30×30×30 tensor by max-qnorm constrained 1-bit tensor completion with different values of σwhen the 1-bit measurements were obtained by using σ = 0.15. The right figure showsthe percentage of correct sign predictions.on the test set. Notice that the scores are in the range of -40 to 40 and the Mean absolute error is inthe range of 8 to 10. The matrix completion results refer to matricizing the tensor to a 592×380matrix by putting the first and fourth dimension in the rows and the rest in the columns. Thismatricization results in the most balanced rearrangement which is the recommended way for betterresults [90]. In both figures we can see that using tensor completion outperforms matrix completionsignificantly.5.2.2 In-car music recommender systemIn this section, we apply our algorithm to an in-car music recommender system [6] which con-tains 4000 ratings from 42 users and 140 music tracks. The ratings are acquired via a mobileapplication and each rating is accompanied with one of the 26 context information which rangesfrom the landscape to the weather condition or the driver’s mood. The resulting tensor is of size42×140×26 and we use 3200 ratings for the minimization, 400 ratings for validating the choiceof max-qnorm bound and the other 400 ratings as test data. Table 5.1 shows the results using tensorcompletion while considering the context information and matrix completion. The table shows thecorrect 1-bit predictions made by each algorithm considering their original ratings in the first sixcolumns and the total average in the last row. Bringing context into the reconstruction results in1090 500 1000 1500 2000a)number of observations304050607080percentage of test samplesCorrect prediction of score signMatrix completionTC max-qnormTC factor-frob0 500 1000 1500 2000b)number of observations678910111213mean errorMean-squared errorMatrix completionTC max-qnormTC factor-frobFigure 5.10: Results of applying 1-bit matrix and tensor completion to partially observedverbal scores tensor [87] to determine whether unobserved scores are above or belowaverage. The left figure shows the percentage of correct predictions and the rightfigure shows the absolute mean error. The scores are in the range [-40 40]an impressive improvement of at least 17% and moreover, using 1-bit information does not changethe results too much compared to using the full information. Note that the results are averagedover 10 different random training and test sets.Remark 69. Although the best results are obtained by the conditional gradient descent (Algorithm3) used for exact-rank constraints, it is worth mentioning that the exact rank constraint is highlysensitive to the algorithm parameters and the results change drastically from one sampling patternto another. As explained before, to reduce these artifacts, we had to run the code with multipleinitial points which results in the algorithm being around 10 times slower than its counterparts. Allthis being said, in theory, the CGD should give the best results if the original tensor is low-rank.Remark 70. The results of using the factor-frob algorithm (4.20) are also reported in the last twolines of the table which is better than matricization but a little worse than using max-qnorm. Theclose outcome of using max-qnorm and factor-frob can be a result of the fact that the underlyingrank is not too small in this case. However, we do not have a precise explanation of the recoveryerror of the factor-frob algorithm.Remark 71. In [31], a similar experiment was done with movie ratings which showed a significantimprovement in using 1-bit data instead of original (multi-bit) data. However, here we just seea small improvement. We believe this is due to the very careful way the in-car music data wasgathered. To be precise 1-bit matrix (and tensor) completion seems to be working better when110Original rating 0 1 2 3 4 5 Overall1-bit matrix completion 75% 65% 54% 51% 70% 65% 60%multi-bit matrix completion 76% 62% 50% 54% 53% 61% 57%1-bit TC, max-qnorm 80% 89% 58% 65% 78% 85% 77%multi-bit TC, max-qnorm 80% 86% 60% 64% 77% 90% 76%1-bit TC, factor-frob 82% 85% 57% 69% 71% 80% 75%multi-bit TC, factor-frob 84% 78% 57% 63% 67% 85% 72%1-bit TC, CGD 82% 89% 60% 63% 80% 88% 78%Table 5.1: Results of a comparison between 1-bit and multi-bit matrix and tensor completion algo-rithms on incar-music data [6] for predicting whether the unobserved ratings were above or belowaverage. The multi-bit methods use the original ratings from 0 to 5 and the context-aware meth-ods include the context information such as time of the day, weather, location of driving andmood.there is an implicit noise in the ratings given by the users which can’t be accounted for when wefit the data exactly.Remark 72. It is generally true that the prediction should be more accurate when the original ratingis further away from the average. This trend can still be seen in Table 5.1 as well except for onecase which is due to the very few instances of 0-rating in the test data (generally 3 to 5 ratings).5.2.3 A Restaurant Context-Aware Recommender SystemNext, we apply our algorithm to a restaurant recommender system [102]. The data was gatheredby asking 50 users who went to 40 restaurants in the city of Tijuana and rated their experiencethrough a Web-based platform. The contextual information added is the day of the week (midweekand weekend) and the place (school, home and work). The resulting tensor is in R40×50×6 and theratings are ranged from 1 to 5 with an average of 3.40. Similar to the In-car music recommendersystem we take the 1-bit information to be whether or not a rating is higher than the averageor not. In Table 5.2 we report the percentage of correct above-or-below-average predictions fordifferent ratings in the test set which we will refer to as 1-bit predictions. To be precise we showwhat percentage of the test ratings have been correctly predicted to be above or below average. InTable 5.3 we show the mean absolute error of the recovered tensor on the test set, i.e., taking Ωtto be the test set, we show MAE := ∑ω∈Ωt |T](ω)−Tˆ (ω)||Ωt | . For this restaurant-ratings data set, a naivematricization results in an overall rating of 60% and a more complicated matrix completion versionwould have repeated user-restaurant ratings (with different context information) and therefore,comparing tensor and matrix completion results is not possible here. Instead, we focus on a fewimportant questions regarding 1-bit tensor completion.111• The first question is the importance of using max-qnorm or factor-frob regularizer. In the firsttwo rows of Tables 5.2 and 5.3, we show the results of factorization (without any regularizer)with r = 10, i.e., by doing alternating minimization over Ni× 10 factors (r = 10 is the bestrank found by numerous empirical experiments). Both results of 1-bit prediction and themean absolute error is generally worse than using max-qnorm and factor-frob results wherethe difference is significant in the 1-bit predictions. It is worth mentioning though that thetensor factorization is a non-convex problem and a more complex algorithm might get betterresults. For example the more sophisticated CGD algorithm gives better results but still usinga regularizer is better improves the results in this question.• The second question we investigate is the effect of choosing f (in 4.5) on the results. The4th and 5th row of Tables 5.2 and 5.3 shows the results for two different values of σ . Inshort, when we use max-qnorm, smaller noise (not too small though) result in better 1-bit predictions and larger noise (not too large though) results in better prediction of theactual score. This behavior can be explained by examining the likelihood function 5.3 moreclosely. When σ is small the dithering function f is spiky around X(ω) = 0 and thereforeit is more sensitive to whether or not X(ω) is positive or negative rather not how large it is.Therefore, using larger values of σ does a better job of recovering the original rating buthas less sensitivity to the sign of X(ω). This can be seen in Tables 5.2 and 5.3 where werecover the sign of the ratings better when we use σ = 0.1 but do worse in terms of meanabsolute error. This is more evident in Table 5.3 where using smaller σ does a better job ofrecovering the ratings that are close to the average and worse in the ratings that are furtherfrom the average. Remark 68 investigates these differences in more details.As expected the best results for recovering the original ratings (MAE in Table 5.3) is obtained byexact tensor completion. However, notice that 1-bit TC outperform exact tensor completion forratings that are around the mean significantly and struggles for the correct scale of the ratings thatare far away from the mean.112Original rating 1 2 3 4 5 Overallmulti-bit TC (TF) 73% 58% 63% 43% 73% 74%1-bit TC (TF) 70% 67% 74% 57% 79% 77%multi-bit TC (max-qnorm) 85% 72% 69% 63% 89% 81%1-bit TC (max-qnorm, σ = 0.1) 80% 72% 72% 69% 92% 84%1-bit TC (max-qnorm, σ = 0.5) 79% 69% 69% 66% 91% 82%multi-bit TC (factor-frob,) 87% 72% 66% 64% 90% 83%1-bit TC (factor-frob, σ = 0.1) 86% 72% 74% 63% 89% 82%1-bit TC (factor-frob, σ = 0.5) 82% 72% 75% 62% 89% 83%1-bit exact-rank (CGD) 83% 75% 74% 66% 86% 80%Table 5.2: Results of a comparison between 1-bit and multi-bit matrix tensor completion algorithmson Tijuana restaurant data [102] for predicting whether the unobserved ratings were above orbelow average. TF refers to using tensor factorization without the max-qnorm regularization.For the 1-bit results we use f (x) =Φ( xσ ).Original rating 1 2 3 4 5 Overallmulti-bit TC (TF) 1.80 1.22 0.74 0.88 1.10 0.971-bit TC (TF) 1.90 1.02 0.62 0.85 0.96 0.97multi-bit TC (max-qnorm) 1.5 1.01 0.54 0.64 0.68 0.761-bit TC (max-qnorm, σ = 0.1) 2.20 1.25 0.29 0.46 1.25 1.111-bit TC (max-qnorm, σ = 0.5) 1.52 1.12 1.08 1.14 0.84 1.02multi-bit TC (factor-frob,) 1.41 1.02 0.53 0.70 0.75 0.751-bit TC (factor-frob, σ = 0.1) 2.20 1.23 0.27 0.51 1.33 1.161-bit TC (factor-frob, σ = 0.5) 1.49 1.03 0.95 1.00 0.82 0.961-bit exact-rank (CGD) 1.95 1.07 0.41 0.53 1.07 1.00Table 5.3: Results of a comparison between 1-bit and multi-bit matrix tensor completion algorithmson Tijuana restaurant data [102] showing the mean absolute error.113Chapter 6ConclusionIn this thesis, we studied the problem of tensor completion from noisy or 1-bit measurements.Below is a list of our contributions in this thesis and some of the open questions related to them.• We define the M-norm and max-qnorm of tensors as robust proxies for the rank of a ten-sor. We prove that both M-norm and max-qnorm of a bounded low-rank tensor are upper-bounded by quantities that depend only on the rank and infinity norm of the tensor and areindependent of N.• We use a generalization of Grothendieck’s theorem to connect the max-qnorm of a tensor toits nuclear decomposition with unit infinity-norm factors. Using this, we prove that the unitball of the max-quasi-norm is contained in a constant expansion of the set of rank-1 signtensors (the unit ball of M-norm) and therefore, the sets of low max-qnorm and low M-normhave small Rademacher complexity. This also establishes a theoretical framework for furtherinvestigation of low max-qnorm tensors.• We prove that with high probability, m=O(r 3d2 dN) (or m=O(R2dN) if M-norm is boundedby R) samples are sufficient to estimate a rank-r bounded tensor using a convex least squaresalgorithm. Moreover, we derive an information-theoretic lower bound that proves m =O(R2N) measurements are necessary for recovery of tensors with M-norm less than R. Thisproves that our bound is optimal both in its dependence on N and the M-norm bound R. It isworth mentioning though, that the bounds we prove in this thesis are not necessarily optimalin r, the rank of the tensor, which in an interesting open problem.• Through synthetic numerical examples, we illustrate the advantage of using algorithms de-signed for max-qnorm constrained tensor completion instead of algorithms using matriciza-114tion. These algorithms significantly improve algorithms based on matricization and alter-nating least squares (ALS). It is worth mentioning that computing the nuclear norm of ageneral tensor is known to be NP-hard. Although it is not known whether computing theM-norm or max-qnorm of a tensor is NP-hard or not, our numerical results for max-qnormconstrained least squares, using a simple projected quasi-Newton algorithm give promisingresults. There are two important future directions related to our algorithm. First, is design-ing methods to approximate the solution of the M-norm constrained tensor completion andsecond is designing faster algorithms for approximating the max-qnorm constrained algo-rithms. One straight-forward future direction is solving the max-qnorm constrained tensorcompletion in parallel (the max-qnorm constrained problems are embarrassingly parallel).• We formulate and analyze 1-bit tensor completion using M-norm constraints on the underly-ing tensor. We prove that, with high probability, the mean squared error (MSE) of recoveringa rank-r tensor T ] from m 1-bit measurements by solving an M-norm constrained optimiza-tion is O(√r3d−3√Ndm ). Moreover, we analyze a related non-convex function, max-qnorm,and prove that MSE of optimizing a log-likelihood function constrained by max-qnorm isO(√rd2−d√Ndm ).• We consider using nuclear norm as a convex constraint for the 1-bit tensor completion prob-lem. The theoretical sample complexity we obtain by using nuclear norm is not as tight asusing M-norm and we show that with the tools we have used here (or in the matrix-case)there is no hope of getting tighter results. We believe that achieving a tighter upper boundis only possible if we impose some incoherence conditions on the problem. Studying thedifference between the tensor nuclear-norm and M-norm is an interesting future direction.• We also analyze 1-bit tensor completion using exact-rank constraints. The estimation errorrelies on bounding the M-norm of rank-r tensors and an upper bound on the dual-M-normof a random tensor which might be of independent interest. However, we think achievingan optimal lower bound for this case depends on proving a restricted strong convexity typeinequality for the set of low-rank tensors which is important future work. Notice that [91]studies the matrix case.• We derive an information-theoretic lower bound on the recovery error of M-norm con-strained 1-bit TC that proves the MSE of any arbitrary algorithm is at least Ω(r√Nm) fora rank-r tensor and Ω(R√Nm) for a tensor with M-norm less than R. This proves that ourupper bound is optimal in N and the M-norm bound R (but not necessarily in r).115• We propose a numerical method to approximate the solution of max-qnorm constrained 1-bittensor completion and show its advantage over 1-bit matrix completion using synthetic andreal-world data. Specifically, we show the huge improvement one can get by applying the1-bit tensor completion to context-aware recommender systems.116Bibliography[1] E. Acar, S. A. C¸amtepe, M. S. Krishnamoorthy, and B. Yener. Modeling and multiwayanalysis of chatroom tensors. In International Conference on Intelligence and SecurityInformatics, pages 256–268. Springer, 2005. → pages 1[2] A. Aidini, G. Tsagkatakis, and P. Tsakalides. 1-bit tensor completion. In Proc. 2018 IS&TInternational Symposium on Electronic Imaging, Image Processing: Algorithms andSystems, Burlingame, CA, 2018. → pages 9, 64[3] A. Aravkin, R. Kumar, H. Mansour, B. Recht, and F. J. Herrmann. Fast methods fordenoising matrix completion formulations, with applications to robust seismic datainterpolation. SIAM Journal on Scientific Computing, 36(5):S237–S266, 2014. → pages 31[4] B. W. Bader, T. G. Kolda, et al. Matlab tensor toolbox version 2.6. Available online,February 2015. URL http://www.sandia.gov/∼tgkolda/TensorToolbox/. → pages 78[5] J. Ballani, L. Grasedyck, and M. Kluge. Black box approximation of tensors in hierarchicaltucker format. Linear algebra and its applications, 438(2):639–657, 2013. → pages 35[6] L. Baltrunas, M. Kaminskas, B. Ludwig, O. Moling, F. Ricci, A. Aydin, K.-H. Lu¨ke, andR. Schwaiger. Incarmusic: Context-aware music recommendations in a car. E-Commerceand web technologies, pages 89–100, 2011. → pages ix, 109, 111[7] B. Barak and A. Moitra. Noisy tensor completion via the sum-of-squares hierarchy. arXivpreprint arXiv:1501.06521, 2015. → pages 36, 42[8] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds andstructural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002. →pages 25, 84, 85[9] J. A. Bazerque, G. Mateos, and G. Giannakis. Rank regularization and bayesian inferencefor tensor completion and extrapolation. IEEE Transactions on Signal Processing, 61(22):5689–5703, 2013. → pages 9, 77[10] J. A. Bengua, H. N. Phien, H. D. Tuan, and M. N. Do. Efficient tensor completion for colorimage and video recovery: Low-rank tensor train. IEEE Transactions on ImageProcessing, 26(5):2466–2479, 2017. → pages 101117[11] S. A. Bhaskar and A. Javanmard. 1-bit matrix completion under exact low-rank constraint.In Information Sciences and Systems (CISS), 2015 49th Annual Conference on, pages 1–6.IEEE, 2015. → pages 4, 63, 64, 72, 73, 89, 91, 92[12] S. Bhojanapalli and P. Jain. Universal matrix completion. arXiv preprint arXiv:1402.2324,2014. → pages 32[13] D. Bini. Border rank of ap× q× 2 tensor and the optimal approximation of a pair ofbilinear forms. In International Colloquium on Automata, Languages, and Programming,pages 98–108. Springer, 1980. → pages 90[14] P. Biswas, T.-C. Lian, T.-C. Wang, and Y. Ye. Semidefinite programming based algorithmsfor sensor network localization. ACM Transactions on Sensor Networks (TOSN), 2(2):188–220, 2006. → pages 31[15] R. C. Blei. Multidimensional extensions of the grothendieck inequality and applications.Arkiv fo¨r Matematik, 17(1):51–68, 1979. → pages 19, 24[16] P. T. Boufounos and R. G. Baraniuk. 1-bit compressive sensing. In Information Sciencesand Systems, 2008. CISS 2008. 42nd Annual Conference on, pages 16–21. IEEE, 2008. →pages 3, 62[17] J.-F. Cai, E. J. Cande`s, and Z. Shen. A singular value thresholding algorithm for matrixcompletion. SIAM Journal on Optimization, 20(4):1956–1982, 2010. → pages 3, 32[18] T. Cai and W.-X. Zhou. A max-norm constrained minimization approach to 1-bit matrixcompletion. Journal of Machine Learning Research, 14(1):3619–3647, 2013. → pages 4,63, 65, 69, 74, 77, 84[19] T. T. Cai, W.-X. Zhou, et al. Matrix completion via max-norm constrained optimization.Electronic Journal of Statistics, 10(1):1493–1525, 2016. → pages 31, 32, 33, 39, 40, 41,54, 58, 60, 65, 93, 95, 96[20] E. J. Candes. The restricted isometry property and its implications for compressed sensing.Comptes rendus mathematique, 346(9-10):589–592, 2008. → pages 2[21] E. J. Candes and Y. Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925–936, 2010. → pages 3, 32, 33[22] E. J. Cande`s and B. Recht. Exact matrix completion via convex optimization. Foundationsof Computational mathematics, 9(6):717–772, 2009. → pages 2, 32[23] E. J. Cande`s and T. Tao. The power of convex relaxation: Near-optimal matrix completion.IEEE Transactions on Information Theory, 56(5):2053–2080, 2010. → pages 2, 3, 32[24] J. D. Carroll and J.-J. Chang. Analysis of individual differences in multidimensionalscaling via an n-way generalization of ?eckart-young? decomposition. Psychometrika, 35(3):283–319, 1970. → pages 6, 8118[25] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex geometry oflinear inverse problems. Foundations of Computational mathematics, 12(6):805–849,2012. → pages 19[26] Y. Chi and H. Fu. Subspace learning from bits. IEEE Transactions on Signal Processing,65(17):4429–4442, 2017. → pages 64[27] A. Cichocki, D. Mandic, L. De Lathauwer, G. Zhou, Q. Zhao, C. Caiafa, and H. A. Phan.Tensor decompositions for signal processing applications: From two-way to multiwaycomponent analysis. IEEE Signal Processing Magazine, 32(2):145–163, 2015. → pages 8[28] R. Coppi and S. Bolasco. Rank, decomposition, and uniqueness for 3-way and iv-wayarrays. 1989. → pages 7[29] C. Da Silva and F. J. Herrmann. Optimization on the hierarchical tuckermanifold–applications to tensor completion. Linear Algebra and its Applications, 481:131–173, 2015. → pages 1, 35[30] A. Daniely, N. Linial, and S. Shalev-Shwartz. More data speeds up training time inlearning halfspaces over sparse vectors. In Advances in Neural Information ProcessingSystems, pages 145–153, 2013. → pages 42[31] M. A. Davenport, Y. Plan, E. van den Berg, and M. Wootters. 1-bit matrix completion.Information and Inference, 3(3):189–223, 2014. → pages 4, 41, 56, 59, 62, 63, 64, 65, 66,70, 74, 75, 80, 81, 85, 86, 87, 89, 94, 95, 107, 110[32] L. De Lathauwer, B. De Moor, and J. Vandewalle. On the best rank-1 and rank-(r 1, r 2,...,rn) approximation of higher-order tensors. SIAM journal on Matrix Analysis andApplications, 21(4):1324–1342, 2000. → pages 35[33] L. De Lathauwer, B. De Moor, and J. Vandewalle. A multilinear singular valuedecomposition. SIAM journal on Matrix Analysis and Applications, 21(4):1253–1278,2000. → pages 7[34] V. De Silva and L.-H. Lim. Tensor rank and the ill-posedness of the best low-rankapproximation problem. SIAM Journal on Matrix Analysis and Applications, 30(3):1084–1127, 2008. → pages 90[35] H. Derksen. On the nuclear norm and the singular value decomposition of tensors.Foundations of Computational Mathematics, pages 1–33, 2013. → pages 9, 16, 77[36] D. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006. → pages 2[37] E. X. Fang, H. Liu, K.-C. Toh, and W.-X. Zhou. Max-norm optimization for robust matrixrecovery. Mathematical Programming, pages 1–31, 2015. → pages 40119[38] M. Fazel. Matrix rank minimization with applications. PhD thesis, PhD thesis, StanfordUniversity, 2002. → pages 2, 32[39] M. Filipovic´ and A. Jukic´. Tucker factorization with missing data with application to low-n n-rank tensor completion. Multidimensional systems and signal processing, 26(3):677–692, 2015. → pages 35[40] R. Foygel and N. Srebro. Concentration-based guarantees for low-rank matrixreconstruction. In COLT, pages 315–340, 2011. → pages 3, 32, 33, 39[41] R. Foygel, N. Srebro, and R. R. Salakhutdinov. Matrix reconstruction with the local maxnorm. In Advances in Neural Information Processing Systems, pages 935–943, 2012. →pages 40[42] S. Friedland. Variation of tensor powers and spectrat. Linear and Multilinear algebra, 12(2):81–98, 1982. → pages 16[43] S. Friedland and L.-H. Lim. Nuclear norm of higher-order tensors. arXiv preprintarXiv:1410.6072, 2014. → pages 9, 17, 77[44] S. Gandy, B. Recht, and I. Yamada. Tensor completion and low-n-rank tensor recovery viaconvex optimization. Inverse Problems, 27(2):025010, 2011. → pages 8, 34[45] N. Ghadermarzy, Y. Plan, and O¨. Yılmaz. Near-optimal sample complexity for convextensor completion. arXiv preprint arXiv:1711.04965, 2017. → pages v, 68, 69, 75, 76, 94[46] N. Ghadermarzy, Y. Plan, and O. Yilmaz. Learning tensors from partial binarymeasurements. arXiv preprint arXiv:1804.00108, 2018. → pages v[47] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry. Using collaborative filtering to weavean information tapestry. Communications of the ACM, 35(12):61–70, 1992. → pages 31[48] L. Grasedyck. Hierarchical singular value decomposition of tensors. SIAM Journal onMatrix Analysis and Applications, 31(4):2029–2054, 2010. → pages 8, 35[49] L. Grasedyck, D. Kressner, and C. Tobler. A literature survey of low-rank tensorapproximation techniques. GAMM-Mitteilungen, 36(1):53–78, 2013. → pages 34[50] A. Grothendieck. Produits tensoriels topologiques et espaces nucle´aires. Se´minaireBourbaki, 2:193–200, 1955. → pages 16[51] C. Guillemot and O. Le Meur. Image inpainting: Overview and recent advances. IEEEsignal processing magazine, 31(1):127–144, 2014. → pages 1[52] W. Hackbusch and S. Ku¨hn. A new scheme for the tensor representation. Journal ofFourier analysis and applications, 15(5):706–722, 2009. → pages 8120[53] R. A. Harshman. Foundations of the parafac procedure: Models and conditions for an”explanatory” multi-modal factor analysis. 1970. → pages 6, 8[54] B. Hidasi and D. Tikk. Fast als-based tensor factorization for context-awarerecommendation from implicit feedback. Machine Learning and Knowledge Discovery inDatabases, pages 67–82, 2012. → pages 64[55] C. J. Hillar and L.-H. Lim. Most tensor problems are np-hard. Journal of the ACM(JACM), 60(6):45, 2013. → pages 9[56] S. Holtz, T. Rohwedder, and R. Schneider. The alternating linear scheme for tensoroptimization in the tensor train format. SIAM Journal on Scientific Computing, 34(2):A683–A713, 2012. → pages 35[57] S. Hu. Relations of the nuclear norm of a tensor and its matrix flattenings. Linear Algebraand its Applications, 478:188–199, 2015. → pages 16, 77[58] W. Hu, D. Tao, W. Zhang, Y. Xie, and Y. Yang. A new low-rank tensor model for videocompletion. arXiv preprint arXiv:1509.02027, 2015. → pages 1[59] L. Jacques, J. N. Laska, P. T. Boufounos, and R. G. Baraniuk. Robust 1-bit compressivesensing via binary stable embeddings of sparse vectors. IEEE Transactions on InformationTheory, 59(4):2082–2102, 2013. → pages 3[60] P. Jain and S. Oh. Provable tensor factorization with missing data. In Advances in NeuralInformation Processing Systems, pages 1431–1439, 2014. → pages 35, 42, 49, 50[61] P. Jain, P. Netrapalli, and S. Sanghavi. Low-rank matrix completion using alternatingminimization. In Proceedings of the forty-fifth annual ACM symposium on Theory ofcomputing, pages 665–674. ACM, 2013. → pages 32[62] G. J. O. Jameson. Summing and nuclear norms in Banach space theory, volume 8.Cambridge University Press, 1987. → pages 70[63] T.-Y. Ji, N. Yokoya, X. X. Zhu, and T.-Z. Huang. Nonlocal tensor completion formultitemporal remotely sensed images’ inpainting. IEEE Transactions on Geoscience andRemote Sensing, 2018. → pages 104[64] F. John. Extremum problems with inequalities as subsidiary conditions. In Traces andemergence of nonlinear programming, pages 197–215. Springer, 2014. → pages 26[65] A. Karatzoglou, X. Amatriain, L. Baltrunas, and N. Oliver. Multiverse recommendation:n-dimensional tensor factorization for context-aware collaborative filtering. In Proceedingsof the fourth ACM conference on Recommender systems, pages 79–86. ACM, 2010. →pages 1, 64121[66] R. H. Keshavan, S. Oh, and A. Montanari. Matrix completion from a few entries. In 2009IEEE International Symposium on Information Theory, pages 324–328. IEEE, 2009. →pages 2, 32[67] R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from noisy entries. Journalof Machine Learning Research, 11(Jul):2057–2078, 2010. → pages 32, 33[68] T. G. Kolda. Orthogonal tensor decompositions. SIAM Journal on Matrix Analysis andApplications, 23(1):243–255, 2001. → pages 16[69] T. G. Kolda. Multilinear operators for higher-order decompositions. United States.Department of Energy, 2006. → pages 5[70] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009. → pages 5, 7, 9[71] N. Kreimer, A. Stanton, and M. D. Sacchi. Tensor completion based on nuclear normminimization for 5d seismic data reconstruction. Geophysics, 78(6):V273–V284, 2013. →pages 1[72] D. Kressner, M. Steinlechner, and B. Vandereycken. Low-rank tensor completion byriemannian optimization. BIT Numerical Mathematics, 54(2):447–468, 2014. → pages 35[73] A. Krishnamurthy and A. Singh. Low-rank matrix and tensor completion via adaptivesampling. In Advances in Neural Information Processing Systems, pages 836–844, 2013.→ pages 34[74] H. J. Kushner and D. S. Clark. Stochastic approximation methods for constrained andunconstrained systems, volume 26. Springer Science & Business Media, 2012. → pages 44[75] J. Lafond, O. Klopp, E. Moulines, and J. Salmon. Probabilistic low-rank matrixcompletion on finite alphabets. In Advances in Neural Information Processing Systems,pages 1727–1735, 2014. → pages 64[76] J. N. Laska and R. G. Baraniuk. Regime change: Bit-depth versus measurement-rate incompressive sensing. IEEE Transactions on Signal Processing, 60(7):3496–3505, 2012.→ pages 62[77] M. Ledoux and M. Talagrand. Probability in Banach Spaces: isoperimetry and processes.Springer Science & Business Media, 2013. → pages 56, 96, 97[78] J. D. Lee, B. Recht, N. Srebro, J. Tropp, and R. R. Salakhutdinov. Practical large-scaleoptimization for max-norm regularization. In Advances in Neural Information ProcessingSystems, pages 1297–1305, 2010. → pages 18, 20, 40[79] N. Li and B. Li. Tensor completion for on-board compression of hyperspectral images. InImage Processing (ICIP), 2010 17th IEEE International Conference on, pages 517–520.IEEE, 2010. → pages 1, 104122[80] X. Li, Y. Ye, and X. Xu. Low-rank tensor completion with total variation for visual datainpainting. In Thirty-First AAAI Conference on Artificial Intelligence, 2017. → pages 101[81] L.-H. Lim and P. Comon. Multiarray signal processing: Tensor decomposition meetscompressed sensing. Comptes Rendus Mecanique, 338(6):311–320, 2010. → pages 16[82] N. Linial, S. Mendelson, G. Schechtman, and A. Shraibman. Complexity measures of signmatrices. Combinatorica, 27(4):439–463, 2007. → pages 17[83] J. Liu, P. Musialski, P. Wonka, and J. Ye. Tensor completion for estimating missing valuesin visual data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):208–220, 2013. → pages xi, 1, 8, 9, 34, 101, 102, 103, 104[84] Z. Liu and L. Vandenberghe. Interior-point method for nuclear norm approximation withapplication to system identification. SIAM Journal on Matrix Analysis and Applications,31(3):1235–1256, 2009. → pages 31[85] S. Ma, D. Goldfarb, and L. Chen. Fixed point and bregman iterative methods for matrixrank minimization. Mathematical Programming, 128(1-2):321–353, 2011. → pages 50[86] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented naturalimages and its application to evaluating segmentation algorithms and measuring ecologicalstatistics. In Proc. 8th Int’l Conf. Computer Vision, volume 2, pages 416–423, July 2001.→ pages 101[87] K. McGrath and J. Waterton. British social attitudes 1983-1986 panel survey, 1986. →pages xii, 110[88] J. Mocks. Topographic components model for event-related potentials and somebiophysical considerations. IEEE Transactions on Biomedical Engineering, 6(35):482–484, 1988. → pages 1[89] A. Montanari and N. Sun. Spectral algorithms for tensor completion. arXiv preprintarXiv:1612.07866, 2016. → pages 36[90] C. Mu, B. Huang, J. Wright, and D. Goldfarb. Square deal: Lower bounds and improvedrelaxations for tensor recovery. In Proceedings of the 31st International Conference onMachine Learning (ICML-14), pages 73–81, 2014. → pages 6, 7, 8, 34, 109[91] S. Negahban and M. J. Wainwright. Restricted strong convexity and weighted matrixcompletion: Optimal bounds with noise. Journal of Machine Learning Research, 13(May):1665–1697, 2012. → pages 33, 73, 74, 93, 96, 115[92] N. H. Nguyen, P. Drineas, and T. D. Tran. Tensor sparsification via a bound on the spectralnorm of random tensors. Information and Inference, page iav004, 2015. → pages 87123[93] R. Ni and Q. Gu. Optimal statistical and computational rates for one bit matrix completion.In Proceedings of the 19th International Conference on Artificial Intelligence andStatistics, pages 426–434, 2016. → pages 4, 63, 64, 73, 78, 89, 91, 92[94] D. Nion and N. D. Sidiropoulos. Tensor algebra and multidimensional harmonic retrievalin signal processing for mimo radar. IEEE Transactions on Signal Processing, 58(11):5693–5705, 2010. → pages 1[95] I. V. Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011. → pages 35[96] I. V. Oseledets and E. E. Tyrtyshnikov. Breaking the curse of dimensionality, or how to usesvd in many dimensions. SIAM Journal on Scientific Computing, 31(5):3744–3759, 2009.→ pages 8[97] P. Paatero. Construction and analysis of degenerate parafac models. Journal ofChemometrics: A Journal of the Chemometrics Society, 14(3):285–299, 2000. → pages 7[98] G. Pisier. The volume of convex bodies and Banach space geometry, volume 94.Cambridge University Press, 1999. → pages 55[99] Y. Plan and R. Vershynin. Robust 1-bit compressed sensing and sparse logistic regression:A convex programming approach. IEEE Transactions on Information Theory, 59(1):482–494, 2013. → pages 3[100] Y. Plan, R. Vershynin, and E. Yudovina. High-dimensional estimation with geometricconstraints. Information and Inference: A Journal of the IMA, 6(1):1–40, 2017. → pages64[101] A. Potechin and D. Steurer. Exact tensor completion with sum-of-squares. arXiv preprintarXiv:1702.06237, 2017. → pages 36[102] X. Ramirez-Garcia and M. Garcı´a-Valdez. Post-filtering for a restaurant context-awarerecommender system. In Recent Advances on Hybrid Approaches for Designing IntelligentSystems, pages 695–707. Springer, 2014. → pages ix, 111, 113[103] C. Rashtchian. Bounded matrix rigidity and john’s theorem. In Electronic Colloquium onComputational Complexity (ECCC), volume 23, page 93, 2016. → pages 26[104] B. Recht. A simpler approach to matrix completion. Journal of Machine LearningResearch, 12(Dec):3413–3430, 2011. → pages 32[105] R. T. Rockafellar. Convex analysis. Princeton university press, 2015. → pages 19[106] T. Rohwedder and A. Uschmajew. On local convergence of alternating schemes foroptimization of convex problems in the tensor train format. SIAM Journal on NumericalAnalysis, 51(2):1134–1162, 2013. → pages 35124[107] R. Schatten. A theory of cross-spaces. Number 26. Princeton University Press, 1985. →pages 16[108] M. W. Schmidt, E. Van Den Berg, M. P. Friedlander, and K. P. Murphy. Optimizing costlyfunctions with simple constraints: A limited-memory projected quasi-newton algorithm. InAISTATS, volume 5, page 2009, 2009. → pages 44[109] A. Shashua and A. Levin. Linear image coding for regression and classification using thetensor-rank principle. In Computer Vision and Pattern Recognition, 2001. CVPR 2001.Proceedings of the 2001 IEEE Computer Society Conference on, volume 1, pages I–42.IEEE, 2001. → pages 1[110] J. Shen, H. Xu, and P. Li. Online optimization for max-norm regularization. In Advancesin Neural Information Processing Systems, pages 1718–1726, 2014. → pages 40[111] M. Signoretto, L. De Lathauwer, and J. A. Suykens. Nuclear norms for tensors and theiruse for convex multilinear estimation. Submitted to Linear Algebra and Its Applications,43, 2010. → pages 34[112] M. Signoretto, R. Van de Plas, B. De Moor, and J. A. Suykens. Tensor versus matrixcompletion: a comparison with application to spectral data. IEEE Signal ProcessingLetters, 18(7):403–406, 2011. → pages 104[113] A. Singer and M. Cucuringu. Uniqueness of low-rank matrix completion by rigidity theory.SIAM Journal on Matrix Analysis and Applications, 31(4):1621–1641, 2010. → pages 31[114] N. Srebro. Learning with matrix factorizations. PhD thesis, Citeseer, 2004. → pages 25,32, 57[115] N. Srebro and R. R. Salakhutdinov. Collaborative filtering in a non-uniform world:Learning with the weighted trace norm. In Advances in Neural Information ProcessingSystems, pages 2056–2064, 2010. → pages 65[116] N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In InternationalConference on Computational Learning Theory, pages 545–560. Springer, 2005. → pages17, 20, 25, 63[117] N. Srebro, J. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. InAdvances in Neural Information Processing Systems, pages 1329–1336, 2004. → pages63, 77[118] N. Srebro, J. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. InAdvances in Neural Information Processing Systems, pages 1329–1336, 2005. → pages 2,3, 32, 33, 40[119] N. Srebro, K. Sridharan, and A. Tewari. Optimistic rates for learning with a smooth loss.arXiv preprint arXiv:1009.3896, 2010. → pages 40, 58, 59125[120] R. Tomioka, K. Hayashi, and H. Kashima. Estimation of low-rank tensors via convexoptimization. arXiv preprint arXiv:1010.0789, 2010. → pages 34[121] A. Tonge. The von neumann inequality for polynomials in several hilbert-schmidtoperators. Journal of the London Mathematical Society, 2(3):519–526, 1978. → pages 19,24, 69[122] L. R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311, 1966. → pages 7, 35[123] T. van Leeuwen and F. J. Herrmann. Fast waveform inversion without source-encoding.Geophysical Prospecting, 61(s1):10–19, 2013. → pages 1[124] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXivpreprint arXiv:1011.3027, 2010. → pages 91[125] L. Wang, X. Zhang, and Q. Gu. A unified computational and statistical framework fornonconvex low-rank matrix estimation. arXiv preprint arXiv:1610.05275, 2016. → pages64[126] Y. Xu, W. Yin, Z. Wen, and Y. Zhang. An alternating direction algorithm for matrixcompletion with nonnegative factors. Frontiers of Mathematics in China, 7(2):365–384,2012. → pages 3[127] Y. Yang and A. Barron. Information-theoretic determination of minimax rates ofconvergence. Annals of Statistics, pages 1564–1599, 1999. → pages 60[128] L. Yao, Q. Z. Sheng, Y. Qin, X. Wang, A. Shemshadi, and Q. He. Context-awarepoint-of-interest recommendation using tensor factorization with social regularization. InProceedings of the 38th international ACM SIGIR conference on research and developmentin information retrieval, pages 1007–1010. ACM, 2015. → pages 64[129] M. Yuan and C.-H. Zhang. On tensor completion via nuclear norm minimization.Foundations of Computational Mathematics, pages 1–38, 2015. → pages 9, 35, 72[130] M. Yuan and C.-H. Zhang. On tensor completion via nuclear norm minimization.Foundations of Computational Mathematics, 16(4):1031–1068, 2016. → pages 36, 40, 77[131] Z. Zhang and S. Aeron. Exact tensor completion using t-svd. arXiv preprintarXiv:1502.04689, 2015. → pages 35[132] Z. Zhang, G. Ely, S. Aeron, N. Hao, and M. Kilmer. Novel methods for multilinear datacompletion and de-noising based on tensor-svd. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pages 3842–3849, 2014. → pages xi, 101, 102[133] Y. Zheng, R. Burke, and B. Mobasher. Optimal feature selection for context-awarerecommendation using differential relaxation. Acm Recsys, 12, 2012. → pages 64126[134] P. Zhou, C. Lu, Z. Lin, and C. Zhang. Tensor factorization for low-rank tensor completion.IEEE Transactions on Image Processing, 27(3):1152–1163, 2018. → pages xi, 101, 102127
You may notice some images loading slow across the Open Collections website. Thank you for your patience as we rebuild the cache to make images load faster.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Near-optimal sample complexity for noisy or 1-bit tensor...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Near-optimal sample complexity for noisy or 1-bit tensor completion Ghadermarzy, Navid 2018
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Near-optimal sample complexity for noisy or 1-bit tensor completion |
Creator |
Ghadermarzy, Navid |
Publisher | University of British Columbia |
Date Issued | 2018 |
Description | Tensor completion is the problem of recovering a low-rank tensor from a partial subset of its entries. Assuming a rank-r, order-d tensor in ℝ^{NxNx...N}, the best sampling complexity achieved is O(rN^{d/2}) which can be obtained by a tensor nuclear-norm minimization problem. This bound is significantly larger than O(rdN), the number of free variables in a rank-r tensor. In this thesis, we prove that when r=O(1), we can achieve optimal sample complexity by constraining either one of two proxies for tensor rank, the convex M-norm or the non-convex max-qnorm. The max-qnorm is the generalization of matrix max-norm to tensors which is non-convex. The M-norm, on the other hand, is a convex atomic norm whose atoms are rank-1 sign tensors. We prove that both max-qnorm and M-norm of a bounded rank-r tensor are bounded by quantities that are independent of N. We also prove that the unit balls of both these norms have small Rademacher complexity. We analyze max-qnorm and M-norm constrained least squares minimization problems for tensor completion, proving that when r=O(1), m=O(Nd) measurements are sufficient for efficient estimation of the original tensor. We also use an information theoretic technique to prove that the dependence on N is optimal. Moreover, we design an alternating method for approximating the solution of max-qnorm tensor completion and do a thorough investigation of its performance on synthetic and real-world data. We also generalize the 1-bit matrix completion problem to higher-order tensors. We prove that when r=O(1) a bounded rank-r, order-d tensor T in ℝ^N x ℝ^N x ... x ℝ^N can be estimated efficiently by only m=O(Nd) binary measurements by regularizing either its max-qnorm or its M-norm. We prove that the sample complexity of recovering a low-rank tensor from 1-bit measurements of a subset of its entries is the same as recovering it from unquantized measurements. Moreover, we show the advantage of using 1-bit tensor completion over matricization both theoretically and numerically. Specifically, we show how the 1-bit measurement model can be used for context-aware recommender systems. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2018-08-21 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0371162 |
URI | http://hdl.handle.net/2429/66856 |
Degree |
Doctor of Philosophy - PhD |
Program |
Mathematics |
Affiliation |
Science, Faculty of Mathematics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2018-09 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2018_september_ghadermarzy_navid.pdf [ 2.33MB ]
- Metadata
- JSON: 24-1.0371162.json
- JSON-LD: 24-1.0371162-ld.json
- RDF/XML (Pretty): 24-1.0371162-rdf.xml
- RDF/JSON: 24-1.0371162-rdf.json
- Turtle: 24-1.0371162-turtle.txt
- N-Triples: 24-1.0371162-rdf-ntriples.txt
- Original Record: 24-1.0371162-source.json
- Full Text
- 24-1.0371162-fulltext.txt
- Citation
- 24-1.0371162.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0371162/manifest