Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Dimension reduction using Independent Component Analysis with an application in business psychology Shchurenkova, Elena 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2017_may_shchurenkova_elena.pdf [ 652.98kB ]
JSON: 24-1.0343288.json
JSON-LD: 24-1.0343288-ld.json
RDF/XML (Pretty): 24-1.0343288-rdf.xml
RDF/JSON: 24-1.0343288-rdf.json
Turtle: 24-1.0343288-turtle.txt
N-Triples: 24-1.0343288-rdf-ntriples.txt
Original Record: 24-1.0343288-source.json
Full Text

Full Text

Dimension reduction usingIndependent Component Analysis withan application in business psychologybyElena ShchurenkovaSpecialist, Plekhanov Russian University of Economics, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Statistics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)March 2017c© Elena Shchurenkova 2017AbstractIndependent component analysis (ICA) is used for separating a set of mixedsignals into statistically independent additive subcomponents. The method-ology extracts as many independent components as there are dimensions orfeatures in the original dataset. Since not all of these components may be ofimportance, a few solutions have been proposed to reduce the dimension ofthe data using ICA. However, most of these solutions rely on prior knowl-edge or estimation of the number of independent components that are tobe used in the model. This work proposes a methodology that addressesthe problem of selecting fewer components than the original dimension ofthe data that best approximate the original dataset without prior knowl-edge or estimation of their number. The trade off between the number ofindependent components retained in the model and the loss of informationis explored. This work presents mathematical foundations of the proposedmethodology as well as the results of its application to a business psychologydataset.iiPrefaceThis work is the result of a collaboration with VIVO Team Development.The project was funded by the Natural Sciences and Engineering Councilof Canada through an ENGAGE grant, and was aimed at exploring thedata collected by VIVO Team Development as part of their business. Thestatistical research described in this thesis is an original unpublished workperformed by the author under the advice and supervision of Prof. MatiasSalibian-Barrera. The collaborators from VIVO Team Development pro-vided subject-matter expertise in business psychology. The dataset used inChapter 4 of this thesis is property of VIVO Team Development.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . ixDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Data description . . . . . . . . . . . . . . . . . . . . . . . . . 32 Independent Component analysis . . . . . . . . . . . . . . . . 52.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Definition and mathematical notation . . . . . . . . . . . . . 72.3 Pre-processing data . . . . . . . . . . . . . . . . . . . . . . . 102.4 ICA with negentropy maximization . . . . . . . . . . . . . . 122.5 FastICA algorithm . . . . . . . . . . . . . . . . . . . . . . . . 172.6 Other approaches to Independent Component Analysis . . . 203 Component selection . . . . . . . . . . . . . . . . . . . . . . . . 233.1 Existing approaches . . . . . . . . . . . . . . . . . . . . . . . 23ivTable of Contents3.2 Proposed approach . . . . . . . . . . . . . . . . . . . . . . . 253.2.1 Similarity to multivariate multiple regression . . . . . 354 Component selection implementation and results . . . . . 364.1 Some exploratory analysis results . . . . . . . . . . . . . . . 364.2 Principal Component Analysis and Factor Analysis . . . . . 384.3 Choosing a suitable negentropy approximation . . . . . . . . 424.4 Implementation results . . . . . . . . . . . . . . . . . . . . . 454.5 Observations about the retained components . . . . . . . . . 515 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58AppendixA Table of loss values and lowest kurtosis values for variousinitial guesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61vList of Tables4.1 The standard deviations, the proportion of variance and thecumulative proportions of variance associated with the prin-cipal components. . . . . . . . . . . . . . . . . . . . . . . . . . 394.2 The sums of squares, the proportions of variance and thecumulative proportions of variance associated with the factors. 414.3 The comparison of the minimum kurtosis, the MSE incurredwhen retaining only 2 independent components out of theoriginal 18 and the average MSE across the possible numberof retained components for seeds 2020 and 2029. . . . . . . . 53A.1 The comparison of the minimum kurtosis, the MSE incurredwhen retaining only 2 independent components out of theoriginal 18 and the average MSE across the possible numberof retained components for all 29 seeds. . . . . . . . . . . . . 62viList of Figures2.1 The comparison of different approaches to negentropy approx-imation: kurtosis (based on the 4th degree) is given by thedashed line, logcosh approach with α = 1 is represented bythe solid red line and exponential – by the dash-dotted bluecurve. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.1 Estimated densities of the original variables (centered andscaled). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2 Pearson Correlation coefficients between the original vari-ables. The colors indicate the strength of linear correlations. . 384.3 The loadings associated with the first three principal compo-nents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.4 The loadings associated with the first six factors extractedusing factor analysis. . . . . . . . . . . . . . . . . . . . . . . . 414.5 Percentage of the converged cases of the FastICA algorithmin application to the VIVO data obtained on 500 runs start-ing from random initial values. The first point shows thepercentage of converged cases for exponential negentropy ap-proximation, all other — logcosh approximation with varyingvalues of the parameter α. . . . . . . . . . . . . . . . . . . . . 444.6 The minimum of the loss function across all possible retainedindependent components for each number of independent com-ponents retained in the model. . . . . . . . . . . . . . . . . . 47viiList of Figures4.7 The minimum MSE across all possible combinations of theretained independent components for each number of inde-pendent components retained in the model. Each value inthe original dataset is on a scale from 3 to 30. . . . . . . . . . 484.8 The minimum MSE values for each number of the retainedcomponents achieved on various starting points for the 29 seeds. 494.9 The minimum MSE values for each number of componentsretained achieved on various starting points for the 29 seedsshown by the seed number. . . . . . . . . . . . . . . . . . . . 494.10 The comparison of the mean squared errors obtained by themethod discussed in this thesis (boxplots) and MSEs obtainedusing an existing approach of first applying PCA. In case ofPCA application the losses do not differ depending on thestarting point. . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.11 Estimated densities of the 18 independent component drawnfrom the VIVO data, seed 2029. . . . . . . . . . . . . . . . . . 524.12 Estimated densities of the 18 independent component drawnfrom the VIVO data, seed 2020. . . . . . . . . . . . . . . . . . 534.13 An example of a supergassian component S13 (left) and asubgaussian component S2 (right) in the seed 2020. . . . . . . 544.14 Mean squared errors when 2 components are retained out ofthe original 18 plotted against the minimum kurtosis of theindependent components for each seed number. The pointsare labeled with the corresponding seed numbers. . . . . . . . 554.15 The p-values of 2-sample t tests comparing the total squareloss incurred for the subgaussian and supergaussian cases av-eraged across the number of components retained in the model. 56viiiAcknowledgementsI would like to acknowledge Professor Matias Salibian-Barrera for his adviceand supervision, which has been an immense help throughout my degreeand the work presented here. I am also acknowledging the team of VIVOTeam Development for providing the dataset that was used in my work andfor their business insights. I owe a special thanks to all the faculty, staff,my colleagues and friends in the Statistics Department of The University ofBritish Columbia. Another thanks goes to Natural Sciences and Engineer-ing Council of Canada for funding my research on this project through anEngage grant.Additionally, I offer gratitude for my colleagues in Raiffeisenbank, espe-cially to Vladimir Snorkin for providing me with additional motivation topursue Master’s level studies. A special thanks goes to Garnik Kakosyanfor his help and insistence when it came to education abroad. Last but notleast, I owe a great thanks to Andres Sanchez-Ordonez for his inexhaustibleencouragement and support.ixDedicationI dedicate this work to my parents Alexander Shchurenkov and TamaraPlyushch with immense gratitude for their support and patience. This workwould not be possible without their encouragement and contribution to myeducation and love for science.xChapter 1OverviewThis chapter contains a brief overview of the Independent Component Anal-ysis methodology used in this work and a detailed description of the datasetused for the application.1.1 IntroductionOne of the fundamental problems in multivariate data analysis is findinga convenient representation of random vectors for a particular application.It is customary to use linear transformations of the original data for thereasons of simplicity and computational efficiency [13]. Examples of suchapproaches are Principal Component Analysis, Factor Analysis and projec-tion pursuit. My research is focused on Independent Component Analysis(ICA), a recently-developed tool [11], the goal of which is to find a linear rep-resentation of the non-gaussian data such that the resulting components arestatistically independent, or as statistically independent as possible. Notethat uncorrelated components are not necessarily statistically independentunless gaussian distribution is assumed.Albeit being a relatively new methodology, ICA has been used in nu-merous disciplines and applications, including feature extraction and signalprocessing. A non-extensive list of applications includes:• Blind source separation with applications to watermarking and au-dio sources [23] [16], ECG (Bell & Sejnowski [3]; Barros et al. [1],McSharry et al. [20]), EEG (Mackeig et al. [19] [14]);• Signal and image denoising (Hyva¨rinen [9]), application in fMRI (Hansen[7])11.1. Introduction• Modeling of the hippocampus and visual cortex (Lorincz, Hyvarinen[18]);• Feature extraction and clustering, (Marni Bartlett, Girolami, Kolenda[2]);• Compression and redundancy reduction (Girolami, Kolenda, Ben-Shalom[4]).ICA is particularly useful in a setup where the data can be assumed tobe generated by several independent processes, such as in “cocktail partyproblem”. This problem constitutes identifying a particular source signal ina mixture of signals, and, possibly, noise, where little or no prior informationis available about the signals forming the mixture or the mixing process. Awidely used example of such problem is trying to separate one person’sspeech in a noisy setting, in other words, identifying a particular speechsignal in a mixture of several speakers and some background noise. In mythesis, I apply the ICA methodology to the business psychology survey datain order to identify the independent components that drive patterns in thesurvey responses.Applying the ICA methodology to the business survey data is motivatedby the fact that since the entire survey is tailored to assess team productivity,all the questions will have highly correlated responses. This happens becauseall of the survey questions represent different approaches to measuring thesame indicator. It is important to identify the driving forces behind theteam’s productivity, in particular, to assess how many independent sourcesaffect the team’s performance. Independent Component Analysis is a nat-ural tool to use for such inquiries. Since the survey contains quite a fewquestions, we are additionally interested in finding out whether it is possibleto reduce the dimensionality of the data with as little information loss aspossible. Thus, my research focuses on the dimension reduction using ICAwith an application to the business psychology survey data.Some methodologies used for finding a convenient representation of mul-tivariate data, along with Independent Component Analysis (ICA), include21.2. Data descriptionPrincipal Component Analysis (PCA) and Factor Analysis (FA) . The keydifference of ICA from PCA and FA is that ICA aims for the statisticalindependence of the resulting components, while the components obtainedusing PCA and FA are only linearly independent (uncorrelated), which ingeneral does not imply statistical independence. There are other technicaldifferences in the assumptions made in each case, which I will cover in detailin the following chapter.1.2 Data descriptionThe dataset used in this work consists of accumulated survey responses gath-ered by VIVO Team Development over the course of their business. VIVOTeam Development is an organization that performs consulting services forother companies aimed at improving the companies’ staff productivity. Inparticular, VIVO offers a data-driven approach to identifying productivityflaws as well as training for personnel to repair dysfunctional and improvemoderately-functional teams. Productivity assessment is done via analyz-ing the team’s responses to a survey composed by VIVO and calculatingthe six key team performance indicators. If the training is done to improvethe clients’ team’s performance, subsequent surveys are conducted to assessthe progress and track changes in the team productivity with training. Thissection contains details on the survey as well as the productivity indicatorsthat are measured using the survey.The Vital Statistics Survey designed by VIVO team development isaimed at a comprehensive assessment of a team’s productivity level. Theinitial survey is conducted before any training takes place to identify theteam’s strengths and areas for improvement. The survey is taken by eachmember of the team as well as the team’s leader. In case training is doneby the team, the subsequent surveys follow after 3 months in training andat conclusion of the training, which can take from 6 months to a year. Allthe team members take the initial survey as well as the subsequent ones.Some of the VIVO’s clients choose to not do training, hence only the initialsurvey results are recorded for these companies.31.2. Data descriptionThe survey consists of over 50 questions, each of which is aimed at as-sessing the team productivity. Each question asks the responder to choosea degree to which they agree or disagree with a statement based on a scalefrom 1 to 10, where 1 indicates complete disagreement and 10 – full agree-ment. Therefore, on each test occasion the response for each team memberis a vector with the same length as the number of questions, each entry beingan integer from 1 to 10. The responses to each question are then combinedto calculate the six key productivity indicators and the three sub-indicatorsof each key indicator, forming eighteen values. These values represent eachteam member’s perception of the team productivity. These 18 indicators ofproductivity will be used in this work as the variables of interest. By thenature of the data, all the 18 variables of interest are measured in the sameunits.The team members’ responses are summarized team-wise to form theaverage perception of productivity and these results are reported to theclient and used for choosing the training plan in case of the initial surveyor for tracking the team’s progress in case of the subsequent surveys. Forthe purposes of the independent components extraction, individual responseswere used as opposed to the aggregated team-wise data. Also, no distinctionwas made between the initial and the subsequent surveys.The dataset used for the analysis in this project contains the individualresponses to the Vital Statistics Survey gathered by VIVO Team Devel-opment over the course of their business. The latest survey version wasintroduced in the middle of 2015 and since then has been used for morethan 15 companies to assess and improve the performance of over 30 teams,which overall amounts to over 270 individual responses.4Chapter 2Independent ComponentanalysisThis chapter focuses on the methodology of Independent Component Anal-ysis and the approaches to defining ICA proposed in the existing literature.First, the motivation and intuition behind the method is explained, followedby the rigorous definition and mathematical properties.2.1 MotivationIndependent Component Analysis is closely related to what is known asa “blind source separation” problem in signal processing [10]. This prob-lem deals with the separation of the original (source) signals from a mixeddata with little or no information about the source signals or the mixingprocess. The problem can be formulated as follows: a set of original unob-served signals recorded in time s(t) = (s1(t), . . . , sp(t))t is mixed into theobserved signals x(t) = (x1(t), . . . xk(t))t with an unknown mixing matrixA = {aij}k×p:x(t) = As(t). (2.1)Note that the number of the original signals k is usually assumed to beequal to the number of the observed signals p. In case k > p linear unmixingcan be used, while if p > k, one must use the non-linear methods to derivethe source signals from the mixtures. Blind source separation aims to findthe unmixing matrix W = {wij}p×k to recover (or approximate) the source52.1. Motivationsignals from the mixture, asy(t) = Wx(t), (2.2)where y(t) is an approximation of s(t). The methods for blind source sep-aration aim at estimating the unmixing matrix utilizing the assumptionsmade about the source signals or the mixing process.An example of blind source separation problem is a well-known “cocktail-party problem”. I will be using this example to motivate ICA, followingHyva¨rinen and Oja [11]. Assume that there are two speakers speaking si-multaneously in a room as well as two microphones recording both speakers.The locations of the microphones are unknown and each microphone recordsa mixture of the speakers’ speech signals. If we denote the recorded signalsas x1(t) and x2(t), where t is the time index, and s1(t) and s2(t) as therespective speakers’ speech signals, we can write a linear equation:x1(t) = a11s1(t) + a12s2(t)x2(t) = a21s1(t) + a22s2(t)Blind source separation in this case refers to estimating the originalspeech signals s1(t) and s2(t) using only the recorded signals x1(t) and x2(t),assuming that the mixing parameters a11, a12, a21 and a22 are also unknown.ICA attempts to solve this problem by estimating the mixing coefficientsusing statistical properties of the signals si(t). In particular, one can esti-mate the parameters aij using the assumption of statistical independence ofthe original signals si(t) and their non-gaussian distributions. Other meth-ods rely on different assumptions about the signals or the mixing process.For example, if one is using Principal Component Analysis to identify theindependent signals, one should assume that the original signals, and hencethe mixtures, follow multivariate gaussian distribution. Such assumptionneeds to be made because the components extracted by PCA are linearlyindependent (uncorrelated), which in general does not imply statistical in-62.2. Definition and mathematical notationdependence unless the variables follow a multivariate gaussian distribution.Hence, if one aims for extraction of statistically independent signals, a mul-tivariate gaussian distribution must be assumed. In a wide range of applica-tions the signals are not necessarily gaussian, which provides an additionalmotivation for ICA.In my work I use ICA for separating the independent components re-lated to team productivity using the key productivity indicators as mixedsignals. It is not unreasonable to assume that the team productivity isdriven by several independent sources or components, which are measuredin the productivity survey, but not necessarily separated from one another.Since all the survey questions are intended to measure productivity, one canassume that the answers to all questions are driven by combinations of theseindependent sources.In the following section I state the rigorous definitions of ICA given inthe literature and also describe some ambiguities related to this method,which will be of importance in this research.2.2 Definition and mathematical notationIn order to define ICA more rigorously we use a statistical “latent vari-ables” model. Assume we are observing k linear mixtures of k independentcomponents. Dropping the time index t we now have:xi = ai1s1 + ai2s2 + · · ·+ aiksk, i = 1, . . . , k.We assume that each component sj is a random variable , j = 1, . . . , k.Hence, each mixture xi is also a random variable i = 1, . . . , k. Also, with-out loss of generality we assume that all the mixtures as well as all theindependent components have zero mean.For convenience, vector-matrix notation is often used. Let x be a randomvector of the observed mixtures, thus x = (x1, . . . , xk)t and s = (s1, . . . , sk)t– a vector of independent components. Additionally, define A as a k × k72.2. Definition and mathematical notationmixing matrix with elements {aij}. Now we can write the ICA model asx = As. (2.3)This model is generative, which means that it describes how the observeddata is generated via mixing the independent components s, which are thenon-observed latent variables. Our goal is to estimate both the independentcomponents s and the mixing matrix A using the observed values of x underas general assumptions as possible.Note that the definition of independent components in equation (2.3)implies two ambiguities of the method [11]:1. The variances of the independent components cannot be determined.This is because since both A and s are unknown, we can scale any ele-ment of s by a constant and divide the corresponding column of A bythe same constant to achieve the same multiplication result as beforescaling. Thus, in practice we fix the magnitude of the components tobe 1 (V ar(sj) = 1 for all j = 1, . . . , k) and adjust the matrix A accord-ingly. This still leaves the ambiguity of the sign, which fortunately isnot a big problem in our application.2. The order of the independent components cannot be determined. Wecannot determine the order of the components because for any (invert-ible) permutation matrix P we can write As = AP−1P s, where P s arethe same independent components, just in a different order. This am-biguity will be important later since each run of the ICA algorithmproduces independent components in a different order.The starting point of ICA is assuming that all components of s are sta-tistically independent. This means that their joint densities are factorisableinto a product of k densities. Note that statistical independence implieslinear independence for random variables with finite second moments, thusthe independent components produced by ICA will be uncorrelated. Fur-thermore, since uncorrelated variables are not independent unless gaussiandistribution is assumed, the principal components obtained using PCA will82.2. Definition and mathematical notationnot necessarily be statistically independent if the original data is not multi-variate gaussian.This assumption also implies a property of independent random vari-ables, namely that if we have functions h1, . . . , hk thenE[h1(s1)× · · · × hk(sk)] = E[h1(s1)]× · · · × E[hk(sk)]Note that if the functions h1(·), . . . , hk(·) are linear in s1, . . . , sk, thisproperty holds for uncorrelated or linearly independent variables s1, . . . , sk.For statistically independent variables this property will hold for a moregeneral class for functions h1(·), . . . , hk(·), such as power functions of theform h(s) = sp, where p ∈ R, polynomials and others.For ICA we must assume that the independent components s are non-gaussian. The reason behind the nongaussianity assumption is that any or-thogonal transformation of a gaussian random vector has exactly the samedistribution as the original random vector (taking into account the zero-mean constraint), therefore in case of gaussian components we can esti-mate the matrix A only up to an orthogonal transformation, which makesthe matrix unidentifiable. In case only one of the components is gaussian,the independent components could still be estimated [11], which makes themethodology useful for a mixture of several signals and a gaussian noise.However, we do not assume the distributions of the components known.For simplicity, we also assume that the matrix A is square, though thisassumption is non-essential and can be relaxed. In Chapter 3 I will describethe methods useful for dealing with non-square mixing matrices.Note that in most applications of ICA the algorithms first estimate the“inverse” of the mixing matrix A, denoted as W , which can then be used tocompute the extracted independent components s as:s = Wx. (2.4)The formula above raises the question of whether the assumption thatthe mixing matrix A is invertible needs to be made. Here we discuss the pre-92.3. Pre-processing dataprocessing steps usually performed on the dataset before the ICA methodol-ogy is applied to it, that, among other benefits, provide the invertible mixingmatrix without making additional assumptions.2.3 Pre-processing dataCustomarily, several pre-processing steps are performed on the data beforethe ICA methodology is applied to it. These steps include centering andwhitening (or sphering) of the original data and could be performed usingmultiple approaches. Here a few methods are outlined and the benefits ofthe preprocessing procedures are explained.1. Centering. This is a simple procedure aimed at making x a zero-mean vector, thus, if we denote E(x) = m setting x′ = x−m. In practice,where one is working with a dataset consisting of n realizations of the randomvector x, thus a data matrix X = (xij)n×k, the sample means are subtracted.Hence, we obtain X ′ = (x′ij)n×k, wherex′ij = xij −1nn∑l=1xlj .This step is made solely for simplification of the ICA algorithm, as itmakes both the original data and, by relation (2.4), the independent com-ponents, centered. This does not imply that the mean of s cannot be esti-mated. In fact, after the mixing matrix is estimated with the centered data,one can translate the centered estimates of s to the original scale by addingWm, which, because of (2.4) serves as the estimate of the mean of s.2. Whitening (or sphering). Whitening is a linear transformationof the centered data x′, which makes the resulting data z have an iden-tity covariance matrix. Thus, each component of z is uncorrelated with therest and has unit variance. There are numerous ways to perform a whiten-ing transform, among which most researchers use a PCA-type transformthrough eigenvalue decomposition (or singular value decomposition) of thecovariance matrix of the centered data E(x′(x′)t). Note that as the covari-102.3. Pre-processing dataance matrix is by definition symmetric, the eigenvalue and the singular valuedecompositions provide the same result. In particular, we decompose thecovariance matrix as E(x′(x′)t) = V DV t, were V is the k × k orthonormalmatrix, the columns of which are the eigenvectors of E(x′(x′)t) and D is thediagonal matrix of its k eigenvalues. We assignx˜ = V D−12V tx′.It is easy to see that this transformation indeed “whitens” the data asfollows:E(x˜x˜t) = V D−12V tE(x′(x′)t)V D−12V t= V D−12V tV DV tV D−12V t= V D−12DD−12V t= V V t = I.This step is performed for reducing the number of parameters to beestimated as well as for simplification of the ICA algorithm. In case of thewhitened original data, the mixing matrix A will always be orthogonal. Tosee this, we consider:E(x˜x˜t) = AE(sst)At = AIAt = I.The primary goal of this step is reducing the number of the parameters tobe estimated from k2 to k(k−1) as the degrees of freedom of the orthogonalsquare matrix A are k(k − 1). Additionally, as A is now orthogonal, it isimplicitly invertible with At being the inverse.An additional step performed in some applications is discarding theeigenvalues of the covariance matrix that are too small and working with thereduced dimension dataset. This step, similar to PCA, offers the benefitsof reducing noise and preventing over-learning. In this work, I discuss the112.4. ICA with negentropy maximizationdimension reduction after the ICA algorithm has been performed, thus I donot discard the low eigenvalues at this step.It can be added that whitening techniques are simple and plentiful and awhitening transformation can be found for any data. Thus, it is advisable toperform the data whitening before any ICA algorithm to reduce the problemcomplexity.2.4 ICA with negentropy maximizationMost of existing algorithms for ICA aim to maximize the non-gaussianityof the independent components, that is, maximize the departure of the in-dividual independent component distributions from a gaussian distribution.To see how this principle works first assume we have a vector x distributedas in equation (2.3). We would like to estimate one of the independentcomponents as a linear combination of the observed variables with some co-efficients y = wtx. Ideally, we would like wt to be equal to one of the rowsof W , the inverse of A.Making a change of variables z = Atw we have (since x = As (2.3)) :y = wtAs = ztsHence y is a linear combination of sj with coefficients zj , j = 1, . . . k. Forsimplicity assume that all the sj have the same distribution. Since by CLTany linear combination of independent variables is going to be “more gaus-sian” than the original variables, then by maximizing the non-gaussianity ofy we arrive at the solution with only one non-zero zj , corresponding to oneof the sj , j = 1, . . . k (since identical distributions of the sj are assumed).The vector w such that the non-gaussianity of y = wtx is maximized willprovide us with one of the independent components.The optimization landscape for the non-gaussianity of the k-dimensionalspace of vectors w has 2k local maxima, each corresponding to one of the sjor −sj (recall the ambiguities of ICA). Finding all of these local maxima isnot very challenging since the independent components are uncorrelated. A122.4. ICA with negentropy maximizationstepwise procedure can be devised, where to estimate the next component,or the next w, we restrict the space to all vectors that are orthogonal tothe vectors w which provided the independent components in the previoussteps. This othogonalization step will further explained in section 2.5.We determined that to estimate the independent components we needto maximize the non-gaussianity of y = wtx. Now we review the existingmeasures of non-gaussianity and choose the one most useful for our purposes.There are several measures of non-gaussianity, such as kurtosis. Kurtosisof a random variable y is a measure of the “tailedness” of its distribution, many extreme observations does the distribution of y produce. Kurtosisis defined as:Kurt(y) =E(y − E(y))4E[(y − E(y))2]2In general, the further the kurtosis is from the value of 3, which is thekurtosis value for any univariate gaussian distribution, the further the distri-bution is from a gaussian distribution. The distributions with the kurtosisvalue greater than 3 (supergaussian) tend to produce more extreme observa-tions than gaussian distributions, while the kurtosis values less than 3 (sub-gaussian) generally indicate less heavy tails than in gaussian case. Strictlyspeaking, a kurtosis value of 3 is not unique to the gaussian distributions,other distributions with specific parameters can have such a kurtosis value aswell. An example is the binomial distribution with p = 12±√112 . However, itis difficult to find such distributions and such specific values of parameters.Gaussian distributions have kurtosis value of 3 regardless of the parametervalues, which is why kurtosis in customarily used to assess the gaussianityof a distribution. However, lately kurtosis is mostly not used in applicationsbecause it may not be defined for heavy-tailed distributions [15]. Amongall other possible non-gaussainity measures most of the ICA methods usenegentropy.Entropy is one of the basic concepts of information theory and can beinterpreted as a measure of deviation from uniformity of a distribution of a132.4. ICA with negentropy maximizationrandom variable. For continuous random variables and vectors we calculatedifferential entropy. For example, a random vector y with a density fy(t)has entropy:Hy = −∫· · ·∫fy(t) log fy(t)dtIt is useful to note that if all the components of y are mutually inde-pendent then we can calculate the entropy for this vector as a sum of theentropies of its components. This can be seen as follows:Hy = −∫fy(t) log fy(t)dt= −∫· · ·∫fy(t1, . . . , tk) log fy(t1, . . . , tk)dt1 . . . dtk= −∫· · ·∫fy1(t1) . . . fyk(tk)k∑j=1log fyj (tj)dt1 . . . dtk= −k∑j=1∫· · ·∫fy1(t1) . . . fyk(tk) log fyj (tj)dt1 . . . dtk= −k∑j=1∫fyj (tj) log(fyj (tj))dtj=k∑j=1Hyj .One of the fundamental results of information theory is that the gaussiandistribution parametrized as N(µ, σ2) has the maximum entropy among alldistributions with mean µ and variance σ2 [11],[6].The concept of negentropy (or negative entropy) can be introduced andused to measure the departure from gaussianity. Negentropy for a randomvector y can be calculated as142.4. ICA with negentropy maximizationJy = Hygauss −Hy (2.5)Here ygauss denotes a gaussian random vector with the same mean andcovariance matrix as the vector y. Maximizing negentropy to achieve themaximum departure from gaussianity is well-justified by statistical theoryand negentropy is considered to be an optimal measure of non-gaussianityas far as statistical properties are concerned [11]. Alternatively, one canminimize the entropy Hy to achieve the same results.Unfortunately, entropy is hard to calculate due to the need to numericallyapproximate the density. Therefore, several less computationally expensiveapproximations for negentropy as a distance measure from a guassian distri-bution have been proposed in the literature. Among others I will mentionlogcosh and exponential approaches shown in [8], since these are the mostwidely used approximations for ICA implemented in most software packages(R, Matlab, etc.).In general, both logcosh and exponential approximations have the fol-lowing form:Jy ≈k∑j=1[E(G(yj))− E(G(ygaussj ))]2 (2.6)Here ygaussj is a gaussian random variable of the same mean and varianceas yj and G(·) is some non-quadratic function. Since we assume that allindependent components have zero mean and unit variance, all ygaussj havea standard normal distribution. Since the second addendum in the equationabove is constant, in order to maximize negentropy, one must maximizeE(G(yj)). Logcosh and exponential approaches correspond respectively to152.4. ICA with negentropy maximizationthe following choices of the function G:Glogcosh(u) =1alog cosh au (2.7)Gexp(u) = − exp(−u22) (2.8)Here a is some parameter between 1 and 2.There are several existing algorithms for performing ICA on a givendataset. Most efficient and widely-used algorithm is FastICA. FastICA usesnegentropy to measure non-gaussianity and offers a choice of logcosh andexponential approaches to the approximation of negentropy. Both of theseapproaches represent more robust estimations than kurtosis, since kurtosisis very sensitive to extreme observations because of the fourth moment usedin its calculation. The comparison of the three approaches is presented inFigure 2.1.Figure 2.1: The comparison of different approaches to negentropy approx-imation: kurtosis (based on the 4th degree) is given by the dashed line,logcosh approach with α = 1 is represented by the solid red line and expo-nential – by the dash-dotted blue curve.162.5. FastICA algorithm2.5 FastICA algorithmHere I present the FastICA algorithm, which is used to extract the indepen-dent components in this work.FastICA focuses on finding a matrix W , or a set of k vectors w1, . . . ,wkthat correspond to the k independent components by maximizing the non-gaussianity of wtjx, j = 1, . . . , k. This section contains a summary of themathematical justification and derivation of this algorithm, for further de-tails please refer to [11]. Assume that we are working with the pre-processeddata x, thus E(x) = 0 and E(xxt) = I. For details on the pre-processingsteps refer to section 2.3.Our goal is to maximize negentropy of wjx for all j. Independent ofwhich approximation for negentropy is used, we want to maximize an ex-pression of form:k∑j=1E(G(yj)) =k∑j=1E(G(wtjx)),where wtj is the j-th row vector of the matrix W . Now we write this maxi-mization problem for a particularwj dropping the index, in form of Lagrangemultipliers, denoting the objective function with O:O(w) = E(G(wtx))− β2(wtw − 1).The second term is needed because for the pre-whitened data the rowsand the columns of matrix W , the inverse of the mixing matrix A, arenormalized. We take a derivative of the objective function and set it tozero, getting:D(w) =∂O(w)∂w= E(xg(wtx))− βw = 0. (2.9)172.5. FastICA algorithmHere g(·) is a non-linear function, which is the derivative ofG(·) discussedin the previous section. Hence for the logcosh and exponential approachesrespectively:glogcosh(u) = tanh(au), (2.10)gexp(u) = u exp(−u22). (2.11)The system of equations in (2.9) can be iteratively solved by the Newton-Raphson method, setting each next estimate of w asw← w − J−1D (w)D(w),where the Jacobian function has the form:JD(w) = E(xxtg′(wtx))− βI.We can simplify the first term by approximating E(xxtg′(wtx)) ≈E(xxt)E(g′(wtx)) = IE(g′(wtx)) because of the pre-whitening step. Withthis approximation the Jabobian matrix becomes diagonal, thus can be eas-ily inverted to obtain the following update formula:w← w − 1E(g′(wtx))− β (E(xg(wtx))− βw)Multiplying both sides by a scalar β−E(g′(wtx)), we get a simple updatestep:w← E{xg(wtx)− E{g′(wtx)w}} (2.12)Note that w on the left side is multiplied by a scalar, thus renormal-ization is needed at every step of the iterative algorithm. Also, since the182.5. FastICA algorithmindependent components need to be uncorrelated, when estimating multiplew1, . . . ,wk one needs to perform orthogonalization of the estimated matrixW at every step, otherwise some wj ’s risk to converge to the same value.The orthogonalization of W can be performed using matrix square roots:W ← (WW t)−1/2W. (2.13)The matrix square root shown in equation (2.13) is defined using eigen-value decomposition of WW t. Let WW t = UDU t, where U is the orthonor-mal matrix of the eigenvectors of WW t and D is the diagonal matrix of itseigenvalues. We define (WW t)−1/2 = UD−1/2U t, where the inverse and thesquare root of the diagonal matrix D is applied element-wise to the eigen-values on the main diagonal. The orthogonalization shown in (2.13) willindeed make W orthonormal as:WW t = ((WW t)−1/2W )((WW t)−1/2W )t= (WW t)−1/2WW t(WW t)−1/2= UD−1/2U tUDU tUD−1/2U t= UD−1/2DD−1/2U t= UU t = I.Notice that the procedure described above (Gram-Schmidt orthogonal-ization) is not the only way orthogonalization can be performed. Alter-natively, Cholesky decomposition, Householder transformation and Givensrotation could be used. A few approximate methods also exist for high-dimensional problems.With all the required steps outlined above, we are ready to write theFastICA algorithm.192.6. Other approaches to Independent Component AnalysisData: The initial data x, the number of independent components k,initial guess for W as W0 = (w1, . . . ,wk)tWhiten x as described in section 2.2;repeatfor j = 1 to k doUpdate wj ← E{xg(wtjx)− E{g′(wtjx)wj}}, where g(·) is ofform (2.10) or (2.11);endPerform the symmetric orthogonalization of W = (w1, . . . ,wk)tby (2.13);until convergence;Algorithm 1: Pseudo-code for the parallel FastICA algorithm.Note that the algorithm above could be easily extended to a datasetconsisting of n realizations of the random vector x, thus a data matrixX = (xij)n×p. In this case, the expected values in the algorithm aboveare calculated as sample means (the averages of the rows of the resultingmatrices). To simplify the algorithm and decrease computation time whenworking with large datasets, one can take sample means of randomly chosensubsets of the data.The convergence here is defined by two subsequent estimates of W beingsufficiently close together. Hence, no computation of negentropy or approx-imations of densities are needed. This is one of the reasons the FastICAalgorithm is very efficient.2.6 Other approaches to IndependentComponent AnalysisA few other approaches for Independent Component Analysis have beenproposed in the literature. Likelihood may seem like a natural tool to usefor ICA, especially since deriving the likelihood for the model is not verymathematically challenging. However, there are a few drawbacks to usingthis approach. First of all, it requires the estimation of the densities of theindependent components, which is a difficult task. The solutions proposed202.6. Other approaches to Independent Component Analysisinclude either using prior information about the independent componentdensities or restricting the densities of all components to a finite-parameterfamily. This option leads to another potential drawback: usually the densi-ties of the subgaussian and supergaussian independent components need tobe modeled separately because they belong to different distribution families.Another approach is the minimization of mutual information. Mutualinformation of a random vector is defined asI(y1, . . . , yk) =k∑j=1H(yj)−H(y)and is a natural measure of dependence between the components of thevector. A small difference of this approach with maximizing negentropyis that by maximizing non-gaussianity one restricts the components to beuncorrelated, while in case of mutual information minimization this doesnot necessarily have to be the case.Another approach to estimating the unmixing matrix in ICA discussedby K. Nordhausen et al. in [22] relies on scatter matrices with the so-called“independent property”. The essence of this method relies on using scattermatrices other than the sample covariance matrix to whiten the data asdescribed in section 2.3. A matrix-valued functional of a random vector xwith CDF Fx is called a scatter matrix if it is positive-definite, symmetricand affine invariant. A sample covariance matrix is an example of a scattermatrix, however there are numerous alternative techniques to constructingthose, for example M-functionals, S-functionals and etc. A scatter matrixis said to have the independent components property if S(Fx) is a diagonalmatrix for all random vectors x with mutually independent components. Acovariance matrix is an example of scatter matrix with this property. M-functionals and S-functionals do not generally have this property but can besymmetrized to possess the independent components property. Using two ofscatter matrices with the independent components property yields an esti-mate of an unmixing matrix that will depend on which scatter matrices areused to construct it. Nordhausen et al. [22] recommend using robust options212.6. Other approaches to Independent Component Analysisfor both scatter matrices, which yields favorable estimates of the unmixingmatrix in case of symmetrically distributed independent components. Theprocedure is however computationally costly and known to produce poorerresults for asymmetric independent components.A few other methods include tensorial methods proposed by L. De Lath-auwer [17], nonlinear ICA reviewed in [5] and Bayesian approach to Inde-pendent Component Analysis recently revisited by Mohammad-Djafari [21].22Chapter 3Component selectionOne of the assumptions made in ICA is that the mixing matrix A is square,therefore, there are as many independent components as there are the recordedmixed signals. In order to relax this assumption, a few methods have beenproposed, most of which require the researcher to choose or estimate thenumber of components to be extracted prior to the application. For exam-ple, one can use a Factor Analysis-type model of formx = As+ ,where A is a k-by-p real matrix, x is a k-dimensional random vector of theoriginal data and s is a p-dimensional vector of the independent components.For noiseless ICA  is assumed to be a k-dimensional deterministic vector ofzeros.My research is concerned with using fewer independent components thanthe dimensionality of the original data. That is, finding a way to choose themost “useful” components out of the ones extracted in case the number ofcomponents is not specified prior to the application. In this chapter I willfirst briefly introduce some existing approaches to extracting less compo-nents than the number of features in the original dataset and then discussthe proposed approach and its mathematical basis.3.1 Existing approachesThere is little literature available on the topic of dimensionality reductionusing ICA because ICA was not originally developed for that purpose [24].233.1. Existing approachesOne method for producing fewer independent components than the di-mensionality of the original data proposed in [11] suggests to reduce thedimension of the problem while performing the pre-processing (whitening)of the data. Recall, data whitening is often performed with a PrincipalComponents-type procedure, where the eigenvector decomposition of thecovariance matrix is used as follows: E(xxt) = V DV t, where D is the diag-onal matrix of eigenvalues of the covariance matrix and V is the orthonormalmatrix, the columns of which are the eigenvectors. The whitened data arethen computed as:z = V D−12V tx′,where x′ is the centered version of x and the inverse and the square rootare applied element-wise to the main diagonal of D. It is possible to re-duce dimensions to a pre-determined value p by discarding the smallestk − p eigenvalues from D and the corresponding eigenvectors from V . Thewhitened data with the reduced dimensions is then fed to the ICA model,which produces p independent components. This assumes prior knowledgeof the number of independent components that should be retained in themodel. Furthermore, the independent components will be estimated on thewhitened dataset, with some features excluded when dropping the smalleigenvalues, which may be undesirable and complicate the interpretabilityof the results.Another recently proposed approach in [24] uses virtual dimensionalityto determine the number of components to be retained in the model. Virtualdimensionality is an approach to estimating the number of sources presentin a signal/image processing context. This estimate in convolution with aprioritization mechanism is then utilized to choose which components areto be retained. The prioritization mechanism mentioned above is based onthe sample skewness and kurtosis of the extracted independent components.Another approach proposed in the same paper suggests running ICA mul-tiple times starting at various initial values and comparing the components243.2. Proposed approachreceived on each run. The independent components that are output bythe algorithm often for various starting points are then deemed statisticallyimportant and retained in the analysis, while the components that appearrarely are discarded. The last two approaches were developed in applicationto hyperspectral image analysis.3.2 Proposed approachThe approach developed in course of my research focuses on a differentway to address the dimension reduction using ICA. When there is no priorknowledge about the number of independent components that should beretained, it may be prudent to perform ICA keeping the original dimensionsof the data. After obtaining the independent components, one can choose thecomponents that should be retained and discard the unneeded components.This choice could be made by comparing the predictive power of differentcombinations and numbers of the retained components or in another way assuits the application goals. An example of a situation, where this methodcould be applicable, is a multivariate dataset, where each variable is a mixof several signals as smooth functions of time, and noise, or several sourcesof noise. As long as the condition of independence of the noise from thesignal is met, ICA will be able to separate the source signals, among whichthe noise will be easily identifiable. Once the sources are separated, onecan proceed with choosing the components that one is interested in. Myresearch focuses on whether or not there is a meaningful way to choose thecomponents to be retained as well as exploring the trade-off between theretained number of components and the amount of information lost whenreducing dimensions in such way.Assume we are working with a dataset, where each observation has kfeatures and there are n observations overall. Assume here that n > k. Letus denote the vector of reduced (or truncated) independent components assi, where out of original k components, p have been retained, i = 1, . . . , n. Ofcourse, in practice, for each p there are(kp)possibilities of si (the choice of253.2. Proposed approachthe retained components is assumed to be consistent across the i = 1, . . . , n).Ideally, we would like to solve the equations:xi = A∗si, i = 1, . . . , n (3.1)for the mixing matrix A∗k×p in order to fully reconstruct the original datawithout any information loss using only p out of the k independent compo-nents.As explained in section 2.3, customarily the data is pre-processed, i.e.centered, scaled and whitened, before the application of ICA. Denote thepre-processed data zi, where zi for i = 1, . . . , n can be written as:zi = V D− 12V t(xi − µ) (3.2)Here µ is the vector of expected values of xi, D is the diagonal matrix ofthe eigenvalues of the covariance matrix 1n∑ni=1 xixti and V is the matrix,whose columns are the eigenvectors. Hence, E(z) = 0 and 1n∑ni=1 zizti =Ik×k.We assume that independent components si have mean zero and covari-ance matrix Ip×p. Therefore, for (3.1) to be satisfied we need:Ik×k =1nn∑i=1zizti =1nn∑i=1A∗sisti(A∗)t = A∗Ip×p(A∗)t = A∗(A∗)tThus, we would need the rank of A∗ to be k, but since A∗ is a k× p matrix,where p < k, this situation cannot occur.We have shown that an exact reconstruction of the data is not possibleusing fewer than the original number of components. Now, consider the casethat we want to get the best possible approximation to the original data byusing the reduced independent components extracted from that data. Wecan measure the quality of approximation by the square of the L2 norm ofthe difference between the original (or the whitened) data and the estimatedvalues, obtained by reconstruction using the reduced independent compo-nents. For now we are working with the whitened data zi, i = 1, . . . , n. In263.2. Proposed approachorder to minimize the information loss, thus getting the best approximationof the original data, we need to obtain a mixing matrix which minimizes thefollowing target function:F(A) =n∑i=1‖zi −Asi‖22 (3.3)Note the si vectors here are obtained from the whitened data as describedin Chapter 2. Additionally, please notice that our target function is a squareof Euclidean norm. We need to find:minAF(A) = minAn∑i=1k∑j=1(zij −p∑l=1Ajl × sil)2(3.4)This is a problem of a function minimization on a manifold of non-squarematrices. In order to solve the problem in (3.4) we compute the Frechetderivative of the L2-norm in (3.3) as a function of a k × p matrix A.Frechet derivative is a generalization of a derivative of real-valued func-tion of a single variable to (possibly) vector-valued function of multiple realvariables. It is defined as follows. Let W and V be Banach spaces andU ⊂ V be an open subset of V . A function F : U → W is called Frechetdifferentiable at A ⊂ U if there exists a bounded linear operator Df(A,H)such thatlim‖H‖V→0‖f(A+H)− f(A)−Df(A,H)‖W‖H‖V = 0 (3.5)Frechet derivative is defined on Banach spaces, which include matrixmanifolds. In our case, space W is the real line and space V is a matrixmanifold of p × k real-valued matrices, while H is a p × k real matrix (thedirection matrix). Hence the function F : Rp×k → R defined in (3.3) couldbe Frechet differentiable. The limit in the equation above is a usual limitof a function defined on a metric space, thus the expression in (3.5) is afunction of H in V , hence the first-order expansion holds. We attempt tofind the linear operator DF(A,H) using this first-order expansion:273.2. Proposed approachF(A+H) = F(A) +DF(A,H) + o(‖H‖) (3.6)We apply the definition given in equation (3.6) to the minimization prob-lem in (3.4) to get the expression for the Frechet derivative. H(1) denotesthe direction matrix for the first order derivative.F(A+H(1)) =n∑i=1k∑j=1(zij −p∑l=1(Ajl +H(1)jl )× sil)2=n∑i=1k∑j=1((zij −p∑l=1Ajl × sil)−p∑l=1H(1)jl × sil)2=n∑i=1k∑j=1((zij −p∑l=1Ajl × sil)2 − 2(zij −p∑l=1Ajl × sil)(p∑l=1H(1)jl × sil)+ [p∑l=1H(1)jl × sil]2)=n∑i=1k∑j=1(zij −p∑l=1Ajl × sil)2 − 2n∑i=1k∑j=1(zij −p∑l=1Ajl × sil)(p∑l=1H(1)jl × sil)+n∑i=1k∑j=1[p∑l=1H(1)jl × sil]2= F(A) +DF(A,H(1)) + o(‖H(1)‖)Thereby, the Frechet derivative of the cost function in (3.3) has the form:DF(A,H(1)) = −2n∑i=1p∑j=1(zij −k∑l=1Ajl × sil)(k∑l=1H(1)jl × sil)In order for A0 to solve the minimization problem in (3.4) we need twoconditions to hold:1. DF(A,H(1)) = 0 for all H(1);283.2. Proposed approach2. There exists an a > 0 such that D2F(A,H(2), H(2))) ≥ a‖H(2)‖ for allH(2).We would like to find such A0 that both conditions above are satisfied.We propose A0 as:A0 =1nn∑i=1zi(si)t. (3.7)This is a k× p matrix as zi is a k-dimensional vector of the original data forall i = 1, . . . n and si is a p-dimensional truncated vector of the independentcomponents. We need to show that this choice of A0 satisfies the condition1 above, or DF(A0, H(1)) = 0 for all choices of H(1). First, we notice that:DF(A,H(1)) = −2n∑i=1k∑j=1(zij − (Asi)j)(H(1)si)j= 2n∑i=1k∑j=1(zi −Asi)j(H(1)si)j= −2n∑i=1(zi −Asi)tH(1)si.Further, rewriting the product above as a trace and using the propertiesof independent components we have:293.2. Proposed approachDF(A0, H(1)) = −2n∑i=1(zi −A0si)tH(1)si= −2n∑i=1tr((zi −A0si)tH(1)si)= −2n∑i=1tr(si(zi −A0si)tH(1))using the cyclic invariance of the trace= −2n∑i=1tr((sizti − sistiAt0)H(1))using the transposition properties= −2n× tr((1nn∑i=1sizti − [1nn∑i=1sisti]At0)H(1))= −2n× tr((At0 − Ip×pAt0)H(1))= 0 using the independence of si.We have shown that A0 satisfies the condition 1 above. Now we wouldlike to show that the condition 2 also holds for our choice of A0. In fact, thecondition is a definition of a strongly convex (or α-convex) function. Recall,that our target function is a square of L2 norm (Euclidean norm). We canprove the strong convexity of our function by first noticing that F can berewritten as:F(A) =n∑i=1k∑j=1‖zij −Ajsi‖22, (3.8)where Aj denotes the j-th row of A, Aj ∈ Rp, j = 1, . . . k. As F can bewritten as a sum of nk squares of L2 norms of affine functions of Aj , it isenough to show that the square of L2 norm of a vector is a strongly convexfunction and then use the property that the sum of strongly convex functionsis still strongly convex.One of the forms of definition of a strongly convex function f with pa-rameter α is that for all k-dimensional real-valued vectors x and y and fort ∈ [0, 1] is:303.2. Proposed approachf(tx+ (1− t)y)− tf(x)− (1− t)f(y) ≤ −12αt(1− t)‖x− y‖22 (3.9)In our case f(x) = ‖x‖22. Let x and y be p-dimensional vectors, let 〈x,y〉denote the inner product of x and y. We expand the definition above toget:f(tx+ (1− t)y)− tf(x)− (1− t)f(y)= t2‖x‖22 + 2t(1− t)〈x,y〉+ (1− t)2‖y‖22 − t‖x‖22 − (1− t)‖y‖22= t(t− 1)‖x‖22 − t(1− t)‖y‖22 + 2t(1− t)〈x,y〉= −t(1− t) (‖x‖22 − 2〈x,y〉+ ‖y‖22)= −t(1− t)‖x− y‖22≤ −12t(1− t)‖x− y‖22 as t(1− t)‖x− y‖22 ≥ 0We have shown that a square of an L2-norm of a vector is strongly convexwith α = 1. As the function F defined in (3.3) is a sum of nk squares ofL2 norms of affine functions of p-dimensional vectors, hence it is stronglyconvex with the same parameter α = 1.We have shown that our choice of A0 satisfies the necessary and thesufficient conditions for the global minimum, solving the problem in (3.4).Let us write zi as rows of matrix Zn×k and si as rows of matrix Sn×p.Thus, we get:A0 =1nn∑i=1zisti =1nZtS (3.10)It turns out that the matrix A0 chosen as in (3.10) satisfies At0A0 = Ip×p.In order to show this, first we assume without loss of generality that theindependent components retained are the first p out of the k independentcomponents. Hence, we can write Sn×p = S˜n×kPk×p, where S˜ is the full313.2. Proposed approachmatrix of independent components, thus S˜n×k, and P is a matrix that allowsus to choose first p components of S˜, which can be written as:P =(Ip×p0(k−p)×p), (3.11)where 0(k−p)×p is a (k − p)× p matrix of zeros. Therefore, P tP = Ip×p. Weneed to remember that si are the independent components that satisfy Z =SA for a k × k mixing matrix A, furthermore, Ik×k = 1nZtZ = 1nAtStSA =AtA, therefore, by the normality of orthogonal matrices, AAt = Ik×k. Now,At0A0 = (1nZtSP )t(1nZtSP )=1n2P tStZZtSP=1n2P tStSAAtStSP= P t1nStSAAt1nStSP= P tIk×kAAtIk×kP = Ip×p.Hence, At0 is the left inverse to A0, and tr(At0A0) = p. Notice that we canrewrite the cost function defined in (3.3) as:323.2. Proposed approachF(A) =n∑i=1p∑j=1(zij −k∑l=1Ajl ∗ sil)2=n∑i=1p∑j=1(zij − (Asi)j)(zij − (Asi)j)=n∑i=1p∑j=1(zi − (Asi))j(zi − (Asi))j=n∑i=1(zi − (Asi))t(zi − (Asi))=n∑i=1tr((zi −Asi)t(zi −Asi)).Now, we evaluate the loss function (3.3) at A0:F(A0) =n∑i=1tr((zi −A0si)t(zi −A0si))=n∑i=1tr(zti − stiAt0)(zi −A0si))=n∑i=1tr(ztizi − stiAt0zi − ztiA0si + stiAtoA0si)= n(tr(1nn∑i=1ztizi)− tr(1nn∑i=1stiAt0zi)− tr(1nn∑i=1ztiA0si) + tr(1nn∑i=1stiAt0A0si))= n(tr(1nn∑i=1ztizi)− tr(At01nn∑i=1stizi)− tr(A01nn∑i=1sizti) + tr(At0A01nn∑i=1sisti))= n(k − tr(At0A0)), recall that tr(At0A0) = p= n(k − p).This result holds no matter which p independent components out ofthe original k are retained in the truncated vectors s. So, working on thewhitened data, the information loss depends only on the number of compo-333.2. Proposed approachnents that are excluded from the model and not on which components theseare.Fortunately, when we return to the original scale of the data, i.e. do thereverse transformation to (3.2), we see that which components are selecteddoes make a difference.For the minimization results above to hold the data does not have tobe whitened, it can just be scaled. Denote yi as the original data that hasbeen centered and scaled, but not whitened. i.e yi = K−1(xi − µ), whereK is a diagonal matrix of the standard deviations of each components ofxi. The covariance matrix of yi here will have ones on the main diagonalbut does not have to have zero off-diagonal elements (the correlations of thecomponents of x remain unchanged). In the exact same way it can be shownthat in order to minimize the loss functionF(A) =n∑i=1‖yi −Asi‖22 (3.12)One should choose A as A0 =1n∑ni=1 yisti. The results for the mini-mization will still hold, but A0 will no longer be semi-orthogonal and left-invertible, also the value of F(A0) will change.Overall, we conclude that given si as the truncated independent compo-nents and xi as the original data, one can find a matrix A0 that will minimizethe loss function F(A) in equation (3.12) as A0 = 1n∑ni=1 yisti, where yi arethe scaled non-whitened original data. After the optimal weighting matrix isfound for the particular (truncated) subset of the independent components,the subset can be updated and the process can be repeated. Also, we foundout that which components we choose does not change the loss in case of thewhitened data, but does change it if the data is only centered and scaled.The methodology shown here allows to perform the dimensionality reduc-tion and components selection without any prior knowledge of the numberof independent components to be retained. First, we extract the full set ofindependent components. Then, for every possible combination and number343.2. Proposed approachof components to be retained, we can easily compute the information loss asdefined in (3.12), using the optimal non-square weighting matrix A0. Oncethis is done, the researcher can choose the number and the combination ofthe independent components to be retained in the model in accordance tohis or her tolerance of the information loss, thus modeling the data withfewer variables than are present in the original dataset.3.2.1 Similarity to multivariate multiple regressionOne can view the minimization problem in (3.4) as finding a Least Squaressolution to a multivariate multiple linear regression problem, that can bewritten as:Z = S ×A+  (3.13)Here Z is the n × k matrix of the whitened data, S is the n × p matrix ofthe reduced independent components, A is the k × p non-square weightingmatrix and  is the k-dimensional vector of errors. Or, similarly, replacingZ with Y , where Y is the matrix of centered and scaled, but not whiteneddata. The Least Squares solution to this problem is given as [12]:Aˆ = (StS)−1StZ =1nStZ = At0. (3.14)Similarly, for the correlated data, the least squares estimate is:Aˆ = (StS)−1StY =1nStY. (3.15)This fact provides additional confidence that our chosen solution in factminimizes the loss function in (3.3).The rest of this thesis focuses on the application of the results above tothe data obtained from VIVO Team Development and observing how thechoice of the number of components retained in the model influences theamount of loss when reconstructing the original data.35Chapter 4Component selectionimplementation and resultsThis chapter focuses on the implementation of the methodology describedin the previous chapter. The techniques are applied to the VIVO TeamDevelopment data, described in Chapter 1. First, some exploratory analysisresults are presented. Then, some standard dimension reduction techniquesare applied to the dataset. Next, some issues with the implementation ofthe existing ICA algorithms are discussed. Following this, I show the resultsof implementation of the findings from Chapter 3 and the comparison ofthis method to an existing method of dimensionality reduction using ICA.Finally, I present some findings about the components that tend to be chosenby the algorithm and contrast them with some practices that are currentlyemployed.4.1 Some exploratory analysis resultsBefore proceeding with implementation of the method discussed in Chapter3, I present some results of the exploratory analysis of the data. The datasetX ∈ Rn×k for the ICA algorithm is assumed to contain n independent andidentically distributed (i.i.d.) observations of the random vector x ∈ Rk. Incase of the VIVO data, the number of features is k = 18, while the numberof observations is n = 278. As mentioned in Chapter 1, each of the valuesin the dataset is measured on the same scale from 3 to 30, as each of the18 variables is calculated as a sum of the scores for three survey questions,where the score for every question is an integer between 1 and 10.In this section I will be working with the centered and scaled, but not364.1. Some exploratory analysis resultswhitened data. It is obtained from the original dataset as follows:Y = (X −m)K−1Here m is the vector of sample means of the columns of X and K is thematrix, the diagonal elements of which are the sample standard deviationsof the columns of X and the off-diagonal elements are zero.First, I present the estimated densities of the components of Y . In Fig-ure 4.1 we can see that the original densities of the 18 variables are unimodaland mostly slightly left-skewed. The distributions exhibit no extreme obser-vations, which is due to the nature of the data.Figure 4.1: Estimated densities of the original variables (centered andscaled).A correlation matrix is an important part of descriptive analysis of mul-tivariate data. The correlation matrix of Pearson pairwise correlations forthe VIVO data is presented in Figure 4.2. All the variables in the dataset arehighly correlated, which is often a feature of survey datasets. In our case, allthe survey questions are aimed at evaluating different sides of productivity,which are interrelated.In the next section, we will apply Principal Component Analysis and374.2. Principal Component Analysis and Factor AnalysisFigure 4.2: Pearson Correlation coefficients between the original variables.The colors indicate the strength of linear correlations.Factor Analysis, widely used dimension reduction methods to the VIVOdataset and briefly discuss the results and the number of components eachmethod indicates fit for the data.4.2 Principal Component Analysis and FactorAnalysisIn this section I show the application of “traditional”, in a sense of popularand widely used, dimensionality reduction techniques.First, we apply Principal Component Analysis to the VIVO data. Thistechnique uses an orthogonal rotation to convert the correlated multivariatedata into a set of uncorrelated principal components. The first principalcomponent is chosen in such way that it has the highest possible variance,the subsequent components have the highest variance under the constraint ofbeing orthogonal to the previously computed components. Dimensionalityreduction using PCA refers to keeping fewer principal components than theoriginal dimension of the data, which in our case is 18. The number ofcomponents to be kept can be decided, among other options, based on the384.2. Principal Component Analysis and Factor Analysiseigenvalues of the sample correlation matrix of the data or on how muchvariance of the data the selected number of components captures.Table 4.1 shows the variance percentage attributable to each of the prin-cipal components (up to the 9th principal component). In our case, the firstcomponent holds 69% of the variance and clearly dominates over the othercomponents, that contain less than 5% each.Component Standard dev Proportion of Var Cumulative ProportionPC1 3.51 0.69 0.69PC2 0.89 0.04 0.73PC3 0.81 0.04 0.77PC4 0.79 0.03 0.80PC5 0.72 0.03 0.83PC6 0.70 0.03 0.86PC7 0.64 0.02 0.88PC8 0.54 0.02 0.90PC9 0.52 0.01 0.91Table 4.1: The standard deviations, the proportion of variance and the cu-mulative proportions of variance associated with the principal components.Figure 4.3 shows the loadings associated with the first three principalcomponents. It can be observed that the first principal component hasapproximately equal loadings associated with each of the original variables,thus can be interpreted as a total questionnaire score. The signs of theprincipal components are non-unique, hence the negative loadings could beviewed as positive ones. The second and the third principal components havelittle variance associated with them and have distinct loadings for severalof the original variables. They can be interpreted as contrasts, which isalways the case when the first principal component includes all the originalvariables with the same sign.Overall, Principal Component Analysis shows one dominating compo-nent, which includes all the original variables with approximately equalweights and accounts for 69% of the variance of the data.After applying PCA, the natural extension is to apply Factor Analysisto the VIVO data. Factor analysis is a method that attempts to describe a394.2. Principal Component Analysis and Factor AnalysisFigure 4.3: The loadings associated with the first three principal compo-nents.multivariate dataset of correlated variables in terms of a set of unobservedorthogonal factors, the number of which is potentially smaller than the di-mension of the original dataset. Exploratory Factor Analysis differs fromPCA in the sense that it includes error term in the model and thus regardsthe PCA eigenvalues as inflated by errors. Unlike PCA, estimating the fac-tors is an optimization procedure, aiming for the highest joint variance ofthe factors. Factor analysis aims to find a generative model for the data,while PCA simply chooses an orthogonal rotation that maximizes each ofthe principal components variances one after another.Under the framework of factor analysis, Chi-square test is performed totest for ideal model fit. In other words, it tests the null hypothesis thatthe number of factors included in the model is sufficient. In case of theVIVO data, we fail to reject the ideal fit hypothesis starting from 7 factors.Together these 7 factors capture 80.5% of the variance of the data. Table4.2 illustrates the percentage of variance associated with each of the factors.Another option to determine the sufficient number of factors is usingfit statistics to compare the observed correlation matrix and the structuredcorrelation matrix based on p factors, p = 1, 2, . . . , k.We see that the first 5 factors each account for over 10% of the data404.2. Principal Component Analysis and Factor AnalysisFactor SS loadings Proportion of Var Cumulative Proportion1 2.69 0.15 0.152 2.60 0.15 0.303 2.25 0.13 0.424 2.03 0.11 0.535 2.02 0.11 0.646 1.59 0.09 0.737 1.31 0.07 0.81Table 4.2: The sums of squares, the proportions of variance and the cumu-lative proportions of variance associated with the factors.variance and the percentages do not decrease as drastically as they do incase of PCA. The individual loadings of the first six factors are displayedin Figure 4.4. All of the factors include all of the original variables withnon-zero weights.Figure 4.4: The loadings associated with the first six factors extractedusing factor analysis.In this section we applied the standard dimension reduction tools to theVIVO dataset, namely Principal Component Analysis and Factor Analysis.PCA indicates the presence of one dominating component, which includesall of the original variables with approximately equal weights. FA, in turn,414.3. Choosing a suitable negentropy approximationidentifies that the data may be a contaminated mixture of at least 7 orthog-onal factors, each of which include all of the original variables with non-zero(but non-equal) weights.In the remainder of the chapter, I will apply dimensionality reductionusing ICA, the method described in Chapter 3 of this thesis.4.3 Choosing a suitable negentropyapproximationBefore performing the component selection and dimension reduction, oneneeds to run the ICA algorithm to extract the independent componentsfrom the dataset.The dataset X ∈ Rn×k for the ICA algorithm is assumed to contain ni.i.d. realizations of the random vector x ∈ Rk. The data is assumed to bepreprocessed as described in section 2.3 to form the input matrix Z as:Z = X ′V D−1/2V t, (4.1)where X ′ = (x′ij)n×k and x′ij = xij − 1n∑ni=1 xij is the centered data ma-trix. D in equation (4.1) is the diagonal matrix of the eigenvalues of thesample covariance matrix computed as 1n(X′)tX ′ and V is the matrix of itseigenvectors.As discussed in the previous chapter, for now we retain as many com-ponents as there are features in the original dataset, thus, by making thereverse transformation Z = S˜At we are able to reconstruct the original dataperfectly. Here S˜ is the full matrix of the estimated independent components, S˜ ∈ Rn×k and A is the square independent components loadings matrix,A ∈ Rk×k.I used the FastICA algorithm to extract the independent componentsfrom the VIVO survey data. This algorithm looks for an orthogonal ro-tation of the data that maximizes statistical independence of the resultingcomponents. Note that maximizing statistical independence here in achieved424.3. Choosing a suitable negentropy approximationby maximizing non-gaussianity as described in Chapter 2, i.e. the algorithmmaximizes a measure of non-gaussianity for each of the resulting compo-nents. As mentioned in Chapter 2, this algorithm provides two choices ofnegentropy approximation as measured of non-gaussianity, exponential andlogcosh. This section discusses which approximation is better suited forthe VIVO dataset, and, additionally, reviews some of the properties of theFastICA algorithm, that came to light while running this analysis.The FastICA algorithm works by updating the orthogonal rotations ofthe data, so that the resulting components show the highest negentropy.That is, the algorithm iteratively estimates the matrix W in (2.4), whichis the inverse of A in equation (2.3), the independent components loadingsmatrix. Given the iterative procedure of the FastICA algorithm, an initialpoint for W needs to be specified. Usually the the starting W is chosen atrandom (in this work, rnorm was used to generate the initial values). Noticethat the initial guess for W does not have to be orthogonal as Gram-Schmidtorthogonalization is performed at every step of the FastICA algorithm, asdescribed in section 2.5.If two subsequent values for W are close enough, then the algorithmconverges and uses the output estimate of W to compute the full matrix ofindependent components as:S˜ = ZW t. (4.2)The algorithm implemented in the R fastIca package by default con-tains no convergence checks at the output, which means that an inverseloadings matrix will be output every time the algorithm is run. The algo-rithm stops in the following cases:• If two subsequent values of W are close enough, i.e. if the norm oftheir difference is below the pre-specified tolerance level. In this casethe algorithm converges and the latest value of W is output;• If the number of iterations reached the pre-specified maximum numberof iterations provided by the user. In this case, the algorithm outputs434.3. Choosing a suitable negentropy approximationW value from the last iteration.I built in a simple convergence indicator into the existing fastICA function,which checks whether the output value of W was produced on the maximumiteration or prior to that. The algorithm is assumed to converge if the outputvalue of W was achieved before the maximum iteration. Then I investigatedwhich negentropy approximation provided convergence in most cases for theVIVO dataset. It can be seen in Figure 4.5, in the case of the VIVO data,logcosh algorithm with the parameter α = 1 showed the best convergenceresults.Figure 4.5: Percentage of the converged cases of the FastICA algorithm inapplication to the VIVO data obtained on 500 runs starting from randominitial values. The first point shows the percentage of converged cases forexponential negentropy approximation, all other — logcosh approximationwith varying values of the parameter α.The FastICA algorithm is very sensitive to the starting point, whichmeans that the returned estimates of the matrix W will vary dependingon the initial guess. This holds true even when the algorithm convergeswith extremely low tolerance levels. Since different initial values providedifferent output estimates of W , the estimated S˜ will vary (4.2). When allthe independent components are retained, this fact is not important, becausethe variations in S˜ will correspond to the variations in A and Z = S˜At444.4. Implementation resultswill still hold. This happens because, as both A and S˜ are estimated, anychange in the estimate of S˜ will be compensated by a change in the estimateof A. However, the loss incurred when retaining only several independentcomponents will vary depending on the initial guess. The following sectionwill show, among other results the magnitude of this variance.4.4 Implementation resultsThis section states the main results of implementation of the method dis-cussed in Chapter 3. The dataset obtained from VIVO Team Develop-ment contains 18 variables measured 278 times. We denote this datasetas X ∈ R278×18. The implementation of the methodology consists of thefollowing steps:1. Center the dataset, i.e. compute X ′ = (x′ij)278×18 by subtracting thesample means of each column of X as x′ij = xij − 1n∑nl=1 xlj . Recordthe column means of X as m;2. Standardize the dataset, i.e. compute Y = X ′K−1, where K is thediagonal matrix of sample standard deviations of the columns of X ′.Note that this step is not compulsory for the FastICA procedure, butthe output matrix Y is used in step 6;3. Whiten the dataset, computing Z = X ′V D−12V t, where D is the di-agonal matrix of the eigenvalues of the sample covariance matrix com-puted as 1n(X′)tX ′ and V is the matrix, the columns of which are theeigenvectors of the sample covariance matrix;4. Run the FastICA algorithm on Z, keeping all the 18 variables. Theinitial guess for the matrix W is 18×18 matrix of independent randomdraws from a standard normal distribution;5. Check whether the fastICA algorithm converged and, if so, use theoutput matrix W to compute an estimate of the full matrix of inde-pendent components as S˜ = ZW t, thereby obtaining S˜ ∈ R278×18, theestimates of the 18 independent components;454.4. Implementation results6. For each p from 2 to 18:• Compute(18p)possible reduced (truncated) matrices S, and,consequently A0 =1nYtS for each of the possible S matrices;• Compute the estimates of Yˆ based on all possible truncated ma-trices S as Yˆ = SAt0;• Return to the original scale of the data by Xˆ = Yˆ K+ µˆ and com-pute the loss values F = ∑ni=1 ‖Xˆi −Xi‖22, where Xˆi representsthe i-th row of Xˆ and Xi stands for the i-th row of X. For eachp,(18p)of the loss values should be computed;• Record the combination of the independent components that pro-vides the lowest value of loss and the value of loss at that combi-nation, then proceed for the next p;7. As output, for each possible number of the retained components, dis-play the combination of the independent components that providedlowest value of loss and the value of the loss function with only theretained components included.The steps above produce a vector of minimum losses across the combina-tions of components when retaining any number from 2 to 18 independentcomponents for each random starting point. The value of loss decreases asthe amount of independent components retained increases and the loss iszero when all of the 18 components are retained in the model. Such vectorobtained for a single run of the procedure above can be seen in Figure 4.6.Figure 4.7 shows mean squared error (MSE) for the dataset reconstructedusing different numbers of retained components corresponding to the lossesshown in Figure 4.6. The mean squared error is computed as:MSE =1n1kn∑i=1k∑j=1‖xij − xˆij‖22. (4.3)464.4. Implementation resultsFigure 4.6: The minimum of the loss function across all possible retainedindependent components for each number of independent components re-tained in the model.Here xij is the j-th element of the i-th row of the original dataset and xˆijis the j-th element of the i-th row of the reconstructed dataset obtained asdescribed in the point 6 of the list above.Figure 4.7 provides the reader with a chance to access the magnitude ofMSE compared to the sale of the original data. Each value in the originaldataset is on a scale from 3 to 30.As was mentioned in the previous section, the FastICA methodologyis sensitive to the initial value for W . Therefore, different starting pointsprovide different loss and MSE values. Figure 4.8 displays the boxplots ofMSE values based on varying initial values (in case of my implementation,varying seeds). I used pseudo-random number generator seeds from 2001 to2030. On the seed 2003 the FastICA algorithm failed to converge, thereforethis point has been excluded. The variation of the mean squared errors ishigh when only a few components are included and decreases as the numberof components to be retained increases.Additionally, Figure 4.9 displays the lines representing the mean squared474.4. Implementation resultsFigure 4.7: The minimum MSE across all possible combinations of theretained independent components for each number of independent compo-nents retained in the model. Each value in the original dataset is on a scalefrom 3 to 30.errors for each starting point. This visual gives the reader a chance toassess whether some starting points provide the independent componentsestimates that are consistently better in terms of data reconstruction, nomatter how many components are retained in the model. For example,seed 2029 seems to consistently produce the highest MSE values, while seed2020 often corresponds to the lowest MSE. The detailed comparison of thecomponents extracted starting from these two seeds is discussed in the nextsection.As mentioned in Chapter 3, an existing method of dimensionality reduc-tion includes first using Principal Component Analysis to select a number oflinear combinations of the original variables and then implementing ICA onthe retained principal components as described in section 2.3. The resultsof this method are shown in Figure 4.10 compared to the method exploredin my research. The comparability of these approaches may be hindered bythe fact that, while in my approach ICA is run once for each seed and theindependent components from this run are then used to evaluate the losses,484.4. Implementation resultsFigure 4.8: The minimum MSE values for each number of the retainedcomponents achieved on various starting points for the 29 seeds.Figure 4.9: The minimum MSE values for each number of componentsretained achieved on various starting points for the 29 seeds shown by theseed case of PCA prior to ICA, ICA has to be run for each number of retained494.4. Implementation resultscomponents separately and there is no guarantee that each of these runs willconverge. To obtain a more comparable plot, I used the same seeds as forthe method discussed in this work. It is clear from Figure 4.10 that the useof PCA before applying ICA to the data reduces the losses incurred whenreconstructing the data by approximately a half for any number of retainedcomponents (except when all the components are retained). This may beexplained by the fact that, while discarding several principal componentscorresponding to the smallest eigenvalues generates some information loss,building a full ICA model on the retained principal components means thatwe can reconstruct those principal components perfectly with the inversetransformation (recall, no information is lost when the dimensionality ofthe components matches the dimensionality of the data). Hence, the entireamount of information lost in case of using PCA corresponds to discardingthe lowest eigenvalues. PCA by definition gives the best L2 approximationon linear transformations from the original data. This is also the reasonbehind the fact that the losses for PCA prior to ICA method do not varydepending on the starting point. Any change in the estimated values ofindependent components is balanced with the change in the correspondingelements of the weighting matrix A, which means we can reconstruct theprincipal components perfectly, thus the losses in this method only includethe deterministic portion that appears when discarding the small eigenvaluesat the PCA stage.To summarize the application outcomes, the results obtained when us-ing the method developed in course of my research highly depend on thestarting point and provide reasonable loss values that decrease faster thanthe number of independent components retained in the model increases.Because the results differ so much based on the starting point, I concen-trated on the foundations and conceptual applications of the methodologyand not the interpretation relative to the VIVO dataset. In principle, theinverse loadings of the independent components that are found to producethe smallest loss when retained in the model could be interpreted as theloadings of the 18 measurements derived from the VIVO questionnaire.504.5. Observations about the retained componentsFigure 4.10: The comparison of the mean squared errors obtained by themethod discussed in this thesis (boxplots) and MSEs obtained using anexisting approach of first applying PCA. In case of PCA application thelosses do not differ depending on the starting point.4.5 Observations about the retained componentsThis section includes some observations about the components retained inthe model by using the method discussed in this thesis. I will not go intodetail concerning these observations, but they may be relevant to possiblefuture research on the ICA methodology.As mentioned in the previous sections, the estimates of the indepen-dent components vary with the initial guess of the inverse loadings matrix.Thereby, different random starts produce different estimates of the indepen-dent components. The densities of the extracted components starting atdifferent points are compared in the figures below. The plots display theextreme cases, that is, the random starts with the highest and the lowestaverage MSE across the number of components retained in the model.Figure 4.11 shows the estimated densities of the independent componentsfor seed number 2029, which provided the highest loss estimates for mostnumbers of components retained in the model across the 29 seeds that thealgorithm was run for.514.5. Observations about the retained componentsFigure 4.11: Estimated densities of the 18 independent component drawnfrom the VIVO data, seed 2029.Figure 4.12 shows the estimated densities for components for the seed2020, which provided the lowest loss values for most numbers of compo-nents retained in the model. As mentioned in Chapter 2, the order of thecomponents is somewhat ambiguous, thus changes with every initial guess.However, in Figure 4.12 we see that the distributions of some of the inde-pendent components have no analog in Figure 4.11. For example, the com-ponent S2 seems to have no counterpart. The ambiguities of ICA describedin Chapter 2 state that one cannot determine the order of the componentsand their sign. None of these ambiguities should affect the distribution ofthe components.As discussed in Chapter 2, the FastICA algorithm aims to maximizethe statistical independence of the output components by maximizing theirnon-gaussianity. One measure of non-gaussianity is kurtosis, which is non-robust in general, but can nevertheless serve as an indicator. Since neitherthe initial data (because of the nature of survey data) nor the independentcomponents exhibit extreme observations, we can use kurtosis, defined insection 2.4, to compare the components obtained starting from random seeds524.5. Observations about the retained componentsFigure 4.12: Estimated densities of the 18 independent component drawnfrom the VIVO data, seed 2020.2020 and 2029, as shown in Table 4.3, which is an extract from Table A.1given in Appendix A, showing the loss and kurtosis values for all seeds:Seed MinimumKurtosisMSE 2componentsAverage MSE2020 2.16 9.66 3.772029 4.74 15.19 5.67Table 4.3: The comparison of the minimum kurtosis, the MSE incurredwhen retaining only 2 independent components out of the original 18 andthe average MSE across the possible number of retained components forseeds 2020 and 2029.If we consider the minimum sample kurtosis among the extracted com-ponents (among the columns of S˜) for each of the 29 seed used in my ap-plication, shown in Appendix A, we see that roughly in half of the casesthe minimum kurtosis value is below 3, which means that at least one ofthe components is subgaussian (we will refer to these cases as subgaussian),while in other cases, the minimum kurtosis value is above 3, which meansall the extracted components are supergaussian (these cases are referred tolater as supergaussian). Note that subgaussian distributions are character-534.5. Observations about the retained componentsized by a low peak compared to gaussian distributions and a strong taildecay property, which means their tails decay at least as fast as the tails ofgaussian distributions. Supergaussian distributions, on the contrary, havehigher density at the peak compared to gaussian family and often heaviertails. Density plots depicting examples of both types of distributions incomparison to a gaussian distribution are shown in Figure 4.13.Figure 4.13: An example of a supergassian component S13 (left) and asubgaussian component S2 (right) in the seed 2020.If we plot the mean squared error obtained when reconstructing thedataset retaining only 2 of the 18 components against the minimum kurto-sis value of the independent components, we notice an interesting tendency.The runs where at least one of the independent components is subgaussianproduce lower MSE values than those where all the components are super-gaussian, as shown in Figure 4.14. I do not have a clear explanation forwhy this tendency is taking place or whether it is specific to the dataset Iam using or persists across different datasets, this may be a potential areaof future research.Another observation is that the subgaussian cases produce significantlylower loss values for the numbers of components retained in the model up to14, after which the differences are no longer statistically significant. I per-544.5. Observations about the retained componentsFigure 4.14: Mean squared errors when 2 components are retained out ofthe original 18 plotted against the minimum kurtosis of the independentcomponents for each seed number. The points are labeled with the corre-sponding seed numbers.formed two-sample t tests comparing the mean losses for the subgaussiancases and the supergaussian cases across the number of retained components.The p-values of the tests can be seen in Figure 4.15. The plot shows thecutoff value at 0.05, which does not take into account the multiple compar-isons issue, but even with the conservative Bonferroni correction, the resultsare significant up to 12 components retained in the model.The observations described in this section persist even when the ICAalgorithm converges at a very low tolerance level. Hence, these observationsmay be indicating a tendency that is characteristic to the component se-lection method described in this thesis. Possibly, the ICA solution is notunique in case of the VIVO data. As mentioned in Chapter 2 the solutionsare not unique for multivariate Gaussian initial data, thus it may be possiblethat the VIVO data resembles a multivariate gaussian distribution enoughfor multiple solutions to exist.Another possible conclusion from these observations is that while largerkurtosis indicates more departure from gaussianity, which is interesting inmany applications [11], this approach does not provide the best selection of554.5. Observations about the retained componentsFigure 4.15: The p-values of 2-sample t tests comparing the total squareloss incurred for the subgaussian and supergaussian cases averaged acrossthe number of components retained in the model.the independent components in terms of recreating the original data whenretaining fewer components than the original dimensions of the data. Someother criteria may need to be used for the purpose of dimensionality re-duction with ICA. One possible criteria is choosing the components thatminimize information loss described in this thesis.56Chapter 5ConclusionThis document provides the mathematical foundation to an approach to thecomponent selection and dimension reduction using Independent Compo-nent Analysis. This method is applicable when the number of componentsto be retained in the model is not known prior to implementation and allowsto choose the number of components to be retained based on the tolerancelevel for the information loss. The method focuses on choosing fewer com-ponents than the original dimension of the data that best approximate theinitial dataset in terms of a squared Euclidean norm. Also the method offersa closed-from solution for estimating the optimal weighting matrix when notall of the independent components are retained in the model.The methodology was applied to a business psychology dataset and theresults have been provided and discussed. The results of proposed methodol-ogy have been compared to an existing method, the benefits and drawbacksof each method have been reviewed. Additionally, a few observations havebeen made about the independent components extracted from the data us-ing different initial values. These observations may serve as a starting pointfor further research and development of theoretical and practical sides todimension reduction using Independent Component Analysis.57Bibliography[1] Barros, A. K., Mansour, A., and Ohnishi, N. (1998). Adaptive blindelimination of artifacts in ecg signals. Proceedings of I and ANN.[2] Bartlett, M. S., Movellan, J. R., and Sejnowski, T. J. (2002). Facerecognition by independent component analysis. IEEE Transactions onneural networks, 13(6):1450–1464.[3] Bell, A. J. and Sejnowski, T. J. (1995). An information-maximization ap-proach to blind separation and blind deconvolution. Neural computation,7(6):1129–1159.[4] Ben-Shalom, A., Werman, M., and Dubnov, S. (2003). Improved low bit-rate audio compression using reduced rank ica instead of psychoacousticmodeling. In Acoustics, Speech, and Signal Processing, 2003. Proceed-ings.(ICASSP’03). 2003 IEEE International Conference on, volume 5,pages V–461. IEEE.[5] Burel, G. (1992). Blind separation of sources: A nonlinear neural algo-rithm. Neural networks, 5(6):937–947.[6] Cover, T. M. and Thomas, J. A. (1991). Entropy, relative entropy andmutual information. Elements of Information Theory, 2:1–55.[7] Hansen, L. K. (2003). Ica if fmri based on a convolutive mixture model.In Ninth Annual Meeting of the Organization for Human Brain Map-ping,(hbm), New York, June.[8] Hyva¨rinen, A. (1998). New approximations of differential entropy forindependent component analysis and projection pursuit. In Jordan, M. I.,58BibliographyKearns, M. J., and Solla, S. A., editors, Advances in Neural InformationProcessing Systems 10, pages 273–279. MIT Press.[9] Hyva¨rinen, A. (1999). Sparse code shrinkage: Denoising of nongaussiandata by maximum likelihood estimation. Neural computation, 11(7):1739–1768.[10] Hyva¨rinen, A., Karhunen, J., and Oja, E. (2004). Independent Com-ponent Analysis. Adaptive and Cognitive Dynamic Systems: Signal Pro-cessing, Learning, Communications and Control. Wiley.[11] Hyva¨rinen, A. and Oja, E. (2000). Independent component analysis:algorithms and applications. Neural networks, 13(4):411–430.[12] Johnson, R. A. and Wichern, D. W. (2014). Applied multivariate sta-tistical analysis. Pearson Education Limited Essex.[13] Jung, A. (2002). An introduction to a new data analysis tool: Inde-pendent component analysis.[14] Jung, T.-P., Humphries, C., Lee, T.-W., Makeig, S., McKeown, M. J.,Iragui, V., Sejnowski, T. J., et al. (1998). Extended ica removes artifactsfrom electroencephalographic recordings. Advances in neural informationprocessing systems, pages 894–900.[15] Kim, T.-H. and White, H. (2004). On more robust estimation of skew-ness and kurtosis. Finance Research Letters, 1(1):56–73.[16] Kwon, O.-W. and Lee, T.-W. (2004). Phoneme recognition usingica-based feature extraction and transformation. Signal Processing,84(6):1005–1019.[17] Lathauwer, L. D. (1997). Signal Processing by Multilinear Algebra. PhDthesis, Faculty of Engineering, K.U. Leuven. unpublished thesis.[18] Lo¨rincz, A., Po´czos, B., Szirtes, G., and Taka´cs, B. (2002). Ockham’srazor at work: Modeling of the“homunculus”. Brain and Mind, 3(2):187–220.59[19] Makeig, S., Bell, A. J., Jung, T.-P., Sejnowski, T. J., et al. (1996). In-dependent component analysis of electroencephalographic data. Advancesin neural information processing systems, pages 145–151.[20] McSharry, P. E. and Clifford, G. D. (2004). A comparison of nonlin-ear noise reduction and independent component analysis using a realisticdynamical model of the electrocardiogram. In Second International Sym-posium on Fluctuations and Noise, pages 78–88. International Society forOptics and Photonics.[21] Mohammad-Djafari, A. (2000). A bayesian approach to source separa-tion. arXiv preprint math-ph/0008025.[22] Nordhausen, K., Oja, H., and Ollila, E. (2016). Robust independentcomponent analysis based on two scatter matrices. Austrian Journal ofStatistics, 37(1):91–100.[23] Toch, B., Lowe, D., and Saad, D. (2003). Watermarking of audio signalsusing independent component analysis. In Web Delivering of Music, 2003.2003 WEDELMUSIC. Proceedings. Third International Conference on,pages 71–74. IEEE.[24] Wang, J. and Chang, C.-I. (2006). Independent component analysis-based dimensionality reduction with applications in hyperspectral im-age analysis. IEEE transactions on geoscience and remote sensing,44(6):1586–1600.60Appendix ATable of loss values andlowest kurtosis values forvarious initial guessesSeed MinimumKurtosisMSE 2componentsAverage MSE2001 2.16 10.67 3.962002 2.21 11.87 4.412004 2.19 11.78 4.272005 2.29 12.73 4.862006 2.17 10.64 3.952007 4.78 15.01 5.562008 4.59 15.28 5.292009 2.17 10.65 3.962010 2.17 10.70 4.002011 4.51 14.28 5.182012 4.54 14.55 5.062013 2.29 12.76 4.892014 2.16 11.22 4.172015 2.17 10.63 3.962016 2.28 12.49 4.682017 2.25 13.29 4.662018 2.17 10.64 3.962019 4.74 14.89 5.532020 2.16 9.66 3.7761Appendix A. Table of loss values and lowest kurtosis values for various initial guessesSeed MinimumKurtosisMSE 2componentsAverage MSE2021 2.28 11.68 4.082022 2.16 10.71 4.012023 2.20 10.02 4.062024 2.16 9.85 3.812025 2.16 10.78 4.012026 2.22 11.87 4.412027 2.17 10.71 4.022028 4.74 14.89 5.532029 4.74 15.19 5.672030 4.67 14.94 5.27Table A.1: The comparison of the minimum kurtosis, theMSE incurred when retaining only 2 independent componentsout of the original 18 and the average MSE across the possiblenumber of retained components for all 29 seeds.62


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items