UBC Faculty Research and Publications

Expanding the boundaries of local similarity analysis Durno, W E; Hanson, Niels W; Konwar, Kishori M; Hallam, Steven J Jan 21, 2013

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
52383-12864_2013_Article_4618.pdf [ 1.06MB ]
Metadata
JSON: 52383-1.0167811.json
JSON-LD: 52383-1.0167811-ld.json
RDF/XML (Pretty): 52383-1.0167811-rdf.xml
RDF/JSON: 52383-1.0167811-rdf.json
Turtle: 52383-1.0167811-turtle.txt
N-Triples: 52383-1.0167811-rdf-ntriples.txt
Original Record: 52383-1.0167811-source.json
Full Text
52383-1.0167811-fulltext.txt
Citation
52383-1.0167811.ris

Full Text

PROCEEDINGS Open AccessExpanding the boundaries of local similarityanalysisW Evan Durno1, Niels W Hanson2, Kishori M Konwar1, Steven J Hallam1*From The Eleventh Asia Pacific Bioinformatics Conference (APBC 2013)Vancouver, Canada. 21-24 January 2013AbstractBackground: Pairwise comparison of time series data for both local and time-lagged relationships is acomputationally challenging problem relevant to many fields of inquiry. The Local Similarity Analysis (LSA) statisticidentifies the existence of local and lagged relationships, but determining significance through a p-value has beenalgorithmically cumbersome due to an intensive permutation test, shuffling rows and columns and repeatedlycalculating the statistic. Furthermore, this p-value is calculated with the assumption of normality – a statisticalluxury dissociated from most real world datasets.Results: To improve the performance of LSA on big datasets, an asymptotic upper bound on the p-valuecalculation was derived without the assumption of normality. This change in the bound calculation markedlyimproved computational speed from O(pm2n) to O(m2n), where p is the number of permutations in a permutationtest, m is the number of time series, and n is the length of each time series. The bounding process is implementedas a computationally efficient software package, FASTLSA, written in C and optimized for threading on multi-corecomputers, improving its practical computation time. We computationally compare our approach to previousimplementations of LSA, demonstrate broad applicability by analyzing time series data from public health,microbial ecology, and social media, and visualize resulting networks using the Cytoscape software.Conclusions: The FASTLSA software package expands the boundaries of LSA allowing analysis on datasets withmillions of co-varying time series. Mapping metadata onto force-directed graphs derived from FASTLSA allowsinvestigators to view correlated cliques and explore previously unrecognized network relationships. The software isfreely available for download at: http://www.cmde.science.ubc.ca/hallam/fastLSA/.BackgroundThe exponential increase and ubiquitous use of computa-tional technology has given rise to an era of “Big Data” thatpushes the limits of conventional data analysis [1-3]. Tech-niques for analyzing big datasets often proceed by identify-ing patterns of co-occurrence or correlation throughprincipal component analysis (PCA) [4], multidimensionalscaling (MDS) [5], etc. However, many of these methodsrequire significant data reduction or smoothing whichmakes them difficult to interpret [6]. Other methods suchas multiple linear regression or Pearson’s correlationcoefficient (PCC) are easy to interpret as they operate ondata in their native data space, without any kind of largedata transformation or dimensionality reduction, but arelimited in the structure that they can detect.Though PCC is a classic and powerful technique forfinding linear relationships between two variables, it is notdesigned for capturing lead-lag relationships seen in timeseries data. Local similarity analysis (LSA) [6] extends cor-relation calculations to include the time variable, enablingidentification of local correlates. Furthermore, Ruan et al.have presented a graphical network framework in whichto visualize the structure of significant LSA correlations.Unfortunately, the current implementation of LSArequires multiple runs on permuted data and a MonteCarlo statistical method known as a permutation test to* Correspondence: shallam@mail.ubc.ca1Department of Microbiology & Immunology, University of British Columbia,Vancouver, BC, CanadaFull list of author information is available at the end of the articleDurno et al. BMC Genomics 2013, 14(Suppl 1):S3http://www.biomedcentral.com/1471-2164/14/S1/S3© 2013 Durno et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction inany medium, provided the original work is properly cited.evaluate a null distribution and obtain a p-value determin-ing significance. Each iteration of this procedure has acomputational complexity of O(pm2n), where p is thenumber of permutations, m is the number of covariatetime series, and n is their length. Due to the number ofpair-wise calculations needed, extant LSA is computation-ally onerous when m is large, limiting its use to datasetswhere the number of observed variables at each timepoint is small (< 100). Though there has been someimprovement to its performance [7], assumptions of nor-mality and implementation issues continue to stymie prac-tical application of LSA on big datasets.Here we describe a novel asymptotic upper bound onthe calculation of the LSA statistic’s p-value, resulting inan exponentially converging calculation to bound andcheck for significance of computed LSA statistics with-out a computationally intensive permutation test. Thisbound does not require a rank-vector normal transfor-mation, promoting application to any distribution thathas finite variance. As a result, this implementation ofLSA can navigate big datasets with millions of co-variatetime series. We demonstrate this using time-series data-sets from public health [8], microbial ecology [9], andsocial media [10]. The implemented algorithm, namedFASTLSA, is written in C and optimized for threadingon multi-core computers.Interpreting the LSA statisticLSA concerns itself with pairs of time series data. TheLSA Statistic can be interpreted in a manner similar toPCC when no lag window exists between two time series.However, LSA is also capable of capturing localized cor-relation that is staggered or lagged. A large positive ornegative LSA value indicates a correspondingly strongPCC correlation or a correlation at a time displacementwithin the lag window (Figure 1). Note that if both posi-tive and negative correlations exist between two time ser-ies, LSA will only report the strongest of the two.LSA is advantageous on large datasets containing manytime series. Results can be visualized as a graphical net-work where nodes represent the individual time seriesand the edges represent their LSA correlation statistic.When displayed using a force-directed layout in Cytos-cape [11], closely related time series cluster together,visually isolating clusters of local similarity. Metadatarelated to experimental or environmental conditions canthen be applied to the nodes, shedding insight into hier-archical network structure.Figure 1 A lagged correlation between two time series. An example of two set time series that contain a lead-lag correlation.Durno et al. BMC Genomics 2013, 14(Suppl 1):S3http://www.biomedcentral.com/1471-2164/14/S1/S3Page 2 of 14ImplementationDescription of the LSA algorithmIn this section we reproduce the algorithm from [6] tocompute LSA statistics and their corresponding p-valuesbetween pairs of time series in a dataset. We assume asinput a set of time series vectors of equal length. Let usdenote the number of time series by m and their length asn. Let us denote the time series dataset as X whereXij denotes the jth element of the ith time series, with i =1, 2, ..., m and j = 1, 2, ..., n, and assume that the Xij arereal numbers. We also assume that there are no missingvalues in the dataset X, and realize that practical use willrequire interpolation or filtering.In Figure 2 we present the algorithm for computingthe LSA statistic for a pair of time series, X = {Xi}n1 andY = {Yi}n1 , where the length of the time series isassumed to be equally spaced in time. We have modifiedthe presentation of the LSA algorithm by [6] to high-light our analysis and derivation of a bound on the taildistribution of the LSA statistic. Specifically we calculatethe LSA statistic for a pair of time series, X and Y.Two-dimensional arrays Pi, j and Ni, j are used to storethe positive and negative partial sums (truncated if lessthan 0) of the pairwise product of time series values.We also assume a suitable bound on the maximum timelag considered while computing the LSA statistic,denoted by D.The algorithm first initializes the arrays Pj,0, Nj, 0, P0, i,and N0,i for all i, j = 1, ..., n, with a maximum absolute dif-ference of D. Next it considers the time series pairs foreach possible lag, up to a maximum of D, and then com-putes the progressive sum of the pair-wise products of thetime-series values from the low to high index of the arrays.During the computation, the progression of the partial sumis reset to 0 if the sum is below 0. After partial sums havebeen computed, the values of N̂ and P̂ are calculated bytaking the maximum of the corresponding values of thearrays N and P. Finally, the LSA statistic is estimated assign(̂P − N̂) max {̂P, N̂}n.Calculating the upper boundIn this section we derive the asymptotic upper bound onthe p-value for the cumulative probability distribution ofthe LSA statistic without the need of a normality assump-tion. Our derivation is based on distributional results ofthe maximum cumulative sum of independent randomvariables known in the literature from probability theory[12-15]. We begin by stating our assumptions about thedataset, isolate target calculations from the LSA algo-rithm, and from our referenced mathematical results,derive and prove important lemmas. These lemmas willserve as the building blocks as we logically construct atheorem which will form the basis of our LSA p-valueupper bound.We begin by making certain assumptions about theprobability model used to derive the bounds. First, eachPi, j or Ni, j is considered individually. We assume thatthe time series values Xi, Yj for i, j = 1, ..., n are indepen-dent of one another. This assumption can be made whenFigure 2 The LSA algorithm . Algorithm for computing the LSA for a pair of time series X and Y. D denotes the set{(i, j): i, j ∈ N+, either i = 0 or j = 0 and ∣∣i − j∣∣≤ D} and N+ denotes the set of positive integers.Durno et al. BMC Genomics 2013, 14(Suppl 1):S3http://www.biomedcentral.com/1471-2164/14/S1/S3Page 3 of 14weak dependence exists because it is near the truth andeffective, much like the Naive Bayes assumption. Thisassumption is also enabling, as it allows us to invoke thedistributions of partial sums of independent random vari-ables and continue in a mathematically straightforwardway. Further, we assume independence between eachtime time series as a null hypothesis, and as it is subjectto rejection upon obtaining a statistically significant LSAvalue.Consider lines 5 and 7 of the LSA algorithm (Figure 2),Pi+k+1,j+k+1 ¬ max{0, Pi+k, j+k + Xi+k * Yj+k} andP̂ ← max{(i,j):|i−j|≤D}{Pi,j}. For any pair of i and j let usdefine the sequence random variables as Zk = Xi+kYj+k fork = 0, ..., min{n - i, n - j} - 1, and the sequence of randomvariables ζk = Z1 + ... + Zk for k = 0, ..., min{n - i, n - j} - 1supposing ζ0 = 0. Using the above ζk’s, we define randomvariables η∗k as η∗k = max {ζ1, ζ2, · · · , ζk} for the samevalues of k = 0, ..., min{n - i, n - j} - 1.We also define the set of random variables h1, h2, ..., hkby the recurrence formula hk+1 = max{0, hk + Zk+1}. Notethat the random variables Pi+k, j+k and the hk have thesame distribution. It is shown in [12,13] that the randomvariables η∗k and hk also have the same distribution. As aresult, now we can analyze the cumulative distribution ofPi+k, j+k as a distribution for η∗k , and use the results byNevzorov and Petrov [14] on Pi+k, j+k to derive tail prob-ability bounds. We also assume that the random variablesZk have the first two moments, although such assumptionsare not required for the results of [14], we use them toderive simpler bounds.We now consider a few useful lemmas that we will useto construct our p-value upper bound. The first step is tosimplify the tail event (which we will later connect to p-value) into simpler terms. The following lemma expressesthe tail event for LSA {|LSA| >x} and any x ∈ R in termsof the tail events of {Pi, j >x} and {Ni, j >x}, the positiveand negative LSA calculations for the same x, the boundon our test statistic (the target p-value).Lemma 1 For any x ∈ Rwe have {|LSA| >x} = {(∪ij{Pij>xn}) ∪ (∪ij{Nij >xn})}.Proof. The result is clear from the following:{|LSA| > x} = {max {̂P, N̂} > xn} = {̂P ≤ xn ∩ N̂≤ xn}c = {maxij {Pij} ≤ xn ∩ maxij {Nij} ≤ xn}c ={(∩ij {Pij ≤ xn}) ∩ (∩ij {Nij ≤ xn})}c = {(∪ij {Pij > xn}) ∪ (∪ij {Nij > xn})} □In the LSA algorithm, we have maximums Pij = max{0, Pi-1, j-1 + Xi-1Yj-1} and Nij = max{0, Pi-1,j-1 - Xi-1, j-1},which complicates their theoretical analysis. Fortunately,equivalence have been demonstrated in the literature[12], and we restate these in the following lemma forclarity: the similarity of the distributions of hk and η∗k ,for k = 1, ..., min{n - i, n - j} - 1. This will help us derivethe bounds for the events {Pij >xn} and {Nij >xn}, thesimpler terms we derived in the previous lemma.Lemma 2 Let Zi be mutually independent randomvariables and let us denote by Sk =∑ki=1 Ziwhere S0 = 0,and qk+1 = max{0, qk + Zk} with q0 = 0, then P(qk ≤ x) =P (max{S0, ..., Sk-1} ≤ x) for x ∈ R .In order to get a simple formula for the bound on thecumulative tail probabilities for Pi, j and Ni, j we reproducebelow the results on partial sums of random variables dueto Nevzorov and Petrov [14]. For our sequence of indepen-dent and identically distributed (iid) random variablesunder consideration {Xn} it follows that Lindeberg’s condi-tion holds [15]. A property showing the variance of a dis-tribution stabilizes as more variables are added, pinningthe tails of it down. Thinking about this in terms of timeseries, as a series gets larger, the upper bound of the distri-bution becomes more defined and calculable.Now to build theorems upon which we will derive aformulaic p-value bound.Theorem 3 If the random variables {Xn} havezero expectation and finite variances and if Linde-berg’s condition holds: Λn(ε) ® 0 as n ® ∞ ∀ε > 0where n (ε) =1q2n∑nk=1 ∫{|x|>εqn}x2dVk (x) and q2n =∑nk=1 E(X2k)and G (x) =√2π∫x0 e−t2/2dt i f x ≥ 0 and 0 i f x < 0 ,then we have supx|P(Sn < qnx)− G (x) | → 0 whereSn = max1≤k≤n∑kj=1 Xj and Vk(x) = P(Xk ≤ x)In order to apply the above theorem to get a simpleformulaic approximation, we assume some randomvariables {Zi}m1 , each with the variance s2 andSk =∑ki=1 Zi . Then by applying the above theorem, weget the following uniform convergence of distributionto that of the one-sided standard normal assupx|P(maxk∈{1,...,m}Sk ≤√mσx)− G (x) | → 0 as m ® ∞.Now we use the above results to get the probability esti-mates for our simple event terms {Pij >xn} and {Nij >xn}.The following theorem provides us with the p-value’s tailbound for LSA for any x ∈ R .Theorem 4 For G, the one-sided normal distribution,defined above P (|LSA| > x) ≤ 2 (n2 − (n − D − 1) (n − D)) (1 − G(x√n/Vαr (X1Y1))) .Proof. By applying Lemma 2 we haveP(Pij > xn)= P(max{0, Pi−1,j−1 + Xi−1Yj−1}> xn)= 1 − P (max {0, Pi−1,j−1 + Xi−1Yj−1} ≤ xn)= 1 − P(max1≤k≤min{i−1,j−1}{k∑l=0XlYl}≤ xn),and by Theorem 3, replacing x with y, we havesupy|P(maxk∈{1,...,m}Sk ≤√mσ y)− G (y) | → 0 as m → ∞.Durno et al. BMC Genomics 2013, 14(Suppl 1):S3http://www.biomedcentral.com/1471-2164/14/S1/S3Page 4 of 14Notice that∑kl=0XlYl satisfies the definition of Sk, soreplacing Sk,√m , and s with∑kl=0XlYl , mini-1,j-1, and√Var (X1Y1) , respectively,supy∣∣∣∣∣P(maxk∈{1,...,√min{i−1,j−1}}k∑l=0XlYl ≤√min{i − 1, j − 1}Var (X1Y1)y)− G (y)∣∣∣∣∣ → 0,as n ® ∞, and by change of variables to get our equa-tion into the appropriate formxn =√min{i − 1, j − 1}Var (X1Y1)y ⇒ y = xn/√min {i − 1, j − 1}Var (X1Y1)⇒ supx∣∣∣∣∣P(maxkε{1,...,√min{i−1,j−1}}k∑l=0XlYl ≤ xn)− G(xn/√min{i − 1, j − 1}Var (X1Y1))∣∣∣∣∣ → 0,as n ® ∞, thusP(Pij > xn) ∼= 1 − G(xn/√min {i − 1, j − 1}Var (X1Y1)) .It follows from Boole’s inequality and Lemma 1 thatP (|LSA| > x) = P ((∪ij {Pij > xn}) ∪ (∪ij {Nij > xn}))≤∑ijP(Pij > xn)+∑ijP(Nij > xn)= 2∑ij(1 − G(xn/√min{i − 1, j − 1}Var (X1Y1))) .Finally, we have the following tail probability boundP (|LSA| > x) ≤ 2∑ij(1 − G(xn/√min{i − 1, j − 1}Var (X1Y1)))≤ 2∑ij(1 − G(xn/√nVar (X1Y1)))= 2(n2 − (n − D − 1) (n − D)) (1 − G(x√n/Var (X1Y1))) ,standardizing with a mean of zero and a variance ofone= 2(n2 − (n − D − 1) (n − D)) (1 − G (x√n)) .□Note that this last result is asymptotic. Thus, n must besubstantially large for this p-value bound to be relevant(Figure 3 and Table 1). Similar to the normal distributionas an approximation to Student’s t-distribution, thisimplementation of LSA requires at least 30 time pointsto promote confidence. Though this convergence canvary from dataset to dataset, the bound is conservative,and will not easily produce false positives if run onshorter time series.ResultsTo validate versatility and effectiveness of the derivedupper bound (Theorem 4), we applied the algorithm tofour datasets, two sourced from biology, one from socialnetworking, and a randomly generated control dataset.These include the Moving Pictures of the Human Micro-biome [8] (MPH), the largest human microbial time ser-ies to date, a microarray hybridization dataset identifyingcell cycle regulated genes in the yeast Saccharomyces cer-evisiae [9] (CDC), and an online social media dataset ofthe volumes of the top 1000 Memetracker phrases andtop 1000 twitter hash tags over an eight month periodfrom September 2008 to August 2009 [10]. Missing datavalues were interpolated by averaging the two nearesttemporal data points, and all analysis was performed on aMac Pro desktop computer running Mac OSX 10.6.8with a 2 × 2.4 Ghz Quad-Core Intel Xeon processors and16 GB of 1066 Mhz DDR3 RAM.Computational complexityThe algorithm calculates in O(m2n) time, where m is thenumber of time series and n is the length of each timeseries. To get an idea of how long calculations take, wefixed n = 50, d = 3 and plotted log-calculation timeagainst log-m (Figure 4). It can be seen that LSA testswith p-values calculated by the permutation test areabout 10,000 times slower than calculating p-values for-mulaically. Compared to direct formulaic calculation,random number generation is slow, making a repetitionof 10,000 permutations for each time series pair a com-putationally intense operation (Table 2). The permuta-tion test may be able to calculate statistically significant(a = 0.001) pairs confidently, but applying a multiple testcorrection (Bonferroni) will require exponentially morepermutations to reach the same level of confidence forthe entire dataset. Pairwise comparisons for big datasetsare computationally infeasible to sufficiently estimate p-values with enough accuracy to protect against false posi-tives. In contrast, FASTLSA directly calculates a conser-vative upper bound approximating the p-value, makingpermutation unnecessary and protecting against falsepositives.Moving pictures of the human microbiome (MPH)The MPH time series dataset [8] investigates temporalvariations in the microbial community structure of twohealthy human subjects, one male, one female. Sampleswere collected from three body parts, the gut (feces),mouth, and skin (left and right palms) daily for 15months (male) and six months (female) with taxonomybeing determined by the amplified V4 region of the smallsubunit ribosomal RNA (SSU or 16S rRNA) gene. Themale and female samples were concatenated togetherresulting in a profile of 14105 taxa for 390 time pointswith missing values being interpolated by the average ofthe two nearest time points.For a given time series, if more than 25% of time stepswere zero it was removed from the analysis. Analysis took58 minutes (7.5 minutes on 16 threads) without includingoutput writing time which is variable. Significant (a <0.001) LSA results revealed clusters of local similarity thatcorresponded well when nodes were colored by sampleDurno et al. BMC Genomics 2013, 14(Suppl 1):S3http://www.biomedcentral.com/1471-2164/14/S1/S3Page 5 of 14source (Figure 5). The low level of mixing between localclusters reflects the large differences in taxonomic profilesacross the different body environments [8].Microarray hybridization detection cell cycle-regulatedgenes in yeast Saccharomyces cerevisae (CDC)In the CDC data set [9], we focused on the cdc15 tem-perature sensitive mutant and the profile of 6178 genesover 24 time steps, representing gene expression forapproximately three cell cycles. Analysis took 3.25 min-utes (22 seconds on 16 threads) without including out-put writing time (Figure 6). Applying the asymptoticbound with the small number of time steps resulted insome rather large bounds (≥ 1).However, LSA was capable of detecting lead-lag corre-lation despite the periodicity of the data, demonstratingits capacity to find long correlate pairs with a largenumber of covariate time series. Only 800 of the 6178gene nodes could be classified from [9] to one of thefive defined cell cycle phases (G1, G2/M, S, S/G2,M/G1) so only two clusters could be inferred upon withany confidence (Figure 7).Social media: top 1000 Twitter and Memetracker phrases(Twitter)The data from [10] contains the volume of the top 1000Twitter and Memetracker phrase counts over 130 timesteps from September 2008 to August 2009, a spacing ofapproximately 2-3 days per observation. Analysis tookapproximately six seconds (one second on 16 threads)without including write out time. Two major clusters ofrelated times series nodes emerged. However, attempts tolabel the series using existing metadata of general contentor time granularity (day of the week, working hours, sea-sonality, etc.) did not elucidate its structure (Figure 8). Weconjecture that this difference is geographical (East-WestNorth America) or socially structured, however, additionalmetadata on geolocation or social connectivity associationsof the nodes would be needed to better elucidate networkstructure.Figure 3 Asymptotic p-value upper bounds converge on the LSA density. Notice that the p-value upper bound (red) converges in the tailto the approximate LSA density (blue), an attractive quality. As the number of time steps (n) increase, both the density and the p-value upperbound push up against zero. This is similar to the asymptotic behaviour of PCC.Durno et al. BMC Genomics 2013, 14(Suppl 1):S3http://www.biomedcentral.com/1471-2164/14/S1/S3Page 6 of 14Table 1 Empirical p-value (Emp) & the FASTLSA p-value bound (Fas) with n = 30, 50, & 100 time steps.x1 n30Emp n30Fas n50Emp n50Fas n100Emp n100Fas0.05 1 1.000 1 1.000 1 1.0000.07 1 1.000 1 1.000 0.997 1.0000.09 1 1.000 0.999 1.000 0.953 1.0000.11 0.999 1.000 0.984 1.000 0.819 1.0000.13 0.989 1.000 0.928 1.000 0.627 1.0000.15 0.958 1.000 0.823 1.000 0.441 1.0000.17 0.896 1.000 0.687 1.000 0.292 1.0000.19 0.803 1.000 0.545 1.000 0.184 1.0000.21 0.694 1.000 0.417 1.000 0.111 1.0000.23 0.58 1.000 0.309 1.000 0.064 1.0000.25 0.472 1.000 0.224 1.000 0.036 1.0000.27 0.376 1.000 0.158 1.000 0.019 0.6930.29 0.294 1.000 0.109 1.000 0.009 0.3730.31 0.227 1.000 0.073 1.000 0.005 0.1940.33 0.172 1.000 0.048 0.981 0.002 0.0970.35 0.128 1.000 0.031 0.666 0.001 0.0470.37 0.094 1.000 0.019 0.444 < 0.001 0.0220.39 0.067 0.98 0.012 0.291 < 0.001 0.010.41 0.048 0.742 0.007 0.187 < 0.001 0.0040.43 0.033 0.555 0.004 0.118 < 0.001 0.0020.45 0.023 0.411 0.002 0.073 < 0.001 0.0010.47 0.015 0.301 0.001 0.044 < 0.001 < 0.0010.49 0.01 0.218 0.001 0.027 < 0.001 < 0.0010.51 0.006 0.156 < 0.001 0.016 < 0.001 < 0.0010.53 0.004 0.111 < 0.001 0.009 < 0.001 < 0.0010.55 0.002 0.078 < 0.001 0.005 < 0.001 < 0.0010.57 0.001 0.054 < 0.001 0.003 < 0.001 < 0.0010.59 0.001 0.037 < 0.001 0.002 < 0.001 < 0.0010.61 < 0.001 0.025 < 0.001 0.001 < 0.001 < 0.0010.63 < 0.001 0.017 < 0.001 < 0.001 < 0.001 < 0.0010.65 < 0.001 0.011 < 0.001 < 0.001 < 0.001 < 0.0010.67 < 0.001 0.007 < 0.001 < 0.001 < 0.001 < 0.0010.69 < 0.001 0.005 < 0.001 < 0.001 < 0.001 < 0.0010.71 < 0.001 0.003 < 0.001 < 0.001 < 0.001 < 0.0010.73 < 0.001 0.002 < 0.001 < 0.001 < 0.001 < 0.0010.75 < 0.001 0.001 < 0.001 < 0.001 < 0.001 < 0.0010.77 < 0.001 0.001 < 0.001 < 0.001 < 0.001 < 0.0010.79 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.0010.81 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.0010.83 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.0010.85 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.0010.87 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.0010.89 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.0010.91 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.0010.93 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.0010.95 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.0010.97 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.0010.99 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 0.001Durno et al. BMC Genomics 2013, 14(Suppl 1):S3http://www.biomedcentral.com/1471-2164/14/S1/S3Page 7 of 14Null hypothesis simulated dataFinally, to identify throughput limits of FASTLSA and tosimulate a large iid dataset without time dependence,three data matrices were randomly generated: (1) one hun-dred thousand measurements across 100 time steps, (2)one million measurements across 30 time steps, and (3)one million measurements across 100 time steps. Datawere generated by random sampling from a uniform dis-tribution. Running FASTLSA on 16 threads, the first data-set (100, 000 × 100) took one hour 54 minutes, the second(1, 000, 000 × 30) 2 days and 3 hours, and the third (1,000, 000 × 100) had an ETA of 7 days and 23 hours with-out including writeout time. The asymptotic bound isconservative for shorter datasets (n ≤ 30) (Figure 3, Table 1),the second data having 30 time points found zero falsepositives, despite having a Bonferroni correction ofα/(n2)= 10−13 . This is likely because the software’sp-value is an upper bound to the real p-value, and so is theBonferroni correction. An inspection of a uniform randomgraph (a = 0.05, |LSA| < 0.4) of 1,000 random time serieswith 100 time steps did not generate any cliques, but onlyshort (4-8) length chains of nodes, serving as a warning tothose wanting to interpret relevant structure (Figure 9).Given appropriate thresholds on LSA values, cliques donot seem to occur randomly.Figure 4 LSA calculation time as a function of the number of time series. On this log-log plot notice that because of its lack of apermutation test FASTLSA (red) is consistently faster than the older implementation of LSA (black) [6].Table 2 Empirical running time for LSA calculation for data sets of different sizeTime series Time points fastLSA (single thread) fastLSA (16 threads)Twitter 1,000 130 6 sec 1 secCDC 6,178 24 3.24 min 2.2 secMPH 14,105 390 58 min 7.5 minFirst Null 100,000 100 - 54 minSecond Null 1,000,000 30 - 2 days 3 hrsThird Null 1,000,000 100 - 7 days 23 hrsDurno et al. BMC Genomics 2013, 14(Suppl 1):S3http://www.biomedcentral.com/1471-2164/14/S1/S3Page 8 of 14Figure 5 MPH local similarity graph. A local similarity graph of the MPH dataset showing significant LSA values as defined by the asymptoticupper bound (a = 0.001). Local clusters defined by LSA were revealed and the mapping of samples sources (feces, mouth, and skin) to thenodes revealed underlying network structure.Durno et al. BMC Genomics 2013, 14(Suppl 1):S3http://www.biomedcentral.com/1471-2164/14/S1/S3Page 9 of 14DiscussionFASTLSA uses a novel asymptotic upper bound algorithmfor calculating the LSA p-value. This is done without anynormality assumption, extending implementation tountransformed data and data in violation of normalityassumptions such as time series containing many zeroentries. Moreover, FASTLSA replaces a computationallyintensive permutation test that was previously required tocalculate significance of LSA statistics with a dramaticincrease on the size of datasets that can be analyzed on asingle desktop machine. However, like all asymptoticbounds, a significant number of observations need to beobtained for their application. From theoretical simulation,we estimate this to be greater than 30 time points for mostdatasets. This is supported by our experience on the CDCand MPH datasets having 24 and 390 time series, respec-tively. Despite this potential operating constraint, FAS-TLSA expands the boundaries of LSA allowing time seriesanalysis on datasets with millions of co-variate time series.The algorithm is implemented as a computationally effi-cient software package, FASTLSA, written in C andoptimized for threading on multi-core computers usingPOSIX threads. Finally, we demonstrated the utility andversatility of FASTLSA using real-world and simulatedtime series datasets from different fields of inquiry, visua-lizing the resulting clusters of local similarity using theCytoscape software.LSA statistics have been demonstrated to capture rele-vant local similarity structure for a number of biologicaldatasets [16,17]. However, previous implementations werelimited to relatively small datasets. FASTLSA improvesthe computational efficiency and statistical robustness ofLSA, a necessary step in using the statistic to explore nextgeneration time series datasets. Despite the currentimprovements, the structure captured by LSA is less thanideal and could be further improved. Given two vectors oftime series, LSA reports the strongest statistic. However, itis unclear where this significant time window occurs, or ifthere are multiple small windows with large LSA valuesthat are not reported. An inspection of time series tracesin question is often required to visually check exactly howthe two are similar. Another hazard is that LSA does notFigure 6 Comparison of LSA values: fastLSA and Original LSA. A comparison of calculated LSA statistics between FASTLSA and Original LSAimplemented by [6] and calculated on the CDC dataset. There is an almost one-to-one correspondence between calculated values. The one outlyingvalue was likely due to the transform that the original LSA applies, causing a disagreement between positive and negative values. For a single valuefastLSA picked the negative value (-0.4) and original LSA picked positive (0.4).Durno et al. BMC Genomics 2013, 14(Suppl 1):S3http://www.biomedcentral.com/1471-2164/14/S1/S3Page 10 of 14handle missing data effectively, and so a continuous ver-sion of the statistic would be desirable for exploratoryexperiments where sampling conditions could change tosmall degrees and analysis could be performed withoutimputation. Furthermore, LSA is asymmetric in nature,meaning that time reversal has the potential to producediffering LSA values. We anticipate even better perfor-mance from the statistic if these issues were addressed,perhaps through a modified version of PCC that isolatesoptimal windows of similarity.Figure 7 CDC local similarity graph. A local similarity graph of the CDC dataset showing significant LSA values as defined by the upper boundcutoff and the additional constraint of absolute LSA values greater than 0.85 (a = 0.001, |LSA| ≥ 0.85). Clusters of local structure are observedwith some example correlates of expression shown in graphs a, b, and c, indicated by solid ellipses. However, because only 800 of theapproximately 6,000 genes could be classified to a cell cycle position (G1, G2/M, S, S/G2, M/G1) we could only guess at two clusters’ functionalcharacteristics. Cluster x has many G1 genes that, according to the Saccharomyces Genome Database http://www.yeastgenome.org [18], havesome functionality relative to helicases and telomeres. The gene, YKR077W (*), accelerates the cell cycle initiation stage (G1) when abundant [19].Cluster y is a set of genes encoding histone proteins [19]. Histone development is an essential part of genome replication [20] adequatelydescribing all of cluster y as S phase genes.Durno et al. BMC Genomics 2013, 14(Suppl 1):S3http://www.biomedcentral.com/1471-2164/14/S1/S3Page 11 of 14ConclusionsLSA is a local similarity statistic that has recently beenused to capture relevant local structure in time series data-sets, particularly within the biological community. How-ever, its use has been limited to smaller datasets due to anintensive permutation test used to calculate significance.Our derivation and direct calculation of an asymptoticupper bound using FASTLSA replaces this onerous calcu-lation without a normality assumption, enabling LSA ontime series datasets containing millions of co-variate timeFigure 8 Twitter local similarity graph. A local similarity graph of the Twitter dataset showing significant LSA values with an additionalthreshold absolute LSA values greater than 0.98 (a = 0.001,|LSA ≥ 0.98). Two primary clusters of local similarity were found, however, none ofthe attempted metadata mappings could classify the clusters by time (hour of day, day of week, season, etc.) or general message content(political, personal, media, etc.).Durno et al. BMC Genomics 2013, 14(Suppl 1):S3http://www.biomedcentral.com/1471-2164/14/S1/S3Page 12 of 14series. We demonstrate the utility and versatility of FAS-TLSA by analyzing time series data from public health,microbial ecology, and social media and compare theseresults to the previous implementation of LSA, obtainingsimilar results with orders of magnitude increase inthroughput.Project name: fastLSAProject home page: http://www.cmde.science.ubc.ca/hallam/fastLSA/Operating system(s): OS X, Linux, or WindowsProgramming Languages: C /C++Other requirements: 1 GB RAMLicense: GPLv3Non-academic restrictions: NoneFigure 9 Uniform random local similarity graph. A local similarity graph representing purposeful false positives, 1000 time series with 100time steps randomly generated from a uniform distribution. Notice how no cliques form in the random data generated from a uniformdistribution.Durno et al. BMC Genomics 2013, 14(Suppl 1):S3http://www.biomedcentral.com/1471-2164/14/S1/S3Page 13 of 14List of abbreviationsLSA: Local Similarity Analysis; PCC: Pearson’s Correlation Coefficient; PCA:Principal Component Analysis; MDS: Multidimensional Scaling; DFA:Discriminant Fraction Analysis; MPH: Moving Pictures of the HumanMicrobiome; CDC: Centre of Disease Control.AcknowledgementsWe would like to acknowledge Dr.Fengzhu Sun and Dr.Jed Fuhrman at theUniversity of Southern California for their support.Author details1Department of Microbiology & Immunology, University of British Columbia,Vancouver, BC, Canada. 2Graduate Program in Bioinformatics, University ofBritish Columbia, Vancouver, BC, Canada.Authors’ contributionsWED derived the p-value upper bound and was the primary programmer ofthe LSA statistics. NH assisted in the derivation of the upper bound, theanalysis of the MPH, CDC, and Twitter datasets, and was the primarymanuscript writer and editor. KK validated upper bound result and assistedin the multi-threaded implementation of the software. SH oversaw theresearch and managed the group.DeclarationsThe publication costs for this article were funded by Genome BritishColumbia and Genome Canada.This article has been published as part of BMC Genomics Volume 14Supplement 1, 2013: Selected articles from the Eleventh Asia PacificBioinformatics Conference (APBC 2013): Genomics. The full contents of thesupplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S1.Competing interestsThe authors declare that they have no competing interests.Published: 21 January 2013References1. Lynch C: Big data: How do your data grow? Nature 2008, 455(7209):28-29.2. Bell G, Hey T, Szalay A: Computer science. Beyond the data deluge.Science 2009, 323(5919):1297-1298.3. Schadt EE, Linderman MD, Sorenson J, Lee L, Nolan GP: Computationalsolutions to large-scale data management and analysis. Nature ReviewsGenetics 2010, 11(9):647-657.4. Ranjard L, Poly F, Lata JC, Mougel C, Thioulouse J, Nazaret S:Characterization of bacterial and fungal soil communities by automatedribosomal intergenic spacer analysis fingerprints: biological andmethodological variability. Applied and Environmental Microbiology 2001,67(10):4479-4487.5. Mooy BASV, Devol AH, Keil RG: Relationship between bacterialcommunity structure, light, and carbon cycling in the eastern subarcticNorth Pacific. Limnology and Oceanography 2004, 1056-1062.6. Ruan Q, Dutta D, Schwalbach MS, Steele JA, Fuhrman JA, Sun F: Localsimilarity analysis reveals unique associations among marinebacterioplankton species and environmental factors. Bioinformatics 2006,22(20):2532-2538.7. Xia LC, Steele JA, Cram JA, Cardon ZG, Simmons SL, Vallino JJ, Fuhrman JA,Sun F: Extended local similarity analysis (eLSA) of microbial communityand other time series data with replicates. BMC Syst Biol 2011, 5(Suppl 2):S15.8. Caporaso JG, Lauber CL, Costello EK, Berg-Lyons D, Gonzalez A,Stombaugh J, Knights D, Gajer P, Ravel J, Fierer N, Gordon JI, Knight R:Moving pictures of the human microbiome. Genome Biol 2011, 12:R50.9. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO,Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarrayhybridization. Molecular Biology of the Cell 1998, 9(12):3273-3297.10. Yang J, Leskovec J: Patterns of temporal variation in online media.Proceedings of the Fourth ACM International Conference on Web Search andData Mining 2011, 177-186.11. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N,Schwikowski B, Ideker T: Cytoscape: a software environment forintegrated models of biomolecular interaction networks. GenomeResearch 2003, 13(11):2498-2504.12. Takacs L: On the distribution of the maximum of sums of mutuallyindependent and identically distributed random variables. Advances inApplied Probability 1970, 2:344-354.13. Wald A: On the distribution of the maximum of successive cumulativesum of independent but not identically distributed chance variables.Bulletin of the American Mathematical Society 1948, 54:422-430.14. Nevzorov VB, Petrov VV: On the distribution of the maximum cumulativesum of independent random variables. Theory of Probability and itsApplications 1969, 14(4):682-687.15. Lindeberg J: Eine neue Herleitung des Exponentialgesetzes in derWahrscheinlichkeitsrechnung. Mathematische Zeitschrift 1922, 15:211-225.16. Fuhrman JA, Steele JA: Community structure of marine bacterioplankton:patterns, networks, and relationships to function. Aquatic MicrobialEcology 2008, 53:69-81.17. Steele JA, Countway PD, Xia L, Vigil PD, Beman JM, Kim DY, Chow CET,Sachdeva R, Jones AC, Schwalbach MS, Rose JM, Hewson I, Patel A, Sun F,Caron DA, Fuhrman JA: Marine bacterial, archaeal and protistanassociation networks reveal ecological linkages. The ISME Journal 2011,5(9):1414-1425.18. Cherry JM, Hong EL, Amundsen C, Balakrishnan R, Binkley G, Chan ET,Christie KR, Costanzo MC, Dwight SS, Engel SR, Fisk DG, Hirschman JE,Hitz BC, Karra K, Krieger CJ, Miyasato SR, Nash RS, Park J, Skrzypek MS,Simison M, Weng S, Wong ED: Saccharomyces Genome Database: thegenomics resource of budding yeast. Nucleic Acids Res 2012, 40:D700-D705.19. Ashe M, deBruin RA, Kalashnikova T, McDonald WJ, Yates JR III,Wittenberg C: The SBF- and MBF-associated protein Msa1 is required forproper timing of G1-specific transcription in Saccharomyces cerevisae.Journal of Biological Chemistry 2007, 283:6040-6049.20. Ewen ME: Where the cell cycle and histones meet. Genes Dev 2000,14:2265-2270.doi:10.1186/1471-2164-14-S1-S3Cite this article as: Durno et al.: Expanding the boundaries of localsimilarity analysis. BMC Genomics 2013 14(Suppl 1):S3.Submit your next manuscript to BioMed Centraland take full advantage of: • Convenient online submission• Thorough peer review• No space constraints or color figure charges• Immediate publication on acceptance• Inclusion in PubMed, CAS, Scopus and Google Scholar• Research which is freely available for redistributionSubmit your manuscript at www.biomedcentral.com/submitDurno et al. BMC Genomics 2013, 14(Suppl 1):S3http://www.biomedcentral.com/1471-2164/14/S1/S3Page 14 of 14

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.52383.1-0167811/manifest

Comment

Related Items