UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Statistical solutions for and from signal processing Bornn, Luke 2008-03-02

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata

Download

Media
24-ubc_2008_fall_bornn_luke.pdf [ 1.27MB ]
Metadata
JSON: 24-1.0066974.json
JSON-LD: 24-1.0066974-ld.json
RDF/XML (Pretty): 24-1.0066974-rdf.xml
RDF/JSON: 24-1.0066974-rdf.json
Turtle: 24-1.0066974-turtle.txt
N-Triples: 24-1.0066974-rdf-ntriples.txt
Original Record: 24-1.0066974-source.json
Full Text
24-1.0066974-fulltext.txt
Citation
24-1.0066974.ris

Full Text

Statistical Solutions For and FromSignal ProcessingbyLuke BornnB.Sc., University of the Fraser Valley, 2003A THESIS SUBMITTED IN PARTIAL FULFILMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinThe Faculty of Graduate Studies(Statistics)The University Of British Columbia(Vancouver)August, 2008©Luke Bornn 2008AbstractWith the wide range of fields engaging in signal processing research, manymethods do not receive adequate dissemination across disciplines due todifferences in jargon, notation, and level of rigor. In this thesis, I attempt tobridge this gap by applying two statistical techniques originating in signalprocessing to fields for which they were not originally intended. Firstly,I employ particle filters, a tool used for state estimation in the physicssignal processing world, for the task of prior sensitivity analysis and crossvalidation in Bayesian statistics. Secondly, I demonstrate the application ofsupport vector forecasters, a tool used for forecasting in the machine learningsignal processing world, to the field of structural health monitoring.11Table of ContentsAbstract.Table of Contents iiiList of Figures vAcknowledgements viStatement of Co-Authorship vii1 Introduction1.1 Particle Filtering1.2 Support Vector Forecasters1.3 ReferencesAn Efficient Computational Approach for Prior SensitivityAnalysis and Cross-Validation2.1 Introduction and Motivation2.2 Sequential Monte Carlo Algorithms2.2.1 An Efficient SMC Algorithm2.3 Prior Sensitivity2.3.1 Regularization Path Plots2.3.2 Variable Selection Using g-Priors . . . 152.4 Cross-Validation2.4.1 Application to Bayesian Penalized Regression2.5 Extensions and Conclusions2.6 References3 Structural Health Monitoring with Autoregressive SupportVector Machines3.1 Introduction3.2 SVM-based SHM21145668101213• 1823• 25• . 26292931111Table of Contents3.2.1 Example: Simulated Damage3.3 Joint Online SHM3.3.1 Example: Experimental Data3.4 Conclusion3.5 References37414246484 Conclusion 505 Technical Appendix 52ivList of Figures2.1 Regularization Path Plots: The gold standard 142.2 Regularization Path Plots: Plots using MCMC and SMC forfixed computational time of 5 minutes 162.3 Exact Marginal and Model probabilities for variable selectionusing g-Priors as a function of log(g) 192.4 Approximate Marginal and Model probabilities for variableselection using g-Priors as a function of log(g) 202.5 Diagram of cross-validation process 212.6 Plots of cross-validation error as a function of log) 243.1 Illustration of linear support vector regression fit 343.2 Illustration of mapping to an alternate space to induce linearity 353.3 Raw simulated data with highlighted artificial damage . . . 383.4 Autocorrelation plot of simulated data 383.5 SVM (top) and linear AR models (bottom) fit to subset of data 393.6 Residuals from SVM (top) and linear AR models (bottom)applied to simulated data 403.7 Diagram of experimental structure 423.8Q-QPlot of residuals from SVM model 443.9 Residuals from 4 sensors for t = 7000, ... , 9384 453.10 Density estimate of combined residual (black) vs. chi-squareddistribution(red) 453.11 Combined residuals from all 4 sensors 46vAcknowledgementsI am indebted to my mentors, Dr. Arnaud Doucet and Dr. Raphael Gottardo, for the wealth of opportunities and immense support they have provided me. I am also grateful to the many people I have had a chance tointeract with at both the University of British Columbia and Los AlamosNational Labs. In particular I thank Dr. Dave Higdon and Dr. Todd Gravesfor guiding me down a fruitful path.viStatement of Co-AuthorshipExcluding chapters 2 and 3, all material in this thesis was solely authored.While I identified and carried out the research in chapter 2, ArnaudDoucet and Raphael Gottardo provided much guidance, criticism, and feedback.The data in chapter 3 was produced by Gyuhae Park and Kevin Farmholt. Chuck Farrar assisted in writing some of the details pertaining to thestructural health monitoring field. In addition, his guidance and feedbackcontributed immensely to the development of the work.viiChapter 1IntroductionThe research area of signal processing is concerned with analyzing signalsincluding sound, video, and radar. There are many components to this task,including storage and compression, removing noise, and extracting featuresof interest. As an example, we might have a noisy recording of a telephoneconversation for which we want to store the signal, remove the noise, andidentify the speakers. These signals can take many forms, either digital oranalog. We focus on statistical signal processing, which is concerned withstudying signals based on their statistical properties. We begin by describing two statistical methods employed for signal processing, the first beingparticle filtering, and the latter being support vector forecasters. Becausechapter 3 contains a detailed development of support vector forecasters, weforego these details here. Later chapters then extend these methodologies tofields for which they were not intended, specifically prior sensitivity analysisand cross validation as well as structural health monitoring1.1 Particle FilteringOne of the crucial areas of study in signal processing is filtering, which isconcerned with estimating a dynamic system’s true state from a series ofnoisy measurements. Specifically, we assume that the system dynamics areknown up to some parameter(s). The underlying state-space model may bewritten asxtIxt_1 ‘-‘-ytlxt py,t(yIxt)where Xt andYtdenote the unobserved state and observation at time t,respectively;Px,tandpy,tare the state transition and measurement models,respectively. Also, we assume a prior distribution p(xo) on xo. In the caseof linearly additive noise, we may write this state-space model as=f(xt_116) +rYt= h(xt) + t.(1.1)1Chapter 1. IntroductionHere both the stochastic noise r and the measurement noise c are mutually independent and identically distributed sequences with known densityfunctions. In addition,f(xt_iI8)and h(xt) are known functions up to someparameters 8.In order to build the framework on which to describe the filtering methodologies, we first frame the above state-space model as a recursive Bayesianestimation problem. Specifically, we are interested in obtaining the posteriordistributionP(xoty1t) (1.2)where XO:t = {xo,x1, .. . ,Xt} andY1:t = {yl,y2, . . . ,yt}.Often we don’t require the entire posterior distribution, but merely one of its marginals. Forinstance, we are often interested in the estimate of state given all observations up to that point; we call this distribution the filtering density anddenote it asp(xtlyit). (1.3)By knowing this density, we are able to make estimates about the system’sstate, including measures of uncertainty such as confidence intervals.If the functionsfand h are linear and both rj and Ct are Gaussian,Kalman filtering is able to obtain the filtering distribution in analytic form.In fact it can be seen that all of the distributions of interest are Gaussianwith means and covariances that can be simply calculated. However, whenthe dynamics are non-linear or the noise non-Gaussian, alternative methodsmust be used.In the case of non-linear dynamics with Gaussian noise, the standardmethodology is the extended Kalman filter, which may be considered as anonlinear Kalman filter which linearizes around the current mean and covariance. However, as a result of this linearization, the filter may diverge ifthe initial state estimate is wrong or the process is incorrectly modeled. Inaddition, the calculation of the Jacobian in the extended Kalman filter canbecome tedious in high-dimension problems. One attempted solution to thisproblem has been the unscented Kalman filter (Wan and van der Merwe,2001), which approximates the nonlinearity by transforming a random variable instead of through a Taylor expansion, as the extended Kalman filterdoes. By employing a deterministic sampling technique known as the Unscented transform (Julier and Uhlmann, 1997), UKF selects a minimal setof sample points around the mean which are then propagated through thenon-linear functions while recovering the covariance matrix.When either the stochastic or measurement noise is non-Gaussian, MonteCarlo methods must be employed, in particular particle filters. This Monte2Chapter 1. IntroductionCarlo based filtering method relies on a large set of samples, called particles, which are evolved through the system dynamics with potentially non-Gaussian noise using importance sampling and bootstrap techniques. Ateach time step the empirical distribution of these particles is used to approximate the distribution of interest and its associated features. By samplingfrom some proposal distribution q(xo:tyi:t) in order to approximate (1.2),we may use importance sampling with corresponding unnormalized weights— P(yi:tlxo.t)P(xot)Wt —q(xotIYi:t)However, we typically wish to perform this estimation sequentially, andhence we can take advantage of the Markov nature of the state and measurement process along with proposal distributions of the form q(xotIyi:t) =q(xot_iIyit_i)q(xtxot_i,yit). From this we obtain the recursive weightformulaP(ytlxt)P(xtlxt_i)Wt=Wt_1q(xtxo:t_i,Yi:t)This equation allows for the sequential updating of importance weightsgiven an appropriate choice of proposal distribution q(xtxot1,yi:t),as wellas simple calculation of the filtering density (1.3). Since we can samplefrom this proposal distribution and evaluate the likelihood and transitionprobabilities, the particle filter simply involves generating a prior set ofsamples, evolving these samples forward with the proposal distribution, andsubsequently calculating the importance weights. In addition, to preventparticle degeneracy, we employ a resampling step to remove particles withlow weight and multiply those with high weight (Douc et al., 2005).The choice of proposal distribution has a significant effect on the rateof degeneracy. The standard (and simplest) choice is the prior distributionq(xtxot_i,yi:t) = P(xjlxt_i)since the weights simplify to a calculation ofthe likelihood. However, if the likelihood is not near the prior, this choicewill lead to large variance in the importance weights, and hence we wouldlike to employ a proposal distribution which uses the data to provide abetter estimate of the posterior distribution. One such possibility is to usea Gaussian approximation of the posterior as the proposal distribution (vander Merwe et al., 2001).Often the problem of filtering isn’t restricted to the estimation of state,but is also concerned with estimating some parameters 0 of the dynamicmodelf(xIO).Further complicating matters, the only information we haveabout the state and the model parameters is the noisy measurements{yt}ti.While there are several approaches for solving this problem, we focus on dual3Chapter 1. Introductionestimation, namely the use of parallel filters to estimate state and model parameters (Wan et al., 2000). Specifically, a state-space representation is usedfor both the state and parameter estimate problems. While the state-spacerepresentation for the state is given in equation (1.1), the representation ofthe model parameters is given by= Ot—1 +VYt= f(xt_iI6t)+ r +Herentandtare as in (1.1), whileiJis an additional iid noise term. Thuswe can run two parallel filters for both state and parameters. At each timestep the current state estimate is used in the parameter filter and the currentparameter estimate is used in the state filter. The situation is complicatedin the particle filter situation, due to the well-known problem of degenerateweights (Casarin and Mann, 2008).Through this filtering methodology we are able to estimate the stateof a dynamic system from noisy measurements, as well as the associateduncertainty of these estimates. In addition, the dual framework providesa mechanism for estimating model parameters along with the state. Thesefiltering tools approximate a sequence of distributions of increasing dimension. In later chapters, we show how the particle filtering methodology maybe adapted for situations involving distributions of equal dimension, andsubsequently build an algorithm for efficiently performing prior sensitivityanalysis and cross-validation.1.2 Support Vector ForecastersWhile particle filtering and other filtering methods rely on knowledge of theunderlying process to de-noise the signal, support vector regression forecasters have a slightly different purpose. Specifically, they use a training dataset to build a model of the signal, which is then used to predict subsequentpieces of the signal. The previous section contains a thorough descriptionof filtering since the associated manuscript of chapter 2 bypasses this development. However, chapter 3 provides a thorough development of supportvector forecasters, and hence we forego this development here, instead simply providing some useful references. The recent work of Steinwart andChristmann (2008) provides thorough details on both theoretical and applied aspects of support vector machines, while Schlkopf and Smola (2001)contains details, on kernel-based learning.4Chapter 1. Introduction1.3 ReferencesCasarin, R., Mann, J-M. (2008). “Online data processing: comparisonof Bayesian regularized particle filters.” arXiv:0806.4242v1.Douc, R., Cappe, 0., Moulines, E. (2005). “Comparison of resamplingschemes for particle filtering.” Proceedings of the th International Symposium on Image and Signal Processing and Analysis.pp6469.Julier, S. and Uhlmann, J. (2007). “A new extension of the kalman iterto nonlinear systems.” Proceedings of AeroSense: The 11th Symposium onAerospace/Defence Sensing, Simulation and Controls.Scholkopf, B. and Smola, A.J. (2001). Learning with Kernels: SupportVector Machines, Regularization, Optimization, and Beyond. MIT Press,Cambridge.Steinwart, I. and Christmann, A. (2008). Support Vector Machines.Springer, New York.van der Merwe, R., Doucet, A., de Freitas, N., Wan, E. (2001). “The unscented particle iter.” Advances in Neural Information Processing Systems.13:584590.Wan, E. and van der Merwe, R. (2001). “The unscented kalman iter.”Kalman Filtering and Neural Networks. Ch. 7. Ed. S. Haykin.Wan, E., van der Merwe, R., Nelson, A. (2000). “Dual estimation andthe unscented transformation.” Advances in Neural Information ProcessingSystems. 12:666672.5SChapter 2An Efficient ComputationalApproach for PriorSensitivity Analysis andCross-Validation2.1 Introduction and Motivation‘An important step in any Bayesian analysis is to assess the prior distribution’s influence on the final inference. In order to check prior sensitivity,the posterior distribution must be studied using a variety of prior distributions. If these posteriors are not available analytically, they are usuallyapproximated using Markov chain Monte Carlo (MCMC). Since obtainingthe posterior distribution for one given prior can be very expensive cornputationally, repeating the. process for a large range of prior distributionsis often prohibitive. Importance sampling has been implemented as an attempted solution (Besag et al., 1995), but the potential of infinite varianceimportance weights makes this technique useless if the posterior distributionchanges more than a trivial amount as the prior is altered. Additionally, thisimportance weight degeneracy increases with the dimension of the parameterspace.One such prior sensitivity problem is the creation of regularization pathplots — a commonly used tool when performing penalized regression. In thesesituations there is typically a tuning parameter which controls the amount ofshrinkage on the regression coefficients; regularization path plots graphicallydisplay this shrinkage as a function of the tuning parameter. For instance, inthe LASSO shrinkage and variable selection method of Tibshirani (1996), theLARS algorithm (Efron et al., 2004) may be employed to quickly produceversion of this chapter has been submitted for publication. Bornn, L., Doucet, A.,Gottardo, R. “An Efficient Computational Approach for Prior Sensitivity Analysis andCross-Validation.”6Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validationthese plots. In the Bayesian version (Vidakovic, 1998; Park and Casella,2008), however, we may want to plot the posterior means of the regressioncoefficients /3 E lilY for a range of the tuning (or penalty) parameter A. Thecorresponding posterior distributions are proportional toexp(_[y- X/3)T(y- X/3) - AiH)(2.1)where the response y is assumed to come from a normal distribution withmean X/3 and variance a2 for a model matrix X. Since approximating (2.1)using MCMC at one level of A can take upwards of an hour depending onthe precision required, producing this plot by repeating MCMC hundreds oftimes for different A is impractical.Another tool requiring repeated posterior approximations is cross-validation,which has two primary statistical purposes. The first is finding the value ofa given parameter (for instance, the penalty parameter in penalized regression) which minimizes prediction error. The second is comparing differentmodels’ or methodologies’ prediction performance. In both situations thedata is split into a training set, which is used to fit the model, and a testingset, which is used to gauge the prediction performance of the trained model.A typical example would involve fitting a model on the training set for arange of values of some model parameter, then setting this parameter to thevalue that results in the lowest prediction error rate on the testing set. Forexample, we might wish to select the value of A in (2.1) to minimize prediction error. From a computational standpoint, cross-validation is similarto prior sensitivity in both structure and complexity. Further, the entireprocess is usually repeated for a variety of different training and testing setsand the results are then combined. Although importance sampling has beenapplied to cross-validation (for example, Alqallaf and Gustafson, 2001), theproblem of infinite variance importance weights remains (Peruggia, 1997).In this paper, we begin by motivating and developing sequential MonteCarlo (SMC) methods, then subsequently apply them to prior sensitivityanalysis and cross-validation. In Section 2 we develop an efficient algorithmfor sampling from a sequence of potentially quite similar probability distributions defined on a common space. Section 3 demonstrates the algorithm ina prior sensitivity setting and applies it to the creation of regularization pathplots and the sensitivity of the tuning parameters when performing variableselection using g-Priors. Cross-validation with application to Bayesian penalized regression is developed in Section 4. We close with extensions andconcluding remarks in Section 5.7Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation2.2 Sequential Monte Carlo AlgorithmsSMC methods are often used in the analysis of dynamic systems wherewe are interested in approximating a sequence of probability distributionsir(O)where t = 1, 2, 3, .., T. The variableOtcan be of evolving or staticdimension as t changes; note that t is simply an index variable and need notbe real time. Most work in the SMC literature is interested in the evolvingdimension case, with applications to state-space models (Doucet et al., 2000)and target tracking (Liu and Chen, 1998) among others. The static case,where each lrt lies in a common space, has received less attention (Chopin,2002; Del Moral et al., 2006). The goal of SMC methods is to sample fromthe distributions{7tt} sequentially, i.e. first from in, then 7n2, up to ir.In some situations we are concerned with each intermediate distribution,whereas in others only the final distribution ‘ItT is of interest (for example,Neal, 2001). For further reading, the edited volume of Doucet et al. (2001)covers a range of developments in SMC theory and application.The situation where the sequence of distributions lie in a common spacearises in several applications. For instance, the number of observations insome experiments can make MCMC prohibitive. In this casemt might bethe posterior distribution of a parameter given the observations 1 throught. Moving through the data with a sequential strategy in this way maydecrease computational complexity. Another application is transitioningfrom a simple distribution in1 to a more complex distribution of interest ‘in.Alternatively we could consider situations analogous to simulated annealing(Kirkpatrick et al., 1983), where in(O) cx[mr(O)]tfor an increasing sequence{t}t = 1, 2, 3, ..., T. In all of these examples the bridging distributionsmn’j4 are only used to reach the final distribution of interestinT.When we are interested in a certain feature of eachitt, SMC will typicallybe computationally cheaper than MCMC even if we can successfully samplefrom each ‘Itt using MCMC. This is because SMC borrows information fromadjacent distributions, using the samples from earlier distributions to helpin approximating later distributions. Often the difficulty in using SMC isconstructing this sequence of distributions; both prior sensitivity and cross-validation are situations where there exists a natural sequence upon whichSMC may be applied. From here forward we assume the distributions tohave a common support.For all times t, we seek to obtain a collection of N weighted samples (called particles) i = 1, ..., N approximating itt where theweights are positive and normalized to sum to 1. We may estimate expectedvalues with these particles using E,.(g(O)) = . g(O). One tech-8Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validationnique used in SMC is importance sampling, where particles {W?1,o?}distributed as rt may be reused, reweighting them (before normalization)according toW(i)n(O)(2.2)in order to obtain an approximation of lrt. Thus we obtain the currentweights by multiplying the previous weights by an incremental weight.In an attempt to prevent these weights from becoming overly non-uniform,we may move each particle o?1 (currently distributed according to ntj)with a Markov kernel K(O, 0’) to a new position0(i)’,then subsequentlyreweight the moved particles to be distributed according to lrt. Although thekernelK(0, 0’) = lrt(O’) minimizes the variance of the importance weights,it is typically impossible to sample from; thus it has been proposed to useMarkov kernels with invariant distributionlrt (Gilks andBerzuini, 2001). Adirect application of this strategy suffers from a major flaw, however, as theimportance distribution given byJni (0i)flKt(0t_i,8t)d01:T_iis usually impossible to compute and therefore we can not calculate thenecessary importance weights. Additionally, this assumes we are able tosample from ir1(Oi)which is not always the case. Alternatives attemptto approximate Tit pointwise when possible, but the computation of thesealgorithms is in 0(N2)(Del Moral et al., 2006).The central idea of SMC Samplers (Del Moral et al., 2006) is to employ an auxiliary backward kernel with density L_1(Ot, Ot—i)to get aroundthis intractible integral. This backward kernel relates to a time-reversedSMC sampler giving the same marginal distribution as the forward SMCsampler induced byK.The backward kernel is essentially arbitrary, butshould be optimized to minimize the variance of the importance weights.Del Moral et al. (2006) prove that the sequence of backward kernels minimizing the variance of the importance weights is, for any t,L?(0,Ot—1) =iit_i(Ot_i)Kt(Ot_i,Ot)/’it(Ot).However, it is typically impossible to use thisoptimal kernel since it relies on intractable marginals. Thus, we should select a backward kernel that approximates this optimal kernel. Del Moraloptet al. (2006) give two suboptimal backward kernels to approximateL19Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validationwhich result in incremental weightsalrt(9t)= ft_i(Ot_i)Kt(Ot_i,Ot)d6t_i(2.3a)w(9t_j,Ot)=lrt(Ot_1)(2.3b)lrt_1(8t—1)These incremental weights are then multiplied by the weights at the previous time and normalized to sum to 1. We note that the suboptimal kernelresulting in (2.3b) is actually an approximation of that resulting in (2.3a),and coincidentally has the same form as (2.2), the reweighting mechanismfor importance sampling. In this manner the first kernel should performbetter, particularly when successive distributions are considerably different(Del Moral et al., 2006). Although the weights (2.3a) are a better approximation of the optimal backward kernel weights, the second kernel isconvenient since the resulting incremental weights (2.3b) do not depend onthe position of the moved particles& and hence we areable to reweight theparticles prior to moving them. We include the incremental weight (2.3a)because, whenKis a Gibbs kernel moving one component at a time, it simplifies torrt(Ot1,k)/rt1(Ot1,k)where k is the index of the componentbeing moved by the Gibbs sampler and0t—1,—kis the particle excluding thekthcomponent. By a simple Rao-Blackwell argument it can be seen thatthis choice, by conditioning on the variable being moved, results in reducedvariance of the importance weights compared to (2.3b).2.2.1 An Efficient SMC AlgorithmNow that we have described some components of SMC methodology, weproceed to develop an efficient algorithm for performing prior sensitivityand cross-validation. The basic idea of our algorithm is to first reweight(i) (i)the particles{W_1,i = 1, ..., N such that they are approximatelydistributed as If the variance of the weights is large, we then resamplethe particles with probabilities proportional to their weights, giving us a setof N equally weighted particles (including some duplicates). After resampling we move the particles with a kernel of invariant distribution lrt, whichcreates diversity in the particles. Our algorithm relates closely to resamplemove algorithms (Gilks and Berzuini, 2001; Chopin, 2002), although ourformulation is more general and allows for the use of a variety of suboptimalbackward kernels and corresponding weights.Moving the particles at each time step is not particularly efficient. Forexample, if two successive distributions in the sequence are identical, we10Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross- Validationare wasting our time by moving the particles. If successive distributionsare similar but not necessarily identical, to save computational time we cansimply copy forward the particles at time t — 1. and reweight them with theimportance sampling weights (2.2). Deciding when to move particles may bedone dynamically or deterministically. A dynamic scheme would move theparticles whenever the variance of the weights becomes too large (usuallymeasured by the effective sample size (ESS),(Z1(W)2)_1), whereas adeterministic scheme would move the particles everykthtime step for someinteger k. Since the sequence of distributions will likely not change at aconstant rate, it is better to use a dynamic scheme as this allows for littleparticle movement during parts of the sequence with little change and moremovement in parts of the sequence where successive distributions vary more.When the ESS drops below a specified threshold, we reweight the particles at time t — 1 to be approximately distributed as lrt prior to movingthem. The weights (2.3b) only depend on the particles at time t — 1, 50we can easily do this. In the case of a one at a time Gibbs sampler, wecan also use the weights (2.3a). Because the unweighted particles at timet are not distributed according to lrt, we cannot simply move the particleswithout first taking their weights into consideration. Thus prior to movingthe particles we must resample them such that = 1/N for i = 1,. .. , Nand the particles’ unweighted distribution islrt. Resampling methods duplicate particles with large weights and remove particles with low weights.Specifically, we copy the particle times such that N = Nand E(N) = NW where are the normalized importance weights.Lastly, all of the resampled particles are assigned equal weight. The simplestunbiased resampling method consists of sampling N4 from a multinomial(i)distribution with parameter (N,{W }).It should be noted that more sophisticated resampling schemes, such as residual resampling (Liu, 2001) andstratified resampling (Kitigawa, 1996) exist, resulting in reduced variance ofrelative to multinomial resampling. After the particles are resampled,we can move them with the kernel K.An efficient SMC algorithm which may be used to perform prior sensitivity and cross-validation is therefore:11Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validationfort=ldoObtain N weighted samples 6 from rj (directly, MCMC, etc.)endfort=2,...,Tdo(i) (i) . (i)Copy 0_ to 61,, and calculate weights W according to (2.2)if ESS(O) > c thenCopy (oi), w)) to (or), w))elseReweight: Calculate weights according toW cc W_1 xwt(bt_i,6t)whereWt(t_1,8t)is either given by(3a) or (3b)Resample: Resample particles according to above weights. Setall weights to 1/NMove: Move particles with Markov kernel of invariantdistribution‘lrtendendnote 1: If a backward kernel is chosen such that the incrementalweights depend on the position of the moved particle O, the reweightstep comes after the move step and resampling is performed with the(i)weightsnote 2: c is a user-specified threshold on the effective sample size.2.3 Prior SensitivityIn the case of prior sensitivity we are interested in approximating the posterior distribution of some variable(s) 0 given the data D, symbolically notatedasir(OID)ccf(DI0). v(9) wheref(DIO)and v(O) are the likelihood and theprior distribution of 0, respectively. Here the notation v(O) is used to differentiate the prior from the posterior distributionnt(O),allowing for theomittance of dependencies. This prior sensitivity framework has been studied in a closed-form setting (Gustafon and Wasserman, 1995; Gustafson,1996), but situations requiring Monte Carlo methods have received less attention. It is worth noting that only the prior distribution changes betweensuccessive distributions (the likelihood remains the same). Thus when wereweight particles to approximate the sequence of posterior distributions for12Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation0, the weights (2.2) depend solely on the prior distribution,w cx w1 x(De1).(o)— f(Do2)t—i(o1)• Vt(ei?)cx W1 x (2.4)Vt_i(et4)where is thethparticle sampled at time t and V(O) is thettIprior(i)distribution evaluated at the point O . If the ESS falls below a giventhreshold at time t (notated as c in algorithm pseudocode), we resampleand move, otherwise we simply reweight. Conveniently, resampling andmoving using (2.3b) and reweighting using (2.2) both result in the sameweight mechanism (2.4). In a later example we will also employ the weights(2.3a), which have reduced variance relative to (2.3b).2.3.1 Regularization Path PlotsConsider the regression model with response vector y= (yi,.. .,y)” andmodel matrix X = (xi, ... , x) where x(xii, . . . , x)T,j 1,. . .,p arethe column vectors of predictors (including the unit intercept vector). Forclarity of presentation we present the model with a continuous response;however, it is simple to extend to binary responses (Albert and Chib, 1993).We use the prostate data of Stamey et al. (1989) which has eight predictorsand a response (logarithm of prostate-specific antigen) with likelihoodyIpx,/3,u2 Nn(Xi3,o21n). (2.5)Using a double exponential prior distribution with parameter A on theregression coefficients /3(/3i, . . .,the corresponding posterior distribution is proportional to (2.1). We see from the form of this posterior distribution that if A = 0 the MAP estimate of /3 will correspond to the leastsquares solution. However, as A increases there will be shrinkage on /3 whichmay be displayed using a regularization path plot. Because the shrinkage asA varies is nonlinear, we set a scheduleA= et/20,t = 1,.. . ,100. We createa “gold standard” Bayesian Lasso regularization path plot for this data byrunning MCMC with a Markov chain of length 50,000 at each level of A andplotting the posterior mean of the resulting regression coefficients (Figure2.1). It should be noted that the creation of this plot took over 5 hours.13Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation00.0 0.2 0.4 0.6 0.8 1.0Figure 2.1: Regularization Path Plots: The gold standardThe plot is of standardized coefficients13jvs./3i/max(I/311).Since the idea is to create these plots quickly for exploratory analysis,we will compare our SMC-based method to MCMC with both constrainedto work in 5 minutes(+/— 5seconds), and both using the same Markovkernel. In order to perform MCMC in our time frame of 5 minutes, theMarkov chain had a length of 1200 for each of the 100 levels of ). The meanof each resulting posterior distribution was used to plot the regularizationpath plots in Figure 2.2(a). In comparison, to run in 5 minutes our SMCalgorithm used N = 4200 particles and resampled and moved the particleswhen the ESS dropped below c = = 2800 (Figure 2.2(b)). For the sakeof time comparisons, all computations here and later were performed on aPower Mac G5 with dual 2.7 GHz PowerPC processors. We see from theseplots that both methods capture the key features of the regularization pathplot as shown by the gold standard: every one of the variables has the correct path. The methods vary, however, in the amount of noise. We see muchmore variability using MCMC compared to SMC. This is due to SMC beingable to use many more particles since it is able to save time by borrowing information from previous distributions. To be specific, the SMC algorithm inthis context had to resample and move the particles only 25 times in the entire sequence of 100 distributions. The remainder of the time our algorithm000014Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validationsimply reweighted the particles, which is computationally inexpensive. It isworth noting that, because of this, adding more incremental distributionsin the sequence will have little effect on the computational time of the SMCalgorithm, unlike MCMC-based strategies, which would approximate eachnew distribution with a new Markov chain. In addition, we attempted tomake these plots using importance sampling, reweighting (and not moving)particles from 7t1 to approximate later distributions. However, the weightsbecame degenerate, with all of the weights eventually focussing on one particle with standardized h-i norm of 0.8. Specifically, all but one of theweights had values near zero, and the one particle with positive weight hadstandardized L-1 norm of 0.8. Thus importance sampling was only able tocreate roughly 1/5 of the entire plot, and hence is clearly not a candidatemethodology for creating these plots. We will see later that in many suchsituations importance sampling fails, even with large amounts of particles.2.3.2 Variable Selection Using g-PriorsConsider the normal likelihood set-up (2.5). Now, however, with an eyetowards model selection, we introduce the binary indicator variable7E{0,1}P,where7j= 1 means the variable x3 is included in the model. Thus7can describe all of the 2 possible models. Following the notation of Mannand Robert (2007), we use q7 = 1y as a counter for the number of variablesin the model. If X is the model matrix which excludes all xj’s if = 0,we can employ the following prior distributions for /3 and 2 (Zellner, 1986;Mann and Robert, 2007):(g2)_(7+l)/_lexpFrom this it is straightforward to show that the posterior density for7isthus—n/2(7Iy,X) (g+1)+1)/2 [yTy_g1YTX(XX)_1X7Y](2.6)We perform model selection on the pollution data set of McDonald andSchwing (1973), in which mortality rate is compared against 15 pollutionrelated variables in 60 metropolitan areas. The 15 independent variablesinclude, for instance, mean annual precipitation, population per household,and average annual relative humidity. The response variable y is the ageadjusted mortality rate in the given metropolitan area. We seek to perform15Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross- Validationa(a) MCMC with 1200 samples (5 minutes)000.0 0.2 0.4 0.6 0.6 1.0(b) SMC with 4200 samples (5 minutes)Figure 2.2: Regularization Path Plots: Plots using MCMC and SMC forfixed computational time of 5 minutesThe plots are of standardized coefficients /3 vs.I/31i/max(I/31i).16Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validationvariable selection to narrow down the number of independent variables whichbest predict the response. With 15 variables, calculating the posterior probabilities of the over 30,000 models exactly is possible but time-consilming.We have chosen this size of data set to allow for a benchmark from whichwe can compare MCMC to SMC.Our goal is to see how the explanatory variables change as we vary theprior distribution parameter g. In other words, we are interested in seeinghow robust the variable selection method is to changes in the setting of g.Our goal is to perform the variable selection for 100 levels of g for scheduleg= et/10,t = 1,.. . , 100. We use a Gibbs sampler strategy to comparethe SMC-based algorithm to brute-force MCMC, benchmarked against theexact solution obtained from (2.6), in which /37 and o2 are integrated out.Specifically, we update y one component at a time. The incremental weightratio (2.3b) will be the ratio of the posterior distribution (2.6) evaluatedon the complete data at successive levels of g. In addition, we are ableto use the weights (2.3a), which corresponds to the ratio of the posteriordistribution (2.6) evaluated on all of the data, excluding the variable that isbeing moved by the Gibbs sampler.In order to see our desired result, we use (2.6) to plot the exact marginalprobabilities as well as some sample model probabilities for various levels ofg (Figures 2.3(a) and 2.3(b)). This process took slightly over 8 hours, andhence we would like to find a faster method. We constrain both stochasticalgorithms to run in 30 minutes(+/- 1 minute). As a result the MCMCalgorithm uses a Markov chain of length 10,000 and the SMC algorithmuses 18,000 particles. We plot the resulting posterior marginal probabilities for each algorithm in Figures 2.4(a) and 2.4(b), respectively. Firstimpression shows that the plot created using MCMC has much more variability. However, the smoothness in the SMC algorithm is not a result ofperfect accuracy of the method, but rather only smoothness of the reweighting mechanism (2.2). Because of this, if the SMC does poorly during timesof particle movement, the subsequent reweighted approximations will alsobe inaccurate. To ensure this is not the case and verify that SMC is indeedoutperforming MCMC, we look at the average absolute error of the marginalprobabilities (at 100 levels of ) and for 15 variables). We find the averageabsolute error in the marginal probabilities using MCMC is 0.0292 whereaswith SMC it is only 0.0187. In addition, their respective maximum absoluteerrors were 0.24 and 0.08, respectively. In fact 30 runs of the algorithmsresulted in similar results, with SMC consistently outperforming MCMC.From this we see that SMC is indeed providing a better approximation ofthe true marginal probabilities.17Chapter 2. An Efficient Computational Approach ibr Prior Sensitivity Analysis and Cross-ValidationWhat then may be taken from these marginal probability plots? Whenperforming simple forward selection regression, the variables 1, 2, 6, 9, and14 are chosen. Slightly different results come from doing backward selection;in particular variables 1 and 14 are replaced by variables 12 and 13. TheLASSO solution (using 5-fold cross-validation) is the same as the forwardsolution with the additional variables 7 and 8. In addition, the LASSOsolution contains some shrinkage on the regression coefficients (see example 2.3.1). Using g-Priors the variables that clearly stand out (see Figure2.3(a)) are 1, 2, 6, 9, and 14. Thus the g-Prior solution taken from the plotcorresponds to the forward selection model. Also, for a given g, say g =the plot obtained with SMC shows the correct top 4 variables for inclusion,whereas the variability from the MCMC-based plot makes it impossible todo so.2.4 Cross-ValidationWe focus on leave-s-out cross-validation, which is the case when. the testingset consists of s observations. Continuing in the linear regression framework, letX\sandY\sbe the model matrix and response vector excludingthe subset S of observations (of size s). We are interesting in a collection ofTmodel parameter (typically prior distribution parameter) settings resultingin posterior densitiesTrt(91X\s,Y\s)for t 1, . . . , T. Once we have approximations of all T posterior densities, we select the model parameter settingswhich result in the best prediction of y usingX. To find the sequence ofdistributionslrt(OIX\s,Y\s)t = 1, ... , T, the same SMC-based algorithmproposed for prior sensitivity is applicable. Specifically, once we have obtained a Monte Carlo approximation oflrl(OIX\s,Y\s)we can transition tothe remainder of the distributions7rt(OIX\s,Y\s),t = 2,... , T using SMC.In addition to quickly evaluating the model for a variety of settings onthe training set, SMC also provides a tool for switching the training/testingset without fully re-approximating the posterior densities. Specifically, suppose we have a testing setSi,and using SMC we find approximations ofrt(OjX\s1Y\s1)t = 1, .. . , T, each of which are tested for prediction performance on the subset S. However, typically we are interested in performingcross-validation for a variety of different splits of the data into training andtesting sets. Thus, we will now want a new testing set82and find approximations of Kt(OIX\s2,Y\s2)’= 1,. .. , T. The obvious way to accomplish this isto start fresh by approximatinglrl(eIX\s2Y\s2)with MCMC and proceeding to approximate the remainder of the distributions using SMC. However,18Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-ValidationoF0(a) Posterior Marginal Probabilities: GreenrrXi,BlueX2,Yellow=X6,Red=X9,Purple=X14C0da0C00 2 4 6 8 10(b) Posterior Model probabilties: Purp1e1,2,4,5,9,Red=1,2,4,5, Blue1,9, Green9, Ye11ow=6Figure 2.3: Exact Marginal and Model probabilities for variable selectionusing g-Priors as a function of log(g)Plot (a) highlights several variables (X1,X2,X6,X9,X14) which show highmarginal probabilities of inclusion. Plot (b) shows the posteriorprobabilities of 5 models chosen to highlight the effect of g on model size190 2 4 6 8 10/\ /Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation000dc’J0(a) Posterior Marginal Probabilities:10000 samples (30 minutes)MCMC with000cJ0•000 2 4 6 8 10(b) Posterior Marginal Probabilities:samples (30 minutes)SMC with 18000Figure 2.4: Approximate Marginal and Model probabilities for variable Selection using g-Priors as a function of log(g)Plots (a) and (b) compare MCMC to SMC’s performance.0 2 4 6 8 1020Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-ValidationTr(OIX, y)ir (81X\s1Y\S)7tl(8IX\SmaxY\Smax)I I7t2(IX\si ‘ Y\s) t2(8IX\S,Y\Smax)71T(OIX\si, Y\S1) —-. 7rT(&JX\Sma,Y\Smcr)Figure 2.5: Diagram of cross-validation processEach arrow represents transitioning using SMC.we can be a bit more clever than this, recognizing thatlrl(0IX\s1 Y\s)andrl(OPX\s2,Y\s2)are related (Alqallaf and Gustafson, 2001; Bhattacharyaand Haslett, 2007).Successive splits of the data into training and testing sets should givesimilar model settings. Therefore, we first build the model for a given parameter setting on the full data set using SMC, resulting in an approximation of7r1(OIX, y).Then instead of using MCMC to get approximations oflrl(81X\S,Y\s)for different S E {S, . . ., Sm&c},we can build a sequenceof distributions(lrl(OIX,y))’7(7r1(OIX\s,y\s))7for an increasing temperature‘y= 0, , 2,. . ., 1 — e, 1 which will allow us to transition to the case-deletion posteriors. The process is illustrated in Figure 2.5. The case of= 0, 1 with no movement step corresponds to basic case-deletion importance sampling as developed in Peruggia (1997). Although case-deletionimportance sampling has been demonstrated to achieve up to 90% cost savings in some circumstances (Aiqallaf and Gustafson, 2001), the problem ofdegeneracy still makes importance sampling fail in many situations (Peruggia, 1997; Epifani et al., 2005).Let 9=(3, u2). The posterior distribution ir(9) of 9 is proportional toq(9)=f(y113,2)x ii-(/3) x ir(o-2). Assume we collect samples from the distribution ir(9). We are interested in reweighting these samples such that theycome from the distribution attained by removing the set S. The modified21Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validationlikelihood and posterior for this case-deletion scenario are, respectivelyf\s(YI3,2)=(2)_(ns)/2exp{---[(y - X/3)T(y- Xj3)- (Ys- X5)T((ys - Xi3))]}q\s(e)f\s(YII3u2)x ir(3) x ir(u2)We assume that the prior distributions for 3 and cr2 are proper and independent. Epifani et al. (2005) show that if the weightsw\S(e) q\s(e)/q(e)are used to move to the case-deletion posterior directly, then the momentof these weights is finite if and only if all of the following conditions hold:a) <1/rb) n—rs >1c) RSS*\s(r) > 0where)His the largest eigenvalue of the matrixHX(XTX)_lXsandRSS*\s(r)= RSS — re(I — rHs)’es where es= ys— X(XTX)_1XTyand RSS denotes the residual sum of squares of the least squares fit of thefull data set. This result should not be taken lightly: as Geweke (1989)points out, if the 2nd moment does not exist, the importance samplingestimator will follow neither a n’2 asymptotic nor a central limit theorem.(a) states that if the leverage of the deleted observations is too large, then theimportance weights will have infinite variance. (b) gives a condition relatingsample size to the allowable test set size s. (c) says that if the influence ofthe deleted observation is large relative to RSS, then the importance weightswill have infinite variance. We show here how using a sequence of artificialintermediate distributions with SMC can help to mitigate this problem.We introduce a sequence of distributionsq7(e) ccwhere y = 0,e,2e,.. ., 1—e, ito move from q(9)= qo(O)to q\5(e) =q1(e).At a given step ‘y =‘yin the sequence, the successive importance weightsappearing in the SMC algorithm to move to the next step+ earee—__________________W\S,7*()— (q(e))l_7*(q\s(e))7*-___\\q(9))22Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-ValidationTheorem 1.. Provided thatRSS*\s(1) > 0and the prior distributions for/3and a2 are proper and independent, a sequence of distributions proportionalto {(q(e))’-7(q\5(e))7;‘y = 0, e, 2e, . . . , 1—c, 1} may be constructed to movefrom q(f3) to q\s(e) such that the importance weightsW\S7(e)for eachsuccessive step have a finite moment underq7(9) provided(2.7)where > 1 is chosen to satisfy‘H< 1/Q (2.8a)n—cs>2 (2.8b)RSS*\s(c) > 0(2.8c)The proof may be found in the appendix. The provision thatRSS*\s(1)>o is very reasonable, and states that the least squares fit of the full data mustnot fit the training set perfectly. Note also that we find for each subset S.Thus we may use the largest allowable step size c in (2.7) for each subset 5,maximizing the algorithm’s efficiency. While this result is not sufficient toestablish that the variance of SMC estimates are finite for a finite number Nof particles, it can be used to upper bound the asymptotic variance of SMCestimates under additional mild regularity mixing conditions on the MCMCkernels; see (Chopin, 2004) and (Jasra and Doucet, 2008) for similar ideas.2.4.1 Application to Bayesian Penalized RegressionTo demonstrate the strength of SMC applied to cross-validation, we useit to select the parameter ) of the Bayesian Lasso (2.1). For brevity, wereuse the pollution data set (McDonald and Schwing, 1973) of section 2.3.2,selecting the parameter ) using leave-one-out cross-validation. Firstly, it isworth pointing out that importance sampling will fail in this situation, asAH >1/2 on 6 of the 60 observations in this data set, and hence the sufficientconditions to ensure finite variance are not satisfied. Using a sequence ofintermediate distributions, we find that the largest c satisfying (8) equals1.103, or, to ensure a finite second moment, e < = .103. Thus, thelargest sequence of distributions was of length 10. For most variables c > 2,which for r = 2 is equivalent to importance sampling. Thus SMC does notwaste time transitioning to case-deleted posteriors if importance samplingwill suffice.We use a Gibbs sampler to approximate the posterior distribution of(3, a2) for A = e5 on the full data set and then use SMC to move to23Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation0o00C)0C00000C)o00-4 -2 02 4(a) Cross-validation error as a function of log(A) usingMCMC with 60 samples (10 minutes)000a1’-000000-(b) Cross-validation error as a function of log(A) usingSMC with 100 samples (10 minutes)Figure 2.6: Plots of cross-validation error as a function of 1og(\)24Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validationthe case-deletion posterior distributions by creating a sequence of auxiliarydistributions as described above. For each different case-deletion we thenuse SMC to find approximations of the posterior for schedule A= et/5,t = —25,. . . , 25. Plotting the cross-validation errors as a function of A usingMCMC with a Markov chain of length 10,000 (not shown) we observe thatthe average squared loss(Yk—Xk/3)2is a smooth function in A with asmall bump at A = e2 and minimum near e312. Thus to minimize predictionerror (at least in terms of the squared loss) we should set A — e312. Toperform this task in a time-restricted manner we constrained both MCMCand SMC algorithms to work in 10 minutes (+/- 30 seconds). Figures 2.6(a)and 2.6(b) are the resulting plots. The reduced variability of the SMC-basedplot allows us to make more accurate conclusions. For instance, it is clearin the plot obtained with SMC (Figure 2.6(b)) that the minimum error liessomewhere around A = e312,whereas from the MCMC plot (Figure 2.6(a))it could be anywhere between 1 and e2.2.5 Extensions and ConclusionsIn our presentation of the algorithm, a fixed sequence of distributionsTrt(O),t = 1, 2, 3, .., T is used. However, it is also possible to determine the sequenceof distributions automatically such that successive distributions are a fixeddistance apart, as measured by ESS. For instance, assume we are interestedin 7rt(8)= ir(8IAt)whereAis a scalar parameter and we have a MonteCarlo approximation ofir(OIA_i)for an arbitrary t, namely {W, 8},i = 1,. ..,N. We may set A to ensure that ESS = c for a constant c bysolvingwhere is given by (2.2). This may be solved numerically or in closed-form, if possible. This technique would be beneficial in situations wherelittle or nothing is known about the sequence of distributions, and hence itwould be nice to automatically create the sequence.All our examples have considered a sequence of distributions parameterized by a scalar parameter for which the definition of the sequence of targetdistributions is very intuitive. If we are interested in dealing with multivariate parameters then the algorithm may be adapted by, for instance, creatinga grid (or hyper-grid) of distributions. SMC may be used to work across25Chapter 2. An Efficient Computational Approach ftr Prior Sensitivity Analysis and Cross-Validationeach dimension in succession. It is worth noting that the complexity ofthe algorithm scales exponentially with dimension, although MCMC doesas well.While we have given two choices of incremental weights, (2.3a) and(2.3b), many other choices are available (Del Moral et al., 2006). In situations where the weights are dependent on the position of the movedparticle, such as with (2.3a), auxiliary particle techniques may be used(Pitt and Shephard, 1999; Johanseri and Whiteley, 2008). Specifically, wereweight the particles with an approximation of the weight of interest (forinstance, (2.3a)) which is oniy dependent on the particles at time t — 1,(i) (i) (i) (i) .usingWtempW_1 x W. where W is the approximation of the incremental weight. After we have resampled and moved the particles we thenI47’compensate for this approximation using W’= tiexWmp.We have seen that by adapting importance sampling to move particlesbetween successive distributions, SMC drastically limits the problem of importance sampling degeneracy. By using a resample-move type algorithm,we are able to perform prior sensitivity and cross-validation in a computationally feasible manner while avoiding the fore-mentioned pitfalls of importance sampling. We have shown the SMC algorithm to be considerably moreefficient than existing methods based on reiterative MCMC approximations.In this way regularization path plots and other sensitivity analysis problemscan be studied in the context of the full posterior distribution instead of afew summary statistics. In addition, SMC provides a tool for naturally performing cross-validation, and in fact guarantees finite case-deletion weightsunder much less stringent conditions than importance sampling. In addition,through the importance weights, SMC provides a measure of the distancebetween distributions, and hence gives a way to select a subset of distributions of interest for exploratory or other purposes.2.6 ReferencesAlbert, J.H. and Chib, S. (1993). “Bayesian Analysis of Binary andPolytomous Response Data.” Journal of the American Statistical Association. 88:669-679.Alqallaf, F. and Gustafson, P. (2001). “On Cross-validation of BayesianModels.” Canadian Journal of Statistics. 29:333-340.Besag, J., Green, P., Higdon, D., Mengersen, K. (1995) “Bayesian Computation and Stochastic Systems (with discussion) .“ Journal of the Amen26Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross- Validationcan Statistical Association. 95:1127-1142.Bhattacharya, S., and Haslett, J. (2007). “Importance Re-samplingMCMC for Cross-Validation in Inverse Problems.” Bayesian Analysis. 2:385-408.Chopin, N. (2002). “A Sequential Particle Filter Method for Static Models.” Biometrika. 89:539-552.Chopin, N. (2004). “Central Limit Theorem for Sequential Monte CarloMethods and Its Application to Bayesian Inference.” Annals of Statistics.32:2385-2411.Del Moral, P., Doucet, A., Jasra, A. (2006). “Sequential Monte CarloSamplers.” Journal of the Royal Statistical Society: Series B. 68:411-436.Doucet, A., Godsill, S., Andrieu, C. (2000). “on Sequential MonteCarlo Sampling Methods for Bayesian Filtering.” Statistics and Computing.10:197-208Doucet, A., de Freitas, N., Gordon, N.J. eds. (2001). Sequential MonteCarlo Methods in Practice. New York: Springer.Efron, B., Hastie, T., Johnstone, I., Tibshirani, R. (2004). “Least AngleRegression.” Annals of Statistics. 32:407-499.Epifani, I., MacEachern, S., Peruggia, M. (2005). “Case-Deletion Importance Sampling Estimators: Central Limit Theorems and Related Results.”Technical Report No. 720, Department of Statistics, Ohio State University.Geweke, J. (1989). “Bayesian Inference in Econometric Models UsingMonte Carlo Integration.” Journal of the American Statistical Association,88:881-889.Gilks, W.R. and Berzuini, C. (2001). “Following a Moving Target:Monte Carlo Inference for Dynamic Bayesian Models.” Journal of the RoyalStatistical Society: Series B. 63:127-146.Gustafson, P. and Wasserman, L. (1995). “Local Sensitivity Diagnosticsfor Bayesian Inference.” Annals of Statistics. 23:2153-2167.Gustafson, p. (1996). “Local Sensitivity of Inferences to Prior Marginals.”Journal of the American Statistical Association. 91:774-781.Jasra, A. and Doucet, A. “Stability of Sequential Monte Carlo Samplersvia the Foster-Lyapunov Condition”, Statistics and Probability Letters. toappear.Johansen, A., and Whiteley, N. (2008). “A Modern Perspective on Auxiliary Particle Filters.” Proceedings of Workshop on Inference and Estimationin Probabilistic Time Series Models. Isaac Newton Institute, JuneKirkpatrick, S., Gelatt, C.D., Vecchi, M.P. (1983). “Optimization bySimulated Annealing.” Science. 220:671-680.27Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-ValidationKitagawa, G. (1996). “Monte Carlo Filter and Smoother for Non-Gaussian,Non-linear State Space Models.” Journal of Computational and GraphicalStatistics. 5:1-25.Liu, J.S. and Chen, R. (1998). “Sequential Monte Carlo Methods for Dynamic Systems.” Journal of the American Statistical Association. 93:1032-1044.Liu, J.S. (2001). Monte Carlo Strategies in Scientific Computing (2nded.), New York: Springer.McDonald, G. and Schwing, R. (1973). “Instabilities of Regression Estimates Relating Air Pollution to Mortality.” Technometrics. 15: 463-481.Neal, R. (2001). “Annealed Importance Sampling.” Statistical Computing. 11:125-139.Park, T. and Casella, G. (2008). “The Bayesian Lasso.” Journal of theAmerican Statistical Association. 103:681-686.Peruggia, M. (1997). “On the Variability of Case-Deletion ImportanceSampling Weights in the Bayesian Linear Model.” Journal of the AmericanStatistical Association. 92:199-207.Pitt, M.K. and Shephard, N. (1999). “Filtering Via Simulation: Auxiliary Particle Filters.” Journal of the American Statistical Association.94:590-591.Tibshirani, R. (1996). “Regression Shrinkage and Selection via theLasso.” Journal of the Royal Statistical Society: Series B. 58:267-288.Vidakovic, B. (1998). “Wavelet-Based Nonparametric Bayes Methods.”in Practical Nonparametric and Semiparametric Bayesian Statistics. D. Dey,P. Muller, and D Sinha (eds.). New York: Springer, 133-256.Zellner, A. (1986). “On Assessing Prior Distributions and Bayesian Regression Analysis with G-prior Distributions.” Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti. pp. 233-243.28Chapter 3Structural HealthMonitoring withAutoregressive SupportVector Machines3.1 Introduction2The extensive literature on structural health monitoring (SHM) has documented the critical importance of detecting damage in aerospace, civil, andmechanical engineering systems at the earliest possible time. For instance,airlines may be interested in maximizing the lifespan and reliability of theirjet engines or governmental authorities might like to monitor the conditionof bridges and other civil infrastructure in an effort to develop cost-effectivelifecycle maintenance strategies. These examples indicate that the ability toefficiently and accurately monitor all types of structural systems is crucialfor both economic and life-safety issues. One such monitoring techniqueis vibration-based damage detection, which is based on the principal thatdamage in a structure, such as a loosened connection or crack, will alter thedynamic response of that structure. There has been much recent work in thisarea; in particular, Doebling et al. (1998) and Sohn et al. (2004) present detailed reviews of vibration-based SHM. Because of random and systematicvariability in experimentally measured dynamic response data, statisticalapproaches are necessary to ensure that changes in a structures measureddynamic response are a result of damage and not caused by operational andenvironmental variability. Although much of the vibration-based SHM literature focuses on deterministic methods for identifying damage from changesin dynamic system response, we will focus on approaches that follow a sta2A version of this chapter has been accepted for publication. Bornn, L., Farrar, C.R.,Park, G., Farinholt, K. (2008). “Structural Health Monitoring with Autoregressive Support Vector Machines.” Journal of Vibration and Acoustics.29Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machinestistical pattern recognition paradigm for SHM (Farrar and Worden, 2008).This paradigm consists of the four steps of 1. Operational evaluation, 2.Data acquisition, 3. Feature extraction, and 4. Statistical classification offeatures. The work presented herein focus on steps 3 and 4 of this paradigm.One approach for performing SHM is to fit a time series predictive modelsuch as an autoregressive (AR) model to each sensor output using dataknown to be acquired from the structure in its undamaged state. Thesemodels are then used to predict subsequent measured data, and the residuals(the difference between the model’s prediction and the observed value) arethe damage-sensitive feature that is used to check for anomalies. This process provides many estimates (one at each time step) of a single-dimensionfeature, which is advantageous for subsequent statistical classification. Thelogic behind this approach is that if the model fit to the undamaged sensor data no longer predicts the data subsequently obtained from the system(and hence the residuals are large and/or correlated), there has been somesort of change in the process underlying the generation of the data. Thischange is assumed to be caused by damage to the system. These lineartime series models have been used in such a damage detection process thatinclude applications to a wide range of structures and associated damagescenarios including cracking in concrete columns (Fugate et al., 2001; Sohnet al., 2000), loose connections in a bolted metallic frame structure (Allenet al., 2002) and damage to insulation on wiring (Clark, 2008). However,the linear nature of this modeling approach limits the scope of applicationand the ability to accurately assess the condition of systems that exhibitnonlinearity in their undamaged state. In this paper, we demonstrate howsupport vector machines (SVM) may be used to create a non-linear timeseries model that provides an alternative to these linear AR models.Once a model has been chosen and the predictions from this model havebeen compared to actual sensor data, there are several statistical methodsfor analyzing the resulting residuals. Sequential hypothesis tests, such asthe sequential probability ratio test (Allen et al., 2002), may be used totest for changes in the residuals. Alternatively, statistical process controlprocedures, typically in the form of control charts, may be used to indicateabnormalities in the residuals (Fugate et al., 2001). In addition, slidingwindow approaches look at the features of successive subsets of data to detectanomalies (e.g. Clark, 2008). For example, the sliding window approach ofMa and Perkins (2003) looks at thresholds for the residuals such that theprobability of an undamaged residual exceeding this threshold is 5%. Asubset of n consecutive data points are then checked, and large values of thenumber g of points exceeding the threshold indicate damage, where g has a30Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machinesbinomial distribution (i.e. g Bin(n, .05)).To date, most of these time series modeling approaches analyze data fromone sensor at a time, and typically some sort of scheme is used to determinehow many sensors need to indicate damage in order to trigger a systemcheck (e.g. Herzog et al., 2005). As an alternative, in this paper we look ata statistically based method for combining multiple sensor output. From thiscombined output analysis, we can establish the existence of damage and alsodetermine which sensors are contributing to the anomalous readings in aneffort to locate the damage within the sensor network’s spatial distribution.Previously Sohn et al. (2000) have used principal component analysis tocombine data from an array of sensors, but this study only examined thesecombined data in an effort to establish the existence of damage.We first present a summary of the SVM approach to nonlinear time series modeling. This procedure is illustrated on numerically generated datawith artificial anomalies added to the baseline signal in an effort to simulatedamage. This time series modeling approach is then compared to linear ARmodels. Next the SVM method is coupled with a statistical analysis procedure that combines modeling results from multiple sensors in an effort toboth establish the existence and the location of the damage. This procedureis applied to data from a laboratory test structure with damage that resultsin local nonlinear system response.3.2 SVM-based SHMExisting methods for performing damage detection extract damage-sensitivefeatures from data acquired on the undamaged system, and then use changesin those features as an indicator of damage. An AR model can be fit to theundamaged sensor output and the residuals from predictions of subsequentdata using this baseline model are then monitored for statistically significantchanges that are assumed to be caused by damage. Specifically, an AR modelwith p autoregressive terms, AR(p), applied to sensor k may be written as(3.1)where x is the representation of the measured signal at discrete times tfrom thekthsensor, are the AR coefficients or model parameters, and eis an unobservable noise term. Thus an AR model works by fitting a simplelinear model to each point with the previous p observed points as dependent31Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machinesvariables. Note that an n point time series will yield n—p equations thatcan be used to generate a least square estimate of the AR coefficients or theYule-Walker Method can be used to solve for the coefficients (Brockwell andDavis, 2001). Auto-regressive models work particularly well when modelingthe response of linear, time-invariant systems. If the undamaged system isnonlinear, the AR process gives the best linear fit to the measured response,but there is no guarantee that this model will accurately predict responsesobtained when the system is subjected to other inputs.Because of the broad array of structural health monitoring problems,employing a linear model confines the scope of problems for which the ARmethodology is appropriate. We thus seek to extend the fidelity of this general damage detection approach by employing a non-linear AR-type modelbased upon SVMs, which have seen widespread use in machine learning andstatistical classification fields. To simplify future development, we denotethe vector . . , as SVMs have many features that makethem a more appropriate choice for SHM based on time series analysis. Withthe right settings and appropriate training they are able to model any nonlinear relationship between the current time point, x, and thep previoustime points, they are well suited for high-dimensional problems,and the methodology is easily generalized and highly adaptable. AlthoughSVMs have been used for SHM before (e.g. Worden and Manson, 2007;Shimada et al., 2006; Bulut et al., 2005, Worden and Lane, 2001; Chattopadhyay et al., 2007), these approaches predominantly focus on one andtwo class SVMs, which are used for outlier detection and group classification,respectively. Our approach is unique in its combination of support vector regression, autoregressive techniques, and residual error analysis. Thus whileearlier approaches look at classifying sections of the time-series responseas damaged or undamaged directly (the dependent variable being a binaryindicator), our methodology works by using support vector regression tomodel the raw time-series data, then subsequently predicting damage bymonitoring the residuals of the model. We follow the development of SVMsfor regression of Smola and Schlkopf (2004) and Ma and Perkins (2003).First, assume we have data from a set of K sensors and we have measurements without damage for time t = 1, . . . ,t0 (i.e. if there is damage, it occursafter time to). Next we must decide the order p of our model. There aremany methods for selecting p, such as partial autocorrelation or the AkaikeInformation Criterion (AIC), which are discussed in more detail in Fugateet al. (2001). In general, we seek the lowest order model that captures theunderlying physical process and hence will generalize to other data sets. Aswith linear AR modeling, we create the training set on which to build our32Chapter 3. Structural Health Monitoring with Autoregressive Support Vector MachinesSVM-based model by using each observation as the dependent variable andthe previous p observations as independent variables. Our training samplesare thus{(X_p:t_i,x), t p+1, . . . ,t0}.Ideally we would like to find a functionfsuch thatf(x_.t_1)= x forall k and t to. However, the form offis often restricted to the class oflinear functions (as is the case for AR models),f(x_p:t_i) = Kw, X_p:t_i)(3.2)where(,)denotes the dot (or inner) product and w is a vector of modelparameters. This restricted form makes perfect fit of the data impossiblein most scenarios. As a result, we allow prediction usingfto have anerror bounded by e, and find w under this constraint. With the recentadvances in penalized regression methods such as ridge regression and lasso,the improved prediction performance of shrunken (or smoothed) models isnow well-understood (Copas, 1997; Fu, 1998). Thus in order to providea model that maximizes prediction performance, we seek to incorporateshrinkage on the model paramaters w. Such shrunken w may be found byminimizing the Euclidean norm subject to the error constraint , namelyminimizeIIwII2(3.3)Ixk_Kw,xksubject tot—p.t—1 —1Kw,X_p:t_i) —This model relies on the assumption that a linear model is able to fitthe data to within precision e. However, typically such a linear model doesnot exist, even for moderate settings of e. As such, we introduce the slackvariables,,to allow for deviations beyond €. The resulting formulationisminimizeIIwW2+CZ+1(+)(3.4)subject to— Kw, xp:t_i)E +1Kw,x_p:t_i)_xThe constant C controls the tradeoff between giving small w and penalizingdeviations larger than e. In this form we see that only points that lie outside of the bound e have an effect on w. Figure 3.1 illustrates the processgraphically.33Chapter 3. Structural Health Monitoring with Autoregressive Support Vector MachinesFigure 3.1: Illustration of linear support vector regression fitAlthough this optimization problem is straightforward to carry out, theextension to non-linearity is revealed by the dual formulation. We thusproceed by constructing a Lagrange function of the above by introducing aset of dual variables.L : IIwII2 +C(+) -(+X + (w,xp:ti))t=p+1 t=p+1(3.5)(c + - + W, xp:ti)) -+h7i)t=p+1 tP+lwhere the dual variables a, a, are understood to be non-negative.It can be shown that this function has a saddle point at the optimal solution,and hence= w —(at +a)Xp:t_l= 0 (3.6)t=p+1-=0Plugging these saddlepoint constraints into L yields the following dual34Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machinesl1.“a..70.0 7 1 90Figure 3.2: Illustration of mapping to an alternate space to induce linearityoptimization problem:maximizetp+i (t+ c)( ±)(tiSp:t’-i)(37)Zt=p+i(c±c)+Z+1xt (ct—c)subject tofx— (W,X_p:t_i)(W,X_p:t_i) C+Notice that by the saddlepoint constraint w= Zp+1 (t+)Xp:t_lwe may writefastof(x_p:t_i) =(€— ü)(xi_p:t_i,x_p:t_i)(3.8)t’=p+1In this way w may be viewed as a linear combination of the trainingpointsX_p.t_1.Note also that in this formation bothfand the corresponding optimization can be described in terms of dot products between the data.In this way, we can transform the data using the function :—*F, andcompute the dot products in the transformed space. Such mappings allowus to extend beyond the linear framework presented above. Specifically, themapping allows us to fit linear functions in F which, when converted back toIR, are nonlinear. A toy example of this process is illustrated for a mappingR2—*R3, namely (x,y) =(x2,x/y),y), in Figure 3.1. Here the datais generated using the relationshipy = x2. To make use of this transformedspace, we replace the dot product term with((x_pto_i) , (x_p:t_)) (3.9)35Chapter 3. Structural Health Monitoring with Autoregressive Support VectorMachinesIf F is of high dimension, then the above dot product will be extremelyexpensive to compute. In some cases, however, there is a correspondingkernel that is simple to compute. For example, the kernel k(x, y) = (x.corresponds to a map F into the space spanned by all products of exactly ddimensions inIRP.When d, p = 2, for instance, we have(x.= ((xi, X2). (yi, y2))2(3.10)= ((x,v’xix2,x)(Y?V’Y1Y2Y))= ((x),(y))defining I’(x) = (xi, \/x1x2, x). More generally, it has been shown thatevery kernel that gives a positive matrix (k(x,y))has a corresponding map(x) (Smola and Schlkopf, 2004). One such family of kernels we focus on isRadial Basis Function (RBF) kernels, which have the formk(x, y) = exp(—lix — y112/(2u))(3.11)where o2 is the kernel variance. This parameter controls fit, with largevalues leading to smoother functions and small values leading to better fit.In practice moderate values are preferred as a trade-off between model fitand prediction performance.Whereas a traditional AR(p) model employs a linear model that is afunction of the previous p time points, the SVM model looks at the previousp time points compared to all groups of p successive data points from thetraining sample. Specifically, the model has the formto:r k iii k kJiXj_p:i_i) — /JjnAXj_p:j_, Xt_p:t_1jp+1Typically only a small fraction of the coefficients /3 are non-zero. Thecorresponding samplesX_p.j_1are called support vectors of the regressionfunction because only these select samples are used in the formulation ofthe model. Once we have trained our model above, we use it to predict eachfuture observation. We then take the residuals and use them as an indicatorof structural change. For our purposes we employ a control chart to monitorif the system generating the data has changed. In this discussion the controlchart is created by constructing 99% control lines that correspond to 99%confidence intervals for the residuals of the model fit to the undamaged dataassuming the residuals are normally distributed. This normality assumptionis further discussed in the experimental results below. These control lines36Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machinesare then extended through the remaining (potentially damaged) data anddamage is indicated when a statistically significant number of residuals, inthis case more than 1%, lie outside these lines. Note damage can also beindicated when the residuals no longer have a random distribution eventhough they may not lie outside the control lines.RBF neural networks, which have the same form as Equation (3.12),have previously been used to perform SHM (e.g. Rytter and Kirkegaard(1997)). However, fitting these networks requires much more user input suchas selecting which/3are non-zero as well as selecting the correspondingtraining points. In addition, the fitting of the neural network model isa rather complicated nonlinear optimization process relative to the simplequadratic optimization used in the support vector framework. Althoughthe SVM models are more easily developed, Schlkopf et al. (1997) havedemonstrated that SVMs still more accurately predict the data than theRBF neural networks despite their simplicity.3.2.1 Example: Simulated DamageWe now compare the performance of the SVM-based damage detectionmethod to a traditional AR model with coefficients estimated by the Yule-Walker method (see, for example, Brockwell and Davis (1991)). The datais generated as follows for discrete time points, t = 1,..., 1200:=sin3(400irt/1200) +sin2(400irt/1200)+sin(200rrt/1200) (3.13)+sin(lOOirt/1200) + ‘I’ + e (3.14)where c is Gaussian random noise with mean 0 and standard deviation 0.1and ‘I’ is a damage term. Three different damage cases are added to thistime series at various times as defined byci for t=600,...,650sin(1000irt/1200) for t=800,...,850—for t = 1000, ... , 10500 otherwisewhere El and2are Gaussian random noise with mean 0 and 1, and standard deviation 0.5 and 0.2, respectively. Through the use ofJfwe attemptto simulate several different types of damaged to compare the models performance handling each. This raw signal is plotted in Figure 3.3 where itcan be seen that the changes caused by the damage are somewhat subtle.37Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines0 200 400 600 800 1000 1200timeFigure 3.3: Raw simulated data with highlighted artificial damageFigure 3.4: Autocorrelation plot of simulated dataThe order p for both models was set at 5 as determined from the autocorrelation plot in Figure 3.4. This plot is the measure of correlation betweensuccessive time points for a given time lag. We see from the plot that after alag of 5, the correlation is quite small, and hence little information is gainedby including a longer past history p. This is a standard method for determining model order for traditional AR models, and as such should maximizethis methods performance, ensuring the SVM-based model isnt afforded anunfair advantage.The results of applying both the SVM model and a traditional AR modelto the undamaged portion of the signal between time points 400 and 600 areshown in Figure 3.5 where the signals predicted by these models are overlaidon the actual signal. A qualitative visual assessment of Figure 3.5 shows thatthe SVM more accurately predicts this signal. A quantitative assessmentis made by examining the distribution of the residual errors obtained witheach model. The standard deviation of the residual errors from the SVMmodel is 0.26 while for the traditional AR it is 0.71, again indicating thatthe SVM is more accurately predicting the undamaged portion of this timeseries.In order for a model to excel at detecting damage, it must fit the undamaged data well (i.e small and randomly distributed residual errors) whilefitting the damaged data poorly as identified by increased residual errors0 5 15Lg38Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines400 450 500 550 600timeFigure 3.5: SVM (top) and linear AR models (bottom) fit to subset of datawith possibly non-random distributions. In other words, the model must besensitive to distributional changes in the data that result from damage. Toquantify such changes a control chart is developed based on the undamagedportion of the time series to establish statistically based thresholds for thedamage detection process. As mentioned earlier, this control chart is calculated based on the fit to the undamaged data, specifically99% confidencelines are drawn based on the undamaged residual error data and carriedforward for comparison on the potentially damaged data. It is in this partof the process that the SVM’s ability to more accurately represent the dataenhances the damage detection process. The 99% confidence lines for theSVM are much closer to the mean value of the residual errors and, hence,will more readily identify small perturbations to the underlying system thatproduce changes in the residual error distribution. In addition, the traditional AR model shows a trend in the residuals, indicating lack of modelfit, even in the undamaged case. We see that during the times of damagethe residuals for the SVM-based model exceed the control limits more thanoccurs with the residuals from the traditional AR model. In fact, the lattermethod would likely miss the damage between time points 1000 and 1050,where only one point exceeds the threshold versus over 10 for the SVM-basedmodel. This result can be seen in Figure 3.6.Since each method performs differently for different sources of damage, itis of interest to determine when each method will be successful in indicating39Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines*I400 600 800 1000 1200timeFigure 3.6: Residuals from SVM (top) and linear AR models (bottom) applied to simulated dataThe 99% control lines based on the residuals from the undamaged portionof the signal are shown in red.damage. Since the traditional AR model fits a single model to the entiredata, model fit will be very poor if the data is non-stationary (for instanceif the excitation is in the form of hammer impacts). Additionally, since thetraditional AR model as presented above does not contain a moving averageterm, it will continue to fit when damage is in the form of a shift up or downin the raw time series (as demonstrated by the third damage scenario above).Conversely, the SVM-based method works by comparing each length ofpdata to all corresponding sets in the training set. Thus, if a similar sequenceexists in the training set, we can expect the fit to be quite good. We see twoscenarios in which the SVM-based method will perform poorly. Firstly, ifthere is some damage in the undamaged scenario, and similar damage occursin the testing set, the model will likely fit this portion quite well. Secondly, ifdamage manifests itself in such a way that the time-series data is extremelysimilar to the undamaged time-series, the SVM methodology will be unableto detect it. However, we should emphasize that other methods, includingthe AR model, will suffer in such scenarios as well. As an attempted solutionwhen the sensitivity of the method to a given type of damage is unknownand simulation tests are impossible, various damage detection methods couldpotentially be combined to boost detection power.a)—-DS‘ia)0,a)V —.S -(a40Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines3.3 Joint Online SHMIn the undamaged state, a Gaussian distribution can often approximate theresiduals from fitted models or control charts can be developed to invokethe central limit theorem and force some function of the residual errors tohave a Gaussian distribution such as with an x-bar control chart (Fugate,et al, 2001). If we have K sensors, each of whose residuals are Gaussiandistributed, we would like a way of combining these residuals to come upwith a damage detection method that examines all K sensors. Noticing thatthe sum of K squared standard Gaussian random variables is distributed asa chi-squared random variable with K degrees of freedom, we square theresiduals from each sensor (after they are normalized to have mean 0 andvariance 1 based on the undamaged data) and add them together to create anew combined residual. These new combined residuals follow a chi-squareddistribution, and hence we can make probabilistic statements about theresiduals being typical or not (indicative of damage). Specifically, considerthe combined residuals at some time point t:K(3.15)k= 1where r is the normalized residual at time t for sensor k. Assuming theoriginal residuals are Gaussian distributed, this random variable will have achi-squared distribution with K degrees of freedom. Note that even whenthe original residuals are not approximately Gaussian, we may still employa control chart on the combined residuals to give probabilistic statementsregarding damage. For instance, when the residual errors from the fittedmodel have thicker tails than Gaussian, control charts must be employedto make probabilistic statements of the combined residual. However, aswell see in the following example, the residual errors are often very close toGaussian.In addition to these combined residuals allowing us to make statementsregarding damage from multiple sensor output, they also provide us with amechanism for determining which sensors are most influenced by the damage. This latter property is of particular importance for damage location. Ifthis combined residual is large, and hence we determine that there is damage,we can look at the values (r)2 for each sensor and from their magnitudesdetermine which sensors contributed the most to this large combined residual. If we detect damage over a range of values, we may average (r)2 overthis range for each sensor to determine how much each sensor is contributing41Chapter 3. Structural Health Monitoring with Autoregressive Support Vector MachinesD A3rd Floor 3rd Floor*otor4A.Cthrnfl0.77 32nd Floor 2nd Floor;I1stFloor 1st Flooret.r2- i.Shkornflri r.___ Pn.iero7ote11074___________________L--. JFigure 3.7: Diagram of experimental structureto the anomalous reading.3.3.1 Example: Experimental DataWe look at joint online SHM using SVMs on experimental data from a structure designed to produce non-linear response when it is damaged. The structure is a three-story building (Figure 3.7) consisting of aluminum columnsand plates with bolted joints and a rigid base that is constrained to slide horizontally on two rails when excited by an electro-dynamic shaker. Each flooris a 30.5 x 30.5 x 2.5cm plate and is separated from adjacent floors by four17.7x 2.5x0.6cm columns. To induce non-linear behavior, a 15.Ox 2.5x2.5cmcolumn is suspended from the top floor and a bumper is placed on the second floor. The contact of this suspended column with the bumper results innon-linear effects. The initial gap between the suspended column and thebumper is adjusted to simulate different levels of damage. In our test datawe employ the case where the column is set 0.05mm away from the bumper.The undamaged data is obtained when the bumper and suspended columndo not contact each other. The structure is subjected to a random baseexcitation from the shaker in both its undamaged and damaged condition.Accelerometers mounted on each floor record the response of the structureto these base excitations. A more detailed description of the test structureand the data obtained can be found at www.lanl.gov/projects/ei.We first concatenate the undamaged data with the damaged data todemonstrate that the proposed methodology adequately detects the damage.42Chapter 3. Structural Health Monitoring with Autoregressive Support Vector MachinesThe SVM time series models are developed for each of the accelerometermeasurements from the undamaged data as follows:1. Select the number of time lags that will be used in the time seriesmodels. In this case eight time lags were used based on the AIC. Notethe number of time lags is analogous to the order of an AR model.2. Select the parameters of the SVM model, including the kernel typeand corresponding parameters as well as C and E, which control modelfit as described earlier. In our case we used a Gaussian kernel withvariance 1 and set C 1 and e = 0.1. We have found the methodologyto be robust to choices of variance ranging over an order of magnitude.In addition, C could be increased to force fitting of extreme values,and could be lowered to enforce a closer fit to the training data.3. Pass the data (arranged as dependent variable and previous p pointsas independent variables) to the optimization described by Equation(3.7). In this case we use the first 6000 undamaged points as training data. This step is handled by the wide variety of support vectormachine software available covering multiple computing environmentsincluding MATLAB and R. In particular, we employ the libSVM library with accompanying MATLAB interface (Chang and Lin, 2001).4. Once the SVM model is trained (i.e. the /3j in Equation (3.12) areselected) in step 3, make predictions based on the new test data fromthe structure in its undamaged or damaged condition. Next, calculatethe residual between the measured data and the output of the timeseries prediction.5. Square and add the residuals from each sensor as described by Equation (3.15). Build a control chart for these combined residuals to detectdamage (perhaps in conjunction with statistical tests such as a slidingwindow approach).Note that steps 1 through 4 of this process are applied to each time seriesrecorded by the four accelerometers shown in Figure 37.First we will revisit the normality assumption that was made in constructing the control chart. Figure 3.8 shows the resulting Q-Q plot for theresiduals from the SVM model fit to sensor 4 data obtained with the structure in its undamaged state. The Q-Q plot compares the sample quantilesof the residuals to theoretical quantiles of a Gaussian distribution. We see43Chapter 3. Structural Health Monitoring with Autoregressive Support Vector MachinesCl)a,0G)ECuC,)Theoretical QuantilesFigure 3.8:Q-QPlot of residuals from SVM modelin this figure that the sample quantiles fall very close to the theoretical line,and hence our residuals are approximately Gaussian.Figure 3.9 shows the residual errors from the SVM fit to each of the accelerometer readings, respectively, and the corresponding 99% control limitsthat are based on the first 6000 points from the undamaged portion of eachsignal. There are 8192 undamaged points and 8192 damaged ones. Thuswhen we concatenate the data the damage occurs at time point 8193 of16384. Figure 3.10 shows the density of the normalized residual errors fromall the sensors that have been combined according to Equation (3.15). Wesee that the distribution is very nearly chi-squared. In situations where theoriginal residuals arent normal, this result wont be true, and hence probabilistic statements regarding the presence of damage must be made basedon control charts.Figure 3.11 shows the combined residuals as a function of time. Theblue points in the plot show damage indication using the sliding windowapproach of Ma and Perkins (2003) as described in the introduction andbased on the 99% control lines. Specifically we use a window size of 6 which,when combined with the 99% control limit, detects damage whenever 1 ormore of the 6 points in the window exceeds the control line (equivalent tobinomial probability of 0.05). We see from Figure 3.9 that sensors 3 andCC’J—4 —2 0 2 444Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines0In0aqII)0fit000oqIn7000 7500 8000 8500 9000timesensor 4Figure 3.9: Residuals from 4 sensors for t = 7000, ... , 9384The 99% control lines are shown in red5 10 15 20Figure 3.10: Density estimate of combined residual (black) vs. chi-squareddistribution (red)sensor 1aVaEaaIn000In0qInaVaaItsensor 2IIIII I7000 7500 8000 8500 9000timesensor 3I ill.9In0In08‘IsIn0II Jib b111IIIf7000 7500 8000 8500 9000time7000 7500 8000 8500 9000time45Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines7000 7500 8000 8500 9000Figure 3.11: Combined residuals from all 4 sensorsThe 99% control line shown in red. Sliding window damage indicatorshown in blue.4 are most influenced by damage. This result is expected as the bumperis mounted between these two sensors. In fact, if we look at the averagevalues of (r)2 (which are the individual squared residuals for sensor k) overthe damaged section for each sensor, we see that the first two sensors havevalues 0.96 and 1.24, whereas the second two sensors have values 59.80 and38.2, respectively. Thus from this numerical rating we can see that sensors3 and 4 are most influenced by the damage, which agrees with the resultshown in Figure 3.9.From this analysis it is evident that we can use the combined residuals toestablish the presence of damage in a statistically rigorous manner and thenexamine the individual sensor residuals in an effort to locate the sensorsmost influenced by the damage. This latter information can be used tohelp locate the damage assuming that the damage is confined to a discretelocation such as the formation of a crack in a welded connection. Furtherinvestigation is needed to assess how this procedure could be used to locatedamage for more distributed damage such as that associated with corrosion.3.4 ConclusionAlthough the application of statistical techniques to structural health monitoring has been investigated in the past, these techniques have predominantly been limited to identifying damage-sensitive features derived fromlinear models fit to the output from individual sensors. As such, they aretypically limited to identifying only that damage has occurred. In general,46Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machinesthese methods are not able to identify which sensors are associated with thedamage in an effort to locate the damage within the resolution of the sensorarray. To improve upon this approach to damage detection, we have appliedsupport vector machines to model sensor output time histories and haveshown that such nonlinear regression models more accurately predict thetime series when compared to linear autoregressive models. Here the metricfor this comparison is the residual errors between the measured responsedata and predictions of the time series model.The support vector machine autoregressive method is superior to traditional linear AR in both its ability to handle nonlinear dynamics as wellas the structure of the model. Specifically, the support vector approachcompares each new testing point to the entire training set whereas the traditional AR model finds a simple linear relationship to best describe theentire training set, which is then used on the testing data. For example,when dealing with transient impact data, the AR model will fail in trying to fit the entire time domain with a simple linear model. Whereas inthe past RBF neural networks have been used to tackle this problem, thesenetworks require significant user input and complex methods for fitting themodel to the training data, and hence the simple support vector frameworkis preferred.Furthermore, we have also shown how the residuals from the SVM prediction of each sensor time history may be combined in a statistically rigorousmanner to provide probabilistic statements regarding the presence of damage as assessed from the amalgamation of all available sensors. In addition,this methodology allows us to pinpoint the sensors that are contributingmost to the anomalous readings and therefore locate the damage withinthe sensor networks spatial resolution. The process was demonstrated on atest structure where damage was simulated by introducing an impact typeof nonlinearity between the measured degrees of freedom. The authors acknowledge that the approach has only been demonstrated on a structurethat was tested in a well-controlled laboratory setting. This approach willhave to be extended to structures subjected to real-world operational andenvironmental variability before it can be used in practice. However, theapproach has the ability to adapt to such changes through the analysis ofappropriate training data that span these conditions. Therefore, follow-onstudies will focus on applying this approach to systems with operational andenvironmental variability as well as systems that exhibit nonlinear responsein their undamaged state.47Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines3.5 ReferencesAllen, D., Sohn, H., Worden, K., and Farrar, C. (2002). “Utilizing theSequential Probability Ratio Test for Building Joint Monitoring.” Proc ofSPIE Smart Structures Conference. San Diego, March 2002.Bulut, A. and Singh, A.K. and Shin, P. and Fountain, T. and Jasso, H.and Yan, L. and Elgamal, A. (2005). “Real-time Nondestructive StructuralHealth Monitoring Using Support Vector Machines and Wavelets.” Proc.SPIE. 5770:180-189.Brockwell, P., and Davis, R. (1991). Time Series Analysis: Forecastingand Control. Prentice-Hall.Chattopadhyay, A., Das, S., and Coelho, CK (2007). “Damage Diagnosis Using a Kernel-based Method.” Insight-Non-Destructive Testing andCondition Monitoring. 49:451-458.Chang, C-J., Lin, C-J. (2001). LIBSVM: a library for support vectormachines. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.Clark, G. (2008) “Cable Damage Detection Using Time Domain Reflectometry and Model-Based Algorithms.” Lawrence Livermore NationalLaboratory document LLNL- CONF-4 02567.Copas, J.B. (1997) “Using Regression Models for Prediction: Shrinkage and Regression to the Mean.” Statistical Methods in Medical Research.6:167-183.Doebling, S., Farrar, C., Prime, M., Shevitz, D. (1998) “A Review ofDamage Identification Methods that Examine Changes in Dynamic Properties.” Shock and Vibration Digest. 30:91-105.Farrar, C.R., Worden, K. (2007). “An Introduction to Structural HealthMonitoring.” Philosophical Transactions of the Royal Society A. 365:303-315.Fu, W.J. (1998). “Penalized Regressions: The Bridge Versus The Lasso.”Journal of Computational and Graphical Statistics. 7:397-416Fugate, M., Sohn, H., and Farrar, C.R. (2001). “Vibration-Based Damage Detection Using Statistical Process Control.” Mechanical Systems andSignal Processing. 15:707-721.Herzog, J., Hanlin, J., Wegerich, S., Wilks, A. (2005). “High Performance Condition Monitoring of Aircraft Engines.” Proc of GT2005 ASMETurbo Expo. June 6-9, 2005.Ma, J., and Perkins, S. (2003). “Online Novelty Detection on Temporal Sequences.” Proc of ninth ACM SIGKDD international conference onknowledge discovery and data mining. 613-618.Rytter, A., and Kirkegaard, P. (1997) “Vibration Based Inspection Using48Chapter 3. Structural Health Monitoring with Autoregressive Support Vector MachinesNeural Networks,” Structural Damage Assessment Using Advanced SignalProcessing Procedures, Proceedings of DAMAS 97. University of Sheffield,UK.pp.97108.Scholkopf, B., Sung, K.K., Burges, CJC, Girosi, F., Niyogi, P., Poggio, T.and Vapnik, V. (1997) “Comparing Support Vector Machines with GaussianKernels to Radial Basis Function Classifiers.” IEEE Transactions on SignalProcessing. 45:2758-2765.Schlkopf, B., Platt, J., Shawe-Taylor, J., Smola, A., and Williamson,R. (2001). “Estimating the Support of a High-Dimensional Distribution.”Neural Computation. 13:1443-1471.Shimada, M. and Mita, A. and Feng, M.Q. (2006) “Damage detectionof structures using support vector machines under various boundary conditions.” Proc. SPIE. 6174:61742-61742Smola, A.J., and Schlkopf, B. (2004). “A tutorial on support vectorregression.” Statistics and Computing. 14:199-222.Sohn, H., Czarnecki, J., and Farrar, C.R. (2000). “Structural HealthMonitoring Using Statistical Process Control.” Journal of Structural Engineering. 126:1356-1363.Sohn, H., Farrar, C.R., Hemez, F.M., Shunk, D.S., Stinemates, D.W.,Nadler, B.R., and Czarnecki, J.J. (2004). “A Review of Structural HealthMonitoring Literature from 1996-2001.” Los Alamos National Laboratoryreport LA -13976-MS.Worden, K. And Lane, A. J. (2001) “Damage Identification Using Support Vector Machines,” Smart Materials and Structures. 10:540-547.Worden, K. and Manson, G. (2007). “The Application of Machine Learning to Structural Health Monitoring.” Philosophical Transactions of theRoyal Society A: Mathematical, Physical and Engineering Science. 365:515-537.49Chapter 4ConclusionBecause research in signal processing is being undertaken by physicists, computer scientists, statisticians, and engineers among others, many tools developed by one group aren’t fully adopted by others. This is partially dueto differences in jargon, but also because of each group’s different focus andgoals. However, this thesis shows that methods developed by one group fora given purpose may often be employed quite successfully by another groupfor an entirely different problem.With the state of the art in particle filtering focussing on limiting degeneracy of the algorithm, it is likely that future research in the area might beapplied to the material in chapter 2 to extend the scope of application. Inaddition, the development of support vector machines is moving toward implementing the method quickly and online, while minimizing space requirements. These advances might increase the ability of performing structuralhealth monitoring as discussed in chapter 3 to long time series for whichstorage and computation becomes difficult.While this thesis successfully implements two separate statistical methods, each is developed in a fairly specific nature, when in fact the scope ofapplication is much more general, and may apply to problems not covered inthis work. As future research, prior sensitivity and cross-validation need tobe studied with the goal of easing implementation for multi-dimensional parameters. Since existing methods, including the one presented, have computational complexity which scales exponentially with dimension, alternativemethods must be found. In regards to structural health monitoring, moreattention must be paid to jointly modeling all sensors simultaneously, takingtheir correlation into effect. In addition, more studies must be undertakento understand the effect of varying environmental conditions as well as ifthe initial system is slightly damaged, and hence nonlinear. Whether thesolutions to these problems come from the world of signal processing is tobe seen.Both of the ideas presented in this thesis have been greeted with enthusiasm from researchers at Los Alamos National Laboratories, who dailyanalyze complex and computationally expensive systems. In particular, the50Chapter 4. Conclusionuse of sequential Monte Carlo for prior sensitivity and cross-validation haspotential to reduce the computational time of building models for understanding complex systems such as those present in biological and weaponsresearch. In addition, the power gained from using SVM’s for structuralhealth monitoring will allow for earlier detection of damage, and hence ensure the structure’s economic viability as well as the safety of operators.51Chapter 5Technical AppendixProof of Theorem 1. (following along the lines of Peruggia (1997) and Epifani et al. (2005)) We seek to show that thertkmoment of successiveimportance weights is finite. So we need to find the conditions under whichf(e)de is finite, where (O)= (q(9))l7(q\(9))7x (w\S7(9))r. Weexpand and simplify (9) to obtain(O)=f1(yIj3,u2)x f(yI,u2)x r(3) xir(a2)x(w\S,7(9))r=f(yI,u2)x[w\s,7(e)]()xn—s(7+re)2X Tr(/3) x ir(u2)x exp{-[(y - X)T(y- X)- (+ re)(ys -X)T(y- X)]}=q(9) Xwherei(9)=)x(2)x exp{_±[(- )T[XTX- (+rE)XXs]( -2(9)—1)—ix exp{—[T— (7+ re)yys— T[XTX— (7+ re)XsX]]}and /3= [XTX— (7+ re)XsX]’[yTX— (7+ re)ysX]. We will showmomentarily that[XTX—(‘y + re)XsX] is positive definite, and henceinvertible. Note that q (9) is proportional to a proper density for 9 when[XTX—(‘y + re)XsXj is positive definite. In this caseq5’(9)is upperbounded. Now (9) is proportional to an inverse gamma distribution provided that bothn—s(7+re)1—(7+ re)yys— 1T[XTX—(7+re)XXs] > 052Chapter 5. Technical AppendixThus, aside from showing conditions under which[XTX— (7+ rc)XsX]is positive definite, we also need to find conditions guaranteeing the above 2inequalities. We first shOw that[XTX—(y +r€)XXs] is positive definite.Using the Woodbury matrix identity, we see that[XTX—(-y + rc)XXs]—1may be written as(XTX)_l+(XTX)_l(y+ r)Xs (I— (7+re)X’(XTX)_1Xs) X(XTX)_lNow if (I— (+re)X(XTX)is positive definite, the second termin the above sum is positive semi-definite. This is the case when all the eigenvalues of X(XTX)_lXs are less than[XTX—(y + rc)XXs] —‘may then be written as the sum of a positive definite and a positive semi-definite matrix, and hence[XTX— (+ re)XXs] is positive definite.Now we proceed to find conditions ensuring— (7+ re)yys— T[XTX— (7+re)XsX] > 0.Simple but tedious algebra gives the following expression:— (7+ r)yys— T[XTX— (7+ rc)XsX]=RSS — (7+ re)e (I — (7+ re)Hs) es=RSS*\s(7+ rc)which, by the theorem’s conditions, is greater than 0 for argument value1, and since RSS*\s is a smoothly decreasing function in its argument, itis also positive for some positive argument value less than 1. Now, wechoose e < which implies c >7+ re.By c satisfying (2.8), theconditions outlined in the proof hold. Namely, a)‘H< 1/(7 + re), sincethese eigenvalues are upper bounded by 1, b) n — 8(7+ re) >2, and c)RSS*\s(7+ r) > 0. E53

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0066974/manifest

Comment

Related Items