UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Statistical solutions for and from signal processing 2008

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata


ubc_2008_fall_bornn_luke.pdf [ 1.27MB ]
JSON: 1.0066974.json
JSON-LD: 1.0066974+ld.json
RDF/XML (Pretty): 1.0066974.xml
RDF/JSON: 1.0066974+rdf.json
Turtle: 1.0066974+rdf-turtle.txt
N-Triples: 1.0066974+rdf-ntriples.txt

Full Text

Statistical Solutions For and From Signal Processing by Luke Bornn B.Sc., University of the Fraser Valley, 2003 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in The Faculty of Graduate Studies (Statistics) The University Of British Columbia (Vancouver) August, 2008 © Luke Bornn 2008 Abstract With the wide range of fields engaging in signal processing research, many methods do not receive adequate dissemination across disciplines due to differences in jargon, notation, and level of rigor. In this thesis, I attempt to bridge this gap by applying two statistical techniques originating in signal processing to fields for which they were not originally intended. Firstly, I employ particle filters, a tool used for state estimation in the physics signal processing world, for the task of prior sensitivity analysis and cross validation in Bayesian statistics. Secondly, I demonstrate the application of support vector forecasters, a tool used for forecasting in the machine learning signal processing world, to the field of structural health monitoring. 11 Table of Contents Abstract. Table of Contents iii List of Figures v Acknowledgements vi Statement of Co-Authorship vii 1 Introduction 1.1 Particle Filtering 1.2 Support Vector Forecasters 1.3 References An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation 2.1 Introduction and Motivation 2.2 Sequential Monte Carlo Algorithms 2.2.1 An Efficient SMC Algorithm 2.3 Prior Sensitivity 2.3.1 Regularization Path Plots 2.3.2 Variable Selection Using g-Priors . . . 15 2.4 Cross-Validation 2.4.1 Application to Bayesian Penalized Regression 2.5 Extensions and Conclusions 2.6 References 3 Structural Health Monitoring with Autoregressive Support Vector Machines 3.1 Introduction 3.2 SVM-based SHM 2 1 1 4 5 6 6 8 10 12 13 • 18 23 • 25 • . 26 29 29 31 111 Table of Contents 3.2.1 Example: Simulated Damage 3.3 Joint Online SHM 3.3.1 Example: Experimental Data 3.4 Conclusion 3.5 References 37 41 42 46 48 4 Conclusion 50 5 Technical Appendix 52 iv List of Figures 2.1 Regularization Path Plots: The gold standard 14 2.2 Regularization Path Plots: Plots using MCMC and SMC for fixed computational time of 5 minutes 16 2.3 Exact Marginal and Model probabilities for variable selection using g-Priors as a function of log(g) 19 2.4 Approximate Marginal and Model probabilities for variable selection using g-Priors as a function of log(g) 20 2.5 Diagram of cross-validation process 21 2.6 Plots of cross-validation error as a function of log) 24 3.1 Illustration of linear support vector regression fit 34 3.2 Illustration of mapping to an alternate space to induce linearity 35 3.3 Raw simulated data with highlighted artificial damage . . . 38 3.4 Autocorrelation plot of simulated data 38 3.5 SVM (top) and linear AR models (bottom) fit to subset of data 39 3.6 Residuals from SVM (top) and linear AR models (bottom) applied to simulated data 40 3.7 Diagram of experimental structure 42 3.8 Q-Q Plot of residuals from SVM model 44 3.9 Residuals from 4 sensors for t = 7000, ... , 9384 45 3.10 Density estimate of combined residual (black) vs. chi-squared distribution(red) 45 3.11 Combined residuals from all 4 sensors 46 v Acknowledgements I am indebted to my mentors, Dr. Arnaud Doucet and Dr. Raphael Got tardo, for the wealth of opportunities and immense support they have pro vided me. I am also grateful to the many people I have had a chance to interact with at both the University of British Columbia and Los Alamos National Labs. In particular I thank Dr. Dave Higdon and Dr. Todd Graves for guiding me down a fruitful path. vi Statement of Co-Authorship Excluding chapters 2 and 3, all material in this thesis was solely authored. While I identified and carried out the research in chapter 2, Arnaud Doucet and Raphael Gottardo provided much guidance, criticism, and feed back. The data in chapter 3 was produced by Gyuhae Park and Kevin Farm holt. Chuck Farrar assisted in writing some of the details pertaining to the structural health monitoring field. In addition, his guidance and feedback contributed immensely to the development of the work. vii Chapter 1 Introduction The research area of signal processing is concerned with analyzing signals including sound, video, and radar. There are many components to this task, including storage and compression, removing noise, and extracting features of interest. As an example, we might have a noisy recording of a telephone conversation for which we want to store the signal, remove the noise, and identify the speakers. These signals can take many forms, either digital or analog. We focus on statistical signal processing, which is concerned with studying signals based on their statistical properties. We begin by describ ing two statistical methods employed for signal processing, the first being particle filtering, and the latter being support vector forecasters. Because chapter 3 contains a detailed development of support vector forecasters, we forego these details here. Later chapters then extend these methodologies to fields for which they were not intended, specifically prior sensitivity analysis and cross validation as well as structural health monitoring 1.1 Particle Filtering One of the crucial areas of study in signal processing is filtering, which is concerned with estimating a dynamic system’s true state from a series of noisy measurements. Specifically, we assume that the system dynamics are known up to some parameter(s). The underlying state-space model may be written as xtIxt_1 ‘-‘- ytlxt py,t(yIxt) where Xt and Yt denote the unobserved state and observation at time t, respectively; Px,t and py,t are the state transition and measurement models, respectively. Also, we assume a prior distribution p(xo) on xo. In the case of linearly additive noise, we may write this state-space model as = f(xt_16) + r Yt = h(xt) + t. (1.1) 1 Chapter 1. Introduction Here both the stochastic noise r and the measurement noise c are mutu ally independent and identically distributed sequences with known density functions. In addition, f(xt_iI8) and h(xt) are known functions up to some parameters 8. In order to build the framework on which to describe the filtering method ologies, we first frame the above state-space model as a recursive Bayesian estimation problem. Specifically, we are interested in obtaining the posterior distribution P(xoty1t) (1.2) where XO:t = {xo,x1, .. . ,Xt} and Y1:t = {yl,y2, . . . ,yt}. Often we don’t re quire the entire posterior distribution, but merely one of its marginals. For instance, we are often interested in the estimate of state given all obser vations up to that point; we call this distribution the filtering density and denote it as p(xtlyit). (1.3) By knowing this density, we are able to make estimates about the system’s state, including measures of uncertainty such as confidence intervals. If the functions f and h are linear and both rj and Ct are Gaussian, Kalman filtering is able to obtain the filtering distribution in analytic form. In fact it can be seen that all of the distributions of interest are Gaussian with means and covariances that can be simply calculated. However, when the dynamics are non-linear or the noise non-Gaussian, alternative methods must be used. In the case of non-linear dynamics with Gaussian noise, the standard methodology is the extended Kalman filter, which may be considered as a nonlinear Kalman filter which linearizes around the current mean and co variance. However, as a result of this linearization, the filter may diverge if the initial state estimate is wrong or the process is incorrectly modeled. In addition, the calculation of the Jacobian in the extended Kalman filter can become tedious in high-dimension problems. One attempted solution to this problem has been the unscented Kalman filter (Wan and van der Merwe, 2001), which approximates the nonlinearity by transforming a random vari able instead of through a Taylor expansion, as the extended Kalman filter does. By employing a deterministic sampling technique known as the Un scented transform (Julier and Uhlmann, 1997), UKF selects a minimal set of sample points around the mean which are then propagated through the non-linear functions while recovering the covariance matrix. When either the stochastic or measurement noise is non-Gaussian, Monte Carlo methods must be employed, in particular particle filters. This Monte 2 Chapter 1. Introduction Carlo based filtering method relies on a large set of samples, called parti cles, which are evolved through the system dynamics with potentially non- Gaussian noise using importance sampling and bootstrap techniques. At each time step the empirical distribution of these particles is used to approx imate the distribution of interest and its associated features. By sampling from some proposal distribution q(xo:tyi:t) in order to approximate (1.2), we may use importance sampling with corresponding unnormalized weights — P(yi:tlxo.t)P(xot) Wt — q(xot IYi:t) However, we typically wish to perform this estimation sequentially, and hence we can take advantage of the Markov nature of the state and mea surement process along with proposal distributions of the form q(xotIyi:t) = q(xot_iIyit_i)q(xtxot_i,yit). From this we obtain the recursive weight formula P(ytlxt)P(xtlxt_i) Wt=Wt_1 q(xtxo:t_i, Yi:t) This equation allows for the sequential updating of importance weights given an appropriate choice of proposal distribution q(xtxot1, yi:t), as well as simple calculation of the filtering density (1.3). Since we can sample from this proposal distribution and evaluate the likelihood and transition probabilities, the particle filter simply involves generating a prior set of samples, evolving these samples forward with the proposal distribution, and subsequently calculating the importance weights. In addition, to prevent particle degeneracy, we employ a resampling step to remove particles with low weight and multiply those with high weight (Douc et al., 2005). The choice of proposal distribution has a significant effect on the rate of degeneracy. The standard (and simplest) choice is the prior distribution q(xtxot_i, yi:t) = P(xjlxt_i) since the weights simplify to a calculation of the likelihood. However, if the likelihood is not near the prior, this choice will lead to large variance in the importance weights, and hence we would like to employ a proposal distribution which uses the data to provide a better estimate of the posterior distribution. One such possibility is to use a Gaussian approximation of the posterior as the proposal distribution (van der Merwe et al., 2001). Often the problem of filtering isn’t restricted to the estimation of state, but is also concerned with estimating some parameters 0 of the dynamic model f(xIO). Further complicating matters, the only information we have about the state and the model parameters is the noisy measurements {yt}ti. While there are several approaches for solving this problem, we focus on dual 3 Chapter 1. Introduction estimation, namely the use of parallel filters to estimate state and model pa rameters (Wan et al., 2000). Specifically, a state-space representation is used for both the state and parameter estimate problems. While the state-space representation for the state is given in equation (1.1), the representation of the model parameters is given by = Ot—1 + V Yt = f(xt_iI6t)+ r + Here nt and t are as in (1.1), while iJ is an additional iid noise term. Thus we can run two parallel filters for both state and parameters. At each time step the current state estimate is used in the parameter filter and the current parameter estimate is used in the state filter. The situation is complicated in the particle filter situation, due to the well-known problem of degenerate weights (Casarin and Mann, 2008). Through this filtering methodology we are able to estimate the state of a dynamic system from noisy measurements, as well as the associated uncertainty of these estimates. In addition, the dual framework provides a mechanism for estimating model parameters along with the state. These filtering tools approximate a sequence of distributions of increasing dimen sion. In later chapters, we show how the particle filtering methodology may be adapted for situations involving distributions of equal dimension, and subsequently build an algorithm for efficiently performing prior sensitivity analysis and cross-validation. 1.2 Support Vector Forecasters While particle filtering and other filtering methods rely on knowledge of the underlying process to de-noise the signal, support vector regression forecast ers have a slightly different purpose. Specifically, they use a training data set to build a model of the signal, which is then used to predict subsequent pieces of the signal. The previous section contains a thorough description of filtering since the associated manuscript of chapter 2 bypasses this devel opment. However, chapter 3 provides a thorough development of support vector forecasters, and hence we forego this development here, instead sim ply providing some useful references. The recent work of Steinwart and Christmann (2008) provides thorough details on both theoretical and ap plied aspects of support vector machines, while Schlkopf and Smola (2001) contains details, on kernel-based learning. 4 Chapter 1. Introduction 1.3 References Casarin, R., Mann, J-M. (2008). “Online data processing: comparison of Bayesian regularized particle filters.” arXiv:0806.4242v1. Douc, R., Cappe, 0., Moulines, E. (2005). “Comparison of resampling schemes for particle filtering.” Proceedings of the th International Sympo sium on Image and Signal Processing and Analysis. pp 6469. Julier, S. and Uhlmann, J. (2007). “A new extension of the kalman iter to nonlinear systems.” Proceedings of AeroSense: The 11th Symposium on Aerospace/Defence Sensing, Simulation and Controls. Scholkopf, B. and Smola, A.J. (2001). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge. Steinwart, I. and Christmann, A. (2008). Support Vector Machines. Springer, New York. van der Merwe, R., Doucet, A., de Freitas, N., Wan, E. (2001). “The un scented particle iter.” Advances in Neural Information Processing Systems. 13:584590. Wan, E. and van der Merwe, R. (2001). “The unscented kalman iter.” Kalman Filtering and Neural Networks. Ch. 7. Ed. S. Haykin. Wan, E., van der Merwe, R., Nelson, A. (2000). “Dual estimation and the unscented transformation.” Advances in Neural Information Processing Systems. 12:666672. 5 S Chapter 2 An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation 2.1 Introduction and Motivation ‘An important step in any Bayesian analysis is to assess the prior distri bution’s influence on the final inference. In order to check prior sensitivity, the posterior distribution must be studied using a variety of prior distri butions. If these posteriors are not available analytically, they are usually approximated using Markov chain Monte Carlo (MCMC). Since obtaining the posterior distribution for one given prior can be very expensive corn putationally, repeating the. process for a large range of prior distributions is often prohibitive. Importance sampling has been implemented as an at tempted solution (Besag et al., 1995), but the potential of infinite variance importance weights makes this technique useless if the posterior distribution changes more than a trivial amount as the prior is altered. Additionally, this importance weight degeneracy increases with the dimension of the parameter space. One such prior sensitivity problem is the creation of regularization path plots — a commonly used tool when performing penalized regression. In these situations there is typically a tuning parameter which controls the amount of shrinkage on the regression coefficients; regularization path plots graphically display this shrinkage as a function of the tuning parameter. For instance, in the LASSO shrinkage and variable selection method of Tibshirani (1996), the LARS algorithm (Efron et al., 2004) may be employed to quickly produce version of this chapter has been submitted for publication. Bornn, L., Doucet, A., Gottardo, R. “An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation.” 6 Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation these plots. In the Bayesian version (Vidakovic, 1998; Park and Casella, 2008), however, we may want to plot the posterior means of the regression coefficients /3 E lilY for a range of the tuning (or penalty) parameter A. The corresponding posterior distributions are proportional to exp (_ [y - X/3)T(y- X/3) - A iH) (2.1) where the response y is assumed to come from a normal distribution with mean X/3 and variance a2 for a model matrix X. Since approximating (2.1) using MCMC at one level of A can take upwards of an hour depending on the precision required, producing this plot by repeating MCMC hundreds of times for different A is impractical. Another tool requiring repeated posterior approximations is cross-validation, which has two primary statistical purposes. The first is finding the value of a given parameter (for instance, the penalty parameter in penalized regres sion) which minimizes prediction error. The second is comparing different models’ or methodologies’ prediction performance. In both situations the data is split into a training set, which is used to fit the model, and a testing set, which is used to gauge the prediction performance of the trained model. A typical example would involve fitting a model on the training set for a range of values of some model parameter, then setting this parameter to the value that results in the lowest prediction error rate on the testing set. For example, we might wish to select the value of A in (2.1) to minimize pre diction error. From a computational standpoint, cross-validation is similar to prior sensitivity in both structure and complexity. Further, the entire process is usually repeated for a variety of different training and testing sets and the results are then combined. Although importance sampling has been applied to cross-validation (for example, Alqallaf and Gustafson, 2001), the problem of infinite variance importance weights remains (Peruggia, 1997). In this paper, we begin by motivating and developing sequential Monte Carlo (SMC) methods, then subsequently apply them to prior sensitivity analysis and cross-validation. In Section 2 we develop an efficient algorithm for sampling from a sequence of potentially quite similar probability distri butions defined on a common space. Section 3 demonstrates the algorithm in a prior sensitivity setting and applies it to the creation of regularization path plots and the sensitivity of the tuning parameters when performing variable selection using g-Priors. Cross-validation with application to Bayesian pe nalized regression is developed in Section 4. We close with extensions and concluding remarks in Section 5. 7 Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation 2.2 Sequential Monte Carlo Algorithms SMC methods are often used in the analysis of dynamic systems where we are interested in approximating a sequence of probability distributions ir(O) where t = 1, 2, 3, .., T. The variable Ot can be of evolving or static dimension as t changes; note that t is simply an index variable and need not be real time. Most work in the SMC literature is interested in the evolving dimension case, with applications to state-space models (Doucet et al., 2000) and target tracking (Liu and Chen, 1998) among others. The static case, where each lrt lies in a common space, has received less attention (Chopin, 2002; Del Moral et al., 2006). The goal of SMC methods is to sample from the distributions {7tt} sequentially, i.e. first from in, then 7n2, up to ir. In some situations we are concerned with each intermediate distribution, whereas in others only the final distribution ‘ItT is of interest (for example, Neal, 2001). For further reading, the edited volume of Doucet et al. (2001) covers a range of developments in SMC theory and application. The situation where the sequence of distributions lie in a common space arises in several applications. For instance, the number of observations in some experiments can make MCMC prohibitive. In this case mt might be the posterior distribution of a parameter given the observations 1 through t. Moving through the data with a sequential strategy in this way may decrease computational complexity. Another application is transitioning from a simple distribution in1 to a more complex distribution of interest ‘in. Alternatively we could consider situations analogous to simulated annealing (Kirkpatrick et al., 1983), where in(O) cx [mr(O)]t for an increasing sequence {t} t = 1, 2, 3, ..., T. In all of these examples the bridging distributions mn’j4 are only used to reach the final distribution of interest inT. When we are interested in a certain feature of each itt, SMC will typically be computationally cheaper than MCMC even if we can successfully sample from each ‘Itt using MCMC. This is because SMC borrows information from adjacent distributions, using the samples from earlier distributions to help in approximating later distributions. Often the difficulty in using SMC is constructing this sequence of distributions; both prior sensitivity and cross- validation are situations where there exists a natural sequence upon which SMC may be applied. From here forward we assume the distributions to have a common support. For all times t, we seek to obtain a collection of N weighted sam ples (called particles) i = 1, ..., N approximating itt where the weights are positive and normalized to sum to 1. We may estimate expected values with these particles using E,.(g(O)) = . g(O). One tech- 8 Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation nique used in SMC is importance sampling, where particles {W?1,o?} distributed as rt may be reused, reweighting them (before normalization) according to W(i) n(O) (2.2) in order to obtain an approximation of lrt. Thus we obtain the current weights by multiplying the previous weights by an incremental weight. In an attempt to prevent these weights from becoming overly non-uniform, we may move each particle o?1 (currently distributed according to ntj) with a Markov kernel K(O, 0’) to a new position 0(i)’, then subsequently reweight the moved particles to be distributed according to lrt. Although the kernel K(0, 0’) = lrt(O’) minimizes the variance of the importance weights, it is typically impossible to sample from; thus it has been proposed to use Markov kernels with invariant distribution lrt (Gilks and Berzuini, 2001). A direct application of this strategy suffers from a major flaw, however, as the importance distribution given by Jni (0i)flKt(0t_i,8t)d01:T_i is usually impossible to compute and therefore we can not calculate the necessary importance weights. Additionally, this assumes we are able to sample from ir1 (Oi) which is not always the case. Alternatives attempt to approximate Tit pointwise when possible, but the computation of these algorithms is in 0(N2) (Del Moral et al., 2006). The central idea of SMC Samplers (Del Moral et al., 2006) is to em ploy an auxiliary backward kernel with density L_1(Ot, Ot—i) to get around this intractible integral. This backward kernel relates to a time-reversed SMC sampler giving the same marginal distribution as the forward SMC sampler induced by K. The backward kernel is essentially arbitrary, but should be optimized to minimize the variance of the importance weights. Del Moral et al. (2006) prove that the sequence of backward kernels mini mizing the variance of the importance weights is, for any t, L?(0, Ot—1) = iit_i(Ot_i)Kt(Ot_i, Ot)/’it(Ot). However, it is typically impossible to use this optimal kernel since it relies on intractable marginals. Thus, we should se lect a backward kernel that approximates this optimal kernel. Del Moral opt et al. (2006) give two suboptimal backward kernels to approximate L1 9 Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation which result in incremental weights a lrt(9t) = ft_i(Ot_i)Kt(Ot_i,Ot)d6t_i (2.3a) w(9t_j,Ot)= lrt(Ot_1) (2.3b) lrt_1(8t—1) These incremental weights are then multiplied by the weights at the previ ous time and normalized to sum to 1. We note that the suboptimal kernel resulting in (2.3b) is actually an approximation of that resulting in (2.3a), and coincidentally has the same form as (2.2), the reweighting mechanism for importance sampling. In this manner the first kernel should perform better, particularly when successive distributions are considerably different (Del Moral et al., 2006). Although the weights (2.3a) are a better ap proximation of the optimal backward kernel weights, the second kernel is convenient since the resulting incremental weights (2.3b) do not depend on the position of the moved particles & and hence we are able to reweight the particles prior to moving them. We include the incremental weight (2.3a) because, when K is a Gibbs kernel moving one component at a time, it sim plifies to rrt(Ot1,k)/rt1(Ot1,k) where k is the index of the component being moved by the Gibbs sampler and0t—1,—k is the particle excluding the kth component. By a simple Rao-Blackwell argument it can be seen that this choice, by conditioning on the variable being moved, results in reduced variance of the importance weights compared to (2.3b). 2.2.1 An Efficient SMC Algorithm Now that we have described some components of SMC methodology, we proceed to develop an efficient algorithm for performing prior sensitivity and cross-validation. The basic idea of our algorithm is to first reweight (i) (i)the particles {W_1, i = 1, ..., N such that they are approximately distributed as If the variance of the weights is large, we then resample the particles with probabilities proportional to their weights, giving us a set of N equally weighted particles (including some duplicates). After resam pling we move the particles with a kernel of invariant distribution lrt, which creates diversity in the particles. Our algorithm relates closely to resample move algorithms (Gilks and Berzuini, 2001; Chopin, 2002), although our formulation is more general and allows for the use of a variety of suboptimal backward kernels and corresponding weights. Moving the particles at each time step is not particularly efficient. For example, if two successive distributions in the sequence are identical, we 10 Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross- Validation are wasting our time by moving the particles. If successive distributions are similar but not necessarily identical, to save computational time we can simply copy forward the particles at time t — 1. and reweight them with the importance sampling weights (2.2). Deciding when to move particles may be done dynamically or deterministically. A dynamic scheme would move the particles whenever the variance of the weights becomes too large (usually measured by the effective sample size (ESS), (Z1W)2)_1), whereas a deterministic scheme would move the particles every kth time step for some integer k. Since the sequence of distributions will likely not change at a constant rate, it is better to use a dynamic scheme as this allows for little particle movement during parts of the sequence with little change and more movement in parts of the sequence where successive distributions vary more. When the ESS drops below a specified threshold, we reweight the par ticles at time t — 1 to be approximately distributed as lrt prior to moving them. The weights (2.3b) only depend on the particles at time t — 1, 50 we can easily do this. In the case of a one at a time Gibbs sampler, we can also use the weights (2.3a). Because the unweighted particles at time t are not distributed according to lrt, we cannot simply move the particles without first taking their weights into consideration. Thus prior to moving the particles we must resample them such that = 1/N for i = 1,. .. , N and the particles’ unweighted distribution is lrt. Resampling methods du plicate particles with large weights and remove particles with low weights. Specifically, we copy the particle times such that N = N and E(N) = NW where are the normalized importance weights. Lastly, all of the resampled particles are assigned equal weight. The simplest unbiased resampling method consists of sampling N4 from a multinomial (i)distribution with parameter (N, {W }). It should be noted that more so phisticated resampling schemes, such as residual resampling (Liu, 2001) and stratified resampling (Kitigawa, 1996) exist, resulting in reduced variance of relative to multinomial resampling. After the particles are resampled, we can move them with the kernel K. An efficient SMC algorithm which may be used to perform prior sensi tivity and cross-validation is therefore: 11 Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation fort=ldo Obtain N weighted samples 6 from rj (directly, MCMC, etc.) end fort=2,...,Tdo (i) (i) . (i)Copy 0_ to 61,, and calculate weights W according to (2.2) if ESS(O) > c then Copy (oi), w)) to (or), w)) else Reweight: Calculate weights according to W cc W_1 x wt(bt_i,6t)where Wt(t_1,8t) is either given by (3a) or (3b) Resample: Resample particles according to above weights. Set all weights to 1/N Move: Move particles with Markov kernel of invariant distribution ‘lrt end end note 1: If a backward kernel is chosen such that the incremental weights depend on the position of the moved particle O, the reweight step comes after the move step and resampling is performed with the (i) weights note 2: c is a user-specified threshold on the effective sample size. 2.3 Prior Sensitivity In the case of prior sensitivity we are interested in approximating the poste rior distribution of some variable(s) 0 given the data D, symbolically notated as ir(OID) cc f(DI0) . v(9) where f(DIO) and v(O) are the likelihood and the prior distribution of 0, respectively. Here the notation v(O) is used to dif ferentiate the prior from the posterior distribution nt(O), allowing for the omittance of dependencies. This prior sensitivity framework has been stud ied in a closed-form setting (Gustafon and Wasserman, 1995; Gustafson, 1996), but situations requiring Monte Carlo methods have received less at tention. It is worth noting that only the prior distribution changes between successive distributions (the likelihood remains the same). Thus when we reweight particles to approximate the sequence of posterior distributions for 12 Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation 0, the weights (2.2) depend solely on the prior distribution, w cx w1 x (De1) . (o) — f (Do2) t—i (o1) • Vt (ei?) cx W1 x (2.4) Vt_i (et4) where is the th particle sampled at time t and V(O) is the ttI prior (i)distribution evaluated at the point O . If the ESS falls below a given threshold at time t (notated as c in algorithm pseudocode), we resample and move, otherwise we simply reweight. Conveniently, resampling and moving using (2.3b) and reweighting using (2.2) both result in the same weight mechanism (2.4). In a later example we will also employ the weights (2.3a), which have reduced variance relative to (2.3b). 2.3.1 Regularization Path Plots Consider the regression model with response vector y = (yi,.. . , y)” and model matrix X = (xi, ... , x) where x (xii, . . . , x )T, j 1,. . . , p are the column vectors of predictors (including the unit intercept vector). For clarity of presentation we present the model with a continuous response; however, it is simple to extend to binary responses (Albert and Chib, 1993). We use the prostate data of Stamey et al. (1989) which has eight predictors and a response (logarithm of prostate-specific antigen) with likelihood yIpx,/3,u2 Nn(Xi3,o21n). (2.5) Using a double exponential prior distribution with parameter A on the regression coefficients /3 (/3i, . . . , the corresponding posterior distri bution is proportional to (2.1). We see from the form of this posterior dis tribution that if A = 0 the MAP estimate of /3 will correspond to the least squares solution. However, as A increases there will be shrinkage on /3 which may be displayed using a regularization path plot. Because the shrinkage as A varies is nonlinear, we set a schedule A = et/20, t = 1,.. . ,100. We create a “gold standard” Bayesian Lasso regularization path plot for this data by running MCMC with a Markov chain of length 50,000 at each level of A and plotting the posterior mean of the resulting regression coefficients (Figure 2.1). It should be noted that the creation of this plot took over 5 hours. 13 Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation 0 0.0 0.2 0.4 0.6 0.8 1.0 Figure 2.1: Regularization Path Plots: The gold standard The plot is of standardized coefficients 13j vs. /3i/max(I/311). Since the idea is to create these plots quickly for exploratory analysis, we will compare our SMC-based method to MCMC with both constrained to work in 5 minutes (+/— 5 seconds), and both using the same Markov kernel. In order to perform MCMC in our time frame of 5 minutes, the Markov chain had a length of 1200 for each of the 100 levels of ). The mean of each resulting posterior distribution was used to plot the regularization path plots in Figure 2.2(a). In comparison, to run in 5 minutes our SMC algorithm used N = 4200 particles and resampled and moved the particles when the ESS dropped below c = = 2800 (Figure 2.2(b)). For the sake of time comparisons, all computations here and later were performed on a Power Mac G5 with dual 2.7 GHz PowerPC processors. We see from these plots that both methods capture the key features of the regularization path plot as shown by the gold standard: every one of the variables has the cor rect path. The methods vary, however, in the amount of noise. We see much more variability using MCMC compared to SMC. This is due to SMC being able to use many more particles since it is able to save time by borrowing in formation from previous distributions. To be specific, the SMC algorithm in this context had to resample and move the particles only 25 times in the en tire sequence of 100 distributions. The remainder of the time our algorithm 0 0 0 0 14 Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation simply reweighted the particles, which is computationally inexpensive. It is worth noting that, because of this, adding more incremental distributions in the sequence will have little effect on the computational time of the SMC algorithm, unlike MCMC-based strategies, which would approximate each new distribution with a new Markov chain. In addition, we attempted to make these plots using importance sampling, reweighting (and not moving) particles from 7t1 to approximate later distributions. However, the weights became degenerate, with all of the weights eventually focussing on one par ticle with standardized h-i norm of 0.8. Specifically, all but one of the weights had values near zero, and the one particle with positive weight had standardized L-1 norm of 0.8. Thus importance sampling was only able to create roughly 1/5 of the entire plot, and hence is clearly not a candidate methodology for creating these plots. We will see later that in many such situations importance sampling fails, even with large amounts of particles. 2.3.2 Variable Selection Using g-Priors Consider the normal likelihood set-up (2.5). Now, however, with an eye towards model selection, we introduce the binary indicator variable 7 E {0, 1}P, where 7j = 1 means the variable x3 is included in the model. Thus 7 can describe all of the 2 possible models. Following the notation of Mann and Robert (2007), we use q7 = 1y as a counter for the number of variables in the model. If X is the model matrix which excludes all xj’s if = 0, we can employ the following prior distributions for /3 and 2 (Zellner, 1986; Mann and Robert, 2007): (g)_(7+l)/_lexp From this it is straightforward to show that the posterior density for 7 is thus —n/2 (7Iy,X) (g+1) +1)/2 [yTy_g1YTX(XX)_1X7] (2.6) We perform model selection on the pollution data set of McDonald and Schwing (1973), in which mortality rate is compared against 15 pollution related variables in 60 metropolitan areas. The 15 independent variables include, for instance, mean annual precipitation, population per household, and average annual relative humidity. The response variable y is the age adjusted mortality rate in the given metropolitan area. We seek to perform 15 Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross- Validation a (a) MCMC with 1200 samples (5 minutes) 0 0 0.0 0.2 0.4 0.6 0.6 1.0 (b) SMC with 4200 samples (5 minutes) Figure 2.2: Regularization Path Plots: Plots using MCMC and SMC for fixed computational time of 5 minutes The plots are of standardized coefficients /3 vs. I/31i/max(I/31i). 16 Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation variable selection to narrow down the number of independent variables which best predict the response. With 15 variables, calculating the posterior prob abilities of the over 30,000 models exactly is possible but time-consilming. We have chosen this size of data set to allow for a benchmark from which we can compare MCMC to SMC. Our goal is to see how the explanatory variables change as we vary the prior distribution parameter g. In other words, we are interested in seeing how robust the variable selection method is to changes in the setting of g. Our goal is to perform the variable selection for 100 levels of g for schedule g = et/10, t = 1,.. . , 100. We use a Gibbs sampler strategy to compare the SMC-based algorithm to brute-force MCMC, benchmarked against the exact solution obtained from (2.6), in which /37 and o2 are integrated out. Specifically, we update y one component at a time. The incremental weight ratio (2.3b) will be the ratio of the posterior distribution (2.6) evaluated on the complete data at successive levels of g. In addition, we are able to use the weights (2.3a), which corresponds to the ratio of the posterior distribution (2.6) evaluated on all of the data, excluding the variable that is being moved by the Gibbs sampler. In order to see our desired result, we use (2.6) to plot the exact marginal probabilities as well as some sample model probabilities for various levels of g (Figures 2.3(a) and 2.3(b)). This process took slightly over 8 hours, and hence we would like to find a faster method. We constrain both stochastic algorithms to run in 30 minutes (+/- 1 minute). As a result the MCMC algorithm uses a Markov chain of length 10,000 and the SMC algorithm uses 18,000 particles. We plot the resulting posterior marginal probabil ities for each algorithm in Figures 2.4(a) and 2.4(b), respectively. First impression shows that the plot created using MCMC has much more vari ability. However, the smoothness in the SMC algorithm is not a result of perfect accuracy of the method, but rather only smoothness of the reweight ing mechanism (2.2). Because of this, if the SMC does poorly during times of particle movement, the subsequent reweighted approximations will also be inaccurate. To ensure this is not the case and verify that SMC is indeed outperforming MCMC, we look at the average absolute error of the marginal probabilities (at 100 levels of ) and for 15 variables). We find the average absolute error in the marginal probabilities using MCMC is 0.0292 whereas with SMC it is only 0.0187. In addition, their respective maximum absolute errors were 0.24 and 0.08, respectively. In fact 30 runs of the algorithms resulted in similar results, with SMC consistently outperforming MCMC. From this we see that SMC is indeed providing a better approximation of the true marginal probabilities. 17 Chapter 2. An Efficient Computational Approach ibr Prior Sensitivity Analysis and Cross-Validation What then may be taken from these marginal probability plots? When performing simple forward selection regression, the variables 1, 2, 6, 9, and 14 are chosen. Slightly different results come from doing backward selection; in particular variables 1 and 14 are replaced by variables 12 and 13. The LASSO solution (using 5-fold cross-validation) is the same as the forward solution with the additional variables 7 and 8. In addition, the LASSO solution contains some shrinkage on the regression coefficients (see exam ple 2.3.1). Using g-Priors the variables that clearly stand out (see Figure 2.3(a)) are 1, 2, 6, 9, and 14. Thus the g-Prior solution taken from the plot corresponds to the forward selection model. Also, for a given g, say g = the plot obtained with SMC shows the correct top 4 variables for inclusion, whereas the variability from the MCMC-based plot makes it impossible to do so. 2.4 Cross-Validation We focus on leave-s-out cross-validation, which is the case when. the testing set consists of s observations. Continuing in the linear regression frame work, let X\s and Y\s be the model matrix and response vector excluding the subset S of observations (of size s). We are interesting in a collection ofT model parameter (typically prior distribution parameter) settings resulting in posterior densities Trt(91X\s, Y\s) for t 1, . . . , T. Once we have approx imations of all T posterior densities, we select the model parameter settings which result in the best prediction of y using X. To find the sequence of distributions lrt(OIX\s, Y\s) t = 1, ... , T, the same SMC-based algorithm proposed for prior sensitivity is applicable. Specifically, once we have ob tained a Monte Carlo approximation of lrl(OIX\s, Y\s) we can transition to the remainder of the distributions 7rt(OIX\s, Y\s), t = 2,... , T using SMC. In addition to quickly evaluating the model for a variety of settings on the training set, SMC also provides a tool for switching the training/testing set without fully re-approximating the posterior densities. Specifically, sup pose we have a testing set Si, and using SMC we find approximations of rt(OjX\s1 Y\s1) t = 1, .. . , T, each of which are tested for prediction perfor mance on the subset S. However, typically we are interested in performing cross-validation for a variety of different splits of the data into training and testing sets. Thus, we will now want a new testing set 82 and find approxima tions of Kt(O IX\s2,Y\s2)’ = 1,. .. , T. The obvious way to accomplish this is to start fresh by approximating lrl(eIX\s2Y\s2) with MCMC and proceed ing to approximate the remainder of the distributions using SMC. However, 18 Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation oF 0 (a) Posterior Marginal Probabilities: GreenrrXi, BlueX2,Yellow=X6,Red=X9,Purple=X14 C 0 d a 0 C 0 0 2 4 6 8 10 (b) Posterior Model probabilties: Purp1e1,2,4,5,9, Red=1,2,4,5, Blue1,9, Green9, Ye11ow=6 Figure 2.3: Exact Marginal and Model probabilities for variable selection using g-Priors as a function of log(g) Plot (a) highlights several variables (X1,X2,X6,X9,X14) which show high marginal probabilities of inclusion. Plot (b) shows the posterior probabilities of 5 models chosen to highlight the effect of g on model size19 0 2 4 6 8 10 / \ / Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation 0 0 0 d c’J 0 (a) Posterior Marginal Probabilities: 10000 samples (30 minutes) MCMC with 0 0 0 cJ 0• 0 0 0 2 4 6 8 10 (b) Posterior Marginal Probabilities: samples (30 minutes) SMC with 18000 Figure 2.4: Approximate Marginal and Model probabilities for variable Se lection using g-Priors as a function of log(g) Plots (a) and (b) compare MCMC to SMC’s performance. 0 2 4 6 8 10 20 Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation Tr(OIX, y) ir (81X\s Y\S) 7tl(8IX\Smax Y\Smax) I I 7t2(IX\si ‘ Y\s) t2(8IX\S, Y\Smax) 71T(OIX\si, Y\S1) —-. 7rT(&JX\Sma, Y\Smcr) Figure 2.5: Diagram of cross-validation process Each arrow represents transitioning using SMC. we can be a bit more clever than this, recognizing that lrl(0IX\s1 Y\s) and rl(OPX\s2,Y\s2) are related (Alqallaf and Gustafson, 2001; Bhattacharya and Haslett, 2007). Successive splits of the data into training and testing sets should give similar model settings. Therefore, we first build the model for a given pa rameter setting on the full data set using SMC, resulting in an approxima tion of 7r1(OIX, y). Then instead of using MCMC to get approximations of lrl(81X\S, Y\s) for different S E {S, . . . , Sm&c}, we can build a sequence of distributions (lrl(OIX,y))’7(7r1(OIX\s,y\s))7 for an increasing temper ature ‘y = 0, , 2,. . ., 1 — e, 1 which will allow us to transition to the case- deletion posteriors. The process is illustrated in Figure 2.5. The case of = 0, 1 with no movement step corresponds to basic case-deletion impor tance sampling as developed in Peruggia (1997). Although case-deletion importance sampling has been demonstrated to achieve up to 90% cost sav ings in some circumstances (Aiqallaf and Gustafson, 2001), the problem of degeneracy still makes importance sampling fail in many situations (Perug gia, 1997; Epifani et al., 2005). Let 9 = (3, u2). The posterior distribution ir(9) of 9 is proportional to q(9) = f(y113, 2) x ii-(/3) x ir(o-2). Assume we collect samples from the distri bution ir(9). We are interested in reweighting these samples such that they come from the distribution attained by removing the set S. The modified 21 Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation likelihood and posterior for this case-deletion scenario are, respectively f\s(YI3, 2) = (2)_(ns)/2 exp {--- [(y - X/3)T(y- Xj3) - (Ys - X5)T((ys - Xi3))] } q\s(e) f\s(YII3u2)x ir(3) x ir(u2) We assume that the prior distributions for 3 and cr2 are proper and indepen dent. Epifani et al. (2005) show that if the weights w\S(e) q\s(e)/q(e) are used to move to the case-deletion posterior directly, then the moment of these weights is finite if and only if all of the following conditions hold: a) <1/r b) n—rs >1 c) RSS*\s(r) > 0 where )H is the largest eigenvalue of the matrix H X(XTX)_lXs and RSS*\s(r) = RSS — re(I — rHs)’es where es = ys — X(XTX)_1XTy and RSS denotes the residual sum of squares of the least squares fit of the full data set. This result should not be taken lightly: as Geweke (1989) points out, if the 2nd moment does not exist, the importance sampling estimator will follow neither a n’2 asymptotic nor a central limit theorem. (a) states that if the leverage of the deleted observations is too large, then the importance weights will have infinite variance. (b) gives a condition relating sample size to the allowable test set size s. (c) says that if the influence of the deleted observation is large relative to RSS, then the importance weights will have infinite variance. We show here how using a sequence of artificial intermediate distributions with SMC can help to mitigate this problem. We introduce a sequence of distributions q7(e) cc where y = 0,e,2e,.. ., 1—e, ito move from q(9) = qo(O) to q\5(e) =q1(e). At a given step ‘y = ‘y in the sequence, the successive importance weights appearing in the SMC algorithm to move to the next step + e are e — __________________ W\S,7*( ) — (q(e))l_7*(q\s(e))7* - \\q(9)) 22 Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation Theorem 1.. Provided that RSS*\s(1) > 0 and the prior distributions for/3 and a2 are proper and independent, a sequence of distributions proportional to {(q(e))’-7(q\5(e) ;‘y = 0, e, 2e, . . . , 1—c, 1} may be constructed to move from q(f3) to q\s(e) such that the importance weights W\S7(e) for each successive step have a finite moment underq7(9) provided (2.7) where > 1 is chosen to satisfy ‘H < 1/Q (2.8a) n—cs>2 (2.8b) RSS*\s(c) > 0 (2.8c) The proof may be found in the appendix. The provision that RSS*\s(1)> o is very reasonable, and states that the least squares fit of the full data must not fit the training set perfectly. Note also that we find for each subset S. Thus we may use the largest allowable step size c in (2.7) for each subset 5, maximizing the algorithm’s efficiency. While this result is not sufficient to establish that the variance of SMC estimates are finite for a finite number N of particles, it can be used to upper bound the asymptotic variance of SMC estimates under additional mild regularity mixing conditions on the MCMC kernels; see (Chopin, 2004) and (Jasra and Doucet, 2008) for similar ideas. 2.4.1 Application to Bayesian Penalized Regression To demonstrate the strength of SMC applied to cross-validation, we use it to select the parameter ) of the Bayesian Lasso (2.1). For brevity, we reuse the pollution data set (McDonald and Schwing, 1973) of section 2.3.2, selecting the parameter ) using leave-one-out cross-validation. Firstly, it is worth pointing out that importance sampling will fail in this situation, as AH > 1/2 on 6 of the 60 observations in this data set, and hence the sufficient conditions to ensure finite variance are not satisfied. Using a sequence of intermediate distributions, we find that the largest c satisfying (8) equals 1.103, or, to ensure a finite second moment, e < = .103. Thus, the largest sequence of distributions was of length 10. For most variables c > 2, which for r = 2 is equivalent to importance sampling. Thus SMC does not waste time transitioning to case-deleted posteriors if importance sampling will suffice. We use a Gibbs sampler to approximate the posterior distribution of (3, a2) for A = e5 on the full data set and then use SMC to move to 23 Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation 0 o 0 0 C) 0 C 0 0 0 0 0 C) o0 0 -4 -2 02 4 (a) Cross-validation error as a function of log(A) using MCMC with 60 samples (10 minutes) 0 0 0 a 1’- 0 0 0 000 - (b) Cross-validation error as a function of log(A) using SMC with 100 samples (10 minutes) Figure 2.6: Plots of cross-validation error as a function of 1og(\) 24 Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation the case-deletion posterior distributions by creating a sequence of auxiliary distributions as described above. For each different case-deletion we then use SMC to find approximations of the posterior for schedule A = et/5, t = —25,. . . , 25. Plotting the cross-validation errors as a function of A using MCMC with a Markov chain of length 10,000 (not shown) we observe that the average squared loss (Yk —Xk/3)2 is a smooth function in A with a small bump at A = e2 and minimum near e312. Thus to minimize prediction error (at least in terms of the squared loss) we should set A — e312. To perform this task in a time-restricted manner we constrained both MCMC and SMC algorithms to work in 10 minutes (+/- 30 seconds). Figures 2.6(a) and 2.6(b) are the resulting plots. The reduced variability of the SMC-based plot allows us to make more accurate conclusions. For instance, it is clear in the plot obtained with SMC (Figure 2.6(b)) that the minimum error lies somewhere around A = e312, whereas from the MCMC plot (Figure 2.6(a)) it could be anywhere between 1 and e2. 2.5 Extensions and Conclusions In our presentation of the algorithm, a fixed sequence of distributions Trt(O), t = 1, 2, 3, .., T is used. However, it is also possible to determine the sequence of distributions automatically such that successive distributions are a fixed distance apart, as measured by ESS. For instance, assume we are interested in 7rt(8) = ir(8IAt) where A is a scalar parameter and we have a Monte Carlo approximation of ir(OIA_i) for an arbitrary t, namely {W, 8}, i = 1,. ..,N. We may set A to ensure that ESS = c for a constant c by solving where is given by (2.2). This may be solved numerically or in closed- form, if possible. This technique would be beneficial in situations where little or nothing is known about the sequence of distributions, and hence it would be nice to automatically create the sequence. All our examples have considered a sequence of distributions parameter ized by a scalar parameter for which the definition of the sequence of target distributions is very intuitive. If we are interested in dealing with multivari ate parameters then the algorithm may be adapted by, for instance, creating a grid (or hyper-grid) of distributions. SMC may be used to work across 25 Chapter 2. An Efficient Computational Approach ftr Prior Sensitivity Analysis and Cross-Validation each dimension in succession. It is worth noting that the complexity of the algorithm scales exponentially with dimension, although MCMC does as well. While we have given two choices of incremental weights, (2.3a) and (2.3b), many other choices are available (Del Moral et al., 2006). In sit uations where the weights are dependent on the position of the moved particle, such as with (2.3a), auxiliary particle techniques may be used (Pitt and Shephard, 1999; Johanseri and Whiteley, 2008). Specifically, we reweight the particles with an approximation of the weight of interest (for instance, (2.3a)) which is oniy dependent on the particles at time t — 1, (i) (i) (i) (i) . using Wtemp W_1 x W. where W is the approximation of the incre mental weight. After we have resampled and moved the particles we then I47’ compensate for this approximation using W’ = tie x Wmp. We have seen that by adapting importance sampling to move particles between successive distributions, SMC drastically limits the problem of im portance sampling degeneracy. By using a resample-move type algorithm, we are able to perform prior sensitivity and cross-validation in a computa tionally feasible manner while avoiding the fore-mentioned pitfalls of impor tance sampling. We have shown the SMC algorithm to be considerably more efficient than existing methods based on reiterative MCMC approximations. In this way regularization path plots and other sensitivity analysis problems can be studied in the context of the full posterior distribution instead of a few summary statistics. In addition, SMC provides a tool for naturally per forming cross-validation, and in fact guarantees finite case-deletion weights under much less stringent conditions than importance sampling. In addition, through the importance weights, SMC provides a measure of the distance between distributions, and hence gives a way to select a subset of distribu tions of interest for exploratory or other purposes. 2.6 References Albert, J.H. and Chib, S. (1993). “Bayesian Analysis of Binary and Polytomous Response Data.” Journal of the American Statistical Associa tion. 88:669-679. Alqallaf, F. and Gustafson, P. (2001). “On Cross-validation of Bayesian Models.” Canadian Journal of Statistics. 29:333-340. Besag, J., Green, P., Higdon, D., Mengersen, K. (1995) “Bayesian Com putation and Stochastic Systems (with discussion) .“ Journal of the Amen 26 Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross- Validation can Statistical Association. 95:1127-1142. Bhattacharya, S., and Haslett, J. (2007). “Importance Re-sampling MCMC for Cross-Validation in Inverse Problems.” Bayesian Analysis. 2:385- 408. Chopin, N. (2002). “A Sequential Particle Filter Method for Static Mod els.” Biometrika. 89:539-552. Chopin, N. (2004). “Central Limit Theorem for Sequential Monte Carlo Methods and Its Application to Bayesian Inference.” Annals of Statistics. 32:2385-2411. Del Moral, P., Doucet, A., Jasra, A. (2006). “Sequential Monte Carlo Samplers.” Journal of the Royal Statistical Society: Series B. 68:411-436. Doucet, A., Godsill, S., Andrieu, C. (2000). “on Sequential Monte Carlo Sampling Methods for Bayesian Filtering.” Statistics and Computing. 10:197-208 Doucet, A., de Freitas, N., Gordon, N.J. eds. (2001). Sequential Monte Carlo Methods in Practice. New York: Springer. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R. (2004). “Least Angle Regression.” Annals of Statistics. 32:407-499. Epifani, I., MacEachern, S., Peruggia, M. (2005). “Case-Deletion Impor tance Sampling Estimators: Central Limit Theorems and Related Results.” Technical Report No. 720, Department of Statistics, Ohio State University. Geweke, J. (1989). “Bayesian Inference in Econometric Models Using Monte Carlo Integration.” Journal of the American Statistical Association, 88:881-889. Gilks, W.R. and Berzuini, C. (2001). “Following a Moving Target: Monte Carlo Inference for Dynamic Bayesian Models.” Journal of the Royal Statistical Society: Series B. 63:127-146. Gustafson, P. and Wasserman, L. (1995). “Local Sensitivity Diagnostics for Bayesian Inference.” Annals of Statistics. 23:2153-2167. Gustafson, p. (1996). “Local Sensitivity of Inferences to Prior Marginals.” Journal of the American Statistical Association. 91:774-781. Jasra, A. and Doucet, A. “Stability of Sequential Monte Carlo Samplers via the Foster-Lyapunov Condition”, Statistics and Probability Letters. to appear. Johansen, A., and Whiteley, N. (2008). “A Modern Perspective on Auxil iary Particle Filters.” Proceedings of Workshop on Inference and Estimation in Probabilistic Time Series Models. Isaac Newton Institute, June Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P. (1983). “Optimization by Simulated Annealing.” Science. 220:671-680. 27 Chapter 2. An Efficient Computational Approach for Prior Sensitivity Analysis and Cross-Validation Kitagawa, G. (1996). “Monte Carlo Filter and Smoother for Non-Gaussian, Non-linear State Space Models.” Journal of Computational and Graphical Statistics. 5:1-25. Liu, J.S. and Chen, R. (1998). “Sequential Monte Carlo Methods for Dy namic Systems.” Journal of the American Statistical Association. 93:1032- 1044. Liu, J.S. (2001). Monte Carlo Strategies in Scientific Computing (2nd ed.), New York: Springer. McDonald, G. and Schwing, R. (1973). “Instabilities of Regression Es timates Relating Air Pollution to Mortality.” Technometrics. 15: 463-481. Neal, R. (2001). “Annealed Importance Sampling.” Statistical Comput ing. 11:125-139. Park, T. and Casella, G. (2008). “The Bayesian Lasso.” Journal of the American Statistical Association. 103:681-686. Peruggia, M. (1997). “On the Variability of Case-Deletion Importance Sampling Weights in the Bayesian Linear Model.” Journal of the American Statistical Association. 92:199-207. Pitt, M.K. and Shephard, N. (1999). “Filtering Via Simulation: Aux iliary Particle Filters.” Journal of the American Statistical Association. 94:590-591. Tibshirani, R. (1996). “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society: Series B. 58:267-288. Vidakovic, B. (1998). “Wavelet-Based Nonparametric Bayes Methods.” in Practical Nonparametric and Semiparametric Bayesian Statistics. D. Dey, P. Muller, and D Sinha (eds.). New York: Springer, 133-256. Zellner, A. (1986). “On Assessing Prior Distributions and Bayesian Re gression Analysis with G-prior Distributions.” Bayesian Inference and De cision Techniques: Essays in Honor of Bruno de Finetti. pp. 233-243. 28 Chapter 3 Structural Health Monitoring with Autoregressive Support Vector Machines 3.1 Introduction 2The extensive literature on structural health monitoring (SHM) has docu mented the critical importance of detecting damage in aerospace, civil, and mechanical engineering systems at the earliest possible time. For instance, airlines may be interested in maximizing the lifespan and reliability of their jet engines or governmental authorities might like to monitor the condition of bridges and other civil infrastructure in an effort to develop cost-effective lifecycle maintenance strategies. These examples indicate that the ability to efficiently and accurately monitor all types of structural systems is crucial for both economic and life-safety issues. One such monitoring technique is vibration-based damage detection, which is based on the principal that damage in a structure, such as a loosened connection or crack, will alter the dynamic response of that structure. There has been much recent work in this area; in particular, Doebling et al. (1998) and Sohn et al. (2004) present de tailed reviews of vibration-based SHM. Because of random and systematic variability in experimentally measured dynamic response data, statistical approaches are necessary to ensure that changes in a structures measured dynamic response are a result of damage and not caused by operational and environmental variability. Although much of the vibration-based SHM liter ature focuses on deterministic methods for identifying damage from changes in dynamic system response, we will focus on approaches that follow a sta 2A version of this chapter has been accepted for publication. Bornn, L., Farrar, C.R., Park, G., Farinholt, K. (2008). “Structural Health Monitoring with Autoregressive Sup port Vector Machines.” Journal of Vibration and Acoustics. 29 Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines tistical pattern recognition paradigm for SHM (Farrar and Worden, 2008). This paradigm consists of the four steps of 1. Operational evaluation, 2. Data acquisition, 3. Feature extraction, and 4. Statistical classification of features. The work presented herein focus on steps 3 and 4 of this paradigm. One approach for performing SHM is to fit a time series predictive model such as an autoregressive (AR) model to each sensor output using data known to be acquired from the structure in its undamaged state. These models are then used to predict subsequent measured data, and the residuals (the difference between the model’s prediction and the observed value) are the damage-sensitive feature that is used to check for anomalies. This pro cess provides many estimates (one at each time step) of a single-dimension feature, which is advantageous for subsequent statistical classification. The logic behind this approach is that if the model fit to the undamaged sen sor data no longer predicts the data subsequently obtained from the system (and hence the residuals are large and/or correlated), there has been some sort of change in the process underlying the generation of the data. This change is assumed to be caused by damage to the system. These linear time series models have been used in such a damage detection process that include applications to a wide range of structures and associated damage scenarios including cracking in concrete columns (Fugate et al., 2001; Sohn et al., 2000), loose connections in a bolted metallic frame structure (Allen et al., 2002) and damage to insulation on wiring (Clark, 2008). However, the linear nature of this modeling approach limits the scope of application and the ability to accurately assess the condition of systems that exhibit nonlinearity in their undamaged state. In this paper, we demonstrate how support vector machines (SVM) may be used to create a non-linear time series model that provides an alternative to these linear AR models. Once a model has been chosen and the predictions from this model have been compared to actual sensor data, there are several statistical methods for analyzing the resulting residuals. Sequential hypothesis tests, such as the sequential probability ratio test (Allen et al., 2002), may be used to test for changes in the residuals. Alternatively, statistical process control procedures, typically in the form of control charts, may be used to indicate abnormalities in the residuals (Fugate et al., 2001). In addition, sliding window approaches look at the features of successive subsets of data to detect anomalies (e.g. Clark, 2008). For example, the sliding window approach of Ma and Perkins (2003) looks at thresholds for the residuals such that the probability of an undamaged residual exceeding this threshold is 5%. A subset of n consecutive data points are then checked, and large values of the number g of points exceeding the threshold indicate damage, where g has a 30 Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines binomial distribution (i.e. g Bin(n, .05)). To date, most of these time series modeling approaches analyze data from one sensor at a time, and typically some sort of scheme is used to determine how many sensors need to indicate damage in order to trigger a system check (e.g. Herzog et al., 2005). As an alternative, in this paper we look at a statistically based method for combining multiple sensor output. From this combined output analysis, we can establish the existence of damage and also determine which sensors are contributing to the anomalous readings in an effort to locate the damage within the sensor network’s spatial distribution. Previously Sohn et al. (2000) have used principal component analysis to combine data from an array of sensors, but this study only examined these combined data in an effort to establish the existence of damage. We first present a summary of the SVM approach to nonlinear time se ries modeling. This procedure is illustrated on numerically generated data with artificial anomalies added to the baseline signal in an effort to simulate damage. This time series modeling approach is then compared to linear AR models. Next the SVM method is coupled with a statistical analysis pro cedure that combines modeling results from multiple sensors in an effort to both establish the existence and the location of the damage. This procedure is applied to data from a laboratory test structure with damage that results in local nonlinear system response. 3.2 SVM-based SHM Existing methods for performing damage detection extract damage-sensitive features from data acquired on the undamaged system, and then use changes in those features as an indicator of damage. An AR model can be fit to the undamaged sensor output and the residuals from predictions of subsequent data using this baseline model are then monitored for statistically significant changes that are assumed to be caused by damage. Specifically, an AR model with p autoregressive terms, AR(p), applied to sensor k may be written as (3.1) where x is the representation of the measured signal at discrete times t from the kth sensor, are the AR coefficients or model parameters, and e is an unobservable noise term. Thus an AR model works by fitting a simple linear model to each point with the previous p observed points as dependent 31 Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines variables. Note that an n point time series will yield n — p equations that can be used to generate a least square estimate of the AR coefficients or the Yule-Walker Method can be used to solve for the coefficients (Brockwell and Davis, 2001). Auto-regressive models work particularly well when modeling the response of linear, time-invariant systems. If the undamaged system is nonlinear, the AR process gives the best linear fit to the measured response, but there is no guarantee that this model will accurately predict responses obtained when the system is subjected to other inputs. Because of the broad array of structural health monitoring problems, employing a linear model confines the scope of problems for which the AR methodology is appropriate. We thus seek to extend the fidelity of this gen eral damage detection approach by employing a non-linear AR-type model based upon SVMs, which have seen widespread use in machine learning and statistical classification fields. To simplify future development, we denote the vector . . , as SVMs have many features that make them a more appropriate choice for SHM based on time series analysis. With the right settings and appropriate training they are able to model any non linear relationship between the current time point, x, and the p previous time points, they are well suited for high-dimensional problems, and the methodology is easily generalized and highly adaptable. Although SVMs have been used for SHM before (e.g. Worden and Manson, 2007; Shimada et al., 2006; Bulut et al., 2005, Worden and Lane, 2001; Chat topadhyay et al., 2007), these approaches predominantly focus on one and two class SVMs, which are used for outlier detection and group classification, respectively. Our approach is unique in its combination of support vector re gression, autoregressive techniques, and residual error analysis. Thus while earlier approaches look at classifying sections of the time-series response as damaged or undamaged directly (the dependent variable being a binary indicator), our methodology works by using support vector regression to model the raw time-series data, then subsequently predicting damage by monitoring the residuals of the model. We follow the development of SVMs for regression of Smola and Schlkopf (2004) and Ma and Perkins (2003). First, assume we have data from a set of K sensors and we have measure ments without damage for time t = 1, . . . ,t0 (i.e. if there is damage, it occurs after time to). Next we must decide the order p of our model. There are many methods for selecting p, such as partial autocorrelation or the Akaike Information Criterion (AIC), which are discussed in more detail in Fugate et al. (2001). In general, we seek the lowest order model that captures the underlying physical process and hence will generalize to other data sets. As with linear AR modeling, we create the training set on which to build our 32 Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines SVM-based model by using each observation as the dependent variable and the previous p observations as independent variables. Our training samples are thus {(X_p:t_i, x), t p + 1, . . . , t0}. Ideally we would like to find a function f such that f(x_.t_1)= x for all k and t to. However, the form of f is often restricted to the class of linear functions (as is the case for AR models), f(x_p:t_i) = Kw, X_p:t_i) (3.2) where (,) denotes the dot (or inner) product and w is a vector of model parameters. This restricted form makes perfect fit of the data impossible in most scenarios. As a result, we allow prediction using f to have an error bounded by e, and find w under this constraint. With the recent advances in penalized regression methods such as ridge regression and lasso, the improved prediction performance of shrunken (or smoothed) models is now well-understood (Copas, 1997; Fu, 1998). Thus in order to provide a model that maximizes prediction performance, we seek to incorporate shrinkage on the model paramaters w. Such shrunken w may be found by minimizing the Euclidean norm subject to the error constraint , namely minimize IIwII2 (3.3) Ixk_Kw,xk subject to t—p.t—1 —1Kw, X_p:t_i) — This model relies on the assumption that a linear model is able to fit the data to within precision e. However, typically such a linear model does not exist, even for moderate settings of e. As such, we introduce the slack variables ,, to allow for deviations beyond €. The resulting formulation is minimize IIwW2+CZ+1(+) (3.4) subject to — Kw, xp:t_i) E + 1Kw,x_p:t_i) _x The constant C controls the tradeoff between giving small w and penalizing deviations larger than e. In this form we see that only points that lie out side of the bound e have an effect on w. Figure 3.1 illustrates the process graphically. 33 Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines Figure 3.1: Illustration of linear support vector regression fit Although this optimization problem is straightforward to carry out, the extension to non-linearity is revealed by the dual formulation. We thus proceed by constructing a Lagrange function of the above by introducing a set of dual variables. L : IIwII2 +C ( +) - (+ X + (w,xp:ti)) t=p+1 t=p+1 (3.5) (c + - + W, xp:ti)) - + h7i) t=p+1 tP+l where the dual variables a, a, are understood to be non-negative. It can be shown that this function has a saddle point at the optimal solution, and hence = w — (at + a) Xp:t_l = 0 (3.6) t=p+1 - =0 Plugging these saddlepoint constraints into L yields the following dual 34 Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines l1. “a.. 70. 0 7 1 9 0 Figure 3.2: Illustration of mapping to an alternate space to induce linearity optimization problem: maximize tp+i (t + c) ( ±) (tiSp:t’-i) (37) Zt=p+i (c ±c ) +Z+1xt (ct —c ) subject to fx — (W,X_p:t_i) (W,X_p:t_i) C+ Notice that by the saddlepoint constraint w = Zp+1 (t +) Xp:t_l we may write f as to f(x_p:t_i) = (€ — ü) (xi_p:t_i,x_p:t_i) (3.8) t’=p+1 In this way w may be viewed as a linear combination of the training points X_p.t_1. Note also that in this formation both f and the correspond ing optimization can be described in terms of dot products between the data. In this way, we can transform the data using the function : —* F, and compute the dot products in the transformed space. Such mappings allow us to extend beyond the linear framework presented above. Specifically, the mapping allows us to fit linear functions in F which, when converted back to IR, are nonlinear. A toy example of this process is illustrated for a mapping R2 —* R3, namely (x,y) =(x2,x/y),y), in Figure 3.1. Here the data is generated using the relationship y = x2. To make use of this transformed space, we replace the dot product term with ( (x_pto_i) , (x_p:t_)) (3.9) 35 Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines If F is of high dimension, then the above dot product will be extremely expensive to compute. In some cases, however, there is a corresponding kernel that is simple to compute. For example, the kernel k(x, y) = (x . corresponds to a map F into the space spanned by all products of exactly d dimensions in IRP. When d, p = 2, for instance, we have (x . = ((xi, X2) . (yi, y2)) (3.10) = ((x,v’xix2, ) (Y?V’Y1Y2Y)) = ((x),(y)) defining I’(x) = (xi, \/x1x2, x). More generally, it has been shown that every kernel that gives a positive matrix (k(x, y)) has a corresponding map (x) (Smola and Schlkopf, 2004). One such family of kernels we focus on is Radial Basis Function (RBF) kernels, which have the form k(x, y) = exp (—lix — y112/(2u)) (3.11) where o2 is the kernel variance. This parameter controls fit, with large values leading to smoother functions and small values leading to better fit. In practice moderate values are preferred as a trade-off between model fit and prediction performance. Whereas a traditional AR(p) model employs a linear model that is a function of the previous p time points, the SVM model looks at the previous p time points compared to all groups of p successive data points from the training sample. Specifically, the model has the form to :r k iii k k J iXj_p:i_i) — /JjnAXj_p:j_, Xt_p:t_1 jp+1 Typically only a small fraction of the coefficients /3 are non-zero. The corresponding samples X_p.j_1 are called support vectors of the regression function because only these select samples are used in the formulation of the model. Once we have trained our model above, we use it to predict each future observation. We then take the residuals and use them as an indicator of structural change. For our purposes we employ a control chart to monitor if the system generating the data has changed. In this discussion the control chart is created by constructing 99% control lines that correspond to 99% confidence intervals for the residuals of the model fit to the undamaged data assuming the residuals are normally distributed. This normality assumption is further discussed in the experimental results below. These control lines 36 Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines are then extended through the remaining (potentially damaged) data and damage is indicated when a statistically significant number of residuals, in this case more than 1%, lie outside these lines. Note damage can also be indicated when the residuals no longer have a random distribution even though they may not lie outside the control lines. RBF neural networks, which have the same form as Equation (3.12), have previously been used to perform SHM (e.g. Rytter and Kirkegaard (1997)). However, fitting these networks requires much more user input such as selecting which /3 are non-zero as well as selecting the corresponding training points. In addition, the fitting of the neural network model is a rather complicated nonlinear optimization process relative to the simple quadratic optimization used in the support vector framework. Although the SVM models are more easily developed, Schlkopf et al. (1997) have demonstrated that SVMs still more accurately predict the data than the RBF neural networks despite their simplicity. 3.2.1 Example: Simulated Damage We now compare the performance of the SVM-based damage detection method to a traditional AR model with coefficients estimated by the Yule- Walker method (see, for example, Brockwell and Davis (1991)). The data is generated as follows for discrete time points, t = 1,..., 1200: =sin3(400 rt/1200) +sin2(400 rt/1200) + sin(200rrt/1200) (3.13) +sin(lOOirt/1200) + ‘I’ + e (3.14) where c is Gaussian random noise with mean 0 and standard deviation 0.1 and ‘I’ is a damage term. Three different damage cases are added to this time series at various times as defined by ci for t=600,...,650 sin(1000irt/1200) for t=800,...,850 — for t = 1000, ... , 1050 0 otherwise where El and 2 are Gaussian random noise with mean 0 and 1, and stan dard deviation 0.5 and 0.2, respectively. Through the use of Jf we attempt to simulate several different types of damaged to compare the models per formance handling each. This raw signal is plotted in Figure 3.3 where it can be seen that the changes caused by the damage are somewhat subtle. 37 Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines 0 200 400 600 800 1000 1200 time Figure 3.3: Raw simulated data with highlighted artificial damage Figure 3.4: Autocorrelation plot of simulated data The order p for both models was set at 5 as determined from the autocor relation plot in Figure 3.4. This plot is the measure of correlation between successive time points for a given time lag. We see from the plot that after a lag of 5, the correlation is quite small, and hence little information is gained by including a longer past history p. This is a standard method for deter mining model order for traditional AR models, and as such should maximize this methods performance, ensuring the SVM-based model isnt afforded an unfair advantage. The results of applying both the SVM model and a traditional AR model to the undamaged portion of the signal between time points 400 and 600 are shown in Figure 3.5 where the signals predicted by these models are overlaid on the actual signal. A qualitative visual assessment of Figure 3.5 shows that the SVM more accurately predicts this signal. A quantitative assessment is made by examining the distribution of the residual errors obtained with each model. The standard deviation of the residual errors from the SVM model is 0.26 while for the traditional AR it is 0.71, again indicating that the SVM is more accurately predicting the undamaged portion of this time series. In order for a model to excel at detecting damage, it must fit the un damaged data well (i.e small and randomly distributed residual errors) while fitting the damaged data poorly as identified by increased residual errors 0 5 15 Lg 38 Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines 400 450 500 550 600 time Figure 3.5: SVM (top) and linear AR models (bottom) fit to subset of data with possibly non-random distributions. In other words, the model must be sensitive to distributional changes in the data that result from damage. To quantify such changes a control chart is developed based on the undamaged portion of the time series to establish statistically based thresholds for the damage detection process. As mentioned earlier, this control chart is calcu lated based on the fit to the undamaged data, specifically 99% confidence lines are drawn based on the undamaged residual error data and carried forward for comparison on the potentially damaged data. It is in this part of the process that the SVM’s ability to more accurately represent the data enhances the damage detection process. The 99% confidence lines for the SVM are much closer to the mean value of the residual errors and, hence, will more readily identify small perturbations to the underlying system that produce changes in the residual error distribution. In addition, the tradi tional AR model shows a trend in the residuals, indicating lack of model fit, even in the undamaged case. We see that during the times of damage the residuals for the SVM-based model exceed the control limits more than occurs with the residuals from the traditional AR model. In fact, the latter method would likely miss the damage between time points 1000 and 1050, where only one point exceeds the threshold versus over 10 for the SVM-based model. This result can be seen in Figure 3.6. Since each method performs differently for different sources of damage, it is of interest to determine when each method will be successful in indicating 39 Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines *I 400 600 800 1000 1200 time Figure 3.6: Residuals from SVM (top) and linear AR models (bottom) ap plied to simulated data The 99% control lines based on the residuals from the undamaged portion of the signal are shown in red. damage. Since the traditional AR model fits a single model to the entire data, model fit will be very poor if the data is non-stationary (for instance if the excitation is in the form of hammer impacts). Additionally, since the traditional AR model as presented above does not contain a moving average term, it will continue to fit when damage is in the form of a shift up or down in the raw time series (as demonstrated by the third damage scenario above). Conversely, the SVM-based method works by comparing each length of p data to all corresponding sets in the training set. Thus, if a similar sequence exists in the training set, we can expect the fit to be quite good. We see two scenarios in which the SVM-based method will perform poorly. Firstly, if there is some damage in the undamaged scenario, and similar damage occurs in the testing set, the model will likely fit this portion quite well. Secondly, if damage manifests itself in such a way that the time-series data is extremely similar to the undamaged time-series, the SVM methodology will be unable to detect it. However, we should emphasize that other methods, including the AR model, will suffer in such scenarios as well. As an attempted solution when the sensitivity of the method to a given type of damage is unknown and simulation tests are impossible, various damage detection methods could potentially be combined to boost detection power. a) —-D S ‘i a) 0, a) V —. S -(a 40 Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines 3.3 Joint Online SHM In the undamaged state, a Gaussian distribution can often approximate the residuals from fitted models or control charts can be developed to invoke the central limit theorem and force some function of the residual errors to have a Gaussian distribution such as with an x-bar control chart (Fugate, et al, 2001). If we have K sensors, each of whose residuals are Gaussian distributed, we would like a way of combining these residuals to come up with a damage detection method that examines all K sensors. Noticing that the sum of K squared standard Gaussian random variables is distributed as a chi-squared random variable with K degrees of freedom, we square the residuals from each sensor (after they are normalized to have mean 0 and variance 1 based on the undamaged data) and add them together to create a new combined residual. These new combined residuals follow a chi-squared distribution, and hence we can make probabilistic statements about the residuals being typical or not (indicative of damage). Specifically, consider the combined residuals at some time point t: K (3.15) k= 1 where r is the normalized residual at time t for sensor k. Assuming the original residuals are Gaussian distributed, this random variable will have a chi-squared distribution with K degrees of freedom. Note that even when the original residuals are not approximately Gaussian, we may still employ a control chart on the combined residuals to give probabilistic statements regarding damage. For instance, when the residual errors from the fitted model have thicker tails than Gaussian, control charts must be employed to make probabilistic statements of the combined residual. However, as well see in the following example, the residual errors are often very close to Gaussian. In addition to these combined residuals allowing us to make statements regarding damage from multiple sensor output, they also provide us with a mechanism for determining which sensors are most influenced by the dam age. This latter property is of particular importance for damage location. If this combined residual is large, and hence we determine that there is damage, we can look at the values (r)2 for each sensor and from their magnitudes determine which sensors contributed the most to this large combined resid ual. If we detect damage over a range of values, we may average (r)2 over this range for each sensor to determine how much each sensor is contributing 41 Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines D A 3rd Floor 3rd Floor *otor4 A .Cthrnfl 0.77 3 2nd Floor 2nd Floor; I 1stFloor 1st Floor et.r2 - i. Shkor nfl ri r. ___ Pn.iero7ote11 074 ___________________ L--. J Figure 3.7: Diagram of experimental structure to the anomalous reading. 3.3.1 Example: Experimental Data We look at joint online SHM using SVMs on experimental data from a struc ture designed to produce non-linear response when it is damaged. The struc ture is a three-story building (Figure 3.7) consisting of aluminum columns and plates with bolted joints and a rigid base that is constrained to slide hor izontally on two rails when excited by an electro-dynamic shaker. Each floor is a 30.5 x 30.5 x 2.5cm plate and is separated from adjacent floors by four 17.7x 2.5x0.6cm columns. To induce non-linear behavior, a 15.Ox 2.5x2.5cm column is suspended from the top floor and a bumper is placed on the sec ond floor. The contact of this suspended column with the bumper results in non-linear effects. The initial gap between the suspended column and the bumper is adjusted to simulate different levels of damage. In our test data we employ the case where the column is set 0.05mm away from the bumper. The undamaged data is obtained when the bumper and suspended column do not contact each other. The structure is subjected to a random base excitation from the shaker in both its undamaged and damaged condition. Accelerometers mounted on each floor record the response of the structure to these base excitations. A more detailed description of the test structure and the data obtained can be found at www.lanl.gov/projects/ei. We first concatenate the undamaged data with the damaged data to demonstrate that the proposed methodology adequately detects the damage. 42 Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines The SVM time series models are developed for each of the accelerometer measurements from the undamaged data as follows: 1. Select the number of time lags that will be used in the time series models. In this case eight time lags were used based on the AIC. Note the number of time lags is analogous to the order of an AR model. 2. Select the parameters of the SVM model, including the kernel type and corresponding parameters as well as C and E, which control model fit as described earlier. In our case we used a Gaussian kernel with variance 1 and set C 1 and e = 0.1. We have found the methodology to be robust to choices of variance ranging over an order of magnitude. In addition, C could be increased to force fitting of extreme values, and could be lowered to enforce a closer fit to the training data. 3. Pass the data (arranged as dependent variable and previous p points as independent variables) to the optimization described by Equation (3.7). In this case we use the first 6000 undamaged points as train ing data. This step is handled by the wide variety of support vector machine software available covering multiple computing environments including MATLAB and R. In particular, we employ the libSVM li brary with accompanying MATLAB interface (Chang and Lin, 2001). 4. Once the SVM model is trained (i.e. the /3j in Equation (3.12) are selected) in step 3, make predictions based on the new test data from the structure in its undamaged or damaged condition. Next, calculate the residual between the measured data and the output of the time series prediction. 5. Square and add the residuals from each sensor as described by Equa tion (3.15). Build a control chart for these combined residuals to detect damage (perhaps in conjunction with statistical tests such as a sliding window approach). Note that steps 1 through 4 of this process are applied to each time series recorded by the four accelerometers shown in Figure 37. First we will revisit the normality assumption that was made in con structing the control chart. Figure 3.8 shows the resulting Q-Q plot for the residuals from the SVM model fit to sensor 4 data obtained with the struc ture in its undamaged state. The Q-Q plot compares the sample quantiles of the residuals to theoretical quantiles of a Gaussian distribution. We see 43 Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines Cl) a, 0 G) E Cu C,) Theoretical Quantiles Figure 3.8: Q-Q Plot of residuals from SVM model in this figure that the sample quantiles fall very close to the theoretical line, and hence our residuals are approximately Gaussian. Figure 3.9 shows the residual errors from the SVM fit to each of the ac celerometer readings, respectively, and the corresponding 99% control limits that are based on the first 6000 points from the undamaged portion of each signal. There are 8192 undamaged points and 8192 damaged ones. Thus when we concatenate the data the damage occurs at time point 8193 of 16384. Figure 3.10 shows the density of the normalized residual errors from all the sensors that have been combined according to Equation (3.15). We see that the distribution is very nearly chi-squared. In situations where the original residuals arent normal, this result wont be true, and hence proba bilistic statements regarding the presence of damage must be made based on control charts. Figure 3.11 shows the combined residuals as a function of time. The blue points in the plot show damage indication using the sliding window approach of Ma and Perkins (2003) as described in the introduction and based on the 99% control lines. Specifically we use a window size of 6 which, when combined with the 99% control limit, detects damage whenever 1 or more of the 6 points in the window exceeds the control line (equivalent to binomial probability of 0.05). We see from Figure 3.9 that sensors 3 and C C’J —4 —2 0 2 4 44 Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines 0 In 0 a q II) 0 fit 0 0 0 o q In 7000 7500 8000 8500 9000 time sensor 4 Figure 3.9: Residuals from 4 sensors for t = 7000, ... , 9384 The 99% control lines are shown in red 5 10 15 20 Figure 3.10: Density estimate of combined residual (black) vs. chi-squared distribution (red) sensor 1 a V a E a a In 0 0 0 In 0 q In a V a a It sensor 2 I II I I I 7000 7500 8000 8500 9000 time sensor 3 I ill .9 In 0 In 0 8 ‘Is In 0 II Jib b11 II If 7000 7500 8000 8500 9000 time 7000 7500 8000 8500 9000 time 45 Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines 7000 7500 8000 8500 9000 Figure 3.11: Combined residuals from all 4 sensors The 99% control line shown in red. Sliding window damage indicator shown in blue. 4 are most influenced by damage. This result is expected as the bumper is mounted between these two sensors. In fact, if we look at the average values of (r)2 (which are the individual squared residuals for sensor k) over the damaged section for each sensor, we see that the first two sensors have values 0.96 and 1.24, whereas the second two sensors have values 59.80 and 38.2, respectively. Thus from this numerical rating we can see that sensors 3 and 4 are most influenced by the damage, which agrees with the result shown in Figure 3.9. From this analysis it is evident that we can use the combined residuals to establish the presence of damage in a statistically rigorous manner and then examine the individual sensor residuals in an effort to locate the sensors most influenced by the damage. This latter information can be used to help locate the damage assuming that the damage is confined to a discrete location such as the formation of a crack in a welded connection. Further investigation is needed to assess how this procedure could be used to locate damage for more distributed damage such as that associated with corrosion. 3.4 Conclusion Although the application of statistical techniques to structural health mon itoring has been investigated in the past, these techniques have predomi nantly been limited to identifying damage-sensitive features derived from linear models fit to the output from individual sensors. As such, they are typically limited to identifying only that damage has occurred. In general, 46 Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines these methods are not able to identify which sensors are associated with the damage in an effort to locate the damage within the resolution of the sensor array. To improve upon this approach to damage detection, we have applied support vector machines to model sensor output time histories and have shown that such nonlinear regression models more accurately predict the time series when compared to linear autoregressive models. Here the metric for this comparison is the residual errors between the measured response data and predictions of the time series model. The support vector machine autoregressive method is superior to tra ditional linear AR in both its ability to handle nonlinear dynamics as well as the structure of the model. Specifically, the support vector approach compares each new testing point to the entire training set whereas the tra ditional AR model finds a simple linear relationship to best describe the entire training set, which is then used on the testing data. For example, when dealing with transient impact data, the AR model will fail in try ing to fit the entire time domain with a simple linear model. Whereas in the past RBF neural networks have been used to tackle this problem, these networks require significant user input and complex methods for fitting the model to the training data, and hence the simple support vector framework is preferred. Furthermore, we have also shown how the residuals from the SVM predic tion of each sensor time history may be combined in a statistically rigorous manner to provide probabilistic statements regarding the presence of dam age as assessed from the amalgamation of all available sensors. In addition, this methodology allows us to pinpoint the sensors that are contributing most to the anomalous readings and therefore locate the damage within the sensor networks spatial resolution. The process was demonstrated on a test structure where damage was simulated by introducing an impact type of nonlinearity between the measured degrees of freedom. The authors ac knowledge that the approach has only been demonstrated on a structure that was tested in a well-controlled laboratory setting. This approach will have to be extended to structures subjected to real-world operational and environmental variability before it can be used in practice. However, the approach has the ability to adapt to such changes through the analysis of appropriate training data that span these conditions. Therefore, follow-on studies will focus on applying this approach to systems with operational and environmental variability as well as systems that exhibit nonlinear response in their undamaged state. 47 Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines 3.5 References Allen, D., Sohn, H., Worden, K., and Farrar, C. (2002). “Utilizing the Sequential Probability Ratio Test for Building Joint Monitoring.” Proc of SPIE Smart Structures Conference. San Diego, March 2002. Bulut, A. and Singh, A.K. and Shin, P. and Fountain, T. and Jasso, H. and Yan, L. and Elgamal, A. (2005). “Real-time Nondestructive Structural Health Monitoring Using Support Vector Machines and Wavelets.” Proc. SPIE. 5770:180-189. Brockwell, P., and Davis, R. (1991). Time Series Analysis: Forecasting and Control. Prentice-Hall. Chattopadhyay, A., Das, S., and Coelho, CK (2007). “Damage Diag nosis Using a Kernel-based Method.” Insight-Non-Destructive Testing and Condition Monitoring. 49:451-458. Chang, C-J., Lin, C-J. (2001). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm. Clark, G. (2008) “Cable Damage Detection Using Time Domain Re flectometry and Model-Based Algorithms.” Lawrence Livermore National Laboratory document LLNL- CONF-4 02567. Copas, J.B. (1997) “Using Regression Models for Prediction: Shrink age and Regression to the Mean.” Statistical Methods in Medical Research. 6:167-183. Doebling, S., Farrar, C., Prime, M., Shevitz, D. (1998) “A Review of Damage Identification Methods that Examine Changes in Dynamic Proper ties.” Shock and Vibration Digest. 30:91-105. Farrar, C.R., Worden, K. (2007). “An Introduction to Structural Health Monitoring.” Philosophical Transactions of the Royal Society A. 365:303- 315. Fu, W.J. (1998). “Penalized Regressions: The Bridge Versus The Lasso.” Journal of Computational and Graphical Statistics. 7:397-416 Fugate, M., Sohn, H., and Farrar, C.R. (2001). “Vibration-Based Dam age Detection Using Statistical Process Control.” Mechanical Systems and Signal Processing. 15:707-721. Herzog, J., Hanlin, J., Wegerich, S., Wilks, A. (2005). “High Perfor mance Condition Monitoring of Aircraft Engines.” Proc of GT2005 ASME Turbo Expo. June 6-9, 2005. Ma, J., and Perkins, S. (2003). “Online Novelty Detection on Tempo ral Sequences.” Proc of ninth ACM SIGKDD international conference on knowledge discovery and data mining. 613-618. Rytter, A., and Kirkegaard, P. (1997) “Vibration Based Inspection Using 48 Chapter 3. Structural Health Monitoring with Autoregressive Support Vector Machines Neural Networks,” Structural Damage Assessment Using Advanced Signal Processing Procedures, Proceedings of DAMAS 97. University of Sheffield, UK. pp. 97108. Scholkopf, B., Sung, K.K., Burges, CJC, Girosi, F., Niyogi, P., Poggio, T. and Vapnik, V. (1997) “Comparing Support Vector Machines with Gaussian Kernels to Radial Basis Function Classifiers.” IEEE Transactions on Signal Processing. 45:2758-2765. Schlkopf, B., Platt, J., Shawe-Taylor, J., Smola, A., and Williamson, R. (2001). “Estimating the Support of a High-Dimensional Distribution.” Neural Computation. 13:1443-1471. Shimada, M. and Mita, A. and Feng, M.Q. (2006) “Damage detection of structures using support vector machines under various boundary condi tions.” Proc. SPIE. 6174:61742-61742 Smola, A.J., and Schlkopf, B. (2004). “A tutorial on support vector regression.” Statistics and Computing. 14:199-222. Sohn, H., Czarnecki, J., and Farrar, C.R. (2000). “Structural Health Monitoring Using Statistical Process Control.” Journal of Structural Engi neering. 126:1356-1363. Sohn, H., Farrar, C.R., Hemez, F.M., Shunk, D.S., Stinemates, D.W., Nadler, B.R., and Czarnecki, J.J. (2004). “A Review of Structural Health Monitoring Literature from 1996-2001.” Los Alamos National Laboratory report LA -13976-MS. Worden, K. And Lane, A. J. (2001) “Damage Identification Using Sup port Vector Machines,” Smart Materials and Structures. 10:540-547. Worden, K. and Manson, G. (2007). “The Application of Machine Learn ing to Structural Health Monitoring.” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Science. 365:515- 537. 49 Chapter 4 Conclusion Because research in signal processing is being undertaken by physicists, com puter scientists, statisticians, and engineers among others, many tools de veloped by one group aren’t fully adopted by others. This is partially due to differences in jargon, but also because of each group’s different focus and goals. However, this thesis shows that methods developed by one group for a given purpose may often be employed quite successfully by another group for an entirely different problem. With the state of the art in particle filtering focussing on limiting degen eracy of the algorithm, it is likely that future research in the area might be applied to the material in chapter 2 to extend the scope of application. In addition, the development of support vector machines is moving toward im plementing the method quickly and online, while minimizing space require ments. These advances might increase the ability of performing structural health monitoring as discussed in chapter 3 to long time series for which storage and computation becomes difficult. While this thesis successfully implements two separate statistical meth ods, each is developed in a fairly specific nature, when in fact the scope of application is much more general, and may apply to problems not covered in this work. As future research, prior sensitivity and cross-validation need to be studied with the goal of easing implementation for multi-dimensional pa rameters. Since existing methods, including the one presented, have compu tational complexity which scales exponentially with dimension, alternative methods must be found. In regards to structural health monitoring, more attention must be paid to jointly modeling all sensors simultaneously, taking their correlation into effect. In addition, more studies must be undertaken to understand the effect of varying environmental conditions as well as if the initial system is slightly damaged, and hence nonlinear. Whether the solutions to these problems come from the world of signal processing is to be seen. Both of the ideas presented in this thesis have been greeted with en thusiasm from researchers at Los Alamos National Laboratories, who daily analyze complex and computationally expensive systems. In particular, the 50 Chapter 4. Conclusion use of sequential Monte Carlo for prior sensitivity and cross-validation has potential to reduce the computational time of building models for under standing complex systems such as those present in biological and weapons research. In addition, the power gained from using SVM’s for structural health monitoring will allow for earlier detection of damage, and hence en sure the structure’s economic viability as well as the safety of operators. 51 Chapter 5 Technical Appendix Proof of Theorem 1. (following along the lines of Peruggia (1997) and Epi fani et al. (2005)) We seek to show that the rtk moment of successive importance weights is finite. So we need to find the conditions under which f(e)de is finite, where (O) = (q(9))l7(q\(9))7 x (w\S79))r. We expand and simplify (9) to obtain (O)=f1(yIj3,u2x f(yI,u2)x r(3) x ir(a2) x (w\S,7(9))r =f(yI,u2)x [w\s,7(e)] () x n—s(7+re) 2 X Tr(/3) x ir(u2) x exp{- [(y - X)T(y- X) - ( + re)(ys -X)T(y- X)]} =q(9) X where i(9) =) x (2) x exp {_± [( - )T [XT- (+ rE)XXs] ( - 2(9) —1)—i x exp {— [T — (7+ re)yys — T [XTX — (7+ re)XsX] ] } and /3 = [XTX — (7+ re)XsX]’ [yTX — (7+ re)ysX]. We will show momentarily that [XTX — (‘y + re)XsX] is positive definite, and hence invertible. Note that q (9) is proportional to a proper density for 9 when [XTX — (‘y + re)XsXj is positive definite. In this case q5’(9) is upper bounded. Now (9) is proportional to an inverse gamma distribution pro vided that both n—s(7+re) 1 — (7+ re)yys — 1T [XTX — (7+ re)XXs] > 0 52 Chapter 5. Technical Appendix Thus, aside from showing conditions under which [XTX — (7+ rc)XsX] is positive definite, we also need to find conditions guaranteeing the above 2 inequalities. We first shOw that [XTX — (y + r€)XXs] is positive definite. Using the Woodbury matrix identity, we see that [XTX — (-y + rc)XXs] —1 may be written as (XTX)_l + (XTX)_l(y + r)Xs (I — (7 + re)X’(XTX)_1Xs) X(XTX)_l Now if (I — ( + re)X(XTX) is positive definite, the second term in the above sum is positive semi-definite. This is the case when all the eigen values of X(XTX)_lXs are less than [XTX — (y + rc)XXs] —‘ may then be written as the sum of a positive definite and a positive semi- definite matrix, and hence [XTX — ( + re)XXs] is positive definite. Now we proceed to find conditions ensuring — (7+ re)yys — T [XTX — (7+ re)XsX] > 0. Simple but tedious algebra gives the following expression: — (7+ r)yys — T [XTX — (7+ rc)XsX] =RSS — (7+ re)e (I — (7+ re)Hs) es =RSS*\s(7 + rc) which, by the theorem’s conditions, is greater than 0 for argument value 1, and since RSS*\s is a smoothly decreasing function in its argument, it is also positive for some positive argument value less than 1. Now, we choose e < which implies c > 7 + re. By c satisfying (2.8), the conditions outlined in the proof hold. Namely, a) ‘H < 1/(7 + re), since these eigenvalues are upper bounded by 1, b) n — 8(7 + re) > 2, and c) RSS*\s(7 + r) > 0. E 53


Citation Scheme:


Usage Statistics

Country Views Downloads
China 1 0
City Views Downloads
Beijing 1 0

{[{ mDataHeader[type] }]} {[{ month[type] }]} {[{ tData[type] }]}


Share to:


Related Items