UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Air-quality model evaluation through the analysis of spatial-temporal ozone features Shi, Tianji 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2015_november_shi_tianji.pdf [ 19.58MB ]
JSON: 24-1.0166673.json
JSON-LD: 24-1.0166673-ld.json
RDF/XML (Pretty): 24-1.0166673-rdf.xml
RDF/JSON: 24-1.0166673-rdf.json
Turtle: 24-1.0166673-turtle.txt
N-Triples: 24-1.0166673-rdf-ntriples.txt
Original Record: 24-1.0166673-source.json
Full Text

Full Text

Air Quality Model EvaluationThrough the Analysis ofSpatial-Temporal Ozone FeaturesbyTianji ShiB.Sc., University of Massachusetts Amherst, 2006M.Sc., University of Massachusetts Amherst, 2008A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Statistics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)August 2015c© Tianji Shi 2015AbstractLegislative actions regarding ozone pollution use air quality models (AQMs)such as the Community Multiscale Air Quality (CMAQ) model for scientificguidance, hence the evaluation of AQM is an important subject. Tradi-tional point-to-point comparisons between AQM outputs and physical ob-servations can be uninformative or even misleading since the two datasetsare generated by discrepant stochastic spatial processes. I propose an al-ternative model evaluation approach that is based on the comparison ofspatial-temporal ozone features, where I compare the dominant space-timestructures between AQM ozone and observations. To successfully implementfeature-based AQM evaluation, I further developed a statistical frameworkof analyzing and modelling space-time ozone using ozone features. Ratherthan working directly with raw data, I analyze the spatial-temporal variabil-ity of ozone fields by extracting data features using Principal ComponentAnalysis (PCA). These features are then modelled as Gaussian Processes(GPs) driven by various atmospheric conditions and chemical precursor pol-lution. My method is implemented on CMAQ outputs during several ozoneepisodes in the Lower Fraser Valley (LFV), BC. I found that the feature-based ozone model is an efficient way of emulating and forecasting a com-plex space-time ozone field. The framework of ozone feature analysis is thenapplied to evaluate CMAQ outputs against the observations. Here, I foundthat CMAQ persistently over-estimates the observed spatial ozone pollution.Through the modelling of feature differences, I identified their associationswith the computer model’s estimates of ozone precursor emissions, and thisCMAQ deficiency is focused on LFV regions where the pollution processtransitions from NOx-sensitive to VOC-sensitive. Through the comparisonof dynamic ozone features, I found that the CMAQ’s over-prediction is alsoiiAbstractconnect to the model producing higher than observed ozone plume in day-time. However, the computer model did capture the observed pattern ofdiurnal ozone advection across LFV. Lastly, individual modelling of CMAQand observed ozone features revealed that even under the same atmosphericconditions, CMAQ tends to significantly over-estimate the ozone pollutionduring the early morning. In the end, I demonstrated that the AQM evalu-ation methods developed in this thesis can provide informative assessmentsof an AQM’s capability.iiiPrefaceThe statistical methods of air quality model (AQM) evaluation, as well asthe particular framework of space-time ozone modelling presented in thisthesis are the products of my original ideas, with extensive guidance andmotivation from my supervisors: Prof. Douw G. Steyn and Prof. WilliamJ. Welch. The statistical AQM evaluation methods and the ozone mod-elling framework are essentially a collection of mostly existing statisticalmethodologies combined and applied in a novel way. The source of existingmethodologies and results are cited and discussed at the appropriate placesin the main text.The entire research is originally motivated by a scientific question posedby Prof. Steyn. The data are provided by Prof. Steyn, Dr. Bruce Ainsliefrom Environment Canada, and Metro Vancouver. All computer codes,unless noted in the thesis, are programmed by me.Selected materials in Chapters 3, 4 and Appendix B.1 were summarizedinto a 30-minute presentation at the 33rd International Technical Meeting(ITM) on Air Pollution Modelling and its Application, which took placeAugust 26th to 30th, 2013 in Miami, Florida. The presented materials werefurther prepared for a chapter in the book “Air Pollution Modelling and itsApplication, XXIII” (Copyright 2014, Springer). Additional manuscriptsbased on the research in this thesis are in preparation for submission topeer-reviewed journals.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiNotations and Abbreviations . . . . . . . . . . . . . . . . . . . . xviiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Ozone Process and the CMAQ System . . . . . . . . . . . . 21.1.1 Tropospheric Ozone Formation Processes . . . . . . . 21.1.2 The CMAQ Modelling System . . . . . . . . . . . . . 51.2 The Problems with “Usual” Means of AQM Evaluation . . . 81.3 Research Topics and Objectives . . . . . . . . . . . . . . . . 101.3.1 Topics of Ozone Feature Analysis and Modelling . . . 121.3.2 Relation to Existing Model Evaluation Projects . . . 141.4 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . 151.4.1 PCA and Extraction of Data Features . . . . . . . . . 151.4.2 Emulation of Non-linear Computer Models and Phys-ical Processes . . . . . . . . . . . . . . . . . . . . . . 17vTable of Contents1.4.3 Feature-based AQM Evaluation . . . . . . . . . . . . 191.5 Novelty of Proposed AQM Evaluation Methods . . . . . . . 231.6 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . 242 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.1 Air-quality Model Output and Physical Observations . . . . 262.1.1 Data from Computer Models . . . . . . . . . . . . . . 272.1.2 Observation Data . . . . . . . . . . . . . . . . . . . . 302.2 Data for CMAQ Evaluation in Chapters 5 and 6 . . . . . . . 322.2.1 Interpolated CMAQ Data . . . . . . . . . . . . . . . . 332.2.2 Missing Observations and Measurement Errors . . . . 352.3 Summary of Data Used in the Thesis . . . . . . . . . . . . . 373 Principal Component Analysis of Space-time Ozone . . . . 393.1 LFV Ozone during an Episode . . . . . . . . . . . . . . . . . 423.2 PCA Methods and Related Topics . . . . . . . . . . . . . . . 483.2.1 Definitions of EOFs and PCs . . . . . . . . . . . . . . 483.2.2 Mathematics of PCA . . . . . . . . . . . . . . . . . . 503.2.3 Relevant PCA-Related Topics . . . . . . . . . . . . . 533.3 The Number of Useful Ozone Features . . . . . . . . . . . . . 593.3.1 Recovering Data Variations . . . . . . . . . . . . . . . 603.3.2 Order of Ozone Feature Degeneracy . . . . . . . . . . 663.4 Ozone Features of LFV Ozone Episodes . . . . . . . . . . . . 683.4.1 Common Ozone Features of All Episodes . . . . . . . 693.4.2 P1ET1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 753.4.3 P2ET2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 763.4.4 Higher-order Ozone Features . . . . . . . . . . . . . . 833.4.5 Sampling Stability of Ozone Features . . . . . . . . . 853.5 Chapter Conclusion . . . . . . . . . . . . . . . . . . . . . . . 884 A Statistical Model of Space-Time Ozone Features . . . . 914.1 Ozone Features and Gaussian Process Models . . . . . . . . 934.1.1 Gaussian Process Model for an EOF . . . . . . . . . . 944.1.2 Gaussian Process Model for a PC . . . . . . . . . . . 96viTable of Contents4.1.3 Modelling a Complete Space-time Ozone Process . . 964.1.4 Use of GPs for Modelling Ozone Features . . . . . . . 964.2 Background on Gaussian Process Models . . . . . . . . . . . 984.2.1 Best Linear Unbiased Predictor (BLUP) . . . . . . . 994.2.2 BLUP and Gaussian Distribution . . . . . . . . . . . 1014.2.3 Fitting the GP Models . . . . . . . . . . . . . . . . . 1024.3 Variable and Covariate Selection . . . . . . . . . . . . . . . . 1044.3.1 Model Variables . . . . . . . . . . . . . . . . . . . . . 1054.3.2 Selection of Model Covariates . . . . . . . . . . . . . 1104.3.3 Goodness-of-fit Statistics . . . . . . . . . . . . . . . . 1144.4 The Framework of Feature-based Ozone Modelling . . . . . . 1154.4.1 Training Set and Predictive Set . . . . . . . . . . . . 1174.4.2 PCA of the “Rectangular” CMAQ Output . . . . . . 1184.5 Covariate Selection . . . . . . . . . . . . . . . . . . . . . . . 1214.5.1 Implementation Details . . . . . . . . . . . . . . . . . 1214.5.2 Selection Results . . . . . . . . . . . . . . . . . . . . . 1234.6 Modelling and Forecasting Ozone Features . . . . . . . . . . 1254.6.1 Modelling and Forecasting the EOFs . . . . . . . . . 1274.6.2 Modelling and Forecasting the PCs . . . . . . . . . . 1324.7 Forecast of Space-Time Ozone Fields . . . . . . . . . . . . . 1364.8 Model Fits from other Episodes . . . . . . . . . . . . . . . . 1454.9 Chapter Conclusion . . . . . . . . . . . . . . . . . . . . . . . 1475 AQM Evaluation I: Comparison of Ozone Features and Mod-elling of Feature Differences . . . . . . . . . . . . . . . . . . . 1505.1 Evaluation Methods and Strategy . . . . . . . . . . . . . . . 1555.1.1 Model of Feature Differences E˜dj and Pdj . . . . . . . 1565.1.2 PCA of CMAQ Outputs and Observation Data . . . 1585.1.3 Evaluation Strategy . . . . . . . . . . . . . . . . . . . 1615.1.4 Discussion of Evaluation Methods . . . . . . . . . . . 1625.2 Comparison of the Mean Fields, E˜1 and E1 . . . . . . . . . . 1635.2.1 General Features of E˜d1 and Ed1 . . . . . . . . . . . . . 1635.2.2 Covariate Selection for E˜d1 . . . . . . . . . . . . . . . 169viiTable of Contents5.2.3 Detailed Analyses of E˜d1 . . . . . . . . . . . . . . . . . 1715.3 Comparison of P1: Hourly LFV Mean Ozone . . . . . . . . . 1765.3.1 Modelling Pd1 . . . . . . . . . . . . . . . . . . . . . . 1775.4 Comparison of Higher-order Features . . . . . . . . . . . . . 1825.5 Chapter Conclusion . . . . . . . . . . . . . . . . . . . . . . . 1866 AQM Evaluation II: Comparison of AQM and Observationsas Stochastic Ozone Processes . . . . . . . . . . . . . . . . . . 1906.1 Pre-analysis Comments . . . . . . . . . . . . . . . . . . . . . 1926.2 Comparing the Space-time Ozone Processes . . . . . . . . . . 1956.3 Comparison of Pˆc1 and Pˆo1 . . . . . . . . . . . . . . . . . . . . 1986.4 Chapter Conclusion . . . . . . . . . . . . . . . . . . . . . . . 1997 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2037.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . 2047.1.1 Recap of Evaluation Results . . . . . . . . . . . . . . 2057.1.2 Conclusion on AQM Evaluation . . . . . . . . . . . . 2067.2 Additional Contributions . . . . . . . . . . . . . . . . . . . . 2077.2.1 Understanding the Features of LFV Ozone . . . . . . 2077.2.2 Ozone Feature Models . . . . . . . . . . . . . . . . . 2087.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 2097.3.1 Application for Other Air Pollution Data . . . . . . . 2097.3.2 Further Works on Ozone Feature Models . . . . . . . 210Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211AppendicesA Appendix Related to Chapter 2 . . . . . . . . . . . . . . . . . 222A.1 Details on Simulated Ozone Data . . . . . . . . . . . . . . . 222A.1.1 Simulated True Ozone Data . . . . . . . . . . . . . . 223A.1.2 Simulated CMAQ Output and Observation . . . . . . 228viiiTable of ContentsB Appendix Related to Chapter 3 . . . . . . . . . . . . . . . . . 235B.1 Analyses of Simulated (Synthetic) Ozone Field . . . . . . . . 235B.2 Plots of Ej from Other PCA Methods . . . . . . . . . . . . . 237C Appendix Related to Chapter 4 . . . . . . . . . . . . . . . . . 242C.1 PCA of GP Model Variables . . . . . . . . . . . . . . . . . . 242C.2 Prediction Bias of Feature-Based Ozone Model . . . . . . . . 250ixList of Tables2.1 CMAQ modelled ozone episodes: years and the episode du-rations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.2 Station names and coordinates of current LFV monitoringnetwork. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.1 The daily wind regime for the middle 3 full days of each episode. 463.2 For the CMAQ outputs of 5 episodes: the Ot×n reconstruc-tion RMSEs in units ppb at p = 1, . . . , 8. . . . . . . . . . . . . 643.3 For the 5 ozone episodes: the proportion of data variationexplained by the first 5 EOF/PC sets. . . . . . . . . . . . . . 644.1 Covariate selection results from the iterative improvement al-gorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1244.2 Training data Cross-validation RMSE of the models chosenby the iterative improvement method. . . . . . . . . . . . . . 1244.3 Prediction RMSE of the EOF models. . . . . . . . . . . . . . 1284.4 Prediction RMSEs of the PC models. . . . . . . . . . . . . . . 1334.5 Prediction RMSE and MPE for the ozone feature models. . . 1444.6 RMSE and MPE of cross-validation predictions made on com-plete ozone fields. . . . . . . . . . . . . . . . . . . . . . . . . . 1444.7 Training data Cross-validation RMSE of the ozone featuremodels of 4 episodes. . . . . . . . . . . . . . . . . . . . . . . . 1464.8 Training data Cross-validation RMSE of the ozone featuremodels of 4 episodes. . . . . . . . . . . . . . . . . . . . . . . . 146xList of Tables5.1 Proportion of data variation explained by ozone features oforders j = 1, 2, 3, 4, 5, . . . , 8. . . . . . . . . . . . . . . . . . . . 1595.2 Data reconstruction RMSE at p = 1, 2, 3, 4, 5, . . . , 8. Theunits are ppb. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1605.3 The types of ozone features separability of both CMAQ andobservations from all episodes. . . . . . . . . . . . . . . . . . 1615.4 Angles between E˜c1 and E˜o1. . . . . . . . . . . . . . . . . . . . 1675.5 Result of covariate selection for E˜d1. . . . . . . . . . . . . . . . 1705.6 Station names and coordinates of the 2001 LFV monitoringnetwork. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1745.7 Angles between Pc1 and Po1. . . . . . . . . . . . . . . . . . . . 1775.8 Result of covariate selection for Pd1. The listed covariates arethose in addition to “hour of the day”. . . . . . . . . . . . . . 1775.9 Angles from the joint comparison of the leading 3 ozone fea-tures from CMAQ output and physical measurements. . . . . 1836.1 Covariates used for CMAQ evaluation within the stochasticcomponent of the ozone feature (GP) models. . . . . . . . . . 194A.1 Parameter values used for generating additive errors in sim-ulated CMAQ and observation. . . . . . . . . . . . . . . . . . 232B.1 Table of estimated 95% confidence intervals for the means ofRMSEj −RMSEj+1’s. . . . . . . . . . . . . . . . . . . . . . 237xiList of Figures1.1 Diurnal trend during summer months of 2004–2008 at Chill-iwack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Day-time mean ozone level against day of the year at Chilliwack. 51.3 Time-series of observed ozone concentrations at 3 close loca-tions during the time-period 1100PST, June 23rd to 1000PST,June 27th of 2006. . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Time-series of observed ozone concentrations at 3 well-separatedlocations during the time-period 1100PST, June 23rd to 1000PST,June 27th of 2006. . . . . . . . . . . . . . . . . . . . . . . . . 62.1 Locations of current measuring stations and the corners ofthe complete rectangular LFV region. . . . . . . . . . . . . . 322.2 For 8 selected monitoring stations, the 4 CMAQ neighboursto be used for interpolation. . . . . . . . . . . . . . . . . . . . 342.3 From the 2006 CMAQ ozone output: the spatial plot of the96-hour ozone means overlaid with the LFV shoreline. . . . . 362.4 Hourly ozone observations from Pitt Meadow and Rocky PointPark during the 2006 ozone episode. . . . . . . . . . . . . . . 373.1 For selected hours during the 1985 ozone episode, 3-dimensionalspatial plots of the hourly ozone field. . . . . . . . . . . . . . 433.2 For selected hours during the 2006 ozone episode, 3-dimensionalspatial plots of the hourly ozone field. . . . . . . . . . . . . . 443.3 Locations of current measuring stations and the corners ofthe complete rectangular LFV region. . . . . . . . . . . . . . 45xiiList of Figures3.4 The four types of LFV wind regime during an ozone episodeas described by Ainslie and Steyn (2007). . . . . . . . . . . . 473.5 Hourly RMSE (units ppb) of the Ot×n reconstruction for the1985, 1995 and 1998 episodes. . . . . . . . . . . . . . . . . . . 613.6 Hourly RMSE (units ppb) of the Ot×n reconstruction for the2001 and 2006 episodes. . . . . . . . . . . . . . . . . . . . . . 623.7 For selected hours, the spatial field of the 2006 CMAQ out-put and corresponding feature-based data reconstruction us-ing p = 4:∑4j=1 PjETj . . . . . . . . . . . . . . . . . . . . . . 633.8 The eigenspectra of λj , j = 2, . . . , 8, decomposed from the the1985 CMAQ output under type IV regime and 2006 outputsunder type I regime. . . . . . . . . . . . . . . . . . . . . . . . 673.9 Plots of the mean fields and E1’s of the 5 ozone episodes. . . 713.10 Plots of spatial field of temporal ozone standard deviationsand E2’s of the 5 ozone episodes. . . . . . . . . . . . . . . . . 723.11 Time series plots of hourly spatial (LFV) ozone means andP1’s of the 5 ozone episodes. . . . . . . . . . . . . . . . . . . 733.12 Time series plots of hourly LFV ozone standard deviationsand P2’s of the 5 ozone episodes. . . . . . . . . . . . . . . . . 743.13 From the 2006 ozone episode under the type I regime: spatialplots of P1ET1 (units ppb) at selected times shown in plotheaders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753.14 From the 2006 ozone episode under the type I regime: spatialplots of P2ET2 (units ppb) at selected hours. . . . . . . . . . . 763.15 From the 2006 episode under type I and III wind regime:P2ET2 (units ppb) from the same selected hours. . . . . . . . 793.16 From the 2001 episode under type II wind regime: P2ET2(units ppb) from selected hours. . . . . . . . . . . . . . . . . . 793.17 From the 1985 episode under type IV regime: dynamic spatialplots of P2ET2 and P3ET3 (in units ppb). . . . . . . . . . . . . 813.18 From the 1985 episode under type IV regime: dynamic spatialplot of joint ozone features P2ET2 + P3ET3 (units ppb). . . . . 82xiiiList of Figures3.19 From the 2006 episode under type I regime: P3ET3 and P4ET4(units ppb) from selected hours. . . . . . . . . . . . . . . . . . 843.20 Sampling stability of PCA for P1. . . . . . . . . . . . . . . . 863.21 Sampling stability of PCA for P2. . . . . . . . . . . . . . . . 873.22 Sampling stability of PCA for P3. . . . . . . . . . . . . . . . 874.1 Figure showing the three neighbours used in arcsin weighting. 1104.2 Map of the complete “rectangular” LFV domain. . . . . . . . 1174.3 From the CMAQ training set, plots of temporal ozone means,standard deviations and the first 4 EOFs. . . . . . . . . . . . 1194.4 From the CMAQ training set, time series of spatial ozonemeans, standard deviation and the first four PCs. . . . . . . . 1204.5 Eigenspectrum from the PCA of “full” 2006 CMAQ output. . 1214.6 Plots of cross-validation MPE vs. RMSE for the model fitsof E1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1254.7 Standard normal QQ-plots of the fitted EOF-CII models. . . 1274.8 Spatial plots of true E1 to be predicted and its GP modelpredictions (all unitless). . . . . . . . . . . . . . . . . . . . . . 1294.9 Spatial plots of true E2 to be predicted and its GP modelpredictions (all unitless). . . . . . . . . . . . . . . . . . . . . . 1304.10 Spatial plots of true E3 to be predicted and its GP modelpredictions (all unitless). . . . . . . . . . . . . . . . . . . . . . 1314.11 Standard normal QQ-plots of the fitted PC-CII models. . . . 1334.12 Time-series plots of the true temporal ozone features in thepredictive set, their predictions using temporal ozone modelsPC-CII and PC-VM. . . . . . . . . . . . . . . . . . . . . . . . 1344.13 Temporal plots of hourly spatial ozone mean, standard devi-ation and the 1st 4 PCs over the course of the entire episode. 1354.14 For hours 0100, 0700 and 1200 of June 26th, 2006 (the predic-tive set): the scatter plots of predictions from the CII modeland the VM model versus the true CMAQ output. . . . . . . 137xivList of Figures4.15 For hours 1400, 1600 and 2000 of June 26th, 2006 (the pre-dictive set): the scatter plots of the true CMAQ output vs.predictions from the CII model and the VM model. . . . . . . 1384.16 Hour 0100 and 0700 of June 26th, 2006: ozone fields of thetrue CMAQ output and its predictions. . . . . . . . . . . . . 1394.17 Hour 1000 and 1200 of June 26th, 2006: ozone fields of thetrue CMAQ output and its predictions. . . . . . . . . . . . . 1404.18 Hour 1400 and 1600 of June 26th, 2006: ozone fields of thetrue CMAQ output and its predictions. . . . . . . . . . . . . 1414.19 Hour 2000 and 2200 of June 26th, 2006: ozone fields of thetrue CMAQ output and its predictions. . . . . . . . . . . . . 1425.1 Schematics of the idea behind the “AQM/CMAQ EvaluationII” and the “traditional” point-to-point approach. . . . . . . 1535.2 Mean fields of CMAQ outputs and observation data of the1985, 1995 and 1998 episodes. . . . . . . . . . . . . . . . . . . 1645.3 Mean fields of CMAQ outputs and observation data of the2001 and 2006 episodes. . . . . . . . . . . . . . . . . . . . . . 1655.4 For the 2001 episode: plots of Ec1, Eo1 and Ed1. . . . . . . . . . 1665.5 Plots of Ed1 of from the 1985, 1995, 1998 and 2006 episodes. . 1675.6 For the 2001 episode: plots of E˜c1, E˜o1 and E˜d1. . . . . . . . . . 1685.7 Plots of E˜d1 of from the 1985, 1995, 1998 and 2006 episodes. . 1695.8 Sensitivity or univariate effect plot of E˜d1 against mean VOCemission rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1725.9 The temporal mean VOC emission rate of LFV. . . . . . . . . 1735.10 Sensitivity or univariate effect plot of E˜d1 against VOC emis-sion rate at 5 LFV locations. . . . . . . . . . . . . . . . . . . 1765.11 Time-series of Pc1 and Po1 for the 1985, 1995 and 1998 episodes.1785.12 Time-series of Pc1 and Po1 for the 2001 and 2006 episodes. . . 1795.13 Time-series of hourly LFV mean ozone of CMAQ output andobservations from the 1985, 1995 and 1998 episodes. . . . . . 1805.14 Time-series of hourly LFV mean ozone of CMAQ output andobservations from the 2001 and 2006 episodes. . . . . . . . . . 181xvList of Figures5.15 Comparing dynamic spatial ozone features. . . . . . . . . . . 1856.1 Spatial fields of ozone means produced by the statistical ozonemodels of CMAQ and observation. . . . . . . . . . . . . . . . 1976.2 Hourly time series of ozone means produced by the CMAQand observation ozone models. . . . . . . . . . . . . . . . . . 1986.3 Time series plots of Pˆc1 and Pˆo1, scatter plot of Pˆo1 vs. Pˆc1. . . 2006.4 Univariate covariate-effect of Pˆc1 and Pˆo1 against temperature. 201A.1 Simulated diurnal temperature profile. . . . . . . . . . . . . . 224A.2 Simulated diurnal profile of c(h). . . . . . . . . . . . . . . . . 226A.3 Simulated true ozone fields at selected hours (I). . . . . . . . 229A.4 Simulated true ozone fields at selected hours (II). . . . . . . . 230A.5 CMAQ output vs. observation for June 26th, 2006. . . . . . . 231A.6 fs[δ(x, y, h) vs. δ(x, y, h). . . . . . . . . . . . . . . . . . . . . 231A.7 Scatter plots for assessing “similarities” between the real dataand simulated data. . . . . . . . . . . . . . . . . . . . . . . . 233A.8 Synthetic CMAQ without using the scaling function. . . . . . 234B.1 Histograms of RMSEi −RMSEi+1 from simulation. . . . . . 237B.2 Comparison plots between the Ej from the PCA of originalOt×n and column-centered ozone data. . . . . . . . . . . . . . 238B.3 From the PCA of centered ozone data: plots of hourly ozonemean, standard deviation and Pj , j = 1, . . . , 4. . . . . . . . . 239B.4 From PCA of original ozone data: plots of hourly ozone mean,standard deviation and Pj , j = 1, . . . , 4. . . . . . . . . . . . . 240B.5 Comparison plots between the original and VARIMAX ro-tated Ej , j = 1, . . . , 4. . . . . . . . . . . . . . . . . . . . . . . 241C.1 Spatial and temporal feature plots of NOx emission rates as-sociated with the 2006 CMAQ output . . . . . . . . . . . . . 245C.2 Spatial and temporal feature plots of temperature associatedwith the 2006 CMAQ output . . . . . . . . . . . . . . . . . . 246xviList of FiguresC.3 Spatial and temporal feature plots of the wind speed associ-ated with the 2006 CMAQ output . . . . . . . . . . . . . . . 247C.4 Spatial and temporal feature plots of the boundary-layer (BL)height associated with the 2006 CMAQ output. . . . . . . . . 248C.5 Spatial and temporal feature plots of the antecedent NOxconcentration data associated with the 2006 CMAQ output. . 249C.6 Plots comparing bias-corrected VM model to un-correctedVM models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255xviiList of MathematicalNotations, Acronyms andAbbreviationsNotations and RepresentationsY Generic notation for a matrix or a vector of a random responseX A matrix or a vector of model covariatesO The general notation for a matrix of space-time ozone dataOc A data matrix of CMAQ outputsOo A data matrix of ozone physical observationstO Simulated true ozone fieldcO Simulated CMAQ outputoO Simulated ozone observationst The number of time points (usually in hourly intervals) inspace-time datan The number of locations in space-time dataE A matrix of Empirical Orthogonal Functions extracted fromspace-time data, where each column of E usually capturesspatial ozone featureP A matrix of Principal Components, where each columnof P usually captures temporal ozone featurei, j Row and column indexes of O, E and PEj The general notation of multivariate random process representingj-th column of ExviiiNotations and AbbreviationsNotations and Representations (continued)Pj The general notation of multivariate random process representingj-th column of PEcj Ej from a CMAQ modelled ozone processPcj Pj from a CMAQ modelled ozone processEoj Ej from an observed ozone processPoj Pj from an observed ozone processEdj The j-th order feature difference of the EOFPdj The j-th order feature difference of the PCxE j Model covariates of EjxP j Model covariates of Pjp The number of spatio-temporal ozone features orthe number of EOF-PC components used to model space-timeozone processes OD An ozone “decenter” matrix containing spatial means of OR The correlation matrix of a Gaussian ProcessZ The zero-mean stochastic component of a Gaussian Processf A vector of Gaussian Process regression functionsF The regression design matrix of a Gaussian Processβ A vector of regression coefficientsσ Model standard deviationθ A vector of Gaussian Process correlation parametersα A vector of Gaussian Process smoothness parametersξ A vector of GP model parameters β, σ, θ and αh Variable denoting the hour of the day{x, y} Specifically for synthetic ozone data, the horizontal and verticallocation index within a 2-D geographical coordinate system.xixNotations and AbbreviationsAcronyms and AbbreviationsGP Gaussian ProcessPCA Principal Component AnalysisEOF Empirical Orthogonal FunctionPC Principal ComponentBLUP Best Linear Unbiased PredictorCMAQ Community Multiscale Air QualityWRF Weather Research and ForecastSMOKE Sparse Matrix Operator Kernel EmissionsMCIP Meteorology-Chemistry Interface ProcessorAQM Air Quality ModelVM model Variable Mean model:An ozone feature model whose covariates are the spatialor temporal means of the model variablesCII model Covariate Iterative Improvement model:An ozone feature model whose covariates are the PCAoutputs of the model variables selected throughthe Iterative Improvement algorithmCSC model Covariate Subset Combination model:An ozone feature model whose covariates are the PCAoutputs of the model variables selected throughthe Subset Combination procedureGASP The computer program used in this thesis tooptimize the GP models.Temp Temperature measured in KelvinWind Wind speed measured in meters per secondBL Planetary boundary layer height in metersNOX-lag The antecedent or lagged NOx concentrationVOC Volatile Organic CompoundVOC-lag The antecedent or lagged VOC concentrationxxAcknowledgementsI would like to thank Prof. Welch and Prof. Steyn for years of patientguidance, support and allowing an uncountable number of meetings. Thisresearch project and thesis would not be possible without you.A hearty thank you to Dr. Bruce Ainslie for teaching me the complicatedscience behind ozone modelling, and helping me understand all the data thathe made available to me. My research would not be the same without yourhelp.I would like to thank NSERC for the funding the research grants of Prof.Welch and Prof. Steyn, some of which provided me with financial supportover the years.I would also like to thank Metro Vancouver for providing me with evenmore data, and for making sure Vancouver stays a great city for living andstudying.Thank you to the department of statistics: especially Prof. Zidek andProf. Joe for years of interesting conversations and inspirations, the depart-ment head and staff for making our department a cozy supportive family.Last but not least, I would like to thank my friends for all the support,both emotionally and financially.xxiDedicationDedicated to my parents and the rest of my wonderful family.xxiiChapter 1IntroductionOzone is an oxygen compound with chemical composition O3. In the gaseousform at ground (surface) level, ozone is considered a harmful pollutant es-pecially on the lung function of people with respiratory conditions. In moresevere instances, prolonged excessive exposure to ozone is linked to asthma,heart attack and premature death (WHO, 2003; Lippmann, 1989). Over theyears, government agencies around the world have drafted and institutedstandards defining the maximum ozone threshold deemed to be harmful tohumans. In Canada, it is called the Canada Wide Standard (CWS): thefourth highest annual ozone measurement should not exceed the level of 65parts-per-billion (ppb) over an “8-hour averaging time” (CCME, 2000). Anozone standard is enforced through the continuous monitoring and analy-sis of surface-level ozone (and other air pollutant) concentrations (JAICC,2005; Reuten et al., 2012).Ozone formation and destruction is a part of a complex system of in-terlinked photochemical reactions. The precursor chemicals are the photo-chemical compounds that once released into the atmosphere, trigger a newchain of ozone reactions. Precursor chemicals are introduced into the at-mosphere through human activities; examples of such compounds includeNOx (the generic term for NO and NO2) and Volatile Organic Compounds(VOC) (Boubel et al., 1994, Chapter 12). Therefore, an ozone standard isimplemented through the reduction of emission level (Reuten et al., 2012).From the perspectives of the regulatory agencies, it is important to studyand understand the effect of precursor emission on ambient ozone concen-tration. The Community Multiscale Air Quality (CMAQ) modelling systemis a useful scientific tool for this very purpose.CMAQ is a process-based Air Quality Model (AQM) used to model the11.1. Ozone Process and the CMAQ Systemspatial-temporal ozone distribution under given meteorological and emis-sion conditions (Byun and Schere, 2006). CMAQ enables atmospheric re-searchers and managers to model ozone variation over a range of backgroundconditions, making forecasts or conducting retrospective analyses. In otherwords, CMAQ is useful for estimating the effect of changing weather andprecursor emissions on surface-level ozone. Reuten et al. (2012), Steyn et al.(2013) and Ainslie et al. (2013) are some of the most recent, extensive im-plementations of these types of analyses.As with any modelling system that aims to simulate a real-life event, es-pecially one as complex as the ozone process, CMAQ users need to have aninformed, “big picture” view of its modelling capability. Hence the evalua-tion of the CMAQ model (and AQMs in general) is an inherently importantresearch subject (Dennis et al., 2010; Galmarini and Steyn, 2010). Thetopic of CMAQ evaluation is the initial scientific motivation that startedthe statistical research in this thesis.1.1 Overview of Ozone Process and CMAQModelling SystemBefore discussing CMAQ evaluation it is useful to describe in detail thescience behind the ozone process and CMAQ modelling system.1.1.1 Tropospheric Ozone Formation ProcessesWhen pollutants are released into the atmosphere, chemical reactions subse-quently occur to form new pollutants, one of which is ozone. In the simplestterms, the formation of ozone can be defined as a function of the existinghydrocarbon mixture and the concentration of NOX plus the intensity ofsolar radiation and temperature. The hydrocarbon mixture could comprisemany types of hydrocarbon compounds. For example, about 43 hydrocar-bon compounds have been identified in the air of St. Petersburg, Floridain the 1970s (Boubel et al., 1994, Section 12.3). In addition, the precursorsfor the creation of ozone would undergo various chemical transformations21.1. Ozone Process and the CMAQ Systemof their own, including reactions with ozone, resulting in a highly complexsystem of atmospheric chemical processes.The process is further complicated by the fact that these gases are beingtransported through the atmosphere. Thus, an important aspect of an ozonemodelling system is a meteorology model that forecasts the atmosphericcirculation over a regional scale relevant to the transportation of pollutants.Also, the meteorology model provides information on ambient conditionslike temperature and humidity, which are in turn necessary to model thechemical reactions.In a nutshell, meteorological and chemical reaction models are essen-tially complex systems of dynamic functions that work jointly to modelthe creation and transportation of air pollution. The mechanisms can besummarized into two fundamental steps: (1) Pollutants are released intothe atmosphere, (2) Subsequent atmospheric chemical transformations oc-cur, both near the pollution source and over a wide geographical region dueto being transported by atmospheric circulation, mixing and reacting withambient gases along the way.Space-time Aspects of Ozone ProcessesAn ozone process has important spatial and temporal structure. Humanactivities determine the types and intensities of pollutants entering the at-mosphere (Steyn et al., 2013; Ainslie et al., 2013). Solar radiation andambient weather determine the conditions under which the atmospheric re-actions occur. Since these factors differ across geographical locations andtime periods (hour of the day, season, etc.), pollution concentration naturallyvaries across space and time. Furthermore, ozone and other air pollutionprocesses do not simply occur independently over space and time. Becauseof atmospheric transport the pollution level at one location depends on thepollution at other locations and their proximities with this location. In otherwords, pollutant concentration of location s at time t could significantly in-fluence the pollution at nearby location s′ at some time t′ in the future (Leand Zidek, 2006). This dynamic aspect of space-time correlation structure31.1. Ozone Process and the CMAQ Systemrequires modelling in our statistical methods.For instance, Figure 1.1 shows the diurnal (daily) trend for the four sum-Figure 1.1: Diurnal trend during summer months of 2004–2008 at Chilliwack(BC, Canada). Each hourly concentration is averaged over the 5 years andthe days of the month.mer months in Chilliwack, BC. Each hourly ozone value is the average ozonemeasurement for that hour in the entire month over the years 2004–2008.The ozone data are recorded in units of parts-per-billion (ppb). Missinghourly data are filled with the average value of available data for that par-ticular hour, e.g. the missing 2 p.m. value of August 3rd, 2005 is filled withthe average of the available 2 p.m. values of August 3rd from other years.Data for any particular hour on a specific date are available for at least 2years among 2004 to 2008. The same plot based on data from 1984–2008shows a very similar diurnal trend. As another example, Figure 1.2 showsdaily day-time (8 a.m. to 8 p.m.) mean ozone level for the years 2004–2008 and the average of these 5 years. One may notice that for Chilliwack,the annual day-time average peaks during the spring in some years, butwhen averaging the daily values over the 5 years, the diurnal fluctuationsare smoothed out and the annual peak is evident during summer.Figures 1.3 and 1.4 are time series plots of ozone concentrations duringan ozone episode in the summer of 2006. The locations in Figure 1.3 are41.1. Ozone Process and the CMAQ SystemFigure 1.2: Day-time mean ozone level against day of the year at Chilliwack.The mean is averaged over values from 8 a.m. to 8 p.m.contained within an area of 8km radius, while the locations in Figure 1.4are much further apart. This type of plot is useful to visually assess howclosely correlated are the ozone concentration levels at different locations.01020304050TimeOzone concentration (ppb)0000PST, June 24th 0000PST, June 25th 0000PST, June 26th 0000PST, June 27thRobson SquareNorth VancouverMahon ParkFigure 1.3: Time-series of observed ozone concentrations at 3 close locationsduring the time-period 1100PST, June 23rd to 1000PST, June 27th of 2006.The vertical dashed-lines indicate the hour 0000PST of each day.1.1.2 The CMAQ Modelling SystemThis section describes in simple and general terms, the inner working ofthe CMAQ modelling system. In Chapter 2, I will provide a more detailed51.1. Ozone Process and the CMAQ System020406080TimeOzone concentration (ppb)0000PST, June 24th 0000PST, June 25th 0000PST, June 26th 0000PST, June 27thRobson SquareRichmond SouthCentral AbbotsfordFigure 1.4: Time-series of observed ozone concentrations at 3 well-separatedlocations during the time-period 1100PST, June 23rd to 1000PST, June 27thof 2006. The vertical dashed-lines indicate the hour 0000PST of each day.description of the conditions and settings of CMAQ runs that are relevantto this research.As a numerical model, CMAQ is essentially a system of differential equa-tions which are integrated given initial and boundary conditions. Hence toimplement a CMAQ model run, one requires various inputs for these initialand boundary conditions of the pollution process. One important input isprovided by the emission model Sparse Matrix Operator Kernel Emission(SMOKE) (SMOKE v2.5, University of North Carolina, 2012). Given anannual total emission for a geographical region, the SMOKE modelling sys-tem distributes this emission figure into spatial grids and time periods: itprovides estimates of the types and amounts of pollutants (reaction precur-sors) released into the atmosphere, and this information is listed for eachrelevant geographical grid cell, at varying height, over given time periods.For example, when we wish to estimate the pollutant types and amountsreleased by household heating during winter, we would first apply appropri-ate sampling methods to obtain the number of residences within a geo-graphical region. This is our pollution source, then the follow-up procedureis described in Boubel et al. (1994, Section 6.4):1. Identify what gases are produced from home heating, typically CO,CO2, NOX and CH4.2. Collect household fuel consumption figures from dealers, utility providers61.1. Ozone Process and the CMAQ Systemand so forth. Without such data, one may apply established distri-bution models of fuel type and consumption amount to produce anestimate.3. Examine the reference literature to decide on relevant emission factorsfor the given consumption figure, such as weight of pollutant per vol-ume of fuel burned. A popular reference on emission factors is “Com-pilation of Air Pollution Emission Factors” by the U.S. EnvironmentalProtection Agency1.4. There exist models that simulate consumption behaviour over time,allowing us to estimate the time of emission.5. Calculate the total emission produced within this region at a certaintime period.The CMAQ system has modelling uncertainties from various sources; theemission inventories provided by SMOKE account for a large portion ofsaid uncertainties. This is perhaps unavoidable given its task of estimatingresults of human activities.Other inputs feeding into the CMAQ model include land surface in-formation and meteorological conditions such as wind speed and direction,ambient temperature, pressure and solar radiation intensity. These inputdata are produced by the Weather Research and Forecasting (WRF) model(WRF v3.1, Skamarock et al. (2008)).Finally, the role of the “chemical component” in the air pollution model isto simulate atmospheric chemical reactions. The emission model tells us thetypes and amounts of gases entering into the chemical reactions. Using thegiven inputs and working in conjunction with the meteorological model, thechemical reaction model simulates the systems of chemical transformationsin the ozone process. At the end, a CMAQ model run produces output inthe form of an average value over a geographical grid cell at a point in time.Thus, hourly averages need to be computed, usually by simple averagingof a series of outputs. Moreover, the model typically requires input every1http://www.epa.gov/otaq/ap42.htm71.2. The Problems with “Usual” Means of AQM Evaluation10 minutes (other time intervals are also possible). Initial and boundaryconditions may be updated every hour, and inputs for shorter time intervalswould be interpolated within the model during simulation.In summary, CMAQ is a complex machine comprised of purpose-specificmodels (WRF, SMOKE, etc.) that operate interactively to model an at-mospheric air pollution process. Given its complexity, model uncertaintiesand errors are an unfortunate but inherent reality of CMAQ. In view of theuse of CMAQ as a “reference guide” to the design of ozone-related policies(introductory paragraphs of the chapter), it is imperative to evaluate themodelling capabilities of CMAQ.The topic of AQM evaluation has garnered substantial attention overthe past few years. Dennis et al. (2010) and Galmarini and Steyn (2010)contain extensive overview and discussion of this topic. The book series “AirPollution Modelling and its Application” (Springer Books) provides up-to-date summaries of new research relating to all aspects air-quality modellingevery 18 months. One research area covered in this book series is the topicof AQM evaluation.1.2 The Problems with “Usual” Means of AQMEvaluationThe most popular means of AQM evaluation is to analyze the statistics ofpoint-to-point differences between the AQM output and corresponding (inspace and time) physical observations. Dennis et al. (2010) listed commonevaluation statistics such as Mean Bias Error, Root Mean Squared Errorand Correlation. Willmot et al. (1985) suggested that bootstrap methodscan be applied to obtain the confidence interval and assess the significanceof observation-model difference statistics such as RMSE. Under the contextof climate model evaluation, Preisendorfer and Barnett (1983) comparedmodel outputs and physical data as two “swarms” of data points in a com-mon euclidean space. Geometric properties such as the distance betweentwo swarms’ centroids, differences in their radial scales and space-time evo-81.2. The Problems with “Usual” Means of AQM Evaluationlutions are proposed as viable measures of a model’s accuracy. Authors thenoutlined sampling procedures to obtain the statistical significance and powerof proposed model accuracy measures.However, Dennis et al. (2010) pointed out that while direct data com-parisons can be useful to a certain degree, they “provide little insight” onthe deficiency and behaviour of AQM simulated pollution fields. There arefurther problems stemming from the point-to-point comparisons.AQM output and ozone observations are products of different processesdriven by their own space-time mechanisms. AQM ozone is driven by asystem of differential equations that describe specific ozone-related photo-chemical and atmospheric processes. In case of CMAQ, these equations areintegrated over the inputs from WRF (meteorology) and SMOKE (emis-sion). Ozone observations are measurements of the real-life ozone processesthat occur during observed meteorological and chemical precursor conditions(Dennis et al., 2010). The physical monitoring stations are sparsely and ir-regularly located across a large spatial domain (details in Chapter 2). Henceobservations do not necessarily represent initial and boundary conditions inthe same way as an AQM modelled process.The AQM ozone and observations are further differentiated by the factthat the computer model outputs are spatial-averaged concentrations (dis-cussed in Section 1.1), whereas the physical observations are ozone measure-ments taken at point locations. In other words, AQM output and observa-tions each capture ozone process on a different spatial scale.The quality of physical observations is invariably degraded by stochasticmeasurement errors. The error source of AQM outputs are characterized byinadequacies in model inputs (WRF and SMOKE outputs) and its deficien-cies in emulating the behaviour of key atmospheric processes. This impliesthat AQM modelled ozone and observations have different stochastic struc-tures in relation to the underlying true ozone. For example, let s and tdenote a location and time, x denotes a set of atmospheric conditions at sand t, and let the unknown true ozone at (s, t, x) be denoted by Ot(s, t, x).Then the observed ozone is Ot(s, t, x) + ε: the true underlying ozone plusrandom measurement error, whereas the AQM modelled ozone Oc(s, t, x)91.3. Research Topics and Objectivesrelates to true ozone by Ot(s, t, x) = Oc(s, t, x) + δ(s, t, x), where δ(s, t, x)is a random process representing the AQM modelling deficiency and oftena nonlinear function of (s, t, x). Such statistical formulation of the AQMozone and observations are based on the more general model-measurementrelationship proposed by Kennedy and O’Hagan (2001), which is applied inlater works such as Fuentes and Raftery (2005) and Higdon et al. (2008).Given the above formulation, the observation-model difference is a ran-dom non-linear process δ(s, t, x)+ε. Hence, an informative model evaluationshould tell us something about the pattern and behaviour of random pro-cesses like δ(s, t, x)+ε, these are information that point-to-point comparisonsummaries such as RMSE cannot provide.In summary, AQM outputs and observational data are generated by twospace-time ozone processes with discrepant physical and stochastic struc-tures. The point-to-point comparison of two datasets only serves to informthe deviations in their output values, not their difference as individual ozoneprocesses. Without a more insightful process-level understanding of the twoozone processes, close agreement based on point-to-point comparison shouldbe deemed “fortuitous” (Dennis et al., 2010).This research is motivated by the concerns and topics proposed by theAir Quality Model Evaluation International Initiative (AQMEII, Galmariniand Steyn (2010)) and Dennis et al. (2010). The literature discussed exten-sively the aforementioned problems of direct output-observation comparison,and highlighted the importance of evaluating the ability of a computer modelsuch as CMAQ to emulate the interacting atmospheric processes within aspace-time air pollution system.1.3 Research Topics and ObjectivesIn the most general terms, my research objective is to develop statisticalmethods of AQM evaluation that are more informative than direct observation-model comparison. These “informative” methods should be able to provideuseful insights into the way AQM and physical observations differ as ozoneprocesses, and to identify possible sources of AQM deficiency. In addition,101.3. Research Topics and Objectivesthe evaluation methods should correct or at least account for the fact thatmodel output and observation are generated from individual processes withdiscrepant physical and statistical properties. The statistical analyses inthis thesis are based on data (model output and physical observations) re-lated to ozone pollution and the CMAQ system. However, the general ideasand statistical methods are intended to be applicable to other air pollutionproblems and AQM evaluations.The AQM evaluation methods proposed in this thesis are based on theanalysis and modelling of spatial-temporal ozone features. An ozone featureis a data component/mode that captures certain space-time structure of theunderlying ozone process and/or recover non-trivial amount of data varia-tion. One example of ozone features is the spatial or temporal ozone means.Suppose there is a space-time ozone dataset O of dimension t× n, t and nbeing the number of hours and locations in a dataset. The column means ofOt×n are the spatial field of temporal ozone means: ozone averaged acrosstime at each location. The row means of Ot×n are the time series of spatialozone means: ozone averaged across space at each hour.Analyses in later chapters will show that ozone features allow for statis-tical or real-world interpretability. In addition to space-time ozone mean, anozone feature may also capture some dynamic patterns of ozone advection(atmospheric ozone transport).In this thesis, I propose and implement two general methods of featurebased evaluation of AQM against the observations:1. The first approach is to compare the ozone features between AQMoutputs and physical measurements, and analyze how the two ozonedata differ in their underlying space-time structures. As mentioned,an ozone feature captures either the mean structure or some dynamicpatterns of ozone advection. The advection patterns reveal the mostfundamental mechanisms of the underlying atmospheric process, hencethe comparison of such advection features is a means of process levelAQM evaluation.I also propose to model the ozone feature differences to identify and111.3. Research Topics and Objectivesanalyze the statistical associations between specific AQM inputs andfeature differences. Here, the AQM inputs are variables representingthe background meteorology, ozone precursor emission rates and at-mospheric concentrations. This is a way of understanding how thecomponents of an AQM (weather, emission, etc.) are associated withits deficiencies in capturing the physical field.2. The second approach is to build statistical ozone feature models forboth AQM and observations, then compare the two statistical mod-els as another means of process-level AQM evaluation. For instance,one can use the fitted models to produce the AQM features and theobserved features given the same regional meteorology and ozone pre-cursor pollution. These features can then be compared and checkedfor the significance of their differences in space and time. Anotherevaluation can analyze how two ozone features react to the same vari-ations in background atmospheric conditions. This is a process levelcomparison of the stochastic properties of AQM ozone and physicalprocess.This thesis will present a coherent set of statistical analyses that arguethe following claim: an informative and “big-picture” AQM evaluation canbe achieved through the analysis and modelling of ozone features.1.3.1 Topics of Ozone Feature Analysis and ModellingTo successfully implement the proposed evaluation approaches, I will need todevelop the necessary statistical tools and framework that: (1) extract ozonefeatures from space-time ozone data, (2) model individual ozone features andmodel the complete space-time ozone field through these features. These aremy other research topics in addition to the statistical AQM evaluation.In this research, methods of Principal Component Analysis (PCA) areused to decompose space-time ozone data into ozone features. One PCA-related topic is to determine the number of ozone features that are meaning-ful for statistical analysis. A “meaningful” ozone feature should be a data121.3. Research Topics and Objectivesmode that either can be interpreted statistically, or captures the underlyingphysical process that created the ozone field. At the very least, a “mean-ingful” feature should recover a non-trivial amount of data variation. Bydetermining the number of meaningful ozone features, one is able to answerwhether a complex space-time ozone data can be analyzed using a few ozonefeatures, i.e., simpler data components.Furthermore, there are several PCA-related complications that shouldbe addressed to ensure convincing implementation of the proposed feature-based AQM evaluation. One such complication is the ozone feature “de-generacy” (North et al., 1982), where a feature’s order of extraction and/orits mode of variation are “mixed” with other features. Failure to addressthis and other PCA-related complications may result in the event where onefeature from AQM is evaluated against an entirely different feature from theobservations, resulting in erroneous conclusion about the AQM performance.Thus, it is important to determine a specific PCA procedure that ensures,or at least maximizes feature correspondence during model evaluation.Aforementioned PCA related topics will be studied in Chapter 3 usingboth the CMAQ outputs and synthetic ozone data. These topics are all partof the first step in the framework of ozone feature analysis and modelling:the extraction of ozone features. The second step is to develop statisticalmodels for individual spatial-temporal ozone features, and this will be donein Chapter 4.The proposed ozone feature models are variations of Gaussian Processes(GPs) driven by the background atmospheric and chemical precursor condi-tions. As the reader will see, the structures of ozone features can be highlynon-linear, making the task of model estimation an interesting and challeng-ing one. One such challenge is to determine the ideal design and compositionof model covariates; a significant portion of my modelling effort is focused onthis topic. Typically for a statistical research, once the model is estimated,one need to implement the proposed model using appropriate data, whichin this case are CMAQ-WRF-SMOKE outputs. The purpose is to carefullyscrutinize the modelling and forecasting capabilities of individual ozone fea-ture models through a series of goodness-of-fit tests, model diagnostics and131.3. Research Topics and Objectivesexercises in ozone forecasting.This framework of ozone feature extraction, analysis and modelling formsthe core statistical methods on which my proposed AQM evaluation is based.Although ozone feature models are developed as a means to AQM evaluation,the developed methodology is potentially an useful contribution in itself: itis a novel and computationally efficient means of modelling a complex space-time air pollution field.1.3.2 Relation to Existing Model Evaluation ProjectsAs discussed earlier in this section, Galmarini and Steyn (2010) and Denniset al. (2010) pointed out the importance of an air quality modelling systemto emulate the interacting atmospheric processes within a space-time airpollution system. They term this type of model evaluation as DiagnosticEvaluation. The authors also categorized three more types of AQM evalua-tion. Dynamic Evaluation analyzes an AQM’s ability to model the changesin pollution concentration due to the fluctuations in meteorological condi-tions and emissions. Operational Evaluation refers to “generating statisticsof the deviations” between the AQM outputs and corresponding observa-tions, and examination of the results based on a few “selected criteria”.Probabilistic Evaluation proposes to model the AQM outputs and/or obser-vations as random processes following certain probability density functions(pdfs), then the estimated pdfs are used to carry out various evaluations ofAQM against observation. My proposed AQM evaluation approach may beviewed as a mixture of above categories of model evaluation done though aprobabilistic framework.This research is also closely related to the works of Steyn et al. (2011)and Steyn et al. (2013). In these works, the authors used CMAQ to simulatethe space-time ozone fields during ozone episodes in the Lower Fraser Valley(LFV), British Columbia (BC). Among other things, the authors carriedout point-to-point comparison of CMAQ outputs with available observa-tions, and identified the precursor-sensitivities of local ozone fields withinLFV. The statistical analyses in this thesis deals exclusively with ozone data141.4. Literature Reviewduring LFV ozone episodes, and all my dataset are those used in Steyn et al.(2013). More detailed discussions of Steyn et al. (2011), Steyn et al. (2013)and other related LFV ozone papers will be done throughout the thesis.In the end, my goal is to generate useful statistical methodologies andanalyses that can be incorporated into the overall body of works in the fieldsof AQM evaluation and LFV regional air quality study.1.4 Literature ReviewThe literature review is organized according to the main steps of the thesisresearch: (1) PCA of space-time processes, (2) methods of modelling thenon-linear processes that produce both computer model outputs and physi-cal observations, and (3) existing AQM evaluations based on the comparisonof data features.1.4.1 PCA and Extraction of Data FeaturesPrincipal Component Analysis (PCA) was originally introduced by Pearson(1902). One earliest application of PCA in the field of atmospheric andclimate science can be traced to Lorenz (1956), where sea level pressure(SLP) data were decomposed into Empirical Orthogonal Functions (EOFs)in space and time. The term “EOF” used in this paper is widely adoptedin atmospheric science.The application of PCA/EOF has since gained prominence for the pur-pose of data decomposition, where the central idea is to decompose a non-linear climate system (sea level pressure, surface temperature, etc.) intoindependent physical modes of variation. A rich body of literature exploresvarious topics stemming from the PCA of space-time physical processes.The general topics relevant to this research include (1) the interpretationand selection of data features (Richman, 1986; Preisendorfer, 1988), and(2) estimate error associated with EOF-decomposition, which relates to theseparability and identifiability of data features (North et al., 1982; Mona-han et al., 2009). These two topics and further references will be exten-151.4. Literature Reviewsively discussed in Chapter 3 on PCA. Furthermore, a useful summary ofPCA/EOF-decomposition, their related issues, extended methodologies, andapplications in atmospheric science can be found in the books by (Storch andZwiers, 1999, Chapter 13) and Jolliffe (2002)) and review articles (Bjornssonand Venegas, 1997; Hannachi et al., 2007).More recently, methods of PCA have been applied to study the spatialand temporal patterns of ozone process. These papers deal with an ozoneprocess at a global or continental scale. Orsolini and Doblas-Reyes (2003)studied the spatial pattern of leading EOFs decomposed from monthly col-umn ozone data observed by Total Ozone Mapping Spectrometer (TOMS)during the spring months between 1948-2000. The spatial field covers theEuro-Atlantic sector (20o to 90o latitude and 60o to -90o longitude) andthe data are measured at 500mb geopotential height. From the patterns ofleading EOFs, the authors were able to identify the pressure system thatis associated with the pattern of each EOF. They found that the most im-portant pressure system is the North Atlantic Oscillation (NAO), and otherleading weather patterns are the Scandinavian, east Atlantic and EuropeanBlock. Principal Component (PC) time-series were also studied for long-term trends. The purpose of this paper is to study the link between ozoneEOFs and known climate patterns.Jrrar et al. (2006) carried out similar analysis using outputs from Chem-ical Transport Model (CTM) SLIMCAT and identified 5 climate patternsfrom the spatial plots of 5 leading ozone EOFs. Camp et al. (2003) imple-mented PCA on column ozone data across the tropics, which are measuredby both TOMS and Solar Backscatter Ultraviolet (SBUV) instruments. Thekey tropical oscillation patterns are identified from the ozone EOFs. Theabove mentioned ozone PCA literature has since been extensively cited overthe past 10 years.Similar ideas have been used in other fields. For example, Liu et al.(2003) implemented a modified Singular Value Decomposition (SVD) tech-nique on an affymetrix microarray, or gene expression data. By analyzingthe decomposition, they isolate a vector of data features that enables themto order gene types according to the level of data variation attributed to161.4. Literature Revieweach genome.There are certainly other ways of decomposing and visualizing high-dimensional data: spectral decompositions (Fourier transform, etc.), andneural networking techniques like self-organizing maps (Kohonen, 1982, 1990)and multidimensional scaling (Borg and Groenen, 2005). Methods of PCA orEOF-decomposition are use here because they are widely adopted in bothstatistics and atmospheric science. Aforementioned literature are simplywell-known examples among an extensive collection of works that demon-strate the use of PCA for extracting structure/dynamic modes from high-dimensional datasets. Furthermore, the usefulness of PCA methods are alsoreiterated in this particular research.1.4.2 Emulation of Non-linear Computer Models andPhysical ProcessesMy proposed statistical analysis involves the modelling of AQM outputs.AQMs such as CMAQ are referred to as “numerical models” or “computermodels” in the sense that the mathematics involves numerical solutions ofgoverning atmospheric dynamic equations and chemical kinetic equations,and only the input conditions are required to implement a model run thatproduces a deterministic output. In contrast, a “statistical model” is builtaround the idea of modelling randomness with probability functions, andsample data are needed to fit a model to output. On the surface, the natureof CMAQ makes statistics inapplicable; after all, statistics requires random-ness.For a complex computer model, a model run at every possible input valueis impractical at best. When we implement a model run for given input, theoutput is deterministic, but outputs at untried inputs remain unknown. Inthis sense the numerical model output follows a stochastic process becauseof these uncertainties in the output. This is the historical reasoning behindtreating model output as a realization of a stochastic process (Sacks et al.,1989). Over the years, statistical methods have been developed based onsuch a formulation, for the purpose of using computationally cheap statisti-171.4. Literature Reviewcal models to emulate the computer model outputs, and the real processesthey try to estimate.As mentioned, Sacks et al. (1989) were among the earliest authors totreat a deterministic model output as a realization of a random process.They suggested that the deterministic model output Y (x) given covariateset x is a realization from a random function such asY (x) =k∑j=1βjfj(x) + Z(x). (1.1)The random process Z(x) is assumed to have zero mean and covarianceσ2R(x,x′), where R(x,x′) is the correlation measure between computer“input sites” x and x′. In such a function, Z(x) models the random de-viation from the regression function βT f(x), where β = (β1, . . . , βk)T andf(x) = (f1(x), . . . , fk(x))T . They further put forth the idea that the ran-dom process Z(x) (hence Y (x)) is governed by a Gaussian Process (GP).As the name implies, the assumption is that the random process follows aNormal distribution.This fundamental approach has since gained prominence, and it has beenstudied and refined for a wide range of applications. Kennedy and O’Hagan(2001) introduced the idea of a joint-GP that combines deterministic outputsand physical data in a method called Bayesian Melding. The application isto calibrate the computer output using its correlation structure with theobserved physical data. This idea of “melding” computer model output andphysical observations has been adapted into more complex forms to tacklethe problems of air pollution as spatial processes (Fuentes and Raftery,2005; Liu, 2007). In these two works, the “grid-cell average” computeroutput is set to equal a weighted sum of latent point processes, where theweights are defined by exponential kernel functions of the distances betweenthe locations of latent process and the grid-cell centroid. Such function“downscales” the spatial scale of computer output to match that of theobservation.Berrocal et al. (2009) and Zidek et al. (2012) ventured one step further181.4. Literature Reviewby implementing particular forms of hierarchical regression model whose co-efficients follow GPs and are allowed to vary in both space and time, thusmodelling the air-pollution as spatial-temporal processes. All the abovementioned literature implemented their model formulation using variationsof a Hierarchical Bayesian Algorithm. Lindstrom et al. (2014) proposed an-other hierarchical regression model that contains a temporal basis function,this function is either assumed to be known or decomposed from data, andit is scaled by a spatial coefficient that follows GP.Berrocal et al. (2012) evaluated of the temperature outputs from theRegional Climate Model (RCM), where the data are quarterly mean tem-peratures from 1962 to 2002 in South Central Sweden. In this paper, in-formation from observation data are scaled, using statistical models, ontothe “grid-box level” for comparison with RCM outputs. The spatial scal-ing is done using both the space-time downscaler model in Berrocal et al.(2009) and an upscaler model in Craigmile and Guttorp (2011). The statis-tical model outputs are Bayes estimates of spatially scaled quarterly meantemperatures.Using the the methodologies of Sacks et al. (1989), Gao et al. (1996)showed that a GP model can be applied to model observational ozone con-centrations. They analyzed daily ozone data from Chicago for the period1981 to 1991, and found that a properly designed GP-based model can bea capable modeller of process-driven temporal ozone processes. The mainpurpose of the paper is to correct the trend over years for changing meteo-rology, and to assess the impact of regulatory initiatives. Dou et al. (2010)also applied GP methodologies in modelling temporal ozone patterns: theauthors used a form of time-series model whose coefficients are modelled asGPs. Cooley et al. (2007) modelled the observed “extreme precipitation re-turn” in Colorado using a Pareto-based distribution. To account for spatialnon-stationarity, the Pareto-parameters are modelled as spatial GaussianProcesses. The latter two works implemented their models using variationsof a Hierarchical Bayesian Algorithm.The aforementioned literature is only a selection among a rich collectionon theories and applications relating to GPs. The popularity and diversity191.4. Literature Reviewin the applications is a testament to GP’s capability to interpolate and ex-trapolate systems of complex non-linear functions. Progressive refinementsmade to the basic framework of a GP model also point to its flexibility thatallows for extensive fine-tuning and elaboration.Unlike aforementioned literature, I do not model ozone data directly. In-stead, the spatial and temporal ozone features are modelled as multivariateGPs. More specifically, they are modelled as GPs driven by the backgroundprocesses conducive to a space-time ozone process, e.g., meteorological con-ditions, chemical precursor emission rates and ambient concentrations.1.4.3 Feature-based AQM EvaluationMy literature review reveals that although there are earlier precedents, PCA-based model evaluation is not a widely and frequently adopted technique.This sentiment is also reflected in Eder et al. (2014).Preisendorfer and Barnett (1983) is among the earliest works that pro-posed the idea of computer model evaluation based on data decomposition.As mentioned in Section 1.2, Preisendorfer and Barnett (1983) proposed afew statistical summaries of point-to-point data differences. The authors alsomentioned a model evaluation approach where the data of observation-modeldifferences are decomposed, and the leading EOF and PC are visually as-sessed to extract useful information about model deficiency. The paper thenevaluated General Circulation Model2 (GCM) outputs for January sea-levelpressure field. However, this PCA-based model evaluation is mentioned onlybriefly as a possible complementary analysis to point-to-point data compar-ison.Cohn and Dennis (1994) and Li et al. (1994) later proposed PCA-basedAQM evaluations with more extensive implementations and discussions thanthe one presented in Preisendorfer and Barnett (1983). Cohn and Dennis(1994) used PCA to evaluate the capability of Regional Acid DepositionModels (RADMs). The model output for various aerosol species is evaluatedagainst corresponding observations collected over Eastern United States at2An early version from National Centre for Atmospheric Research.201.4. Literature Reviewhigh altitude (1000-1500 meters) during August-September, 1988. The dataare averaged over space: each column is a time-series data of one aerosolspecie. The RADM outputs are further separate into sulfur system (O3,SO2, H2O2 and SO2−4 ) and nitrogen system (O3, NO, NO2, etc.). The twogroups of outputs and observations are then decomposed through PCA andtheir PCA results compared. Comparison metrics include percent varia-tion explained by leading PC loadings, and angles between the PC spaces.The results indicated systematic RADMs deficiency in the nitrogen sys-tem. Through the scatter plots of O3 outputs against NO2 outputs, authorpointed out the possibility of RADM producing fewer molecules of O3 perNO2 photochemical cycling.Li et al. (1994) used PCA to evaluate the Eulerian Acid Deposition andOxidation Model (ADOM) against observations collected from the Eule-rian Model Evaluation Field Study (EMEFS). Through PCA, the modelledchemical process is decomposed into three distinct components simulatingthe process of chemical aging/transport, diurnal cycle and area emission.The resulting PC scores, which are in the form of a time series, are com-pared between ADOM and EMEFS to identify specifics of the computermodel’s temporal bias.Fiore et al. (2003) evaluated AQMs’ abilities to model space-time ozoneprocesses. In this paper, two models are evaluated: Multiscale Air QualitySimulation Platform (MAQSP) and global GEOS-CHEM model at two spa-tial resolutions. The ozone EOFs from these AQMs are compared againstcorresponding EOFs of observations, both visually and through correlationstatistics such as linear slope and R2. Furthermore, from the spatial varia-tions of the leading EOFs, the authors discussed the possible wind patternsresponsible for ozone transport across the Eastern U.S. The evaluation re-vealed that all three models captured similar east-west spatial feature shownin the observed EOF, while both resolutions of GEOS-CHEM misplaced amidwest-northeast EOF. All three models also shown to capture the generalpatterns of leading temporal features from the observations.Dennis et al. (2010) also contains an review of AQM evaluations basedon data decomposition. For example, Hogrefe et al. (2000) and Porter211.4. Literature Reviewet al. (2010) applied spectral decomposition to longterm time-series data ofCMAQ output and O3 observations, then compared the decomposed spectralbands of varying frequencies, e.g., diurnal, seasonal and long-term fluctua-tions.A more recent work is Eder et al. (2014), published during the writingof this thesis. In this paper, PCA methods are used to evaluate CMAQoutputs for SO−4 and NH4 against weekly observations from the Clean AirStatus and Trend Network (CASTNet) over 2001-2006. Moreover, PCAis implemented on the difference of CMAQ outputs and observations, notindividual data. The authors identified some systematic features of CMAQ-observation differences. For example, the third SO−4 spatial feature revealedfive high-elevation locations in the eastern U.S., and corresponding temporalfeatures indicated a seasonal cycle where CMAQ under-predicted the con-centrations during late summer months and over-predicted for rest of theyear. Authors found that the Meso-scale Model (MM5, weather model ofCMAQ) under-predicted relative humidity and over-predicted solar radia-tion in high-elevations. These weather model deficiencies are shown to besignificant from Mann-Whiteney nonparametric tests between MM5 outputsand weather observations. In the end, author suggested that the aforemen-tioned CMAQ deficiency with high-elevation SO−4 modelling is caused byissues with the parameterization of clouds in CMAQ.Another recent example of PCA-based CMAQ evaluation can be foundin Marmur et al. (2009). This paper instead applied Positive Matrix Fac-torization (PMF) to the CMAQ output for numerous chemical species andassociated observations. As with earlier literature, comparison matrices suchas “percent data variation explained” and PC scores time-series were com-pared between CMAQ outputs and observations.There is also a general method of climate model evaluation called opti-mal fingerprinting. Linear regression is used to compare climate observations(the responses) with model outputs under some external climate forcing (re-gressors). Usually the response and regressors are data features obtainedfrom decomposition (Hasselmann, 1993; Allen and Tett, 1999). These datafeatures, representing significant departures from normal climate variations,221.4. Literature Revieware called “fingerprints”. The extensive literature on optimal fingerprintingincludes Hobbs et al. (2015). Here, the authors evaluated climate modelsof Antarctica ice coverage (a proxy for climate change), where above men-tioned linear regression is used to analyze the seasonal feature/fingerprintsof observation-model deviations.Other works used combined analyses of PCA and cluster analysis onmeteorological data to study measurement-model discrepancy. Beaver et al.(2010) used EOF to categorize daily physical observations and correspond-ing model outputs into clusters of weather patterns. They then assessedhow often the outputs and observations match in this categorization. Theclustering is based on hourly wind speed data: it is recorded in vector com-ponents u and v, and the measurement stations cover the bay-area of Cal-ifornia. The hourly wind data are first concatenated into daily data, thenthe PCA-compressed data are clustered based on the criteria of minimumsum of squared errors. The defining wind pattern within each cluster isvisually interpreted and defined. In the end, the authors recorded the num-ber of days where observation and model output were assigned to differentclusters, and analyzed the meteorological patterns during the days that aremis-categorized. The article concluded that the instances of observed mis-categorization (model inadequacy) are consistent with what they alreadyknew from experience regarding the behaviour of models being analyzed.Ainslie and Steyn (2007) ventured one step further. The authors imple-mented EOF-decomposition and cluster analysis of mesoscale wind data forregions around LFV. They then defined four types of synoptic wind patternsassociated with LFV’s regional ozone exceedance, where the “ozone thresh-old” is defined by CWS mentioned in the introductory paragraphs (page 1).Reuten et al. (2012) further applied results from Ainslie and Steyn (2007)to forecast the frequencies and types of regional ozone exceedance for thefuture time period 2046 to 2065. In this thesis, the four LFV wind patternsidentified in Ainslie and Steyn (2007) will provide an important referencepoint in the ozone feature analysis and CMAQ evaluations to be presentedin Chapter 3 and 5. The key results and conclusions in this paper will alsobe discussed in detail in Section Novelty of Proposed AQM Evaluation Methods1.5 Novelty of Proposed AQM EvaluationMethodsIn this section, I will discuss in the most general terms, the novelties andpotential contributions of my proposed AQM evaluation approaches. Thegoal of this thesis is then to present a coherent set of statistical analysesthat demonstrate the usefulness of my proposed evaluation methods.In the existing literature, the differences in data features (in the form ofPC scores and loadings) are compared using correlation measures, RMSEor linear regression (optimal fingerprinting). To the best of my knowledge,no previous attempt has been made to model the feature differences usingGP or other non-linear functions, which I propose here. As this thesis willdemonstrate, ozone features have highly non-linear structures, which makesmy proposed non-linear modelling a practical improvement over existingevaluations.The aim of modelling data feature is to identify any statistical asso-ciation between the observation-AQM feature differences and specific inputconditions of AQM run. Using the feature difference models, one can furtheranalyze how the feature differences change with variations in AQM inputs.I will show later in this thesis that the modelling of feature differences canreveal useful and specific insights into an AQM’s capability.My second proposed method (Section 1.3) is also a novel means offeature-based AQM evaluation. In this method, statistical ozone featuremodels are used to predict the AQM feature and the observation featureunder the same background conditions, e.g., weather and precursor pollu-tion. These “same-condition” features are then compared in space and time,and the significance of their feature differences are assessed. In essence, thisproposed evaluation compares the stochastic properties of two air pollutionprocesses.The complexities of an AQM such as CMAQ dictate that one cannot runAQMs under the same real-world conditions that generated the observations.This is evident from literature regarding the CMAQ modelling of LFV airpollution, such as Reuten et al. (2012), Steyn et al. (2013) and Ainslie241.6. Thesis Structureet al. (2013). Therefore, the second proposed evaluation can provide usefulstatistical answers to the question: “can AQM produce results that arestatistically similar to observations after correcting for deviations in basicbackground conditions?”1.6 Thesis StructureChapter 2 will describe the data used in this thesis. Specifically, they aremodel outputs from CMAQ, WRF and SMOKE, as well as physical observa-tions on air pollution and accompanying meteorology. I will also discuss thespecific set-up of CMAQ modelling runs that produced the available data.As discussed in Section 1.3, I will develop the necessary statistical toolsfor AQM evaluation. In Chapter 3, I will study the PCA-related topics thatare crucial for an informative and defensible AQM evaluation. In Chapter 4,I will develop statistical models for individual ozone features. Specifically,I will estimate the exact formulation of the models, diagnose relevant sta-tistical assumptions, and evaluate the prediction capability of these ozonefeature models. Furthermore, I will analyze whether a complete space-timeozone fields can be modelled using combinations of ozone features.Chapter 5 and 6 bring everything back to my original research motiva-tion: the statistical evaluation of AQMs, which in this case, the CMAQ.The two general AQM evaluation approaches proposed in the beginning ofSection 1.3 will be developed and implemented individually. Combined in-sights and modelling methodologies developed in Chapters 3 and 4 will beapplied while evaluating CMAQ output against the observations.25Chapter 2DataSpace-time ozone fields are either produced by the CMAQ modelling systemor physically observed at monitoring locations. In addition to ozone, thisresearch also uses space-time data (either from computer model or obser-vation) of variables representing meteorological conditions, ozone precursoremission rates and surface level concentrations.Section 2.1 introduces the sources of data and provide some background.Section 2.2 presents the way CMAQ outputs and observation data are pro-cessed for CMAQ evaluation in Chapter 5 and 6. I refer to data presentedin Sections 2.1 and 2.2 as the real data to differentiate them from the simu-lated or synthetic data that are discussed in Appendix A.1 and used in B.1.Simulated ozone processes are useful for answering statistical questions thatare difficult, or impossible, to answer unequivocally using the real data.2.1 Air-quality Model Output and PhysicalObservationsData undergo extensive numerical processing during statistical modelling.The details behind each data-processing procedure will be discussed at ap-propriate stages of the statistical analyses. Section 2.1 simply informs thereader of the original source of all my data. The computer model outputs3used in this thesis are part of those used in Steyn et al. (2011) and Steynet al. (2013).3All available CMAQ outputs are calculated using the BORA server at the University ofBritish Columbia.262.1. Air-quality Model Output and Physical Observations2.1.1 Data from Computer ModelsCMAQ models ozone on a regular grid system, in which outputs are pre-sented in the form of hourly grid-cell averaged ozone concentration in unitsparts-per-billion (ppb) (Byun and Schere, 2006). The geographical size ofa grid cell, commonly referred to as the CMAQ resolution, is defined byusers. The coarsest resolution available for this study is a single grid cellsize of 36km × 36km with 93 cells in the east-west direction and 95 in thenorth-south direction. Smaller grid cell sizes available are 12km×12km and4km×4km, with grids of 70×89 (E-W×N-S) cells and 172×103 respectively.In this regional ozone study, I use CMAQ outputs at 4km× 4km resolution.In addition to ozone, CMAQ outputs the concentrations of 124 photo-chemical compounds. Furthermore, CMAQ models the air-pollution fieldsat 48 different atmospheric heights. My analysis of interest is the surfacelevel ozone.Chemical precursor data came from the CMAQ and the SMOKE models.Each model output represents one major source of photochemical precursor:precursors already present in the atmosphere (CMAQ output) and precur-sors “newly” emitted into the atmosphere (SMOKE output).The statistical analysis in this thesis will use data on the emission ratesand antecedent concentrations of NOx (oxides of Nitrogen) and VOC (volatileorganic compounds). NOx is the sum of NO and NO2 data, while VOC dataare created by adding the scaled values of 16 families of volatile organic com-pounds. The reactivity scale for each compound is calculated as the ratiobetween the Carbon-Bond 5 (CB5) reaction rate of that compound and themedian of the 16 reaction rates (Yarwood et al., 2005).The antecedent concentrations represent the hourly atmospheric NOxand VOC concentrations (in ppb) associated with CMAQ ozone modelling ateach grid cell. These data are generated using lagged NOx and VOC outputsfrom CMAQ and spatially-weighted according to surrounding wind flow foreach grid cell. The detailed method of generating antecedent concentrationdata will be discussed in Section 4.3.I will also use space-time data on temperature, wind direction and speed,272.1. Air-quality Model Output and Physical Observationsand planetary boundary layer height. These meteorology data are WRFmodel outputs, which are post-processed by Meteorology-Chemistry Inter-face Processor (MCIP) into a format usable for CMAQ, and convenient forstatistical analysis using R. Moreover, WRF output has accompanying geo-graphical coordinate data and the topography of the modelled region, whichis linked to the CMAQ output. Within the CMAQ system, each grid cell isspatially indexed by the longitude and latitude of its centre.The WRF and SMOKE outputs are produced on the same spatial do-main and grid system used by the CMAQ model. Their outputs at eachlocation and hour is presented as the numerical result from the spatial-interpolation and temporal-averaging of each respective grid-cell’s initialand boundary conditions. The details of aforementioned model variables, aswell as the reasonings behind their selection will be discussed extensively inChapter 4.3.The setup of CMAQ modelling runsIn Chapter 1, I described a generalized picture in the way CMAQ-WRF-SMOKE modelling system is run interactively to model space-time ozone.This subsection summarizes the specific setups of CMAQ, SMOKE andWRF modelling runs that produced the data in this thesis.The SMOKE emission model generates an emission inventory and dis-tributes the emissions into spatial grids and time periods at varying degreesof resolutions/intervals (SMOKE v2.5, University of North Carolina, 2012).The emission inventory generated in this study is further adjusted for boththe amount and source location to reflect the change in LFV’s “populationdensity and economic activity” over the years (Steyn et al., 2011, 2013).The overall annual emission rates of NOx, VOC and other pollutants areobtained from the Metro Vancouver forecast and backcast emission inven-tories reported by Greater Vancouver Regional District (GVRD) in 20074.The SMOKE output used here is the sum from 10 types of emissionsource: light and heavy duty vehicles, off-road vehicles, rail-roads, aircrafts,4Prepared by Metro Vancouver.282.1. Air-quality Model Output and Physical Observationsmarine, other emission sources, biogenic emissions, point and area sources.The LFV mobile (vehicle) emission rates are modelled by MOBILE6.2 andMOBILE6.2C models (US Environmental Protection Agency, 2010) usingbackcast emission totals from the above mentioned GVRD inventory. Theregional ozone modelling in this thesis is analyzed at 4km×4km spatial res-olution, such detailed biogenic emission modelling is handled by MEGANversion 2.04 (Guenther et al., 2006).WRF (v3.1, Skamarock et al. (2008)) produces the 3-dimensional meteo-rological fields for ozone modelling. The meteorological conditions were sim-ulated at 48 vertical levels to model air pollution at all elevations. As withCMAQ and SMOKE, it can simulate a space-time process at varying spa-tial resolutions. The Kain-Fritsch convective parameterization is applied tomodel “unresolved cloud updraft and downdraft”, and Asymmetric Convec-tive Model (ACM, version 2) accounts for “unresolved Planetary BoundaryLayer” process (Steyn et al., 2013). Data from Moderate Resolution ImagingSpectroradiometer (MODIS), either weekly average values or weekly valuesaround the episode dates, were used to initiate the simulations of sea-surfacetemperatures (Steyn et al., 2013).The CMAQ (EPA Model-3) modelling system of version 4.7.1 modelsthe overall photochemical process. For each ozone episode, the modelling isdone over a period of 96 hours with a 13-hour “spin-up” period (Steyn et al.,2011, 2013). This means that, a full 96-hour ozone episode dataset containsthree full days of ozone simulation plus additional 11 hours on the last day.This spin-up period is determined based on past experiences of our CMAQdata providers. Due to the recirculation of pollutants within LFV (Seagramet al., 2013), the pollutants from the previous day remains in LFV as theinitial pollutants of a new diurnal ozone process.Furthermore, background ozone, CO and NOx concentrations were pro-vided by Re-analysis of TROpospheric chemical composition (RETRO) monthlyaverage outputs, which in turn are simulated jointly by general circula-tion model, and chemical and aerosol model called ECHAM5-MOZ: Euro-pean Centre Hamburg Model-Model for Ozone and Related chemical Tracers(Steyn et al., 2013). Carbon-bond 5 (CB05) gas phase chemical mechanism292.1. Air-quality Model Output and Physical Observationswith chlorine (Yarwood et al., 2005) using the Aerosol Energetics (AE-5)aerosol module was used.The computation time of CMAQ depends on the available computingpower. In our case, it took CMAQ approximately one day to simulate oneday of ozone at 4km× 4km resolution5.Episodes and spatial domain of our studyAt 4km × 4km resolution, the CMAQ modelling region covers an area inthe Pacific Northwest that includes parts of Washington state in the U.S.,Alberta in the east and northern BC mountains. I focus my modellingaround the Lower Fraser Valley (LFV): a region of British Columbia thatencompasses Greater Vancouver Regional District (Metro Vancouver) andFraser Valley Regional District. The Fraser Valley Regional District spansfrom Abbotsford to Hope in the east.Specific for this thesis, the “full” region under analysis is a rectangularapproximation to the valley floor of the LFV. The large “pins” in Figure 2.1indicate the corners that define my rectangular LFV region. This rectangu-lar LFV includes a small portion of the north shore mountains immediatelyadjacent to some urban areas, and excludes the area around Hope (to theeast of Chilliwack). This modelling region is comprised of 229 CMAQ (henceWRF and SMOKE) grid cells, whereas a complete CMAQ model domaincontains 17716 grid cells at 4km× 4km resolution.Each CMAQ run is used to model a summer-time ozone episode, whichtypically lasts 96 hours that span over 5 days: 13 hour spin-up period onthe first day, 3 full days in the middle and 11 hours on the last day. Inall, model outputs for 5 episodes are used in this study, they took place inthe years 1985, 1995, 1998, 2001 and 2006 (Steyn et al., 2013; Ainslie et al.,2013). Table 2.1 shows the start and end time of each episode.5From Bruce Ainslie, who produced all the CMAQ outputs used in this thesis302.1. Air-quality Model Output and Physical ObservationsYear Time span1985 July 18th, 1100PST - July 22nd, 1000PST1995 July 16th, 1100PST to July 20th, 1000PST1998 July 24th, 1100PST to July 28th, 1000PST2001 August 9th, 1100PST to August 13th, 1000PST2006 June 23rd, 1100PST to June 27th, 1000PSTTable 2.1: CMAQ modelled ozone episodes: years and the episode durations.2.1.2 Observational Data from Air-quality Monitoring SitesThe observed data on ozone and other variables are collected at air-qualitymonitoring locations across the lower mainland of BC (Metro Vancouver,2012). At each location, there is an instrument that draws in ambient airand measures the pollutant concentrations in the air sample. This procedurecan be done every few seconds, and such rapid-response data collectionallows for various averaging times depending on the type of data analysis(Metro Vancouver, 2013). The data currently available are based on hourlyaverages.Besides air-quality data on ambient ozone and NOx concentrations, eachmonitoring location also collects accompanying weather data on tempera-ture, wind direction and speed. Although weather data are recorded multipletimes per hour, available data are hourly averages. The VOC concentrationsare measured only at 4 of the 17 ozone monitoring stations, and they areusually available as daily values. The physical measurements of LFV plan-etary boundary layer height are not available for our study (Steyn et al.,2011).I make use of observation data collected from monitoring sites locatedwithin my rectangular LFV region. The small “red pins” in Figure 2.1show the 17 monitoring locations that recorded the data for 2001 and 2006(Ainslie et al., 2009), and Table 2.2 shows the associated station coordinatesand names. The number and locations (longitude and latitude) of availablemonitoring stations vary by the episode. One reason is that between 1985to 2006, some monitoring stations were retired from service while new lo-cations were established. Moreover, for certain years some of the locations312.1. Air-quality Model Output and Physical Observationscontain a large number of missing measurements and data collected fromsuch locations are discarded from analysis. The 1985, 1995, 1998 , 2001 and2006 episodes have observation data available from 11, 16, 16, 17 and 17monitoring sites.Figure 2.1: Locations of current measuring stations (small pins) and thecorners of my self-defined rectangular LFV region (large pins). The stationcoordinates and names associated with the numbers 1 to 17 are in Table 2.2.322.2. Data for CMAQ Evaluation in Chapters 5 and 6Number Longitude Latitude Name1 -123.16 49.26 Kitsilano2 -123.15 49.19 YVR3 -123.12 49.28 Robson square4 -123.11 49.14 Richmond south5 -123.08 49.32 Mahon park6 -123.02 49.30 North Vancouver7 -122.99 49.22 Burnaby south8 -122.97 49.28 Kenshington park9 -122.90 49.16 North Delta10 -122.85 49.28 Rocky Point Park11 -122.79 49.29 Coquitlam12 -122.71 49.25 Pitt Meadows13 -122.69 49.13 Surrey east14 -122.58 49.22 Maple Ridge15 -122.57 49.10 Langley central16 -122.31 49.04 Central Abbotsford17 -121.94 49.16 ChilliwackTable 2.2: The Station names and coordinates of the numbers 1 to 17 inFigure 2.1: the map of the LFV monitoring network.2.2 Data for CMAQ Evaluation in Chapters 5and 6This section describes the various numerical processing of CMAQ outputand observation data used for CMAQ evaluations in Chapters 5 and Interpolated CMAQ DataA proper implementation of my proposed CMAQ evaluation approach re-quires that CMAQ outputs and observations be matched on a spatial-temporaldomain. Being both hourly data, CMAQ outputs and observation arematched in time. However, this is not the case with space: CMAQ out-puts are air pollution data on a regular spatial grid across the entire LFV,while observations cover an irregular and sparse set of locations (Section2.1).332.2. Data for CMAQ Evaluation in Chapters 5 and 6The CMAQ evaluation analyses in Chapter 5 and 6 will be based onobservation data from the nobs monitoring sites and CMAQ-WRF-SMOKEoutputs spatially interpolated onto the locations of these nobs locations.Therefore, the spatial domain of statistical CMAQ evaluation is definedby the “ozone monitoring space”, not the “CMAQ modelling space”. Asdiscussed in Section 2.1, the monitoring locations vary by the episode, sothe CMAQ interpolation is done by the episode.Each computer model output is spatially indexed by the longitude-latitudeof the corresponding grid-cell’s centre point. While the example below showshow spatial interpolation is done for CMAQ ozone output, the method isthe same for other variables.1. For each ozone monitoring station, choose 4 neighbouring CMAQ gridcells using the combined criteria: closeness in Euclidean distance andgood coverage around the point-location of the monitoring site. Fig-ure 2.2 shows for selected observation stations, the chosen 4 CMAQneighbours that were used for interpolation.-122.99 -122.97 -122.9549.2649.2849.3049.32 KENSINGTON PARKLongitudeLatitude-123.05 -123.03 -123.0149.2849.2949.3049.3149.32 NORTH VANCOUVERLongitudeLatitude-122.88 -122.86 -122.8449.2649.2849.3049.32 ROCKY POINT PARKLongitudeLatitude-121.96 -121.94 -121.9249.1549.1649.1749.18CHILLIWACKLongitudeLatitude-122.93 -122.91 -122.8949.1449.1549.1649.17NORTH DELTALongitudeLatitude-123.04 -123.00 -122.9649.2149.2249.2349.24BURNABY SOUTHLongitudeLatitude-123.11 -123.09 -123.0749.3249.3349.3449.35MAHON PARKLongitudeLatitude-123.16 -123.14 -123.1249.1749.1849.1949.20YVRLongitudeLatitudeFigure 2.2: For 8 selected monitoring stations, the 4 CMAQ neighbours tobe used for interpolation. The red dot is the location of a monitoring stationand the squares are the centres of the 4km×4km CMAQ grid cells.342.2. Data for CMAQ Evaluation in Chapters 5 and 62. For each monitoring location, interpolate the 4 CMAQ outputs viaInverse Squared Weighting:ws =1d2s(4∑s=11d2s)−1, Ocint =4∑s=1ws ·Ocs,where Ocint denotes “interpolated CMAQ”. Ocs is the CMAQ output atgrid cell s, ds is the Euclidean distance between s and the monitoringsite, and s = 1, . . . , 4 for each monitoring site.In the end, one obtains space-time data of interpolated CMAQ outputs,whose locations are matches to the longitude-latitude of observations.Here, the interpolation at each observation location is based on onlythe four nearest CMAQ grid-cells. For such a local field, inverse squaredweighting is an adequate interpolation method. For interpolations based onlarger and more complex fields, one might need to adjust the power of in-verse weighting according to the dataset’s “coefficients of spatial variations”(Gotway et al., 1996). Alternatively, one may interpolate CMAQ outputs atthe observation locations using Kriging methods (Matheron, 1963; Cressie,1990).2.2.2 Missing Observations and Measurement ErrorsFor an episode, the percentage of missing data is typically ≤ 5% for all vari-ables. In addition, data are usually missing for just an hour or occasionally,a few hours. There are also instances where the observations are completelyunavailable for one or more stations, e.g., the temperature data during the2006 episode are not available for the Robson Square and North Vancouverstations, and the percentage of missing data reached ≈ 12% in total (acrossall 17 stations and 96 hours).When the data are missing for 3 or fewer consecutive hours, I interpolatethe missing observations by simple linear regression: “ozone concentration”is the response variable and “hour” the regressor along with an interceptterm, and the regression is fitted with two available observations at eitherside of the missing period. Otherwise, a “proxy station” is chosen for each352.2. Data for CMAQ Evaluation in Chapters 5 and 6location, and the missing observations at this location are filled-in with datafrom the “proxy station”. The proxy station is selected based on the visualcriteria of spatial-temporal homogeneity.For example, at Pitt Meadows, the ozone observations are missing for 4hours on June 26th, 2006. These missing data are filled with the same-hourobservations from the Rocky Point Park. Figure 2.3 shows the spatial plotof 96-hour ozone mean produced from the 2006 CMAQ output: it helpsto assess the spatial homogeneity between locations. The contour of LFVshoreline, and the locations of the Pitt Meadows and Rocky Point Parkmonitoring sites are also shown. The plot of mean field indicates that the96-hour ozone means are similar between two locations. Figure 2.4 plotsthe hourly observations from the two locations; it helps to compare theirtemporal patterns. As shown, the observations from the two locations closelytrack each other by the hour, except that the ozone peak is higher for PittMeadow on the 3rd day. However, the data are missing during the pre-noonhours on the 4th day, and observations from the previous days show thatduring these hours the ozone levels are similar between the two locations.Figure 2.3: From the 2006 CMAQ ozone output: the spatial plot of the96-hour ozone means overlaid with the LFV shoreline. Note that Longitudeis expressed differently from the other plots: it is the usual longitude angleplus 360◦, simply used to overlay the available map data.362.2. Data for CMAQ Evaluation in Chapters 5 and 6Figure 2.4: Hourly ozone observations from Pitt Meadow and Rocky PointPark during the 2006 ozone episode. The dashed line indicate the hour 0000of each day. Notice the effect of nocturnal down mixing at 0400PST, the25th, in Rocky Point ParkWhen a station have ozone observations that are missing or flagged asincorrect readings for consecutive 8 or more hours, the data from this sta-tion are not used for analysis. This is because observations are used as“reference data” for CMAQ evaluation, and excessive amount of interpo-lated/estimated inputs would introduce unwanted bias into the statisticalanalysis.As Figure 2.4 shows, there is a hour long spike in the early-morning forRocky Point Park. This is the result of what is known as Down Mixing(Salmond and McKendry, 2002). It is a nocturnal process in the boundarylayer of LFV, which creates vertical mixing and downward transport of pol-lutants from the atmosphere above. The natural consequence is the suddenspike of ozone and precursor concentrations in particular locations aroundLFV, such as this example at Rocky Point Park. This phenomenon is be-yond the scope of my current analysis. I chose to replace the down-mixingaffected data points using the average measurements from adjacent hours atthe same location.372.3. Summary of Data Used in the Thesis2.3 Summary of Data Used in the ThesisI also generated synthetic/simulated ozone data that emulates the space-time structure of the real LFV ozone field. I then implemented a few analysesusing these synthetic data to complement the ozone PCA in Chapter 3. Sincethese synthetic data are analyzed in Appendix B.1, the details regarding thedesign and creation of such data is described in Appendix A.1 instead of thischapter.In summary, the types of data used in this thesis are:• CMAQ ozone output and associated data on meteorology, chemicalprecursor emission rates and antecedent concentrations (processed fromWRF, SMOKE and CMAQ output).• Physical Observations. These are data recorded at monitoring stationsacross LFV. Observation data include ozone and NOx concentrations,temperature, wind speed and direction.• Above mentioned synthetic ozone data.In Section 2.1, I mentioned that the full spatial domain of my analysis isthe rectangular LFV shown in Figure 2.1. In upcoming statistical analysesin Chapter 3 to 6, I will analyze data based on either this “full” rectangularLFV or subregions of it. The decision is based on the specific goal of analysisat hand. The list below gives a quick overview of the regions analyzed:1. In Chapter 3, I will analyze CMAQ ozone outputs across part of LFVwhere the elevation is below 150 meters. This region represents thetriangular “valley floor” of LFV. This is a region where most of theozone activities (chemical reactions and atmospheric transportation)occur. I will simply refer to this region as “LFV”.2. In Chapter 4, I will use CMAQ-WRF-SMOKE data across the fullrectangular LFV. I refer to this region as “rectangular LFV” to differ-entiate it from the “LFV” defined above.382.3. Summary of Data Used in the Thesis3. The CMAQ evaluation analyses in Chapters 5 and 6 will be basedon the area defined by the nobs monitoring locations. As discussed,the computer model outputs will be spatially interpolated onto theobservation locations.39Chapter 3Principal ComponentAnalysis of Space-time OzoneMy research explores methods of AQM evaluation based on the the analysisand modelling of ozone features. The term ozone features describes thedominant spatial-temporal structures of a space-time ozone field. In thisresearch, methods of PCA are used to decompose a space-time ozone datainto spatial and temporal features. Since data features capture the keyspace-time variation within a dataset, one may argue conceptually that theyare the most informative portion of the data for statistical analysis.In this chapter, I use Principal Component Analysis (PCA) of CMAQoutputs to address important topics related to the statistical analysis ofozone features. Specifically, I aim to accomplish three things:1. Determine the most appropriate PCA method for feature-based AQMevaluation. The exact PCA method will then be applied consistentlyin this study.2. Analyze whether a space-time ozone field can be understood and an-alyzed through a small number of ozone features.3. Interpret the ozone features: either as statistical summaries of data,or as space-time structures that explain important underlying mecha-nisms of the ozone process.The following discussion explains the purposes of these analyses.An important topic relating to the comparison of ozone features is thetopic of “feature correspondence” between ozone data. In order to compareozone features between AQM modelled ozone and physical observations, it40Chapter 3. Principal Component Analysis of Space-time Ozoneis crucial that we understand whether the same types of features are beingevaluated. Otherwise, one type of ozone feature from AQM will be judgedagainst an entirely different feature from observations, leading to wrong con-clusions about the computer model’s performance. One way of addressingabove concern is to understand what types of ozone features dominate theozone field being analyzed. If one is able to interpret the extracted fea-tures from both datasets, subsequent feature-to-feature evaluations can beconcluded in ways that is both logically justifiable and informative.Each spatial ozone feature is numerically represented by an EmpiricalOrthogonal Function (EOF) vector, and each temporal feature is representedby a Principal Component (PC) vector. Hence, PCA is used to extract EOFsfrom an ozone dataset, and these EOFs are the estimates (from sampledataset) of the unknown true EOFs of the ozone field under analysis. Thefollowing are two main PCA-related complications due to uncertainty inestimating EOFs, or sampling uncertainty. One needs to address them inorder to assure feature correspondence during AQM evaluation.• All pairs of EOF vectors are orthogonal. If the 1st EOFs of two datasets capture different, thus incomparable, types of features, then thisfeature discordance will carry over to higher-order features due to theaforementioned orthogonality constraint (Cohn and Dennis, 1994). Inaddition, if the 1st EOF of an analyzed ozone field is estimated in-correctly, perhaps due to the sparseness of sample data, then the or-thogonality requirement of EOFs results in the propagation of EOFestimate errors towards higher-order features (Cohn and Dennis, 1994;Monahan et al., 2009).• Identifiability or separability of ozone features: whether extractedozone features are individual space-time fields separable from the rest,or whether multiple features form an inseparable couplet or multiplet.This is a data-specific statistical property defined by EOF estimateerrors, and it informs us which features can be compared individuallyor jointly.In this chapter, above mentioned topics will be analyzed using CMAQ ozone41Chapter 3. Principal Component Analysis of Space-time Ozoneoutputs for the LFV during the 5 episodes in 1985, 1995, 1998, 2001 and2006 (Section 2.1).Through ozone PCA in this chapter, I will establish the exact PCA pro-cedure that will be implemented in all subsequent statistical analyses. Thisprocedure is meant to address aforementioned PCA-related complicationsand to maximize interpretability of ozone features.I will further examine the number of spatial and temporal ozone featuresthat are meaningful for statistical analyses. First, this question is addressedby analyzing the amount of data variation as well as the importance of space-time structure each ozone feature can capture. Secondly, I will implement asimulation-based approach to construct a statistical test that determines thenumber of meaningful ozone features. These analyses address the questionof whether a complex space-time air pollution field can be understood andanalyzed through simpler data features.As discussed in Section 2.1, CMAQ outputs are produced on a regulargrid with high spatial resolution of 4km×4km, while observations are ir-regularly placed at most, 17 locations in LFV. This means that, comparedto observations it will be easier to interpret the spatial ozone features ofCMAQ. Therefore in this chapter, for the purpose of learning about ozonePCA and LFV ozone features, I will use CMAQ outputs instead of obser-vation data. However, later in Chapter 5, I will implement PCA on ozoneobservations to compare the CMAQ features to the observed features.Before proceeding, I need to define a few important recurring termi-nologies. Let Ot×n be an ozone dataset of dimension t × n, t being thenumber of hours and n the number of locations. The term spatial field oftemporal ozone means refers to the column means of Ot×n: ozone averagedacross t hours, resulting in a spatial field of ozone means. Short descriptions“mean field” or “field of means” will also be used to describe the temporalozone means. The term hourly spatial ozone means refers to the row meansof Ot×n: a time series of ozone averaged across n locations at each hour.“Hourly LFV mean ozone” will mostly be used to describe this time series.Similarly, temporal ozone standard deviation is calculated across time,i.e., the column standard deviation of Ot×n. Spatial ozone standard devia-423.1. LFV Ozone during an Episodetion is then the standard deviation calculated across space.Section 3.1 summarizes what is already known about LFV’s ozone episodes.Section 3.2 reviews the PCA or EOF-decomposition methods relevant to thisresearch, and determines the PCA procedure to be used in this thesis. InSection 3.3, I will discuss the number of ozone features useful for furtheranalysis. In Section 3.4, I will implement PCA on the space-time CMAQoutputs from the 5 episodes (Table 2.1), where the goal is to formulate un-derstandings and interpretations of extracted ozone features that are usefulfor the upcoming AQM evaluations. Section 3.5 summarizes the findings inChapter 3.3.1 An Overview of LFV Ozone Field during anEpisodeThe following discussion will use CMAQ outputs to highlight the distinctspace-time structures of LFV ozone fields during an episode. Later in Section3.4, I will determine whether the visible qualitative ozone patterns can beextracted through PCA.Figures 3.1 and 3.2 show for selected hours during the 1985 and the 2006episodes, the 3-dimensional spatial ozone fields outputted by CMAQ. Duringa summer-time ozone episode, an ozone plume forms in the morning acrossthe west of LFV. This plume keeps building in concentration before theearly afternoon peak while slowly travelling east (Ainslie and Steyn, 2007;Steyn et al., 2013). This eastward movement, driven by daytime westerlywinds, carries the ozone plume inland throughout the day. As night falls,the intensity of the ozone process (its creation and destruction) decreasesdramatically due to the absence of UV radiation. Hence during the nightand following morning, there exists low-level background ozone across thetriangular valley floor of LFV (Salmond and McKendry, 2002).Furthermore, as the map in Figure 3.3 shows, the LFV is surroundedby mountains from the northeast and the southwest. Combined with a lowboundary layer height, these surrounding mountains act as a physical bar-433.1. LFV Ozone during an EpisodeFigure 3.1: For selected hours during the 1985 ozone episode, 3-dimensionalspatial plots of the hourly ozone field. The spatial domain is the “rectangularLFV” including the north shore mountains.443.1. LFV Ozone during an EpisodeFigure 3.2: For selected hours during the 2006 ozone episode, 3-dimensionalspatial plots of the hourly ozone field. The spatial domain is the “rectangularLFV” including the north shore mountains.453.1. LFV Ozone during an Episoderier to the eastward horizontal advection of pollutants that channels themalong the valley (Taylor, 1991; Steyn et al., 1997), and subsequently createsa “bottleneck effect” (Robeson and Steyn, 1990) that traps the pollutionwithin LFV’s mountainous barrier. Therefore, during the night time asozone within LFV’s valley floor reverts to background level, the surround-ing mountains and parts of eastern LFV generally suffer from high ozonepollution due to the accumulation of the plume produced earlier in LFV.This is noticeable from the ozone fields at 2100PST and 0600PST in bothFigures 3.1 and 3.2. The high ozone pollution along the mountains remainsfor the entire duration of an episode.Figure 3.3: Locations of current measuring stations (small pins) and thecorners of the complete rectangular LFV region (large pins).Each episode is also defined by slight deviations in its space-time struc-ture from the generalized LFV ozone structure I just described. The 1985episode has a pronounced ozone plume around the city of Vancouver, andthe plume transports eastward in a discernible wavelike pattern along thenorthern part of LFV (1300PST and 1700PST plots of Figure 3.1). The2006 episode (Figure 3.2) has an ozone plume forming in the middle LFVand transports eastward as a “block” of plume that is evenly distributed fromnorth to south. Perhaps more significant is the difference in the night timeozone fields between 1985 and 2006. For 2006, the night time spatial ozonevariation is consistent with the generalized LFV night time pattern, and the463.1. LFV Ozone during an Episodemaximum ozone level is around 20 ppb at the valley floor (2100PST plotof Figure 3.2). For 1985, there is a heavy accumulation of ozone pollutionaround the southwest corner of LFV by the evening, where the maximumozone level exceeds 60 ppb (2100PST plot of Figure 3.1). This is in additionto the high ozone levels along the north shore mountains.Therefore, despite the fact that every ozone episode happens at a narrowand specific set of weather conditions and regional pollution (Steyn et al.,2013), differences do exist. One needs to consider the ever changing localemission standards that can shift the spatial patterns of air pollution overthe years (Steyn et al., 2011). Furthermore, the between-episode differencesin ozone structures are partly influenced by the prevailing regional windpatterns.Each ozone episode is defined by its unique wind regime, i.e., combinationof mesoscale wind direction and speed. Using wind observations at YVR,Ainslie and Steyn (2007) identified four types of possible mesoscale windregimes during an LFV ozone episode, and characterized the dominant windpatterns of each episode. Figure 3.4 shows the hodograph of the four windregimes from Ainslie and Steyn (2007). A point in the plot indicates thedirection from which the wind blows towards the origin, the distance fromthe origin shows the wind speed, and the number above each point indicatethe hour of the day. As shown, the daytime atmospheric circulation isusually defined by a westerly wind system. Regime IV also displays easterlywinds in the evening. Table 3.1 summarizes the wind regime of the middle3 full days of each episode (remember that the first and last days are halfdays).Year Dates Wind Regime1985 July 19-20-21 I-IV-IV1995 July 17-18-19 III-III-III1998 July 25-26-27 II-III-II2001 August 10-11-12 II-II-II2006 June 24-25-26 I-I-IIITable 3.1: The daily wind regime for the middle 3 full days of each episode.Figure 3.4 shows what circulation types I-IV look like.473.1. LFV Ozone during an EpisodeFigure 3.4: The four types of LFV wind regime during an ozone episode asdescribed by Ainslie and Steyn (2007). The wind regimes are presented ashodograph: the number on top of a point indicates the hour of the day, thepoint’s position indicates the direction in which the wind blow towards theorigin and the distance from origin indicates the speed.483.2. PCA Methods and Related TopicsIn Section 3.4, the LFV ozone structure and space-time patterns de-scribed in this section will be used as physical references for interpreting theozone features. Furthermore, PCA will be done episode-by-episode to iden-tify any changes (man-made or weather-driven) in LFV’s ozone structureover the years. PCA will be also be done on ozone fields under separatewind regime types. This helps to analyze possible influence of wind regimeon the dominant pattern of LFV ozone advection.3.2 PCA Methods and Related TopicsThis section defines the PCA notations and methods used throughout thethesis. I will also (1) review the mathematical properties and identitiesof PCA that are useful for subsequent statistical analyses, and (2) discussthe PCA-related complications (briefly described in the beginning of thischapter) that require attention during feature-based AQM evaluations inChapters 5 and Definitions of EOFs and PCsLet O be a t×n matrix of ozone data, where t is the number of time pointsand n the number of locations (longitude and latitude concatenated). In ageneral sense, a column of O can be viewed as containing t observations onone of n variables, and a row of O is one set of observations on n variables.PCA is typically implemented on centered or standardized data. Incentered data, each ozone value in Ot×n is “centered” by subtracting itslocation-specific mean over the t hours. Denote the column centered dataas O˜t×n, where the j-th column of O˜ isO˜t×n[, j] = O[, j]−∑ti=1 O[i, j]t.Hence (1/n)O˜T O˜ is an n×n sample covariance matrix since the mean of eachcolumn (corresponding to one ozone location) is 0. Standardized data aresubsequently obtained by dividing each column of O˜ by its column standard493.2. PCA Methods and Related Topicsdeviation. Column standardization maps the origins of data vectors onto acommon origin in data space, and the scales are unitless and comparable.However, I believe data centering/standardization is unnecessary giventhe purpose of my analyses. In this research, decompositions are performedon space-time data of single photochemical or meteorological quantities, soall data elements have the same units and scale. The space-time data pro-duced by computer models are well-behaved, and except for rare occasionsof equipment malfunctioning, observed data are well behaved as well. Henceinfluential data outliers are not a pressing concern. More importantly, givenmy goal of feature-based AQM evaluation and space-time ozone modelling,it is reasonable to keep the mean structure intact during PCA. This pointwill be elaborated at the end of this section. First, I will present the PCAprocedures and properties that are relevant to this research. The discussionsin this chapter are focused on the data matrix Ot×n, but the presented al-gebraical properties are also applicable to O˜t×n (and any data in general).In PCA, the matrix OT O undergoes eigen-decomposition OT O = EΛET ,giving the matrix En×n whose columns are n-length eigenvectors of OT O,and a diagonal matrix of n eigenvalues Λ. Here, each of the n eigenvectorsis referred to as an EOF, and PCA eigen-decomposition is interchangeablyreferred to as EOF decomposition. There is an eigenvalue correspondingto each eigenvector. Denote each eigenvalue as λj , j = 1, . . . , n, we haveΛ = diag{λ1, . . . , λn}. Multiplying Ot×n by En×n gives a matrix of PCsP = OE, which has the same dimension as the original data Ot×n. Thecombined analyses of EOF decomposition and PC calculations are referredto here as PCA.P is an orthogonal basis for O, and each column of P consists of tweighted row-sums of O. In other words, each value in a PC is the location-weighted sum of the ozone values at the corresponding time point. There-fore, the values in an EOF can be seen as the spatial weights, or the “im-portance”, of each location over the time period t captured by O. Here,each column of E, denoted as Ej , j = 1, . . . , n, is a normalized eigenvector.Hence the EOF values are unitless. Each column of P is denoted as Pj , andthe PC values have the ozone unit ppb.503.2. PCA Methods and Related TopicsAs discussed above, each element in an Ej represents a certain type ofspatial weight of the corresponding location. To state it explicitly, each Ejcontains n “spatial variables” that can be defined on the spatial domainof Ot×n: each element of Ej can be located on the spatial domain by thelocation of its corresponding column in Ot×n. This leads to the fact that allEj ’s are in essence, spatial processes that can be conveniently plotted overthe spatial domain of the ozone field. The same can be said for the Pj ’s:each of the t elements in a Pj is the weighted row-sum of Ot×n at hour i,where i = 1, . . . , t. Hence each Pj can be plotted as a time-series of length t.The nature of the Ej and Pj will be illustrated by the ozone feature analysisin following sections.3.2.2 Mathematics of PCAThe data in this thesis can have t > n (observations and interpolated CMAQoutput) or t < n (CMAQ output for LFV). Although the following discussionon the mathematical properties of PCA is presented for the the case of t > n,equivalent results hold for t < n.An important property of E is its orthogonality, i.e., any pair of columnsis orthogonal. Moreover, I will also show the orthogonality of the columnsis also a property of the principal component matrix Pt×n. Taking intoaccount the orthogonality of E and P, one may derive and arrive at someuseful definitions and relationships.Interpretation of EigenvaluesStarting with the definition of eigenvectors and eigenvalues, we have:(OT O)E = EΛ, (3.1)513.2. PCA Methods and Related Topicswhere Λ is an n × n diagonal matrix of eigenvalues. Left-multipling eachside by ET , we arrive at PT P = Λ:ET OT OE = ET EΛ⇒ ET OT P = ET EΛ⇒ (OE)T P = Λ⇒ PT P = Λ.Hence P is column-orthogonal, and the eigenvalues correspond to the secondmoments of the variables represented by the columns of P. Furthermore,right-multiplying (3.1) by ET gives OT O = EΛET . It can be easily shownthat the trace (diagonal sum) of OT O is equal to the trace of Λ:tr(OT O) = tr(EΛET ) = tr(E(ΛET )) = tr((ΛET )E) = tr(Λ).Note that the diagonal elements of OT O correspond to the second momentsof the variables in O. One may conclude that the sums of the second mo-ments of O and P are equal. Using this important relationship, we canestimate the proportion of data variation explained by each EOF-PC pair:proportion of data variation accounted for by Ej and Pj =λj∑nj=1 λj,where λj is the j-th diagonal element of Λ.Since P is column-orthogonal, any cross-product (uncorrected for themeans) between the variables in P is 0. Therefore, E maps O, a column-wise correlated matrix (correlated variables) into a new data matrix P whosevariables are orthogonal.In this thesis, I rank the Ej ’s and Pj ’s, j = 1, . . . , n, according to theamount of data variation the j-th EOF-PC pair explains. Thus, the “1stEOF” and “1st PC” together account for the most data variation (equivalentto having the largest λj), and so forth for the successively higher order EOFsand PCs. Here, the “order ” of an EOF-PC pair corresponds to their rankj: compared to E1 and P1, Ej ’s and Pj ’s of j ≥ 2 are referred to as “higherorder” data features.523.2. PCA Methods and Related TopicsData Reconstruction via EOFs and PCsThe data matrix O can be constructed as follow:P = OE⇒ multiply each side by ET ⇒PET = OEET ⇒ O = PET .The equality O = PET can be explicitly expanded into a sum of n EOF-PCterms:Pt×nETn×n =P11E11 P11E21 . . . P11En1P21E11 P21E21 . . . P21En1............Pt1E11 Pt1E21 . . . Pt1En1+P12E12 . . . P12En2P22E12 . . . P22En2.........Pt2E12 . . . Pt2En2+ . . .+P1nE1n . . . P1nEnnP2nE1n . . . P2nEnn.........PtnE1n . . . PtnEnn=[P1 . . . Pn]ET1. . .ETn = P1ET1 + P2ET2 . . .+ PnETn=n∑j=1PjETj . (3.2)Pj and Ej are vectors of length t and n respectively.The t× n matrix PjETj is the j-th order space-time component of dataOt×n that captures the space-time interaction between Ej and Pj . In otherwords, PjETj is the j-th spatial-temporal ozone feature. As equation (3.2)shows, a complete ozone data can be recovered or constructed by summingn spatial-temporal ozone features.The PCA of CMAQ outputs in the following sections will show that theeigenvalues decrease rapidly as the order of the EOF/PC increases. Theimplication here is that the amount of data variation explained by higher-533.2. PCA Methods and Related Topicsorder space-time ozone features decreases rapidly. Indeed, the traditionalstatistical purpose of PCA is to reduce the dimensionality of a large dataset,using the fewest possible components to explain the most possible data vari-ation (Hardle and Simar, 2012). If the first p features capture most of thevariation in O, instead of (3.2), we writeO ≈p∑j=1PjETj , p n. (3.3)The values of p will be discussed in Section 3.3.Relationship Between PCA and SVDPCA is related to Singular Value Decomposition (SVD). In general terms,SVD decomposes data as O = UMVT , where UT = U−1 and VT = V−1.When O is a non-square matrix, M is a rectangular-diagonal matrix ofnonnegative real numbers. It follows that:OT O = (UMVT )T UMVT = VMT UT UMVT = VMT MVT .In addition, V is an eigenvector matrix of OT O (i.e., V = E), and Uis an eigenvector matrix of OOT . Utilizing the aforementioned equalityOT O = EΛET , we haveEΛET = VMT MVT .In other words, E ≡ V and Λ ≡MT M.When t < n, there will be t = min(t, n) Ej ’s and λj ’s. The upcomingPCA in this chapter and the ozone feature modelling in Chapter 4 will beimplemented using CMAQ outputs with t < n. However, to avoid confusionin notation, I will still denote the dimension of E and Λ by n×n. Written assuch, λj values of orders j > t will be 0, and Ej ’s of j > t will be structuredsuch that associated Pj is an approximately zero-vector. In other words,data components PjETj of orders j > t will recover no additional datavariation.543.2. PCA Methods and Related Topics3.2.3 Relevant PCA-Related TopicsFeature-based AQM evaluation approach requires that following PCA-relatedcomplications to be addressed: (1) the ozone features may be incomparablebetween AQM and observations, (2) propagation of EOF estimate errorsfrom E1 towards higher-order features, and (3) the ozone features may beinseparable or arbitrarily ordered. In addition, it is important to discussPCA methods that may enhance the interpretability of ozone features.Complications from Orthogonality Constraint and DataColumn-CenteringIf E1 from the AQM modelled ozone captures a different type of spatial ozonefeature from the observed E1, then these E1’s are incomparable. Due to theorthogonality requirement of Ej , this problem of feature discordance will becarried over to higher-order features, making the practice of feature-basedAQM evaluation problematic. This problem is mentioned in the context ofAcid Deposition Model evaluation in Cohn and Dennis (1994).Moreover, the Ej calculated from the PCA of Ot×n are estimates of thetrue EOF of the underlying ozone process. This is because Ot×n can beregarded as sample data, i,e, one realization of the process. Suppose theestimated 1st-order feature E1 is incorrect or different from the true EOF,then the EOF orthogonality will cause j ≥ 2 Ej ’s to be incorrect. Thisis the problem of “error propagation” between Ej ’s, and the implicationhere is that AQM outputs will be evaluated against observations based onincomparable sets of features.The PCA of original un-centered data Ot×n helps to maximize feature-correspondence during CMAQ/AQM evaluation. As PCA from subsequentsections and chapters will show, for a LFV ozone field, the spatial-temporalmean dominates the ozone variation and it is reliably captured by E1 and P1.Therefore, by keeping the mean structure of Ot×n intact, i.e., no column-centering, we may use the actual data of spatial and temporal means asreference points to assess (1) whether E1 and P1 are comparable betweenAQM and observations, and (2) how “close to correct” the extracted E1553.2. PCA Methods and Related Topicsand P1 are. Without the mean structure in Ot×n, the 1st-order featuresmay capture dynamic modes of ozone variations that cannot be properlyinterpreted or assessed for their EOF estimate errors.Furthermore, except for the feature representing ozone means, PCA ofcolumn-centered data O˜t×n gave same set of individual ozone features asthose from the PCA of original data (Appendix B.2). The differences lie inthe orders of said features. Therefore, for our particular LFV ozone data, noadditional ozone features or dynamic patterns are uncovered by subtractingout the mean structure.There are other practical reasons for not centering data. For feature-based AQM evaluation, the space-time mean structure is the most fun-damental and important feature that can be compared. Column-centeringprior to PCA will process out this key feature from evaluation. Furthermore,in Chapter 4, I will use equation (3.3) to model space-time ozone processusing individual ozone features. PCA of O˜t×n means that ozone features canonly be constructed to model an ozone process without the mean structure.Earlier in this section, it was also pointed out that all PCA in this researchis done on individual data with uniform within-data units and scale. Thus,the usual statistical reasoning behind data centering/standardization doesnot apply.Considering the aforementioned results, the ozone PCA in this thesiswill be implemented on the original un-centered data Ot×n.“Degeneracy” of EOFs and EigenspectrumNorth et al. (1982) explained the concept of “effective degeneracy” of EOFsin terms of the sampling error of eigenvalues. As the case with EOFs,eigenvalues λj ’s calculated by the PCA of Ot×n are estimates of some trueeigenvalues of the underlying physical process. If the error associated witha certain λj is close to its distance to an adjacent λj+1, then Ej and Ej+1form a “degenerate multiplet” where their estimated orders and capturedfeatures become arbitrary.The implication for this research is that “degenerate” ozone features are563.2. PCA Methods and Related Topicsordered and separated arbitrarily, making the same-order feature compari-son of AQM and observations difficult to analyze. For instance, if E1 and E2from AQM are inseparable, then the equivalent features from observationsmay be captured by EOFs of different orders, i.e., the E1 from AQM andobservations may capture different, thus incomparable ozone features. Fur-thermore, if E1 and E2 (from either AQM or observation) form a degeneratecouplet, then the ozone features they capture may be “mixed” or “contam-inated” into each other (Storch and Zwiers, 1999; Bjornsson and Venegas,1997). One should note that the last problem is not always the case, as Iwill show in Section 3.4 with PCA of LFV ozone data.A generally applied rule of thumb to assess the separation between adja-cent eigenvalues is summarized by North et al. (1982) based on asymptoticresults. It states that a “confidence interval” of an eigenvalue λj isλˆj ·(1+√2t), (3.4)where λˆj is the eigenvalue estimate, and t is the number of hours in Ot×n orthe number of data observations. Due to correlations between observations,one may use “effective sample size”, denoted as ne, ne ≤ t, in place of t(Hannachi et al., 2007). Estimates of ne based on time-series autocorre-lation measures have been introduced by Thiebaux and Zwiers (1984) andPreisendorfer (1988), among others. There does not seem to be a definitiveapproach to estimate ne, and some existing literature (North et al., 1982;Hannachi et al., 2007) simply applied the sample size n. One example ofan estimating equation from Thiebaux and Zwiers (1984), also referenced inEOF review paper Hannachi et al. (2007), has the form (using this thesis’snotations):ne = t(1 + 2t−1∑k=11− k/tρ(k))−1,where ρ(k) is the autocorrelation function of order k and t is number ofhours in Ot×n.Another means of estimating the sampling error of an eigenvalue is by573.2. PCA Methods and Related TopicsMonte Carlo (MC) based sampling (Bjornsson and Venegas, 1997). Onesampling approach, applied in Section 3.4, randomly samples locations frompredetermined LFV subregions to form subsamples of space-time ozone data,then PCA is applied to the subsample. Repeated realizations of subsamplePCA yield a MC sampling distribution of Ej ’s, j = 1, . . . , n, and the eigen-spectrum can be constructed. However, the sampling analysis in Section 3.4is done for purposes other than estimating eigenspectrum.Rotation of EOFsBy design, PCA extracts data features Ej that are orthogonal to each other,and this constraint may introduce difficulties in interpreting data features.One reason is that, for a real world physical process the features are notindependent, hence the orthogonality of features should not be expected(Monahan et al., 2009). Richman (1986) provided an extensive discussion onthe complications facing PCA of spatial processes, two of which are relatedto our study: “domain shape dependence” and “subdomain stability” ofPCA.Domain shape dependence suggests that the shapes of the extractedEj ’s are influenced by the topography of the studied region, and importantunderlying physical processes will not be properly captured. Subdomainstability is related to the problem when data from a portion of the domainis used in PCA to make conclusions about dominant modes of a physicalprocess. For example, if an Ej decomposed from a complete LFV ozonedata is different from a same-order Ej from a sub-regions data from thelower valley, which Ej captures the true underlying feature?In our case, one may account for the problem of domain shape depen-dence by implementing PCA on a part of LFV with similar topography, e.g.,locations with low elevation. Such ozone PCA will be implemented in Sec-tion 3.4, where the region of interest will be the part of the rectangular LFV(Figure 3.3) where elevations are below 150 meters. In Section 3.4, I willalso use a method of spatial sampling to analyze LFV ozone’s subdomainstability.583.2. PCA Methods and Related TopicsThere are also purely mathematical approaches to alleviate the afore-mentioned concerns of regular PCA, and rotation of EOFs is “perhaps themost widely used” method for such purpose (Hannachi et al., 2007). Thisprocedure rotates the estimated EOFs by maximizing the spatial weightswithin the Ej ’s toward specific regions, potentially highlighting underlyingspatial variation/structure and aid EOF interpretation. There are variousapproaches to the rotation of EOFs, but the general idea is the same. Sup-pose an EOF rotation matrix Mr of dimension p × p, where p  min(n, t)is the number of dominant data features. Then the rotated EOF Er ofdimension n× p isErn×p = E[, 1 : p]Mrp×p.The matrix Mr is found by solving the maximization problem MAX [f(E[, 1 :p]Mrp×p)] where f(·) represents a function defining a specific rotation method.A review of the literature suggests that VARIMAX rotation is the mostcommonly used method, e.g., see the recent work by Eder et al. (2014).Hannachi et al. (2007) also mentioned VARIMAX as being the “most well-known and used” rotation method. In VARIMAX rotation, the maximiza-tion of f(E[, 1 : p]Mrp×p) is implemented under the constraint Mr(Mr)T =(Mr)TMr = Ip×p, i.e., the rotation matrix is orthogonal. The functionf(E[, 1 : p]Mrp×p) = f(Ern×p) has the formf(Ern×p) =p∑k=1nn∑j=1Er[j, k]4 −n∑j=1E[j, k]22 .A numerical feature of VARIMAX is that elements or individual weights inthe rotated Ej ’s are shifted towards either 0 or +1, thus revealing a morefocused and simpler spatial pattern for interpretation.Appendix B.2 shows the comparison plots between normal Ej ’s and p = 4VARIMAX rotated Ej ’s, where the Ot×n data is the CMAQ output forthe 2006 episode. The results indicate that the utility of EOF rotations isnot obvious for the LFV ozone data: the pattern shifting and focusing ofEj ’s spatial weights do not result in easier-to-interpret ozone features. This593.3. The Number of Useful Ozone Featuresis also true for higher-order features with closely spaced eigenvalues, i.e.,features that form degenerate multiplets. In the end, I found no reason toapply EOF-rotation given our data, and the upcoming feature-based ozoneanalyses will proceed without further considering EOF-rotation.Poorly conditioned covariance matrixLedoit and Wolf (2004) pointed out that large-dimensional sample covari-ance matrices are often non-invertible and ill-conditioned, and decomposingsuch a matrix will give a more “dispersed” sets of eigenvalues than the under-lying truth. The authors proposed an optimal covariance matrix estimatorthat is the linear combination of the sample covariance and scaled identitymatrices, where the scales are estimable from the sample and proved to beasymptotically consistent. The idea is to “shrink” the sample covariancetowards the identity matrix to be well-conditioned.Eigen-decomposition does not require a matrix to be full-rank. One canstill decompose Ot×n into Ej ’s and Pj ’s when the sample covariance matrixis ill-conditioned. The main problem is that the sample Ej may be poorly-estimated and not allow meaningful further analysis. However, as I willshow in Section 3.4, this is not the case here. Moreover, it can be arguedthat highly dispersed λj are preferable in our application. First, it meansthat a very small number of ozone features can capture most of the datastructure. Secondly, by North’s rule-of-thumb, well-separated λj alleviatethe problem of feature degeneracy, thus allowing for individual analysis ofthe leading features.Thus the method of sample covariance correction, although a viable al-ternative, is not attempted.3.3 The Number of Useful Ozone FeaturesBefore delving into the analysis of ozone features, it is useful to first definecriteria for the number p of ozone features that are useful for further anal-ysis. A useful ozone feature should at the very least, recover a non-trivial603.3. The Number of Useful Ozone Featuresamount of space-time ozone data variation. Ideally, it should also captureinterpretable ozone features that enhance our big picture understanding ofan ozone process, i.e., interpretability makes subsequent feature-based ozonemodelling (Chapter 4) and CMAQ evaluation (Chapter 5 and 6) more in-formative.The number of “useful” EOFs is often determined by the number re-quired to explain a large portion of data variation, so that these EOFs aresufficient to represent the original data. What constitutes “a large portionof variation” is rather an arbitrary decision. Typically the chosen EOFsshould combine to explain ≥ 80% of data variation (Higdon et al., 2008;Hannachi et al., 2007).3.3.1 Recovering Data VariationsFigures 3.5 and 3.6 shows the hourly RMSEs of Ot×n reconstruction for allfive episodes. Here, Ot×n is the CMAQ output for different episodes over theentire 96 hours with spatial domain containing the locations at elevation≤150 meters. The plots show the RMSE by hour between Ot×n and thereconstructions from∑pj=1 PjETj done for p = 1, . . . , 4. Note that there isno modelling of Ej ’s and Pj ’s involved in this result. I simply recovered theCMAQ outputs using increasing numbers of space-time components, usedthese reconstructions as ozone estimates and calculated RMSEs at each hour.As shown, the first two space-time features recover most of the day-time ozone variation, with notable exceptions during afternoons of 1998,where the addition of the 3rd ozone components improves the RMSE no-ticeably. Similar improvements are also observed to lesser extent during lastafternoons of 2001 and 2006. Otherwise, the addition of the 3rd featurerecovers variations between late evening and morning hours. Subsequentozone features capture mostly ozone variations during nocturnal hours ofdiminished ozone activity. This result supports the findings in the followingsection that successive higher order features capture increasingly localizedozone variations in space and time. As I will elaborate in the next section,the description “localized variation” means that the recovered space-time613.3. The Number of Useful Ozone Features051015The 1985 EpisodeTimeRMSE (ppb)0000PST, July 19th 0000PST, July 20th 0000PST, July 21st 0000PST, July 22ndEOF/PC 1 and 2EOF/PC 1 to 3EOF/PC 1 to 405101520The 1995 EpisodeTimeRMSE (ppb)0000PST, July 17th 0000PST, July 18th 0000PST, July 19th 0000PST, July 20thEOF/PC 1 and 2EOF/PC 1 to 3EOF/PC 1 to 4051015The 1998 EpisodeTimeRMSE (ppb)0000PST, July 25th 0000PST, July 26th 0000PST, July 27th 0000PST, July 28thEOF/PC 1 and 2EOF/PC 1 to 3EOF/PC 1 to 4Figure 3.5: Hourly RMSE (units ppb) of the Ot×n reconstruction(∑pj=1 PjETj ) using an increasing number of data components to p = 4.The Ot×n are the 1985, 1995 and 1998 CMAQ outputs across the LFV atelevation≤ 150m.623.3. The Number of Useful Ozone Features051015The 2001 EpisodeTimeRMSE (ppb)0000PST, Aug. 10th 0000PST, Aug. 11th 0000PST, Aug. 12th 0000PST, Aug. 13thEOF/PC 1 and 2EOF/PC 1 to 3EOF/PC 1 to 4051015The 2006 EpisodeTimeRMSE (ppb)0000PST, June 24th 0000PST, June 25th 0000PST, June 26th 0000PST, June 27thEOF/PC 1 and 2EOF/PC 1 to 3EOF/PC 1 to 4Figure 3.6: Hourly RMSE (units ppb) of the Ot×n reconstruction(∑pj=1 PjETj ) using an increasing number of data components to p = 4.The Ot×n are the 2001 and 2006 CMAQ outputs across the LFV atelevation≤ 150m.633.3. The Number of Useful Ozone Featuresvariations are confined to a narrow window of time period and geographicalregion. Furthermore, these features capture space-time variations that aregenerally episode specific, and they are neither influential nor structuredenough to allow for definite interpretation.Figures 3.7a and 3.7b show for select hours during day 3 of the 2006episode, the spatial ozone fields of the CMAQ output (original data) andthe corresponding approximation with p = 4:∑4j=1 PjETj . The hourlyspatial plots show that the sum of leading p = 4 ozone features appear torecover the defining space-time structure of their originating ozone data.Table 3.2 shows the Ot×n reconstruction RMSEs averaged across alllocation and hours for all five episodes, and Table 3.3 shows the proportionof data variation explained by leading ozone features. The main resultsare consistent across episodes: (1) E1 and P1 jointly recover ≥ 90% ofdata variation and the proportion of recovered variation quickly decreasesto ≈ 0% at order j = 5 (Table 3.3), and (2) the improvement in RMSEdecreases to ≤ 0.6 ppb from p = 5 onward (Table 3.2).p: number of features used for reconstructionEpisode 1 2 3 4 5 6 7 81985 10.4 7.57 6.01 5.15 4.78 4.34 4.04 3.931995 10.8 7.82 6.32 5.17 4.72 4.28 3.87 3.631998 13.2 8.94 6.37 5.69 5.25 4.71 4.25 4.042001 11.1 7.84 6.23 5.16 4.72 4.35 3.92 3.772006 7.66 5.80 4.07 3.27 2.83 2.62 2.39 2.29Table 3.2: For the CMAQ outputs of 5 episodes: the Ot×n reconstructionRMSEs in units ppb (averaged across all location and hours) at p = 1, . . . , 8.The decompositions are for the entire episode h = 96 hours.Furthermore, ozone feature analysis in Chapters 4 to 6 will model ozonefeatures as spatial or temporal processes driven by background meteorol-ogy, chemical precursor emissions and antecedent concentrations. It willbe shown that statistical models begin to lose their predictive capabilitywith high-order ozone features j ≥ 3, suggesting diminishing associationsbetween these features and process-driving background conditions.In summary, for real CMAQ outputs of most episodes, P1ET1 and P2ET2643.3. The Number of Useful Ozone Features(a) Original ozone fields.(b) Ozone fields recovered with leading p = 4 ozone features.Figure 3.7: For selected hours, the spatial field of the 2006 CMAQ out-put and corresponding feature-based data reconstruction using p = 4:∑4j=1 PjETj . The colour scales are held consistent for all plots.653.3. The Number of Useful Ozone FeaturesEOF/PC order jEpisode 1 2 3 4 51985 0.91 0.04 0.02 0.01 0.001995 0.91 0.04 0.01 0.01 0.001998 0.93 0.04 0.02 0.01 0.002001 0.90 0.05 0.02 0.01 0.002006 0.95 0.03 0.02 0.01 0.00Table 3.3: For the 5 ozone episodes: the proportion of data variation ex-plained by the leading 5 EOFs/PCs. The decompositions are done for theentire episode t = 96 hours.are sufficient for the analysis of daytime ozone process in terms of recoveringvariations in the data. Features of orders j = 3, 4 recover the remainder ofdaytime ozone variations and most of the variations outside of daytime ozonepeak hours. The ozone features of orders j ≥ 5 do not recover significantamounts of space-time variation.I also implemented a simulation-based analysis (Appendix B.1) where theozone feature sum∑pj=1 PjETj from one synthetic dataset is used to predictanother realization of a space-time ozone field, the procedure is repeateda large number of times for various values of p. I then found that thepredictive accuracy starts to deteriorate at order p = 3. This indicates alack of structure, i.e., domination of data noise within the ozone featuresof orders j ≥ 3. Since the synthetic ozone field emulates the structures ofthe daytime LFV ozone field, the simulation-based statistical test impliesthat there is no practical reason for using more than j = 2 ozone featuresto model a complete daytime process. This result are consistent with theaforementioned results using real CMAQ outputs.The just described simulation-based approach combines the elements ofthe PC selection methods proposed in the Chapter 5 of Preisendorfer (1988),which are categorized into three general selection rules. In “Dominant Vari-ance” selection rule, synthetic data are generated from assumed Gaussianprocesses, these data are then decomposed to obtain a sample of eigenvalues.An EOF/PC pair from the real data is retained if their associated eigenvalueis above the 95th percentile of the sample eigenvalue (called Rule N). The663.3. The Number of Useful Ozone Features“Time History” selection rule uses statistical tests of spectral whiteness ofserial correlation to assess the nosiness of PCs, which determines the PCsto keep. Lastly, in the “Space-map” rule, Ej decomposed from data arecompared to a set of well-understood modes of variation, such as those froma General Circulation Model. Any Ej from the real data that has a closematch (based on tests of conical direction angles) to a dynamic mode isretained.3.3.2 Order of Ozone Feature DegeneracyThe number of useful ozone features may also be determined through thespectra of PCA eigenvalues. From the eigenspectra, one may judge theorder j at which Ej ’s begin to lose their separability, thus making individualfeature analysis questionable.Figure 3.8 shows for the 1985 and 2006 episodes, the eigenspectra com-posed using North’s rule-of-thumb described in Section 3.2. Here, the 1steigenvalue is not shown because its high value makes the higher-order eigen-spectrum difficult to visualize. The PCA is based on different datasets: for1985, the dataset is from its last two full days which are dominated by typeIV wind regime, and for 2006, the dataset is the first two full days that aredriven by type I regime.It was mentioned in Section 3.2 that various estimates of effective samplesize ne exist, most of which are based on autocorrelation measures calculatedform the data. I tried methods described in Thiebaux and Zwiers (1984)and Preisendorfer (1988), and found all estimated ne resulted in eigenspec-turms that give the same overall conclusion regarding the separability ofleading features. Both plots are made using ne = 18, which is lower than allestimated ne. Hence, based on equation (3.4) the width of the confidenceband can only be narrower, i.e., the order of EOF degeneracy can only gethigher than those shown.The 1985 eigenspectrum shows that for ozone fields dominated by thetype IV regime, although eigenvalues λ2 and λ3 are significantly higher thanthe rest, they cannot be separated themselves. For the 2006 episode, the673.3. The Number of Useful Ozone Features2 3 4 5 6 7 80e+002e+054e+056e+05Plot of eigenvalue spectrum: 1985 during type IV regimethe rank of eigenvalueeigenvalues2 3 4 5 6 7 80e+002e+05Plot of eigenvalue spectrum: 2006 during type I regimethe rank of eigenvalueeigenvaluesFigure 3.8: The eigenspectra of λj , j = 2, . . . , 10, decomposed from the the1985 (top) CMAQ output under type IV regime and 2006 (bottom) outputsunder type I regime. The spectrum is based rule-of-thumb of North et al.(1982). The dashed-lines indicate the orders of ozone features that can beseparated from the rest, whether individually or as degenerate sets.683.4. Ozone Features of LFV Ozone Episodesfirst 2 ozone features are separable from the rest, while the 3rd and 4th-orderozone features form a couplet. This latter result is the case for all episodesnot dominated by type IV wind regime.These results from the PCA eigenspectrum are crucial when analyzingindividual ozone features, and Figure 3.8 will be revisited in the next section.The evidence from eigenspectra suggests that for ozone process driven bywind regime types I, II and III, the first two ozone features can be studiedindependently. For episodes dominated by the type IV regime, the 2ndand 3rd-order ozone features should be interpreted jointly, or analyzed withconsideration of the other.Various analyses presented in this section and Appendix B.1 are de-signed to help answer the question: how many ozone features are useful?The answer is that ozone features of order j = 1, 2, 3, 4 should be the fo-cus. However, given the specific contexts and foci of upcoming statisticalanalyses, the number of useful ozone features will be less. The reasoningbehind the use and analysis of any particular feature will be discussed atthe appropriate places in following chapters.3.4 Ozone Features of LFV Ozone EpisodesIn this section, PCA will be implemented on CMAQ ozone outputs to iden-tify and understand what types of ozone features define the LFV ozonefield during an episode. Such an exercise also determines if interpretableozone features can be obtained through the PCA of space-time ozone data.Here, I am not attempting to interpret physical modes of variation. By“interpretable”, I am referring to an ozone feature that either representsstatistical summaries of data, or captures recognizable LFV ozone struc-tures and behaviour, e.g., general patterns of diurnal ozone advection acrossLFV. Furthermore, the ozone feature analysis is done by accounting for thenon-separability or “degeneracy” of multiple ozone features. Lastly, the endof this section contains a study of LFV’s PCA subdomain stability.PCA is implemented on CMAQ outputs by episode, and CMAQ outputswhere days are separated into wind regime types. It should be noted that I693.4. Ozone Features of LFV Ozone Episodesdo not combine datasets from different episodes under the same wind regime.PCA of each wind regime type is instead analyzed episode-by-episode. PCAof specific regime within the same episode helps to analyze the effect ofregional wind pattern on dominant ozone features. Same episode PCA suchas this controls the effect of changing emission source distribution over theyears (Section 2.2). PCA based on wind regime is also a means of CMAQevaluation without comparison with observations. It evaluates how wellCMAQ can capture the effect of regional wind flow on its modelling ofLFV’s ozone process.The PCA results to be shown are implemented on ozone fields at eleva-tions ≤ 150 meters. Such datasets capture the ozone field across a roughlytriangular region covering the lower valley of the “rectangular” LFV (Figure3.3), with Chilliwack on the eastern edge. I will simply refer to this lowervalley region as “LFV”. LFV is where the physical ozone monitoring stationsare located, and it is the region of interest for CMAQ model evaluation.In this study, I also implemented PCA on data of dimension n× t. Theresulting ozone features (not shown) are equivalent to the upcoming results,where the spatial and temporal features are captured by Pj and Ej instead.3.4.1 Common Ozone Features of All EpisodesFigure 3.9 shows the spatial plots of temporal ozone means (the mean field)summarized from the 96-hour CMAQ outputs. The same figure also showsthe spatial plots of the 1st-order features E1 extracted from the same CMAQoutputs. Figure 3.9 shows that the spatial patterns of mean fields are similarbetween episodes, and the pattern of each mean field is accurately and reli-ably captured by the corresponding E1. Eastern LFV has distinctly highertemporal means than the west. This result indicates that the eastwardozone advection (in combination with low boundary layer heights, Section3.1) causes the eastern LFV to experience continuous high level of ozone pol-lution, thus higher temporal means. The same advection pattern also givesthe western LFV, the main source of ozone formation, a diurnal cycle ofincrease-peak-decrease. Thus, western LFV locations have smaller temporal703.4. Ozone Features of LFV Ozone Episodesmeans.Figure 3.10 shows the spatial plots of ozone standard deviations (cal-culated across time) and the 2nd-order spatial features E2 of each episode.As shown, E2 captures the spatial pattern of ozone standard deviation, andits values range from negative to positive. The spatial variation of ozonestandard deviation can be explained by aforementioned behaviour of LFV’sozone process: pronounced ozone fluctuation in the west and continuoushigh pollution in the east. Hence, once summarized over time, the west-ern LFV locations have higher ozone standard deviation than the easternlocations.As I will elaborate in Section 3.4.3, E2 can be interpreted as the spatialozone contrast between the area of ozone plume formation (western LFV)and the area where most of the ozone move to (eastern LFV). More impor-tantly, the interactive space-time feature P2ET2 will be shown to capture aspecific pattern of eastward ozone advection, as well as the magnitudes ofozone formation and destruction across LFV.Figure 3.11 shows the hourly spatial ozone means (averaged across space)and P1 decomposed from the 5 ozone episodes. As in Figures 3.9 and 3.10,the PCA is based on the entire 96-hour period of each episode. Hence theseP1 are the temporal ozone features associated with the spatial features E1in Figure 3.9. As shown, the P1 captures the temporal pattern of hourlyLFV mean ozone, and the temporal patterns of 1st-order ozone featuresbetween-episodes are near identical. One exception is the last afternoon of1985, where we see a bi-modal ozone peaks not evident in other episodes,and this pattern is captured by the 1985 P1.Lastly, Figure 3.12 shows the temporal patterns of hourly LFV ozonestandard deviations and P2 from the 5 ozone episodes. These time seriesindicate that the standard deviations of ozone across space varies temporallyin a less consistent way from one episode to another. It also shows that each2nd-order temporal ozone feature P2 captures a pattern of temporal ozonecontrasts that cycles diurnally. Temporal trend of P2 further correspondsto a smoothed and inverse curve of hourly LFV standard deviation over the96 hours.713.4. Ozone Features of LFV Ozone EpisodesFigure 3.9: Plots of the mean fields and E1’s of the 5 ozone episodes. ThePCA is implemented on the ozone episodes in their entirety (96 hours).723.4. Ozone Features of LFV Ozone EpisodesFigure 3.10: Plots of spatial field of temporal ozone standard deviations(calculated across time) and E2’s of the 5 ozone episodes. The PCA isimplemented on the ozone episodes in their entirety (96 hours).733.4. Ozone Features of LFV Ozone EpisodesFigure 3.11: Time series plots of hourly spatial (LFV) ozone means andP1’s of the 5 ozone episodes. The number in each PC plot heading is theproportion of data variation explained. All plotted data have units ppb.743.4. Ozone Features of LFV Ozone EpisodesFigure 3.12: Time series plots of hourly LFV ozone standard deviations(calculated across space) and P2’s of the 5 ozone episodes. The numberin each PC plot heading is the proportion of data variation explained. Allplotted data have units ppb.753.4. Ozone Features of LFV Ozone EpisodesFigures 3.9 to 3.12 illustrate an important point that LFV ozone is con-sistently dominated by the same systematic ozone structures during everyepisode, and they can be reliably captured by the first 2 EOFs and PCs.These results highlight the stable recurring nature of dominant LFV ozonefeatures, despite the episodic variations in emissions and wind regimes. Inthe remainder of this section, I will interpret ozone features in terms of theirspace-time interactions PjETj , i.e,, interpreting the spatial-temporal ozonefeatures.3.4.2 P1ET1 : Structure of Space-Time Ozone MeanFigures 3.13 gives the dynamic spatial plots of space-time feature P1ET1 ofthe 2006 episode. Each space-time ozone feature PjETj is a data matrix ofdimension t× n, where each row i, i = 1, . . . , t, relates to the spatial ozonefeature of that particular hour i. The presented dynamic spatial plots arefor selected hours from the 3rd day of 2006.Figure 3.13: From the 2006 ozone episode under the type I regime: spatialplots of P1ET1 (units ppb) at selected times shown in plot headers.When multiplied, P1 and E1 capture the space-time interaction betweenthe spatial and temporal ozone means; a spatial-temporal feature that rep-resents the underlying mean structure of the data Ot×n. As the episode pro-gresses (hour changes and different rows of P1ET1 selected), P1ET1 ’s hourlypattern of ozone variation remain as defined by E1, only the spatial values763.4. Ozone Features of LFV Ozone Episodesare scaled by P1 at each hour. P1ET1 ’s from other episodes capture verysimilar dynamic structure illustrated in Figure 3.13. The aforementionedgeneral pattern of the 1st-order ozone feature (as well as ozone means) re-mains consistent when ozone PCA is done based on wind regime type (notshown).Lastly, without subtracting the column means out of Ot×n, the space-time mean structure P1ET1 dominates LFV’s regional ozone variation. Itaccount for > 90% of data variation of any ozone episode or ozone dataunder any wind regimes. The exact proportions were shown in the headerof each P1 plots and summarized in Table 3.3 from Section P2ET2 : Dominant Patterns of Ozone AdvectionIn the following discussion, the joint spatial-temporal ozone features of or-ders j ≥ 2 are extracted from ozone data under individual wind regime types.I should note again that I do not combine dataset from different episodesunder the same regime. PCA of each regime type is done episode-by-episode:each decomposed dataset is the part of an episode under one specific regime.Similar analysis can still be made based on PCA of entire episodes, but ozonefeature discussions under the context of background wind regime provide aclearer picture of the way CMAQ models ozone advection across LFV.Figure 3.14 shows the dynamic spatial plot of P2ET2 for the 2006 CMAQoutput under type I wind regime (first 2 full days). The result reveals thatthe second ozone feature captures the dynamic evolution of spatial ozonecontrast between the west and the east of LFV. More specifically, E2 capturesthe spatial contrast between the area of ozone plume formation (westernLFV) and the area that is the destination of ozone advection. The term“contrast” is also used in Jin et al. (2011) to highlight two ozone regionswith contrasting signs.During the afternoon ozone-peak hours, the spatial contrast is positivein the west and negative in the east. The spatial contrast then reversessign from 2000PST onward, and this diurnal evolution of ozone contrast isrepeated throughout the episode. This dynamic alternation of contrast sign773.4. Ozone Features of LFV Ozone EpisodesFigure 3.14: From the 2006 ozone episode under the type I regime: spatialplots of P2ET2 (units ppb) at selected hours.is the result of E2 and P2 representing corresponding spatial and tempo-ral ozone contrasts (Figures 3.10 and 3.12). As discussed, E2 has positivecontrast in the west and negative contrast in the east. Regardless of theepisode, P2 is a dipole wave that alternates between (+) during daytimeand (−) during night time to early morning. Thus, the interaction of E2and P2 results in the type of dynamic spatial contrast shown.Interpretation of P2ET2 : the dominance of westerly wind flowThe first ozone feature P1ET1 contains only positive values, and it capturesthe underlying structure of the space-time ozone mean. Each element inP1ET1 represents the base ozone concentration at a particular location andtime. Higher order (j = 2, 3, . . .) ozone features have values ranging fromnegative to positive; they are ozone correction or adjustment terms that insuccessive order of j = 2, 3, . . ., subtract or add ozone values that are specificto each location and time.During the afternoon peak hours (1300PST, Figure 3.14), contrast ispositive around western LFV, indicating the formation of ozone plume inthe west. By evening (2100PST, Figure 3.14), the P2ET2 values changes tonegative in the west, indicating that ozone plume is transported out of thisregion between 1300PST and 2100PST. During the same hours, the eastern783.4. Ozone Features of LFV Ozone EpisodesLFV (Chilliwack to be specific) has contrast change from (−) to (+): theozone from the west are transported to the east. Therefore, this P2ET2captures a processes of west-to-east ozone advection driven by a generalwesterly wind system. Its order of decomposition further indicates that thiswesterly wind is the most dominant flow regime of LFV ozone.The magnitudes of positive and negative ozone contrasts further revealthe numerical amount of ozone formation or destruction based on the ozonemeans P1ET1 . For example, Figure 3.14 shows that at 1300PST, the centerof ozone formation (western LFV) creates about 10 ppb of ozone in additionto the underlying ozone means. At 2100PST, the contrast is generally at−10 ppb, indicating that ozone is lost by this amount due to both advectionand local photochemical reactions. At 2100PST in Chilliwack (eastern tipof the map), the positive contrasts show that this area gained around 10-15ppb of ozone due to both the transport of pollution system from the westand local ozone creation.The hourly values of P2ET2 are near 0 ppb during the “transitional”hours when the positive contrast switches from west to east. This impliesthat during these hours the pattern spatial ozone resembles the underlyingmean.P2ET2 of Episodes dominated by Type I, II and III Wind RegimeAll 2nd-order ozone features P2ET2 from episodes dominated by type I, IIand III wind regimes capture the same general structure of dynamic east-west ozone contrast (Figure 3.14). As I will now show, although differencesin spatial patterns do exist between P2ET2 ’s, they are subtle. In followingdiscussions I will use “type I ozone feature” as short for “the feature of ozonefields dominated by type I wind regime”, and so forth.The 2006 episode is dominated by wind regime type I on the first 2 fulldays and type III on the 3rd day, hence comparison of P2ET2 between windregime types I and III is done through the PCA of the two subsets of 2006CMAQ output. Selected dynamic spatial plots of P2ET2 under regime typesI and III are shown in Figure 3.15. When compared to the feature under793.4. Ozone Features of LFV Ozone Episodestype III regime, the type I feature has positive contrast covering a smallerarea of LFV during the hour of daily ozone peak (1300 PST). At night timethe negative contrast is also larger in magnitude for the type I feature.Figure 3.15: From the 2006 episode under type I (top) and III (bottom)wind regime: P2ET2 (units ppb) from the same selected hours.Figure 3.16 shows the P2ET2 from the 2001 CMAQ output, which isdominated by a type II wind regime. As shown in Figure 3.4, the directionalflow under the type II regime stays close to 290◦ throughout the day, so it isdefined by a stable northwesterly flow regime. The PCA results show thatthe P2ET2 of 2001 episode (all type II) and 1998 episode (type II for twodays) also captured the same form of dynamic east-west ozone contrasts asseen in type I and III features. One unique pattern for the year 2001: theeastern ozone contrast extends beyond Chilliwack to include a large portionof Abbotsford, whereas the area of eastern contrast from other wind regimetypes are focused solely around Chilliwack.In summary, any variations between the space-time patterns of P2ET2under regime types I, II and III are slight. and they do not affect the overallconclusion that P2ET2 under these wind regimes represent the same formof space-time ozone contrast. Interpreted as atmospheric process, precedinganalyses reveals that the eastward ozone advection, driven by a westerlywind system, is the most dominant advection mechanism of LFV ozoneprocess under the type I, II and III wind regimes.803.4. Ozone Features of LFV Ozone EpisodesFigure 3.16: From the 2001 episode under type II wind regime: P2ET2 (unitsppb) from selected hours.Ozone features of the 1985 episode: type IV wind regimeFrom Figure 3.4, one can see that the type IV regime is dominated byeasterly winds for most of the day, and westerly wind is only observed duringa few afternoon hours. Figure 3.1 showed the 3-dimensional ozone fields fromselected hours of the 1985 episode under the type IV wind regime. One mainfeature of this episode is the heavy accumulation of ozone in the southwestLFV during night time, a feature of which is not noticed during the 2006episode (Figure 3.2).The following results are based on the PCA of CMAQ ozone output forthe 3rd and 4th full days of 1985: the days dominated by type IV windregime. Figure 3.17a shows P2ET2 from selected hours. As shown, the2nd-order ozone feature still captured a form of dynamic east-west ozonecontrast. However, it differs from the preceding results (features of type I toIII regimes) in that the east-west alternation of spatial contrast takes placearound 2300PST to midnight, not evening. Furthermore, this feature didnot capture the aforementioned night-time ozone pollution around southwestLFV.Figure 3.17b shows the dynamic spatial plots of P3ET3 at selected hours.This feature captures a diurnal evolution of spatial ozone contrast betweenthe area around Vancouver’s city core (northwest LFV) and the southwest813.4. Ozone Features of LFV Ozone Episodes(a) Hourly plots of P2ET2 (in units ppb).(b) Hourly plots of P3ET3 (in units ppb).Figure 3.17: From the 1985 episode under type IV regime: dynamic spatialplots of P2ET2 and P3ET3 (in units ppb).823.4. Ozone Features of LFV Ozone Episodesof LFV. More specifically, the contrast in northwest LFV transitions frompositive in the morning to negative in the evening, while the opposite istrue for southwestern LFV. This pattern of dynamic contrast shows thatP3ET3 captures the advection of ozone plume from northwest LFV to thesouthwest. Moreover, P3ET3 captures spatial contrasts between morningand evening, which is different from the 2nd-order feature which contrastsafternoon peak hours and late night.Figure 3.18 shows the dynamic spatial plot of joint ozone feature P2ET2 +P3ET3 . During the afternoon, instead of an eastward advection, the ozoneplume circulates from northwest LFV towards the southwest. This is evi-dent from the movement of positive ozone contrast from the northwest tothe southwest between 1300PST and 2000PST. The positive contrast thentransitions to the eastern tip of LFV at the conclusion of a diurnal cy-cle. Therefore, P2ET2 and P3ET3 jointly capture the continuous space-timemechanism of LFV ozone driven by the unique flow pattern of the type IVregime. Here, the space-time structure is defined by an advection pattern ofnorthwest→southewest→east. This complex dynamic is expressed via east-west spatial contrast captured by P2ET2 and north-south spatial contrastcaptured by P3ET3 .Figure 3.18: From the 1985 episode under type IV regime: dynamic spatialplot of joint ozone features P2ET2 + P3ET3 (units ppb).As shown in the eigenspectrum analysis (Figure 3.8) of Section 3.3, for833.4. Ozone Features of LFV Ozone Episodesthe two days of 1985 under type IV regime, the ozone features of ordersj = 2, 3 form a degenerate couplet. The main implications are: (1) the ex-tracted patterns of these features are possibly mixed into each other, makingindividual interpretations difficult, and/or (2) the orders of feature extrac-tion are arbitrary. The results in this section do not support the first im-plication. As shown in Figures 3.10 and 3.17a, the E2 from 1985 capturedthe same general contrast pattern as other episodes. This means that E2from 1985 still managed to clearly capture the defining east-west contrastof LFV, and that there is no significant evidence of EOF-estimate error ormixing of ozone features between E2 and E3. Therefore, the above men-tioned feature degeneracy indicates that for an ozone field under the typeIV regime, P2ET2 and P3ET3 capture well defined individual features thatare equally important and should be analyzed jointly.Summary discussion of Section 3.4.3The general structure of P2ET2 remained consistent across episodes domi-nated by wind regime types I, II and III, all of which are defined by westerlywind flow. The conclusion here is that P2ET2 captures the space-time pro-cess of eastward ozone advection driven by a westerly wind system, and it isthe most important advection pattern of LFV ozone. The ozone field undertype IV regime has more complex structure. The PCA results show thatthe ozone advection towards southwest LFV represents an equally importantdynamic behaviour as the eastward advection.The hodograph in Figure 3.4 (from Ainslie and Steyn (2007)) showedthat the type IV regime is dominated by easterly flow for most of the day.However, the ozone features from the 1985 episode revealed a more complexpattern of ozone advection that cannot be clearly explained by this priorknowledge. The wind regime hodograph was made from average of observedwind speed and direction at YVR. Hence, preceding ozone feature analysesshowed that the type IV wind regime should be defined by a regional-scaleflow pattern that cannot be properly captured by point-based informationat YVR. Steyn et al. (2011) also raised the similar point that the wind data843.4. Ozone Features of LFV Ozone Episodesfrom YVR sometimes may not capture the complexity of LFV’s regionalwind. However, Ainslie and Steyn (2007) did correctly categorize the 1985episode as driven by its unique wind regime, which was shown (in this sec-tion) to create dynamic ozone features that are different from the featuresunder regimes types I to III.3.4.4 Higher-order Ozone FeaturesFigure 3.19 shows the dynamic spatial plots of P3ET3 and P4ET4 from the2006 episode under type I regime. As shown, the higher-order featuresP3ET3 and P4ET4 are “active” mainly between late evening and early morn-ing. The term “active” indicates that an hourly PjETj spatial field containsmoderately non-zero values, i.e., the feature PjETj recovers non-trivial ozonevariation from the data. Therefore, the 3rd and 4th-order data features maybe viewed as nocturnal ozone features, where they recover ozone variationsduring hours with minimal ozone formation and destruction. Ozone featuresof orders j ≥ 3 under regime types II and III, as well as type IV features oforders j ≥ 4 also capture nocturnal ozone features.Figure 3.19: From the 2006 episode under type I regime: P3ET3 and P4ET4(units ppb) from selected hours.Overall, the higher-order ozone features do not generate easily inter-pretable spatial or temporal patterns, although they certainly capture well-defined spatial structures, especially P3ET3 . This lack of interpretability can853.4. Ozone Features of LFV Ozone Episodesbe attributed mainly to the fact that wind regimes I to III do not deviatesignificantly from a general westerly flow system. This wind system createsa relatively simple ozone processes where the ozone plume forms in the westand accumulates in the east due to eastward advection. As discussed ear-lier in this section, these features can be sufficiently captured by P1ET1 andP2ET2 . Therefore, the higher-order features are left to recover location andtime-specific ozone corrections or adjustments. For ozone fields dominatedby the type IV regime, the leading 3 ozone features capture a slightly morecomplex ozone structure, hence ozone features of orders j ≥ 4 are left torecover non-systematic data variations.In summary, j ≥ 3 or j ≥ 4 ordered ozone features recover episode-specific and localized ozone variations in space and time. This assertionis also supported by the data reconstruction RMSEs showed in Section 3.3(Table 3.2, Figures 3.5 and 3.6). These higher-order features can be under-stood as the representations of particular deviations from the general LFVozone structure.3.4.5 Sampling Stability of Ozone FeaturesAs discussed by Richman (1986), the “correctness” of EOFs estimated froma space-time data can be affected by the quality of spatial sampling of thatdata. This is an important point to consider when one uses the EOF froma subsample of space-time data to estimate the unknown EOF of the wholedomain. The Ej and associated PjEj shown earlier in this chapter shouldnot suffer from sampling errors. This is because the decomposed data arehigh-resolution CMAQ output in 4km-by-4km grids, and any defining fea-tures of CMAQ modelled ozone should be sufficiently captured.However, when CMAQ modelled ozone features are evaluated against theobserved features, one needs to consider the subdomain sampling stabilityof LFV ozone. As discussed in Chapter 2, the spatial domain of evalu-ated CMAQ outputs are matched to observations by spatially interpolatingCMAQ outputs onto the locations of the nobs irregularly placed monitoringsites. This means that I will be implementing PCA on a sub-sample of the863.4. Ozone Features of LFV Ozone Episodes“full LFV” ozone fields.Therefore, it is useful to analyze whether the sub-domain PCA gives thesame ozone features we have seen in this chapter. One way of doing so is to(1) randomly sample nobs number of locations from the space-time LFV data,(2) decompose the sample data into ozone features, and (3) repeat these twosteps to obtain a sample of ozone features. To assess the sampling stabilityof PCA, one may compare the patterns of sample Pj , j = 1, . . . , nobs, to thePj decomposed from the full LFV data.The preceding results have shown that PjEj capture space-time featuresthat represent the interactions between Ej and Pj . If the type of temporalfeature captured by subsample Pj remains stable, then the associated sub-space Ej may be interpreted similarly to the j-th spatial feature of the fullfield. For instance, if sub-domain P2 captures day-night contrast as before(Section 3.4.3), then E2 still captures the path of diurnal ozone advection.As a result, the ozone feature interpretations formulated in the precedingsections are applicable to later CMAQ evaluations.PCA sample plots shown in Figures 3.20, 3.21 and 3.22 are designedto analyze the above mentioned PCA sampling stability of LFV ozone. Ineach Pj sample plot, the red-coloured time series is the Pj of full LFV dataduring the middle 3 days of the 2006 CMAQ output. Here, the “full” LFV isthe same triangular lower valley region we analyzed thus far in this chapter.Each grey-coloured curve is the Pj decomposed from one LFV subset withnobs = 17 randomly sampled locations, and this subset covers the same timeperiod as the full data. The Pj ’s of order j = 1, 2, 3 are each sampled 50times to create Figures 3.20, 3.21 and 3.22.As discussed in Section 3.2, the PCs in this thesis are calculated asPt×n = Ot×nEn×n: each Pj contains hourly weighted sums of Ot×n whereEj are the spatial weights. This explains the difference in scale betweenP1’s from the full LFV data and its subsamples. As shown, the patternsof sampled P1, P2 and P3 remained stable throughout the episode. Theexceptions being P2 and P3 during a few nocturnal hours. The same PCAsampling analyses of different ozone episodes gave the same results. Inthe end, there is no reason to believe that PCA sampling error or feature873.4. Ozone Features of LFV Ozone Episodes0200400600Sampling of PC_1 (year 2006)TimePC value (ppb)0000PST, June 25th 0000PST, June 26thFigure 3.20: Sampling stability of PCA for P1. The full dataset is the 2006CMAQ output within the LFV region with elevation≤ 150 meters (the “fullLFV” analyzed thus far in this chapter). The red curve is P1 of full LFV andeach grey curve is P1 decomposed from one sub-data with n = 17 randomlysampled LFV locations.-100-50050Sampling of PC_2 (year 2006)TimePC value (ppb)0000PST, June 25th 0000PST, June 26thFigure 3.21: Sampling stability of PCA for P2. The full dataset is the 2006CMAQ output within the LFV region with elevation≤ 150 meters (the “fullLFV” analyzed thus far in this chapter). The red curve is P2 of full LFV andeach grey curve is P2 decomposed from one sub-data with n = 17 randomlysampled LFV locations.883.5. Chapter Conclusion-100-50050Sampling of PC_3 (year 2006)TimePC value (ppb)0000PST, June 25th 0000PST, June 26thFigure 3.22: Sampling stability of PCA for P3. The full dataset is the 2006CMAQ output within the LFV region with elevation≤ 150 meters (the “fullLFV” analyzed thus far in this chapter). The red curve is P3 of full LFV andeach grey curve is P3 decomposed from one sub-data with n = 17 randomlysampled LFV locations.instability is an overriding concern for LFV ozone field.3.5 Chapter ConclusionThis chapter answered PCA-related topics that are crucial for feature-basedAQM evaluation and ozone modelling in the following chapters. In chapterintroduction, I raised the point of feature correspondence between AQMoutput and observations. This allows for direct comparison of ozone featuresand subsequent statistical modelling of feature differences.First, it was determined that the ozone PCA in this thesis will be doneon original ozone data without the use of column-centering/standardization(Section 3.2). The worry here is that if the 1st-order feature E1’s are incom-parable, then the orthogonality of EOF will make all higher-order featurecomparisons questionable. The analyses in this chapter have shown that E1reliably captures the pattern of the spatial field of temporal ozone means,and this mean field can be calculated directly from the data and plotted.This physical reference can then be used to assess the “correctness” of theextracted E1 and determine whether the E1’s from AQM and observations893.5. Chapter Conclusionrepresent comparable features. This decision is further supported by theresults where the PCA of anomaly data did not reveal additional featuresof LFV regional ozone (Appendix B.2).Another means of ensuring defensible and informative feature compari-son is to formulate an understanding of ozone features being evaluated. Hereare the key features of the LFV ozone modelled by CMAQ:• All episode have the same general structure of space-time ozone means,and this feature is reliably captured by P1ET1 . This result revealsthe highly consistent nature of LFV ozone process during an episode.Over the period of 1985 to 2006, the changes in emission standard anddifferences in weather condition did not significantly alter the space-time distribution of LFV ozone.• For episodes driven by wind regime types I, II and III, the 2nd-orderfeature P2ET2 consistently captured the same form dynamic east-westozone contrast. This dynamic contrast revealed the most dominantpattern of ozone advection across LFV: it is the eastward horizontaltransport of ozone plume, which is driven by a westerly wind system.Wind regime types I to III are defined mostly by a westerly windsystem with little deviations in direction.The ozone feature under type IV wind regime is different in that P2ET2and P3ET3 jointly capture a more complex advection pattern, whereozone plume moves from the northwest to the southwest during theafternoon, and then advects eastward by the end of a diurnal cycle.This advection feature is only present from days dominated by thetype IV wind regime.• Higher order ozone features are less interpretable, and they primarilyrecover localized space-time ozone variations rather than systematicspace-time features. Here, the “higher order” means j ≥ 3 for episodesunder regimes types I, II and III, and j ≥ 4 for episode under regimetype IV.903.5. Chapter ConclusionThese results also fill an existing knowledge gap regarding the regional-scalefeatures of LFV ozone.Furthermore, existing works on ozone PCA (Section 1.4) rely on visualanalyses of Ej to interpret possible modes of variation. This is useful whenthe pollution field covers a large spatial domain where one can identify theunderlying climate systems, e.g., North Atlantic Oscillation. However, suchlarge scale process is not readily identifiable when analyzing a smaller, butcomplex regional pollution field such as the LFV ozone. Instead of analyzingEj on its own, it was found that the analysis of space-time interaction featurePjETj provides a more informative and big picture understanding of theunderlying dynamic process. In summary, a PjETj at j ≥ 2 is a dynamicozone contrast that captures certain spade-time advection process. Threepieces of information can be interpreted from PjETj plots: (1) the directionof ozone advection, (2) the time period (within an episode) during whichthe advection took place, and (3) the order j, i.e., the importance of thisadvection to the overall ozone process.Study in this chapter also showed that one may analyze a space-timeozone field using a few leading ozone features; a high-dimensional data can beanalyzed through simpler data components Ej and Pj . As discussed, thesefew leading features capture the systematic structure and behaviour of LFVozone. Therefore, through feature-based AQM evaluation, one may drawup a “big picture” of how AQM modelled ozone differ from the real-worldphysical field. As mentioned, the actual evaluation will be implementedin Chapters 5 and 6. In the next chapter, I will proposed a frameworkfor modelling individual ozone features, as well as the complete space-timeozone process defined by these features.91Chapter 4A Statistical Model ofSpace-Time Ozone FeaturesIn the previous chapter, I presented the PCA methods used to extract ozonefeatures Ej ’s and Pj ’s and discussed topics related to ozone feature analysis.In this chapter, I propose methods of modelling individual ozone featuresas random processes driven by variables capturing atmospheric conditions,ozone precursor emission rates and antecedent concentrations. I will alsoanalyze a method of modelling a complete space-time ozone process throughits features. The statistical ozone feature models are intended to apply toboth AQM ozone and physical observations. The purpose of this chapter isto estimate the details of ozone feature models and assess their modellingcapability through goodness-of-fit analyses and exercises in space-time ozoneforecasting.I am placing significant effort on identifying the Gaussian Process co-variates that can be used to model the process behind each ozone feature.This model developing exercise is driven by my prior experience that spatialand temporal variables such as longitude, latitude and “hour of the day”are not sufficient in modelling ozone fields. The complexities of the AQMmodelling system (Chapters 1 and 2) also means that it does not have aclearly definable input structure typical of a computer model. Hence, care-ful analysis is need to formulate a set of AQM input conditions useful forstatistical modelling.All modelling analyses in this chapter will be done using CMAQ-WRF-SMOKE outputs (Section 2.1, Table 2.1). Due to the richness of informationthe computer models can provide, their outputs are used instead of obser-92Chapter 4. A Statistical Model of Space-Time Ozone Featuresvation data to estimate the ozone feature models. The computer modelsproduce data for a comprehensive list of variables representing regional me-teorology and particulate pollution (more details in Section 4.3), whereasobservations provide data on a few basic weather and air pollution measure-ments. Moreover, computer model outputs are produced on a dense andregular spatial grid covering a large geographic domain and they, unlike theobservations, do not suffer from missing or erroneous data.The statistical ozone feature model developed in this chapter will thenbe applied in Chapters 5 and 6 to implement two distinct types of feature-based AQM evaluation. As discussed in the introductory chapter, for oneevaluation, I will model the difference in ozone features between AQM andobservations as functions of AQM input conditions. In another evaluation,I will estimate statistical ozone feature models for both AQM ozone andphysical observation, then compare the statistical properties of two ozoneprocesses under the same condition.Besides serving as a means to my main objective of AQM evaluation, themethodologies developed in this chapter are a novel and efficient approachfor modelling of space-time air pollution processes (not limited to ozone).The bulk of existing statistical air pollution models deal only with either aspatial process or a time series (temporal process). In a spatial model, therandom process is the spatial air pollution field whose values are usually atemporal summary, i.e., summer time means (Fuentes and Raftery, 2005;Liu, 2007). In a temporal model, the random process is expressed by a timeseries of air pollution whose values are either based on a point location, ora spatial average (Gao et al., 1996). Spatial-temporal pollution modellinghas received more attention in recent years, much of it is based on GaussianProcesses and Kriging (Berrocal et al., 2009; Conti and O’Hagan, 2010;Zidek et al., 2012). Usually, Gaussian Process based models are designed tohandle data in their original form - what I call the “raw data”.Before preceding, a reminder on the terminology. “Spatial ozone means”describe the ozone averaged across space, and “temporal means” describeaverages across time. Hence, a “spatial field of temporal ozone means”, or“mean field” and “field of means” for short, describes a spatial field of mean934.1. Ozone Features and Gaussian Process Modelsvalues obtained by averaging ozone across time. Similarly, “time series ofspatial means” or “hourly LFV mean ozone” describe a times series obtainedby averaging ozone across space for each hour.Similarly, “spatial ozone standard deviation” and “temporal ozone stan-dard deviation” are summarized respectively, across space and time. Thesame way of describing spatial and temporal summaries are also used forother variables like temperature, NOx and VOC emission rates, etc.Section 4.1 presents the general model formulations for ozone featuresEj ’s and Pj ’s. Section 4.2, summarizes the kriging-based methods for pre-dicting unobserved spatial/temporal processes and approach to estimate GPmodel parameters. Section 4.3 introduces the covariates used to modelspatial-temporal ozone features and methods of covariate selection. Sec-tion 4.4 outlines the framework of implementing feature-based ozone mod-elling using methods described in Sections 4.1 to 4.3. Data analyses are onSections 4.4-4.7, where the ozone feature models will be estimated and theirmodelling capability will be assessed. The data used in this chapter are 2006CMAQ output over the entire “rectangular” LFV domain (more details inSection 4.4).4.1 Ozone Features and Gaussian Process ModelsBe it a CMAQ output or a physical measurement, ozone values are influencedby a number of background meteorological processes, chemical emissionsand reactions. As discussed in Section 1.4, without physically measuring ormaking an estimate using CMAQ, the ozone value given a set of backgroundconditions is unknown, hence it is reasonable to treat this unknown ozonevalue as a random variable. Since the EOFs and PCs are extracted fromdata treated as random ozone values, both the EOFs and the PCs are bylogic, random vectors (multivariate data).Denote the extracted EOFs as Ej , j = 1, . . . , p, where p is the numberof EOFs used for modelling. Similarly, Pj (j = 1, . . . , p) denotes the PCvectors. As before, I define the dimension of space-time ozone data as t×n,t and n are respectively the number of time points and locations. As such,944.1. Ozone Features and Gaussian Process Modelseach EOF vector is multivariate random spatial data of size n, and each PCvector is multivariate random temporal data of length t.I showed in the last chapter that Ej ’s and Pj ’s capture strong spatialand temporal structures, this implies the presence of spatial and tempo-ral correlations, i.e, the spatial and temporal features are not dominatedby white noise. Hence the ozone feature models should be derived frommultivariate distributions that explicitly account for internal correlations.In this study, Gaussian Process models are used. As its name suggests,this model is built on the idea that a vector of random variables can bemodelled by a Multivariate Normal (MVN) distribution (Sacks et al., 1989;Kennedy and O’Hagan, 2001). In the introductory chapter, I discussed itswell-documented proficiencies in modelling both computer model outputsand physical observations, and the initial reasoning behind my applicationof GP models for this research. The theoretical and practical appropriate-ness of GP models will be examined during data analyses.4.1.1 Gaussian Process Model for an EOFLet the n × k matrix XE j denotes the model covariate set for the n × 1response vector Ej , where k is the number of covariates. These covariateswill be selected in Section 4.5.I propose a GP-based model for Ej :Ej |XE j = F(XE j)βE j + ZE j . (4.1)The regression function F(XE j)βE j is the mean of the GP model, whereβE j is the regression coefficient vector. I define the regression design matrixas F(XE j) = {f(xj,1), . . . , f(xj,n)}T , where f(xj,1), . . . , f(xj,n) are columnvectors of functions of the covariates, and xT1 , . . . ,xTn are rows of XE j . ZE j954.1. Ozone Features and Gaussian Process Modelsis a zero-mean Gaussian Process:ZE j ∼ MVN(0, σ2E jRE j), whereRE j =1 R(Ej,1, Ej,2) R(Ej,1, Ej,3) . . . R(Ej,1, Ej,n)R(Ej,2, Ej,1) 1 R(Ej,2, Ej,3) . . . R(Ej,2, Ej,n)R(Ej,3, Ej,1) R(Ej,3, Ej,3) 1 . . . R(Ej,3, Ej,n).......... . ....R(Ej,n, Ej,1) R(Ej,n, Ej,2) R(Ej,n, Ej,3) . . . 1.The correlations between Ej elements Ej and E′j are quantified by a functionof covariate distances: R(Ej , E′j) = fR(xj − x′j), where xj and x′j are therespective covariate sets for Ej and E′j . Multiplied by the constant varianceσ2E j , one obtains the model covariance.The mean represents the fixed component in (4.1), and models the simpleassociation between Ej and XE j as a regression function. The random com-ponent ZE j models the stochastic behaviour of Ej based on Ej ’s correlationstructure (a function of XE j). In my experience, it is often reasonable tosimply specify the fixed regression term to be a constant scalar mean (thoughI do introduce more general regression terms during model implementationlater in the chapter). The true focus of GP model lie with the modelling ofthe stochastic process ZE j .The distribution of Ej is given by:Ej |XE j =Ej,1|xj,1...Ej,n|xj,n ∼ MVNfT (xj,1)βE j...fT (xj,n)βE j , σ2E jRE j.Ej,1, . . . , Ej,n are the n elements of Ej , fT (xj,1)βE j , . . . , fT (xj,n)βE j andxj,1, . . . ,xj,n are means and covariate sets of each random variable in Ej .964.1. Ozone Features and Gaussian Process Models4.1.2 Gaussian Process Model for a PCIf XP j is the covariate matrix of Pj , a Principal Component vector is mod-elled similarly byPj |XP j = F(XP j)βP j + ZP j , (4.2)where ZP j is a zero-mean Gaussian Process:ZP j ∼ MVN(0, σ2P jRP j), whereRP j =1 R(Pj,1, Pj,2) R(Pj,1, Pj,3) . . . R(Pj,1, Pj,n)R(Pj,2, Pj,1) 1 R(Pj,2, Pj,3) . . . R(Pj,2, Pj,n)R(Pj,3, Pj,1) R(Pj,3, Pj,2) 1 . . . R(Pj,3, Pj,n).......... . ....R(Pj,n, Pj,1) R(Pj,n, Pj,2) R(Pj,n, Pj,3) . . . 1.4.1.3 Modelling a Complete Space-time Ozone ProcessTo model the ozone process in its original space-time field and value scale(ppb), I apply the constructive relationship between space-time ozone andits defining ozone features:Ot×n|XE,XP ≈p∑j=1(Pj |XP j)(Ej |XE j)T , p min(t, n). (4.3)In essence, I propose to model the spatial and temporal ozone features asGPs, then the complete space-time ozone process as the joint sum of outerproducts of its defining features. This is a departure from the traditionalstatistical practice of directly modelling given data. I name this approachthe feature-based ozone model to emphasize this central idea.4.1.4 Use of GPs for Modelling Ozone FeaturesI showed in Chapter 3 that the structures of individual spatial and tem-poral ozone features are highly non-linear, implying that the usual linear974.1. Ozone Features and Gaussian Process Modelsregression models are not sufficient for modelling these features. Bloom-field et al. (1996) and Thompson et al. (2001) provided good examples ofthe deficiencies of linear regression when modelling non-linear air pollutionprocesses.Given the GP assumption of individual ozone features, the resultantspace-time ozone process has a complicated distribution. It is common totransform space-time data, e.g., a square-root transformation, before ap-plying the GP model (Le and Zidek, 2006). So it is reasonable to placethe assumption of non-normality on a complex process such as space-timeozone. Examples also exist where the GP assumption is imposed on theoriginal data, e.g., Gao et al. (1996) and Fuentes and Raftery (2005) (in thecontext of SO2 modelling). However, the research in this thesis is focusedon the analysis and statistical modelling of ozone features rather than theoriginal ozone field. Given this particular focus, it is sensible to place anystatistical assumption on the ozone features, i.e., the random process beinganalyzed directly.Furthermore, by modelling each Ej (and each Pj) as a GP, I am assumingthat a particular ozone feature of a particular episode is treated as onerealization of GP. The purpose of GP model is to predict the outcome of thisparticular realization for unobserved values of the covariates. This analysiswill be done in Chapters 5 and 6 to evaluate CMAQ.In practice, the use of a GP is mainly for convenience. It is also a decisioninformed by both my past experience and existing literature (Section 1.4),where GP-based models were shown to be proficient in emulation complexnon-linear processes. Analysis later in this chapter will further validatethe appropriateness, thereby the usefulness of GPs for modelling the ozonefeatures.Lastly, I should note that the general concept of “modelling the datacomponents as GPs” is a recent one. Higdon et al. (2008) built a statis-tical emulator of a computerized “implosion simulator”. This is a high-dimensional output computer model in that it produces multiple outputsfor every model run. The authors applied SVD to decompose a data ma-trix of model outputs into loading components (in PCA terminology) that984.2. Background on Gaussian Process Modelsare regarded as invariant basis vectors. The model output data is a ma-trix of multiple model runs, where each row contains the multi-dimensionaloutput from one model setting. The high-dimension model output is thenstatistically modelled as a scaled-sum of said invariant basis vectors, andeach “scale” is modelled as a GP. Kleiber et al. (2014) implemented thisapproach to emulate the output of a “geomagnetic storm simulator”. The“scales” in Higdon et al. (2008) are expanded into vectors of Principal Com-ponents and modelled as GPs, the loading components are still regarded asinvariant. These works are both motivated by the objective of “computermodel calibration” while data-decomposition is applied to reduce the di-mensionality of the model output in order to increase the efficiency of theircalibration algorithms, which are both based on Bayesian methodologies.One might view the feature-based ozone modelling framework as an elab-oration of the above mentioned modelling ideas, but driven by a differentfocus and intended application. My focus is to develop individual ozonefeature models as means to feature-based CMAQ evaluation. Whereas thecommon focus of Higdon et al. (2008) and Kleiber et al. (2014) is to re-duce computation load during computer model calibration, attention wasnot placed in detailed modelling of data components. The computer modelsin Higdon et al. (2008) and Kleiber et al. (2014) are simple enough to havea definitive set of model covariates, which is not the case in my air pollu-tion modelling. Furthermore, there is no invariant basis in my modellingframework; all data features Ej ’s and Pj ’s are modelled, whereas the afore-mentioned references both regarded their equivalent of the E’s as vectors ofan invariant basis.4.2 Background on Gaussian Process ModelsIn this section, I first present a non-parametric (without assumptions abouta specific probability distribution) prediction method and discuss its con-nection with Gaussian Processes. I will then present the general methodof estimating Gaussian Process models. For ease of discussion, I use thegeneric notation of Y as a random response and X as process covariates.994.2. Background on Gaussian Process Models4.2.1 Best Linear Unbiased Predictor (BLUP)Let y = (y1, . . . , yn)T be realizations of a random vector Yn×1. Each yi,i = 1, . . . , n, is realized given a set of k covariates xi = (xi1, . . . , xik)T . Inmore general notation, the GP model for Y(x) has the formulationY(x) = Fβ + Z(x). (4.4)F = (f(x1), . . . , f(xn))T is a row matrix of n covariate functions, wherethe i-th covariate function is f(xi) = (f1(xi), . . . , fk(xi))T , making F ann × k design matrix. β = (β1, . . . , βk)T is a regression coefficient vector.Z(x) is a zero mean random process with covariance matrix σ2R. Theelements of R quantify the correlation between random variables in Z(x),and subsequently Y(x) (Z(x) is the random component of Y(x)). Also asmentioned, correlation between any ij-th pair Yi(xi) and Yj(xj) is a functionof some distance measure between relevant covariate pairs (xi,xj).One way to predict the response at an “unobserved” covariate settingx0 is called universal kriging (Matheron, 1963; Cressie, 1990), named af-ter the South African mining engineer Danie G. Krige. Sacks et al. (1989)adapted the same mathematics to problems in computer experiments withhigher-dimensional covariates. Given n “observed covariate inputs” XS =(x1, . . . ,xn)T , we have corresponding output/data yS = (y1(x1), . . . , yn(xn))T ,and the universal kriging predictor has the form yˆ(x0) = wT (x0)yS withw(x0) being an n × 1 vector. It is essentially a weighted average of thedata. From the frequentist perspective, yS and y(x0) are realizations ofequation (4.4). The Best Linear Unbiased Predictor (BLUP) is the w(x0)that minimizes Mean Squared Error (MSE)MSE [yˆ(x0)] = E [wT (x0)YS − Y (x0)]2 (4.5)subject to the unbiasedness constraint E [wT (x0)YS] = E [Y (x0)], i.e.,wT (x0)Fβ = βT f(x0) for all β.When one takes the usual Bayesian approach, yˆ(x0) is the posterior meanE [Y (x0)|yS]. Currin et al. (1991) suggested modelling Z(x) as a Gaussian1004.2. Background on Gaussian Process ModelsBayesian prior on the unknown function and Handcock and Stein (1993)applied Bayesian methodologies in estimating model parameters.In addition to R, a matrix of correlations between Y ’s at “observed”settings x, we have the correlation between Y at an “unobserved” x0 andthe Y ’s at x (Sacks et al., 1989). This correlation is represented by the n×1vector r(x0) = [R(x1,x0), ..., R(xn,x0)]T .Without accounting for the aforementioned unbiasedness constraint, theMSE in equation (4.5) can be expanded to arrive at the expression:MSE (yˆ(x0)) = [(wT (x0)F− fT (x0))β]2+ (4.6)σ2[1 + wT (x0)Rw(x0)− 2wT (x0)r(x0)],where the first term in (4.6) is the squared prediction bias. Incorporatingthe unbiasedness constraint (wT (x0)F − fT (x0))β = 0, one is left to min-imize the second term in (4.6). This term is generally referred to as the“unbiased MSE equation”. To obtain the w(x0) that minimizes the term1 + wT (x0)Rw(x) − 2wT (x)r(x0) under said constraint, one would addthe k × 1 Lagrangian term λT (x)[FT w(x0) − f(x0)] to the unbiased MSEequation, differentiate and find w(x0) defining the BLUP from the equation(0 FTF R)(λ(x)w(x0))=(f(x0)r(x0)).Making use of the following result regarding the inversion of a partitionedmatrix:(0 FTF R)=(R−1 −R−1FK−1FT R−1 R−1FK−1K−1FR−1 −K−1),where K = FT R−1F, one may solve for w(x0). This results in the predictoryˆ(x0) = fT (x0)βˆ + rT (x0)R−1(yS − Fβˆ), (4.7)with βˆ being the generalized least-square estimate.1014.2. Background on Gaussian Process ModelsSubstituting the optimized w(x0) into the unbiased MSE equation, oneobtains the prediction standard errorSE (yˆ(x0)) = σ[1− rT (x0)R−1r(x0)+ (4.8)(f(x0)− FT R−1r(x0))T K−1(f(x0)− FT R−1r(x0))]12The preceding derivations are based on a single-point prediction at x0.For a multivariate prediction (simultaneous predictions of multiple responses),the prediction equation for individual response is still (4.7). However, dueto the correlation between unobserved responses, we now have a predictioncovariance matrix in place of a single prediction standard error (or variance)(Bastos and O’Hagan, 2009). Let Y(X0) be a vector of m “unobserved” re-sponses, where X0 is the m× k matrix of the covariates. Further denote F0as the corresponding m× k design matrix of X0. The prediction covariancematrix isΣ(yˆ(X0)) = σ2[Rm − rTm(x0)R−1rm(x0)+ (4.9)(F0 − FT R−1rm(x0))T K−1(F0 − FT R−1rm(x0))],where Σ(yˆ(X0)) has dimension m×m. Rm is the m×m correlation matrixof Y(X0). The correlation matrix rm(x0) has dimension n×m, each of itscolumns is a n-length vector of correlations between an element in Y(X0)and YS. As one can see, multivariate-prediction does not change the com-ponents involving training data, it simply increases the dimensions of termsthat are functions of X0. Moreover, the diagonal elements of Rm are 1,hence the individual prediction standard errors are still calculated as (4.8).4.2.2 Connecting the BLUP to a Gaussian DistributionSuppose the random response Y(x) follows a multivariate normal (Gaussian)distribution, and let YJ = (YS, Y (x0))T denote the n × 1 vector of thetraining responses and the response to be predicted. Then YJ is also MVN1024.2. Background on Gaussian Process Modelswith density:fYJ(yJ|β, σ,Rn+1) =1(2piσ2)n/2det (Rn+1)×exp(−(yJ − FJβ)T RTn+1(yJ − FJβ)2σ2),where FJ = (FT , f(x0))T ,and Rn+1 =(R r(x0)rT (x0) 1).The multivariate distribution of YS can be written out in the same fashion.The conditional distribution of Y (x0) given YS isfY (x0)|YS(y(x0)|yS) =fYJ(yJ|β, σ,Rn+1)fYS(yS|β, σ,R). (4.10)It can be shown through rather tedious matrix algebra, that the conditionaldistribution (4.10) follows a normal (Gaussian) distribution with mean andvariance:E (Y (x0)|YS) = fT (x0)β + rT (x0)R−1(yS − Fβ)Var (Y (x0)|YS) = σ2[1− rT (x0)R−1r(x0)].One may notice that the conditional mean has exactly the same expres-sion as the BLUP (4.7), while the conditional variance is the squared BLUPstandard error (4.8) without the 3rd term inside the bracket. This is becausethe above expressions for Y (x0)|YS is derived under the assumption that βis known, whereas the BLUP is derived using the Generalized Least Square(GLS) estimator βˆ. The 3rd bracketed term in (4.8) simply represent theextra error resulting from not knowing β.4.2.3 Estimating GP Parameters: Fitting the GP ModelsA decision needs to be made regarding the form of the correlation func-tion R(x,x′). The requirement here is that the resultant covariance matrix1034.2. Background on Gaussian Process Modelsσ2R(x,x′) is positive-definite (Nychka et al., 2002; Fuentes and Raftery,2005). One possible specification of R(x,x′) is the power-exponential corre-lation function, which has formR(x,x′) = exp(−k∑j=1θj |xj − x′j |αj ), θj > 0 and 1 ≤ αj ≤ 2, (4.11)where xj and x′j are one of k covariates of x = (x1, .., xk)T and x′ =(x′1, ..., x′k)T respectively. Such a correlation function is utilized repeatedlyin the literature: Sacks et al. (1989), Gao et al. (1996) and Kennedy andO’Hagan (2001) for example.This formulation is attractive in its ease of interpretation. A small nor-malized distance between x and x′ gives high correlation that tends to 1as the distance between x and x′ goes to 0, and conversely, the correlationapproaches 0 when the distance becomes large. A small value of θj im-plies that Y as a function of xj is relatively insensitive to the fluctuation ofxj . In other words, for any fixed distance xj − x′j , covariate xj ’s numericalinfluence in (4.11) tends to 0 when θj is small, where a zero-value insidethe exponential function gives a perfect correlation of 1 between Y (x) andY (x′) for dimension j. Thus θj can be viewed as a measure of correlationstrength or “activity” of associated covariate xj . The parameter αj controlsthe smoothness of the correlation function, where αj = 2 gives a smoothsurface (infinitely differentiable). Furthermore, such a correlation functionmakes the GP model scale-invariant to the covariate inputs. Notice in (4.11),when a covariate xj is scaled as s˜xj (s˜ being the scaling factor), the resultantcorrelation parameters are simply scaled up to αj-th power θj/s˜αj to keepR(x,x′) constant. Discussions of the interpretation of θ and α can be foundin Welch et al. (1992) and Jones et al. (1998).Assuming Y (x) follows a Gaussian process (GP), one may write the1044.2. Background on Gaussian Process Modelslikelihood based on β, σ2 and the correlation parameters in R(x,x′) as:L(β, σ,θ,α|yS,F) = (4.12)1(2piσ2)n/2|R|1/2exp[−(yS − Fβ)T R−1(yS − Fβ)2σ2].Given the correlation parameters, the Maximum Likelihood Estimator (MLE)of variance σ2 has expressionσˆ2 =1n(yS − Fβˆ)T R−1(yS − Fβˆ),and the Generalized Least Squared estimator of β isβˆ = (FT R−1F)−1FT R−1yS.Placing these two expressions back into the likelihood function, one is leftwith a profile likelihood that is a function of the correlation parameters(Sacks et al., 1989; Jones et al., 1998). This profile likelihood is then maxi-mized over the correlation parameters θ = {θ1, . . . , θk} and α = {α1, . . . , αk}.In this thesis, all GP model parameter estimates are MLEs. I will usethe common statistical expression “fit the model” to describe the procedureof estimating GP model parameters given a dataset.Strictly speaking, the BLUP results in Section 4.2.1 assume that thecovariance parameters θ and α are known. In practice however, they areusually estimated by the methods described here, or by fitting a certain var-iogram model, which is often used in geo-statistics instead of the correlationfunction. The variogram is2γ(x,x′) = Var [Y (x)− Y (x′)]= σ2R(x,x) + σ2R(x′,x′)− 2σ2R(x,x′) = 2σ2[1−R(x,x′)]and γ(x,x′) is called the semi-variogram. Le and Zidek (2006) describedpopular variogram models, such as the exponential and spherical variograms.Cressie (1990) also derived the predictive function based on the semi-variogram1054.3. Variable and Covariate Selectionunder the weighting condition∑w(x) = 1 (i.e., Ordinary Kriging).4.3 Variable and Covariate SelectionIn this thesis, the term model variable refers to: temperature, wind speed,planetary boundary layer height, NOx and VOC emission rate and ambientconcentration. The term model covariate refers a function of a variable usedin the statistical model. For instance, “temperature” is a variable, while theform of temperature values inputted into the model, such as 24-hour meantemperature, is a covariate. The term design matrix denotes a matrix ofmodel covariates.This section discusses one of the most important aspects of model devel-opment: selecting model covariates. Naturally, an ozone model with a well-chosen set of covariates should properly model the spatial-temporal featuresof an ozone process. This desired attribute is expressed numerically througha combination of high likelihood value and low prediction error. Likelihoodbased statistical analyses are theoretical assessments of model quality, whileprediction-error based analyses are practical measurements of how well amodel performs.Proper variable and covariate selection also make the GP models moreparsimonious, i.e., devoid of unnecessary model covariates. This is partic-ularly useful for my application. When modelling physical observations,the accompanying variable measurements are not always available. Hencefrom a practical standpoint, a model containing fewer variables/covariatesis easier to implement.From the following discussion, readers will notice that my variable andcovariate selection process is a balancing act between statistical analysis andscientific reasoning.4.3.1 Model VariablesIn this subsection, I discuss and analyze the usefulness of various variablesfor the ozone model. Below is a list of scientifically relevant variables:1064.3. Variable and Covariate Selection• Location and time variables: I use longitude, latitude, elevationand hour of the day. As an alternative to longitude and latitude,one may also use unitless values that index point locations on a 2-Dx − y Cartesian grid. Considering the aforementioned scale-invariantproperty of the GP model, the scale of a location variable is arbitrary.I chose longitude, latitude and elevation to allow identification of real-life locations in modelling, an important feature for the upcomingCMAQ model evaluation against observations.• Meteorological variables: These are wind speed and direction, tem-perature, pressure, humidity and boundary-layer height. Boundary-layer height is the depth of the atmospheric layer in which the surface-bound photochemical processes are contained (Stull, 1988). Taylor(1991) and Salmond and McKendry (2002) are examples of works dis-cussing the associations between LFV’s boundary layer height and itsozone process.• Chemical precursor information: Important precursors to ozoneare NOx (oxides of Nitrogen) and VOC (volatile organic compounds).I use two types of precursor variables: the incoming rate of new precur-sor molecules (rate of emissions) in units of mole per second, and con-centrations of precursors (in ppb) already present in the atmosphere,which is referred to here as antecedent precursors. As discussed in Sec-tion 2.1, NOx is the sum of NO and NO2 data, while VOC data arecreated by adding the scaled values of 16 families of volatile organiccompounds.As mentioned in Chapter 2, the meteorological variables are outputs fromWRF (with MCIP post-processing), and precursor emission rates are out-puts from SMOKE. These variables are created over the same spatial do-main and time period as CMAQ. The antecedent precursor concentrationsare processed from CMAQ outputs; the processing methods are discussedin a later subsection titled “Antecedent Precursor Concentrations”. As forGP models of observed ozone features, the model variable data are the cor-1074.3. Variable and Covariate Selectionresponding (in space and time) observations. Related details of observation-based modelling are presented in Chapter 6, the discussion here is focusedon the statistical modelling of the CMAQ ozone features.Selecting Meteorological VariablesOn the matter of variable selection (not covariate selection), the decisionsare based on available scientific advice and references. In previous chapters,I discussed the defining space-time behaviours of a LFV ozone process andits relationship with the dominant wind regime. The following paragraphsfurther complete the picture on the meteorological conditions that are con-ducive to an ozone episode, they provide the reasoning and justificationbehind my science-based variable selection approach.The start of an ozone episode requires meteorological conditions such ashigh temperatures, clear sky and low wind speed (Robeson and Steyn, 1990;Taylor, 1991; Ainslie and Steyn, 2007). All these conditions can be triggeredby a meso-scale high pressure system. High pressure near ground creates alow pressure system in the atmospheric layer above, the cold (thus dense)air from above moves downward and warms due to adiabatic heating. Thisresults in a clear sky that allows unobstructed UV radiation, directly driv-ing the photochemical process. A high pressure system also results in lowwind speed: fast enough for localized chemical mixing, but not fast enoughto transport the ongoing photochemical process out of the LFV quickly. Inshort, pressure is negatively correlated to wind speed and positively corre-lated to temperature (Taylor, 1992; Ainslie and Steyn, 2007).However, an ozone model based solely on pressure without temperatureor wind speed is hard to justify. Although high pressure is indirectly es-sential to the formation of an ozone episode in the most generalized way,temperature and wind speed are the meteorological forces that drive themore detailed photochemical and atmospheric transportation process; tem-perature and wind are simply more relevant and useful for the geographicalscale of my ozone modelling (Beaver et al., 2010; Jin et al., 2011; Reutenet al., 2012). Furthermore, my statistical analysis has shown that the addi-1084.3. Variable and Covariate Selectiontion of pressure along with temperature and wind as model variables tendsto degrade the goodness-of-fit and forecasting capability of my feature-basedozone model.6 This is perhaps partially the result of covariate confoundingintroduced by the addition of pressure. Therefore, pressure is not used formy ozone modelling.The intensity of ultra-violet radiation, measured by UV index, is anotherimportant variable. As discussed, an ozone episode takes place during dayswith clear sky, thus unobstructed UV radiation. This variable is not usedin modelling for two reasons. First, it is collinear with temperature: duringsummer days, intense UV radiation results in higher regional temperature.Secondly, the UV radiation is near uniformly distributed in space (at leastwithin LFV). Thus, the practical utility of including UV index in statisticalozone modelling is questionable.In the end, I decided to incorporate 3 meteorological variables: temper-ature (in units Kelvin), wind speed (in meter/second) and boundary layerheight (in meter). Boundary layer height is particularly important given thetopography of LFV. The mountains surrounding the LFV act as a physical“barrier” to the horizontal advection of pollutants that channels them alongthe valley (Steyn et al., 1997), and a shallow boundary layer would trapthe pollutants within the LFV’s barrier (Robeson and Steyn, 1990; Tay-lor, 1991). An ozone episode may be initiated by this accumulation of airpollutants in conjunction with the meteorology conducive to photochemicalreactions: high temperature, light wind and strong UV (Boubel et al., 1994,Chapter 17).The use of wind direction data and generating data forantecedent precursor concentrationsAlthough wind direction is an integral part of CMAQ that dictates the direc-tion of ozone plume transport (as I have shown in chapter 3), its usefulnessin ozone feature modelling is questionable. This is because its values are6The types of “statistical analyses” mentioned here are presented from Sections 4.5 to4.7.1094.3. Variable and Covariate Selectionmeasured as the angle that orients clockwise from the north. Given an an-gle, the direction of the vector (from the origin) is the direction in which thewind blows from, e.g., a wind direction value of 270◦ represents a westerlywind (blowing directly from the west). As I will discuss in more detail, thecovariate data, which are space-time in nature, will be numerically summa-rized before being incorporated into my model. Sets of wind direction datawith vastly different temporal or spatial profiles may end up being similarin value when summarized. This lack of identifiability would make the mod-elling effect of wind direction difficult to interpret, thus wind direction is notincorporated directly as a model variable. However, the influence of winddirection (hence wind regime) on spatial-temporal ozone patterns are in-stead expressed through the creation of a new variable: antecedent chemicalprecursor concentrations.For grid cell s at hour h, CMAQ produces concentrations for O3, NOx,and VOC (among others). CMAQ integrates a system of partial-differentialequations given the initial and boundary conditions for the grid cell. Foreach s, concentration at h is integrated over the time period between h− 1and h, where h in indexed in ”hours”. So output at hour 1300 is an averagedconcentration during 1201 − 1300. Within this hour, CMAQ produces airpollution estimates in smaller time-intervals. I denote this small-intervaltime variable as τ , where ∆τ  ∆h and ∆ represents incremental timescale of τ and h. Therefore, from an input-output perspective of a computermodel, precursor concentrations from τ − ∆τ are the “inputs” to ozoneoutput at τ : they are corresponding antecedent precursor information. Asis the case with ozone output, hourly CMAQ outputs for NOx and VOC’sare the ∆τ -interval concentrations integrated over an hour.NOx and VOC concentrations at h are used as antecedent precursordata associated with O3 at h. This is because atmospheric photochemicalreactions happen at a rate more appropriately indexed by τ , and lagged con-centrations at h−1 (previous hour) are too far back in time considering boththe reaction rates and mesoscale wind speed. This point can be illustratedthrough simple mathematics: under a light wind of 3 ms−1, which is typicalduring an ozone episode within LFV (Section 2.2), the per-hour-distance of1104.3. Variable and Covariate Selectionozone-plume advection is 3 ms−1 · 3600 s = 10800 m = 10.8 km. This ismuch larger than the CMAQ grid-cell size of 4km×4km.Furthermore, due to the presence of atmospheric circulation, each gridcell s has two sources of antecedent precursor chemicals: its own grid celland all neighbouring grid cells (Kalenderski and Steyn, 2011). In a regulargrid system, each s has 8 immediate neighbours.At time h, the neighbouring precursor concentrations are part of theinitial/boundary conditions for s, and the wind direction of grid cell s attime h determines how neighbouring cell concentrations affect its ozone pro-duction process. I weight lagged neighbouring precursor concentrations us-ing a scheme called Arcsin Weighting. The entire range of wind direction(0◦ − 360◦) is partitioned into four intervals: > 315◦ and ≤ 45◦, between45◦ and 135◦, between 135◦ and 225◦, and between 225◦ and 315◦. Thewind direction value of s at time h will fall into one of the intervals. Fig-ure 4.1 illustrates how the arcsin weighting is done when the wind directionis within the range 225◦ and 315◦. Within any interval, only three neigh-bours are used for weighting: the lagged concentration of “directly upwindneighbour” (U in Figure 4.1) is multiplied by 1/(1 + 2/√2), while the twoneighbours adjacent to the upwind one (A in Figure 4.1) are each multipliedby (1/√2)/(1 + 2/√2). The antecedent concentrations at time h for each sis the sum of the concentration from s and arcsin weighted concentrationsfrom its neighbours.In total, there are 4 types of chemical precursor variables: the NOx andVOC emission rates, and antecedent (or lagged) NOx and VOC concentra-tions.4.3.2 Selection of Model CovariatesRemember that a space-time ozone field is decomposed into spatial and tem-poral components Ej ’s and Pj ’s, j = 1, . . . , p. Since the meteorological andchemical variables are also space-time in nature, a function is needed totransform model variables into numerical expressions appropriate for mod-elling ozone features. Several designs of transform function are explored and1114.3. Variable and Covariate SelectionFigure 4.1: Illustration of how the arcsin weighting is done when the winddirection is 270◦. There are 8 neighbours to the grid cell s, the NOx and VOCconcentrations from the upwind neighbour “U” and two adjacent neighbours“A” will be used. The precursor concentration in “U” is scaled by 1/(1 +2/√2) and the concentrations in “A” is each scaled by (1/√2)/(1 + 2/√2).Their weight sum is then calculated.the two best functions (in terms of modelling capability) are presented inthis section.Model I: Covariates are Spatial or Temporal Means of VariablesWith random response variables representing either spatial or temporalozone patterns, their corresponding covariates can be spatial and tempo-ral means of the model variables. A t × n space-time ozone dataset hasmeteorological and chemical data from the same spatial-temporal domain.An n-length vector of temporal variable means is obtained by averaging t×nvariable data by column (across time): such a vector represents the meanfield of each variable. A t-length vector of spatial variable means forms byaveraging the variable data by row (across space): this is the hourly timeseries of LFV variable means.The n-length vectors of temporal variable means (variable mean fields)are model covariates of Ej ’s, i.e., spatial ozone features. The t-length vectorof LFV variable means are the covariates of Pj ’s, i.e., temporal ozone fea-tures. Covariate selection is not necessary with such a formulation: withinany ozone feature model, each variable is represented by one covariate only.1124.3. Variable and Covariate SelectionMoreover, all spatial feature models Ej have the same set of covariates, andlikewise for the temporal feature models Pj .Model II: Covariates are EOFs or PCs of VariablesAs with the decomposition of ozone data, I can use the PCA of variable datato extract spatial and temporal features of model variables. EOFs of modelvariables are the covariates for spatial ozone feature models, while PCs ofmodel variables are the covariates for temporal ozone feature models.To better understand how variable EOFs and PCs can be incorporatedinto GP models, one needs to first analyze the PCA results of model vari-ables. The results from PCA of model variable data are presented in Ap-pendix C.1. The data are CMAQ-WRF-SMOKE outputs from 2006, thespatial domain is the “rectangular” LFV (Section 2.1, Figure 2.1) and thedata contain all 96 hours of output. For all model variables, the first 3 EOF-PC pairs capture over 90%, or in certain cases, close to 100% of data varia-tion, hence the full model covariate set is comprised of the first 3 EOFs/PCsextracted from all 7 variable data. The purpose of model variable PCA is notto closely analyze the decomposition of model variables. The objective is toconfirm that the model variable EOFs and PCs do indeed capture noticeablespace-time structures and variations, interpretability is not a priority.The next step of analysis is covariate selection: determine the numberof useful EOFs and PCs from each variable. Intuitively, one may view co-variate selection as a procedure in which I determine the “useful number” ofdata features from each variable for the modelling of ozone features. Afterextensive analysis and experimentation with a variety of covariate selectionschemes, I decided on a forward selection method which I refer to as Itera-tive Improvement. Relative to other selection methods I explored, iterativeimprovement delivered the best combination of model goodness-of-fit andozone feature forecasting capability. This iterative approach is based on themethod of maximum likelihood (4.12). Take the GP model of any ozoneEOF as an example, the iterative procedure proceeds as follow:Step 1: The starting covariate set contains longitude, latitude, elevation and1134.3. Variable and Covariate SelectionE1’s of all 7 meteorological and precursor variables. As a reminder,these are temperature, wind speed, boundary layer height, NOx andVOC emission rates and antecedent concentrations.Fit the GP model containing the starting covariate set and record themaximized likelihood. Denote this starting likelihood value as `start.The likelihood maximizing method (thus model estimate method) isdescribed in Section 4.2 and the detailed implementation will be de-scribed during data analysis in Sections 4.5 to 4.6.All 1st order model-variable features are used since they capture/recoverthe most fundamental features of each variable. The iterative im-provement procedure is in essence, a means of incorporating usefuladditional covariate information into each GP model.Step 2: The candidate set contains the remaining Q covariates. Initially, Q =14 (E2’s and E3’s from 7 meteorological and precursor variables). FitQ GP models, where model q, q = 1, . . . , Q, is the GP model with theq-th covariate added to the starting covariate set in the previous step.Denote each maximized likelihood as `q, the largest one as `max, andits corresponding covariate as xmax.Step 3: If the likelihood test statistic δmax = 2(`max − `start) is larger thanthe critical value χ20.95, df=2, then I say that the addition of covariatexmax is a significant improvement over the starting model. Update thestarting covariate set by incorporating xmax, and update the candidateset by removing xmax. Note that the χ2 degree of freedom is 2 becausethere are correlation and power parameters attached to each additionalcovariate in the GP model for an ozone EOF.Step 4: Repeat step 1 to 3 until the likelihood test statistic in step 3 is smallerthan the critical value. This is the point where the addition of morecandidate covariates fails to improve the GP model fit in a statisticallysignificant way. If the initial step 3 fails to add any covariates, then Iconclude that the original starting set is the best.1144.3. Variable and Covariate SelectionThe alternative to iterative improvement is its backward selection counter-part: start with a full covariate set with all possible model covariates anditeratively remove covariates one-by-one until further omission of covariatesresults in a statistically significant drop in the log-likelihood.The iterative covariate selection for ozone PC models proceed in thesame manner using the PCs of model variables as candidate covariates.The iterative improvement algorithm is built upon rejecting the nullhypothesis (type I error), whereas the backward selection is based on notrejecting the null (type II error). Intuitively, the iterative improvementalgorithm should arrive at the optimized covariate set quickly if the set issmall. Conversely, the backward selection approach is recommended if thefinal covariate set is expected to be large: it requires less iterations to deletea small number of candidate covariates than to add a large number of them.The likelihood test is especially useful since I have multivariate (corre-lated) data, where many statistical tests are inapplicable. A likelihood-basedtest takes into account the inherent correlation structure within the data.Furthermore, it can be shown that maximizing the likelihood is the sameas “minimizing the expected predictive deficiency” (Currin et al., 1991).Therefore, one may interpret my proposed covariate selection algorithm asa statistical procedure that searches for a model that is expected to deliverthe best prediction.4.3.3 Goodness-of-fit StatisticsSummary statistics based on cross-validation errors (or residuals) are help-ful for assessing model quality from a practical, goodness-of-prediction per-spective. As usual, let Yn×1 be a vector of n responses and Xn×k its corre-sponding covariate matrix. Denote an entry of the response-covariate set asyi and xi, i = 1, . . . , n. Let Y−i and X−i be the response and covariate datawithout the i-th entry. In cross validation, I fit GP model using Y−i andX−i, and predict yi|xi. Repeating for all entries in the data, one obtainsn cross-validation predictions yˆi|{xi,Y−i,X−i}, i = 1, . . . , n, each with ac-companying prediction error. Cross-validation root mean squared error, or1154.4. The Framework of Feature-based Ozone ModellingCVRMSE, is calculated asCVRMSE =√∑ni=1(yi − yˆi|{xi,Y−i,X−i})2n.I also use mean percentage error (MPE), which during cross-validation iscalculated asMPE =1nn∑i=1(yi − yˆi|{xi,Y−i,X−i}yi∗ 100%).In the MPE calculation, the positive and negative errors can offset eachother, and it can be applied as a measure of prediction bias. When theCV-residuals are taken as absolute values, MPE becomes Mean AbsolutePercentage Errors (MAPE).Furthermore, a prediction yˆi|{xi,Y−i,X−i} and its prediction error areestimates of the conditional mean and variance of a normal distributionyi|Y−i. In theory (Gao et al., 1996; Bastos and O’Hagan, 2009), I can checkthe assumptions of a Gaussian Process model by analyzing plots of standard-ized cross validation residual (yi − yˆi)/se(yˆi) (to be implemented in Section4.6).4.4 The Framework of Feature-based OzoneModellingIn the remainder of this chapter, I will estimate individual spatial and tem-poral ozone feature models and assess their goodness-of-fit and predictivecapabilities. It is worth re-emphasizing that the ozone models developed inthis chapter are intended to both serve as the basis for air quality modelevaluation and as an efficient means of modelling a large space-time airpollution process.The framework below details the steps for (1) estimating individual spa-tial and temporal ozone feature models, and (2) applying said models toforecast LFV ozone features and subsequent space-time ozone fields.1164.4. The Framework of Feature-based Ozone Modelling• Step 1: Partition a complete space-time ozone dataset into the train-ing set and the predictive set. The data can be separated by spatiallocations, time periods, or both. In my analyses, I partition ozonedata by the time periods within an episode.The training dataset is used to fit the ozone feature models and carryout model diagnostics. The predictive set contains variable data fu-ture to the training set. It provides statistical model inputs to makeforecasts of ozone features and space-time ozone fields. This helps usto further assess the capabilities of ozone feature models.In this thesis, the term “prediction” is used to describe the procedureof estimating unobserved values of an ozone process, whereas the term“forecast” is specific to prediction of future ozone process in relationto the training set.• Step 2: Model selection. In Section 4.3, I proposed a type of ozonefeature model where the covariates are selected from the leading 3EOFs or PCs of model variables. The covariate selection is done usingan iterative improvement procedure. This model selection method willbe implemented using the ozone feature data and the model covariatedata obtained from the PCA of the training dataset.• Step 3: With model covariates for each GP determined, fit individualGP models (estimate model parameters) for Ej and Pj , j = 1, 2, 3, us-ing their respective training data. Forecasts of regional ozone featurescan then be made from the fitted models using the EOFs and PCs ofmodel variables from the predictive dataset.The predictive data contain information on the LFV’s atmosphericand pollution conditions during the time period following the trainingset. Hence, I am making ozone-feature forecasts for the LFV region.Kriging-based prediction methods were discussed in Section 4.2.• Step 4: Combine EOF and PC forecasts via (4.13) below to forecastthe hourly LFV ozone fields. Denote the predictive set covariate databy xE j and xP j , j = 1, . . . , p, and the estimated model parameters1174.4. The Framework of Feature-based Ozone Modellingby ξˆE,j = {θˆE,j , αˆE,j , βˆE,j , σˆE,j} and ξˆP,j = {θˆP,j , αˆP,j , βˆP,j , σˆP,j}(parameter notations from Section 4.2). The predictive equation for aspace-time ozone field is the “prediction based” form of (4.3):Oˆpred|xpred =p∑j=1(Pˆj pred|xP j , ξˆP,j)(Eˆj pred|xE j , ξˆE,j)T . (4.13)The data xpred are the variable data of a predictive set: the data fromwhich xE and xP are extracted. Opred is the prediction target, i.e, thespace-time ozone field of the predictive set. Furthermore, the trainingset and predictive set can have different data dimensions (differentnumbers of locations and time points).In its entirety, my proposed ozone modelling scheme follows the path:decompose ozone training data into separate ozone features ⇒ select modelcovariates and fit the ozone feature models ⇒ apply the covariate inputsfrom the predictive data to forecast ozone features ⇒ combine predictedfeatures to forecast the complete space-time ozone field.4.4.1 Details of Training Set and Predictive SetThe data analysis in this chapter uses the 2006 CMAQ output across theentire “rectangular” LFV domain shown in Figure 4.2 (originally from Figure2.1 in Chapter 2). This LFV domain includes north shore mountains as wellas the valley floor analyzed in the last Chapter. In total, this region containsn = 229 CMAQ grid cells. The training data is the first 2 full days of 2006episode (June 24th and 25th), and the prediction is made for the remainingfull day (June 26th), so ttrain = 48 and tpred = 24. The training ozone fieldis dominated by a type I wind regime and the predictive day is driven by atype III regime (Table 3.1, Ainslie and Steyn (2007)).As such, I am using ozone feature models estimated for an ozone fieldunder one wind regime type to forecast an ozone field under a differentregime. In addition, the modelling is implemented on a complex pollu-tion field that includes a large metropolitan area, farm lands (Abbotsford),and surrounding mountains with elevations sometimes exceeding 500m. A1184.4. The Framework of Feature-based Ozone ModellingFigure 4.2: Map of the complete “rectangular” LFV domain being modelledin this Chapter. The red dots indicate the corners of the modelled domain,the coordinate of each corner is also shown.simpler modelling exercise can be based on the low-elevation LFV field an-alyzed in Chapter 3, and have both training and predictive set under thesame wind regime type, e.g., 1995 or 2001 episodes. Therefore, the analysesin this chapter is a rather tough assessment of the modelling capability ofthe proposed ozone feature models.One also needs to produce the antecedent NOx and VOC data for thepredictive set. As discussed in Section 4.3, the antecedent precursor dataare processed from CMAQ outputs. The implication here is that one cannotobtain the actual antecedent precursor data without running CMAQ, or inthe case of predicting physical observations, making measurements duringthe prediction hours. In practice, it is redundant to run CMAQ or takemeasurements to obtain NOx and VOC data before predicting ozone fields,because the real ozone values are also obtained. One straightforward solutionis to use the previous-day antecedent precursor data as a proxy.However, this is not an issue for the analyses in this thesis. Here, thepredictions/forecasts are done for the sole purpose of evaluating the capa-bility of ozone feature models. In fact, to properly evaluate these statisticalmodels, it is essential to apply the actual NOx and VOC antecedent con-1194.5. Covariate Selectioncentrations from the day of prediction. Hence, all forecasts in this chapterare done using NOx and VOC data from June 26th, 2006; the day for whichthe ozone forecasts are made.4.4.2 PCA of CMAQ Output over the “Rectangular” LFVFigures 4.3 and 4.4 show that the type of spatial-temporal ozone featuresover the “rectangular LFV” domain are more or less consistent with whatwere learned from Chapter 3, where the analyses are done for the low-elevation “valley floor” of LFV. However, the 2nd EOF now captures the dy-namic spatial contrast between the lower valley and the mountains, whereasin Chapter 3 for the valley floor, the spatial contrast was between easternand western LFV. The leading temporal features are still interpreted thesame: P1 is the hourly spatial mean, and P2 is a daytime-nighttime tem-poral contrast that interacts with E2 to capture a diurnal eastward ozonetransport.Figure 4.5 shows the eigenspectra from the ozone PCA of training dataset.The ozone features form degenerate multiplets starting at j = 4. Based onthe eigenspectrum interpretations discussed in Sections 3.2 and 3.3, one mayconclude that for the “rectangular” LFV ozone field, the leading 3 ozone fea-tures are identifiable as individual features and separable from the others.4.5 Covariate SelectionThis section presents the results from covariate selection using the trainingdata. A covariate is denoted using the abbreviation of the variable name,with subscripts indicating the order of EOF or PC. For example, TempE,2and NOx-lagP,1 denote respectively, the 2nd EOF of temperature and the1st PC of antecedent (lagged) NOx concentration.1204.5. Covariate SelectionFigure 4.3: From the training set, day 2 and 3 of the 2006 Ozone episode:Spatial plots of temporal ozone means and standard deviations (calculatedacross time), and the first 4 EOFs. The ozone mean and standard deviationhave units ppb, while E1, . . . ,E4 are unitless.1214.5. Covariate SelectionFigure 4.4: From the training set, day 2 and 3 of the 2006 Ozone episode:Time series of hourly spatial ozone means, standard deviation and the firstfour PCs. The number in each PC plot heading is the “proportion of datavariation explained”. All plotted data have units ppb.1224.5. Covariate Selection2 3 4 5 6 7 805000001500000Plot of eigenvalue spectrum: ozone training datathe rank of eigenvalueeigenvaluesFigure 4.5: Eigenspectrum from the PCA of the training data: 2006 CMAQoutput for June 24th and 25th, over the rectangular LFV domain (areaincluding the mountains). The dashed line indicates the ozone feature whenfeature degeneracy occurs. No spectrum is shown for λ1, which has a muchlarger eigenvalue that is distinct from those shown.4.5.1 Implementation DetailsAs discussed in Section 4.2, the multivariate normal based profile likelihood(4.12) is maximized to find estimates for the GP model parameters, andthe GP correlation functions are power exponential. I use the programGASP written by William J. Welch to optimize all my Gaussian-based profilelikelihoods. This program iterates through different non-linear optimizationapproaches like the Nelder-Mead method to maximize a likelihood function.It is a reliable GP optimizer that has been applied in many published works(Jones et al., 1998; Aslett et al., 1998).GASP outputs a number of optimization summaries, one of which is calledthe Condition Number. It shows the number of significant figures lost to nu-merical error in the maximum likelihood results. In other words, it indicatesthe precision and reliability of optimization: the higher the Condition Num-ber, the smaller the number of accurate significant digits, which in termindicates a lessened degree of optimization quality. As such, it is desir-1234.5. Covariate Selectionable to have a small condition number. In practice, a condition number of> 1 × 106 gives cause for concern. All the optimizations presented in thisthesis have smaller condition numbers.The regression part of Ej GP models (4.1) contains longitude, latitude,elevation and an intercept term: f(XE) = (1, lon, lat, elev)T . The covariatesin the regression term are fixed. I only select covariates for the stochasticprocess ZE, i.e., covariates that define spatial correlations within each Ej .As mentioned in Section 4.1, the regression term is often treated as a con-stant. Based on my past experience working with spatial Gaussian Processesand existing literature (Gao et al., 1996; Jones et al., 1998), the inclusion ofspatial linear regression may slightly improve the models’ predictive qual-ities. Hence, I chose to use the 4-term regression function, make its formconstant, and focus my covariate selection analyses on the stochastic com-ponents of GPs.The starting covariate set contains longitude, latitude, elevation and the1st EOF of all 7 variables: NOx and VOC emission rates, temperature, windspeed, boundary layer height, NOx and VOC antecedent concentrations.Once again, at any time point t, the antecedent values (NOx and VOC) ofeach grid cell are the sum of antecedent concentration from that cell andneighboring cells weighted via Arcsine Weighting (Section 4.3). Given thestarting covariate set, I initiate the iterative improvement algorithm outlinedin Section 4.3. This algorithm is stopped once no statistically significantimprovement can be made by introducing more covariates into the model,and the covariate set at termination is my covariate set of choice. Thisiterative operation is implemented for the ozone feature models Ej , j =1, 2, 3.For PC covariate selection, I fix the regression term at f(XP) = (1, hour)T .I experimented with different functions of hour, e.g., hour2 and auto-regressivetime series of lag 1. However, added model complexities came with no notice-able improvements in models’ predictive qualities. I decided that a simple2-term regression is sufficient for the Pj GP models (4.2).The focus here is once again, selecting covariates for the stochastic pro-cess ZP. For selection of parameters in ZP 1, ZP 2 and ZP 3, the starting1244.5. Covariate Selectioncovariates are respectively TempP,1, VOCP,1 and NOx-lagP,3. They are se-lected because single-covariate models of ZP 1, ZP 2 and ZP 3 with thesecovariates have the highest model-fit likelihood. Iterative improvement isthen implemented to identify additional model covariates.4.5.2 Selection ResultsTable 4.1 shows the results of covariate selection from iterative improvementusing the 2006 training data mentioned in Section 4.4. The covariates shownare in addition to longitude, latitude and elevation for the Ej models. Asshown, for both Ej and Pj models, the variable temperature consistentlyhas multiple EOFs and PCs selected, reflecting its expected importance inan ozone process. Moreover, for Ej models the chemical precursor variablesare more likely than meteorological variables to have covariates included.Table 4.2 shows the cross-validation RMSEs from the models selected by it-erative improvement and the standard deviations of their respective trainingdata. The RMSEs can be compared with those of a null model, where theprediction is just the mean of the training data, i.e., the training data stan-dard deviations also shown in Table 4.2. As shown, the fitted ozone featuremodels have much lower CVRMSE than the data standard deviation.Figure 4.6 plots the goodness-of-fit results from various model fits forE1. Performances are compared from four covariate sets:• Covariates selected through iterative improvement.• The covariates are the 1st EOFs of model variables.• All candidate model covariates.• Only longitude, latitude and elevation as covariates.A circle in the plot indicates that using a likelihood-ratio test with a signif-icance level of 0.05, that there is no significant difference between a simplermodel and the full model containing all possible covariates. An ”x” indicatesotherwise.1254.5. Covariate SelectionModel CovariateEOF 1 NOxE,1, NOxE,2, VOCE,1, TempE,1, TempE,2, WindE,1,WindE,2, BLE,1, NOx-lagE,1, VOC-lagE,1, VOC-lagE,2EOF 2 NOxE,1, VOCE,1, TempE,1, TempE,2, WindE,1,BLE,1, NOx-lagE,1, VOC-lagE,1EOF 3 NOxE,1, VOCE,1, TempE,1, TempE,3, WindE,1, BLE,1,NOx-lagE,1, NOx-lagE,2, NOx-lagE,3, VOC-lagE,1PC 1 TempP,1, TempP,3, WindP,2, BLP,1, VOC-lagP,1PC 2 VOCP,1, TempP,3, WindP,1, WindP,3, NOx-lagP,1PC 3 NOx-lagP,3, BLP,2, NOx-lagP,1, TempP,2, TempP,3, WindP,1Table 4.1: Covariate selection results from the iterative improvement algo-rithm. The data are the 2006 training data mentioned in Section 4.4. “BL”is the boundary layer, “NOx-lag” and “VOC-lag” indicate antecedent pre-cursor concentration. For the EOF models, the covariates shown are thoseused in addition to longitude, latitude and elevation.EOF 1 EOF 2 EOF 3CVRMSE of iterative improvement 0.0006 0.0039 0.0064Standard deviation 0.015 0.065 0.066PC 1 PC 2 PC 3CVRMSE of iterative improvement 5.60 5.92 3.58Standard deviation 142.6 117.23 48.98Table 4.2: Training data Cross-validation RMSE of the models chosen bythe iterative improvement method and the standard deviation of the ozonefeature data. The units of Pj model CVRMSE and standard deviation areppb. The CVRMSE and standard deviation are unitless for the Ej ’s.Figure 4.6 shows that the model with only the location variables fitsnoticeably worse than models containing meteorological and precursor vari-ables. These results highlight that, due to the complex non-linear structuresof ozone features, longitude and latitude are not sufficient for ozone featuremodelling. Meteorological and ozone precursor variables are statisticallyimportant for ozone modelling here.1264.6. Modelling and Forecasting Ozone Features6e-04 7e-04 8e-04 9e-04 1e- EOF 1CV-RMSE (unitless)CV-MAPE (in %)all componentsIterative Improvementeof 1lon, lat, elevFigure 4.6: Cross-validation MPE versus cross-validation RMSE for fourmodels with different covariate sets. A circle indicates via a likelihood-ratiotest, that there is no significant difference between a simpler model and thefull model.4.6 Modelling and Forecasting Ozone FeaturesThis section will present results including diagnostic tests of model assump-tions and evaluation of the predictive quality of the ozone feature models.Here, “prediction” refers to the forecasts of future ozone-feature patternsand space-time ozone fields for the predictive set. This is not to be confusedwith the cross-validation done in the previous section, which is predictionwithin the training set. Forecasting is a more stringent test of a model’scapabilities than cross-validation.This section presents the results from two types of ozone feature modelsdiscussed in Section 4.3:1. Models where the covariates are spatial and temporal means of modelvariables, which I refer to as the Variable Mean (VM) models. ForVM models of Ej , the covariates are model variables’ temporal means(averaged across time): these are spatial fields of variable means. Forthe VM models of Pj , the covariates are model variables’ spatial means1274.6. Modelling and Forecasting Ozone Features(averaged across space): these are time series of variable means.2. Models where covariates are variable data decomposed into EOFs/PCsand selected via the iterative improvement procedure, the results ofwhich are shown in the previous section. I refer to such ozone featuremodels as Covariate Iterative Improvement (CII) models.A further note on terminology: the ozone EOF models based on formulationsVM and CII are referred to as EOF-VM and EOF-CII models. Similarly,the ozone PC models are PC-VM and PC-CII.In Section 4.2, I presented the Best Linear Unbiased Predictor (BLUP)for the unobserved response Y0 given observations y. Its two main propertiesare:• Let yˆ0 denote the BLUP of Y0, then by the unbiasedness propertyE (yˆ0) = E (Y0), and the mean of random variable Y0 − yˆ0 is 0.• The predictor yˆ0 is also the mean of the conditional Normal (Gaussian)distribution f(Y0|y).Summarizing the above properties, one would expect that, if Y0 (and Y)indeed follow a Normal (or MVN) distribution, then the random variable(Y0− yˆ0)/SE (yˆ0) would follow a standard normal distribution (ignoring esti-mation of the parameters). Therefore, the assumption of Gaussian Processesfor EOFs and PCs can be tested by analyzing how closely the distribution of(Y0− yˆ0)/SE (yˆ0) resembles a standard normal. By running cross-validationson the training data and obtaining standardized Cross Validation (CV) er-rors, I obtain samples of (Y0 − yˆ0)/SE (yˆ0). I then plot and examine thestandard normal Q-Q plots of the (Y0 − yˆ0)/SE (yˆ0) samples to assess theappropriateness of the Gaussian Process assumption (Gao et al., 1996; Joneset al., 1998).4.6.1 Modelling and Forecasting of Spatial Ozone Features:EOFsFigure 4.7 shows for the 3 Ej models based on CII (EOF-CII), standard nor-mal QQ-plots of the standardized cross-validation errors. Given how closely1284.6. Modelling and Forecasting Ozone Featuresthe scatter plot of “theoretical quantile vs. sample quantile” lie along the liney-axis= x-axis, I conclude that based on this quantile-to-quantile criterion,the assumption of a Gaussian Process is appropriate. The model fitting anddiagnostic results are similar for the EOF-VM models (not shown).Figure 4.7: For the three EOF-CII models: standard normal QQ-plots ofstandardized CV-errors. The x-axis is the theoretical quantiles of a stan-dard normal distribution and the y-axis is the sample quantiles from cross-validation.An important model fitting result is that the fitted correlation smooth-ness (power) parameters αˆ are either equal or very close to 2. A valueα = 2 indicates an infinitely differentiable and smooth correlation function.In addition, at α = 2, a power-exponential function becomes a Gaussiancorrelation.Forecasting the Spatial FeaturesWith GP model parameters estimated from the two days of training data,the corresponding covariate data from the predictive set are used in themultivariate form of BLUP (4.7) to make forecasts of spatial ozone featuresfor the LFV on June 26th, 2006.Table 4.3 contains the forecast RMSEs calculated asRMSEj =√∑ni=1(Eij − Eˆij)2n, j = 1, 2, 3. (4.14)1294.6. Modelling and Forecasting Ozone FeaturesHere, Eˆij denotes the forecast made at location i for j-th ozone feature,and n = 229 is the number of forecast locations. When Eˆj = E¯j in (4.14),one obtains the standard deviations of the true EOFs being predicted (alsoin Table 4.3). The true EOFs are decompositions of the ozone data fromthe predictive set, i.e., they are the real-life spatial ozone features we tryto forecast. These true standard deviations provide useful reference pointswhen assessing the scale of RMSE. Note that the response variables are theozone EOFs, which are unitless.When compared to the standard deviations of the true ozone EOFs, theestimated ozone feature models delivered reasonably low prediction RMSEs.However, it is worth noting that the RMSEs of both E3 models (VM andCII) are about 77% of the true EOF’s standard deviation. As discussed inChapter 3, the 3rd-order and 4th-order ozone features capture the space-time behaviour of the nocturnal ozone process, which is not as influencedby the model variables as the daytime ozone. In other words, they maynot be process-driven enough to be adequately modelled as ozone features.This translates to the difficulties we see in the modelling of E3. Table 4.3shows, however, that E1 and E2 have smaller forecasting error relative tothe standard deviations of the true EOFs.EOF 1 EOF 2 EOF 3RMSE of EOF-VM models 0.009 0.020 0.051RMSE of EOF-CII models 0.008 0.016 0.047S.D. of the true EOFs 0.020 0.065 0.066Table 4.3: Prediction RMSE of the EOF models. The last row contains thestandard deviations of the true EOFs being predicted. The Ej ’s are unitless.Figures 4.8, 4.9 and 4.10 display the spatial patterns of the true EOFsand their predictions. For reference, refer to Figure 4.2 in Section 4.4 for themap of modelled LFV domain. As shown, the model forecasts captured thetrue EOFs’ gross regional-scale spatial patterns as well as some of the finer-scale spatial variations. This type of visual test informs us how closely the1304.6. Modelling and Forecasting Ozone FeaturesFigure 4.8: Spatial plots of true E1 to be predicted and its GP model pre-dictions (all unitless). The colour scales are the same between plots.1314.6. Modelling and Forecasting Ozone FeaturesFigure 4.9: Spatial plots of true E2 to be predicted and its GP model pre-dictions (all unitless). The colour scales are the same between plots.1324.6. Modelling and Forecasting Ozone FeaturesFigure 4.10: Spatial plots of true E3 to be predicted and its GP modelpredictions (all unitless). The colour scales are the same between plots.1334.6. Modelling and Forecasting Ozone Featuresstatistical models can predict future spatial variations of the ozone features,which is especially useful in this application of ozone feature modelling. Theresults show that despite moderate RMSEs, the ozone feature models arecapable of forecasting the complex non-linear structure of the leading ozonefeatures.4.6.2 Modelling and Forecasting of Temporal OzoneFeatures: PCsFigure 4.11 shows for the three ozone PC-CII models (the covariate formu-lations shown in Table 4.1), the standard normal QQ-plots of standardizedCV-errors. While the centres of the sample distributions correspond closelyto that of a standard normal, the lower tails of P2 and P3 are both higherthan expected for a standard normal. There are also 3 upper-tail standard-ized errors that deviated noticeably from the standard normal assumption.Therefore, except for its deficiency in modelling the extremities of higher-order PCs, models with a Gaussian assumption do a satisfactory job ofdescribing the distribution of temporal ozone features. The QQ-plots fromthe PC-VM model fits (not shown) delivered similar results. In the end, Ifound little reason to doubt the appropriateness of the Gaussian assumptionwhen modelling ozone features.Figure 4.11: For the three PC-CII models: standard normal QQ-plots ofstandardized CV-errors. The x-axis is the theoretical quantiles of a stan-dard normal distribution and the y-axis is the sample quantiles from cross-validation.1344.6. Modelling and Forecasting Ozone FeaturesForecasts of the Temporal FeaturesTable 4.4 shows the prediction RMSE’s along with the standard deviationsof the true PCs for reference. It should be noted again that the PCs are theweighted row-sums of Ot×n (Section 3.2), which explains the high RMSEsand standard deviations shown. The space-time ozone field Oˆ = PˆEˆT willcontain appropriate ozone values once scaled by E. Relative to the standarddeviations of the true PCs, both P1 and P2 predictions have lower RMSEs.Again, the prediction of a higher-order ozone feature, P3, is relatively lessaccurate, with an RMSE of about 70% of the standard deviation of true P3.PC 1 PC 2 PC 3RMSE of PC-CII models 92.21 ppb 45.95 ppb 42.22 ppbRMSE of PC-VM models 74.29 ppb 49.46 ppb 40.70 ppbS.D. of the true PCs 177.23 ppb 113.60 ppb 56.64 ppbTable 4.4: Prediction RMSEs of the PC models, where predictions are madeusing the “real” predictive set on antecedent precursors. The standard de-viations of the true PCs are shown for comparison.Figure 4.12 plots the temporal patterns of predicted PCs overlaid withthe true PCs. As shown, the temporal patterns of the forecasts reflectthe general trends of true PCs. Figure 4.12 does show a few exceptions:my models for P1 over-predict the LFV ozone means in the early-morningand after 1900PST; there are also slight discrepancies between the true P2and predictions during early-morning and late-night. However, my modelsforecasted the day-time temporal ozone features very well, which is the mostimportant conclusion drawn.Figure 4.13 shows from the CMAQ 2006 ozone output, the temporalplots of hourly spatial mean, standard deviation (both summarized acrossspace) and the 1st 4 PCs over the course of the whole ozone episode (includ-ing both the training and predictive days). There is a trend of decreasingnight-time hourly LFV means through the episode, and by the 4th day (pre-dictive set), the hourly LFV mean experiences a dramatic decline from which1354.7. Forecast of Space-Time Ozone FieldsFigure 4.12: Time-series plots of the true temporal ozone features in thepredictive set (black), their predictions using ozone feature models PC-CII(blue) and PC-VM (red).it barely recovers as this episode concludes. This daily trend is naturallyreflected in the temporal pattern of P1. Recall that days 2 and 3 are usedas the training set, in which this “sudden decline” in night-time ozone is notobserved. Therefore, such within-episode variation of the ozone process isnot “learned” when estimating the P1 models, and P1’s numerical relationswith available covariates are not sensitive enough to forecast this night-timefeature. In short, this is a problem of extrapolation.It is worth noting that the magnitude of P1 is noticeably larger thanthe others (Figure 4.12). From the ozone prediction function (4.13), onecan deduce that E1 prediction receives a larger weighting in the final ozonemodelling/prediction, an expected result given the importance of the spa-tial/temporal mean to an ozone (or any air pollution) process. The P3values are smallest in magnitude, hence assigning the smallest weight to-wards the E3 prediction. As a result, the effect of any prediction errors inthe 3rd EOF-PC models are subsequently alleviated.4.7 Forecast of Space-Time Ozone FieldsWith the ozone features forecasted, (4.13) is used to forecast the regionalozone fields for the last day of the ozone episode: June 26th, 2006. Onceagain, the ozone field being modelled is the CMAQ produced output, not1364.7. Forecast of Space-Time Ozone FieldsFigure 4.13: Temporal plots of hourly spatial ozone mean, standard devia-tion and the 1st 4 PCs over the course of the entire episode. Notice the sharpnight-time decline of P1 during the 4th day (predictive set), as highlightedby a red circle. The vertical dashed lines indicate the hour 0000, and thevalue in the heading of each PC plot is the “proportion of data variationexplained”.1374.7. Forecast of Space-Time Ozone Fieldsphysical observations.Figures 4.14 and 4.15 show for selected hours, the scatter plots of themodel forecasts against the true CMAQ output. The feature-based modelgave good forecasts during the afternoon peak hours: the points are scat-tered near the line x = y. As expected from the over-prediction of P1 inthe last section, nighttime forecasts are higher than the true ozone level ata number of locations.Figures 4.16 to 4.19 present forecasts for selected hours as regional ozonefields visualized through 3-dimensional plots. There are four types of ozonefield: true CMAQ output, true CMAQ output constructed from only theleading 3 EOFs/PCs, and my predictions using the CII model (the covariatesare variable EOFs and PCs selected via iterative improvement) and the VMmodel (covariates are variables’ spatial and temporal means).There are two sources of prediction (or forecast) error inherent to thefeature-based ozone model: (1) the error from directly predicting ozonefeatures, and (2) the error from using p = 3, or in general, p  min(t, n)ozone features to predict a complete space-time ozone field. The second errorsource is extensively discussed in Chapter 3. Hence, I believe it is usefulto present the patterns of the true regional ozone reconstructed with only 3EOFs and PCs. Comparison between forecasted hourly ozone fields (bottomplots in each set of four plots in Figures 4.16 to 4.19) to the corresponding“true CMAQ with first 3 EOFs/PCs” (upper right of a set) evaluates thefeature-based ozone model’s capability from the sole perspective of errorsource (1) mentioned above.1384.7. Forecast of Space-Time Ozone FieldsTrue CMAQ output (ppb)Predicted CMAQ (ppb)204060Hour 0100: Model CII20 40 60Hour 0700: Model CII Hour 1200: Model CII20 40 60Hour 0100: Model VM Hour 0700: Model VM20 40 60204060Hour 1200: Model VMFigure 4.14: For hours 0100, 0700 and 1200 of June 26th, 2006 (the predic-tive set): the scatter plots of predictions from the CII model and the VMmodel versus the true CMAQ output. The three lines are y=x, y=2x andy= 12x.1394.7. Forecast of Space-Time Ozone FieldsTrue CMAQ output (ppb)Predicted CMAQ (ppb)204060Hour 1400: Model CII20 40 60Hour 1600: Model CII Hour 2000: Model CII20 40 60Hour 1400: Model VM Hour 1600: Model VM20 40 60204060Hour 2000: Model VMFigure 4.15: For hours 1400, 1600 and 2000 of June 26th, 2006 (the predic-tive set): the scatter plots of predictions from the CII model and the VMmodel versus the true CMAQ output. The three lines are y=x, y=2x andy= 12x.1404.7. Forecast of Space-Time Ozone FieldsFigure 4.16: Hour 0100 and 0700 of June 26th, 2006 (the predictive set): 3-Dspatial ozone fields of true CMAQ output (upper-left), true CMAQ outputwith only the first 3 EOFs and PCs (upper-right), forecasts using CII model(lower-left) and VM model (lower-right).1414.7. Forecast of Space-Time Ozone FieldsFigure 4.17: Hour 1000 and 1200 of June 26th, 2006 (the predictive set): 3-Dspatial ozone fields of true CMAQ output (upper-left), true CMAQ outputwith only the first 3 EOFs and PCs (upper-right), forecasts using CII model(lower-left) and VM model (lower-right).1424.7. Forecast of Space-Time Ozone FieldsFigure 4.18: Hour 1400 and 1600 of June 26th, 2006 (the predictive set): 3-Dspatial ozone fields of true CMAQ output (upper-left), true CMAQ outputwith only the first 3 EOFs and PCs (upper-right), forecasts using CII model(lower-left) and VM model (lower-right).1434.7. Forecast of Space-Time Ozone FieldsFigure 4.19: Hour 2000 and 2200 of June 26th, 2006 (the predictive set): 3-Dspatial ozone fields of true CMAQ output (upper-left), true CMAQ outputwith only the first 3 EOFs and PCs (upper-right), forecasts using CII model(lower-left) and VM model (lower-right).1444.7. Forecast of Space-Time Ozone FieldsAt 0100PST and 0700PST, the spatial patterns of ozone fields resemblethose of the regional topography: near background ozone level across thevalley, and a high-level ozone plume blanketing the north shore mountains.During these hours my predictions effectively capture this reality, includinghigher-resolution spatial details such as the “peaks-and-troughs” along thenorth shore mountains. As one might expect from the forecasts of P1 (pre-vious section), my CII model over-predicts the southwest LFV ozone fieldsduring the night-time period of 2000PST-2300PST. This over-prediction canbe seen from Figure 4.19. Although the VM model also over-predicted the2000PST CMAQ ozone field, its predictive quality improved noticeably inthe following hours, as seen in the prediction for 2200PST (Figure 4.19).Both statistical models’ over-prediction at 2000PST is also evident from thescatter plots in Figure 4.15: some of the true CMAQ output are near 0 ppbwhile their corresponding predictions are noticeably higher.However, day-time is the most important period for ozone modelling.Much ozone research focuses on the mean 8-hour daily maximum: the av-erage ozone levels during the highest 8-hour window of each day, and gov-ernment policies and regulations are based on compliance with this statistic(CCME, 2000; Yarwood et al., 2005; Reuten et al., 2012). My feature-basedozone models delivered good predictions during the day-time, as evidentfrom both the 3-D spatial plots (Figures 4.17 and 4.18) and scatter plotsof true CMAQ vs. prediction (hours 1200PST, 1400PST and 1600PST inFigures 4.14 and 4.15).Table 4.5 shows for the CII and VM models, the hourly prediction RM-SEs and MPEs at 3 mid-day hours and the RMSE/MPE summarized overthe 8-hour maximum. The table also shows the hourly LFV standard devi-ations of the true ozone fields being predicted: these values help to put incontext the scale of the CII/VM model forecasting accuracy. Hourly LFVstandard deviation is also mathematically the same as hourly RMSE of pre-dictions made by the true ozone mean of that hour (averaged across space).Using RMSE as the reference, both models displayed similar prediction ac-curacy during the afternoon ozone peak hours. However, the CII modelgave predictions with noticeably smaller MPE throughout the 8-hour daily1454.7. Forecast of Space-Time Ozone Fieldsmaximum as well as the entire 24 forecasting hours (not shown). Further-more, the hourly LFV standard deviations of the true ozone are more thandouble the hourly RMSEs of ozone feature models. The prediction MPE isnear 0% at 1300PST for the CII model, and the MPEs of the VM model areconsistently higher than the CII model in magnitude throughout the diurnalcycle.Hour (PST) Hours from1200 1300 1400 8-hour maximumRMSE of CII (in ppb) 4.08 4.24 7.19 7.60RMSE of VM (in ppb) 4.34 5.24 6.88 7.50Std. deviation of true data 14.78 15.59 16.24 16.84MPE of CII (in %) -2.08 0.33 1.18 -4.95MPE of VM (in %) -6.52 -3.09 -1.28 -7.90Table 4.5: Prediction RMSE and MPE from the two feature-based ozonemodels and the standard deviation of the true ozone field. The predic-tion statistics are presented as hourly value for hours 1200PST, 1300PST,1400PST and summarized across the hours during the 8-hour ozone maxi-mum. The forecasting day is June 26th, 2006 and the spatial domain is therectangular LFV field (Figure 4.2).RMSE in ppb MPE in %day 1 day 2 day 1 day 2CII model 2.84 2.80 -0.85 -0.63VM model 3.45 3.47 -0.82 -0.47Table 4.6: RMSE and MPE of cross-validation predictions made on completeozone fields. The statistics are summarized over the 8-hour ozone maximumof June 24th and 25th, 2006 (day 1 and 2 from training data). The spatialdomain is the rectangular LFV field (Figure 4.2).Overall, the ozone feature models delivered space-time ozone forecastswith reasonably low error-statistics, and their predictions manage to cap-ture the complex spatial structures of LFV’s hourly ozone fields through adiurnal cycle. One caveat is their over-predictions during night-time hours1464.8. Model Fits from other Episodes(especially for the CII model). This is the result of over-predicting the tailof P1, which as discussed in the previous section, is due to extrapolation.However, this over-prediction is limited to a particular area of LFV (thesouthwest) and the hours of 2000PST and 2100PST.Table 4.6 shows the RMSE and MPE of cross-validation predictions ofcomplete ozone fields for the two days of training data. This is analogous tothe Table 4.2, where CV-RMSE and CV-MPE of individual ozone featuremodels are shown. The RMSEs and MPEs are summarized over the 8-hour maximum of each day. The cross-validation statistics are described inSection 4.3, but now applied to reconstructed ozone rather than the features.As shown, compared to the predictions done for ozone fields outside of thetraining data (Table 4.5), both RMSE and MPE are noticeably smallerfor the cross-validation. This is especially true for the percentage of meanprediction error, where the MPEs are < 1% for cross-validation comparedto −4.96% and −7.90% for ozone forecasting (Table 4.5).The accuracy of cross-validation RMSE and MPE may be viewed asa goodness-of-fit statistics for the ozone feature models: CV-RMSE andCV-MPE show how well a statistical model can emulate the space-timestructure of LFV ozone in the training set. As results from cross-validationshave shown, both types of ozone feature models are capable of modelling acomplex weather and pollution driven regional ozone process.4.8 Model Fits from other EpisodesOzone feature models like those developed in this chapter will be applied toAQM evaluations in Chapters 5 and 6. As I will discuss in Section 5.1, theevaluations to be presented in this thesis are done for individual episodes.Hence, the ozone feature models were fitted per-episode in this chapter, andthe 2006 model in particular was analyzed in detail between Sections 4.4to 4.7. Tables 4.7 and 4.8 show the cross-validation RMSEs of the fittedmodels for the other four episodes along with the standard deviations of thetraining data. As shown, all fitted models have CVRMSEs that are muchlower than the standard deviations, and these CVRMSEs are comparable1474.8. Model Fits from other Episodesacross episodes between 1985 and 2001.Episode1985 1995 1998 2001E1 0.0008 0.0007 0.001 0.0008(0.016) (0.011) (0.011) (0.016)E2 0.0033 0.0026 0.0038 0.0037(0.065) (0.066) (0.065) (0.065)E3 0.0071 0.0062 0.0071 0.005(0.066) (0.066) (0.066) (0.066)Table 4.7: Cross-validation RMSE of the fitted E1, E2 and E3 models,whose covariates are selected by iterative improvement. The numbers inparentheses are standard deviations of the training data. The CVRMSEand standard deviation are unitless for the Ej ’s.Episode1985 1995 1998 2001P1 (ppb) 17.46 17.29 19.62 15.9(216.02) (252.87) (335.12) (230.63)P2 (ppb) 17.91 20.33 10.67 15.86(159.49) (153.31) (188.21) (138.16)P3 (ppb) 10.77 15.15 10.37 11.06(71.31) (75.10) (91.27) (76.51)Table 4.8: Cross-validation RMSE of the fitted P1, P2 and P3 models,whose covariates are selected by iterative improvement. The numbers inparentheses are standard deviations of the training data. The CVRMSEand standard deviation have units ppb.An alternative model-fitting procedure is to merge all CMAQ outputsand fit ozone feature models describing all episodes. This is a reasonableapproach if the objective is to estimate a statistical emulator of CMAQprocess. However, as mentioned the ozone feature models in this chapterwere estimated for per-episode CMAQ evaluations. Therefore, I decided tonot pursue the “merged-data” approach to model estimation.1484.9. Chapter Conclusion4.9 Chapter ConclusionIn this chapter, statistical models of spatial-temporal ozone features are de-veloped. These ozone feature models displayed notable capability in mod-elling the non-linear structures of the ozone features.Individual features are modelled as GPs driven by a set of variablesdescribing background temperature, wind speed, planetary boundary layerheight, ozone precursor emission rates and ambient concentrations. Thecovariates of each feature model are selected through a forward selectionalgorithm based on a combination of goodness-of-fit statistics. The modelsare then fitted by maximizing the GP profile-likelihood. The fits and pre-dictive capabilities of individual ozone feature models are evaluated throughdiagnostic tests, cross-validation and feature forecasting. Here, the forecastsare made for the 4th day of the 2006 CMAQ output across a complex spatialdomain including LFV and surrounding mountains.The ozone feature models proved their capability in forecasting the com-plex non-linear structures of the spatial ozone features, where both theregional-scale patterns and localized details of the true features are cap-tured by model forecasts with good numerical accuracy (Section 4.6). Tem-poral ozone feature models displayed appropriate goodness-of-fits throughlow values of cross-validation RMSE (Section 4.5). With the exception ofthe night-time forecast of P1, the temporal ozone feature models satisfacto-rily forecasted the temporal patterns and values of the true features (Section4.6).By combining the predicted ozone features via equation 4.13, forecastswere also made for the complete space-time ozone field. The feature-basedozone model is able to forecast the hourly LFV ozone fields at great spatialresolution: compared to the true ozone fields being forecasted, the statisti-cal model predictions captured the detailed local ozone variations both inthe lower-valley region and across north shore mountains. The forecastingaccuracy was especially good during the important daily ozone peak hours,with low RMSEs of about 4 ppb to 7 ppb and near 0 prediction biases.1494.9. Chapter ConclusionUse of ozone feature models in Chapters 5 and 6Compared to the spatial domain of the ozone field modelled in this chap-ter, the CMAQ evaluation analyses in the following chapters involve themodelling of a much smaller LFV sub-domain: area within the boundaryof Metro Vancouver monitoring stations. Hence, the ozone feature modelsdeveloped here should be well qualified to model a simpler ozone field, thusserving their original intended purpose of CMAQ evaluation.An efficient and capable model of space-time ozone processThe analyses in this chapter showed that a complex space-time ozone can bemodelled through a few ozone features, i.e., data components with simplerstructures. This feature-based ozone model combines the methods of PCAand GP, and it is a novel approach for modelling a space-time air pollu-tion process. Furthermore, several variables are identified to be useful formodelling ozone features, and data on wind direction can be used to createan useful new variable representing the space-time field of antecedent ozoneprecursor concentrations.In addition to the already established modelling capability, the feature-based ozone model is also a computationally efficient means of modelling alarge air quality dataset. Let N be the size of the data used in modelling;the ozone feature models has N = n or N = t, while the direct modelling of“raw” data has N = n ∗ t. The GP modelling is a O(N3) function (Sackset al., 1989), which means that the rate of increase in the computationalload is defined by the cube of N . Hence, the computational efficiency of thestatistical models is sensitive to data size. In this case, we are comparing thecomputation loads involving N = 229 or N = 48 with N = 229∗48 = 10992.This notable computational efficiency will prove useful when emulating anAQM process, which typically generates large datasets.Lastly, statistical theory suggests that feature-based predictions of ozonevia reconstruction may be biased. As discussed, the predictions of individualozone features are unbiased in the sense that they are based on BLUPs.However, the equation (4.13) for the ozone field is not a BLUP. This topic1504.9. Chapter Conclusionis also explored in my research and I found that the problem of predictionbias is not an overriding concern here. Appendix C.2 presents statisticalanalysis of prediction bias.151Chapter 5AQM Evaluation I:Comparison of OzoneFeatures and Modelling ofFeature DifferencesThe conventional way of evaluating air quality models is to compare modeloutputs and observations at a point location and time (Dennis et al., 2010).The output-observation differences are then summarized by statistics suchas RMSE and MPE (mean percentage error). Preisendorfer and Barnett(1983) and Willmot et al. (1985) further used sampling methods to esti-mate the statistical confidence and significance of error statistics. How-ever, while point-based comparison can be useful “up to a point”, withouta process-level understandings of the compared air pollution fields at hand,any observation-model agreement (or disagreement) should be deemed “for-tuitous” (Dennis et al., 2010). The reasoning behind this assertion is dis-cussed extensively in the introduction Section 1.2.As mentioned in the introduction, this research is motivated by the needfor a more informative means of air quality model evaluation (Galmarini andSteyn, 2010). This thesis proposes two general AQM evaluation approaches,both based on ozone features. These methods are then implemented toevaluate CMAQ outputs for LFV ozone episodes. Both methods apply thestatistical tools formulated in Chapters 3 and 4: methods for analyzing andmodelling ozone features.152Chapter 5. Evaluation I: Feature Comparison and Model of DifferencesEvaluation I: Ozone Feature Comparison and FeatureDifference ModelFirst, I propose to compare individual spatial and temporal ozone featuresbetween AQM output and observations. In Chapter 3, I interpreted thetypes of spatial-temporal ozone features that define an LFV ozone process.In addition to capturing the structure of space-time ozone mean relation-ships, I also identified the most dominant pattern(s) of ozone advection(movement of the ozone plume) across the LFV.The comparison of features between AQM and observations is a meansof evaluation that addresses the need for a more insightful model evaluation.Feature based observation-AQM comparison allows for evaluations of under-lying space-time structures and dynamic processes, e.g., evaluate whetherAQM can capture observed patterns of ozone advection, or model the east-west variation of ozone means caused by westerly wind regime (typical ofan ozone episode in LFV).Secondly, I propose to statistically model the ozone feature differencesusing the GP model and covariates determined in Chapter 4. Feature differ-ence between AQM and observation can be summarized into error statisticssuch as RMSE and other tests for significance. However, such analysis onlyprovides one summarized value of observation-AQM distance without pro-viding insight into the pattern and behaviour of observation-AQM difference.By analyzing the statistical association between observation-AQM fea-ture difference and various conditions of an AQM run, one may (1) identifythe specific AQM input(s) most responsible for its modelling deficiencies,and (2) associate the observation-AQM difference with the spatial or tem-poral variations of said model input(s). This is another means of informativeevaluation of AQM.AQMs such as CMAQ model the grid-cell ozone average calculated froma set of initial and boundary conditions, whereas the observations are record-ings of air pollution levels at points in space and time (Section 1.2). In otherwords, AQM outputs and physical observation are defined by discrepantspatial scales and processes (Dennis et al., 2010), which can make direct153Chapter 5. Evaluation I: Feature Comparison and Model of Differencesobservation-model comparisons questionable.The spatial discrepancy between computer models and observations isnot an obvious problem when comparing spatial ozone features Ej . This isbecause the values in an Ej are no longer indexed by either grid-cell averageor point location, the Ej are spatial weights that describe specific patterns ofozone variation in space. We are comparing data structures or “summaries”rather than data values, thereby avoiding the problems stemming from directdata comparison. This is an important point that I have not seen raised byexisting literature in PCA-based AQM evaluation (those reviewed in Section1.4).In this chapter, the proposed “AQM Evaluation I” will be described indetail and implemented to evaluate the CMAQ performance against phys-ical observations for 5 ozone episodes in 1985, 1995, 1998, 2001 and 2006.Although the chapter focuses on the first method, the second method is alsooutlined here to give an overview.Evaluation II: Comparison of AQM and Observation asStochastic Ozone ProcessesAnother proposed AQM evaluation approach aims to answer the question:given the same basic conditions in background weather and precursor pol-lution, will AQM and the physical process produce similar ozone features?This evaluation is implemented by first building GP-based ozone fea-ture models for both AQM ozone and physical observation. Comparisonsare then made between the ozone features produced by the two processes(model predictions) under the same covariate settings that represent variousbackground conditions.At a covariate setting, GP model can produce the estimated processmean and standard deviation at this particular condition. By comparingthe outputs of two GP models under the same setting, one compares thestatistical properties of two ozone processes that generated AQM outputsand observation data. This point will be discussed more extensively inChapter 6.154Chapter 5. Evaluation I: Feature Comparison and Model of DifferencesFigure 5.1 shows in diagrammatic form the central idea behind the pro-posed AQM evaluation (in the context of CMAQ) and its departure from theconventional point-to-point AQM evaluation. As I discussed in Section 1.2,AQM output and observation data are generated by different air pollutionprocesses, and direct data comparison only serves to inform the deviation intheir values, not the difference in their behaviour as air pollution processes.Figure 5.1: Schematics of the idea behind the “AQM/CMAQ Evaluation II”and the “traditional” point-to-point approach.My second proposed evaluation method provides the type of insightsnot obtainable from either point-to-point evaluations or mere comparison ofair pollution features. The advantage of such analysis is that it evaluatesAQM simulated ozone field against observations as comparable stochasticprocesses: both are described by GPs that are governed by the same sets ofbackground conditions.155Chapter 5. Evaluation I: Feature Comparison and Model of DifferencesFeature Correspondence during AQM EvaluationTo successfully implement my proposed AQM evaluation, it is crucial toensure feature correspondence: the evaluated AQM and observed ozone fea-tures are indeed comparable. This point was discussed in Chapter 3, andthe analyses in that chapter formulated a set of PCA procedure and inter-pretations of ozone features that address the topic of feature comparability.These analyses are means of addressing the complications from EOF/PCsampling uncertainty. The following is a recap.First, I found that PCA of original uncentered Ot×n will extract thespatial and temporal means as the first feature. Specifically, E1 will capturethe structure of the mean field (temporal means) and P1 represents the time-series of hourly spatial ozone means. Later analyses will also show that theobserved E1 and P1 also capture the mean structures of observation data.Therefore, the PCA of uncentered Ot×n ensures that the first and the mostimportant ozone features are indeed comparable, and that they are well-estimated.Secondly, I interpreted the other leading features as dynamic data modesthat capture the space-time patterns of diurnal ozone advection, as well asthe area and magnitude of ozone plume formation. These dynamic struc-tures are interactions between spatial and temporal ozone contrasts (Ejand Pj). I also found that certain leading features capture more localizedozone variations during the less important nocturnal hours. Moreover, PCAsampling stability analysis indicated that the aforementioned LFV featureinterpretations are applicable to smaller sub-domains. Detailed understand-ing of ozone features allows for an informative and defensible evaluation ofAQM features: the evaluated features are comparable because we alreadyunderstood what they are.Lastly, the test of ozone feature degeneracy can inform on which featurescan be analyzed individually or jointly. The analysis of feature degeneracyis a means of measuring the extent of EOF estimation error during AQMevaluation.1565.1. Evaluation Methods and StrategyThe Purpose of the Proposed Evaluation MethodsIn summary, my proposed AQM evaluation approaches are designed toaddress the shortcomings of conventional methods: lack of of informativecomparison due to observation-AQM discrepancies in physical scale, ozone-producing conditions and the underlying stochastic process. The proposedmethods in this thesis also aim to add to the existing “statistical toolset” forPCA-based AQM evaluations, via rigorous statistical analysis and modellingof air pollution features.Usually, simulation or sampling-based approaches are used to assess theusefulness of a statistical model evaluation method. However, such an exer-cise is mainly beneficial when developing a statistical measure that summa-rizes point-to-point differences between two dataset.The proposed CMAQ evaluation involves the modelling of CMAQ andobservation ozone features, as well as the feature differences as processesdriven by background meteorology and atmospheric pollution. Althoughthere are ways of simulating LFV ozone fields driven by temperature andwind (Appendix A.1), it is difficult to build an ozone simulation that accu-rately describes the complex interactions between LFV emission, meteorol-ogy and surrounding topography. In fact, the best ozone simulation is fromCMAQ itself.Therefore in this study, the means of assessing the usefulness of an eval-uation method will be based on whether this method can provide resultsand insights into observation-CMAQ differences that are sensible and ex-plainable by existing knowledge of the LFV pollution process.5.1 Evaluation Methods and StrategyI denote the CMAQ ozone output as Oct×n and observed ozone data asOot×n. Suppose either dataset is decomposed into En×n and Pt×n, Ej andPj then represent spatial and temporal features. Let Ecj and Pcj denotethe features of CMAQ, and Eoj and Poj the features of observations. Ozonefeature differences will be denoted by Edj = Eoj − Ecj and Pdj = Poj −Pcj for1575.1. Evaluation Methods and Strategyany j. Furthermore, PCA will be implemented on the original data withoutcolumn-centering, and rotation of Ej is not applied (Section 3.2).In the previous 2 chapters, Ej is analyzed and modelled as unitless (nor-malized) eigenvectors of OT O. In the context of CMAQ evaluation, thecomparison of Ec1 and Eo1 is the unitless comparison of spatial variations oftemporal ozone means. That is, the spatial weights in E1’s are comparedwithout considering the magnitude of ozone values from both data.In order to evaluate CMAQ’s capability in capturing both the spatialpatterns and the numerical values of the observed mean fields, one maymultiply Ec1 and Eo1 by their respective PCA eigenvalues√λc1 and√λo1.This way, the magnitudes of data values are incorporated into the spatialozone features. I denote this scaled E1 as E˜1 = E1√λ1, and this can bedone for all j ≥ 2 ozone features under evaluation. The observation-CMAQdifference of scaled-EOF is denoted asE˜dj = Eoj√λoj −Ecj√λcj , j = 1, . . . , p.It is worth noting that√λ1-scaling is done for the purpose of this AQMevaluation. The eigenvalues are always incorporated into Pj for my specificdata decomposition (Section 3.2), so the reconstruction of Ot×n still requiresthe use of E1.5.1.1 Model of Feature Differences E˜dj and PdjKennedy and O’Hagan (2001) formulated an often used mathematical re-lationship between computer model outputs and their associated physicaldata. Using the ozone feature notation of this thesis, for scaled-EOF andPC the formulas are:E˜oj = E˜cj |XE˜c j + E˜dj |XE˜d j + εE j , and (5.1)Poj = Pcj |XP c j + Pdj |XP d j + εP j . (5.2)Here, E˜dj and Pdj are the spatial and temporal ozone feature difference: theyare multivariate random processes representing the modelling deficiencies of1585.1. Evaluation Methods and StrategyCMAQ. Equation 5.1 is also applicable to Ej , the following CMAQ evalua-tion will be focused on E˜ due to its easier interpretability.A few more comments regarding the observation-CMAQ relationshipformulated by equations (5.1) and (5.2):• XE˜c j and XP c j are the covariates representing CMAQ ozone features.• XE˜d j and XP d j are covariates of the random processes represent-ing CMAQ modelling deficiency; they are either equal to XE˜c j andXP c j or are subsets of them. The latter case indicates that CMAQinadequacies are influenced by a few particular CMAQ inputs.• εE j and εP j are random observation errors are assumed to be i.i.dN(0, σEj) and N(0, σPj).• The above formulation implies that E˜cj + E˜dj and P˜cj + P˜dj are thetrue underlying mean processes representing the j-th ozone feature.Physical observation is the sum of true ozone feature and randomerror.• The E˜dj (similarly for Pdj and Edj ) is defined as E˜oj − E˜cj : a negativedifference is interpreted as CMAQ over-estimate of observed featureand vice versa.The model diagnostic and goodness-of-fit assessments in Sections 4.5 and4.6 supported the GP assumption for individual ozone features of ordersj = 1, 2, 3. If one regards Ej and Pj as Gaussian Processes, then it isreasonable to model E˜dj ’s and Pdj ’s as GPs.For temporal computer models, Guttorp and Walden (1987) further pro-posed that the model deficiency be represented by two terms: one for inad-equacy in describing the physical system, one for not tracking the extremeobservations due to model outputs being temporal averages. The secondsource of model deficiency should not be a concern here: the CMAQ runsare specifically used to simulate an extreme event (ozone episode), and theoutputs are generated on a high-resolution spatial grid during the same time1595.1. Evaluation Methods and Strategyperiod as the physical process. Hence, a single term AQM deficiency process(E˜dj or Pdj ) is sufficient for my evaluation.By analyzing the importance of each covariate in XE˜d j and XP d j forthe processes E˜dj and Pdj , one may formulate insights into the underlyingassociation between the magnitude of ozone feature difference and specificcovariate/input of AQM modelling run. It provides the CMAQ modellerwith a statistical reference for calibrating and interpreting CMAQ outputs.With feature differences E˜dj and Pdj modelled as Gaussian Processes,their covariates XE˜d j and XP d j are selected through the iterative improve-ment algorithm proposed in Chapter 4. This algorithm adds one-by-one,the model covariates (from a candidate set of covariates) into a GP modeluntil no more statistically significant covariates are left. In addition to find-ing the most parsimonious form of model formulation, this model selectionmethod is also designed to rank the model covariates in terms of their statis-tical importance in modelling a GP. It has shown proficiency in finding theappropriate forms of GP models for ozone EOFs and PCs (Sections 4.5 to4.7). In this AQM evaluation, candidate covariate sets for XE˜d j and XP d jare respectively, the temporal and spatial means of CMAQ model variables:NOx and VOC emission rates and antecedent concentrations, temperature,wind speed and planetary boundary layer height.This iterative improvement procedure can also be viewed as significancetests of the statistical associations between individual AQM model inputsand feature differences.5.1.2 PCA of CMAQ Outputs and Observation DataThe PCA is implemented individually for CMAQ output and observationdata, then Ej and Pj of the same order between CMAQ and observations arecompared. Here, CMAQ-WRF-SMOKE outputs for the 5 episodes (Table2.1) are interpolated onto n = nobs locations, where nobs is the number ofLFV observation locations available for a given year. The interpolation isdone for both the ozone data and model variable data: temperature, windspeed, boundary layer height, NOx and VOC emission rates and ambient1605.1. Evaluation Methods and Strategyconcentration. This way, the original computer model outputs will be placedon a comparable space and time domain as the physical observation. Readersmay refer back to Section 2.2 for the way data are processed for evaluationuse. Furthermore, for both CMAQ and observation of all episodes, I usedthe data from the middle 3 full days (complete diurnal cycles) of the episode.When I use the term “episode mean”, I am referring to the ozone averagedover this 3 day period.Table 5.1 shows the portion of data variation explained by the first 8ozone features of Oct×n and Oot×n. As the table shows, the underlying meanstructure (E1) dominates the data variation, and the amount of variationexplained by features of j ≥ 2 orders decrease rapidly to ≈ 0 from j = 5onward. CMAQ and observed features of the same order explain similarproportions of their data variations.The order j of Ej and PjEpisode 1 2 3 4 5 . . . 81985 CMAQ 0.94 0.03 0.02 0.01 0.00 . . . < 0.001985 Obs. 0.94 0.02 0.01 0.01 0.00 . . . < 0.001995 CMAQ 0.94 0.03 0.02 0.01 0.00 . . . < 0.001995 Obs. 0.94 0.02 0.01 0.01 0.00 . . . < 0.001998 CMAQ 0.94 0.03 0.02 0.01 0.00 . . . < 0.001998 Obs. 0.95 0.02 0.01 0.01 0.00 . . . < 0.002001 CMAQ 0.93 0.03 0.01 0.01 0.00 . . . < 0.002001 Obs. 0.94 0.02 0.01 0.01 0.01 . . . < 0.002006 CMAQ 0.96 0.02 0.01 0.01 0.00 . . . < 0.002006 Obs. 0.93 0.02 0.01 0.01 0.01 . . . < 0.00Table 5.1: Proportion of data variation explained by ozone features of ordersj = 1, 2, 3, 4, 5, . . . , 8.Table 5.2 shows the RMSEs from data reconstruction∑pj=1 PjETj forincreasing value of p. These results show from a prediction perspective theamount of data variation that can be recovered by successive ozone features.Here, the improvement in RMSE gradually decreases between p = 2 to p = 4,and from p = 4 onward the improvement in RMSE becomes ≤ 1 ppb forboth CMAQ outputs and observations of all episodes. At p = 1 and p = 2,1615.1. Evaluation Methods and Strategythe RMSE of observation is smaller than CMAQ for 3 out of 5 episodes.However, starting from p = 4, the reconstruction RMSEs of CMAQ becomesmaller than the RMSEs of observation for all episodes except for 1995.The number p used for data reconstructionEpisode 1 2 3 4 5 . . . 81985 CMAQ 8.62 6.07 4.13 2.76 2.01 . . . 0.601985 Obs. 9.84 7.73 6.02 4.94 4.12 . . . 2.101995 CMAQ 8.88 6.62 4.56 3.73 3.14 . . . 1.531995 Obs. 5.34 4.21 3.39 3.01 2.61 . . . 1.771998 CMAQ 10.14 7.71 5.51 3.79 3.14 . . . 1.871998 Obs. 7.55 5.98 4.86 4.19 3.68 . . . 2.542001 CMAQ 9.42 6.59 5.10 3.92 3.29 . . . 1.952001 Obs. 7.53 6.41 5.45 4.78 4.21 . . . 2.912006 CMAQ 6.59 4.70 3.77 2.92 1.95 . . . 1.162006 Obs. 7.01 5.87 4.89 4.09 3.49 . . . 2.27Table 5.2: Data reconstruction RMSE at p = 1, 2, 3, 4, 5, . . . , 8. The unitsare ppb.As discussed in Chapter 3, each Ej has an associated eigenvalue λj . Ifλj is not statistically significantly different from λj+1 or even high-orderedeigenvalues, then Ej forms a degeneracy set with higher-order EOF(s). Theconsequence is that these “degenerate EOFs” may suffer mixing of datafeature/patterns, making them difficult to analyze (North et al., 1982; Mon-ahan et al., 2009) and the order of these EOFs may also be arbitrary (Cohnand Dennis, 1994; Hannachi et al., 2007).This question of “ozone feature separability” is important for feature-based CMAQ evaluation. Suppose Ec2 from CMAQ is clearly distinguishablefrom the rest, but the observed Eo2 and Eo3 form a degeneracy set, then itis not clear whether Ec2 should be compared to Eo2 or Eo3. This matter ismade worse by the orthogonality constraint of EOFs, because any mismatchbetween Ecj and Eoj may be carried-over onto higher-order features.The eigenspectrum (North et al., 1982) is often used to assess the level ofEOF degeneracy of a given dataset. The idea and methodology of eigenspec-trum are discussed in Section 3.2 and implemented in Section 3.3 to assess1625.1. Evaluation Methods and Strategythe orders of feature degeneracy of LFV ozone. For CMAQ evaluation, Ialso produced eigenspectra for the n = nobs interpolated CMAQ output andobserved ozone data. The results are summarized in Table 5.3. As shown,ozone feature separability can be categorized into two groups: one that de-fines the 1985, 1995 and 1998 episodes, one that defines the 2001 and 2006episodes.Episode (wind regime) Orders of ozone feature separabilityAll episodes E1 is separable from higher-order features.1985 (I-IV-IV) For both CMAQ and observations: features oforders j = 2, 3 form a couplet, and the feature oforder j = 4 is separable from the rest.1995 (III-III-III) The same as 1985.1998 (II-III-II) The same as 1985.2001 (II-II-II) j = 3, 4 features form a couplet for CMAQ,feature inseparability starts at j = 2for observations.2006 (I-I-III) The same as 2001.Table 5.3: The types of ozone features separability of both CMAQ andobservations of all episodes, the parentheses shows the wind regime type(s)of each episode. The conclusions are drawn based on eigenspectra obtainedfrom the PCA of individual episodes.5.1.3 Evaluation StrategyIt was concluded in Section 3.3 that ozone features of order j ≤ 4 should bethe focus of analysis. In Chapter 4, I further built ozone feature models forEj and Pj of orders j = 1, 2, 3. There, the j = 4 feature is not modelled dueto degeneracy with j > 4 features. The modelling and forecasting exercisesin Sections 4.6 and 4.7 showed that by modelling only the first 3 spatial andtemporal ozone features, one can closely model a complex space-time ozoneprocess.Considering the aforementioned results, the CMAQ evaluation will be fo-cused on Ej and Pj at j = 1, 2, 3. Based on Table 5.3, the proposed feature-to-feature evaluation between CMAQ and observation is implemented in the1635.1. Evaluation Methods and Strategyfollowing individual analyses:• Both original and√λ1-scaled E1 are compared directly between CMAQand observations for all episodes. The 1st-order spatial ozone featuredifference Ed1 = Ec1 −Eo1 and E˜d1 = Ec1√λc1 −Eo1√λo1 will be modelledas Gaussian Processes driven by CMAQ input conditions.• The same evaluation will be done for 1st-order temporal feature P1.• For 1985, 1995 and 1998 episodes, the 2nd and 3rd-order featureswill be compared jointly. The methods in Krzanowski (1979) andCohn and Dennis (1994) will be used to calculate the “joint distancemeasures” between CMAQ modelled and observed features. I will alsocompare the joint spatial-temporal features P2ET2 +P3ET3 to examineany difference in their ozone advection patterns and other underlyingdynamic processes.• For the two most recent episodes during 2001 and 2006, I will refrainfrom feature-to-feature comparison at orders j ≥ Discussion of Evaluation MethodsThe combined results in this section have revealed that regional ozone fieldof LFV is simple enough to be dominated by one leading ozone feature, andhave shown high levels of ozone feature inseparability that makes higher-order feature-based CMAQ evaluation difficult. However, this should notdetract from the main purpose of this chapter as well as Chapter 6, whichis the development of AQM evaluation methods. Furthermore, if an ozoneprocess is simple enough to be dominated by one leading feature, then onemay say that the CMAQ evaluation based on this feature alone is a nearcomplete evaluation of CMAQ.In the following sections, ozone feature comparison statistics will beshown for all episodes. A more detailed ozone feature comparison and mod-elling of feature differences will be focused on either 1995 or 2001: oneepisode from each type of feature degeneracy shown in Table 5.3.1645.2. Comparison of the Mean Fields, E˜1 and E1The CMAQ evaluations in Chapters 5 and 6 are done individually foreach episode, an alternative is to use combined CMAQ outputs and obser-vations from all available episodes. Due to the highly consistent nature ofLFV’s dominant features (Section 3.4), per-episode analysis still allows fora systematic evaluation of CMAQ features against the observed features.Per-episode evaluation further has the potential to uncover episode specificdeficiencies in the CMAQ system: each CMAQ run is done under episodespecific atmospheric conditions, emission levels and spatial patterns (Steynet al., 2013).5.2 Comparison of the Mean Fields, E˜1 and E1Figures 5.2 and 5.3 show the the spatial plots of temporal ozone means(the mean fields) from CMAQ output and observation data. The spatiallycontinuous plots are created by applying a cubic-spline smoothing that in-terpolates the irregular spatial data into a smooth spatial field within thelongitude-latitude bound of the nobs locations. The between-episode differ-ence in spatial domain is due to the different locations of available observa-tions. The colour scale is set to be the same for each episode, because thefocus of comparison here is between CMAQ output and observations fromthe same episode, not across episodes. As Figures 5.2 and 5.3 shown, withthe exception of 1985, the episode means (ozone averaged across time) ofCMAQ are near uniformly higher than observed throughout LFV.5.2.1 General Features of E˜d1 and Ed1Figure 5.4 compares Ec1 and Eo1 for the 2001 episode, where the bottom plotshows the spatial feature difference Ed1 = Eo1−Ec1. Figure 5.5 shows Ed1 fromthe other 4 episodes. As shown for all episodes, the observed E1 varies over awider range of values than the CMAQ feature: higher maximum in the eastand lower minimum in the west. However, both features exhibited similarpatterns of east-west variation. These results imply that CMAQ modellingis able to capture similar spatial variation of ozone means as the physical1655.2. Comparison of the Mean Fields, E˜1 and E1Figure 5.2: Spatial plots of temporal ozone means (the mean fields) ofCMAQ outputs and observation data of the 1985, 1995 and 1998 episodes.For the same episode, the colour scale is the same in order to aid comparison.1665.2. Comparison of the Mean Fields, E˜1 and E1Figure 5.3: Spatial plots of temporal ozone means (the mean fields) ofCMAQ outputs and observation data of the 2001 and 2006 episodes. Forthe same episode, the colour scale is the same in order to aid comparison.observation. However, compared to CMAQ modelled ozone, the observedozone process is governed by a more pronounced space-time variation: Eo1has a larger range and variance than Ec1.The scaled-EOF E˜ = E1√λ1 captures the spatial variation of the meanfield while taking the magnitude of Ot×n into account. Figure 5.6 comparesE˜c1 and E˜o1 from 2001, and Figure 5.7 shows the E˜d1 of all other episodes. Asshown, the comparison of E˜1 is analogous to the comparison of mean fieldscalculated from the data (Figures 5.2 and 5.3). Hence, the comparisonbetween E˜c1 and E˜o1 is a means to evaluate CMAQ’s capability in capturingnot only the spatial variations of temporal ozone means in LFV, but alsotheir magnitudes. This is a spatial ozone feature evaluation using datainformation summarized across hours of the episode.Figures 5.6 and 5.7 show that the E˜d1 values are all negative for the 1995,1998 and 2001 episodes, nearly all-negative on 2006, and mostly positive for1985. Hence, when evaluated against observations, CMAQ almost system-atically over-estimated the temporal ozone means throughout the triangularLFV region. In other words, the mean fields produced by CMAQ modelling1675.2. Comparison of the Mean Fields, E˜1 and E1Figure 5.4: For the 2001 episode: plots of Ec1 (top), Eo1 (middle) and Ed1(bottom).1685.2. Comparison of the Mean Fields, E˜1 and E1Figure 5.5: Plots of Ed1 of from the 1985, 1995, 1998 and 2006 episodes.tend to have uniformly higher values. Moreover, the magnitude of CMAQover-estimate is more pronounced in the west around the city of Vancouverand its suburbs than the eastern LFV.Table 5.4 shows the angles between E˜c1 and E˜o1 for all 5 episodes. Theangle between two Ej vectors is a distance measure between the CMAQmodelled feature and the corresponding observed feature. Cohn and Dennis(1994) used vector angle to quantify the closeness between the EOFs of aciddeposition model outputs and observations. The authors regarded angles≤ 15◦ as reasonably low, which indicates good observation-model agree-ment. The values in table 5.4 show that the E1 angles are 8 − 11◦ for allepisodes. Hence, despite just discussed CMAQ over-estimates, there is con-sistent close correspondence between the E˜1’s (the mean fields) of CMAQand observations.Episode1985 1995 1998 2001 2006Angle between E˜1’s: 10.73◦ 10.21◦ 9.01◦ 7.98◦ 10.29◦Table 5.4: Angles between E˜c1 and E˜o1.In summary, the CMAQ tends to over-estimate the observed episode1695.2. Comparison of the Mean Fields, E˜1 and E1Figure 5.6: For the 2001 episode: plots of E˜c1 (top), E˜o1 (middle) and E˜d1(bottom).1705.2. Comparison of the Mean Fields, E˜1 and E1Figure 5.7: Plots of E˜d1 of from the 1985, 1995, 1998 and 2006 episodes.means across LFV, and this difference is captured by E˜1 for each episode. Inaddition, E˜1 has well defined spatial structures that are consistent betweenmost of the episodes evaluated.5.2.2 Covariate Selection for E˜d1: the Difference in theMean FieldsA follow-up to the preceding ozone feature comparison is to analyze thebackground factors that influence the feature differences between CMAQmodelled ozone and physical observation. These “background factors” arethe model covariates of E˜d1 selected from a set of candidate model covari-ates using iterative improvement algorithm (Section 5.1). The covariateset XE j = (longitude, latitude) is used to start the algorithm. Given therandom observation error in (5.1), the optimization of GP functions will ex-plicitly include a nugget term that captures the stochastic variation at fixedpoint location or time.Table 5.5 shows the selected model covariates in addition to longitudeand latitude. The results reveal that for 1985, 2001 and 2006 CMAQ mod-elling runs, the 1st-order observation-CMAQ feature difference is influencedsolely by the mean VOC emission rates of the episodes, i.e., the spatial fields1715.2. Comparison of the Mean Fields, E˜1 and E1of VOC emission rates averaged across time. The 1995 and 1998 feature dif-ferences are statistically associated with either the episode mean of NOxemission or antecedent NOx concentrations.Episode year1985 1995 1998 2001 2006XE˜d 1 VOC NOx-lag NOx VOC VOCTable 5.5: Result of covariate selection for E˜d1. The listed covariates are thosein addition to longitude and latitude. The notation “-lag” represent theatmospheric (antecedent/lagged) concentration of the precursor (notationdefined in Section 4.5.)These model selection results indicate that the deviations in E˜1, or themean fields, are heavily influenced by the spatial distributions of mean pre-cursor emission rates or antecedent concentration. On the other hand, nometeorological variable was determined to be statistically significant throughiterative likelihood testing.SMOKE/CMAQ modelling deficiencies associated with emission inputsand chemical reaction modelling is identified by my proposed method ofCMAQ evaluation for all episodes. As mentioned in Chapter 2, Steyn et al.(2011) and Steyn et al. (2013) described the efforts that went into producingthe CMAQ-WRF-SMOKE data used in this thesis. The papers outlined themethods of estimating the space-time distributions of NOx and VOC emis-sion across LFV, especially the way of estimating the year-specific spatialshift in emission sources. The task of estimating localized emission patternsis a difficult one, this is noticed from the descriptions in aforementionedpapers as well as the the complexities of SMOKE operations in general(overview in Section 1.1). Moreover, detailed space-time emission is unob-servable (unlike the weather), thus CMAQ users are unable to tune SMOKEoutputs against observations.In addition to CMAQ input uncertainties, there are further uncertaintieswhen modelling the atmospheric chemical precursor concentrations. Uncer-tainties in the chemical model within CMAQ come from the fact that one1725.2. Comparison of the Mean Fields, E˜1 and E1is typically unable to know and model every chemical reaction occurring.Rather, reactions involving one molecule serve as a proxy model for reac-tions of similar molecules (Finlayson-Pitts and Pitts Jr, 1999), thus mod-elling deficiencies unavoidably follow.5.2.3 Detailed Analyses of E˜d1 vs. VOC Emission for the2001 EpisodeIn this section, I will evaluate the 2001 ozone episode by analyzing the sen-sitivity of E˜d1 to the spatial variation of episode-mean VOC emission rate.This analysis will show over the course of an episode, how the mean VOCemission influences the difference between the mean fields of CMAQ andobservations. The 2001 data are used to demonstrate that, by modelling E˜d1one can extract insightful information regarding CMAQ’s modelling defi-ciency. Similar analysis can be done on any other ozone features from otherepisodes.An estimated sensitivity or univariate effect plot of E˜d1 against VOCemission rate is produced using the method described in Schonlau and Welch(2006). I first fitted the GP model of E˜d1 using CMAQ-SMOKE outputs of2001, where the covariates are longitude, latitude and the temporal meanVOC emission rates. I then produced E˜d1 outputs at a range of VOC emissionrates while integrating out the two location variables from the GP model.Thus, the univariate effect of mean VOC emission on E˜d1 can be analyzed.Figure 5.8 shows the estimated univariate effect of E˜d1 = E˜o− E˜c againsta range of mean VOC emission rate. The dots are E˜d1 outputs over VOCemission rates of 0.2− 1.9 moles·sec−1. The dashed-lines are the 95% confi-dence interval calculated using the analytical expression derived in Schonlauand Welch (2006). The training data (2001 SMOKE output) have temporalmean VOC emission varying between 0.5 − 1.0 moles·sec−1 and one datapoint at 1.8 moles·sec−1. This distribution of values partially explains thelarge standard error (wide confidence interval) associated with E˜d1 outputsbetween 1.3− 1.8 moles·sec−1.The confidence interval indicates the statistical significance of mean fea-1735.2. Comparison of the Mean Fields, E˜1 and E1ture difference at each VOC emission rate. Given a VOC emission rate, ifthe confidence interval for the univariate effect on E˜d1 contains 0, then we failto reject at significance level of 5% the hypothesis that the mean feature dif-ference is 0. In other words, if the confidence interval is above or below theline E˜d1 = 0 at a given VOC emission, then we conclude that the estimatedunivariate effect of VOC emission on E˜d1 is statistically significant.We see from the sensitivity plot that:• There is a negative trough at VOC = 0.82 moles·sec−1, i.e., CMAQover-estimate of temporal ozone mean in space. The confidence in-terval is below 0, indicating the statistical significance of this ozonefeature difference.• There is a positive peak at VOC = 1.18 moles·sec−1. However, thispeak value is predicted by the E˜d1 model with lower confidence intervalat near 0. Thus, the statistical significance of ozone feature differencemaybe in question. As mentioned, a positive E˜d1 indicates a CMAQunder-estimate of observed feature.Figure 5.8: Sensitivity or univariate effect plot of E˜d1 against episode meanVOC emission rate. The blue dotted line is the estimate of E˜d1 averagedover locations for VOC emission rates of 0.2− 1.9 moles·sec−1, and the redtriangle lines are point-wise 95% confidence intervals. The GP model of E˜d1is fitted using 2001 CMAQ-SMOKE data.1745.2. Comparison of the Mean Fields, E˜1 and E1Figure 5.9 shows the spatial field of mean VOC emission (averaged acrosstime) from the 2001 episode, where the dataset is SMOKE output. In thesame figure, the map of the LFV observation network from 2001 is alsoprovided. Table 5.6 shows the station name associated with each locationnumber in the network map. From Figure 5.8, the largest feature differenceoccurs for VOC ≈ {0.7, 1.2} moles·sec−1. These two values can be identifiedto define three areas of LFV: (1) the area north of Abbotsford and west ofChilliwack has temporal (or episode) mean VOC emission at ≈ 0.7 to 0.8moles·sec−1, (2) the suburbs of eastern Metro Vancouver (Burnaby, etc,)have temporal mean VOC emission at ≈ 0.8 moles·sec−1, and (3) the smallarea surrounding Vancouver’s city core has temporal mean VOC emissionat ≈ 1.2 moles·sec−1.Number Longitude Latitude Name1 -123.16 49.26 Kitsilano2 -123.15 49.19 YVR3 -123.12 49.28 Robson square4 -123.11 49.14 Richmond south5 -123.08 49.32 Mahon park6 -123.02 49.30 North Vancouver7 -122.99 49.22 Burnaby south8 -122.97 49.28 Kenshington park9 -122.90 49.16 North Delta10 -122.85 49.28 Rocky Point Park11 -122.79 49.29 Coquitlam12 -122.71 49.25 Pitt Meadows13 -122.69 49.13 Surrey east14 -122.58 49.22 Maple Ridge15 -122.57 49.10 Langley central16 -122.31 49.04 Central Abbotsford17 -121.94 49.16 ChilliwackTable 5.6: Station names and coordinates of numbers 1 to 17 in Figure 5.9:the map of the 2001 LFV monitoring network.A high observation-CMAQ feature difference is mostly associated withVOC emissions at the aforementioned three areas of LFV. Two areas of1755.2. Comparison of the Mean Fields, E˜1 and E1Figure 5.9: For the 2001 episode: the spatial plot of episode mean VOCemission rate within the LFV region defined by nobs = 17 monitoring sites(top), and the map of the LFV monitoring network (bottom). The stationnames and coordinates associated with the numbers 1 to 17 are in Table 5.6.1765.2. Comparison of the Mean Fields, E˜1 and E1interest, the suburbs of Metro Vancouver and locations around Vancouver’scity core, are also areas where daily ozone plume forms (Section 3.1). Hence,evidence suggests that the CMAQ over-estimation of the observed ozonefield is attributable to the production of a higher-than-observed initial ozoneplume by CMAQ.Furthermore, Steyn et al. (2011) and Steyn et al. (2013) mentioned thatthe city of Vancouver is a “VOC sensitive” region: NOx is the dominatingozone precursor and its concentration is near saturation, hence any variationin VOC causes a noticeable change in O3 concentrations. On the other hand,the eastern LFV (Abbotsford and Chilliwack) are “NOx sensitive” area, i.e.,high concentration of VOC, making O3 pollution sensitive to variation inNOx. Combined results from preceding analyses indicate that the easternMetro Vancouver, where the ozone process begin to transition from VOCto NOx sensitive, is the area of interest: the temporal mean VOC emissionproduced by SMOKE for this region showed strong statistical associationwith the observation-CMAQ difference in their mean ozone fields.The univariate plot in Figure 5.8 is obtained by averaging out the effectof location. To uncover any bivariate or interaction effect of location andVOC emission on E˜d1, one can produce “E˜d1 versus VOC emission” plots atmultiple locations across LFV. In each plot, the location covariates in theE˜d1 model are fixed at a longitude-latitude setting, and E˜d1 is estimated overa range of VOC emission rates appropriate for this location.Figure 5.10 shows the E˜d1 versus VOC emission plot at 5 locations acrossLFV. As shown, for locations across LFV, a negative trough at VOC ≈ 0.82moles·sec−1 is a common feature. This result is representative of other LFVlocations not shown. Figure 5.10 shows that there is little, if any interactioneffect of VOC and location on CMAQ over-estimate (negative E˜d1) of theepisode mean. The over-estimate is most noticeable when the mean VOCemission is around 0.82 moles·sec−1, and this is a feature of CMAQ modellingdeficiency that is common across LFV locations.1775.3. Comparison of P1: Hourly LFV Mean OzoneFigure 5.10: Sensitivity or univariate effect plot of E˜d1 against VOC emissionrate at 5 LFV locations. The GP model of E˜d1 is fitted using 2001 CMAQ-SMOKE data.5.3 Comparison of P1: Hourly LFV Mean OzoneFigure 5.11 compares Pc1 and Po1 for ozone episodes 1985, 1995 and 1998,while Figure 5.12 does the same for 2001 and 2006. Figures 5.13 and 5.14show the time series of LFV mean ozone of CMAQ output and observationsdata. Comparison with Pc1 and Po1 time series reveals that the 1st-ordertemporal features of both ozone data captured to near exact detail, thetemporal patterns of their respective ozone means, and the scale of differencebetween two ozone means. Therefore, the comparison of P1 is equivalent tothe comparison of hourly LFV mean ozone. It is worth repeating that thePCs are weighted row sums of Ot×n, hence the high values (in units ppb).As shown in Figure 5.11, for the 1985 episode, the CMAQ modelledhourly LFV mean ozone corresponded closely with the observations. Forall other 4 episodes, the pattern of observation-CMAQ differences can bedefined as CMAQ over-estimate of observed hourly LFV ozone during boththe early morning and afternoon peaks hours. The relatively close corre-spondence of the 1985 temporal features is noticeable from Table 5.7, whichshows the angles between component vectors Pc1 and Po1. The 1985 episodehas slightly smaller angle than other episodes, while the 2001 episode has1785.3. Comparison of P1: Hourly LFV Mean Ozonethe largest angle. However, the angles are low enough (Cohn and Den-nis, 1994) that there is generally good agreement between the 1st-ordertemporal-features of CMAQ and observations.Episode1985 1995 1998 2001 2006Angle between P1’s: 9.77◦ 10.30◦ 12.09◦ 13.02◦ 10.52◦Table 5.7: Angles between Pc1 and Po1.5.3.1 Modelling Pd1Covariate selection for Pdj is initiated by a GP model with “hour of the day”as starting covariate, and the iterative improvement procedure is applied asbefore. The GP model optimizations are done by including the stochasticerror term εPj in (5.2).Table 5.8 shows the statistically significant covariates associated with Pd1,Episode year1985 1995 1998XP d 1 Temp, Wind BL, NOx-lag Temp, WindBL, VOC NOx-lag2001 2006XP d 1 Temp, Wind Temp, WindBL, NOx-lag NOx-lagTable 5.8: Result of covariate selection for Pd1. The listed covariates arethose in addition to “hour of the day”.i.e, observation-CMAQ difference in hourly LFV mean. The descriptions“NOx-lag” and “VOC-lag” represent the hourly LFV means of NOx andVOC antecedent concentrations, and “BL” represent hourly LFV meansof boundary layer height. The iterative improvement algorithm delivereda mixture of meteorological and chemical precursor variables. Unlike themodelling of E˜d1 (Table 5.5), there are no clearly definable CMAQ inputsresponsible for the difference in temporal ozone features between CMAQ1795.3. Comparison of P1: Hourly LFV Mean OzoneFigure 5.11: Time-series of Pc1 (blue) and Po1 (red) for the 1985, 1995 and1998 episodes.1805.3. Comparison of P1: Hourly LFV Mean OzoneFigure 5.12: Time-series of Pc1 (blue) and Po1 (red) for the 2001 and 2006episodes.1815.3. Comparison of P1: Hourly LFV Mean OzoneFigure 5.13: Time-series of hourly LFV mean ozone (averaged across space)of CMAQ output (blue) and observations (red) from the 1985, 1995 and1998 episodes.1825.3. Comparison of P1: Hourly LFV Mean OzoneFigure 5.14: Time-series of hourly LFV mean ozone (averaged across space)of CMAQ output (blue) and observations (red) from the 2001 and 2006episodes.1835.4. Comparison of Higher-order Featuresand observations. The forward selection method indeed found statisticallysignificant covariates based on the Gaussian log-likelihood test (the selectioncriteria of iterative improvement, Section 4.3), but the selected covariatesdo not allow a clear explanation.As the analysis in the next chapter will show, there is reason to believethat the difference in temporal ozone features is to a large degree, causedby CMAQ not modelling certain real-world ozone processes. In Section 2.2,I mentioned the phenomenon of nocturnal ozone down-mixing: during thenocturnal hours, vertical atmospheric mixing draws the upper-layer pollu-tion downward, causing a short-term spike in the ground level ozone atsome locations in the LFV. The follow-up ozone process is that ozone isconsumed by NOx at the ground level, making surface-level ozone concen-trations approximately 0 ppb. However, as Figures 5.11 and 5.12 showed,CMAQ does not seem to capture the reality of NOx-initiated ozone reduc-tion the way observations do, and it tends to over-estimate the ozone levelsbetween 0000PST to 0400PST. Furthermore, CMAQ and WRF do not ac-count for the process of nocturnal ozone down-mixing.The analysis of temporal ozone feature difference will arrive at some formof conclusion in the next chapter, where the statistical properties of ozonefeatures are compared.5.4 Comparison of Higher-order FeaturesAs described in Section 5.1 and Table 5.3, for the 1985, 1995 and 1998episodes, E2 and E3 have close enough eigenvalues that make the same-order feature comparison between CMAQ and observations questionable.Whereas for the 2001 and 2006 episodes, the degeneracy of the observedfeatures starts from j = 2, which makes the feature-by-feature comparisoneven harder.Krzanowski (1979) proposed a method of jointly comparing PCA com-ponents that was later applied by Cohn and Dennis (1994) to evaluate aciddeposition models. Let Ecn×p and Eon×p be a matrix with p leading EOFs,where p = 3 and n = nobs in this evaluation. Also consider a form of1845.4. Comparison of Higher-order Featuresjoint-covariance matrix M = (Ec)T Eo(Eo)T Ec whose eigenvectors are ejand eigenvalues are λej , j = 1, 2, 3. It was shown in Krzanowski (1979) thatecj = Ecjej form an orthogonal basis for Ec and eoj = Eoj(Eoj)T ej is an or-thogonal subspace for Eo. Furthermore, ec1 and eo1 are the closest vectorsbetween the subspaces defined by ozone features Ec and Eo, and their an-gle is calculated as cos−1(√λe1). The subsequent higher-order vectors arefurther apart with angles cos−1(√λej).Table 5.9 shows for the episodes 1985, 1995 and 1998, the angles be-tween the vectors in Ecn×3 and Eon×3 calculated using the joint-comparisonmethod just described. These angles indicate the observation-CMAQ differ-ence when the leading p = 3 ozone features are compared jointly. As shown,the angles between ec1 and eo1 are smaller than 10◦ for all three episodes,indicating close agreement. While the angles between the 2nd vector setsare reasonably low, the angles between ec3 and eo3 increased significantly tonear 45◦ for the 1998 episode. These results reveals that, when comparedalone, the 1st-order ozone features have good agreement between CMAQand observations, but when the leading 3 features are compared jointly, thegood agreement quickly disappears. This implies a noticeable discordanceof higher-order features.Angles between ecj and eojj=1 j=2 j=3Episode 1985 6.76◦ 10.2◦ 37.34◦Episode 1995 9.84◦ 15.0◦ 28.43◦Episode 1998 4.43◦ 15.83◦ 44.19◦Table 5.9: Angles from the joint comparison of the leading 3 ozone featuresfrom CMAQ output and physical measurements.In Chapter 3, I have shown that some ozone features of orders j ≥2 individually or jointly capture the dynamic patterns of ozone advectionacross LFV. However, difference statistics such as ones in Table 5.9 onlygives one value summarizing the observation-CMAQ difference. A moreinformative approach is needed to compare dynamic ozone features betweenCMAQ and observations.1855.4. Comparison of Higher-order FeaturesAs discussed in Section 5.1, the results from eigenspectra pointed out thecloseness between the 2nd and 3rd eigenvalues of both CMAQ and observa-tions. This implies the possibility of feature degeneracy, which requires thatP2ET2 and P3ET3 be analyzed jointly. Therefore, one way of performingthe observation-CMAQ comparison of advection patterns is to compare thesum of ozone features Pc2Ec2T + Pc3Ec3T between CMAQ and observation.Figures 5.15a and 5.15b compare the Pc2Ec2T +Pc3Ec3T and Po2Eo2T +Po3Eo3Tat selected hours on the 3rd day of 1995. For ease of comparison all con-tour plots have the same range of values. In the morning hours between0600PST-0900PST, both CMAQ and observation captured ozone contraststhat are similar both in pattern and magnitude: the contrast is slightly > 0ppb in the eastern and western edge of LFV and slightly < 0 ppb in themiddle.The observation-CMAQ difference emerges during the afternoon ozonepeaks, where the east-west ozone contrast is more pronounced for CMAQ.At 1300PST the CMAQ ozone feature has contrast values ranging from −12ppb to 6 ppb whereas the observation has contrast ranging from −5 ppb to3 ppb. At 1400PST the CMAQ feature still has noticeable ozone contrastwhile nearly no spatial ozone contrast is noticeable in observed feature.Moreover, this contrast pattern of “positive in the west and negative in theeast” lasted from 1100PST to 1500PST for CMAQ and 1100PST to 1300PSTfor observations. As discussed in Chapter 3, such dynamic east-west ozonecontrast captures the formation of daytime ozone plume in LFV. Given theabove results, one may conclude that CMAQ generates higher-than-observedlevel of ozone plume at western LFV between mid-day to early afternoon.The night time dynamic ozone contrasts of CMAQ are also more pro-nounced than observations. At 2100PST, the positive contrast in the easternLFV is up to +25 ppb for CMAQ and +8 ppb for observations, the negativecontrast in the western LFV is down to −10 ppb for CMAQ and −4 ppb forobservations. These results imply that aforementioned CMAQ’s “overpro-duction” of daytime ozone caused higher-than-observed level of night timeozone in the east.Earlier analysis in Section 5.2 showed that compared to physical observa-1865.4. Comparison of Higher-order Features(a) Hours 0900PST and 1300PST.(b) Hours 1500PST and 2100PST.Figure 5.15: From PCA of the 1995 CMAQ output (top) and ozone observa-tions (bottom): dynamic spatial plots of joint ozone feature P2ET2 + P3ET3at hours 0900PST, 1300PST, 1500PST and 2100PST on the 3rd day of 1995.1875.5. Chapter Conclusiontion, CMAQ persistently overestimated the temporal ozone means through-out LFV. The comparison of dynamic ozone contrasts in this section revealedthat the problem lies primarily on the fact that CMAQ modelled processgenerated (thus transported) higher-than-observed level of ozone plume inthe western LFV. This CMAQ over-production of ozone may further explainthe daytime pattern of feature difference Pd1 we saw in last section (Figures5.11 and 5.12). Since CMAQ produces more ozone than the physical processduring the daytime, the spatial means of CMAQ would be higher than thatof the observations, i.e., Pc1 > Po1 during some daytime hours.Furthermore, the joint-comparison of the 2nd and 3rd-order ozone fea-tures also reveals that the computer model is able to capture the overallpattern of east-to-west ozone advection that is observed physically. Thisis an important result that highlights WRF’s capability of accurately mod-elling the wind patterns across LFV.5.5 Chapter ConclusionIn this chapter, I developed and implemented means of CMAQ evaluation bycombining methods of ozone PCA (Chapter 3) and ozone feature modelling(Chapter 4). Although the statistical analyses are done to evaluate CMAQ’scapability to model space-time ozone, the overall methodology should applyto the evaluations of other AQMs. The central idea behind the proposedAQM evaluation is based on the observation-model comparison of data fea-tures (space-time structures of an air pollution field), and statistical mod-elling of the feature differences. The specific purpose of this chapter is to(1) develop the exact methods of feature-based AQM evaluation, and (2)implement these methods using CMAQ-WRF-SMOKE outputs and obser-vation data to show the usefulness and advantages of feature-based modelevaluation.Implementation of my proposed evaluation methods revealed a few “bigpicture” similarities and differences between CMAQ ozone and observations.Compared to physical measurements, CMAQ tends to over-estimate episodemeans (average across hours of the episode) throughout LFV. This is ob-1885.5. Chapter Conclusionserved for 4 out of 5 episodes analyzed. However, the pattern of ozonevariation across LFV is similar between CMAQ output and observations:the mean ozone levels are highest in the east and gradually decrease to-wards the west. Comparison of ozone features and GP modelling of featuredifferences identified two “sources” of this feature discrepancy:• For all episodes, the difference in temporal means in LFV are statis-tically associated with the episode means of either the emission ratesor antecedent concentrations of one ozone precursor (NOx or VOC).A detailed evaluation of the 2001 episode showed that the main sourceof discrepancy lies in the area of LFV where the episode averaged VOCemission rates are between 0.7 to 1.2 moles·sec−1. This corresponds tothe middle LFV, especially the eastern Metro Vancouver. This regionis where much of daily ozone plume forms, and also an area where thelocal ozone process transitions from VOC-sensitive to NOx-sensitive.Furthermore, CMAQ over-estimation of observed episode mean is ex-pected to be the most pronounced when VOC ≈ 0.82 moles·sec−1.This is a feature of CMAQ deficiency expected from all LFV loca-tions.• Certain ozone features are what I refer to as “dynamic ozone con-trasts” (Section 3.4), they capture the most dominant patterns ofozone plume advection across LFV and the magnitude (in ppb) ofozone formation/destruction. Comparison of these features has shownthat CMAQ tends to produce higher-than-observed level of ozone pol-lution around Metro Vancouver during the ozone formation stage of adiurnal cycle. Thus transports a “bigger” ozone plume eastward acrossLFV.However, my analyses have also shown that WRF (the weather com-ponent of CMAQ) is able to simulate close-to-observed patterns ofdiurnal ozone transport across LFV.In the end, the available evidence suggest that the source of observation-CMAQ difference lies primarily in the computer models’ deficiencies in sim-1895.5. Chapter Conclusionulating processes of ozone precursor emission and photochemical reactions.Furthermore, the ozone feature analyses and CMAQ evaluations are donefor five LFV ozone episodes spanning two decades. I found that LFV ozoneprocess is dominated by a few recurring spatial-temporal ozone features, andthe episode-by-episode CMAQ evaluation resulted in similar, i.e., system-atic, sets of conclusions.These model evaluation results are made possible by the statistical com-parison and analysis of ozone features. This highlights the important pointthat CMAQ (or any AQM) evaluation based on ozone features is more in-formative than direct observation-model comparison of data values. Bydeconstructing CMAQ output and observation data into informative ozonefeatures, I was able to (1) evaluate how closely CMAQ can emulate the ob-served structure of space-time ozone means, and (2) how close-to-reality theCMAQ-WRF-SMOKE system can model the defining patterns of ozone ad-vection, as well as the magnitude of ozone creation and destruction acrossLFV. The combination of ozone PCA and ozone feature comparison is ameans to extract “maximum information” out of two compared ozone data.With the point-to-point comparison of data values, many important datastructures are simply “hidden” from analysis.Secondly, I proposed to model the ozone feature differences. As I havealready summarized, this analysis not only revealed a definitive statisticalassociation between CMAQ deficiency and SMOKE output, it also quan-tified the non-linear structure of this association. In turn, I was able tohighlight a few specific areas in LFV where future SMOKE modelling ef-fort should pay close attention. This description of SMOKE’s spatial defi-ciency is especially useful for AQM modellers. In practice, due to the scarceavailability of detailed emission measurements, one cannot simply analyzeSMOKE-observation comparison data (Steyn et al., 2013). Therefore, thetype of detailed CMAQ/SMOKE evaluations presented in this chapter arethe unique outcomes of my proposed AQM evaluation approaches.Lastly, in the existing literature of PCA-based AQM evaluation, informa-tive discussions of AQM capabilities are based on authors’ prior knowledgeof the AQMs. The analyses in this chapter showed that using combined1905.5. Chapter Conclusionmethods of PCA and GP modelling, one can also achieve informative andsystematic evaluation of AQM.191Chapter 6AQM Evaluation II:Comparison of AQM andObservations as StochasticOzone ProcessesIn this chapter, I will implement the second proposed AQM evaluationmethod that was briefly explained at the beginning of Chapter 5. A moredetailed description of this method, written in the context of CMAQ evalu-ation, follows:1. Using the methods from Chapter 4, fit CMAQ ozone feature modelsusing the CMAQ-WRF-SMOKE outputs, and fit separate observationozone feature models using data from physical measurements.2. Use the fitted ozone feature models to produce GP model outputs(make predictions) under common covariate settings. These modeloutputs are the estimated CMAQ and observation ozone features un-der the same covariates settings that capture the basic conditions ofbackground weather and precursor pollution.3. The estimated “common background” ozone features of CMAQ andobservations are then compared.Figure 5.1 from Chapter 5 showed in diagrammatic form the central ideabehind my CMAQ evaluation.To discuss the purpose of above evaluation approach, I will use theCMAQ evaluation of 1st-order feature E1 as an example. Let Ec1 and Eo1 be192Chapter 6. Evaluation II: Comparison of Stochastic Ozone Processesthe random processes representing the features of CMAQ ozone and physicalobservations. Further, let xc and xo denote the covariate sets that representindividual background conditions that the two ozone processes occur under.I propose to model Ec1 and Eo1 as GPs with xc and xo as model covariates:Ec1(xc) and Eo1(xo). After fitting the GP models, one can produce modeloutputs at a new common covariate setting x0: Eˆc1(x0) and Eˆo1(x0). TheseGP model outputs are statistically, the process means (or expected values)estimated at x0 given Ec1(xc) and Eo1(xo). They are the ozone features thatare statistically expected to be produced by the two GPs.By applying the same input x0, the outputs Eˆc1(x0) and Eˆo1(x0) are onlydifferent due to the parameters, i.e., the stochastic structures of Ec1(xc) andEo1(xo). Hence, comparison of Eˆc1(x0) and Eˆo1(x0) is a means of comparingthe overall statistical properties of CMAQ ozone and physical process underthe same condition. Moreover, every point of model output has an associ-ated standard error, which allows one to assess the significance of featuredifference in space and time.My proposed evaluation approach is a statistical means of addressingthe need for a “process level understanding” between AQM and observa-tions (Dennis et al., 2010; Galmarini and Steyn, 2010), and it is relatedto the “Probabilistic Evaluation” approach they mentioned. It should benoted that I am not evaluating the underlying physical or chemical pro-cesses governing an air pollution system, such as the chemical kinetics ofspecific reactions. Such detailed AQM evaluation are beyond the scope ofmy thesis.The parameters in a Gaussian Process model quantify the influence ofmodel covariate(s) on the random response variable. As described in Chap-ter 4, my GP model is the sum of a fixed regression component and astochastic Gaussian Process. The regression coefficients model the linearassociation between each spatial/temporal ozone feature and variables suchas longitude, latitude and hour of the day. The GP correlation parametersmodel the spatial or temporal behaviour of each ozone feature (pattern)as a non-linear function of model covariates. Hence by comparing the cor-relation and regression parameters between GP models Ecj and Eoj , or Pcj1936.1. Pre-analysis Commentsand Poj , one can analyze how the ozone features of CMAQ and observationbehave differently (or similarly) across a range of meteorological conditionsand precursor pollution.Such comparison can be purely numerical using statistical testing. How-ever, as I shall demonstrate in the following analyses, the comparison ofstatistical models is more informative when done as I proposed.6.1 Pre-analysis CommentsIn Chapter 4, I estimated the CMAQ ozone feature models as Gaussian Pro-cesses, performed model diagnostics and assessed the models’ performancein modelling both individual spatial/temporal ozone patterns and completespace-time ozone fields. Each ozone feature model is able to emulate thebehaviour of the corresponding spatial or temporal processes. I also con-cluded that, once combined using equation (4.13), the ozone feature modelsare well-suited for forecasting hourly ozone across the entire spatial domainof the “rectangular” LFV, especially during the important period of daily8-hour maximum. I believe the feature-based ozone model is an appropriatefoundation upon which to implement my proposed CMAQ evaluation.This chapter will use the same notations as the previous chapters. Sup-pose CMAQ or observational data of dimensions t× n are decomposed intoEn×n and Pt×n, and GP models (4.1) and (4.2) are fitted to the decom-positions. Let Ecj and Pcj denote the feature-based GP of CMAQ, and Eojand Poj denote the corresponding features of ozone observations. I furtherdenote the statistical ozone model of CMAQ and physical observation as Ocand Oo, henceOc ≈3∑j=1PcjEcjT andOo ≈3∑j=1PojEojT . (6.1)I will implement the proposed CMAQ evaluation using the 2001 and1946.1. Pre-analysis Comments2006 data. Specifically, the middle 3 days (72 hours) of the interpolatedn = 17 CMAQ-related data and observation data from 2006 will be used asthe “training dataset” to estimate the ozone feature models. The 96-hour2001 CMAQ-WRF-SMOKE outputs at the same 17 locations will be used asthe “common model covariate inputs” to produce the ozone feature modeloutputs. The 2006 data is already used as the training dataset in Chapter 4to fit ozone feature models, so it is used here again for the same purpose. The2001 CMAQ data captured a set of meteorological and precursor pollutionconditions similar to the 2006, so it is used as model inputs to minimizeuncertainties due to model extrapolation.It is important to note that in the following data analysis, the ozonefeature models are fitted using 2006 data. This means that CMAQ evalua-tion is done for the 2006 episode, because the compared stochastic modelsdescribe the 2006 ozone processes. The 2001 CMAQ data are simply usedas common model inputs, but the evaluation is not done for the year 2001.Moreover, the 2001 CMAQ data are used to provide model inputs based onrealistic combinations of weather conditions and precursor pollution. Onemay also use the observed meteorology and precursor data, but as discussedin Chapter 2, observations data may suffer from missing observations ormeasurement error.As an alternative, one may input into the CMAQ ozone feature mod-els the covariates from the corresponding observation data. The outputfrom the CMAQ statistical model can then be compared to the observedozones features. In such a test, the physical observations are regarded asthe “benchmark” against which CMAQ is evaluated. One might argue thatthis is closer to the usual concept of model evaluation. My proposed CMAQevaluation framework is designed not to have a preconceived notion that theobservation is the benchmark, or even the truth. I regard them as individualozone processes whose behaviour is being compared.Guttorp and Walden (1987) discussed using bootstrapping to analyzethe variation, thus the statistical significance of data differences. Comparedto the method proposed here, their approach is non-parametric and thusmore general. However, bootstrapping results can be difficult to interpret1956.2. Comparing the Space-time Ozone Processeswhen the sample is not independently and identically distributed. My pro-posed method, based on the comparison of GP model predictions, explicitlytakes the process correlation structures into account. It also has the advan-tage of identifying specific conditions behind statistically significant featuredifferences.Model CovariatesIn the previous chapter, I modelled the statistical associations between theCMAQ model covariates and the ozone feature differences. In this chapter,the ozone feature models of observations are fitted using the data fromphysical measurements, which as mentioned in Chapter 2, only contain dataon wind speed, temperature and ambient (antecedent) NOx concentration.This evaluation requires that both the CMAQ and observation ozone featuremodels have the same types of covariates, e.g., temperature, wind speed andambient NOx concentrations. Hence, I am constrained to use fewer modelcovariates than those listed in Table 4.1.The regression covariates of the GP models are the same as I describedin Section 4.5. The stochastic-term covariates are the ones in Table 4.1that are both available from CMAQ outputs and observation data, they aresummarized in Table 6.1 for each ozone feature model.6.2 Comparing the Space-time Ozone ProcessesOne intuitive way of CMAQ evaluation is to compare the space-time ozonefields produced by the CMAQ-based and the observation-based statisticalozone models. The detailed procedure is as follows:1. Fit statistical ozone feature models Ecj , Eoj , Pcj and Poj , j = 1, . . . , 3.That is, build GP models for the spatial and temporal ozone featuresof CMAQ and observations.The training data is the 2006 ozone episode. The statistical ozonemodels for CMAQ and observations are fitted using correspondingCMAQ and observed data, and the covariates of both models are those1966.2. Comparing the Space-time Ozone ProcessesModel CovariateEc1 and Eo1 Longitude, Latitude, Elevation, TempE,1, TempE,2,WindE,1, WindE,2, NOx-lagE,1Ec2 and Eo2 Longitude, Latitude, Elevation, TempE,1,TempE,2, WindE,1, NOx-lagE,1Ec3 and Eo3 Longitude, Latitude, Elevation, TempE,1, TempE,3,WindE,1, NOx-lagE,1, NOx-lagE,2, NOx-lagE,3Pc1 and Po1 TempP,1, TempP,3, WindP,2Pc2 and Po2 TempP,1, TempP,3, WindP,3Pc3 and Po3 TempP,1, TempP,3Table 6.1: Covariates used for CMAQ evaluation within the stochastic com-ponent of the ozone feature (GP) models. The covariate sets are shortened(as compared to Table 4.1) due to the constraint imposed by unavailable ob-servations. The acronym “NOx-lag” indicates antecedent or ambient NOxconcentrations.listed in Table 6.1. I did not perform covariate selection for the ob-servation models due to a dearth of available observation data. Here,I assume that the observed ozone process is driven by the same back-ground variables as the CMAQ model. The difference is how these tworandom processes behave under the same sets of covariates, and this“difference in behaviour” is the focus.2. Apply a common set of covariate inputs to the fitted models. Theinput covariates are decomposed from the 96-hour, 2001 CMAQ-WRF-SMOKE outputs. This model input data have dimension 17× 96.3. Combine the GP model outputs Eˆcj ’s, Pˆcj ’s, Eˆoj ’s and Pˆoj ’s into space-time ozone fields Oˆc and Oˆo via (6.1).I should reiterate that the purpose here is not to make predictions, but ratherto produce outputs from the statistical CMAQ and observation models giventhe same covariates representing background atmospheric conditions. SinceOˆc and Oˆo are ozone fields produced from exactly the same input, thecomparisons between Oˆc and Oˆo can be viewed as an analysis of the way1976.2. Comparing the Space-time Ozone Processesthe statistical CMAQ model and observation model differ as space-timeprocesses. Furthermore, the input data are associated with an actual ozoneepisode, hence we are evaluating how CMAQ and observed ozone processesbehave under conditions that are conducive to a real-world ozone episode.Figures 6.1a and 6.1b show the mean fields of Oˆc and Oˆo produced by twostatistical ozone models given the same sets of covariate inputs (outputs fromthe 2001 episode). Both ozone model outputs Oˆc and Oˆo have dimension96×17. The matrices are averaged across the 96 hours to obtain the spatialfield of ozone means at the n = 17 locations, i.e., the Oˆt×n data are averagedby the columns. Cubic-spline smoothing is then applied to interpolate then = 17 spatial data within the longitude-latitude boundary of measurementlocations. Both ozone fields are plotted over the same colour scale for easycomparison. Figure 6.2 shows the equivalent hourly time-series of meanoutputs (averaged across locations) produced by the statistical CMAQ andobservation models.Both the spatial (Figures 6.1a and 6.1b) and temporal (Figure 6.2) plotsshow that, even under the same background conditions, the CMAQ processtends to produce higher level of ozone than physical observation. Spatially,this result implies that location-by-location in LFV, the episode or temporalozone means produced by CMAQ are uniformly higher than observations.From the plots of hourly LFV mean ozone (Figure 6.2), we see that usingozone observations as reference, the CMAQ model tends to over-predictthe spatial means during the hours between 0000PST and 0800PST, in thesecond day especially, where the CMAQ spatial means can be more thantwice the observed (shaded area in Figure 6.2). CMAQ also tends to producehigher LFV means during a few afternoon peak hours, but the magnitudeof over-prediction is not as noticeable as the morning.In summary, after controlling for differences in background conditions,the CMAQ modelled ozone fields still showed the same space-time patternsof over-prediction noticed from the comparison of original data - refer backto Figures 5.2 and 5.3 for spatial differences, Figures 5.13 and 5.14 for tem-poral differences. Therefore, the preceding evaluation indicates that theCMAQ is statistically expected to produce higher temporal/episode ozone1986.2. Comparing the Space-time Ozone Processes(a) The mean field (in ppb) produced by the statistical ozone model of CMAQ.(b) The mean field (in ppb) produced by the statistical ozone model of observations.Figure 6.1: Spatial fields of ozone means produced by the statistical ozonemodels of CMAQ and observation. Statistical ozone feature models arefitted using the middle 3 full days of 2006 CMAQ and observation data(72-hours). The common model inputs are the entire 96 hours of 2001CMAQ/WRF/SMOKE output.means throughout LFV, and higher hourly LFV mean ozone during theearly morning and afternoon.1996.3. Comparison of Pˆc1 and Pˆo1Figure 6.2: Hourly time series of LFV mean ozone (averaged across spaceby the hour) produced by the CMAQ and observation ozone models. Thefeature-based statistical ozone models are fitted using the 2006 CMAQ andobservation data. The common input is the 2001 CMAQ data. The dashedlines indicate hour 0000 of each day, and the shaded region shows the hourswhen CMAQ over-prediction is the largest during the 96 hours.In the last chapter, I modelled the covariates for Pd1: feature that repre-sents the hourly differences in LFV mean ozone between CMAQ and obser-vation. That analysis did not clearly identify one specific input of CMAQrun that is driving the ozone feature difference. I further raised the issuethat CMAQ does not model certain nocturnal (or early morning) pollu-tion processes that occur around LFV. The preceding analyses point to aobservation-CMAQ difference of P1 at the process level.6.3 Comparison of Pˆc1 and Pˆo1In this section, I will perform statistical comparison of Pˆc1 and Pˆo1, featuresestimated under the same weather and pollution settings. Before furtherdiscussion, it is worth repeating that the difference between Pc1 and Po1mirrors the scale and temporal pattern of difference between the hourlyLFV means of CMAQ and observations. This fact allows one to interpretthe following comparison of P1 as a comparison of hourly LFV mean ozone.One may refer back to Section 5.3 for the discussions of this particular point.2006.3. Comparison of Pˆc1 and Pˆo1As discussed in Section 4.2, the GP model output from each set of co-variate input is the conditional mean of the process at that setting, and thisestimate of the mean has a standard error. Overall, there are 96 Pˆ1’s andassociated standard errors.Figure 6.3a shows the temporal patterns of Pˆc1 and Pˆo1 estimated underthe same covariate sets (weather and precursor pollution). I also plotted the“error bars” whose magnitudes indicate the median Pˆ1 standard errors forthe two statistical models, and the arrows indicate the hours where the 95%prediction intervals of Pc1 and Po1 do not overlap. As shown, the 1st-ordertemporal feature difference between CMAQ and observation (or differencein hourly LFV means) are statistically significant for a few early morninghours and one daytime hour on the 3rd day of the 2006 episode.Figure 6.3b further shows the scatter plot of Pˆo1 vs. Pˆc1. We see thatduring early morning hours (when Pˆo1 ≤ 50 ppb), the CMAQ ozone processtends to over-predict the observed hourly LFV means by more than 100%,whereas the daytime correspondence between CMAQ and observed ozoneprocesses is much better.Figure 6.4 shows the same Pˆc1 and Pˆo1 plotted as functions of TempP 1, atemperature covariate in both models (Table 6.1). In these plots, Pˆc1’s andPˆo1’s at the same hour are averaged across days of the episode, and the sameis done for covariate TempP 1. This averaging is done to smooth the dailyvariations of Pˆ1 at similar TempP 1 values. TempP 1 captures the temporalfeatures of the hourly mean LFV temperature (Section 4.3 and AppendixC.1). So to enhance interpretation, the x-axis in Figure 6.4 shows the meantemperature values that correspond to TempP 1.The CMAQ model produced higher-than observed Pˆ1 values across thetemperature range typical of LFV ozone episodes. The small exception isshown between 26.0◦C-27.0◦C, where it alternates between Pˆc1 < Pˆo1 andPˆc1 > Pˆo1. This pattern translates to CMAQ over-predictions of hourlyLFV means during most of the day, but especially during periods of lowtemperature such as the hours between the late-night and the morning.2016.4. Chapter Conclusion(a) Time series plots of Pˆc1 and Pˆo1.(b) Scatter plot of Pˆo1 vs. Pˆc1.Figure 6.3: Time series plots of Pˆc1 and Pˆo1 and scatter plot of Pˆo1 vs. Pˆc1.The time-series plot at the top shows the “error bars” whose magnitudeindicate the median Pˆ1 standard errors from the two ozone feature models,and the arrow indicate the hour where the difference between Pˆc1 and Pˆo1 issignificant at type-I error = 0.05. The shaded area shows the hours whendifferences between Pˆc1 and Pˆo1 are the largest. The lines in the scatter plotare y = x, y = 2x and y = 1/2x.6.4 Chapter ConclusionDuring CMAQ evaluation in Chapter 5, I found that the 1st-order temporalfeature of CMAQ Pc1 tends to have noticeably higher values than the ob-served feature Po1 during the morning hours between 0000PST to 0800PST,2026.4. Chapter ConclusionFigure 6.4: Univariate covariate-effect of Pˆc1 and Pˆo1 against temperatureinput TempP 1 - a feature that represents mean LFV temperature. Themodels for Pc1 and Po1 are fitted using the 2006 CMAQ and observationdata. The input is processed from the CMAQ data of 2001.and a few hours during afternoon peak. Compared to observations, thisresult corresponds to CMAQ’s over-prediction of hourly LFV mean ozoneduring the morning and afternoon. The feature-based evaluations in thischapter revealed certain process-level differences between CMAQ ozone pro-cess and the reality (physical observation).I applied the ozone feature models from Chapter 4 and implemented sta-tistical analyses that answer the question: “would CMAQ produce higher-than observed ozone under the same atmospheric condition?” The statisticalcomparisons are based on the assumption that the CMAQ and observationfeatures follow Gaussian Processes in the specific forms estimated in Chap-ter 4. The reasonableness of this assumption was extensively analyzed inChapter 4.The analyses have shown that, given the same background conditionsin temperature, wind and ozone precursor concentrations, CMAQ is sta-tistically expected to produce aforementioned temporal patterns of ozoneover-prediction. During some morning hours, these hourly over-predictionsare significant in the sense that the prediction intervals of Pc1 and Po1 do notoverlap. Plots of Pc1 and Po1 against the temperature covariate further re-vealed that CMAQ is expected to produce higher-than-observed hourly LFV2036.4. Chapter Conclusionmean ozone at temperatures below 25◦C, i.e., outside of afternoon peaks.Similar analyses of spatial features also showed that CMAQ will alsoproduce higher temporal/episode ozone means than the physical observa-tions throughout LFV. In other words, the spatial CMAQ over-predictionsshowed in Section 5.2 are expected to present under the same general weatherconditions and precursor pollution.204Chapter 7ConclusionTraditional method of AQM evaluation directly compares the model out-puts against observation data and summarize the deviation values into errorstatistics such as RMSE and MBE. However, as Dennis et al. (2010) pointedout (summarized in Section 1.2), AQM outputs and observation data aregenerated by discrepant physical processes, and without a deeper under-standing on the complexities of air pollution system at hand, any directobservation-model comparisons are “fortuitous”.More informative and “big picture” approaches to AQM evaluation havebeen proposed over the years. These evaluations compare the data featuresobtained from the decompositions of AQM outputs and observation data.However, the differences in the data features are still summarized into statis-tical measures like correlations, angle between vectors, etc. The systematicobservation-AQM differences are visually interpreted using authors’ priorknowledge, as atmospheric scientists, about the inner workings of AQMsand physical processes.The goal of this research is to develop novel statistical methods of AQMevaluations that (1) are more informative than the point-to-point data com-parison, and (2) further the existing methods of feature-based AQM evalu-ation.This chapter summarizes the novel contributions made in this thesis,both in the fields of statistical AQM evaluation and modelling of space-timeozone process. I will then finish the conclusion by proposing future works.Since the evaluations in this thesis are done for CMAQ modelling of LFVozone, the following discussion will be written mainly under this context.2057.1. Main Contributions7.1 Main Contributions and the Novelty of theEvaluation MethodsUsing combined methods of PCA and Gaussian Process modelling, I pro-posed and implemented means of AQM evaluations that provide a process-level understanding of the way AQM simulated ozone differ from physicalobservations. Here, the “process-level” evaluation refers to either the fea-ture comparison of AQM ozone and observations as stochastic processes,or the comparison of their dynamic features, e.g., the dominant pattern ofozone advection, and the region and magnitude of ozone formation. A moredetailed process-level evaluation, such as the chemical kinetics of certainreactions, is beyond the scope of this thesis.In addition to the structural comparison of ozone features, I proposedtwo approaches to AQM evaluation that incorporate the methods of non-linear spatial and temporal modelling. They are:1. Statistical modelling of ozone feature differences as Gaussian Processesdriven by AQM inputs representing atmospheric conditions, ozone pre-cursor emission rates and antecedent pollution. This method evaluatesthe statistical associations between the inputs and particular condi-tions of AQM run to its modelling capability.2. Comparison of the statistical properties of AQM ozone and physicalprocesses. This is done by estimating the ozone feature models forAQM and observations, then comparing their ozone features predictedunder the same sets of background weather condition and precursorpollution. This method also assesses the statistical significance of anyfeature difference in both space and time.The developed statistical methods are implemented to model and eval-uate CMAQ ozone. I will now provide a simple recap of the evaluationresults; a more detailed summaries are at the ends of Chapters 5 and 6.These results serve to demonstrate the usefulness of my proposed evalua-tion methods, thereby highlighting the contribution of this thesis.2067.1. Main Contributions7.1.1 Recap of Evaluation ResultsThe ozone feature analyses and CMAQ evaluations are done for five LFVozone episodes spanning two decades. We found that LFV ozone processis dominated by a few recurring spatial-temporal ozone features, and theepisode-by-episode CMAQ evaluation resulted in similar, i.e., systematicsets of conclusions. The following is an evaluation recap:• Comparison of the 1st-order spatial features showed that CMAQ tendsto over-estimate the observed local ozone means (averaged across time)almost uniformly across LFV. The over-estimate is observed for 4 outof 5 episodes. However, the east-west patterns of spatial ozone varia-tion are similar between CMAQ outputs and observations.• GP modelling of feature differences showed that above mentioned dif-ferences in the mean field are statistically associated with the episodemean emission rates or the antecedent concentrations of either NOxor VOC.A detailed feature difference modelling was performed for the 2001episode. I found that the main sources of feature difference lie in theareas of LFV where the VOC emission rates are around 0.82 and 1.18moles·sec−1. These correspond to the middle of LFV in general, andeastern Metro Vancouver in particular. This region is where much ofthe daily ozone plume forms, and where the ozone process transitionsfrom being VOC-sensitive to NOx-sensitive.• Certain ozone features capture the most dominant pattern of ozoneadvection across LFV, as well as the area and the magnitude (in ppb)of ozone formation/destruction. Comparison of these features haveshown that CMAQ tends to produce higher-than-observed level ofozone pollution around eastern Metro Vancouver during the ozone for-mation stage of a diurnal cycle. Subsequently, CMAQ ozone processtransports a “bigger” ozone plume eastward across LFV.However, the same feature comparison also showed that WRF (weather2077.1. Main Contributionscomponent of CMAQ) is able to simulate a close-to-observed generalpatterns of eastward ozone advection.The above results suggest that the source of observation-CMAQ devi-ations in spatial ozone lies mainly in the CMAQ/SMOKE deficiencies insimulating LFV’s spatial precursor emissions and their subsequent atmo-spheric reactions. Specifically, the suburbs of eastern Metro Vancouver areidentified as areas where SMOKE modelling showed its deficiency. Hence,any future effort in air quality modelling should pay close attention to theaccuracy of emission modelling at aforementioned locations.The evaluation results relating to SMOKE modelling are especially use-ful. Due to the scarce availability of detailed emission observations (Steynet al., 2013), it is difficult in practice to evaluate SMOKE outputs andmake meaningful associations to CMAQ deficiency. My proposed evalua-tion method provided a way of addressing this concern.The second proposed evaluation method analyzes, under the same atmo-spheric conditions, whether the space and time differences in ozone featureare significantly different. This is a method that uses the statistical ozonefeature models to compare the stochastic structures of CMAQ ozone and thephysical process. The results showed that, even under the same backgroundconditions, CMAQ is expected to produce significantly higher-than-observedhourly LFV mean ozone (averaged across space) during the morning. Anal-yses also showed that CMAQ tends to over-estimate the hourly LFV meanat temperatures below 25◦C, which are the temperatures outside of the af-ternoon peak hours. In addition, CMAQ is also statistically expected toproduce a spatial ozone field with values that are uniformly higher thanobserved.7.1.2 Conclusion on AQM EvaluationIn the end, air-quality model evaluation is not a well-defined science; there isno “one right way” of evaluation. What separates different model evaluationtechniques and approaches is the different levels of informativeness, judgedby the amount of insight and knowledge into the model behaviour that an2087.2. Additional Contributionsevaluation method can provide.My proposed AQM evaluation methods provided the types of insightinto the modelling capability of CMAQ not possible with direct observation-model data comparison. As the preceding evaluation recap and the detailedanalyses in Chapter 5 and 6 have shown, the proposed AQM evaluationsdelivered an informative and coherent set of results that highlighted boththe “big picture” and detailed input-level capability of the CMAQ-SMOKE-WRF system.My proposed AQM evaluation methods, which are based on the framework of statistical analysis and modelling of ozone features, constitute novelcontributions to the existing works on AQM evaluation (AQMEII, Section1.3). In particular, the methods proposed in this thesis add to the knowledgeof features-based AQM evaluation (existing works summarized in Section1.4).7.2 Additional ContributionsAlthough the analyses in Chapters 3 and 4 are designed to provide thenecessary tools for subsequent CMAQ evaluation, these works should also beconsidered useful contributions on their own. Combined analyses in Chapter3 and 4 drew an important conclusion that a complex space-time pollutionprocess can be conceptually understood and modelled statistically using afew leading features. The ozone feature models developed in Chapter 4 isespecially a novel and efficient means of modelling a complex space-time airpollution process.7.2.1 Understanding the Features of LFV OzoneIn Chapter 3, I formulated a detailed understanding of LFV ozone featuresduring a summer-time ozone episode. I identified spatial-temporal ozonefeatures that consistently appeared and dominated the ozone processes dur-ing the years 1985-2006. The effect of different wind flow patterns on theresultant ozone features are also studied.2097.2. Additional ContributionsDuring a LFV ozone episode dominated by wind regime types I, II andIII, the most important dynamic process is the eastward advection of ozoneplume. This transport is driven by a prevailing westerly wind flow. Under atype IV wind regime, there are two defining ozone circulation patterns: thedaytime northwest-to-southwest ozone advection and nighttime west-to-eastadvection. These dynamic processes, in addition to space-time structures ofozone mean, account for over 90% of space-time ozone variation. Furtheranalysis of eigenspectra indicates that these few features are statisticallyseparable from other features.7.2.2 Ozone Feature ModelsI further developed statistical models for individual ozone features, and aframework where a complete space-time ozone field can be modelled throughits features. Individual features are modelled as GPs driven by a set of vari-ables describing background meteorology, ozone precursor emission rates andantecedent concentrations. Each ozone feature model is estimated througha forward selection algorithm based on statistical goodness-of-fit measures.Forecasts of ozone features and resultant space-time ozone fields aremade for the 4th day of the 2006 CMAQ output across a complex rectangu-lar domain including LFV and surrounding mountains. The ozone featuremodels displayed good capability in emulating the complex non-linear struc-tures of respective features. The predicted spatial features captured both theregional-scale patterns and localized details of the true spatial features withgood numerical accuracy. By combining the predicted features, forecast wasmade for the hourly spatial ozone fields. The proposed feature-based ozonemodel is able to forecast the LFV’s ozone fields at great spatial resolution,where the hourly forecasts captured the spatial details of local ozone both inthe lower-valley region and across north shore mountains. The forecastingaccuracy was especially good during the important daily ozone peak hours,with low RMSEs and near 0 prediction bias.Given the complexities of running CMAQ (Sections 1.1 and 2.1), thedeveloped ozone feature model can be useful as a statistical CMAQ emulator.2107.3. Future WorkFurthermore, compared to the traditional statistical approach of directlymodelling the “raw” space-time data, it is more computationally efficient tomodel the data through its features.7.3 Future WorkProposed future works mainly include applying the presented statisticalmethods on other data and refining the ozone feature models.7.3.1 Application for Other Air Pollution DataFor the regional ozone fields analyzed in this thesis, one would performAQM evaluation based on very few (sometimes only one) spatial-temporalfeatures. This is because the observed ozone suffers from noticeable fea-ture degeneracy, making high order feature-by-feature comparison difficultto justify. The existing works on PCA-based AQM evaluation analyzed dataover a large spatial domain and longer time period. Fiore et al. (2003) andEder et al. (2014) performed model evaluations based on air pollution fieldover eastern United States. Orsolini and Doblas-Reyes (2003) and Campet al. (2003) performed ozone PCAs (not for AQM evaluation) over theEuro-Atlantic sector (20o to 90o latitude and 60o to -90o longitude) and theentire global tropic region. These works showed that for a large spatial airpollution field, the are at least 3 clearly structured and interpretable datafeatures in addition to the mean.The relative simplicity (compared to continental air pollution field) of theLFV ozone shows that, the space-time ozone means and dynamic contrasts(features capturing patterns of advection) are the only important features.Moreover, these features consistently dominated all episodes during the twodecades between 1985-2006. Hence, one could justifiably evaluate CMAQbased on very few (sometimes only one) leading ozone features and constructa systematic view on the capability of CMAQ-WRF-SMOKE to interactivelymodel the LFV’s air pollution.The disadvantage of the relative simplicity of LFV ozone is that I could2117.3. Future Worknot, as evident in Chapters 5 and 6, compare and analyze many ozonefeatures in order to demonstrate to fuller extent the utility of my AQMevaluation method. The aforementioned works on continental/global scalePCA did not consider the problem of feature degeneracy, so I do not knowwhether such large scale AQM evaluation allows for more orders of featurecomparisons. Therefore, my foremost future works will focus on the appli-cations of the developed evaluation methods to a much larger air pollutionfield: to study whether more non-degenerate features can be extracted andput through feature-based AQM evaluation and statistical modelling.This thesis analyzed hourly spatial ozone data during episode days. Al-ternatively, one may apply the presented statistical methods to (1) studyother space-time pollution process, such as hourly PM2.5 exposure, and (2)analyze long-term ozone data and/or daily maximum data.7.3.2 Further Works on Ozone Feature ModelsMy preliminary analysis has shown that an LFV ozone process may be sep-arated into three sub-regions, i.e., the LFV ozone field can also be modelledas spatially heterogeneous processes. This result was revealed when I built“local” ozone models through spatial sampling, where the models are basedon the “raw” ozone data; they are not feature models. Application of k-means clustering identified three LFV areas with “similar” estimated modelparameters: area across the foot of mountains, western and eastern LFVseparated by Surrey. The ozone feature models developed in Chapter 4 didnot account for possible spatial heterogeneity. Hence, another future workwould be to improve upon the existing ozone feature models where the GPparameters are allowed to vary in space to account for spatial heterogeneity.Besides AQM evaluation, the ozone feature models developed here mayserve as a computationally efficient emulator of an AQM. Hence, it may bepractical to summarize all relevant code for PCA and GP modelling into anR package for fast forecasting of AQM output.212BibliographyAinslie, B., Reuten, C., Steyn, D. G., Le, N. D., and Zidek, J. V. (2009).Application of an entropy-based bayesian optimization technique to theredesign of an existing monitoring network for single air pollutants. Jour-nal of Environmental Management, 90:2715–2729.Ainslie, B. and Steyn, D. G. (2007). Spatiotemporal trends in episodicozone pollution in the Lower Fraser Valley, British Columbia, in relationto mesoscale atmospheric circulation patterns and emissions. Journal ofApplied Meteorology and Climatology, 46:1631–1644.Ainslie, B., Steyn, D. G., Reuten, C., and Jackson, P. L. (2013). A retrospec-tive analysis of ozone formation in the Lower Fraser Valley, BC, Canada.part ii: Influence of emission reduction on ozone formation. AtmosphereOcean, 51:170–186.Allen, M. R. and Tett, S. F. B. (1999). Checking for model consistency inoptimal fingerprinting. Climate Dynamics, 15:419–434.Aslett, R., Buck, R. J., Duvall, S. G., Sacks, J., and Welch, W. J. (1998).Circuit optimization via sequential computer experiments: design of anoutput buffer. Journal of the Royal Statistical Society, Series C, 47:31–48.Bastos, L. S. and O’Hagan, A. (2009). Diagnostics for gaussian processemulators. Technometrics, 51:425–438.Beaver, S., Tanrikulu, S., Palazoglu, A., Singh, A., Soong, S. T., Jia, Y.,Tran, C., Ainslie, B., and Steyn, D. G. (2010). Pattern-based evaluationof coupled meteorological and air quality models. Journal of AppliedMeteorology and Climatology, 49:2077–2091.213BibliographyBerrocal, V. J., Craigmile, P. F., and Guttorp, P. (2012). Regional climatemodel assessment using statistical upscaling and downscaling techniques.Environmetrics, 23:482–492.Berrocal, V. J., Gelfand, A. E., and Holland, D. M. (2009). A spatio-temporal downscaler for output from numerical models. Journal of Agri-cultural, Biological, and Environmental Statistics, 15:176–197.Bjornsson, H. and Venegas, S. A. (1997). A manual for eof and svd analysesof climate data. Technical Report CCGCR No. 97-1, McGill University.Bloomfield, P., Royle, A. J., Steinberg, L. J., and Yang, Q. (1996). Account-ing for meteorological effects in measuring urban ozone levels and trends.Atmospheric Environment, 30:3067–3077.Borg, I. and Groenen, P. (2005). Modern Multidimensional Scaling: theoryand applications. Springer-Verlag.Boubel, R. W., Fox, D. L., Turner, D. B., and Stern, A. C. (1994). Funda-mentals of Air Pollution. Academic press.Byun, D. and Schere, K. L. (2006). Review of the governing equations, com-putational algorithms, and other components of the models-3 communitymultiscale air quality (cmaq) modeling system. Applied Mechanics Re-view, 59:51–77.Camp, C. D., Roulston, M. S., and Yung, Y. L. (2003). Temporal and spatialpatterns of the interannual variability of total ozone in the tropics. Journalof Geophysical Research, 108:4643–4660.CCME (2000). Canada wide standard for particulate matter (pm) andozone. Technical report, Canadian Council of Ministries of the Environ-ment. Available at http://www.ccme.ca/ourwork/air.html?category_id=99, accessed 2013-09-12.Cohn, R. D. and Dennis, R. L. (1994). Evaluation of acid-deposition modelusing principal component spaces. Atmospheric Environment, 28:2531–2543.214BibliographyConti, S. and O’Hagan, A. (2010). Bayesian emulation of complex multi-output and dynamic computer models. Journal of Statistical Planningand Inference, 140:640–651.Cooley, D., Nychka, D., and Naveau, P. (2007). Bayesian spatial modelingof extreme precipitation return levels. Journal of American StatisticalAssociation, 102:824–840.Craigmile, P. F. and Guttorp, P. (2011). Space-time modelling of trends intemperature series. Journal of Time-series Analysis, 32:378–395.Cressie, N. (1990). The origin of kriging. Mathematical Geology, 22:239–252.Currin, C., Mitchell, T., Morris, M., and Ylvisaker, D. (1991). Bayesianprediction of deterministic functions, with applications to the design andanalysis of computer experiments. Journal of American Statistical Asso-ciation, 86:953–963.Dennis, R., Fox, T., Fuentes, M., Gilliland, A., Hanna, S., Hogrefe, C., Ir-win, J., Rao, S. T., Scheffe, R., Schere, K., Steyn, D. G., and Venkatram,A. (2010). A framework for evaluation of regional-scale numerical photo-chemical modeling system. Environmental Fluid Mechanics, 10:471–489.Dou, Y., Le, N. D., and Zidek, J. M. (2010). Modelling hourly ozone con-centration fields. The Annals of Applied Statistics, 4:1183–1213.Eder, B., Bash, J., Foley, K., and Pleim, J. (2014). Incorporating princi-pal component analysis into air quality model evaluation. AtmosphericEnvironment, 82:307–315.Finlayson-Pitts, B. J. and Pitts Jr, J. N. (1999). Upper and Lower Atmo-sphere. Academic Press.Fiore, A. M., Jacob, D. J., Mathur, R., and Martin, R. V. (2003). Applica-tion of empirical orthogonal functions to evaluate ozone simulations withregional and global models. Journal of Geophysical Research, 108:4431–4445.215BibliographyFuentes, M. and Raftery, A. E. (2005). Model evaluation and spatial in-terpolation by bayesian combination of observations with outputs fromnumerical models. Biometrics, 61:36–45.Galmarini, S. and Steyn, D. G. (2010). Advancing approaches tothe evaluation of regional scale air quality modeling system. Tech-nical report, Air Quality Model Evaluation International Initiative.Available at http://publications.jrc.ec.europa.eu/repository/handle/111111111/13563, accessed 2010-06-12.Gao, F., Sacks, J., and Welch, W. J. (1996). Predicting urban ozone lev-els and trends with semiparametric modeling. Journal of Agricultural,Biological and Environmental Statistics, 1:404–425.Gotway, C. A., Ferguson, R. B., Herbert, G. W., and Peterson, T. A. (1996).Comparison of kriging and inverse-distance methods for mapping soil pa-rameters. Soil Science Society of America Journal, 60:1237–1247.Guenther, A., Karl, T., Harley, P., Wiedinmyer, C., Palmer, P. I., andGeron, C. (2006). Estimates of global terrestrial isoprene emissions usingMEGAN (Model of Emissions and Gaxes and Aaerosols from Nature).Atmospheric Chemistry and Physics, 6:3181–3210.Guttorp, P. and Walden, A. (1987). On the evaluation of geophysical models.Geophysical Journal of the Royal Astronomical Society, 91:201–210.Handcock, M. S. and Stein, M. L. (1993). A bayesian analysis of kriging.Technometrics, 35:403–410.Hannachi, A., Jolliffe, I. T., and Stephenson, D. B. (2007). Empirical orthog-onal functions and related techniques in atmospheric science: A review.International Journal of Climatology, 27:1119–1152.Hardle, W. K. and Simar, L. (2012). Applied Multivariate Statistical Anal-ysis, chapter 9. Springer.Hasselmann, K. (1993). Optimal fingerprints for the detection of time-dependent climate change. Journal of Climate, 6:1957–1971.216BibliographyHigdon, D., Gattiker, J., Williams, B., and Rightley, M. (2008). Computercalibration using high-dimensional output. Journal of American Statisti-cal Association, 103:570–583.Hobbs, W. R., Bindoff, N. L., and Raphael, M. N. (2015). New perspec-tives on the observed and simulated Antarctica sea ice extent trend usingoptimal fingerprinting techniques. Journal of Climate, 28:1543–1560.Hogrefe, C., Rao, S. T., Zurbenko, I. G., and Porter, P. S. (2000). Interpret-ing information in time series of ozone observations and model predictionsrelevant to regulatory policies in the eastern united states. Bulletin of theAmerican Meteorological Society, 81:2083–2106.JAICC (2005). A report to CCME: An update in the support of canadawide standard for particulate matter (pm) and ozone. Technical re-port, Joint Action Implementation Coordinating Committee. Availableat http://www.ccme.ca/ourwork/air.html?category_id=99, accessed2013-09-12.Jin, L., Harley, R. A., and Brown, N. J. (2011). Ozone pollution regimesmodeled for a summer season in California’s San Joaquin Valley: A clusteranalysis. Atmospheric Environment, 45:4707–4718.Jolliffe, L. (2002). Principal Component Analysis, 2nd ed. Springer.Jones, D. R., Schonlau, M., and Welch, W. J. (1998). Efficient global op-timization of expensive black-box functions. Journal of Global Optimiza-tion, 13:455–492.Jrrar, A., Braesicke, P., Hadjinicolaou, P., and Pyle, J, A. (2006). Trendanalysis of ctm-derived northern hemisphere winter total ozone using self-consistent proxies: How well can we explain dynamically induced trends?Quarterly Journal of Royal Meteorological Society, 132:1969–1983.Kalenderski, S. and Steyn, D. G. (2011). Mixed deterministic statisticalmodelling of regional ozone air pollution. Environmetrics, 22:572–586.217BibliographyKennedy, M. C. and O’Hagan, A. (2001). Bayesian calibration of computermodels (with discussion). Journal of the Royal Statistical Society, SeriesB, 63:425–464.Kleiber, W., Sain, S., Heaton, M., Wiltberger, M., Reese, C., and Bingham,D. (2014). Parameter tuning for a multi-fidelity dynamical model of themagnetosphere. Annals of Applied Statistics, 7:1286–1310.Kohonen, T. (1982). Self-organized formation of topologically correct featuremaps. Biological Cybernetics, 43:59–69.Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE,78:1464–1480.Krzanowski, W. J. (1979). Between group comparison of principal compo-nents. Journal of American Statistical Association, 74:703–707.Le, N. D. and Zidek, J. M. (2006). Statistical Analysis of EnvironmentalSpace-Time Process. Springer.Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis,88:365–411.Li, S., Anlauf, K., Weibe, H., Bottenheim, J., and Pucket, K. (1994). Evalu-ation of a comprehensive euclidean air-quality model with multiple chemi-cal species measurement using principal component analysis. AtmosphericEnvironment, 28:3449–3461.Lindstrom, J., Szpiro, A. A., Sampson, P. D., Oron, A. P., Richards, M.,Larson, T. V., and Sheppard, L. (2014). A flexible spatio-temporal modelfor air pollution with spatial and spatio-temporal covariates. Environ-mental and Ecological Statistics, 21:411–433.Lippmann, M. (1989). Health effects of ozone. A critical review. Journal ofAir Pollution Control Association, 39:672–695.218BibliographyLiu, L., Hawkins, D. M., Ghosh, S., and Young, S. S. (2003). Robust sin-gular value decomposition analysis of microarray data. Proceedings of theNational Academy of Sciences, 100:13167–13172.Liu, Z. (2007). Combining Deterministic and Statistical Methods in ModelingEnvironmental Processes. PhD thesis, UBC.Lorenz, E. D. (1956). Empirical orthogonal functions and statistical weatherprediction. Technical report, Statistical Forecast Project Report 1, Dept.of Meteorology, M.I.T.Marmur, A., Liu, W., Wang, Y. H., Russell, A., and Edgerton, E. S. (2009).Evaluation of model simulated atmospheric constituents with observationsin the factor projected space: CMAQ simulations of SEARCH measure-ments. Atmospheric Environment, 43:1839–1849.Matheron, G. (1963). Principles of geostatistics. Economic Geology,58:1246–1266.Metro Vancouver (2012). Station information: Lowed Fracer Valleyair-quality monitoring network. Technical report, metrovancouver. Avail-able at http://www.metrovancouver.org/services/air-quality/emissions-monitoring/monitoring/network/Pages/default.aspx,accessed 2013-06-23.Metro Vancouver (2013). Lowed Fracer Valley air-quality report. Technicalreport, metrovancouver. Available at http://www.metrovancouver.org/services/air-quality/emissions-monitoring/monitoring/reports/Pages/default.aspx, accessed 2013-06-23.Monahan, A. H., Fyfe, J. C., Ambaum, M. H. P., Stephenson, D. B., andNorth, G. R. (2009). Empirical orthogonal functions: the medium is themessage. Journal of Climate, 22:6501–6514.North, G. R., Bell, T. L., Cahalan, R. F., and Moeng, F. J. (1982). Sam-pling errors in the estimations of empirical orthogonal functions. MonthlyWeather Review, 110:699–706.219BibliographyNychka, D., Wikle, C., and Royle, A. J. (2002). Multiresolution modelsfor nonstationary spatial covariance functions. Statistical Modelling: anInternational Journal, 2:315–331.Orsolini, Y. J. and Doblas-Reyes, F. J. (2003). Ozone signatures of climatepatterns over the euro-atlantic sector in the spring. Quarterly Journal ofRoyal Meteorological Society, 129:3251–3263.Pearson, K. (1902). On lines and planes of closest fit to systems of pointsin space. Philosophical Magazine, 2:559–572.Porter, P. S., Hogrefe, C., Gego, E., Foley, K., Goodwitch, J. M., and Rao,S. T. (2010). Application of wavelet filters in an evaluation of photochem-ical model performance. In Steyn, D. G. and Rao, S. T., editors, AirPollution Modeling and its Application XX, chapter 4, pages 415–420.Springer.Preisendorfer, R. W. (1988). Principal Component Analysis in Meteorologyand Oceanography. Elsevier.Preisendorfer, R. W. and Barnett, T. P. (1983). Numerical model-realityinter comparison tests using small-sample statistics. Journal of the At-mospheric Science, 40:1884–1896.Reuten, C., Ainslie, B., Steyn, D. G., Jackson, P. L., and McKendry, I.(2012). The impact of climate change on ozone pollution in the LowerFraser Valley, BC. Atmosphere Ocean, 50:42–53.Richman, M. B. (1986). Review article: Rotation of principal components.Journal of Climatology, 6:293–335.Robeson, S. M. and Steyn, D. G. (1990). Evaluation and comparison ofstatistical forecast models for daily maximum ozone concentrations. At-mospheric Environment, 24B:303–312.Sacks, J., Welch, W. J., Mitchell, T. J., and Wynn, H. (1989). Design andanalysis of computer experiments (with discussion). Statistical Science,4:409–423.220BibliographySalmond, J. A. and McKendry, I. G. (2002). Secondary ozone maxima in avery stable nocturnal boundary layer: observations from the Lower FraserValley, BC. Atmospheric Environment, 36:5771–5782.Schonlau, M. and Welch, W. J. (2006). Methods for Experimentation inIndustry, Drug Discovery, and Genetics, chapter 14. Springer. Bookedited by Dean, A. and Lewis, S.Seagram, A., Steyn, D. G., and Ainslie, B. (2013). Modelled recirculationof pollutants during ozone episodes in the Lower Fracer Valley, B.C. InSteyn, D. G. and Timmermans, R., editors, Air Pollution Modeling andits Application XXII. Springer.Skamarock, W. C., Klemp, J. B., Dudhia, J., Gill, D. O., Barker, D. M.,Duda, M. G., and Powers, J. G. (2008). Description of advanced researchWRF version 3. (technical report NCAR/TN-475+STR). Technical re-port, National Centre for Atmospheric Research.Steyn, D. G., Ainslie, B., Reuten, C., and Jackson, P. L. (2011). A ret-rospective analysis of ozone formation in the Lower Fraser Valley, BC,Canada. Available at https://circle.ubc.ca/handle/2429/36587, ac-cessed 2014-10-15.Steyn, D. G., Ainslie, B., Reuten, C., and Jackson, P. L. (2013). A retrospec-tive analysis of ozone formation in the Lower Fraser Valley, BC, Canada.part i: Dynamical model evaluation. Atmosphere Ocean, 51:153–169.Steyn, D. G., Bottenheim, J. W., and Thomson, R. B. (1997). Overviewof the tropospheric ozone in the Lower Fraser Valley, and the Pacific ’93field study. Atmospheric Environment, 31:2025–2035.Storch, H. V. and Zwiers, F. W. (1999). Statistical Analysis of ClimateResearch. Cambridge University Press.Stull, R. B. (1988). An Introduction to Boundary Layer Meteorology.Springer Books.221BibliographyTaylor, B. (1992). The relationship between ground-level ozone concentra-tions, surface pressure gradients, and 850mb temperatures in the LowerFraser Valley of British Columbia. Technical Report PAES-92-3, Atmo-spheric Issues and Service Branch, Pacific Region, Environment Canada.Taylor, E. (1991). Forecasting ground-level ozone in vancouver and theLower Fraser Valley of British Columbia. Technical Report PAES-91-3,Scientific Service Division, Pacific Region, Environment Canada.Thiebaux, H. J. and Zwiers, F. W. (1984). The interpretation and estimationof effective sample size. Journal of Climate and Applied Meteorology,23:800–811.Thompson, M. L., Reynolds, J., Cox, L. H., Guttorp, P., and Sampson, P. D.(2001). A review of statistical methods for the meteorological adjustmentof tropospheric ozone. Atmospheric Environment, 35:617–630.US Environmental Protection Agency (2010). MOBILE6 Vvehicle EmissionModelling Software. Technical report, U.S.E.P.A. Available at http://www.epa.gov/otaq/m6.htm, accessed 2013-11-09.Welch, W. J., Buck, R. J., Sacks, J., Wynn, H. P., Mitchell, T. J., andMorris, M. D. (1992). Screening, predicting, and computer experiments.Technometrics, 34:15–25.WHO (2003). Health aspects of air pollution with particulate matter,ozone and nitrogen dioxide. Technical report, WHO Regional Office forEurope. Available at http://www.euro.who.int/en/health-topics/environment-and-health/air-quality/publications/pre2009/health-aspects-of-air-pollution-with-particulate-matter,-ozone-and-nitrogen-dioxide, accessed 2013-09-13.Willmot, C. J., Ackleson, S. G., Davis, R. E., Feddema, J. S., Klink, K. E.,Legates, D. R., O’Donnell, J., and Rowe, C. M. (1985). Statistics for theevaluation and comparison of models. Journal of Geographical Research,90:8995–9005.222Yarwood, G., Rao, S., Yocke, M., and Whitten, G. Z. (2005). The carbonbond mechanism: CB05. Technical report, U.S. Environmental ProtectionAgency.Zidek, J., Le, N. D., and Liu, Z. (2012). Combining data and simulated datafor space-time fields: application to ozone. Environmental and EcologicalStatistics, 19:37–56.223Appendix AAppendix Related toChapter 2A.1 Details on Simulated Ozone DataThe aim is to emulate, in a simplified form, the space-time feature of west-to-east ozone advection in LFV (Sections 3.1 and 3.4). This simulationdoes not include the ozone process over the mountains. Simulation providesspace-time ozone data whose underlying structures and statistical propertiesare known exactly; it also allows for repeated realizations of the same ozoneprocess. The usefulness of simulated data is demonstrated in Appendix B.1.I simulate 3 types of space-time ozone data:1. Simulated true ozone: this is the underlying true space-time process,which in reality, is never known.2. Simulated CMAQ output. This is created based on simulated trueozone. Moreover, I simulate CMAQ output on a complete and regularspatial grid as well as an irregular grid at the real-life observationlocations shown in Figure 2.1.3. Simulated physical observations. These are simulated at locationswhere real-life ozone monitoring stations are situated.Simulated CMAQ output and physical observations are both generated byadding error functions to the simulated true ozone. Temporally, the simula-tions are created for a 24-hour period of 0000-2300 during an episode.In this section, I will first present the method of generating the true LFVozone field, followed by the methods of generating synthetic CMAQ outputs224A.1. Details on Simulated Ozone Dataand physical observations.A.1.1 Simulated True Ozone DataBefore developing simulated data, I first define a few key variables:• The usual geographical coordinate system of longitude-latitude is notused. Although frequently used to identify locations on a 2-D plane,they actually measure the angle of a location in reference to the merid-ian and equator, and it is only necessary in large domain studieswhere the curvature of the earth makes the horizontal and verticalcoordinates non-Cartesian. To avoid misunderstanding, self-definedLFV is mapped onto a 2-D Cartesian coordinate system. The South-west corner of the LFV is (x, y) = (0, 0) and the Northeast corneris (x, y) = (100, 100). My formulas for generating simulated dataare based on this (x, y) Cartesian coordinate, x = 0, . . . , 100 andy = 0, . . . , 100.• The ozone generating functions are also time-dependent. Since thesimulation runs between 0000 to 2300, I define an hourly time variableh, h = 0, . . . , 23, and transform functions f(h) and fT (h):f(h) ={1.714 · h− 10.286 h = 6, . . . , 200 h = 0, . . . , 5 and 21, . . . , 23,fT (h) ={0.333 · h− 1.667 h = 5, . . . , 200 h = 0, . . . , 4 and 21, . . . , 23.The units of f(h) and fT (h) are both hour. The above formulation isby no means necessary, these step-wise linear functions f(h) and fT (h)are used to transform daily hour values h = 0, . . . , 23 into a series ofvalues (found by trial and error) convenient for simulating daily ozoneand temperature.225A.1. Details on Simulated Ozone DataWeather VariablesThere are two weather variables built into the simulation: wind and temper-ature. Wind represents the driving force behind the phenomenon of ozonetransportation. Here, I define a westerly wind system that transports theozone plume eastward (Taylor, 1991; Ainslie and Steyn, 2007) with a con-stant speed of 3 m · s−1, below which is considered a light wind (Ainslie andSteyn, 2007).Temperature is an important factor controlling the rate of photochemicalreactions (Robeson and Steyn, 1990; Taylor, 1992; Reuten et al., 2012).During a summer-time ozone episode, it has a diurnal profile similar toFigure A.1. This curve is created using the equationT (h) = {−[fT (h) · Uh]2 + 5 · fT (h) · Uh + 17} · UT , t = 0, . . . , 23,where Uh = 1 · hour−1 to make the hourly function terms unitless andUT = 1◦C to give T (·) an appropriate temperature unit. This functionallows for a season-appropriate minimum daily temperature of 17◦C duringthe early-morning and the evening. The 2nd order term in turn createsa concave downward function during the daytime with daily maximum of23◦C occurring between noon and 1300.Figure A.1: Simulated diurnal temperature profile.226A.1. Details on Simulated Ozone DataThe Ozone FieldLet x and y range from 0 to 100, and h vary from 0 to 23. Equation (A.1)produces a eastward-moving “Gaussian hill” that simulates daytime ozone,tO(x, y, h) = {c(h) · exp[−(x− µx(h)σx(h))2−(y − µy(h)σy(h))2]+ 15ppb}.(A.1)As I will show, the term c(h) has unit ppb, while (x − µx(h))/σx(h) and(y − µy(h))/σy(h) are unit less. Together with the addition of the constant15ppb, (A.1) produces ozone data with unit ppb.It should be noted that Gaussian function (A.1) is used for the purposeof generating simulation. It is a simplistic description of the more complexreal-life ozone process, and the use of gaussian spatial profile is a convenientapproximation.Equation (A.1) contains a collection of time-dependent functions: c(h),µx(h), µy(h), σx(h) and σy(h). The aforementioned wind and temperaturevariables are incorporated into these functions, thus influencing the Gaussianozone function. Here are the details needed to construct (A.1).• c(h) incorporates the diurnal temperature and ozone pattern, and actsas a time-dependent scaling function in (A.1). It is described by thefunction:c(h) ={c0(h) T (h) ≤ 20◦C120◦C · T (h) · c0(h) T (h) > 20◦CT (h) is the aforementioned diurnal temperature profile. The stepwisefunction scales the daily temperature values to 1 at T (h) ≤ 20◦C andto values above 1 at T (h) > 20◦C using 20◦C−1. The scaled hourlytemperatures are then applied to (A.1).c0(h) depends on time as:c0(h) = 24ppb ·{1 +−[f(h) · Uh]2 + 24f(h) · Uh − 4848},227A.1. Details on Simulated Ozone Datawhere f(h) and Uh are already defined hourly function and its unitlessscalar. The constant 24ppb gives the scaling function c(h) a desireddiurnal profile and a unit of ppb. By definition, c0(h) is 0 when f(h) =0. In turn, c(h) becomes 0 and the equation (A.1) will take a minimumbackground ozone of 15ppb. This takes place between late eveningand early morning. c(h) takes positive values during 0600-2000. Thediurnal profile of c(h), h = 0, . . . , 23, is shown in Figure A.2.Figure A.2: Simulated diurnal profile of c(h), the time-dependent scalingfactor in (A.1).• µx(h) and µy(h) are the locations of the maximum of the Gaussianhill along the x-axis and y-axis, i.e., the hourly location of the highestozone concentration. Making these vary by the hour will enable theozone field to travel in any direction as the day goes by. In order tomimic the commonly observed ozone movement in the LFV, µy(h) isfixed at the middle of the y-range: µy = 50 for h = 0, . . . , 23. As themap in Figure 2.1 shown, the LFV’s vertical extent is less than thehorizontal extent. Because I specified both location variables x andy to range from 0 to 100, it is not sensible to let both have the sameunit. Here, I let x have unit kilometre (km) and y have an arbitrarydistance unit Uy. The units of x and y are not critical since both(x− µx(h))/σx(h) and (y − µy(h))/σy(h) terms are unitless.228A.1. Details on Simulated Ozone DataMoreover, letting µx(h) increase during daytime would simulate aneastward ozone movement. Conversely, a decreasing µx(h) translatesto a westward ozone movement. µx(h) has formµx(h) = Wind · Uw · f(h) + 24 · km.As discussed, the parameter Wind is fixed at 3m/s, together witha positive sign, it corresponds to an eastward ozone movement at aconstant speed of 3m/s. The constant Uw = 3.6(km · s)/(hour · m)transforms the wind parameter into unit kilometres per hour (km ·hour−1). The larger the parameter Wind, the faster the simulatedozone plume travels. Multiplied by f(h), an hourly variable with unithour, I obtain a distance unit km for the µx(h).An additive term of 24 ·km results in µx(7) ≈ 30km, i.e., the centre ofozone formation at 0700 takes place near x = 30km, the approximatecentre of Vancouver city in this simulated LFV.• σx(h) and σy(h) are time-varying spatial values that control the “spread”of the Gaussian field in x and y orientations. Observations and mod-ellings show that the spatial variation is larger from East to West thanNorth to South, so σx(h) > σy(h) for h = 0, . . . , 23. I define them astime-dependent 2nd order functions:σy(h) = 24 ·{1 +24f(h)Uh − [f(h)Uh]2144}· Uy andσx(h) = (1.5km) · (σyU−1y ).A quick explanation of unit constants: Uy is the aforementioned dis-tance unit of y, the constant 1.5km in the second equation gives σx aunit of km.In summary, the defining characteristics of the simulated true ozone dataare easily summarized by understanding the temporal functions c(h), µx(h),µy(h), σx(h) and σy(h). The scaling function c(h) incorporates diurnal tem-perature and ozone trend to increase or decrease the spatial ozone levels at229A.1. Details on Simulated Ozone Datathe appropriate hours. The means µx(h) and µy(h) incorporate the windinformation to transport the simulated ozone field with the appropriate di-rection and speed (ozone advection). The mean functions also define thehourly locations of high ozone concentrations. Finally, the spatial devia-tion functions σx(h) and σy(h) control how widely the hourly ozone field isspread along the East-West and North-South orientations.All functional coefficients and scale values were determined following ex-tensive fine-tuning. Figures A.3 and A.4 show for selected hours, the spatialozone field of the simulated true data. The simulated data are producedover a spatial grid of 51 × 51 (horizontal×vertical) cells, hence the ozonedata shown have dimension 24× 2601.A.1.2 Simulated CMAQ Output and ObservationThe CMAQ model output and physical observation are commonly regardedas functions of the true underlying ozone level tO(x, y, h) with an additiverandom error (Kennedy and O’Hagan, 2001; Fuentes and Raftery, 2005). IfcO(x, y, h) and oO(x, y, h) are the simulated CMAQ output and observationrespectively at location (x, y) and time h, they are expressed by the formulas:cO(x, y, h) = fcmaq[tO(x, y, h)] and (A.2)oO(x, y, h) = tO(x, y, h) + εo,where εo is independent random error of observation. Although the ob-served value is often treated as the “true” ozone level when judging againstair-quality model output, random measurement error (εo) is in reality un-avoidable. The task of ozone monitoring may further be complicated by ahost of factors. However, in general it is sufficient to regard physical mea-surement as the sum of true value and an additive error (Dennis et al.,2010). Observations are simulated at the actual observation locations: Itransformed the longitude-latitude of ozone monitoring sites onto the 2-D(x, y) coordinate system of simulated true ozone field.The procedure for simulating a space-time CMAQ output is more in-volved. First, one needs to conceptualize the real-life relationship between230A.1. Details on Simulated Ozone Data(a) Hours 0400 and 0700.(b) Hours 1000 and 1300.Figure A.3: Simulated true ozone fields at hours 0400, 0700, 1000 and 1300.observation and corresponding (in space and time) CMAQ model output.Figure A.5 plots observations against CMAQ using data from June 26th,2006, the 4th day of the 2006 ozone episode at all observation locations.The middle diagonal line is x = y, the other two lines are x = 0.5 · y andx = 2 · y. CMAQ output are interpolated on the locations of the physicalmeasurements as described in Section 2.2.231A.1. Details on Simulated Ozone DataFigure A.4: Simulated true ozone fields at hours 1600 and 1900.The goal is to simulate the slightly downward concave pattern seen inFigure A.5. After some trial and error, the detailed formulation of cO(x, y, h)in (A.2) is determined to becO(x, y, h) = fs[δ(x, y, h)] · δ(x, y, h), where (A.3)δ(x, y, h) = |tO(x, y, h) + εc|, εc ∼ N(µc, σ2c ), andfs[δ(x, y, h)] = af sδ(x, y, h)2 + bf sδ(x, y, h) + 1.02,where af s = −1.25 · 10−4 1ppb2, bf s = 6.74 · 10−3 1ppb.The function δ(x, y, h) ≥ 0 ppb is the absolute value of the sum: truesimulated ozone plus random error. The εc is the random additive error ofCMAQ, it has unit ppb.In order to capture aforementioned CMAQ behaviour into the simulateddata, I use multiplicative scaling function fs[δ(x, y, h)] in (A.3), which pro-duces scaling factors for cO(x, y, h). A plot of scaling factors against a rangeof δ(x, y, h) is shown in Figure A.6. Values of fs[δ(x, y, h)] start above 1 atδ(x, y, h) = 0 and drop below 1 when δ(x, y, h) reaches higher ozone lev-els. Such scaling profile helps to simulate the concave pattern in FigureA.5. Note that the constants af s and bf s in the scaling function have units232A.1. Details on Simulated Ozone Data0 20 40 60 80 100020406080100June 26th, 2006Observed ozone (ppb)CMAQ output (ppb)Figure A.5: CMAQ output vs. observation for June 26th, 2006, the 4th dayof 2006 ozone episode.that make fs[δ(x, y, h)] unit less, which gives cO(x, y, h) a unit ppb whenmultiplied by δ(x, y, h).0 20 40 60 80 1000.δ(x, y, t)(ppb)f_s(δ(x, y, t))Figure A.6: fs[δ(x, y, h)] (scaling term in (A.3)) plotted against δ(x, y, h).Lastly, one need define the parameters of the random errors: εo ∼N(0, σ2o) and εc ∼ N(µc, σ2c ). Table A.1 lists the parameter values, which233A.1. Details on Simulated Ozone Dataare based on the grand mean of the simulated true data:Z¯ =124 · 100 · 10023∑h=0100∑x=1100∑y=1tO(x, y, h) = 31.5ppb.In keeping with the knowledge of CMAQ deficiency in low-concentrationozone modelling, the mean and variance of random errors increase whentO(x, y, h) ≤ 30ppb.σ2o µc(tO(x, y, h) > 30ppb) σ2c (tO(x, y, h) > 30ppb)values 0.20 · Z¯ 0.1 · Z¯ 0.25 · Z¯µc(tO(x, y, h) ≤ 30ppb) σ2c (tO(x, y, h) ≤ 30ppb)values 0.1 · Z¯ + 2 · ppb 0.25 · Z¯ + 3 · ppbTable A.1: Parameter values used for generating additive errors in simulatedCMAQ and observation.Figure A.7b plots the simulated CMAQ against the simulated observa-tions, the same plot using the real data is shown again for convenience in Fig-ure A.7a. The percentage of points lying within the bounds are around 70%for both simulation and real data. As seen from the “scatter patterns”, thesimulations captured the main characteristics of the “real” CMAQ outputsand observations. Figure A.8 shows the same scatter plot of the simulatedCMAQ against observations, where the CMAQ is simulated without the useof scaling function fs[δ(x, y, h)] in (A.3). It is noticeable that without thescaling factors, the simulated CMAQ output no longer capture the impor-tant “concave relationship” with the observations, indicating the importanceof fs[δ(x, y, h)] as a part of simulation procedure.The simulated ozone observations are not used in ozone PCA. Due itsstraightforward relationship with true ozone, observations are simulated toprovide reference points for the formulation of (A.3) (method of simulatingCMAQ output).234A.1. Details on Simulated Ozone Data0 20 40 60 80 100020406080100June 26th, 2006Observed ozone (ppb)CMAQ output (ppb)(a) CMAQ output vs. observation for June 26th, 2006, the 4th day of 2006ozone episode (repeated from Figure A.5).0 20 40 60 80 100020406080100Simulated DataSimulated observed ozone (ppb)Simulated CMAQ ozone (ppb)(b) Simulated CMAQ vs. simulated observation.Figure A.7: Scatter plots for assessing “similarities” between the real dataand simulated data.235A.1. Details on Simulated Ozone Data0 20 40 60 80 100020406080100Simulated Data without the scaling functionSimulated observed ozone (ppb)Simulated CMAQ ozone (ppb)Figure A.8: Plot of simulated CMAQ vs. simulated observation, where theCMAQ data are simulated without the use of scaling function fs[δ(x, y, h)].236Appendix BAppendix Related toChapter 3B.1 Analyses of Simulated (Synthetic) OzoneFieldThis appendix section presents the simulation-based analysis that deter-mines the number of meaningful ozone features, this analysis was quicklysummarized in Section 3.3. This simulation analysis is based on the repeatedgenerations of simulated (synthetic) CMAQ data described in Appendix A.1,and it proceeds as follow:1. Generate two sets of synthetic space-time CMAQ outputs: one datasetis used as the “modelling set” and the other the “testing set”, whichI denote as Omodel and Otest.2. The “modelling set” is subjected to PCA, and the output EOFs andPCs are used to build ozone predictions for Otest via (3.3):∑pj=1 Pj ∗ETj for p = 1, p = 2 and so forth. That is, make predictions for Otestusing increasing number of Ej ’s and Pj ’s decomposed form Omodel.Moreover, these simulations are generated with dimension n× t. Thisstep also shortens the simulation time: OTmodelOmodel has dimensiont× t instead of n× n, where it is usually n > t in a synthetic data.3. Predictions with p = 1, . . . , 6 are evaluated against the simulated “test-ing set” Otest and their Root Mean Squared Errors (RMSEs) are cal-culated.4. Repeat the above 3 steps. I chose a repetition size of 500.237B.1. Analyses of Simulated (Synthetic) Ozone FieldThis simulation exercise is designed around the idea that ozone datasimulated using exactly the same simulation parameters are multiple real-izations of the same process. In other words, these simulated CMAQ fieldsare driven by a common ozone formation-circulation mechanism, and anydifferences between simulations are purely due to random noise. Addingmore EOFs to (3.3) increases the prediction quality for the modelling setas more underlying ozone feature and patterns (data components) are used.However, notice that the EOFs from one simulation are used in (3.3) topredict another simulation, and it can be argued that after a certain point,the act of adding EOFs into (3.3) will cease to be beneficial and the predic-tion quality for the testing set will deteriorate. Since EOFs are noise aftera few spatially or temporally “meaningful” ones, adding additional EOFs istantamount to using one set of noise to predict another set of noise, and themodelling quality subsequently suffers. The transition between “beneficial”to “detrimental” is the p value to choose in (3.3), as it signals that additionalEOFs no longer represent useful ozone features.Each simulation produces 6 RMSEs, denoted as RMSEj , j = 1, . . . , 6.With a simulation size of 500, there are 500 sets of (RMSE1, . . . ,RMSE6),and subsequently the differences (RMSE1−RMSE2, . . . ,RMSE5−RMSE6).I use these RMSEj −RMSEj+1 samples to determine the number of “use-ful” EOFs/PCs. Figure B.1 shows the histograms of eachRMSEj−RMSEj+1and Table B.1 shows the 95% confidence intervals of the means of RMSEj−RMSEj+1’s, which are calculated asSample Mean(RMSEj −RMSEj+1) +1.96 ·Sample Std.deviation(RMSEj −RMSEj+1)√n.The histograms in Figure B.1 are approximately Normally distributed, henceI used the multiplier +1.96 that defines a 95% confidence region of a Normaldistribution. As shown, the RMSEj −RMSEj+1 start to become negativeat j = 3. The range of RMSEj−RMSEj+1 is completely negative at p = 4,implying that a model with 4 EOFs is worse than that with 3 EOFs.238B.2. Plots of Ej from Other PCA MethodsUsing our simulated CMAQ fields as the reference data, analyses in thissubsection show that at p = 3, EOFs/PCs begin to capture less-structuredpatterns. From p = 4 onward, the EOFs/PCs extracted from our simulatedCMAQ fields begin to be dominated by random noise.Lower bound Upper boundRMSE1 −RMSE2 0.09 0.24RMSE2 −RMSE3 -0.03 0.11RMSE3 −RMSE4 -0.32 -0.19RMSE4 −RMSE5 -0.31 -0.18RMSE5 −RMSE6 -0.28 -0.18Table B.1: Table of estimated 95% confidence intervals for the means ofRMSEj −RMSEj+1’s.Figure B.1: Histograms of RMSEi −RMSEi+1 from simulation.B.2 Plots of Ej from Column-centered OzoneData and Rotated-EjFigure B.2 compares the Ej (left), j = 2, 3, 4, decomposed from Ot×n toEcentrej from column-entered ozone data of orders j = 1, 2, 3. Figure 3.9 in239B.2. Plots of Ej from Other PCA MethodsSection 3.4 showed that E1 from original data captures the spatial variationof LFV’s temporal ozone mean. Subtracting the mean from each locationresults in Ecentre1 no longer capturing said mean feature. As Figure B.2shown, spatial features captured by Ecentrej is similar to Ej−1. Hence, thePCA of column centred data extracted similar feature as PCA of originaldata, albeit the feature rank j is one order lower.Figure B.2: Comparison plots between the Ej from the PCA of originalOt×n (left) and column-centered ozone data (right). The EOF orders arej = 2, 3, 4 for the original data and j = 1, 2, 3 for the centered data. Theozone data is the CMAQ output for the entire 96 hours of the 2006 episode.Figures B.3 shows the time-series of hourly LFV ozone means and stan-dard deviations of the 2006 CMAQ output, along with Pcentrej , j = 1, . . . , 4,obtained from the PCA of column-centered data (the 2006 CMAQ output).Figure B.4 shows the same-ordered Pj from the original data. The 1st-order temporal feature captures the time-series pattern of the hourly LFVmean with PC values ranging from negative to positive. This shows that thesubtraction of the mean field will retain the structure of the hourly spatial240B.2. Plots of Ej from Other PCA Methodsozone mean. The space-time feature Pcentre1 (Ecentre1 )T is not the dynamiceast-west contrast captured by P2ET2 (Section 3.4): it shows the patternof Ecentre1 scaled positive during the daytime and negative at night, wherethe magnitude of ozone values (in both negative and positive direction) arehigher for the western LFV. The dynamic patterns of Pcentrej (Ecentre)Tj ,j ≥ 2, reflects those of PjETj at j ≥ 3.103050Ozone hourly mean (2006 episode)TimeO3 level (ppb)0000PST, June 24th 0000PST, June 26th261016Ozone hourly standard deviation (2006 episode)TimeO3 level (ppb)0000PST, June 24th 0000PST, June 26th-3000200PC 1 (0.83): 2006 episode, centeredTimePC value (ppb)0000PST, June 24th 0000PST, June 26th-1000100PC 2 (0.08): 2006 episode, centeredTimePC value (ppb)0000PST, June 24th 0000PST, June 26th-100050PC 3 (0.03): 2006 episode, centeredTimePC value (ppb)0000PST, June 24th 0000PST, June 26th-50050PC 4 (0.02): 2006 episode, centeredTimePC value (ppb)0000PST, June 24th 0000PST, June 26thFigure B.3: From the PCA of centered ozone data: plots of hourly LFVozone mean, standard deviation and Pj , j = 1, . . . , 4. The number in PCplot headings indicate the proportion of data variation each feature recovers.The ozone data is the CMAQ output for the entire 96 hours of the 2006episode.Figure B.5 compares the original Ej , j = 1, . . . , 4, to VARIMAX rotatedEOFs of the same order. The EOF rotation is done for Ej , j = 1, . . . , 4,simultaneously; the Ej ’s of orders j ≥ 5 remain unrotated. As shown, therotated-E1 captures a spatial ozone pattern with region of positive spatialweights around the middle LFV with maximum at Maple Ridge, and ar-eas of negative spatial weights at two edges of LFV. When multiplied bycorresponding P1, which is strictly positive, the dynamic spatial pattern(not shown) captures a daily peak around middle of LFV during afternoon(1400PST to 1600PST) and negative peak at edges of LFV during the sameafternoon hours. After rotations, EOFs of orders j = 2, 3, 4 revealed a sim-241B.2. Plots of Ej from Other PCA Methods103050Ozone hourly mean (2006 episode)TimeO3 level (ppb)0000PST, June 24th 0000PST, June 26th261016Ozone hourly standard deviation (2006 episode)TimeO3 level (ppb)0000PST, June 24th 0000PST, June 26th100400700PC 1 (0.95): 2006 episode, un-centeredTimePC value (ppb)0000PST, June 24th 0000PST, June 26th-5050150PC 2 (0.03): 2006 episode, un-centeredTimePC value (ppb)0000PST, June 24th 0000PST, June 26th-150-5050PC 3 (0.01): 2006 episode, un-centeredTimePC value (ppb)0000PST, June 24th 0000PST, June 26th-80-2040PC 4 (0): 2006 episode, un-centeredTimePC value (ppb)0000PST, June 24th 0000PST, June 26thFigure B.4: From PCA of original ozone data: plots of hourly LFV ozonemean, standard deviation and Pj , j = 1, . . . , 4. The number in PC plotheadings indicate the proportion of data variation each feature recovers.The ozone data is the CMAQ output for the entire 96 hours of the 2006episode.ilar spatial features as un-rotated Ej ’s (Figure B.5). We also experimentedwith data from other episodes, data with different sized spatial domains andvarious orders of VARIMAX rotation (how many Ej to rotate). It was foundthat for the particular LFV ozone data under analysis, rotated EOF did notprovide clear advantage in interpretability compared to regular EOFs.242B.2. Plots of Ej from Other PCA MethodsFigure B.5: Comparison plots between the original (left) and VARIMAXrotated (right) Ej , j = 1, . . . , 4. The VARIMAX rotation is done for the 1st4 Ej ’s only. The dataset is the CMAQ output for the entire 96 hours of the2006 episode.243Appendix CAppendix Related toChapter 4C.1 PCA of Meteorological and ChemicalPrecursor VariablesThis section shows the spatial-temporal ozone means and standard devia-tions, as well as PCA outputs Ej ’s and Pj ’s of ozone model variables. Theyare space-time data of: temperature, wind speed, planetary boundary layerheight, NOx and VOC emission rates and antecedent concentrations.• NOx: The 1st EOF captures the structures and variations of boththe field of means and standard deviations of NOx emission (FigureC.1a). The 3rd EOF shows a well-defined spatial structure, albeit withless interpretability. The 2nd and 4th EOFs both have spatial fieldsthat show little variation. The 1st PC reflects the temporal patternsof both hourly spatial means and standard deviations (Figure C.1b).Note the double peak of NOx production at morning and afternoon,illustrating how emissions are distributed inside SMOKE. The first PCalone accounts for about 98% of data variation, although the higher-order PCs still represent noticeable temporal patterns.• Temperature: The 1st and 2nd EOF capture the spatial patterns ofthe episodic means and standard deviations (averaged across 96-hoursof the episode, Figure C.2a). The 3rd EOF shows a spatial pattern re-flecting the topography of my self-defined LFV: it captures the spatial244C.1. PCA of GP Model Variablescontrast in temperature between the “valley floor” and the mountain-ous region. The first 3 PCs show distinctive diurnal hourly patterns.It seems that the 2nd and 3rd PCs each capture a specific feature ofthe hourly spatial standard deviation of temperature (Figure C.2b).• Wind speed: The 1st EOF and 2nd EOF have near identical spa-tial distributions to the data means and standard deviations (FigureC.3a). Like temperature, the 3rd EOF shows a spatial pattern akinto the topography of the LFV. The 1st PC as usual, reflects the time-series of the hourly wind speed averaged across space (Figure C.3b).When examined closely, the 2nd PC is inversely related to the hourlypattern of spatial standard deviation. The later PCs also capture cleartemporal structures. Together, the first 3 EOFs/PCs are responsiblefor over 95% of data variation.• Boundary layer height: The spatial fields of means and standarddeviations have patterns that are very similar, both of which are cap-tured by the 1st EOF (Figure C.4a). The diurnal patterns of the 1stPC and the hourly spatial means (Figure C.4b) closely resemble the1st-order temperature feature (Figure C.2b). Higher order EOFs andPCs also exhibit discernible spatial and temporal patterns.• Antecedent NOx: As with NOx emission, the spatial fields of episodicmeans and standard deviations have similar pattern, and it is capturedby the 1st EOF (Figure C.5a). The diurnal patterns of hourly spa-tial means and standard deviations are similar: they peak early inthe morning, then decreases significantly during the daytime beforerecovering late in the afternoon (Figure C.5b). This indicates that thepeak of photochemical reaction, thus NOx consumption, takes placeduring daytime, whereas morning and night-time are times for NOxdeposition. As with all aforementioned model variables, the higher-245C.1. PCA of GP Model Variablesorder data features decomposed from the antecedent NOx data displaystrong spatial and temporal structures.• VOC emission rate and antecedent VOC: The conclusions aresimilar to those for NOx emission and antecedent NOx concentration(Figures not shown).As one might expect, without centering (or standardizing) the space-time data, the spatial/temporal mean and standard deviation become thedominant features. Once again, I do not perform centering or standardiza-tion on the data because my goal is to analyze the most important datafeatures. For all model variables, the first 3 EOF-PC pairs capture over90%, or in certain cases, close to 100% of data variation. Hence the fullcovariate set of ozone feature models contain the first 3 EOFs/PCs of all 7variables.246C.1. PCA of GP Model Variables(a) From the NOx emission data for the 2006 CMAQ ozone (SMOKE output):plots of the spatial fields of means, standard deviations and the first four EOFs.(b) From the NOx emission data for the 2006 CMAQ ozone (SMOKE output):plots of hourly spatial means, standard deviations and the first four PCs. Thevalue in the PC plot heading indicate the proportion of data variation explained.The dashed line indicate the hour 0000 of each day.Figure C.1: Spatial and temporal feature plots of NOx emission rates asso-ciated with the 2006 CMAQ output247C.1. PCA of GP Model Variables(a) From the temperature data for the 2006 CMAQ ozone (WRF output): plots ofthe spatial fields of means, standard deviations and the first four EOFs.(b) From the temperature data for the 2006 CMAQ ozone (WRF output): plots ofhourly spatial means, standard deviations and the first four PCs. The value in thePC plot heading indicate the proportion of data variation explained. The dashedline indicate the hour 0000 of each day.Figure C.2: Spatial and temporal feature plots of temperature associatedwith the 2006 CMAQ output248C.1. PCA of GP Model Variables(a) From the wind speed data for the 2006 CMAQ ozone (WRF output): plots ofthe spatial fields of means, standard deviations and the first four EOFs.(b) From the wind speed data for the 2006 CMAQ ozone (WRF output): plots ofhourly spatial means, standard deviations and the first four PCs. The value in thePC plot heading indicate the proportion of data variation explained. The dashedline indicate the hour 0000 of each day.Figure C.3: Spatial and temporal feature plots of the wind speed associatedwith the 2006 CMAQ output249C.1. PCA of GP Model Variables(a) From the data of BL height for the 2006 CMAQ ozone (WRF output): plots ofthe spatial fields of means, standard deviations and the first four EOFs.(b) From the data of BL height for the 2006 CMAQ ozone (WRF output): plots ofhourly spatial means, standard deviations and the first four PCs. The value in thePC plot heading indicate the proportion of data variation explained. The dashedline indicate the hour 0000 of each day.Figure C.4: Spatial and temporal feature plots of the boundary-layer (BL)height associated with the 2006 CMAQ output.250C.1. PCA of GP Model Variables(a) The antecedent NOx concentration data associated with the 2006 CMAQ out-put: plots of the spatial fields of means, standard deviations and the first fourEOFs.(b) The antecedent NOx concentration data associated with the 2006 CMAQ out-put: plots of hourly spatial means, standard deviations and the first four PCs. Thevalue in the PC plot heading indicate the proportion of data variation explained.The dashed line indicate the hour 0000 of each day.Figure C.5: Spatial and temporal feature plots of the antecedent NOx con-centration data associated with the 2006 CMAQ output.251C.2. Prediction Bias of Feature-Based Ozone ModelC.2 Prediction Bias of Feature-Based OzoneModelTable 4.5 showed Mean Percentage Errors. Due to not taking the abso-lute values of prediction residuals, positive errors offset the negative errors.Hence MPE can be viewed as a rough measurement of prediction bias. TheMPE results show that the ozone predictions are close to unbiased duringthe day-time, but the bias becomes more pronounced towards the night-time. At 1900PST, the mean prediction residual is −5 ppb for the CII modeland −7 ppb for the VM model. The residual means for both models gradu-ally increase as the night progresses, while they are close to 0 during earlierhours.In addition to the discussed “lack-of-accuracy in night-time PC predic-tions”, I suspect there is also the issue of ozone prediction bias. As discussed,the predictions of each ozone features are unbiased owing to the applicationof BLUP. However, as I will now show, the use of equation (4.13) entailsthat the ozone field prediction is biased.Let Ei,j denote i-th element of Ej and Ph,j be h-th element of Pj , hencethe ozone prediction at the i-th location and h-th hour of my self-definedmodelling region isOˆh,i =p∑j=1Pˆh,jEˆi,j ,where p = 3, h = 1, . . . , 24 and i = 1, . . . , 229.Here, the “hat” notation indicates an estimate. The issue is that Oˆh,i isa statistically biased prediction of Oh,i. This conclusion can be verified by252C.2. Prediction Bias of Feature-Based Ozone Modelexpanding the covariance equation between Ei,j and Ph,j :Cov (Ph,j , Ei,j) = E [(Ph,j − E (Ph,j))(Ei,j − E (Ei,j))]= E (Ph,jEi,j)− E (Ph,j)E (Ei,j) ⇒E (Ph,jEi,j) = E (Ph,j)E (Ei,j) + Cov (Ph,j , Ei,j) and similarlyE (Pˆh,jEˆi,j) = E (Pˆh,j)E (Eˆi,j) + Cov (Pˆh,j , Eˆi,j)I have established unbiasedness of ozone feature predictions, E (Eˆi,j) =E (Ei,j) and E (Pˆh,j) = E (Ph,j). Therefore, space-time ozone predictionsPˆjEˆTj ’s become unbiased predictors by adding a correction term:Cov (Ph,j , Ei,j)− Cov (Pˆh,j , Eˆi,j).That is, the covariances of all EOF-PC pairs between elements of Ej ’s andPj ’s subtract by covariances of their corresponding predictions. The resul-tant unbiased ozone prediction function is{Oˆh,i}unbiased =p∑j=1{Pˆh,jEˆi,j − Cov (Pˆh,j , Eˆi,j) + Cov (Ph,j , Ei,j)}, (C.1)and it can be shown that for each j,E [(Pˆh,jEˆi,j)unbiased] = E (Pˆh,jEˆi,j)− Cov (Pˆh,j , Eˆi,j) + Cov (Ph,j , Ei,j)= E (Pˆh,j)E (Eˆi,j) + Cov (Ph,j , Ei,j)= E (Ph,jEi,j).For any j-th EOF/PC, Cov (Ph, Ei) is not the covariance between lo-cation i and time h. Rather, it is the covariance between the i-th elementof an EOF and h-th element of the corresponding PC: it measures the co-variance between the spatial feature at a particular location Ei,j and thetemporal feature at a particulate hour Ph,j . Similar interpretation can bemade for the covariance Cov (Pˆh,j , Eˆi,j). Using the data at hand, I deviseda way of estimating the covariances between all 24× 229× 3 = 16488 pairs253C.2. Prediction Bias of Feature-Based Ozone Modelof {Ph,j , Ei,j} and corresponding {Pˆh,j , Eˆi,j}.Re-sampling MethodGiven ozone data of dimension t × n, one can only extract one pair of{Ph,j , Ei,j}. To obtain a sample for any pair of {Ph,j , Ei,j}, one may re-peatedly sample from the complete t×n data to construct subsamples. Foreach subsample, decompose and extract {Ph,j , Ei,j}. The detailed imple-mentation preceeds as follow:1. Create a subsample from the complete t × n ozone data. To create asample of any particular pair of {Ph,j , Ei,j}, include dataset on lo-cation i at hour h in every subsample. Remember that after theEOF-decomposition of Ot×n, row i in E corresponds to the EOFs(summarized spatial data) for location i, and row h of P is the PCat hour h. By decomposing a subsample, I can acquire Ei,j andPh,j , for all j = 1, . . . , 3, from the appropriate rows of Esubsample andPsubsample. Hence from one subsample, I obtain exactly one EOF-PCpair {Ph,j , Ei,j} between location i and h at j = 1, 2, 3.Repeating the above for multiple subsamples that include the datapoint at location i and hour h, I obtain a sample of {Ph,j , Ei,j}. Samplecovariance is then used for estimating all corresponding Cov (Ph,j , Ei,j)’sat j = 1, 2, 3.2. Repeat step 1 for every i = 1, . . . , n and h = 1, . . . , t. In the end, I havethe sample covariances of all (Ph,j , Ei,j)’s , i = 1, . . . , n, h = 1 . . . , tand j = 1, 2, 3. Here, I let n = 229 (defining the the LFV modellingregion), and t = 24, based on the assumption that EOF-PC correlationis a type of diurnal behaviour.The dimension of each resampled-data and the number of repetitionsare determined through experimentation.The 24 × 229 sampled ozone data are obtained by separating the 48-hourozone training set into two daily 24 × 229 data, then average them across254C.2. Prediction Bias of Feature-Based Ozone Modelthe same hour of the day h and location i.Following the procedure, I can collect all 3×229×24 covariance estimatesinto j = 3 number of 24× 229 “EOF-PC covariance matrix”, where element(h, i) in j-th matrix contain the sample estimation of Cov (Ph,j , Ei,j).For estimating the covariance Cov (Pˆh,j , Eˆi,j), I propose to implement theaforementioned sampling method on the 24 × 229 data of ozone forecasts:∑PˆjEˆTj . Subsequently, one may collect the estimated covariances into cor-responding j = 3 covariance matrices of Cov (Pˆh,j , Eˆi,j), where i = 1, . . . , 229and h = 1, . . . , 24.In the end, by adding the two types of covariance matrices to match-ing PjETj via (C.1), one obtains sample based covariance-corrected ozonepredictions.Results of Correcting Ozone Prediction BiasAs the analyses below show, biased prediction is not an issue during the all-important day-time ozone modelling. During night-time modelling, covariance-corrected predictions do induce a level of “localized spatial variations” atthe right locations. However, at a regional level, these fine spatial variationsintroduce a fair amount of noise into the modelled ozone field.First, I made space-time ozone forecasts using only the first two EOFsand PCs: Oˆt×n = Pˆ1EˆT1 + Pˆ2EˆT2 . As the results at 1400PST in Figure C.6shown, with only 2 predicted space-time features and no bias correction, thedaytime ozone fields are still well-forecasted. The bottom plots in Figure C.6show that a noticeable lack of prediction quality emerges at night-time. Thisis because the P1ET1 and P2ET2 capture the underlying spatial-temporalmean structures and “daytime” ozone features.Also shown in Figure C.6, are predictions from two types of improvementscheme: (1) adding the 3rd space-time component Pˆ3EˆT3 , and (2) apply thebias-correction covariance matrices. The first method gives the model Ihad in Section 4.7. As for the bias-correction method, there is evidencethat by correcting for prediction bias, the night-time prediction quality isimproved by the addition of detailed spatial ozone variations at appropriate255C.2. Prediction Bias of Feature-Based Ozone Modellocations. However, the improvement is not as drastic as the method 1:using additional ozone feature forecasts Pˆ3EˆT3 (Figure C.6). Furthermore,I am inclined to conclude that, since little prediction bias exists during thedaytime, there is no practical reason for the application of covariance-basedcorrection terms.In summary, the addition of Pˆ3EˆT3 into the space-time ozone forecastOˆt×n improves the quality of night-time prediction without inducing unnec-essary “local spatial noise” into the modelled ozone fields. These resultsindicates that, the statistical ozone model’s issue with night-time modellingis attributed more to not modelling higher-order ozone features; less to thepresence of prediction bias. In the end, I believe the issue of prediction biasshould be approached given one’s particular research focus. In my case,the importance of day-time ozone (or daily 8-hour maximum) coupled withrelevant modelling results inform my decision to forgo the application ofEOF-PC covariances for bias-correction.256C.2. Prediction Bias of Feature-Based Ozone ModelFigure C.6: Hours 1400 and 2100 of June 26th, 2006 (the predictive set):3-D spatial ozone fields of true CMAQ output (upper-left), prediction madeby the VM model with 2 and 3 sets of predicted EOFs/PCs (upper-rightand lower-left), the lower-right plots shows the bias-corrected version of theVM model prediction made with 2 sets of EOFs/PCs.257


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items