Short-term hydro-meteorological forecasting with extremelearning machinesbyAranildo R. LimaB.Sc., Catholic University of Pernambuco, Brazil, 2009M.Sc., Federal Rural University of Pernambuco, Brazil, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Atmospheric Science)The University of British Columbia(Vancouver)June 2016c© Aranildo R. Lima, 2016AbstractIn machine learning (ML), the extreme learning machine (ELM), a feedforwardneural network model which assigns random weights in the single hidden layerand optimizes only the weights in the output layer, has the fully nonlinear mod-elling capability of the traditional artificial neural network (ANN) model but issolved via linear least squares, as in multiple linear regression (MLR). Chapter2 evaluated ELM against MLR and three nonlinear ML methods (ANN, supportvector regression and random forest) on nine environmental regression problems.ELM was then developed for short-term forecasting of hydro-meteorologicalvariables. In situations where new data arrive continually, the need to make fre-quent model updates often renders ANN impractical. An online learning algorithm– the online sequential extreme learning machine (OSELM) – is automatically up-dated inexpensively as new data arrive. In Chapter 3, OSELM was applied toforecast daily streamflow at two small watersheds in British Columbia, Canada,at lead times of 1–3 days. Predictors used were weather forecast data generatedby the NOAA Global Ensemble Forecasting System (GEFS), and local hydro-meteorological observations. OSELM forecasts were tested with daily, monthly oryearly model updates, with the nonlinear OSELM easily outperforming the bench-mark, the online sequential MLR (OSMLR).A major limitation of OSELM is that the number of hidden nodes (HN), whichcontrols the model complexity, remains the same as in the initial model, even whenthe arrival of new data renders the fixed number of HN sub-optimal. A new variablecomplexity online sequential extreme learning machine (VC-OSELM), proposedin Chapter 4, automatically adds or removes HN as online learning proceeds, sothe model complexity self-adapts to the new data. For streamflow predictions atiilead time of one day, VC-OSELM outperformed OSELM when the initial numberof HN turned out to be smaller or larger than optimal.In summary, by using linear least squares instead of nonlinear optimization,ELM offers a major advantage over a traditional method like ANN. In situationswhere new data arrive continually, OSELM and VC-OSELM were shown in thisthesis to be more useful than ANN and OSMLR.iiiPrefaceThis thesis results from a collection of five journal papers. Three have been pub-lished and two are currently under review. Submitted and published papers havebeen reformatted to meet dissertation formatting requirements. Minor editing chan-ges may have been made, but the content is otherwise unaltered.The papers are:(A) Aranildo R. Lima, Alex J. Cannon, WilliamW. Hsieh, Nonlinear regressionin environmental sciences by support vector machines combined with evolutionarystrategy. Computers & Geosciences, Volume 50, January 2013, Pages 136-144,ISSN 0098-3004, http://dx.doi.org/10.1016/j.cageo.2012.06.023.(B) Aranildo R. Lima, Alex J. Cannon, William W. Hsieh, Nonlinear regres-sion in environmental sciences using extreme learning machines: A comparativeevaluation. Environmental Modelling & Software, Volume 73, November 2015,Pages 175-188, ISSN 1364-8152, http://dx.doi.org/10.1016/j.envsoft.2015.08.002.(C) Aranildo R. Lima, Alex J. Cannon, William W. Hsieh, Forecasting dailystreamflow using online sequential extreme learning machines. Journal of Hydrol-ogy, Volume 537, June 2016, Pages 431-443, ISSN 0022-1694, http://dx.doi.org/10.1016/j.jhydrol.2016.03.017.(D) Huiping Peng, Aranildo R. Lima, Andrew Teakles, Jian Jin, Alex J. Can-non, William W. Hsieh, Evaluating hourly air quality forecasting in Canada withnonlinear updatable machine learning methods. Air Quality, Atmosphere and Health,In press, http://dx.doi.org/10.1007/s11869-016-0414-3.(E) Aranildo R. Lima, William W. Hsieh, Alex J. Cannon, Variable complex-ity online sequential extreme learning machine. Submitted for professional peerreview on 10 March 2016ivChapter 1, the Introduction, is adapted from papers (B), (C) and (D).Chapter 2 is basically paper (B), where I was responsible for the implementa-tion and analysis of all the experiments, and the writing. Prof. William W. Hsiehand Dr. Alex J. Cannon provided extensive feedback with the used methodologyand writing. Dr. Cannon kindly provided the code of the Artificial Neural Networkused in this chapter.Chapter 3 is basically paper (C), where I was responsible for the implemen-tation and analysis of all the experiments, and the writing. Prof. Hsieh and Dr.Cannon provided extensive feedback with the used methodology and writing.Chapter 4 is basically paper (E), where I was responsible for the implementa-tion and analysis of all the experiments, and the writing. Prof. Hsieh developedthe equations used in the method. Prof. Hsieh and Dr. Cannon provided extensivefeedback with the used methodology and writing.Appendix B is basically paper (A), where I was responsible for the implemen-tation and analysis of all the experiments, and the writing. Prof. Hsieh and Dr.Cannon provided extensive feedback with the used methodology and writing. Dr.Cannon kindly provided the code of the Artificial Neural Network used in thischapter.As some sections of Chapters 2, 3 and 4 repeat material already presentedearlier, they are marked by a footnote following the section title, so the reader canskip these sections.Funding for this research was provided by the Natural Sciences and Engineer-ing Research Council of Canada via a Discovery Grant and an Engage Grant toProf. Hsieh.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiGlossary of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . xixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Multiple Linear Regression (MLR) . . . . . . . . . . . . . . . . . 21.3 Online-Sequential Multiple Linear Regression (OSMLR) . . . . . 31.4 Multi-layer Perceptron Artificial Neural Network . . . . . . . . . 51.5 Extreme Learning Machine (ELM) . . . . . . . . . . . . . . . . . 71.6 Online-Sequential Extreme Learning Machine (OSELM) . . . . . 91.7 Research Goals and Contributions . . . . . . . . . . . . . . . . . 92 Nonlinear regression in environmental sciences using extreme learn-ing machines: A comparative evaluation . . . . . . . . . . . . . . . . 122.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12vi2.2 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 Extreme Learning Machines . . . . . . . . . . . . . . . . . . . . 162.3.1 Ensemble of ELM . . . . . . . . . . . . . . . . . . . . . 172.3.2 Range of Weights and ELM Predictions . . . . . . . . . . 182.3.3 Network Structure . . . . . . . . . . . . . . . . . . . . . 192.4 Hill Climbing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.5 Benchmark Models . . . . . . . . . . . . . . . . . . . . . . . . . 222.5.1 Artificial Neural Network . . . . . . . . . . . . . . . . . 222.5.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . . 222.5.3 Support Vector Regression with Evolutionary Strategy . . 232.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 232.6.1 Automated Procedures . . . . . . . . . . . . . . . . . . . 242.6.2 1-D Search . . . . . . . . . . . . . . . . . . . . . . . . . 362.6.3 ELM-S vs. ANN . . . . . . . . . . . . . . . . . . . . . . 402.7 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . 403 Forecasting daily streamflow using online sequential extreme learn-ing machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . 453.2.1 Englishman River . . . . . . . . . . . . . . . . . . . . . . 463.2.2 Stave River . . . . . . . . . . . . . . . . . . . . . . . . . 463.2.3 Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . 483.3 Online Sequential Extreme Learning Machine . . . . . . . . . . . 503.3.1 Range of Weights and ELM Predictions . . . . . . . . . . 523.3.2 Network Structure . . . . . . . . . . . . . . . . . . . . . 533.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 543.4.1 Predictor Selection . . . . . . . . . . . . . . . . . . . . . 563.4.2 Frequency of Updates . . . . . . . . . . . . . . . . . . . 573.4.3 Training Size . . . . . . . . . . . . . . . . . . . . . . . . 593.5 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . 67vii4 Variable complexity online sequential extreme learning machine . . 734.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.2 Extreme Learning Machines . . . . . . . . . . . . . . . . . . . . 764.3 VC-OSELM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.3.1 Adding HN . . . . . . . . . . . . . . . . . . . . . . . . . 794.3.2 Deleting HN . . . . . . . . . . . . . . . . . . . . . . . . 804.3.3 Buffer Set . . . . . . . . . . . . . . . . . . . . . . . . . . 824.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 824.4.1 Data Description . . . . . . . . . . . . . . . . . . . . . . 844.4.2 Change Number of HN by Fixed Steps . . . . . . . . . . 854.4.3 Determining Automatically the Number of HN . . . . . . 924.5 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . 1005 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1015.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1015.2 Batch Learning ELM . . . . . . . . . . . . . . . . . . . . . . . . 1015.3 Online Sequential Learning ELM . . . . . . . . . . . . . . . . . . 1025.4 Variable Complexity Online Sequential Learning . . . . . . . . . 1035.5 Overall Recommendations and Potential Applications . . . . . . . 103Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106A Description of Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 120A.1 ENSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120A.2 SFBAY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121A.3 WIND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121A.4 FRASER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122A.5 YVR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122A.6 STAVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123A.7 PRECIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124A.8 TEMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124A.9 SO2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124viiiB Nonlinear regression in environmental sciences by support vectormachines combined with evolutionary strategy . . . . . . . . . . . . 126B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126B.2 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . 128B.3 Support Vector Regression . . . . . . . . . . . . . . . . . . . . . 130B.3.1 Evolutionary Strategies . . . . . . . . . . . . . . . . . . . 131B.3.2 Hyper-parameter Optimization and SVR-ES . . . . . . . . 133B.4 Artificial Neural Network and Regression Trees . . . . . . . . . . 135B.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 136B.6 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . 148C Additional performance statistics for OSELM . . . . . . . . . . . . 151D HC algorithm with VC-OSELM . . . . . . . . . . . . . . . . . . . . 159ixList of TablesTable 2.1 Specification of the tested datasets. . . . . . . . . . . . . . . . 15Table 2.2 Mean and standard deviation (in parenthesis) of MAE (over 30trials) from the test set using RF, SVR-ES, ANN, and ELM-S.The model with the lowest MAE of each dataset is shown inboldface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Table 2.3 Mean and standard deviation (in parenthesis) of RMSE (over 30trials) from the test set using RF, SVR-ES, ANN, and ELM-S.The model with the lowest RMSE of each dataset case is shownin boldface. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Table 2.4 Mean and standard deviation (in parenthesis) of total computertime (in seconds) used for building (including testing) a model.The model with the lowest computer time of each dataset isshown in boldface. The number of HN is also shown. . . . . . 31Table 2.5 Comparison of MAE, RMSE and time (in seconds) from thetraining and testing sets using the average number of HN se-lected by the automated procedures (Table 2.4). The lowestcomputer time (in seconds), test MAE and test RMSE of eachdataset are shown in boldface. . . . . . . . . . . . . . . . . . . 32Table 2.6 Ranks based on the MAE and RMSE (over 30 trials) for the testdata, with rank 1 being the best. The bottom row gives the meanfor each method. . . . . . . . . . . . . . . . . . . . . . . . . . 35xTable 3.1 List of predictors. These are of two types, i.e. local observa-tions and GEFS outputs. The GEFS outputs were selected at0900 and 2100 UTC for day 1, 3300 and 4500 UTC for day 2,and 5700 and 6900 UTC for day 3, hence the total number ofpotential predictors is 34. (The soil moisture content and soiltemperature are for 0.0–0.1-m depth). . . . . . . . . . . . . . 49Table 3.2 Statistics of the observed streamflow (m3=s) in the training andtesting datasets. . . . . . . . . . . . . . . . . . . . . . . . . . 66Table 4.1 Statistics of the observed streamflow (m3=s) in the training andtesting datasets. . . . . . . . . . . . . . . . . . . . . . . . . . 87Table 4.2 Comparison of RMSE and MAE from the test set, with the ini-tial number of HN selected by HC using one year of initial train-ing data and the number of HN increased by a fixed step of 6%after S data points. The values are the mean over the 30 trialsand the parentheses give the standard deviation. The lowest testRMSE and test MAE of each row are shown in boldface. . . . 89Table 4.3 Comparison of RMSE and MAE from the test set, with the ini-tialization using a large number of HN and one year of trainingdata and the number of HN decreased by 1% after S data points.The values are the mean over the 30 trials and the parenthesesgive the standard deviation. The lowest test RMSE and testMAE of each row are shown in boldface. . . . . . . . . . . . . 91Table 4.4 The final number of HN, the RMSE and MAE from the testset for OSELM and VC-OSELM (with HC). The values are themean over the 30 trials and the parentheses give the standarddeviation. The lowest test RMSE and test MAE of each row areshown in boldface. . . . . . . . . . . . . . . . . . . . . . . . . 96Table 4.5 Comparison of RMSE and MAE from the test set, with a largeinitial number of HN and one year of training data. The val-ues are the mean over the 30 trials and the parentheses give thestandard deviation. The lowest test RMSE and test MAE ofeach row are shown in boldface. . . . . . . . . . . . . . . . . . 99xiTable B.1 Comparison of MAE and MSE values from the test set usingSVRwith the range of γ suggested by Cherkassky andMa (2004)and by Lin and Lin (2003) for the PRECIP, TEMP, and SO2datasets, using either all predictors (ALL) or predictors selectedby SLR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138Table C.1 MAE, RMSE (m3=s) and NSE calculated over 1990-2011 (testset) for Englishman and Stave. For OSELM, the values are themean over the 30 trials and the parentheses give the standarddeviation. Day 1, day 2 and day 3 are the forecast lead times.The five models are OSMLR using stepwise (OSMLR-S), OS-MLR using all predictors (OSMLR-A), OSELM using stepwise(OSELM-S), OSELM using Boruta (OSELM-B) and OSELMusing all predictors (OSELM-A). The linear models of OSMLRare from single run instead of ensemble runs. . . . . . . . . . 152Table C.2 MAE, RMSE (m3=s) and NSE calculated over 1990-2011 (testset) for Englishman and Stave. For OSELM, the values are themean over the 30 trials and the parentheses give the standarddeviation. Day 1, day 2 and day 3 are the forecast lead times.The six models are OSMLR updated daily (OSMLR-Daily),monthly (OSMLR-Monthly), and yearly (OSMLR-Yearly), andthe OSELM updated daily (OSELM-Daily), monthly (OSELM-Monthly) and yearly (OSELM-Yearly). . . . . . . . . . . . . 153Table C.3 Mean and standard deviation (in parenthesis) of theMAE, RMSE(m3=s) and NSE of the 30 trials calculated over 1990-2011 (testset) for Englishman and Stave. Day 1, day 2 and day 3 are theforecast lead times, and 1 year-OSELM, 3 years-OSELM and 5years-OSELM are the OSELM runs initialized with 1, 3 and 5years of data, respectively. . . . . . . . . . . . . . . . . . . . 154xiiList of FiguresFigure 2.1 Comparison between different ELM setups, with the “waist-line” of the boxplot showing the median SS of the 30 trials forthe nine datasets. The 25th and 75th percentiles are shown,respectively, as the bottom and top of each box. Distance be-tween the 25th and 75th percentiles is the interquartile range(IQR). The upper whisker extends from the upper hinge tothe highest value that is within 1.5 IQR of the hinge. Thelower whisker extends from the lower hinge to the lowest valuewithin 1.5 IQR of the hinge. Data beyond the end of the whis-kers are outliers and are plotted as points. The four models arethe original single ELM (ELM-O), the ensemble of 30 models(ELM-E), the ensemble with output bias parameter (ELM-B)and the ELM ensemble with scaled weights (ELM-S). . . . . 28Figure 2.2 Boxplot of the MAE SS of the 30 trials. . . . . . . . . . . . . 29Figure 2.3 Boxplot of the RMSE SS of the 30 trials. . . . . . . . . . . . 29Figure 2.4 The total computer time used for building (including testing) amodel (RF, ELM-S, ANN or SVR-ES), plotted on a log scale(base 10). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Figure 2.5 Boxplot of the number of HN (plotted on a log scale) usedin the ANN and ELM-S models as chosen by the automatedprocedures (30 trials). . . . . . . . . . . . . . . . . . . . . . 36xiiiFigure 2.6 RMSE of the training set for (a) TEMP and (b) PRECIP usinga 1-D search over the number of HN. The number of HN usedby the 1-D search for ELM-E/ELM-S is provided along thebottom axis and for ANN, along the top axis. . . . . . . . . . 37Figure 2.7 Computational time (in seconds) versus the number of hiddennodes (HN) for (a) TEMP and (b) PRECIP plotted on a logscale. The number of HN used by the 1-D search for ELM-E/ELM-S is provided along the bottom axis and for ANN alongthe top axis. The ELM-E and ELM-S curves essentially over-lapped in (a) and (b). . . . . . . . . . . . . . . . . . . . . . . 38Figure 3.1 Study Area. The Englishman River Basin in Vancouver Island(red) and the Stave River Basin in southern British Columbia(green). The solid circles show the hydrometric stations run byWater Survey of Canada, and the squares the hydro-meteoro-logical stations. For Stave River (on the right) the circle (thehydrometric station) overlaps the square (the hydro-meteoro-logical station). . . . . . . . . . . . . . . . . . . . . . . . . . 47Figure 3.2 ELM architecture (shown with a single output variable). . . . 52xivFigure 3.3 Comparison between prescreening techniques at (a) English-man and (b) Stave, with “waistline” of the boxplot showingthe median MAE of the 30 trials calculated over 1990-2011for each dataset and forecast lead time. The 25th and 75thpercentiles are shown, respectively, as the bottom and top ofeach box. Distance between the 25th and 75th percentiles isthe interquartile range (IQR). The upper whisker extends fromthe upper hinge to the highest value that is within 1.5 IQR ofthe hinge. The lower whisker extends from the lower hingeto the lowest value within 1.5 IQR of the hinge. Data beyondthe end of the whiskers are considered outliers and are plot-ted as points. Day 1, day 2 and day 3 are the forecast leadtimes. The five models are OSMLR using stepwise (OSMLR-S), OSMLR using all predictors (OSMLR-A), OSELM usingstepwise (OSELM-S), OSELM using Boruta (OSELM-B) andOSELM using all predictors (OSELM-A). The boxes for thetwo OSMLR models collapsed to two horizontal lines as mul-tiple trials are not needed. . . . . . . . . . . . . . . . . . . . 61Figure 3.4 Comparison between different frequencies of model updatesfor (a) Englishman and (b) Stave. The boxplot displays theMAE for each dataset and forecast lead time. The six mod-els are OSMLR updated daily (OSMLR-Daily), monthly (OS-MLR-Monthly), and yearly (OSMLR-Yearly), and the OSELMupdated daily (OSELM-Daily), monthly (OSELM-Monthly)and yearly (OSELM-Yearly). . . . . . . . . . . . . . . . . . . 62Figure 3.5 Residual times series from (a) an OSELM-Daily model, (b) anOSELM-Yearly model, and (c) their difference. . . . . . . . . 63xvFigure 3.6 Comparison of (a) MAE and (b) RMSE for values above the90th percentile for different frequency updates at Englishman.Boxplots for each forecast lead time are shown for six modelsOSMLR updated daily (OSMLR-Daily), monthly (OSMLR-Monthly) and yearly (OSMLR-Yearly), and the OSELM up-dated daily (OSELM-Daily), monthly (OSELM-Monthly) andyearly (OSELM-Yearly). . . . . . . . . . . . . . . . . . . . . 64Figure 3.7 Times series (training and test set) of (a) ENG and (b) STA.The blue line marks the beginning of 1987, the green linemarks the beginning of 1989 and the red line marks the be-ginning of 1990. . . . . . . . . . . . . . . . . . . . . . . . . 65Figure 3.8 Comparison of the MAE between different initial training datasizes for (a) Englishman and (b) Stave. The boxplot showsthe MAE from the 30 trials for each dataset and forecast leadtime. The three models are OSELM-Daily trained during theinitialization phase with 1, 3 and 5 years of data respectively. . 68Figure 3.9 Boxplot of the number of HN selected in the 30 trials at (a)Englishman and (b) Stave. Due to discrete nature of the num-ber of HN, the median could overlap with the upper or lowerquantile. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69Figure 3.10 Comparison of MAE using a fixed number of HN for differentinitial training sizes at (a) Englishman and (b) Stave. . . . . . 70Figure 4.1 Mean accumulated precipitation (bars) and mean accumulatedstreamflow (black line) for each of the 12 calendar months overthe studied period for (a) Englishman and (b) Stave. . . . . . . 85Figure 4.2 Time series of the number of HN for Stave when (a) S = 180and (b) S = 365. Each dashed line shows the number of HNfor one trial, the solid line the mean over the 30 trials and thehorizontal dashed line the value of S (below which scheme Awas used instead of scheme B for adding HN). . . . . . . . . 95xviFigure 4.3 Time series of the number of HN for (a) Englishman and (b)Stave when starting the initial model with L = 350. Eachdashed line shows the number of HN for one trial, the solidline the mean over the 30 trials and the horizontal dashed linethe value of S. . . . . . . . . . . . . . . . . . . . . . . . . . . 98Figure B.1 Skill score of MAE for the PRECIP test dataset with all pre-dictors used (dark bar) and with predictors selected by SLR(light bar). The reference forecast, MLR with all predictors,has simply 0 for the skill score. . . . . . . . . . . . . . . . . . 140Figure B.2 Skill score of MSE for the PRECIP test dataset. . . . . . . . . 141Figure B.3 Skill score of MAE for the TEMP test dataset. . . . . . . . . . 142Figure B.4 Skill score of MSE for the TEMP test dataset. . . . . . . . . . 143Figure B.5 Skill score of MAE for the SO2 test dataset. . . . . . . . . . . 144Figure B.6 Skill score of MSE for the SO2 test dataset. . . . . . . . . . . 145Figure B.7 Skill score of MSE (dark bar) and MAE (light bar) for (a)PRECIP and (b) SO2 test datasets, where f1 denotes SVR-ESwith f1=(MSE)−1 as fitness function and f2 denotes SVR-ESwith f2=(MAE)−1. . . . . . . . . . . . . . . . . . . . . . . . 146Figure B.8 Results of MAE as γ varies between [2−10;24], for the PRECIPtest dataset with all predictors used (solid line with circles),with predictors selected by SLR (dashed line with circles), forthe TEMP test dataset with all predictors used (solid line withtriangles) and with predictors selected by SLR (dashed linewith triangles), for the SO2 test dataset with all predictors used(solid line with x) and with predictors selected by SLR (dashedline with x). . . . . . . . . . . . . . . . . . . . . . . . . . . . 147Figure B.9 Results of MSE as γ varies between [2−10;24]. . . . . . . . . 148xviiFigure B.10 Results of MSE skill score (left side) and MAE skill score(right side) for the PRECIP test dataset using the last 500 pointsof the training set and all the predictors to train the models. Themodels are respectively: the modified GA proposed by Leunget al. (2003) with different parameters settings (GA1, GA2 andGA3), SVR with the procedure recommended by Cherkasskyand Ma (2004) and the extended range γ = [2−10;24] suggestedby Lin and Lin (2003), SVR-ES with 200 generations and 3-D grid search with the hyper-parameters’ range recommendedby Fan et al. (2005). . . . . . . . . . . . . . . . . . . . . . . 149Figure C.1 Scatter plot of the observed (y) and predicted (yˆ) streamflowvalues for Englishman. The three models are the OSELM up-dated daily (OSELM-Daily), monthly (OSELM-Monthly) andyearly (OSELM-Yearly). The black line is the linear regres-sion between yˆ and y, while the dashed line is when y= yˆ. . . 155Figure C.2 Scatter plot of the observed (y) and predicted (yˆ) streamflowvalues for Stave. The three models are the OSELM updateddaily (OSELM-Daily), monthly (OSELM-Monthly) and yearly(OSELM-Yearly). The black line is the linear regression be-tween yˆ and y, while the dashed line is when y= yˆ. . . . . . 156Figure C.3 Scatter plot of the observed (y) and predicted (yˆ) streamflowvalues for Englishman. The three models are OSELM-Dailytrained during the initialization phase with 1, 3 and 5 yearsof data respectively. The black line is the linear regressionbetween yˆ and y, while the dashed line is when y= yˆ. . . . . 157Figure C.4 Scatter plot of the observed (y) and predicted (yˆ) streamflowvalues for Stave. The three models are OSELM-Daily trainedduring the initialization phase with 1, 3 and 5 years of datarespectively. The black line is the linear regression between yˆand y, while the dashed line is when y= yˆ. . . . . . . . . . . 158xviiiGlossary of AcronymsAIC Akaike information criterionAG-ELM ELM with adaptive growth of HNANN Artificial neural networksCART Classification and regression treeEM-ELM Error-minimized ELMELM Extreme learning machinesELM-B ELM ensemble with output bias parameterELM-O Original single ELMELM-S ELM ensemble with scaled weightsEOSELM Ensemble of OSELMESN Echo state networksGEFS Global ensemble forecasting systemHC Hill climbingHN Hidden nodesIQR Interquartile rangeI-ELM Incremental ELMxixMAE Mean absolute errorML Machine learningMLR Multiple linear regressionMOS Model output statisticsMSE Mean squared errorMLP Multilayer perceptronNSE Nash-Sutcliffe model efficiency coefficientOSELM Online-sequential ELMOSELM-A OSELM using all predictorsOSELM-B OSELM using Boruta for predictor selectionOSELM-S OSELM using stepwise selection of predictorsOSMLR Online-sequential MLROSMLR-A OSMLR using all predictorsOSMLR-S OSMLR using stepwise selection of predictorsRBF Radial basis functionRF Random forestsRMSE Root mean squared errorRNN Recurrent neural networksRVFL Random vector functional linkSLFN Single-hidden layer feedforward neural networksSLR Stepwise linear regressionSS Skill scorexxSSCP Sums-of-squares-and-cross-productsSVR Support vector regressionSVR-ES SVR with evolutionary strategyVC-OSELM Variable complexity OSELMVC-OSELM-A VC-OSELM using scheme AVC-OSELM-B VC-OSELM using scheme BxxiAcknowledgmentsFirst, I would like to thank my research advisor, Professor William Hsieh. I wouldalso like to thank Dr. Alex Cannon, my co-supervisor, Professor Roland Stull andall my UBC colleagues. Finally, I would like to thank my family, in special Atina,Ariel and Fernanda. You made this journey more exciting.xxiiChapter 1Introduction1.1 OverviewLinear statistical models are simple, fast, and often provide adequate and inter-pretable descriptions of how the predictors affect the outputs. In particular, forprediction purposes they sometimes outperform fancier nonlinear models (Hastieet al., 2009). However, environmental problems are generally complex with manycomponents, such as trends, seasonality, interactions between predictors, and non-linear relationships (Kuligowski and Barros, 1998; Cawley et al., 2007; McCollorand Stull, 2008). Predictions by linear models may have accuracy limitations dueto the generalization (and extrapolation) of the environmental problems by linearfunctions.To overcome the limitations of linear models, nonlinear machine learning (ML)methods have been successfully used in environmental problems (Cherkassky et al.,2006; Hsieh, 2009; Haupt et al., 2009). The main idea of ML is that computer al-gorithms are capable of automatically distilling knowledge from data. From thisknowledge they can construct models capable of making predictions from noveldata in the future.Artificial neural networks (ANN), support vector regression (SVR) and ran-dom forests (RF) are examples of ML models that have been used to solve prob-lems such as: downscaling precipitation (Chen et al., 2010; Cannon, 2011; Ibarra-Berastegi et al., 2011), downscaling precipitation for climate change scenarios1(Tripathi et al., 2006), forecasting summertime surface-level ozone concentrations(Cannon and Lord, 2000), predicting potential habitat of oaks (Pino-Mejı´as et al.,2010), forecast of pacific sea surface temperatures (Aguilar-Martinez and Hsieh,2009), forecasting wind (Kusiak et al., 2009; Zeng et al., 2011; Foley et al., 2012),prediction of air pollution (Kurt et al., 2008), and prediction of water resource vari-ables (e.g. flow, water level, nitrate, salinity and suspended sediment concentra-tion) (Maier et al., 2010; Shamseldin, 2010; Cannon, 2012b; Rasouli et al., 2012;Hong, 2012; Wu et al., 2014; Chang et al., 2014).Short lead time hydro-meteorological forecasts aim to provide an estimate offuture conditions, based on recent and forecast hydro-meteorological conditionsand observations. In practice the provided lead time (ranging from a few hours todays) is constrained by factors such as system response times, and the accuracy ofmeteorological forecasts at the required lead times. Forecasting techniques basedon statistical modelling are used when simplicity and robustness of the model im-plementation are more important than an accurate description of the various inter-nal sub-processes. They are especially useful for operational purposes, e.g. as anearly warning system for floods, especially for mesoscale basins, which are char-acterized by rapid system response time (Potter and Colman, 2003; Bu¨rger et al.,2009; Sene, 2010; Shamseldin, 2010; Fleming et al., 2015).This thesis is focused on developing and applying ML models to environmen-tal prediction, specially with hydro-meteorological variables. In particular an MLmodel called the extreme learning machines (ELM) is intensively studied.1.2 Multiple Linear Regression (MLR)The objective of MLR is to determine how multiple predictor variables xl (l =1; : : : ;d); with N samples, is linearly related to the predictand or response variabley.The MLR output variable yˆ can be written in terms of the input predictor vari-ables asyˆ= β0+β1x1+β2x2+ : : :+βdxd ; (1.1)where βl (l = 0; : : : ;d) are the regression coefficients or parameters. The values ofthe model regression coefficients are found by minimizing the mean squared error2(MSE) between the model output (yˆ) and the target data (y) using a linear leastsquares algorithm (Wilks, 2005).For the more general situation where the predictand y is a vector, MLR findsthe least squares solution of the linear system XB=Y, where the input data matrixX, the regression coefficient matrix B and the target data matrixY are, respectively,X=1 x11 · · · x1d....... . ....1 xN1 · · · xNd ; B=β T0β T1...β Td and Y=yT1yT2...yTN ; (1.2)where T denotes the transpose. The least squares estimate of the regression coeffi-cients is given by,Bˆ= PXTY; (1.3)where P = (XTX)−1 and XTX is the sums-of-squares-and-cross-products (SSCP)matrix (also called the data covariance matrix).1.3 Online-Sequential Multiple Linear Regression(OSMLR)MLR uses a batch-learning type of training method, i.e., whenever new data arereceived, batch learning uses the past data together with the new data to retrainthe model. In many real world applications, learning has to be an ongoing processsince a complete set of data is not readily available. In addition, by having to useall past data, a large amount of training data is often required to train batch learningmethods (Liang et al., 2006; Hong, 2012). For example, in streamflow forecasting,the accuracy of each forecast can be evaluated as soon as the observed flow for thattime step becomes available. Therefore, this information should be useful for ad-justing model parameters to maximize the accuracy of the next forecast (Potter andColman, 2003; Sene, 2010). Thus, in some practical applications online sequen-tial learning algorithms are preferred over batch learning algorithms as sequentiallearning algorithms do not require retraining using all past data whenever new dataare received (Liang et al., 2006).3To facilitate rapid and frequent updating of large number of equations, OSMLRmodels are commonly developed using the SSCP matrix. The idea of the updatingis to do part of the necessary recalculation of regression coefficients in near-realtime by updating the SSCP matrix and storing the data in that form rather than asraw observations. With new data arriving sequentially, the online updating algo-rithm can learn the new data individually or in chunks of fixed or varying length,and discard the data which have already been learned.The OSMLR algorithm can be divided into two phases: an initialization phase(involving a small batch MLR run) and an online sequential learning phase. In theinitialization phase, starting with a training set with N0 data points {X0;Y0}, thelinear regression solution isBˆ0 = P0XT0Y0; (1.4)where P0 = (XT0X0)−1.Next, the online sequential learning phase using the recursive least square al-gorithm commences. Incorporating a new chunk of data containing N1 data points{X1;Y1} in the least squares solution (1.3) gives the new regression coefficientsBˆ1 = P1[X0X1]T[Y0Y1]; with P1 =[ X0X1]T[X0X1]−1 : (1.5)After some algebra as in Liang et al. (2006)Bˆ1 = Bˆ0+P1XT1 (Y1−X1Bˆ0); (1.6)whereP1 = P0−P0XT1 (I+X1P0XT1 )−1X1P0; (1.7)with I denoting the identity matrix.Generalizing the recursive algorithm to incorporate the k+1 new chunk of datacontaining Nk+1 data points {Xk+1;Yk+1}, the least squares solution givesPk+1 = Pk−PkXTk+1(I+Xk+1PkXTk+1)−1Xk+1Pk; (1.8)Bˆk+1 = Bˆk+Pk+1XTk+1(Yk+1−Xk+1Bˆk): (1.9)4The steps for the OSMLR algorithm are shown in Algorithm 1.1 begin2 Initialization Phase:3 // k = 0;4 Given a training set {X0;Y0}5 begin6 Calculate the matrix P0 = (XT0X0)−1;7 Calculate the output weights Bˆ0 = P0XT0Y0;8 end9 Online Sequential Learning Phase:10 begin11 (k+1)th chunk of new data {Xk+1;Yk+1} arrives;12 Calculate the matrixPk+1 = Pk−PkXTk+1(I+Xk+1PkXTk+1)−1Xk+1Pk;13 Calculate the output weightsBˆk+1 = Bˆk+Pk+1XTk+1(Yk+1−Xk+1Bˆk);14 Set k = k+1 and iterate;15 end16 endAlgorithm 1: OSMLR1.4 Multi-layer Perceptron Artificial Neural NetworkAn artificial neural network (ANN) is a biologically inspired mathematical modelwhich is composed of a large number of highly interconnected nodes divided inlayers (input, hidden, output), but working in union to solve a problem. There aremany types of ANNs, but the multilayer perceptron (MLP) architecture is probablythe most popular type (Haykin, 1998; Maier et al., 2010; Abrahart et al., 2012) andconsists of at least one hidden layer of nodes or “neurons” sandwiched between theinput and output layers.An MLP with one hidden layer can be defined as:yˆ j =L∑i=1β ih(wi ·x j+bi)+β 0; ( j = 1; : : : ;N); (1.10)5where x j ∈ Rd and yˆ j ∈ Rm are the model input and output, respectively, N thenumber of cases, wi and bi the weight and bias parameters in the hidden layer(with L hidden nodes), β i and β 0 are the weight and bias parameters in the out-put layer, and h the activation function. One can think of MLR as a special caseof ANN where the activation functions are linear and there are no hidden layersbetween the inputs and output(s). In MLR with d predictors and one predictand(output) there are (d+1) parameters. For corresponding nonlinear regression withan MLP with L number of hidden nodes (HN) there are (d+1)L+(L+1) param-eters, usually greatly exceeding the number of parameters in the MLR. Training ofthe ANN model involves adjusting the parameters iteratively so the MSE betweenthe model output and target data is minimized. The backpropagation algorithm, agradient-based optimization method, is often used to train the ANNmodels (Hsieh,2009; Maier et al., 2010). However, nonlinear optimization by gradient descent-based learning methods is computationally expensive and may easily converge tolocal minima. Due to multiple local minima, ANN models are usually used in anensemble (or committee) setting, where each ensemble member is an ANN trainedwith different random initial weights, and the output of the ensemble ANN modelis simply the averaged output from all the individuals members. As nonlinearmodels can easily overfit (i.e. fit to the noise in the data), ANN is usually run withsome form of regularization (e.g. weight penalty) (Hsieh, 2009). Hence for ANNmodels, the number of HN, the regularization parameter (i.e. weight penalty pa-rameter), and the algorithm for network training all need to be specified in advance(Haykin, 1998; Hsieh, 2009).The outputs from a numerical weather prediction model are post-processedby methods like model output statistics (MOS) (Wilson and Valle´e, 2002) to cor-rect for model bias and to extend the predictions to variables not computed by thenumerical model. So far, only linear statistical methods are used in these computa-tionally intensive updatable MOS systems, as nonlinear methods like ANN are notpractical to use in high dimensional systems with new data arriving continually.Unlike MLR where the simple linear least squares algorithm easily generalizes toonline-sequential MLR by recursive least squares, the complicated nonlinear opti-mization of the MLP ANN does not easily lead to online-sequential ANN. Yuvaland Hsieh (2003) proposed an adaptive nonlinear MOS scheme for ANN, where6the ANN model can be updated by new data without having to retrain using allpast data. However, unlike OSMLR, this scheme cannot be used for updating theANN model with a single new datum at a station, but requires more data for eachupdate, and nonlinear optimization is still needed. In short, the lack of a simpleonline-sequential nonlinear ML method has hindered the development of nonlin-ear updatable MOS for numerical weather prediction.1.5 Extreme Learning Machine (ELM)Nonlinear optimization by gradient descent-based learning methods is computa-tionally expensive and may easily converge to local minima. To overcome some ofthese problems, randomized neural networks have been used. Randomized neuralnetworks randomly chooses the weights in the hidden nodes of an MLP networkwith one hidden layer, and analytically determines the weights at the output layer.Broomhead and Lowe (1988) suggested some generalizations to the conventionalnetwork model based on the radial basis function (RBF) approach to curve fitting.The centers of the RBFs can be selected from the training data or randomly gen-erated. Adopting a minimum norm least square method (e.g. the Moore-Penrosepseudo-inverse), the solution is unique and in this sense the network may be said tohave a guaranteed learning algorithm (training the network is equivalent to solvinga linear matrix equation). It combines a linear dependence on the variable weightswith an ability to model explicitly nonlinear relationships. Also they suggested thatthe RBF method can be viewed as a layered network. Schmidt et al. (1992) inves-tigated the performance of an MLP where the weights between the input layer andhidden layer are randomly generated and kept fixed. They proposed that only thehidden layer and the output layer needed to be tuned (once they were properly ini-tialized) using a single layer learning rule or a pseudo-inverse technique. Anotherrandomized approach is the random vector functional link (RVFL) neural networkmodel (Pao et al., 1994) which is based on the functional link neural networks (Paoand Takefuji, 1992). In the RVFL the weights between the input layer and hiddenlayer are also randomly generated and kept fixed, however the architecture of theRVFL is slight different than the MLP. The RVFL has extra direct connections be-tween the inputs and the outputs called enhancement nodes. The addition of the7enhancement nodes can lead to better performance by acting as a form of regu-larization (Zhang and Suganthan, 2015). Huang et al. (2006b) proposed the ELMwhich is the same as the model proposed by Schmidt et al. (1992) but without thebias parameter in the output layer, and is also very similar to RVFL but without thedirect links between the inputs and the outputs.The wi and bi parameters from equation (1.10) can be randomly assigned(Schmidt et al., 1992) if the activation function is infinitely differentiable. Con-sequently, only the β parameters need to be optimized when minimizing the MSEbetween the model output and the target data. In addition, the β 0 term is not nec-essary for ELM (Huang, 2014). Thus, in the ELM approach, training an MLP withone hidden layer is equivalent to simply finding the least-squares solution of thelinear system (Huang et al., 2011), HB= Y, where the hidden layer output matrixH of dimension N×L isH=h(w1 ·x1+b1) · · · h(wL ·x1+bL).... . ....h(w1 ·xN +b1) · · · h(wL ·xN +bL) ; (1.11)the parameter matrix B of dimension L×m and the target data matrix Y of dimen-sion N×m areB=β T1...β TL ; and Y=yT1...yTN : (1.12)Algebraically, the linear system for B is solved via the Moore-Penrose generalizedinverse H† (Huang et al., 2006b), i.e.Bˆ=H†Y with H† = PHT and P= (HTH)−1; (1.13)with P being the inverse of the covariance matrix of the hidden layer output. Theseequations are analogous to Eqs. (1.2) and (1.3) for MLR, but with H replacing X.Figure 3.2 shows a schematic of the ELM architecture, which is analogous tothe MLP architecture, except that MLP does not randomly assign wi and bi.81.6 Online-Sequential Extreme Learning Machine(OSELM)To take advantage of real time information, the ELM has to be retrained using allpast data as new data arrive. Liang et al. (2006) proposed the OSELM, whichupdates the weights of the ELM automatically as new data arrive. The OSELMstrongly resembles OSMLR in Sec 1.3, being divided in similar two phases: ini-tialization phase and online sequential learning phase. OSELM can also learn thenew data individually or chunk by chunk (with fixed or varying length) and discardthe data for which the training has already been done.In the initialization phase, the appropriate matricesH0, P0 and Bˆ0 are computedbased on the initial training dataset {X0;Y0} = {x j;y j}N0j=1, where the number ofdata points in this phase (N0) should be at least equal to the number of HN (L).Following the initialization phase, the online sequential learning phase com-mences. At the (k+ 1)th update a new chunk of data {Xk+1;Yk+1} is presented,and the new matrices Hk+1, Pk+1 and Bˆk+1 are calculated. The steps for the OS-ELM algorithm are shown in Algorithm 2. Again, Algorithm 2 is analogous toAlgorithm 1 (OSMLR) but with H replacing X. For more details, see Liang et al.(2006) and Huang et al. (2011).1.7 Research Goals and ContributionsThe first objective of this dissertation is to test whether the claims of ELM hav-ing good generalization performance, requiring less computational time than othercommon nonlinear ML methods and having only one parameter (the number ofHN) to adjust, are applicable in the context of environmental sciences. Cherkasskyet al. (2006) grouped the environmental sciences applications in three domains: cli-mate, Earth and ocean (which is closely related to climate because global processeson the Earth and in the ocean directly influence the climate), and hydrology. TheELM is applied in Chapter 2 to nine different datasets from the three domains, withfour of them in climate (ENSO, YVR, PRECIP and TEMP), two in Earth and ocean(WIND and SO2), and three in hydrology (SFBAY, FRASER and STAVE) (see Ap-pendix A for details). Environmental datasets span a broad range in terms of size,number of predictors, degree of nonlinearity, signal-to-noise ratio, etc. While the91 begin2 Initialization Phase:3 // k = 0;4 Given a training set {X0;Y0} and L hidden nodes5 begin6 Randomly assign wi and bi, i= 1; · · · ;L;7 Calculate the hidden layer output matrix H0;8 Calculate the matrix P0 = (HT0H0)−1;9 Calculate the output weights Bˆ0 = P0HT0Y0;10 end11 Online Sequential Learning Phase:12 begin13 (k+1)th chunk of new data {Xk+1;Yk+1} arrives;14 Calculate Hk+1 using only the new data;15 Calculate the matrixPk+1 = Pk−PkHTk+1(I+Hk+1PkHTk+1)−1Hk+1Pk;16 Calculate the output weightsBˆk+1 = Bˆk+Pk+1HTk+1(Yk+1−Hk+1Bˆk);17 Set k = k+1 and iterate;18 end19 endAlgorithm 2: OSELMselected datasets by no means span the whole range of environmental datasets, theydo provide better insight into the advantages and disadvantages of various methodsthan a single dataset. In addition, simple automated procedures are proposed tofind the parameters and build the models automatically in a reasonable amount oftime.One would use an adequate amount of training data containing enough statis-tical information for building the initial model. However, it is not always possibleto get a large amount of data in advance to train the model – often data arrivecontinually after the initial model has been trained. In the fields of hydrology andatmospheric sciences, the addition of new observational stations and changes in thenumerical weather prediction (NWP) model are examples of situations where it isdifficult (or even impossible) to obtain large amounts of data in advance. Contrary10to the ANN and ELM approaches that require a long training data record and thenbehave in a static manner, the OSELM can be automatically updated as new dataarrive.The second objective of this dissertation is to test whether OSELM is suitablefor hydro-meteorological applications where data arrive continually. Thus, the OS-ELM is used in Chapter 3 as an alternative solution to forecast daily streamflowof two hydrometric stations in southern British Columbia, Canada. Local hydro-meteorological observations and NOAA’s weather reforecast dataset generated bythe Global Ensemble Forecasting System (GEFS) model (Hamill et al., 2013) areused as predictors.One disadvantage of OSELM is that after the initialization phase, the numberof HN is fixed. As new data arrive, and information on longer-term variability(e.g. interannual variability) becomes available, the fixed number of HN will betoo few to model the longer term variability and the model becomes sub-optimal.The third objective of this thesis, is to develop an algorithm to add or remove HNof the OSELM according to their significance to network performance (Chapter 4).Therefore, not only the weights can be adjusted but also the complexity (numberof HN) can be self-adapted simultaneously.11Chapter 2Nonlinear regression inenvironmental sciences usingextreme learning machines: Acomparative evaluation2.1 IntroductionLinear models are simple, fast, and often provide adequate and interpretable de-scriptions of how the predictors affect the outputs. In particular, for prediction pur-poses they sometimes outperform fancier nonlinear models (Hastie et al., 2009).However, environmental problems are generally complex with many components,such as trends, seasonality, interactions between predictors, and nonlinear relation-ships (Kuligowski and Barros, 1998; Cawley et al., 2007). Predictions by linearmodels may have accuracy limitations due to the generalization (and extrapola-tion) of the environmental problems by linear functions.To overcome the limitations of linear models, nonlinear MLmethods have beensuccessfully used in environmental problems (Cherkassky et al., 2006). ANN, SVRand RF have been used to solve hydrological problems, such as the prediction of10-day reservoir inflows, downscaling precipitation, and prediction of water re-12source variables (e.g. flow, water level, nitrate, salinity and suspended sedimentconcentration) (Tripathi et al., 2006; Chen et al., 2010; Maier et al., 2010; Can-non, 2012b; Rasouli et al., 2012). Due to the large amount of research activity inhydrology using ANNmodels, good review papers (Maier and Dandy, 2000; Abra-hart et al., 2012; Maier et al., 2010) are available in the hydrological literature. Forhydrological problems, gradient-based optimization methods are widely used totrain ANN models (Maier et al., 2010). This is also true for environmental prob-lems such as forecasting wind power (Kusiak et al., 2009) and equatorial Pacificsea surface temperatures (Aguilar-Martinez and Hsieh, 2009). However, nonlin-ear optimization by gradient descent-based learning methods is computationallyexpensive and may easily converge to local minima. Nature inspired evolutionarycomputation algorithms have been successfully applied to ANN training (Leunget al., 2003; Chen and Chang, 2009), however there is no consensus on their su-perior skills or convergence speed (Solomatine and Ostfeld, 2008; Piotrowski andNapiorkowski, 2011). Consequently, there is still a need for better and faster ANNtraining algorithms.The ELM algorithm for SLFNs was proposed by Huang et al. (2006b). TheELM algorithm implements a SLFN similar in structure to a traditional ANNmodel, but the ELM randomly chooses the weights leading to the HN and ana-lytically determines the weights at the output layer (Schmidt et al., 1992). Oncethe activation function has been chosen, the only parameter to be tuned in the ELMis the number of HN. ELM has been used in different research areas (Sun et al.,2008; Huang et al., 2011) and has been found to produce good generalization per-formance with learning times that are thousands of times faster than traditionalgradient-based ANN training algorithms.Correct adjustment of the model parameters is mandatory for building a suc-cessful predictive model. For example, in ANN models, the number of HN, theregularization parameter (i.e. weight penalty parameter), and the algorithm for net-work training all need to be specified (Haykin, 1998; Hsieh, 2009). Unfortunately(with very few exceptions), there is no unique formula to estimate the model pa-rameters before training starts, so they are usually determined by time-consumingtrial and error. In addition, many published articles do not reveal the details ofparameter estimation, hiding pitfalls and misuses of the technique employed in the13studies (Zhang, 2007; Maier et al., 2010; Wu et al., 2014). For ELM the correctadjustment of the parameter is also crucial. Similar to ANN, the optimal numberof HN is problem dependent and unknown in advance (Huang et al., 2006b; Fenget al., 2009; Liu and Wang, 2010). Thus, the process to find the optimal numberof HN is also by the trial and error procedure, necessitating multiple ELM runs(Parviainen and Riihimaki, 2013).The objective of this study is to test whether the claims of ELM having goodgeneralization performance, requiring less computational time than other commonnonlinear ML methods and having only one parameter (the number of HN) to ad-just, are applicable in the context of environmental sciences. We applied ELM tonine very different datasets. MLR (with stepwise selection of predictors) was usedas the benchmark to compare accuracy (based on the root mean squared error andmean absolute error) and computational time. We also used three nonlinear tech-niques that have previously been successfully applied to environmental problemsas ML benchmarks, namely gradient-based ANN (Cannon and Lord, 2000; Schoofand Pryor, 2001; Krasnopolsky, 2007), SVR (Tripathi et al., 2006; Lima et al.,2013), and RF (Ibarra-Berastegi et al., 2011). We combined the ML techniqueswith simple automated procedures to search for optimal parameters.Section 2.2 describes the datasets used in our study. The regression methods arepresented in Sections 2.3, 2.4 and 2.5. Results and discussion of the experimentsare given in Section 2.6, followed by summary and conclusion in Section 2.7.2.2 Data DescriptionIn our present peer-reviewed publication system, publications biased in favour ofnew methods are common because authors tend to present their new methods inthe best possible light to enhance their chances of passing the peer review. For in-stance, an author might test their new method against a traditional method on twodifferent datasets. If the newmethod fails against a traditional method on one of thedatasets, the author could ignore the failure and write a journal paper describing thenew method using only the successful case (Hsieh, 2009; Zhang, 2007; Elshorbagyet al., 2010a). Environmental datasets span a broad range in terms of size, num-ber of predictors, degree of nonlinearity, signal-to-noise ratio, etc. Cherkassky14Table 2.1: Specification of the tested datasets.Datasets # Observations # Predictors PredictandMean Std. dev.Train Test Train Test Train TestENSO 264 120 8 -0.049 -0.025 0.902 0.800SFBAY 475 157 6 1.177 1.145 1.226 1.359WIND 279 100 160 36952 40655 28610 29165FRASER 2557 1095 4 1.779 1.693 0.547 0.528YVR 3286 1694 3 1.451 1.444 0.457 0.460STAVE 5258 1260 25 3.134 3.088 0.909 0.919PRECIP 4213 2074 106 0.984 0.981 0.342 0.341TEMP 7117 3558 106 -0.002 -0.010 1.010 0.983SO2 15110 7533 27 3.125 3.147 0.929 0.924et al. (2006) grouped the environmental sciences applications in three domains:climate, Earth and ocean (which is closely related to climate because global pro-cesses on the Earth and in the ocean directly influence the climate), and hydrology.Thus, to explore the applicability of the ELM as a prediction tool in environmen-tal sciences, we selected nine different datasets from the three domains, with fourof them in climate (ENSO, YVR, PRECIP and TEMP), two in Earth and ocean(WIND and SO2), and three in hydrology (SFBAY, FRASER and STAVE) (seeAppendix A for details). While the selected datasets by no means span the wholerange of environmental datasets, they do provide better insight into the advantagesand disadvantages of various methods than a single dataset.As part of theML development process, the available data are generally dividedinto training and testing subsets. The training set is used to build the ML modelsand the testing set is used to determine the generalization ability of the trainedmodels. Table 2.1 shows the specifications of the datasets and the basic descriptivestatistics of the predictand (i.e. response variable) of the training and testing sets.One of the most important steps in the ML modelling process is the determi-nation of an appropriate set of predictors (Cherkassky et al., 2006; Solomatine andOstfeld, 2008; Maier et al., 2010). Usually, they are determined on an ad hoc basisor by using a priori system knowledge (Maier and Dandy, 2000). Since the focusof this research is the comparison of ML models, emphasis was not given to the15identification of the optimal predictors for each particular dataset.2.3 Extreme Learning Machines1The key element of an ANN is its distributed, nonlinear structure. An ANN iscomposed of a large number of highly interconnected neurons (or perceptrons) di-vided in layers (input, hidden, output), but working in concert to solve a predictionproblem (Haykin, 1998).The MLP architecture is probably the most popular type of ANN (Kuligowskiand Barros, 1998; Cawley et al., 2003; Haylock et al., 2006) and consists of at leastone hidden layer sandwiched between the input and output layers. Training ofthe ANN model involves adjusting the parameters iteratively so the error betweenthe model output yˆ and the predictand data y is minimized. The backpropaga-tion algorithm is often used to calculate the gradient of the error function, with agradient-descent approach used to reduce the mean squared error iteratively, whichcould be time consuming. There are also others issues to consider such as the num-ber of learning epochs, learning rate, stopping criteria, regularization and/or localminima. To overcome some of these issues, an algorithm called ELM for SLFNscan be used.The SLFN can be defined as:yˆ j =L∑i=1β is(wi ·x j+bi)+β 0; ( j = 1; : : : ;N) (2.1)where x j ∈ Rd and yˆ j ∈ Rm are the model input and output, respectively, N thenumber of cases, wi and bi the weight and bias parameters in the hidden layer(with L hidden nodes), β i and β 0 the weight and bias parameters in the outputlayer, and the sigmoidal activation function s iss(x) =11+ exp(−x) : (2.2)Huang et al. (2006b) proved that the wi and bi parameters can be randomlyassigned (Schmidt et al., 1992) if the activation function is infinitely differentiable,1The material in the first part of this section, before Sec. 2.3.1, is already covered in Sec. 1.5.16so only the β parameters need to be optimized when minimizing the mean squarederror between the model output yˆ and the target data y. Thus, in the ELM approach,training a SLFN is equivalent to simply finding the least-squares solution of thelinear system (Huang et al., 2011), HB= Y, where the hidden layer output matrixH of dimension N× (L+1) isH=1 s(w1 ·x1+b1) · · · s(wL ·x1+bL)....... . ....1 s(w1 ·xN +b1) · · · s(wL ·xN +bL) ; (2.3)the B parameter matrix of dimension (L+ 1)×m and the target data matrix Y ofdimension N×m areB=β T0β T1...β TL ; and Y=yT1...yTN : (2.4)Algebraically, the linear system for B is solved via the Moore-Penrose generalizedinverse H† (Huang et al., 2011),Bˆ=H†Y: (2.5)The bias parameter β 0 is not essential for ELM, and we tested ELM both withand without β 0 (where the first column of ones inH and βT0 in Eqs. (2.3) and (2.4)are deleted, as in Huang 2014).2.3.1 Ensemble of ELMLike many other learning algorithms the stochastic nature of ELM means that dif-ferent trials of simulations may yield different results. The random assignmentof weight and bias parameters in the hidden layer makes each ELM distinct, sowe used an ensemble (or committee) of ELMs. We repeatedly ran the ELM forE times with the same dataset and network structure (number of HN). The outputof the ELM ensemble is the averaged output of the individual ensemble members.17As shown in Sun et al. (2008) and Liu and Wang (2010), output from an ELMensemble is more stable than from a single ELM.2.3.2 Range of Weights and ELM PredictionsProper random initialization of the weights is one of the most important prerequi-sites for fast convergence of feedforward ANNmodels like MLP. The effectivenessof random weight initialization depends on the initial weight distribution. A bal-ance must exist to ensure that the sigmoidal function does not remain completelylinear nor become saturated near the asymptotic limits of 0 and 1. If the weight dis-tribution is too narrowly centered about the origin, both activation and error signalswill die out on their way through the network. If it is too broad, the saturated ac-tivation function will block the backpropagated error signals from passing throughthe node.Often a uniform distribution, spread over a fixed interval [−r;r], is used for theweights. Thimm and Fiesler (1997) gave a review of random weight initializationmethods for MLP and found that r should be of the formr = a ·F−0:5; (2.6)where F is the fan-in of the node (i.e. the number of incoming weights). Usinghyperbolic tangent as the activation function, they found a ≈ 1 to be a reasonablevalue.Parviainen and Riihimaki (2013) raised questions about meaningfulness ofchoosing model complexity based on HN only, as is traditionally done with ELM.Usually, only the number of HN is carefully selected, and a fixed variance or in-terval parameter is used for the distribution from which the random weights aredrawn. The width of the interval is seen as simply a constant instead of a modelparameter. As weights are random variables, diversity (i.e. not having similar val-ues for all weights) is an important factor because model complexity is affectedby the variance of the distribution. Small variance means small weights and linearmodels, while large variance can produce larger weights and nonlinear models.Also, depending on how the random weights and bias parameters are selectedfor the hidden nodes, the learning time for ELM can decrease dramatically (Liu18and Wang, 2010). Parviainen and Riihimaki (2013) found that using a differentvariance can improve performance, achieving a similar effect as regularizers intraditional ANN models. Thus they concluded that the weight variance should alsobe a parameter of ELM.Huang et al. (2012) proposed an ELM using a regularization parameter (i.e.weight penalty), as the only parameter to be adjusted. They set the number of HNto 1000 and searched 50 different values (e.g. [2−24;2−23; : : : ;224;225]) to find thebest weight penalty parameter. They found that the generalization performancewas not sensitive to the number of HN as long as it was set large enough (≥ 1000for their experiments). Two considerations: First, the intervals and ranges of the1-dimensional (1-D) optimization have to be chosen carefully to avoid bad gen-eralization performance (if the 1-D is too coarse) or high computational cost (ifthe 1-D is too refined). Second, if the number of HN is set large then the 1-Dsearch will perform operations on large matrices. Thus there is no guarantee thatthe computational cost will be less than a simple sequential search of the numberof HN.Model selection with two parameters is slower than with only one. To avoidthis drawback, we adopted the approach used by Thimm and Fiesler (1997) todefine a reasonable range for weight initialization. Because we used the logisticsigmoidal function, we chose a= 2 in Eq. (2.6) (equivalent to choosing a= 1 withthe hyperbolic tangent function). Similarly, the random bias parameter of the HNin our ELM is uniformly distributed within [−2;2].2.3.3 Network StructureFinding an appropriate network structure for ANNs involves the selection of a suit-able number of HN. The optimal network structure is a balance between general-ization ability and network complexity. If network complexity is too low (i.e. toofew HN), the network might be unable to capture the true nonlinear relationship. Ifnetwork complexity is too high, the network might have decreased generalizationability due to overfitting (i.e. fitting to noise in the data) and increased trainingtime. However, the choice of the optimal number of HN is problem dependent,i.e., different phenomena will require distinct number of HN to build a successful19regression model. In general, the number of HN is selected empirically (Huanget al., 2006b; Sun et al., 2008; Feng et al., 2009; Liu and Wang, 2010).For an ELM, the range in which the optimal number of HN can be found isfar wider (sometimes by orders of magnitude) than for an ANN using iterativenonlinear optimization. Therefore, vigilant care must be taken with the automatedprocedure used to select the number of HN. Automatic procedures generally usea predefined step size to find a solution for a given problem. However, by start-ing with small step sizes the algorithm could spend a considerable amount of timesearching for an optimal solution. Starting with large step sizes, the algorithmcould miss parsimonious models (models with fewer HN). Parsimonious modelsare desirable because increasing the number of HN leads to increasing computa-tional time and potential to overfit. In other words, the appropriate fixed step sizefor a sequential/1-D search is hard to determine. We used the hill climbing algo-rithm to find the number of HN of the ELM.2.4 Hill ClimbingHill climbing is a simple mathematical optimization technique that starts with anarbitrary solution to a problem, then attempts to find a better solution (in our casesmaller RMSE) by incrementally changing a single element of the solution (in ourcase the number of HN). If the change produces a better solution, an incrementalchange (often by a defined step size) is made towards the new solution. Iterationscontinue until no further improvements can be found. In the hill climbing algorithmused here, the step size is automatically adjusted by the algorithm. Thus it shrinkswhen the probes do poorly and it grows when the probes do well, helping thealgorithm to be more efficient and robust (Yuret, 1994). The necessary steps forimplementing the hill climbing algorithm are shown in Algorithm 3.For each candidate solution a k-fold cross-validation within the training set isperformed to avoid overfitting. In the k-fold cross-validation, the training set issplit into k smaller sets. Of the k sets, a single set is retained as the validation datafor validating the model, and the remaining k− 1 sets are used as training data.This procedure is then repeated k times so every fold has been used for validation.Thus, the final skills of the candidate is based on validation using all k folds.201 begin2 // Variable Initialization;3 acce // Defines the factor of expansion and contraction ofthe search space;4 step.size← initialStepSize // Defines the expansion andcontraction of the search space;5 ensemble.number // number of ensemble members;6 ensemble.number.training // smaller than ensemble.number;7 candidates← [−1=acce;0;1=acce;acce] // vector containing thesolution candidates;8 best.HN← initialnumberHN // best number of hidden neurons;9 best.rmse← ∞ // best RMSE;10 // Searching the best number of HN;11 while Best model = False do12 for j=1 to 4 do13 for e = 1 to ensemble.number.training do14 HN.candidate← round(best.HN + step.size*candidates[j]) // numberof HN according to the candidate and the stepsize;15 for n = 1 to Number of folds used for cross-validation do16 Evaluate(ELM using HN.candidate) using Eq. (2.5);17 end18 end19 end20 best.solution← smaller RMSE among the cross-validated candidates;21 best.j← index j of the smallest RMSE among the cross-validated candidates;22 if best.rmse < best.solution then23 // Checking if the best candidate is the same as inthe previous step;24 if (best.HN = round(best.HN + step.size*candidates[best.j])) then25 Best model = True26 else27 // Adjusted according with the best candidate;28 best.rmse← best.solution;29 best.HN← round(best.HN + step.size*candidates[best.j]);30 step.size← maximum(1,round(step.size*candidates[best.j]));31 end32 else33 Best model = True34 end35 end36 // Building the final model with the best number of HN;37 for e = 1 to ensemble.number do38 Evaluate(ELM with best.HN) using Eq. (2.5);39 end40 endAlgorithm 3: Hill climbing algorithm used to identify the optimum numberof HN in the ELM models.212.5 Benchmark ModelsFor the sake of comparison we used four different prediction models (one linear andthree nonlinear): MLR, bootstrap-aggregated ensemble ANN with an automatedprocedure to find the number of HN (henceforth just called ANN for brevity), RF,and support vector regression with evolutionary strategy (SVR-ES).2.5.1 Artificial Neural NetworkIn this work the architecture used consists of a MLP neural network with one hid-den layer and the analytical calculation of the gradient is made via backpropaga-tion.Bootstrap aggregation (bagging) (Breiman, 1996) is an ensemble method whichallows greater model stability and more efficient use of training data relative toearly stopping methods (methods which stop when the validation error is at a min-imum instead of continuing optimization until errors on the training data are min-imized) (Cannon and Whitfield, 2002). Bagging generates different training datasets by bootstrapping (i.e. resampling with replacement from the available cases).The resampled dataset are used to train the models in the ensemble. The out-of-bag cases (i.e. cases not selected in a bootstrap sample) can be used as a validationdataset, e.g. for determining when to stop training with the training dataset.As the ANN with backpropagation requires fewer HN than ELM, a sequential,small 1-D search is often used to select the optimal number. The best number ofHN is determined by the ensemble of ANNmodels based on the out-of-bag RMSE.To find the optimal number of HN we sequentially increased the number of HN oneby one, until the maximum number of HN was achieved or consecutive incrementswere without improvements.2.5.2 Random ForestConceptually simple yet powerful, decision tree models can be applied to datasetswith large numbers of cases and predictors and are extremely resistant to out-liers. Among the decision trees used in ML, the CART model, first introducedby Breiman et al. (1984), is the most commonly used. A regression tree growsvia binary recursive partitioning using the sum of squares as the minimization cri-22terion. A RF (Breiman, 2001) is a bootstrap ensemble of many decision trees(CART). Each tree is grown over a bootstrap sample from the training dataset us-ing a randomly selected subset of the available predictors at each decision branch.The average output of all trees is the output of the RF model.2.5.3 Support Vector Regression with Evolutionary StrategySVR, a kernel method, maps the input data x into a higher dimensional featurespace using a nonlinear mapping function Φ. The nonlinear regression problembetween x and the predictand y can thus be converted to a linear regression problembetweenΦ(x) and y. The global minimum solution of the linear regression problemis achieved without nonlinear optimization, hence there are no local minima in theobjective function.The performance of an SVR model is highly dependent on the choice of thekernel function and the parameters (Hastie et al., 2009; Hsieh, 2009). The use ofa suitable nonlinear kernel function in SVR allows it to be fully nonlinear, whilethe use of a linear kernel function restricts SVR to a linear model. To find theoptimal SVR parameters, Lima et al. (2013) (Appendix B) proposed the SVR-ESwhich was faster than a grid search. It combines some of the parameters estimationproposed by Cherkassky and Ma (2004) and a simple evolutionary strategy called“uncorrelated mutation with p step sizes” (Eiben and Smith, 2003). In this study,we used a nonlinear kernel, the RBF (i.e. the Gaussian kernel), in the SVR-ES.More information about the SVR-ES can be found in Appendix B.2.6 Experimental ResultsTo demonstrate the practical use of the ELM as a prediction method for environ-mental sciences, experiments were performed on the datasets described in the Ap-pendix. The experiments were divided into three groups. First, a comparison wasmade between the nonlinear methods using the automated procedures to find theparameters (Section 2.6.1). Second, to check if the automated procedures used tofind the number of HN performed satisfactorily, a separate set of experiments in-volving a 1-D search was performed to determine approximately the optimal num-ber of HN for ELM-E, ELM-S and ANN (Section 2.6.2). Third, to compare the23ELM-S and ANN while excluding the time used to find the number of HN, anotherset of experiments was performed given the number of HN (Section 2.6.3).All the prediction models can be built using the free R software environmentfor statistical computing (R Core Team, 2014). For RF we used the R packagerandomForest (Liaw and Wiener, 2002). We developed the SVR-ES usingthe R package e1071 (Meyer et al., 2014) and R native libraries. ANN uses themonmlp package (Cannon, 2012a). MLR, from the R package stats, uses stepwiseselection of predictors (Venables and Ripley, 2002) based on the Akaike infor-mation criterion (AIC). We also tested MLR using Bayesian information criterion(BIC) and a 5-fold cross-validation. However we obtained similar results withminor differences in the tested cases. We developed the ELM using the R pack-age MASS (Venables and Ripley, 2002) and R native libraries. SVR-ES and ELMcode are freely available from the project website: http://forai.r-forge.r-project.org/.The experiments were realized using a R version 3.0.3 environment running in ai5-2500, 3.30GHz CPU and 8 GB-RAM computer.To measure the performance of nonlinear methods relative to the MLR, wecalculated the skill score (SS) of the MAE and of the RMSE. The SS is defined by:SS=A−ArefAperfect−Aref ; (2.7)where A is a particular measure of accuracy, Aperfect is the value of A for a set ofperfect forecasts (zero for MAE or RMSE), and Aref is the value of A computedover the set of reference forecasts (in our case the MAE or RMSE value of theMLR model). Positives SS means better performance than the reference modeland negative values, worse performance.2.6.1 Automated ProceduresThe experiments were designed to be fully automated, i.e., no human interventionwas necessary to tune the model parameters. As the potential advantage claimedby the ELM practitioners is the powerful combination of better generalization withlow training/prediction time, we also measured the cpu time necessary to train andtest the models. For model timing, we took into account the time used to tune theparameters automatically, to train the model with the estimated parameters, and to24generate predictions for the test data using the trained model.Data were standardized (zero mean and unit variance) and divided into a train-ing set and an independent test set to test the trained model. The test set waskept completely unseen during the training process. The RF and ANN performedtheir own split-sample validation via the out-of-bootstrap samples in the bootstrapaggregation. The SVR-ES and the ELM with hill climbing performed their own 5-fold cross-validation when choosing the best parameters (number of HN for ELMand C, ε and γ for SVR , which control the regularization, the insensitivity of theerror norm and the width of the Gaussian kernel function, respectively, in SVR).Thirty different trials were performed with different seeds for the random num-ber generator. At each trial an ensemble of 30 members were used for the ELM,ANN, and SVR-ES and the averaged output of the individual members used asthe output of the ensemble. The ANN setup was as follows: the range of initialuniformly-distributed random weights was [−0:5;0:5], and the maximum numberof iterations was 5000. The stopping criterion for the automatic choice of the num-ber of HN was three consecutive increments without improvements in the objectivefunction or the number of HN reached 50. For RF, we used 500 as the numberof generated trees, 1 as the minimum size of terminal nodes, and a search in therandomForest package was applied to find the optimal number of predictorvariables to be randomly selected at each split (as determined by out-of-bag errorestimates). For the SRV-ES, C and ε were initialized by the Cherkassky and Ma(2004) guidelines and γ by 0:001, and the stopping criterion for the evolutionaryalgorithm was 50 consecutive generations without improvements or the number ofgenerations reached 500. After the three parameters were tuned, we further per-formed bagging (30 times) to improve the SVR final results.To check some issues raised in Section 2.3, we performed a preliminary exper-iment with ELM. First we tested the original setup used by Huang et al. (2006b)in which the predictors were normalized to the range [0;1] and the predictand to[−1;1]. Although Huang et al. (2006b) did not mention the range of the weightsused, the paper provides a link to a downloadable ELM code where the weights arerandomly assigned from a uniform distribution of [−1;1] and the bias parametersfrom a uniform distribution of [0;1]. Also the output node is linear and without abias parameter. We averaged the results of 30 trials of simulations and we used hill25climbing to find the optimal number of HN with E = 1 (i.e. a single ELM for eachtrial). The RMSE SS from this original ELM is represented in Figure 2.1 underELM-O. Then we reran with E = 30 (i.e. 30 ensemble members for each trial),with results indicated in Figure 2.1 by ELM-E. With E = 30 and having added thebias parameter in the output layer we ran ELM-B. For the final model based onSection 2.3.2, predictors and predictand were standardized to zero mean and unitstandard deviation, the bias parameters in the HN were randomly chosen from auniform distribution of [−2;2], and the weights in the HN were randomly assignedfrom a uniform distribution of [−r;r], with r scaled by Eq. (2.6) (with a= 2). ThisELM-S model was run with E = 30 and with a bias parameter in the output node(Figure 2.1). As expected, ELM-O was unstable (with a few exceptions), hencethe hill climbing algorithm was not able to find the proper number of HN. Con-sequently ELM-O performed poorly compared to the others and in some cases toMLR. Huang et al. (2006b) gradually increased the number of HN by increments of5 (using cross-validation) to find the optimal number of HN for ELM. ELM-E im-proved the stability of ELM-O and the hill climbing performed satisfactorily. Theonly noticeable improvement between ELM-E and ELM-B is in SFBAY. Hence thebias parameter in the output layer was retained in ELM-S as the SS might improveat almost no cost. Overall ELM-S using Eq. (2.6) to scale the range of weightstended to have better SS than the other three setups.Next we compare the ELM-S model performance over test data to those of theother models (RF, ANN and SVR-ES) in terms of the MAE and RMSE SS (Figures2.2 and 2.3) and the computer time (Figure 2.4) over the nine datasets. In addition,Tables 2.2, 2.3, and 2.4 show respectively MAE, RMSE, and computer time overthe nine datasets.26Table 2.2: Mean and standard deviation (in parenthesis) of MAE (over 30trials) from the test set using RF, SVR-ES, ANN, and ELM-S. The modelwith the lowest MAE of each dataset is shown in boldface.MAEMLR RF SVR-ES ANN ELM-SENSO 0.437 0.416(0.002) 0.429(0.005) 0.423(0.005) 0.428(0.003)SFBAY 0.652 0.569(0.004) 0.603(0.013) 0.568(0.006) 0.571(0.008)WIND 22848 17742(112) 17070(164) 17016(188) 17730(645)FRASER 0.240 0.201(0.000) 0.191(0.011) 0.182(0.001) 0.184(0.002)YVR 0.363 0.366(0.000) 0.354(0.001) 0.353(0.000) 0.353(0.001)STAVE 0.201 0.162(0.002) 0.144(0.002) 0.153(0.001) 0.153(0.001)PRECIP 0.235 0.233(0.000) 0.225(0.000) 0.231(0.000) 0.225(0.000)TEMP 0.213 0.211(0.000) 0.196(0.001) 0.203(0.001) 0.195(0.001)SO2 0.629 0.497(0.001) 0.572(0.012) 0.586(0.001) 0.580(0.001)Table 2.3: Mean and standard deviation (in parenthesis) of RMSE (over 30trials) from the test set using RF, SVR-ES, ANN, and ELM-S. The modelwith the lowest RMSE of each dataset case is shown in boldface.RMSEMLR RF SVR-ES ANN ELM-SENSO 0.550 0.531(0.003) 0.528(0.003) 0.527(0.005) 0.527(0.004)SFBAY 1.050 0.903(0.006) 0.904(0.011) 0.845(0.011) 0.878(0.014)WIND 27772 22109(116) 21586(200) 21226(244) 21477(414)FRASER 0.306 0.264(0.001) 0.255(0.011) 0.242(0.001) 0.248(0.002)YVR 0.439 0.447(0.001) 0.428(0.001) 0.427(0.000) 0.427(0.000)STAVE 0.294 0.259(0.001) 0.242(0.002) 0.243(0.001) 0.241(0.001)PRECIP 0.289 0.285(0.000) 0.277(0.001) 0.284(0.001) 0.277(0.000)TEMP 0.275 0.270(0.000) 0.252(0.001) 0.263(0.001) 0.251(0.001)SO2 0.802 0.644(0.001) 0.734(0.013) 0.751(0.001) 0.741(0.001)27−100102030ENSO SFBAY WIND FRASER YVR STAVE PRECIP TEMP SO2Skill Score (%) of RMSEELM−OELM−EELM−BELM−SFigure 2.1: Comparison between different ELM setups, with the “waistline”of the boxplot showing the median SS of the 30 trials for the ninedatasets. The 25th and 75th percentiles are shown, respectively, asthe bottom and top of each box. Distance between the 25th and 75thpercentiles is the interquartile range (IQR). The upper whisker extendsfrom the upper hinge to the highest value that is within 1.5 IQR of thehinge. The lower whisker extends from the lower hinge to the lowestvalue within 1.5 IQR of the hinge. Data beyond the end of the whiskersare outliers and are plotted as points. The four models are the originalsingle ELM (ELM-O), the ensemble of 30 models (ELM-E), the ensem-ble with output bias parameter (ELM-B) and the ELM ensemble withscaled weights (ELM-S).280102030ENSO SFBAY WIND FRASER YVR STAVE PRECIP TEMP SO2Skill Score (%) of MAERFSVR−ESANNELM−SFigure 2.2: Boxplot of the MAE SS of the 30 trials.01020ENSO SFBAY WIND FRASER YVR STAVE PRECIP TEMP SO2Skill Score (%) of RMSERFSVR−ESANNELM−SFigure 2.3: Boxplot of the RMSE SS of the 30 trials.291e+001e+021e+041e+06ENSO SFBAY WIND FRASER YVR STAVE PRECIP TEMP SO2Time(s)RFSVR−ESANNELM−SFigure 2.4: The total computer time used for building (including testing) amodel (RF, ELM-S, ANN or SVR-ES), plotted on a log scale (base 10).30Table 2.4: Mean and standard deviation (in parenthesis) of total computer time (in seconds) used for building (includingtesting) a model. The model with the lowest computer time of each dataset is shown in boldface. The number ofHN is also shown.RF SVR-ES ANN ELM-STIME TIME TIME HN TIME HNENSO 1(0) 4(2) 32(12) 6(2) < 1(0) 11(3)SFBAY 2(0) 14(9) 38(11) 6(2) 1(0) 17(5)WIND 13(2) 22(9) 290(346) 2(2) 2(1) 24(7)FRASER 25(4) 404(244) 465(270) 15(6) 12(4) 31(7)YVR 30(1) 532(337) 114(49) 3(2) 9(3) 19(4)STAVE 375(46) 8019(6750) 2050(772) 16(3) 2533(1217) 415(49)PRECIP 811(107) 6774(5083) 1327(569) 3(1) 930(349) 282(41)TEMP 1971(164) 30147(18311) 13845(7429) 14(3) 14846(7386) 771(138)SO2 2475(199) 446950(689120) 5006(1673) 13(3) 94768(34676) 1440(162)31Table 2.5: Comparison of MAE, RMSE and time (in seconds) from the training and testing sets using the averagenumber of HN selected by the automated procedures (Table 2.4). The lowest computer time (in seconds), testMAE and test RMSE of each dataset are shown in boldface.MAE RMSE TIMEANN ELM-S ANN ELM-S ANN ELM-STRAIN TEST TRAIN TEST TRAIN TEST TRAIN TEST TRAIN TEST TRAIN TESTENSO 0.317 0.419 0.336 0.429 0.423 0.523 0.444 0.530 3.334 0.004 0.023 0.005SFBAY 0.471 0.563 0.480 0.571 0.659 0.838 0.674 0.870 4.903 0.006 0.079 0.006WIND 14067 17160 15109 17556 17834 21354 18423 21372 12.679 0.005 0.138 0.016FRASER 0.162 0.184 0.169 0.185 0.224 0.243 0.231 0.249 33.339 0.026 0.573 0.194YVR 0.351 0.353 0.351 0.353 0.425 0.427 0.426 0.427 11.821 0.014 0.431 0.057STAVE 0.150 0.153 0.144 0.153 0.250 0.243 0.239 0.240 107.639 0.040 93.440 1.077PRECIP 0.226 0.231 0.210 0.225 0.279 0.284 0.260 0.276 85.886 0.047 35.757 1.867TEMP 0.193 0.203 0.174 0.195 0.248 0.263 0.225 0.251 557.260 0.197 434.336 9.577SO2 0.580 0.589 0.523 0.581 0.742 0.754 0.672 0.741 275.208 0.202 3469.477 26.96432For the ENSO dataset, all the nonlinear methods performed better than MLR(SS=0). All the nonlinear models had similar RMSE SS (Figure 2.3). Overall RFtended to perform better than the other nonlinear models based on the MAE SS(Figure 2.2). For the mean time spent to build the models (Figure 2.4 and Table2.4), all the models were fast and the ELM-S ran at 2.18x, 62.3x and 7.45x thespeed of RF, ANN and SVR-ES respectively.For the SFBAY dataset, the nonlinear models all had SS advantages over MLR.ANN had the best RMSE SS followed by ELM-S, while SVR-ES and RF had sim-ilar RMSE SS. RF, ANN and ELM-S had similar MAE SS. Overall ANN tendedto perform better than the other nonlinear models. ELM-S ran at 1.90x, 32.7x and12.2x the speed of RF, ANN and SVR-ES, respectively.For the WIND dataset, even the worst nonlinear model (RF) had a much bet-ter SS than the MLR model, and ANN slightly led the other nonlinear models inthe MAE and RMSE SS. Overall ANN tended to perform better than the othernonlinear models. WIND has a relatively large number of predictors (160) whencompared to the number of training cases (279). This could explain the large stan-dard deviation in the ANN’s computing time (Figure 2.4 and Table 2.4), as addingone extra HN means 162 more weights to adjust in the ANN. The ELM-S ran at5.34x, 118x and 8.86x the speed of RF, ANN and SVR-ES, respectively.For the FRASER dataset, again all nonlinear models had a large advantage overMLR based on the SS, with ANN leading in the SS. This dataset has more observa-tions than the ones mentioned before (Table 2.1), but it has only four predictors soANN had less variability. The ELM-S ran at 2.03x, 37.8x and 32.8x the speed ofRF, ANN and SVR-ES, respectively. Overall ANN tended to perform better thanthe others techniques for the small datasets (ENSO, SFBAY,WIND and FRASER).The powerful combination of ANN and bagging is likely responsible for the betterperformance as bagging is helpful when the number of cases is not large.For the YVR dataset, RF performed worse than MLR in SS while the othernonlinear methods all outperformed MLR. Overall SVR-ES, ANN and ELM-S hadsimilar skills. As expected for precipitation problems (similarly in PRECIP), theskills were not as high as in others problems due to the complexity and difficultyof forecasting the phenomenon itself (Kuligowski and Barros, 1998; Cawley et al.,2007). ELM-S kept its speed performance, running at 3.38x, 13.0x and 60.8x the33speed of RF, ANN and SVR-ES respectively.For the STAVE dataset, SVR-ES tended to perform best overall based on theMAE SS, though ELM-S was slightly ahead in the RMSE SS. RF was the fastest(but with the lowest SS among the nonlinear models), ANN and ELM-S had similarcomputational time and SVR-ES was the slowest.For PRECIP, ELM-S and SRV-ES attained the highest SS. However the nonlin-ear models had only minor improvements over the linear model. The computationaltime was highest for SVR-ES, followed by ANN, ELM-S and RF.For TEMP, ELM-S had the best SS around 10% (Fig. 2.2 and 2.3) followedclosely by SVR-ES. ANN and ELM-S had comparable times but RF was again thefastest and SVR-ES the slowest.For SO2, RF had much better SS (Figs. 2.2 and 2.3) than the other three non-linear methods, all of which performed better than MLR. In SS, SVR-ES showedlarge scatter, while ELM-S did better than ANN. However, for computational time,ELM-S performed worse than ANN and RF (Fig. 2.4). This occurred because thenumber of HN needed by ELM-S was large (Fig. 2.5). RF, the fastest model, ranat 2.02x the speed of ANN, which was the second fastest. Thus RF is clearly thebest method for this dataset.Table 2.6 summarizes the results by showing the average performance (rank)over the tested datasets. However, one must avoid trying to identify a single “win-ner” method as the datasets are very different from each other and they came fromdifferent domains in the environmental sciences.The number of steps used to find the number of HN by the ANN sequential pro-cedure and the ELM-S hill climbing procedure are similar. Although hill climbinguses adaptive steps, the computational cost necessary to solve Eq. (2.5) is still highwhen using a large number of HN. Figure 2.4 shows ELM-S losing its time advan-tage when large numbers of HN (Fig. 2.5 and Table 2.4) have been used.Also associated with the size of the training set (including number of casesand number of predictors) is the number of HN used by the ELM-S (Table 2.4)which is larger than the number of HN used by ANN. In particular, ELM needsmore HN than ANN to compensate for the randomness of the parameters wi and biin Eq. (2.1). However as mentioned in Section 2.3 this drawback is compensatedby ELM having to find the least-squares solution of a linear system, which does34Table 2.6: Ranks based on the MAE and RMSE (over 30 trials) for the testdata, with rank 1 being the best. The bottom row gives the mean for eachmethod.MAE Rank RMSE RankRF SVR-ES ANN ELM-S RF SVR-ES ANN ELM-SENSO 1 4 2 3 4 3 1 1SFBAY 2 4 1 3 3 4 1 2WIND 4 2 1 3 4 3 1 2FRASER 4 3 1 2 4 3 1 2YVR 4 3 1 1 4 3 1 1STAVE 4 1 2 2 4 2 3 1PRECIP 4 1 3 1 4 1 3 1TEMP 4 2 3 1 4 2 3 1SO2 1 2 4 3 1 2 4 3Average 3.1 2.4 2.0 2.1 3.5 2.6 2.0 1.6not have multiple minima and is faster than nonlinear optimization. Even if ELMuses far more HN than ANN, it does not imply ELM uses more adjustable weightand bias parameters than ANN. If the feedforward neural network has d inputs,one layer of L HN and m outputs, the number of adjustable parameters is (L+1)mfor ELM and (d+1)L+(L+1)m for ANN. For TEMP, with d = 106 and m = 1,L= 771 for ELM and L= 14 for ANN (Table 2.4) give 772 adjustable parametersfor ELM and 1513 for ANN.Liu et al. (2012) pointed out that ELM has superior computational time com-pared to the SVR, and this superiority increases drastically as the size of trainingset grows. Similar to Liu et al. (2012), Figure 2.4 and Table 2.4 show that ELM-S isfaster than SVR-ES in all studied cases and SVR-ES starts to lose speed advantageto ANN when the size of the training set starts to grow. Figures 2.2 and 2.3 andTables 2.2 and 2.3 show that the SS of ELM-S were generally as good as (and insome cases better than) the SS of the SVR-ES. However ELM-S and SVR-ES onlystart to overcome the ANN SS when the size of the dataset is large (i.e. PRECIP,TEMP and SO2).35101000ENSO SFBAY WIND FRASER YVR STAVE PRECIP TEMP SO2Number of HNANNELM−SFigure 2.5: Boxplot of the number of HN (plotted on a log scale) used in theANN and ELM-S models as chosen by the automated procedures (30trials).2.6.2 1-D SearchTo check if the automated procedure used to find the number of HN performedsatisfactorily, a separate set of experiments involving a 1-D search using all thepredictors was performed to determine approximately the optimal number of HNfor ELM-E, ELM-S and ANN. The tested cases were 1, 5, 10, 20, 50, 100, 200,500, and 1000 HN for the ELM-E/ELM-S and 1, 2, 5, 10, 20, 30, 50, 80, and 100HN for the ANN, with 30 ensemble members used in all three models.As shown in Figs. 2.6 and 2.7, large differences in RMSE were present amongthe various choices for the number of HN. Also, the weight range for ELM hada noticeable effect, as seen in the difference in the RMSE between ELM-S andELM-E. To achieve similar RMSE, ELM-E needs more HN than ELM-S. Similarto Parviainen and Riihimaki (2013), our results show that if the weight range wasnot carefully chosen a large number of HN could be necessary, with the increasednetwork size (consequently network complexity) slowing down the computation.36Number of Hidden NeuronsRMSE1 5 10 20 50 100 200 500 10000.40.60.81.01 2 5 10 20 30 50 80 100ELM−EELM−SANN(a) TEMPNumber of Hidden NeuronsRMSE1 5 10 20 50 100 200 500 10000.280.290.300.310.320.330.341 2 5 10 20 30 50 80 100ELM−EELM−SANN(b) PRECIPFigure 2.6: RMSE of the training set for (a) TEMP and (b) PRECIP usinga 1-D search over the number of HN. The number of HN used by the1-D search for ELM-E/ELM-S is provided along the bottom axis andfor ANN, along the top axis.Thus, it is possible to save computational time and improve the RMSE by adjustingthe range of the weights in advance using Eq. (2.6). However, as shown in Fig. 2.7,the computational time taken to train the models (ELM-S and ELM-E) is similar ifthey use the same number of HN.For the ENSO dataset, the 1-D search found the optimal number of HN to be10 for the ANN and between 10 and 20 for the ELM-S. Hence for ENSO, ELM-Swas indeed faster than ANN using backpropagation and it had good generaliza-tion. The first impression is that the automated procedure used for the ANN wasunderestimating the number of HN (mean of 7 over 30 trials in Fig. 2.5), but forthe ANN (using the 1-D search) the RMSE difference between 5 HN and 10 HNwas only 0:1%. Since the difference in RMSE in this plateau region was small,the automated procedure for ANN worked well selecting the parsimonious mod-els with good accuracy. A plateau is a flat region of the search space where thevalue returned by the target function is indistinguishable from the value returnedfor nearby regions. It can be a flat local maximum (or minimum), from which nouphill (downhill) exit exists, or a shoulder, from which progress is possible (Russelland Norvig, 2003). As outlined in Cannon and Whitfield (2002) the ANN plateauis due to stability from bagging, thus there is little penalty for selecting more HN37Number of Hidden NeuronsTime (s)1 5 10 20 50 100 200 500 10001101001000100001 2 5 10 20 30 50 80 100ELM−EELM−SANN(a) TEMPNumber of Hidden NeuronsTime (s)1 5 10 20 50 100 200 500 10001101001000100001 2 5 10 20 30 50 80 100ELM−EELM−SANN(b) PRECIPFigure 2.7: Computational time (in seconds) versus the number of hiddennodes (HN) for (a) TEMP and (b) PRECIP plotted on a log scale. Thenumber of HN used by the 1-D search for ELM-E/ELM-S is providedalong the bottom axis and for ANN along the top axis. The ELM-E andELM-S curves essentially overlapped in (a) and (b).than are strictly necessary. Fig. 2.5 and Table 2.4 also show the number of HNchosen by the hill climbing procedure for ELM-S, where the average number ofHN over the 30 trials was 11, close to the best range of values (10-20) found by the1-D search. Thus the hill climbing for ELM-S also worked well.For SFBAY the 1-D search found the optimal number of HN to be 5 for theANN and 20 for the ELM. For automated procedures the average number of HNwas 6 for ANN and 17 for ELM. For WIND the 1-D search found as the optimalnumber of HN 2 for the ANN and 20 for the ELM, while the automated proceduresselected 3 and 26 for the ANN and ELM, respectively. For FRASER the 1-D searchfound the optimal number of HN 10 for the ANN and 50 for the ELM, while theautomated procedures selected 15 and 31 for the ANN and ELM, respectively. ForYVR the 1-D search found as the optimal number of HN 10 for the ANN and 10for the ELM, while the automated procedures selected 3 and 19 for the ANN andELM, respectively. Again the impression of underestimation/overestimation is dueto the plateau region. For STAVE, the 1-D search found the optimal number ofHN to be 30 for the ANN and 500 for the ELM-S, versus 16 for ANN and 415 forELM-S from the automated procedures.38For PRECIP (Fig. 2.6b), the 1-D search found the optimal number of HN to bearound 5 for the ANN and identified a plateau between 200 and 500 for the ELM-S.Thus the automated procedures, which selected 3 and 282 HN for ANN and ELM-S, respectively, worked properly. For TEMP (Fig. 2.6a) the 1-D search found theoptimal number of HN to be 30 for the ANN. We extended the search until 2000HN (not shown in the figure) for the ELM-S and the wide region between 500and 1000 was where the optimal solution was located. The automated proceduresselected 14 and 771 HN for ANN and ELM-S.For SO2 1000 HN was not enough to cover the necessary range of HN usedby ELM-S, thus we extended the search until 2000. The difference between ELM-S using 1000 and 2000 was around 1%, again a plateau region. Basically with alarge number of HN the susceptibility of the model to the increment of HN willdecrease. Consequently, adding a few HN in models with a large number of HNwill not make any difference because the variability will not be larger than thatfrom using different initial random weights. As expected when the number of HNis high (Fig. 2.4, Table 2.4) the time necessary to train and predict with the modelsis also high. For this dataset ELM-S needed a complex structure with many HN toget the best result. For ANN the best number of HN found by the 1-D search was30 (versus 13 from automated procedure), however the RMSE difference between10 HN and 30 was less than 1%. Since the difference in RMSE in this plateauregion was small, the automated procedure for ANN worked well, though none ofthe ANN (nor the ELM-S) results were able to beat the outstanding RF results.Hill climbing algorithms have the advantage that they are often quick to iden-tify a good solution to the problem, which is sometimes all that is required in prac-tical applications. However, the downside is that if the problem exhibits numerouslocal optima significantly worse than the global optimum, no guarantees can beoffered for the quality of solution found (Eiben and Smith, 2003). However, asthe adjustment of the number of HN in the ELM-S often does not show numerouslocal optima (Fig. 2.6), the hill climbing algorithm is a practical heuristic to solvethe problem.392.6.3 ELM-S vs. ANNIn order to compare the ELM-S and ANN excluding the time used to find the opti-mal number of HN, a separate set of experiments was performed with the numberof HN set to equal the average number of HN found earlier by the automated pro-cedures (Table 2.4).Similar to Section 2.6.1, data were standardized (zero mean and unit variance)and divided into a training set and an independent test set to test the trained model.Only one trial was performed, but an ensemble of 30 members was used for theELM-S and ANN, and the averaged output of the individual members used as theoutput of the ensemble.Table 2.5 shows the results which in general are similar to Section 2.6.1. Over-all ANN has better MAE and RMSE for small datasets (ENSO, SFBAY,WIND andFRASER) and ELM-S better for large datasets (PRECIP, TEMP, SO2). Except forSO2, where ELM-S used a large number of HN (1440), the training of ELM-S wasfaster than the training of ANN. For testing, ANN was faster than ELM-S, due tothe larger number of HN required by the ELM-S (1440 HN for SO2 and 771 HNfor TEMP). In practice, the testing time is insignificant compared to the trainingtime, but the computer time needed for finding the optimal number of HN shouldnot be ignored.2.7 Summary and ConclusionIn summary, we used nine environmental datasets to test four nonlinear predictionmethods. As expected, no single method was best for every problem in everysituation (Zhang, 2007; Elshorbagy et al., 2010b). All the tested nonlinear methods(RF, ELM-S, ANN, and SVR-ES) were suitable for the problems presented here,outperforming in most of the cases the linear method (MLR). In particular forthe SFBAY, WIND, FRASER, STAVE and SO2 datasets, the nonlinear methodsperformed much better than MLR.It is possible to do a fine adjustment of the parameters to achieve better predic-tion skills for all the nonlinear methods presented. As mentioned in Zhang (2007),however, it is important to take into consideration the computational time requiredto identify the optimal parameters, especially if time consuming trial and error40methods are used.We used simple automated procedures to find the parameters and build themodels. The proposed automated procedures performed satisfactorily and wereable to find suitable parameters in a reasonable amount of time.Excluding the large datasets, ELM-S tended to be the fastest among the non-linear models, followed by RF. For the large datasets, RF tended to be the fastest.ELM-S tended to have higher skill scores than RF - e.g. for the RMSE SS (Fig. 2.3),ELM-S was surpassing RF except for the SO2 dataset.For smaller datasets, SVR-ES tended to be faster than ANN, but when thenumber of data points (cases) became large, the computing time needed to find theoptimal parameters and to train the model exceeded the time taken by ANN.Among the four nonlinear models, in terms of skill scores, ANN tended to bethe best for the smaller datasets and ELM-S for the larger datasets. This could bedue to the fact that ANN used bagging which gave it an advantage when overfittingis more of a concern for datasets with relatively few training data. For datasetswith many training cases, overfitting is not an issue, and using bootstrap samplingin bagging meant ANN ensemble members were trained with fewer data pointsthan the ELM-S ensemble members which did not use bootstrapping to leave outdata during training.The ELM algorithm is indeed a promising approach for a range of predictionproblems in the environmental sciences. The model is relatively simple to im-plement and can be very fast to train on small to medium sized datasets. Also,performance in terms of SS is comparable to the other nonlinear models. For smallto medium datasets, ELM-S requires a moderate number of HN. However, it startsto lose the speed advantage when the datasets are large, with numerous data pointsand predictors. For large datasets, the optimal number of HN tends to be largeand finding the optimal number of HN becomes costly computationally. However,ELM-S appeared to run efficiently for datasets with large number of data points butfew predictors (YVR) or large number of predictors but few data points (WIND).Compared individually against ANN, ELM-S requires a larger number of HN tocompensate for the randomness of the weights used in the HN.As suggested by Parviainen and Riihimaki (2013) the range of the randomweights in ELM should also be considered a parameter for achieving the best re-41sults. Scaling the weights by Eq. (2.6) in ELM-S, we were able to improve theSS of ELM-E in seven of the nine datasets (Fig. 2.1). Also, computational costmay be lowered by scaling the range of the weights in advance, which may lead tomore parsimonious models (fewer HN required). As we chose a constant value ofa= 2 in Eq. (2.6), further work on how to optimally scale the weights for ELM iswarranted.For future work, we would investigate the OSELM (Liang et al., 2006), whichis more suitable for real-time applications. Additionally, datasets with differentnumbers of predictors and data points, procedures to screen out irrelevant predic-tors, and other automated procedures to find parameters are worth further study.Using bagging with ELM-S should also be considered to improve stability andaccuracy.42Chapter 3Forecasting daily streamflowusing online sequential extremelearning machines3.1 IntroductionForecasting techniques based on statistical modelling are used when simplicity androbustness of the model implementation are more important than an accurate de-scription of the various internal sub-processes. They are especially useful for op-erational purposes, e.g. as an early warning system for floods, especially for mes-oscale basins, which are characterized by rapid response time (Potter and Colman,2003; Bu¨rger et al., 2009; Sene, 2010; Shamseldin, 2010; Fleming et al., 2015).Short lead time streamflow forecasts aim to provide an estimate of future catch-ment conditions, based on recent and forecast meteorological conditions, and hy-drological observations. In practice the provided lead time (ranging from a fewhours to days) is constrained by factors such as catchment response times, and theaccuracy of meteorological forecasts at the required lead times.ML algorithms operate by building a model from example inputs in order tomake data-driven predictions. They have been widely studied and used in variousenvironmental sciences (Cherkassky et al., 2006; Hsieh, 2009; Lima et al., 2015),43including hydrology and water resources (Shamseldin, 2010; Hong, 2012; Rasouliet al., 2012; Abrahart et al., 2012; Maier et al., 2010; Wu et al., 2014; Chang et al.,2014). ML training methods are usually of batch-learning type, i.e., whenever newdata are received, batch learning uses the past data together with the new data toretrain the model. The computer time needed for retraining could be substantial,thereby reducing drastically the advantage of receiving real-time data. Because theaccuracy of each forecast can be evaluated as soon as the observed flow for thattime step becomes available, this information should, in principle, be useful foradjusting model parameters to maximize the accuracy of the next forecast (Wil-son and Valle´e, 2002; Potter and Colman, 2003; Sene, 2010). In addition, a largeamount of training data is often required to train batch learning methods (Lianget al., 2006; Hong, 2012). This often represents a constraint in practical applica-tions of ML as the overall model response will be less accurate when developedwith smaller training datasets. Changes in the numerical weather prediction model(Wilson and Valle´e, 2002), paucity and associated costs needed to obtain the data(Shamseldin, 2010), or addition of new stations are examples of situations where itis difficult (or even impossible) to obtain large amounts of data in advance. Thus,in practical applications, online sequential learning algorithms are preferred overbatch learning algorithms as sequential learning algorithms do not require expen-sive retraining whenever new data are received (Liang et al., 2006).The ELM algorithm implements a single-hidden layer feedforward neural net-work, that is similar in structure to a traditional MLP ANN model. The ELM isa nonlinear model that randomly chooses the weights leading to the HN and an-alytically determines the weights at the output layer (Schmidt et al., 1992). Thisallows the ELM model to be solved by simple linear least squares optimization in-stead of the nonlinear optimization needed for ANN. Since its introduction, ELMhas been successfully used in different research areas (Huang et al., 2011; Siqueiraet al., 2012; Huang et al., 2015), producing good generalization performance withlearning times that are thousands of times faster than traditional gradient-basedANN training algorithms (Huang et al., 2006b). Siqueira et al. (2012) comparedthe performance of ELM, ANN trained by a gradient-descent algorithm and echostate networks for the prediction of monthly seasonal streamflow of three Brazilianhydroelectric plants. They found that ELM had better skills than ANN. Similarly,44Chapter 2 tested the ELM on nine environmental regression problems (includingstreamflow forecasting). They evaluated the prediction accuracy and computationalspeed of the ensemble ELM against ANN trained by a gradient-descent algorithm.They found that ANN and ELM had similar skills, but ELM was much faster thanANN except for large datasets with large number of predictors.The sequential learning algorithm called OSELM (Liang et al., 2006) is a spe-cific case of ELM. Contrary to an ANN approach that requires a long trainingperiod and then behaves in a static manner, the OSELM can be automatically up-dated as new data arrive, either with a single datum or a chunk of data. In addition,the data already used by the model can be discarded. The OSELM requires onlyone parameter to be adjusted (the number of HN). To the best of our knowledge,the use of the OSELM has not been explored to date in the context of hydrology(e.g. streamflow forecasting). In this paper, we use the OSELM as an alterna-tive solution to forecast daily streamflow of two hydrometric stations in southernBritish Columbia, Canada. As predictors, we used local hydro-meteorological ob-servations and NOAA’s weather reforecast dataset generated by the GEFS model(Hamill et al., 2013). We used the OSMLR as a benchmark.Section 3.2 describes the data set used in our study. The forecasting methodis presented in Section 3.3. Results and discussion of the experiments are given inSection 3.4, followed by summary and conclusion in Section 3.5.3.2 Data DescriptionStreamflow is sensitive to the space-time variability of precipitation, land surfacecharacteristics and climate variability. In addition there is a large number of non-linear interactions related to the transformation of snow and rainfall into soil waterand runoff (Potter and Colman, 2003). As a case of study, we used the GEFS re-forecast output and local hydro-meteorological observations as predictors to fore-cast the daily mean streamflow at two hydrological stations at lead times of 1-3days during 1985-2011.The time-series of the stations tested here were obtained from Water Survey ofCanada at http://wateroffice.ec.gc.ca/. The Englishman River station (hydrometricstation 08HB002, 49◦ 19′ 0′′N, 124◦ 16′ 58′′W) is located on Vancouver Island,45near the town of Parksville, British Columbia. The Stave River station (hydromet-ric station 08MH147, 49◦ 33′ 21′′N, 122◦ 19′ 24′′W) is located above Stave Lake,southern British Columbia. Both stations have been used in research studies before(Fleming et al., 2007; Rasouli et al., 2012; Fleming et al., 2015; Lima et al., 2015).3.2.1 Englishman RiverThe headwaters of the Englishman River lie on the eastern slopes of the BeaufortRange and the watershed drains into Strait of Georgia (Salish Sea). It providescommunity water to the residents of the town of Parksville and surrounding area(where residential flooding is a common occurrence) and is a source of abundantsalmonid habitat. As in many coastal regions, the main part of inflow results fromseasonal storms and spring snowmelt. The heavy precipitation (mostly originat-ing from multi-day, synoptic-scale systems) occurs in winter (November throughMarch). The typical (but not exclusive) cause is the warm moist air aloft com-ing from North Pacific frontal flows. Summer has generally dry conditions withprecipitation events mainly from weak fronts or convective storms.This basin has a nival-supported pluvial system (Fleming et al., 2007), wheremost of the watershed area lies at relatively low elevation. Thus, it is predominantlyrainfall fed with a wintertime freshet corresponding to peak seasonal precipitation.However, a second minor springtime snowmelt freshet is not unusual in high-snowyears (e.g. La Nin˜a years) (Fleming et al., 2007; Fleming and Whitfield, 2010).The catchment drainage area is about 319 km2 (Fig. 3.1) and the measurementrecord used in our study is from 1985=01=28 to 2011=12=31. The mean annualdischarge is 13 m3=s.3.2.2 Stave RiverThe Stave River, a tributary of the Fraser River, starts in the Garibaldi ProvincialPark. It has two dams and streamflow forecasts of the studied hydrometric sta-tion (close to the head of Stave Lake) are important to BC Hydro (a provincialCrown corporation which serves as the primary generator and distributor of elec-trical power in British Columbia) for planning hydroelectric power scheduling andflood mitigation.46125°W 124.5°W 124°W 123.5°W 123°W 122.5°W 122°W49°N49.2°N49.4°N49.6°N49.8°N50°NStrait of GeorgiaVancouver IslandBritish Columbia(Canada)Figure 3.1: Study Area. The Englishman River Basin in Vancouver Island(red) and the Stave River Basin in southern British Columbia (green).The solid circles show the hydrometric stations run by Water Survey ofCanada, and the squares the hydro-meteorological stations. For StaveRiver (on the right) the circle (the hydrometric station) overlaps thesquare (the hydro-meteorological station).The basin is a mixed pluvial-nival system, exhibiting an annual snowmelt-driven streamflow maximum in early summer and rainfall-driven peaks in winter.Similar to Englishman River, precipitation events in the summer are from weakfronts or convective storms. The major source of precipitation in the Stave basincomes during November-February from southwesterly frontal flows. In addition,due to the abrupt rise of the Coast Mountains above the flat or rolling topographyof the Lower Fraser Valley, these frontal flows cause on average 53% of the annualprecipitation to fall during this 4-month period (Rasouli et al., 2012). Precipitationevents in the summer are generally from weak fronts or convective storms.The catchment drainage area is about 290 km2 (Fig. 3.1) and the measurementrecord used in our study is from 1985=01=01 to 2011=12=31. The mean annualdischarge is 34 m3=s.473.2.3 PredictorsThe meteorological data used for the Englishman River is from the Qualicum RFish Research station (Fig. 3.1) (ID 1026565, 49◦ 23′ 37′′N, 124◦ 37′ 02′′W, el-evation 7.60m), which can be found at the Historical Canadian Climate Database(http://climate.weather.gc.ca/index e.html). The annual means of the daily maxi-mum temperature and daily minimum temperature are 13.7◦C and 5.7◦C, respec-tively. The meteorological data used for Stave River is from the hydro-meteoro-logical station of B.C. Hydro. The station used was the Stave River above Lake(Fig. 3.1) (Code STA, 49◦ 33′ 22”N, 122◦ 19′ 19”W, elevation 330m). The annualmeans of the daily maximum temperature and daily minimum temperatures are11.8◦C and 4.2◦C, respectively.Reforecasts (or hindcasts), are retrospective numerical weather prediction fore-casts for many dates in the past, ideally conducted using the same forecast mod-els and same assimilation system used operationally. They can be used to studyatmospheric predictability and to distinguish between random and model errors.NOAA’s Second-Generation GEFS Reforecast dataset (Hamill et al., 2013) is gen-erated once daily, at 0000 UTC, with 10 perturbed forecast members and one con-trol forecast. The data is saved at 3-hourly intervals from 0 to 72h and every 6hthereafter. During the first 8 days of operational GEFS reforecast, the model is runat T254L42 resolution, which uses a quadratic Gaussian transform grid with anequivalent grid spacing of approximately 40 km at 40◦ latitude, and 42 levels. Thepool of 34 potential predictors for streamflow forecasting is listed in Table 3.1. Allpredictors are ensemble averages over the 11 forecast members.48Table 3.1: List of predictors. These are of two types, i.e. local observations and GEFS outputs. The GEFS outputswere selected at 0900 and 2100 UTC for day 1, 3300 and 4500 UTC for day 2, and 5700 and 6900 UTC for day3, hence the total number of potential predictors is 34. (The soil moisture content and soil temperature are for0.0–0.1-m depth).Variable Unit TypePrecipitation (t, t−1 and t−2) mm ObsMaximum temperature (t, t−1 and t−2) ◦C ObsMinimum temperature (t, t−1 and t−2) ◦C ObsStreamflow at day (t, t−1 and t−2) m3=s ObsMoving average of Precipitation mm (in last 7 days period) ObsMoving average of Maximum temperature ◦C (in last 7 days period) ObsMoving average of Minimum temperature ◦C (in last 7 days period) ObsMean sea level pressure Pa GEFSPrecipitable water mm GEFS2-m specific humidity kg/kg dry air GEFS2-m temperature ◦C GEFSVolumetric soil moisture content fraction between wilting and saturation GEFSSoil temperature ◦C GEFSWater runoff mm GEFSWater equivalent of accumulated snow depth mm GEFSWind speed m/s GEFSTotal precipitation mm (in last 24-h period) GEFS493.3 Online Sequential Extreme Learning Machine1The MLP neural network architecture is probably the most popular type of ANN(Maier et al., 2010; Abrahart et al., 2012; Wu et al., 2014) and consists of at leastone hidden layer sandwiched between the input and output layers. An MLP withone hidden layer can be defined as:yˆ j =L∑i=1β ih(wi ·x j+bi)+β 0; ( j = 1; : : : ;N) (3.1)where x j ∈ Rd and yˆ j ∈ Rm are the model input and output, respectively, N thenumber of cases, wi and bi the weight and bias parameters in the hidden layer(with L hidden nodes), β i and β 0 are the weight and bias parameters in the outputlayer, and h the activation function (we used the hyperbolic tangent function).Training of the ANN model involves adjusting the parameters iteratively so theerror between the model output yˆ and the predictand or target data y is minimized.The backpropagation algorithm is often used to calculate the gradient of the errorfunction, with a gradient-descent approach (often time consuming) used to reducethe mean squared error iteratively.Schmidt et al. (1992) proposed and Huang et al. (2006b) proved that the wiand bi parameters can be randomly assigned if the activation function is infinitelydifferentiable, so only the β parameters need to be optimized when minimizing themean squared error between the model output yˆ and the target data y. The β 0 termis not necessary for ELM (Huang, 2014). Thus, in the ELM approach, trainingan MLP with one hidden layer is equivalent to simply finding the least-squaressolution of the linear system (Huang et al., 2011), HB=Y, where the hidden layeroutput matrix H of dimension N×L isH=h(w1 ·x1+b1) · · · h(wL ·x1+bL).... . ....h(w1 ·xN +b1) · · · h(wL ·xN +bL) ; (3.2)the B parameter matrix of dimension L×m and the target data matrix Y of dimen-1The material in the first part of this section, before Sec. 3.3.1, is already covered in Sec. 1.5 and1.6.50sion N×m areB=β T1...β TL ; and Y=yT1...yTN ; (3.3)where superscript T denotes the transpose. Algebraically, the linear system for Bis solved via the Moore-Penrose generalized inverse H† (Huang et al., 2006b), i.e.Bˆ=H†Y with H† = PHT and P= (HTH)−1; (3.4)with P being the inverse of the covariance matrix of the hidden layer output.The batch ELM previously described assumes that all the training data (sam-ples) are available for training. To take advantage of real time information, theELM has to be retrained as new data arrive. Liang et al. (2006) proposed the OS-ELM to use the advantages of the ELM and update the weights automatically. TheOSELM can learn the new data chunk by chunk (with fixed or varying length) anddiscard the data for which the training has already been done. The OSELM al-gorithm can be divided into two phases: initialization phase (a small batch ELMdescribed in Algorithm 4) and online sequential learning phase.The necessary steps for implementing the initialization phase (k = 0) of theOSELM algorithm are shown in Algorithm 4, where the initial training dataset is{X0;Y0}= {x j;y j}N0j=1.1 Given a training set {X0;Y0} and L hidden nodes2 begin3 // k = 0;4 Randomly assign wi and bi, i= 1; · · · ;L;5 Calculate the hidden layer output matrix H0;6 Calculate the matrix P0 = (HT0H0)−1;7 Calculate the output weights Bˆ0 = P0HT0Y0;8 endAlgorithm 4: The initialization phase of the OSELMIn the initialization phase, the appropriate matrices H0, P0 and Bˆ0 are com-puted. The number of data points (N0) in this phase should be at least equal to51Figure 3.2: ELM architecture (shown with a single output variable).L, the number of HN. It is important to guarantee that the initial dataset {X0;Y0}contains enough information to infer the optimal number of HN. Figure 3.2 showsa schematic of the ELM architecture, which is analogous to the MLP architecture,except that MLP does not randomly assign wi and bi.Following the initialization phase, the online sequential learning phase com-mences. At the (k+ 1)th update a new chunk of data {Xk+1;Yk+1} is presented,and the new matrices Hk+1, Pk+1 and Bˆk+1 are calculated. The steps for the OS-ELM algorithm are shown in Algorithm 5, with I being the identity matrix. Formore details, see Liang et al. (2006) and Huang et al. (2011).3.3.1 Range of Weights and ELM PredictionsUsually the ELM’s model complexity is based on the number of HN only. Thus,only the number of HN is carefully selected, and a fixed variance or interval pa-rameter is used for the distribution from which the random weights are drawn. The521 begin2 Initialization Phase: ;3 Perform Algorithm 4;4 Online Sequential Learning Phase:;5 (k+1)th chunk of new data {Xk+1;Yk+1} arrives;6 Calculate Hk+1 using only the new data;7 Calculate the matrix Pk+1 = Pk−PkHTk+1(I+Hk+1PkHTk+1)−1Hk+1Pk;8 Calculate the output weights Bˆk+1 = Bˆk+Pk+1HTk+1(Yk+1−Hk+1Bˆk);9 Set k = k+1 and iterate;10 endAlgorithm 5: OSELMwidth of the interval is traditionally treated as simply a constant instead of a modelparameter. However, as weights are random variables, diversity (i.e. not havingsimilar values for all weights) is an important factor because model complexity isaffected by the variance of the distribution. Small variance means small weightsand linear models, while large variance can produce larger weights and nonlinearmodels. Following Chapter 2, we used a uniform distribution, spread over a fixedinterval determined by the number of predictors.3.3.2 Network StructureA balance must exist between generalization ability and network complexity. Find-ing an appropriate network structure for ANNs involves the selection of a suitablenumber of HN. Compared individually against ANN trained by backpropagationalgorithms, ELM requires a larger number of HN to compensate for the random-ness of the weights used in the HN (Chapter 2). However, even if ELM uses farmore HN than ANN trained by backpropagation, it does not imply ELM uses moreadjustable parameters than ANN. If a feedforward neural network has d inputs, onelayer of L HN and m outputs, the number of adjustable parameters is L for ELMand (d+1)L+(L+1)m for the MLP ANN. For example, with d = 34 and m= 1,L = 340 (Fig. 3.9) for ELM and L = 10 for ANN give 340 adjustable parametersfor ELM and 361 for ANN. The choice of the optimal number of HN is problemdependent, i.e., different phenomena will require distinct number of HN to build53a successful model. In general, the number of HN is selected empirically (Huanget al., 2006b; Feng et al., 2009). Thus, to estimate the number of HN of the OS-ELM, we adopted the approach used by Chapter 2 which automatically choosesthe number of HN by the hill climbing algorithm.3.4 Experimental ResultsTo demonstrate the practical use of the OSELM as an online prediction method,experiments were performed on the datasets described in Sec. 3.2. The fact thatML can adapt to different situations is important, but the potential demand for suchmodelling complexities remains an open question (Abrahart et al., 2008). Thus thecomparison of the results against a simple linear model or persistence is helpful forestablishing some minimum standards (Abrahart et al., 2012). We used the onlinesequential multiple linear regression (OSMLR) as a benchmark. Since ELM solvesHB= Y, while MLR solves XB= Y (with B containing a β 0 term), our OSMLRalgortihm is basically the same as our OSELM except that H is replaced by X inAlgorithms 4 and 5.The experiments were divided into three parts. First, a comparison was madebetween the models using two prescreening techniques (one linear and one non-linear) to check if it is possible to make the modelling process more accurate byremoving predictors that are irrelevant or redundant. Second, to compare how ad-vantageous is the use of observations up to the time of the forecast, an experimentinvolving different frequency of updates (daily, monthly, yearly) of the models wasperformed. Finally, to check how dependent the online sequential learning phaseis on the initialization phase (including the automated procedure used to find thenumber of HN) and if the models would have similar skills when developed withsmaller datasets in the initialization phase, a separate set of experiments involvingthree different sizes of the initial dataset was performed.Applying more than one diagnostic metric is also helpful to establish minimumstandards for model inter-comparison (Abrahart et al., 2012). To measure the per-formance of OSELM relative to the OSMLR, we used the MAE, the RMSE andthe Nash-Sutcliffe model efficiency coefficient (NSE).For flood forecasting purposes, ML solutions and generic packages offering54practical advantages related to operational costs and socio-economic resources(less computational requirements than numerical models, rapid development pro-cess and open source code) would be of potential interest in developing countries(Abrahart et al., 2008; Shamseldin, 2010). All our prediction models were built us-ing the free R software environment for statistical computing (R Core Team, 2014).OSMLR, using the R package stats and R native libraries, did stepwise selec-tion of predictors (Venables and Ripley, 2002) based on the Akaike informationcriterion (AIC). OSELM uses the R package MASS (Venables and Ripley, 2002)and R native libraries. Our OSMLR and OSELM codes are freely available fromthe project website: http://forai.r-forge.r-project.org/. For Boruta (Section 3.4.1) weused the R package Boruta (Kursa and Rudnicki, 2010).As the predictand has a right-skewed distribution, the logarithm transforma-tion was applied first to improve the regression results (Cawley et al., 2007). Someresults of the OSMLR were unstable due the non-normality of some predictors,hence, for OSMLR, the logarithm transformation was also applied to some of thepredictors (e.g. precipitation, water runoff, water equivalent of accumulated snowdepth, and total precipitation) to normalize their distribution, which stabilized theresults. Only days not containing missing values were considered (9540 days forEnglishman and 9619 days for Stave). As part of the ML development process, theavailable data, standardized to zero mean and unit variance, were generally dividedinto training and testing subsets. The training set was used to build the ML modelsand the testing set to determine the generalization ability of the trained models.To emulate a scenario of real-time data, we used the initial five years (1985-1989)for the initialization phase model development (including finding appropriate pa-rameters). The experiments were designed to be fully automated, i.e., no humanintervention was necessary to tune the model parameters. The OSELM with hillclimbing performed its own 5-fold cross-validation when choosing the best param-eter (number of HN) (Chapter 2). The test set was kept completely unseen bythe model during the training process (initialization phase). The sliding retroactivevalidation was then performed (1990-2011) as an emulation of real-time data avail-ability, i.e. we updated the model daily (except in Section 3.4.2 where the modelswere also updated monthly and yearly) then used the model forecasts for testing(i.e. verification). Thirty different trials were performed with different seeds for55the random number generator. As an individual OSELM may not adapt well tothe new data, Lan et al. (2009) proposed an ensemble of OSELM (EOS-ELM) toimprove the stability of OSELM. The ensemble was generated by running the OS-ELM multiple times (with different random wi and bi) and the outputs from theindividual ensemble members were averaged. For each trial, we used an ensembleof 30 members and the averaged output of the 30 individual members became theoutput for that trial – i.e. a total of 30 trials × 30 members = 900 OSELM runs.For the linear model (OSMLR) there is no need for multiple trials and multipleensemble members.3.4.1 Predictor SelectionOne of the most important steps in the ML modelling process is the determinationof an appropriate set of predictors (Cherkassky et al., 2006; Solomatine and Ost-feld, 2008; Maier et al., 2010). ML algorithms may produce less accurate and lessunderstandable results if the data are inadequate or contain irrelevant or redundantinformation (Hall and Smith, 1996). Maier et al. (2010) pointed that the majorityof works using ANNs (which are generally used in preference to linear modellingapproaches), used linear methods for selecting predictors. In cases where input-output relationships are suspected to be highly non-linear (which is often the casein hydrology problems), nonlinear approaches should be more appropriate for de-termining inputs to ANN models.To briefly compare the linear versus nonlinear approaches for input selection,we tested one linear approach, namely stepwise linear regression (SLR) (Hocking,1976) and one nonlinear approach, namely Boruta (Kursa and Rudnicki, 2010).The Boruta approach is based on the well known ML method RF (Breiman, 2001).Each technique used the training data to select the relevant predictors from the fullset of predictors listed in Table 3.1. The selected predictors were used to train theOSMLR and the OSELM. The three models are OSMLR using stepwise selec-tion of predictors (OSMLR-S), OSELM using stepwise (OSELM-S), and OSELMusing Boruta (OSELM-B). The development of these models would help in de-termining the effects of input information on model performance. As benchmarksthe OSMLR using all predictors (OSMLR-A) and the OSELM using all predictors56(OSELM-A) were also tested. Figure 3.3 shows the boxplot of the MAE of the30 trials for each dataset and forecast lead times. In addition, Table C.1 (in theAppendix) shows the mean and standard deviation of MAE, RMSE, and NSE foreach dataset.For Englishman at forecast lead times of 1, 2 and 3 days the SLR selected24, 21 and 21 predictors respectively. The Boruta procedure always selected thecomplete set of 34 predictors in all trials for all tested forecast lead times. ForStave at forecast lead times of 1, 2 and 3 days the SLR selected 25, 24 and 20predictors respectively, while Boruta selected between 32 and 34 predictors for allthe tested forecast lead times (the number varies with the trial). In all cases thenonlinear model had better skills than the linear model. The objective of SLR is tofind redundant predictors and select the smallest set of predictors giving the bestregression results. The objective of Boruta is to identify all predictors which arein some circumstances relevant for the regression. Thus, Boruta tends to find allrelevant predictors, instead of only the non-redundant ones. Therefore the largenumber of predictors selected by Boruta was not a surprise.The aggressive reduction of predictors by SLR affected the MAE of the OS-MLR-S, where it lost in most of the cases (although sometimes not by much)against the OSMLR-A (Fig. 3.3). The use of the predictors selected by SLR alsodid not help OSELM-S; in some cases, mainly in day 3, its performance was down-graded relative to OSELM-A. There is little difference between OSELM-B andOSELM-A because Boruta most of the time selected all the predictors. Thus, theuse of all predictors seemed appropriate for leading the models to good results.Since the focus of this research is not on the comparison between prescreeningtechniques, we will not give emphasis in the identification of the optimal predic-tors for each particular dataset and forecast lead time. From now on we will simplyuse all 34 predictors for OSMLR and OSELM.3.4.2 Frequency of UpdatesIn forecasting applications, it is possible to improve the model outputs by usingobservations up to the time of the forecast to update the model (Sene, 2010). Wetested different frequencies of updates of the model to check how the use of the57new observations affects the results of the OSELM forecast. From 1990 to 2011,we used sliding retroactive validation as an emulation of real-data availability, i.e.we updated the models daily, monthly or yearly, then used the model forecasts fortesting (i.e. verification).Figure 3.4 shows the boxplot of the MAE of the 30 trials for each datasetand forecast lead times. Table C.2 in the Appendix shows the mean and standarddeviation of MAE, RMSE, and NSE for each dataset. Supplementary Figs. C.1 andC.2 show the scatter plot of the observed and predicted streamflows at Englishmanand Stave respectively.Again, all nonlinear models had better MAE, RMSE and NSE than the linearmodels. As expected, there was improvement in the OSELM forecasts if the modelwas updated frequently with the new information. However for the OSMLR theimprovement was not clear. For 1-day lead time in Englishman River (Table C.2)the OSMLR-Daily had slightly worse RMSE than the OSMLR-Yearly. Empiricalmodels perform better if all events in the test data are contained within the range ofthe training data. Thus they may fail to adequately predict streamflow in extremeor unusual years. It is not unusual for the OSMLR-Daily to heavily update itsweights according to a particular new event that has never happened before, whichcan lead to overfitting if the event is very atypical. OSMLR-Monthly and OSMLR-Yearly both have more data to update its weights than the OSMLR-Daily, so theimportance of the new event will be smoothed out and the predicted values moreconservative. As MAE is more resistant to outliers than RMSE, OSMLR-Daily didnot perform worse than OSMLR-Yearly in Figure 3.4.For OSELM, shorter duration updates were more skilled than longer ones onall tested cases for both datasets. The skill differences are more apparent whencomparing daily update with yearly update. Figure 3.5 shows the time series resid-uals from an OSELM-Daily model and an OSELM-Yearly model. Both modelshave the same number of HN and same initial weights in the training phase, hencethe only difference between them is the frequency of updates. Figs. 3.5a and 3.5bappeared very similar, however, plotting their difference (Fig. 3.5c) reveals notabledifferences between them, especially near the beginning of the time series. Largerdifferences between OSELM-Daily and OSELM-Yearly tended to occur when theresidual values were large. After 10 to 11 years, novel events (events not seen be-58fore) became rare, hence the difference between updating daily and updating yearlybecame small (except at the end of the time series).ANNs, like all empirical models, perform best when they are not used to ex-trapolate beyond the range of the predictors in the training data, To check if shorterfrequency updating has any advantage when forecasting extreme events, we calcu-lated theMAE and RMSE using only the observed values above the 90th percentile.Figure 3.6 shows that for Englishman the OSELM-Daily has better MAE in fore-casting extremes than OSELM-Monthly and OSELM-Yearly. Even for RMSE,which is more sensitive to outliers and extremes than MAE, OSELM-Daily outper-formed OSELM-Yearly. Similarly, results were found at Stave.3.4.3 Training SizeML algorithms can be less accurate when they are developed with smaller train-ing datasets. Dawson (2008) showed that limiting the available data can have anadverse effect on the accuracy of ANNs. However, as explained in Section 3.1, itis not always possible to get a large amount of data in advance to train the model- often in reality data arrive continually after the model has been trained. Conse-quently, a general goal is to use an adequate amount of training data containingenough statistical information for building the initial model. For the OSELM, oncethe number of HN has been chosen in the initialization phase, it cannot be changed,and is fixed until the end. Thus the training data in the initialization phase shouldcontain enough statistical information to select the appropriate number of HN. Foran operational weather prediction model combined with model output statisticsone would have to wait at least two years after an implementation/change of themodel to obtain statistically stable equations using variables from the new numer-ical weather model (Wilson and Valle´e, 2002). In the previous experiments onlythe online sequential learning phase was considered as we used a fixed amount ofinitialization data (5 years) to train the models and to choose the parameters of themodels. In some practical situations, one year of data or less may be all that isavailable for the initial training.To check how dependent OSELM is on the initial data and to emulate a scenarioof real-time data, we used three different sizes of the initial training set to find the59model parameters and weights - 1 year (1989), 3 years (1987-1989), and 5 years(1985-1989). The test set (1990-2011) was the same for all three different initialtraining datasets. Figure 3.7a and 3.7b shows the time series of Englishman andStave respectively. The color lines mark the division between the different trainingsets and the test set. Table 3.2 shows the specifications of the datasets and the basicdescriptive statistics of the streamflow in the training and testing sets.All the models were updated daily using the sliding retroactive approach. Inmost cases, more data in the initial training set generated better prediction scores(Fig. 3.8 and Table C.3), though there was no significant difference between ini-tializing with 1 and 3 years of data for Englishman’s day 3 forecasts and for Stave’sday 2 and day 3 forecasts. Supplementary Figs. C.3 and C.4 show the scatter plotof the observed and predicted streamflows at Englishman and Stave respectively.603456Day 1 Day 2 Day 3MAE (m3s)OSMLR−SOSMLR−AOSELM−SOSELM−BOSELM−A(a) Englishman681012Day 1 Day 2 Day 3MAE (m3s)OSMLR−SOSMLR−AOSELM−SOSELM−BOSELM−A(b) StaveFigure 3.3: Comparison between prescreening techniques at (a) Englishmanand (b) Stave, with “waistline” of the boxplot showing the median MAEof the 30 trials calculated over 1990-2011 for each dataset and forecastlead time. The 25th and 75th percentiles are shown, respectively, asthe bottom and top of each box. Distance between the 25th and 75thpercentiles is the interquartile range (IQR). The upper whisker extendsfrom the upper hinge to the highest value that is within 1.5 IQR of thehinge. The lower whisker extends from the lower hinge to the lowestvalue within 1.5 IQR of the hinge. Data beyond the end of the whiskersare considered outliers and are plotted as points. Day 1, day 2 and day 3are the forecast lead times. The five models are OSMLR using stepwise(OSMLR-S), OSMLR using all predictors (OSMLR-A), OSELM usingstepwise (OSELM-S), OSELM using Boruta (OSELM-B) and OSELMusing all predictors (OSELM-A). The boxes for the two OSMLRmodelscollapsed to two horizontal lines as multiple trials are not needed.613456Day 1 Day 2 Day 3MAE (m3s)OSMLR−DailyOSMLR−MonthlyOSMLR−YearlyOSELM−DailyOSELM−MonthlyOSELM−Yearly(a) Englishman681012Day 1 Day 2 Day 3MAE (m3s)OSMLR−DailyOSMLR−MonthlyOSMLR−YearlyOSELM−DailyOSELM−MonthlyOSELM−Yearly(b) StaveFigure 3.4: Comparison between different frequencies of model updates for(a) Englishman and (b) Stave. The boxplot displays the MAE foreach dataset and forecast lead time. The six models are OSMLR up-dated daily (OSMLR-Daily), monthly (OSMLR-Monthly), and yearly(OSMLR-Yearly), and the OSELM updated daily (OSELM-Daily),monthly (OSELM-Monthly) and yearly (OSELM-Yearly).621990 1995 2000 2005 2010−100050150(a) Daily Update Residual − (MAE:2.34)m3s1990 1995 2000 2005 2010−100050150(b) Yearly Update Residual − (MAE: 2.41 )m3s1990 1995 2000 2005 2010−30−10010(c) Daily Update Residual−Yearly Update Residualm3sFigure 3.5: Residual times series from (a) an OSELM-Daily model, (b) anOSELM-Yearly model, and (c) their difference.631520253035Day 1 Day 2 Day 3MAE (m3s)OSMLR−DailyOSMLR−MonthlyOSMLR−YearlyOSELM−DailyOSELM−MonthlyOSELM−Yearly(a)MAE304050Day 1 Day 2 Day 3RMSE (m3s)OSMLR−DailyOSMLR−MonthlyOSMLR−YearlyOSELM−DailyOSELM−MonthlyOSELM−Yearly(b) RMSEFigure 3.6: Comparison of (a) MAE and (b) RMSE for values above the90th percentile for different frequency updates at Englishman. Box-plots for each forecast lead time are shown for six models OSMLR up-dated daily (OSMLR-Daily), monthly (OSMLR-Monthly) and yearly(OSMLR-Yearly), and the OSELM updated daily (OSELM-Daily),monthly (OSELM-Monthly) and yearly (OSELM-Yearly).641985 1990 1995 2000 2005 201001 002 50(a) EnglishmanF lo w ( m3s )1985 1990 1995 2000 2005 201002 004 00(b) StaveF lo w ( m3s )Figure 3.7: Times series (training and test set) of (a) ENG and (b) STA. Theblue line marks the beginning of 1987, the green line marks the begin-ning of 1989 and the red line marks the beginning of 1990.65Table 3.2: Statistics of the observed streamflow (m3=s) in the training and testing datasets.EnglishmanpercentilesMean Std. dev. Median Min Max 5th 10th 90th 95thTrain - 1 Year 9.9 12.8 6.1 0.3 119 0.3 0.6 22.0 32.6Train - 3 Year 10.3 14.3 6.4 0.3 160 0.3 0.4 24.4 35.1Train - 5 Year 10.9 18.2 5.9 0.3 259 0.3 0.4 25.1 37.9Test 13.7 23.7 7.2 0.1 310 0.6 1.1 29.6 50.4StavepercentilesMean Std. dev. Median Min Max 5th 10th 90th 95thTrain - 1 Year 31.6 26.5 25.4 4.3 177 5.4 6.5 67.6 86.7Train - 3 Year 32.4 30.0 23.1 3.2 262 4.8 5.9 71.7 87.8Train - 5 Year 32.2 31.1 22.4 1.8 264 4.5 5.9 71.7 88.2Test 34.2 34.3 23.2 2.6 553 5.4 6.8 73.7 91.866Figure 3.9 shows the number of HN selected by the hill climbing. At English-man, when the MAE scores were similar between two different initial training datasizes (Fig. 3.8a), the numbers of HN were also similar (Fig. 3.9a) – as seen for day2 forecasts between 3 and 5 years of initial data and for day 3 forecasts between 1and 3 years of data. The same situation occurs at Stave, where for day 2 and day3 forecasts using 1 and 3 years of initial data yielded similar MAE (Fig. 3.8b) andsimilar number of HN (Fig. 3.9b).To check the relation between the number of HN and the initial training size,a final experiment was made using a fixed number of HN with the Englishmandataset. Instead of using hill climbing to select the number of HN automatically foreach training size, we used the same number of HN in the runs with three differentsizes of the initial training set. We chose the number of HN to be the averaged valueover 30 trials selected by the hill climbing when initiated with 5 years of trainingdata. At Englishman the numbers of HN chosen were 218, 178, and 248 for fore-cast lead times of 1-3 days respectively. Figure 3.10 shows the MAE boxplots fromthe runs with the fixed number of HN, where for Englishman (Fig. 3.10a), the 5year-OSELM had the best results followed by 3 year-OSELM and 1 year-OSELM.Comparing Fig. 3.10a with Fig. 3.8a, there were some improvements in the 1 year-OSELM and 3 years-OSELM results. For the 1 year-OSELM the number of HNchosen by hill climbing was much lower (Fig. 3.9a) than the fixed number of 218HN used in Fig. 3.10. With only one year of initialization data, the model wouldnot choose enough HN to learn the structure of interannual variability, which waspresent when using 5 years of initialization data. Since the number of HN wasfixed during the online sequential learning phase, not having enough HN to modelthe structure of interannual variability could lead to sub-optimal forecasts over thetest data. However for Stave (Fig. 3.10b), using 3 or 5 years of initialization datadid not improve on using only 1 year of data – this may be because 3 and 5 yearsare still short to learn the interannual variability accurately.3.5 Summary and ConclusionNonlinear machine methods such as ANN have been widely used in environmentalforecasting. One disadvantage of ANN relative to a linear method such as MLR67345Day 1 Day 2 Day 3MAE (m3s)1 year−OSELM3 years−OSELM5 years−OSELM(a) Englishman5678910Day 1 Day 2 Day 3MAE (m3s)1 year−OSELM3 years−OSELM5 years−OSELM(b) StaveFigure 3.8: Comparison of the MAE between different initial training datasizes for (a) Englishman and (b) Stave. The boxplot shows the MAEfrom the 30 trials for each dataset and forecast lead time. The threemodels are OSELM-Daily trained during the initialization phase with 1,3 and 5 years of data respectively.is that its solution requires computationally expensive nonlinear optimization, incontrast to the simple linear least squares solution in MLR. The nonlinear ELMmethod, by assigning random weights in the hidden layer of a single hidden layerfeedforward neural network, reduces the problem to linear least squares as in MLR.After testing ELM and ANN over nine different environmental datasets using batchtraining, Chapter 2 concluded the two methods had similar prediction skills, butELMwas generally much faster than ANN, except when dealing with large datasetscontaining many predictors.In many real world situations, new data arrive continually. To have the best68100150200250300Day 1 Day 2 Day 3HN1 year−OSELM3 years−OSELM5 years−OSELM(a) Englishman100200300Day 1 Day 2 Day 3HN1 year−OSELM3 years−OSELM5 years−OSELM(b) StaveFigure 3.9: Boxplot of the number of HN selected in the 30 trials at (a) En-glishman and (b) Stave. Due to discrete nature of the number of HN,the median could overlap with the upper or lower quantile.forecast skills, it is necessary to update the model frequently using the new data.For ANN, frequent updating can be cumbersome and computationally costly, mak-ing it impractical to use ANN on applications such as post-processing the verylarge number of variables from numerical weather prediction models. In contrast,updating MLR via online sequential learning is straightforward. Similarly, onlinesequential updating of ELM (OSELM) is quite straightforward as ELM is basicallysolved like MLR. With OSELM, the updates involve only small datasets contain-ing the newly arrived data, so the case of ELM running slow (when dealing withlarge datasets with large number of predictors) found in batch training (Chapter 2)does not arise for OSELM.6934Day 1 Day 2 Day 3MAE (m3s)1 year−OSELM3 years−OSELM5 years−OSELM(a) Englishman56789Day 1 Day 2 Day 3MAE (m3s)1 year−OSELM3 years−OSELM5 years−OSELM(b) StaveFigure 3.10: Comparison of MAE using a fixed number of HN for differentinitial training sizes at (a) Englishman and (b) Stave.In this paper, we compared the two online sequential learning methods, OS-ELM and OSMLR, in forecasting the daily streamflow at two hydrometric stationsin British Columbia, Canada, at forecast lead times of one to three days. As pre-dictors, we used local hydro-meteorological observations and the reforecast datasetgenerated by the GEFS from NOAA. The nonlinear model (OSELM) had superiorforecast skills compared to the linear model (OSMLR). Hence, if the underlyingpredictor-predictand relation is nonlinear, and frequent model updates are neededto use the continually arriving new data, OSELM is an excellent prediction methodnot only for hydrological systems but also for other environmental sciences.We compared the results from updating daily, monthly and yearly. As ex-pected, updating OSELM frequently with new observations improved the forecast.70However as the number of novel events became rare after 10 years, the differencebetween the models updating daily and yearly eventually became small. For highstreamflow events (in the 90th percentile), higher frequency updating in OSELMgave better forecast skills.Generally, having more data in the initial training dataset tended to give betterprediction. Larger initial datasets helped to find better parameters (number of HN)for the OSELM. Based on Parviainen and Riihimaki (2013) and Chapter 2, therange of the random weights in ELM could also be considered a parameter forachieving the best results, so further work on how to optimally scale the weightrange for ELM is warranted.The OSELM model does not get feedback from the previous errors becauseit has a feedforward architecture. Recurrent neural networks (RNNs), where theoutput depends on previous model states, have been found to perform better thanfeedforward networks for streamflow forecasting (Kumar et al., 2004). Echo statenetworks (ESN) are a recent addition to the field of RNNs (Lukoeviius and Jaeger,2009). The HN in an ESN architecture features recurrent connections, giving ita short-term memory, whereas ELMs, being a feedforward architecture, have noshort-term memory. As in ELM, only the output weights are trained in the ESNmodel. Therefore, ELM can be viewed as a kind of memoryless ESN with fewerparameters to be adjusted. Butcher et al. (2013) combined an ESN with two ELM-inspired hidden layers. They tested the novel approach against a time delay ELMwhich consists of a memoryless feedforward ANN, where only the output weightsare trained, coupled with a moving window of the input data giving it a memoryas long as the length of the window. They found that the ELM had better accuracywhen compared to the ESN approaches. However, it is still unclear in what charac-teristics of the network and/or problem domains they perform well. Thus, furtherwork is still required as a recurrent approach is expected to outperform an ELMwhen applied to data that require a long fading memory.One disadvantage of OSELM is that after the initialization phase, the numberof HN is fixed. As new data arrive, and information on longer-term variability (e.g.interannual variability) becomes available, the fixed number of HN will be too fewto model the longer term variability and the model becomes sub-optimal. Futureresearch needs to allow the model complexity (i.e. the number of HN) to change71during the online sequential learning phase.72Chapter 4Variable complexity onlinesequential extreme learningmachine4.1 IntroductionANN have been widely studied and used in various fields (Haykin, 1998; Bishop,2006; Hsieh, 2009). The MLP architecture is probably the most popular type ofANN (Haykin, 1998; Maier et al., 2010; Abrahart et al., 2012) and consists of atleast one hidden layer of nodes sandwiched between the input and output layers.Training of the MLP model involves adjusting the weights or parameters iterativelyso the error between the model output and the target (or predictand) data is mini-mized. The backpropagation algorithm is often used to calculate the gradient of theerror function, with a gradient-descent approach used to reduce the mean squarederror iteratively.The ELM randomly chooses the weights in the hidden nodes or neurons (HN)of an MLP network with one hidden layer, and analytically determines the weightsat the output layer (Schmidt et al., 1992; Huang et al., 2006b). This allows the ELMmodel to be solved by simple linear least squares optimization instead of the non-linear optimization needed for MLP. Since its introduction, ELM has been success-73fully used in different research areas (Huang et al., 2011; Lima et al., 2015; Huanget al., 2015; Lima et al., 2016), achieving good generalization performance withlearning times that are often thousands of times faster than traditional gradient-based MLP training algorithms.Correct adjustment of the model parameters is mandatory for building a suc-cessful predictive model. Based on the arguments from Huang et al. (2006b), forELM the only parameter that needs to be specified is the number of HN (Parvi-ainen and Riihimaki (2013) and Chapter 2 suggested that the range of the randomweights in ELM should also be considered a parameter). The optimal ELM’s net-work structure should have a balance between generalization ability and networkcomplexity. If network complexity is too low (i.e. too few HN), the network mightbe unable to capture the true nonlinear relationship. If network complexity is toohigh, the network might have decreased generalization ability due to overfitting(i.e. fitting to noise in the data) and increased training time. Compared individu-ally against ANN trained by backpropagation algorithms, ELM requires a largernumber of HN to compensate for the randomness of the weights used in the HN.The optimal number of HN is problem dependent and unknown in advance. Theyare usually determined by time-consuming trial and error. This approach, althoughstraightforward, does not guarantee that the selected number of HN will be closeto optimal or will generalize well (Zhang et al., 2013).To overcome this problem some approaches (referred to as constructive algo-rithms) such as incremental ELM (I-ELM) (Huang et al., 2006a), convex I-ELM(CI-ELM) (Huang and Chen, 2007), and enhanced I-ELM (EI-ELM) (Huang andChen, 2008) have been proposed. The main idea of constructive algorithms is tostart with a small initial network and gradually add new HN until a satisfactorysolution is found. Unfortunately, the HN are added only one by one. To solvethis limitation, Feng et al. (2009) proposed the error-minimized ELM (EM-ELM)which adds random HN to the existing ELM one by one or group by group (withvarying size). It also updates incrementally the output weights during the networkgrowth, reducing the computational complexity when compared with ELM and I-ELM. However, in the implementation of all the above referred constructive ELMs,the number of HN monotonically increases with the learning progress.Zhang et al. (2012) proposed an ELM with adaptive growth of HN (AG-ELM)74to determine the network architecture (number of HN) in an adaptive way. Existingnetworks may be replaced by newly generated networks which have fewer HN andbetter generalization performance rather than always keeping those existing onesin other I-ELMs. Later, Zhang et al. (2013) proposed the dynamic ELM (D-ELM)which is the combination of the AG-ELM and EM-ELM. It can be viewed as animprovement of AG-ELM because the output weights are updated incrementallyby the error-minimization-based method rather than randomly generated.ELM and its above improved algorithms use a batch-learning training approach,i.e., whenever new data are received, batch learning uses the past data together withthe new data to retrain the model. Consequently, the computer time needed for re-training could be substantial (Liang et al., 2006). In many real world applications,learning has to be an ongoing process since new data arrive continually. Thus,in practical applications, online sequential learning algorithms are preferred overbatch learning algorithms as online sequential learning algorithms do not requireexpensive retraining using the complete dataset whenever new data are received(Liang et al., 2006).ELM has an online sequential version called OSELM (Liang et al., 2006). TheOSELM can be automatically updated as new data arrive, either with a single da-tum or a chunk of data. In addition, the data already used by the model can bediscarded. Lan et al. (2009) proposed an ensemble of OSELM (EOSELM) to im-prove the stability of OSELM. The EOSELM averages the output of multiple OS-ELM with the same number of HN and the same activation function for each HN.However, the number of HN used in the initial model cannot be changed as onlinelearning proceeds in OSELM.One would like to use an adequate amount of training data containing enoughstatistical information for building the initial model. However, it is not always pos-sible to get a large amount of data in advance to train the model – often data arrivecontinually after the initial model has been trained. In the fields of hydrology andatmospheric sciences, the addition of new observational stations and changes in thenumerical weather prediction (NWP) model are examples of situations where it isdifficult (or even impossible) to obtain large amounts of data in advance. For ex-ample, in streamflow prediction using a machine learning model, it is common touse NWP model forecasts as predictors (inputs). If the NWP model is changed, the75statistical relation among the predictors is also likely to change. If a new station isadded, depending on the studied phenomena, the measured hydro-meteorologicalvariables could not have stable statistical relationships in years or even decades.For temperature and precipitation, normally at least two seasons of homogeneousdata (about 300 cases or so) are required for stable statistical relationships to de-velop. Thus, traditionally, one would then have to wait at least two years to acquirethe two seasons of homogeneous data (Wilson and Valle´e, 2002). Also, in the hy-drological context, as new data arrive, and information on longer-term variability(e.g. interannual variability) becomes available, the fixed number of HN will betoo few to model the longer term variability and the model becomes sub-optimal.Therefore, there is a need for algorithms allowing model complexity (i.e. the num-ber of HN) to change during the online sequential learning phase.In this paper, we proposed an algorithm called variable complexity online se-quential extreme learning machine (VC-OSELM). The VC-OSELM dynamicallyadds or removes the hidden nodes in the OSELM according to their significance tonetwork performance. Therefore, not only the weights can be adjusted but also thecomplexity (number of HN) can be self-adapted simultaneously.The rest of this paper is organized as follows. Section 4.2 gives a brief reviewof ELM and OSELM. The proposed algorithm is described in Section 4.3. Resultsand discussion of the experiments are given in Section 4.4, followed by summaryand conclusion in Section 4.5.4.2 Extreme Learning Machines1An MLP with one hidden layer is typically described by the nonlinear mappingyˆ j =L∑i=1β ih(wi ·x j+bi)+β 0; ( j = 1; : : : ;N) (4.1)where x j ∈ Rd and yˆ j ∈ Rm are the model input and output, respectively, N thenumber of cases or observations, wi and bi the weight and bias parameters in thehidden layer (with L HNs), β i and β 0 the weight and bias parameters in the output1The material in the first part of this section, up to Eq. (4.6), is already covered in Sec. 1.5 and1.6.76layer, and h the activation function (with h chosen to be the hyperbolic tangentfunction in this paper).Nonlinear optimization is needed in MLP to find the model parameters thatminimize the MSE between yˆ (output) and y (target). Nonlinear optimization iscomputationally expensive and has multiple local minima.Schmidt et al. (1992) proposed and Huang et al. (2006b) proved that the wiand bi parameters can be randomly assigned if the activation function is infinitelydifferentiable, so only the β parameters need to be optimized when minimizing theMSE. Hence, in the ELM approach, training an MLP with one hidden layer givenby Eq. (4.1) is equivalent to simply finding the least-squares solution of the linearsystem HB = Y (Huang et al., 2011), where the hidden layer output matrix H ofdimension N×L isH=h(w1 ·x1+b1) · · · h(wL ·x1+bL).... . ....h(w1 ·xN +b1) · · · h(wL ·xN +bL) ; (4.2)and the B parameter matrix of dimension L×m and the target data matrix Y ofdimension N×m areB=β T1...β TL ; and Y=yT1...yTN ; (4.3)with superscript T denoting the transpose. The β 0 term is omitted here as it notnecessary for ELM according to Huang (2014). The linear system for B is solvedvia the Moore-Penrose generalized inverse H† (Huang et al., 2006b), i.e.Bˆ=H†Y; with H† = PHT and P= (HTH)−1; (4.4)with P being the inverse covariance matrix of the hidden layer outputs. This is thebatch version of the ELM. Thus the MLP model requiring nonlinear optimizationis simplified in ELM to just solving a linear least squares problem as in multiplelinear regression.77In many applications, data arrive continually, so Liang et al. (2006) proposedthe OSELM, which inexpensively updates the model weights using the newly ar-rived data. The OSELM algorithm can be divided into two phases: an initializationphase (a batch ELM trained with some initial data) followed by the online sequen-tial learning phase where the model is frequently updated with the newly arriveddata.In the initialization phase, the initial training dataset {X0;Y0}= {x j;y j}Nj=1 isused to generate H0, B0 and P0 using Eqs. (4.2) and (4.4). As the number of HNis limited to no more than N, the number of data points available during the initial-ization phase, in the absence of regularization this limits the model complexity ofthe subsequent OSELM, even when more data become available later.During step k+1 of the online sequential learning phase (k= 0;1; : : :), the (k+1)th chunk of new data {Xk+1;Yk+1} becomes available, and Hk+1 is calculatedusing only the new data. From Liang et al. (2006),Pk+1 = Pk−PkHTk+1(I+Hk+1PkHTk+1)−1Hk+1Pk; and (4.5)Bˆk+1 = Bˆk+Pk+1HTk+1(Yk+1−Hk+1Bˆk); (4.6)with I being the identity matrix.The EOSELM proposed by Lan et al. (2009) builds an ensemble of OSELMnetworks with the same number of HN and the same activation function for eachHN,yˆ(e)j =L∑i=1β (e)i h(w(e)i ·x j+b(e)i); ( j = 1; : : : ;N); (e= 1; : : : ;E); (4.7)where superscript e indicates the particular ensemble member. The final output ofthe EOSELM is given by:yˆ j =1EE∑e=1yˆ(e)j : (4.8)When E = 1 the EOSELM is the same as the OSELM proposed by Liang et al.(2006).784.3 VC-OSELMSince OSELM is limited to having a fixed model complexity determined by the ini-tialization data, we propose the variable complexity OSELM (VC-OSELM) whichallows the model complexity to vary as new data arrive. To achieve this, the modelneeds to be able to add or subtract HN.4.3.1 Adding HNTo add δL HN with random weights, we follow Feng et al. (2009), where the newenlarged H′ matrix is composed of the original H matrix plus δH (containing δLadditional columns from the newly added HN), i.e.H′ =[H δH]; (4.9)and the new P′ matrix is given byP′ =(H′TH′)−1=([HTδHT][H δH])−1: (4.10)Following Feng et al. (2009),P′ =[A11 A12A21 A22]; where (4.11)A11 = P+PHTδHR−1δHTHP;A12 = −PHTδHR−1;A21 = −R−1δHTHP;A22 = R−1; (4.12)with R = δHTδH−δHTHPHTδHT. After new HN have been added, we simplyreplace P and H on the right hand side of Eqs. (4.5) and (4.6) by P′ and H′, addδL rows of zeros to Bˆk in Eq. (4.6), and continue running OSELM. Using (4.12) tocalculate P′ will be referred to as scheme A.79Some approximations can be made to simplify scheme A. Assuming HTδHnegligible (i.e. orthogonality between the column vectors in H and δH) simplifiesP′ toP′ =[P 00(δHTδH)−1]: (4.13)Further assuming orthogonality among the vectors in δH renders(δHTδH)−1 di-agonal in (4.13). Even when the orthogonality assumptions are invalid, we candefine a scheme B to approximate P′ byP′ =[P 00 diag((δHTδH)−1)]; (4.14)where diag(...) means the off-diagonal elements are set to 0. While scheme B isa crude approximation of scheme A, we found it to be numerically more stablewhen we are updating the model with relatively small numbers of new data points(compared to the number of HN).4.3.2 Deleting HNNext, we need to be able to delete HN. To simplify the notation, instead of multipleoutput variables in Eq. (4.1), we will consider for now only a single model outputyˆ=L∑i=1βihi; (4.15)where hi = h(wi · x+ bi). If one HN is to be deleted, we will choose the leasteffective one based on some criterion. If the input variables are normalized orstandardized, we can delete the lth HN where its corresponding βl has the smallestmagnitude in {βi}. To adjust the P matrix, we compute the covariance matrixK = P−1, where K = HTH, then delete the lth row and lth column from K, andfinally taking inverse to get the new P matrix. Unfortunately when we continuedrunning OSELM after the deletion, the model became unstable.To improve model stability after deleting an HN, we project the signal from thelth HN to the remaining HNs, to reduce the shock to the system from deleting the80lth HN. We can think of Eq. (4.15) as a linear regression problem with hi being thepredictors. After hl is deleted, redoing the regression givesyˆ=∑i ̸=lβ ′i hi; (4.16)with the new regression coefficient β ′i incorporating some of the signal from thedeleted βl coefficient.To see how the signal from the lth predictor projects to the other predictors, wewriteβlhl =∑i ̸=lαihi+ ε; (4.17)with ε representing what cannot be projected to the other predictors.Writing the data in vector form, i.e. the elements of hl are the values of hl atvarious times (up to time of deletion),βlhl =∑i ̸=lαihi+ ε : (4.18)Taking dot product with h j ( j ̸= l), we haveβlhl ·h j =∑i̸=lαihi ·h j+ ε ·h j: (4.19)As ε ·h j = 0, hl ·h j = Kl j = K jl and hi ·h j = Ki j = K ji, where K is the covariancematrix computed from all data just before the deletion, we obtain∑i ̸=lαiKi j = βlKl j: (4.20)This is a system of linear algebraic equations, where the number of αi variablesequals the number of equations, hence αi can be solved exactly. Therefore, afterdeleting hl , the new regression relation (4.16) is solved by havingβ ′i = βi+αi; (4.21)where αi adds back some of the deleted signal from βlhl .81Next, when deleting more than a single HN at a time, one needs to sum over{l}, where {l} is the set of indices for the deleted HNs, and (4.18) is replaced by∑l∈{l}βlhl = ∑i ̸∈{l}αihi+ ε ; (4.22)Eq.(4.20) becomes∑i̸∈{l}αiKi j = ∑l∈{l}βlKl j; j ̸∈ {l}: (4.23)Again αi is solved exactly, and β ′i = βi+αi.4.3.3 Buffer SetIn OSELM, the data already used by the model can be discarded. However, VC-OSELM needs to save a chunk of the data already used to calculate δH (and H ifusing scheme A). Thus, a small buffer set X′;Y′ containing recent past informationis needed. The size S (number of cases) to be used in the buffer set is problemdependent. After adding HN in the model, the information in the current bufferset is erased. The necessary steps for implementing the VC-OSELM algorithm areshown in Algorithm 6. Note that when HNs are being added, scheme A is usedonly when the number of HN is less than the number of cases in Y′, otherwisescheme B is used.4.4 Experimental ResultsTo demonstrate the practical use of the VC-OSELM as an online prediction method,experiments were performed on two hydro-meteorological datasets. The experi-ments were divided into two groups. First, a comparison was made between theOSELM and VC-OSELM using fixed steps to increase or decrease the number ofHN (Section 4.4.2). Second, an automated procedure was used to find the num-ber of HN in an adaptive way so HNs could be recruited or deleted dynamicallyaccording to their significance to network performance (Section 4.4.3). All ourprediction models were built using the free R software environment for statisticalcomputing (R Core Team, 2014). OSELM and VC-OSELM uses the R packageMASS (Venables and Ripley, 2002) and R native libraries. Our OSELM is freely821 Step I) Initialization Phase begin2 // k = 0;3 Based on Chapter 2, randomly assign wi and bi, i= 1; · · · ;L;4 Calculate the hidden layer output matrix H0;5 Calculate P0 = (HT0H0)−1;6 Calculate the output weights Bˆ0 = P0HT0Y0;7 Buffer {X′;Y′}=∅;8 end9 Step II) Online Sequential Learning Phase begin10 (k+1)th chunk of new data {Xk+1;Yk+1} arrives;11 Calculate Hk+1 using only the new data;12 Calculate Pk+1 = Pk−PkHTk+1(I+Hk+1PkHTk+1)−1Hk+1Pk;13 Calculate Bˆk+1 = Bˆk+Pk+1HTk+1(Yk+1−Hk+1Bˆk);14 Set k = k+1;15 if Add or Remove HN then16 Go to Step III);17 else18 {X′;Y′}= {X′;Y′}∪{Xk+1;Yk+1};19 Go to Step II);20 end21 end22 Step III) Add or Remove HN begin23 if Add HN then24 Randomly assign wi and bi, i= 1; · · · ;δL;25 Using X′, calculate the hidden layer output matrix δH (and H if using scheme A);26 if Total number of HN (L+δL) < number of cases in Y′ then27 Calculate P′ using Eq. (4.12) (scheme A);28 else29 Calculate P′ using Eq. (4.14) (scheme B);30 end31 Add δL rows of zeros to Bˆk in Eq. (4.6);32 {X′;Y′}=∅;33 else34 Calculate αi using Eq. (4.23);35 Calculate the new β ′i using Eq. (4.21);36 end37 Go to Step II);38 endAlgorithm 6: VC-OSELM83available from the project website: http://forai.r-forge.r-project.org/. To measurethe performance of VC-OSELM relative to the OSELM, we used the MAE and theRMSE.4.4.1 Data DescriptionAs a case of study, we used the GEFS reforecast2 output (Hamill et al., 2013) andlocal hydro-meteorological observations as predictors to forecast the daily meanstreamflow at two hydrological stations: the Englishman River station – hydro-metric station 08HB002, (49◦ 19′ 0′′N, 124◦ 16′ 58′′W) – and the Stave Riverstation – hydrometric station 08MH147, (49◦ 33′ 21′′N, 122◦ 19′ 24′′W) – at leadtimes of one day during 1985-2011. The time-series of the stations tested herewere obtained from Water Survey of Canada at http://wateroffice.ec.gc.ca/. Bothstations have been used in research studies before (Rasouli et al., 2012; Fleminget al., 2015; Lima et al., 2015). The meteorological data used for the English-man River were from the Qualicum R Fish Research station (ID 1026565), whichcan be found at the Historical Canadian Climate Database (http://climate.weather.gc.ca/index e.html). The meteorological data used for Stave River were from thehydro-meteorological station of B.C. Hydro (Code STA). The pool of 34 predictors(inputs) for streamflow forecasting is shown in 3.2 (Table 3.1).Streamflow is sensitive to the space-time variability of precipitation, land sur-face characteristics and climate variability. In addition there is a large number ofnonlinear interactions related to the transformation of snow and rainfall into soilwater and runoff (Potter and Colman, 2003). For Englishman (Fig. 4.1a) in themajority of the cases, the months with high accumulated precipitation also havehigh accumulated streamflow. However, for Stave (Fig. 4.1b) the month with thehighest accumulated streamflow is June when the accumulated precipitation is low,as melting snow is the main source for the streamflow.Only days not containing missing values were considered (9540 days for En-glishman and 9619 days for Stave). The available data, standardized to zero meanand unit variance, were divided into training and testing subsets. Similarly to Chap-ter 2, the training set was used to build the initial OSELM models (Step I of Al-2Reforecasts (or hindcasts), are retrospective NWP forecasts for many years in the past, ideallyconducted using the same forecast models and same assimilation system used operationally.84J F M A M J J A S O N D(a) EnglishmanP re ci pi t at i on ( mm )051 01 52 001 02 03 0S tr ea mf l ow ( m3s )J F M A M J J A S O N D(b) StaveP re ci pi t at i on ( mm )051 01 52 002 04 06 08 0S tr ea mf l ow ( m3s )Figure 4.1: Mean accumulated precipitation (bars) and mean accumulatedstreamflow (black line) for each of the 12 calendar months over the stud-ied period for (a) Englishman and (b) Stave.gorithm 6) and the testing set to assess the model performance during the onlinelearning phase.4.4.2 Change Number of HN by Fixed StepsBecause the number of HN cannot be changed for the OSELM, the training data inthe initialization phase should contain enough statistical information to select theappropriate number of HN. As mentioned in Section 4.1, it is not always possibleto get a large amount of data in advance to train the model.The first set of experiments involved increasing the number of HN by fixedsteps as more data become available. The one year of initial training data (1989)was followed by test data (1990-2011) for the online learning phase (Table 4.1).Sliding retroactive validation was performed (1990-2011) as an emulation of real-85time data availability, i.e. we updated the model daily then used the model forecastsfor testing (i.e. verification). Thirty different trials were performed with differentseeds for the random number generator. At each trial, E = 25 (an ensemble of 25members) was generated and the averaged output of the 25 individual membersbecame the output for that trial (Eq. 4.8)– i.e. a total of 30 trials × 25 members =750 OSELM runs.86Table 4.1: Statistics of the observed streamflow (m3=s) in the training and testing datasets.EnglishmanpercentilesMean Std. dev. Median Min Max 5th 10th 90th 95thTrain - 1 Year 9.9 12.8 6.1 0.3 119 0.3 0.6 22.0 32.6Train - 3 Year 10.3 14.3 6.4 0.3 160 0.3 0.4 24.4 35.1Train - 5 Year 10.9 18.2 5.9 0.3 259 0.3 0.4 25.1 37.9Test 13.7 23.7 7.2 0.1 310 0.6 1.1 29.6 50.4StavepercentilesMean Std. dev. Median Min Max 5th 10th 90th 95thTrain - 1 Year 31.6 26.5 25.4 4.3 177 5.4 6.5 67.6 86.7Train - 3 Year 32.4 30.0 23.1 3.2 262 4.8 5.9 71.7 87.8Train - 5 Year 32.2 31.1 22.4 1.8 264 4.5 5.9 71.7 88.2Test 34.2 34.3 23.2 2.6 553 5.4 6.8 73.7 91.887We started with OSELM as in Chapter 2 and 3, where in the initialization phase(Step I of Algorithm 6), the OSELM was trained using the initialization data, withthe number of HNs determined by a hill climbing (HC) algorithm (Yuret, 1994) (wechose the number of HN to be the averaged value over 30 trials selected by HC).At Englishman (ENG) the number of HN chosen was 92 when using 1 year oftraining data. At Stave (STA) the number of HN chosen was 96 when using 1 yearof training data. In each trial the same OSELM initial models were used with theVC-OSELM (Steps II and III of Algorithm 6). We tried to avoid variability relatedto random weight initialization by having the same initial models for OSELM andVC-OSELM.During the online learning phase, the VC-OSELM updated its weights everyday with the new data, which were saved in the buffer dataset. After S days (withS = 365), the number of HN was increased by 6% (δL = 0:06L) and the infor-mation in the current buffer was erased. This procedure was repeated until theend of the test data. The fixed step of 6% was chosen to ensure that at the endof the experiment, L remained less than S, so both schemes A and B could beused. Table 4.2 shows the initial and final number of HN, RMSE and MAE ofOSELM, VC-OSELM using scheme A (VC-OSELM-A), and VC-OSELM usingscheme B (VC-OSELM-B). For S = 365, VC-OSELM-B performed better (bothin RMSE and MAE) than OSELM, showing that adding HN was beneficial. How-ever, VC-OSELM-A performed worse than OSELM. To check if the generalizationof scheme A could be improved, using the same initial models as in S= 365, we in-creased the size of the buffer to S= 730 and S= 1095 leading to fewer updates andconsequently fewer number of final HN. In both cases VC-OSELM-A had betterresults than OSELM and VC-OSELM-B, showing that scheme A would performbetter with larger buffer size.88Table 4.2: Comparison of RMSE and MAE from the test set, with the initial number of HN selected by HC using oneyear of initial training data and the number of HN increased by a fixed step of 6% after S data points. The valuesare the mean over the 30 trials and the parentheses give the standard deviation. The lowest test RMSE and testMAE of each row are shown in boldface.#HN RMSE MAEInitial Final S OSELM VC-OSELM-A VC-OSELM-B OSELM VC-OSELM-A VC-OSELM-B93 312 365 9.11(0.16) 10.13(0.25) 9.01(0.16) 2.65(0.03) 2.77(0.03) 2.57(0.03)ENG 93 166 730 9.11(0.16) 8.97(0.12) 9.03(0.15) 2.65(0.03) 2.54(0.02) 2.61(0.03)93 140 1095 9.11(0.16) 8.94(0.16) 9.08(0.16) 2.65(0.03) 2.57(0.03) 2.64(0.03)96 326 365 14.29(0.11) 16.14(0.33) 14.15(0.10) 5.61(0.03) 5.84(0.04) 5.52(0.03)STA 96 172 730 14.29(0.11) 14.12(0.11) 14.27(0.11) 5.61(0.03) 5.45(0.04) 5.58(0.03)96 144 1095 14.29(0.11) 13.97(0.11) 14.30(0.11) 5.61(0.03) 5.48(0.03) 5.62(0.03)89The second set of experiments involved checking if the deletion of some HNwould be beneficial when the model had more HN than necessary. With L = 350and one year of initial training data, the initial ensemble members of the OSELMwere generated and these were also used as the initial VC-OSELM. In the onlinelearning phase, the VC-OSELM updated its weights every day. After S days (S =365), the number of HN were decreased by 1% (δL= 0:01L).Table 4.3 shows the initial and final number of HN, RMSE and MAE of OS-ELM and VC-OSELM. The RMSE and MAE are higher than in Table 4.2 as theinitial use of an excessive large number of HN led to overfitting and poor general-ization, with both models over-predicting when the observed streamflow had highvalues.90Table 4.3: Comparison of RMSE and MAE from the test set, with the initialization using a large number of HN andone year of training data and the number of HN decreased by 1% after S data points. The values are the mean overthe 30 trials and the parentheses give the standard deviation. The lowest test RMSE and test MAE of each row areshown in boldface.#HN RMSE MAEInitial Final S OSELM VC-OSELM OSELM VC-OSELMENG 350 286 365 18.49(6.84) 14.85(6.20) 3.54(0.25) 2.88(0.17)STA 350 286 365 37.19(38.62) 29.35(40.00) 7.92(0.68) 5.80(0.52)91In summary, these experiments with VC-OSELM increasing or decreasing thenumber of HN by fixed steps confirmed that (1) as more data became available,increasing the number of HN allowed VC-OSELM to outperform OSELM (Table4.2), and (2) if the initial number of HN was too large, decreasing the number ofHN allowed VC-OSELM to reduce the overfitting in OSELM (Table 4.3). Spec-ifying a fixed step in advance is of course unrealistic, so in Sec. 4.4.3, the VC-OSELM is capable of automatically increasing or decreasing the number of HNduring online learning.4.4.3 Determining Automatically the Number of HNAdjusting the number of HN (model complexity) by a fixed step is not the best way,as using a small fixed step may not change the number of HN fast enough, while alarge fixed step may choose too many HN, leading to overfitting. To dynamicallyfind the number of HN we used an algorithm based on the hill climbing. HC is asimple mathematical optimization technique that starts with an arbitrary solutionto a problem, then attempts to find a better solution (in our case smaller RMSE) byincrementally changing a single element of the solution (in our case increasing ordecreasing the number of HN). If the change produces a better solution, an incre-mental change (usually by a defined step size) is made towards the new solution.In the HC algorithm used here, the step size is automatically adjusted by the algo-rithm, i.e., it shrinks when the probes do poorly and it grows when the probes dowell, helping the algorithm to be more efficient and robust (Yuret, 1994).The combination of HC and VC-OSELM provides an approach for determiningthe model complexity in an adaptive way. In this approach, first various models aregenerated in the candidate reservoir. The prediction accuracy (RMSE) of the VC-OSELM is monitored over a validation period S. Among the U choices of HN(in our case U = 4), the one with the lowest RMSE over the validation periodis chosen as the optimal choice. HC chooses U new choices, adding or deletingHNs if needed, and continues running the VC-OSELM. Then, based on whichcandidate is the optimal choice, the HC shrinks or grows the step size. As thenumber of HN varies gradually, the model complexity of OSELM varies. Using ashorter validation period (S) will allow the model complexity to vary more quickly.92The necessary steps for implementing the combination of HC and VC-OSELMalgorithm are shown in Appendix D (Algorithm 8). The values of acce and theinitial stepsize were defined as 1:2 and round(L ∗ 0:3), respectively, with L beingthe averaged number of HN over 30 trials selected in the initialization phase ofSection 4.4.2.The following experiments were designed to be fully automated, i.e., no humanintervention was necessary to tune the model parameters. We used three differentsizes of the initial training set to find the initial number of HN and to train the initialmodel (Step I of Algorithm 6) - 1 year (1989), 3 years (1987-1989) and 5 years(1985-1989). The test set (1990-2011) remained the same (Table 4.1). Similar toSection 4.4.2, we used sliding retroactive validation. Thirty different trials wereperformed with different seeds for the random number generator. In the HC usedhere, a candidate reservoir with four different number of HN (U = 4) was used.Each candidate had an ensemble of 25 members (E = 25). All the networks inthe candidate reservoir were evaluated and compared, however the “operational”(final) forecast came only from the networks contained in the candidate reservoirwith zero acceleration ( j = 2 in Algorithm 8). Thus, the errors measured in thisexperiment came from the averaged output of the 25 individual members of the“operational” forecast.Table 4.4 shows the initial and final number of HN, RMSE and MAE of OS-ELM and VC-OSELM using HC. When using one year of training data, the combi-nation of VC-OSELM and HC had better results than the OSELM for all differenttested values of S for ENG and STA. The best results of the VC-OSELM with HCfor both datasets were found using S = 365, where scheme A was used most ofthe time. Scheme A was only used when L< S, hence the less accurate scheme Bwas used much more often when S = 90 and 180. Also when S = 365, HC usedthe entirely year (including wet and dry seasons) in the validation set. In contrast,with S = 90 or S = 180 a validation set would not be long enough to contain bothwet and dry seasons, which could cause VC-OSELM to converge to the incorrectnumber of HN, as the number of HN required for modelling wet and dry seasonsare different. The time evolution of the number of HN for Stave using S = 180(Fig. 4.2a) and S = 365 (Fig. 4.2b) both showed an increase in the number of HNas more data became available, but there was much more spread in the number of93HN when S= 180.As three years of initial training data provide more information than one year oftraining data, Table 4.1 shows the basic statistics of the training data being closer tothe test data when using more training data. Thus, it is expected that the number ofHN chosen using three years of training data were closer to the optimal than usingone year of training data. With three years of training data, HC chose a largernumber of initial HN than with one year of training data for both ENG and STA,thereby reducing the difference between the initial and final number of HN and theadvantage of using VC-OSELM. With three years of training data, VC-OSELMwith HC still tended to improve over OSELM when S= 365 (Table 4.4).Table 4.4 also shows the results using 5 years of training data and S = 365.Having more data in the initial training dataset tended to give better prediction,as the RMSE and MAE for both datasets (ENG and STA) were the lowest so far.However, for STA VC-OSELM with HC underperformed OSELM. It is possibleto improve the results of the VC-OSELM with HC by doing some adjustments inthe algorithm, such as increasing U , changing acce according to S, changing theinitial stepsize, etc. Since the focus of this research is the VC-OSELM, emphasiswas not given to the identification of the optimal setup for the HC.941990 1995 2000 2005 2010100200300400500(a) S = 180YearNumber of hidden nodes1990 1995 2000 2005 2010100200300400500(b) S = 365YearNumber of hidden nodesFigure 4.2: Time series of the number of HN for Stave when (a) S= 180 and(b) S = 365. Each dashed line shows the number of HN for one trial,the solid line the mean over the 30 trials and the horizontal dashed linethe value of S (below which scheme A was used instead of scheme Bfor adding HN). 95Table 4.4: The final number of HN, the RMSE and MAE from the test set for OSELM and VC-OSELM (with HC).The values are the mean over the 30 trials and the parentheses give the standard deviation. The lowest test RMSEand test MAE of each row are shown in boldface.1 year of training data#HN RMSE MAEInitial Final S OSELM VC-OSELM OSELM VC-OSELM93 270(30) 90 9.11(0.16) 9.01(0.22) 2.65(0.03) 2.50(0.03)ENG 93 121(53) 180 9.11(0.16) 9.06(0.18) 2.65(0.03) 2.58(0.05)93 172(55) 365 9.11(0.16) 8.83(0.15) 2.65(0.03) 2.49(0.04)96 288(70) 90 14.29(0.11) 14.07(0.47) 5.61(0.03) 5.51(0.18)STA 96 172(106) 180 14.29(0.110) 13.98(0.31) 5.61(0.03) 5.42(0.12)96 223(66) 365 14.29(0.11) 13.69(0.19) 5.61(0.03) 5.26(0.07)3 years of training data181 357(49) 90 8.38(0.11) 8.53(0.16) 2.40(0.01) 2.40(0.03)ENG 181 330(40) 180 8.38(0.11) 8.51(0.13) 2.40(0.01) 2.41(0.02)181 182(53) 365 8.38(0.11) 8.35(0.11) 2.40(0.01) 2.40(0.02)168 320(42) 90 13.42(0.10) 13.50(0.16) 5.22(0.02) 5.23(0.06)STA 168 302(98) 180 13.42(0.10) 13.40(0.21) 5.22(0.02) 5.17(0.08)168 293(143) 365 13.42(0.10) 13.32(0.18) 5.22(0.02) 5.10(0.07)5 years of training dataENG 207 200(96) 365 8.24(0.11) 8.16(0.11) 2.35(0.01) 2.34(0.02)STA 286 394(155) 365 12.72(0.11) 13.00(0.21) 4.96(0.02) 4.98(0.07)96Next, we checked if VC-OSELM with HC could adjust automatically froman excessive number of HN by starting the initial model with L = 350. The VC-OSELM with HC was able to adjust automatically the initial overestimated num-ber of HN and outperformed OSELM for both stations. For ENG, VC-OSELMwith HC reduced the final average number of HN relative to the initial number ofHN. However, for STA, the final average number of HN was larger than the initialnumber. Time evolution of the number of HN revealed that for ENG, the aver-age number of HN dropped to a minimum around year 1998 then gently increased(Fig. 4.3a), whereas for STA, after dropping to a minimum around year 1995, theaverage number of HN increased to surpass the initial number (Fig. 4.3b). ThusVC-OSELM with HC was able to recover from the excessive number of initialHN, then generally increased the number of HN as more data became availableand higher model complexity was appropriate for modelling the more complexrelations contained in the longer data record.971990 1995 2000 2005 2010200400600800(a) EnglishmanYearNumber of hidden nodes1990 1995 2000 2005 2010200400600800(B) StaveYearNumber of hidden nodesFigure 4.3: Time series of the number of HN for (a) Englishman and (b) Stavewhen starting the initial model with L = 350. Each dashed line showsthe number of HN for one trial, the solid line the mean over the 30 trialsand the horizontal dashed line the value of S.98Table 4.5: Comparison of RMSE and MAE from the test set, with a large initial number of HN and one year of trainingdata. The values are the mean over the 30 trials and the parentheses give the standard deviation. The lowest testRMSE and test MAE of each row are shown in boldface.#HN RMSE MAEInitial Final S OSELM VC-OSELM OSELM VC-OSELMENG 350 234(119) 365 18.49(6.84) 13.67(6.44) 3.54(0.25) 2.66(0.15)STA 350 393(186) 365 37.19(38.62) 28.51(40.15) 7.92(0.68) 5.45(0.52)994.5 Summary and ConclusionIn summary, we proposed the VC-OSELM algorithm to dynamically add or re-move hidden nodes (HN) in the OSELM, thereby allowing the model complexityto vary as the online learning proceeds. To test the performance of VC-OSELM,we used the GEFS reforecast output and local hydro-meteorological observationsas predictors to forecast the daily mean streamflow at two hydrological stations.We tested two schemes for adding HN. First, both schemes were tested us-ing fixed steps to increment the number of HN as online learning proceeds. VC-OSELM improved the results of OSELM when using a relatively small dataset(one year) for the initial training data. With less data in the initial training set, thenumber of HN were generally fewer than optimal, hence the ability to add moreHN by VC-OSELM online learning was advantageous. When using 3 or 5 yearsof data in the initial training set the number of initial HN chosen by the HC werelarger and closer to optimal than using 1 year. As expected, more data in the ini-tial training set generated better prediction scores. As scheme A was only stablewhen L (the number of HN) < S (the number of data points) our proposed schemeB was used when L ≥ S. It is possible to stabilize scheme A for L ≥ S by addinga regularization parameter which should be considered in future works. We alsoproposed an approach to remove HN which improved the results of OSELM whenthe number of HN were larger than optimal.In a real-world situation, it would not be possible to tune the incremental/decre-mental fixed step in advance, thus a fully automated approach using the HC algo-rithm to dynamically adjust the model complexity was tested. From a candidatereservoir, the algorithm incrementally changes the number of HN and adjusts au-tomatically the future step size. The proposed algorithm was able to adjust thenumber of HN achieving better RMSE and MAE than OSELM in most of thetested scenarios. However, VC-OSELM with HC has a higher computational cost,as it uses U ×E models (with U = 4) compared to the ensemble of E models inOSELM.100Chapter 5Conclusions5.1 SummaryIn this thesis the use of advanced machine learning (ML) methods in environmentalproblems was explored. More precisely, the method called extreme learning ma-chines (ELM) was applied to forecast short-term hydro-meteorological variables.The online sequential learning version of the ELM (OSELM) was also tested inproblems where online sequential learning is preferred over batch learning algo-rithms. Also, an algorithm called variable complexity online sequential extremelearning machine (VC-OSELM) was proposed to dynamically add or remove theHN of the OSELM. Multiple linear regression (MLR), online sequential multiplelinear regression (OSMLR), bootstrap-aggregated ensemble artificial neural net-work (ANN), support vector regression with evolutionary strategy (SVR-ES), andrandom forest (RF) were compared against the various forms of the ELM. A briefsummary of each form of the ELM, focusing on strengths and weaknesses of theproposed approaches, contributions made to the state of the art, potential applica-tions of the research findings, and future research directions, follows.5.2 Batch Learning ELMIn Chapter 2 the ELM was tested on nine environmental regression problems. Theprediction accuracy and computational speed of the ensemble ELMwere evaluated101against MLR and three nonlinear ML techniques – ANN, SVR-ES and RF. Gener-ally, the tested ML techniques outperformed MLR, but no single method was bestfor all the nine datasets. In particular for the SFBAY, WIND, FRASER, STAVEand SO2 datasets, the nonlinear methods performed much better than MLR.Simple automated algorithms were used to estimate the parameters (e.g. num-ber of HN) needed for model training. The proposed automated procedures per-formed satisfactorily and were able to find suitable parameters in a reasonableamount of time. ELM tended to be the fastest among the nonlinear models butfor large datasets RF tended to be the fastest. ANN and ELM had similar skills,but ELM was much faster than ANN except for large datasets.The ELM algorithm is indeed a promising approach for a range of predictionproblems in the environmental sciences. The model is relatively simple to im-plement and can be very fast to train on small to medium sized datasets. Also,performance in terms of skill score is comparable to the other nonlinear models.Compared individually against ANN, ELM requires a larger number of HN to com-pensate for the randomness of the weights used in the HN.Scaling the range of the random weights in ELM improved its performance.Also, computational cost may be lowered by scaling the range of the weights inadvance, which may lead to more parsimonious models (fewer HN required).5.3 Online Sequential Learning ELMIn Chapter 3 the OSELM was applied to forecast daily streamflow at two smallwatersheds in British Columbia, Canada, at lead times of 1-3 days. Predictors usedwere weather forecast data generated by the NOAA Global Ensemble Forecast-ing System (GEFS), and local hydro-meteorological observations. The OSELMis quite straightforward as ELM is basically solved like MLR. With OSELM, theupdates involve only small datasets containing the newly arrived data and the newdata can then be discarded after the update.OSELM forecasts were tested with daily, monthly or yearly model updates. Asexpected, more frequent updating gave smaller forecast errors, including errors fordata above the 90th percentile (high streamflow events). However as the numberof novel events became rare after 10 years, the difference between the models up-102dating daily and yearly eventually became small. Larger datasets used in the initialtraining of OSELM helped to find better parameters (number of HN) for the model,yielding better predictions.The OSMLR was used as benchmark and OSELM is an attractive approach asit easily outperformed OSMLR in forecast accuracy.5.4 Variable Complexity Online Sequential LearningIn Chapter 4, a new variable complexity online sequential extreme learning ma-chine (VC-OSELM) was proposed. The VC-OSELM automatically adds or re-moves HN as the online learning proceeds, so the model complexity self-adaptsto the new data. The performance of VC-OSELM was compared with OSELMin daily streamflow predictions, showing that VC-OSELM outperformed OSELMwhen the number of initial HN was smaller or larger than optimal.Two schemes for adding HN were tested. First, both schemes were tested usingfixed steps to increment the number of HN as online learning proceeded. VC-OSELM improved the results of OSELM when using a relatively small dataset(one year) for the initial training data. The initial number of HN were generallyfewer than optimal, hence the ability to add more HN by VC-OSELM during onlinelearning was advantageous. Also, an approach to remove HN improved the resultsof OSELM when the initial number of HN was larger than optimal.A fully automated approach using the hill climbing (HC) algorithm to dynam-ically adjust the model complexity was tested. From a candidate reservoir, thealgorithm incrementally changes the number of HN and adjusts automatically thefuture step size. The proposed algorithm was able to adjust the number of HNachieving better RMSE and MAE than OSELM in most of the tested scenarios.5.5 Overall Recommendations and PotentialApplicationsThe experiments made here used simple automated algorithms to estimate the pa-rameters (e.g. number of HN) needed for model training. By fine tuning the param-eters, it is possible to achieve better prediction skills for all the nonlinear methodspresented. However, it is important to take into consideration the computational103time required to identify the optimal parameters, especially if time consuming trialand error methods are used (Zhang, 2007).As suggested by Parviainen and Riihimaki (2013) and in Chapter 2, the rangeof the random weights in ELM should also be considered a parameter for achievingthe best results. In this thesis, a constant value of a = 2 was used when using thesigmoidal activation function and a= 1 when using the hyperbolic tangent functionin Eq. (2.6). Further work on how to optimally scale the weights for ELM iswarranted. Also, using bagging with ELM could be explored to improve stabilityand accuracy.As ELM requires a larger number of HN than ANN, approaches to reducethe number of HN in ELM such as the orthogonal incremental extreme learningmachine (Ying, 2014), is worth further studying.The OSELM is more suitable for many real-time applications than ELM. If theunderlying predictor-predictand relation is nonlinear, and frequent model updatesare needed to use the continually arriving new data, OSELM is an excellent pre-diction method not only for hydrological systems but also for other environmentalsciences. However, unlike echo state networks (ESN), OSELM does not get feed-back from previous errors since it has a feedforward architecture. For data with along fading memory, a recurrent approach (ESN) may outperform a feedforwardapproach.For the VC-OSELM, it may be possible to stabilize scheme A for L ≥ S byadding a regularization parameter, which should be considered in future work. Theresults of the VC-OSELM with HC could be improved by adjusting parameters inthe HC algorithm, such as increasingU , changing acce according to S, changingthe initial stepsize, etc. Therefore, identification of the optimal setup for theHC and other algorithms to decide whenever the HN should be added or removedare worth further studying.Additionally, datasets with different predictors and sizes, procedures to selectpredictors, and other automated procedures to find parameters are all worthy offurther exploration.Perhaps the most important future application to follow from this thesis is non-linear updatable model output statistics (MOS). Post-processing numerical weatherprediction model output by MOS is needed to correct for model bias and to extend104the predictions to variables not computed by the numerical model. Linear statis-tical methods are often used in these computationally intensive updatable MOSsystems, as nonlinear methods like ANN are not practical to use in high dimen-sional systems with new data arriving continually. OSELM and VC-OSELM willlikely advance operational numerical weather prediction through the deploymentof nonlinear updatable MOS systems.105BibliographyAbrahart, R., See, L., and Dawson, C. (2008). Neural Network Hydroinformatics:Maintaining Scientific Rigour. In Abrahart, R., See, L., and Solomatine, D.,editors, Practical Hydroinformatics, volume 68 of Water Science andTechnology Library, pages 33–47. Springer Berlin Heidelberg. → pages 54, 55Abrahart, R. J., Anctil, F., Coulibaly, P., Dawson, C. W., Mount, N. J., See, L. M.,Shamseldin, A. Y., Solomatine, D. P., Toth, E., and Wilby, R. L. (2012). Twodecades of anarchy? Emerging themes and outstanding challenges for neuralnetwork river forecasting. Progress in Physical Geography, 36(4):480–513. →pages 5, 13, 44, 50, 54, 73Aguilar-Martinez, S. and Hsieh, W. W. (2009). Forecasts of tropical Pacific seasurface temperatures by neural networks and support vector regression.International Journal of Oceanography, 2009:13. Article ID 167239. → pages2, 13Armstrong, J. S. and Collopy, F. (1992). Error measures for generalizing aboutforecasting methods: Empirical comparisons. International Journal ofForecasting, 8(1):69–80. → pages 142Bishop, C. M. (2006). Pattern Recognition and Machine Learning (InformationScience and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA.→ pages 73, 131Breiman, L. (1996). Bagging predictors. Machine Learning, 24:123–140. →pages 22, 135Breiman, L. (2001). Random Forests. Machine Learning, 45:5–32. → pages 23,56, 128, 136Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification andRegression Trees. Wadsworth Inc. → pages 22, 136106Broomhead, D. S. and Lowe, D. (1988). Multivariable Functional Interpolationand Adaptive Networks. Complex Systems, 2(3):321–355. → pages 7Bu¨rger, G., Reusser, D., and Kneis, D. (2009). Early flood warnings fromempirical (expanded) downscaling of the full ECMWF Ensemble PredictionSystem. Water Resources Research, 45(10). → pages 2, 43Butcher, J., Verstraeten, D., Schrauwen, B., Day, C., and Haycock, P. (2013).Reservoir computing and extreme learning machines for non-linear time-seriesdata analysis. Neural Networks, 38:76 – 89. → pages 71Cannon, A. J. (2011). Quantile regression neural networks: Implementation in Rand application to precipitation downscaling. Computers & Geosciences,37(9):1277–1284. → pages 1, 122Cannon, A. J. (2012a). monmlp: Monotone multi-layer perceptron neuralnetwork. http://CRAN.R-project.org/package=monmlp. R package version1.1.2. → pages 24Cannon, A. J. (2012b). Neural networks for probabilistic environmentalprediction: Conditional Density Estimation Network Creation and Evaluation(CaDENCE) in R. Computers & Geosciences, 41(0):126–135. → pages 2, 13,122Cannon, A. J. and Lord, E. R. (2000). Forecasting summertime surface-levelozone concentrations in the lower Fraser Valley of British Columbia: Anensemble neural network approach. Journal of the Air & Waste ManagementAssociation, 50(3):322–339. → pages 2, 14, 128, 135Cannon, A. J. and Whitfield, P. H. (2002). Downscaling recent streamflowconditions in British Columbia, Canada using ensemble neural network models.Journal of Hydrology, 259(1-4):136 – 151. → pages 22, 37Cawley, G. C., Haylock, M., Dorling, S. R., Goodess, C., and Jones, P. D. (2003).Statistical downscaling with artificial neural networks. In Proceedings of theEuropean Symposium on Artificial Neural Networks (ESANN-2003), pages167–172. → pages 16Cawley, G. C., Janacek, G. J., Haylock, M. R., and Dorling, S. R. (2007).Predictive uncertainty in environmental modelling. Neural Networks,20(4):537–549. → pages 1, 12, 33, 55, 121, 123, 124, 126, 128107Chang, F.-J., Chen, P.-A., Lu, Y.-R., Huang, E., and Chang, K.-Y. (2014).Real-time multi-step-ahead water level forecasting by recurrent neural networksfor urban flood control. Journal of Hydrology, 517:836 – 846. → pages 2, 44Chen, S.-T., Yu, P.-S., and Tang, Y.-H. (2010). Statistical downscaling of dailyprecipitation using support vector machines and multivariate analysis. Journalof Hydrology, 385(1-4):13–22. → pages 1, 13, 122, 123Chen, Y. and Chang, F. J. (2009). Evolutionary artificial neural networks forhydrological systems forecasting. Journal of Hydrology, 367(1-2):125–137. →pages 13Cherkassky, V., Krasnopolsky, V., Solomatine, D., and Valdes, J. (2006).Computational intelligence in earth sciences and environmental applications:Issues and challenges. Neural Networks, 19(2):113 – 121. → pages 1, 9, 12,14, 15, 43, 56Cherkassky, V. and Ma, Y. (2004). Practical selection of SVM parameters andnoise estimation for SVM regression. Neural Networks, 17(1):113–126. →pages xii, xviii, 23, 25, 127, 128, 133, 138, 145, 149Dawson, C. (2008). Neural Network Solutions to Flood Estimation at UngaugedSites. In Abrahart, R., See, L., and Solomatine, D., editors, PracticalHydroinformatics, volume 68 ofWater Science and Technology Library, pages49–57. Springer Berlin Heidelberg. → pages 59Draper, N. R. and Smith, H. (1998). Applied Regression Analysis (Wiley Series inProbability and Statistics). Wiley series in probability and mathematicalstatistics. Applied probability and statistics. John Wiley & Sons Inc, 2 subedition. → pages 128Dugdale, R. C., Wilkerson, F. P., Hogue, V. E., and Marchi, A. (2007). The role ofammonium and nitrate in spring bloom development in San Francisco Bay.Estuarine, Coastal and Shelf Science, 73(1-2):17–29. → pages 121Eiben, A. E. and Schoenauer, M. (2002). Evolutionary computing. InformationProcessing Letters, 82(1):1–6. → pages 134Eiben, A. E. and Smith, J. E. (2003). Introduction to Evolutionary Computing.Natural Computing Series. Springer, Berlin. → pages 23, 39, 128, 131, 133,134, 150108Elshorbagy, A., Corzo, G., Srinivasulu, S., and Solomatine, D. P. (2010a).Experimental investigation of the predictive capabilities of data drivenmodeling techniques in hydrology - Part 1: Concepts and methodology.Hydrology and Earth System Sciences, 14(10):1931–1941. → pages 14Elshorbagy, A., Corzo, G., Srinivasulu, S., and Solomatine, D. P. (2010b).Experimental investigation of the predictive capabilities of data drivenmodeling techniques in hydrology - Part 2: Application. Hydrology and EarthSystem Sciences, 14(10):1943–1961. → pages 40Fan, R.-E., Chen, P.-H., and Lin, C.-J. (2005). Working set selection using secondorder information for training support vector machines. J. Mach. Learn. Res.,6:1889–1918. → pages xviii, 127, 143, 146, 149Faraway, J. and Chatfield, C. (1998). Time series forecasting with neuralnetworks: a comparative study using the air line data. Journal of the RoyalStatistical Society: Series C (Applied Statistics), 47:123–140.10.1111/1467-9876.00109. → pages 135Feng, G., Huang, G.-B., Lin, Q., and Gay, R. (2009). Error minimized extremelearning machine with growth of hidden nodes and incremental learning. IEEETransactions on Neural Networks, 20(8):1352–1357. → pages 14, 20, 54, 74,79Fleming, S. W., Bourdin, D. R., Campbell, D., Stull, R. B., and Gardner, T.(2015). Development and Operational Testing of a Super-Ensemble ArtificialIntelligence Flood-Forecast Model for a Pacific Northwest River. JAWRAJournal of the American Water Resources Association, 51(2):502–512. →pages 2, 43, 46, 84Fleming, S. W. and Whitfield, P. H. (2010). Spatiotemporal mapping of ENSOand PDO surface meteorological signals in British Columbia, Yukon, andsoutheast Alaska. Atmosphere-Ocean, 48(2):122–131. → pages 46Fleming, S. W., Whitfield, P. H., Moore, R. D., and Quilty, E. J. (2007).Regime-dependent streamflow sensitivities to Pacific climate modes cross theGeorgia-Puget transboundary ecoregion. Hydrological Processes,21(24):3264–3287. → pages 46Foley, A. M., Leahy, P. G., Marvuglia, A., and McKeogh, E. J. (2012). Currentmethods and advances in forecasting of wind power generation. RenewableEnergy, 37(1):1–8. → pages 2, 121109Friedrichs, F. and Igel, C. (2005). Evolutionary tuning of multiple svmparameters. Neurocomputing, 64:107–117. Trends in Neurocomputing: 12thEuropean Symposium on Artificial Neural Networks 2004. → pages 127, 150Fritsch, J. M., Houze, R. A., Adler, R., Bluestein, H., Bosart, L., Brown, J., Carr,F., Davis, C., Johnson, R. H., Junker, N., and et al. (1998). Quantitativeprecipitation forecasting: Report of the eighth prospectus development team, usweather research program. Bulletin of the American Meteorological Society,79(3):285–299. → pages 129Gobena, A. K., Weber, F. A., and Fleming, S. W. (2013). The role of large-scaleclimate modes in regional streamflow variability and implications for watersupply forecasting: a case study of the Canadian Columbia River Basin.Atmosphere-Ocean, 51(4):380–391. → pages 120, 123Hall, M. and Smith, L. (1996). Practical feature subset selection for machinelearning. In McDonald, C., editor, Computer Science ’98 Proceedings of the21st Australasian Computer Science Conference ACSC’98, pages 181–191.Springer. → pages 56, 127Hamill, T. M., Bates, G. T., Whitaker, J. S., Murray, D. R., Fiorino, M.,Galarneau, T. J., Zhu, Y., and Lapenta, W. (2013). NOAA’s Second-GenerationGlobal Medium-Range Ensemble Reforecast Dataset. Bulletin of the AmericanMeteorological Society, 94(10):1553–1565. → pages 11, 45, 48, 84Hashmi, M. Z., Shamseldin, A. Y., and Melville, B. W. (2009). Statisticaldownscaling of precipitation: state-of-the-art and application of Bayesianmulti-model approach for uncertainty assessment. Hydrology and Earth SystemSciences Discussions, 6(5):6535–6579. → pages 129Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of StatisticalLearning: Data Mining, Inference, and Prediction, Second Edition. SpringerSeries in Statistics. Springer. → pages 1, 12, 23, 130, 133, 136Haupt, S., Pasini, A., and Marzban, C. (2009). Artificial intelligence methods inthe environmental sciences. Springer. → pages 1, 128, 131Haykin, S. (1998). Neural Networks - A Comprehensive Foundation. PearsonEducation, Second edition. → pages 5, 6, 13, 16, 73, 127Haylock, M. R., Cawley, G. C., Harpham, C., Wilby, R. L., and Goodess, C. M.(2006). Downscaling heavy precipitation over the United Kingdom: acomparison of dynamical and statistical methods and their future scenarios.International Journal of Climatology, 26(10):1397–1415. → pages 16, 122110Hocking, R. R. (1976). The Analysis and Selection of Variables in LinearRegression. Biometrics, 32(1):pp. 1–49. → pages 56, 130Hong, Y.-S. T. (2012). Dynamic nonlinear state-space model with a neuralnetwork via improved sequential learning algorithm for an online real-timehydrological modeling. Journal of Hydrology, 468-469(0):11–21. → pages 2,3, 44Hordijk, W. (1996). A Measure of Landscapes. Evolutionary Computation,4(4):335–360. → pages 134Hornik, K., Buchta, C., and Zeileis, A. (2009). Open-source machine learning: Rmeets Weka. Computational Statistics, 24(2):225–232. → pages 136Hsieh, W. W. (2009). Machine Learning Methods in the Environmental Sciences:Neural Networks and Kernels. Cambridge University Press, New York, NY,USA. → pages 1, 6, 13, 14, 23, 43, 73, 127, 133, 139Hsieh, W. W. and Tang, B. (2001). Interannual variability of accumulated snow inthe Columbia Basin, British Columbia. Water Resources Research,37(6):1753–1759. → pages 120Huang, C.-L. and Wang, C.-J. (2006). A GA-based feature selection andparameters optimization for support vector machines. Expert Systems withApplications, 31(2):231–240. → pages 127Huang, G., Huang, G.-B., Song, S., and You, K. (2015). Trends in extremelearning machines: A review. Neural Networks, 61(0):3248. → pages 44, 74Huang, G.-B. (2014). An Insight into Extreme Learning Machines: RandomNeurons, Random Features and Kernels. Cognitive Computation,6(3):376–390. → pages 8, 50, 77Huang, G.-B. and Chen, L. (2007). Convex incremental extreme learningmachine. Neurocomputing, 70(16–18):3056–3062. Neural NetworkApplications in Electrical Engineering Selected papers from the 3rdInternational Work-Conference on Artificial Neural Networks (IWANN 2005).→ pages 74Huang, G.-B. and Chen, L. (2008). Enhanced random search based incrementalextreme learning machine. Neurocomputing, 71(16–18):34603468. Advancesin Neural Information Processing (ICONIP 2006) / Brazilian Symposium onNeural Networks (SBRN 2006). → pages 74111Huang, G.-B., Chen, L., and Siew, C.-K. (2006a). Universal approximation usingincremental constructive feedforward networks with random hidden nodes.IEEE Transactions on Neural Networks, 17(4):879–892. → pages 74Huang, G.-B., Wang, D., and Lan, Y. (2011). Extreme learning machines: asurvey. International Journal of Machine Learning and Cybernetics,2(2):107–122. → pages 8, 9, 13, 17, 44, 50, 52, 74, 77Huang, G.-B., Zhou, H., Ding, X., and Zhang, R. (2012). Extreme LearningMachine for Regression and Multiclass Classification. Systems, Man, andCybernetics, Part B: IEEE Transactions on Cybernetics, 42(2):513–529. →pages 19Huang, G.-B., Zhu, Q.-Y., and Siew, C.-K. (2006b). Extreme learning machine:Theory and applications. Neurocomputing, 70(1-3):489 – 501. NeuralNetworks: Selected Papers from the 7th Brazilian Symposium on NeuralNetworks (SBRN ’04). → pages 8, 13, 14, 16, 20, 25, 26, 44, 50, 51, 54, 73,74, 77Ibarra-Berastegi, G., Sae´nz, J., Ezcurra, A., Elı´as, A., Diaz de Argandon˜a, J., andErrasti, I. (2011). Downscaling of surface moisture flux and precipitation in theEbro Valley (Spain) using analogues and analogues followed by random forestsand multiple linear regression. Hydrology and Earth System SciencesDiscussions, 8(1):1951–1985. → pages 1, 14Jachner, S., van den Boogaart, K. G., and Petzoldt, T. (2007). Statistical methodsfor the qualitative assessment of dynamic models with time delay (r packagequalv). Journal of Statistical Software, 22(8):1–30. → pages 142Jassby, A. D. and Cloern, J. E. (2014). wq: Exploring water quality monitoringdata. http://CRAN.R-project.org/package=wq. R package version 0.4-1. →pages 121Kalnay, E., Kanamitsu, M., Kistler, R., Collins, W., Deaven, D., Gandin, L.,Iredell, M., Saha, S., White, G., Woollen, J., Zhu, Y., Leetmaa, A., Reynolds,R., Chelliah, M., Ebisuzaki, W., Higgins, W., Janowiak, J., Mo, K. C.,Ropelewski, C., Wang, J., Jenne, R., and Joseph, D. (1996). The NCEP/NCAR40-Year Reanalysis Project. Bulletin of the American Meteorological Society,77(3):437–471. → pages 123Khan, M. S., Coulibaly, P., and Dibike, Y. (2006). Uncertainty analysis ofstatistical downscaling methods. Journal of Hydrology, 319(1-4):357 – 382. →pages 123112Kramer, O., Gloger, B., and Goebels, A. (2007). An experimental analysis ofevolution strategies and particle swarm optimizers using design of experiments.In Proceedings of the 9th annual conference on Genetic and evolutionarycomputation, GECCO ’07, pages 674–681, New York, NY, USA. ACM. →pages 127, 146Krasnopolsky, V. M. (2007). Neural network emulations for complexmultidimensional geophysical mappings: Applications of neural networktechniques to atmospheric and oceanic satellite retrievals and numericalmodeling. Reviews of Geophysics, 45(3). → pages 14, 128Kuligowski, R. J. and Barros, A. P. (1998). Localized Precipitation Forecasts froma Numerical Weather Prediction Model Using Artificial Neural Networks.Weather Forecasting, 13:1194–1204. → pages 1, 12, 16, 33, 122, 129Kumar, D. N., Raju, K. S., and Sathish, T. (2004). River Flow Forecasting usingRecurrent Neural Networks. Water Resources Management, 18(2):143–161. →pages 71Kursa, M. B. and Rudnicki, W. R. (2010). Feature selection with the Borutapackage. Journal of Statistical Software, 36(11):1–13. → pages 55, 56Kurt, A., Gulbagci, B., Karaca, F., and Alagha, O. (2008). An online air pollutionforecasting system using neural networks. Environment International,34(5):592–598. → pages 2, 128, 130Kusiak, A., Zheng, H., and Song, Z. (2009). Wind farm power prediction: adata-mining approach. Wind Energy, 12(3):275–293. → pages 2, 13Lan, Y., Soh, Y. C., and Huang, G.-B. (2009). Ensemble of online sequentialextreme learning machine. Neurocomputing, 72(13):3391–3395. → pages 56,75, 78Leung, F., Lam, H., Ling, S., and Tam, P. (2003). Tuning of the structure andparameters of a neural network using an improved genetic algorithm. IEEETransactions on Neural Networks, 12(1):79–88. → pages xviii, 13, 145, 149Liang, N.-Y., Huang, G.-B., Saratchandran, P., and Sundararajan, N. (2006). AFast and Accurate Online Sequential Learning Algorithm for FeedforwardNetworks. IEEE Transactions on Neural Networks, 17(6):1411–1423. → pages3, 4, 9, 42, 44, 45, 51, 52, 75, 78Liaw, A. and Wiener, M. (2002). Classification and Regression by randomForest.R News, 2(3):18–22. → pages 24, 136113Lima, A. R., Cannon, A. J., and Hsieh, W. W. (2013). Nonlinear regression inenvironmental sciences by support vector machines combined withevolutionary strategy. Computers & Geosciences, 50(0):136–144. → pages 14,23Lima, A. R., Cannon, A. J., and Hsieh, W. W. (2015). Nonlinear regression inenvironmental sciences using extreme learning machines: A comparativeevaluation. Environmental Modelling & Software, 73:175–188. → pages 43,46, 74, 84Lima, A. R., Cannon, A. J., and Hsieh, W. W. (2016). Forecasting dailystreamflow using online sequential extreme learning machines. Journal ofHydrology, 537:431 – 443. → pages 74Lima, A. R., Silva, D. A., Mattos Neto, P. S., and Ferreira, T. A. (2010). Anexperimental study of fitness function and time series forecasting usingartificial neural networks. In Proceedings of the 12th annual conferencecompanion on Genetic and evolutionary computation, GECCO ’10, pages2015–2018, New York, NY, USA. ACM. → pages 142Lin, K.-M. and Lin, C.-J. (2003). A study on reduced support vector machines.Neural Networks, IEEE Transactions on, 14(6):1449 – 1459. → pages xii,xviii, 137, 138, 145, 149Lin, S.-W., Ying, K.-C., Chen, S.-C., and Lee, Z.-J. (2008). Particle swarmoptimization for parameter determination and feature selection of supportvector machines. Expert Systems with Applications, 35(4):1817–1824. →pages 127Liu, N. and Wang, H. (2010). Ensemble Based Extreme Learning Machine.Signal Processing Letters, IEEE, 17(8):754–757. → pages 14, 18, 20Liu, X., Gao, C., and Li, P. (2012). A comparative analysis of support vectormachines and extreme learning machines. Neural Networks, 33(0):58 – 66. →pages 35Lukoeviius, M. and Jaeger, H. (2009). Reservoir computing approaches torecurrent neural network training. Computer Science Review, 3(3):127–149. →pages 71Maier, H. R. and Dandy, G. C. (2000). Neural networks for the prediction andforecasting of water resources variables: a review of modelling issues andapplications. Environmental Modelling & Software, 15(1):101–124. → pages13, 15114Maier, H. R., Jain, A., Dandy, G. C., and Sudheer, K. (2010). Methods used forthe development of neural networks for the prediction of water resourcevariables in river systems: Current status and future directions. EnvironmentalModelling & Software, 25(8):891–909. → pages 2, 5, 6, 13, 14, 15, 44, 50, 56,73, 121Mathew, S. (2006). Wind Energy: Fundamentals, Resource Analysis andEconomics. Springer. → pages 122McCollor, D. and Stull, R. (2008). Hydrometeorological Accuracy Enhancementvia Postprocessing of Numerical Weather Forecasts in Complex Terrain.Weather and Forecasting, 23(1):131–144. → pages 1Merryfield, W. J., Lee, W.-S., Boer, G. J., Kharin, V. V., Scinocca, J. F., Flato,G. M., Ajayamohan, R. S., Fyfe, J. C., Tang, Y., and Polavarapu, S. (2013). TheCanadian seasonal to interannual prediction system. Part I: Models andinitialization. Monthly Weather Review, 141:2910–2945. → pages 120Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., and Leisch, F. (2014).e1071: Misc Functions of the Department of Statistics (e1071), TU Wien.http://CRAN.R-project.org/package=e1071. R package version 1.6-4. → pages24, 136Muller, K., Smola, A., Ratsch, G., Scholkopf, B., Kohlmorgen, J., and Vapnik, V.(1997). Predicting time series with support vector machines. In Gerstner, W.,Germond, A., Hasler, M., and Nicoud, J.-D., editors, Artificial Neural Networks- ICANN’97, volume 1327 of Lecture Notes in Computer Science, pages999–1004. Springer Berlin / Heidelberg. 10.1007/BFb0020283. → pages 130Ortiz-Garcı´a, E. G., Salcedo-Sanz, S., Pe´rez-Bellido, A. M., and Portilla-Figueras,J. A. (2009). Improving the training time of support vector regressionalgorithms through novel hyper-parameters search space reductions.Neurocomputing, 72:3683–3691. → pages 127Pai, P.-F. and Hong, W.-C. (2005). Forecasting regional electricity load based onrecurrent support vector machines with genetic algorithms. Electric PowerSystems Research, 74(3):417–425. → pages 127Pao, Y.-H., Park, G.-H., and Sobajic, D. J. (1994). Learning and generalizationcharacteristics of the random vector functional-link net. Neurocomputing,6(2):163 – 180. Backpropagation, Part {IV}. → pages 7Pao, Y. H. and Takefuji, Y. (1992). Functional-link net computing: theory, systemarchitecture, and functionalities. Computer, 25(5):76–79. → pages 7115Parviainen, E. and Riihimaki, J. (2013). A connection between extreme learningmachine and neural network kernel. In Fred, A., Dietz, J., Liu, K., and Filipe,J., editors, Knowledge Discovery, Knowledge Engineering and KnowledgeManagement, volume 272 of Communications in Computer and InformationScience, pages 122–135. Springer Berlin Heidelberg. → pages 14, 18, 19, 36,41, 71, 74, 104Pino-Mejı´as, R., de-la Vega, M. D. C., Anaya-Romero, M., Pascual-Acosta, A.,Jorda´n-Lo´pez, A., and Bellinfante-Crocci, N. (2010). Predicting the potentialhabitat of oaks with data mining models and the R system. EnvironmentalModelling & Software, 25(7):826–836. → pages 2, 127, 128, 130Piotrowski, A. P. and Napiorkowski, J. J. (2011). Optimizing neural networks forriver flow forecasting - evolutionary computation methods versus theLevenberg-Marquardt approach . Journal of Hydrology, 407(1-4):12 – 27. →pages 13Potter, T. and Colman, B. (2003). Handbook of weather, climate, and water:atmospheric chemistry, hydrology, and societal impacts. Wiley-Interscience. →pages 2, 3, 43, 44, 45, 84Quinlan, J. R. (1992). Learning with continuous classes. In Adams, A. andSterling, L., editors, Proceedings of the 5th Australian joint Conference onArtificial Intelligence, pages 343–348. World Scientific. → pages 128, 136R Core Team (2014). R: A Language and Environment for Statistical Computing.http://www.R-project.org/. → pages 24, 55, 82, 136Rasouli, K., Hsieh, W. W., and Cannon, A. J. (2012). Daily streamflowforecasting by machine learning methods with weather and climate inputs.Journal of Hydrology, 414-415(0):284–293. → pages 2, 13, 44, 46, 47, 84, 123Russell, S. J. and Norvig, P. (2003). Artificial Intelligence: A Modern Approach.Pearson Education. → pages 37Schmidt, W. F., Kraaijveld, M., and Duin, R. P. (1992). Feedforward neuralnetworks with random weights. In Pattern Recognition, 1992. Vol. II.Conference B: Pattern Recognition Methodology and Systems, Proceedings.,11th IAPR International Conference on, pages 1–4. IEEE. → pages 7, 8, 13,16, 44, 50, 73, 77Schoof, J. and Pryor, S. (2001). Downscaling temperature and precipitation: acomparison of regression-based methods and artificial neural networks.International Journal of Climatology, 21(7):773–790. → pages 14, 128116Sene, K. (2010). Hydrometeorology: Forecasting and Applications. Springer. →pages 2, 3, 43, 44, 57, 123Shamseldin, A. Y. (2010). Artificial neural network model for river flowforecasting in a developing country. Journal of Hydroinformatics, 12(1):22–35.→ pages 2, 43, 44, 55Siqueira, H., Boccato, L., Attux, R., and Lyra, C. (2012). Echo State Networksand Extreme Learning Machines: A Comparative Study on SeasonalStreamflow Series Prediction. In Huang, T., Zeng, Z., Li, C., and Leung, C.,editors, Neural Information Processing, volume 7664 of Lecture Notes inComputer Science, pages 491–500. Springer Berlin Heidelberg. → pages 44Smit, S. and Eiben, A. (2009). Comparing parameter tuning methods forevolutionary algorithms. In Evolutionary Computation, 2009. CEC ’09. IEEECongress on, pages 399 –406. → pages 127Solomatine, D. P. and Ostfeld, A. (2008). Data-driven modelling: some pastexperiences and new approaches. Journal of Hydroinformatics, 10(1):322. →pages 13, 15, 56Solomatine, D. P. and Xue, Y. (2004). M5 model trees and neural networks:Application to flood forecasting in the upper reach of the huai river in china.9(6):491–501. → pages 128Strangeways, I. (2007). Precipitation: theory, measurement and distribution.Cambridge University Press. → pages 129Sun, Z.-L., Choi, T.-M., Au, K.-F., and Yu, Y. (2008). Sales forecasting usingextreme learning machine with applications in fashion retailing. DecisionSupport Systems, 46(1):411–419. → pages 13, 18, 20Thimm, G. and Fiesler, E. (1997). High-order and multilayer perceptroninitialization. IEEE Transactions on Neural Networks, 8(2):349–359. → pages18, 19Trenberth, K. E. (1997). The Definition of El Nin˜o. Bulletin of the AmericanMeteorological Society, 78(12):2771–2777. → pages 120Tripathi, S., Srinivas, V., and Nanjundiah, R. S. (2006). Downscaling ofprecipitation for climate change scenarios: A support vector machine approach.Journal of Hydrology, 330(3-4):621–640. → pages 2, 13, 14, 127, 130117Venables, W. and Ripley, B. (2002). Modern applied statistics with S. Statisticsand Computing. Springer. → pages 24, 55, 82, 130Verhoeven, M. and Aarts, E. (1995). Parallel local search. Journal of Heuristics,1:43–65. 10.1007/BF02430365. → pages 150Von Storch, H. and Zwiers, F. (2001). Statistical Analysis in Climate Research.Cambridge University Press. → pages 129Wallace, J. and Hobbs, P. (2006). Atmospheric Science: An Introductory Survey.International Geophysics Series. Elsevier Academic Press. → pages 129Wilks, D. S. (2005). Statistical Methods in the Atmospheric Sciences. AcademicPr Inc, 2 edition. → pages 3Wilson, L. J. and Valle´e, M. (2002). The Canadian Updateable Model OutputStatistics (UMOS) System: Design and Development Tests. Weather andForecasting, 17(2):206–222. → pages 6, 44, 59, 76Witten, I. H., Frank, E., and Hall, M. A. (2011). Data Mining: Practical MachineLearning Tools and Techniques. Morgan Kaufmann, Burlington, MA, 3 edition.→ pages 142, 148Wu, W., Dandy, G. C., and Maier, H. R. (2014). Protocol for developing ANNmodels and its application to the assessment of the quality of the ANN modeldevelopment process in drinking water quality modelling. EnvironmentalModelling & Software, 54(0):108–127. → pages 2, 14, 44, 50, 121Ying, L. (2014). Orthogonal incremental extreme learning machine for regressionand multiclass classification. Neural Computing and Applications,27(1):111–120. → pages 104Yuret, D. (1994). From genetic algorithms to efficient optimization. Master’sthesis, Massachusetts Institute of Technology. → pages 20, 88, 92Yuval and Hsieh, W. W. (2003). An Adaptive Nonlinear MOS Scheme forPrecipitation Forecasts Using Neural Networks. Weather and Forecasting,18(2):303–310. → pages 6Zeng, Z., Hsieh, W. W., Burrows, W. R., Giles, A., and Shabbar, A. (2011).Surface wind speed prediction in the canadian arctic using non-linear machinelearning methods. Atmosphere-Ocean, 49(1):22–31. → pages 2, 127118Zhang, G. P. (2007). Avoiding Pitfalls in Neural Network Research. IEEETransactions on Systems, Man, and Cybernetics, Part C, 37(1):3–16. → pages14, 40, 104, 126Zhang, L. and Suganthan, P. (2015). A comprehensive evaluation of randomvector functional link networks. Information Sciences. → pages 8Zhang, R., Lan, Y., Huang, G.-B., and Xu, Z.-B. (2012). UniversalApproximation of Extreme Learning Machine With Adaptive Growth ofHidden Nodes. IEEE Transactions on Neural Networks and Learning Systems,23(2):365–371. → pages 74Zhang, R., Lan, Y., Huang, G.-B., Xu, Z.-B., and Soh, Y. C. (2013). DynamicExtreme Learning Machine and Its Approximation Capability. Cybernetics,IEEE Transactions on, 43(6):2054–2065. → pages 74, 75119Appendix ADescription of DatasetsA.1 ENSOThe El Nin˜o-Southern Oscillation (ENSO) is the dominant mode of inter-annualocean-atmosphere variability in the equatorial Pacific. The global influence ofENSO plays a substantial role in shaping seasonal climate anomalies, e.g. in snow-pack (Hsieh and Tang, 2001) and streamflow (Gobena et al., 2013) in westernCanada. Four regional ENSO indexes have been defined based on sea-surfacetemperature (SST) anomalies in the equatorial Pacific: Nin˜o 1+2 (0◦-10◦S, 80◦W-90◦W), Nin˜o 3 (5◦N-5◦S, 90◦W-150◦W), Nin˜o 3.4 (5◦N-5◦S, 120◦W-170◦W) andNin˜o 4 (5◦N-5◦S, 150◦W-160◦E). There is no universal single definition for ElNin˜o events but it is suggested that El Nin˜o occurs if 5-month running means ofSST anomalies in the Nin˜o 3.4 region exceed 0:4◦C for 6 months or more (Tren-berth, 1997).The Canadian Seasonal to Interannual Prediction System (CanSIPS), the real-time extension of the Coupled Historical Forecast Project (CHFP2), has been op-erational at Environment Canada’s Canadian Meteorological Centre (CMC) sinceDecember 2011. It combines 12-month ensemble forecasts from two versions ofthe Canadian Climate Modelling and Analysis (CCCma) coupled global climatemodel (CanCM3 and the CanCM4) (Merryfield et al., 2013). We used eight CHFP2predictors, namely the ensemble mean and the ensemble spread of Nin˜o 1+2, 3,3.4, and 4, to make seasonal ENSO forecasts. Starting from January of 1979, the120predictand is the observed Nin˜o 3.4 index at 5-month lead time.A.2 SFBAYIn reviews of prediction and forecasting in water resources and environmental en-gineering using ANNs, Maier et al. (2010) and Wu et al. (2014) found that the vastmajority of studies focused on flow prediction with very few applications to waterquality. Ammonium concentration is an important component in the delicate estu-arine ecosystems. Disturbances on its concentrations could inhibit other chemicalreactions, affecting directly the ecosystems (Dugdale et al., 2007).The SFBAY dataset consists of selected observations and variables (1993-2004)from U.S. Geological Survey water quality stations in south San Francisco Bay(Jassby and Cloern, 2014). In this dataset, only the cases not containing missingvalues were considered (632 of 23207 cases). The predictand was ammonium con-centration and the six predictors were depth, chlorophyll, dissolved oxygen, sus-pended particulate matter, salinity and water temperature, all at zero lead time. Asthe predictand has a right-skewed distribution, the logarithm transformation wasapplied first to improve the regression results (Cawley et al., 2007).A.3 WINDInterest in wind-generated electricity has grown during the last few years as windpower is considered a source of clean energy. In some electricity markets, wind hasbecome a significant source of energy. Wind power forecasting is an important toolfor efficiently operating power systems, as it helps in electricity trading, planningmaintenance of the wind farms, and energy storage operations (Foley et al., 2012).The data used here come from a wind farm consisting of 53 turbines located innorthern Colorado. This dataset was originally used in the American Meteorologi-cal Society’s 2012 Wind Power Prediction Contest and the objective was to predictthe total wind power from the 53 turbines of a wind farm (predictand).The predictors were forecast variables from three different numerical weatherprediction (NWP) models: Global Forecast System (GFS), North American Mes-oscale (NAM) and Weather Research & Forecasting (WRF). Due to differences inthe resolution of the NWP models, forecasts were provided at four adjacent grid121points for NAM and WRF, and at one grid point for GFS. The forecast variableswere temperature, pressure, relative humidity, and the u and v components of windvelocity. These variables were provided at different levels in the vertical, with someat or near ground level, some at levels adjacent to the wind turbine hub height of 80m above ground, and others provided as average quantities throughout the verticallayer surrounding the turbine hub.As wind power is mainly proportional to the cube of the wind speed (Mathew,2006), the u and v components were replaced by the cube of the wind speed. Thegoal was to make a 24-hour forecast, which was done four times per day.A.4 FRASERThe dataset FRASER, consists of daily observations of suspended sediment con-centration (SSC) (mgL−1) and stream discharge (Q) (m3s−1) for the years 1970-1979 at the Fraser River at Hope station in British Columbia, Canada. The rela-tionship between SSC and Q is nonlinear and time dependent (Cannon, 2012b).To reduce the non-Gaussian nature of the distribution, the log10 transformationwas applied to the predictand (the SSC at zero lead time) and the four predictorswere logQ, dQ5, dQ30, and dQ90 (5-, 30-, and 90-day moving averages of thedaily changes in Q).A.5 YVRThe limited temporal and spatial scales of NWP models (Kuligowski and Barros,1998) is one of the factors that contributes to the difficulty of modelling precipi-tation. To forecast precipitation at local scales, general circulation model (GCM)outputs cannot be used directly due to the coarse spatial resolution, so statisticaldownscaling is often performed (Cannon, 2011; Chen et al., 2010; Haylock et al.,2006).The YVR dataset is a simple real-world precipitation downscaling dataset de-scribed by Cannon (2011). The predictand data consists of thirty years of dailyprecipitation (1971-2000) from the Vancouver International Airport YVR (WMOstation 71892, 49◦ 12:0’N, 123◦ 10.8’W).Precipitation is an example of mixed distribution where the predictand is partly122discrete (days with or without precipitation) and partly continuous (the amountof precipitation observed). Therefore precipitation is often modelled separatelyby occurrence (usually by a logistic regression model) and by amount (regressionmodel fitted to the amount of precipitation on “wet” days) (Cawley et al., 2007;Chen et al., 2010). As our focus is to compare regression models, we used aspredictand the precipitation amount on wet days (defined as days with precipitation> 0:001mm). However the nonlinear models presented here can also be used forclassification (to separate wet and dry days). As the predictand was not normallydistributed, we instead predicted the fourth root of the predictand, as commonlydone for precipitation amounts (Khan et al., 2006).The three predictors were sea-level pressure, 700-hPa specific humidity, and500-hPa geopotential height from the nearest National Centers for EnvironmentalPrediction/National Center for Atmospheric Research (NCEP/NCAR) Reanalysis(Kalnay et al., 1996) grid point at 50◦N, 122:5◦W.A.6 STAVEShort lead time streamflow forecasts based on recent and forecasted meteorologi-cal conditions, and hydrological observations are important for flood warning, hy-droelectric power scheduling and regulating reservoir storage for optimum use ofavailable water resources (Sene, 2010; Gobena et al., 2013).This dataset was originally used by Rasouli et al. (2012). They applied GFSreforecast output, climate indexes and local meteo-hydrological observations toforecast daily streamflow (predictand) for the Stave River above Stave Lake insouthern British Columbia, Canada at lead times of 1-7 days during 1983-2001.They found that local observations plus the GFS NWP output were generally bestas predictors at shorter lead times (1-4 days), while local observations plus climateindexes were best at longer lead times (5-7 days).For 1 day ahead prediction, we used 25 predictors, maximum temperature,temperature range, precipitation, streamflow at day t, streamflow at day t− 1, ac-cumulated precipitation, heating, air pressure at 500 m, mean sea level pressure,precipitable water, relative humidity, 2 m air temperature, temperature, wind am-plitude and wind phase at 10 m), same as Rasouli et al. (2012) but excluding the123seasonal and climate indexes (i.e. we used only the local observations and the GFSoutputs). We applied the logarithm transformation to the streamflow as it has aright-skewed distribution.A.7 PRECIPThe next three datasets were originally used at the WCCI-2006 Predictive Un-certainty in Environmental Modeling Challenge (Cawley et al., 2007) and theyare freely available from the challenge website (http://theoval.cmp.uea.ac.uk/∼gcc/competition/). As the datasets were provided as part of a prediction contest, de-tailed descriptions of the predictors were not provided to entrants and are henceunknown.In the PRECIP dataset the predictors supply large-scale circulation informa-tion, such as might be obtained from a GCM, and the predictand is the daily pre-cipitation amount recorded at Newton Rigg, a relatively wet station located in thenorthwest of the United Kingdom. As above for YVR, we used only data from thewet days for the predictand and applied the fourth root transform to alleviate thenon-Gaussian distribution. The PRECIP dataset contains 106 predictors.A.8 TEMPWith similar large-scale circulation features to those used for the PRECIP dataset,the TEMP dataset, also with 106 predictors, is another statistical downscaling prob-lem, but the predictand is the more normally distributed daily maximum tempera-ture at the Writtle station in the southeast of the United Kingdom.A.9 SO2Local emissions and meteorological conditions such as high atmospheric pressureand temperature inversions could lead to poor dispersion of atmospheric pollutantsdue to the stagnant conditions. The predictand is the concentration of sulphur diox-ide (SO2), an atmospheric pollutant, in urban Belfast. The SO2 concentration is aright skewed variable with a heavy tail, thus the logarithm transformation was ap-plied. Also only the days with SO2 concentration> 0:001 µgm−3 were considered124as predictand. To predict the SO2 concentration 24 hours in advance, meteorologi-cal conditions and current SO2 levels were supplied as 27 predictors.125Appendix BNonlinear regression inenvironmental sciences bysupport vector machinescombined with evolutionarystrategyB.1 IntroductionThe main idea of machine learning (ML) is that computer algorithms are capable ofautomatically distilling knowledge from data. From this knowledge they can con-struct models capable of making predictions from novel data in the future. How-ever, environmental modeling problems are typically very noisy (Cawley et al.,2007), hence it is not easy to build successful predictive models.Due to data modeling complexity, even for a particular ML method, there isoften more than one way to build the model (Zhang, 2007). To build a successfulpredictive model, the correct adjustment of the model hyper-parameters is nec-essary. For example, in artificial neural network (ANN) models, the number ofhidden processing units, the choice of activation functions and the regularization126parameter all need to be specified (Haykin, 1998; Hsieh, 2009).Similarly, in support vector machines (SVM) for regression (SVR), typicallytwo or three hyper-parameters have to be tuned, such as the cost of constraint vi-olation (C), the insensitive-loss (ε) and, if the Gaussian function is used as thekernel function, the width of the Gaussian (γ). In theory, the establishment of thesehyper-parameters requires an optimal search in full state space.Much effort has been spent on improving the efficiency of the SVR hyper-parameter search (Cherkassky and Ma, 2004; Friedrichs and Igel, 2005; Fan et al.,2005; Huang and Wang, 2006; Lin et al., 2008). The most common approach is thesimple grid search (Fan et al., 2005; Ortiz-Garcı´a et al., 2009; Pino-Mejı´as et al.,2010; Zeng et al., 2011), an exhaustive approach that is computationally expensive.Furthermore, the hyper-parameters are varied by fixed step-sizes through a widerange of values, limiting the search to discrete values.Some of the most promising approaches in tuning the SVM hyper-parametersare based on evolutionary algorithms (EA), e.g. genetic algorithms (GA) (Pai andHong, 2005; Huang and Wang, 2006; Tripathi et al., 2006), particle swarm opti-mization (PSO) (Lin et al., 2008), and evolutionary strategies (ES) (Friedrichs andIgel, 2005). However, these approaches need a set of adjustable parameters calledEA parameters (Smit and Eiben, 2009) and typically there are more EA parametersto adjust than the number of hyper-parameters. For example, GA has populationsize, mutation rate, crossover rate, number of generations, etc., and PSO has pop-ulation size, acceleration coefficients, inertia weight, number of generations, etc.Kramer et al. (2007) showed that the choice of the EA parameters has a decisiveimpact in the final results and can undervalue the algorithm performance. In otherwords, to use a GA it is necessary to adjust at least four parameters, which is morethan the three SVR hyper-parameters when using a Gaussian kernel.Another important issue is the data quality upon which machine learning meth-ods operate. Indeed, machine learning algorithms may produce less accurate andless understandable results if the data are inadequate or contain extraneous andirrelevant information (Hall and Smith, 1996). It is also possible to make the mod-eling process less time consuming and sometimes more accurate by removing pre-dictors that are irrelevant or redundant with respect to the task to be learned.In this paper, our main goals are: (i) reduce the number of hyper-parameters127which require estimation; (ii) use an accurate initialization of the hyper-parametersearch; and (iii) discard irrelevant and redundant predictors. We propose a hybridalgorithm called SVR-ES which uses a simple evolutionary strategy called “uncor-related mutation with p step sizes” (Eiben and Smith, 2003) to find the optimalSVR hyper-parameters. We also combine the SVR-ES with stepwise linear regres-sion (SLR) (Draper and Smith, 1998) to screen out irrelevant predictors.Three environmental forecast problems used in the WCCI-2006 contest – sur-face air temperature (TEMP), precipitation (PRECIP) and sulphur dioxide concen-tration (SO2) – are tested (Cawley et al., 2007). These three datasets contain differ-ent amounts of nonlinearity and noise. Several other machine learning techniquessuccessfully used in the environmental forecast problems are considered, includ-ing bootstrap-aggregated ensemble ANN (Cannon and Lord, 2000; Krasnopolsky,2007), SVR using the Cherkassky-Ma hyper-parameter estimates (Cherkassky andMa, 2004), the M5 regression tree (Quinlan, 1992; Solomatine and Xue, 2004;Haupt et al., 2009) and random forest (RF) (Breiman, 2001; Pino-Mejı´as et al.,2010). We also use SLR with these techniques to prescreen and reduce the numberof predictors.Section B.2 describes the data sets used in our study. The forecasting methodsare presented in Sections B.3 and B.4. Results and discussion of the experimentsare given in Section B.5, followed by summary and conclusion in Section B.6.B.2 Data DescriptionEnvironmental data normally contain properties that are difficult to model by com-mon regression techniques. For example, response variables may be strictly non-negative, highly skewed, non-Gaussian distributed and heteroscedastic (Cawleyet al., 2007). Some examples are the modeling of SO2 air pollution (Kurt et al.,2008) or statistical downscaling of temperature and precipitation (Schoof and Pryor,2001).We used as benchmark three datasets which were originally used at the WCCI-2006 Predictive Uncertainty in Environmental Modeling Challenge. These bench-marks, characterized by a non-Gaussian, heteroscedastic variance structure (Caw-ley et al., 2007), are freely available from the challenge website (http://theoval.128cmp.uea.ac.uk/∼gcc/competition/).Predicting precipitation accurately is still one of the most difficult tasks in me-teorology (Fritsch et al., 1998; Kuligowski and Barros, 1998; Strangeways, 2007).Factors responsible for the difficulty in predicting precipitation are e.g. the chaoticnature of the atmosphere and the complexity of the processes involved in its cre-ation (Fritsch et al., 1998), seasonal variations (Wallace and Hobbs, 2006), non-stationary statistical behavior (Von Storch and Zwiers, 2001), difficulties in pre-cipitation measurements including problems with rain gauges, radar and satellites(Strangeways, 2007) and the limited temporal and spatial scales of global circula-tion models (GCMs) (Kuligowski and Barros, 1998).To forecast precipitation at regional scale, GCM outputs cannot be used directlydue to coarse spatial resolution. For example, GCMs do not provide informationon the spatial structure of temperature and precipitation in areas of complex topog-raphy and land use distribution. To use the output of a GCM for this task, statisticaldownscaling models are often used (Hashmi et al., 2009).In the PRECIP benchmark the input variables are large-scale circulation infor-mation, such as might be obtained from a GCM, and the target is daily precipitationdata recorded at Newton Rigg, a relatively wet station located in the northwest ofthe United Kingdom. This benchmark is an example where the response variable isskewed to the right. The PRECIP dataset contains 106 predictors and 10,546 dailypatterns.With similar large-scale circulation features to those used for the PRECIPbenchmark, the TEMP benchmark is another statistical downscaling problem, inthis case one in which the target is the more normally distributed daily maximumtemperature at the Writtle station in the southeast of the United Kingdom. TheTEMP dataset has 106 predictors and 10,675 daily patterns.Air pollution is a major environmental problem in cities. Local emissionsand topographic factors, allied with meteorological conditions such as high atmo-spheric pressure and temperature inversions, cause poor dispersion of atmosphericpollutants due the stagnant conditions. Human health problems (both short termand chronic problems) can occur when pollutant concentrations exceed the allow-able limits. Therefore predicting the concentration of pollutants such as sulfurdioxide (SO2) is crucial to providing proper actions and control strategies in ex-129treme situations (Kurt et al., 2008).The target of the SO2 dataset is the SO2 concentration in urban Belfast. In orderto forecast the SO2 concentration twenty-four hours in advance, meteorologicalconditions and current SO2 levels were used as input variables. Similar to PRECIP,SO2 is a right skewed variable with a heavy tail. The SO2 dataset has 27 predictorsand 22,956 hourly patterns, more than double the number of patterns in PRECIPand TEMP.Irrelevant predictors can unnecessarily increase the time needed for learning asufficiently accurate forecast model and in many cases also increase the size of thesearch space. To reduce the number of predictor variables, we use the well-knownprescreening technique of SLR (Hocking, 1976; Venables and Ripley, 2002; Hastieet al., 2009).B.3 Support Vector RegressionSupport vector machines (SVM) have been widely used in the environmental sci-ences for classification and regression problems (Pino-Mejı´as et al., 2010; Tripathiet al., 2006). SVM were originally designed for nonlinear classification problems,then extended to nonlinear regression problems (SVR) (Muller et al., 1997).Suppose we are given the training data {(x1;y1); : : : ;(xn;yn)} with n patterns.After mapping the input pattern x into a higher dimensional feature space using anonlinear mapping function Φ, the nonlinear regression problem between x and ycan be converted to a linear regression problem between Φ(x) and y, i.e.f (x;w) = 〈w;Φ(x)〉+b; (B.1)where 〈;〉 denotes the inner product, and w and b are the regression coefficientsobtained by minimizing the error between f and the observed values of y. Insteadof the commonly used mean squared error (MSE) norm, SVR uses the ε-insensitiveerror norm to measure the error between f and y,| f (x;w)− y|ε ={0; if | f (x;w)− y|< ε| f (x;w)− y|− ε; otherwise, (B.2)130i.e., small errors (| f − y|< ε) are ignored, whereas for large errors, the error normapproximates the mean absolute error (MAE). A key issue is that an error normbased on MAE is more robust to outliers in the data than the MSE. Using patterndata (xi, yi), the w and b coefficients are estimated by minimizing the objectivefunction:J =Cnn∑i=1| f (xi;w)− yi|ε +12‖w‖2 ; (B.3)whereC (which controls the regularization) and ε are the hyper-parameters.The global minimum solution to the linear regression problem (B.1) can beachieved without iterative nonlinear optimization, hence local minima in the ob-jective function are not a problem. However, the linear regression problem can bevery expensive or impossible to compute without truncating infinite dimensionalvectors. This occurs because Φ(x) may be a very high (or even infinite) dimen-sional vector. To counter this drawback a kernel trick is used in which the innerproduct 〈Φ(x);Φ(x′)〉 in the solution algorithm is replaced by a kernel functionK(x;x′), which does not involve the difficult handling of Φ(x). The minimizationof (B.3) uses the method of Lagrange multipliers, and the final regression estimatecan be expressed in the form (Bishop, 2006):f (x) =∑iK(x;xi)+b; (B.4)where the summation is only over a subset of the given data xi called the supportvectors.B.3.1 Evolutionary StrategiesEvolutionary algorithms (EA) mimic nature’s way of evolving successful organ-isms (individuals) (Haupt et al., 2009). An Evolutionary Strategy (ES) is a par-ticular class of EA that is normally used for continuous parameter optimization.An attractive point of the ES is the self-adaptation of some strategy parameters.The strategy parameters are the set of adjustable parameters used by the algorithm.Self-adaptivity means that some EA parameters are varied during a run in a specificmanner: the parameters are included in the chromosomes and co-evolve with thesolutions (Eiben and Smith, 2003), i.e., the algorithm is capable of adapting itself131autonomously.Mutation and Self-adaptationMutation is the name given to a genetic operation which uses only one parent andcreates one offspring by applying some type of randomized change to the repre-sentation. Let G be a chromosome defined by,G= (g1;g2; : : : ;gp); (B.5)where gi (i = 1;2; : : : ; p) are the solution parameters. Mutations are realized byadding some ∆gi to each gi, where ∆gi are values from a Gaussian distributionN(0;σ), with zero mean and standard deviation σ , i.e.,g′i = gi+N(0;σ); i= 1;2; : : : ; p: (B.6)The self-adaptation consists of including the step size σ in the chromosomesso it also undergoes variation and selection, i.e. mutations are realized by replacing(g1;g2; : : : ;gp;σ) by (g′1;g′2; : : : ;g′p;σ ′), where σ ′ is the mutated value of σ , alsocalled the mutation step size.Uncorrelated Mutation with p Step SizesAn attractive feature of the “mutation with p step sizes” method is its ability to treateach dimension differently, i.e., using different step sizes for different dimensions.The chromosome given in (B.5) is extended to p step sizes, resulting inG= (g1;g2; : : : ;gp;σ1;σ2; : : : ;σp); (B.7)and the mutation rules are given by:σ ′i = σieτ′N(0;1)+τNi(0;1); (B.8)g′i = gi+σiNi(0;1); i= 1;2; : : : ; p; (B.9)132where τ ′ ∝ 1=√2p, and τ ∝ 1=√2√p. The common base mutation eτ′N(0;1) allowsan overall change of the mutability. The flexibility to use different mutation strate-gies in different directions is provided by the coordinate-specific eτNi(0;1) (Eibenand Smith, 2003).B.3.2 Hyper-parameter Optimization and SVR-ESThe performance of an SVR model is highly dependent on the choice of the kernelfunction and the hyper-parameters (Hastie et al., 2009; Hsieh, 2009). The use ofa suitable nonlinear kernel function in SVR allows it to be fully nonlinear, whilethe use of a linear kernel function restricts SVR to a linear model. The SVR withthe linear kernel is a linear regression model but with the robust ε-insensitive errornorm instead of the non-robust MSE norm as used in multiple linear regression(MLR). In this study, we used the radial basis function (RBF) kernel (also calledthe Gaussian kernel) given by K(x;xi) = exp[−‖x−xi‖2=(2σ2K)], with the hyper-parameter γ = 1=(2σ2K) controlling the width σK of the Gaussian function.Cherkassky and Ma (2004) proposed how to estimate values for the hyper-parametersC and ε in SVR. The estimated regularization parameter isC =max(|y+3σy|; |y−3σy|); (B.10)where y and σy are the mean and the standard deviation of the y values of thetraining data; and ε ∼ 3 σN√lnn=n, where n is the size of data set and σN is thestandard deviation of the noise. The noise level is estimated by using the k-nearest-neighbors regression method, where yˆ, the value of a new point, is estimated fromthe average of the k nearest points (Hastie et al., 2009). Hence the estimated noiseis:σˆ2N =n1=5kn(n1=5k−1)n∑m=1(ym− yˆm)2; (B.11)where ym and yˆm are the observed and estimated output values, respectively, forcase m.Cherkassky and Ma (2004) suggested that with d predictor variables the RBFwidth σK should behave as σdK ∼ (0:1− 0:5). Hence, no precise estimate of γwas given. From now on we will call the SVR with the Cherkassky-Ma hyper-133parameters just SVR for brevity.In SVR-ES, the ES is initialized from a starting solution h containing theCherkassky-Ma hyper-parameters. The local search algorithm searches the candi-date solutions in N(h), a set of neighbors, for an h′ that has a better fitness functionthan h. Basically, the fitness function assigns a fitness value to each point in theparameter space (adaptive landscape (Eiben and Schoenauer, 2002)), where thisvalue can be seen as a measure of how good a solution, represented by that pointin the landscape, is to the given problem (Hordijk, 1996). In this way, if a solutionexists (i.e. has better fitness function) it is accepted as the new incumbent solution,and the search proceeds by examining the candidate solution in N(h′). In our par-ticular case, we minimize the MSE ( f (h) = (MSE)−1). The necessary steps forimplementing the SVR-ES algorithm are shown in Algorithm 7.1 begin2 τ → 0; // τ: Iteration number;3 if Screenout irrelevant Predictors = TRUE then4 Perform stepwise regression;5 end6 Initialize G(τ); // G(τ): Chromosome according toEquation (B.7);7 Evaluate f (G(τ)); // f (G(τ)): Fitness function value ;8 while not termination condition do9 τ → τ+1 ;10 Perform mutation operation according to Eqns. (B.8) and (B.9) togenerate new chromosome G′ ;11 Evaluate f (G′(τ));12 if f (G′)> f (G) then13 f (G)← f (G′);14 end15 end16 endAlgorithm 7: Procedure of the SVR-ESEventually, this process will lead to the identification of a local optimum (Eibenand Smith, 2003). However, as this was done for a single individual instead of apopulation (which is necessary for a global search), this approach is highly depen-134dent on its initial condition (initial individual). Assuming that the initial hyper-parameters from Cherkassky-Ma are reasonable, our local search with ES is ade-quate.In particular, the use of SVR-ES is attractive because it can be implemented inan almost automatic way, uses variable steps to find the optimal hyper-parameters,is self-adaptive, and, by using τ ′ = 1=√2p and τ = 1=√2√p, the only EA algo-rithm parameter that remains to be adjusted is the number of iterations.B.4 Artificial Neural Network and Regression TreesFor comparison with the SVR and SVR-ES models, an artificial neural network(ANN) and two regression tree algorithms are also applied to the WCCI-2006datasets.An ANN is a biologically inspired mathematical model which is composedof a large number of highly interconnected perceptrons divided in layers (input,hidden, output), but working in union to solve a problem. Training of the ANNmodel involves adjusting the parameters iteratively so the MSE between the modeloutput yˆ and target y is minimized. Overfitting is a known problem with ANN. Ifp, the number of model parameters, is too large, the ANN may overfit the data,producing a spuriously good fit that does not lead to better predictions for newcases. This motivates the use of model comparison criteria, such as the correctedAkaike information criterion (AICc), which penalizes the MSE as a function p.The AICc, a bias-corrected version of the original AIC, is given byAICc = n ln(MSE)+2p+2(p+1)(p+2)=(n− p−2); (B.12)where n is the number of effective observations (Faraway and Chatfield, 1998).Models with increasing numbers of hidden neurons L are trained and the modelthat minimizes AICc is chosen as the optimum model. The ANN architecture andtraining algorithm used in this study are based on Cannon and Lord (2000), withone hidden layer, early stopping to avoid overfitting, an ensemble (or committee)of ANN models to deal with local minima in the objective function, and with boot-strap aggregation (bagging) (Breiman, 1996).Conceptually simple yet powerful, regression tree analysis, applicable to data135sets with both a large number of patterns and a large number of variables, is ex-tremely resistant to outliers. Among the decision trees used in machine learning,the classification and regression tree (CART) model, first introduced by Breimanet al. (1984), is the most commonly used. A random forest (RF) (Breiman, 2001)is a bootstrap ensemble of many decision trees (CART). Each tree is grown over abootstrap sample from the training data set using a randomly selected subset of theavailable predictors at each decision branch.Like CART, M5 builds a tree-based model, however the tree constructed byM5 can have multivariate linear models at its leaves - the model trees are thusanalogous to piecewise linear functions. The advantage of M5 over CART is thatM5 model trees are generally much smaller and more accurate in some problemdomains (Quinlan, 1992).B.5 Experimental ResultsTo demonstrate the practical use of the forecasting methods for environmentalproblems, experiments were performed on the datasets outlined in Section B.2.In order to compare linear versus non-linear approaches, we also performed exper-iments with MLR. Linear models are simple, fast, and often provide adequate andinterpretable descriptions of how the inputs affect the output. In particular for pre-diction purposes they can sometimes outperform fancier nonlinear models (Hastieet al., 2009). All the forecast models can be built using the free R software en-vironment for statistical computing (R Core Team, 2014). For RF we used the Rpackage randomForest (Liaw and Wiener, 2002). For SVR we used the R packagee1071 (Meyer et al., 2014). We developed the SVR-ES using the R package e1071and the R native libraries; SVR-ES code is freely available from the project web-site: http://forai.r-forge.r-project.org/. For M5 we used the package RWeka (Horniket al., 2009). ANN uses the monmlp package. For MLR and SLR we used thepackage stats.Data were standardized (zero mean and unit variance) and divided into a train-ing set (75% of the data) and an independent test set (last 25% of the data) whichis used to test the trained model. For the SVR we used a 5-fold cross validationwithin the training set to train and validate the model. For SVR-ES we subdivided136the training set into two parts leaving the final 20% to validate the model. TheRF and ANN perform their own split-sample validation via the out-of-bootstrapsamples in the bootstrap aggregation.The ANN setup was as follows: the range of initial random weights was be-tween [−0:5 : 0:5], 30 ensemble members were used for bagging, and 5000 wasused as the maximum number of iterations. For RF, we used 500 as the numberof generated trees and 5, the default value, as the minimum size of terminal nodes.For M5 we used the default values of the package RWeka. For the SRV-ES, C andε were initialized by the Cherkassky-Ma guidelines and γ by 0:001.For SVR, the C and ε values used were from Cherkassky-Ma, but for γ we dida grid search using the range suggested by Cherkassky-Ma (see section B.3.2) andthe extended range [2−10;24] suggested by Lin and Lin (2003). Table B.1 showsthat using γ from Cherkassky-Ma yielded poorer results than those from Lin andLin. Henceforth, SVR will refer to the model using a γ search following Lin andLin.137Table B.1: Comparison of MAE and MSE values from the test set using SVR with the range of γ suggestedby Cherkassky and Ma (2004) and by Lin and Lin (2003) for the PRECIP, TEMP, and SO2 datasets, using ei-ther all predictors (ALL) or predictors selected by SLR.MAE MSECherkassky-Ma Lin and Lin Cherkassky-Ma Lin and LinALL SLR ALL SLR ALL SLR ALL SLRPRECIP 0.0534 0.0530 0.0379 0.0380 0.00951 0.00944 0.00572 0.00572TEMP 0.2520 0.2415 0.0622 0.0618 0.09199 0.08543 0.00642 0.00630SO2 0.0277 0.0263 0.0244 0.0244 0.00386 0.00371 0.00325 0.00326138In order to compute the relative accuracy between the MLR and the non-linearmethods we calculated the skill score (SS) (Hsieh, 2009) of the MAE and MSE.The SS is defined by:SS=A−ArefAperfect−Aref ; (B.13)where A is a particular measure of accuracy, Aperfect is the value of A for a set ofperfect forecasts (in our case MAE and MSE equal to zero), and Aref is the value ofA computed over the set of reference forecasts (in our case the MAE or MSE valueof the MLR model using all predictors). Positives SS means better performancethan the reference model and negative values, worse performance.According to the skill scores for PRECIP, the SVR (with the extended γ range)achieved the best MAE results (Figure B.1) and the SVR-ES the best MSE results(Figure B.2). That the two SVR methods perform better than the other methodsrelative to the MAE could be expected due to the ε-insensitive error norm of theSVR being more similar to the MAE.The SLR reduced the number of predictors from 106 to 59, cutting approxi-mately 45% of the predictors. Basically, the SLR did not have an impact on theMLR accuracy. For the ANN, the MAE SS is also essentially the same with andwithout prescreening by SLR. However, an improvement is noted in the MSE whenthe SLR was applied. The combination of SLR with M5 and RF tended to dimin-ish the skill. As M5 and RF do their own variable selection, it makes sense thatSLR is not helpful. On other hand, SVR-ES had a minor improvement with fewerpredictors.All the non-linear methods had better MAE results than the MLR (Figure B.1).However theM5 had poorMSE performance when compared with theMLR (around10% worse) (Figure B.2). The SVR-ES had good results when compared withMLR, with skill scores exceeding 20% in MAE and over 10% in MSE.Figures B.3 and B.4 show that for TEMP the SVR-ES achieved the best re-sults in both MAE and MSE. Excluding M5, all the non-linear methods again hadbetter MAE SS than the MLR. The SLR reduced the 106 predictors to 60, cuttingapproximately 44% of the predictors. Again the strongest improvement was ANNcombined with SLR. SLR was also beneficial to SVR-ES, SVR, and RF but, as inPRECIP, detrimental to M5.139MLR ANN SVR SVR−ES M5 RFALLSLRPRECIPS ki l l S co re o f MA E0 .0 00 .0 50 .1 00 .1 50 .2 00 .2 50 .3 0Figure B.1: Skill score of MAE for the PRECIP test dataset with all predic-tors used (dark bar) and with predictors selected by SLR (light bar).The reference forecast, MLR with all predictors, has simply 0 for theskill score.For the SO2 dataset, Figures B.5 and B.6 show RF as the best performer. How-ever, SVR-ES kept its good performance in both MSE and MAE. Figure B.5 showsagain that MLR is the worst in terms of MAE performance. SVR-ES has MAESS performance around 10% better than the MLR, and RF and SVR around 15%better. SLR reduced the number of predictors for SO2 from 27 to 21, cutting ap-proximately 23% of the predictors. Although SLR did not have major impact onthe forecast accuracy for the SO2 dataset, the results are still desirable since thereduction of predictors reduces computing time.To clarify why SVR performs better than SVR-ES in Figures B.1 and B.5, weperformed a small experiment on the PRECIP and SO2 datasets (with all predictors140MLR ANN SVR SVR−ES M5 RFALLSLRPRECIPS ki l l S co re o f MS E−0 .1 0−0 .0 50 .0 00 .0 50 .1 0Figure B.2: Skill score of MSE for the PRECIP test dataset.used) by varying the fitness function in ES. Instead of maximizing only (MSE)−1,we also tested using (MAE)−1 as fitness function, and calculated the skill scores us-ing the MLR values as reference. Figure B.7 compares the skill score of MAE andMSE with f1= (MSE)−1 as fitness function, and f2 = (MAE)−1 as fitness function.For the PRECIP dataset (Figure B.7a), with f1 as fitness function, SVR-ES hasworse MAE SS than SVR, but the two are comparable when f2 is used instead.However, for the MSE SS, SVR-ES did better than SVR, regardless of whetherf1 or f2 was used. Similarly, for the SO2 dataset (Figure B.7b), SVR-ES withf1 has worse MAE SS than SVR, but slightly better MAE SS than SVR with f2.For the MSE SS, SVR-ES (with either f1 or f2) did better than SVR, thought thedifference between f1 and f2 is more pronounced than in the PRECIP dataset.141MLR ANN SVR SVR−ES M5 RFALLSLRTEMPS ki l l S co re o f MA E−0 .0 20 .0 00 .0 20 .0 40 .0 60 .0 80 .1 0Figure B.3: Skill score of MAE for the TEMP test dataset.More guidelines on why an error measure can be minimized while a different errormeasure can remain unchanged or diminish can be found at Jachner et al. (2007)and Lima et al. (2010). In general, it is not possible for a single error measure toperform best on all criteria (e.g. cost, reliability, sensitivity, resistance to outliers,relationship to decision making, etc.) (Armstrong and Collopy, 1992), hence thechoice of the fitness function is dependent on the desired goal.The heart of the learning problem is generalization (Witten et al., 2011), i.e.,whether the methods can retain satisfactory performance on new datasets. Basedon the previous results (Figures B.2, B.3 and B.4), M5 is not recommended. On theother hand, ANN, SVR, SVR-ES, and RF have better performance than the MLRwhen considering both MAE and MSE over the three tested datasets.It is difficult to evaluate the computing time of two methods under every pa-142MLR ANN SVR SVR−ES M5 RFALLSLRTEMPS ki l l S co re o f MS E0 .0 00 .0 50 .1 00 .1 5Figure B.4: Skill score of MSE for the TEMP test dataset.rameter setting because different values of the SVR hyper-parameters affect thetraining time (Fan et al., 2005). To illustrate the time reduction provided by SLR,a small experiment using the PRECIP dataset was performed. Varying the size ofthe training dataset, we measured the cpu time used to train the SVR model (withγ fixed at 0:001) and forecast the test set, using (i) all predictors and (ii) only thepredictors selected by SLR. For 1000, 3000, 5000 and 7000 points in the trainingset, using all predictors required cpu times of 10, 135, 371 and 727 s, respectively,while using only the predictors selected by SLR required 5, 60, 212 and 469 s,respectively, thus confirming that screening of predictors by SLR saves computingtime.To provide some guidelines and explanation for SVR results, Figures B.8 andB.9 show changes related to the range of the γ value, with γ varying between143MLR ANN SVR SVR−ES M5 RFALLSLRSO2S ki l l S co re o f MA E0 .0 00 .0 50 .1 00 .1 50 .2 0Figure B.5: Skill score of MAE for the SO2 test dataset.[2−10;24]. The MAE and MSE values were rescaled to range between [0;1] in or-der to compare the sensitivity between the different datasets. First, as discussed inCherkassky-Ma, indeed there is dependency related to the γ values and the datasetdimensionality. However, this dependency is not present over the full tested range.For example, in Figures B.8 and B.9, the solid lines with triangles correspond tothe full dimensionality (106 predictors) and the dashed lines with triangles cor-respond to a reduced dimensionality (60 predictors) of the TEMP dataset; withinthe range of [2−10;2−6] there is no difference between the dimensionality and thevalues of MAE/MSE. The same behavior occurs in the SO2 dataset over the rangeof [2−10;2−4] (solid and dashed lines with x symbols). Another interesting pointis that the best γ values are independent of the dimensionality, i.e., solid lines anddashed lines converge to the same minimal point. This means that the dependency144MLR ANN SVR SVR−ES M5 RFALLSLRSO2S ki l l S co re o f MS E0 .00 .10 .20 .3Figure B.6: Skill score of MSE for the SO2 test dataset.of γ is not totally related to the dataset dimensionality but with the dataset charac-teristics, as can be seen in the range of [2−10;2−6]. However, in this range, theMAEand MSE results for the PRECIP dataset had different optimal γ values, though thisdifference is only one unit (log2γ) as the grid search used discrete steps of this stepsize.Finally, to show the differences between various approaches to setting SVRhyper-parameters, we performed an experiment on the PRECIP dataset again, us-ing all predictors and the same test set as before, but only the last 500 points of thetraining set. We tested the modified GA proposed by Leung et al. (2003) with dif-ferent parameters settings (GA1, GA2 and GA3), SVR using the procedure recom-mended by Cherkassky andMa (2004) but with the extended range γ = [2−10;24] assuggested by Lin and Lin (2003), SVR-ES with 200 generations (which is less than145f1 f2 SVRMSE Skill ScoreMAE Skill Score(a) PRECIPS ki l l S co re0 .0 00 .0 50 .1 00 .1 50 .2 00 .2 50 .3 0f1 f2 SVRMSE Skill ScoreMAE Skill Score(b) SO2S ki l l S co re0 .0 00 .0 50 .1 00 .1 50 .2 00 .2 50 .3 0Figure B.7: Skill score of MSE (dark bar) and MAE (light bar) for (a)PRECIP and (b) SO2 test datasets, where f1 denotes SVR-ESwith f1=(MSE)−1 as fitness function and f2 denotes SVR-ES withf2=(MAE)−1.the 500 used in GA1 and GA2) and SVR using 3-D grid search with the range ofthe three hyper-parameters recommended by Fan et al. (2005). For GA1, we used500 as the number of generations, 0.1 as the mutation rate, 0.9 as the crossoverweight and 50 as the size of the population (which was initialized randomly). ForGA2, the corresponding values were 500, 0.2, 0.8 and 20, respectively, while forGA3, the values were 200, 0.3, 0.5 and 100, respectively. The reference model wasagain the MLR (trained on the 500 points).SVR-ES and SVR had the best results according to the MSE and MAE SS(Figure B.10). As shown by Kramer et al. (2007), changing the GA parameterscan improve/worsen the final results. Similar to Figure B.1, SVR had better perfor-146−10 −8 −6 −4 −2 0 2 40 .00 .20 .40 .60 .81 .0log2γMA EPRECIPPRECIP+SLRTEMPTEMP+SLRSO2SO2+SLRFigure B.8: Results of MAE as γ varies between [2−10;24], for the PRECIPtest dataset with all predictors used (solid line with circles), with pre-dictors selected by SLR (dashed line with circles), for the TEMP testdataset with all predictors used (solid line with triangles) and with pre-dictors selected by SLR (dashed line with triangles), for the SO2 testdataset with all predictors used (solid line with x) and with predictorsselected by SLR (dashed line with x).mance in the MAE SS than SVR-ES, but this can be reversed by using (MAE)−1as the fitness function (Figure B.7). The 3-D grid search had similar performanceas GA1. In terms of computing efforts, SVR required 15 evaluations (for the 15different γ values used in the search), SVR-ES 200 evaluations, 3-D grid searchabout 1500, and the three GA models 1400–3500 evaluations.147−10 −8 −6 −4 −2 0 2 40 .00 .20 .40 .60 .81 .0log2γMS EPRECIPPRECIP+SLRTEMPTEMP+SLRSO2SO2+SLRFigure B.9: Results of MSE as γ varies between [2−10;24].B.6 Summary and ConclusionIn summary, we used three environmental datasets to test five non-linear forecast-ing methods (four well-known and one that we proposed). Except for M5, thenonlinear methods (ANN, SVR, SVR-ES and RF) generally outperformed the lin-ear method (MLR). Prescreening of predictors by SLR is generally beneficial forthe nonlinear models (except for M5), as it reduces the computing time and may in-crease the forecast skills (especially for ANN). As explained inWitten et al. (2011),there is no universal best learning method. SVR-ES had very good accuracy whenthe datasets were TEMP and PRECIP, while RF had the best accuracy when thedataset was SO2. During pollution events, the SO2 concentration spikes muchmore dramatically than the high values observed in the temperature and precipi-tation data, hence the architecture of RF may be particularly suited for predicting148GA1GA2GA3SVRSVR−ESGRIDPRECIPMSE MAESkill Score0.00.10.20.30.40.50.60.7Figure B.10: Results of MSE skill score (left side) andMAE skill score (rightside) for the PRECIP test dataset using the last 500 points of the train-ing set and all the predictors to train the models. The models arerespectively: the modified GA proposed by Leung et al. (2003) withdifferent parameters settings (GA1, GA2 and GA3), SVR with theprocedure recommended by Cherkassky and Ma (2004) and the ex-tended range γ = [2−10;24] suggested by Lin and Lin (2003), SVR-ESwith 200 generations and 3-D grid search with the hyper-parameters’range recommended by Fan et al. (2005).variables with a heavy-tailed distribution.The best overall method tends to be SVR-ES. SVR (with the Cherkassky-Maestimates for C and ε and an extended grid search in γ) worked well in terms ofthe MAE skill score, and provided satisfactory performance in terms of the MSE. Itused a relatively modest amount of computing time for the hyper-parameter search.When using the fitness function of (MSE)−1 in the ES, SVR-ES may sometimesunderperform SVR in terms of the MAE skill score, but changing the fitness func-149tion to (MAE)−1 appears to eliminate this problem.As recommended by Eiben and Smith (2003), a common approach is to startwith simple models; if the results are not of good enough quality, then move tomore complex models. The SVR-ES accuracy can be improved using more com-plex ES such as correlated mutations (Eiben and Smith, 2003) or covariance ma-trix adaptation evolution strategy (CMA-ES) (Friedrichs and Igel, 2005). Multi-processors can also be used in the computation, i.e. given a chromosome G (equa-tion B.7), k independent mutations can be made among ρ processors (Verhoevenand Aarts, 1995).150Appendix CAdditional performance statisticsfor OSELM151Table C.1: MAE, RMSE (m3=s) and NSE calculated over 1990-2011 (test set) for Englishman and Stave. For OSELM,the values are the mean over the 30 trials and the parentheses give the standard deviation. Day 1, day 2 and day 3are the forecast lead times. The five models are OSMLR using stepwise (OSMLR-S), OSMLR using all predictors(OSMLR-A), OSELM using stepwise (OSELM-S), OSELM using Boruta (OSELM-B) and OSELM using allpredictors (OSELM-A). The linear models of OSMLR are from single run instead of ensemble runs.EnglishmanMAE RMSE NSEday 1 day 2 day 3 day 1 day 2 day 3 day 1 day 2 day 3OSMLR-S 3.73 5.12 5.89 13.19 15.81 17.14 0.69 0.55 0.48OSMLR-A 3.71 5.12 5.86 13.13 15.78 17.06 0.69 0.56 0.48OSELM-S 2.38(0.03) 3.76(0.02) 4.66(0.02) 8.33(0.14) 12.34(0.10) 14.02(0.11) 0.88(0.00) 0.73(0.00) 0.65(0.01)OSELM-B 2.36(0.03) 3.76(0.04) 4.59(0.03) 8.32(0.13) 12.32(0.15) 13.62(0.12) 0.88(0.00) 0.73(0.01) 0.67(0.01)OSELM-A 2.34(0.03) 3.77(0.04) 4.59(0.03) 8.24(0.12) 12.30(0.15) 13.61(0.12) 0.88(0.00) 0.73(0.01) 0.67(0.01)StaveMAE RMSE NSEday 1 day 2 day 3 day 1 day 2 day 3 day 1 day 2 day 3OSMLR-S 7.53 10.43 12.04 18.62 23.22 25.38 0.71 0.54 0.45OSMLR-A 7.52 10.36 11.75 18.60 23.05 25.02 0.71 0.55 0.47OSELM-S 5.06(0.06) 7.89(0.07) 9.35(0.08) 12.90(0.15) 17.86(0.15) 20.53(0.16) 0.86(0.00) 0.73(0.00) 0.64(0.01)OSELM-B 4.96(0.04) 7.70(0.11) 9.01(0.08) 12.76(0.13) 17.64(0.21) 20.11(0.12) 0.86(0.00) 0.74(0.01) 0.66(0.00)OSELM-A 4.97(0.05) 7.75(0.08) 9.01(0.07) 12.79(0.13) 17.72(0.16) 20.12(0.15) 0.86(0.00) 0.73(0.01) 0.66(0.01)152Table C.2: MAE, RMSE (m3=s) and NSE calculated over 1990-2011 (test set) for Englishman and Stave. For OSELM,the values are the mean over the 30 trials and the parentheses give the standard deviation. Day 1, day 2 and day3 are the forecast lead times. The six models are OSMLR updated daily (OSMLR-Daily), monthly (OSMLR-Monthly), and yearly (OSMLR-Yearly), and the OSELM updated daily (OSELM-Daily), monthly (OSELM-Monthly) and yearly (OSELM-Yearly).EnglishmanMAE RMSE NSEDay 1 Day 2 Day 3 Day 1 Day 2 Day 3 Day 1 Day 2 Day 3OSMLR-Daily 3.71 5.12 5.86 13.13 15.78 17.06 0.69 0.56 0.48OSMLR-Monthly 3.71 5.13 5.88 13.10 15.81 17.09 0.69 0.55 0.48OSMLR-Yearly 3.71 5.13 5.89 13.06 15.77 17.08 0.70 0.56 0.48OSELM-Daily 2.34(0.03) 3.77(0.04) 4.59(0.03) 8.24(0.12) 12.30(0.15) 13.61(0.12) 0.88(0.00) 0.73(0.01) 0.67(0.01)OSELM-Monthly 2.38(0.03) 3.82(0.03) 4.66(0.03) 8.36(0.11) 12.43(0.14) 13.79(0.11) 0.88(0.00) 0.72(0.01) 0.66(0.01)OSELM-Yearly 2.41(0.02) 3.85(0.03) 4.71(0.02) 8.40(0.10) 12.51(0.13) 13.96(0.13) 0.87(0.00) 0.72(0.01) 0.65(0.01)StaveMAE RMSE NSEDay 1 Day 2 Day 3 Day 1 Day 2 Day 3 Day 1 Day 2 Day 3OSMLR-Daily 7.52 10.36 11.75 18.60 23.05 25.02 0.71 0.55 0.47OSMLR-Monthly 7.53 10.38 11.79 18.60 23.09 25.07 0.71 0.55 0.47OSMLR-Yearly 7.54 10.39 11.80 18.59 23.08 25.06 0.71 0.55 0.47OSELM-Daily 4.97(0.05) 7.75(0.08) 9.01(0.07) 12.79(0.13) 17.72(0.16) 20.12(0.15) 0.86(0.00) 0.73(0.00) 0.66(0.00)OSELM-Monthly 5.04(0.05) 7.85(0.08) 9.15(0.06) 12.95(0.12) 17.86(0.16) 20.41(0.16) 0.86(0.00) 0.73(0.00) 0.65(0.01)OSELM-Yearly 5.12(0.04) 7.95(0.06) 9.22(0.05) 13.13(0.13) 17.96(0.13) 20.47(0.18) 0.85(0.00) 0.73(0.00) 0.64(0.01)153Table C.3: Mean and standard deviation (in parenthesis) of the MAE, RMSE (m3=s) and NSE of the 30 trials calculatedover 1990-2011 (test set) for Englishman and Stave. Day 1, day 2 and day 3 are the forecast lead times, and 1year-OSELM, 3 years-OSELM and 5 years-OSELM are the OSELM runs initialized with 1, 3 and 5 years of data,respectively.EnglishmanMAE RMSE NSEDay 1 Day 2 Day 3 Day 1 Day 2 Day 3 Day 1 Day 2 Day 31 year-OSELM 2.59(0.07) 3.95(0.06) 4.83(0.08) 9.03(0.18) 12.68(0.17) 14.29(0.25) 0.85(0.00) 0.71(0.01) 0.64(0.01)3 years-OSELM 2.41(0.03) 3.83(0.03) 4.80(0.05) 8.39(0.14) 12.36(0.13) 14.30(0.17) 0.87(0.00) 0.73(0.01) 0.64(0.01)5 years-OSELM 2.34(0.03) 3.77(0.04) 4.59(0.03) 8.24(0.12) 12.30(0.15) 13.61(0.12) 0.88(0.00) 0.73(0.01) 0.67(0.01)StaveMAE RMSE NSEDay 1 Day 2 Day 3 Day 1 Day 2 Day 3 Day 1 Day 2 Day 31 year-OSELM 5.44(0.12) 8.28(0.23) 9.62(0.30) 13.87(0.29) 18.85(0.57) 21.28(0.63) 0.84(0.01) 0.70(0.02) 0.62(0.02)3 years-OSELM 5.27(0.10) 8.19(0.10) 9.55(0.12) 13.54(0.25) 18.80(0.22) 21.22(0.27) 0.84(0.01) 0.70(0.01) 0.62(0.01)5 years-OSELM 4.97(0.05) 7.75(0.08) 9.01(0.07) 12.79(0.13) 17.72(0.16) 20.12(0.15) 0.86(0.00) 0.73(0.00) 0.66(0.00)1540 100 200 300 40001 002 003 004 00Day 1 − OSELM − Dailyy^y0 100 200 300 40001 002 003 004 00Day 2 − OSELM − Dailyy^y0 100 200 300 40001 002 003 004 00Day 3 − OSELM − Dailyy^y0 100 200 300 40001 002 003 004 00Day 1 − OSELM − Monthlyy^y0 100 200 300 40001 002 003 004 00Day 2 − OSELM − Monthlyy^y0 100 200 300 40001 002 003 004 00Day 3 − OSELM − Monthlyy^y0 100 200 300 40001 002 003 004 00Day 1 − OSELM − Yearlyy^y0 100 200 300 40001 002 003 004 00Day 2 − OSELM − Yearlyy^y0 100 200 300 40001 002 003 004 00Day 3 − OSELM − Yearlyy^yFigure C.1: Scatter plot of the observed (y) and predicted (yˆ) streamflow val-ues for Englishman. The three models are the OSELM updated daily(OSELM-Daily), monthly (OSELM-Monthly) and yearly (OSELM-Yearly). The black line is the linear regression between yˆ and y, whilethe dashed line is when y= yˆ.1550 100 200 300 40001 002 003 004 00Day 1 − OSELM − Dailyy^y0 100 200 300 40001 002 003 004 00Day 2 − OSELM − Dailyy^y0 100 200 300 40001 002 003 004 00Day 3 − OSELM − Dailyy^y0 100 200 300 40001 002 003 004 00Day 1 − OSELM − Monthlyy^y0 100 200 300 40001 002 003 004 00Day 2 − OSELM − Monthlyy^y0 100 200 300 40001 002 003 004 00Day 3 − OSELM − Monthlyy^y0 100 200 300 40001 002 003 004 00Day 1 − OSELM − Yearlyy^y0 100 200 300 40001 002 003 004 00Day 2 − OSELM − Yearlyy^y0 100 200 300 40001 002 003 004 00Day 3 − OSELM − Yearlyy^yFigure C.2: Scatter plot of the observed (y) and predicted (yˆ) streamflowvalues for Stave. The three models are the OSELM updated daily(OSELM-Daily), monthly (OSELM-Monthly) and yearly (OSELM-Yearly). The black line is the linear regression between yˆ and y, whilethe dashed line is when y= yˆ.1560 100 200 300 40001 002 003 004 00Day 1 − OSELM − 1 yeary^y0 100 200 300 40001 002 003 004 00Day 2 − OSELM − 1 yeary^y0 100 200 300 40001 002 003 004 00Day 3 − OSELM − 1 yeary^y0 100 200 300 40001 002 003 004 00Day 1 − OSELM − 3 yearsy^y0 100 200 300 40001 002 003 004 00Day 2 − OSELM − 3 yearsy^y0 100 200 300 40001 002 003 004 00Day 3 − OSELM − 3 yearsy^y0 100 200 300 40001 002 003 004 00Day 1 − OSELM − 5 yearsy^y0 100 200 300 40001 002 003 004 00Day 2 − OSELM − 5 yearsy^y0 100 200 300 40001 002 003 004 00Day 3 − OSELM − 5 yearsy^yFigure C.3: Scatter plot of the observed (y) and predicted (yˆ) streamflow val-ues for Englishman. The three models are OSELM-Daily trained dur-ing the initialization phase with 1, 3 and 5 years of data respectively.The black line is the linear regression between yˆ and y, while the dashedline is when y= yˆ.1570 100 200 300 40001 002 003 004 00Day 1 − OSELM − 1 yeary^y0 100 200 300 40001 002 003 004 00Day 2 − OSELM − 1 yeary^y0 100 200 300 40001 002 003 004 00Day 3 − OSELM − 1 yeary^y0 100 200 300 40001 002 003 004 00Day 1 − OSELM − 3 yearsy^y0 100 200 300 40001 002 003 004 00Day 2 − OSELM − 3 yearsy^y0 100 200 300 40001 002 003 004 00Day 3 − OSELM − 3 yearsy^y0 100 200 300 40001 002 003 004 00Day 1 − OSELM − 5 yearsy^y0 100 200 300 40001 002 003 004 00Day 2 − OSELM − 5 yearsy^y0 100 200 300 40001 002 003 004 00Day 3 − OSELM − 5 yearsy^yFigure C.4: Scatter plot of the observed (y) and predicted (yˆ) streamflow val-ues for Stave. The three models are OSELM-Daily trained during theinitialization phase with 1, 3 and 5 years of data respectively. The blackline is the linear regression between yˆ and y, while the dashed line iswhen y= yˆ.158Appendix DHC algorithm with VC-OSELM1591 begin2 acce //Defines the factor of expansion and contraction of thesearch space;3 stepsize //Defines the expansion and contraction of the searchspace;4 candidates← [−1=acce;0;1=2∗acce;acce] //Vector containing U candidatessolutions;5 L //Initial number of HN of the operational forecast;6 validation:set //Set containing Yˆ of the last S days of all candidates(dimension: U×S);7 candidates:res //Candidate reservoir;8 for j=1 toU do9 // number of HN according to the candidate and the step size;10 HN:candidate← round(L+ stepsize∗ candidates[ j]);11 candidates:res[ j]← Generate the candidate using HN:candidate (Step I of Algorithm 6);12 end13 Set k = 0 and s= 0;14 while End of set testing set = False do15 (k+1)th chunk of new data {Xk+1;Yk+1} arrives;16 for j=1 toU do17 Using candidates:res[ j] and Xk+1, calculate Yˆk+1;18 Using candidates:res[ j] and {Xk+1;Yk+1}, calculate: Hk+1, Pk+1; Bˆk+1;19 candidates:res[ j]← Hk+1, Pk+1; Bˆk+1;20 validation:set[ j;s] = Yˆk+1;21 end22 {X′;Y′}= {X′;Y′}∪{Xk+1;Yk+1} ;23 Set k = k+1 and s= s+1;24 if s= S then25 Calculate RMSE between validation:set and Y′;26 best.solution← index of the smallest RMSE among all the candidates;27 // Checking if the best candidate is the same as in theprevious step;28 if best.solution is not 2 then29 L← number of HN of the best.solution;30 candidates:res[2] = candidates:res[best:solution];31 for j=3 toU do32 δL← round(L+ stepsize∗ candidates[ j])−L;33 candidates:res[ j]← add δL in candidates:res[2; j] (Step III of Algorithm 6);34 end35 δL← L− round(L+ stepsize∗ candidates[ j]);36 candidates:res[1]← remove δL in candidates:res[2] (Step III of Algorithm 6);37 stepsize← maximum(1;round(stepsize∗ candidates[best:solution]));38 Buffer {X′;Y′}=∅ ;39 Set s= 0;40 end41 end42 end43 endAlgorithm 8: Hill climbing algorithm used to identify the optimum numberof HN in the VC-OSELM models.160
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Short-term hydro-meteorological forecasting with extreme...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Short-term hydro-meteorological forecasting with extreme learning machines Lima, Aranildo R. 2016
pdf
Page Metadata
Item Metadata
Title | Short-term hydro-meteorological forecasting with extreme learning machines |
Creator |
Lima, Aranildo R. |
Publisher | University of British Columbia |
Date Issued | 2016 |
Description | In machine learning (ML), the extreme learning machine (ELM), a feedforward neural network model which assigns random weights in the single hidden layer and optimizes only the weights in the output layer, has the fully nonlinear modelling capability of the traditional artificial neural network (ANN) model but is solved via linear least squares, as in multiple linear regression (MLR). Chapter 2 evaluated ELM against MLR and three nonlinear ML methods (ANN, support vector regression and random forest) on nine environmental regression problems. ELM was then developed for short-term forecasting of hydro-meteorological variables. In situations where new data arrive continually, the need to make frequent model updates often renders ANN impractical. An online learning algorithm – the online sequential extreme learning machine (OSELM) – is automatically updated inexpensively as new data arrive. In Chapter 3, OSELM was applied to forecast daily streamflow at two small watersheds in British Columbia, Canada, at lead times of 1–3 days. Predictors used were weather forecast data generated by the NOAA Global Ensemble Forecasting System (GEFS), and local hydro-meteorological observations. OSELM forecasts were tested with daily, monthly or yearly model updates, with the nonlinear OSELM easily outperforming the benchmark, the online sequential MLR (OSMLR). A major limitation of OSELM is that the number of hidden nodes (HN), which controls the model complexity, remains the same as in the initial model, even when the arrival of new data renders the fixed number of HN sub-optimal. A new variable complexity online sequential extreme learning machine (VC-OSELM), proposed in Chapter 4, automatically adds or removes HN as online learning proceeds, so the model complexity self-adapts to the new data. For streamflow predictions at lead time of one day, VC-OSELM outperformed OSELM when the initial number of HN turned out to be smaller or larger than optimal. In summary, by using linear least squares instead of nonlinear optimization, ELM offers a major advantage over a traditional method like ANN. In situations where new data arrive continually, OSELM and VC-OSELM were shown in this thesis to be more useful than ANN and OSMLR. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2016-07-01 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0305711 |
URI | http://hdl.handle.net/2429/58366 |
Degree |
Doctor of Philosophy - PhD |
Program |
Atmospheric Science |
Affiliation |
Science, Faculty of Earth, Ocean and Atmospheric Sciences, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2015-09 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2016_september_lima_aranildo.pdf [ 15.74MB ]
- Metadata
- JSON: 24-1.0305711.json
- JSON-LD: 24-1.0305711-ld.json
- RDF/XML (Pretty): 24-1.0305711-rdf.xml
- RDF/JSON: 24-1.0305711-rdf.json
- Turtle: 24-1.0305711-turtle.txt
- N-Triples: 24-1.0305711-rdf-ntriples.txt
- Original Record: 24-1.0305711-source.json
- Full Text
- 24-1.0305711-fulltext.txt
- Citation
- 24-1.0305711.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0305711/manifest