UBC Faculty Research and Publications

Evaluation of Probabilistic Medium-Range Temperature Forecasts from the North American Ensemble Forecast… McCollor, Doug; Stull, Roland B. Feb 28, 2009

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-Stull_AMS_2009_2008WAF2222130.pdf [ 1.35MB ]
JSON: 52383-1.0041846.json
JSON-LD: 52383-1.0041846-ld.json
RDF/XML (Pretty): 52383-1.0041846-rdf.xml
RDF/JSON: 52383-1.0041846-rdf.json
Turtle: 52383-1.0041846-turtle.txt
N-Triples: 52383-1.0041846-rdf-ntriples.txt
Original Record: 52383-1.0041846-source.json
Full Text

Full Text

Evaluation of Probabilistic Medium-Range Temperature Forecasts from the North American Ensemble Forecast System DOUG MCCOLLOR University of British Columbia, and BC Hydro Corporation, Vancouver, British Columbia, Canada ROLAND STULL University of British Columbia, Vancouver, British Columbia, Canada (Manuscript received 18 February 2008, in final form 19 August 2008) ABSTRACT Ensemble temperature forecasts from the North American Ensemble Forecast System were assessed for quality against observations for 10 cities in western North America, for a 7-month period beginning in February 2007. Medium-range probabilistic temperature forecasts can provide information for those eco- nomic sectors exposed to temperature-related business risk, such as agriculture, energy, transportation, and retail sales. The raw ensemble forecasts were postprocessed, incorporating a 14-day moving-average forecast–observation difference, for each ensemble member. This postprocessing reduced the mean error in the sample to 0.68C or less. It is important to note that the NorthAmerican Ensemble Forecast System available to the public provides bias-corrected maximum and minimum temperature forecasts. Root-mean-square-error and Pearson correlation skill scores, applied to the ensemble average forecast, indicate positive, but diminishing, forecast skill (compared to climatology) from 1 to 9 days into the future. The probabilistic forecasts were evaluated using the continuous ranked probability skill score, the relative operating characteristics skill score, and a value assessment incorporating cost–loss determination. The full suite of ensemble members provided skillful forecasts 10–12 days into the future. A rank histogram analysis was performed to test ensemble spread relative to the observations. Forecasts are underdispersive early in the forecast period, for forecast days 1 and 2. Dispersion improves rapidly but remains somewhat underdispersive through forecast day 6. The forecasts show little or no dispersion beyond forecast day 6. A new skill versus spread diagram is presented that shows the trade-off between higher skill but low spread early in the forecast period and lower skill but better spread later in the forecast period. 1. Introduction Accurate weather forecasts help mitigate risk to those economic sectors exposed to weather-related business losses. Astute business operators can also incorporate accurate weather forecasts for financial gain for their organization. Probabilistic forecasts are especially use- ful in making economic decisions because the expected gain or loss arising from a particular weather-related event can be measured directly against the probability of that event actually occurring. For the example of hydroelectric systems, temperature affects the demand for electricity (from electric heaters and air condi- tioners) and the resistance (therefore maximum electric load capacity) of transmission lines. In high-latitude countries river-ice formation affects hydroelectric gen- eration planning during winter. Temperature effects on hydroelectric systems also include the height above the ground of transmission lines (lines sag closer to the ground as temperatures rise) and precipitation type (rain versus snow), an important factor determining the inflow response in watersheds. Of course, weather forecast accuracy diminishes as the forecast lead time increases; this is true for proba- bilistic forecasts as well as deterministic forecasts. This paper aims to analyze a set of midrange probabilistic temperature forecasts and measure the quality and value Corresponding author address: Doug McCollor, Dept. of Earth and Ocean Sciences, University of British Columbia, 6339 Stores Rd., Vancouver, BC V6T 1Z4, Canada. E-mail: doug.mccollor@bchydro.com VOLUME 24 WEATHER AND FORECAST ING FEBRUARY 2009 DOI: 10.1175/2008WAF2222130.1  2009 American Meteorological Society 3 of the forecasts as a function of lead time. Decision makers using this forecast system will then have a valid estimate of how far into the future these particular forecasts provide valuable information. Probability estimates of future weather are most of- ten derived from an ensemble of individual forecasts. Many short-range ensemble forecast (SREF) systems, designed to provide forecasts in the day 1 and day 2 timeframe, provide high quality probabilistic forecasts by varying the model physics to derive their ensemble members (Stensrud et al. 2000; Eckel and Mass 2005; Jones et al. 2007; McCollor and Stull 2008b). Medium- range ensemble forecast (MREF) systems, necessary for probabilistic forecasts beyond day 2, can be designed to represent the errors inherent in synoptic flow pat- terns in the midrange period (Buizza et al. 2005; Roulin and Vannitsem 2005; Tennant et al. 2007). The idea of combining ensemble forecasts from dif- ferent national meteorological centers, thus taking ad- vantage of different modes of ensemble member generation and greatly increasing the overall ensemble size, resulted in the formation of the North American Ensemble Forecast System (NAEFS; Toth et al. 2006). We investigate the quality and value of NAEFS tem- perature forecasts for 10 cities in westernNorthAmerica. Section 2 describes the forecasts and observations analyzed in the MREF study. Section 3 describes the verification metrics used to evaluate the probabilistic forecasts and discusses the results, and section 4 sum- marizes the findings and concludes the paper. 2. Methodology a. Ensemble forecasts Raw (unprocessed) temperature forecasts were obtained and stored in real time from the North Amer- ican Ensemble Forecast System. The NAEFS is a joint project involving the Meteorological Service of Canada (MSC), the U.S. National Weather Service (NWS), and the National Meteorological Service of Mexico (NMSM). NAEFS was officially launched in November 2004 and combines state-of-the-art ensembles developed at MSC and NWS. It is important to note that the NAEFS available to the public provides bias-corrected maximum and minimum temperature forecasts. The grand- or superensemble provides weather fore- cast guidance for the 1–16-day period, allowing for the inclusion of ensemble members from different gener- ating schemes. The original configuration of the NWS- produced ensembles consisted of one control run plus 14 perturbed ensemble members produced by a bred- vector perturbation method (Buizza et al. 2005; Toth and Kalnay 1993). This initial-condition perturbation method assumed that fast-growing errors develop nat- urally in a data assimilation cycle and that errors will continue to grow through the medium-range forecast cy- cle. The original MSC ensemble prediction system (EPS) configuration consisted of one control run plus 16 per- turbed ensemble members. The perturbation approach developed at MSC (Houtekamer et al. 1996; Buizza et al. 2005) generated initial conditions by assimilating ran- domly perturbed observations. This multimodel ap- proach ran different model versions (involving variable physical parameterizations) incorporating a number of independent data assimilation cycles. At the time of this study, no forecasts were available from the NMSM. In addition, it must be noted that the ensemble sys- tem’s design, including strategies to sample observation error, initial condition error, and model error, as well as addressing the issue of member resolution versus en- semble size, is continually evolving. Recent information describing current national ensemble prediction systems can be found in a report from a November 2007 work- shop on ensemble prediction (available online at http:// www.ecmwf.int/newsevents/meetings/workshops/2007/ ensemble_prediction/wg1.pdf; accessed 12 April 2008). Current details on improvements to theMSCEPS are also available online (at http://www.ecmwf.int/newsevents/ meetings/workshops/2007/MOS_11/ presentations_files/ Gagnon.pdf; accessed 14 April 2008). In May 2006 the NWS EPS changed its initial pertur- bation generation technique to the ensemble transform with rescaling (ETR) technique, where the initial per- turbations are restrained by the best available analysis variance from the operational data assimilation system (Wei et al. 2008). In March 2007 the NWS ensemble increased to 20 members integrated from an 80-member ETR-based ensemble. In July 2007 MSC increased its ensemble size from 16 to 20 members (information online at http://www.smc-msc.ec.gc.ca/cmc/op_systems/ doc_opchanges/genot_20070707_e_f.pdf; accessed 17April 2008) to conform to the NWS ensemble size. MSC also increased the horizontal resolution of the members and extended the physical parameterization package. The complete current status of the MSC EPS can be found online (http://www.ecmwf.int/newsevents/meetings/ workshops/2007/ensemble_prediction/presentations/ houtekamer.pdf; accessed 17 April 2008). b. Observation Unprocessed forecasts for days 1–15 were retrieved from the online NAEFS over the period 15 February– 1 October 2007. An ensemble size of 32 members was maintained throughout this period, to maintain consis- tency in the process, despite operational changes at the 4 WEATHER AND FORECAST ING VOLUME 24 national centers as described above. Ten city locations (five in Canada and five in the United States) were chosen to represent different climate regimes in far western North America (see Table 1 for the list of North American cities in the study and Fig. 1 for a reference map). For each day and each city, the forecast valid at 1600 local time (0000 UTC) was verified against the observed maximum temperature for the day; the fore- cast valid at 0400 local time (1200 UTC) was verified against the observed minimum temperature for the day. Temperatures are accurate to within60.68C, determined from the Instrument Requirements and Standards for the NWSSurfaceObserving Programs (information online at http://www.weather.gov/directives/sym/pd01013002curr. pdf, accessed 3 June 2008). c. Method used Forecast system performance can be verified against actual point observations or against model-based grid- ded analyses. Gridded analyses directly from the models generally offer more data for investigation. However, gridded analyses are subject to model-based data esti- mation errors and smoothing errors. Even though point observations are subject to instrumentation error and site representativeness error, it is preferable to verify forecasts against observations as opposed to model analyses (Hou et al. 2001). The mean error of the maximum and minimum temperature forecasts for each ensemble member for each forecast day was evaluated from the sample. Random errors caused by initial condition and model– parameterization errors can be reduced by using an ensemble of forecasts. Bias in the forecasts is largely attributable to local terrain and topographic effects being poorly represented in the forecast models. Bias is evaluated by calculating the mean error in the forecasts and can be reduced via postprocessing of the direct model output (DMO). Regarding postprocessing, the authors previously found (McCollor and Stull 2008a) that a 14-day moving- average filter was effective in reducing the bias in a medium-range maximum and minimum temperature forecast sample. Therefore, a 14-day moving-average filter is applied here to each of the ensemble member raw forecasts (for each station location and each forecast lead time) to minimize the bias in the DMO forecasts. Care was taken in this study to ensure the method is operationally achievable. That is, only observations that would be available in a real-time setting for each fore- cast period were used in the moving-average filter. For example, all day 5 forecasts included a moving-average filter incorporating forecast–observation pairs from 19 days prior to the forecast valid time through 6 days prior to the forecast time to achieve a 14-day moving-average bias correction. This method of error reduction for ensemble predic- tion is known as the bias-corrected ensemble (BCE) approach (Yussouf and Stensrud 2007). The scheme responds quickly to any new station locations, new models, or model upgrades. The results of the filter ap- plication are presented in Fig. 2. The calculations are derived over all forecast–observation pairs between 1 March and 1 October 2007 (the first 14 days of the sample were reserved to begin the moving-average postprocessing filter). Mean error in the DMO en- semble average daily maximum temperature forecasts, ranging from 23.68C (day 1 forecasts) to 22.68C (day 15 forecasts), was reduced to a range from 0.08C (day 1 forecasts) to 10.68C (day 15 forecasts) in the post- processed forecasts. Similarly, mean error in the DMO ensemble average daily minimum temperature fore- casts, near 21.58C throughout the 15-day forecast cycle, was reduced to the 0.08C to 10.38C range in the postprocessed forecasts. These bias-corrected fore- casts are used in the subsequent ensemble evaluations in this paper. The final phase in designing an ensemble prediction system lies in choosing a weighting scheme for the in- dividual ensemble members in the computation of the ensemble average. In addition to applying equal weighting to all ensemble members by simply averaging all members, Yussouf and Stensrud (2007) explored performance-based weighted BCE schemes wherein unequal weights, based on individual member’s past forecast accuracy, are assigned to each ensemble mem- ber prior to calculating the ensemble mean forecast. However, results from the Yussouf and Stensrud (2007) study indicate there is no significant improvement over the original 2-m temperature BCE forecasts when per- formance-based weighting schemes are applied to the TABLE 1. Ten cities for which forecasts and associated verifying observations were evaluated. City ICAO* identifier Lat (8N) Lon (8W) High Level, AB CYOJ 58.6 117.2 Peace River, AB CYPE 56.2 117.4 Fort St. John, BC CYXJ 56.2 120.7 Victoria, BC CYYJ 48.6 123.4 Vancouver, BC CYVR 49.2 123.2 Seattle, WA KSEA 47.4 122.3 Spokane, WA KGEG 47.6 117.5 Portland, OR KPDX 45.6 122.6 Sacramento, CA KSAC 38.5 121.5 Bakersfield, CA KBFL 35.4 119.1 * ICAO 5 International Civil Aviation Organization. FEBRUARY 2009 MCCOLLOR AND STULL 5 EPS individual members. Therefore, a BCE approach for computing the ensemble average from equally weighted ensemble members is chosen for our study of medium- range temperature forecasts. 3. Results and discussion A single verification score is generally inadequate for evaluating all of the desired information about the performance of an ensemble prediction system (Murphy and Winkler 1987; Murphy 1991). Different measures, emphasizing different aspects and attributes of forecast performance, should be employed to assess the statistical reliability, resolution, and discrimination of an EPS. A standardized set of evaluation methods, scores, and dia- grams for interpreting ensemble forecasts is provided in Hamill et al. (2000) and is incorporated here to assess a MREF system for daily maximum and minimum tem- perature forecasts. The reader is referred toWilks (2006) or Toth et al. (2003) for comprehensive descriptions of verification measures. a. Skill scores Initial assessment of the forecasts utilized a root-mean- square-error skill score (RMSESS) and a Pearson corre- lation skill score (CORRSS) evaluated for the ensemble average. A general skill score definition is given by skill score5 score-scoreref scoreperfect-scoreref , (1) where scoreref is the score for the reference forecasts. In this case the reference forecasts are the climatological mean values of the maximum and minimum tempera- tures for each individual city location, for each day, av- eraged over the 30-yr normal period 1971–2000. Here, scoreperfect is the score the forecasts would have received if they were perfect. Skill scores range from zero (for no skill, or skill equivalent to the reference forecasts) to 1 (for perfect forecasts). Negative skill score values indi- cate the forecasts are worse than the reference forecasts. RMSESS and CORRSS were evaluated for the aver- age of the 32 ensemble members for each of the forecast days from 1 through 15. The results are shown in Fig. 3. Both the RMSESS and CORRSS for the ensemble av- erage forecast indicate similar results in terms of fore- cast skill. Forecast skill, as expected, is highest for the day 1 forecast and gradually diminishes. By day 9, forecast skill is negligible compared to the climatologi- cal reference forecasts. By day 10 the forecasts show no skill compared to the climatological reference forecasts, and for days 11–15 the NAEFS forecasts have worse skill than do the climatological forecasts. Figure 3 also compares the ensemble average forecast with the deterministic 5-day forecast from the MSC op- erational Global Environmental Multiscale (GEM) model (Côté et al. 1998a,b). Forecast skill is comparable between the ensemble average and deterministic forecast early in the forecast period, but the ensemble average outperforms the deterministic forecast in terms of RMSESS and CORRSS as the forecast period progresses. A summary of the equations used for the meteoro- logical statistical analysis of the MREF is provided in the appendix. b. Continuous ranked probability score The continuous ranked probability score (CRPS; Hersbach 2000) is a complete generalization of the Brier score. The Brier score (Brier 1950; Atger 2003; McCollor FIG. 1. Reference map for the western North American cities an- alyzed in this study. 6 WEATHER AND FORECAST ING VOLUME 24 and Stull 2008b) is a verification measure designed to quantify the performance of probabilistic forecasts of dichotomous event classes. The discrete ranked proba- bility score (RPS; Murphy 1969; Toth et al. 2003) extends the range of the Brier score to more event classes. The CRPS extends the limit of the RPS to an infinite number of classes. The CRPS has several distinct advantages over the Brier score and the RPS. First, the CRPS is sensitive to the entire permissible range of the parameter. In ad- dition, the CRPS does not require a subjective threshold parameter value (as does the Brier score) or the intro- duction of a number of predefined classes (as does the RPS). Different choices of threshold value (Brier score) or number and position of classes (RPS) result in dif- ferent outcomes, a weakness in these scores not shared with CRPS. The CRPS can be interpreted as an integral over all possible Brier scores. A major advantage of the Brier score is that it can be decomposed into a reliability component, a resolution component, and an uncertainty component (Murphy 1973; Toth et al. 2003). In a similar fashion, Hersbach (2000) showed how the CRPS can be decomposed into the same three components: CRPS5ReliResol1Unc. (2) The CRPS components can be converted to a posi- tively oriented skill score (CRPSS) in the same manner that the Brier score components are converted to a skill score (BSS; McCollor and Stull 2008b): CRPSS5 Resol Unc Reli Unc 5 relative resolution relative reliability 5CRPSRelResol  CRPSRelReli. CRPSS results are shown in Fig. 4. The CRPSS was calculated on forecast and observation anomalies, de- fined as the difference between the forecast or the ob- servation and the climatological value for that city location on a particular date. The forecasts are reliable throughout the forecast period, as indicated by the CRPSRelReli term being near zero for all forecasts. The CRPSRelResol term however, hence the CRPSS, is highest for the day 1 forecast and exhibits diminishing skill as the forecast period extends through the day 12 forecast. For forecast days 13–15, both maximum and minimum daily temperature forecasts exhibit negligible or no skill compared to the reference term. c. ROC area The relative operating characteristic (ROC) is a ver- ification measure based on signal detection theory FIG. 2. Mean error for ensemble averageDMO forecasts (squares) and ensemble average postprocessed forecasts (circles) for (top) daily maximum and (bottom) daily minimum temperatures for forecast days 1–15. FIG. 3. (top) RMSE and (bottom) correlation skill scores for daily maximum (squares) and daily minimum (circles) tempera- ture forecasts for forecast days 1–15. The ensemble average fore- cast (solid line) is compared with a deterministic operational forecast (dashed line). FEBRUARY 2009 MCCOLLOR AND STULL 7 (Mason 1982; Mason and Graham 1999; Gallus et al. 2007) that provides a method of discrimination analysis for an EPS. The ROC curve is a graph of hit rate versus false alarm rate for a particular variable threshold (see the appendix for definitions of these terms and Table 2 for the contingency table basis for this analysis). Perfect discrimination is represented by a ROC curve that rises from (0,0) along the y axis to (0,1) then hori- zontally to the upper-right corner (1,1). The diagonal line from (0,0) to (1,1) represents zero skill, meaning the forecasts show no discrimination among events. The area under the ROC curve can then be converted to a skill measure as in Eq. (1), where scoreref is the zero- skill diagonal line (ROCArea5 0.5) and scoreperfect is the area under the full ROC curve, equal to 1. The ROC area skill score is then ROCSS 5 2ROCArea 2 1. The hit rate and false alarm rate contingency table was evaluated on forecast and observation anomaly thresh- olds. The ROC daily maximum temperature anomaly threshold was defined as 58C over the climatological value for the day (158C anomaly threshold), and the daily minimum temperature anomaly threshold was 58C below the climatological value for the day (258C anom- aly threshold). A 58C anomaly threshold was chosen because this value produces a significant deviation from the normal electrical load in cities and is therefore an important factor for planning generation requirements and power import–export agreements. A 108C anomaly threshold would produce an even greater impact on generation planning and power import–export, but there was not enough data available in our study to produce reliable results. In a future study we would like to acquire enough data to examine forecast skill for these tem- perature extremes. Results for the ROCSS for the 58C temperature anomaly threshold for the full range of the 15-day forecast period are shown in Fig. 5. Actual ROC area curves for the daily maximum forecasts for a subset of the forecast period are shown in Fig. 6. The ROCSS indicates that both maximum and mini- mum daily forecasts exhibit consistent and very good discrimination through forecast day 6. Both maximum and minimum daily forecasts exhibit diminishing, but measurable, discrimination skill from forecast day 7 through forecast day 12. The daily minimum tempera- ture forecasts show slightly higher skill than the daily maximum temperature forecasts through this period. For forecast days 13–15, the ROCSS is close to zero and somewhat noisy, indicating no skill in the forecasts. d. Economic value The potential economic value for the cost–loss decision model (Murphy 1977; Katz andMurphy 1997; Richardson 2000; Zhu et al. 2002; Richardson 2003; McCollor and Stull 2008c) is uniquely determined from ROC curves. The cost–loss ratio (C/L) model shows how probabilistic forecasts can be used in the decision-making process, and provides a measure of the benefit of probabilistic fore- casts over single ensemble average forecasts. Figure 7 shows forecast value (Richardson 2003) as a function of C/L value for daily maximum temperature forecasts with a 158C anomaly threshold. Maximum value occurs at the climate frequency for the event, and users with a C/L value equal to the climate frequency derive most value from the forecasts. Figure 7 also shows the value calculated for the en- semble average forecasts. The maximum value for the ensemble average forecasts is less than the maximum value for the forecasts incorporating the full suite of ensemble members. The discrepancy between the maximum value for the ensemble forecasts and the maximum value for the ensemble average forecasts widens with projected forecast period. Additionally, the cast of users (described by different C/L ratios) that FIG. 4. Continuous ranked probability skill scores for ensemble (top) daily maximum and (bottom) daily minimum temperature anomaly forecasts. The CRP skill score (circles) equals the relative resolution (x markers) less the relative reliability (asterisks). TABLE 2. Contingency table for hit rate and false alarm rate calculations. The counts a, b, c, and d are the numbers of events in each category, out of N total events. Observed Not observed Forecast a a Not forecast c d 8 WEATHER AND FORECAST ING VOLUME 24 gain benefit (show positive value) from the forecasts is wider for the full ensemble forecasts than for the en- semble average forecasts. A graph depicting the decay of forecast maximum value with projected forecast period is shown in Fig. 8. The maximum value is slightly higher for minimum temperature forecasts than for maximum temperature forecasts, and the forecasts lose very little of their initial value through forecast day 6. The maximum value lessens but remains positive through forecast day 11. Forecasts for days 12–15 indicate that the value of these forecasts is little better than climatology. Plotting forecast value as a function of probability threshold pt for different C/L values (Fig. 9) is a way of showing how different forecast users can benefit indi- vidually from the same set of ensemble forecasts. The probability threshold pt is the point at which a decision maker should take action on the basis of the forecast to derive the most value from the forecasts. Low C/L ratio forecast users (C/L 5 0.2 in Fig. 9) can take advantage of the ensemble forecasts by taking action at very low forecast probabilities through forecast day 8. High C/L forecast users (C/L 5 0.8) must wait until the forecast probability is much higher, and can gain positive value from the forecasts only through forecast day 4. e. Equal likelihood In a perfectly realistic and reliableEPS, the range of the ensemble forecast members represents the full spread of the probability distribution of observations, and each member of the ensemble exhibits an equal likelihood of occurrence. Actual EPSs are rarely that perfect. The rank histogram, also knownas aTalagranddiagram (Anderson 1996; Talagrand et al. 1998), is a useful measure of equal likelihood and of reliability (Hou et al. 2001; Candille and Talagrand 2005) of an EPS. A perfect EPS in which each ensemble member is equally likely to verify for any particular forecast exhibits a flat rank histogram. Ensemble forecasts with insuffi- cient spread to adequately represent the probability distribution of observations exhibit a U-shaped histo- gram. Rank histograms derived from a particular EPS can also exhibit an inverted-U shape, indicative of too much spread in the ensemble members compared to the observations. Rank histograms can also be L shaped (warm bias) or J shaped (cool bias).Operational ensemble weather forecasting systems in the medium lead-time range (3–10 days ahead) exhibit U-shaped analysis rank histograms, implying the ensemble forecasts underesti- mate the trueuncertainty in the forecasts (Toth et al. 2003). A subset of rank histogram plots for the EPS daily maximum temperature forecasts is provided in Fig. 10, for forecast days 2, 4, 6 8, 10, and 12. These plots show that early in the forecast period, especially through forecast day 4, the rank histograms do exhibit a U shape, with excess values in the outermost bins. This means the EPS is underdispersive through forecast day 4; the spread in the EPS is not indicative of the spread in the actual observations. However, the rank histogram is symmetric, confirming that the bias-corrected temper- ature forecasts are indeed unbiased. Forecast days 6–12 display a progressively flatter his- togram, indicating the EPS spread increases with fore- cast lead timeand theEPS spreadessentiallymatches the FIG. 5. ROC area skill score for ensemble daily maximum tem- perature (squares) forecasts with a158C anomaly threshold and daily minimum temperature (circles) with a 258C anomaly threshold. FIG. 6. ROC area plots for a subset of the forecast period for a daily maximum temperature anomaly 158C threshold. The plots range from an ROC curve for day 2 (dotted line), 4 (dot–dashed line), 6 (solid line), 8 (heavy dotted line), 10 (heavy dot–dashed line); and 12 (heavy solid line). The dashed line is the zero-skill line. FEBRUARY 2009 MCCOLLOR AND STULL 9 spread of the observations beyond forecast day 6. Similar results (Fig. 11) were obtained from a rank histogram analysis of the EPS daily minimum temperature fore- casts. However, this better spread is associated with worse skill, as shown previously in sections 3a and 3b. A measure of the deviation of the rank histogram from flatness is given in Candille and Talagrand (2005) for an EPS with M members and N available forecast– observation pairs. The number of values in the ith in- terval of the histogram is given by si. For a flat histogram generated from a reliable EPS, si 5 N/(M 1 1) for each interval i. The quantity D5  M11 i51 si  N M11  2 (3) measures the deviation of the histogram from a hori- zontal line. For a reliable system, the base value is D05 NM/(M 1 1). The ratio d5D/D0 (4) is evaluated as the overall measure of the flatness of an ensemble prediction system rank histogram. A value of FIG. 7. Forecast value as a function of C/L ratio for ensemble daily maximum forecasts (158C anomaly threshold), comparing the value of the full ensemble to the ensemble mean. The solid line represents the value of the full ensemble and the dotted line represents the value of the ensemble mean. 10 WEATHER AND FORECAST ING VOLUME 24 d that is significantly larger than 1 indicates the system does not reflect the equal likelihood of ensemble mem- bers. A value of d that is closer to 1 indicates that the ensemble forecasts are tending toward an equal likeli- hood of occurrence and, therefore, are more reliable. Figure 12 provides an analysis of the rank histograms for the MREFs studied here. Early in the forecast pe- riod, for forecast days 1–3, d is much greater than 10. Therefore, the MREF system for forecast days 1–3 ex- hibits insufficient, but successively increasing, spread to adequately represent the variability in the observations. For forecast days 4 and 5, the MREF exhibits d values successively smaller, but still greater than 5, so that evidence of slight underdispersion remains. For forecast days beyond day 5, the d values derived from this MREF system (values between 1 and 5), as indicated in Figs. 10–12, indicate that the ensemble spread is very slightly underdispersive, though nearly sufficient spread is realized in the ensembles to represent the observa- tions in the analysis. Spread does improve as the fore- cast period increases; however, this improvement in spread is negated by the deterioration in skill at the longer forecast periods. This skill versus spread rela- tionship is demonstrated in a trade-off diagram, de- scribed in the following section. SKILL–SPREAD TRADE-OFF DIAGRAM We define a new spread score as spread score5 1/d. (5) A plot of the skill score on the ordinate and this spread score on the abscissa (as the forecast projection progresses from day 1 through day 15) shows the trade- off between higher skill but lower spread early in the forecast period and degraded skill but better spread later in the forecast period. A skill–spread trade-off diagram with the CRPSS versus the spread score is shown in Fig. 13, and a similar trade-off diagram displaying the ROC skill score (for the 58C temperature anomaly threshold) is shown in Fig. 14. Both Figs. 13 and 14 display the changes in skill and spread with forecast projection, for both daily minimum and daily maximum temperature forecasts. Beyond forecast day 11 or 12 (indicated by a dotted line in these figures), the skill becomes negligible and the spread stops improving, or is even reduced in some cases. The authors suggest that this relationship between skill and spread could be explored further with the aid of this skill– spread trade-off diagram, with different parameters, more datasets, and other skill measures. One interesting application of this skill–spread diagram would analyze the variation of skill with spread as the number of indi- vidual members in an ensemble prediction system is varied. The diagram could be used, for example, to find the fewest number of members needed to reach a specific level of skill, for a specific forecast projection, in order to optimize the use of computer resources in ensemble system design. 4. Summary and conclusions The skill and value of medium-range NAEFS tem- perature forecasts are evaluated here for the 7-month period of 1 March–1 October 2007. However, caution must be used when objectively assessing the quality of ensemble forecasts. Ensemble forecast systems must be verified over many cases. As a result, the scoring metrics described here are susceptible to several sources of noise (Hamill et al. 2000). For example, improper estimates of probabilities will arise from small-sized ensembles. Also, insufficient variety and number of cases will lead to sta- tistical misrepresentation. Finally, imperfect observations make true forecast evaluation impossible. Two standard verification skill score measures designed for continuous deterministic forecast variables, a RMSE skill score and a Pearson correlation skill score, were applied to the mean of the bias-corrected ensem- ble forecasts. These measures both showed that the ensemble means of these city temperature forecasts from the NAEFS possessed skill (compared to clima- tology) through forecast day 9, though for forecast days 8 and 9 the forecast skill was marginal. These skill scores also indicated that daily maximum temperature fore- casts were slightly more skillful than daily minimum temperature forecasts for the period analyzed. This study then incorporated verification measures specifically designed for probabilistic forecasts generated from the bias-corrected set of all 32 ensemble members. The continuous ranked probability score (CRPS), FIG. 8. C/L ratio maximum value for ensemble daily maximum temperatures (squares) with a 158C anomaly threshold and min- imum temperatures (circles) with a 258C anomaly threshold. FEBRUARY 2009 MCCOLLOR AND STULL 11 particularly useful for evaluating a continuous en- semble variable, indicated skill in the EPS through day 12. The CRPS also supported the previous finding for the ensemble mean that daily maximum tempera- tures are slightly more skillful than the daily minimum forecasts. The ROC area skill score, designed to indicate the ability of the EPS to discriminate among different events, showed that the EPS forecasts were indeed skillful (for 58C temperature anomaly thresholds) in this regard through day 12. ROC evaluation is characterized by stratification by observation, and discrimination is based FIG. 9. Value as a function of probability threshold pt for C/L ratios of 0.2 (solid), 0.4 (dash), 0.6 (dot), and 0.8 (dot–dash). The forecasts are daily maximum temperature forecasts with a 158C anomaly threshold for (top to bottom) days 2, 4, 6, 8, 10, and 12. 12 WEATHER AND FORECAST ING VOLUME 24 on the probabilities of the forecasts conditioned on dif- ferent observed values, or likelihood-base rate factor- ization in the parlance of the Murphy and Winkler general framework for forecast verification (Murphy and Winkler 1987; Potts 2003). ROC analysis leads to economic evaluation through a cost–loss measure of value. Cost–loss analysis indicated that, overall, the EPS provides some economic value through day 12. However, the maximum value, assigned to the user whose C/L value is equal to the climato- logical base rate for the event in question, diminishes markedly through this forecast range. Additionally, the range of users, defined by different C/L ratios, dimin- ishes considerably beyond forecast day 6. The C/L analysis also provides proof that the full suite of en- semble forecasts provides value to a greater range of users than the single ensemble average forecasts. Rank histograms provide a method to test if the dis- persion, or spread, of the ensemble forecasts represents the dispersion in the actual observations, a vital com- ponent in a reliable EPS. In the NAEFS forecast sample examined here, the forecasts were decreasingly FIG. 10. Rank histograms for the ensemble of daily maximum temperatures for (top to bottom) days 2, 4, 6, 8, 10, and 12. FEBRUARY 2009 MCCOLLOR AND STULL 13 underdispersive early in the forecast cycle, for forecast days 1–5. The forecasts exhibited marginal under- dispersion for forecast days 6 and beyond. A new trade- off diagram is proposed to show the interplay between skill and spread. One aspect of interest uncovered in this study is the comparison of forecast skill between daily maximum and daily minimum temperature forecasts. The skill scores evaluated for the ensemble mean forecast (RMSESS and CORRSS) indicated that daily maximum tempera- ture forecasts were somewhat more skillful than daily minimum temperature forecasts through the first 9 days of the forecast period (Fig. 3). These skill scores indicate no skill in the forecasts, for either daily maximum or daily minimum forecasts, for forecast day 10 (compared to the climatological reference forecasts). Beyond day 10, the MREF forecasts fare worse than climatological forecasts, with daily maximum temperature forecasts now proving worse than daily minimum temperature forecasts. Daily maximum temperature forecasts also prove slightly better than daily minimum temperature forecasts FIG. 11. As in Fig. 10, but for minimum temperatures. 14 WEATHER AND FORECAST ING VOLUME 24 when compared using the CRPSS (Fig. 4), throughout the 15-day range of MREF forecasts. The ROCSS, on the other hand, indicates daily minimum temperature fore- casts show better discrimination than daily maximum temperature forecasts (Fig. 5). In terms of forecast spread, both daily maximum and daily minimum tem- perature forecasts possess similar spread for all forecasts beyond day 1 (Fig. 12). We do not have enough infor- mation to determine if the difference in skill between the daily maximum and daily minimum temperature forecasts is an artifact of the scoring metric, the models in the MREF, or the fact that the 0000 and 1200 UTC forecasts do not necessarily correspond to the times of maximum and minimum daily temperatures. Investiga- tion of this difference in skill between maximum and minimum temperature forecasts is an avenue for future research. In summary, theNAEFS forecasts analyzed in this study provide skill in thedeterministic ensembleaverage through forecast day 9. Employing the full range of 32 ensemble members to provide probabilistic forecasts of daily maxi- mum and minimum temperature extended the skill of the NAEFS throughan additional three forecast days, through day12.Theauthors suggest that an investigationof the skill and value of the current 42-memberNAEFS configuration of bias-corrected minimum and maximum temperatures would also prove beneficial to users. Summarizing the findings for the ensemble average forecasts, and in comparing MREF forecasts in a con- tinuous mode, daily maximum temperature forecasts were slightly more skillful than daily minimum tem- perature forecasts. Shifting to discriminating ability (using 658C anomaly thresholds), daily minimum tem- perature forecasts provided slightly better discriminat- ing ability than daily maximum temperature forecasts. Manysectorsoftheeconomy,suchashydroelectricenergy generation, transmission, and distribution, are susceptible to financial risk driven by changing temperature regimes. This study has shown that business managers savvy enough to incorporateensemble temperature forecastsprovidedbythe NAEFS can mitigate that risk quite far into the future (po- tentially 12 days depending on user requirements). FIG. 13. A trade-off diagram of CRPSS vs spread score (1/d) for daily maximum temperature (squares) and daily minimum tem- perature (circles) forecast anomalies. FIG. 12. Rank histogram d scores for ensemble daily maximum (squares) and daily minimum (circles) temperature forecasts. FIG. 14. As in fig. 13, but for ROCSS vs spread score (1/d). FEBRUARY 2009 MCCOLLOR AND STULL 15 Acknowledgments. The authors thank the Meteoro- logical Service of Canada and the National Weather Service for making available the temperature observa- tions necessary to perform this study. The authors would also like to thank thenationalweather agencies ofCanada (theMeteorological ServiceofCanada), theUnitedStates (the National Weather Service), and Mexico (the Na- tional Meteorological Service of Mexico) for providing public access to forecasts from the North American En- semble Forecast System. We thank Uwe Gramman of Mountain Weather Services of Smithers, British Colum- bia, for writing the scripts to access the NAEFS forecasts. Additional support was provided by the Canadian Natu- ral Science and Engineering Research Council. APPENDIX Equations for Meteorological Statistical Analysis Given fk as forecast values, yk as observed values, f as the mean forecast value, y as the mean observed value, andN as the number of forecast–observation pairs, then the following definitions apply: mean error (ME), ME5 f  y; (A1) mean square error (MSE), MSE5 1 N  N k51 ð f k  ykÞ2; (A2) root-mean-square error (RMSE), RMSE5 ffiffiffiffiffiffiffiffiffiffiffi MSE p ; and (A3) Pearson correlation (r), r5 Nk51ð f k  f Þðyk  yÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Nk51ð f k  f Þ2 N k51ðyk  yÞ2 r . (A4) Contingency table analysis equations for Table 2 are given below: hit rate, H5 a a1 c , and (A5) false alarm rate (F ), F5 b b1 d . (A6) REFERENCES Anderson, J. L., 1996: A method for producing and evaluating probabilistic precipitation forecasts from ensemble model integrations. J. Climate, 9, 1518–1529. Atger, F., 2003: Spatial and interannual variability of the reliability of ensemble-based probabilistic forecasts: Consequences for calibration. Mon. Wea. Rev., 131, 1509–1523. Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78, 1–3. Buizza, R., P. L. Houtekamer, Z. Toth, G. Pellerin, M. Wei, and Y. Zhu, 2005: A comparison of the ECMWF, MSC, and NCEP global ensemble prediction systems. Mon. Wea. Rev., 133, 1076–1097. Candille, G., and O. Talagrand, 2005: Evaluation of probabilistic prediction systems for a scalar variable.Quart. J. Roy. Meteor. Soc., 131, 2131–2150. Côté, J., J.-G. Desmarais, S. Gravel, A. Méthot, A. Patoine, M. Roch, and A. Staniforth, 1998a: The operational CMC–MRB Global Environmental Multiscale (GEM) model. Part I: De- sign considerations and formulation. Mon. Wea. Rev., 126, 1373–1395. ——, ——, ——, ——, ——, ——, and ——, 1998b: The opera- tional CMC–MRB Global Environmental Multiscale (GEM) model. Part II: Results. Mon. Wea. Rev., 126, 1397–1418. Eckel, F. A., and C. F. Mass, 2005: Aspects of effective mesoscale, short-range ensemble forecasting.Wea. Forecasting, 20, 328–350. Gallus, W. A., Jr., M. E. Baldwin, and K. L. Elmore, 2007: Eval- uation of probabilistic precipitation forecasts determined from Eta and AVN forecasted amounts.Wea. Forecasting, 22, 207–215. Hamill, T. M., S. L. Mullen, C. Snyder, Z. Toth, and D. P. Baumhefner, 2000: Ensemble forecasting in the short to me- dium range: Report from aworkshop.Bull. Amer. Meteor. Soc., 81, 2653–2664. Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems. Wea. Forecasting, 15, 559–570. Hou, D., E. Kalnay, and K. K. Droegemeier, 2001: Objective verification of the SAMEX ‘98 ensemble forecasts.Mon.Wea. Rev., 129, 73–91. Houtekamer, P. L., L. Lefaivre, J. Derome, H. Ritchie, and H. L. Mitchell, 1996: A system simulation approach to ensemble prediction. Mon. Wea. Rev., 124, 1225–1242. Jones, M. S., B. A. Colle, and J. S. Tongue, 2007: Evaluation of a mesoscale short-range ensemble forecast system over the northeast United States. Wea. Forecasting, 22, 36–55. Katz, R. W., and A. H. Murphy, 1997: Economic Value of Weather and Climate Forecasts. Cambridge University Press, 222 pp. Mason, I., 1982: A model for assessment of weather forecasts. Aust. Meteor. Mag., 30, 291–303. ——, and N. E. Graham, 1999: Conditional probabilities, relative operating characteristics, and relative operating levels. Wea. Forecasting, 14, 713–725. McCollor, D., and R. Stull, 2008a: Hydrometeorological accuracy enhancement via postprocessing of numerical weather fore- casts in complex terrain. Wea. Forecasting, 23, 131–144. ——, and——, 2008b: Hydrometeorological short-range ensemble forecasts in complex terrain. Part I: Meteorological evalua- tion. Wea. Forecasting, 23, 533–556. ——, and——, 2008c: Hydrometeorological short-range ensemble forecasts in complex terrain. Part II: Economic evaluation. Wea. Forecasting, 23, 557–574. 16 WEATHER AND FORECAST ING VOLUME 24 Murphy, A. H., 1969: On the ‘‘Ranked probability score.’’ J. Appl. Meteor., 8, 988–989. ——, 1973: A new vector partition of the probability score. J. Appl. Meteor., 12, 595–600. ——, 1977: The value of climatological, categorical, and probabi- listic forecasts in the cost–loss ratio situation.Mon. Wea. Rev., 105, 803–816. ——, 1991: Forecast verification: Its complexity and dimensional- ity. Mon. Wea. Rev., 119, 1590–1601. ——, and R. L. Winkler, 1987: A general framework for forecast verification. Mon. Wea. Rev., 115, 1330–1338. Potts, J. M., 2003: Basic concepts. Forecast Verification: A Practi- tioner’s Guide in Atmospheric Science, I. Jolliffe and D. B. Stephenson, Eds., Wiley, 13–36. Richardson, D. S., 2000: Skill and relative economic value of the ECMWFEnsemble Prediction System.Quart. J. Roy. Meteor. Soc., 126, 649–667. ——, 2003: Economic value and skill. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. Jolliffe and D. B. Stephenson, Eds., Wiley, 164–187. Roulin, E., and S. Vannitsem, 2005: Skill of medium-range hy- drological ensemble predictions. J. Hydrometeor., 6, 729–744. Stensrud, D. J., J. Bao, and T. T. Warner, 2000: Using initial conditions and model physics perturbations in short-range ensemble simulations of mesoscale convective systems. Mon. Wea. Rev., 128, 2077–2107. Talagrand, O., R. Vautard, and B. Strauss, 1998: Evaluation of probabilistic prediction systems. Proc. Workshop on Predict- ability, Reading, United Kingdom, ECMWF, 1–25. Tennant, W. J., Z. Toth, and K. J. Rae, 2007: Application of the NCEP Ensemble Prediction System to medium-range fore- casting in South Africa: New products, benefits, and chal- lenges. Wea. Forecasting, 22, 18–35. Toth, Z., and E. Kalnay, 1993: Ensemble forecasting at NMC: The generation of perturbations. Bull. Amer. Meteor. Soc., 74, 2317–2330. ——, O. Talagrand, G. Candille, and Y. Zhu, 2003: Probability and ensemble forecasts. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. Jolliffe and D. B. Stephenson, Eds., Wiley, 137–163. ——, and Coauthors, 2006: The North American Ensemble Forecast System (NAEFS). Preprints, 18th Conf. on Proba- bility and Statistics in the Atmospheric Sciences, Atlanta, GA, Amer. Meteor. Soc., 4.1. [Available online at http://ams.confex. com/ams/pdfpapers/102588.pdf.] Wei, M., Z. Toth, R. Wobus, and Y. Zhu, 2008: Initial perturba- tions based on the ensemble transform (ET) technique in the NCEP Global Operational Forecast System. Tellus, 60A, 62–79. Wilks, D. S., 2006: Statistical Methods in the Atmospheric Sciences. 2nd ed. Academic Press, 627 pp. Yussouf, N., and D. J. Stensrud, 2007: Bias-corrected short- range ensemble forecasts of near-surface variables dur- ing the 2005/06 cool season. Wea. Forecasting, 22, 1274– 1286. Zhu, Y., Z. Toth, R. Wobus, D. Richardson, and K. Mylne, 2002: The economic value of ensemble-based weather forecasts. Bull. Amer. Meteor. Soc., 83, 73–83. FEBRUARY 2009 MCCOLLOR AND STULL 17


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items