UBC Faculty Research and Publications

Evaluation of Probabilistic Medium-Range Temperature Forecasts from the North American Ensemble Forecast… McCollor, Doug; Stull, Roland B. Feb 28, 2009

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-Stull_AMS_2009_2008WAF2222130.pdf [ 1.35MB ]
JSON: 52383-1.0041846.json
JSON-LD: 52383-1.0041846-ld.json
RDF/XML (Pretty): 52383-1.0041846-rdf.xml
RDF/JSON: 52383-1.0041846-rdf.json
Turtle: 52383-1.0041846-turtle.txt
N-Triples: 52383-1.0041846-rdf-ntriples.txt
Original Record: 52383-1.0041846-source.json
Full Text

Full Text

VOLUME 24  WEATHER AND FORECASTING  FEBRUARY 2009  Evaluation of Probabilistic Medium-Range Temperature Forecasts from the North American Ensemble Forecast System DOUG MCCOLLOR University of British Columbia, and BC Hydro Corporation, Vancouver, British Columbia, Canada  ROLAND STULL University of British Columbia, Vancouver, British Columbia, Canada (Manuscript received 18 February 2008, in final form 19 August 2008) ABSTRACT Ensemble temperature forecasts from the North American Ensemble Forecast System were assessed for quality against observations for 10 cities in western North America, for a 7-month period beginning in February 2007. Medium-range probabilistic temperature forecasts can provide information for those economic sectors exposed to temperature-related business risk, such as agriculture, energy, transportation, and retail sales. The raw ensemble forecasts were postprocessed, incorporating a 14-day moving-average forecast–observation difference, for each ensemble member. This postprocessing reduced the mean error in the sample to 0.68C or less. It is important to note that the North American Ensemble Forecast System available to the public provides bias-corrected maximum and minimum temperature forecasts. Root-mean-square-error and Pearson correlation skill scores, applied to the ensemble average forecast, indicate positive, but diminishing, forecast skill (compared to climatology) from 1 to 9 days into the future. The probabilistic forecasts were evaluated using the continuous ranked probability skill score, the relative operating characteristics skill score, and a value assessment incorporating cost–loss determination. The full suite of ensemble members provided skillful forecasts 10–12 days into the future. A rank histogram analysis was performed to test ensemble spread relative to the observations. Forecasts are underdispersive early in the forecast period, for forecast days 1 and 2. Dispersion improves rapidly but remains somewhat underdispersive through forecast day 6. The forecasts show little or no dispersion beyond forecast day 6. A new skill versus spread diagram is presented that shows the trade-off between higher skill but low spread early in the forecast period and lower skill but better spread later in the forecast period.  1. Introduction  hydroelectric systems, temperature affects the demand for electricity (from electric heaters and air conditioners) and the resistance (therefore maximum electric load capacity) of transmission lines. In high-latitude countries river-ice formation affects hydroelectric generation planning during winter. Temperature effects on hydroelectric systems also include the height above the ground of transmission lines (lines sag closer to the ground as temperatures rise) and precipitation type (rain versus snow), an important factor determining the inflow response in watersheds. Of course, weather forecast accuracy diminishes as the forecast lead time increases; this is true for probabilistic forecasts as well as deterministic forecasts. This paper aims to analyze a set of midrange probabilistic temperature forecasts and measure the quality and value  Accurate weather forecasts help mitigate risk to those economic sectors exposed to weather-related business losses. Astute business operators can also incorporate accurate weather forecasts for financial gain for their organization. Probabilistic forecasts are especially useful in making economic decisions because the expected gain or loss arising from a particular weather-related event can be measured directly against the probability of that event actually occurring. For the example of  Corresponding author address: Doug McCollor, Dept. of Earth and Ocean Sciences, University of British Columbia, 6339 Stores Rd., Vancouver, BC V6T 1Z4, Canada. E-mail: doug.mccollor@bchydro.com DOI: 10.1175/2008WAF2222130.1 Ó 2009 American Meteorological Society  3  4  WEATHER AND FORECASTING  of the forecasts as a function of lead time. Decision makers using this forecast system will then have a valid estimate of how far into the future these particular forecasts provide valuable information. Probability estimates of future weather are most often derived from an ensemble of individual forecasts. Many short-range ensemble forecast (SREF) systems, designed to provide forecasts in the day 1 and day 2 timeframe, provide high quality probabilistic forecasts by varying the model physics to derive their ensemble members (Stensrud et al. 2000; Eckel and Mass 2005; Jones et al. 2007; McCollor and Stull 2008b). Mediumrange ensemble forecast (MREF) systems, necessary for probabilistic forecasts beyond day 2, can be designed to represent the errors inherent in synoptic flow patterns in the midrange period (Buizza et al. 2005; Roulin and Vannitsem 2005; Tennant et al. 2007). The idea of combining ensemble forecasts from different national meteorological centers, thus taking advantage of different modes of ensemble member generation and greatly increasing the overall ensemble size, resulted in the formation of the North American Ensemble Forecast System (NAEFS; Toth et al. 2006). We investigate the quality and value of NAEFS temperature forecasts for 10 cities in western North America. Section 2 describes the forecasts and observations analyzed in the MREF study. Section 3 describes the verification metrics used to evaluate the probabilistic forecasts and discusses the results, and section 4 summarizes the findings and concludes the paper.  2. Methodology a. Ensemble forecasts Raw (unprocessed) temperature forecasts were obtained and stored in real time from the North American Ensemble Forecast System. The NAEFS is a joint project involving the Meteorological Service of Canada (MSC), the U.S. National Weather Service (NWS), and the National Meteorological Service of Mexico (NMSM). NAEFS was officially launched in November 2004 and combines state-of-the-art ensembles developed at MSC and NWS. It is important to note that the NAEFS available to the public provides bias-corrected maximum and minimum temperature forecasts. The grand- or superensemble provides weather forecast guidance for the 1–16-day period, allowing for the inclusion of ensemble members from different generating schemes. The original configuration of the NWSproduced ensembles consisted of one control run plus 14 perturbed ensemble members produced by a bredvector perturbation method (Buizza et al. 2005; Toth  VOLUME 24  and Kalnay 1993). This initial-condition perturbation method assumed that fast-growing errors develop naturally in a data assimilation cycle and that errors will continue to grow through the medium-range forecast cycle. The original MSC ensemble prediction system (EPS) configuration consisted of one control run plus 16 perturbed ensemble members. The perturbation approach developed at MSC (Houtekamer et al. 1996; Buizza et al. 2005) generated initial conditions by assimilating randomly perturbed observations. This multimodel approach ran different model versions (involving variable physical parameterizations) incorporating a number of independent data assimilation cycles. At the time of this study, no forecasts were available from the NMSM. In addition, it must be noted that the ensemble system’s design, including strategies to sample observation error, initial condition error, and model error, as well as addressing the issue of member resolution versus ensemble size, is continually evolving. Recent information describing current national ensemble prediction systems can be found in a report from a November 2007 workshop on ensemble prediction (available online at http:// www.ecmwf.int/newsevents/meetings/workshops/2007/ ensemble_prediction/wg1.pdf; accessed 12 April 2008). Current details on improvements to the MSC EPS are also available online (at http://www.ecmwf.int/newsevents/ meetings/workshops/2007/MOS_11/ presentations_files/ Gagnon.pdf; accessed 14 April 2008). In May 2006 the NWS EPS changed its initial perturbation generation technique to the ensemble transform with rescaling (ETR) technique, where the initial perturbations are restrained by the best available analysis variance from the operational data assimilation system (Wei et al. 2008). In March 2007 the NWS ensemble increased to 20 members integrated from an 80-member ETR-based ensemble. In July 2007 MSC increased its ensemble size from 16 to 20 members (information online at http://www.smc-msc.ec.gc.ca/cmc/op_systems/ doc_opchanges/genot_20070707_e_f.pdf; accessed 17 April 2008) to conform to the NWS ensemble size. MSC also increased the horizontal resolution of the members and extended the physical parameterization package. The complete current status of the MSC EPS can be found online (http://www.ecmwf.int/newsevents/meetings/ workshops/2007/ensemble_prediction/presentations/ houtekamer.pdf; accessed 17 April 2008).  b. Observation Unprocessed forecasts for days 1–15 were retrieved from the online NAEFS over the period 15 February– 1 October 2007. An ensemble size of 32 members was maintained throughout this period, to maintain consistency in the process, despite operational changes at the  FEBRUARY 2009  5  MCCOLLOR AND STULL  national centers as described above. Ten city locations (five in Canada and five in the United States) were chosen to represent different climate regimes in far western North America (see Table 1 for the list of North American cities in the study and Fig. 1 for a reference map). For each day and each city, the forecast valid at 1600 local time (0000 UTC) was verified against the observed maximum temperature for the day; the forecast valid at 0400 local time (1200 UTC) was verified against the observed minimum temperature for the day. Temperatures are accurate to within 60.68C, determined from the Instrument Requirements and Standards for the NWS Surface Observing Programs (information online at http://www.weather.gov/directives/sym/pd01013002curr. pdf, accessed 3 June 2008).  c. Method used Forecast system performance can be verified against actual point observations or against model-based gridded analyses. Gridded analyses directly from the models generally offer more data for investigation. However, gridded analyses are subject to model-based data estimation errors and smoothing errors. Even though point observations are subject to instrumentation error and site representativeness error, it is preferable to verify forecasts against observations as opposed to model analyses (Hou et al. 2001). The mean error of the maximum and minimum temperature forecasts for each ensemble member for each forecast day was evaluated from the sample. Random errors caused by initial condition and model– parameterization errors can be reduced by using an ensemble of forecasts. Bias in the forecasts is largely attributable to local terrain and topographic effects being poorly represented in the forecast models. Bias is evaluated by calculating the mean error in the forecasts and can be reduced via postprocessing of the direct model output (DMO). Regarding postprocessing, the authors previously found (McCollor and Stull 2008a) that a 14-day movingaverage filter was effective in reducing the bias in a medium-range maximum and minimum temperature forecast sample. Therefore, a 14-day moving-average filter is applied here to each of the ensemble member raw forecasts (for each station location and each forecast lead time) to minimize the bias in the DMO forecasts. Care was taken in this study to ensure the method is operationally achievable. That is, only observations that would be available in a real-time setting for each forecast period were used in the moving-average filter. For example, all day 5 forecasts included a moving-average filter incorporating forecast–observation pairs from 19 days prior to the forecast valid time through 6 days prior  TABLE 1. Ten cities for which forecasts and associated verifying observations were evaluated.  City High Level, AB Peace River, AB Fort St. John, BC Victoria, BC Vancouver, BC Seattle, WA Spokane, WA Portland, OR Sacramento, CA Bakersfield, CA  ICAO* identifier CYOJ CYPE CYXJ CYYJ CYVR KSEA KGEG KPDX KSAC KBFL  Lat (8N)  Lon (8W)  58.6 56.2 56.2 48.6 49.2 47.4 47.6 45.6 38.5 35.4  117.2 117.4 120.7 123.4 123.2 122.3 117.5 122.6 121.5 119.1  * ICAO 5 International Civil Aviation Organization.  to the forecast time to achieve a 14-day moving-average bias correction. This method of error reduction for ensemble prediction is known as the bias-corrected ensemble (BCE) approach (Yussouf and Stensrud 2007). The scheme responds quickly to any new station locations, new models, or model upgrades. The results of the filter application are presented in Fig. 2. The calculations are derived over all forecast–observation pairs between 1 March and 1 October 2007 (the first 14 days of the sample were reserved to begin the moving-average postprocessing filter). Mean error in the DMO ensemble average daily maximum temperature forecasts, ranging from 23.68C (day 1 forecasts) to 22.68C (day 15 forecasts), was reduced to a range from 0.08C (day 1 forecasts) to 10.68C (day 15 forecasts) in the postprocessed forecasts. Similarly, mean error in the DMO ensemble average daily minimum temperature forecasts, near 21.58C throughout the 15-day forecast cycle, was reduced to the 0.08C to 10.38C range in the postprocessed forecasts. These bias-corrected forecasts are used in the subsequent ensemble evaluations in this paper. The final phase in designing an ensemble prediction system lies in choosing a weighting scheme for the individual ensemble members in the computation of the ensemble average. In addition to applying equal weighting to all ensemble members by simply averaging all members, Yussouf and Stensrud (2007) explored performance-based weighted BCE schemes wherein unequal weights, based on individual member’s past forecast accuracy, are assigned to each ensemble member prior to calculating the ensemble mean forecast. However, results from the Yussouf and Stensrud (2007) study indicate there is no significant improvement over the original 2-m temperature BCE forecasts when performance-based weighting schemes are applied to the  6  WEATHER AND FORECASTING  VOLUME 24  MREF system for daily maximum and minimum temperature forecasts. The reader is referred to Wilks (2006) or Toth et al. (2003) for comprehensive descriptions of verification measures.  a. Skill scores Initial assessment of the forecasts utilized a root-meansquare-error skill score (RMSESS) and a Pearson correlation skill score (CORRSS) evaluated for the ensemble average. A general skill score definition is given by skill score 5  FIG. 1. Reference map for the western North American cities analyzed in this study.  EPS individual members. Therefore, a BCE approach for computing the ensemble average from equally weighted ensemble members is chosen for our study of mediumrange temperature forecasts.  3. Results and discussion A single verification score is generally inadequate for evaluating all of the desired information about the performance of an ensemble prediction system (Murphy and Winkler 1987; Murphy 1991). Different measures, emphasizing different aspects and attributes of forecast performance, should be employed to assess the statistical reliability, resolution, and discrimination of an EPS. A standardized set of evaluation methods, scores, and diagrams for interpreting ensemble forecasts is provided in Hamill et al. (2000) and is incorporated here to assess a  score-scoreref , scoreperfect -scoreref  (1)  where scoreref is the score for the reference forecasts. In this case the reference forecasts are the climatological mean values of the maximum and minimum temperatures for each individual city location, for each day, averaged over the 30-yr normal period 1971–2000. Here, scoreperfect is the score the forecasts would have received if they were perfect. Skill scores range from zero (for no skill, or skill equivalent to the reference forecasts) to 1 (for perfect forecasts). Negative skill score values indicate the forecasts are worse than the reference forecasts. RMSESS and CORRSS were evaluated for the average of the 32 ensemble members for each of the forecast days from 1 through 15. The results are shown in Fig. 3. Both the RMSESS and CORRSS for the ensemble average forecast indicate similar results in terms of forecast skill. Forecast skill, as expected, is highest for the day 1 forecast and gradually diminishes. By day 9, forecast skill is negligible compared to the climatological reference forecasts. By day 10 the forecasts show no skill compared to the climatological reference forecasts, and for days 11–15 the NAEFS forecasts have worse skill than do the climatological forecasts. Figure 3 also compares the ensemble average forecast with the deterministic 5-day forecast from the MSC operational Global Environmental Multiscale (GEM) model (Coˆte´ et al. 1998a,b). Forecast skill is comparable between the ensemble average and deterministic forecast early in the forecast period, but the ensemble average outperforms the deterministic forecast in terms of RMSESS and CORRSS as the forecast period progresses. A summary of the equations used for the meteorological statistical analysis of the MREF is provided in the appendix.  b. Continuous ranked probability score The continuous ranked probability score (CRPS; Hersbach 2000) is a complete generalization of the Brier score. The Brier score (Brier 1950; Atger 2003; McCollor  FEBRUARY 2009  MCCOLLOR AND STULL  7  FIG. 2. Mean error for ensemble average DMO forecasts (squares) and ensemble average postprocessed forecasts (circles) for (top) daily maximum and (bottom) daily minimum temperatures for forecast days 1–15.  and Stull 2008b) is a verification measure designed to quantify the performance of probabilistic forecasts of dichotomous event classes. The discrete ranked probability score (RPS; Murphy 1969; Toth et al. 2003) extends the range of the Brier score to more event classes. The CRPS extends the limit of the RPS to an infinite number of classes. The CRPS has several distinct advantages over the Brier score and the RPS. First, the CRPS is sensitive to the entire permissible range of the parameter. In addition, the CRPS does not require a subjective threshold parameter value (as does the Brier score) or the introduction of a number of predefined classes (as does the RPS). Different choices of threshold value (Brier score) or number and position of classes (RPS) result in different outcomes, a weakness in these scores not shared with CRPS. The CRPS can be interpreted as an integral over all possible Brier scores. A major advantage of the Brier score is that it can be decomposed into a reliability component, a resolution component, and an uncertainty component (Murphy 1973; Toth et al. 2003). In a similar fashion, Hersbach (2000) showed how the CRPS can be decomposed into the same three components: CRPS 5 Reli À Resol 1 Unc.  (2)  The CRPS components can be converted to a positively oriented skill score (CRPSS) in the same manner that the Brier score components are converted to a skill score (BSS; McCollor and Stull 2008b):  FIG. 3. (top) RMSE and (bottom) correlation skill scores for daily maximum (squares) and daily minimum (circles) temperature forecasts for forecast days 1–15. The ensemble average forecast (solid line) is compared with a deterministic operational forecast (dashed line).  Resol Reli À Unc Unc 5 relative resolution À relative reliability 5 CRPSRelResol À CRPSRelReli .  CRPSS 5  CRPSS results are shown in Fig. 4. The CRPSS was calculated on forecast and observation anomalies, defined as the difference between the forecast or the observation and the climatological value for that city location on a particular date. The forecasts are reliable throughout the forecast period, as indicated by the CRPSRelReli term being near zero for all forecasts. The CRPSRelResol term however, hence the CRPSS, is highest for the day 1 forecast and exhibits diminishing skill as the forecast period extends through the day 12 forecast. For forecast days 13–15, both maximum and minimum daily temperature forecasts exhibit negligible or no skill compared to the reference term.  c. ROC area The relative operating characteristic (ROC) is a verification measure based on signal detection theory  8  WEATHER AND FORECASTING  VOLUME 24  TABLE 2. Contingency table for hit rate and false alarm rate calculations. The counts a, b, c, and d are the numbers of events in each category, out of N total events.  Forecast Not forecast  FIG. 4. Continuous ranked probability skill scores for ensemble (top) daily maximum and (bottom) daily minimum temperature anomaly forecasts. The CRP skill score (circles) equals the relative resolution (x markers) less the relative reliability (asterisks).  (Mason 1982; Mason and Graham 1999; Gallus et al. 2007) that provides a method of discrimination analysis for an EPS. The ROC curve is a graph of hit rate versus false alarm rate for a particular variable threshold (see the appendix for definitions of these terms and Table 2 for the contingency table basis for this analysis). Perfect discrimination is represented by a ROC curve that rises from (0,0) along the y axis to (0,1) then horizontally to the upper-right corner (1,1). The diagonal line from (0,0) to (1,1) represents zero skill, meaning the forecasts show no discrimination among events. The area under the ROC curve can then be converted to a skill measure as in Eq. (1), where scoreref is the zeroskill diagonal line (ROCArea 5 0.5) and scoreperfect is the area under the full ROC curve, equal to 1. The ROC area skill score is then ROCSS 5 2ROCArea 2 1. The hit rate and false alarm rate contingency table was evaluated on forecast and observation anomaly thresholds. The ROC daily maximum temperature anomaly threshold was defined as 58C over the climatological value for the day (158C anomaly threshold), and the daily minimum temperature anomaly threshold was 58C below the climatological value for the day (258C anomaly threshold). A 58C anomaly threshold was chosen because this value produces a significant deviation from the normal electrical load in cities and is therefore an important factor for planning generation requirements and power import–export agreements. A 108C anomaly threshold would produce an even greater impact on generation planning and power import–export, but there was not  Observed  Not observed  a c  a d  enough data available in our study to produce reliable results. In a future study we would like to acquire enough data to examine forecast skill for these temperature extremes. Results for the ROCSS for the 58C temperature anomaly threshold for the full range of the 15-day forecast period are shown in Fig. 5. Actual ROC area curves for the daily maximum forecasts for a subset of the forecast period are shown in Fig. 6. The ROCSS indicates that both maximum and minimum daily forecasts exhibit consistent and very good discrimination through forecast day 6. Both maximum and minimum daily forecasts exhibit diminishing, but measurable, discrimination skill from forecast day 7 through forecast day 12. The daily minimum temperature forecasts show slightly higher skill than the daily maximum temperature forecasts through this period. For forecast days 13–15, the ROCSS is close to zero and somewhat noisy, indicating no skill in the forecasts.  d. Economic value The potential economic value for the cost–loss decision model (Murphy 1977; Katz and Murphy 1997; Richardson 2000; Zhu et al. 2002; Richardson 2003; McCollor and Stull 2008c) is uniquely determined from ROC curves. The cost–loss ratio (C/L) model shows how probabilistic forecasts can be used in the decision-making process, and provides a measure of the benefit of probabilistic forecasts over single ensemble average forecasts. Figure 7 shows forecast value (Richardson 2003) as a function of C/L value for daily maximum temperature forecasts with a 158C anomaly threshold. Maximum value occurs at the climate frequency for the event, and users with a C/L value equal to the climate frequency derive most value from the forecasts. Figure 7 also shows the value calculated for the ensemble average forecasts. The maximum value for the ensemble average forecasts is less than the maximum value for the forecasts incorporating the full suite of ensemble members. The discrepancy between the maximum value for the ensemble forecasts and the maximum value for the ensemble average forecasts widens with projected forecast period. Additionally, the cast of users (described by different C/L ratios) that  FEBRUARY 2009  MCCOLLOR AND STULL  9  FIG. 5. ROC area skill score for ensemble daily maximum temperature (squares) forecasts with a 158C anomaly threshold and daily minimum temperature (circles) with a 258C anomaly threshold.  gain benefit (show positive value) from the forecasts is wider for the full ensemble forecasts than for the ensemble average forecasts. A graph depicting the decay of forecast maximum value with projected forecast period is shown in Fig. 8. The maximum value is slightly higher for minimum temperature forecasts than for maximum temperature forecasts, and the forecasts lose very little of their initial value through forecast day 6. The maximum value lessens but remains positive through forecast day 11. Forecasts for days 12–15 indicate that the value of these forecasts is little better than climatology. Plotting forecast value as a function of probability threshold pt for different C/L values (Fig. 9) is a way of showing how different forecast users can benefit individually from the same set of ensemble forecasts. The probability threshold pt is the point at which a decision maker should take action on the basis of the forecast to derive the most value from the forecasts. Low C/L ratio forecast users (C/L 5 0.2 in Fig. 9) can take advantage of the ensemble forecasts by taking action at very low forecast probabilities through forecast day 8. High C/L forecast users (C/L 5 0.8) must wait until the forecast probability is much higher, and can gain positive value from the forecasts only through forecast day 4.  e. Equal likelihood In a perfectly realistic and reliable EPS, the range of the ensemble forecast members represents the full spread of the probability distribution of observations, and each member of the ensemble exhibits an equal likelihood of occurrence. Actual EPSs are rarely that perfect. The rank histogram, also known as a Talagrand diagram (Anderson 1996; Talagrand et al. 1998), is a useful measure of equal likelihood and of reliability (Hou et al. 2001; Candille and Talagrand 2005) of an EPS. A perfect EPS in which each ensemble member is equally likely to verify for any particular forecast exhibits a flat rank histogram. Ensemble forecasts with insuffi-  FIG. 6. ROC area plots for a subset of the forecast period for a daily maximum temperature anomaly 158C threshold. The plots range from an ROC curve for day 2 (dotted line), 4 (dot–dashed line), 6 (solid line), 8 (heavy dotted line), 10 (heavy dot–dashed line); and 12 (heavy solid line). The dashed line is the zero-skill line.  cient spread to adequately represent the probability distribution of observations exhibit a U-shaped histogram. Rank histograms derived from a particular EPS can also exhibit an inverted-U shape, indicative of too much spread in the ensemble members compared to the observations. Rank histograms can also be L shaped (warm bias) or J shaped (cool bias).Operational ensemble weather forecasting systems in the medium lead-time range (3–10 days ahead) exhibit U-shaped analysis rank histograms, implying the ensemble forecasts underestimate the true uncertainty in the forecasts (Toth et al. 2003). A subset of rank histogram plots for the EPS daily maximum temperature forecasts is provided in Fig. 10, for forecast days 2, 4, 6 8, 10, and 12. These plots show that early in the forecast period, especially through forecast day 4, the rank histograms do exhibit a U shape, with excess values in the outermost bins. This means the EPS is underdispersive through forecast day 4; the spread in the EPS is not indicative of the spread in the actual observations. However, the rank histogram is symmetric, confirming that the bias-corrected temperature forecasts are indeed unbiased. Forecast days 6–12 display a progressively flatter histogram, indicating the EPS spread increases with forecast lead time and the EPS spread essentially matches the  10  WEATHER AND FORECASTING  VOLUME 24  FIG. 7. Forecast value as a function of C/L ratio for ensemble daily maximum forecasts (158C anomaly threshold), comparing the value of the full ensemble to the ensemble mean. The solid line represents the value of the full ensemble and the dotted line represents the value of the ensemble mean.  spread of the observations beyond forecast day 6. Similar results (Fig. 11) were obtained from a rank histogram analysis of the EPS daily minimum temperature forecasts. However, this better spread is associated with worse skill, as shown previously in sections 3a and 3b. A measure of the deviation of the rank histogram from flatness is given in Candille and Talagrand (2005) for an EPS with M members and N available forecast– observation pairs. The number of values in the ith interval of the histogram is given by si. For a flat histogram generated from a reliable EPS, si 5 N/(M 1 1) for each interval i. The quantity  M11   D5  å i51  si À  2 N M11  (3)  measures the deviation of the histogram from a horizontal line. For a reliable system, the base value is D0 5 NM/(M 1 1). The ratio d 5 D/D0  (4)  is evaluated as the overall measure of the flatness of an ensemble prediction system rank histogram. A value of  FEBRUARY 2009  MCCOLLOR AND STULL  FIG. 8. C/L ratio maximum value for ensemble daily maximum temperatures (squares) with a 158C anomaly threshold and minimum temperatures (circles) with a 258C anomaly threshold.  d that is significantly larger than 1 indicates the system does not reflect the equal likelihood of ensemble members. A value of d that is closer to 1 indicates that the ensemble forecasts are tending toward an equal likelihood of occurrence and, therefore, are more reliable. Figure 12 provides an analysis of the rank histograms for the MREFs studied here. Early in the forecast period, for forecast days 1–3, d is much greater than 10. Therefore, the MREF system for forecast days 1–3 exhibits insufficient, but successively increasing, spread to adequately represent the variability in the observations. For forecast days 4 and 5, the MREF exhibits d values successively smaller, but still greater than 5, so that evidence of slight underdispersion remains. For forecast days beyond day 5, the d values derived from this MREF system (values between 1 and 5), as indicated in Figs. 10–12, indicate that the ensemble spread is very slightly underdispersive, though nearly sufficient spread is realized in the ensembles to represent the observations in the analysis. Spread does improve as the forecast period increases; however, this improvement in spread is negated by the deterioration in skill at the longer forecast periods. This skill versus spread relationship is demonstrated in a trade-off diagram, described in the following section.  SKILL–SPREAD  TRADE-OFF DIAGRAM  We define a new spread score as spread score 5 1/d.  (5)  A plot of the skill score on the ordinate and this spread score on the abscissa (as the forecast projection progresses from day 1 through day 15) shows the tradeoff between higher skill but lower spread early in the forecast period and degraded skill but better spread later in the forecast period. A skill–spread trade-off  11  diagram with the CRPSS versus the spread score is shown in Fig. 13, and a similar trade-off diagram displaying the ROC skill score (for the 58C temperature anomaly threshold) is shown in Fig. 14. Both Figs. 13 and 14 display the changes in skill and spread with forecast projection, for both daily minimum and daily maximum temperature forecasts. Beyond forecast day 11 or 12 (indicated by a dotted line in these figures), the skill becomes negligible and the spread stops improving, or is even reduced in some cases. The authors suggest that this relationship between skill and spread could be explored further with the aid of this skill– spread trade-off diagram, with different parameters, more datasets, and other skill measures. One interesting application of this skill–spread diagram would analyze the variation of skill with spread as the number of individual members in an ensemble prediction system is varied. The diagram could be used, for example, to find the fewest number of members needed to reach a specific level of skill, for a specific forecast projection, in order to optimize the use of computer resources in ensemble system design.  4. Summary and conclusions The skill and value of medium-range NAEFS temperature forecasts are evaluated here for the 7-month period of 1 March–1 October 2007. However, caution must be used when objectively assessing the quality of ensemble forecasts. Ensemble forecast systems must be verified over many cases. As a result, the scoring metrics described here are susceptible to several sources of noise (Hamill et al. 2000). For example, improper estimates of probabilities will arise from small-sized ensembles. Also, insufficient variety and number of cases will lead to statistical misrepresentation. Finally, imperfect observations make true forecast evaluation impossible. Two standard verification skill score measures designed for continuous deterministic forecast variables, a RMSE skill score and a Pearson correlation skill score, were applied to the mean of the bias-corrected ensemble forecasts. These measures both showed that the ensemble means of these city temperature forecasts from the NAEFS possessed skill (compared to climatology) through forecast day 9, though for forecast days 8 and 9 the forecast skill was marginal. These skill scores also indicated that daily maximum temperature forecasts were slightly more skillful than daily minimum temperature forecasts for the period analyzed. This study then incorporated verification measures specifically designed for probabilistic forecasts generated from the bias-corrected set of all 32 ensemble members. The continuous ranked probability score (CRPS),  12  WEATHER AND FORECASTING  VOLUME 24  FIG. 9. Value as a function of probability threshold pt for C/L ratios of 0.2 (solid), 0.4 (dash), 0.6 (dot), and 0.8 (dot–dash). The forecasts are daily maximum temperature forecasts with a 158C anomaly threshold for (top to bottom) days 2, 4, 6, 8, 10, and 12.  particularly useful for evaluating a continuous ensemble variable, indicated skill in the EPS through day 12. The CRPS also supported the previous finding for the ensemble mean that daily maximum temperatures are slightly more skillful than the daily minimum forecasts.  The ROC area skill score, designed to indicate the ability of the EPS to discriminate among different events, showed that the EPS forecasts were indeed skillful (for 58C temperature anomaly thresholds) in this regard through day 12. ROC evaluation is characterized by stratification by observation, and discrimination is based  FEBRUARY 2009  MCCOLLOR AND STULL  13  FIG. 10. Rank histograms for the ensemble of daily maximum temperatures for (top to bottom) days 2, 4, 6, 8, 10, and 12.  on the probabilities of the forecasts conditioned on different observed values, or likelihood-base rate factorization in the parlance of the Murphy and Winkler general framework for forecast verification (Murphy and Winkler 1987; Potts 2003). ROC analysis leads to economic evaluation through a cost–loss measure of value. Cost–loss analysis indicated that, overall, the EPS provides some economic value through day 12. However, the maximum value, assigned to the user whose C/L value is equal to the climatological base rate for the event in question, diminishes  markedly through this forecast range. Additionally, the range of users, defined by different C/L ratios, diminishes considerably beyond forecast day 6. The C/L analysis also provides proof that the full suite of ensemble forecasts provides value to a greater range of users than the single ensemble average forecasts. Rank histograms provide a method to test if the dispersion, or spread, of the ensemble forecasts represents the dispersion in the actual observations, a vital component in a reliable EPS. In the NAEFS forecast sample examined here, the forecasts were decreasingly  14  WEATHER AND FORECASTING  VOLUME 24  FIG. 11. As in Fig. 10, but for minimum temperatures.  underdispersive early in the forecast cycle, for forecast days 1–5. The forecasts exhibited marginal underdispersion for forecast days 6 and beyond. A new tradeoff diagram is proposed to show the interplay between skill and spread. One aspect of interest uncovered in this study is the comparison of forecast skill between daily maximum and daily minimum temperature forecasts. The skill scores evaluated for the ensemble mean forecast (RMSESS and CORRSS) indicated that daily maximum temperature forecasts were somewhat more skillful than daily  minimum temperature forecasts through the first 9 days of the forecast period (Fig. 3). These skill scores indicate no skill in the forecasts, for either daily maximum or daily minimum forecasts, for forecast day 10 (compared to the climatological reference forecasts). Beyond day 10, the MREF forecasts fare worse than climatological forecasts, with daily maximum temperature forecasts now proving worse than daily minimum temperature forecasts. Daily maximum temperature forecasts also prove slightly better than daily minimum temperature forecasts  FEBRUARY 2009  MCCOLLOR AND STULL  15  FIG. 12. Rank histogram d scores for ensemble daily maximum (squares) and daily minimum (circles) temperature forecasts.  when compared using the CRPSS (Fig. 4), throughout the 15-day range of MREF forecasts. The ROCSS, on the other hand, indicates daily minimum temperature forecasts show better discrimination than daily maximum temperature forecasts (Fig. 5). In terms of forecast spread, both daily maximum and daily minimum temperature forecasts possess similar spread for all forecasts beyond day 1 (Fig. 12). We do not have enough information to determine if the difference in skill between the daily maximum and daily minimum temperature forecasts is an artifact of the scoring metric, the models in the MREF, or the fact that the 0000 and 1200 UTC  FIG. 13. A trade-off diagram of CRPSS vs spread score (1/d) for daily maximum temperature (squares) and daily minimum temperature (circles) forecast anomalies.  FIG. 14. As in fig. 13, but for ROCSS vs spread score (1/d).  forecasts do not necessarily correspond to the times of maximum and minimum daily temperatures. Investigation of this difference in skill between maximum and minimum temperature forecasts is an avenue for future research. In summary, the NAEFS forecasts analyzed in this study provide skill in the deterministic ensemble average through forecast day 9. Employing the full range of 32 ensemble members to provide probabilistic forecasts of daily maximum and minimum temperature extended the skill of the NAEFS through an additional three forecast days, through day 12. The authors suggest that an investigation of the skill and value of the current 42-member NAEFS configuration of bias-corrected minimum and maximum temperatures would also prove beneficial to users. Summarizing the findings for the ensemble average forecasts, and in comparing MREF forecasts in a continuous mode, daily maximum temperature forecasts were slightly more skillful than daily minimum temperature forecasts. Shifting to discriminating ability (using 658C anomaly thresholds), daily minimum temperature forecasts provided slightly better discriminating ability than daily maximum temperature forecasts. Many sectors of the economy, such as hydroelectric energy generation, transmission, and distribution, are susceptible to financial risk driven by changing temperature regimes. This study has shown that business managers savvy enough to incorporate ensemble temperature forecasts provided by the NAEFS can mitigate that risk quite far into the future (potentially 12 days depending on user requirements).  16  WEATHER AND FORECASTING  Acknowledgments. The authors thank the Meteorological Service of Canada and the National Weather Service for making available the temperature observations necessary to perform this study. The authors would also like to thank the national weather agencies of Canada (the Meteorological Service of Canada), the United States (the National Weather Service), and Mexico (the National Meteorological Service of Mexico) for providing public access to forecasts from the North American Ensemble Forecast System. We thank Uwe Gramman of Mountain Weather Services of Smithers, British Columbia, for writing the scripts to access the NAEFS forecasts. Additional support was provided by the Canadian Natural Science and Engineering Research Council.  APPENDIX Equations for Meteorological Statistical Analysis Given fk as forecast values, yk as observed values, f as the mean forecast value, y as the mean observed value, and N as the number of forecast–observation pairs, then the following definitions apply: mean error (ME), ME 5 f À y;  (A1)  mean square error (MSE), N  MSE 5  1 ð f À yk Þ2 ; N k51 k  å  root-mean-square error (RMSE), pffiffiffiffiffiffiffiffiffiffiffi RMSE 5 MSE; and  (A2)  (A3)  Pearson correlation (r), N  åk51 ð f k À fÞðyk À yÞ ffi . r 5 rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi N N åk51 ð f k À fÞ2 åk51 ðyk À yÞ2  (A4)  Contingency table analysis equations for Table 2 are given below: hit rate, H5  a , a1c  and  (A5)  false alarm rate (F ), F5  b . b1d  (A6)  VOLUME 24 REFERENCES  Anderson, J. L., 1996: A method for producing and evaluating probabilistic precipitation forecasts from ensemble model integrations. J. Climate, 9, 1518–1529. Atger, F., 2003: Spatial and interannual variability of the reliability of ensemble-based probabilistic forecasts: Consequences for calibration. Mon. Wea. Rev., 131, 1509–1523. Brier, G. W., 1950: Verification of forecasts expressed in terms of probability. Mon. Wea. Rev., 78, 1–3. Buizza, R., P. L. Houtekamer, Z. Toth, G. Pellerin, M. Wei, and Y. Zhu, 2005: A comparison of the ECMWF, MSC, and NCEP global ensemble prediction systems. Mon. Wea. Rev., 133, 1076–1097. Candille, G., and O. Talagrand, 2005: Evaluation of probabilistic prediction systems for a scalar variable. Quart. J. Roy. Meteor. Soc., 131, 2131–2150. Coˆte´, J., J.-G. Desmarais, S. Gravel, A. Me´thot, A. Patoine, M. Roch, and A. Staniforth, 1998a: The operational CMC–MRB Global Environmental Multiscale (GEM) model. Part I: Design considerations and formulation. Mon. Wea. Rev., 126, 1373–1395. ——, ——, ——, ——, ——, ——, and ——, 1998b: The operational CMC–MRB Global Environmental Multiscale (GEM) model. Part II: Results. Mon. Wea. Rev., 126, 1397–1418. Eckel, F. A., and C. F. Mass, 2005: Aspects of effective mesoscale, short-range ensemble forecasting. Wea. Forecasting, 20, 328–350. Gallus, W. A., Jr., M. E. Baldwin, and K. L. Elmore, 2007: Evaluation of probabilistic precipitation forecasts determined from Eta and AVN forecasted amounts. Wea. Forecasting, 22, 207–215. Hamill, T. M., S. L. Mullen, C. Snyder, Z. Toth, and D. P. Baumhefner, 2000: Ensemble forecasting in the short to medium range: Report from a workshop. Bull. Amer. Meteor. Soc., 81, 2653–2664. Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems. Wea. Forecasting, 15, 559–570. Hou, D., E. Kalnay, and K. K. Droegemeier, 2001: Objective verification of the SAMEX ‘98 ensemble forecasts. Mon. Wea. Rev., 129, 73–91. Houtekamer, P. L., L. Lefaivre, J. Derome, H. Ritchie, and H. L. Mitchell, 1996: A system simulation approach to ensemble prediction. Mon. Wea. Rev., 124, 1225–1242. Jones, M. S., B. A. Colle, and J. S. Tongue, 2007: Evaluation of a mesoscale short-range ensemble forecast system over the northeast United States. Wea. Forecasting, 22, 36–55. Katz, R. W., and A. H. Murphy, 1997: Economic Value of Weather and Climate Forecasts. Cambridge University Press, 222 pp. Mason, I., 1982: A model for assessment of weather forecasts. Aust. Meteor. Mag., 30, 291–303. ——, and N. E. Graham, 1999: Conditional probabilities, relative operating characteristics, and relative operating levels. Wea. Forecasting, 14, 713–725. McCollor, D., and R. Stull, 2008a: Hydrometeorological accuracy enhancement via postprocessing of numerical weather forecasts in complex terrain. Wea. Forecasting, 23, 131–144. ——, and ——, 2008b: Hydrometeorological short-range ensemble forecasts in complex terrain. Part I: Meteorological evaluation. Wea. Forecasting, 23, 533–556. ——, and ——, 2008c: Hydrometeorological short-range ensemble forecasts in complex terrain. Part II: Economic evaluation. Wea. Forecasting, 23, 557–574.  FEBRUARY 2009  MCCOLLOR AND STULL  Murphy, A. H., 1969: On the ‘‘Ranked probability score.’’ J. Appl. Meteor., 8, 988–989. ——, 1973: A new vector partition of the probability score. J. Appl. Meteor., 12, 595–600. ——, 1977: The value of climatological, categorical, and probabilistic forecasts in the cost–loss ratio situation. Mon. Wea. Rev., 105, 803–816. ——, 1991: Forecast verification: Its complexity and dimensionality. Mon. Wea. Rev., 119, 1590–1601. ——, and R. L. Winkler, 1987: A general framework for forecast verification. Mon. Wea. Rev., 115, 1330–1338. Potts, J. M., 2003: Basic concepts. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. Jolliffe and D. B. Stephenson, Eds., Wiley, 13–36. Richardson, D. S., 2000: Skill and relative economic value of the ECMWF Ensemble Prediction System. Quart. J. Roy. Meteor. Soc., 126, 649–667. ——, 2003: Economic value and skill. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. Jolliffe and D. B. Stephenson, Eds., Wiley, 164–187. Roulin, E., and S. Vannitsem, 2005: Skill of medium-range hydrological ensemble predictions. J. Hydrometeor., 6, 729–744. Stensrud, D. J., J. Bao, and T. T. Warner, 2000: Using initial conditions and model physics perturbations in short-range ensemble simulations of mesoscale convective systems. Mon. Wea. Rev., 128, 2077–2107. Talagrand, O., R. Vautard, and B. Strauss, 1998: Evaluation of probabilistic prediction systems. Proc. Workshop on Predictability, Reading, United Kingdom, ECMWF, 1–25.  17  Tennant, W. J., Z. Toth, and K. J. Rae, 2007: Application of the NCEP Ensemble Prediction System to medium-range forecasting in South Africa: New products, benefits, and challenges. Wea. Forecasting, 22, 18–35. Toth, Z., and E. Kalnay, 1993: Ensemble forecasting at NMC: The generation of perturbations. Bull. Amer. Meteor. Soc., 74, 2317–2330. ——, O. Talagrand, G. Candille, and Y. Zhu, 2003: Probability and ensemble forecasts. Forecast Verification: A Practitioner’s Guide in Atmospheric Science, I. Jolliffe and D. B. Stephenson, Eds., Wiley, 137–163. ——, and Coauthors, 2006: The North American Ensemble Forecast System (NAEFS). Preprints, 18th Conf. on Probability and Statistics in the Atmospheric Sciences, Atlanta, GA, Amer. Meteor. Soc., 4.1. [Available online at http://ams.confex. com/ams/pdfpapers/102588.pdf.] Wei, M., Z. Toth, R. Wobus, and Y. Zhu, 2008: Initial perturbations based on the ensemble transform (ET) technique in the NCEP Global Operational Forecast System. Tellus, 60A, 62–79. Wilks, D. S., 2006: Statistical Methods in the Atmospheric Sciences. 2nd ed. Academic Press, 627 pp. Yussouf, N., and D. J. Stensrud, 2007: Bias-corrected shortrange ensemble forecasts of near-surface variables during the 2005/06 cool season. Wea. Forecasting, 22, 1274– 1286. Zhu, Y., Z. Toth, R. Wobus, D. Richardson, and K. Mylne, 2002: The economic value of ensemble-based weather forecasts. Bull. Amer. Meteor. Soc., 83, 73–83.  


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items